MPS-Net: Learning to Recover Surface Normal for Multispectral Photometric Stereo
Communicated by Dr. H. Yu
Journal Pre-proof
MPS-Net: Learning to Recover Surface Normal for Multispectral Photometric Stereo Yakun Ju, Lin Qi, Jichao He, Xinghui Dong, Feng Gao, Junyu Dong PII: DOI: Reference:
S0925-2312(19)31358-X https://doi.org/10.1016/j.neucom.2019.09.084 NEUCOM 21339
To appear in:
Neurocomputing
Received date: Revised date: Accepted date:
13 June 2019 11 September 2019 26 September 2019
Please cite this article as: Yakun Ju, Lin Qi, Jichao He, Xinghui Dong, Feng Gao, Junyu Dong, MPSNet: Learning to Recover Surface Normal for Multispectral Photometric Stereo, Neurocomputing (2019), doi: https://doi.org/10.1016/j.neucom.2019.09.084
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.
Highlights • We propose an end-to-end method for Multispectral Photometric Stereo, without any extra information. • For the first time, our MPS-Net takes the initial surface normal into account, which provides a state-of-the-art estimation. • We design a localized convolutional neural network to establish flexible mapping considering the adjacent structural feature.
1
MPS-Net: Learning to Recover Surface Normal for Multispectral Photometric Stereo Yakun Jua , Lin Qia , Jichao Hea , Xinghui Dongb , Feng Gaoa , Junyu Donga,∗ a
Department of Computer Science and Technology, Ocean University of China, Qingdao, China b Centre for Imaging Sciences, The University of Manchester, Manchester, UK
Abstract Multispectral Photometric Stereo (MPS) estimates per-pixel surface normals from one single image captured under three colored (red, green and blue) light sources. Unlike traditional Photometric Stereo, MPS can therefore be used in dynamic scenes for single frame reconstruction. However, MPS is challenging due to the tangle of the illumination, surface reflectance and camera response, causing inaccurate estimation of surface normal. Existing approaches rely on either extra depth information or materials calibration strategies, thus limiting its usage in practical applications. In this paper, we propose a Multispectral Photometric Stereo Network (MPS-Net) to solve this under-determined system. The MPS-Net takes the single multispectral image and an initial surface normal estimation obtained from this image ∗
Corresponding author Email address:
[email protected] (Junyu Dong )
Preprint submitted to Neurocomputing
October 2, 2019
itself, and outputs an accurate surface normal map, where no extra depth or materials calibration information is required. We show that the MPS-Net is not constrained to Lambertian surfaces and can be applied to surfaces with complex reflectance. We evaluated the MPS-Net using both synthetic and real objects of various materials. Our experiment results show that the MPS-Net outperforms the state-of-the-art approaches. Keywords: Surface normal estimation, Multispectral photometric stereo, Neural network 1
1. Introduction
2
Multispectral photometric stereo (MPS) can estimate surface normal from
3
a single image of an object illuminated simultaneously by three colored (red,
4
green and blue) light sources. Therefore, it allows single frame reconstruction
5
in dynamic scenes. This idea was first demonstrated in [1, 2, 3] and has been
6
shown to be able to efficiently produce surface normal estimation in dynamic
7
scenes [4, 5, 6].
8
However, the major weakness of the existing MPS methods is the as-
9
sumption of Lambertian reflectance and constant chromaticity of the target
10
object. For objects with varying chromaticities, existing methods appeal to
11
extra depth information [7], regularization of the normal field [8] or time 3
12
multiplexing [9].
13
In this paper, we propose an innovative end-to-end solution by using deep
14
neural networks, the multispectral photometric stereo network (MPS-Net),
15
to predict surface normal from the multispectral image and initial normal
16
map (as shown in Fig.1). We use a localized convolutional neural network
17
(CNN) to establish a flexible mapping from input data to pixel-wise dense
18
surface normal. MPS-Net uses only the information of image itself without
19
extra depth or material calibration. The input includes two components: the
20
observed image and the initial normal map estimation obtained using the
21
three-channel photometric stereo from the observed image. We believe that
22
the initial surface normal provides better prior information to the network
23
and are corrected through MPS-Net with the constraint of the input original
24
image.
25
A variety of bidirectional reflectance distribution functions (BRDFs) from
26
the MERL database [10] are used for training in order that our network can
27
deal with objects with complex reflectance rather than Lambertian surfaces.
28
We trained the network on the Blobby Shape Dataset [11], and it works
29
well on both synthetic and real datasets, including the Stanford 3D Scanning
30
Dataset [12], the Web 3D model and the DiLiGenT Benchmark [13]. We also
4
Three-channel separated photometric stereo
r
g
b
Fuse
MPS-Net
Figure 1: The overview of the proposed method. Given a multispectral image with predefined light directions and the initial normal map as input, MPS-Net estimates an accurate normal map of the object. In MPS-Net, the multispectral image corrects the initial normal map to produce the accurate results (see section 3.2).
31
tested the MPS-Net in the generalization ability in illumination directions
32
and found that it can still predict satisfactory results when the illumination
33
directions are different from those used in the training stage. The proposed
34
method is more practical than existing learning based approaches which have
35
to remain the same light directions in the training and prediction stages [14].
5
36
2. Related work
37
Recently, estimating surface normal of deforming objects has drawn in-
38
creasing attention among researchers in computer vision community. The
39
orientation-sensing techniques based on photometric are good at process-
40
ing high frequency information, while the range sensing technologies, such
41
as multiview stereo, are suitable for dealing with low frequency information
42
[15, 16]. In this section, we focus on reviewing photometric stereo methods.
43
Conventional photometric stereo methods [17] produce pixel-wise surface
44
normal based on the Lambertian model. For deforming object, multispectral
45
photometric stereo was first introduced by Petrov et al. [2] and had been
46
used in many applications [18, 19]. Some researchers [20] employed the coarse
47
depth information obtained using Kinect or binocular stereo to iteratively
48
search for an optimized solution. Hernandez et al. [21] utilized a calibration
49
approach which is planar with the special marking that allows the plane
50
orientation to be estimated. However, these methods require a lot of prior
51
knowledge and need to incorporate more cameras.
52
Ozawa et al. [19] estimated the surface normal by facilitating the re-
53
flectance norm distribution, under the assumption that the surface is colored
54
with a finite number of materials and the surface regions of the same re6
55
flectance is sufficiently curved. Fyffe et al. [22] simultaneously estimated
56
colors and surface normal of textured surfaces using multispectral camera.
57
Kawabata et al. [23] moved a step further by adding a reflectance basis set
58
obtained from principal component analysis. Both of them added a smooth-
59
ness constraint on the surfaces. Ozawa et al. [6] successfully estimated the
60
reflectance spectra and surface normal of an arbitrary colored surface from
61
a single hyperspectral image. Both the spectral and spatial arranged illumi-
62
nations work as a light source for measuring reflectance spectra and shading
63
images. However, the multispectral camera is prone to suffer from spatial,
64
spectral and temporal resolution issues. Moreover, these methods can hardly
65
handle objects with the non-Lambertian surface.
66
With the development of deep learning techniques in recent years, these
67
techniques have also been applied to the normal estimation. Yoon et al.
68
[24] utilized Generative Adversarial Networks (GANs) to estimate surface
69
normal from a single image. However, this work was restricted in the infrared
70
images ignoring the color information. Recently, some researchers [14, 25,
71
26] proposed deep neural networks to regress per-pixel normal. These work
72
can handle non-Lambertian surface and achieved dense results. However,
73
they are hardly able to tackle dynamic objects. Lu et al. [27] estimated
7
74
the initial depth of multispectral images and fed this data to the classical
75
algorithm as the prior information. However, this method was restricted
76
in the specific albedo and generated rough results. Ju et al. [28] used a
77
two-step pipeline to estimate surface normal from a single colored image
78
by demultiplexing the multiplexed multispectral image. Nevertheless, this
79
method solely performed estimation based on the reflectance observations
80
of a single pixel, and cannot fully take the advantage of the information
81
embedded in local surface points. Besides, the learning of [28] is unstable
82
and intractable to the second photometric stereo step. To solve these issues,
83
we introduce a new method which robustly enables the estimation of the
84
surface normal from a single multispectral image.
85
3. MPS-Net
86
In this section, we first introduce the theory background. Then, we de-
87
scribe the learning framework of the MSP-Net. Finally, we present the details
88
of the network architecture.
89
3.1. Preliminaries
90
We consider a Lambertian surface lit by three light sources with different
91
spectra. Following the work introduced in [2], the intensity of the pixel (x, y) 8
92
in an observed image c can be described as:
ci =
X k
lT kn
Z
Ek (λ)R(λ)Si (λ)dλ,
(1)
93
where ci (i ∈ r, g, b) represents the ith channel in c, n and R are the surface
94
normal and spectral reflectance of the surface respectively, lk and Ek (k = 1,
95
2, 3) are the kth light vector and its energy distribution respectively, and Si
96
is the camera-sensitivity function of the ith channel.
97
It can be observed that the tangle of the illumination, surface reflectance
98
and camera response in Eq.1 are cased by non-ideal camera and light sources
99
as well as under-constrained surface reflectance. Classical methods [29] re-
100
101
102
quire the extra information in order to calibrate the tangled part. If we ignore the aliasing between channels, we can simplify Eq.1 as the following three-channel separated photometric stereo:
c = ρ lT k ninit ,
(2)
103
where ρ is a fixed scalar that replaces the spectral reflectance. Then the
104
initial normal ninit in Eq.2 can be easily solved based on [17]. The initial
105
normal ninit under the simplified condition is erroneous as expected. We
106
show the error in Fig.2. 9
Image
Initial normal
GT
Error map
13.87°
16.53°
Figure 2: Examples of the error under the simplified condition. In each row, the observed multispectral image, the inaccurate initial normal obtained using Eq.(2), the ground-truth and the error map are shown. The mean angular error is shown at the bottom-left corner of each error map.
107
In addition, non-Lambertian surface is common in the real world and
108
challenging for the multispectral photometric stereo. Hence, we consider a
109
more flexible and robust learning solution for estimating normal data from a
110
multispectral image.
10
111
3.2. Learning framework
112
A novel learning framework (MPS-Net) is proposed to tackle the tangle
113
and non-Lambertian problems. Given an inaccurate initial normal ninit , we
114
first combine it with the multispectral image c, and then the mapping from
115
the fused input to the surface normal is learned using the MPS-Net (See
116
Fig.1). The MPS-Net can be treated as a function that essentially corrects
117
the inaccurate initial normal from the observed image and predicts the accu-
118
rate surface normal. The function f can be approximated by the deep neural
119
network: MPS-Net as :
nest = f (ninit , (cr , cg , cb )) 120
(3)
where nest represents the estimated surface normal map.
121
There are two reasons to fuse the initial normal with the multispectral
122
image. First, from the model perspective, the error of the initial normal is
123
highly related to the multispectral image. It is the colored or non-Lambertian
124
surface in the observed image that results in the deviation of the initial nor-
125
mal. The image and the initial normal always complement to each other.
126
Therefore, we decide to use the multispectral image to correct the error in
127
the initial normal based on the deep neural network with the strong fitting 11
128
ability rather than establish an unstable mapping between image and normal
129
map. Second, the addition of the initial normal makes the network converge
130
faster due to the intrinsic properties of neural networks. The initial normal
131
acts as a prior information. Compared to the input with only the multi-
132
spectral image, the fused counterpart has more similar information to the
133
accurate predicted normal, which enables the network powerful in learning
134
the difference information between normals.
135
3.3. Network architecture
136
Unlike existing methods, we choose the local image patch instead of the
137
single pixel by considering the advantage of the information embedded in
138
neighborhood surface. In MPS-Net, the introduction of neighborhood pixels
139
makes the continuity of image also be considered as a constraint. Generally,
140
complex surfaces are composed of simple and small surface patches. How-
141
ever, the general shape of the object, which is characterized by a large patch
142
or the whole image, is not informative for learning pixel-wise normals. The
143
conventional size of the patch [25, 26] is too large. This reduces the diver-
144
sity of the data and may introduce the over-fitting problem. Therefore, we
145
used a novel unequal input and output (IO) convolutional neural network
146
to automatically learn features from local patches and guide surface normal 12
147
estimation. The default size of the input local patches is 5×5 pixels, while
148
the output is the estimated normal of the center pixel. In order to map the
149
per-pixel normal to the original image, the stride of each neighboring patch
150
in test is 1, which ensures the estimated normal (the center position of the
151
patches) can be closely arranged. The architecture is presented in Fig.3. Feature Extraction Conv8: 1x1x1024
5x5 Conv1: 1x1x256
5x5
5x5
Normal Generator
Shortcut connection 5x5 5x5
5x5 3x3 1x1
1x1
Fuse Randomization Conv3:
3x3x512 Conv2: 1x1x256
5x5
Conv4: Conv5: 3x3x512 3x3x1024
Conv:padding=SAME+Elu
Conv9: Conv6: Conv7: 3x3x1024 3x3x1024 3x3x512
Conv10: 3x3x3
Conv:padding=SAME+Elu+BN
Conv:padding=VALID+Elu Dropout layer
L2 normalization Initial normal Image
Est. normal
Figure 3: The network architecture of the proposed MPS-Net. BN represents the batch normalization operation. Image denotes the multispectral image. The red numbers mean the size of feature maps. The bold numbers after “Conv:” represent the kernel size and the output dimensions of the convolutional layer. Dropout layers are introduced to simulate the cast shadow.
152
We now describe the network architecture in detail. As shown in Fig.3,
13
153
the fusion input consists of the multispectral image patch and the initial
154
normal patch. We call the first part as “Feature Extractor” and term the
155
last two layers as “Normal Generator”.
156
The first eight layers in the “Feature Extractor” are composed of 1×1
157
and 3×3 filters with the “SAME” padding. This means that the size of
158
the six layers are always 5×5. The 1×1 convolutional layers are used to
159
increase the dimensions and are fused by the concatenation operation. The
160
randomization after “Conv2” means that we randomize the feature maps of
161
“Conv2”. This process guarantees that the MPS-Net is not dependent on the
162
order of feature maps. Therefore, the MPS-Net can learn the multispectral
163
image under different light directions. We also use a shortcut connection
164
to link the “Conv7” and the “Conv8”, which represents the feature of initial
165
normal (increase the dimension to 1024 for ensuring the shortcut connection).
166
Thus, the first eight layers can be treated as a residual block [30] which is
167
focused on feature extraction.
168
Then, the last two convolutional layers transfer the feature map size from
169
5×5 to 1×1. Note that the strides applied in all layers are 1 pixel except
170
the first layer in the “Normal Generator”, which utilizes a stride of 2 pix-
171
els. Therefore, the size is decreased to 3×3 in “Conv9”. The convolutional
14
172
layer with padding=“VALID” is utilized afterwards. Therefore, the result of
173
“Conv10” is a single pixel. An L2-normalization operation is appended to
174
the end of the network to ensure the normalized output.
175
The Elu activation function [31] is used at each layer. The dropout layers
176
[32] are introduced after “Conv3” and “Conv5”. We utilize two dropout
177
layers to simulate the inevitable cast shadow in MPS system [14]. We also
178
apply batch normalization [33] to each layer except the first two layers, where
179
it may otherwise break low-level feature detectors.
180
Our network is trained with a pixel-by-pixel MAE loss. MAE loss is the
181
angle deviation between the estimated nest and groundtruth n. It can be
182
written as:
LMAE (nest , n) = arccos(
hnest , ni ), ||nest || ||n||
(4)
183
where n is the ground-truth normal and nest is the estimated normal. h*, *i
184
denotes the dot product. LMAE is minimized using Adam [34] with the sug-
185
gested default settings.
15
186
4. Datasets
187
The deep learning approach normally requires a huge number of training
188
samples for learning a regressor. For training an MPS-Net, a dataset which
189
includes the images, initial normal maps and ground-truth normal maps of
190
objects is needed. However, the ground-truth normal of the real object and
191
the ideal light source are usually not available. In this study, we train the
192
MPS-Net using the widely-used synthetic data and evaluate it using both
193
synthetic and real datasets from previous studies [26, 13]. Experiment results
194
show that the MPS-Net trained using the synthetic datasets generalizes well
195
to the real datasets.
196
4.1. Synthetic datasets
197
The synthetic datasets include the Blobby Shape dataset [11], the Stan-
198
ford 3D Scanning dataset [12] and dozens of 3D models downloaded from
199
Sketchfab https://sketchfab.com/ and Free3d https://free3d.com/ which are
200
named the “Web 3D” dataset. Following the work introduced in [14, 26],
201
we employ the MERL dataset [10] which contains 100 different materials
202
BRDFs. We render the Blobby, Stanford and Web 3D datasets with the
203
MERL dataset in the pre-defined light source by following the method used
16
204
in [14]. The three pre-defined lights have the same slant angle (30◦ ) and are
205
evenly separated by a tilt angle (120◦ ). In order to simulate real conditions,
206
each rendered image contains at least two materials. Note that the MERL
207
dataset is a dictionary which contains every incident and exit angle under
208
the white light. Therefore, we combine the RGB channels of a three white
209
light image according to a proportion, to derive a pseudo multispectral im-
210
age as performed in [28]. This method effectively simulates the tangle of the
211
illumination, surface reflectance and camera response.
212
The training set comprises eight models contained in the Blobby Shape
213
dataset and six models included in the Web 3D dataset. The rest two models
214
contained in the Blobby Shape dataset are used as the validation set. The
215
remaining six models of the Web 3D dataset and the Stanford 3D Scanning
216
dataset are utilized for testing.
217
4.2. Real datasets
218
First, we employed the DiLiGenT Benchmark [13] as the real dataset for
219
testing. This dataset contains 10 objects made from complex non-Lambertian
220
materials. For each object were 96 images captured under different light di-
221
rections. In order to obtain multispectral images, we use the method pro-
222
posed in [28] with the same configurations for the synthetic datasets (Three 17
223
images are selected for each object and light intensities are normalized). It
224
is worth noting that the pre-defined light directions in the DiLiGenT Bench-
225
mark are different from those used in our training set. We will analyze the
226
performance of the proposed network under different light directions and
227
evaluate the robustness of this network.
228
Furthermore, we also built a MPS system to capture real fabrics for
229
demonstrating the generalization ability of MPS-Net on real objects. Fabrics
230
are deformable material, which are always challenging for traditional PS to
231
recover surface normal.
232
4.2.1. The experimental setup of the MPS system
233
Our experimental setup is shown in Fig.4. We used an IDS UI-358xCP-C
234
camera put on top-center of a circular. The lights were placed in the circular
235
orbit around the camera to provide varying illumination directions. The
236
three lights have the same slant angle (30◦ ) and are evenly separated by a
237
tilt angle (120◦ ).
238
In this experiment, we use the fixed weights learned from training set
239
to demonstrate the robustness of our network. We compare it against the
240
PS-FCN, Demultiplexer and baseline. It should be noted that the lighting
241
intensity, spectral distribution and camera response are all changed in our 18
Red light
Camera Green light
Blue light
Figure 4: Experimental device. The red box represents the camera and the yellow circles represent the lights.
242
real MPS system.
243
5. Experiments
244
In this section, we describe the implementation details of the proposed
245
network and evaluate it using different setups. Regarding the evaluation,
246
we first conduct a network analysis for the MPS-Net on the validation set
247
and then compare it with the state-of-the-art methods using both synthetic 19
248
dataset and real dataset. We employ the angular error (in degree) perfor-
249
mance metric in order to measure the accuracy of estimated normal maps.
250
5.1. Implementation details
251
The MPS-Net is implemented using Tensorflow in Ubuntu 16.04. The
252
training set includes 1.5 × 106 input patches with the size of 5×5 pixels and
253
the corresponding ground-truth normal data. We train our model on two
254
NVIDIA GTX 1080Ti GPUs using a batch size of 500 for 20 epochs. The
255
initial learning rate is set to 0.001, with Adam [34] default parameters (β1 =
256
0.9 and β2 =0.999).
257
5.2. Network analysis
258
We quantitatively analyze our network using the validation set. The
259
fixed-size image patches centered at estimated normal pixels are used as the
260
input of the MPS-Net. The fusion input is composed of the observed image
261
and the inaccurate initial normal estimated using the three-channel separated
262
photometric stereo (see Eq.2). Therefore, the effects of the input size and
263
the fusion with the initial normal are analyzed in this subsection.
264
We assess the effectiveness of 1×1 patch (i.e., the single pixel), 3×3 patch,
265
5×5 patch, 7×7 patch and 9×9 patch as well as the effectiveness of the
20
266
fusion operation. For 9×9 patch, 7×7 patch and 3×3 patch, we tune the
267
number of convolutional layers with the “VALID” padding in order to ensure
268
the single pixel estimated normal. For the 1×1 patch, we replace all the
269
convolutional layers by fully-connected layers with the same dimension. In
270
this case, the structure of our network is similar to that used in [14]. For the
271
network without using the fusion input, only image patches are used while
272
the concatenation and shortcut connection operations are discarded. We
273
randomly select 5 × 105 patches sampled from the validation set and report
274
the mean angular error and the max angular error results. These results are
275
summarized in Table 1. Table 1: Results of network analysis. The digits represent the mean angular error (MeAE) or the max angular error (MaAE) across all the selected patches (the lower the better). I and N stand for the multispectral image and the initial normal respectively.
Metrics Patch type
Patch size 1×1
3×3
5×5
7×7
9×9
MeAE
I
10.09◦
9.18◦
8.75◦
8.53◦
10.20◦
MaAE
I
57.25◦
49.93◦
43.24◦
44.93◦
44.71◦
MeAE
I+N
8.28◦
7.41◦
7.03◦
7.11◦
7.72◦
MaAE
I+N
42.61◦
37.31◦
31.82◦
32.05◦
33.09◦
21
276
5.2.1. Effect of different patch sizes
277
It can be observed that the mean angular error and the max angular error
278
are decreasing with the increasing size until the size reaches to 5×5. Then
279
the errors incline to become stable. This finding support our hypothesis
280
that the local patch takes advantage of the information embedded in the
281
heighborhood surface point. Moreover, the local patch is able to represent
282
the non-Lambertian surface, avoiding the influence of the shadow or highlight
283
which completely covers the information of a single pixel. On the other hand,
284
the 7×7 and 9×9 patches encode the redundant information and increase
285
computational cost, which may introduce extra error. More importantly,
286
the larger patches lead to the blurring of the estimated normal map because
287
farther pixels interfere with the center pixel. Therefore, we choose the 5×5
288
patch as the default setting of the MPS-Net.
289
5.2.2. Effect of fusion with the initial normal
290
Referring to the results shown in Table 1, we can see that both the mean
291
angular error and the max angular error are decreasing across different patch
292
sizes when the fused input (I+N) is used. Compared with the input which
293
only uses the image patch, the initial normal provides better information
294
to the ground-truth data. This results in more stable learning process and 22
295
the more rapid convergence (see Section 3.2). In addition, the initial nor-
296
mal map and the corresponding multispectral image are complementary to
297
each other. When the initial normal map is very bad, its corresponding
298
multispectral image provides different patterns for MPS-Net, ensuring the
299
accurate prediction.
300
5.3. Evaluation on different materials
301
In Fig.5, we compare the MPS-Net with Demultiplexer [28], PS-FCN [26]
302
and the baseline results derived using Eq.2 on the validation set (Blobs8).
303
Note that PS-FCN is a photometric stereo method but it allows arbitrary
304
size input. Therefore, we use the three channels of the multispectral image
305
as three input images during the training and test stages.
306
It can be seen that the MPS-Net significantly outperform the baseline
307
results and Demultiplexer. The Demultiplexer ignores the information em-
308
bedded in local surface points, while MPS-Net obtains continuity constraints
309
and spatial information from neighboring pixels. With the help of the fusion
310
with the initial normal, our method is stable across different materials and
311
is superior to PS-FCN in most cases. For PS-FCN, we believe that the Max-
312
pooling strategy [35] used can achieve good results when there are a large
313
number of input channels (e.g., 96), while the effect becomes worse in MPS 23
A
B
A
C
D
B
F
E
C
D
E
F
Figure 5: Comparison between MPS-Net, Demultiplexer, PS-FCN and baseline (initial normal) on the samples of Blobs8 in Blobby Shape Dataset [11] rendered with 100 different BRDFs of MERL Datasets [10]. Images in the top-left corner show several rendered samples.
314
system (only three input channels).
315
5.4. Evaluation on synthetic datasets
316
We use the Stanford 3D Scanning dataset and the Web 3D dataset to
317
quantitatively evaluate the proposed MPS-Net. The comparison between the
318
MPS-Net, Demultiplexer, PS-FCN and the baseline are shown in Fig.6. The
319
selected objects in Fig.6 are representative, ranging from simple to complex,
320
as well as lambertian and non-Lambertian materials.
321
Compared with the other methods, the MPS-Net produces better results
24
Objects
GT
MPS-Net & Error map
PS-FCN & Error map
Demultiplexer & Error map
Baseline & Error map
7.06°
10.11°
11.19°
13.15°
5.38°
7.66°
7.71°
8.41°
8.96°
10.71°
13.90°
14.47°
9.12°
10.45°
10.84°
15.05°
Figure 6: Quantitative results obtained from synthetic datasets. Here, GT means the ground-truth data and the digits in error maps represent mean angular errors (in degree). Note that the third object ”Dragon” has been rotated for better display.
322
on all objects with complicated and simple shapes. It can be observed that
323
the MPS-Net is more robust in the regions with multiple BRDFs (see the
324
first two objects in Fig.6). The estimated normal maps obtained using our
325
method are almost unaffected when the material changes. It is harder to find
326
the boundary of the materials’ change in MPS-Net. In addition, the surface
327
normal generated by PS-FCN is accompanied by much noise (see the last
328
object in Fig.6) when the inputs are complex objects. It may be due to the
25
329
large patch input in CNN weakens the fitting and generalization ability.
330
5.5. Evaluation on real datasets
331
5.5.1. DiLiGenT Benchmark
332
In order to further evaluate the proposed MPS-Net method, we compare
333
it against the Demultiplexer, PS-FCN and baseline on the DiLiGenT Bench-
334
mark [13] along with the ground-truth data. It should be noted that the
335
pre-defined light directions in the DiLiGent Benchmark are different from
336
our training counterparts. The 96 light directions of the DiLiGent Bench-
337
mark dataset are shown in Fig.7. In fact, for the DiLiGenT Benchmark, the
338
lighting intensity, spectral distribution and camera response are all changed
339
from training dataset. However, we did not retrain the network but using
340
the fixed weights learned from training set. The results demonstrate the
341
robustness of our network.
342
In the rendering, we use the three light directions as evenly as possible
343
(e.g., h48, 1, 96 i). Since the surface of the object is not flat, there are cast
344
shadows in a particular illumination direction, where parts of the surface
345
can be occluded from the light source by other parts [36]. When the three
346
illumination directions are close, there will be severe cast shadows. As a
347
convention, we therefore use the evenly distributed lights to avoid obscured 26
z
1 0.8 0.6 -0.8 -0.6
24 16 8 4840 32 56 64 23 31 15 39 72 7 55 47 80 63 22 30 14 6 71 88 4638 54 62 79 96 87 70 4537 29 21 13 5 78 95 53 61 86 443628 20 12 69 94 77 4 52 85 60 68 433527 19 93 11 3 76 84 51 59 423426 92 67 75 18 10 0 2 50 8391 4133 25 17 58 66 -0.1 9 1 74 82 90 49 57 -0.2 65 73 81 89 -0.3
-0.4
-0.2 0
0.5 0.4 0.3 0.2 0.1
y
-0.4
0.2 0.4
x
0.6
-0.5 0.8
Figure 7: The 96 illumination directions of “Bear” in the DiLiGenT Benchmark [13]. Each number represents the corresponding image sequence of this light direction.
348
shadows. (There is a very small difference in the illumination directions of
349
each object in DiLiGenT. Here, we select the exact illumination directions
350
of each object).
351
First, we set the light direction to h48, 1, 96 i. The experiment results
352
are shown in Fig.8. The selected objects in Fig.8 are representative, ranging
353
from simple to complex, as well as lambertian and non-Lambertian materials.
354
Compared with the PS-FCN and Demultiplexer, MPS-Net performs well on
27
355
all objects. In particular, MPS-Net achieves smoother results and less errors.
356
It generates best result even on the most non-Lambertian surface, e.g., the
357
“Reading” object: when the PS-FCN has only three input channels (MPS
358
system), Max-pooling may preserve the highlight area as the maximum re-
359
sponse, affecting the quality of the generated normal map. The reason that
360
our method achieves better results is the initial normal map and image gen-
361
erated by the highlight region can interact and constrain each other. The
362
Demultiplexer method produces the worse result compared with the initial
363
normal map on this object. This result may be attributed to the incompe-
364
tency of Demultiplexer when it deals with the strongly non-Lambertian and
365
dark surface. Note that our method has a large error at the top-head of
366
“Reading”. This is because our method is also a single-image based algo-
367
rithm, lacks enough information required for such complicated structure.
368
Second, we analyze the influence of different light directions. We choose
369
different combinations of directions and examine the mean angular error
370
using MPS-Net. We randomly selected three non-planar lighting locations in
371
each group, and the directions of the four groups already covered a circle of
372
lights. The results are reported in Table 2.
373
It can be observed that MPS-Net is insensitive to the change of light
28
Objects
GT
MPS-Net & Error map
PS-FCN & Error map
Demultiplexer & Error map
Baseline & Error map
Goblet
11.43°
13.92°
12.46°
15.80°
Reading
17.10°
19.74°
22.62°
21.81°
Bear
8.40°
12.29°
18.77°
18.87°
Buddha
9.54°
12.69°
14.10°
16.31°
Figure 8: Quantitative results obtained from the DiLiGenT Benchmark [13]. GT means the ground-truth data and the numbers in error maps represent the mean angular error in degree. Table 2: Comparison of different light directions combinations when DiLiGenT is used. The four different light direction combinations shown in Fig.7 are reported. The numbers represent the mean angular error in degree.
No. Ball Bear Pot1 Pot2 GobletReading Cow Harvest Cat Buddha h48,1,96i5.39◦ 9.74◦ 10.61◦ 10.69◦ 11.43◦ 17.10◦ 10.78◦ 16.02◦ 9.90◦ 11.54◦ h8,41,89i5.35◦ 9.91◦ 10.50◦ 10.99◦ 12.07◦ 17.31◦ 10.29◦ 15.44◦ 9.31◦ 11.58◦ h73,72,27i5.17◦ 9.44◦ 10.29◦ 10.93◦ 11.40◦ 17.85◦ 9.92◦ 15.81◦ 8.97◦ 10.73◦ h22,50,78i5.52◦ 10.03◦ 10.51◦ 11.36◦ 12.25◦ 17.41◦ 10.13◦ 16.55◦ 9.46◦ 10.93◦
29
374
directions. The error of each combination remains relatively consistent. The
375
incorporation of the initial normal should be account for the robust results of
376
MPS-Net. The initial normal is not affected by the change of light directions.
377
MPS-Net can be treated as the correction to the initial map. Moreover, the
378
application of randomization after “conv2” further increases the robustness
379
under different light directions (see section 3.3).
380
5.5.2. Real objects
381
We capture images of real objects (multicolored fabrics). The comparison
382
between the MPS-Net, Demultiplexer, PS-FCN and the baseline are shown
383
in Fig.9.
384
As shown in Figure 9, the boundaries of multicolored fabrics can be clearly
385
seen. The Baseline method which uses the three-channel of multispectral
386
image directly causes the discontinuity on the normal map. The error is
387
the deviation of the surface albedo estimation and the tangle among illumi-
388
nation, surface reflectance and camera response, which leads to an under-
389
determined system. It can be seen from the experiment that the results of
390
Demultiplexer is better compared to baseline, but there are still obvious dis-
391
continuous boundaries on the normal map, for the reason that method only
392
applies the single pixel, ignoring the information embedded in local surface 30
Objects
MPS-Net
PS-FCN
Demultiplexer
Baseline
Figure 9: Comparison between MPS-Net, Demultiplexer, PS-FCN and baseline (initial normal) on the real photoed objects.
393
and causing information uncertainty. The results of PS-FCN show smoother
394
normal maps but there still exhibits the fuzzy boundaries. This is attributed
395
to the fact that larger input patches reduce the generalization of the network
396
to multicolored surface.
397
In contrast, MPS-Net performs best in the multicolored real photoed
398
objects. The reason is the using of 5×5 patch: since the 5×5 input patch, the
399
change of BRDFs 2-pixels away from the center pixel do not affect the normal.
400
Thus, our network is robust to multicolored surface. We also note that, 31
401
compared with PS-FCN, MPS-Net utilizes the initial normal map, which
402
provides a more robust way: we use the multispectral image to correct the
403
error in the initial normal rather than establish an unstable mapping between
404
image and normal map.
405
6. Conclusion
406
In this paper, we proposed a novel learning framework for the multispec-
407
tral photometric stereo, namely, MPS-Net. The proposed MPS-Net method
408
is able to estimate the accurate normal map from a single multispectral im-
409
age. MSP-Net does not require any extra prior information and can be used
410
in the light directions which are slightly different from the learned ones in
411
the training operation (also the camera and light intensity). Experimental
412
results have demonstrated the excellent performance of MPS-Net on various
413
BRDFs. The results obtained from the synthetic and real datasets indicate
414
the power of MPS-Net compared with the state-of-the-art methods.
415
In future work, we plan to design a multi-scale pyramid network to pro-
416
vide different receptive information, which will take multi-scale local context
417
information into account. We believe it will further improve our surface
418
normal recovery accuracy.
32
419
420
421
Conflicts of interest There arenoconflicts of interest.
References
422
[1] M. S. Drew, L. L. Kontsevich, Closed-form attitude determination under
423
spectrally varying illumination, Computer Vision and Pattern Recogni-
424
tion (1994) 985–990.
425
[2] A. P. Petrov, I. S. Vergelskaya, L. L. Kontsevich, Reconstruction of
426
shape from shading in color images, Journal of the Optical Society of
427
America A 12 (1994) 1047–1052.
428
[3] R. J. Woodham, Gradient and curvature from the photometric-stereo
429
method, including local confidence estimation, Journal of the Optical
430
Society of America A 11 (1994) 3050–3068.
431
[4] H. Kim, B. Wilburn, M. Benezra, Photometric stereo for dynamic sur-
432
face orientations, European Conference on Computer Vision (2010) 59–
433
72.
434
[5] G. J. Brostow, C. Hernandez, G. Vogiatzis, B. Stenger, R. Cipolla, Video
33
435
normals from colored lights, IEEE Transactions on Pattern Analysis
436
Machine Intelligence 33 (2011) 2104–2114.
437
[6] K. Ozawa, I. Sato, M. Yamaguchi, Hyperspectral photometric stereo for
438
a single capture., Journal of the Optical Society of America A 34 (2017)
439
384–394.
440
[7] R. Anderson, B. Stenger, R. Cipolla, Color photometric stereo for mul-
441
ticolored surfaces, International Conference on Computer Vision (2011)
442
2182–2189.
443
[8] Z. Jank, A. Delaunoy, E. Prados, Colour dynamic photometric stereo for
444
textured surfaces, Asian Conference on Computer Vision (2010) 55–66.
445
[9] B. D. Decker, J. Kautz, T. Mertens, P. Bekaert, Capturing multiple
446
illumination conditions using time and color multiplexing, Computer
447
Vision and Pattern Recognition (2009) 2536–2543.
448
449
450
451
[10] W. Matusik, H. Pfister, M. Brand, L. Mcmillan, A data-driven reflectance model, ACM Transactions on Graphics (2003) 759–769. [11] M. K. Johnson, E. H. Adelson, Shape estimation in natural illumination, Computer Vision and Pattern Recognition (2011) 2553–2560.
34
452
[12] B. Curless, M. Levoy, A volumetric method for building complex models
453
from range images, Conference on Computer Graphics and Interactive
454
Techniques (1996) 303–312.
455
[13] B. Shi, Z. Mo, Z. Wu, D. Duan, S. K. Yeung, P. Tan, A benchmark
456
dataset and evaluation for non-lambertian and uncalibrated photometric
457
stereo, IEEE Transactions on Pattern Analysis and Machine Intelligence
458
PP (2018).
459
[14] H. Santo, M. Samejima, Y. Sugano, B. Shi, Y. Matsushita, Deep pho-
460
tometric stereo network, International Conference on Computer Vision
461
Workshop (2017) 501–509.
462
463
[15] J. L. Schnberger, J. M. Frahm, Structure-from-motion revisited, IEEE Conference on Computer Vision and Pattern Recognition (2016).
464
[16] J. Vongkulbhisal, R. Cabral, F. D. L. Torre, J. P. Costeira, Motion from
465
structure (mfs): Searching for 3d objects in cluttered point trajectories,
466
Computer Vision and Pattern Recognition (2016) 5639–5647.
467
[17] R. J. Woodham, Photometric method for determining surface orienta-
468
tion from multiple images, Optical Engineering 19 (1980) 139–144.
35
469
[18] B. Bringier, D. Helbert, M. Khoudeir, Photometric reconstruction of a
470
dynamic textured surface from just one color image acquisition, Journal
471
of the Optical Society of America A Optics Image Science Vision 25
472
(2008) 566.
473
[19] K. Ozawa, I. Sato, M. Yamaguchi, Single color image photometric stereo
474
for multi-colored surfaces, Computer Vision & Image Understanding
475
(2018).
476
477
[20] R. Anderson, B. Stenger, R. Cipolla, Augmenting depth camera output using photometric stereo., MVA 1 (2011).
478
[21] C. Hernandez, G. Vogiatzis, G. J. Brostow, B. Stenger, R. Cipolla, Non-
479
rigid photometric stereo with colored lights, IEEE International Con-
480
ference on Computer Vision (2007) 1–8.
481
482
[22] G. Fyffe, Single-shot photometric stereo by spectral multiplexing, IEEE International Conference on Computational Photography (2010) 1–6.
483
[23] T. Kawabata, F. Sakaue, J. Sato, One shot photometric stereo from
484
reflectance classification, 11th Joint Conference on Computer Vision,
485
Imaging and Computer Graphics Theory and Applications. (2016) 620–
486
627. 36
487
[24] Y. Yoon, G. Choe, N. Kim, J.-Y. Lee, I. S. Kweon, Fine-scale surface
488
normal estimation using a single nir image, European Conference on
489
Computer Vision (2016) 486–500.
490
[25] T. Taniai, T. Maehara, Neural inverse rendering for general reflectance
491
photometric stereo,
492
(2018) 4864–4873.
International Conference on Machine Learning
493
[26] G. Chen, K. Han, K.-Y. K. Wong, Ps-fcn: A flexible learning frame-
494
work for photometric stereo, European Conference on Computer Vision
495
(2018) 3–19.
496
[27] L. Lu, L. Qi, Y. Luo, H. Jiao, J. Dong, Three-dimensional reconstruc-
497
tion from single image base on combination of cnn and multi-spectral
498
photometric stereo, Sensors 18 (2018) 764.
499
[28] Y. Ju, L. Qi, H. Zhou, J. Dong, L. Lu, Demultiplexing colored images
500
for multispectral photometric stereo via deep neural networks, IEEE
501
Access 6 (2018) 30804–30818.
502
[29] H. Jiao, Y. Luo, N. Wang, L. Qi, J. Dong, H. Lei, Underwater multi-
503
spectral photometric stereo reconstruction from a single rgbd image,
37
504
Signal and Information Processing Association Summit and Conference
505
(2017) 1–4.
506
[30] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image
507
recognition, Proceedings of the IEEE conference on computer vision
508
and pattern recognition (2016) 770–778.
509
[31] D. A. Clevert, T. Unterthiner, S. Hochreiter, Fast and accurate deep
510
network learning by exponential linear units (elus), International Con-
511
ference on Machine Learning (2015).
512
[32] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov,
513
Dropout: a simple way to prevent neural networks from overfitting,
514
Journal of Machine Learning Research 15 (2014) 1929–1958.
515
[33] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network
516
training by reducing internal covariate shift, International Conference
517
on Machine Learning (2015) 448–456.
518
[34] D. Kingma, J. Ba,
Adam: A method for stochastic optimization,
519
Proceedings of International Conference on Learning Representations
520
(2014).
38
521
[35] O. Wiles, A. Zisserman, Silnet : Single- and multi-view reconstruction
522
by learning from silhouettes, The British Machine Vision Association
523
and Society for Pattern Recognition (2017).
524
[36] L. Wu, A. Ganesh, B. Shi, Y. Matsushita, Y. Wang, Y. Ma, Robust
525
photometric stereo via low-rank matrix completion and recovery, Asian
526
Conference on Computer Vision (2010) 703–717.
39
527
Yakun Ju received the B.Sc degree of engineering in industrial
528
design from Sichuan University, Chengdu, China, in 2016. He is currently
529
pursuing the Ph.D. degree in computer application technology with the De-
530
partment of Computer Science and Technology, Ocean University of China,
531
Qingdao, China. His research interests include 3D reconstruction, machine
532
learning and image processing.
533
Lin Qi received his B.Sc and M.Sc degrees from Ocean
534
University of China in 2005 and 2008 respectively, and received his Ph.D.
535
in computer science from Heriot-Watt University in 2012. He is now an
536
associate professor in the Department of Computer Science and Technology
537
in Ocean University of China. His research interests include computer vision
538
and visual perception.
40
539
Jichao He received the B.Sc degree in In-
540
formation Security at Sichuan University, Chengdu, China, in 2018. Cur-
541
rently, he is pursuing the Master’s degree at Ocean University of China,
542
Qingdao, China. His research interests include computer vision, machine
543
learning and deep learning.
41
544
Xinghui Dong received the Ph.D.
545
degree from Heriot-Watt University, U.K., in 2014. He is currently a Re-
546
search Associate with the Centre for Imaging Sciences, The University of
547
Manchester, U.K. His research interests include automatic defect detection,
548
image representation, texture analysis, and visual perception.
549
Feng Gao received his B.Sc degree from the Department of
550
Computer Science, Chongqing University, Chongqing, China in 2008, and
551
received the Ph.D. degree from the Department of Computer Science and
552
Engineering, Beihang University, Beijing, China in 2015. He is currently an
42
553
associate professor in the Department of Computer Science and Technology
554
in Ocean University of China. His research interests include computer vision
555
and remote sensing.
556
Junyu Dong received the B.Sc. and M.Sc. degrees from the
557
Department of Applied Mathematics, Ocean University of China, Qingdao,
558
China, in 1993 and 1999, respectively, and the Ph.D. degree in image pro-
559
cessing from the Department of Computer Science, Heriot-Watt University,
560
U.K., in 2003. He joined Ocean University of China in 2004, and he is cur-
561
rently a Professor and the Head of the Department of Computer Science
562
and Technology. His research interests include machine learning, big data,
563
computer vision, and underwater vision.
43