A Dual-Cue Network for Multispectral Photometric Stereo
Journal Pre-proof
A Dual-Cue Network for Multispectral Photometric Stereo Yakun Ju, Xinghui Dong, Yingyu Wang, Lin Qi, Junyu Dong PII: DOI: Reference:
S0031-3203(19)30462-5 https://doi.org/10.1016/j.patcog.2019.107162 PR 107162
To appear in:
Pattern Recognition
Received date: Revised date: Accepted date:
10 February 2019 10 December 2019 12 December 2019
Please cite this article as: Yakun Ju, Xinghui Dong, Yingyu Wang, Lin Qi, Junyu Dong, A Dual-Cue Network for Multispectral Photometric Stereo, Pattern Recognition (2019), doi: https://doi.org/10.1016/j.patcog.2019.107162
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier Ltd.
Highlights • A novel dual-cue fused network is proposed for surface normal recovering, which exploits specular highlights, shadows and interreflections appearing in local image patches, meanwhile maintaining high-frequency details. • Compared to previous multispectral photometric stereo algorithms, the proposed method requires no extra information and breaks the limitation of Lambertian surfaces. • The Dual-cue fused network outperforms existing approaches in robustness under complex illumination.
1
A Dual-Cue Network for Multispectral Photometric Stereo Yakun Jua , Xinghui Dongb , Yingyu Wanga , Lin Qia , Junyu Donga,∗ a
Department of Computer Science and Technology, Ocean University of China, Qingdao, China b Centre for Imaging Sciences, The University of Manchester, Manchester, UK
Abstract Estimating pixel-wise surface normal from a single image is a challenging task but offers great values to computer vision and robotics applications. By using the spectrally and spatially variant illumination, multispectral photometric stereo can produce pixel-wise surface normal from just one image. But multispectral photometric stereo methods may encounter the tangle problem of illumination, surface reflectance and camera response, which lead to an under-determined system. Existing approaches rely on either extra depth information or material calibration strategies, assuming a Lambertian surface condition which limits their application in practical systems. Previous learning-based methods employ fully-connected or CNN architectures to estimate surface normal. Compared with fully-connected framework, CNN ∗
Corresponding author Email address:
[email protected] (Junyu Dong )
Preprint submitted to Pattern Recognition
December 12, 2019
takes advantage of the information embedded in the neighborhood of a surface point, but losing high-frequency surface normal details. In this paper, we present a new method that addresses this task by designing two stacked deep network. We first apply a CNN-based structural cue network to approximate coarse surface normal on small patches. Then, we use a pixel level fully-connected photometric cue network to further refine surface normal details and correct errors from the first step. The fused network is robust to non-Lambertian surfaces and complex illumination environments, such as ambient light and variant light directions. Experimental results show that our dual-cue fused network outperforms existing methods. Keywords: Multispectral photometric stereo, Normal estimation, Deep neural networks, Networks fusion
1
1. Introduction
2
Recovering dense 3D shapes is a fundamental and challenging problem in
3
the field of computer vision [1]. Traditional photometric stereo methods can
4
produce pixel-wise surface normal estimation using multiple images captured
5
with a stationary camera and the changed illumination [2]. However, these
6
requirements limit its use in dynamic applications. Multispectral photomet-
7
ric stereo is a popular method to handle non-rigid/moving objects using a 3
8
single image [3, 4], which only requires three colored lights (i.e., red, green
9
and blue) to illuminate the target simultaneously. Generally, photometric
10
stereo takes a time-division multiplexing strategy, whereas the multispectral
11
photometric stereo uses a spectral-division multiplexing strategy.
12
The biggest challenge for multispectral photometric stereo is the tangle
13
of illumination, surface reflectance and camera response which leads to an
14
under-determined system. Mathematically, it is hard to solve the normal of
15
varying chromaticities. Previous researchers have investigated different ap-
16
proaches, including prior depth information [5], calibration surface material
17
[6] and regularization of the normal field [7]. However, existing methods bear
18
the following limitations. First, pre-calibration and the prior depth informa-
19
tion may be unavailable in many circumstances. Second, those methods are
20
time-consuming and the accuracy can be further improved. More impor-
21
tantly, the existing methods require the Lambertian surface assumption.
22
Deep learning methods have been widely employed in computer vision
23
tasks. It can be seen that CNN have been successfully applied to dense re-
24
gression problems like depth estimation [8] and surface normal estimation
25
[9, 10]. CNN-based normal estimation methods like PS-FCN [11] can bet-
26
ter handle specular, shadows, and interreflections [12], as they all form the
4
27
appearance of a local image patch. However, according to the results and
28
analysis of our experiments, CNN-based methods produce relatively fuzzy
29
surface normal output, losing high-frequency details. This is partly caused
30
by the increased receptive field in deeper convolution layers, which involves
31
irrelevant pixels far away from the convolution center in the image lattice.
32
To solve the above problems, we designed a dual-cue network to combine
33
the advantages of CNN and fully-connected network, called structural cue
34
network and photometric cue network, respectively. Like [8], we present a
35
new method that addresses this task by employing two deep network stacks:
36
we first apply a CNN-based structural cue network to approximate the coarse
37
normal. Then, we apply a pixel level fully-connected photometric cue net-
38
work to refine the coarse normal. The photometric cue network enhances
39
high frequency details and further correct the error brought by structural
40
cue network. The overview of the proposed network is shown in Figure 1.
41
The main contributions of this work are summarized as follows.
42
• A novel dual-cue fused network is proposed for surface normal recov-
43
ering, which exploits specular, shadows and interreflections appearing
44
in local image patches, meanwhile maintaining high-frequency details.
45
• Compared to previous approaches, our method requires no extra infor5
Input: Image
Coarse Surface Normal
r
g
CNN
b
Structural Cue Network
Surface Normal
Combine
FC Net Photometric Cue Network
Figure 1: The overview of the dual-cue fused network. It consists of two components: the structural cue network and the photometric cue network. Given a single image, our structural cue network first generates the coarse surface normal data. These normal data are then used as the input of a photometric cue network which produces an fine surface normal map. The structural cue network is implemented as a convolutional neural network (CNN) while the photometric cue network is built by a fully-connected network (FC Net).
46
47
48
49
mation and breaks the limitation of Lambertian surfaces. • The Dual-cue fused network outperforms existing approaches in robustness under complex illumination.
2. Related Work
50
Photometric stereo [2] methods were designed based on the Lambertian
51
model [13], which provides a dense surface normal estimation. Some re-
52
searchers [14, 11] introduced the learning-based frameworks and achieved
6
53
better results in the non-Lambertian cases. However, these methods require
54
many images of a target object and cannot handle non-rigid or moving object.
55
To estimate the surface normal from a single image, many multispec-
56
tral photometric stereo methods have been proposed over the last 20 years
57
[15], and have been used in different applications [16, 17]. Hernandez et al.
58
[6] used a pre-calibration approach and obtained the accurate surface nor-
59
mal of fabrics, where a planar plane with special marks was used. Some
60
researchers employed a coarse surface estimation as the initial input and
61
iteratively searched for an optimized solution [5]. However, these methods
62
require the prior information for solving the under-determined equations (see
63
Eq. 5) and are affected by non-Lambertain surfaces.
64
On the other hand, the field of 3D reconstruction has also benefited from
65
learning-based techniques. Recently, some researchers investigated the deep
66
learning techniques in the context of multispectral photometric stereo. Ju
67
et al. [18] used a fully-connected neural network to estimate surface normal
68
from a single colored image. In contrast, Lu et al. [19] estimated the coarse
69
depth of multispectral images using a CNN and used the coarse depth to
70
solve an under-determined system. Antensteiner et al. [20] proposed a Unet-
71
like network to solve the multispectral photometric stereo, where they only
7
72
tested on images of coins with uniform albedo (surface reflection).
73
3. The Dual-Cue Fused Network The proposed dual-cue fused network comprises two modules: the structural cue network and the photometric cue network, as shown in Figure 1. The structural cue network predicts the coarse normal using patches. CNNbased framework can better handle specular, shadows, and interreflection, as they all contribute to the appearance of a local image patch. It is then combined with the input image and they are passed to the photometric cue network to learn a fine per-pixel normal. The photometric cue network enhances high frequency details and further correct the error brought by the involved irrelevant pixels far away from the convolution center in the image lattice. Our dual-cue fused network can be written as a function: nest = fpcn (c, fscn (c)),
(1)
74
where nest represents the estimated surface normal map, c represents the
75
input multispectral image, fscn represents the structural cue network and
76
fpcn represents the photometric cue network.
8
77
3.1. The Structural Cue Network
78
Unlike the previous works [14, 18] which only map from the single pixel
79
value to the surface normal, we introduce the structural cue network which
80
takes full advantage of the information embedded in the neighborhood of a
81
surface point. Additionally, the features extracted from the global informa-
82
tion are seldom affected by ambient light.
83
The whole image input reduces the diversity of the data and introduces
84
the over-fitting problem. Considering complex surfaces comprise simple and
85
small surface patches, we choose the patch rather than the whole image as the
86
input. The structural cue network takes a r, g, b-channel image patch within
87
a neighborhood C ∈ R40×40×3 as the input. The structural cue network only
88
consists of convolutional layers, including an Interleaved Group Convolution
89
(IGC) [21] and a Unet-based network [22]. The kernel size of each layer is 3
90
× 3. The structural cue network is described in Figure 2.
91
For the image captured under the red, green and blue lights, we sepa-
92
rate the three channels and feed these into three convolutional layers, re-
93
spectively. Since multispectral photometric stereo uses the spectral-division
94
multiplexing strategy, each channel has different features. Therefore we ap-
95
ply a multi-branch network to extract the unique feature from each channel 9
Conv+Relu
Deconv+Relu
Conv (Stride=2)+Relu
Patch
r channel
96
Concat 32
Interleaved Group Convolution (IGC)
96
128
1x1Conv
1x1Conv
g channel
1x1Conv
Coarse Surface Normal
3
Input: Image
512
256
80
64 160 128
320
256
b channel
Figure 2: The architecture of the structural cue network. The red digits represent the dimensions of feature maps. The 1×1 Conv means the kernel size of a convolutional layer.
96
independently rather than processing the three channels as a single entity.
97
Then, IGC [21] is applied to the concatenation of three channels. IGC cre-
98
ates an interleaved group convolution block: channels contained in the same
99
partition of the secondary group convolution come from different partitions
100
used in the primary group convolution. It addresses the redundancy problem
101
of convolutional filters in the channel domain and enhances the robustness
102
due to disrupting the feature extracted from the r, g, b channels. As a result,
103
our network avoids being affected by the order of the lights, which has not
104
been considered by other methods. The Unet-based network is applied after
105
IGC, where the feature maps are down-sampled thrice by the convolution op-
10
106
eration with stride=2 and are then up-sampled thrice by the deconvolution
107
operation. It increase the size of the receptive field and preserve the spatial
108
information with a small memory consumption to deepen the network. To train the structural cue network, we define a loss function which consists of a gradient loss and a content loss. We write the total loss Lscn as the minimized difference between the coarse surface normal predicted using the structural cue network ncoarse and the ground-truth normal data n: Lscn (ncoarse , n) = λgrad Lgrad (ncoarse , n) + λcont Lcont (ncoarse , n),
(2)
109
where λgrad and λcont are the weights for gradient loss Lgrad (ncoarse , n) and
110
content loss Lcont (ncoarse , n), respectively. In this paper, we set λgrad = 0.1,
111
and λcont =0.9 by experiments. For the gradient loss, we utilize the combined two-directional gradient q ∇ = ∇x 2 + ∇y 2 used in depth estimation [23]. ∇x and ∇y represent the
gradient in the horizontal and vertical directions, respectively. They are used
to penalize the boundaries of the coarse surface normal. We use the gradient loss Lgrad (ncoarse , n) to fulfil this constraint as: Lgrad (ncoarse , n) = ||∇(ncoarse ) − ∇(n)||22
(3)
Furthermore, we introduce a content loss in order to further constrain 11
the structure cue network. The content loss is implemented as the Euclidean distance between ncoarse and n, which is calculated as: Lcont (ncoarse , n) = ||ncoarse − n||22 112
(4)
3.2. The photometric cue network In a multispectral photometric stereo system [15], the measurement of a single point can be represented by the equation as: T
ci = l n
Z
E(λ)R(λ)Si (λ)dλ,
(5)
113
where ci is the intensity of the pixel in channel i(i ∈ {r, g, b}), E(λ) repre-
114
sents the energy distribution of the incident illumination with the wavelength
115
lambda, R(λ) represents the spectral reflectance function of the objects sur-
116
face, Si (λ) is the camera sensor for channel i, and l and n represent the
117
incident illumination directions and the pixel’s surface normal. It can be
118
seen the tangle of illumination, surface reflectance and camera response in
119
Eq. 5, which leads to an under-determined system.
120
Differing from photometric stereo using illumination with the same spec-
121
tral in a set of images, multispectral photometric stereo recovers the surface
122
normal of a moving/non-rigid object from one image captured under three
123
lights with different spectral (red, green, blue) simultaneously. Due to the 12
124
tangle in multispectral photometric stereo and the single input image, we
125
have to design a more powerful per-pixel fully-connected network to refine
126
the results, combined with the coarse prediction in structural cue network.
127
The photometric cue network is used to extract the photometric information
128
using a fully-connected network, which learns a mapping from the measure-
129
ment C ∈ R3 and the coarse normal ncoa ∈ R3 to the fine normal n ∈ R3 .
130
The architecture of the photometric network is shown in Figure 3. The
131
network takes a 40 × 40 × 6 patch as the input (which concatenates the
132
observation patch C with ncoa ). In the photometric cue network, we divide
133
a mini-batch tensor into multiple parts. Each part contains an image patch.
134
Then we reshape the image patch to a 1600×6 tensor and feed it into the
135
fully-connected (FC) layers. The photometric cue network applies the multi-
136
branch FC architecture, which significantly reduces the number of network
137
parameters. The Relu activation function is used and an L2-normalization
138
layer is appended to the end of the network to ensure the normalized output. The photometric cue network is trained with a combined loss Lpcn (nest , n), which consists of a structural similarity index (SSIM) loss LSSIM (nest , n) and a Mean Angle Errors (MAE) loss LMAE (nest , n) as: Lpcn (nest , n) = λSSIM LSSIM (nest , n) + λM AE LM AE (nest , n), 13
(6)
L2 normalization
[1600, 6] 3
FC 512
FC 512
Output Reshape
...
Concat Coarse Surface Normal
FC 1024
6
BS
FC 1024
Reshape
FC 1024
[1600, 6]
Image Patch
...
BS Estimated Surface Normal
BS
Figure 3: The architecture of the photometric cue network. BS means the mini-batch size. FC means fully-connected layers. 139
where λSSIM and λM AE are the weights for the SSIM loss LSSIM (nest , n)
140
and the MAE loss LM AE (nest , n), respectively. We set λSSIM = 0.84, and
141
λM AE =0.16 following the research in [24]. For the SSIM loss LSSIM (nest , n), the SSIM index [25] is used to measure the structure similarity. This loss is given by: LSSIM (nest , n) = 1 − SSIM(nest , n).
(7)
In Eq.(7) SSIM is defined as: SSIM(nest , n) =
(2µx µy + C1 )(2σxy + C2 ) , + µ2y + C1 )(σx2 + σy2 + C2 )
(µ2x
(8)
142
where µx and µy are the means of nest and n respectively, σx and σy are the
143
variances of nest and n respectively, σxy is their corresponding covariance,
144
and C1 and C2 are constants used to keep stability. The SSIM value ranges
145
from 0 to 1. The larger the value is, the more similar the images are. 14
SSIM may cause changes of brightness and shifts of the colors of pixels, due to its insensitiveness to the uniform bias [26]. Therefore, we also apply an MAE loss to the constraint to improve the predicted results. The MAE loss is defined as the angle deviation between the estimated normal nest and the ground-truth normal n. The MAE loss can be written as: LMAE (nest , n) = arccos( 146
where h*, *i denotes the dot product.
147
4. Datasets
hnest , ni ), ||nest || ||n||
(9)
148
For training our network, datasets that include the multispectral im-
149
ages and the ground-truth normal maps of objects is needed. However, the
150
ground-truth normal of the real object and the ideal light source are difficult
151
to measure. Therefore we use synthetic images obtained from the Blobby
152
Shape Dataset [27] which contains ten synthetic objects. Moreover, we em-
153
ploy the Stanford 3D Scanning dataset [28] and ten 3D models downloaded
154
from the Internet
1
which are named ”Web 3D dataset”.
155
Following the work introduced in [14], we employ the MERL dataset [29]
156
which contains 100 different bidirectional reflectance distrbution functions 1
https://sketchfab.com/, https://free3d.com/
15
157
(BRDFs). We render the Blobby Dataset with the MERL dataset in the
158
pre-defined light source by following the method used in [14]. The three
159
pre-defined lights have the same slant angle (30◦ ) and are evenly separated
160
by a tilt angle (120◦ ). To simulate real world conditions, each object was
161
rendered using at least two materials. MERL dataset is a dictionary which
162
records the reflection at the different incident and exit angle. We combine the
163
RGB channels of a three white light image according to a combined propor-
164
tion(92%, 4%, 4%), to derive a pseudo multispectral image as performed in
165
[18]. The training set comprises nine models contained in the Blobby Shape
166
dataset. The other model contained in this dataset is used as the validation
167
set. The Stanford dataset and Web 3D dataset are utilized for testing.
168
We also acquire a real photoed dataset for testing. This dataset contains
169
ten objects and a color chart under the illuminations of red, green and blue
170
lights simultaneously. The lights have the same slant angle (30◦ ) and are
171
evenly separated by a tilt angle (120◦ ), as we did for the synthetic datasets.
172
5. Experiments
173
We first perform the network analysis for our method on the validation
174
set and compare our method with the state-of-the-art approaches on both the
16
175
synthetic and real datasets. We report quantitative results on the synthetic
176
dataset and qualitative results on the real photoed dataset. Finally, we
177
further analyze the light robustness of our method.
178
5.1. Implementation Details
179
Our method is implemented using Tensorflow 1.4.0. The training set
180
includeds 1.3 × 105 patches with the size 40×40 pixels. We trained our model
181
on two NVIDIA GTX 1080Ti GPUs using the mini-batch size of 24. The
182
initial learning rate was set to 0.001, with the Adam default parameters
183
(β1 = 0.9 and β2 =0.999). All the network analysis were measured using the
184
validation set with 1.44 × 104 patches.
185
5.2. Network analysis
186
5.2.1. The analysis of structural and photometric cues
187
We first compare our network with the different components. For the com-
188
parison “only using structural cue network”, we add the L2-normalization to
189
the end and removed the photometric cue network. For the comparison “only
190
using photometric cue network”, we cancel the structural cue normal input
191
of the photometric cue network. All the networks are trained with the same
192
training set and parameters. The results are shown in Table 1 with valida-
17
193
tion. In this experiment, we test all the results with three metrics: mean
194
angular error, max angular error and structural similarity index (SSIM). Table 1: Quantitative evaluation using the validation set. We validate the proposed method with different components.
Methods
Angular error (◦ )
SSIM index
(Lower better)
(Higher better)
Mean
Max
Only photometric cue network
8.09
32.58
0.9091
Only structural cue network
8.42
34.70
0.8742
Dual-cue fused network
6.81
25.31
0.9504
195
It can be seen that the performance is improved when both cues are
196
considered, the structural cue network takes advantage of the information
197
embedded in the patch like specular and the photometric cue network refine
198
the results. When the structural cue network is used only, the errors become
199
larger. This is partly caused by increased receptive field in the deeper convo-
200
lution layers, which involves irrelevant pixels far away from the convolution
201
center in the image lattice. When the photometric cue network is used only,
202
the mean and max angular errors are biggest. The reason is that the global
203
constraint cannot be taken into account, resulting in the failure of strong 18
204
non-Lambertian surface (e.g.,the pixel value is saturated due to specular).
205
5.2.2. The analysis of different backbone model
206
We then compared the effects of different CNN modules in the structure
207
cue network, including the Unet-based [22], ResNet-based [30] and DenseNet-
208
based [31] modules. We replaced these module after IGC in the structure cue
209
network. For the ResNet-based module, we designed a part with four residual
210
blocks (each block contains 3 convolutional layers) which down-samples the
211
spatial size from 40 × 40 to 5 ×5 using average-pooling, followed by three
212
fully-connected layers to produce the 40 × 40 × 3 coarse surface normal
213
estimation. In terms of the DenseNet-based module, we designed a similar
214
structure but used dense connections instead of residual blocks. All the
215
networks were trained with the same training set and parameters. Table 2: The results of using different modules. MAE is short for mean angular error.
Models
ResNet-based
DenseNet-based
Unet-based (Used)
MAE(◦ )
7.24
6.96
6.81
216
Table 2 illustrate that the Unet-based module achieved better results on
217
MAE. We believe that Unet is more suitable for normal estimation because
218
it performs image segmentation on pixel-level, while ResNet and DenseNet 19
219
conduct classification on image level. The pixel-level regression of a normal
220
map is more similar to a pixel-wise process. In addition, the multiple fully-
221
connected layers discards the embedded local context information.
222
5.2.3. The analysis of different loss functions
223
To compare the loss functions, we use the same network architecture but
224
change the loss functions. Lgrad , Lcont , LSSIM and LM AE are listed in Table
225
3 for comparison. We evaluate two metrics: MAE and SSIM. Table 3: Evaluation of loss functions in dual-cue network.
Structural cue network
Photometric cue network MAE (◦ )
SSIM
Lcont
LM AE
7.43
0.8839
Lgrad + Lcont
LM AE
7.14
0.9080
Lcont
LSSIM + LM AE
6.96
0.9402
Lgrad + Lcont
LSSIM + LM AE
6.81
0.9504
226
As shown in the table, combined loss functions outperform others. We
227
conclude that both global loss and pixel-by-pixel loss are beneficial for surface
228
normal estimation. Firstly, gradient loss extracts the gradient information
229
from the coarse surface normal and groundtruth, guiding the whole surface
230
normal recovery process. It penalizes the discontinuous boundaries in the 20
231
estimated surface normal which is caused by varying surface materials. This
232
global constraint information can not be obtained from the Euclidean dis-
233
tance loss Lcont . Secondly, SSIM loss penalizes the surface normal from the
234
contrast and the structure. It has been widely proved that combining SSIM
235
has a better effect [24]. In this paper, SSIM loss is a supplement to the angle
236
error constraint, especially at edges and regions with complex structure.
237
5.3. Comparisons with other methods
238
We also present the comparison of our method with state-of-the-art meth-
239
ods, including Demultiplexer [18], Semi-learning [19], DPSN [3], PS-FCN [11]
240
and a baseline. With their original training settings, we only adjusted the
241
form of the dataset to suit the compared methods. Demultiplexer maps the
242
image illuminated by red, green and blue lights into three images illuminated
243
by white lights. We therefore re-render the training set to be illuminated by
244
white light with the same position. Then we reconstruct the surface normal
245
n to depth with the method [32] to train the Semi-learning method, which
246
establishes an initial depth estimated network for multispectral photomet-
247
ric stereo. Note that PS-FCN and DPSN are two methods for photometric
248
stereo reconstruction, the dimensions of inputs are larger than our single
249
three-channel image. PS-FCN is a photometric stereo method but it allows 21
250
the arbitrary number of inputs. Therefore, we use the three channels of the
251
multispectral image as three input images during the training and test stages.
252
For DPSN, we copy each channel of the image to 32 times in training set,
253
making 96 channels input totally to comply with the settings. In this paper,
254
Baseline represents the method calculating the surface normal by photomet-
255
ric stereo [2] using the three-channel of an image directly. We conduct the
256
comparison both on synthetic image and real photoed image.
257
5.3.1. Quantitative analysis on synthetic images
258
259
We first compare the reconstruction performance on the Stanford 3D Scanning dataset and Web 3D dataset, the results are shown in Figure 4.
260
As shown in Figure 4, the comparison clearly demonstrates that our
261
method outperforms others, particularly on objects with complex surface
262
structures. For the shown objects, our method achieved an average MAE of
263
8.65
264
of Semi-learning, 9.81
265
The PS-FCN [11] shows fuzzier surfaces, losing high-frequency details on the
266
surface normal. Simultaneously, the performance of Semi-learning [19] in
267
all objects is unsatisfactory. For multicolored objects such as “Rabbit” and
268
“Lion” , it can be seen that the Demultiplexer [18] and DPSN [14] meth-
◦
on test synthetic dataset, better than 9.89 ◦
of DPSN, 9.78
22
◦
◦
of Demultiplexer, 21.88◦
of PS-FCN and 14.03◦ of Baseline.
Objects
GT Normal
Our Est. & Error Map
Demultiplexer Est. & Error Map Semi-learning Est. & Error Map
DPSN Est. & Error Map
PS-FCN Est. & Error Map
Baseline Est. & Error Map
Dragon
6.94°
7.51°
19.88°
7.92°
8.13°
9.73°
Rabbit
8.49°
10.24°
22.80°
10.03°
9.75°
16.01°
Lion
8.44°
9.81°
24.97°
9. 07°
9.02°
13.73°
Sitting
8.87°
9.20°
19.31°
10.86°
9.93°
20.16°
Monkey
8.10°
9.74°
21.42°
10.04°
10.78°
11.65°
Buddha
7.05°
7.29°
19.02°
7.91°
7.18°
10.07°
Goddess
10.71°
14.36°
20.42°
11.72°
11.16°
14.01°
Man
10.55°
10.92°
27.22°
10.93°
12.36°
16.84°
Figure 4: Quantitative results on the synthetic objects. The numbers represent Mean Angular Error (MAE) in degree. The object names are displayed on the first column.
269
ods show uncontinuous boundaries on estimated normal maps, which affects
270
the accuracy. This is because these two methods apply the networks based
271
solely on fully connected layers lacking in neighborhood information and con-
272
straint. It also can be found that most methods have satisfactory results on
273
Lambertian surface like “Buddha”. However, previous methods would be
274
failed facing strong non-Lambertian surface like “Monkey” and “Goddess”.
275
It proves the fitting ability of our dual-cue network, which benefits from the
276
encoded local context information in structural cue network, which reduces
23
277
the estimation failure of specular and shadows.
278
5.3.2. Qualitative analysis on real objects
279
In addition to using synthetic data to evaluate our method, we also uti-
280
lize real objects for experiments. We first take the images of real objects
281
illuminated by RGB light. Figure 5 shows the comparative results. Observation
Ours Est.
Demultiplexer Est.
Semi-learning Est.
DPSN Est.
PS-FCN Est.
Baseline Est.
Figure 5: Qualitative results on the multicolored fabrics.
282
As shown in Figure 5, the boundaries of multicolored fabrics and objects
283
can be clearly seen in input images. Baseline method which uses the three-
284
channel of the multispectral image directly shows the discontinuity on nor24
285
mal maps. The error is due to the assumption of constant albedo in Baseline
286
method, ignoring the tangle of illumination, surface reflectance and camera
287
response. It can be seen from the experiment that the results of DPSN and
288
Demultiplexer are better than baseline, but there are still obvious discontin-
289
uous boundaries on the normal map. In terms of Semi-learning method, the
290
results are noisy. For our dual-cue fused network and PS-FCN, the normal
291
maps show almost no discontinuous boundaries. However, the results of PS-
292
FCN are fuzzier. In addition, due to the unavoidable spectral errors caused
293
by real color lights, the prediction of real objects are worse than synthetic
294
images. However, our network can still achieve the best results, compared
295
with other methods using the same synthetic training dataset.
296
We also evaluate our method with a more convincing object. Figure 6
297
shows the qualitative results on a color chart in different orientations. Since
298
the color chart is a plane plate, we reckon that the normal on it is the same.
299
Compared with others, our method predicts more accurate normal maps,
300
which are almost unaffected by the changing of colors, reflecting the correct
301
surface normal. The results indicate that our network outperforms other
302
methods facing the reflectance spectra distributions of various albedo.
25
Observation
Ours Est.
Demultiplexer Est.
Semi-learning Est.
DPSN Est.
PS-FCN Est.
Baseline Est.
Figure 6: Qualitative results on a color chart in different orientations.
303
5.4. Evaluation of robustness
304
Previous photometric stereo methods [2, 33] will be affected by complex il-
305
lumination conditions like additional light and varied lighting directions. We
306
also analyze the impact of complex illumination environment to our method.
307
5.4.1. Additional light
308
We use a High Dynamic Range (HDR) environment map (See Figure
309
7.(a)) with an additional light source to introduce illumination noise to the
310
ideal darkroom condition. To evaluate the robustness of methods, we only
311
utilize the additional light on the validation set and remain the darkroom
312
environment in training. The results are shown in Figure 8. We compare the
313
MAE of normal produced by different methods under the additional light
314
condition and darkroom condition as well as the fluctuation ratio α as:
26
α=
|χ − χ0 | χ
(10)
315
where χ represents the MAE of standard condition while χ0 represents the
316
MAE of complex illumination condition (additional light or changed RGB
317
lights directions). The fluctuation ratio measures the robustness of methods.
318
The smaller the value, the stronger the robustness of the evaluated method. z
Camera Slant angle a
Slant angle a′
Light A Light A′
x (a)
(b)
Figure 7: (a). The High Dynamic Range environment map used for simulating the additional illumination. (b). Schematic diagram of changed slant angle. When the light moves from the position A to the position A0 , the slant angle will also change from a to a0 .
319
It can be seen in Figure 8 that our dual-cue fused network is more robust
320
in additional light condition with smaller fluctuation ratio. The introduction
321
of structural cue network considers local context and neighborhood informa-
322
tion, less affected by lighting conditions. Therefore, our double-cue network
27
323
is more robust than the methods only considering one cue. Note that Demul-
324
tiplexer and DPSN perform worst on robustness. This might be explained by
325
the fact that the single fully-connected network lacks structural constraint
326
embedded in local pixels. As a result, the changed measurement on single
327
pixel due to complex illumination directly influences the estimation.
Figure 8: Evaluation for the additional light condition. The left charts show the MAE under darkroom and additional light, the right charts show fluctuation ratios of them.
328
5.4.2. Changed RGB lighting directions
329
In this experiment, we remain the slant angle of RGB lights in training
330
set at 30◦ , while we adjust the them in validation set to 20◦ , 25◦ , 30◦ , 35◦
331
and 40◦ , respectively (See Figure 7.(b)). We show the MAE and fluctuation
332
ratios under different slant angles in Figure 9. 28
Figure 9: Evaluation for changed RGB lights directions. The left charts show the MAE under different slant angles, the right charts show fluctuation ratios of them.
333
From Figure 9, it is obvious that our method better resists variations in
334
slant lighting direction. The robustness for using the ResNet-based module
335
and the DenseNet-based module is almost as good as the Unet-based module
336
(Proposed). Moreover, it can be seen that the fluctuation ratio of all deep
337
learning methods is smaller than the Baseline which uses the three-channel
338
of a multispectral image directly. It demonstrates the generalization ability
339
of data-driven deep learning methods.
340
6. Conclusion
341
In this paper, we proposed a dual-cue fused network in order to estimate
342
surface normals from a single multispectral photometric stereo image. Unlike
29
343
previous algorithms which require extra depth data or pre-calibration, our
344
method estimates surface normal without any prior information. We first
345
apply a CNN-based structural cue network to approximate the coarse sur-
346
face normal on small patches. Then, we apply a pixel level fully-connected
347
photometric cue network to refine the coarse surface normal. The structure
348
cue network exploits the neighborhood embedded feature and local context
349
information for the coarse normal estimation, while the photometric stereo
350
network further learns finer surface details from a per-pixel perspective. This
351
dual-cue network guarantees the accuracy of surface normal recovery and the
352
robustness to noisy illumination environment. Compared with traditional al-
353
gorithms, our method is able to handle objects with non-Lambertian surface.
354
The experiments on both the synthetic images and the real photoed images
355
showed that our dual-cue fused network outperformed the existing methods.
356
Despite offering the state-of-the-art performance, our method suffers from
357
a limitation. The directions of the lights are almost the same. Although
358
our method is robust to changed lighting directions, the deviation of the
359
slant angle is at most 10◦ . Our method cannot adapt completely arbitrary
360
lighting directions. In fact, this issue has remained a difficult problem for
361
the learning-based surface normal recovery methods. In the future, we aim
30
362
to develop more stable methods for arbitrary lighting.
363
7. Acknowledgement
364
This work was supported by the National Key Scientific Instrument De-
365
velopment Project (No.41927805), the International Science & Technology
366
Cooperation Program of China (ISTCP) (No. 2014DFA10410) and the Na-
367
tional Natural Science Foundation of China (NSFC) (No.61501417).
368
References
369
[1] M. Jian, Y. Yin, J. Dong, W. Zhang, Comprehensive assessment of
370
non-uniform illumination for 3d heightmap reconstruction in outdoor
371
environments, Computers in Industry 99 (2018) 110–118.
372
[2] R. J. Woodham, Photometric method for determining surface orienta-
373
tion from multiple images, Optical Engineering 19 (1980) 139–144.
374
[3] K. Ozawa, I. Sato, M. Yamaguchi, Hyperspectral photometric stereo for
375
a single capture., Journal of the Optical Society of America A 34 (2017)
376
384–394.
377
[4] T. Kawabata, F. Sakaue, J. Sato, One shot photometric stereo from
378
reflectance classification, 11th Joint Conference on Computer Vision, 31
379
Imaging and Computer Graphics Theory and Applications. (2016) 620–
380
627.
381
[5] H. Jiao, Y. Luo, N. Wang, L. Qi, J. Dong, H. Lei, Underwater multi-
382
spectral photometric stereo reconstruction from a single rgbd image,
383
Signal and Information Processing Association Summit and Conference
384
(2017) 1–4.
385
[6] C. Hernandez, G. Vogiatzis, G. J. Brostow, B. Stenger, R. Cipolla, Non-
386
rigid photometric stereo with colored lights, International Conference
387
on Computer Vision (2007).
388
[7] Z. Jank´o, A. Delaunoy, E. Prados, Colour dynamic photometric stereo
389
for textured surfaces, Asian Conference on Computer Vision (2010)
390
55–66.
391
[8] D. Eigen, C. Puhrsch, R. Fergus, Depth map prediction from a single
392
image using a multi-scale deep network, Advances in neural information
393
processing systems (2014) 2366–2374.
394
[9] T. Taniai, T. Maehara, Neural inverse rendering for general reflectance
395
photometric stereo,
396
(2018) 4864–4873.
International Conference on Machine Learning
32
397
[10] Y. Ju, L. Qi, J. He, X. Dong, F. Gao, J. Dong, Mps-net: Learning to
398
recover surface normal for multispectral photometric stereo, Neurocom-
399
puting 375 (2020) 62–70.
400
[11] G. Chen, K. Han, K.-Y. K. Wong, Ps-fcn: A flexible learning frame-
401
work for photometric stereo, European Conference on Computer Vision
402
(2018) 3–19.
403
404
[12] S. K. Nayar, K. Ikeuchi, T. Kanade, Shape from interreflections, International Journal of Computer Vision 6 (1991) 173–195.
405
[13] J. A. Smith, T. L. Lin, K. L. Ranson, The lambertian assumption
406
and landsat data., Photogrammetric Engineering & Remote Sensing 46
407
(1980) 1183–1189.
408
[14] H. Santo, M. Samejima, Y. Sugano, B. Shi, Y. Matsushita, Deep pho-
409
tometric stereo network, International Conference on Computer Vision
410
Workshop (2017) 501–509.
411
[15] L. L. Kontsevich, A. P. Petrov, I. S. Vergelskaya, Reconstruction of
412
shape from shading in color images, Journal of the Optical Society of
413
America A 12 (1994) 1047–1052.
33
414
[16] H. Kim, B. Wilburn, M. Ben-Ezra, Photometric stereo for dynamic
415
surface orientations, European Conference on Computer Vision (2010)
416
59–72.
417
[17] G. J. Brostow, C. Hernandez, G. Vogiatzis, B. Stenger, R. Cipolla, Video
418
normals from colored lights, IEEE Transactions on Pattern Analysis &
419
Machine Intelligence 33 (2011) 2104–14.
420
[18] Y. Ju, L. Qi, H. Zhou, J. Dong, L. Lu, Demultiplexing colored images
421
for multispectral photometric stereo via deep neural networks, IEEE
422
Access 6 (2018) 30804–30818.
423
[19] L. Lu, L. Qi, Y. Luo, H. Jiao, J. Dong, Three-dimensional reconstruc-
424
tion from single image base on combination of cnn and multi-spectral
425
photometric stereo, Sensors (2018).
426
[20] D. Antensteiner, S. Stolc, D. Soukup, Single image multi-spectral photo-
427
metric stereo using a split u-shaped cnn, Computer Vision and Pattern
428
Recognition Workshops (2019).
429
430
[21] T. Zhang, G. J. Qi, B. Xiao, J. Wang, Interleaved group convolutions, International Conference on Computer Vision (2017) 4383–4392.
34
431
[22] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for
432
biomedical image segmentation, International Conference on Medical
433
image computing and computer-assisted intervention (2015) 234–241.
434
[23] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy,
435
T. Brox, Demon: Depth and motion network for learning monocular
436
stereo, Computer Vision and Pattern Recognition (2017) 5038–5047.
437
[24] H. Zhao, O. Gallo, I. Frosio, J. Kautz, Loss functions for image restora-
438
tion with neural networks, IEEE Transactions on computational imaging
439
3 (2016) 47–57.
440
[25] Z. Wang, A. Bovik, H. Sheikh, E. Simoncelli, Image quality assessment:
441
from error visibility to structural similarity, IEEE Transactions on Image
442
Processing 13 (2004) 600–612.
443
[26] R. Wan, B. Shi, L.-Y. Duan, A.-H. Tan, A. C. Kot, Crrn: Multi-scale
444
guided concurrent reflection removal network, Computer Vision and
445
Pattern Recognition (2018) 4777–4785.
446
447
[27] M. K. Johnson, E. H. Adelson, Shape estimation in natural illumination., Computer Vision and Pattern Recognition (2011) 2553–2560.
35
448
[28] B. Curless, M. Levoy, A volumetric method for building complex models
449
from range images, Conference on Computer Graphics and Interactive
450
Techniques (1996) 303–312.
451
452
[29] W. Matusik, H. Pfister, M. Brand, L. Mcmillan, A data-driven reflectance model, ACM Transactions on Graphics (2003) 759–769.
453
[30] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image
454
recognition, Computer Vision and Pattern Recognition (2016) 770–778.
455
[31] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely con-
456
nected convolutional networks, Computer Vision and Pattern Recogni-
457
tion (2017) 4700–4708.
458
[32] A. Agrawal, R. Raskar, R. Chellappa, What is the range of surface re-
459
constructions from a gradient field?, European Conference on Computer
460
Vision (2006) 578–591.
461
462
[33] X. Jiang, H. Bunke, On error analysis for surface normals determined by photometric stereo, Signal Processing 23 (1991) 221–226.
36
463
Biography
464
465
Yakun Ju received the B.Sc degree of engineering in industrial design
466
from Sichuan University, Chengdu, China, in 2016. He is currently pursuing
467
the Ph.D. degree in computer application technology with the Department
468
of Computer Science and Technology, Ocean University of China, Qingdao,
469
China. His research interests include 3D reconstruction, machine learning
470
and image processing.
471
472
Xinghui Dong received the Ph.D. degree from Heriot-Watt University,
473
U.K., in 2014. He is currently a Research Associate with the Centre for
37
474
Imaging Sciences, The University of Manchester, U.K. His research interests
475
include automatic defect detection, image representation, texture analysis,
476
and visual perception.
477
478
Yingyu Wang received the B.Sc. degree in computer science and tech-
479
nology from the Chengdu University of Technology, Chengdu, China, in 2017,
480
respectively, where he is currently pursuing the Masters degree at Ocean Uni-
481
versity of China, Qingdao, China. His research interests include computer
482
vision, robotics and deep learning.
483
38
484
Lin Qi received his B.Sc and M.Sc degrees from Ocean University of
485
China in 2005 and 2008 respectively, and received his Ph.D. in computer sci-
486
ence from Heriot-Watt University in 2012. He is now an associate professor in
487
the Department of Computer Science and Technology in Ocean University of
488
China. His research interests include computer vision and visual perception.
489
490
Junyu Dong received the B.Sc. and M.Sc. degrees from the Department
491
of Applied Mathematics, Ocean University of China, Qingdao, China, in 1993
492
and 1999, respectively, and the Ph.D. degree in image processing from the
493
Department of Computer Science, Heriot-Watt University, U.K., in 2003.
494
He joined Ocean University of China in 2004, and he is currently a Professor
495
and the Head of the Department of Computer Science and Technology. His
496
research interests include machine learning, big data, computer vision, and
497
underwater vision.
39