A dual-cue network for multispectral photometric stereo

A dual-cue network for multispectral photometric stereo

A Dual-Cue Network for Multispectral Photometric Stereo Journal Pre-proof A Dual-Cue Network for Multispectral Photometric Stereo Yakun Ju, Xinghui ...

3MB Sizes 0 Downloads 47 Views

A Dual-Cue Network for Multispectral Photometric Stereo

Journal Pre-proof

A Dual-Cue Network for Multispectral Photometric Stereo Yakun Ju, Xinghui Dong, Yingyu Wang, Lin Qi, Junyu Dong PII: DOI: Reference:

S0031-3203(19)30462-5 https://doi.org/10.1016/j.patcog.2019.107162 PR 107162

To appear in:

Pattern Recognition

Received date: Revised date: Accepted date:

10 February 2019 10 December 2019 12 December 2019

Please cite this article as: Yakun Ju, Xinghui Dong, Yingyu Wang, Lin Qi, Junyu Dong, A Dual-Cue Network for Multispectral Photometric Stereo, Pattern Recognition (2019), doi: https://doi.org/10.1016/j.patcog.2019.107162

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier Ltd.

Highlights • A novel dual-cue fused network is proposed for surface normal recovering, which exploits specular highlights, shadows and interreflections appearing in local image patches, meanwhile maintaining high-frequency details. • Compared to previous multispectral photometric stereo algorithms, the proposed method requires no extra information and breaks the limitation of Lambertian surfaces. • The Dual-cue fused network outperforms existing approaches in robustness under complex illumination.

1

A Dual-Cue Network for Multispectral Photometric Stereo Yakun Jua , Xinghui Dongb , Yingyu Wanga , Lin Qia , Junyu Donga,∗ a

Department of Computer Science and Technology, Ocean University of China, Qingdao, China b Centre for Imaging Sciences, The University of Manchester, Manchester, UK

Abstract Estimating pixel-wise surface normal from a single image is a challenging task but offers great values to computer vision and robotics applications. By using the spectrally and spatially variant illumination, multispectral photometric stereo can produce pixel-wise surface normal from just one image. But multispectral photometric stereo methods may encounter the tangle problem of illumination, surface reflectance and camera response, which lead to an under-determined system. Existing approaches rely on either extra depth information or material calibration strategies, assuming a Lambertian surface condition which limits their application in practical systems. Previous learning-based methods employ fully-connected or CNN architectures to estimate surface normal. Compared with fully-connected framework, CNN ∗

Corresponding author Email address: [email protected] (Junyu Dong )

Preprint submitted to Pattern Recognition

December 12, 2019

takes advantage of the information embedded in the neighborhood of a surface point, but losing high-frequency surface normal details. In this paper, we present a new method that addresses this task by designing two stacked deep network. We first apply a CNN-based structural cue network to approximate coarse surface normal on small patches. Then, we use a pixel level fully-connected photometric cue network to further refine surface normal details and correct errors from the first step. The fused network is robust to non-Lambertian surfaces and complex illumination environments, such as ambient light and variant light directions. Experimental results show that our dual-cue fused network outperforms existing methods. Keywords: Multispectral photometric stereo, Normal estimation, Deep neural networks, Networks fusion

1

1. Introduction

2

Recovering dense 3D shapes is a fundamental and challenging problem in

3

the field of computer vision [1]. Traditional photometric stereo methods can

4

produce pixel-wise surface normal estimation using multiple images captured

5

with a stationary camera and the changed illumination [2]. However, these

6

requirements limit its use in dynamic applications. Multispectral photomet-

7

ric stereo is a popular method to handle non-rigid/moving objects using a 3

8

single image [3, 4], which only requires three colored lights (i.e., red, green

9

and blue) to illuminate the target simultaneously. Generally, photometric

10

stereo takes a time-division multiplexing strategy, whereas the multispectral

11

photometric stereo uses a spectral-division multiplexing strategy.

12

The biggest challenge for multispectral photometric stereo is the tangle

13

of illumination, surface reflectance and camera response which leads to an

14

under-determined system. Mathematically, it is hard to solve the normal of

15

varying chromaticities. Previous researchers have investigated different ap-

16

proaches, including prior depth information [5], calibration surface material

17

[6] and regularization of the normal field [7]. However, existing methods bear

18

the following limitations. First, pre-calibration and the prior depth informa-

19

tion may be unavailable in many circumstances. Second, those methods are

20

time-consuming and the accuracy can be further improved. More impor-

21

tantly, the existing methods require the Lambertian surface assumption.

22

Deep learning methods have been widely employed in computer vision

23

tasks. It can be seen that CNN have been successfully applied to dense re-

24

gression problems like depth estimation [8] and surface normal estimation

25

[9, 10]. CNN-based normal estimation methods like PS-FCN [11] can bet-

26

ter handle specular, shadows, and interreflections [12], as they all form the

4

27

appearance of a local image patch. However, according to the results and

28

analysis of our experiments, CNN-based methods produce relatively fuzzy

29

surface normal output, losing high-frequency details. This is partly caused

30

by the increased receptive field in deeper convolution layers, which involves

31

irrelevant pixels far away from the convolution center in the image lattice.

32

To solve the above problems, we designed a dual-cue network to combine

33

the advantages of CNN and fully-connected network, called structural cue

34

network and photometric cue network, respectively. Like [8], we present a

35

new method that addresses this task by employing two deep network stacks:

36

we first apply a CNN-based structural cue network to approximate the coarse

37

normal. Then, we apply a pixel level fully-connected photometric cue net-

38

work to refine the coarse normal. The photometric cue network enhances

39

high frequency details and further correct the error brought by structural

40

cue network. The overview of the proposed network is shown in Figure 1.

41

The main contributions of this work are summarized as follows.

42

• A novel dual-cue fused network is proposed for surface normal recov-

43

ering, which exploits specular, shadows and interreflections appearing

44

in local image patches, meanwhile maintaining high-frequency details.

45

• Compared to previous approaches, our method requires no extra infor5

Input: Image

Coarse Surface Normal

r

g

CNN

b

Structural Cue Network

Surface Normal

Combine

FC Net Photometric Cue Network

Figure 1: The overview of the dual-cue fused network. It consists of two components: the structural cue network and the photometric cue network. Given a single image, our structural cue network first generates the coarse surface normal data. These normal data are then used as the input of a photometric cue network which produces an fine surface normal map. The structural cue network is implemented as a convolutional neural network (CNN) while the photometric cue network is built by a fully-connected network (FC Net).

46

47

48

49

mation and breaks the limitation of Lambertian surfaces. • The Dual-cue fused network outperforms existing approaches in robustness under complex illumination.

2. Related Work

50

Photometric stereo [2] methods were designed based on the Lambertian

51

model [13], which provides a dense surface normal estimation. Some re-

52

searchers [14, 11] introduced the learning-based frameworks and achieved

6

53

better results in the non-Lambertian cases. However, these methods require

54

many images of a target object and cannot handle non-rigid or moving object.

55

To estimate the surface normal from a single image, many multispec-

56

tral photometric stereo methods have been proposed over the last 20 years

57

[15], and have been used in different applications [16, 17]. Hernandez et al.

58

[6] used a pre-calibration approach and obtained the accurate surface nor-

59

mal of fabrics, where a planar plane with special marks was used. Some

60

researchers employed a coarse surface estimation as the initial input and

61

iteratively searched for an optimized solution [5]. However, these methods

62

require the prior information for solving the under-determined equations (see

63

Eq. 5) and are affected by non-Lambertain surfaces.

64

On the other hand, the field of 3D reconstruction has also benefited from

65

learning-based techniques. Recently, some researchers investigated the deep

66

learning techniques in the context of multispectral photometric stereo. Ju

67

et al. [18] used a fully-connected neural network to estimate surface normal

68

from a single colored image. In contrast, Lu et al. [19] estimated the coarse

69

depth of multispectral images using a CNN and used the coarse depth to

70

solve an under-determined system. Antensteiner et al. [20] proposed a Unet-

71

like network to solve the multispectral photometric stereo, where they only

7

72

tested on images of coins with uniform albedo (surface reflection).

73

3. The Dual-Cue Fused Network The proposed dual-cue fused network comprises two modules: the structural cue network and the photometric cue network, as shown in Figure 1. The structural cue network predicts the coarse normal using patches. CNNbased framework can better handle specular, shadows, and interreflection, as they all contribute to the appearance of a local image patch. It is then combined with the input image and they are passed to the photometric cue network to learn a fine per-pixel normal. The photometric cue network enhances high frequency details and further correct the error brought by the involved irrelevant pixels far away from the convolution center in the image lattice. Our dual-cue fused network can be written as a function: nest = fpcn (c, fscn (c)),

(1)

74

where nest represents the estimated surface normal map, c represents the

75

input multispectral image, fscn represents the structural cue network and

76

fpcn represents the photometric cue network.

8

77

3.1. The Structural Cue Network

78

Unlike the previous works [14, 18] which only map from the single pixel

79

value to the surface normal, we introduce the structural cue network which

80

takes full advantage of the information embedded in the neighborhood of a

81

surface point. Additionally, the features extracted from the global informa-

82

tion are seldom affected by ambient light.

83

The whole image input reduces the diversity of the data and introduces

84

the over-fitting problem. Considering complex surfaces comprise simple and

85

small surface patches, we choose the patch rather than the whole image as the

86

input. The structural cue network takes a r, g, b-channel image patch within

87

a neighborhood C ∈ R40×40×3 as the input. The structural cue network only

88

consists of convolutional layers, including an Interleaved Group Convolution

89

(IGC) [21] and a Unet-based network [22]. The kernel size of each layer is 3

90

× 3. The structural cue network is described in Figure 2.

91

For the image captured under the red, green and blue lights, we sepa-

92

rate the three channels and feed these into three convolutional layers, re-

93

spectively. Since multispectral photometric stereo uses the spectral-division

94

multiplexing strategy, each channel has different features. Therefore we ap-

95

ply a multi-branch network to extract the unique feature from each channel 9

Conv+Relu

Deconv+Relu

Conv (Stride=2)+Relu

Patch

r channel

96

Concat 32

Interleaved Group Convolution (IGC)

96

128

1x1Conv

1x1Conv

g channel

1x1Conv

Coarse Surface Normal

3

Input: Image

512

256

80

64 160 128

320

256

b channel

Figure 2: The architecture of the structural cue network. The red digits represent the dimensions of feature maps. The 1×1 Conv means the kernel size of a convolutional layer.

96

independently rather than processing the three channels as a single entity.

97

Then, IGC [21] is applied to the concatenation of three channels. IGC cre-

98

ates an interleaved group convolution block: channels contained in the same

99

partition of the secondary group convolution come from different partitions

100

used in the primary group convolution. It addresses the redundancy problem

101

of convolutional filters in the channel domain and enhances the robustness

102

due to disrupting the feature extracted from the r, g, b channels. As a result,

103

our network avoids being affected by the order of the lights, which has not

104

been considered by other methods. The Unet-based network is applied after

105

IGC, where the feature maps are down-sampled thrice by the convolution op-

10

106

eration with stride=2 and are then up-sampled thrice by the deconvolution

107

operation. It increase the size of the receptive field and preserve the spatial

108

information with a small memory consumption to deepen the network. To train the structural cue network, we define a loss function which consists of a gradient loss and a content loss. We write the total loss Lscn as the minimized difference between the coarse surface normal predicted using the structural cue network ncoarse and the ground-truth normal data n: Lscn (ncoarse , n) = λgrad Lgrad (ncoarse , n) + λcont Lcont (ncoarse , n),

(2)

109

where λgrad and λcont are the weights for gradient loss Lgrad (ncoarse , n) and

110

content loss Lcont (ncoarse , n), respectively. In this paper, we set λgrad = 0.1,

111

and λcont =0.9 by experiments. For the gradient loss, we utilize the combined two-directional gradient q ∇ = ∇x 2 + ∇y 2 used in depth estimation [23]. ∇x and ∇y represent the

gradient in the horizontal and vertical directions, respectively. They are used

to penalize the boundaries of the coarse surface normal. We use the gradient loss Lgrad (ncoarse , n) to fulfil this constraint as: Lgrad (ncoarse , n) = ||∇(ncoarse ) − ∇(n)||22

(3)

Furthermore, we introduce a content loss in order to further constrain 11

the structure cue network. The content loss is implemented as the Euclidean distance between ncoarse and n, which is calculated as: Lcont (ncoarse , n) = ||ncoarse − n||22 112

(4)

3.2. The photometric cue network In a multispectral photometric stereo system [15], the measurement of a single point can be represented by the equation as: T

ci = l n

Z

E(λ)R(λ)Si (λ)dλ,

(5)

113

where ci is the intensity of the pixel in channel i(i ∈ {r, g, b}), E(λ) repre-

114

sents the energy distribution of the incident illumination with the wavelength

115

lambda, R(λ) represents the spectral reflectance function of the objects sur-

116

face, Si (λ) is the camera sensor for channel i, and l and n represent the

117

incident illumination directions and the pixel’s surface normal. It can be

118

seen the tangle of illumination, surface reflectance and camera response in

119

Eq. 5, which leads to an under-determined system.

120

Differing from photometric stereo using illumination with the same spec-

121

tral in a set of images, multispectral photometric stereo recovers the surface

122

normal of a moving/non-rigid object from one image captured under three

123

lights with different spectral (red, green, blue) simultaneously. Due to the 12

124

tangle in multispectral photometric stereo and the single input image, we

125

have to design a more powerful per-pixel fully-connected network to refine

126

the results, combined with the coarse prediction in structural cue network.

127

The photometric cue network is used to extract the photometric information

128

using a fully-connected network, which learns a mapping from the measure-

129

ment C ∈ R3 and the coarse normal ncoa ∈ R3 to the fine normal n ∈ R3 .

130

The architecture of the photometric network is shown in Figure 3. The

131

network takes a 40 × 40 × 6 patch as the input (which concatenates the

132

observation patch C with ncoa ). In the photometric cue network, we divide

133

a mini-batch tensor into multiple parts. Each part contains an image patch.

134

Then we reshape the image patch to a 1600×6 tensor and feed it into the

135

fully-connected (FC) layers. The photometric cue network applies the multi-

136

branch FC architecture, which significantly reduces the number of network

137

parameters. The Relu activation function is used and an L2-normalization

138

layer is appended to the end of the network to ensure the normalized output. The photometric cue network is trained with a combined loss Lpcn (nest , n), which consists of a structural similarity index (SSIM) loss LSSIM (nest , n) and a Mean Angle Errors (MAE) loss LMAE (nest , n) as: Lpcn (nest , n) = λSSIM LSSIM (nest , n) + λM AE LM AE (nest , n), 13

(6)

L2 normalization

[1600, 6] 3

FC 512

FC 512

Output Reshape

...

Concat Coarse Surface Normal

FC 1024

6

BS

FC 1024

Reshape

FC 1024

[1600, 6]

Image Patch

...

BS Estimated Surface Normal

BS

Figure 3: The architecture of the photometric cue network. BS means the mini-batch size. FC means fully-connected layers. 139

where λSSIM and λM AE are the weights for the SSIM loss LSSIM (nest , n)

140

and the MAE loss LM AE (nest , n), respectively. We set λSSIM = 0.84, and

141

λM AE =0.16 following the research in [24]. For the SSIM loss LSSIM (nest , n), the SSIM index [25] is used to measure the structure similarity. This loss is given by: LSSIM (nest , n) = 1 − SSIM(nest , n).

(7)

In Eq.(7) SSIM is defined as: SSIM(nest , n) =

(2µx µy + C1 )(2σxy + C2 ) , + µ2y + C1 )(σx2 + σy2 + C2 )

(µ2x

(8)

142

where µx and µy are the means of nest and n respectively, σx and σy are the

143

variances of nest and n respectively, σxy is their corresponding covariance,

144

and C1 and C2 are constants used to keep stability. The SSIM value ranges

145

from 0 to 1. The larger the value is, the more similar the images are. 14

SSIM may cause changes of brightness and shifts of the colors of pixels, due to its insensitiveness to the uniform bias [26]. Therefore, we also apply an MAE loss to the constraint to improve the predicted results. The MAE loss is defined as the angle deviation between the estimated normal nest and the ground-truth normal n. The MAE loss can be written as: LMAE (nest , n) = arccos( 146

where h*, *i denotes the dot product.

147

4. Datasets

hnest , ni ), ||nest || ||n||

(9)

148

For training our network, datasets that include the multispectral im-

149

ages and the ground-truth normal maps of objects is needed. However, the

150

ground-truth normal of the real object and the ideal light source are difficult

151

to measure. Therefore we use synthetic images obtained from the Blobby

152

Shape Dataset [27] which contains ten synthetic objects. Moreover, we em-

153

ploy the Stanford 3D Scanning dataset [28] and ten 3D models downloaded

154

from the Internet

1

which are named ”Web 3D dataset”.

155

Following the work introduced in [14], we employ the MERL dataset [29]

156

which contains 100 different bidirectional reflectance distrbution functions 1

https://sketchfab.com/, https://free3d.com/

15

157

(BRDFs). We render the Blobby Dataset with the MERL dataset in the

158

pre-defined light source by following the method used in [14]. The three

159

pre-defined lights have the same slant angle (30◦ ) and are evenly separated

160

by a tilt angle (120◦ ). To simulate real world conditions, each object was

161

rendered using at least two materials. MERL dataset is a dictionary which

162

records the reflection at the different incident and exit angle. We combine the

163

RGB channels of a three white light image according to a combined propor-

164

tion(92%, 4%, 4%), to derive a pseudo multispectral image as performed in

165

[18]. The training set comprises nine models contained in the Blobby Shape

166

dataset. The other model contained in this dataset is used as the validation

167

set. The Stanford dataset and Web 3D dataset are utilized for testing.

168

We also acquire a real photoed dataset for testing. This dataset contains

169

ten objects and a color chart under the illuminations of red, green and blue

170

lights simultaneously. The lights have the same slant angle (30◦ ) and are

171

evenly separated by a tilt angle (120◦ ), as we did for the synthetic datasets.

172

5. Experiments

173

We first perform the network analysis for our method on the validation

174

set and compare our method with the state-of-the-art approaches on both the

16

175

synthetic and real datasets. We report quantitative results on the synthetic

176

dataset and qualitative results on the real photoed dataset. Finally, we

177

further analyze the light robustness of our method.

178

5.1. Implementation Details

179

Our method is implemented using Tensorflow 1.4.0. The training set

180

includeds 1.3 × 105 patches with the size 40×40 pixels. We trained our model

181

on two NVIDIA GTX 1080Ti GPUs using the mini-batch size of 24. The

182

initial learning rate was set to 0.001, with the Adam default parameters

183

(β1 = 0.9 and β2 =0.999). All the network analysis were measured using the

184

validation set with 1.44 × 104 patches.

185

5.2. Network analysis

186

5.2.1. The analysis of structural and photometric cues

187

We first compare our network with the different components. For the com-

188

parison “only using structural cue network”, we add the L2-normalization to

189

the end and removed the photometric cue network. For the comparison “only

190

using photometric cue network”, we cancel the structural cue normal input

191

of the photometric cue network. All the networks are trained with the same

192

training set and parameters. The results are shown in Table 1 with valida-

17

193

tion. In this experiment, we test all the results with three metrics: mean

194

angular error, max angular error and structural similarity index (SSIM). Table 1: Quantitative evaluation using the validation set. We validate the proposed method with different components.

Methods

Angular error (◦ )

SSIM index

(Lower better)

(Higher better)

Mean

Max

Only photometric cue network

8.09

32.58

0.9091

Only structural cue network

8.42

34.70

0.8742

Dual-cue fused network

6.81

25.31

0.9504

195

It can be seen that the performance is improved when both cues are

196

considered, the structural cue network takes advantage of the information

197

embedded in the patch like specular and the photometric cue network refine

198

the results. When the structural cue network is used only, the errors become

199

larger. This is partly caused by increased receptive field in the deeper convo-

200

lution layers, which involves irrelevant pixels far away from the convolution

201

center in the image lattice. When the photometric cue network is used only,

202

the mean and max angular errors are biggest. The reason is that the global

203

constraint cannot be taken into account, resulting in the failure of strong 18

204

non-Lambertian surface (e.g.,the pixel value is saturated due to specular).

205

5.2.2. The analysis of different backbone model

206

We then compared the effects of different CNN modules in the structure

207

cue network, including the Unet-based [22], ResNet-based [30] and DenseNet-

208

based [31] modules. We replaced these module after IGC in the structure cue

209

network. For the ResNet-based module, we designed a part with four residual

210

blocks (each block contains 3 convolutional layers) which down-samples the

211

spatial size from 40 × 40 to 5 ×5 using average-pooling, followed by three

212

fully-connected layers to produce the 40 × 40 × 3 coarse surface normal

213

estimation. In terms of the DenseNet-based module, we designed a similar

214

structure but used dense connections instead of residual blocks. All the

215

networks were trained with the same training set and parameters. Table 2: The results of using different modules. MAE is short for mean angular error.

Models

ResNet-based

DenseNet-based

Unet-based (Used)

MAE(◦ )

7.24

6.96

6.81

216

Table 2 illustrate that the Unet-based module achieved better results on

217

MAE. We believe that Unet is more suitable for normal estimation because

218

it performs image segmentation on pixel-level, while ResNet and DenseNet 19

219

conduct classification on image level. The pixel-level regression of a normal

220

map is more similar to a pixel-wise process. In addition, the multiple fully-

221

connected layers discards the embedded local context information.

222

5.2.3. The analysis of different loss functions

223

To compare the loss functions, we use the same network architecture but

224

change the loss functions. Lgrad , Lcont , LSSIM and LM AE are listed in Table

225

3 for comparison. We evaluate two metrics: MAE and SSIM. Table 3: Evaluation of loss functions in dual-cue network.

Structural cue network

Photometric cue network MAE (◦ )

SSIM

Lcont

LM AE

7.43

0.8839

Lgrad + Lcont

LM AE

7.14

0.9080

Lcont

LSSIM + LM AE

6.96

0.9402

Lgrad + Lcont

LSSIM + LM AE

6.81

0.9504

226

As shown in the table, combined loss functions outperform others. We

227

conclude that both global loss and pixel-by-pixel loss are beneficial for surface

228

normal estimation. Firstly, gradient loss extracts the gradient information

229

from the coarse surface normal and groundtruth, guiding the whole surface

230

normal recovery process. It penalizes the discontinuous boundaries in the 20

231

estimated surface normal which is caused by varying surface materials. This

232

global constraint information can not be obtained from the Euclidean dis-

233

tance loss Lcont . Secondly, SSIM loss penalizes the surface normal from the

234

contrast and the structure. It has been widely proved that combining SSIM

235

has a better effect [24]. In this paper, SSIM loss is a supplement to the angle

236

error constraint, especially at edges and regions with complex structure.

237

5.3. Comparisons with other methods

238

We also present the comparison of our method with state-of-the-art meth-

239

ods, including Demultiplexer [18], Semi-learning [19], DPSN [3], PS-FCN [11]

240

and a baseline. With their original training settings, we only adjusted the

241

form of the dataset to suit the compared methods. Demultiplexer maps the

242

image illuminated by red, green and blue lights into three images illuminated

243

by white lights. We therefore re-render the training set to be illuminated by

244

white light with the same position. Then we reconstruct the surface normal

245

n to depth with the method [32] to train the Semi-learning method, which

246

establishes an initial depth estimated network for multispectral photomet-

247

ric stereo. Note that PS-FCN and DPSN are two methods for photometric

248

stereo reconstruction, the dimensions of inputs are larger than our single

249

three-channel image. PS-FCN is a photometric stereo method but it allows 21

250

the arbitrary number of inputs. Therefore, we use the three channels of the

251

multispectral image as three input images during the training and test stages.

252

For DPSN, we copy each channel of the image to 32 times in training set,

253

making 96 channels input totally to comply with the settings. In this paper,

254

Baseline represents the method calculating the surface normal by photomet-

255

ric stereo [2] using the three-channel of an image directly. We conduct the

256

comparison both on synthetic image and real photoed image.

257

5.3.1. Quantitative analysis on synthetic images

258

259

We first compare the reconstruction performance on the Stanford 3D Scanning dataset and Web 3D dataset, the results are shown in Figure 4.

260

As shown in Figure 4, the comparison clearly demonstrates that our

261

method outperforms others, particularly on objects with complex surface

262

structures. For the shown objects, our method achieved an average MAE of

263

8.65

264

of Semi-learning, 9.81

265

The PS-FCN [11] shows fuzzier surfaces, losing high-frequency details on the

266

surface normal. Simultaneously, the performance of Semi-learning [19] in

267

all objects is unsatisfactory. For multicolored objects such as “Rabbit” and

268

“Lion” , it can be seen that the Demultiplexer [18] and DPSN [14] meth-



on test synthetic dataset, better than 9.89 ◦

of DPSN, 9.78

22





of Demultiplexer, 21.88◦

of PS-FCN and 14.03◦ of Baseline.

Objects

GT Normal

Our Est. & Error Map

Demultiplexer Est. & Error Map Semi-learning Est. & Error Map

DPSN Est. & Error Map

PS-FCN Est. & Error Map

Baseline Est. & Error Map

Dragon

6.94°

7.51°

19.88°

7.92°

8.13°

9.73°

Rabbit

8.49°

10.24°

22.80°

10.03°

9.75°

16.01°

Lion

8.44°

9.81°

24.97°

9. 07°

9.02°

13.73°

Sitting

8.87°

9.20°

19.31°

10.86°

9.93°

20.16°

Monkey

8.10°

9.74°

21.42°

10.04°

10.78°

11.65°

Buddha

7.05°

7.29°

19.02°

7.91°

7.18°

10.07°

Goddess

10.71°

14.36°

20.42°

11.72°

11.16°

14.01°

Man

10.55°

10.92°

27.22°

10.93°

12.36°

16.84°

Figure 4: Quantitative results on the synthetic objects. The numbers represent Mean Angular Error (MAE) in degree. The object names are displayed on the first column.

269

ods show uncontinuous boundaries on estimated normal maps, which affects

270

the accuracy. This is because these two methods apply the networks based

271

solely on fully connected layers lacking in neighborhood information and con-

272

straint. It also can be found that most methods have satisfactory results on

273

Lambertian surface like “Buddha”. However, previous methods would be

274

failed facing strong non-Lambertian surface like “Monkey” and “Goddess”.

275

It proves the fitting ability of our dual-cue network, which benefits from the

276

encoded local context information in structural cue network, which reduces

23

277

the estimation failure of specular and shadows.

278

5.3.2. Qualitative analysis on real objects

279

In addition to using synthetic data to evaluate our method, we also uti-

280

lize real objects for experiments. We first take the images of real objects

281

illuminated by RGB light. Figure 5 shows the comparative results. Observation

Ours Est.

Demultiplexer Est.

Semi-learning Est.

DPSN Est.

PS-FCN Est.

Baseline Est.

Figure 5: Qualitative results on the multicolored fabrics.

282

As shown in Figure 5, the boundaries of multicolored fabrics and objects

283

can be clearly seen in input images. Baseline method which uses the three-

284

channel of the multispectral image directly shows the discontinuity on nor24

285

mal maps. The error is due to the assumption of constant albedo in Baseline

286

method, ignoring the tangle of illumination, surface reflectance and camera

287

response. It can be seen from the experiment that the results of DPSN and

288

Demultiplexer are better than baseline, but there are still obvious discontin-

289

uous boundaries on the normal map. In terms of Semi-learning method, the

290

results are noisy. For our dual-cue fused network and PS-FCN, the normal

291

maps show almost no discontinuous boundaries. However, the results of PS-

292

FCN are fuzzier. In addition, due to the unavoidable spectral errors caused

293

by real color lights, the prediction of real objects are worse than synthetic

294

images. However, our network can still achieve the best results, compared

295

with other methods using the same synthetic training dataset.

296

We also evaluate our method with a more convincing object. Figure 6

297

shows the qualitative results on a color chart in different orientations. Since

298

the color chart is a plane plate, we reckon that the normal on it is the same.

299

Compared with others, our method predicts more accurate normal maps,

300

which are almost unaffected by the changing of colors, reflecting the correct

301

surface normal. The results indicate that our network outperforms other

302

methods facing the reflectance spectra distributions of various albedo.

25

Observation

Ours Est.

Demultiplexer Est.

Semi-learning Est.

DPSN Est.

PS-FCN Est.

Baseline Est.

Figure 6: Qualitative results on a color chart in different orientations.

303

5.4. Evaluation of robustness

304

Previous photometric stereo methods [2, 33] will be affected by complex il-

305

lumination conditions like additional light and varied lighting directions. We

306

also analyze the impact of complex illumination environment to our method.

307

5.4.1. Additional light

308

We use a High Dynamic Range (HDR) environment map (See Figure

309

7.(a)) with an additional light source to introduce illumination noise to the

310

ideal darkroom condition. To evaluate the robustness of methods, we only

311

utilize the additional light on the validation set and remain the darkroom

312

environment in training. The results are shown in Figure 8. We compare the

313

MAE of normal produced by different methods under the additional light

314

condition and darkroom condition as well as the fluctuation ratio α as:

26

α=

|χ − χ0 | χ

(10)

315

where χ represents the MAE of standard condition while χ0 represents the

316

MAE of complex illumination condition (additional light or changed RGB

317

lights directions). The fluctuation ratio measures the robustness of methods.

318

The smaller the value, the stronger the robustness of the evaluated method. z

Camera Slant angle a

Slant angle a′

Light A Light A′

x (a)

(b)

Figure 7: (a). The High Dynamic Range environment map used for simulating the additional illumination. (b). Schematic diagram of changed slant angle. When the light moves from the position A to the position A0 , the slant angle will also change from a to a0 .

319

It can be seen in Figure 8 that our dual-cue fused network is more robust

320

in additional light condition with smaller fluctuation ratio. The introduction

321

of structural cue network considers local context and neighborhood informa-

322

tion, less affected by lighting conditions. Therefore, our double-cue network

27

323

is more robust than the methods only considering one cue. Note that Demul-

324

tiplexer and DPSN perform worst on robustness. This might be explained by

325

the fact that the single fully-connected network lacks structural constraint

326

embedded in local pixels. As a result, the changed measurement on single

327

pixel due to complex illumination directly influences the estimation.

Figure 8: Evaluation for the additional light condition. The left charts show the MAE under darkroom and additional light, the right charts show fluctuation ratios of them.

328

5.4.2. Changed RGB lighting directions

329

In this experiment, we remain the slant angle of RGB lights in training

330

set at 30◦ , while we adjust the them in validation set to 20◦ , 25◦ , 30◦ , 35◦

331

and 40◦ , respectively (See Figure 7.(b)). We show the MAE and fluctuation

332

ratios under different slant angles in Figure 9. 28

Figure 9: Evaluation for changed RGB lights directions. The left charts show the MAE under different slant angles, the right charts show fluctuation ratios of them.

333

From Figure 9, it is obvious that our method better resists variations in

334

slant lighting direction. The robustness for using the ResNet-based module

335

and the DenseNet-based module is almost as good as the Unet-based module

336

(Proposed). Moreover, it can be seen that the fluctuation ratio of all deep

337

learning methods is smaller than the Baseline which uses the three-channel

338

of a multispectral image directly. It demonstrates the generalization ability

339

of data-driven deep learning methods.

340

6. Conclusion

341

In this paper, we proposed a dual-cue fused network in order to estimate

342

surface normals from a single multispectral photometric stereo image. Unlike

29

343

previous algorithms which require extra depth data or pre-calibration, our

344

method estimates surface normal without any prior information. We first

345

apply a CNN-based structural cue network to approximate the coarse sur-

346

face normal on small patches. Then, we apply a pixel level fully-connected

347

photometric cue network to refine the coarse surface normal. The structure

348

cue network exploits the neighborhood embedded feature and local context

349

information for the coarse normal estimation, while the photometric stereo

350

network further learns finer surface details from a per-pixel perspective. This

351

dual-cue network guarantees the accuracy of surface normal recovery and the

352

robustness to noisy illumination environment. Compared with traditional al-

353

gorithms, our method is able to handle objects with non-Lambertian surface.

354

The experiments on both the synthetic images and the real photoed images

355

showed that our dual-cue fused network outperformed the existing methods.

356

Despite offering the state-of-the-art performance, our method suffers from

357

a limitation. The directions of the lights are almost the same. Although

358

our method is robust to changed lighting directions, the deviation of the

359

slant angle is at most 10◦ . Our method cannot adapt completely arbitrary

360

lighting directions. In fact, this issue has remained a difficult problem for

361

the learning-based surface normal recovery methods. In the future, we aim

30

362

to develop more stable methods for arbitrary lighting.

363

7. Acknowledgement

364

This work was supported by the National Key Scientific Instrument De-

365

velopment Project (No.41927805), the International Science & Technology

366

Cooperation Program of China (ISTCP) (No. 2014DFA10410) and the Na-

367

tional Natural Science Foundation of China (NSFC) (No.61501417).

368

References

369

[1] M. Jian, Y. Yin, J. Dong, W. Zhang, Comprehensive assessment of

370

non-uniform illumination for 3d heightmap reconstruction in outdoor

371

environments, Computers in Industry 99 (2018) 110–118.

372

[2] R. J. Woodham, Photometric method for determining surface orienta-

373

tion from multiple images, Optical Engineering 19 (1980) 139–144.

374

[3] K. Ozawa, I. Sato, M. Yamaguchi, Hyperspectral photometric stereo for

375

a single capture., Journal of the Optical Society of America A 34 (2017)

376

384–394.

377

[4] T. Kawabata, F. Sakaue, J. Sato, One shot photometric stereo from

378

reflectance classification, 11th Joint Conference on Computer Vision, 31

379

Imaging and Computer Graphics Theory and Applications. (2016) 620–

380

627.

381

[5] H. Jiao, Y. Luo, N. Wang, L. Qi, J. Dong, H. Lei, Underwater multi-

382

spectral photometric stereo reconstruction from a single rgbd image,

383

Signal and Information Processing Association Summit and Conference

384

(2017) 1–4.

385

[6] C. Hernandez, G. Vogiatzis, G. J. Brostow, B. Stenger, R. Cipolla, Non-

386

rigid photometric stereo with colored lights, International Conference

387

on Computer Vision (2007).

388

[7] Z. Jank´o, A. Delaunoy, E. Prados, Colour dynamic photometric stereo

389

for textured surfaces, Asian Conference on Computer Vision (2010)

390

55–66.

391

[8] D. Eigen, C. Puhrsch, R. Fergus, Depth map prediction from a single

392

image using a multi-scale deep network, Advances in neural information

393

processing systems (2014) 2366–2374.

394

[9] T. Taniai, T. Maehara, Neural inverse rendering for general reflectance

395

photometric stereo,

396

(2018) 4864–4873.

International Conference on Machine Learning

32

397

[10] Y. Ju, L. Qi, J. He, X. Dong, F. Gao, J. Dong, Mps-net: Learning to

398

recover surface normal for multispectral photometric stereo, Neurocom-

399

puting 375 (2020) 62–70.

400

[11] G. Chen, K. Han, K.-Y. K. Wong, Ps-fcn: A flexible learning frame-

401

work for photometric stereo, European Conference on Computer Vision

402

(2018) 3–19.

403

404

[12] S. K. Nayar, K. Ikeuchi, T. Kanade, Shape from interreflections, International Journal of Computer Vision 6 (1991) 173–195.

405

[13] J. A. Smith, T. L. Lin, K. L. Ranson, The lambertian assumption

406

and landsat data., Photogrammetric Engineering & Remote Sensing 46

407

(1980) 1183–1189.

408

[14] H. Santo, M. Samejima, Y. Sugano, B. Shi, Y. Matsushita, Deep pho-

409

tometric stereo network, International Conference on Computer Vision

410

Workshop (2017) 501–509.

411

[15] L. L. Kontsevich, A. P. Petrov, I. S. Vergelskaya, Reconstruction of

412

shape from shading in color images, Journal of the Optical Society of

413

America A 12 (1994) 1047–1052.

33

414

[16] H. Kim, B. Wilburn, M. Ben-Ezra, Photometric stereo for dynamic

415

surface orientations, European Conference on Computer Vision (2010)

416

59–72.

417

[17] G. J. Brostow, C. Hernandez, G. Vogiatzis, B. Stenger, R. Cipolla, Video

418

normals from colored lights, IEEE Transactions on Pattern Analysis &

419

Machine Intelligence 33 (2011) 2104–14.

420

[18] Y. Ju, L. Qi, H. Zhou, J. Dong, L. Lu, Demultiplexing colored images

421

for multispectral photometric stereo via deep neural networks, IEEE

422

Access 6 (2018) 30804–30818.

423

[19] L. Lu, L. Qi, Y. Luo, H. Jiao, J. Dong, Three-dimensional reconstruc-

424

tion from single image base on combination of cnn and multi-spectral

425

photometric stereo, Sensors (2018).

426

[20] D. Antensteiner, S. Stolc, D. Soukup, Single image multi-spectral photo-

427

metric stereo using a split u-shaped cnn, Computer Vision and Pattern

428

Recognition Workshops (2019).

429

430

[21] T. Zhang, G. J. Qi, B. Xiao, J. Wang, Interleaved group convolutions, International Conference on Computer Vision (2017) 4383–4392.

34

431

[22] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for

432

biomedical image segmentation, International Conference on Medical

433

image computing and computer-assisted intervention (2015) 234–241.

434

[23] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy,

435

T. Brox, Demon: Depth and motion network for learning monocular

436

stereo, Computer Vision and Pattern Recognition (2017) 5038–5047.

437

[24] H. Zhao, O. Gallo, I. Frosio, J. Kautz, Loss functions for image restora-

438

tion with neural networks, IEEE Transactions on computational imaging

439

3 (2016) 47–57.

440

[25] Z. Wang, A. Bovik, H. Sheikh, E. Simoncelli, Image quality assessment:

441

from error visibility to structural similarity, IEEE Transactions on Image

442

Processing 13 (2004) 600–612.

443

[26] R. Wan, B. Shi, L.-Y. Duan, A.-H. Tan, A. C. Kot, Crrn: Multi-scale

444

guided concurrent reflection removal network, Computer Vision and

445

Pattern Recognition (2018) 4777–4785.

446

447

[27] M. K. Johnson, E. H. Adelson, Shape estimation in natural illumination., Computer Vision and Pattern Recognition (2011) 2553–2560.

35

448

[28] B. Curless, M. Levoy, A volumetric method for building complex models

449

from range images, Conference on Computer Graphics and Interactive

450

Techniques (1996) 303–312.

451

452

[29] W. Matusik, H. Pfister, M. Brand, L. Mcmillan, A data-driven reflectance model, ACM Transactions on Graphics (2003) 759–769.

453

[30] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image

454

recognition, Computer Vision and Pattern Recognition (2016) 770–778.

455

[31] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely con-

456

nected convolutional networks, Computer Vision and Pattern Recogni-

457

tion (2017) 4700–4708.

458

[32] A. Agrawal, R. Raskar, R. Chellappa, What is the range of surface re-

459

constructions from a gradient field?, European Conference on Computer

460

Vision (2006) 578–591.

461

462

[33] X. Jiang, H. Bunke, On error analysis for surface normals determined by photometric stereo, Signal Processing 23 (1991) 221–226.

36

463

Biography

464

465

Yakun Ju received the B.Sc degree of engineering in industrial design

466

from Sichuan University, Chengdu, China, in 2016. He is currently pursuing

467

the Ph.D. degree in computer application technology with the Department

468

of Computer Science and Technology, Ocean University of China, Qingdao,

469

China. His research interests include 3D reconstruction, machine learning

470

and image processing.

471

472

Xinghui Dong received the Ph.D. degree from Heriot-Watt University,

473

U.K., in 2014. He is currently a Research Associate with the Centre for

37

474

Imaging Sciences, The University of Manchester, U.K. His research interests

475

include automatic defect detection, image representation, texture analysis,

476

and visual perception.

477

478

Yingyu Wang received the B.Sc. degree in computer science and tech-

479

nology from the Chengdu University of Technology, Chengdu, China, in 2017,

480

respectively, where he is currently pursuing the Masters degree at Ocean Uni-

481

versity of China, Qingdao, China. His research interests include computer

482

vision, robotics and deep learning.

483

38

484

Lin Qi received his B.Sc and M.Sc degrees from Ocean University of

485

China in 2005 and 2008 respectively, and received his Ph.D. in computer sci-

486

ence from Heriot-Watt University in 2012. He is now an associate professor in

487

the Department of Computer Science and Technology in Ocean University of

488

China. His research interests include computer vision and visual perception.

489

490

Junyu Dong received the B.Sc. and M.Sc. degrees from the Department

491

of Applied Mathematics, Ocean University of China, Qingdao, China, in 1993

492

and 1999, respectively, and the Ph.D. degree in image processing from the

493

Department of Computer Science, Heriot-Watt University, U.K., in 2003.

494

He joined Ocean University of China in 2004, and he is currently a Professor

495

and the Head of the Department of Computer Science and Technology. His

496

research interests include machine learning, big data, computer vision, and

497

underwater vision.

39