MPS-Net: Learning to recover surface normal for multispectral photometric stereo

MPS-Net: Learning to recover surface normal for multispectral photometric stereo

MPS-Net: Learning to Recover Surface Normal for Multispectral Photometric Stereo Communicated by Dr. H. Yu Journal Pre-proof MPS-Net: Learning to R...

3MB Sizes 0 Downloads 11 Views

MPS-Net: Learning to Recover Surface Normal for Multispectral Photometric Stereo

Communicated by Dr. H. Yu

Journal Pre-proof

MPS-Net: Learning to Recover Surface Normal for Multispectral Photometric Stereo Yakun Ju, Lin Qi, Jichao He, Xinghui Dong, Feng Gao, Junyu Dong PII: DOI: Reference:

S0925-2312(19)31358-X https://doi.org/10.1016/j.neucom.2019.09.084 NEUCOM 21339

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

13 June 2019 11 September 2019 26 September 2019

Please cite this article as: Yakun Ju, Lin Qi, Jichao He, Xinghui Dong, Feng Gao, Junyu Dong, MPSNet: Learning to Recover Surface Normal for Multispectral Photometric Stereo, Neurocomputing (2019), doi: https://doi.org/10.1016/j.neucom.2019.09.084

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.

Highlights • We propose an end-to-end method for Multispectral Photometric Stereo, without any extra information. • For the first time, our MPS-Net takes the initial surface normal into account, which provides a state-of-the-art estimation. • We design a localized convolutional neural network to establish flexible mapping considering the adjacent structural feature.

1

MPS-Net: Learning to Recover Surface Normal for Multispectral Photometric Stereo Yakun Jua , Lin Qia , Jichao Hea , Xinghui Dongb , Feng Gaoa , Junyu Donga,∗ a

Department of Computer Science and Technology, Ocean University of China, Qingdao, China b Centre for Imaging Sciences, The University of Manchester, Manchester, UK

Abstract Multispectral Photometric Stereo (MPS) estimates per-pixel surface normals from one single image captured under three colored (red, green and blue) light sources. Unlike traditional Photometric Stereo, MPS can therefore be used in dynamic scenes for single frame reconstruction. However, MPS is challenging due to the tangle of the illumination, surface reflectance and camera response, causing inaccurate estimation of surface normal. Existing approaches rely on either extra depth information or materials calibration strategies, thus limiting its usage in practical applications. In this paper, we propose a Multispectral Photometric Stereo Network (MPS-Net) to solve this under-determined system. The MPS-Net takes the single multispectral image and an initial surface normal estimation obtained from this image ∗

Corresponding author Email address: [email protected] (Junyu Dong )

Preprint submitted to Neurocomputing

October 2, 2019

itself, and outputs an accurate surface normal map, where no extra depth or materials calibration information is required. We show that the MPS-Net is not constrained to Lambertian surfaces and can be applied to surfaces with complex reflectance. We evaluated the MPS-Net using both synthetic and real objects of various materials. Our experiment results show that the MPS-Net outperforms the state-of-the-art approaches. Keywords: Surface normal estimation, Multispectral photometric stereo, Neural network 1

1. Introduction

2

Multispectral photometric stereo (MPS) can estimate surface normal from

3

a single image of an object illuminated simultaneously by three colored (red,

4

green and blue) light sources. Therefore, it allows single frame reconstruction

5

in dynamic scenes. This idea was first demonstrated in [1, 2, 3] and has been

6

shown to be able to efficiently produce surface normal estimation in dynamic

7

scenes [4, 5, 6].

8

However, the major weakness of the existing MPS methods is the as-

9

sumption of Lambertian reflectance and constant chromaticity of the target

10

object. For objects with varying chromaticities, existing methods appeal to

11

extra depth information [7], regularization of the normal field [8] or time 3

12

multiplexing [9].

13

In this paper, we propose an innovative end-to-end solution by using deep

14

neural networks, the multispectral photometric stereo network (MPS-Net),

15

to predict surface normal from the multispectral image and initial normal

16

map (as shown in Fig.1). We use a localized convolutional neural network

17

(CNN) to establish a flexible mapping from input data to pixel-wise dense

18

surface normal. MPS-Net uses only the information of image itself without

19

extra depth or material calibration. The input includes two components: the

20

observed image and the initial normal map estimation obtained using the

21

three-channel photometric stereo from the observed image. We believe that

22

the initial surface normal provides better prior information to the network

23

and are corrected through MPS-Net with the constraint of the input original

24

image.

25

A variety of bidirectional reflectance distribution functions (BRDFs) from

26

the MERL database [10] are used for training in order that our network can

27

deal with objects with complex reflectance rather than Lambertian surfaces.

28

We trained the network on the Blobby Shape Dataset [11], and it works

29

well on both synthetic and real datasets, including the Stanford 3D Scanning

30

Dataset [12], the Web 3D model and the DiLiGenT Benchmark [13]. We also

4

Three-channel separated photometric stereo

r

g

b

Fuse

MPS-Net

Figure 1: The overview of the proposed method. Given a multispectral image with predefined light directions and the initial normal map as input, MPS-Net estimates an accurate normal map of the object. In MPS-Net, the multispectral image corrects the initial normal map to produce the accurate results (see section 3.2).

31

tested the MPS-Net in the generalization ability in illumination directions

32

and found that it can still predict satisfactory results when the illumination

33

directions are different from those used in the training stage. The proposed

34

method is more practical than existing learning based approaches which have

35

to remain the same light directions in the training and prediction stages [14].

5

36

2. Related work

37

Recently, estimating surface normal of deforming objects has drawn in-

38

creasing attention among researchers in computer vision community. The

39

orientation-sensing techniques based on photometric are good at process-

40

ing high frequency information, while the range sensing technologies, such

41

as multiview stereo, are suitable for dealing with low frequency information

42

[15, 16]. In this section, we focus on reviewing photometric stereo methods.

43

Conventional photometric stereo methods [17] produce pixel-wise surface

44

normal based on the Lambertian model. For deforming object, multispectral

45

photometric stereo was first introduced by Petrov et al. [2] and had been

46

used in many applications [18, 19]. Some researchers [20] employed the coarse

47

depth information obtained using Kinect or binocular stereo to iteratively

48

search for an optimized solution. Hernandez et al. [21] utilized a calibration

49

approach which is planar with the special marking that allows the plane

50

orientation to be estimated. However, these methods require a lot of prior

51

knowledge and need to incorporate more cameras.

52

Ozawa et al. [19] estimated the surface normal by facilitating the re-

53

flectance norm distribution, under the assumption that the surface is colored

54

with a finite number of materials and the surface regions of the same re6

55

flectance is sufficiently curved. Fyffe et al. [22] simultaneously estimated

56

colors and surface normal of textured surfaces using multispectral camera.

57

Kawabata et al. [23] moved a step further by adding a reflectance basis set

58

obtained from principal component analysis. Both of them added a smooth-

59

ness constraint on the surfaces. Ozawa et al. [6] successfully estimated the

60

reflectance spectra and surface normal of an arbitrary colored surface from

61

a single hyperspectral image. Both the spectral and spatial arranged illumi-

62

nations work as a light source for measuring reflectance spectra and shading

63

images. However, the multispectral camera is prone to suffer from spatial,

64

spectral and temporal resolution issues. Moreover, these methods can hardly

65

handle objects with the non-Lambertian surface.

66

With the development of deep learning techniques in recent years, these

67

techniques have also been applied to the normal estimation. Yoon et al.

68

[24] utilized Generative Adversarial Networks (GANs) to estimate surface

69

normal from a single image. However, this work was restricted in the infrared

70

images ignoring the color information. Recently, some researchers [14, 25,

71

26] proposed deep neural networks to regress per-pixel normal. These work

72

can handle non-Lambertian surface and achieved dense results. However,

73

they are hardly able to tackle dynamic objects. Lu et al. [27] estimated

7

74

the initial depth of multispectral images and fed this data to the classical

75

algorithm as the prior information. However, this method was restricted

76

in the specific albedo and generated rough results. Ju et al. [28] used a

77

two-step pipeline to estimate surface normal from a single colored image

78

by demultiplexing the multiplexed multispectral image. Nevertheless, this

79

method solely performed estimation based on the reflectance observations

80

of a single pixel, and cannot fully take the advantage of the information

81

embedded in local surface points. Besides, the learning of [28] is unstable

82

and intractable to the second photometric stereo step. To solve these issues,

83

we introduce a new method which robustly enables the estimation of the

84

surface normal from a single multispectral image.

85

3. MPS-Net

86

In this section, we first introduce the theory background. Then, we de-

87

scribe the learning framework of the MSP-Net. Finally, we present the details

88

of the network architecture.

89

3.1. Preliminaries

90

We consider a Lambertian surface lit by three light sources with different

91

spectra. Following the work introduced in [2], the intensity of the pixel (x, y) 8

92

in an observed image c can be described as:

ci =

X k

lT kn

Z

Ek (λ)R(λ)Si (λ)dλ,

(1)

93

where ci (i ∈ r, g, b) represents the ith channel in c, n and R are the surface

94

normal and spectral reflectance of the surface respectively, lk and Ek (k = 1,

95

2, 3) are the kth light vector and its energy distribution respectively, and Si

96

is the camera-sensitivity function of the ith channel.

97

It can be observed that the tangle of the illumination, surface reflectance

98

and camera response in Eq.1 are cased by non-ideal camera and light sources

99

as well as under-constrained surface reflectance. Classical methods [29] re-

100

101

102

quire the extra information in order to calibrate the tangled part. If we ignore the aliasing between channels, we can simplify Eq.1 as the following three-channel separated photometric stereo:

c = ρ lT k ninit ,

(2)

103

where ρ is a fixed scalar that replaces the spectral reflectance. Then the

104

initial normal ninit in Eq.2 can be easily solved based on [17]. The initial

105

normal ninit under the simplified condition is erroneous as expected. We

106

show the error in Fig.2. 9

Image

Initial normal

GT

Error map

13.87°

16.53°

Figure 2: Examples of the error under the simplified condition. In each row, the observed multispectral image, the inaccurate initial normal obtained using Eq.(2), the ground-truth and the error map are shown. The mean angular error is shown at the bottom-left corner of each error map.

107

In addition, non-Lambertian surface is common in the real world and

108

challenging for the multispectral photometric stereo. Hence, we consider a

109

more flexible and robust learning solution for estimating normal data from a

110

multispectral image.

10

111

3.2. Learning framework

112

A novel learning framework (MPS-Net) is proposed to tackle the tangle

113

and non-Lambertian problems. Given an inaccurate initial normal ninit , we

114

first combine it with the multispectral image c, and then the mapping from

115

the fused input to the surface normal is learned using the MPS-Net (See

116

Fig.1). The MPS-Net can be treated as a function that essentially corrects

117

the inaccurate initial normal from the observed image and predicts the accu-

118

rate surface normal. The function f can be approximated by the deep neural

119

network: MPS-Net as :

nest = f (ninit , (cr , cg , cb )) 120

(3)

where nest represents the estimated surface normal map.

121

There are two reasons to fuse the initial normal with the multispectral

122

image. First, from the model perspective, the error of the initial normal is

123

highly related to the multispectral image. It is the colored or non-Lambertian

124

surface in the observed image that results in the deviation of the initial nor-

125

mal. The image and the initial normal always complement to each other.

126

Therefore, we decide to use the multispectral image to correct the error in

127

the initial normal based on the deep neural network with the strong fitting 11

128

ability rather than establish an unstable mapping between image and normal

129

map. Second, the addition of the initial normal makes the network converge

130

faster due to the intrinsic properties of neural networks. The initial normal

131

acts as a prior information. Compared to the input with only the multi-

132

spectral image, the fused counterpart has more similar information to the

133

accurate predicted normal, which enables the network powerful in learning

134

the difference information between normals.

135

3.3. Network architecture

136

Unlike existing methods, we choose the local image patch instead of the

137

single pixel by considering the advantage of the information embedded in

138

neighborhood surface. In MPS-Net, the introduction of neighborhood pixels

139

makes the continuity of image also be considered as a constraint. Generally,

140

complex surfaces are composed of simple and small surface patches. How-

141

ever, the general shape of the object, which is characterized by a large patch

142

or the whole image, is not informative for learning pixel-wise normals. The

143

conventional size of the patch [25, 26] is too large. This reduces the diver-

144

sity of the data and may introduce the over-fitting problem. Therefore, we

145

used a novel unequal input and output (IO) convolutional neural network

146

to automatically learn features from local patches and guide surface normal 12

147

estimation. The default size of the input local patches is 5×5 pixels, while

148

the output is the estimated normal of the center pixel. In order to map the

149

per-pixel normal to the original image, the stride of each neighboring patch

150

in test is 1, which ensures the estimated normal (the center position of the

151

patches) can be closely arranged. The architecture is presented in Fig.3. Feature Extraction Conv8: 1x1x1024

5x5 Conv1: 1x1x256

5x5

5x5

Normal Generator

Shortcut connection 5x5 5x5

5x5 3x3 1x1

1x1

Fuse Randomization Conv3:

3x3x512 Conv2: 1x1x256

5x5

Conv4: Conv5: 3x3x512 3x3x1024

Conv:padding=SAME+Elu

Conv9: Conv6: Conv7: 3x3x1024 3x3x1024 3x3x512

Conv10: 3x3x3

Conv:padding=SAME+Elu+BN

Conv:padding=VALID+Elu Dropout layer

L2 normalization Initial normal Image

Est. normal

Figure 3: The network architecture of the proposed MPS-Net. BN represents the batch normalization operation. Image denotes the multispectral image. The red numbers mean the size of feature maps. The bold numbers after “Conv:” represent the kernel size and the output dimensions of the convolutional layer. Dropout layers are introduced to simulate the cast shadow.

152

We now describe the network architecture in detail. As shown in Fig.3,

13

153

the fusion input consists of the multispectral image patch and the initial

154

normal patch. We call the first part as “Feature Extractor” and term the

155

last two layers as “Normal Generator”.

156

The first eight layers in the “Feature Extractor” are composed of 1×1

157

and 3×3 filters with the “SAME” padding. This means that the size of

158

the six layers are always 5×5. The 1×1 convolutional layers are used to

159

increase the dimensions and are fused by the concatenation operation. The

160

randomization after “Conv2” means that we randomize the feature maps of

161

“Conv2”. This process guarantees that the MPS-Net is not dependent on the

162

order of feature maps. Therefore, the MPS-Net can learn the multispectral

163

image under different light directions. We also use a shortcut connection

164

to link the “Conv7” and the “Conv8”, which represents the feature of initial

165

normal (increase the dimension to 1024 for ensuring the shortcut connection).

166

Thus, the first eight layers can be treated as a residual block [30] which is

167

focused on feature extraction.

168

Then, the last two convolutional layers transfer the feature map size from

169

5×5 to 1×1. Note that the strides applied in all layers are 1 pixel except

170

the first layer in the “Normal Generator”, which utilizes a stride of 2 pix-

171

els. Therefore, the size is decreased to 3×3 in “Conv9”. The convolutional

14

172

layer with padding=“VALID” is utilized afterwards. Therefore, the result of

173

“Conv10” is a single pixel. An L2-normalization operation is appended to

174

the end of the network to ensure the normalized output.

175

The Elu activation function [31] is used at each layer. The dropout layers

176

[32] are introduced after “Conv3” and “Conv5”. We utilize two dropout

177

layers to simulate the inevitable cast shadow in MPS system [14]. We also

178

apply batch normalization [33] to each layer except the first two layers, where

179

it may otherwise break low-level feature detectors.

180

Our network is trained with a pixel-by-pixel MAE loss. MAE loss is the

181

angle deviation between the estimated nest and groundtruth n. It can be

182

written as:

LMAE (nest , n) = arccos(

hnest , ni ), ||nest || ||n||

(4)

183

where n is the ground-truth normal and nest is the estimated normal. h*, *i

184

denotes the dot product. LMAE is minimized using Adam [34] with the sug-

185

gested default settings.

15

186

4. Datasets

187

The deep learning approach normally requires a huge number of training

188

samples for learning a regressor. For training an MPS-Net, a dataset which

189

includes the images, initial normal maps and ground-truth normal maps of

190

objects is needed. However, the ground-truth normal of the real object and

191

the ideal light source are usually not available. In this study, we train the

192

MPS-Net using the widely-used synthetic data and evaluate it using both

193

synthetic and real datasets from previous studies [26, 13]. Experiment results

194

show that the MPS-Net trained using the synthetic datasets generalizes well

195

to the real datasets.

196

4.1. Synthetic datasets

197

The synthetic datasets include the Blobby Shape dataset [11], the Stan-

198

ford 3D Scanning dataset [12] and dozens of 3D models downloaded from

199

Sketchfab https://sketchfab.com/ and Free3d https://free3d.com/ which are

200

named the “Web 3D” dataset. Following the work introduced in [14, 26],

201

we employ the MERL dataset [10] which contains 100 different materials

202

BRDFs. We render the Blobby, Stanford and Web 3D datasets with the

203

MERL dataset in the pre-defined light source by following the method used

16

204

in [14]. The three pre-defined lights have the same slant angle (30◦ ) and are

205

evenly separated by a tilt angle (120◦ ). In order to simulate real conditions,

206

each rendered image contains at least two materials. Note that the MERL

207

dataset is a dictionary which contains every incident and exit angle under

208

the white light. Therefore, we combine the RGB channels of a three white

209

light image according to a proportion, to derive a pseudo multispectral im-

210

age as performed in [28]. This method effectively simulates the tangle of the

211

illumination, surface reflectance and camera response.

212

The training set comprises eight models contained in the Blobby Shape

213

dataset and six models included in the Web 3D dataset. The rest two models

214

contained in the Blobby Shape dataset are used as the validation set. The

215

remaining six models of the Web 3D dataset and the Stanford 3D Scanning

216

dataset are utilized for testing.

217

4.2. Real datasets

218

First, we employed the DiLiGenT Benchmark [13] as the real dataset for

219

testing. This dataset contains 10 objects made from complex non-Lambertian

220

materials. For each object were 96 images captured under different light di-

221

rections. In order to obtain multispectral images, we use the method pro-

222

posed in [28] with the same configurations for the synthetic datasets (Three 17

223

images are selected for each object and light intensities are normalized). It

224

is worth noting that the pre-defined light directions in the DiLiGenT Bench-

225

mark are different from those used in our training set. We will analyze the

226

performance of the proposed network under different light directions and

227

evaluate the robustness of this network.

228

Furthermore, we also built a MPS system to capture real fabrics for

229

demonstrating the generalization ability of MPS-Net on real objects. Fabrics

230

are deformable material, which are always challenging for traditional PS to

231

recover surface normal.

232

4.2.1. The experimental setup of the MPS system

233

Our experimental setup is shown in Fig.4. We used an IDS UI-358xCP-C

234

camera put on top-center of a circular. The lights were placed in the circular

235

orbit around the camera to provide varying illumination directions. The

236

three lights have the same slant angle (30◦ ) and are evenly separated by a

237

tilt angle (120◦ ).

238

In this experiment, we use the fixed weights learned from training set

239

to demonstrate the robustness of our network. We compare it against the

240

PS-FCN, Demultiplexer and baseline. It should be noted that the lighting

241

intensity, spectral distribution and camera response are all changed in our 18

Red light

Camera Green light

Blue light

Figure 4: Experimental device. The red box represents the camera and the yellow circles represent the lights.

242

real MPS system.

243

5. Experiments

244

In this section, we describe the implementation details of the proposed

245

network and evaluate it using different setups. Regarding the evaluation,

246

we first conduct a network analysis for the MPS-Net on the validation set

247

and then compare it with the state-of-the-art methods using both synthetic 19

248

dataset and real dataset. We employ the angular error (in degree) perfor-

249

mance metric in order to measure the accuracy of estimated normal maps.

250

5.1. Implementation details

251

The MPS-Net is implemented using Tensorflow in Ubuntu 16.04. The

252

training set includes 1.5 × 106 input patches with the size of 5×5 pixels and

253

the corresponding ground-truth normal data. We train our model on two

254

NVIDIA GTX 1080Ti GPUs using a batch size of 500 for 20 epochs. The

255

initial learning rate is set to 0.001, with Adam [34] default parameters (β1 =

256

0.9 and β2 =0.999).

257

5.2. Network analysis

258

We quantitatively analyze our network using the validation set. The

259

fixed-size image patches centered at estimated normal pixels are used as the

260

input of the MPS-Net. The fusion input is composed of the observed image

261

and the inaccurate initial normal estimated using the three-channel separated

262

photometric stereo (see Eq.2). Therefore, the effects of the input size and

263

the fusion with the initial normal are analyzed in this subsection.

264

We assess the effectiveness of 1×1 patch (i.e., the single pixel), 3×3 patch,

265

5×5 patch, 7×7 patch and 9×9 patch as well as the effectiveness of the

20

266

fusion operation. For 9×9 patch, 7×7 patch and 3×3 patch, we tune the

267

number of convolutional layers with the “VALID” padding in order to ensure

268

the single pixel estimated normal. For the 1×1 patch, we replace all the

269

convolutional layers by fully-connected layers with the same dimension. In

270

this case, the structure of our network is similar to that used in [14]. For the

271

network without using the fusion input, only image patches are used while

272

the concatenation and shortcut connection operations are discarded. We

273

randomly select 5 × 105 patches sampled from the validation set and report

274

the mean angular error and the max angular error results. These results are

275

summarized in Table 1. Table 1: Results of network analysis. The digits represent the mean angular error (MeAE) or the max angular error (MaAE) across all the selected patches (the lower the better). I and N stand for the multispectral image and the initial normal respectively.

Metrics Patch type

Patch size 1×1

3×3

5×5

7×7

9×9

MeAE

I

10.09◦

9.18◦

8.75◦

8.53◦

10.20◦

MaAE

I

57.25◦

49.93◦

43.24◦

44.93◦

44.71◦

MeAE

I+N

8.28◦

7.41◦

7.03◦

7.11◦

7.72◦

MaAE

I+N

42.61◦

37.31◦

31.82◦

32.05◦

33.09◦

21

276

5.2.1. Effect of different patch sizes

277

It can be observed that the mean angular error and the max angular error

278

are decreasing with the increasing size until the size reaches to 5×5. Then

279

the errors incline to become stable. This finding support our hypothesis

280

that the local patch takes advantage of the information embedded in the

281

heighborhood surface point. Moreover, the local patch is able to represent

282

the non-Lambertian surface, avoiding the influence of the shadow or highlight

283

which completely covers the information of a single pixel. On the other hand,

284

the 7×7 and 9×9 patches encode the redundant information and increase

285

computational cost, which may introduce extra error. More importantly,

286

the larger patches lead to the blurring of the estimated normal map because

287

farther pixels interfere with the center pixel. Therefore, we choose the 5×5

288

patch as the default setting of the MPS-Net.

289

5.2.2. Effect of fusion with the initial normal

290

Referring to the results shown in Table 1, we can see that both the mean

291

angular error and the max angular error are decreasing across different patch

292

sizes when the fused input (I+N) is used. Compared with the input which

293

only uses the image patch, the initial normal provides better information

294

to the ground-truth data. This results in more stable learning process and 22

295

the more rapid convergence (see Section 3.2). In addition, the initial nor-

296

mal map and the corresponding multispectral image are complementary to

297

each other. When the initial normal map is very bad, its corresponding

298

multispectral image provides different patterns for MPS-Net, ensuring the

299

accurate prediction.

300

5.3. Evaluation on different materials

301

In Fig.5, we compare the MPS-Net with Demultiplexer [28], PS-FCN [26]

302

and the baseline results derived using Eq.2 on the validation set (Blobs8).

303

Note that PS-FCN is a photometric stereo method but it allows arbitrary

304

size input. Therefore, we use the three channels of the multispectral image

305

as three input images during the training and test stages.

306

It can be seen that the MPS-Net significantly outperform the baseline

307

results and Demultiplexer. The Demultiplexer ignores the information em-

308

bedded in local surface points, while MPS-Net obtains continuity constraints

309

and spatial information from neighboring pixels. With the help of the fusion

310

with the initial normal, our method is stable across different materials and

311

is superior to PS-FCN in most cases. For PS-FCN, we believe that the Max-

312

pooling strategy [35] used can achieve good results when there are a large

313

number of input channels (e.g., 96), while the effect becomes worse in MPS 23

A

B

A

C

D

B

F

E

C

D

E

F

Figure 5: Comparison between MPS-Net, Demultiplexer, PS-FCN and baseline (initial normal) on the samples of Blobs8 in Blobby Shape Dataset [11] rendered with 100 different BRDFs of MERL Datasets [10]. Images in the top-left corner show several rendered samples.

314

system (only three input channels).

315

5.4. Evaluation on synthetic datasets

316

We use the Stanford 3D Scanning dataset and the Web 3D dataset to

317

quantitatively evaluate the proposed MPS-Net. The comparison between the

318

MPS-Net, Demultiplexer, PS-FCN and the baseline are shown in Fig.6. The

319

selected objects in Fig.6 are representative, ranging from simple to complex,

320

as well as lambertian and non-Lambertian materials.

321

Compared with the other methods, the MPS-Net produces better results

24

Objects

GT

MPS-Net & Error map

PS-FCN & Error map

Demultiplexer & Error map

Baseline & Error map

7.06°

10.11°

11.19°

13.15°

5.38°

7.66°

7.71°

8.41°

8.96°

10.71°

13.90°

14.47°

9.12°

10.45°

10.84°

15.05°

Figure 6: Quantitative results obtained from synthetic datasets. Here, GT means the ground-truth data and the digits in error maps represent mean angular errors (in degree). Note that the third object ”Dragon” has been rotated for better display.

322

on all objects with complicated and simple shapes. It can be observed that

323

the MPS-Net is more robust in the regions with multiple BRDFs (see the

324

first two objects in Fig.6). The estimated normal maps obtained using our

325

method are almost unaffected when the material changes. It is harder to find

326

the boundary of the materials’ change in MPS-Net. In addition, the surface

327

normal generated by PS-FCN is accompanied by much noise (see the last

328

object in Fig.6) when the inputs are complex objects. It may be due to the

25

329

large patch input in CNN weakens the fitting and generalization ability.

330

5.5. Evaluation on real datasets

331

5.5.1. DiLiGenT Benchmark

332

In order to further evaluate the proposed MPS-Net method, we compare

333

it against the Demultiplexer, PS-FCN and baseline on the DiLiGenT Bench-

334

mark [13] along with the ground-truth data. It should be noted that the

335

pre-defined light directions in the DiLiGent Benchmark are different from

336

our training counterparts. The 96 light directions of the DiLiGent Bench-

337

mark dataset are shown in Fig.7. In fact, for the DiLiGenT Benchmark, the

338

lighting intensity, spectral distribution and camera response are all changed

339

from training dataset. However, we did not retrain the network but using

340

the fixed weights learned from training set. The results demonstrate the

341

robustness of our network.

342

In the rendering, we use the three light directions as evenly as possible

343

(e.g., h48, 1, 96 i). Since the surface of the object is not flat, there are cast

344

shadows in a particular illumination direction, where parts of the surface

345

can be occluded from the light source by other parts [36]. When the three

346

illumination directions are close, there will be severe cast shadows. As a

347

convention, we therefore use the evenly distributed lights to avoid obscured 26

z

1 0.8 0.6 -0.8 -0.6

24 16 8 4840 32 56 64 23 31 15 39 72 7 55 47 80 63 22 30 14 6 71 88 4638 54 62 79 96 87 70 4537 29 21 13 5 78 95 53 61 86 443628 20 12 69 94 77 4 52 85 60 68 433527 19 93 11 3 76 84 51 59 423426 92 67 75 18 10 0 2 50 8391 4133 25 17 58 66 -0.1 9 1 74 82 90 49 57 -0.2 65 73 81 89 -0.3

-0.4

-0.2 0

0.5 0.4 0.3 0.2 0.1

y

-0.4

0.2 0.4

x

0.6

-0.5 0.8

Figure 7: The 96 illumination directions of “Bear” in the DiLiGenT Benchmark [13]. Each number represents the corresponding image sequence of this light direction.

348

shadows. (There is a very small difference in the illumination directions of

349

each object in DiLiGenT. Here, we select the exact illumination directions

350

of each object).

351

First, we set the light direction to h48, 1, 96 i. The experiment results

352

are shown in Fig.8. The selected objects in Fig.8 are representative, ranging

353

from simple to complex, as well as lambertian and non-Lambertian materials.

354

Compared with the PS-FCN and Demultiplexer, MPS-Net performs well on

27

355

all objects. In particular, MPS-Net achieves smoother results and less errors.

356

It generates best result even on the most non-Lambertian surface, e.g., the

357

“Reading” object: when the PS-FCN has only three input channels (MPS

358

system), Max-pooling may preserve the highlight area as the maximum re-

359

sponse, affecting the quality of the generated normal map. The reason that

360

our method achieves better results is the initial normal map and image gen-

361

erated by the highlight region can interact and constrain each other. The

362

Demultiplexer method produces the worse result compared with the initial

363

normal map on this object. This result may be attributed to the incompe-

364

tency of Demultiplexer when it deals with the strongly non-Lambertian and

365

dark surface. Note that our method has a large error at the top-head of

366

“Reading”. This is because our method is also a single-image based algo-

367

rithm, lacks enough information required for such complicated structure.

368

Second, we analyze the influence of different light directions. We choose

369

different combinations of directions and examine the mean angular error

370

using MPS-Net. We randomly selected three non-planar lighting locations in

371

each group, and the directions of the four groups already covered a circle of

372

lights. The results are reported in Table 2.

373

It can be observed that MPS-Net is insensitive to the change of light

28

Objects

GT

MPS-Net & Error map

PS-FCN & Error map

Demultiplexer & Error map

Baseline & Error map

Goblet

11.43°

13.92°

12.46°

15.80°

Reading

17.10°

19.74°

22.62°

21.81°

Bear

8.40°

12.29°

18.77°

18.87°

Buddha

9.54°

12.69°

14.10°

16.31°

Figure 8: Quantitative results obtained from the DiLiGenT Benchmark [13]. GT means the ground-truth data and the numbers in error maps represent the mean angular error in degree. Table 2: Comparison of different light directions combinations when DiLiGenT is used. The four different light direction combinations shown in Fig.7 are reported. The numbers represent the mean angular error in degree.

No. Ball Bear Pot1 Pot2 GobletReading Cow Harvest Cat Buddha h48,1,96i5.39◦ 9.74◦ 10.61◦ 10.69◦ 11.43◦ 17.10◦ 10.78◦ 16.02◦ 9.90◦ 11.54◦ h8,41,89i5.35◦ 9.91◦ 10.50◦ 10.99◦ 12.07◦ 17.31◦ 10.29◦ 15.44◦ 9.31◦ 11.58◦ h73,72,27i5.17◦ 9.44◦ 10.29◦ 10.93◦ 11.40◦ 17.85◦ 9.92◦ 15.81◦ 8.97◦ 10.73◦ h22,50,78i5.52◦ 10.03◦ 10.51◦ 11.36◦ 12.25◦ 17.41◦ 10.13◦ 16.55◦ 9.46◦ 10.93◦

29

374

directions. The error of each combination remains relatively consistent. The

375

incorporation of the initial normal should be account for the robust results of

376

MPS-Net. The initial normal is not affected by the change of light directions.

377

MPS-Net can be treated as the correction to the initial map. Moreover, the

378

application of randomization after “conv2” further increases the robustness

379

under different light directions (see section 3.3).

380

5.5.2. Real objects

381

We capture images of real objects (multicolored fabrics). The comparison

382

between the MPS-Net, Demultiplexer, PS-FCN and the baseline are shown

383

in Fig.9.

384

As shown in Figure 9, the boundaries of multicolored fabrics can be clearly

385

seen. The Baseline method which uses the three-channel of multispectral

386

image directly causes the discontinuity on the normal map. The error is

387

the deviation of the surface albedo estimation and the tangle among illumi-

388

nation, surface reflectance and camera response, which leads to an under-

389

determined system. It can be seen from the experiment that the results of

390

Demultiplexer is better compared to baseline, but there are still obvious dis-

391

continuous boundaries on the normal map, for the reason that method only

392

applies the single pixel, ignoring the information embedded in local surface 30

Objects

MPS-Net

PS-FCN

Demultiplexer

Baseline

Figure 9: Comparison between MPS-Net, Demultiplexer, PS-FCN and baseline (initial normal) on the real photoed objects.

393

and causing information uncertainty. The results of PS-FCN show smoother

394

normal maps but there still exhibits the fuzzy boundaries. This is attributed

395

to the fact that larger input patches reduce the generalization of the network

396

to multicolored surface.

397

In contrast, MPS-Net performs best in the multicolored real photoed

398

objects. The reason is the using of 5×5 patch: since the 5×5 input patch, the

399

change of BRDFs 2-pixels away from the center pixel do not affect the normal.

400

Thus, our network is robust to multicolored surface. We also note that, 31

401

compared with PS-FCN, MPS-Net utilizes the initial normal map, which

402

provides a more robust way: we use the multispectral image to correct the

403

error in the initial normal rather than establish an unstable mapping between

404

image and normal map.

405

6. Conclusion

406

In this paper, we proposed a novel learning framework for the multispec-

407

tral photometric stereo, namely, MPS-Net. The proposed MPS-Net method

408

is able to estimate the accurate normal map from a single multispectral im-

409

age. MSP-Net does not require any extra prior information and can be used

410

in the light directions which are slightly different from the learned ones in

411

the training operation (also the camera and light intensity). Experimental

412

results have demonstrated the excellent performance of MPS-Net on various

413

BRDFs. The results obtained from the synthetic and real datasets indicate

414

the power of MPS-Net compared with the state-of-the-art methods.

415

In future work, we plan to design a multi-scale pyramid network to pro-

416

vide different receptive information, which will take multi-scale local context

417

information into account. We believe it will further improve our surface

418

normal recovery accuracy.

32

419

420

421

Conflicts of interest There arenoconflicts of interest.

References

422

[1] M. S. Drew, L. L. Kontsevich, Closed-form attitude determination under

423

spectrally varying illumination, Computer Vision and Pattern Recogni-

424

tion (1994) 985–990.

425

[2] A. P. Petrov, I. S. Vergelskaya, L. L. Kontsevich, Reconstruction of

426

shape from shading in color images, Journal of the Optical Society of

427

America A 12 (1994) 1047–1052.

428

[3] R. J. Woodham, Gradient and curvature from the photometric-stereo

429

method, including local confidence estimation, Journal of the Optical

430

Society of America A 11 (1994) 3050–3068.

431

[4] H. Kim, B. Wilburn, M. Benezra, Photometric stereo for dynamic sur-

432

face orientations, European Conference on Computer Vision (2010) 59–

433

72.

434

[5] G. J. Brostow, C. Hernandez, G. Vogiatzis, B. Stenger, R. Cipolla, Video

33

435

normals from colored lights, IEEE Transactions on Pattern Analysis

436

Machine Intelligence 33 (2011) 2104–2114.

437

[6] K. Ozawa, I. Sato, M. Yamaguchi, Hyperspectral photometric stereo for

438

a single capture., Journal of the Optical Society of America A 34 (2017)

439

384–394.

440

[7] R. Anderson, B. Stenger, R. Cipolla, Color photometric stereo for mul-

441

ticolored surfaces, International Conference on Computer Vision (2011)

442

2182–2189.

443

[8] Z. Jank, A. Delaunoy, E. Prados, Colour dynamic photometric stereo for

444

textured surfaces, Asian Conference on Computer Vision (2010) 55–66.

445

[9] B. D. Decker, J. Kautz, T. Mertens, P. Bekaert, Capturing multiple

446

illumination conditions using time and color multiplexing, Computer

447

Vision and Pattern Recognition (2009) 2536–2543.

448

449

450

451

[10] W. Matusik, H. Pfister, M. Brand, L. Mcmillan, A data-driven reflectance model, ACM Transactions on Graphics (2003) 759–769. [11] M. K. Johnson, E. H. Adelson, Shape estimation in natural illumination, Computer Vision and Pattern Recognition (2011) 2553–2560.

34

452

[12] B. Curless, M. Levoy, A volumetric method for building complex models

453

from range images, Conference on Computer Graphics and Interactive

454

Techniques (1996) 303–312.

455

[13] B. Shi, Z. Mo, Z. Wu, D. Duan, S. K. Yeung, P. Tan, A benchmark

456

dataset and evaluation for non-lambertian and uncalibrated photometric

457

stereo, IEEE Transactions on Pattern Analysis and Machine Intelligence

458

PP (2018).

459

[14] H. Santo, M. Samejima, Y. Sugano, B. Shi, Y. Matsushita, Deep pho-

460

tometric stereo network, International Conference on Computer Vision

461

Workshop (2017) 501–509.

462

463

[15] J. L. Schnberger, J. M. Frahm, Structure-from-motion revisited, IEEE Conference on Computer Vision and Pattern Recognition (2016).

464

[16] J. Vongkulbhisal, R. Cabral, F. D. L. Torre, J. P. Costeira, Motion from

465

structure (mfs): Searching for 3d objects in cluttered point trajectories,

466

Computer Vision and Pattern Recognition (2016) 5639–5647.

467

[17] R. J. Woodham, Photometric method for determining surface orienta-

468

tion from multiple images, Optical Engineering 19 (1980) 139–144.

35

469

[18] B. Bringier, D. Helbert, M. Khoudeir, Photometric reconstruction of a

470

dynamic textured surface from just one color image acquisition, Journal

471

of the Optical Society of America A Optics Image Science Vision 25

472

(2008) 566.

473

[19] K. Ozawa, I. Sato, M. Yamaguchi, Single color image photometric stereo

474

for multi-colored surfaces, Computer Vision & Image Understanding

475

(2018).

476

477

[20] R. Anderson, B. Stenger, R. Cipolla, Augmenting depth camera output using photometric stereo., MVA 1 (2011).

478

[21] C. Hernandez, G. Vogiatzis, G. J. Brostow, B. Stenger, R. Cipolla, Non-

479

rigid photometric stereo with colored lights, IEEE International Con-

480

ference on Computer Vision (2007) 1–8.

481

482

[22] G. Fyffe, Single-shot photometric stereo by spectral multiplexing, IEEE International Conference on Computational Photography (2010) 1–6.

483

[23] T. Kawabata, F. Sakaue, J. Sato, One shot photometric stereo from

484

reflectance classification, 11th Joint Conference on Computer Vision,

485

Imaging and Computer Graphics Theory and Applications. (2016) 620–

486

627. 36

487

[24] Y. Yoon, G. Choe, N. Kim, J.-Y. Lee, I. S. Kweon, Fine-scale surface

488

normal estimation using a single nir image, European Conference on

489

Computer Vision (2016) 486–500.

490

[25] T. Taniai, T. Maehara, Neural inverse rendering for general reflectance

491

photometric stereo,

492

(2018) 4864–4873.

International Conference on Machine Learning

493

[26] G. Chen, K. Han, K.-Y. K. Wong, Ps-fcn: A flexible learning frame-

494

work for photometric stereo, European Conference on Computer Vision

495

(2018) 3–19.

496

[27] L. Lu, L. Qi, Y. Luo, H. Jiao, J. Dong, Three-dimensional reconstruc-

497

tion from single image base on combination of cnn and multi-spectral

498

photometric stereo, Sensors 18 (2018) 764.

499

[28] Y. Ju, L. Qi, H. Zhou, J. Dong, L. Lu, Demultiplexing colored images

500

for multispectral photometric stereo via deep neural networks, IEEE

501

Access 6 (2018) 30804–30818.

502

[29] H. Jiao, Y. Luo, N. Wang, L. Qi, J. Dong, H. Lei, Underwater multi-

503

spectral photometric stereo reconstruction from a single rgbd image,

37

504

Signal and Information Processing Association Summit and Conference

505

(2017) 1–4.

506

[30] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image

507

recognition, Proceedings of the IEEE conference on computer vision

508

and pattern recognition (2016) 770–778.

509

[31] D. A. Clevert, T. Unterthiner, S. Hochreiter, Fast and accurate deep

510

network learning by exponential linear units (elus), International Con-

511

ference on Machine Learning (2015).

512

[32] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov,

513

Dropout: a simple way to prevent neural networks from overfitting,

514

Journal of Machine Learning Research 15 (2014) 1929–1958.

515

[33] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network

516

training by reducing internal covariate shift, International Conference

517

on Machine Learning (2015) 448–456.

518

[34] D. Kingma, J. Ba,

Adam: A method for stochastic optimization,

519

Proceedings of International Conference on Learning Representations

520

(2014).

38

521

[35] O. Wiles, A. Zisserman, Silnet : Single- and multi-view reconstruction

522

by learning from silhouettes, The British Machine Vision Association

523

and Society for Pattern Recognition (2017).

524

[36] L. Wu, A. Ganesh, B. Shi, Y. Matsushita, Y. Wang, Y. Ma, Robust

525

photometric stereo via low-rank matrix completion and recovery, Asian

526

Conference on Computer Vision (2010) 703–717.

39

527

Yakun Ju received the B.Sc degree of engineering in industrial

528

design from Sichuan University, Chengdu, China, in 2016. He is currently

529

pursuing the Ph.D. degree in computer application technology with the De-

530

partment of Computer Science and Technology, Ocean University of China,

531

Qingdao, China. His research interests include 3D reconstruction, machine

532

learning and image processing.

533

Lin Qi received his B.Sc and M.Sc degrees from Ocean

534

University of China in 2005 and 2008 respectively, and received his Ph.D.

535

in computer science from Heriot-Watt University in 2012. He is now an

536

associate professor in the Department of Computer Science and Technology

537

in Ocean University of China. His research interests include computer vision

538

and visual perception.

40

539

Jichao He received the B.Sc degree in In-

540

formation Security at Sichuan University, Chengdu, China, in 2018. Cur-

541

rently, he is pursuing the Master’s degree at Ocean University of China,

542

Qingdao, China. His research interests include computer vision, machine

543

learning and deep learning.

41

544

Xinghui Dong received the Ph.D.

545

degree from Heriot-Watt University, U.K., in 2014. He is currently a Re-

546

search Associate with the Centre for Imaging Sciences, The University of

547

Manchester, U.K. His research interests include automatic defect detection,

548

image representation, texture analysis, and visual perception.

549

Feng Gao received his B.Sc degree from the Department of

550

Computer Science, Chongqing University, Chongqing, China in 2008, and

551

received the Ph.D. degree from the Department of Computer Science and

552

Engineering, Beihang University, Beijing, China in 2015. He is currently an

42

553

associate professor in the Department of Computer Science and Technology

554

in Ocean University of China. His research interests include computer vision

555

and remote sensing.

556

Junyu Dong received the B.Sc. and M.Sc. degrees from the

557

Department of Applied Mathematics, Ocean University of China, Qingdao,

558

China, in 1993 and 1999, respectively, and the Ph.D. degree in image pro-

559

cessing from the Department of Computer Science, Heriot-Watt University,

560

U.K., in 2003. He joined Ocean University of China in 2004, and he is cur-

561

rently a Professor and the Head of the Department of Computer Science

562

and Technology. His research interests include machine learning, big data,

563

computer vision, and underwater vision.

43