Depth-guided view synthesis for light field reconstruction from a single image

Depth-guided view synthesis for light field reconstruction from a single image

Journal Pre-proof Depth-Guided View Synthesis for Light Field Reconstruction From a Single Image Wenhui Zhou, Gaomin Liu, Jiangwei Shi, Hua Zhang, Gu...

3MB Sizes 0 Downloads 38 Views

Journal Pre-proof Depth-Guided View Synthesis for Light Field Reconstruction From a Single Image

Wenhui Zhou, Gaomin Liu, Jiangwei Shi, Hua Zhang, Guojun Dai PII:

S0262-8856(20)30006-8

DOI:

https://doi.org/10.1016/j.imavis.2020.103874

Reference:

IMAVIS 103874

To appear in:

Image and Vision Computing

Received date:

27 September 2019

Accepted date:

7 January 2020

Please cite this article as: W. Zhou, G. Liu, J. Shi, et al., Depth-Guided View Synthesis for Light Field Reconstruction From a Single Image, Image and Vision Computing(2020), https://doi.org/10.1016/j.imavis.2020.103874

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2020 Published by Elsevier.

Journal Pre-proof

Depth-Guided View Synthesis for Light Field Reconstruction From a Single Image Wenhui Zhoua,b , Gaomin Liua , Jiangwei Shia , Hua Zhanga,∗, Guojun Daia a School

of

of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China b Zhejiang Provincial Key Laboratory of Information Processing,Communication and Networking, Zhejiang, China

ro

Abstract

-p

Light field imaging has recently become a promising technology for 3D rendering and displaying. However, capturing real-world light field images still faces

re

many challenges in both the quantity and quality. In this paper, we develop a learning based technique to reconstruct light field from a single 2D RGB im-

lP

age. It includes three steps: unsupervised monocular depth estimation, view synthesis and depth-guided view inpainting. We first propose a novel monoc-

na

ular depth estimation network to predict disparity maps of each sub-aperture views from the central view of light field. Then we synthesize the initial subaperture views by using the warping scheme. Considering that occlusion makes

ur

synthesis ambiguous for pixels invisible in the central view, we present a simple but effective fully convolutional network (FCN) for view inpainting. Note that

Jo

the proposed network architecture is a general framework for light field reconstruction, which can be extended to take a sparse set of views as input without changing any structure or parameters of the network. Comparison experiments demonstrate that our method outperforms the state-of-the-art light field reconstruction methods with single-view input, and achieves comparable results with the multi-input methods. Keywords: Light field, Convolutional neural network, Depth estimation, View ∗ Corresponding

author Email address: [email protected] (Hua Zhang)

Preprint submitted to Journal of LATEX Templates

January 11, 2020

Journal Pre-proof

synthesis, View inpainting

1. Introduction Along with the deep-going of the light field research and applications, light field imaging has become a promising imaging technology for 3D render and display, especially for virtual reality and augmented reality (VR and AR) [1].

of

Limited by the inherent trade-off between the angular and spatial resolution,

ro

the quantity and quality of the light field images captured from commercial plenoptic cameras can not meet the application requirements. As an alternative

-p

method, light field reconstruction from a single or several images has attracted extensive attention in recent years [2, 3].

re

By exploiting the powerful ability of convolutional neural networks (CNN), various learning based light field reconstruction methods have been proposed

lP

to recover a high angular resolution (densely sampled) light field [4, 5, 6], and significant performance improvement has been achieved. Most of these methods take a sparse set of sub-aperture views as input, and utilize the geometric con-

na

sistency between views to alleviate the ambiguity caused by the occlusions and the textureless regions. Yeung et al. investigated the impact of different sparse

ur

input patterns for the reconstruction quality [7]. However, such sparse-view input is hardly available in many visual applications since the baselines between

Jo

views are extremely narrow. Therefore, this paper focuses on the light field reconstruction from one single image, which has a wider application range. For a given 2D central view of light field, specifically, we synthesize a dense collection of new views with light field benefits, such as digital refocusing and aperture adjustment. Because of lacking the geometric consistency between views, the single-input methods face severe challenges, such as subpixel displacements [8] and occlusions [9, 10]. Obviously, tiny errors in the disparity maps and occlusions may result in evident artifacts and distortion in the synthesized views, as shown in Figure 1. To deal with this issue, we propose a depth-guided learning method for light

2

Journal Pre-proof

result of Srinivasan et al.

our result

of

Ground-truth view

Figure 1: Exemplarily illustration of artifact and distortion in a synthesized view. In the result of Srinivasan et al. [11], an artifact occurs at the edge of a petal (red square), and a

ro

geometry distortion appears at the bottom edge of the curtain (blue rectangle).

-p

field reconstruction, which can be divided into the following three aspects: un-

re

supervised monocular depth estimation, view synthesis and depth-guided view inpainting. Specifically, we first predict a disparity map for each virtual views

lP

without any ground truth depth information. Then the initial synthesized views are generated by the light field warping scheme. Obviously, when we synthesize the initial sub-aperture views from the given view, occlusions will bring a crit-

na

ical issue. Moreover, photo inconsistency also results in the warping error. It is difficult to handle these warping errors in monocular visual system because

ur

of the lack of multiview information. We try to treat them as a type of noise, and train a view inpainting network to inpaint these warping error regions. Our

Jo

main contribution is twofold: 1) An unsupervised monocular depth estimation method with multi-cue losses for light field. It has an encoder-decoder structure and takes the central view of light field as input. To improve the accuracy of the prediction, a multi-cue loss is introduced according to the multi-orientation epipolar geometry of light field. 2) A depth-guided view inpainting network with the residual structure. It is a 10-layer FCN with the residual structure. Following the supervised learning paradigm, it trains an inpanting model from the initial synthesized views and the corresponding disparity maps.

3

Journal Pre-proof

Our network architecture is a general framework that can be extended to take a sparse set of views as input without changing any structure or parameters of network. We perform comprehensive evaluation of our model on the public datasets. Our method outperforms the state-of-the-art light field reconstruction methods with single-view input, and achieves comparable results with the multi-input methods. Comparison experiments manifest the effectiveness and

of

advantages of our paradigm.

ro

2. Related Works

-p

Early work on light field reconstruction may be traced back to the research on the light field super-resolution techniques [12, 13]. In the last decade, lots

re

of innovations on light field reconstruction or light field synthesis had been proposed. They can be roughly divided into two categories: non-learning-based

lP

and learning-based. The former usually formulates the view synthesis as an optimization problem [14] with various priors or assumptions, such as Lambertian and textual priors [15], Gaussian mixture models [16], variational models [17],

na

etc. The latter designs an end-to-end learning framework to automatically learn depth cues and geometric information from a training set, and then reconstruct

ur

a dense set of sub-aperture view. For brevity, we herein only discuss the deep

Jo

learning based approaches that are associated with our work. 2.1. Depth based View Synthesis These approaches are usually based on the estimation of scene depth. They synthesize the virtual views by using the estimated depth and the given images. Flynn et al. [18] proposed a deep network to synthesize novel views by using a sequence of images with wide baselines. Kalantari et al. proposed a learning based method to synthesize new views from a sparse set of input views [2]. They broke down the task of view synthesis into the disparity and color estimation components, and used two sequential convolutional neural networks (CNN) to model these two components. Vadathya et al. [19] built their network on the

4

Journal Pre-proof

similar pipeline: a DispNet taking focus-defocus pairs as input, and a RefineNet predicting occluded pixels and non-lambertian effects. Zhou et al. [20] trained a network that infers alpha and multiplane images. The novel views are synthesized by using homography and alpha composition. Recently, several light field synthesis methods with single-view input have been proposed. Srinivasan et al. [11] presented an end-to-end CNN framework to

of

synthesize a 4D RGBD light field from a single 2D RGB image. They factorized the light field synthesis problem into three subproblems: scene depth estimation,

ro

light field rendering, and occluded ray prediction. Ivan et al. [21] proposed an appearance flow learning network. It was used to synthesize a light field

-p

from a single image robustly and directly, without requiring any physical-based

re

approaches or post-processing subnetworks. 2.2. Sampling based Angular Interpolation

lP

These approaches do not require the depth information as an auxiliary mapping. Instead, they take a sparse set of sub-aperture views as input, and for-

na

mulate the view synthesis as sampling and consecutive reconstruction (interpolation) of the plenoptic function. Following the advancement of the CNN based single image superresolution

ur

(SISR), Yoon et al. [22, 23] developed a CNN model that jointly super-resolves the light field in both spatial and angular domains.

Jo

Most recently, many state-of-the-art methods have been proposed based on the Epipolar Plane Images (EPIs), and achieved far superior performance over the traditional approaches. Wu et al. [4, 5] modeled the light field reconstruction as a learning-based angular detail restoration on 2D EPIs. Ruan et al. [24] proposed an improved Wasserstein Generative Adversarial Network with gradient penalty to upscale a low-resolution RGB image to a high-resolution light field image. They introduced the EPI constraint of light field into the loss function. Guo et al. [25] presented an EPI-based method to reconstruct new light fields between two mutually independent light fields. Wang et al. [6] built a Pseudo 4DCNN to synthesize new views of dense 4D light field from sparse 5

Journal Pre-proof

LF x, u

Ground-truth views

Initial synthesized results

Disparity maps

DecodeBlock1 DecodeBlock2 DecodeBlock3 DecodeBlock4 DecodeBlock5 DecodeBlock6

!c x | u0

input view

Encode4 (ResBlk x3)

Encode3 (ResBlk x

x Conv, ELU x Maxpooling Encode1 ( ResBlk x3) Encode2 (ResBlk x4)

DispNet

W

C

InpaintingNet

Inpainting loss

DecodeBlock

InpaintingNet

InpaintBlock

1x1 Conv, ELU, IN

Upsampling

InpaintBlock

1x1 Conv

x Conv, ELU, IN

x Conv, ELU, IN

InpaintBlock

x Conv

x Conv, ELU, IN

InpaintBlock

ro

ResBlk

of

unsupervised multi-cue loss

x Conv

x Conv

C

concatenation

+

-p

+

W warping operator

x Conv, ELU

ELU, IN

ELU

re

Figure 2: Our network architecture.

lP

input views. Their pseudo 4DCNN assembles the 2D strided convolutions and

na

two detail-restoration 3D CNNs.

3. Our Network Architecture

ur

Different from the light field reconstruction from the sparse input views, the approaches that reconstruct the light field from a single image lack the geometry

Jo

constraints of EPIs or multiple views. In order to address this issue, we propose a depth-guided view synthesis scheme and a view inpainting method for light field reconstruction. The former learns the geometric cues by a disparity estimation network DispNet. The latter applies a FCN network, InpaintingNet, to deal with the ambiguous regions, such as occlusion. The schematic architecture of our network is depicted in Figure 2. The DispNet takes the central view of light field as input, and predicts a disparity map for each sub-aperture view. These disparity maps are used to warp the central view to generate initial synthesized views. The warping operator is implemented by the bilinear interpolation.

6

Journal Pre-proof

Since some pixels lying in the real sub-aperture view are not visible in the central view, the warping error inevitably occurs in the initial synthesized views. We treat these warping errors as noise. Inspired by the CNN based image denoising [26] and the disocclusion inpainting in free-viewpoint rendering [27, 28], the InpaintingNet is used to inpaint the ambiguous regions and the warping error regions.

of

3.1. Light Field Warping Operator

ro

Let LF (x, u) be the 4D light field data, which represented by the two plane parametrization [29, 30], where x and u are the spatial and angular coordinates,

-p

respectively. The central view Iu0 (x) is formed by the rays passed through the

re

main lens optical center u0 .

Iu0 (x) = LF (x, u0 ) .

(1)

lP

Given a sub-aperture image Iu (x), the reconstructed central view by warp-

na

ing Iu (x) can be expressed as follows:

Ieu7→u0 (x) = Iu (x + k ∗ du0 (x)) ,

(2)

ur

where k = u0 − u, and du0 (x) is the disparity map of the central view. Similarly, we can reconstruct a virtual sub-aperture view from the central

Jo

view Iu0 (x) as follows:

Ieu0 7→u (x) = Iu0 (x − k ∗ du (x)) ,

(3)

where du (x) is the disparity map of the sub-aperture view Iu (x). 3.2. DispNet Our DispNet is an unsupervised monocular disparity estimation network with an encoder-decoder structure. It takes the central view of light field as input, and predicts the disparity maps of all sub-aperture views without any ground-truth information.

7

Journal Pre-proof

We use ResNet50 [31] as the encoder for its effective residual learning manner. The decoder is made up of deconvolution layers. It uses the nearest neighbor up-sampling to enlarge the spatial resolution of feature maps to full scale as input. We add long skip connections between the encoder and decoder in order to forward both global high-level and local detailed information. Similar structures are widely used in many unsupervised methods, such as [32, 33].

of

The key difference between our DispNet and the previous proposed methods is that we propose a novel unsupervised multi-cue loss function combined with the

ro

characteristics of light field geometry and data consistency measure. Figure 3 shows our disparity estimation method achieves significantly superior perfor-

-p

mance over the state-of-the-art approaches [2, 11].

Our unsupervised loss includes three terms: photometric loss Lp , defocus

re

loss Lr and disparity-consistency loss Lc .

Ltotal = Lp + α1 Lr + α2 Lc ,

lP

(4)

where α1 and α2 are respectively set to be 10 and 0.001 in our experiments.

na

We adopt the L1-distance and the single scale structural similarity index measure (SSIM) [34] to compute the image similarity between the image I and

ur

e its reconstructed version I: 



2

Jo

ψ I, Ie = β

  1 − SSIM I, Ie



+ (1 − β) I − Ie ,

(5)

1

where β is set to be 0.85. The photometric loss Lp is defined as follows:  Lp =

X u

   ψ Iu0 (x) , Ieu7→u0 (x) +   .  ψ Iu (x) , Ieu0 7→u (x)

(6)

One of the most distinctive properties of light field image is containing enough angular resolution for refocusing, which usually implies an important defocus cue associated with depth information. Herein, we compute an integral

8

Jo

ur

na

lP

re

-p

ro

of

Journal Pre-proof

Central views Disparity results Disparity results of light field of Srinivasan et al. of Kalantari et al.

Our disparity results

Figure 3: Disparity estimation comparisons among Srinivasan et al. [11], Kalantari et al. [2] and our method. Our method generates more accurate disparity estimation not only on the object surfaces but also around the object boundaries.

image by accumulating all the reconstructed central views Ieu7→u0 (x).

Iinteg (x) =

1 Xe Iu7→u0 (x), Nu u 9

(7)

Journal Pre-proof

where Nu is the number of the sub-aperture views. Obviously, it is an all-in-focus image (an image without defocus blur) if the disparity map of the central view is absolutely correct. Inaccurate disparity estimation of the central view results in the defocus blur in the integral image. Therefore, the defocus loss Lr is simply defined as the image similarity between

of

the integral image and the central view.

Lr = ψ (Iu0 (x) , Iinteg (x)) .

(8)

ro

By mimicking the traditional forward-backward or left-right consistency check [32, 33, 35], we use the L1-distance to compute our disparity-consistency

-p

loss Lc as follows:

u

re



P

  Lc =

du (x) − deu0 7→u (x)

1

(9)

lP

  deu 7→u (x) = du (x − k ∗ du (x)) 0 0 3.3. InpaintingNet

na

Different from the occlusion prediction network [11] and RefineNet [33, 19], we treat the warping error as a type of noise, and try to learn a denoising model from the image structure and depth information.

ur

Specifically, we train a simple but effective FCN by following the supervised

Jo

learning paradigm, named as InpaintingNet. It includes three InpaintBlock s with residual connections, as shown in Figure 2. We concatenate the initial synthesized views and the corresponding disparity maps, and then send them into the InpaintingNet in turn. The inpaniting loss is defined as follows:

Li =

 1 X ˆ ψ Iu (x) , Iu (x) , Nu u

(10)

where Iˆu (x) is one of the final synthesized results by the InpaintingNet, and Nu is the number of the sub-aperture views.

10

lP

re

-p

ro

of

Journal Pre-proof

Initial synthesized Final synthesized views views

na

Disparity maps

Figure 4: Visualization comparisons between the initial synthesized views and the inpainted synthesized views. To make it clear, we set the maximum permissible errors of pixel to be 15,

ur

and mark the bad pixels in yellow. It is obvious that the bad pixels are mainly located on the

Jo

object boundaries, and our InpaintingNet can correct most of them.

In order to preserve the high frequency details in the synthesized views, we cancel the pooling operation throughout the network and maintain the input and output of each layer in the same size. The DispNet and the light field warping operator are based on the Lambertian assumption. Therefore, the warping errors inevitably occurs at occlusion and non-Lambertian reflectance regions. The InpaintingNet cascades the residual blocks to handle the warping error. In order to emphasize the importance of enforcing disparity consistency in predictions, the predicted disparity maps are input into the InpaintingNet, so the InpaintingNet has the depth information to understand which rays in the initial synthesized views are incorrect. Figure 4 11

of

Journal Pre-proof

Figure 5: Examples of the “bad” data removed from the training set. Low-light results in

ro

large texture-less or weakly-textured regions.

-p

demonstrates that our InpaintingNet can correct the warping error. Network

4. Experimental Evaluation

re

ablation experiments in Section IV-E also verify its effectiveness.

lP

In this section, we firstly introduce the implementation details of our network. Then we evaluate our method on two public datasets [2, 11], and analyze

na

the impact of different network component on the performance of our method. 4.1. Datasets and Pre-processing

ur

The public “Flowers” dataset [11] and “30 Scenes” dataset [2], captured by

Jo

the Lytro Illum camera, are selected for evaluation. The size of these light field images is 14 × 14 × 376 × 541. In order to reduce the impact of ambiguous correspondences, we exclude some training data that contains large texture-less or weakly-textured regions caused by low-light conditions or unsuitable exposure conditions, as shown in Figure 5. Data augmentation is performed to prevent overfitting and improve the generalization ability of our model. We use the general augmentation techniques with a 50% chance. Specifically, we perform random gamma changes (within [0.8, 1.2]), random brightness changes (within [0.5, 2.0]), and random multiplicative color changes for each color channel separately (within [0.8, 1.2]). 12

Journal Pre-proof

Table 1: The detailed architecture of the DispNet.

Input (BatchNum,Height,Width,Channels): (1, 376, 541, 3) Layer name

Module

Output size

Activate/Norm

7×7Conv

7×7

(1, 192, 288, 64)

ELU/IN

Maxpooling

3×3

(1, 96, 144, 64)

ELU /IN

Encode1

ResBlk ×3

(1, 48, 72, 256)

Encode2

ResBlk ×4

(1, 24, 36, 512)

Encode3

ResBlk ×6

(1, 12, 18, 1024)

Encode4

ResBlk ×3

(1, 6, 9, 2018) (1, 12, 18, 512)

DecodeBlock2

(1, 24, 36, 256) 3×3

DecodeBlock3 DecodeBlock4



3×3



ro



of

DecodeBlock1

ELU /IN

(1, 48, 72, 128) 

ELU

(1, 96, 144, 64)

(1, 192, 288, 32)

DecodeBlock6

(1, 384, 576, 49)

Sigmoid

re

-p

DecodeBlock5

lP

Table 2: The detailed architecture of the InpaintingNet.

Input (BatchNum,Height,Width,Channels): (1,376,541,4) Layer name InpaintBlock1

na

InpaintBlock2 InpaintBlock3

3×3

Output size

Activate/Norm

(1, 376, 574, 8) (1, 376, 574, 16)

ELU

(1, 376, 574, 8) (1, 376, 574, 3)

ReLU

ur

3×3Conv

Module  1×1    3×3    3×3 

Jo

4.2. Implementation Details The input of our network is a central view of light field with the spatial resolution of 376 × 541, and the output is a 7 × 7 × 376 × 541 light field. More detailed layer-by-layer definitions are listed in Table 1 and Table 2, respectively. Our model contains about 58.64 million trainable parameters. Our network, implemented with Tensorflow in python, is trained with an Intel Core i7-5930k 3.5GHz, 80G DDR4 memory and NVIDIA GTX 1080Ti GPU. We train it with Adam optimizer (β1 = 0.9, β2 = 0.999) [36]. The initial learning rate is set to 0.0004 which is kept unchanged for the first 30, 000 iterations, and then it is halved every 30, 000 iterations until the end. It takes

13

Journal Pre-proof

about 40 hours for our network to converge. 4.3. Performance Comparison We compare our method with 5 state-of-the-art methods: two methods with single-view input (Srinivasan et al. [11] and Ivan et al. [21]), and three methods with sparse-view input (Kalantari et al. [2], Wu et al. [4, 5], and Wang et al. [6]).

of

For fair comparisons, we also extend our original network to a sparse-input network without changing any structure or parameters of the network. Note

ro

that our network architecture is a general framework for light field reconstruction, which can be simply extended to take a sparse set of views as input, just

-p

by concatenating the input views in the channel dimension. That is, the input dimension of our sparse-input network is BatchN um × Height × W idth ×

re

(N um · Channels), where N um is the number of the input views. Without loss of generality, herein we use 3 × 3 sparse input, which is the same degree of

lP

sparsity as those of Wu et al. [4, 5] and Wang et al. [6]. So the input of our sparse-input model is 1 × 376 × 541 × 27 in the evaluation experiments.

na

Following the numerical evaluation methods proposed in the previous papers [4, 5, 6], the gray-scale SSIM [34] and the peak signal-to-noise ratio (PSNR) are used to evaluate the results of the light field reconstruction. The SSIM is

ur

a well-known similarity metric based on the quality perception of the human visual system (HVS). It models the image quality distortion as a combination

Jo

of three different factors: loss of correlation, luminance distortion and contrast distortion. The value of the SSIM metric is between 0 and 1. The higher the SSIM value is, the more similar the images are. The PSNR metric implies the numerical differences between the synthesized sub-aperture views and the ground-truths. The higher PSNR value denotes the better quality of the reconstructed light field. To better demonstrate the performance of the reconstruction methods, the input view(s) will be excluded when we count the SSIM and PSNR of the synthesized sub-aperture views, instead of using all views of the reconstructed light field as most of existing methods did. The evaluation metrics of a reconstructed 14

Journal Pre-proof

light field can be formulated as follows.     MSSIM =

  SSIM Ieu0 s (x) , Iu0 s (x) us ∈LF   P 1 0 0  e  PSNR MPSNR = I (x) , I (x)  us us Ns 1 Ns

P

(11)

us ∈LF

where Ns is the number of the synthesized views, and us is the angular coordinate of the synthesized view. Ie0 and I 0 are the luminance components of the

of

images Ie and I, respectively.

ro

4.3.1. “Flowers” Dataset

The “Flowers” dataset contains 3343 light fields of flowers and plants. We

-p

randomly divide this dataset into two groups: 3009 for training and 334 for testing.

re

We compute the MSSIM and MPSNR over all 334 testing images, and compare the performance between our original method and two state-of-the-art

lP

single-view input methods [11, 21]. Note that we retrained the model of Srini-

webpage1 .

na

vasan et al. [11] with their source code, which is downloaded from the authors’

Figure 6 shows some visualization comparisons between our original model

ur

and the method of Srinivasan et al. [11]. Notable artifacts and distortion can be observed in the synthesized results of [11], while our results are almost free

Jo

of artifacts and much closer to the ground truth views in appearance. Figure 7 compares the statistics histograms of MSSIM and MPSNR between our original model and the method of Srinivasan et al. [11]. It is obvious that our histograms have more compact distribution and higher mean values. For numerical comparisons, the average values of MSSIM and MPSNR are listed in Table 3. It is evident that our original method is superior over two state-of-the-art single-view input methods [11, 21]. Without loss of generality, we verify that our sparse-input model, with a 3 × 3 sparse views input, can obtain the best performance. 1 https://people.eecs.berkeley.edu/

pratul/

15

Disparity maps of Srinivasan et al

ur

Ground-truth views

na

lP

re

-p

ro

of

Journal Pre-proof

Synthesized views of Srinivasan et al

Ours disparity maps

Ours synthesized views

Jo

Figure 6: Visualization comparisons between our original method and [11] on the “flowers” dataset. The image regions in the small color squares are zoomed in to reveal more details. By comparing the big squares with the same color in each row, it’s easy to conclude that our synthesized results are almost free of artifacts and much closer to the ground-truth views in appearance.

4.3.2. “30 scenes” dataset The “30 Scenes” dataset [2] contains 130 light fields of various scenes, in which 100 scenes for training and 30 scenes for testing. Herein we test the generality of our original model and our sparse-input model that are trained on the “Flowers” dataset. The reconstructed results of

16

ro

of

Journal Pre-proof

-p

Figure 7: Quantitative comparison by the statistics histograms of MSSIM and MPSNR. Table 3: Quantitative results on the “Flowers” dataset [11]. Bold indicates the best perfor-

Srinivasan et al. [11]

re

mance.

Input Mode

MSSIM

MPSNR

single input

0.9192

33.04

Ivan et al. [21]

single input

0.93

36.31

Our original model

single input

0.9718

39.05

Our sparse-input model

3 × 3 sparse input

0.9948

49.60

na

lP

Methods

ur

three state-of-the-art sparse-view input methods [2, 4, 5, 6] are directly obtained by their models that are provided by the authors (Kalantari et al.2 , Wu et al.3 ,

Jo

and Wang et al.4 ).

We compute the MSSIM and MPSNR over the 30 testing scenes. Figure 8 depicts the quantitative comparisons of the MSSIM and MPSNR of each testing scene. Obviously, our sparse-input model performs significantly better than the other methods, and our original model can achieve comparable results with the sparse-view input methods. This also can be demonstrated by the average values of MSSIM and MPSNR listed in Table 4. 2 http://cseweb.ucsd.edu/

viscomp/projects/LF/papers/SIGASIA16/

3 https://github.com/GaochangWu/lfepicnn/tree/master 4 https://github.com/DalonsmaryCASIA/Pseudo4DCNN

17

Journal Pre-proof

1 0.95 0.9 0.85 0.8 0.75 0.7 Our sparse-input model

Our original model

5

10

Wu et al.

Wang et al.

15

20

Srinivasan et al.

25

30

ro

30 Scenes 50

-p

45 40

re

35 30

Our sparse-input model

5

lP

25 20

Kalantari et al.

of

0.65

Our original model

10

Wu et al.

Wang et al.

15

Kalantari et al.

20

Srinivasan et al.

25

30

na

30 Scenes

Figure 8: Quantitative comparisons by the MSSIM and MPSNR of each testing scene on the

ur

“30 Scenes” dataset.

Table 4: Quantitative results on the “30 Scenes” dataset [2]. Bold indicates the best perfor-

Jo

mance.

Methods

Input Mode

MSSIM

MPSNR

Srinivasan et al. [11]

single input

0.8501

28.82

Ivan et al. [21]

single input

0.86

29.82

Kalantari et al. [2]

2 × 2 sparse input

0.9693

36.98

Wang et al. [6]

3 × 3 sparse input

0.9698

33.96

Wu et al. [4, 5]

3 × 3 sparse input

0.9823

42.24

Our original model

single input

0.9317

33.68

Our sparse-input model

3 × 3 sparse input

0.9913

44.92

18

Journal Pre-proof

Table 5: Average runtime comparisons on the “30 Scenes” dataset [2]. Bold indicates the shortest runtime.

Methods

CUDA

Avg. runtime (s)

Kalantari et al. [2]



16.2032

Wu et al. [4, 5]

– √

6.8571

Wang et al. [6]

2.0435



Srinivasan et al. [11]

0.0146



Our original model

0.0072

of



0.0110

ro

Our sparse-input model

Table 6: Performance comparisons of different configurations. Bold indicates the best perfor-

DispNet







MSSIM

MPSNR

MSSIM

MPSNR

0.9475

37.56

0.9064

31.22

0.9551

37.80

0.9117

31.89

0.9520

37.45

0.9135

31.90

0.9609

38.04

0.9258

32.31

0.9718

39.05

0.9317

33.68

“Flowers”

Inpainting Lc

√ √











“30 Scenes”

na



re

Lr

lP

Lp √

-p

mance.

Table 5 provides the average runtime comparisons on the 30 testing scenes.

ur

Because our original model takes the whole central view as input and synthesizes all views simultaneously, our prediction time is significantly less than those of

Jo

the state-of-the-art methods. 4.4. Network Ablation Experiments To analyze the impact of the different loss configurations and the InpaintingNet on the performance of light field reconstruction, the ablation experiments are carried out on the “Flowers” and “30 Scenes” datasets. Table 6 provides the performance comparisons of different configurations. The quantitative improvements verify the effectiveness of the proposed loss functions and the InpaintingNet.

19

Journal Pre-proof

5. Conclusions and Future Work We have designed a depth-guided view synthesis and inpainting method for light field reconstruction from a single image. It includes two sequential CNNs (DispNet and InpaintingNet). The proposed network is a general framework for light field reconstruction, and can be extended to a multi-input network easily. We have evaluated our method on the public “Flowers” and “30 Scenes”

of

datasets. Both quantitative evaluation and visual comparisons demonstrate the superiority of our method. Our original model outperforms the state-of-the-art

ro

methods with single-view input, and achieves comparable results with the multi-

-p

input methods. Our sparse-input model is significantly better than the existing methods. For future work, we will try to introduce the low-level structure

re

information of objects to refine the object boundaries.

lP

6. Acknowledgment

This work is supported in part by the National Key R&D Program of China

na

(2017YFE0118200), Open Project of Zhejiang Provincial Key Laboratory of Information Processing, Communication and Networking, and the Key Program of Zhejiang Provincial Natural Science Foundation of China (LZ14F020003).

Jo

comments.

ur

The authors are grateful for the anonymous reviewers who made constructive

References

[1] F.-C. Huang, K. Chen, G. Wetzstein, The light field stereoscope: Immersive computer graphics via factored near-eye light field displays with focus cues, ACM Transactions on Graphics (SIGGRAPH) 34 (4) (2015) 60:1–60:12. [2] N. K. Kalantari, T.-C. Wang, R. Ramamoorthi, Learning-based view synthesis for light field cameras, ACM Transactions on Graphics (SIGGRAPH Asia) 35 (6) (2016) 193:1–193:10.

20

Journal Pre-proof

[3] S. Vagharshakyan, R. Bregovic, A. Gotchev, Light field reconstruction using shearlet transform, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 40 (1) (2018) 133–147. [4] G. Wu, M. Zhao, L. Wang, Q. Dai, T. Chai, Y. Liu, Light field reconstruction using deep convolutional network on EPI, in: IEEE Conference on

of

Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1–15. [5] G. Wu, Y. Liu, L. Fang, Q. Dai, T. Chai, Light field reconstruction using

ro

convolutional network on EPI and extended applications, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 41 (7) (2019)

-p

1681–1694.

re

[6] Y. Wang, F. Liu, Z. Wang, G. Hou, Z. Sun, T. Tan, End-to-end view synthesis for light field imaging with pseudo 4DCNN, in: European Conference

lP

on Computer Vision (ECCV), 2018, pp. 1–16. [7] H. W. F. Yeung, J. Hou, J. Chen, Y. Y. Chung, X. Chen, Fast light field

na

reconstruction with deep coarse-to-fine modeling of spatial-angular clues, in: European Conference on Computer Vision (ECCV), 2018, pp. 1–17.

ur

[8] H.-G. Jeon, J. Park, G. Choe, J. Park, Y. Bok, Y.-W. Tai, I. Kweon, Accurate depth map estimation from a lenslet light field camera, in: IEEE

Jo

Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–9.

[9] T.-C. Wang, A. Efros, R. Ramamoorthi, Occlusion-aware depth estimation using light-field cameras, in: IEEE International Conference on Computer Vision (ICCV), 2015, pp. 3487–3495. [10] T.-C. Wang, A. Efros, R. Ramamoorthi, Depth estimation with occlusion modeling using light-field cameras, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 38 (11) (2016) 2170–2181.

21

Journal Pre-proof

[11] P. P. Srinivasan, T. Wang, A. Sreelal, R. Ramamoorthi, R. Ng, Learning to synthesize a 4D RGBD light field from a single image, in: International Conference on Computer Vision (ICCV), 2017, pp. 1–9. [12] A. Levin, F. Durand, Linear view synthesis using a dimensionality gap light field prior, in: IEEE Conference on Computer Vision and Pattern

of

Recognition (CVPR), 2010, pp. 1831–1838. [13] T. E. Bishop, P. Favaro, The light field camera: Extended depth of field,

ro

aliasing, and superresolution, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 34 (5) (2012) 972–986.

-p

[14] S. Wanner, B. Goldluecke, Variational light field analysis for disparity es-

re

timation and superresolution, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 36 (3) (2014) 606–619.

lP

[15] T. E. Bishop, S. Zanetti, P. Favaro, Light field superresolution, in: IEEE International Conference on Computational Photography (ICCP), 2009,

na

pp. 1–9.

[16] K. Mitra, A. Veeraraghavan, Light field denoising, light field superresolu-

ur

tion and stereo camera based refocussing using a GMM light field patch prior, in: IEEE Conference on Computer Vision and Pattern Recognition

Jo

Workshops (CVPRW), 2012, pp. 22–28. [17] S. Wanner, B. Goldluecke, Spatial and angular variational superresolution of 4D light fields, in: European Conference on Computer Vision (ECCV), 2012, pp. 608–621. [18] J. Flynn, I. Neulander, J. Philbin, N. Snavely, DeepStereo: Learning to predict new views from the world’s imagery, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–10. [19] A. K. Vadathya, S. Girish, K. Mitra, A deep learning framework for light field reconstruction from focus-defocus pair: A minimal hardware approach,

22

Journal Pre-proof

in: Computational Cameras and Displays (CCD) workshop (CVPRW), 2018. [20] T. Zhou, R. Tucker, J. Flynn, G. Fyffe, N. Snavely, Stereo magnification: Learning view synthesis using multiplane images, ACM Transactions on Graphics (SIGGRAPH) 37 (4) (2018) 65:1–65:10.

of

[21] A. Ivan, Williem, I. K. Park, Synthesizing a 4D spatio-angular consistent light field from a single image, arXiv.org, https://arxiv.org/abs/1903.12364

ro

(2019).

[22] Y. Yoon, H.-G. Jeon, D. Yoo, J.-Y. Lee, I. S. Kweon, Learning a deep

-p

convolutional network for light-field image super-resolution, in: IEEE In-

re

ternational Conference on Computer Vision Workshop (ICCVW), 2015, pp. 1–9.

lP

[23] Y. Yoon, H.-G. Jeon, D. Yoo, J.-Y. Lee, I. S. Kweon, Light-field image super-resolution using convolutional neural network, IEEE Signal Process-

na

ing Letters 24 (6) (2017) 848–852.

[24] L. Ruan, B. Chen, M.-L. Lam, Light field synthesis from a single image

ur

using improved wasserstein generative adversarial network, in: EUROGRAPHICS, 2018, pp. 1–2.

Jo

[25] M. Guo, H. Zhu, G. Zhou, Q. Wang, Dense light field reconstruction from sparse sampling using residual network, in: Asian Conference on Computer Vision (ACCV), 2018, pp. 1–16. [26] K. Zhang, W. Zuo, Y. Chen, D. Meng, L. Zhang, Beyond a gaussian denoiser: Residual learning of deep CNN for image denoising, IEEE Transactions on Image Processing (TIP) 26 (7) (2017) 3142–3155. [27] P. Buyssens, O. L. Meur, M. Daisy, D. Tschumperle, O. Lezoray, Depthguided disocclusion inpainting of synthesized RGB-D images, IEEE Transactions on Image Processing (TIP) 26 (2) (2017) 525–538.

23

Journal Pre-proof

[28] J. Dai, T. Nguyen, View synthesis with hierarchical clustering based occlusion filling, in: IEEE International Conference on Image Processing (ICIP), 2017, pp. 1–5. [29] Y.-R. Ng, M. Levoy, M. Bredif, G. Duval, M. Horowitz, P. Hanrahan, Light field photography with a handheld plenoptic camera, Computer Science

of

Technical Report (CSTR) 2 (11) (2005) 1–11. [30] A. Lumsdaine, T. Georgiev, The focused plenoptic camera, in: IEEE In-

ro

ternational Conference on Computational Photography (ICCP), San Francisco, USA, 2009, pp. 1–8.

-p

[31] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recog-

(CVPR), 2016, pp. 770–778.

re

nition, in: IEEE Conference on Computer Vision and Pattern Recognition

lP

[32] C. Godard, O. M. Aodha, G. J. Brostow, Unsupervised monocular depth estimation with left-right consistency, in: IEEE Conference on Computer

na

Vision and Pattern Recognition (CVPR), 2017, pp. 1–10. [33] Z. Yin, J. Shi, Geonet: Unsupervised learning of dense depth, optical flow

ur

and camera pose, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 1–10.

Jo

[34] Z.Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE Transactions on Image Processing (TIP) 13 (4) (2004) 600–612. [35] Y. Wang, Y. Yang, Z. Yang, L. Zhao, P. Wang, W. Xu, Occlusion aware unsupervised learning of optical flow, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 1–10. [36] D. Kingma, J. Ba, Adam: A method for stochastic optimization, in: International Conference on Learning Representations (ICLR), 2015, pp. 1795– 1812.

24