Journal Pre-proof Depth-Guided View Synthesis for Light Field Reconstruction From a Single Image
Wenhui Zhou, Gaomin Liu, Jiangwei Shi, Hua Zhang, Guojun Dai PII:
S0262-8856(20)30006-8
DOI:
https://doi.org/10.1016/j.imavis.2020.103874
Reference:
IMAVIS 103874
To appear in:
Image and Vision Computing
Received date:
27 September 2019
Accepted date:
7 January 2020
Please cite this article as: W. Zhou, G. Liu, J. Shi, et al., Depth-Guided View Synthesis for Light Field Reconstruction From a Single Image, Image and Vision Computing(2020), https://doi.org/10.1016/j.imavis.2020.103874
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
© 2020 Published by Elsevier.
Journal Pre-proof
Depth-Guided View Synthesis for Light Field Reconstruction From a Single Image Wenhui Zhoua,b , Gaomin Liua , Jiangwei Shia , Hua Zhanga,∗, Guojun Daia a School
of
of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China b Zhejiang Provincial Key Laboratory of Information Processing,Communication and Networking, Zhejiang, China
ro
Abstract
-p
Light field imaging has recently become a promising technology for 3D rendering and displaying. However, capturing real-world light field images still faces
re
many challenges in both the quantity and quality. In this paper, we develop a learning based technique to reconstruct light field from a single 2D RGB im-
lP
age. It includes three steps: unsupervised monocular depth estimation, view synthesis and depth-guided view inpainting. We first propose a novel monoc-
na
ular depth estimation network to predict disparity maps of each sub-aperture views from the central view of light field. Then we synthesize the initial subaperture views by using the warping scheme. Considering that occlusion makes
ur
synthesis ambiguous for pixels invisible in the central view, we present a simple but effective fully convolutional network (FCN) for view inpainting. Note that
Jo
the proposed network architecture is a general framework for light field reconstruction, which can be extended to take a sparse set of views as input without changing any structure or parameters of the network. Comparison experiments demonstrate that our method outperforms the state-of-the-art light field reconstruction methods with single-view input, and achieves comparable results with the multi-input methods. Keywords: Light field, Convolutional neural network, Depth estimation, View ∗ Corresponding
author Email address:
[email protected] (Hua Zhang)
Preprint submitted to Journal of LATEX Templates
January 11, 2020
Journal Pre-proof
synthesis, View inpainting
1. Introduction Along with the deep-going of the light field research and applications, light field imaging has become a promising imaging technology for 3D render and display, especially for virtual reality and augmented reality (VR and AR) [1].
of
Limited by the inherent trade-off between the angular and spatial resolution,
ro
the quantity and quality of the light field images captured from commercial plenoptic cameras can not meet the application requirements. As an alternative
-p
method, light field reconstruction from a single or several images has attracted extensive attention in recent years [2, 3].
re
By exploiting the powerful ability of convolutional neural networks (CNN), various learning based light field reconstruction methods have been proposed
lP
to recover a high angular resolution (densely sampled) light field [4, 5, 6], and significant performance improvement has been achieved. Most of these methods take a sparse set of sub-aperture views as input, and utilize the geometric con-
na
sistency between views to alleviate the ambiguity caused by the occlusions and the textureless regions. Yeung et al. investigated the impact of different sparse
ur
input patterns for the reconstruction quality [7]. However, such sparse-view input is hardly available in many visual applications since the baselines between
Jo
views are extremely narrow. Therefore, this paper focuses on the light field reconstruction from one single image, which has a wider application range. For a given 2D central view of light field, specifically, we synthesize a dense collection of new views with light field benefits, such as digital refocusing and aperture adjustment. Because of lacking the geometric consistency between views, the single-input methods face severe challenges, such as subpixel displacements [8] and occlusions [9, 10]. Obviously, tiny errors in the disparity maps and occlusions may result in evident artifacts and distortion in the synthesized views, as shown in Figure 1. To deal with this issue, we propose a depth-guided learning method for light
2
Journal Pre-proof
result of Srinivasan et al.
our result
of
Ground-truth view
Figure 1: Exemplarily illustration of artifact and distortion in a synthesized view. In the result of Srinivasan et al. [11], an artifact occurs at the edge of a petal (red square), and a
ro
geometry distortion appears at the bottom edge of the curtain (blue rectangle).
-p
field reconstruction, which can be divided into the following three aspects: un-
re
supervised monocular depth estimation, view synthesis and depth-guided view inpainting. Specifically, we first predict a disparity map for each virtual views
lP
without any ground truth depth information. Then the initial synthesized views are generated by the light field warping scheme. Obviously, when we synthesize the initial sub-aperture views from the given view, occlusions will bring a crit-
na
ical issue. Moreover, photo inconsistency also results in the warping error. It is difficult to handle these warping errors in monocular visual system because
ur
of the lack of multiview information. We try to treat them as a type of noise, and train a view inpainting network to inpaint these warping error regions. Our
Jo
main contribution is twofold: 1) An unsupervised monocular depth estimation method with multi-cue losses for light field. It has an encoder-decoder structure and takes the central view of light field as input. To improve the accuracy of the prediction, a multi-cue loss is introduced according to the multi-orientation epipolar geometry of light field. 2) A depth-guided view inpainting network with the residual structure. It is a 10-layer FCN with the residual structure. Following the supervised learning paradigm, it trains an inpanting model from the initial synthesized views and the corresponding disparity maps.
3
Journal Pre-proof
Our network architecture is a general framework that can be extended to take a sparse set of views as input without changing any structure or parameters of network. We perform comprehensive evaluation of our model on the public datasets. Our method outperforms the state-of-the-art light field reconstruction methods with single-view input, and achieves comparable results with the multi-input methods. Comparison experiments manifest the effectiveness and
of
advantages of our paradigm.
ro
2. Related Works
-p
Early work on light field reconstruction may be traced back to the research on the light field super-resolution techniques [12, 13]. In the last decade, lots
re
of innovations on light field reconstruction or light field synthesis had been proposed. They can be roughly divided into two categories: non-learning-based
lP
and learning-based. The former usually formulates the view synthesis as an optimization problem [14] with various priors or assumptions, such as Lambertian and textual priors [15], Gaussian mixture models [16], variational models [17],
na
etc. The latter designs an end-to-end learning framework to automatically learn depth cues and geometric information from a training set, and then reconstruct
ur
a dense set of sub-aperture view. For brevity, we herein only discuss the deep
Jo
learning based approaches that are associated with our work. 2.1. Depth based View Synthesis These approaches are usually based on the estimation of scene depth. They synthesize the virtual views by using the estimated depth and the given images. Flynn et al. [18] proposed a deep network to synthesize novel views by using a sequence of images with wide baselines. Kalantari et al. proposed a learning based method to synthesize new views from a sparse set of input views [2]. They broke down the task of view synthesis into the disparity and color estimation components, and used two sequential convolutional neural networks (CNN) to model these two components. Vadathya et al. [19] built their network on the
4
Journal Pre-proof
similar pipeline: a DispNet taking focus-defocus pairs as input, and a RefineNet predicting occluded pixels and non-lambertian effects. Zhou et al. [20] trained a network that infers alpha and multiplane images. The novel views are synthesized by using homography and alpha composition. Recently, several light field synthesis methods with single-view input have been proposed. Srinivasan et al. [11] presented an end-to-end CNN framework to
of
synthesize a 4D RGBD light field from a single 2D RGB image. They factorized the light field synthesis problem into three subproblems: scene depth estimation,
ro
light field rendering, and occluded ray prediction. Ivan et al. [21] proposed an appearance flow learning network. It was used to synthesize a light field
-p
from a single image robustly and directly, without requiring any physical-based
re
approaches or post-processing subnetworks. 2.2. Sampling based Angular Interpolation
lP
These approaches do not require the depth information as an auxiliary mapping. Instead, they take a sparse set of sub-aperture views as input, and for-
na
mulate the view synthesis as sampling and consecutive reconstruction (interpolation) of the plenoptic function. Following the advancement of the CNN based single image superresolution
ur
(SISR), Yoon et al. [22, 23] developed a CNN model that jointly super-resolves the light field in both spatial and angular domains.
Jo
Most recently, many state-of-the-art methods have been proposed based on the Epipolar Plane Images (EPIs), and achieved far superior performance over the traditional approaches. Wu et al. [4, 5] modeled the light field reconstruction as a learning-based angular detail restoration on 2D EPIs. Ruan et al. [24] proposed an improved Wasserstein Generative Adversarial Network with gradient penalty to upscale a low-resolution RGB image to a high-resolution light field image. They introduced the EPI constraint of light field into the loss function. Guo et al. [25] presented an EPI-based method to reconstruct new light fields between two mutually independent light fields. Wang et al. [6] built a Pseudo 4DCNN to synthesize new views of dense 4D light field from sparse 5
Journal Pre-proof
LF x, u
Ground-truth views
Initial synthesized results
Disparity maps
DecodeBlock1 DecodeBlock2 DecodeBlock3 DecodeBlock4 DecodeBlock5 DecodeBlock6
!c x | u0
input view
Encode4 (ResBlk x3)
Encode3 (ResBlk x
x Conv, ELU x Maxpooling Encode1 ( ResBlk x3) Encode2 (ResBlk x4)
DispNet
W
C
InpaintingNet
Inpainting loss
DecodeBlock
InpaintingNet
InpaintBlock
1x1 Conv, ELU, IN
Upsampling
InpaintBlock
1x1 Conv
x Conv, ELU, IN
x Conv, ELU, IN
InpaintBlock
x Conv
x Conv, ELU, IN
InpaintBlock
ro
ResBlk
of
unsupervised multi-cue loss
x Conv
x Conv
C
concatenation
+
-p
+
W warping operator
x Conv, ELU
ELU, IN
ELU
re
Figure 2: Our network architecture.
lP
input views. Their pseudo 4DCNN assembles the 2D strided convolutions and
na
two detail-restoration 3D CNNs.
3. Our Network Architecture
ur
Different from the light field reconstruction from the sparse input views, the approaches that reconstruct the light field from a single image lack the geometry
Jo
constraints of EPIs or multiple views. In order to address this issue, we propose a depth-guided view synthesis scheme and a view inpainting method for light field reconstruction. The former learns the geometric cues by a disparity estimation network DispNet. The latter applies a FCN network, InpaintingNet, to deal with the ambiguous regions, such as occlusion. The schematic architecture of our network is depicted in Figure 2. The DispNet takes the central view of light field as input, and predicts a disparity map for each sub-aperture view. These disparity maps are used to warp the central view to generate initial synthesized views. The warping operator is implemented by the bilinear interpolation.
6
Journal Pre-proof
Since some pixels lying in the real sub-aperture view are not visible in the central view, the warping error inevitably occurs in the initial synthesized views. We treat these warping errors as noise. Inspired by the CNN based image denoising [26] and the disocclusion inpainting in free-viewpoint rendering [27, 28], the InpaintingNet is used to inpaint the ambiguous regions and the warping error regions.
of
3.1. Light Field Warping Operator
ro
Let LF (x, u) be the 4D light field data, which represented by the two plane parametrization [29, 30], where x and u are the spatial and angular coordinates,
-p
respectively. The central view Iu0 (x) is formed by the rays passed through the
re
main lens optical center u0 .
Iu0 (x) = LF (x, u0 ) .
(1)
lP
Given a sub-aperture image Iu (x), the reconstructed central view by warp-
na
ing Iu (x) can be expressed as follows:
Ieu7→u0 (x) = Iu (x + k ∗ du0 (x)) ,
(2)
ur
where k = u0 − u, and du0 (x) is the disparity map of the central view. Similarly, we can reconstruct a virtual sub-aperture view from the central
Jo
view Iu0 (x) as follows:
Ieu0 7→u (x) = Iu0 (x − k ∗ du (x)) ,
(3)
where du (x) is the disparity map of the sub-aperture view Iu (x). 3.2. DispNet Our DispNet is an unsupervised monocular disparity estimation network with an encoder-decoder structure. It takes the central view of light field as input, and predicts the disparity maps of all sub-aperture views without any ground-truth information.
7
Journal Pre-proof
We use ResNet50 [31] as the encoder for its effective residual learning manner. The decoder is made up of deconvolution layers. It uses the nearest neighbor up-sampling to enlarge the spatial resolution of feature maps to full scale as input. We add long skip connections between the encoder and decoder in order to forward both global high-level and local detailed information. Similar structures are widely used in many unsupervised methods, such as [32, 33].
of
The key difference between our DispNet and the previous proposed methods is that we propose a novel unsupervised multi-cue loss function combined with the
ro
characteristics of light field geometry and data consistency measure. Figure 3 shows our disparity estimation method achieves significantly superior perfor-
-p
mance over the state-of-the-art approaches [2, 11].
Our unsupervised loss includes three terms: photometric loss Lp , defocus
re
loss Lr and disparity-consistency loss Lc .
Ltotal = Lp + α1 Lr + α2 Lc ,
lP
(4)
where α1 and α2 are respectively set to be 10 and 0.001 in our experiments.
na
We adopt the L1-distance and the single scale structural similarity index measure (SSIM) [34] to compute the image similarity between the image I and
ur
e its reconstructed version I:
2
Jo
ψ I, Ie = β
1 − SSIM I, Ie
+ (1 − β) I − Ie ,
(5)
1
where β is set to be 0.85. The photometric loss Lp is defined as follows: Lp =
X u
ψ Iu0 (x) , Ieu7→u0 (x) + . ψ Iu (x) , Ieu0 7→u (x)
(6)
One of the most distinctive properties of light field image is containing enough angular resolution for refocusing, which usually implies an important defocus cue associated with depth information. Herein, we compute an integral
8
Jo
ur
na
lP
re
-p
ro
of
Journal Pre-proof
Central views Disparity results Disparity results of light field of Srinivasan et al. of Kalantari et al.
Our disparity results
Figure 3: Disparity estimation comparisons among Srinivasan et al. [11], Kalantari et al. [2] and our method. Our method generates more accurate disparity estimation not only on the object surfaces but also around the object boundaries.
image by accumulating all the reconstructed central views Ieu7→u0 (x).
Iinteg (x) =
1 Xe Iu7→u0 (x), Nu u 9
(7)
Journal Pre-proof
where Nu is the number of the sub-aperture views. Obviously, it is an all-in-focus image (an image without defocus blur) if the disparity map of the central view is absolutely correct. Inaccurate disparity estimation of the central view results in the defocus blur in the integral image. Therefore, the defocus loss Lr is simply defined as the image similarity between
of
the integral image and the central view.
Lr = ψ (Iu0 (x) , Iinteg (x)) .
(8)
ro
By mimicking the traditional forward-backward or left-right consistency check [32, 33, 35], we use the L1-distance to compute our disparity-consistency
-p
loss Lc as follows:
u
re
P
Lc =
du (x) − deu0 7→u (x)
1
(9)
lP
deu 7→u (x) = du (x − k ∗ du (x)) 0 0 3.3. InpaintingNet
na
Different from the occlusion prediction network [11] and RefineNet [33, 19], we treat the warping error as a type of noise, and try to learn a denoising model from the image structure and depth information.
ur
Specifically, we train a simple but effective FCN by following the supervised
Jo
learning paradigm, named as InpaintingNet. It includes three InpaintBlock s with residual connections, as shown in Figure 2. We concatenate the initial synthesized views and the corresponding disparity maps, and then send them into the InpaintingNet in turn. The inpaniting loss is defined as follows:
Li =
1 X ˆ ψ Iu (x) , Iu (x) , Nu u
(10)
where Iˆu (x) is one of the final synthesized results by the InpaintingNet, and Nu is the number of the sub-aperture views.
10
lP
re
-p
ro
of
Journal Pre-proof
Initial synthesized Final synthesized views views
na
Disparity maps
Figure 4: Visualization comparisons between the initial synthesized views and the inpainted synthesized views. To make it clear, we set the maximum permissible errors of pixel to be 15,
ur
and mark the bad pixels in yellow. It is obvious that the bad pixels are mainly located on the
Jo
object boundaries, and our InpaintingNet can correct most of them.
In order to preserve the high frequency details in the synthesized views, we cancel the pooling operation throughout the network and maintain the input and output of each layer in the same size. The DispNet and the light field warping operator are based on the Lambertian assumption. Therefore, the warping errors inevitably occurs at occlusion and non-Lambertian reflectance regions. The InpaintingNet cascades the residual blocks to handle the warping error. In order to emphasize the importance of enforcing disparity consistency in predictions, the predicted disparity maps are input into the InpaintingNet, so the InpaintingNet has the depth information to understand which rays in the initial synthesized views are incorrect. Figure 4 11
of
Journal Pre-proof
Figure 5: Examples of the “bad” data removed from the training set. Low-light results in
ro
large texture-less or weakly-textured regions.
-p
demonstrates that our InpaintingNet can correct the warping error. Network
4. Experimental Evaluation
re
ablation experiments in Section IV-E also verify its effectiveness.
lP
In this section, we firstly introduce the implementation details of our network. Then we evaluate our method on two public datasets [2, 11], and analyze
na
the impact of different network component on the performance of our method. 4.1. Datasets and Pre-processing
ur
The public “Flowers” dataset [11] and “30 Scenes” dataset [2], captured by
Jo
the Lytro Illum camera, are selected for evaluation. The size of these light field images is 14 × 14 × 376 × 541. In order to reduce the impact of ambiguous correspondences, we exclude some training data that contains large texture-less or weakly-textured regions caused by low-light conditions or unsuitable exposure conditions, as shown in Figure 5. Data augmentation is performed to prevent overfitting and improve the generalization ability of our model. We use the general augmentation techniques with a 50% chance. Specifically, we perform random gamma changes (within [0.8, 1.2]), random brightness changes (within [0.5, 2.0]), and random multiplicative color changes for each color channel separately (within [0.8, 1.2]). 12
Journal Pre-proof
Table 1: The detailed architecture of the DispNet.
Input (BatchNum,Height,Width,Channels): (1, 376, 541, 3) Layer name
Module
Output size
Activate/Norm
7×7Conv
7×7
(1, 192, 288, 64)
ELU/IN
Maxpooling
3×3
(1, 96, 144, 64)
ELU /IN
Encode1
ResBlk ×3
(1, 48, 72, 256)
Encode2
ResBlk ×4
(1, 24, 36, 512)
Encode3
ResBlk ×6
(1, 12, 18, 1024)
Encode4
ResBlk ×3
(1, 6, 9, 2018) (1, 12, 18, 512)
DecodeBlock2
(1, 24, 36, 256) 3×3
DecodeBlock3 DecodeBlock4
3×3
ro
of
DecodeBlock1
ELU /IN
(1, 48, 72, 128)
ELU
(1, 96, 144, 64)
(1, 192, 288, 32)
DecodeBlock6
(1, 384, 576, 49)
Sigmoid
re
-p
DecodeBlock5
lP
Table 2: The detailed architecture of the InpaintingNet.
Input (BatchNum,Height,Width,Channels): (1,376,541,4) Layer name InpaintBlock1
na
InpaintBlock2 InpaintBlock3
3×3
Output size
Activate/Norm
(1, 376, 574, 8) (1, 376, 574, 16)
ELU
(1, 376, 574, 8) (1, 376, 574, 3)
ReLU
ur
3×3Conv
Module 1×1 3×3 3×3
Jo
4.2. Implementation Details The input of our network is a central view of light field with the spatial resolution of 376 × 541, and the output is a 7 × 7 × 376 × 541 light field. More detailed layer-by-layer definitions are listed in Table 1 and Table 2, respectively. Our model contains about 58.64 million trainable parameters. Our network, implemented with Tensorflow in python, is trained with an Intel Core i7-5930k 3.5GHz, 80G DDR4 memory and NVIDIA GTX 1080Ti GPU. We train it with Adam optimizer (β1 = 0.9, β2 = 0.999) [36]. The initial learning rate is set to 0.0004 which is kept unchanged for the first 30, 000 iterations, and then it is halved every 30, 000 iterations until the end. It takes
13
Journal Pre-proof
about 40 hours for our network to converge. 4.3. Performance Comparison We compare our method with 5 state-of-the-art methods: two methods with single-view input (Srinivasan et al. [11] and Ivan et al. [21]), and three methods with sparse-view input (Kalantari et al. [2], Wu et al. [4, 5], and Wang et al. [6]).
of
For fair comparisons, we also extend our original network to a sparse-input network without changing any structure or parameters of the network. Note
ro
that our network architecture is a general framework for light field reconstruction, which can be simply extended to take a sparse set of views as input, just
-p
by concatenating the input views in the channel dimension. That is, the input dimension of our sparse-input network is BatchN um × Height × W idth ×
re
(N um · Channels), where N um is the number of the input views. Without loss of generality, herein we use 3 × 3 sparse input, which is the same degree of
lP
sparsity as those of Wu et al. [4, 5] and Wang et al. [6]. So the input of our sparse-input model is 1 × 376 × 541 × 27 in the evaluation experiments.
na
Following the numerical evaluation methods proposed in the previous papers [4, 5, 6], the gray-scale SSIM [34] and the peak signal-to-noise ratio (PSNR) are used to evaluate the results of the light field reconstruction. The SSIM is
ur
a well-known similarity metric based on the quality perception of the human visual system (HVS). It models the image quality distortion as a combination
Jo
of three different factors: loss of correlation, luminance distortion and contrast distortion. The value of the SSIM metric is between 0 and 1. The higher the SSIM value is, the more similar the images are. The PSNR metric implies the numerical differences between the synthesized sub-aperture views and the ground-truths. The higher PSNR value denotes the better quality of the reconstructed light field. To better demonstrate the performance of the reconstruction methods, the input view(s) will be excluded when we count the SSIM and PSNR of the synthesized sub-aperture views, instead of using all views of the reconstructed light field as most of existing methods did. The evaluation metrics of a reconstructed 14
Journal Pre-proof
light field can be formulated as follows. MSSIM =
SSIM Ieu0 s (x) , Iu0 s (x) us ∈LF P 1 0 0 e PSNR MPSNR = I (x) , I (x) us us Ns 1 Ns
P
(11)
us ∈LF
where Ns is the number of the synthesized views, and us is the angular coordinate of the synthesized view. Ie0 and I 0 are the luminance components of the
of
images Ie and I, respectively.
ro
4.3.1. “Flowers” Dataset
The “Flowers” dataset contains 3343 light fields of flowers and plants. We
-p
randomly divide this dataset into two groups: 3009 for training and 334 for testing.
re
We compute the MSSIM and MPSNR over all 334 testing images, and compare the performance between our original method and two state-of-the-art
lP
single-view input methods [11, 21]. Note that we retrained the model of Srini-
webpage1 .
na
vasan et al. [11] with their source code, which is downloaded from the authors’
Figure 6 shows some visualization comparisons between our original model
ur
and the method of Srinivasan et al. [11]. Notable artifacts and distortion can be observed in the synthesized results of [11], while our results are almost free
Jo
of artifacts and much closer to the ground truth views in appearance. Figure 7 compares the statistics histograms of MSSIM and MPSNR between our original model and the method of Srinivasan et al. [11]. It is obvious that our histograms have more compact distribution and higher mean values. For numerical comparisons, the average values of MSSIM and MPSNR are listed in Table 3. It is evident that our original method is superior over two state-of-the-art single-view input methods [11, 21]. Without loss of generality, we verify that our sparse-input model, with a 3 × 3 sparse views input, can obtain the best performance. 1 https://people.eecs.berkeley.edu/
pratul/
15
Disparity maps of Srinivasan et al
ur
Ground-truth views
na
lP
re
-p
ro
of
Journal Pre-proof
Synthesized views of Srinivasan et al
Ours disparity maps
Ours synthesized views
Jo
Figure 6: Visualization comparisons between our original method and [11] on the “flowers” dataset. The image regions in the small color squares are zoomed in to reveal more details. By comparing the big squares with the same color in each row, it’s easy to conclude that our synthesized results are almost free of artifacts and much closer to the ground-truth views in appearance.
4.3.2. “30 scenes” dataset The “30 Scenes” dataset [2] contains 130 light fields of various scenes, in which 100 scenes for training and 30 scenes for testing. Herein we test the generality of our original model and our sparse-input model that are trained on the “Flowers” dataset. The reconstructed results of
16
ro
of
Journal Pre-proof
-p
Figure 7: Quantitative comparison by the statistics histograms of MSSIM and MPSNR. Table 3: Quantitative results on the “Flowers” dataset [11]. Bold indicates the best perfor-
Srinivasan et al. [11]
re
mance.
Input Mode
MSSIM
MPSNR
single input
0.9192
33.04
Ivan et al. [21]
single input
0.93
36.31
Our original model
single input
0.9718
39.05
Our sparse-input model
3 × 3 sparse input
0.9948
49.60
na
lP
Methods
ur
three state-of-the-art sparse-view input methods [2, 4, 5, 6] are directly obtained by their models that are provided by the authors (Kalantari et al.2 , Wu et al.3 ,
Jo
and Wang et al.4 ).
We compute the MSSIM and MPSNR over the 30 testing scenes. Figure 8 depicts the quantitative comparisons of the MSSIM and MPSNR of each testing scene. Obviously, our sparse-input model performs significantly better than the other methods, and our original model can achieve comparable results with the sparse-view input methods. This also can be demonstrated by the average values of MSSIM and MPSNR listed in Table 4. 2 http://cseweb.ucsd.edu/
viscomp/projects/LF/papers/SIGASIA16/
3 https://github.com/GaochangWu/lfepicnn/tree/master 4 https://github.com/DalonsmaryCASIA/Pseudo4DCNN
17
Journal Pre-proof
1 0.95 0.9 0.85 0.8 0.75 0.7 Our sparse-input model
Our original model
5
10
Wu et al.
Wang et al.
15
20
Srinivasan et al.
25
30
ro
30 Scenes 50
-p
45 40
re
35 30
Our sparse-input model
5
lP
25 20
Kalantari et al.
of
0.65
Our original model
10
Wu et al.
Wang et al.
15
Kalantari et al.
20
Srinivasan et al.
25
30
na
30 Scenes
Figure 8: Quantitative comparisons by the MSSIM and MPSNR of each testing scene on the
ur
“30 Scenes” dataset.
Table 4: Quantitative results on the “30 Scenes” dataset [2]. Bold indicates the best perfor-
Jo
mance.
Methods
Input Mode
MSSIM
MPSNR
Srinivasan et al. [11]
single input
0.8501
28.82
Ivan et al. [21]
single input
0.86
29.82
Kalantari et al. [2]
2 × 2 sparse input
0.9693
36.98
Wang et al. [6]
3 × 3 sparse input
0.9698
33.96
Wu et al. [4, 5]
3 × 3 sparse input
0.9823
42.24
Our original model
single input
0.9317
33.68
Our sparse-input model
3 × 3 sparse input
0.9913
44.92
18
Journal Pre-proof
Table 5: Average runtime comparisons on the “30 Scenes” dataset [2]. Bold indicates the shortest runtime.
Methods
CUDA
Avg. runtime (s)
Kalantari et al. [2]
–
16.2032
Wu et al. [4, 5]
– √
6.8571
Wang et al. [6]
2.0435
√
Srinivasan et al. [11]
0.0146
√
Our original model
0.0072
of
√
0.0110
ro
Our sparse-input model
Table 6: Performance comparisons of different configurations. Bold indicates the best perfor-
DispNet
√
√
√
MSSIM
MPSNR
MSSIM
MPSNR
0.9475
37.56
0.9064
31.22
0.9551
37.80
0.9117
31.89
0.9520
37.45
0.9135
31.90
0.9609
38.04
0.9258
32.31
0.9718
39.05
0.9317
33.68
“Flowers”
Inpainting Lc
√ √
√
√
√
√
√
“30 Scenes”
na
√
re
Lr
lP
Lp √
-p
mance.
Table 5 provides the average runtime comparisons on the 30 testing scenes.
ur
Because our original model takes the whole central view as input and synthesizes all views simultaneously, our prediction time is significantly less than those of
Jo
the state-of-the-art methods. 4.4. Network Ablation Experiments To analyze the impact of the different loss configurations and the InpaintingNet on the performance of light field reconstruction, the ablation experiments are carried out on the “Flowers” and “30 Scenes” datasets. Table 6 provides the performance comparisons of different configurations. The quantitative improvements verify the effectiveness of the proposed loss functions and the InpaintingNet.
19
Journal Pre-proof
5. Conclusions and Future Work We have designed a depth-guided view synthesis and inpainting method for light field reconstruction from a single image. It includes two sequential CNNs (DispNet and InpaintingNet). The proposed network is a general framework for light field reconstruction, and can be extended to a multi-input network easily. We have evaluated our method on the public “Flowers” and “30 Scenes”
of
datasets. Both quantitative evaluation and visual comparisons demonstrate the superiority of our method. Our original model outperforms the state-of-the-art
ro
methods with single-view input, and achieves comparable results with the multi-
-p
input methods. Our sparse-input model is significantly better than the existing methods. For future work, we will try to introduce the low-level structure
re
information of objects to refine the object boundaries.
lP
6. Acknowledgment
This work is supported in part by the National Key R&D Program of China
na
(2017YFE0118200), Open Project of Zhejiang Provincial Key Laboratory of Information Processing, Communication and Networking, and the Key Program of Zhejiang Provincial Natural Science Foundation of China (LZ14F020003).
Jo
comments.
ur
The authors are grateful for the anonymous reviewers who made constructive
References
[1] F.-C. Huang, K. Chen, G. Wetzstein, The light field stereoscope: Immersive computer graphics via factored near-eye light field displays with focus cues, ACM Transactions on Graphics (SIGGRAPH) 34 (4) (2015) 60:1–60:12. [2] N. K. Kalantari, T.-C. Wang, R. Ramamoorthi, Learning-based view synthesis for light field cameras, ACM Transactions on Graphics (SIGGRAPH Asia) 35 (6) (2016) 193:1–193:10.
20
Journal Pre-proof
[3] S. Vagharshakyan, R. Bregovic, A. Gotchev, Light field reconstruction using shearlet transform, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 40 (1) (2018) 133–147. [4] G. Wu, M. Zhao, L. Wang, Q. Dai, T. Chai, Y. Liu, Light field reconstruction using deep convolutional network on EPI, in: IEEE Conference on
of
Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1–15. [5] G. Wu, Y. Liu, L. Fang, Q. Dai, T. Chai, Light field reconstruction using
ro
convolutional network on EPI and extended applications, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 41 (7) (2019)
-p
1681–1694.
re
[6] Y. Wang, F. Liu, Z. Wang, G. Hou, Z. Sun, T. Tan, End-to-end view synthesis for light field imaging with pseudo 4DCNN, in: European Conference
lP
on Computer Vision (ECCV), 2018, pp. 1–16. [7] H. W. F. Yeung, J. Hou, J. Chen, Y. Y. Chung, X. Chen, Fast light field
na
reconstruction with deep coarse-to-fine modeling of spatial-angular clues, in: European Conference on Computer Vision (ECCV), 2018, pp. 1–17.
ur
[8] H.-G. Jeon, J. Park, G. Choe, J. Park, Y. Bok, Y.-W. Tai, I. Kweon, Accurate depth map estimation from a lenslet light field camera, in: IEEE
Jo
Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–9.
[9] T.-C. Wang, A. Efros, R. Ramamoorthi, Occlusion-aware depth estimation using light-field cameras, in: IEEE International Conference on Computer Vision (ICCV), 2015, pp. 3487–3495. [10] T.-C. Wang, A. Efros, R. Ramamoorthi, Depth estimation with occlusion modeling using light-field cameras, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 38 (11) (2016) 2170–2181.
21
Journal Pre-proof
[11] P. P. Srinivasan, T. Wang, A. Sreelal, R. Ramamoorthi, R. Ng, Learning to synthesize a 4D RGBD light field from a single image, in: International Conference on Computer Vision (ICCV), 2017, pp. 1–9. [12] A. Levin, F. Durand, Linear view synthesis using a dimensionality gap light field prior, in: IEEE Conference on Computer Vision and Pattern
of
Recognition (CVPR), 2010, pp. 1831–1838. [13] T. E. Bishop, P. Favaro, The light field camera: Extended depth of field,
ro
aliasing, and superresolution, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 34 (5) (2012) 972–986.
-p
[14] S. Wanner, B. Goldluecke, Variational light field analysis for disparity es-
re
timation and superresolution, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 36 (3) (2014) 606–619.
lP
[15] T. E. Bishop, S. Zanetti, P. Favaro, Light field superresolution, in: IEEE International Conference on Computational Photography (ICCP), 2009,
na
pp. 1–9.
[16] K. Mitra, A. Veeraraghavan, Light field denoising, light field superresolu-
ur
tion and stereo camera based refocussing using a GMM light field patch prior, in: IEEE Conference on Computer Vision and Pattern Recognition
Jo
Workshops (CVPRW), 2012, pp. 22–28. [17] S. Wanner, B. Goldluecke, Spatial and angular variational superresolution of 4D light fields, in: European Conference on Computer Vision (ECCV), 2012, pp. 608–621. [18] J. Flynn, I. Neulander, J. Philbin, N. Snavely, DeepStereo: Learning to predict new views from the world’s imagery, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–10. [19] A. K. Vadathya, S. Girish, K. Mitra, A deep learning framework for light field reconstruction from focus-defocus pair: A minimal hardware approach,
22
Journal Pre-proof
in: Computational Cameras and Displays (CCD) workshop (CVPRW), 2018. [20] T. Zhou, R. Tucker, J. Flynn, G. Fyffe, N. Snavely, Stereo magnification: Learning view synthesis using multiplane images, ACM Transactions on Graphics (SIGGRAPH) 37 (4) (2018) 65:1–65:10.
of
[21] A. Ivan, Williem, I. K. Park, Synthesizing a 4D spatio-angular consistent light field from a single image, arXiv.org, https://arxiv.org/abs/1903.12364
ro
(2019).
[22] Y. Yoon, H.-G. Jeon, D. Yoo, J.-Y. Lee, I. S. Kweon, Learning a deep
-p
convolutional network for light-field image super-resolution, in: IEEE In-
re
ternational Conference on Computer Vision Workshop (ICCVW), 2015, pp. 1–9.
lP
[23] Y. Yoon, H.-G. Jeon, D. Yoo, J.-Y. Lee, I. S. Kweon, Light-field image super-resolution using convolutional neural network, IEEE Signal Process-
na
ing Letters 24 (6) (2017) 848–852.
[24] L. Ruan, B. Chen, M.-L. Lam, Light field synthesis from a single image
ur
using improved wasserstein generative adversarial network, in: EUROGRAPHICS, 2018, pp. 1–2.
Jo
[25] M. Guo, H. Zhu, G. Zhou, Q. Wang, Dense light field reconstruction from sparse sampling using residual network, in: Asian Conference on Computer Vision (ACCV), 2018, pp. 1–16. [26] K. Zhang, W. Zuo, Y. Chen, D. Meng, L. Zhang, Beyond a gaussian denoiser: Residual learning of deep CNN for image denoising, IEEE Transactions on Image Processing (TIP) 26 (7) (2017) 3142–3155. [27] P. Buyssens, O. L. Meur, M. Daisy, D. Tschumperle, O. Lezoray, Depthguided disocclusion inpainting of synthesized RGB-D images, IEEE Transactions on Image Processing (TIP) 26 (2) (2017) 525–538.
23
Journal Pre-proof
[28] J. Dai, T. Nguyen, View synthesis with hierarchical clustering based occlusion filling, in: IEEE International Conference on Image Processing (ICIP), 2017, pp. 1–5. [29] Y.-R. Ng, M. Levoy, M. Bredif, G. Duval, M. Horowitz, P. Hanrahan, Light field photography with a handheld plenoptic camera, Computer Science
of
Technical Report (CSTR) 2 (11) (2005) 1–11. [30] A. Lumsdaine, T. Georgiev, The focused plenoptic camera, in: IEEE In-
ro
ternational Conference on Computational Photography (ICCP), San Francisco, USA, 2009, pp. 1–8.
-p
[31] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recog-
(CVPR), 2016, pp. 770–778.
re
nition, in: IEEE Conference on Computer Vision and Pattern Recognition
lP
[32] C. Godard, O. M. Aodha, G. J. Brostow, Unsupervised monocular depth estimation with left-right consistency, in: IEEE Conference on Computer
na
Vision and Pattern Recognition (CVPR), 2017, pp. 1–10. [33] Z. Yin, J. Shi, Geonet: Unsupervised learning of dense depth, optical flow
ur
and camera pose, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 1–10.
Jo
[34] Z.Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE Transactions on Image Processing (TIP) 13 (4) (2004) 600–612. [35] Y. Wang, Y. Yang, Z. Yang, L. Zhao, P. Wang, W. Xu, Occlusion aware unsupervised learning of optical flow, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 1–10. [36] D. Kingma, J. Ba, Adam: A method for stochastic optimization, in: International Conference on Learning Representations (ICLR), 2015, pp. 1795– 1812.
24