Context pyramidal network for stereo matching regularized by disparity gradients

Context pyramidal network for stereo matching regularized by disparity gradients

ISPRS Journal of Photogrammetry and Remote Sensing 157 (2019) 201–215 Contents lists available at ScienceDirect ISPRS Journal of Photogrammetry and ...

18MB Sizes 0 Downloads 89 Views

ISPRS Journal of Photogrammetry and Remote Sensing 157 (2019) 201–215

Contents lists available at ScienceDirect

ISPRS Journal of Photogrammetry and Remote Sensing journal homepage: www.elsevier.com/locate/isprsjprs

Context pyramidal network for stereo matching regularized by disparity gradients

T

Junhua Kanga,b, , Lin Chenb, Fei Denga, , Christian Heipkeb ⁎

a b



School of Geodesy and Geomatics, Wuhan University, Wuhan, China Institute of Photogrammetry and GeoInformation (IPI), Leibniz Universität Hannover, Hannover, Germany

ARTICLE INFO

ABSTRACT

Keywords: Stereo matching Dilated convolution Structure preserving Gradient regularizer

Also after many years of research, stereo matching remains to be a challenging task in photogrammetry and computer vision. Recent work has achieved great progress by formulating dense stereo matching as a pixel-wise learning task to be resolved with a deep convolutional neural network (CNN). However, most estimation methods, including traditional and deep learning approaches, still have difficulty to handle real-world challenging scenarios, especially those including large depth discontinuity and low texture areas. To tackle these problems, we investigate a recently proposed end-to-end disparity learning network, DispNet (Mayer et al., 2015), and improve it to yield better results in these problematic areas. The improvements consist of three major contributions. First, we use dilated convolutions to develop a context pyramidal feature extraction module. A dilated convolution expands the receptive field of view when extracting features, and aggregates more contextual information, which allows our network to be more robust in weakly textured areas. Second, we construct the matching cost volume with patch-based correlation to handle larger disparities. We also modify the basic encoder-decoder module to regress detailed disparity images with full resolution. Third, instead of using post-processing steps to impose smoothness in the presence of depth discontinuities, we incorporate disparity gradient information as a gradient regularizer into the loss function to preserve local structure details in large depth discontinuity areas. We evaluate our model in terms of end-point-error on several challenging stereo datasets including Scene Flow, Sintel and KITTI. Experimental results demonstrate that our model decreases the estimation error compared with DispNet on most datasets (e.g. we obtain an improvement of 46% on Sintel) and estimates better structure-preserving disparity maps. Moreover, our proposal also achieves competitive performance compared to other methods.

1. Introduction

smooth matching costs in the local neighborhood, disparity calculation to predict initial disparity from the cost volume and, finally, disparity refinement to obtain sub-pixel results. Traditional stereo matching methods are well studied following the above popular stereo pipeline. However, such a stage-wise pipeline is prone to errors in each step and lacks an overall objective for optimization. Most methods have limited performance in challenging situations, such as areas with poor texture, repetitive patterns, occlusions and areas with large height discontinuities. Recently, deep learning techniques have shown powerful capabilities for various vision tasks, such as object detection (Ren et al., 2015), recognition (He et al., 2016), image segmentation and classification (Badrinarayanan et al., 2017; Chen et al., 2018). For stereo matching, convolutional neural networks (CNNs) have first been introduced to calculate matching costs in

Over the last decades, stereo matching has continuously been an active research area in photogrammetry and computer vision. It is applied broadly in many applications, such as topographic mapping, robotics and autonomous driving, 3D model reconstruction, object detection and recognition. The core task of stereo matching is to find point-wise correspondences between images, and thus to calculate the parallax (called disparity in computer vision) of corresponding points between images. Once the parallax is calculated, the depth of the point can be derived and 3D information is thus retrieved from stereo images. A traditional pipeline for stereo matching includes four steps (Scharstein and Szeliski, 2002), namely matching cost computation to measure the similarity of points from two images, cost aggregation to



Corresponding authors at: School of Geodesy and Geomatics, Wuhan University, Wuhan, China (F. Deng and J. Kang). E-mail addresses: [email protected] (J. Kang), [email protected] (F. Deng).

https://doi.org/10.1016/j.isprsjprs.2019.09.012 Received 17 March 2019; Received in revised form 18 September 2019; Accepted 19 September 2019 Available online 27 September 2019 0924-2716/ © 2019 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier B.V. All rights reserved.

ISPRS Journal of Photogrammetry and Remote Sensing 157 (2019) 201–215

J. Kang, et al.

Matching Cost CNN (MC-CNN) (Žbontar and Lecun, 2016). Instead of using hand-crafted matching cost metrics, the authors present a Siamese CNN for measuring the similarity between image patches. Most other patch-based stereo methods also focus on using CNNs to generate unary terms as similarity measure (Chen and Yuan, 2016; Luo et al., 2016). Though CNN based methods generally out-perform the traditional ones, these algorithms suffer from a high computational burden because they calculate matching costs at all potential disparities. In addition, they require extra post-processing steps and hand-crafted regularization to produce complete disparity results, because matching cost computation is only one step of stereo matching. Therefore, it was suggested to integrate all steps into an end-to-end network to directly estimate the disparity from stereo images. DispNet is the first such end-to-end learning framework (Mayer et al., 2015); it was derived from FlowNet (Dosovitskiy et al., 2015). Both methods are restricted to rectified stereo images. The network architecture of DispNet follows a coarse-to-fine fashion called auto encoder-decoder structure. It encodes information of a wider context at low resolution through successive convolutions and activations and then decodes the result back to the original resolution by successive deconvolutions. In this paper we use DispNet as the basic architecture and propose an end-to-end context pyramidal network for stereo matching regularized by disparity gradients. In order to improve the performance in areas of weak texture, we use dilated convolution (Holschneider et al., 1990; Yu and Koltun, 2016) to encode richer context cues when extracting features. Dilated convolution enlarges the receptive field of a CNN without requiring extra learning parameters. We construct a context pyramidal module in our network with different dilated convolutional layers to exploit multi-scale context. In this way, local priors and multi-scale context cues are combined for disparity estimation. In addition, we modify the correlation layer in the cost volume construction module to deal with large disparities. At the same time, we also modify the structure of the encoder-decoder module to preserve more spatial information and output a full resolution disparity map. Finally, the horizontal and vertical gradients of the disparity map convey information about significant depth differences in the scene and local structure, which can be used to improve estimated disparity maps. In order to avoid over-smoothing in the output disparities, especially around large disparity discontinuities, we add a regularizer based on depth gradient information into our network to preserve sharp structure details. Compared to our prior work (Kang et al., 2019), in this paper we extend our method by employing dilated convolutions. In addition, we conduct ablation experiments to investigate the effectiveness of our modifications. Moreover, we present a detailed analysis with respect to the results. In summary, the contributions of the paper are as follows: (i) We propose an end-to-end context pyramidal network with dilated convolution for stereo matching. (ii) We modify the correlation layer of DispNet in the cost volume construction module to deal with large disparities. (iii) We introduce a gradient regularizer to preserve local structure details in the output disparity map, especially in large depth discontinuity area. This paper is organised as follows: we review the related work of stereo matching based on CNNs in Section 2 and present the details of our methodology in Section 3; followed by experimental results and analysis in Section 4, and conclude our work in Section 5.

However, since the emphasis of stereo matching has shifted from the four-step pipeline to CNN based stereo matching techniques, albeit with the need for ground truth and a training phase prior to disparity estimation, we restrict our review to those CNN based methods. These approaches estimate disparities which can reflect part or all of the aforementioned four steps. They can be roughly divided into three categories: patch-based matching cost learning, post-processed regularity learning, and end-to-end disparity learning. Patch-based matching cost learning. In this category, CNNs are introduced which compute the matching cost of image patches. MCCNN (Žbontar and Lecun, 2016) is a Siamese network composed of a series of stacked convolutional layers to extract descriptors of each image patch, followed by a simple dot product (MC-CNN-fst) or a number of fully-connected layers (MC-CNN-art) to derive the similarity measure. Luo et al. (2016) expand MC-CNN and propose a notably faster Siamese network to learn a probability distribution over all possible disparities without manually pairing patch candidates. Chen and Yuan (2016) propose a multi-scale CNN to introduce global context by employing down-sampled images and increase the matching accuracy without enlarging the input patch. The dilated convolution (Holschneider et al., 1990) has been employed in a semantic segmentation network, DeepLab (Chen et al., 2018), to capture the contextual information at multiple scales. It has been shown to perform well in this dense-pixel prediction task. Li and Yu (2018) introduce it to compute matching cost and have found that dilated convolutions are helpful for stereo matching. Inspired by these two contributions, we use dilated convolution in our end-to-end network for better aggregating context information and generating more accurate disparities. Although patchbased methods outperform most traditional stereo matching methods using hand-crafted features, they still require subsequent post-processing to produce complete results. Furthermore, most of them still suffer from challenging regions. That is mainly because, with the restriction of small receptive fields in image patches, only limited context information is employed for disparity estimation. Post-processed regularity learning. This category learns regularization and focuses on post-processing of disparity maps. Scharstein and Pal (2007) learn the parameters of conditional random fields (CRFs) to replace heuristic priors on disparities. Li and Huttenlocher (2008) train a non-parametric CRF model with explicit occlusion labelling by using a structured support vector machine (Cortes and Vapnik, 1995). Guney and Geiger (2015) incorporate semantic segmentation and object recognition in a super-pixel based CRF framework to learn a regularizer and to resolve ambiguities in regions with total reflection and poor texture. Seki and Pollefeys (2017) propose an SGM-Net to learn penalty parameters for different 3D structures. They obtain better penalties than SGM (Hirschmuller, 2008) and mitigate streaking artefacts that appear in MC-CNN. End-to-end disparity learning. Approaches in this category combine the matching cost computation and the hand-crafted post-processing in a single neural network and learn them jointly through training the whole network in an end-to-end mode. The first end-to-end stereo matching network is DispNet (Mayer et al., 2015), which has a similar structure as FlowNet (Dosovitskiy et al., 2015). DispNet utilizes an encoder-decoder architecture for disparity regression. Given a pair of rectified images, it explicitly extracts features in the encoder part and then directly estimates the disparity map in the decoder part by minimizing a regression training loss based on the absolute difference between prediction and ground truth disparity. DispNet has achieved prominent performance and has become a baseline network in stereo matching. However, the loss function employed in DispNet results in over-smoothing output disparities, which leads to a loss of local detail, especially in large disparity discontinuity areas. In addition, we find that DispNet has lower accuracy for large disparities. Several studies have improved the performance DispNet, e.g. by stacking multiple networks based on this baseline architecture. For instance, CRL (Pang et al., 2018) is a cascade residual learning network,

2. Related work There is a lot of literature focusing on stereo matching research. As mentioned above, most methods follow the classical four-step pipeline. The landmark algorithm along this pipeline is Semi-Global Matching (SGM) (Hirschmuller, 2008). It calculates the matching cost using Mutual Information (Viola and Wells III, 1997) and seeks an optimal disparity assignment by 1D optimizations of a global energy function from different directions in image space using dynamic programming. 202

ISPRS Journal of Photogrammetry and Remote Sensing 157 (2019) 201–215

J. Kang, et al.

stacking an advanced DispNet and a residual network for explicitly refining initial disparity. DispNet_css (Ilg et al., 2018) combines three separate networks (each of them similar to DispNet) with residual connections to predict more accurate disparity. However, these methods use multiple deep neural networks and contain a large number of training parameters, which leads to complex multi-stage training and expensive computations. GC-Net (Kendall et al., 2017) regresses disparity by employing 3D convolutions to learn to regularize the results, thus being able to incorporate more context directly from the data. Similar to GC-Net, PSM-Net (Shaked and Wolf, 2017) uses spatial pyramid pooling and 3D convolutions to incorporate contextual information on different scales. However, employing high-dimensional features based 3D convolution is computationally expensive. Instead of using 3D convolutions, dilated convolutions are introduced in our paper to exploit multi-scale context cues. DenseMapNet (Atienza, 2018) uses Dense Convolutional Networks (DenseNet) (Huang et al., 2017) instead of an encoder-decoder structure to reduce the number of learning parameters. Although this network is fast to train, the results show limitations in terms of preserving details. Cheng et al. (2018a) learn a unified deep network model that predicts a disparity gradient map and a confidence map. The two maps are integrated into a continuous Markov Random Field (MRF) for estimating a structure-preserving disparity map. This method needs multiple networks and extra complex global optimization steps to refine disparity maps. To enforce smooth disparities, a smoothness loss is introduced in an unsupervised deep neural network for depth estimation (Godard et al., 2017). This smoothness loss is added with an edge-aware term using original image gradients. Inspired by this method, we apply a gradient regularizer on disparity estimation in a supervised way based on the ground truth disparity gradients to preserve local details.

3.1. Overall network architecture The schematic structure of our proposed network is depicted in Fig. 1. This is a data-driven model that enables end-to-end disparity learning. From Fig. 1, it can be observed that our encoder-decoder network is composed of three main parts, namely context pyramidal feature extraction, cost volume construction, and disparity estimation. In the feature extraction part, we learn features based on multi-scale context cues by using pyramidal dilated convolutions. In the cost volume construction part, a cost volume is generated based on the learnt features using a correlation layer within a pre-defined disparity range. Disparity estimation contains an encoder to perform matching at multiple resolutions, followed by a decoder with skip connections to incorporate information from the various resolutions and to estimate coarse-to-fine disparities. A more detailed definition of the individual layers is presented in Table 1 and implementations of each component are introduced in the following sections. 3.2. Context pyramidal feature extraction 3.2.1. Dilated convolution Learning an effective representation, especially in regions of poor texture, usually requires a large receptive field of the CNN to combine information from a larger context window, because a larger context window normally delivers higher discriminability. To this end, the common way is to use larger convolution kernels or to conduct more consecutive down-sampling operations. However, both solutions lead to an increase in the number of learning parameters and a reduction of the spatial resolution, neither of which is desired in dense disparity estimation. Since the dilated convolution can enlarge the receptive field of CNN without increasing the number of learning parameters, it is employed in our network to explicitly extract features on multiple scales. The dilated convolution was originally developed in an “à trous” scheme for efficient computation of wavelet decomposition (Holschneider et al., 1990). The intuitive idea is to insert zeros between every two consecutive filter values in standard convolutional filters (“trou” means “hole” in French). Considering the two-dimensional (2D) case, given a 2D feature map F and a convolutional kernel w(i, j) of size (2n + 1) × (2n + 1) , a dilated convolution operator at position (p, q) in F can be defined as follows:

3. Methodology In this study, given a pair of rectified images, we aim to improve the quality of stereo matching results based on CNN. We do so by enhancing DispNet in three ways. First, our method employs dilated convolutions to exploit multi-scale context cues without extra learning parameters. Second, we modify the correlation module of DispNet to deal with larger disparities and we also modify the structure of the encoder-decoder part to obtain a resulting disparity map with the same resolution as the input. Third, under the assumption that the desired disparity map should be locally smooth except at actual depth discontinuities, we present a gradient regularizer to preserve sharp structure details.

n

n

C(p, q) =

F (p + i ·r , q + j·r )·w (i , j ) i = n j= n

(1)

where r is the dilation rate with which we sample the input feature map. When r = 1, a dilated convolution turns into a standard

Fig. 1. Overall network architecture of our method. The output disparity is estimated with respect to the left image. 203

ISPRS Journal of Photogrammetry and Remote Sensing 157 (2019) 201–215

J. Kang, et al.

Table 1 Detailed parameters of our network. We use RGB images as input. The convolution kernel is presented in the form of [width, height, # fan_in, # fan_out] (#fan_in means the number of input channels, #fan_out means the number of output channels). H and W are the height and width of the original image. Name

Input channel

Layer description

Output dimension

Input left Input right Conv_1a Conv_1b Conv_dilated_a

H×W×3 H×W×3 Context pyramidal feature extraction (Section 3.2) Left image 7 × 7 × 3 × 64, stride 2 Right image 7 × 7 × 3 × 64, stride 2 Conv_1a

½H×½W × 64 ½H×½W × 64 ½H×½W × 64

3 × 3 × 64 × 64 , dilation 2 1 × 1 × 64 × 32 3 × 3 × 64 × 64 , dilation 3 1 × 1 × 64 × 32 3 × 3 × 64 × 64 , dilation 6 1 × 1 × 64 × 32 [1 × 1 × 96 × 64], stride 1

Conv_dilated_b

Conv_1b

½H×½W × 64

3 × 3 × 64 × 64 , dilation 2 1 × 1 × 64 × 32 3 × 3 × 64 × 64 , dilation 3 1 × 1 × 64 × 32

Conv_2a Conv_2b Conv_redir Corr

3 × 3 × 64 × 64 , dilation 6 1 × 1 × 64 × 32 [1 × 1 × 96 × 64], stride 1 5 × 5 × 64 × 128, stride 2 5 × 5 × 64 × 128, stride 2 1 × 1 × 128 × 64, stride 1

Conv_dilated_a Conv_dilated_a Conv_2a

¼H×¼W × 128 ¼H×¼W × 128 ¼H×¼W × 64

Cost volume construction (Section 3.3) Conv_2a + Conv_2b Correlation layer Encoder-Decoder (Section 3.4) 5 × 5 × 145 × 256, stride 2 3 × 3 × 256 × 256, stride 1 3 × 3 × 256 × 512, stride 2 3 × 3 × 512 × 512, stride 1 3 × 3 × 512 × 1024, stride 2 3 × 3 × 1024 × 1024, stride 1

Conv3 Conv3_1 Conv4 Conv4_1 Conv5 Conv5_1

Corr + Conv_redir Conv3 Conv3_1 Conv4 Conv4_1 Conv5

Pre5 + loss5upconv5 iconv5 Pre4 + loss4upconv4 iconv4 Pre3 + loss3upconv3 iconv3 Pre2 + loss2upconv2 iconv2 Pre1 + loss1upconv1 iconv1 Pre0 + loss0

Conv5_1 Conv5_1 Conv4_1 + Pre5 + upconv5 iconv5 iconv5 Conv3_1 + Pre4 + upconv4 iconv4 iconv4 Conv_2a + Pre3 + upconv3 iconv3 iconv3 Conv_1a + Pre2 + upconv2 iconv2 iconv2 Left + Pre1 + upconv1 iconv1

3 × 3 × 1024 × 1, stride 1 4 × 4 × 1024 × 512, stride 2 3 × 3 × 1025 × 512, stride 1 3 × 3 × 512 × 1, stride 1 4 × 4 × 512 × 256, stride 2 3 × 3 × 513 × 256, stride 1 3 × 3 × 256 × 1, stride 1 4 × 4 × 256 × 128, stride 2 3 × 3 × 257 × 128, stride 1 3 × 3 × 128 × 1, stride 1 4 × 4 × 128 × 64, stride 2 3 × 3 × 129 × 64, stride 1 3 × 3 × 64 × 1, stride 1 4 × 4 × 64 × 32, stride 2 3 × 3 × 36 × 32, stride 1 3 × 3 × 32 × 1, stride 1

convolution. In dilated convolutions, a small-size kernel with k × k elements is enlarged to k + (k − 1)(r − 1) with dilation rate r. Fig. 2 illustrates the dilated convolution of a 3*3 kernel with different dilation rates. When the feature map is convolved by a dilated filter with rate 1, the receptive field size is 3 × 3 with 9 parameters being learnt. When the rate is 2 or 3, the corresponding receptive filed increases to 5 × 5 and 7 × 7, but with a constant number of parameters (namely 9). Therefore, the dilated convolution can enable the receptive field to grow without introducing additional learning parameters and allows flexible aggregation of multi-scale contextual information.

¼H×¼W × 81 ⅛H×⅛W × 256 ⅛H×⅛W × 256 ∕16H×1∕16 W × 512 1 ∕16H×1∕16 W × 512 1 ∕32H×1∕32 W × 1024 1 ∕32H×1∕32 W × 1024 1

1 ∕32H×1∕32 W × 1 ∕16H×1∕16 W × 512 1 ∕16H×1∕16 W × 512 1 ∕16H×1∕16 W × 1 ⅛H×⅛W × 256 ⅛H×⅛W × 256 ⅛H×⅛W × 1 ¼H×¼W × 128 ¼H×¼W × 128 ¼H×¼W × 1 ½H×½W × 64 ½H×½W × 64 ½H×½W × 1 H × W × 32 H × W × 32 H×W×1 1

In each branch of the network, we employ a convolutional layer which is followed by a rectified linear unit (ReLU) to learn the unary features. Let F Rw × h × m denote the 3D feature tensor, where w, h, m mean the width, height and no. of channels, respectively. The context pyramidal module is applied for capturing multi-scale context cues. As shown in Fig. 3, this module is formulated with three pyramid levels, each consisting of a dilated convolution layer with a kernel Ckd and a different dilation rate rk (k [1, 3]). A dilated convolution layer with kernel size c × c and dilation rate rk thus has a receptive field of [(c 1) rk + 1] × [(c 1) rk + 1] pixels. The size of the receptive field gives an indication how much context is used and differs significantly with rk . Each dilated convolution layer is followed by a 1 × 1 standard convolution with a kernel C s to reduce the feature dimension. Hence the output feature map of each layer Tk Rw × h × mT is:

3.2.2. Feature extraction For feature extraction, we utilize a series of dilated convolutions with different dilation rates in parallel to incorporate hierarchical context information (see Fig. 3). More specially, we extract features using a Siamese network with two branches. This means the network learns the same convolution parameters and thus the same type of features from the two input images.

Tk = C s

Ckd F

(2)

Here ⊗ indicates the standard convolution operator and ⊛ denotes the dilated convolution operation. Finally, the three output spatial 204

ISPRS Journal of Photogrammetry and Remote Sensing 157 (2019) 201–215

J. Kang, et al.

Fig. 2. Dilated Convolution in 2D with kernel size 3 × 3 and different dilation rates. Employing large values enlarges the receptive field, enabling feature encoding at multiple scales.

Fig. 3. Architecture details of our context pyramidal feature extraction.

features {Tk }3k = 1 from the different branches are concatenated to form multi-scale context cues and fed into the next convolutional layer. The concatenated feature X Rw × h × mX is calculated as:

X= T1

T2

pairs by setting the maximum disparity along the epipolar line. For each location x1 in ML , we compute the correlation C(x1, x2) only in the interval [x2 = x1, x2 = x1 + 2dm ], which implies a one sided search on MR . We set dm to 40 pixels and increase the stride from 1 to 2 when changing the convolution layer. Because we use two convolution layers and down-sample the feature maps (ML , MR ) by a factor of 4 between the layers, our network can handle disparities of 40*4*2 = 320 pixels. After creating the resulting multi-channel maps and organizing the relative displacements in channels, we obtain a 3D cost volume of size (w × h × (dm + 1)) .

(3)

T3

where ⊕ represents the concatenation operation. The output feature maps from this module are then fed into the correlation module to extract the correspondence prior between left and right images. 3.3. Cost volume construction After having obtained deep unary features from the Siamese network, the cost volume is constructed based on these features using a correlation layer. As the input stereo images are rectified, the y-coordinates of conjugate points are identical. Similar to DispNet, we use a 1D correlation layer along the x-direction (epipolar line) to construct the cost volume. First, let ML , MR denote the left and right feature maps with w, h, m representing their width, height and no. of channels, respectively. Then, the cost volume C is created by convolving the left and right feature maps ML, MR up to the maximum disparity dm . The correlation of two patches (i.e. context windows) centered at x1 in ML and x2 in MR is defined as

C(x1, x2) =

[ML (x1 + o) o [ k, k] × [ k, k]

MR (x2 + o)]

3.4. Encoder-Decoder module Given the disparity cost volume, the next step is to learn a regularization function to refine our disparity estimation. We modify the deep encoder-decoder module of DispNet to output detailed disparity with the same resolution as the input. The architecture of our encoderdecoder network is presented in Fig. 1 and layer settings are shown in Table 1. The encoder part encodes sub-sampled features and enables the network to explicitly leverage context with a wide field of view. However, it results in reduced resolution with multiple convolutions of stride 2. Therefore, unlike DispNet, which uses four groups of convolutional layers to down-sample the features with a factor of 64, we only stack three groups of convolutional layers in the encoder to preserve more spatial context. Each group contains two 3 × 3 convolutions with strides of 2 and 1, respectively, achieving an encoded feature map with dimension (W 32 × H 32 × C 32) where W, H, C represent the

(4)

where k is an index, K = 2k + 1 is the patch size and ⊗ denotes the convolution operation. We restrict the search space of possible patch205

ISPRS Journal of Photogrammetry and Remote Sensing 157 (2019) 201–215

J. Kang, et al.

Table 2 Overview of datasets used in our experiments. Note that the Driving and Monkaa datasets do not provide images for testing. Dataset Name Scene Flow

FlyingThings3D Driving Monkaa MPI Sintel KITTI dataset KITTI2015 KITTI2012 HCI

#Frames for training

#Frames for testing

Ground disparity coverage

Type of imagery

Resolution

21,818 4392 8591 1064 200 194

4248 – – 564 200 195

100% 100% 100% 100% 30% (sparse) 30% (sparse) –

Synthetic

960 × 540 960 × 540 960 × 540 1024 × 436 1242 × 375 1226 × 370 656 × 541

330

Real world

Table 3 Quantitative comparison of models with and without dilated convolutions. The metrics are the percentage of bad pixels (> 3px) and the end-point-error (EPE) respectively. Results on FlyingThings3D refer to validation, while the results on other datasets are test results. The bold font means the largest improvements among all datasets in terms of “EPE” and “ > 3px”. Dataset

Validation

Test

FlyingThings3D

Model_NoD Model_Final Improvement

Monkaa

Driving

> 3px

EPE

> 3px

EPE

> 3px

EPE

> 3px

EPE

> 3px

EPE

9.50 8.02 15.6%

1.74 1.48 14.9%

17.60 13.10 25.7%

4.78 3.92 17.9%

44.65 37.91 15.1%

8.97 8.31 7.4%

18.45 13.65 26.0%

3.63 3.07 15.4%

10.55 7.54 28.5%

1.58 1.37 13.3%

9.65 7.02 27.3%

1.45 1.21 16.6%

D and Nv is the number of valid pixels. 3.5.2. Gradient regression loss Besides the above disparity regression term, a new gradient term is introduced into the total loss in our work by considering large disparity discontinuities. We apply a gradient regularizer on the disparity field to encourage similar change of disparities in the predicted and the ground truth disparity map (and thus scene depth). This is one of the key differences between our method and DispNet. We minimize differences between gradients of the estimated disparity map and the ground truth and again apply the l1 norm with the gradient regression loss Lg defined as:

Lg =

where ·

1

1

[

x Di,j

x Di,j 1

+

y Di,j

y Di,j 1 ]

(6)

i,j

3.5.3. Total training loss To fuse the results of the different scales, the losses for all intermediate predictions are combined by taking a weighted average. For one scale i , the loss is a weighted sum of above two terms, while g controls the relative importance of the gradient regularizer in the optimization.

3.5.1. Disparity regression loss For the disparity regression loss Lr , we use the end point error (EPE), the absolute Euclidean distance between the disparity D predicted by the model and the ground truth disparity D , averaged over the valid pixels of the whole image. The latter is done, because ground truth disparity maps are sometimes sparse (e.g. the KITTI dataset (Geiger et al., 2012; Menze and Geiger, 2015; Menze et al., 2018)). We adopt the l1 norm to regularize the prediction, which is known to be less sensitive to outliers than the l2 norm and is also used in other methods (Mayer et al., 2015; Pang et al., 2018). Thus, the disparity regression loss Lr is formulated as:

Di, j

1 Nv

where · 1 and v again denote the l1 norm and all valid disparity pixels in D , and x and y are the horizontal and vertical gradient operators of the disparity maps.

We train our model end to end in supervised mode using ground truth disparity data. In order to preserve local structure details in the output disparity, we present a gradient regularizer as an auxiliary loss. So, the loss function contains two parts: the disparity regression term and the gradient regression term.

Di, j

KITTI2012

EPE

3.5. Training loss

i, j v

KITTI2015

> 3px

width, height, and no. of channels of the feature map, respectively. In order to obtain dense per-pixel predictions with the original input resolution, we apply five up-sampling blocks corresponding to six scales (1 32, 1 16, 1 8, 1 4, 1 2, and 1 × the input size ) in the decoder part to refine the coarse representation. Each block consists of a 4 × 4 deconvolution layer with stride 2 to up-sample the encoded output. This encoder-decoder module has six outputs (Pre0 to Pre5 in Table 1) and losses (loss0 to loss5 in Table 1), the related loss function is described in Section 3.5. During the training phase, the disparity loss is calculated as the weighted sum of these six losses. Since the encoder part reduces spatial accuracy and fine-grained details through the loss of resolution, similar to DispNet, skip connections are also used in our decoder part to preserve both the high-level coarse and the low-level fine information. In addition, we connect the left original image with the deconvolution features, as shown in Fig. 1 and Table 1, which is different from DispNet.

1 Lr = Nv

Sintel

Ei = Lr +

(7)

g Lg

The network is trained by minimizing the total loss function E which is a weighted sum of losses of all scales: iE

E=

i

(8)

i

where Ei is the loss from Eq. (7), evaluated at layer i , and weighting factor for the i th scale.

i

denotes the

4. Experiments and results In this section, we evaluated our method on several publically available synthetic and real stereo datasets. Two ablation experiments were conducted to evaluate the influence of the newly introduced

(5)

denotes the l1 norm, v represents all valid disparity pixels in 206

ISPRS Journal of Photogrammetry and Remote Sensing 157 (2019) 201–215

J. Kang, et al.

Left image

Ground Truth

Model_NoD

Model_Final

Fig. 4. Comparison of results with and without dilated convolution. Colum 1: Left image; Colum 2: ground truth; Colum 3: results predicted by Model_NoD (without dilated convolution). Colum 4: results predicted by our final model (with dilated convolution).

dilated convolution and the gradient regularization on the disparity estimation. We also compared our method with the baseline model and other traditional and CNN based stereo matching methods. The primary goal of these experiments is to assess the performance of our method in a qualitative and quantitative way.

stereo matching, first designed and used in DispNet for training CNNs to estimate disparity. This dataset is rendered by computer graphics methods and provides accurate dense ground truth, which is large enough to train a complex network. It is composed of three subsets and contains more than 39,000 stereo frames in 960 × 540 pixel resolution. In this paper, we only use the FlyingThings3D subset to train our model. The Driving and Monkaa datasets are used for performance evaluation. MPI Sintel2 (Wulff et al., 2012) is also an entirely synthetic dataset, which was created in the Blender software by rendering artificial scenes from a short open source animated 3D movie. It has 1064 training frames and provides dense ground truth disparities with large displacement; it is a very reliable dataset for testing different methods. In this work, we use the final version, which contains sufficiently realistic scenes including natural image degradations. The KITTI3 dataset was produced in 2012 (Geiger et al., 2012) and

4.1. Experimental settings 4.1.1. Datasets Unlike traditional approaches like SGM, CNN based matching approaches require training data with ground truth for learning the network parameters and optimizing hyper parameters. In this paper, we use part of the synthetic Scene Flow dataset to train our model, and then evaluate it on a number of public competitive synthetic and real stereo datasets. An overview of these datasets is given in Table 2. Scene Flow1 (Mayer et al., 2015) is a large synthetic dataset for

2

http://sintel.is.tue.mpg.de/stereo (accessed on Sep 18, 2019). http://www.cvlibs.net/datasets/kitti/eval_scene_flow.php?benchmark= stereo (accessed on Sep 18, 2019).

1

3

https://lmb.informatik.uni-freiburg.de/resources/datasets/ SceneFlowDatasets.en.html (accessed on Sep 18, 2019). 207

ISPRS Journal of Photogrammetry and Remote Sensing 157 (2019) 201–215

J. Kang, et al.

Table 4 Quantitative comparison of models with and without the gradient regularizer. The metrics are the percentage of bad pixels (> 3px) and the end-point-error (EPE) respectively. Results on FlyingThings3D refer to validation, while the results on other datasets are test results. Dataset

Validation

Test

FlyingThings3D

Model_NoG Model_Final Improvement

Monkaa

Driving

Sintel

KITTI2015

KITTI2012

> 3px

EPE

> 3px

EPE

> 3px

EPE

> 3px

EPE

> 3px

EPE

> 3px

EPE

8.56 8.02 6.3%

1.59 1.48 6.9%

13.69 13.10 4.3%

4.07 3.92 3.7%

40.37 37.91 6.1%

8.46 8.31 1.8%

13.91 13.65 1.9%

3.21 3.07 4.4%

7.78 7.54 3.1%

1.44 1.37 4.9%

7.35 7.02 4.5%

1.25 1.211 3.2%

Table 5 Comparisons of the quality of the final disparity gradients. The metric is the EPE of both the horizontal and the vertical disparity gradients. Results on FlyingThings3D refer to validation, while the results on other datasets are test results. The smallest EPE of disparity gradients on each dataset is highlighted by bold font. “–” means the result of the method on this dataset is not available. Dataset

Model_NoG Model_Final Improvement

Validation

Test

FlyingThings3D

Monkaa

Driving

Sintel

KITTI2015

KITTI2012

1.12 0.99 11.6%

0.86 0.72 16.4%

2.73 2.57 6.1%

1.03 0.84 17.9%

– – –

– – –

extended in 2015 (Menze et al., 2018; Menze and Geiger, 2015). It contains stereo images of complex real world road scenes collected from a calibrated stereo camera pair mounted on a driving car. It provides 200 stereo frames with ground truth obtained from a 3D laser scanner. Since the laser only provides sparse data up to a certain distance and height, ground truth is only available for these areas. HCI4 (Meister et al., 2012) is a challenging outdoor dataset, which contains eleven real world scenes with highly variable weather conditions, motions and depth. It has 330 frames and no ground truth. In this work, we only use it to show the visual quality of the results of our method.

momentum parameters, β1 = 0.9 and β2 = 0.999. Due to hardware limitations (we use a Titan X GPU), we trained the network with a mini batch size of 4. The training images were resized to 368 × 760 pixels and pre-processed by normalizing them to zero mean and standard deviation of 1. We set the starting learning rate λ to be 10−4 and then halved it every 150 k-th iteration after the first 200 k iterations. To further avoid overfitting, we employed l2 regularization with a weight decay strength d = 0.0004. We used fixed weights for the different scales in the loss function during training, because we observed that using dynamic weights can lead to unstable convergence. The loss weights ( 1, 2 , 3 , 4 , 5 , 6 ) were fixed and set to (1, 0.5, 0.2, 0.2, 0.2, 0.2). The disparity gradient regression loss weight g was set to 2. We validated our model in each epoch and used an early stop strategy (Goodfellow et al., 2016) to obtain training weights of our final model. We recorded the smallest validation error during training. Then, if there was no improvement for 5 consecutive validations compared with the current smallest validation error, training was stopped and the recorded best validation model is used as the final trained model for testing. Testing criteria: We trained our model on the FlyingThings3D dataset and tested it on the other datasets. Following the convention of DispNet, we used the test set of FlyingThings3D as our validation dataset. So the test results on FlyingThings3D are actually validation results. For evaluation of results, we used the end-point-error (EPE) measure. We also used the t-pixel error to compute the percentage of “bad” pixels among all valid pixels. A bad pixel is a pixel with an absolute disparity error larger than a threshold t , which is set to 3 pixels in our paper. We also compared the performance of our method with other disparity estimation methods in terms of EPE values. Fine-tuning: We fine-tune the pre-trained model on the KITTI training dataset (2015) by choosing the stochastic gradient descent optimizer with momentum parameter 0.9 and a learning rate 10−6 (for the experiment reported in Table 8 only). Fine-tuning stops after 94 k iterations to alleviate the problem of over-fitting. Considering the KITTI dataset includes sparse labelled disparities, disparities for pixels, for which ground truth are not provided, are ignored when computing the loss according to Eq. (6).

4.1.2. Data augmentation Even though our training dataset has many stereo images, we still find data augmentation necessary to alleviate overfitting and to improve generalization capabilities of our network. We perform online augmentation during training to introduce more variation in the training data. The data augmentation approach we use is the same as the one employed in DispNet (Mayer et al., 2015), which includes geometric and radiometric transformations. Identical geometric transformations are applied to both stereo images, including translation and scaling, to make sure the disparities between augmented image pairs undergo only a common 3 degree of freedom transformation [2D translation + scale] with respect to the disparities of the original image pair. We uniformly sample translation values from the range of [−0.4, −0.4] to [0.4, 0.4] pixels and scaling values from the range [0.8, 1.6]. For the radiometric transformations, we apply additive Gaussian noise and random changes in brightness, contrast, gamma, and colour on the left and right images independently. The standard deviation σ of Gaussian noise is uniformly sampled from (0, 0.06). The brightness, contrast and gamma values are sampled randomly from a Gaussian with σ = 0.01. Colour channels are multiplied separately by a factor randomly sampled from a Gaussian with σ = 0.01. 4.1.3. Implementation details Training: We implemented our architecture using the Tensorflow framework (Abadi et al., 2016) and optimized our model end-to-end by choosing the Adam optimizer (Kingma and Ba, 2015) with default

4.2. Ablation experiments Based on our preliminary experiments, we found that dilated convolutions and the gradient regularizer provide the most important

4 https://hci.iwr.uni-heidelberg.de/benchmarks/Challenging_Data_for_ Stereo_and_Optical_Flow (accessed on Sep 18, 2019).

208

ISPRS Journal of Photogrammetry and Remote Sensing 157 (2019) 201–215

J. Kang, et al.

Left image

Ground Truth

Model_NoD

Model_Final

Fig. 5. Comparison of results with and without the gradient regularizer. Colum 1: Left image; Colum 2: ground truth; Colum 3: results predicted by Model_NoG (without the gradient regularizer). Colum 4: results predicted by our final model (with the gradient regularizer).

improvements in terms of quality of the results. In this section, we show ablation experiments with respect to these two factors.

3 and “Model_NoD” is the model without the dilated convolutions, but include gradient regularization. One can see that on all datasets, large improvements result from employing the dilated convolutions (e.g. 17.9% on Monkaa in terms of EPE and 28.5% on KITTI2015 in terms of 3-pixel error). Since Monkaa includes images with texture-less areas and the KITTI contain strong depth differences, they can benefit from global context information obtained by dilated convolution. These results confirm that using multiple dilated convolutions instead of a single convolution during feature extraction helps the network to produce

4.2.1. Ablation study for dilated convolutions In order to explore the effectiveness of our context pyramidal feature extraction module, we test our network architecture with and without this module on the aforementioned stereo datasets. Table 3 reports the corresponding results in terms of 3-pixel error and EPE, where “Model_Final” represents our final model as described in chapter 209

ISPRS Journal of Photogrammetry and Remote Sensing 157 (2019) 201–215

J. Kang, et al.

Table 6 Comparison of results between our model and the baseline model. The metrics are the percentage of bad pixels (> 3px) and the end-point-error (EPE) respectively. Results on FlyingThings3D refer to validation, while the results on other datasets are test results. Dataset

Validation

Test

FlyingThings3D

Baseline Model_Final Improvement

Monkaa

Driving

Sintel

KITTI2015

KITTI2012

> 3px

EPE

> 3px

EPE

> 3px

EPE

> 3px

EPE

> 3px

EPE

> 3px

EPE

9.45 8.02 15%

1.68 1.48 12%

18.44 13.10 29%

5.78 3.92 32%

48.46 37.91 22%

12.46 8.31 33%

22.91 13.65 40%

5.66 3.07 46%

10.73 7.54 30%

1.59 1.37 14%

10.38 7.02 32%

1.55 1.21 22%

(a) Baseline (Sintel)

(b) Our model (Sintel)

(c) Baseline (Driving)

(d) Our model (Driving)

Fig. 6. Error distribution of the Sintel and Driving datasets.

better disparity predictions. From the results of Table 3, we see that the improvements in terms of 3-pixel error are more obvious than that of EPE. This indicates that multi-scale context priors help in particular to remove bad pixels from the final disparity maps. Some visual results are presented in Fig. 4 to illustrate the effectiveness of dilated convolution. From the results in the red boxes in Fig. 4, we can see that the model with dilated convolution (“Model_Final”) can estimate disparities close to ground truth, while the model without dilated convolution (“Model_NoD”) struggles to maintain fine details and generates noisy disparity estimation, especially in weakly texture areas. That is because, in these weakly texture regions, disparities are difficult to be inferred locally. Therefore, global context cues (or large receptive fields) are required in these areas to compute more accurate disparities. The visual results confirm again that multiscale context priors are beneficial for accurate and complete disparity estimation.

4.2.2. Ablation study for gradient regularization As mentioned above, we use an additional gradient regularizer in the loss function to preserve discontinuity on disparity maps. Table 4 shows the quantitative results in terms of EPE and 3-pixel error, where “Model_NoG” represents the model without the gradient regularizer, but incl. dilated convolution. By comparing the results of the “Model_Final” to the “Model_NoG”, we investigate the effectiveness of this gradient regularizer. From Table 4, we observe that the EPE values and the 3-pixel errors of the model with the gradient regression loss (Model_Final) are slightly smaller than those of the model without gradient regression loss. In addition, in order to evaluate the structurepreserving performance resulting from the gradient regularizer, we compute the sum of the EPE of both the horizontal and the vertical disparity gradients. Table 5 shows the EPE of the final gradients of the disparity maps. Note that the two KITTI datasets have sparse ground truth disparities and the ground truth in some areas is not available, as 210

ISPRS Journal of Photogrammetry and Remote Sensing 157 (2019) 201–215

J. Kang, et al.

Left image

Ground truth/Right image

Baseline

Model_Final

Fig. 7. Visual results of our model and the baseline network. Colum 1: Left image; Colum 2: ground truth (except for last row); Colum 3: results predicted by baseline network. Colum 4: results predicted by our model. Note that we also give an example of a real world dataset “HCI” (without ground truth) in the last row; Colum 2 contains the right image.

Table 7 Comparison of results of our model with other methods. The metric is the end-point-error (EPE). Results on FlyingThings3D of “Baseline”, “DispNet_css” and “Our” refer to validation, while the other figures in this column are test results. The smallest EPE on each dataset is highlighted by bold font. “–” means the result of the method on this dataset is not available. Dataset

FlyingThings3D

Monkaa

Driving

Sintel

KITTI2015

KITTI2012

No. of parameters

SGM MC-CNN-fst Baseline DenseMapNet DispNet_css Ours

8.70 4.09 1.68 5.07 1.34 1.48

20.16 6.71 5.78 4.45 4.23 3.92

40.19 19.58 12.46 6.56 8.68 8.31

19.62 11.94 5.66 4.41 2.95 3.07

7.21 – 1.59 2.52 1.52 1.37

10.06 – 1.55 – 1.57 1.21

– 0.6 M 36 M 0.3 M 108 M 37 M

211

ISPRS Journal of Photogrammetry and Remote Sensing 157 (2019) 201–215

J. Kang, et al.

its performance to that of our model to illustrate the effectiveness of our modifications. The results are presented in Table 6. By comparing the results of the “Model_Final” to those of the “Baseline”, we see that our model outperforms the baseline network in all cases. As shown in Table 6, for the two KITTI datasets, the improvement in terms of the 3pixel errors is larger than that of the EPE values. The KITTI dataset contains real world street scenes, which have larger depth differences than the other datasets. The significant reduction in the 3-pixel error on these two datasets indicates that our method can remove a large number of bad pixels in these areas. We notice that the EPE values of our model are improved significantly, especially on the Driving and the Sintel dataset (33% and 46%, respectively). Fig. 6 illustrates the error distributions of the Sintel and the Driving datasets with respect to the disparity size. Compared with the baseline model, accuracy improvements are mainly found for large disparities (i.e. from 150 to 300 pixels in Fig. 6(a) and Fig. 6(b)). These significant improvements can be directly attributed to our modification of the correlation module of DispNet (the original version is only able to deal with the disparity range from 0 to 160 pixels). In addition, we notice that our method yields more robust results. As indicated in Fig. 6, for smaller disparities (i.e. from 0 to 160 pixels in Fig. 6(c) and (d)), there are fewer outliers in our results and the overall accuracy is better than that of the baseline network. That is mainly because we employ dilated convolutions to encode multi-scale context information as priors, which is beneficial for accurate disparity estimation, particularly in complex and texture-less areas, see the improvements shown in Tables 3 and 4. The improvement in visual quality is also distinct from the qualitative results, as shown in Fig. 7. Compared with the baseline, our method performs noticeably better. We see that not only the resolutions of the disparity maps are improved, but also more detailed structures of the scenes are captured. For instance, the disparity estimates within the red boxes are improved by our method. Furthermore, in large disparity discontinuity areas, our method can preserve clear edge details and produce correct disparity estimates because of using the gradient regularizer, as indicated by the red rectangular.

Table 8 Results on the KITTI 2015 stereo online leaderboard. The best results among all methods in terms of different metrics are highlighted by bold font. Method

SGM M2S_CSPN (Cheng et al., 2018b) GANet-deep (Zhang et al., 2019) PSM-Net CRL GC-Net MADnet (Tonioni et al., 2019) Baseline Ours

All pixels

Non-occluded pixels

D1-bg

D1-fg

D1-all

D1-bg

D1-fg

D1-all

8.92 1.51

20.59 2.88

10.86 1.74

7.62 1.40

18.81 2.67

9.47 1.61

1.48

3.46

1.81

1.34

3.11

1.63

1.86 2.48 2.21 3.75 4.32 3.61

4.62 3.59 6.16 9.20 4.41 7.14

2.32 2.67 2.87 4.66 4.34 4.20

1.71 2.32 2.02 3.45 4.11 3.41

4.31 3.12 5.58 8.41 3.72 6.59

2.14 2.45 2.61 4.27 4.05 3.94

mentioned in Section 4.1.1. For this reason, we do not give results for the disparity gradients of these two datasets in Table 5. We observe that the model with the gradient regression loss has the best performance on all datasets. This proves that our method can effectively preserve structures in the final disparity map. Since the EPE metric often favours over-smoothed solutions, it is interesting to further inspect qualitative results. Fig. 5 shows visual examples from “Model_Final” and “Model_NoG”. As illustrated in the red box, by using the gradient regularizer, our model performs well around the boundaries of objects. It is able to regularize the output effectively while learning to maintain sharpness and local structure details in the output disparity map. This is especially noticeable for areas with large disparity discontinuities. These results indicate that utilizing the gradient regularizer has a positive impact on the performance. 4.3. Comparison with the baseline network In this section, we adopt DispNet as the baseline model and compare

(a) Left image

(b) Ground truth

(c) Our prediction

(d) Error map (red pixels mean higher errors)

Fig. 8. Results of disparity estimation for KITTI 2015 training images. (Note that our method can predict a correct depth boundary, although the ground truth label marks it as wrong). 212

ISPRS Journal of Photogrammetry and Remote Sensing 157 (2019) 201–215

J. Kang, et al.

(a) Left image

(b) Ground truth

(c) Filled disparity Fig. 9. Example of disparity for KITTI 2015 training images.

compared with more traditional methods. From Table 7, we see that the EPE values of MC-CNN-fst5 are larger than any other end-to-end method. A possible reason is the fact that MC-CNN needs extra postprocessing steps, which is often less accurate than an end-to-end solution. The main contribution of DenseMapNet is to reduce the number of parameters and the computation time. With 0.3 million parameters, this network has the least number of parameters and performs the best on the Driving dataset. However, the EPE values on other datasets are worse than those of our network. DispNet_css combines three individual networks to refine disparities. The results show this network has a slightly better performance than our model on the FlyingThings3D and the Sintel datasets. This small improvement may come from their deep architecture with over 75 convolutional layers, resulting in a huge amount of parameters (108 M). Our model only requires fewer parameters (37 M) and achieves better results on the other datasets. The above comparison demonstrates that our model achieves competitive performance with respect to the current state-of-the-art. Furthermore, we also fine-tuned our network on the training dataset of KITTI2015 and then submitted our results to the KITTI online leaderboard.6 The results are shown in Table 8. “D1-bg” means the 3-pixel

Table 9 Comparison of fine-tuning results using sparse disparities and interpolated disparities. The best results among all methods in terms of “D1-all” are highlighted by bold font. Method

Baseline Our Our-filled

All pixels

Non-occluded pixels

D1-bg

D1-fg

D1-all

D1-bg

D1-fg

D1-all

4.32 3.61 3.60

4.41 7.14 6.89

4.34 4.20 4.15

4.11 3.41 3.41

3.72 6.59 6.46

4.05 3.94 3.91

4.4. Comparison with other methods In this section, we investigate how well our method performs when compared with previously published methods, including SGM (Hirschmuller, 2008), MC-CNN (Žbontar and Lecun, 2016), DispNet (Mayer et al., 2015), DenseMapNet (Atienza, 2018) and DispNet_css (Ilg et al., 2018). These methods represent three different typical categories of stereo matching approaches; refer to Table 7 for a detailed evaluation of the results. It can be observed that our end-to-end model achieves the best disparity estimation performance in terms of EPE in most cases, which demonstrates the generalization capabilities of our model. We note that SGM performs worst. This confirms that CNN based stereo matching algorithms have a distinct advantage for disparity estimation

5 To the best of our knowledge, results for MC-CNN-art of our testing datasets are not available in the literature. 6 http://www.cvlibs.net/datasets/kitti/eval_scene_flow.php?benchmark= stereo.

213

ISPRS Journal of Photogrammetry and Remote Sensing 157 (2019) 201–215

J. Kang, et al.

error in the background and “D1-fg” means the 3-pixel error in the foreground. “D1-all” represents the 3-pixel error for all pixels. From Table 8, we can see that our method achieves a 3-pixel error of 4.20% and outperforms SGM (10.86%). But we have not achieved a large margin of improvements compared to the baseline network (4.34%) and perform worse than the state of art (M2S_CSPN). A few issues can be mentioned in this regard. First, some detailed structures (e.g. a road sign in the red rectangle area in Fig. 8(a)) are contained in the KITTI training images, but not always present in the provided ground truth label images (e.g. only sparse pixels are provided in the red rectangle area in Fig. 8(b)). We actually observed that some of our predictions captured detailed structures in the scene (e.g. clear edges of the road sign in the red rectangle as shown in Fig. 8(c)), but these are judged as wrong predictions as this detail is missing in the ground truth label (e.g. high error values around the edges of the road sign in the red rectangle area in Fig. 8(d)). Second, as the ground truth labels are sparse (see e.g. see Fig. 9(b)), it is impossible to obtain accurate ground truth disparity gradients for a number of locations in the image. For example, the disparity on the road changes smoothly from parts closer to the camera to parts further away. However, the ground truth labels (see Fig. 9(b)) do not reflect this smooth change, because of the sparsity of the laser scanning data. Because of these limitations in the ground truth, we believe that the KITTI dataset is not necessarily suitable for assessing our network. We tried to overcome the problems related to the sparse ground truth by interpolation. We used a depth filled method (Ku et al., 2018) to complete the ground truth disparity of KITTI 2015. This method uses basic image processing operations to perform depth completion. An example of completion results is shown in Fig. 9(c). As shown in the figure, most of the areas in the disparity map are interpolated. The resulting disparity maps were then used to fine-tune our model and the results are shown in Table 9. From the fine-tuning results, we found that using interpolated disparities instead of sparse disparities indeed improve the results slightly, but still performs worse than other state-of-art methods. It seems that the ground truth of KITTI 2015 are too sparse to recover accurate disparities, at least when using simple interpolation methods. As can be observed in Fig. 9(c), the interpolation introduced a few wrong edges to the disparity map, e.g. on the trunk of trees, and also the road does not look as smooth as one could expect. Because of these limitations in the ground truth, we conclude that the KITTI dataset is not necessarily suitable for assessing the performance of our network.

Acknowledgements The author Junhua Kang would like to thank the China Scholarship Council (CSC) for financially supporting her study at the Institute of Photogrammetry and GeoInformation, Leibniz Universität Hannover, Germany, as a visiting PhD student. Furthermore, we gratefully acknowledge the support of NVIDIA Corporation with the donation of GPUs used for this research. References Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv Prepr. arXiv1603.04467. Atienza, R., 2018. Fast disparity estimation using dense networks. Proc. - IEEE Int. Conf. Robot. Autom., pp. 3207–3212. https://doi.org/10.1109/ICRA.2018.8463172. Badrinarayanan, V., Kendall, A., Cipolla, R., 2017. Segnet: A deep convolutional encoderdecoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39, 2481–2495. Chen, J., Yuan, C., 2016. Convolutional neural network using multi-scale information for stereo matching cost computation. In: Image Processing (ICIP), 2016 IEEE International Conference On. IEEE, pp. 3424–3428. Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L., 2018. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40, 834–848. Cheng, F., He, X., Zhang, H., 2018a. Learning to refine depth for robust stereo estimation. Pattern Recognit. 74, 122–133. Cheng, X., Wang, P., Yang, R., 2018b. Learning depth with convolutional spatial propagation network. arXiv Prepr. arXiv1810.02695. Cortes, C., Vapnik, V., 1995. Support-vector networks. Mach. Learn. 20, 273–297. Dosovitskiy, A., Fischery, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Smagt, P. Van Der, Cremers, D., Brox, T., 2015. FlowNet: Learning optical flow with convolutional networks. Proc. IEEE Int. Conf. Comput. Vis. 2015 Inter, pp. 2758–2766. https://doi. org/10.1109/ICCV.2015.316. Geiger, A., Lenz, P., Urtasun, R., 2012. Are we ready for autonomous driving? the kitti vision benchmark suite. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference On. IEEE, pp. 3354–3361. Godard, C., Mac Aodha, O., Brostow, G.J., 2017. Unsupervised monocular depth estimation with left-right consistency. In: Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, pp. 6602–6611. https://doi. org/10.1109/CVPR.2017.699. Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep Learning. MIT Press. Guney, F., Geiger, A., 2015. Displets: Resolving stereo ambiguities using object knowledge. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4165–4175. He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Hirschmuller, H., 2008. Stereo processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell. 30, 328–341. Holschneider, M., Kronland-Martinet, R., Morlet, J., Tchamitchian, P., 1990. A real-time algorithm for signal analysis with the help of the wavelet transform. In: Wavelets. Springer, pp. 286–297. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q., 2017. Densely connected convolutional networks. In: Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017. pp. 2261–2269. https://doi.org/10.1109/ CVPR.2017.243. Ilg, E., Saikia, T., Keuper, M., Brox, T., 2018. Occlusions, Motion and Depth Boundaries with a Generic Network for Disparity, Optical Flow or Scene Flow Estimation BT Computer Vision – ECCV 2018. Springer International Publishing, Cham, pp. 626–643. Kang, J., Chen, L., Deng, F., Heipke, C., 2019. Encoder-Decoder network for local structure preserving stereo matching. Dreiländertagung der DGPF, der OVG und der SGPF Wien. Österreich – Publ. der DGPF 28, 2019. Kendall, A., Martirosyan, H., Dasgupta, S., Henry, P., Kennedy, R., Bachrach, A., Bry, A., 2017. End-to-end learning of geometry and context for deep stereo regression. In: Proc. IEEE Int. Conf. Comput. Vis. 2017-Octob, pp. 66–75. https://doi.org/10.1109/ ICCV.2017.17. Kingma, D.P., Ba, J., 2015. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR). Ku, J., Harakeh, A., Waslander, S.L., 2018. In defense of classical image processing: Fast depth completion on the cpu. In: 2018 15th Conference on Computer and Robot Vision (CRV). IEEE, pp. 16–22. Li, Y., Huttenlocher, D.P., 2008. Learning for stereo vision using the structured support vector machine. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference On. IEEE, pp. 1–8. Li, Z., Yu, L., 2018. Compare stereo patches using atrous convolutional neural networks. In: Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. ACM, pp. 473–480. Luo, W., Schwing, A.G., Urtasun, R., 2016. Efficient deep learning for stereo matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5695–5703.

5. Conclusion In this paper, we have introduced significant improvements for the state-of-the-art CNN network DispNet for dense stereo matching. We add dilated convolutions to exploit multi-scale context information, which is beneficial for disparity estimation in weakly texture areas. In addition, we increase the disparity range when computing the cost volume module to handle large disparities. Finally, a gradient regression loss derived from disparity gradient information is combined with the disparity regression loss to preserve sharper local structure details in large depth discontinuity areas. The performance of our approach was evaluated on several challenging stereo datasets. The experiments demonstrate that our method achieves competitive performance and predicts more accurate and in particular more detailed disparity maps in specific areas. In the future work, we will focus on transferring our model to aerial images. In addition, as occlusions play an important role for stereo matching, we strive to extend our model by considering the occlusion problem explicitly and we will also incorporate the left-right consistency constraint. Finally, it would be also interesting to investigate multi-view stereo matching based on deep learning methods. 214

ISPRS Journal of Photogrammetry and Remote Sensing 157 (2019) 201–215

J. Kang, et al. Mayer, N., Ilg, E., Häusser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T., 2015. A large dataset to train convolutional networks for disparity. Optical Flow, and Scene Flow Estimation 4040–4048. https://doi.org/10.1109/CVPR.2016.438. Meister, S., Jähne, B., Kondermann, D., 2012. Outdoor stereo camera system for the generation of real-world benchmark data sets. Opt. Eng. 51, 21107. Menze, M., Geiger, A., 2015. Object scene flow for autonomous vehicles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3061–3070. Menze, M., Heipke, C., Geiger, A., 2018. Object scene flow. ISPRS J. Photogramm. Remote Sens. 140, 60–76. Pang, J., Sun, W., Ren, J.S.J., Yang, C., Yan, Q., 2018. Cascade residual learning: a twostage convolutional neural network for stereo matching. In: Proceedings - 2017 IEEE International Conference on Computer Vision Workshops, ICCVW 2017, pp. 878–886. https://doi.org/10.1109/ICCVW.2017.108. Ren, S., He, K., Girshick, R., Sun, J., 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 91–99. Scharstein, D., Pal, C., 2007. Learning conditional random fields for stereo. In: Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference On. IEEE, pp. 1–8. Scharstein, D., Szeliski, R., 2002. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis. 47, 7–42. Seki, A., Pollefeys, M., 2017. SGM-nets: Semi-global matching with neural networks. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern

Recognition Workshops (CVPRW), Honolulu, HI, USA, pp. 21–26. Shaked, A., Wolf, L., 2017. Improved stereo matching with constant highway networks and reflective confidence learning. In: Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017 2017-Janua, 6901–6910. https://doi.org/10.1109/CVPR. 2017.730. Tonioni, A., Tosi, F., Poggi, M., Mattoccia, S., Stefano, L. Di, 2019. Real-time self-adaptive deep stereo. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 195–204. Viola, P., Wells III, W.M., 1997. Alignment by maximization of mutual information. Int. J. Comput. Vis. 24, 137–154. Wulff, J., Butler, D.J., Stanley, G.B., Black, M.J., 2012. Lessons and insights from creating a synthetic optical flow benchmark. In: European Conference on Computer Vision. Springer, pp. 168–177. Yu, F., Koltun, V., 2016. Multi-scale context aggregation by dilated convolutions. International Conference on Learning Representations (ICLR). Žbontar, J., Lecun, Y., 2016. Stereo matching by training a convolutional neural network to compare image patches. J. Mach. Learn. Res. 17, 2287–2318. Zhang, F., Prisacariu, V., Yang, R., Torr, P.H.S., 2019. GA-net: guided aggregation net for end-to-end stereo matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 185–194.

215