Deep hierarchical guidance and regularization learning for end-to-end depth estimation

Deep hierarchical guidance and regularization learning for end-to-end depth estimation

Pattern Recognition 83 (2018) 430–442 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/patcog...

3MB Sizes 1 Downloads 36 Views

Pattern Recognition 83 (2018) 430–442

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/patcog

Deep hierarchical guidance and regularization learning for end-to-end depth estimation Zhenyu Zhang, Chunyan Xu∗, Jian Yang∗, Ying Tai, Liang Chen School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, PR China

a r t i c l e

i n f o

Article history: Received 3 March 2017 Revised 1 May 2018 Accepted 13 May 2018

Keywords: Depth estimation Multi-regularization Deep neural network

a b s t r a c t In this work, we propose a novel deep Hierarchical Guidance and Regularization (HGR) learning framework for end-to-end monocular depth estimation, which well integrates a hierarchical depth guidance network and a hierarchical regularization learning method for fine-grained depth prediction. The two properties in our proposed HGR framework can be summarized as: (1) the hierarchical depth guidance network automatically learns hierarchical depth representations by supervision guidance and multiple side conv-operations from the basic CNN, leveraging the learned hierarchical depth representations to progressively guide the upsampling and prediction process of upper deconv-layers; (2) the hierarchical regularization learning method integrates various-level information of depth maps, optimizing the network to predict depth maps with similar structure to ground truth. Comprehensive evaluations over three public benchmark datasets (including NYU Depth V2, KITTI and Make3D datasets) well demonstrate the state-of-the-art performance of our proposed depth estimation framework. © 2018 Elsevier Ltd. All rights reserved.

1. Introduction Estimating depth from monocular RGB images is a fundamental problem in computer vision and many tasks can benefit from the depth information such as scene understanding [1], 3D modeling [2,3], robotics [4,5], action recognition [6], etc. The depth of a scene can be inferred from stereo cues [7,8]. However, when it comes to monocular scenes, depth estimation becomes actually an inherently ambiguous and ill-posed problem since a given image may correspond to an infinite number of possible world scenes [9]. To deal with the depth estimation problem, existing previous works often rely on strong priors and focus on the geometric knowledge [10–14]. Other works benefitting from RGB-D data show an improvement on dense depth maps estimation [15– 18]. Additionally, some efforts [19,20] have been made to leverage the labeling information which also contributes to depth estimation tasks. However, these above works rely on hand-craft features, strong priors and pre-processing to solve the depth estimation task, which have no universality for different real-world scenes. In recent years, deep learning methods have reached a big breakthrough on computer vision tasks [21] such as image clas-



Corresponding authors. E-mail addresses: [email protected] (C. Xu), [email protected] (J. Yang).

https://doi.org/10.1016/j.patcog.2018.05.016 0031-3203/© 2018 Elsevier Ltd. All rights reserved.

sification [22–24], semantic segmentation and scene parsing [25– 27] and pose estimation [28]. Efforts based on CNNs have also been successfully introduced to monocular depth estimation tasks [9,29–31], which significantly improve the estimation performance. However, these methods still rely on multi-stage processing pipelines, e.g., super-pixel and CRF, respectively trained multiscale networks. Moreover, their predictions are not sufficiently high resolution. Motivated by these witnesses, a more straightforward method for high resolution pixel-wise prediction is necessary. Previous works often employ feedforward networks [9,29,30,34] for the problem of depth estimation, which do not well consider multi-scale features of previous lower layers with abundant image detail information. The recent work [34] proposed a deeper network architecture for depth estimation and obtained good performance, however, the predictions of their method are too smooth and lack details, which is caused by the pooling operations. Eigen and Fergus [32] have adopted skip connections to introduce multi-scale information in process of estimating depth maps, but these operations also introduce much noise to the final prediction. As illustrated in Fig. 1, the predictions of [32] contain inaccurate geometric structures and coarse boundaries, especially at the location of the table lamp. The more recent work [33] introduced the detailed information using multi-Scale continuous CRFs with a sequential deep networks, but the predictions of their method also contain much noise. As illustrated in Fig. 1, the predictions of [33] obtains much obvious discontinuity

Z. Zhang et al. / Pattern Recognition 83 (2018) 430–442

431

Fig. 1. Illustration of predicted depth maps with different methods. (a) input RGB image; (b) ground truth; (c) depth maps predicted by Eigen and Fergus [32]; (d) predictions of [33]; (e) predictions of our HGR framework. The predictions of our HGR framework are more fine-grained and detailed.

and noise at the location of bed, walls and table lamp. The main problem above is that the semantic information contained in image details is much different from the depth information, which we call a “semantic gap”. For example, in Fig. 1, the regions of bed and curtain in the RGB image contain much texture information because of rivel, and the pedestal of the table lamps has similar color compared with the curtain. However, in the depth map, the depths of bed and curtain are smooth and continuous without much texture, but the pedestal has very different depths compared with the curtain. As a result, the texture and detailed information of RGB images may not be always proper and useful for depth estimation because of such semantic gap. As the semantic gap is large, the successful method [35] in super-resolution or other approaches in semantic segmentation [36,37] and edge detection [38] may not be suitable for depth estimation task. Inspired by these witnesses, we argue that a feedforward network without lower feature guidance may be difficult to better predict depth maps with fine-grained details, and meanwhile the lower feature guidance may not be suitable to directly used in depth estimation tasks. We propose a novel hierarchical network architecture which is significantly suitable for depth estimation or other dense map prediction task. This proposed network architecture integrates hierarchical features to progressively refine the estimation process and guide the network to produce fine-grained prediction. Further, it solves the ‘semantic gap’ problem between image texture and depth information, improving the performance of the final prediction. As illustrated in Fig. 1, the predictions of our approach contain fine-grained details and match the geometric structure exactly. To our knowledge, this effort is first been made in depth estimation task. While estimating depth maps, the depth structure information is of great importance. For example, depth values are similar on the positions of an object, but possibly containing huge variations on the surroundings of its boundaries. Previous works attempt to directly extract this depth structure information from RGB images by super-pixel CRF models [29–31] or using hand-craft feature (HSV color histogram) based local image patch correlation [39]. However, these methods only utilize pixel-level depth information, failing to predict accurate depth maps on the textured images containing many objects. In the work of [9], Eigen et al. present a scale invariant loss which leverages the depth relations of different pixels to optimize their network. This is a more direct way to use depth information, but actually scale invariant loss introduces no region-level depth structure and its advantages are not demonstrated experimentally. Due to these limitations of considering less depth structure information, the above approaches are difficult to improve the performance.

In this paper, we propose a novel deep Hierarchical Guidance and Regularization (HGR) framework for depth estimation, which well integrates multi-scale depth semantic features and variouslevel information of depth maps to guide the estimation processing. Given an input monocular RGB image, our method can predict a correspondingly-sized depth map in an end-to-end way, as illustrated in Fig. 2. The basic feedforward network mainly associates to multiple convolutional and deconvolutional operations, and the architecture fashion is based on ResNet [40]. To resolve the texture loss caused by pooling operations, we develop a hierarchical depth guidance and regularization strategy to utilize multi-level details. First, refining networks are build on side of conv-net of each scale, predicting multi-scale depth maps guided by supervisions. In this way, the features of these refining networks are highly correlative with depth information. Second, we combine the depth semantic features with upper deconv-features of the corresponding scale. This concatenation operation provides hierarchical depth guidance for the estimation process, progressively resolve the ambiguity and smoothness by hierarchical depth details. Finally, a multi-regularized network learning method is applied to optimize our network, which leverage hierarchical structure information from depth maps to guide the network optimization. The whole framework is trained in an end-to-end way, predicting depth maps without any pre-and post-processing operations. To sum up, the major contributions of this work can be summarized as follows:







We propose a novel hierarchical guidance network architecture for depth estimation, which well leverages hierarchical lower depth semantic features to progressively guide the predicting processing of deconv-upper layers. Then our method can well consider the semantic difference between images texture and depth details and predict depth maps with more fine details. A novel multi-regularized network learning strategy is introduced to optimize the parameters of our depth estimation network. This learning method employs various-level information of depth maps, which contributes to high quality estimation. We demonstrate theoretically and experimentally why it is suitable for this task. Depth maps are predicted end-to-end by our framework, without any pre-and post-processing operation. We demonstrate that the proposed method obtains state-of-the-art depth estimation performance over three public benchmark datasets (including NYU Depth V2, KITTI and Make3D datasets), especially in outdoor scene datasets.

432

Z. Zhang et al. / Pattern Recognition 83 (2018) 430–442

Fig. 2. Overview of our end-to-end depth estimation framework. The feedforward backbone is build based on ResNet-50, associating to multiple conv-and deconv-residual blocks. Each refining network contains five conv-residual blocks which has a same fashion as those in ResNet-50 and several 1 × 1 conv-layers, predicting depth maps guided by supervisions. A fusion module is developed to improve the quality of each scale prediction, obtaining more valuable semantical features. We combine these depth feature maps of refining network with corresponding scale deconv-features, providing depth details for the deconv-net and progressively guiding the estimation process to predict fine-grained depth maps. The whole depth estimation network is regularized by our hierarchical regularization strategy which integrates pixel correlation, local structure and global consistency of depth maps. Given an RGB image, the proposed framework predicts a correspondingly-sized depth map in an end-to-end way. Example images are from NYU Depth V2 dataset [1].

2. Related work

3. HGR learning for end-to-end estimation

Estimating depth from monocular image is inherently ambiguous and ill-posed. A key strategy in early works for handling these problems was to use prior knowledge and strong assumptions [15–18]. Some other works like [19,20] rely on semantic labels to predict depth maps. Deep learning based depth estimation has achieved great success during the past few years. Eigen et al. [9] proposed a multi-scale deep network for dense depth map estimation. They used a simple deep network to predict dense depth maps and another network to refine this coarse output. Additionally, they proposed the scale invariant loss to optimize their network, leveraging depth correlations between different pixels. Liu et al. [29] used super-pixel in the same way as[19] and combined CNN and CRFs in a unified framework. They learned unary and pairwise potentials jointly with CNN and found a close-form solution for their CRF model to predict a depth value for each superpixel. In the work of [30], they used a similar model as[29], but applying a refinement method to produce pixel-wise depth maps from the superpixel-wise ones. Wang et al. [31] combined semantic segmentation and depth estimation tasks using CNN and hierarchical CRF. More recently, Roy and Todorovic [39] proposed a neural regression forest model which combined CNN and regression forest to deal with depth estimation problem. These above previous works make it possible to learn depth from monocular RGB images. But unfortunately, all the models mentioned above need strong priors [15,16,17,18] or contain preand post-processing, such as super-pixel [29], CRF [30], respectively trained multi-scale networks [9] and hand-craft features. In contrast, our method has an end-to-end architecture without any pre-or post-processing, predicting dense depth maps with higher resolution and better quality. Moreover, these above approaches learn respective networks under the guide of limited RGB image local features [31] or pixel-wise depth information [9], and all of the above deep learning based methods contain feedforward architectures, which fails to utilize depth structure information of ground truth or multi-scale features of the networks, bringing difficulty to improve the performance. In contrast, our method combines hierarchical depth semantic conv-features with upper deconv-features to predict fine-grained depth maps, meanwhile it integrates various-level information of depth maps to optimize the network, predicting depth maps with similar structure as ground truth.

In this section, we propose a novel Hierarchical Guidance and Regularization (HGR) framework for the depth estimation task, as illustrated in Fig. 2. The proposed HGR framework learns a mapping between RGB images and corresponding depth maps, predicting depth in an end-to-end way. We describe the two properties of the HGR in the following.

3.1. Hierarchical depth guidance We aim to utilize various-level texture information to progressively guide the estimation process. Hence, we develop a novel network architecture to provide hierarchical depth guidance for estimation process. Backbone: The network backbone contains a feedforward architecture, associating to multiple convolution and deconvolution operations. The convolution network is build based on ResNet-50, and we remove the fully connected layers. The deconvolution network has a same fashion as the convolution network, but only containing 3 residual blocks in each scale. The feedforward backbone can be used as a baseline, and its predictions are usually coarse and smooth due to lack of depth details. See Section 4.3 for experimentally demonstration. Hierarchical guidance: To counteract the texture loss caused by pooling operations, we develop a hierarchical guidance strategy to guide the estimation process, predicting fine-grained depth maps. As illustrated in Fig. 2, we apply side output on the last conv-layer of 3 scales including 120 × 160, 60 × 80 and 30 × 40, employing these features to a refining network of corresponding scale respectively. Each refining network contains 5 conv-residual blocks and several following 1 × 1 convolution layers, predicting depth maps of the corresponding scale guided by supervision. In this way, the features of the refining networks are highly correlative with depth information, meanwhile containing different-level texture and details. Then, in each refining network, we concatenate the features to deconv-features in the deconv-network of corresponding scale, as shown by blue arrows. We denote the features of the ith convGj

layer of refining network or deconv-layer of the scale j is Fi

or

D

Fi j , respectively, then the hierarchical guidance can be expressed as D

D

D

G

D

j F2 j = σ (W2 j  (F1 j , Fr−1 ) + b2 j ),

(1)

Z. Zhang et al. / Pattern Recognition 83 (2018) 430–442

Where r is the number of conv-layer of each refining network and

σ is ReLU, while  represents the deconv-operation. D

D W2 j

D F2 j

spectively. In this way, in each scale j, the deconv-features obtain the depth guidance which contains depth details and texture. Hence, the deconv-network will utilize the hierarchical depth guidance to progressively refine the prediction, capturing depth information of different scales and producing fine-grained depth maps. Fusion module: While predicting depth maps by our refining networks, we find that it is difficult to obtain satisfactory prediction in the scale of 120 × 160 and 60 × 80, compared with those in the scale of 30 × 40. The reason is that while predicting depth maps of different scales, the information which can be utilized is unfair. The predictions of 30 × 40 are usually good because the network is deep which contains much useful information. However, compared with the condition of the 30 × 40 predicting process, the network is much shallower when predicting 60 × 80 or 120 × 160 depth maps, which means much less useful information can be used. In this case, the depth representations of 120 × 160 and 60 × 80 refining network may also be less valuable, providing ambiguous guidance for the deconv-estimating process. To resolve this ambiguity, we develop a fusion module to combine the features of different scale refining network, which just contains simple operations. Denote j1 , j2 and j3 represent the scale of 30 × 40, 60 × 80 and 120 × 160 respectively. The features of the ith G

conv-layer of refining network in the scale j are denoted as Fi j , while W and b are denote as filter and bias in corresponding layer respectively. Then the total fusion module can be expressed as Gj ↑

Gj

Gj

Gj

Gj ↑

(2) Gj

j2 2 Fr−12 = σ (Wr−1 ∗ (Fr−22 , Fr−21 ) + br−1 ), Gj ↑

(3)

Gj

j3 = σ (Wr−1 ∗(

Gj ↑

Gj Gj ↑ Fr−23 , Fr−12

)+

To utilize the depth structure information in ground truth, we propose a novel hierarchical regularization learning method. This proposed regularization method contributes to the point-to-point, shape and distribution similarity between predictions and ground truth. We introduce these properties in the following. Pixel-to-pixel regularization: We consider to leverage pixellevel difference between predictions and the corresponding ground truth. To obtain the pixel-level difference, we use the L2-norm between this two depth maps. The normalized total sum reveals the pixel-level correlation between a predicted depth map and its corresponding ground truth. In this way, the pixel-to-pixel regularization (Lpixel ) between these two depth maps can be formulated as:

Lpixel =

n 1 1 P − G22 = ( pi − gi )2 , n n

where p is a predicted depth map and g is the corresponding ground truth, and i is pixel index while n is the number of pixel in each map. This regularization is a reliable strategy for such regression task, but considering only point-to-point correlation without any depth structure information. Hence only using this regularization will bring ambiguity while optimizing the network and prediction poor depth maps on complicated scenes. Shape-guaranteed regularization: The depth of a scene is usually discrete and changes a lot on the positions of the object boundaries. To predict depth of these boundaries, we need to utilize the local information of the depth maps. The discreteness of depths at the boundaries reveal conspicuous gradient information, and we apply this local information as a shape-guaranteed regularization. We denote three shape-guaranteed regularizations, one is the L2 norm based regularization which can be formulated as:

),

(5)

Gj ↑

Where Fr−21 and Fr−12 mean the output of each deconv-operation respectively, and σ means ReLU. The operators ∗ and  represent convolution and deconvolution respectively. In this way, the predictions of 60 × 80 become better than those of 30 × 40 Gj

due to fusing 30 × 40 depth feature maps Fr−21 , while the predictions of 120 × 160 become the best among those of these 3 Gj

scales due to fusing the features Fr−12 which combine the information of the other two scales. This proposed fusion module guarantees the ”quality” of the hierarchical depth guidance, making it authentically guide the estimation process. This proposed fusion module is significantly important in our framework, experiment in Section 4.3 will demonstrate the effect of this fusion module. Discussion: Our hierarchical network architecture contains multiple fusion and refining processes, which is significantly different with previous directly multi-scale fusion networks, such as [38]. We do not follow the direct skip connection used in [35] or top-down refinement architecture used in [37], because the features of our backbone conv-network and deconv-network contains highly different semantic information (appearance and depth), hence directly combining these features will bring ambiguity and difficulty to estimation process. We apply multi-scale supervisions to make the features of refining network approach the depth representation, rather than applying the supervisions to deconv-estimation process which still fails to utilize the image texture and details.

n 1 2 (∇x di + ∇y2 di )2 n

(7)

i=1

(4) G j3 br−1

(6)

i

Llocal-L2 =

Fr−12 = σ (Wdj2  Fr−12 + bdj2 ), Gj Fr−13

3.2. Hierarchical regularization learning

is the

deconv-filter and b2 j is the bias of the corresponding layer, re-

Fr−21 = σ (Wdj1  Fr−21 + bdj1 ),

433

after normalized, where d is the difference map of the prediction and ground truth, n is the valid pixel number of a depth map and i is the index. Inspired by the convex form total variation, we denote the second shape-guaranteed regularization as:

Llocal-L1 =

n 1 2 |∇x di | + |∇y2 di | n

(8)

i=1

where the gradient information is represented by the L1 norm. The third shape-guaranteed regularization is inspired by the Sobel operator, which can be formulated as:

Llocal-Sobel =

n 1  sobel |∇x di | + |∇ysobel di | n

(9)

i=1

where the ∇xsobel and ∇ysobel represent the horizontal and vertical convolutional Sobel operator, which calculate the gradient information with more adjacent pixels. In this way, the gradient information at each pixel will influence the optimization process, especially at the positions of object boundaries. As this regularization will lead each pixel of predictions to approach the corresponding ground truth in a gradient-approximation way, the object shapes will be guaranteed in the predictions. Global consistency regularization: The depth of different scenes usually contains different distributions, e.g., the scene of the urban environment and that of a bedroom. To obtain predictions close to ground truth, we also need to compare their depth distributions. The normalized cross-covariance, which describe the relations of two variables, is a good means to measure the overall similarity of a prediction p and corresponding ground truth g,

434

Z. Zhang et al. / Pattern Recognition 83 (2018) 430–442

which is defined as

1 nσ p σg

n 

( pi − μ p )(g∗ − μg ),

i=1

where i is the pixel index, μp and μg are the mean depth values of p and g respectively, while σ p and σ g are corresponding standard deviation. The statistical characters of cross covariance (mean depths and standard deviations) reveal the global structure of depth maps. The normalized cross-covariance between a good prediction and the corresponding ground truth is close to 1. Hence, we define the squared error



1

Lglobal =

nσy σy ∗

n 

2

(yi − μy )(y∗i − μy∗ ) − 1

.

(10)

i=1

to describe the global consistency of predictions and ground truth. To obtain low error, each pixel pi in the prediction must differ in mean depth μp by an amount similar to that of the corresponding pixel gi in the ground truth. In fact, transforming the prediction and the corresponding ground truth into vectors and setting P =  = [ g 1 − μg , g 2 − μg , . . . , g n − [ p1 − μ p , p2 − μ p , . . . , pn − μ p ] and G μg ], we have



Lglobal =

 P · G −1  PG 

2

.

(11)

The global consistency regularization Lglobal is equivalent to a cosine similarity, which describes the difference more integrally without absolute element errors. It describes the angular distance between the two variables. Sometimes though the Euclidean distance between two variables is large, the cross-covariance may be small due to the small angular distance between the two variables. While optimizing the network weights, the gradient of Lglobal at each pixel contains the global characters (mean depth and variance) of the depth maps. Hence, it will provide a reliable gradient reference which combines the global structure information, contributing to further updating. We combine the three properties together to utilize hierarchical depth information for network optimization, and the proposed Hierarchical Regularization learning method is formulated as

LHGR () = λ1 Lpixel + λ2 Llocal + λ3 Lglobal ,

(12)

where  denotes the parameters of network, λ1 , λ2 and λ3 are constants to weight the different regularizations. In this way, the proposed hierarchical regularization learning method integrates the pixel-level correlation, local structure and global consistency in depth maps. During training, this multi-regularization will optimize the parameters of our networks to predict depth maps.

p(G|P ) = N (P, ε 2 ) = N ((x; ), ε 2 ),

Our network has a same fashion as ResNet, associating to multiple conv-and deconv-residual blocks. The supervisions which are applied to guide the refining networks are pooled directly from ground truth. We use a pre-trained model on imagenet to initialize the weights of convolution backbone. While training our network, we use the hierarchical regularization method defined as Eq. (12). It is noted that λ1 , λ2 and λ3 are three constants that are used as weights to balance the three regularizations. If just setting the three weights as constants, these regularizations will provide invariable influence during optimizing process, which is improper as the updating goes on. Simply, we first use an adaptive way to set these weights. At the iteration step t, we can get the values (t ) (t ) (t ) of the three regularizations: Lpixel , Llocal , Lglobal . Define the sum (t ) Lpixel

L(t )

, λ2 =

(t ) Llocal , L(t )

λ3 =

(t ) Lglobal

L(t )

. In

(13)

where ε is the observed noise scalar. To maximize the Gaussian likelihood, the log likelihood can be written as:

log p(G|P ) ∝ −

1 G − (x; )2 − log ε 2 , 2ε 2

(14)

with the parameter ε which captures how much noise we have in the prediction. Note that maximizing Eqn. (14) is similar to minimize the pixel-to-pixel regularization, except for the noise parameter ε . Inspired by Kendall et al. [41], if we can transform our objective functions into the form of Eq. (14) with given likelihoods, we can regard the noise parameter ε as part of the balancing weight and optimize it during training. If using Llocal-L2 defined in Eq. (7) as the shape-guaranteed regularization, we can also define the likelihood of the depth gradient regression task as a Gaussian with mean given by the depth gradient of network predictions, and then the log likelihood can also be written as the form of Eq. (14) with ground truth depth gradient and predicted depth gradient. For Llocal-L1 and Llocal-sobel , the defined likelihood turns to be Laplacian and the corresponding log likelihood can be also written as the form of Eq. (14). For the Lglobal , setting y˜i =

Lglobal =

yi −μy

σy

and y˜∗i =

y∗i −μy∗

σy ∗

, we have

n 1 (y˜i − y˜∗i )2 , n

(15)

i=1

which has a similar form to pixel-to-pixel regularization. That is to say, with defined likelihoods, minimizing the proposed Shapeguaranteed and Global consistency regularization can all be extended to maximizing log likelihoods in a similar form as Eq. (14) with different norms and noise parameters. Inspired by Kendall et al. [41], denote λi = 21ε , the final objective function to be optimized can be defined as:

LHGR (, λ1 , λ2 , λ3 ) =

3.3. End-to-end depth estimation

of them as L(t ) , then we set λ1 =

this way, the balance weights will keep changing, depending on the corresponding values of regularizations. This adaptive balance setting will help the network to update in its most needed way. Experiments in Section 4 demonstrate its appropriateness. In fact, we can also optimize the balancing parameters during the training process from the perspective of maximum likelihood inference, according to [41]. Here we make inference based on the theory of [41]. Defined the ground truth as G and the prediction of our network as P = (x; ) where x is the input image and  is the network transformation with network parameter , we can define the likelihood of the depth regression task as a Gaussian with mean given by the network prediction:

i

λ1 Lpixel + λ2 Llocal + λ3 Lglobal + ψ (λ1 , λ2 , λ3 ),

(16)

where the parameter λi also need to be learned and optimized during the training process, and ψ (λ1 , λ2 , λ3 ) = log 2λ1 + 1

log 2λ1 + log 2λ1 . In this way, the relative weights of different reg2 3 ularizations can be learned in a principled and well-founded way. When the parameters λi are too large or too small, it will bring significant penalty to the object function, so as to avoid making improper and extreme balancing weights during the training. Relative experiment will be illustrated in Section 4. Most ground truth depth maps have missing values at some pixels, especially near object boundaries and positions of sky. During training, we simply mask them out and calculate the loss only on the pixels with depth values. That is to say, the n in Eq. (12) does not express the total number but the valid number of pixels in a depth map.

Z. Zhang et al. / Pattern Recognition 83 (2018) 430–442

4. Experiments In this section, we test our model on different datasets and compare the results with those of state-of-the-art methods. We also analyze the effectiveness of different network architectures and regularizations. The experiment details are shown in the following. 4.1. Experimental settings Dataset: We evaluate the effectiveness of our proposed model on NYU Depth v2 [1], KITTI [42] Eigen split (selected by Eigen et al. [9]), KITTI official depth prediction dataset [43] (KITTI 2018) and Make3D [3] datasets. The pixels that have missing depth values remain unfilled. These missing values have no influence on our model. The NYU Depth v2 dataset [1] consists of RGB-D images of 464 indoor scenes. We use the raw data of official train split which contains 249 scenes for training and official test split for testing respectively. The original images are 480 × 640 and are cropped to 427 × 561 to remove the boundaries where the depth values do not exist. The KITTI dataset [42] is composed of several urban scenes captured by LIDAR scanner and car-mounted cameras. We train our model on KITTI to prove that our model has the ability to predict outdoor depth maps. For the Eigen split, following [9], we use the raw data selected from the “city”, “residential” and “road” categories. The whole data consists of 56 scenes and we split them into 28 for training and 28 for testing. The original RGB images are 368 × 1224, and we cut off a small upper boundary of each image as the pixels in the boundary usually represent sky and can not be scanned by LIDAR. The final images are 256 × 1224 and cropped into 6 256 × 256 images to form the network inputs. For the KITTI official depth prediction benchmark, we use the standard training set to train our model and evaluate our method on KITTI online benchmark. The Make3D dataset [3] consists of 400 training images and 134 testing images, gathered using a custom 3D scanner. As the ground truth depth map is restricted to 305 × 55 while the original RGB images are 1704 × 2272, following [17], we resize all the RGB images and ground truth depth maps to 345 × 460. The predictions are depth maps with half size of the input images, and we resize them to the input size using bilinear upsampling while testing. Same as [34], we mask out the pixels deeper than 70m in both training and testing to avoid the limitation of the old type scanner. Implementation details: For NYU Depth V2, the training set has 7K original images selected from the raw data of official train split and each image is cropped to several images in a size of 288 × 288. The training set contains 7k and 30k unique images in different settings respectively. Following [34], we use the same strategy to augment the training set. The images for test are in original size (480 × 640). We use the official test split which contains 654 RGB images for test. For KITTI Eigen split, following [9], we use the same training scenes and the training set has 6K images randomly selected from these scenes, while the same testing set which contains 697 images is use to evaluate our method. During testing, as the original images are too large, we crop a test image into 6 256 × 256 images. These cropped images have small overlaps, but these overlaps do not influence the test results. For KITTI official depth prediction dataset, we use a 352 × 1216 center crop from each image as input, and upsample our predictions to the original size for evaluation. For Make3D dataset, we use the same augmentation strategy as [34] and augment the training images to 15k samples. A pretrained model on the ImageNet classification task [44] is used to initialize our network. Learning rates are 10−5 for basic convolution network and 0.01 for refining network and deconvolution layers, reducing with a scale of 0.1 after

435

2 epochs. The momentum is 0.9 and weight decay is 0.0 0 05. We train the network using SGD with batches of size 12 for 8 epochs on NYU Depth v2, 4 epochs on KITTI Eigen split, 8 epochs on KITTI official depth prediction dataset and 15 epochs on Make3D. The models are trained on a NVidia K80 GPU. Evaluation criteria: We compare our method against the published results of current methods on NYU depth V2 and KITTI datasets. Some commonly used measures are applied here for quantitative evaluations: • • • • •

y

y∗

threshold (δ ): % of yi s.t. max( y∗i , yi )=δ i i  |y −y∗ | average relative error (rel): 1n i i y∗ i ;



i

 root mean squared error (rms): 1n i (yi − y∗i )2 ;  average log10 error (log10): n1 i | log10 yi − log10 y∗i |; scale invariant error (scale-inv):



1  ((log yi − log y j ) − (log y∗i − log y∗j ))2 2n2 i, j

• • • •

For KITTI 2018 dataset, it uses a different evaluation criteria as following: Scale invariant logarithmic error (SILog) [log(m)∗ 100] Relative squared error (percent) (sqErrorRel) Relative absolute error (percent) (absErrorRel) Root mean squared error of the inverse depth [1/km] (iRMSE)

Note that the ground truth depth maps may have miss values at some pixels, we mask out these invalid pixels during testing and n in these measures means the number of valid pixels. 4.2. Results and comparison We compare the proposed end-to-end depth estimation model with current state-of-the-art approaches on three datasets. NYU Depth V2 Dataset: We compare our approach with the state of the art on NYU Depth v2 dataset. The results are shown on Table 1, and the values are those reported by the authors in their respective paper. As we observed, our model obtains stateof-the-art performance in rms, scale-inv, δ > 1.25 and δ > 1.252 metrics. Compared with the traditional methods [16,17,45,46], our method (Ours 7k) obtains 64.2% average relative gain in all metrics. Compared with deep learning based method [9,29,30,31,39], our method obtains 23.54% average relative gain in all metrics. These results reveal that our method is much more efficient than these traditional or deep learning based methods. Compared with Eigen and Fergus [32], our method with 7k training images obtains 15.65% average relative gain in all metrics. These results demonstrate that our method can obtain better performance. Compared with the more recent work [33], our method with 7k training images is weaker on rel, but getting better performance in rms and δ metrics, obtaining 7.64% relative gain. Note that in [33], they used 95k unique image and depth pairs to train their final network. However, we only use 7K unique pairs (around 1/14 of their training set) to train our method and still get a competitive performance. While increasing the training set, the performance of our method gets obvious gains in all metrics. Our method with 30k training images significantly outperforms the approach of [33] with 95k training images, obtaining 3.08% relative gain totally. These results demonstrate the superiority of our method. Compared with [34], our method with 7k training images is slightly weaker in rel metric but outperforms in other metrics, obtaining 9.08% relative gain in these metrics. This proves that our method can achieve competitive or even better results. While comparing qualitative results, the advantages of our method can be easily observed. As shown in Fig. 3, the results of

436

Z. Zhang et al. / Pattern Recognition 83 (2018) 430–442 Table 1 Comparisons between our approach and the state-of-the-art on NYU Depth V2 Dataset. rms

rel

log10

scale-inv

δ > 1.25

δ > 1.252

δ > 1.253

Karsch et al. [16] Liu et al. [17] Li et al. [30] FCDLR [45] RCL [46] Liu et al. [29] Wang et al. [31] Zhuo et al. [18] Eigen et al. [9] Roy and Todorovic [39] Eigen and Fergus [32]

1.200 1.060 0.821 0.839 0.802 0.824 0.745 1.040 0.877 0.744 0.641

0.250 0.335 0.232 0.260 0.254 0.230 0.220 0.305 0.214 0.187 0.158

– 0.1270 0.0940 0.0998 0.0960 0.0950 0.0940 0.1220 – 0.0780 –

– – – 0.242 0.236 – – – 0.219 –

– – 0.621 0.598 0.610 0.614 0.605 0.525 0.611 –

– – 0.886 – – 0.883 0.890 0.838 0.887 –

– – 0.968 – – 0.971 0.970 0.962 0.971 –

Xu et al. (5k) [33] Xu et al. (95k) [33] Laina et al. [34] Ours (7k) Ours (30k)

0.613 0.586 0.573 0.550 0.540

0.143 0.121 0.127 0.158 0.134

0.0650 0.0520 0.0550 0.0520 0.0512

0.171 – – – 0.127 0.124

0.769 0.789 0.811 0.811 0.824 0.830

0.950 0.946 0.954 0.953 0.955 0.964

0.988 0.984 0.987 0.988 0.988 0.992

Fig. 3. Some examples of depth estimation result on NYU Depth V2. (a)RGB input; (b)ground truth depth maps; (c) results of Eigen and Fergus [32]; (d) results of [34]; (e) results of our method.

[32] contain not only some depth details but also much noise, failing to predict exact geometric structure at the positions of doors, bookrack and the wall. The results of [34] are too smooth, lacking details at the positions of object boundaries. Hence, it cannot predict object shape exactly but coarse and smooth depth maps instead. However, our method can predict fine-grained depth maps, meanwhile the geometric structures are exactly matched as observed in Fig. 3. The results of our method contain exact object shapes and depth details (e.g., at the positions of chairs, tables and bookrack, etc.) compared with the ground truth, which demonstrates the effectiveness of our hierarchical guidance and regularizations. KITTI eigen split: We compare our model with several stateof-the-art methods on KITTI Eigen split. The results are shown in Table 2. The results of make3D [15] are reported in [9], and the results of Liu et al. [29] and the recomputed results of [9] are reported in [47]. For the results in [47], we compare with the ones calculated on the position where the depth values are smaller than 80 m. As we observed, our method obtains best performance in rms, log10, scale-inv, δ > 1.252 and δ > 1.253 metrics. Compared with traditional methods [15,45,46], our method obtains 37.06% av-

erage relative gain in all metrics. Compared with deep learning based methods [9,29,48,49], our method obtains 22.07% average relative gain in all metrics. These results reveal that our method does obtain better prediction on KITTI dataset than the previous traditional or deep learning based approaches. Compared with the more recent work [47], our method is slightly weaker in rel and δ > 1.25 metric (7.35% relative loss), but obtaining 22.8% relative gain in rms, δ > 1.252 and δ > 1.253 metrics, which demonstrates that our predictions are more reliable. In fact, the outdoor depth maps contain less texture (the LiDar points are sparse compared with the RGB pixels) which makes the outdoor depth scenes significantly different from indoor depth scenes. Similarly, our method still obtains good performance, which demonstrates its robustness. Qualitative results are illustrated in Fig. 4. Comparing to the results of [9], our predictions are highly close to ground truth and one can easily recognize the depth information of objects, such as cars, trees and road signs. Even when the environment is complicated (e.g., at the intersection), our method can still obtain good performance. These demonstrates that our predictions are reliable on automobile environment perception.

Z. Zhang et al. / Pattern Recognition 83 (2018) 430–442

437

Table 2 Comparisons between our approach and the state of the art on KITTI dataset.

make3D [15] CCDLR [45] FCDLR [45] GCL [46] RCL [46] Liu et al. [29] Eigen coarse [9] Eigen fine [9] Cadena et al. [48] Garg et al. [49] Godard et al. [47] Ours

rms

log10

rel

scale-inv

δ > 1.25

δ > 1.252

δ > 1.253

8.734 6.607 6.589 6.608 6.437 6.986 6.414 6.179 6.960 5.104 5.381 4.310

– 0.0833 0.0828 – – – – – – – – 0.0382

0.280 0.183 0.216 0.218 0.206 0.217 – 0.197 0.251 0.169 0.126 0.136

0.327 0.262 0.261 0.262 0.260 – 0.206 0.246 – – – 0.178

0.601 0.679 0.692 0.691 0.699 0.647 0.679 0.692 0.643 0.740 0.843 0.833

0.820 – – – – 0.882 0.897 0.899 0.833 0.904 0.941 0.957

0.926 – – – – 0.961 0.967 0.967 0.925 0.962 0.972 0.987

Fig. 4. Some examples of depth estimation result on KITTI. (a) RGB input; (b) ground truth depth maps (interpolated for visualization); (c) results of [9]; (d) results of our method. We can observe that the results of our method are very close to the ground truth visually. Table 3 Comparisons between the proposed method and the state-of-the-art approaches on KITTI 2018. Method

SILog

sqErrorRel (%)

absErrorRel (%)

iRMSE

Runtime (s)

DORN_v1 DHGRL (our method) LSIM GoogleNet V1 TASKNetV1_ROB GoogLeNetV1_ROB

11.80 15.47 17.92 18.19 23.96 33.49

2.19 4.04 6.88 7.32 7.24 24.06

8.93 12.52 14.04 14.24 15.37 29.72

13.22 15.72 17.62 18.50 22.87 35.22

0.2 / GPU 0.2 / GPU 0.08 / GPU 0.2 / GPU 0.18 / GPU 0.05 / 1 core

KITTI 2018: The results of our method on KITTI 2018 can be observed in Table 3. The model is built based on ResNet-50. As currently none of the method is published, here we directly compare the performance. Our model is slightly weaker than DORN_v1 but outperforms all other approaches in all the metrics, especially the GoogLeNetV1_ROB, TASKNetV1_ROB and GoogleNet v1. The visual results can be viewed in Fig. 5. As the ground truth is anonymous, the benchmark only provided the error image between our predictions and ground truth. It can be observed that our approach can predict good depth values at the near regions, but weaker depth values at the far regions. Note that in the far regions, the LiDAR points are usually more sparse and incomplete than those in the near regions. Such phenomenon may influence the network learning of the depth of the far regions. Despite such imperfect ground truth our method can provide satisfactory predictions. Make3D dataset: We compare our method with some state-ofthe-art methods on Make3D dataset. The results are illustrated in Table 4 and we only compare the pixels with depths less than 70m.

Table 4 Comparisons between our approach and the state of the arts on Make3D dataset.

karsch [16] Liu et al. [17] Liu et al.[29] Li et al. [30] Xu et al. [33] Laina et al. [34] Ours

rms

rel

log10

9.20 9.49 8.60 7.19 4.38 4.46 4.36

0.355 0.335 0.314 0.278 0.184 0.176 0.181

0.127 0.137 0.119 0.092 0.065 0.072 0.066

As it can be observed, our method obtains significantly better results in all metrics compared with the approaches of [16,17,29,30]. Compared with the results of [34], our method obtains 2.7% relative loss in rel metric but gets superior results in rms and log10 metric with 10.5% relative gain. Such performance reveals that our method obtains better predictions on Make3D dataset compared

438

Z. Zhang et al. / Pattern Recognition 83 (2018) 430–442

Fig. 5. Some examples of depth estimation result on KITTI 2018. (a) RGB input; (b) predictions of our method; (c) error image compared with ground truth (red regions represent larger errors). We can observe that the results of our method can produce satisfactory results.

Fig. 6. Some examples of depth estimation results on Make3D. (a) RGB input; (b) ground truth depth maps; (c) results of our method. We can observe that the results of our method are very close to the ground truth visually.

with [34]. Compared with the approach in [33], our method obtains weaker results in log10 metric but gets better performance in rms and rel metrics, obtaining a slightly 0.5% average relative gain. Such performance reveals that our method obtains at least competitive predictions on Make3D dataset compared with [33]. Qualitative results on Make3D can be observed in Fig. 6 . The object boundaries and depths are well predicted in different scenes. Although the training images are limited, our method can still obtain visually good predictions. 4.3. Algorithm analysis Network architecture: To demonstrate the effect of our HGR learning regime, we build models based on different network architectures, and the results are shown in Table 5. As illustrated in Table 5, our ResNet-18 based model slightly outperforms the AlexNet and VGG based models. For the models based on ResNet, Ours(ResNet-50) outperforms our method based on ResNet-18 in all metrics, obtaining an 27.1% average relative gain, while the ResNet-101 based model gets best performance in all the three metrics. These results indicate that our method based on ResNet

Table 5 Comparisons between approaches based on different architectures on NYU Depth V2 dataset.

Laina et al. [34] AlexNet upconv VGG l2 ResNet-50 Ours AlexNet VGG ResNet-18 ResNet-50 ResNet-101

rms

rel

log10

0.853 0.746 0.573 0.802 0.705 0.690 0.550 0.534

0.218 0.194 0.127 0.235 0.214 0.209 0.158 0.130

0.0940 0.0830 0.0550 0.0942 0.0802 0.0823 0.0520 0.0508

does make better predictions on indoor scenes than our method based on AlexNet and VGG, and deeper ResNet leads to better results because of the stronger model ability. Meanwhile, as the depth of the network increases, the adaptive field also increases to capture larger scale inputs, which is significant on depth prediction task, and as a result, the performance improves. To further analyze our HGR learning framework, we also compare AlexNet and VGG based model in [34]. As observed, our model based on AlexNet ob-

Z. Zhang et al. / Pattern Recognition 83 (2018) 430–442

439

Fig. 7. The predictions of different network architectures (a) input RGB images. (b) ground-truth. (c) the predictions of ours (w/o HGR). (d) the predictions of [33]. (e) the predictions of our method. Our method predicts fine-grained depth maps with less noise, which demonstrated the advantages of our HGR learning framework.

Table 6 Comparisons between baselines and our approach. NYU-D v2

Ours Ours Ours Ours Ours Ours

(scale-1) (scale-2) (scale-3) (w/o HGR) (w/o FM)

rms

log10

rel

scale-inv

0.583 0.578 0.605 0.615 0.573 0.550

0.0580 0.0570 0.0592 0.0608 0.0563 0.0520

0.186 0.179 0.190 0.195 0.175 0.158

0.142 0.140 0.145 0.150 0.138 0.127

tains 5.98% relative gain in rms metric, and our model based on VGG obtains 5.49% and 3.37% relative gain in rms and log10 metric, respectively. Although our method is usually weaker in rel metric, these facts prove that it is competitive or even better in the same condition. Hierarchical guidance: To analyse the effectiveness of our hierarchical guidance, we compare the performance of our method with those of several baselines. As illustrated in Table 6, ours (scale-1), ours (scale-2) and ours (scale-3) represent the network which only contains depth guidance in a scale of 30 × 40, 60 × 80 and 120 × 160, respectively. Ours (w/o HGR) means the network which is only composed of the backbone convolution and deconvolution network without the hierarchical guidance, which has a feedforward architecture. Ours (w/o FM) represents our network which contains no fusion module. The experiment settings are all the same. As observed, compared with ours (w/o HGR), our method obtains 14.82% average relative gain on NYU-Depth V2. Compared with the models with depth guidance of only one scale, our method obtains 9.23% average relative gain on all metrics. These results reveal that our hierarchical depth guidance strategy does improve the quality of predictions. Qualitative results can be observed in Fig. 7. The predictions of ours (w/o HGR) contain less details, which is caused by multiple pooling operations and lack of use of multi-level information. Xu et al. [33] proposed a sequential network combining multi-scale continuous CRFs to utilize the texture information while predicting depth. However, their CRF based method tends to focus on image texture or even some irrelevant details too much, leading to predictions with much noise, as shown in Fig. 7 (d). Our approach predicts fine-grained depth maps, making substantial improvement in detail sharpness over ours (w/o HGR), meanwhile less noise than the results of [33]. Compared with Ours (w/o MF), our method obtain obvious better performance, obtaining 5.49% average relative gain on NYU-

Depth V2. Here we discuss the effect of our fusion module. As observed in Table. 6, the methods containing depth guidance of different scales obtain different performance. It is because that the lower features of different scales contain information of different levels. While the layers are lower, the features of it will contain more texture information that extracted from input images but more noise meanwhile. If without a proper way to fuse these multi-level features, the noise and semantic difference will lead to a unsatisfactory final depth prediction. Hence, without the fusion module, our method can only obtain limited improvement, getting 3.5% average relative gain compared with ours (scale-1), ours (scale-2) and ours (scale-3). However, with our fusion module, the relative gain comes to 8.50%, which demonstrate its effectiveness. For qualitatively analysis, as illustrated in Fig. 8, the predictions of our method with fusion module are high-quality at each scale. The prediction of 30 × 40 captures the global depth structure of the scene, while the predictions of 60 × 80 and 120 × 160 recover more fine details. In this case, the hierarchical guidance brings valuable features to the predicting process, progressively reconstructing the depth details, and the final prediction contains similar structure with ground truth. However, the predictions of our method without fusion module are noisy. Though the prediction of 30 × 40 can cover the global structure, the predictions of 60 × 80 and 120 × 160 are ambiguous with much noise. These results are caused by the unfair usable information while predicting depth maps from the two lower layers. In this case, the hierarchical depth guidance will bring ambiguity to the predicting process and the final prediction contains much noise as illustrated. Hence, the witness of Fig. 8 can also demonstrate the effect of our fusion module, which guarantees the quality of final prediction. Regularization: During training, we use different regularizations to optimize our network and compare the performance. The results are presented in Table 7. We use the Lpixel regularization as the baseline. We denote Lpixel + Llocal as the average performance of the models with pixelto-pixel regularization and three shape-guaranteed regularizations. Lpixel + Lglobal represents the pixel-to-pixel regularization with shape-guaranteed regularization or global consistency regularization. LHGR (fixed weight) denotes the multi-regularization with invariable weights (λ1 = λ2 = λ3 = 13 ). LHGR with simply adaptive weight denotes the multi-regularization with weight λ1 =

λ2 =

(t )

Llocal , L(t )

(t ) Lglobal

(t ) Lpixel

L(t )

,

λ3 = L(t ) . The LHGR with optimized weight is the strategy defined in Eq. (16). All the LHGR based models contain Llocal-sobel as the local regularization. We use these above

440

Z. Zhang et al. / Pattern Recognition 83 (2018) 430–442

formance. The better performance of Llocal-sobel may caused by the selection of adjacent pixels while calculating gradient information. The Llocal-sobel considers more adjacent pixels around the target pixel to calculate the gradient, which makes the gradient of depth maps more reliable and robust. To evaluate the influence of different balance weights, we use different strategies to balance each regularization. Compared with the model with fixed weights, our model with LHGR (optimized weight) obtains 5.37% average relative gain in all metrics. Further, The fixed weights based model obtains similar results as Lpixel + Lglobal and Lpixel + Llocal based models. These results reveal that the fixed weight has limitation to balance these three regularizations well to highlight the respective advantage, failing to combine them flexibly. Compared with the simply adaptive strategy, the optimized strategy obtains 6.06% relative gain totally. The simply adaptive balancing strategy can vary the weights during the training process, however, it may bring the weights into very different scales, which may significantly decrease the influence of the regularization with very small weights. In contrast, the optimized weight strategy will balance each regularization to give proper complement of the total training loss and optimize the weight according to the contribution of corresponding regularization, which helps the network updating. 5. Conclusion and future work

Fig. 8. The effect of our fusion module. The left column is the prediction of our method with fusion module, while the right column is the prediction of our method without fusion module. (a) the prediction of 30 × 40. (b) the prediction of 60 × 80. (c) the prediction of 120 × 160. (d) final prediction of 480 × 640. (e) ground truth. Table 7 Depth prediction with different regularization on NYU Depth V2 dataset.

Lpixel Llocal-L2 Llocal-L1 Llocal-sobel Lglobal Lpixel + Llocal Lpixel + Lglobal LHGR (fixed weight) LHGR (simply adaptive weight) LHGR (optimized weight)

rms

rel

log 10

scale-inv

0.584 0.590 0.582 0.580 0.580 0.570 0.568 0.564 0.553 0.550

0.186 0.192 0.191 0.191 0.182 0.173 0.180 0.171 0.162 0.158

0.0563 0.0581 0.0586 0.0581 0.0559 0.0553 0.0560 0.0538 0.0522 0.0518

0.142 0.146 0.143 0.139 0.142 0.136 0.138 0.136 0.130 0.127

regularizing strategies to optimize our model, and the experiment settings are all the same. As illustrated in Table 7, the model with LHGR (optimized weight) regularization obtains best performance in all metrics, obtaining 9.50% average relative gain over those single-regularized model. Compared with the model with Lpixel regularization, the models with Lpixel + Llocal and Lpixel + Lglobal regularization both obtain better performance, getting 2.65% and 2.33% average relative gain in all metrics, respectively. These results demonstrate that the pixel-to-pixel, shape-guaranteed and global consistency regularizations are all useful for the depth estimation task, and each of them contributes to improving the final performance. Integrating the three level regularizations conduces to more reliable depth predictions. For different local regularizations, the Llocal-sobel based model obtains slightly better performance, getting totally 7.00% relative gain compared with Llocal-L2 based model and 3.70% relative gain compared with Llocal-L1 based model. The performances of the Llocal-L1 regularization are slightly better than that of Llocal-L2 . It reveals that the more robust L1-norm based local regularization is more robust to capture the depth discontinuity and gradient information, which improves the final per-

In this paper, we propose a novel Hierarchical Guidance and Regularization learning (HGR) framework for end-to-end depth estimation. The hierarchical guidance well integrates various-level depth details to guide the prediction network, progressively refining the estimation process. The hierarchical regularization learning method combines hierarchical depth structure, i.e., pixel-level correlation, local structure and global consistency of ground truth, optimizing our network to predict reliable depth maps. Experiment results on NYU Depth V2 and KITTI datasets clearly demonstrate that the proposed end-to-end framework achieves state-of-the-art performances. Analysis on network architecture and regularization strategy shows that our HGR framework does obtain better performance. The proposed HGR framework is significant on dense map prediction tasks. In the future we will utilize this framework on semantic segmentation and super resolution. Weak- or unsupervised depth estimation based on our framework is also worth exploring. Acknowledgements The authors would like to thank the editor and the anonymous reviewers for their critical and constructive comments and suggestions. This work was supported by the National Science Fund of China under Grant Nos. U1713208, 61472187, 61602244 and 61772276, the 973 Program No. 2014CB349303, Program for Changjiang Scholars, the Natural Science Foundation of Jiangsu Province under grant No. BK20170857, the fundamental research funds for the central universities No. 30918011321 and 30918011320. References [1] N. Silberman, D. Hoiem, P. Kohli, R. Fergus, Indoor segmentation and support inference from RGBD images, in: Proceedings of the European Conference on Computer Vision, 2012, pp. 746–760. [2] D. Hoiem, A.A. Efros, M. Hebert, Automatic photo pop-up, ACM Trans. Gr. (TOG) 24 (3) (2005) 577–584. [3] A. Saxena, M. Sun, A.Y. Ng, Make3D: learning 3D scene structure from a single still image, IEEE Trans. Pattern Anal. Mach. Intell. 31 (5) (2009) 824–840. [4] R. Hadsell, P. Sermanet, J. Ben, A. Erkan, M. Scoffier, K. Kavukcuoglu, U. Muller, Y. LeCun, Learning long-range vision for autonomous off-road driving, J. Field Rob. 26 (2) (2009) 120–144. [5] J. Michels, A. Saxena, A.Y. Ng, High speed obstacle avoidance using monocular vision and reinforcement learning, in: Proceedings of the Twenty Second International Conference on Machine Learning, ACM, 2005, pp. 593–600.

Z. Zhang et al. / Pattern Recognition 83 (2018) 430–442 [6] J. Zhang, W. Li, P.O. Ogunbona, P. Wang, C. Tang, RGB-D-based action recognition datasets: a survey, Pattern Recognit. 60 (2016) 86–105. [7] T. Aydin, Y.S. Akgul, Stereo depth estimation using synchronous optimization with segment based regularization, Pattern Recognit. Lett. 31 (15) (2010) 2389–2396. [8] F. Cheng, X. He, H. Zhang, Learning to refine depth for robust stereo estimation, Pattern Recognit. 74 (2018) 122–133. [9] D. Eigen, C. Puhrsch, R. Fergus, Depth map prediction from a single image using a multi-scale deep network, in: Proceedings of the Advances in Neural Information Processing Systems, 2014, pp. 2366–2374. [10] V. Hedau, D. Hoiem, D. Forsyth, Thinking inside the box: using appearance models and context based on room geometry, in: Proceedings of the European Conference on Computer Vision, Springer, 2010, pp. 224–237. [11] A. Gupta, M. Hebert, T. Kanade, D.M. Blei, Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces, in: Proceedings of the Advances in Neural Information Processing systems, 2010, pp. 1288–1296. [12] A. Gupta, A.A. Efros, M. Hebert, Blocks world revisited: image understanding using qualitative geometry and mechanics, in: Proceedings of the European Conference on Computer Vision, Springer, 2010, pp. 482–496. [13] A.S. Malik, T.S. Choi, A novel algorithm for estimation of depth map using image focus for 3d shape recovery in the presence of noise, Pattern Recognit. 41 (7) (2008) 2200–2225. [14] A. Sellent, P. Favaro, Optimized aperture shapes for depth estimation, Pattern Recognit. Lett. 40 (2014) 96–103. [15] A. Saxena, S.H. Chung, A.Y. Ng, Learning depth from single monocular images, in: Proceedings of the Advances in Neural Information Processing Systems, 2005, pp. 1161–1168. [16] K. Karsch, C. Liu, S.B. Kang, Depth transfer: depth extraction from video using non-parametric sampling, IEEE Trans. Pattern Anal. Mach. Intell. 36 (11) (2014) 2144–2158. [17] M. Liu, M. Salzmann, X. He, Discrete-continuous depth estimation from a single image, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 716–723. [18] W. Zhuo, M. Salzmann, X. He, M. Liu, Indoor scene structure analysis for single image depth estimation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 614–622. [19] B. Liu, S. Gould, D. Koller, Single image depth estimation from predicted semantic labels, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 1253–1260. [20] L. Ladicky, J. Shi, M. Pollefeys, Pulling things out of perspective, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 89–96. [21] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, L. Wang, G. Wang, Recent advances in convolutional neural networks, Pattern Recognit. 77 (2018) 354–377. [22] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, in: Proceedings of the Advances in Neural Information Processing Systems, 2012, pp. 1097–1105. [23] C. Xu, C. Lu, X. Liang, J. Gao, W. Zheng, T. Wang, S. Yan, Multi-loss regularized deep neural network, in: Proceedings of the IEEE Transactions on Circuits and Systems for Video Technology, 2015. [24] K. Nogueira, O.A.B. Penatti, J.A.D. Santos, Towards better exploiting convolutional neural networks for remote sensing scene classification, Pattern Recognit. 61 (2017) 539–556. [25] Q. Zhou, B. Zheng, W. Zhu, L.J. Latecki, Multi-scale context for scene labeling via flexible segmentation graph, Pattern Recognit. 59 (C) (2016) 312–324. [26] F. Liu, G. Lin, C. Shen, CRF learning with CNN features for image segmentation, Pattern Recognit. 48 (2015) 2983–2992. [27] S. Bu, P. Han, Z. Liu, J. Han, Scene parsing using inference embedded deep networks, Pattern Recognit. 59 (C) (2016) 188–198. [28] M. Patacchiola, A. Cangelosi, Head pose estimation in the wild using convolutional neural networks and adaptive gradient methods, Pattern Recognit. 71 (2017) 132–143. [29] F. Liu, C. Shen, G. Lin, I. Reid, Learning depth from single monocular images using deep convolutional neural fields, IEEE Trans. Pattern Anal. Mach. Intell. 38 (10) (2016) 2024–2039. [30] B. Li, C. Shen, Y. Dai, A. van den Hengel, M. He, Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1119–1127. [31] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, A. Yuille, Towards unified depth and semantic prediction from a single image, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2800–2809. [32] D. Eigen, R. Fergus, Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2650–2658. [33] D. Xu, E. Ricci, W. Ouyang, X. Wang, N. Sebe, Multi-scale continuous CRFs as sequential deep networks for monocular depth estimation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. [34] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, N. Navab, Deeper depth prediction with fully convolutional residual networks, in: Proceedings of the Fourth International Conference on 3D Vision, 2016, pp. 239–248.

441

[35] X.J. Mao, C. Shen, Y.B. Yang, Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections, in: Proceedings of the Advances in Neural Information Processing Systems, 2016. [36] B. Hariharan, P.A. Arbeláez, R.B. Girshick, J. Malik, Hypercolumns for object segmentation and fine-grained localization, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 447–456. [37] P.O. Pinheiro, T. Lin, R. Collobert, P. Dollár, Learning to refine object segments, in: Proceedings of the European Conference on Computer Vision„ 2016, pp. 75–91. [38] S. Xie, Z. Tu, Holistically-nested edge detection, in: Proceedings of the International Conference on Computer Vision, 2015, pp. 1395–1403. [39] A. Roy, S. Todorovic, Monocular depth estimation using neural regression forest, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. [40] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, arXiv preprint arXiv:1512.03385. [41] A. Kendall, Y. Gal, R. Cipolla, Multi-task learning using uncertainty to weigh losses for scene geometry and semantics, arXiv preprint arXiv:1705.07115. [42] A. Geiger, P. Lenz, C. Stiller, R. Urtasun, Vision meets robotics: the KITTI dataset, Int. J. Robot. Res. 32 (11) (2013) 1231–1237. [43] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, A. Geiger, Sparsity invariant CNNs, in: Proceedings of the International Conference on 3D Vision (3DV), 2017. [44] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: a large-scale hierarchical image database, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255. [45] M.H. Baig, L. Torresani, Coarse-to-fine depth estimation from a single image via coupled regression and dictionary learning 5 (2015). arXiv:1501.04537 [46] B.M. Haris, T. Lorenzo, Coupled depth learning, in: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2016, pp. 1–10. [47] C. Godard, O.M. Aodha, G.J. Brostow, Unsupervised monocular depth estimation with left-right consistency, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. [48] C. Cadena, A. Dick, I.D. Reid, Multi-modal auto-encoders as joint estimators for robotics scene understanding, in: Proceedings of the Robotics: Science and Systems, 2016. [49] R. Garg, K.B.G. Vijay, G. Carneiro, I. Reid, Unsupervised CNN for single view depth estimation: geometry to the rescue, in: Proceedings of the European Conference on Computer Vision, 2016, pp. 740–756. Zhenyu Zhang is a Ph.D. student from School of Computer Science and Engineering, Nanjing University of Science and Technology. His research interests mainly include depth estimation, machine learning and deep neural networks.

Chunyan Xu received the B.Sc. degree from Shandong Normal University in 2007 and the M.Sc. degree from Huazhong Normal University in 2010 and the Ph.D. degree in the School of Computer Science and Technology, Huazhong University of Science and Technology in 2015. She is a visiting scholar at National University of Singapore from 2013 to 2015. She is now working as a lecturer in the school of Computer Science and Engineering, Nanjing University of Science and Technology.

Jian yang received the Ph.D. degree from Nanjing University of Science and Technology in 2002. He is a Changjiang Professor and Dean in School of Computer Science and Engineering, Nanjing University of Science and Technology. He was a Research Associate and Postdoctoral Fellow in Biometric Centre of Department of Computing of Hong Kong Polytechnic University from 2004 to 2006. He was a Postdoctoral Fellow of Department of Computer Science of New Jersey Institute of Technology from 2006 to 2007.

442

Z. Zhang et al. / Pattern Recognition 83 (2018) 430–442 Ying Tai is a Ph.D. student from School of Computer Science and Engineering, Nanjing University of Science and Technology. His research interests mainly include face recognition, image super resolution and deep neural networks. Liang Chen is a Ph.D. student from School of Computer Science and Engineering, Nanjing University of Science and Technology. His research interests mainly include autonomous driving and robotics.

liang Chen is a Ph.D. student at the School of Computer Science and Engineering, Nanjing University of Science and Technology. Before coming to School of Computer Science and Engineering, he received his Bachelor degree from School of Science at Nanjing University of Science and Technology in 2013. His research interests are autonomous robot and deep learning.