ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020) 13–26
Contents lists available at ScienceDirect
ISPRS Journal of Photogrammetry and Remote Sensing journal homepage: www.elsevier.com/locate/isprsjprs
3D map-guided single indoor image localization refinement a,b
a
d,f
c,d
e
b
a
Qing Li , Jiasong Zhu , Jun Liu , Rui Cao , Hao Fu , Jonathan M. Garibaldi , Qingquan Li , ⁎ Bozhi Liud, Guoping Qiub,d,f,
T
a
Guangdong Key Laboratory of Urban Informatics & Shenzhen Key Laboratory of Spatial Smart Sensing and Services & Key Laboratory for Geo-Environmental Monitoring of Coastal Zone of the Ministry of Natural Resources, Shenzhen University, Shenzhen 518060, China School of Computer Science, The University of Nottingham, NG8 1BB, UK c School of Computer Science, The University of Nottingham – Ningbo China, 315100, China d The College of Electronics and Information Engineering and Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University, Shenzhen 518060, China e College of Intelligence Science and Technology, National University of Defense Technology, 410000, China f Shenzhen Institute of Artificial Intelligence and Robotics for Society, Shenzhen, China b
ARTICLE INFO
ABSTRACT
Keywords: Single image indoor localization Depth prediction 3D Geometry matching
Image localization is an important supplement to GPS-based methods, especially in indoor scenes. Traditional methods depending on image retrieval or structure from motion (SfM) techniques either suffer from low accuracy or even fail to work due to the texture-less or repetitive indoor surfaces. With the development of range sensors, 3D colourless maps are easily constructed in indoor scenes. How to utilize such a 3D colourless map to improve single image localization performance is a timely but unsolved research problem. In this paper, we present a new approach to addressing this problem by inferring the 3D geometry from a single image with an initial 6DOF pose estimated by a neural network based method. In contrast to previous methods that rely multiple overlapping images or videos to generate sparse point clouds, our new approach can produce dense point cloud from only a single image. We achieve this through estimating the depth map of the input image and performing geometry matching in the 3D space. We have developed a novel depth estimation method by utilizing both the 3D map and RGB images where we use the RGB image to estimate a dense depth map and use the 3D map to guide the depth estimation. We will show that our new method significantly outperforms current RGB image based depth estimation methods for both indoor and outdoor datasets. We also show that utilizing the depth map predicted by the new method for single indoor image localization can improve both position and orientation localization accuracy over state-of-the-art methods.
1. Introduction Single image localization is a promising alternative to GPS for indoor localization as GPS signals are mostly blocked in indoor environments. It is also a key component of many computer vision tasks like structure from motion (SfM), simultaneous localization and mapping (SLAM) as well as many applications such as robotics and autonomous driving. It refers to the problem of estimating the 6 DoF parameters of the query image. Traditional methods address it either through image matching (Kröse et al., 2001; Menegatti et al., 2004; Murillo and Kosecka, 2009) or constructing point-to-point associations between the query images and a 3D model built with SfM algorithms (Sattler et al., 2012; Sattler et al., 2017; Sun et al., 2017; Taira et al., 2018; Uyttendaele et al., 2012). Learning-based methods try to directly
⁎
predict the camera pose by training a CNN-based regressor (Kendall et al., 2015). However, they are not feasible for many indoor scenes as image matching-based methods are not accurate, and 3D models are difficult to construct using SfM if the environment is comprised of texture-less surfaces like white walls or repetitive decoration. The rapid development of range sensing instruments makes it easy to build 3D model of indoor scenes as it only relies on the geometry information without any requirements for the surface. However, pointto-point matching methods still do not function on 3D models built from range sensing sensors which lack the color information compared to that generated from SfM. Single image localization in a 3D map is a hot research topic due to widely available 3D maps and cameras embedded on the smart phones. Directly matching 2D images and 3D model is a very challenging problem as image geometry are ambiguous
Corresponding author at: School of Computer Science, The University of Nottingham, NG8 1BB, UK. E-mail address:
[email protected] (G. Qiu).
https://doi.org/10.1016/j.isprsjprs.2020.01.008 Received 22 July 2019; Received in revised form 2 January 2020; Accepted 7 January 2020 0924-2716/ © 2020 Published by Elsevier B.V. on behalf of International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS).
ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020) 13–26
Q. Li, et al.
compared to 3D models. Two strategies can be used to tackle it: (1) matching in 2D space; (2) matching in 3D space. Methods based on matching in 2D space are similar to image retrieval methods. The key problem of them is to design the similarity metric to compare two source informations. Another strategy is to match in 3D space. It infers depth of RGB images to generate the 3D point cloud and matches against 3D map through 3D geometry matching. The key problem is to accurately estimate the RGB image depth. Traditional methods estimate the image depth with SfM algorithms and require multiple overlapping images (Schönberger and Frahm, 2016; Schönberger et al., 2016). But they fail in low-texture indoor scenes. Besides, they are time-consuming to estimate the dense depth. In this paper, we present a new approach to addressing single image localization in 3D maps through depth inference from RGB images. The proposed method exploits the deep learning technique to perform single image depth inference and localizes the query images based on 3D geometry alignment. Our method firstly estimates the coarse pose using our previous work on camera 6 DoF relocalization (Li et al., 2019) and we warp the 3D map into an initial depth image of the RGB images. Instead of only inferring from RGB image, we predict depth from RGB image with the initial depth as well. Attaching additional depth information is able to enhance the depth prediction performance as it gives an initial guidance to the RGB images instead of being totally blind. An example is shown in Fig. 1. Compared to the real depth map, the initial depth is sparser and the structure has tiny misalignment. Given the predicted dense depth, a 3D point cloud is generated and aligned to the 3D map to finally localize the image. The process of the proposed method is illustrated in Fig. 3. The whole approach is a coarse-to-fine process. At first, we estimate a coarse pose with deep regression-based method (Li et al., 2019). Then, the ICP algorithm is used to align the point cloud produced from predicted depth to correct the initial pose. Compared to the methods based on scene coordinates prediction (Brachmann et al., 2017), which requires 2D-3D matches to train the network, our method estimates the depth map of RGB images to refine the initial pose. Direct learning-based methods like PoseNet (Kendall et al., 2015), can be used for pose initialization. In summary, we make the following contributions in this paper:
overlapping images or videos, our new approach can achieve high localization accuracy using only a single image. We achieve this through estimating the depth map of the input image and performing geometry matching in the 3D space. 2. We propose a novel depth estimation approach by utilizing both the 3D map and RGB images. We use the 3D map to generate an initial depth map and thus guide the RGB image to produce a fine depth map. Our new method significantly outperforms current RGB image based depth estimation methods for both indoor and outdoor datasets. 3. We present extensive experimental results to demonstrate the effectiveness of our new depth estimation method and the new single indoor image localization approach. The rationale behind our approach is that we believe monocular depth prediction and image-based 3D localization are two interleaved problems: once the depth is accurately predicted, the image localization accuracy should be on par with the 3D point cloud registration approaches; once the image is accurately localized, it should generate an accurate depth prediction result that is well aligned with the 3D point cloud map. The rest of the paper is organized as follows: Section 2 reviews related works in depth prediction and image localization in 3D maps. Section 3 describes each component of the proposed single indoor image localization approach. Section 4 elaborates the details of the proposed depth prediction method. Experimental results are shown in Section 5. Finally, we conclude our work in Section 6. 2. Related work In this section, we review the related works on single image depth prediction and image localization in 3D maps. 2.1. Monocular image depth prediction Early works on estimating the image depth either depend on parametric learning techniques (MRF and CRF) (Liu et al., 2010; Saxena et al., 2006) or non-parametric learning approaches (Karsch et al., 2014; Konrad et al., 2013; Liu et al., 2014). Parametric learning-based methods predict the depth from features derived from pixels and their surroundings. Non-parametric learning-based methods estimate depth
1. We present a new approach for image-based indoor localization in a 3D map. In contrast to previous methods that require multiple
Fig. 1. An illustration example of the results of the proposed deep learning-based RGB image depth prediction approach. (a) Original RGB image. (b) Initial depth map generated from a 3D map. (c) Predicted depth map. (d) Ground truth depth map. (e) Error map of initial depth map. (f) Error map of the predicted depth map. The error maps are the differences between the ground truth and the initial depth map (b) and the predicted depth map (c), respectively. It should be noted that the streaks in (b) are caused by depth map warping. A point (130, 105) is highlighted in (a, b, c, d) to demonstrate the misalignment of the initial depth map and the predicted one. Blue and red colors in (e, f) represent low and high errors, respectively. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
14
ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020) 13–26
Q. Li, et al.
through point matching between the query images and similar images from a database. All these methods usually suffer from heavy computational burden during the inference and the post-processing. Recent works start to exploit the deep learning techniques to address the depth prediction problem. Eigen et al. (Eigen and Fergus, 2015; Eigen et al., 2014) propose a method which firstly performs the coarse estimation and then refines it with two networks. They further enhance the prediction quality by learning the surface normal and semantic label jointly. Liu et al. (Liu et al., 2015) estimate the single image depth by fusing the deep learning network and the continuous CRF regression. Laina et al. (Laina et al., 2016) propose a residual network based on fully convolutional network to model the mapping between monocular image and depth image. The reverse Huber loss and the newly designed up-sampling blocks are verified to be effective in their method. Jiao et al. (Jianbo Jiao and Cao, 2018) present an approach that jointly predicts the depth and semantic labels. They design an attention-driven loss based on the fact that the distribution of the depth values exhibits a long tail. All the aforementioned methods belong to supervised learning methods that require the ground truth labelling of the depth. Some works exploit the unsupervised learning to predict the depth to avoid the demand of ground truth depth label. Garg et al. (Garg and V. K. BG, G. Carneiro, I. Reid, , 2016) use stereo pairs to train a network to predict the depth with the loss function formulated from the photometric difference between the true right image and synthesized one generated from the left image and the predicted depth. Godard et al. (Godard et al., 2017) improve the depth estimation by introducing the symmetric left-right consistency loss. Kuznietsov et al. (Kuznietsov and J. Stückler, B. Leibe, 2017) propose a semi-supervised learning framework by using sparse depth maps for supervised learning and dense photometric error for unsupervised learning. Zhou et al. (Zhou et al., 2017) propose an approach which jointly predicts the image depth and its pose in a single network. Additional information is also exploited with RGB data to perform depth prediction using a convolutional neural network. Ma et al. (Ma and Karaman, 2018) predict full resolution depth from a few depth samples and images. Liao et al. (Liao et al., 2017) utilize sparse laser scanner points to aid RGB image depth prediction. Cadena et al. (Cadena et al., 2016) propose a network to learn depth from the RGB image as well as the semantic labels. Zhang et al. (Zhang and Funkhouser, 2018) generate dense depth map by taking RGB-D image as input. In our method, we also utilize the depth information to guide the RGB image depth prediction. Unlike other RGBD-based methods (Liao et al., 2017; Ma and Karaman, 2018; Zhang and Funkhouser, 2018), our depth information is dense but not accurate. The initial depth images are generated by projecting the 3D points onto a plane, which is defined by the pose near to the real. Thus the corresponding depth value for each RGB pixel is quite close to the ground truth. It provides coarse guidance for the depth prediction.
point clouds. Recent works can be divided into two categories: matching in 2D space and matching in 3D space. Methods based on 2D matching synthesize images from 3D points based on LiDAR reflectance or distance and compare it with the query RGB images. For instance, Wolcott et al. (Wolcott and Eustice, 2014) construct a LiDAR reflectance image database and perform localization under the image retrieval framework. The similarity metric is designed with the normalized mutual information (NMI). Newman et al. (Stewart and Newman, 2012) propose a method by matching the query images against generated LiDAR intensity images, and they solve the localization problem through a Quasi-Newton optimization. Newber et al. (Neubert et al., 2017) produce a depth image from two images and match it to the intensity image from the LiDAR 3D map. Xu et al. (Xu et al., 2017) present a method by matching the depth images against the LiDAR reflectance image. Kim et al. (Kim et al., 2018) synthesize depth images and formulate the cost function as the difference of synthesized depth image and depth images generated from a stereo camera. They also utilize Quasi-Newton optimization to localize the query image. Performing 2D-matching involves a huge number of images rendering on-line or off-line, thus it suffers from efficiency issues. Moreover, it is vulnerable to scene changes. 3D matching-based methods perform localization by exploiting the geometry in 3D space. They generate a sparse point cloud through SfM or bundle adjustment and perform the localization using the 3D point cloud registration approaches (Segal et al., 2009). Forster et al. (Forster et al., 2013) localize the query images by aligning the generated 3D points to a 3D map constructed from a depth sensor. Caseliz et al. (Caselitz et al., 2016) use similar strategy and align the 3D sparse structure to a prior 3D map. Bao et al. (Bao et al., 2016) utilize the stereo camera to reconstruct the side view of the scene and match it against the map. DSAC (Brachmann et al., 2017) also solves the problem by training a CNN to directly estimating the 3D coordinates of the RGB images for image localization in a trainable framework based on RANSAC strategy, which are easily affected by repetitive texture. Our approach also belongs to the methods based on 3D geometry. Instead of using the traditional bundle adjustment, which is vulnerable to scenes with texture-less surfaces, we generate the dense depth maps using the CNN-based method and perform the localization using the ICP algorithm. 3. Single image localization within a 3D map We consider a scenario as shown in Fig. 2 where we are given a single 2D RGB indoor color image and the 3D map of the scene and our
2.2. Image localization in 3D maps Conventional image localization is conducted based on three schemes: image retrieval, 2D-3D matches, and learning-based regression (Brachmann et al., 2017; Kendall et al., 2015; Li et al., 2019). Learning-based methods either directly estimate the pose through regression (Kendall et al., 2015) or indirectly predicting scene coordinates for pose regression (Brachmann et al., 2017), which requires the scenes coordinates of the image pixels. 3D information collected with LiDAR devices is able to increase the localization performance. Traditional methods predict the location of the query image in a 3D map through establishing 2D-3D correspondences by matching local features like SIFT (Lowe, 1999), SURF (Bay et al., 2006) or ORB (Rublee et al., 2011; Sattler et al., 2017; Wang et al., 2006). Those approaches are not feasible for localizing against the 3D map as its lack local visual features. The main difficulty of localizing single image within a 3D map is to handle the inherent modal differences between 2D RGB image and 3D
Fig. 2. Demonstration of single indoor image localization in a 3D map. 15
ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020) 13–26
Q. Li, et al.
Fig. 3. The proposed 3D map-guided image localization process. It includes four stages: (1) initial pose estimation, (2) local map extraction, (3) point cloud generation, (4) geometry matching.
aim is to estimate the pose (6 DoF) of the 2D image. While past research has considered the case where multiple overlapping images or a video is available, we consider the more challenging case that only a single RGB image is available. It is a very difficult problem as it tries to register an image to a point cloud, as each comes from a different modality. Images contain color information in 2D space and the point cloud contains the geometric information. Since no color information can be utilized from point cloud, we propose an approach to address this problem by inferring the geometry information for 2D color images. This is achieved by predicting the corresponding depth images of the RGB images. Given the depth images, we can obtain point clouds from the them. Then, the iterative closest point (ICP) algorithm is applied to register the produced point clouds with the 3D map through geometry matching. This section describes the proposed approach of localizing a single image in a 3D map. Fig. 3 shows the process of the proposed method, including four steps: pose initialization, local map extraction, point cloud generation and ICP-based geometry matching. The pose
initialization step provides a coarse pose. Given the coarse pose, we extract a local map to perform geometry match instead of the global map for efficiency reason. The local 3D map is also utilized to generate the initial depth, and the initial depth is used to perform dense depth prediction with RGB image. The point cloud generation produces a point cloud with the coarse pose and the dense depth image. Eventually we exploit the ICP matching strategy to align the generated point cloud into the local 3D map to obtain the pose correction. By adding the correction to the initial pose, we obtain the accurate pose in the 3D maps. In the rest of this section, we describe the details of each step. 3.1. Pose initialization Pose initialization is the key component of the proposed approach. It provides the initial guess to extract the local 3D map from the global one and the ICP algorithm heavily relies on it to achieve good results. In this step, we utilize our previous approach (Li et al., 2019) to initialize 16
ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020) 13–26
Q. Li, et al.
the pose of the image, which can also be replaced with other localization methods. The pose initialization approach is also a learning-based method. A Siamese neural network structure is designed to exploit the relative geometry between images in both feature space and label space. The network consists of two shared-weighted ResNet50, two global pose regression units and a relative pose regression unit, which are made of three fully connected layers. Three loss functions are designed in conjunction with the global pose loss function to train the network. The proposed network is capable of estimating the global poses and relative poses of two images. In addition, it can also predict a single image by feeding the image into one branch of the network. This strategy performs multi-task learning and adding relative geometrical constraints to regularize the network. It is capable of performing localization in low texture indoor environments. Although the localization accuracy is not as high as traditional 3D model-based methods, it is enough to generate a coarse pose estimate which will be refined in later steps.
3.4. ICP-based geometry matching Given the local 3D map and the predicted point cloud from an image, we perform the ICP algorithm (Segal et al., 2009) to align them. Assume that P is the point cloud generated from the predicted depth map, and Q refers to the extracted local map. There exists a transformation containing a rotation parameter R and a transition parameter T , which can be expressed with Eq. (3). where pi is a 3D point in the predicted point cloud and qi is the corresponding point in the local map. ICP obtains RandT by optimizing the 1 n function E (R, T ) = n i = 1 qi (Rpi + T ) 2 in an iterative manner. The detailed process is given in Algorithm 1. This contain four main steps: update the transformation; construct point pairs; compute RandT and determine whether the process stops or not. The key of the proposed approach is estimating the depth image of the 2D RGB image. In the next section, we present a new approach that fuses the RGB image with the 3D map information to estimate the depth image of the RGB image.
3.2. Local 3D map extraction The global map contains a large number of points, and matching against the global map is quite inefficient because many matches are not necessary as only a small portion of them can be seen in the field of view of the query images. To increase the efficiency of warping 3D point cloud and ICP matching, we extract a local 3D map based on the initial pose. Kim et al. generate the 3D local map by choosing points within a distance threshold to the initial position (Kim et al., 2018). Many points that are out of view still appear in the map, which results in low efficiency. Points of long distance are filtered, which are important for the further localization. Therefore, we propose an approach based on the image field of view that is able to avoid the problems. Given the initial pose, we calculate the angle of the global points in the polar coordinates system. Points within an angular window are selected as local 3D maps. The size of the window is determined based on the initial pose and image size. Given the image intrinsic parameters, the field of view (FOV) can be computed using Eqs. (1) and (2).
fovh = arctan
w , 2×f
(1)
fovv = arctan
h , 2×f
(2)
(3)
qi = Rpi + T,
4. Single image depth prediction with 3D map guidance Traditional single image depth prediction methods which operates directly on RGB images, suffer from the scale ambiguity problem. To relieve this, we exploit 3D maps to guide the process. The 3D map information is utilized by generating an initial depth image of the input RGB images. Then both the RGB image and the initial depth image are fed into a neural network to infer the correct depth image. In the rest of this section, we elaborate on the details of warping 3D map to provide the initial depth image, and we also describe the architecture of the proposed convolutional neural network for depth prediction and the formulation of loss functions. Algorithm 1. (ICP-based image pose refinement.). Require: The point clouds generated from the predicted depth map, Pi; The local point cloud map, Qi; The initial pose of the image, R0, T0; Ensure: The final pose of the image, R, T ; 1: Perform generated point cloud transformation with pose Rn − 1, Tn − 1 in the nth iteration ; 2: Construct the point pair (pi, qi) that satisfies the condition that kqi − (Rpi + T)k2 = mim ; 3: Optimize the objective function to obtain the new pose Rn, Tn; 4: Perform generated point cloud transformation with pose Rn, Tn and compute the average distance of all the point pairs d = n1 Pin=1 kpi0 − qik2, where pi0 is the transformed point pi ; 5: if d is less than the given threshold or the n equals to the maximum iteration number, stop; else return step 2; 6: return Rn, Tn;
where fovh and fovv represent the horizontal and vertical view of the camera, f is the focal length, and w, h are the width and height of the image. To include points that appear in the camera view on the local map, we set the window larger than the camera FOV size. Empirically, 15° for both vertical and horizontal windows work well in all experiment settings. 3.3. Point cloud generation We estimate the corresponding point cloud of the query image in two steps. Firstly, we predict the depth image of the query image using the approach proposed in section 4 and generate its corresponding point cloud with Eq. (4). Secondly, we filter point clouds based on the density distribution of indoor 3D points. The filtering is essential since not all depth values are predicted accurately, which will affect the final results of geometry in 3D space. The depth values of large errors exhibit as the floating points in the 3D space. To eliminate them, we use a simple point cloud filtering strategy, which is based on the number of points within the given radius. Let Ni denotes the number of points near point i , T represents the threshold, R denotes the given radius. If Ni < T , then the point i is reserved, otherwise the point i is abandoned. In our approach, the radius R is set to 1m and the point count threshold T is 100.
4.1. Depth of points Depth images are generated by projecting the local 3D points into a plane defined by the camera intrinsic parameters and coarse pose. The 3D coordinates and the 2D corresponding points obey the pinhole camera geometry that can be expressed by :
dx = P (X
C) = KR (X
(4)
C)
where d is the depth value, x = (x , y, denotes the images ordinates, X = (X , Y , Z )T represents the 3D point coordinates in space, C is the camera centre in the 3D space. P is determined by camera intrinsic matrix K , the camera orientation matrix R and
1)T
17
co3D the the
ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020) 13–26
Q. Li, et al.
Fig. 4. The architecture of the proposed network. The red blocks are the feature maps of residual blocks in ResNet (He et al., 2016), and blue blocks indicate the feature maps of upconv up-sampling layers. We fuse the initial depth information and RGB information by concatenating the feature map of the Conv1 layers. We also conduct experiments to fuse the two source information at later layers and results in Section 5.1.8 demonstrate fusion after the Conv1 is the most effective manner for the two source information. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
camera centre position C . Given the camera intrinsic matrix, the orientation matrix and the centre position, we can compute the image coordinates of all the 3D points in the local point cloud and the depth values. if d < 0 , the point lies behind the principle plane, then it is abandoned; otherwise the depth is kept for depth map generation.
absolute depth error of all pixels for a batch, err represents the absolute depth error . Depth-wise losses lead to smooth boundaries. To eliminate the effect, we attempt to higher constraints like SSIM loss and gradient loss to keep sharp edges. Therefore, we further test gradient loss and SSIM loss (Wang et al., 2004) to keep the boundaries sharp. The SSIM loss aims to constrain the difference of the predicted depth and the ground truth in appearance from a whole image. The SSIM loss is formulated as in Eq. (6).
4.2. Depth prediction network 4.2.1. Network architecture The network architecture is illustrated in Fig. 4. Our network structure is designed based on (Ma and Karaman, 2018) which achieved state-of-the-art results in depth prediction from RGB images. We slightly change the network structure for our applications. It is because in (Ma and Karaman, 2018), the sparse depth map has already been aligned with the RGB image, while for our methods, the initial depth map are not aligned with depth map and we need an extra module to extract effective guided information to assist the depth prediction. In (Ma and Karaman, 2018), they stack the RGB and sparse depth at the input layer while we fuse the depth and RGB after the first convolutional block. The network is composed of two components: an encoder and a decoder. We use the modified ResNet50 to encode the image information for the NYU-Depth-v2. For the KITTI dataset, we use the ResNet18 because of GPU limitation, since the size of the KITTI images is too large to process with ResNet50 encoder. We modified ResNet by replacing the last pooling layer and the fully connected layer with a convolutional layer and a batch normalization layer. We fuse the initial depth images and RGB images after the first convolutional layer. The decoder consists of four successive uppool upsampling layers and a bilinear upsampling layer.
SSIM (x, y) =
err ,
if err
err 2 + c 2 , 2c
otherwise
2 x
xy
+
+ C2 ) 2 y
+ C2 )
,
(6)
where µ x , µ y are the average depth of images, the represents the standard deviation of two images, and C1, C2 are constant values, which equal to 0.012 , 0.032 respectively. The gradient loss is defined as in Eq. (7), which try to preserve detail information on the depth images.
Lg = dx dx + dy dy ,
(7)
where the Lg denotes the gradient loss, dx , dy are the gradients computed from the ground truth depth image in x , y directions and dx , d y are the gradients of the predicted depth image. The pixel-wise gradient loss encourages the gradient of the predicted depth image to be consistent with the ground-truth depth image. In the Section 5.1.7, we conducts the experiment to find that the SSIM loss and the gradient loss decrease the depth prediction performance and we chosel2 loss to train the network and leave the smooth boundaries to be filtered in 3D space eventually.
4.2.2. Loss function The loss function is designed based on the difference between the estimated depth image and the ground truth. Three common depth-wise losses are exploited for training the network, i.e. the mean squared error (l2) , the mean absolute error (l1) , and the reversed Huber loss (berHu) (Laina et al., 2016). The berHu loss can be seen as a compromise of l2 and l1 as it is equivalent to the l1, and otherwise is approximate to l2 . It is defined as:
LberHu (err ) =
(2µ x µ y + C1 )(2 (µx2 + µ y2 + C1 )(
5. Experiments In this section, we evaluate the performance of depth prediction and image localization respectively. We compare the depth estimation performance with the state-of-the-art on two benchmarks: the NYUDepth-v2 (Nathan Silberman et al., 2012) and the KITTI dataset (Uhrig et al., (3DV), 2017.). Ablation studies are conducted on the network structure and loss functions. For localization, we evaluate our method on the 7Scene dataset (Shotton et al., 2013).
c (5)
where c is a parameter that is computed as 20 % of the rank of the 18
ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020) 13–26
Q. Li, et al.
5.1. Depth prediction
The root mean square error (rmse):
5.1.1. Datasets The NYU-Depth-v2 dataset is collected from 464 different scenes with a Kinect device. It is officially split into training a testing dataset, where 249 scenes are selected for training and 215 scenes for testing. To facilitate comparison with the previous methods, we also evaluate our method on 654 images as in previous works (Eigen and Fergus, 2015; Eigen et al., 2014; Karsch et al., 2014; Liu et al., 2014). Following previous work (Ma and Karaman, 2018), we resize the image into 320 × 240 and crop a patch of 304 × 228 from the center. The KITTI dataset is collected on a mobile car and the depth is obtained using a Velodyne LiDAR sensor. We use the split proposed by (Eigen et al., 2014) in which 22,600 images are used for training and 697 images for testing. Only the bottom crop (912 × 228) is performed to eliminate the sky, where no depth information is acquired by the sensor. Since both datasets have no accurate pose information, we simulate the initial pose for them. Three random numbers within [ 3t , 3t ] are used to simulate initial position and three random numbers within [ 3 , 3 ] act as the initial orientation, where and t are the median position error and orientation error in our previous work (Li et al., 2019). For the NYU-Depth-v2 dataset, the t and are 0.2m and 10° respectively, according to the localization results on indoor scenes. For the KITTI dataset, the t and are 1.2m and 3. 2° according to the performance in our outdoor localization results.
rmse =
N
di ) 2 .
(d i
(9)
i=1
The mean relative error (rel):
rel =
1 N
N
di
i=1
di
.
di
(10)
The percentage of the relative depth prediction within threshold 1.25 j :
N di : max j
=
{ , } < 1.25 di di
di di
N (d i )
j
,
(11)
where di and di are the ground truth depth values and the predicted ones respectively, and N is the number of element of a set, j = 1, 2, 3. A higher i indicates better prediction. 5.1.5. Comparison with state-of-the-art We compare with existing monocular image depth prediction methods (Eigen and Fergus, 2015; Eigen et al., 2014; Fu et al., 2018; Karsch et al., 2014; Li et al., 2015; Liao et al., 2017; Liu et al., 2014; Liu et al., 2015; Ma and Karaman, 2018; Roy and Todorovic, 2016; Xu et al., 2017) on the depth prediction performance on NYU-Depth-v2 (Nathan Silberman et al., 2012) and the KITTI dataset (Uhrig et al., (3DV), 2017.). Among them, (Liao et al., 2017) and (Ma and Karaman, 2018) also utilize the depth image whilst all other approaches only use the RGB image. (Liao et al., 2017) use a line of correct depth values to assist the depth prediction while (Ma and Karaman, 2018) utilizes a sparse map of correct depth value to help the depth prediction from RGB images. Ma _Initial represents the case in which the depth map consist of 200 depth values randomly selected from the initial map. Only our method exploits the dense initial depth map for guiding the depth prediction. The comparative results are shown in Table 1. It can be seen from Table 1 that compared to RGB-based methods, RGBD-based methods achieve better performance on both error and accuracy. RGBD-based methods jointly utilize the texture information from RGB images and absolute scale information from additional depth information, thus obtaining better results. Some qualitative examples are shown in Fig. 5. By comparing RGBD-based methods, our method achieves comparable performance with Ma et al. (Ma and Karaman, 2018) and outperforms the approach in (Liao et al., 2017). The main difference between our method and (Ma and Karaman, 2018); (Liao et al., 2017) is that we do not use the correct the depth value to guide the depth prediction while (Ma and Karaman, 2018) and (Liao et al., 2017) they employ the part the real depth value to guide the depth prediction. The difference between (Ma and Karaman, 2018) and (Liao et al., 2017) is that in (Ma and Karaman, 2018); the real depth values are randomly distributed over the RGB images and (Liao et al., 2017) limited the real depth value in a line. Randomly distributed depth-value method can guide the RGB in large areas while line-distributed approach can only affect the RGB along line. Our method are much denser than both of them but the depth values are of slight error, and we correct them by training the networks. We cover large area over the RGB images than (Ma and Karaman, 2018) and (Liao et al., 2017), which helps to improve the depth prediction. It demonstrates the effectiveness of our proposed method. It can be seen that sparse initial depth map can improve depth prediction performance compared to the RGB-based methods. By comparing Ma _Initial and our method, we can conclude that dense initial depth map enhances depth prediction performance. The KITTI dataset is more challenging compared to NYU-Depth-v2 dataset because it has larger distance up to (100m) than that (10m) of the NYU-Depth-v2 dataset. Besides, it was collected in outdoor environments, the scene geometry of which is complex and challenging
5.1.2. Initial depth map generation To generate the initial depth map, we project the local point cloud at the coarse pose. The coarse pose is obtained by adding random noise on both position and orientation with certain threshold Tpos and Tori . We randomly generate three float numbers in [ Tpos, Tpos] to form a threedimensional vector. The vector represents the positional noise in three axes. We also generate three random numbers in [ Tori, Tori] to represent the angle noise of three axes. The three rotational noises are transformed into rotation matrix form and the comprehensive noise is generated by multiplying them together. The coarse pose is obtained with Eq. (8)
Ccoarse = Ctrue + R 1K 1Tnoise Rcoarse = Rnoise Rtrue
1 N
(8)
where Rnoise = Rx Ry Rz , and Rx , Ry , Rz indicates the rotational matrix of three rotational noise. In the case that multiple 3D points belong to the same position, the smallest distance are given to the image position. Another issue concerning depth map generation is the depths of background points appear among that of the foreground points. In our experiment, the global map is generated in high density with the minimum distance of 0.001 m, which relieves this problem as only a tiny number of background points are not occluded by the dense foreground points. Those tiny number of background depths values hardly affect the depth prediction in our experiments. 5.1.3. Setup We follow the same data augmentation strategy as in (Ma and Karaman, 2018) by random transformation on scale, rotation, color, and flips on RGB images. Training images are shuffled before they are fed to the network. We choose Adam optimizer to train the network with parameters 1 = 0.9 and 2 = 0.999 . The weight decay is 1 × 10 5 . We train the network with a learning rate of 1 × 10 4 and the batch size of 12. We implement the network with PyTorch and train the network on an Ubuntu 16.04 LTS system with a NVIDIA GTX 1080Ti GPU. Training is stopped until the network is converged. 5.1.4. Evaluation metrics We evaluate the performance of current depth prediction with the following metrics: 19
ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020) 13–26
Q. Li, et al.
Table 1 Comparison with the state-of-the-art on the NYU-Depth-v2 dataset. The reported values are referred to their papers respectively. The best performance is highlighted in bold. 2*Problem
2*Method
(l)3–7 RGB
RGBD
Karsch et al., 2014 Liu et al., 2014 Li et al., 2015 Roy and Todorovic, 2016 Liu et al., 2015 Eigen et al., 2014 Eigen and Fergus, 2015 Laina et al., 2016 Xu et al., 2017 Fu et al., 2018 Liao et al., 2017 Ma and Karaman, 2018 Ma_Initial Ours
Error (lower is better)
Accuracy (higher is better)
rmse
rel
1.200 1.060 0.821 0.744 0.824 0.877 0.641 0.573 0.586 0.509 0.442 0.230 0.409 0.225
0.250 0.335 0.232 0.187 0.230 0.214 0.158 0.127 0.121 0.115 0.104 0.044 0.134 0.070
due to plants and shallows. We also compare both RGB-based methods and RGBD-based methods for the KITTI dataset. A similar conclusion can be drawn from Table 2 that RGBD-based methods achieve significantly better performance than RGB-based methods in the outdoor environment by comparing the results of group RGB and group RGBD. Comparing with the other two RGBD-based methods, approach obtains a better performance. Compared to the NYU-Depth-v2 images, the images of the KITTI dataset are larger and the depth maximum is larger. Sparsely labeled depth images need more real depth values to achieve better results. Our initial depth images projected from LiDAR data are relatively denser to work better for large image and large scenes. We also give some the qualitative prediction results in Fig. 6, which demonstrates that our method can effectively infer the depth of the RGB images.
1 – – 62.1 – 61.4 61.4 76.9 81.1 81.1 82.8 87.8 97.1 86.4 94.9
2
3
– – 88.6 – 88.3 88.8 95.0 95.3 95.4 96.5 96.4 99.4 96.2 99.1
– – 96.8 – 97.1 97.1 98.8 98.8 98.7 99.2 98.9 99.8 99.0 99.7
NYU-Depth-v2 dataset. l1, l2 , and berHu loss are used to perform the comparison as well as their combination with the gradient loss and the SSIM loss. For the KITTI dataset, we only evaluate the l1, l2 , and berHu losses. It is because the ground truth depth is sparsely distributed in the KITTI dataset while the prediction is dense. Thus the SSIM loss and the gradient loss do not help in this case. The results are listed in Tables 5 and 6 respectively. From Tables 5 and 6, it can be seen that l1, l2 , and berHu loss alone achieve good results and l2 loss gets a slightly better performance than the other two. We can also see that the gradient loss and the SSIM loss fails to work as performance decreases by adding them to the depthwise loss. The reason might be that further loss makes the networks overfit to the training data and decrease the generality to the testing data.
5.1.6. Analysis of taking different input data We compare the depth prediction results of three kinds of input data including initial depth map (I ) , RGB images (RGB ) , and RGB images with their corresponding initial depth map (RGBD) . The rest of the networks are the same. The setup is the same with that in 5.1.3. The results are listed in Tables 3 and 4, respectively. I _org represents the original initial depth image. The results of I in Table 3 demonstrate that the proposed method is able to correct the depth value prediction from the inaccurate depth maps. The probable reason is that the structure of depth information is learned by the convolutional neural network after properly trained. Better results can be obtained if it is fused with RGB images as RGBD contains both the correct structure information and global scale information. For the KITTI dataset, the depth prediction results from initial depth alone are significantly worse than that from other input data. It is because unlike the NYU-Depth-v2, the ground truth depth for training network is sparse. Structure information is not able to be inferred by the network with sparsely distributed ground truth depth values. However, although the initial map alone is not able to restore the dense depth, it is still capable of increasing the results compared to results from RGB image alone.
5.1.8. Analysis of fusion strategy To find the best fusion strategy of the initial depth image and the RGB image, we conduct experiments to fuse the initial depth and RGB data in different layers, which are listed as below. 1. Input : Initial depth image and RGB image are concatenated before feeding into the network. 2. Conv1 : Initial depth image and RGB image are fused after the convl layer. 3. Res1 : Initial depth image and RGB image are fused after the Res1 block. 4. Res2 : Initial depth image and RGB image are fused after the Res2 block. 5. Res3 : Initial depth image and RGB image are fused after the Res3 block. 6. Output : Initial depth image and RGB image are fused before the last convolutional layer. We use the same training setup in Section 5.1.3 on NYU-Depth-v2 dataset. The training loss is l1. The results are shown in Table 7. It can be drawn from Table 7 that fusion in later layers gives worse results. In general the later the layer is, the worse the results are. The best performance is achieved in Input and Conv1 layers. Earlier network layers contain the spatial information, which is highly related to depth prediction. To fully maintain the spatial information, we also conduct experiment by adding initial depth information before last depth prediction layer, the results is the worst, which implies that the network requires the complex operation to fuse two sources information to get the best results.
5.1.7. Loss analysis To find the proper loss function to train the network, we conduct ablation experiments to validate the effectiveness of three depth-wise losses and gradient loss, and SSIM loss. The experimental setup is repeated for each experiment as in Section 5.1.3 except the loss function. To compare the loss function, we use the same network architecture in which the initial depth and RGB image information are concatenated after the first convolutional layer. We conducted the experiments on the 20
ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020) 13–26
Q. Li, et al.
Fig. 5. Qualitative depth prediction results on the NYU-Depth-v2 dataset. The first column shows RGB images and column (b)–(d) are the results of similar methods, and the results of the proposed method are shown in column (e), and (f) shows the real depth images.
Table 2 Comparison with the state-of-the-art on the KITTI dataset. The reported values are referred to their respective papers. The best performance is highlighted in bold. 2 * Problem
2 * Methods
(l)3–7 7 * RGB
4*RGBD
(Liu et al., 2015) (Eigen et al., 2014) (Cao et al., 2017) (Garg and V. K. BG, G. Carneiro, I. Reid, , 2016) (Godard et al., 2017) (Zhang and Funkhouser, 2018) (Fu et al., 2018) (Ma and Karaman, 2018) (Liao et al., 2017) Ours
Error (lower is better)
Accuracy (higher is better)
rmse
rel
6.986 6.179 4.712 5.104 5.381 4.310 2.271 3.378 4.50 2.710
0.217 0.197 0.115 0.169 0.126 0.136 0.071 0.073 0.113 0.068
21
1 64.7 69.2 88.7 74.0 84.3 83.3 93.6 93.5 87.4 95.1
2
3
88.2 89.9 96.3 90.4 94.1 95.7 98.5 97.6 96.0 98.3
96.1 96.7 98.2 96.2 97.2 98.7 99.5 98.9 98.4 99.3
ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020) 13–26
Q. Li, et al.
Fig. 6. Qualitative depth prediction results of the KITTI dataset. (a) RGB images; (b) prediction results of Eigen (Eigen et al., 2014); (c) prediction of our method; (d) ground truth depth images. Table 3 Results of different input data on the NYU-Depth-v2 dataset. 2 * Input
Error (lower is better)
(l)2–6
rmse
rel
I_org I RGB RGBD
0.733 0.320 0.514 0.228
0.133 0.108 0.143 0.070
Table 4 Results of different input data on the KITTI dataset.
Accuracy (higher is better)
1 86.9 89.1 81.0 94.3
2
3
90.5 98.4 95.9 98.9
91.6 99.7 98.9 99.8
2 * Input (l)2–6
Error (lower is better) rmse rel
Accuracy (higher is better) 2 3 1
I_org I RGB RGBD
20.25 16.38 6.33 2.71
4.7 6.6 53.7 95.1
0.943 0.56 0.21 0.068
6.8 24.4 92.5 98.4
7.1 41.0 97.8 99.3
relocalization and trajectory tracking. It is collected with Kinect in a handhold manner. The ground truth pose and the 3D maps are generated using the Kinect Fusion approach (Newcombe et al., 2011). The dataset is captured in 7 indoor scenes. For each scene, it contains several image sequences as well as the depth images sequence, which has already been divided into training and testing sets. The images are
5.2. Localization 5.2.1. Datasets 7Scene (Shotton et al., 2013) is used to evaluate the performance of the localization. 7Scene is an indoor image dataset for camera 22
ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020) 13–26
Q. Li, et al.
Table 5 Evaluations of loss functions on the NYU-Depth-v2 dataset. 2*
2 * Additional loss
(l)3–7 3*l1 3*berHu 3*l2
+SSIM Loss +Gtradient Loss – +SSIM Loss +Gtradient Loss – +SSIM Loss +Gtradient Loss –
Error (lower is better)
Table 8 Depth prediction results of the 7Scene dataset.
Accuracy (higher is better)
rmse
rel
0.304 0.275 0.228 0.598 0.323 0.243 0.327 0.303 0.225
0.091 0.092 0.070 0.153 0.109 0.075 0.097 0.104 0.070
2
3
97.7 98.7 98.9 97.6 97.5 98.9 98.9 98.0 99.1
99.9 99.8 99.8 99.3 99.6 99.8 99.8 99.7 99.8
1 95.6 91.5 94.3 80.5 90.3 94.1 93.2 90.2 94.9
2*Dataset (l)3–7 2*chess 2*fire 2*heads 2*office 2*pumpkin
Table 6 Comparison of loss functions on the KITTI dataset. 2 * Loss
Error (lower is better)
(l)2–6
rmse
rel
l1 berHu l2
2.73 2.74 2.71
0.057 0.058 0.068
2*redkitchen
Accuracy (higher is better)
1 95.9 95.8 95.1
2
3
98.3 98.2 98.4
99.1 99.2 99.3
2*stairs
Error (lower is better)
(l)2–6
rmse
rel
Input Conv1 Res1 Res2 Res3 Output
0.231 0.228 0.242 0.297 0.400 0.600
0.066 0.070 0.075 0.107 0.134 0.191
Accuracy (higher is better)
1 96.2 94.3 94.2 89.4 84.9 71.8
2
3
99.3 98.9 99.2 97.1 96.9 93.0
99.8 99.8 99.9 99.5 99.5 98.0
RGB RGBD/L RGBD/G RGB RGBD/L RGBD/G RGB RGBD/L RGBD/G RGB RGBD/L RGBD/G RGB RGBD/L RGBD/G RGB RGBD/L RGBD/G RGB RGBD/L RGBD/G
Error (lower is better) rmse
rel
0.215 0.186 0.207 0.108 0.140 0.111 0.160 0.112 0.149 0.243 0.184 0.165 0.155 0.149 0.092 0.228 0.208 0.164 0.344 0.268 0.380
0.076 0.167 0.060 0.048 0.070 0.046 0.140 0.109 0.130 0.093 0.069 0.050 0.054 0.049 0.025 0.087 0.079 0.044 0.091 0.072 0.102
Accuracy (higher is better)
1 93.9 95.5 94.0 97.5 95.2 97.4 78.7 90.3 82.0 91.2 95.4 95.9 97.6 97.4 99.5 92.6 94.6 97.0 87.4 91.6 88.4
2
3
98.8 99.1 98.8 99.5 99.4 99.6 94.1 98.7 95.4 98.6 99.1 99.3 99.5 99.4 99.9 99.0 99.3 99.2 96.4 98.0 95.3
99.7 99.9 99.9 99.9 99.9 99.9 98.8 99.8 99.1 99.7 99.8 99.9 99.8 99.8 99.9 99.9 99.9 99.9 99.3 99.4 98.4
0.7 to eliminate the depth prediction of large error. The max iteration is set 30 and the threshold to stop the iteration is set 0.01. The localization result is shown in Table 9. The results are reported with the median error to facilitate the comparison. By comparing CNN-based pose regression methods (Kendall, et al., 2017; Laskar et al., 2017; Li et al., 2019) and depth prediction-based method, we can draw a conclusion that depth prediction helps increase the pose localization performance in both position and orientation. The accurate depth prediction results in better localization by comparing RGB depth prediction-based method and RGBD depth prediction-based method. The positional error decreases from 0.177m to 0.102m and the orientational error drops from 9.39° to 3.40°. The table also show that DSAC achieves slightly better performance on the first six dataset and the performance drops down significantly on the stairs dataset that has repetitive structure. It limited its applications in indoor scenes. On the contrary, the proposed method is more robust, with comparable performance. Besides, one of the advantages of our methods is that other localization strategies can be utilized as initialisation pose for our method.
Table 7 Evaluation of different fusion strategies on the NYU-Depth-v2 dataset. 2*architecture
2*Input
taken at the resolution of 640 × 480 with the known the intrinsic parameters. We use officially split of the training set and testing set to train and test our depth estimation network. 5.2.2. Depth prediction results In this section, we evaluate the RGB-based method and RGBD-based methods with the initial depth generated from local 3D map (RGBD L) and global 3D point cloud map (RGBD G ) . We use the same setup for depth prediction of the 7Scene dataset as in the NYU-Depth-v2 and perform the depth prediction from RGB images and RGBD data. The initial map is generated from the corresponding depth image by adding the certain position and orientation noise to the real pose as we did in the experiments of the NYU-Depth-v2. The l2 loss is used to train the network. The results are shown in Table 8. By comparing the depth prediction results on the 7Scene from RGB and RGBD information, we find that RGBD achieves better performance, which further verifies our idea on more datasets. Another fact is that performance of RGBD G is slightly better than that of the RGBD L . The probable reason is that the global map provides more information than the local 3D map especially at the boundaries of the images. Some qualitative prediction examples of 7 scenes are listed in Fig. 7. It shows that the RGBD-based method also help improve the details on the structure than using RGB data alone.
5.2.4. Localization performance over 3D maps In this section, we make a comparison to evaluate the localization performance in four types of 3D point clouds, which are: 3D map generated from training depth images, 3D map of testing depth images, 3D map from both training images and testing images and 3D maps from RGB corresponding depth images. For the first three 3D maps, we first produce the local point cloud and transform them with the pose of each depth image. Then we remove the duplicated points from the dense point cloud by setting the minimum distance between points to be 0.001 m. The fourth 3D map is comprised of a set of local point cloud generated from the testing images. The ICP setup are the same for the four experiments as in section 5.2.3. The localization performance are shown in Table 10. The error is also reported with the median value. It can be seen from Table 10 that the localization performance are very close in both positional and orientation accuracy for all seven scenes. It is because that although the training dataset and the testing dataset are captured in different trajectories, the 3D maps from the training dataset are highly overlapped over the validating dataset. The table also demonstrates that Map_global obtains better localization results than the that of Map_val, and Map_train gives the worst performance. The probable reason is that the Map_train does not cover all the
5.2.3. Localization results Given the predicted depth, The ICP algorithm is applied to perform localization as described in section 3. The overlap parameter is set as 23
ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020) 13–26
Q. Li, et al.
Fig. 7. Qualitative depth prediction on 7Scenes dataset. The top row represents the RGB data, the second row are the prediction results from RGB image, the third row represents the results from RGBD data and the bottom row are the ground truth depth. Table 9 Comparison with the CNN-based localization over the 7Scene dataset. The best localization results are highlighted in bold. The second best results are in italic. 2 * Dataset (l)2–13
PoseNet2 (Kendall, et al., 2017) Orientation Position
Orientation
chess fire heads office pumpkin redkitchen stairs average
4.48° 11.3° 13.0° 5.55° 4.75° 5.35° 12.4° 8.12°
6.46° 12.72° 12.34° 7.35° 6.35° 8.03° 11.82° 9.30°
0.13 m 0.27 m 0.17 m 0.19 m 0.26 m 0.23 m 0.35 m 0.23 m
Relnet (Laskar et al., 2017) Position
Siamese CNN (Li et al., 2019) Orientation Position
DSAC (Brachmann et al., 2017) Orientation Position
Orientation
Position
Orientation
Position
0.13 m 0.26 m 0.14 m 0.21 m 0.24 m 0.24 m 0.27 m 0.21 m
5.19° 11.64° 13.20° 7.71° 6.61° 8.24° 13.13° 9.39°
1.2° 1.5° 2.7° 1.6° 2.0° 2.0° 33.1° 6.2°
2.99° 3.12° 16.77° 4.94° 4.60° 2.86° 7.56° 6.12°
0.07 m 0.07 m 0.29 m 0.15 m 0.15 m 0.15 m 0.23 m 0.17 m
2.49° 1.22° 6.44° 4.66° 4.03° 2.45° 2.48° 3.40°
0.077 m 0.035 m 0.140 m 0.141 m 0.154 m 0.086 m 0.078 m 0.102 m
0.099 m 0.253 m 0.126 m 0.161 m 0.163 m 0.174 m 0.260 m 0.177 m
0.02 m 0.04 m 0.03 m 0.04 m 0.05 m 0.05 m 1.17 m 0.2 m
RGB-depth
RGBD-depth (ours)
Table 10 Evaluation the localization performance by referring to the 3D map constructed from training depth images (Map_train), testing depth images (Map_val), all the depth images (Map_global) and local 3D points from corresponding depth images (Map_local). 2*Dataset
Map_train
Map_val
Map_global
(l)2–9
Orientation
Position
Orientation
Position
Orientation
Position
Orientation
Position
chess fire heads office pumpkin redkitchen stairs average
2.89° 1.66° 6.33° 5.02° 4.605° 2.85° 2.60° 3.71°
0.081 m 0.043 m 0.150 m 0.186 m 0.170 m 0.173 m 0.108 m 0.130 m
2.57° 1.56° 6.80° 5.00° 4.76° 2.75° 2.66° 3.73°
0.078 m 0.036 m 0.143 m 0.174 m 0.168 m 0.175 m 0.056 m 0.119 m
2.46° 1.36° 6.6° 4.86° 4.14° 2.67° 2.65° 3.53
0.081 m 0.039 m 0.143 m 0.177 m 0.159 m 0.176 m 0.055 m 0.119 m
2.49° 1.22° 6.44° 4.66° 4.03° 2.45° 2.48° 3.40°
0.077 m 0.035 m 0.140 m 0.141 m 0.154 m 0.086 m 0.078 m 0.102 m
points than appear in the generated local map while the Map_val is capable of it and the Map_global is better than the Map_val. The Map_local achieves the best performance because it is not only cover the generated points clouds but also are not down-sampled.
Map_local
axis. The positional noise is added from −0.2 m to 0.2 m with a step of 0.05 m. The orientational noise is added from 11.46° to 11.46° with a step of 2.87° . With the new initial pose, we fine-tune the trained models with new initial depth maps in chess dataset. The predicted depth map is used to generate the point cloud for pose refinement. Fig. 8a shows the positional error with regard to positional noise and Fig. 8b indicates the orientational error with regard to orientational noise. The first conclusion we can draw from the two figures is that the
5.2.5. Localization performance over initialization pose In this section, we evaluate the localization performance by adding different levels of positional and orientational noise on three X , Y , Z 24
ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020) 13–26
Q. Li, et al.
Fig. 8. The positional and orientational error over the positional and orientational noise. The green line represents the localization error with regard to the noise added on X axis, the red line for Y axis and the blue line for Z axis. The grey line represents the error of the initial pose. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
more accurate the initial pose is, the better the method works. It can also be seen that orientational noise can be correctly refined in between 11.46° and 11.46° for three axis, and noise on Y axis and Z axis can be better refined than that of X . For positional error, the noise on Y axis is not refined. The noise over Z between −0.2 m and 0.2 m is refined, and it tends not to be refined when noise is over 0.2 m. The X axis can tolerate large positional noise.
Computer Vision and Pattern Recognition (CVPR) 3. Cadena, C., Dick, A.R., Reid, I.D., 2016. Multi-modal auto-encoders as joint estimators for robotics scene understanding. In: Robotics: Science and Systems 5. pp. 1. Cao, Y., Wu, Z., Shen, C., 2017. Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Trans. Circuits Syst. Video Technol. 28 (11), 3174–3182. Caselitz, T., Steder, B., Ruhnke, M., Burgard, W., 2016. Monocular camera localization in 3d lidar maps. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEEpp. 1926–1931. Eigen, D., Fergus, R., 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE international conference on computer visionpp. 2650–2658. Eigen, D., Puhrsch, C., Fergus, R., 2014. Depth map prediction from a single image using a multi-scale deep network. In: International Conference on Neural Information Processing Systemspp. 2366–2374. Forster, C., Pizzoli, M., Scaramuzza, D., 2013. Air-ground localization and map augmentation using monocular dense reconstruction. In: 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEEpp. 3971–3978. Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D., 2018. Deep ordinal regression network for monocular depth estimation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Garg, R., Bg, V.K., Carneiro, G., Reid, I., 2016. Unsupervised CNN for single view depth estimation: Geometry to the rescue. In: European Conference on Computer Vision. Springer, pp. 740–756. Godard, C., Mac Aodha, O., Brostow, G.J., 2017. Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognitionpp. 270–279. He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognitionpp. 770–778. Jianbo Jiao, R.L., Cao, Ying, 2018. Yibing Song, Look Deeper into Depth : Monocular Depth Estimation with Semantic Booster and. Eccv 1–17. Karsch, K., Liu, C., Kang, S.B., 2014. Depth transfer: Depth extraction from video using non-parametric sampling. IEEE Trans. Pattern Anal. Mach. Intell. 36 (11), 2144–2158. Kendall, A., Cipolla, R., et al., 2017. Geometric loss functions for camera pose regression with deep learning. In: Proc. CVPR 3. pp. 8. Kendall, A., Grimes, M., Cipolla, R., 2015. Posenet: A convolutional network for real-time 6-dof camera relocalization. In: Proceedings of the IEEE international conference on computer visionpp. 2938–2946. Kim, Y., Jeong, J., Kim, A., 2018. Stereo camera localization. In: 3D LiDAR maps, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)pp. 1–9. Konrad, J., Wang, M., Ishwar, P., Wu, C., Mukherjee, D., 2013. Learning-based, automatic 2d-to-3d image and video conversion. IEEE Trans. Image Process. 22 (9), 3485–3496. Kröse, B.J., Vlassis, N., Bunschoten, R., Motomura, Y., 2001. A probabilistic model for appearance-based robot localization. Image Vis. Comput. 19 (6), 381–391. Kuznietsov, Y., Stuckler, J., Leibe, B., 2017. Semi-supervised deep learning for monocular depth map prediction. In: IEEE Conference on Computer Vision and Pattern Recognitionpp. 2215–2223. Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N., 2016. Deeper depth prediction with fully convolutional residual networks. In: 2016 Fourth international conference on 3D vision (3DV). IEEE, pp. 239–248. Laskar, Z., Melekhov, I., Kalia, S., Kannala, J., 2017. Camera relocalization by computing pairwise relative poses using convolutional neural network. Computer Vision Workshop (ICCVW), 2017 IEEE International Conference on, IEEE 920–929. Li, Q., Zhu, J., Cao, R., Sun, K., Garibaldi, J.M., Li, Q., Liu, B., Qiu, G., 2019. Relative geometry-aware siamese neural network for 6dof camera relocalization, arXiv: 1901. 01049. Li, B., Shen, C., Dai, Y., Van Den Hengel, A., He, M., 2015. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFS. In: Proceedings of the IEEE conference on computer vision and pattern recognitionpp. 1119–1127. Liao, Y., Huang, L., Wang, Y., Kodagoda, S., Yu, Y., Liu, Y., 2017. Parse geometry from a line: Monocular depth estimation with partial laser observation. In: 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp. 5059–5066. Liu, B., Gould, S., Koller, D., 2010. Single image depth estimation from predicted
6. Concluding remarks Single image indoor localization in 3D map is very important for many applications. This paper presents a new framework to localize single images in 3D maps through RGB images depth inference and matching them based on their geometry similarity in 3D space. Moreover, we propose a new depth prediction method by warping the 3D maps information into initial depth for depth prediction. The depth prediction results outperform the state-of-the-art. We also evaluate our localization approach on the 7Scene dataset, the experimental results demonstrate the effectiveness of our method in enhancing the localization accuracy. In principle, our method can be equally applied to single outdoor image localization. In fact, we have tested the algorithm on the outdoor dataset. However, due to the difficulty in obtaining an accurate 3D map, the performance is not as good as that of the indoor images. Our future work will focus on applying the method to the outdoor scenario. The large memory consumption is a problem of the proposed method as it depends on the scenes. ICP takes the major time for the localization problems. In our experiment, the initial pose barely takes time and it takes about 0.01 s to estimate the pose. Initial depth map generation from local map costs about 0.5 in average. The depth map refinement takes 0.02 s. The localizations costs the main time and it depends on the number of the iteration of the ICP. It ranges from 0.1 s to 0.7 s. Acknowledgments This work was supported in part by the National Natural Science Foundation of China under Grant 41871329, in part by the Science and Technology Planning Project of Guangdong Province under Grant 2018B020207005, and in part by the Shenzhen Scientific Research and Development Funding Program under Grant JCYJ20170818092931604. Horizon Centre for Doctoral Training at the University of Nottingham (RCUK Grant No. EP/L015463/1) References Bao, J., Gu, Y., Hsu, L.-T., Kamijo, S., 2016. Vehicle self-localization using 3d building map and stereo camera. In: 2016 IEEE Intelligent Vehicles Symposium (IV). IEEE, pp. 927–932. Bay, H., Tuytelaars, T., Van Gool, L., 2006. Surf: Speeded up robust features. In: European conference on computer vision. Springer, pp. 404–417. Brachmann, E., Krull, A., Nowozin, S., Shotton, J., Michel, F., Gumhold, S., Rother, C., 2017. Dsac-differentiable ransac for camera localization. IEEE Conference on
25
ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020) 13–26
Q. Li, et al. semantic labels. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, pp. 1253–1260. Liu, M., Salzmann, M., He, X., 2014. Discrete-continuous depth estimation from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognitionpp. 716–723. Liu, F., Shen, C., Lin, G., Reid, I., 2015. Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intell. 38 (10), 2024–2039. Lowe, D.G., 1999. Object recognition from local scale-invariant features. In: Computer vision, 1999. The proceedings of the seventh IEEE international conference on. Ieee, pp. 1150–1157. Ma, F., Karaman, S., 2018. Sparse-to-dense: Depth prediction from sparse depth samples and a single image. In: 2018 IEEE Int. Conf. Robot. Autom., IEEEpp. 1–8. Menegatti, E., Zoccarato, M., Pagello, E., Ishiguro, H., 2004. Image-based monte carlo localisation with omnidirectional images. Rob. Auton. Syst. 48 (1), 17–30. Murillo, A.C., Kosecka, J., 2009. Experiments in place recognition using gist panoramas. In: 2009 IEEE 12th International Conference on, IEEEpp. 2196–2203. Neubert, P., Schubert, S., Protzel, P., 2017. Sampling-based methods for visual navigation in 3d maps by synthesizing depth images. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEEpp. 2492–2498. Newcombe, R.A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A.J., Kohi, P., Shotton, J., Hodges, S., Fitzgibbon, A., Kinectfusion, 2011. Real-time dense surface mapping and tracking. In: Mixed and augmented reality (ISMAR), 2011 10th IEEE international symposium on, IEEEpp. 127–136. Roy, A., Todorovic, S., 2016. Monocular depth estimation using neural regression forest. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognitionpp. 5506–5514. Rublee, E., Rabaud, V., Konolige, K., Bradski, G., 2011. Orb: An efficient alternative to sift or surf, in: Computer Vision (ICCV). In: 2011 IEEE international conference on, IEEEpp. 2564–2571. Sattler, T., Leibe, B., Kobbelt, L., 2012. Improving image-based localization by active correspondence search. Springer, pp. 752–765. Sattler, T., Leibe, B., Kobbelt, L., 2017. Efficient & effective prioritized matching for largescale image-based localization. IEEE Trans. Pattern Anal. Mach. Intell. 9, 1744–1756. Sattler, T., Torii, A., Sivic, J., Pollefeys, M., Taira, H., Okutomi, M., Pajdla, T., 2017. Are large-scale 3D models really necessary for accurate visual localization? Comput. Vis. Pattern Recognit. 1637–1646. https://doi.org/10.1109/CVPR.2017.654. Saxena, A., Chung, S.H., Ng, A.Y., 2006. Learning depth from single monocular images. In: Advances in neural information processing systemspp. 1161–1168. Schönberger, J.L., Frahm, J.-M., 2016. Structure-from-motion revisited. Conference on Computer Vision and Pattern Recognition (CVPR). Schönberger, J.L., Zheng, E., Pollefeys, M., Frahm, J.-M., 2016. Pixelwise view selection for unstructured multi-view stereo. European Conference on Computer Vision (ECCV).
Segal, A., Haehnel, D., Thrun, S., 2009. Generalized-icp. In: Robotics: science and systems 2. pp. 435. Shotton, J., Glocker, B., Zach, C., Izadi, S., Criminisi, A., Fitzgibbon, A., 2013. Scene coordinate regression forests for camera relocalization in rgb-d images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognitionpp. 2930–2937. Silberman, N., Hoiem, D., Kohli, P., Fergus, R., 2012. Indoor segmentation and support inference from rgbd images. ECCV. Stewart, A.D., Newman, P., 2012. Laps-localisation using appearance of prior structure: 6dof monocular camera localisation using prior pointclouds. In: 2012 IEEE International Conference on Robotics and Automation, IEEEpp. 2625–2632. Sun, X., Xie, Y., Luo, P., Wang, L., 2017. A dataset for benchmarking image-based localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognitionpp. 7436–7444. Taira, H., Okutomi, M., Sattler, T., Cimpoi, M., Pollefeys, M., Sivic, J., Pajdla, T., Torii, A., 2018. Inloc: Indoor visual localization with dense matching and view synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognitionpp. 7199–7209. Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T., Geiger, A., 2017. Sparsity invariant cnns. International Conference on 3D Vision (3DV). Uyttendaele, M., Cohen, M., Sinha, S., Lim, H., 2012. Real-time image-based 6-dof localization in large-scale environments. 2012 IEEE Conf. Comput. Vision Pattern Recog. IEEE 1043–1050. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P., et al., 2004. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13 (4), 600–612. Wang, J., Zha, H., Cipolla, R., 2006. Coarse-to-fine vision-based localization by indexing scale-invariant features. IEEE Trans. Syst. Man Cybern. B Cybern. 36 (2), 413–422. https://doi.org/10.1109/TSMCB.2005.859085. Wolcott, R.W., Eustice, R.M., 2014. Visual localization within lidar maps for automated urban driving. In: 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEEpp. 176–183. Xu, Y., John, V., Mita, S., Tehrani, H., Ishimaru, K., Nishino, S., 2017. 3d point cloud map based vehicle localization using stereo camera. In: 2017 IEEE Intelligent Vehicles Symposium (IV). IEEE, pp. 487–492. Xu, D., Ricci, E., Ouyang, W., Wang, X., Sebe, N., 2017. Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognitionpp. 5354–5362. Zhang, Y., Funkhouser, T., 2018. Deep depth completion of a single rgb-d image. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit.pp. 175–185. Zhou, T., Brown, M., Snavely, N., Lowe, D.G., 2017. Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognitionpp. 1851–1858.
26