DRM-SLAM: Towards dense reconstruction of monocular SLAM with scene depth fusion

DRM-SLAM: Towards dense reconstruction of monocular SLAM with scene depth fusion

DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion Communicated by Bin Fan Journal Pre-proof DRM-SLAM: Towards Dense ...

5MB Sizes 0 Downloads 64 Views

DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion

Communicated by Bin Fan

Journal Pre-proof

DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion Xinchen Ye, Xiang Ji, Baoli Sun, Shenglun Chen, Zhihui Wang, Haojie Li PII: DOI: Reference:

S0925-2312(20)30234-4 https://doi.org/10.1016/j.neucom.2020.02.044 NEUCOM 21912

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

25 April 2019 16 September 2019 8 February 2020

Please cite this article as: Xinchen Ye, Xiang Ji, Baoli Sun, Shenglun Chen, Zhihui Wang, Haojie Li, DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion, Neurocomputing (2020), doi: https://doi.org/10.1016/j.neucom.2020.02.044

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. Β© 2020 Published by Elsevier B.V.

DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion Xinchen Yea,b,βˆ— , Xiang Jia , Baoli Suna , Shenglun Chena , Zhihui Wanga,b and Haojie Lia,b a DUT-RU b Key

International School of Information Science & Engineering, Dalian University of Technology, Liaoning,China Laboratory for Ubiquitous Network and Service Software of Liaoning Province, China

ARTICLE INFO

ABSTRACT

Keywords: Depth fusion SLAM Monocular Depth estimation

Monocular visual SLAM methods can accurately track the camera pose and infer the scene structure by building sparse correspondence between two/multiple views of the scene. However, the reconstructed 3D maps of these methods are extremely sparse. On the other hand, deep learning is widely used to predict dense depth maps from single-view color images, but the results are subject to blurry depth boundaries, which severely deform the structure of 3D scene. Therefore, this paper proposes a dense reconstruction method under the monocular SLAM framework (DRM-SLAM), in which a novel scene depth fusion scheme is designed to fully utilize both the sparse depth samples from monocular SLAM and predicted dense depth maps from convolutional neural network (CNN). In the scheme, a CNN architecture is carefully designed for robust depth estimation. Besides, our approach also accounts for the problem of scale ambiguity existing in the monocular SLAM. Extensive experiments on benchmark datasets and our captured dataset demonstrate the accuracy and robustness of the proposed DRM-SLAM. The evaluation of runtime and adaptability under challenging environments also verify the practicability of our method.

1. Introduction

The rapid development of visual simultaneous localization and mapping (vSLAM) [54, 55, 24] has created a new visualization and sensing wave for computer vision community. The footprints of robust tracking and 3D mapping have influenced a broad spectrum of the technological frontiers, e.g., autonomous navigation, mobile robots and augmented reality [42, 17]. Existing SLAM techniques can achieve accurate scene reconstruction relying on the equipments of RGBD or stereo cameras [23, 48, 49, 37] However, the overdependence on these capturing equipments with special imaging mechanism may bring limitations in real conditions. Therefore, SLAM based on monocular camera is widely used and free from constraint by the capturing equipments, which is basically classified into feature-based methods and direct methods. Usually, the feature-based monocular techniques mainly extract sparse interest points [32, 1, 41] in every video frame, constructing correspondences between these features across different frames, and then using them to estimate camera pose and reconstruct 3D scene [6, 14, 7, 25, 36]. Although the tracking performance of such algorithm is impressive, the generated 3D map is extremely sparse and cannot be used in practical application. On the other hand, direct methods have been proposed to construct semi-dense and high-quality 3D scene models in real time. Rather than optimizing on robust feature points, these approaches directly use raw pixel intensities on the basis of stereo matching [38, 40, 39, 16, 10, 11]. However, the methThis work was supported by National Natural Science Foundation of China (NSFC) under Grant 61702078, 61772108, and by the Fundamental Research Funds for the Central Universities. βˆ— Corresponding author [email protected] (X. Ye) ORCID (s): 0000-0001-5328-3911 (X. Ye)

YE et al.: Preprint submitted to Elsevier

(a)

(b)

(c)

(d)

Figure 1: A dense reconstruction example of the proposed DRM-SLAM. (a) A keyframe chosen from input video sequence; (b) Reconstructed sparse point cloud by feature-based ORB-SLAM [36]; (c) Dense depth map estimated by the proposed depth fusion scheme; (d) Reconstructed dense 3D scene by our DRM-SLAM approach.

ods of this category also struggle in reconstructing dense 3D map. The missing points in the generated 3D map mainly correspond to the textureless areas in the raw image, where the accuracy of stereo matching can be very low. Dense solutions of monocular depth reconstruction can be classified into two categories, i.e., hand-crafted graphical models and CNN-based methods. The first category is to construct graphical models and use hand-crafted priors [5, 13, 4, 38, 22, 39] to regularize the depth reconstruction process. However, these hand-crafted priors are not optimal, and high-level scene context needs to be extracted for better understanding the scene geometry from the single-view color image. Recently, CNN has demonstrated its powerful representation ability to extract high-level features, and is Page 1 of 20

DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion Table 1 Properties of different depth estimation methods using heterogeneous depth data sources. Method

3D map

Scale ambiguity

Accuracy

Time complexity

Sparse (only key-points)

Yes

High

real time

CNN-based depth estimation

Dense (estimation of each pixel)

No

Medium (scenedependent)

real-time on keyframe

Our DRM-SLAM

Dense (estimation of each pixel)

No

High

nearly real time

Feature-based monocular SLAM

used to estimate scene depth or disparity [29, 9] with higher accuracy. Although depth map predicted by CNN is globally accurate, details such as depth boundaries and tiny objects can not be well preserved, which would have a big impact on other tasks that relying on the recovered depth map, such as object detection or image rendering for 3DTV. Therefore, Tateno et al. [46] proposed a dense depth reconstruction method, in which CNN-inferred depth map and depth measurements from direct monocular SLAM [11] are fused together to achieve accurate reconstruction. Weerasekera et al. [47] formulated the depth estimation problem as an energy minimization process, in which the data term is constructed based on the photometric difference while the smooth term is regularized by CNN-inferred surface normals. Luo et al. [33] proposed a dense monocular SLAM system, which fuses direct SLAM with an online-adapted depth prediction network to achieve accurate depth prediction. Note that, all the aforementioned dense solutions of monocular SLAM are based on direct SLAM [11]1 , since direct methods can optimize on raw pixels with sufficient image gradients to compute semi-dense scene depth, which makes the dense reconstruction much easier. Compared to direct SLAM, map-points computed from feature-based SLAM are relatively irregular and extremely sparse. Dense reconstruction based on feature-based SLAM is more challenging2 . So, in this paper, we stand on ORB-SLAM [36], a state-of-theart method from the tank of feature-based SLAMs, to build our dense scene reconstruction. To our best knowledge, ORB-SLAM generates extremely sparse map-points, and most of them correspond to features extracted from high-texture image regions. These features are important cues to capture the geometry structure of 3D scene. In contrast, the depth map inferred from CNN is dense and globally accurate in the smooth image regions, but can not preserve details and fine structures well due to the feature aggregation by performing repeated downsampling operations in CNN layers. Considering the above problems, we propose a dense reconstruction approach under the monocular SLAM framework (DRM-SLAM), in which a novel scene depth fusion scheme is designed to fully exploit both 1 Note that, [47] employs similar feature extraction and depth estimation framework based on graphic models with other direct methods. 2 The difference between direct SLAM and feature based SLAM is given in Sec. 2 in detail.

YE et al.: Preprint submitted to Elsevier

the sparse depth samples from ORB-SLAM and the CNNinferred depth. Table 1 shows the two kinds of depth estimation methods that can be complementary to each other, i.e. CNN-inferred depth is dense but has lower accuracy while depth from feature-based SLAM is more accurate but too sparse. Our DRM-SLAM can achieve both dense and high accuracy depth estimation and scene reconstruction with nearly real time. Fig. 1 shows the original reconstruction of ORBSLAM [36] and the dense reconstruction of the proposed DRM-SLAM. Note that ORB-SLAM recovers the exact geometry structure of the 3D scene but fails to generate dense 3D reconstruction, while DRM-SLAM obtains the complete scene depth map and achieves accurate dense 3D reconstruction. We demonstrate on three benchmark datasets and our captured dataset that our approach outperforms other CNNbased methods in the aspect of scene depth estimation, and generates comparable dense 3D reconstruction results to other dense solutions. The main contributions are summarized in the following: 1) A depth fusion scheme based on a depth reconstruction model is proposed to fully exploit sparse depth samples generated from ORB-SLAM and depth map inferred from CNN to achieve dense and accurate reconstruction. Before fusing the heterogeneous depth sources, the problem of scale ambiguity and the uncertainty of CNN-inferred depth around the object boundaries are taken into consideration. 2) A deep CNN network is designed based on ResNet architecture [21] to learn depth maps from monocular color images. Dilated convolutions [53] are used to maintain a relative high output resolution, and a multi-scale scheme [2] is employed to distinguish scene objects in different scales. 3) An effective parameter adaptation scheme is proposed to achieve stable and accurate dense reconstruction. The evaluation of runtime and adaptability under challenging environments also verify the practicability of the proposed DRMSLAM.

2. Related Work

In this section, we give a overview of related work in three aspects, i.e. the classification of monocular SLAM, dense solutions of monocular SLAM, and depth estimation methods.

Page 2 of 20

DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion

2.1. Monocular Visual SLAM

The first monocular SLAM system called MonoSLAM was proposed by Davison et al. [8] in 2007, which uses an extended Kalman filter to recursively estimate the camera pose and sparse positions of features on each frame. After that, key-frame based approaches [25, 26, 36, 4] were proposed to estimate the map and pose using only key-frames allowing to perform less costly but more accurate than filteringbased methods [44]. As for feature based monocular SLAM, a category of the key-frame based approaches, ORB-SLAM [36] achieves state-of-the-art performance on pose estimation and tracking. The sparse features in high-texture regions are first extracted by ORB descriptors. Then, local bundle adjustment and pose graph optimization are used to reconstruct the 3D scene by successively tracking the sparse ORB features. Real time camera relocalization with invariance to viewpoint and illumination allows the successful recovery against tracking failures and also enhances the probability of map reuse. Direct methods like the large-scale direct SLAM (LSDSLAM) [11] proposed to keep track of depth values only on raw pixels with sufficient image gradients instead of sparse features. Their results are very impressive as the system is able to build a semi-dense maps in real time without GPU acceleration. Nevertheless they still need features for loop detection and their camera localization accuracy is significantly lower than ORB-SLAM. Moreover, because of the dependence on the assumption of intensity consistency, the accuracy of LSD-SLAM will decrease when the scene has huge light variation.

2.2. Dense Solution of Monocular SLAM

Solutions achieving dense depth reconstruction based on monocular camera are mainly classified into two categories, i.e., pure geometry-based approaches, and the recent deep learning based approaches. Combining the parallel processing techniques with shortbaseline multi-view stereo matching under a regularization scheme, dense tracking and mapping (DTAM) [38] followed the structure of PTAM [4] to achieve dense tracking and mapping in real time on GPU. The reconstruction framework uses enhanced photometric data term by hundreds of narrow-baseline images and global spatially regularized energy function to improve the quality of optimization. However, since DTAM computes depth of every single pixel and uses global optimization, it causes a low efficiency even with GPU acceleration. Moreover, Engel et al. [10] also demonstrated that, in DTAM, small image intensity errors have a large effect on the accuracy of estimated disparity for those regions with small gradient. That is why LSD-SLAM only computes depth on pixels with sufficient gradients to improve the efficiency and accuracy. Another work of dense reconstruction is probabilistic, monocular dense reconstruction (Remode) [39] proposed by Pizzoli et al.. It solves the problem of estimating dense and accurate depth maps from a single camera. A probabilistic depth measurement is carried out in real time on a per-pixel YE et al.: Preprint submitted to Elsevier

basis and the computed uncertainty is used to reject erroneous estimations and provide live feedback on the reconstruction process. Each depth point is described and updated by a parametric model under Bayesian estimation frame. Finally, smoothness on the depth map is applied to obtain better results. Since the framework requires the camera pose at each frame and the depth range of the scene as inputs, the application of Remode is therefore limited in different scenes. All the mentioned two approaches above are pure-geometry based solutions, usually ignoring high-level scene context and suffering from low-texture regions. Recently, Weerasekera et al. [47] used surface normal map predicted from a learned CNN [9] as a strong prior and aims to construct a graphics model to estimate a dense depth map constraint by the photometric cost and a surface normal consistency. Following the model from DTAM, their proposed surface normal prior replaces the inverse-depth smoothness prior used in DTAM. CNN-SLAM proposed by Tateno et al. [46] presented a method where CNN-predicted dense depth maps are naturally fused together with depth measurements obtained from direct monocular SLAM. A particularly important stage of the framework is the scheme employed to refine the CNN-predicted depth map associated to each key-frame via small-baseline stereo matching, by enforcing color consistency minimization between a key-frame and associated input frames. Yang et al. [51] proposed a GAN based method for real-time dense mapping based on a monocular camera. It takes a semidense map obtained from motion stereo matching as a guidance to supervise dense depth prediction of a single RGB image. An adversarial loss and a pixel-wise mean square error loss are used to train the generator. Based on CNN-SLAM, Luo et al. [33] proposed a novel dense monocular SLAM system, which fuses direct SLAM with an online-adapted depth prediction network (OADPN for short) for achieving accurate depth prediction of scenes of different types. The depth prediction network is tuned on-the-fly toward better generalization ability for different scenes types, and a stage-wise Stochastic Gradient Descent algorithm is used for efficient convergence of the tuning process. Meanwhile, the dense map produced by CNN is used to deal with the scale ambiguity problem which in turn improves the accuracy of both tracking and overall reconstruction. Note that, the above methods are all based on direct SLAM. Unlike feature-based SLAM that computes sparse feature points to perform the following matching, direct SLAM can directly optimize on raw pixels with sufficient image gradients to reconstruct semi-dense map, which is easier to achieve dense reconstruction than feature-based SLAM. Yet, the direct techniques have lower accuracy and robustness compared with feature-based methods because of their dependence on assumption of intensity consistency. Motivated by this, we build our dense scene reconstruction framework based on more challenging feature-based SLAM, i.e., ORBSLAM [36], in which a novel scene depth fusion scheme is designed to fully utilize both the sparse depth samples from monocular SLAM and predicted dense depth maps from CNN. Page 3 of 20

DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion

Figure 2: System Overview. Our DRM-SLAM incrementally generates a fully dense reconstruction of a scene given sparse depth map from monocular SLAM and CNN-inferred depth of key-frames.

2.3. CNN-based Depth Estimation

Traditional depth estimation methods mainly focused on using graphical models to recast the task as an energy optimization problem [5, 13, 4, 38, 22, 39]. Due to the powerful representation ability of CNN, Eigen et al. [9] used a single multi-scale CNN to accomplish depth prediction in which they progressively refines the predicted depth by using a sequence of scale sub-networks. Liu et al. [31] presented a deep structured learning scheme for estimating depth from single monocular images. They learn the unary and pairwise potentials of continuous conditional random field (CRF) in a unified deep CNN framework to jointly explore the capacity of CNN and CRF. Laina et al. [29] resorted to a deeper fully convolutional residual network to predict the depth map. They also designed a up-projection module to address the downsampling problem, i.e., the decrease of the feature map resolution. Reverse Huber loss is used to train the whole network. However, the resolution of the prediction results is still inferior to the input color image, and is subject to severe blurring artifacts around depth boundaries. Recently, there appears many dense depth reconstruction methods based on the input of a sparse set of depth measurements and a single RGB image, which can be categorized into a depth super-resolution or filling problem with sparser depth measurements. The Sparse-to-Dense (StD) prediction method proposed by [34] presented a deep regression network to learn directly from the RGB-D raw data, and explore the impact of number of depth samples on prediction accuracy. Chen et al. [3] also designed an end-to-end CNN to estimate depth map from RGB and sparse sensing (RSS for short), which works simultaneously for indoors and outdoors scenes. A parameterized method for sparse depth inputs is proposed to accommodate the sparse depth inputs. However, the methods of this category utilize a specific sampling mode, e.g., regular grid or Bernoulli sampling, which is not suitable for real conditions, e.g., sampling by orb features. They paid more attention to verify the effectiveness of their methods under ideal conditions, but apply only a few simYE et al.: Preprint submitted to Elsevier

ple experiments under SLAM environments, which are not enough to demonstrate their practicability.

3. The Proposed Method

Fig. 2 illustrates the pipeline of our proposed framework. During the camera tracking process, color key-frames are chosen and the sparse depth map for each frame is generated. For each color key-frame, we first estimate a depth map via CNN, and apply scale consensus on the CNN-inferred depth map and the sparse depth map. Then, we formulate a depth reconstruction model to obtain a high quality dense depth map by fully exploiting both the sparse depth samples from monocular SLAM and predicted dense depth map from CNN. Finally, the generated dense depth maps on keyframes are transformed into the representation of point cloud, and are assembled based on a global consistent model to achieve the final dense reconstruction of the given 3D scene. For the whole processing procedures, camera tracking and depth fusion and reconstruction procedures run on the main CPU thread. Our implementation of depth map training and prediction runs on GPU, and the depth prediction process on key-frames is managed in parallel with the main thread. Each stage of the framework is presented in detail in the following subsections.

3.1. Camera Tracking and Key-frame Selection

We employ the feature-based tracking framework in ORBSLAM [36] to estimate the camera poses and select keyframes from the input video. For each key-frame π‘˜π‘– , we can obtain the corresponding sparse depth map Dβ€²π‘˜ and camera 𝑖 pose Pπ‘˜π‘– = [Rπ‘˜π‘– , tπ‘˜π‘– ], where R and t are 3Γ—3 rotation matrix and 3-dimension translation vector, respectively. To optimize the camera pose P during tracking, bundle adjustment is used to optimize the reprojection error between paired 3D points X𝑗 ∈ ℝ3 in world coordinate and 2D feature points x𝑗 ∈ ℝ2 in image plane, where 𝑗 is the pixel Page 4 of 20

DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion

Residual Blocks

2-Dconv,512

3x3 conv,256

Residual Blocks

2-Dconv,256

3x3 conv,128

Residual Blocks

3x3 conv,128,/2

Pool,/2

7x7conv,64, /2

INPUT

3x3 conv,512

1/8

1/8 1/4

Multi-scale Block

1/8 OUTPUT

6-Dconv, 1 Residual Blocks

12-Dconv,1 18-Dconv,1

Fusion

24-Dconv,1

Figure 3: The proposed deep architecture to estimate depth from monocular RGB image. s-DConv denotes dilated convolution with dilation factor s; All the successive skip structures in ResNet-101 model are uniformly marked as a residual block.

index from the matching set : ( ) βˆ‘ ( )β€–2 β€– 𝜌 β€–x𝑗 βˆ’ 𝜏 RX𝑗 + t β€– , (1) {R, t} = arg min β€– β€–Ξ£ R,t π‘—βˆˆξ‰„

where 𝜌 is the robust Huber cost function, and Ξ£ is the information matrix associated to the scale of the keypoint. The perspective projection function 𝜏 is defined as follow: [ ] 𝑋 βŽ›βŽ‘π‘‹ ⎀⎞ 𝑓 + 𝑐 π‘₯ π‘₯ 𝑍 , (2) 𝜏 ⎜⎒ π‘Œ βŽ₯⎟ = π‘Œ ⎜⎒ βŽ₯⎟ 𝑓 + 𝑐𝑦 𝑦 𝑍 βŽβŽ£π‘ ⎦⎠ ) ) ( ( where 𝑓π‘₯ , 𝑓𝑦 and 𝑐π‘₯ , 𝑐𝑦 are the focal length and principal point along the x and y-axes , all obtained from camera calibration. The computed camera pose Pπ‘˜π‘– and sparse depth map Dβ€²π‘˜ of key-frame π‘˜π‘– are then used in the following depth 𝑖 fusion and reconstruction.

3.2. Depth Estimation based on Deep CNN

Fig. 3 shows our proposed CNN architecture for depth estimation. ResNet-101 model [21] is selected as our backbone to infer depth map from single-view color image, but with several modifications (marked as yellow) shown as follows. 1) All the fully connected layers designed for the classification problem are removed from our regression network. For our loss function, L2 norm is used to penalize the regression error between the pair of groundtruth and its corresponding predicted version. 2) Different form [29] that designs complex up-projection module to prevent the feature map resolution to become further smaller, we resort to use dilated convolutions (Dconv) [53] to replace the pooling and stride operators (/2) at the last two downsampling layers. The dilation rate is set to 2 for both Dconv layers (2-Dconv). Note that dilated convolution is a way of increasing receptive view of the network exponentially but keeps the the number of parameters and feature map resolution unchanged. With this purpose, it is suitable for our depth estimation task under the SLAM framework with real-time requirement, which cares more about integrating knowledge of the wider context with less cost. 3) The multi-scale scheme [2] is introduced by concatenating it to the last residual block. Four dilated convolutions YE et al.: Preprint submitted to Elsevier

with dilation rates set to 6, 12, 18, and 24 (marked in the red dashed box), are employed to form a spatial pyramid pooling structure to extract four feature maps with different receptive fields. It ensures that scene objects of arbitrary scales can be identified and its corresponding depth can be accurately inferred by pooling convolutional features at different scales. Then, the four feature maps are fused together to generate the final depth map (1/8 resolution of the input color image). Note that, we don’t use complex strategies like deconvolution or up projection [29] to upsample the inferred depth map, since our CNN architecture aims at inferring the depth values from a global perspective, but not removing the blurring at local structures which can be solved by the following depth fusion scheme. Therefore, we upsample the inferred depth map by simply inserting zeros at missing locations to Μƒπ‘˜ . get a HR sparse depth map D 𝑖

3.3. Scale Consensus

There always exists the problem of scale ambiguity in monocular ORB-SLAM, since the algorithm can not obtain the absolute scale from the monocular video captured by single camera. In contrast, The CNN is to learn the correlations between visual cues and absolute depths from a large amount of training image pairs, and thus can provide absolute scale information as reference for tracking and mapping. Therefore, the depth map of sparse samples Dβ€²π‘˜ obtained from 𝑖 Μƒ π‘˜ inferred from CNN for the SLAM and dense depth map D 𝑖 key-frame π‘˜π‘– , are not on the same scale. Since the depth Μƒ π‘˜ is learned from CNN based on absolute depth valmap D 𝑖 ues, and thus we take it as the reference and transform Dβ€²π‘˜ 𝑖 Μƒ π‘˜ by multiplying a scale factor. As a preto the scale of D 𝑖 Μƒ π‘˜ by bicubic interpolation, and processing, we upsample D 𝑖 Μƒπ‘˜ select the valid pixels with the same positions from both D

and Dβ€²π‘˜ as two candidate pixel sets 𝐷̃ and 𝐷′ , respectively. 𝑖 The most direct way is to compute the mean ratio on the intensity level between two pixel sets, and set it as the scale factor. Advanced methods resort to fit these two pixel sets by least square to find to global solution. However, these two methods are inaccurate, because they ignore the existence of outliers. Similar to Luo et al. [33] that also uses RANSAC algorithm to regress a correct scale with the pres𝑖

Page 5 of 20

DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion

ence of a nontrivial amount of outliers, we also employ a robust RANSAC-based least square strategy to find the optimal scale factor. Randomly given a number of pairs of depth samples 𝑑 β€² and 𝑑̃ in the pixel sets 𝐷′ and 𝐷̃ respectively, we solve the following optimization function to obtain a scale factor by least square fitting: 𝑛 βˆ‘ β€– β€–2 arg min ‖𝑠 β‹… 𝑑𝑗′ βˆ’ 𝑑̃𝑗 β€– , 𝑑 β€² ∈ 𝐷′ , 𝑑̃ ∈ 𝐷̃ β€– β€–

(3)

𝑗=1

Here, 𝑠 is the scalar scale to be estimated. 𝑛 is the number of pairs of depth samples used in each iteration. Then, we repeatedly solve the above function to obtain multiple scale factors based on different subsets of depth samples chosen from 𝐷̃ and 𝐷′ . Finally, we choose the scale value π‘ βˆ— corresponding to the one containing the most inliers among multiple iterations as our optimal solution. Detailed process of scale computation is presented in Algorithm 1. 𝑠𝑖 represents the output scale value of the π‘–π‘‘β„Ž least square fitting. The operation of πΌπ‘›π‘™π‘–π‘’π‘Ÿ_π‘π‘œπ‘’π‘›π‘‘(β‹…) is the procedure to count the number of inliers with the threshold 𝛿 and πΌπ‘›π‘™π‘–π‘’π‘Ÿπ‘ (𝑠𝑖 ) denotes the number of inliers under the current scale 𝑠𝑖 . Algorithm 1 RANSAC-based Scale Computation Require: 𝐷̃ , 𝐷′ , inlier threshold 𝛿 1: while 𝑖 < max_iteration do 2: Randomly selecting 𝑛 pairs of depth samples 3: Compute 𝑠𝑖 using (3) 4: πΌπ‘›π‘™π‘–π‘’π‘Ÿπ‘ (𝑠𝑖 ) = πΌπ‘›π‘™π‘–π‘’π‘Ÿ_π‘π‘œπ‘’π‘›π‘‘(𝐷̃ , 𝐷′ , 𝛿, 𝑠𝑖 ) 5: if πΌπ‘›π‘™π‘–π‘’π‘Ÿπ‘ (𝑠𝑖 ) > πΌπ‘›π‘™π‘–π‘’π‘Ÿπ‘ (π‘ βˆ— ) then 6: π‘ βˆ— ← 𝑠𝑖 7: end if 8: end while 9: return π‘ βˆ— , πΌπ‘›π‘™π‘–π‘’π‘Ÿπ‘ (π‘ βˆ— ) Note that the scale factor between CNN-inferred depth and monocular SLAM is always changing along the tracking process. Therefore, the scale is re-computed and updated when each new color key-frame is inserted, and thus the scale adjustment is done along the whole SLAM system to deal with the problem of scale variation.

3.4. Depth Fusion and Reconstruction

In this stage, we first fuse the sparse depth map Dβ€² and Μƒ 3 , meanwhile, compute a conCNN-inferred depth map D fidence map H that indicating the accuracy for each pixel Μ„ Then, we formulate a graphical in the fused depth map D. model to reconstruct the final dense depth map D from the Μ„ fused initial depth observation D. Μ„ 𝑝 is deAt a given pixel index 𝑝, the fused depth value D fined as: ⎧Dβ€² 𝑝 ̄𝑝 = βŽͺ ̃𝑝 D ⎨D βŽͺ0 ⎩

D′𝑝 β‰  0 ̃𝑝 β‰  0 D′𝑝 = 0, D β€² ̃𝑝 = 0 D𝑝 = 0, D

(4)

3 To avoid symbols confusion, we remove the representation of subΜƒ for simplification. script π‘˜ in Dβ€² and D

YE et al.: Preprint submitted to Elsevier

Figure 4: Statistical analysis of the error distribution by presenting the correlation between the distance of a point to the depth boundary and the estimation error.

Μƒ 𝑝 appear at the location of pixel 𝑝, if both the values D′𝑝 and D β€² D𝑝 is retained because of its high accuracy computed from pure-geometry ORB-SLAM. To construct the confidence map H, we set H𝑝 = 0 for the { } Μ„ 𝑝 = 0 . As pixel 𝑝 with no valid depth value, i.e., 𝑝 ∈ 𝑝 ∣ D we observe, the pure-geometry based methods usually suffer from low parallax. There would be a few extremely large outliers in these depth samples produced by ORB-SLAM when translation is not sufficiently large during camera motion. Therefore, for the } pixel 𝑝 coming from ORB-SLAM, { β€² i.e., 𝑝 ∈ 𝑝 ∣ D𝑝 β‰  0 , we compute the confidence as following: ( )2 βŽ› 𝐷 ⎞ { } π‘šπ‘Žπ‘₯ H𝑝 = min ⎜ (5) , 1⎟ , 𝑝 ∈ D′𝑝 β‰  0 , ̄𝑝 ⎜ D ⎟ ⎝ ⎠ where π·π‘šπ‘Žπ‘₯ represents the maximum depth value learned by CNN. Those depth samples obtained from ORB-SLAM with depth values larger than π·π‘šπ‘Žπ‘₯ are very likely to be outliers, their corresponding confidences should be assigned to lower values. The remaining problem is to determine the confidence values of CNN-inferred depth samples. As we observe, pixels close to depth boundaries are less reliably predicted by CNN than those far away, and should be assigned to a lower reliability. We perform a statistical analysis of the error distribution, i.e. the correlation between the distance of a point to the boundary and the error, to validate our observations. The region of depth boundaries is computed by imposing edge detector on the depth maps recovered by our proposed CNN. Then we count the mean difference error on the pixels with the same distance to depth boundaries for all the test dataset. The statistical result is shown in Fig. 4, which verifies our observation. Therefore, the confidence values are simply computed as below: (( ) ) { } 𝑑𝑝 2 Μƒ 𝑝 β‰  0 , (6) , 1 , 𝑝 ∈ D′𝑝 = 0, D H𝑝 = min 𝑑𝑇 β„Ž Page 6 of 20

DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion

where 𝑑𝑝 is the distance between current pixel 𝑝 and its nearest pixel in the depth boundarie set . 𝑑𝑇 β„Ž is a maximum threshold. Finally, our confidence map H becomes ⎧ βŽͺ0 ( )2 ) ( βŽͺ 𝐷 π‘šπ‘Žπ‘₯ βŽͺmin ,1 ̄𝑝 D H𝑝 = ⎨ ( βŽͺ ( 𝑑 )2 ) 𝑝 βŽͺmin ,1 𝑑𝑇 β„Ž βŽͺ ⎩

̄𝑝 = 0 D D′𝑝 β‰  0

D′𝑝

̃𝑝 β‰  0 = 0, D

(7)

Μ„ and confiThen, according to the fused depth map D dence map H, we formulate our dense depth reconstruction as the following model: min D

βˆ‘βŽ› ( βˆ‘ ) ( )2 ⎞ ⎜H𝑝 D𝑝 βˆ’ D ̄𝑝 2 +πœ† 𝑀𝑝,π‘ž D𝑝 βˆ’ Dπ‘ž ⎟ , (8) ⎜ ⎟ 𝑝 ⎝ π‘žβˆˆξˆΊ4 (𝑝) ⎠

where 4 (𝑝) is 4-connected neighborhood of pixel 𝑝, πœ† is a balance parameter. The observation constraint is enforced by the confidence map H, while the smoothness constraint is adaptively enforced using spatially varying weighting function 𝑀𝑝,π‘ž defined on the color key-frames. Usually, exploiting depth-color correlation is quite informative for depth reconstruction when the accompanied color images are available. So, 𝑀𝑝,π‘ž is defined according to the similarity computed based on the color image C: ) ( ( )2 (9) 𝑀𝑝,π‘ž = exp βˆ’ C𝑝 βˆ’ Cπ‘ž βˆ•πœŽ 2 , where 𝜎 is a variance parameter. By setting the gradient of the optimization function (8) to zero, the solution d is obtained by solving a linear system based on inversing a large sparse matrix: ( ) Μƒ + πœ†W βˆ’1 H Μƒ d, Μ„ d= H

(10)

Μ„ respectively. where d and dΜ„ are the vector form of D and D Μƒ is a diagonal matrix with its diagonal elements given by H h, i.e., the vector form of the confidence map H. W denotes the spatially varying Laplacian matrix defined by 𝑀𝑝,π‘ž : βŽ§βˆ‘ 𝑀 βŽͺ π‘™βˆˆξˆΊ (π‘Ÿ) π‘Ÿ,𝑙 W (π‘Ÿ, 𝑐) = βŽ¨βˆ’π‘€π‘Ÿ,𝑐 βŽͺ0 ⎩

𝑐=π‘Ÿ 𝑐 ∈ 4 (π‘Ÿ) π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’

(11)

where π‘Ÿ and 𝑐 denote one-dimensional scalar pixel indexes corresponding to a pixel p and its neighborhoods, respectively. W is a five-point sparse matrix including diagonal elements. By solving the linear system, the sparse data dΜ„ is propagated to the whole image domain guided by the confidence map H and the weighting matrix W. However, the matrix Μƒ + πœ†W is highly ill-conditioned due to the sparse input H Μƒ is severely rank-deficient), directly reversing it to data (H compute the solution is very unstable. Inspired by [30] that YE et al.: Preprint submitted to Elsevier

transforms data interpolation task into simple filtering subproblems, Eq. (10) can be decomposed into the following formula correspondingly: ) ( (I + πœ†W)βˆ’1 dΜ„ (π‘Ÿ) (12) d (π‘Ÿ) = (S𝑑̄.βˆ•Sβ„Ž )(π‘Ÿ) = ( ) , (I + πœ†W)βˆ’1 h (π‘Ÿ) where I is identity matrix. Note that, the original optimization problem is decoupled into two simple subproblems in the numerator and denominator to solve S𝑑̄ and Sβ„Ž separately. The matrix (I + πœ†W)βˆ’1 can be regarded as a filtering matrix applied on both dΜ„ and h. The final fused depth map d can be computed by element-wise division β€˜./’ between S𝑑̄ and Sβ„Ž on each pixel of the same index π‘Ÿ. Besides, solving Eq. (12) directly is also time-consuming. Several methods [27, 28] have been proposed to solve the linear system, but they are still much slower than local filtering methods [15, 19], which impede the practical use in realtime SLAM. Therefore, instead of directly performing the inversion of the sparse matrix (I + πœ†W), we resort to the fast solving strategy [30, 35] to accelerate the solving process. In essence, the algorithm is to break down the solving proΜ„ and H in our cess of directly smoothing on a 2D image (D framework) into multiple 1D smoothing processes applied on rows and the columns of the image sequentially using the proposed horizontal and vertical 1D solvers. The algorithm can achieve a comparable performance to the local filters. To briefly illustrate the 1D fast solver, we define the linear function for 1D signal along the π‘₯ (horizontal) dimension as follow: ( ) π‘₯ I + πœ†π‘₯ Wπ‘₯ Sπ‘₯𝑑̄ = dΜ„ (13) where dΜ„ is an 1D horizontal signal extracted from the row Μ„ Wπ‘₯ is a three-point Laplacian matrix constructed on of D. 2 (π‘Ÿ) neighborhood that containing two neighbors for π‘Ÿ (i.e., π‘Ÿ βˆ’ 1 and π‘Ÿ + 1). The 1D output solution Sπ‘₯𝑑̄ can be obtained by solving the linear system 4 . In fact, solving (13) becomes much easier than directly solving S𝑑̄ in Eq. (12), since the Laplacian matrix Wπ‘₯ becomes a tridiagonal matrix, whose nonzero elements exist only in the diagonal, the left and right diagonals. Such a matrix has an exact solution obtained using the Gaussian elimination algorithm, and thus the function can be solved in a recursive manner with a 𝑂(𝑁) complexity (here 𝑁 is the width of the image) 5 . To avoid appearing β€˜streaking artifact’, we perform 2D smoothing by applying sequential 1D global smoothing operations for a multiple number of iterations to propagate information across edges. The number of iterations is set at 𝑇 = 3 based on experimental performance. Note that, owing to the accelerated algorithm, our depth reconstruction framework achieves a reasonable running time without decreasing the reconstruction performance. π‘₯

4 Sπ‘₯ can be obtained with the similar function, and thus is not written β„Ž for saving place. 5 We direct readers to refer to Ref. [35] for more details about the algorithm.

Page 7 of 20

DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion

Figure 5: Average key-frame reconstruction rmse (left vertical axes in red) and accuracy with threshold 1.25 (right vertical axes in blue) with respect to the parameters:(a) πœ† and (b) 𝜎. The first row shows RGB patches and the corresponding depth ground truth of selected key-frames. Some representative recovered depth maps at the key point of rmse curve are presented for clear comparison and analysis on the parameter 𝜎.

Finally, the generated dense depth maps on key-frames are fused into a global consistent model based on point cloud representation to achieve accurate dense reconstruction of the 3D scene.

4. Evaluation

In this section, we first show the parameters analysis and adaptation scheme for depth reconstruction in Sec. 4.1. The influence of scale consensus is evaluated in Sec. 4.2. Then, we evaluate the reconstruction density and the quality of depth estimation in Sec. 4.3 and Sec. 4.4, respectively. Ablation study and the evaluation of running speed are given in Sec. 4.5 and Sec. 4.6, respectively. We also verify the effectiveness of our method under challenging situations, i.e. pure camera motion and low-texture environment in Sec. 4.7 Evaluation on our captured dataset is shown in Sec. 4.8. Three public benchmark datasets, i.e., NYU RGB-D V2 dataset (β€˜bathroom 0003 0007’, β€˜kitchen 0046 0037’, and β€˜bedroom 0037 0041’) [43], TUM RGB-D SLAM dataset (β€˜fr1 rpy’, β€˜fr2 dishes’, β€˜fr2 desk’, β€˜fr3 long office household’, β€˜fr3 nostructure texture near withloop’ and β€˜fr3 structure texture far’) [45], and ICL-NUIM dataset (β€˜lr kt0’, β€˜lr kt1’, β€˜lr kt2’, β€˜of kt0’, β€˜of kt1’ and β€˜of kt2’) [18] are used to conduct our experiments. The first two datasets are acquired from Kinect sensor while the last one is synthetic. When testing on a specific dataset, we can directly use its own given intrinsic values, e.g., focal length and principal point, to compute the camera pose during the tracking process. All experiments are implemented in Tensorflow, and run under a desktop with Intel 2.4GHz CPU, 32GB RAM and Nvidia TiYE et al.: Preprint submitted to Elsevier

tanX 12GB GPU. We use the NYU RGB-D V2 dataset as our preliminary training dataset. There are totally 1449 RGB and depth images in NYU dataset. Following the official splitting, We use 795 and 654 images pairs for training and testing separately. We augment the training data with rotation and flipping operations up to 14K images. We use the ResNet-101 network parameters pretrained on ImageNet to initialize the network, and randomly initialize other modules. We then train the model weights with a L2-norm regression loss function. We use the SGD optimizer and set momentum parameter to 0.9. The learning rate is initialized to 1e-4 for all layers and decreased by 0.9 every epoch. Then, the trained model is finetuned using either TUM or ICL-NUIM datasets when testing on any of these two datesets, correspondingly 6 . Four commonly used measurements for quantitative comparison are applied: √ βˆ‘ β€’ Root mean squared error (rmse): 𝑁1 𝑝 (𝑑𝑝𝑔𝑑 βˆ’ 𝑑𝑝 )2 β€’ Average π‘™π‘œπ‘” error(log):

√

1 𝑁

βˆ‘ ( 𝑝

( ))2 π‘™π‘œπ‘”(𝑑𝑝𝑔𝑑 ) βˆ’ π‘™π‘œπ‘” 𝑑𝑝 ,

β€’ Absolute relative error (abs.rel): 𝑁1

βˆ‘

𝑔𝑑

𝑝

|𝑑𝑝 βˆ’π‘‘π‘ | 𝑔𝑑

𝑑𝑝

,

β€’ Accuracy with threshold thr: percentage(%) of 𝑑𝑝 s.t. 𝑔𝑑

𝑑𝑝

𝑝

𝑑𝑝

max( 𝑑𝑝 , 𝑑

𝑔𝑑

) = 𝛿 < π‘‘β„Žπ‘Ÿ,

6 TUM or ICL-NUIM test datasets are excluded from the fine-tuning dataset to ensure the fairness.

Page 8 of 20

DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion Table 2 Comparison in term of reconstruction density on benchmark datasets. The bold values represent the best one in respective categories, while the underline values are the cases which the corresponding methods are better than us across categories. Feature-based SLAM

Direct-based SLAM

ORB[36]

Remode[39]

StD[34]

RSS[3]

Ours

ICL/ β€˜lr kt0’

0.03

4.48

22.35

24.02

24.28

0.36

1.43

12.84

22.70

ICL/ β€˜lr kt1’

0.02

2.43

34.27

35.44

37.10

0.06

3.03

13.04

25.72

ICL/ β€˜lr kt2’

0.01

8.68

27.53

27.93

28.69

0.17

1.81

26.56

22.80

ICL/ β€˜of kt0’

0.02

4.48

10.65

12.78

13.65

0.33

0.60

19.41

22.94

ICL/ β€˜of kt1’

0.02

3.13

41.33

42.10

43.46

0.04

4.76

29.15

34.65

ICL/ β€˜of kt2’

0.04

16.71

35.67

37.87

39.59

0.08

1.44

37.23

22.06

TUM/fr3_long

0.03

9.55

17.23

18.64

19.26

0.09

3.80

12.48

20.14

TUM/fr3__str

0.03

6.74

38.29

39.18

40.07

0.04

6.45

27.40

35.77

Average

0.03

7.03

28.42

29.75

30.76

0.15

2.92

22.26

25.85

where 𝑑𝑝𝑔𝑑 and 𝑑𝑝 denote groundtruth depth and estimated depth of pixel 𝑝 respectively. To test the influence of the parameters on the stability and the recovery quality, we evaluate the depth map reconstructed by our method with different parameter settings on the chosen datasets. We compute the average key-frame reconstruction rmse and accuracy for each video sequence with respect to the specific πœ† and 𝜎. Results are presented in Fig.5. We analyze the sensitivity of each parameter as well as its adaptation as follows. 1) πœ†: This parameter controls the balance between the data term and the regularization term. A suitable value of πœ† ensures the property of smoothing, i.e., rejecting to more outliers in the output depth map. However, excessively large values will cause an obvious decreasing on the accuracy of the recovered results. Fig. 5(a) shows that πœ† ∈ [52 , 352 ] yield different measurements in terms of both rmse and accuracy. Note that there is a knee point at 252 . For all the displayed examples, the lowest rmse and highest accuracy reconstruction can be achieved at that knee point. To conclude, we set πœ† = 252 in our implementation. 2) 𝜎: The smoothness constraint of energy function (8) is adaptively enforced using the spatially varying weighting function (9) defined on the color image, and 𝜎 is a parameter that controls the influence of the weight on the sharpness of depth boundaries. As shown in Fig. 5(b), the rmse curve decreases significantly before the value of 𝜎 touches 0.1. When stepping into the interval of [0.1, 0.2], the curve starts to fluctuate. According to the displayed depth patches at different values of 𝜎, we find that the best visual effect appears at 0.1. It well preserves the depth boundary, which contributes a lot to the following scene reconstruction. Considering the further improvement of accuracy is at the cost of degrading the structure of 3D scene when 𝜎 is greater than 0.1. We make a trade-off between accuracy and structure completeness and choose 0.1 in our implementation.

90 Mean ratio Least-square RANSAC

80

Percentage of Inlier(%)

4.1. Parameters Adaptation

LSD[11] LSD-BS[11] CNN-SLAM[46] OADPN[33]

70

60

50

40

30 0.18

0.22

0.26

0.30

0.34

0.38

0.42

0.46

Average

Threshold(m)

Figure 6: Percentage of the inlier pixels over the whole pixel set under different thresholds. Three methods are compared in the presentation of bar chart.

4.2. Evaluation on Scale Consensus

To evaluate the effectiveness of scale computation, we compare our RANSAC based least-square with mean ratio and least square methods, which are introduced in Sec. 3. We unify the scale between 𝐷′ and 𝐷̃ using the above three methods, and compute the deviation for each depth pairs with the same location in both sets. Then, we count the number of inliers according to the deviation errors within a threshold range [0.18, 0.46] and compute the percentage over the whole pixel set. Fig. 6 shows the statistical results on NYU dataset. It can be seen that our method outperforms the other two methods under all different thresholds. Specifically, with a very small tolerance (the threshold is set to 0.18m), our method has far more inliers (approaching 55%) than other two methods. When the threshold is set to 0.46m, our percentage of inliers can reach to 85%. Note that setting the threshold value to 0.46m is reasonable, since the depth estimation error (rmse) from the best method on the NYU dataset is just around 0.50m (see Table 3 for details).

4.3. Evaluation on Reconstruction Density

In this section, we assess reconstruction density by evaluating the percentage of correct depth values whose differYE et al.: Preprint submitted to Elsevier

Page 9 of 20

DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion Table 3 Quantitative results of depth estimation on three benchmark datasets.

NYU RGB-D V2 dataset (β€˜bathroom 0003’,β€˜bathroom 0007’,β€˜kitchen 0046’, β€˜kitchen 0037’ , β€˜bedroom 0037’, and β€˜bedroom 0041’)

TUM RGB-D SLAM dataset ( β€˜fr1_rpy’, β€˜fr2_dishes’, β€˜fr2_desk’, β€˜fr3_long_office_household’, β€˜fr3_nostructure_texture_near_withloop’ and β€˜fr3_structure_texture_far’ )

ICL-NUIM dataset (β€˜lr kt0’, β€˜lr kt1’, β€˜lr kt2’, β€˜of kt0’, β€˜of kt1’ and β€˜of kt2’)

Eigen et al. [9] Liu et al. [31] Laina et al. [29] StD- RGB [34] RSS-RGB [3] Our_C PE_S [47] PE_N [47] StD [34] RSS [3] Our_F Eigen et al. [9] Liu et al. [31] Laina et al. [29] Our_C PE_S [47] PE_N [47] StD [34] RSS [3] Our_F Eigen et al. [9] Liu et al. [31] Laina et al. [29] Our_C PE_S [47] PE_N [47] StD [34] RSS [3] Our_F

Error (lower is better) rmse (m) log abs.rel 0.64 0.23 0.16 0.73 0.33 0.33 0.51 0.22 0.18 0.51 0.21 0.14 0.73 0.19 0.15 0.50 0.19 0.15 0.52 0.21 0.12 0.45 0.17 0.09 0.48 0.17 0.13 0.45 0.18 0.14 0.42 0.16 0.08 1.41 0.37 0.23 0.86 0.29 0.25 1.07 0.39 0.25 0.70 0.28 0.20 0.69 0.25 0.13 0.65 0.24 0.12 0.70 0.27 0.13 0.65 0.24 0.12 0.62 0.23 0.10 0.83 0.43 0.30 0.81 0.41 0.45 0.54 0.28 0.23 0.36 0.18 0.16 0.32 0.18 0.12 0.22 0.12 0.07 0.36 0.18 0.15 0.33 0.19 0.15 0.30 0.13 0.14

ence with corresponding ground truth depth are less than 10%. The compared methods can be classified into two categories, i.e., feature-based methods including ORB-SLAM [36], Remode [39], StD [34] and RSS [3], and direct-based methods including LSD-SLAM [11] and its improved version by bootstrapping its initial scale using the ground-truth depth map (LSD-BS), CNN-SLAM [46] and OADPN [33]. From them, ORB-SLAM, LSD, and LSD-BS represent the original monocular SLAMs with no dense mapping solutions. For fairly comparison, we re-train StD [34] and RSS [3] models by using the sparse depth points extracted by orb features as groundtruth input based on the source codes provided by the authors. The training and fine-tuning modes for their networks keep the same with us. Table 2 reports the comparison results of all aforementioned methods. Apparently, the depth map reconstructed by our method is much denser than those reported by all of other feature-based SLAMs. Surprisingly, our approach also achieves far better performance than CNN-SLAM, and comparable results to OADPN. Note that the reconstruction density of LSD-SLAM are far higher on average than that of ORB-SLAM, because direct SLAM can directly optimize on raw pixels with sufficient image gradients to reconstruct semi-dense map. Theoretically, the methods of CNN-SLAM and OADPN using direct techniques are easier to obtain dense YE et al.: Preprint submitted to Elsevier

Accuracy (higher is better) 1.25 1.252 1.253 0.74 0.94 0.98 0.59 0.81 0.91 0.84 0.94 0.97 0.81 0.96 0.98 0.67 0.90 0.97 0.82 0.95 0.98 0.83 0.95 0.98 0.89 0.96 0.99 0.82 0.95 0.98 0.87 0.93 0.99 0.91 0.97 0.99 0.54 0.82 0.92 0.54 0.87 0.90 0.49 0.75 0.88 0.63 0.88 0.93 0.79 0.89 0.96 0.83 0.90 0.96 0.78 0.89 0.96 0.81 0.92 0.97 0.83 0.95 0.97 0.47 0.78 0.90 0.47 0.71 0.87 0.59 0.83 0.95 0.74 0.96 0.98 0.83 0.97 0.99 0.93 0.99 0.99 0.84 0.95 0.98 0.85 0.95 0.97 0.89 0.99 0.99

and accurate reconstruction than ours that using feature-based SLAM. But in real situations, benefited from the performance of every module designed in our whole framework, i.e., depth prediction, scale consensus, and depth fusion, our method achieves superior results for most cases.

4.4. Evaluation on Depth Estimation Accuracy

In this section, we compare our depth fusion method (Our_F) to the photometric error method (PE) with smoothness constraint (PE_S) [47] and with surface normal constraint (PE_N) [47], StD [34] and RSS [3]. The former two methods are both dense solutions based on direct SLAM, but with different priors, while the last two can be classified into featurebased depth fusion methods that take the sparse depth map and color image as input and predict a dense depth map by CNN. Besides, we also compare pure CNN-based depth estimation methods, i.e., Eigen et al. [9], Liu et al. [31] and Laina et al. [29] to our proposed CNN architecture (Our_C). StD-RGB and RSS-RGB are the versions of pure depth estimation from StD and RSS without sparse depth maps as input, respectively. All the other CNN architectures are retrained on the specific dataset, if necessary. Quantitative results on benchmark sequences are given in Table 3. The analysis is given in detail as follows. Firstly, our CNN-inferred depth maps have almost the lowest error and highest accuracy compared to those estiPage 10 of 20

DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion

Figure 7: Visual comparison on three benchmark datasets, from top-to-bottom are NYU-D V2 β€˜bedroom_0041’ and β€˜bathroom_0003’ sequences, TUM β€˜fr2_desk’ sequence, ICL-NUIM β€˜lr kt0’ sequence. (a) Input image; (b) Ground truth depth; (c) Depth estimation by Liu et al. [31]; (d) Depth estimation by Laina et al. [29]; (e) Our CNN-inferred depth map; (f) Our fused depth map. Regions in red rectangle are enlarged for better visualization.

mated by other CNN based methods on all three benchmark datasets. Fig. 7 also visually demonstrate this. In Fig. 7, depth maps estimated by Liu et al. are subject to a large wrongly estimated depth areas. The results of Laina et al. achieve a relatively higher accuracy than Liu et al., but tend to be blurred at depth boundaries and are subject to the loss of scene structures. YE et al.: Preprint submitted to Elsevier

Secondly, our depth fusion method (Our_F) achieves superior performance on NYU and TUM datasets, but is slightly inferior to another two PE methods on ICL-NUIM dataset. Note that, PE formulated the data term of depth estimation problem based on the photometric difference, the assumption of photometric consistence is not always satisfied in the real environment, leading to a relative larger estimation erPage 11 of 20

DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion

(a)

(b)

(c)

(d)

(e)

Figure 8: Visual comparison on an example from NYU dataset. (a) Color image; Depth maps estimated by (b) StD [34], (c) RSS [3], and (d) Our_F; (e) GT depth map.

Figure 9: Qualitative depth estimation results on three benchmark datasets, from top-to-bottom are NYU-D V2 β€˜bedroom_0041’ and ’bathroom_0003’ sequences, TUM ’fr2_desk’ sequence, ICL-NUIM ’lr kt0’ sequence, ’lr kt3’ sequence and ’of kt2’ sequence. An alternative view of our fusion result is shown for better visualization.

ror in the first two real datasets. For synthetic ICL-NUIM dataset, illumination is set to be fixed in the whole scene, which is beneficial to finding intensity consistence for PE methods. On the contrary, our method is not constrained by the assumption of photometric consistence, and achieves satisfying quantitative performance using extremely sparse depth map-points, i.e., much fewer available depth samples to obtain the comparable results to PE. In Fig. 8, our results well preserve the depth boundaries, as the same time, achieve the highest accuracy compared to StD [34] and RSS [3]. Lastly, although the ORB features are sparse, but the positions and values of these features contain important depth cues that are complementary to CNN-inferred depth. The ORB features are usually extracted from high-texture regions YE et al.: Preprint submitted to Elsevier

like object boundaries where the CNN-inferred depth map tends to be blurred. On the contrary, CNN can apply a dense and globally accurate estimation in the low-texture image regions where ORB-SLAM cannot. Experimental results in Table 3 and Fig. 7(e)(f) also validate this. Thanks to the fusion of these two heterogeneous depth sources, our fused results perform far better than the results that only use CNN to estimate the scene depth. Fig. 9 further shows 3D reconstruction results obtained via our CNN and depth fusion, respectively. The 3D reconstruction results based on the fused depth present precise and undistorted scenes either on the original view or the alternative view, which are more similar to the groundtruth than those from our CNN. Visual results demonstrate the effectiveness of our depth fusion and reconPage 12 of 20

DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion Table 4 Quantitative results of ablation study on three benchmark datasets. Error (lower is better) rmse (m) log abs.rel 0.73 0.33 0.33 0.65 0.30 0.29 0.51 0.22 0.18 0.44 0.19 0.16

Accuracy (higher is better) 1.25 1.252 1.253 0.59 0.81 0.91 0.62 0.83 0.94 0.84 0.94 0.97 0.85 0.95 0.98

NYU RGB-D

Liu et al. [31] Liu et al.[31] + Fusion Laina et al. [29] Laina et al.[29] + Fusion

TUM RGB-D

Our_C Our_F w/o Confidence Our_F Liu et al. [31] Liu et al.[31]+ Fusion Laina et al. [29] Laina et al.[29] + Fusion

0.50 0.48 0.44 0.86 0.81 1.07 0.91

0.19 0.20 0.16 0.29 0.28 0.39 0.32

0.16 0.16 0.09 0.25 0.24 0.25 0.22

0.81 0.83 0.90 0.54 0.56 0.49 0.57

0.95 0.95 0.97 0.87 0.89 0.75 0.82

0.98 0.98 0.99 0.90 0.95 0.88 0.92

ICL-NUIM

Our_C Our_F w/o Confidence Our_F Liu et al. [31] Liu et al.[31] + Fusion Laina et al. [29] Laina et al.[29] + Fusion

0.70 0.67 0.62 0.81 0.64 0.54 0.41

0.28 0.26 0.23 0.41 0.32 0.28 0.23

0.20 0.18 0.10 0.45 0.34 0.23 0.19

0.63 0.67 0.83 0.47 0.55 0.59 0.65

0.88 0.90 0.95 0.71 0.82 0.83 0.89

0.93 0.94 0.97 0.87 0.92 0.95 0.98

Our_C Our_F w/o Confidence Our_F

0.36 0.35 0.30

0.18 0.17 0.13

0.16 0.16 0.14

0.74 0.76 0.89

0.96 0.97 0.99

0.98 0.98 0.99

struction framework.

4.5. Ablation Study

For further demonstrating the superiority of our depth reconstruction framework, we combine other depth estimation methods, i.e., Liu et al. [31] and Laina et al. [29], with our fusion module, which are denoted as β€˜Liu et al. + Fusion’ and β€˜Laina et al. + Fusion’, respectively. The compared results are presented in Table 4. The performances are improved obviously on all three benchmark datasets compared with their corresponding depth estimation methods, which further verifies the effectiveness and adaptability of the proposed depth fusion module. Besides, we also evaluate the effectiveness of our confidence map 𝐻. We simply replace the confidence map 𝐻 with a binary mask that only indicates the valid depth pixels but without confidence, which is denoted as β€˜Our_𝐹 w/o Confidence’. The compared results are also reported in Table 4. In contrast with CNN estimated results (Our_C), the improvements are extremely limited without confidence map, since all the observed depth pixels are treated equally without considering their relative significance and accuracy. Actually, the sparse ORB depth samples have higher estimated accuracy than those from CNN prediction. On the other hand, CNN-inferred depth is comparatively dense but generates blurry depth boundaries due to the repeated combination of max-pooling and downsampling performed in CNN layers. Therefore, exploiting confidence map to distinguish different depth pixels and reasonably assign weights, the performance (Our_F) is improved obviously. Moreover, we visualize our confidence maps for the CNN-inferred depth maps, and compare with another method proposed by Yang et al. [52]. Yang et al. [52] proposed a Bayesian DeNet to conYE et al.: Preprint submitted to Elsevier

(a)

(b)

(c)

(d)

Figure 10: Visualization of the confidence maps between Yang et al. [52] and ours. (a) GT depth maps, (b) The inferred depth maps, (c) Our confidence maps, (d) The results generated by Yang et al. [52].

currently output a depth map and its corresponding uncertainty map for each video frame, which can be classified into a learning-based way to infer the pixel confidence. As shown in Fig. 10, the inferred depth map from our CNN are subject to large blurring artifacts along depth boundaries, and thus the pixels approaching depth boundaries are inclined to be a small confidence value. Similar situations can be seen in the results of Yang et al. [52].

4.6. Runtime

The proposed DRM-SLAM mainly has two components that need time to process, i.e., the tracking and the depth reconstruction. Specifically, camera tracking works at the frame rate around 25-30 fps. For depth reconstruction, the Page 13 of 20

DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion

struction framework.

4.7. Evaluation under Challenging Situations

Ground Truth

Ours

LSD-SLAM[23]

Figure 11: Comparison under the situation of pure rotational camera motion on the sequence ’fr1 rpy’ between the reconstruction results obtained by ground truth, our approach, and LSD-SLAM.

CNN-SLAM

Ours

Ground Truth

Remode[20]

Figure 12: Comparison under low-texture situation on the sequence ’fr2 dishes’ between the reconstruction results obtained by ground truth, our approach, and Remode. Table 5 Comparison of different solutions to depth reconstruction in terms of time and accuracy. Properties

GF [20]

DT [15]

WLS [12]

Ours

Runtime(s) RMSE(m) Accuracy(𝛿 < 1.25)

0.16 0.45 75.2%

0.09 0.47 77.3%

3.9 0.36 88.4%

0.15 0.39 85.6%

scale factor is computed when tracking is initialized successfully and runs steadily, and is updated along with local bundle adjustment (BA) to deal with the scale drift. Since the update frequency is slow, it does not possess any processing time. Therefore, the speed of depth estimation and depth fusion becomes the bottleneck of time efficiency in our framework. Depth estimation from CNN is only applied on selected key-frames and the frame rate is up to 5 keyframes per second. For depth fusion, three algorithms to solve the depth interpolation problem are compared in terms of running time and performance, and shown in Table 5, i.e., two local filter methods (guided filter(GF) [20] and domain transform (DT) [15]), and the original weighted least squares(WLS) [12] that uses the same modeling way with ours but solves the linear system by Eq. (10) directly. The proposed method has a comparable runtime to local filteringbased algorithm, but the global optimization formulation overcomes the short-sighted local judgement of these filters, and achieves higher performance. Besides, it also achieves high quality results as the state-of-the-art WLS method, but runs about 26 times faster. To conclude statistically, the proposed DRM-SLAM runs nearly in real time by combining featurebased camera tracking (ORB-SLAM) and our depth reconYE et al.: Preprint submitted to Elsevier

As mentioned, one of the advantages of our DRM-SLAM compared to traditioal monocular SLAM is that, under pure rotational motion and low-texture situations, the reconstruction can still be obtained by relying on CNN-predicted depth map. To portray this benefit, we evaluate our method on the sequences ’fr1 rpy’ and ’fr2 dishes’ from the TUM dataset. The sequence β€˜fr1 rpy’ is constructed under the situation of pure rotational camera motion, while β€˜fr2 dishes’ includes large low-texture areas, like wall, floor, and desktop. The reconstruction results on ’fr1 rpy’ obtained by our approach and LSD-SLAM are shown in Fig. 11. Our method can reconstruct the rough scene structure even if the camera motion is purely rotational, while LSD-SLAM fails to generate the result of 3D reconstruction. Traditional monocular SLAM approaches estimating the depth largely depend on building stereo correspondences or epipolar geometry between two views. However, the stereo baselines and the geometry relationship are destroyed by the pure rotational motion, leading to a chaotic reconstruction result. On the contrary, the depth estimated from CNN is not affected by this, and therefore our approach achieves a relatively better performance. The reconstruction results on ’fr2 dishes’ obtained by our approach and Remode are shown in Fig. 12. Note that Remode is based on direct SLAM, and uses stereo matching techniques to compute the semi-dense depth map. As we know, the accuracy of stereo matching is affected by the matching inefficiency for textureless areas. Therefore, under low-texture situation, the result of Remode deforms badly, while our approach generates consistent result to the ground truth.

4.8. Evaluation on Our Captured Dataset

To further verify the effectiveness of our method, we capture a real video sequence in our laboratory (called β€˜LAB’ for short) by using a Point Grey Flea3 color camera. The video is about 40 seconds long with the resolution and frame rate of 480Γ—640 and 25Hz respectively. The CNN-inferred depth maps for our video are directly obtained by using the trained model based on NYU dataset. Then, the scale adjustment and depth fusion are done on each extracted color key-frame from the video to get the final results. The recovered depth maps and the corresponding dense reconstruction results can be seen in Fig.13. The fused depth maps present superior performance than those estimated directly by CNN, and obtain accurate and geometry-preserved dense scene reconstruction, which demonstrates the practicability of our method in real video sequence.

5. Conclusion and Future Work

This paper proposes a dense reconstruction method under the monocular SLAM framework (DRM-SLAM), in which a novel scene depth fusion scheme is designed to fully utilize both the sparse depth samples from monocular SLAM Page 14 of 20

DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion

(e)

(f)

(g)

(a)

(b)

(c)

(d)

Figure 13: Evaluation on Our Captured β€˜LAB’ Dataset. (a) Color key-frames, (b) Sparse depth points extracted from ORBSLAM, (c) Depth maps estimated by our CNN, (d) Our fused results, (e) The reconstruction result from LSD-SLAM, (f) and (g) Our dense reconstruction from different views.

and predicted dense depth maps from CNN. In the scheme, a CNN architecture is carefully designed for robust depth estimation. Besides, our approach also accounts for the problem of scale ambiguity existing in the monocular SLAM. Extensive experiments and ablation study demonstrate the accuracy and robustness of the proposed DRM-SLAM. Our DRM-SLAM still presents some limitations and can be solved in the future work. Firstly, the reconstruction density and depth estimation accuracy still have room for improvement. Current depth fusion framework has been split into two separate parts, i.e., depth estimation and depth fusion modules. We can design a novel CNN architecture without hand-crafted modules that can be trained end-to-end to output the high quality depth map by taking the color image and the ORB depth samples into consideration. Besides, just like [50], we can utilize multiple intermediate multi-modal output, e.g., contour prediction and semantic parsing, from multi-task predictions as guidance to facilitate the final depth estimation task. Secondly, as the absolute scale in the current framework is estimated according to the CNN-inferred depth, and largely depends on the accuracy of the depth estimation from CNN, we expect that the scale can be estimated precisely with the help of inertial measurement unit (IMU). The IMU sensor can obtain the absolute measurement of camera state, then, by fusing pre-integrated IMU measurements and feature observations, the odometry is capable of achieving higher accuracy without scale ambiguity. Finally, the estimated depth can also be used in turn to help to better estimate the camera pose, and therefore make the tracking process more robust even in the pure rotational motion and low-texture situations. YE et al.: Preprint submitted to Elsevier

References

[1] Bay, H., Ess, A., Tuytelaars, T., Van Gool, L., 2008. Speeded-up robust features (SURF). Computer vision and image understanding 110, 346–359. [2] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L., 2017. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence PP, 1–1. [3] Chen, Z., Badrinarayanan, V., Drozdov, G., Rabinovich, A., 2018. Estimating depth from rgb and sparse sensing, in: European Conference on Computer Vision. [4] Concha, A., Civera, J., 2015. DPPTAM: Dense piecewise planar tracking and mapping from a monocular sequence, in: IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5686– 5693. [5] Concha, A., Hussain, M.W., Montano, L., Civera, J., 2014. Manhattan and piecewise-planar constraints for dense monocular mapping., in: Robotics: Science and systems. [6] Davide, S., Friedrich, F., 2011. Visual odometry: Part 1: The first 30 years and fundamentals. IEEE Robotics & Automation Magazine . [7] Davison, A.J., 2008. Real-time simultaneous localisation and mapping with a single camera, in: IEEE international conference on Computer Vision, p. 1403. [8] Davison, A.J., Reid, I.D., Molton, N.D., Stasse, O., 2007. Monoslam: Real-time single camera slam. IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 1052–1067. [9] Eigen, D., Fergus, R., 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture, in: IEEE international conference on Computer Vision, pp. 2650– 2658. [10] Engel, J., Cremers, D., 2013. Semi-dense visual odometry for a monocular camera, in: IEEE International Conference on Computer Vision, pp. 1449–1456. [11] Engel, J., SchΓΆps, T., Cremers, D., 2014. LSD-SLAM: Large-scale direct monocular SLAM, in: European Conference on Computer Vision, Springer. pp. 834–849. [12] Farbman, Z., Fattal, R., Lischinski, D., 2008. Edge-preserving decompositions for multi-scale tone and detail manipulation, in: ACM

Page 15 of 20

DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion SIGGRAPH, p. 67. [13] Flint, A., Murray, D., Reid, I., 2011. Manhattan scene understanding using monocular, stereo, and 3d features, in: IEEE International Conference on Computer Vision, pp. 2228–2235. [14] Fraundorfer, F., Scaramuzza, D., 2012. Visual odometry: Part 2: Matching, robustness, optimization, and applications. IEEE Robotics & Automation Magazine 19, 78–90. [15] Gastal, E.S.L., Oliveira, M.M., 2011. Domain transform for edgeaware image and video processing. Acm Transactions on Graphics 30, 1–12. [16] Graber, G., Pock, T., Bischof, H., 2011. Online 3d reconstruction using convex optimization, in: IEEE International Conference on Computer Vision Workshops, pp. 708–711. [17] Guan, T., Wang, C., 2009. Registration based on scene recognition and natural features tracking techniques for wide-area augmented reality systems. IEEE Transactions on Multimedia 11, 1393–1406. [18] Handa, A., Whelan, T., Mcdonald, J., Davison, A.J., 2014. A benchmark for rgb-d visual odometry, 3d reconstruction and SLAM, in: IEEE International Conference on Robotics and Automation, pp. 1524–1531. [19] He, K., Sun, J., Tang, X., 2010. Guided image filtering, in: European Conference on Computer Vision, pp. 1–14. [20] He, K., Sun, J., Tang, X., 2013. Guided image filtering. IEEE Transactions on Pattern Analysis & Machine Intelligence 35, 1397–1409. [21] He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. [22] Herrera, D., Kannala, J., HeikkilΓ€, J., et al., 2013. Depth map inpainting under a second-order smoothness prior, in: Scandinavian Conference on Image Analysis, Springer. pp. 555–566. [23] Keller, M., Lefloch, D., Lambers, M., Izadi, S., Weyrich, T., Kolb, A., 2013. Real-time 3d reconstruction in dynamic scenes using pointbased fusion, in: 3D Vision-3DV 2013, 2013 International Conference on, IEEE. pp. 1–8. [24] Khan, I., 2017. Robust sparse and dense non-rigid structure from motion. IEEE Transactions on Multimedia , 1–1. [25] Klein, G., Murray, D., 2007. Parallel tracking and mapping for small AR workspaces, in: IEEE International Symposium on Mixed and Augmented Reality, pp. 225–234. [26] Klein, G., Murray, D., 2008. Improving the agility of keyframe-based SLAM, in: European Conference on Computer Vision, pp. 802–815. [27] Koutis, I., Miller, G.L., Tolliver, D., 2009. Combinatorial preconditioners and multilevel solvers for problems in computer vision and image processing, in: International Symposium on Advances in Visual Computing, pp. 1067–1078. [28] Krishnan, D., Fattal, R., Szeliski, R., 2013. Efficient preconditioning of laplacian matrices for computer graphics. Acm Transactions on Graphics 32, 1–15. [29] Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N., 2016. Deeper depth prediction with fully convolutional residual networks, in: International Conference on 3D Vision, IEEE. pp. 239– 248. [30] Lang, M., Wang, O., Aydin, T., Smolic, A., Gross, M., 2012. Practical temporal consistency for image-based graphics applications. Acm Transactions on Graphics 31, 1–8. [31] Liu, F., Shen, C., Lin, G., Reid, I., 2016. Learning depth from single monocular images using deep convolutional neural fields. IEEE transactions on pattern analysis and machine intelligence 38, 2024–2039. [32] Lowe, D.G., 1999. Object recognition from local scale-invariant features, in: IEEE international conference on Computer Vision, pp. 1150–1157. [33] Luo, H., Gao, Y., Wu, Y., Liao, C., Yang, X., Cheng, K., 2019. Realtime dense monocular SLAM with online adapted depth prediction network. IEEE Trans. Multimedia 21, 470–483. [34] Ma, F., Karaman, S., 2017. Sparse-to-dense: Depth prediction from sparse depth samples and a single image, in: IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. [35] Min, D., Choi, S., Lu, J., Ham, B., Sohn, K., Do, M.N., 2014. Fast

YE et al.: Preprint submitted to Elsevier

[36] [37]

[38] [39] [40]

[41] [42] [43] [44] [45]

[46]

[47] [48]

[49] [50] [51]

[52] [53] [54] [55]

global image smoothing based on weighted least squares. IEEE Transactions on Image Processing 23, 5638. Mur-Artal, R., Montiel, J.M.M., Tardos, J.D., 2015. ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Transactions on Robotics 31, 1147–1163. Newcombe, R.A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A.J., Kohi, P., Shotton, J., Hodges, S., Fitzgibbon, A., 2011a. Kinectfusion: Real-time dense surface mapping and tracking, in: IEEE International Symposium on Mixed and Augmented Reality, pp. 127–136. Newcombe, R.A., Lovegrove, S.J., Davison, A.J., 2011b. DTAM: Dense tracking and mapping in real-time, in: IEEE international conference on Computer Vision, pp. 2320–2327. Pizzoli, M., Forster, C., Scaramuzza, D., 2014. Remode: Probabilistic, monocular dense reconstruction in real time, in: IEEE International Conference on Robotics and Automation, pp. 2609–2616. Pradeep, V., Rhemann, C., Izadi, S., Zach, C., 2013. Monofusion: Real-time 3d reconstruction of small scenes with a single web camera, in: IEEE International Symposium on Mixed and Augmented Reality, pp. 83–88. Rublee, E., Rabaud, V., Konolige, K., Bradski, G., 2011. ORB: An efficient alternative to SIFT or SURF, in: IEEE international conference on Computer Vision, pp. 2564–2571. Shum, H.Y., Ng, K.T., Chan, S.C., 2005. A virtual reality system using the concentric mosaic: construction, rendering, and data compression. IEEE Transactions on Multimedia 7, 85–95. Silberman, N., Hoiem, D., Kohli, P., Fergus, R., 2012. Indoor segmentation and support inference from rgbd images, in: European Conference on Computer Vision, pp. 746–760. Strasdat, H., Montiel, J.M.M., Davison, A.J., 2012. Visual slam: Why filter? Image and Vision Computing 30, 65–77. Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D., 2012. A benchmark for the evaluation of rgb-d SLAM systems, in: IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 573–580. Tateno, K., Tombari, F., Laina, I., Navab, N., 2017. CNN-SLAM: Real-time dense monocular SLAM with learned depth prediction, in: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6565–6574. Weerasekera, C.S., Latif, Y., Garg, R., Reid, I., 2017. Dense monocular reconstruction using surface normals, in: IEEE International Conference on Robotics and Automation, pp. 2524–2531. Whelan, T., Kaess, M., Johannsson, H., Fallon, M., Leonard, J.J., McDonald, J., 2015. Real-time large-scale dense RGB-D SLAM with volumetric fusion. The International Journal of Robotics Research 34, 598–626. Whelan, T., Salas-Moreno, R.F., Glocker, B., Davison, A.J., Leutenegger, S., 2016. Elasticfusion: Real-time dense SLAM and light source estimation. International Journal of Robotics Research . Xu, D., Ouyang, W., Wang, X., Sebe, N., 2018. PAD-Net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. arXiv preprint arXiv:1805.04409 . Yang, X., Chen, J., Wang, Z., Zhang, Q., Liu, W., Liao, C., Cheng, K., 2018. Monocular camera based real-time dense mapping using generative adversarial network, in: ACM Multimedia Conference on Multimedia Conference, MM, pp. 896–904. Yang, X., Gao, Y., Luo, H., Liao, C., Cheng, K.T., 2019. Bayesian denet: Monocular depth prediction and frame-wise fusion with synchronized uncertainty. IEEE Transactions on Multimedia PP, 1–1. Yu, F., Koltun, V., 2015. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 . Zhou, H., Li, X., Sadka, A.H., 2012. Nonrigid structure-from-motion from 2-d images using markov chain monte carlo. IEEE Transactions on Multimedia 14, 168–177. Zhou, Z., Shi, F., Xiao, J., Wu, W., 2015. Non-rigid structure-frommotion on degenerate deformations with low-rank shape deformation model. IEEE Transactions on Multimedia 17, 171–185.

Page 16 of 20

DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion

of Technology, Dalian, China. He is currently a graduate student at the school of Software in Dalian University of Technology in Liaoning, China. His research interests include simultaneous localization and mapping (SLAM) and 3D reconstruction. Xinchen Ye (M’17) received the B.E. degree and Ph.D. degree from the Tianjin University, Tianjin, China, in 2012 and 2016, respectively. He was with the Signal Processing Laboratory, EPFL, Lausanne, Switzerland in 2015 under the Grant of the Swiss federal government. He has been a Faculty Member of Dalian University of Technology, Dalian, Liaoning, China, since 2016, where he is currently a Assistant Professor with the DUT-RU International School of Information Science and Engineering. His current research interests include image/video processing and 3D imaging. As a co-author, he received the Platinum Best Paper Award in the IEEE ICME 2017

Xiang Ji received the B.S. degree in software engineering in 2016 from the Tianjin Normal University, Tianjin, China. He is currently a graduate student at the School of Software in Dalian University of Technology. His research interests include SLAM, computer vision and deep learning.

Zhihui Wang received the B.S. degree in software engineering in 2004 from the North Eastern University, Shenyang, China. She received her M.S. degree in software engineering in 2007 and the Ph.D degree in software and theory of computer in 2010, both from the Dalian University of Technology, Dalian, China. Since November 2011, she has been a visiting scholar of University of Washington. Her current research interests include information hiding and image compression.

Haojie Li is a Professor in the School of Software, Dalian University of Technology. His research interests include social media computing and multimedia information retrieval. He has co-authored over 50 journal and conference papers in these areas, including IEEE TCSVT, TMM, TIP, ACM Multimedia, ACM ICMR, etc. Dr. Li received the B.E. and the Ph. D. degrees from Nankai University, Tianjin and the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, in 1996 and 2007 respectively. From 2007 to 2009, he was a Research Fellow in the School of Computing, National University of Singapore. He is a member of IEEE and ACM.

Baoli Sun received the B.S degree in microelectronics science and engineering in 2018 from the Hefei University of Technology , Anhui , China. He is currently a graduate student at the school of Software in Dalian University of Technology in Liaoning, China. His research interests include image processing, computer vision and deep learning.

Shenglun Chen received the B.S degree in software engineering in 2017 from the Dalian University YE et al.: Preprint submitted to Elsevier

Page 17 of 20

We wish to draw the attention of the Editor to the following facts which may be considered as potential conflicts of interest and to significant financial contributions to this work. [OR] We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome. We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us. We confirm that we have given due consideration to the protection of intellectual property associated with this work and that there are no impediments to publication, including the timing of publication, with respect to intellectual property. In so doing we confirm that we have followed the regulations of our institutions concerning intellectual property. We understand that the Corresponding Author is the sole contact for the Editorial process (including Editorial Manager and direct communications with the office). He/she is responsible for communicating with the other authors about progress, submissions of revisions and final approval of proofs. We confirm that we have provided a current, correct email address which is accessible by the Corresponding Author and which has been configured to accept email from ([email protected]). Signed by all authors as follows: Xinchen Ye, Xiang Ji, Baoli Sun, Shenglun Chen, Zhihui Wang, Haojie Li Sep. 16, 2019.