DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion
Communicated by Bin Fan
Journal Pre-proof
DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion Xinchen Ye, Xiang Ji, Baoli Sun, Shenglun Chen, Zhihui Wang, Haojie Li PII: DOI: Reference:
S0925-2312(20)30234-4 https://doi.org/10.1016/j.neucom.2020.02.044 NEUCOM 21912
To appear in:
Neurocomputing
Received date: Revised date: Accepted date:
25 April 2019 16 September 2019 8 February 2020
Please cite this article as: Xinchen Ye, Xiang Ji, Baoli Sun, Shenglun Chen, Zhihui Wang, Haojie Li, DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion, Neurocomputing (2020), doi: https://doi.org/10.1016/j.neucom.2020.02.044
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. Β© 2020 Published by Elsevier B.V.
DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion Xinchen Yea,b,β , Xiang Jia , Baoli Suna , Shenglun Chena , Zhihui Wanga,b and Haojie Lia,b a DUT-RU b Key
International School of Information Science & Engineering, Dalian University of Technology, Liaoning,China Laboratory for Ubiquitous Network and Service Software of Liaoning Province, China
ARTICLE INFO
ABSTRACT
Keywords: Depth fusion SLAM Monocular Depth estimation
Monocular visual SLAM methods can accurately track the camera pose and infer the scene structure by building sparse correspondence between two/multiple views of the scene. However, the reconstructed 3D maps of these methods are extremely sparse. On the other hand, deep learning is widely used to predict dense depth maps from single-view color images, but the results are subject to blurry depth boundaries, which severely deform the structure of 3D scene. Therefore, this paper proposes a dense reconstruction method under the monocular SLAM framework (DRM-SLAM), in which a novel scene depth fusion scheme is designed to fully utilize both the sparse depth samples from monocular SLAM and predicted dense depth maps from convolutional neural network (CNN). In the scheme, a CNN architecture is carefully designed for robust depth estimation. Besides, our approach also accounts for the problem of scale ambiguity existing in the monocular SLAM. Extensive experiments on benchmark datasets and our captured dataset demonstrate the accuracy and robustness of the proposed DRM-SLAM. The evaluation of runtime and adaptability under challenging environments also verify the practicability of our method.
1. Introduction
The rapid development of visual simultaneous localization and mapping (vSLAM) [54, 55, 24] has created a new visualization and sensing wave for computer vision community. The footprints of robust tracking and 3D mapping have influenced a broad spectrum of the technological frontiers, e.g., autonomous navigation, mobile robots and augmented reality [42, 17]. Existing SLAM techniques can achieve accurate scene reconstruction relying on the equipments of RGBD or stereo cameras [23, 48, 49, 37] However, the overdependence on these capturing equipments with special imaging mechanism may bring limitations in real conditions. Therefore, SLAM based on monocular camera is widely used and free from constraint by the capturing equipments, which is basically classified into feature-based methods and direct methods. Usually, the feature-based monocular techniques mainly extract sparse interest points [32, 1, 41] in every video frame, constructing correspondences between these features across different frames, and then using them to estimate camera pose and reconstruct 3D scene [6, 14, 7, 25, 36]. Although the tracking performance of such algorithm is impressive, the generated 3D map is extremely sparse and cannot be used in practical application. On the other hand, direct methods have been proposed to construct semi-dense and high-quality 3D scene models in real time. Rather than optimizing on robust feature points, these approaches directly use raw pixel intensities on the basis of stereo matching [38, 40, 39, 16, 10, 11]. However, the methThis work was supported by National Natural Science Foundation of China (NSFC) under Grant 61702078, 61772108, and by the Fundamental Research Funds for the Central Universities. β Corresponding author
[email protected] (X. Ye) ORCID (s): 0000-0001-5328-3911 (X. Ye)
YE et al.: Preprint submitted to Elsevier
(a)
(b)
(c)
(d)
Figure 1: A dense reconstruction example of the proposed DRM-SLAM. (a) A keyframe chosen from input video sequence; (b) Reconstructed sparse point cloud by feature-based ORB-SLAM [36]; (c) Dense depth map estimated by the proposed depth fusion scheme; (d) Reconstructed dense 3D scene by our DRM-SLAM approach.
ods of this category also struggle in reconstructing dense 3D map. The missing points in the generated 3D map mainly correspond to the textureless areas in the raw image, where the accuracy of stereo matching can be very low. Dense solutions of monocular depth reconstruction can be classified into two categories, i.e., hand-crafted graphical models and CNN-based methods. The first category is to construct graphical models and use hand-crafted priors [5, 13, 4, 38, 22, 39] to regularize the depth reconstruction process. However, these hand-crafted priors are not optimal, and high-level scene context needs to be extracted for better understanding the scene geometry from the single-view color image. Recently, CNN has demonstrated its powerful representation ability to extract high-level features, and is Page 1 of 20
DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion Table 1 Properties of different depth estimation methods using heterogeneous depth data sources. Method
3D map
Scale ambiguity
Accuracy
Time complexity
Sparse (only key-points)
Yes
High
real time
CNN-based depth estimation
Dense (estimation of each pixel)
No
Medium (scenedependent)
real-time on keyframe
Our DRM-SLAM
Dense (estimation of each pixel)
No
High
nearly real time
Feature-based monocular SLAM
used to estimate scene depth or disparity [29, 9] with higher accuracy. Although depth map predicted by CNN is globally accurate, details such as depth boundaries and tiny objects can not be well preserved, which would have a big impact on other tasks that relying on the recovered depth map, such as object detection or image rendering for 3DTV. Therefore, Tateno et al. [46] proposed a dense depth reconstruction method, in which CNN-inferred depth map and depth measurements from direct monocular SLAM [11] are fused together to achieve accurate reconstruction. Weerasekera et al. [47] formulated the depth estimation problem as an energy minimization process, in which the data term is constructed based on the photometric difference while the smooth term is regularized by CNN-inferred surface normals. Luo et al. [33] proposed a dense monocular SLAM system, which fuses direct SLAM with an online-adapted depth prediction network to achieve accurate depth prediction. Note that, all the aforementioned dense solutions of monocular SLAM are based on direct SLAM [11]1 , since direct methods can optimize on raw pixels with sufficient image gradients to compute semi-dense scene depth, which makes the dense reconstruction much easier. Compared to direct SLAM, map-points computed from feature-based SLAM are relatively irregular and extremely sparse. Dense reconstruction based on feature-based SLAM is more challenging2 . So, in this paper, we stand on ORB-SLAM [36], a state-of-theart method from the tank of feature-based SLAMs, to build our dense scene reconstruction. To our best knowledge, ORB-SLAM generates extremely sparse map-points, and most of them correspond to features extracted from high-texture image regions. These features are important cues to capture the geometry structure of 3D scene. In contrast, the depth map inferred from CNN is dense and globally accurate in the smooth image regions, but can not preserve details and fine structures well due to the feature aggregation by performing repeated downsampling operations in CNN layers. Considering the above problems, we propose a dense reconstruction approach under the monocular SLAM framework (DRM-SLAM), in which a novel scene depth fusion scheme is designed to fully exploit both 1 Note that, [47] employs similar feature extraction and depth estimation framework based on graphic models with other direct methods. 2 The difference between direct SLAM and feature based SLAM is given in Sec. 2 in detail.
YE et al.: Preprint submitted to Elsevier
the sparse depth samples from ORB-SLAM and the CNNinferred depth. Table 1 shows the two kinds of depth estimation methods that can be complementary to each other, i.e. CNN-inferred depth is dense but has lower accuracy while depth from feature-based SLAM is more accurate but too sparse. Our DRM-SLAM can achieve both dense and high accuracy depth estimation and scene reconstruction with nearly real time. Fig. 1 shows the original reconstruction of ORBSLAM [36] and the dense reconstruction of the proposed DRM-SLAM. Note that ORB-SLAM recovers the exact geometry structure of the 3D scene but fails to generate dense 3D reconstruction, while DRM-SLAM obtains the complete scene depth map and achieves accurate dense 3D reconstruction. We demonstrate on three benchmark datasets and our captured dataset that our approach outperforms other CNNbased methods in the aspect of scene depth estimation, and generates comparable dense 3D reconstruction results to other dense solutions. The main contributions are summarized in the following: 1) A depth fusion scheme based on a depth reconstruction model is proposed to fully exploit sparse depth samples generated from ORB-SLAM and depth map inferred from CNN to achieve dense and accurate reconstruction. Before fusing the heterogeneous depth sources, the problem of scale ambiguity and the uncertainty of CNN-inferred depth around the object boundaries are taken into consideration. 2) A deep CNN network is designed based on ResNet architecture [21] to learn depth maps from monocular color images. Dilated convolutions [53] are used to maintain a relative high output resolution, and a multi-scale scheme [2] is employed to distinguish scene objects in different scales. 3) An effective parameter adaptation scheme is proposed to achieve stable and accurate dense reconstruction. The evaluation of runtime and adaptability under challenging environments also verify the practicability of the proposed DRMSLAM.
2. Related Work
In this section, we give a overview of related work in three aspects, i.e. the classification of monocular SLAM, dense solutions of monocular SLAM, and depth estimation methods.
Page 2 of 20
DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion
2.1. Monocular Visual SLAM
The first monocular SLAM system called MonoSLAM was proposed by Davison et al. [8] in 2007, which uses an extended Kalman filter to recursively estimate the camera pose and sparse positions of features on each frame. After that, key-frame based approaches [25, 26, 36, 4] were proposed to estimate the map and pose using only key-frames allowing to perform less costly but more accurate than filteringbased methods [44]. As for feature based monocular SLAM, a category of the key-frame based approaches, ORB-SLAM [36] achieves state-of-the-art performance on pose estimation and tracking. The sparse features in high-texture regions are first extracted by ORB descriptors. Then, local bundle adjustment and pose graph optimization are used to reconstruct the 3D scene by successively tracking the sparse ORB features. Real time camera relocalization with invariance to viewpoint and illumination allows the successful recovery against tracking failures and also enhances the probability of map reuse. Direct methods like the large-scale direct SLAM (LSDSLAM) [11] proposed to keep track of depth values only on raw pixels with sufficient image gradients instead of sparse features. Their results are very impressive as the system is able to build a semi-dense maps in real time without GPU acceleration. Nevertheless they still need features for loop detection and their camera localization accuracy is significantly lower than ORB-SLAM. Moreover, because of the dependence on the assumption of intensity consistency, the accuracy of LSD-SLAM will decrease when the scene has huge light variation.
2.2. Dense Solution of Monocular SLAM
Solutions achieving dense depth reconstruction based on monocular camera are mainly classified into two categories, i.e., pure geometry-based approaches, and the recent deep learning based approaches. Combining the parallel processing techniques with shortbaseline multi-view stereo matching under a regularization scheme, dense tracking and mapping (DTAM) [38] followed the structure of PTAM [4] to achieve dense tracking and mapping in real time on GPU. The reconstruction framework uses enhanced photometric data term by hundreds of narrow-baseline images and global spatially regularized energy function to improve the quality of optimization. However, since DTAM computes depth of every single pixel and uses global optimization, it causes a low efficiency even with GPU acceleration. Moreover, Engel et al. [10] also demonstrated that, in DTAM, small image intensity errors have a large effect on the accuracy of estimated disparity for those regions with small gradient. That is why LSD-SLAM only computes depth on pixels with sufficient gradients to improve the efficiency and accuracy. Another work of dense reconstruction is probabilistic, monocular dense reconstruction (Remode) [39] proposed by Pizzoli et al.. It solves the problem of estimating dense and accurate depth maps from a single camera. A probabilistic depth measurement is carried out in real time on a per-pixel YE et al.: Preprint submitted to Elsevier
basis and the computed uncertainty is used to reject erroneous estimations and provide live feedback on the reconstruction process. Each depth point is described and updated by a parametric model under Bayesian estimation frame. Finally, smoothness on the depth map is applied to obtain better results. Since the framework requires the camera pose at each frame and the depth range of the scene as inputs, the application of Remode is therefore limited in different scenes. All the mentioned two approaches above are pure-geometry based solutions, usually ignoring high-level scene context and suffering from low-texture regions. Recently, Weerasekera et al. [47] used surface normal map predicted from a learned CNN [9] as a strong prior and aims to construct a graphics model to estimate a dense depth map constraint by the photometric cost and a surface normal consistency. Following the model from DTAM, their proposed surface normal prior replaces the inverse-depth smoothness prior used in DTAM. CNN-SLAM proposed by Tateno et al. [46] presented a method where CNN-predicted dense depth maps are naturally fused together with depth measurements obtained from direct monocular SLAM. A particularly important stage of the framework is the scheme employed to refine the CNN-predicted depth map associated to each key-frame via small-baseline stereo matching, by enforcing color consistency minimization between a key-frame and associated input frames. Yang et al. [51] proposed a GAN based method for real-time dense mapping based on a monocular camera. It takes a semidense map obtained from motion stereo matching as a guidance to supervise dense depth prediction of a single RGB image. An adversarial loss and a pixel-wise mean square error loss are used to train the generator. Based on CNN-SLAM, Luo et al. [33] proposed a novel dense monocular SLAM system, which fuses direct SLAM with an online-adapted depth prediction network (OADPN for short) for achieving accurate depth prediction of scenes of different types. The depth prediction network is tuned on-the-fly toward better generalization ability for different scenes types, and a stage-wise Stochastic Gradient Descent algorithm is used for efficient convergence of the tuning process. Meanwhile, the dense map produced by CNN is used to deal with the scale ambiguity problem which in turn improves the accuracy of both tracking and overall reconstruction. Note that, the above methods are all based on direct SLAM. Unlike feature-based SLAM that computes sparse feature points to perform the following matching, direct SLAM can directly optimize on raw pixels with sufficient image gradients to reconstruct semi-dense map, which is easier to achieve dense reconstruction than feature-based SLAM. Yet, the direct techniques have lower accuracy and robustness compared with feature-based methods because of their dependence on assumption of intensity consistency. Motivated by this, we build our dense scene reconstruction framework based on more challenging feature-based SLAM, i.e., ORBSLAM [36], in which a novel scene depth fusion scheme is designed to fully utilize both the sparse depth samples from monocular SLAM and predicted dense depth maps from CNN. Page 3 of 20
DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion
Figure 2: System Overview. Our DRM-SLAM incrementally generates a fully dense reconstruction of a scene given sparse depth map from monocular SLAM and CNN-inferred depth of key-frames.
2.3. CNN-based Depth Estimation
Traditional depth estimation methods mainly focused on using graphical models to recast the task as an energy optimization problem [5, 13, 4, 38, 22, 39]. Due to the powerful representation ability of CNN, Eigen et al. [9] used a single multi-scale CNN to accomplish depth prediction in which they progressively refines the predicted depth by using a sequence of scale sub-networks. Liu et al. [31] presented a deep structured learning scheme for estimating depth from single monocular images. They learn the unary and pairwise potentials of continuous conditional random field (CRF) in a unified deep CNN framework to jointly explore the capacity of CNN and CRF. Laina et al. [29] resorted to a deeper fully convolutional residual network to predict the depth map. They also designed a up-projection module to address the downsampling problem, i.e., the decrease of the feature map resolution. Reverse Huber loss is used to train the whole network. However, the resolution of the prediction results is still inferior to the input color image, and is subject to severe blurring artifacts around depth boundaries. Recently, there appears many dense depth reconstruction methods based on the input of a sparse set of depth measurements and a single RGB image, which can be categorized into a depth super-resolution or filling problem with sparser depth measurements. The Sparse-to-Dense (StD) prediction method proposed by [34] presented a deep regression network to learn directly from the RGB-D raw data, and explore the impact of number of depth samples on prediction accuracy. Chen et al. [3] also designed an end-to-end CNN to estimate depth map from RGB and sparse sensing (RSS for short), which works simultaneously for indoors and outdoors scenes. A parameterized method for sparse depth inputs is proposed to accommodate the sparse depth inputs. However, the methods of this category utilize a specific sampling mode, e.g., regular grid or Bernoulli sampling, which is not suitable for real conditions, e.g., sampling by orb features. They paid more attention to verify the effectiveness of their methods under ideal conditions, but apply only a few simYE et al.: Preprint submitted to Elsevier
ple experiments under SLAM environments, which are not enough to demonstrate their practicability.
3. The Proposed Method
Fig. 2 illustrates the pipeline of our proposed framework. During the camera tracking process, color key-frames are chosen and the sparse depth map for each frame is generated. For each color key-frame, we first estimate a depth map via CNN, and apply scale consensus on the CNN-inferred depth map and the sparse depth map. Then, we formulate a depth reconstruction model to obtain a high quality dense depth map by fully exploiting both the sparse depth samples from monocular SLAM and predicted dense depth map from CNN. Finally, the generated dense depth maps on keyframes are transformed into the representation of point cloud, and are assembled based on a global consistent model to achieve the final dense reconstruction of the given 3D scene. For the whole processing procedures, camera tracking and depth fusion and reconstruction procedures run on the main CPU thread. Our implementation of depth map training and prediction runs on GPU, and the depth prediction process on key-frames is managed in parallel with the main thread. Each stage of the framework is presented in detail in the following subsections.
3.1. Camera Tracking and Key-frame Selection
We employ the feature-based tracking framework in ORBSLAM [36] to estimate the camera poses and select keyframes from the input video. For each key-frame ππ , we can obtain the corresponding sparse depth map Dβ²π and camera π pose Pππ = [Rππ , tππ ], where R and t are 3Γ3 rotation matrix and 3-dimension translation vector, respectively. To optimize the camera pose P during tracking, bundle adjustment is used to optimize the reprojection error between paired 3D points Xπ β β3 in world coordinate and 2D feature points xπ β β2 in image plane, where π is the pixel Page 4 of 20
DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion
Residual Blocks
2-Dconv,512
3x3 conv,256
Residual Blocks
2-Dconv,256
3x3 conv,128
Residual Blocks
3x3 conv,128,/2
Pool,/2
7x7conv,64, /2
INPUT
3x3 conv,512
1/8
1/8 1/4
Multi-scale Block
1/8 OUTPUT
6-Dconv, 1 Residual Blocks
12-Dconv,1 18-Dconv,1
Fusion
24-Dconv,1
Figure 3: The proposed deep architecture to estimate depth from monocular RGB image. s-DConv denotes dilated convolution with dilation factor s; All the successive skip structures in ResNet-101 model are uniformly marked as a residual block.
index from the matching set ξ: ( ) β ( )β2 β π βxπ β π RXπ + t β , (1) {R, t} = arg min β βΞ£ R,t πβξ
where π is the robust Huber cost function, and Ξ£ is the information matrix associated to the scale of the keypoint. The perspective projection function π is defined as follow: [ ] π ββ‘π β€β π + π π₯ π₯ π , (2) π ββ’ π β₯β = π ββ’ β₯β π + ππ¦ π¦ π ββ£π β¦β ) ) ( ( where ππ₯ , ππ¦ and ππ₯ , ππ¦ are the focal length and principal point along the x and y-axes , all obtained from camera calibration. The computed camera pose Pππ and sparse depth map Dβ²π of key-frame ππ are then used in the following depth π fusion and reconstruction.
3.2. Depth Estimation based on Deep CNN
Fig. 3 shows our proposed CNN architecture for depth estimation. ResNet-101 model [21] is selected as our backbone to infer depth map from single-view color image, but with several modifications (marked as yellow) shown as follows. 1) All the fully connected layers designed for the classification problem are removed from our regression network. For our loss function, L2 norm is used to penalize the regression error between the pair of groundtruth and its corresponding predicted version. 2) Different form [29] that designs complex up-projection module to prevent the feature map resolution to become further smaller, we resort to use dilated convolutions (Dconv) [53] to replace the pooling and stride operators (/2) at the last two downsampling layers. The dilation rate is set to 2 for both Dconv layers (2-Dconv). Note that dilated convolution is a way of increasing receptive view of the network exponentially but keeps the the number of parameters and feature map resolution unchanged. With this purpose, it is suitable for our depth estimation task under the SLAM framework with real-time requirement, which cares more about integrating knowledge of the wider context with less cost. 3) The multi-scale scheme [2] is introduced by concatenating it to the last residual block. Four dilated convolutions YE et al.: Preprint submitted to Elsevier
with dilation rates set to 6, 12, 18, and 24 (marked in the red dashed box), are employed to form a spatial pyramid pooling structure to extract four feature maps with different receptive fields. It ensures that scene objects of arbitrary scales can be identified and its corresponding depth can be accurately inferred by pooling convolutional features at different scales. Then, the four feature maps are fused together to generate the final depth map (1/8 resolution of the input color image). Note that, we donβt use complex strategies like deconvolution or up projection [29] to upsample the inferred depth map, since our CNN architecture aims at inferring the depth values from a global perspective, but not removing the blurring at local structures which can be solved by the following depth fusion scheme. Therefore, we upsample the inferred depth map by simply inserting zeros at missing locations to Μπ . get a HR sparse depth map D π
3.3. Scale Consensus
There always exists the problem of scale ambiguity in monocular ORB-SLAM, since the algorithm can not obtain the absolute scale from the monocular video captured by single camera. In contrast, The CNN is to learn the correlations between visual cues and absolute depths from a large amount of training image pairs, and thus can provide absolute scale information as reference for tracking and mapping. Therefore, the depth map of sparse samples Dβ²π obtained from π Μ π inferred from CNN for the SLAM and dense depth map D π key-frame ππ , are not on the same scale. Since the depth Μ π is learned from CNN based on absolute depth valmap D π ues, and thus we take it as the reference and transform Dβ²π π Μ π by multiplying a scale factor. As a preto the scale of D π Μ π by bicubic interpolation, and processing, we upsample D π Μπ select the valid pixels with the same positions from both D
and Dβ²π as two candidate pixel sets ξπ·Μ and ξπ·β² , respectively. π The most direct way is to compute the mean ratio on the intensity level between two pixel sets, and set it as the scale factor. Advanced methods resort to fit these two pixel sets by least square to find to global solution. However, these two methods are inaccurate, because they ignore the existence of outliers. Similar to Luo et al. [33] that also uses RANSAC algorithm to regress a correct scale with the presπ
Page 5 of 20
DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion
ence of a nontrivial amount of outliers, we also employ a robust RANSAC-based least square strategy to find the optimal scale factor. Randomly given a number of pairs of depth samples π β² and πΜ in the pixel sets ξπ·β² and ξπ·Μ respectively, we solve the following optimization function to obtain a scale factor by least square fitting: π β β β2 arg min βπ β
ππβ² β πΜπ β , π β² β ξπ·β² , πΜ β ξπ·Μ β β
(3)
π=1
Here, π is the scalar scale to be estimated. π is the number of pairs of depth samples used in each iteration. Then, we repeatedly solve the above function to obtain multiple scale factors based on different subsets of depth samples chosen from ξπ·Μ and ξπ·β² . Finally, we choose the scale value π β corresponding to the one containing the most inliers among multiple iterations as our optimal solution. Detailed process of scale computation is presented in Algorithm 1. π π represents the output scale value of the ππ‘β least square fitting. The operation of πΌπππππ_πππ’ππ‘(β
) is the procedure to count the number of inliers with the threshold πΏ and πΌππππππ (π π ) denotes the number of inliers under the current scale π π . Algorithm 1 RANSAC-based Scale Computation Require: ξπ·Μ , ξπ·β² , inlier threshold πΏ 1: while π < max_iteration do 2: Randomly selecting π pairs of depth samples 3: Compute π π using (3) 4: πΌππππππ (π π ) = πΌπππππ_πππ’ππ‘(ξπ·Μ , ξπ·β² , πΏ, π π ) 5: if πΌππππππ (π π ) > πΌππππππ (π β ) then 6: π β β π π 7: end if 8: end while 9: return π β , πΌππππππ (π β ) Note that the scale factor between CNN-inferred depth and monocular SLAM is always changing along the tracking process. Therefore, the scale is re-computed and updated when each new color key-frame is inserted, and thus the scale adjustment is done along the whole SLAM system to deal with the problem of scale variation.
3.4. Depth Fusion and Reconstruction
In this stage, we first fuse the sparse depth map Dβ² and Μ 3 , meanwhile, compute a conCNN-inferred depth map D fidence map H that indicating the accuracy for each pixel Μ Then, we formulate a graphical in the fused depth map D. model to reconstruct the final dense depth map D from the Μ fused initial depth observation D. Μ π is deAt a given pixel index π, the fused depth value D fined as: β§Dβ² π Μπ = βͺ Μπ D β¨D βͺ0 β©
Dβ²π β 0 Μπ β 0 Dβ²π = 0, D β² Μπ = 0 Dπ = 0, D
(4)
3 To avoid symbols confusion, we remove the representation of subΜ for simplification. script π in Dβ² and D
YE et al.: Preprint submitted to Elsevier
Figure 4: Statistical analysis of the error distribution by presenting the correlation between the distance of a point to the depth boundary and the estimation error.
Μ π appear at the location of pixel π, if both the values Dβ²π and D β² Dπ is retained because of its high accuracy computed from pure-geometry ORB-SLAM. To construct the confidence map H, we set Hπ = 0 for the { } Μ π = 0 . As pixel π with no valid depth value, i.e., π β π β£ D we observe, the pure-geometry based methods usually suffer from low parallax. There would be a few extremely large outliers in these depth samples produced by ORB-SLAM when translation is not sufficiently large during camera motion. Therefore, for the } pixel π coming from ORB-SLAM, { β² i.e., π β π β£ Dπ β 0 , we compute the confidence as following: ( )2 β π· β { } πππ₯ Hπ = min β (5) , 1β , π β Dβ²π β 0 , Μπ β D β β β where π·πππ₯ represents the maximum depth value learned by CNN. Those depth samples obtained from ORB-SLAM with depth values larger than π·πππ₯ are very likely to be outliers, their corresponding confidences should be assigned to lower values. The remaining problem is to determine the confidence values of CNN-inferred depth samples. As we observe, pixels close to depth boundaries are less reliably predicted by CNN than those far away, and should be assigned to a lower reliability. We perform a statistical analysis of the error distribution, i.e. the correlation between the distance of a point to the boundary and the error, to validate our observations. The region of depth boundaries is computed by imposing edge detector on the depth maps recovered by our proposed CNN. Then we count the mean difference error on the pixels with the same distance to depth boundaries for all the test dataset. The statistical result is shown in Fig. 4, which verifies our observation. Therefore, the confidence values are simply computed as below: (( ) ) { } ππ 2 Μ π β 0 , (6) , 1 , π β Dβ²π = 0, D Hπ = min ππ β Page 6 of 20
DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion
where ππ is the distance between current pixel π and its nearest pixel in the depth boundarie set ξ½. ππ β is a maximum threshold. Finally, our confidence map H becomes β§ βͺ0 ( )2 ) ( βͺ π· πππ₯ βͺmin ,1 Μπ D Hπ = β¨ ( βͺ ( π )2 ) π βͺmin ,1 ππ β βͺ β©
Μπ = 0 D Dβ²π β 0
Dβ²π
Μπ β 0 = 0, D
(7)
Μ and confiThen, according to the fused depth map D dence map H, we formulate our dense depth reconstruction as the following model: min D
ββ ( β ) ( )2 β βHπ Dπ β D Μπ 2 +π π€π,π Dπ β Dπ β , (8) β β π β πβξΊ4 (π) β
where ξΊ4 (π) is 4-connected neighborhood of pixel π, π is a balance parameter. The observation constraint is enforced by the confidence map H, while the smoothness constraint is adaptively enforced using spatially varying weighting function π€π,π defined on the color key-frames. Usually, exploiting depth-color correlation is quite informative for depth reconstruction when the accompanied color images are available. So, π€π,π is defined according to the similarity computed based on the color image C: ) ( ( )2 (9) π€π,π = exp β Cπ β Cπ βπ 2 , where π is a variance parameter. By setting the gradient of the optimization function (8) to zero, the solution d is obtained by solving a linear system based on inversing a large sparse matrix: ( ) Μ + πW β1 H Μ d, Μ d= H
(10)
Μ respectively. where d and dΜ are the vector form of D and D Μ is a diagonal matrix with its diagonal elements given by H h, i.e., the vector form of the confidence map H. W denotes the spatially varying Laplacian matrix defined by π€π,π : β§β π€ βͺ πβξΊ (π) π,π W (π, π) = β¨βπ€π,π βͺ0 β©
π=π π β ξΊ4 (π) ππ‘βπππ€ππ π
(11)
where π and π denote one-dimensional scalar pixel indexes corresponding to a pixel p and its neighborhoods, respectively. W is a five-point sparse matrix including diagonal elements. By solving the linear system, the sparse data dΜ is propagated to the whole image domain guided by the confidence map H and the weighting matrix W. However, the matrix Μ + πW is highly ill-conditioned due to the sparse input H Μ is severely rank-deficient), directly reversing it to data (H compute the solution is very unstable. Inspired by [30] that YE et al.: Preprint submitted to Elsevier
transforms data interpolation task into simple filtering subproblems, Eq. (10) can be decomposed into the following formula correspondingly: ) ( (I + πW)β1 dΜ (π) (12) d (π) = (SπΜ.βSβ )(π) = ( ) , (I + πW)β1 h (π) where I is identity matrix. Note that, the original optimization problem is decoupled into two simple subproblems in the numerator and denominator to solve SπΜ and Sβ separately. The matrix (I + πW)β1 can be regarded as a filtering matrix applied on both dΜ and h. The final fused depth map d can be computed by element-wise division β./β between SπΜ and Sβ on each pixel of the same index π. Besides, solving Eq. (12) directly is also time-consuming. Several methods [27, 28] have been proposed to solve the linear system, but they are still much slower than local filtering methods [15, 19], which impede the practical use in realtime SLAM. Therefore, instead of directly performing the inversion of the sparse matrix (I + πW), we resort to the fast solving strategy [30, 35] to accelerate the solving process. In essence, the algorithm is to break down the solving proΜ and H in our cess of directly smoothing on a 2D image (D framework) into multiple 1D smoothing processes applied on rows and the columns of the image sequentially using the proposed horizontal and vertical 1D solvers. The algorithm can achieve a comparable performance to the local filters. To briefly illustrate the 1D fast solver, we define the linear function for 1D signal along the π₯ (horizontal) dimension as follow: ( ) π₯ I + ππ₯ Wπ₯ Sπ₯πΜ = dΜ (13) where dΜ is an 1D horizontal signal extracted from the row Μ Wπ₯ is a three-point Laplacian matrix constructed on of D. ξΊ2 (π) neighborhood that containing two neighbors for π (i.e., π β 1 and π + 1). The 1D output solution Sπ₯πΜ can be obtained by solving the linear system 4 . In fact, solving (13) becomes much easier than directly solving SπΜ in Eq. (12), since the Laplacian matrix Wπ₯ becomes a tridiagonal matrix, whose nonzero elements exist only in the diagonal, the left and right diagonals. Such a matrix has an exact solution obtained using the Gaussian elimination algorithm, and thus the function can be solved in a recursive manner with a π(π) complexity (here π is the width of the image) 5 . To avoid appearing βstreaking artifactβ, we perform 2D smoothing by applying sequential 1D global smoothing operations for a multiple number of iterations to propagate information across edges. The number of iterations is set at π = 3 based on experimental performance. Note that, owing to the accelerated algorithm, our depth reconstruction framework achieves a reasonable running time without decreasing the reconstruction performance. π₯
4 Sπ₯ can be obtained with the similar function, and thus is not written β for saving place. 5 We direct readers to refer to Ref. [35] for more details about the algorithm.
Page 7 of 20
DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion
Figure 5: Average key-frame reconstruction rmse (left vertical axes in red) and accuracy with threshold 1.25 (right vertical axes in blue) with respect to the parameters:(a) π and (b) π. The first row shows RGB patches and the corresponding depth ground truth of selected key-frames. Some representative recovered depth maps at the key point of rmse curve are presented for clear comparison and analysis on the parameter π.
Finally, the generated dense depth maps on key-frames are fused into a global consistent model based on point cloud representation to achieve accurate dense reconstruction of the 3D scene.
4. Evaluation
In this section, we first show the parameters analysis and adaptation scheme for depth reconstruction in Sec. 4.1. The influence of scale consensus is evaluated in Sec. 4.2. Then, we evaluate the reconstruction density and the quality of depth estimation in Sec. 4.3 and Sec. 4.4, respectively. Ablation study and the evaluation of running speed are given in Sec. 4.5 and Sec. 4.6, respectively. We also verify the effectiveness of our method under challenging situations, i.e. pure camera motion and low-texture environment in Sec. 4.7 Evaluation on our captured dataset is shown in Sec. 4.8. Three public benchmark datasets, i.e., NYU RGB-D V2 dataset (βbathroom 0003 0007β, βkitchen 0046 0037β, and βbedroom 0037 0041β) [43], TUM RGB-D SLAM dataset (βfr1 rpyβ, βfr2 dishesβ, βfr2 deskβ, βfr3 long office householdβ, βfr3 nostructure texture near withloopβ and βfr3 structure texture farβ) [45], and ICL-NUIM dataset (βlr kt0β, βlr kt1β, βlr kt2β, βof kt0β, βof kt1β and βof kt2β) [18] are used to conduct our experiments. The first two datasets are acquired from Kinect sensor while the last one is synthetic. When testing on a specific dataset, we can directly use its own given intrinsic values, e.g., focal length and principal point, to compute the camera pose during the tracking process. All experiments are implemented in Tensorflow, and run under a desktop with Intel 2.4GHz CPU, 32GB RAM and Nvidia TiYE et al.: Preprint submitted to Elsevier
tanX 12GB GPU. We use the NYU RGB-D V2 dataset as our preliminary training dataset. There are totally 1449 RGB and depth images in NYU dataset. Following the official splitting, We use 795 and 654 images pairs for training and testing separately. We augment the training data with rotation and flipping operations up to 14K images. We use the ResNet-101 network parameters pretrained on ImageNet to initialize the network, and randomly initialize other modules. We then train the model weights with a L2-norm regression loss function. We use the SGD optimizer and set momentum parameter to 0.9. The learning rate is initialized to 1e-4 for all layers and decreased by 0.9 every epoch. Then, the trained model is finetuned using either TUM or ICL-NUIM datasets when testing on any of these two datesets, correspondingly 6 . Four commonly used measurements for quantitative comparison are applied: β β β’ Root mean squared error (rmse): π1 π (ππππ‘ β ππ )2 β’ Average πππ error(log):
β
1 π
β ( π
( ))2 πππ(ππππ‘ ) β πππ ππ ,
β’ Absolute relative error (abs.rel): π1
β
ππ‘
π
|ππ βππ | ππ‘
ππ
,
β’ Accuracy with threshold thr: percentage(%) of ππ s.t. ππ‘
ππ
π
ππ
max( ππ , π
ππ‘
) = πΏ < π‘βπ,
6 TUM or ICL-NUIM test datasets are excluded from the fine-tuning dataset to ensure the fairness.
Page 8 of 20
DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion Table 2 Comparison in term of reconstruction density on benchmark datasets. The bold values represent the best one in respective categories, while the underline values are the cases which the corresponding methods are better than us across categories. Feature-based SLAM
Direct-based SLAM
ORB[36]
Remode[39]
StD[34]
RSS[3]
Ours
ICL/ βlr kt0β
0.03
4.48
22.35
24.02
24.28
0.36
1.43
12.84
22.70
ICL/ βlr kt1β
0.02
2.43
34.27
35.44
37.10
0.06
3.03
13.04
25.72
ICL/ βlr kt2β
0.01
8.68
27.53
27.93
28.69
0.17
1.81
26.56
22.80
ICL/ βof kt0β
0.02
4.48
10.65
12.78
13.65
0.33
0.60
19.41
22.94
ICL/ βof kt1β
0.02
3.13
41.33
42.10
43.46
0.04
4.76
29.15
34.65
ICL/ βof kt2β
0.04
16.71
35.67
37.87
39.59
0.08
1.44
37.23
22.06
TUM/fr3_long
0.03
9.55
17.23
18.64
19.26
0.09
3.80
12.48
20.14
TUM/fr3__str
0.03
6.74
38.29
39.18
40.07
0.04
6.45
27.40
35.77
Average
0.03
7.03
28.42
29.75
30.76
0.15
2.92
22.26
25.85
where ππππ‘ and ππ denote groundtruth depth and estimated depth of pixel π respectively. To test the influence of the parameters on the stability and the recovery quality, we evaluate the depth map reconstructed by our method with different parameter settings on the chosen datasets. We compute the average key-frame reconstruction rmse and accuracy for each video sequence with respect to the specific π and π. Results are presented in Fig.5. We analyze the sensitivity of each parameter as well as its adaptation as follows. 1) π: This parameter controls the balance between the data term and the regularization term. A suitable value of π ensures the property of smoothing, i.e., rejecting to more outliers in the output depth map. However, excessively large values will cause an obvious decreasing on the accuracy of the recovered results. Fig. 5(a) shows that π β [52 , 352 ] yield different measurements in terms of both rmse and accuracy. Note that there is a knee point at 252 . For all the displayed examples, the lowest rmse and highest accuracy reconstruction can be achieved at that knee point. To conclude, we set π = 252 in our implementation. 2) π: The smoothness constraint of energy function (8) is adaptively enforced using the spatially varying weighting function (9) defined on the color image, and π is a parameter that controls the influence of the weight on the sharpness of depth boundaries. As shown in Fig. 5(b), the rmse curve decreases significantly before the value of π touches 0.1. When stepping into the interval of [0.1, 0.2], the curve starts to fluctuate. According to the displayed depth patches at different values of π, we find that the best visual effect appears at 0.1. It well preserves the depth boundary, which contributes a lot to the following scene reconstruction. Considering the further improvement of accuracy is at the cost of degrading the structure of 3D scene when π is greater than 0.1. We make a trade-off between accuracy and structure completeness and choose 0.1 in our implementation.
90 Mean ratio Least-square RANSAC
80
Percentage of Inlier(%)
4.1. Parameters Adaptation
LSD[11] LSD-BS[11] CNN-SLAM[46] OADPN[33]
70
60
50
40
30 0.18
0.22
0.26
0.30
0.34
0.38
0.42
0.46
Average
Threshold(m)
Figure 6: Percentage of the inlier pixels over the whole pixel set under different thresholds. Three methods are compared in the presentation of bar chart.
4.2. Evaluation on Scale Consensus
To evaluate the effectiveness of scale computation, we compare our RANSAC based least-square with mean ratio and least square methods, which are introduced in Sec. 3. We unify the scale between ξπ·β² and ξπ·Μ using the above three methods, and compute the deviation for each depth pairs with the same location in both sets. Then, we count the number of inliers according to the deviation errors within a threshold range [0.18, 0.46] and compute the percentage over the whole pixel set. Fig. 6 shows the statistical results on NYU dataset. It can be seen that our method outperforms the other two methods under all different thresholds. Specifically, with a very small tolerance (the threshold is set to 0.18m), our method has far more inliers (approaching 55%) than other two methods. When the threshold is set to 0.46m, our percentage of inliers can reach to 85%. Note that setting the threshold value to 0.46m is reasonable, since the depth estimation error (rmse) from the best method on the NYU dataset is just around 0.50m (see Table 3 for details).
4.3. Evaluation on Reconstruction Density
In this section, we assess reconstruction density by evaluating the percentage of correct depth values whose differYE et al.: Preprint submitted to Elsevier
Page 9 of 20
DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion Table 3 Quantitative results of depth estimation on three benchmark datasets.
NYU RGB-D V2 dataset (βbathroom 0003β,βbathroom 0007β,βkitchen 0046β, βkitchen 0037β , βbedroom 0037β, and βbedroom 0041β)
TUM RGB-D SLAM dataset ( βfr1_rpyβ, βfr2_dishesβ, βfr2_deskβ, βfr3_long_office_householdβ, βfr3_nostructure_texture_near_withloopβ and βfr3_structure_texture_farβ )
ICL-NUIM dataset (βlr kt0β, βlr kt1β, βlr kt2β, βof kt0β, βof kt1β and βof kt2β)
Eigen et al. [9] Liu et al. [31] Laina et al. [29] StD- RGB [34] RSS-RGB [3] Our_C PE_S [47] PE_N [47] StD [34] RSS [3] Our_F Eigen et al. [9] Liu et al. [31] Laina et al. [29] Our_C PE_S [47] PE_N [47] StD [34] RSS [3] Our_F Eigen et al. [9] Liu et al. [31] Laina et al. [29] Our_C PE_S [47] PE_N [47] StD [34] RSS [3] Our_F
Error (lower is better) rmse (m) log abs.rel 0.64 0.23 0.16 0.73 0.33 0.33 0.51 0.22 0.18 0.51 0.21 0.14 0.73 0.19 0.15 0.50 0.19 0.15 0.52 0.21 0.12 0.45 0.17 0.09 0.48 0.17 0.13 0.45 0.18 0.14 0.42 0.16 0.08 1.41 0.37 0.23 0.86 0.29 0.25 1.07 0.39 0.25 0.70 0.28 0.20 0.69 0.25 0.13 0.65 0.24 0.12 0.70 0.27 0.13 0.65 0.24 0.12 0.62 0.23 0.10 0.83 0.43 0.30 0.81 0.41 0.45 0.54 0.28 0.23 0.36 0.18 0.16 0.32 0.18 0.12 0.22 0.12 0.07 0.36 0.18 0.15 0.33 0.19 0.15 0.30 0.13 0.14
ence with corresponding ground truth depth are less than 10%. The compared methods can be classified into two categories, i.e., feature-based methods including ORB-SLAM [36], Remode [39], StD [34] and RSS [3], and direct-based methods including LSD-SLAM [11] and its improved version by bootstrapping its initial scale using the ground-truth depth map (LSD-BS), CNN-SLAM [46] and OADPN [33]. From them, ORB-SLAM, LSD, and LSD-BS represent the original monocular SLAMs with no dense mapping solutions. For fairly comparison, we re-train StD [34] and RSS [3] models by using the sparse depth points extracted by orb features as groundtruth input based on the source codes provided by the authors. The training and fine-tuning modes for their networks keep the same with us. Table 2 reports the comparison results of all aforementioned methods. Apparently, the depth map reconstructed by our method is much denser than those reported by all of other feature-based SLAMs. Surprisingly, our approach also achieves far better performance than CNN-SLAM, and comparable results to OADPN. Note that the reconstruction density of LSD-SLAM are far higher on average than that of ORB-SLAM, because direct SLAM can directly optimize on raw pixels with sufficient image gradients to reconstruct semi-dense map. Theoretically, the methods of CNN-SLAM and OADPN using direct techniques are easier to obtain dense YE et al.: Preprint submitted to Elsevier
Accuracy (higher is better) 1.25 1.252 1.253 0.74 0.94 0.98 0.59 0.81 0.91 0.84 0.94 0.97 0.81 0.96 0.98 0.67 0.90 0.97 0.82 0.95 0.98 0.83 0.95 0.98 0.89 0.96 0.99 0.82 0.95 0.98 0.87 0.93 0.99 0.91 0.97 0.99 0.54 0.82 0.92 0.54 0.87 0.90 0.49 0.75 0.88 0.63 0.88 0.93 0.79 0.89 0.96 0.83 0.90 0.96 0.78 0.89 0.96 0.81 0.92 0.97 0.83 0.95 0.97 0.47 0.78 0.90 0.47 0.71 0.87 0.59 0.83 0.95 0.74 0.96 0.98 0.83 0.97 0.99 0.93 0.99 0.99 0.84 0.95 0.98 0.85 0.95 0.97 0.89 0.99 0.99
and accurate reconstruction than ours that using feature-based SLAM. But in real situations, benefited from the performance of every module designed in our whole framework, i.e., depth prediction, scale consensus, and depth fusion, our method achieves superior results for most cases.
4.4. Evaluation on Depth Estimation Accuracy
In this section, we compare our depth fusion method (Our_F) to the photometric error method (PE) with smoothness constraint (PE_S) [47] and with surface normal constraint (PE_N) [47], StD [34] and RSS [3]. The former two methods are both dense solutions based on direct SLAM, but with different priors, while the last two can be classified into featurebased depth fusion methods that take the sparse depth map and color image as input and predict a dense depth map by CNN. Besides, we also compare pure CNN-based depth estimation methods, i.e., Eigen et al. [9], Liu et al. [31] and Laina et al. [29] to our proposed CNN architecture (Our_C). StD-RGB and RSS-RGB are the versions of pure depth estimation from StD and RSS without sparse depth maps as input, respectively. All the other CNN architectures are retrained on the specific dataset, if necessary. Quantitative results on benchmark sequences are given in Table 3. The analysis is given in detail as follows. Firstly, our CNN-inferred depth maps have almost the lowest error and highest accuracy compared to those estiPage 10 of 20
DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion
Figure 7: Visual comparison on three benchmark datasets, from top-to-bottom are NYU-D V2 βbedroom_0041β and βbathroom_0003β sequences, TUM βfr2_deskβ sequence, ICL-NUIM βlr kt0β sequence. (a) Input image; (b) Ground truth depth; (c) Depth estimation by Liu et al. [31]; (d) Depth estimation by Laina et al. [29]; (e) Our CNN-inferred depth map; (f) Our fused depth map. Regions in red rectangle are enlarged for better visualization.
mated by other CNN based methods on all three benchmark datasets. Fig. 7 also visually demonstrate this. In Fig. 7, depth maps estimated by Liu et al. are subject to a large wrongly estimated depth areas. The results of Laina et al. achieve a relatively higher accuracy than Liu et al., but tend to be blurred at depth boundaries and are subject to the loss of scene structures. YE et al.: Preprint submitted to Elsevier
Secondly, our depth fusion method (Our_F) achieves superior performance on NYU and TUM datasets, but is slightly inferior to another two PE methods on ICL-NUIM dataset. Note that, PE formulated the data term of depth estimation problem based on the photometric difference, the assumption of photometric consistence is not always satisfied in the real environment, leading to a relative larger estimation erPage 11 of 20
DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion
(a)
(b)
(c)
(d)
(e)
Figure 8: Visual comparison on an example from NYU dataset. (a) Color image; Depth maps estimated by (b) StD [34], (c) RSS [3], and (d) Our_F; (e) GT depth map.
Figure 9: Qualitative depth estimation results on three benchmark datasets, from top-to-bottom are NYU-D V2 βbedroom_0041β and βbathroom_0003β sequences, TUM βfr2_deskβ sequence, ICL-NUIM βlr kt0β sequence, βlr kt3β sequence and βof kt2β sequence. An alternative view of our fusion result is shown for better visualization.
ror in the first two real datasets. For synthetic ICL-NUIM dataset, illumination is set to be fixed in the whole scene, which is beneficial to finding intensity consistence for PE methods. On the contrary, our method is not constrained by the assumption of photometric consistence, and achieves satisfying quantitative performance using extremely sparse depth map-points, i.e., much fewer available depth samples to obtain the comparable results to PE. In Fig. 8, our results well preserve the depth boundaries, as the same time, achieve the highest accuracy compared to StD [34] and RSS [3]. Lastly, although the ORB features are sparse, but the positions and values of these features contain important depth cues that are complementary to CNN-inferred depth. The ORB features are usually extracted from high-texture regions YE et al.: Preprint submitted to Elsevier
like object boundaries where the CNN-inferred depth map tends to be blurred. On the contrary, CNN can apply a dense and globally accurate estimation in the low-texture image regions where ORB-SLAM cannot. Experimental results in Table 3 and Fig. 7(e)(f) also validate this. Thanks to the fusion of these two heterogeneous depth sources, our fused results perform far better than the results that only use CNN to estimate the scene depth. Fig. 9 further shows 3D reconstruction results obtained via our CNN and depth fusion, respectively. The 3D reconstruction results based on the fused depth present precise and undistorted scenes either on the original view or the alternative view, which are more similar to the groundtruth than those from our CNN. Visual results demonstrate the effectiveness of our depth fusion and reconPage 12 of 20
DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion Table 4 Quantitative results of ablation study on three benchmark datasets. Error (lower is better) rmse (m) log abs.rel 0.73 0.33 0.33 0.65 0.30 0.29 0.51 0.22 0.18 0.44 0.19 0.16
Accuracy (higher is better) 1.25 1.252 1.253 0.59 0.81 0.91 0.62 0.83 0.94 0.84 0.94 0.97 0.85 0.95 0.98
NYU RGB-D
Liu et al. [31] Liu et al.[31] + Fusion Laina et al. [29] Laina et al.[29] + Fusion
TUM RGB-D
Our_C Our_F w/o Confidence Our_F Liu et al. [31] Liu et al.[31]+ Fusion Laina et al. [29] Laina et al.[29] + Fusion
0.50 0.48 0.44 0.86 0.81 1.07 0.91
0.19 0.20 0.16 0.29 0.28 0.39 0.32
0.16 0.16 0.09 0.25 0.24 0.25 0.22
0.81 0.83 0.90 0.54 0.56 0.49 0.57
0.95 0.95 0.97 0.87 0.89 0.75 0.82
0.98 0.98 0.99 0.90 0.95 0.88 0.92
ICL-NUIM
Our_C Our_F w/o Confidence Our_F Liu et al. [31] Liu et al.[31] + Fusion Laina et al. [29] Laina et al.[29] + Fusion
0.70 0.67 0.62 0.81 0.64 0.54 0.41
0.28 0.26 0.23 0.41 0.32 0.28 0.23
0.20 0.18 0.10 0.45 0.34 0.23 0.19
0.63 0.67 0.83 0.47 0.55 0.59 0.65
0.88 0.90 0.95 0.71 0.82 0.83 0.89
0.93 0.94 0.97 0.87 0.92 0.95 0.98
Our_C Our_F w/o Confidence Our_F
0.36 0.35 0.30
0.18 0.17 0.13
0.16 0.16 0.14
0.74 0.76 0.89
0.96 0.97 0.99
0.98 0.98 0.99
struction framework.
4.5. Ablation Study
For further demonstrating the superiority of our depth reconstruction framework, we combine other depth estimation methods, i.e., Liu et al. [31] and Laina et al. [29], with our fusion module, which are denoted as βLiu et al. + Fusionβ and βLaina et al. + Fusionβ, respectively. The compared results are presented in Table 4. The performances are improved obviously on all three benchmark datasets compared with their corresponding depth estimation methods, which further verifies the effectiveness and adaptability of the proposed depth fusion module. Besides, we also evaluate the effectiveness of our confidence map π». We simply replace the confidence map π» with a binary mask that only indicates the valid depth pixels but without confidence, which is denoted as βOur_πΉ w/o Confidenceβ. The compared results are also reported in Table 4. In contrast with CNN estimated results (Our_C), the improvements are extremely limited without confidence map, since all the observed depth pixels are treated equally without considering their relative significance and accuracy. Actually, the sparse ORB depth samples have higher estimated accuracy than those from CNN prediction. On the other hand, CNN-inferred depth is comparatively dense but generates blurry depth boundaries due to the repeated combination of max-pooling and downsampling performed in CNN layers. Therefore, exploiting confidence map to distinguish different depth pixels and reasonably assign weights, the performance (Our_F) is improved obviously. Moreover, we visualize our confidence maps for the CNN-inferred depth maps, and compare with another method proposed by Yang et al. [52]. Yang et al. [52] proposed a Bayesian DeNet to conYE et al.: Preprint submitted to Elsevier
(a)
(b)
(c)
(d)
Figure 10: Visualization of the confidence maps between Yang et al. [52] and ours. (a) GT depth maps, (b) The inferred depth maps, (c) Our confidence maps, (d) The results generated by Yang et al. [52].
currently output a depth map and its corresponding uncertainty map for each video frame, which can be classified into a learning-based way to infer the pixel confidence. As shown in Fig. 10, the inferred depth map from our CNN are subject to large blurring artifacts along depth boundaries, and thus the pixels approaching depth boundaries are inclined to be a small confidence value. Similar situations can be seen in the results of Yang et al. [52].
4.6. Runtime
The proposed DRM-SLAM mainly has two components that need time to process, i.e., the tracking and the depth reconstruction. Specifically, camera tracking works at the frame rate around 25-30 fps. For depth reconstruction, the Page 13 of 20
DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion
struction framework.
4.7. Evaluation under Challenging Situations
Ground Truth
Ours
LSD-SLAM[23]
Figure 11: Comparison under the situation of pure rotational camera motion on the sequence βfr1 rpyβ between the reconstruction results obtained by ground truth, our approach, and LSD-SLAM.
CNN-SLAM
Ours
Ground Truth
Remode[20]
Figure 12: Comparison under low-texture situation on the sequence βfr2 dishesβ between the reconstruction results obtained by ground truth, our approach, and Remode. Table 5 Comparison of different solutions to depth reconstruction in terms of time and accuracy. Properties
GF [20]
DT [15]
WLS [12]
Ours
Runtime(s) RMSE(m) Accuracy(πΏ < 1.25)
0.16 0.45 75.2%
0.09 0.47 77.3%
3.9 0.36 88.4%
0.15 0.39 85.6%
scale factor is computed when tracking is initialized successfully and runs steadily, and is updated along with local bundle adjustment (BA) to deal with the scale drift. Since the update frequency is slow, it does not possess any processing time. Therefore, the speed of depth estimation and depth fusion becomes the bottleneck of time efficiency in our framework. Depth estimation from CNN is only applied on selected key-frames and the frame rate is up to 5 keyframes per second. For depth fusion, three algorithms to solve the depth interpolation problem are compared in terms of running time and performance, and shown in Table 5, i.e., two local filter methods (guided filter(GF) [20] and domain transform (DT) [15]), and the original weighted least squares(WLS) [12] that uses the same modeling way with ours but solves the linear system by Eq. (10) directly. The proposed method has a comparable runtime to local filteringbased algorithm, but the global optimization formulation overcomes the short-sighted local judgement of these filters, and achieves higher performance. Besides, it also achieves high quality results as the state-of-the-art WLS method, but runs about 26 times faster. To conclude statistically, the proposed DRM-SLAM runs nearly in real time by combining featurebased camera tracking (ORB-SLAM) and our depth reconYE et al.: Preprint submitted to Elsevier
As mentioned, one of the advantages of our DRM-SLAM compared to traditioal monocular SLAM is that, under pure rotational motion and low-texture situations, the reconstruction can still be obtained by relying on CNN-predicted depth map. To portray this benefit, we evaluate our method on the sequences βfr1 rpyβ and βfr2 dishesβ from the TUM dataset. The sequence βfr1 rpyβ is constructed under the situation of pure rotational camera motion, while βfr2 dishesβ includes large low-texture areas, like wall, floor, and desktop. The reconstruction results on βfr1 rpyβ obtained by our approach and LSD-SLAM are shown in Fig. 11. Our method can reconstruct the rough scene structure even if the camera motion is purely rotational, while LSD-SLAM fails to generate the result of 3D reconstruction. Traditional monocular SLAM approaches estimating the depth largely depend on building stereo correspondences or epipolar geometry between two views. However, the stereo baselines and the geometry relationship are destroyed by the pure rotational motion, leading to a chaotic reconstruction result. On the contrary, the depth estimated from CNN is not affected by this, and therefore our approach achieves a relatively better performance. The reconstruction results on βfr2 dishesβ obtained by our approach and Remode are shown in Fig. 12. Note that Remode is based on direct SLAM, and uses stereo matching techniques to compute the semi-dense depth map. As we know, the accuracy of stereo matching is affected by the matching inefficiency for textureless areas. Therefore, under low-texture situation, the result of Remode deforms badly, while our approach generates consistent result to the ground truth.
4.8. Evaluation on Our Captured Dataset
To further verify the effectiveness of our method, we capture a real video sequence in our laboratory (called βLABβ for short) by using a Point Grey Flea3 color camera. The video is about 40 seconds long with the resolution and frame rate of 480Γ640 and 25Hz respectively. The CNN-inferred depth maps for our video are directly obtained by using the trained model based on NYU dataset. Then, the scale adjustment and depth fusion are done on each extracted color key-frame from the video to get the final results. The recovered depth maps and the corresponding dense reconstruction results can be seen in Fig.13. The fused depth maps present superior performance than those estimated directly by CNN, and obtain accurate and geometry-preserved dense scene reconstruction, which demonstrates the practicability of our method in real video sequence.
5. Conclusion and Future Work
This paper proposes a dense reconstruction method under the monocular SLAM framework (DRM-SLAM), in which a novel scene depth fusion scheme is designed to fully utilize both the sparse depth samples from monocular SLAM Page 14 of 20
DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion
(e)
(f)
(g)
(a)
(b)
(c)
(d)
Figure 13: Evaluation on Our Captured βLABβ Dataset. (a) Color key-frames, (b) Sparse depth points extracted from ORBSLAM, (c) Depth maps estimated by our CNN, (d) Our fused results, (e) The reconstruction result from LSD-SLAM, (f) and (g) Our dense reconstruction from different views.
and predicted dense depth maps from CNN. In the scheme, a CNN architecture is carefully designed for robust depth estimation. Besides, our approach also accounts for the problem of scale ambiguity existing in the monocular SLAM. Extensive experiments and ablation study demonstrate the accuracy and robustness of the proposed DRM-SLAM. Our DRM-SLAM still presents some limitations and can be solved in the future work. Firstly, the reconstruction density and depth estimation accuracy still have room for improvement. Current depth fusion framework has been split into two separate parts, i.e., depth estimation and depth fusion modules. We can design a novel CNN architecture without hand-crafted modules that can be trained end-to-end to output the high quality depth map by taking the color image and the ORB depth samples into consideration. Besides, just like [50], we can utilize multiple intermediate multi-modal output, e.g., contour prediction and semantic parsing, from multi-task predictions as guidance to facilitate the final depth estimation task. Secondly, as the absolute scale in the current framework is estimated according to the CNN-inferred depth, and largely depends on the accuracy of the depth estimation from CNN, we expect that the scale can be estimated precisely with the help of inertial measurement unit (IMU). The IMU sensor can obtain the absolute measurement of camera state, then, by fusing pre-integrated IMU measurements and feature observations, the odometry is capable of achieving higher accuracy without scale ambiguity. Finally, the estimated depth can also be used in turn to help to better estimate the camera pose, and therefore make the tracking process more robust even in the pure rotational motion and low-texture situations. YE et al.: Preprint submitted to Elsevier
References
[1] Bay, H., Ess, A., Tuytelaars, T., Van Gool, L., 2008. Speeded-up robust features (SURF). Computer vision and image understanding 110, 346β359. [2] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L., 2017. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence PP, 1β1. [3] Chen, Z., Badrinarayanan, V., Drozdov, G., Rabinovich, A., 2018. Estimating depth from rgb and sparse sensing, in: European Conference on Computer Vision. [4] Concha, A., Civera, J., 2015. DPPTAM: Dense piecewise planar tracking and mapping from a monocular sequence, in: IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5686β 5693. [5] Concha, A., Hussain, M.W., Montano, L., Civera, J., 2014. Manhattan and piecewise-planar constraints for dense monocular mapping., in: Robotics: Science and systems. [6] Davide, S., Friedrich, F., 2011. Visual odometry: Part 1: The first 30 years and fundamentals. IEEE Robotics & Automation Magazine . [7] Davison, A.J., 2008. Real-time simultaneous localisation and mapping with a single camera, in: IEEE international conference on Computer Vision, p. 1403. [8] Davison, A.J., Reid, I.D., Molton, N.D., Stasse, O., 2007. Monoslam: Real-time single camera slam. IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 1052β1067. [9] Eigen, D., Fergus, R., 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture, in: IEEE international conference on Computer Vision, pp. 2650β 2658. [10] Engel, J., Cremers, D., 2013. Semi-dense visual odometry for a monocular camera, in: IEEE International Conference on Computer Vision, pp. 1449β1456. [11] Engel, J., SchΓΆps, T., Cremers, D., 2014. LSD-SLAM: Large-scale direct monocular SLAM, in: European Conference on Computer Vision, Springer. pp. 834β849. [12] Farbman, Z., Fattal, R., Lischinski, D., 2008. Edge-preserving decompositions for multi-scale tone and detail manipulation, in: ACM
Page 15 of 20
DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion SIGGRAPH, p. 67. [13] Flint, A., Murray, D., Reid, I., 2011. Manhattan scene understanding using monocular, stereo, and 3d features, in: IEEE International Conference on Computer Vision, pp. 2228β2235. [14] Fraundorfer, F., Scaramuzza, D., 2012. Visual odometry: Part 2: Matching, robustness, optimization, and applications. IEEE Robotics & Automation Magazine 19, 78β90. [15] Gastal, E.S.L., Oliveira, M.M., 2011. Domain transform for edgeaware image and video processing. Acm Transactions on Graphics 30, 1β12. [16] Graber, G., Pock, T., Bischof, H., 2011. Online 3d reconstruction using convex optimization, in: IEEE International Conference on Computer Vision Workshops, pp. 708β711. [17] Guan, T., Wang, C., 2009. Registration based on scene recognition and natural features tracking techniques for wide-area augmented reality systems. IEEE Transactions on Multimedia 11, 1393β1406. [18] Handa, A., Whelan, T., Mcdonald, J., Davison, A.J., 2014. A benchmark for rgb-d visual odometry, 3d reconstruction and SLAM, in: IEEE International Conference on Robotics and Automation, pp. 1524β1531. [19] He, K., Sun, J., Tang, X., 2010. Guided image filtering, in: European Conference on Computer Vision, pp. 1β14. [20] He, K., Sun, J., Tang, X., 2013. Guided image filtering. IEEE Transactions on Pattern Analysis & Machine Intelligence 35, 1397β1409. [21] He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770β778. [22] Herrera, D., Kannala, J., HeikkilΓ€, J., et al., 2013. Depth map inpainting under a second-order smoothness prior, in: Scandinavian Conference on Image Analysis, Springer. pp. 555β566. [23] Keller, M., Lefloch, D., Lambers, M., Izadi, S., Weyrich, T., Kolb, A., 2013. Real-time 3d reconstruction in dynamic scenes using pointbased fusion, in: 3D Vision-3DV 2013, 2013 International Conference on, IEEE. pp. 1β8. [24] Khan, I., 2017. Robust sparse and dense non-rigid structure from motion. IEEE Transactions on Multimedia , 1β1. [25] Klein, G., Murray, D., 2007. Parallel tracking and mapping for small AR workspaces, in: IEEE International Symposium on Mixed and Augmented Reality, pp. 225β234. [26] Klein, G., Murray, D., 2008. Improving the agility of keyframe-based SLAM, in: European Conference on Computer Vision, pp. 802β815. [27] Koutis, I., Miller, G.L., Tolliver, D., 2009. Combinatorial preconditioners and multilevel solvers for problems in computer vision and image processing, in: International Symposium on Advances in Visual Computing, pp. 1067β1078. [28] Krishnan, D., Fattal, R., Szeliski, R., 2013. Efficient preconditioning of laplacian matrices for computer graphics. Acm Transactions on Graphics 32, 1β15. [29] Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N., 2016. Deeper depth prediction with fully convolutional residual networks, in: International Conference on 3D Vision, IEEE. pp. 239β 248. [30] Lang, M., Wang, O., Aydin, T., Smolic, A., Gross, M., 2012. Practical temporal consistency for image-based graphics applications. Acm Transactions on Graphics 31, 1β8. [31] Liu, F., Shen, C., Lin, G., Reid, I., 2016. Learning depth from single monocular images using deep convolutional neural fields. IEEE transactions on pattern analysis and machine intelligence 38, 2024β2039. [32] Lowe, D.G., 1999. Object recognition from local scale-invariant features, in: IEEE international conference on Computer Vision, pp. 1150β1157. [33] Luo, H., Gao, Y., Wu, Y., Liao, C., Yang, X., Cheng, K., 2019. Realtime dense monocular SLAM with online adapted depth prediction network. IEEE Trans. Multimedia 21, 470β483. [34] Ma, F., Karaman, S., 2017. Sparse-to-dense: Depth prediction from sparse depth samples and a single image, in: IEEE International Conference on Robotics and Automation (ICRA), pp. 1β8. [35] Min, D., Choi, S., Lu, J., Ham, B., Sohn, K., Do, M.N., 2014. Fast
YE et al.: Preprint submitted to Elsevier
[36] [37]
[38] [39] [40]
[41] [42] [43] [44] [45]
[46]
[47] [48]
[49] [50] [51]
[52] [53] [54] [55]
global image smoothing based on weighted least squares. IEEE Transactions on Image Processing 23, 5638. Mur-Artal, R., Montiel, J.M.M., Tardos, J.D., 2015. ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Transactions on Robotics 31, 1147β1163. Newcombe, R.A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A.J., Kohi, P., Shotton, J., Hodges, S., Fitzgibbon, A., 2011a. Kinectfusion: Real-time dense surface mapping and tracking, in: IEEE International Symposium on Mixed and Augmented Reality, pp. 127β136. Newcombe, R.A., Lovegrove, S.J., Davison, A.J., 2011b. DTAM: Dense tracking and mapping in real-time, in: IEEE international conference on Computer Vision, pp. 2320β2327. Pizzoli, M., Forster, C., Scaramuzza, D., 2014. Remode: Probabilistic, monocular dense reconstruction in real time, in: IEEE International Conference on Robotics and Automation, pp. 2609β2616. Pradeep, V., Rhemann, C., Izadi, S., Zach, C., 2013. Monofusion: Real-time 3d reconstruction of small scenes with a single web camera, in: IEEE International Symposium on Mixed and Augmented Reality, pp. 83β88. Rublee, E., Rabaud, V., Konolige, K., Bradski, G., 2011. ORB: An efficient alternative to SIFT or SURF, in: IEEE international conference on Computer Vision, pp. 2564β2571. Shum, H.Y., Ng, K.T., Chan, S.C., 2005. A virtual reality system using the concentric mosaic: construction, rendering, and data compression. IEEE Transactions on Multimedia 7, 85β95. Silberman, N., Hoiem, D., Kohli, P., Fergus, R., 2012. Indoor segmentation and support inference from rgbd images, in: European Conference on Computer Vision, pp. 746β760. Strasdat, H., Montiel, J.M.M., Davison, A.J., 2012. Visual slam: Why filter? Image and Vision Computing 30, 65β77. Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D., 2012. A benchmark for the evaluation of rgb-d SLAM systems, in: IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 573β580. Tateno, K., Tombari, F., Laina, I., Navab, N., 2017. CNN-SLAM: Real-time dense monocular SLAM with learned depth prediction, in: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6565β6574. Weerasekera, C.S., Latif, Y., Garg, R., Reid, I., 2017. Dense monocular reconstruction using surface normals, in: IEEE International Conference on Robotics and Automation, pp. 2524β2531. Whelan, T., Kaess, M., Johannsson, H., Fallon, M., Leonard, J.J., McDonald, J., 2015. Real-time large-scale dense RGB-D SLAM with volumetric fusion. The International Journal of Robotics Research 34, 598β626. Whelan, T., Salas-Moreno, R.F., Glocker, B., Davison, A.J., Leutenegger, S., 2016. Elasticfusion: Real-time dense SLAM and light source estimation. International Journal of Robotics Research . Xu, D., Ouyang, W., Wang, X., Sebe, N., 2018. PAD-Net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. arXiv preprint arXiv:1805.04409 . Yang, X., Chen, J., Wang, Z., Zhang, Q., Liu, W., Liao, C., Cheng, K., 2018. Monocular camera based real-time dense mapping using generative adversarial network, in: ACM Multimedia Conference on Multimedia Conference, MM, pp. 896β904. Yang, X., Gao, Y., Luo, H., Liao, C., Cheng, K.T., 2019. Bayesian denet: Monocular depth prediction and frame-wise fusion with synchronized uncertainty. IEEE Transactions on Multimedia PP, 1β1. Yu, F., Koltun, V., 2015. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 . Zhou, H., Li, X., Sadka, A.H., 2012. Nonrigid structure-from-motion from 2-d images using markov chain monte carlo. IEEE Transactions on Multimedia 14, 168β177. Zhou, Z., Shi, F., Xiao, J., Wu, W., 2015. Non-rigid structure-frommotion on degenerate deformations with low-rank shape deformation model. IEEE Transactions on Multimedia 17, 171β185.
Page 16 of 20
DRM-SLAM: Towards Dense Reconstruction of Monocular SLAM with Scene Depth Fusion
of Technology, Dalian, China. He is currently a graduate student at the school of Software in Dalian University of Technology in Liaoning, China. His research interests include simultaneous localization and mapping (SLAM) and 3D reconstruction. Xinchen Ye (Mβ17) received the B.E. degree and Ph.D. degree from the Tianjin University, Tianjin, China, in 2012 and 2016, respectively. He was with the Signal Processing Laboratory, EPFL, Lausanne, Switzerland in 2015 under the Grant of the Swiss federal government. He has been a Faculty Member of Dalian University of Technology, Dalian, Liaoning, China, since 2016, where he is currently a Assistant Professor with the DUT-RU International School of Information Science and Engineering. His current research interests include image/video processing and 3D imaging. As a co-author, he received the Platinum Best Paper Award in the IEEE ICME 2017
Xiang Ji received the B.S. degree in software engineering in 2016 from the Tianjin Normal University, Tianjin, China. He is currently a graduate student at the School of Software in Dalian University of Technology. His research interests include SLAM, computer vision and deep learning.
Zhihui Wang received the B.S. degree in software engineering in 2004 from the North Eastern University, Shenyang, China. She received her M.S. degree in software engineering in 2007 and the Ph.D degree in software and theory of computer in 2010, both from the Dalian University of Technology, Dalian, China. Since November 2011, she has been a visiting scholar of University of Washington. Her current research interests include information hiding and image compression.
Haojie Li is a Professor in the School of Software, Dalian University of Technology. His research interests include social media computing and multimedia information retrieval. He has co-authored over 50 journal and conference papers in these areas, including IEEE TCSVT, TMM, TIP, ACM Multimedia, ACM ICMR, etc. Dr. Li received the B.E. and the Ph. D. degrees from Nankai University, Tianjin and the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, in 1996 and 2007 respectively. From 2007 to 2009, he was a Research Fellow in the School of Computing, National University of Singapore. He is a member of IEEE and ACM.
Baoli Sun received the B.S degree in microelectronics science and engineering in 2018 from the Hefei University of Technology , Anhui , China. He is currently a graduate student at the school of Software in Dalian University of Technology in Liaoning, China. His research interests include image processing, computer vision and deep learning.
Shenglun Chen received the B.S degree in software engineering in 2017 from the Dalian University YE et al.: Preprint submitted to Elsevier
Page 17 of 20
We wish to draw the attention of the Editor to the following facts which may be considered as potential conflicts of interest and to significant financial contributions to this work. [OR] We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome. We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us. We confirm that we have given due consideration to the protection of intellectual property associated with this work and that there are no impediments to publication, including the timing of publication, with respect to intellectual property. In so doing we confirm that we have followed the regulations of our institutions concerning intellectual property. We understand that the Corresponding Author is the sole contact for the Editorial process (including Editorial Manager and direct communications with the office). He/she is responsible for communicating with the other authors about progress, submissions of revisions and final approval of proofs. We confirm that we have provided a current, correct email address which is accessible by the Corresponding Author and which has been configured to accept email from (
[email protected]). Signed by all authors as follows: Xinchen Ye, Xiang Ji, Baoli Sun, Shenglun Chen, Zhihui Wang, Haojie Li Sep. 16, 2019.