UnLearnerMC: Unsupervised learning of dense depth and camera pose using mask and cooperative loss

UnLearnerMC: Unsupervised learning of dense depth and camera pose using mask and cooperative loss

Journal Pre-proof UnLearnerMC: Unsupervised learning of dense depth and camera pose using mask and multiple consistency losses Junning Zhang, Qunxing ...

2MB Sizes 0 Downloads 8 Views

Journal Pre-proof UnLearnerMC: Unsupervised learning of dense depth and camera pose using mask and multiple consistency losses Junning Zhang, Qunxing Su, Pengyuan Liu, Chao Xu, Zhengjun Wang

PII: DOI: Reference:

S0950-7051(19)30615-X https://doi.org/10.1016/j.knosys.2019.105357 KNOSYS 105357

To appear in:

Knowledge-Based Systems

Received date : 13 February 2019 Revised date : 31 October 2019 Accepted date : 6 December 2019 Please cite this article as: J. Zhang, Q. Su, P. Liu et al., UnLearnerMC: Unsupervised learning of dense depth and camera pose using mask and multiple consistency losses, Knowledge-Based Systems (2019), doi: https://doi.org/10.1016/j.knosys.2019.105357. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2019 Published by Elsevier B.V.

Journal Pre-proof

pro of

*Revised Manuscript (Clean Version) Click here to view linked References

UnLearnerMC: Unsupervised Learning of Dense Depth and Camera Pose using Mask and Multiple Consistency Losses Junning ZHANGa, Qunxing SUa,b, Pengyuan LIUa*, Chao XUc, Zhengjun WANGd b c d

Army Engineering University, Shijiazhuang, 050003 China; Army Command College, Nanjing 210016, China; 32181 troops, Xian 710032, China; 32181 troops, Shijiazhuang, 050003, China

A R T I C L E

I N F O

A B S T R A C T

re-

a

urn a

Keywords: Deep Learning Depth Estimation Camera Pose Photometric Loop Consistency Loss Cooperative Loss

1. Introduction

In this paper, we propose an unsupervised learning framework, named UnLearnerMC, for jointly training monocular depth, camera pose, and segmentation from videos. Existing unsupervised methods typically exploit optical flow consistency to train the segmentation network and eliminate interference from moving objects. Our key point is to create SegMaskNet and corresponding training losses for distinguishing between moving objects and static scenes. Specifically, we eliminate misleading by exploiting boundary masks when moving objects are beyond the view boundary. The SegMaskNet constantly adjusts the allocation of static and moving object pixels via a cooperative loss to minimize the total loss during training. For the moving areas, we can use the re-estimated static depth and pose to eliminate interference from moving areas by using the proposed photometric loop consistency loss. Experiments on KITTI datasets show that UnLearnerMC achieves state-of-the-art results in single-view depth and camera ego-motion, which illustrate the benefits of our approach.

lP

Article history: Received xxxx Revised xxxx Accepted xxxx

Jo

Single-view depth prediction and camera pose estimation are important and long-standing tasks in 3D environment construction [1-2] and autonomous driving [3-4]. However, due to the interference of various factors in the realistic environment, the implementation of these technologies normally has some inherent limitations [5]. Currently, Convolutional Neural Network (CNN) is very popular in dealing with the two tasks due to its advantages of the learning ability and the robustness to the complicated environment [6]. Some impressive progress regarding depth estimation and camera pose have been made with supervision using large amounts of data in [7-10]. However, the annotation of these datasets comes at a price, such as using expensive lasers or depth cameras to capture depths. Based on geometric constraints, some researches [11-13] have been proposed * Corresponding author e-mail address: [email protected]. VOLUME XX, 2017

to train the CNN model in an unsupervised form. The core idea of them is the image wrap technique “spatial transformer” [11] based on the luminance constancy between parallax maps. For instance, Zhan et al. [13] used both spatial and temporal photometric warp error to constrain depth and camera pose estimation. Problem. These assumptions, however, are generally not applied to the situations (e.g., scenes with moving objects or occlusion area), which make the training unstable. The previous studies [15-17] solved the problem by adding another training phase (RefineNet, FlowNet) for supplying the gap. However, the precision of these methods has a certain error. A key reason is that ideal depth and camera ego-motion estimation are assumed in these methods when optical flow is synthesized, but their constraints applied to static scenes are disturbed by the factors (i.e., moving objects and occlusion area). In general, the constraints failed to conform to all KITTI datasets (i.e., 9

Journal Pre-proof

pro of

motion estimation, and moving object segmentation from videos.  We combine SegMaskNet with the cooperative loss to constrain the moving object area and restrict the factors, such as the occlusion area, which are not considered in the mask network.  The photometric loop consistency loss is proposed to overcome the moving object interference not included in a pure view synthesis task.  Relevant evaluations on the KITTI dataset reveal that the proposed method outperforms the previous unsupervised ones and is comparable to the supervised results. The rest of the paper is organized as follows: We start by illustrating the related works about depth and camera pose estimation, which reports the state-of-the-art in dealing with the moving object interference and invalid view region (Section 2). Then, the SegMaskNet and different types of losses are proposed for ignoring moving objects (Section 3). Our algorithm is validated via KITTI datasets and Cityscapes datasets (Section 4). Finally, we conclude our works with future research (Section 5).

lP

re-

popular standard benchmark datasets for training depth and pose), some of which may disrupt CNN network training. For example, training datasets for depth and camera egomotion should not include independent moving objects. The errors in either depth or camera pose results in inaccurate optical flow prediction, and thus the performance for masking moving objects based on optical flow is also unsatisfactory [18]. In addition, joint training of depth, camera pose and optical flow requires huge hardware computing resources, which makes the deployment on autonomous vehicles cumbersome. Approach. Assume that all video sequences are static scenes, learning depth and camera pose can be regarded as a simple geometric problem. For our model, in order to learn depth and camera motion only in static scenes without interference from moving objects, our focus is to create a SegMaskNet for distinguishing between moving objects and static scenes, which transfers the learning of the scene into a purely geometric transformation learning. The foundation of SegMaskNet is to utilize the motion characteristics that the difference between the moving object and its static state is related within the same interval. This property can be applied to form the prospective of dividing moving objects.

Jo

urn a

2. Related work

Fig. 1. Illustration of the proposed algorithm. A network taking an unlabeled image sequence as input can estimate the target view depths and a relatively camera pose.

Therefore, to overcome the interference of moving objects, a novel unsupervised network training pipeline is proposed, as shown in Fig. 1. Specifically, the contributions of this paper are listed as follows:  Without using optical flow network information, we propose UnLearnerMC, a jointly unsupervised learning framework for monocular depth, camera

VOLUME XX, 2017

Estimating scene depth and camera ego-motion from images have been a central issue in computer vision [2, 5]. There have been successful cases directly using popular deep learning networks. Here, we review the most relevant works in several areas. Supervised depth and camera pose estimation with CNN. Ref. [27] was the first work to estimate depth with ConvNets. The authors extracted features from global and local perspectives in convolutional neural networks, thus improving the accuracy of pixel block depth estimation. Brahmbhatt et al. [5] established more geometric constraints from multiple sensor data to improve depth estimation performance. Others [16-17] tried to use a more robust loss function or combined conditional random field (CRF) with convolution neural networks, resulting in better performance. Unsupervised depth and camera pose estimation. Recent works [19, 28, 29, 33, 34] jointly estimated multiple tasks together by making full use of the geometric relationship between them in an unsupervised learning framework. A novel depth estimation method was proposed based on the left-right luminosity constraint of stereo image pairs. Inspired by recent advances in direct visual odometry (DVO), Wang et al. [12] proposed a depth CNN predictor without using pose CNN framework. However, these unsupervised learning schemes solely considered a rigid scene, and neglected

9

Journal Pre-proof

pro of

preventing other networks from using optical flow information to improve themselves. Our UnLearnerMC attempts to ignore moving objects without using additional optical flow networks, thus making deployment on autonomous vehicles possible. The key missing part we add is a SegMaskNet that focuses only on moving objects, which allows the network to strictly use geometric constraints to accurately infer the complete geometry and motion of the scene except moving objects. 3. Approach

The training of the whole model focuses on minimizing the reconstruction error loss of adjacent frames, and allows a certain amount of relaxation to discount factors not considered in the model. In other words, the interference about the moving area is segmented via mask networks, and not imposes image reconstruction loss on them. To prevent the model from minimizing the loss by excessive relaxation, we constrain the relaxation factor by photometric loop consistency loss Eo , interpretable mask loss Em and cooperation loss Ec . Em is encouraged to impose image reconstruction loss on all pixels, while Eo and Ec are designed to segment moving areas from all scene pixels and eliminate moving objects interference. In this section, we start by explaining the symbols used in context and give an overview of UnLearnerMC, as shown in Fig. 2. To avoid confusion of invalid regions caused by camera motion parallax, a simple and effective marking method [14] is introduced to mark the portion beyond the view boundary. Then, we propose SegMaskNet and corresponding training losses for distinguishing moving objects, which are the cores of our UnLearnerMC.

Jo

urn a

lP

re-

many key problems, such as an invalid view region due to occlusion or independent moving objects. To segment invalid view region, some works [14, 15, 20] used an explainability mask to eliminate the evidence that cannot be explained via static scenario hypotheses. For example, the invalid view region was solved by adding a universal mask to the CNN model [15] or applying post-processing to remove edge artifacts [20]. Mahjourian et al. [14] took a principled mask to jointly train depth and camera pose in monocular video streams. A fly-out mask in [21] was applied to filter out the invalid areas caused by the motion parallax of the camera. Another similar approach by Yang et al. [22] adopted 3D-ASAP, occlusion mask and geometry (depth, normal) to solve the fine detailed structures. In summary, the masks achieved good performance in capturing ineffective regions resulting from camera motion parallax or all factors not considered by the model. However, these masks can not control the degree of relaxation, which leads to the reduction of the accuracy in estimating depth and camera pose. For solving interference from moving objects, other training phases (i.e., RefineNet, FlowNet) were added to consider moving objects. For example, Yin et al. [16] introduced the optical flow consistency loss to learn depth and camera motion, which makes the synthesized target image more in line with the spatial geometric transformation relationship. Similarly, Yang et al. [17] addressed the problem by fusing full holistic 3D motion understanding in the deep learning framework. Zou et al. [18] jointly trained optical flow and depth by using the consistency between depth and optical flow, which helped to improve the accuracy of depth estimation. However, if the training accuracy of the first stage (DepthNet and PoseNet) is poor, the precision of the optical flow network (FlowNet) will be reduced, thus

VOLUME XX, 2017

9

Jo

urn a

lP

re-

pro of

Journal Pre-proof

Fig. 2. The pipeline of UnLearnerMC for depth, camera pose, SegMaskNet, and loss constraints.

3.1 Symbolic interpretation

Our goal is to develop an unsupervised learning framework for jointly estimating scene depth and camera ego-motion by using unmarked monocular video sequences. Specifically, given a sequence of three consecutive frames [ I t 1 , I t , I t 1 ] with the target frame I t and the rest source views I t 1 , I t 1 , we estimate the view depth and camera ego-motion (i.e., position and orientation). For the VOLUME XX, 2017

'

convenience of description, we define I t 1t as the synthesized image from the t-1 frame to the t frame. Similarly, I t' 1t , I t't 1 , I t't 1 … 3.2 Mark the valid boundary region due to motion parallax Given the estimated depth Dt 1 'at time t+1 and the egomotion Tt 1 t , the target image I t 1t is warped from the source image I t 1 . The essential idea of view synthesis is 9

Journal Pre-proof

pro of

where Tt t 1 is the relative pose from the t frame to the t+1 frame. M Ct 1t indicates a boundary mask when the frame I t 1 is warped to the frame I t . M C (u , v )  0 represents an invalid pixel (u, v) in the warped image. Instead, M C (u , v)  1 represents a valid pixel. I h , I w denote the height and width of target view size respectively. The boundary masks showed in Fig. 3(b) are examples of M Ct 1t and M Ct 1t , which mark the invalid areas beyond the visual boundary in the synthesized target view I t . As the sample image shows, the M Ct 1t can discard the reconstruction loss of pixel coordinates outside the boundary of the synthesized image. Furthermore, we can find that the invalid regions are still the same when the t+1 frame is reconstructed inversely. This helps distinguish between areas beyond the visual boundary induced by camera motion parallax and the car's own motion. 3.3 Movement characteristics of moving objects

re-

the image wrap technique “spatial transformer” [11] based on the brightness constancy assumption. However, the assumption fails to apply to the moving objects. Once this example contains dynamic objects, the reconstruction of dynamic objects in the source image can not coincide with the dynamic objects of the target image. Since UnLearnerMC believes that the difference in the position of the same object between two frames image is caused by the camera's ego-motion, the reconstruction errors caused by dynamic objects lead to confusion and misunderstanding in UnLearnerMC's learning ego-motion during training. Therefore, we need to identify the motion regions and not impose photometric losses on them. Zhou et. al [15] used an explainability mask to eliminate the evidence including the unexplainable phenomenon in all static scenario assumptions. However, the method seems to be limited. This is because that invalid areas marked by explainability mask contained invalid boundaries induced by camera motion parallax, which leads mask to a misunderstanding of moving objects. Assume that the perfect scene depth and camera pose are obtained, Fig. 3(a) shows the synthesized target view from the source view. As the sample view shows, this difference mainly comes from the motion parallax and moving objects. When the moving object is located at the view boundary, the masks will confuse the two cases.

lP

Input Frame t

Losst 1t

Warpedt 1t

Differnce due to motion parallax

Differnce due to moving object

(a) Input Frame t+1

Target Frame t

Warpedt1t

urn a

Maskt 1t

Different from the invalid areas due to camera motion parallax, capturing moving objects is more difficult and may encounter some problems, e.g., directional uncertainty of moving objects and occlusion of moving objects. To mark the invalid regions caused by these factors, we propose a simple but effective mask (SegMaskNet), which allocates the moving areas in the video scene by using the object moving differences during training. The object moving differences are proposed for constraining moving object areas and distinguishing static scenes, including depth transformation difference and photometric loop difference. Definition 1. Depth Transformation Difference. Assume that the scene in the video contains a car driving in the opposite direction, and the scene depth and camera pose are perfectly inferred from the network. Fig. 4 shows the difference in the warped target view from different source views. When the car is static, the warped static car C in the target view I t can be calculated (blue in Fig. 4) based on the image wrap technique [11]. Then the position difference between the static car and the actual car is the moving distance of the car from time t to t+1. Similarly, when the static car is warped from the source view I t 1 , the position difference between the static car and the actual car is due to the car's own movement from the t-1 frame to the t frame.

Losst 1t

(b) Fig. 3. The view synthesis differences come from calculation without boundary mask vs. with boundary mask.

Jo

Therefore, to overcome the interference of the previous problem, we first calculate the visual range of view based on spatial geometry theory similar to that in [4]. Let Dt denote the depth of the pixel ( ut , vt ) at time t, we define Pwt as the index position of the pixel in the real world, and K represents the intrinsic camera matrix, the invalid areas caused by camera motion is described as:  ut   ut 1 

Dt  vt   KPwt , Dt 1  vt 1   KTt t 1 Pwt

  1  

   1   

(1) t 1t MC

0, others ut 1 , vt 1   1, 0  u , v  I , I  t t   h w 

VOLUME XX, 2017

(2)

Fig. 4. The pixel position difference of the moving car in the warped target view.

Since the sampling rate of KITTI datasets is 10 Hz, i.e., the interval between two continuous frames is 0.1 s. Thus, 9

Journal Pre-proof

Ldld  Dt' 1t  Dt' 1t (3)

ut 1 , vt 1   1,

0, others

ut 1 , vt 1   1,

0, others

 ut 1 , vt 1   M c & Ldld  

(4) t 1t

M dld

 ut 1 , vt 1   M c & Ldld  

urn a

Difference : Dt 1t  Dt

Dt 1

(6) t 1t

M ophl



ut 1 , vt 1   1, u 

0, others t 1

, vt 1   M c & Lophl



(7) t 1t

M ophl



ut 1 , vt 1   1, u 

0, others

t 1

, vt 1   M c & Lophl



(8) The photometric loop difference (see Fig. 6) can be described as follows: Firstly, the source image I t 1 is synthesized to target frame I t based on the camera pose Tt  t 1 and the predicted depth Dt , and then the synthesized ' image I t t 1 is taken as input in DepthNet to re-estimate the s depth Dts1 at time t-1. Secondly, the target frame I t 1 is re-synthesized based on the camera pose Tt  t 1 and the reestimate depth Dt . Then, inversely synthesized from the ' s synthesized image I t t 1 and I t t 1 to the target image ' s It t 1t and It t 1t based on the camera pose Tt 1t and the depth Dt , respectively. Finally, the photometric errors ' of the moving areas among the synthesized images I t t 1t , s It t 1t , and the original target image I t are compared. A key component of the photometric loop difference is the re-estimated depth Dts1 , which is the output from the ' DepthNet by using the synthesized adjacent image I t t 1 as input.

lP

(5) where Ldld indicates the depth transformation consistency, Dt' 1t indicates that the estimated depth Dt 1 in the frame It 1 is transformed into the frame I t coordinate system. Dt' 1t  Dt represents the depth difference in the frame ' I t . The threshold  forces a pixel whose depth Dt 1t is ' close to Dt 1t to be marked as a valid pixel.

Tt 1t transformt 1t

s

 I t  I t t 1t  I t

re-

t 1t

M dld

Lophl  I t't 1t

pro of

the speed between two consecutive frames can be regarded as uniform, and the position difference of the car synthesized from the source view I t 1 should be consistent with that from the source view I t 1 . The above conclusion is also true when the scene contains a car traveling in the same direction. Therefore, the difference in this warped view can be used to capture movement areas. In addition, considering the complexity of the difference in the view (consisting of 3 channels), we use the depth map (consisting of 1 channel) to compare the different regions. Let Dt denote the depth of the pixel ( ut , vt ) at time t, the motion regions based on the depth transformation consistency are described as:

Tt t 1

transformt 1t

Dt 1

Difference : Dt 1t  Dt

Fig. 5. shows the depth error image produced by the difference between original depth and the transformed depth.

Jo

Fig. 5 demonstrates the depth difference of car in the transformed target view from different source views, where white indicates no difference and black color indicates that the area cannot match the target view. We can observe that in both cases, the differences in car depth from different source views transform to the target view can distinguish moving objects. Definition 2. Photometric Loop Difference. To capture invalid areas caused by moving objects, we propose a second component, namely, the photometric loop difference. Let Lophl denote photometric loop difference to target view I t , and  is valid pixel threshold, the motion regions based on the photometric loop difference are described as:

VOLUME XX, 2017

Fig. 6. The process of marking valid pixels via photometric loop difference.

Fig. 6 illustrates the contribution of the re-estimated s depth Dt 1 for searching the invalid pixels caused by ' moving objects. In the view I t t 1 synthesis process based on image wrap technology, the difference of object location in different frame views is purely caused by camera ego-motion, while ' the motion of moving objects can interfere with view I t t 1 synthesis. To address the issues, we separately use the re-estimated depth Dts 1 and camera pose Ttst 1 as the depth and pose' inputs in the process of self-synthesis target image I t t 1t at time t+1 , and the Dts 1 , Ttst 1 are respectively close to the static depth value and actual camera pose at time t+1. s Based on the re-estimated depth Dt 1 , pose Ttst 1 , and 9

Journal Pre-proof

s

target view I t , the synthesized image I t t 1 can be regarded as the synthesis of the entire static scene I t . ' Thus, compared with the self-synthesis I t t 1t , the selfs synthesis I t t 1t produces a better correspondence with the original image I t . Therefore, we propose the photometric loop difference and consider that the pixels are valid static scene pixels when the difference is less than the threshold  . In contrast, the pixels are moving object pixels. The photometric loop difference forces a s pixel whose photometric I t t 1t is close to I t , which can be marked as a valid pixel.

where Dt represents the pixel depth in I t , Tt t 1 represents the relative pose from the t frame to the t+1 frame.

3.4 Objective losses for unsupervised training

Fig. 7. Illustration of the target image synthesis process in a static scene.

Our ultimate losses function E final can be formulated as follows: E final = r Er + o Eo  m Em + c Ec + s Es (9)

Fig. 7 illustrates the change in position coordinates of a pixel from the target view to the source view due to camera ego-motion. When the input views are all from static scenes, and the estimated pose T and depth D from the CNN ' networks are perfect, the synthesized image I t t 1t will be completely aligned with the original image I t . However, that is not the case. When the scene contains moving objects, the position of the moving object (coupled by the geometry and moving object) in the source view will not coincide with the original image. Therefore, we assume that moving objects are stationary, and then reconstruct the position of the moving object in the source view. Then we use the reconstructed view to learn depth and camera motion, thus making the synthesized target view more in line with spatial geometric transformation relationship. Based on the photometric loop difference Lophl , the photometric loop consistency loss Eo is designed to minimize the error of moving areas, and needs to be minimized:

Tt 1t static scene t PoseNet

ut 1

vt 1

pro of ut

vt

Dt 1

static scene t+1

Tt t 1

DepthNet

synthesis

input

output

t t 1

 Mc

urn a

lP

re-

where r , m , o , and s respectively represent the corresponding loss weights. The static image reconstruction loss Er is designed to minimize reconstructing static scenes. The photometric loop consistency loss Ec is used to reason about pixels in the independently moving areas. To gain more pixels towards the static scene reconstructor, a larger weight  m will be set for driving the data competition of the regularization term Em . The term E o constraints the mask to segment the moving objects, and the Es is an edge-aware depth smoothness loss, as shown in Fig. 2. Definition 1. Image Reconstruction Loss . Image reconstruction loss E r is defined by using the view synthesis technique from two continuous monocular views, and can be formulated as follows: t t 1 Er  M c M t t 1Γ  I t 1 , I t' t 1   M ct 1t M t 1t Γ  I t , I t' 1t 

align

M t t 1Γ  I t 1 , I t' t 1   M ct 1t M t 1t Γ  I t , I t' 1t  (10)

 (2   c )(2uv  c2 )  2 2 Γ  u , v    (u  v)    (1   ) 1  2 u 2v 1  (11)  (u  v  c1 )( u   v  c2 ) 

Jo

where M ct t 1 is a boundary mask for representing the boundary range of camera visualization, and M t  t 1 represents the valid ' static scene pixels in the synthesis process from I t to I t 1 . Γ rec denotes an error loss function, which was used in currently classical research [31]. The second term represents the SSIM index [14, 16],  u and  u are the local mean and variance for describing the pixel neighborhood, and the other function parameters are set as 2 2   0.25,   0.001, c1  0.01 , c2  0.03 . Definition 2. Photometric Loop Consistency Loss. Denote pt as the homogeneous coordinates of a pixel in the target view. We can determine the projection coordinates of pt on the source view pt 1 , which satisfies the following principle:

pt 1 ~ KTt t 1 Dt K 1 pt VOLUME XX, 2017

(12)

Eo = M t t 1 M t 1t   I t ,I t t 1t   M t t 1 M t 1t   I t ,I t t 1t  s

s

 M t t 1 M t 1t   I t 1 ,I ts1t t 1   M t t 1 M t 1t   I t 1 ,I ts1t t 1  (13) where M t  t 1 is the reverse of the mask M t  t 1 , and moving areas in the synthesis M t  t 1 represents the ' s process from I t to I t 1 . I t t 1t represents a loop warping s process between I t and I t 1 . First, I t 1t can be warped s using the re-estimated depth Dt and source view I t 1 , and s I t 1t can be understood as a static view at the t frame, then s inversely warped It 1t t 1 based on the static view depth Dt and the target view I t 1 . Similarly, to improve the learning efficiency of the proposed pipeline to the moving areas, we train the loop warping between I t and I t 1 simultaneously. While this loss is to resolve moving object areas in the view, the essential supervision of the loss comes from static view synthesis techniques based on the estimated depth D and camera pose T . Thus, the use of this loss not only segments the dynamically moving object, but also significantly adjusts T and D to a better initial alignment. Furthermore, when the actual scene is all static, this loss 9

Journal Pre-proof

t 1t t 1t t 1t t 1t  H ( M dld  M ophl , M t 1t )  H ( M dld  M ophl , M t 1t )

(15) is an indicator mask indicating static pixels

urn a

that satisfy the depth transform consistency, and M ophl is another indicator mask representing static pixels that satisfy the photometric loop consistency. In the process of cooperation loss, the first indicator mask M dld forces pixels with the same depth in the target view as static pixels, and pixels with a loop loss less than the threshold  are indicated as static via the second indicator mask M ophl . The symbol  represents the logical AND between the indicator masks. Moreover, we need to maintain clear details on the estimated depth and segmentation, and the edge-aware depth smoothness loss Es is introduced:



Es   D  pt   e pt

 I  pt 



T

(16)

Jo

where  is the vector differential operator, and T is a transpose operation to the image gradient weights. 4. Experiments

set to be

 polcl  0.2, phlcl  0.3, smooth  0.5 .

Runtime and Memory Requirements: During training, the number of parameters required for UnLearnerMC is approximately 66.46 million (similar to GeoNet [16]). During the testing phase, UnLearnerMC estimates the depth, camera pose with the speeds of 12 ms, 4.75 ms per epoch. Table 1 shows the test time efficiency of different algorithms on the depth and poses estimation. We can observe that our UnLearnerMC is faster than other mainstreaming methods [15, 30], and is the same as the recently classical research [17]. This is because that our model is similar to the network structure in [17], and the testing time of model is dependent on the model network. However, our model can obtain more accurate depth and pose results via the proposed photometric loop consistency loss and cooperative loss during training.

lP

where M dld

re-

where H indicates the cross-entropy function, 1 is the mask with constant label 1. By setting a larger weight m , the interpretable mask loss Em is encouraged to use all pixels for training the static scene reconstructor. Definition 4. Cooperative Loss. The cooperative loss Ec divides the pixels of the moving object by balancing the static scene reconstruction loss given by Er and the dynamic object loop loss from Eo . The cooperative loss Ec can be expressed as: t t 1 t t 1 t t 1 t t 1 Eb  H ( M dld  M ophl , M t t 1 )  H ( M dld  M ophl , M t t 1 )

Network Architecture: The architecture of UnLearnerMC is divided into three main parts, including the DepthNet, the PoseNet, and SegMaskNet. As described in [24], we choose the Vgg-64 and ResNet-50 [24] as the encoder parts of DepthNet, followed by a corresponding number of task-specific convolutional decoders, which is designed to take the enlarged spatial feature graph as input. For the estimation of the camera pose between consecutive frames, the recently pose estimation operator [15, 32] are adopt in our framework. For the SegMaskNet network, we use upconvolutional layers [15] to produce masks. Training Details: The training process of UnLearnerMC consists of two stages. In the first stage, we use all datasets containing moving objects as inputs for training DepthNet and PoseNet. Then, DepthNet and PoseNet, together with SegMaskNet collaborate to infer static scenes and moving objects by forming a consensus on minimizing losses. UnLearnerMC is implemented on the TensorFlow platform [25]. The images trained in the experiment are taken via a single camera. For all experiments, the images are resized to 128  416, the initial learning rate and mini-batch are set to be r=0.0002 and b=4, respectively. The loss weights are

pro of

works well in static scenes (see Section 4.4). We can conclude that the photometric loop consistency loss is uniform for both cases (static scenes and dynamic objects). Definition 3. Interpretable Mask Loss . To bias the model towards the reconstruction of static scenes, rather than relying on SegMaskNet to ignore calculations, we also add the interpretable mask loss, which is defined based on the cross-entropy of the mask with constant 1. Denote Em as interpretable mask loss, the Em can be formulated as: Em  H (1, M t t 1 )  H (1, M t t 1 )  H (1, M t 1t )  H (1, M t 1t ) (14)

In this section, we have complemented the training details of UnLearnerMC, and compared the estimation results of dense depths and camera poses with other supervised and unsupervised algorithms qualitatively and quantitatively.

Table 1 Testing time of different algorithms on KITTI dataset, where the depth and pose tests use the split testing images (refer to [27]) and KITTI odometry set. Time /ms Algorithm Depth

Pose(09)

Pose(10)

Pose(mean)

Godard et al. [15]

22

-

-

-

Li et al. [30]

16

6.7

5.8

6.25

Yin et al. [16]

11

5.2

4.4

4.8

Ours

12

5.0

4.5

4.75

4.2 Monocular image of depth estimation 4.1 Implementation details

VOLUME XX, 2017

9

Journal Pre-proof

1

 T

RMSE :

1

U ( xi )  Z ( xi )

T i 1

1

 T

RMSE (log) :

T i 1

2 2

1

log(U ( xi ) )  log( Z ( xi ))

2 2

U (x ) Z (x )   , )    thr   i  {1,..., T } max( Z (x ) U (x )   Accuracy : 1

i

i

1

i

i

T

T

T

U ( xi )  Z ( xi )

i 1

Z ( xi )



SRD:

1 T

T



1

U ( xi )  Z ( x i )

i 1

2

Z ( xi )

re-

ARD :

1

1

et al. [18], Godard et al. [28], and Ranjan et al. [32] integrated FlowNet in a Depth-Pose framework to divide the scene into static and independent moving areas. It is hard to compare our method with the optical-flow-based methods. Since these methods used optical flow information to improve themselves (PoseNet, DepthNet), while our model is trained without using FlowNet. Nevertheless, our UnLearnerMC outperforms other popular unsupervised methods [14, 15, 22] and the optical-flowbased methods [16, 18, 28] in some indexes (bold in Table 1), while UnLearnerMC is slightly worse than the recently classical researches [32] that utilize the independent motion segmentation from optical flows. However, it is also worth noting that on the metric of “Pose”, our UnLearnerMC outperforms to these methods (see Section 4.3). Fig. 8 depicts a qualitative comparison with other algorithms. We can observe how definitely learning the cooperative loss helps to improve depth predictions, especially on moving objects. In addition, the depth of the slender structure like telegraph pole is finely estimated, which can be clearly perceived by observing the penultimate frame.

pro of

We take the KITTI datasets [26], which provides 40109 training pictures and 4031 test pictures, follow by excluding static frames and setting the length of image sequences to 3. Finally, we evaluate the results of UnLearnerMC with different types of self-supervision methods: monocular, stereo, and optical flow. The existing metrics of depth have been used for comparing the detailed results, as in [30].

where T is the number of pixels in a camera view, Z is the depth of the corresponding pixel in the actual scene. We first evaluate our model on depth predictions with the testing set of Eigen split. [27]. The quantitative comparison results are shown in Table 2. Note that Yang et al. [17], Zou

lP

Table 2 Results of different algorithms in the KITTI dataset using the split method (refer to [27]), where the best results are shown in bold. For training, K = KITTI, and CS = Cityscapes [29]. Error metric

Supervised

Dataset

ARD

SRD

RMSE

0.214

1.605

6.563

0.292

0.673

0.884

0.957

  1.25

  1.252

  1.253

Depth

Eigen et al. [7] Fine

Depth

K

0.203

1.548

6.307

0.282

0.702

0.890

0.958

Liu et al. [11]

Depth

K

0.202

1.614

6.523

0.275

0.678

0.895

0.965

Yang et al. [17]

Stereo

K

0.127

1.239

6.247

0.214

0.847

0.926

0.969

Yang et al. [17]

Stereo

CS + K

0.114

1.074

5.836

0.208

0.856

0.939

0.976

Ranjan et al.[32]

Mono

K

0.140

1.070

5.326

0.217

0.826

0.941

0.975

Zhou et al. [15]

Mono

K

0.208

1.768

6.856

0.283

0.678

0.885

0.957

Zhou et al. [15] updated

Mono

K

0.183

1.595

6.709

0.270

0.734

0.902

0.959

Mahjourian et al. [14]

Mono

K

0.163

1.240

6.220

0.250

0.762

0.916

0.968

Yang et al. [22]

Mono

K

0.162

1.352

6.276

0.252

-

-

-

Mono

K

0.155

1.296

5.857

0.233

0.793

0.931

0.973

Godard et al. [28]

Mono

K

0.154

1.218

5.699

0.231

0.798

0.932

0.973

Zou et al. [18]

Mono

K

0.150

1.124

5.507

0.223

0.806

0.933

0.973

Ours

Mono

K

0.147

1.194

5.537

0.220

0.812

0.935

0.975

Ranjan et al.[32]

Mono

CS + K

0.139

1.032

5.199

0.213

0.827

0.943

0.977

VOLUME XX, 2017

urn a

Eigen et al. [7] Coarse

Yin et al. [16]

K

Accuracy metric RMSE log

Jo

Algorithm

9

Journal Pre-proof

Error metric Algorithm

Dataset

Supervised

Accuracy metric

ARD

SRD

RMSE

RMSE log

  1.25

  1.252

  1.253

Mono

CS + K

0.198

1.836

6.565

0.275

0.718

0.901

0.960

Mahjourian et al. [14]

Mono

CS + K

0.159

1.231

5.912

0.243

0.784

0.923

0.970

Yang et al. [22]

Mono

CS + K

0.159

1.345

6.254

0.247

-

-

-

Yin et al. [16]

Mono

CS + K

0.153

1.328

5.737

0.232

0.802

0.934

0.972

Zou et al. [18]

Mono

CS + K

0.146

1.182

5.215

0.213

0.818

0.943

0.978

Ours

Mono

CS + K

0.143

1.176

5.218

0.212

0.822

0.943

0.978

lP

re-

pro of

Zhou et al. [15]

urn a

Fig. 8. Depth examples of different algorithms on the KITTI test set.

4.3 Evaluation of ego-motion

Jo

We use the official KITTI odometry datasets [26] and set the length of image sequences to 3. To compare with the other algorithms [14, 15, 16, 18, 28], we take the 00-08 and the 09-10 monocular sequences for training and testing. In addition, the traditional SLAM framework (ORB-SLAM) [3] is compared with UnLearnerMC. As described in [15], the average absolute trajectory error is evaluated on five or three overlapping frames in the test sequences, followed by the aligning with ground truth via using scaling factor. As demonstrated in Table 3, our algorithm is superior to other popular competing baselines [14, 15, 16, 18, 28]. Table 3 Absolute Trajectory Error of different algorithms on KITTI odometry set. Algorithm

Seq(09)

Seq(10)

ORB-SLAM (full)

0.014  0.008

0.012  0.011

ORB-SLAM (short)

0.064  0.141

0.064  0.130

Godard et al. [28]

0.023  0.013

0.018  0.014

VOLUME XX, 2017

Algorithm

Seq(09)

Seq(10)

Zou et al. [18]

0.017  0.007

0.015  0.009

Zhou et al. [15] updated (5-frame)

0.016  0.009

0.013  0.009

Yin et al. [16] (5-frame)

0.012  0.007

0.012  0.009

Mahjourian et al. [14] (3-frame)

0.013  0.010

0.012  0.011

Ours (3-frame)

0.0095  0.0049

0.0085  0.0069

The visual comparison of the estimated trajectories produced by different methods can be shown in Fig. 10. It is important to note that this evaluation protocol is disadvantageous to the proposed method, because it does not allow for the correction the drift or translation scale. However, the trajectory of UnLearnerMC is significantly better than the SFMLearner algorithm [15]. Therefore, the better visual trajectory results than the popular competitive baselines are obtained, again revealing the superiority of the proposed UnLearnerMC.

9

Journal Pre-proof

Algorithm

pixel.acc.

mean.acc.

mean.IoU

f.w.IoU

Yang et al. [17]

88.71

74.59

52.25

86.53

Ours with CCNet

86.59

75.67

53.38

87.45

pro of

4.5 Evaluation of the proposed loss

We trained and evaluated a series of models for our method showing importance of each component of the loss function. (1) our first baseline is trained without using the interpretable mask loss. The remaining loss weights are set to be r  1.0, o  0.5, c  0.03, s  0.3 (2) the Photometric loop Consistency Loss is excluded from our framework (“No Eo ”). The remaining loss r  1.0, m  0.03, weights are set to be E c  0.03, s  0.3 ; (3) “No c ” is our model without using the Cooperative Loss. The remaining loss weights are set to be r  1.0, o  0.5, m  0.03, s  0.3 . Fig. 10 demonstrates the evolution of the depth estimation error over time when training these models. Using the first baseline ("No Eo loss") improves depth estimation accuracy slightly. In addition, using the second model ("No E m loss") gives a further boost. The results in the first curve of the figure reflect that overall performance of the model ("No Ec loss") is essential. For visualization results in Fig. 11, we can observe how the depth image estimated by model with Ec loss provides better moving object shapes. This fact is more apparent on transparent surfaces like car windows as shown in the last column.

lP

re-

(a) 09

(b) 10

Fig. 11. Translational error of Sequence 09 and 10.

4.4 Evaluation of moving object segmentation

Jo

urn a

For testing the capability of the SegMaskNet to segment moving objects, we used the KITTI 2015 datasets [6] for evaluating the segmentation performance. Following [4], we evaluate object segmentation on car pixels. In Table 4, we show the IOU scores of the segmentation masks obtained via different algorithms. We can see that our UnLearnerMC outperforms DepthPose-Mask method [15], while UnLearnerMC is slightly worse than the recently classical research [17]. This difference is most likely caused by the optical flow segmentation using the forward-backward consistency loss. Nevertheless, the comparison among “Our with CCNet” with the other algorithms [15, 17] already demonstrate the advantages of our SegMaskNet with the learning framework based on optical flow information.

Fig. 10. The depth evaluation error of the training model with and without the proposed losses.

Table 4 Segmentation results of different algorithms on the KITTI 2015 dataset. “Zhou et al. [15]*” trained segment mask using the pipeline of Yang et al.[17]. “Our with CCNet” trained on FlowNet shares the third phase(CCNet) [32]. Algorithm

pixel.acc.

mean.acc.

mean.IoU

f.w.IoU

Zhou et al. [15]*

70.32

58.24

41.95

67.56

Ours*

73.82

62.36

44.83

68.72

VOLUME XX, 2017

9

Journal Pre-proof

work, it would be interesting to explore the interference of image blur, camera parameters, or illumination changes to our system. Notes: Acknowledgments:

pro of

The authors would like to thank Zhichao Yin and Tinghui Zhou for helpful discussions and sharing the code. The authors also thank the anonymous reviewers for their instructive comments. Compliance with ethical standards Conflict of interest

The authors declare that they have no conflict of interest. Fig. 11. Example depth estimation results come from a series of models, and each of the tested models add a component of the loss function.

This article does not contain any studies with animals performed by any of the authors. References

[1] X. Li, Q. Liu, N. Fan, Z. Y. He, H. Z. Wang, Hierarchical spatial-aware Siamese network for thermal infrared object tracking, Knowledge-Based Systems, 16 (6) (2019) 71-81. [2] G. Klein and D. Murray, “Parallel tracking and mapping for small AR workspaces,” in Mixed and Augmented Reality, IEEE & Acm International Symposium on Mixed, (2008), 1-10. [3] R. Mur-Artal, J. Montiel, and J. D. Tardos, “ORBSLAM: a versatile and accurate monocular SLAM system,” IEEE Transactions on Robotics, 31 (5), (2015) 1147–1163. [4] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, DTAM: Dense tracking and mapping in real-time, in: The IEEE International Conference on Computer Vision (ICCV), 2011. [5] J. Engel, T. Sch¨ops, and D. Cremers, LSD-SLAM: Large-scale direct monocular SLAM, in: European Conference on Computer Vision(ECCV), 2014. [6] J. Engel, V. Koltun, and D. Cremers, MSST-ResNet: Deep multi-scale spatiotemporal features for robust visual object tracking, Knowledge-Based Systems, 16 (4) (2019) 235-252. [7] A. Kendall, M. Grimes, and R. Cipolla, PoseNet: A convolutional network for real-time 6-DOF camera relocalization, in: the IEEE International Conference on Computer Vision (ICCV), 2015. [8] R. Li, Q. Liu, J. Gui, D. Gu, and H. Hu, “Indoor relocalization in challenging environments with dualstream convolutional neural networks,” IEEE Transactions on Automation Science and Engineering, 2017. [9] R. Clark, S. Wang, A. Markham, N. Trigoni, and H. Wen, VidLoc: 6-DoF video-clip relocalization, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [10] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia, “Exploring representation learning with CNNs for frame-to-frame ego-motion estimation,” IEEE robotics and automation letters, 1 (1) (2016) 1825. [11] M. Jaderberg, K. Simonyan, A. Zisserman, K. Kavukcuoglu, “Spatial transformer networks,” in Advances in Neural Information Processing Systems, (2015) 2017–2025.

lP

re-

In addition, to prove the interference-free of Eo to model performance when training in pure static scenes, we evaluated and compared the UnLearnerMC (Ours) and “Our No Eo ” model trained with static scene sequence from KITTI odometry datasets [26]. The experimental results are presented in Table 5. We can find that the performance of the “Our No Eo ” model is almost the same as that of UnLearnerMC, which shows that the loss works well in static scenes. Furthermore, the comparison between “Ours*” and Li et al. [30] method already demonstrate the advantages of our proposed losses.

Ethical approval

Table 5 Depth evaluations of different algorithms on KITTI dataset using the split of Eigen et al. [27]. The methods marked with "*" are trained with all KITTI odometry training sets [26]. Error metric Datasets ARD

SRD

RMSE

RMSE log

urn a

Algorithm Li et al. [30]*

K(odo)

0.183

1.73

6.57

0.268

Ours*

K(odo)

0.163

1.256

6.21

0.250

Our No Eo

K(odo)

0.160

1.22

6.09

0.248

Ours

K(odo)

0.161

1.23

6.12

0.248

5. Conclusion

Jo

In this paper, we presented an unsupervised learning pipeline for single-view depth, camera pose, and segmentation tasks without using optical flow information. We learn to segment the moving areas from video scenes via the proposed SegMaskNet. In the independent moving area, the object photometric consistency loss is proposed for constraining the moving areas and improving the segmentation accuracy. To facilitate this process, we proposed a cooperative loss and SegMaskNet that constantly adjust the allocation of pixels to static scenes and moving areas to help cooperatively constrain all tasks. On the KITTI datasets, our model achieves competitive performance with state-of-the-art methods. In the future VOLUME XX, 2017

9

Journal Pre-proof

[28]

[29]

[30]

[31]

pro of

[27]

suite, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012. D. Eigen, C. Puhrsch, and R. Fergus, Depth map prediction from a single image using a multi-scale deep network, in: the 27th International Conference on Neural Information Processing Systems, 2014. C. Godard, O. M. Aodha, G. J. Brostow, Digging Into Self-Supervised Monocular Depth Estimation, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016. R. H. Li, S. Wang, Z. Q. Long, D. B. Gu, UnDeepVO: Monocular Visual Odometry through Unsupervised Deep Learning, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600-612, 2004. Anurag Ranjan, Varun Jampani, Lukas Balles, et al. Competitive Collaboration: Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation. Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

re-

[12] C. Wang, J. M. Buenaposada, R. Zhu et al. “Learning Depth from Monocular Videos using Direct Methods,” in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [13] H. Zhan, R. Garg, C. S. Weerasekera, et al. Unsupervised Learning of Monocular Depth Estimation and Visual Odometry with Deep Feature Reconstruction. in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [14] R. Mahjourian, M. Wicke, A. Angelova, Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [15] T. Zhou, M. Brown, N. Snavely, DG Lowe, Unsupervised Learning of Depth and Ego-Motion from Video, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017. [16] Z. Yin, J. Shi, GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [17] Z. Yang, P. Wang, Y. Wang, et al, Every Pixel Counts: Unsupervised Geometry Learning with Holistic 3D Motion Understanding, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [18] Y. Zou, Z. Luo, J. B. Huang, DF-Net: Unsupervised Joint Learning of Depth and Flow using Cross-Task Consistency, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [19] S. Pillai and J. J. Leonard, Towards visual ego-motion learning in robots, in: The IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017. [20] C. Godard, O. M. Aodha, and G. J. Brostow, Unsupervised monocular depth estimation with leftright consistency, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [21] Z. Yang, P. Wang, W. Xu, L. Zhao, and N. Ram, Unsupervised learning of geometry from videos with edge-aware depth-normal consistency, in: Conference on the Association for the Advance of Artificial Intelligence (AAAI), 2018. [22] Z. Yang, P. Wang, Y. Wang, W. Xu, R. Nevatia, LEGO: Learning Edge with Geometry all at Once by Watching Videos, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [23] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13 (4) (2004) 600-612. [24] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [25] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016. [26] A. Geiger, P. Lenz, and R. Urtasun, Are we ready for autonomous driving? The KITTI vision benchmark

Jo

urn a

lP

[32]

VOLUME XX, 2017

9

Jo

urn a

lP

re-

pro of

Journal Pre-proof

Journal Pre-proof

Highlights:

Jo

urn a

lP

re-

pro of

(1) UnLearnerMC is able to estimate monocular depth, camera motion, and segment moving object without using optical flow network information. (2) The photometric loop consistency loss is proposed to overcome the moving object interference not included in a pure view synthesis task. (3) We combine SegNetMask with the cooperative loss to constrain the moving object area and restrict the number of factors not considered in the mask network. (4) UnLearnerMC achieves state-of-the-art results in pose and depth estimation, performing better than previously unsupervised methods.