Accepted Manuscript Robust RGB-D visual odometry based on edges and points Erliang Yao, Hexin Zhang, Hui Xu, Haitao Song, Guoliang Zhang
PII: DOI: Reference:
S0921-8890(18)30077-0 https://doi.org/10.1016/j.robot.2018.06.009 ROBOT 3048
To appear in:
Robotics and Autonomous Systems
Received date : 31 January 2018 Revised date : 9 May 2018 Accepted date : 19 June 2018 Please cite this article as: E. Yao, H. Zhang, H. Xu, H. Song, G. Zhang, Robust RGB-D visual odometry based on edges and points, Robotics and Autonomous Systems (2018), https://doi.org/10.1016/j.robot.2018.06.009 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Robotics and Autonomous Systems
1
Robust RGB-D Visual Odometry Based on Edges and Points Erliang Yaoa,*, Hexin Zhanga, Hui Xua, Haitao Songa, Guoliang Zhangb a b
Department of Control Engineering , High-Tech Institute of Xi'an, Xi’an, China
College of Controlling Engineering , Chengdu University of Information Technology, Chengdu, China
Abstract—Localization in unknown environments is a fundamental requirement for robots. Egomotion estimation based on visual information is a hot research topic. However, most visual odometry (VO) or visual Simultaneous Localization and Mapping (vSLAM) approaches assume static environments. To achieve robust and precise localization in dynamic environments, we propose a novel VO based on edges and points for RGB-D cameras. In contrast to dense motion segmentation, sparse edge alignment with distance transform (DT) errors is adopted to detect the states of image areas. Features in dynamic areas are ignored in egomotion estimation with reprojection errors. Meanwhile, static weights calculated by DT errors are added to pose estimation. Furthermore, local bundle adjustment is utilized to improve the consistencies of the local map and the camera localization. The proposed approach can be implemented in real time. Experiments are implemented on the challenging sequences of the TUM RGB-D dataset. The results demonstrate that the proposed robust VO achieves more accurate and more stable localization than the state-of-the-art robust VO or SLAM approaches in dynamic environments. Index Terms—Localization, Visual Odometry, Dynamic Environments, Edge Alignment, Bundle Adjustment
1 Introduction In unknown or GPS-denied environments, accurate localization with visual data is one of the most active topics in Robotics. RGB-D cameras could directly provide depth information of environments, and have been used in more and more robotic applications. As a part of vSLAM (visual Simultaneous Localization and Mapping) [1], VO (Visual Odometry) [2] focuses on the egomotion estimation of consecutive images. To simplify the formulation, most SLAM and VO approaches assume static environments, namely, the information among consecutive images is consistent. However, moving objects exist everywhere, such as pedestrians, animals and cars. The image information factors that change at different times consist of the camera motion and moving objects. The moving objects destroy the static environment assumption and have negative effects on VO. Traditional VO methods obtain poor localization in dynamic environments. Based on optimized errors, traditional VO or SLAM methods can be categorized into roughly four types. The first type is based on reprojection errors, namely, indirect methods, such as PTAM [3] and ORB-SLAM2 [4]. Point features of consecutive images are extracted and matched with invariant feature descriptors. Reprojection errors of point features are minimized for precise egomotion estimation. Due to robust point features, these methods allow large inter-frame movements, and the features can be used for loop detection to further improve the localization accuracy. The second type is based on photometric errors, namely, direct methods, e.g. DVO [5], LSD-SLAM [6] and DSO [7]. Direct methods apply intensity values of raw image data to estimate the camera motion without feature extraction. Compared with the reprojection location of features in indirect methods, direct methods adopt the intensity gradient direction and magnitude to guide the egomotion estimation. However, they assume small inter-frame motion and photometric invariance. In addition, effective loop closures cannot be integrated into direct methods. The third type is based on Euclidean space errors with the ICP (iterative closest point) algorithm, such as RGB-D SLAM [8] and KinectFusion [9]. Instead of aligning the image information as with the above two methods, this approach registers point clouds. Due to the depth uncertainty of the 3D features [10], the egomotion estimation is less accurate than that of minimizing reprojection errors. However, optimization with large sets of points and a 3D Gaussian error model could improve the performance of the egomotion
Robotics and Autonomous Systems
2
estimation[11]. The fourth type of VO or SLAM method is based on distance transform (DT) errors, e.g., REVO (Robust Edge-based Visual Odometry) [12] and D-EA (Direct Edge Alignment) [13] [14]. These methods are considered crossovers between direct and indirect methods [12]. They extract edges in images and calculate the distance transform [15], which provides the edge similarity measure without correspondence matching. The edges between a source frame and a target frame are aligned for egomotion estimation by minimizing DT errors. Compared with indirect methods, no matching steps are needed for the calculation of the DT errors, which saves the computation cost. In addition, REVO exhibits a larger convergence basis than direct and indirect methods when increasing inter-frame motion. However, due to the large number of 3D edge points, BA is hardly applied in edge alignment [14]. Besides, further effort is required to the loop closure in edge alignment. In highly dynamic environments, the standard RANSAC (RANdom SAmple Consensus) [16] approach does not handle dynamic objects well [17][18][19]. The traditional robust kernel functions, such as the Huber robust kernel function, cannot eliminate the influence of highly dynamic objects, as shown in Section 4.1. To handle moving objects in dynamic environments, different strategies are adopted for robust VO or SLAM. Namdev et al. [20] segment moving objects based on dense optical flow and two view geometry. The optical flow potential and geometry potential are calculated for a graph based segmentation algorithm. Similar potentials are clustered together and motion segmentation is obtained. Their method achieves the precise segmentation of moving objects. However, a frame needs seven minutes to be computed. Sun et al. [21] calculate the difference image between a source image and a target image to detect the boundary of moving objects. Then, motion segmentation is achieved based on vector quantized depth images, but a frame costs almost half a second. Moreover, its performance in lowly dynamic scenes is not satisfactory. These two dense motion segmentation methods are time-consuming. RDSLAM [18] compares appearance and structure to detect the changed GPUSITF features. Outliers are removed by a prior-based adaptive RANSAC algorithm. However, RDSLAM is restricted to small environments and is not sufficiently accurate. Li et al. [19] extract foreground depth edges. Then, the static point weights of the depth edge points are calculated based on Student’s t-distribution. The Static point weights, intensity weights and geometric weights are added to ICP for the registration. The experiments on the TUM RGB-D dataset [22] show that this method achieves perfect results. However, loop closure based on 3D points is more simplistic than the methods based on point features. In contrast to previous robust approaches of egomotion estimation in dynamic environments, we propose a novel robust VO based on edges and points for RGB-D cameras. A two-stage process is adopted in this method as follows: edge alignment by DT errors detects dynamic areas in images, and camera poses are estimated with static weights by the reprojection errors of non-dynamic areas. Compared with dense motion segmentation, edge alignment dealing with sparse 3D points guarantees real-time performance. Besides, egomotion estimation with little static information in dynamic environments is a challenging task. Reprojection error optimization with a small amount of static and accurate point features can obtain better localization results than the other three types. The main contributions of this work are as follows: 1) To handle motion blur in images, a novel reference frame selection strategy is proposed for the edge alignment in dynamic environments. 2) As sparse motion segmentation, a novel area state detection scheme is presented based on DT errors and priors, and the states are divided further based on reprojection errors.
Robotics and Autonomous Systems
3
3) Experiments on the TUM dataset show that the proposed method obtains more precise localization than other robust approaches. To the best of our knowledge, the proposed robust VO is the first method combining edges and points in dynamic environments. 2 Preliminaries The world coordinate system and camera coordinate system are defined by W and C respectively. A camera pose in the world C SO(3) and coordinate system is defined as a transformation matrix TWC SE(3) , consisting of an orthogonal rotation matrix RW C : a translation vector tW
RC t C TWC = W W ` (1) 0 1 For a 3D point PW ( xW , yW , zW ) in the world coordinate system, the corresponding position PC ( xC , yC , zC ) in the camera C coordinate system can be obtained by TW :
PC = TWC PW = RWC PC tWC Accordingly, let us project PC into the image coordinate system and obtain pixel position u ( u , v ) by π( PC ) : y f x f u π ( PC ) ( C x cz , C y c y ) zC zC
(2)
(3)
where f x and f y are focal lengths, and cx and cy are optical centres of a pinhole camera model. Conversely, if the depth zC of a pixel
u is known, PC can be calculated as follows: PC (
v cy u cx zC , zC , zC ) fx fy
(4)
Ordinarily, VO based on optimization takes the egomotion estimation as an energy minimization problem, which needs a C minimal representation of a camera pose. However, TW is over-parameterized and could not be utilized in optimization. Therefore,
Lie algebra is adopted to parameterize the camera pose. An element ξ se(3) of Lie-algebra can be mapped to T SE (3) by an exponential map. In contrast, T can be mapped to by a logarithmic map:
T expse(3) (ξ )
(5)
ξ logSE(3) (T )
(6)
A minimized function F (T ) could be solved with an iterative non-linear least square method. The element ξ se(3) can act as a perturbation of the transformation matrix T . In each iteration, ξ is optimized to minimize the function energy, and then, T is updated with ξ . 3
Robust visual odometry
3.1 Overview ORB features [23] are selected for the representations of points. Fig. 1 demonstrates the overview of the proposed robust VO method. Similar to ORB-SLAM2, the VO approach consists of a tracking thread and a local mapping thread. First, ORB features and edge information of the current frame are extracted while DT of the edge information is calculated. Edge alignment based on ref . Static weights ωs DT errors is executed between the current frame and a reference frame to acquire the transformation matrix Tcur
are calculated by the DT errors and prior states from the last frame. The states of the current image areas are determined by ωs . ref Then, the current camera pose is initialized with Tcur and is optimized based on ORB features in non-dynamic areas with ωs by
motion-only BA (bundle adjustment). Finally, if the current frame is chosen as a keyframe, it would be sent to the local mapping
Robotics annd Autonomouus Systems
4
thread to adjust a local map annd to execute local l BA. Note that referencce frames are ddifferent from keyframes as follows: referrence fraames are usedd for edge alignnment, and keeyframes particcipate in the loocal mapping. Motion-o only BA
OR RB Features Pose optimization n with static weigh hts Current frrame 33D Edge infformation
Keyframe
Edge alignment
Referencee frame
Area state detection
Local map management 3D points Culling
DT of reference frame
Local BA Reference fframe selection
Tracking th hread
Loocal mapping thread
Fig. 1. Oveerview of robust VO method.
3.22 Edge weighht estimation
0,10) in an im An image is divided into 20*20 pixell blocks. For example, a ppixel whose loocation is (10 mage at 640 * 480 resolution belonngs to the pixeel block (1,1) . Each pixel block corresponds to an areaa of the imagee, which has thhe following tthree staates: static, unnknown and dyynamic. Unknnown areas meean that the areeas are static oor dynamic, w which cannot bbe determined.. The geeneration of uunknown areass depends on the DT errorss in Section 3.3, and the unnknown areas are divided iinto static areaas or dyynamic areas bby reprojectedd errors of the ppoint features listed in Sectiion 3.5.
(a) Gray image
(b) 2D edge image
(c) DT imaage
Fig. 2 Edge extraction e and caalculation of DT T
As mentionedd in [12], deepp learned edgee features, suchh as Structuredd Edges [24] and a Holisticallly Nested Edges [25], omit w weak eddges in imagess. Their perforrmances in thee edge alignmeent are better than t that of thhe Canny algorrithm [26]. Hoowever, the lacck of weeak edges is uunfit for the suubsequent areaa state detectioon. Therefore, 2D edges in aan image are still s extracted by b the Canny edge deetector in this ppaper for eachh frame, as shoown in Fig. 2((b). If a 2D eddge has a reliabble depth meaasure from a R RGB-D cameraa, the coorresponding 33D edge point is obtained byy (4). After edge exxtraction, its DT D informatioon based on 2D D edges is callculated by thee OpenCV librrary, as shownn in Fig. 2(c).. The intensity of eachh pixel in the DT D image indicates the distaance to the clossest edge. Thee brighter the ppixel in the DT T image, the faarther
Robotics annd Autonomouus Systems
5
aw way from the closest c edge. The T intensity off the pixel servves as the meaasurement of thhe subsequent edge alignmennt, namely, thee DT vaalue. wever, when moving m objects have many textures (such aas the A good edgee distribution iis of importancce for edge aliignment. How chhecked shirt inn Fig. 2(c), thhe 3D edge pooints from thee dynamic texxtures will inttroduce an uneeven spatial distribution, d w which heeavily affects edge e alignmennt. Therefore, each pixel bloock contains N th 3D edge ppoints at most. ref The transform mation matrix Tcur between a reference frrame and curreent frame is iniitialized with tthe motion moodel of the cam mera.
Thhe motion moddel assumes thhat the cameraa motion is ideentical betweeen the adjacentt time intervalls. Assuming tthat current tim me is ref k 1 Tkk12 . Then, Tcur k , we can obtain Tcur is initialized as a follows: ref Tcur Twref (Twk 1 ) 1 Tkk12 k 2 k 1
T
k 2 w
T
(7)
k 1 1 w
(T
)
(8)
ref iss estimated by the edge alignnment, which minimizes DT T errors of thee current 3D edge points PCccur . For the cuurrent Then, Tcur ref PCcur_i . T 3D D edge point PCcur_i , its posittion in the cooordinate system m of the referrence frame is obtained by Tcur Then it is projeected ref PCcur_i ) , aand its DT valu into the referencce frame to obttain the pixel ccoordinate π (Tcur ue is DT
ref
ref ref [ π (Tcur PCcur_i )] . If Tcur is preecise,
ref ref PCcur_i ) of PCcuur_i should be an PCcur_i )] is zero. Thereefore, the reprojected ppixel π (Tcur a edge pixel, which means that DT ref [ π (Tcur
D DT ref [] showss the errors inttroduced by Tccurref , which is ttermed the DT T error. The eddge alignment is expressed aas follows: ref ref Tcur = arg m min Hi D DT ref [π (Tcur PCcur_i )] Tcurref ur
2
(9)
i
whhere Hi is caalculated by thhe Huber kerneel function as follows:
1 th DT [] ref is thee threshold of the Huber kerrnel function. i H
whhere th
DTref [] th (10)
DTref [] th
Edge repeataability is impoortant for edgge alignment. Following REVO, R an edgge filter is utiilized for enhhancing reliabbility. Hoowever, comppared with RE EVO taking all current 3D edge points iinto edge aliggnment, the prroposed methood only takes into acccount the currrent 3D edge ppoints that aree reprojected innto the non-dyynamic (namely, static and uunknown) areas of the referrence fraame. Edge alignmeent can be sollved by an iteerative non-linnear optimizattion, such as L Levenberg-Maarquardt (LM)) algorithm. T Then, reff eggomotion estim mation for currrent frame Tcur and a weightt Hi for eachh 3D edge poinnt PCcur_i are obbtained. The reeprojections oof the
cuurrent 3D edgee points are shoown in Fig. 3. Compared wiith static edgess, the edges off dynamic objeects obtain higgher DT errorss. We caan conclude thhat edge alignm ment based on DT errors woorks well in dyynamic environnments.
(a)
(bb)
Robotics annd Autonomouus Systems
6
(c) (dd) Figg. 3. Current 3D D edge points are projected innto reference frrames. The sequuences are from m TUM RGB-D D dataset. The ppersons move iin the ennvironments. Reed edges indicatte high DT errorrs and yellow eddges express low w DT errors.
3.3 Area state ddetection In this sectionn, the areas w will be divided into static areeas, dynamic areas a and unknnown areas preeliminarily. H Here, an exampple is a A, a dynaamic area B annd an given to illustraate the area state detection, aas shown in Figg. 4. Let us assume that therre are a static area frame, there aree extracted eddges in the D, E and F areas. First, the 3D edge unnknown area C in the referennce frame. Forr the current fr pooints in D and E are projected into A in thee reference fram me. Depending on the DT errrors, D and E are divided innto a static areaa and a dynamic area respectively. Second, we aassume that 3D D edge pointss in F are projjected into B or C. Due to the uncertaintty of dyynamic objects, the states oof current 3D edge e points w which are projeected into B cannot c be deteermined by thee DT errors oof the reference framee. Similarly, thhe state of C m may be dynam mic or static. Therefore, T the state of F cannnot be determ mined by B annd C. Thhird, there are no edges in G G, which meanss that DT errorrs cannot be caalculated. Thenn, G is dividedd as an unknow wn area. As wee can seee, the unknow wn areas conssist of the folllowing two tyypes: current aareas without edges (namelly, G), and cuurrent areas w whose coorresponding aareas of the refference frame are non-staticc (namely, F).
Refference Fram me Dynamiic Staatic A B
U Unknown C
P Projection Staatic D
mic Unknow Dynam wn E F
Edgges
U Unknown G N No Edges
Currrent Frame Fig. 4. Illusstration of area state detection
First, we conssider static areeas and dynam mic areas. For eeach pixel blocck (i, j ) of thee current frame, the weight aand depth statiistics off the 3D edge ppoints are anallysed to judge dynamic areaas. The averagee weight annd average deppth d of a pixxel block (i, j) are coomputed as folllows:
(i, j ) Hk / N i , j
(11)
k
d (i , j ) d k / N i , j k
(12)
Robotics annd Autonomouus Systems
7
whhere N i , j is thhe total number of effectivee 3D edge poiints in pixel bblock (i, j ) , whhich are projeected into the static areas of the reference framee. d k is the corrresponding deepth value of a 3D edge poiint in the curreent frame. The DT errorrs of current eedges are not enough to dettermine that a pixel block oof the current frame is a dynnamic area. F Fig. 3 shhows that, duee to dynamic oobjects, edge alignment a is nnot completelyy precise, whiich results in llarge DT errorrs of several sstatic eddges. Thus, thee states of corresponding pixxel blocks in thhe last frame arre taken into account. The sttatic weight annd average deppth of the last frame arre ls (i, j ) andd dl (i, j ) , resspectively. If thhe average deppths d (i, j) annd d l (i, j ) aree similar, the states of the cuurrent pixel block andd the last pixell block shouldd be consistennt. Thus, the ccurrent static w weight s (i, j ) of pixel blocck is calculateed as foollows:
d diffff
ls (i, j ) th
) (i , j ) e ddiff abs(d (i, j ) d l (i, j ))
s (i, j ) (1
(13) (14)
whhere th is a threshold to ddecide the staate of a pixel block. abs() obtains the abbsolute value of a number.. ddiff is the ddepth difference. The smaller the ddiff , the morre consistent ls (i, j ) and s (i, j ) are. Inn contrast, if ddiff is large, ls (i, j ) has little im mpact on s (i , j ) . If s (i, j ) is less than th , the pixel blocck (i, j ) is connsidered a dynnamic area (naamely, E in Figg. 4), as the redd squares show wn in Fiig. 5. Otherwisse, it is a canddidate of static areas.
(a)
(bb)
(dd) (c) Figg. 5. Dynamic aarea detection baased on static w weights. Red squuares indicate dyynamic areas. Yeellow squares im mply static areas and unknown areas. Too show the statees of the areas cllearly, each squuare does not coover the whole ppixel block.
wn areas are analysed a for noon-dynamic arreas. It is wortthwhile to notte that several dynamic areaas are not detected, Then, unknow suuch as parts off the checked shirt in Fig. 55a and the arm m in Fig. 5b. T These areas aare regarded ass unknown arreas in our meethod (nnamely, G in F Fig. 4). These unknown areeas result from m non-existentt edges. In facct, the subsequuent pose optim mization baseed on features dependds on the non--dynamic areaas. If there aree no edges in a pixel block,, features are nnearly non-exxistent in this area. Thherefore, the sttates of areas w without edges hardly reducee the localizatioon accuracy. F Furthermore, iff the projectionn of the currennt 3D
Robotics and Autonomous Systems
8
edge point PCcur_i belongs to the non-static areas of the reference frame, the state of PCcur_i cannot be determined. If half of the 3D edge points in the current pixel block are unknown, the pixel block is considered an unknown area (namely, F in Fig. 4). The static weight s (i , j ) of an unknown area is set as 0.5 to model the uncertainty for the pose optimization in Section 3.5. Candidates of static areas that are not categorized as unknown areas, are regarded as static areas (namely, D in Fig. 4). 3.4 Reference frame selection In REVO [12], the tracking quality is measured after edge alignment between the current frame and a reference frame. If the quality is not valid, the last frame would be selected as a new reference frame. The DT of the new reference frame would be calculated and aligned with the current frame. Namely, double edge alignments and one calculation of DT are executed when the reference frame is updated. This strategy increases computation costs. Moreover, in dynamic environments, the poor egomotion estimation provided by edge alignment cannot accurately project 3D edge points for the tracking quality appraisal, which limits the evaluation of the strict overlaps of edges between the current frame and the reference frame. This results in frequent reference frame insertion. New reference frame
Edge Alignment
Projection Fig. 6. Strategy of the proposed reference frame selection. Weighted DT errors among current frame Fk and candidate reference frames in ={Fcur_ref ,,Fk 2 , Fk 1} are evaluated. The frame with the least-weighted DT error is selected as a new reference frame for the next frame Fk +1 .
A novel reference frame selection is proposed for edge alignment in dynamic environments, as shown in Fig. 6. Contrary to REVO, a new reference frame is selected for edge alignment of the next frame. It is assumed that the image information between two successive frames changes little. Weighted DT errors are adopted for the matching evaluation between the current frame and adjacent frames. An adjacent frame with a least-weighted DT error is selected as the new reference frame. In general, the weighted DT error between successive frames is the least. However, if motion blur exists in the last frame, edge alignment between the last frame and the current frame will become insufferable. As a result, area state detection based on DT errors is inaccurate. Therefore, a reference frame is selected from adjacent frames with the least-weighted DT error. First, we assume that the current frame is the
k th frame, termed Fk . Then, a set of candidate reference frames is obtained: ={Fcur_ref ,,Fk 2 , Fk 1}
(15)
Note that does not consist of Fk , which means that edge alignment between two successive frames is not utilized by this method. The reason for this limit is that the dynamic areas can hardly be detected by edge alignment if the dynamic objects move slowly. However, when the camera stays still in static environments, may contain many candidate reference frames, which results in high computation costs in reference frame selection. For this reason, is restricted to a maximum of four candidate reference frames according to the adjacent time frames.
Robotics annd Autonomouus Systems
9
Second, the ccurrent 3D edgge points in thhe non-dynamiic areas are proojected into caandidate reference frames bby transformations. DT T errors are obbtained and w weights
Hi
arre calculated bby (10). Then, the weighted DT error of thhe current fram me and a candidate
reference framee is evaluated bby Vr that hass an inverse coorrelation as foollows: Vr ci Hi
(16)
i
whhere Hi has aan inverse corrrelation withh the DT errorr. ci adjusts thhe importancee of a 3D edgge point basedd on the numeerical interval of Hi , as shown in ((17). It is clearr that, ci encourages large Hi and restrrains small Hi . This meanss that a smallerr DT errror is more accceptable. Thee reason for thhis behaviour iis that, the DT T errors obtainned by imagess with motion blur are geneerally
(b)
(a)
larrge, as shown in the middle image of Fig. 7(a), and the proposed p methhod prefers thee reference fraame with little or no motion blur. 1.5 Hi [0.8,1] Hi [0.6,0.8)) 1.25 ci (17) Hi [0.4,0.6) 1 0.5 Hi [0,0.4)
Figg. 7. Area state detection with different referennce frame selecction: (a) and (b) are obtained by b REVO and ouur method respeectively. The im mages in the first columnn are the selecteed reference fraames. The imagees in the secondd column are thee results of edgee alignment. Thhe images in the third coolumn show the dynamic areas by red squares.
Finally, the ccandidate referrence frame w with the largesst Vr is selecteed as the new reference fram me. Fig. 7 shoows an exampple of reference framee selection withh motion blur. The frames arre from the “Fr3/walking_sttatic” sequencee of the TUM dataset. As wee can me in the sequeence) with mottion blur is unssatisfactory. M Many seee in Fig. 7(a), the edge alignnment by a refference frame (the 37th fram staatic areas are m misjudged as dynamic areas while the refference frame (the 35th fram me in the sequuence) obtaineed by the propposed m method is aligneed properly. 3.5 Egomotionn estimation baased on point ffeatures with sstatic weights mplements the local The tracking thread executtes the motionn-only BA for pose optimizaation, while thhe local mappping thread im BA A of camera pposes and 3D feature pointss. Following O ORB-SLAM2 [4], the motioon-only BA coonsists of trackking the last fr frame annd the local maap. Contrary too ORB-SLAM M2, states of im mage areas andd static weightss from edge allignment and aarea state detecction arre considered iin optimizationn.
Robotics annd Autonomouus Systems
10
First, feature matching is conducted for non-dynamic n areas by descrriptors. Each m matched featurre whose locattion in an imaage is ui associates a correspondingg 3D feature ppoint PWi from m the last fram me or keyframees. Second, the reeprojection errrors of correspponding 3D feaature points arre minimized tto estimate egoomotion. If only ORB featurres in staatic pixel bloccks take part iin the pose esstimation, the number of ORB features m may sometimees not be suffficient for accuurate opptimization. Actually, the OR RB features inn both static annd unknown pixel blocks parrticipate in the optimization oof the motion--only BA A. Due to the uncertainties of o ORB featurres in unknownn pixel blockss, the ORB feaatures in static pixel blocks sshould play a more m im mportant role inn the egomotioon estimation. To adjust the importance off ORB features in static and unknown pixeel blocks, the sstatic weeight s is addopted in the ooptimization. ref TWcur is initiallized as TrefcurTWref by Tcur froom the edge allignment. The motion-only B BA is solved bby the LM alggorithm:
TWccur = arg min s (Grid( ui )) pyr ui π (TWcur PWi ) cur TW
2
(18)
i
where Grid() obtains a pixxel block posittion to which ui belongs. s determines the static weigght of the pixeel block. pyr iis the weeight of an im mage pyramid aaccording to O ORB-SLAM2:
pyr =
1 sli
(19)
where s is thhe scale of an image pyramidd and li is the level of the ORB feature in the image pyrramid. The higgher the level of o the im mage pyramid, the more unccertain the ORB B feature. Third, after thhe pose optim mization, a chi--square test foor the reprojecttion errors is aadopted to rejeect outliers. O Outliers are divvided into the followiing two types: outliers in staatic areas from m mismatchess and outliers iin unknown aareas from dynnamic objects.. The relationships beetween 3D featture points andd features of thhe current fram me are deleted.. Moreover, foor outliers of thhe second typee, the 3D D feature poinnts do not partiicipate in futurre pose optimiization. Lastly, the staates of unknow wn areas are uupdated by thee ratio of outlieers in the pixel blocks. If haalf of the featuures in an unknnown pixel block are outliers, o the arrea is changedd to a dynamic area. Otherwiise, the unknow wn area becom me a static areaa. Fig. 8 show ws the OR RB features inn static areas.
(aa) (b) (c) (d) Figg. 8. Static ORB B features for trracking local maap in static areaas. There are alm most no featuress in dynamic obbjects, namely, the t moving perssons.
The local maapping thread is i similar to thhat of ORB-SL LAM2. Howevver, only the ffeatures in stattic areas are uused when creaating neew 3D featuree points for trriangulation. A local map is obtained byy the keyframee covisibility information. T Then, local B BA is im mplemented to enhance the cconsistencies oof keyframes and a 3D featuree points in thee local map. 4
nt Experimen The proposedd robust VO method is tested on the TUM T RGB-D benchmark, which is colllected by a m moving Kinectt. Its
me-synchronizzed ground truuth is obtaineed by a motionn capture system. It contaiins several chhallenging sequuences for SL LAM tim
Robotics and Autonomous Systems
11
evaluation, such as large loop sequences, kidnapped sequences and sequences with dynamic objects. To validate the performance of the proposed VO, we principally focus on the static and dynamic sequences. The dynamic sequences are categorized to several types by the movement styles of Kinect and dynamic objects: 1. Sitting: persons sitting on chairs pick and place objects, or talk to others. 2. Walking: persons move around the areas. 3. Xyz: The Kinect moves along the x-y-z axes. 4. Rpy: The Kinect rotates along the roll-pitch-yaw axes. 5. Halfsphere: The Kinect moves along halfsphere-like trajectories. 6. Static: The Kinect remains nearly static. The TUM RGB-D dataset provides various dynamic environments, which can be applied to evaluate the accuracy and stability of the robust VO. In contrast, the traditional VO, robust VO and robust SLAM are tested by the same sequences. 4.1 Comparisons with traditional VO Among traditional approaches, ORB-SLAM2 is a precise SLAM system based on indirect methods, which provides satisfactory results for static scenes of the TUM RGB-D dataset. It utilizes the Huber kernel function and the visibility of 3D feature points to reject outliers, which can handle lowly dynamic environments. The open-source implementation [27] is adopted for ORB-SLAM2. The proposed method and ORB-SLAM2 are evaluated with several static scenes, a low-dynamic scene and a highly dynamic scene, as shown in Tab. 1. For a fair comparison, the loop closure detection of ORB-SLAM2 is closed in the experiments. The influences of dynamic objects increase gradually from static environments to dynamic environments. To quantify the localization accuracy, the Absolute Trajectory Error (ATE) [22] that represents the global consistency is calculated with ground truth. Meanwhile, the standard deviation is applied for evaluating stabilities of the approaches. The quantitative results of the root mean squared error (RMSE) and the standard deviation are shown in Tab. 1. More detailed statistics and distributions of ATE obtained by the evaluation tool [28] are shown in Fig. 9. The estimated trajectories are shown in Fig. 10. The first row of Fig. 10 shows the trajectories estimated by our robust approach. Meanwhile, the trajectories estimated by ORB-SLAM2 are shown in the second row. The shorter the red lines in Fig. 10, the better the localization. The two methods obtain similar localizations in static environments that do not contain any dynamic factors. Compared with ORB-SLAM2, the increase in the RMSE does not exceed 10% in static environments. The reason for the performance degradation is that the proposed method wrongly eliminates static ORB features according to the area state detection. Fig. 11 shows that only a few areas are wrongly determined as dynamic areas in “Fr2/desk”. On average, only 6.04 ORB features are determined as dynamic features for each frame in the sequence. From another perspective, the proposed algorithm is nearly able to position the camera accurately as ORB-SLAM2 in static environments. However, with the dynamic factors, the performance of ORB-SLAM2 obviously degrades. Especially in the “Fr3/walking-static” sequence, a moving person with a checked shirt provides a large number of dynamic features. Those features confuse the reasons for the movements of all features, which seriously break the assumption of static environments. Compared with ORB-SLAM2, the proposed algorithm obtains a 98.16% improvement in the RMSE for the “Fr3/walking-static” sequence. Moreover, this robust approach is more stable than ORB-SLAM2 in dynamic environments, achieving 46.76% and 97.81% standard deviation improvements for the low-dynamic and highly dynamic environments, respectively.
Robotics annd Autonomouus Systems
12
Note that thee proposed meethod is basedd on indirect m methods, thereefore, the loopp closure of O ORB-SLAM2 can be used. This prroposed methood with the looop closure is teested on the staatic sequence ““Fr2/desk” whhich contains loops, l and shoows that the RM MSE off ATE decreasses from 0.01335 m to 0.00888 m. However,, loops have nnot been detectted in dynamicc sequences. Taable 1 AT TE [m] of ORB B-SLAM2 and thhe proposed meethod in static annd dynamic envvironments. ORB-SLAM22 Sttandard RM MSE deeviation
Proposed M Method Standard RMSE deviation
Improovement Standard RMSE Deviationn
Fr1/desk
0.00160
00.0090
0.01487
0.0086
7.06%
4.44%
Fr1/desk2
0.00237
00.0117
0.02461
0.0127
-3.84%
-8.55%
Fr1/room
0.00934
00.0196
0.0989
0.0254
-5.89%
-29.60%
Fr2/desk
0.00128
00.0060
0.0135
0.0052
-5.47%
13.3%
Fr2/xyz
0.00035
00.0017
0.0036
0.0017
-2.86%
0%
Fr3/office
0.00140
00.0064
0.0140
0.0070
0%
-9.38%
Fr3/nst
0.00192
00.0076
0.0208
0.0103
-8.33%
-35.53%
Fr33/sitting-halfsphhere
0.00264
00.0139
0.0148
0.0074
43.94%
46.76%
F Fr3/walking-stattic
0.44181
00.1828
0.0077
0.0040
98.16%
97.81%
S Seq.
S Static
L Lowdyynamic H Highly dyynamic
Fr2/deskk
tions of ATE (b) Distrib Distributions
(a) Statistics of ATE
Seeq.
Figg. 9 The statistiics and distributtions of ATE.
Fr3/sitting-halffsphere
Fr3/walking-static
Robotics annd Autonomouus Systems
Fr2/deskk
Fr3/sitting-halffsphere
Fr3/walking-static
ORB SLAM2 (b) ORB-SLAM2
O r method (a) Our
Seeq.
13
Figg. 10 Kinect trajectories estim mated by the proposed VO andd ORB-SLAM22. The proposedd method mainntains more robuust localizationn than OR RB-SLAM2.
me (aa) The 10th fram
(b)) The 176th fram me
(c) The 280thh frame
(d)) The 416th fram me (e) The 576th frame fr (f) The 1257tth frame Figg. 11 Area statee detection in static environmennts from “Fr2/ddesk” sequence. Only a few areas are wrongly regarded as dynnamic areas.
4.22 Comparisoons with robusst VO Furthermore, to compare the performannces of existinng VO methoods in dynamiic environmennts, more seqquences are teested, coonsisting of staatic sequencess, low-dynam mic and highly dynamic sequuences. DVO (Dense Visuaal Odometry) is a kind of ddirect m methods based oon RGB-D sennsors. It handlees dynamic ennvironments w with a small quaantity of moving objects by a robust weighhting VO [29] is pparticularly deesigned to haandle dynamiic environmennts with a baackground moodel. SPW-IA AICP fuunction. BaMV
Robotics and Autonomous Systems
14
(Intensity-Assisted Iterative Closest Point with Static Point Weighting) is the VO of the work [19] and provides detailed experimental results on the TUM dataset. Since most sources of robust VO approaches are closed, we adopt results of the evaluation from their papers. To quantify the odometry drift, the Relative Pose Error (RPE) [22] is adopted. The RMSE values of the translational drift and rotational drift are shown in Tab. 2. It is clear that, SPW-IAICP achieves better results than DVO and BaMVO for most sequences, and their analyses have been discussed in [19]. Significantly, due to the precise egomotion estimation by minimizing reprojection errors of static features, the proposed method outperforms the three methods mentioned above for all of the sequences in Tab. 2. Table 2 RPE of DVO, BaMVO, SPW-IAICP and the proposed method RMSE of translational drift [ m s ] Seq. DVO BaMVO SPW-IAICP Proposed [17] [26] [19] method Fr2/desk 0.0296 0.0299 0.0173 0.0072 Static Fr3/long-office 0.0231 0.0332 0.0168 0.0084 Fr2/desk-person 0.0354 0.0352 0.0173 0.0068 Fr3/sitting-static 0.0157 0.0248 0.0231 0.0089 LowFr3/sitting-xyz 0.0453 0.0482 0.0219 0.0109 dynamic Fr3/sitting-rpy 0.1735 0.1872 0.0843 0.0499 Fr3/sitting-halfsphere 0.1005 0.0589 0.0389 0.0168 Fr3/walking-static 0.3818 0.1339 0.0327 0.0101 Fr3/walking-xyz 0.4360 0.2326 0.0651 0.0292 Highly dynamic Fr3/walking-rpy 0.4038 0.3584 0.2252 0.0561 Fr3/walking-halfsphere 0.2628 0.1738 0.0527 0.0352
RMSE of rotational drift [ s ] DVO [17] 1.3920 1.5689 1.5368 0.6084 1.4980 6.0164 4.6490 6.3502 7.6669 7.0662 5.2197
BaMVO [26] 1.1167 2.1583 1.2159 0.6977 1.3885 5.9834 2.8804 2.0833 4.3911 6.3398 4.2863
SPW-IAICP [19] 0.7266 0.8012 0.8213 0.7228 0.8466 5.6258 1.8836 0.8085 1.6442 5.6902 2.4048
Proposed method 0.4506 0.4627 0.4359 0.2809 0.4717 0.7830 0.5690 0.2571 0.5847 1.021 0.7618
4.3 Comparisons with robust SLAM Finally, the robust VO in this work is compared with complete SLAM systems of [17] and [19] in dynamic environments. In DVO SLAM, a metrical nearest neighbor search is adopted for candidate keyframes. Then the average entropy and an entropy ratio test are utilized for loop detection, while SPW-IAICP SLAM uses a geometric proximity, a common visible part and a forward backward consistency check to detect loops. The two loop closures employ a pose graph for loop optimization. However, the optimization based on the pose graph does not adjust the environment map. The RMSE and standard deviation of ATE are shown in Tab. 3. It is worth noting that the proposed VO method still outperforms the two SLAM systems which are designed for dynamic environments. Compared with SPW-IAICP SLAM, the proposed VO method obtains a 68% improvement in RMSE and a 78% improvement in standard deviation on average. Some of the trajectories are shown in Fig. 12, and the positioning results can be found on GitHub [30]. The proposed VO is implemented on a laptop with Ubuntu 14.04, equipped with an Intel i7-4720HQ CPU (2.60 GHz) and 16 GB RAM. ORB features and edges are synchronously extracted in different child threads. Compared with the tracking thread of ORB-SLAM2, the increased computation costs of the proposed method lie primarily in the edge alignment, which require 6 ms per frame on average. The tracking thread can run at 20 Hz or greater. Table 3 ATE [m] of SLAM approaches and the proposed VO approaches Motion Removal+ SPW-IAICP SLAM [19] DVO SLAM [17] Seq. Standard Standard RMSE RMSE deviation deviation Fr3/sitting-xyz 0.0482 0.0282 0.0397 0.0206 Fr3/sitting-halfsphere 0.0470 0.0249 0.0432 0.0246
Our VO method RMSE 0.0087 0.0148
Standard deviation 0.0041 0.0074
Improvement compared with [19] Standard RMSE deviation 78.1% 80.10% 65.7% 69.92%
Robotics annd Autonomouus Systems Fr2/ddesk-person Fr3/w walking-static Fr3/w walking-xyz Fr3/w walking-rpy Fr3/walkking-halfspheree
Fr2/desk_withh_person
00.0239 00.0536 00.0534 00.0839 00.0903
0.04484 0.02261 0.06601 0.17791 0.04489
0.02337 0.01222 0.03330 0.11661 0.72666 Fr3/walkinng_xyz
0.0060 0.0078 0.0222 0.0388 0.0328
0.0027 0.0040 0.0122 0.0241 0.0184
87.6% 70.1% 63.1% 78.3% 32.9%
88.61% 67.21% 63.03% 79.24% 97.47%
Fr3/walking_hhalfsphere
(b)SPW-ICP SLAM [19]
(a)Our method
S Seq.
0.0596 0.0656 0.0932 0.1333 0.1252
15
Figg. 12 Kinect traajectories estimaated by the propposed VO and SPW-ICP S SLAM M.
5
Conclusion n To overcomee the localization accuracy ddegradation in dynamic enviironments, a nnovel VO methhod based on eedges and poinnts is
prroposed in thiss work. Dynam mic image areaas are detectedd by DT errorss from the edgee alignment w with a referencee frame. To haandle m motion blur in images, an impproved strateggy of referencee frame selectiion is presenteed. Static weigghts from DT errors e are addeed to mation based on reprojectiion errors. Coompared with other motionn removal appproaches, the proposed meethod eggomotion estim foocuses on sparsse edge alignm ment without dense d segmenttation of dynam mic areas, whiich results in real-time r operaation. Experim ments onn the TUM dattaset show thaat the method iin this work ouutperforms staate-of-the-art aapproaches. Esspecially for thhe highly dynamic seequence “Fr3/w walking-rpy”,, the proposedd VO methodd reduces the RMSE of AT TE from 0.1791 m [19] to 0.0388 m, which w im mproves the peerformance byy 78.3%. Comppared with tradditional VO, thhe proposed m method also acchieves compaarable perform mance in static environnments. Future work will foocus on loop closures c in dyynamic environnments to enhhance robustneess for a compplete LAM system. A And the depthh uncertainty of edges introduuced by Kinecct-like sensors will be taken iinto consideraation for the fuurther SL im mprovement. Reeferences
[1]
C. Cadena, L. Carllone, H. Carrrillo, et al., “Simultaneouus localizatioon and mappping: Present,, future, andd the robust-pperception age,” IEEE Transs. Robotics, vool.32, no. 6, ppp. 1309-1332, 2016.
[22]
D. Scaraamuzza, F. Fraundorfer, “V Visual odometrry [tutorial] Paart I: The firstt 30 years andd fundamentalls,” IEEE Robbotics Automattion Magazinee, vol. 18, no. 4, pp. 80-92, 2011. 2
[3]
G. Kleinn and D. Murrray, “Parallell tracking andd mapping forr small AR workspaces,” inn Proc. IEEE E Int. Symp. M Mixed Augmennted Reality, 20007, 225–234.
Robotics and Autonomous Systems
16
[4]
R. Mur-Artal and J. D. Tardos, “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,” IEEE Trans. Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
[5]
C. Kerl, J. Sturm, and D. Cremers, “Robust odometry estimation for RGB-D cameras,” in Proc. IEEE Int. Conf. Robot. Autom., 2013, pp. 3748–3754.
[6]
J. Engel, J. Stueckler, and D. Cremers, “Large-scale direct SLAM with stereo cameras,” in IEEE/RSJ Int. Conf. Intell. Robots Syst., 2015, pp. 1935-1942.
[7]
J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” arXiv:1607.02565 [cs.CV].
[8]
F. Endres, J. Hess, Sturm J, et al, “3-D mapping with an RGB-D camera,” IEEE Trans. Robotics, vol. 30, no. 1, pp. 177-187, 2014.
[9]
R. A. Newcombe et al., “Kinectfusion: Real-time dense surface mapping and tracking,” in Proc. IEEE Int. Symp. Mixed Augmented Reality, 2011, pp. 127–136.
[10]
I. Cvišić and I. Petrović, “Stereo odometry based on careful feature selection and tracking,” in Proc. European Conf. Mobile Robots (ECMR), 2015, pp. 1–6.
[11]
D. Nister, O. Naroditsky, J. Bergen, “Visual Odometry for Ground Vehicle Applications”, Journal of Field Robotics, vol. 23, no. 1, pp. 3–20, 2006.
[12]
F. Schenk, F. Fraundorfer, “Robust Edge-based Visual Odometry using Machine-Learned Edges,” in IEEE/RSJ Int. Conf. Intell. Robots Syst., 2017, pp. 1-8.
[13]
M. Kuse and S. Shen, “Robust camera motion estimation using direct edge alignment and sub-gradient method,” in Proc. IEEE Int. Conf. Robot. Autom., 2016, pp. 573-579.
[14]
Y. Ling, M. Kuse and S. Shen, “Edge alignment-based visual–inertial fusion for tracking of aggressive motions,” Autonomous Robots, pp. 1-16,2017.
[15]
P. F. Felzenszwalb and D. P. Huttenlocher, “Distance transforms of sampled functions,” Theory of Computing, vol. 8, no. 1, pp. 415–428, 2012.
[16]
M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381-395,1981.
[17]
D.-H. Kim, S.-B. Han, and J.-H. Kim, “Visual odometry algorithm using an RGB-D sensor and IMU in a highly dynamic environment,” in Robot Intelligence Technology and Applications 3, 2015, pp. 11–26.
[18]
W Tan, H Liu, Z Dong, et al. Robust monocular SLAM in dynamic environments in Proc. IEEE Int. Symp. Mixed Augmented Reality, 2013, pp. 209-218.
[19]
S Li and D. Lee, “RGB-D SLAM in dynamic environments using static point weighting,” IEEE Robotics and Automation Letters, vol. 2, no. 4, pp. 2263-2270, 2017.
[20]
R. K. Namdev, A. Kundu, K. M. Krishna, et al., “Motion segmentation of multiple objects from a freely moving monocular camera,” in Proc. IEEE Int. Conf. Robot. Autom., 2012, pp. 4092-4099.
[21]
Y. Sun, M. Liu, and M. Q.-H. Meng, “Improving RGB-D SLAM in dynamic environments: A motion removal approach,” Robot. Auton. Syst., vol. 89, pp. 110–122, 2017.
[22]
J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of RGB-D SLAM systems,” in IEEE/RSJ Int. Conf. Intell. Robots Syst., 2012, pp. 573–580.
[23]
E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in 2011 International conference on computer vision. IEEE, 2011, pp. 2564–2571.
Robotics and Autonomous Systems
17
[24]
P. Dollár and C. L. Zitnick, “Fast edge detection using structured forests,” IEEE Trans. Pattern Anal. Machine Intell., vol. 37, no. 8, pp. 1558-1570, 2015.
[25]
S. Xie and Z. Tu, “Holistically-nested edge detection,” Proceedings of the IEEE international conference on computer vision, 2015, pp. 1395-1403.
[26]
J. Canny, “A computational approach to edge detection,” IEEE Trans. Pattern Anal. Machine Intell., vol. 6, pp. 679-698, 1986.
[27]
ORB-SLAM2 website. [Online]. Available: https://github.com/raulmur/ORB_SLAM2.
[28]
EVO website. [Online]. Available: https://github.com/konanrobot/evo.
[29] [30]
D.-H. Kim and J.-H. Kim, “Effective background model-based RGB-D dense visual odometry in a dynamic environment,” IEEE Trans. Robot., vol. 32, no. 6, pp. 1565–1573, 2016. Robust VO website. [Online]. Available: https://github.com/familyyao/ Results_robust_VO.
Corresponding author: Erliang Yao E-mail:
[email protected]
Robotics annd Autonomouus Systems
18
Erliaang Yao received his B B.S. degreee in 2008 aand his M.S Sc. degree iin 2012 from m High-Tech Instiitute of Xi''an, Shaanxxi, China. N Now He iss a Ph.D. candidate oof High-Tech Instittute of Xi'ann. His latestt research innterests incluude roboticss vision, patth plannning and appplications too mobile robbotics. Hexin n Zhang reeceived his M.Sc. deggree in 19944 from Norrthwestern P Polytechnicaal Univeersity, and received r hiss Ph.D. degrree in 2001 from Highh-Tech Instittute of Xi'ann, Shaannxi, China. N Now he is a doctoral suupervisor annd a professoor of High-T Tech Institutte of Xii’an. His reesearch interrests includde nonlinearr control, faault tolerantt control annd compputer vision.. Hui X Xu receivedd his B.Sc. degree d in 20016 from High-Tech Insstitute of Xii'an, Shaanxxi, Chinaa. Now he is a postgraaduate student in Highh-Tech Instiitute of Xi'aan. His maiin research interest is robot vission navigation.
Haitaao Song received his B.S. degreee in 2005 aand his M.S Sc. degree inn 2008 from m High--Tech Instittute of Xi'ann, Shaanxi, China. He received hiis Ph.D. degree in 20115 from Tsinghua U University, B Beijing, Chiina. Now hee is a lectureer at High-T Tech Institutte of Xi'an. His currrent researchh interests innclude systeem modellinng, nonlineaar control annd intelligent decisiion. Guoliang Zhan ng received hhis Ph.D. deegree in 20003 from Highh-Tech Instiitute of Xi'aan, Shaaanxi, China. He was a dooctoral supeervisor and a professor of High-Tecch Institute of Xi'ann in 2009. He has pubblished threee books, aand about 550 articles. Now he is a profeessor of Chhengdu Uniiversity of IInformationn Technologgy, Sichuann, China. H His latestt research innterests incllude advanced control, eenvironmennt modeling,, multi-senssor fusioon and autonnomous navvigation.
Robotics and Autonomous Systems
19
Research highlights
A novel scheme of the reference frame selection is proposed. This scheme can handle the motion blue in images. As sparse motion segmentation, a novel area state detection scheme is presented based on DT errors and priors, and the states are divided further based on reprojection errors. Experiments on the TUM dataset show that the proposed method obtains more precise localization than other robust approaches.