Delayed Inverse Depth Monocular SLAM

Delayed Inverse Depth Monocular SLAM

Proceedings of the 17th World Congress The International Federation of Automatic Control Seoul, Korea, July 6-11, 2008 Delayed Inverse Depth Monocula...

2MB Sizes 0 Downloads 135 Views

Proceedings of the 17th World Congress The International Federation of Automatic Control Seoul, Korea, July 6-11, 2008

Delayed Inverse Depth Monocular SLAM Rodrigo Munguia, Antoni Grau Department of Automatic Control, Technical University of Catalonia, UPC , c/ Pau Gargallo, 5 E-08028 Barcelona, Spain, {rodrigo.munguia;antoni.grau}@upc.edu.

Abstract: The 6-DOF monocular camera case possibly represents the harder variant in the context of simultaneous localization and mapping problem. In the last years, several advances have been appeared in this area; however the application of these techniques to real world applications it’s difficult so far. Recently, the unified inverse depth parametrization has shown to be a good option this challenging problem, in a scheme of EKF for the estimation of the stochastic map and camera pose. In this paper a new delayed initialization scheme is proposed for adding new features to the stochastic map. The results show that delayed initialization can improve some aspects without losing the performance and unified aspect of the original method, when initial reference points are used in order to fix a metric scale in the map. 1. INTRODUCTION In recent works, (Montiel, 2006a,b) and (Eade, 2006) have shown that the use of an inverse depth parametrization for monocular SLAM can improve the linearity of the measurement equation even for small changes in the camera position yielding small changes in the parallax angle, this fact allows a Gaussian distribution to cover uncertainty in depth which spans a depth range from nearby to infinity. In the unified inverse depth method presented by (Montiel, 2006a), transition from partially to fully initialized features need not to be explicitly tackled, making it suitable for direct use in EKF framework for sparse mapping. In this approach the features are initialized in the first frame observed (undelayed initialization) with an initial fixed depth and uncertainty, determined heuristically to cover ranges from nearby to infinity, so distant points can be coded. Due to the clarity and scalability of this method, this approach is a good option for monocular-SLAM implementation. Particularly, this work is motivated by the problems of vision-based robot map building and localization, therefore, if monocular SLAM wants to be applied in this context, retrieving the metric scale of the world is very important. The experiments with the unified inverse depth method show that, when initial reference points are used for establishing a metric scale in the map, the initial features depths have to be tuned, otherwise, is likely that new features added to the map never converges respect to the metric reference. On the other hand, initializing features distant to the optical camera center can increase the possibility that features depth become negative after a Kalman update step. Initializing features in the first observed frame (undelayed initialization) avoids the use of pre-initialized features in the state and allows the use of all the information available in the 978-3-902661-00-5/08/$20.00 © 2008 IFAC

feature since it is detected, on the other hand, when features are detected in the image with a saliency operator in order to be automatically added to the map, usually the weak longterm image features are added to the map. Therefore it is difficult to match them in subsequent frames. When a minimum number of active image features want to be maintained, it could happen that unnecessary initializations are realized. Every new feature initialization introduces biases to the system (Davison, 2001). The aforementioned issues suggested for new features, initial inverse depth and their associated initial uncertainty, could be treated before being added to the system state instead of using a fixed initial depth and uncertainty. At the same time features can be tested prior to be added to map in order to prune weak long-term features. 2. RELATED WORK In (Davison, 2003) a multi-hypothesis method based on a particle filter to represent the initial depth of a feature is proposed. This work gives good results. However its application in large environments is not straightforward, as it would require a huge number of particles. In (Lemaire, 2005) is proposed a delayed multi-hypothesis method based in a sum of Gaussian mixture for depth estimation, but it uses odometry as an additional sensor. The work in (Eade 2006) is based in the FastSLAM algorithm, where the pose of the robot is represented by particles and a set of Kalman filters refine the estimation of the features, this approach is unable to code distant points. In the work presented in this paper, a delayed feature initialization is proposed for adding new features to the stochastic map in a context for monocular SLAM using inverse depth parametrization. The experimental results show that delayed initialization can improve some aspects without losing the performance and

2365

10.3182/20080706-5-KR-1001.2321

17th IFAC World Congress (IFAC'08) Seoul, Korea, July 6-11, 2008

unified aspect of the original (undelayed) method presented by (Montiel, 2006a), where initial reference points are used in rder to fix a metric scale in the map. 3. INVERSE DEPTH MONOCULAR SLAM

The different locations of the camera, along with the location of the already mapped features, are used to predict the feature position hi. The observation of a point yi from a camera location defines a ray expressed in the camera frame as hC=[hx,hy,hz]: ⎛ ⎡ xi ⎤ ⎞ 1 ⎜ ⎟ hC = R CW ⎜ ⎢⎢ yi ⎥⎥ + m (θi , φi ) − r WC ⎟ ρ i ⎜⎢z ⎥ ⎟ ⎝⎣ i ⎦ ⎠

3.1 Camera motion model A free camera moving in any direction in ℜ 3 × SO (3) is considered. The camera state xv is defined by:

[

xv = r WC q WC v W ω W

]

T

(1)

Where rWC= [x,y,z] represents the camera optical center position, qWC=[q0,q1,q2,q3] represents the camera orientation by a quaternion, vW=[vx,vy,vz] and ωW=[ωx,ωy,ωz] denote linear and angular velocities respectively. At every step it is assumed an unknown linear and angular acceleration with zero mean and known covariance Gaussian processes, aW and αW, producing an impulse of linear and angular velocity such as: ⎛ V W ⎞ ⎛ aW ∆t ⎞ n = ⎜⎜ W ⎟⎟ = ⎜⎜ W ⎟⎟ ⎝ Ω ⎠ ⎝α ∆t ⎠

hC is observed by the camera through its projection in the image. The projection is modeled using a full perspective wide angle camera. First the projection is modeled in the normalized retina: ⎛ hx ⎞

⎜ ⎟ ⎛υ ⎞ ⎜ hz ⎟ = ⎜ ⎟ ⎜h ⎟ y ⎝ν ⎠

The camera calibration model is applied to produce the pixel coordinates for the predicted point: f ⎞ ⎛ ⎜ u0 − d υ ⎟ ⎛u ⎞ x ⎟ h=⎜ ⎟=⎜ ⎜ f ⎝v⎠ v − ν ⎟ ⎜ 0 d ⎟ y ⎝ ⎠

The camera motion prediction model is:

(

)

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦

(3)

3.2 Features definition and measurement The complete state that includes the features y is made of: x = ⎡⎣ xvΤ , y1Τ , y2Τ ,... ynΤ ⎤⎦

Τ

(4)

where a feature y represents a scene 3D point i defined by the 6-dimension state vector: yi = [ xi , yi , zi , θ i , φi , ρ i ]

Τ

(5)

⎛ u − u0 ⎞ + u0 ⎟ ⎜ 2 ⎛ u ⎞ ⎜ 1 + 2 K1r ⎟ hi = ⎜ d ⎟ = ⎜ ⎟ ⎝ vd ⎠ ⎜ v − v0 + v ⎟ 0 ⎜ 1 + 2K r 2 ⎟ 1 ⎝ ⎠

where r = coefficient.

(10)

2 2 ( u − u0 ) + ( v − v0 ) , and K1 is the distortion

Features search is constrained to elliptical regions around the predicted hi. The elliptical regions are defined by the Τ innovation covariance matrix Si = H i Pk +1 H i + R where Hi is the Jacobian of the sensor model with respect to the state, Pk+1 is the prior state covariance, and measurements z assumed corrupted by zero mean Gaussian noise with covariance R. 4. DELAYED FEATURE INITIALIZATION

which models the 3D point located at: ⎡ xi ⎤ ⎢ y ⎥ + 1 m θ ,φ ⎢ i⎥ ρ ( i i) i ⎣⎢ zi ⎦⎥

(9)

where u0,v0 is the camera center in pixels, f is the focal length and dx, and dy the pixel size. Finally, a radial distortion model is applied (Davison, 2004).

Being q ( (ωkW + ΩW ) ∆t ) the quaternion defined by the rotation vector (ω kW + ΩW ) ∆t . An Extended Kalman Filter propagates the camera pose and velocity estimates, as well as feature estimates.

(8)

⎜ ⎟ ⎝ hz ⎠

(2)

WC W W ⎡ rkWC ⎤ ⎡ rk + ( vk + Vk ) ∆t +1 ⎢ ⎢ WC ⎥ ⎢ qWC × q (ωkW + ΩW ) ∆t q f v = ⎢ Wk +1 ⎥ = ⎢ k ⎢ vk +1 ⎥ vWk + V W ⎢ W ⎥ ⎢ ⎢⎣ωk +1 ⎥⎦ ⎢ ωkW + ΩW ⎣

(7)

(6)

where xi,yi,zi are the camera optical center coordinates when the feature was first observed; and θi , φi represent azimuth and elevation (respect to the world reference W) for the directional vector m (θ i , φi ) . The point depth di along the ray is coded by its inverse ρi=1/di.

4.1 Candidate points In our work we consider a minimum number of features yi to be predicted appearing in the image, otherwise new features have to be added to the map. In this latter case, new points are detected in the image with a saliency operator. Specifically, we use Harris corner detector, although more robust detectors can be used. If the data association problem want to be addressed in a more robust way, features

2366

17th IFAC World Congress (IFAC'08) Seoul, Korea, July 6-11, 2008

descriptors could be used, in previous work (Munguía, 2006 a,b) we treat this problem. Only areas in the image free of previously detected points or features already mapped are consider for detecting new points, we call these points in the image that do not have to be added to the map as candidate points, λ. When a point is first detected by the saliency operator in a frame k, the candidate point is conformed by: 0

1

2

3

λi = (x1, y1, z1,σ1x,σ1y ,σ1z , q10, q11, q12, q13,σ1q ,σ1q ,σ1q ,σ1q ,u1,v1)

(11)

The values x1, y1, z1 represent the camera optical center position, σ 1x , σ 1y , σ 1z their associated variances taken from the 0 1 2 3 state covariance matrix Pk. q10 , q11 , q12 , q13 , σ1q , σ1q , σ1q , σ1q is the quaternion representing the current camera orientation and its associated variances taken from the state covariance matrix Pk, and u1, v1 is the current pixel coordinates for the point λi. In subsequent frames λi is tracked, but practically some λi points can not be tracked. This process is used for pruning weakest image features. For tracking purposes any method can be used. The tracking for every candidate point λi is realized until is pruned or initialized in the system. In practice for every frame, some new candidate points λi could be detected, others points could be pruned and others could be considered to be added to the map. In our experiments an average of 5 to 15 points λi are maintained at every step.

the map. For near features, a small translation is enough to reproduce some parallax. We use a minimum parallax threshold αmin for considering a candidate point λi to be added to the map as a feature yi. On the other hand distant features will not produce parallax but are useful to estimate the camera orientation, and therefore it is advantageous to include some distant features in the map with big depth uncertainty. Then, a minimum base-line camera translation |b|min is also considered for adding a candidate point yi to the map. Fig. 2 shows a simulation for decrementing uncertainty in feature depth estimation respect with the increase of parallax angle. It can be observed that a few parallax degrees are enough for reducing significantly the depth uncertainty. In the experiments αmin =3 is used. The minimum base-line bmin was heuristically established to be the base-line necessary to produce a parallax α ≈ 6º in the initial reference points. For example if the camera initial position is in average one meter away from the initial reference points then bmin = 8cm.

4.2 Adding features to the state As the camera freely moves through its environment, the translation produces parallax in features. Parallax is really the key that allows to estimating features depth. In the case of indoor sequences, centimeters are enough to produce parallax, on the other hand, the more distant the features, the more the camera have to be translated to produce parallax.

Fig 2. Estimate simulation of uncertainty feature depth σp for parallax angle α from 0.1º to 10º. An increment in the uncertainty σθ of the measurement angle θ is considered as the parallax grows. Note that a few degrees parallax is enough to reduce the uncertainty in the estimation. So far, the uncertainty of the measurements is not considered, and the parallax α is estimated using i) the base-line b, ii) λi using its associated data ( x1 , y1 , z1 , q10 , q11 , q12 , q13 , u1 , v1 ) , and iii) the current state ( xk , yk , zk , qk0 , q1k , qk2 , qk3 , uk , vk ) . The parallax angle for a λi can be estimated (Fig. 1): α = π − (β + γ )

(12)

The angle β is determined by the directional projection ray vector h1 and the vector b1 defining the base-line b in the direction of the camera trajectory by: ⎛ h1 • b1 ⎝ h1 b1

β = cos −1 ⎜⎜

Fig. 1. Feature parametrization and initialization. In our approach we want dynamically to estimate an initial depth and its associated uncertainty for the features added to

⎞ ⎟⎟ ⎠

(13)

where the directional projection ray vector h1 expressed in the absolute frame, is computed from the camera position and the coordinates of the observed point when it was first observed, using the data stored in λi

2367

17th IFAC World Congress (IFAC'08) Seoul, Korea, July 6-11, 2008

⎛u ⎞ h1 = RWC (q1WC )h1C ⎜ 1 ⎟ ⎝ v1 ⎠

b1 = [( xk − x1 ), ( yk − y1 ), ( z z − z1 )]

⎞ ⎟⎟ ⎠

(16)

The directional projection ray vector h2 expressed in the absolute frame, is computed in a similar way as (14) but using current camera position xv and points coordinates u,v. b2 is equal to b1 but pointing to the opposite direction: ⎛u ⎞ h2 = RWC (q )h ⎜ ⎟ ⎝v⎠ b2 = [( x1 − xk ), ( y1 − yk ),( z1 − zk )] WC k

C k

(17) (18)

The base-line b is the module of b2 or b1 : (19)

b = b 2 = b1

If α > αmin or b>bmin then λi is initialized as a new feature map: l l⎤ yi = ⎡ xli , l yi , zli ,θli , φli , ρ i⎦ ⎣

Τ

(20)

where the three first elements are obtained directly from the current camera optical center position: ⎛ xli ⎞ ⎜ ⎟ ⎛ xk ⎞ ⎜ ⎟ ⎜l yi ⎟ = ⎜ yk ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ zli ⎟ ⎝ zk ⎠ ⎝ ⎠

(21)

The angles can be derived as:

(

⎛ θli ⎞ ⎛ arctan − h2y , h2x 2 + h2z 2 ⎜ ⎟=⎜ ⎜ φl ⎟ ⎜⎜ arctan ( h2x , h2z ) ⎝ i⎠ ⎝

) ⎞⎟ ⎟⎟ ⎠

(22)

where h2 = [h2x , h2y , h2z ] is obtained from equation 17. Finally the inverse depth ρi is derived from the sine law

ρli =

sin α b ∗ sin β

(23)

4.3 Updating the covariance matrix l is derived from the The covariance for xli , l yi , zli ,θli , φli and ρ i error diagonal covariance matrix R j measurement and the state covariance matrix estimate Pk.

y 1

z 1

q1 1

q2 1

(24)

q3 1

Rj is conformed with the image measurement error variance σ u22 , σ v22 , σ u21 , σ v21 and the variances stored in λi 0 1 2 3 σ 1x , σ 1y , σ 1z , σ 1q , σ 1q , σ 1q , σ 1q . The state covariance matrix after initialization is: ⎛ Pk 0 ⎞ Τ Pknew = J ⎜ ⎟J ⎝ 0 Rj ⎠ I 0 ⎞ ⎛ ⎜ ⎟ ∂y ⎟ J = ⎜ ∂y , 0,..., 0, ⎜ ∂xv ∂hR j ⎟⎠ ⎝

(15)

The angle γ is determined in a similar way as β but using the directional projection ray vector h2 and the vector b2 defining the base-line in the opposite direction of the camera trajectory by: ⎛ h2 • b2 ⎝ h2 b2

q0 1

σ ,σ ,σ ,σ ,σ ,σ ,σ ) x 1

with RWC ( q1WC ) being the rotation matrix depending on the stored camera orientation quaternion q1WC = (q10 , q11 , q12 , q13 ) and h1C (u1 , v1 ) is the directional vector in the camera frame using equation 7. b1 is the vector representing the camera base-line b between the camera optical center position x1, y1, z1 where the point was first observed and the current optical center xk, yk, zk.

γ = cos −1 ⎜⎜

R j = diag (σ u22 , σ v22 , σ u21 , σ v21 ,...

(14)

(25)

(26)

where I is the identity matrix with the same dimension of Pk. ∂y / ∂xv are the derivatives of yi with respect to the state xv and ∂y / ∂h the derivatives of yi with respect to measurement equations depending on Rj. The Jacobian calculation is complicated but a tractable matter of differentiation; we do not present the results here. 5. EXPERIMENTAL RESULTS Real image sequences of 320 × 240 pixels acquired with a monochrome IEEE1394 web-cam camera at 30 fps was used for test the performance of the method. The experiments were developed in MatLab. The part of code related with section 2 was based in the code provided by the author of (Montiel 2006a). The initial reference consists in three spatial points forming a triangle of known dimensions, (see Fig. 3 and 4). Prior to start the first Kalman step, these three points are selected on the image, then their 3D position respect to the camera are calculated using an optimization technique, and finally included in the system state with zero uncertainty. Several image sequences moving the camera through different trajectories were recorded following a predefined path. The undelayed and delayed initialization has been compared. The trajectories were designed in order that if a feature is left behind by the movement of the camera, this feature will not appear in image again in subsequent frames. The original method have a drawback when a initial metric reference is used; if the features are initialized with an initial distance close to the optical center with respect to the distance to the reference points, the features never converge respect to the reference, and even the Kalman Filter never converges to an unscaled trajectory. Figure 3 illustrates the initialization of the first features after the three reference points are introduced in the system for the undelayed and delayed method. The graphics in the center show the undelayed method for an initial feature depth of 50cm, in frame 2 (central upper), it is possible to observe that reference points are located approximate 80 cm from the initial camera position and the first observed points are immediately initialized. However at frame 320 (central lower) the mapped features never converge respect to the metric reference.

2368

17th IFAC World Congress (IFAC'08) Seoul, Korea, July 6-11, 2008

Sequence 1 2 3

Method Undelayed Delayed Undelayed Delayed Undelayed Delayed

σ x,y,z 4 4.1 2.1 2.4 1.4 2.5

Nf 47 35 46 28 34 27

%c 42 27 76 82 44 55

Nfc 114 110 36 45 45 58

E 5.7 1.44 11.2 9.12 17 19

Nf<0 0 0 0 0 2 1

Table 1. The results at the end of the three sequences: (σ x,y,z): Summed standard deviation for the x,y,z, position of the camera. (Nf):Total number of features added to the system. (%c): Percentage of features that present convergence. (Nfc): The average number of frames needed for the convergence of the features. (E): The metric error distance in cm from the real to final estimated trajectory. (Nf<0) Number of negative inverse depth estimated at the final of the trajectory.

Camera trajectory either converge, note the 4 points corresponding to the printer located besides the initial three point reference. On the other hand when we use an initial depth equal to 60 cm, (right upper and lower graphics) the map and camera trajectory converge reasonably. In delayed approach (left graphics) the first feature is added to the map until frame 125, in this case with a huge initial uncertainty (upper left graphic). However at frame 320 (lower left graphic) the map and trajectory converges. Note that the first added feature was initialized very near to its final position, and its uncertainty was minimized. The condition for detecting new points with the Harris corner detector for both methods is applied if the number of actives features in image goes below 30, in this case the detector is applied over the free features image regions. Figure 4 shows the results for three different sequences. Real final camera position and trajectory was manually added to the graphics (in black) to make easier the comparison, the initial and final frames are illustrated in the center for each sequence. Table 1 shows the results for each sequence for the next aspects. In our experiments we consider that a feature converges when its depth uncertainty σ represents less than 5% of its depth, in this way we consider a convergence measurement proportional to the distance. The depth of a near feature should be estimated in a more accurate manner than a distant feature. 6. CONCLUSIONS In this work a method for delayed features initialization for inverse depth parametrization in monocular SLAM is presented. The experimental results show that this method can be a good choice when using monocular SLAM. The method seems to be more robust respect to the undelayed method, when initial metric reference points are used for scaling the map. In our experiments the resulting camera trajectory estimate using the delayed method was similar to the estimate by the undelayed method. In aspects relating with features depth convergence the results were similar for both methods. Since the delayed method is more restrictive for adding new

features, a reduced percentaje of new features are added to the map (20-40%) respect to the undelayed method, without losing the quality of the map. This aspect is desirable, because bigger environments can be mapped with the same number of features. On the other hand is clear that an additional computational cost is added in the delayed method, since the candidate points have to be tested in order to be added to the map. The Jacobian to estimate the new covariance matrix is more complex respect to one used in the undelayed method. On the other hand is known that Kalman filter computation cost scales poorly with the size of the state, and the saving computational cost using 20-40% of the total amount of features can be higher than the computational cost added in the delayed method. REFERENCES Davison, A. J. Kita N. (2001) Sequential localization and map-building for real-time computer vision and robotics. Robotics and Autonomous systems Davison, A. J. (2003) Real-time simultaneous localization and mapping with a single camera . In Proc. International Conference on Computer Vision. Davison, A. J. Gonzales, Y. Kita N. (2004) Real-Time 3D SLAM with Wide-Angle Vision . In Proc. of Symposium on Intelligent Autonomous Vehicles. Eade, E. Drummond, T. (2006) Scalable monocular SLAM. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition. Hartley, R. I. Zisserman, A. (2004) Multiple View Geometry in Computer Vision. Cambridge University Press. Lemaire, T. Lacroix, S. Sola J. (2005) A practical 3D Bearing-Only SLAM algorithm. In Proc. International Conference on Intelligent Robots and Systems. Montiel, J.M.M. Civera, J. Davison, A.J. (2006) Unified Inverse Depth Parametrization for Monocular SLAM. Robotics: Science and Systems Conference. Montiel, J.M.M. Davison, A.J. (2006) A visual compass based on SLAM. In Proc. Intl. Conf. on Robotics and Automation Munguia, R. Grau, A. Sanfeliu A. (2006) Matching Images Features in a Wide Base Line with ICA Descriptors. In Proc. of the IEEE International Congress in Pattern Recognition, ICPR Munguia, R. Grau, A. (2006) Learning Variability of Image Feature Appearance Using Statistical Methods. Lecture Notes in Computer Science, 4225

2369

17th IFAC World Congress (IFAC'08) Seoul, Korea, July 6-11, 2008

Figure 3. Delayed and undelayed methods, using three point reference to establish metric scale. The features positions are represented by green solid circles and their uncertainty by red ellipses. The camera position is represented by a blue solid circle and its orientation by a blue line emerging from the camera position. The camera trajectory is indicated with the blue path from the initial (x=0 z=0) to the final camera position. For simplicity all the maps are viewed in x-z axes.

Figure 4. Camera trajectory and map for three sequences. Undelayed method (upper graphics) and delayed method (lower graphics). The first sequence corresponds to 760 frames of a house livingroom and it is the same sequence used in the previous experiment. The second sequence corresponds to 480 frames taken in a laboratory. Note that a PC monitor was used as initial metric reference. The third 360-frame sequence was taken following a simple linear path, but in a more occluded terrace building environment, with very near and very distant features.

2370