Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Incremental image set querying based localization$ Lei Deng a,b, Zhixiang Chen a,b, Baohua Chen a,b, Yueqi Duan a,b, Jie Zhou a,b,c,n a b c
Department of Automation, Tsinghua University, Beijing, China Tsinghua National Laboratory for Information Science and Technology, Beijing, China State Key Laboratory of Intelligent Technology and Systems, Beijing, China
art ic l e i nf o
a b s t r a c t
Article history: Received 29 September 2015 Received in revised form 13 November 2015 Accepted 14 November 2015
Image based localization has been developed for many applications such as mobile localization, autonavigation, augmented reality and photo tourism. When the querying image is matched against a prebuilt 3D feature point cloud, its pose can be estimated for future use. However, when the querying image is distant from the pre-built 3D point cloud, conventional single image-based localization method will fail. To address this problem, we present an incremental image set querying based localization framework. When single image localization fails, the system will incrementally ask the user to input more auxiliary images until the localization is successful and stable. The main idea is that image set, instead of single image, is matched against the pre-built 3D point cloud to meet the challenge. Next the image set is incrementally enlarged and aggregated to form a local 3D model. Compared with single image querying based localization method, the querying 3D model contains more information and geometry constraints which are essential for localization. Experiments have demonstrated the effectiveness and feasibility of the proposed framework. & 2016 Elsevier B.V. All rights reserved.
Keywords: Incremental image set localization Structure-from-motion Camera set pose estimation
1. Introduction Image based localization has been widely used in many applications such as mobile localization [1], auto-navigation [2], augmented reality [3], and photo tourism [4]. The aim of image-based localization is to retrieve the camera's pose (orientation and position) in an interested area from a single querying image. Based on the pose of the image, high level analysis could be applied [5–16]. Generally, there are three key steps in a single image-based localization system [17,18]: (1) 2D feature extraction (e.g., SIFT [19]) from the querying image, (2) matching between these 2D features and the pre-built 3D feature point cloud, and (3) camera pose estimation by solving a perspectiven-point (PNP) problem [20–22]. The 3D feature point cloud is usually reconstructed from many captured images at offline stage by using a conventional 3D reconstruction algorithm [4,23]. It is challenging to directly match a querying image to the pre-built 3D point cloud of the scene, especially when there exist large variations between them. The reason is that the environment for 3D point cloud reconstruction is usually fixed and different from that of the querying image. The querying image may be captured at a distant location or under different illumination conditions. For instance, a prebuilt 3D point cloud is reconstructed using some high quality street ☆
The earlier version of this paper has been accepted by VCIP2015. Corresponding author at: Department of Automation, Tsinghua University, Beijing, China. n
views, and the surveillance cameras to be localized are under highly different illumination condition and distant from the street view, as shown in Fig. 1. Under such scenarios, conventional single image localization methods [17,18] will fail as they heavily rely on enough 2D–3D feature matches between the querying image and the pre-built 3D point cloud which are hardly to be available. To address this, in this paper, we present an incremental image-set querying based localization framework, as shown in Fig. 2. When the pose estimation by the conventional single image localization method is unsuccessful, user is asked to capture more auxiliary bridging images to assist the localization task. These bridging images together with the querying image are matched with each other. Then they are aggregated to form a local 3D model. A 3D-to-3D feature matching scheme is taken to obtain reliable matches between the querying 3D model and the pre-built 3D feature point cloud. Next, the pose of the local 3D model is estimated by solving a nonlinear optimization problem. Besides using the reconstructed camera poses in the local camera set coordinate system, local 3D point information is explored in the nonlinear optimization stage for a further reprojection error minimization. Since the image set not only contains more information for localization, but also has stronger inherent geometry constraints, better localization performance can be obtained. Our main contributions are as follows: (1) a new incremental image set querying based localization framework is proposed, (2) a new camera set pose estimation solver is presented, (3) the quality criteria of camera set pose
http://dx.doi.org/10.1016/j.neucom.2015.11.117 0925-2312/& 2016 Elsevier B.V. All rights reserved.
Please cite this article as: L. Deng, et al., Incremental image set querying based localization, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2015.11.117i
L. Deng et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
2
West Building
West Hallway
East Hallway
Street view panorama
Fig. 1. Illustration of the 2D-3D feature matching challenge. The pre-built 3D feature point cloud (yellow dots on the main building) is reconstructed by street view panoramas (bottom). Surveillance cameras from West Hallway (red), West Building (blue) and East Hallway (green) are required to localize. There are only a few feature matches between the West Hallway image and the pre-built point cloud. No feature matches could be found on the West Building and the East Hallway image due to they are distant from the scene. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)
Image from target camera
Target camera pose
Additional image sequence
timeline
Single image localization
Local 3D points and camera set
Pose estimation qualitytest Success? No
Yes
Target camera extraction
Adding new image
Matches between the local points and the scene points
Localincremental 3D reconstruction
Prebuilt scene 3D point cloud and panorama cameras
3D-to-3D feature matching Camera set pose estimation Fig. 2. The framework of proposed incremental image set querying based localization.
estimation are defined. Our previous work has been accepted by VCIP2015 [24] (see the Appendix). This paper is organized as follows. Section 2 presents an incremental 3D reconstruction algorithm used in both the offline
scene 3D feature point cloud reconstruction stage and the online image set incremental local 3D reconstruction stage. Section 3 details the proposed incremental image set querying based localization procedure. Section 4 introduces a new camera set pose
Please cite this article as: L. Deng, et al., Incremental image set querying based localization, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2015.11.117i
L. Deng et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
3
Fig. 3. Illustration of the mapping among panoramic, fisheye and rectilinear images.The principle point is shown in red dot which lies on the z-axis in camera coordinates. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)
solver. Section 5 defines the quality criteria of the camera set pose estimation. Section 6 presents the experimental results, and Section 7 concludes this paper.
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u21 þ v21 ; r θ ¼ 2arctan 1 ; 2 κ ðu; ðf ; uc ÞÞ ¼ ð cos ðϕÞ sin ðθÞ; cos ðθÞ; sin ðϕÞ sin ðθÞÞ; r1 ¼
2. Incremental pinhole model based 3D reconstruction p ¼ arctan In this section, an incremental pinhole camera model based 3D reconstruction algorithm is presented for building the pre-built 3D feature point cloud. The algorithm is also used in the local incremental 3D reconstruction stage at the online camera set localization procedure. Compared with the ground area, surrounding buildings usually have rich information for localization. For example, if a person does not known where he is, he will look around to get his position. To capture the surrounding information of the scene for better localization, 360° panorama or fisheye is a good choice due to its large field of view. Additionally, with the help of existing street view panoramas with high quality at regular distributed locations, high quality scene 3D point cloud is guaranteed to be reconstructed and human efforts can be greatly alleviated. Conventional structure-from-motion methods [4,23] used for 3D reconstruction are under the rectilinear camera model assumption and try to minimize the pixel re-projection errors. To get a unify representation of the panorama, fisheye and rectilinear cameras, we use the pinhole camera model instead. The pinhole camera model considers each 2D pixel as a ray passing through a single projection center (optical center) which can be represented as a 3D coordinate xðu; v; wÞ lies on the unit spherical surface in the camera coordinate system. Calibration function x ¼ κ ðu; KÞ; u ¼ κ 1 ðx; KÞ defines the mapping between a ray xðu; v; wÞ and its corresponding pixel uðu; vÞ. Calibration functions Eqs. (1), (2) and (3) are the calibration functions for the panoramic, fisheye and rectilinear cameras respectively, as shown in Fig. 3. u uc v vc ; t¼ ; uc ¼ ðuc ; vc Þ; f f κ ðu; ðf ; uc ÞÞ ¼ ð cos ðtÞ sin ðpÞ; sin ðtÞ; cos ðtÞ cos ðpÞÞ;
p¼
u uc v vc ; v1 ¼ ; f f ϕ ¼ arctan2ðv1 ; u1 Þ;
u1 ¼
uc ¼ ðuc ; vc Þ;
ð1Þ
u uc ; f
ð2Þ
v vc t ¼ arctan ; f
κ ðu; ðf ; uc ÞÞ ¼ ð cos ðtÞ sin ðpÞ; sin ðtÞ; cos ðtÞ cos ðpÞÞ:
ð3Þ
In the equations, uc is the principle point's pixel coordinate (for panorama and fisheye, any point can be principle point theoretically), f is the focal length (for 360° panorama, f ¼ image2πwidth), p is the panning angle around y-axis, and t is the tilting angle around x-axis. The stereographic projection is applied for the fisheye camera. Then the geometry re-projection angle error becomes J κ ðuij ; K i Þ
P i Xej J P i Xej J
J;
ð4Þ
where Xej ¼ ðX; 1Þ is the homogeneous coordinate of jth 3D point, P i ¼ ½R; t is the ith camera projection matrix, elements in Ki represent the intrinsic parameters of the ith camera. The reconstruction pipeline is shown as follows: (1) feature extraction for every image which is involved in the 3D reconstruction, (2) feature matching for each pair image, (3) using the RANSAC robust framework to estimate essential matrix for each pair with enough feature matches and denoising the outlier feature matches in each pair, (4) decomposing these essential matrices into relative poses, and (5) rearranging these feature matches and the matched image pairs to form a reconstruction graph which is prepared for the final 3D reconstruction algorithm. A undirected reconstruction graph is defined as G ¼ fNP; NX; EP; EXg shown in Fig. 4. NP is the camera node. NX is the 3D point node. EP is the camera–camera edge with attributes of feature matches (EP matches ði; kÞ ¼ fðxij0 ; xkj0 Þ; …; ðxijn ; xkjn Þg; n 4 thmatch ) and estimated relative pose (EP relpose ði; kÞ ¼ ðRik ; C ik Þ). EX is the camera–point edge with attribute of observed ray EX ox ði; jÞ ¼ xij . From the reconstruction graph we can define visible functions visX ðX j ; PsÞ ¼ fi : ði; jÞ A EX; i A Ps g; visP ðP i ; XsÞ ¼ f j : ði; jÞ A EX; jA Xsg which represent visible P set for Xj and visible X set for Pj, respectively. This reconstruction graph contains all information needed for the next incremental 3D reconstruction algorithm.
Please cite this article as: L. Deng, et al., Incremental image set querying based localization, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2015.11.117i
4
L. Deng et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
3. Incremental camera set localization
Fig. 4. A reconstruction graph G ¼ fNP; NX; EP; EXg contains camera node NP (blue), 3D point node NX (red), camera–camera edge EP which represents the relative rotation Rik and relative translation Cik of the overlapped camera pair Pi, Pk, camera– point edge EX which represents an observation xij of a 3D point Xj from one its visible cameras Pi. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)
According to the geometry properties of the pinhole camera model, conventional rectilinear camera model based building blocks for 3D reconstruction such as two view geometry calculation, triangulation, camera pose solver [22,21,20] and bundle adjustment [25], should be adjusted. The algorithm of incremental pinhole camera model based 3D reconstruction is shown in Algorithm 1. Algorithm 1. Incremental pinhole model based 3D reconstruction algorithm. Input: Reconstruction graph (G ¼ fNP; NX; EP; EXg) Output: 3D points Xs ¼ fX 1 ; …; X n g, camera poses Ps ¼ fP 1 ; …; P m g 1: Initializing camera set Ps by choosing a good pair fP i ; P k g, Xs ¼ fg, nX ¼0, nP ¼ j Psj 2: loop 3: Triangulating more 3D point candidates Xsnew from existing camera set Ps based on pinhole camera model. = Xs o g, nX ’j Xsnew j Xsnew ¼ fX j : j visX ðX j ; PsÞj 4 thtri_nr ; j2 4: if nX 4 0 then 5: Nonlinear optimizing the camera pose Ps and the 3D point cloud Xs to minimize the sum of the re-projection errors Eq. (4). 6: else if nX ¼0 and nP ¼ 0 then 7: break 8: end if 9: Estimatingmore camera candidates Psnew from existing 3D point set Xs using pinhole camera pose solver. = Ps og, nP ’j Psnew j . Psnew ¼ fP i : j visP ðP i ; XsÞj 4 thpnp_nr ; i2 10: if nP 4 0 then 11: Nonlinear optimizing the camera pose Ps and the 3D point cloud Xs to minimize the sum of the re-projection errors Eq. (4). 12: Filtering out bad P and X from Ps; Xs. 13: else if nX ¼ 0 and nP ¼0 then 14: break 15: end if 16: end loop
Next, 3D feature point cloud about the interested scene and cameras can be reconstructed, as shown in Fig. 2. Each 3D point may correspond to several 2D feature descriptors (SIFT) from the reconstructed images. These 2D features are indexed by a Kd-Tree for accelerating the online 2D–3D feature matching stage.
When a querying image from the target camera comes, we first apply the conventional single image based localization technique followed by a pose estimation quality test (detailed at Section 5). If the pose estimation fails, it means that the querying image has large variations compared to the pre-built 3D point cloud. In this case, the system will iteratively ask the user to capture some additional bridging images to help matching between the querying image and the 3D point cloud. These auxiliary images can be captured in the area from the target camera to the scene as a bridge. Together with the querying image, these bridging images are aggregated to form a local 3D model including a 3D point cloud and the camera poses using the previous incremental 3D reconstruction algorithm in Algorithm 1. Compared with the independent images in the set, inherent geometry constraints are enhanced in the local 3D model. After that, a 3D-to-3D feature matching scheme is applied and the local 3D model is matched with the pre-built 3D point cloud as a query. Then the pose of querying local 3D model is estimated by using the proposed camera set pose solver described in Section 4. If the pose estimation quality test fails again, the system will back to ask the user for more bridging images until success. Finally, the target camera's pose can be extracted as shown in Fig. 2. During the whole incremental process, the local 3D model is enlarged and contains more and more information for localization to keep promoting the success rate of the image set localization. The 3D-to-3D matching stage works as follows: two nearest neighbors in the 3D point cloud for each 3D point in local 3D model are first identified. Then the ratio of the distance between the nearest and the second nearest neighbor is tested. At last, the ratio test is employed reversely to filter out bad local 3D points to get enough reliable 3D-to-3D feature matches.
4. Camera set pose estimation A set of pinhole cameras could be considered as a generic camera which is represented by a bag of rays [26]. Fig. 5 illustrates the basic idea of using a set of pinhole cameras to form a generic camera where these rays may not come from the same optical center, it also shows how the pose of the generic camera is estimated and optimized. Similar to the single camera pose estimation, the camera set pose estimation problem can be defined as, given some rays (direction ri with its projection center Ci) and their corresponding global fixed 3D points Xgi, find the camera set pose which is the rigid transformation (T ¼ sR½Ij C) to map the matched 3D points from the global coordinate system to the camera set's local coordinate system. After previous local to global 3D–3D matching step, local 2D features corresponding with related 3D points could be retrieved and converted to non-concentric rays. These correspondences between the rays and the global 3D points are the inputs to the camera set pose estimation solver. According to the geometry constraint, we have the following equation r i ðTX gi C i Þ ¼ 0;
ð5Þ
where Ci is the projection center and ri is the ray's direction. There are 12 unknown variables and 7 DOFs (6 for pose and 1 for scale) in the transformation T. Direct linear transformation (DLT) method can be applied to solve the estimation problem. By rearranging Eq. (6), we have ½r i TX gi ¼ ½r i C i :
ð6Þ
Please cite this article as: L. Deng, et al., Incremental image set querying based localization, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2015.11.117i
L. Deng et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
5
Fig. 5. A pinhole camera set forms a generic camera and an observed ray ri of the generic camera comes from center Ci directing at the global fixed 3D point Xgi. By applying a similarity transformation T, corresponding 3D point Xgi in global coordinate system is mapped to local camera set coordinate system as TXgi (top right). The similarity transformation T can be seen as the pose of the camera set and can be estimated by using the proposed DLT solver. With a nonlinear optimization, the constraints introduced by the 3D point Xlj in local camera set coordinate system are explored. Relative poses among these cameras Ci and the local 3D points Xlj could be adjusted (red) for the further re-projection error minimization (bottom right). (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)
By applying the Kronecker product property, Eq. (6) can be rewritten as X gi T ½r i vecðTÞ ¼ ½r i C i :
ð7Þ
Since two independent constraints can be provided by one observation (r i ; C i ; X gi ), at least 6 points are required to solve the problem with DLT. Having obtained the transformation T, we need to project the 12-DOF space into a 7-DOF valid similarity transformation space. The K½Rj t from camera matrix P could be decomposed [27] by the transformation T as K; ½Rj t ¼ rq_decompositionðTÞ:
ð8Þ
Then the valid 7 DOF transform TDLT is projected as T DLT ¼ s½Rj t ¼
traceðKÞ ½Rj t; 3
ð9Þ
where the scale factor s is the average value of K's diagonal elements. After that, the Levenberg–Marquardt algorithm is used to minimize the re-projection error which is the golden standard in geometry estimation [27]. Initializing T as TDLT, the optimization objective function is formulated as g fg Þ X ðT X j ð10Þ T LM ¼ arg min r ij P i ; g T g f i;j J P i ðT X Þ J j
where Pi is the ith camera matrix, Xgj is the jth 3D point in global coordinate, rij is the related observed ray in the camera set coordinate system, and T is the camera set pose needed to optimize. Previous solver and optimization considered the camera set as a whole rigid object, and the relative poses among these cameras could not be changed. However, if the poses of these cameras are also reconstructed by a 3D reconstruction algorithm, relative poses among them may not be accurate. So a further optimization is needed to adjust the inner relative poses for better re-projection error minimization. Beside the corresponding global 3D points Xg, local 3D points Xl reconstructed in the camera set coordinate system from the image set are also involved in the optimization step. Then the nonlinear
Solidangle Fig. 6. The estimated camera pose quality for single camera is determined by two factors, (1) the distance from the camera center to the center of matched 3D points, and (2) the solid angle (dark blue region) expanded from the rays directing at matched 3D points. When the distance is smaller and the solid angle is larger, the quality is better which indicates that the estimated camera pose is more reliable. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)
optimization becomes the following objective function, g e fg Þ X X ðT X l P i X lj j ; T OPT ¼ arg min r ij P i þ r ij e g T fg Þ J i;j J P i X lj J i;j J P i ðT X j
ð11Þ
where rlij is the observed ray from the jth camera directing at locally reconstructed the jth 3D point Xlj. After this optimization, more geometry constraints introduced by Xl can be explored and a better pose estimation can be achieved, as illustrated in Fig. 5.
5. Quality criteria of pose estimation In the localization framework, it is important to evaluate the quality of the pose of estimated camera or camera set, which is less mentioned in previous work [17,18]. From the geometry constraints on camera pose estimation (Fig. 6), under the same Gaussian noise of the 3D points and rays, two geometry properties are observed as: (1) When the camera is close to its matched global 3D points the error of camera pose estimation is small, in other words, the camera pose is
Please cite this article as: L. Deng, et al., Incremental image set querying based localization, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2015.11.117i
L. Deng et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
6
a
Target
Low quality
Path
High quality
Support
b
Low quality
Path High quality
Target
Support
Fig. 7. Two layouts of camera sets with matched global 3D points to illustrate the inference of the target camera pose quality. The target camera's (purple) pose cannot be estimated directly as no global 3D points could be seen. Its pose can only be inferred from other high quality pose cameras (red) which see more 3D points. There exist low quality bridge cameras (black) between the target camera and the high quality cameras. The way that the high quality camera supports the target camera is the shortest path in the reconstruction graph of the image set.
more heavily influenced by the nearer 3D points. (2) When the solid angle expanded by the observed rays is larger, in other words, 2D point observations is preferred to be spread out on the whole image rather than on a small region, the error of camera pose estimation is small. Hence, we define the pose quality of single camera as follows Q cam ðC; fX gi ; i ¼ 1…ngÞ ¼ ½f J X i C J ; i ¼ 1…ngq ∠solid ðfX i C; i ¼ 1…ngÞ;
ð12Þ
Xgi
where C is the interested camera's center, is the ith matched 3D point, ½q is the qth quantile of a set of scalars (0.25 in this paper), and ∠solid ðÞ is the solid angle expanded by a set of vectors which is calculated as follows, !! 2 Z X Xi 2 exp x =σ 4 θ dx; ∠solid ðX i ; i ¼ 1…nÞ ¼ JX J i
i
ð13Þ which calculates the region area that is near the existing rays under Gaussian distribution in the solid angle space (spherical surface).
Based on the single camera pose quality, next we define the pose quality of camera set for iteratively judging whether the whole localization procedure successes and the system should exit. In a camera set with estimated pose, only some of cameras in the set has high quality pose and others are with low quality including the target camera, as shown in Fig. 7. And this is the reason why single image based localization method fails. The pose of target camera can only be inferred by those high quality cameras. Hence, the pose quality of target camera is influenced by two factors, (1) the pose quality and the count of those high quality cameras in the set and (2) the length and the direction of the supporting path provided by those high quality pose camera. The shortest path from a high quality camera to the target camera in the reconstruction graph are called as the supporting path. The length and the direction of a supporting path indicate the reliability of pose transfer from a high pose quality camera to the target camera. When the image set comes with more high quality cameras and the supporting paths are shorter and from more directions, the target camera will gain better pose quality.
Please cite this article as: L. Deng, et al., Incremental image set querying based localization, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2015.11.117i
L. Deng et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
The camera set quality is defined as Q camset ðC
target
; fC hq i ; i ¼ 1…ngÞ
¼
target ∠solid ðfpathðC hq Þ; i i ;C
¼ 1…ngÞ X target 1=j pathðC hq Þj ; ð14Þ i ;C
7
proposed framework, and the orientation errors are usually very small (less than 4°), (2) only a few images can be localized by the conventional single image localization method, while all of them
i target Þ pathðC hq i ;C
is the shortest path in the reconstruction where graph from the high quality ith camera node Chq i to target camera node Ctarget. j pathð; Þj is the length of related path. For instance, the camera set with the supporting paths from more directions will provide better inference to the target camera pose and vice versa, as shown in Fig. 7.
6. Experiments We build scene 3D feature point cloud for the Main Building in Tsinghua University with hundreds of meter size, which consists of 23 street view panoramas, 3067 3D points and 14,330 feature descriptors as shown in Fig. 8. The 3D-to-3D ratio test threshold is set to 0.6 and the scene 3D point cloud kd-tree is built by FLANN [29] with 95% accuracy. Three image sets are tested, West Hallway (14 images), West Building (15 images) and East Building (21 images). The querying image distant to the point cloud. Experiments are conducted on methods including the conventional single image based method [17], proposed camera set pose estimation with and without nonlinear optimization (camset, camsetþ opt). To evaluate the accuracy of the localization, as done in [17], a whole 3D model reconstructed by using all images is taken as the ground truth and all the localization results are further checked manually. The minimal 2D–3D matched inliers for conventional single image based localization is 12 (same as [17]) below which pose estimation is regarded as failed. Fig. 9 shows the layout of localized camera set where high quality cameras are highlighted (red). Table 1 shows the statistics of high quality cameras and matched 3D points in each querying image set. From the table we can see, it is because the existence of enough high quality cameras and the matched global 3D points, the target camera pose can be successfully inferred. Conventional single image based localization method [17] cannot estimate the poses of the cameras in most cases due to the lack of enough 2D–3D feature matches with the pre-built 3D point cloud. The proposed methods can locate each image in the querying image set successfully, as shown in Table 2. The qualitative localization results are shown in Fig. 10 and the quantitative results are listed in Table 3. From which we can see that (1) the querying image is successfully extracted by the
Fig. 9. High quality pose cameras (red) in the camera set are localized by seen several 3D points (red dots) in camera set West Building. The target camera's pose is inferred by these high quality pose cameras through the rest bridging cameras (blue). The background is the dense point and the corresponding satellite map for visualization. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)
Table 1 High quality camera count in the three querying image sets (hq. stands for high quality camera counts). West Hallway
East Hallway
West Building
#hq./#total
3D points
#hq./#total
3D points
#hq./#total
3D points
7/14
20
13/15
28
4/21
11
Table 2 Successful registration rate in the three querying image sets. Method
Single image based [17] Proposed
West Hallway
East Hallway
West Building
#reg./ #total
#reg. rate (%)
#reg./ #total
#reg. rate (%)
#reg./ #total
#reg. rate (%)
4/14
28.57
12/15
80.00
1/21
4.76
14/14
100.00
15/15
100.00
21/21
100.00
Fig. 8. The pre-built scene used in the experiments. The sparse feature 3D point cloud (left) is reconstructed by the street view panoramas (yellow box). The dense point cloud on satellite map (right) is reconstructed by applying PMVS algorithm [28] for visualization. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)
Please cite this article as: L. Deng, et al., Incremental image set querying based localization, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2015.11.117i
L. Deng et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
8
Fig. 10. Illustration of localization results. Top left shows the reconstructed scene in main building dataset. Top middle is the reconstructed ground truth scene using all three image sets. Top right is the final localization result. Bottom two rows show details of the West Hallway, West Building and East Hallway image sets with different methods, single image based [17] (blue), camset (yellow) and camsetþ opt (red). Images in the bottom row are enlarged from the images in the top row. We link the ground truth with corresponding estimated camera center to visualize the displacement. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)
Table 3 Evaluation of the location error on image set and querying image. The statistical results for single querying image based method [17] is from successful registered images only. Dataset
Method
Image Set
Target image
#reg./#total
Min (m/deg)
Median (m/deg)
Max (m/deg)
Mean (m/deg)
Recon. err (m/deg)
West Hallway
Single image based [17] Camset Camset þ opt
4/14 14/14 14/14
2.081/1.001 2.175/0.949 1.770/0.915
2.708/1.019 3.031/0.999 2.455/0.970
5.029/1.092 3.973/1.546 3.248/1.546
3.131/1.033 3.030/1.042 2.457/1.042
5.028/1.092 3.973/1.092 3.248/1.095
East Hallway
Single image based [17] Camset Camset þ opt
12/15 15/15 15/15
1.015/0.335 2.833/1.116 2.489/0.980
3.672/1.278 3.297/1.129 2.908/0.991
23.014/7.815 3.697/1.154 3.273/1.104
8.418/2.827 3.293/1.1.131 2.904/0.991
– 3.641/1.154 3.273/1.001
West Building
Single image based [17] Camset Camset þ opt
1/21 21/21 21/21
4.729/3.847 2.167/2.402 1.638/2.418
4.729/3.847 4.516/3.334 4.463/2.925
4.729/3.847 4.889/3.658 4.963/3.347
4.729/3.847 4.247/3.320 4.197/2.960
– 4.664/3.414 4.487/2.979
can be successfully localized by the proposed approach, and (3) the location error of the proposed methods is much smaller than the conventional single image based method, and the performance can be improved when the nonlinear optimization is applied.
7. Conclusion In this paper, we proposed a new incremental image set querying based localization framework to solve the challenging
scenarios when conventional single image localization methods fail. This situation usually occurs when the target camera is distant or in a different environment from the pre-built scene 3D point cloud. Compared with the single image, the image set not only contains more information for localization with more feature matches, but also enables us to enforce stronger inherent geometry constraints by a local 3D reconstruction which improves the localization performance. Experiments have shown the effectiveness and feasibility of the proposed approach. In the future, we will study the way to capture these auxiliary bridging images efficiently for a better result in a larger scene area.
Please cite this article as: L. Deng, et al., Incremental image set querying based localization, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2015.11.117i
L. Deng et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Acknowledgment This work is supported by the National Natural Science Foundation of China under Grants 61225008, 61373074 and 61373090, the National Basic Research Program of China under Grant 2014CB349304, the Ministry of Education of China under Grant 20120002110033, and the Tsinghua University Initiative Scientific Research Program.
References [1] J. Ventura, C. Arth, G. Reitmayr, D. Schmalstieg, Global localization from monocular SLAM on a mobile phone, IEEE Trans. Vis. Comput. Graph. 20 (4) (2014) 531–539, http://dx.doi.org/10.1109/TVCG.2014.27. [2] A.J. Davison, I.D. Reid, N. Molton, O. Stasse, Monoslam: real-time single camera SLAM, IEEE Trans. Pattern Anal. Mach. Intell. 29 (6) (2007) 1052–1067, http: //dx.doi.org/10.1109/TPAMI.2007.1049. [3] R.O. Castle, G. Klein, D.W. Murray, Video-rate localization in multiple maps for wearable augmented reality, in: 12th IEEE International Symposium on Wearable Computers, 2008, pp. 15–22. http://dx.doi.org/10.1109/ISWC.2008. 4911577. [4] N. Snavely, S.M. Seitz, R. Szeliski, Photo tourism: exploring photo collections in 3d, ACM Trans. Graph. 25 (3) (2006) 835–846. [5] J. Lu, V.E. Liong, X. Zhou, J. Zhou, Learning compact binary face descriptor for face recognition, IEEE Trans. Pattern Anal. Mach. Intell. 37 (10) (2015) 2041–2056, http://dx.doi.org/10.1109/TPAMI.2015.2408359. [6] J. Lu, X. Zhou, Y. Tan, Y. Shang, J. Zhou, Neighborhood repulsed metric learning for kinship verification, IEEE Trans. Pattern Anal. Mach. Intell. 36 (2) (2014) 331–345, http://dx.doi.org/10.1109/TPAMI.2013.134. [7] J. Lu, Y. Tan, G. Wang, Discriminative multimanifold analysis for face recognition from a single training sample per person, IEEE Trans. Pattern Anal. Mach. Intell. 35 (1) (2013) 39–51, http://dx.doi.org/10.1109/TPAMI.2012.70. [8] J. Lu, G. Wang, P. Moulin, Human identity and gender recognition from gait sequences with arbitrary walking directions, IEEE Trans. Inf. Forensics Secur. 9 (1) (2014) 51–61, http://dx.doi.org/10.1109/TIFS.2013.2291969. [9] J. Lu, Y. Tan, Cost-sensitive subspace analysis and extensions for face recognition, IEEE Trans. Inf. Forensics Secur. 8 (3) (2013) 510–519, http://dx.doi.org/ 10.1109/TIFS.2013.2243146. [10] J. Lu, Y. Tan, Regularized locality preserving projections and its extensions for face recognition, IEEE Trans. Syst. Man Cybern. Part B 40 (3) (2010) 958–963, http://dx.doi.org/10.1109/TSMCB.2009.2032926. [11] Y. Yan, E. Ricci, G. Liu, N. Sebe, Egocentric daily activity recognition via multitask clustering, IEEE Trans. Image Process. 24 (10) (2015) 2984–2995, http: //dx.doi.org/10.1109/TIP.2015.2438540. [12] Y. Yan, Y. Yang, D. Meng, G. Liu, W. Tong, A.G. Hauptmann, N. Sebe, Event oriented dictionary learning for complex event detection, IEEE Trans. Image Process. 24 (6) (2015) 1867–1878, http://dx.doi.org/10.1109/TIP.2015.2413294. [13] Y. Yan, G. Liu, E. Ricci, N. Sebe, Multi-task linear discriminant analysis for multi-view action recognition, in: IEEE International Conference on Image Processing, ICIP 2013, Melbourne, Australia, September 15–18, 2013, 2013, pp. 2842–2846. http://dx.doi.org/10.1109/ICIP.2013.6738585. [14] Y. Yan, E. Ricci, S. Ramanathan, O. Lanz, N. Sebe, No matter where you are: Flexible graph-guided multi-task learning for multi-view head pose classification under target motion, in: IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1–8, 2013, 2013, pp. 1177– 1184. http://dx.doi.org/10.1109/ICCV.2013.150. [15] Y. Yan, H. Shen, G. Liu, Z. Ma, C. Gao, N. Sebe, Glocal tells you more: coupling glocal structural for feature selection with sparsity for image and video classification, Comput. Vis. Image Underst. 124 (2014) 99–109, http://dx.doi.org/ 10.1016/j.cviu.2014.02.006. [16] Q. Zhao, D. Meng, Z. Xu, W. Zuo, Y. Yan, L1 -norm low-rank matrix factorization by variational bayesian method, IEEE Trans. Neural Netw. Learn. Syst. 26 (4) (2015) 825–839, http://dx.doi.org/10.1109/TNNLS.2014.2387376. [17] T. Sattler, B. Leibe, L. Kobbelt, Fast image-based localization using direct 2d-to-3d matching, in: International Conference on Computer Vision, 2011, pp. 667–674. [18] Y. Li, N. Snavely, D. Huttenlocher, P. Fua, Worldwide pose estimation using 3d point clouds, in: European Conference on Computer Vision, 2012, pp. 15–29. [19] D.G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vis. 60 (2) (2004) 91–110. [20] M.A. Fischler, R.C. Bolles, Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography, Commun. ACM 24 (6) (1981) 381–395. [21] M. Bujnak, Z. Kukelova, T. Pajdla, A general solution to the p4p problem for camera with unknown focal length, in: Computer Vision and Pattern Recognition, 2008, pp. 1–8. [22] Z. Kukelova, M. Bujnak, T. Pajdla, Real-time solution to the absolute pose problem with unknown radial distortion and focal length, in: International
9
Conference on Computer Vision, 2013, pp. 2816–2823. [23] S. Agarwal, N. Snavely, I. Simon, S.M. Seitz, R. Szeliski, Building rome in a day, in: International Conference on Computer Vision, 2009, pp. 72–79. [24] vcip2015, 〈http://vcip2015.org〉. [25] B. Triggs, P.F. McLauchlan, R.I. Hartley, A.W. Fitzgibbon, Bundle adjustment- A modern synthesis, in: Vision Algorithms: Theory and Practice, International Workshop on Vision Algorithms, held during ICCV '99, Corfu, Greece, September 21–22, 1999, Proceedings, 1999, pp. 298–372. http://dx.doi.org/10. 1007/3-540-44480-7_21. [26] J. Ponce, What is a camera?, in: Computer Vision and Pattern Recognition, 2009, pp. 1526–1533. [27] A. Harltey, A. Zisserman, Multiple View Geometry in Computer Vision, 2nd ed., Cambridge University Press, Cambridge, U.K. 2006. [28] Y. Furukawa, J. Ponce, Accurate, dense, and robust multiview stereopsis, IEEE Trans. Pattern Anal. Mach. Intell. 32 (8) (2010) 1362–1376, http://dx.doi.org/ 10.1109/TPAMI.2009.161. [29] M. Muja, D.G. Lowe, Scalable nearest neighbor algorithms for high dimensional data, IEEE Trans. Pattern Anal. Mach. Intell. 36 (11) (2014) 2227–2240.
Lei Deng has received the B.S. degree in Department of Automation, Beijing Institute of Technology in 2009. He is currently pursuing the Ph.D. degree in the Department of Automation, Tsinghua University, Beijing, China. His current research interests include computer vision, image processing and 3D reconstruction.
Zhixiang Chen received the B.S. degree in Microelectronics from Xi’an Jiaotong University, Xi’an, China, in 2010. He is currently pursuing the Ph.D. degree in the Department of Automation, Tsinghua University, Beijing, China. His current research interests include 3D reconstruction and image retrieval.
Baohua Chen is currently pursuing the Ph.D. degree in the Department of Automation, Tsinghua University, Beijing, China. His current research interests include computer vision, robotics and image processing.
Yueqi Duan received the B.S. degree in Automation from Tsinghua University, Beijing, China, in 2014. He is currently pursuing the Ph.D. degree in the Department of Automation, Tsinghua University, Beijing, China. His current research interests include 3D reconstruction, face recognition and re-identification.
Please cite this article as: L. Deng, et al., Incremental image set querying based localization, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2015.11.117i
10
L. Deng et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Jie Zhou received the BS and MS degrees both from the Department of Mathematics, Nankai University, Tianjin, China, in 1990 and 1992, respectively, and the PhD degree from the Institute of Pattern Recognition and Artificial Intelligence, Huazhong University of Science and Technology (HUST), Wuhan, China, in 1995. From then to 1997, he served as a postdoctoral fellow in the Department of Automation, Tsinghua University, Beijing, China. Since 2003, he has been a full professor in the Department of Automation, Tsinghua University. Now he is the head of Department of Automation, Tsinghua University. His research interests include computer vision, pattern recognition, and image processing. In recent years, he has authored more than 100 papers in peer-reviewed journals and conferences. Among them, more than 40 papers have been published in top journals and conferences. He is an associate editor for the International Journal of Robotics and Automation and two other journals. He received the National Outstanding Youth Foundation of China Award.
Please cite this article as: L. Deng, et al., Incremental image set querying based localization, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2015.11.117i