Pattern Recognition Letters 31 (2010) 112–118
Contents lists available at ScienceDirect
Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec
Global registration of overlapping images using accumulative image features Karthik Krish *, Stuart Heinrich, Wesley E. Snyder, Halil Cakir, Siamak Khorram North Carolina State University, Raleigh, NC 27695, USA
a r t i c l e
i n f o
Article history: Received 26 March 2008 Received in revised form 10 September 2009 Available online 18 September 2009 Communicated by B. Kamgar-Parsi Keywords: Image registration Feature matching Accumulator-based methods Feature correspondence Evidence accumulation
a b s t r a c t This paper introduces a new feature-based image registration technique which registers images by finding rotation- and scale-invariant features and matching them using a novel feature matching algorithm based on an evidence accumulation process reminiscent of the generalized Hough transform. Once feature correspondence has been established, the transformation parameters are then estimated using non-linear least squares (NLLS) and the standard RANSAC (random sample consensus) algorithm. The technique is evaluated under similarity transforms – translation, rotation and scale (zoom) and also under illumination changes. Ó 2009 Elsevier B.V. All rights reserved.
1. Introduction Image registration is the process of geometrically aligning two images taken at different times, orientation, or sensors. Zitova and Flusser (2003) provides a comprehensive survey of existing image registration algorithms. According to them, there are essentially four steps to any feature-based image registration algorithm: (1) Feature detection: This involves finding salient features in the two images to be registered. An interest point detector is first employed to detect characteristic points in the image. A feature consists of measurements made in the neighborhood of such an interest point. These may include corner points, edges etc. Ideally, we want these points to be invariant to geometric and photometric transformations. A survey of the literature shows many approaches including the Harris Corner detector (Harris and Stephens, 1988) and the scale-invariant keypoints introduced by Lowe (1999). A complete review of affine invariant region detectors can be found in (Mikolajczyk et al., 2005). (2) Feature matching: Once features have been detected in the two images, the next step is to match them or establish correspondence. This is an important step because the percentage of correct matches identified here determines the how
* Corresponding author. Tel.: +1 919 649 9298; fax: +1 919 515 5523. E-mail addresses:
[email protected] (K. Krish),
[email protected] (S. Heinrich),
[email protected] (W.E. Snyder),
[email protected] (H. Cakir), khorram@ ncsu.edu (S. Khorram). 0167-8655/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2009.09.016
well the transformation can be estimated in the next step. The common approach to feature matching is to build local descriptors around the feature point and then matching these descriptors. Some popular local descriptors include the scale-invariant feature transforms (SIFT) (Lowe, 1999, 2003). An exhaustive review of local descriptors for feature matching can be found in (Mikolajczyk and Schmid, 2005). Viola and Wells (1997) proposed an area-based feature matching method based on mutual information. Flusser and Suk (1993) proposed a set of affine invariant moments and used them to register images with affine distortions (Flusser and Suk, 1994). (3) Transform model estimation: Once feature correspondence has been established, the next step is to solve for the parameters of some global transformation. Usually this involves finding the translation, rotation and scale parameters to transform one image to another. Local deformations are usually mapped using techniques such as elastic registration (Bajcsy and Kovacˇicˇ, 1989) and thin-plate splines (Bookstein, 1989). (4) Image re-sampling and transformation: The final step of any registration algorithm involves the actual mapping of one image to the other. This paper introduces a new feature-based image registration technique which matches a target image to a reference (or source) image by finding rotation- and scale-invariant features which are then matched using a novel feature matching algorithm based on an evidence accumulation process reminiscent of the generalized
K. Krish et al. / Pattern Recognition Letters 31 (2010) 112–118
113
Hough transform (Ballard, 1981). Accumulator-based approaches have been shown to be very robust to noise and highly parallel in nature (Ecabert and Thiran, 2004; Kiryati et al., 2000). The algorithm, which we will henceforth refer to as the Simple K-Space (SKS) feature matching algorithm, is shown to be highly robust and outperform conventional SIFT-based feature matching algorithms. This paper is organized as follows: Section 2 looks at the feature extraction process. Section 3 describes the new feature matching algorithm which forms the primary contribution of this paper. This is followed, in Section 4, by the transform model estimation using non-linear least squares (NLLS) and random sample consensus (RANSAC). Section 5 looks at the performance of the algorithm on aerial imagery followed by the conclusion in Section 6. 2. Feature extraction The first step in the image registration process is feature extraction. Features could include salient points and distinctive objects in the image like closed contours, corners, etc. We use the Harris–Laplace (Mikolajczyk and Schmid, 2001) interest point detector in our algorithm. The Harris–Laplace detector uses the multi-scale Harris corner detector with the Laplacian operator for characteristic scale selection (Mikolajczyk and Schmid, 2001; Lindeberg, 1998). The characteristic scale is the scale at which an interest point is a local extremum across scale. Characteristic scales were extensively studied in (Lindeberg, 1998). Theoretically, the ratio of the scales for two corresponding points at the characteristic scales is equal to the scale factor between the two images. Mikolajczyk and Schmid (2001) analyzed four different functions to determine the local maxima of a feature point across scales. These include the Laplacian, Difference of Gaussian (DOG), simple gradient and the Harris function. They concluded that the Laplacian detects the highest percentage of correct points which are at the characteristic scale. By reducing the set of interest points to characteristic scale Harris corner points, we reduce the computational complexity of the problem without loss of scale invariance. 3. Feature matching 3.1. Model building Once interest points have been detected in the images, the next step is to match them by building a descriptor (or model). Consider, a set of J interest points in the source image S F ¼ fS F 1 ; S F 2 ; . . . ; S F j ; . . . ; S F J g, S F j is an ordered pair of coordinates, ðS xj ; S yj Þ. We first establish a rotation invariant coordinate system at each interest point S F j by finding the dominant orientation ð/j Þ in the circular neighborhood of radius rj where rj is the scale at which the interest point was found. See Section 3.3 for more detail on finding the dominant orientation. We now build a model at each and every feature point (S F j ) as follows: We formulate a set of K feature vectors v jk ¼ ðr jk ; hjk ; /jk Þ at each point k in a circular radius rj around S F j as shown in Fig. 1. ðr jk ; hjk Þ are the polar coordinates of the point with respect to the invariant coordinate system at the interest point S F j . /jk is the difference between the gradient direction at the point k and the dominant orientation ð/j Þ
/jk ¼ /k /j
ð1Þ
Fig. 1. The elements of the feature vector v at a point in the neighborhood of an interest point (shown in blue). ðr; hÞ are the polar coordinates of the point with respect to the interest point; / is the gradient direction with respect to the dominant direction. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
The model function (S wj ðv Þ) can be viewed as a function which estimates the closeness of a given feature vector ðv Þ to all the feature vectors around the interest point S F j . x is a constant which controls the amount of deviation allowed between two feature vectors below which they are considered close. The model function can also be precomputed and stored as a look up table which considerably speeds up the matching process. 3.2. Matching The matching process uses evidence accumulation similar to the generalized Hough transform to determine the correspondence between two sets of interest points. Assume the source image (or reference image) has a set of J interest points S F ¼ fS F 1 ; S F 2 ; . . . ; S F j ; . . . ; S F J g, S F j ¼ ðS xj ; S yj Þ and a corresponding set of models for each interest point ðS F j ÞS wj ; j ¼ 1; . . . ; J. Let the target image have M interest points T F ¼ fT F 1 ; T F 2 ; . . . ; T F m ; . . . ; T F M g. Consider an interest point T F m in the target image. To find the best point in the source similar to T F m , we first compute, (similar to the model building process) L feature vectors v ml ¼ ðrml ; hml ; /ml Þ in a circular radius rm around the interest point. ðr ml ; hml Þ is the polar coordinates of the point with respect to the invariant coordinate system at the interest point T F m . /ml is the difference between the gradient direction at the point l and the dominant orientation ð/m Þ at T F m . Now, we compute a match between the interest point T F m in the target image and the model (S wj ðv Þ) of the interest point S F j in the source image,
Amj ¼
L 1X S w ðv ml Þ L l¼1 j
The best match for the point T F m in the source image is given by: S
F i ¼ arg max Amj 8Amj > T j
The model at the feature point S F j is given by: S
K
wj ðv Þ ¼ max exp k¼1
2
jjv v jk jj 2x2
! ð2Þ
ð3Þ
ð4Þ
where T is a user-defined threshold which determines the minimum match value between two points to be considered similar. If no match is found, then it is assumed the interest point (T F m ) does not exist in the source image.
114
K. Krish et al. / Pattern Recognition Letters 31 (2010) 112–118
Fig. 2. Two images from the dataset where one is a rotated and scaled version of the other.
Hx ¼ sðx cosðhÞ y sinðhÞÞ þ txð7Þ
3.3. Dominant orientation estimation To achieve invariance to rotation, we establish a reference coordinate system at each interest point S F j by estimating the dominant gradient orientation. We do this by building a histogram of the gradient directions in a circular neighborhood of radius rj . rj is the scale of the interest point S F j . Histogram based approaches have been shown to give the most stable results (Lowe, 2003). The accuracy of the actual direction of the dominant orientation is irrelevant as long as it is consistent i.e two corresponding interest point neighborhood always need to have the same dominant orientation. Assume that the histogram has N bins and that it is smoothed with a Gaussian window of width N=3. The dominant orientation ð/j Þ is then taken as the highest peak of the histogram N
/j ¼ arg max hð/n Þ; n¼1
/j 2 ½0; 2p
ð5Þ
Hy ¼ sðx sinðhÞ þ y cosðhÞÞ þ tyð8Þ The number of correspondences (n) is assumed to be greater than the number of model parameters to be estimated i.e n > 4. We construct the Jacobian matrix of partial derivatives using all the points in the target
2 @Hx
j @h 1 6 @Hy 6 @h j1 6 6 @Hx 6 @h j2 6 6 @Hy j 6 @h 2 6 A¼6 6 6 6 6 6 6 6 @Hx j 4 @h n @Hy j @h n
4. Transformation model
h;tx;ty;s
n X
jjH T F i S F i jj2
ð6Þ
i¼1
In the 1D case, we have a set of m equations from all the correspondences can be solved using Non-linear Least Squares Regression (NLLS).1 In the 2D case, we can solve this as a set of 2n equations, where each correspondence satisfies two separate equations in the x and y dimensions but uses the same model parameters. For a similarity transform, the transformation equation along x and y is given by: 1
0
0
@Hy j @ty 1
@Hx j @tx 2
0
0
@Hy j @ty 2
@Hx j @tx n
0
0
@Hy j @ty n
@Hx @s @Hy @s @Hx @s @Hy @s
j1
3
7 j1 7 7 7 j2 7 7 j2 7 7 7 7 7 7 7 7 7 7 @Hx j 7 @s n 5 @Hy j @s n
ð9Þ
where,
Once feature correspondence has been established, the next step is to transform the target image to the reference image using a global transformation. The transformation should ideally minimize some error function between all correspondences. Since, the images must be overlapping for our algorithm to work, the focus point of the camera when taking both images can be assumed to be close by (globally), so distortion differences between the two images can be assumed to be low. Therefore, we assume a similarity transform between the two images – the model parameters are translation (tx; ty), rotation (h) and scale (s) only. Let ðS F 1 ; T F 1 Þ; ðS F 2 ; T F 2 Þ; . . . ; ðS F n ; T F n Þ be a set of n point correspondences between the source and the target image. S F n and T F n are feature points in the source and target images respectively and are tuples of x and y. Let H be the transformation which maps H : T F n #S F n . We look for the transformation model which minimizes the sum of squared error between the correspondences
b tx; ty; sÞ ¼ arg min Hðh;
@Hx j @tx 1
http://mathworld.wolfram.com/NonlinearLeastSquaresFitting.html.
@Hx j @h i
¼ sðx cosðhÞ y sinðhÞÞ
@Hy j @h i
¼ sðx cosðhÞ y sinðhÞÞ
@Hx j @tx i
¼1
@Hy j @ty i
¼1
@Hx j @s i
¼ x cosðhÞ y sinðhÞ
@Hy j @s i
¼ x cosðhÞ þ y sinðhÞ
Let k ¼ ½h tx ty s be the model parameter vector. We solve for k recursively until convergence using the following recurrence relation:
kiþ1 ¼ ki þ dk
ð10Þ
dk is computed as follows
a ¼ AT A
ð11Þ
T
b ¼ A dB
ð12Þ
dk ¼ a1 b
ð13Þ
The error matrix dB is given by:
2S
F 1 HðT F 1 Þ
3
6 S F HðT F Þ 7 2 7 6 2 7 6 7 6 7 dB ¼ 6 7 6 7 6 7 6 5 4 S
ð14Þ
F n HðT F n Þ
dB is a matrix of size 2n 1. We use LU decomposition to solve for a1 . To reduce the influence of outliers in the accuracy of the solu-
K. Krish et al. / Pattern Recognition Letters 31 (2010) 112–118
115
tion, we use RANSAC (Fischler and Bolles, 1981) which involves finding the best model parameters by using random subsets from the set of correspondences instead of using the entire set. The model parameters estimated from the subset which gives the least sum of squared error is then taken as the best fit.
5. Results We evaluate the performance of the algorithm in two steps. First, we look at the performance of the feature matcher by comparing with it the state-of-the-art. Then we look at the performance of the registration technique itself by registering images from a dataset of eight aerial image pairs. 5.1. Feature matching performance To evaluate the performance of the feature matcher, one might use ROC curves but ROC is not really suited to this problem as the concept of a ‘‘negative” is not really defined. This problem is best summed up by Agarwal and Roth (2002). A feature point has one correct match and many negative matches causing the total number of negative matches to be typically large (and difficult to measure accurately) when compared to the number of positive (or correct) matches. This gives rise to large one-sided error when using ROC curves as the false positive rate will appear to be very small despite the high number of absolute false matches. Instead, we look at the performance of the feature matcher using precision–recall curves which avoids this problem but at the same time expresses the desired trade-off. Recall is defined as:
recall ¼
Number of correct matches found Actual number of correct matches
ð15Þ
Precision is defined as:
precision ¼
Number of correct matches found Total number of matches found
ð16Þ
We may plot recall versus imprecision which is defined as:
imprecision ¼ 1 precision ¼
Number of false matches found Total number of matches found
ð17Þ
We use the dataset2 used in (Mikolajczyk and Schmid, 2005). Fig. 2 shows a reference and the target image from the dataset, where the target image is a scaled and rotated version of the reference image. Fig. 3 shows the features matched between the two images. In order to determine the actual number of correct matches between the two images and to verify the correctness of a correspondence, we use the method described in (Mikolajczyk and Schmid, 2005). Let A and B respectively be the two regions in the reference and the target images, whose correspondence is to be checked. Let H be the known homography provided between the reference and the target images. We calculate the overlap error between the two regions by computing the areas of the union and intersection of the regions. This is done by first projecting the region B from the transformed image to the reference image using the homography H. The overlap error is defined as:
¼1
AreaðA \ H T BHÞ T
AreaðA [ H BHÞ
ð18Þ
2 The data set used is available at http://www.robots.ox.ac.uk/~vgg/research/ affine/.
Fig. 3. The features matched between the two images. The total number of correspondences found was 121. The number of correct correspondences was 120 with one false correspondence. The identified correspondences (both correct and incorrect) are marked using the same numbers (drawn in red) in both images. A subset of the correct matches are shown using yellow lines for easy visualization. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
A match is assumed to be correct if the overlap error is less than 0.5. The actual number of correct matches is estimated by finding, for each region around the corner points in the target image, the region around the corner point in the reference image with the lowest overlap error. The region size is proportional to the scale of the corner point.
116
K. Krish et al. / Pattern Recognition Letters 31 (2010) 112–118
Fig. 4. Precision–recall curve for scale and rotation changes.
Fig. 5. Registration results.
Fig. 6. Registration results.
Fig. 7. Registration results.
K. Krish et al. / Pattern Recognition Letters 31 (2010) 112–118
The goal of any feature matcher is to maximize the number of correct matches and minimize the number of false matches. Precision-recall curves are plotted by varying a threshold which controls the total number of matches found. In general, recall increases with increasing imprecision. If the precision is reduced to zero, then everything is considered a match. So, a recall of 1 is achieved only at precisions close to zero. Fig. 4 shows the performance of the SKS feature matcher when compared with SIFT-based matching. SIFT-based descriptors with Euclidean distance for feature matching have been shown to outperform other methods in (Mikolajczyk and Schmid, 2005). Ideally, we want a feature matching algorithm to detect a lot more correct matches at lower imprecision (<0.4). This is because at higher imprecision, we have a lot more false matches despite having higher recall in general. Notice that SKS clearly has higher recall rates at
117
low imprecision than SIFT-based matchers. This translates to higher percent correct matches when compared to false matches in that region. Most real world feature matchers would operate in this region and SKS clearly demonstrates its superiority here. SIFT-based matchers, however, get higher recall rates at high imprecision, but this comes at a cost of higher false matches. The next step after feature matching is usually, in the case of image registration, a transformation model estimator where high false positives rates are clearly not acceptable. Therefore, the SKS feature matcher is clearly the better choice of the two here. 5.2. Registration performance We test the performance of the algorithm on registering images by using the feature matcher to determine correspondence points
Fig. 8. Registration results.
Fig. 9. Registration results.
Fig. 10. Registration results.
118
K. Krish et al. / Pattern Recognition Letters 31 (2010) 112–118
Fig. 11. Registration results.
Fig. 12. Registration results.
and then estimating the similarity transform using non-linear least squares regression (NLLS) as described in Section 4. We use a database of 8 synthetic aerial image pairs (generated from real images). Each image pair contains one target image with extreme rotation, scale and brightness changes which we attempt to register with a reference image. Also note that there is some time lapse between the image pairs. The results of the registration are shown in Figs. 5–12. Each figure shows the reference image, the target image and the difference image after registration. The results are close to perfect and clearly show no registration anomalies. Notice that the difference image is not all solid black. This is because of the extreme brightness changes present in the target image. Since the emphasis of this paper is on feature matching, we did not consider the sub-pixel accuracy of the actual registration, but computed it to integer pixel precision only. 6. Conclusion and future work We have introduced a new powerful feature matching algorithm which matches features using an evidence accumulation process and is invariant to translation, scale, rotation and robust to illumination changes. The algorithm also outperforms conventional feature matching algorithms based on the SIFT descriptor. The correspondence points provided by the feature matcher were then used to estimate a similarity transformation to register images with impressive results. Future work would involve extending the algorithm to model more complex transformations. Acknowledgement This work was supported by the United States Air Force Office of Scientific Research under Grant No. FA9550-07-1-0176.
References Agarwal, S., Roth, D., 2002. Learning a sparse representation for object detection. In: ECCV ’02: Proc. 7th European Conf. on Computer Vision – Part IV. SpringerVerlag, London, UK, pp. 113–130. Bajcsy, R., Kovacˇicˇ, S., 1989. Multiresolution elastic matching. Comput. Vision Graphics Image Process. 46 (1), 1–21. Ballard, D., 1981. Generalizing the Hough transform to detect arbitrary shapes. Pattern Recognition 13 (2), 111–122. Bookstein, F., 1989. Principal warps: Thin-plate splines and the decomposition of deformations. IEEE Trans. Pattern Anal. Machine Intell. 11 (6), 567–585. Ecabert, O., Thiran, J., 2004. Adaptive hough transform for the detection of natural shapes under weak affine transformations. Pattern Recognition Lett. 25 (12), 1411–1419. Fischler, M., Bolles, R., 1981. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. ACM 24, 381–395. Flusser, J., Suk, T., 1993. Pattern recognition by affine moment invariants. Pattern Recognition 26 (1), 167–174. Flusser, J., Suk, T., 1994. A moment-based approach to registration of images with affine geometric distortion. IEEE Trans. Geosci. Remote Sens. 32 (2), 382–387. Harris, C., Stephens, M., 1988. A combined corner and edge detector. In: Proc. 4th Alvey Vision Conf., pp. 147–152. Kiryati, N., Kälviäinen, H., Alaoutinen, S., 2000. Randomized or probabilistic Hough transform: Unified performance evaluation. Pattern Recognition Lett. 21 (13– 14), 1157–1164. Lindeberg, T., 1998. Feature detection with automatic scale selection. Internat. J. Comput. Vision 30 (2), 79–116. Lowe, D.G., 1999. Object recognition from local scale-invariant features. In: Proc. Internat. Conf. on Computer Vision ICCV, Corfu, pp. 1150–1157. Lowe, D., 2003. Distinctive image features from scale-invariant keypoints. Internat. J. Comput. Vision 20, 91–110. Mikolajczyk, K., Schmid, C., 2001. Indexing based on scale invariant interest points. In: Proc. 8th IEEE Internat. Conf. on Computer Vision, 2001, ICCV 2001, vol. 1, pp. 525–531. Mikolajczyk, K., Schmid, C., 2005. A performance evaluation of local descriptors. IEEE Trans. Pattern Anal. Machine Intell. 27 (10), 1615–1630. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir, T., Gool, L.V., 2005. A comparison of affine region detectors. Internat. J. Comput. Vision 65 (1–2), 43–72. Viola, P., Wells III, W., 1997. Alignment by maximization of mutual information. Internat. J. Comput. Vision 24 (2), 137–154. Zitova, B., Flusser, J., 2003. Image registration methods: A survey. Image Vision Comput. 21 (11), 977–1000.