Optik 124 (2013) 4685–4692
Contents lists available at ScienceDirect
Optik journal homepage: www.elsevier.de/ijleo
Efficient numerical analysis of optical imaging data: A comparative study David Duarte-Correa a , Alberto Pastrana-Palma a , Carlos A. Olvera-Olvera a , Sergio R. Ramírez-Rodríguez c , Daniel Alaniz-Lumbreras b , Domingo Gómez-Meléndez d , ˜ a,∗ Ismael de la Rosa b , Salvador Noriega e , Vianey Torres e , Victor M. Castano a
Universidad Autónoma de Querétaro, Cerro de las Campanas s/n, 76010 Querétaro, Mexico Unidad Académica de Ingeniería Eléctrica, Doctorado en Ingeniería Universidad Autónoma de Zacatecas, Jardín Juárez 146, Centro Histórico, Zacatecas, Mexico c Tlachia Systems S.A. de C.V., Calle 13 de septiembre No.1, Ni˜ nos Héroes, C.P. 76010 Querétaro, Mexico d Universidad Politécnica de Querétaro, Carretera Estatal 420 S/N, El Rosario, C.P. 76240 El Marqués, Querétaro, Mexico e Departamento de Ingeniería Industrial y Manufactura, IIT, Universidad Autónoma de Ciudad Juárez, Av. P.E. Calles 1210, Fovissste Chamizal, C.P. 32310 Juárez, Chihuahua, Mexico b
a r t i c l e
i n f o
Article history: Received 5 September 2012 Accepted 20 January 2013
Keywords: Interest points Detector Computing time
a b s t r a c t The computational efficiency of 14 optical detectors over six types of transformations, namely: blur, illumination, rotation, viewpoint, zoom, and zoom-rotation changes, was analyzed. Images with the same resolution (750 × 500 pixels) were studied, in terms of correspondences, repeatability and computing time, and the correspondence was measured by using homographies i.e. projective transformations, to obtain the best efficiency for imaging applications. Results show that the multi-scale Harris Hessian detector is the most efficient for blur, illumination, and zoom-rotation changes. Meanwhile, multi-scale Hessian and Hessian Laplace are the best methods for rotation, viewpoint, and zoom changes. © 2013 Elsevier GmbH. All rights reserved.
1. Introduction A fundamental issue in computer vision is image matching of optical data. The main problem in object recognition is finding correspondences between two optical images of the same scene, taken from arbitrary viewpoints, with different cameras, scaling, rotation, and illumination conditions. Different solutions have been developed over the past few years by using interest points detectors. These approaches first detect characteristic features and then compute a set of descriptors for these features [1–5]. Among the different types of transformations for image recognition, feature detection has become the most widely used. In this method, at least few features must be present in both images in order to allow correspondences. Features shown to be particularly appropriate are called keypoints [5]. These features have also been referred as salient points or interest points in the literature [1]. These interest points are typically blobs, corners and junctions. Additionally, there is no universal detector or descriptor, but a combination of complementary operators seems to be a reasonable solution [9]. If the change of scale between images is unknown,
∗ Corresponding author at: On sabbatical leave at CIATEQ , Centro de Tecnologia Avanzada, Queretaro, Mexico. ˜ E-mail address:
[email protected] (V.M. Castano). 0030-4026/$ – see front matter © 2013 Elsevier GmbH. All rights reserved. http://dx.doi.org/10.1016/j.ijleo.2013.01.116
a simple way to deal with this change is to extract points at several scales and use them to represent the image, i.e. a multi-scale approach. In this approach, generally a local image is presented in a defined scale range. The points are then detected at each scale within this range. As a consequence, there are many points that represent the same structure, however the location and scale of points is slightly different. The unnecessary high number of points increases the probability of mismatches and the complexity of the matching algorithms. In this case, efficient methods for rejecting the false matches and for verifying the results are required [6]. On the other hand, optical detectors and descriptors are relevant methods to extract meaningful features for image recognition. Studies that measure correspondence, occurrence, and accuracy within images have been recently reported [6–10]. However, just a limited number of works actually compare the efficiency of these methods. Generally speaking, the main goal of the detection methods is to recognize image regions with covariant transformations, which are used as support regions to compute invariant descriptors. In this work we present evaluations of detection methods in different contexts. The same scene or object is observed under different viewing conditions: blur, illumination, viewpoint, rotation, zoom, and zoom-rotation changes. Accordingly, this paper compares the efficiency and repeatability of the leading detector algorithms reported in the literature. The evaluation was carried out by using the following detectors [6]:
4686
D. Duarte-Correa et al. / Optik 124 (2013) 4685–4692
Fig. 1. Detection methods evaluated in this paper. These colors are used in all the figures to identify each method. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of the article.)
Harris, Hessian, Harris–Laplace, Hessian–Laplace, Multi-scale Harris, Multi-scale Hessian, Multi-scale Harris–Hessian, Difference of Gaussian, Maximally Stable Extremal Regions (MSER), Harris Affine and, Hessian Affine [3]; MSER [2], Intensity Extrema Based (IBR), and Edge Based (EBR) by Tuytelaars [12], the last three from the web Affine Covariant Features [11].
2. Experimental Fourteen detectors, which have shown previously good performance in such environments were selected. The performance was measured according to the repeatability rate, which is the percentage of points simultaneously presented in a pair of images. As this best matching is done, better recognition results are obtained.
Fig. 3. Blur change. (a) Correspondences, (b) repeatability, (c) computing time of correspondences, (d) computing time of repeatability.
Fig. 1 shows 14 detection methods and their respective color used. Note that there are two MSER detectors. MSER 1 and MSER 2 refer to Mikolajczyk [3] and Matas [2] implementation, respectively. The image sets were taken from the database of Mikolajczyk site [4]. Fig. [2] shows samples from the image sets used to evaluate the detectors. The images were chosen to represent six different types of transformations: blur (a), illumination (b), rotation (c), viewpoint (d), zoom (e), and zoom-rotation (f). The blur sequences
Fig. 2. Image sets examples: (a) image blur, (b) light change, (C) rotation, (d) viewpoint change, (e) zoom, (f) zoom-rotation. In these examples, there is only one change for each type of transformation. The left image of each set was used as the reference image.
D. Duarte-Correa et al. / Optik 124 (2013) 4685–4692
Fig. 4. Illumination change. (a) Correspondences, (b) repeatability, (c) computing time of correspondences, (d) computing time of repeatability.
were obtained by varying the camera focus, illumination by camera aperture, and rotation by turning the plane of the picture. In the viewpoint change test, the camera varies from a front parallel view to approximately 60◦ to the camera viewpoint. Zoom change is obtained by modifying the camera lenses. In the zoom sequences, the scale changes by a factor of four [8]. The sequences for managing the blur, light, viewpoint, and zoom-rotation contain 6 images. The rotation sequences contain 18 images and the zoom set includes 21 images. All of them have a gradual geometric or photometric transformation. The original images have different resolutions and formats. In order to solve this issue, the images were crop from the center to allow an average homogeneous resolution of 750 × 500 pixels in “Portable Pixel Map” (PPM) format. Thus, each image has 1.1 MB file size. By having the same image size and format, the comparison among images become easier, especially to evaluate the computational time on the six types of transformations presented. The pictures in an image set are related by plane projective transformations: homographies. Here, in the reference image (left image for each set in Fig. 2) the mapping that associates the images is known. Hence, this mapping is used to determine ground true matches for the affine covariant detectors [8]. In all our experiments, the same sets of parameters are used for each detector. These parameters are the default parameters given by the authors. The analysis was made in order to find the more efficient methods to detect interest points. First, we compared visually the behavior of the 14 methods and observed their tendencies respect to each other. As second study, we used the methodology proposed and the images as described above. The experiments were made on an Intel® CoreTM i3 processor M370 2.4 GHz Linux PC. 3. Results and discussion The correspondence, repeatability, and computing time of detectors were evaluated using partially the repeatability criterion presented by Mikolajczyk [6] and Schmid [10]. The most important parameters considered for characterizing a feature detector were
4687
Fig. 5. Rotation angle. (a) Correspondences, (b) repeatability, (c) computing time of correspondences, (d) computing time of repeatability.
1. The number of correspondences, which is a measure of their localization accuracy. 2. The repeatability, which is the average number of corresponding points detected in the images under different geometric and photometric transformations. 3. The computing time to execute the detection algorithm. 3.1. Accuracy The measure of accuracy is the relative amount of overlap between the evaluated image and the reference image. The evaluated image projects a region over the area detected in the reference image, and these projected regions are obtained using the homographies that relate the pair of images. The accuracy is a measure of the localization of interest points and it is a region estimator. Here, it is considered that there is correspondence between two points if their error in relative location is less than 1.5 pixels [5]: ||xa , H · xb || < 1.5
(1)
where xa is a detected point in the reference image, xb is the detected point at the evaluation image, and H is the homography relating those images. Additionally, the error εs in the image surface is covered by the neighborhoods and must be less than 40%:
εs = 1 −
a ∩ (AT b A) < 0.4 a ∪ (AT b A)
(2)
where A is the linearization of H in point xb , a and b are elliptical regions defined by xT x = 1, a ∩ (AT b A) is the intersection of the regions, and a ∩ (AT b A) is their union [5]. Regions of the union and the intersection are computed numerically. The repeatability is a measure of the quantity of corresponding interest points between two images. This score is computed as the ratio between the number of correspondences Nc and the smallest quantity of detected points Dp in the pair of images. Such criterion
4688
D. Duarte-Correa et al. / Optik 124 (2013) 4685–4692
Fig. 6. Viewpoint change. (a) Correspondences, (b) repeatability, (c) computing time of correspondences, (d) computing time of repeatability.
takes into account only the regions located in both images. The repeatability rate, R, is usually expressed in percentage units. R=
Nc Dp
(3)
As observed, the accuracy is implicit to this rate. Better repeatability describes best detection methods for recognition tasks. However, if the number of correspondences is very low, one of the detection methods could have the same or greater repeatability than other with a high number of correspondences. Despite that, we use this percentage to compare their occurrence in plots. The computational process consists in evaluating the response time of detection methods. This is done by the successive application of a particular method over the same set of images at different transformations. Given the multiple interest point approaches, the need for independent performance evaluations was identified early and many experimental tests have been performed over the last three decades. Various experimental frameworks and criteria were used. In the earliest papers, often only visual inspection was done [7]. Others performed more quantitative evaluations providing scores for individual images or for small test data. Different image processing applications have different computing time requirements. On several occasions it is desirable to have a fast response in critical applications and in the other hand there are applications that need precise results even if it takes more time. These times may range from hundreds of microseconds up to seconds, which is a wide time span that affects directly the expected performance. Considering the aforementioned criteria, the need to select an appropriate algorithm that has a fast response and good accuracy is clear. The chosen method is determinant to the final results since it defines the number and position of interest points for the further steps in matching, and recognition labors. This selection affects the subsequent calculations and time to process them, plus the computing time of the specific detector itself. If the chosen detector for the six types of transformations studied results in poor
Fig. 7. Zoom factor. (a) Correspondences, (b) repeatability, (c) computing time of correspondences, (d) computing time of repeatability.
recognition and takes a long time, this situation can derivate in a unviable implementation. Furthermore, there is a relationship among the correspondence, the repeatability and the computing time to obtain the detection method most efficiently. For a set of n sequences I1 , I2 , I3 , · · ·In we denoted Ai as the ith change of I1 related to the image Ii+1 . Additionally, if j is a detection method for any pair of images, Ai,j is the ith change with the jth detection method related to a image pair I1 to Ii+1 .
Fig. 8. Zoom and rotation change. (a) Correspondences, (b) repeatability, (c) computing time of correspondences, (d) computing time of repeatability.
D. Duarte-Correa et al. / Optik 124 (2013) 4685–4692
4689
Fig. 9. Blur change performance. (a) Average efficiency and (b) efficiency ratio.
Now, for a collection Ai,j of changes and algorithms, C(A) represents their correspondence. By normalizing these correspondence databases, we obtain a relationship between their values: ||C(Aj )|| =
average(C(Ai,j )) max(C(Ai,j ))
(4)
where max(C(Ai,j )) is the highest number of correspondences in all transformations and methods of one image set. Further, average(C(Ai,j )) is the average correspondence of each jth method overall ith sequences. Thus ||C(Aj || is the normalized correspondence. Similarly, the repeatability is represented by R(A), and its normalization ||R(Aj )|| is shown below: ||R(Aj )|| =
average(R(Ai,j )) max(R(Ai,j ))
(5)
where max(R(Ai,j )) is the highest repeatability in Ai,j , and average(R(Ai,j )) is the average repeatability of each jth method over all ith sequences. Meanwhile, in Eqs. (4) and (5) is expected a value near to the unit as a convenient behavior. In the case of the time, is better to use a minimum time of response. The computing time is given by T(A), and the normalization of computing time is presented as follows:
||T (Aj )|| =
average(T (Ai,j )) max(T (Ai,j ))
(6)
here, max(T(Ai,j )) is the highest computing time in Ai,j , and average(T(Ai,j )) is the average computing time of each jth method overall ith sequences.
Fig. 10. Illumination change performance. (a) Average efficiency and (b) efficiency ratio.
4690
D. Duarte-Correa et al. / Optik 124 (2013) 4685–4692
Fig. 11. Rotation performance. (a) Average efficiency and (b) efficiency ratio.
The average efficiency (AE ) is defined as the average among the normalized correspondence, repeatability, and time: AE =
||C(Aj )|| + ||R(Aj )|| + ||T (Aj )|| 3
(7)
The ideal AE is the unit, where there is only one posible combination value: with ||C(Aj )|| = 1, ||R(Aj )|| = 1 and ||T(Aj )|| = 1. However, while a j detector can have high repeatability, the number of correspondences or computing time can be low and vice versa. The final criteria to select the most efficient algorithm is the efficiency ratio (ER ). This approximation is a linear combination of 4, 5, and 6: ER = ||C(A)||||R(A)||||T (A)||
(8)
The product ER relates the three behaviors better. The unit indicates the best performance, and implies a relationship for the fastest detector, with the highest repeatability, and the maximum number of correspondences. In order to obtain the best efficiency ratio, there should exist a balance among the three components. In the blur change, two groups are observed according to the number of correspondences in a logarithmic scale (Fig. 3a). The multi-scale Harris–Hessian method presented the major number of correspondences (2500–1100). However, its repeatability is less compared to other detectors (Fig. 3b). Fig. 3c and d contain plots of the computing time vs. the number of correspondences and the repeatability, respectively. These plots show all Ai,j transformations over the blur set images. The stronger the dispersion points in a specific method, the more sensitive to the blur changes. In the subsections c and d of the Figs. 3–8, the horizontal black line is the median of the computing time and the vertical line is the median of the correspondences or repeatability, depending on the plot. Thus, quadrants to locate the detector behavior in the experiment are obtained. The quadrants C1 and C4 indicate low computing time; meanwhile quadrants C3 and C4 are the detectors that indicate high number of correspondences and repeatability as the case. The detectors located in the quadrant C1 represents the worst response, for blur change they are EBR and IBR. In C3 it is observed that the best response is obtained with methods based on Harris and Hessian.
The illumination change in Fig. 4a and b decreases its value faster than blur changes. Similar than above, black lines are the median on the corresponding figure. Fig. 4c and d have wide dispersion. Here, clearly Harris Affine is the faster detector and multi-scale Harris–Hessian has more correspondences. The slowest methods are EBR, MSER1, MSER2 and Difference of Gaussian. The rotation image set has 17 transformations with angles variation from 10 to 170◦ . Near to 90◦ the number of correspondences decreases reaching zero in some cases (Fig. 5a). Repeatability is lower than the other five transformations, where Harris is the best with 10–30% (Fig. 5b). MSER1 and MSER2 presented zero correspondences, Hessian Affine is included in the slowest detectors in quadrants C2 and C3 (Fig. 5c, d). Harris Affine is ten times faster than others. In general the methods based in Hessian have the best results. For viewpoint changes, Hessian Laplace and multi-scale Hessian have the major number of correspondences, but quickly turn to zero as the viewpoint change is greater. On the other hand, Harris Laplace and Multi-scale Hessian are more stable to viewpoint changes (Fig. 6a, b). EBR takes 100 times more than others and have fewer correspondences. Also, Hessian and Harris are the fastest detectors (Fig. 6c, d). There are 21 variations from 0.4 to 4.4× in the zoom changes. The number of correspondences ranges from thousands to zero, with repeatability decreasing quickly at the zoom factor of 1× (Fig. 7a, b). The computing time is not significantly affected by zoom factor, as it is observed in the horizontal lines in Fig. 7c and d. Hessian and multi-scale Hessian are the fastest detectors, however, there is not a clear distinction in repeatability or correspondences. The combination of zoom and rotation are displayed in Fig. 8. These plots show the same behavior that zoom changes, by decreasing quickly the number of correspondences. Again, it is difficult known which is the best method. The results based on the criteria described before are summarized in Table 1. According to these data, two series of plots can be extracted and they are presented in Figs. 9–14. The average efficiency AE is presented in the plot (a) of every figure. As detailed before, AE corresponds the average among the normalizing values of number of correspondences, repeatability, and computing time (Eq. (7)).
D. Duarte-Correa et al. / Optik 124 (2013) 4685–4692
4691
Table 1 Summary of results. The values are the average for all sets of images sequences. C(A)) – Number of correspondences, R(A) – repeatability percent, T(A) – computing time in seconds, AE – average efficiency (Eq. (7)), ER – efficiency ratio (Eq. (8)). Detector
C(A)
R(A)%
T(A)s
AE
ER
Difference of Gaussian EBR Harris Affine Multi-scale Harris–Hessian Harris–Laplace Multi-scale Harris Harris Hessian Affine Hessian Laplace Multi-scale Hessian Hessian IBR MSER MSER
61.6 37.3 410.3 637.8 475.7 475.7 136 312.7 410.2 410.2 88.9 48.6 90.2 29.1
32 25 30.7 35.2 34.1 34.1 32 34.6 40.9 40.9 40.3 21.4 22.9 24.3
2.28 22.82 1.31 0.58 0.52 0.52 0.8 1.01 0.38 0.38 0.77 3.86 2.61 0.19
0.498 0.171 0.667 0.874 0.789 0.79 0.608 0.656 0.858 0.858 0.652 0.26 0.362 0.467
0.058 0.001 0.276 0.643 0.481 0.481 0.151 0.25 0.605 0.605 0.155 0.001 0.028 0.019
Fig. 12. Viewpoint performance. (a) Average efficiency and (b) efficiency ratio.
Fig. 13. Zoom performance. (a) Average efficiency and (b) efficiency ratio.
4692
D. Duarte-Correa et al. / Optik 124 (2013) 4685–4692
Fig. 14. Zoom and rotation performance. (a) Average efficiency and (b) efficiency ratio.
The plots designated as (b) present the efficiency ratio (ER ) expressed as the triple product of the same parameters, normalized (Eq. (8)). Note that MSER methods have relatively good repeatability and computing time for all transformations, as observed in all figures (a). However, the low correspondences make their ER drop to very low values, in figures (b). On the other hand, EBR and IBR detectors showed few correspondences and a long time to perform the operations, making their ER value close to zero. Furthermore, IBR showed to be the slowest detector.
4. Conclusions The results of Table 1 show that multi-scale Harris–Hessian allows to obtain the highest number of correspondences, but Hessian Laplace and multi-scale Hessian have better repeatability. The computational time among detectors varies widely, where MSER 2 is the fastest. The multi-scale Harris–Hessian detector is the most efficient for blur, illumination, and zoom-rotation changes (Figs. 9b, 10b, and 14b). Meanwhile, multi-scale Hessian, and Hessian Laplace are better for rotation, viewpoint, and zoom changes (Figs. 11b, 13b, and 12b). The information presented gives a reasonable indicator of typical computational time, even though timing algorithms are not for optimized codes and change depending on the implementation. Furthermore, the image content, hardware, and software should be considered. This work is useful to choose a detector method according to geometric or photometric transformations. Also, considering the number of correspondences, repeatability and computing time, exist a relationship among them. The correct detector selection is
based on the specific application, response time and their necessary efficiency. Acknowledgements This work was supported by Universidad Autónoma de Querétaro and CONACYT. References [1] J. Fauqueur, N. Kingsbury, R. Anderson, Multiscale keypoint detection using the dual-tree complex wavelet transform, Int. Conf. Image Process. 1 (2006) 1625–1628. [2] J. Matas, Robust wide-baseline stereo from maximally stable extremal regions, Image Vis. Comput. 22 (10) (2004) 761–767. [3] K. Mikolajczyk, Feature detectors and descriptors: The state of the art and beyond, 2010 http://www.featurespace.org/ [4] K. Mikolajczyk, Krystian Mikolajczyk, Personal homepage, 2010 http://lear.inrialpes.fr/people/mikolajczyk/ [5] K. Mikolajczyk, C. Schmid, An affine invariant interest point detector 7th European Conference on Computer Vision, ECCV 2002, May, 2002, vol. 2350, Springer, Copenhagen, Denmark, 2002, pp. 128–142. [6] K. Mikolajczyk, C. Schmid, Scale & affine invariant interest point detectors, Int. J. Comput. Vis. 60 (1) (2004) 63–86. [7] K. Mikolajczyk, C. Schmid, Performance evaluation of local descriptors, IEEE Trans. Pattern Anal. Mach. Intell. 27 (10) (2005) 1615–1630. [8] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, L.V. Gool, A Comparison of Affine Region Detectors, Int. J. Comput. Vis. 65 (1–2) (2005) 43–72. [9] J.N. Ouellet, H. Patrick, ASN: Image Keypoint Detection from Adaptive Shape Neighborhood, Image, Rochester, NY, 2008, pp. 454–467. [10] C. Schmid, R. Mohr, C. Bauckhage, Evaluation of interest point detectors, Int. J. Comput. Vis. 37 (2) (2000) 151–172. [11] Collaborative work between The Visual Geometry Group Katholieke Universiteit Leuven, I.R.A., the Center for Machine Perception: Affine Covariant Region Detectors (2010). http://www.robots.ox.ac.uk/ [12] T. Tuytelaars, C.H. Lampert, M.B. Blaschko, W. Buntine, Unsupervised object discovery: a comparison, Int. J. Comput. Vis. 88 (2) (2009) 284–302.