Digital Signal Processing 24 (2014) 124–134
Contents lists available at ScienceDirect
Digital Signal Processing www.elsevier.com/locate/dsp
Context-aware Discriminative Vocabulary Tree Learning for mobile landmark recognition Zhen Li ∗ , Kim-Hui Yap School of Electrical and Electronic Engineering, Nanyang Technological University, 639798, Singapore
a r t i c l e
i n f o
Article history: Available online 18 September 2013 Keywords: Mobile landmark recognition Content and context integration Weighted hierarchical clustering
a b s t r a c t Recently, mobile landmark recognition has become one of the emerging applications in mobile media, offering landmark information and e-commerce opportunities to both mobile users and business owners. Existing mobile landmark recognition techniques mainly use GPS (Global Positioning System) location information to obtain a shortlist of database landmark images nearby the query image, followed by visual content analysis within the shortlist. This is insufficient since (i) GPS data often has large errors in dense build-up areas, and (ii) direction data that can be acquired from mobile devices is underutilized to further improve recognition. In this paper, we propose to integrate content and context in an effective and efficient vocabulary tree framework. Specifically, visual content and two types of mobile context: location and direction, can be integrated by the proposed Context-aware Discriminative Vocabulary Tree Learning (CDVTL) algorithm. The experimental results show that the proposed mobile landmark recognition method outperforms the state-of-the-art methods by about 6%, 21% and 13% on NTU Landmark-50, PKU Landmark-198 and the large-scale San Francisco landmark dataset, respectively. © 2013 Elsevier Inc. All rights reserved.
1. Introduction Recent years have witnessed the phenomenal growth in the usage of mobile devices. Nowadays, most mobile phones in use have the camera feature. The built-in camera and network connectivity of current mobile phones makes it increasingly appealing for the users to snap pictures of landmarks, and obtain relevant information about the captured landmarks. This is particularly useful in applications such as mobile landmark identification for tourists guide, mobile shopping, and mobile image annotation. An illustration of a mobile recognition system for tourist guide is shown in Fig. 1. The mobile user captures a landmark image but he/she has no idea about the landmark. Then the user uploads it to the server and, in a moment, the related information of the captured image is returned to the mobile user, e.g. the landmark name, the location in the map, the description of the landmark, recent events, etc. The returned shortlist contains the best matched images and multiple images corresponding to the same category are only shown with best matched one. In such a way, the user will probably find the correct landmark category by simply scrolling the screen in a few times. Landmark recognition (e.g. [1–3]) is closely related to landmark classification [4] and landmark search [5] problems, and can be broadly categorized into non-mobile and mobile landmark recognition systems.
*
Corresponding author. E-mail addresses:
[email protected] (Z. Li),
[email protected] (K.-H. Yap).
1051-2004/$ – see front matter © 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.dsp.2013.09.002
In recent years, a number of non-mobile landmark vision systems have been developed [6,4], which focused on learning a statistical model for mapping image content features to classification labels. In [6], a scalable vocabulary tree (SVT) is generated by hierarchically clustering local descriptors as follows: (1) an initial k-means clustering is performed on the local descriptors of the training data to obtain the first-level k-clusters, (2) the same clustering process is applied recursively to each cluster at the current level. This will lead to the formation of a hierarchical k-means clustering structure, also known as a VT in this context. Based on the VT, the high-dimensional histograms of image local descriptors can be generated, which enables efficient image recognition. In [4], non-visual information such as textual tags and temporal constraints are utilized to improve the performance of content analysis. However, these landmark recognition methods underutilize the unique features of mobile devices, such as location, direction and time information. These context information can be easily acquired from mobile devices, which can further enhance the performance of content analysis without introducing much additional computational cost. In the mobile landmark recognition scenario, utilizing either content or context analysis alone is suboptimal. GPS location is able to help discriminate visually similar landmark categories that are far away from each other. However, location information should not be used alone due to the GPS error. As such, the context information must be used in conjunction with visual content analysis in mobile landmark recognition.
Z. Li, K.-H. Yap / Digital Signal Processing 24 (2014) 124–134
125
Fig. 1. Illustration of mobile landmark recognition prototype.
Fig. 2. Proposed framework for mobile landmark recognition.
In order to further enhance the performance, it is therefore imperative to combine the context information with content analysis in mobile systems. In [7], the content of mobile images is analyzed based on local descriptors, and the user interaction as a source of context information is considered. In [8–11,1], the GPS location information is utilized to assist in content-based mobile image recognition. In these methods, the content analysis is essentially filtered by a pre-defined area centered at the logged GPS location of the query image. With the aid of the location information, the challenge in differentiating similar images that are captured in different areas can be reduced substantially. Recent works in [5, 12] demonstrate that GPS can be utilized to improve SVT-based mobile landmark recognition [6]. In these systems, the candidate images for the query image are obtained from the landmark image database by using GPS information. This is done by selecting those landmark images that are located close to the mobile device when taking the query image. Visual content analysis is then performed in the GPS-based shortlist for the final recognition. This approach makes it easy to differentiate visually similar images of different landmark categories that are far away from each other. However, the approach of performing content analysis within the GPS-based shortlist still has some drawbacks that make it nonideal in real applications: (1) it is elusive to determine the optimal search radius since the GPS error of the captured image may vary significantly from open areas (about 30 meters) to dense build-up
areas (up to 150 meters). In the event that GPS error of the captured image is too large, the correct landmark category may be filtered out, which will result in wrong recognition, and (2) direction information is not utilized to further improve the recognition performance, which can be easily acquired from the digital compass on mobile devices. In order to circumvent the approach of performing content analysis within a GPS-based shortlist as in most existing mobile recognition systems as well as to have the flexibility of incorporating different types of context, we propose to integrate various content and context information by Context-aware Discriminative Vocabulary Tree Learning (CDVTL). As compared to the well-known approach of using SVT built upon SIFT descriptors [6], we improve the recognition performance by incorporating various context descriptors. Context information adopted in this work includes location information (obtained through built-in GPS or WiFi localization) and direction information (obtained from digital compass). 2. Overview of the proposed mobile landmark recognition In this paper, we present a context-aware mobile landmark recognition framework as shown in Fig. 2, which enables efficient mobile landmark recognition in both context-aware and contextagnostic environments. In Fig. 2, the application on the mobile device provides the user interface which can upload the photo along with its context
126
Z. Li, K.-H. Yap / Digital Signal Processing 24 (2014) 124–134
Fig. 3. Typical images from NTU landmark dataset. (a) Administrative building, (b) Nanyang auditorium, (c) Cafe express.
information such as GPS and direction to the server, while all the feature extraction and recognition tasks are processed in the server side. In such a way, both feature extraction and recognition are fast and the efficiency is independent to the computational capacities of different mobile devices. The battery energy is also saved by avoiding the computation on the mobile devices. Moreover, this architecture makes it easier to migrate the mobile image recognition across different mobile platforms. In this work, a prototype for context-aware mobile image recognition is developed and its efficiency is validated using the developed Android application. The recognition procedure is different from the traditional mobile image recognition pipeline: confine the candidate images to a radius of several hundred meters centered at the location of the mobile device, and then perform content-based recognition within the shortlist. This pre-filtering is less than ideal because it is elusive to determine the search range centered at the logged location and the range should not be a constant. In this work, the pre-filtering by location is circumvented, that is, image content descriptors and context information are precombined without the pre-filtering step. The proposed framework has achieved superior recognition precision with low computational cost comparing with some state-of-the-art mobile vision systems to recognize landmarks from university-scale to city-scale landmark datasets. We also propose a Context-aware Discriminative Vocabulary Tree Learning (CDVTL) algorithm that can be embedded in the proposed framework. In landmark recognition scenarios, the location information can be more important within a region with a radius of hundreds of meters, while less reliable in a region within one hundred meters due to the GPS error. In contrast, image content descriptors are usually more discriminative in small regions for landmark recognition. In addition, for regions of the same scale, the image content descriptors may not be the same discriminative, e.g. it is more discriminative in suburban areas but less discriminative in build-up areas. In view of these observations, for each codeword in each level of the hierarchical clustering when training the vocabulary tree, we introduce a weighting scheme which evaluates the relative discriminative power between the context information and the image content descriptors. Intuitively, a small region with visually distinct landmarks is supposed to yield more weighting for image descriptors and less weighting for location, and vice versa. In this work, we use 128-D SIFT as the content descriptor, and 2-D GPS location (longitude and latitude) and 1-D direction as the context descriptors. The context descriptors are concatenated to each of its SIFT descriptor to form a 130-D (only GPS is available) or 131-D (both GPS and compass are available) hybrid descriptors, where SIFT, GPS and direction components have their respective weights. In order to demonstrate the effectiveness of combining content and context features for image recognition, we randomly selected 60 images from the categories “Administrative building”, “Nanyang auditorium” and “Cafe express” in NTU landmark dataset. It is a typical dataset to compare different methods although the classifi-
cation on this small-scale dataset is relatively easier than that on the whole dataset. The example images for the 3 landmarks are shown in Fig. 3(a)–(c), respectively, and are briefed as categories 1, 2 and 3 here. We can notice that categories 1 and 2 are visually very similar, however, they are different buildings far away from each other. In contrast, categories 2 and 3 are located nearby sharing similar GPS coordinates, however, they are visually dissimilar. In Fig. 4(a)–(b), we show 2D feature for the purpose of illustration, where different shapes indicate the feature extracted from different categories. In Fig. 4(a), the average pixel values in R and B channels are adopted as the 2D content feature for each image, which are clustered by using K -means clustering. By observing the Voronoi graph, we can see that the 2D content features are not well partitioned between category 1 and category 2 because the images are visually similar, while the content features from category 3 are well separated from categories 1 and 2. In Fig. 4(b), the 2D GPS coordinates of images are clustered by K -means clustering. By observing the Voronoi graph, we can observe that the 2D GPS features are not well partitioned between categories 2 and 3 because the images are geometrically nearby, while the GPS feature from category 1 is distinguished. In Fig. 4(c), the 4D composite feature is formed by concatenating the 2D R–B feature and 2D GPS coordinates, and is also clustered by K -means clustering. In order to illustrate the clustering results, the 4D composite feature is only shown in 2D geometrical point of view in Fig. 4(c). For visualization, the feature points allocated to different clusters are marked by different colors. We can observe that the feature points are correctly grouped into their own category. In contrast, in Fig. 4(a) and (b), the feature points are not well grouped into their own category overall. Therefore, by considering both the content and context information, we can obtain a better representation of the mobile image for the subsequent recognition. Using K -means as the flat clustering method in each layer of hierarchical clustering, a VT is generated from the composite content–context image features. VT trees are illustrated In Fig. 5, where the gray level of each node indicates the weight for GPS component (the darker the higher). In Fig. 5(a), the traditional VT-based mobile landmark recognition [5,12] is shown, where the coarse clusters are partitioned by location first, and the finer clusters are obtained by soly visual content descriptors. However, as shown in Fig. 4, this scheme is not optimal. In Fig. 5(b), GPS is considered together with content feature in each cluster in each level, and the weight for each component is generally non-zero values. For example, in the coarse cluster I, the visual content is important and location is less useful, so the node I has very low location weight; in the cluster III, location weight is high since it can perfectly separate the patterns; while in the cluster II, content and GPS weights are almost equal so the node is gray. We can notice that the existing mobile vision systems based on VT matching are essentially a special case of content–context weighing in the proposed hybrid recognition framework.
Z. Li, K.-H. Yap / Digital Signal Processing 24 (2014) 124–134
127
Fig. 4. Clustering of content feature, context feature and composite feature. (a) Clustering by content descriptors alone, (b) clustering by location coordinates alone, (c) clustering by hybrid RB–GPS descriptors with a proper weighting scheme.
3. Context-aware discriminative vocabulary tree learning 3.1. Related work Towards scalable near-duplicate mobile landmark recognition, the Scalable Vocabulary Tree (SVT) model [6] is well exploited in the state-of-the-art works [12,13,5,14]. In these works, the SVT is constructed within a region by filtering out the database images that are far away from the query image. An SVT T is constructed by hierarchically clustering of local descriptors, which consists of C = B L codewords, where B is the branch factor and L is the depth. Let each node vl,h ∈ T in the tree repre-
sent a visual codeword, where l ∈ {0, 1, . . . , L } indicates its level, and h ∈ {1, 2, . . . , B L −l } is the index of its node in its level. A query image q is represented by a bag of N q local descriptors Xq = {xq,i }, i ∈ N q . Each xq,i is traversed in T from the root to a leaf to find the nearest node, resulting in the quantization l,h T (xi ) = {vi }lL=1 . Thus, a query image is eventually represented by a high-dimensional sparse Bag-of-Words (BoW) histogram Hq = [h1q , . . . , h Cq ] obtained by counting the occurrence of its descriptors on each node of T . M The database images are denoted by {dm }m =1 . Following the same VT quantization procedure, the local descriptors { y j } in dm
128
Z. Li, K.-H. Yap / Digital Signal Processing 24 (2014) 124–134
Fig. 5. VT nodes and the associated weights for location. (a) Nister’s method, (b) a better solution.
are mapped to a high-dimensional sparse Bag-of-Words (BoW) histogram Hd = [h1d , . . . , h Cd ]. The images with highest similarity score sim(q, dm ) between query q and database image dm are returned as the retrieval set, where similarity sim(q, dm ) is defined in [6] as
Hq · w Hdm · w sim(Hq , Hdm ) = − Hq · w Hdm · w
(1)
M where w = [ w 1 , w 2 , . . . , w C ], w i = ln M , and M i is the number i of images in the database with at least one descriptor vector path through node i. Since the vocabulary tree is very large, the number of images whose descriptors are quantized to a particular node is normally zero. Therefore, inverted index files attached to leaf nodes allow a very efficient implementation of this voting procedure. In [6] the scalability is addressed by only comparing those database images indexed by each non-zero codeword for the given query image.
where ε i = [εi ,1 , . . . , εi ,m ] T is the training error vector with respect to the training sample xli , αl is the weight for the l-th type of descriptor, η is the slack factor for the sum of weights, and C is a user-specified parameter and provides a tradeoff between the distance between two different classes and the training error. In this work, we do not aim to tune C for every dataset in order to achieve a better generalization, so we default C = 1. The first two terms in the objective function is the balance of error minimization and generalization. The additional third term is the regularization that requires that the weights summing to unity, which it can restrict the search space to achieve efficient optimization. Based on the KKT theorem [15], the above optimization function is equivalent to solving the following dual optimization problem:
K N 2 1 2 1 l 2 1 1 l w + C · αl ε + D η L= i 2 2 l =1
− 3.2. Node-wise content and context weighting
l = 1, . . . , K , where xli ∈ R D l represents the i-th vector in the l-th type of feature (SIFT descriptor, GPS location, direction, etc.), D l is the dimensionality of the l-th type of feature. ti is the multioutputs label vector, that is, if the original class label is p, the expected output vector is ti = [0, . . . , 0, 1, 0, . . . , 0] T with only the p-th element of ti = [t i ,1 , . . . , t i ,m ] T being one, while the rest of the elements are set to zero. Due to the nonlinear separability of these training data in the input space, in most cases, one can map the training data xli from the input space to a feature space Z through a nonlinear mapping φ : xi → φ(xi ). As such, the distance between 2 two different classes in the feature space Z is w , where w is the coefficient vector in the feature space. To maximize the separating margin and to minimize the training errors, i , is equivalent to
min =
w,ε ,α ,η
l =1
1 l 2 1 w + C · 1 2 2 2 N
N i =1
2
m N K
l 2 1 2 αl ε + D η i 2 2
(5)
The KKT optimality conditions are as follows:
∂L ∂ w lj ∂L ∂ εil ∂L ∂β li
= 0 → w lj =
N
(6)
i =1
= 0 → β li = C ·
1 N
αl2 εli → β l = C ·
1 N
αl2 l
= 0 → Φ lT wl − T + l = 0
(7) (8)
N
(9)
where α i = [αi ,1 , . . . , αi ,m ] T and α = [α 1 , . . . , α N ] T . By substituting (6) and (7) into (19), we have
wl = ΦlT
(2)
NI C αl2
+ Φl ΦlT
−1 T
(10)
The output function is
(3)
and
1T α = 1 − η
T βil, j φ xli → wl = Φ lT β l
∂L 1 l 2 r ε − r = 0 → αl = = 0 → C · αl i 2 1 ∂ αl N C · N l 22 i =1
l
i = 1, 2, . . . , N
2
i =1
βi , j φ xli wlj − ti , j + εli , j − r 1T α − 1 + η
subject to
T φ xli wl = tiT − εli ,
N
l =1 i =1 j =1
As mentioned in our contributions, for each VT node we introduce a weighting factor that evaluates the relative discriminative power between the context information and the image content descriptors. Given a set of training data {(xli , ti )}, i = 1, . . . , N,
K
2
(4)
l
Y = Φl · w =
Φ l Φ lT
NI
+ Φ lT Φ l 2 l
Cα
−1 T
(11)
Since we expect that 1 T α = 1, from (9) and (8), we can calculate the normalized weighting factor αl∗ for the l-th type of feature:
Z. Li, K.-H. Yap / Digital Signal Processing 24 (2014) 124–134
α 1 1 αl∗ = l = 2 l α T − Y T − Yl 22 l l 2 l
(12)
Different solutions to the above optimization can be obtained for the purpose of efficiency of huge sizes of the training set. If the number of training data is very large, for example, it is much larger than the dimensionality of the feature space, we have an alternative solution. From (6) and (7), we have the following equations:
l
ε =
ΦlT αl2 εl
N
N
ΦlT αl2
C
ΦlT wl +
+
N 2 l
Cα
(13)
wl
ΦlT
(14)
+
wl = T
(15)
NI T + Φ ΦlT Φl + wl = ΦlT T l 2 wl =
respectively. Therefore, the clustering algorithm has to consider a weighting vector associated with each hybrid content–context descriptor in the computation of cluster centers. We develop a diverse hierarchical clustering with a weighted K -means embedded as the subroutine clustering algorithm, which is composed of the following steps in Algorithm 1. Algorithm 1 Context-aware Discriminative Vocabulary Tree Learning. p
1
βl = C ·
129
C αl
NI C αl2
+ ΦlT Φl
−1
(16)
ΦlT T
(17)
From (6) and (8), we have
ΦlT Φl β l +
NI C αl2
βl − T = 0
(18)
−1 NI T β = Φl Φl + T 2
4. Experimental results
And then from (17) we have
NI 2 l
Cα
+ Φl ΦlT
−1
ΦlT T
4.1. Datasets
(20)
and predicted class label matrix is
Yl = Φl · wl = Φl
NI C αl2
+ ΦlT Φl
−1
ΦlT T
(21)
We have φ(x) = [G (a1 , b1 , x), . . . , G (a D , b D , x)] where G (a, b, x) is a nonlinear piecewise continuous function. According to [16], almost all nonlinear piecewise continuous functions can be used for the feature mapping φ(x). So the parameters {(ai , b i )}iD=1 used in φ(x) can be very diversified and here they are randomly generated. In this work, we adopt the Sigmoid function:
G (a, b, x) =
have been partitioned to B clusters;
(19)
C αl
• Calculate the weight αl for the l-th type
using (12);
of feature • Compute the cost function: E (V) = ni=1 Dj=1 vl,h ∈V μ j (xi , j − vlj,h )2 , where
P D = p =1 D p is the sum of the dimensionalities of all types of features, and μ j is the weighting for each dimension as follows: μ = [[ Dα11 , . . . , αD11 ] D 1 ×1 , . . . , [ Dα PP , . . . , αD PP ] D P ×1 ] D ×1 ; • Repeat the pattern assignment and cluster center
D update until2 convergence with the weighted distance measure: d(x, v) = j =1 μ j (x j − v j ) ; • Repeat the above clustering procedure in level l until all vl,h , h = 1, . . . , B L −l , • l ← l − 1. Repeat the clustering procedure in each level recursively until l = 0.
l
εl = T − Φl · wl = I − Φl
p
Input: Descriptors of database images and their labels: Z = {(xm , ym )}, where xm can be SIFT, GPS or direction descriptors for an image. Output: The vocabulary tree T consisting of C = B L visual codewords vl,h ∈ T , where l ∈ {0, 1, . . . , L } indicates its level, and h ∈ {1, 2, . . . , B L −l } is its index in its level. (1) Set the branch factor B, the depth L; Initialize a vocabulary tree by using the traditional hierarchical K -means clustering; Set the current level l = L. (2) Apply the following weighted K -means procedure to the descriptors that are assigned to the current cluster:
1 1 + exp(−(a · x + b))
(22)
In order to achieve high computational efficiency, if the sizes of the training data in the feature space Φ l are very large, one may prefer to apply solutions (21) in order to reduce computational costs; otherwise, one may prefer to use either solutions (11) or (21). 3.3. Context-aware weighted hierarchical clustering Under the proposed framework, a simple way to integrate the content and context information is that, various types of descriptors (128-D SIFT, 2-D GPS and/or 1-D direction) are pre-combined to generate hybrid descriptors with equal weight of each type of descriptors. Then the traditional hierarchical K -means algorithm is applied directly to the hybrid descriptors as in [6]. This tentative strategy is expected to improve the recognition relative to SVT matching based on content descriptors, however, it can be further improved by assigning optimal weights for each type of descriptors
4.1.1. NTU landmark dataset For the purpose of validating the effectiveness of the proposed schemes for mobile image recognition, we require a data set of labeled and geo-tagged photos. Therefore we constructed a landmark database called NTU Landmark-50 dataset [1] consisting of 3622 training images and 534 testing images for 50 landmark categories from the campus in Nanyang Technological University (NTU), Singapore. We consider choosing NTU campus for collecting mobile images mainly because it facilitates: (i) collecting a large number of images in a confined geographic area for validation of the system in domain-specific application, (ii) evaluating the system performance in terms of recognition accuracy and processing time, and (iii) testing the performance of incorporating direction information. The landmark images are captured using camera phones equipped with GPS receiver and digital compass. The volunteers were instructed to take photos around the NTU campus, at their leisure, as if they were traveling in NTU campus with their own mobile cameras. Different image acquisition conditions are used when capturing the images, such as camera settings, capturing time, illumination, weather changes, and changes in viewpoints, scale and occlusion. Sample images of NTU Landmark-50 dataset are given in Fig. 6(a), and the geospatial distribution of the landmarks is given in Fig. 6(b). From the figure, it can be seen that the landmarks spread across the whole campus of NTU, with certain areas having a higher concentration of landmarks. The numbers of the training and testing images for each landmark are mostly uniform. There are on average 70 images for training and 10 images for testing for each landmark. 4.1.2. PKU landmark dataset We also test our proposed algorithms on the mobile landmark benchmark from MPEG CDVS requirement subgroup [13], which
130
Z. Li, K.-H. Yap / Digital Signal Processing 24 (2014) 124–134
Fig. 6. Sample images and geospatial distribution of the images in NTU Landmark-50 dataset. (a) Sample images, (b) geospatial distribution of the image-wise locations.
Fig. 7. Sample images and geospatial distribution of the images in PKU Landmark-198 dataset. (a) Sample images, (b) geospatial distribution of the image-wise locations.
contains 13,179 geo-tagged scene photos, organized into 198 landmark locations from the Peking University Campus. We randomly selected 3290 images for training and the rest for testing. Sample images of PKU Landmark-198 dataset are given in Fig. 7(a), and the geospatial distribution of the landmarks is given in Fig. 7(b). 4.1.3. San Francisco Landmark dataset San Francisco Landmark dataset [12] contains a database of 1.7 million images of buildings in San Francisco with ground truth labels, geotags, and calibration data. The data is originally acquired by vehicle-mounted cameras with wide-angle lenses capturing spherical panoramic images. For all visible buildings in each panorama, a set of overlapping perspective images is generated. This dataset is rather challenging since it is a large-scale dataset which consists of images under various and sometimes poor acquisition conditions, such as severe illumination changes, occlusion and distortion, as well as various changes in viewpoints and scale. In addition, it has more than 6000 categories and the 803 query images were taken with camera phones which is different from the acquisition of database images. Sample images of San Francisco Landmark dataset are given in Fig. 8(a), and the geospatial distribution of the landmarks is given in Fig. 8(b).
4.2. Experiment settings For effective image descriptor extraction, we compute dense patch SIFT descriptors sampled on overlapping 16 × 16 pixel patches in space of 8 pixels as in [17] for all the compared algorithms on all the datasets. All images are resized to 320 × 240 resolution. For matching of VT histograms between query images and database images, we adopt the intersection kernel [18] for its high efficiency. The proposed Context-aware Discriminative Vocabulary Tree Learning (CDVTL) algorithm is compared with state-of-the-art works including: Scalable Vocabulary Tree (SVT) based on the content analysis (Nister’s method) [6], content-based SVT within the GPS shortlist (Girod’s method) [5], and hybrid content–context descriptor with equal weighting applied to the proposed framework. In this section, several parameters for Nister’s method and Girod’s method are first determined for subsequent comparison, including the layout of the SIFT-based SVTs and the search radius R of the local area centered at the GPS location of the captured query image to obtain the shortlist for Girod’s method. By comparing the recognition rates of Nister’s method [6] and Girod’s method [5] as shown in Fig. 9, we can observe that, for depth L is larger than 5, the recognition rates improve slowly for both NTU-50 dataset and PKU-198 dataset. Thus we set the depth for SIFT-based SVT to be
Z. Li, K.-H. Yap / Digital Signal Processing 24 (2014) 124–134
131
Fig. 8. Sample images and geospatial distribution of the images in San Francisco Landmark dataset. (a) Sample images, (b) geospatial distribution of the image-wise locations.
Fig. 9. Performance comparison in terms of different vocabulary tree layers on (a) NTU Landmark-50 dataset, (b) PKU Landmark-198 dataset. Table 1 Performance of Girod’s method when varying the radius on NTU-50 dataset. R (m)
No. of candidates
Accuracy (%)
Time (ms)
100 200 300 400
130 310 510 700 3622
87.5 91.6 90.8 89.5 88.6
35 85 140 180 460
∞
L = 5 as a tradeoff between the recognition rate and the computational cost for all datasets. For context-based SVTs, since it is not necessary to partition the low-dimensional context descriptors in very fine bins, we set the SVT depth to be 4 and 3 for location and direction descriptors, respectively. In addition, the branch number of all types of SVTs is set as B = 10 [6] for all the datasets. The effect of varying the search radius R for landmark recognition with L = 5 using Girod’s method is given in Table 1, where we can reduce the number of average candidate images of the GPS-based shortlist thus reducing the computational cost for content analysis. We notice that the recognition rate achieves its peak at around 200 m and then it starts to decrease. This is due to the fact that users normally capture a landmark image within 200 m from the mobile device. However, using Girod’s method, even though R is set to be 200 m, there is only little recognition
improvement over the baseline SVT approach on both NTU and PKU datasets. For San Francisco datasets, the performance reaches its peak at about R = 300 m but the improvement over the baseline SVT approach is still limited. 4.3. Performance evaluation With the above parameters optimized for Nister’s method and Girod’s method, we further demonstrate the performance improvement of the proposed CDVTL method. Note that direction information is only available in NTU-50 dataset, so we construct an SVT based on the 131-D hybrid SIFT–GPS-direction descriptors. For other datasets, since only GPS is available as context information, we construct SVTs based on the 130-D hybrid SIFT–GPS descriptors. We randomly choose about one tenth of hybrid content– context descriptors from database images to construct an SVT. For San Francisco dataset, since there are over 1 million images covering the whole city, we use Girod’s method to reduce the search range within a radius of 300 m, and then use the proposed CDVTL algorithm within the shortlist. Using the proposed CDVTL algorithm, the recognition rates versus the layer of the SVT (i.e. L) are shown in Fig. 10. We can notice that, by integrating SIFT and GPS descriptors for SVT construction with the proposed weighting scheme, we can significantly improve the recognition performance over Girod’s method as well as other algorithms.
132
Z. Li, K.-H. Yap / Digital Signal Processing 24 (2014) 124–134
Fig. 10. Comparison of recognition accuracy versus the layer of the SVT. (a) NTU-50 dataset, (b) PKU-198 dataset, (c) San Francisco dataset. Table 2 Classification rates of different methods in different context scenarios. Algorithms
NTU
PKU
San Francisco
Nister et al. [6] Girod et al. [5] CDVTL (SIFT&GPS) CDVTL (SIFT&GPS&direction)
0.871 0.890 0.947 0.953
0.551 0.593 0.809 –
0.122 0.673 0.802 –
The recognition results of Nister’s method, Girod’s method and the proposed CDVTL method in different context scenarios are shown in Table 2, where the adopted context sources are indicated in brackets. It is evident that the proposed CDVTL method
outperforms Nister’s method and Girod’s method by combining SIFT-based SVTs and GPS-based SVTs. For the NTU-50 dataset, by incorporating additional direction information, the performance of the proposed CDVTL can be improved by about 0.6% compared to only incorporating GPS information. Since the proposed framework has the flexibility to incorporate any type of context information including GPS and direction, we test our proposed CDVTL approach progressively in such a way: baseline SVT without context information, baseline SVT with GPS, and baseline SVT with GPS and direction. The comparison of the above progressive approaches in terms of different top n recognized categories are shown in Table 3. From Tables 4 and 5, we
Z. Li, K.-H. Yap / Digital Signal Processing 24 (2014) 124–134
can observe that, incorporating the context information will progressively improve the recognition performance. Typical cases for the improved recognition are shown in Fig. 11, where the top 3 matched categories are shown for each method. The top categories are obtained by finding the first categories in Table 3 Recognition rates by CDVTL with various context scenarios on NTU dataset. Information available SIFT
GPS
Direction
×
× ×
top 1
top 3
top 5
0.871 0.947 0.953
0.913 0.954 0.959
0.925 0.958 0.961
Table 4 Recognition rates by CDVTL with various context scenarios on PKU dataset. Information available SIFT
GPS
×
top 1
top 3
top 5
0.551 0.809
0.634 0.851
0.652 0.872
Table 5 Recognition rates by CDVTL with various context scenarios on San Francisco dataset. Information available SIFT
GPS
×
top 1
top 3
top 5
0.122 0.810
0.136 0.822
0.143 0.833
133
the top recognized image list. In such a way, the user will probably find the correct landmark category out of the top recognized categories. In Fig. 11(a) and (b), a query image in “Nanyang Lake” is tested for comparison of the proposed method and Nister’s method. Using Nister’s method which only utilizes content information, the first and the third recognized images in the returned list are actually located far away from where the query image is taken. By using the proposed CDVTL method which incorporates location and direction information, the recognized images in the returned list are all well matched to the query image. In Fig. 11(c) and (d), a query image for “Wells Fargo” is tested for comparison of the proposed method and Girod’s method. Since Girod’s method utilizes GPS information to obtain candidate images that are located near the query image and then perform content-based recognition, the recognized images in the returned list are located nearby the query image. However, the recognition result can be further improved by assigning optimal weights for SIFT descriptors and GPS locations using the proposed CDVTL algorithm. By using the proposed method, we first filter out database images located far away, and then use the proposed CDVTL algorithm, resulting in much better recognized images. 4.4. Computational cost As to the computational cost, combination of content and context analyses will need more time than using content or context analysis alone. However, since the mobile user will normally not be sure whether using content or context information alone is better,
Fig. 11. A typical case of top matched category list. (a) NTU query using Nister’s method, (b) NTU query using the proposed method, (c) San Francisco query using Girod’s method, (d) San Francisco query using the proposed method.
134
Z. Li, K.-H. Yap / Digital Signal Processing 24 (2014) 124–134
References
Table 6 Average computational cost.
Nister et al. [6] Girod et al. [5] CDVTL (SIFT&GPS) CDVTL (SIFT&GPS&direction)
NTU
PKU
San Francisco
0.460 0.085 0.472 0.473
0.538 0.0850 0.540 0.541
– 2.852 2.882 2.889
the proposed adaptive content and context combination approach is more reliable, effective and convenient. The computational cost is shown in Table 6. It is noticed that the proposed recognition scheme shows a dramatic improvement over Nister’s baseline SVT approach, however, the increase in recognition time per image is as trivial as about 0.013 s. Although using Girod’s method, the recognition time is greatly reduced by filtering out the database images located far away from the query image in the first stage. However, the proposed scheme improves the recognition rate over Girod’s method by about 6%–21%, which is significant. For San Francisco dataset, since we use Girod’s scheme in the first stage and then use the proposed CDVTL scheme, the recognition time is only slightly higher than Girod’s scheme, while the traditional content analysis alone will not return the result within one hour. Therefore we can conclude that proposed recognition scheme has a good practical utility in landmark recognition systems. The computational cost includes both feature extraction time and the recognition by combining content and context information. The feature extraction step is rather efficient: we can SIFT descriptor within 0.2 second and context information within 0.002 second for each query image, while the recognition time dominates the total online computational cost. The hardware setting is given as follows: Windows XP, CPU P4 2.66 GHz, and 4 GB RAM. The current implementation is carried out mainly by MATLAB2009a, and we can expect it to be further accelerated several times by C code. So the proposed method is affordable in real applications. 5. Conclusions We proposed a new mobile landmark recognition framework that can integrate various mobile content and context information by Context-aware Discriminative Vocabulary Tree Learning (CDVTL). As compared to content analysis based on GPS shortlist that is adopted in most existing mobile landmark recognition systems, the proposed CDVTL scheme can significantly improve the recognition performance while not introducing much additional computational cost. In addition, the proposed content and context integration framework has the flexibility to incorporate any types of content and context information when they are available in the mobile device. Future work may include integrating other context information (e.g. time) to further improve the recognition performance.
[1] K. Yap, T. Chen, Z. Li, K. Wu, A comparative study of mobile-based landmark recognition techniques, IEEE Intell. Syst. 25 (1) (2010) 48–57. [2] T. Chen, K. Wu, K. Yap, Z. Li, F. Tsai, A survey on mobile landmark recognition for information retrieval, in: International Conference on Mobile Data Management: Systems, Services and Middleware, 2009, pp. 625–630. [3] T. Chen, Z. Li, K. Yap, et al., A multi-scale learning approach for landmark recognition using mobile devices, in: International Conference on Information, Communications and Signal Processing, 2009, pp. 1–4. [4] Y. Li, D. Crandall, D. Huttenlocher, Landmark classification in large-scale image collections, in: IEEE 12th International Conference on Computer Vision, 2009, pp. 1957–1964. [5] B. Girod, V. Chandrasekhar, D. Chen, et al., Mobile visual search, IEEE Signal Process. Mag. 28 (4) (2011) 61–76. [6] D. Nister, H. Stewenius, Scalable recognition with a vocabulary tree, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, 2006, pp. 2161–2168. [7] Y. Li, J. Lim, Outdoor place recognition using compact local descriptors and multiple queries with user verification, in: Proceedings of the 15th International Conference on ACM Multimedia, 2007, pp. 549–552. [8] G. Fritz, C. Seifert, L. Paletta, A mobile vision system for urban detection with informative local descriptors, in: IEEE International Conference on Computer Vision Systems, 2006, pp. 30–38. [9] C. Tsai, A. Qamra, E. Chang, Y. Wang, Extent: Inferring image metadata from context and content, in: IEEE International Conference on Multimedia and Expo, 2005, pp. 1270–1273. [10] N. O’Hare, C. Gurrin, G. Jones, A. Smeaton, Combination of content analysis and context features for digital photograph retrieval, in: The 2nd European Workshop on the Integration of Knowledge, Semantics and Digital Media Technology, 2005, pp. 323–328. [11] J. Lim, Y. Li, Y. You, J. Chevallet, Scene recognition with camera phones for tourist information access, in: IEEE International Conference on Multimedia and Expo, 2007, pp. 100–103. [12] D. Chen, G. Baatz, K. Koser, et al., City-scale landmark identification on mobile devices, in: IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 737–744. [13] R. Ji, L. Duan, J. Chen, et al., Pkubench: A context rich mobile visual search benchmark, in: 18th IEEE International Conference on Image Processing, 2011, pp. 2545–2548. [14] X. Wang, M. Yang, T. Cour, et al., Contextual weighting for vocabulary tree based image retrieval, in: IEEE International Conference on Computer Vision, 2011, pp. 209–216. [15] P. Gill, W. Murray, M. Wright, Practical Optimization, vol. 1, Academic Press, 1981. [16] G. Huang, L. Chen, Enhanced random search based incremental extreme learning machine, Neurocomput. 71 (16) (2008) 3460–3468. [17] J. Zhang, M. Marszalek, S. Lazebnik, C. Schmid, Local features and kernels for classification of texture and object categories: A comprehensive study, Int. J. Comput. Vis. 73 (2) (2007) 213–238. [18] M. Swain, D. Ballard, Color indexing, Int. J. Comput. Vis. 7 (1) (1991) 11–32.
Zhen Li is working toward a PhD in electrical engineering at Nanyang Technological University, Singapore. His research interests include contentand context-based multimedia understanding and processing. He has a master degree in measurement technology and instrumentation from Harbin Institute of Technology, China. Kim-Hui Yap is an associate professor in the School of Electrical and Electronic Engineering at Nanyang Technological University, Singapore. His main research interests include image and video processing, media content analysis, computer vision, and computational intelligence. He has a PhD in electrical engineering from the University of Sydney, Australia.