Neurocomputing 120 (2013) 277–286
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Visual word coding based on difference maximization Yu-Bin Yang a,n, Ling-Yan Pan a, Yang Gao a, Guang-Nan He a, Yao Zhang b a b
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China Jin-Ling College, Nanjing University, Nanjing 210089, China
art ic l e i nf o
a b s t r a c t
Article history: Received 25 December 2011 Received in revised form 10 June 2012 Accepted 7 August 2012 Available online 27 March 2013
Image classification is an important topic in computer vision, which becomes more and more challenging due to the rapid increase of the amounts and categories of images, as well as the different geometric deformations and illumination variations existing in image objects. “Bag-of-Features” model, also known as “Bag-of-Words” model, plays a fundamental and crucial role in generating efficient and discriminative image content representations by using local descriptors such as visual words, making it widely used in solving image classification problems. However, because of the weak discriminative power and strong ambiguity of the low-level visual features, the visual codebook, i.e., a set of visual words, generated in this model is usually over-completed and inconsistent for capturing image semantics. To address this issue, we propose a novel visual word coding algorithm in this paper based on difference maximization technique to improve the generated codebook model. Instead of mapping an image feature vector to one or multiple nearest visual words, the proposed approach utilizes a group of the nearest and the farthest visual words together in the coding process. Consequently, the representative variations of different image features are well kept and strengthened, which can then improve the discriminative power of the visual word descriptor significantly. We examine the performance of our visual word coding model extensively on four standard real-world image datasets, demonstrating that it captures image semantic content more accurately and achieves superior classification performance. & 2013 Elsevier B.V. All rights reserved.
Keywords: Bag of words Visual words Codebook Difference maximization Image classification
1. Introduction Image classification is an important and challenging topic in computer vision, which can be directly used in versatile tasks including image understanding, scene classification, image retrieval, and image content filtering, etc. However, due to the rapid increase of the amounts and categories of images, as well as the different geometric deformations and illumination variations existing in image objects, it becomes more and more difficult to accurately identify the correct semantic category of an image from diverse classes. To solve the above problem, many well-performed image classification techniques have been proposed, among which two models are most widely applied: (1) bag-of-features (BoF, also known as “bag-ofwords”, BoW) [1], and (2) spatial pyramid matching (SPM) [2]. The bag-of-features methodology was first proposed for text document analysis and further transferred to be used in computer vision applications [1]. The BoF model uses a visual analogue of a textual “word”, constructed by quantizing low-level visual features such as color, shape, and texture, etc. Recent studies have indicated that local features represented by bag-of-features are suitable for image classification tasks. In the BoF model, an image is
n
Corresponding author. E-mail address:
[email protected] (Y.-B. Yang).
0925-2312/$ - see front matter & 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.neucom.2012.08.050
firstly partitioned into local patches, based on which low-level feature descriptors such as SIFT [3,4], HOG [5–7], SURF [8] or CENTRIST [9] can be extracted. In this manner, an image is represented as a “bag” of local features. Consequently, a set of visual code words, that is, the codebook, is constructed from numerous low-level features via various clustering algorithms such as k-means or random trees. The next step, called as coding process, assigns all low-level features to the built codebook, after which an image is represented as a histogram of visual words. Then, the final coding result is learned by adopting machine learning techniques, among which SVM [10] is one of the most widely used classifiers with great generalization ability. However, the BoF Model ignores the spatial information contained in an image and thus is incapable of capturing location-related features of an object. In this case, SPM is a good extension for BoF model, which divides an image into increasingly finer spatial local patches, and then concatenates all the histograms of local features of each local patch. From the above discussion, it can be seen that the coding approach is crucial for guaranteeing good quality of the mapping between the built codebook and low-level features in order to achieve promising classification performance. From this point, there are already many extensions of BoF model aiming to construct better codebook [11,12], and better coding method [13–16]. The common point of those achievements is that the
278
Y.-B. Yang et al. / Neurocomputing 120 (2013) 277–286
classification performance of feature histograms mainly depends on the representative power of the visual words. However, because of the weak discriminative power and strong ambiguity of the low-level visual features, image features with different high-level semantics are probably mapped onto the same visual words. This disadvantage makes the generated visual codebook, i.e., the set of visual words, usually over-completed [17] and inconsistent to capture image semantics. To address the above issue, we propose a novel visual word coding algorithm in this paper, based on difference maximization technique to improve the generated codebook model. Instead of mapping an image feature vector strictly to one or multiple nearest visual words as the current coding methods usually do, the proposed approach utilizes a group of the nearest and the farthest visual words together in the coding process. This strategy may greatly increase the distribution of the feature's discrepancy over the coded visual words, that is, even if image features belong to the same cluster center, their coding results are not necessarily same. Consequently, the representative variations of different image features are well kept and strengthened, which can then improve the discriminative power of the visual word descriptor significantly. We examine the performance of our visual word coding model extensively on four standard real-world image datasets, demonstrating that it captures image semantic content more accurately and achieves superior classification performance. The rest of this paper is organized as follows. In Section 2, we discuss the related work on improving codebook model. Section 3 presents the proposed visual word coding method based on difference maximization in details. Section 4 illustrates and analyzes the extensive experimental results on four standard real-world image datasets. Afterwards, in-depth discussion on the experimental results is presented in Section 5. Finally, conclusion remarks are provided in Section 6.
learning problem. Experimental results have shown that locality is more essential than sparsity. Motivated by LCC, Wang et al. [15] presented an effective coding scheme called Locality-constrained Linear Coding (LLC), which can be seen as a fast implementation of LLC. It firstly performed a k-nearest-neighbor (KNN) search and then solved a constrained least square fitting problem. LLC algorithm finally projected each feature descriptor onto several similar visual words from a codebook. It overcomes the drawback of sparse coding which neglects features’ locality, and also speeds up the computation process. Gemert et al. [16] indicated that traditional codebook model ignored the ambiguity of visual feature, that is, codeword uncertainty and codeword plausibility. It successfully introduced uncertainty modeling technique in the coding process by using kernel density estimation, and remarkably improved the categorization performance. Huang et al. [18] explored the relationships of the codes and constructed a codebook graph. The richer information contained in the graph is helpful to achieve excellent performance stably and robustly. Although the above studies are able to generate better visual codebooks than the basic BoF model, they still suffer from weak discriminative power and strong ambiguity of the low-level visual features. As we may notice, the existing approaches all deal with the coding process by computing the nearest code or codes to the image low-level features. Therefore, the classification performance of feature histograms mainly depends on the representative power of the visual words. Nevertheless, by simply using low-level image features, we may not assure that image features with different high-level semantics can be certainly mapped onto different visual words. This disadvantage makes the generated visual codebook usually over-completed and inconsistent to the high level semantics.
3. Visual word coding based on difference maximization 2. Related work As mentioned in Section 1, the codebook's quality and the coding approach itself are two fundamental factors that affect the final classification performance. Therefore, researchers have carried out extensive studies on both topics. Examples are as follows. Jurie et al. [11] described a scalable acceptance-radius cluster based on mean shift approach, associating with densely sampled local patches. The algorithm is able to generate better codebooks and significantly outperforms the traditional codebook model constructed by using k-means techniques when applied on several popular image datasets. Wu et al. [12] demonstrated that the Histogram Intersection Kernel (HIK) is more effective than the Euclidean distance to serve as a similarity measurement in creating codebook. As a consequence, the approach can be used to improve the generation of visual codebooks. The coding result represented as a histogram over visual codes can be applied in the following machine learning process and then directly influences the image classification accuracy. Intuitively, Csurka et al. [1] mapped each local feature onto its nearest code. This strategy may perform unfavorably because of the codebook's over-fitting nature or the large quantization error caused by hard assignment. Yang et al. [13] proposed a spatial pyramid matching approach based on SIFT sparse codes (ScSPM) for image classification. It used selective sparse coding instead of traditional vector quantization (VQ) technique to extract salient properties of appearance descriptors of local image patches. Yu et al. [14] revealed that sparse coding is helpful for learning only if the codes are local. They presented a Local Coordinate Coding (LCC) approach which encouraged the coding to be local, and turned a difficult nonlinear learning problem into a simple global linear
Constructing the bag-of-features from the images consists of the following steps: (1) Detect local patches (regions/points of interest) automatically, (2) compute local descriptors over these patches, (3) quantize the local descriptors into “words” to construct the visual vocabulary, i.e., codebook, and (4) find the occurrences of each visual word in the codebook in each image to build the bag-of-features (in fact, histogram of words). Let X ¼ ½x1 ; x2 ; …; xN ∈RDN denote a set of D-dimensional local features extracted from an image, where xi stands for the ith feature vector. Given a codebook with M clustered visual words, B ¼ ½b1 ; b2 ; …; bM ∈RDM , an image is then encoded by a set of codes C ¼ ½c1 ; c2 ; …; cN , where M-dimensional vector ci indicates the code for xi. 3.1. Vector quantization Traditional hard assignment coding method is generally implemented by solving a constraint problem described in Eq. (1). The constraint ensures that only a single component of the coding vector ci for xi equals to 1 and all other components equal to 0. The goal is to find the nearest code for each feature. N
arg min ∑ ∥xi −Bci ∥2 ; C
i¼1
s:t:∀i; ci ≥0; ∥ci ∥lo ¼ 1; ∥ci ∥l1 ¼ 1:
ð1Þ
Many recent studies, for example Refs. [13,14], have shown that vector quantization suffers from its computational complexity and large quantization error. Another drawback was also indicated in Ref. [16] that it may raise word uncertainty and word plausibility problems.
Y.-B. Yang et al. / Neurocomputing 120 (2013) 277–286
279
3.2. Local-constrained linear coding An alternative of hard assignment is Local-constrained Linear Coding (LLC), which incorporates locality constraint and takes relationships between different visual words into consideration. It uses the following criterion: N
min ∑ ∥xi −Bci ∥2 þ λ∥di ⊙ci ∥2 ; C
i¼1
s:t: 1Τ ci ¼ 1; ∀i
ð2Þ
where ⊙ means the element-wise multiplication, di gives different freedom for each visual word, proportional to the similarity between the input descriptor xi and each visual word, which is given as Eq. (3). distðxi ; BÞ : ð3Þ di ¼ exp s In Eq. (3), distðxi ; BÞ ¼ ½distðxi ; b1 Þ; …; distðxi ; bM ÞΤ , in which each element denotes the Euclidean distance between xi and bj . s is a parameter used for adjusting the weight’s decaying speed. Through subtracting maxðdistðxi ; BÞÞ from distðxi ; BÞ to obtain a new distðxi ; BÞ with each element being negative, the distance item is further normalized into ð0; 1. Meanwhile, the closer xi to bj is, the smaller di will be. Thus, a higher weight will be assigned to the visual word bj . As the LLC solution only has a few significant values, it simply selects the k-nearest neighboring words of each feature as coding base Bi . Finally, an approximated LLC method can be given as Eq. (4). N
min ∑ ∥xi −c~ i Bi ∥2 ; C~
i¼1 Τ
s:t: 1 c~ i ¼ 1; ∀i:
ð4Þ
3.3. Difference maximization coding Vector quantization makes the variations of the coding results represented as histograms mainly depend on the differences of visual words. In other words, features mapped onto the same visual word generally are with the same code. In LLC method, features having the same k-nearest neighbors are probably assigned with similar codes, which means that the discrimination relies on the codebook as well. Since both the visual words and the coding results suffer from weak discriminative power and strong ambiguity of the low-level visual features, it is difficult for them to make correct classification based on different features. Fig. 1 illustrates this disadvantage clearly. As shown in Fig. 1, x1 ; x2 , and x3 are three different features vectors, and they all belong to the same visual word b1 . As a result for VQ or LLC, they are all encoded as the same code. However, it can be seen from Fig. 1 that x1 and x2 are both on the same side of b1 but x3 is on the other side of it. Particularly, x1 and x2 are closer to b2 and b10 while x3 is closer to b7 and b8 . Therefore, they are in fact expected to be encoded as different visual codes. As for LLC, x1 and x2 share the same k-nearest neighbors, thus x1 's code is similar to x2 's in most cases. However, as we can see from Fig. 1, although x1 and x2 are both assigned to the same single or multiple visual words, there are still some discrepancies between them. In fact, the distance between x1 and b2 is less than that between x1 and b10 , while the distance between x2 and b2 is larger than that between x2 and b10 . Moreover, their distances to other visual words may be different largely. Herewith, in VQ and LLC methods, two features belonging to the same cluster center cannot be guaranteed to be similar in semantic level as they assume. (We will further discuss this in Section 4). For instance, in Fig. 1, x1 may be a feature for a region with an object like “sky”, and x2 may be a
Fig. 1. The distributions between features and visual codebook.
feature for a region with an object like “sea”. They are similar according to the low-level feature, but deliver different high level semantics. But the current available visual word coding methods fail to handle this crucial characteristic. Aiming to solve this problem effectively, we propose our coding method based on difference maximization to depict the difference of features more accurately. It intends to further distinguish features belonging to the same cluster center, and makes sure that two features share the same visual word if and only if they are actually close enough to each other in the feature space. To maximize the representative difference of histograms which reflects the distinctions of different features, the proposed difference maximization coding method utilizes a group of the nearest and the farthest visual words together in the coding process, not just simply mapping an image feature vector onto single or multiple nearest visual words. In fact, it combines both the nearest coding idea and the farthest coding idea. Therefore, the method can also be called Nearest and Farthest Coding. The nearest coding part is similar to the LLC method ensuring that similar patches have similar codes. However, it is not robust enough to guarantee that different features are mapped onto distinctive visual words. Therefore, for this part, we incorporate the farthest coding by considering visual words far from features in order to take advantage of the information reflecting the distributions between features and visual words. Suppose that the nearest coding result is denoted as NC and the farthest coding is denoted as FC. As a result, the final coding result can be simply defined as the sum of both, i.e. RC ¼ NC þ FC. Considering the analytical solution for the regularization item ∥di ⊙ci ∥2 in Eq. (3), we adopt it in our nearest coding process. Thus, NC is expressed as follows: N
NC ¼ min ∑ ∥xi −Bci ∥2 þ λ∥di ⊙ci ∥2 ; C
i¼1
s:t: 1Τ ci ¼ 1; ∀i
ð5Þ
where di also follows the same form as Eq. (3). As suggested in LLC method, Eq. (5) has an analytical solution as well, which is shown in Eq. (6). c~ i ¼ ðCi þ λdiag=ðdÞÞ 1:
ð6Þ
280
Y.-B. Yang et al. / Neurocomputing 120 (2013) 277–286
The nearest neighboring words play a significant role in our nearest coding part. The distance constraint item in Eq. (5) treats the nearest word as the most important factor, and also guarantees that the nearest word is assigned with the highest weight. As we mentioned before, features having the same nearest neighbors may contain different high-level semantic information, it is essential to deal with these features specifically. Inspired by the fact that the great discrimination between the final coding results will remarkably improve the classification accuracy, we also adopt the farthest neighboring words in our coding process. The farthest coding part is derived by Eq. (7). N
^ d^ i ⊙c^ i ∥2 ; FC ¼ min ∑ ∥xi −Bc^ i ∥2 þ λ∥ C^
i¼1
Τ^
s:t: 1 ci ¼ 1; ∀i ð7Þ pffiffiffiffiffiffi 2 ^ where d^ i ¼ ð1= 2π sÞexpð−ðdist ðxi ; BÞ=2s^ 2 ÞÞ. The most important difference between Eqs. (5) and (7) is the variant formulations of the distance item, i.e., di and d^ i . As mentioned above, Eq. (5) directly adopts the distance form in Eq. (3) as an exponential function of which the value is proportional to the similarity between the input descriptor and each visual word. However, Eq. (7) assumes that the Euclidean distances between features and visual words are as the law of Gaussian distribution, which also has a smoothing parameter s^ determining the degree of similarity between features and visual words. As a result, the farthest visual word to a feature will obtain the lowest value of the Gaussian function and thus have a greater chance to be assigned with larger coefficients. Contrary to the nearest coding part, the distance constraint in farthest coding ensures that the farthest neighboring words may participate to encode features. The coefficients λ in nearest coding and λ^ in farthest coding are both constants to control the relative importance of the distance penalty term. It is worth noting that the distance constraint items play different roles in nearest coding and farthest coding. In the former, the distance constraint item ensures the distance between visual words and feature xi is closer; while in the latter it ensures the distance as far as possible. Although both nearest coding and farthest coding strategies are adopted, the number of visual words participating in the coding process is still a few. Therefore, we can also simply select these important visual words as the base vector for each feature descriptor, just like the approximated LLC method does [15]. We utilize a simpler but more effective approach to implement the coding procedure. Our approximated method directly chooses a set of the nearest neighbors and the farthest ones as the coding result. Specifically, as for each feature xi , let Nðxi Þ denote the set of the nearest neighbors and Fðxi Þ denote the set of the farthest neighbors. Then, the distribution of each word b in an image can be approximately computed as: N
RCðbÞ ¼ ∑
i¼1
1; 0:
if b∈Nðxi Þ or b∈Fðxi Þ;
ð8Þ
The image classification accuracy varies as the number of the nearest and the farthest points changes, and we are able to achieve the optimal numbers when this change tends to be stable. We will make further discussion in Section 5. Combining nearest and farthest coding strategies enhances the variations of feature descriptors, and thus improve the performance of SVM classifier, which is usually used to complete the image classification task. Moreover, this combination is straightforward and simple to be implemented. What we need to do is simply sorting the distances between features and all visual words, and then selecting the nearest and the farthest points directly. When we use the quick sort algorithm, the computation complexity is only OðMlogðMÞÞ, where M is the size of the codebook. In the following section, we
will validate the proposed approach by designing and carrying out a series of experiments extensively.
4. Experimental results In this part we demonstrate the effectiveness and efficiency of the visual word coding method based on difference maximization by conducting substantial experiments on four widely used datasets, including 15 class scene dataset, eight class sports events dataset, Caltech-101 and Caltech-256. The former two datasets focus on some particular scenes and are much more complicated; the latter two, which are changeable in angles or scales, contain more semantic objects.
4.1. Experimental setup In order to get stable experimental result, the setup of our experiments is identical to other studies. In each experiment, we randomly choose images as training set and testing set. This work is repeated for 10 runs. In each run, the training set is used to create a codebook which is then adopted to encode the remaining testing set. We employ k-means++ [19] as clustering algorithm to construct codebook, which is an improvement of k-means and can effectively prevent local optimization. The classifier used to complete machine learning step is a SVM classifier similar to the one used in Ref. [12], associated with Histogram Intersection Kernel as the kernel method. Particularly, we adopt 1-vs-1 strategy with LIBSVM [20] if the number of categories is small, and choose BSVM [21] if there comes more classes. The final accuracy is the average for all 10 runs of experiments. In terms of image features, there are different feature extraction methods for different image datasets according to the setup of Wu et al. [12]. For example, CENTRIST is suitable on scene datasets, and SIFT is better for object recognition. Therefore, we use CENTRIST feature for 15 class scene and eight class sports events datasets, and SIFT for the other two. All SIFT features are extracted in 16 16 pixel patches with a step-size of eight pixels. We use M ¼ 200 or M ¼ 400 to generate 200 or 400 visual words respectively. The larger the codebook is, the more effective the experiment is, and simultaneously the more complex the computation is. In this paper, we perform comparisons between the tests for a 200-word codebook and for a 400-word one. We also adopt the spatial hierarchy [12] to take advantage of spatial information, as illustrated in Fig. 2. In order to further improve the consistence of image features, we extract features and encode both the original image and its Sobel gradients image. The coding histograms on each sub-region are finally concatenated. Thus, when the size of codebook is 200, the histogram has 31 200 2 ¼12,400 dimensions in total; when the size is 400, it is a 31 400 2¼ 24,800 dimensional histogram. As for the constraint items in Eqs. (5) and (7), currently it is difficult to theoretically derive the optimal numbers for the nearest code words and the farthest code words, but they are easy to be acquired through several runs of experiments.
Fig. 2. Three levels of image patches.
Y.-B. Yang et al. / Neurocomputing 120 (2013) 277–286
281
the classification accuracies of several categories, such as “office” and “suburb” are even close to 100%.
4.2. Experiment 1: 15 class scene dataset The 15 class scene dataset consists of 4,485 scene images belonging to 15 categories, including “forest”, “mountain”, “office”, etc. The number of images associated with each class ranges from 200 to 400. The parameters in the experiments are strictly set as shown in Refs. [2,12]. We test our algorithm by training 100 images per class and testing the rest, that is, there are 1,500 images used in training and 2,985 images used in testing. We extract CENTRIST feature and make use of image spatial information as shown in Fig. 2. The number of the nearest neighbors K near and the number of the farthest neighbors K f ar are both set as 2. Before performing classification experiments, we firstly carry out a test over 15 class scene dataset to validate the interpretation that lowlevel features possess weak discriminative power. Following the parameters mentioned above, we use k-means++ method to create a codebook with 200 visual words. For each class, we then simply adopt hard assignment to make a statistic graph about the distribution of the visual words, as shown in Fig. 3. From the figure, we can find that the features of each category are more evenly distributed on codebook and there comes no significant difference, except for the seventh class. The explanation is that the seventh class is about “forest” scene and the objects in it are too limited. From this point, there is no significant visual word that can either particularly describe a certain category, or represent the image semantic content properly. Fig. 3 further illustrates the visual features' ambiguity and weak discriminative power, and also shows the “semantic gap” problem in image retrieval to some extent. On the other hand, the even distribution also indicates that hard assignment is not very reasonable. Furthermore, we perform image classification experiments on 15 class scene dataset by using our coding algorithm. Compared with the results obtained by the current available methods, our coding method based on difference maximization achieves the best performance. The comparisons between our classification accuracy and other algorithms' results (with a 400-word codebook) are shown in the second column of Table 1. In fact, given M¼200, the average accuracy of our approach is 85.5%, which is also higher than other methods. Fig. 4 illustrates the experimental results of each class when M¼ 400. As shown in Fig. 4,
4.3. Experiment 2: eight class sports events dataset The eight class sports events dataset introduced by Li et al. [25] is used to test the object recognition performance. Each class is a sport scene, varying from “rowing” to “polo”, with the number of each category ranging from 137 to 250. Following the same procedure of Li et al. [25] and Wu et al. [12], we randomly choose 70 images per class for training and 60 images per class for testing. This test repeats 10 times and the average of all results is reported. The number of the nearest neighbors and the farthest neighbors are also set as 2. The final classification accuracy of our algorithm is 82.50% with the size of codebook is 400. Li et al. achieved a rate of 73.4% with complicated probability graph model and classification model. Wu's result is 84.21%, which is slightly higher than ours. However, they adopted a very complex model to construct their codebook, and used k-means clustering algorithm and histogram intersection kernel to measure similarity, rather than the simple Euclidean distance adopted by ours. Given n features and k cluster centers, iterating t times of k-means takes OðnktÞ steps. In Wu's experiments, they stored a table T to speed up the computation in the process of utilizing histogram intersection kernel to create codebook. The table Table 1 The comparisons of the average classification accuracy on different datasets. Algorithm
15 scene (%)
Caltech-101 (%)
Caltech-256 (%)
Lazebnik et al. [2] Wu et al. [12] Yang et al. [13] Wang et al. [15] Griffin et al. [22] Boiman et al. [23] Jain et al. [24] Gemert et al. [16] Our approach
81.40 84.12 80.28 – – – – 76.70 86.03
56.40 67.44 67.00 65.43 59.00 65.00 61.00 – 68.61
– – 34.02 – 34.10 – – 27.17 35.56
Fig. 3. The distribution graph of features on the codebook.
282
Y.-B. Yang et al. / Neurocomputing 120 (2013) 277–286
Fig. 4. Classification accuracies of 15 class scene dataset when M ¼ 400 and K near ¼ K f ar ¼ 2 (a), (b), (c), (d), (e), (f), (g), (h), (i), (j), (k), (l), (m), (n), and (o).
requires O(dhmax) storage space and O(ndhmax) steps for pre-computation, where d is the dimension of a feature and hmax is set to fix the maximum value of a feature descriptor. Therefore, the total computation complexity is Oðnkt þ ndhmax Þ and it also needs extra O(dhmax) memory space to store table T. Furthermore, Wu et al. [12] incorporated a one-class SVM to meliorate the codebook. Compared with Wu's method, our algorithm is suitable to employ the k-means++ clustering method, an improvement of k-means, to generate codebook, which reduces the computation complexity to O(nkt), and with no need for extra storage space. The classification accuracy of each class with a 400-word codebook is shown in Fig. 5. From the figure, it can be seen that five out of eight categories achieve accuracy above 80%, and the “Polo” category even reaches as high as 98.7%. This firmly validates the effectiveness and efficiency of the algorithm proposed in this paper. 4.4. Experiment 3, 4: Caltech-101 and Caltech-256 Experiment 3 and 4 are carried out on Caltech-101 and Caltech-256 respectively, both of which contain a large number of diverse categories of objects. Caltech-101 holds 8577 images containing 101 classes with high shape variability. The number of images per category varies from 31 to 800. Each class shares the same kind of object, but from changeable perspectives. Caltech-256 contains 29,780 images of 256 classes, with higher intra-class variability and object location difficulty than Caltech-101. Each class has 80 images at least. We follow the same experimental setup as other researches did on Caltech-101, by randomly sampling 15 images per class in training phase and 20 images per class in testing phase. As for Caltech-101
dataset, the number of the nearest neighbors is set to be 1 and no farthest neighbor is used in this case. We also repeat our algorithm 10 times, and finally calculate the average result. Because of the large number of classes, it is not practical to show the classification accuracy one by one. Therefore, we illustrate the accuracy of 101 categories in the form of curve, as shown in Fig. 6. It can be seen that the classification accuracies of 22 classes reach above 90%, and eight classes even achieve 100%. It shows that the method in this paper offers better performance than others in the same experimental condition, as indicates in the third column of Table 1. Similarly, we randomly sample 30 images from each class of Catech-256 dataset as training set, and 25 images as testing set. Ten times of repeat work is then taken and the average classification result is reported. In particular, the result of each class when Knear ¼ Kf ar ¼ 2 is shown in Fig. 7, in which 15 categories have accuracy higher than 80% and three achieve 100%. In the same experimental setup, comparisons with other methods are also displayed in the last column of Table 1, which also illustrates that our method obtains the best performance.
5. Discussions In our approach, we associate a group of the nearest visual words with the farthest ones serving as the visual codes for every descriptor. We explore the distribution of distances between features and visual words when the codebook has a size of 200 or 400 respectively in 15 class scene dataset. Particularly, we randomly select 100 feature descriptors from an image, but discard those on the edge in order to
Y.-B. Yang et al. / Neurocomputing 120 (2013) 277–286
283
Fig. 5. Classification accuracies of 8 sports events dataset when M ¼ 400 and Knear ¼ Kfar ¼2 (a), (b), (c), (d), (e), (f), (g) and (h).
Fig. 6. Classification accuracies of Caltech-101 when M¼ 400 and K near ¼ 1; K f ar ¼ 0.
Fig. 7. Classification accuracies of Caltech-256 when M ¼400 and K near ¼ K f ar ¼ 2:
remove the noises. Given a 200-word codebook, we compute the distances between each feature vector and each visual word to form a 200-dimension Euclidean distance vector further sorted in ascending order. The average of all sorted distance vectors is eventually recorded. For a 400-word codebook, it follows the same steps. Fig. 8 shows the distance distribution of a 200-word codebook and a 400-word codebook respectively. The horizontal axis indicates the dimension of the distance vector, the same as the size of the codebook. The vertical axis indicates the distance between feature and visual words. As shown in Fig. 8, most of the distances have little difference except the nearest and the farthest code
words. Therefore, it may be very difficult to select out the best representing visual words because the distances are all so similar. This ambiguity of visual words also raises the uncertainty problem [16]. This phenomenon also validates our viewpoint that the codebook is over-completed. In other words, the coding process merely assigns a feature to one or more code words, ignoring the relevance of other words that may fairly close to the selected ones. Moreover, since the distances of many words are almost same, it is hard to decide which visual words should be selected as the codes. Aiming to solve this problem caused by the densely distributed words, our nearest and farthest coding method pays more attention on the nearest words and the farthest ones, and can effectively avoid this problem. Moreover, it is able to raise the discriminative power among different words greatly and produce significantly high classification rates. Another difficulty is how to define the optimal numbers of both the nearest neighbors and the farthest neighbors. To deal with it, we carry out a number of experiments over the 15 class scene dataset to study the effect of different numbers of the nearest neighbors and the farthest neighbors to the corresponding classification accuracy. Suppose that Knear is the number of the nearest neighbors, Kfar is the number of the farthest neighbors. The classification accuracies with a 200-word and a 400-word codebook are shown in Figs. 9 and 10, respectively. The numbers of the nearest neighbors are the same in each sub-figure. The vertical axis refers to the classification accuracy and the horizontal axis represents the number of the farthest neighbors, ranging from 0 to 10. The distributions in both figures show that the overall performance can be significantly improved when the codebook size is larger. Generally, the accuracy decreases as the number of the nearest neighbors grows larger. The highest accuracy is acquired when K near ¼ 2. In addition, in most cases the accuracy tends to increase when K far ≤ K near , and decrease when K far 4 K near . When K near ¼ K f ar , it reaches the highest value. Actually, in the experiment on 15 class scene dataset, the highest accuracy is indeed achieved when K near ¼ K f ar ¼ 2, no matter the codebook size is 200 or 400. This result is also validated on other datasets, which are all displayed in Tables 2–4 (M¼400). As illustrated in the tables, eight class sports events dataset and Caltech-256 dataset both obtain the optimal performance when Knear ¼Kfar ¼2, while Calthch-101 dataset achieves the best when K near ¼ 1; K f ar ¼ 0.
284
Y.-B. Yang et al. / Neurocomputing 120 (2013) 277–286
Fig. 8. The distribution of distances between features and code words.
Fig. 9. Classification accuracies and their relationships with the number of the nearest neighbors and the farthest neighbors when M ¼ 200.
Fig. 10. Classification accuracies and their relationships with the number of the nearest neighbors and the farthest neighbors when M ¼ 400.
6. Conclusions In this paper, we propose a novel visual word coding algorithm based on difference maximization to overcome the over-completed problem of the codebook model caused by the weak discriminative
power and strong ambiguity of the low-level features and improve the generated codebook model. Instead of mapping an image feature vector onto one or multiple nearest visual words, the proposed approach utilizes a group of the nearest and the farthest visual words together in the coding process. Consequently, the representative
Y.-B. Yang et al. / Neurocomputing 120 (2013) 277–286
Table 2 The accuracy over 8 Class Sports Events Dataset with different numbers of the nearest and the farthest neighbors (M ¼ 400). Number of the neighbors knear knear knear knear knear
¼1 ¼2 ¼3 ¼4 ¼5
kf ar ¼ 0
kf ar ¼ 1
kf ar ¼ 2
kf ar ¼ 3
kf ar ¼ 4
kf ar ¼ 5
81.04 80.83 80.63 80.21 80.63
82.27 82.15 81.67 81.46 81.67
80.21 82.50 81.87 81.67 81.67
80.83 81.67 81.46 81.46 81.04
81.04 81.25 81.46 81.04 81.04
79.79 81.04 81.25 81.46 81.25
Table 3 The accuracy over Caltech-101 with different numbers of the nearest and the farthest neighbors (M ¼ 400). Number of the neighbors knear knear knear knear knear
¼1 ¼2 ¼3 ¼4 ¼5
kf ar ¼ 0
kf ar ¼ 1
kf ar ¼ 2
kf ar ¼ 3
kf ar ¼ 4
68.61 68.41 68.26 68.06 67.86
66.47 67.27 67.31 67.21 67.31
64.72 66.42 66.92 66.62 66.61
63.58 65.67 65.62 66.27 66.17
62.88 64.87 65.42 65.97 65.87
Table 4 The accuracy over Caltech-256 with different numbers of the nearest and the farthest neighbors (M ¼ 400). Number of the neighbors knear knear knear knear knear
¼1 ¼2 ¼3 ¼4 ¼5
kf ar ¼ 0
kf ar ¼ 1
kf ar ¼ 2
kf ar ¼ 3
kf ar ¼ 4
kf ar ¼ 5
35.97 35.84 35.10 34.69 34.32
35.45 35.58 35.03 34.68 34.48
34.75 35.56 35.05 34.73 34.31
33.98 35.01 35.06 34.63 34.31
33.39 34.70 34.78 34.39 34.30
32.67 34.41 34.55 34.38 34.17
variations of different image features are well kept and strengthened, which can then improve the discriminative power of the visual word descriptor significantly. Experimental results have demonstrated that our method is able to capture image semantic content more accurately and achieve superior classification performance. However, currently the proposed method still has much room to be further investigated and improved. For example, it still needs to be studied how to determine an optimal numbers of the nearest and farthest neighbors by using a theoretical derivation process, rather than by experiments. Furthermore, it is also worth carrying out more tests for our method over other standard datasets to further validate its classification performance.
Acknowledgments This work is supported by the Program for New Century Excellent Talents of MOE China (Grant No. NCET-11–0213), National 973 Program of China (Grant No. 2010CB327903), the Natural Science Foundation of China (Grant Nos. 61273257, 61035003, 61021062), the International Science and Technology Cooperation Program of China (Grant No. 2010DFA11030), and the Natural Science Foundation of Jiangsu, China (Grant Nos. BK2010054, BK2011005, BE2010638). References [1] G. Csurka, C. Dance, L. Fan, J. Willamowski, C. Bray, Visual categorization with bags of keypoints, in: Proceedings of the ECCV'04 Workshop on Statistical Learning in Computer Vision, Prague, Czech Republic, 2004, pp. 1–22.
285
[2] S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: spatial pyramid matching for recognizing natural scene categories, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), New York, NY, USA, 2006, pp. 2169–2178. [3] D. Lowe, Object recognition from local scale-invariant features, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV'99), Kerkyra, Corfu, Greece, 1999, pp.1150–1157. [4] Y. Pang, M. Shang, J. Pan, Scale invariant image matching using triplewise constraint and weighted voting, Neurocomputing 83 (5) (2012) 64–71. [5] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), San Diego, CA, USA, 2005, pp. 886–893. [6] Y. Pang, Y. Yuan, X. Li, J. Pan, Efficient HOG human detection, Signal Process. 91 (4) (2011) 773–781. [7] Y. Pang, H. Yan, Y. Yuan, K. Wang, Robust CoHOG feature extraction in human centered image/video management system, IEEE Trans. Syst. Man Cybern B: Cybern. 42 (2) (2012) 458–468. [8] Y. Pang, W. Li, Y. Yuan, J. Pan, Fully affine invariant SURF for image matching, Neurocomputing 85 (2012) 6–10. [9] J. Wu, J. Rehg, Centrist: a visual descriptor for scene categorization, IEEE Trans. Pattern Anal. Mach. Intell. 33 (8) (2011) 1489–1501. [10] C.J.C. Burges, A tutorial on support vector machines for pattern recognition, Data Min. Knowl. Discovery 2 (2) (1998) 121–167. [11] F. Jurie, B. Triggs, Creating efficient codebooks for visual recognition, in: Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV'05), Beijing, China, 2005, pp. 604–610. [12] J. Wu and J. M. Rehg, Beyond the Euclidean distance: creating effective visual codebooks using the histogram intersection kernel, in: Proceedings of the Twelfth IEEE International Conference on Computer Vision (ICCV'09), Kyoto, Japan, 2009, pp. 630–637. [13] J. Yang, K. Yu, Y. Gong, T. Huang, Linear spatial pyramid matching using sparse coding for image classification, in: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'09), Miami, Florida, USA, 2009, pp. 1794–1801. [14] K. Yu, T. Zhang, Y. Gong, Nonlinear learning using local coordinate coding, in: Advances in Neural Information Processing System 22 (NIPS'09),Vancouver, British Columbia, Canada, 2009, pp. 2222–2231. [15] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, Y. Gong, Locality-constrained linear coding for image classification, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR'10), San Francisco, CA, USA, 2010, pp. 3360–3367. [16] J. Gemert, C.J. Veenman, A.W.M. Smeulders, J.M. Geusebroek, Visual word ambiguity, IEEE Trans. Pattern Anal. Mach. Intell. 32 (7) (2010) 1271–1283. [17] M.S. Lewicki, T.J. Sejnowski, Learning over complete representations, Neural Comput. 12 (2) (2000) 337–365. [18] Y. Huang, K. Huang, C. Wang, T. Tan, Exploring relations of visual codes for image classification, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR'11), Colorado Springs, CO, USA, 2011, pp. 1649– 1656. [19] D. Arthur, S. Vassilvitskii, k-means++: the advantages of careful seeding, in: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'07), New Orleans, Louisiana, USA, 2007, pp. 1027–1035. [20] C.-C. Chang, C.-J. Lin, LIBSVM: a library for support vector machines, ACM Trans. Intelligent Syst. Technol. 2 (3) (2011) 27, Software available at〈http:// www.csie.ntu.edu.tw/ cjlin/libsvm〉. [21] C.-W. Hsu, C.-J. Lin, A simple decomposition method for support vector machines, Mach. Learn. 46 (1–3) (2002) 291–314, Software available at 〈http://www.csie.ntu.edu.tw/ cjlin/bsvm〉. [22] G. Griffin, A. Holub, P. Perona, Caltech-256 object category dataset, Technical Report 7694, California Institute of Technology, 2007. [23] O. Boiman, E. Shechtman, M. Irani, In defense of nearest-neighbor based image classification, in: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'08), Miami, Florida, USA, 2008, pp. 1–8. [24] P. Jain, B. Kulis, K. Grauman, Fast image search for learned metrics, in: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'08), Miami, Florida, USA, 2008, pp. 1–8. [25] L. Li, F. Li, What, where and who? Classifying events by scene and object recognition, in: Proceedings of the Eleventh IEEE International Conference on Computer Vision (ICCV'07), Rio de Janeiro, Brazil, 2007, pp. 1–8.
Yu-Bin Yang received the B.Sc. degree in computer science from Wuhan Technical University of Surveying and Mapping, Wuhan, China, in 1997, and the M.Sc. and Ph.D. degrees in computer science from Nanjing University, Nanjing, China, in 2000 and 2003, respectively. He participated in collaborative research at The Chinese University of Hong Kong (CUHK) in 2003–2005, and at University of New South Wales at Australian Defence Force Academy (UNSW@ADFA) in 2005–2006. He is currently an Associate Professor with the State Key Laboratory for Novel Software Technology, Nanjing University. His current research interests include media computing, multimedia information retrieval, largescale data mining, and machine learning.
286
Y.-B. Yang et al. / Neurocomputing 120 (2013) 277–286 Ling-Yan Pan is currently a graduate student at the Department of Computer Science and Technology, Nanjing University. Her research interests include image retrieval, pattern recognition, artificial intelligence and machine learning.
Guang-Nan He received the M.S. degree in computer science from Nanjing University, Nanjing, China, in 2011. His research interests include image retrieval, pattern recognition, artificial intelligence and machine learning.
Yang Gao received the Ph.D. degree in computer science from Nanjing University, Nanjing, China, in 2000. Currently, he is a Professor at the Department of Computer Science and Technology, Nanjing University. His research interests include artificial intelligence, machine learning and big data.
Yao Zhang received the M.S. degree in information management from Nanjing University, Nanjing, China, in 2007. She is currently a lecturer with Jin-Ling College, Nanjing University.