scene recognition

scene recognition

G Model ASOC-1845; No. of Pages 8 ARTICLE IN PRESS Applied Soft Computing xxx (2013) xxx–xxx Contents lists available at SciVerse ScienceDirect App...

1MB Sizes 1 Downloads 80 Views

G Model ASOC-1845; No. of Pages 8

ARTICLE IN PRESS Applied Soft Computing xxx (2013) xxx–xxx

Contents lists available at SciVerse ScienceDirect

Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc

Heterogeneous bag-of-features for object/scene recognition Loris Nanni a,∗ , Alessandra Lumini b a b

DEI, University of Padua, Viale Gradenigo 6, Padua, Italy DEIS, Università di Bologna, Via Venezia 52, 47023 Cesena, Italy

a r t i c l e

i n f o

Article history: Received 4 April 2012 Received in revised form 14 September 2012 Accepted 5 December 2012 Available online xxx Keywords: Object recognition Bag-of-features Interest region description Texture descriptors Machine learning Support vector machine

a b s t r a c t In this work we propose a method for object recognition based on a random selection of interest regions, heterogeneous set of texture descriptors and a bag-of-features approach based on several k-means clustering runs for obtaining different codebooks. The proposed system is not based on complex region detection as SIFT but on a simple exhaustive extraction of sub-windows of a given image. In the classification step an ensemble of random subspace of support vector machine (SVM) is used. The use of random subspace ensemble coupled to the principal component analysis for reducing the dimensionality of the descriptors permits to reduce the curse of dimensionality problem. In the experimental section we show that the combination of classifiers trained using different descriptors permits a consistent improvement of the performance of the stand alone approaches. The proposed system has been tested on four datasets: in the VOC2006 dataset, in a wide-used scene recognition dataset, in the well-known Caltech-256 Object Category Dataset and in a landmark dataset, obtaining remarkable results with respect to other state-of-the-art approaches. The MATLAB code of our system is publicly available. © 2013 Elsevier B.V. All rights reserved.

1. Introduction As an evolution of over 10 years of research in the field of content based image retrieval (CBIR), recent years have witnessed increasing interest in image annotation and object recognition problems. Recently, along with the rapid progress in the application of local descriptors in pattern recognition, computer vision and image retrieval, the use of local features like keypoints or image patches has appeared as promising for several applications of object classification and image retrieval, including wide baseline matching for stereo pairs [15,28], object retrieval in videos [25], object recognition [19], texture recognition [18], robot localization [24], visual data mining [26], and symmetry detection [29]. One of the key issues for existing image annotation methods is to find an effective feature representation for images. Some approaches, e.g. [13,14], proposed constellation models to locate distinctive object parts and determine constraints on the spatial arrangement. The main drawbacks of these approaches are that spatial models typically could not handle significant deformations (i.e. large rotations, occlusions) and they did not

∗ Corresponding author. E-mail addresses: [email protected] (L. Nanni), [email protected] (A. Lumini).

consider objects with variable part numbers (i.e. buildings, trees). Recently, there has been emerging consensus for the bagof-words (BoW) model for image representation [19]. The BoW approach is based on powerful scale invariant feature descriptors which are used to represent regions in a given image that are covariant to a class of transformations and to match identical regions between images. The most performing descriptors seem to be those based on histogram distributions [30]: for example the intensitydomain spin image [5], a histogram approach that represents regions using the distance from the center point and intensity values; the Scale-invariant feature transform (SIFT) descriptor [19], a histogram that takes the weighed gradient locations and orientations; the geodesic intensity histogram [31], it provides a deformation invariant local descriptor. Other well-known descriptors include the Principal Components Analysis-based SIFT (PCA-SIFT) [3], moment invariants [16], and complex filters [23]. Moreover, a number of powerful texture operators can be used in region description: for example in [17] a LBP-based texture descriptor (named center-symmetric local binary pattern, CS-LBP) is proposed, which is computationally simpler than SIFT and more robust to illumination problems. Another interesting result is reported in [21], where it is shown that, when the number of regions is quite large, random sampling gives equal or better classification

1568-4946/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.asoc.2012.12.013

Please cite this article in press as: L. Nanni, A. Lumini, Heterogeneous bag-of-features for object/scene recognition, Appl. Soft Comput. J. (2013), http://dx.doi.org/10.1016/j.asoc.2012.12.013

G Model ASOC-1845; No. of Pages 8

ARTICLE IN PRESS L. Nanni, A. Lumini / Applied Soft Computing xxx (2013) xxx–xxx

2

rates than the other more complex operators that are in common use. Starting from these and other results, in this work we improve our previously published system for object recognition [9] considering the ideas reported in PCA-SIFT [3,4] and in [7]: • We use both local and global descriptors to represent an image; we have tested and fused together several texture descriptors (i.e. local binary/ternary patterns, local phase quantization, histogram of oriented edges, Gabor like features, SIFT). • We perform a dimensionality reduction of the texture descriptors using principal component analysis (according to the PCA-SIFT approach [3]) to deal with the problems of high correlation among the features and the curse of dimensionality. • We use a BoW approach computing textons1 by clustering the descriptors of the regions of each class with k-means; by concatenating the textons over all the classes, we obtain a global texton vocabulary. • We have built an ensemble not only combining different texture descriptors (local and global) but also considering different codebooks. Since k-means is an unstable clustering approach, it is performed several times to create different codebooks (one for each run), and for each codebook a classifier is trained. • We represent each image as a histogram of texton labels, and these histograms are used to train a random subspace ensemble (RS) of support vector machines (SVM) classifiers. It has been reported in several papers (e.g. [20]) that RS of SVM performs better than a stand-alone SVM classifier and it is able to successfully manage high dimensional data and to reduce the curse of dimensionality problem. The novelty of this paper does not consist in presenting a new descriptor, but in having performed extensive tests with many of the state-of-the-art descriptors and showed that they can be combined in the same framework gaining excellent results. A full-feature MATLAB toolbox containing the source code for the all the steps of the proposed system is available at bias.csr.unibo.it\nanni\objH.rar. Hoping that it will serve as the foundation for further explorations by other researchers in the field. The rest of the paper is organized as follows: in Section 2 the texture descriptors used in our system are briefly reviewed, in Section 3 the proposed approach for object/scene recognition is explained in details, in Section 4 experimental result are discussed and in Section 5 the conclusions are drawn. 2. Descriptors In this work we test several state-of-art descriptors that are potentially useful for object/scene classification. 2.1. SIFT descriptor The SIFT2 descriptor [19] is a 3D histogram that takes the gradient locations (quantized into a 4 × 4 location grid) and orientations (quantized into eight values) and weighs them by the gradient magnitude and a Gaussian window superimposed over the region. The SIFT descriptor, which is obtained by concatenating the orientation histograms over all bins and normalized to unit length, explains

1 Textons refer to fundamental micro-structures in generic natural images and the basic elements in early (pre-attentive) visual perception. In practice, the study of textons reduces information redundancy and thus leads to better image representations. 2 Matlab code available at http://www.vlfeat.org/∼vedaldi/code/sift.html.

how the local gradients around a point are aligned and distributed at different scales. 2.2. Local ternary patterns Local ternary patterns (LTPs) are a quite recent variant [27] of the widely used local binary patterns (LBPs). The LBP operator is a rotation invariant operator which evaluates the binary difference between the gray value of a pixel x and the gray values of P neighboring pixels on a circle of radius R around x. LBP is a powerful texture descriptor based on occurrence histogram of the LBP operator. Conventional LBP is sensitive to noise in the near-uniform image regions: LTP solve this problem by using a three value encoding. Our implementation of LTP is a modification of the original LBP Matlab3 code which includes a three value encoding and normalized histogram, we have used both rotation invariant and uniform bins (named LTP-R and LTP-U). Each descriptor, obtained concatenating the features extracted with (R = 1, P = 8) and (R = 2, P = 16), is used to train a different classifier. 2.3. Local phase quantization (LPQ) The local phase quantization operator (LPQ)4 is a texture descriptor [22] based on the blur invariance property of the Fourier phase spectrum. LPQ uses the local phase information extracted from the 2-D short-term Fourier transform computed over a rectangular neighborhood of radius R at each pixel position of the image. In LPQ only four complex coefficients, corresponding to 2-D frequencies, are considered and quantized using a scalar quantizer between 0 and 255. The final descriptor is the normalized histogram of the LPQ values. In our experiment two different LPQ descriptors are extracted by varying the parameter R (R = 3 and R = 5 in our final system, named LPQ(3) and LPQ(5)), each descriptor is used to train a different classifier. 2.4. GIST The GIST descriptor [10] computes the energy of a bank of Gaborlike filters evaluated at 8 orientations and 4 different scales. The square output of each filter is then averaged on a 4 × 4 grid. 2.5. The histogram of oriented edges (HOG) In this work a 2 × 2 version of the Histogram of Oriented Edges (HOG) [4] has been used. The HOG features are extracted on a regular grid at steps of 8 pixels, and stacked together considering sets of 2 × 2 neighboring to form a longer descriptor which provides more descriptive power. 3. Proposed approach In the following we explain the steps of our proposed approach, see Fig. 1 for a schema of the proposed method: • Step 1: Pre-processing. The image is normalized by contrastlimited adaptive histogram equalization,5 and then the image is resized such as lower dimension is at least 50 pixels. • Step 2: Global descriptors. A global feature extraction is performed, from the image four equal regions without overlap and a central region of size 50% of the original image are extracted. A given descriptor is extracted separately from each of the five regions

3 4 5

Matlab code available at http://www.ee.oulu.fi/mvg/page/lbp matlab. LPQ code available at http://www.ee.oulu.fi/mvg/download/lpq/. Using the function adapthisteq.m in MATLAB.

Please cite this article in press as: L. Nanni, A. Lumini, Heterogeneous bag-of-features for object/scene recognition, Appl. Soft Comput. J. (2013), http://dx.doi.org/10.1016/j.asoc.2012.12.013

G Model ASOC-1845; No. of Pages 8

ARTICLE IN PRESS L. Nanni, A. Lumini / Applied Soft Computing xxx (2013) xxx–xxx

3

Fig. 1. Schema of the proposed approach.















and used to train five SVMs (one for each region) combined by sum rule. Step 3: Sub-windows. Each image is divided into overlapping subwindows whose size is (l/12.5, h/12.5) where l × h is the size of the original image, taken at fixed steps of min(l/25, h/25). Step 4: Local descriptors. A local feature extraction is performed by evaluating the different texture descriptors from each subwindow. Step 5: Dimensionality reduction by PCA: Each local descriptor is transformed in a reduced space according to PCA (calculated as in TRAINING2) Step 6: Codebook assegnation. Each descriptor is assigned to one codebook (created as in TRAINING3) according to the minimum distance criterion. Step 7: Classification. Each global and local descriptor extracted from the image is classified by a RS of SVM (trained as in TRAINING1). Step 8: Fusion. The classifiers results are then combined using the sum rule, i.e. selecting as final score the sum of the scores of the pool of the classifiers that belong to the ensemble. Notice that before fusion, the scores of each classifier are normalized to a mean of 0 and standard deviation of 1. TRAINING1: RS of SVM. A different RS of SVM is trained for each local or global descriptor. SVM [34] is a widely used general purpose classifier trained to distinguish two classes by finding the equation of a hyperplane that maximally separates all the points between the two classes. SVM can deal with non-linearly separable problems, by using kernel functions to project the data points onto a higher-dimensional feature space, and with multi-class problems, by performing several “one-versus-all” (OVA) classification. OVA is a strategy where a single (usually binary) classifier is trained per class to distinguish that class from all other classes: prediction is then performed by selecting the class with the highest confidence score. According to the result reported in [20] we use RS of SVMs which have been proven to reduce the curse of

dimensionality problem. The RS method [2] is a method for creating ensembles that combines classifiers (50 in this work) trained with subsets of the original training set, generated by randomly sampling subsets of features of reduced dimensionality (in this study we half the dimensionality of the dataset). We have used radial basis function kernel for global descriptors and histogram intersection kernel for bag-of-words approach. • TRAINING2: PCA. A set of 250,000 sub-windows is randomly extracted from the training set (considering the different classes) and used to construct the PCA matrix (one projection matrix for each descriptor). • TRAINING3: Codebook creation. A different set of textons is created for each class of the dataset, one for each subgroup of 50 images of the training set.6 A single texton is created by clustering a local descriptor with k-means (the number of cluster k is randomly selected between 10 and 40). For each descriptor, the final texton vocabulary (codebook) is obtained by concatenating the textons over the all classes. Moreover, since k-means is an unstable clustering approach, we run it several times (20 in this paper, each with a different PCA subspace projection) in order to obtain more codebooks. We have not checked if the different clusterizations obtained by k-means are similar, we combine all the created codebooks. Moreover, in our tests the performance obtained by the fusion of n classifiers based on n different codebooks built by n different k-means outperforms the performance of the best classifier of the ensemble. 4. Experimental results The recognition of object categories and scene is one of the most challenging problems in computer vision. In our experiments

6

Images are clustered at groups of 50 due to computational issues.

Please cite this article in press as: L. Nanni, A. Lumini, Heterogeneous bag-of-features for object/scene recognition, Appl. Soft Comput. J. (2013), http://dx.doi.org/10.1016/j.asoc.2012.12.013

G Model ASOC-1845; No. of Pages 8

ARTICLE IN PRESS L. Nanni, A. Lumini / Applied Soft Computing xxx (2013) xxx–xxx

4

Fig. 2. Samples from the VOC2006 dataset.

we use the PASCAL Visual Object Classes Challenge 2006 protocol (VOC2006) [32], a 15 classes scene dataset widely used in the literature [10], the well-known Caltech-256 Object Category Dataset [36] and in a landmark dataset [35].

4.1. Object recognition on VOC2006 The VOC2006 dataset includes ten categories of images: bicycle, bus, car, cat, cow, dog, horse, motorbike, person, and sheep. According to the official protocol [32],7 we use Trainval images for training and the Test images for testing. Some images of the VOC2006 dataset are illustrated in Fig. 2. Even if VOC2006 dataset is quite old and easier with respect to the most recent VOC competition datasets (VOC2006 contains only 10 classes of object) [33], it is widely used in the literature to assess the performance of object recognition systems. The classification task in the VOC2006 contest was judged by the Receiver Operating Characteristic (ROC) curve. The official performance indicator [32] for VOC2006 is the area under the Receiver Operating Characteristic curve. The Receiver Operating Characteristic (ROC) curve is a graphical plot of the sensitivity of a binary classifier vs. false positives (1-specificity), as its discrimination threshold is varied; the area under the ROC curve (AUC) [38] is a scalar measure that can be interpreted as the probability that the classifier will assign a lower score to a randomly picked positive samples than to a randomly picked negative samples. Since we deal with a multiclass dataset, a different AUC is calculated for each category (one-versus-all). The first test is aimed at comparing some variants of the steps of the proposed approach. In particular we evaluate the performance obtained by considering a single texture descriptor: LPQ with R = 3. The performance of the following approaches is reported in Table 1:

7 http://www.pascal-network.org/challenges/VOC/voc2006. In this work we use segmented images for classification, without performing the segmentation step.

• LocalSa: this is a version of our system based only on a local descriptor (i.e. LPQ with R = 3) used to train a stand-alone SVM (only steps 1, 3, 5, 6 of the whole approach are performed, for a single descriptor); the codebook creation is performed just one time as in [9]. • LocalRs: the system above (LocalSA) where RS of SVM is employed as classifier. This is a simplified (single descriptor) version of our previous approach proposed in [9]; • LocalCbSa: the system LocalSA enhanced by adding a multiple codebook creation (as described in TRAINING3); • LocalPcaCbSa: the system LocalCbSa enhanced with the step 2 (dimensionality reduction by PCA); • Local: this is a complete version of our system based only on a local descriptor (steps 1, 3, 4, 5, 6 are performed for a single descriptor).

The results reported in Table 1 clearly show that the idea applied in the new approach (Local) drastically improves our previous method (LocalRs) [9]. The second test is aimed at comparing each single descriptor fused in our system and our final system (Final) as described in Section 3 to the state-of-the-art approaches proposed in the literature: a method using SIFT descriptors [17] (Sift), the best results reported in [17] using the CS-LBP approach (CS-LBP), our previous method based on three descriptors [9] (Old), a method [21] based on random sampling of regions (RanSam), the winner of the VOC2006 competition (WinVOC), the method with highest performance that we have found in the literature [12], which is based on a metric learning (MetLearn). The first lines of Table 2 are referred to the following descriptors (see Section 2) each applied to the whole image (subscription G) or as a Local feature (subscription L): LTP-U, LTP-R, LPQ(3), LPQ(5), GIST, HOG, SIFT. We want to stress the usefulness to extract the global descriptors from the regions of the image instead from the whole image: if we extract the features from the whole image LPQ(3) obtains an AUC of 91.27 and GIST of 92.99; while using our approach LPQ(3) obtains an AUC of 92.36 and GIST of 94.35.

Please cite this article in press as: L. Nanni, A. Lumini, Heterogeneous bag-of-features for object/scene recognition, Appl. Soft Comput. J. (2013), http://dx.doi.org/10.1016/j.asoc.2012.12.013

G Model

ARTICLE IN PRESS

ASOC-1845; No. of Pages 8

L. Nanni, A. Lumini / Applied Soft Computing xxx (2013) xxx–xxx

5

Table 1 Performance obtained by some variants of the proposed system. AUC

Bus

Car

Bicycle

Cat

Cow

Dog

Horse

Motorbike

Person

Sheep

AVG

LocalSa LocalRs LocalCbSa LocalPcaCbSa Local

90.9 91.5 97.7 97.7 98.2

95 95.7 96.7 96.8 96.9

86.9 89.8 94.5 94.4 94.6

83.5 86.7 91.1 90.9 91.2

83.7 87.4 89.6 89.8 89.8

78.6 81.1 82.7 82.9 82.9

81.1 84.3 87 87.2 87.3

87.6 89.6 92.8 93.2 93.3

86.9 89.1 91.7 91.7 91.9

90.1 91.5 93.6 93.4 93.6

86.43 88.67 91.7 91.8 92

Table 2 AUC obtained by the state-of-the-art methods in the VOC2006 dataset. AUC

AVG

LTP-UG LTP-UL LTP-RG LTP-RL LPQ(3)G LPQ(3)L LPQ(5)G LPQ(5)L GISTG GISTL HOGG HOGL Final Sift [17] CS-LBP [17] RanSam [21] WinVOC [32] Old [9] MetLearn [12]

91.6 90.5 91.5 90.9 92.3 92.2 91.5 90.1 94.3 93.2 91.5 90.7 96.4 89.4 91.4 90.8 93.6 94.2 95.6

to the state-of-the-art. Moreover, notice that [41–43] use 45 images in the training data for each class, so their training sets are larger than those used in this work. While in [45] several approaches are compared (see Fig. 2 of that paper8 ) and only LPbeta (i.e. [40]) among the approaches based on 40 training images obtains performance higher than 40%. In this test, since a large training set is available, we slightly modify the algorithm (without change the parameters): we do not consider LTP-RL and LTP-UL due to their low performance. Moreover we propose another fusion named Final2 where the descriptors (always without using LTP-RL and LTP-UL ) are combined by a weighted sum rule, the global descriptors have a weight equal to one, while local descriptors have a weight of 0.5. Notice that if in Final2 we use the whole image for the global descriptors and only one codebook for the local descriptors it obtains an accuracy of 33.9%. Moreover if we combine all the local descriptors (considering all the codebooks) the gained accuracy is 27.2%, while if only one codebook is used for each descriptor their fusion obtains an accuracy of 25.0%. These results confirm our idea to combine several simple variants of the same descriptor for improving the performance.

4.2. Object recognition on Caltech-256 The Caltech-256 dataset includes a challenging set of 256 object categories containing a total of 30,607 images with at least 80 images for each category. Caltech-256 is collected by choosing a set of object categories, downloading examples from Google Images and then manually screening out all images that did not fit the category. According to a suggested protocol [36], we run 5 splits test using Ntrain = 40 images per class for training and Ntest = 25 for testing. The performance indicator is the accuracy, which is averaged on the 5 experiments. Some images of the Caltech-256 dataset are illustrated in Fig. 3. The test reported in Table 3 is aimed at comparing each single descriptor fused in our system and our final system (Final) to the state-of-the-art approaches proposed in the literature: also in this test we show that our fusion permits to obtain performance similar Table 3 Accuracy obtained by the state-of-the-art methods in the Caltech-256 dataset. Approach

Accuracy

LTP-UG LTP-UL LTP-RG LTP-RL LPQ(3)G LPQ(3)L LPQ(5)G LPQ(5)L GISTG GISTL HOGG HOGL Final Final2 [40] [41] (Ntrain = 45) [42] (Ntrain = 45) [43] (Ntrain = 45)

22.5% 12.0% 13.4% 9.79% 21.5% 18.0% 23.5% 20.0% 28.8% 20.5% 12.1% 15.6% 38.0% 40.0% 48.9% 37.5% 45% 45.3%

4.3. Scene recognition The scene dataset [10] is composed by the following categories: coast (360 images), forest (328 images), mountain (274 images), open country (410 images), highway (260 images), inside city (308 images), tall building (356 images), street (292 images), bedroom (216 images), kitchen (210 images), living room (289 images), office (215 images), suburb (241 images), industrial (311 images), and store (315 images). Images are about 300 × 250 in resolution. Some sample images are shown in Fig. 4. The testing protocol for the scene dataset, according to other papers in the literature, requires 5 experiments, each using 100 randomly selected images per category for training and the remaining images for testing. The performance indicator is the accuracy, which is averaged on the 5 experiments. In Table 4 we report the results obtained by each single descriptor fused in our system, our final system compared to the best method reported in the literature. Even if the proposed approach gains lower performance than the one in [4], the reported results are very valuable, since obtained without any ad hoc optimization. Notice that some state-of-art stand-alone approaches obtain performance similar to our stand-alone best method, e.g. SIFT based method tested in [4] (the best stand-alone method among those that build their ensemble) obtains an accuracy of 81.2%, while the salient coding approach [39] which is a more performing variant of LLC (the winner of VOC2009) obtains 82.6%. Also in this dataset is very useful to extract the global descriptors from the regions of the image instead from the whole image: if we extract the features from the whole image LPQ(3) obtains an accuracy of 57.35% and GIST of 76.08%; while using our approach LPQ(3) obtains an accuracy of 61.81% and GIST of 75.04%.

8

http://www.cs.dartmouth.edu/∼lorenzo/Papers/tsf-eccv10.pdf.

Please cite this article in press as: L. Nanni, A. Lumini, Heterogeneous bag-of-features for object/scene recognition, Appl. Soft Comput. J. (2013), http://dx.doi.org/10.1016/j.asoc.2012.12.013

G Model

ARTICLE IN PRESS

ASOC-1845; No. of Pages 8

L. Nanni, A. Lumini / Applied Soft Computing xxx (2013) xxx–xxx

6

Fig. 3. Samples from the Caltech-256 dataset.

Fig. 4. Samples from the scene dataset.

Table 4 Accuracy obtained by the state-of-the-art approaches in the scene dataset. Approach

Year

Accuracy

LTP-UG LTP-UL LTP-RG LTP-RL LPQ(3)G LPQ(3)L LPQ(5)G LPQ(5)L GISTG GISTL HOGG HOGL Final Oliva and Torralba [10] Lazebnik et al. [5] Liu and Shah [6] Wu and Rehg [11] Xiao et al. [4] Gu et al. [1] Huang et al. [39] Meng et al. [8] Nanni et al. [9] Elfiky et al. [37]

– – – – – – – – – – – – 2012 2001 2006 2007 2009 2010 2011 2011 2012 2012 2012

81.2% 71.2% 73.5% 65.6% 76.1% 74.1% 74.7% 74.1% 75.0% 75.9% 68.2% 74.5% 87.1% 73.3% 81.4% 83.3% 83.1% 88.1% 83.7% 82.6% 84.1% 82.0% 85.4%

4.4. Landmark recognition The landmark dataset [35] is composed of 1227 photos of landmarks located in Pisa and crawled from Flickrs. The datasets contains photos from 12 classes having at least 46 images per class. Some sample images are shown in Fig. 5. The landmark dataset, according to the official testing protocol [35] is divided in a training set consisting of 921 photos (approximately 80% of the dataset) and a test set consisting of 226 (approximately 20% of the dataset). The performance indicator is the accuracy. Table 5 reports the performance obtained by each single descriptor fused in our system, our final system compared to the best method reported in [35]. In this dataset it is clear that that also global descriptors works very well, notice that several our approaches outperforms SIFT and our proposed fusion drastically outperforms methods reported in [35]. Here LPQ(5)G and GISTG obtains an accuracy of 94.7% and 88.0%, while if we extract the descriptors from the whole image we obtain an accuracy of 88.5% and 83.6%.

4.5. Comments The computer vision is one the computer science field with higher performance improvement in the last years, e.g. the winner

Please cite this article in press as: L. Nanni, A. Lumini, Heterogeneous bag-of-features for object/scene recognition, Appl. Soft Comput. J. (2013), http://dx.doi.org/10.1016/j.asoc.2012.12.013

G Model

ARTICLE IN PRESS

ASOC-1845; No. of Pages 8

L. Nanni, A. Lumini / Applied Soft Computing xxx (2013) xxx–xxx

7

Fig. 5. Samples from the landmark dataset.

of VOC2006 obtained an AUC of 93.6% in 2006, while in 2012 the highest reported performance is 95.6% [12] in this difficult dataset; anyway these results are still far from those obtained by a human being. The same conclusion is obtained in the several datasets: • the accuracy in the scene dataset used in this paper was 73.3% in 2001, while the highest reported error accuracy in 2012 is 88.1% [4]; • the accuracy in the Caltech-256 dataset (Ntrain = 40) was ∼38% in 2006, while the highest reported error accuracy in 2012 is 48.9% [40]. Our results are very valuable, since they have been obtained without any ad hoc parameter tuning for each dataset, while other approaches at the state-of-the-art which reach comparable performance are designed and tuned specifically for a single problem: object recognition or scene recognition. From the analysis of our experiments it is interesting to note that each descriptor works differently in each dataset, e.g. in the scene/landmark datasets GIST works poorly while in the object classification datasets it works very well. This is a reason why the fusion works well in all the tested datasets. We investigated, using the average Q-statistic [44], the relationship among the scores of the different classifiers (each for each texton vocabulary) using the same texture descriptor. Yule’s Q-statistic is a measure for evaluating the independence of classifiers: for two classifiers Di and Dk the Q-statistic is defined as: Qi,k =

ad − bc ad + bc

The average Q-statistic (among all the couple of classifiers) in the scene dataset using LPQ(3) is 0.86. This value is enough low, for confirming that the different classifiers bring different information. 5. Conclusions In this paper we have presented a new method for recognizing object categories and scene. We have used a random sampling instead of a complex region detector, we have used ideas from the PCA-SIFT to deal with the high dimensionality problem and we have combined different codebooks obtained by different clustering runs to enrich the power of codebook representations. Our final approach combines global and local features (obtained from the different codebooks and texture descriptors). Instead of a stand-alone classifier we have used a random subspace ensemble of SVM, which has the ability to reduce the curse of dimensionality problem. Without an ad-hoc optimization per dataset, our approach obtains a very high performance in four different datasets using always the same parameters. As a future work we plan to test some non-trained approaches for dimensionality reduction instead of the PCA in our final method, according to some ideas proposed in [37], where a histogram bin selection is performed. Acknowledgements The authors would like to thank all the other researchers that have shared their MATLAB code. References

where a is the probability of both classifiers being correct, d is the probability of both classifiers being incorrect, b is the probability first classifier is correct and second is incorrect, c is the probability second classifier is correct and first is incorrect. Qi,k varies between −1 and 1 and it is 0 for statistically independent classifiers. Table 5 Accuracy obtained by the state-of-the-art approaches in the landmark dataset. Approach

Accuracy

LTP-UG LTP-UL LTP-RG LTP-RL LPQ(3)G LPQ(3)L LPQ(5)G LPQ(5)L GISTG GISTL HOGG HOGL Final SIFT [35] Color-SIFT [35] SURF [35]

91.6% 80.5% 81.9% 77.9% 95.6% 94.7% 94.7% 93.8% 88.0% 94.7% 86.3% 91.2% 95% 92% 82% 90%

[1] G. Gu, Y. Zhao, Z. Zhu, Integrated image representation based natural scene classification, Expert Systems with Applications 38 (9) (2011) 11273–11279. [2] T.K. Ho, The random subspace method for constructing decision forests, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (8) (1998) 832–844. [3] Y. Ke, R. Sukthankar, PCA-SIFT: a more distinctive representation for local image descriptors, Proceedings of IEEE Confernce on Computer Vision and Pattern Recognition (2004) 506–513. [4] J. Xiao, J. Hays, K. Ehinger, A. Oliva, A. Torralba, SUN database: large-scale scene recognition from abbey to zoo, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2010). [5] S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: spatial pyramid matching for recognizing natural scene categories, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2 (2006) 2169–2178. [6] J. Liu, M. Shah, Scene modeling using co-clustering, Proceedings of IEEE International Conference on Computer Vision (2007). [7] H.L. Luo, H. Wei, F. Hu, Improvements in image categorization using codebook ensembles, Image and Vision Computing 29 (11) (2011) 759–773. [8] X. Meng, Z. Wang, L. Wu, Building global image features for scene recognition, Pattern Recognition 45 (1) (2012). [9] L. Nanni, S. Brahnam, Random interest regions for object recognition based on texture descriptors and bag-of-features, Expert Systems with Applications 39 (1) (2012) 973–977. [10] A. Oliva, A. Torralba, Modeling the shape of the scene: a holistic representation of the spatial envelope, International Journal of Computer Vision 42 (3) (2001) 145–175. [11] J. Wu, J.M. Rehg, CENTRIST: A Visual Descriptor for Scene Categorization, Technical Report GIT-GVU-09-05, Georgia Institute of Technology, 2009.

Please cite this article in press as: L. Nanni, A. Lumini, Heterogeneous bag-of-features for object/scene recognition, Appl. Soft Comput. J. (2013), http://dx.doi.org/10.1016/j.asoc.2012.12.013

G Model ASOC-1845; No. of Pages 8 8

ARTICLE IN PRESS L. Nanni, A. Lumini / Applied Soft Computing xxx (2013) xxx–xxx

[12] L. Wu, S.C.H. Hoi, Enhancing bag-of-words models by efficient semanticspreserving metric learning, IEEE Multimedia Magazine 18 (1) (2011) 24–37. [13] F.F. Li, R. Fergus, P. Perona, A Bayesian approach to unsupervised one-shot learning of object categories, Proceedings of 9th IEEE International Conference on Computer Vision (2003) 1134–1141. [14] R. Fergus, P. Perona, A. Zisserman, A visual category filter for google images, Proceedings of ECCV (2004) 242–256. [15] A. Baumberg, Reliable feature matching across widely separated views, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2000) 774–781. [16] L.J.V. Gool, T. Moons, D. Ungureanu, Affine/photometric invariants for planar intensity patterns, Proceedings of 4th European Conference on Computer Vision (1996) 642–651. [17] M. Heikkilä, M. Pietikäinen, C. Schmid, Description of interest regions with local binary patterns, Pattern Recognition 42 (3) (2009) 425–436. [18] S. Lazebnik, C. Schmid, J. Ponce, A sparse texture representation using local affine regions, IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (8) (2005) 1265–1278. [19] D.G. Lowe, Distinctive image features from scale-invariant keypoints, Proceedings of International Journal of Computer Vision 60 (2) (2004) 91–110. [20] L. Nanni, S. Brahnam, A. Lumini, A study for selecting the best performing rotation invariant patterns in local binary/ternary patterns, Proceedings of Image Processing, Computer Vision, and Pattern Recognition (IPCV’10) (2010). [21] E. Nowak, F. Jurie, B. Triggs, Sampling strategies for bag-of-features image classification, Proceedings of European Conference on Computer Vision (2006). [22] V. Ojansivu, J. Heikkila, Blur insensitive texture classification using local phase quantization, Proceedings of ICISP (2008). [23] F. Schaffalitzky, A. Zisserman, Multi-view matching for unordered image sets, Proceedings of 7th European Conference on Computer Vision (2002) 414–431. [24] S. Se, D. Lowe, J. Little, Mobile robot localization and mapping with uncertainty using scale-invariant visual landmarks, Proceedings of International Journal of Robotics Research 21 (8) (2008) 735–758. [25] J. Sivic, F. Schaffalitzky, A. Zisserman, Object level grouping for video shots, Proceedings of 8th European Conference on Computer Vision (2004) 724–734. [26] J. Sivic, A. Zisserman, Video data mining using configurations of viewpoint invariant regions, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2004) 488–495. [27] X. Tan, B. Triggs, Enhanced local texture feature sets for face recognition under difficult lighting conditions, in: Analysis and Modelling of Faces and Gestures, vol. 4778 of LNCS, 2007, pp. 168–182. [28] T. Tuytelaars, L.V. Gool, Matching widely separated views based on affine invariant regions, International Journal of Computer Vision 59 (1) (2004) 61–85.

[29] A. Turina, T. Tuytelaars, L.V. Gool, Efficient grouping under perspective skew, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2001) 247–254. [30] K. Mikolajczyk, C. Schmid, A performance evaluation of local descriptors, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (10) (2005) 1615–1630. [31] H. Ling, D.W. Jacobs, Deformation invariant image matching, Proceedings of 10th IEEE International Conference on Computer Vision (2005) 1466–1473. [32] M. Everingham, A. Zisserman, C.K.I. Williams, L. Van Gool, The PASCAL Visual Object Classes Challenge 2006 (VOC2006). Results. http://www.pascal-network.org/challenges/VOC/voc2006/results.pdf [33] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, A. Zisserman, International Journal of Computer Vision 88 (2) (2010) 303–338. [34] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, John Wiley & Sons, 2000. [35] G. Amato, F. Falchi, P. Bolettieri, Recognizing landmarks using automated classification techniques: evaluation of various visual features, Proceedings of the 2010 Second International Conferences on Advances in Multimedia (2010) 78–83. [36] G. Griffin, A. Holub, P. Perona, Caltech-256 Object Category Dataset. California Institute of Technology (Unpublished). http://resolver.caltech. edu/CaltechAUTHORS:CNS-TR-2007-001 [37] N.M. Elfiky, F.S. Khan, J. van de Weijer, J. Gonzalez, Discriminative compact pyramids for object and scene recognition, Pattern Recognition 45 (4) (2012) 1627–1636. [38] T. Fawcett, ROC Graphs: Notes and Practical Considerations for Researchers, HP Laboratories, Palo Alto, 2004. [39] Y. Huang, K. Huang, Y. Yu, T. Tan, Salient coding for image classification, Computer Vision and Pattern Recognition (CVPR) (2011). [40] P. Gehler, S. Nowozin, On feature combination for multiclass object detection, ICCV (2009). [41] J. Yang, K. Yu, Y. Gong, T.S. Huang:, Linear spatial pyramid matching using sparse coding for image classification, Computer Vision and Pattern Recognition (CVPR) (2009) 1794–1801. [42] F. Perronnin, J. Sánchez, T. Mensink, Improving the Fisher Kernel for large-scale image classification, European Conference on Computer Vision (2010). [43] J. Wang, et al., Locality-constrained linear coding for image classification, Computer Vision and Pattern Recognition (CVPR) (2010). [44] L.I. Kuncheva, C.J. Whitaker, Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy, Machine Learning 51 (2003) 181–207. [45] T. Lorenzo, S. Martin, F. Andrew, Efficient object category recognition using classemes, European Conference on Computer Vision (2010).

Please cite this article in press as: L. Nanni, A. Lumini, Heterogeneous bag-of-features for object/scene recognition, Appl. Soft Comput. J. (2013), http://dx.doi.org/10.1016/j.asoc.2012.12.013