Author's Accepted Manuscript
Laplacian Regularized Locality-constrained Coding for Image Classification Huaqing Min, Mingjie Liang, Ronghua Luo, Jinhui Zhu
www.elsevier.com/locate/neucom
PII: DOI: Reference:
S0925-2312(15)01092-9 http://dx.doi.org/10.1016/j.neucom.2015.07.084 NEUCOM15868
To appear in:
Neurocomputing
Received date: 20 January 2014 Revised date: 10 March 2015 Accepted date: 29 July 2015 Cite this article as: Huaqing Min, Mingjie Liang, Ronghua Luo, Jinhui Zhu, Laplacian Regularized Locality-constrained Coding for Image Classification, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2015.07.084 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Laplacian Regularized Locality-constrained Coding for Image Classification Huaqing Mina , Mingjie Liangb,∗, Ronghua Luob , Jinhui Zhua a School
of Software Engineering, South China University of Technology Guangzhou 510006, China b School of Computer Science and Engineering, South China University of Technology Guangzhou 510006, China
Abstract Feature coding which encodes local features extracted from an image with a codebook and generates a set of codes for efficient image representation, has shown very promising results in image classification. Vector quantization is the most simple but widely used method for feature coding. However, it suffers from large quantization errors and leads to dissimilar codes for similar features. To alleviate these problems, we propose Laplacian Regularized Locality-constrained Coding (LapLLC), wherein a locality constraint is used to favor nearby bases for encoding, and Laplacian regularization is integrated to preserve the code consistency of similar features. By incorporating a set of template features, the objective function used by LapLLC can be decomposed, and each feature is encoded by solving a linear system. Additionally, k nearest neighbor technique is employed to construct a much smaller linear system, so that fast approximated coding can be achieved. Therefore, LapLLC provides a novel way for efficient feature coding. Our experiments on a variety of image classification tasks demonstrated the effectiveness of this proposed approach. Keywords: Image classification, Feature coding, Locality-constrained, Laplacian regularization
1. Introdution Classifying images into semantic categories, which is also referred to as image classification, is a problem of great interest in both research and practice. On one hand, it is a very challenging problem due to a number of factors involved in images, such as a wide range of illumination conditions, tremendous changes in view points, and large intra-class variation. On the other hand, it is an essential issue in computer vision and image processing; the techniques for solving image ∗ Corresponding
author Email addresses:
[email protected] (Huaqing Min),
[email protected] (Mingjie Liang),
[email protected] (Ronghua Luo),
[email protected] (Jinhui Zhu)
Preprint submitted to Neurocomputing
August 4, 2015
classification can be applied to a large number of practical fields, including video tracking and surveillance[1, 2], content-based image indexing and retrieval[3, 4], and intelligent robot localization and navigation[5, 6]. Potentials and challenges of image classification have attracted lots of researchers’ attention these years. One of the key issues for image classification is to find a suitable way to represent images. Many image representation models have been proposed, including the ones based only on low-level features and the ones concerning semantic modeling [7]. The Bag-of-Words (BoWs) model [8] is one of the most popular methods belonging to the latter category. In BoWs model, local features are first extracted from an image, and quantized into “visual words”, and then a histogram is formed by counting the occurrence of visual words. Representing an image by a set of local features has enabled BoWs model to obtain decent performance in image classification despite changes in viewpoint, illumination variation and partial occlusion. However, researchers also notice several drawbacks of this model. One evident drawback is the spatial information loss. BoWs model considers an image as an orderless collection of features, and discards the spatial relationship between them. This can severely limit the descriptive power of the image representation. To incorporate the spatial information, Lazebnik et al.[9] introduce Spatial Pyramid Matching (SPM). Motivated by the work of Grauman et al.[10], they partition the image into increasingly finer spatial sub-regions and compute a histogram of local features for each sub-regions. The histograms from all regions are then concatenated to form a final representation of the image. Compared to the original BoWs model, this technique has been shown to be capable of improving the performance substantially. Plenty of recent studies are built on the SPM framework, such as [11–14]. Another drawback is related to quantization errors [11, 15]. Commonly, local features are converted to visual words by vector quantization in the traditional BoWs model. Specifically, each local feature will be assigned to an entry with the closest distance in the given codebook. Doing this, however, can lead to tremendous quantization errors and code inconsistency. To alleviate information loss, soft assignment methods have been proposed, wherein each feature is assigned to multiple entries or bases, and weights are introduced to specify the importance of each entry. By allowing choosing a combination of entries instead of a single one, soft assignment reduces quantization errors, and therefore maintains as much information. Note that the weights of all entries (the one not being used can simply be assigned a weight 0) and the codebook together can be considered as a new form of representation. Given the codebook, the weights are often referred to as the “code” of the feature. Converting from local features to codes gains several advantages. First, much more compact representations can be obtained since codes are usually dominated by a few items. Second, codes can be easily integrated into the SPM framework to further improve the performance. Third, it processes information in a more biologically plausible way since it is believed that human beings learn things in a deep form, i.e., through multiple levels of abstraction. In this paper, our primary concern will be the coding scheme, which deter2
mines how local features are converted into corresponding codes. Specifically, we try to find out the principles of coding by analysing previous related work, and design a more robust and efficient method for coding so as to improve the performance of image classification. The rest of the paper is organized as follows. In Section 2, we give a brief review of some related works concerning different coding schemes. Section 3 presents our proposed coding method under the assumption that a codebook is given. We refer our codebook learning method to Section 4. In Section 5, we present an approximate implementation of the proposed coding method so that features can be encoded more efficiently. Experimental results on several publicly available datasets are reported in Section 6. Finally, Section 7 concludes the paper. 2. Coding Schemes In image processing, coding refers to the process of finding the bases and associated weights for input features. Proper coding of the original features not only provides a more compact representation, but also improves the performance of image classification. Several coding schemes have been proposed; each emphasizes different characteristics of the codes. In this section, we briefly review three types of coding schemes and discuss their pros and cons. To start with, let X = [x1 , x2 , . . . , xN ] ∈ RD×N denote a set of local features extracted from an image, where D is the dimension and N is the number of the features. Given a codebook B = [b1 , b2 , . . . , bM ] ∈ RD×M , where M is the number of entries, different coding schemes convert each descriptor xi (i ∈ {1, · · · , N }) into a corresponding M -dimensional code ci ∈ RM so that the image can be represented as a set of codes C = [c1 , c2 , . . . , cN ] ∈ RM ×N . 2.1. Vecter Quantization Vector Quantization (VQ) enjoys a great popularity in the computer vision community, partially due to the simplicity and effectiveness of this method. The basic idea is very simple: Given a feature xi , VQ uses its closest base in the codebook to represent the feature, and the weight is simply set to be 1. Formally, VQ encodes a set of feature X by solving the following constrained least squares optimization problem: arg min C
s.t.
N
xi − Bci 22
(1)
i=1
Card(ci ) = 1, ci 1 = 1, ci 0, ∀i
where · 2 denotes the 2 -norm and · 1 is the 1 -norm. The cardinality constraint Card(ci ) means that only one element of ci can be nonzero, and condition ci 0 requires nonnegative values for all elements of ci . Therefore, all the constraints together indicate that one and only one element of ci has value 1. In practice, the problem in Eq.(1) can be easily solved by searching the 3
nearest neighbor for each feature xi , and labeling the corresponding index as 1, others as 0. 2.2. Sparse Coding One problem with VQ is that it may cause large quantization error, which means that the entry chosen from the codebook cannot well represent the feature. To alleviate information loss, soft assignment methods have been proposed, and a given feature is encoded with multiple entries instead of a single one. Yang et al.[11] suggest representing a feature with only a small subset of entries so that the code would be sparse (i.e., with only a small amount of entries not being zero). They fulfill sparse coding (SC) by solving the following 1 -norm regularized least squares optimization: arg min C
N
xi − Bci 22 + λci 1
(2)
i=1
where λ is a parameter that balances the reconstruction error and the sparseness of the code. Note that sparse coding [Eq.(2)] has much looser constraint than VQ [Eq.(1)], meaning that it is capable of achieving smaller quantization errors. Meanwhile, the sparsity characteristic of the code tends to capture the saliency of the feature while filtering out the noises. Therefore, SC is of broad interest to image classification, and has achieved high level of performance . In [11], Yang et al. combine SC with SPM, which shows very promising results despite a linear classifier being used. Several authors attempt to design a more proper dictionary for sparse coding so as to adapt the representation to specific data [16–18]. Recently, Zheng et al.[19] observe that traditional encoding methods fail to consider the geometrical structure in the data, and propose graph regularized sparse coding. Gao et al.[13] also exploit the dependence among local features and present a Laplacian sparse coding method, which utilizes a Laplacian matrix to characterize the similarity of local features. These extensions of SC have led to the state-of-the-art results. 2.3. Locality-constrained Linear Coding The great power of sparse coding comes together with some drawback: it requires to solve computationally expensive 1 -norm optimization problem. Recently, Yu et al.[20] found that sparse coding results are commonly local, i.e., nonzero coefficients are often assigned to nearby bases of the encoded feature. Based on this observation, Wang et al.[12] proposed a method called Localityconstrained Linear Coding (LLC), which imposes a locality constraint on code ci instead of the sparsity constraint in Eq.(2). Specifically, LLC encodes a set of features by solving the following optimization problem: arg min C
s.t.
N
xi − Bci 22 + λdi ci 22
i=1
1 ci = 1, ∀i 4
(3)
where denotes the element-wise multiplication, and di ∈ RM is a locality adaptor defined as dist(xi , B) di = exp (4) σ where dist(xi , B) = [dist(xi , b1 ), . . . , dist(xi , bM )] , and dist(xi , bj ) is the Euclidean distance between xi and bj , i.e., dist(xi , bj ) = xi −bj 2 . σ is a paramter for adjusting the weight decay speed. In addition, LLC requires the code to sum to 1 by setting 1 ci = 1. One of the several attractive properties of LLC is that it can be solved analytically, and therefore, coding can be performed very efficiently. Experimental results show that LLC achieves comparable or even better classification performance than SC. The importance of locality has been underscored in other work as well. For example, Liu et al. [21] suggest performing soft-assignment coding under the condition of locality constraint, which greatly improves the performance. Huang et al. [22] present a method that uses the k nearest bases to compute a salient code and show that the best performance is achieved when k ranges from 2 to 10. Chao et al. [23] combine data locality and group sparsity into a unified optimization framework so as to generate a locality and group sensitive sparse representation. Shabou and Borgne [24] present a new formalism that implicitly preserves the locality and similarity constraints in both the feature space and the spatial domain of the image. 3. Locality-constrained Coding with Laplacian Regularization To achieve good classification performance, two criterions are of particular importance for coding: i) The code must be able to achieve relatively low reconstruction errors; ii) Similar features must have similar codes. Both SC and LLC use multiple bases to represent given features so as to achieve less reconstruction errors; this has been shown to boost the performance of classification. Meanwhile, by using the locality constraint, LLC is more likely to generate similar codes for similar features; because SC is sensitive to the variance of the features[25], and it might select quite different bases to favor sparsity. This partially explains the superiority of LLC. Note that VQ, SC and LLC all disregard the correlations between features (and therefore the correlations between codes), and try to encode each feature independently. This, however, may cause inconsistency of codes, i.e., similar features have quite different codes. To ensure similar features be encoded similarly, an explicit expression of the dependencies among features can be helpful. Fig.1 shows the difference between independent coding and dependent coding. In independent coding, each feature is encoded independently and thus there may be a huge difference among the codes of similar features; while in dependent coding, codes of different features are related directly or indirectly so as to maintain the consistency among them.
5
B
C b1 b2 b3
xi
.. .
.. .
S b1 b2 b3
.. . bM
(a)
B
c1 c2 c3 .. xi .
.. .
.. .
cN
b1 b2 b3 .. .
bM (b)
B
s1 s2 s3 .. xi .
sP
.. .
.. . bM
(c)
Figure 1: Independent coding vs. dependent coding. (a) Independent coding: each feature is encoded independently. (b) Dependent coding: the code of the feature xi is related to codes of other features from the same image. (c) Codes are related indirectly through a set of template features.
3.1. LapLLC formulation Recently, Gao et al.[26] exploited the dependence among local features to alleviate the “dissimilarity” problem of SC. They used a Laplacian matrix to characterize the similarity of local features, and then incorporate this Laplacian matrix into the objective function of sparse coding to preserve the consistency of similar local features in sparse representation. Zheng et al.[19] also present a graph regularized sparse coding, where graph Laplacian is utilized as a smooth operator so as to obtain sparse representations vary smoothly along the geodesics of the data manifold. Both Zheng et al. and Gao et al. emphasize the consistency of the codes of similar features, but are mainly concerned with this property in sparse coding. In this section, we extend LLC by exploiting a similar idea. Specifically, we explicitly impose “similarity constraints” on the LLC codes of similar features so as to ensure their similarity. A Laplacian matrix is employed to characterize the similarity of local features, and Laplacian regularization is added to smooth corresponding codes. This way, features can be encoded more consistently. The idea is referred to as LapLLC in the following discussion. Compared to the work of graph regularized sparse coding, LapLLC has the advantage of being able to be solved analytically, and thus features can be encoded more efficiently. We first give a detailed formalization of LapLLC in the following, and then discuss its solution in Section 3.2. Let Ω = {ωij } be a metric matrix characterizes the similarity of local features, where ωij is a similarity metric between feature i and feature j. Then, the Laplacian matrix is defined as L = Λ − Ω, where Λ is a diagonal matrix and Λii = j ωij . We encode features by solving the following Laplacian regularized optimization problem
6
arg min C
= arg min C
N
⎛ ⎝xi − Bci 22 + λdi ci 22 + γ
i=1 N
⎞ ωij ci − cj 22 ⎠
(5)
j=1
xi − Bci 22 + λ
i=1
= arg min X − C
N
N
di ci 22 + γ
i=1
BC2F
+λ
N
N N
ωij ci − cj 22
(6)
i=1 j=1
di ci 22 + γtr(CLC )
(7)
i=1
where λ and γ are parameters balancing the reconstruction error, locality and code similarity; and · F denotes the matrix Frobenius norm. In the original work of LLC, the authors show that equal performance can be achieved when unconstrained or the shift-invariant constraint (i.e., 1 c = 1) is applied. Therefore, whether the code is normalized or not is not essential. To make Eq.(7) easy to solve, we remove the shift-invariant constraint in LapLLC formulation. For the similarity between features, we consider the χ2 kernel, which is a good distance metric for features described as histograms. Given two features xi and xj , the χ2 distance and χ2 kernel are computed respectively as D
π(xi , xj ) =
1 (xik − xjk ) 2 xik + xjk
(8)
k=1
κ(xi , xj ) = exp(−π(xi , xj )/σ)
(9)
where σ is a scaling parameter. In [27], Zhang et al. empirically found that setting σ to the mean value of the χ2 distances between all features gives comparable results. We follow this setting in our work. 3.2. Solving LapLLC Let us first consider the case of coding one single feature. Specifically, we assume that only one of the feature codes is unknown whereas codes of other features are all given. Based on this assumption and suppose that xi is the feature to be encoded, LLC coding with Laplacian regularization can be expressed as follows:
arg min xi − Bci 22 + λdi ci 22 + γ 2c (10) i CLi − ci Lii ci ci
By denoting the objective function of Eq.(10) as J (ci ) and taking the first derivative of J with respect to ci , we arrivte at
∂J = 2 B Bci − B xi + 2λdiag2 (di )ci + 2γ (C−i Li,−i + Li,i ci ) ∂ci
7
(11)
where diag(·) is a diagonal operator which creates a diagonal matrix by providing a vector as the diagonal entries. C−i denotes the submatrix with the i-th column removed, and Li,−i denotes the subvector with the i-th entry removed. By setting ∂J ∂ci = 0, we obtain the following closed-form solution for ci :
ci = B B + λdiag2 (di ) + γLii I \ B xi − γC−i Li,−i
(12)
where I is the identity matrix with the correct size. To encode a set of features, we can consider one feature at a time, and solve the coding problem as Eq.(12). However, since the codes are correlative, it requires coding features iteratively until the objective function converges. We will revisit this issue in Section 5. 4. Codebook Optimization In previous section, we have assumed that the codebook is given. A simple way to generate the codebook is to use clustering based methods such as K-means[9]. Though codebooks generated by these techniques usually provide satisfactory results, they are not optimal; because the generation process totally ignores the properties of the codes. In this section, we present a codebook learning method which better exploits the properties of the codes. Specifically, given a set of features X, the codebook B and the code set C are tuned simultaneously to minimize the objective function in Eq.(7). Formally, we solve the following optimization problem: arg min X − BC2F + λ B,C
s.t.
N
di ci 22 + γtr(CLC )
(13)
i=1
bm 22 ≤ 1
∀m = 1, 2, . . . , M
Note that a constraint bm 22 ≤ 1 is added to each entry of the codebook so as to avoid the scaling problem. The optimization problem in Eq.(13) is not convex when both B and C are unknown. However, it is convex for one based on the other (i.e., suppose one of them is given and fixed). Therefore, we can optimize B and C alternatively. Specifically, when B is fixed, we solve the coding problem with given codebook as described in Section 3.2. When C is fixed, we update the codebook by optimizing the following objective using conjugate gradient decent [28]: arg min X − BC2F B
s.t.
bm 22 ≤ 1
(14)
∀m = 1, 2, . . . , M
More details of the optimization process can be found in Algorithm 1. We use a codebook trained by K-mean clustering to initialize B in our work.
8
Algorithm 1 Codebook Learning Algorithm For LapLLC Input: feature set X ∈ RD×N , initial codebook Binit ∈ RD×M , parameters λ and γ Output: optimized codebook B, corresponding code set C 1: Let B := Binit 2: Construct similarity matrix Ω, where ωij is computed as Eq.(16) 3: Compute Laplacian matrix L = Λ − Ω, where Λii = j ωij 4: // Outer loop: solve B and C alternatively 5: while not converged do 6: Construct matrix D, where di is computed as Eq.(4) 7: // Inner loop: encode features iteratively 8: while not converged do 9: for each i ∈ [1, N ] do 10: Encode feature xi to get code ci according to Eq.(12). 11: end for 12: end while 13: Let C := [c1 , c2 , . . . , cN ] 14: Update codebook B by solving Eq.(14) 15: end while
5. Approximated LapLLC Coding As stated in Section 3, given a set of features X from a testing image and a trained codebook B, we can encode these features by solving Eq.(7). Doing this, however, there are still two limitations: First, the similarity is ensured for the codes of features extracted from the testing image only; inconsistency may still exist among codes from different sets (i.e., the training set and the testing set). Second, the codes for the features are correlative; therefore, the coding problem has to be solved iteratively. To address these two issues, we introduce a new set of features, called template features, by following the idea of Gao et al [13]. Template features are part of the features randomly selected from training data. During the codebook learning process, template features and their corresponding codes are retained. Afterward, similarity constraints are imposed between the codes of testing and template features (also refer to Fig.1(c)). This way, not only codes are decoupled so that we can consider one feature at a time, but also more consistency can be achieved among the codes. Specifically, we build a Laplacian matrix with selected template features as described in Secion 3.1, and use it for regularization. Let S = [s1 , s2 , . . . , sP ] ∈ RM ×P denote the set of codes corresponding to P template features, we rewrite the objective in Eq.(10) as: arg min xi − Bci 22 + λdi ci 22 + γ(2c i SLi + ci Lii ci ) ci
(15)
Now code ci is independent of other codes cj , j = i, and the solution of Eq.(15) can be found analytically the same as described before. What is more, since the 9
template codes are all known beforehand, there is no need to code iteratively when a set of features is to be encoded. Therefore, we can encode features very efficiently. Note that not all of the template features are informative when encoding a specific feature; we choose a small subset with most similarity. In other words, we use the k-NN method to construct the similarity matrix Ω, whose entries are set as follows κ(xi , xj ) if xj ∈ N (xi ) or xi ∈ N (xj ) ωij = (16) 0 otherwise. where N (xi ) denotes the nearest neighbor set of xi , and κ(·, ·) is the χ2 similarity kernel discussed before. Following a similar idea, we also use a subset of bases which are the k nearest neighbors of the feature to perform coding, since only a few significant values will appear close to the encoded feature with the locality constraint. Therefore, in the final implementation, only a much smaller linear system is to be solved. This further improves the efficiency. FLANN [29] is used for fast approximate nearest neighbors lookup in our work. 6. Experiments In this section, we evaluate our method on several publicly available datasets, including UIUC-Sport dataset, Scene 15 dataset, Caltech 101 dataset and Pascal VOC2007 dataset. We report the performance of the proposed method on these datasets and make comparison with other well-known methods. 6.1. Experimental Settings Local feature descriptor is essential to successful image classification. Many feature descriptors have been proposed, amongst which the Scale Invariant Feature Transform (SIFT) descriptor has shown excellent performance. In this work, we adopt SIFT for local feature description. Since dense SIFT has shown to achieve superior results compared with the sparse one, we also use a dense mode for feature extraction. Specifically, SIFT features are extracted from patches densly located on the image. In our setup, we use patches of three different sizes, 16 × 16, 24 × 24 and 32 × 32, and the step sizes are set to be 8, 12 and 16 respectively. For codebook learning, we randomly choose ∼ 100, 000 features, and the size of the codebook is fixed to be 1024. After the training process, 50, 000 features are randomly selected as template features, and their corresponding codes are kept. For approximate coding, 10 bases and 5 template features as well as their corresponding codes are chosen for each feature according to the nearest neighbor criterion, and features are encoded with the selected bases while using the selected codes as constraints (refer to Section 5). Note that the numbers of bases and template features are chosen empirically here based on the results reported in [12] and [13], which also achieve comparable results in our experiments. For 10
Table 1: Performance comparison with previous studies on UIUC-Sport dataset
Method Wu et al.[31] Nakayama et al.[32] Yang et al.[11] Wang et al.[12] Huang et al.[22] Gao et al.[13] Ours
Classification Rate 83.54 ± 1.13 84.40 ± 1.40 82.74 ± 1.46 83.27 ± 1.43 82.85 ± 1.52 85.18 ± 0.46 85.73 ± 1.14
coding parameters λ and γ, we evaluated their effect using the UIUC-Sport dataset and fixed them for other datasets. To create a more discriminative representation and achieve better classification performance, we adopt the SPM technique after coding. Specifically, the whole image is partitioned into increasingly finer spatial sub-regions, and the codes in each spatial sub-region are pooled to form a histogram. The histograms from all sub-regions are then concatenated together to generate the final image representation. Two pooling methods are commonly used for SPM: One is SUM pooling, which performs element-wise sums for a set of features; the other is MAX pooling, where a “max” function is utilized instead of the sum operator. It has been shown in several previous work that MAX pooling can achieve better performance than SUM pooling, which may be partially explained by the analogy between MAX pooling and biophysical mechanism of primary visual cortex (V1). Therefore, we also use MAX pooling to aggregate the codes. For classification, we use one-vs-all linear SVM due to its advantages in speed and excellent performance [11, 13]. We use the grayscale information of the images only, even when color images are available. As a preprocessing step, all images were also resized to be no larger than 300 × 300 pixels with preserved aspect ratio. For all experiments, we perform 10 rounds of repeat test, where different randomly selected training and testing images are used to obtain reliable results; an exception is the Pascal VOC2007 dataset where the training and testing sets are given. 6.2. UIUC-Sport Dataset The UIUC-Sport dataset [30] consists of 1, 792 images from eight sport categories, including badminton, bocce, croquet, polo, rock climbing, rowing, sailing, and snow boarding. The number of images in each category ranges from 137 to 250. To be consistent with previous studies [26], we randomly select 70 images from each class as training data and randomly select 60 images as test data. We first evaluate the parameters of LapLLC. To easy the evaluation process, we fix one parameter (λ or γ) while evaluating the other (γ or λ). The classification performance with different parameter settings is shown in Fig.2. Fig.2(a) shows the performance when γ = 0, where different choices of λ (ranges from 0.001 to 1000) make no evident influence. The reason may be that, 11
88
90
88 precision rate (%)
precision rate (%)
86
84
82
80
86
84
82
−3
−2
−1
0 λ (log)
1
2
80
3
(a) γ = 0
0
0.1
0.2
γ
0.3
0.4
0.5
(b) λ = 0.01
Figure 2: The effect of parameters λ and γ. (a)γ = 0, different choices of λ make no evident influence on the performance; (b)λ = 0.01, median values for γ (0.1 ∼ 0.4) achieve good performance.
when λ approaches to zero (Note that this is actually the choice employed in approximated LLC[12] with the constraint that only a few nearby bases are used for coding), it favors codes with good approximation; while λ has a relatively large value, it somewhat tries to emphasize the saliency in coding [22]. Both techniques have been shown to work well. We also note that they achieve their best performance when the number of selected bases is about 2 ∼ 10. Therefore, it is reasonable to believe that the locality is actually the key. The work of Liu et al.[21] is another solid support for this conclusion. Besides the locality, the importance of consistency is also identified. Fig.2(b) shows the performance when λ = 0.01, in which we find that LapLLC achieves its best performance when γ = 0.3. The influence of γ can be explained as follows: when γ has a smaller value (using γ = 0.3 as a baseline), the code is able to achieve lower reconstruction error but shares less code similarity; when γ has a larger value, it prefers codes with higher similarity but sacrifices the property of low reconstruction to some extent. Therefore, smaller or larger value of γ gains inferior performance because of putting too less or too much emphasis on code similarity. Note that the locality can be well characterized by the nearby bases selected and thus the value of λ is not essential, but it can affect the choice of γ. We use the setting λ = 0.01 and γ = 0.3 for all experiments. We evaluate the proposed method on the selected data and compare it with several previous work. The classification performance is listed in Table 1. It can be seen that the proposed method outperforms all other methods on this dataset. To see the performance for each category clearly, we also draw the confusion matrix, as is shown in Fig.3. It can be seen that bocce and croquet are the most confusing pair, which is quite reasonable because they share very similar scenes or backgrounds. Fig.4 shows some example images from these two categories.
12
100 RockClimbing badminton
5.30
97.12 93.18 5.15
80
bocce
65.15 14.55
croquet
7.58 76.67
polo rowing
82.27
40 5.00
91.36
5.61
sailing snowboarding
60
20
95.15 83.64
7.12
R oc kC lim ba bin dm g in to n bo cc cr e oq ue t po lo ro w in g sn sa ow ili bo ng ar di ng
0
Figure 3: Confusion matrix for the UIUC-Sport dataset. The average classification rate is 85.73 ± 1.14%. To be clear, only rates greater than 5% are shown in the figure.
Figure 4: Example images from the bocce and croquet categories, which share very similar scenes or backgrounds. Top: images from the bocce category; Bottom: images from the croquet category.
13
Table 2: Performance comparison with previous studies on Scene 15 dataset
Method Lazebnik et al.[9] Gemert et al.[33] Bosch et al.[34] Wu et al.[31] Yang et al.[11] Gao et al.[35] Huang et al.[22] Gao et al.[13] Ours
Average Classification Rate 81.40 ± 0.50 76.67 ± 0.39 83.70 ± (−) 84.10 ± (−) 80.28 ± 0.93 83.68 ± 0.61 82.55 ± 0.41 89.75 ± 0.50 85.0 ± 0.71
6.3. Scene 15 Dataset Scene 15 is a benchmark dataset for scene classification, which is compiled by several researchers [9, 36, 37]. The dataset contains a total number of 4, 485 grayscale images falling into 15 categories. The number of images in each category ranges from 200 to 400. The dataset has very diverse content. It contains images captured from different environments: natural outdoor, man-made outdoor and man-made indoor; and the 15 categories vary from forest and coast to office and living room. Following the same setting of Lazebnik et al. [9], we randomly select 100 images per class for training and use the rest for testing. The performances based on different methods are listed in Table 2. We can see from the table that, the proposed method outperforms all other methods; except the one by Gao et al.[13] which is, to our knowledge, the best result reported on the dataset when only a single kind of feature is used. Nevertheless, our method is more efficient since we encode features analytically while they have to solve computationally expensive 1 -norm minimization problems. To see the performance for each category and know how the categories are confused, we also compute the confusion matrix, as is shown in Fig.5. Additionally, we divided all 15 basic-level categories into 3 superordinate-level categories (i.e., Natural outdoor, Man-made outdoor and Man-made indoor), and found that Natural outdoor category has the best performance, Man-made outdoor has a weaker one while Man-made indoor has the worst performance. More details refer to Table 3. In other words, indoor scenes are more likely to be confused than outdoor scenes. The reason may be that the SPM technique used in our method can capture the global spatial properties, which is essential for natural scene classification, but some indoor scenes (such as living room, bed room and kitchen) may be better characterized by the objects they contain [38]. Actually, if the images from these indoor categories are scaled to a small size so that the objects in the images cannot be easily recognized, we human beings are apt to confuse one with another.
14
Table 3: The relationship between basic-level category and superordinate-level category, and the performance on superordinate-level categories.
Superordinatelevel category Natural Outdoor
Man-made Outdoor
Man-made Indoor
Average Precision
Basic-level category
92.77%
MITcoast
MITforest
MITmountain
MITopencountry
MIThighway
MITinsidecity 85.85%
MITstreet
MITtallbuilding
CALsuburb
industrial
PARoffice
bedroom 77.76%
livingroom
kitchen
store
Table 4: Performance comparison with previous studies on Caltech-101 dataset
training examples Zhang et al.[39] Lazebnik et al.[9] Griffin et al.[40] Yang et al.[11] Wang et al.[12] Ours
5 46.6 44.2 51.2 53.4
10 55.8 54.5 59.8 62.9
15
15 59.1 56.4 59.0 67.0 65.4 67.5
20 62.0 63.3 67.7 69.2
25 65.8 70.2 71.6
30 66.2 64.6 67.6 73.2 73.4 74.0
100
MITcoast
99.72
MITforest
86.38
MITmountain
90
10.81 95.61
80
MITopencountry
89.38
MIThighway
70
86.92
MITinsidecity
91.93
60
MITstreet
9.69
MITtallbuilding
78.81
50
93.85
CALsuburb
93.83
5.59
40
PARoffice
94.26 5.86
bedroom
67.93
kitchen
12.33 67.77
20
livingroom
7.50
77.36 5.34
industrial
13.36
7.64 69.74
store
co
fo
0
M
IT
IT
IT
10
5.29 81.49
as t r M m es IT o t op un en tai co n M IT unt r h M igh y IT in way si de M cit M IT y IT s ta tre llb et ui l C AL din su g PA bur R b of be fice dr oo ki m t liv che in gr n o in om du st ria l st or e
12.04
M
M
30
Figure 5: Confusion matrix for the Scene 15 dataset. The average classification rate is 85.0 ± 0.71%. To be clear, only rates greater than 5% are shown in the figure.
6.4. Caltech-101 Dataset The Caltech-101 dataset [41] consists of 9, 144 images from 101 object categories, including animals, vehicles, flowers, etc. There is also an additional background class, making the total number of classes reach 102. For each category, the number of images ranges from 31 to 800. Following the common experimental setup, different numbers of images (from 5 to 30 samples per class) are used for training, and no more than 50 images per class are used for testing. We compare our method with several existing approaches. The average classification performances are listed in Table 4. We can see that the proposed method achieves better results than the compared methods, especially when number of training samples is small. In addition, the confusion matrix is shown in Fig.6, and some example images and the recognition rates of the corresponding classes are shown in Fig.7. We can see from these figures that, some object categories (such as water lily, octopus and chairs) achieve very promising results (100%), while some others (such as brain, dalmatian and yin yang) perform quite poorly. Note that the background category has only a very low classification rate (17.6%). This result is intuitive since images in the background category are randomly selected from the Google website, which share no common discriminative features.
16
100 90
20
80
30
70
leopards
1.82
40
60
motorbikes
1.09
50
50
accordion
faces
60
100
background 17.64
10
93.45
facesEasy
80
2.55 98.18 96.55
60
99.27 100.00
airplanes
40
anchor
70
30
80
barrel
10
20
30
40
50
60
70
80
90
0
100
20
2.27 34.85 1.27
70.59
0
ba ck gr ou nd fa fac ce es sE le as o y m pa ot rd or s ac bike co s r ai dio rp n la n an es ch or an ba t rre l
10
100
43.94
ant
20
90
40
94.55
(a)
(b)
Figure 6: Confusion matrix for the Caltech-101 dataset when 30 images per category are used for training. The average classification rate is 74.0%. (a) Confusion matrix for all categories; (b) Confusion matrix for a subset of categories, corresponding to the first 10th entries in (a). Only rates greater than 1% are shown here.
water lily (100%)
octopus (100%)
accordion (100%)
background (17.6%)
chairs (100%)
okapi (99.6%)
brain (23.9%)
yin yang (31.8%)
dalmatian (30.4%)
panda (35.9%)
Figure 7: Some typical results on the Caltech-101 dataset. The 1st and 2nd rows show the five classes on which our method achieved the best performance, while the 3rd and 4th rows show the five classes on which our method had the worst performance.
17
bicycle, bus car, person
bottle, chair table, person
car, dog horse, person
cat, dog person, sofa
chair, table person, plant
chair, plant sofa, tv
Figure 8: Example images from the Pascal VOC2007 dataset. Note that these images fall into 4 categories simultaneously, and the object to be recognized may just take up a very small portion of the image.
Table 5: Performance comparison with previous studies on Pascal V0C2007 Dataset
Category Winner Wang[12] Huang[22] Ours Category Winner Wang[12] Huang[22] Ours Category Winner Wang[12] Huang[22] Ours
aero 77.5 74.8 71.3 73.7 cat 58.8 61.7 59.2 60.5 person 85.9 83.5 82.9 85.0
bicycle 63.6 65.2 64.2 63.4 chair 53.5 54.3 53.6 54.9 plant 36.3 30.8 29.1 31.7
bird 56.1 50.7 45.5 52.1 cow 42.6 48.6 43.3 47.9 sheep 44.7 44.6 46.5 46.9
18
boat 71.9 70.9 67.4 69.2 table 54.9 51.8 48.2 52.7 sofa 50.9 53.4 52.4 53.3
bottle 33.1 28.7 29.8 31.5 dog 45.8 44.1 43.8 44.3 train 79.2 78.2 76.1 77.4
bus 60.6 68.8 63.9 68.4 horse 77.5 76.6 76.2 76.8 tv 53.2 53.5 52.0 52.2
car 78.0 78.5 78.2 79.1 mbike 64.0 66.9 66.4 65.8 mean 59.4 59.3 57.5 59.3
6.5. Pascal VOC2007 Dataset The Pascal VOC2007 dataset is collected for the visual object challenge [42]. The dataset contains 9, 963 images falling into 20 object categories. The images are all daily photos obtained from Flicker, which involves large variations in a number of aspects, including size, illumination, viewpoint, clutter etc. Besides, images in this dataset often contain multiple objects, and hence fall into multiple categories. Though each object category is trained and tested individually, it is still very challenging since the object to be recognized may only take up a small portion of the image. Fig.8 shows some example images belonging to 4 categories simultaneously. We test our method on this dataset, and report the Average Precision (AP), which is a standard metric used by the PASCAL challenge. In Table 5, we list our scores for all 20 object categories, and compare them with several other results reported on the dataset, including the best performance achieved in the 2007 challenge [42]. We can see from the table that, our proposed method performs as well as LLC, and catches up with the best performance reported on the challenge. 7. Conclusion In this paper, we propose a novel method for efficient feature coding, named Laplacian Regularized Locality-constrained Coding (LapLLC). LapLLC emphasizes two characteristics of the generated codes: reconstruction error and consistency. In LapLLC, reconstruction error is minimized by representing each feature as a linear combination of nearby bases, and code consistency is preserved through incorporating a Laplacian regularization. By introducing a set of template features, the objective function of LapLLC is decoupled. Hence, each feature is encoded by solving a linear system analytically. In addition, we present an approximate implementation of LapLLC where k-NN technique is employed to further improve the efficiency. Experiments on several publicly available datasets show the effectiveness of the proposed method. Acknowledgements This work is partially supported by NNSF of China (No. 61372140, 61300135, 61005061), Foundation for Distinguished Young Talents in Higher Education of Guangdong (No. LYM08015) and Natural Science Foundation of Guangdong Province of China(No. 9251064101000010). References [1] R. T. Collins, A. Lipton, T. Kanade, H. Fujiyoshi, D. Duggins, Y. Tsin, e. a. Tolliver, A system for video surveillance and monitoring, Vol. 2, Carnegie Mellon University, the Robotics Institute Pittsburg, 2000. [2] O. Javed, M. Shah, Tracking and object classification for automated surveillance, in: Computer Vision ECCV 2002, Springer, 2006, pp. 343–357. 19
[3] A. Vailaya, M. A. Figueiredo, A. K. Jain, H.-J. Zhang, Image classification for content-based indexing, Image Processing, IEEE Transactions on 10 (1) (2001) 117–130. [4] G. Carneiro, A. B. Chan, P. J. Moreno, N. Vasconcelos, Supervised learning of semantic classes for image annotation and retrieval, Pattern Analysis and Machine Intelligence, IEEE Transactions on 29 (3) (2007) 394–410. [5] I. Ulrich, I. Nourbakhsh, Appearance-based place recognition for topological localization, in: Robotics and Automation, 2000 IEEE International Conference on, Vol. 2, IEEE, 2000, pp. 1023–1029. [6] O. Booij, B. Terwijn, Z. Zivkovic, B. Krose, Navigation using an appearance based topological map, in: Robotics and Automation (ICRA), 2007 IEEE International Conference on, IEEE, 2007, pp. 3927–3932. [7] A. Bosch, X. Mu˜ noz, R. Mart´ı, Which is the best way to organize/classify images by content?, Image and vision computing 25 (6) (2007) 778–791. [8] J. Sivic, A. Zisserman, Video google: A text retrieval approach to object matching in videos, in: Computer Vision (ICCV), 2003 IEEE International Conference on, IEEE, 2003, pp. 1470–1477. [9] S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories, in: Computer Vision and Pattern Recognition (CVPR), 2006 IEEE Conference on, Vol. 2, IEEE, 2006, pp. 2169–2178. [10] K. Grauman, T. Darrell, The pyramid match kernel: Discriminative classification with sets of image features, in: Computer Vision (ICCV), 2005 IEEE International Conference on, Vol. 2, IEEE, 2005, pp. 1458–1465. [11] J. Yang, K. Yu, Y. Gong, T. Huang, Linear spatial pyramid matching using sparse coding for image classification, in: Computer Vision and Pattern Recognition (CVPR), 2009 IEEE Conference on, IEEE, 2009, pp. 1794– 1801. [12] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, Y. Gong, Locality-constrained linear coding for image classification, in: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, IEEE, 2010, pp. 3360– 3367. [13] S. Gao, I. Tsang, L. Chia, Laplacian sparse coding, hypergraph laplacian sparse coding, and applications, Pattern Analysis and Machine Learning, IEEE Trans. on (1) (2013) 92–104. [14] C. Zhang, S. Wang, Q. Huang, J. Liu, C. Liang, Q. Tian, Image classification using spatial pyramid robust sparse coding, Pattern Recognition Letters (2013) 1046–1052.
20
[15] Y. Huang, Z. Wu, L. Wang, T. Tan, Feature coding in image classification: A comprehensive study., Pattern Analysis and Machine Intelligence,IEEE transactions on (2013) accept. [16] M. Aharon, M. Elad, A. Bruckstein, Svd: An algorithm for designing overcomplete dictionaries for sparse representation, Signal Processing, IEEE Transactions on 54 (11) (2006) 4311–4322. [17] J. Mairal, F. Bach, J. Ponce, G. Sapiro, Online learning for matrix factorization and sparse coding, The Journal of Machine Learning Research 11 (2010) 19–60. [18] Z. Jiang, Z. Lin, L. S. Davis, Learning a discriminative dictionary for sparse coding via label consistent k-svd, in: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE, 2011, pp. 1697–1704. [19] M. Zheng, J. Bu, C. Chen, C. Wang, L. Zhang, G. Qiu, D. Cai, Graph regularized sparse coding for image representation, Image Processing, IEEE Transactions on 20 (5) (2011) 1327–1336. [20] K. Yu, T. Zhang, Y. Gong, Nonlinear learning using local coordinate coding, in: Advances in Neural Information Processing Systems (NIPS), 2009, pp. 2223–2231. [21] L. Liu, L. Wang, X. Liu, In defense of soft-assignment coding, in: Computer Vision (ICCV), 2011 IEEE International Conference on, IEEE, 2011, pp. 2486–2493. [22] Y. Huang, K. Huang, Y. Yu, T. Tan, Salient coding for image classification, in: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE, 2011, pp. 1753–1760. [23] Y.-W. Chao, Y.-R. Yeh, Y.-W. Chen, Y.-J. Lee, Y.-C. Wang, Localityconstrained group sparse representation for robust face recognition, in: Image Processing (ICIP), 2011 18th IEEE International Conference on, IEEE, 2011, pp. 761–764. [24] A. Shabou, H. LeBorgne, Locality-constrained and spatially regularized coding for scene categorization, in: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 3618–3625. [25] K. Kavukcuoglu, M. Ranzato, R. Fergus, Y. Le-Cun, Learning invariant features through topographic filter maps, in: Computer Vision and Pattern Recognition (CVPR), 2009 IEEE Conference on, IEEE, 2009, pp. 1605– 1612. [26] S. Gao, I. W. Tsang, L.-T. Chia, P. Zhao, Local features are not lonely– laplacian sparse coding for image classification, in: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, IEEE, 2010, pp. 3555–3561. 21
[27] J. Zhang, M. Marszalek, S. Lazebnik, C. Schmid, Local features and kernels for classification of texture and object categories: A comprehensive study, International journal of computer vision 73 (2) (2007) 213–238. [28] H. Lee, A. Battle, R. Raina, A. Ng, Efficient sparse coding algorithms, in: Advances in neural information processing systems (NIPS), 2006, pp. 801–808. [29] M. Muja, D. G. Lowe, Fast approximate nearest neighbors with automatic algorithm configuration, in: International Conference on Computer Vision Theory and Application VISSAPP’09), INSTICC Press, 2009, pp. 331–340. [30] L.-J. Li, L. Fei-Fei, What, where and who? classifying events by scene and object recognition, in: Computer Vision (ICCV), 2007 IEEE International Conference on, IEEE, 2007, pp. 1–8. [31] J. Wu, J. M. Rehg, Beyond the euclidean distance: Creating effective visual codebooks using the histogram intersection kernel, in: Computer Vision (ICCV), 2009 IEEE International Conference on, IEEE, 2009, pp. 630–637. [32] H. Nakayama, T. Harada, Y. Kuniyoshi, Global gaussian approach for scene categorization using information geometry, in: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, IEEE, 2010, pp. 2336–2343. [33] J. C. van Gemert, J.-M. Geusebroek, C. J. Veenman, A. W. Smeulders, Kernel codebooks for scene categorization, in: Computer Vision–ECCV 2008, Springer, 2008, pp. 696–709. [34] A. Bosch, A. Zisserman, X. Muoz, Scene classification using a hybrid generative/discriminative approach, Pattern Analysis and Machine Intelligence, IEEE Transactions on 30 (4) (2008) 712–727. [35] S. Gao, I. W.-H. Tsang, L.-T. Chia, Kernel sparse representation for image classification and face recognition, in: Computer Vision–ECCV 2010, Springer, 2010, pp. 1–14. [36] A. Oliva, A. Torralba, Modeling the shape of the scene: A holistic representation of the spatial envelope, International journal of computer vision 42 (3) (2001) 145–175. [37] L. Fei-Fei, P. Perona, A bayesian hierarchical model for learning natural scene categories, in: Computer Vision and Pattern Recognition (CVPR), 2005 IEEE Conference on, Vol. 2, IEEE, 2005, pp. 524–531. [38] N. Silberman, R. Fergus, Indoor scene segmentation using a structured light sensor, in: Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, IEEE, 2011, pp. 601–608.
22
[39] H. Zhang, A. C. Berg, M. Maire, J. Malik, Svm-knn: Discriminative nearest neighbor classification for visual category recognition, in: Computer Vision and Pattern Recognition (CVPR), 2006 IEEE Conference on, Vol. 2, IEEE, 2006, pp. 2126–2136. [40] G. Griffin, A. Holub, P. Perona, Caltech-256 object category dataset. [41] L. Fei-Fei, R. Fergus, P. Perona, Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories, Computer Vision and Image Understanding 106 (1) (2007) 59– 70. [42] M. Everingham, L. Gool, C. Williams, J. Winn, A. Zisserman, The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results.
23