Neurocomputing 148 (2015) 467–476
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
A novel topic feature for image scene classification Mujun Zang a, Dunwei Wen b,n, Ke Wang a, Tong Liu a, Weiwei Song a a b
College of Communication Engineering, Jilin University, China School of Computing and Information Systems, Athabasca University, Alberta, Canada
art ic l e i nf o
a b s t r a c t
Article history: Received 6 December 2013 Received in revised form 14 June 2014 Accepted 7 July 2014 Communicated by D. Wang Available online 19 July 2014
We propose a novel topic feature for image scene classification. The feature is defined based on the thematic representation of images constructed by using topics, i.e., the latent variables of LDA (latent Dirichlet allocation) and their learning algorithms. Different from the related works, the feature defined in this paper shares topics in different classes, and does not need class labels before classification, so that it can avoid the coupling between features and labels. For representing a new image, our approach directly extracts its topic feature by codewords linear mapping instead of the inference of latent variable. We compared our method with three other topic models under similar experimental condition, as well as with pooling methods on the 15 Scenes dataset. The results show that our approach is capable of classifying the scene classes with a higher accuracy than the other topic models and pooling methods without using spatial information. We also observe that the performance improvement is due to the proposed feature and our algorithm, rather than the other factors such as additional low-level image features and stronger preprocessing. & 2014 Elsevier B.V. All rights reserved.
Keywords: Image scene classification Topic features LDA model Gibbs sampler
1. Introduction Image scene classification is an important problem in computer vision and machine learning. To automatically obtain semantic information of images becomes indispensable and has wide applications in many real-world information systems. However, classifying scenes faces challenges of their variability, ambiguity, and the wide range of illumination [1]. At present, three typical strategies can be found in the task of image scene classification. The first strategy considers images as individual objects and directly classifies them through low-level features such as color, texture, and power spectrum. This approach is normally used to classify only a small number of scene categories [2]. The second adopts high-level representation of the image, which regards an image as a collection of image blobs and a number of approaches based on this strategy have shown excellent performance [3–5]. Topic models constructed by the latent variables are adopted in the last strategy and they classify images according to their intermediate semantics. The strategy has been applied to cases where a larger number of scene classes are involved. It considers an image not only as a collection of blobs [6,7], but also as a more complicated structure [8,9], which contains abundant image information and is close to the way that human beings see n
Corresponding author. Tel.: þ 1 780 423 7907. E-mail addresses:
[email protected] (M. Zang),
[email protected] (D. Wen),
[email protected] (K. Wang),
[email protected] (T. Liu),
[email protected] (W. Song). http://dx.doi.org/10.1016/j.neucom.2014.07.018 0925-2312/& 2014 Elsevier B.V. All rights reserved.
the image. So the strategy has gained growing attention of the research community. Representative topic models are probabilistic latent semantic analysis (pLSA) [10] and latent Dirichlet allocation (LDA) [11]. Although topic models were initially developed from text processing, it has been widely used in the field of computer vision in recent years. For example, the image classification method proposed in [6] can train an LDA model for each category and used Bayesian decision to identify image labels. Also, a joint model of image classification and annotation based on sLDA (supervised LDA) has been proposed in [7]. These methods can assign labels of an image through inference of the intermediate variables, and thus usually do not require extra classifiers. Unlike some earlier methods, in which topic models are combined with supervised classifiers, however, these supervised models [6,7] cannot make use of some existing achievements in image classification, such as universal supervised classification and feature processing. Particularly, it is hard to extract feature vectors from these models, so one cannot easily combine topic information with other features, as presented in [12]. On the other hand, those earlier methods yield lower classification accuracy than these new ones because their representation of the image is not as appropriate as the new ones. Therefore, the objective of this paper is not only taking into account the advantages of the latter but also getting higher classification accuracy. In this paper, we propose a novel topic feature for image scene classification based on the assumption of BoW (Bag of Words). We first use LDA to establish an image topic model, and then
M. Zang et al. / Neurocomputing 148 (2015) 467–476
allows topic features to be extracted rapidly by a linear mapping from the codebook. Defining an image topic feature based on the representation, which represents the scene environment information of images by feature vectors whose dimensions can be flexibly adjusted. This makes it possible for images to be classified by off-theshelf supervised classifiers according to its scene. The proposed topic feature has very low computation complex in feature extraction, so that it is easy to be combined with other features for scene classification.
In Section 2, we introduce our model and algorithms. In Section 3 we describe related works. The datasets, experiments and results are presented in Section 4. In Section 5, we discuss our results and summarize our findings. Finally we conclude the paper in Section 6.
2. Model and algorithms Our algorithm is described in Fig. 1. We treat an image as a set of local patches and generate the codewords for the image by its clustered patches. The codewords are defined as the centers of the learned clusters. A complete collection of codewords forms a codebook. Each image is then represented as a set of codewords through its patches. On the training set, an LDA model [11] is built through the codebook and then used for generating the topic features space. The codewords of each image are converted into feature vectors under the topic feature space by our proposed linear mapping method. The labeled feature vectors of images are used for training a supervised classifier. On the testing set, the images are firstly represented by using the same codebook and then the relevant feature vectors generated in the topic feature space similarly. Finally, images of testing set are classified by the trained classifier through theirs feature vectors. In this section, we will introduce our proposed simplification and explicit linear mapping in Section 2.4, right after we discuss all the other steps from Section 2.1 to Section 2.3. 2.1. Features and codebook Most non-probabilistic models for natural scene classification have focused on using global features such as frequency distribution, edge orientations and color histogram [2]. While using probabilistic topic models to deal with the classification task, however, it is expected to take into account the way of representing local regions of images, as local features are regarded as more robust to occlusions and spatial variations than global features. Four different ways to extracting local regions have been tested in
Testing unknown image
Input image
feature extraction
feature extraction
cluster
Generate codebook
Represent each image into a bag of codewords
…
Represent each image into a bag of codewords
… … …
… … …
p (z | w )
LDA
(λ1, λ2 ,…, λV )
λ
m
(labeled)
training
λm(unlabeled) classification
supervised classification
Classification
Proposing a convenient representation of image topics that
Training
Feature extraction & representation
extract the topic features based on the topic model for further classification. Different from those in [6,1], our method does not need to establish a topic model for each scene class separately. Rather, it describes all scene images in the same latent topic space (similar to [7]). Therefore our method can ignore scene category information completely before topic features are input to the supervised classifier. Compared with other topic model methods, our proposed method does not need inference in feature extracting process, so it can reduce computational complex of topic models. Moreover, it has much less computation compared with pooling methods as well, because our method needs smaller codewords per image, which avoids lots of computing for SIFT descriptors and codewords. At the same time, the proposed method can increase the accuracy of image scene classification. The main work of the paper includes:
Learning
468
Fig. 1. Flow chart of the algorithm.
[6], which proved that the region descriptors of 128-dimensional SIFT [13] have more useful information and better robustness. These local features have been widely used in scene classification [14,1,7]. In order to compare with others models, we employed similar region descriptors of 128-dimensional gray level intensity of SIFT in our experiments. This SIFT is utilized in [3,15]. The difference from a usual SIFT is that it uses a dense regular grid instead of interest points and skips the usual SIFT normalization procedure when the overall gradient magnitude along the path is too weak. To calculate the region descriptors of each patch, an image is firstly divided into overlapping patches by sliding grid. Then the K-means algorithm is used to cluster the SIFT region descriptors (SIFT descriptors built in a sliding grid region are regarded as a sample) and codewords are defined as the cluster centers, all of which form the codebook. This process is similar to those used in [6,7,1], and the adopted format of codewords is the same to them. 2.2. Model structure Fig. 2 shows the graphical model representation of LDA for image processing, where φk represents the probability of the codeword in topic k, and θm represents the probability of the topic in the mth image. θm and φk are used to generate topics and words respectively as parameters of multinomial distribution. K represents the number of topics, M indicates the number of images in datasets and N is the number of codewords in each image. wm;n and zm;n denote the nth codeword and its topic in the mth image respectively. The parameters α and β are Dirichlet distribution parameters. In this paper, we employ the same graphical model representation as that in Fig. 4(a) of [16] for illustrating the generation and learning process, and it is basically the same as
M. Zang et al. / Neurocomputing 148 (2015) 467–476
469
2.3. Parameter estimation: Gibbs sampler
Fig. 2. Graphical model representation of LDA.
a modified LDA in Fig. 7 of [11], where the original βk is now represented as φk . Note that the LDA based text processing technique is applied to image processing in this paper, and the analogies between both terminologies can be defined as follows:
A codeword w is the basic unit of an image, defined to be a
codeword membership from a dictionary of codewords indexed by f1; …; Vg. The vth codeword in the dictionary is represented by a V-vector w such that wv ¼ 1 and wt ¼ 0 for t av. In Fig. 2, w is shaded to indicate that it is an observed variable. A codeword is equivalent to a “word” in text processing. An image is a sequence of N patches denoted by wm ¼ ðw1 ; w2 ; …; wN Þ ðm ¼ 1; 2; …; MÞ, where wn is the nth patch of the image. The equivalent of an image in text processing is a “document”. W ¼ fw1 ; …; wm g represents image dataset. In text processing, this is equivalent to a “corpus”.
Now we can write down the process that generates an image wm from the LDA model. 1. Sampling of the topic φk Dirichletðβ Þ, k A ½1; K. 2. For an image wm in image dataset w, m A ½1; M, sampling topic probability distribution θm DirichletðαÞ. 3. For the nth codeword in the image wm , n A ½1; N (a) Choose latent topic zm;n Multinomialðθm Þ. (b) Generate a codeword wm;n Multinomialφzm;n .
θm and φk follow Dirichlet distribution, which is the conjugate prior distribution of the multinomial distribution. The distribution function is defined as follows: K Γ ðα 0 Þ DirðμjαÞ ¼ ∏ μ αk 1 Γ ðα1 Þ⋯Γ ðαk Þ k ¼ 1 k
ð1Þ
where, 0 r μk r 1, ∑k μk ¼ 1, α0 ¼ ∑Kk ¼ 1 αk , Γ ðÞ is the standard gamma function. When given the hyperparameters α and β, the joint distribution of all known and hidden variables can be written as follows:
pðw; zÞ ∑z pðw; zÞ
ð5Þ
This distribution cannot be computed directly, because the sum in the denominator does not factorize and involves a large number of terms (i.e., Kn terms, where n is the total number of codeword instances in the image dataset). However, we can use Gibbs sampling that samples only one latent variable (topic) each time to address the problem. More specifically, the Gibbs Sampler for this model samples the topic z of each codeword w that avoid estimating actual parameters θm and φk by integration. Once the topic of each codeword is determined, the values of θm and φk can be computed from frequency statistics. Final sampling formula is as follows [16]: pðzi ¼ kjz i ; wÞ p
p
nðtÞ þ βt k; i ∑Vv ¼ 1 nðvÞ þ k; i
nðtÞ þ t k; i V ∑v ¼ 1 nðvÞ þ k; i
nðkÞ m; i þ αk
βv ½∑Kz ¼ 1 nðzÞ m þ αz 1
β
βv
ðnðkÞ m; i þ αk Þ
ð6Þ
where assume wi ¼ t; zi is the ith codeword corresponding topic variable; the counts nðÞ ; i indicate that the token i is excluded from the corresponding topic and the hyperparameters are omitted, and nðvÞ is the number of times that codeword v has been assigned to k topic k; βv is the Dirichlet prior of codeword v; nðzÞ m is the number of times that topic z has been assigned to image m; αz is the Dirichlet prior of topic z. After sampling, we can estimate φ and θ from the value z by
φk;t ¼ θm;k ¼
nðtÞ þ βt k
ð7Þ
nðkÞ m þ αk
ð8Þ
∑Vv ¼ 1 nðvÞ þ βv k
∑Kz ¼ 1 nðzÞ m þ αz
ð2Þ
ð3Þ
2.4. Image representation with topic features
pðwm ; zm ; θm ; Φjα; βÞ ¼ ∏ pðwm;n jφzm;n Þpðzm;n jθm Þ pðθm jαÞ pðΦjβÞ n¼1
K
pðzjwÞ ¼
where φk;t is the probability of codeword t that has been assigned to topic k; θm;k is the probability of topic k that has been assigned to image m. Using Eqs. (6), (7) and (8), the Gibbs sampling procedure can be performed conveniently (more details can be found in Heinrich's work [16], especially Fig. 9 in [16]).
N
pðwm;n ¼ tjθm ; ΦÞ ¼ ∑ pðwm;n ¼ tjφk Þpðzm;n ¼ kjθm Þ
There are some approaches to parameter estimation for the LDA model, such as Laplace Approximation [17], Variational Inference [18] and Markov chain Monte Carlo(MCMC) [19]. Gibbs sampling [20,16,21] is a special case of MCMC which samples one component from joint distribution while keeping others invariant in every step. Gibbs sampling can produce relatively simple algorithms while the dimension of the joint distribution is high. Each estimation method has its advantages and disadvantages, and many factors, such as efficiency, complexity, accuracy and simplicity of the concept need to be considered for choosing an appropriate approximate inference algorithm. As Gibbs sampler is simple to describe and easier to implement, we use a Gibbs sampler for parameter estimates. Our goal is then to evaluate the posterior distribution,
k¼1
M
M
pðWjΘ; ΦÞ ¼ ∏ pðwm jθm ; ΦÞ ¼ ∏ m¼1
N
∏ pðwm;n jθm ; ΦÞ
ð4Þ
m¼1n¼1
where Φ ¼ fφk gKk ¼ 1 is a matrix of size K V, Θ ¼ fθm gm ¼ 1 is a matrix of size M K. M
As mentioned above, images are treated as a collection of codewords and the image scene classification task is performed by using a similar natural language topic model. Following the case of applying LDA to natural language processing, it is easy to obtain the representation of a new image in topic space. Given images codewords and a trained model Mod, the latent topic on each
470
M. Zang et al. / Neurocomputing 148 (2015) 467–476
out that extracting this feature on a new image is a linear mapping under codewords, which is a process demanding much less computation.
3. Related work
Fig. 3. Topic features of an image.
codeword can be sampled by the following formula: ~ i ; ModÞ p φk;t ðnðkÞ ~ i ¼ t; z~ i ; w pðz~ i ¼ kjw ~ i þ αk Þ m;
ð9Þ
~ Topics and their statistics where z~ is the topics of a new image w. are updated by iteration in Gibbs Sampling. The posterior probability of topics can be obtained by (9). After sampling, applying (8) yields the topic distribution for a new image,
θm;k ~ ¼
nðkÞ ~ þ αk m
ð10Þ
∑Kz ¼ 1 nðzÞ ~ þ αz m
This process is applicable for complete collections of new images [16]. If we make a simplified assumption, where for a new image φk;t is not updated and θm;k is equal to Eðθm;k Þ, the posterior ~ probability for an new image can be simplified,1 ~ ModÞ p ~ ¼ t; w; pðz~ ¼ kjw
1 M ∑ θ φk;t M m ¼ 1 m;k
ð11Þ
where M is the number of images in training set. By this operation, all of the images are represented in the same topic space. Our method of representing an image by using the LDA model is equivalent to map images to a space that the vector pðzjwj Þ forms its bases and the codewords act like coordinates. It establishes a unified feature space by using this topic representation as a feature for supervised classification. In this way, there is no need to establish θm~ and nðkÞ ~ i when representing a new m; image. Therefore, the steps of (9) and (10) can totally be omitted. The representation also brings a convenience in computation that the posterior probability pðzjwÞ can be described by a K Vdimensional matrix A, each element Aij of which is the probability that the jth codeword has been assigned to the ith topic ðiA ½1; K; j A ½1; VÞ. We define λj ¼ Að:; jÞ, so that an image is treated as a collection of codewords. Therefore, a patch of an image is represented by its codeword λj . An image m with codewords wm ¼ ðw1 ; w2 ; …; wN Þ can be defined as topic features as follows: 1 N
λm ¼ ðλw1 þ λw2 þ ⋯ þ λwN Þ
ð12Þ
Fig. 3 shows the relationship between an image m and its m representation by using λ . The representation of an image contains two parts, i.e., the codewords of the image and the statistical information of the image dataset. It should be point 1
The original representation of this formula (without simplification) can be ~ ModÞ p θ m~ ;k φk;t , which is equivalent to formula 85 ~ ¼ t; w; written as pðz~ ¼ kjw in [16].
A human takes into account the content of the image rather than the low-level features when recognizing an image of a scene. It is more consistent with the habit of human to use the semantic information of an image for scene classification. At present, there has been a lot of works involving the use of topic models for image features [6,7,1,22,14,23,8,24,9]. In the case of the methods based on BoW hypothesis (without spatial information), some of them use topic model to describe images and classify them through a classifier. Both existing supervised classifiers [1,22,14] and new methods based on inference and discrimination [6,7] are used in classification. While our idea is largely inspired by these works, it has the advantage that features, topics and classifiers in question are independent of each other, therefore we can flexibly apply the best of each of the three parts to improve the overall performance of the algorithm [22]. Moreover, it can discriminate image scene categories without any additional classifier. If we view scene discrimination task only from a classification perspective, these methods have a common characters: the image labels have been used before they are input to the classifier, whether by generating the topic distribution on each category [6,1,14] or by using the category information for training hidden variables [7]. Compared with methods in [6,1], our approach has a difference that image categories are not observable in process of image statistics information collection. This representation, from the perspective of a supervised classification, dose not couple features and labels. So that the majority of the techniques of supervised classifier and features can be used. The existing methods either train one LDA module for each class, build diagnostic topics, or, more relevant to our method, use the θm~ of a new image as the feature for classification. It should be noted that our method is different from them in essence by simplifying the representation of a new image (pðzjwÞ in Section 2.4), which is called dictionary. Thus codewords are linearly and directly mapped to topical feature so that the process of representing a new image by (9) and (10) can be omitted.
4. Experiments and results 4.1. Datasets We evaluated our approach on three real-world datasets: 1. LabelMe dataset provided by Oliva and Torralba [25]; 2. UIUCSport dataset provided by Li-Jia Li and Li Fei-Fei [12]; 3. The 15 Scenes dataset provided by researchers in [25,6,3]. Fig. 4 shows some example images from each dataset, and the contents are summarized here:
LabelMe: This is a dataset of 8 natural scene classes, and
contains 2688 color images of the same size of 256 256. There are 360 coast, 328 forest, 260 highway, 308 inside city, 374 mountain, 410 opencountry, 292 street, 356 tallbuilding. We used 100 images in each class for training and the rest for testing. UIUC-Sport: This dataset includes 8 complex event classes, and contains 1579 color images with different sizes. There are 194 rock climbing, 200 badminton, 137 bocce, 236 croquet, 182 polo, 250 rowing, 190 sailing, 190 snowboarding. We
store
office
livingroom
kitchen
industrial
Snowb.
bocce
sailing
rowing
badmin.
polo
rockcli.
471
croquet
tallbuil. suburb
bedroom
insidec.
street
highway
forest
opencoun.
coast
mounta.
M. Zang et al. / Neurocomputing 148 (2015) 467–476
Fig. 4. The sample images from three datasets, i.e., (a) LabelMe dataset, (b) UIUC-Sport dataset, and (c) the 15 Scenes dataset. Only the categories not shown in (a) LableMe are listed.
normalized them in size of 256 256. We followed their experimental setting in [12] by using 70 randomly drawn images from each class for training and 60 for testing. The 15 Scenes: This dataset contains totally 4485 images falling into 15 classes. 8 of classes are the same as LabelMe dataset and the rest 7 classes consist of 216 bedroom, 241 suburb 210 kitchen, 289 livingroom, 215 office, 315 store, 311 industrial. The average size of each image is approximately 300 250 pixels. We normalized them in size of 256 256. Following the same experimental setting in [3,15], we used 100 images in each class for training and the rest for testing.
4.2. LabelMe and UIUC-Sport We evaluated our approach on LabelMe and UIUC-Sport datasets in order to compare with the methods that use topic model and close to ours. We employed the common settings of the SIFT vector and codewords in their work to facilitate the comparison. At feature level, we used a grid sampling technique similar to [6,7]. In our experiments, the SIFT descriptors were extracted from 16 16 pixel patches. A 128-dimensional SIFT vector was used to represent each patch. We created a visual dictionary and topic model on each training set, and obtained a dictionary of 240 codewords. The defined features were generated by the LDA model which was trained unsupervisedly on the training set, and then
these features are labeled and used for training an SVM classifier. On the testing set, images were mapped to the topic feature space through the codebook and pðzjwÞ directly and classified by the trained classifier. Performance of the algorithm is described by the confusion table on the testing set. We compared our method with the following work with similar experimental conditions: 1. Li Fei-Fei et al.: This is an LDA based method proposed in [6] for learning topics of each class and can classify images by using Bayesian decision. 2. Bosch et al.: The approach is described in [1]. It can learn topics by employing pLSA on each category and classify images by using the k-nearest neighbor (KNN) classifier. 3. Chong Wang et al.: Proposed in [7], it uses two supervised topic models, i.e., multi-class sLDA and multi-class sLDA with annotations, each of which builds a topic model on all classes and classifies the images by posterior probability estimation of label c. As illustrated in Fig. 5, our approach achieves the highest average classification accuracy in the task of image scene classification (LabelMe dataset). It reduces the error of Chong Wang [7] by at least 4%, and even more for the others [6,1]. Particularly, the reducing error of [6] can certify the contribution of our strategies, because the method in [6] adopted the LDA without our
472
M. Zang et al. / Neurocomputing 148 (2015) 467–476
simplification and mapping. Meanwhile it also achieves a relatively high accuracy in the event discrimination task (UIUC-Sport dataset), higher than Chong Wang [6] about 4% and far better than the other two methods. Fig. 6 shows the confusion table of the experiment results. In the confusion table, the rows represent the models for each scene category while the columns represent the ground truth categories of scenes. It presents a good performance for scene categories dominated by the scene environment (e.g. Forest, Rock climbing and Tall building, etc. has their own scene environment), with a highest accuracy of 95% for forest. In contrast, it is interesting to observe that the system tends to confuse coast with open country, where the images tend to share similar scene environment. It is consistent with our intuition. It should be noted that these error-prone classes also showed relatively poor results in others [6,7,1] and their accuracy are lower than our approach. As an explanation of this phenomenon, in the UIUC-Sport dataset, the scene environment in some scenes is vital for classification; actually the combination of scene and object can achieve a good performance in most cases described in [12].
A contrast is that the scene environment dominated event classes, such as rock climbing, sailing and snowboarding, our algorithm is effective. In spite of scene environment dominated or not, our proposed method achieved relatively better results than methods in [6,7,1] and the related additional work [12] (Scene only 60%). The number of topics has an important impact on classification accuracy. Generally speaking, classification accuracy increases with the number of topics in a certain range and then begins to decrease. The phenomenon of accuracy decline is recognized as overfitting. The larger value at which the overfitting starts, the larger number of latent features the model can handle [7]. Fig. 7 illustrates the relationship between accuracy and topic number on LabelMe dataset. As the number of topics increases, our approach does not overfit in the experimental range, while method in [1,6] begins to overfit as early as around 40 topics and [7] around 100 topics. This suggests that our approach, which combines aspects of both generative and discriminative classification, can handle more latent features than a purely generative [6] or other combination approach [7,1].
0.8 Image classification on the LabelMe dataset 0.8
LabelMe UIUC-Sports
0.75
The average accuracy
0.78
0.7
0.65
0.76 Our Approach Chong Wang Li Fei-Fei Bosch
0.74 0.72 0.7 0.68
0.6 0.66
Li Fei-Fei
Bosch
Chong Wang
0.64 20
Our Approach
Fig. 5. Comparison of classification results between our approach and Li Fei-Fei, Bosch, Chong Wang on both LabelMe and UIUC-Sport datasets.
30
40
50
60
70
80
90
100
110
120
Number of topics
Fig. 7. Accuracy as a function of the number of topics on the LabelMe dataset.
0.00
0.02
Fore.
0.00
0.95
0.00
0.00
0.04
0.00
0.00
0.00
Badm.
0.00
0.95
0.02
0.00
0.02
0.00
0.00
0.02
High.
0.03
0.00
0.84
0.03
0.06
0.02
0.03
0.00
Bocc.
0.02
0.02
0.55
0.08
0.05
0.10
0.07
0.12
Insi.
0.00
0.00
0.00
0.90
0.00
0.01
0.06
0.01
Croq.
0.02
0.03
0.23
0.53
0.13
0.00
0.03
0.02
Moun.
0.01
0.05
0.04
0.00
0.78
0.06
0.04
0.03
Polo.
0.00
0.10
0.10
0.13
0.53
0.12
0.02
0.00
Open.
0.11
0.07
0.03
0.01
0.05
0.72
0.00
0.01
Rowi.
0.05
0.05
0.12
0.02
0.02
0.65
0.03
0.07
Stre.
0.00
0.00
0.07
0.10
0.02
0.03
0.73
0.04
Sail.
0.00
0.03
0.03
0.02
0.02
0.05
0.82
0.03
Tall.
0.00
0.01
0.00
0.09
0.00
0.00
0.07
0.82
Snow.
0.07
0.02
0.05
0.05
0.03
0.02
0.02
0.75
C
Fo
H
In
M
O
St
Ta
R
Ba
cc Bo
Sa
Sn
ow i.
.
q.
lo
ro
.
.
n.
n.
k.
.
h.
s.
Fig. 6. Comparisons using confusion matrix, all from the 120-topic. (a) LabelMe dataset, (b) UIUC-Sport dataset.
ow
0.05
il.
0.03
R
0.00
Po
0.02
C
0.02
.
0.87
dm
Rock.
oc
0.00
ll.
0.00
. re
0.15
pe
0.03
ou
0.00
si
0.10
ig
0.00
. re
0.71
oa
Coas.
M. Zang et al. / Neurocomputing 148 (2015) 467–476
4.3. The 15 scenes We also tried our method on the 15-Scenes dataset. The settings of the SIFT vector are the same as in Section 4.2, and 512 codewords are employed here for representing more categories. In the experiment, we compared our method with basic BoW models with sum-pooling and max-pooling, a basic LDA method, PCA method, as well as Spatial Pyramid Matching (SPM) [3]. 1. Pooling methods. Summing the count of each type codewords in the area f s ðvÞ ¼ ∑Pi¼ 1 vi is called sum-pooling; and computing the max response of each type codewords in the area f m ðvÞ ¼ maxi vi is called max-pooling. We followed the work in [26], which considers the whole image to be one area to pooling. 2. Basic LDA with Gibbs sampling. This method mainly follows the text analysis method reported in [20]. It can be seen as the LDA method without our strategies. Rather than the diagnostic topics related method used in [20], the vector θm~ of a new image is used as the feature for classification to make a close comparison with our method. We use logistic regression (LR) for testing the feature's performance. 3. Spatial Pyramid Matching (SPM) [3]. This method partitions an image into increasingly fine sub-regions and computes histograms of local features found inside each sub-region. Rather than Pyramid Match kernel SVM classifier reported in [3], in this paper, we used LR again for testing the features performance. 4. Principal Component Analysis (PCA). This is a technique widely used for image classification. We adopted 99% principal components of PCA on codewords histogram as features. Given the feature is generated from codewords through LDA, we need to evaluate the effectiveness of the proposed feature, rather than that of SIFT feature extraction and codewords clustering. However, the performance of the same method may change when the setting of codewords and patches are different. For example, SPM method with the SIFT setting of [3] obtained an 81.4% accuracy, but only 76.73% with the setting of [15]. Also, the performance is different when we build sum-pooling and maxpooling respectively on a different size of codebook according to [3]. Therefore, in our experiment on this dataset, we computed all SIFT descriptors and codewords by the same SPM Matlab code [3], and we built LDA model through the same Matlab topic modeling toolbox [20] for both our method and Basic LDA with the same parameters. The experiment is built on four settings of patches (i.e., 361, 441, 625, 1681 per image). We further adopted the same LR classifier coming with Liblinear [27] to test these features. In order to compare with other methods in a close feature dimension, we adopted 512 topics that is the feature dimension in our experiment.
Table 1 Classification rate (%) comparison on 15 Scenes. Algorithms
361 (%)
441 (%)
625 (%)
1681 (%)
Max Pooling Sum Pooling PCA SPM Basic LDA Our Approach
58.96 63.45 55.44 68.31 52.60 66.23
60.03 64.86 56.08 70.21 54.37 67.71
60.67 67.10 57.99 72.29 58.49 68.81
61.24 69.38 59.60 73.67 61.37 70.22
473
Coas.
0.70
0.01
0.07
0.00
0.02
0.17
0.00
0.01
0.01
0.00
0.00
0.00
0.00
0.00
0.01
Fore.
0.00
0.91
0.00
0.00
0.03
0.00
0.00
0.00
0.00
0.01
0.00
0.00
0.00
0.04
0.01
High.
0.03
0.00
0.82
0.01
0.04
0.03
0.02
0.00
0.01
0.00
0.00
0.00
0.01
0.01
0.04
Insi.
0.02
0.00
0.00
0.68
0.00
0.01
0.05
0.01
0.00
0.03
0.02
0.02
0.00
0.06
0.07
Moun.
0.01
0.05
0.04
0.00
0.77
0.05
0.01
0.01
0.01
0.02
0.00
0.00
0.00
0.00
0.01
Open.
0.13
0.05
0.04
0.00
0.05
0.69
0.01
0.00
0.01
0.01
0.00
0.00
0.00
0.00
0.01
Stre.
0.00
0.00
0.02
0.06
0.02
0.01
0.81
0.02
0.00
0.01
0.00
0.01
0.00
0.01
0.05
Tall.
0.00
0.00
0.00
0.03
0.00
0.00
0.03
0.83
0.01
0.02
0.00
0.00
0.01
0.02
0.05
Bedr.
0.01
0.00
0.01
0.01
0.00
0.00
0.01
0.01
0.53
0.01
0.11
0.18
0.04
0.05
0.03
Subu.
0.00
0.01
0.00
0.01
0.00
0.01
0.00
0.01
0.00
0.90
0.00
0.01
0.01
0.03
0.01
Kitc.
0.01
0.00
0.00
0.07
0.00
0.00
0.00
0.01
0.12
0.01
0.47
0.09
0.10
0.09
0.03
Livi.
0.00
0.00
0.00
0.02
0.00
0.00
0.00
0.01
0.18
0.01
0.21
0.41
0.06
0.07
0.04
Offi.
0.00
0.00
0.00
0.01
0.00
0.00
0.00
0.00
0.03
0.00
0.09
0.09
0.77
0.00
0.02
Stor.
0.00
0.01
0.00
0.07
0.01
0.00
0.00
0.03
0.04
0.00
0.04
0.07
0.02
0.61
0.08
0.01
0.00
0.02
0.04
0.00
0.00
0.06
0.04
0.05
0.02
0.04
0.04
0.03
0.14
0.48
Coas.
Fore.
High.
Insi.
Moun.
Open.
Stre.
Tall.
Bedr.
Subu.
Kitc.
Livi.
Offi.
Stor.
Indu.
Indu.
Fig. 8. Confusion table for the 15 Scenes dataset, avg.accuracy:70.22%.
Table 1 shows the detailed results of our experiments. The performance of our method is superior to all those without spatial information in every setting of patches per image. Although the SPM method with the LR classifier achieves accuracies lower than that with the original kernel SVM, it still performs better than those without spatial information. Therefore, besides the high-level features of this paper, spatial information is also important for image scene classification. Fig. 8 shows the confusion table of the experiment results. Similar to LabelMe and UIUC-sports datasets, the classification performed better for the scene categories that are dominated by the scene environment than the others.
5. Discussion Our experiment results show that our proposed method can achieve obvious improvement compared with topic models in supervised classification accuracy. Meanwhile, we restricted ourselves to gray SIFT features in order to compare with other models, and by doing this it is easier to illustrate that the improvement of classification accuracy is from our method, rather than the other factors, such as feature sets [1,22], other aspects of image [8,9,23] and stronger classifier [22,4]. These factors, compared with the BoW based methods and other important information for image scene classification, including spatial, object and blob's contextual information, is used in [23,8,24,9,12,28] for improving the performance of image classification. These methods can also be combined with BoW models [12] for performance improvement [9,23]. By contrast, the methods in [3–5,29] used the high-level features directly without topic models and have also achieved good results. Although Lazebnik et al. [3] declared that the high-level features have equivalent status of topic models, these new features can be used to generate the histogram as low-level features do, (such as SIFT) [4] and they may be available for topic model. We hold the same viewpoints with the Chong Wang [7] about these factors. Considering these factors, may significantly improve the classification accuracy, but it cannot provide an accurate comparison of the models. When we consider these factors, our approach should be obtain a better performance just like Bosch [1].
474
M. Zang et al. / Neurocomputing 148 (2015) 467–476
There are some aspects of the method need to improve relative to above advantages. We use the gray SIFT descriptor as the only source of image information in our approach and normalize the probability of different scene in data processing. Therefore, the method is to improve performance in classifying natural scene, the scene environment in which can be described accurately by the representation of all its patches. Meanwhile, it is weak for classifying the object sensitivity images, especially the indoor-scenes and event, for example, some indoor-scenes of the 15 Scenes dataset as well as object sensitivity event classes in UIUC-Sport dataset. Fig. 9 shows an example of this phenomenon. The information source of BoW is SIFT descriptor on all patches. A pair of natural scenes of different categories in Fig. 9(a) or (b) tends to have different scene environment. It means that the information input to the model is recognizable. For the majority of event classes and indoor-scenes, however, the information input to the model does not reflect the difference between the image classes, as the case shown in Fig. 9(c) and (d). It is interesting to see that this phenomenon is consistent with human's thinking habits. We can identify the different categories according to the scene
environment easily, but for indoor-scenes and event classes we judge them mainly by the objects (such as the bedroom is distinguished through the bed, wardrobe, table lamps and so on, rather than the scene environment). Therefore, we think that this disadvantage is caused by the BoW assumption and simply improving the performance of the model cannot lead to a solution to the problem. A good solution idea is proposed in [12]. While its scene-only accuracy on UIUC-Sport dataset is 60%, lower than ours, they yielded a high accuracy (73.4%) by combining information of both scene environment and objects with a scene model for event classification. In the experiment on Scenes-15 dataset, the accuracy of our method is lower than SPM and ScSPM that were built on codewords extracted by strong pre-processing. But it is worth noting that the number of descriptors per image of our method is far smaller than the other two, which means our proposed method has much less computational complexity in the process of generating SIFT descriptors and computing codewords. This also means that extracting the proposed feature for representing environment information needs lesser computation, which suggests that our approach is capable of extracting scene
Fig. 9. Image and its SIFT descriptor. (a) An example of correctly classified image on the LabelMe dataset, (b) an example of correctly classified image on the UIUC-Sport dataset, (c) an example of incorrectly classified image on the UIUC-Sport dataset, and (d) an example of incorrectly classified image on the 15 Scenes dataset.
M. Zang et al. / Neurocomputing 148 (2015) 467–476
environment information in more complex classification tasks and combining other information without too much computation. 6. Conclusion In this paper, we have proposed a novel topic feature for image scene classification. It uses pðzjw; DÞ obtained by image LDA model as features for topical image representation, which avoids complicated probability computation and topic inference. Firstly, the gray SIFT features of image patches are clustered into codewords by K-means. Secondly, a topic model based on LDA is established for images through codewords. Then, a simple mapping from codewords to topic features is formed based on our defined feature and its calculation method. Finally, image scenes are classified by an LR classifier according to their topic features. This method has achieved a higher accuracy than related works in [6,7,1] in our experiments. The proposed feature has the following advantages: 1. It can be extracted without class labels; 2. its extraction is a linear mapping under codewords for a trained topic feature space, which ensure much less computation. In the near future, we will explore more image features and stronger classifiers as well as consider combining the object, spatial and environment information for image classification. Acknowledgments The authors would like to thank the anonymous reviewers for their constructive comments. Thanks also to the authors in [3,6,12,25] for making their experimental datasets available, and the authors in [3,20] for making their code available. Thanks for the support of Short-term Foreign Expert Program of Jilin University 2013, and Academic Research Fund and Research Incentive Grant of Athabasca University. References [1] A. Bosch, A. Zisserman, X. Munoz, Scene classification via plsa, in: Computer Vision–ECCV 2006, Springer, Berlin, Heidelberg, 2006, pp. 517–530. [2] A. Vailaya, M.A. Figueiredo, A.K. Jain, H.-J. Zhang, Image classification for content-based indexing, IEEE Trans. Image Process. 10 (1) (2001) 117–130. [3] S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: spatial pyramid matching for recognizing natural scene categories, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, IEEE, pp. 2169–2178. [4] L.-J. Li, H. Su, L. Fei-Fei, E.P. Xing, Object bank: a high-level image representation for scene classification & semantic feature sparsification, in: Advances in Neural Information Processing Systems, 2010, pp. 1378–1386. [5] A. Quattoni, A. Torralba, Recognizing indoor scenes, in: CVPR, 2009, pp. 413–420. [6] L. Fei-Fei, P. Perona, A Bayesian hierarchical model for learning natural scene categories, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 2, IEEE, 2005, pp. 524–531. [7] C. Wang, D. Blei, L. Fei-Fei, Simultaneous image classification and annotation, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, IEEE, 2009, pp. 1903–1910. [8] Z. Niu, G. Hua, X. Gao, Q. Tian, Context aware topic model for scene recognition, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2012, pp. 2743–2750. [9] X. Wang, E. Grimson, Spatial latent dirichlet allocation, in: Advances in Neural Information Processing Systems, 2007, pp. 1577–1584. [10] T. Hofmann, Probabilistic latent semantic indexing, in: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 1999, pp. 50–57. [11] D.M. Blei, A.Y. Ng, M.I. Jordan, Latent dirichlet allocation, J. Mach. Learn. Res. 3 (2003) 993–1022. [12] L.-J. Li, L. Fei-Fei, What, where and who? Classifying events by scene and object recognition, in: IEEE 11th International Conference on Computer Vision, ICCV 2007, IEEE, 2007, pp. 1–8. [13] D.G. Lowe, Object recognition from local scale-invariant features, in: The Proceedings of the Seventh IEEE International Conference on Computer Vision, vol. 2, IEEE, 1999, pp. 1150–1157. [14] P. Quelhas, F. Monay, J.-M. Odobez, D. Gatica-Perez, T. Tuytelaars, L. Van Gool, Modeling scenes with local descriptors and latent aspects, in: Tenth IEEE International Conference on Computer Vision, ICCV 2005, vol. 1, IEEE, 2005, pp. 883–890.
475
[15] J. Yang, K. Yu, Y. Gong, T. Huang, Linear spatial pyramid matching using sparse coding for image classification, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, IEEE, 2009, pp. 1794–1801. [16] G. Heinrich, Parameter estimation for text analysis, web: 〈http://www.arbylon. net/publications/text-est.pdf〉. [17] D.J. MacKay, Information Theory, Inference and Learning Algorithms, Cambridge University Press, Cambridge, UK, 2003. [18] M.J. Wainwright, M.I. Jordan, Graphical models, exponential families, and variational inference, Found. Trends Mach. Learn. 1 (1–2) (2008) 1–305. [19] C.M. Bishop, N.M. Nasrabadi, Pattern Recognition and Machine Learning, vol. 1, Springer, New York, 2006. [20] T.L. Griffiths, M. Steyvers, Finding scientific topics, Proc. Natl. Acad. Sci. USA 101 (Suppl. 1) (2004) 5228–5235. [21] I. Porteous, D. Newman, A. Ihler, A. Asuncion, P. Smyth, M. Welling, Fast collapsed gibbs sampling for latent dirichlet allocation, in: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2008, pp. 569–577. [22] A. Bosch, A. Zisserman, X. Muoz, Scene classification using a hybrid generative/discriminative approach, IEEE Trans. Patt. Anal. Mach. Intell. 30 (4) (2008) 712–727. [23] Z. Niu, G. Hua, X. Gao, Q. Tian, Spatial-disclda for visual recognition, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2011, pp. 1769–1776. [24] S.N. Parizi, J.G. Oberlin, P.F. Felzenszwalb, Reconfigurable models for scene recognition, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2012, pp. 2775–2782. [25] A. Oliva, A. Torralba, Modeling the shape of the scene: a holistic representation of the spatial envelope, Int. J. Comput. Vis. 42 (3) (2001) 145–175. [26] Y.-L. Boureau, J. Ponce, Y. LeCun, A theoretical analysis of feature pooling in visual recognition, in: Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 111–118. [27] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, C.-J. Lin, Liblinear: a library for large linear classification, J. Mach. Learn. Res. 9 (2008) 1871–1874. [28] L.-J. Li, H. Su, Y. Lim, L. Fei-Fei, Objects as attributes for scene classification, in: Trends and Topics in Computer Vision, Springer, Berlin, Heidelberg, 2012, pp. 57–69. [29] M. Pandey, S. Lazebnik, Scene recognition and weakly supervised object localization with deformable part-based models, in: IEEE International Conference on Computer Vision (ICCV), IEEE, 2011, pp. 1307–1314.
Mujun Zang is a PhD candidate in the College of Communication Engineering, Jilin University, China. Her research interests include signal and image processing, pattern recognition, as well as machine learning methods in computer vision. Her current research focus is object and natural scene classification.
Dunwei Wen is an Associate Professor in the School of Computing and Information Systems at Athabasca University, Alberta, Canada. He received the PhD in pattern recognition and intelligent systems from Central South University, and MSc from Tianjin University and BEng from Hunan University. Prior to his current position, he was a visiting scholar in the Department of Computing Science at the University of Alberta, and Professor at Central South University. His research interests include artificial intelligence, machine learning, natural language processing, computer vision, text analysis, intelligent agents, and their application in industrial and educational information systems.
Ke Wang graduated from the Department of Electrical Engineering, Jilin University of Technology, China, in 1978, and received the MSc degree in Communication and Electrical system from Jilin University of Technology in 1994 and the PhD degree in Mechanical and Power Engineering in Jilin University in 2001. He is currently a Professor and Director in the College of Communication Engineering, Jilin University. His research interests include image processing, pattern recognition, signal processing as well as intelligent system.
476
M. Zang et al. / Neurocomputing 148 (2015) 467–476 Tong Liu is a PhD candidate in the College of Communication Engineering, Jilin University, China. His major research interests are machine learning, pattern recognition and intelligent system, including biomedical signal and image processing. His current research focus is ECG recognition.
Weiwei Song is a MSc student in the College of Communication Engineering, Jilin University, China. His research interests include machine learning and remote sensing image classification.