Semantic modeling of natural scenes based on contextual Bayesian networks

Semantic modeling of natural scenes based on contextual Bayesian networks

Pattern Recognition 43 (2010) 4042–4054 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/pr ...

1MB Sizes 0 Downloads 100 Views

Pattern Recognition 43 (2010) 4042–4054

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

Semantic modeling of natural scenes based on contextual Bayesian networks Huanhuan Cheng n, Runsheng Wang ATR National Laboratory, Institute of Electronic Science and Engineering, National University of Defense Technology, Changsha, Hunan 410073, China

a r t i c l e in f o

a b s t r a c t

Article history: Received 1 August 2009 Received in revised form 1 June 2010 Accepted 4 June 2010

This paper presents a novel approach based on contextual Bayesian networks (CBN) for natural scene modeling and classification. The structure of the CBN is derived based on domain knowledge, and parameters are learned from training images. For test images, the hybrid streams of semantic features of image content and spatial information are piped into the CBN-based inference engine, which is capable of incorporating domain knowledge as well as dealing with a number of input evidences, producing the category labels of the entire image. We demonstrate the promise of this approach for natural scene classification, comparing it with several state-of-art approaches. & 2010 Elsevier Ltd. All rights reserved.

Keywords: Scene classification Image representation Bayesian network Spatial information Semantic features

1. Introduction Scene classification, categorizing images into discrete categories (e.g., beach, forest or indoor), is a classical yet challenging problem in computer vision. It is an intermediate step to close the semantic gap between the image understanding of the user and the computer. From the application viewpoint, scene classification is relevant in systems for organization of personal and professional image and video collections. As such, this problem has been widely explored in the context of content-based image retrieval [1], but most prior approaches [2–4] have focused on mapping a set of classic low-level vision features to semantically meaningful categories using a classifier engine. The semantic modeling of scenes by an intermediate representation was next proposed in order to reduce the gap between low-level and high-level image processing. The meaning of the semantic of the scene is not unique, and two basic strategies based on semantic representation can be found in literatures. One is the object-based strategy [5–8], which identifies the semantic as a set of materials or objects that appear in the image (e.g., sky, grass and rocks). These methods are mainly based on first segmenting the image in order to deal with different regions. Subsequently local classifiers are used to label the regions as belonging to an object. Finally, using this local information, the global scene is classified. Another popular approach is the bag-ofwords strategy [9–13], which uses more general intermediate representations. In this case, they first identify a dictionary of

n

Corresponding author. Tel.: +86 731 4575724; fax: 86 731 4518730. E-mail address: [email protected] (H. Cheng).

0031-3203/$ - see front matter & 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2010.06.004

visual words or local semantic concepts in order to build the bagof-words, and further use bag-of-words models (e.g., probabilistic latent semantic analysis (pLSA) [14] and latent Dirichlet allocation (LDA) [15]) to discover clusters of local semantic concepts for scenes. Although these two strategies have both achieved some promising results, images with similar visual contents are often mis-categorized, particularly for natural scenes. For example, in experiments on Vogel’s dataset of natural scenes [7], coasts and river/lakes are frequently confused, and the reported performance of river/lakes is less than the average rate. This is also observed in the result based on bag-of-words methods [12,16]. Fig. 1 shows two images from Vogel’s dataset. For the first river/lakes scene, the percentages of water, sky and rocks are very similar with those in coasts images. The second image is clearly a forest scene. However, the large amount of grass causes the image high probability to be classified as a plain scene. In other words, common materials of scenes usually produce similar semantic features. For this reason, even the approaches based on semantic modeling fail to distinguish them correctly. Therefore, this shows that there is still a challenging work, and more advanced classification methods need to be designed for scene classification. For natural scenes, a scene is generally composed of several entities, organized in often unpredictable layouts, varying with different seasons and weathers. This makes natural scenes hard to be distinguished. However, without any accurate features extracted from images, people can categorize images into natural scenes very well. Human perception mainly relies on domain knowledge about certain scenes, which includes various attributes such as objects’ occurrence probabilities and their spatial

H. Cheng, R. Wang / Pattern Recognition 43 (2010) 4042–4054

4043

2. Related work

Fig. 1. Examples of images which are often misclassified.

arrangements. Furthermore, these various elements will be considered in a unified way when people categorize a picture. The challenge with such an idea is that knowledge from diverse feature sets needs to be integrated, so that specific inferences can be made. In addition, the inference engine should be capable of resolving conflicting indicators from various features, which are likely to occur due to the imperfect nature of the feature extraction algorithms. Thus, in the paper, we aim to develop a unified framework to model and classify images by various attributes. Bayesian networks (BN) provide a powerful framework for knowledge representation and domain-specific knowledge can be incorporated in the network structure. Luo et al. [5] present a general-purpose knowledge integration framework for semantic image understanding that employs BN in integrating both lowlevel and semantic features. Indoor/outdoor classification is one of their applications. In their work, semantic features are few and spatial information is not taken into account. We believe that context cues in images should be used for a more intuitive and accurate classification. The spatial correlation of basic elements (e.g., pixels, lines and regions) is the most common type of context in the image. While, in habitual bag-ofwords techniques, the spatial relationships of the image patches or object parts are ignored. The object-based strategy can explicitly exploit spatial information among the parts or regions. Therefore, we follow the object-based strategy in this paper. In this paper, we propose a contextual Bayesian network-based framework for natural scene modeling and classification, which models spatial relationships between local semantic regions and integrates multi-attributes to infer high-level semantic of a global image. The contributions of our paper are the following: (1) A unified probabilistic framework for scene classification based on Bayesian networks to represent scenes by various attributes. We propose a contextual Bayesian network to model local materials of images and spatial configurations of scenes together, which have been proven to classify natural scenes successfully. This is different from previous works such as [5,8], where only two scene categories (indoor/outdoor) are considered and the spatial relationships are ignored. (2) An effective approach based on the use of spatial information of the key entities for scene classification. This spatial information is not fixed, varying with key semantic regions chosen in different images. Besides, it allows for a simpler representation for scene structure, which has proven to be helpful to categorize some type of natural scenes, which are often mis-categorized.

The rest of the paper is organized as follows: Section 2 discusses related work. Section 3 describes the general framework we explore. The contextual Bayesian network model is presented in Section 4. Classification results are provided and discussed in Section 5. Section 6 concludes the paper.

Early works on scene classification use low-level features directly from the whole image or from a fixed spatial layout, combining with supervised learning methods to classify images into several semantic classes. The work by Vailaya et al. [2] is regarded as a representative of the literature in this field. This approach relies on a combination of distinct low-level cues for different two-class problems (global edge features for city/landscape and local color features for indoor/outdoor). The semantic modeling of scene classification can be primarily categorized into bag-of-words methods and object-based methods. In recent years, bag-of-words models have shown much success for text analysis and information retrieval. Inspired by this, a number of works [9–12] propose the demonstrated impressive results for image analysis and classification using the bag-of-words models. Bosch et al. [9] provide an approach, which uses bag-of-words to model visual scenes based on local invariant features and probabilistic latent semantic analysis (pLSA). The same authors extend their work to investigate the various choice of vocabularies, parameters and the gain in adding spatial information [10]. Fei-Fei and Perona [11] independently propose two variations of LDA. In that framework, local regions are first clustered into different intermediate themes, and then into categories. No supervision is needed apart from a single category label to the training image. Several studies suggest that to understand the context of a complex scene, one needs first to recognize the objects and then in turn recognize the category of the scene [17]. The object-based methods are following this strategy and are closer to human perceptions. Luo et al. [5] proposed a hybrid approach: low-level and semantic features are integrated into a general-purpose knowledge framework that employs a Bayesian network (BN). Vogel and Schiele [7] recently present a novel image representation for natural scene modeling by local semantic description. They predefined a set of semantic concepts such as water, rocks and foliage to describe the content of images. They first classify local image regions into semantic concept classes. Images are represented through the frequency of occurrence of these local concepts. But spatial relationships between objects are not considered in these works. Aksoy et al. [18] applied a Bayesian framework in a visual grammar. Scene representation is achieved by decomposing the image into prototype regions and modeling the interactions between these regions in terms of their spatial relationships. Boutell et al. [19] present a graph-based approach to learn spatial configuration models for outdoor scenes. Since a fully connected scene configuration model is intractable, they latterly chose to model pairwise relationships between regions and estimate scene probabilities using loopy belief propagation on a factor graph in [20]. This generative model offers a number of advantages at the expense of slightly lower accuracies compared with discriminative models using same semantic features. These object-based approaches are able to provide a visual representation of objects based on image regions. However, many of them lack sufficient use of spatial context information and an effective mapping mechanism from diverse feature set to highlevel semantic features contained in global pictures. Therefore, this paper proposes a contextual Bayesian network framework to categorize natural scenes. The structure of the CBN is derived based on domain knowledge, and parameters are learned from training images. For test images, they are first segmented into homogeneous regions, and labeled by the local classifier with an object by their identities. Then the semantic features and the spatial relationships of key semantic regions are extracted from images. The hybrid stream of these

4044

H. Cheng, R. Wang / Pattern Recognition 43 (2010) 4042–4054

evidences is piped into the CBN inference engine, producing category predicates. In summary, the key advantages of the proposed approach over previous approaches are: (1) semantic features of scene content and spatial context information are integrated by a unified probabilistic framework using Bayesian networks. (2) Spatial information used in this paper is not fixed, varying with the key semantic regions extracted in the image. The spatial arrangement of the key semantic objects will be modeled by sub-networks. This difference is crucial, as they allow us to extract the typical or main structure information in certain scenes. Finally, combining with the materials’ nodes, the spatial Bayesian network model will be constructed and the scene classification will be proceeding through the network.

3. Framework overview 3.1. Bayesian networks A Bayesian network [21,22] is a directed acyclic graph that encodes a joint probability distribution over a set of random variables. Consider a finite set U ¼{X1,y,Xn} of discrete random variables where each variable Xi may take on values from a finite set, denoted by Val(Xi). Formally, a Bayesian network for U is a pair B ¼(G,Y). The first component, G is a directed acyclic graph whose vertices correspond to the random variables and whose edges represent direct dependencies between the variables X1,y,Xn. The graph G encodes the set of Markov independence assumptions: each variable Xi is conditionally independent of its non-descendants given its parents. More formally   PðXi AðXi Þ,IIðXi ÞÞ ¼ PðXi IIðXi ÞÞ ð1Þ where A(Xi) denotes the non-descendants nodes of Xi, and II(Xi) denotes the parents of Xi in G. The second component of the pair, namely Y, represents the set of parameters that quantifies the network. Y is also called conditional  probability matrix (CPM). It contains a parameter yXi 9IIðXi Þ ¼ Pðxi IIxi Þ for each possible value xi of Xi and IIxi of II(Xi). The probability for a full assignment to all of the variables in the network is computed using the chain rule PðX1 ,. . .,Xn Þ ¼

n Y

PðXi 9IIðXi ÞÞ ¼

i¼1

n Y

yXi jIIðXi Þ

ð2Þ

i¼1

When used in conjunction with statistical techniques, Bayesian networks provide a powerful framework for the description of complicated probabilistic systems through simple conditional relationships and have been successfully employed for semantic scene understanding [5,8]. Bayes’ theorem is given by PðH9EÞ ¼

PðE9HÞPðHÞ PðH,EÞ ¼ PðEÞ PðEÞ

ð3Þ

The latter part of Eq. (3) is the well-known inversion formula. The importance of this result is that P(H9E), which is often difficult to assess, can be obtained from quantities that are usually available from experiential knowledge. A purely mathematical description of probabilistic reasoning can be devoid of psychological meaning and often differs from human probabilistic reasoning. Bayesian networks efficiently encode the joint probability of the variables and can be used to learn causal relationships, and hence can be used to gain understanding about a problem domain and to predict the consequences of intervention.

3.2. The framework Although the visual content (entities and layout) of a specific scene class exhibits a large variability, the specificity of a scene class greatly relies on the local objects/materials and the basic spatial arrangement of those entities. We use the term object in this paper generically to refer to both materials (grass, sky) and objects (buildings) in scenes. Fig. 2 illustrates the proposed general framework for modeling and categorization of natural scenes, which is a two-step process. The first step is the process of local semantic labeling for segmented regions; the second is the modeling and classification procedure for natural scenes using Bayesian networks. To describe the image content of natural scenes, nine objects are determined as being discriminated for the desired tasks similar to [7]. These objects are {sky, water, grass, trunks, foliage, field, rocks, sand}, denoted by M ¼{Mi}, i¼1,y,L, where L¼9. The purpose of local semantic modeling is to classify the local image regions into object classes. We generically refer to image regions labeled by object classifiers as semantic regions. First, every image is segmented into disjoint regions using a general algorithm (Mean-shift algorithm [23]), and the low-level features are extracted from each region. The feature vector extracted from the image region is the concatenation of a 54-bin linear HSV color histogram (hue: 18bins; saturation: 16bins; value: 16bins), an 8-bin edge direction histogram, and the 24 features of the graylevel co-occurrence matrix (32 gray levels): contrast, energy, entropy, homogeneity, inverse difference moment and correlation ! ! ! ! for the displacements 1,0 , 1,1 , 0,1 , 1,1 . For training images, we manually label the regions with their identities and train a multiclass SVM [24] classifier for these objects. For a test image, after segmentation and feature extraction, each region of the image is labeled by the trained SVM with one of the objects according to its color and texture feature vector. In the second stage, the region-based information is combined to a global image representation. The image representation consists of two parts: one is the object occurrences; the other is spatial arrangements of the object. A contextual Bayesian network (CBN) is proposed to integrate this different type information, the nodes in which represent features that may describe low- or highlevel semantic information. The structure of CBN is derived based on domain knowledge of the relationships between different entities, and the parameters of the network are learned from training images. For a test image, as illustrated in the right part of Fig. 2, two sets of descriptors are extracted from labeled image obtained at the first stage. The first corresponds to semantic objects features, which can be computed as their frequency of the occurrence; the second set corresponds to spatial information, which means the spatial configuration of the critical semantic objects. The hybrid streams of these semantic evidences are piped into the CBN, which will produce semantic predicates of the entire image.

4. Contextual Bayesian network model for scene categorization Our Bayesian network model is based on the concept of scene configurations [20]. As mentioned above, the specificity of a scene class greatly relies on the local objects/materials and the basic spatial arrangement of them. We extend the meaning of scene configurations in [20]. Thus, in our paper, extended scene configurations consist of two parts. First is the attribute of objects being present in the scene (e.g., the percentage of occurrence for

H. Cheng, R. Wang / Pattern Recognition 43 (2010) 4042–4054

4045

Fig. 2. Overview of the contextual Bayesian network-based scene classification.

scene classification. For a test image, segmented regions are labeled by objects classifiers, and adjacent regions with the same label will be merged. 4.1. Formalizing the problem of scene classification

Fig. 3. Example for scene configurations. (a) A coast scene. (b) Its manual labeled materials. (c) The graph showing objects and their spatial relations and (d) Chosen key regions and their adjacent regions.

each object in Fig. 3(b)). Second is the spatial configuration of semantic regions (the graph of Fig. 3(c)). We generically refer to ‘‘semantic regions’’ as the labeled regions (color-coded regions shown in Fig. 3(b)). We start by automatically segmenting the image using a general algorithm (mean-shift algorithm [23]). Some regions may have more than one object, e.g., water regions with a tiny bit of sand. The regions containing at least 75% of one type of object were selected both for training and testing the SVM object classifier. Image regions that contain two objects in about equal amounts are not used for training or testing of object classifiers. The ground truth labeled by human is both used for training local objects classifiers and the

We formalize the scene classification problem as follows: let C¼{Ci} be the set of scene classes considered and E be the input evidence. After the first stage of local semantic modeling, image regions are labeled by the given objects as Fig. 3(b) and several semantic regions are obtained. Based on the concept of scene configurations, two kinds of evidences will be extracted from the labeled image: one is the feature of each object; the other is the spatial arrangement of those objects. Let EM ¼{F1,F2,y,FL} be the set of features for each object, and EG be the set of spatial configurations of objects in the image. Thus, E¼{EM,EG}. We choose the frequency of occurrence in the image as the feature of an object. For example, in Fig. 3, if we consider the nine predefined objects, M¼{0.475,0.235,0,0.2,0.09,0,0,0,0} is obtained by computing the percentage of each object being present in the image. In our framework, we want to find the scene with maximum a posteriori (MAP) likelihood, given the input evidence from the  labeled image, or arg maxi PðCi EÞ. By Bayes’ rule PðCi ÞPðE9Ci Þ PðEÞ

PðCi 9EÞ ¼

ð4Þ

At inference time, we have the evidence E; thus P(E) is fixed and does not depend on the scene i. Thus  C* ¼ arg maxPðCi EÞ i  ¼ arg maxPðCi ÞPðE9Ci Þ ¼ arg maxPðCi ÞPðEM ,EG Ci Þ ð5Þ i

i

4046

H. Cheng, R. Wang / Pattern Recognition 43 (2010) 4042–4054

Fig. 4. Contextual Bayesian network model for natural scene classification.

In the next section, a Bayesian network is exploited to encode conditional probabilities between scenes and various features. 4.2. Contextual Bayesian network While all graphical models may have the same representational power, not all are equally suitable for a given problem. Markov random field (MRF) is a suitable choice for a region-based approach, due to the similarity to its use in low-level vision problems. A factor graph is used in [20] to model pairwise relationships for scenes. Another alternative is to use a Bayesian network. Bayesian belief networks provide explicit probabilistic capabilities for representing diverse feature sets in a common modality (probability space) and theoretically sound fusion rules for combining these probabilities to generate a consensus. In this paper, the contextual Bayesian network (CBN) is presented to model and classify images by local semantic features and spatial information, as shown in Fig. 4. The goal of CBN is to explore inherent relations between multi-level semantics (including global scene semantic and local semantic features), which can be quantified by conditional probabilities between parent nodes and child nodes in the network. The model is called contextual Bayesian network model (CBN) due to its capability of modeling the spatial context of images. The structure of the network is designed based on domain knowledge, and parameters will be learned from training samples. 4.2.1. Construction of the structure of CBN The structure of a Bayesian network is usually derived based on domain knowledge of the relationships between different entities, although automatic algorithms exist for discovering the Bayesian net structure directly from the data [25]. CBN is constructed based on the perceived semantic relationships between elements of the scene configuration. The root node C represents the scene category of an image. The set of values for the node is the set of scene categories {C1,C2,y,CN}. ‘‘Objects’’ and ‘‘Spatial Configuration’’ with dashed circles are not actual nodes, which are used to describe two elements of the scene configuration—local objects and their spatial configurations. We assume a locality condition: the features of the object depend only on the material present in the image and not on their spatial configurations. So nodes below

the dashed line are conditionally independent of each other given the scene node. To model spatial configurations of all objects present in the image is more difficult. At this coarse level of segmentation, even distant semantic regions may be strongly correlated, e.g., sky and sand in coast scenes. Fig. 3(c) shows a graph of relationships between semantic regions. As analyzed in [20], an exact model for the spatial configuration will suffer from the sparseness: the number of training images is typically much less than the number of all possible spatial configurations. They approximate the joint distribution as the product of pairwise distribution, that is, only pairwise relations of semantic regions are considered. We choose to find some typical semantic regions (e.g., the largest region in size) and use their adjacent regions to represent the major and characteristic configurations for scenes. Fig. 3(d) shows two chosen key regions and their adjacent regions. One is sky above rocks and the other is water above sand and rocks. These two layouts of objects are typical in coast scenes, which may be a distinctive feature of the spatial layout for coast scenes. 4.2.2. Description for nodes and their values Now, we will talk about nodes in CBN and their values when modeling an image. There are L nodes for the predefined objects, and they are independent of each other. The value of the node Mi is the feature extracted from semantic regions labeled with Mi in an image. We choose the percentage of object Mi being present in the image as its feature, denoted by Fi Fi ¼

X

9Rj 9 9I9

ð6Þ

MðRi Þ ¼ Mi Rj A I

where M(Rj) represents the label of region Rj, and the function 9  9 provides the number of pixels in the contained set. The structure of the right of CBN is created to model spatial configurations of chosen key regions. Below the ‘‘Spatial Configuration’’, there are K nodes for key semantic regions, and they are supposed to be independent of each other for simplicity. For each node of key region KRj, its child nodes are its adjacent region in different direction, denoted by fRj1 ,. . .,Rj9S9 g. S is the set of spatial relations, which will be discussed in the next section. Because Bayesian networks are acyclic, we ignore possible relationships

H. Cheng, R. Wang / Pattern Recognition 43 (2010) 4042–4054

between those adjacent regions of the same key region. Values of nodes KRj and Rjk are object labels of corresponding regions. In our study, key semantic regions are extracted according to some rules describe in the next section. We efficiently propagate evidence through the network using Pearl’s message-passing algorithm [21]. Details for inference will be described in Section 4.5. Learning of parameters that quantifies the network will be discussed in Section 4.4. After the messagepassing algorithm has completed, we find the scene class by taking value with the highest marginal probability at the scene node. We now discuss the choice of key regions and spatial relations.

adjacent regions. We choose similar 9S9¼6 spatial relations (above, below, left, right, enclosed and enclosing) between the key regions and their adjacent regions. These relationships are illustrated in Fig. 5, the circle with solid line represents the chosen key region Ri, and the number represents its adjacent region Rj. The spatial relationships are described as follows:

    

4.3. Key semantic regions and their spatial arrangements

 4.3.1. Key semantic regions As argued above, we aim for a key region-based spatial configuration to approximate the spatial configurations of scenes. Key semantic regions are generally chosen to be larger regions in size. We hope that chosen key regions and their adjacent regions can almost cover the whole image, thus their arrangements will be more likely to spatial configurations of scenes. For an image I, R ¼{R1,R2,y,Rn} is the set of semantic regions, {m1,m2,y,mn} is the set of labels for these regions and miAM. The adjacent regions of a semantic region Ri are represented by adj(Ri). Therefore, we choose key semantic regions as follows: (1) Initializing the set of candidates for key regions CR ¼R, and the set of key regions is represented by KR. The number of KR is NKR ¼0. (2) The largest region in candidates set CR is chosen to be a key semantic region, and NKR is updated. This region and its adjacent regions will be removed from CR. (3) Compute the total size (ts) of key regions and their adjacent regions (the same region is only counted once) X ts ¼ 9Rj 9 ð7Þ Rj A adjðRi Þ or Rj A KR

4047

above (0): Rj is above Ri; left (1): Rj is on the left of Ri; below (2): Rj is below Ri; right (3): Rj is on the right of Ri; enclosing (4): Ri is enclosed by Rj at least 50% of the smaller one’s perimeter; enclosed (5): Rj is enclosed by Ri at least 50% of the smaller one’s perimeter.

The relationship between a pair of adjacent regions can be obtained by computing the perimeters, centroids and angles of two regions. For simplicity, we compute spatial relations between adjacent regions in the image. The current model also ignores the shape and size of regions and occlusions that cause regions to be split. While any of these may be useful features in a full-scale system, we ignore them in this work.

4.4. Training the Bayesian network Training Bayesian networks involves determining the CPMs for each parent–child nodes relationship. Various researchers have proposed schemes for learning parameters associated with a BN [25]. When the observed data is complete, the CPM associated with a link can be trained using frequency counting only when ground truth is available. At learning phase, we have a set of training images T with scene labels. Moreover, each image in training set is segmented into a few regions and image regions are hand-labeled with those predefined objects. Some ambiguous regions are left unlabeled.

Then, the percentage of those regions in the image is

Z ¼ ðts=9I9Þ. If Z othd, the process is returned to step (2), else the chosen key semantic regions are enough for the image. We choose thd ¼0.9 in our paper, which means the chosen key regions and their adjacent regions must cover 90% of the whole image. The number of nodes for key semantic regions in CBN is P i set as: K ¼ ð1=nÞ ni¼ 1 NKR , where n is the number of training images. For example, for the image showed in Fig. 3(a), R¼{R1,R2,R3,R4}, their labels are {sky,water,rocks,sand}. The first chosen key image is R1, and its adjacent regions are {R3}. R1 and R3 (rock) are removed from CR. Although R3 is removed and cannot be chosen to be a key region, it can also be the adjacent region of the following key regions. Therefore, CR becomes {R2,R4}. Because Z ¼ ðts=9I9Þ ¼ 0:675 othd, R2 is chosen to be the second key region and its adjacent regions are {R3,R4}. Now we have KR¼ {R1,R2}, ts¼9R19+ 9R29+ 9R39 +9R49, and Z ¼ ðts=9I9Þ ¼ 14 thd. Therefore, we have resulted KR¼{R1,R2} and their spatial arrangements are shown in Fig. 3(d). 4.3.2. Spatial relationships Singhal et al. [26] have found that seven distinct spatial relations (above, far_above, below, far_below, beside, enclosed and enclosing) are sufficient to model the relationships between objects in outdoor scenes. There are five of those relations for the

4.4.1. Prior distribution for scene node P(C) models the prior distribution of scene types across the image population. We currently do not take advantage of prior information and simply use a flat prior, but priors could be learned in the future.

Fig. 5. Spatial relationships of region pairs: above, left, below, right, enclosed, enclosing.

4048

H. Cheng, R. Wang / Pattern Recognition 43 (2010) 4042–4054

4.4.2. Conditional probabilities for object nodes P(Mj9Ci) models the conditional probability of object Mj given the scene type Ci. The definition of that is not unique. We will introduce two methods for the calculation of conditional probability of P(Mj9Ci). According to two different definitions of P(Mj9C), resulted CBN models are denoted as CBN_SP1 and CBN_SP2. 4.4.2.1. CBN_SP1. The observation for the node Mj is the feature Fj extracted from the training image by (6). Let {H(mj)}A{1,2,3,y,nj} be quantization of the feature Fj, where nj ¼10 is the number of bins. Prij ðUÞ represents the quantized histogram of the training images of Ci that contain Mj, then, Prij ðHðMj Þ ¼ hÞ PðMj 9Ci Þ ¼ PðHðMj Þ ¼ h9Ci Þ ¼ vi

4.4.2.2. CBN_SP2. A discriminative and very efficient approach to scene categorization is the use of SVM. Vogel [7] trains multi-class SVMs for natural scenes using semantic features of conceptoccurrence vector (COV). Similar to the COV in [7], features of each node Mj can be combined into a new feature vector Xj ¼{F(M1),F(M2),y,F(ML)} for the whole image, which also reflects the frequency of semantic objects in images. Let Xj be the input feature of the SVM, we train a multi-class SVM classifier for scenes. The output of SVM classifier is distance measures in feature space. Let d(Ci9Xj) represent SVM classification results for a given image according to Xj for scene Ci, it can be transformed to probability space using the sigmoid function 1

ð9Þ

1 þedðCi jXj Þ

Since Xj ¼{F(M1),F(M2),y,F(ML)}, we define the joint conditional probability of object nodes {M1,M2,y,ML} by the posterior probability obtained by SVM classifier similar with [8] as follows: PðM1 ,M2 ,. . .,ML 9Ci Þ ¼ PðXj 9Ci Þ ¼

 PðXj Þ PðCi Xj Þ PðCi Þ

ð10Þ

where, P(Xj) and P(Ci) are the prior distributions of local objects and scenes. We currently do not take advantage of prior information and simply use a flat prior to model the prior probabilities. Therefore, the joint conditional probability of object nodes PðM1 ,M2 ,. . .,ML 9Ci ÞpPðCi 9Xj Þp

1 dðCi jXj Þ

1þe

ZðRi ,Mj Þ ¼

1,

MðRi Þ ¼ Mj

0,

Otherwise

,

and M(Ri) denote  the object label of the semantic region. We use PðRij KRi Þ to denote the conditional probability between the key region and its adjacent region of the certain direction j, where jAS. We express this dependency using the common boundary of adjacent regions P P i i KRi Rij BðKRi ,Rj ÞZððKRi ,Rj Þ,ðMk ,Ml ÞÞ P , PðRij ¼ Ml 9KRi ¼ Mk Þ ¼ KRi BðKRi ÞZðKRi ,Mk Þ 8KRi A T4Rij A adjðKRi Þ

ð13Þ

ð8Þ

where vi is the number of the training images of scene type Ci.

PðCi 9Xj Þ ¼

defined as (

ð11Þ

Although each P(Mj9Ci) is not defined in this method, the joint conditional probability P(M1,M2,y,ML9Ci) will be enough for inference of the posteriori probability of scene node. 4.4.3. Conditional probabilities for nodes of key semantic regions P(KRj9Ci) represents the conditional dependency of the key semantic regions given the scene type. The observation value of the key semantic region is its object label. Let Mj be the label of KRj, thus, using a simple coverage percentage method, P(KRj9Ci) would be expressed as P Rl A U 9Rl 9ZðRl ,Mj Þ P ð12Þ PðKRj ¼ Mj 9Ci Þ ¼ I A Di 9I9 where Di represents the set of training images whose scene type is Ci, U is the set of key regions chosen in Di, Z the selection function

where T is the set of training images, B(U) represents the boundary of the region or the common boundary of two adjacent regions. The selection function is defined as ( 1, MðRi Þ ¼ Mk 4MðRij Þ ¼ Ml i ZððRi ,Rj Þ,ðMk ,Ml ÞÞ ¼ 0, Otherwise ( 1, MðRi Þ ¼ Mk ð14Þ ZðRi ,Mk Þ ¼ 0, Otherwise

4.5. Inference For an input image I, after the process of the stage first in Section 3.2, segmented regions of the image is labeled by object classifiers and adjacent regions with the same label will be merged. Let R ¼{R1,R2,y,Rn} be the set of semantic regions, {KR1,KR2,y,KRK} be the set of chosen key semantic regions and fRk1 ,. . .,Rk9S9 g be the set of adjacent regions of KRk in each direction. We use E¼{EM,EG} to denote the set of the evidence from the input image. EM ¼ {F1,F2,y,F9M9} is used to denote the evidence from each object, which can be calculated by Eq. (6). EG ¼{KE1,KE2,y,KEK} is used to denote the evidence from each key semantic region, where KEk ¼ fmk ,mk1 ,. . .,mk9R9 g consists of labels of KRk and labels of its adjacent regions. In this framework, we want to find the scene with maximum a posteriori (MAP) likelihood, given the evidence from input image. As analyzed in Section 4.1, we already have C* ¼ arg max  PðCi ÞPðECi Þ. P(Ci) is defined as a flat prior in Section 4.4.Ci By exploiting the Markov independence assumptions, the term P(E9Ci) can be rewritten as    PðECi Þ ¼ PðEM Ci ÞPðEG Ci Þ   ¼ PðM1 ,M2 ,. . .,ML Ci ÞPðKE1 ,. . .,KEK Ci Þ ¼

L Y j¼1

PðMj 9Ci Þ

K Y

 PðKEk Ci Þ

ð15Þ

k¼1

where N is the number of scene categories. In fact, P(Mj ¼Fj9Ci) can be computed by formula (8) or (11). The term p(KEk9Ci) will be factorized as    PðKEk Ci Þ ¼ PðKRk ¼ mk , Rk1 ¼ mk1 ,. . .,Rk9S9 ¼ mk9S9 Ci Þ   ¼ PðRk1 ¼ mk1 ,. . .,Rk9S9 ¼ mk9S9 KRk ¼ mk ,Ci ÞPðKRk ¼ mk Ci Þ ð16Þ

The term P(KRk ¼mk9Ci) can be calculated by formula (12). By exploiting the Markov independence assumptions in Bayesian networks, nodes fRk1 ,Rk2 ,. . .,Rk9M9 g are conditional independent of

H. Cheng, R. Wang / Pattern Recognition 43 (2010) 4042–4054

In all our experiments, we used a multi-class SVM for classifying local regions into objects. The feature vector extracted from the image region is the concatenation of a 54-bin linear HSV color histogram (hue: 18bins; saturation: 16bins; value: 16bins), an 8-bin edge direction histogram, and the 24 features of the ! ! ! ! gray-level 1,0 , 1,1 , 0,1 , 1,1 .

the scene node C given their parent node KRk, thus PðRk1 ,Rk2 ,. . .,Rk9S9 9KRk ,Ci Þ ¼ PðRk1 ,Rk2 ,. . .,Rk9S9 9KRk Þ ¼

9S9 Y

 PðRkl KRk Þ

l

So we get, pðKEk 9Ci Þ ¼

9S9 Y

  pðRkl ¼ mkl KRk ¼ mk ÞpðKRk ¼ mk Ci Þ

ð17Þ

l¼1

Therefore the scene type of an unknown image inferred by CBN can be expressed as follows:  C * ¼ arg maxpðCi EÞ Ci

¼ arg maxpðCi Þ  Ci



L Y

4049

PðMj 9Ci Þ

5.1.3. CBN models with different structures We present four Bayesian network classification:

models

for

scene

(1) CBN_SP1: this CBN model is presented in Section 4.4. The conditional probability P(Mj9Ci) is calculated by Eq. (8) learned by frequency counting. (2) CBN_SP2: the structure is same as CBN_SP1, while the conditional probability P(Mj9Ci) is calculated by Eq. (10) learned by SVM distance measure.

j¼1

9S9 K Y Y

  pðRkl ¼ mkl KRk ¼ mk ÞpðKRk ¼ mk Ci Þ

The next two are baselines for comparison:

k¼1l¼1

¼ arg maxpðCi Þ  PðM1 ,M2 ,. . .,ML 9Ci Þ Ci



K Y

9S9 Y

  pðRkl ¼ mkl KRk ¼ mk ÞpðKRk ¼ mk Ci Þ

ð18Þ

k¼1l¼1

where the second line of (18) will be used in CBN_SP1 and the third line will be used in CBN_SP2. Fortunately, for scene classification, and particularly for landscape images, the number of semantic regions, n, is generally small (n o10 in the ground truth labeled by human). For the test image, segmented regions are merged if they are adjacent and have the same object label. Obviously, the number of semantic regions in the test image is a little more than that in the ground truth because of the errors caused by the object classifiers. Thus, a brute-force approach to maximizing (18) can be tractable.

5. Experimental results 5.1. Experimental setup 5.1.1. Database We use the database kindly provided to us by Vogel et al. [7], which contains 700 images of 720  480 pixels resolution. They are distributed over 6 natural scene classes as follows: coasts (142), river/lakes (111), forests (103), plains (131), mountains (179) and sky/clouds (34). We chose this dataset because of its good resolution and color. We refer the dataset as VS. The dataset is challenging given their respective number of classes and the intrinsic ambiguities that arise from their definition. This is a property originally introduced as part of the database to evaluate classification performance, making the classification with more difficulties. 5.1.2. Protocol The classification task is to assign each test image to one of a number of categories. To perform experiments, we adopted a 10-fold training/testing protocol. That is, data is split into 10 folds, and for each fold, all parameters are trained on the remaining 90% data, and the learned system is tested on the given fold. The presented class performance corresponds to the averages over the 10 runs, and the overall system performance is the macro-average of the class performance.

(1) IBN (Independent BN model), the model considers each region to be independent of each other and does not take any spatial information into account. The IBN model only includes the objects nodes, which are similar to the Bayesian network presented in [8] for indoor/outdoor classification. The structure of IBN is equal to CBN without nodes of ‘‘Configurations’’. (2) CBN_SP0: the model considers only the spatial information between objects. That is to say, the structure of CBN_SP0 is equal to CBN without nodes of ‘‘Objects’’.

5.1.4. State-of-the-art baselines Baseline of [7] (discriminative approach): We considered it as first baseline, the approach introduced along with the database [7]. In that work, the image was divided into a grid of 10  10 blocks, and on each block a feature vector composed of an 84-bin HIS histogram, a 72-bin edge histogram and a 24 features graylevel co-occurrence matrix were computed. These features were concatenated after normalization and weighting, and used to classify (with an SVM) each block into 1 of 9 local semantic classes (water, sand, foliage, grassy). In a second stage, the 9-dimensional vector containing the image occurrence percentage of each regional object was used as input to an SVM classifier to classify images into one of the 6 scene classes. The reported performance of that approach was good: 74.1%. We chose the discriminative approach as a baseline because it is a mainstream approach that is also based on training. Factor-Graph model [20]: In order to compare our Bayesian network model with other methods using graphic model with spatial relations, we carry out the classification task on the VS dataset using the generative model proposed by Boutell et al. [20]. In their work, images were segmented into regions and labeled by semantic object detectors. Then, they modeled pairwise relationships between regions and estimated scene probabilities using loopy belief propagation on a factor graph. 5.2. Classification results In this section, we present the classification results of our approach, using four Bayesian network models (IBN, CBN_SP0, CBN_SP1, CBN_SP2) and compare them with the baseline methods and other previous methods. In each case, dataset VS was automatically segmented. Test image set with hand-labeled region annotation is referred as VSa, while the one labeled by

4050

H. Cheng, R. Wang / Pattern Recognition 43 (2010) 4042–4054

objects classifier is VSb. The performance of the methods under different conditions are presented and discussed. Table 1 shows the classification accuracies using our method and several stateof-art techniques. Table 1 Classification accuracy (%) comparison between techniques. Methods

Test set

Bayesian network

IBN CBN_SP0 CBN_SP1 CBN_SP2

Factor-graph approach [20]

VSa

VSb

80.29 80.86 87.28 88.71

67.14 68.71 73.57 75.86

82.29

Discriminative approach [7]

Low-level features Semantic features

Bag-of-words

BOV [12] pLSA1 [12] pLSA2 [13] pLSA3 [10]

68.43 65.0

86.4

74.1 52.1 61.9 76.9 87.8

5.2.1. Classification performance of Bayesian network models To show the benefit of the spatial information, we compare the contextual Bayesian network models CBN_SP1 and CBN_SP2 against the baseline models mentioned earlier: IBN model, CBN_SP0 model and the discriminative approach. In our experiments, classification using the CBN_SP1 and CBN_SP2 models always outperformed those using IBN and the discriminative approach, showing that spatial information does help distinguish natural scenes. Table 2 displays the confusion matrix of the scene categorization based on manual annotation, which shows how the different categories are confused. Table 3 displays the confusion matrix of the scene categorization based on classified regions. It is clear that two categories coasts and rivers/lakes are frequently confused, especially for the IBN and discriminative models. The finding suggests that the corresponding images are also semantically hard to categorize. Visual inspection of the mis-categorized images shows that those images are hard to categorize unambiguously even for humans. Using our contextual Bayesian network, the confusion rates between coasts and river/lakes have been decreased obviously. The confusion rate of river/lakes mis-categorized into coasts

Table 2 Categorization accuracies (%) based on manual annotated image regions. IBN

Coasts River/lakes Forests Plains Mountains Sky/clouds

CBN_SP1

Coa

R/L

For

Pla

Mou

S/C

Coa

R/L

For

Pla

Mou

S/C

70.5 20.7 0.0 3.0 1.1 0.0

19.7 71.2 2.9 0.0 2.8 0.0

2.8 5.4 94.2 15.3 0.6 0.0

3.5 0.0 2.9 75.6 8.4 0.0

2.8 1.8 0.0 6.1 86.6 5.9

0.7 0.9 0.0 0.0 0.6 94.1

75.4 6.3 0.0 4.6 0.0 8.8

13.4 82.9 1.0 0.0 1.1 0.0

0.7 1.8 99.0 2.3 0.0 0.0

0.7 0.9 0.0 84.0 1.7 8.8

8.5 8.1 0.0 10.7 97.2 5.9

0.0 0.0 0.0 0.0 0.0 76.5

Discriminative approach [7]

Coasts River/lakes Forests Plains Mountains Sky/clouds

CBN_SP2

Coa

R/L

For

Pla

Mou

S/C

Coa

R/L

For

Pla

Mou

S/C

80.3 18.0 0.0 0.8 0.6 0.0

14.1 73.0 1.9 0.0 2.2 0.0

0.7 3.6 95.1 0.8 0.6 0.0

3.5 0.9 1.9 91.6 6.7 5.9

0.7 3.6 1.0 5.3 89.4 0.0

0.7 0.9 0.0 1.5 0.6 94.1

81.7 9.0 0.0 1.5 0.0 2.9

11.3 83.8 1.0 0.0 1.7 2.9

0.7 2.7 97.1 1.5 0.0 0.0

0.7 1.8 1.9 90.8 5.6 5.9

4.2 2.7 0.0 6.1 92.7 2.9

0.0 0.0 0.0 0.0 0.0 85.3

Table 3 Categorization accuracies (%) based on classified image regions. IBN

Coasts River/lakes Forests Plains Mountains Sky/clouds

CBN_SP1

Coa

R/L

For

Pla

Mou

S/C

Coa

R/L

For

Pla

Mou

S/C

55.6 12.6 0.0 6.1 6.7 0.0

25.4 48.6 3.9 5.3 3.9 5.9

2.1 6.3 89.3 13.0 1.7 0.0

5.6 1.8 1.9 51.9 4.5 0.0

10.6 27.9 3.9 22.1 81.0 0.0

0.7 2.7 0.0 1.5 2.2 94.1

67.6 15.3 0.0 69 3.9 5.9

17.6 51.4 4.9 3.8 2.8 0.0

1.4 4.5 91.3 5.3 0.6 0.0

4.2 0.9 1.9 60.3 2.8 5.9

8.1 17.1 1.9 22.1 89.4 2.9

0.7 1.8 0.0 1.5 0.6 85.3

Discriminative approach [7]

Coasts Rivers/lakes Forests Plains Mountains Sky/clouds

CBN_SP2

Coa

R/L

For

Pla

Mou

S/C

Coa

R/L

For

Pla

Mou

S/C

71.1 28.8 1.0 4.6 3.9 8.8

12.0 42.3 2.9 0.8 3.4 0.0

0.7 6.3 89.3 5.3 0.0 0.0

6.3 4.5 3.9 71.0 5.0 0.0

9.2 17.1 2.9 17.6 86.6 0.0

0.7 0.0 0.0 0.8 1.1 91.2

73.2 20.7 0.0 3.8 3.4 2.9

14.8 49.5 4.9 3.1 1.7 0.0

2.1 5.4 89.3 8.4 0.6 2.9

3.5 0.9 2.9 65.0 2.8 0.0

5.6 22.5 2.9 18.3 91.1 2.9

0.7 1.8 0.0 1.6 0.6 91.2

H. Cheng, R. Wang / Pattern Recognition 43 (2010) 4042–4054

4051

Fig. 6. Examples of images and segmentations for which the spatial Bayesian model gave correct results while the baseline methods failed.

decreases remarkably from 18% to 9%, and that of coasts has a slight decrease too. Tables 2 and 3 also show that the classification rate of sky/clouds is significantly decreased in both CBN_SP1 and CBN_SP2 compared to IBN. It is mainly because images of sky/clouds consist of large amount of sky, which has few spatial relations. Fig. 6 contains example images (and their corresponding annotations with local objects) for which the contextual Bayesian network model yields an improvement over the baseline models. The first and second columns are coasts images, which are mis-categorized by discriminative method, while the CBN model obtained correct results. The following columns are river/lakes and the last two are images of other categories including mountains, plains and forests. The first three rows are images with their manual annotation, and images in the last row are labeled by the objects classifier. We will choose some typical images to analyze. Let’s talk about images with manual annotations first. The first example shows a coast scene with mountains and a large grass field. In this case, because the foliage and grass have a high percentage in whole image, using IBN or discriminative model based on object occurrence only, classified it incorrectly as a plain scene. However, the facts that sky is above water and that rocks are beside water allow CBN models to classify it correctly as a coast. For the first river/lakes image in the third column, the percentages of water, sky and rocks are very similar with those in coasts images, causing the baseline models to misclassify it as a coast scene. However, the typical coast images have adjacent sky and water regions, which rarely occur in rivers or lakes. This spatial relationship caused CBN to classify it correctly. The first image in the fifth column is clearly a forest scene. However, without spatial information, the large amount of grass caused the image high probability of the plain scene according to IBN and discriminative model. Using the evidence of trunks surrounded by foliage, our CBN models classified it correctly.

Since the classifier of local objects made some errors, the final scene categorization results would be influenced. For example, in the last image in Fig. 6, haystacks on the grassplot are labeled as rocks by the object classifier, which caused the IBN and discriminative approaches to misclassify it as a mountain. However, typical mountain images have rocks in the foreground and next to the sky, not the bottom of the image; this spatial relationship caused CBN models to classify it correctly. 5.2.2. Comparison between object-based methods Our approach, the Factor-graph approach [20] and the discriminative approach [7] are all based on the intermediate representation by local semantic objects. We choose the discriminative approach as a baseline to compare our approach because it is a mainstream approach that is also based on training. The SVM classifier used in the discriminative approach has also equivalent or significantly better error rates than other classification methods. We choose the factor-graph approach as a baseline because it is a typical graph model based scene classification method in available works. The results of each category using these approaches are depicted in Figs. 7 and 8, where the FG represents the factor-graph approach in [20] and the SVM represents the discriminative approach in [7]. The factor-graph approach was based on a graphical model as our approach. They formulized the scene classification problem based on the concept of scene configurations. Scene configurations in their paper represented the configurations of materials in a scene, which is similar to spatial configurations of semantic regions in our paper. The scene type was found with MAP likelihood, given the evidence from the object detector  C* ¼ arg maxPðCi EÞ i X   Pðg Ci ÞPðEgÞ ð19Þ ¼ arg maxPðCi ÞPðE9Ci Þ ¼ arg maxPðCi Þ i

i

gAG

4052

H. Cheng, R. Wang / Pattern Recognition 43 (2010) 4042–4054

Fig. 7. Classification accuracy of the scene category using different methods—based on regions with manual annotation.

Fig. 8. Classification accuracy of the scene category using different methods—based on classified regions.

Our final scene type is decided by  C* ¼ arg maxPðCi EÞ i  ¼ arg maxPðCi ÞpðECi Þ i   ¼ arg maxPðCi ÞpðEM Ci ÞpðEG Ci Þ

ð20Þ

(2) To model the spatial configuration, the method of [20] approximates the full connection of objects as pairwise relationships. That means they consider the pairwise relations between any two regions in the scene. Our method is to choose the discriminative regions and use their spatial arrangements to express scene configurations.

i

The differences between the method of [20] and our method are mainly in two points:   P   (1) Both the terms g A G Pðg Ci ÞPðE gÞ and p(EG9Ci) express the compatibility between the scene and the spatial configurations of objects although they are different in the details such as the model and the spatial relations used in the system. However, our method has another term p(EM9Ci), which models the relation of the scene and those objects present in the scenes. It’s a basic and useful evidence for scene classification, which is normally used in many approaches [7].

Compared to the discriminative approach, our approach is a generative one. Generative models usually offer much insight to the relationship between and influence of various factors involved in the problem. This is often not the case with discriminative models such as neural networks or the SVM. (1) Categorization based on image regions with manual annotation: Fig. 7 displays the classification results based on manual annotations, which serve as the benchmark: the experiments reveal what is the best performance of the methods that can be expected. The plot of (a) suggests that the CBN_SP2 model outperforms other Bayesian network models except the

H. Cheng, R. Wang / Pattern Recognition 43 (2010) 4042–4054

sky/clouds. It is mainly because images of sky/clouds have few spatial relations and other categories have lots of spatial relations between semantic regions. The plot of (b) suggests that our approach is comparable to the discriminative approach using SVM classifier, and that an increase in some complex scenes such as coasts and river/lakes. It also shows quite clearly that our model outperforms the factor-graph model, which only uses the pairwise relationship and doesn’t consider the object occurrence in scenes. (2) Categorization based on classified image regions: Fig. 8 displays the results of every category based on classified image regions. The classification accuracy of the objects is 73.5%. Some materials have high accuracies (e.g., sky, 93%), other objects have substantially lower performance (e.g., field and grass below 40%). It is mainly because the various colors and textures of field under different seasons and weathers. Due to the inaccuracy of object classifiers, classification results based on classified image regions are lower than those based on annotated regions. Since field is a major composition of plain scenes and grass occurs frequently in river scenes, the classification rates based on classified regions for river and plain have significantly decreased. As with the annotated image regions, the CBN_SP2 approach clearly outperforms the other BN approach and outperforms the FG approach and the SVM approach. The gain in categorization accuracy relative to the IBN model is up to 8%. 5.2.3. Comparison between all methods Table 1 shows the classification accuracies using several stateof-art techniques. The bag-of-words technique is a popular approach for scene classification in recent years. The categorization results using several typical bag-of-word approaches are displayed. They are different in statistic models and features extracted from images. The classification accuracies of the bag-of-words techniques are from 52.1% to 87.8%. The best result is obtained by the pLSA2 proposed by Bosch et al. [10]. Table 1 shows that our Bayesian network model outperforms the BOV approach and the pLSA1 model in [12] and the CBN_SP2 result, and is comparable to the pLSA2 model in [13]. The superior performance of the pLSA2 [13] and pLSA3 [10] could be due to the use of better features and how they are used. For example, in the feature detection step of the approach of pLSA2, a 5  5 square neighborhood around a pixel is used to compute the feature vector and the patches are spaced by 3 pixels on a regular grid, forming about 6400 descriptors per image. The pLSA3 and the SPM models used much more feature descriptors: color SIFT descriptors are computed at points on a regular grid with spacing M pixels. At each grid point, SIFT descriptors are computed over circular support patches with radii r pixels. Consequently, each point is represented by n SIFT descriptors (where n is the number of circular supports), each is 128  3 dimensional (the best results are obtained with M¼10 and r ¼4). For a 720  480 image, about 3456 SIFT descriptors will be formed. In our paper, the segmentation algorithm partitions an image into roughly 20–50 homogeneous regions and a 78 dimensional feature vector is extracted on each region. The features are much more less than those in pLSA3. We believe that the difference in performance with respect to our work arises from the fact that natural scene discrimination can benefit greatly from the use of different feature descriptors. Something that we have not made use of constitutes an issue to investigate in the future. In addition, the intermediate

4053

information of the image regions obtained by our approaches can be used in other scene retrieval tasks. The bag-of-words techniques are only interested in the semantic meaning of the whole scene.

6. Conclusions and future work In this paper, we presented a probabilistic scheme for natural scene classification using Bayesian networks. Based on the results presented in this paper, we believe that the presented scene modeling methodology is effective for solving natural scene categorization problems. With extensive results, we have shown that it outperforms some classical scene classification methods and obtains a comparable performance with the best results in previous works. In particular, the main novelty of the work lies in the explicit use of spatial relations in building Bayesian networks for scene modeling, distinguishing it from other work using semantic features. We studied the influence of the spatial evidence used in our Bayesian network model and the contextual Bayesian network model with SVM distance measure (CBN_SP2) gave the best results. Furthermore, the contextual Bayesian network models CBN_SP1 and CBN_SP2 have both shown very good performance on distinguishing the categories (e.g., coasts and river/lakes) being frequently confused with each other compared with the discriminative approach. In contrast to another approach also modeling the scene configuration using a graphic model— factor graph, we obtain superior classification rates on the natural scene dataset. Moreover, our systems can be highly modular, and the method of extracting local cues can be improved to retrain the model. Because of the modularity of our system and the flexibility of the Bayesian network structure, we believe that our approach could be used for scene understanding of different environments (e.g., indoor images), given the appropriate object detectors and relations in those environments (e.g., tables and computers in an office scene). Additional scene attributes can easily be added to the Bayesian network for enhanced semantic scene interpretation. In terms of future directions, diversifying the Bayesian network with other useful semantic attributes may improve the contribution of computed semantic features. In addition, the conditional dependency of the network should be approximated by more complex functions, and this could improve the modeling and inference ability of the Bayesian networks.

Acknowledgment The authors would like to thank Julia Vogel for kindly providing the image datasets. References [1] A.W. Smeulders, M. Worring, S. Santini, A. Gupta, R. Jain, Content-based image retrieval at the end of the early years, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (2000) 1349–1380. [2] A. Vailaya, A. Figueiredo, A. Jain, H. Zhang, Image classification for contentbased indexing, IEEE Transactions on Image Processing 10 (2001) 117–129. [3] J. Shen, J. Shepherd, and A.H.H. Ngu, Semantic-sensitive classification for large image libraries, in: Proceedings of International Multimedia Modelling Conference, Melbourne, Australia, 2005. [4] E. Chang, K. Goh, G. Sychay, G. Wu, CBSA: content-based soft annotation for multimodal image retrieval using Bayes point machines, IEEE Transactions on Circuits and Systems for Video Technology Special Issue on Conceptual and Dynamical Aspects of Multimedia Content Description 13 (2003) 26–38. [5] J. Luo, A.E. Savakis, A. Singhal, A Bayesian network-based framework for semantic image understanding, Pattern Recognition 38 (2005) 919–934.

4054

H. Cheng, R. Wang / Pattern Recognition 43 (2010) 4042–4054

[6] A. Mojsilovic, J. Gomes, and B. Rogowitz, ISEE: perceptual features for image library navigation, in: Proceedings of SPIE Human vision and electronic imaging, San Jose, California, 2002. [7] J. Vogel, B. Schiele, Semantic modeling of natural scenes for content-based image retrieval, International Journal of Computer Vision 72 (2007) 133–157. [8] N. Serrano, A.E. Savakis, J. Luo, Improved scene classification using efficient low-level features and semantic cues, Pattern Recognition 37 (2004) 1773–1784. ˜ oz, Scene classification via pLSA, in: [9] A. Bosch, A. Zisserman, and X. Mun Proceedings of European Conference on Computer Vision, Graz, Austria, 2006. [10] A. Bosch, A. Zisserman, X. Munoz, Scene classification using a hybrid generative/discriminative approach, IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (2008) 712–727. [11] L. Fei-Fei and P. Perona, A Bayesian hierarchical model for learning natural scene categories, in: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington DC, USA, 2005. [12] P. Quelhas, F. Monay, J.M. Odobez, D. Gatica-Perez, T. Tuytelaars, A thousand words in a scene, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (2007) 1575–1589. [13] A. Bosch, X. Muonz, R. Marti, A review: which is the best way to organize/classify images by content? Image and Vision Computing 25 (2007) 778–791. [14] T. Hofmann, Unsupervised learning by probabilistic latent semantic analysis, Machine Learning 42 (2001) 177–196. [15] D. Blei, A. Ng, M. Jordan, Latent Dirichlet allocation, Journal of Machine Learning Research 3 (2003) 993–1022.

[16] P. Quelhas and J.-M. Odobez, Natural scene image modeling using color and texture visterms, in: Proceedings of CIVR, 2006. [17] S. Thorpe, S. Fize, C. Marlot, Speed of processing in the human visual system, Nature 381 (1996) 520–522. [18] S. Aksoy, K. Koperski, C. Tusk, G. Marchisio, J.C. Tilton, Learning Bayesian classifiers for scene classification with a visual grammar, IEEE Transactions on Geoscience and Remote Sensing 43 (2005) 581–589. [19] M. Boutell, C. Brown, and J. Luo, Learning spatial configuration models using modified Dirichlet priors, in: Proceedings of Workshop on statistical relational learning in conjunction with ICML, 2004. [20] M.R. Boutell, J. Luo, C.M. Brown, Scene parsing using region-based generative models, IEEE Transactions on Multimedia 9 (2007) 136–146. [21] J. Pearl, Probabilistic Reasoning in Expert System: Networks of Plausible Inference, Morgan Kaufmann, San Francisco, California, 1988. [22] N. Friedman, D. Geiger, m Goldszmidt, Bayesian Network Classifiers, Kluwer Academic Publishers, Boston, 1996. [23] D. Comaniciu, P. Meer, Mean shift: a robust approach toward feature space analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2002) 603–619. [24] C.C. Chang, C.J. Lin, LIBSVM: A Library for Support Vector Machines, Software package available at: /http://www.csie.ntu.edu.twS, 2001. [25] D. Heckerman, A tutorial on learning with Bayesian network, Microsoft Research, Technical Report MSD-TR, 95–06 March, 1995. [26] A. Singhal, J. Luo, and W. Zhu, Probabilistic spatial context models for scene content understanding, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Madison, WI, 2003.

Huanhuan Cheng received the BS degree in Computing Mathematics in 2003, the MS degree in Applied Mathematics in 2005 from National University of Defense Technology, China. She is currently pursuing the PhD degree in Information and Communication Engineering at National University of Defense Technology, China. She is the student member of China Society of Image and Graphics. Her research interests include: image understanding and analysis, image retrieval, video mining, etc.

Runsheng Wang graduated from the Harbin Institute of Technology, Harbin, China, in 1964. From January 1984 to June 1986 and from October 1992 to April 1993, he was a Visiting Scholar at the Department of Computer Science, University of Massachusetts, Amherst, USA. He has taught at the Harbin Institute of Technology and Changsha Institute of Technology, Changsha, China. He is currently a Professor with the ATR National Laboratory, National University of Defense Technology, Changsha. His research interests include image analysis and understanding, pattern recognition, and information fusion.