Pattern Recognition 38 (2005) 865 – 885 www.elsevier.com/locate/patcog
Statistical modeling and conceptualization of natural images夡 Jianping Fana,∗ , Yuli Gaoa , Hangzai Luoa , Guangyou Xub a Department of Computer Science, University of North Carolina, 9201 Univ. City Blvd., Charlotte, NC 28223, USA b Department of Computer Science, Tsinghua University, Beijing, PR China
Received 1 November 2003; received in revised form 30 June 2004; accepted 1 July 2004
Abstract Multi-level annotation of images is a promising solution to enable semantic image retrieval by using various keywords at different semantic levels. In this paper, we propose a multi-level approach to interpret and annotate the semantics of natural images by using both the dominant image components and the relevant semantic image concepts. In contrast to the well-known image-based and region-based approaches, we use the concept-sensitive salient objects as the dominant image components to achieve automatic image annotation at the content level. By using the concept-sensitive salient objects for image content representation and feature extraction, a novel image classification technique is developed to achieve automatic image annotation at the concept level. To detect the concept-sensitive salient objects automatically, a set of detection functions are learned from the labeled image regions by using support vector machine (SVM) classifiers with an automatic scheme for searching the optimal model parameters. To generate the semantic image concepts, the finite mixture models are used to approximate the class distributions of the relevant concept-sensitive salient objects. An adaptive EM algorithm has been proposed to determine the optimal model structure and model parameters simultaneously. In addition, a large number of unlabeled samples have been integrated with a limited number of labeled samples to achieve more effective classifier training and knowledge discovery. We have also demonstrated that our algorithms are very effective to enable multi-level interpretation and annotation of natural images. 䉷 2004 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. Keywords: Semantic image classification; Salient object detection; Adaptive EM algorithm; SVM
1. Introduction As high-resolution digital cameras become more affordable and widespread, high-quality digital natural images have exploded on the Internet. With the exponential growth
夡 This project is supported by National Sciences Foundation under 0208539-IIS and the grant from AO Research Foundation, Switzerland. The work of Prof. Xu is supported by Chinese National Sciences Foundation under 60273005. ∗ Corresponding author. E-mail addresses:
[email protected] (J. Fan),
[email protected] (Y. Gao),
[email protected] (H. Luo),
[email protected] (G. Xu).
on high-quality digital natural images, the need of semantic image classification is becoming increasingly important to support effective image database indexing and retrieval at the semantic level [1–6]. As shown in Fig. 1, the semantic image similarity can be categorized into two major classes [7]: (a) similar image components (e.g., similar image objects such as sky, grass [28–35]) or similar global visual properties (e.g., similar global configurations such as openness, naturalness [27]); (b) similar semantic image concepts such as garden, beach, mountain view or similar abstract image concepts (e.g., image events such as sailing, skiing) [8–26]. To achieve the first class of semantic image similarity, it is very important to achieve a middle-level understanding of the semantics of image contents [28–35]. To achieve the
0031-3203/$30.00 䉷 2004 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2004.07.011
866
J. Fan et al. / Pattern Recognition 38 (2005) 865 – 885
Fig. 1. Human beings interpret the semantics of images based on: (a) certain types of the concept-sensitive salient objects, (b) the global configuration among the concept-sensitive salient objects.
second class of semantic image similarity, semantic image classification was reported as a promising approach, but its performance largely depends on two key issues [1]: (1) The effectiveness of visual patterns for image content representation and feature extraction; (2) The significance of the algorithms for semantic image concept modeling and classifier training. Many techniques for semantic image classification have been proposed in the literatures [8–26]. However, few existing work has provided a good framework to address the following inter-related problems jointly:
only the global visual properties are used for image content representation [27], the image-based approaches may not work very well for the images that contain the individual objects, especially when the individual objects are used by human beings to interpret the semantics of images as shown in Fig. 1a [28–35]. To enhance the quality of features on discriminating between different semantic image concepts, we need certain means to detect suitable semantic-sensitive visual patterns so that a middle-level understanding of the semantics of image contents can be achieved effectively.
1.1. Quality of features 1.2. Semantic image concept modeling and interpretation The success of most existing techniques for semantic image classification is often limited and largely depends on the discrimination power of the low-level visual features because of semantic gap. On the other hand, the discrimination power of the low-level visual features largely depends on the effectiveness of the underlying visual patterns that are selected for image content representation and feature extraction. Two approaches are widely used for image content representation and feature extraction: (1) image-based approaches that treat the whole images as the individual visual patterns for feature extraction, (2) region-based or Blobbased approaches that take the homogeneous image regions or connected homogeneous image regions with the same color or texture (i.e., Blobs) as the underlying visual patterns for feature extraction. Both these two approaches have obtained some successful evidences and negative evidences [27–35]. One common weakness of the region-based approaches is that the homogeneous image regions have little correspondence to the semantic image concepts, thus they are not effective for multi-class image classification [28–35]. Without using image segmentation, the image-based approaches are very attractive to enable a low-cost framework for feature extraction and image classification [20,21,24,25,27]. Since
In order to interpret the semantic image concepts intuitively, it is very important to explore the contextual relationships and joint effects among the relevant conceptsensitive salient objects. However, no existing work has addressed what kind of image context integration model can be used to link the concept-sensitive salient objects to the most relevant semantic image concepts. Statistical image modeling is a promising approach to formulate the interpretations of the semantic image concepts quantitatively by using generative or discriminative models [15,45]. However, no existing work has addressed the origin of the concept models and the mathematical spaces for the concept models. 1.3. Semantic image classification Semantic image classification plays an important role on achieving the second class of semantic image similarity, many techniques have been proposed in the past [8–26]. The limitation of pages does not allow us to survey all these works. Instead we try to emphasize some of these works that are most relevant to our proposed work. A Bayesian framework has been developed to enable binary semantic
J. Fan et al. / Pattern Recognition 38 (2005) 865 – 885
867
classification of the vacation images by using the imagebased global visual features and vector quantization for density estimation [20]. SIMPLIcity was reported as a binary image classification system by using the region-based visual features and distance-based nearest neighbor classifier [19]. By using the image Blobs for content representation, Barnard et al. have also developed a Bayesian framework for interpreting the semantics of images [10–12]. Without detecting and recognizing the individual objects, Lipson et al. have presented a configuration-based technique for image classification by exploring the spatial relationships and arrangements of various image regions in an image [21]. Schyns and Oliva have recently developed a novel approach for natural image understanding and classification by treating the discriminant power spectrum templates in the lowfrequency domain as the concept-sensitive visual patterns [27], but this novel technique may fail for the natural images that contain the individual objects and human beings may use these individual objects to interpret the semantics of images as shown in Fig. 1a. As mentioned above, the image semantics can be described in multiple levels (i.e., both the content level and the concept level). Thus a good image classification and annotation scheme should enable the interpretation and annotation of both the dominant image components and the relevant semantic image concepts. However, few existing work has achieved such multi-level annotation of images [2,3,8]. Based on these observations, we propose a novel framework to enable more effective interpretation of semantic image concepts and multi-level annotation of natural images. This paper is organized as follows: Section 2 presents a novel framework for semantic-sensitive image content representation by using the concept-sensitive salient objects; Section 3 proposes a new framework for semantic image concept interpretation by using the finite mixture models to approximate the class distributions of the relevant concept-sensitive salient objects; Section 4 introduces an automatic technique for salient object detection; Section 5 presents a novel algorithm for incremental classifier training by using an adaptive EM algorithm for parameter estimation and model selection; Section 6 shows the benchmark environment for evaluating our techniques for semantic image classification and interpretation; We conclude in Section 7.
under different vision purposes. Thus a good framework for semantic-sensitive image content representation should be able to achieve a middle-level understanding of the semantics of image contents and enhance the quality of features on discriminating between different semantic image concepts [28–35]. In order to enhance the quality of features, we propose a novel framework by using the concept-sensitive salient objects to enable more expressive representation of image contents. The concept-sensitive salient objects are defined as the dominant image components that are semantic to human being and are also visually distinguishable [28–35] or the global visual properties of whole images that can be identified by using the spectrum templates in the frequency domain [27]. For example, the concept-sensitive salient object “sky” is defined as the connected image regions with large sizes (i.e., dominant image components) that are related to the human semantics “sky”. The concept-sensitive salient objects that are related to the global visual properties in the frequency domain can be obtained easily by using wavelet transformation [27]. In the following discussion, we focus on modeling and detecting the concept-sensitive salient objects that correspond to the visually distinguishable image components. In addition, the basic vocabulary of such concept-sensitive salient objects can be obtained by using the taxonomy of the dominant image components of natural images as shown in Fig. 2. Since the concept-sensitive salient objects are semantic to human beings, they can act as a middle-level representation of image contents and break the semantic gap into two “smaller” and “bridgable” gaps as shown in Fig. 3: (a) Gap 1: The bridgable gap between the low-level image signals and the concept-sensitive salient objects; (b) Gap 2: The bridgable gap between the concept-sensitive salient objects and the relevant semantic image concepts. Through this multi-level framework, the semantic gap can be bridged efficiently by two steps: (1) bridging the Gap 1 by detecting the concept-sensitive salient objects automatically by learning the detection functions from the labeled image regions, (2) bridging the Gap 2 by using the finite mixture models to approximate the class distributions of the relevant concept-sensitive salient objects.
2. Semantic-sensitive image content representation
3. Semantic image concept modeling and interpretation
As mentioned above, the quality of features largely depends on the underlying visual patterns that are selected for image content representation and feature extraction. The visual features, that are extracted from whole images or homogeneous image regions, may not be effective on discriminating between different semantic image concepts because the homogeneous image regions have little correspondence to the semantic image concepts and one single image may consist of multiple semantic image concepts
In order to achieve automatic image annotation at the concept level, we have also proposed a novel technique for semantic image concept modeling and interpretation. To quantitatively interpret the contextual relationships between the semantic image concepts and the relevant concept-sensitive salient objects, we use the finite mixture model (FMM) to approximate the class distribution of the concept-sensitive salient objects that are relevant to a specific semantic image
868
J. Fan et al. / Pattern Recognition 38 (2005) 865 – 885
Nature
Sky
clear sky
blue sky
floor
Ground
Foliage
green foliage
cloudy sky
sand
grass
floral foliage
water
branch foliage
rock
Fig. 2. Examples of the taxonomy of natural images.
Semantic Image Concept 1
Semantic Image Concept j
Semantic Image Concept Nc
Gap 2
Semantic Image Classification Salient Object Type i
Salient Object Type Ne
Gap 1
Salient Object Type 1
Learning for Salient Object Detection Color/Texture Pattern 1
Image 1
Color/Texture Pattern k
Color/Texture Pattern R Automatic Image Segmentation
Image m
Image M
Fig. 3. The proposed image content representation and semantic context integration framework.
concept Cj P (X, Cj , , cj , cj ) =
i=1
P (X|Si , si )si ,
(1)
where P (X|Si , si ) is the ith multivariate mixture component with n independent means and a common n × n covariance matrix, indicates the optimal number of multivariate mixture components, cj = {si , i = 1, . . . , } is the set of the model parameters for these multivariate mixture components, cj = {si , i = 1, . . . , } is the set of the relative weights among these multivariate mixture components, and X is the n-dimensional visual features that are used for representing the relevant concept-sensitive salient objects. For example, the semantic image concept, “beach scene”, is related to at least three types (classes) of the concept-sensitive salient objects such as “sea water”, “sky”, “beach sand” and other hidden visual patterns.
It is known that an image object has various appearances under different viewing conditions, thus its principal visual properties may look different at different lighting and capturing conditions [29,34,35]. For example, the concept-sensitive salient object “sky” consists of various appearances, such as “blue sky pattern”, “white (clear) sky pattern”, “cloudy sky pattern”, “dark (night) sky patterns”, and “sunset/sunrise sky pattern”, which have very different properties on color and texture under different viewing conditions. Thus, the data distribution for a specific type of concept-sensitive salient object is approximated by using multiple mixture components to accommodate the variability of the same type of concept-sensitive salient object (i.e., presence/absence of distinctive parts, variability in overall shape, changing of visual properties due to lighting conditions, viewpoints etc.). The fundamental assumptions of our finite mixture models are: (a) There is a many-to-one correspondence between the multivariate mixture components and each type (class) of
J. Fan et al. / Pattern Recognition 38 (2005) 865 – 885
concept-sensitive salient object. (b) Different types (classes) of the concept-sensitive salient objects are independent in their visual feature space. For a specific semantic image concept, the optimal number of mixture components and their relative weights are acquired automatically through a machine learning process. Using the finite mixture models for probabilistic interpretation of semantic image concepts is able to maintain the variability (heterogeneity) among different semantic image concepts, thus it will offer a number of additional theoretical advantages.
4. Automatic salient object detection The objective of image analysis is to parse the natural images into the concept-sensitive salient objects in the basic vocabulary. Based on the basic vocabulary as shown in Fig. 2, we have already implemented 32 functions to detect 32 types of concept-sensitive salient objects in natural images, and each function is able to detect a specific type of these concept-sensitive salient objects in the basic vocabulary. Each detection function consists of three parts: (a) automatic image segmentation by using the mean shift technique [42], (b) binary image region classification by using the SVM classifiers with an automatic scheme for searching the optimal model parameters [40,41], (c) label-based aggregation of the connected similar image regions for salient object generation. We use our detection function of the concept-sensitive salient object “beach sand” as an example to show how we can design our detection functions. As shown in Fig. 4, the image regions with homogeneous color or texture are first obtained by using the mean shift techniques [42]. Since the visual properties of a certain type of concept-sensitive salient object may look different at different lighting and capturing conditions [29], using only one single image is insufficient to represent its principal visual characteristics. Thus this automatic image segmentation procedure is performed on a set of training images which contain the concept-sensitive salient object “beach sand”. The homogeneous image regions in the training images, that are related to the concept-sensitive salient object “beach sand”, are selected and labeled as the training samples by human interaction. It is worth noting that the homogeneous image regions related to the same type of concept-sensitive salient object may have different visual properties on color or texture. Region-based low-level visual features, such as 1-dimensional coverage ratio (i.e., density ratio) for a coarse shape representation, 6-dimensional region locations (i.e., 2-dimensions for region center and 4-dimensions to indicate the rectangular box for a coarse shape representation), 7dimensional LUV dominant colors and color variances (i.e., overall colors and color purity), 14-dimensional Tamura texture, and 28-dimensional wavelet texture/color features, are extracted to characterize the principal visual properties of
869
these labeled image regions that are explicitly related to the concept-sensitive salient object “beach sand”. The 6dimensional region locations will be used to determine the spatial contexts among different types of concept-sensitive salient objects and increase the expressiveness [34]. The spatial context refers to the relationship and arrangement of the concept-sensitive salient objects in a natural image. We use one-against-all rule to label the training samples gj = {Xl , Lj (Xl )|l = 1, . . . , N}: positive samples for the specific concept-sensitive salient object “beach sand” and negative samples. Each labeled training sample is a pair (Xl , Lj (Xl )) that consists of a set of region-based low-level visual features Xl and the semantic label Lj (Xl ) for the corresponding homogeneous image region. The image region classifier is learned from these available labeled training samples. We use the well-known SVM classifiers for binary image region classification [40,41]. Consider a binary classification problem with linearly separable sample set gj = {Xl , Lj (Xl )|l = 1, . . . , N}, where the semantic label Lj (Xl ) for the homogeneous image region with the visual feature Xl is either +1 or −1. For the positive samples Xl with Lj (Xl ) = +1, there exists the transformation parameters and b such that · Xl + b > + 1. Similarly, for the negative samples Xl with Lj (Xl ) = −1, we have ·Xl +b <−1. The margin between these two supporting planes will be 2/ 2 . The SVM classifier is then designed for maximizing the margin with the constraints ·Xl +b >+1 for the positive samples and ·Xl +b <−1 for the negative samples. Given the training set gj = {Xl , Lj (Xl )|l = 1, . . . , N}, the margin maximization procedure is then transformed into the following optimization problem: N
arg min ,b,
1 T ·+C l 2 l=1
Lj ( · (Xl ) + b) 1 − l ,
(2)
where l 0 represents the training error rate, C > 0 is the penalty parameter to adjust the training error rate and the regularization term 2· , (Xl ) is the function that maps Xl into higher-dimensional space (i.e., feature dimensions plus the dimension of response) and the kernel function is defined as (Xi , Xj ) = (Xi )T (Xj ). In our current implementation, we select the radial basis function (RBF), (Xi , Xj ) = exp(− Xi − Xj 2 ), > 0. We have developed an efficient search algorithm to determine the optimal model parameters (C, ) for the SVM classifiers: (a) The labeled image regions for a specific type of concept-sensitive salient object are partitioned into subsets in equal size, where − 1 subsets are used for classifier training and the remaining one is used for classifier validation. (b) Our feature set for image region representation is first normalized to avoid the features in greater numeric ranges that dominate those in smaller numeric ranges. Because inner product is usually used to calculate the kernel T
870
J. Fan et al. / Pattern Recognition 38 (2005) 865 – 885
Fig. 4. The flowchart for automatic salient object extraction.
Fig. 5. The detection results of the concept-sensitive salient object “sunset/sunrise”.
values, this normalization procedure is able to avoid the numerical problem. (c) The numeric ranges for the parameters C and are exponentially partitioned into small pieces with M pairs. For each pair, − 1 subsets are used to train the classifier model. When the M classifier models are available, cross-validation is then used to determine the underlying optimal parameter pair (C, ). (d) Given the optimal parameter pair (C, ), the final classifier model (i.e., support vectors) is trained again by using the whole training data set. (e) The spatial contexts among different types of concept-sensitive salient objects (i.e., coherence among different types of concept-sensitive salient objects) have also been used to cope well with the over-segmented images [34,35].
We currently have implemented 32 detection functions for 32 types of concept-sensitive salient objects in natural images. If all these detection functions fail to detect these 32 types of concept-sensitive salient objects from the test images, the wavelet transformation is performed on the test images to obtain the 33rd type of concept-sensitive salient object, i.e., spectrum templates in the frequency domain that are used to represent the global visual properties of the test images. Some results for our detection functions are shown in Figs. 5–8. From these experimental results, one can find that the concept-sensitive salient objects can be visually distinguishable and the principal visual properties of the dominant image components can be expressively represented. As
J. Fan et al. / Pattern Recognition 38 (2005) 865 – 885
871
Fig. 6. The detection results of the concept-sensitive salient object “sand field”.
Fig. 7. The detection results of the concept-sensitive salient object “cat”.
shown in Fig. 9, the mean shift technique often partitions one single object into multiple homogeneous image regions with none of them being representative for the object and
thus the homogeneous image regions have little correspondence to the semantic image concepts. On the other hand, the concept-sensitive salient objects have the capability to
872
J. Fan et al. / Pattern Recognition 38 (2005) 865 – 885
Fig. 8. The detection results of the concept-sensitive salient object “water”.
Fig. 9. The concept-sensitive salient objects can enable more expressive representation of image contents: (a) original images, (b) homogeneous image regions, (c) concept-sensitive salient objects.
characterize the principal visual properties of the corresponding image object and thus using the concept-sensitive salient objects for image content representation and feature extraction can enhance the quality of features and result in more effective semantic image classification. In addition, the concept-sensitive salient objects are semantic to human
beings and thus the keywords for interpreting the conceptsensitive salient objects can also be used to achieve automatic image annotation at the content level. The optimal parameters (C, ) for some detection functions are given in Table 1. Precision and recall are used to measure the average performance of our detection
J. Fan et al. / Pattern Recognition 38 (2005) 865 – 885
873
Table 1 The optimal parameters (C, ) of some detection functions
Table 2 The average performance of some detection functions
Salient objects
Brown horse
Grass
Purple flower
Salient objects
Brown horse
Grass
Purple flower
C
8 0.5
10 1.0
6 0.5
(%) (%)
95.6 100
92.9 94.8
96.1 95.2
Red flower 32 0.125
Rock 32 2
Sand field 8 2
Salient objects
(%) (%)
Red flower 87.8 86.4
Rock 98.7 100
Sand field 98.8 96.6
Water 2 0.5
Human skin 32 0.125
Sky 8192 0.03125
(%) (%)
Water 86.7 89.5
Human skin 86.2 85.4
Sky 87.6 94.5
Snow 512 0.03125
Sunset/sunrise 8 0.5
Waterfall 32 0.0078125
(%) (%)
Snow 86.7 87.5
Sunset/sunrise 92.5 95.2
Waterfall 88.5 87.1
Yellow flower 8 0.5
Forest 32 0.125
Sail cloth 64 0.125
(%) (%)
Yellow flower 87.4 89.3
Forest 85.4 84.8
Sail cloth 96.3 94.9
Elephant 32 0.5
Cat 512 1.125
Zebra 512 4
(%) (%)
Elephant 85.3 88.7
Cat 90.5 87.5
Zebra 87.2 85.4
Salient objects C
Salient objects C
Salient objects C
Salient objects C
Salient objects C
functions
= ,
+ε
=
,
+ϑ
Salient objects
Salient objects
Salient objects
Salient objects
(3)
where is the set of true positive samples that are related to the corresponding type of concept-sensitive salient object and are detected correctly, ε is the set of true negative samples that are irrelevant to the corresponding type of conceptsensitive salient object but are detected incorrectly, and ϑ is the set of false positive samples that are related to the corresponding type of concept-sensitive salient object but are mis-detected. The average performance for some detection functions is given in Table 2. It is worth noting that the procedure for salient object detection is automatic and the human interaction is only involved in the procedure to label the training samples (i.e., homogeneous image regions) for learning the detection functions. After the concept-sensitive salient objects are extracted automatically from the images, a set of visual features are then calculated to characterize their principal visual properties. These visual features include 1-dimensional coverage ratio (i.e., density ratio) for a coarse shape representation, 6-dimensional object locations (i.e., 2-dimensions for object center and 4-dimensions to indicate the rectangular box for a coarse shape representation of salient object), 7-dimensional LUV dominant colors and color variances (i.e., the overall
color and color purity of a certain concept-sensitive salient object can be described in terms of the presence/absence of the dominant colors and color variances), 14-dimensional Tamura texture, and 28-dimensional wavelet texture/color features.
5. Semantic image classification and annotation For each semantic image concept, the semantic labels for a limited number of training samples are provided by human-computer interaction. We use one-against-all rule to organize the labeled samples cj = {Xl , Cj (Sl )|l = 1, . . . , NL } into: positive samples for a specific semantic image concept Cj and negative samples. Each labeled sample is a pair (Xl , Cj (Sl )) that consists of a set of n-dimensional visual features Xl and the semantic label Cj (Sl ) for the corresponding sample Sl . The unlabeled samples cj = {Xk , Sk |k = 1, . . . , Nu } can be used to achieve a better approximation of the class distribution and select more accurate model structure. For a certain semantic image concept Cj , we thendefine the mixture training sample set as = cj cj .
874
J. Fan et al. / Pattern Recognition 38 (2005) 865 – 885
5.1. Adaptive EM algorithm
P (X|Sj , sj ) in the same concept model [39]
Since the maximum likelihood estimate prefers the complex models with more free parameters [20,36], a penalty term is added to determine the underlying optimal model structure. Thus the optimal model structure and parameters ˆ c ) for a certain semantic image concept are de(, ˆ ˆ cj , j termined by
KL(P (X|Si , si ), P (X|Sj , sj )) P (X|Si , si ) . = P (X|Si , si ) log P (X|Sj , sj )
ˆ c ) = arg max {L(Cj , X, , c , c )}, (, ˆ ˆ cj , j j j ,cj ,cj
(4)
where L(Cj , X, , cj , cj ) = − Si ∈c log P (X, Cj , j
, cj , cj ) + log p(, cj , cj ) is the objective function, − Si ∈c log P (X, Cj , , cj , cj ) is j
the likelihood function, and log p(, cj , cj ) = l N − (N+1) is the − n+2 +3 l=1 log N12 − 2 log 12 2 minimum description length (MDL) term to penalize the complex models [20,36], N is the total number of training samples and n is the dimensions of visual features. The estimate of maximum likelihood can be achieved by using the EM algorithm [37,38]. Unfortunately, the EM iteration needs the knowledge of and a “suitable” value of is usually pre-defined based on personal experience. However, it is critical to determine the values of automatically for different semantic image concepts because they may consist of different types of conceptsensitive salient objects. Thus a pre-defined (i.e., a fixed number of mixture components) may misfit the class distribution of the relevant concept-sensitive salient objects. To estimate the optimal number of mixture components, we propose an adaptive EM algorithm by integrating parameter estimation and model selection (i.e., selecting the optimal number of mixture components) seamlessly in a single algorithm. It takes the following steps: Step 1: To avoid the initialization problem, our adaptive EM algorithm starts from a reasonably large value of to explain the essential structure of the training samples and reduce the number of mixture components sequentially. The model parameters are initialized by using the labeled samples. Step 2: When the mixture components overpopulate in some sample areas but underpopulate in other sample areas, the EM iteration encounters the local extrema. To escape the local extrema, our adaptive EM algorithm performs automatic merging, splitting and death to re-organize the distributions of mixture components and modify the optimal number of mixture components according to the real distributions of the training samples. The Kullback divergence KL(P (X|Si , si ), P (X|Sj , sj )) is used to measure the divergence between the ith mixture component P (X|Si , si ) and the jth mixture component
(5)
If KL(P (X|Si , si ), P (X|Sj , sj )) is small, these two strongly overlapped mixture components provide similar densities and overpopulate the relevant sample areas. Thus they can be potentially merged as one single mixture component. If they are merged, the local Kullback divergence KL(P (X|Sij , sij ), P (X|sij )) is used to measure the local divergence between the merged mixture component P (X|Sij , sij ) and the local sample density P (X|sij ). The local sample density P (X|sij ) is modified as the empirical distribution weighted by the posterior probability and defined as [20]: N (X − Xi )P (Si |X, Cij , sij ) P (X|sij ) = i=1 , (6) P (Si |X, Cij , sij ) where P (Si |X, Cij , sij ) is the posterior probability. To detect the best candidate for merging, our adaptive EM algorithm tests (2−1) pairs of mixture components, and the pair with the minimum value of the local Kullback divergence is selected as the best candidate for merging. At the same time, our adaptive EM algorithm also calculates the local Kullback divergence KL(P (X, Cj |Sl , sl ), P (X, Cj |)) to measure the divergence between the lth mixture component P (X, Cj |Sl , sl ) and the local sample density P (X, Cj |). If the local Kullback divergence for a specific mixture component P (X, Cj |Sl , sl ) is big, the relevant sample area is underpopulated and the elongated mixture component P (X, Cj |Sl , sl ) is selected as the best candidate to be split into two representative mixture components. In order to achieve discriminative classifier training, the classifiers for multiple semantic image concepts are trained simultaneously, where the positive samples for a certain semantic image concept can be used as the negative samples for other semantic image concepts. To control the potential overlapping among the class distributions for different semantic image concepts, our adaptive EM algorithm calculates the Kullback divergence KL(P (X, Cj |Sl , sl ), P (X, Ci |Sm , sm )) between two mixture components from the class distributions of two different semantic image concepts Cj and Ci . If the Kullback divergence between these two mixture components P (X, Cj |Sl , sl ) and P (X, Ci |Sm , sm ) is small, these two mixture components are overlapped in the feature space and thus they are selected as the best candidates to be removed from the concept models so that discriminative classifier training can be achieved. By removing the overlapped mixture components (i.e., death), our classifier training technique is able to maximize the margin among multiple classifiers for different semantic image concepts and thus it will result in higher prediction power.
J. Fan et al. / Pattern Recognition 38 (2005) 865 – 885
Step 3: To optimize three kinds of operations of merging, splitting and death, their probabilities are defined as: Jmerge (l, k, ) =
KL(P (X, Cj |Slk , slk ), P (X, Cj |)) , () (7)
Jdeath (l, m, ) KL(P (X, Cj |Sl , sl ), P (X, Ci |Sm , sm )) = , () Jsplit (l, ) =
() , KL(P (X, Cj |Sl , sl ), P (X, Cj |))
(8)
(9)
where () is a normalized factor to make j l=1
Jsplit (l, ) +
+
j i l=1 m=l+1
j j
Jmerge (l, k, )
l=1 k=l+1
Jdeath (l, m, ) = 1.
(9a)
The acceptance probability to perform the merging, splitting or death operation is defined by: |L(X, 1 ) − L(X, 2 )| ,1 , Paccept = min exp − (10) where L(X, 1 ) and L(X, 2 ) are the objective functions for the models 1 and 2 (i.e., before and after performing the merging, splitting or death operation) as described in Eq. (4), is a constant that is determined by experiments. In our current experiments, is set as = 9.8. Step 4: Given the finite mixture model with a certain number of mixture components (i.e., after performing the merging, splitting or death operation), the EM iteration is performed to estimate their mixture parameters such as means, covariances and weights among different mixture components. E Step: Calculates the expected likelihood function and the posterior probability by using the mixture parameters obtained at the tth iteration. P (X|Si , si )si P (Si |X, Cj , , cj , cj ) = . i=1 P (X|Si , si )si
(11)
M Step: Finds the (t + 1)th estimation of the mixture parameters t+1 si =
t+1 cj =
Nl 1 P (Si |Xl , Cj , , cj , cj ), Nl l=1
Nl
l=1 Xl P (Si |Xl , Cj , , cj , cj ) , Nl l=1 P (Si |Xl , Cj , cj , cj )
875
t+1 cj = Nl t+1 t+1 T l=1 (Xl −cj )(Xl −cj ) P (Si |Xl , Cj , , cj , cj ) . Nl l=1 P (Si |Xl , Cj , cj , cj ) (12) After the EM iteration procedure converges, a weak classifier is built. The performance of this weak classifier is obtained by testing a small number of labeled samples that are not used for classifier training. If the average performance of this weak classifier is good enough, P (Cj |X, , cj , cj ) 1 , go to step 5. Otherwise, go back to step 2. 1 is set to 80% in our current experiments. Step 5: Output the mixture model and parameters , ˆ ˆ cj , ˆc . j By performing the merging, splitting and death operations automatically, our adaptive EM algorithm has the following advantages: (a) It does not require a careful initialization of the model structure by starting with a reasonably large number of mixture components, and the model parameters are initialized directly by using the labeled samples. (b) It is able to take advantage of negative samples to achieve discriminative classifier training. (c) It is able to escape the local extrema and enable a global solution by re-organizing the distributions of mixture components and modifying the optimal number of mixture components. We have also achieved a theoretical justification for the convergence of the proposed adaptive EM algorithm. In our proposed adaptive EM algorithm, the parameter spaces for the two approximated models that are estimated incrementally have the following relationship: Merging operation: Two original mixture components, P (X, Cj |Si , si ) and P (X, Cj |Sj , si ) are merged as one single representative mixture component P (X|Sij , sij ) Sij = Si + Sj , = − 1, Sij Sij = Si Si + Sj Sj , Sij P (X|Sij , sij ) = Si P (X|Si , si ) + Sj P (X|Sj , sj ).
(13)
Split operation: The original mixture component P (X, Cj |Sl , sl ) is split into two representative mixture components, P (X, Cj |Sm , sm ) and P (X, Cj |Sk , sk ) Sm = Sk =
Sl , 2
= + 1, Sm Sm = Sl Sl + 1 ,
Sk Sk = Sl Sl − 1 ,
Sl P (X|Sl , sl ) = Sm P (X|Sm , sm ) + Sk P (X|Sk , sk ).
(14)
876
J. Fan et al. / Pattern Recognition 38 (2005) 865 – 885
Death operation: The overlapped mixture component P (X, Cj |Sl , sl ) is removed from the finite mixture model = − 1, P (X, Cj , − 1, cj , cj ) =
−1 1 P (X|Si , si )si . 1−sl i=1
Hence our adaptive EM algorithm can reduce the divergence sequentially and thus it can be converged to the underlying optimal model incrementally. Our experimental results have also demonstrated this convergence of our adaptive EM algorithm as shown in Figs. 19 and 20. 5.2. Classifier training with unlabeled samples
(15) The real class distribution P (X, Cj , ∗ , ∗cj , ∗cj ) is defined as the underlying optimal model that our proposed adaptive EM algorithm should converge to. Thus we put the real class distribution P (X, Cj , ∗ , ∗cj , ∗cj ) as the first augment in the following discussion. Given the apˆ c ) and proximated class distributions P (X, Cj , , ˆ ˆ cj , j P (X, Cj , , cj , cj ) that are estimated sequentially, the Kullback divergences, between the real class distribution P (X, Cj , ∗ , ∗cj , ∗cj ) and the approximated class distributions, is calculated as P (X, Cj , ∗ , ∗cj , ∗cj ) 1 = × log 2 =
P (X, Cj , ∗ , ∗cj , ∗cj ) P (X, Cj , , cj , cj )
dX,
(16)
dX,
(17)
P (X, Cj , ∗ , ∗cj , ∗cj )
× log
P (X, Cj , ∗ , ∗cj , ∗cj ) ˆc ) P (X, Cj , , ˆ ˆ cj , j
where 1 and 2 , are always non-negative [39]. Thus the difference D, between 1 and 2 , is able to reflect the convergence of our adaptive EM algorithm. The difference D is calculated as D = 1 − 2 = P (X, Cj , ∗ , ∗cj , ∗cj ) × log −
P (X, Cj , ∗ , ∗cj , ∗cj ) ˆc ) P (X, Cj , , ˆ ˆ cj , j
dX
P (X, Cj , ∗ , ∗cj , ∗cj )
P (X, Cj , ∗ , ∗cj , ∗cj ) dX × log P (X, Cj , , cj , cj ) = P (X, Cj , ∗ , ∗cj , ∗cj ) × log
P (X, Cj , , cj , cj ) dX. ˆc ) P (X, Cj , , ˆ ˆc , j
(18)
j
By considering the implicit relationships among , , ˆ ∗ , ˆ c , ∗c and P (X, Cj , ∗ , ∗c , ∗c ), cj , ˆ cj , ∗cj , cj , j j j j ˆ c ), P (X, Cj , , c , c ), we can ˆ ˆ cj , P (X, Cj , , j j j prove: D 0, D > 0,
if , ˆ ∗ , if , ˆ > ∗ .
(19)
After the weak classifiers for the semantic image concepts are available, we use the Bayesian framework to achieve a “soft” classification of unlabeled images (i.e., each unlabeled image may belong to different semantic image concepts with different posterior probabilities), thus the confidence score for an unlabeled image (i.e., unlabeled sample) {Xl , Sl } can be defined as [25] (20) (Xl , Sl ) = (Xl , Sl ) (Xl , Sl ), where (Xl , Sl ) = max{P (Cj |Xl , , cj , cj )} is the maximum posterior probability for the unlabeled sample {Xl , Sl }, (Xl , Sl ) = (Xl , Sl ) − max{P (Cj |Xl , , cj , cj )|P (Cj |Xl , , cj , cj ) = (Xl , Sl )} is the multiconcept margin for the unlabeled sample {Xl , Sl }. For a specific unlabeled sample {Xl , Sl }, its confidence score (Xl , Sl ) can be used as the criterion to indicate its possibility to be taken as the outlier. Based on their confidence scores, the unlabeled samples can be categorized into two classes: (a) known context classes of existing semantic image concepts; (b) uncertain samples. The unlabeled samples with high confidence scores originate from the known context classes (i.e., known mixture components of concept models) of existing semantic image concepts. On the other hand, the unlabeled samples with low confidence scores are treated as the uncertain samples. The unlabeled samples with high confidence scores can be used to improve the density estimation (i.e., regular updating of the model parameters and ) incrementally. However, they cannot provide additional image contexts to achieve more accurate modeling of the semantic image concepts (i.e., cannot discover additional mixture components) and thus they do not have the capability to improve the model selection. By adding the unlabeled samples with high confidence scores for incremental classifier training, the confidence scores for the uncertain samples can be updated over time. Thus the uncertain samples can be further categorized into two classes according to their updated confidence scores: (1) The uncertain samples that originate from the unknown context classes (i.e., unknown mixture components of concept models) of existing semantic image concepts; (2) The uncertain samples that originate from the uncertain concepts. The uncertain samples with a significant change of confidence scores originate from the unknown context classes of existing semantic image concepts, and thus they should be included for incremental classifier training because they can provide some additional image contexts to achieve
J. Fan et al. / Pattern Recognition 38 (2005) 865 – 885
more accurate modeling of the semantic image concepts. On the other hand, the uncertain samples without a significant change of confidence scores may originate from the uncertain concepts, and thus they should be eliminated from the training set. After the unlabeled samples for the uncertain concepts are eliminated, the likelihood function as described in Eq. (4) is replaced by a joint likelihood function for both the labeled samples and the unlabeled samples with high confidence scores. This joint likelihood function is defined as [43,44] log P (X, Cj , , cj , cj ) = log P (X, Cj |Sl ∈ Cj , sl )sl Si ∈cj
+
Sm ∈cj
log
m=1
P (Y, Cj |Sm , sm )sm ,
(21) P (XF2 , Cj , 2 ) =
where the discount factor is used to determine the relative contribution of the unlabeled samples for density estimation. Using the joint likelihood function in Eq. (21) to replace the likelihood function in Eq. (4), our adaptive EM algorithm is then performed on the set of mixture training samples, both originally and probabilistically labeled, to learn a new classifier incrementally. By eliminating the outlying visual patterns, the unlabeled samples can be used to obtain more accurate classifiers by improving the estimation of the mixture-based sample density and model structure. E Step: Calculates the expected likelihood function and the posterior probability by using the mixture parameters obtained at the tth iteration P (X|Si , Cj , si )si P (Si |X, Cj , , cj , cj ) = . i=1 P (X|Si , Cj , si )si (22) M Step: Finds the (t + 1)th estimation of the mixture parameters by integrating the unlabeled samples with high confidence scores Nl 1 t+1 P (Si |Xl , Cj , , cj , cj ) si = Nl + Nu l=1 Nu + P (Si |Xl , Cj , , cj , cj ) , l=1
t+1 cj =
In order to incorporate the feature subset selection with the model selection within our framework, we have proposed a wrapper strategy to combine the feature subset selection with the underlying classifier training procedure seamlessly. Given two feature subsets with different dimensions, F1 and F2 , our adaptive EM algorithm performs the model selection and parameter estimation through these two feature subsets and achieves two models, 1 and 2 . To achieve the feature selection and model selection simultaneously, a novel technique has been developed to compare the classifier models which exist in the spaces of different feature subsets and different number of mixture components: P (XF1 , Cj , 1 ) =
j
1 l=1
m=1
P (Si |Xl , Cj , , cj , cj ) +
where 1 = {1 , sl , sl |l = 1, . . . , 1 } and 2 = {2 , sm , sm |m = 1, . . . , 2 }. The local Kullback divergence KL(P (XF1 , Cj , 1 ), P (XF1 , 1 )) is used to measure the divergence between the mixture distribution P (XF1 , Cj , 1 ) and the local sample density P (XF1 , 1 ). KL(P (XF1 , Cj , 1 ), P (XF1 , 1 )) 1 sl P (XF1 , Cj |Sl , sl , sl ) = l=1
× log
P (XF1 , Cj |Sl , sl , sl ) P (XF1 , sl )
l=1 P (Si |Xl , Cj , , cj , cj ) +
t+1 T where = (Xl − t+1 cj )(Xl − cj ) .
dX.
(25)
If KL(P (XF1 , Cj , 1 ), P (XF1 , 1 )) < KL(P (XF2 , Cj , 2 ), P (XF2 , 2 )), the feature subset F1 and concept model 1 is selected; If KL(P (XF1 , Cj , 1 ), P (XF1 , 1 )) = KL(P (XF2 , Cj , 2 ), P (XF2 , 2 )), the model with smaller feature subset is selected. To implement this feature subset selection scheme, we have proposed a backward search scheme by eliminating the irrelevant features sequentially. In addition, the feature correlations have also been considered in our feature selection procedure. If a certain dimension of feature subset is eliminated, other dimensions of feature subset that have higher correlations with it (i.e., with big value in covariance matrix) are selected as the prior candidates to be eliminated.
Nu
l=1 P (Si |Xl , Cj , , cj , cj )
Nu
P (XF2 , Cj |Sm , sm , sm )sm , (24)
Nu l=1 XP (Si |Xl , Cj , , cj , cj ) + l=1 XP (Si |Xl , Cj , , cj , cj ) , Nl Nu l=1 P (Si |Xl , Cj , , cj , cj ) + l=1 P (Si |Xl , Cj , , cj , cj )
Nl
P (XF1 , Cj |Sl , sl , sl )sl ,
2
Nl
l=1 t+1 cj = Nl
877
l=1 P (Si |Xl , Cj , , cj , cj )
, (23)
878
J. Fan et al. / Pattern Recognition 38 (2005) 865 – 885
Fig. 10. The flowchart for semantic image classification and annotation.
Fig. 11. The result for our multi-level image annotation system, where the image annotation includes the keywords for the concept-sensitive salient objects “sky”, “rock”, “snow”, “forest” and the semantic concept “mountain view”.
By integrating the unlabeled samples for parameter estimation, model selection and feature subset selection, our incremental classifier training technique is able to achieve knowledge discovery (i.e., discovering the unknown context classes of existing semantic image concepts) and results in more accurate modeling of the semantic image concepts. 5.3. Semantic image classification Once the classifiers for the Nc pre-defined semantic image concepts are in place, our system takes the following steps for semantic image classification as shown in Fig. 10: (1) Given a specific test image Il , the underlying conceptsensitive salient objects are detected automatically. It is important to note that one specific test image may consist of multiple types of different concept-sensitive salient objects in the basic vocabulary. Thus Il ={S1 , . . . , Si , . . . , Sn }. This scheme of concept-sensitive image segmentation enables us to interpret the semantics of the complex natural images collectively. (2) The class distribution of these concept-sensitive salient objects Il = {S1 , . . . , Si , . . . , Sn } is then modeled as a finite mixture model P (X, Cj |, cj , cj ) [15]. (3) The test image Il is finally classified into the best matching semantic image concept Cj with the maximum posterior probability. P (X, Cj |, cj , cj )P (Cj ) P (Cj |X, Il , ) = N , c j =1 P (X, Cj |, cj , cj )P (Cj ) (26) where = {, cj , cj , j = 1, . . . , Nc } is the set of the mixture parameters and relative weights for the classifiers, and P (Cj ) is the prior probability (i.e., relative
weight) of the semantic image concept Cj in the database. Thus the approach to semantic image classification taken here is to model the class distribution of the relevant concept-sensitive salient objects by using the finite mixture models. Our current experiments focus on generating 15 basic semantic image concepts, such as “beach”, “garden”, “mountain view”, “sailing” and “skiing”, which are widely distributed in natural images. It is important to note that once an unlabeled test image is classified into a specific semantic image concept, the text keywords that are used for interpreting the relevant semantic image concept and the underlying concept-sensitive salient objects become the text keywords for annotating the multi-level semantics of the corresponding image. The text keywords for interpreting the conceptsensitive salient objects (i.e., dominant image components) provide the annotations of the images at the content level. The text keywords for interpreting the relevant semantic image concepts provide the annotations of the images at the concept level. Thus our multi-level image annotation framework can support more expressive interpretations of natural images as shown in Figs. 11–14. In addition, our multi-level image annotation technique will be very attractive to enable semantic image retrieval so that naive users will have more flexibility to specify their query concepts via various keywords at different semantic levels.
6. Performance evaluation Our experiments are conducted on two image databases: a photography database that is obtained from Google
J. Fan et al. / Pattern Recognition 38 (2005) 865 – 885
879
Fig. 12. The result for our multi-level image annotation system, where the image annotation includes the keywords for the concept-sensitive salient objects “sky”, “grass”, “flower”, “forest” and the semantic concept “garden”.
Fig. 13. The result for our multi-level image annotation system, where the image annotation includes the keywords for the concept-sensitive salient objects “sky”, “sea water”, “sail cloth” and the semantic concept “sailing”.
Fig. 14. The result for our multi-level image annotation system, where the image annotation includes the keywords for the concept-sensitive salient objects “sky”, “sand field”, “sea water” and the semantic concept “beach”.
search engine and a Corel image database. The photography database consists of 35,000 digital pictures. The Corel image database includes more than 125,000 pictures with different image concepts. These images (total 160,000) are classified into 15 pre-defined classes of semantic image concepts and one additional category for outliers. Our training sets for 15 semantic image concepts consist of 1800 labeled samples, where each semantic image concept has 120 positive labeled samples. Our algorithm and system evaluation works focus on: (1) By using the same classifier, evaluating the performance differences of two image content representation frameworks: concept-sensitive salient objects versus image blobs. (2) Under the same image content representation framework (i.e., using the concept-sensitive salient objects), comparing the performance differences between our proposed classifiers and the well-known SVM classifiers. (3) Using the conceptsensitive salient objects for image content representation, evaluating the performance differences of our proposed classifiers by using different sizes of unlabeled samples for classifier training.
6.1. Benchmark metric The benchmark metric for classifier evaluation includes classification precision and classification recall . They are defined as =
, +
=
, +
(27)
where is the set of true positive samples that are related to the corresponding semantic image concept and are classified correctly, is the set of true negative samples that are irrelevant to the corresponding semantic image concept and are classified incorrectly, is the set of false positive samples that are related to the corresponding semantic image concept but are mis-classified. As mentioned above, two key issues may affect the performance of the classifiers: (a) the performance of our detection functions of concept-sensitive salient objects; (b) the performance of the underlying classifier training techniques. Thus the real impact for semantic image classification comes from these two key issues, the average precision and
880
J. Fan et al. / Pattern Recognition 38 (2005) 865 – 885
Table 3 The classification performance (i.e., average precision versus average recall) comparison for our classifiers Concept
Mountain view
Beach
Garden
Salient Objects
(%) (%)
81.7 84.3
80.5 84.7
80.6 90.6
Image Blobs
(%) (%)
78.5 75.5
74.6 75.9
Concept Salient Objects
(%) (%)
Sailing 87.6 85.5
Image Blobs
(%) (%)
79.5 77.3
Concept
Mountain view
Beach
Garden
Salient Objects
(%) (%)
81.2 80.5
81.1 82.3
79.3 84.2
73.3 78.2
Image Blobs
(%) (%)
80.1 76.6
75.4 76.3
74.7 79.4
Skiing 85.4 83.7
Desert 89.6 82.8
Concept Salient Objects
(%) (%)
Sailing 85.5 86.3
Skiing 84.6 87.3
Desert 85.8 88.3
79.3 78.2
76.6 78.5
Image Blobs
(%) (%)
81.2 75.6
78.9 79.4
80.2 81.7
average recall are then defined as = × ,
= × ,
Table 4 The classification performance (i.e., average precision versus average recall) comparison for the SVM classifiers
(28)
where and are the precision and recall for our detection functions of the relevant concept-sensitive salient objects, and are the classification precision and recall for the classifiers. 6.2. Performance comparison
Table 5 The optimal parameters (C, ) of the SVM classifiers for some semantic concepts Semantic concepts
Mountain view
Beach
Garden
C
512 0.0078
32 0.125
312 0.03125
Sailing 56 0.625
Skiing 128 4
Desert 8 2
Semantic concepts C
To obtain the real impact of using the concept-sensitive salient objects for semantic image classification, we compared the performance differences of the same semantic image classifier by using the image blobs and the conceptsensitive salient objects. The average performance differences are given in Tables 3 and 4 for some semantic image concepts. For the SVM approach, the search scheme as introduced in Section 4 is used to obtain the optimal model parameters as given in Table 5. The average performances are obtained by averaging precision and recall over 125,000 Corel images and 35,000 photographs. One can find that using the concept-sensitive salient objects for image content characterization has improved the accuracy of the semantic image classifiers significantly (i.e., both the finite mixture models and the SVM classifiers). It is worth noting that the average performance results and shown in Tables 3 and 4 have already included the potential detection errors that are induced by our detection functions of the relevant conceptsensitive salient objects. In addition, the problem of overdetection of semantic image concepts can also be avoided. By using the concept-sensitive salient objects for image content representation and feature extraction, the performance comparison between our classifiers and the SVM classifiers is given in Fig. 15. The experimental results are obtained for 15 semantic image concepts from the same test
dataset. By determining the optimal model structure and reorganizing the distributions of mixture components, our proposed classifiers are very competitive with the SVM classifiers. Another advantage of our classifiers is that the models for image concept modeling are semantically interpretable. Some results of our multi-level image annotation system are given in Figs. 16 and 17, where the keywords for automatic image annotation include the multi-level keywords for interpreting both the visually distinguishable conceptsensitive salient objects and the relevant semantic image concepts. Given the limited sizes of the labeled training samples, we have tested the performance differences of our classifier training algorithm by using different sizes of unlabeled samples (i.e., different ratios of unlabeled samples in the mixture training set). The average performance differences are given in Fig. 18 for some semantic image concepts. One can find that the unlabeled samples can improve the classifier’s performance significantly when only a limited number of labeled samples are available for classifier training. The reasons are that a limited number of labeled samples cannot interpret the necessary image contexts for se-
J. Fan et al. / Pattern Recognition 38 (2005) 865 – 885
881
Fig. 15. The performance comparison (i.e., classification precision versus classification recall) between finite mixture model (FMM) and SVM approach.
Fig. 16. The semantic image classification and annotation results of the natural scenes that consist of the semantic image concept “garden” and the most relevant concept-sensitive salient objects.
mantic image concept interpretation and the unlabeled samples have the capability to provide additional image contexts to learn the finite mixture models more correctly. If the sizes of the available labeled samples are large enough,
the benefit from the unlabeled samples is limited because the labeled samples have already provided the necessary image contexts for interpreting the semantic image concepts correctly.
882
J. Fan et al. / Pattern Recognition 38 (2005) 865 – 885
Fig. 17. The semantic image classification and annotation results of the natural scenes that consist of the semantic image concept “sailing” and the most relevant concept-sensitive salient objects.
1.0 0.9
Precision
0.8 0.7
beach mountain view sailing skiing garden
0.6 0.5
0.5
1.0
1.5
2.0
2.5
Ratio between Unlabeled Samples and Labeled Samples
Fig. 18. The relationship between the classifier performance (i.e., Nu precision ) and the percentage of the unlabeled samples = N L for classifier training.
6.3. Convergence evaluation We have also tested the convergence of our adaptive EM algorithm experimentally. As shown in Figs. 19 and 20, one can find that the classifier’s performance increases incrementally before our adaptive EM algorithm converges to the underly optimal model. After our adaptive EM algorithm converges to the underlying optimal model, merging the overlapped mixture components in the finite mix-
Fig. 19. The classifier performance versus different number of mixture components, where the optimal number of mixture components for the semantic image concept “sailing” is = 12.
ture models decreases the classifier’s performance. The EM algorithm is guaranteed to converge only to the local extrema and does not guarantee the global solution. On the other hand, our adaptive EM algorithm is able to avoid the local extrema by involving an automatic merging, splitting and death procedure. Thus it can support the reasonably sta-
J. Fan et al. / Pattern Recognition 38 (2005) 865 – 885
Fig. 20. The classifier performance versus different number of mixture components, where the optimal number of mixture components for the semantic concept “garden” is = 36.
ble convergence to the global solution as shown in Figs. 19 and 20.
7. Conclusions and future work This paper has proposed a novel framework to enable more effective semantic image classification and multi-level image annotation. Based on a novel framework for semanticsensitive image content representation and classifier training, our multi-level image annotation system has achieved very good performance. Integrating unlabeled samples for classifier training not only dramatically reduces the cost for labeling sufficient samples required for accurate classifier training but also increases the classifier accuracy significantly. Experimental results have also demonstrated the efficiency of our new framework and strategies for semantic image classification. It is worth noting that the proposed automatic salient object detection and semantic image classification techniques can also be used for other image domains when the labeled training samples are available. It is also very important to classify the images into multi-level semantic image concepts via concept hierarchy. Our future work will focus on addressing these problems.
8. Summary The semantic image similarity can be categorized into two major classes: (a) similar image components (e.g., sky, grass) or similar global visual properties (i.e., openness, naturalness); (b) similar semantic image concepts (e.g., garden, beach, mountain view) or similar abstract image concepts (e.g., image events such as sailing, skiing). To achieve the first class of semantic image similarity, it is very important
883
to enable more expressive representation and interpretation of the semantics of image contents. To achieve the second class of semantic image similarity, semantic image classification was reported as a promising approach, but its performance largely depends on two key issues: (1) The effectiveness of visual patterns for image content representation and feature extraction; (2) The significance of the algorithms for semantic image concept modeling and classifier training. Based on these observations, we have proposed a multilevel approach to interpret the semantics of natural images by using both the dominant image components and the relevant semantic image concepts. The major contributions of this paper include: (a) Using the concept-sensitive salient objects as the dominant image components to achieve a middle-level understanding of the semantics of image contents and enhance the quality of features on discriminating between different semantic image concepts, (b) Automatic detection of the concept-sensitive salient objects by using the SVM classifiers with an automatic scheme for searching the optimal model parameters, (c) Semantic image concept modeling and interpretation by using the finite mixture models to approximate the class distributions of the relevant concept-sensitive salient objects, (d) Adaptive EM algorithm to achieve the optimal model selection and model parameter estimation simultaneously, (e) Integrating a large number of unlabeled samples with a limited number of labeled samples to achieve knowledge discovery from large sets of natural images (i.e., discover the hidden image contexts), (f) Semantic image classification and multi-level image annotation to enable more effective image retrieval at the semantic level so that naive users will have more flexibility to specify their query concepts by using various keywords at different semantic levels.
References [1] Y. Rui, T.S. Huang, S.F. Chang, Image retrieval: Past, present, and future, Journal of Visual Comm. and Image Represent. 10 (1999) 39–62. [2] A.B. Benitez, J.R. Smith, S.-F. Chang, MediaNet: a multimedia information network for knowledge representation, Proceedings of the SPIE, vol. 4210, 2000. [3] J.R. Smith, S.F. Chang, Visually searching the web for content, IEEE Multimedia, 1997. [4] E. Chang, Statistical learning for effective visual information retrieval, Proceedings of the ICIP, 2003. [5] X. He, W.-Y. Ma, O. King, M. Li, H.J. Zhang, Learning and inferring a semantic space from user’s relevance feedback, ACM MM (2002). [6] X. Zhu, T.S. Huang, Unifying keywords and visual contents in image retrieval, IEEE Multimedia (2002) 23–33. [7] R. Oami, A. Benitez, S.-F. Chang, N. Dimitrova, Understanding and modeling user interests in consumer videos, ICME (2004). [8] J.R. Smith, S.-F. Chang, Multi-stage classification of images from features and related text, Proceedings of the DELOS (1997).
884
J. Fan et al. / Pattern Recognition 38 (2005) 865 – 885
[9] J.R. Smith, C.S. Li, Image classification and querying using composite region templates, Computer Vision and Image Understanding 75 (1999). [10] P. Duygulu, K. Barnard, N. de Freitas, D. Forsyth, Object recognition as machine translation: learning a lexicon for a fixed image vocabulary, ECCV (2002). [11] K. Branard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, M.I. Jordan, Matching words and pictures, J. Mach. Learning Res. 3 (2003) 1107–1135. [12] K. Barnard, D. Forsyth, Learning the semantics of words and pictures, Proceedings of the ICCV, 2001, pp. 408–415. [13] M. Szummer, R.W. Picard, Indoor-outdoor image classification, Proceedings of the ICAIVL, 1998. [14] M.R. Naphade, T.S. Huang, A probabilistic framework for semantic video indexing, filtering, and retrieval, IEEE Trans. Multimedia 3 (2001) 141–151. [15] H. Greenspan, J. Goldberger, A. Mayer, Probabilistic spacetime video modeling via piecewise GMM, IEEE Trans. PAMI 26 (3) (2004). [16] C. Carson, S. Belongie, H. Greenspan, J. Malik, Region-based image querying, ICAIVL (1997). [17] J. Huang, S.R. Kumar, R. Zabih, An automatic hierarchical image classification scheme, ACM MM (1998). [18] N. Campbell, B. Thomas, T. Troscianko, Automatic segmentation and classification of outdoor images using neural networks, Int. J. Neural Systems 8 (1997) 137–144. [19] J.Z. Wang, J. Li, G. Wiederhold, SIMPLIcity: Semanticsensitive integrated matching for picture libraries, IEEE Trans. PAMI 23 (2001) 947–963. [20] A. Vailaya, M. Figueiredo, A.K. Jain, H.J. Zhang, Image classification for content-based indexing, IEEE Trans. Image Process. 10 (2001). [21] P. Lipson, E. Grimson, P. Sinha, Configuration based scene classification and image indexing, CVPR (1997). [22] W.H. Adams, G. Iyengar, C.-Y. Lin, M.R. Naphade, C. Neti, H.J. Nock, J.R. Smith, Semantic indexing of multimedia content using visual, audio and text cues, EURASIP JASP 2 (2003) 170–185. [23] F. Jing, M. Li, L. Zhang, H.J. Zhang, B. Zhang, Learning in region-based image retrieval, CIVR (2003). [24] E. Chang, K. Goh, G. Sychay, G. Wu, CBSA: content-based annotation for multimodal image retrieval using Bayes point machines, IEEE Trans. CSVT (2002). [25] B. Li, K. Goh, E. Chang, Confidence-based dynamic ensamble for image annotation and semantic discovery, ACM MM (2003). [26] A. Mojsilovic, J. Gomes, B. Rogowitz, ISee: perceptual features for image library navigation, Proceedings of the SPIE, 2001.
[27] A.B. Torralba, A. Oliva, Semantic organization of scenes using discriminant structural templates, Proceedings of the IEEE ICCV, 1999. [28] M.R. Naphade, J.R. Smith, A hybrid framework for detecting the semantics of concepts and context, CIVR (2003). [29] J. Luo, S. Etz, A physical model-based approach to detecting sky in photographic images, IEEE Trans. Image Process. 11 (2002). [30] S.F. Chang, W. Chen, H. Sundaram, Semantic visual template: linking visual features to semantics, Proceedings of the ICIP, 1998. [31] S. Li, X. Lv, H.J. Zhang, View-based clustering of object appearances based on independent subspace analysis, Proceedings of the IEEE ICCV, 2001, pp. 295–300. [32] J. Luo, C. Guo, Perceptual grouping of segmented regions in color images, Pattern Recognition 36 (2003) 2781–2792. [33] Y.-F. Ma, H.J. Zhang, Contrast-based image attention analysis by using fuzzy growing, ACM MM (2003) 374–381. [34] A. Singhal, J. Luo, W. Zhu, Probabilistic spatial context models for scene content understanding, Proceedings of the CVPR, 2003. [35] J. Luo, A. Singhal, S. Etz, R. Gray, A computational approach to determine of main subject regions in photographic images, Image Vision Comput. 22 (2004) 227–241. [36] M. Hansen, B. Yu, Model selection and the principal of minimum description length, J. Am. Stat. Assoc. 96 (2001) 746–774. [37] G. McLachlan, T. Krishnan, The EM Algorithm and Extensions, Wiley, New York, 2000. [38] Y. Wu, Q. Tian, T.S. Huang, Discriminant-EM algorithm with application to image retrieval, Proceedings of the CVPR, 2000, pp. 222–227. [39] S. Kullback, R. Leibler, On information and sufficiency, Ann. Math. Stat. 22 (1951) 76–86. [40] S. Tong, E. Chang, Support vector machine active learning for image retrieval, ACM MM (2001). [41] T. Joachims, Transductive inference for text classification using support vector machine, Proceedings of the ICML, 1999. [42] D. Comanicu, P. Meer, Mean shift: a robust approach toward feature space analysis, IEEE Trans. PAMI 24 (2002) 603–619. [43] M.R. Naphade, X. Zhou, T.S. Huang, Image classification using a set of labeled and unlabeled images, Proceedings of the SPIE, November, 2000. [44] Y. Wu, Q. Tian, T.S. Huang, Integrating unlabeled images for image retrieval based on relevance feedback, ICPR (2000). [45] S.C. Zhu, Statistical modeling and conceptualization of visual patterns, IEEE Trans. PAMI 25 (2003) 691–712.
About the Author—JIANPING FAN received his M.S. degree in theory physics from Northwestern University, Xian, China, in 1994. and the Ph.D. degree in optical storage and computer science from Shanghai Institute of Optics and Fine Mechanics, Chinese Academy of Sciences, Shanghai, China, in 1997. He was a researcher at Fudan University, Shanghai, China, during 1998. From 1998 to 1999, he was a researcher with the Japan Society for Promotion of Sciences (JSPS), Department of Information System Engineering, Osaka University, Osaka, Japan. From September 1999 to 2001, he was researcher in Department of Computer Science, Purdue University, West Lafayette, IN. He is now an assistant professor in the Department of Computer Science, University of North Carolina, Charlotte, NC. His research interests include semantic video computing and content-based video retrieval for medical education. About the Author—HANGZAI LUO received his B.S. degree in computer science from Fudan University, Shanghai, China, in 1998. From 1998 to 2002, he was lecturer in Department of Computer Science, Fudan University. He is now pursuing his Ph.D. degree in Information Technology at University of North Carolina, Charlotte, NC. His research interests include video analysis and content-based video retrieval.
J. Fan et al. / Pattern Recognition 38 (2005) 865 – 885
885
About the Author—YULI GAO received his B.S. degree in computer science from Fudan University, Shanghai, China, in 2002. He is now pursuing his Ph.D. degree at Department of Computer Science, University of North Carolina at Charlotte. His research areas are image segmentation and classification. About the Author—GUANGYOU XU received his B.S. degree in computer Science from Tsinghua University, Beijing, China, in 1963. He joined Tsinghua University as an assistant professor at 1963 and a professor at 1989. From 1982 to 1984, he was a visiting professor at Purdue University, West Lafayette, IN, USA. From 1993 to 1994, he was a visiting professor at Beckman Institute, University of Illinois at Urbana Champaign. His research areas include computer vision, content-based image/video analysis and retrieval.