Integrating global and local visual features with semantic hierarchies for two-level image annotation

Integrating global and local visual features with semantic hierarchies for two-level image annotation

Neurocomputing 171 (2016) 1167–1174 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Integ...

4MB Sizes 0 Downloads 31 Views

Neurocomputing 171 (2016) 1167–1174

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Integrating global and local visual features with semantic hierarchies for two-level image annotation Zhiming Qian n, Ping Zhong, Jia Chen College of Electronic Science and Engineering, National University of Defense Technology, 410073 Changsha, China

art ic l e i nf o

a b s t r a c t

Article history: Received 23 December 2014 Received in revised form 30 May 2015 Accepted 20 July 2015 Communicated by Rongrong Ji Available online 10 August 2015

Image annotation is a challenge task due to the semantic gap between low-level visual features and high-level human concepts. Most previous annotation methods take the task as a multilabel classification problem. However, these methods always suffer from poor accuracy and efficiency in the case that plentiful visual variations and large semantic vocabularies are encountered. In this paper, we focus on two-level image annotation by integrating both the global and local visual features with semantic hierarchies, in an effort to simultaneously learn annotation correspondences in a relatively small and most relevant subspace. Given an image, the two-level task includes scene classification for the image and object labeling for its regions. For scene classification, we first define several specific scenes that describe the most case of the given image data, and then use support vector machines (SVMs) based on the global features. For region labeling, we first format a set of abstract nouns in accordance with WordNet to define relevant objects, and then use local support tensor machines (LSTMs) based on highorder regional features. By introducing a new conditional random field (CRF) model that exploits the multiple correlations with respect to scene–object hierarchies and object–object relationships, our system achieves a more hierarchical and coherent description of image contents than do the simpler image annotation tasks. Experimental results have been reported over the MSRC and SAIAPR datasets to validate the superiority of using multiple visual features and prior semantic correlations for image annotation by comparing with several state-of-the-art methods. & 2015 Elsevier B.V. All rights reserved.

Keywords: Image annotation Semantic hierarchy Support vector machine Support tensor machine Conditional random field

1. Introduction The great proliferation of web images has brought new opportunities to users, which now have a large amount of image resources when searching for some contents. In this case, an emerging issue is how to browse and retrieve this daunting volume of image data. To address this, images are usually complemented by their associated semantics that are used to describe them. To provide such semantics, it is essential to understand the image content and label it appropriately, which refers to the task of image annotation. Previous researches [1–3] mostly focus on matching images and words with a flat approach that uses unstructured semantic vocabulary for image annotation. However, large-scale image annotation poses tremendous challenges if this approach is simply employed: (1) the sparseness problem that single labeled images for training are difficult to obtain in quantity for full description of each object class; (2) the imbalance problem that the number of positive samples for a given object class is

n

Corresponding author. E-mail addresses: [email protected] (Z. Qian), [email protected] (P. Zhong), [email protected] (J. Chen). http://dx.doi.org/10.1016/j.neucom.2015.07.094 0925-2312/& 2015 Elsevier B.V. All rights reserved.

seriously less than that of negative ones; (3) the complexity problem that the computational cost grows quadratically with the number of object classes. To address these issues, it is obviously meaningful to make some reasonable prior assumptions to help learning the image semantics due to the fact that humans begin to train the complicated neural networks in their brains from birth while machines might train just a few hours to understand images. Usually, these assumptions should be structured according to the hierarchies in human visual systems. Consider, for example, an image that describes water lily and its leaves in a pond. It would be possible to label the image with the term “pond”, “flower” or “lily”. These terms could carry information at multiple levels: specific scene (e.g. “pond”), generic objects (e.g. “flower”) and specific objects (e.g. “lily”). Moreover, Jaimes and Chang [4] offer to structure image labels using 10 levels between visual features and abstract interpretation with the purpose of helping understanding the complex visual world. Multilevel image annotation [5–11] that uses hierarchies to partition the whole vocabulary into a few structured subsets can significantly reduce the complexity for classifier training and improve the discriminative power of the single label classifier. Although the hierarchical approach may provide some advantages, it may still suffer from the above sparseness problem and also lead

1168

Z. Qian et al. / Neurocomputing 171 (2016) 1167–1174

to the problem of inter-level error propagation that the misclassification error at the parent nodes of semantics will transport to its child nodes. To solve the sparseness problem, the strong interclass correlations are usually employed, because data-driven methods for single label annotation do not offer abstract reasoning mechanisms that allow inferring the meaning of the labels according to their semantics. On the other hand, context techniques have been applied to improve the representations of different labels. Incorporating knowledge into annotation techniques emerges as a promising approach for improving annotation efficiency and provides coherent semantic domain models to support visual inference in the specified context. In [5], millions of tiny images are labeled with WordNet nouns by categorizing images at different levels using a hierarchical vote. Considering conceptual similarity is not always accordant with visual similarity, Fan et al. [6] suggest building a conceptual ontology using both conceptual and visual similarities. They propose a hierarchical boosting algorithm based on this hierarchy allowing image annotation at different semantic levels. Moreover, probabilistic graphical models [7,8] are a promising alternative for modeling the context-dependent relations among data in a more realistic fashion. In particular, hierarchical Bayesian models are employed due to the possibility of modeling spatial neighborhood relations in images. To tackle the issue of inter-level error propagation, some researchers [9–11] have proposed hierarchical hinge loss function to jointly optimize the node classifiers at multiple levels. More specifically, in [9], large margin kernel methods and Bayesian analysis are combined for supervised classification learning with a hierarchical structure of labels. In [10], a large margin method through constraints is proposed by characterizing a multipath hierarchy, which permits a treatment of various losses for hierarchical classification, while in [11], a cost-sensitive learning algorithm is developed to jointly train hierarchical tree classes by penalizing various types of misclassification errors. In this paper, we address the challenge of multilevel image annotation by jointly learning a conditional random field (CRF) model using semantic hierarchies. To leverage the advantages of semantic hierarchies and the inter-level error propagation, we only impose two-level semantic concepts that can be defined without much confusion for the annotation task, i.e. scene-level and object-level labels. Specifically, we propose to use support vector machines (SVMs) based on the global features for scene classification and local support tensor machines (LSTMs) based on high-order regional features for accurate region labeling. Motivated by perceptual psychology, global features introduce global image statistics about inter-object or contextual relations, seeking for proper scene configurations, while local features operate on regions or superpixels to provide shape, continuity or symmetry information for recognizing objects. Moreover, we formulate the learning problem as a joint optimization problem over two-level semantic concepts by introducing a new CRF model, which simultaneously minimizes the misclassification error from global features to scene-level labels and the prediction losses from local features to object-level labels. We expect that such a model can automatically capture intrinsic explanations of two-level semantic concepts, and hence improve the overall prediction performance for both scene-level and object-level labels. More specially, we introduce the three-order statistic tensor provided in [12] with the purpose of improving the accuracy of region labeling which is the most important in our framework due to the large size of object vocabulary. Finally, as will be seen by the experimental evaluation of the proposed method, the elegant combination of global and local features as well as contextual and hierarchical labeling information leads to improved performance, as compared to some state-of-the-art methods with respect to both flat and two-level annotation models.

The remainder is organized as follows: Section 2 reviews the related work, followed by the illustration of the proposed model for two-level image annotation in Section 3. Section 4 evaluates the experimental results. Conclusions and future work are drawn in Section 5.

2. Related work Multilevel image annotation has had contributions from several fields, such as artificial intelligence, pattern recognition and other disciplines, which define a robust theoretical background. In this perspective, we give a systematic introduction to theoretical approaches for multilevel knowledge representation with a discussion on the use of both semantic and regional features to interpret and analyze images. 2.1. Semantic hierarchies for image annotation The choice of terms being used to describe image contents is much important for system performance of image annotation. Usually, these terms are structured. Layne [13] first divided them into four categories: biographical attributes, subject attributes, exemplified attributes and relationship attributes. For image annotation, we mainly concern on subject attributes. Then, Jaimes and Chang [4] provided 10 levels for describing subject attributes of images. Based on this semantic description, Hollink et al. [14] used unified modeling language (UML) to formally describe the annotation levels. Moreover, ontologies [6,15] also provide explicit specification of relations among semantic terms to construct the structured vocabulary. As this vocabulary is usually built from a consensus of domain experts, it represents a powerful way to structure the semantic terms. However, the explicit expressions of ontologies are not always beneficial to the task of image annotation due to the fact that visual observations are not always accordance with the expert knowledge. Besides, inter-level error propagation is a hard problem for using ontologies. In this paper, we investigate the pros and cons of using semantic hierarchies, and only take two levels of semantic concepts, i.e. scene-level and object-level labels, which are widely accepted by previous researches and show much coherent visual statistics in each level. 2.2. Proximity between semantic terms To assess the proximity between semantic terms, three groups of measures are generally gathered [16], including edge-based measures, feature-based measures and information content measures. Edge-based measures directly compute the semantic similarity from the structure of a vocabulary or ontology; feature-based measures compute the similarity regarding the degree of overlaps between two sets of semantic features; information content measures associate appearance probabilities to two semantic concepts, computed from their occurrences in a text corpus. The work in [17] has experimentally shown that the edge-based measures outperform others due to the consideration of vocabulary structure as in human perception. To quantify the semantic similarity, an intuitive edgebased method [18] has been originally proposed by computing the minimum path length connecting the corresponding ontological nodes of semantic terms via taxonomic links. The underlying idea is that the longer is the path, the more semantically far the terms are. Then, the work in [19] considers the relative depth of the semantic terms in the ontology and measures the semantic similarity by counting the number of taxonomic links from each term to their least common subsumer (LCS) and the number of links of the LCS to the roof of the ontology. Moreover, AI-Mubaid and Nguyen [20] have proposed a powerful and comprehensive measure by combining the

Z. Qian et al. / Neurocomputing 171 (2016) 1167–1174

minimum path length and the taxonomical depth of considered branches. In this paper, we follow the work in [20] and consider both the relative path length between the semantic terms and their relative depth in the ontology using WordNet. 2.3. Descriptors for region labeling Region representations are beneficial to characterize correspondence ambiguity in methods that attempt to learn from loosely labeled data. However, in real world applications, how to efficiently and effectively describe the segmented regions is a hard problem due to their arbitrary shapes. For computational efficiency, the classical approaches [21,22] are based on the vector space model that encodes particular statistical properties of the region data. However, vectorizations suffer from serious degeneration for separating different types of regions. To address this, matrix or tensor representations [23,24] are applied to describe the interactions of multiple factors inherent to image regions and encode the higher-order statistics of these factors. Tuzel et al. [23] first introduced region covariance descriptors (RCDs) for object detection and texture classification. They argued that this descriptor was more general than the data sums or histograms. Usually, the RCDs are not embedded in common Euclidean space. Thus, Sivalingam et al. [24] proposed a sparse coding technique for positive definite matrices in order to preserve the positivity of their eigenvalues with respect to the data structure of the Riemannian manifold. Recently, the work in [12] argues that three-order statistic tensor can produce both the distance and the angle statistics. In this paper, we introduce such a high-order region representation to our framework of two-level image annotation in an attempt to achieve more convincible results of region labels.

1169

among these nouns. Then, we divide the vocabulary into two levels, i.e. scenes and objects, according to the semantic levels that described in [4]. We discard abstract scene and abstract objects due to these interpretations that involve more complex knowledge and hard to be efficiently obtained. The work in [4] also points out that the distinction between the levels is not strict, this is, however, not beneficial to construct semantic hierarchies. Actually, we can easily give a category of an image label that belongs to a scene or an object. As a result, we argue that the hierarchies between scenes and objects are always existed. To make semantic hierarchies more explicit and consistent, we only define two-level semantic hierarchies with respect to scenes and objects. Noticeably, we mostly use specific scenes and generic objects rather than generic scenes and specific objects in order to keep semantic coherent in the large-scale datasets. Table 1 gives a description of scene labels with their most related object labels to sort the vocabulary of ground truth labels in the MSRC [25] dataset.

3.2. SVM for scene classification To represent the visual contents of an image, we use the global feature with colored pattern appearance model (CPAM) [26], which is comprised of the achromatic spatial pattern histogram (ASPH) and chromatic spatial pattern histogram (CSPH). Here, we simply represent each image with a 128-dimensional feature vector including 64 achromatic prototypes and 64 chromatic prototypes. We simply train an SVM classifier [27] to conduct scene classification by extracting CPAM features from training images. We choose all the positive and negative examples in the training dataset when learning a classifier for a specific scene. Therefore, the feature function for a scene c A f1; …; C g is

3. The proposed method

f c ðxÞ ¼ sigmðhwc ; xi þ bc Þ

Fig. 1 illustrates our framework for two-level image annotation. Given an image, we first extract the global feature and employ SVMs for scene classification. Then, we use the global feature again to search the nearest neighbors which are used to construct LSTMs for region labeling. Finally, we build the joint model with CRF for two-level image annotation by combining the tasks of scene classification and region labeling.

where sigmðÞ is a sigmoid function used to shape SVM scores, wc is the weight vector trained for scene c, bc is the corresponding bias.

S

3.1. Definition rules of scenes and objects As the task of image retrieval mostly concentrates on searching images with nouns, we use WordNet nouns to format the annotation vocabulary with the purpose of obtaining semantic relationships

ð1Þ

Table 1 Scene and object labels in MSRC. Scene label

Most related object labels

Grassland Airport Downtown Landscape Human-activity Animal-activity

Grass, cow, sheep, horse, water Airplane, sky, grass, building, tree Building, car, road, sign, bicycle, grass, tree, sky, face, body Sky, flower, mountain, tree, grass, water, boat, road, building Face, body, building, book, chair, sky, grass, flower Bird, cat, dog, face, body, sky, tree, grass, water, building

Fig. 1. Illustration of the framework of two-level image annotation.

1170

Z. Qian et al. / Neurocomputing 171 (2016) 1167–1174

Fig. 2. An example of image over-segmentation. (a) Input image of 480  360 pixels. (b) An over-segmentation of 119 regions.

3

s:t:

yvm

!

!

Rm ∏ i wðiÞ þ bv Z 1 ξvm v i¼1

ξvm Z 0; m ¼ 1; …; M

ð3Þ

where yvm A f þ1;  1g is the label that determines whether the corresponding image region belongs to the class v, W v ¼ ð3Þ wvð1Þ ○wð2Þ is the weight tensor of image samples, bv is the v ○wv bias, ξvm is the misclassification error, and ηv is the trade-off between the classification margin and misclassification error. Once the LSTM model in (3) has been solved, determining a test region R whether belongs to the object v can be predicted by 3

f v ðRÞ ¼ sigmðR ∏ 3 wðiÞ v þ bv Þ T

3.3. LSTM for region labeling

3.4. Joint modeling with CRF

For region labeling, we first take the process of image segmentation via the normalized cuts algorithm [28] to generate image regions. Fig. 2 shows the representative segmentation results. We observe that using segmented regions other than image pixels can provide a clear computational improvement, whereas each image contains approximately 150–500 K pixels and the number of segmented regions is about 100 in our experiments. To represent image regions, we aim to give a much informative representation by using high-order statistics. The work in [12] argues that three-order statistic tensor can produce both the distance and the angle statistics. In our work, we take this representation, and define the region descriptor R A RDDD as R¼

H 1X ðr  μÞ○ðr i  μÞ○ðr i  μÞ Hi¼1 i

ð2Þ

P where r i A RD is pixel-wise feature vector, μ ¼ H i ¼ 1 r i is the mean feature vector, “○” is the outer product [12], and H is the number of pixels in the region. Here, we characterize image regions by extracting PCA-SIFT features [29], which assign each pixel with a 36-dimensional vector. Then, suppose that an image can only be related to a limited number of image samples in the image database, we search for the potentially K nearest neighbors of a test image in the database to form a local tagging dictionary. Define the number of regions in ground truths of nearest neighbors as M, we aim to label each region into a relevant class v A f1; …; V g with LSTM by solving the following optimization problem: min

W v ;bv ;ξv

M X 1 3 2 JðW v ; bv ; ξv Þ ¼ ∏ J wðiÞ ξvm v J F þ ηv 2i¼1 m¼1

ð4Þ

i¼1

Fig. 3. Illustration of the proposed CRF model.

To model the correlation between the labels, we employ a CRF model, which is a probabilistic model first presented by Lafferty et al. [30] for the task of segmenting and labeling of sequential data. Different from other probabilistic models, CRF models the conditional likelihoods instead of joint distributions, relaxing the assumption on distributions and making it easier to take more features and factors into account. Here, we employ it to jointly model the correlations among scene labels and object labels based on both the global and the local visual features. We first define a number of notations. Given an image training set fI n ; n ¼ 1; …; N g, the global and local features are represented by X ¼ fxn ; n ¼ 1; …; Ng and R ¼ fRnm ; n ¼ 1; …; N; m ¼ 1; …; M g, respectively. We denote the sets of scene labels and object labels for labeling these images as S ¼ fsn ; n ¼ 1; …; N g and T ¼ ft nm ; n ¼ 1; …; N; m ¼ 1; …; Mg, respectively. Here, sn A f1; …; C g and t nm A f1; …; V g are the corresponding index of the lists of labels for scenes and objects, respectively. Fig. 3 gives the graphical description of the proposed CRF model. To further model the correlations, we use the conditional probability as PðS; T j X; R; ΘÞ ¼

N X 1 exp ϕS ðxn ; sn ; αsn Þ ZðX; R; ΘÞ n¼1

þ

N M  X X  ϕT ðRnm ; t nm ; βtnm Þ þ ϕST ðsn ; t nm ; γ sn tnm Þ n¼1m¼1

þ

N X M X M X

 ϕTT ðt ni ; t nj ; θtni t nj Þ

ð5Þ

n¼1i¼1j¼1

  where Θ ¼ αi ; βj ; γ ij ; θjk ; i ¼ 1; …; C; j; k ¼ 1; …; V is the set of parameters that are learned from the training data, ZðX; R; ΘÞ is a normalized function with respect to Θ.

Z. Qian et al. / Neurocomputing 171 (2016) 1167–1174

3.5. Two-level image annotation with CRF

Table 2 The statistics of the two image datasets. Datasets

Total size

No. of scenes

No. of objects

MSRC SAIAPR

591 20,000

6 6

23 195

3.4.1. Defining the potential functions To obtain the potential function of ϕS ðxn ; sn ; αsn Þ for determining the scene, we employ the feature function in (2), and obtain S

ϕS ðxn ; sn ; αsn Þ ¼ αsn f sn ðxn Þ

0 To annotate  a0 test image with  the global feature x and regional 0 0 features R ¼ Rm ; m ¼ 1;…; M , we assign  a scene label s and several object labels T 0 ¼ t 0m ; m ¼ 1; …; M to it. Then, our method is to infer the most possible labels with the trained parameters by maximizing:

1 Pðs0 ; T 0 j x0 ; R0 ; ΘÞ ¼ exp ϕS ðx0 ; s0 ; αs0 Þ: Z þ

ð6Þ

T

ϕT ðRnm ; t nm ; βt nm Þ ¼ βt nm f t nm ðRnm Þ

ð8Þ

where ICðÞ is the information content, and lcsðt m ; t n Þ is the least common subsumer in the WordNet taxonomy. Then, this correlation function can be defined as  ϕST ðsn ; t nm ; γ sn tnm Þ ¼ γ sn t nm ρ1 W ST ðsn ; t nm Þ  ST þ ð1  ρ1 Þf ðsn ; t nm Þ ð9Þ ST

where f ðsn ; t nm Þ is the co-occurrence function of the scene label sn and the object label t nm . Similarly, we evaluate the correlation function of ϕTT ðt ni ; t nj ; θt ni tnj Þ by  ϕTT ðt ni ; t nj ; θt ni tnj Þ ¼ θt ni tnj ρ2 W TT ðt ni ; t nj Þ  TT ð10Þ þ ð1  ρ2 Þf ðt ni ; t nj Þ TT

where W TT ðt ni ; t nj Þ and f ðt ni ; t nj Þ are the similarity and the cooccurrence function of two object labels, respectively. Taking all these functions into account, the log-likelihood function of (5) becomes αsn f sn ðxn Þ þ

n¼1

þ

N M X X

N M X X

βtnm ht nm ðRnm Þ

n¼1m¼1

  ST γ sn t nm ρ1 W ST ðsn ; t nm Þ þ ð1  ρ1 Þf ðsn ; t nm Þ

n¼1m¼1

þ

N M X X

  TT θt ni t nj ρ2 W TT ðt ni ; t nj Þ þð1 ρ2 Þf ðt ni ; t nj Þ

ð11Þ

n ¼ 1 i;j ¼ 1

3.4.2. Parameter estimation Observe that L is a convex function subject to Θ, we take the conjugate-gradient method provided in [31] to maximize the loglikelihood in (11). In detail, we first compute the gradient ∇L and expand the log-likelihood using Taylor series: LðΘ þ ΔΘÞ ¼ LðΘÞ þ ∇LðΘÞT ΔΘ þ 12 ΔΘT BΔΘ

ð12Þ

where B is the Hessian matrix which can be efficiently updated. The log-likelihood attains its extremum at ∇LðΘÞ þ BΔΘ ¼ 0

ð13Þ

ð14Þ

ϕT ðR0m ; t 0m ; βt 0m Þ þ ϕST ðs0 ; t 0m ; γ s0 t0m Þ



1 ϕ

TT

ðt 0i ; t 0j ; θt 0i t0j ÞA

ð15Þ

The loopy belief propagation method is employed for this task, which iteratively passes positive real vector valued messages between the variables until convergence. Compared with traditional mean field algorithm, loopy belief propagation is more robust and faster for the inference task [32].

4. Experimental results We set up several quantitative experiments as follows. We first give a description to experimental settings. Then, we investigate the effects of modeling parameters. Next, we evaluate the improvement of our method for scene classification and region labeling. Finally, we compare the performance of our method for two-level image annotation with several state-of-the-art methods.

4.1. Datasets and evaluation criteria We evaluate the proposed framework on the MSRC [25] and SAIAPR [33] datasets. Both of these datasets have been manually segmented and labeled for region-level ground truth, and thus they are well suited for evaluating the performance of the involved annotation methods. We list the details of the experimental datasets in Table 2. To make a fair comparison, we split each of the two datasets into a training part that is used to construct a ground truth dataset and a test part that is employed for performance evaluation. We evaluated the performance by means of a 10-fold cross-validation loop. Particularly, we repeated 10 trials with each using 90% of the image dataset for training and 10% for testing, where the data splits were different from each other. The performance of two-level image annotation is evaluated by comparing the automatically generated scene and region labels for the test set with the human-produced ground truth. Our unified framework includes the tasks of scene classification and region labeling, and thus we give the following two measures for evaluating the performance. First, we give the accuracy measurement for scene classification and pixel-level region labeling. Then, we give a performance evaluation for image annotation by reporting precision (Prec), recall (Rec) and the F 1 measure averaged over all the labels. Precision is the fraction of labels that are true positives rather than false positives, while recall is the fraction of true positives that are labeled rather than missed. The F 1 measure is defined as

Then, we can obtain the updating scheme Θ ¼ Θþ ΔΘ ¼ Θ  B  1 ∇LðΘÞ

M X i;j ¼ 1

2  ICðlcsðsn ; t nm ÞÞ ICðsn Þ þ ICðt nm Þ

N X

þ

ð7Þ

To evaluate the correlation function of ϕST ðsn ; t nm ; γ sn tnm Þ, we first define the similarity of two concepts in WordNet as

L ¼  log Z þ

M  X m¼1

Similarly, the potential function of ϕT ðRnm ; t nm ; βt nm Þ is

W ST ðsn ; t nm Þ ¼

1171

F1 ¼

2Prec  Rec Prec þ Rec

ð16Þ

1172

Z. Qian et al. / Neurocomputing 171 (2016) 1167–1174

4.2. Parameter analysis Regarding our method for two-level image annotation, the number of nearest neighbors (i.e. K) that determine the accuracy of region labeling and the tradeoffs (i.e. ρ1 and ρ2) that determine the impact on the correlation functions (i.e. ϕST and ϕTT).

semantic similarities while object–object relationships prefer semantic similarities rather than data co-occurrences. The reason is probably that scene labels are much specific in our datasets while objects are too generic to describe their relations by only using data co-occurrences. 4.3. Evaluation on scene classification

4.2.1. Effects of the number of nearest neighbors We evaluate the number of nearest neighbors K by measuring the accuracy of pixel-level region labeling using LSTM with different sizes from 10 to 100 for all two datasets. As seen in Fig. 4, the performance of pixel-level region labeling grows much slowly when the size of local image subset arrives at 30 for both the two datasets. This reveals that the subset of nearest neighbors can provide the dominant labeling information to support the task of region labeling when the neighbor size is relatively large. However, the larger neighbor size is not always guaranteed to achieve the better performance, but introduce the higher computational complexity and the more tag noises which might degrade the labeling performance. Besides, the larger number of nearest neighbors can indeed augment the robustness of the region classifiers, which achieve the minor deviations. In the following experiments, we fix K ¼ 50 and K¼ 80 that achieve the best performance for MSRC and SAIAPR, respectively. 4.2.2. Effects of the tradeoffs The tradeoffs ρ1 and ρ2 are used to account for the contributions of two-level image annotation with respect to semantic similarities and data co-occurrences. We investigate these two parameters by setting them with a series of values ranging from 0 to 1. Fig. 5 illustrates the impact of ρ1 and ρ2 by measuring the F 1 scores of label predictions with the proposed method. We observe that ρ1 ¼ 0:4; ρ2 ¼ 0:8 and ρ1 ¼ 0:2; ρ2 ¼ 0:6 achieve the highest F 1 scores for MSRC and SAIAPR, respectively. This result reveals that scene–object hierarchies prefer data co-occurrences rather than

Fig. 6 shows the accuracy of scene classification. We observe that the accuracy performance improves a lot by comparing the result of our CRF models with that of SVMs for scene classification on both the two datasets. Meanwhile, we are pleased that the misclassification error is limited among all classes, making the inter-level error of hierarchical propagation be much small to support the further analysis of region labeling. 4.4. Evaluation on region labeling For region labeling, we compare our tensorial region representation with the RCD descriptor, and give the vectorization of this descriptor by aligning its upper triangular part for performance evaluation with some vector-based methods, including SVM, naive Bayes classifier (NBC) and random forest (RF) that are available in the CLOP machine learning toolbox [34]. To evaluate the impacts on joint modeling with our CRF model, we also compare the energy model with respect to EBM [35]. Fig. 7 shows the performance of region labeling for these methods by measuring accuracy on the two datasets. We observe that using our CRF model can always augment the performance of region labeling for the classification methods while using EBM is not guaranteed for better results (e.g. the labeling performance of RF in Fig. 6(b)). Meanwhile, among all the labeling experiments, using STM with CRF achieves the best performance for both the datasets, demonstrating the superior modeling capability of our proposed method for region labeling. 4.5. Comparison of different methods for image annotation

Fig. 4. The performance of pixel-level region labeling with different numbers of nearest neighbors.

For two-level image annotation, we compare our methods with both the above mentioned methods for region labeling and some other flat or two-level methods. We choose to compare with a state-of-the-art flat model and the most relevant two-level method with respect to TagProp [3] and the hierarchical boosting method based on SVMs (HB_SVM) provided in [6]. TagProp is a weighted nearest neighbor model that allows the integration of metric learning by directly maximizing the log-likelihood of the tag predictions. We simply use the global features to perform the model and take five top-ranked tags to measure its annotation performance. HB_SVM learns the ensemble classifiers hierarchically and can tackle the problem of huge intra-concept visual diversity for the image concepts at the higher levels of the concept

Fig. 5. The performance of image annotation with different tradeoffs on (a) MSRC and (b) SAIAPR.

Z. Qian et al. / Neurocomputing 171 (2016) 1167–1174

1173

Fig. 6. The accuracy of scene classification on (a) MSRC and (b) SAIAPR.

Fig. 7. Performance of region labeling with different methods on (a) MSRC and (b) SAIAPR.

Table 3 The performance (in percent) of image annotation for different methods. Method

SVM NBC RF TagProp HB_SVM CRF-1 CRF-2 Ours

MSRC

SAIAPR

Prec

Rec

F1

Prec

Rec

F1

53.77 27.21 62.58 44.65 62.59 61.43 55.72 63.74

62.85 76.40 52.73 62.13 55.36 55.98 60.04 57.22

56.44 39.82 56.56 51.47 59.15 58.46 56.82 60.44

37.98 21.76 38.42 36.21 38.85 39.34 38.47 40.32

28.65 31.86 29.44 28.30 30.24 31.03 29.51 31.53

32.69 25.75 33.24 30.77 34.33 35.71 33.30 36.75

ontology. Here, we take the concept ontology with the proposed two-level hierarchies. Besides, we also evaluate two other methods using only parts of the proposed CRF model, which remove scene–object hierarchies (i.e. ϕST) and object–object relationships (i.e. ϕTT), respectively. For convenience, we denote these two models as CRF-1 and CRF-2, respectively. Table 3 shows the performance of image annotation with different methods on the MSRC and SAIAPR datasets. The experimental results show that our method achieves the best performance by measuring F1 scores. However, the performance of our method is not performing the best by only measuring Rec. The reason for this is that Prec and Rec are much related to the number of final labeled tags, and thus it is not fair to use only Prec or Rec to measure the tagging performance. For example, NBC generates more tags than others, resulting in the best recall performance and

the worst precision performance. By comparing the region-level methods using region information with the image-level methods that use only the global information, we find that the region-level methods except for NBC mostly perform better than image-level method with respect to TagProp. This result confirms that region information is much useful for image annotation. More specially, the overall performance of two-level methods (i.e. HB_SVM, CRF-1, CRF-2 and ours) is usually better than others, demonstrating the effectiveness of semantic hierarchies. Besides, we find that CRF-1 is much better than CRF-2, showing that the object–object relationships are more informative than the scene–object hierarchies when annotating an image.

5. Conclusions and future work Multilevel image annotation is an important yet challenging research problem. In this paper, we use CRF to jointly model the global and local visual features with semantic hierarchies, which can simultaneously learn annotation correspondences in a relatively small and most relevant subspace. The extensive experimental results reveal that the elegant combination of global and local features as well as contextual and hierarchical labeling information leads to improved performance, as compared to some state-of-the-art methods with respect to both flat and two-level annotation models. Several future work directions arise through the development of this research. We are working on dynamic applications of semantic hierarchies for video annotation. Besides, we will try to construct a more effective model for hierarchical semantic learning.

1174

Z. Qian et al. / Neurocomputing 171 (2016) 1167–1174

References [1] D. Zhang, M.M. Islam, G. Lu, A review on automatic image annotation techniques, Pattern Recognit. 45 (2012) 346–362. [2] M. Wang, F. Li, M. Wang, Collaborative visual modeling for automatic image annotation via sparse model coding, Neurocomputing 95 (2012) 22–28. [3] M. Guillaumin, T. Mensink, J. Verbeek, C. Schmid, Tagprop: discriminative metric learning in nearest neighbor models for image auto-annotation, in: International Conference on Computer Vision, 2009, pp. 309–316. [4] A. Jaimes, S.-F.Chang, Conceptual framework for indexing visual information at multiple levels, in: SPIE Internet Imaging, vol. 3964, 2000, pp. 2–15. [5] A. Torralba, R. Fergus, W. Freeman, 80 million tiny images: a large dataset for non-parametric object and scene recognition, IEEE Trans. Pattern Anal. Mach. Intell. 30 (11) (2008) 1958–1970. [6] J. Fan, Y. Gao, H. Luo, Integrating concept ontology and multitask learning to achieve more effective classifier training for multilevel image annotation, IEEE Trans. Image Process. 17 (3) (2008) 407–426. [7] D.M. Steinberg, O. Pizarro, S.B. Williams, Hierarchical Bayesian models for unsupervised scene understanding, Comput. Vis. Image Underst. 131 (2015) 128–144. [8] V. Nguyen, D. Phung, X.L. Nguyen, S. Venkatesh, H.H. Bui, Bayesian nonparametric multilevel clustering with group-level contexts, in: International Conference on Machine Learning, 2014, pp. 288–296. [9] O. Dekel, J. Keshet, Y. Singer, Large margin hierarchical classification, in: International Conference on Machine Learning, 2004, pp. 27–34. [10] J. Wang, X. Shen, W. Pan, On large margin hierarchical classification with multiple paths, J. Am. Stat. Assoc. 104 (487) (2009) 1213–1223. [11] J. Fan, J. Zhang, K. Mei, J. Peng, L. Gao, Cost-sensitive learning of hierarchical tree classifiers for large-scale image classification and novel category detection, Pattern Recognit. 48 (5) (2015) 1673–1687. [12] X. Geng, K. Sun, L. Ji, Y. Zhao, A high-order statistical tensor based algorithm for anomaly detection in hyperspectral imagery, Sci. Rep. 4 (6869) (2014) 1–7. [13] S.S. Layne, Some issues in the indexing of images, J. Am. Soc. Inf. Sci. 45 (8) (1994) 583–588. [14] L. Hollink, G. Schreiber, B. Wielinga, M. Worring, Classification of user image descriptions, Int. J. Hum. Comput. Stud. 61 (5) (2004) 601–626. [15] M. Srikanth, J. Varner, M. Bowden, D. Moldovan, Exploiting ontologies for automatic image annotation, in: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2005, pp. 552–558. [16] C. Kurtz, C.F. Beaulieu, S. Napel, D.L. Rubin, A hierarchical knowledge-based approach for retrieving medical images described with semantic annotations, J. Biomed. Inf. 49 (2014) 227–244. [17] W.N. Lee, N. Shah, K. Sundlass, M. Musen, Comparison of ontology-based semantic-similarity measures, in: Proceedings of the American Medical Informatics Association Annual Symposium, 2008, pp. 84–90. [18] R. Rada, H. Mill, E. Bicknell, M. Blettner, Development and application of a metric on semantic nets, IEEE Trans. Syst. Man Cybern. 19 (1) (1989) 17–30. [19] A. Tagarelli, Exploring dictionary-based semantic relatedness in labeled tree data, Inf. Sci. 220 (2) (2013) 44–68. [20] H. AI-Mubaid, H.A. Nguyen, Measuring semantic similarity between biomedical concepts within multiple ontologies, IEEE Trans. Syst. Man Cybern. 39 (4) (2009) 89–98. [21] C. Saathoff, M. Grzegorzek, S. Staab, Labelling image regions using wavelet features and spatial prototypes, Semant. Multimed. 5392 (2008) 89–104. [22] Y. Liu, Y. Liu, K.C. Chan, Tensor-based locally maximum margin classifier for image and video classification, Comput. Vis. Image Underst. 115 (3) (2011) 300–309. [23] O. Tuzel, F. Porikli, P. Meer, Region covariance: a fast descriptor for detection and classification, in: European Conference on Computer Vision, 2006, pp. 589–600. [24] R. Sivalingam, D. Boley, V. Morellas, N. Papanikolopoulos, Tensor sparse coding for positive definite matrices, IEEE Trans. Pattern Anal. Mach. Intell. 36 (3) (2014) 592–605. [25] J. Shotton, J. Winn, C. Rother, A. Criminisi, Textonboost for image understanding: multi-class object recognition and segmentation by jointly modeling texture, layout, and context, Int. J. Comput. Vis. 81 (1) (2009) 2–23. [26] N. Zhou, W.K. Cheung, G. Qiu, X. Xue, A hybrid probabilistic model for unified collaborative and content-based image tagging, IEEE Trans. Pattern Anal. Mach. Intell. 33 (7) (2011) 1281–1294.

[27] S.S. Keerthi, S.K. Shevade, C. Bhattacharyya, K.R.K. Murthy, Improvements to Platt's SMO algorithm for SVM classifier design, Neural Comput. 13 (3) (2001) 637–649. [28] G. Mori, Guiding model search using segmentation, in: International Conference on Computer Vision, 2005, pp. 1417–1423. [29] Y. Ke, R. Sukthankar, PCA-SIFT: a more distinctive representation for local image descriptors, Comput. Vis. Pattern Recognit., 2004, pp. 511–517. [30] J. Lafferty, A. McCallum, F. Pereira, Conditional random fields: probabilistic models for segmenting and labeling sequence data, in: International Conference on Machine Learning, 2001, pp. 282–289. [31] F. Sha, F. Pereira, Shallow parsing with conditional random fields, in: Proceedings of HLT-NAACL, 2003, pp. 134–141. [32] C. Sutton, K. Rohanimanesh, A. McCallum, Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data, in: International Conference on Machine Learning, 2004, pp. 282–289. [33] H.J. Escalante, C.A. Hernandez, J.A. Gonzalez, A. Loez-Loez, The segmented and annotated IAPR TC-12 benchmark, Comput. Vis. Image Underst. 114 (2010) 419–428. [34] A.R.S.A. Alamdari, I. Guyon, Quick start guide for clop, 〈http://ymer.org/ research/files/clop〉. [35] H.J. Escalante, M.M. y Gomez, L.E. Sucar, An energy-based model for regionlabeling, Comput. Vis. Image Underst. 115 (6) (2011) 787–803.

Zhiming Qian received his M.S degree in information and communication engineering from the National University of Defense Technology, Changsha, China, in 2010. He is currently a doctoral student in National University of Defense Technology, Changsha, China. His research interests include image processing, computer vision, and pattern recognition.

Ping Zhong (M'09) received the M.S. degree in applied mathematics and the Ph.D. degree in information and communication engineering from the National University of Defense Technology, Changsha, China, in 2003 and 2008, respectively. He is currently an assistant professor in National University of Defense Technology. His research interests include image processing, computer vision, and pattern recognition.

Jia Chen is currently a graduate student for a Master's degree in National University of Defense Technology, Changsha, China. His research interests include image and video processing, computer vision, and pattern recognition.