J. Vis. Commun. Image R. 24 (2013) 895–910
Contents lists available at SciVerse ScienceDirect
J. Vis. Commun. Image R. journal homepage: www.elsevier.com/locate/jvci
Cross-modal social image clustering and tag cleansing Jinye Peng a, Yi Shen b, Jianping Fan a,b,⇑ a b
School of Information Science and Technology, Northwest University, Xi’an 710069, China Department of Computer Science, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
a r t i c l e
i n f o
Article history: Received 28 December 2011 Accepted 2 June 2013 Available online 17 June 2013 Keywords: Cross-modal image clustering Social tag cleansing Weakly-tagged social images K-way min–max cut Mixture-of-kernels Spectral clustering Spam tag detection Image similarity measurement
a b s t r a c t In this paper, a cross-modal approach is developed for social image clustering and tag cleansing. First, a semantic image clustering algorithm is developed for assigning large-scale weakly-tagged social images into a large number of image topics of interest. Spam tags are detected automatically via sentiment analysis and multiple synonymous tags are merged as one super-topic according to their inter-topic semantic similarity contexts. Second, multiple base kernels are seamlessly combined by maximizing the correlations between the visual similarity contexts and the semantic similarity context, which can achieve more precise characterization of cross-modal (semantic and visual) similarity contexts among weakly-tagged social images. Finally, a K-way min–max cut algorithm is developed for social image clustering by minimizing the cumulative inter-cluster cross-modal similarity contexts while maximizing the cumulative intra-cluster cross-modal similarity contexts. The optimal weights for base kernel combination are simultaneously determined by minimizing the cumulative within-cluster variances. The polysemous tags and their ambiguous images are further split into multiple sub-topics for reducing their within-topic visual diversity. Our experiments on large-scale weakly-tagged Flickr images have provided very positive results. Ó 2013 Elsevier Inc. All rights reserved.
1. Introduction Collaborative image tagging systems, such as Flickr.com, have now become very popular for tagging large-scale social images by relying on collaborative efforts of a large population of Internet users [1,2]. In a collaborative image tagging system, people may tag social images according to their social or cultural backgrounds, personal expertise and perception. We call such collaborativelytagged social images as weakly-tagged social images because their social tags may not have exact correspondences with the underlying image semantics. With the exponential growth of weaklytagged social images, it has become increasingly attractive to develop new algorithms for achieving more effective organization and summarization of large-scale social images. Image clustering, which can assign large amounts of images into different clusters with some common semantics or visual properties [2,33–35,41–45], is very attractive for achieving more effective organization, summarization and visualization of largescale image collections. Clustering the search results into multiple semantic groups is helpful for users to assess the relevance between large amounts of returned images (which are returned by the same query terms) and their query intentions [33–35,44]. Im-
⇑ Corresponding author at: Department of Computer Science, UNC-Charlotte, Charlotte, NC 28223, USA. Fax: +1 704 687 3516. E-mail address:
[email protected] (J. Fan). 1047-3203/$ - see front matter Ó 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.jvcir.2013.06.004
age clustering has also been used to achieve more precise organization and summarization of large-scale Flickr images according to their visual similarities [2]. Most existing algorithms for image clustering focused on only the low-level visual features [44], thus it is doubtful of their effectiveness because of the problem of semantic gap [55–58]. Some researchers have recently integrated visual and textual features for Web image clustering and higher accuracy rates are reported [33–35,41–43]. In a collaborative image tagging space, there are two inter-related information sources that can be used to support image clustering: (1) visual properties of weakly-tagged social images; and (2) their social tags. It is worth noting that the visual properties of weakly-tagged social images and their social tags can offer complementary strengths, thus it is very attractive to develop new frameworks that are able to integrate the visual features with the social tags for achieving more precise clustering of large-scale weakly-tagged social images. Unfortunately, it is not a trivial work because of the following issues: (a) Social Tag Ambiguity: Different people may use different social tags (i.e., text terms), which have the same or close meaning (synonyms), to tag their social images [5,26–31]. The weaklytagged social images, which belong to a set of synonymous tags, may share some common visual properties and semantics. The appearance of synonymous tags may prevent most existing clustering algorithms from deriving more representative image clusters. On the other hand, collaborative image tagging is an
896
J. Peng et al. / J. Vis. Commun. Image R. 24 (2013) 895–910
ambiguous process [22–25]. Different people may apply the same social tag in different ways (i.e., the polysemous tag may have different meanings under different contexts), which may result in large amounts of ambiguous images with diverse visual properties. Because the effectiveness of most existing clustering algorithms largely depends on the accuracy of the underlying functions for data similarity characterization, the appearance of polysemous tags and their ambiguous images may bring new challenges for social image clustering, e.g., it is hard to design suitable similarity functions for characterizing the diverse image similarity contexts accurately. In a collaborative image tagging space, dishonest users may use spam tags to tag their social images, so that they can derive traffic to their social images for fun or profit [37–39]. The appearance of spam tags may result in large amounts of junk images, which may mislead most existing clustering algorithms to derive less representative clusters from large-scale weaklytagged social images. (b) Visual Ambiguity and Semantic Gap: Multiple types of visual features are usually extracted to achieve more sufficient characterization of various visual properties of the images, thus the distributions of the images could be very sparse and the visual similarity contexts among the images could be very diverse in the highdimensional feature space (i.e., visual ambiguity). As a result, it is very hard to use one single type of base kernels such as RBF kernel [7–12] to achieve precise characterization of the diverse visual similarity contexts among the images. In addition, there may have large amounts of outliers in the high-dimensional feature space and most existing clustering algorithms may seriously suffer from the problem of skewed cuts. Another challenging problem for image clustering is the semantic gap [55–58] between the low-level visual features and the image semantics, e.g., it is very hard to achieve semantic clustering of large-scale weakly-tagged social images by using only the low-level visual features. It is worth noting that the visual properties of weakly-tagged social images and their social tags can offer complementary strengths, thus they can be integrated to achieve more precise clustering of large-scale weakly-tagged social images. Because the visual features and the social tags belong to different spaces, it is unsatisfied to combine them directly for social image clustering. As shown in Fig. 1, a cross-modal approach is developed in this paper to achieve social image clustering and tag cleansing: (a) a semantic image clustering algorithm is developed for extracting the image topics of interest from large amounts of social tags; (b) spam tags are ide.epsied automatically via sentiment analysis and multiple synonymous tags are merged as one super-topic according to their inter-topic semantic similarity contexts; (c) a mixture-of-kernels algorithm is developed to achieve more accurate characterization of the cross-modal similarity contexts among the weakly-tagged social images; (d) a K-way min–max cut algorithm is extended for supporting cross-modal social image clustering and tag cleansing, where the polysemous tags and their ambiguous images are split into multiple sub-topics for reducing their intra-topic visual diversity; (e) a topic network is constructed to achieve more effective organization and summarization of largescale weakly-tagged social images at the semantic level. The rest of this paper is organized as follows. In Section 2, a brief review of some relevant work is presented; In Section 3, a semantic image clustering algorithm is introduced to assign large-scale social images into a large number of image topics of interest; In Section 4, a mixture-of-kernels algorithm is developed for achieving more precise characterization of the diverse cross-modal image similarity contexts among the social images; In Section 5 K-way min–max cut algorithm is presented for achieving cross-modal social image clustering; In Section 6 topic network is constructed to enable semantic summarization and organization of large-scale weakly-tagged social images at the semantic level; Our experimen-
Fig. 1. The flowchart of our algorithm for social image clustering and tag cleansing.
tal results on algorithm evaluation are given in Section 7 and we conclude this paper in Section 8. 2. Related work Clustering, which is one of the fundamental problems in machine learning and data mining, has received a significant amount of attentions in the last three decades [19]. Spectral clustering has recently become very popular because it is more effective in finding representative clusters [13–18], and one popular objective function (which is used in most spectral clustering approaches) is to minimize the normalized cuts [13] (i.e., minimizing the interconnections among the subgraphs). Spectral clustering algorithm can relax the problem of minimizing the normalized cuts into a tractable eigenvalue problem, thus it is simple for implementation and can be solved efficiently by standard linear algebra software [13–18,40]. Based on these observations, it is very attractive to extend the spectral clustering algorithms for social image clustering, but it is not easy task: (a) the problems of visual ambiguity and semantic gap may bring new challenges on designing suitable similarity functions to achieve precise characterization of the diverse crossmodal similarity contexts among the social images; (b) there may have large amounts of outliers in high-dimensional feature space, which may result in large amounts of skewed cuts [15– 18]; (c) when the spectral clustering approaches are directly performed over large-scale weakly-tagged social images, they may seriously suffer from the problem of huge memory cost (i.e., the similarity matrix could be huge to be handled by single computer). To achieve more effective social image clustering, other alternative information sources should be exploited rather than using only the low-level visual features [32–36,41–44]. Based on this observation, some researchers from Microsoft Research Asia (MSRA) have integrated the low-level visual features of the images, text terms of associated text documents and linkage information of the relevant Web pages for image clustering [33–35]. Instead of treating the associated text terms as the single information source for Web image indexing and retrieval, they have incorporated cross-modal information sources to explore the mutual reinforcement between the Web images and their associated text terms. Each of these information sources gives a different aspect of Web images and there have some inherent correlations between them. A tri-parties graph is generated to model the inherent correlations among the low-level visual features, images and their associated text terms, thus automatic image clustering is achieved by a triparties graph partition process. When reliable text annotations of the images are available, Barnard et al. [41] have developed a generative hierarchical model for image clustering. Loeff et al. [42] have also integrated cross-modal features for image clustering and sense estimation. It is worth noting that our scenarios for social image clustering are significantly different: the appearances of spam tags, polysemous tags and synonymous tags may seriously
J. Peng et al. / J. Vis. Commun. Image R. 24 (2013) 895–910
influence the effectiveness of the clustering algorithms on finding representative clusters. Thus it is very important to develop new algorithms for tackling the issue of social tag ambiguity, so that we can integrate both the visual properties of weakly-tagged social images and their social tags for social image clustering. To tackle the issue of tag ambiguity, tag clustering has been used to exploit the inter-tag semantic similarity contexts to deal with the synonymous tags, where the WordNet-based tag correlations or the co-occurrence probabilities of the social tags are used to characterize their semantic similarities [2,26–31]. Word sense disambiguation is one potential solution for addressing the issue of polysemous tags [22], but the social tags are given individually in a collaborative image tagging space and it is not easy to use traditional ways to determine the significant contexts among the neighboring social tags. Thus performing word sense disambiguation is more difficult in a collaborative image tagging space [24]. Some researchers have recently integrated both the visual properties of the images and their associated text terms for word sense disambiguation [23–25]. To tackle the issue of junk images for Web image search, some pioneer work have been done recently [32–36]. All these existing researches have provided different aspects of tag uncertainty, but there is no well-accepted framework for integrating the visual features of weakly-tagged social images with their social tags for social image clustering and tag cleansing.
3. Semantic image clustering As shown in Fig. 2, each image in a collaborative tagging system is associated with the image holder’s tags of the image semantics and other users’ tags or comments. Because multiple social tags are given individually in a collaborative image tagging space, entity extraction can be done more effectively. In this paper, a semantic image clustering algorithm is developed for: (a) automatically extracting the social tags for image topic interpretation; and (b) assigning large-scale weakly-tagged social images into a large number of image topics of interest. In a collaborative image tagging space, there are clear and fixed patterns for social tags, thus a context-free syntax parser is designed to detect and mark the phrases of interest (entity extraction) in high accuracy. Because English has relatively strict grammar, an open source text analysis package LingPipe [50] is used to detect the phrases of interest (entities) automatically, where all the parameters are set to default value. The phrases of
897
interest are further partitioned into two categories: noun phrases versus verb phrases. In this paper, a lexical-lookup algorithm is performed to separate the noun phrases from the verb phrases [49], where we maintain hand-crafted lexicons which contain a list of popular noun phrases for interpreting the entities of interest (i.e., 8000 phrases for interpreting most popular real-world object classes, image concepts (scenes) and events). The popular noun phrases are then extracted by looking up the noun phrases in the lists of social tags through matching with the text terms specified in the hand-crafted lexicons. The noun phrases are further partitioned into two categories automatically: topic-relevant tags (i.e., social tags that are used to interpret the image topics of interest) and topic-irrelevant tags, where a similar lexical-lookup algorithm is performed to separate the topic-relevant tags from the topic-irrelevant tags. The handcrafted lexicons, which contain a list of noun phrases for interpreting the most popular real-world object classes and image concepts (scenes) in large-scale social images, are used for separating the topic-relevant tags from the topic-irrelevant tags. Following the same process, the verb phrases are further partitioned into two categories automatically: event-relevant tags (i.e., social tags that are used to interpret the image events of interest) and event-irrelevant tags. For a given topic-relevant tag or event-relevant tag, its occurrence frequency is counted automatically according to the number of the relevant weakly-tagged social images, which are tagged by the corresponding topic-relevant tag or event-relevant tag. Two social tags (topic-relevant tags or event-relevant tags), which are used for tagging the same social image, are considered to co-occur once without considering their orders. A co-occurrence matrix is obtained by counting the frequencies of such pairwise co-occurrences of social tags. The topic-relevant tags and the event-relevant tags are further partitioned into two categories according to their interestingness scores: interesting tags and uninteresting tags. In this work, multiple information sources have been exploited for determining the interesting tags more accurately. For a given social tag C, its interestingness score xðCÞ depends on two issues: (1) its occurrence frequency tðCÞ (e.g., higher occurrence frequency corresponds to higher interestingness score); (2) its co-occurrence frequency #ðCÞ with any other social tag in the vocabulary (e.g., higher co-occurrence frequency corresponds to higher interestingness score). The occurrence frequency tðCÞ for a given social tag C is equal to the number of social images that are tagged by the given social tag
Fig. 2. Multiple information sources for Flickr images: social tags, image and comments.
898
J. Peng et al. / J. Vis. Commun. Image R. 24 (2013) 895–910
C. The co-occurrence frequency #ðCÞ for the given social tag C is equal to the number of social images that are tagged jointly by the given social tag C and any other social tag in the vocabulary. The interestingness score xðCÞ for a given social tag C is defined as:
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi t 2 ðCÞ þ 1 þ f qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi log #ðCÞ þ #2 ðCÞ þ 1
and IðO; WÞ is the cumulative strength of the association between the given opinion text term O and all the ide.epsied negative words.
IðO; UÞ ¼
IðO; WÞ ¼
PðO \ W i Þ PðOÞPðW i Þ
ð3Þ
N X
log
PðO \ Ni Þ PðOÞPðNi Þ
ð4Þ
i¼1
ð1Þ
where n þ f ¼ 1, the first part is used to characterize the interestingness score of the given social tag C gained from its occurrence frequency tðCÞ in large-scale weakly-tagged social images, the second part is used to characterize the interestingness score gained from its co-occurrence frequency #ðCÞ with any other social tag in the vocabulary, n and f are the relative importance factors. The interesting social tags, which have larger values of xðÞ, are treated as image topics of interest. Even people may tag social images from their own perspectives, they usually tag the social images according to the underlying image semantics because their motivation to tag the images is for social incentives (i.e., making themselves known by contributing to the tagging task) [48]. Thus the images, which are tagged by the same social tag, may have strong correlations on their semantics and visual properties. By mapping the social images and their social tags onto a conceptual space (i.e., a large number of image topics of interest), our semantic image clustering algorithm can assign large-scale weakly-tagged social images into a large number of image topics of interest automatically. As a result, the size of the weakly-tagged social images for each image topic is small enough and they can be handled effectively by using single computer (i.e., the size of the visual similarity matrix is small enough to be handled by single computer), thus performing semantic image clustering (e.g., assigning large-scale weakly-tagged social images into a large number of image topics of interest) can significantly reduce the computational complexity for supporting cross-modal image clustering. 3.1. Spam tag detection Collaborative image tagging systems encourage people to tag the social images that have already been tagged by others and allow people to comment on others’ images and social tags as shown in Fig. 2. The comment documents may provide a good information source for detecting spam tags and filtering out junk images. For each weakly-tagged social image under a given image topic of interest, the opinion text terms are first extracted from its comment document (i.e., text terms which are used to interpret users’ opinions on the social image and its social tags) [46,47]. In order to analyze the sentiments of these opinion text terms, a public corpus is used to ide.epsy: (a) a set of positive words U (such as good, cool, superior, positive, etc.); (b) a set of negative words W (such as bad, boring, inferior, negative, etc.); and (c) co-occurrence probabilities between the opinion text terms (which are extracted from the given public corpus and they also appear in the comment documents for social images) and the positive and negative words in the given public corpus. For a given opinion text term O, its subjectivity and orientation is calculated from its cumulative pairwise mutual information with these ide.epsied positive words U and negative words W in the given public corpus [45–47]. Thus the cumulative pairwise mutual information is defined as:
IðOÞ ¼ IðO; UÞ IðO; WÞ
log
i¼1
xðCÞ ¼ n log tðCÞ þ
N X
ð2Þ
where IðO; UÞ is the cumulative strength of the association between the given opinion text term O and all the ide.epsied positive words
where PðOÞ is the occurrence probability for the given opinion text term O in the corpus, PðN i Þ is the occurrence probability for the ith negative word Ni ; PðW i Þ is the occurrence probability for the ith positive word W i ; PðO \ W i Þ and PðO \ Ni Þ are the co-occurrence probabilities for the given opinion text term O and the ith positive word W i and the ith negative word Ni . For each weakly-tagged social image under the given image topic, the cumulative pairwise mutual information IðÞ for all its opinion text terms are treated as the sentimental features to determine its reliability score. The weakly-tagged social images for the given image topic are further partitioned into two clusters according to their reliability scores: positive cluster versus negative cluster. The weakly-tagged social images, which belong to the negative cluster with low reliability scores, are treated as the junk images for the given image topic of interest and are filtered out automatically. By detecting the spam tags and filtering out the junk images automatically, our algorithm can effectively cleanse large-scale weaklytagged social images and their social tags. 3.2. Social tag cleansing Some social tags for image topic interpretation may be synonymous (i.e., multiple social tags share the same meaning). The appearances of synonymous tags may prevent most existing clustering algorithms from deriving comprehensive knowledge (i.e., finding representative image clusters) from large-scale weaklytagged social images. To address the issue of synonymous tags, WordNet [5] is used to ide.epsy the candidates of synonymous tags from a large number of image topics of interest in the vocabulary. In addition, the semantic similarity contexts between these candidates of synonymous tags are calculated by using two information sources: (a) WordNetbased semantic correlations [5]; and (b) their co-occurrence probabilities in large-scale weakly-tagged social images [2,26–31]. The candidates of synonymous tags, which have large values of the inter-topic semantic similarity contexts, are clustered into the same cluster. The synonymous tags in the same cluster are merged as one super-topic and their social images are assigned to the supertopic automatically, so that our image clustering algorithm can derive more comprehensive knowledge from large-scale weaklytagged social images. To evaluate the performance of our algorithms for spam tag detection and social tag cleansing, we have designed an interactive system for searching and exploring large-scale weakly-tagged social images [2]. Obviously, other image visualization techniques can also be used here for supporting interactive algorithm evaluation [51–54]. The benchmark metric for algorithm evaluation includes precision q and recall . for image retrieval. They are defined as:
q¼
# ; #þn
.¼
# #þm
ð5Þ
where # is the set of social images that are relevant to the given image topic of interest and are returned correctly, n is the set of social images that are irrelevant to the given image topic of interest and are returned incorrectly, and m is the set of social images that are relevant to the given image topic of interest but are not returned.
J. Peng et al. / J. Vis. Commun. Image R. 24 (2013) 895–910
899
Fig. 3. The comparison on the precision rates after and before performing spam tag detection.
The precision rate is used to characterize the accuracy of our system on finding particular social images of interest, thus it can be used to assess the effectiveness of our spam tag detection algorithm. As shown in Fig. 3, one can observe that our spam tag detection algorithm can filter out the junk images effectively, which can further result in higher precision rates for social image retrieval. On the other hand, the recall rate is used to characterize the efficiency of our system on finding particular social images of interest, thus it can be used to assess the effectiveness of our social tag cleansing algorithm. As shown in Fig. 4, one can observe that our social tag cleansing algorithm can effectively combine the synonymous tags and their similar social images, which can further result in higher recall rates for social image retrieval. 4. Cross-modal similarity characterization for social images As shown in Fig. 5, four grid resolutions are used for image partition and feature extraction [3]. As shown in Fig. 6, three types of visual features are extracted for characterizing various visual properties of weakly-tagged social images: (a) grid-based color histograms; (b) Gabor texture features; (c) SIFT features. For the color features, one color histogram is extracted for each P image grid, thus there are 3r¼0 2r 2r ¼ 85 grid-based color histograms. Each grid-based color histogram consists of 36 RGB bins to represent the color distributions in the corresponding image grid. To extract Gabor texture features, a Gabor filter bank, which contains twelve 21 21 Gabor filters in 3 scales and 4 orientations,
is used. The Gabor filter is generated by using a Gabor function class. To apply Gabor filters on an image, we need to calculate the convolutions of the filters and the image. We transform both the filters and the image into the frequency domain to get the products and then transform them back to the space domain. This process can calculate Gabor filtered image more efficiently. Finally, the mean values and standard deviations are calculated from 12 filtered images, making up to 24-dimensional Gabor texture features. SURF algorithm is used to reduce the cost for SIFT feature extraction [4]. For each social image, a number of interesting points are detected and their corresponding 64-dimensional features are extracted. By using high-dimensional multi-modal visual features (gridbased color histogram, Gabor wavelet textures, SIFT features) for image content representation, it is able for us to characterize the diverse visual properties of the weakly-tagged social images more sufficiently. Because different feature subsets are used to characterize different types of visual properties of the weakly-tagged social images, the statistical properties of the weakly-tagged social images could be heterogeneous in the high-dimensional multimodal feature space. Thus using only one single type of base kernel may not be able to achieve accurate characterization of the diverse visual similarity contexts among the weakly-tagged social images [6–12,20]. Because each type of visual features (each feature subset) is used to characterize one certain type of visual properties of the weakly-tagged social images, the visual similarity contexts among
Fig. 4. The comparison on the recall rates after and before merging the synonymous tags.
900
J. Peng et al. / J. Vis. Commun. Image R. 24 (2013) 895–910
Fig. 5. Image partition for feature extraction.
Fig. 6. Visual feature extraction: (a) original images; (b) RGB color histograms; (c) wavelet transformation; (d) interesting points and SIFT features.
the weakly-tagged social images are more homogeneous and can be approximated more precisely by using one particular type of base kernel. Thus one specific base kernel is constructed for each type of visual features (i.e., one certain feature subset). For two weakly-tagged social images u and v (which belong to the same image topic C), their color similarity relationship can be defined as [20]:
jc ðu; v Þ ¼
1 2
R
D0 ðu; v Þ þ
R1 X
1
r¼0
2Rrþ1
Dr ðu; v Þ
ð6Þ
where R ¼ 4 is the total number of grid resolutions for image partition, D0 ðu; v Þ is the color similarity relationship between two weakly-tagged social images u and v according to their full-resolution (image-based) color histograms, Dr ðu; v Þ is the color similarity relationship between two weakly-tagged social images u and v according to their grid-based color histograms at the rth resolution.
Dr ðu; v Þ ¼
36 X DðHri ðuÞ; Hri ðuÞÞ
ð7Þ
i¼1
where Hri ðuÞ and Hri ðv Þ are the ith component of the grid-based color histograms for the weakly-tagged social images u and v at the rth image partition resolution. For two weakly-tagged social images u and v (which belong to the same image topic C), their local similarity relationship can be defined as [5,20]:
js ðu; v Þ ¼ eds ðu;v Þ=rs ds ðu; v Þ ¼
PP i
j
xi ðuÞxj ðv ÞEDðsi ðuÞ; sj ðv ÞÞ PP i j xi ðuÞxj ðv Þ
ð8Þ ð9Þ
where rs is the mean value of ds ðu; v Þ in our test images, xi and xj are the Hessian values of the ith and jth interesting points for the weakly-tagged social images u and v (i.e., the importance of the ith and jth interesting points, EDðsi ðuÞ; sj ðv ÞÞ is the Euclidean distance between two SIFT descriptors. For two weakly-tagged social images u and v (which belong to the same image topic C), their textural similarity relationship can be defined as:
jt ðu; v Þ ¼ edt ðu;v Þ=rt ; dt ðu; v Þ ¼ EDðg i ðuÞ; g j ðv ÞÞ
ð10Þ
where rt is the mean value of dt ðu; v Þ in our test images, EDðg i ðuÞ; g j ðv ÞÞ is the Euclidean distance between two Gabor textural descriptors. For a given image topic C, all its weakly-tagged social images are simultaneously associated with many other social tags rather than only the social tag for interpreting the given image topic C. For two weakly-tagged social images u and v (which belong to the same image topic C), their semantic similarity relationship between their residue social tags (rather than the image topic C) can be defined as [6]:
jc ðu; v Þ ¼ edc ðu;v Þ=rc
ð11Þ
where rc is the mean value of dc ðu; v Þ in our test images, dc ðu; v Þ is the semantic correlation between two social tags which is determined by using their WordNet-based similarity and co-occurrence probability [2,5,26–31]. It is worth noting that the visual similarity contexts and the semantic similarity context belong to different spaces, they cannot be combined directly for social image clustering, thus the kernel canonical correlation analysis (KCCA) [21] algorithm is
901
J. Peng et al. / J. Vis. Commun. Image R. 24 (2013) 895–910
performed to determine the optimal projection directions by maximizing the correlations between the visual similarity contexts and the semantic similarity context. After the visual similarity contexts and the semantic similarity context are projected onto their most correlation space (best projection direction), for a given image topic C, the diverse cross-modal similarity context between two weakly-tagged social images u and v can be characterized more precisely by using a mixture of these four base image kernels (i.e., mixture-of-kernels) [7– 12] on their optimal projection direction.
jðu; v Þ ¼
4 X i ðu; v Þ; bi j
4 X bi ¼ 1
i¼1
i¼1
ð12Þ
sðGi ; Gi Þ ¼
XX
jðu; v Þ
ð15Þ
u2Gi v 2Gi
We further define X ¼ ½X 1 ; ; X l ; ; X k as the cluster indicators, and its component X l is a binary indicator for the appearance of the lth cluster Gl ,
X l ðuÞ ¼
1; u 2 Gl
ð16Þ
0; otherwise
W is defined as an n n symmetrical matrix (i.e., n is the total number of weakly-tagged social images for the given image topic C), and its component is defined as:
W u;v ¼ jðu; v Þ
where bi P 0 is the importance factor for the ith base kernel ji ðu; v Þ. Obviously, combining multiple base kernels can allow us to achieve more precise characterization of the diverse cross-modal similarity contexts among the weakly-tagged social images, which can further result in higher accuracy rates for social image clustering and tag cleansing. In supervised learning scenarios, a multi-kernel learning algorithm is developed to estimate the optimal weights for kernel combination [12]. Estimating the kernel weights in unsupervised learning (clustering) scenarios is a much hard problem, due to the absence of class labels that would guide the search for relevant information.
D is defined as an n n diagonal matrix, and its diagonal components are defined as:
Du;u ¼
n X W u;v
ð17Þ
v ¼1
For a given image topic C, an optimal partition of its weakly-tagged social images (i.e., image clustering) is achieved by:
( ^ ¼ min WðC; K; bÞ
K X X T ðD WÞX l
ð18Þ
X Tl WX l
l¼1 1 1 ! ! Let W ¼ D2 WD2 , and X l ¼
5. Cross-modal social image clustering and tag cleansing
)
l
1
D2 X l 1
, the objective function for our K-
kD2 X l k
way min–max cut algorithm can further be refined as: To achieve more effective social image clustering and automatic kernel weight determination, a K-way min–max cut algorithm is developed, where the cumulative inter-cluster cross-modal similarity contexts are minimized while the cumulative intra-cluster cross-modal similarity contexts (summation of the pairwise image similarity contexts among the social images within the same cluster) are maximized. Our K-way min–max cut algorithm takes the following steps iteratively for social image clustering and kernel weight determination: (a) For a given image topic C, a graph is first constructed to organize all its weakly-tagged social images according to their cross-modal similarity contexts, where each node on the graph is one certain weakly-tagged social image for the given image topic C and an edge between two nodes is used to characterize the cross-modal similarity contexts between two weakly-tagged social images, jð; Þ. (b) All these weakly-tagged social images for the given image topic C are partitioned into K clusters automatically by minimizing the following objective function:
) K X sðGi ; G=Gi Þ ^ min WðC; K; bÞ ¼ sðGi ; Gi Þ i¼1
( ^ ¼ min WðC; K; bÞ
ð13Þ
jðu; v Þ
ð14Þ
u2Gi v 2G=Gi
The cumulative intra-cluster cross-modal similarity context sðGi ; Gi Þ is defined as:
K ¼
K X l¼1
1 !T ! ! K Xl W Xl
)
subject to:
!T ! X l X l ¼ I;
!T ! ! X l W X l > 0;
l 2 ½1; . . . ; K
Thus the optimal solution for Eq. (13) is finally achieved by solving eigenvalue equations:
! ! ! W X l ¼ kl X l ;
l 2 ½1; . . . ; K
ð20Þ
(c) The objective function for kernel weight determination is to minimize the cumulative intra-cluster similarity variance. For one certain cluster Gl , we can refine its cumulative intra-cluster pairwise image similarity contexts sðGl ; Gl Þ as WðGl Þ:
WðGl Þ ¼
XX
jðu; v Þ ¼
u2Gl v 2Gl
where G ¼ fGi ji ¼ 1; . . . ; Kg is used to represent K image clusters for the given image topic C; G=Gi is used to represent other K 1 image clusters in the set G except Gi ; K is the total number of image clus^ is the set of the optimal kernel weights. The cumulative interters, b cluster cross-modal similarity context sðGi ; G=Gi Þ is defined as:
XX
l T l¼1 X l WX l
ð19Þ
(
sðGi ; G=Gi Þ ¼
K X X T DX l
4 X XX bi ji ðu; v Þ i¼1
ð21Þ
u2Gl v 2Gl
where ji ð; Þ is the ith base kernel for cross-modal image similarity characterization. We assume that the lth subgraph (i.e., the lth image cluster) has nl weakly-tagged social images, then the mean lðGl Þ of the weakly-tagged social images in the lth cluster Gl can be defined as:
lðGl Þ ¼
WðGl Þ n2l
ð22Þ
For the lth cluster Gl , its mean li ðGl Þ on the ith feature subset (i.e., ith base kernel) can be defined as:
li ðGl Þ ¼
1 XX ji ðu; v Þ n2l u2G v 2G l
l
ð23Þ
902
J. Peng et al. / J. Vis. Commun. Image R. 24 (2013) 895–910
Thus the covariance rðGl Þ of the weakly-tagged social images in the lth cluster Gl and its value ri ðGl Þ on the ith feature subset (i.e., ith base kernel) can be defined as:
5.1. Inter-cluster correlation determination
rðGl Þ ¼ kWðGl Þ lðGl Þk; ri ðGl Þ XX ¼ kji ðu; v Þ li ðGl Þk
ð24Þ
u2Gl v 2Gl
Obviously, there have the following relationships between lðGl Þ; li ðGl Þ; rðGl Þ and ri ðGl Þ:
lðGl Þ ¼
4 X bi li ðGl Þ;
rðGl Þ ¼
i¼1
4 X
ri ðGl Þ
min b
bT rðGl Þ ¼ ~
l¼1
! ) b XðGl ÞXðGl Þ ~
K X
T
ð26Þ
l¼1
subject to: 4 X
bi ¼ 1;
bP0
i¼1
where XðGl ÞXðGl ÞT is semi-positive defined and XðGl ÞT is defined as:
XðGl ÞT ¼ ½W 1 ðGl Þ; . . . ; W 4 ðGl Þ
ð27Þ
where W i ðGl Þ is defined as:
W i ðGl Þ ¼
XX
kji ðu; v Þ ri ðGl Þk
^1 ; ; b ^4 are determined b ¼ ½b Thus the optimal kernel weights ~ automatically by solving the following quadratic programming problem [40]:
b
( ! ) K 1 ~T X b b XðGl ÞXðGl ÞT ~ 2 l¼1
ð28Þ
subject to: 4 X bi ¼ 1;
(1) The cross-modal similarity contexts among the weaklytagged social images in the same cluster are cumulated. (2) The optimal projection directions of the weakly-tagged social images for two clusters are obtained automatically by solving a generalized eigenvalue problem, so that the cross-modal correlations among their weakly-tagged social images on the optimal projection directions q and . can mutually be maximized:
WðGi ÞWðGi Þq k2q WðGi ÞWðGi Þq ¼ 0
ð29Þ
WðGj ÞWðGj Þ. k2. WðGj ÞWðGj Þ. ¼ 0
ð30Þ
where the eigenvalues kq and k. follow the additional constraint kq ¼ k. . (3) For two image clusters Gi and Gj , their inter-cluster crossmodal similarity context uðGi ; Gj Þ is defined as [21]:
uðGi ; Gj Þ ¼
u2Gl v 2Gl
min
In order to derive more comprehensive knowledge from large amounts of weakly-tagged social images, it is also very attractive to determine the correlations among multiple clusters for the same image topic of interest. Thus the kernel canonical correlation analysis (KCCA) [21] algorithm is extended to determine the inter-cluster cross-modal correlations and it takes the following steps:
ð25Þ
i¼1
^1 ; ; b ^4 for kernel combination are b ¼ ½b The optimal weights ~ determined automatically by mining the cumulative intra-cluster similarity variance:
( K X
tagged social images and discover comprehensive knowledge (i.e., representative image clusters and their global distributions).
bP0
i¼1
In summary, our K-way min–max cut algorithm takes the following steps iteratively for social image clustering and kernel weight determination: (1) b is set equally for all these 4 feature subsets, e.g., b1 ¼ ¼ b4 ¼ 0:25. (2) Given the initial set of the kernel weights, our K-way min–max cut algorithm is performed to partition large amounts of weakly-tagged social images for the given image topic C into K clusters automatically. (3) Given an initial partition of large amounts of weakly-tagged social images, our kernel weight determination algorithm is performed to estimate more suitable kernel weights, so that more precise characterization of the cross-modal image similarity contexts can be achieved. (4) Go to step 2 and continue the loop iteratively until b is convergent. To evaluate the effectiveness of our cross-modal image clustering algorithm, our image visualization technique [2] is incorporated to layout the image clustering results. Because each image topic may consist of large amounts of weakly-tagged social images with diverse visual properties, it is impossible to layout all these weakly-tagged social images within one small screen. As shown in Fig. 7, one or multiple most representative images are selected for each representative cluster. From this visualization result, one can observe that our cross-modal social image clustering algorithm can provide a good summarization of large amounts of weakly-
max
q; .
qT W T ðGi ÞWðGj Þ. qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qT W T ðGi ÞWðGi Þq .T W T ðGj ÞWðGj Þ.
ð31Þ
where q and . are the parameters for determining the optimal projection directions to maximize the cross-modal correlations between two given image clusters Gi and Gj ; WðGi Þ and WðGj Þ are the cumulative cross-modal similarity contexts among the weakly-tagged social images in the given clusters Gi and Gj as defined in Eq. (21). As shown in Fig. 8, Fig. 9 and Fig. 10, determining and visualizing the inter-cluster cross-modal correlations can provide more concise representation of the derived knowledge (i.e., image clusters and their inter-cluster cross-modal correlations). Representing and visualizing the image clusters with their most representative images, their global distributions and their inter-cluster crossmodal correlations in a graph form can also enable more effective navigation and exploration of large amounts of weakly-tagged social images, which is very attractive for users to assess the image search results interactively [2,33–35,44]. 5.2. Polysemous tag cleansing Some image topics may be polysemous, which may result in large amounts of ambiguous images. Our K-way min–max cut algorithm has provided a reasonable way to deal with the issue of polysemous tags by partitioning large amounts of ambiguous images for the same polysemous tag (topic) into multiple clusters, where each image cluster may correspond to one particular sense of the given polysemous tag (i.e, one certain sub-topic). By splitting the polysemous topic into multiple sub-topics automatically, our K-way min–max cut algorithm can derive more precise knowledge from large-scale weakly-tagged social images. To evaluate the effectiveness of our cross-modal tag cleansing algorithm on dealing with the polysemous tags, we have compared the performance differences on the precision rates for 1000 query terms before and after separating the polysmous tags and their
J. Peng et al. / J. Vis. Commun. Image R. 24 (2013) 895–910
903
Fig. 7. Clustering result of the social images for the image topic ‘‘flower’’.
Fig. 8. Image clusters for image topic ‘‘beach’’ and their inter-cluster cross-modal correlations.
ambiguous tags. Some results are illustrated in Fig. 11, one can obtain that our cross-modal tag cleansing algorithm can tackle the issue of polysemous tags effectively. By separating the polysemous
tags and their ambiguous images into multiple sub-concepts, our system can achieve higher precision rates for social image retrieval.
904
J. Peng et al. / J. Vis. Commun. Image R. 24 (2013) 895–910
Fig. 9. Image clusters for ‘‘flower’’ and their inter-cluster cross-modal correlations.
Fig. 10. Image clusters for ‘‘wave’’ and their inter-cluster cross-modal correlations.
J. Peng et al. / J. Vis. Commun. Image R. 24 (2013) 895–910
905
Fig. 11. The precision rates for some query terms before and after separating the polysemous tags and their ambiguous images.
Fig. 12. Inter-cluster correlation analysis for junk image filtering: (a) inter-cluster correlations for ‘‘red flower’’; (b) the filtered junk images for ‘‘red flower’’. (For interpretation of the references to colour in this figure caption, the reader is referred to the web version of this article.)
Fig. 13. Junk image filtering for Flickr images: (a) inter-cluster correlations for ‘‘rock’’; (b) filtered out junk images for ‘‘rock’’.
5.3. Junk image filtering In Section 3.1, we have incorporated users’ comments for spam tag detection and junk image filtering. It is worth noting that our algorithms for social image clustering and inter-cluster correlation determination can also provide an alternative solution for junk image filtering.
For a given image topic C, its K image clusters can further be partitioned into two groups according to their inter-cluster cross-modal similarity contexts uð; Þ: positive group versus negative group. The weakly-tagged social images in the positive group have strong correlations on their visual properties and semantics (i.e., with larger values of the inter-cluster cross-modal similarity contexts), thus they are correlated strongly and can
906
J. Peng et al. / J. Vis. Commun. Image R. 24 (2013) 895–910
Fig. 14. Major components for inter-concept visual similarity determination.
Fig. 15. Different views of our topic network for knowledge representation and visualization.
be treated as the relevant social images for the given image topic C. On the other hand, the weakly-tagged social images in the negative group, which differ significantly from the weaklytagged social images in the positive group on their visual properties and semantics (i.e., with small values of the inter-cluster cross-modal similarity contexts), can be treated as the junk images for the given image topic C and be filtered out automatically. The negative clusters for the junk images may have small sizes because the junk images may not have strong correlations on their visual properties and semantics. As shown in Fig. 12 and Fig. 13, one can observe that our inter-cluster correlation analysis algorithm can filter out the junk images effectively.
6. Topic Network Generation for Large-Scale Image Summarization and Navigation To support interactive visualization and exploration of largescale weakly-tagged social images, it is very attractive to enable graph-based representation of a large number of image topics of interest and their inter-topic similarity contexts. As illustrated in Fig. 14, a new algorithm is developed for determining the inter-topic similarity contexts. The inter-topic similarity context cðC i ; C j Þ between two image topics C i and C j can be determined by:
cðC i ; C j Þ ¼
max hT jðSi ÞjðSj Þ# qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi h; # hT j2 ðSi Þh #T j2 ðSj Þ#
ð32Þ
J. Peng et al. / J. Vis. Commun. Image R. 24 (2013) 895–910
907
Fig. 16. The comparison results on the accuracy g of our image clustering algorithm by using mixture-of-kernels and single base kernel for image similarity characterization.
Fig. 17. The comparison results on the accuracy g between our K-way min–max algorithm, the normalized cuts approach and the kernel K-means method.
Fig. 18. The comparison results on the accuracy g of our K-way min–max algorithm for social image clustering by combining different feature subsets, where Full means all four feature subsets are used.
where h and # are the parameters for determining the optimal projection directions to maximize the correlations between two image sets Si and Sj for the image topics C i and C j ; jðSi Þ and jðSj Þ are the cumulative kernel functions for characterizing the cross-modal correlations between the social images in the image sets Si and Sj .
jðSi Þ ¼
X Gl 2C i
WðGl Þ; jðSj Þ ¼
X
network visualization and exploration, our topic network can further be used to assist users on query formulation [2], e.g., finding the most relevant image topics interactively as their query terms.
7. Algorithm Evaluation
WðGl Þ
ð33Þ
Gl 2C j
As shown in Fig. 15 topic network (i.e., image topics and their inter-topic similarity contexts) is constructed for summarizing and visualizing large-scale social images at the semantic level. The topic network can provide a good global overview (summarization) of large-scale social images. By supporting interactive topic
Our experiments on algorithm evaluation are performed on 5 million Flickr images. To assess the effectiveness of our proposed algorithms, our algorithm evaluation work focuses on: (1) comparing the performance differences of our social image clustering algorithm by using single base kernel or mixture-of-kernels for image similarity characterization; (2) comparing the performance differences between various approaches for social image clustering
908
J. Peng et al. / J. Vis. Commun. Image R. 24 (2013) 895–910
Fig. 19. The comparison results on the accuracy g of our K-way min–max algorithm and MSRA tri-parties graph partition technique for social image clustering.
(i.e., our K-way min–max cut algorithm, normalized cuts approach [13], kernel k-means method [16], and MSRA tri-parties graph partition technique [33–35]); (3) Comparing the performance of our social image clustering algorithm by combining different feature subsets for image similarity characterization. For algorithm evaluation, the accuracy rate g is used to measure the performance of various image clustering algorithms. The accuracy rate g is defined as: n X dðLi ; Ri Þ
g¼
i¼1
ð34Þ
n
where n is the total number of social images, Li is the cluster label for the ith social image which is obtained by various clustering algorithms, Ri is the label for the ith social image which is given by a benchmark image set. dðx; yÞ is a delta function,
dðx; yÞ ¼
1; x ¼ y; 0;
otherwise
ð35Þ
One serious problem for evaluating various image clustering algorithms is that the cluster labels for large-scale social images are not available, thus it could be very hard to obtain the accuracy rates g in real applications. To address this issue, our image visualization system [2] is performed to enable interactive exploration of largescale social images. For a given social image, users can interactively assess the matching between its cluster label (predicted by image clustering algorithm) and its real cluster label from human perception, so that we can obtain the accuracy rates g to evaluate the performance of various algorithms for social image clustering. By combining multiple base kernels for diverse image similarity characterization, our mixture-of-kernels algorithm can achieve more accurate approximation of the diverse similarity relationships among the social images, which can further result in higher accuracy rates for social image clustering. For the same task of social image clustering, we have compared the performance of our image clustering algorithm by using mixture-of-kernels and single-kernel for image similarity characterization. As shown in Fig. 16, one can observe that our image clustering algorithm with mixture-of-kernels can outperform the same clustering algorithm
when single base kernel is used. Such the improvement on the accuracy rates for social image clustering benefits from two components: (a) combining multiple base kernels is able to achieve more accurate characterization of the diverse similarity relationships among the social images; (b) each base kernel is used to characterize one certain type of the similarity relationships among the social images, thus our mixture-of-kernels algorithm is not sensitive to the distributions of the social images and weaker assumptions can be made for base kernel selection. By using the same set of multi-modal feature subsets for image content representation, we have compared the performance differences between three approaches for social image clustering: (a) our K-way min–max cut algorithm; (b) normalized cuts method [13]; and (c) kernel k-means technique [16]. As shown in Fig. 17, one can observe that our K-way min–max cut algorithm can achieve higher accuracy rates for social image clustering. Such the improvement on the clustering accuracy rate benefits from our better definitions of the inter-cluster similarity contexts and the intra-cluster similarity contexts. It is important to note that the objective function for the K-way normalized cut algorithm can be refined as:
NcutðC; KÞ ¼
K X sðGl ; G=Gl Þ l¼1
dl
¼
K X l¼1
sðGl ; G=Gl Þ sðGl Þ þ sðGl ; G=Gl Þ
ð36Þ
Because sðGl ; G=Gl Þ may dominate to produce a smaller value of NcutðC; KÞ in Eq. (36), the normalized cuts algorithm may always cut out a subgraph (cluster) with a very small weight, e.g., a skewed cut [18]. On the other hand, our K-way min–max cut algorithm can deal with the problem of skewed cuts effectively, thus it can achieve higher accuracy rates for social image clustering. Different feature subsets may play different roles on image similarity characterization and decision making for social image clustering. By combining different feature subsets for image similarity characterization, we have also compared the performance difference of our K-way min–max cut algorithm for social image clustering, which may provide good evidences to assess the effectiveness of various feature subsets. As shown in Fig. 18 and Fig. 19, one can observe that the social tags are valuable and crucial for achieving more precise image clustering. When both visual features and so-
Fig. 20. The comparison results on the accuracy g of our K-way min–max algorithm for image clustering: (a) integrating both the visual features and social tags; and (b) only the visual features (without social tags).
J. Peng et al. / J. Vis. Commun. Image R. 24 (2013) 895–910
909
Fig. 21. The precision rates for 5000 query terms: (a) our system; (b) Flickr search.
Fig. 22. The recall rates for 5000 query terms: (a) our system; (b) Flickr search.
cial tags are integrated for image similarity characterization, we have also compared the performance difference between our Kway min–max algorithm and MSRA tri-parties graph partition method [33–35] on cross-modal social image clustering. As shown in Fig. 20, one can observe that our K-way min–max algorithm has achieved very competitive results on social image clustering. We have also compared the precision and recall rates between our system (i.e., which have provided techniques to deal with the critical issues of spam tags, synonymous tags, and polysemous tags) and Flickr search system (which have not provided techniques to deal with these critical issues). As shown in Fig. 21 and Fig. 22, one can observe that our system can achieve higher precision and recall rates for all these 5000 query terms (i.e., 5000 social tags of interest are used as query terms in our experiments) by addressing the critical issues of spam tags, synonymous tags and polysemous tags effectively.
8. Conclusions In this paper, a new algorithm is developed for achieving crossmodal social image clustering and tag cleansing. A semantic image clustering algorithm is developed to assign large-scale weaklytagged social images into a large number of image topics of interest. A K-way min–max cut algorithm is developed for social image clustering by minimizing the cumulative inter-cluster cross-modal similarity contexts while maximizing the cumulative intra-cluster cross-modal similarity contexts. To tackle the issue of tag ambigu-
ity, multiple algorithms are developed for spam detection and cross-modal tag cleansing. Our experiments on large-scale weakly-tagged Flickr images have provided very positive results. Acknowledgment The authors would like to than the reviewers for their insightful comments and suggestions to make this paper more readable. This research is partly supported by National Science Foundation of China under Grants 61272285, 61103062 and 61075014, Doctoral Program of Higher Education of China (Grant No. 20126101110022, 20116102110027, 20116102120031) and Program for New Century Excellent Talents in University under NCET-10-0071. References [1] Flickr.
[2] J. Fan, D. Keim, Y. Gao, H. Luo, Z. Li, JustClick: personalized image recommendation via exploratory search from large-scale Flickr images, IEEE Trans. CSVT 19 (5) (2009). [3] Y.G. Jiang, C.W. Ngo, J. Yang, Towards optimal bag-of-features for object categorization and semantic video retrieval, in: ACM CIVR, 2007. [4] H. Bay, A. Ess, T. Tuytelaars, L. Van Gool, SURF: speeded up robust features, Comput. Vision Image Understand. (CVIU) 110 (3) (2008) 346–359. [5] C. Fellbaum, WordNet: An Electronic Lexical Database, MIT Press, Boston, MA, 1998. [6] N. Cristianini, J. Shawe-Taylor, H. Lodhi, Latent semantic kernels, J. Intell. Inf. Syst. 18 (2–3) (2002) 127–152. [7] S. Sonnenburg, G. Ratsch, C. Schafer, B. Scholkopf, Large scale multiple kernel learning, J. Mach. Learn. Res. 7 (2006) 1531–1565.
910
J. Peng et al. / J. Vis. Commun. Image R. 24 (2013) 895–910
[8] M. Varma, D. Ray, Learning the discriminative power-invariance trade-off, in: IEEE ICCV, 2007. [9] A. Frome, Y. Singer, F. Sha, J. Malik, Learning globally-consistent local distance functions for shape-based image retrieval and classification, in: IEEE ICCV, 2007. [10] A. Bosch, A. Zisserman, X. Munoz, Representing shape with a spatial pyramid kernel, in: ACM CIVR, 2007. [11] J. Zhang, M. Marszalek, S. Lazebnik, C. Schmid, Local features and kernels for classification of texture and object categories: a comprehensive study, Int. J. Comput. Vision 73 (2) (2007) 213–238. [12] J. Fan, Y. Gao, H. Luo, Integrating concept ontology and multi-task learning to achieve more effective classifier training for multi-level image annotation, IEEE Trans. Image Process. 17 (3) (2008) 407–426. [13] J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE Trans. PAMI (2000). [14] C. Ding, X. He, H. Zha, M. Gu, H. Simon, A min–max cut algorithm for graph partitioning and data clustering, in: ICDM, 2001. [15] S. Yu, J. Shi, Multiclass spectral clustering, in: ICCV, 2003. [16] I. Dhillon, Y. Guan, B. Kulis, Kernel k-mean, spectral clustering and normalized cut, in: KDD, 2004. [17] D. Yuan, L. Huang, M.J. Jordan, Fast approximate spectral clustering, in: KDD, 2009. [18] M. Gu, H. Zha, C. Ding, X. He, H. Simon, J. Xia, Spectral relaxation models and structure analysis for K-way graph clustering and bi-clustering, Technical Report, 2001. [19] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2006, ISBN 1-55860-901-6. [20] K. Grauman, T. Darrell, The pyramid match kernel: discriminative classification with sets of image features, in: ICCV, 2005. [21] D.R. Hardoon, S. Szedmak, J. Shawe-Taylor, Canonical correlation analysis: An overview with application to learning methods, Technical Report, CSD-TR-0302, University of London, 2003. [22] M. Sussna, Word sense disambiguation for free-text indexing using a massive semantic network, in: ACM CIKM, pp. 67–74, 1993. [23] K. Barnard, M. Johnson, Word sense disambiguation with pictures, Art. Intell. 167 (2005) 13–30. [24] J. Fan, H. Luo, Y. Shen, C. Yang, Integrating visual and semantic contexts for topic network generation and word sense disambiguation, in: ACM CIVR, 2009. [25] Y. Jing, S. Baluja, PageRank for product image search, in: ACM WWW, 2008, pp. 307–315. [26] C.H. Brooks, N. Montanez, Improved annotation of the blogosphere via autotagging and hierarchical clustering, in: ACM WWW, 2006. [27] S. Bao, X. Wu, B. Fei, G. Xue, Z. Su, Y. Yu, Optimizing web search using social annotations, in: WWW, 2007, pp.501–510. [28] G. Begelman, P. Keller, F. Smadja, Automated tag clustering: improving search and exploration in the tag space, in: ACM WWW, 2006. [29] J. Gemmell, A. Shepitsen, B. Mobasher, R. Burke, Personalized navigation in folksonomies using hierarchical tag clustering, in: AAAI Workshop, 2008. [30] E. Simpson, Clustering tags in enterprise and web folksonomies, HPL-2007190, 2007. [31] M. Grineva, M. Grinev, D. Turdakov, P. Velikhov, Harnessing Wikipedia for smart tags clustering, in: AAAI, 2008. [32] R. Fergus, L. Fei-Fei, P. Perona, A. Zisserman, Learning object categories from Google’s image search, in: Proc. IEEE CVPR, 2006. [33] D. Cai, X. He, Z. Li, W.-Y. Ma, J.-R. Wen, Hierarchical clustering of WWW image search results using visual, textual, and link information, in: ACM Multimedia, 2004.
[34] X.-J. Wang, W.-Y. Ma, G.-R. Xue, X. Li, Multi-modal similarity propagation and its application for web image retrieval, in: ACM Multimedia, 2004. [35] B. Gao, T.-Y. Liu, T. Qin, X. Zhang, Q.-S. Cheng, W.-Y. Ma, Web image clustering by consistent utilization of visual features and surrounding texts, in: ACM Multimedia, 2005. [36] N. Ben-Haim, B. Babenko, S. Belongie, Improving web-based image search via content based clustering, in: IEEE CVPR Workshop on SLAM, 2006. [37] G. Koutrika, F. Effendi, Z. Gyongyi, P. Heymann, H. Garcia-Molina, Combating spam in tagging systems, in: AIRWeb, 2007. [38] A. Ntoulas, M. Najork, M. Manasse, D. Fetterly, Detecting spam web pages through content analysis, in: ACM WWW, 2006, pp.83–92. [39] B. Wu, B. Davison, Detecting semantic cloaking on the web, in: ACM WWW, 2006. [40] Y. Dai, R. Fletcher, New algorithm for singly linearly constrained quadratic programs subject to lower and upper bounds, Math. Program.: Ser. A, B 106 (3) (2006) 403–421. [41] K. Barnard, P. Duygulu, D.A. Forsyth, Clustering art, in: IEEE CVPR, 2001, pp. 434–441. [42] N. Loeff, C.O. Alm, D.A. Forsyth, Discriminating image senses by clustering with multi-modal features, in: Proc. of COLING/ACL, 2006, pp. 547–554. [43] M. Rege, M. Dong, J. Hua, Graph theoretical framework for simultaneously integrating visual and textual features for efficient web image clustering, in: WWW, 2008. [44] Y. Chen, J.Z. Wang, R. Krovetz, Clue: cluster-based retrieval of images by unsupervised learning, IEEE Trans. Image Process 14 (8) (2005) 1187–1201. [45] N. Jindal, B. Liu, Review spam detection, in: ACM WWW, 2007. [46] B. Liu, M. Hu, J. Cheng, Opinion observer: analyzing and comparing opinions on the web, in: ACM WWW, 2005. [47] K. Dave, S. Lawrence, D.M. Pennock, Mining the peanut gallery: opinion extraction and semantic classification of product reviews, in: ACM WWW, 2003. [48] M. Ames, M. Naaman, Why we tag: motivations for annotation is mobile and online media, in: CHI, 2007. [49] A. Borthwick, J. Sterling, E. Agichtein, R. Grishman, NYU: description of the MENE named entity system as used in MUC-7, in: Proc. of the Seventh Message Understanding Conf. (MUC-7), 1998. [50] A.I. Inc., Lingpipe. . [51] Y. Rubner, C. Tomasi, L. Guibas, A metric for distributions with applications to image databases, in the proceeding of IEEE ICCV, 1998, pp. 59–66. [52] G.P. Nyuyen, M. Worring, Interactive access to large image visualizations using similarity-based visualization, J. Visual Lang. Comput., 2006. [53] B. Moghaddam, Q. Tian, N. Lesh, C. Shen, T.S. Huang, Visualization and usermodeling for browsing personal photo libraries, Int. J. Comput. Vision 56 (2004) 109–130. [54] S. Santini, A. Gupta, R. Jain, Emergent semantics through interaction in image databases, IEEE Trans. Knowledge Data Eng. 13 (3) (2001) 337–351. [55] Y. Lu, L. Zhang, J. Liu, Q. Tian, Constructing concept lexica with small semantic gaps, IEEE Trans. Multimedia 12 (4) (2010) 288–299. [56] H. Ma, J. Zhu, M.R. Lyu, I. King, Bridging the semantic gap between image contents and tags, IEEE Trans. Multimedia 12 (5) (2010) 462–473. [57] R. Zhao, W.I. Grosky, Narrowing the semantic gap - improved text-based web document retrieval using visual features, IEEE Trans. Multimedia 4 (2) (2002) 189–200. [58] J. Fan, X. He, N. Zhou, J. Peng, R. Jain, Quantitative characterization of semantic gaps for learning complexity estimation and inference model selection, IEEE Trans. Multimedia 14 (5) (2012).