Automatic image annotation by semi-supervised manifold kernel density estimation

Automatic image annotation by semi-supervised manifold kernel density estimation

INS 10297 No. of Pages 13, Model 3G 16 September 2013 Information Sciences xxx (2013) xxx–xxx 1 Contents lists available at ScienceDirect Informat...

2MB Sizes 2 Downloads 142 Views

INS 10297

No. of Pages 13, Model 3G

16 September 2013 Information Sciences xxx (2013) xxx–xxx 1

Contents lists available at ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins 5 6

Automatic image annotation by semi-supervised manifold kernel density estimation

3 4 7

Q1

Ping Ji a,⇑, Na Zhao b, Shijie Hao b, Jianguo Jiang b a

8 9

b

10

Department of Electronic Information and Electrical Engineering, Hefei University, Hefei 230601, PR China School of Computer and Information, Hefei University of Technology, Hefei 230009, PR China

a r t i c l e

1 2 2 1 13 14

i n f o

Article history: Available online xxxx

15 16 17 18 19 20

Keywords: Image annotation Semi-supervised learning Kernel density estimation Manifold

a b s t r a c t The insufficiency of labeled training data is a major obstacle in automatic image annotation. To tackle this problem, we propose a semi-supervised manifold kernel density estimation (SSMKDE) approach based on a recently proposed manifold KDE method. Our contributions are twofold. First, SSMKDE leverages both labeled and unlabeled samples and formulates all data in a manifold structure, which enables a more accurate label prediction. Second, the relationship between KDE-based methods and graph-based semi-supervised learning (SSL) methods is analyzed, which helps to better understand graph-based SSL methods. Extensive experiments demonstrate the superiority of SSMKDE over existing KDE-based and graph-based SSL methods. Ó 2013 Published by Elsevier Inc.

22 23 24 25 26 27 28 29 30 31 32

33 34

1. Introduction

35

The rapidly increasing large-scale image data makes their effective management [19,13,48] and accessing [27] highly desired. Metadata have shown their superiority in image representation at syntactic and semantic levels. These metadata can be used for image retrieval, summarization and indexing. To generate the metadata, automatic annotation is an elementary step, which can be formulated as a classification task and accomplished by a learning-based method. More specifically, statistical models are usually built based on pre-labeled data to accomplish the task. However, manually labeling images is usually a time-consuming and labor-intensive process. This brings about the problem of training data insufficiency in practice, which thus leads to inaccurate annotation results. Extensive research efforts have been dedicated to automatic image annotation, and semi-supervised learning (SSL) methods have recently shown great potential in solving the problem of training data insufficiency. Essentially, automatic image annotation can be formulated as a SSL [11,37,57] task, in which only a small proportion of images are labeled while the rest are left for label prediction. By leveraging a large amount of unlabeled data based on certain assumptions, SSL methods are expected to build more accurate models than those built based on purely supervised methods. Recently, graph-based SSL methods that benefit from label smoothness assumption have been introduced [6,56,59]. These methods define a graph where the vertices are labeled/unlabeled samples and the edges reflect the similarities between vertex pairs. A labeling function is then estimated on the graph. The label smoothness over the graph is characterized in a regularization framework, which is composed of a regularization term and a loss function term. Graph-based SSL algorithms have shown encouraging performance in many machine learning and multimedia applications [60,61,64,63], in particular when labeled data are extremely limited. However, we have to notice several issues that are still not clear enough for graph-based SSL methods. First, a supervised version of graph-based SSL sometimes outperforms its

36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53

Q2 Q1

⇑ Corresponding author. Tel.: +86 13855147539. E-mail addresses: [email protected] (P. Ji), [email protected] (S. Hao), [email protected] (J. Jiang). 0020-0255/$ - see front matter Ó 2013 Published by Elsevier Inc. http://dx.doi.org/10.1016/j.ins.2013.09.016

Q1 Please cite this article in press as: P. Ji et al., Automatic image annotation by semi-supervised manifold kernel density estimation, Inform. Sci. (2013), http://dx.doi.org/10.1016/j.ins.2013.09.016

INS 10297

No. of Pages 13, Model 3G

16 September 2013 Q1

2

P. Ji et al. / Information Sciences xxx (2013) xxx–xxx

78

semi-supervised version, i.e., unlabeled samples may degrade the performance of generative SSL methods [14]. Second, graph-based SSL models are generally considered to be transductive and they cannot be easily extended to out-of-sample data. Several different approaches have been proposed to tackle this problem [15,51]. But it is not yet clear that which way is optimal. Third, as a graph edge measures the similarity between two vertices, several novel graph construction strategies, in addition to using Euclidean distance, can be used to improve classification performance [39,45]. From these above issues, we can observe that further improvements are still needed for the existing methods. Kernel Density Estimation (KDE) is a non-parametric density estimation approach, which avoids the model assumption problem [62]. Recently, semi-supervised KDE (SSKDE) has been proposed to investigate both labeled and unlabeled multimedia data [46]. Furthermore, an improved SSKDE, which adaptively estimates kernel density, is introduced in [47]. These methods are limited by the isotropic nature of the chosen Gaussian window. Despite the globally non-linear structure of feature space in the multimedia scenario, it is plausible to assume that feature vectors can be embedded in a locally linear subspace. By considering this structure, a Manifold KDE is proposed in [41] to generate much smoother classification boundaries, whereas the previous KDE methods usually introduce boundaries with ‘‘hole’’ or ‘‘zig-zag’’ artifacts. In this paper, we extend the Manifold KDE to a novel method called Semi-Supervised Manifold KDE (SSMKDE). The method combines the strengths inherited from SSL and KDE, and thus addresses both problems of training data insufficiency and model assumption mentioned above. We also show that several different graph-based SSL algorithms can be derived from SSKDE. Therefore, KDE can be viewed as the supervised version of graph-based SSL methods. We employ the proposed SSMKDE for image annotation and experiments demonstrate the effectiveness of our algorithm. The contributions of our work are as follows. First, SSMKDE leverages both labeled and unlabeled samples and formulates all data in a manifold structure, which improves annotation accuracy. Second, we analyze the relationships between KDE-based methods and graph-based SSL methods, which are helpful in better understanding graph-based SSL methods. For clarity, we organize the rest of this paper as follows. Section 2 briefly introduces related work. In Section 3, we introduce the proposed semi-supervised manifold KDE for automatic image annotation. In Section 4, we provide a discussion on the relationships between KDE-based methods and graph-based SSL methods. Experiments are provided in Section 5, followed by concluding remarks in Section 6.

79

2. Related work

80

2.1. Image annotation

81

91

Generally, image annotation approaches can be divided into three paradigms, i.e. generative models, discriminative models and nearest neighbor based methods [22]. Generative models can be further categorized into topic models [4,31,49] and mixture models [8]. Discriminative models learn a separate classifier for each class [18]. As for nearest neighbor based methods, various kinds of features are combined (called Joint Equal Contribution, JEC) to describe the similarities between images, and a simple kNN-based keyword transferring method is used [30]. Instead of the JEC strategy, TagProp [22] learns the combination weights of the feature groups and a word-specific model for each keyword is learned. Recently, the family of sparsity methods is also widely employed in image annotation. An input and output structural grouping sparsity is introduced into a regularized regression model for image annotation in [23]. Zhang et al. [52] introduce a regularization-based feature selection algorithm to leverage both the sparsity and clustering properties of features. All the above-mentioned research is assumed to have sufficient labeled training data. However, when training data are insufficient, the performance of these methods may severely degrade.

92

2.2. Semi-supervised learning

93

As we stated above, large collection of unlabeled images and the high cost of manual labeling trigger the research on semi-supervised learning methods [42–44]. Although there are large bodies of SSL research such as self-training [35], co-training [7], transductive SVM [53], and graph-based methods [59] (for in-depth reading, literatures such as [11,37,57] are recommended), many of them are computationally expensive [53] or ineffective when the assumed models are inaccurate [14]. So we argue that new models that reveal the implicit structures of multimedia data should be developed. Recently, Wang et al. propose semi-supervised kernel density estimation (SSKDE), in which both labeled and unlabeled data are leveraged to estimate class conditional probability densities. Shao et al. [38] propose a semi-supervised topic model for image annotation, in which a harmonic regularization based on the graph Laplacian is introduced into the probabilistic semantic model. Zhao et al. [55] build a cooperative kernel sparse representation (SR) method for image annotation with co-training two SRs in the kernel space. In [50], the authors propose a semi-supervised long-term Relevance Feedback (RF) algorithm to refine the multimedia data representation. The proposed long-term RF algorithm utilizes both the multimedia data distribution in multimedia feature space and the history RF information provided by users. Zhang et al. [54] propose a generic framework for video annotation via semi-supervised learning. A Fast Graph-based Semi-Supervised Multiple Instance Learning (FGSSMIL) algorithm, which aims to build a generic framework for various video domains, is proposed. These works inspire us that SSL methods can be elegantly incorporated into a multimedia annotation framework at various aspects.

54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77

82 83 84 85 86 87 88 89 90

94 95 96 97 98 99 100 101 102 103 104 105 106 107

Q1 Please cite this article in press as: P. Ji et al., Automatic image annotation by semi-supervised manifold kernel density estimation, Inform. Sci. (2013), http://dx.doi.org/10.1016/j.ins.2013.09.016

INS 10297

No. of Pages 13, Model 3G

16 September 2013 Q1

P. Ji et al. / Information Sciences xxx (2013) xxx–xxx

3

108

3. Semi-supervised manifold kernel density estimation

109 111

In this section, we present the semi-supervised manifold kernel density estimation method for image annotation. We first briefly introduce KDE and manifold KDE. Then, we present the SSMKDE algorithm where unlabeled data are incorporated. For clarity, we list all the notations and definitions throughout this paper in Table 1.

112

3.1. KDE and manifold KDE

113

Density estimation methods can be categorized into parametric and non-parametric ones. Among the non-parametric methods, the most popular one is the kernel density estimation [34]:

110

114

115

^ðxjF k Þ ¼ p 117 118 119 120 121 122 123 124

125

1X kðx  xj Þ lk j2L

ð1Þ

k

^ðxjF k Þ is the estimated class conditional probability density of class k. For concision, we abbreviate ‘‘class conditional Here p probability density function’’ to ‘‘density’’ below. Here k(x) is a bounded Riemann integrable kernel function with compact R support and satisfies kðxÞdx ¼ 1 with k(x) > 0. There are many methods that try to improve the traditional KDE. A large family among them is the adaptive kernel density estimation methods, which employ adaptive kernels rather than fixed ones by exploring the local structures of sample distribution [29,36]. Here, we provide such an example based on an algorithm named manifold KDE [41]. Manifold KDE defines a multivariate Gaussian kernel over sample xi according to its local covariance as follows:

127

  1 1  exp  xT C 1 kðxi ; xÞ ¼  x i 2 2pjC i jd=2

128

where the covariance matrix Ci is defined as:

129

Ci ¼ 131

1 X ðxj  xi ÞT ðxj  xi Þ N j2Nðx Þ

ð2Þ

ð3Þ

i

132

and N(i) is defined as the neighborhood of sample xi, i.e., the set of N nearest neighbors of xi (not including xi itself).

133

3.2. Semi-supervised manifold kernel density estimation

134

The traditional KDE and its extensions are designed to work on labeled data. In practice, nevertheless, labeled data are sometimes limited while unlabeled ones are abundant. Therefore, it is valuable to incorporate unlabeled data into the KDE-based framework. Therefore, we extend the manifold KDE to a SSL-based version, i.e., SSMKDE.

135 136

Table 1 Symbols and their descriptions. Symbol

Description

d l u n K L U Lk lk k(x) p(x) p(xjCk) P(Ckjx) P(Ck) d W D

Dimension of feature space Number of labeled samples Number of unlabeled samples n = l + u, Number of all samples Number of classes L = {1, 2, . . . , l}, Index set of labeled samples U = {l + 1,l + 2, . . . ,n}, Index set of unlabeled samples Index set of samples labeled class k Size of Lk Kernel function Global probability density function Conditional probability density function of class k Posterior probability of class k giving x Prior probability of class k Indicator function, d[[true] = 1, d[false] = 0 an n  n Matrix, where Wij indicates the similarity between xi and xj P An n  n diagonal matrix, where Dii ¼ j W ij

F P Fi

An n  K matrix, where Fik is the estimated value of P(Ckjxi) An n  n matrix (see Eq. (17)) Fi = [Fi1, Fi2, . . . , FiK] Bandwidth of Gaussian kernel see Eq. (11) see Eq. (8) see Eq. (17) see Eq. (18)

r l ti T0 T00

Q1 Please cite this article in press as: P. Ji et al., Automatic image annotation by semi-supervised manifold kernel density estimation, Inform. Sci. (2013), http://dx.doi.org/10.1016/j.ins.2013.09.016

INS 10297

No. of Pages 13, Model 3G

16 September 2013 Q1 137 138 139

140

142 143

144 146

147

148

4

P. Ji et al. / Information Sciences xxx (2013) xxx–xxx

Following the introduction in [47], we can assume that the class posterior probabilities of both labeled and unlabeled samples are given. Here, we denote the posterior probability of class k giving xi as P(Ckjxi). In this way, the kernels can be weighted by the corresponding posterior probabilities in KDE by:

P

j2L[U PðC k jxj Þkðx

P

^ðxjC k Þ ¼ p

^ðC k jxj Þ ¼ P p

^ðxj jC k Þ PðC k Þp ^

ð5Þ

k¼1:K PðC k Þpðxj jC k Þ

We can also estimate the prior probability based on posterior probabilities by using the strong law of large numbers:

Z

PðC k jxÞPðxÞdx 

150 151

ð4Þ

Let P(Ck) denote the prior probability of class k. Posterior probabilities can be computed by the Bayes rule as follows

PðC k Þ ¼

152

 xj Þ

j2L[U PðC k jxj Þ

X 1 PðC k jxi Þ l þ u i¼1:ðlþuÞ

ð6Þ

Then, we can derive that:

P P Pi2L[U PðC k jxi Þkðxj xi Þ PðC jx Þ i2L[U PðC k jxi Þkðxj  xi Þ i2L[U P k i ^ðC k jxj Þ ¼ P ¼ p P PðC k jxi Þkðxj xi Þ i2L[U i2L[U kðxj  xi Þ P k¼1:K PðC k Þ PðC jx Þ PðC k Þ

154 155 156

157

159 160

161 163 164 165 166

167 169 170

171

172

8P F jk kðxj xi Þ > > < Pi2L[U kðxj xi Þ ¼ F jk ; where j 2 U i2L[U P F kðx x Þ > > : ð1  ti Þ Pi2L[U jk j i þ ti dðyi ¼ kÞ ¼ F jk ; kðx x Þ i2L[U

Pij ¼ P

kðxj  xi Þ  xi Þ

kðxi ; xÞ ¼

1 ð2pjC i jd=2 Þrd

( F

183

186 187

188 190

X

ð10Þ

X W ij kxi  xj k þ l kF i  Y i k2

)

2

i;j2L[U

ð11Þ

i2L

We further introduce SSL into the manifold KDE. The above Eq. (11) can be reformulated as:

( F ¼ arg min

184

  1 exp  2 xT C 1 x i 2r

Originally, a regularization framework of graph-based SSL methods can be formulated as follows (we would analyze it in detail in Section 4.1):

F

178

182

ð9Þ

Under the manifold KDE framework, k(xj  xi) can be defined by Eq. (2). Here a parameter r is further introduced into Eq. (2) as a global kernel bandwidth (tuning this parameter in experiments can help us to obtain better results). Thus, the equation turns to:



179

i

i2L[U kðxj

F ¼ arg min

180

j

ð8Þ where j 2 L

We denote Pij by:

174 175

i

^ðC k jxi Þ is close to p(Ckjxi). We denote the estimated posterior probabilities by Fjk and Here, we have made an assumption that p ^ðC k jxi Þ þ t i dðyi ¼ kÞ, we have: denote the truths by P(Ckjxj). As PðC k jxi Þ ¼ ð1  t i Þp



176

k

i2L[U

ð7Þ

X

2

X

kði; jÞkxi  xj k þ l

i;j2L[U

) 2

kF i  Y i k

ð12Þ

i2L

We divide P into 4 blocks as:





PLL

PLU

PUL

PUU

 ð13Þ

We further divide the matrix F into 2 blocks as:





FL FU

 ð14Þ

Therefore, Eq. (8) can be rewritten as:



ðPUU  IÞFU þ PUL FL ¼ 0 ðI  TÞðPLL FL þ PLU FU Þ þ TY  FL ¼ 0

ð15Þ

Q1 Please cite this article in press as: P. Ji et al., Automatic image annotation by semi-supervised manifold kernel density estimation, Inform. Sci. (2013), http://dx.doi.org/10.1016/j.ins.2013.09.016

INS 10297

No. of Pages 13, Model 3G

16 September 2013 Q1

P. Ji et al. / Information Sciences xxx (2013) xxx–xxx

191

Finally, we can solve it by:

194

F ¼ ðT0 þ I  PÞ T00 Y

192

195

196

198 199

200

202

1

0

ð16Þ nn

In the above equation, T 2 R

( T0ij ¼

5

is defined as:

ti 1t i

if i ¼ j and j 2 L

0

otherwise

ð17Þ

where (1 6 i, j 6 n), and T00 2 Rnn is defined as:

( T00ij

¼

ti 1t i

if i ¼ j and j 2 L

0

otherwise

ð18Þ

203

where (1 6 i 6 n) and (1 6 j 6 l).

204

4. The relationship between KDE-based methods and graph-based SSL methods

205

211

In this section, we mainly discuss the relationships between KDE-based methods and existing graph-based SSL methods, which are quite popular in the fields of machine learning, multimedia content analysis, etc. First, we formalize the graphbased SSL framework, such as notations, the problem statement, and the regularization framework. Second, the relationships between KDE-based methods and graph-based SSL methods are discovered. Then we point out that the semi-supervised adaptive KDE method can be viewed as a special case of our proposed SSMKDE. To some extent, this discussion, which is another contribution of this paper, is helpful in understanding and uncovering some ambiguous issues in the graph-based SSL methods mentioned in the first section.

212

4.1. Graph-based semi-supervised learning

213

Graph-based SSL methods, which are conducted on a graph, are a large family among existing SSL methods [12,33]. An assumption of these methods is the label smoothness which requires the labeling function to simultaneously satisfy the following two conditions: (1) it should be close to the given truths on the labeled vertices, and (2) it should be smooth on the whole graph. These two conditions are often characterized in regularization frameworks. For the integrity, we formalize the graph-based semi-supervised learning framework in this subsection, which is helpful in understanding the proposed SSMKDE as well as the relationship between the KDE-based and the graph-based learning problem discussed below. We consider a K-class classification problem. There are l labeled samples {(x1, y1), . . . , (xl, yl)} (y 2 {1, 2, 3, . . . , K}), and u unlabeled samples {xl+1, . . . , xl+u}. Let n = l + u and typically we have l  u. Let L = {1, 2, . . . , l} be the index set of the labeled samples and U = {l + 1, . . . , n} be the index set of the unlabeled samples. Let Li denote the index set of the samples with label P i and li denote the size of Li. So, we have L =z [ Li and l ¼ li . On the assumption of i.i.d., samples xi(i 2 L [ U) are extracted from an unknown (global) probability density function p(x), and p(xjCk) denotes the class conditional probability density function of class k. Then, the task is to assign labels to xi, where i 2 U. Define a positive similarity function s(xi, xj) = exp (kxi  xjk2/2r2). Based on the similarity function, an affinity matrix W T is defined as Wi j = s(xi, xj). Then, we define an n  K matrix F ¼ ½F T1 ; F T2 ; . . . ; F Tn  , where Fij is the confidence of xi with label yj. T So, the classification rule is to assign each sample xi a label yi = arg maxj6KFij. Define an l  K matrix Y ¼ ½Y T1 ; Y T2 ; . . . ; Y Tl  with Yij = d(yi = j), where d is the indicator function (i.e., d[ture] = 1, d[false] = 0). Then a regularization framework of graph-based SSL methods can be formulated as in Eq. (11). The minimization of the above criterion gives rise to the linear system as follow:

206 207 208 209 210

214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230

231 233 234

235 237

238

239 241

242

243 245

246 247

ðR0 þ D  WÞF ¼ R00 Y 0

where R 2 R

R0ij ¼

nn



ð19Þ

is defined as:

l if i ¼ j and j 2 L 0

otherwise

ð20Þ

R00 2 Rnn is defined as:

R00ij ¼



l if i ¼ j and j 2 L 0

otherwise

ð21Þ

and D 2 Rnn is defined as:



P 0

k2L[U W ik

if

i¼j

otherwise

ð22Þ

Two methods using the criteria of the above form have been proposed: [5] (where K = 2) and [59] (where K = 2 and l = 1). But there is also another important algorithm named Learning with Local and Global Consistency (LLGC) that does not fit this Q1 Please cite this article in press as: P. Ji et al., Automatic image annotation by semi-supervised manifold kernel density estimation, Inform. Sci. (2013), http://dx.doi.org/10.1016/j.ins.2013.09.016

INS 10297

No. of Pages 13, Model 3G

16 September 2013 Q1

6

P. Ji et al. / Information Sciences xxx (2013) xxx–xxx

249

P 2 2 framework [56]. It adds another regularization term is replaced by i2U kF i k , and the regularization item kFi  Fjk

2

F

F j

i

pffiffiffiffi . Here we regard it as a variant of the regularization framework in Eq. (11).  pffiffiffiffi

D

250

4.2. Relationships between KDE-based and Graph-based SSL methods

251

In the previous research on graph-based SSL [15,24] and KDE [10], their connections are not detailed. Here we provide a comprehensive discussion on the relationships between KDE-based methods and graph-based SSL. From the above sections, we can view SSKDE as a novel perspective of graph-based SSL methods. In graph-based SSL methods, the label smoothness assumption is characterized in regularization frameworks; in SSKDE, the regularization is implicitly represented by kernel function. Now we can provide the interpretations to the issues presented in Section 1. Obviously, KDE can be regarded as a supervised version of graph-based SSL (later we will compare the performance of graph-based SSL methods with KDE to empirically investigate the effect of unlabeled samples in these methods). Although graph-based SSL methods are usually considered to be discriminative and transductive, SSKDE can be viewed as a generative and inductive method (since densities have been estimated and out-of-sample data can be classified based on the densities by Bayes rule). This novel perspective can also help us extend graph-based SSL by developing semi-supervised variants of other KDE-based methods. Furthermore, it is noteworthy that, in [59], the proposed graph-based SSL is a binary classification approach. So it has to be applied in the one-against-all or one-against-one style in dealing with multi-way classification problems. In the novel KDE perspective, however, we can see that in fact the algorithm can naturally deal with multiple classes as well. How to induce graph-based SSL methods to out-of-sample data is once a concern since they are generally regarded as transductive. However, the induction is clear in the perspective of SSKDE. Since densities are estimated in Eq. (4), the posterior probabilities of newly given samples can be computed by Bayes rule. Actually we only have to replace xj with new sample x in Eq. (7), i.e.

248

252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267

268 270

Dii

jj

b k jxi Þ PðC

P

jðx  xj Þ jðx  xi Þ

i2L[U PðC k jxi Þ

P

i2L[U

ð23Þ

274

Eq. (23) is equivalent to the induction method proposed in [58], where the authors have mentioned that the equation ‘‘happens to have the form of Parzen window regressors’’. In this work, we have clarified the underlying reason. The well-known label propagation process [58], which can be viewed as an iterative solution method of Eq. set (17), now has a novel probabilistic explanation in SSKDE:

275

Algorithm 1. Iterative solution process of SSKDE

271 272 273

h iT Step 1. Initialization of the posterior probability matrix F ¼ FTL ; FTU . Step 2. Estimate densities according to Eq. (4), and then re-compute posterior probabilities according to Bayes rule as shown in Eq. (7). It can be directly expressed as F = WF. Step 3. Adjust the posterior probabilities of the labeled samples FL = (I  T)FL + TY. Step 4. Repeat from step 2 until F converges. Step 5. Label each unlabeled sample xi as yi = arg maxj6KFij. 284 285 286 287 288 289 290 291 292 293 294 295 296 297

The convergence of this algorithm can be easily proved. Step 2 corresponds to an EM style posterior probabilities refinement process. Besides closely connected to graph-based methods, SSKDE can also be viewed as an alternative to another family of SSL methods: Semi-Supervised Parametric Density Estimation (SSPDE),1 which has already been studied intensively [14,32]. Obviously SSKDE and SSPDE are developed on non-parametric and parametric models respectively. It has been proved that unlabeled samples are detrimental in SSPDE when prior model assumption is incorrect [14], but there is no such problem in SSKDE (kernel density estimates always converge to the truths when samples are sufficient). Thus, SSKDE can be applied in more applications whereas it is difficult for SSPED to assume an accurate prior model. 4.2.1. Relationship between SSAKDE and SSMKDE As we have shown the connection between the SSKDE and the graph-based SSL methods, in following we further discuss the relationship between the semi-supervised adaptive KDE method (SSAKDE, its improved version of SSKDE) proposed in [47] and our SSMKDE. In SSAKDE, the kernel function is selected as:

kðx; xj Þ ¼ 299

1 ð2pÞ

d=2

r2j





2 exp  xi  xj =r2j

ð24Þ

1 More generally, these methods are named semi-supervised learning on mixture model (as in [14]) or semi-supervised learning based on EM (as in [32]). Here we denote them by SSPDE as they are based on parametric models and this name is closer to SSKDE.

Q1 Please cite this article in press as: P. Ji et al., Automatic image annotation by semi-supervised manifold kernel density estimation, Inform. Sci. (2013), http://dx.doi.org/10.1016/j.ins.2013.09.016

INS 10297

No. of Pages 13, Model 3G

16 September 2013 Q1 300

301

7

P. Ji et al. / Information Sciences xxx (2013) xxx–xxx

and

P 2

ri ¼ nr P

303

xk 2Ni kxk

i¼1:n

P

 xi k2

xk 2N i kxk

! ð25Þ

 xi k 2

306

In comparison with our proposed SSMKDE, SSAKDE only adaptively selects the size of kernel. As shown in Eq. (2), the proposed SSMKDE adaptively selects not only the size but also the shape of the selected kernel. With this fact, SSAKDE can be considered as a special case of SSMKDE.

307

5. Experiment

308

To evaluate the performance of the proposed SSMKDE method for automatic image annotation, we conduct experiments on the following datasets:

304 305

309 310 311 312 313 314 315 316 317 318

 Corel dataset with 5000 images. The data set has been used in [25,40]. The images are grouped into 50 categories, each of which contains 100 images. We adopt 144D HSV Correlogram as the feature set for this 50-class classification task.  ‘‘Handwritten digit recognition’’ dataset from Cedar Buffalo [28]. The dataset has been used in [59] as well. From this dataset we generate three classification tasks, i.e., digits ‘0’, ‘1’, ‘2’, and ‘3’ classification, even and odd digits classification, and 10-way classification of all digits.  ‘‘Letter Image Recognition’’ dataset. From this dataset we generate three classification tasks as well, i.e., letters ‘I’ and ‘O’ and ‘J’ and ‘Q’ classification, letters ‘A’–‘M’ and ‘N’–‘Z’ classification, and 26-way classification of all letters.  ‘‘Ionosphere’’ dataset [2].  ‘‘Segmentation’’ dataset [3].

319 320 321 322 323 324

The last three datasets are all from UCI MLR [1]. Table 2 illustrates the information on these nine classification tasks. For each task, we set different labeled data size l, and perform 10 trials for each l (the labeled data sizes have also been illustrated in Table 2). In each trial we randomly select labeled samples and use the rest of the samples as testing data. For each trial, if any class does not contain labeled samples, we redo the sampling. We compare the averaged results over 10 trails of the following methods:

Table 2 Information of classification tasks. Task

Attribute

Size

Class

Training set size

Digit 0/1/2/3 Digit even/odd Digit 10-way Letter IO/JQ Letter A–M/N–Z Letter 26-way Ionosphere Corel Segmentation

256 256 256 16 16 16 34 144 19

4400 11,000 11,000 3038 20,000 20,000 351 5000 2310

4 2 10 2 2 26 2 50 7

20, 20, . . . , 200 20, 20, . . . , 200 20, 20, . . . , 200 20, 20, . . . , 200 20, 20, . . . , 200 50, 100, . . . , 500 5, 10, . . . , 50 50, 100, . . . , 500 10, 20, . . . , 100

Fig. 1. Performance comparison of SVM, KDE, SSKDE, LLGC, and SSMKDE on the Corel dataset.

Q1 Please cite this article in press as: P. Ji et al., Automatic image annotation by semi-supervised manifold kernel density estimation, Inform. Sci. (2013), http://dx.doi.org/10.1016/j.ins.2013.09.016

INS 10297

No. of Pages 13, Model 3G

16 September 2013 Q1 325 326 327 328 329

8

P. Ji et al. / Information Sciences xxx (2013) xxx–xxx

    

SVM with a RBF kernel 2. KDE. SSKDE. LLGC. SSMKDE.

330

Fig. 2. Performance comparison of SVM, KDE, SSKDE, LLGC, and SSMKDE on the Digit dataset.

Q1 Please cite this article in press as: P. Ji et al., Automatic image annotation by semi-supervised manifold kernel density estimation, Inform. Sci. (2013), http://dx.doi.org/10.1016/j.ins.2013.09.016

INS 10297

No. of Pages 13, Model 3G

16 September 2013 Q1 331 332 333 334

P. Ji et al. / Information Sciences xxx (2013) xxx–xxx

9

Since there is no reliable model selection approach when labeled samples are extremely limited, we empirically tune the following parameters to their optimal values: parameter r in the last three methods, the radius parameter c for RBF kernel and trade-off between training error and margin c in SVM model. Besides that, the parameter l is set to 50 in SSKDE and SSMKDE and 0.01 in LLGC. The size of neighborhood N in SSMKDE is set to 16. These settings are adopted to the works in

Fig. 3. Performance comparison of SVM, KDE, SSKDE, LLGC, and SSMKDE on the Letter dataset.

Q1 Please cite this article in press as: P. Ji et al., Automatic image annotation by semi-supervised manifold kernel density estimation, Inform. Sci. (2013), http://dx.doi.org/10.1016/j.ins.2013.09.016

INS 10297

No. of Pages 13, Model 3G

16 September 2013 Q1

10

P. Ji et al. / Information Sciences xxx (2013) xxx–xxx

336

[2,9,26,30] that have shown promising performance. From Figs. 1–5, they illustrate the performance comparison of the five methods on the five datasets, respectively.

337

5.1. The comparison of KDE, SSKDE, and SSMKDE

338

347

From the figures we can observe that, although SSKDE performs better than KDE in other six tasks, it is only comparable or even worse than KDE in the first three classification tasks. It indicates that unlabeled data do not necessarily improve performance. This can be attributed to the inaccuracy of the estimated posterior probabilities of unlabeled samples. Now we revisit the process in Fig. 3, and we observe that it is a propagation process. When the posterior probabilities of unlabeled samples have too large biases, they may propagate to each other, so that the performance degrades and then SSKDE may be P 1 even worse than the traditional KDE method. In Section 3.2, we assume that PðC k Þ  lþu i¼1:ðlþuÞ PðC k jxi Þ, but in this case the equation does not hold due to the biases in estimated posterior probabilities. This is the reason why unlabeled samples are detrimental. In [59], Zhu et al. have analyzed this phenomenon as well, and they propose a method named class mass normalization to adjust the posterior probabilities as a post-processing step. Now the problem is alleviated in SSMKDE, of which the superiority is obvious in all experiments, since more accurate posterior probabilities are obtained.

348

5.2. The comparison of SSKDE, LLGC, and SSMKDE

349

In this subsection, we compare SSMKDE with LLGC. The LLGC is usually believed to be superior to SSKDE since it is based on normalized graph Laplacian while SSKDE is based on graph Laplacian [26]. Therefore, we regard LLGC and SSMKDE both as variants of SSKDE method (the former is in the spectral graph theory viewpoint [20] while the latter is in the KDE perspec-

335

339 340 341 342 343 344 345 346

350 351

Fig. 4. Performance comparison of SVM, KDE, SSKDE, LLGC, and SSMKDE on the Ionosphere dataset.

Fig. 5. Performance comparison of SVM, KDE, SSKDE, LLGC, and SSMKDE on the Segmentation dataset.

Q1 Please cite this article in press as: P. Ji et al., Automatic image annotation by semi-supervised manifold kernel density estimation, Inform. Sci. (2013), http://dx.doi.org/10.1016/j.ins.2013.09.016

INS 10297

No. of Pages 13, Model 3G

16 September 2013 Q1 352 353 354 355 356

P. Ji et al. / Information Sciences xxx (2013) xxx–xxx

11

tive). So it is instructive to compare these three methods. The experimental results from the figures show that LLGC outperforms SSKDE in most tasks (excluding the three tasks for letter classification, in these three tasks LLGC are slightly worse than SSKDE). This generally reconfirms the superiority of LLGC to SSKDE [26]. But we can also see that the superiority of SSMKDE is obvious in comparison of LLGC. It outperforms LLGC in all tasks except the task Digit0/1/2/3. Even in this task, SSMKDE is only slightly worse.

Fig. 6. Induction performance of SSKDE and SSMKDE. Unlabeled training samples increase gradually. Error rates are estimated on all unlabeled samples (that is why the curves are denoted by ‘‘Transd & Ind’’).

Q1 Please cite this article in press as: P. Ji et al., Automatic image annotation by semi-supervised manifold kernel density estimation, Inform. Sci. (2013), http://dx.doi.org/10.1016/j.ins.2013.09.016

INS 10297

No. of Pages 13, Model 3G

16 September 2013 Q1

12

P. Ji et al. / Information Sciences xxx (2013) xxx–xxx

357

5.3. The induction of SSKDE and SMKDE

358

369

We choose three tasks, including Digit0/1/2/3, Letter IO/JQ, and Ionosphere, to test the induction performance of SSKDE and SSMKDE. We randomly split unlabeled samples into two sets, i.e., unlabeled training samples and unlabeled testing samples. We first implement SSKDE and SSMKDE on labeled samples and unlabeled training samples and then induce the labels for unlabeled testing samples. In the three tasks, l is set to 20, 20, and 5, respectively. We gradually increase the unlabeled training samples and estimate error rates on all unlabeled samples (note that we cannot only estimate error rates on unlabeled testing samples since the variety of their sizes may introduce bias). Fig. 6 shows the results. For comparison, we also provide the transduction results (i.e., the results shown in Figs. 2–4 in the figure with dashed lines. From the figure we can see that in the first task the performance of SSKDE degrades when unlabeled training samples increase. This demonstrates again the negative impact of unlabeled samples. Contrarily, we observe the performance improvement of SSMKDE when unlabeled training samples increase. This indicates the effectiveness of unlabeled samples in SSMKDE. It is clear that in all three tasks, the induction performance of SSMKDE is better than SSKDE. As the unlabeled training samples increase, both SSKDE and SSMKDE converge to the results obtained by their corresponding transduction.

370

6. Conclusion

371

380

In this paper, we introduce a semi-supervised manifold kernel density estimation (SSMKDE) method for automatic multimedia annotation. The proposed SSMKDE adopts adaptive kernels according to the local structure of data distribution, and semi-supervised learning is conducted to leverage both the labeled and the unlabeled image data. We have also shown that graph-based SSL methods can be regarded as semi-supervised KDE approaches, which incorporate unlabeled data into classical KDE scheme. This perspective helps to better understand graph-based SSL, including investigating the effect of unlabeled samples, indicating the induction approach, and more importantly, leading us to improved algorithms based on existing works on KDE. We apply our SSMKDE to several automatic image annotation tasks. Experimental results demonstrate that our SSMKDE significantly outperforms SSKDE in all evaluations. In the future, we plan to extend the existing KDE-based methods in a similar way and apply them in the video annotation scenario.

381

7. Uncited references

359 360 361 362 363 364 365 366 367 368

372 373 374 375 376 377 378 379

382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418

Q3

[16,17,21]. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22]

UCI Repository of Machine Learning Databases, 2012. (retrieved in October 2012). Ionosphere Dataset, 2012. (retrieved in October 2012). Segmentation Dataset, 2012. (retrieved in October 2012). K. Barnard, P. Duygulu, N. Freitas, D. Forsyth, D. Blei, M.I. Jordan, Matching words and pictures, Journal of Machine Learning Research 3 (2003) 1107– 1135. M. Belkin, L. Matveeva, P. Niyogi, Regularization and semi-supervised learning on large graphs, in: Proceedings of The Conference on Learning Theory, 2004, pp.624–638. A. Blum, S. Chawla, Learning from labeled and unlabeled data using graph mincuts, in: Proceedings of International Conference on Machine Learning, 2001, pp. 19–26. A. Blum, T. Mitchell, Combining labeled and unlabeled data with co-training, in: Proceedings of The Conference on Learning Theory, 1998, pp. 92–100. G. Carneiro, A.B. Chan, P.J. Moreno, N. Vasconcelos, Supervised learning of semantic classes for image annotation and retrieval, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (3) (2007) 394–410. C. Chang, C. Lin, LIBSVM: a library for support vector machines, ACM Transactions on Intelligent Systems and Technology 2 (2011) 1–27. O. Chapelle, Active learning for parzen window classifier, in: Proceedings of International Conference on Artificial Intelligence and Statistics, 2005, pp. 49–56. O. Chapelle, A. Zien, B. Scholkopf, Semi-Supervised Learning, MIT Press, 2006. H. Chen, L. Li, J. Peng, Error bounds of multi-graph regularized semi-supervised classification, Information Sciences 179 (12) (2009) 1960–1969. J. Cheng, W. Bian, D. Tao, Locally regularized sliced inverse regression based 3D hand gesture recognition on a dance robot, Information Sciences 221 (1) (2013) 274–283. I. Cohen, F.G. Cozman, N. Sebe, M.C. Cirelo, T.S. Huang, Semi-supervised learning of classifiers: theory, algorithms and their application to humancomputer interaction, IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (12) (2004) 1553–1567. O. Delalleau, Y. Bengio, N.L. Roux, Efficient non-parametric function induction in semi-supervised learning, in: Proceedings of International Conference on Artificial Intelligence and Statistics, 2005, pp. 96–103. L. Devroye, The equivalence of weak, strong and complete convergence in L1 for kernel density estimates, The Annals of Statistics 11 (3) (1983) 896– 904. L. Devroye, A. Krzyzak, An equivalence theorem for L1 convergence of the kernel regression estimate, Journal of Statistical Planning and Inference 23 (1) (1989) 71–82. L. Devroye, C.S. Penrod, The consistency of automatic kernel density estimates, The Annals of Statistics 12 (4) (1984) 1231–1249. Y. Gao, M. Wang, R. Ji, Z. Zha, J. Shen, k-Partite graph reinforcement and its application in multimedia information retrieval, Information Sciences 194 (1) (2012) 224–239. F.C. Chung, Spectral Graph Theory, American Mathematical Society, 1997. D. Grangierand, S. Bengio, A discriminative kernel-based approach to rank images from text queries, IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (8) (2008) 1371–1384. M. Guillaumin, T. Mensink, J. Verbeek, C. Schmid, Tagprop: discriminative metric learning in nearest neighbor models for image auto-annotation, in: Proceedings of International Conference on Computer Vision, 2009, pp. 309–316.

Q1 Please cite this article in press as: P. Ji et al., Automatic image annotation by semi-supervised manifold kernel density estimation, Inform. Sci. (2013), http://dx.doi.org/10.1016/j.ins.2013.09.016

INS 10297

No. of Pages 13, Model 3G

16 September 2013 Q1 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491

P. Ji et al. / Information Sciences xxx (2013) xxx–xxx

13

[23] Y. Han, F. Wu, Q. Tian, Y. Zhuang, Image annotation by input-output structural grouping sparsity, IEEE Transactions on Image Processing 21 (6) (2012) 3066–3079. [24] J.R. He, J. Carbonell, Y. Liu, Graph-based semi-supervised learning as a generative model, in: Proceedings of International Joint Conference on Artificial Intelligence, 2007, pp. 2492–2497. [25] J.R. He, M.J. Li, H.J. Zhang, H.H. Tong, C.S. Zhang, Manifold-ranking based image retrieval, in: Proceedings of ACM Conference on Multimedia, 2004, pp. 9–16. [26] J. Huang, A Combinatorial View of Graph Laplacians, Max Planck Institute Technical Report 144, 2005. [27] N. Huang, Y. Tzang, H. Chang, C. Ho, Enhancing P2P overlay network architecture for live multimedia streaming, Information Sciences 180 (17) (2010) 3210–3231. [28] J.J. Hull, A database for handwritten text recognition research, IEEE Transactions on Pattern Analysis and Machine Intelligence 16 (5) (1994) 550–554. [29] A.J. Inzeman, Recent developments in nonparametric density estimation, Journal of American Statistical Association 86 (413) (1991) 205–224. [30] A. Makadia, V. Pavlovic, S. Kumar, A new baseline for image annotation, in: Proceedings of the European Conference on Computer Vision, 2008, pp. 316–329. [31] F. Monayand, D. Gatica-Perez, Plsa-based image auto-annotation: Constraining the latent space, in: Proceedings of ACM Multimedia, 2004, pp. 348– 351. [32] K. Nigam, A.K. McCallum, S. Thrun, T. Mitchell, Text classification from labeled and unlabeled documents using EM, Machine Learning 39 (2-3) (2000) 103–134. [33] Z. Pan, X. You, H. Chen, D. Tao, B. Pang, Generalization performance of magnitude-preserving semi-supervised ranking with graph-based regularization, Information Sciences 221 (1) (2013) 284–296. [34] E. Parzen, On the estimation of a probability density function and the mode, Annals of Mathematical Statistics 33 (3) (1962) 1065–1076. [35] C. Rosenberg, M. Heberg, H. Schneiderman, Semi-supervised self-training of object detection models, in: Proceedings of Workshop on Applications of Computer Vision, 2005, pp. 29–36. [36] R.S. Sain, Adaptive Kernel Density Estimation, Ph.D. Thesis, Rice University, 1994. [37] M. Seeger, Learning with Labeled and Unlabeled Data, Technical report, Edinburgh University, 2001. [38] Y. Shao, Y. Zhou, X. He, D. Cai, H. Bao, Semi-Supervised Topic Modeling for Image Annotation, in: Proceedings of ACM Multimedia, 2009, pp. 521–524. [39] H. Shin, N.J. Hill, G. Ratsch, Graph-based semi-supervised learning with sharper edges, in: Proceedings of European Conference on Machine Learning, 2006, pp. 401–412. [40] H. Tong, J.R. He, M.J. Li, C.S. Zhang, W.Y. Ma, Graph-based multi-modality learning, in: Proceedings of ACM Conference on Multimedia, 2005, pp. 862– 871. [41] P. Vincent, Y. Bengio, Manifold parzen windows, in: Proceedings of Advances in Neural Information Processing System, 2003, pp. 1–8. [42] M. Wang, X. Hua, J. Tang, R. Hong, Beyond distance measurement: constructing neighborhood similarity for video annotation, IEEE Transactions on Multimedia 11 (3) (2009) 465–476. [43] B. Liu, M. Wang, R. Hong, Z. Zha, X Hua, Joint learning of labels and distance metric, IEEE Transactions on System, Man and Cybernetics, Part B 50 (3) (2010) 973–978.. [44] J. Tang, X. Hua, G. Qi, M. Wang, T. Mei, X. Wu, Structure-sensitive manifold ranking for video concept detection, in: Proceedings of ACM Multimedia, 2007, pp. 852–861. [45] F. Wang, C. Zhang, Label propagation through linear neighborhoods, in: Proceedings of International Conference on Machine Learning, 2006, pp. 985– 992. [46] M. Wang, X. Hua, Y. Song, X. Yuan, S. Li, H. Zhang, Video annotation by semi-supervised learning with kernel density estimation, in: Proceedings of ACM Multimedia, 2006, pp. 967–976. [47] M. Wang, X. Hua, T. Mei, R. Hong, G. Qi, Y. Song, L. Dai, Semi-supervised kernel density estimation for video annotation, Computer Vision and Image Understanding 113 (3) (2009) 384–396. [48] L. Xie, Y. Yang, Z. Liu, On the effectiveness of subwords for lexical cohesion based story segmentation of Chinese broadcast news, Information Sciences 181 (13) (2011) 274–283. [49] O. Yakhnenkoand, V. Honavar, Annotating images and image objects using a hierarchical dirichlet process model, in: Proceedings of International Workshop on Multimedia Data Mining, 2008, pp. 1–7. [50] Y. Yang, F. Nie, D. Xu, J. Luo, Y. Zhuang, Y. Pan, A multimedia retrieval framework based on semi-supervised ranking and relevance feedback, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (4) (2012) 723–742. [51] K. Yu, V. Tresp, D. Zhou, Semi-supervised induction with basis functions, Max Planck Institute Technical Report 141, 2004. [52] S. Zhang, J. Huang, H. Li, D.N. Metaxas, Automatic image annotation and retrieval using group sparsity, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 42 (3) (2012) 838–849. [53] T. Zhang, F.J. Oles, A probability analysis on the value of unlabeled data for classification problems, in: Proceedings of International Conference on Machine Learning, 2000, pp. 1191–1198. [54] T. Zhang, C. Xu, G. Zhu, S. Liu, H. Lu, A generic framework for video annotation via semi-supervised learning, IEEE Transactions on Multimedia 14 (4) (2012) 1206–1219. [55] Z.Q. Zhao, H. Glotin, Z. Xie, J. Gao, X. Wu, Cooperative sparse representation in two opposite directions for semi-supervised image annotation, IEEE Transactions on Image Processing 21 (9) (2012) 4218–4231. [56] D. Zhou, O. Bousquet, T.N. Lal, J. Weston, B. Scholkopf, Learning with local and global consistency, in: Proceedings of Advances of Neural Information Processing, 2004, pp. 321–328. [57] X. Zhu, Semi-supervised learning literature survey, Technical Report, University of Wisconsin-Madison, 2005. [58] X. Zhu, Z. Ghahramani, Learning from labeled and unlabeled data with label propagation, Technical Report, Carnegie Mellon University, 2002. [59] X. Zhu, Z. Ghahramani, J. Lafferty, Semi-supervised learning using gaussian fields and harmonic functions, in: Proceedings of International Conference on Machine Learning, 2003, pp. 912–919. [60] Y. Gao, M. Wang, Z. Zha, J. Shen, X. Li, X. Wu, Visual-textual joint relevance learning for tag-based social image search, IEEE Transactions on Image Processing 22 (1) (2013) 363–376. [61] M. Wang, Y. Gao, K. Lu, Y. Rui, View-based discriminative probabilistic modeling for 3D object retrieval and recognition, IEEE Transactions on Image Processing 22 (4) (2013) 1395–1407. [62] A. Martnez, P. Larraga, I. Inza, Bayesian classifiers based on kernel density estimation: flexible classifiers, International Journal of Approximate Reasoning 50 (2) (2009) 341–362. [63] Y. Gao, M. Wang, D. Tao, R. Ji, Q. Dai, 3-D object retrieval and recognition with hypergraph analysis, IEEE Transactions on Image Processing 21 (9) (2012) 4290–4303. [64] M. Wang, H. Li, D. Tao, K. Lu, X. Wu, Multimodal graph-based reranking for web image search, IEEE Transactions on Image Processing 21 (11) (2012) 4649–4661.

492

Q1 Please cite this article in press as: P. Ji et al., Automatic image annotation by semi-supervised manifold kernel density estimation, Inform. Sci. (2013), http://dx.doi.org/10.1016/j.ins.2013.09.016