Automatic image annotation via compact graph based semi-supervised learning

Automatic image annotation via compact graph based semi-supervised learning

KNOSYS 3023 No. of Pages 18, Model 5G 29 December 2014 Knowledge-Based Systems xxx (2014) xxx–xxx 1 Contents lists available at ScienceDirect Know...

6MB Sizes 0 Downloads 110 Views

KNOSYS 3023

No. of Pages 18, Model 5G

29 December 2014 Knowledge-Based Systems xxx (2014) xxx–xxx 1

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys 5 6

4

Q1

Automatic image annotation via compact graph based semi-supervised learning

7

Q2

Mingbo Zhao a, Tommy W.S. Chow a, Zhao Zhang b,⇑, Bing Li c

3

a

8 9 10

Department of Electronics Engineering, City University of Hong Kong, Hong Kong Special Administrative Region School of Computer Science and Technology, Soochow University, PR China c School of Economics, Wuhan University of Technology, PR China b

11

a r t i c l e

1 4 3 2 14 15 16 17 18 19 20 21 22 23

i n f o

Article history: Received 20 May 2014 Received in revised form 31 August 2014 Accepted 10 December 2014 Available online xxxx

Q3

Keywords: Graph based semi-supervised learning Image annotation Compact graph construction

a b s t r a c t The insufficiency of labeled samples is major problem in automatic image annotation. However, unlabeled samples are readily available and abundant. Hence, semi-supervised learning methods, which utilize partly labeled samples and a large amount of unlabeled samples, have attracted increased attention in the field of image annotation. During the past decade, graph-based semi-supervised learning has been becoming one of the most important research areas in semi-supervised learning. In this paper, we propose a novel and effective graph based semi-supervised learning method for image annotation. The new method is derived by a compact graph that can well grasp the manifold structure. In addition, we theoretically prove that the proposed semi-supervised learning method can be analyzed under a regularized framework. It can also be easily extended to deal with out-of-sample data. Simulation results show that the proposed method can achieve better performance compared with other state-of-the-art graph based semi-supervised learning methods. Ó 2014 Elsevier B.V. All rights reserved.

25 26 27 28 29 30 31 32 33 34 35 36 37 38

39 40

1. Introduction

41

In the real world, there are ever-increasing vision image data generated from Internet surfing and daily social communication. These metadata can be labeled or unlabeled, and accordingly be utilized for image retrieval, summarization, and indexing. In order to handle these datasets for realizing the above tasks, automatic annotation is an elementary step, which can be formulated as a pattern classification problem and accomplished by learning-based techniques. Traditionally, the learning-based methods can be categorized into two categories: (1) supervised learning, which aims to predict the labels of new-coming data samples from the observed labeled set, i.e., to handle the classification problem or to preserve the discriminative information which is embedded in the training set. Typical supervised learning methods include Support Vector Machines (SVM) [36,37], and Linear Discriminant Analysis (LDA) and its variants [1–3,17,18]; and (2) the other one is unsupervised learning, the goal of which is to handle the observed data with no labels and to grasp the intrinsic structure of the dataset. Typical unsupervised learning methods include clustering [20–22] and

42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58

⇑ Corresponding author at: Tat Chee Avenue, Kowloon, Hong Kong Special

Q2

Administrative Region. Tel.: +852 3442 2874. E-mail addresses: [email protected] (M. Zhao), [email protected] (T.W.S. Chow), [email protected] (Z. Zhang), [email protected] (B. Li).

manifold learning methods, such as ISOMAP [23], Locally Linear Embedding (LLE) [24], and Laplacian Eigenmap (LE) [25]. In this paper, we mainly focus on the classification problem, which is traditionally a supervised learning task. In order to handle the pattern classification problem, such as image annotation, the conventional supervised learning methods, such as Linear Discriminant Analysis (LDA) [1–3] and Support Vector Machine (SVM) [36,37], cannot deliver satisfactory classification accuracy when the number of labeled samples is not sufficient. However, labeling a large number of samples is timeconsuming and costly. On the other hand, unlabeled samples are abundant and can be easily obtained in the real world. Hence, semi-supervised learning methods (SSL), which incorporate partly labeled samples and a large amount of unlabeled samples into learning, have become more effective than only relying on supervised learning. Recently, based on clustering and manifold assumptions, i.e., nearby samples (or samples of the same cluster or data manifold) are likely to share the same label [4–6], graph based semi-supervised learning methods have received considerable research interest in the area of semi-supervised learning. Typical methods include Manifold Regularization (MR) [15], Semisupervised Discriminant Analysis (SDA) [16], Gaussian Fields and Harmonic Functions (GFHF) [4], Learning with Local and Global Consistency (LLGC) [5], and General Graph based Semi-supervised

http://dx.doi.org/10.1016/j.knosys.2014.12.014 0950-7051/Ó 2014 Elsevier B.V. All rights reserved.

Q2 Please cite this article in press as: M. Zhao et al., Automatic image annotation via compact graph based semi-supervised learning, Knowl. Based Syst. Q1 (2014), http://dx.doi.org/10.1016/j.knosys.2014.12.014

59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82

KNOSYS 3023

No. of Pages 18, Model 5G

29 December 2014 Q2 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148

2

M. Zhao et al. / Knowledge-Based Systems xxx (2014) xxx–xxx

Learning method (GGSSL) [7,8]. All of these methods represent both labeled and unlabeled sets by a graph, and then utilize its graph Laplacian matrix to characterize the manifold structure [25,26]. In general, the abovementioned graph based semi-supervised learning can be divided into two categories: inductive learning methods and transductive learning methods. The inductive learning methods, such as MR [15] and SDA [16], try to induce a decision function that has a low classification error rate on the whole data space; while the transductive learning methods, also known as Label Propagation, aim to directly predict the label information from the labeled set to the unlabeled set along the graph, which is much easier to handle and less complicated than inductive learning methods. Two well-known transductive learning methods are GFHF [4] and LLGC [5]. GFHF has an elegant probabilistic explanation, and the output labels are the probabilistic values; however, it cannot detect outliers in data. In contrast, LLGC can detect outliers; however, its output labels are not probabilistic values. Both the problems in GFHF and LLGC have been eliminated by GGSSL [7,8], in which it can either detect the outliers or develop a mechanism to calculate the probabilities of data samples. It should be noted that one important step of graph based SSL is to construct a graph with weights for characterizing the data structure. The graph is usually used to find the neighbors by k-neighborhood or e-neighborhood in the whole data [23–26], and then to define the weight matrix on the graph. There are commonly two ways to define the weight matrix: one is to apply the Gaussian function [4,5,7,8,15,16] and the other is to employ the local linear reconstruction strategy [13,14]. The Gaussian function is easily manipulated in many graph based SSL, but estimating the optimal variance in the Gaussian function is very difficult [13]. The locally linear reconstruction strategy has no such problem, as it is based on the assumption that each sample can be reconstructed by a linear combination of its neighborhoods. The weight matrix is then automatically calculated when the neighborhood size is fixed. However, as pointed out in [10], using the neighborhoods of a sample to reconstruct it may not achieve the minimum result, which may not well capture the manifold structure of the dataset. In this paper, motivated by the framework of GGSSL [7,8], we present an effective semi-supervised learning method, namely, Compact Graph based Semi-supervised Learning (CGSSL), for image annotation, which is based on a newly proposed compact graph. The newly proposed graph finds the neighbors of each sample and calculates the graph weights in a way as [13,14]. However, since the minimum reconstruction error of a sample may not be obtained by its own neighborhood, it aims to reconstruct the sample by using the neighbors of its adjacent samples and then preserve the graph weights corresponding to the minimum error. In this way, a more compact graph can be constructed, which can well capture the manifold structure of the dataset. In addition, in order to establish the connection to the normalized graph, we further symmetrize and normalize this compact graph. With these processes, the proposed CGSSL can be theoretically analyzed from the perspective of a graph, through which we show that the proposed CGSSL can be derived from a smoothness regularized framework. The proposed CGSSL can also be easily extended to its inductive out-of-sample version for handling new-coming data by using the same smoothness criterion. Finally, extensive simulations on image annotation and content based image retrieval show the effectiveness of the proposed CGSSL. The main contributions of this paper are as follows: (1) we propose a new compact local reconstruction graph with symmetrization and normalization for semi-supervised learning. The new graph construction strategy can find a more compact way to approximate a sample with its neighborhoods, which can better grasp the manifold structure embedded in the dataset. In addition,

with symmetrization and normalization processes, the proposed CGSSL can be analyzed theoretically under a regularized framework; (2) the proposed CGSSL is a transductive learning method, and it can also be easily extended to its inductive out-of-sample version for handling new-coming data by using the same smoothness criterion; (3) we analyze the relationships between the proposed CGSSL and other state-of-the-art graph based semi-supervised learning in terms of objectives, parameters and out-of-sample extensions, which are helpful for better understanding graph-based semi-supervised learning methods. Moreover, extensive simulations based on image annotation and content based image retrieval have verified the effectiveness of the proposed method; and (4) we further analyze the proposed CGSSL and other state-of-the-art methods from the time-to-process point of view and give an explicit implementation choice. Simulation results regarding computational time show that the proposed CGSSL can be more suitable and practical for handling large-scale image annotation tasks. The rest of this paper is organized as follows: In Section 2, we will provide some notations and present the proposed CGSSL; in Section 3, we will give detailed analysis and out-of-sample extensions for the proposed CGSSL; extensive simulations are conducted in Section 4, and final conclusions are drawn in Section 5.

149

2. The proposed semi-supervised learning method

172

Let X ¼ ½X l ; X u  2 RdðlþuÞ be the data matrix, where d is the number of data features, and the first l and the remaining u samples in X represent the labeled set X l and unlabeled set X u , respectively. Each sample in X l is associated with a class label ci ; i 2 ½1; 2; . . . ; c, where c is the number of classes. The goal of graph based semi-supervised learning methods is to propagate the label information of the labeled set to the unlabeled set according to the distribution associated with both the labeled and unlabeled set [4,5], and through which the predicted labels of the unlabeled set, called soft labels, can be obtained.

173

2.1. Review of graph construction

183

In label propagation, a similarity matrix must be defined for evaluating the similarities between any two samples. The similarity matrix can be approximated by a neighborhood graph associe ¼ ðV e; e EÞ denote ated with weights on the edges. Officially, let G e representing the training e is the vertex set of G this graph, where V e associated with a weight matrix E is the edge set of G samples, and e W containing the local information between two nearby samples. There are many ways to define the weight matrix. A typical way is to use the Gaussian function [4,5,7,8,15,16]:

184

  wij ¼ exp kxi  xj k2 =2r2 xi 2 Nk ðxj Þ or

xj 2 Nk ðxi Þ;

ð1Þ

151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171

174 175 176 177 178 179 180 181 182

185 186 187 188 189 190 191 192

193 195

where Nk ðxj Þ is the k neighborhood set of xj , and r is the Gaussian function variance. However, r is hard to be determined, and even a small variation of r can alter the results dramatically [13]. Wang e by using the et al. have proposed another strategy to construct G

196

neighborhood information of samples [13,14]. This strategy assumes that each sample can be reconstructed by a linear combiP nation of its neighborhoods [24], i.e., xi  j:xj 2Nk ðxi Þ wij xj . It then cal-

200

culates the weight matrix by solving a standard quadratic programming (QP) problem as:

203

2    X    min xi  wij xj     j:xj 2N k ðxi Þ F

s:t: wij P 0;

X

wij ¼ 1:

197 198 199 201 202 204

205

ð2Þ

j2Nk ðxi Þ

The above strategy is empirically better than the Gaussian function, as the weight matrix can be automatically calculated in a closed

Q2 Please cite this article in press as: M. Zhao et al., Automatic image annotation via compact graph based semi-supervised learning, Knowl. Based Syst. Q1 (2014), http://dx.doi.org/10.1016/j.knosys.2014.12.014

150

207 208 209

KNOSYS 3023

No. of Pages 18, Model 5G

29 December 2014 Q2

225

form once the neighborhood size is fixed. However, as pointed out in [10], using the neighborhoods of a sample to reconstruct it may not achieve a compact result. We use Fig. 1 to elaborate this fact, in which we generalize eight samples in R2 . We then, in this next subsection, propose a more effective strategy for graph construction. Taking x1 as an example and letting k ¼ 5, we first reconstruct x1 by using its neighborhoods, i.e., fx1 ; x5 ; x6 ; x7 ; x2 g. Following Eq. (2), we have f x1 ¼ 0:1694x5 þ 0:2491x6 þ 0:4109x7 þ 0:1706x2 . In this case, the reconstruction error is jjx1  f x1 jj2F ¼ 0:26852. Note that x1 is also in the neighborhoods of x6 , i.e., fx6 ; x1 ; x4 ; x8 ; x5 g. Hence, if we use them, i.e., fx6 ; x4 ; x8 ; x5 g, to reconstruct x1 as f x1 ¼ 0:1694x6 þ 0:2491x4 þ 0:4109x8 þ 0:1706x5 , the error can be 0:04631 < 0:26852. This indicates that using the neighborhoods of x6 to reconstruct x1 is better than that using the neighborhoods of its own, as the former reconstruction error is much smaller.

226

2.2. The proposed graph construction

227

232

The above analysis motivates us to propose an improved local reconstruction strategy for graph construction. Since the minimum reconstruction error of a sample may not be obtained by its own neighborhood, we need to search in its adjacent samples and find the minimum error. This can be a more compact way to reconstruct each sample. In practice, we first generate a vector

233

e ¼ ½e1 ; e2 ; . . . ; elþu  2 R1ðlþuÞ to preserve the minimum errors of

234

samples. We then search each sample xj and its neighborhood set N j : xj1 ; xj2 ; . . . ; xjk (including itself). For each xji , i ¼ 1 to k, we use other samples in N j to reconstruct it. If the reconstruction error f eji is smaller than eji preserved in e, replace eji with f eji and preserve the reconstruction weights of xji into W. The basic steps of the proposed strategy can be shown in Table 1. Fig. 2 shows the improvement of the proposed graph construction, in which we generalize a two-cycle dataset.

210 211 212 213 214 215 216 217 218 219 220 221 222 223 224

3

M. Zhao et al. / Knowledge-Based Systems xxx (2014) xxx–xxx Table 1 An effective algorithm for calculating reconstructed weight. Input: Data matrix X 2 RDðlþuÞ , neighborhood number k. Output: Weight matrix W ¼ ½x1 ; x2 ; . . . ; xlþu  2 RðlþuÞðlþuÞ . Algorithm: 1. Generate an error vector e ¼ ½e1 ; e2 ; . . . ; elþu  2 R1ðlþuÞ with each element ej ¼ þ1 and initialize W as a zero matrix. 2. for each sample xj ; j ¼ 1 to l þ u, do 3. Identify the k neighborhood set as: N j : xj1 ; xj2 ; . . . ; xjk . 4. for each sample xji ; i ¼ 1 to k, do P 5. Reconstruct xji  f xji ¼ t:t¼1:k;t–i wjt xjt according to Eq. (2).

eji ¼ kxji  f xji k2F < eji , do if the reconstructed error f f eji , clear the ji th column of W and update it by wjt , t ¼ 1 to k; t – i, which are obtained in Step 5. 8. end for 9. end for 10. output weight matrix W

6. 7.

eji

251 228 229 230 231

235 236 237 238 239 240 241 242

2.3. Symmetrization and normalization of graph weight

243

Note that the improved local reconstruction weight W obtained

244

by Table 1 does not satisfy the symmetric property, i.e., W – W T , which means that it does not have a close relationship to the normalized graph theorem. Although this drawback can be overcome

245 246 247 248 249 250

by the processes, i.e., W ðW þ W T Þ=2 and W D1=2 WD1=2 [7], the weight matrix still holds inhomogeneous degrees. In our work, to handle this problem, we design the proposed weight matrix as follows:

f ¼ W D1 W T : W

ð3Þ

253

where D 2 RðlþuÞðlþuÞ is a diagonal matrix with each element Djj P being the sum of the jth row in W, i.e., Djj ¼ lþu i¼1 W ij . It can be easily f is symmetric, and the sum of each row or column of verified that W f is equal to 1; hence, W f is also normalized. As pointed in [7], the W normalization can strengthen the weights in the low-density region and weaken the weights in the high density region, which is advantageous for handling the case in which the density of datasets varies dramatically. Following Eq. (3), it can be noted that by neglecting the diagonal matrix D; WW T is an inner product with each element ðWW T Þij measuring the similarity between each pair-wise local reconstructed vector wi and wj . Given that xi and xj are close to each other and in the same manifold, their corresponding wi and wj tend to be similar, and ðWW T Þij will be a large value; otherwise, ðWW T Þij will be equal to 0, if xi and xj do not share any neighborhoods. The weight matrix defined in Eq. (3) can also be easily extended to outof-sample data by using the same smoothness criterion. We will discuss this issue in Section 3.2.

254

2.4. Label propagation process

271

We will then predict the labels of unlabeled samples based on a General Graph based Semi-supervised Learning process (GGSSL)

272

[7,8]. Let Y ¼ ½y1 ; y2 ; . . . ; ylþu  2 Rðcþ1ÞðlþuÞ be the initial labels of all samples, for the labeled sample xj ; yij ¼ 1 if xj belongs to the ith

274

X4(4,5)

X6(2,4)

X4(4,5)

X6(2,4)

X3(5,4)

X1(3,3)

X8(1,2)

X1(3,3)

X5(4,2)

X7(3,1)

(a)

X3(5,4)

X2(4,1)

X8(1,2)

X5(4,2)

X7(3,1)

X2(4,1)

(b)

Fig. 1. Two ways of reconstructing x1 : seven data points with coordinates as: x1 ð3; 3Þ; x2 ð4; 1Þ; x3 ð5; 4Þ; x4 ð4; 5Þ; x5 ð4; 2Þ; x6 ð2; 4Þ; x7 ð3; 1Þ, and x8 ð1; 2Þ. (a) Reconstruct x1 using the neighbors of x1 and (b) reconstruct x1 using the neighbors of x6 .

Q2 Please cite this article in press as: M. Zhao et al., Automatic image annotation via compact graph based semi-supervised learning, Knowl. Based Syst. Q1 (2014), http://dx.doi.org/10.1016/j.knosys.2014.12.014

255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270

273

275

KNOSYS 3023

No. of Pages 18, Model 5G

29 December 2014 Q2

4

M. Zhao et al. / Knowledge-Based Systems xxx (2014) xxx–xxx

(a) Q6

276 277 278 279

(b)

Fig. 2. Two ways to construct the graph: two-cycle dataset fðx; yÞjx 2 ½2; 2; y 2 ½2; 2g (a) using the strategy of Wang et al. [7], the total reconstructed error is 12.29283 and (b) using the strategy of the proposed method, the total reconstructed error is 0.039. From the results, we can see that the proposed strategy can construct the graph in a more compact way.

class; otherwise, yij ¼ 0. For the unlabeled sample xj ; yij ¼ 1 if i ¼ c þ 1; otherwise, yij ¼ 0. Note that an addition class c þ 1 is added to represent the undetermined labels of unlabeled samples; hence, the sum of each column of Y is equal to 1 [7,8]. We also let

286

F ¼ ½f 1 ; f 2 ; . . . ; f lþu  2 Rðcþ1ÞðlþuÞ be the predicted soft label matrix, where f i are row vectors satisfying 0 6 f ij 6 1. Let us consider an iterative process for label propagation. At each iteration, we hope that the label of each labeled sample is partly received from its neighborhoods, and the rest is from its own label. Thus, the label information of the data at t þ 1 iteration can be:

289

f Ia þ YIb : Fðt þ 1Þ ¼ FðtÞ W

280 281 282 283 284 285

287

290 291 292 293 294 295 296 297 298

299

301 302 303 304

305

ð4Þ

308 309 310 311 312 313 314 315 316 317 318 319

Input: Data matrix X 2 RDðlþuÞ , label matrix Y 2 Rðcþ1ÞðlþuÞ , the number of nearest neighbor k and other relative parameters. Output: The predicted label matrix F 2 Rðcþ1ÞðlþuÞ . Algorithm: 1. Construct the neighborhood graph and calculate the weight matrix W as Table 1. f ¼ W D1 W T in Eq. (3), where D is the 2. Symmetrize and normalize W as W P diagonal matrix satisfying Djj ¼ lþu i¼1 W ij .  1 f Ia in Eq. (6) and 3. Calculate the predicted label matrix F ¼ YIb I  W output F.

ðlþuÞðlþuÞ

where Ia 2 R is a diagonal matrix with each element being aj ; Ib ¼ I  Ia , and aj ð0 6 aj 6 1Þ is a parameter for xj to balance the initial label information of xj and the label information received from its neighbors during the iteration. According to [7], the regularization parameter aj for the labeled data xj is set to al ; for the unlabeled sample xj , it is set to au in the simulations. In this paper, we simply set al ¼ 0 for the labeled sample, which means we constrain F l ¼ Y l , and set the value of au for the unlabeled sample by using a fivefold cross validation. By the iterative process in Eq. (4), we have:

f Ia Þt þ YIb FðtÞ ¼ Fð0Þð W

t1 X f I a Þk : ðW

ð5Þ

k¼0

f Ia Þt ¼ 0 and Based on the properties of matrix, i.e., limt!1 ð W 1 Pt1 f k f Ia Þ , the iterative process of Eq. (5) limt!1 k¼0 ð W Ia Þ ¼ ðI  W converges to: 1

307

Table 2 Compact graph based semi-supervised learning.

f Ia Þ : F ¼ limFðtÞ ¼ YIb ðI  W t!1

ð6Þ

It can be easily proven that the sum of each column of F is equal to 1 (see Appendix). This indicates that the elements in F are probability values, and fij can be seen as the posterior probability of xj belonging to the ith class; when i ¼ c þ 1, fij represents the probability of xj belonging to the outliers. The basic steps for the proposed label propagation process can be seen in Table 2. In addition, since our proposed SSL method is based on a compact local reconstruction graph with normalization, we refer to our SSL method as Compact Graph based Semi-Supervised Learning (CGSSL). Here, a simple example of the proposed label propagation process with outlier detection can be seen in Fig. 3. However, in many

real-world applications, there may exist noisy labeled samples due to the carelessness of the annotator, and our method can also automatically erase such noisy labeled samples by setting al > 0. A case in point is given in Fig. 2, where we generalize a two-moon dataset with incorrectly labeled samples. With the proposed CGSSL, we can directly use it for automatic image annotation. Specifically, given an image dataset, users only need to annotate a small part of data samples as the labeled set, while keep the remaining data samples as the unlabeled set. Then, by taking both the labeled and unlabeled set as inputs, the proposed CGSSL can automatically estimate the labels of the unlabeled set based on a label propagation process and output them to the users. In other words, the class labels of the unlabeled set can be automatically annotated by the proposed CGSSL, hence realizing Q4 automatic image annotation (see Fig. 4).

320

3. Analysis and extensions of the compact semi-supervised learning method

335

3.1. Regularization interpretation

337

In this subsection, we will analyze the proposed CGSSL from the perspective of a graph, through which we show that the proposed CGSSL can be derived in a regularized framework. Specifically, let TrðÞ be the trace operator and k  kF is the Frobenius norm of matrix, i.e., kMk2F ¼ TrðM T MÞ [38]; then, we have the following theorem:

338

Q2 Please cite this article in press as: M. Zhao et al., Automatic image annotation via compact graph based semi-supervised learning, Knowl. Based Syst. Q1 (2014), http://dx.doi.org/10.1016/j.knosys.2014.12.014

321 322 323 324 325 326 327 328 329 330 331 332 333 334

336

339 340 341 342 343

KNOSYS 3023

No. of Pages 18, Model 5G

29 December 2014 Q2

5

M. Zhao et al. / Knowledge-Based Systems xxx (2014) xxx–xxx

(a)

(b)

(c)

(d)

Fig. 3. The label propagation process with outlier detection: two-moon dataset (two labels per class) (a) initial data with outliers; (b) 200 iterations; (c) 400 iterations; and (d) 800 iterations. At first, all unlabeled sets are viewed as outliers (black points). After several hundred iterations (about 800 iterations), the label information is successfully propagated to the unlabeled set (blue and red points). However, the points in the left-lower corner are still outliers, as the label information cannot be propagated to them. Here, the label assigned to an unlabeled sample xj in t iteration is determined by maxi6cþ1 f ij ðtÞ. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

(a)

(b)

(c)

(d)

Fig. 4. The label propagation process with noise labels: two-cycle dataset (a) initial data with noisy labeled samples (each per class); (b) al ¼ 0; (c) al ¼ 0:9; and (d) al ¼ 0:99. From the simulation results, we can see that when setting al ¼ 0, the noisy labeled sample cannot be erased, as setting al ¼ 0 means that the resulted labels of the labeled samples will not change from its initial labels. On the other hand, if we cannot guarantee that the label information is all accurate, the corresponding al should be set larger than zero in case the label of the noisy labeled sample can be updated by those of its neighborhood. 344 345 346

347

Theorem 1. The predicted result of NGSSL in Eq. (6), i.e., F ¼ YIb ðI  PIa Þ1 , is equivalent to the optimal solution that minimizes the following objective function:

JðFÞ ¼ 349 350 351 352 353 354 355 356 357

lþu X

e ij kf i  f j k2F þ w

i;j¼1

lþu X

lj kf j 

yj k2F :

f ; F, and Y are defined the same as in Section 2. Here, the first where W term in the objective function is a regularized term, which measures the smoothness of predicted labels on the graph. The second term is a fitting term, which measures the inconsistency between the predicted labels and initial label assignments. The tradeoff between the two terms is controlled by lj , where lj > 0 is a regularization parameter for sample xj . In this paper, given that xj is the labeled sample, lj is set to ll ; given that xj is the unlabeled sample, lj is set to lu in the simulations.

362

LF T Þ þ TrððF  YÞUðF  YÞ Þ: JðFÞ ¼ TrðF e

363 365

f is the normalized graph Laplacian matrix, and U is L ¼IW where e a diagonal matrix satisfying U jj ¼ lj . By setting the derivation of JðFÞ w.r.t. F to zero, we have:

368

 @JðFÞ @F 

364

366

369 370 371

372

T

F¼F 

LF þ 2UðF  YÞ ¼ 0: ¼ 2e

1

F ¼ YUðe L þ UÞ 374

ð8Þ

ð9Þ

Let us introduce a set of variables, i.e., aj ¼ 1=ð1 þ lj Þ, i 2 f1; 2; . . . ; l þ ug, or Ia ¼ ðI þ UÞ1 ; then, the optimal solution of Eqs. (7) and (8) can be derived as: 

3.2. Out-of-sample extension

380

In general, the proposed SSL is a transductive method, which is an off-line model process. However, it cannot deal with out-ofsample data, i.e., it cannot predict the fault or normal condition for a new-coming scene. In this subsection, we will apply the local approximation strategy to extend SSL to the out-of-sample data. Specifically, for new-coming test data z, since we know nothing about its label information, we can treat it as an unlabeled sample and utilize the same smoothness constraint in Eq. (2) as a criterion for predicting the label of z. In addition, the inclusion of z should not affect the prediction results of the original training data samples. Following the work in [22], we first give the smoothness criterion for the new-coming sample z as:

381

ð7Þ

359

360

375 376 377 378 379

j¼1

Proof of Theorem 1. For convenience, we first rewrite Eq. (7) in a matrix form as follows:

358

f , and the fourth equaL ¼IW The second equation holds as e tion holds as Ib ¼ I  Ia ¼ UðI þ UÞ1 . Hence, following Eq. (10), we can see that the optimal solution in Eq. (10) is equivalent to the predicted labels in Eq. (6).

1

¼ YUðI  P þ UÞ

Jðf ðzÞÞ ¼

f þ UÞ ¼ YUðI þ UÞ1 ðI þ UÞðI  W

1

f Ia Þ : ¼ YIb ðI  W

384 385 386 387 388 389 390 391 392 393

ð11Þ 395 396

mized by setting the derivation of Jðf ðzÞÞ in Eq. (11) w.r.t. f ðzÞ to  zero. Then, the optimal f ðzÞ can be given as follows:

401

X

e xj Þf j : wðz;

397 398 399 400 402

403

ð12Þ

j:xj 2X;xj 2Nk ðzÞ

Q2 Please cite this article in press as: M. Zhao et al., Automatic image annotation via compact graph based semi-supervised learning, Knowl. Based Syst. Q1 (2014), http://dx.doi.org/10.1016/j.knosys.2014.12.014

383

e xj Þ is the similarity where Nk ðzÞ is the neighborhood set of z, wðz; between z and xj , and f ðzÞ is the predicted label of z. Note that the nearest neighbors of z, i.e., N k ðzÞ, are search in z [ X, and the weight e xj Þ satisfies the sum-to-one constraint, i.e., matrix wðz; P e j:xj 2X;xj 2N k ðzÞ wðz; xj Þ ¼ 1. Since Jðf ðzÞÞ is convex in f ðzÞ, it is mini-

f ðzÞ ¼ ð10Þ

e xj Þkf ðzÞ  wðz;

f j k2F :

j:xj 2X;xj 2Nk ðzÞ



1

X

382

405

KNOSYS 3023

No. of Pages 18, Model 5G

29 December 2014 Q2 406 407 408 409

410

6

M. Zhao et al. / Knowledge-Based Systems xxx (2014) xxx–xxx

Here, we define the weight matrix following the same criterion as Eq. (3). Specifically, we first calculate wðz; xj Þ by the same strategy following Eq. (2), which is to minimize the following objective function: 2    X X   Jðf ðzÞÞ ¼  wðz;xj Þf j  wðz;xj Þ ¼ 1:  s:t: wðz;xj Þ P 0; f ðzÞ    j:xj 2Nk ðzÞ j2N k ðzÞ F

412 413

414 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434

ð13Þ

We then extend wðzÞ following the same form of Eq. (3), i.e.:

e wðzÞ ¼ wðzÞD1 W T ;

ð14Þ

P e xj Þ ¼ 1; then, folHere, it can be easily verified that j:xj 2X;xj 2Nk ðzÞ wðz; lowing Eq. (12), the sum of the column in f ðzÞ is equal to 1, which indicates that each element in f ðzÞ can be seen as the probabilistic value. The basic steps for calculating the predicted label of newcoming sample z is given in Table 3. Two toy examples for evaluating the effectiveness of the proposed out-of-sample extension can be seen in Figs. 2 and 3, where we generalize a two-moon dataset and two-Swiss-roll dataset. In Fig. 2, we select two samples per class as the labeled set and the remaining as the unlabeled set; whereas, in Fig. 3, we only choose one sample per class as the labeled set. In the two examples, we first adopt the proposed labeled propagation method to predict the labels of the unlabeled data samples. We then use the strategy in Table 3 to induce the labels of all samples in the region fðx; yÞjx 2 ½2; 2; y 2 ½2; 2g for the two-moon dataset and fðx; yÞjx 2 ½8; 8; y 2 ½8; 8g for the two-Swiss-roll dataset. The simulation results are also given in Figs. 2a and b and 3a and b, from which we can observe that the induced decision boundary

Table 3 Out-of-sample extension of CGSSL. DðlþuÞ

Input: Data matrix X 2 R

, corresponding predicted labels

F 2 Rðcþ1ÞðlþuÞ , new-coming sample z, the number of nearest neighbor k and other relative parameters. Output: The predicted label of f ðzÞ. Algorithm: 1. Search the k nearest neighbor of z in z [ X. 2. Construct the weight vector following Eq. (2). 4. Extend the weight vector following Eq. (3). 3. Calculate the predicted label of z as Eq. (12) and output f ðzÞ.

and region are intuitively satisfying, as they are in accordance with the intrinsic structure of the training dataset (see Fig. 5).

435

3.3. Discussion and related work

437

In this subsection, we will briefly discuss the relationships between the proposed CGSSL and other state-of-the-art graph based semi-supervised learning methods (see Fig. 6). In general, based on the manifold and cluster assumption [4–6]: (1) nearby points are likely to have the same label and (2) points on the same structure (such as a cluster or a sub-manifold) are likely to have the same label, the graph based semi-supervised learning e to model the whole dataset, and then estifirst employs a graph G e under the following conmates a classification function F on G, straints: (1) it should be close to the given label assignments and e These two con(2) it should be smooth on the whole graph G. straints are often characterized in a regularization framework. For integrity, we formalize the graph based semi-supervised learning framework in this subsection, which is helpful to build the relationships between the proposed NGSSL and other methods. Specifically, similar to Eqs. (7) and (8), we first give the following regularization objective function to be optimized:

438

JðFÞ ¼ aCðF l ; Y l Þ þ bSðFÞ:

ð15Þ

where CðÞ is a loss function to measure the inconsistency between the predicted and initial labels on X l , and SðÞ is a regularized term punishing the smoothness of F on the whole graph and is approximated by the following discrete form:

-2 -2

2 1

-1.5 -1 -0.5

-1

-0.5

0

0.5

(a)

1

1.5

2

0

0 0.5

1

-1 1.5

(b)

Fig. 5. Illustration of out-of-sample extension using Table 3: two-moon dataset in fðx; yÞjx 2 ½2; 2; y 2 ½2; 2g with two data points labeled per class: (a) the contour lines represent the boundary between two classes formed by the data points with the estimated label values equal to 0 and (b) the z-axis indicates the predicted label values associated with different colors over the region.

Q2 Please cite this article in press as: M. Zhao et al., Automatic image annotation via compact graph based semi-supervised learning, Knowl. Based Syst. Q1 (2014), http://dx.doi.org/10.1016/j.knosys.2014.12.014

443 444 445 446 447 448 449 450 451 452 453 454

455 457 458 459 460 461

465

1 0 -1 -2

-1.5

442

where S 2 RðlþuÞðlþuÞ is the smoothness matrix. In fact, many traditional graph-based semi-supervised methods are similar to each other, with different choices of loss functions and regularized terms. Table 4 gives some examples of those methods, such as GFHF, LLGC, GGSSL, LNP, Manifold Regularization (MR), Semi-supervised Kernel Density Estimation (SSKDE) [11,12], and the proposed CGSSL. Here, L represents the combinatorial graph Laplacian [4], and b L represents the normalized graph Laplaican [5]. They are both derived from the weight matrix W. For MR, K represents the induced kernel of f in the reproducing kernel Hilbert space, and cA and cI are two parameters balancing the tradeoff between two terms. For LNP and the proposed CGSSL, LLRW represents the Local Linear Reconstruction Weight.

1

-1.5

441

464

0.5

-1

440

ð16Þ

1.5

-0.5

439

462

SðFÞ ¼ TrðFSF T Þ;

2

0

436

466 467 468 469 470 471 472 473 474 475 476 477

KNOSYS 3023

No. of Pages 18, Model 5G

29 December 2014 Q2

7

M. Zhao et al. / Knowledge-Based Systems xxx (2014) xxx–xxx

8 6 4 2 2 0 -2 -8

0 -2

8 6 4

-6

-4

-2

-6 -8 -8

2

-4 0

0 -6

-4

-2

0

2

4

6

-2

2

8

-4

4 6

(a)

-6

(b)

Fig. 6. Illustration of out-of-sample extension using Table 3: two-Swiss roll dataset in fðx; yÞjx 2 ½8; 8; y 2 ½8; 8g with only one data point labeled per class (a) the contour lines represent the boundary between two classes formed by the data points with the estimated label values equal to 0 and (b) the z-axis indicates the predicted label values associated with different colors over the region. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Table 4 Objectives and parameters of related graph based semi-supervised learning methods.

478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495

Methods

CðÞ

S

W

a

b

Constraints

Out-of-sample extension

GFHF [4] LLGC [5]

Quadratic Quadratic

L b L

Gaussian Gaussian

1 1

1=2 b>0

None None

None None

GGSSL [7,8]

Quadratic

L or b L

Gaussian

Ia

Ib ¼ I  I a

MR [15]

Arbitrary

cA cI K þ L

Gaussian

1

cI

None Plþu j¼1 f j ¼ 0

Arbitrary

SSKDE [11,12] LNP [13,14] CGSSL

Quadratic Quadratic Quadratic

L L b L

Gaussian LLRW LLRW

0
1a 1a Ib ¼ I  I a

None None None

Yes Yes Yes

3

4

Table 4 shows that different graph based semi-supervised learning methods can be unified into a regularized framework, and the proposed CGSSL can be viewed as a special case with different choices of parameters. Following Table 4, we can observe that GFHF, LLGC, GGSSL, and MR formulate the graph weight by using the Gaussian function, while LNP and the proposed CGSSL formulate it by using the local linear reconstruction strategy. Compared with the above traditional graph based SSL, the proposed CGSSL method has adopted an improved local reconstructive construction strategy to formulate the graph weight. As analyzed in Section 2, this strategy can be viewed as an extension to LLRW by finding a more compact way to approximate a sample with its neighborhoods. In addition, the proposed graph construction strategy is effective, as the graph weight can be automatically calculated in a closed form once the neighborhood size is fixed. On the other hand, in GFHF, LLGC, GGSSL, SSKDE and MR, they use the Gaussian function to construct the graph weight, which is dramatically affected by the Gaussian function variance.

496

3.4. Analysis computational complexity

497

In this subsection, we will analyze the computational complexities of different methods. Note that for most graph based methods, such as MR, GFHF, LLGC, GGSSL, LNP and the proposed CGSSL, one needs to construct a k nearest neighborhood graph, and the 2 computational complexity is Oððl þ uÞ kÞ. For GFHF, LLGC and GGSSL, following the work in [8], the computational complexity for label propagation is Oððl þ uÞkcÞ, where k and c are the numbers of neighborhoods and class, respectively. Hence, the total computational complexity of GFHF, LLGC, and GGSSL is 2 Oððl þ uÞ k þ ðl þ uÞkcÞ. For LNP and the proposed CGSSL, one needs to additionally calculate the reconstructed weight matrix by solving standard QP problems in Eq. (2), where the computational

498 499 500 501 502 503 504 505 506 507 508

None

complexities are Oððl þ uÞk Þ and Oððl þ uÞk Þ for LNP and CGSSL, respectively. Note that the computational complexity of CGSSL in this step is k times that of LNP. This is reasonable, as CGSSL is to calculate the reconstructed weight matrix by searching all neighborhoods of the dataset for the minimum reconstructed error. Thus, the total computational complexities of LNP and CGSSL are 2 3 Oððl þ uÞ k þ ðl þ uÞk þ ðl þ uÞkcÞ and 2 4 Oððl þ uÞ k þ ðl þ uÞk þ ðl þ uÞkcÞ, respectively. For MR, one needs to calculate the inverse of the graph Laplacian matrix and the com3 putational complexity is Oðd Þ, where d is the dimensionality of the dataset. Hence, the total computational complexity of MR is 2 3 Oððl þ uÞ k þ d Þ. For SSKDE, since one does not need to construct a k nearest neighborhood graph, its total computational complexity is only from the kernel density estimation process, and the 2 computational complexity is Oððl þ uÞ cÞ. The comparisons of different methods are shown in Table 5. From Table 5, we have the following observations: (1) the computational complexity of MR is closely related to the number of dimensionality, while that of SSKDE is related the number of datasets and (2) for other graph based methods, we can see that the computational complexities of LNP and the proposed CGSSL are larger than those of GFHF, LLGC, and GGSSL. This can be reasonable, as LNP and CGSSL need to additionally calculate the reconstructed weight matrix by solving standard QP problems. However, since GFHF, LLGC and GGSSL use the Gaussian function to calculate the weight matrix, it should be necessary to spend some computational time for parameter selection in order to find the best Gaussian covariance.

509

4. Simulation

537

In this section, we evaluate the proposed methods with two synthetic datasets, several UCI and real-world datasets. For the

538

Q2 Please cite this article in press as: M. Zhao et al., Automatic image annotation via compact graph based semi-supervised learning, Knowl. Based Syst. Q1 (2014), http://dx.doi.org/10.1016/j.knosys.2014.12.014

510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536

539

KNOSYS 3023

No. of Pages 18, Model 5G

29 December 2014 Q2

8

M. Zhao et al. / Knowledge-Based Systems xxx (2014) xxx–xxx

Table 5 Computational complexities of different methods. Methods Computational complexity

MR

SSKDE 2

3

Oððl þ uÞ k þ d Þ

542 543 544 545 546

4.1. Toy examples for sensitivity analysis

548

We first use two toy examples to compare the proposed CGSSL with the Euclidean distance-based method and GGSSL, where we set k as 8. In the first toy example, we compare the annotated performance between the proposed CGSSL and Euclidean distance on one-Swiss-roll dataset, where the density of this dataset is dense in the core, while it is sparse in the surrounding area (see Fig. 7a). In this dataset, we only annotate one sample in the core of one-Swissroll, as the labeled set and the remaining samples as unlabeled samples. The simulation results of the annotation can be seen in Fig. 7b and c. For CGSSL, the marker size of each unlabeled data point is proportional to its predicted label; whereas, for Euclidean distance, the reciprocal of the distance between a given data point

550 551 552 553 554 555 556 557 558 559

3

Oððl þ uÞ k þ ðl þ uÞk þ ðl þ uÞkcÞ

547

549

Oððl þ uÞ k þ ðl þ uÞkcÞ CGSSL

2

synthetic datasets, we evaluate the proposed method using oneSwiss-roll and two-cycle datasets. For other datasets, we focus on solving the image annotation problems based on seven UCI datasets and five real-world datasets, which are all benchmark datasets. Furthermore, we compare the proposed GCSSL with stateof-the-art SSL methods. In the comparative study, we randomly split each dataset into a labeled and unlabeled set.

540 541

2

Oððl þ uÞ cÞ

LNP Computational complexity

GFHF/LLGC/GGSSL 2

2

4

Oððl þ uÞ k þ ðl þ uÞk þ ðl þ uÞkcÞ

and the annotated point is used to evaluate the annotated results. From Fig. 7b and c, we can observe that the distance-based method fails to preserve the one-Swiss-roll structure. The reason for this is that the distance-based method only considers pair-wise distance, but ignores the whole data distribution. In contrast, the proposed CGSSL can well annotate the label of data points even if they are far from the annotated point, which indicates that the proposed CGSSL can better handle datasets with a complicated manifold structure. In the second toy example, we compare the annotated performance of the proposed CGSSL and GGSSL in terms of robustness to the parameters on the two-cycle dataset with two classes, and each follows a cycle distribution. In each class, two samples are annotated as the labeled set and the remaining as unlabeled set. The simulation results of annotation for GGSSL and CGSSL can be seen in Figs. 8 and 9. From Fig. 8, we can observe that the performance of GGSSL is quite sensitive to the Gaussian covariance r ~ =10, in the Gaussian function. For example, when we set r ¼ r ~ is the half of the average distance of all pair-wise neighwhere r borhood samples, the annotated results for GGSSL is fairly good. ~ or r ¼ r ~ =0:1, the annoHowever, when we increase r to r ¼ r tated performance will become quite poor, which indicates that

Fig. 7. The label propagation process for annotation: one-Swiss-roll dataset (a) initial dataset with varied-density distribution (one labeled sample); (b) Euclidean Distance; and (c) CGSSL.

Q7

Fig. 8. The label propagation process with sensitivity analysis: two-cycle dataset (two labels per class) (a)–(c) the performances of GGSSL with different Gaussian variance, (b) r ¼ r~ =0:1, (c) r ¼ r~ and (d) r ¼ r~ =10, where r~ is half of the average distance of all pair-wise neighborhood samples.

Q2 Please cite this article in press as: M. Zhao et al., Automatic image annotation via compact graph based semi-supervised learning, Knowl. Based Syst. Q1 (2014), http://dx.doi.org/10.1016/j.knosys.2014.12.014

560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581

KNOSYS 3023

No. of Pages 18, Model 5G

29 December 2014 Q2

9

M. Zhao et al. / Knowledge-Based Systems xxx (2014) xxx–xxx

Fig. 9. The label propagation process with sensitivity analysis: two-cycle dataset (two labels per class) (a)–(c) the performances of the proposed CGSSL with different lu (a) lu ¼ 1, (b) lu ¼ 10, and (c) lu ¼ 109 .

582 583 584 585 586 587 588

even a small variation of r can make the results dramatically different. However, the proposed CGSSL does not have such a parameter. In addition, from Fig. 9, we can observe that the proposed CGSSL works well in a wide range of lu from 10 to 109 , which indicates that the proposed CGSSL is more robust to parameters. Clearly, a method that is less sensitive to the parameters is more suitable for real-world annotation tasks.

589

4.2. Image annotation

590

For solving the image annotation problem, we use six UCI and six real-world datasets to evaluate the performance of the methods. The UCI datasets are available at http://archieve.ics.uci.edu/ ml, and the real-world datasets include COIL100 [27], ETH80 [28], USPS [29], MNIST [30], and Extended Yale-B dataset [31]. The COIL100 dataset consists of images of 100 objects viewed from varying angles at an interval of five degrees, resulting in 72 images per object. The size of each cropped image is 128  128, with 24bit color levels per pixel. In our simulation, we down-sample the size of images to 32  32 and transfer each image to 256 gray levels. The ETH80 dataset contains 80 objects, with each object represented by 41 views of images. The original size of each image is 128  128, with 24 bit color levels per pixel. Similarly, we downsample the size of images to 32  32 and transfer them to 256 gray levels. The USPS dataset contains 9298 gray-scale handwritten digit images scanned from envelopes by the U.S. Postal Service, with an image size of 16  16 in 256 gray levels. The MNIST dataset is a handwritten digit image dataset. It has a training set of 60,000 samples (Set A) and a testing set of 10,000 samples (Set B), with an image size of 28  28 in 256 gray levels. In our work, we randomly choose 1000 samples per class from Set A as our training set. The Extended Yale-B dataset contains 16,123 images of 38 human

591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611

subjects under nine poses and 64 illumination conditions. In our simulation, the images are cropped and resized to 32  32 pixels. This dataset now has around 64 near frontal images under different illuminations per individual. Detailed information about the datasets and some sampled images of real-world datasets are shown in Table 6 and Fig. 10. For each dataset, we randomly annotate l samples from each class as a labeled set and let the remaining samples as an unlabeled set. We then evaluate the proposed CGSSL for image annotation and compare its performance with other state-of-the-art graph based transductive semi-supervised learning methods, such as GFHF, LLGC, GGSSL, LNP, SSKDE, and one inductive learning method, such as MR. We also compare a Euclidean distance-based method in our simulation, in which we choose near neighbor classifier (1NN) for evaluation. For GFHF, LLGC, GGSSL, LNP and the proposed CGSSL, the parameter k for constructing the k neighborhood graph is set by using fivefold cross validation, and the candidate set is from 6 to 20. For the Gaussian variance r used in GFHF, LLGC, GGSSL and MR, we use the same strategy in [9] to determine it. Specially, ~ be half of the let s 2 ð0; 1Þ be a parameter to be determined and r average distance of all pair-wise neighborhood samples; then, the Gaussian variance r can be determined as:

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  2 ~ ~2 r s 1 r ¼ )r¼ exp ; 2 lnðs=kÞ 2r2 k

ð17Þ

Note that it is much easier to choose a suitable s between 0 and 1. In our simulation, we use fivefold cross validation to determine s, and the candidate set is from ½0:0001; 0:001; 0:01; 0:1; 1. For SSKDE, the Gaussian variance is also involved, and we determine it by using the strategy in Section 5 in [34]. For LLGC, LNP and SSKDE, the regularized parameter is involved, and we determine it from f106 ; 103 ; 101 ; 1; 10; 103 ; 106 g by using fivefold cross validation.

Table 6 Dataset information for each dataset (‘‘Balance’’ is defined as the ratio between the number of samples in the smallest class and the number of samples in the largest class). Dataset

Type

Bupa Ionosphere Heart-Statlog Iris Wine Segmentation Digit

UCI UCI UCI UCI UCI UCI UCI

COIL100 [27] ETH80 [28] USPS [29] MNIST [30] Extended Yale-B [31]

Object Object Hand-written digit Hand-written digit Face

#Samples

#Features

#Class

Balance

345 351 270 150 178 2100 5620

6 34 13 4 13 19 64

2 2 2 3 3 7 10

0.725 0.56 0.8 1 0.6761 1 0.9508

7200 3280 9298 10,000 16,123

1024 1024 256 784 1024

100 80 10 10 38

1 1 0.4559 1 1

Q2 Please cite this article in press as: M. Zhao et al., Automatic image annotation via compact graph based semi-supervised learning, Knowl. Based Syst. Q1 (2014), http://dx.doi.org/10.1016/j.knosys.2014.12.014

612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633

634

636 637 638 639 640 641 642 643

KNOSYS 3023

No. of Pages 18, Model 5G

29 December 2014 Q2

10

M. Zhao et al. / Knowledge-Based Systems xxx (2014) xxx–xxx

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 10. Sample images of real-world datasets: (a) COIL100 Dataset; (b) ETH80 Dataset; (c) USPS Hand-written Dataset; (d) MNIST Hand-written Dataset; (e) Optical Digit Dataset; and (f) Extended Yale-B Dataset. 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686

For MR, the Laplacian regularized Least Squares (LapRLS) is chosen in our simulations, which is a special case of MR. Its implementation code is downloaded from the website of http://manifold.cs.uchicago.edu/manifold_regularization/software.html. Here, the Gaussian variance in MR is set using the same strategy as Eq. (17), and the extrinsic and intrinsic regularization parameters are searched from f106 ; 103 ; 101 ; 1; 10; 103 ; 106 g. In addition, for simplicity, we adopt the linear kernel in MR, i.e., f ðxj Þ ¼ W T xj , where W is the projection matrix and can provide a simple model to map the samples. The average results over 20 random splits of different methods with varied numbers of labeled samples are shown in Fig. 11. From the simulation results, we have the following observations: (1) for all of the methods, the annotated accuracies can be higher when the number of labeled samples increases. For example, the accuracy for the proposed CGSSL will increase by about 15% when the percentage of labeled samples varies from 5% to 50% for most UCI datasets. This increase can even achieve 20% in the Extended Yale-B dataset when the number of labeled samples increases from 2 to 20 per class. This indicates that the labeled samples are actually helpful for classification. However, we also observe that, in most cases, the accuracy of different methods will not increase any more when a sufficient number of labeled samples can be given; (2) all semi-supervised learning methods outperform supervised learning methods, such as 1NN, by about 4–20%. For example, given few labeled samples, the proposed CGSSL can even achieve 20% improved accuracy to 1NN in most real-world datasets. This indicates that by incorporating unlabeled samples into learning, classification accuracy can be greatly improved and semi-supervised learning methods generally outperform conventional supervised methods, such as 1NN, if there are inefficient labeled samples; (3) among all semi-supervised learning methods, the proposed CGSSL can achieve almost the best performance in real-world datasets and some of the UCI datasets, i.e., CGSSL can provide higher accuracy than other semi-supervised methods due to the compact graph construction as analyzed in Section 2. For example, CGSSL can achieve 1–3% improvement over LNP and 5– 6% improvement over GFHF, LLGC and SSKDE, respectively, in most UCI and real-world datasets. This superiority can even achieve 8% compared to GFHF and LLGC in the Bupa dataset. Another observation is that GGSSL can achieve competitive results to the proposed CGSSL in most real-world datasets. However, it should be noted that the result of GGSSL is obtained by best tuning the parameters, while the proposed CGSSL can easily construct the compact graph

and obtain the results when the neighborhood size is fixed; (4) compared with inductive methods, such as MR, we can observe that MR can achieve competitive results to CGSSL in most UCI datasets and the Extended Yale-B dataset, but is not superior to other semi-supervised methods in real-world applications. The main reason for this is that real-world datasets (except for Extended Yale-B) have more complex manifold structures than UCI datasets; hence, the graph constructed by transductive learning methods, such as the proposed CGSSL, can well capture such manifold structure of datasets, and the labels can be correctly propagated in these cases. However, for the Extended Yale-B dataset, CGSSL and other transductive learning methods are not superior to MR, possibly because of the strong lighting variations of images. Hence, in this case, the labels may not be properly propagated, which will degrade the performance of CGSSL and other transductive learning methods; and (5) we also compare the performance between the proposed CGSSL and SSKDE, which is a semi-supervised extension to Kernel Density Estimation based on statistical models. From the simulations, we can see that the proposed CGSSL, GGSSL, and LLGC generally outperform SSKDE in most real-world datasets, as they based on normalized graph Laplacian, while SSKDE is based on graph Laplacian. For example, in COIL100 and ETH80 datasets, the proposed CGSSL can achieve approximately 5–6% improvement over SSKDE, which generally confirms its superiority compared to SSKDE. It should also be pointed out that the proposed CGSSL cannot achieve better performance in some UCI and Extended Yale-B datasets. The reasons for this can be as follows: (1) for some UCI datasets, such as Iris, Wine and Segmentation datasets, we observe that these datasets are usually with low dimensionality and their data structure is very simple, which means that the manifold structure of these datasets may not be very clear and their patterns are easy quite to classify. Therefore, the superiority of CGSSL is not demonstrated, and the proposed CGSSL can only achieve competitive, or not much better, performance compared with other state-of-theart graph based semi-supervised learning methods; and (2) for the Extended Yale-B face dataset, we observe that it contains the images of 38 human subjects, with each human subject having 64 different lighting variations. Because of strong lighting variations of face images, there may exist cases in which two pair-wise face images with the same lighting variation, but in different human subjects, can be closer; whereas, two pair-wise face images with different lighting variations, but actually in the same human subject, may be far away. This indicates that using the

Q2 Please cite this article in press as: M. Zhao et al., Automatic image annotation via compact graph based semi-supervised learning, Knowl. Based Syst. Q1 (2014), http://dx.doi.org/10.1016/j.knosys.2014.12.014

687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729

KNOSYS 3023

No. of Pages 18, Model 5G

29 December 2014 Q2

11

M. Zhao et al. / Knowledge-Based Systems xxx (2014) xxx–xxx 0.9

0.7

0.85 0.65

1NN SSKDE MR LNP GFHF LLGC GGSSL CGSSL

0.6

0.55

0.5 5%

10% 15% 20% 25% 30% 35% 40% 45% Percentage of Randomly Labeled Samples

Accuracy

Accuracy

0.8

1NN SSKDE MR LNP GFHF LLGC GGSSL CGSSL

0.75

0.7

0.65

5%

50%

10% 15% 20% 25% 30% 35% 40% 45% Percentage of Randomly Labeled Samples

(a)

50%

(b)

0.65

0.95

0.6

0.9

1NN SSKDE MR LNP GFHF LLGC GGSSL CGSSL

0.55

0.5

0.45

0.4 5%

10% 15% 20% 25% 30% 35% 40% 45% Percentage of Randomly Labeled Samples

Accuracy

Accuracy

1

1NN SSKDE MR LNP GFHF LLGC GGSSL CGSSL

0.85

0.8

0.75

0.7 5%

50%

10% 15% 20% 25% 30% 35% 40% 45% Percentage of Randomly Labeled Samples

(c)

50%

(d)

1

1 0.95

0.9

0.8 1NN SSKDE MR LNP GFHF LLGC GGSSL CGSSL

0.7

0.6

0.5 5%

10% 15% 20% 25% 30% 35% 40% 45% Percentage of Randomly Labeled Samples

50%

(e)

Accuracy

Accuracy

0.9 0.85 1NN SSKDE MR LNP GFHF LLGC GGSSL CGSSL

0.8 0.75 0.7 0.65

2

4

6

8 10 12 14 #Labels per Class

16

18

20

(f)

Fig. 11. The average accuracy over 20 random splits for different datasets: (a) Bupa; (b) Ionosphere; (c) Heart-Statlog; (d) Wine; (e) Segmentation; (f) Iris; (g) COIL100; (h) ETH80; (i) USPS; (j) MNIST; (k) Digit; and (l) Extended Yale-B.

730 731 732 733 734 735 736 737 738 739

conventional neighborhood graph cannot well capture the correct manifold structure of the dataset, as we expect the images within the same human subject to be closer. As a result, the labels from the labeled set cannot be properly propagated to the unlabeled set, causing the performances of CGSSL and other graph based learning methods to be degraded. 4.3. Comparison between transductive learning and inductive out-of-sample extension In this subsection, we compare the performance between the transductive method in Table 2 and inductive out-of-sample

extension in Table 3 for the proposed CGSSL. In this study, we conduct the simulations for different real-world datasets with varied numbers of labeled samples. Specifically, we randomly select 20% of samples from each dataset as a testing set and the remaining samples as a training set. We then split the training set into two sets, i.e., the labeled set and the unlabeled set. Next, we compare the transductive version of CGSSL with the inductive extension of CGSSL. Since the transductive CGSSL cannot deal with testing samples, for comparison, we regard testing samples as unlabeled and integrate them for training. The average results over 20 random splits for different datasets with varied numbers of labeled samples are shown in Fig. 12. From

Q2 Please cite this article in press as: M. Zhao et al., Automatic image annotation via compact graph based semi-supervised learning, Knowl. Based Syst. Q1 (2014), http://dx.doi.org/10.1016/j.knosys.2014.12.014

740 741 742 743 744 745 746 747 748 749 750 751

KNOSYS 3023

No. of Pages 18, Model 5G

29 December 2014 Q2

12

M. Zhao et al. / Knowledge-Based Systems xxx (2014) xxx–xxx 1

0.9

0.95

0.8

0.85

1NN SSKDE MR LNP GFHF LLGC GGSSL CGSSL

0.8

0.75

0.7

Accuracy

Accuracy

0.9

2

4

6

8 10 12 14 #Labels per Class

16

18

0.7 1NN SSKDE MR LNP GFHF LLGC GGSSL CGSSL

0.6

0.5

0.4

20

2

4

6

(g)

8 10 12 14 #labels per Class

16

18

20

(h)

1

1

0.95 0.9

0.9

Accuracy

Accuracy

0.85 0.8 1NN SSKDE MR LNP GFHF LLGC GGSSL CGSSL

0.75 0.7 0.65 0.6 0.55

2

4

6

8 10 12 14 #Labels per Class

16

18

0.8 1NN SSKDE MR LNP GFHF LLGC GGSSL CGSSL

0.7

0.6

0.5

20

10

20

30

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

1NN SSKDE MR LNP GFHF LLGC GGSSL CGSSL

0.5 0.4 0.3 0.2

10

20

30

40 50 60 70 #Labels per Class

80

90

100

(j)

1

Accuracy

Accuracy

(i)

40 50 60 70 #Labels per Class

80

90

0.6

1NN SSKDE MR LNP GFHF LLGC GGSSL CGSSL

0.5 0.4 0.3

100

0.2

1

(k)

2

3

4 5 6 7 #Labels per Class

8

9

10

(l) Fig. 11 (continued)

752 753 754 755 756 757 758 759 760 761 762 763 764

the simulation results, we have the following observations: (1) similar to the transductive CGSSL, the accuracies obtained by inductive extension of CGSSL increase when the number of labeled samples increases. For example, the accuracy for the inductive extension of CGSSL will increase by about 20–25% when the number of labeled samples varies from 1 to 20 for ETH and Extended Yale-B datasets; (2) the accuracies obtained by transductive CGSSL are slightly better than those obtained by inductive extension of CGSSL. This is natural, as we have considered the testing samples as unlabeled samples when we perform transductive CGSSL. However, the results of the inductive extension of CGSSL are satisfied and believable, which demonstrate the effectiveness of inductive extension when handling the new-coming samples.

4.4. Confidence analysis of the parameters

765

In this subsection, we analyze the confidence intervals of the results in order to further show the robustness of the proposed method. Specifically, the proposed CGSSL does not have the parameter of Gaussian covariance (that is in GFHF, LLGC, and GGSSL); thus, in this revised version, we only need to provide the confidence intervals of al and au which are used in Eq. (4). The average annotated accuracies over 20 random splits of six different datasets with varied al and au are shown in Fig. 13, where the candidate set of al is from ½0; 0:05; . . . ; 0:9; 0:95, while the candidate of au is from ½0:05; 0; 1; . . . ; 0:95; 1. Here, we do not choose al ¼ 1 and au ¼ 0 as candidates, as al ¼ 1 means that the labeled information

766

Q2 Please cite this article in press as: M. Zhao et al., Automatic image annotation via compact graph based semi-supervised learning, Knowl. Based Syst. Q1 (2014), http://dx.doi.org/10.1016/j.knosys.2014.12.014

767 768 769 770 771 772 773 774 775 776

KNOSYS 3023

No. of Pages 18, Model 5G

29 December 2014 Q2

13

M. Zhao et al. / Knowledge-Based Systems xxx (2014) xxx–xxx 1

0.9 0.8

0.8

0.7

Accuracy

Accuracy

0.6 0.6

0.4

0.5 0.4 0.3 0.2

0.2 Trans Induc 0

0

5

10 #Labels per Class

15

Trans Induc

0.1 0

20

0

5

15

20

(b)

1

1

0.8

0.8

0.6

0.6

Accuracy

Accuracy

(a)

10 #Labels per Class

0.4

0.2

0.4

0.2 Trans Induc

0

0

20

40 60 #Labels per Class

80

Trans Induc 0

100

0

20

40 60 #Labels per Class

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5 0.4 0.3

0.5 0.4 0.3

0.2

0.2 Trans Induc

0.1 0

100

(d)

Accuracy

Accuracy

(c)

80

0

20

40 60 #Labels per Class

80

100

(e)

Trans Induc

0.1 0

2

4

6

8 10 12 14 #Labels per Class

16

18

20

(f)

Fig. 12. The average accuracies over 20 random splits in terms of the transducive CGSSL and inductive CGSSL for different datasets: (a) COIL100; (b) ETH80; (c) USPS; (d) MNIST; (e) Digit; and (f) Extended Yale-B.

777 778 779 780 781 782 783 784 785 786 787 788 789

of the labeled set will not be considered in Eq. (4); whereas, au ¼ 0 means that the labeled information of the unlabeled set will not be received from its neighborhoods. Both cases should be neglected when evaluating the confidence intervals of al and au . From simulation results in Fig. 13, we can draw the conclusion that the proposed CGSSL can work well in a wide interval of al and au . In detail, for COIL20, USPS and Digit datasets, the proposed CGSSL can achieve robust annotated accuracies when al and au are in the intervals ½0; 0:9 and ½0:05; 0:95, respectively; whereas, for ETH80, MNIST and Extended Yale-B datasets, such confidence intervals of al and au are ½0; 0:8 and ½0; 0:85, respectively. In practice, since most of the datasets have wide confidence intervals of parameters al and au , we can simply choose al ¼ 0 and au 6 0:8

for conducting the simulations. Clearly, a method that has wide intervals of parameters means that it is more robust to the parameters, and is more suitable and practical for real-world image annotation tasks.

790

4.5. Comparisons of computational time

794

In this subsection, we will compare the computational time of different methods when carrying out the simulation, which is shown in Table 7. Here, the configuration of the PC that we used is Intel (R) Core (TM) i7-4600U, [email protected] GHz, and with 8 GB RAM. From the simulation results in Table 7, we can see that the computational time is coincident to the analysis of computational

795

Q2 Please cite this article in press as: M. Zhao et al., Automatic image annotation via compact graph based semi-supervised learning, Knowl. Based Syst. Q1 (2014), http://dx.doi.org/10.1016/j.knosys.2014.12.014

791 792 793

796 797 798 799 800

KNOSYS 3023

No. of Pages 18, Model 5G

29 December 2014 14

M. Zhao et al. / Knowledge-Based Systems xxx (2014) xxx–xxx

0.5

0.5

0

0.5

0.7 1

u

α

α

l

(a) 5% randomly labeled samples COIL20 dataset

1

u

α

α

l

Accuracy

Accuracy α

1

α

α

l

(d) 5% randomly labeled samples ETH80 dataset

1

u

0.2 α

α

l

Accuracy

0 0.95

0.5

0.7

0.5

α

u

1

0.2 α

l

(g) 5% randomly labeled samples

α

l

0.5

0 0.25

0.95

0.5

0.45

0.75

0.2

1

0 0.25

1

u

(f) 15% randomly labeled samples ETH80 dataset

1

0.5

0.45

0.75

(e) 10% randomly labeled samples ETH80 dataset

1

0.7

0.45

0.75

0.2

0.95

0.5

0.7

0.45

u

0.25

0.95

0.5

0.7 0.75

l

0 0.25

0.95

0.5

α

0.5

0 0.25

0.2

1

0.5

0

1

u

(c) 15% randomly labeled samples COIL20 dataset

1

0.5

0.7 0.45

0.75

0.2

(b) 10% randomly labeled samples COIL20 dataset

1

0.95

0.5

0.7 0.45

0.75

0.2

0.25

0.95

0.45

0.75 α

0 0.25

0.95

0.5

Accuracy

0.5

0

0.25

Accuracy

1 Accuracy

1 Accuracy

Accuracy

1

Accuracy

Q2

0.7

u

1

0.2 α

l

(h) 10% randomly labeled samples

0.95

0.5

0.45

0.75 α

0.25 0.7 0.45

0.75 α

u

1

0.2 α

l

(i) 15% randomly labeled samples

Fig. 13. The average annotated accuracies over 20 random splits of COIL20, ETH80, MNIST, USPS, Extended Yale-B, and Digit Datasets with varied al 2 ½0; 0:05; . . . ; 0:9; 0:95 and au 2 ½0:05; 0; 1; . . . ; 0:95; 1.

801 802 803 804 805 806 807 808 809 810 811 812 813 814 815

complexities of methods in Table 5 and has the following observations: (1) the proposed CGSSL needs more computational time for carrying out the simulations, except for some large-scale datasets such as Digit, COIL100, ETH80, USPS, MNIST, and Extended Yale-B datasets. In these datasets, SSKDE has the most computational time, as its computational complexity is closely related to the number of datasets; and (2) the LNP and the CGSSL indeed have more computational time than GFHF/LLGC/GGSSL, but not by a large factor. We can also see that in some large-scale datasets, such as COIL100, USPS, MNIST and Extended Yale-B datasets, the computational times of LNP and CGSSL are almost equal to those of GFHF/LLGC/GGSSL. In addition, based on the analysis of Table 7, we will also give an explicit implementation choice of different methods from the time-to-process point of view: (1) for most UCI datasets, we can

observe that it is better to utilize GFHF/LLGC/GGSSL for implementation, as their computational times are much smaller. Another reason is that UCI datasets are usually with small scale and simple structure. Hence, GFHF/LLGC/GGSSL can easily adjust the parameter of Gaussian covariance and achieve competitive performance compared with the proposed CGSSL; (2) for some real-world datasets, such as COIL100, USPS, MNIST and Extended Yale-B datasets, we observe that the computational times of CGSSL in these datasets are almost equal to those of GFHF/LLGC/GGSSL. Considering that CGSSL can achieve better performance than GFHF/LLGC/GGSSL, it is better to choose CGSSL for implementation. Another reason is that real-world datasets are usually with large scale and complex structure; thus, a method in which it is easy to adjust the parameters, such as the proposed CGSSL, can be more practical for handing the complex real-world datasets which are

Q2 Please cite this article in press as: M. Zhao et al., Automatic image annotation via compact graph based semi-supervised learning, Knowl. Based Syst. Q1 (2014), http://dx.doi.org/10.1016/j.knosys.2014.12.014

816 817 818 819 820 821 822 823 824 825 826 827 828 829 830

KNOSYS 3023

No. of Pages 18, Model 5G

29 December 2014 Q2

15

M. Zhao et al. / Knowledge-Based Systems xxx (2014) xxx–xxx

0.5

0

0.45

0.75 1

αu

αl

(j) 5% randomly labeled samples USPS dataset

1

αu

αl

0.5

0 0.25

αu

0.45 1

0.2

αu

αl

1

0.2

αu

αl

0.95

0.5

0.7

0.5

αu

1

0.95

0.5 αu

αl

(p) 5% randomly labeled samples Digit dataset

0.5

0.25

0.45 1

0.7 0.45

0.75

0.2

αu

αl

(q) 10% randomly labeled samples Digit dataset

0.95

0.5

0.7 0.75

0.2

αl

0 0.25

0.45

0.75

0.2

1

0 0.25

1

(o) 40% randomly labeled samples Extended Yale-B dataset

Accuracy

Accuracy

0

0.7 0.45

0.75

1

0.5

0.95

0.5

0.45

(n) 30% randomly labeled samples Extended Yale-B dataset

1

0.25

0.7 0.75

(m) 20% randomly labeled samples Extended Yale-B dataset

0.5

0.95

0.5

0.7 0.75

αl

0 0.25

0.95

0.5

0.2

1 Accuracy

Accuracy

0

1

(l) 15% randomly labeled samples USPS dataset

1

0.5

0.7 0.45

0.75

0.2

(k) 10% randomly labeled samples USPS dataset

1

0.95

0.5

0.7 0.45

0.75

0.2

0.25

0.95

0.5

0.7

αu

0.5

0 0.25

0.95

0.5

Accuracy

0.5

0 0.25

Accuracy

1 Accuracy

1 Accuracy

Accuracy

1

1

0.2 αl

(r) 15% randomly labeled samples Digit dataset

Fig. 13 (continued)

Table 7 Computational time (s) of different methods. Dataset

MR

SSKDE

GFHF

LLGC

GGSSL

LNP

CGSSL

Bupa Ionosphere Heart-Statlog Iris Wine Segmentation Digit COIL100 ETH80 USPS MNIST Extended Yale-B

9.5242e4 1.0249e3 5.8540e4 1.8006e4 2.5567e4 3.5287e2 2.5294e1 1.4885e0 1.1598e0 7.0840e1 1.2819e0 3.1534e0

2.3805e4 2.4640e4 1.4580e4 6.7500e5 9.5052e5 3.0870e2 3.1584e1 5.1840e0 8.6067e1 8.6453e1 1.0000e0 9.8781e0

9.5772e4 9.9122e4 5.8752e4 1.8360e4 2.5774e4 3.5398e2 2.5312e1 4.2048e1 8.8166e2 6.9237e1 8.0080e1 2.0845e0

9.5911e4 9.9325e4 5.8951e4 1.8964e4 2.6046e4 3.5597e2 2.5327e1 4.2795e1 8.8611e2 7.0169e1 8.0546e1 2.1064e0

9.6757e4 9.9850e4 5.9153e4 1.9636e4 2.6884e4 3.5616e2 2.6009e1 4.3174e1 8.9443e2 7.0672e1 8.1255e1 2.1193e0

1.1344e3 1.1709e3 7.2576e4 2.6040e4 3.4888e4 3.6473e2 2.5600e1 4.3417e1 8.9846e2 7.1713e1 8.1592e1 2.1228e0

2.3708e3 2.4289e3 1.6934e3 7.9800e4 9.8683e4 4.3999e2 2.7614e1 4.4997e1 1.0160e1 7.3045e1 8.4176e1 2.1306e0

Q2 Please cite this article in press as: M. Zhao et al., Automatic image annotation via compact graph based semi-supervised learning, Knowl. Based Syst. Q1 (2014), http://dx.doi.org/10.1016/j.knosys.2014.12.014

KNOSYS 3023

No. of Pages 18, Model 5G

29 December 2014 Q2

16

M. Zhao et al. / Knowledge-Based Systems xxx (2014) xxx–xxx

Fig. 14. Sample images of the Corel dataset.

833

usually confronted in image annotation tasks, CBIR, and social network invoking. We have also included this point in the introduction.

834

4.6. Application for content based image retrieval

835

The goal of content-based image retrieval (CBIR) is to retrieve relevant images from the dataset by using semantic similarity between image content. In this subsection, we evaluate the CBIR performance of the proposed CGSSL, and compare them with those of the above mentioned semi-supervised graph based methods based on a large-scale image dataset, called the ‘‘Corel dataset’’ [32]. In this dataset, we choose a subset from the dataset for simulation, which contains 50 concepts, with each having 100 images [33]. Some samples of the Corel dataset are shown in Fig. 14. In CBIR, low-level image representation is of great importance to characterize the feature information for each image. Current visual features include color, texture, shape, and different combinations of the previously mentioned methods. In this work, we combine a 48-dimension color histogram in HSV space [34] and a 48-dimension color texture moment (CTM) [35] to represent each image. For the color histogram, in order to avoid the computational burden, we only consider 16 bins to calculate the histograms for each of the three HSV channels. Specifically, the histogram of the hue channel is calculated as follows:

836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853

854 856 857 858 859 860 861

hHi ¼ nHi =nT ;

i ¼ 1; 2; . . . ; 16

ð18Þ

where Hi is the ith quantized level of the hue channel, nHi is the total number of pixels with their hue values in that level, and nT is the total number of pixels in the images. The histograms of the Saturation ðhSi Þ and Value ðhV i Þ channels can also be calculated as in Eq. (18). For the color texture moment (CTM), local Fourier

transform (LFT) is adopted as a texture representation strategy, and eight characteristic maps are derived for describing different aspects of co-occurrence relations of images pixels in the three channels of the color space (SV cos H; SV sin H, and V). Then, CTM calculates the first and second moments of these maps to represent the natural color image pixel distribution. We next design an automatic feedback strategy to model the retrieval process. For each query image submitted by the user, the system retrieves and ranks the images in the database. Here, the rank for each image in the database is based on the estimated label information after performing the proposed CGSSL or other state-of-the-art methods. The top images with the highest ranking score are then selected as the feedback images, and their feedback information can be used for re-ranking. It is worth noting that for the sake of feedback, users should mark them as either relevant or irrelevant. The above automatic feedback strategy is performed for four iterations for each query image. We next discuss how to utilize the proposed CGSSL in a CBIR system. Given a query image submitted by the user, the CBIR system first treats it as a labeled set, while treating other images in the system database as an unlabeled set. Then, by taking both the labeled and unlabeled set as inputs, the proposed CGSSL can automatically annotate the labels of the unlabeled set, and the top images (maybe 10, 20, or more images) with the highest values of estimated labels are selected as the feedback images. Next, the users can annotate such feedback images as relevant or irrelevant to the query image. In other words, if a feedback image is judged as relevant, it will be added to the labeled set, making the number of the labeled set increase. The new-formed labeled set combined with the remaining unlabeled set can then be used as inputs for a new-round annotation. The process will be iteratively performed several times until the user’s requirements are satisfied. Hence,

0.45

0.5 1NN SSKDE MR LNP GFHF LLGC GGSSL CGSSL

0.4

0.35

0.3

0.4

0.35

0.25

0.3

0.2

0.25

10

20

30

40

50 60 Scope

(a)

70

80

90

1NN SSKDE MR LNP GFHF LLGC GGSSL CGSSL

0.45

Accuracy

832

Accuracy

831

100

0.2 10

20

30

40

50 60 Scope

70

80

90

100

(b)

Fig. 15. Average accuracy-scope curves on a test set of different methods for the second and fourth feedback iterations: Corel dataset (a) the second iteration and (b) the fourth iteration.

Q2 Please cite this article in press as: M. Zhao et al., Automatic image annotation via compact graph based semi-supervised learning, Knowl. Based Syst. Q1 (2014), http://dx.doi.org/10.1016/j.knosys.2014.12.014

862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893

KNOSYS 3023

No. of Pages 18, Model 5G

29 December 2014 Q2

17

M. Zhao et al. / Knowledge-Based Systems xxx (2014) xxx–xxx 0.55

0.5

0.5

0.45 0.3

0.45

0.4

0.35 0.3

SSKDE MR LNP LLGC GFHF GGSSL CGSSL

0.25 0.2 0.15 0

1

2 #Feedback Iteration

3

0.35 0.3 SSKDE MR LNP LLGC GFHF GGSSL CGSSL

0.25 0.2 0.15

4

Accuracy

Accuracy

Accuracy

0.4

0.1

0.35

0.1

0

1

2 #Feedback Iteration

(a)

3

0.25

SSKDE MR LNP GFHF LLGC GGSSL CGSSL

0.2

0.15

4

(b)

0.1

0

1

2 #Feedback

3

4

(c)

Fig. 16. Average accuracies at scope 10, 20, and 50 of different methods: Corel dataset (a) scope 10; (b) scope 20; and (c) scope 50.

930

with the increase of the labeled set, the feedback annotated images are expected to be more and more relevant to the query image. In a real image retrieval system, however, the query image is usually not in the image database. To simulate this environment, we divide the dataset into two non-overlapping subsets, i.e., the first one includes 4500 images used for training the system, and the second one includes 500 images used as query images. For the image retrieval results, we use an accuracy-scope curve [35] to evaluate the performance of different methods under a fixed iteration. The scope is the number of top-ranked images presented to the user, and the accuracy is the ratio between the numbers of relevant images to the given scope. The accuracy-scope curve describes the accuracies with varied scopes and, thus, gives an overall performance evaluation of the algorithms. We then fix the scope by 10, 20 and 50, and evaluate the performance of the different methods with varied iterations. Fig. 15 shows the average accuracy-scope curves of different methods at iteration 2 and iteration 4. Fig. 16 shows the average accuracies with varied iterations under the fixed scope of 10, 20, and 50. From Figs. 15 and 16, we have the following observations: (1) all of the supervised and semi-supervised dimension reduction methods outperform Euclidean distance-based methods, such as 1NN, due to the utilization of user feedback; (2) by iteratively adding the user feedback, we can observe from Fig. 16 that the retrieval accuracies for all of the methods will increase, which indicates that user feedback is actually useful to provide the discriminative information for CBIR. For example, following Fig. 16a, we can observe that the proposed CGSSL can achieve 30% improvement of accuracy in the fourth iteration compared to the first iteration in Scope 10. For Scope 20 and 50, such improvement can achieve 25% and 20%, respectively; and (3) the proposed CGSSL can achieve competitive retrieval results compared to other methods over the entire scope and iterations. The improvement can be the largest, by approximately 3% and more, compared to GGSSL and other state-of-the-art semi-supervised methods when the scope is fixed to 10 in the last round of feedback, shown in Fig. 15a. This enhancement is believable due to the reason analyzed above.

931

5. Conclusion

932

In this paper, we propose an effective label propagation procedure, which is based on a new local reconstruction graph with symmetrization and normalization for image annotation. Following the theoretical analysis and extensive studies in our work, we can draw the conclusions for the proposed CGSSL as:

894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929

933 934 935 936 937 938 939 940

(1) The proposed CGSSL can achieve better performance in most cases of dataset due to the reason that the new graph construction strategy in CGSSL can find a compact way to approximate a sample with its neighborhoods.

(2) Since the proposed CGSSL is based on the framework of GGSSL, it can either detect the outliers or develop a mechanism to calculate the probabilities of data samples. In addition, with symmetrization and normalization processes, the proposed CGSSL can be analyzed theoretically under a regularized framework. (3) The proposed CGSSL can automatically erase noisy labeled samples that are generalized due to carelessness of the annotator, and are robust to the parameters. Clearly, a method that can handle noisy annotated samples and is less sensitive to the parameters is more suitable for real-world annotation tasks. (4) The proposed CGSSL can be easily extended to its inductive out-of-sample version for handling new-coming data by using the same smoothness criterion.

942 943 944 945 946 947 948 949 950 951 952 953 954 955 956

While our method can achieve promising results for automatic image annotation and retrieval, our future work might focus on the following issues in order to improve the proposed method:

957

(1) The proposed CGSSL is based on a single graph. However, in some real-world datasets, such as the Extended Yale-B face dataset, using a single graph cannot be sufficient to approximate the manifold structure of datasets. This is because the Extended Yale-B dataset has strong lighting variations, i.e., each human subject has 64 lighting variations. In order to better characterize such multi-lighting-variation structure of datasets, it is better to construct the graph for each lighting variation separately and then to form a multi-graph. Our future work can lie in extending the current single-graph based method to a multi-graph based method. (2) The simulation results illustrate that the proposed CGSSL is capable to annotate the dataset with a single class label. However, in many real-world applications, a data sample may include multiple labels, i.e., there exists a multi-label dataset. In our future work, annotating multi-label datasets can be analyzed based on the proposed method. (3) Instead of solving the automatic image annotation problem, how to utilize the proposed CGSSL for some other applications, such as video annotation and social network invoking, are other challenges and of great importance. Our proposed method can be extended or improved to be applicable for these tasks.

960

958 959

961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984

6. Uncited reference [19].

Q2 Please cite this article in press as: M. Zhao et al., Automatic image annotation via compact graph based semi-supervised learning, Knowl. Based Syst. Q1 (2014), http://dx.doi.org/10.1016/j.knosys.2014.12.014

941

985

Q5

986

KNOSYS 3023

No. of Pages 18, Model 5G

29 December 2014 Q2

18

M. Zhao et al. / Knowledge-Based Systems xxx (2014) xxx–xxx

987

Appendix A

988

Denote en ¼ ½1; 1; . . . ; 1 2 R1n as the unit vector, to prove that the sum of each column of F is equal to 1, we need to prove f ¼ elþu and ecþ1 Y ¼ elþu , ecþ1 F ¼ elþu . Actually, we first have elþu W then:

989 990 991

992

(

ecþ1 Y ¼ elþu f Ia ¼ elþu ) ecþ1 YIb ! ecþ1 YIb þ elþu W f ¼ elþu elþu W    1 f Ia ) ecþ1 YIb I  W f Ia ¼ elþu I  W ¼ elþu ) ecþ1 F ¼ elþu :

994

995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032

ð19Þ

References [1] P.N. Belhumeur, J.P. Hespanha, D.J. Kriegman, Eigenfaces vs. fisherfaces: recognition using class specific linear projection, IEEE Trans. Pattern Anal. Mach. Intell. 19 (7) (1997) 711–720. [2] M. Zhao, Z. Zhang, T.W.S. Chow, Trace ratio criterion based generalized discriminative learning for semi-supervised dimension reduction, Pattern Recogn. 45 (4) (2012) 1482–1499. [3] M. Zhao, R.H.M. Chan, P. Tang, T.W.S. Chow, S.W.H. Wong, Trace ratio linear discriminant analysis for medical diagnosis: a case study of dementia, Signal Process. Lett. 5 (20) (2013). [4] X. Zhu, Z. Ghahramani, J.D. Lafferty, Semi-supervised learning using Gaussian fields and harmonic function, in: Proc. of ICML, 2003. [5] D. Zhou, O. Bousquet, T.N. Lai, J. Weston, B. Scholkopf, Learning with local and global consistency, in: Proc. of NIPS, 2004. [6] P. Thomas, Semi-supervised learning by Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien (review), IEEE Trans. Neural Netw. 20 (3) (2009) 542. [7] F. Nie, S. Xiang, Y. Liu, C. Zhang, A general graph based semi-supervised learning with novel class discovery, Neural Comput. Appl. 19 (4) (2010) 549– 555. [8] F. Nie, D. Xu, X. Li, S. Xiang, Semi-supervised dimensionality reduction and classification through virtual label regression, IEEE Trans. Syst. Man Cybernet. Part B 41 (3) (2011) 675–685. [9] F. Nie, S. Xiang, Y. Jia, C. Zhang, Semi-supervised orthogonal discriminant analysis via label propagation, Pattern Recogn. 42 (11) (2009) 2615–2627. [10] S. Xiang, F. Nie, C. Pan, C. Zhang, Regression reformulations of LLE and LTSA with locally linear transformation, IEEE Trans. Syst. Man Cybernet. Part B 41 (5) (2011) 1250–1262. [11] M. Wang, X. Hua, T. Mei, R. Hong, G. Qi, Y. Song, L. Dai, Semi-supervised kernel density estimation for video annotation, Comput. Vis. Image Underst. 13 (3) (2009) 384–396. [12] P. Ji, N. Zhao, S. Hao, J.G. Jiang, Automatic image annotation by semisupervised manifold kernel density estimation, Inform. Sci. (2013), http:// dx.doi.org/10.1016/j.ins.2013.09.016. [13] F. Wang, C. Zhang, Label propagation through linear neighborhoods, IEEE Trans. Knowl. Data Eng. 20 (1) (2008) 55–67. [14] J. Wang, F. Wang, C. Zhang, H.C. Shen, L. Quan, Linear neighborhood propagation and its applications, IEEE Trans. Pattern Anal. Mach. Intell. 31 (9) (2009) 1600–1615.

[15] M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: a geometric framework for learning from labeled and unlabeled examples, J. Mach. Learn. Res. 7 (2006) 2399–2434. [16] D. Cai, X. He, J. Han, Semi-supervised discriminant analysis, in: Proc. of ICCV, 2007. [17] M. Zhao, Z. Zhang, T.W.S. Chow, B. Li, Soft label based linear discriminant analysis for image recognition and retrieval, Comput. Vis. Image Underst. 121 (2014) 86–99. [18] M. Zhao, Z. Zhang, T.W.S. Chow, B. Li, A general soft label based linear discriminant analysis for semi-supervised dimension reduction, Neural Netw. 55 (2014) 83–97. [19] Z. Zhang, M. Zhao, T.W.S. Chow, Marginal semi-supervised sub-manifold projections with informative constraints for dimension reduction and recognition, Neural Netw. 36 (2012) 97–111. [20] Z. Zhang, T.W.S. Chow, M. Zhao, Trace ratio optimization-based semisupervised nonlinear dimension reduction for marginal manifold visualization, IEEE Trans. Knowl. Data Eng. 25 (5) (2013) 1148–1161. [21] Z. Zhang, M. Zhao, T.W.S. Chow, Graph based constrained semi-supervised learning framework via label propagation over adaptive neighborhood, IEEE Trans. Knowl. Data Eng. (2014), http://dx.doi.org/10.1109/TKDE. 2013.182. [22] O. Delalleau, Y. Bengio, N.L. Roux, Efficient non-parametric function induction in semi-supervised learning, in: Proc. of AISTATS, 2005. [23] J.B. Tenenbaum, V. de Silva, J.C. Langford, A global geometric framework for nonlinear dimension reduction, Science 290 (2000) 2319–2323. [24] S. Roweis, L. Saul, Nonlinear dimension reduction by locally linear embedding, Science 290 (2000) 2323–2326. [25] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput. 15 (6) (2003) 1373–1396. [26] X. He, S. Yan, Y. Hu, P. Niyogi, H. Zhang, Face recognition using Laplacianfaces, IEEE Trans. Pattern Anal. Mach. Intell. 27 (3) (2005) 1–13. [27] S.A. Nene, S.K. Nayar, H. Murase, Columbia Object Image Library (COIL-100), Technical Report CUCS-005-96, Columbia University, 1996. [28] B. Leibe, B. Schiele, Analyzing appearance and contour based methods for object categorization, in: Proc. of CVPR, 2003. [29] J. Hull, A database for handwritten text recognition research, IEEE Trans. Pattern Anal. Mach. Intell. 16 (5) (1994) 550–554. [30] Y.L. Cun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, in: Proc. of IEEE, vol. 86(11), 1998, pp. 2278–2324. [31] K.C. Lee, J. Ho, D. Kriegman, Acquiring linear subspaces for face recognition under variable lighting, IEEE Trans. Pattern Anal. Mach. Intell. 27 (5) (2005) 1– 15. [32] J.Z. Wang, J. Li, G. Wiederhold, Simplicity: semantics sensitive integrated matching for picture libraries, IEEE Trans. Pattern Recogn. Mach. Intell. 23 (9) (2001) 947–963. [33] H. Yu, M. Li, H.J. Zhang, J. Feng, Color texture moments for content-based image retrieval, in: Proc. ICIP, 2002. [34] T.W.S. Chow, M.K.M. Rahman, Content-based image retrieval by using treestructured features and multi-layer self-organizing map, Pattern Anal. Appl. 9 (2006) 1–20. [35] X. He, C. Cai, J. Han, Learning a maximum margin subspace for image retrieval, IEEE Trans. Knowl. Data Eng. 20 (2) (2008) 189–201. [36] C.C. Chang, C.J. Lin, LIBSVM: a library for support vector machines, ACM Trans. Intell. Syst. Technol. 2 (3) (2011) 27:1–27:27. [37] K. Fukuaga, Introduction to Statistical Pattern Classification, Academic Press, USA, 1990. [38] R.A. Horn, C.R. Johnson, Matrix Analysis, Cambridge University Press, Cambridge, 1990.

Q2 Please cite this article in press as: M. Zhao et al., Automatic image annotation via compact graph based semi-supervised learning, Knowl. Based Syst. Q1 (2014), http://dx.doi.org/10.1016/j.knosys.2014.12.014

1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090