Semi-supervised classification with Laplacian multiple kernel learning

Semi-supervised classification with Laplacian multiple kernel learning

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Semi-supe...

667KB Sizes 0 Downloads 120 Views

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Semi-supervised classification with Laplacian multiple kernel learning$ Tao Yang n, Dongmei Fu School of Automation and Electrical Engineering, University of Science & Technology Beijing, Beijing, China

art ic l e i nf o

a b s t r a c t

Article history: Received 24 October 2013 Received in revised form 1 March 2014 Accepted 3 March 2014 Communicated by M. Wang

Laplacian Support Vector Machine (lapSVM) is a SVM with an additional graph-based regularization for semi-supervised learning (SSL). As its base classifier is a single kernel SVM, it may be inefficient to deal with multi-source or multi-attribute complex datasets. Instead of one single kernel, we know that multiple kernels could correspond to different notions of similarity or information from multiple sources and represent differences between features. Therefore, we extend lapSVM to multiple kernel occasion, namely Laplacian Multiple Kernel Learning (lapMKL), improving the ability of processing more complex data in semi-supervised classification task. The proposed lapMKL is solved by Level Method, which was used in multiple kernel learning (MKL) and showed relatively high efficiency. Experiments on several data sets and comparisons with state of the art methods show that the proposed lapMKL is competitive and even better. & 2014 Elsevier B.V. All rights reserved.

Keywords: Semi-supervised classification Graph-based regularizer Multiple kernel learning

1. Introduction As we know, in classification problems unlabeled data are obtained easily while labeled one costly, so semi-supervised learning (SSL), which makes use of both labeled and unlabeled data, is useful in this situation and becomes an attention in machine learning field. The aim of SSL is to improve the learning performance by incorporating unlabeled data information and to build better functions which are used to do classification or make decisions. Commonly used methods in SSL include self-training, co-training, gaussian process, etc., important review work refers to [1]. While the most used methods are mainly graph-based, many graph-based works achieve promising performances [2–8]. The graph is constructed upon the labeled and unlabeled data samples, i.e. vertices are samples and edges, linking the adjacent two vertices, have weights indicating the similarity between two vertices. In this way, a typical regularization framework is proposed which has a regularizer term forcing the classification function smooth upon the graph and a loss term ensuring the consistency of function on labeled samples. However, traditional graph-based classification methods in SSL face two problems.

☆ This work is supported by the National Natural Science Foundation of China under Grant no. 61272358 and The key construction disciplines project of Beijing under Grant no. 00012007. n Corresponding author. E-mail address: [email protected] (T. Yang).

(1) The dataset we obtain often shows multi-attribute or comes from multi-source. While in a graph the similarity is insufficient to represent the different importances in different features of samples. (2) The graph-based methods are mostly transductive. Training and testing processes are upon the graph, the new samples may cause changes in graph, it is a relatively heavy work to rebuild the whole graph. Multi-graph is a solution to the first problem. After we investigate a diverse set of features, according to different distance functions we could build multiple graphs on different features and then integrate them, such as [5] formed various factors in video annotation to build multiple graphs and integrated them into a regularization framework. As multi-view learning in SSL is highly relative to multi-graph methods, recently it is an emerging research, it is trained with multiple hypotheses and provides integration process to give a unified result or a more confident one. Each view could be one feature representation, following the multi-graph method, it is able to consider multi-attribute or multisource dataset. Some relative works are [2,9–12], among which objects are mostly images or videos, for they are proper to do multi-view analysis. However, traditional graph-based methods consider only the relationship between two samples, ignoring the high order relations. Therefore, hypergraph is proposed [13]. In hypergraph, a set of vertices is connected by a hyperedge, which is assigned a weight according to the relationship of those vertices. In this way one vertex could belong to different hyperedge based

http://dx.doi.org/10.1016/j.neucom.2014.03.039 0925-2312/& 2014 Elsevier B.V. All rights reserved.

Please cite this article as: T. Yang, D. Fu, Semi-supervised classification with Laplacian multiple kernel learning, Neurocomputing (2014), http://dx.doi.org/10.1016/j.neucom.2014.03.039i

2

T. Yang, D. Fu / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

on measurement of different features or views, so it could deal with multi-source data. Relative works are [3,14,15], and when constructing graphs, some novel similarity metric methods involve [16–18]. However, facing the second problem graph-based methods are not fully capable of, while kernel-based methods are naturally inductive. The classic kernel method in SSL is TSVM [19,20], it forces the classification hyperplane to cross the low-density data region and triggers many mathematical methods to solve the non-convex problem, important research results refer to [21]. An improvement work is meanS3VM [22], which takes extra label means of the unlabeled data. They involve a lot of iterations training SVM and are susceptible to labels unbalance condition. In Belkin's work [7], it proposes a lapSVM model, incorporating a graph-based regularizer to standard SVM model, constraining the classification function from SVM be smooth along the graph to implement the semisupervised learning. In this way, a small scale SVM and graph construction are all what need to accomplish classification task. Follow that work, some relative researches are [4,23–25]. However, most works are single kernel based, it is insufficient to deal with the complex datasets and lacks flexibility of kernel mappings as in multiple kernel learning (MKL) research. Recently, MKL has become an attention for it could form different features in one reproducing kernel Hilbert space (RKHS) and is able to deal with multi-source data. MKL has several kernel mappings and combines them in a convex way, to improve the ability of processing data. In this paper, we propose the Laplacian MKL (lapMKL), it is a form of extension of lapSVM to multiple kernel circumstance. There are two main parts in our proposed model, an illustration is in Fig. 1. First, based on labeled data we establish a multiple kernel classifier, which could consider different features to be under the corresponding kernel mappings, in this way we could deal with complex data coming from multi-source or different sampling ways. Second, we construct a graph upon the whole data to reflect the labels variation, the similarities in graph are calculated based on Euclidean distances. Overall, the two parts are simultaneously in effect during the solution process, that is we assume a function existed in multiple kernel space could label the vertices with unlabeled samples on data-dependent graph meanwhile it is smooth over the graph and as most as possible consistent with the vertices with labeled ones. LapMKL is easy to do inductive classification after finding the optimal kernel mappings. In practice, we develop a level method algorithm to solve the parameters

in lapMKL, which has been proved more efficient in processing multiple kernel combination task [26]. Experimental results show that the proposed lapMKL has competent performances in benchmark dataset, several UCI datasets, and Caltech256. In summary, contributions of this paper are in the following: (1) We propose lapMKL to do semi-supervised classification, dealing with the complex multi-attribute or multi-source data in kernel-based way, it could address the previous two problems in some sense. (2) We develop the level method to accomplish the model calculation, which was used in MKL and showed relatively high efficiency. (3) Apply the lapMKL to benchmark dataset, several UCI datasets and Caltech256 to have the semi-supervised classification tests, the results validate the effectiveness of the lapMKL. The rest of this paper is organized as follows: In Section 2 we provide a short review on the related work. In Section 3, we propose lapMKL for semi-supervised classification and develop the level method to solve the problem. Experiments and comparisons are in Section 4. Finally, we give the conclusions and future work expectations.

2. Related work For notation consistency, we follow the convention below. Assume that we are given l labeled data points L ¼ fðxi ; yi Þgli ¼ 1 where yi A f þ 1;  1g are the class label, and u unlabeled data points U ¼ fðxj Þguj¼ 1 , and l þ u ¼ n, the input space is X , i.e. xi A X ; i A n. We are to find a function f : X -f þ 1;  1g, to label the unlabeled data and infer the whole input space. 2.1. Graph-based regularizer In SSL, graph-based regularizer is essential in graph-based methods, this regularizer needs a data-dependent graph. Define graph G ¼ ðV; EÞ, V is vertices with each vertex denoting one sample, including labeled and unlabeled samples, i.e. vi ¼xi; E is edges linking the adjacent vertices, we mark each edge eij : vi  vj , where ‘  ’ means adjacent, and each edge is weighted, that is eij has a weight wij, through all edges we have a weight matrix W. To judge whether two vertices are adjacent, mostly used method is kNN meaning one vertex is in another vertex's k-nearest neighborhood or is jxi  xj j2 o ϵ, then we think xi and xj are adjacent. The weight matrix computation is in the following:  1 vi  vj wij ¼ 0 otherwise ( or wij ¼

expð  s2 ‖xi  xj ‖2 Þ

vi  vj

0

otherwise

ð1Þ

Assume a function f needs to be estimated on the graph based on label smoothness assumption. To implement this smoothness assumption, we arrive at a minimization problem 1 T min ∑ wij ðf i  f j Þ2 ¼ min f Lf i;j 2

Fig. 1. Illustration of the proposed lapMKL.

ð2Þ

where f i ¼ f ðxi Þ is the label prediction of xi, f ¼ ðf ðx1 Þ; …; f ðxn ÞÞ, and L ¼ D  W, D is a diagonal matrix with entries dii ¼ ∑j wij . L is called the Laplacian graph. Above minimization is just the graph-based regularization on the function f. In multi-graph circumstance, suppose there are N graphs with weight matrix W 1 ; W 2 ; …; W N according to distances of different

Please cite this article as: T. Yang, D. Fu, Semi-supervised classification with Laplacian multiple kernel learning, Neurocomputing (2014), http://dx.doi.org/10.1016/j.neucom.2014.03.039i

T. Yang, D. Fu / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

features, we will have the following regularizer ! N

min ∑ αn ∑wn;ij ðf i  f j Þ2 n¼1

ð3Þ

i;j

where α is a weight vector which satisfies αn Z0, ∑αn ¼ 1, wn;ij is the edge's weight of the nth graph. While in hypergraph, a hypergraph G ¼ ðV; E; WÞ is composed of the vertex set V, the hyperedge set E and the hyperedge weight vector W. Every edge ei has a weight wðei Þ. Define an incidence matrix H with the following elements: ( 1 if v A e Hðv; eÞ ¼ ð4Þ 0 if v2 =e where v A e means vertex v is linked by hyperedge e, otherwise v2 = e. After H construction, each vertex degree is dðvÞ ¼ ∑e A E wðeÞHðv; eÞ and the edge degree of hyperedge e A E is δðeÞ ¼ ∑v A V Hðv; eÞ. We let Dv ; De denote diagonal matrices of vertex degrees and hyperedge degrees, and the Laplacian of hypergraph is L ¼ IΘ

Θ ¼ Dv ð1=2Þ HWDe 1 HT Dv ð1=2Þ

ð5Þ T

After calculating the Laplacian of hypergraph, the constraint f Lf is applied in many hypergraph methods. In SSL, the following regularization framework is often used to search optimal label function: f ¼ arg minff Lf þ λRemp ðf Þg n

T

ð6Þ

f

3

2.1.2. Multiple kernel learning As kernel methods are always an interest in machine learning, multiple kernel learning becomes a very topic recently for it is flexible in parameters tuning and capable of processing large ranges of datasets. Following the convention in [29], assume that there are M positive definite kernels K 1 ; …; K M , each is associated with an RKHS Hm endowed with an inner product ð; Þm . In fact, there exist M different feature mappings φm : X -Hm , if we take X as RD1  RD2  ⋯  RDM , then each Hm is a projection space of RD m and is a representation of input space, so we could characterize multisource data by using the different mappings φm ; m ¼ 1; …; M, that shows the adaptability of multiple kernel space. The important works about MKL are: Lanckriet et al. transformed kernel learning to several kernel matrices combination problem and showed it is a SDP problem [30]; after that many researchers focused on efficient algorithm to solve the SDP problem, such as Bach et al. used second order conic optimization technique [31], Sonnenberg et al. reformulated MKL as a semi-infinite linear program (SILP) to deal with medium or large dataset [32], then proposed SimpleMKL to further obtain efficient algorithm [29]; other efficient algorithms include: Xu et al. used level method [26], Francesco proposed a UFO-MKL model [33], and Taiji et al. proposed a SpicyMKL based on dual augmented Lagrangian [34]. In terms of formation and complexity analysis of MKL, the main works are: Kloft et al. proposed the p-norm MKL and provided closed-form solution and relative analysis [35–37], Taiji applied elastic-net regularization to compromise between sparse and non-sparse property [38]; Cortes et al. provided complexity analysis on MKL [39,40]. Extensive review work refers to [41].

where λ 4 0 is a tradeoff parameter between the regularizer f Lf and the empirical loss term Remp(f), the regularizer forces f be smooth along the graph G when labeling the vertices, while the empirical loss term maintains the labeled vertices consistent. In this paper, we follow (2) as our regularizer but function f exists in multi-kernel space considering different feature mappings.

3. LapMKL

2.1.1. LapSVM In lapSVM, semi-supervised classification is implemented through the graph-based regularization as in (2). LapSVM specifies the function f as a SVM function. It is known that in SVM, for a Mercer kernel K : X  X -R, there is an RKHS HK of functions f : X -R with the corresponding norm ‖  ‖K [27], and it admits the reproducing property [28]:

3.1. LapMKL formulation

T

ð7Þ

f ðÞ ¼ ðf ðxÞ; Kð; xÞÞK

where ð; ÞK is the inner product in HK . By additional regularization and foundation in HK , lapSVM tends to solve the following problem to achieve semi-supervised classification l

γA

i¼1

2

min C ∑ maxð0; 1  yi f ðxi ÞÞ þ f AH

where f ¼ ðf ðx1 Þ; …; f ðxn ÞÞT ;

γ

T

‖f ‖2K þ I f Lf 2 n ¼ lþu

ð8Þ

where γ A ; γ I control the regularization terms. The foundation of the model (8) states that when we give labels to the unknown ones, the similarities between samples should be maintained, and further we know that weight matrix W in the graph represents the similarities of data samples, so a natural way is

min∑nij wij ðf ðxi Þ  f ðxj ÞÞ2 ,

and it just

corresponds to the graph-based regularizer in (8). The solution admits the Representer Theorem [7]: lþu

f ðxÞ ¼ ∑ αi Kðxi ; xÞ; n

i¼1

αAR

where K is a kernel function in SVM.

ð9Þ

In this section, we present the formulation of the proposed lapMKL, then we use the Representer Theorem and the Lagrangian multipliers optimization technique to our model, to have the dual form. Finally, we develop the level method used in MKL to lapMKL for obtaining the solution efficiently.

Now we define Hilbert space H0m for any m with the inner product ðf ; gÞH0m ¼ ððf ; gÞm =dm Þ, dm Z 0, complying with if x¼ 0 then x=0 ¼ 0 otherwise 1, f and g are in RKHS Hm associated with inner product ðf ; gÞm . The multiple kernel space H we finally use is the direct sum of H0m and according to [28], H is an RKHS with the kernel which has the following form: M

Kð; xÞ ¼ ∑ dm K m ð; xÞ

ð10Þ

m¼1

Therefore, any f A H has the form f ¼ ∑m f m ; f m A Hm . Thus we could first establish the multiple kernel model for classification task as follows: min

f m A Hm ;dm

l 1 1 C ∑ maxð0; 1  yi f ðxi ÞÞ þ ∑ ‖f m ‖2Hm 2 m dm i¼1

M

∑ dm ¼ 1;

s:t:

m¼1

dm Z0 8 m

ð11Þ

Then we need to extend (11) to semi-supervised classification using regularization (2), assume that we have constructed the graph with labeled and unlabeled data and calculated the Laplacian L of it. We could add the regularizer and get the following formulation: min

f m A Hm ;dm M

s:t:

l

γA

i¼1

2

C ∑ maxð0; 1  yi f ðxi ÞÞ þ

∑ dm ¼ 1; dm Z 0

8m

1 γ T ‖f m ‖2Hm þ I f Lf d 2 m m



ð12Þ

m¼1

Please cite this article as: T. Yang, D. Fu, Semi-supervised classification with Laplacian multiple kernel learning, Neurocomputing (2014), http://dx.doi.org/10.1016/j.neucom.2014.03.039i

T. Yang, D. Fu / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

4

where f ðxi Þ ¼ ∑m f m ðxi Þ, f ¼ ðf ðx1 Þ; …; f ðxn ÞÞT , n ¼ l þu and γ A controls the complexity of the function in the multi-kernel space and γ I controls the complexity of the function over the graph. The Laplacian graph L uses the intrinsic geometric property of data points to obtain more information to give a proper classifier, which is essential in SSL. In RKHS H, the solution to problem (12) admits the Representer Theorem which has the form [42,43]: n

n

M

i¼1

i¼1

m¼1

f ðxÞ ¼ ∑ αi Kðxi ; xÞ ¼ ∑ αi ∑ dm K m ðxi ; xÞ; n

αi A R

ð13Þ

Therefore we notice that the optimal f lies in the spanned linear ? space, S ¼ spanfKð; xÞjx A X g; we know any f A H, f ¼ f S þ f S , ? where f S is the projection of f to S and f S is its orthogonal complement. By lemma (4) in [7] that we have f ðxi Þ ¼ f S ðxi Þ, it means the solution fn is not relative to its orthogonal part; on the ? other hand ‖f ‖2H ¼ ‖f S ‖2H þ ‖f S ‖2H , therefore ‖f ‖2H ¼ ‖f S ‖2H , so the n minimizer f is in S, as Representer Theorem has demonstrated. It is useful to reduce the problem to optimize over the finite dimensional space, of which dimension is relative to the number of all the data, via Representer Theorem we transform (12) to the following: !

γA 2

α;d

l lþu γ αT K α þ I αT KLK α þ C ∑ max 0; 1 yi ∑ αj Kðxi ; xj Þ

2

M

∑ dm ¼ 1;

s:t:

m¼1

i¼1

j¼1

dm Z 0 8 m

M

K ¼ ∑ dm K m

ð14Þ

m¼1

where Km is the ðl þ uÞ  ðl þ uÞ Gram matrix over labeled and unlabeled points, meaning K m ði; jÞ ¼ K m ðxi ; xj Þ. LapMKL analysis: We add a bias term to the final function i.e. f ðÞ ¼ ∑li þ¼u1 αi Kðxi ; Þ þ b. We initially consider (14) with the fixed d and take ð1  yi ð∑lj þ¼u1 αj Kðxi ; xj Þ þ bÞÞ r ξi , ξi Z 0, so the Lagrangian equation is l l 1 L ¼ C ∑ ξi þ ðγ A αT K α þ γ I αT KLK αÞ  ∑ ζ i ξi 2 i¼1 i¼1 ! l

lþu

i¼1

j¼1

 ∑ βi yi ∑ αj Kðxi ; xj Þ þ b 1 þ ξi

l 1 T min max gðd; β Þ ¼ ∑ βi  β YJKQJ T Y β 2 dAD βAΘ i¼1

(

Θ¼ D¼

t

ð18Þ

m¼1

where Sm ðβ Þ ¼ 12ðβ ÞT YJ Γ J T Y β  ∑li ¼ 1 β i ; we put zero to the derivative of λ and take the result to (18): t

t

t

t

τ

d ¼ arg max θ M

s:t: ∑ dm Sm ðβ Þ Z θ;

t ¼ 1; …; τ

t

m¼1

dAD

ð19Þ

In this way d is updated iteratively. Another method to update d is based on gradient way [29]. When formula (16) has the intermediate solution βt, we take t gðd; β Þ as object, and calculate the differentiating of each dm; it is known that the minus gradient is a proper updating direction, so we have: t

d ¼d

t1

 λt ∇d gðd; β Þ t

and ∇dm gðd; β Þ t

¼  12 ðβ ÞT YJðK m Q  γ I KQLK m Q ÞJ T Y β t

t

8m

ð20Þ

λt is the step size that needs to be computed dynamically by one dimensional line search method, meanwhile it should be noticed that d A D is supposed to be satisfied while updating with gradient method. Next we will discuss the level method. 3.1.1. Level method to LapMKL The level method refers to [26], a kind of bundle methods which are often employed to efficiently solve regularized risk minimization problems. In our setting, since gðd; β Þ in (16) is convex in d and concave in β, according to von Neumann Lemma n n we have the following for optimal ðd ; β Þ: n

n

n

dAD

ð21Þ

The aim is to find the saddle point by iteratively updating d A D and β A Θ. The implementation of level method is under a cutting plane model hðdÞ, defined as

)

∑ dm ¼ 1; 0 rdm r 1; 8 m

M

Lλ ¼ λ þ ∑ dm ðSm ðβ Þ  λÞ

βAΘ

τ

∑ β i yi ¼ 0; 0 r βi r C; 8 i M

where Γ ¼ K m Q þ γ I KQLðK K m ÞQ ; then we take d1 ; …; dM as multipliers in (17) and establish Lagrangian equation:

n

h ðdÞ ¼ max gðd; β Þ t

ð22Þ

1rt rτ

i¼1



ð17Þ

i¼1

gðd; β Þ ¼ max gðd; β Þ Z gðd ; β Þ Zgðd ; βÞ ¼ min gðd; βÞ

Q ¼ ðγ A I þ γ I LKÞ  1 l

l

1 2

λ Z ðβt ÞT YJ Γ J T Y βt  ∑ βti 8 m

s:t:

ð15Þ

where ζ and β are Lagrangian multipliers, K ¼ ∑m dm K m is Gram matrix over labeled and unlabeled data points. By putting the derivative of primal variables ðα; β; bÞ in (15) into zero, we get the following min-max problem:

where

λ

min λ

n

min

After intermediate βt is provided, one method is through the Lagrangian analysis. Following (15), we take d as primal variables and add them to L by lagrangian multiplier, that is L0 ¼ L þ λð∑m dm  1Þ  ∑m ηm dm , by putting the derivative of primal variables d to zero and taking the result back to L0 , we get the following:

Next, we define a lower bound g τ and a upper bound g τ as

 ð16Þ

m¼1

Y is an l  l dimensional diagonal matrix with entry Y ii ¼ yi , i ¼ 1; …; l, J is an l  ðl þ uÞ matrix: J ¼ ½I 0 with I is an identity matrix. In (16) we could solve the inner maximum problem with traditional SVM solver with fixed d, and then go back to the outer minimum problem to update d. We obtain the final optimal n n ðβ ; d Þ through the iteration process, take t ¼ 1; …; τ as time step mark, the inner solution βt is relatively easy to obtain via a SVM solver, next we put two methods to update d. They are the components of the level method.

τ

g τ ¼ min h ðdÞ dAD

and

g τ ¼ min gðd ; β Þ t

1rt rτ

t

ð23Þ

The bounds are just the values of (16) but with different ways in updating d. In lower bound d is solved by formula (19) and in upper bound d is updated by formula (20). Theorem 1. The properties for g τ and g τ [26]: (a) g τ r n n gðd ; β Þ r g τ . (b) g 1 Z g 2 Z ⋯ Z g τ . (c) g 1 r g 2 r⋯ r g τ . τ

We define the gap δ as

δτ ¼ g τ  g τ

ð24Þ

Please cite this article as: T. Yang, D. Fu, Semi-supervised classification with Laplacian multiple kernel learning, Neurocomputing (2014), http://dx.doi.org/10.1016/j.neucom.2014.03.039i

T. Yang, D. Fu / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Then we construct a level set Lτ using the two bounds τ

L ¼ fd A D : h ðdÞ r ℓτ ¼ λg τ þ ð1  λÞg τ g: τ

4.1. Experiment setting and comparisons ð25Þ

where λ A ð0; 1Þ, in practice λ is proper to be close to 1 for forcing the new solution not far from the previous one. The core idea of updating d in the next time step is computed as the projection of τ d onto the level set Lτ , which translates into an optimization problem: d

τþ1

τ

¼ arg minf‖d  d ‖22 : d A D; gðd; β Þ r ℓt ; t ¼ 1; …; τg: t

d

5

ð26Þ

Now we could give an algorithm to summarize the above discussion, see Algorithm 1. Algorithm 1. Level method for lapMKL. 1: Initialize d0 ¼ 1=M and t ¼1 2: repeat t1 3: Solve the problem (16) with a SVM algorithm with d t and get the intermediate optimal solution β t 4: Compute the cutting plane model h ðdÞ. 5:

Compute the lower and upper bound in (23) and the gap in (24)

6:

Calculate the projection of d onto the level set Lt by solving the optimization in (26) Update t ¼ t þ 1

t

7: 8: until δt r ε 9: Output: βn and dn

In practice, convergence is guaranteed for the gap monotonically decreases through the iterations. And there is a theorem of this conclusion. Theorem 2 (Xu et al. [26]). The maximum number of iterations, t which is required to have the solution d that satisfies δ r ε, is bounded. 3.1.2. Complexity analysis There are n data including the labeled and unlabeled, the dimension of one sample is d and M kernel mappings. First for the Laplacian graph, the distances between all samples take time complexity O(nd), and Laplacian graph L has the time complexity Oðn2 Þ, then calculating each gram matrix Km is O(nd) so the combined K ¼ ∑M m ¼ 1 dm K m takes time complexity OðM  ndÞ; in formula (16) there is a dense inverse matrix Q, the time complexity is Oð2n3 Þ, then calculating the intermediate βt is a SVM of which the time complexity is Oðn3 Þ. Updating d has two parts, in t formula (19) matrix dm Sm ðβ Þ has time complexity OðM  n3 Þ and the linear programming has Oð2M Þ in worst case, in formula (20) the gradient computation has time complexity OðM  n3 Þ. The projection calculation (26) has time complexity OðM 3 Þ in worst case. Therefore, the entire time complexity of the proposed model is Oðnd þ n2 þ ðM  nd þ 3n3 þ M  n3 þ 2M þ M  n3 þ M3 Þ  TÞ ¼ Oðnd þ n2 þ ðM  nd þ ð3 þ 2MÞn3 þ 2M þ M 3 Þ  TÞ, where T is the number of iterations in optimization process.

4. Experiments Our experiments are tested on benchmark datasets [21], UCI datasets [44] and an image dataset Caltech256-2000 [45].

In practice, the value of hyper-parameter C¼ 100, and the value of γ A is taken from ½10  5 ; …; 10  2 , γ I from f10  6 ; 10  5 ; …; 10  2 g by 5-fold cross validation. For multiple kernels, in benchmark datasets and UCI datasets tests the candidate kernels are two parts, one part is composed of 10 Gaussian kernels kðx; yÞ ¼ expð  ‖x  y‖2 =s2 Þ with 10 different s, i.e. f2  4 ; 2  3 ; …; 25 g, the other part is three polynomial kernels kðx; yÞ ¼ ð1 þ x  yÞd with different d, i.e. f1; 2; 3g; in Caltech256-2000 dataset test, the candidate kernels are Gaussian kernels with s ¼ ½0:01; 0:05; 0:1; 0:5; 1; 2; 5. All kernel matrices are normalized to unit trace. The algorithm implementation refers to [26]. The neighborhood of each vertex in the graph is decided by kNN method where parameter k is chosen from f6; 7; 8g with cross validation along with γ A and γ I , the weights on edges are determined by binary function. The SVM solver refers to [46]. For classification, we compare the following methods including the proposed lapMKL. (1) Laplacian multiple kernel learning (lapMKL): Parameters are selected by 5-fold cross validation as discussed before. (2) Laplacian support vector machine (lapSVM) [7]: A SVM with geometrically motivated penalty is to implement the semisupervised classification, it uses Gaussian kernel in SVM with s in the range ½0:1; 0:3; 0:5 selected by 5-fold cross validation. (3) Mean semi-supervised SVM (meanS3VM) [22]: It improves the traditional TSVM by incorporating the mean of the unlabeled data to maximize the margin. It uses Gaussian kernel with parameter to be mean distances of data. (4) Transductive SVM (TSVM) [19]: It forces the classification hyperplane to cross the low-density data region, the kernel is radial basis function kða; bÞ ¼ expð  γ ‖a  b‖2 Þ with γ ¼ 0:1. (5) Low density separation (LDS) [47]: Complying with the hyperplane is supposed to cross the low density of data, it forms a kernel via graph metrics and then uses the formed kernel in TSVM. (6) Harmonic function [48]: The labeling processes are on a graph with maintaining the labeled data meanwhile the similarity between all the samples, this classification function has harmonic property. (7) Support vector machine (SVM) [49]: It is classical in machine learning, we implement it through the libSVM [46] and adopt the radial basis function as kernel, the kernel parameter is selected to be mean distances of data.

4.2. UCI datasets In this part, we take eight datasets from UCI. We split each of the eight datasets into 10 labeled and 100 labeled samples randomly chosen 10 times for each and the rest are unlabeled ones, in this way we construct 20 groups of data samples, the first 10 groups have 10 labeled samples the last 10 groups have 100 labeled ones. We let all the thirteen kernels be on all variables and on each variable. Error rates are given on the unlabeled samples. We see in Table 1 the lapMKL shows the lowest error rank, it demonstrates lapMKL is relatively stable and effective. 4.3. Benchmark datasets The benchmark datasets refer to [21, chapter 5], we use digit1, g241c, g241n, COIL, BCI, and USPS. As said in [21], the purpose of the benchmark datasets was to evaluate the power of the algorithms themselves in a way as neutral as possible. There are two settings, 10 labeled samples and 100 labeled samples, and each

Please cite this article as: T. Yang, D. Fu, Semi-supervised classification with Laplacian multiple kernel learning, Neurocomputing (2014), http://dx.doi.org/10.1016/j.neucom.2014.03.039i

T. Yang, D. Fu / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

6

setting has 12 subsets, we do tests on each subset, and give the average error rates on the unlabeled samples. We let all the thirteen kernels be on all variables, the number of sample is much larger in benchmark datasets than that of the eight UCI datasets, it could test the application of the proposed model on medium or large datasets. Table 2 gives the results, we need to further analyze about it. Firstly, for we let kernels be on all variables, it has less kernels than that of UCI experiments, so the ability may degenerate a little in some sense; secondly, some benchmark datasets we select are preprocessed in Chapelle [21, pp. 377]. The ‘digit1’ was designed to have manifold regularization but added with gaussian noise, ‘COIL’ and ‘USPS’ were images data added with noises, from those three classification results we may assume that noise may break manifold structure in some sense and give barriers to the proposed model. The ‘g241c’ and ‘g241n’ were artificially generated by the cluster assumption, so the proposed model may not be proper as it is under the manifold assumption. The ‘BCI’ data is the parameters from regressive model to fit the time series data of electrodes observed from brain–computer interface, the regressive model itself may not very exact so the parameters were not truly

representations of objects, it may bear less manifold structure. Through results we could see method ‘meanS3VM’ is suitable because the total rank is relatively low, it incorporates the label means of unlabeled data and maximizes the margin between the means. We know that average calculations make the noise weak, it may be the reason that ‘meanS3VM’ is proper in the experiment. Our future work can be on how to deal with data with several kinds of noises. 4.4. Caltech256-2000 datasets The Caltech256-2000 is a subset of the Caltech256 dataset [45]. It is composed of 2000 images from 20 categories: chimp, bear, bowling-ball, light-house, cactus, iguana, toad, cannon, chessboard, lightning, goat, pci-card, dog, raccoon, cormorant, hammock, elk, necktie and self-propelled-lawn-mower. We select two kinds of features to describe images, one is RGB histogram, each component of RGB is quantized uniformly into 32 bins, then we have 96 features vector for each image; the other is edge direction histogram, we use canny descriptor to extract edge pixels then calculate their probabilities to fall into 8 directions. The features

Table 1 UCI data sets tests on error rates (mean(rank)%). Total rank is the sum of ranks of algorithms on the corresponding data sets. The smaller total rank, the better performance. 10 label

lapMKL

lapSVM

meanS3VM

TSVM

SVM

Harmonic

LDS

heart german iono vote pima liver wdbc vehicle Total rank

24.5 (1) 34.6 (4) 21.9 (2) 7.9 (1) 31.7 (1) 41.3 (1) 10.9 (3) 28.1 (2) 15

33.9 (6) 29.6 (1) 34.8 (7) 15.1 (7) 34.6 (6) 41.5 (2) 36.1 (7) 26.5 (1) 37

25.6 (5) 38.3 (6) 18.6 (1) 9.7 (2) 33.3 (3) 43.2 (6) 14.1 (4) 28.1 (2) 29

24.7 (2) 38.2 (5) 26.5 (6) 10.9 (3) 32.9 (2) 45.2 (7) 9.5 (2) 37.9 (7) 34

25.0 (3) 30.0 (2) 25.6 (5) 12.9 (6) 34.1 (5) 42.2 (3) 31.3 (6) 28.4 (5) 35

36.8 (7) 31.9 (3) 23.0 (4) 11.2 (4) 34.6 (6) 42.9 (4) 23.5 (5) 28.1 (2) 35

25.5 (4) 43.9 (7) 21.9 (2) 11.2 (4) 33.3 (3) 43.1(5) 6.7 (1) 33.7 (6) 32

100 label

lapMKL

lapSVM

meanS3VM

TSVM

SVM

harmonic

LDS

heart german iono vote pima liver wdbc vehicle

17.9 (1) 28.0 (1) 9.5 (3) 5.1 (1) 25.2 (2) 34.7 (3) 6.0 (3) 14.6 (3)

18.1 (2) 28.2 (2) 15.4 (6) 8.7 (6) 20.0 (1) 28.1 (1) 14.8 (7) 17.8 (6)

21.6 (7) 31.2 (4) 6.7 (1) 6.3 (2) 28.4 (5) 33.4 (2) 5.3 (2) 10.1 (1)

18.7 (5) 31.3 (5) 17.1 (7) 6.9 (3) 26.2 (3) 34.7 (3) 7.6 (5) 16.4 (4)

18.6 (3) 28.8 (3) 7.0 (2) 7.4 (4) 26.9 (4) 35.3 (5) 8.3 (6) 12.1 (2)

21.2 (6) 31.9 (6) 14.6 (5) 7.9 (5) 28.5 (6) 36.1 (7) 6.1 (4) 16.8 (5)

18.6 (3) 33.7 (7) 10.9 (4) 9.9 (7) 30.8 (7) 35.3 (5) 4.6 (1) 18.3 (7)

Total rank

17

31

24

35

29

44

41

Table 2 Benchmark data sets tests on error rates (mean(rank)%). Total rank is the sum of ranks of algorithms on the corresponding data sets. Smaller total rank means better performance. 10 label

lapMKL

lapSVM

meanS3VM

TSVM

SVM

Harmonic

LDS

digit1 g241c g241n COIL BCI USPS Total rank

22.4 (5) 45.4 (4) 47.5 (3) 40.5 (3) 48.9 (3) 19.8 (2) 20

36.5 (6) 50.2 (7) 50.3 (7) 37.1 (1) 51.7 (7) 20.4 (4) 32

21.3 (4) 35.8 (3) 42.5 (1) 42.6 (4) 46.8 (1) 24.1 (5) 18

20.3 (2) 21.4 (1) 46.3 (2) 49.9 (7) 48.0 (2) 28.3 (6) 20

44.3 (7) 47.1 (5) 47.6 (4) 46.8 (6) 50.1 (6) 20.0 (3) 31

20.8 (3) 49.0 (6) 49.2 (6) 38.0 (2) 49.5 (5) 19.3 (1) 23

16.4 (1) 32.4 (2) 48.7 (5) 45.1 (5) 49.3 (4) 28.4 (7) 24

100 label

lapMKL

lapSVM

meanS3VM

TSVM

SVM

harmonic

LDS

digit1 g241c g241n COIL BCI USPS

5.1 (3) 22.2 (3) 25.9 (4) 11.9 (3) 41.6 (4) 9.2 (3)

2.2 (2) 44.6 (7) 42.4 (7) 3.7 (1) 37.4 (3) 17.9 (6)

6.9 (5) 22.7 (4) 24.7 (3) 17.7 (4) 34.2 (2) 8.2 (1)

7.7 (7) 19.5 (2) 23.2 (1) 19.5 (5) 29.5 (1) 13.5 (4)

7.1 (6) 25.8 (5) 27.6 (5) 20.6 (6) 43.6 (6) 14.5 (5)

2.1 (1) 43.2 (6) 41.1 (6) 6.9 (2) 46.9 (7) 8.6 (2)

6.7 (4) 18.9 (1) 24.2 (2) 31.3 (7) 41.8 (5) 25.0 (7)

Total rank

20

26

19

20

32

24

26

Please cite this article as: T. Yang, D. Fu, Semi-supervised classification with Laplacian multiple kernel learning, Neurocomputing (2014), http://dx.doi.org/10.1016/j.neucom.2014.03.039i

T. Yang, D. Fu / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Fig. 2. Classification error rates of Caltech256-2000.

are integrated into a vector with 104 dimensions. Therefore, we have 2000  104 samples with 20 categories, and let 7 kernels be on all variables. In the experiment, the sampling percentages for training are ½10%; 20%; 30%; 40%; 50%, we take the one-againstall strategy in multi-class problem. The classification error rates on the whole data samples are given in Fig. 2. Additionally, we directly take the result of [11] to add in the figure, i.e. the pairwise constraint based multiview subspace learning (PCMSL) method, it extracted five different features from Caltech256 and builded multi-graph, more details refer to [11]. In Fig. 2, lapMKL is better than the other six methods. Compared with PCMSL, it has a good trend to give more accurate results when the training number increasing, though in the beginning it is less effective.

5. Conclusion In this paper, we improve the lapSVM to multiple kernel occasion, i.e. lapMKL, specializing in semi-supervised classification task. LapMKL is the classifier which is based on multiple kernel learning with additional graph-based regularization to extend itself to semi-supervised learning, where the regularizer is a geometrically motivated penalty to constrain the classifier be smooth along the data-dependent graph. When using multiple kernels in lapMKL instead of single kernel, it shows more flexibility and interpretability and has potential ability to address complex or multi-source datasets, this would enlarge the area where classifier could adapt to. To have efficient solutions, we adopt Level Method to solve the optimal problem. The experimental results indicate that lapMKL is capable of dealing with several kinds of semi-supervised datasets. During the lapMKL tests, we found it may be proper for lapMKL to process noiseless data or the data which has potential geometrical shape in space. There are several directions to study further about the proposed model, such as analysis of parameters in lapMKL, the selections of kernel parameters and more efficient algorithm. References [1] X. Zhu, Semi-supervised Learning Literature Survey, Department of Computer Sciences, University of Wisconsin-Madison, Wisconsin, United States. [2] J. Yu, Y. Rui, B. Chen, Exploiting click constraints and multi-view features for image re-ranking, IEEE Trans. Multimed. 16 (1) (2014) 159–167. [3] Y. Gao, M. Wang, Z. Zha, J. Shen, X. Li, X. Wu, Visual-textual joint relevance learning for tag-based social image search, IEEE Trans. Image Process. 22 (1) (2013) 363–376. [4] L. Chen, I. Tsang, D. Xu, Laplacian embedded regression for scalable manifold regularization, IEEE Trans. Neural Netw. Learn. Syst. 23 (6) (2012) 902–915.

7

[5] M. Wang, X. Hua, R. Hong, J. Tang, G.J. Qi, Y. Song, Unified video annotation via multigraph learning, IEEE Trans. Circuits Syst. Video Technol. 19 (5) (2009) 733–746. [6] V. Sindhwani, D. Rosenberg, An RKHS for multi-view learning and manifold co-regularization, 2008, pp. 976–983. [7] M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: a geometric framework for learning from labels and unlabels examples, J. Mach. Learn. Res. 7 (2006) 2399–2434. [8] D. Zhou, J. Huang, B. Schö lkopf, Learning from labeled and unlabeled data on a directed graph, in: Proceedings of the 22nd International Conference on Machine Learning, 2005, pp. 1041–1048. [9] J. Yu, D. Liu, D. Tao, H. Seah, Complex object correspondence construction in two-dimensional animation, IEEE Trans. Image Process. 20 (11) (2011) 3257–3269. [10] J. Yu, D. Liu, D. Tao, H. Seah, On combining multiple features for cartoon character retrieval and clip synthesis, IEEE Trans. Syst. Man Cybern. B 42 (5) (2012) 1413–1427. [11] J. Yu, D. Tao, Y. Rui, J. Cheng, Pairwise constraints based multiview features fusion for scene classification, Pattern Recognit. 46 (2013) 483–496. [12] J. Yu, M. Wang, D. Tao, Semi-supervised multiview distance metric learning for cartoon synthesis, IEEE Trans. Image Process. 21 (11) (2012) 4636–4648. [13] D. Zhou, J. Huang, B. Schö lkopf, Learning with hypergraphs: clustering, classification, and embedding, in: Advances in Neural Information Processing Systems, 2006, pp. 1601–1608. [14] J. Yu, D. Tao, M. Wang, Adaptive hypergraph learning and its application in image classification, IEEE Trans. Image Process. 21 (7) (2012) 3262–3272. [15] Y. Gao, M. Wang, D. Tao, R. Ji, Q. Dai, 3D object retrieval and recognition with hypergraph analysis, IEEE Trans. Image Process. 21 (9) (2012) 4290– 4303. [16] M. Wang, X. Hua, J. Tang, R. Hong, Beyond distance measurement: constructing neighborhood similarity for video annotation, IEEE Trans. Multimed. 11 (3) (2009) 465–476. [17] M. Wang, Y. Gao, K. Lu, Y. Rui, View-based discriminative probabilistic modeling for 3D object retrieval and recognition, IEEE Trans. Image Process. 22 (4) (2013) 1395–1407. [18] Y. Gao, M. Wang, R. Ji, X. Wu, Q. Dai, 3D object retrieval with Hausdorff distance learning, IEEE Trans. Ind. Electron. 61 (4) (2014) 2088–2098. [19] T. Joachims, Transductive inference for text classification using support vector machines, in: Proceedings of the 16th International Conference on Machine Learning, 1999, pp. 200–209. [20] O. Chapelle, V. Sindvani, S.S. Keerthi, Optimization techniques for semisupervised support vector machine, J. Mach. Learn, Res. 9 (2008) 203–233. [21] O. Chapelle, B. Schö lkopf, A. Zien (Eds.), Semi-supervised Learning, MIT, Cambridge, Massachusetts, 2006. [22] Y. Li, J. Kwok, Z. Zhou, Semi-supervised learning using label mean, in: Proceedings of the 26th International Conference on Machine Learning, 2009, pp. 633–640. [23] Z. Xu, I. King, M. Lyu, R. Jin, Discriminative semi-supervised feature selection via manifold regularization, IEEE Trans. Neural Networks 21 (7) (2010) 1033–1047. [24] D. Tao, W. Liu, Multiview hessian regularization for image annotation, IEEE Trans. Image Process. 22 (7) (2013) 2676–2687. [25] Y. Luo, D. Tao, C. Xu, H. Liu, Y. Wen, Multiview vector-valued manifold regularization for multilabel image classification, IEEE Trans. Neural Networks Learn. Syst. 24 (5) (2013) 709–722. [26] Z. Xu, R. Jin, I. King, An extended level method for efficient multiple kernel learning, in: Advances in Neural Information Processing Systems, B.C., Canada, 2009, pp. 1825–1832. [27] M.C. Bishop, M. Jordan, M. Kleinberg, B. Schö lkopf (Eds.), Pattern Recognition and Machine Learning, Springer, NY, USA, 2006. [28] N. Aronszajn, Theory of reproducing kernels, Trans. Am. Math. Soc. 68 (3) (1950) 337–404. [29] A. Rakotomamonjy, F. Bach, S. Canu, Y. Grandvalet, Simplemkl, J. Mach. Learn. Res. 9 (2008) 2491–2521 . [30] G.R. Lanckriet, N. Cristianini, P. Bartlett, L.E. Ghaoui, M.I. Jordan, Learning the kernel matrix with semidefinite programming, J. Mach. Learn. Res. 5 (2004) 27–72. [31] F. Bach, G.R. Lanckriet, M.I. Jordan, Multiple kernel learning, conic duality, and the SMO algorithm, in: Proceedings of the Twenty-first International Conference on Machine Learning, ACM, 2004, p. 6. [32] S. Sonnenburg, G. Rätsch, C. Schäfer, B. Schö lkopf, Large scale multiple kernel learning, J. Mach. Learn. Res. 7 (2006) 1531–1565. [33] F. Orabona, L. Jie, Ultra-fast optimization algorithm for sparse multi kernel learning, in: Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 249–256. [34] T. Suzuki, R. Tomioka, Spicymkl: a fast algorithm for multiple kernel learning with thousands of kernels, Mach. Learn. 85 (1–2) (2011) 77–108. [35] M. Kloft, U. Brefeld, S. Sonnenburg, A. Zien, Lp-norm multiple kernel learning, J. Mach. Learn. Res. 12 (3) (2011) 953–997. [36] M. Kloft, U. Brefeld, S. Sonnenburg, A. Zien, Non-sparse regularization and efficient training with multiple kernels, Technical Report No. UCB/EECS-2010-21, Electrical Engineering and Computer Sciences, University of California at Berkeley, CA, United States. [37] M. Kloft, U. Brefeld, S. Sonnenburg, P. Laskov, K. Müller, A. Zien, Efficient and accurate lp-norm multiple kernel learning, Adv. Neural Inf. Process. Syst. 22 (22) (2009) 997–1005.

Please cite this article as: T. Yang, D. Fu, Semi-supervised classification with Laplacian multiple kernel learning, Neurocomputing (2014), http://dx.doi.org/10.1016/j.neucom.2014.03.039i

8

T. Yang, D. Fu / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

[38] R. Tomioka, T. Suzuki, Sparsity-accuracy trade-off in mkl, arXiv preprint arXiv:1001.2615. 2010. [39] C. Cortes, M. Mohri, A. Rostamizadeh, Generalization bounds for learning kernels, in: Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 247–254. [40] C. Cortes, M. Mohri, A. Rostamizadeh, Ensembles of kernel predictors, arXiv preprint arXiv:1202.3712', 2012. [41] M. Gö nen, E. Alpaydın, Multiple kernel learning algorithms, J. Mach. Learn. Res. 12 (2011) 2211–2268. [42] B. Schö lkopf, R. Herbrich, A. Smola, A generalized representer theorem, in: Computational Learning Theory, Springer Berlin Heidelberg, Amsterdam, Netherland, 2001, pp. 416–426. [43] F. Dinuzzo, M. Neve, G.D. Necolao, On the representer theorem and equivalent degrees of freedom of SVR, J. Mach. Learn. Res. 8 (2007) 2467–2496. [44] K. Bache, M. Lichman, UCI Machine Learning Repository, University of California, School of Information and Computer Science, Irvine, CA, 2013. [45] G. Griffin, A. Holub, P. Perona, Caltech-256 Object Category Dataset, California Institute of Technology, Pasadena, CA, United Sates. [46] C.C. Chang, C.J. Lin, Libsvm: a library for support vector machines, Trans. Intell. Syst. Technol. 2 (3) (2011) 27. [47] O. Chapelle, A. Zien, Semi-supervised classification by low density separation, in: Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, 2005, pp. 57–64. [48] X. Zhu, Z. Ghahramani, J. Lafferty, Semi-supervised learning using Gaussian fields and harmonic functions, in: Proceedings of the 20th International Conference on Machine Learning, Washington DC, 2003, pp. 912–919. [49] C.J. Burges, A tutorial on support vector machines for pattern recognition, Data Min. Knowl. Discov. 2 (2) (1998) 121–167.

Tao Yang is a Ph.D. student in School of Automation and Electrical Engineering, University of Science & Technology Beijing (USTB), China. He received his Bachelor degree in Automation Science and Engineering from USTB in 2010. His researches focus on pattern recognition, machine learning.

Dongmei Fu received his MS in Northwest Polytechnical University in 1984 and Ph.D. degrees in Automation Science from University of Science & Technology Beijing (USTB) in 2006. She is currently a Professor and Doctoral Supervisor in School of Automation and Electrical Engineering, USTB, China. From 2002 to 2012, she took charge of several national projects about corrosion data mining and infrared image processing. Her main research interests include automation control theory, image processing and data mining.

Please cite this article as: T. Yang, D. Fu, Semi-supervised classification with Laplacian multiple kernel learning, Neurocomputing (2014), http://dx.doi.org/10.1016/j.neucom.2014.03.039i