Synthesis of maximum margin and multiview learning using unlabeled data

ARTICLE IN PRESS Neurocomputing 70 (2007) 1254–1264 www.elsevier.com/locate/neucom Synthesis of maximum margin and multiview learning using unlabele...

Download PDF

376KB Sizes 3 Downloads 156 Views

Report

PDF Reader
Full Text

ARTICLE IN PRESS

Neurocomputing 70 (2007) 1254–1264 www.elsevier.com/locate/neucom

Synthesis of maximum margin and multiview learning using unlabeled data$ Sandor Szedmaka,b,, John Shawe-Taylorc a

ISIS Group, Electronics and Computer Science, University of Southampton, SO17 1BJ, UK b Department of Computer Science, University of Helsinki, Finland c Department of Computer Science, University College London, WC1E 6BT, UK Available online 13 December 2006

Abstract In this paper we show that the semi-supervised learning with two input sources can be transformed into a maximum margin problem to be similar to a binary support vector machine. Our formulation exploits the unlabeled data to reduce the complexity of the class of the learning functions. In order to measure how the complexity is decreased we use the Rademacher complexity theory. The corresponding optimization problem is convex and it is efﬁciently solvable for large-scale applications as well. r 2007 Elsevier B.V. All rights reserved. Keywords: Semi-supervised learning; Maximum margin; Support vector machine; Rademacher complexity; Multiview learning

1. Introduction Semi-supervised learning belongs to the main stream of the recent machine learning research. The exploitation of the unlabeled data is an attractive approach either to extend the capability of the known methods or to derive novel learning devices. Survey about the recent directions devoted to the semi-supervised learning can be found in [9,23]. In this paper we give a synthesis of some earlier approaches and show an optimization framework with good statistical generalization capability. Before outlining our learning strategy we need to raise an essential question which initiated this research. The support vector machine has been proved to be an efﬁcient learning tool, its performance is demonstrated in several studies and plenty of work is devoted to give strong theoretical foundation. The question is: can we go further? Is there any way to set up a learning device which can improve the performance of the SVM in some distribution families? This paper tries to $ This work is supported by PASCAL Network of Excellence (IST2002-506778) European Community IST Programmes. Corresponding author. ISIS Group, Electronics and Computer Science, University of Southampton, SO17 1BJ, United Kingdom. E-mail addresses: [email protected] (S. Szedmak), [email protected] (J. Shawe-Taylor).

0925-2312/$ - see front matter r 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2006.11.012

answer this questions by presenting a model that might help to make the next step. Our idea is built upon: Multiview learning, when two or more sources of the inputs are given with the same output, see in papers of Blum et al. [7] and Dasgupta et al. [11]. The multiview learning is also referred as co-training in several papers. Maximum margin learning, where the unlabeled cases within the margin considered as errors, see in Bennett et al. [3]. Reduction of the learning class complexity, where the ‘‘closeness’’ of the predictions for the unlabeled cases computed on the distinct sources of the input is assumed, see in the conference paper about skeleton learning of Lugosi et al. [17], Balcan et al. [1], Ka¨a¨ria¨inen [14] and Blum et al. [6]. Combination of the multiview and maximum margin learning, which introduces new constraints to the optimization problem of the supervised maximum margin learning, see in Meng et al. [18] and in Farquhar et al. [12]. Blending these ideas we arrive at an optimization framework, which can be solved efﬁciently on the level of computational complexity of a binary SVM problem. In the ﬁrst part of the paper the learning framework and the corresponding optimization problem are presented and in the second part the generalization performance is

ARTICLE IN PRESS S. Szedmak, J. Shawe-Taylor / Neurocomputing 70 (2007) 1254–1264

1255

analyzed. We apply the Rademacher complexity to give an upper bound on the expected error of the label prediction. At the end, experimental results will illustrate the usefulness of the approach. The underlying theory is summarized in Section 3. Further details can be found in [2].

Informally, in order to exploit the unlabeled data, we are looking for particular solutions relating to the different views where the predictors give similar solutions on the unlabeled data. The similarity measured by L1 norm of the differences between the real-valued predictions

2. Learning framework

jf 1 ðu1i Þ f 2 ðu2i Þj;

The main idea is that if we can restrict the function class occurring in the SVM in a plausible way then by decreasing the complexity the generalization performance could be improved. This restriction is realized by exploiting unlabeled sources of the data via multiview learning. The general solution framework is depicted in Fig. 1.

and it is minimized simultaneously with the error occurring in the estimation of the labeled cases. The two view learning seems to impose a restriction on the possible applications. To obtain distinct views one can apply a random partition of the available feature components. This kind of random selection was applied in the experiments, the details are in Section 5.

i ¼ 1; . . . ; mU ,

(2)

2.1. Learning task 2.2. Optimization problem In the semi-supervised, multiview learning, with two views, we are given a compound sample S L of pairs of outputs and inputs fðyi ; ðx1i ; x2i ÞÞ : yi 2 f1; 1g; xki 2 Xk ; i ¼ 1; . . . ; mL ; k ¼ 1; 2g independently and identically generated by an unknown multivariate distribution PðY ; X 1 ; X 2 Þ, and a compound sample SU ¼ fðu1i ; u2i Þ : uki 2 Xk ; i ¼ 1; . . . ; mU ; k ¼ 1; 2g of unlabeled cases independently and identically chosen from the marginal distribution PðX 1 ; X 2 Þ of PðY ; X 1 ; X 2 Þ. Furthermore, there are given embeddings of the input vectors into Hilbert spaces, called feature spaces, by the functions fk : Xk ! Hfk ; k ¼ 1; 2. The image vectors of these mappings are called feature vectors in the sequel. The objective is to ﬁnd linear functions f k ðxk Þ ¼ wTk /k ðxk Þ þ bk which can predict the potential label value for any labeled and unlabeled cases. The decision function is then deﬁned for an arbitrary input x ¼ ðx1 ; x2 Þ by X f ðxÞ ¼ 12 signðf k ðxk ÞÞ. (1) k

Before formulating the optimization framework the following notations are introduced: the matrix Y is a diagonal matrix of the labels fyi g, and the matrices Xk and Uk comprise the labeled and unlabeled inputs in their rows for k ¼ 1; 2. When the embedding functions fk applied on Xk and Uk , denoted by fk ðXk Þ and fk ðUk Þ, give matrices with the feature vectors in their rows. All vectors are column vectors and they are denoted by lower-case, bold face Latin or Greek letters except 0 and 1 those which denote vectors with components 0 and 1, respectively. The notations used for matrices are uppercase bold face letters. The scalars are lower-case and constants are upper-case letters. Rþ denotes the set of nonnegative real numbers. The expressions are presented in vectorized form to emphasize the linear algebraic relations and to decrease the number of indexes. Thus, for an example, 1T n means Pm L i xi .

Fig. 1. Learning with two views via forcing the similarity between the predictions.

ARTICLE IN PRESS S. Szedmak, J. Shawe-Taylor / Neurocomputing 70 (2007) 1254–1264

1256

The optimization problem formulating our learning framework is given by min

b : ð/1 ðU1 Þw1 þ b1 1Þ ð/2 ðU2 Þw2 þ b2 1Þ ¼ gþ g , ( synthesis

1X T wk wk ( regularization 2 k X þ 1T C k nk ( labeled loss

a1 : Yð/1 ðX1 Þw1 þ b1 1ÞX1 n1 ; a2 : Yð/2 ðX2 Þw2 þ b2 1ÞX1 n2 ;

k

w.r.t. s.t.

þ C Z 1T g ( unlabeled loss

c1 : n1 X0,

wk ; bk ; nk ; k ¼ 1; 2; g,

c2 : n2 X0, dþ : gþ X0,

jð/1 ðU1 Þw1 þ b1 1Þ ð/2 ðU2 Þw2 þ b2 1Þjpg, ( synthesis

mU L nk 2 R m . þ ; k ¼ 1; 2; g 2 R

ð3Þ

The constants, C k ; k ¼ 1; 2 and C Z , control the balance between the labeled and unlabeled loss terms in the objective function. These constants are positive parameters for which a validation procedure could ﬁnd the optimal values in a given problem. In what follows we use the acronym SVM_2K for problem (3). Since we have linearly constrained quadratic optimization task, where the quadratic term is a convex one, hence (3) is a convex problem. The absolute values in the synthesis constraint and the corresponding term in the objective function can be unfolded by the following substitutions: g ¼ gþ g ;

kgk1 ¼ 1T ðgþ þ g Þ;

U gþ ; g 2 R m þ ,

(4)

then we have 1X T wk wk ( regularization 2 k X þ 1T C k nk ( labeled loss

s.t.

Lððwk ; bk ; nk ; gþ ; g Þ; ðak ; b; ck ; d ÞÞ X 1X T wk wk þ 1T C k nk þ C Z 1T ðgþ þ g Þ ¼ 2 k k þ bT ½ð/1 ðU1 Þw1 þ b1 1Þ ð/2 ðU2 Þw2 þ b2 1Þ gþ þ g X ½aTk ðY½/k ðXk Þwk þ bk 1 þ 1 nk Þ þ k

X

cTk nk dTþ gþ dT g ,

ð7Þ

k

where we have the restrictions L ak ; ck XRm þ ;

k ¼ 1; 2;

U dþ ; d XRm þ ,

(8)

imposed on the dual variables belonging to the inequality constraints. Unfolding the Karush–Kuhn–Tucker conditions we can express the normal vector of the separating hyperplanes for any k by

where

ð/1 ðU1 Þw1 þ b1 1Þ ð/2 ðU2 Þw2 þ b2 1Þ ¼ gþ g , ( synthesis k ¼ 1; 2,

( SVM subproblems mU L nk 2 R m þ ; k ¼ 1; 2; gþ ; g 2 Rþ .

We can then write up the corresponding Lagrangian functional

¼ ½/k ðXk ÞT ; /k ðUk ÞT gk ,

þ C Z 1 ðgþ þ g Þ ( unlabeled loss wk ; bk ; nk ; k ¼ 1; 2; gþ ; g ,

Yð/k ðXk Þwk þ bk 1ÞX1 nk ;

ð6Þ

wk ¼ /k ðXk ÞT Yak þ ð1Þk /k ðUk ÞT b

k T

w.r.t.

( SVM subproblem 1 ( SVM subproblem 2

d : g X0.

Yð/k ðXk Þwk þ bk 1ÞX1 nk ; k ¼ 1; 2, ( SVM subproblems

min

multipliers and the constraints is summarized in the next table.

ð5Þ

Introducing vectors of the Lagrangian multipliers ak for any k in the SVM subproblems where the components of these vectors correspond to the labeled cases, and the vector of Lagrangian multipliers b in the synthesis constrain to each non-labeled cases, and ck ; d to the non-negativity constraints. The assignment between the

ð9Þ

gTk ¼ ðaTk Y; ð1Þk bT Þ.

After substituting the expressions of the primal variables back into the Lagrangian functional we arrive at the dual problem of (3). The kernel matrices in the dual have the structure 3 " # 2 LL Kk KLU /k ðXk Þ/k ðXk ÞT /k ðXk Þ/k ðUk ÞT k 5 Kk ¼ ¼ 4 UL Kk KUU /k ðUk Þ/k ðXk ÞT /k ðUk Þ/k ðUk ÞT k 2 3 L Kk ¼ 4 U 5. Kk The superscripts of the sub-matrices refer to the labeled and unlabeled sources of the sub-kernels. We are going to exploit these notations in the following sections of the paper.

ARTICLE IN PRESS S. Szedmak, J. Shawe-Taylor / Neurocomputing 70 (2007) 1254–1264

The dual problem reads as X 1X T gk K k gk 1T a k ð2:2Þ min 2 k k w.r.t. s.t.

This kind of supervised learning have been presented by Meng et al. [18] and in Farquhar et al. [12].

ak 2 RmL ; k ¼ 1; 2; b 2 RmU

2.4. tolerance

where gTk ¼ ðaTk Y; ð1Þk bT Þ, 0pak pC k ; k ¼ 1; 2,

The synthesis constraint forcing the similarity between the predictions can be transformed into a more robust formulation, which allows us to ignore the small departures between the views. To realize this, the constraint

C Z pbpC Z .

jð/1 ðU1 Þw1 þ b1 1Þ ð/2 ðU2 Þw2 þ b2 1Þjpg

1T gk ¼ 0,

The reader, familiar with the dual formulation of the support vector machine, can recognize the similarity between the structures of the SVM_2K and the SVM. This similarity allows us to use the algorithms which have been developed to solve the SVM dual problem with only moderate modiﬁcations. Our implementation follows that has been published in [18]. A short introduction of this solution is shown in Section 2.6. After when wk ; k ¼ 1; 2 have been computed the prediction for a new input item x ¼ ðx1 ; x2 Þ can be obtained by 1 1 T 2 T yðxÞ ¼ sign ½/1 ðx Þ w1 þ b1 þ /2 ðx Þ w2 þ b2 2 ! 1X T T T ¼ sign ð/k ðxk Þ ½/k ðXk Þ ; /k ðUk Þ gk þ bk Þ 2 k ! 1X k ¼ sign Kk ðx Þgk þ bk , 2 k with the shorthand notations k

1257

k T

T

T

Kk ðx Þ ¼ /k ðx Þ ½/k ðXk Þ ; /k ðUk Þ ;

The optimization schema presented above turns to be applicable when only labeled cases are given. In that case we have a supervised learning with two views and we are going to force the similarity between the predictions built upon those views. To this end unlabeled inputs Uk are replaced with Xk for any k ¼ 1; 2. Ease to check that all the steps above can be carried out without any modiﬁcations. We can make a simpliﬁcation in the expressions of wk , see (9), by recognizing that the dimensions of the dual variables, ak and b, are the same, hence we can reduce the dimension of the corresponding compound variables gk ; k ¼ 1; 2

¼ ½/k ðXk ÞT ~gk , where g~ Tk ¼ ðaTk Y þ ð1Þk bT Þ.

jð/1 ðU1 Þw1 þ b1 1Þ ð/2 ðU2 Þw2 þ b2 1Þjp1 þ g,

(11)

which absorbs the differences less than a ﬁxed given positive threshold . A similar application of tolerance was applied in the formulation of the support vector regression, the reader can refer to the book of Vapnik [22] and to Cristianini et al. [10] as well. 2.5. Concatenated case In the primal formulation of the optimization problem (3) the views have been handled in distinct constraints. One can unify these constraints by concatenating the two feature vectors into one and then solving the simpliﬁed problem 1X T min wk wk ( regularization 2 k

k ¼ 1; 2.

2.3. Supervised multiview learning as subcase

¼ /k ðXk ÞT ½Yak þ ð1Þk b

is turned into

þ C k 1T n ( labeled loss

In the next subsections some variants of the base formulation are presented.

wk ¼ /k ðXk ÞT Yak þ ð1Þk /k ðXk ÞT b

(10)

w.r.t. s.t.

þ C Z 1T ðgþ þ g Þ ( unlabeled loss wk ; bk ; k ¼ 1; 2; n; gþ ; g , ð/1 ðU1 Þw1 þ b1 1Þ ð/2 ðU2 Þw2 þ b2 1Þ ¼ gþ g , ( synthesis ! X ½/k ðXk Þwk þ bk 1 X1 n, Y k

( ‘‘concatenated’’ SVM L nXRm þ ;

U ðgþ ; g ÞXRm þ .

ð12Þ

This formulation reduces the number of Lagrangian dual variables, since the corresponding dual vectors a1 , a2 will be the same. This reformulation decreases the complexity of the corresponding dual problem and consequently diminishes the computational cost. 2.6. Skeleton of the algorithm The detailed description of algorithm solving the SVM_2K problem is given in [18]. Here we give an outline of the method and offer a simple alternative well applicable in large-scale problems. The algorithm solves the dual problem (2.2). The base idea contains a particular elimination of the constraints occurring in the dual. The elimination employs the

ARTICLE IN PRESS S. Szedmak, J. Shawe-Taylor / Neurocomputing 70 (2007) 1254–1264

1258

augmented Lagrangian formulation which turns the constraints that we are going to eliminate into a penalty term in the Lagrangian functional and then solves this modiﬁed problem. Details of the formulation, theoretical background and proofs of the convergence can be found in the book of Bertsekas [4] and in the references therein. The partial augmentation of (2.2) reads as ð2:6Þ Lða1 ; a2 ; b; l; cÞ X 1X T c gk K k gk 1 T ak þ kT A þ A T A 2 k 2 k

¼ min s.t.

0pak pC k ; k ¼ 1; 2 C Z pbpC Z ,

where c40 is a penalty parameter, the matrix A is built upon the linear constraints and it is equal to " # yT 0T 1T A¼ , 0T yT þ1T and the vector k comprises the corresponding Lagrangians. y is a vector containing the output labels ðyi Þ. Let zT ¼ ðaT1 ; aT2 ; bT Þ and Z be a set of z’s satisfying the box constraints in (2.6). The algorithm then follows the schema Step 1: Let k0 be an initial solution for the Lagrangian multipliers, c0 41 be a given initial value to the penalty parameter c, gc 41 is a multiplier for c, k be a required accuracy and l ¼ 0. Step 2: Solve the problem for z at ﬁxed kl minz

Lðz; kl ; cl Þ

s.t. z 2 Z,

ð13Þ

The optimum solution is denoted by zl . Step 3: Update the Lagrangian multipliers by klþ1 ¼ l k þ cl Azl and the running constant by clþ1 ¼ gc cl . Step 4: If kAzl koh a given error tolerance, Then Stop! Else Set l ¼ l þ 1 and go to Step 2! It is proved in [4] if the running constant goes to inﬁnity then the solution of the augmented Lagrangian problem converges to the solution of the original problem. The critical point is to ﬁnd the solution of the subproblem in Step 2. In [18] a conditional gradient based method has been proposed, however, it may be better to apply a simple coordinate descent schema exploiting the simplicity of the box constraints. It involves a component-wise optimization of the objective function on the corresponding slice of the box, see details in [4]. This approach is a simpler version of the sequential minimal optimization(SMO) introduced by Platt [20] to the support vector machine. The experiments show that this simple, task speciﬁc method becomes superior on others developed to solve linearly constrained quadratic programming problems, when the sample size goes beyond 1000 items.

3. Theoretical foundation To illuminate the theoretical background of the presented learning method we apply the theory of Rademacher complexity, see [2]. We use the notation from Section 2 when referring to the quantities used in the learning algorithms. The key idea is that by restricting the two linear functions f 1 and f 2 to give similar outputs we implicitly restrict the ﬂexibility of the functions that can be implemented, thus reducing the chances of overﬁtting. We would therefore expect that if we can achieve good alignment between f 1 and f 2 while having good classiﬁcation accuracy, then the performance of the classiﬁer will be improved. The difﬁculty with making this intuition rigorous is that statistical learning theory requires the function class to be speciﬁed independently of the training data, while our measure of alignment between f 1 and f 2 (with norms bounded by C) is measured on the unlabeled data. Hence, if the function class is deﬁned in terms of alignments on the unlabeled data, the sampling process would appear to affect the choice of hypothesis space. This restriction should be distinguished from that arising from bounding the norm of the weight vector classifying the labeled data, since this hierarchy of classes can be speciﬁed without reference to the data. The problem can, however, be avoided if we consider the unlabeled data as being chosen before the labeled data. In this way we may view the hypothesis space as ﬁxed at the point when the labeled data are generated. The next problem is to estimate the complexity of the space of functions that are restricted in this data-driven manner. The approach we adopt is to derive an optimization problem whose objective value is with high probability an upper bound on the required complexity. The accuracy of the estimate depends on the number of random rs that are used and hence the number of optimization problems that are solved. Finally, we need to perform a union bound over the required range of values of C and D. Our analysis has a similar ﬂavor to that of Balcan et al. [1] and Blum et al. [6]. The main differences are that we are able to infer advantages from all sizes of unlabeled samples without any prespeciﬁed required size. Furthermore the results are presented to indicate the level of performance obtained from the data available and the results of the learning, rather than indicating sample sizes required to give a particular error rate. Finally, we make use of Rademacher bounds in order to enable computation on the given samples.

3.1. Short introduction to rademacher complexity theory We ﬁrst quote McDiarmid’s inequality as this will prove useful later.

ARTICLE IN PRESS S. Szedmak, J. Shawe-Taylor / Neurocomputing 70 (2007) 1254–1264

Theorem 1 (McDiarmid). Let X 1 ; . . . ; X n be independent random variables taking values in a set A, and assume that f : An ! R satisfies sup

jf ðx1 ; . . . ; xn Þ f ðx1 ; . . . ; x^ i ; xiþ1 ; . . . ; xn Þjpci ,

x1 ;...;xn ; x^ i 2A

1pipn.

1259

zero. Let ðxi Þ‘i¼1 be drawn independently according to a probability distribution D. Then with probability at least 1 d over random draws of samples of size ‘, every f 2 F satisfies rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ lnð2=dÞ ^ . ED ½f ðxÞpES ½f ðxÞ þ R‘ ðFÞ þ 3ðB AÞ 2‘

Then for all 40, 22 Pfjf ðX 1 ; . . . ; X n Þ Ef ðX 1 ; . . . ; X n ÞjXgp2 exp Pn 2 . i¼1 ci We begin with the deﬁnition required for Rademacher complexity, see for example Bartlett and Mendelson [2], see also [21] for an introductory exposition. Deﬁnition 2. For a sample S ¼ fx1 ; . . . ; x‘ g generated by a distribution D on a set X and a real-valued function class F with a domain X, the empirical Rademacher complexity of F is the random variable " # 2X ‘ R^ ‘ ðFÞ ¼ Es sup si f ðxi Þx1 ; . . . ; x‘ , f 2F ‘ i¼1 where s ¼ fs1 ; . . . ; s‘ g are independent uniform f1gvalued Rademacher random variables. The Rademacher complexity of F is # " 2X ‘ R‘ ðFÞ ¼ ED ½R^ ‘ ðFÞ ¼ ED;s sup si f ðxi Þ . ‘ f 2F i¼1

We use ED to denote expectation with respect to a distribution D and ES for its empirical value on the sample S assuming uniform sampling. The Rademacher complexity allows us to uniformly bound the expectation of the values of all functions in a function class based on the class complexity. Deﬁnition 3. A loss function A : R ! ½0; 1 is Lipschitz with constant L if it satisﬁes jAðaÞ Aða0 ÞjpLja a0 j for all a; a0 2 R.

We quote a lemma giving some properties of Rademacher complexity including that of a function class composed with a ﬁxed Lipschitz function A. Theorem 4. Let F; F1 ; . . . ; Fn and G be classes of real functions. Then: (1) if F G, then R^ ‘ ðFÞpR^ ‘ ðGÞ; (2) for every c 2 R, R^ ‘ ðcFÞ ¼ jcjR^ ‘ ðFÞ; (3) if A : R ! R is Lipschitz with constant L and satisfies Að0Þ ¼ 0, then R^ ‘ ðA FÞp2LR^ ‘ ðFÞ. Theorem 5. Fix d 2 ð0; 1Þ and let F be a class of functions mapping from S to ½A; B, where at least one of A or B is

For the proof the reader can consult [2,21] with the slight adaptation that the range is scaled from the standard ½0; 1 to ½A; B. Application of the standard result to an appropriate scaling of the functions together with (2) of Theorem 4 gives the result, with the proviso that the inequality must be reversed if Ao0. This is a straightforward adaptation of the main proof, since the proof relies on bounding the supremum of the difference between empirical and true expectations. Given a training set S the class of functions that we will primarily be considering are linear functions with bounded norm ( ) ‘ X 0 2 x! ai kðxi ; xÞ : a KapB i¼1

fx ! hw; fðxÞi : kwkpBg ¼ FB , where f is the feature mapping corresponding to the kernel function k and K is the corresponding kernel matrix for the sample S. The following result bounds the Rademacher complexity of this kind of linear function classes. Theorem 6 (Bartlett and Mendelson [2]). If k : X X ! R is a kernel, and S ¼ fx1 ; . . . ; x‘ g is a i.i.d. sample from X, then the empirical Rademacher complexity of the class FB satisfies 2B pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ trðKÞ. R^ ‘ ðFÞp ‘

3.2. Analysis of the semi-supervised multiview learning Initially we will ﬁx the positive numbers C and D and assume that the weight vectors ðw1 ; b1 Þ and ðw2 ; b2 Þ produced by the learning algorithm both have 2-norms bounded by C. Furthermore, we assume that there is an a priori bound on the norms of the feature space projections of the input vectors of R, that is kðx; xÞ þ 1pR2

for all x,

where the 1 comes from the component handling the offset b. We deﬁne an auxiliary function of the weight vectors w ¯ k ¼ ðwk ; bk Þ, k ¼ 1; 2, def ^ w Dð ¯ 1 ; w¯ 2 Þ ¼ ES ½kðf1 ðU1 Þw1 þ b1 1Þ ðf2 ðU2 Þw2 þ b2 1Þk1 .

^ w¯ 1 ; w Again for the time being we will assume that Dð ¯ 2 ÞpD and leave the question of choosing C and D to depend on

ARTICLE IN PRESS S. Szedmak, J. Shawe-Taylor / Neurocomputing 70 (2007) 1254–1264

1260

the data to the end of the analysis. Hence, the class of functions we are considering when applying SVM_2K to this problem can be restricted to ( 2 1X FC;D ¼ f jf : ð½x1 ; x2 Þ ! ðKk ðxk Þgk þ bk Þ; 2 k¼1 gTk Kk gk þ b2k pC 2 ; ) ^ w¯ 1 ; w¯ 2 ÞpD , Dð

k ¼ 1; 2,

bounds computed for the individual or concatenated SVMs and that of the SVM_2K. 3.3. Empirical rademacher complexity of FC;D The empirical Rademacher complexity of the class FC;D is given by # mL 2 X R^ mL ðFC;D Þ ¼ Es sup si f ðXi Þ m L i¼1 f 2FC;D 2 3 2 1X 2 7 6 ¼ Es 4 sup rT ðKk ðxk Þgk þ bk 1Þ5 kw¯ k kpC; k¼1;2mL 2 k¼1 "

where Kk ðxk Þ ¼ ½/ðxk ÞT /ðXk ÞT ; /ðxk ÞT /ðUk ÞT .

2

3 1 X 7 6 pEs 4 sup rT ½fk ðXk Þwk þ bk 15, ¯ m kwk kpC; k¼1;2; L k

Hence, the function class FC;D has an additional restric^ w¯ 1 ; w¯ 2 ÞpD that could decrease the chances of tion Dð overﬁtting. Applying the usual Rademacher techniques for margin bounds on generalization we obtain the following result. Theorem 7. Fix d 2 ð0; 1Þ and constants C and D and let FC;D be the class of functions described above. Let ðX1 ; X2 Þ labeled and ðU1 ; U2 Þ unlabeled samples be drawn independently according to a probability distribution PðX 1 ; X 2 Þ. Then with probability at least 1 d=2 over random draws of labeled samples of size mL , every f 2 FCD satisfies Pðx;yÞD ðsignðf ðxÞÞayÞp

1 T 1 ðn1 þ n2 Þ þ 2R^ mL ðFC;D Þ 2mL sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ lnð4=dÞ þ3 . 2mL

^ w¯ ;w Dð 1 ¯ 2 ÞpD

where r 2 f1; þ1gmL . Note that the expression in square brackets is concentrated under the uniform distribution of Rademacher variables. Hence, we can estimate the complexity for N randomly chosen instantiations r^ 1 ; . . . ; r^ N of the Rademacher variables r. For each j ¼ 1; . . . ; N, we now must ﬁnd the value of the learning parameters w ¯ k ; k ¼ 1; 2 that maximize

max

Proof. First observe that Pðx;yÞD ðsignðf ðxÞÞayÞpED ½minðð1 yf ðxÞÞþ ; 1Þ, 1 T 1 ðn1 þ n2 Þ while ES ½minðð1 yf ðxÞÞþ ; 1Þp 2mL

^ w¯ ;w Dð 1 ¯ 2 ÞpD

w.r.t. s.t.

" # X 1 X j T T ðr^ Þ fk ðXk Þwk þ ðr^ j Þ 1bk mL k k 1 X j T L ¼ ðr^ Þ ðKk gk þ bk 1Þ mL k gk 2 RmL þmU ; bk 2 R; k ¼ 1; 2, gTk Kk gk þ b2k pC 2 ; k ¼ 1; 2, 1 T U 1 jðK1 g1 þ b1 1Þ ðKU 2 g2 þ b2 1ÞjpD. mU

ð14Þ

Applying Theorem 5 for the function minðð1 yf ðxÞÞþ ; 1Þ 1 with range ½1; 0 and observing that the function AðzÞ ¼ minðð1 zÞþ ; 1Þ 1 satisﬁes Að0Þ ¼ 0 and has Lipschitz constant 1, we obtain ED ½minðð1 yf ðxÞÞþ ; 1Þ 1 1 T p 1 ðn1 þ n2 Þ 1 þ 2R^ mL ðFC;D Þ þ 3 2mL The result follows.

sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ lnð4=dÞ . 2mL

&

Apart from the unﬁxing of C and D, it therefore only remains to compute the empirical Rademacher complexity of FC;D , which is the critical discriminator between the

By an application of McDiarmid’s inequality, with probability at least 1 d=2, the expected value of the objective function computed on the randomly chosen ﬁxed s^ is not pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 2RC lnð2=dÞ=2NmL less than the empirical Rademacher complexity. This follows from observing that ﬂipping a single sji can change the complexity formula by at most ci ¼ 2RC=ðNmL Þ. Hence, if we deﬁne RðC; DÞ to be the average of the values of the objective of the optimization (14) over j ¼ 1; . . . ; N, we have the following bound on the generalization of the SVM_2K. Theorem 8. Fix d 2 ð0; 1Þ and let FC;D be the class of functions described above and RðC; DÞ be the average of the values of the objective of the optimization (14) for j ¼ 1; . . . ; N. Let ðX1 ; X2 Þ labeled and ðU1 ; U2 Þ unlabeled samples be drawn independently according to a probability distribution PðX 1 ; X 2 Þ. Then with probability at least 1 d

ARTICLE IN PRESS S. Szedmak, J. Shawe-Taylor / Neurocomputing 70 (2007) 1254–1264

1261

over random draws of labeled samples of size mL , every f 2 FC;D satisfies 1 T 1 ðn1 þ n2 Þ þ 2RðC; DÞ Pðx;yÞD ðsignðf ðxÞÞayÞp 2mL sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 4RC lnð4=dÞ þ pﬃﬃﬃﬃﬃ þ 3 . 2mL N

Let FC;D be the class of functions described above and RðC; DÞ be the average of the values of the objective of optimization (14) over its N instances. Let ðX1 ; X2 Þ labeled and ðU1 ; U2 Þ unlabeled samples be drawn independently according to a probability distribution PðX 1 ; X 2 Þ. Then with probability at least 1 d over random draws of labeled samples of size mL , every f 2 FC;D satisfies

Theorem 8 shows that we can bound the generalization in terms of the quantity RðC; DÞ computed by optimization (14). The next section will investigate the computation of this quantity, but ﬁrst we need to address the issue that in the analysis so far C and D have been ﬁxed a priori, while in a real application we wish to treat them as outputs of the learning algorithm. This is a similar situation to that encountered in bounds involving the margin and we follow a similar strategy to that adopted in that case. First observe that since the value of the optimal solution of the optimization (3) must be less than that obtained by taking the feasible solution wk ¼ 0 and nk ¼ 1 for k ¼ 1; 2. The objective value for this solution is ðC 1 þ C 2 ÞmL and so for the optimal solution we must have kwk kp2ðC 1 þ C 2 ÞmL for k ¼ 1; 2. Similarly to ensure that at least one data point has no margin error the weight vector must have norm at least 1=R, so we can assume that C lies in the range ½1=R; 2ðC 1 þ C 2 ÞmL . By similar arguments we can place bounds on the range ½D1 ; RðC 1 þ C 2 Þ of the expression

Pðx;yÞD ðsignðf ðxÞÞayÞ 1 T 1 ðn1 þ n2 Þ þ 2RðCðf Þ; Dðf ÞÞ p 2mL sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 4RCðf Þ lnð4tC tD =dÞ pﬃﬃﬃﬃﬃ þ 3 þ . 2NmL N

1 T þ 1 ðg þ g Þ mU indicating that we need only consider values of D in this range. Since, the expression could potentially be zero, D1 must be chosen a priori as a suitable small non-zero constant. We will see below that it only plays a very minor role in the bound. We now consider a grid of values for C C 1 ¼ 2ðC 1 þ C 2 ÞmL ; C 2 ¼ C 1 =2; . . . ; C iþ1 ¼ C i =2; . . . ; C tC , where tC ¼ blog2 ð2RðC 1 þ C 2 ÞmL Þc. We create a similar grid for a range of tD ¼ blog2 ðRðC 1 þ C 2 Þ=D1 Þc values of D starting with D1 ¼ RðC 1 þ C 2 Þ and halving their value each time. We now apply Theorem 8 for each pair of values ðC i ; Dj Þ for i ¼ 1; . . . ; tC and j ¼ 1; . . . ; tD , with d replaced by d=ðtC tD Þ. This ensures that with probability at least 1 d all of the tC tD conclusions hold. For any value of C and D in their given ranges there is a grid point ðC i ; Dj Þ in the rectangle ½C; 2C ½D; 2D. Furthermore the monotonicity of the bound implies that obtained for ðC i ; Dj Þ will be tighter than for ð2C; 2DÞ. Putting all these observations together, we obtain the following theorem. Theorem 9. Fix d 2 ð0; 1Þ and for f 2 F let Cðf Þ ¼ 2 maxðkw1 k; kw2 kÞ and 1 T þ 1 ðg þ g Þ . Dðf Þ ¼ 2 max D1 ; mU

It remains to note that the inﬂuence of the repeated applications of Theorem 8 is benign in that tC and tD are logs of the ratio of the two ranges, and furthermore they enter the bound under a further log.

4. Estimation of the empirical rademacher complexity The estimation of the empirical Rademacher complexity can be carried out by solving the optimization problem (14). Unfortunately, (14) contains a convex maximization even with a non-differentiable objective function. Here we present a simple decomposition procedure which can overcome on these difﬁculties. To solve this problem the feasibility domain of (14) can be cut into two parts; one is where the objective is non-negative and the other where it is non-positive, and then comparing the optimum values received in the subproblems the optimum solution of the entire problem can be obtained. We also have a nondifferentiable constraint 1 T U 1 jðK1 g1 þ b1 1Þ ðKU 2 g2 þ b2 1ÞjpD mU that can be resolved by replacing it with an equivalent formulation U þ ðKU 1 g1 þ b1 1Þ ðK2 g2 þ b2 1Þ ¼ g g , 1 T þ 1 ðg þ g ÞpD, mU gþ X0; g X0.

Thus, we are facing the derived problems, where the maximization has been transformed into minimization. Firstly the non-negative part of the domain is considered min w.r.t. s.t.

1 X T L r^ ðKk gk þ bk 1Þ mL k

gk 2 RmL þmU ; bk 2 R k ¼ 1; 2; gþ ; g 2 RmU gTk Kk gk þ b2k pC 2 ;

k ¼ 1; 2,

U þ ðKU 1 g1 þ b1 1Þ ðK2 g2 þ b2 1Þ ¼ g g ,

ARTICLE IN PRESS S. Szedmak, J. Shawe-Taylor / Neurocomputing 70 (2007) 1254–1264

1262

1 T þ 1 ðg þ g ÞpD, mU 1 X T L r^ ðKk gk þ bk 1ÞX0, mL k gþ X0;

Table 1 Expected values (standard deviations) of the accuracies (%) computed on the test sets and the estimated empirical Rademacher complexities R^ (standard deviations) of the semi-supervised learning

g X0,

ð15Þ

and similarly we also have the problem for the non-positive part of feasibility domain 1 X T L min r^ ðKk gk þ bk 1Þ mL k w.r.t. s.t.

g X0.

r ¼ minðr þ ; r Þ;

Accuracy

R^

Accuracy

R^

93.3 (3.1) 96.7 (1.7) 94.3 (0.9)

104.4 (49.1) 94.7 (46.6) 81.2 (32.2)

94.4 (1.9) 97.1 (1.4) 94.6 (0.6)

105.2 (60.8) 64.4 (27.2) 59.5 (25.5)

Data set

ð16Þ

ðg k ; b k Þ ¼ argð minðr þ ; r ÞÞ; k þ 1; 2. (17)

Both problems (15) and (16) are convex with linear objective function and with linear and quadratic constraints. One can transform them into second order cone programming (SOCP) problems and using appropriate interior point methods they can be efﬁciently solved, see details in [15] and [8]. The second order cone formulation of the quadratic constrains is derived as þ b2k pC 2 ; ) ^ T A. ^ Kk ¼ A

^ k g ; bk kpC; k½A k

k ¼ 1; 2,

Data sets

Breast cancer

Labeled (%)

10 5

10 5

Haberman

10 5

Pima

10 5

The data set was cut into training and test sets by randomly sorting it into ﬁvefolds and ﬁvefold crossvalidation was applied. In each cross-validation step randomly chosen 90%, or 95%, of the labels of the training items were deleted and these label-free items used as unlabeled cases in the learning.

Items

Variables

699 690 306 768

9 15 4 8

Table 3 Expected values (standard deviations) of the accuracies (%) computed on the test sets and the estimated empirical Rademacher complexities R^ (standard deviations) of the semi-supervised learning

Credit

Here we are going to demonstrate the effect of the included similarity—forcing new constraint on the accuracy. In the training procedure the difference between the learners touched only this constraint and the corresponding penalty constant. The test method applied on all sets obeyed the following procedure:

Number of

Wisconsin breast cancer Credit card application approval Haberman’s survival data Pima Indians diabetes

ð18Þ

5. Experiments

SVM_2K

Table 2 The data sets used in the experiments

Let the optimum values and the optimum solutions of (15) and (16) be denoted by r þ , ðg k ; b k Þþ ; k ¼ 1; 2 and by r and ðg k ; b k Þ ; k ¼ 1; 2, respectively. The optimum value of the original problem (14) is then given by

5 5 5

Image classiﬁcation SVM (concatenated)

gTk Kk gk þ b2k pC 2 ; k ¼ 1; 2,

gþ X0;

where

Airplanes Faces Motorbikes

Labeled (%)

gk 2 RmL þmU ; bk 2 R k ¼ 1; 2; gþ ; g 2 RmU U þ ðKU 1 g1 þ b1 1Þ ðK2 g2 þ b2 1Þ ¼ g g , 1 T þ 1 ðg þ g ÞpD, mU 1 X T L r^ ðKk gk þ bk 1Þp0, mL k

gTk Kk gk

Data sets

UCI data sets SVM (concatenated)

SVM_2K

Accuracy

R^

Accuracy

R^

96.04 (1.88) 95.28 (2.82)

1.06 (0.41)

96.72 (1.37) 96.21 (1.42)

1.24 (0.42)

81.95 (4.45) 79.12 (6.20)

4.71 (1.10)

83.54 (4.40) 80.73 (9.19)

4.34 (2.22)

69.42 (10.41) 63.58 (12.63)

0.66 (0.43)

74.00 (3.96) 68.24 (14.43)

0.20 (0.31)

72.67 (3.43) 70.07 (4.25)

2.60 (1.22)

70.49 (4.27) 67.40 (5.05)

1.97 (1.56)

1.66 (0.61)

8.04 (3.08)

1.02 (0.43)

4.05 (1.56)

1.48 (0.57)

5.97 (2.77)

0.35 (0.41)

1.95 (1.74)

The selection of the unlabeled cases was repeated 10 times in each of the ﬁve cross-validations.

To set the penalty parameters C 1 ; C 2 ; C Z , a training set based, validation procedure was applied. After creating the folds and choosing the unlabeled cases the training set was partitioned into validation-training and validation-test sets. Those values for the parameters were chosen which

ARTICLE IN PRESS S. Szedmak, J. Shawe-Taylor / Neurocomputing 70 (2007) 1254–1264

1263

96.8 SVM mean accuracy Upper confidence

96.6

Lower confidence SVM_2K

Accuracy (%)

96.4 96.2 96 95.8 95.6 95.4 5

10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 % of unlabeled data considered

Fig. 2. Comparison of SVM working on 10%+5% of labeled data and the SVM_2K working on 10% labeled and gradually increasing proportion of unlabeled data. The 10% labeled data were the same in each of the experiments.

gave the highest accuracy on the validation-test. The values of C 1 and C 2 were ﬁxed to 1 and C Z was looked for among the values 0:01 2i ; i ¼ 1; . . . ; 10. In the experiment the kernel type was linear and twophase normalization was applied. In the ﬁrst phase the variables were normalized by subtracting the corresponding expected values and they were scaled to have variance 1. In the second phase the input vectors were projected onto a ball with unit radius. The normalization is a critical step in ﬁltering the inﬂuence of the unlabeled outliers out of the model. The aim why the linear kernel is applied in our experiments is to avoid the use of the additional parameters not relating to the discussed framework. The estimation of the empirical Rademacher complexity was computed parallel on each estimation of the test accuracy. One estimation was computed on 10 randomly chosen Rademacher labels. 5.1. Case of two input sources The data set1 is commonly used for generic object recognition, e.g. Fergus et al. [13] and other papers. The three object classes in this data set are; motorbikes, aeroplanes and faces. It also contains an additional background class. For each image two sets of low level features were computed. One2 used the afﬁne invariant Harris detector [19] developed by Mikolajczyk and Schmid to detect interest points within an image and invariant moment’s as patch descriptors. The other used Lowe’s keypoint detector3 to detect interesting patches with SIFT [16] patch descriptors. These sets of image patch descriptors then form the basis of the feature generation. 1

Available at http://www.robots.ox.ac.uk/vgg/data/. Available at http://lear.inrialpes.fr/people/Mikolajczyk/. 3 Available at http://www.cs.ubc.ca/lowe/keypoints/. 2

Because different images have different numbers of interest points vector quantization was used to map these sets of points into a ﬁxed length feature vector. Speciﬁcally, k-means was used to learn K cluster centers based upon the features from all images. For each image a ﬁxed length K feature vector was then created by recording the minimum distance between an image feature and each of these K centers. In all included experiments, the parameter for clustering was chosen as K ¼ 400. The result is presented by Table 1. 5.2. Views on random split In this group of the experiments we used examples from the UCI Repository of machine learning data sets, the details about this repository are given by [5]. The data sets included into the experiments are shown in Table 2. The categorical variables in the Credit data set are transformed into indicator variables. The test method applied on these sets was extended by random partitioning of the variables:

To create two views the set of variables was split randomly into two subsets. The chance of falling into one of these subsets was the same for each variable. If one of the two subsets was empty then this random selection was repeated until both subsets contained at least one variable. The splitting of the set of the variables was repeated 10 times in each of ﬁve fold cross-validations.

Analyzing the result in Table 3 one can recognize that the smaller complexity tends to increase the accuracy. Some obvious consequences of the structure of the Rademacher complexity can be noted as well; the increased number of labeled cases reduces the complexity, but this effect is signiﬁcantly smaller in case of the SVM_2K since the unlabeled cases are incorporated into the model.

ARTICLE IN PRESS 1264

S. Szedmak, J. Shawe-Taylor / Neurocomputing 70 (2007) 1254–1264

5.3. The value of the unlabeled data At the end we demonstrate the classiﬁcation value of the unlabeled data on Fig. 2. Here we used the data of Wisconsin breast cancer. In this experiment 10% of the training data were chosen as labeled input and it was ﬁxed. Then this subsample and an additional 5% was included as labeled data into the input of the SVM. The proposed method, SVM_2K received the ﬁxed 10% percent labeled part of the training set and a gradually increasing proportion of the remaining 90% as unlabeled input source. In the ﬁrst experiment the unlabeled data started on 5% and it was increased by 5% in each of the following experiments. Fig. 2 shows at the above outlined circumstances approximately 25% of unlabeled cases is worth as much as 5% of labeled ones. 6. Discussion We presented a generalization theory based extension of the well-known support vector machine. Our objective was to ﬁnd a modiﬁcation for the optimization formulation of the SVM which can improve its performance when the unknown distribution of the data set allows us to gain higher accuracy. We must emphasize that to extend the capability of any kind of learner uniformly, for all possible data set, theoretically impossible, thus, our extension can work only on a certain class of problems. Further research should be carried on to discover the boundary of this class. References [1] M.F. Balcan, A. Blum, A pac-style model for learning from labeled and unlabeled data, in: COLT, 2005, pp. 111–126. [2] P.L. Bartlett, S. Mendelson, Rademacher and Gaussian complexities: risk bounds and structural results, J. Mach. Learning Res. 3 (2002) 463–482. [3] K. Bennett, A. Demirez, Semi-supervised support vector machines, in: M.S. Kearns, S.A. Solla, D.A. Cohn (Eds.), Advances in Neural Information Processing Systems, vol. 12, MIT Press, Cambridge, MA, 1998, pp. 368–374. [4] D.P. Bertsekas, Nonlinear Programming, second ed., Athena Scientiﬁc, 1999. [5] C.L. Blake, C.J. Merz, UCI repository of machine learning databases, Department of Information and Computer Sciences, University of California, Irvine, 1998 hhttp://www.ics.uci.edu/ mlearn/MLRepository.htmli. [6] A. Blum, M.F. Balcan, An augmented pac model for semi-supervised learning, in: O. Chapelle, A. Zien, B. Scho¨lkopf (Eds.), SemiSupervised Learning, MIT Press, Boston, 2006. [7] A. Blum, T. Mitchell, Combining labeled and unlabeled data with cotraining, in: Proceedings of the 11th Annual Conference on Computational Learning Theory, 1998, pp. 92–100. [8] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press, Cambridge, 2004. [9] O. Chapelle, B. Scho¨lkopf, A. Zien (Eds.), Semi-Supervised Learning, MIT Press, Cambridge, 2006. [10] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, Cambridge, 2000. [11] S. Dasgupta, M.L. Littman, D. McAllester, Pac generalization bounds for co-training, in: Advances in Neural Information Processing Systems (NIPS), 2001.

[12] J.D.R. Farquhar, D.R. Hardoon, J. Shawe-Taylor H. Meng, S. Szedmak, Two view learning: SVM-2k, theory and practice, in: NIPS2005, 2005. [13] R. Fergus, P. Perona, A. Zisserman, Object class recognition by unsupervised scale-invariant learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2003. [14] M. Ka¨a¨ria¨inen, Generalization error bounds using unlabeled data, in: Learning Theory: 18th Annual Conference on Learning Theory, COLT ’05, 2005, pp. 127–142. [15] M. Lobo, L. Vandenberghe, S. Boyd, H. Lebret, Applications of second-order cone programming, in: Linear Algebra and its Applications, Special Issue on Linear Algebra in Control, Signals and Image Processing, vol. 284, 1998, pp. 193–228. [16] D. Lowe, Object recognition from local scale-invariant features, in: Proceedings of the Seventh IEEE International Conference on Computer vision, ‘‘Kerkyra Greece’’, 1999, 1150–1157. [17] G. Lugosi, M. Pinter, A data-dependent skeleton estimate for learning, in: Proceedings of the Ninth Annual Conference on Computational Learning Theory, New York, Association for Computing Machinery, New York, 1996, pp. 51–58. [18] H. Meng, J. Shawe-Taylor, S. Szedmak, J.R.D. Farquhar, Support vector machine to synthesise kernels, in: Shefﬁeld Machine Learning Workshop Proceedings, Lecture Notes in Computer Science, Springer, Berlin, 2005. [19] K. Mikolajczyk, C. Schmid, An afﬁne invariant interest point detector, in: Proceedings of the 2002 European Conference on Computer vision, ‘‘Copenhagen Denmark’’, 2002, pp. 128–142. [20] J. Platt, Fast training of support vector machines using sequential minimal optimization, in: B. Scho¨lkopf, C. Burges, A. Smola (Eds.), Advances in Kernel Methods—Support Vector Learning, MIT Press, Cambridge, MA, 1998. [21] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, Cambridge, 2004. [22] V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998. [23] X. Zhu, Semi-supervised learning literature survey, Computer Science Department, University of Wisconsin, Madison, 2006 hwww.cs.wisc. edu/jerryzhu/pub/ssl_survey.pdfi. Sandor Szedmak was born in Hungary. He obtained a MS degree in Mathematics from Lajos Kossuth University, Debrecen, Hungary. He received the PhD in Operations Research from the Rutgers, The State University of New Jersey, USA. He held research positions at the Computer Science Department of Royal Holloway, University of London and University of Helsinki. Recently he is a senior research fellow at the School of Electronics and Computer Science of University of Southampton in UK. He main research interest focuses on optimization problems arising in machine learning and pattern recognition, especially in large-scale applications. John Shawe-Taylor obtained a PhD in Mathematics at Royal Holloway, University of London in 1986. He subsequently completed a MSc in the Foundations of Advanced Information Technology at Imperial College. He was promoted to professor of Computing Science in 1996. He has published over 150 research papers. He was appointed director of the Centre for Computational Statistics and Machine Learning at University College London. He has pioneered the development of the well-founded approaches to machine learning inspired by statistical learning theory (including support vector machine, boosting and kernel principal components analysis) and has shown the viability of applying these techniques to document analysis and computer vision. He is co-author of an Introduction to Support Vector Machines, the ﬁrst comprehensive account of this new generation of machine learning algorithms. A second book on Kernel Methods for Pattern Analysis was published in 2004.

Synthesis of maximum margin and multiview learning using unlabeled data

Synthesis of maximum margin and multiview learning using unlabeled data

Recommend Documents