ARTICLE IN PRESS
Neurocomputing 70 (2007) 1254–1264 www.elsevier.com/locate/neucom
Synthesis of maximum margin and multiview learning using unlabeled data$ Sandor Szedmaka,b,, John Shawe-Taylorc a
ISIS Group, Electronics and Computer Science, University of Southampton, SO17 1BJ, UK b Department of Computer Science, University of Helsinki, Finland c Department of Computer Science, University College London, WC1E 6BT, UK Available online 13 December 2006
Abstract In this paper we show that the semi-supervised learning with two input sources can be transformed into a maximum margin problem to be similar to a binary support vector machine. Our formulation exploits the unlabeled data to reduce the complexity of the class of the learning functions. In order to measure how the complexity is decreased we use the Rademacher complexity theory. The corresponding optimization problem is convex and it is efficiently solvable for large-scale applications as well. r 2007 Elsevier B.V. All rights reserved. Keywords: Semi-supervised learning; Maximum margin; Support vector machine; Rademacher complexity; Multiview learning
1. Introduction Semi-supervised learning belongs to the main stream of the recent machine learning research. The exploitation of the unlabeled data is an attractive approach either to extend the capability of the known methods or to derive novel learning devices. Survey about the recent directions devoted to the semi-supervised learning can be found in [9,23]. In this paper we give a synthesis of some earlier approaches and show an optimization framework with good statistical generalization capability. Before outlining our learning strategy we need to raise an essential question which initiated this research. The support vector machine has been proved to be an efficient learning tool, its performance is demonstrated in several studies and plenty of work is devoted to give strong theoretical foundation. The question is: can we go further? Is there any way to set up a learning device which can improve the performance of the SVM in some distribution families? This paper tries to $ This work is supported by PASCAL Network of Excellence (IST2002-506778) European Community IST Programmes. Corresponding author. ISIS Group, Electronics and Computer Science, University of Southampton, SO17 1BJ, United Kingdom. E-mail addresses:
[email protected] (S. Szedmak),
[email protected] (J. Shawe-Taylor).
0925-2312/$ - see front matter r 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2006.11.012
answer this questions by presenting a model that might help to make the next step. Our idea is built upon: Multiview learning, when two or more sources of the inputs are given with the same output, see in papers of Blum et al. [7] and Dasgupta et al. [11]. The multiview learning is also referred as co-training in several papers. Maximum margin learning, where the unlabeled cases within the margin considered as errors, see in Bennett et al. [3]. Reduction of the learning class complexity, where the ‘‘closeness’’ of the predictions for the unlabeled cases computed on the distinct sources of the input is assumed, see in the conference paper about skeleton learning of Lugosi et al. [17], Balcan et al. [1], Ka¨a¨ria¨inen [14] and Blum et al. [6]. Combination of the multiview and maximum margin learning, which introduces new constraints to the optimization problem of the supervised maximum margin learning, see in Meng et al. [18] and in Farquhar et al. [12]. Blending these ideas we arrive at an optimization framework, which can be solved efficiently on the level of computational complexity of a binary SVM problem. In the first part of the paper the learning framework and the corresponding optimization problem are presented and in the second part the generalization performance is
ARTICLE IN PRESS S. Szedmak, J. Shawe-Taylor / Neurocomputing 70 (2007) 1254–1264
1255
analyzed. We apply the Rademacher complexity to give an upper bound on the expected error of the label prediction. At the end, experimental results will illustrate the usefulness of the approach. The underlying theory is summarized in Section 3. Further details can be found in [2].
Informally, in order to exploit the unlabeled data, we are looking for particular solutions relating to the different views where the predictors give similar solutions on the unlabeled data. The similarity measured by L1 norm of the differences between the real-valued predictions
2. Learning framework
jf 1 ðu1i Þ f 2 ðu2i Þj;
The main idea is that if we can restrict the function class occurring in the SVM in a plausible way then by decreasing the complexity the generalization performance could be improved. This restriction is realized by exploiting unlabeled sources of the data via multiview learning. The general solution framework is depicted in Fig. 1.
and it is minimized simultaneously with the error occurring in the estimation of the labeled cases. The two view learning seems to impose a restriction on the possible applications. To obtain distinct views one can apply a random partition of the available feature components. This kind of random selection was applied in the experiments, the details are in Section 5.
i ¼ 1; . . . ; mU ,
(2)
2.1. Learning task 2.2. Optimization problem In the semi-supervised, multiview learning, with two views, we are given a compound sample S L of pairs of outputs and inputs fðyi ; ðx1i ; x2i ÞÞ : yi 2 f1; 1g; xki 2 Xk ; i ¼ 1; . . . ; mL ; k ¼ 1; 2g independently and identically generated by an unknown multivariate distribution PðY ; X 1 ; X 2 Þ, and a compound sample SU ¼ fðu1i ; u2i Þ : uki 2 Xk ; i ¼ 1; . . . ; mU ; k ¼ 1; 2g of unlabeled cases independently and identically chosen from the marginal distribution PðX 1 ; X 2 Þ of PðY ; X 1 ; X 2 Þ. Furthermore, there are given embeddings of the input vectors into Hilbert spaces, called feature spaces, by the functions fk : Xk ! Hfk ; k ¼ 1; 2. The image vectors of these mappings are called feature vectors in the sequel. The objective is to find linear functions f k ðxk Þ ¼ wTk /k ðxk Þ þ bk which can predict the potential label value for any labeled and unlabeled cases. The decision function is then defined for an arbitrary input x ¼ ðx1 ; x2 Þ by X f ðxÞ ¼ 12 signðf k ðxk ÞÞ. (1) k
Before formulating the optimization framework the following notations are introduced: the matrix Y is a diagonal matrix of the labels fyi g, and the matrices Xk and Uk comprise the labeled and unlabeled inputs in their rows for k ¼ 1; 2. When the embedding functions fk applied on Xk and Uk , denoted by fk ðXk Þ and fk ðUk Þ, give matrices with the feature vectors in their rows. All vectors are column vectors and they are denoted by lower-case, bold face Latin or Greek letters except 0 and 1 those which denote vectors with components 0 and 1, respectively. The notations used for matrices are uppercase bold face letters. The scalars are lower-case and constants are upper-case letters. Rþ denotes the set of nonnegative real numbers. The expressions are presented in vectorized form to emphasize the linear algebraic relations and to decrease the number of indexes. Thus, for an example, 1T n means Pm L i xi .
Fig. 1. Learning with two views via forcing the similarity between the predictions.
ARTICLE IN PRESS S. Szedmak, J. Shawe-Taylor / Neurocomputing 70 (2007) 1254–1264
1256
The optimization problem formulating our learning framework is given by min
b : ð/1 ðU1 Þw1 þ b1 1Þ ð/2 ðU2 Þw2 þ b2 1Þ ¼ gþ g , ( synthesis
1X T wk wk ( regularization 2 k X þ 1T C k nk ( labeled loss
a1 : Yð/1 ðX1 Þw1 þ b1 1ÞX1 n1 ; a2 : Yð/2 ðX2 Þw2 þ b2 1ÞX1 n2 ;
k
w.r.t. s.t.
þ C Z 1T g ( unlabeled loss
c1 : n1 X0,
wk ; bk ; nk ; k ¼ 1; 2; g,
c2 : n2 X0, dþ : gþ X0,
jð/1 ðU1 Þw1 þ b1 1Þ ð/2 ðU2 Þw2 þ b2 1Þjpg, ( synthesis
mU L nk 2 R m . þ ; k ¼ 1; 2; g 2 R
ð3Þ
The constants, C k ; k ¼ 1; 2 and C Z , control the balance between the labeled and unlabeled loss terms in the objective function. These constants are positive parameters for which a validation procedure could find the optimal values in a given problem. In what follows we use the acronym SVM_2K for problem (3). Since we have linearly constrained quadratic optimization task, where the quadratic term is a convex one, hence (3) is a convex problem. The absolute values in the synthesis constraint and the corresponding term in the objective function can be unfolded by the following substitutions: g ¼ gþ g ;
kgk1 ¼ 1T ðgþ þ g Þ;
U gþ ; g 2 R m þ ,
(4)
then we have 1X T wk wk ( regularization 2 k X þ 1T C k nk ( labeled loss
s.t.
Lððwk ; bk ; nk ; gþ ; g Þ; ðak ; b; ck ; d ÞÞ X 1X T wk wk þ 1T C k nk þ C Z 1T ðgþ þ g Þ ¼ 2 k k þ bT ½ð/1 ðU1 Þw1 þ b1 1Þ ð/2 ðU2 Þw2 þ b2 1Þ gþ þ g X ½aTk ðY½/k ðXk Þwk þ bk 1 þ 1 nk Þ þ k
X
cTk nk dTþ gþ dT g ,
ð7Þ
k
where we have the restrictions L ak ; ck XRm þ ;
k ¼ 1; 2;
U dþ ; d XRm þ ,
(8)
imposed on the dual variables belonging to the inequality constraints. Unfolding the Karush–Kuhn–Tucker conditions we can express the normal vector of the separating hyperplanes for any k by
where
ð/1 ðU1 Þw1 þ b1 1Þ ð/2 ðU2 Þw2 þ b2 1Þ ¼ gþ g , ( synthesis k ¼ 1; 2,
( SVM subproblems mU L nk 2 R m þ ; k ¼ 1; 2; gþ ; g 2 Rþ .
We can then write up the corresponding Lagrangian functional
¼ ½/k ðXk ÞT ; /k ðUk ÞT gk ,
þ C Z 1 ðgþ þ g Þ ( unlabeled loss wk ; bk ; nk ; k ¼ 1; 2; gþ ; g ,
Yð/k ðXk Þwk þ bk 1ÞX1 nk ;
ð6Þ
wk ¼ /k ðXk ÞT Yak þ ð1Þk /k ðUk ÞT b
k T
w.r.t.
( SVM subproblem 1 ( SVM subproblem 2
d : g X0.
Yð/k ðXk Þwk þ bk 1ÞX1 nk ; k ¼ 1; 2, ( SVM subproblems
min
multipliers and the constraints is summarized in the next table.
ð5Þ
Introducing vectors of the Lagrangian multipliers ak for any k in the SVM subproblems where the components of these vectors correspond to the labeled cases, and the vector of Lagrangian multipliers b in the synthesis constrain to each non-labeled cases, and ck ; d to the non-negativity constraints. The assignment between the
ð9Þ
gTk ¼ ðaTk Y; ð1Þk bT Þ.
After substituting the expressions of the primal variables back into the Lagrangian functional we arrive at the dual problem of (3). The kernel matrices in the dual have the structure 3 " # 2 LL Kk KLU /k ðXk Þ/k ðXk ÞT /k ðXk Þ/k ðUk ÞT k 5 Kk ¼ ¼ 4 UL Kk KUU /k ðUk Þ/k ðXk ÞT /k ðUk Þ/k ðUk ÞT k 2 3 L Kk ¼ 4 U 5. Kk The superscripts of the sub-matrices refer to the labeled and unlabeled sources of the sub-kernels. We are going to exploit these notations in the following sections of the paper.
ARTICLE IN PRESS S. Szedmak, J. Shawe-Taylor / Neurocomputing 70 (2007) 1254–1264
The dual problem reads as X 1X T gk K k gk 1T a k ð2:2Þ min 2 k k w.r.t. s.t.
This kind of supervised learning have been presented by Meng et al. [18] and in Farquhar et al. [12].
ak 2 RmL ; k ¼ 1; 2; b 2 RmU
2.4. tolerance
where gTk ¼ ðaTk Y; ð1Þk bT Þ, 0pak pC k ; k ¼ 1; 2,
The synthesis constraint forcing the similarity between the predictions can be transformed into a more robust formulation, which allows us to ignore the small departures between the views. To realize this, the constraint
C Z pbpC Z .
jð/1 ðU1 Þw1 þ b1 1Þ ð/2 ðU2 Þw2 þ b2 1Þjpg
1T gk ¼ 0,
The reader, familiar with the dual formulation of the support vector machine, can recognize the similarity between the structures of the SVM_2K and the SVM. This similarity allows us to use the algorithms which have been developed to solve the SVM dual problem with only moderate modifications. Our implementation follows that has been published in [18]. A short introduction of this solution is shown in Section 2.6. After when wk ; k ¼ 1; 2 have been computed the prediction for a new input item x ¼ ðx1 ; x2 Þ can be obtained by 1 1 T 2 T yðxÞ ¼ sign ½/1 ðx Þ w1 þ b1 þ /2 ðx Þ w2 þ b2 2 ! 1X T T T ¼ sign ð/k ðxk Þ ½/k ðXk Þ ; /k ðUk Þ gk þ bk Þ 2 k ! 1X k ¼ sign Kk ðx Þgk þ bk , 2 k with the shorthand notations k
1257
k T
T
T
Kk ðx Þ ¼ /k ðx Þ ½/k ðXk Þ ; /k ðUk Þ ;
The optimization schema presented above turns to be applicable when only labeled cases are given. In that case we have a supervised learning with two views and we are going to force the similarity between the predictions built upon those views. To this end unlabeled inputs Uk are replaced with Xk for any k ¼ 1; 2. Ease to check that all the steps above can be carried out without any modifications. We can make a simplification in the expressions of wk , see (9), by recognizing that the dimensions of the dual variables, ak and b, are the same, hence we can reduce the dimension of the corresponding compound variables gk ; k ¼ 1; 2
¼ ½/k ðXk ÞT ~gk , where g~ Tk ¼ ðaTk Y þ ð1Þk bT Þ.
jð/1 ðU1 Þw1 þ b1 1Þ ð/2 ðU2 Þw2 þ b2 1Þjp1 þ g,
(11)
which absorbs the differences less than a fixed given positive threshold . A similar application of tolerance was applied in the formulation of the support vector regression, the reader can refer to the book of Vapnik [22] and to Cristianini et al. [10] as well. 2.5. Concatenated case In the primal formulation of the optimization problem (3) the views have been handled in distinct constraints. One can unify these constraints by concatenating the two feature vectors into one and then solving the simplified problem 1X T min wk wk ( regularization 2 k
k ¼ 1; 2.
2.3. Supervised multiview learning as subcase
¼ /k ðXk ÞT ½Yak þ ð1Þk b
is turned into
þ C k 1T n ( labeled loss
In the next subsections some variants of the base formulation are presented.
wk ¼ /k ðXk ÞT Yak þ ð1Þk /k ðXk ÞT b
(10)
w.r.t. s.t.
þ C Z 1T ðgþ þ g Þ ( unlabeled loss wk ; bk ; k ¼ 1; 2; n; gþ ; g , ð/1 ðU1 Þw1 þ b1 1Þ ð/2 ðU2 Þw2 þ b2 1Þ ¼ gþ g , ( synthesis ! X ½/k ðXk Þwk þ bk 1 X1 n, Y k
( ‘‘concatenated’’ SVM L nXRm þ ;
U ðgþ ; g ÞXRm þ .
ð12Þ
This formulation reduces the number of Lagrangian dual variables, since the corresponding dual vectors a1 , a2 will be the same. This reformulation decreases the complexity of the corresponding dual problem and consequently diminishes the computational cost. 2.6. Skeleton of the algorithm The detailed description of algorithm solving the SVM_2K problem is given in [18]. Here we give an outline of the method and offer a simple alternative well applicable in large-scale problems. The algorithm solves the dual problem (2.2). The base idea contains a particular elimination of the constraints occurring in the dual. The elimination employs the
ARTICLE IN PRESS S. Szedmak, J. Shawe-Taylor / Neurocomputing 70 (2007) 1254–1264
1258
augmented Lagrangian formulation which turns the constraints that we are going to eliminate into a penalty term in the Lagrangian functional and then solves this modified problem. Details of the formulation, theoretical background and proofs of the convergence can be found in the book of Bertsekas [4] and in the references therein. The partial augmentation of (2.2) reads as ð2:6Þ Lða1 ; a2 ; b; l; cÞ X 1X T c gk K k gk 1 T ak þ kT A þ A T A 2 k 2 k
¼ min s.t.
0pak pC k ; k ¼ 1; 2 C Z pbpC Z ,
where c40 is a penalty parameter, the matrix A is built upon the linear constraints and it is equal to " # yT 0T 1T A¼ , 0T yT þ1T and the vector k comprises the corresponding Lagrangians. y is a vector containing the output labels ðyi Þ. Let zT ¼ ðaT1 ; aT2 ; bT Þ and Z be a set of z’s satisfying the box constraints in (2.6). The algorithm then follows the schema Step 1: Let k0 be an initial solution for the Lagrangian multipliers, c0 41 be a given initial value to the penalty parameter c, gc 41 is a multiplier for c, k be a required accuracy and l ¼ 0. Step 2: Solve the problem for z at fixed kl minz
Lðz; kl ; cl Þ
s.t. z 2 Z,
ð13Þ
The optimum solution is denoted by zl . Step 3: Update the Lagrangian multipliers by klþ1 ¼ l k þ cl Azl and the running constant by clþ1 ¼ gc cl . Step 4: If kAzl koh a given error tolerance, Then Stop! Else Set l ¼ l þ 1 and go to Step 2! It is proved in [4] if the running constant goes to infinity then the solution of the augmented Lagrangian problem converges to the solution of the original problem. The critical point is to find the solution of the subproblem in Step 2. In [18] a conditional gradient based method has been proposed, however, it may be better to apply a simple coordinate descent schema exploiting the simplicity of the box constraints. It involves a component-wise optimization of the objective function on the corresponding slice of the box, see details in [4]. This approach is a simpler version of the sequential minimal optimization(SMO) introduced by Platt [20] to the support vector machine. The experiments show that this simple, task specific method becomes superior on others developed to solve linearly constrained quadratic programming problems, when the sample size goes beyond 1000 items.
3. Theoretical foundation To illuminate the theoretical background of the presented learning method we apply the theory of Rademacher complexity, see [2]. We use the notation from Section 2 when referring to the quantities used in the learning algorithms. The key idea is that by restricting the two linear functions f 1 and f 2 to give similar outputs we implicitly restrict the flexibility of the functions that can be implemented, thus reducing the chances of overfitting. We would therefore expect that if we can achieve good alignment between f 1 and f 2 while having good classification accuracy, then the performance of the classifier will be improved. The difficulty with making this intuition rigorous is that statistical learning theory requires the function class to be specified independently of the training data, while our measure of alignment between f 1 and f 2 (with norms bounded by C) is measured on the unlabeled data. Hence, if the function class is defined in terms of alignments on the unlabeled data, the sampling process would appear to affect the choice of hypothesis space. This restriction should be distinguished from that arising from bounding the norm of the weight vector classifying the labeled data, since this hierarchy of classes can be specified without reference to the data. The problem can, however, be avoided if we consider the unlabeled data as being chosen before the labeled data. In this way we may view the hypothesis space as fixed at the point when the labeled data are generated. The next problem is to estimate the complexity of the space of functions that are restricted in this data-driven manner. The approach we adopt is to derive an optimization problem whose objective value is with high probability an upper bound on the required complexity. The accuracy of the estimate depends on the number of random rs that are used and hence the number of optimization problems that are solved. Finally, we need to perform a union bound over the required range of values of C and D. Our analysis has a similar flavor to that of Balcan et al. [1] and Blum et al. [6]. The main differences are that we are able to infer advantages from all sizes of unlabeled samples without any prespecified required size. Furthermore the results are presented to indicate the level of performance obtained from the data available and the results of the learning, rather than indicating sample sizes required to give a particular error rate. Finally, we make use of Rademacher bounds in order to enable computation on the given samples.
3.1. Short introduction to rademacher complexity theory We first quote McDiarmid’s inequality as this will prove useful later.
ARTICLE IN PRESS S. Szedmak, J. Shawe-Taylor / Neurocomputing 70 (2007) 1254–1264
Theorem 1 (McDiarmid). Let X 1 ; . . . ; X n be independent random variables taking values in a set A, and assume that f : An ! R satisfies sup
jf ðx1 ; . . . ; xn Þ f ðx1 ; . . . ; x^ i ; xiþ1 ; . . . ; xn Þjpci ,
x1 ;...;xn ; x^ i 2A
1pipn.
1259
zero. Let ðxi Þ‘i¼1 be drawn independently according to a probability distribution D. Then with probability at least 1 d over random draws of samples of size ‘, every f 2 F satisfies rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi lnð2=dÞ ^ . ED ½f ðxÞpES ½f ðxÞ þ R‘ ðFÞ þ 3ðB AÞ 2‘
Then for all 40, 22 Pfjf ðX 1 ; . . . ; X n Þ Ef ðX 1 ; . . . ; X n ÞjXgp2 exp Pn 2 . i¼1 ci We begin with the definition required for Rademacher complexity, see for example Bartlett and Mendelson [2], see also [21] for an introductory exposition. Definition 2. For a sample S ¼ fx1 ; . . . ; x‘ g generated by a distribution D on a set X and a real-valued function class F with a domain X, the empirical Rademacher complexity of F is the random variable " # 2X ‘ R^ ‘ ðFÞ ¼ Es sup si f ðxi Þx1 ; . . . ; x‘ , f 2F ‘ i¼1 where s ¼ fs1 ; . . . ; s‘ g are independent uniform f1gvalued Rademacher random variables. The Rademacher complexity of F is # " 2X ‘ R‘ ðFÞ ¼ ED ½R^ ‘ ðFÞ ¼ ED;s sup si f ðxi Þ . ‘ f 2F i¼1
We use ED to denote expectation with respect to a distribution D and ES for its empirical value on the sample S assuming uniform sampling. The Rademacher complexity allows us to uniformly bound the expectation of the values of all functions in a function class based on the class complexity. Definition 3. A loss function A : R ! ½0; 1 is Lipschitz with constant L if it satisfies jAðaÞ Aða0 ÞjpLja a0 j for all a; a0 2 R.
We quote a lemma giving some properties of Rademacher complexity including that of a function class composed with a fixed Lipschitz function A. Theorem 4. Let F; F1 ; . . . ; Fn and G be classes of real functions. Then: (1) if F G, then R^ ‘ ðFÞpR^ ‘ ðGÞ; (2) for every c 2 R, R^ ‘ ðcFÞ ¼ jcjR^ ‘ ðFÞ; (3) if A : R ! R is Lipschitz with constant L and satisfies Að0Þ ¼ 0, then R^ ‘ ðA FÞp2LR^ ‘ ðFÞ. Theorem 5. Fix d 2 ð0; 1Þ and let F be a class of functions mapping from S to ½A; B, where at least one of A or B is
For the proof the reader can consult [2,21] with the slight adaptation that the range is scaled from the standard ½0; 1 to ½A; B. Application of the standard result to an appropriate scaling of the functions together with (2) of Theorem 4 gives the result, with the proviso that the inequality must be reversed if Ao0. This is a straightforward adaptation of the main proof, since the proof relies on bounding the supremum of the difference between empirical and true expectations. Given a training set S the class of functions that we will primarily be considering are linear functions with bounded norm ( ) ‘ X 0 2 x! ai kðxi ; xÞ : a KapB i¼1
fx ! hw; fðxÞi : kwkpBg ¼ FB , where f is the feature mapping corresponding to the kernel function k and K is the corresponding kernel matrix for the sample S. The following result bounds the Rademacher complexity of this kind of linear function classes. Theorem 6 (Bartlett and Mendelson [2]). If k : X X ! R is a kernel, and S ¼ fx1 ; . . . ; x‘ g is a i.i.d. sample from X, then the empirical Rademacher complexity of the class FB satisfies 2B pffiffiffiffiffiffiffiffiffiffiffi trðKÞ. R^ ‘ ðFÞp ‘
3.2. Analysis of the semi-supervised multiview learning Initially we will fix the positive numbers C and D and assume that the weight vectors ðw1 ; b1 Þ and ðw2 ; b2 Þ produced by the learning algorithm both have 2-norms bounded by C. Furthermore, we assume that there is an a priori bound on the norms of the feature space projections of the input vectors of R, that is kðx; xÞ þ 1pR2
for all x,
where the 1 comes from the component handling the offset b. We define an auxiliary function of the weight vectors w ¯ k ¼ ðwk ; bk Þ, k ¼ 1; 2, def ^ w Dð ¯ 1 ; w¯ 2 Þ ¼ ES ½kðf1 ðU1 Þw1 þ b1 1Þ ðf2 ðU2 Þw2 þ b2 1Þk1 .
^ w¯ 1 ; w Again for the time being we will assume that Dð ¯ 2 ÞpD and leave the question of choosing C and D to depend on
ARTICLE IN PRESS S. Szedmak, J. Shawe-Taylor / Neurocomputing 70 (2007) 1254–1264
1260
the data to the end of the analysis. Hence, the class of functions we are considering when applying SVM_2K to this problem can be restricted to ( 2 1X FC;D ¼ f jf : ð½x1 ; x2 Þ ! ðKk ðxk Þgk þ bk Þ; 2 k¼1 gTk Kk gk þ b2k pC 2 ; ) ^ w¯ 1 ; w¯ 2 ÞpD , Dð
k ¼ 1; 2,
bounds computed for the individual or concatenated SVMs and that of the SVM_2K. 3.3. Empirical rademacher complexity of FC;D The empirical Rademacher complexity of the class FC;D is given by # mL 2 X R^ mL ðFC;D Þ ¼ Es sup si f ðXi Þ m L i¼1 f 2FC;D 2 3 2 1X 2 7 6 ¼ Es 4 sup rT ðKk ðxk Þgk þ bk 1Þ5 kw¯ k kpC; k¼1;2mL 2 k¼1 "
where Kk ðxk Þ ¼ ½/ðxk ÞT /ðXk ÞT ; /ðxk ÞT /ðUk ÞT .
2
3 1 X 7 6 pEs 4 sup rT ½fk ðXk Þwk þ bk 15, ¯ m kwk kpC; k¼1;2; L k
Hence, the function class FC;D has an additional restric^ w¯ 1 ; w¯ 2 ÞpD that could decrease the chances of tion Dð overfitting. Applying the usual Rademacher techniques for margin bounds on generalization we obtain the following result. Theorem 7. Fix d 2 ð0; 1Þ and constants C and D and let FC;D be the class of functions described above. Let ðX1 ; X2 Þ labeled and ðU1 ; U2 Þ unlabeled samples be drawn independently according to a probability distribution PðX 1 ; X 2 Þ. Then with probability at least 1 d=2 over random draws of labeled samples of size mL , every f 2 FCD satisfies Pðx;yÞD ðsignðf ðxÞÞayÞp
1 T 1 ðn1 þ n2 Þ þ 2R^ mL ðFC;D Þ 2mL sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi lnð4=dÞ þ3 . 2mL
^ w¯ ;w Dð 1 ¯ 2 ÞpD
where r 2 f1; þ1gmL . Note that the expression in square brackets is concentrated under the uniform distribution of Rademacher variables. Hence, we can estimate the complexity for N randomly chosen instantiations r^ 1 ; . . . ; r^ N of the Rademacher variables r. For each j ¼ 1; . . . ; N, we now must find the value of the learning parameters w ¯ k ; k ¼ 1; 2 that maximize
max
Proof. First observe that Pðx;yÞD ðsignðf ðxÞÞayÞpED ½minðð1 yf ðxÞÞþ ; 1Þ, 1 T 1 ðn1 þ n2 Þ while ES ½minðð1 yf ðxÞÞþ ; 1Þp 2mL
^ w¯ ;w Dð 1 ¯ 2 ÞpD
w.r.t. s.t.
" # X 1 X j T T ðr^ Þ fk ðXk Þwk þ ðr^ j Þ 1bk mL k k 1 X j T L ¼ ðr^ Þ ðKk gk þ bk 1Þ mL k gk 2 RmL þmU ; bk 2 R; k ¼ 1; 2, gTk Kk gk þ b2k pC 2 ; k ¼ 1; 2, 1 T U 1 jðK1 g1 þ b1 1Þ ðKU 2 g2 þ b2 1ÞjpD. mU
ð14Þ
Applying Theorem 5 for the function minðð1 yf ðxÞÞþ ; 1Þ 1 with range ½1; 0 and observing that the function AðzÞ ¼ minðð1 zÞþ ; 1Þ 1 satisfies Að0Þ ¼ 0 and has Lipschitz constant 1, we obtain ED ½minðð1 yf ðxÞÞþ ; 1Þ 1 1 T p 1 ðn1 þ n2 Þ 1 þ 2R^ mL ðFC;D Þ þ 3 2mL The result follows.
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi lnð4=dÞ . 2mL
&
Apart from the unfixing of C and D, it therefore only remains to compute the empirical Rademacher complexity of FC;D , which is the critical discriminator between the
By an application of McDiarmid’s inequality, with probability at least 1 d=2, the expected value of the objective function computed on the randomly chosen fixed s^ is not pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2RC lnð2=dÞ=2NmL less than the empirical Rademacher complexity. This follows from observing that flipping a single sji can change the complexity formula by at most ci ¼ 2RC=ðNmL Þ. Hence, if we define RðC; DÞ to be the average of the values of the objective of the optimization (14) over j ¼ 1; . . . ; N, we have the following bound on the generalization of the SVM_2K. Theorem 8. Fix d 2 ð0; 1Þ and let FC;D be the class of functions described above and RðC; DÞ be the average of the values of the objective of the optimization (14) for j ¼ 1; . . . ; N. Let ðX1 ; X2 Þ labeled and ðU1 ; U2 Þ unlabeled samples be drawn independently according to a probability distribution PðX 1 ; X 2 Þ. Then with probability at least 1 d
ARTICLE IN PRESS S. Szedmak, J. Shawe-Taylor / Neurocomputing 70 (2007) 1254–1264
1261
over random draws of labeled samples of size mL , every f 2 FC;D satisfies 1 T 1 ðn1 þ n2 Þ þ 2RðC; DÞ Pðx;yÞD ðsignðf ðxÞÞayÞp 2mL sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 4RC lnð4=dÞ þ pffiffiffiffiffi þ 3 . 2mL N
Let FC;D be the class of functions described above and RðC; DÞ be the average of the values of the objective of optimization (14) over its N instances. Let ðX1 ; X2 Þ labeled and ðU1 ; U2 Þ unlabeled samples be drawn independently according to a probability distribution PðX 1 ; X 2 Þ. Then with probability at least 1 d over random draws of labeled samples of size mL , every f 2 FC;D satisfies
Theorem 8 shows that we can bound the generalization in terms of the quantity RðC; DÞ computed by optimization (14). The next section will investigate the computation of this quantity, but first we need to address the issue that in the analysis so far C and D have been fixed a priori, while in a real application we wish to treat them as outputs of the learning algorithm. This is a similar situation to that encountered in bounds involving the margin and we follow a similar strategy to that adopted in that case. First observe that since the value of the optimal solution of the optimization (3) must be less than that obtained by taking the feasible solution wk ¼ 0 and nk ¼ 1 for k ¼ 1; 2. The objective value for this solution is ðC 1 þ C 2 ÞmL and so for the optimal solution we must have kwk kp2ðC 1 þ C 2 ÞmL for k ¼ 1; 2. Similarly to ensure that at least one data point has no margin error the weight vector must have norm at least 1=R, so we can assume that C lies in the range ½1=R; 2ðC 1 þ C 2 ÞmL . By similar arguments we can place bounds on the range ½D1 ; RðC 1 þ C 2 Þ of the expression
Pðx;yÞD ðsignðf ðxÞÞayÞ 1 T 1 ðn1 þ n2 Þ þ 2RðCðf Þ; Dðf ÞÞ p 2mL sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 4RCðf Þ lnð4tC tD =dÞ pffiffiffiffiffi þ 3 þ . 2NmL N
1 T þ 1 ðg þ g Þ mU indicating that we need only consider values of D in this range. Since, the expression could potentially be zero, D1 must be chosen a priori as a suitable small non-zero constant. We will see below that it only plays a very minor role in the bound. We now consider a grid of values for C C 1 ¼ 2ðC 1 þ C 2 ÞmL ; C 2 ¼ C 1 =2; . . . ; C iþ1 ¼ C i =2; . . . ; C tC , where tC ¼ blog2 ð2RðC 1 þ C 2 ÞmL Þc. We create a similar grid for a range of tD ¼ blog2 ðRðC 1 þ C 2 Þ=D1 Þc values of D starting with D1 ¼ RðC 1 þ C 2 Þ and halving their value each time. We now apply Theorem 8 for each pair of values ðC i ; Dj Þ for i ¼ 1; . . . ; tC and j ¼ 1; . . . ; tD , with d replaced by d=ðtC tD Þ. This ensures that with probability at least 1 d all of the tC tD conclusions hold. For any value of C and D in their given ranges there is a grid point ðC i ; Dj Þ in the rectangle ½C; 2C ½D; 2D. Furthermore the monotonicity of the bound implies that obtained for ðC i ; Dj Þ will be tighter than for ð2C; 2DÞ. Putting all these observations together, we obtain the following theorem. Theorem 9. Fix d 2 ð0; 1Þ and for f 2 F let Cðf Þ ¼ 2 maxðkw1 k; kw2 kÞ and 1 T þ 1 ðg þ g Þ . Dðf Þ ¼ 2 max D1 ; mU
It remains to note that the influence of the repeated applications of Theorem 8 is benign in that tC and tD are logs of the ratio of the two ranges, and furthermore they enter the bound under a further log.
4. Estimation of the empirical rademacher complexity The estimation of the empirical Rademacher complexity can be carried out by solving the optimization problem (14). Unfortunately, (14) contains a convex maximization even with a non-differentiable objective function. Here we present a simple decomposition procedure which can overcome on these difficulties. To solve this problem the feasibility domain of (14) can be cut into two parts; one is where the objective is non-negative and the other where it is non-positive, and then comparing the optimum values received in the subproblems the optimum solution of the entire problem can be obtained. We also have a nondifferentiable constraint 1 T U 1 jðK1 g1 þ b1 1Þ ðKU 2 g2 þ b2 1ÞjpD mU that can be resolved by replacing it with an equivalent formulation U þ ðKU 1 g1 þ b1 1Þ ðK2 g2 þ b2 1Þ ¼ g g , 1 T þ 1 ðg þ g ÞpD, mU gþ X0; g X0.
Thus, we are facing the derived problems, where the maximization has been transformed into minimization. Firstly the non-negative part of the domain is considered min w.r.t. s.t.
1 X T L r^ ðKk gk þ bk 1Þ mL k
gk 2 RmL þmU ; bk 2 R k ¼ 1; 2; gþ ; g 2 RmU gTk Kk gk þ b2k pC 2 ;
k ¼ 1; 2,
U þ ðKU 1 g1 þ b1 1Þ ðK2 g2 þ b2 1Þ ¼ g g ,
ARTICLE IN PRESS S. Szedmak, J. Shawe-Taylor / Neurocomputing 70 (2007) 1254–1264
1262
1 T þ 1 ðg þ g ÞpD, mU 1 X T L r^ ðKk gk þ bk 1ÞX0, mL k gþ X0;
Table 1 Expected values (standard deviations) of the accuracies (%) computed on the test sets and the estimated empirical Rademacher complexities R^ (standard deviations) of the semi-supervised learning
g X0,
ð15Þ
and similarly we also have the problem for the non-positive part of feasibility domain 1 X T L min r^ ðKk gk þ bk 1Þ mL k w.r.t. s.t.
g X0.
r ¼ minðr þ ; r Þ;
Accuracy
R^
Accuracy
R^
93.3 (3.1) 96.7 (1.7) 94.3 (0.9)
104.4 (49.1) 94.7 (46.6) 81.2 (32.2)
94.4 (1.9) 97.1 (1.4) 94.6 (0.6)
105.2 (60.8) 64.4 (27.2) 59.5 (25.5)
Data set
ð16Þ
ðg k ; b k Þ ¼ argð minðr þ ; r ÞÞ; k þ 1; 2. (17)
Both problems (15) and (16) are convex with linear objective function and with linear and quadratic constraints. One can transform them into second order cone programming (SOCP) problems and using appropriate interior point methods they can be efficiently solved, see details in [15] and [8]. The second order cone formulation of the quadratic constrains is derived as þ b2k pC 2 ; ) ^ T A. ^ Kk ¼ A
^ k g ; bk kpC; k½A k
k ¼ 1; 2,
Data sets
Breast cancer
Labeled (%)
10 5
10 5
Haberman
10 5
Pima
10 5
The data set was cut into training and test sets by randomly sorting it into fivefolds and fivefold crossvalidation was applied. In each cross-validation step randomly chosen 90%, or 95%, of the labels of the training items were deleted and these label-free items used as unlabeled cases in the learning.
Items
Variables
699 690 306 768
9 15 4 8
Table 3 Expected values (standard deviations) of the accuracies (%) computed on the test sets and the estimated empirical Rademacher complexities R^ (standard deviations) of the semi-supervised learning
Credit
Here we are going to demonstrate the effect of the included similarity—forcing new constraint on the accuracy. In the training procedure the difference between the learners touched only this constraint and the corresponding penalty constant. The test method applied on all sets obeyed the following procedure:
Number of
Wisconsin breast cancer Credit card application approval Haberman’s survival data Pima Indians diabetes
ð18Þ
5. Experiments
SVM_2K
Table 2 The data sets used in the experiments
Let the optimum values and the optimum solutions of (15) and (16) be denoted by r þ , ðg k ; b k Þþ ; k ¼ 1; 2 and by r and ðg k ; b k Þ ; k ¼ 1; 2, respectively. The optimum value of the original problem (14) is then given by
5 5 5
Image classification SVM (concatenated)
gTk Kk gk þ b2k pC 2 ; k ¼ 1; 2,
gþ X0;
where
Airplanes Faces Motorbikes
Labeled (%)
gk 2 RmL þmU ; bk 2 R k ¼ 1; 2; gþ ; g 2 RmU U þ ðKU 1 g1 þ b1 1Þ ðK2 g2 þ b2 1Þ ¼ g g , 1 T þ 1 ðg þ g ÞpD, mU 1 X T L r^ ðKk gk þ bk 1Þp0, mL k
gTk Kk gk
Data sets
UCI data sets SVM (concatenated)
SVM_2K
Accuracy
R^
Accuracy
R^
96.04 (1.88) 95.28 (2.82)
1.06 (0.41)
96.72 (1.37) 96.21 (1.42)
1.24 (0.42)
81.95 (4.45) 79.12 (6.20)
4.71 (1.10)
83.54 (4.40) 80.73 (9.19)
4.34 (2.22)
69.42 (10.41) 63.58 (12.63)
0.66 (0.43)
74.00 (3.96) 68.24 (14.43)
0.20 (0.31)
72.67 (3.43) 70.07 (4.25)
2.60 (1.22)
70.49 (4.27) 67.40 (5.05)
1.97 (1.56)
1.66 (0.61)
8.04 (3.08)
1.02 (0.43)
4.05 (1.56)
1.48 (0.57)
5.97 (2.77)
0.35 (0.41)
1.95 (1.74)
The selection of the unlabeled cases was repeated 10 times in each of the five cross-validations.
To set the penalty parameters C 1 ; C 2 ; C Z , a training set based, validation procedure was applied. After creating the folds and choosing the unlabeled cases the training set was partitioned into validation-training and validation-test sets. Those values for the parameters were chosen which
ARTICLE IN PRESS S. Szedmak, J. Shawe-Taylor / Neurocomputing 70 (2007) 1254–1264
1263
96.8 SVM mean accuracy Upper confidence
96.6
Lower confidence SVM_2K
Accuracy (%)
96.4 96.2 96 95.8 95.6 95.4 5
10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 % of unlabeled data considered
Fig. 2. Comparison of SVM working on 10%+5% of labeled data and the SVM_2K working on 10% labeled and gradually increasing proportion of unlabeled data. The 10% labeled data were the same in each of the experiments.
gave the highest accuracy on the validation-test. The values of C 1 and C 2 were fixed to 1 and C Z was looked for among the values 0:01 2i ; i ¼ 1; . . . ; 10. In the experiment the kernel type was linear and twophase normalization was applied. In the first phase the variables were normalized by subtracting the corresponding expected values and they were scaled to have variance 1. In the second phase the input vectors were projected onto a ball with unit radius. The normalization is a critical step in filtering the influence of the unlabeled outliers out of the model. The aim why the linear kernel is applied in our experiments is to avoid the use of the additional parameters not relating to the discussed framework. The estimation of the empirical Rademacher complexity was computed parallel on each estimation of the test accuracy. One estimation was computed on 10 randomly chosen Rademacher labels. 5.1. Case of two input sources The data set1 is commonly used for generic object recognition, e.g. Fergus et al. [13] and other papers. The three object classes in this data set are; motorbikes, aeroplanes and faces. It also contains an additional background class. For each image two sets of low level features were computed. One2 used the affine invariant Harris detector [19] developed by Mikolajczyk and Schmid to detect interest points within an image and invariant moment’s as patch descriptors. The other used Lowe’s keypoint detector3 to detect interesting patches with SIFT [16] patch descriptors. These sets of image patch descriptors then form the basis of the feature generation. 1
Available at http://www.robots.ox.ac.uk/vgg/data/. Available at http://lear.inrialpes.fr/people/Mikolajczyk/. 3 Available at http://www.cs.ubc.ca/lowe/keypoints/. 2
Because different images have different numbers of interest points vector quantization was used to map these sets of points into a fixed length feature vector. Specifically, k-means was used to learn K cluster centers based upon the features from all images. For each image a fixed length K feature vector was then created by recording the minimum distance between an image feature and each of these K centers. In all included experiments, the parameter for clustering was chosen as K ¼ 400. The result is presented by Table 1. 5.2. Views on random split In this group of the experiments we used examples from the UCI Repository of machine learning data sets, the details about this repository are given by [5]. The data sets included into the experiments are shown in Table 2. The categorical variables in the Credit data set are transformed into indicator variables. The test method applied on these sets was extended by random partitioning of the variables:
To create two views the set of variables was split randomly into two subsets. The chance of falling into one of these subsets was the same for each variable. If one of the two subsets was empty then this random selection was repeated until both subsets contained at least one variable. The splitting of the set of the variables was repeated 10 times in each of five fold cross-validations.
Analyzing the result in Table 3 one can recognize that the smaller complexity tends to increase the accuracy. Some obvious consequences of the structure of the Rademacher complexity can be noted as well; the increased number of labeled cases reduces the complexity, but this effect is significantly smaller in case of the SVM_2K since the unlabeled cases are incorporated into the model.
ARTICLE IN PRESS 1264
S. Szedmak, J. Shawe-Taylor / Neurocomputing 70 (2007) 1254–1264
5.3. The value of the unlabeled data At the end we demonstrate the classification value of the unlabeled data on Fig. 2. Here we used the data of Wisconsin breast cancer. In this experiment 10% of the training data were chosen as labeled input and it was fixed. Then this subsample and an additional 5% was included as labeled data into the input of the SVM. The proposed method, SVM_2K received the fixed 10% percent labeled part of the training set and a gradually increasing proportion of the remaining 90% as unlabeled input source. In the first experiment the unlabeled data started on 5% and it was increased by 5% in each of the following experiments. Fig. 2 shows at the above outlined circumstances approximately 25% of unlabeled cases is worth as much as 5% of labeled ones. 6. Discussion We presented a generalization theory based extension of the well-known support vector machine. Our objective was to find a modification for the optimization formulation of the SVM which can improve its performance when the unknown distribution of the data set allows us to gain higher accuracy. We must emphasize that to extend the capability of any kind of learner uniformly, for all possible data set, theoretically impossible, thus, our extension can work only on a certain class of problems. Further research should be carried on to discover the boundary of this class. References [1] M.F. Balcan, A. Blum, A pac-style model for learning from labeled and unlabeled data, in: COLT, 2005, pp. 111–126. [2] P.L. Bartlett, S. Mendelson, Rademacher and Gaussian complexities: risk bounds and structural results, J. Mach. Learning Res. 3 (2002) 463–482. [3] K. Bennett, A. Demirez, Semi-supervised support vector machines, in: M.S. Kearns, S.A. Solla, D.A. Cohn (Eds.), Advances in Neural Information Processing Systems, vol. 12, MIT Press, Cambridge, MA, 1998, pp. 368–374. [4] D.P. Bertsekas, Nonlinear Programming, second ed., Athena Scientific, 1999. [5] C.L. Blake, C.J. Merz, UCI repository of machine learning databases, Department of Information and Computer Sciences, University of California, Irvine, 1998 hhttp://www.ics.uci.edu/ mlearn/MLRepository.htmli. [6] A. Blum, M.F. Balcan, An augmented pac model for semi-supervised learning, in: O. Chapelle, A. Zien, B. Scho¨lkopf (Eds.), SemiSupervised Learning, MIT Press, Boston, 2006. [7] A. Blum, T. Mitchell, Combining labeled and unlabeled data with cotraining, in: Proceedings of the 11th Annual Conference on Computational Learning Theory, 1998, pp. 92–100. [8] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press, Cambridge, 2004. [9] O. Chapelle, B. Scho¨lkopf, A. Zien (Eds.), Semi-Supervised Learning, MIT Press, Cambridge, 2006. [10] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, Cambridge, 2000. [11] S. Dasgupta, M.L. Littman, D. McAllester, Pac generalization bounds for co-training, in: Advances in Neural Information Processing Systems (NIPS), 2001.
[12] J.D.R. Farquhar, D.R. Hardoon, J. Shawe-Taylor H. Meng, S. Szedmak, Two view learning: SVM-2k, theory and practice, in: NIPS2005, 2005. [13] R. Fergus, P. Perona, A. Zisserman, Object class recognition by unsupervised scale-invariant learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2003. [14] M. Ka¨a¨ria¨inen, Generalization error bounds using unlabeled data, in: Learning Theory: 18th Annual Conference on Learning Theory, COLT ’05, 2005, pp. 127–142. [15] M. Lobo, L. Vandenberghe, S. Boyd, H. Lebret, Applications of second-order cone programming, in: Linear Algebra and its Applications, Special Issue on Linear Algebra in Control, Signals and Image Processing, vol. 284, 1998, pp. 193–228. [16] D. Lowe, Object recognition from local scale-invariant features, in: Proceedings of the Seventh IEEE International Conference on Computer vision, ‘‘Kerkyra Greece’’, 1999, 1150–1157. [17] G. Lugosi, M. Pinter, A data-dependent skeleton estimate for learning, in: Proceedings of the Ninth Annual Conference on Computational Learning Theory, New York, Association for Computing Machinery, New York, 1996, pp. 51–58. [18] H. Meng, J. Shawe-Taylor, S. Szedmak, J.R.D. Farquhar, Support vector machine to synthesise kernels, in: Sheffield Machine Learning Workshop Proceedings, Lecture Notes in Computer Science, Springer, Berlin, 2005. [19] K. Mikolajczyk, C. Schmid, An affine invariant interest point detector, in: Proceedings of the 2002 European Conference on Computer vision, ‘‘Copenhagen Denmark’’, 2002, pp. 128–142. [20] J. Platt, Fast training of support vector machines using sequential minimal optimization, in: B. Scho¨lkopf, C. Burges, A. Smola (Eds.), Advances in Kernel Methods—Support Vector Learning, MIT Press, Cambridge, MA, 1998. [21] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, Cambridge, 2004. [22] V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998. [23] X. Zhu, Semi-supervised learning literature survey, Computer Science Department, University of Wisconsin, Madison, 2006 hwww.cs.wisc. edu/jerryzhu/pub/ssl_survey.pdfi. Sandor Szedmak was born in Hungary. He obtained a MS degree in Mathematics from Lajos Kossuth University, Debrecen, Hungary. He received the PhD in Operations Research from the Rutgers, The State University of New Jersey, USA. He held research positions at the Computer Science Department of Royal Holloway, University of London and University of Helsinki. Recently he is a senior research fellow at the School of Electronics and Computer Science of University of Southampton in UK. He main research interest focuses on optimization problems arising in machine learning and pattern recognition, especially in large-scale applications. John Shawe-Taylor obtained a PhD in Mathematics at Royal Holloway, University of London in 1986. He subsequently completed a MSc in the Foundations of Advanced Information Technology at Imperial College. He was promoted to professor of Computing Science in 1996. He has published over 150 research papers. He was appointed director of the Centre for Computational Statistics and Machine Learning at University College London. He has pioneered the development of the well-founded approaches to machine learning inspired by statistical learning theory (including support vector machine, boosting and kernel principal components analysis) and has shown the viability of applying these techniques to document analysis and computer vision. He is co-author of an Introduction to Support Vector Machines, the first comprehensive account of this new generation of machine learning algorithms. A second book on Kernel Methods for Pattern Analysis was published in 2004.