Pattern Recognition Letters 28 (2007) 2054–2062 www.elsevier.com/locate/patrec
Semismooth Newton support vector machine Zhou Shui-sheng a
q
a,*
, Liu Hong-wei a, Zhou Li-hua b, Ye Feng
a
Department of Mathematics, School of Science, Xidian University, Xi’an, Shaanxi 710071, China b School of Computer, Xidian University, Xi’an, Shaanxi 710071, China Received 22 January 2006; received in revised form 23 May 2007 Available online 4 July 2007 Communicated by R.P.W. Duin
Abstract Support vector machines can be posed as quadratic programming problems in a variety of ways. This paper investigates the 2-norm soft margin SVM with an additional quadratic penalty for the bias term that leads to a positive definite quadratic program in feature space only with the nonnegative constraint. An unconstrained programming problem is proposed as the Lagrangian dual of the quadratic programming for the linear classification problem. The resulted problem minimizes a differentiable convex piecewise quadratic function with lower dimensions in input space, and a Semismooth Newton algorithm is introduced to solve it quickly, then a Semismooth Newton Support Vector Machine (SNSVM) is presented. After the kernel matrix is factorized by the Cholesky factorization or the incomplete Cholesky factorization, the nonlinear kernel classification problem can also be solved by SNSVM, and the complexity of the algorithms has no apparent increase. Many numerical experiments demonstrate that our algorithm is comparable with the similar algorithms such as Lagrangian Support Vector Machines (LSVM) and Semismooth Support Vector Machines (SSVM). 2007 Elsevier B.V. All rights reserved. Keywords: Support vector machines; Semismooth; Lagrangian dual; Cholesky factorization
1. Introduction Based on the Vanpnik and Chervonkis’ structural risk minimization principle, support vector machines (SVMs) are proposed as computationally powerful tools for supervised learning. SVMs are widely used in classification and regression problems, such as character identification, disease diagnoses, face recognition, the time serial prediction, etc. (Vapnik, 1999, 2000). For a classification problem, a data set fðxi ; y i Þgmi¼1 is given for training with the input xi 2 Rn and the corresponding target value or label yi = 1 or 1. Noting
q This work was supported in part by the National Natural Science Foundation of China under Grant No. 60572150 and No. 60603098. * Corresponding author. Tel.: +86 29 88203822; fax: +86 29 88202861. E-mail addresses:
[email protected] (S.-s. Zhou), hwliu@ mail.xidian.edu.cn (H.-w. Liu),
[email protected] (L.-h. Zhou),
[email protected] (F. Ye).
0167-8655/$ - see front matter 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2007.06.010
matrixes A = [x1 x2 xm]T, y = [y1 y2 ym]T and D = diag(y), the linear support vector machine for classification achieves the optimal classification function by solving the following (primal) quadric programming with parameter v > 0 (Vapnik, 2000; Platt, 1999; Lee and Mangasarian, 2001; Joachims et al., 1999; Mangasarian and Musicant, 2001, etc.): min
ðw;b;nÞ2Rnþ1þm
1 veT n þ wT w 2
subject to DðAw ebÞ þ n P e;
ð1Þ n P 0;
where e = (1, 1, . . . , 1)T, w is the normal to the bounding planes xTw = b ± 1, and b determines their location relative to the origin. The plane xTw = b + 1 bounds the class points with yi = 1, possibly with some errors, and the plane xTw = b 1 bounds the class points with yi = 1, also possibly with some errors. The errors are determined by the nonnegative error variable n correspondingly. The linear
S.-s. Zhou et al. / Pattern Recognition Letters 28 (2007) 2054–2062
separating surface is the plane xTw = b midway between the bounding planes xTw = b ± 1. Usually we solve problem (1) by solving the following dual quadric programming: 1 T min u DAAT DT u eT u u 2 ð2Þ subject to y T u ¼ 0; 0 6 u 6 ve: For the nonlinear problem, the optimal separating curve is found by solving the following dual programming in feature space: 1 T u DKðA; AÞDu eT u 2 subject to y T u ¼ 0; 0 6 u 6 ve;
min u
ð3Þ
where K(A, A) is an m · m positive semidefinite matrix whose (i, j)th element is given by K(xi, xj) with the kernel function K(x, y). Due to the only one equality constraint and bounded constraints present, currently the decomposition method is one of the major methods to solve SVM for large scale problems (Joachims, 1998; Platt, 1999; Joachims et al., 1999; Keerthi et al., 2001). The representative training algorithm of SVM is SMO (sequential minimal optimization), which was first proposed by Platt (1999) and was extended by Keerthi et al. (2001). The basic idea of SMO is selecting two elements as the working set per iteration, and the different selection of the working set obtains different SMO method. The linear convergence of a popular decomposition method was proved under some assumptions (Lin, 2001, 2002). Recently Chen et al. (2005) summarized the common working set selection methods and extended it to solve v-SVM (Scho¨lkopf et al., 2000; Chang and Li, 2001). Among these decomposition methods, the software LIBSVM (Chang and Li, 2005) is very popular. The matrix DAATD or DK(A, A)D appearing in the dual objective function (2) or (3) is not positive definite in general, which brings on some difficulties in solving the program. In order to overcome all these difficulties, Mangasarian etc. changed the L1-norm of n to a L2-norm squared and also appended the term b2 the objective function. Then many improvements of SVMs were proposed, such as Smooth Support Vector Machine (Lee and Mangasarian, 2001), Active Set Support Vector Machine (Mangasarian and Musicant, 2000), Lagrangian Support Vector Machines (LSVM) (Mangasarian and Musicant, 2001) and Successive Overrelaxation (SOR) SVM (Mangasarian and Musicant, 1999), etc. In paper (Ferris and Munson, 2004), a Semismooth SVM (SSVM) with a Q-quadratic rate of convergence was proposed by solving a formulation with one equality constraint and nonnegative variable constraints in feature space. All the algorithms mentioned above are to solve a dual programming in feature space with higher dimensions (Platt, 1999; Mangasarian and Musicant, 1999, 2000, 2001; Keerthi et al., 2001; Ferris and Munson, 2004) or
2055
solve a low dimension program approximately with smooth penalization skill (Lee and Mangasarian, 2001; Zhou et al., 2003; Zhou and Zhou, 2004), then the efficiency or the precision of the algorithms is limited. In this paper, a Semismooth Newton Support Vector Machine (SNSVM) is proposed in linear or nonlinear case by solving the Mangasarian’s perturbation reformation, where only a low dimension programming is solved, and the efficiency and the precision are improved. Some definitions about semismooth function are given below. Suppose that G(x) : Rk ! Rk is locally Lipschitz but not necessarily continuously differentiable. Let DG denote the set of points where G is differentiable. For every x 2 DG, G 0 (x) is the Jacobian of G at x. The following definition and Lemma can be found in (Qi and Sun, 1999; Ferris and Munson, 2004). Definition 1 (Clarke, 1983). The Clarke generalized Jacobian of G at x is oG(x) = co oBG(x), where oB GðxÞ ¼ flimxj !x G0 ðxj Þjxj 2 DG g and co denotes the convex hull. Definition 2. Let G(x) : Rk ! Rk be locally Lipschitzian at x 2 Rk. For V 2 oG(x + d), G is semismooth at x if Vd G 0 (x, d) = o(kdk), when d ! 0. And G is called strongly semismooth at x if Vd G 0 (x, d) = O(kdk2), when d ! 0, where G 0 (x, d) denotes the directional derivative of G at x in the direction d, o(kdk) stands for a vector function e : Rk ! Rk satisfying limd!0e(d)/kdk = 0, and O(kdk2) stands for a vector function e :Rk ! Rk such that for some constants M > 0 and d > 0, ke(d)k 6 Mkdk2 holds for all d satisfying kdk 6 d. Lemma 1. Function u(a) = a+ = max{a, 0} is strongly semismooth. Some elementary notations are given below too. All vectors will be column vectors unless it is transposed to a row vector by a superscript ‘‘T’’. For a vector x in the n-dimensional real space Rn, x+ denotes the vector in Rn and all of its negative components are set to zero. A vector of ones or zeroes in a real space with arbitrary dimensions will be denoted by e or 0, respectively. The identity matrix of arbitrary dimensions will be denoted by I. For a matrix A, Ai denotes the ith row of A, and Ai,j denotes the corresponding element of A. This paper is organized as follows. In Section 2 we discuss LSVM and SSVM in brief, and in Section 3 an unconstrained Lagrangian dual of SVM is obtained. SNSVM algorithm is proposed in Section 4. The experiment results are shown in Section 5. The conclusion mark is presented in Section 6. 2. LSVM and SSVM This Section LSVM (Mangasarian and Musicant, 2001) and SSVM (Ferris and Munson, 2004) are discussed in brief.
2056
S.-s. Zhou et al. / Pattern Recognition Letters 28 (2007) 2054–2062
2.1. LSVM
min
ðw;b;nÞ2Rnþ1þm
Mangasarian and Musicant (2001) proposed a reformation of the program (1) leading to the following reformulation of L2-SVM (correspondingly (1) is called L1-SVM): min
ðw;b;nÞ2Rnþ1þm
subject to
v T 1 n n þ ðwT w þ b2 Þ 2 2
ð4Þ
DðAw ebÞ þ n P e:
Firstly the term b2 is appended to objective function as SOR in paper (Mangasarian and Musicant, 1999); secondly the L2-norm squared misclassification error n is used rather than the L1-norm of n. This reformation is also used in (Lee and Mangasarian, 2001) and (Mangasarian and Musicant, 2000). Generally, this reformation is not robust for the outliers, but lots of experiment results show that there is not any negative influence. A particular study is on comparing the performance of L2-SVM and L1-SVM (Zhu et al., 2003), and the results show that the performance of classification is similar, but L2-SVM has more support vectors than L1-SVM. The apparent advantages of the reformation are that the Wolfe dual of the program (4) only contains the nonnegative constraint and the objective function is strict convex 1 min qðuÞ ¼ uT Qu eT u; 2
ð5Þ
06u2Rm
where Q = I/v + HHT and H = D[A e]. The KKT conditions (Bazara and Shetty, 1979; Mangasarian, 1994) of the programming (5) are u P 0;
Qu e P 0;
uT ðQu eÞ ¼ 0:
ð6Þ
Mangasarian and Musicant (2001) present (6) in the equivalent form Qu e = ((Qu e) cu)+ and a very simple iterative scheme which constitutes LSVM (Lagrangian support vector machine) algorithm is proposed as follows: uiþ1 ¼ Q1 ½e þ ððQui eÞ cui Þþ :
ð7Þ
And the linear convergence of LSVM is proved under the condition 0 < c < 2/v. For the linear classification problem, the Sherman–Morrison–Woodbury (SMW) identity Q1 ¼ ðI=v þ HH T Þ
1
1
¼ vðI H ðI=v þ H T H Þ H T Þ
ð8Þ
v T 1 n n þ wT w 2 2
subject to DðAw ebÞ þ n P e: Its dual programming is 1 T u ðI=v þ DAAT DÞu eT u 06u2R 2 subject to eT Du ¼ 0: minm
ð9Þ
The KKT conditions of programming (9) are the following mixed linear complementarity problem: u P 0;
uT ðQu Dec eÞ ¼ 0;
Qu Dec e P 0;
eT Du ¼ 0; ð10Þ T
where Q = I/v + DAA D, c is the optimal multiplier of eTDu = 0. It holds Q = I/v + DK(A, A)D corresponding to the nonlinear classification problem. pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi The semismooth function /ða; bÞ ¼ a þ b a2 þ b2 has the NCP(Nonlinear complementarity problem)-property that / (a, b) = 0 () 0 6 a?b P 0. In paper (Ferris and Munson, 2004), this property is used to convert the mixed linear complementarity problem (10) to a system of nonlinear semismooth equations. And the damped Newton method (Qi and Sun, 1999; Ferris and Munson, 2004) with a Q-quadratic rate of convergence is adopted to solve it. The following system of Newton equations is solved per iteration:
Da þ Db Q
Db De
x
eT D
0
c
¼
r1 r2
;
where Da and Db are diagonal matrices (see (Ferris and Munson, 2004) in detail). In order to solve these Newton equations simply, the SMW identity is also used to calculate (Da + DbQ)1. In order to compare SSVM and the following proposed algorithm, we use Ferris and Munson’s SSVM to solve programming (5) but not (9), and thus the corresponding Newton equation is simplified as ðDa þ Db QÞx ¼ r;
ð11Þ
where Q ¼ I=v þ DðAAT þ eeT ÞD
or
T
is used to improve the efficiency of LSVM. But for the nonlinear classification problem, SMW identity cannot be used directly (in there Q = I/v + D(K(A, A) + eeT)D), and LSVM needs to calculate Q1 directly and the algorithm is only effective for moderately sized problems (Mangasarian and Musicant, 2001). 2.2. SSVM Ferris and Munson (2004) propose the SSVM (Semismooth SVM) from the following formulation:
Q ¼ I=v þ DðKðA; AÞ þ ee ÞD: 3. Unconstrained Lagrangian dual of SVM In this section, we consider the Lagrangian dual problem of programming (5), which is an unconstrained optimization with a differentiable piecewise quadratic objective function, and some cheering results are obtained. Madsen et al. (1999) also used similar method to convert a bound constrained quadratic programming as a piecewise quadratic functions.
S.-s. Zhou et al. / Pattern Recognition Letters 28 (2007) 2054–2062
1 v min f ðxÞ ¼ xT x þ ðe HxÞT ðe HxÞþ : lþ1 2 2 x2R
3.1. Linear classification case
2057
ð17Þ
Let z = HTu, then the programming (5) is equivalent to min m
u2R ;z2Rm
subject to
1 T 1 u u þ zT z eT u 2v 2 T z ¼ H u; u P 0:
ð12Þ
4. Semismooth Newton SVM
Its Lagrangian dual problem is 1 T 1 T T T T u z max min u þ z e u þ x ðH u zÞ ; m m 2v 2 x2Rnþ1 06u2R ;z2R which is equivalent to 1 T 1 T T z z xT z þ minm u u ðe HxÞ u : minm z2R 06u2R 2 2v x2Rnþ1 max
ð13Þ In the objective function of (13), the first minimization problem over z yields the identity z = x, which, when plugged back in, yields the term 12 xT x. And the second minimization problem over u can be analytically solved at u = v(e Hx)+ (See Rockafellar (1974), and we also proved it in the paper (Zhou and Zhou, 2006) with a different method). When plugged back in, the programming (13) can be simplified as 1 T v T max x x ðe HxÞ ðe HxÞþ ; 2 2 x2Rnþ1 which is equivalent to the following minimization problem 1 v T min f ðxÞ ¼ xT x þ ðe HxÞ ðe HxÞþ : 2 2
x2Rnþ1
ð14Þ
Because the problem (5) and (14) are all strict convex programming, the strong duality holds. Then we have the following result. Theorem 1. Let the minima of programming (14) is x, and u is the one of (5), then u ¼ vðe HxÞþ . 3.2. Nonlinear classification case For the nonlinear classification problem, the Cholesky factorization or the incomplete Cholesky factorization is used to decompose the kernel matrix K(A, A) (Lin and Saigal, 2000), then we have KðA; AÞ GGT ;
G 2 Rml
ðl 6 mÞ:
ð15Þ
This method is also used in paper (Fine and Scheinberg, 2001; Ferris and Munson, 2004). Let H = D[G e], the similar quadric programming can be obtained as linear case: 1 min qðuÞ ¼ uT ðI=v þ HH T Þu eT u: 2
06u2Rm
Furthermore, its unconstrained Lagrangian dual is
The only difference between programming (17) and (14) is H 2 Rm·(l+1), but not H 2 Rm·(n+1), commonly l P n.
ð16Þ
Because the objective function f(x) of the programming (14) or (17) is a continuous differentiable piecewise quadratic function with the gradient $f(x) = x vHT(e Hx)+. The unique minima is just corresponding to the solution to the system of equations x vHT(e Hx)+ = 0, which is a system of non-differentiable equations and is also a system of semismooth piecewise linear equations. Similar as Qi and Sun (1999) or Ferris and Munson (2004), a generalized Newton method (damped Newton method) is adopted to find the solution of the system of semismooth equations in this section. 4.1. Algorithm Let GðxÞ ¼ x vH T ðe HxÞþ :
ð18Þ
Then G(x) is strongly semismooth by Definition 2 and Lemma 1. And the solution of the semismooth equations G(x) = 0 is just the solution of the programming (14) or (17). The proposed method is called Semismooth Newton Support Vector Machine (SNSVM). Algorithm 1. SNSVM Step 0. (Initialization) Input data H and v, let x0 2 Rn+1 or x0 2 Rl+1, e > 0, b > 0, p > 2 and r 2 (0, 1/2). Calculate g0 = G(x0). Set k = 0. Step 1. (Termination) If kgkk < e, stop. Step 2. (Direction Generation) Otherwise, let Bk 2 oG(xk), and calculate dk by solving the Newton system Bkd = gk. If either Bkd = gk is unsolvable or p the descent condition gTk d k < bkd k k is not satisfied, then set dk = gk. Step 3. (Line search) Choose tk ¼ 2ik , where ik is the smallest integer such that f ðxk þ 2ik d k Þ 6 f ðxk Þ þ r2ik gTk d k : Step 4. (Update) Let xk+1 = xk + tkdk, gk+1 = G(xk+1) and k :¼ k + 1. Go to 1. Because the objective function is strict convex and the solution is unique, the start point is not important. In this work, we always set x0 = e. The most important things are the convergence rate of the method and the technique of selecting and updating the matrix Bk. All of which are discussed in the following in detail.
2058
S.-s. Zhou et al. / Pattern Recognition Letters 28 (2007) 2054–2062
4.2. Convergence results By Lemma 1, we know that u(t) = t+ = max{t, 0} (t 2 R) is strongly semismooth function, and its generalized gradient satisfies 8 t > 0; > < 1; ouðtÞ ¼ a 2 ½0; 1; t ¼ 0; ð19Þ > : 0; t < 0: For given x, let the diagonal matrix S(x) = diag(s1, . . . , sm) satisfy si = ou(1 Hix), i = 1, 2, . . . , m. Then the following results hold. T
Theorem 2. For G(x) in (18), oG(x) {I + vH S(x)H} holds and every B 2 oG(x) is a positive definite matrix. Proof. Let vector value function V(x) = (e Hx)+ satisfies Vi(x) = (1 Hix)+, i = 1, 2, . . . , m. Let ojVi(x) be jth element of the generalized gradient of Vi(x), we have ojVi(x) Hi,jou(1 Hix) = Hi,jsi (j = 1, 2, . . . , l + 1, i = 1, 2, . . . , m) by the Chain Rule (Theorem 2.3.9 in (Clarke, 1983)). Hence oV i ðxÞ H Ti si and Clarke generalized Jacobian of V(x) satisfies 3 2 s1 H 1 6sH 7 6 2 27 7 oV ðxÞ 6 6 .. 7 ¼ SðxÞH : 4 . 5 sm H m So we have oG(x) {I + vHTS(x)H} and every B 2 oG(x) is a positive definite matrix. This completes the proof. h Because G(x) is strongly semismooth and B 2 oG(x) is positive definite. We have the following convergence result that can be obtained by the Theorem 2.4 in paper (Qi and Sun, 1999) and Theorem 3 in paper (Ferris and Munson, 2004) directly. Theorem 3. In SNSVM algorithm, any accumulation point of {xk} is a stationary point for f(x). Furthermore, SNSVM has a Q-quadratic rate of convergence.
4.3. Select and update Bk For simple, we always set a ¼ 12 in (19) when t = 0, then from Theorem 2 we select Bk as v Bk ¼ I þ vH TJ k H J kþ þ H TJ k H J k ; ð20Þ þ 2 0 0 where J kþ ¼ fij1 H i xk > 0g, J k0 ¼ fij1 H i xk ¼ 0g, H J kþ is a matrix that consists of the rows of H with i 2 J kþ and H J k 0 is a matrix that consists the rows of H with i 2 J k0 . For the rounding error, it is necessary to set J kþ ¼ fij1 H i xk P e0 g, J k0 ¼ fij j1 H i xk j < e0 g, where e0 is a tiny positive number such as e0 = 1010 or less. Because the sets J kþ (or J k0 Þ and J kþ1 (or J kþ1 þ 0 Þ always have many same elements, we only need recode the rows
k k k (or J kþ1 in J kþ1 þ 0 Þ but not in J þ (or J 0 Þ and the rows in J þ k kþ1 kþ1 (or J 0 Þ but not in J þ (or J 0 Þ and update the matrix Bk to Bk+1. Noting kþ1 k J in þ ¼ fiji 2 J þ ^ i 62 J þ g;
k kþ1 J out þ ¼ fiji 2 J þ ^ i 62 J þ g;
kþ1 ^ i 62 J k0 g; J in 0 ¼ fiji 2 J 0
k kþ1 J out 0 ¼ fiji 2 J 0 ^ i 62 J 0 g:
We can update Bk+1 from Bk as following: v v Bkþ1 ¼ Bk þ vH TJ in H J inþ vH TJ out H J out þ H TJ in H J in H TJ out H J out : þ 0 þ þ 2 0 0 2 0 ð21Þ out in out always have a few Because the sets J in þ , J þ , J 0 , and J 0 elements, the updating (21) only needs to calculate the product of some matrixes with much small size. It can improve the efficiency of the method and will be illustrated by experiments.
5. Numerical experiments In this section we compare the performance between SNSVM and LSVM, SSVM in the linear case and the nonlinear case with Gaussian kernel function respectively. All experiments are run on Personal Computer which utilizes a 3.0 GHz Pentium IV processor and a maximum of 896 Mbytes of the memory available for all processes. This computer runs Windows XP with MATLAB 7.1. The source code of LSVM, ‘‘lsvm.m’’, is obtained from the author’s web site (Musicant and Managsarian, 2000) for the linear problem, and ‘‘lsvmk.m’’ for the nonlinear problem. The program of SSVM and SNSVM is written in MATLAB language. In SSVM, the SMW identity is used to calculate (Da + DbQ)1 for solving (11) and the sparseness of the diagonal elements of the diagonal matrix Db is also used to reduce the complexity as Ferris and Munson (2004) do. In our experiments, all of the input data, the variables needed in program are keeping in the memory. For LSVM and SSVM, we used an optimality tolerance of 104 to determine when to terminate, and for SNSVM the stopping criterion is e = 108. All the results in table are tenfold mean result. The first results in Table 1 are designed to compare the training time, number of support vectors, iterations and the training correctness between LSVM, SSVM and SNSVM for linear classification problems on six datasets available from the UCI Machine Learning Repository (Murphy and Aha, 1992). The results show that the training accuracies are almost the same but the training time of the new method is faster than the training time of LSVM and SSVM. The iterations of SNSVM and SSVM are very few because of its Q-quadratic rate of convergence. Table 2 shows results from running LSVM, SSVM and SNSVM on a massively sized dataset. The datasets are created using Musicant’s NDC Data Generator (Musicant, 1998) with different size. And the test samples are 5% of the training samples. The training time, number of support vectors, iterations, the training correctness and the testing
S.-s. Zhou et al. / Pattern Recognition Letters 28 (2007) 2054–2062
2059
Table 1 SNSVM compared with LSVM and SSVM on six UCI datasets in linear classification problems Dataset
Algorithm
CPU time (s)
Liver disorder 345 · 6 (v = 1)
LSVM SSVM SNSVM
0.0039 0.0031 0.0008
338 331 331
68 6 3
71.01 71.01 71.01
Pima diabetes 768 · 8 (v = 1)
LSVM SSVM SNSVM
0.0086 0.0106 0.0024
749 708 708
74 6 4
78.39 78.39 78.39
Tic tac toe 958 · 9 (v = 1)
LSVM SSVM SNSVM
0.0070 0.0063 0.0017
957 941 941
46 4 2
98.33 98.33 98.33
Ionosphere 351 · 34 (v = 4)
LSVM SSVM SNSVM
0.0102 0.0126 0.0047
164 232 154
115 9 7
93.73 93.73 93.73
Adulta32562 · 14 (v = 1)
LSVM SSVM SNSVM
0.9984 1.1656 0.1672
21513 21513 21513
94 10 6
84.16 84.16 84.16
Connect-4b 61108 · 42 (v = 1)
LSVM SSVM SNSVM
3.1969 2.7719 0.3969
60920 60920 60920
63 4 3
73.00 73.00 73.00
a b
No. of SVs
Iter.
Train correc. (%)
The value is divided by the maximize absolute value of the corresponding dimension of the whole samples. Discarding the third class (draw).
Table 2 SNSVM compared with LSVM and SSVM on NDC generated dataset with difference size (v = 1) Trains, dimension
Algorithm
Train correc. (%)
Test correc. (%)
10,000, 100
LSVM SSVM SNSVM
1.6406 1.1719 0.3438
2946 2946 2946
133 13 7
94.26 94.26 94.26
93.60 93.60 93.60
10,000, 500
LSVM SSVM SNSVM
8.1563 14.906 2.8750
4959 4961 4961
118 11 6
89.68 89.68 89.68
90.20 90.20 90.20
10,000, 1000
LSVM SSVM SNSVM
28.969 56.219 10.484
6035 4172 4172
201 12 7
91.68 91.67 91.67
86.00 86.20 86.20
20,000, 1000
LSVM SSVM SNSVM
464.88 111.31 18.264
11,906 9572 9572
2000a 12 6
90.16 90.15 90.15
87.00 87.00 87.00
2,000,000, 10
LSVM SSVM SNSVM
110.54 48.973 11.964
986,750 986,752 986,752
157 11 7
90.86 90.86 90.86
91.22 91.22 91.22
2,000,000, 20
LSVM SSVM SNSVM
1861.7 267.64 17.398
1,205,764 1,205,767 1,205,767
2000a 11 8
87.64 87.64 87.64
87.08 87.08 87.08
a
Time (s)
No. of SVs
Iter.
The maximum iterations 2000 reached.
correctness are compared. The results show that the correctness of training and testing are almost similar. But SNSVM can be used to solve massive problems quickly than SSVM and LSVM with a few iterations used, and the training time of LSVM rapidly increases when the size of training data increases. The third experiment is designed to demonstrate the effectiveness of SNSVM in solving the nonlinear classifica-
tion problems through the use of kernel functions. One highly nonlinearly separable but simple example is the ‘‘tried and true’’ checkerboard dataset (Ho and Kleinberg, 1996), which has often been used to show the effectiveness of nonlinear kernel methods on a dataset (Lee and Mangasarian, 2001; Mangasarian and Musicant, 2001). The checkerboard dataset is generated by uniformly discretized the regions [0, 199] · [0, 199] to 2002 = 40,000 points, and is
2060
S.-s. Zhou et al. / Pattern Recognition Letters 28 (2007) 2054–2062 200
150
100
50
0
0
50
100
150
200
Fig. 1. The figure of the Checkerboard dataset.
labeled two classes ‘‘White’’ and ‘‘Black’’ spaced by 4 · 4 grid as Fig. 1 shows. In the first trial of this experiment, the training set contains 1000 points randomly sampled from the checkerboard (For comparison, those data are obtained from (Ho and Kleinberg, 1996) and the same as (Mangasarian and Musicant, 2001), which contained 514 ‘‘white’’ samples and 486 ‘‘black’’ samples), and the rest 39,000 points are in the testing set. Gaussian kernel function K(x, y) = exp(0.0011kx yk2) is used and v = 105. Total time for the 1000-point training set using SNSVM with a Gaussian kernel is 3.49 s (including 0.49 s for factorizing the kernel matrix), and test set accuracy is 98.28% on a 39,000-point test set with only 93 iterations needed. SSVM can solve this problem effectively and obtains the same test accuracy within 4.89 s and 34 iterations. But LSVM (lsvmk.m) exits after 106 iterations without meeting the optimality tolerance 0.0001 within 1.40 h, and the test set accuracy is 97.37%. The similar results of LSVM were also reported in (Mangasarian and Musicant, 2001). The rest results are presented in Table 3 by the training set randomly sampled from the checkerboard with different size using LSVM, SSVM and SNSVM with the same Gaussian kernel function. The results demonstrate that
the SSVM and SNSVM can do it quickly but LSVM is less effective. At the same time the results also show that SNSVM has higher precision, because the number of the Support Vectors with u > 0 is the same as the number of the Support Vectors with tolerance u > 104, but SSVM has not. From the results in Table 3, it is clear that SSVM and SNSVM have more advantages over LSVM. This is mainly because they have a Q-quadratic rate of convergence but LSVM is a linear rate of convergence. And SNSVM has a fewer advantages over SSVM only because of the training time. We will compare their complexity in the following. The SSVM solves a nonlinear problem with m dimensions, but our algorithm minimizes an unconstrained differentiable convex piecewise quadratic problem only with l + 1 dimensions, where l is decided by the incomplete Cholesky factorization of (15) and always l < m. In SSVM, the SMW identity is used to reduce the complexity from O(m3) to O(max(l3, ml2)) O(ml2). The sparseness of the diagonal elements of the diagonal matrix Db is used to reduce the complexity to O(m1l2), where m1 is the number of nonzero elements (with tolerance 1010) of diagonal elements of matrix Db. In SNSVM, the complexity of solving the Newton equation is O(l3). It needs O(l2m2) for updating the generalized Jacobian with (21), and m2 is the maximum cardinal of the out in out set J in þ , J þ , J 0 and J 0 . So the total complexity is 3 2 2 O(max(l , l m2)) O(l m2). m2 is typically much smaller than m, which will be shown by experiments following. The major difference between SNSVM and SSVM is the difference between m1 and m2. It is clear that m2 ! 0 and m1 ! svn (the number of the support vector). The following experiments are proposed to compare m1 and m2. Fig. 2 shows the percentage of diagonal elements of Db that are nonzero per iteration for SSVM on the checkerboard experiment with different size corresponding to the result in Table 3. There are some differences between our curves in Fig. 1 and Ferris and Munson’s results (Fig. 1 in paper (Ferris and Munson, 2004)), but for the linear problem the curve we plot is very similar to results in (Ferris and Munson, 2004), which is not presented here.
Table 3 SNSVM compared with LSVM and SSVM on checkerboard dataset with difference size (v = 105) No. of SVs (u > 0)
No. of SVs (u > 104)
Training size
Algorithms
Training time (s)
Iter.
Test correc. (%)
1000
LSVM SSVM SNSVM
5032 4.89 3.49
106 36 93
97.37 98.28 98.28
309 543 71
301 71 71
2000
LSVM SSVM SNSVM
20308 11.83 4.98
106 46 95
97.41 98.54 98.54
512 1049 76
508 76 76
3000
LSVM SSVM SNSVM
44107 25.16 6.72
106 66 90
98.34 98.88 98.88
272 1519 74
272 74 74
5000
LSVM SSVM SNSVM
Too long 47.16 8.41
– 82 124
– 99.51 99.51
– 2517 79
– 79 79
S.-s. Zhou et al. / Pattern Recognition Letters 28 (2007) 2054–2062
2061
Acknowledgements
1 m=1000 m=2000
0.4
We would like to thank O. Managsarian and R. Musicant for supplying their source codes for LSVM on web. In addition we would also like to express our gratitude R. Musicant for supplying his source code of NDC Data Generator.
0.2
References
0.8
m=3000 m=5000
m1/m
0.6
0
0
20
40 Iteration
60
80
Fig. 2. The plot of m1/m versus iterations for SSVM the Checkerboard experiments with different size corresponding to Table 3.
0.5 m=1000 m=2000 m=3000 m=5000
0.4
m2/m
0.3
0.2
0.1
0
20
40
60 Iteration
80
100
120
Fig. 3. The plot of m2/m versus iterations for SNSVM on the Checkerboard experiments with different size corresponding with Table 3.
Fig. 3 shows the percentage of the maximum cardinal of out in out J in , þ J þ , J 0 and J 0 , namely m2, per iteration when updating the generalized Jacobian with (21) in training checkerboard experiments with different size. It clearly shows that m2 is typically much smaller than m for most iteration. Comparing the curves in Fig. 2 and Fig. 3, we can conclude that SNSVM has fewer advantages over SSVM. 6. Conclusion This paper propose a semismooth Newton support vector machine (SNSVM), which only needs to find the unique minima of a low-dimension, unconstrained differentiable convex piecewise quadratic function. Compared with the present methods it has many advantages, such as solving a low-dimension problem, speedy convergence rate, less training time cost, ability to solve very massive problems, etc. They are illustrated by the experiments in Section 5. Furthermore, the piecewise quadratic characteristic of the objective function is not fully used in this algorithm. And in (Sun, 1997), Sun gives some stronger convergence results of Newton algorithm when the objective function is piecewise quadratic. Some further results will be obtained in the following research.
Bazara, M.S., Shetty, C.M., 1979. Nonlinear Programming. Theory and Algorithms. John Wiley and Sons, New York. Chang, C.-C., Li, C.-J., 2001. Training v-Support vector classifiers: Theory and algorithms. Neural Comput. 13, 2119–2147. Chang, C.-C., Lin, C.-J., 2005. LIBSVM-A Library for Support Vector Machines.
. Chen, P.-H, Fan, R.-E., Lin, C.-J., 2005. A Study on SMO-type Decomposition Methods for Support Vector Machines. . Clarke, F.H., 1983. Optimization and Nonsmooth Analysis. John Willey & Sons, New York. Ferris, M.C., Munson, T.S., 2004. Semismooth support vector machines. Math. Programm. Ser. B 101, 185–204. Fine, S., Scheinberg, K., 2001. Efficient SVM training using low-rank kernel representations. J. Mach. Learn. Res. 2, 243–264. Ho, T.K., Kleinberg, E.M. Checkerboard Dataset, 1996. . Joachims, T. SVMlight, 1998. . Joachims, T., 1999. Making large-scale SVM learning practical. In: Scho¨lkopf, B. et al. (Eds.), Advances in Kernel Method-Support Vector Learning. MIT Press, Cambridge, MA. Keerthi, S., Shevade, S., Bhattacharyya, C., Murthy, K., 2001. Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Comput. 13, 637–649. Lee, Y.-J., Mangasarian, O.L., 2001. SSVM: A smooth support vector machine. Computat. Optimiz. Appl. 20, 5–22. Lin, C.-J., 2001. On the convergence of the decomposition method for support vector machines. IEEE Trans. Neural Networks 12, 1288– 1298. Lin, C.-J., 2002. Asymptotic convergence of an SMO algorithm without any assumptions. IEEE Trans. Neural Networks 13, 248–250. Lin, C.-J., Saigal, R., 2000. An incomplete Cholesky factorization for dense matrices. BIT 40, 536–558. Madsen, K., Nielsen, H.B., Pinar, M.C., 1999. Bound constrained quadratic programming via piecewise quadratic functions. Math. Programm. 85 (1), 135–156. Mangasarian, O.L., 1994. Nonlinear Programming. SIAM, Philadelphia, PA. Mangasarian, O.L., Musicant, D.R., 1999. Successive overrelaxation for support vector machines. IEEE Trans. Neural Networks 10, 1032– 1037. Mangasarian, O.L., Musicant, D.R., 2000. Active set support vector machine classification. In: Todd, K.L. (Ed.), Neural Information Processing Systems 2000 (NIPS 2000). MIT Press, pp. 577–583. Mangasarian, O.L., Musicant, D.R., 2001. Lagrangian support vector machines. J. Mach. Learn. Res., 161–177. Murphy, P.M., Aha, D.W., 1992. UCI Repository of Machine Learning Databases. . Musicant, D.R., 1998. NDC: Normally Distributed Clustered Datasets. . Musicant, D.R., Managsarian, O.L., 2000. LSVM: Lagrangian Support Vector Machine. . Platt, J.C., 1999. Fast training of support vector machines using sequential minimal optimization. In: Scho¨lkopf, B. (Ed.), Advances in Kernel Method-Support Vector Learning. MIT Press, Cambridge, pp. 185– 208.
2062
S.-s. Zhou et al. / Pattern Recognition Letters 28 (2007) 2054–2062
Qi, L., Sun, D., 1999. A survey of some nonsmooth equations and smoothing Newton methods. In: Eberhard, A., Glover, B., Hill, R., Ralph, D. (Eds.), Progress in Optimization, Applied Optimization, vol. 30. Kluwer Academic Publishers, Dordrecht, pp. 121– 146. Rockafellar, R.T., 1974. Conjugate Duality and Optimization. SIAM, Philadelphia. Scho¨lkopf, B., Smola, A.J., Williamson, R.C., Bartlett, P.L., 2000. New support vector algorithms. Neural Computation 12, 1207–1245. Sun, J., 1997. On piecewise quadratic Newton and trust region problems. Math. Programm. 76, 451–467. Vapnik, V.N., 1999. An overview of statistical learning theory. IEEE Trans. Neural Network 10, 988–999.
Vapnik, V.N., 2000. The Nature of Statistical Learning Theory. SpringerVerlag, NY. Zhou, S.-S., Zhou, L.-H., 2004. Lower dimension Newton-algorithm for training the support vector machines. Syst. Eng. Electron. 26, 1315– 1318 (in Chinese). Zhou, S.-S., Zhou, L.-H., 2006. Conjugate gradients support machine. Pattern Recognition Artificial Intell. 19 (2), 129–136 (in Chinese). Zhou, S.-S., Rong, X.-F., Zhou, L.-H., 2003. A maximum entropy method for training the support vector machines. Signal Process. 19, 595–599 (in Chinese). Zhu, Y.-S., Wang, C.-D., Zhang, Y.-Y., 2003. Experimental study on the performance of support vector machine with squared cost function. Chinese J. Comput. 26, 982–989 (in Chinese).