Application of smoothing technique on twin support vector machines

Application of smoothing technique on twin support vector machines

Pattern Recognition Letters 29 (2008) 1842–1848 Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier...

266KB Sizes 0 Downloads 52 Views

Pattern Recognition Letters 29 (2008) 1842–1848

Contents lists available at ScienceDirect

Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

Application of smoothing technique on twin support vector machines M. Arun Kumar *, M. Gopal Control group, Department of Electrical Engineering, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110 016, India

a r t i c l e

i n f o

Article history: Received 6 September 2007 Received in revised form 19 May 2008 Available online 3 June 2008 Communicated by Y. Ma Keywords: Support vector machines Pattern recognition Twin support vector machines

a b s t r a c t This paper enhances the recently proposed twin SVM Jayadeva et al. [Jayadeva, Khemchandani, R., Chandra, S., 2007. Twin support vector machines for pattern classification. IEEE Trans. Pattern Anal. Machine Intell. 29 (5), 905–910] using smoothing techniques to smooth twin SVM for binary classification. We attempt to solve the primal quadratic programming problems of twin SVM by converting them into smooth unconstrained minimization problems. The smooth reformulations are solved using the wellknown Newton–Armijo algorithm. The effectiveness of the enhanced method is demonstrated by experimental results on available benchmark datasets. Ó 2008 Elsevier B.V. All rights reserved.

1. Introduction Support vector machines (SVMs), being computationally powerful tools for supervised learning, are widely used in classification and regression problems. SVMs have been successfully applied to a variety of real world problems like particle identification, face recognition, text categorization and bioinformatics (Burges, 1998). The approach is systematic and motivated by statistical learning theory (SLT) and Bayesian arguments. The central idea of SVM is to find the optimal separating hyperplane between the positive and negative examples. The optimal hyperplane is defined as the one giving maximum margin between the training examples that are closest to the hyperplane. Recently proposed (Mangasarian and Wild, 2006) generalized eigenvalue proximal support vector machine (GEPSVM), does binary classification by obtaining two non-parallel hyperplanes, one for each class. In this approach, data points of each class are clustered around the corresponding hyperplane. The new data points are assigned to a class based on its proximity to one of the two hyperplanes. This formulation leads to two generalized eigenvalue problems, whose solutions are obtained as eigenvectors corresponding to the smallest eigenvalues. Jayadeva et al. (2007) proposed Twin Support Vector Machine (TWSVM) which is similar in spirit to GEPSVM that obtains non-parallel hyperpanes by solving two novel formulations of quadratic programming problems (QPP). The idea is to solve two dual QPPs of smaller size rather than solving single dual QPP with large number of parameters. Experimental results of Jayadeva et al. (2007) show the effectiveness of TWSVM over GEPSVM and standard SVM on UCI datasets. The gen* Corresponding author. Tel.: +91 9868957974; fax: +91 1126581606. E-mail address: [email protected] (M.A. Kumar). 0167-8655/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2008.05.016

eralization of TWSVM has been shown to be significantly better than standard SVM for both linear and nonlinear kernels. Computing time comparisons for linear kernel have been made by solving QPPs arising in dual TWSVM and SVM using standard QP solvers. The results show that linear TWSVM is significantly faster than SVM. While QP solvers can be used to solve SVM, the usual practice is to solve SVM using much faster SMO type decomposition methods. Hence the authors have left speeding up TWSVM as a subject of future work. In this paper, we enhanced TWSVM using smoothing techniques proposed in (Lee and Mangasarian, 2001a), to Smooth TWSVM (STWSVM). The smoothing technique of interest to us here is about smoothing the plus function, (x)+ = max{x, 0}, where x is a real number, using smoothing functions q(x, g) with g > 0 (Chen and Mangasarian, 1995, 1996). While there are many applications in mathematical programming where this smooth plus function q(x, g) can be used, our concern is about its use in SVM optimization problem. The linear inequalities of the SVM primal QPP can be substituted into its objective function by making use of the plus function. This converts the SVM primal QPP into an unconstrained minimization problem (UMP). However, the objective function of the UMP is not twice differentiable because of which fast Newton methods cannot be used to solve it. When the smooth plus function q(x, g) is used instead of plus function, the objective function of UMP becomes twice differentiable and hence Newton methods can be used to solve it. The above method of solving SVM optimization problem is termed as Smooth SVM (SSVM) and has been solved with Newton–Armijo algorithm in (Lee and Mangasarian, 2001a). It is to be noted that SSVM solves primal QPP of SVM whereas the conventional way of solving SVM is to solve its dual problem. Following Lee and Mangasarian (2001a) in this paper, we attempted to solve the two primal QPPs of TWSVM, rather than dual

M.A. Kumar, M. Gopal / Pattern Recognition Letters 29 (2008) 1842–1848

QPPs as solved in (Jayadeva et al., 2007), by converting them into two UMPs. Smoothing techniques are used to make the objective function of UMPs twice differentiable and are solved using fast Newton method with Armijo stepsize proposed in (Lee and Mangasarian, 2001a). We also extended the smooth approach to nonlinear TWSVM. Computational comparisons of STWSVM against TWSVM, GEPSVM, and SSVM, in terms of classification accuracy and computing time, have been made on standard datasets for both linear and nonlinear kernel. The primary objective of this paper is to improve computing time of TWSVM such that they can be applied to large-scale problems without making any compromise in their generalization ability. While there are many decomposition/fast methods available for SVM in the literature, we have chosen smoothing technique as an appropriate tool to improve TWSVM, because of the following reasons: (i) While the computational complexity of primal and dual problems are same, solving primal QPPs is believed to be advantageous than solving dual QPPs for very large-scale problems (Chapelle, 2007). This is because when the number of datapoints is very large, it becomes intractable to compute exact solution of SVM and one has to look for approximate solutions. The approximate dual solution need not be a good approximate primal solution in which we are ultimately interested in. Many decomposition methods, like SMO, SVMlight, have been proposed for solving dual QPPs but, smoothing technique enables us to solve primal QPPs and hence they are more suitable for large-scale problems. (ii) The dual QPPs of TWSVM are bound-constrained and they involve inverse matrix calculations. This prevents the straightforward extension of popularly used SMO algorithm to TWSVM. By solving primal problems of TWSVM using smoothing technique, the inverse matrix calculations can be avoided. (iii) Smoothing techniques have already been successfully applied to SVM, and SSVM has been shown to be faster than decomposition methods like SOR, SMO, and SVMlight (Lee and Mangasarian, 2001a). (iv) Reduced SVM (RSVM) has been proposed for solving SSVM with nonlinear kernel, which uses a rectangular kernel KðA; A0 Þ, where A is only a randomly selected subset of data that is typically 10% or less of the original dataset. With this rectangular kernel, RSVM has achieved better performance in terms of both computing time and generalization than using the complete square kernel (Lee and Mangasarian, 2001b). It has been shown that RSVM is more suitable for solving very large-scale problems with nonlinear kernel because of its reduced memory usage and fast computing time. In later sections, we will show that reduced kernel techniques for STWSVM follow as a natural extension, and enjoy more computational advantage than reduced TWSVM.

1843

x is the result of applying step function component-wise to x. For a vector x 2 Rn ; the plus function x+ is defined as (x+)i = max{0, xi}; i = 1, . . ., n. 2. Support vector machines (SVMs) SVMs represent novel learning techniques that have been introduced in the framework of structural risk minimization (SRM) and in the theory of VC bounds. Compared to state-of-the-art methods, SVMs have showed excellent performance in pattern recognition tasks. In the simplest binary pattern recognition tasks, SVM uses a linear separating hyperplane to create a classifier with maximal margin. Consider the problem of binary classification wherein a linearly inseparable dataset X of l points in real n-dimensional space of features is represented by matrix X 2 Rln . The corresponding target or class of each point Xi, i = 1, 2, . . ., l is represented by a diagonal matrix D 2 Rll with entries Dii as +1 or 1. Given the above problem, SVM’s linear softmargin problem is to solve the following primal QPP (Burges, 1998):

Minimize such that

1 0 w w þ Ce0 y 2 DðXw þ ebÞ þ y P e

where C is the penalty parameter and y are the non-negative slack variables. The optimal separating hyperplane can be expressed as

dðxÞ ¼ w0 x þ b

ð2Þ

n

where x 2 R . Since the number of constraints in (1) is large, the dual of (1) is usually solved. The Wolfe dual of (1) is (Mangasarian, 1998):

1 Maximize e0 a  a0 DXX 0 Da 2 such that e0 Da ¼ 0

ð3Þ

0 6 a 6 Ce where a 2 Rn are Lagrangian multipliers. The optimal separating hyperplane is same as in (2) whose parameters are given by

w ¼ X 0 Da

ð4Þ

Nsv 1 X 1 b¼  Xi w Nsv i¼1 Dii

where Nsv represents the number of support vectors such that 0 < ai < C. The hyperplane described by (2) lies midway between the bounding planes given by:

w0 x þ b ¼ 1 and w0 x þ b ¼ 1

ð5Þ

2 and separates the two classes from each other with a margin of kwk . 2 A point is classified as +1 or 1 according to whether the decision function

ðw0 x þ bÞ The paper is organized as follows: in Section 2, we briefly discuss the SVM problem formulation and its dual problem. Sections 3 and 4 give a short summary of GEPSVM and TWSVM, respectively. In Section 5, we extend TWSVM to STWSVM for linear and nonlinear kernel. Computational comparisons are done in Section 6; and Section 7 gives concluding remarks. In this paper, all vectors will be column vectors unless transformed to a row vector by a prime 0 . A column vector of ones in real space of arbitrary dimension will be denoted by e. A vector of zeros in a real space of arbitrary dimension will be denoted by 0. For a matrix A 2 Rln , Ai is the ith row of A which is a row vector in Rn . For a vector x 2 Rn , x denotes the vector in Rn with components (x )i = 1 if xi > 0 and 0 otherwise; i = 1, . . ., n. In other words,

ð1Þ

yP0

ð6Þ

yields 1 or 0, respectively. Fig. 1 shows the geometric interpretation of this formulation for a toy example. An important characteristic of SVM is that it can be extended in a relatively straightforward manner to create nonlinear decision boundaries (Vojislav, 2004). 3. Generalized eigenvalue proximal support vector machine (GEPSVM) In this section, we give a brief overview of GEPSVM (Mangasarian and Wild, 2006). Consider a binary classification problem of classifying m1 data points belonging to class +1 and m2 data points belonging to class 1 in the n-dimensional real space Rn . Let

1844

M.A. Kumar, M. Gopal / Pattern Recognition Letters 29 (2008) 1842–1848

25 Class -1

Bounding Planes Margin

A where C ¼ B Class +1

25

30

35

40

45

Fig. 1. Geometric interpretation of standard SVM.

matrix A in Rm1 n represent the data points of class +1 and matrix B in Rm2 n represent the data points of class 1. Given the above stated binary classification problem, GEPSVM seeks two non-parallel hyperplanes in Rn

x0 wð1Þ þ b

ð1Þ

¼ 0 and x0 wð2Þ þ b

ð2Þ

¼0

ð7Þ

such that each hyperplane is closest to data points of one class and farthest from the data points of other class. A new data point is assigned to class +1 or 1 depending upon its proximity to the two non-parallel hyperplanes. Geometrically, the concept of GEPSVM is depicted in Fig. 2 for a toy example. These hyperplanes are obtained by solving the optimization problems (8) and (9) given as

Min z2 6¼0

z01 Gz1 z01 Hz1 z02 Lz2 z02 Mz2

10

0 Fig. 2. Geometric interpretation of GEPSVM.

40

ð13Þ

ð14Þ

The Wolfe dual of QPPs (13) and (14) can be shown to be QPPs (15) and (16) in terms of the Lagrangian multipliers a 2 Rm2 and b 2 Rm1 , respectively.

ð15Þ

G ¼ ½ B e  and H ¼ ½ A e  1 Maximize e0 b  b0 PðQ 0 Q Þ1 P 0 b 2 such that 0 6 b 6 C 2 e where

ð16Þ

P ¼ ½ A e  and Q ¼ ½ B e 

The non-parallel hyperplanes (7) can be obtained from the solution of QPPs (15) and (16), as given in (17) and (18), respectively.

v1 ¼ ðH 0 HÞ1 G0 a; 0

1

0

v2 ¼ ðQ Q Þ P b;

where v1 ¼ ½ wð1Þ where v2 ¼ ½ wð2Þ

b b

ð1Þ 0



ð2Þ 0



ð17Þ ð18Þ

1 ð1Þ kðKðA; C 0 Þuð1Þ þ eb Þk2 þ C 1 e0 y 2 ð1Þ such that  ðKðB; C 0 Þuð1Þ þ eb Þ þ y P e; y P 0 1 ð2Þ kðKðB; C 0 Þuð2Þ þ eb Þk2 þ C 2 e0 y Minimize 2 ð2Þ such that ðKðA; C 0 Þuð2Þ þ eb Þ þ y P e; y P 0

Minimize

Class +1

35

1 ð1Þ ð1Þ ðAwð1Þ þ eb Þ0 ðAwð1Þ þ eb Þ þ C 1 e0 y 2 ð1Þ such that  ðBwð1Þ þ eb Þ þ y P e; y P 0 1 ð2Þ ð2Þ ðBwð2Þ þ eb Þ0 ðBwð2Þ þ eb Þ þ C 2 e0 y Minimize 2 ð2Þ such that ðAwð2Þ þ eb Þ þ y P e; y P 0

Solving two dual QPPs has the advantage of bounded constraints and reduced number of parameters as QPP (15) has only m2 parameters and QPP (16) has only m1 parameters, when compared with QPP (3) which has l = m1 + m2 parameters. However, it is to be noted that solving dual QPPs (15) and (16) also requires inversion of matrix of size (n + 1)  (n + 1) twice, where n  l. TWSVM can also be extended to nonlinear cases similar to GEPSVM, by considering the two kernel-generated surfaces shown in (12). The primal nonlinear TWSVMs corresponding to the hypersurface K(x0 , C 0 )u(1) + b(1) = 0 and K(x0 , C0 )u(2) + b(2) = 0, are given in (19) and (20).

Proximal Planes

30

ð12Þ

Minimize

ð9Þ

Class -1

25

¼0

Twin support vector machines proposed in (Jayadeva et al., 2007), do binary classification by obtaining two non-parallel hyperplanes similar to GEPSVM, and solving two novel formulations of QPP (13) and (14) as opposed to two eigenvalue problems solved by GEPSVM. The idea is to solve two QPPs with objective function corresponding to one class and constraints corresponding to the other class.

where

20

20

ð2Þ

and K is any arbitrary kernel.

ð8Þ

25

5

¼ 0 and Kðx0 ; C 0 Þuð2Þ þ b

1 Maximize e0 a  a0 GðH 0 HÞ1 G0 a 2 such that 0 6 a 6 C 1 e

where G, H, L and M are symmetric matrices in Rðnþ1Þðnþ1Þ , defined as G ¼ ½ A e 0  ½ A e  þ dI for some d > 0, H ¼ ½ B e 0  ½ B e ; L ¼ ½ B e 0  ½ B e  þ dI for some d > 0, and M ¼ ½ A e 0  ½ A e , and z1 and z2 are vectors in Rðnþ1Þ given by z1 ¼ ½ wð1Þ bð1Þ 0 and z2 ¼ ½ wð2Þ bð2Þ 0 . Using Rayleigh’s quotient, the global minimum of optimization problems (8) and (9) can be achieved as the eigenvector of the generalized eigenvalue problems (10) and (11) corresponding to the minimum eigenvalue lmin and kmin, respectively.

15

ð1Þ



4. Twin support vector machine (TWSVM)

Separatingplane

20

z1 6¼0

ð11Þ



5

Min

ð10Þ

z2 6¼ 0

Kðx0 ; C 0 Þuð1Þ þ b

10

0

z1 6¼ 0

Lz2 ¼ kMz2 ;

It is to be noted that the above results are applicable under the assumption that the columns of matrices ½ A e  and ½ B e  are linearly independent, which is not a restriction for many classification problems. In the same spirit, GEPSVM can be extended to nonlinear cases by considering the following kernel-generated surfaces:

20

15

Gz1 ¼ lHz1 ;

45

ð19Þ

ð20Þ

1845

M.A. Kumar, M. Gopal / Pattern Recognition Letters 29 (2008) 1842–1848

The dual problems for (19) and (20) are derived and solved in (Jayadeva et al., 2007) to obtain the kernel-generated surfaces (12). Experimental results of Jayadeva et al. (2007) show that TWSVM performance is better than GEPSVM and conventional SVM on UCI machine learning datasets.

such that

C1 1 ð1Þ ð1Þ kðe þ Bwð1Þ þ eb Þþ k2 þ kðAwð1Þ þ eb Þk2 2 2

ð21Þ

ð22Þ

Since the objective function of this UMP is not twice differentiable, a smoothing technique is used to make the objective function differentiable. The smooth reformulation of (22) is given as

C1 1 ð1Þ ð1Þ kqðe þ Bwð1Þ þ eb ; gÞk2 þ kðAwð1Þ þ eb Þk2 2 2 1 where qðx; gÞ ¼ x þ logð1 þ egx Þ; g > 0

Minimize

ð23Þ

g

The function q(x, g) is a smooth approximation for the plus function (x)+ and g is the smoothing parameter. This particular form of smooth plus function is known as neural networks smooth plus function, as it is obtained by integrating the well-known sigmoid function 1þe1gx used in neural networks. In general, the smoothing functions are obtained by twice integrating probability density functions. This is because of the fact that plus function can be obtained by twice integrating the Dirac delta function, and probability density functions are smooth approximations for Dirac delta function. While there are many other smooth plus functions in the literature (Kanzow, 1994; Zang, 1980; Pinar and Zenios, 1994), we used the neural networks smooth plus function because of its successful results in (Lee and Mangasarian, 2001a,b; Chen and Mangasarian, 1995, 1996). By following the same steps as discussed above, the modified primal QPP (24) of TWSVM can also be reformulated as an UMP given by (25).

1 C2 ð2Þ ð2Þ ðBwð2Þ þ eb Þ0 ðBwð2Þ þ eb Þ þ y0 y 2 2 ð2Þ such that ðAwð2Þ þ eb Þ þ y P e; y P 0

Minimize

C2 1 ð2Þ ð2Þ Minimize kqðe  Awð2Þ  eb ; gÞk2 þ kðBwð2Þ þ eb Þk2 2 2

ð24Þ ð25Þ

It can be shown that the solution of QPPs (21) and (24) can be obtained by solving their smooth reformulations (23) and (25) as g approaches infinity. Theorem 5.1. Let P 2 Rm1 n , Q 2 Rm2 n and b 2 Rm1 1 define the real valued functions f(x) and g(x,g) in the n-dimensional real space Rn :

f ðxÞ ¼ and

1 1 kðPx þ bÞþ k22 þ kQxk22 2 2

with g > 0.  of min f ðxÞ and a unique (i) There exists a unique solution x x2Rn g of min gðx; gÞ. solution x n

g  kx

k22 x

m1 6 2

 2 log 2

g

þ 2n

log 2

g

ð26Þ

! ð28Þ

where n is defined as follows:

 þ bÞi j n ¼ max jðP x 16i6m1

Following Lee and Mangasarian (2001a), we replace y in (21) by (e + Bw(1) + eb(1))+ and convert the QPP (21) into an equivalent unconstrained minimization problem (UMP) as shown in the following equation:

Minimize

ð27Þ

(ii) For all g > 0, we have the following inequality:

In this section, we solve the primal QPPs of TWSVM rather than dual QPPs using smooth technique proposed in (Lee and Mangasarian, 2001a). Here we modify the primal problem (13) of TWSVM as (21) which uses the square of 2-norm of slack variables y with weight C21 instead of 1-norm of y with weight C1.

Minimize

1 1 kqððPx þ bÞ; gÞk22 þ kQxk22 2 2

x2R

5. Smooth twin support vector machine (STWSVM)

1 C1 ð1Þ ð1Þ ðAwð1Þ þ eb Þ0 ðAwð1Þ þ eb Þ þ y0 y 2 2 ð1Þ  ðBwð1Þ þ eb Þ þ y P e; y P 0

gðx; gÞ ¼

ð29Þ

The above theorem is an extension of Theorem 2.2 given in (Lee and Mangasarian, 2001a); the proof also follows on similar lines. In the above theorem, function f(x) represents the objective function (22) and function g(x, g) represents the smooth objective function (23). Thus the theorem shows that solutions of smooth reformulations (23) and (25) approach solutions of QPPs (21) and (24) as g approaches infinity with an upper bound given by (29). The smooth objective functions (23) and (25) can be solved using fast Newton–Armijo algorithm (Lee and Mangasarian, 2001a) by taking advantage of their twice differentiability. The algorithm has been shown to be quadratically convergent with an Armijo stepsize in (Lee and Mangasarian, 2001a). The solution of UMPs (23) and (25) defines the two non-parallel hyperplanes given in (7) and the classification of a new data point is done based on its proximity to these two hyperplanes. We also extended this smoothing technique to solve nonlinear TWSVM. The smooth reformulations corresponding to the two primal nonlinear TWSVMs (19) and (20) are given by

C1 1 ð1Þ ð1Þ kqðeþKðB;C 0 Þuð1Þ þeb ; gÞk2 þ kðKðA;C 0 Þuð1Þ þeb Þk2 2 2 ð30Þ C2 1 ð2Þ ð2Þ 2 0 0 Minimize kqðeKðA;C Þuð2Þ eb ; gÞk þ kðKðB;C Þuð2Þ þeb Þk2 2 2 ð31Þ

Minimize

It can immediately be noted that Theorem 5.1 still holds for the above smooth reformulations. Also, Newton–Armijo algorithm can be applied to solve these UMPs as the objective function is still convex. 6. Experimental results and discussion Before introducing experimental results on standard datasets we will present a simple two-dimensional ‘‘Cross Planes” example, which shows the effectiveness of multi-plane/surface classifiers like GEPSVM, TWSVM and STWSVM over SVM or SSVM (SSVM). The ‘‘Cross Planes” dataset used in (Mangasarian and Wild, 2006) is generated by perturbing points lying on two intersecting lines. Fig. 3 shows the dataset and the linear classification results obtained by GEPSVM, TWSVM, STWSVM, SVM and SSVM. While all multi-plane classifiers obtained 100% accuracy, SVM/SSVM obtained only 84% accuracy. To demonstrate the performance of STWSVM, we conducted experiments on 11 datasets from the UCI Repository (Blake and Merz, 1998) and a synthetic dataset. The synthetic dataset is an extension of two-dimensional ‘‘Cross Planes” to R7 . We also conducted experiments on SSVM, GEPSVM and TWSVM for comparison with STWSVM on the same datasets. All algorithms were implemented in MATLAB 7.3.0 (R2006b) (MATLAB, 2007) environment on a PC with Intel Core2Duo processor (2.13 GHz),

1846

M.A. Kumar, M. Gopal / Pattern Recognition Letters 29 (2008) 1842–1848

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5 0.4

0.4 0.3

0.3 0.2 0.1

0.2 0.1 0

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fig. 3. Classification results of GEPSVM/TWSVM/STWSVM (left) and SVM/SSVM (right) for ‘‘cross planes” dataset.

1 GB RAM. The dual QPPs arising in TWSVM were solved using mosek optimization toolbox for MATLAB (Mosek, 2007) which implements fast interior point based algorithms. Classification accuracy of each algorithm was measured by standard tenfold cross-validation methodology. Table 1 shows the comparison of classification accuracy for STWSVM with SSVM, GEPSVM and TWSVM for linear kernel on 9 UCI and ‘‘Cross Planes” datasets. For each dataset, we estimated the generalized accuracy using best penalty parameter C/d, of SVM/GEPSVM, respectively, obtained through search in the range 27–212. For TWSVM and STWSVM, a two-dimensional grid search on penalty parameters C1 and C2 was carried out in the same range. Following (Hsu and Lin, 2001), the results corresponding to the maximum testing accuracy are reported here. Table 2 shows the comparison of classification accuracy for the nonlinear extensions of STWSVM, TWSVM, GEPSVM and SSVM on 4 UCI and ‘‘Cross Planes” datasets. We used Gaussian kernel for all experiments, gii j 2 ven by Kðxi ; xj Þ ¼ elkx x k . The kernel parameter l along with penalty parameters were tuned for best classification accuracy. The kernel parameter l was obtained through search from the range 220–24. While the effectiveness of TWSVM over GEPSVM and SVM are already shown in (Jayadeva et al., 2007), Tables 1 and 2 reveal that the accuracy of STWSVM and TWSVM are significantly better than SSVM also. It can also be observed that the accuracy of STWSVM is almost same as that of TWSVM inspite of slightly modified objective function. In fact, accuracy of STWSVM is slightly better than TWSVM on some datasets. We also conducted experiments on large datasets, generated by David Musicant’s NDC Data Generator (Musicant, 1998) to get a clear picture of how the computing time of all these algorithms scale with respect to number of datapoints. NDC datasets had already been used in (Mangasarian and Wild, 2006; Fung and Mangasarian, 2001; Mangasarian and Musicant, 2001). Table 3 gives a description of NDC datasets. For experiments with NDC datasets,

Table 2 Classification accuracy for Gaussian kernel (italic type indicates the best result) Dataset l  n

STWSVM

TWSVM

GEPSVM

SSVM

Hepatitis (155  19) WPBC (198  34) Cross planes (300  7) Bupa liver (345  7) Votes (435  16)

84.13 ± 6.02 82.22 ± 4.38 98.91 ± 3.49 74.94 ± 6.85 95.71 ± 2.7

83.73 ± 6.25 82.22 ± 6.82 98.6 ± 2.6 75.15 ± 6.57 95.23 ± 1.5

79.28 ± 5.2 80 ± 5.97 99.02 ± 4.16 68.18 ± 6.2 94.5 ± 3.37

81.64 ± 7.10 81.11 ± 2.86 84.6 ± 3.1 74.64 ± 6.2 95.95 ± 3.62

we fixed penalty parameters of all algorithms to be one (i.e. C = 1, C1 = 1 and C2 = 1). We used Gaussian kernel with l = 217 for all experiments with nonlinear kernel. Table 4 shows the comparison of computing time and accuracy for all four algorithms with linear kernel. It is evident that GEPSVM is the fastest among all four algorithms with linear kernel. STWSVM is next fastest and it has performed several orders of magnitude faster than TWSVM. Even though the difference is not very high as in the previous case, still STWSVM performs faster than SSVM. Table 5 shows the comparison of computing time and accuracy for all four algorithms with Gaussian kernel. While the computing time of STWSVM, TWSVM and SSVM algorithms are comparable, all these algorithms are faster than GEPSVM. This is because of the need to solve eigenvalue problem of dimension (l + 1)  (l + 1) two times in GEPSVM. While STWSVM still takes least computing time, the computational advantage gained is not as high as in the case of linear kernel. Also it is to be noted that STWSVM and SSVM do not use any external optimizer whereas TWSVM had used mosek (Mosek, 2007). We further investigated the use of reduced kernel technique for SSVM, proposed in (Lee and Mangasarian, 2001b), to decrease the computing time of nonlinear STWSVM. Table 6 shows the compar-

Table 3 Description of NDC datasets Table 1 Classification accuracy for linear kernel (italic type indicates the best result) Dataset l  n

STWSVM

TWSVM

GEPSVM

SSVM

Heart-statlog (270  14) Cross planes (300  7) Heart-c (303  14) Hepatitis (155  19) Ionosphere (351  34) Sonar (208  60) Votes (435  16) Pima-Indian (768  8) Australian (690  14) CMC (1473  9)

86.55 ± 6.06

86.66 ± 6.8

85.55 ± 6.1

84.61 ± 4.43

98.2 ± 3.67 85.86 ± 5.25 86.42 ± 5.27 88.23 ± 5.18 81.05 ± 9.01 96.19 ± 2.30 78.0 ± 6.23 87.05 ± 5.13 68.84 ± 4.38

98.02 ± 3.92 85.86 ± 6.9 85.71 ± 6.73 88.23 ± 3.10 80.52 ± 4.9 95.9 ± 2.2 78.0 ± 6.29 86.91 ± 3.5 68.84 ± 2.39

98.2 ± 5.1 85.51 ± 5.08 85 ± 9.19 84.11 ± 3.2 79.47 ± 7.6 95.0 ± 2.36 76.66 ± 4.62 80.00 ± 3.99 68.76 ± 2.98

62.14 ± 4.31 84.80 ± 6.71 84.2 ± 7.86 88.23 ± 5.0 78.94 ± 5.54 95.0 ± 3.96 77.73 ± 5.69 86.35 ± 4.17 68.84 ± 3.72

Dataset

# Training data

# Testing data

# Features

NDC-500 NDC-700 NDC-900 NDC-1k NDC-2k NDC-3k NDC-4k NDC-5k NDC-10k NDC-50k NDC-1l NDC-3l NDC-5l

500 700 900 1000 2000 3000 4000 5000 10,000 50,000 100,000 300,000 500,000

50 70 90 100 200 300 400 500 1000 5000 10,000 30,000 50,000

32 32 32 32 32 32 32 32 32 32 32 32 32

1847

M.A. Kumar, M. Gopal / Pattern Recognition Letters 29 (2008) 1842–1848 Table 4 Comparison for linear kernel (italic type indicates the best result) Dataset

NDC-3k

NDC-4k

NDC-5k

NDC-10k

NDC-1l

NDC-3l

NDC-5l

a

STWSVM

TWSVM

GEPSVM

Table 6 Comparison for reduced Gaussian kernel (italic type indicates the best result) SSVM

NDC-700

NDC-900

NDC-1k

NDC-2k

NDC-3k

a

TWSVM Train %

Train %

Train %

Train %

Train %

Test %

Test %

Test %

Test %

Test %

Test %

Time (s)

Time (s)

Time (s)

Time (s)

Time (s)

Time (s)

79.13 77.0 0.0548 79.7 73.75 0.0663 78.9 80.2 0.0761 86.48 87.9 0.1267 86.2 86.18 1.014 78.73 78.59 3.103 78.70 78.59 5.3509

78.73 74.66 27.12 79.67 73.75 60.86 78.8 79.8 116.50 86.36 87.3 1094.19

77.6 77.33 0.017 76.9 71.0 0.021 75.0 74.4 0.0224 84.76 83.9 0.0353 83.90 84.32 0.2636 75.66 75.34 0.7792 75.64 75.62 1.3242

79.06 76.33 0.0607 79.65 74.0 0.0718 79.12 80.0 0.0848 86.38 87.8 0.1399 86.22 86.11 1.1014 78.73 78.55 3.3715 78.68 78.61 6.1456

92.42 85.75 5.5424 88.24 87.6 2.46 95.25 95.1 4.38 92.1 91.48 56.8

92.75 84.25 33.939 88.72 87.8 53.8495

a

a

a

Table 5 Comparison for Gaussian kernel (italic type indicates the best result)

NDC-500

STWSVM

Train %

We stopped experiments as computing time was very high.

Dataset

Dataset (reduction rate %)

STWSVM

TWSVM

GEPSVM

Train %

Train %

Train %

Train %

Test %

Test %

Test %

Test %

Time (s)

Time (s)

Time (s)

Time (s)

100.0 82.0 0.5695 100.0 82.85 1.3124 100.0 87.77 2.5334 100.0 87.0 3.5229 100 89.5 21.58 100 92.0 66.7124

99.0 80.0 0.804 99.42 84.28 1.7326 98.88 80.0 3.4689 98.1 83.0 4.1185 99.35 87.5 25.8972 99.53 90.41 85.474

82.0 76.0 8.415 80.71 71.42 25.3417 83.33 66.66 52.5041 79.1 69.0 71.9482

95.2 82.0 0.6140 96.42 85.71 1.4366 95.22 80 3.0246 95.5 82.0 3.8906 95.8 82.5 23.0 96.7 89.33 68.8953

a

a

SSVM

We stopped experiments as computing time was very high.

ison of STWSVM and TWSVM with reduced kernels on 4 NDC datasets. STWSVM performs at least six times faster than TWSVM with a reduction rate of 10% and is more than 21 times faster with 5% reduction ratio. For NDC-10k and NDC-50k datasets we do not report experimental results of TWSVM as its optimizer ran out of memory. Thus the STWSVM formulation has more computational advantage than TWSVM especially when reduced kernels are used. This is because even when reduced kernel of dimension ðl  lÞ is used, TWSVM still requires solving two QPPs of same size m1 and m2. The reduced kernel technique is applicable to other algorithms as well; however we have not made comparisons here.

NDC-4k 10% NDC-5k 5% NDC-10k 2% NDC-50k 1% a

a

a

Experiments ran out of memory.

7. Concluding remarks In this paper, we have enhanced TWSVM to STWSVM using smoothing techniques. In STWSVM, we solve two primal problems using fast Newton–Armijo algorithm, instead of two dual problems in TWSVM. STWSVM has been extended to nonlinear separation surfaces by using nonlinear kernel techniques. The experimental results show that the generalization of STWSVM is comparable to TWSVM and is better than other methods for both linear and nonlinear cases. STWSVM is significantly faster than TWSVM especially with linear and reduced kernel methods. STWSVM obviates the need for any external optimizers with both linear and nonlinear kernels. Analysis of statistical properties of STWSVM, extensions to multi-category classification and kernel optimization are promising areas of future research. Given the two facts that linear STWSVM achieves improved performance over SSVM and linear SVM gives better performance in text categorization than nonlinear SVM; application of linear STWSVM to text categorization will be the subject of our future work.

References Blake, C.I., Merz, C.J., 1998. UCI Repository for Machine Learning Databases: . Irvine, CA: University of California, Department of Information and Computer Sciences. Burges, C.J., 1998. A tutorial on support vector machines for pattern recognition. Data Mining Knowledge Discov. 2 (2), 955–974. Chen, C., Mangasarian, O.L., 1995. Smoothing methods for convex inequalities and linear complementarity problems. Math. Program. 71 (1), 51–69. Chen, C., Mangasarian, O.L., 1996. A class of smoothing functions for nonlinear and mixed complementarity problems. Comput. Optim. Appl. 5 (2), 97–138. Chapelle, O., 2007. Training a support vector machine in the primal. Neural Comput. 19 (5), 1155–1178. Fung, G., Mangasarian, O.L., 2001. Proximal support vector machine classifiers. In: Provost, F., Srikant, R. (Eds.), Proc. KDD-2001, August 26–29 2001, San Francisco, CA, New York, pp. 77–86. Hsu, C.W., Lin, C.J., 2001. A Comparison on methods for multi-class support vector methods. Technical report, Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan. Jayadeva, Khemchandani, R., Chandra, S., 2007. Twin support vector machines for pattern classification. IEEE Trans. Pattern Anal. Machine Intell. 29 (5), 905–910. Kanzow, C., 1994. Some tools allowing interior-point methods to become noninterior. Technical report, Institute of Applied Mathematics, University of Hamburg, Germany. Lee, Y.J., Mangasarian, O.L., 2001a. SSVM: A smooth support vector machine for classification. Comput. Optim. Appl. 20 (1), 5–22. Lee, Y.J., Mangasarian, O.L., 2001b. RSVM: Reduced support vector machines. In: Internat. Conf. on Data Mining, Chicago. Mangasarian, O.L., 1998. Nonlinear Programming. SIAM. Mangasarian, O.L., Musicant, D.R., 2001. Lagrangian support vector machines. J. Machine Learn. Res. 1, 161–177.

1848

M.A. Kumar, M. Gopal / Pattern Recognition Letters 29 (2008) 1842–1848

Mangasarian, O.L., Wild, E.W., 2006. Multisurface proximal support vector classification via generalized eigenvalues. IEEE Trans. Pattern Anal. Machine Intell. 28 (1), 69–74. MATLAB, 2007. . Mosek, 2007. . Musicant, D.R., 1998. NDC: Normally distributed clustered datasets. .

Pinar, M.C., Zenios, S.A., 1994. On smoothing exact penalty functions for convex constrained optimization. SIAM J. Optim. 4, 486–511. Vojislav, Kecman, 2004. Learning and Soft Computing: Support Vector Machines, Neural Networks and Fuzzy Logic Models. Pearson Education, Singapore. Zang, I., 1980. A smoothing-out technique for min–max optimization. Math. Program. 19, 61–77.