Distance difference and linear programming nonparallel plane classifier

Distance difference and linear programming nonparallel plane classifier

Expert Systems with Applications 38 (2011) 9425–9433 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: ww...

620KB Sizes 2 Downloads 76 Views

Expert Systems with Applications 38 (2011) 9425–9433

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Distance difference and linear programming nonparallel plane classifier Qiaolin Ye a,⇑, Chunxia Zhao a, Haofeng Zhang a, Ning Ye b a b

School of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing, China School of Information Technology, Nanjing Forestry University, Nanjing, China

a r t i c l e

i n f o

Keywords: GEPSVM LSTSVM TWSVM Standard eigenvalues Feature selection Input features Kernel functions

a b s t r a c t We first propose Distance Difference GEPSVM (DGEPSVM), a binary classifier that obtains two nonparallel planes by solving two standard eigenvalue problems. Compared with GEPSVM, this algorithm does not need to care about the singularity occurring in GEPSVM, but with better classification correctness. This formulation is capable of dealing with XOR problems with different distribution for keeping the genuine geometrical interpretation of primal GEPSVM. Moreover, the proposed algorithm gives classification correctness comparable to that of LSTSVM and TWSVM, but with lesser unknown parameters. Then, the regularization techniques are incorporated to the TWSVM. With the help of the regularized formulation, a linear programming formation for TWSVM is proposed, called FETSVM, to improve TWSVM sparsity, thereby suppressing input features. This means FETSVM is capable of reducing the number of input features, for linear case. When a nonlinear classifier is used, this means few kernel functions determine the classifier. Lastly, this algorithm is compared on artificial and public datasets. To further illustrate the effectiveness of our proposed algorithms, we also apply these algorithms to USPS handwritten digits. Ó 2011 Elsevier Ltd. All rights reserved.

1. Introduction Eigenvalue based techniques are attractive for the classification of very large sparse datasets (Guarracino, Cifarelli, Seref, & Pardalos, 2007) such as generalized proximal SVM (GEPSVM for short) (Mangasarian & Wild, 2006). GEPSVM obtains each of the nonparallel planes by solving the eigenvector corresponding to a smallest eigenvalue of a generalized eigenvalue problem, such that each plane is as close as possible to the samples for its class and meantime as far as possible from the samples for the other classes (Mangasarian & Wild, 2006). The edges of two-class GEPSVM lie in its lower computational complexity and its better classification performance in terms of solving XOR problems with respect to standard SVM that find one plane that separates the two classes. In Mangasarian and Wild (2006), Mangasarian et al. presented a simple ‘‘cross planes’’ example that is a generalization of the XOR example, which indicated the effectiveness of GEPSVM over PSVM and SVM. Fig. 1 in Mangasarian and Wild (2006) demonstrates GEPSVM has classification correctness of 100% in XOR case. Recently, a lot of GEPSVM-based algorithms have been proposed. To improve the generalization of GEPSVM, Jayadeva et al. proposed Fuzzy GEPSVM (FGEPSVM) given its multi-category formulation. In 2007, Guarracino et al. (2007) introduced a new regularization technique to GEPSVM for reducing the time complexity of GEPSVM, but with two unknown parameters in linear case. These ⇑ Corresponding author. E-mail address: [email protected] (Q. Ye). 0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.01.131

algorithms obtain two planes by solving generalized eigenvalue problems as GEPSVM does. However, for the symmetric matrices occurring in these algorithms such as H and M in the formulation (5) and (6), if both are semi-positive, an ill-defined operation will be obtained. Moreover, these algorithms weaken the genuine geometrical interpretation of the nonparallel plane classifier due to the adoption of regularization term that improves their generalization. Recently, a twin SVM algorithm (TWSVM for short), proposed by Jayadeva et al., was published in TPAMI (Jayadeva & Chandra, 2007). This algorithm, which is in the spirit of GEPSVM, obtains two planes by solving two smaller quadratic programming problems (QPPs) than that of the standard SVM. Experimental results show the effectiveness of TWSVM over SVM and GEPSVM (Arun Kumar & Gopal, 2009; Jayadeva & Chandra, 2007). TWSVM takes O(1/4m3) operations which is 1/4 of standard SVM, whereas, GEPSVM takes O(1/4n3). Here, m is the number of training samples, n is the dimensionality and m  n (Arun Kumar & Gopal, 2009; Jayadeva & Chandra, 2007). Obviously, GEPSVM is by far faster than TWSVM. To reduce the time complexity and keep the effectiveness of the twin SVM classifier, some scholars proposed its least squares version (LSTSVM for short) in 2009 (Arun Kumar & Gopal, 2009; Ghorai, Mukherjee, & Dutta, 2009). In fact, LSTSVM determines two nonparallel planes by solving two PSVM-type (Fung & Mangasarian, 2001) problems. Compared with TWSVM, LSTSVM has lesser computational time due to the fact that it only solves two systems of linear equations instead of two QPPs as for TWSVM. TWSVM and LSTSVM however, also lose the genuine geometrical interpretation of the nonparallel plane classifier. GEPSVM is

9426

Q. Ye et al. / Expert Systems with Applications 38 (2011) 9425–9433

proposed to solve the complex examples which are difficult classification cases for typical linear classifiers just as XOR example does (Mangasarian & Wild, 2006). Each of the planes obtained by GEPSVM is as close as possible to the samples for its class and meantime as far as possible from the samples for the other classes (Mangasarian & Wild, 2006). However, TWSVM requires each of planes obtained to be as close as possible to the samples for its class and meantime at a distance of at least 1 from the samples for the other classes (Jayadeva & Chandra, 2007). LSTSVM requires each of the planes to be as close as possible to the samples for its class and meantime at a distance of 1 from the samples for the other classes. In intuition, when handling XOR examples of different distribution, TWSVM and LSTSVM may yield poor classification performance due to the difference from the optimization criterion of GEPSVM, although they can obtain good classification performance on UCI datasets due to the use of the loss function. Moreover, another flaw of TWSVM and LSTSVM is that two penalty parameters are introduced to their objective functions instead of one regularization parameter as for GEPSVM. Undoubtedly, this will lead to the difficulty of parameter selection. In addition, when there are many noise variables, the 1-norm SVM (Zou, 2007; Zhou, Zhang, & Jiao, 2002) has advantages over the 2-norm SVM because the former is capable of generating sparse solutions that make the classifier easier to store and faster to compute. However, these GEPSVM-based algorithms, including GEPSVM, cannot generate very sparse solutions, even if we give their 1-norm formulations as in 1-norm SVM (Zou, 2007). This is so because the direction wi and threshold ri that determine the ith separating planes combines with the input samples. In this paper, we first propose a new but fast algorithm, termed as Difference GEPSVM (DGEPSVM). DGEPSVM need not consider the singularity occurring in GEPSVM due to the use of a similar formulation to the MMC (Jiang & Zhang, 2004). We show that the solution of DGEPSVM reduces to solving two simple eigenvalue problems. This property determines DGEPSVM is fast and at least comparable to GEPSVM. Moreover, DGEPSVM can deal with XOR examples with different distribution because it keeps the genuine geometrical interpretation of GEPSVM. Then, we further propose a feature selection algorithm for TWSVM, called FETSVM. This proposed algorithm can overcome such a flaw, that is, GEPSVM and other GEPSVM-based algorithms cannot generate the very sparse solutions. Lastly, the two algorithms are compared on artificial and UCI datasets. We also go onto illustrate their effectiveness for USPS handwritten digits application. Given four facts: (1) DGEPSVM need not care about the singularity occurring in GEPSVM and performs better in classification correctness than GEPSVM; (2) DGEPSVM surpasses TWSVM and LSTSVM in terms of solving XOR examples with different distribution and gives comparable classification correctness on standard datasets; (3) DGEPSVM has lesser unknown parameters than TWSVM and LSTSVM; and (4) FETSVM performs faster than TWSVM and suppresses input features as well as giving comparable classification correctness.

where (wi, bi) 2 (Rn  R) (i = 1, 2). This algorithm requires each plane to be as close as possible to the samples for its class and as far as possible from the samples for the other classes at the same time. Suppose (wi, bi) – 0, binary GEPSVM classifiers can be written as the optimization formulation as follows:

ðGEPSVM1Þ

ðGEPSVM2Þ

xT w2  b2 ¼ 0

ð1Þ

ð3Þ

ðAw2  e1 b2 ÞT ðAw2  e1 b2 Þ

H ¼ FT F

L ¼ F T F þ dI; M ¼ ET E     w1 w2 ; z2 ¼ z1 ¼ b1 b2

ð4Þ

where E = [A  e1], F = [B  e2], E, F 2 Rn+1. The optimization problem (2) and (3) become:

ðGEPSVM1Þ

Min

zT1 G z1 zT1 H z1

ð5Þ

ðGEPSVM2Þ

Min

zT2 L z2 zT2 M z2

ð6Þ

z1 –0

z2 –0

Using the well-known properties of Rayleigh quotient, we can obtain the solutions of (5) and (6) by solving the following two generalized eigenvalue problems:

Gz1 ¼ k1 Hz1 ;

Lz2 ¼ k2 Mz2 ;

zi – 0; i ¼ 1; 2:

ð7Þ

It can be seen from (2) and (3), Tikhonov regularization (Tikhonov & Arsen, 1977) is applied to each GEPSVM pair. This can improve the generalization of GEPSVM due to the fact that the regularization term is used for penalty. 2.2. TWIN Support Vector Machines (TWSVM) (Jayadeva & Chandra, 2007) In 2007, Jayadeva et al. proposed a stand-alone algorithm for binary classification, termed as Twin SVM (TWSVM) (Jayadeva & Chandra, 2007), which is in the spirit of GEPSVM. However, TWSVM has different formulation from GEPSVM. This algorithm obtains two nonparallel planes by solving two SVM-type problems. Experimental results show TWSVM outperforms GEPSVM and standard SVM, in terms of classification correctness. TWSVM can be written as follows:

2.1. Generalized Proximal Support Vector Machines (GEPSVM) (Mangasarian & Wild, 2006)

xT w1  b1 ¼ 0;

 T   w2 w2 T ðBw2  e2 b2 Þ ðBw2  e2 b2 Þ þ d b2 b2

ðw2 ;b2 Þ – 0

G ¼ ET E þ dI;

ðTWSVM1Þ

Given m training points in n dimension input space R , denoted by the m1  n matrix A belonging to class 1 and the m2  n matrix B belonging to class-1, where m2 + m1 = m. The main purpose of GEPSVM is to find two nonparallel hyperplanes in n-dimension space, i.e.,

Min

ð2Þ

T

ðBw1  e2 b1 Þ ðBw1  e2 b1 Þ

ðw1 ;b1 Þ – 0

where d is a regularization constant. The formulation (2) enables GEPSVM to obtain the plane, which is closest to the points for set +1 and furthest from set 1, and (3) enables GEPSVM to obtain the plane which is closest to the points for set 1 and furthest from the points for set +1. Let

2. Related work

n

Min

   T    w1 w1 ðAw1  e1 b1 ÞT ðAw1  e1 b1 Þ þ d b1 b1

ðTWSVM2Þ

Min s:t: Min s:t:

1 ðAw1 2

 e1 b1 ÞT ðAw1 þ e1 b1 Þ þ C 1 eT2 n

 ðBw1  e2 b1 Þ þ n P e2 ;

nP0

 e2 b2 ÞT ðBw2  e2 b2 Þ þ C 2 eT1 n ðAw2  e1 b2 Þ þ n P e1 ; n P 0

1 ðBw2 2

ð8Þ

ð9Þ

where C1 and C2 are two penalty coefficients. From (8) and (9), we find that only constraints of the other class appear. The objective function does not sum up error over patterns of both the classes. These features show TWSVM is effective on skewed or unbalanced datasets. This may be a reason results in a better classification why

9427

Q. Ye et al. / Expert Systems with Applications 38 (2011) 9425–9433

the performance of TWSVM with respect to standard SVM and GEPSVM. Obviously, this idea in TWSVM is to solve two QPPs (8) and (9). Each of the QPPs in the TWSVM pair is a typical SVM formulation, except that not all patterns appear in the constraints of either problem at the same time (Jayadeva & Chandra, 2007). The time complexity of TWSVM is O(1/4m3), which is lesser than that of he standard SVM with the time complexity of O(m3). However, for large-scale classification, TWSVM is not feasible. To improve the computational time of TWSVM and keep the effectiveness of TWSVM, Kumar et al. proposed a least squares version of TWSVM (LSTSVM) (Arun Kumar & Gopal, 2009; Ghorai et al., 2009). The idea in LSTSVM is to replace the inequality constraints in TWSVM pair using equality constraints as follows:

ðLSTSVM1Þ

ðLSTSVM2Þ

1 ðAw1 2

Min s:t:

 e1 b1 ÞT ðAw1  e1 b1 Þ þ 12 C 1 n2

 ðBw1  e2 b1 Þ þ n ¼ e2  e2 b2 ÞT ðBw2  e2 b2 Þ þ 12 C 2 n2

Min

1 ðBw2 2

s:t:

ðAw2  e1 b2 Þ þ n ¼ e1

ð10Þ

ð11Þ

Each of the two QPPs in LSTSVM pair is the formulation of a typical PSVM (Fung & Mangasarian, 2001). The edges of LSTSVM lie in its lesser computational time due to the fact it only needs to solve two systems of linear equations. Two explicit expressions for LSTSVM pair can be given as follows:

ðLSTSVM1Þ Min

ðLSTSVM2Þ Min

w1



r1 

w2 r2



 1 1 ¼  F T F þ ET E F T e2 C1  1 1 ¼ ET E þ F T F ET e1 C1

ð16Þ

ð17Þ

ð12Þ

ð13Þ

The formulations (12) and (13) are two unconstrained QPPs. After some algebra, the two nonparallel planes are obtained by solving the follow two systems of linear equations:



s:t: wT1 w1 ¼ 1y

s:t: wT2 w2 ¼ 1

1 ðBw2  e2 b2 ÞT ðBw2  e2 b2 Þ 2

1 þ C 2 ðe1  ðAw2  e1 b2 ÞÞT ðe1  ðAw2  e1 b2 ÞÞ 2

Min f ðw1 ;b1 Þ ¼ ðAw1  e1 b1 ÞT ðAw1  e1 b1 Þ  bðBw1  e2 b1 ÞT ðBw1  e2 b1 Þ

Min f ðw2 ;b2 Þ ¼ ðBw2  e2 b2 ÞT ðBw2  e2 b2 Þ  bðAw2  e1 b2 ÞT ðAw2  e1 b2 Þ

1 ðAw1  e1 b1 ÞT ðAw1  e1 b1 Þ 2

1 þ C 1 ðe2 þ Bw1  e2 b1 ÞT ðe2 þ Bw1  e2 b1 Þ 2

sets can all obtain better generalization than GEPSVM. This is because penalty terms are introduced to the objective functions of the two algorithms. However, we find, in some XOR examples, that the two algorithms cannot obtain the satisfied classification results only due to the different data distribution. This may be because the difference from primal GEPSVM lies in the genuine geometrical interpretation. We further find both TWSVM and LSTSVM cannot generate very sparse solutions, although they are SVM-type problems. This is so because the direction wi and threshold ri that determine the ith separating planes combines with the input samples. Combining with the maximum margin criterion (MMC) (Jiang & Zhang, 2004), DGEPSVM is first proposed, not changing the genuine geometrical interpretation of primal GEPSVM. DGEPSVM on UCI datasets has classification a performance comparable to TWSVM and LSTSVM, whereas, it can obtain a perfect classifier in XOR examples with different distributions. Moreover, the edge of DGEPSVM still lies in avoiding the use of the regularization term that is used in GEPSVM. Based on the MMC, the formulation (2) is replaced by the difference as follows:

ð14Þ

where b > 0 that is used for balancing between Min f(w1, b1) and Min f(w2, b2). Comparing with the formulation (1), each of our formulations includes a constrained optimization while the primal GEPSVM solves an unconstrained optimization. The use of constraints wTi wi ¼ 1ði ¼ 1; 2Þ of the formulations (16) and (17) is to avoid the singularity of GEPSVM. DGEPSVM keeps the genuine geometrical interpretation of primal GEPSVM, but without the need to incorporate the regularization term (Tikhonov & Arsen, 1977) to the classifier. We show the solution of (16) below. Forming the Lagrangian of (16) with multiplierk1:

Lðw1 ; b1 ; k1 Þ ¼ ðAw1  e1 b1 ÞT ðAw1  e1 b1 Þ

   bðBw1  e2 b1 ÞT ðBw1  e2 b1 Þ  k1 wT1 w1  1 ð18Þ

ð15Þ

where E = [A e1], and E = [B e2]. In the next section, we first report the flaws occurring in GEPSVM, TWSVM and LSTSVM, and then give our algorithms. 3. Linear DGEPSVM and FDGESPVM 3.1. Linear DGEPSVM In GEPSVM, the matrices H and M are always singular. In Mangasarian and Wild (2006), Mangasarian et al. acclaimed that GEPSVM can also obtain a perfect two-plane classifier in XOR examples, even if H and M are singular. This means GEPSVM has the ability to deal with XOR examples well without other constraints. Observing the constraints in TWSVM and LSTSVM, we easily find that they lose the genuine geometrical interpretation of primal GEPSVM. In TWSVM, the constraints require the plane to be at a distance of at least 1 from points of the other class. In other words, this means the distance may be 1. In LSTSVM, the constraints require the plane to be at a distance of 1 from points of the other class. In Jayadeva and Chandra (2007) and Ghorai et al. (2009), the authors acclaimed that the two algorithms on UCI data-

and setting partial derivatives concerning the primal variable (w1, b1) equal to zero, we can get:

ðAT A  bBT BÞw1 þ ðbBT e2  AT e1 Þb1 ¼ k1 w1

ð19Þ

ðeT1 A

ð20Þ



beT2 BÞw1

ðbeT2 e2

þ

ðbeT2 e2

eT1 e1 Þ



eT1 e1 Þb1

¼0

When  ¼ 0, the threshold b1 will disappear, and thus we discuss the solutions of the proposed optimization problems for two cases: Case ðbeT2 e2  eT1 e1 Þ – 0: After some algebra, Eqs. (19) and (20) become:  1 T T ðA A  bBT BÞ þ T ðbeT2 B  eT1 AÞ ðbeT2 B  eT1 AÞ w1 ¼ k1 w1 ð21Þ T be2 e2  e1 e1  T  1 b1 ¼ T be2 B  eT1 A w1 ð22Þ be2 e2  eT1 e1  T   The matrix ðAT A  bBT BÞ þ beT e 1eT e beT2 B  eT1 A beT2 B  eT1 A is in 2 2 1 1 nn R , where n  m. Case ðbeT2 e2  eT1 e1 Þ ¼ 0: The threshold b1 disappears from Eq. (20), so Eq. (20) becomes:



 eT1 A  beT2 B w1 ¼ 0

ð23Þ

Substituting Eq. (23) into (19) give an explicitly expression for (19):

9428

Q. Ye et al. / Expert Systems with Applications 38 (2011) 9425–9433

ðAT A  bBT BÞw1 ¼ k1 w1

ð24Þ

In this case, b1 cannot be obtained through (19) and (20). Thus, we directly giveb1:

b1 ¼ Mw1

ð25Þ

where M is the mean of the samples for corresponding class. The above definition derives from the fact that a fitting plane through the center of the given points has less regression loss (Yang, Chen, Chen, & Pan, 2009). k1 is an eigenvalue of the symmetric matrix  T   ðAT A  bBT BÞ þ beT e 1eT e beT2 B  eT1 A beT2 B  eT1 A or ATA  bBTB. 2 2

1 1

Now the key is how to obtain the directionw1, which should satisfyMin f(w1, b1). In fact, it is easy to conclude w1 is the eigenvector corresponding to the smallest eigenvalue of the matrix  2 ðAT A  bBT BÞ þ beT e 1eT e beT2 B  eT1 A (equal to ATA  bBTB when 2 2 1 1  T  be2 e2  eT1 e1 ¼ 0Þ.We can state the following theorem based on the above analysis, and give its proof. Theorem 1. The eigenvector corresponding to the smallest eigenvalue  T   of the matrix ðAT A  bBTBÞ þ beT e 1eT e  beT2 B  eT1 A beT2 B  eT1 A 2 2 T1 1 T T T (equal to A A  bB B when be2 e2  e1 e1 ¼ 0Þ is the optimal direction w1of the first plane. Proof. We rewrite the objective function of (16) and one gets T

T

f ðw1 ; b1 Þ ¼ ðAw1  e1 b1 Þ ðAw1  e1 b1 Þ  bðBw1  e2 b1 Þ ðBw1  e2 b1 Þ

T T ¼ wT1 AT Aw1 þ b1 eT1 e1 b1  2b1 eT1 Aw1

T T ð26Þ  b wT1 BT Bw1 þ b1 eT2 e2 b1  2b1 eT2 Bw1 After some algebra, we can give an explicit expression for (26):

T T f ðw1 ; b1 Þ ¼ wT1 AT A  bBT B w1 þ 2ðbb1 eT2 B  b1 eT1 AÞw1   2 þ eT1 e1  beT2 e2 b1

Thus, we easily conclude that the eigenvector corresponding to the  smallest eigenvalue of the matrix ðAT A  bBT BÞ þ beT e 1eT e beT2 B 2 2 1 1     eT1 AÞT beT2 B  eT1 A (equal to ATA  bBTB when beT2 e2  eT1 e1 ¼ 0Þ is the optimal direction w1 of the first plane. h By an entirely similar argument, we give an analogous theorem to Theorem 1. Theorem 2. The eigenvector corresponding to the smallest eigenvalue  T   of the matrix ðBT B  bAT AÞ þ beT e 1eT e beT1 A  eT2 B beT1 A  eT2 B 1 1

xT wi þ bi ¼ Min jxT wi þ bi j;

3.2. Linear FETSVM As we all know, 1-norm SVM (Zou, 2007) can obtain sparse solutions. However, the 1-norm formulations based on GEPSVM, TWSVM and LSTSVM cannot generate very sparse solutions due to their complex constructions. GEPSVM is the Rayleigh quotient, whereas, in TWSVM and LSTSVM, the direction wi and the threshold bi combine with the training samples. We now aim to obtain very sparse solutions of TWSVM so as to incorporate the Tikhonov regularization techniques to the objective. This helps FETSVM automatically select input features, but TWSVM, TWSVM and LSTSVM can not. We make the following definitions:

ðRTSVM1Þ

Min s:t:

  2 w1 T  e1 b1 ÞT ðAw1  e1 b1 Þ þ 12 e b þ Ce2 n 1  ðBw1  e2 b1 Þ þ n P e2 ; n P 0

1 ðAw1 2

ð32Þ

ðRTSVM2Þ

¼ wT1 ðAT A  bBT BÞw1

Min s:t:

 T   2 wT beT2 B  eT1 A beT2 B  eT1 A w1 beT2 e2  eT1 e1 1  T   1  T wT beT2 B  eT1 A beT2 B  eT1 A w1 be2 e2  eT1 e1 1 

 T T   1 T ¼ wT1 A A  bBT B þ T be2 B  eT1 A beT2 B  eT1 A w1 T be2 e2  e1 e1

i ¼ 1; 2 for binary classification

where jj is the distance of point x from xTwi + bi = 0, i = 1, 2.

ð27Þ

Case ðbeT2 e2  eT1 e1 Þ – 0: Substituting the Eq. (22) into (27) gives a more explicit expression for (27) as follows:

T T f ðw1 ;b1 Þ ¼ wT1 ðAT A  bBT BÞw1 þ 2 bb1 eT2 B  b1 eT1 A w1   2 þ eT1 e1  beT2 e2 b1

2 2

(equal toBTB  bATA when beT1 e1  eT2 e2 ¼ 0Þ is the optimal direction w2 for the second plane. Here, we do not give the proof of Theorem 2. DGEPSVM needs to solve two standard eigenvalue problems (Golub & Van Loan, 1996). Due to this edge, we need NOT consider the singularity occurring in GEPSVM. For an unseen sample x 2 Rn, it is assigned to the closest plane, i.e.,

  2 w2 T  e2 b2 ÞT ðBw2  e2 b2 Þ þ 12 e b þ Ce1 n 2 ðAw2  e1 b2 Þ þ n P e1 ; n P 0 1 ðBw2 2

ð33Þ

þ

where e > 0 is a regularization parameter. It can be seen from (32) and (33), the direction wi and the threshold bi are isolated in the objective functions such that we can easily propose the FETSVM. We minimize the 1-norm distance of the planes from the samples of class +1 and 1, respectively

ð28Þ ðFETSVM1Þ

Case ðbeT2 e2  eT1 e1 Þ ¼ 0: Eq. (20) becomes



 eT1 A  beT2 B w1 ¼ 0

ðFETSVM2Þ

  w2 2 T kBw2  e2 b2 k1 þ e b þ Ce1 n 2 ðAw2  e1 b2 Þ þ n P e1 ; n P 0:

ð35Þ

ð31Þ

Min s:t:

ð30Þ

Note that the formulation (28) is equivalent to the left of (21), and (30) is equivalent to the left of (24). Thereby, we can further give:

f ðw1 ; r 1 Þ ¼ k1

ð34Þ

ð29Þ

Substituting (29) into (27) obtains:

f ðw1 ; b1 Þ ¼ wT1 ðAT A  bBT BÞw1

  w1 T Min kAw1  e1 b1 k1 þ e b þ Ce2 n 1 1 s:t:  ðBw1  e2 b1 Þ þ n P e2 ; n P 0:

Let



 w1 ¼ p1  q1 ; Aðp1  q1 Þ ¼ r 1  s1 ; G ¼ ½A e1 ; H ¼ ½B e2  b1 ð36Þ

9429

Q. Ye et al. / Expert Systems with Applications 38 (2011) 9425–9433

where p1 P 0 and q1 P 0 are two column vectors in Rn, while r1 P 0 and s1 P 0 are two column vectors in Rm1 . Substituting (36) into (32) gives:

Min eT1 ðr 1 þ s1 Þ þ eeT ðp1 þ q1 Þ þ CeT2 n s:t:

 Hðp1  q1 Þ þ n P e2

ð37Þ

Gðp1  q1 Þ ¼ r 1  s1 r 1 ; s1 ; p1 ; q1 ; n P 0

where e is a column vector of ones of n + 1 dimensions. The formulation (37) is a standard linear programming problem. We can solve it by using fast interior optimization algorithm. By an entirely analogous argument, the linear programming problem corresponding (33) is obtained:

Min

eT2 ðr 2

T

s:t:

Gðp2  q2 Þ þ n P e1

þ s2 Þ þ ee ðp2 þ q2 Þ þ

CeT1 n ð38Þ

Hðp2  q2 Þ ¼ r2  s2

gether needs the storage space of 2(m + 1)(m + 1). This leads nonlinear DGEPSVM not to handle massive-scale classification problems. According to the idea of GEPSVM (Mangasarian & Wild, 2006), a reduced kernel technique (Lee & Mangasarian, 2001) is used in nonlinear DGEPSVM, such that the kernel matrices K(A, CT) and K(B, CT) 0 appearing in nonlinear DGEPSVM can be replaced by K(A, C T) and 0 T 0 K(B, C ), respectively, where C is a small subset selected randomly from C. 4.2. Nonlinear FETSVM (KFETSVM) Using the expressions (39) and (40), the kernel-based surfaces instead of planes are given as follows:

KðxT ; C T Þa1  b1 ¼ 0 and KðxT ; C T Þa2  b2 ¼ 0

In line with the arguments in Section 3.1, we construct the optimization formulation of KFETSVM using the reduced kernel techniques (Lee & Mangasarian, 2001) as follows:

r 2 ; s2 ; p2 ; q2 ; n P 0 where p2 P 0 and q2 P 0 are two column vectors in Rn, while r2 P 0 and s2 P 0 are two column vectors in Rm2 . In real world, the classification problems that we encounter cannot always be handled using linear kernel algorithms. We next extend our results to the nonlinear case, with kernel technology.

ðKFETSVM1Þ

  a1 T Min kKðA; C 0T Þa1  e1 b1 k1 þ e b þ Ce2 n 1 1  ðKðB; C 0T Þa1  e2 b1 Þ þ n P e2 ;

s:t:

ðKFETSVM2Þ

4. Nonlinear IMPDMC

ð44Þ

n P 0: ð45Þ

  a2 T Min kKðB; C 0T Þa2  e2 b2 k1 þ e b þ Ce1 n 2 1 ðKðA; C 0T Þa2  e1 b2 Þ þ n P e1 ;

s:t:

nP0

4.1. Nonlinear MPDMC

ð46Þ

We first discuss the optimization problem (16). In Jiang and Zhang (2004) and Mika, Ratsch, Weston, Scholkopf, and Mullers (1999), we note that every solution w 2 H(KFS: kernel feature space) can be written as an expansion in terms of mapped training data, thereby obtaining:

w1 ¼

m X

Let



a1 b1



¼ p1  q1 ;

Aðp1  q1 Þ ¼ r 1  s1 ;

G ¼ ½KðA; C 0T Þ  e1 ;

Bðp1  q1 Þ ¼ a1  d1 ;

H ¼ ½KðB; C 0T Þ  e2  ð47Þ

ðai Þ1 /ðxi Þ ¼ /ðXÞa1

ð39Þ

Substituting (47) into (45) gives:

i¼1

Min eT1 ðr1 þ s1 Þ þ eeT ðp1 þ q1 Þ þ CeT2 n

where

s:t:

/ðXÞ ¼ ð/ðx11 Þ; /ðx12 Þ; . . . ; /ðx1m1 Þ; /ðx21 Þ; /ðx22 Þ; . . . ; /ðx2m2 Þ;

 Hðp1  q1 Þ þ n P e2 Gðp1  q1 Þ ¼ r 1  s1

a1 ¼ ða11 ; a12 ; . . . ; a1m1 ; . . . ; a21 ; a22 ; . . . ; a2m2 ÞT

r 1 ; s1 ; p1 ; q1 ; n P 0 ð40Þ

Substituting (39) into (16) gives an explicit expression:

where e is a augmented column vector of ones of a + 1 dimensions. Here, a is a column vector of m3 ones (m3 is the number of the points for subset C0 ). By an entirely argument, the linear program-

Min f ða1 ; b1 Þ ¼ ðKðA; C T Þa1  e1 b1 ÞT ðKðA; C T Þa1  e1 b1 Þ  bðKðB; C T Þa1  e2 b1 ÞT ðKðB; C T Þa1  e2 b1 Þ aT1 a1 ¼ 1

s:t:

  A where C ¼ . With an entirely analogous argument to the linear B case, we have the following eigen-equation systems

 Mþ

 1 T N N a1 ¼ k/1 a1 ; when beT2 e2  eT1 e1 – 0 beT2 e2  eT1 e1

Ma1 ¼

k/1 a1 ;

when

beT2 e2

 T T

eT1 e1

¼0 T

ð42Þ ð43Þ

T T

ð41Þ

ming problem corresponding (46) is obtained:

Min eT2 ðr2 þ s2 Þ þ eeT ðp2 þ q2 Þ þ CeT1 n s:t:

Gðp2  q2 Þ þ n P e1 Hðp2  q2 Þ ¼ r 2  s2 r 2 ; s2 ; p2 ; q2 ; n P 0

ð49Þ

T

where M = K(A, C ) K(A, C )  bK(B, C ) K(B, C ), and N ¼ beT2 KðB; C T Þ  eT1 KðA; C T Þ. The matrix M þ beT e 1eT e N T N is in Rmm. Details for the solution 2 2

ð48Þ

5. Experimental result and analysis

1 1

of nonlinear DEGEPSVM are entirely similar to the linear case. Linear DGEPSVM has time complexity of O(n2), where n is dimensionality of samples. However, the computational complexity of nonlinear DGEPSVM is O(m2) because n = m. M þ beT e 1eT e N T N to2 2

1 1

5.1. Experimental results on XOR examples and standard datasets We tried out on publicly available datasets from the UCI (Muphy & Aha, 1992) database as well as two synthetic datasets to

9430

Q. Ye et al. / Expert Systems with Applications 38 (2011) 9425–9433

Fig. 1. Classification results of GEPSVM/DEGEPSVM/TWSVM (left) and LSTSVM (right) for ‘‘crossplanes 1’’ dataset.

Fig. 2. Classification results of GEPSVM/DEGEPSVM (left), LSTSVM (middle) and TWSVM (right) for ‘‘crossplanes 2’’ dataset.

Table 1 Linear GEPSVM, TWSVM, LSTSVM, DGEPSVM and FETSVM: fivefold testing correctness and number of input features needed. Dataset mn

GEPSVM test features

TWSVM test features

LSTSVM test features

DGEPSVM test features

FETSVM test features

Sonar 208  60

72.64 ± 10.081 60

72.11 ± 2.897 60

71.61 ± 3.954 60

74.05 ± 2.649 60

73.59 ± 7.782 27.5

Votes 435  16

95.62 ± 1.126 16

96.55 ± 1.923 16

96.32 ± 0.8602 16

95.40 ± 1.868 16

95.86 ± 2.126 8.1

Wbcd 699  10

100 ± 0.000 10

100 ± 0.000 5

100 ± 0.000 10

100 ± 0.000 10

100 ± 0.000 4.7

Ionosphere 351  33

82.04 ± 3.483 32

89.75 ± 1.645 32

89.62 ± 2.641 32

89.18 ± 4.196 32

89.17 ± 3.964 26

Hepatitis 155  19

78.71 ± 7.334 19

82.58 ± 11.648 19

80.65 ± 9.784 19

80.00 ± 5.161 19

80.00 ± 6.255 16.5

Monk3 432  6

79.61 ± 3.578 6

86.13 ± 3.178 3.8

84.67 ± 4.645 6

85.93 ± 3.632 6

86.10 ± 1.949 3.0

Spect 267  44

74.56 ± 4.844 44

77.9 ± 7.762 44

78.68 ± 5.671 44

79.04 ± 4.142 44

76.76 ± 4.943 30.8

Heart-statlog 270  13

72.96 ± 4.772 13

83.44 ± 6.264 13

82.59 ± 6.789 13

80.01 ± 2.457 13

82.59 ± 6.584 11.5

BupaLiver 345  6

54.78 ± 3.095 6

65.22 ± 6.416 2.4

67.25 ± 1.478 6

64.47 ± 6.603 6

65.51 ± 3.924 4.8

Wpbc 198  34

75.74 ± 3.525 33

80.53 ± 6.184 33

79.38 ± 4.92 33

76.77 ± 4.47 33

78.33 ± 3.736 24.3

demonstrate the effectiveness of DEGEPSVM and FETSVM. The two synthetic datasets ‘‘Crossplanes’’ are designed to visually illustrate the effectiveness of our proposed DEGEPSVM. In the first ‘‘Crossplanes’’ dataset different from that used in Ref. (Arun Kumar & Gopal, 2009), there exists some points, with different labels, distribute in the cross-location of two classes of samples. Fig. 1 shows the dataset and computational results obtained by GEPSVM, TWSVM, LSTSVM and DEGEPSVM. GEPSVM, TWSVM and

DGEPSVM obtain classification correctness of 100%, while LSTSVM only obtains 97.9% correctness. As previously discussed, TWSVM on this dataset performs better than LSTSVM due to the fact that the constraints in LSTSVM imply the samples for the other class to be at distance of 1 from the corresponding plane, whereas, in TWSVM, the distance is at least 1. The genuine geometrical interpretation of the nonparallel plane classifier is weaker than that of TWSVM. Another ‘‘Crossplanes’’ dataset is shown on Fig. 2,

9431

Q. Ye et al. / Expert Systems with Applications 38 (2011) 9425–9433

Fig. 3. Multiple Myeloma dataset results for linear FETSVM, LSTSVM, TWSVM and GEPSVM: (a) average fivefold cross validation error (%) versus number of input features; (b) average selected number of input features. The number of selected input number for FETSVM are written in ‘‘h’’.

Table 2 GEPSVM, TWSVM, LSTSVM, DGEPSVM and FETSVM: fivefold testing correctness and number of kernel functions needed using nonlinear classifiers. Dataset mn

GEPSVM No. of kernel functions

TWSVM No. of kernel functions

LSTSVM No. of kernel functions

DGEPSVM No. of kernel functions

FETSVM No. of kernel functions

WDBC 569  30

73.23 ± 15.502 81.2

92.51 ± 3.867 81.2

93.03 ± 1.828 81.2

89.30 ± 1.608 81.2

87.65 ± 4.898 32.8

Wbcd 699  10

100 ± 0.000 98

100 ± 0.000 49

100 ± 0.000 98

100 ± 0.000 98

100 ± 0.000 4.2

Monk3 432  6

90.39 ± 3.704 79

96.07 ± 1.269 79

96.67 ± 1.047 79

87.12 ± 4.729 79

93.18 ± 2.633 42.6

Ionosphere 351  33

88.78 ± 4.634 49.8

88.94 ± 1.629 49.8

88.41 ± 2.071 49.8

89.18 ± 3.165 49.8

88.89 ± 3.165 9.5

Glass 214  9

84.92 ± 17.635 30

99.07 ± 1.861 30

97.63 ± 2.942 30

98.79 ± 1.861 30

98.69 ± 1.393 26.8

Tic 958  10

65.82 ± 11.248 137.4

98.33 ± 1.335 137.4

98.33 ± 1.335 137.4

96.68 ± 4.216 137.4

98.33 ± 1.335 88.4

Spect 267  44

74.61 ± 10.778 38

79.49 ± 2.447 38

77.69 ± 5.525 38

79.27 ± 4.735 38

73.93 ± 5.258 17.18

which is used in Guarracino et al. (2007). The classification results show DGEPSVM keeps the effectiveness of GEPSVM, and obtains 100% correctness. TWSVM and LSTSVM however, only obtain 62.6% and 63.91% correctness, respectively. This may be because DGEPSVM keeps the genuine geometrical interpretation of primal GEPSVM, whereas, TWSVM and LSTSVM lose that. To further illustrate the performance of DGEPSVM and FETSVM, we further tried out on 11 UCI datasets. In experiments, for each SVM algorithm for nonlinear kernels, Gauss kernel, i.e., k(xi, xj) = exp(kxi  xjk2/2c2) was selected. This computer ran on windows XP, with MATLAB 7.1 installed. The mosek optimization toolbox (), which implements a fast interior point based algorithms (Arun Kumar & Gopal, 2009), was used for solving the dual QPPs occurring in TWSVM and the linear programming problems arising in FETSVM. We evaluated the classification correctness using fivefold cross validation. Table 1 shows that the linear kernel comparison of DGEPSVM versus GEPSVM, TWSVM, LSTSVM and FETWSVM. The optimal parameters were obtained by tuning on each training fold. The single parameter in either DGEPSVM or DGEPSVM was selected over the range {2iji = 6, 5, . . . , 14}. For TWSVM, LSTSVM and FETSVM, penalty parameters C, C1 and C2 were searched in the same range. The regularization constant e was carried out in the range {10iji = 7, 6, . . . , 3}. For all algorithms, components of w less than 1e8 in absolute are discarded. We compute the means and stan-

dard errors of the results. We also recorded the number of input features, for all algorithms. Table 1 shows the classification results of both DGEPSVM and FETSVM are comparable to TWSVM and LSTSVM, but better than GEPSVM. However, DGEPSVM only has single parameter. We also recorded the number of input features, for all algorithms. Table 1 also shows FETSVM obtains more effective feature suppression with respect to the other multi-plane classifiers shown in Table 1. In the meantime, this edge of FETSVM is reflected on Fig. 3. Fig. 3 shows FETSVM on Multiple Myeloma dataset (Page et al., 2002), with comparable classification correctness to TWSVM and LSTSVM, uses minimum number of input features in all algorithms used. For example, when the number of features selected reaches 350, FETSVM only requires 66.2. In the nonlinear case, all algorithms carry out the kernel parameter c selection. This means TWSVM and LSTSVM have three unknown parameters. To save time, we chose 10% of each training fold as a tuning set. To avoid bias, we repeated fivefold cross validation five times by using a randomly selected tuning set each time. The parameter c was selected within {2iji = 7, 6, . . . , 7}, and parameter e was searched within {2iji = 7, 6, . . . , 3}. A 20% reduced kernel (Lee & Mangasarian, 2001) was used in all algorithms as in Jayadeva & Chandra (2007). As indicated in Table 2, DGEPSVM performs in the generalization performance comparable to both TWSVM and LSTSVM. FETSVM has minimum number of kernel functions.

9432

Q. Ye et al. / Expert Systems with Applications 38 (2011) 9425–9433

Fig. 4. NDCC dataset results for linear FETSVM, LSTSVM, TWSVM, DGEPSVM and GEPSVM: (a) fivefold training times versus number of input samples for TWSVM and FETSVM and (b) for LSTSVM, GEPSVM, DGEPSVM and FETSVM.

Table 3 Fivefold testing correctness comparison of GEPSVM, LSTSVM, DGEPSVM and FETSVM. Selected class

GEPSVM test features

LSTSVM test features

DGEPSVM test features

FETSVM test features

0 versus 1 1 versus 2 3 versus 4 5 versus 6

96.4 ± 0.936 256 99.55 ± 0.4767 256 99.45 ± 0.308 256 98.5 ± 0.468 256

98.86 ± 0.477 256 99.64 ± 0.273 256 99.64 ± 0.181 256 98.68 ± 0.364 256

96.60 ± 0.5183 256 99.81 ± 0.265 256 100 ± 0.000 256 98.91 ± 0.526 256

99.00 ± 0.372 172.3 99.82 ± 0.223 147.9 100 ± 0.000 139.6 98.75 ± 0.216 168.1

We also conducted experiments on large NDCC datasets (Thompson, 2006) to compare the fivefold computational time of linear GEPSVM, TWSVM, LSTSVM, DGEPSVM and FESVM. Computational complexity of LSTSVM is 2O(n3) due to two PSVM-type (Fung & Mangasarian, 2001) formulations, which is similar to that of GEPSVM (Mangasarian & Wild, 2006). For DGEPSVM, the complexity is order of 2n2 due to solving two standard eigenvalue problems. The facts help us explain the computational times shown on Fig. 4. All parameters of the above algorithms were set equal to 2+5. Fig. 4 shows GEPSVM, LSTSVM and DGEPSVM in speed is comparable, but considerably faster than TWSVM. Moreover, compared with TWSVM, FETSVM also performs faster. 5.2. Application to USPS handwritten digits USPS dataset consists of the numbers from‘‘0’’ to ‘‘9’’, which contains 11,000 samples with 256 (16  16) dimensions. Here, we consider binary classification multi-surface classifiers. Selected classes are shown in Table 3. The parameter was selected using the same tuning procedure to linear case. Our computational results on USPS indicate that DGEPSVM and FETSVM are comparable to LSTSVM in use, in terms of classification correctness. However, FETSVM requires minimum number of input features in all algorithms. 6. Conclusion We have proposed a new but simple classifier for solving a data classification problem of data mining with an unknown parameter, which is termed as DGEPSVM in this paper. With the genuine geometrical interpretation of the nonparallel plane classifier, DGEPSVM is capable of dealing with XOR examples from different distribution in contrast to TWSVM and LSTSVM, which weaken the genuine geometrical interpretation of the nonparallel plane classi-

fier. In addition, DGEPSVM on public available datasets from UCI database obtains comparable classification correctness to that of both TWSVM and LSTSVM, but better than that of GEPSVM. This algorithm requires less operations than TWSVM as two standard eigenvalue problems need to be employed. DGEPSVM takes advantage of reduced kernel techniques which result in a by far faster training of nonlinear classifier. This allows DGEPSVM to perform on large-scale datasets, whereas, in TWSVM, the training times is very high. Direct linear programming formulations of nonparallel plane classifiers such as TWSVM, LSTSVM and GEPSVM cannot obtain very sparse solutions due to their complex formulations. Considering this weak point, using a regularization method, we give a feature selection algorithm for TWSVM with the ability to suppress input features (FETSVM), which obtain the sparse solutions using two linear programming. To save training time, a fast interior algorithm in MOSEK toolbox is used for solving FETSVM. Our computational results on UCI and NDCC datasets indicate that FETSVM obtains comparable classification correctness to that of TWSVM, but with much faster training. More importantly, FETSVM is capable of suppressing input features. These features and their simplicity are very helpful to make our proposed algorithms as a useful classification tool. Acknowledgments The authors are extremely thankful to Research Foundation for the Doctoral Program of Higher Education of China (20093219120025), and National Science Foundations of China and (90820306) for support. References Arun Kumar, M., & Gopal, M. (2009). Least squares twin support vector machines for pattern classification. Expert Systems with Applications, 36, 7535–7543. Fung, G., & Mangasarian, O. L. (2001). Proximal support vector machine Classifiers. In F. Provost & R. Srikant (Eds.), Proceedings KDD-2001: Knowledge discovery and data mining (pp. 77–86). Ghorai, S., Mukherjee, A., & Dutta, P. K. (2009). Nonparallel plane proximal classifier. Signal Processing, 510–522. Golub, G. H., & Van Loan, C. F. (1996). Matrix computations (third ed.). Baltimore: The John Hopkins University Press. Guarracino, M. R., Cifarelli, C., Seref, O., & Pardalos, P. M. (2007). A classification algorithm based on generalized eigenvalue problems. Optimization Algorithms and Software, 22(1), 73–81. Jayadeva, K. R., & Chandra, S. (2007). Twin support vector machines for pattern classification. IEEE Transaction on Pattern Analysis and Machine Intelligence (TPAMI), 29(5), 905–910. Jiang, L. H., & Zhang, K. (2004). Efficient and robust feature extraction by maximum margin criterion. In Proceedings of the conference advances in neural information processing systems (NIPS’04) (pp. 97–104). Cambridge, MA: MIT Press. Lee, Y. J., & Mangasarian, O. L. (2001). RSVM: reduced support vector machines. In Proceedings of the first SIAM international conference data mining, April 2001. Mangasarian, O., & Wild, E. (2006). MultisurFace proximal support vector machine classification via generalized eigenvalues. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(1), 69–74.

Q. Ye et al. / Expert Systems with Applications 38 (2011) 9425–9433 Mika, S., Ratsch, G., Weston, J., Scholkopf, B., Mullers, K. R., et al. (1999). Fisher discriminant analysis with kernels. In Y. H. Hu, J. Larsen, E. Wilson, & S. Douglas (Eds.), Neural networks for signal processing IX (pp. 41–48). Muphy P. M., & Aha D. W. (1992). UCI repository of machine learning databases. Page, D., Zhan, F., Cussens, J. M., Waddell, J., Hardin, B., Barlogie, & J. Shaughnessy, Jr. (2002). Comparative data mining for microarrays: A case study based on multiple myeloma. Technical Report 1453. Computer Sciences Department, University of Wisconsin, November, 2002. Thompson M. E. (2007). Normally distributed clustered datasets, 2006. , 2007. Tikhonov, A. N., & Arsen, V. Y. (1977). Solutions of ill-posed problems. New York: John Wiley and Sons.

9433

Tikhonov, A. N., & Arsen, V. Y. (1977). Solutions of ill-posed problems. New York: John Wiley and Sons. Yang, X. B., Chen, S. C., Chen, B., & Pan, Z. S. (2009). Proximal support vector machine using local information. Neurocomputing, 73, 357–365. Zhou, W. D., Zhang, L., & Jiao, L. C. (2002). Liner programming support vector machines. Pattern Recognition, 35(12), 2927–2936. Zou, H. (2007). An improved 1-norm svm for simultaneous classification and variable selection. In AISTATS. Zou, H. (2007). An improved 1-norm svm for simultaneous classification and variable selection. In AISTATS.