Weighted Twin Support Vector Machines with Local Information and its application

Weighted Twin Support Vector Machines with Local Information and its application

Neural Networks 35 (2012) 31–39 Contents lists available at SciVerse ScienceDirect Neural Networks journal homepage: www.elsevier.com/locate/neunet ...

653KB Sizes 0 Downloads 44 Views

Neural Networks 35 (2012) 31–39

Contents lists available at SciVerse ScienceDirect

Neural Networks journal homepage: www.elsevier.com/locate/neunet

Weighted Twin Support Vector Machines with Local Information and its application Qiaolin Ye a,∗ , Chunxia Zhao a , Shangbing Gao a , Hao Zheng b a

School of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing, People’s Republic of China

b

School of Information Technology, Nanjing Xiaozhuang University, People’s Republic of China

article

info

Article history: Received 2 March 2010 Received in revised form 15 March 2012 Accepted 29 June 2012 Keywords: GEPSVM WLTSVM Similarity information Support vector

abstract A Twin Support Vector Machine (TWSVM), as a variant of a Multisurface Proximal Support Vector Machine via Generalized Eigenvalues (GEPSVM), attempts to improve the generalization of GEPSVM, whose solution follows from solving two quadratic programming problems (QPPs), each of which is smaller than in a standard SVM. Unfortunately, the two QPPs still lead to rather high computational costs. Moreover, although TWSVM has better classification performance than GEPSVM, a major disadvantage is it fails to exploit the underlying correlation or similarity information between any pair of data points with the same labels that may be important for classification performance as much as possible. To mitigate the above deficiencies, in this paper, we propose a novel nonparallel plane classifier, called Weighted Twin Support Vector Machines with Local Information (WLTSVM). WLTSVM mines as much underlying similarity information within samples as possible. This method not only retains the superior characteristics of TWSVM, but also has its additional advantages: (1) comparable or better classification accuracy compared to SVM, GEPSVM and TWSVM; (2) taking motivation from standard SVM, the concept of support vectors is retained; (3) more efficient than TWSVM in terms of computational costs; and (4) only one penalty parameter is considered as opposed to two in TWSVM. Finally, experiments on both simulated and real problems confirm the effectiveness of our method. © 2012 Elsevier Ltd. All rights reserved.

1. Introduction Mangasarian and Wild proposed a fast classifier for binary classification, termed multisurface proximal SVM via generalized eigenvalues (GEPSVM) (Mangasarian & Wild, 2006), which is an extension of proximal SVM (PSVM) (Fung & Mangasarian, 2001). The geometric interpretation of GEPSVM is that each plane is closest to the samples for its own class and at the same time is furthest from the samples for the other classes (Mangasarian & Wild, 2006). Such a method has lower computational complexity and works better on XOR examples in comparison to standard SVM. Mangasarian and Wild (2006) presented a simple ‘‘cross planes’’ example that is a generalization of the XOR example, which confirmed the effectiveness of GEPSVM over PSVM and SVM. The authors also claimed that the classification ability of GEPSVM is similar to that of SVM for real data sets under the assumption that the symmetric matrices H and M in (7) and (12) in their paper are non-singular. In reality, the two matrices may be singular, which would yield an ill-conditioned problem.



Corresponding author. E-mail address: [email protected] (Q. Ye).

0893-6080/$ – see front matter © 2012 Elsevier Ltd. All rights reserved. doi:10.1016/j.neunet.2012.06.010

Although the computational efficiency for GEPSVM is higher, a major disadvantage is that loss of sparseness (i.e., loss of the concept of support vector) in its solution makes every input pattern contribute to the model, so that the memory overhead is sharply increased. During the last 5 years, GEPSVM has been enhanced to a family of novel nonparallel plane classifiers for solving data-mining classification problems, which are roughly divided into two categories: learning algorithms based on generalized eigenvalues to improve the generation capability or computational cost (Guarracino, Cifarelli, Seref, & Pardalos, 2007; Jayadeva, Khemchandai, & Chandra, 2005, 2007a, 2007b; Yang, Chen, Chen, & Pan, 2009; Ye, Zhao, Ye, & Chen, 2010a) of GEPSVM; and SVM-like methods that obtain two nonparallel planes by solving quadratic programming problems (QPPs) (Arun Kumar & Gopal, 2009; Ghorai, Mukherjee, & Dutta, 2009; Jayadeva et al., 2007a, 2007b). Among these, the stand-alone nonparallel plane classifier TWSVM (Jayadeva et al., 2007a, 2007b) in the second category has attracted much attention owing to its good generalization ability, good sparseness and shorter computing time compared with standard SVM. Such a method, which is in the spirit of GEPSVM, obtains two nonparallel planes by solving two related SVM-type problems, each of which is smaller than in a standard SVM (Jayadeva et al., 2007a, 2007b). The original motivation for GEPSVM is to effectively deal with XOR problems

32

Q. Ye et al. / Neural Networks 35 (2012) 31–39

that are not linearly separable. TWSVM inherits such an advantage. Through analysis and experiments, it has shown better classification performance than both GEPSVM and SVM (Arun Kumar & Gopal, 2009). Concentrating on the two QPPs arising from TWSVM, it is clear that the concept of support vectors is retained. Quite different from SVM, the support vectors consist of the samples for the other classes (Jayadeva et al., 2007a, 2007b). Despite the above advantages, TWSVM not only requires the time complexity of no more than 1/4 m3 (Jayadeva et al., 2007a, 2007b) but also confronts the difficult problem of selecting two penalty parameters (each is an arbitrary real number), where m is the number of training samples. Inspired by PSVM, scholars proposed a least-squares version of TWSVM, such as LSTSVM (Arun Kumar & Gopal, 2009; Ghorai et al., 2009). Similar to GEPSVM, sparseness in its solution is lost. A common disadvantage of the existing nonparallel plane classifiers, including TWSVM, is they fail to exploit similarity information between any pair of data points. Cover and Hart (1967) were the first to confirm that almost half of the data information resides in the nearest neighbors that can be found by k-nearest neighbor relations to points. For the purpose of classification, the basic assumption here is that the samples sharing the same labels have higher similarity than those with different labels (Wu, Ianakiev, & Govindaraju, 2002; Yang et al., 2009). Previous efforts have shown that most of the samples of a data set are highly correlated, at least locally, or the data set has an inherent geometrical property (Cai, He, Zhou, Han & Bao, 2007a; Lafon, Keller, & Coifman, 2006; Yang et al., 2009; Yan, Xu, Zhang, & Zhang, 2005). Intuitively, underlying similarity information is therefore crucial for data classification. Motivated by the above conclusions and theories, we develop a novel classification method, referred to as WLTSVM, which stands for Weighted Twin Support Vector Machines with Local Information. By making full use of similarity information in terms of the data affinity, WLTSVM reformulates the primary problem of TWSVM. The proposed method uses two graphs (intra-class graph and inter-class graph) to characterize the intra-class compactness and inter-class separability, respectively. The proposed method can reduce the time complexity of TWSVM by reducing the support vectors for each class. Intuitively, these support vectors reside in the nearest relations between points sharing different labels, and thus it is expected that the time complexity of TWSVM can be greatly reduced by taking into consideration the possible support vectors. Although two parameters (i.e., the neighbor size k and a penalty parameter C ) need to be considered, WLTSVM significantly narrows the range for parameter selection in comparison to TWSVM. This is so because the concerned neighbor size k is an arbitrary positive integer instead of an arbitrary real number. Experiments carried out on artificial and real data sets indicate that WLTSVM provides comparable performance to SVM, GEPSVM and TWSVM. 2. Twin SVM (TWSVM) (i)

(i)

(i)

Given a training point set of two classes X (i) = [x1 , x2 , . . . ,

xNi ], i = 1, 2 with Ni n-dimensional samples in class i, denoted by the N1 × n matrix A (Ai corresponds to the ith sample of class 1) belonging to class 1 and the N2 × n matrix B (Bi has the same meaning as Ai ) belonging to class −1, where N1 + N2 = N. The central idea in TWSVM is to seek two nonparallel hyperplanes in n-dimensional input space, xT w1 + b1 = 0,

xT w 2 + b2 = 0

(1)

where the superscript ‘‘T ’’ denotes transposition, and (wi , bi ) ∈ (Rn × R) (i = 1, 2). Twin SVM (TWSVM) (Jayadeva et al., 2007a, 2007b) is in the spirit of GEPSVM. However, GEPSVM and SVM are

based on an entirely different formulation. Each nonparallel plane is generated by solving an SVM-type problem. Experiments have demonstrated that TWSVM not only runs faster than standard SVM but also outperforms GEPSVM, in terms of classification accuracy. The formulation of TWSVM can be written as follows: 1

T T (TWSVM1)min 2 (Aw1 + e1 b1 ) (Aw1 + e1 b1 ) + C1 e2 ξ s.t. − (Bw1 + e2 b1 ) + ξ ≥ e2 , ξ ≥ 0,

(2)

1

T T (TWSVM2)min 2 (Bw2 + e2 b2 ) (Bw2 + e2 b2 ) + C2 e1 ξ s.t. (Aw2 + e1 b2 ) + ξ ≥ e1 , ξ ≥ 0,

(3)

where ei , i = 1, 2 is a column vector of ones of appropriate dimensions, and C1 and C2 are two penalty coefficients. The Wolfe’s dual of QPPs (2) and (3) are given by (4) and (5) in terms of the Lagrangian multipliers α ∈ RN2 and γ ∈ RN1 , respectively (Jayadeva et al., 2007a, 2007b). 1

eT2 α − α T G(H T H )−1 GT α (DTWSVM1)max α 2 s.t. 0 ≤ α ≤ C1

(4)

where H = [A e1 ] and G = [B e2 ]. 1

T T −1 T T (DTWSVM2)maxγ e1 γ − 2 γ H (G G) H γ s.t. 0 ≤ γ ≤ C2 .

(5)

After solving the QPPs (4) and (5), two nonparallel planes of (1) can be respectively produced by

  w1 b1

  w2 b2

= −(H T H )−1 GT α,

(6)

= (GT G)H T γ .

(7)

From (4) and (5), we notice that only constraints for the other class appear, implying that QPP (4) has N2 parameters and QPP (5) has N1 parameters as opposed to N = N1 + N2 parameters in standard SVM. We further conclude that the objective function of each of the TWSVM pair does not sum up the error over patterns for both classes. These aspects show that TWSVM may be effective for skewed or unbalanced data sets. It is evident that the idea in TWSVM is to solve two QPPs (4) and (5). Each of the QPPs in the TWSVM pair is a typical SVM formulation, except that not all patterns appear in the constraints of either problem at the same time. In the TWSVM pair, two penalty parameters, i.e., C1 and C2 need to be selected. The complexity of TWSVM is no more than O(1/4 m3 ) (Jayadeva et al., 2007a, 2007b). For large-scale data classification, TWSVM is not feasible. It is easily observed that the time complexity of the dual QPP (4) is determined by the number of points in class −1, and that the number of points in class +1 determines the computational cost of QPP (5). The points for which 0 < αi < C1 (i = 1, 2, . . . , N1 ) or 0 < γj < C2 (j = 1, 2, . . . , N2 ) are defined as support vectors, as they play an important role in determining the planes of (1) and it has been shown that the support vectors lie on the corresponding supporting/bounding plane (Jayadeva et al., 2007a, 2007b). Therefore, it is expected that by selecting possible support vectors, the computational efficiency of TWSVM will be decreased. Suppose that we have N training samples for class −1{xi , yi }, i = 1, 2, . . . , N, where yi is the class labels, and that we have produced a plane of class +1. It is evident that the points in class −1 should be furthest from this plane. For convenience of description, these points of class −1 are divided into two subset matrices M1 and M2 , respectively consisting of the margin and non-margin points, respectively, where the margin

Q. Ye et al. / Neural Networks 35 (2012) 31–39

points are closest to the plane of class +1 and the non-margin points are furthest from this plane. Undoubtedly, the non-margin points in M2 in fact tend to be considerably distant from the plane of class +1, whereas, the margin points in M1 are always close to this plane, which means that these margin points are easily misclassified. In other words, only the margin points instead of all of the points of the other class intuitively play an important role in optimal production of the plane of the corresponding class. By utilizing only the points in a particular class as well as a few margin points from the other class instead of all the points, TWSVM may obtain the desired results. The margin points are sparse, and their number influences the time complexity. Intuitively, these support vectors with sparseness (the support vector set size is much smaller than the entire training sample set size) comprise the margin points that reside in the nearest neighbor relations between points sharing different labels and are sparse, and thus can be obtained by determining k-nearest neighbors. Furthermore, still a major disadvantage of the existing nonparallel plane classifiers, including TWSVM, is that they fail to exploit the similarity information between pairs of samples. In this paper, we attempt to define a new optimization criterion to explore the similarity information residing in points and to reduce the time complexity of TWSVM.

Here, the distance between any pair of k-nearest neighbors is measured by using the standard Euclidean metric. The idea of WLTSVM is to discover the intrinsic similarity information within samples from the same class and to extract the possible support vectors residing in the samples of the other class as far as possible (hereafter we call these margin points). As in Ye, Zhao, and Ye (2010b), we redefine the weight matrix Wd,il of Gd

 fj =

1 0

∃j, Wd,il ̸= 0

(11)

otherwise.

As mentioned above, WLTSVM is similar to TWSVM in spirit, as it also tries to obtain two nonparallel planes of (1), each of which fits the points of the corresponding class. Following the analogous geometric interpretation of TWSVM, we construct an optimization problem to estimate the plane of class +1 as follows:

(WLTSVM1)

Min

N1 N1  1

(1)

Ws,ij (w1T xj

+ b1 ) 2 + C

2 i=1 j=1 (2) s.t. − fj (w1T xj + b1 ) + ξj ≥ fj · 1,

N2 

ξj ,

j =1

(12)

ξj ≥ 0.

By simplifying (12), we can obtain the following expression

3. The WLTSVM classifier In this section, we present our original classifier WLTSVM and describe how to extend to a Reproducing Kernel Hilbert Space (RKHS), which results in kernel-generated WLTSVM.

33

(WLTSVM1)

Min

N1 1

(1)

dj (w1T xj

+ b1 )2 + C

N2 

ξj ,

2 j =1 j =1 (2) s.t. − fj (w1T xj + b1 ) + ξj ≥ fj · 1, ξj ≥ 0,

(13)

N

3.1. Linear WLTSVM The task for nonparallel plane classifiers such as GEPSVM and TWSVM (Jayadeva et al., 2007a, 2007b; Mangasarian & Wild, 2006) is to separate samples with different labels as far as possible by learning on the training samples. 3.1.1. Construction of WLTSVM Inspired by the spectral graph theory, we construct a k-nearest neighbor graph G to model the local geometric structure of data. The weight matrix of G can be defined as (Xue, Chen, & Yang, 2009); Chung (1997), Cai et al. (2007a): 1

 Wij =

0

if xi is k-nearest neighbors of xj or xj is k-nearest neighbors of xi otherwise.

(8)

However, this graph cannot reveal the discriminative structure in data, which has a direct connection to classification. The aim of TWSVM is to generate two planes, each of which fits the points of the corresponding class. To obtain better performance, it is expected that the planes yielded by TWSVM should best fit points with high-density correlated points and should be furthest from the points for the other class. Therefore, inspired by the graphbased dimensionality reduction method (Cai et al., 2007a), we construct two graphs for each of the TWSVM pair, a within-class graph Gs and a between-class graph Gd , to model the intra-class compactness and the inter-class separability, respectively. Gs and Gd are two sub-graphs of the above graph G. We first discuss the optimization (2) in TWSVM1. Given any pair of points (xi , xj ) in class +1 and an arbitrary point xl in class −1, the weight matrices for Gs and Gd of the plane of class +1 are respectively defined as 1

 Ws,ij =

0

 Wd,ij =

1 0

if xi is k- nearest neighbors of xj or xj is k- nearest neighbors of xi otherwise.

(9)

if xl is k- nearest neighbors of xi otherwise.

(10)

1 where dj = i=1 Ws,ij , j = 1, 2, . . . , N1 . The bigger the value dj is, the more ‘‘important’’ is xj . In fact, dj is introduced to exploit the similarity information between pairs of samples. It is evident that (13) implies that WLTSVM1 only takes into account all the weighted samples in class +1 whose weights dj are at least one and the margin points of class −1 whose weights (fj ) are equal to one. In other words, optimization problem (13) is minimized to keep the plane of class +1 as close as possible to the neighboring samples of Gs and as far as from the connected points of Gd that generates margin points for Class −1. The QPP can converge to a globally optimal solution. Different from TWSVM, WLTSVM employs a penalty parameter. The formulation in (13) is similar to that in our previous work (Ye et al., 2010b) in which a leastsquares version of the reduced algorithm suggested by Yang et al. (2009) is proposed. However, WLTSVM is not a reduced algorithm. The weight matrix (9) defined by us and dj are different from those proposed by Ye et al. (2010b) where the ‘‘and’’ relation between any pair of neighboring points is considered and dj is either 1 and 0. Furthermore, as discussed, the sparseness in the solution is lost. In the following, we concentrate on the solution to this optimization problem.

3.1.2. Solution to WLTSVM The Lagrangian based on the optimization problem (13) is given by L(w1 , r1 , ξj , αj , βj )

=

N1 1

2 j =1



N2  j =1

(1)

dj (w1T (xj ) + b1 )2 + C

N2 

ξj

j =1

αj (−fj (w1T (x(j 2) ) + b1 ) + ξj − fj · 1) −

N2 

ξ

βj j ,

(14)

j =1

where αj and βj (here, αj , βj ≥ 0) are the members of the vectors of Lagrange multipliers. By setting the gradient of (14) with respect

34

Q. Ye et al. / Neural Networks 35 (2012) 31–39

to w1 , b1 and ξj equal to zero, we obtain

Simplifying the above expression and rewriting it in matrix form yields:

N1  ∂L (1) (1) (1) = dj ((xj )(xj )T w1 + (xj )b1 ) ∂w1 j =1 N2 

+

L =

αj fj (x(j 2) ) = 0,

(15)

=

N

N

(16)

∂L = 0 → 0 ≤ αj ≤ C , ∂ξj

(17)

j = 1, 2, . . . , N2 .

AT D(Aw1 + e1 b1 ) + BT F α = 0,

(18)

eT1 D(Aw1 + e1 b1 ) + eT2 F α = 0,

(19)

0 ≤ α ≤ e2 C ,

(20)

where D = diag (d1 , d2 , . . . , dN1 ) (here, di ≥ 1, i = 1, 2, . . . , N1 ) and F = diag (f1 , f2 , . . . , fN2 ). Clearly, fj , j = 1, 2, . . . , N2 is either 0 or 1. Next, combining (18) and (19) leads to

  w1

A eT1

D[A e1 ]

b1

 T +

B eT2

F α = 0.

(21)

Defining H = [A e1 ] and G = [B e2 ], the Eq. (21) becomes: T

H DH

  w1 b1

+ GT F α = 0,

  w1

i.e.,

(22)

b1

= −(H DH + ε I ) G F α. −1 T

T

(23)

From the above equation, we notice that the first plane (for class +1) can be produced, only when the vector of Lagrange multipliers α is known. One problem is how to obtain such a vector. Fortunately, one way is to solve a convex programming problem (QPP). We now rewrite the Lagrangian multiplier function (14) as follows: L =

N1 1

2 j=1



(1)

dj (w1T (xj ) + b1 )2 + C

N2 

ξj

j =1

N2 

(2)

αj (−fj (w (xj ) + b1 ) + ξj − fj · 1) − T 1

j =1

=

1 2

w1T

N2 

N2

 j =1

(1) T

dj (xj )(xj ) w1 + b1

ξj −

j=1



(1)

j=1

+C

N2 

N1  j =1

N2 

(Aw1 + e1 b1 )T D(Aw1 + e1 b1 ) + CeT2 ξ

− α T (−F (Bw1 + e2 b1 ) + ξ − Fe2 ) − β T ξ .

(1) T

1

βj

dj (xj ) w1 +

1 2

ξ

(26)

Note that WLTSVM1 multiplies the samples in class +1 by their corresponding weights, denoted by the matrix D. Using a similar process, we consider the second optimization problem to estimate the plane of class −1 and obtain its dual as Max eT1 P γ −

(DWLTSVM2)

γ

1 2

γ T (PH )(GT QG)−1 (PH )T γ ,

subject to 0 ≤ γ ≤ C ,

(27)

where Q = diag(q1 , q2 , . . . , qN2 ) (i.e., the weight matrix of class −1) and P = diag(p1 , p2 , . . . , pN1 ) (i.e., the weight matrix of class +1). Similarly, we can conclude that the plane of class −1 is determined by both the weighted samples of its own class and the margin points of class +1. In addition, it can be observed from QPPs (26) and (27) that the computational complexity in the training phase of WLTSVM is affected by the of margin points. Fur number 



w2



w1 b1

is given by

= (GT QG)−1 H T P γ .

(28)

Once the augmented vectors of (23) and (28) are known, the two separating planes of (1) are obtained. The class of an unknown data point x ∈ Rn is determined as class(x) = Argmin |xT wl + bl |,

(29)

l=1,2

where | · | denotes the perpendicular distance of the point x from the plane. Following the same idea, we extend WLTSVM to its nonlinear version. 3.2. Nonlinear WLTSVM (NWLTSVM) In the real world, classification problems cannot always be handled using linear kernel methods. Thus, we extend our WLTSVM to a nonlinear version by considering the following kernel-based surfaces instead of planes: and K (x)a2 + b2 = 0,

(30)

where

r12

N1 

K (x) = [K (x1 , x), K (x2 , x), . . . , K (xN , x)]T , dj

j =1

αj (−fj (w1T (x(j 2) ) + b1 ) + ξj − fj · 1) Min

(NWLTSVM1) (24)

(31)

and K stands for an arbitrary kernel. The optimization problems for NWLTSVM1 as opposed to the primary ones in the input space as (13) can be reformulated using a similar argument as

j=1

βj j .

(25)

T T T −1 T (DWLTSVM1)Max α e2 F α − 2 α (FG)(H DH ) (FG) α, subject to 0 ≤ α ≤ C .

K (x)a1 + b1 = 0, ξj

j =1

N1 

2

b2

Clearly, it is easy to check that the matrix H T DH may be positive semi-definite. Using a similar method to avoid the possibility of the ill-conditioning of such a matrix, i.e., introducing a regularization term ε I , ε > 0, it can be shown that the solution of (22) is

  w1

1

thermore, the augmented vector

= −(H T DH )−1 GT F α.

b1

2

We can use (23) and (25) to solve the following convex QPP, which is the Wolfe’s dual (Mangasarian, 1994) of WLTSVM1:

Arranging Eqs. (15) and (16) and representing them in matrix form gives:

 T

1

w1T AT DAw1 + b1 AT De1 w1 + b21 eT1 De1

+ CeT2 ξ − α T (−F (Bw1 + e2 b1 ) + ξ − Fe2 ) − β T ξ

j =1 1 2   ∂L (1) = dj ((xj )T w1 + b1 ) + αj fj = 0, ∂ r1 j=1 j =1

1 2

N1 1

(1)

dj (aT1 K (xj ) + b1 )2 + C

N2 

2 j=1 j =1 (2) s.t. − fj (aT1 K (xj ) + b1 ) + ξj ≥ fj · 1, ξj ≥ 0,

ξj , (32)

Q. Ye et al. / Neural Networks 35 (2012) 31–39

35

Table 1 Classification comparison for SVM, GEPSVM, TWSVM and WLTSVM with linear kernels. Bold values denote a comparable or better classification performance compared to TWSVM, an underlined value indicates the best classification performance among all algorithms, and an asterisk (*) denotes a significant difference from WLTSVM. The execution time includes 10-fold training. For WLTSVM, the execution time includes KNN identification. SVM Test±Std(%) Time (s) t-value

GEPSVM Test±Std(%) Time (s) t-value

TWSVM Test±Std(%) Time (s) t-value

WLTSVM Test±Std(%) Time (s) t-value

Cross planes 169 × 2

68.05±13.333∗ 13.5936 7.997

100±0.000 0.0564 0.517

100±0.000 2.8086 0.517

99.41±2.926 0.4444

Wpbc 194 × 33

76.15±10.947 21.008 0.437

77.71±10.596 0.0817 0.237

76.66±11.253 5.8199 0.246

77.21±10.026 1.9486

Cleveland 297 × 13

85.54±4.872 59.1384 1.000

84.87±5.225 0.0622 0.544

81.43±6.523∗ 15.6995 2.366

85.54±5.488 2.2031

Germ 1000 × 24

74.42±3.146∗ 1879.54 3.819

70.4±3.718∗ 0.1031 5.623

78.30±3.267 448.8515 0.957

77.20±4.184 112.2999

CMC 1473 × 9

67.62±2.788 3982.395 0.182

67.88±4.531 0.0631 0.724

67.28±3.246 1028.0844 0.281

67.21±3.093 379.1159

Hepatitis 155 × 19

79.29±12.64 12.3903 1.038

59.25±10.337∗ 0.0690 5.383

76.75±12.00∗ 3.8189 2.166

82.50±8.162 1.2537

Bupa Liver 345 × 6

66.43±11.734 84.41 0.812

53.55±12.645∗ 2.810 0.729

67.76±8.951 15.9778 1.334

68.77±8.951 8.7856

Heart-Statlog 270 × 14

68.89±12.614 40.001 3.951

71.85±8.936∗ 0.0580 4.825

82.96±6.807∗ 8.4651 1.791

84.07±6.061 4.9817

House 506 × 13

81.64±7.536∗ 201.298 3.9815

72.34±6.034∗ 0.0612 6.144

83.19±5.378 44.27

84.60±5.252 9.4579

Check 1000 × 2

51.40±1.985 1228.4283 1.2779

49.29±3.124∗ 0.0219 4.824

50.1±4.067∗ 288.18 1.964

52.80±6.083 23.7234

Pima 768 × 8

76.68±5.445 658.8209 0.000

73.04±4.488∗ 0.0642 3.063

76.68±6.051 178.2113 0.000

76.68±3.509 42.5632

Australian 690 × 14

62.89±5.885∗ 516.2083 9.970

62.32±5.379∗ 0.0755 12.377

84.20±6.639∗ 114.0914 1.576

85.94±5.161 43.1087

Wdbc 569 × 30

91.39±1.041∗ 347.8059 3.736

93.148±3.643∗ 0.0998 2.074

94.56±3.736∗ 78.8869 1.812

96.67±4.562 2.4251

Votes 435 × 16

95.88±3.184 159.155 1.000

95.65±3.122 0.0659 –

94.27±2.876 34.235 1.505

95.65±3.122 2.7159

Haberman 306 × 3

72.85±7.609 58.39 0.808

70.91±8.1355 0.0571 0.9000

72.52±6.657∗ 16.4192 1.795

73.83±6.929 3.6131

Data set N ×n

where dj and fj are defined as in the linear case. NWLTSVM measures the distance between pairs of samples using the standard Euclidean metric and the distance is computed as feature space instead of input space in the linear case. Using a similar process to the linear case, we have the following Wolfe’s dual QPP of NWLTSVM1

(33)

where R = [K (A) e1 ] and S = [K (B) e2 ]. The augmented vector is

  a1 b1

= −(RT DR)−1 S T F α.

Max eT1 P γ − γ

1 2

γ T (PR)(S T QS )−1 (PR)T γ ,

(35)

subject to 0 ≤ γ ≤ C .

1

eT2 F α − α T (FS )(RT DR)−1 (FS )T α, (DNWLTSVM1)Max α 2 subject to 0 ≤ α ≤ C ,

In a similar way, we can obtain the Wolfe’s dual of the optimization problem (NWLTSVM2) to solve the second plane of (30) by reversing the roles of K (A)and K (B) in (33):

(34)

Here, the augmented vector

  a2 b2

is given by

  a2 b2

= (S T QS )−1 RT P γ .

Specification of thematrices D, S, P and Q is analogous to the linear WLTSVM. Now we analyze the time complexity of WLTSVM. We take into account the case whereby N is considerably greater than the dimensionality n of patterns of class +1 and −1 and suppose

36

Q. Ye et al. / Neural Networks 35 (2012) 31–39

Table 2 Comparison of accuracy and computing time for nonlinear, SVM, GEPSVM, TWSVM and WLTSVM. Bold values denote a better classification performance compared to TWSVM. Underlined values indicate the best classification performance among all algorithms, and an asterisk (*) denotes a significant difference from WLTSVM. Execution time included 10-fold training. For WLTSVM, the execution time still includes KNN finding. SVM Test±Std(%) Time (sec.) t-value

GEPSVM Test±Std(%) Time (sec.) t-value

TWSVM Test±Std(%) Time (sec.) t-value

WLTSVM Test±Std(%) Time (sec.)

Wpbc 194 × 33

76.13±12.302 16.8766 0.245

57.03±23.482∗ 3.1991 2.261

75.63±13.611 6.6434 0.455

76.68±12.642 3.1492

Votes 435 × 16

93.57±2.994 86.6706 0.389

83.95±13.001∗ 42.4585 2.715

94.96±2.796 55.9721 1.718

94.08±3.068 6.723

Spectf 267 × 22

79.76±5.988∗ 43.9105 1.869

46.55±26.977∗ 9.5762 4.526

81.65±5.459∗ 19.8161 1.877

83.13±6.484 3.855

House 506 × 13

81.06±6.637∗ 219.6367 1.805

51.2±16.192 70.0649 6.101

82.60±5.667 62.1517 0.972

83.61±5.784 15.9837

Monk3 432 × 6

96.57±0.6760∗ 493.78 2.581

96.93±0.9169∗ 67.57 2.482

98.19±0.7263 71.9384 0.128

98.01±0.9201 34.0438

Hepatitis 155 × 19

78.63±4.8279∗ 11.4580 2.542

70.79±24.099∗ 0.7256 1.991

77.25±12.867∗ 4.579 2.725

83.21±8.301 2.1592

Tic-Tac-Toe 958 × 10

79.22±13.742∗ 1496.136 4.780

81.74±11.883∗ 543.4867 4.859

100±0.000 530.6697 –

100±0.000 201.921

Wdbc 569 × 30

91.22±2.974∗ 302.1462 6.016

82.86±17.8369∗ 101.9985 2.564

94.55±3.829∗ 83.7327 2.081

96.13±3.181 9.3191

Check 1000 × 2

95.80±0.6782 1379.6028 1.6528

69.00±14.561 467.3108 4.704

92.20±2.5734 358.534 0.281

92.80±1.7205 55.1811

Heart-stat 270 × 14

65.56±9.728∗ 37.356 4.174

46.29±12.977∗ 7.1431 7.646

82.22±6.939 9.810 0.480

82.96±9.271 5.2121

Data set N ×n

N1 = N2 , as described by Jayadeva et al. (2007a, 2007b). Let n1 and n2 be the number of class +1 and class −1 margin points, respectively. WLTSVM essentially performs regression and finds the solution directly by solving two smaller dual QPPs in comparison to those in TWSVM. The major computation in WLTSVM involves two steps: (1) k-nearest neighbor graph. Using N 2 log(N ) for all the N training points (Cai, He, Zhang, & Han, 2007b) to compute weight matrices D and Q and select of margin points for the respective class. (2) Optimization of WLTSVM, which costs around O(2n31 ) under the assumption that n1 = n2, where n1 , n2 ≪ N. Thus, the overall computational complexity of WLTSVM is about N 2 log(N ) + 2n31 , i.e., N 2 log(N ). Clearly, NWLTSVM has the same time complexity as the linear case. 4. Experimental results and analysis In experiments, for each SVM algorithm for nonlinear kernels, a Gaussian kernel (i.e. K (xi , xj ) = exp(−||xi − xj ||/2σ 2 )) was selected. We implemented all algorithms in MATLAB 7.1 and carried out experiments on a PC with an Intel (R) Core 2 Duo processor (2.79 GHz), 3 GB of RAM. 4.1. Experiments with UCI database To investigate the classification effectiveness of WLTSVM, we first conducted experiments on UCI data sets Muphy and

Aha (1992), including MC, Cleveland, Hepatitis, Pima, Australian, Votes and synthetic datasets, Cross Planes (Table 1). For the CPU time and accuracy of SVM and TWSVM, we used the MATLAB codes from the Gunn SVM toolbox (Gunn, 1998) and the optimizer code ‘‘qp.ll’’ from the Gunn SVM toolbox, respectively, as suggested by (Jayadeva et al., 2007a, 2007b). In a similar way to TWSVM, we implemented our WLTSVM methods. GEPSVM is applied using a simple MATLAB function such as ‘‘eig’’. The classification accuracy of each algorithm was estimated using 10-fold cross-validation methodology. Table 1 shows the linear kernel comparison of WLTSVM versus SVM, GEPSVM and TWSVM. We compared them through 10-fold cross-validation and the optimal values of the regularization and penalty parameters were selected from the set {2i | i = −7, −6, . . . , +7} by using 10% for each training fold as in (Jayadeva et al., 2007a, 2007b; Mangasarian & Wild, 2006). The neighborhood size k1 was searched in the range from two to eight and k2 was set to 3. We also performed paired t-tests on the 10-fold classification results to calculate the statistical significance of WLTSVM. Here, the significance level was set to 0.05, implying that a greater difference between two classification accuracy values exists for a t-value of 1.7341. We computed the means and standard errors for the results. The effectiveness of TWSVM over GEPSVM and SVM has already been reported in the literature (Jayadeva et al., 2007a, 2007b), but Table 1 reveals that the accuracy of WLTSVM is similar to or better than that of TWSVM, GEPSVM and SVM for the standard data sets. Moreover, for the Cross Planes data set, WLTSVM works as well as TWSVM and GEPSVM and significantly outperforms the single-plane classifiers such as SVM.

Q. Ye et al. / Neural Networks 35 (2012) 31–39

37

Fig. 1. Classification accuracy versus the neighbor size k2 .

For the 10-fold computing time shown in Table 1, it should be noted that WLTSVM is considerably faster than TWSVM and SVM. As discussed above, WLTSVM with similarity information requires a shorter training time than TWSVM and SVM. For example, the computing time for CMC is 379.1159 s for WLTSVM, 1028.0844 s for TWSVM and 3982.395 s for SVM. Table 2 compares the classification accuracy and computing time for the nonlinear GEPSVM, SVM, TWSVM and WLTSVM cases for ten UCI data sets. The parameter σ was obtained by searching from 2−4 to 24 . The effectiveness of nonlinear TWSVM over GEPSVM and SVM has already reported in the literature (Arun Kumar & Gopal, 2009; Jayadeva et al., 2007a, 2007b), but Table 3 shows that the accuracy of nonlinear WLTSVM is comparable to

or better than that of TWSVM. In fact, the accuracy of nonlinear WLTSVM is slightly better than that of the other classifiers, including TWSVM on many data sets. It is evident that the nonlinear approach is the fastest among all the algorithms. WLTSVM is the next fastest and it outperforms GEPSVM in terms of accuracy. In contrast to the linear case, the computing time of GEPSVM is comparable to that of TWSVM and longer than that of WLTSVM. This is because it needs to solve two eigenvalue problems of dimension (N + 1) × (N + 1) in GEPSVM. Next, we show the classification performances of linear WLTSVM for five data sets (Australian, Wdbc, Pima, Bupa Liver, and Votes) as a function of the values of k2 in Fig. 1. The classification results for SVM, GEPSVM, and TWSVM (note that each of them

38

Q. Ye et al. / Neural Networks 35 (2012) 31–39

Table 3 Computing time comparison.

range. In the above experiments, we set it to 3 and the results confirm that WLTSVM is effective.

Kernel type m × n (%)

GEPSVM Test±Std (%) Time (s)

TWSVM Test±Std (%) Time (s)

WLTSVM Test±Std (%) Time (s)

20 30 40 50 60 70

0.0909 0.0952 0.1021 0.0903 0.095 0.1031

30.7285 91.438 200.1461 385.3144 687.0183 1159.379

3.952 9.2665 18.6796 29.6259 41.5984 60.2981

Fig. 2. Computing time versus the neighbor size k2 .

should have the same result for different k2 values, since they do not use graph-based techniques). The same procedures were used for all parameters. The value of k2 directly influences the computing time for WLTSVM. Fig. 2 shows that the WLTSVM computing time for the Pima and Australian data sets increases with k2 . To reduce the computational cost for WLTSVM, k2 should be set to a small value. The results show that the classification accuracy of WLTSVM is always comparable to or better than that of the other methods for each value of k2 . This feature means WLTSVM is very effective, even if k2 is selected within a small

4.2. Traversability recognition Analysis of outdoor terrain images, for mobile robot navigation, is very challenging. The terrain images used in the experiment were collected in an unmanned ground vehicle navigating Village Road Environment Project No. 90820306. In these images, four terrain classes can be defined: sky, land, grass/bush and tree. Using the patch-based method (Michael, Jane, & Greg, 2009), we perform a traversability classification task. The task here is to recognize traversable regions and to separate land from bush/grass, sky and tree, which robots cannot traverse. Obviously, a standard binary classification data set can be created. We constructed a terrain data set (TERR) that contains 1750 patches. The size of each terrain patch is 16 × 16. Some sample images of the four kinds of terrain types are shown in Fig. 3. Direct classification of the outdoor terrains in an unstructured environment seems to be difficult, so we used the effective LBP8,1 (Matti, Tomi, Topi, & Markus, 2005; Ojala, Pietikäinen, & Mäenpää, 2002) texture features to represent each patch. Therefore, for each patch, we used the feature vector of size 59. In our experiments, a random subset with l (= 20%, 30% 40% 50%, 60%, 70%) samples from the TERR data set were selected for training and the rest for testing. Fig. 4 shows the recognition rates versus the percentage of training samples. For each given l, we averaged the results over 10 random splits. Table 3 reports the running time. The parameter setting follows the same procedure as for the linear case. As discussed in Section 4.1, SVM has a much higher computational cost compared to TWSVM and WLTSVM, so we did not perform it. Linear kernels were used since these have found widespread applications in terrain classification, for example, in (Michael et al., 2009). From Fig. 4 and Table 3, two main points are evident. First, WLTSVM has comparable or better classification performance. Second, the computing time for TWSVM sharply increases with the number of training samples. Although this phenomenon also occurs in WLTSVM, it is obviously more efficient than TWSVM. As a result, WLTSVM is a useful classification tool.

Fig. 3. Patches of each of the classes in the data set.

Q. Ye et al. / Neural Networks 35 (2012) 31–39

39

Doctoral Program of Higher Education of China (20093219120025), and the National Science Foundations of China (61101197) for support. References

Fig. 4. Classification accuracy as a function of the percentage of training samples.

5. Conclusions and future work The original motivation for GEPSVM was to solve XOR problems, which are difficult classification cases for typical linear classifiers. Furthermore, GEPSVM is considerably faster than SVM. A major disadvantage of GEPSVM is that its generalization is not guaranteed owing to the occurrence of the matrix singularity. Recently, many multi-plane learning algorithms have been developed to improve the generalization of GEPSVM. Among these, the stand-alone TWSVM has attracted much attention. Although it retains the superior characteristics of standard SVM and GEPSVM and performs better than standard SVM in terms of classification accuracy and shorter computing time, it fails to exploit local sample geometry that may be important for classification effectiveness. Moreover, the solution is obtained by solving two QPPs. Undoubtedly, high time complexity is a limitation in the solution of a large-scale data classification problems. To address the above issues, by reformulating the primary problem with TWSVM, we proposed a novel WLTSVM classifier that explores the similarity information between pairs of samples. Experimental results revealed that WLTSVM is better than TWSVM in terms of both classification effectiveness and lower computational cost. However, although WLTSVM performs faster than TWSVM, a limitation is that it cannot handle large-scale problems, since it has to find the k-nearest neighbors for all the samples. Moreover, the selection of the parameters k1 and k2 is an open problem. Thus, further work will include how to use the chunking algorithm coupled with WLTSVM to solve real largescale classification problems, which may not fit in memory. The selection of k1 and k2 is also a main topic in the future. Acknowledgments The authors are very grateful to the Scientific Foundation of Jiangsu Province (BK2009393), the Research Foundation for the

Arun Kumar, M., & Gopal, M. (2009). Least squares twin support vector machines for pattern classification. Expert Systems with Applications, 36, 7535–7543. Cai, D., He, X.F., Zhang, W.V., & Han, J.W. (2007b). regularized locality preserving indexing via spectral regression. In Proc. 2007 ACM int. conf. on information and knowledge management, CIKM’07. Cai, D., He, X., Zhou, K., Han, J., & Bao, H. (2007a). Locality sensitive discriminant analysis. In IJCAI (pp. 708–713). Chung, F. R. K. (1997). Spectral Graph Theory. In Regional Conference Series in Mathematics, number 92. Cover, T. M., & Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13, 21–27. Fung, G., & Mangasarian, O.L. (2001). Proximal support vector machine Classifiers. In F. Provost&R. Srikant (Eds.), Proceedings KDD-2001: knowledge discovery and data mining (pp. 77–86). Ghorai, S., Mukherjee, A., & Dutta, P. K. (2009). Nonparallel plane proximal classifier. Signal Processing, 510–522. Guarracino, M. R., Cifarelli, C., Seref, O., & Pardalos, P. M. A (2007). Classification method based on generalized eigenvalue problems. Optimization Methods and Software, 22(1), 73–81. Gunn, S.R. (1998). Support vector machine Matlabtoolbox. http://www.isis.ecs. soton.ac.uk/resources/svminfo/. Jayadeva, Khemchandai R., & Chandra S. (2005). Fuzzy proximal support vector classification via generalized eigenvalues. Lecture Notes in Computer Science, 3776, 360–363. Jayadeva, Khemchandai R., & Chandra S. (2007a). Twin support vector machines for pattern classification. IEEE Transaction on Pattern Analysis and Machine Intelligence(TPAMI), 29(5), 905–910. Jayadeva, Khemchandai R., & Chandra S. (2007b). Fuzzy multi-category proximal support vector classification via generalized eigenvalues. Soft Computing, 11, 685–769. Lafon, S., Keller, Y., & Coifman, R. R. (2006). Data fusion and multicue data matching by diffusion maps. IEEE Transactions On Pattern Analysis and Machine Intelligence (PAMI), 28(11), 1784–1797. Mangasarian, O. L. (1994). Nonlinear programming. SIAM. Mangasarian, O., & Wild, E. (2006). Multisurface proximal support vector machine classification via generalized eigenvalues. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(1), 69–74. Matti, P., Tomi, N., Topi, M., & Markus, T. (2005). View-based recognition of realworld textures. Pattern Recognition, 37(2), 313–323. Michael, J. P., Jane, M., & Greg, G. (2009). Learning terrain segmentation with classifier ensembles for autonomous robot navigation in unstructured environments. Journal of Field Robotics,. Muphy, P.M., & Aha, D.W. (1992). UCI repository of machine learning databases. Ojala, T., Pietikäinen, M., & Mäenpää, T. (2002). Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7), 971–987. Wu, Y., Ianakiev, K., & Govindaraju, V. (2002). Improved k-nearset neighbor classification. Pattern Recognition, 35, 2311–2318. Xue, H., Chen, S. C., & Yang, Q. (2009). Discriminatively regularized least-squares classification. Pattern Recognition, 42(1), 93–104. Yang, X. B., Chen, S. C, Chen, B., & Pan, Z. S. (2009). Proximal support vector machine using local information. Neurocomputing, 73(1–3), 357–365. Yan, S., Xu, D., Zhang, B., & Zhang, H.-J. (2005). Graph embedding: a general framework for dimensionality reduction. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition, CVPR2005. Ye, Q. L., Zhao, C. X., & Ye, N. (2010b). Localized twin SVM via convex minimization. Neuro Computing, 74(4), 580–587. Ye, Q., Zhao, C., Ye, N., & Chen, Y. (2010a). Multi-weight vector projection support vector machines. Pattern Recognition Letters, 31(13), 2006–2011. 2010.