A regression-type support vector machine for k-class problem

A regression-type support vector machine for k-class problem

Neurocomputing 340 (2019) 1–7 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom A regressio...

881KB Sizes 2 Downloads 165 Views

Neurocomputing 340 (2019) 1–7

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

A regression-type support vector machine for k-class problem Zeqian Xu a, Tongling Lv a,1, Liming Liu b, Zhiqiang Zhang c,∗, Junyan Tan a,∗ a

College of Science, China Agricultural University, Beijing 100083 China School of Statistics, Capital University of Economics and Business, Beijing 100081 China c School of Mechanical Engineering, Beijing Institute of Technology, Beijing 100081 China b

a r t i c l e

i n f o

Article history: Received 3 March 2018 Revised 26 October 2018 Accepted 15 February 2019 Available online 22 February 2019 Communicated by Dr. Yingjie Tian Keywords: Multi-class problems Regression Support vector machines K-SVCR

a b s t r a c t In this paper, a new method for multiclass classification is proposed, named regression based support vector machine (RBSVM). Rest-versus-1-versus-1 strategy is adopted in RBSVM. When constructing the classification hyperplane, a mixed regression and classification formulation is used. The training set is divided into positive class, negative class and rest class, the regression hyperplane is required to pass through the rest class and it can classify the positive and negative class correctly as far as possible simultaneously. The proposed method can not only improve the classification accuracy but also reduce the computational complexity compared with some currently existed multiclass classification methods. The numerical results show the effectiveness of our methods. © 2019 Elsevier B.V. All rights reserved.

1. Introduction Vapnik [1,2] proposes a standard support vector machine for the binary class classification problem, with the aim of finding a hyperplane that separates the examples correctly. SVM is distinguished for binary classification indeed. While, in the real word, there are many multi-class classification problems e.g. fault diagnosis problem, disease classification, etc. For the multi-class classification problem, how to use SVM perfectly attracts researcher’s interest. In recent years, more and more SVM methods have been developed [3–7]. There are two SVM frameworks for multi-class support vector machine. One is “all-together” strategy, the other is “decomposition-reconstruction” strategy. For the former, we get all the discrimination function by solving a single optimization problem, such as M-SVM [8] and Crammer-Singer method [9]. For the latter, we get the discrimination function by solving a series of binary classification problems. There are several “decomposition-reconstruction” strategies. The first one is “1versus-rest” [10] which constructs K classifiers. When constructing the kth classifier, the examples belonging to the kth class are viewed as positive class examples, all the other examples are viewed as negative class, then binary SVM is used to build the classifier. A new example is assigned a label according to ∗

Corresponding authors. E-mail addresses: [email protected] (Z. Zhang), [email protected] (J. Tan). 1 Co-first author. https://doi.org/10.1016/j.neucom.2019.02.033 0925-2312/© 2019 Elsevier B.V. All rights reserved.

the “winner-takes-all” scheme. The shortcoming of “1-versus-rest” is that almost all the binary problems are unbalanced. The second one is the method proposed by Yang [13] in 2013, called multiple-birth SVM (MBSVM). MBSVM belongs to “1-versus rest” strategy, but it is different from “1-versus-rest” mentioned above. When constructing the kth hyperplane, the examples belong to the kth class are viewed as positive class, all the other examples are the rest class, the hyperplane is required to pass through the examples of the rest class and to be as far as possible from the positive class. A new example is assigned to the label depending on which of the K hyperplanes it lies farthest to. The third one is “1-versus-1” [11] which constructs K (K − 1 )/2 binary classifiers and each classifier involves only two class examples, one class with positive labels and the other with negative labels. For a new example, its label is decided by the “voting” scheme. The shortcoming of “1-versus-1” is that the information of the remaining examples are omitted in each binary classification. In order to take advantage of the information of all examples in each classifier, Angulo et al. [12] propose a new method based on support vector classification and regression for K-class classification problem, called “K-SVCR”. KSVCR takes “1-versus-1-rest” strategy and constructs K (K − 1 )/2 sub-classifiers. When constructing the sub-classifiers, KSVCR uses all information of all examples in the training set. The original training set is divided into three parts: positive class (the examples belonging to ith class in original training set), negative class (the examples belonging to jth class in original training set) and rest class (all the other examples belonging to the original training set except the ith class and the jthh class). KSVCR aims at finding the sub-classifier that can classify the

2

Z. Xu, T. Lv and L. Liu et al. / Neurocomputing 340 (2019) 1–7

positive class and negative class correctly and examples belonging to the rest class are restricted within an ε -insensitive band. For a new example, its label is also decided by the “voting” scheme. Although KSVCR does not loss the information of the examples, the constraints of its optimization problems increase largely and thus lead to long running time for solving the problems. In order to avoid increasement of the scale of the optimization problem and keep use the examples efficiently, we propose a new method, which is closely related with regression, for K-class classification problem. In our method, K (K − 1 )/2 sub-classifiers are constructed. For each sub-classifier, a hyperplane is obtained by solving an optimization problem. It should be pointed out that, compared with standard SVM, the optimization problem here only adds an extra square term. Therefore, our method takes full advantage of all examples compared with 1-versus-1 SVM, and takes more efficient use of the examples compared with MBSVM and 1versus-rest SVM. In addition, the scale of our optimization problem is much smaller than that of KSVCR. This paper is organized as follows. In Section 2, we will give a brief introduction of SVM, MBSVM and K-SVCR. Our new method RBSVM is proposed in Section 3. The numerical experiments on several real datasets and synthetic datasets are conducted in Section 4. We summarize this paper in Section 5. 2. Related work In this section, we briefly introduce some works related to this paper, such as SVM, K-SVCR and MBSVM. 2.1. Support vector machine Consider the binary classification problem with the training set

T = {(x1 , y1 ), . . . , (xl , yl )}

(1)

where xi is an example, yi ∈ {+1, −1} is the corresponding label of xi SVM constructs a hyperplane w · x + b = 0 by maximizing the margin between two support hyperplanes (w · x + b = −1 and w · x + b = 1).The hyperplane w · x + b = 0 should separate the training examples as correct as possible. SVM gets this kind of hyperplane by solving the following quadratic program: ∈ Rn

l

i=1

(2)

where C>0 is the parameter that can balance the margin and classification error, ξ i ≥ 0 is slack variable measuring the classification loss of examples. Usually, SVM gets the optimal hyperplane by solving the following dual problem:

α

s.t.

l  i=1 l 

αi −

l l 1  αi α j yi y j xi T x j 2 i=1 j=1

αi yi = 0,

i=1

0 ≤ αi ≤ C, i = 1, . . . , l,

(3)

More details of SVM can be found in the literature [14]. 2.2. Multi-Birth support vector machine Consider a multi-classification problem with the training set:

T = {(x1 , y1 ), . . . , (xl , yl )}

QPs

Variables

Constraints

Complexity

Ratio

1-v-r

K

l

2l+1

o(Kl3 )

K3 4(K−1 )

1-v-1

K (K −1 ) 2

2l K

4l K

) 3 o( 4(K−1 l ) K2

1

KSVCR

K (K −1 ) 2

l

2l+1

o( K (K2−1) l 3 )

K3 8

Linear MBSVM

K

l K

l K

3 o( Kl 2

4 K−1

Linear RBSVM

K (K −1 ) 2

2l K

4l K

)l 3 o( 4(K−1 K2

(4)

+1

) )

1

where xi ∈ Rn is a pattern, yi ∈ {1, . . . , K } is class label, l is the number of the examples. Multi-birth support vector machine [13] adopts the idea of “rest-versus-1” and totally k hyperplanes are constructed. For each hyperplane, it is required to pass through the examples of the rest class and to be as far as possible from the positive class. The kth optimization problem is as follows:

min

w k , b k , ξk

s.t.

1 ||C w + ek1 bk ||2 + ak eTk2 ξk 2 k k

( A k w k + e k 2 b k ) + ξk ≥ e k 2 , ξk ≥ 0 ,

(5)

where matrix Ak is composed of examples belonging to kth class, Ak ∈ Rlk ×n , lk represent the number of examples in the kth class, Ck = [A1 , A2 , . . . , Ak−1 , Ak+1 , . . . , AK ] is comprised of all the examples belonging to the rest class, Ck ∈ R(l−lk )×n , ak is the penalty parameter, ek1 , ek2 are the vector of ones with appropriate dimensions, ξ k is the slack variable measuring the classification loss of examples, k = 1, . . . , K. Now, let’s give the geometric interpretation of the optimization problem (5). Minimizing the first term in the objective function means the hyperplane pass through the rest class uniformly; minimizing the second term is to minimize the loss of misclassification; the constraints means the distance between the examples belonging to the kth class and the kth hyperplanes should be at least 1. For a new example, it’s label is decided by the following discriminant function:



k=1,...,K

s.t. yi (w · xi + b) ≥ 1 − ξi , i = 1, ..., l,

max

Methods

f (x ) = arg max

 1 ||w||2 + C ξi w,b,ξ 2

min

ξi ≥ 0, i = 1, ..., l,

Table 1 The architecture complexity and the computational complexity of several methods for K-class classification problem with l data points.

 xT wk + bk   wk 



(6)

Compared with “1-versus-rest” support vector machines, there are less constraints in MBSVM. Thus, a smaller-sized quadratic program is needed to be solved in MBSVM. Table 1 shows that the complexity of MBSVM is lower than ‘1-v-r’ SVM.

2.3. Support vector classification-regression machine for K-class (K-SVCR) K-SVCR [12] adopts “1-versus-1-versus-rest” strategy. Totally K (K − 1 )/2 sub-classifiers are constructed. When constructing the sub-classifiers, the training set is divided into three parts: positive class, negative class and rest class. A mixed classification and regression machine is formulated. Firstly, the classification hyperplane w · x + b = 0 should separates the training examples belonging to positive and negative classes as correct as possible. Secondly, the examples of the rest class should be in the ɛ-insensitive band of the hyperplane w · x + b = 0. It is important to note that ɛ must be less than 1 to avoid overlapping with the support hyperplanes w · x + b = −1 and w · x + b = 1. According to the idea mentioned above, the optimization problem of K-SVCR is formulated as follows:

Z. Xu, T. Lv and L. Liu et al. / Neurocomputing 340 (2019) 1–7 l 1 +l2 1 2 || w | | + C ξi +D w,b,ξ ,ηi ,ηi∗ 2

l 

min

i=1

s.t.

(ηi + ηi∗ )

i=l1 +l2 +1

y i ( w · x i + b ) ≥ 1 − ξi , i = 1 , . . . , l 1 + l 2 , −ε − ηi∗ ≤ w · xi + b ≤ ε + ηi , i = l1 + l2 + 1, . . . , l,

ξi ≥ 0 , i = 1 , . . . , l 1 + l 2 ,

(7)

ξi , ηi , ηi∗

where are slack variables measuring the loss of misclassification, C>0 and D>0 are parameters that can balance the margin, classification error and regression error. l1 represent the number of examples in the positive class. l2 represents the number of examples in the negative class. Now let’s explain the optimization problem in detail. Minimizing the first term is to maximize the margin between two support hyperplanes w · x + b = −1 and w · x + b = 1. The second term is to minimize the sum of classification errors of the examples belonging to the positive and negative class, and the third term is to minimize the sum of regression errors of the examples belonging to the rest class. The first constraints mean the examples belonging to the positive class should be above the positive support hyperplane w · x + b = 1 and the examples belonging to the negative class be under the negative support hyperplane w · x + b = −1. The second constraints mean the examples belonging to the rest class should be in the ε -insensitive band of w · x + b = 0. K-SVCR gets the optimal solution by solving the following dual problem instead of the primal problem:

 1 min αi ,βi ,βi∗ 2 s.t.

l 1 +l2 i=1

l 1 +l2

l 

αi yi xi −

i=1

αi yi −





2 −

l 1 +l2 i=1

i=l1 +l2 +1 l 



βi − β x i ∗ i

αi + ε

l 



βi − βi∗



i=i=l1 +l2 +1

 ∗

βi − β i = 0

(8)

i=l1 +l2 +1

0 ≤ αi ≤ C, i = 1, . . . , l1 + l2 0 ≤ βi , βi∗ ≤ Di = l1 + l2 + 1, . . . , l

where αi , βi , βi∗ are slack variables, ɛ>0 is the parameter given in advance. As mentioned earlier, it must be less than 1. C>0 and D>0 are parameters that can balance the margin, classification error and regression error. l1 (l2 ) represent the number of positive (negative) class.

3

(iii) the green dots should be located in the lower left of the induced hyperplane w · x + b = −1. Starting from the above idea, for the given data set (4), we need finding K (K − 1 )/2 hyperplanes and the kth hyperplane (wk  x ) + bk = 0 is obtained by solving the following optimization problem:

min

w,b,ξk

1 1 ||C w + ek3 bk ||2 + Dk eTk1 ξk1 + Ek eTk2 ξk2 + (||wk ||2 + b2k ) 2 k k 2

s.t (Ak wk + ek1 bk ) + ξk1 ≥ ek1 , ξk1 ≥ 0, − ( B k w k + e k 2 b k ) + ξk 2 ≥ e k 2 , ξk 2 ≥ 0 ,

(9)

where the matrix Ak is a collection of positive class examples, Ak ∈ Rl+ ×n , the matrix Bk is comprised of negative class examples, Bk ∈ Rl− ×n , the matrix Ck is comprised of the zero class examples, Ck ∈ Rl0 ×n , and l+ + l− + l0 = l. Here, ek1 , ek2 , ek3 are the vector of ones with appropriate dimensions, ξ k1 , ξ k2 are slack variables measuring the misclassification. Dk > 0, Ek > 0 are the penalty parameters to balance the fitted error, the margin and the misclassification. It is easy to see that minimizing the first term in the objective function in (9) means solving a regression problem. In fact, it is a least square regression term for the examples in zero class. In this sense, our approach can be considered as a regression-type method. Minimizing the second and third term is to minimize the sum of classification errors. The last term is the regularization term which ensures the minimization of the structural risk. Then constraints together with the objective function guarantee the positive and negative examples should be as far away from the corresponding hyperplane as possible. Just as SVM, we get the optimal solution of (9) by solving its dual problem. So, we construct the following Lagrange function of problem (9):

L(wk , bk , ξk1 , ξk2 , ξk3 , αk , μk , βk , ηk ) 1 1 2 = ||Ck wk + ek3 bk ||2 + Dk eTk1 ξk1 + Ek eTk2 ξk2 + (||wk ||2 + bk ) 2 2 − αkT [(Ak wk + ek1 bk ) + ξk1 − ek1 ] − μTk ξk1 +

βkT [(Bk wk + ek2 bk ) − ξk2 + ek2 ] − ηkT ξk2 .

(10)

By the Karush–Kuhn–Tucker (KKT) conditions, we get

(11)

3. Regression based support vector machine (RBSVM)

∂L = CkT (Ck wk + ek3 bk ) + wk − ATk αk + BTk βk = 0 ∂ wk

Now, we are in the position to put out our RBSVM including the linear RBSVM and nonlinear RBSVM

∂L = eTk3 (Ck wk + ek3 bk ) + bk − eTk1 αk + eTk2 βk = 0 ∂ bk

(12)

3.1. The linear RBSVM

∂L = Dk eTk1 − αkT − μTk =0 ∂ ξk 1

(13)

∂L = Ek eTk2 − βkT − ηkT =0 ∂ ek2

(14)

Our new method takes “rest-versus-1-versus-1” strategy. Totally K (K − 1 )/2 sub-classifiers are constructed for K-class problem and the final decision is made by the voting scheme. For constructing the sub-classifiers, the training set (4) is firstly divided into three parts: the examples belonging to the ith class in (4) are viewed as positive class with label 1, the examples belonging to the jth class in (4) are viewed as negative class with label −1 and the examples belonging to the rest class except the ith and jth class are viewed as zero class with label 0. The geometric explanation can be seen clearly in Fig. 1. In Fig. 1, red dots on top are the examples with label 1, the green below dots are the examples with label −1 and the blue dots in the middle are the examples with label 0. The case in Fig. 1 corresponds to one sub-classifier. For this case our linear RBSVM aims at finding a hyperplane w · x + b = 0 with the following properties: (i) the blue dots should be located in the hyperplane w · x + b = 0 as close as possible; (ii) the red dots should be located in the upper right of the induced hyperplane w · x + b = 1;

Combining (11) and (12) leads to

[CkT , eTk3 ][Ck , ek3 ][wTk , bk ]T + [wTk , bk ]T −[ATk , eTk1 ]αk + [BTk , eTk2 ]βk = 0 Let Hk = [Ck , ek3 ], Fk = [Ak , ek1 ], Gk = [Bk , ek2 ], uk = above formula can be abbreviated as:

(15) [wTk , bk ]T ,

(HkT Hk + I )uk − FkT αk + GTk βk = 0

the

(16)

Combining (13) and (14) leads to

0≤ 0≤

αk ≤ Dk βk ≤ Ek

Then the dual problem of (9) can be formulated as

max [eTk1 , eTk2 ] [αkT , βkT ]T αk ,βk

(17)

4

Z. Xu, T. Lv and L. Liu et al. / Neurocomputing 340 (2019) 1–7

Fig. 1. A toy example learned by the linear RBSVM.

1 − [αkT , βkT ]T [FkT , −GTk ]T (HkT Hk + I )−1 [FkT , −GTk ][αkT , βkT ]T 2 s.t. 0 ≤ αk ≤ Dk , 0 ≤ βk ≤ Ek ,

So, the kth optimization problem is written as

(18)

where α k , β k are slack variables, ek1 , ek2 are the vector of ones with appropriate dimensions Dk and Ek are parameters, k1 (k2 ) represent the number of positive (negative) class. Solving the dual problem (18), we get the optimal solution [αk∗ , βk∗ ], then the optimal solution of the primal problem (9) is ∗



T T −1 T ∗ T uk ∗ = [w∗T k , bk ] = (Hk Hk + I ) [Fk αk − Gk βk ].

(19)

So, the function of the kth hyperplane can be expressed as ∗

T fk (x ) = [x, 1][w∗T k , bk ]

(20)

Totally K (K − 1 )/2 hyperplanes like (20) are constructed in our RBSVM and thus the K (K − 1 )/2 decision functions are defined as follows:

f i j (x ) =

⎧ T T ⎪ ⎨+1, [x, 1][ui j , bi j ] > 1−ε −1,

⎪ ⎩

[x, 1][uTi j , bi j ]T <

−1 + ε

  1

K CkT , E T vk + ek3 bk 2 vk ,b,ξk 2  1 + Dk eTk1 ξk1 + Ek eTk2 ξk2 + vk 2 + b2k 2     s.t. K ATk , E T vk + ek1 bk + ξk1 ≥ ek1 , ξk1 ≥ 0, min

i, j = 1, . . . , K; i < j; i, j = 1, . . . , K; i < j;

 

− K BTk , E T



 v k + e k 2 b k + ξk 2 ≥ e k 2 , ξk 2 ≥ 0 ,

where, E T =[ATk , BTk , CkT ]T and K is a kernel function, ek1 , ek2 , ek3 are the vector of ones, ξ k1 , ξ k2 are slack variables. Dk , Ek are the penalty parameters. The Lagrange function of (23) is

L(wk , bk , ξk1 , ξk2 , ξk3 , αk , μk , βk , ηk ) 1 = ||K (CkT , E T )vk + ek3 bk ||2 + Dk eTk1 ξk1 2 1 + Ek eTk2 ξk2 + (||vk ||2 + b2k ) 2 − αkT [(K (ATk , E T )vk + ek1 bk ) + ξk1 − ek1 ] − μTk ξk1 +

βkT [(K (BTk , E T )vk + ek2 bk ) + ξk2 − ek2 ] − ηkT ξk1 .

(21)

K (CkT , E T )T (K (CkT , E T )vk + ek3 bk ) + vk

The labels “1”, “−1”, and “0” are assigned to examples of class (i), class (j), and all rest classes, respectively according to (21). ɛ is a constant given in advance. For a new example x, a vote is given to the class (i) or class (j) based on which condition is satisfied. Finally, the example x is assigned to the class label that gets the most votes.

−K (ATk , E T )T αk + K (BTk , E T )T βk = 0

3.2. The nonlinear RBSVM In this section, the linear RBSVM is extended to the nonlinear case using kernel trick. In the nonlinear case, RBSVM aims at finding K kernel generated hyperplanes

K (x, E ) uk +bk = 0, k = 1, . . . , K

(24)

According to KKT conditions, we have

0, otherwise

T

(23)

(22)

(25)

eTk3 (K (CkT , E T )wk + ek3 bk ) + bk − eTk1 αk + eTk2 βk = 0

(26)

Dk eTk1 − αkT − μTk =0

(27)

Ek eTk2 − βkT − ηkT =0

(28)

Set Mk = [K (ATk , E T ), ek1 ], Nk = [K (BTk , E T )k , ek2 ], sk = [vTk , bk ]T , then the dual problem of (24) is

Z. Xu, T. Lv and L. Liu et al. / Neurocomputing 340 (2019) 1–7

5

1800 1600 1400 1200

TIME

1000 800 600 400 200 0 1

2

3

data set

KSVCR

4

5

6

RBSVM

Fig. 2. The comparison of the time consuming of RBSVM and KSVCR on the data shown in Table 3.

max[eTk1 , −eTk2 ][αkT , βkT ]T

I M B A L A N CE D D A T A

αk ,βk

1 − [αkT , βkT ]T [MkT , −NkT ]T (LTk Lk + I )−1 [MkT , −NkT ][αkT , βkT ]T 2 s.t. 0 ≤ αk ≤ Dk ,

RBSVM

80

(29)

where the penalty parameter Dk , Ek are non-negative. ek1 , ek2 are the vector of ones with appropriate dimensions, and α k , β k are slack variables. Then the function of the kth hyperplane can be expressed as

g(x ) = [K (xT , E ), 1][vTk , bk ]T

KSVCR

90

(30)

Just as the linear case the decision function that classifies class (i) and class (j) is designed as

⎧ T T T ⎪ ⎨+1,[K (x , E ), 1][vi j , bi j ] > − 1+ε i, j = 1, . . . , K; i < j; fi j (x ) = −1, [K (xT , E ), 1][vTi j , bi j ]T < 1 − ε i, j = 1, . . . , K; i < j; ⎪ ⎩

70

ACCURACY

0 ≤ βk ≤ Ek

100

60 50 40 30 20 10 0

1:1:3

1:1:5

0, otherwise

1:1:7

1:1:9

1:1:11

PROPORTION

(31) ɛ is a constant given by in advance. For a new example x, a vote is given to the class (i) or class (j) based on which condition is satisfied. Finally, the example x is assigned to the class label that gets the most votes. 3.3. Comparing with K-SVCR and MBSVM Comparing with K-SVCR: The motivation of our method is consistent with KSVCR. But the processing method is really different. Firstly, the requirement for the rest class is realized by minimizing a square term in the objective function of the optimization problem in RBSVM, while it is realized by adding a lot of constraints in KSVCR. So, the computational complexity of our method is much lower than KSVCR, as shown in Table 1. The gap is more obvious when the number of categories is large. Secondly, different penalty parameter Dk , Ek are adopted in our method which determines the tradeoff between the loss of positive and negative examples. Correspondingly, only one penalty parameter is used for all examples in KSVCR. It will make a great difference when the number of examples is imbalanced. Therefore, selecting proper Dk , Ek enables us to get a higher accuracy compared with KSVCR. This can be seen clearly in the numerical experiments. Comparing with MBSVM: The regression idea is used in both our RBSVM and MBSVM. In our method, the regression hyperplane is

Fig. 3. The comparison of the accuracy of RBSVM and KSVCR on the unbalanced data.

required to pass through the examples belonging to the rest class and to classify the negative class and positive class as correctly as possible. While, it is required to pass through the negative class and to be as far as possible from the positive class in MBSVM. From the classification point of view, RBSVM adopts “1-versus-1versus-rest” strategy, totally K (K − 1 )/2 classifiers are constructed; MBSVM generates K classifiers by the “1-versus-r” strategy. Thus, RBSVM takes more careful use of the examples. 4. Numerical experiments To evaluate the performance of the new algorithm, artificial data sets and several UCI [15] datasets are used in numerical experiments. All experiments have been implemented in Matlab 2015b on a PC with system configuration Intel core7 Duo CPU at 2.53 GHz with 4 GB of RAM, and Windows 10 operating system. For the nonlinear case, RBF kernel K(xi ,x j ) =e−γ ||xi −x j || is used. The parameters C, D are selected from the set [2−8 , 2−7 , . . . . , 27 , 28 ] and γ is selected from γ ∈ [0.1, 0.2, . . . , 2]. Our RBSVM is compared with four popular multiclass classification methods: 1-v-1 SVM,1-v-r SVM, KSVCR and MBSVM. 2

6

Z. Xu, T. Lv and L. Liu et al. / Neurocomputing 340 (2019) 1–7

Fig. 4. The comparison of the time consuming of RBSVM and KSVCR on the nine experimental data sets.

Table 2 Class label defined by the mean vector μ. Class

1

2

3

4

5

6

7

8

9

μ

(1,1)

(1,4)

(4,1)

(4,4)

(1,8)

(8,1)

(8,8)

(1,12)

(12,1)

Table 3 Six artificial datasets used in this paper. Dataset

Number of examples

Number of features

Class labels

1 2 3 4 5 6

800 10 0 0 1200 1400 1600 1800

2 2 2 2 2 2

1,2,3,4 1,2,3,4,5 1,2,3,4,5,6 1,2,3,4,5,6,7 1,2,3,4,5,6,7,8 1,2,3,4,5,6,7,8,9

Table 3. On these 6 data sets, we count the time of one-time training, the results are summarized in Fig. 2. It is obvious that KSVCR consumes more time than our method especially when the number of class increases. The time consumed by KSVCR is three or more times as much as the time consumed by RBSVM when the number of class is more than 7. Next, we compare the accuracy of KSVCR and our method when the data sets are imbalance. Five imbalance three-class data sets are generated. The difference of these five data sets is the ratio of three class. They are 1:1:3, 1:1:5,1:1:7,1:1:9, 1:1:11 respectively. Fig. 3 shows the results. It is clearly that our method performs better than KSVCR on these five data sets, about 10% higher than KSVCR. The reason may be that different penalty parameters are used in our method for positive and negative class when constructing the classifiers. 4.2. Experiments on UCI datasets

4.1. Experiments on artificial dataset Firstly, we compare the computation time between our RBSVM and KSVCR on six artificial datasets. The input of the six datasets are generated from multivariate Gaussian distribution N( , μ), where = diag(2, 2 ) is the covariance matrixes, μ is the mean vector. Totally 9 different mean vectors are used in this paper. The label of the input is decided by the mean vector, i.e. different mean vector has different class label. This can be seen clearly in Table 2. Then we define six artificial datasets used in this paper in

In this section, we compare our RBSVM and other four multiclass classification methods on UCI data sets. The five-fold cross validations are conducted on 9 UCI datasets and the results are shown in Table 4. We can see that our RBSVM performs better than the other four methods on four data sets: ’tea’, ’glass’, ’balance’ and ’der’. On these four data sets, the accuracy is improved by up to 4% using RBSVM. On the data sets ’iris’, ’wine’ and ’ecoli’, the accuracy of RBSVM is similar to that of the other method. Only on the data sets, ’seeds’ and ’hayes’, RBSVM performs worse than other

Table 4 Five-fold cross validation comparison on UCI data sets. Dataset

1-v-1 Acc + std

1-v-r Acc + std

KSVCR Acc + std

MBSVM Acc + std

RBSVM Acc + std

Iris 150 × 4 × 3 Wine 178 × 13×3 Tea 151 × 5 × 3 Seeds 210 × 7 × 3 Ecoli 327 × 7 × 5 Glass 214 × 9 × 6 Hayes 132 × 5 × 3 Balance 625 × 4 × 3 Der 358 × 34×6 w/t/l Mean acc Mean std

96.67 ± 3.65♦ 97.75 ± 2.09 63.58+5.47 91.9 + 3.87 89.3 ± 2.55♦ 71.55+4.72 75.01+ 3.74 95.82 ± 2.23 97.76 ± 1.41 8/1/0 86.6 3.64

95.33+4.52 98.86 ± 1.40 64.9+4.31♦ 91.9 + 3.56 88.58+2.82 69.67+4.09 77.29+4.05 88.96 ± 1.18 98.04+0.99 9/0/0 85.95 2.34

96.67 +2.11 97.74 ± 2.37 63.44+3.76 92.12 +5.47 88.68+1.35 72.64+ 3.28 78.24 +2.31 95.04 +1.87 97.76 +1.43 9/0/0 86.93 2.67

96.00 ± 3.27 98.89 ± 2.22♦ 64.91+4.31 93.33 + 5.51♦ 88.38+1.84 71.64+6.73 81.05 ± 0.35♦ 91.68 +0.96 98.04 ± 1.67 7/2/0 87.10 2.98

96.67 ± 3.65 98.89 ± 2.22 66.96 ± 4.74 92.38 +4.10 88.99+4.12 75.33+5.70 80.31+4.31 96.48 ± 2.29 98.32 ± 1.06

 ♦

means our method performs better. shows that our method performance is tied with other methods.

88.26 3.57

Z. Xu, T. Lv and L. Liu et al. / Neurocomputing 340 (2019) 1–7 Table 5 The ranking of several methods based on the accuracies of nine experimental data sets. Dataset

1-v-1

1-v-r

KSVCR

MBSVM

RBSVM

Iris Wine Tea Seeds Ecoli Glass Hayes Balance Der Average rank

2.5 4 4 4 1 4 5 2 4.5 3.44

4 3 2.5 3 2.5 5 4 5 2.5 3.5

2.5 5 5 5 2.5 2 3 3 4.5 3.61

5 2.5 2.5 1 5 3 1 4 2.5 2.94

2.5 2.5 1 2 2 1 2 1 1 1.67

methods, the accuracy or RBSVM is 0.95% lower than MBSVM on seeds and 0.74% lower than MBSVM on hayes. On the whole, our RBSVM performs best in classification accuracy because it has the highest average accuracy on all data sets, the average accuracy of RBSVM is over 1.16% higher than other four methods. Table 5 lists the ranking of our method compared with other methods based on the accuracies of nine experimental data sets. We can see that RBSVM has the smallest average ranking. This confirms that our RBSVM performs better than the other four methods. We also compare the time consuming by KSVCR and RBSVM on 9 UCI data sets, the results are shown in Fig. 4. We can see that our RBSVM is faster than KSVCR

7

[9] K Crammer, Y Singer, On the algorithmic implementation of multiclass kernel-based vector machines, J. Mach. Learn. Res. 2 (2002) 265–292. [10] L Bottou, C Cortes, JS Denker, H Drucher, I Guyon, LD Jackel, Y LeCun, UA M€uller, E Sackinger, P Simard, V Vapnik, Comparison of classifier methods: a case study in handwriting digit recognition, in: Proceedings of the International Conference on Pattern Recognition, IAPR, IEEE Computer Society Press, 1994, pp. 77–82. [11] U Kreb, Pairwise classification and support vector machines, in: B. Scholkopf, C.J.C. Burges, A.J. Smola (Eds.), Advances in Kernel methods: Support Vector Learning, MIT Press, Cambridge, MA, 1999, pp. 255–268. [12] C. Angulo, X. Parra, A. Catala, K-SVCR.A support vector machine for multi-class classification, Neurocomputing 55 (1) (2003) 57–77. [13] Z.X. Yang, Y.H. Shao, Multiple birth support vector machine for multi-class classification, Neural Comput. Appl. 22 (Suppl 1) (2013) S153–S161. [14] N.Y. Deng, Y.J. Tian, Support Vector machines: Optimization Based theory, algorithms, and Extensions, CRC Press, 2012. [15] D. Dua, E. Karra, Taniskidou UCI Machine Learning Repository, 2017. http: //archive.ics.uci.edu/ml. Zeqian Xu received his Bachelor’s degree in Beijing Institute of Technology in 2016 and Master’s degree in 2018 in China Agricultural University. His research interests mainly include classification problem and dimensionality reduction problem.

Tongling Lv received the Bachelor’s degree in Yantai University in 1999 and Master’ degree in Nankai University in 2003. Now, she is a teacher in College of Science in China Agricultural University. Her research interests are applied statistics and data mining.

5. Conclusion In this paper, a novel multi-class classification method named RBSVM is proposed. The strategy of “rest-versus-1-versus-1” is adopted in our new method. Although the motivation of RBSVM is consist with KSVCR, the processing methods of them are different. The analysis of the computational complexity and the numerical results show that our method is more reasonable. The numerical results show that RBSVM is effective in multi-classification. Since, there are totally 3 parameters in nonlinear RBSVM, the parameter selection is a practical problem and should be addressed.

Liming Liu received the Bachelor’s degree in Harbin Institute of Technology in 1982 and Ph.D. in China Agricultural University in 1999. Now, she is a professor in Capital University of Economics and Business. Her research interests mainly include applied statistics, machine learning and data mining.

Acknowledgment This paper is supported by the National Natural Science Foundation of China (Nos. 11301535, 11371365) and Chinese Universities Scientific Fund (No. 2017LX003). References [1] C Cortes, V Vapnik, Support-vector network, Mach. Learn. 20 (1995) 273–297. [2] V Vapnik (1998) The Nature of Statistical Learning, 2nd edn. Springer, New York [3] K.P. Bennett, Combining support vector and mathematical programming methods for classification, in: B. Schälkopf, C.J.C. Burges, A.J. Smola (Eds.), Advances in Kernel Methods: Support Vector Learning, MIT Press, Cambridge, MA, 1999, pp. 307–326. [4] Y. Lee, Y. Lin, G. Wahba, Multicategory support vector machines, Comput. Sci. Stat. 33 (2001) 498–512. [5] J. Weston, C. Watkins, Multi-class support vector machines, CSD-TR-98-04 Royal Holloway, University of London, Egham, UK, 1998. [6] T. Graepel, et al., Classification on proximity data with LP-machines, in: Proceedings of the IEEE Ninth International Conference on Artifical Neural Networks, London, Conference Publications, 1999, pp. 304–309. [7] R Jayadeva, R Khemchandani, S Chandra, Twin support vector machine for pattern classification, IEEE Trans. Pattern Anal. Mach. Intell. 29 (5) (2007) 905–910. [8] J Weston, C Watkins, Multi-class Support Vector machines.CSD-TR-98-04, Royal Holloway, University of London, Egham, UK, 1998.

Zhiqiang Zhang received the Master’s degree in 1998 and Ph.D. in 2002 in Harbin Institute of Technology. Now, he is a teacher in Beijing Institute of Technology, his research interests mainly include data mining, computer language and program design.

Junyan Tan received the Master’s degree in Harbin Normal University in 2002 and Ph.D. in China Agricultural University in 2010. Now, she is an associate professor in College of Science in China Agricultural University. Her research interests mainly include classification problem, dimensional reduction and their applications.