Alternating Direction Method of Multipliers for L1- and L2-norm Best Fitting Hyperplane Classifier

Alternating Direction Method of Multipliers for L1- and L2-norm Best Fitting Hyperplane Classifier

Available online at www.sciencedirect.com ScienceDirect This space is reserved for the Procedia header, do not use it Procedia Computer Science 108C...

551KB Sizes 1 Downloads 63 Views

Available online at www.sciencedirect.com

ScienceDirect

This space is reserved for the Procedia header, do not use it Procedia Computer Science 108C (2017) 1292–1301

This space is reserved for the Procedia header, do not use it

International Conference on Computational Science, ICCS 2017, 12-14 June 2017, Zurich, Switzerland

Alternating Direction Method of Multipliers Direction Method of Multipliers - and L2-norm Best Fitting Hyperplane Classifier for L1Alternating Best Fitting Hyperplane Classifier 2 for L1- 1and L2-norm 2 1 2

Chen Wang , Chun-Na Li , Hua-Xin Pei , Yan-Ru Guo , and Yuan-Hai Shao

1 of Science, Zhejiang 1 College UniversityPei of Technology, 310023, PR ChinaShao2 Chen 1Wang , Chun-Na Li2 , Hua-Xin , Yan-RuHangzhou Guo2 , and Yuan-Hai

[email protected] Science,Zhejiang ZhejiangUniversity UniversityofofTechnology, Technology,Hangzhou Hangzhou310024, 310023,PR PRChina China ZhijiangofCollege, [email protected] [email protected]

1 2 College

2

Zhijiang College, Zhejiang University of Technology, Hangzhou 310024, PR China [email protected]

Abstract Recently, two-sided best fitting hyperplane classifier (2S-BFHC) is proposed, which has several Abstract significant advantages over previous proximal hyperplane classifiers. Moreover, Concave-Convex Recently, two-sided best fitting hyperplane classifier (2S-BFHC) is proposed, which has several Procedure (CCCP) has already been provided to solve the dual problem of 2S-BFHC. In this significant advantages over previous proximal hyperplane classifiers. Moreover, Concave-Convex paper, we solve the primal problem of 2S-BFHC by the alternating direction method of mulProcedure (CCCP) has already been provided to solve the dual problem of 2S-BFHC. In this tipliers (ADMM) which is well suited to solve the distributed optimization problem, and we paper, we solve the primal problem of 2S-BFHC by the alternating direction method of mulalso propose a robust L1 -norm two-sided best fitting hyperplane classifier (L1 -2S-BFHC) with tipliers (ADMM) which is well suited to solve the distributed optimization problem, and we ADMM, which aims at giving a robust performance for the problem with outliers. Priliminary also propose a robust L1 -norm two-sided best fitting hyperplane classifier (L1 -2S-BFHC) with numerical results demonstrate the effectiveness of proposed methods. ADMM, which aims at giving a robust performance for the problem with outliers. Priliminary

© 2017 The Authors. by Elsevier B.V. hyperplane classifier, alternating direction method of mulKeywords: PatternPublished recognition, best fitting numerical results demonstrate effectiveness proposed methods. Peer-review under responsibility of the the scientific committee of of the International Conference on Computational Science tipliers, L1 -norm optimization Keywords: Pattern recognition, best fitting hyperplane classifier, alternating direction method of multipliers, L1 -norm optimization

1

Introduction

1

Introduction Classification [1] is an important part of data mining and pattern recognition. In this paper,

we consider the binary classification problem with a dataset T = {(x1 , y1 ), . . . , (xm , ym )}, where Classification [1] is an important part of data mining and pattern recognition. In this paper, xi ∈ Rn is the input index vector, yi ∈ {+1, −1} is the output value, i = 1, 2, . . . , m, and m we consider the binary classification problem with a dataset T = {(x1 , y1 ), . . . , (xm , ym )}, where is the number of samples. The purpose of the classification problem means that according to xi ∈ Rn is the input index vector, yi ∈ {+1, −1} is the output value, i = 1, 2, . . . , m, and m is the number of samples. The purpose of the classification problem means that according to 1

1877-0509 © 2017 The Authors. Published by Elsevier B.V. Peer-review under responsibility of the scientific committee of the International Conference on Computational Science 10.1016/j.procs.2017.05.049

1



A Guide for Writing LATEX Documents Chen . .et. Wang, Chun-Na Li, Hua-Xin Pei, (2017) Yan-Ru Guo and Yuan-Hai Shao Chen Wang al. / Procedia Computer Science 108C 1292–1301

the training set, we can deduce the output y for any given input x. Support vector machine (SVM) is a well-known method for classification problem, and it has been used in a wide variety of applications [2, 3, 4, 5]. The standard support vector classification (SVC) searches for two parallel hyperplanes with maximal width between them. Recently, some classifiers based on nonparallel hyperplanes such as the generalized eigenvalue proximal support vector machine (GEPSVM) and twin support vector machine (TWSVM) have been proposed in [6] and [7], respectively. Both GEPSVM and TWSVM seek two nonparallel hyperplanes such that each hyperplane fits to the corresponding class samples and is as far as possible from the other class samples. However, there are two limitations of GEPSVM and TWSVM: One is that they are not suitable for large-scale classification problems, because they need to eigen-decomposition or take the inverse of a (m + 1) × (m + 1) matrix where m is the number of training samples. The other one is that the solution of both GEPSVM and TWSVM is not sparseness.

In 2013, Tian et al. proposed a novel nonparallel support vector machine (NPSVM) [8] with corresponding solver [9] that overcomes the drawbacks of in TWSVM with matrix inversion. NPSVM with the similar idea to that of TWSVM. However, NPSVM can be efficiently solved by using sequential minimal optimization (SMO) and the solution returned by the dual form of NPSVM is sparse. In 2016, Hakan et al. proposed the two-sided best fitting hyperplane classifier (2S-BFHC) [10] with two types of ramp loss functions. Different from NPSVM, 2SBFHC allows one class samples which is far from the hyperplane to lie on both sides of the hyperplane, and it requires solving a more complicated non-convex optimization problem. Though the 2S-BFHC has better generalization performance, it needs to solve a nonconvex optimization problem. So, the effective solving algorithm of 2S-BFHC is worth studied. At present, most of algorithms of SVMs solve the optimization problem in the dual space to capture the support vectors. In this paper, we solve the primal problem of L2 -norm regularization term 2S-BFHC and L1 -norm two-sided best fitting hyperplane classifier (L1 -2S-BFHC) which is proposed in this paper by the alternating direction method of multipliers (ADMM) [11]. Owing to both primal problems are non-convex optimization problems, we regard parameters which in optimization problem as two groups, and optimize these two groups alternately. More importantly, many explicit solutions can be returned from the separate minimization problem for both L2 -2S-BFHC and L1 -2S-BFHC. In addition, we propose a simple but effective initialization way for BFHC. The effectiveness and robustness of our L1 -2S-BFHC are demonstrated by numerical tests on toy example as well as on some benchmark datasets.

2

2S-BFHC The two-sided best fitting hyperplane classifier (2S-BFHC) constructs two nonparallel hy-

perplanes

2

w1T x + b1 = 0 and w2T x + b2 = 0

(1)

1293

1294

A Guide for Writing LATEX Documents Chen . .et. Wang, Chun-Na Li, Hua-Xin Pei, (2017) Yan-Ru Guo and Yuan-Hai Shao Chen Wang al. / Procedia Computer Science 108C 1292–1301

where w1 and w2 ∈ Rn , b1 and b2 ∈ R. The basic idea of 2S-BFHC is that the first hyperplane

is closest to the samples in the positive class and farther from the samples in the negative class, while the second hyperplane is closest to the samples in the negative class and farther from the samples in the positive class. To obtain the two hyperplanes, 2S-BHFC uses two symmetric ramp loss functions [10] for fitting class and separating class Lpos (a) = Rpos (a) + Rpos (−a) and Lneg (a) = Rneg (a) + Rneg (−a), respectively. Therefore, the optimal problems of 2S-BFHC can be written as 1 (w1 ,b1 ) 2

min

2

∥w1 ∥2 + C1

m ∑1

m ( ) ( ) ∑2 Lpos w1T xi + b1 + C2 Lneg w1T xj + b1

(2)

m ∑2

m ( ) ( ) ∑1 Lpos w2T xi + b2 + C4 Lneg w2T xj + b2

(3)

i=1

i ∈ I+ , j ∈ I− and

1 (w2 ,b2 ) 2

min

2

∥w2 ∥2 + C3

i=1

i ∈ I− , j ∈ I+

j=1

j=1

where the first term of (2) is regularization term, m1 (m2 ) is the number of positive (negative) samples, C1 (C2 ) is user defined parameter that controls the weight of the errors associated with the positive(negative) samples and I+ (I− ) is the set including indices of the positive(negative) samples, C3 (C4 ) is the same role with C1 (C2 ). Obviously, (2) and (3) are nonconvex problems, so Hakan et al.[10] used the concave-convex procedure (CCCP) to get it’s solution in dual space. Once the best fitting hyperplanes are found, a new sample is classified based on the minimum distances to the returned hyperplanes which use the decision function f (x) = arg min i=1,2

|wiT x+bi | ∥wi ∥

(4)

where | · | is the absolute value and ∥ · ∥ is the two norm.

3 3.1

ADMM for BFHC ADMM for 2S-BFHC (L2 -ADMM-BFHC) In this subsection, we use ADMM for the primal form of linear 2S-BFHC (L2 -ADMM-

BFHC). Here we just consider the first fitting hyperplane and the second fitting hyperplane can obtained in the same procedure. Firstly, we reformulate (2) by adding maximizing ∆ since bigger value of ∆ indicate better separability. In other words, the minimum distance values 2∆/∥w1 ∥ between the positive class

and the negative class will be bigger with bigger value of ∆. Then, (2) could be reformulate as m m ( ( ) ) ∑1 ∑2 2 min 12 ∥w1 ∥2 + C1 Lpos w1T xi + b1 + C2 Lneg w1T xj + b1 − H∆ (w1 ,b1 ,∆) i=1 j=1 (5) s.t. i ∈ I+ , j ∈ I− , ∆ ∈ (0, 1) where H is a positive weight parameter. 3



Chen Wang et al. / Procedia Computer Science 108C (2017) 1292–1301 A Guide for Writing LATEX Documents Chen . . . Wang, Chun-Na Li, Hua-Xin Pei, Yan-Ru Guo and Yuan-Hai Shao

To solve the problem (5), we split it as two groups. The first concerns the fitting hyperplane, i.e., u1 = (w1T , b1 )T , and the second concerns separability, i.e., ∆. So we optimize these two groups alternately. We divide (5) into two parts (convex part and concave part), and add slack variables, then we have its ADMM forms as: 1 T u1 Du1 (u1 ,ξi ,ηi ,tj ,qj ) 2

min

+ C1

m ∑1

m ∑1

(ξi )+ + C1

(ηi )+ +C2

m ∑2

(tj )+ + C2

m ∑2

i=1 i=1 j=1 j=1 T T − 1 + ∆ − u1 hi = ξi , −1 + ∆ + u1 hi = ηi , i=1, . . . , m1 1 + ∆ − uT1 hj = tj , 1 + ∆ + uT1 hj = qj , j=1, . . . , m2

s.t.

(qj )+ (6)

and min∗

− H∆ − C1

s.t.

s−2+∆

∆,v,δ,ξi ,ηi∗

m ∑1

(ξi∗ )+ −C1

i=1 − uT1 hi

m ∑1

i=1

(ηi∗ )+ + g (δ) + g (v)

= ξi∗ , s − 2 + ∆ + uT1 hi = ηi∗ ,

∆+δ= 1, ∆ − v = 0

where hi = (xTi , 1)T ,hj = (xTj , 1)T ,D =

(

In 0

)

i=1, . . . , m1

(7)

, g(δ) and g(v) are indicator functions

of R+ . We solve (6) and (7) alternately to optimize u1 and ∆, respectively. In ADMM iterations, ∂ L¯ ∂ L˜ we solve u1 and ∆ by letting ∂u1ρ = 0 and ∂∆ρ = 0, where L¯ρ and L˜ρ are the augmented Lagrangian functions for (6) and (7), respectively. For solving ξi , ηi , tj and qj , we use the    w − λ, w > λ  result Sλ (w) = 0, 0 ≤ w ≤ λ which is introduced in [12].    w, w < 0 According to [11], we can obtain primal and dual residuals at iteration k + 1 as follows: (k+1)T (k+1)T (k+1)T = (−1 + ∆ − u1 hi − ξik+1 ; −1 + ∆ + u1 hi − ηik+1 ; 1 + ∆ − u1 hj − tk+1 ; j

r1k+1

(k+1)T

r2k+1 sk+1 1 sk+1 2

hj − qjk+1 ) 1 + ∆ + u1 = (s−2+∆k+1 −uT1 hi −ξi∗k+1 ; s−2+∆k+1 +uT1 hi −ηi∗k+1 ; ∆k+1 +δ k+1 −1; ∆k+1 −v k+1 ) m m m m ∑1 ∑2 ∑2 ∑1 ρhi (ηik − ηik+1 ) + ρhj (tkj − tk+1 )+ ρhj (qjk − qjk+1 ) = ρhi (ξik − ξik+1 ) + j =

where

i=1 m ∑1

ρ(ξi∗k+1

i=1 r1k+1 ,



ξi∗k )

sk+1 and 1

+

i=1 m ∑1

j=1

ρ(ηi∗k+1

i=1 r2k+1 , sk+1 2



ηi∗k )

k

+ ρ(δ − δ

j=1

k+1

) + ρ(v

k+1

− vk )

are primal and dual residuals for (6) and (7), respectively,

i = 1, . . . , m1 , j = 1, . . . , m2 . A reasonable stopping criterion in ADMM process is that both ∥r1k ∥2 ≤ ϵpri , ∥sk1 ∥2 ≤ ϵdual and ∥r2k ∥2 ≤ ϵpri , ∥sk2 ∥2 ≤ ϵdual must be satisfied, where ϵpri and

ϵdual are the feasibility tolerances for these residuals respectively and we set ϵpri = ϵdual = 10−5 in this paper. The algorithm of solving (5) is listed in Table 1.

In a same way, u2 = (w2T , b2 )T in (3) can obtained by the ADMM algorithm. Once the solutions (w1T , b1 )T and (w2T , b2 )T are obtained, a new sample x ∈ Rn is assigned to class i(i = 1, 2), depending on which of the two hyperplanes is closer to, such as using (4). 4

1295

1296

Chen Wang et al. / Procedia Computer Science 108C (2017) 1292–1301 A Guide for Writing LATEX Documents Chen . . . Wang, Chun-Na Li, Hua-Xin Pei, Yan-Ru Guo and Yuan-Hai Shao

Table 1: The Algorithm of solving (5) Input: Data set: T = {(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )}, parameter: C1 , C2 , H, ρ, s.

Output: u1 , ∆

(k)

1. Initialize ∆(k) , u1 , k = 0. (k)

2. Alternately update u1

(k)

into (6), and update u1 until ∥r1k ∥2 ≤ 10−5 , ∥sk1 ∥2 ≤ 10−5 . m m m ∑1 ∑2 ∑1 = −ρ(D + 2ρ hi hi T + 2ρ hj hj T )−1 ( (ξi k − ηi k +αi k − βi k )hi

a. Putting ∆ uk+1 1

and ∆(k) :

(k)

m ∑2

+

ξi k+1 = S C1 ρ

tj k+1 = S C2 ρ

j=1 (

i=1

k

j=1

k

k

i=1

k

(tj − qj +γj − τj )hj )

( ) ) T k+1 T k k+1 k ) h − α = S ) h − β −1 + ∆ − (uk+1 −1 + ∆ + (u , η C i i i i i 1 1 1 ρ( ( ) ) T k+1 T k k+1 ) h − γ = S ) hj − τ j k 1 + ∆ − (uk+1 , q 1 + ∆ + (u C j j j 2 1 1 ρ

αi k+1 = αi k + ξi k+1 + 1 − ∆+(uk+1 )T hi , βi k+1 = βi k + ηi k+1 + 1 − ∆ − (uk+1 ) T hi 1 1 γj k+1 = γj k + tj k+1 − 1 − ∆+(uk+1 )T hj , τj k+1 = τj k + qj k+1 − 1 − ∆ − (uk+1 )T hj 1 1 (k)

b. Putting u1 k+1

into (7), and update ∆(k) until ∥r2k ∥2 ≤ 10−5 , ∥sk2 ∥2 ≤ 10−5 .

−H+ρ

m ∑1

i=1

∗k k ∗k (2s−4−ξi∗k −ηi∗k +α∗k −v k +τ ∗k ) i +βi )+ρ(δ −1+γ

=− 2ρ(m1 +1) ) ) ( ( δ k+1 = 1 − ∆k+1 − γ ∗k + , v k+1 = ∆k+1 +τ ∗k + ( ) ξi ∗k+1 = T −C1 s − 2 + ∆k+1 − (u1 )T hi + αi∗k ρ ( ) ηi ∗k+1 = T −C1 s − 2 + ∆k+1 + (u1 )T hi + βi∗k



    w + λ, w > 0 where (a)+ is the Euclidean projection of a onto R and Tλ (w) = 0, − λ ≤ w ≤ 0 .    w, w < −λ α ∗k+1 = α ∗k + s − 2 + ∆k+1 − (u )T h − ξ ∗k+1 ρ

i

i

1

i

i

βi ∗k+1 = βi ∗k + s − 2 + ∆k+1 + (u1 )T hi − ηi∗k+1

γ ∗k+1 = γ ∗k + ∆k+1 +δ k+1 − 1, τ ∗k+1 = τ ∗k + ∆k+1 − v k+1

stop while ∥uk+1 − uk1 ∥ ≤ 10−5 , else k = k + 1 and go to step 2. 1

3.2

ADMM for L1 -2S-BFHC (L1 -ADMM-BFHC) In this subsection, we formulate our L1 -norm two-sided best fitting hyperplane classifier(L1 -

2S-BFHC). It is known that L1 -norm is more robust to outliers and noises. To reduce the sensitivity, we replace the regularization of L2 -norm which in (2) and (3) by the L1 -norm. Here we just show the optimization problem of the first fitting hyperplane and the other optimization problem has the similar form 1 (w1 ,b1 ,∆) 2

min

2

∥w1 ∥1 + C1

m ∑1

i=1

m ( ) ( ) ∑2 Lpos w1T xi + b1 + C2 Lneg w1T xj + b1 − H∆ j=1

(8)

s.t. i ∈ I+ , j ∈ I− , ∆ ∈ (0, 1)

Similarly, the above problem could be solved by ADMM. Therefore, we can get the algorithm of solving (8) similar to the algorithm of solving (5), see Table 2. In a same way, u2 = (w2T , b2 )T can also obtained by the ADMM algorithm. A new sample x ∈ Rn is assigned in a same way as L2 -2S-BFHC. 5



Chen Wang et al. / Procedia Computer Science 108C (2017) 1292–1301 A Guide for Writing LATEX Documents Chen . . . Wang, Chun-Na Li, Hua-Xin Pei, Yan-Ru Guo and Yuan-Hai Shao

Table 2: The Algorithm of solving (8) Input: Data set: T = {(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )}, parameter: C1 , C2 , H, ρ, s. Output: u1 , ∆

(k)

1. Initialize ∆(k) , u1 , k = 0. (k)

2. Alternately update u1 a. Putting ∆ (

(k)

and ∆(k) : (k)

into convex part, and update u1

until primal and dual residuals

converge to zero. ) (( )) = F 1 Du1 k − µk i i

z1k+1

ρ

where (·)i represents the ith element of the vector. Fλ (w) =

uk+1 1

= −(2

tj k+1

T

hi hi + 2

i=1

m ∑2

T

T

−1

hj hj + D D)

j=1

(

m ∑1

i=1

k

k

  

1 w, 1+λ

0,

w>0 w=0

1 w, 1−λ k

w<0

k

(ξi − ηi +αi − βi )hi T

)T ( (tj k − qj k +γj k − τj k )hj T − z1k+1 + µk D) j=1( ) ( ) T T = S C1 −1 + ∆ − (uk+1 ) hi − αi k , ηi k+1 = S C1 −1 + ∆ + (uk+1 ) h i − βi k 1 1 ρ ( ρ( ) ) T T k k+1 = S C2 1 + ∆ − (uk+1 ) h − γ = S ) h j − τj k , q 1 + ∆ + (uk+1 C j j j 2 1 1 +

ξi k+1

m ∑1

   

m ∑2

ρ

ρ

αi k+1 = αi k + ξi k+1 + 1 − ∆+(uk+1 )T hi , βi k+1 = βi k + ηi k+1 + 1 − ∆ − (uk+1 )T hi 1 1 γj k+1 = γj k + tj k+1 − 1 − ∆+(uk+1 )T hj , τj k+1 = τj k + qj k+1 − 1 − ∆ − (uk+1 ) T hj 1 1 µk+1 = µk + z1k+1 − Duk+1 1 (k)

b. Putting u1

into concave part, and update ∆(k) as Table 1(2.b.).

stop while ∥uk+1 − uk1 ∥ ≤ 10−5 , else k = k + 1 and go to step 2. 1

4

Implementation of ADMM-BFHC The above results could be extended to nonlinear classifier similar to that of BFHC, consider

kernel-generated surfaces instead of the hyperplane defined in (1): K(xT , X T )w1 + b1 = 0 and K(xT , X T )w2 + b2 = 0, where X is the whole training input and K is an appropriately chosen kernel. Then, corresponding to (5) and (8), the primal problems of getting the (w1 , b1 ) can be expressed as:

1 (w1 ,b1 ,∆) 2

min

2

∥w1 ∥2 + C1

m ∑1

i=1

m ( ) ( ) ∑2 Lpos fi1 + C2 Lneg fj1 − H∆ j=1

s.t. i ∈ I+ , j ∈ I− , ∆ ∈ (0, 1) m m ( ) ( ) ∑1 ∑2 2 min 12 ∥w1 ∥1 + C1 Lpos fi1 + C2 Lneg fj1 − H∆

(w1 ,b1 ,∆)

i=1

j=1

(9)

(10)

s.t. i ∈ I+ , j ∈ I− , ∆ ∈ (0, 1)

where fi1 = K(xTi , X T )w1 + b1 and fj1 = K(xTj , X T )w1 + b1 . The primal problems of getting the (w2 , b2 ) have the same forms by interchanging the roles of positive and negative samples. The solutions (wi , bi ), i = 1, 2 can also be obtained by the same way as linear case and a new sample x ∈ Rn is assigned to class k(k = +1, −1) by √ class k = arg min |K(xT , X T )wi + bi |/ wiT K(X, X T )wi . i=1,2

6

1297

1298

Chen Wang et al. / Procedia Computer Science 108C (2017) 1292–1301 A Guide for Writing LATEX Documents Chen . . . Wang, Chun-Na Li, Hua-Xin Pei, Yan-Ru Guo and Yuan-Hai Shao

In BFHC, the initialization is very important (the returned solution will be a local minimum) and [10] claimed that the initialization could be obtained by SVM or non-parallel SVMs. However, SVM may be time consuming and it finds separating hyperplane rather than fitting hyperplane. In this section, we propose a new initialization way which is similar to that of k-plane clustering [13, 14], while the main difference is the class in our BFHC is stationary. Specifically, for each class, we construct a fitting hyperplane wiT x + bi = 0, where wi ∈ Rn is the

weighted vector and bi ∈ R is the bias, i = 1, 2 . . . is the classes. Then we solve the following problem:

min ∥Xi wi + bi ei ∥22

wi ,bi

s.t.

(11)

∥wi ∥22 = 1

where Xi contains all samples in the ith class, ei is a vector of ones of appropriate dimension, i = 1, 2. The geometric interpretation is clear, i.e., given the ith class, the objective of (11) minimizes the sum of the distances between the samples of the ith class and the corresponding fitting hyperplane, and the constraint normalizes the normal vector of the fitting hyperplane. The solution of the problem (11) can be obtained by solving eigenvalue problems [13].

5

Experiments In this section, we test the proposed methods, L2 -ADMM-BFHC and L1 -ADMM-BFHC

on both synthetic and UCI databases to assess their performance, and we compared our results to NPSVM and 2S-BFHC. The “Accuracy”used to evaluate methods as Accuracy = (T P + T N )/(T P + F P + T N + F N ), where T P , T N , F P and F N are the number of true positive, true negative, false positive, and false negative, respectively. ′

For the nonlinear (kernelized) classifiers, we only used the RBF kernel, K(x, x ) = ′

exp(−∥x − x ∥2 /σ). For the choice of parameters, Ci , i = 1, 2, 3, 4 and H are selected from

the set {2−4 , 2−3 , 2−2 , 2−1 , 20 , 51 , 52 , 53 }, ρ is selected from the set {0.1, 1, 10, 100, 1000}, they

are all obtained by employing the standard 10-fold cross-validation technique [1]. In this paper, we give some fix values for some parameters: s is set to -0.2 and ∆ is set to 0.3.

5.1

Experiments on Synthetic Data In order to see the efficiency of our L1 -2S-BFHC, we first consider a simple two-dimensional

‘Cross-Plane’dataset, which was tested in [15, 16] to indicate that nonparallel hyperplanes classifiers can handle the cross planes dataset much better compared with parallel ones. We now show that our L1 -2S-BFHC can also handle the cross planes type data well and moreover, is insensitive to outliers. Specifically, Class +1 and Class -1 are presented by 35 red plus and 35 blue star respectively, which are shown in Figure 1(a). Figure 1(d) shows the original dataset with two outliers, and the two outliers are marked. The proximal lines and corresponding prediction 7



A Guide for Writing LATEX Documents Chen . .et. Wang, Chun-Na Li, Hua-Xin Pei, (2017) Yan-Ru Guo and Yuan-Hai Shao Chen Wang al. / Procedia Computer Science 108C 1292–1301

gained by 2S-BFHC and L1 -2S-BFHC on original dataset and original dataset with outliers are shown in Figure 1(b,c,e,f), respectively, where the circled points are used to demonstrate the misclassified ones. It is easy to see that the classification result of our L1 -2S-BFHC is obviously better than that of 2S-BFHC. Specifically, we can see that there are eight misclassified points with accuracy 88.89% when 2S-BFHC is employed, while that of L1 -2S-BFHC with no misclassified point and its accuracy is 100%. In addition, some misclassified points in 2S-BFHC are not even outliers, which are misclassified by the influence of two outliers. On the other hand, these two outliers have little influence to L1 -2S-BFHC. Therefore, by observing Figure 1, it is evident that L1 -2S-BFHC is more robust to outliers than 2S-BFHC.

(a) Original dataset

(b) Proximal lines and correspond- (c) Proximal lines and corresponding ing prediction obtained by 2S-BFHC prediction obtained by L1 -2S-BFHC

(d) Original dataset with outliers

(e) Proximal lines and corresponding (f) Proximal lines and corresponding prediction obtained by 2S-BFHC

prediction obtained by L1 -2S-BFHC

Figure 1: An example learned by 2S-BFHC and L1 -2S-BFHC respectively, and the misclassified samples are circled.

5.2

Experiments on UCI Datasets To evaluate the performance of the proposed L2 -ADMM-BFHC and L1 -ADMM-BFHC

further, we choose six datasets from the UCI machine learning repository. For every dataset, we repeat experiment 20 times with appropriate parameters which are selected by the 10-fold cross validation, the final accuracy is the average of accuracies obtained in these procedures. 8

1299

1300

A Guide for Writing LATEX Documents Chen . .et. Wang, Chun-Na Li, Hua-Xin Pei, (2017) Yan-Ru Guo and Yuan-Hai Shao Chen Wang al. / Procedia Computer Science 108C 1292–1301

The results are summarized in Table 3, where the best accuracy is shown by bold figures. Table 3 shows that L1 -ADMM-BFHC gets better accuracies, while L2 -ADMM-BFHC gets similar competence in accuracies for most of cases, which proves the effectiveness of our method. Table 3: The results of linear(a ) or nonlinear(b ) classifiers on UCI datasets in terms of accuracy (mean±std) Datasets

NPSVM

2S-BFHC

L2 -ADMM-BFHC

L1 -ADMM-BFHC

Acc (%)

Acc (%)

Acc (%)

Acc (%)

Habermana

73.75±0.32

73.15±0.67

73.41±1.40

73.82±1.17

a

78.90±1.46

78.41±1.42

74.15±2.02

79.64±0.26

95.76±0.49

93.00±1.42

90.49±2.69

87.11±1.06

69.08±1.00

71.20±1.00

69.76±2.25

70.01±0.00

72.63±1.26

70.38±2.87

72.43±2.37

73.32±1.58

79.21±0.60

78.63±2.22

79.11±1.00

69.38±3.65

57.98±0.63

75.00±1.23

76.70±1.79

60.34±0.01

79.59±0.24

79.56±1.29

76.38±1.70

79.68±0.22

69.28±0.88

95.48±0.41

72.79±3.94

87.66±1.23

Spect

a

Housevotes Germana Haberman hepatitis Hourse

b

b

b

Spectb b

Housevotes

6

Conclusions In this paper, we propose a robust L1-norm two-sided best fitting hyperplane classifier, and

use alternating direction method of multipliers for L2 -2S-BFHC and L1 -2S-BFHC. Our work shows the strong power of ADMM method on BFHC. And from artificial results of experiment, we can see that L1 -2S-BFHC is more robust to outliers. Since ADMM can be easily applied in a distributed fashion, afterwards, we can design parallelized ADMM for 2S-BFHC according to the characteristics of data sets. In addition, ADMM can also be used in other types of SVM models with similar techniques and procedures.

Acknowledgment This work is supported by the National Natural Science Foundation of China (Nos. 61603338, 11501310 and 11371365), the Zhejiang Provincial Natural Science Foundation of China (Nos. LY15F030013,LQ17F030003 and LY16A010020),the Natural Science Foundation Project of CQ CSTC (cstc2014jcyjA40011)and the Scientific and Technological Research Program of Chongqing Municipal Education Commission(KJ1400513). 9



A Guide for Writing LATEX Documents Chen . . .etWang, Chun-Na Li, Hua-Xin Pei, Yan-Ru Guo and Yuan-Hai Shao Chen Wang al. / Procedia Computer Science 108C (2017) 1292–1301

References [1] R. O. Duda, P. E. Hart, and D. G. Stork, “Pattern classification, second edition,” 2001. [2] D. S. Abe, Support Vector Machines for Pattern Classification. Springer London, 2005. [3] Y. X. Li, Y. H. Shao, L. Jing, and N. Y. Deng, “An efficient support vector machine approach for identifying protein s-nitrosylation sites.,” Protein and Peptide Letters, vol. 18, no. 6, p. 573, 2011. [4] Y. H. Shao, Z. Wang, Z. M. Yang, and N. Y. Deng, “Weighted linear loss support vector machine for large scale problems,” Procedia Computer Science, vol. 31, pp. 639–647, 2014. [5] B. Sch¨ olkopf, K. Tsuda, and J. P. Vert, “Support vector machine applications in computational biology,” pp. 71–92, 2004. [6] O. L. Mangasarian and E. W. Wild, “Multisurface proximal support vector machine classification via generalized eigenvalues.,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 1, pp. 69–74, 2006. [7] Jayadeva, R. Khemchandani, and S. Chandra, “Twin support vector machines for pattern classification.,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 5, pp. 905–910, 2007. [8] Y. Tian, Z. Qi, X. Ju, Y. Shi, and X. Liu, “Nonparallel support vector machines for pattern classification.,” IEEE Transactions on Cybernetics, vol. 44, no. 7, p. 1067, 2013. [9] Y. Tian and Y. Ping, “Large-scale linear nonparallel support vector machine solver.,” Neural Networks the Official Journal of the International Neural Network Society, vol. 138, no. 2, p. 114–119, 2014. [10] H. Cevikalp, “Best fitting hyperplanes for classification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2016. [11] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistiR in cal learning via the alternating direction method of multipliers,” Foundations and Trends⃝

Machine Learning, vol. 3, no. 1, pp. 1–122, 2011.

[12] G. B. Ye and X. Xie, “Split bregman method for large scale fused lasso,” Computational Statistics and Data Analysis, vol. 55, no. 4, pp. 1552–1569, 2010. [13] P. S. Bradley and O. L. Mangasarian, “k-plane clustering,” Journal of Global Optimization, vol. 16, no. 1, pp. 23–32, 2000. [14] Z. Wang, Y. H. Shao, L. Bai, and N. Y. Deng, “Twin support vector machine for clustering,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 10, p. 2583, 2015. [15] Y. H. Shao, C. H. Zhang, X. B. Wang, and N. Y. Deng, “Improvements on twin support vector machines.,” IEEE Transactions on Neural Networks, vol. 22, no. 6, pp. 962–968, 2011. [16] Y. H. Shao, W. J. Chen, and N. Y. Deng, “Nonparallel hyperplane support vector machine for binary classification problems,” Information Sciences An International Journal, vol. 263, no. 3, pp. 22–35, 2014.

10

1301