ν-projection twin support vector machine for pattern classification

ν-projection twin support vector machine for pattern classification

ν-projection twin support vector machine for pattern classification Communicated by Dr Yingjie Tian Journal Pre-proof ν-projection twin support vec...

3MB Sizes 0 Downloads 120 Views

ν-projection twin support vector machine for pattern classification

Communicated by Dr Yingjie Tian

Journal Pre-proof

ν-projection twin support vector machine for pattern classification Wei-Jie Chen, Yuan-Hai Shao, Chun-Na Li, Ming-Zeng Liu, Zhen Wang, Nai-Yang Deng PII: DOI: Reference:

S0925-2312(19)31337-2 https://doi.org/10.1016/j.neucom.2019.09.069 NEUCOM 21324

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

9 June 2018 1 March 2019 17 September 2019

Please cite this article as: Wei-Jie Chen, Yuan-Hai Shao, Chun-Na Li, Ming-Zeng Liu, Zhen Wang, Nai-Yang Deng, ν-projection twin support vector machine for pattern classification, Neurocomputing (2019), doi: https://doi.org/10.1016/j.neucom.2019.09.069

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.

Highlights • A nonparallel projection classifier (ν-PTSVM) with theoretically sound parameter ν is proposed. • ν controls the bounds of fraction of both support vectors and margin-error instances. • ν-PTSVM skillfully avoids the matrix inverse operation. • ν-PTSVM behaves consistent in the linear and nonlinear cases. • Experimental results confirm the superiority of the proposed ν-PTSVM.

1

ν-projection twin support vector machine for pattern classification Wei-Jie Chena,b , Yuan-Hai Shaoc,∗, Chun-Na Lia , Ming-Zeng Liud , Zhen Wange , Nai-Yang Dengf a

Zhijiang College, Zhejiang University of Technology, Hangzhou, 310024, P.R.China Centre for Artificial Intelligence, University of Technology Sydney, NSW, 2007, Australia c School of Economics and Management, Hainan University, Haikou, 570228, P.R.China d School of Mathematics and Physics Science, Dalian University of Technology, Dalian, 124221, P.R.China e School of Mathematical Sciences, Inner Monggolia University, Hohhot 010021, PR China f College of Science, China Agricultural University, Beijing, 100083, P.R.China b

Abstract In this paper, we improve the projection twin support vector machine (PTSVM) to a novel nonparallel classifier, termed as ν-PTSVM. Specifically, our ν-PTSVM aims to seek an optimal projection for each class such that, in each projection direction, instances of their own class are clustered around their class center while keep the other class instances at least one distance away from such center. Different from PTSVM, our ν-PTSVM enjoys the following characteristics: (i) ν-PTSVM is equipped by a more theoretically sound parameter ν, which can be used to control the bounds of fraction of both support vectors and margin-error instances. (ii) By reformulating the least-square loss of within-class instances in the primal problems of ν-PTSVM, its dual problems no longer involve the timecostly matrix inversion operation. (iii) ν-PTSVM behaves consistent between its linear and nonlinear cases. Namely, the kernel trick can be applied directly to νPTSVM for its nonlinear extension. Experimental evaluations on both synthetic and real-world datasets demonstrate the feasibility and efficacy of the proposed approach. Keywords: Twin support vector machine, Projection twin support vector machine, Nonparallel classifier, Kernel trick, Pattern classification.



Corresponding author. Email addresses: [email protected] (Wei-Jie Chen), [email protected]

Preprint submitted to Elsevier

September 28, 2019

1

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

1. Introduction Support vector machine (SVM) [1–5] is a preeminent maximum-margin learning paradigm for data classification. The implementation of structural risk minimization principle, which controls the upper bound of generalization error, makes SVM enjoy an excellent generalization ability. Furthermore, the kernel trick allows SVM to deal with nonlinear problems effectively. As a powerful tool, SVM has been successfully applied in a wide variety of fields ranging from scene classification [6], energy system [7], fault diagnosis [8], bioinformatics [9] to power applications [10]. The main idea of SVM is to maximize the margin between two parallel hyperplanes by solving a quadratic programming problem (QPP). However, such parallel requirement in SVM may restrict its performance on some heterogeneous distribution learning tasks [2, 11, 12], such as “XOR” problem. To alleviate this challenge, during the past years, many nonparallel SVM classifiers have been brought forward in literatures [13, 14]. The pioneering work is the generalized eigenvalue proximal SVM (GEPSVM) proposed by Mangasarian and Wild [11]. GEPSVM relaxes the parallel requirement of hyperplanes generated by SVM, and attempts to seek a pair of nonparallel hyperplanes by solving eigenvalue problems. Subsequently, inspired by GEPSVM, Jayadeva [12] proposed a novel nonparallel method for binary classification, named as Twin SVM (TWSVM). It aims to construct a pair of nonparallel hyperplanes, one for each class, such that each hyperplane is closer to its own class and at least one distance far away from the other class. Due to the idea of solving two smaller QPPs rather than a larger one in SVM, it entails TWSVM enjoy around four times faster learning efficiency than SVM [12, 14]. The advantage of GEPSVM and TWSVM bring much efforts to their various improvements, including least squares version of TWSVM (LSTSVM) [15], structural risk minimization version of TWSVM (TBSVM) [16], robust version of TWSVM (RTWSVM) [17], wavelet TWSVM [18], ν-TWSVM [19], multi-label TWSVM (MLTSVM) [20], multi-class TWSVM with information granulation (GWLMBSVM) [21], ITBSVM [22], nonparallel SVM (NPSVM) [23, 24], and so on [25–31]. Recently, Mehrkanoon et al. [32] introduced a general framework for nonparallel SVMs, which involves a regularization term, a scatter (within-class) loss and a misclassification (between-class) loss. By implementing the different within-class or between-class loss for empirical risks, many stateof-the-art of nonparallel models, such as GEPSVM [11], TWSVM [12], etc, will (Yuan-Hai Shao ), [email protected] (Chun-Na Li), [email protected] (Ming-Zeng Liu), [email protected] (Zhen Wang), [email protected] (Nai-Yang Deng)

3

36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69

70 71 72

fall into this framework. Then, under this elegant nonparallel framework [32], authors proposed three extensions of TWSVM: LS-LS loss, LS-Hinge loss and LS-Pinball loss. Moreover, in contrast with original TWSVM, the kernel trick can be applied directly in these extensions, which lead to the consistency between linear and nonlinear models. Different from GEPSVM and TWSVM, whose idea is to construct nonparallel hyperplanes, Ye [33, 34] proposed a novel multi-weight vector projection SVM (MVSVM) in the light of linear discriminant analysis (LDA) [35] . It seeks for a pair of optimal weight-vector projections by solving eigenvalue problems. Soon afterwards, Chen [36] formulated the MVSVM model into two related SVMtype QPPs, and further proposed a projection twin SVM (PTSVM). The idea of PTSVM is to learn an optimal discriminant subspace for each class, such that, in the new subspace, it minimizes the within-class scatter while keeps the between-class distance no less than one. Thus, after PTSVM projects instances into such a discriminant subspace, the projected instances can preserve more discriminant information than the original ones. Experimental results in [36] show the effectiveness of PTSVM over TWSVM and SVM on many datasets. For related works on extensions of PTSVM, see [37–42]. However, the formulations of the above PTSVM exist several demerits: (1) The penalty parameter in PTSVM has less theoretical interpretation but just the trade-off weight of different loss terms. (2) Although PTSVM decomposes the problem into two smaller QPPs, their dual problems still require the timecostly matrix inversion for constructing the Hessian matrix during the learning procedure. (3) PTSVM utilizes the extra kernel-generated technique [12, 38] to extend their model to nonlinear case. However, such a kernel trick will cause the inconsistency between linear and nonlinear models. That is, when applying the linear kernel for nonlinear PTSVM, it is not equivalent to the linear ones. This motivates the rush towards a new classifier. Thus, in the light of ν-SVM [5] and the nonparallel framework [32], we propose an improvement of PTSVM, termed as ν-PTSVM. The main contributions are summarized as: • A meaningful parameter ν is introduced to make model possess a better theoretical interpretation than PTSVM. Especially, ν controls the lower bound of fraction of support vectors and upper of fraction of margin-error instances. (Section 3.1 and Section 3.3) • To remedy the time-costly matrix inverse operation, we further reformulate the within-class least-square loss in ν-PTSVM by a new slack vector η. Consequently, its dual problems no longer involve the matrix inversion, 4

73 74

75 76 77 78

79 80 81

which makes ν-PTSVM enjoy more efficient training procedure. (Section 3.1 and Section 3.2) • The dual formulations of ν-PTSVM behave in consistency between linear and nonlinear cases. That is to say, there are only inner products appear in its dual problems. Thus, the kernel trick can be applied directly as standard SVM. (Section 4) • Last but not the least, the feasibility and effectiveness of ν-PTSVM are validated by extensive experimental results on several synthetic and realworld datasets. (Section 5)

86

The remainder of this paper is organized as follows. Section 2 briefly introduces the notations and related works. Section 3 proposes the formulation of our ν-PTSVM. The properties of ν-PTSVM are also discussed in Section 3. The nonlinear version of ν-PTSVM is extended in Section 4. Experimental results are described in Section 5, and Section 6 gives concluding remarks and future works.

87

2. Backgrounds

82 83 84 85

88 89 90

91 92 93 94 95 96 97 98

In this section, we first describe the notations used throughout the paper. Then, briefly introduce SVM [1, 2], ν-SVM [5], TWSVM [12, 16] and PTSVM [36, 37]. 2.1. Notations Upper (lower) bold face letters are used for matrices (column vectors). All vectors will be column vectors unless transformed to row vectors by a prime superscript (·)0 . A vector of zeros of arbitrary dimensions is represented by 0. In addition, we denote e as a vector of ones and I as an identity matrix of arbitrary dimensions. Moreover, k · k stands for the L2 -norm. Consider a binary classification learning task in the n-dimensional space Rn . Denote the set of training data as T = {(xi , yi )|1 ≤ i ≤ l} ∈ (X × Y)l ,

99 100 101 102 103 104

(1)

where xi ∈ X ⊂ Rn is an input instance with its associated label yi ∈ Y = {1, −1}. Furthermore, organize l1 instances of the positive class by matrix A ∈ Rl1 ×n and l2 instances of the negative class by matrix B ∈ Rl2 ×n . For convenience, use Ik to denote the set of indexes such that if an instance xi belongs to the k-th class, i.e., i ∈ Ik , where k = 1 or 2 corresponds to the positive or negative class. 5

105 106 107 108 109

2.2. Support vector machine and its variant ν-SVM As an excellent maximum-margin learning paradigm for data classification, SVM [1, 2] is principled the structural risk minimization to minimize the upper bound of generalization error. The basic idea is to seek an optimal separating hyperplane f (x) : w0 x + b = 0,

110 111 112

such that it maximizes the decision margin between different classes, where w ∈ Rn is a normal vector and b ∈ R is a bias term. To measure its empirical risk, SVM considers the following Hinge loss function Remp =

113 114

(2)

l X i

max(0, 1 − yi (w0 xi + b)).

(3)

By introducing a regularization term kwk2 , the primal problem of SVM can be expressed as l X 1 min kwk2 + c ξi (4) w,b,ξ 2 i=1

s.t.

115 116 117 118 119 120 121 122

yi (w0 xi + b) ≥ 1 − ξi , ξi ≥ 0,

where ξ is a non-negative slack vector used to measure the misclassification error, and c > 0 is a penalty parameter that balances the trade-off between the maximization of margin and the minimization of training error. Apparently, the meaning of c in problem (4) is qualitatively clear. That is, the larger c implies the more attention paid to the training error. However, it lacks in the theoretical interpretation. For that reason, a modified version, named as ν-SVM [5], is proposed by introducing a meaningful margin regularization term νρ, which leads to the following optimization problem l

min

w,b,ρ,ξ

s.t. 123 124 125

1 1X kwk2 − νρ + ξi 2 l i=1

(5)

yi (w0 xi + b) ≥ ρ − ξi , ξi ≥ 0, ρ ≥ 0,

where ν ∈ (0, 1] is a predefined parameter. Note that a nice property of ν is connected to the fraction of support vectors and that of margin-error instances. More details about ν-SVM can be refer to [2, 5].

6

126 127 128

2.3. Twin support vector machine TWSVM [12, 16] relaxes the parallel requirement in SVM [1, 2], and attempts to seek a pair of nonparallel hyperplanes f1 (x) : w10 x + b1 = 0 and f2 (x) : w20 x + b2 = 0,

129 130 131 132 133

134

135 136

such that each hyperplane is closer to instances of its own class and as far as possible from the other to some extent. For this purpose, TWSVM solves a pair of relatively smaller QPPs, instead of a larger one in SVM. Therefore, the learning efficiency of TWSVM is much faster than that of SVM [12]. To measure its empirical risks, TWSVM considers the following two loss functions = Remp 1

X 1X 2 (w10 xi + b1 ) + c1 max(0, 1 + (w10 xj + b1 )), 2

(7)

Remp = 2

X 1 X 2 (w20 xj + b2 ) + c2 max(0, 1 − (w20 xi + b2 )), 2

(8)

and

i∈I1

s.t.

w2 ,b2 ,η

s.t.

140 141 142 143

X 1X 2 (w10 xi + b1 ) + c1 ξj , 2

i∈I1 −(w10 xj

j∈I2

(9)

+ b1 ) ≥ 1 − ξj , ξj ≥ 0,

and min

139

i∈I1

where c1 , c2 > 0 are penalty parameters. Then, the primal problems of TWSVM can be expressed as w1 ,b1 ,ξ

138

j∈I2

j∈I2

min

137

(6)

X 1 X 2 (w20 xj + b2 ) + c2 ηi , 2 j∈I2

i∈I1

(10)

w20 xi + b2 ≥ 1 − ηi , ηi ≥ 0,

where ξ and η are nonnegative slack vectors. We now consider the geometric interpretation of the above problems. Take problem (9) for example, its objective function makes positive instances xi∈I1 proximal to the hyperplane w10 x+b1 = 0, while constraints make negative instances xi∈I2 bound in the hyperplane w10 x + b1 = −1. To obtain solutions to problems (9) and (10), we derive their dual problems as min α1

s.t.

1 0 α G(H 0 H)−1 G0 α1 − e02 α1 2 1 0 ≤ α1 ≤ c2 e2 ,

7

(11)

144

and min α2

s.t. 145 146 147 148 149

1 0 α H(G0 G)−1 H 0 α2 − e01 α2 2 2 0 ≤ α2 ≤ c1 e1 ,

where H = [A e1 ] and G = [B e2 ]. After getting solutions (w1 , b1 ) and (w2 , b2 ) from solving problems (11) and (12), an unseen instance x is assigned to the positive or negative class, depending upon which nonparallel hyperplane (6) it is closer to. Namely, the decision function of TWSVM can be constructed as y = arg min

k={1,2}

150 151 152 153 154 155

156

157 158 159 160 161 162 163

(12)

|fk (x)| . |wk |

(13)

2.4. Projection twin support vector machine and its variant RPTSVM PTSVM [36] aims to seek a pair of nonparallel projections w1 and w2 such that, in each projection direction, the within-class variance of its own class instances is minimized while the other class instances scatter away as much as possible. To measure its empirical risks, PTSVM considers the following two loss functions Remp = 1

X 1X 2 (w10 xi − w10 m1 ) + c3 max(0, 1 − w10 (xj − m1 )), 2

(14)

Remp = 2

X 1 X 2 (w20 xj − w20 m2 ) + c4 max(0, 1 + w20 (xi − m2 )), 2

(15)

i∈I1

j∈I2

and

j∈I2

i∈I1

P P where c3 , c4 > 0 are penalty parameters, m1 = l11 i∈I1 xi and m2 = l12 j∈I2 xj are the center of the positive and negative class, respectively. However, PTSVM only concerns on the minimization of empirical risks Remp 1 and Remp 2 , which may lead to over-fitting in practice. To alleviate the above issue, Shao [37] further proposed an improvement of PTSVM, whereas the structural risk minimization is implemented by minimizing the regularization term kwk2 . Then, the primal problems of PTSVM can be formulated as min

w1 ,ξ1

s.t.

X 1X c1 2 kw1 k2 + (w10 xi − w10 m1 ) + c3 ξ1j 2 2

w10 (xj

i∈I1

j∈I2

− m1 ) ≥ 1 − ξ1j , ξ1j ≥ 0,

8

(16)

164

and min

w2 ,ξ2

s.t. 165 166

X c2 1 X 2 (w20 xj − w20 m2 ) + c4 kw1 k2 + ξ2i 2 2 j∈I2

−w20 (xi − m2 ) ≥ 1 − ξ2i , ξ2i ≥ 0,

X

i∈Ik

min

w1 ,ξ1

s.t.

min s.t.

α1

s.t.

α2

s.t.

175 176

177 178

c2 1 kw2 k2 + w20 S2 w2 + c4 e01 η, 2 2 −(A − e1 m02 )w2 ≥ e1 − ξ2 , ξ2 ≥ 0,

(20)

1 0 α (B − e2 m01 )(S1 + c1 I)−1 (B − e2 m01 )0 α1 − e02 α1 2 1 0 ≤ α1 ≤ c3 e2 ,

(21)

1 0 α (A − e1 m01 )(S2 + c2 I)−1 (A − e1 m02 )0 α2 − e01 α2 2 2 0 ≤ α2 ≤ c4 e1 ,

(22)

and min

174

(19)

as min

173

c1 1 kw1 k2 + w10 S1 w1 + c3 e02 ξ1 , 2 2 (B − e2 m01 )w1 ≥ e2 − ξ1 , ξ1 ≥ 0,

To obtain solutions to problems (19) and (20), we derive their dual problems

169

172

(18)

and w2 ,ξ2

171

(xi − mk )(xi − mk )0 , k = 1, 2.

Then, problems (16) and (17) can be rewritten as

167

170

(17)

where ci > 0, i = 1, . . . , 4 are penalty parameters, ξ1 and ξ2 are nonnegative slack vector. Define the within-scatter matrix as Sk =

168

i∈I1

Remark 1. It is worth noting that optimizing the dual problems (21) and (22) requires time-costly matrix inversion operations (S1 + c1 I)−1 and (S2 + c2 I)−1 . In fact, their complexity is O(n3 ) for linear case or O(l3 ) for nonlinear case, where n is the number of features and l is the number of instances. Thus, computing such a matrix inversion is a very heavy burden for the whole training procedure. After getting solutions w1 and w2 , an unseen instance x is assigned to class y, depending on which projection centers it is nearest to, i.e., the decision function 9

179

is y = arg min

k={1,2}

|wk0 (x − mk )|.

180

3. ν-projection twin support vector machine

181

3.1. Model Formulation

182 183 184 185 186 187 188 189

From problems (16) and (17) in PTSVM, it can be found that penalty parameters c3 and c4 do lack in the theoretical interpretation but just balance empirical risks terms. To overcome this issue, inspired by ν-SVM [5], we introduce a more meaningful parameter ν for each projection, and further propose a ν-projection twin support vector machine (ν-PTSVM) for classification. More specifically, the parameter ν can effectively control the number of support vectors and marginerror instances. Formally, our ν-PTSVM can be formulated as the following optimization problems min

w1 ,ξ1 ,η1 ,ρ1

s.t.

190

w2 ,ξ2 ,η2 ,ρ2

s.t.

192 193 194

195 196 197 198 199 200 201

1X 2 1 X c1 η1i + ξ1j + kw1 k2 − ν1 ρ1 2 l2 2

(24)

c2 1 X 2 1 X η2j + ξ2i + kw2 k2 − ν2 ρ2 2 l1 2

(25)

i∈I1

j∈I2

w10 (xi − m1 ) = η1i , w10 (xj − m1 ) ≥ ρ1 − ξ1j , ξ1j ≥ 0, ρ1 > 0,

and min

191

(23)

j∈I2

i∈I1

w20 (xj − m2 ) = η2j , −w20 (xi − m2 ) ≥ ρ2 − ξ2i , ξ2i ≥ 0, ρ1 > 0,

where v1 , v2 ∈ (0, 1] are penalty parameters, ξ1 , ξ2 and η1 , η2 are nonnegative slack vectors. To deliver the mechanism of ν-PTSVM, we now give the following analysis and geometrical explanation for problem (24), which is illustrated in Fig.1. • For the first term in the objective function and the first constraint, the leastsquare loss function (w10 xi − w10 m1 )2 is used to cluster the positive class instance xi around the positive center m1 in the projection space, where i ∈ I1 . As shown in Fig.1, each positive instance xi (red “plus”) has its corresponding projection w10 xi (red “dot”) on the direction w1 . Minimizing this term makes the projected positive instance w10 xi (red “dot”) locate as near as possible to the projected positive center w10 m1 (red “solid circle”). 10

4 3.5 3 2.5 2 1.5 1 0.5 0

0

0.5

1

1.5

2

2.5

3

3.5

4

Figure 1: Geometrical interpretation of problem (24) for ν-PTSVM in R2 .

202 203

204 205 206 207 208 209 210 211 212

213 214 215 216

217 218 219

Otherwise, a slack variable ηi in the first constraint is introduced to measure its error. • The second term in the objective function and the second constraint employ the Hinge loss function max{0, w10 xj − w10 m1 − ρ1 } to keep the negative class instance xj away from the positive class in the projection space, where j ∈ I2 . As shown in Fig.1, each negative instance xj (blue “cross”) has its corresponding projection w10 xj (blue “dot”) on the direction w1 . Optimizing this term leads to the projected negative instance w10 xj (blue “dot”) being at least ρ1 (blue “line”) distance from the projected positive center w10 m1 (red “solid circle”). Otherwise, a slack variable ξj is utilized to measure the error when the second constraint is violated. • The third term is the L2 -norm regularization of projection w1 . Minimizing it aims to regulate the model complexity of ν-PTSVM and avoid overfitting. The parameter c1 is used to balance the model regularization and empirical risks. • The last term ν1 ρ1 is the margin regularization term, which is introduced to make our model be more flexible and interpretable. More specifically, all negative instances xj∈I2 are separated from the positive center m1 by 11

projection w1 with the margin ρ1 /kw1 k2 (margin between the black and blue “dotted lines” in Fig.1). It relaxes the over-restricting requirements in the constraint of PTSVM (margin should be 1/kw1 k2 ) via the optimal ρ1 , resulting in the better generalization ability. Moreover, compared with PTSVM, the parameter ν1 in problem (24) has a better theoretical interpretation than the penalty parameter c1 in problem (16), which is utilized to control the bounds of fraction of support vectors and margin-error instances (details can be referred to section 3.3).

220 221 222 223 224 225 226 227

228 229 230 231

The geometrical explanation for problem (25) is similar. For the sake of simplicity, let Ak = {xi − mk }i∈I1 ∈ Rl1 ×n and Bk = {xj − mk }j∈I2 ∈ Rl2 ×n , where k = 1 or 2. Then, the matrix formulations of problems (24) and (25) can be expressed as

min

w1 ,η1 ,ξ1 ,ρ1

s.t.

232

min s.t.

234

(26)

1 0 1 c2 η η2 + e01 ξ2 + kw2 k2 − ν2 ρ2 , 2 2 l1 2 B1 w2 = η2 −A2 w2 ≥ ρ2 e1 − ξ2 , ξ2 ≥ 0, ρ2 > 0.

(27)

and w2 ,η2 ,ξ2 ,ρ2

233

1 0 1 c1 η η1 + e02 ξ1 + kw1 k2 − ν1 ρ1 , 2 1 l2 2 A1 w1 = η1 B1 w1 ≥ ρ1 e2 − ξ1 , ξ1 ≥ 0, ρ1 > 0,

In what follows, we will discuss solutions of problems (26) and (27). 3.2. Model optimization

236

To obtain solutions to problems (26) and (27), we first derive their dual problems by Theorem 1.

237

Theorem 1. Optimization problems

235

min

u1 ,α1

s.t.

   1 0 A1 A01 + c1 I B1 A01 u1 , (u1 α01 ) A1 B10 B1 B10 α1 2 e2 0 ≤ α1 ≤ , e02 α1 = ν1 , l2

12

(28)

238

and min

u2 ,α2

s.t. 239

240 241 242

   1 0 u2 B2 B20 + c2 I −B2 A02 , (u2 α02 ) α2 −A2 B20 A2 A02 2 e1 0 ≤ α2 ≤ , e01 α2 = ν2 , l1

(29)

are the dual problems of (26) and (27), respectively. Proof. Here, we only give the proof of problem (28). Introduce non-negative Lagrange multipliers α1 , β1 and γ1 to constrains of problem (26), then its Lagrangian function is built as L(Ξ) =

c1 1 0 1 η1 η1 + e02 ξ1 + kw1 k2 − ν1 ρ1 2 l2 2 −u01 (A1 w1 − η1 ) − α01 (B1 w1 − ρ1 e2 + ξ) −β10 ξ − γ1 ρ1 ,

(30)

where Ξ = {w1 , η1 , ξ1 , ρ1 , u1 , α1 , β1 , γ1 }. According to Karush-Kuhn-Tucker (KKT) conditions [2, 43], the Lagrangian function (30) has to be maximized with its dual variables u1 , α1 , β1 , γ1 , meanwhile minimized with its primal variables w1 , η1 , ξ1 , ρ1 . Differentiate L(Ξ) with respect to w1 , η1 , ξ1 , ρ1 , then optimality conditions of problem (26) are obtained by ∇Lw1 = c1 w1 − A01 u1 − B10 α1 = 0, ∇Lη1 = η1 + u1 = 0, e2 ∇Lξ1 = − α1 − β1 = 0, l2 ∇Lρ1 = −ν1 + e02 α1 − γ = 0, α01 (B1 w1

(33) (34) (35)

= 0,

(36)

γ1 ρ1 = 0,

(37)

From (31) and (32), we have w1 =

244

(32)

− ρ1 e2 + ξ) = 0. β10 ξ

243

(31)

1 (A0 u1 + B10 α1 ) and η1 = −u1 . c1 1

(38)

Since β ≥ 0, from (33), we derive 0≤α≤ 13

e2 . l2

(39)

245 246 247

According to the margin variable ρ1 > 0 in practice and the complementary slackness of the KKT condition (37), we have γ1 = 0. Then, putting it into (34), obtain e02 α1 = ν1 . (40) Finally, substituting (38) into the Lagrangian function (30) and using KKT conditions (31)-(37), the dual problem of (26) can be formulated as min

u1 ,α1

s.t. 248

249 250 251 252 253

254 255 256

   1 0 A1 A01 + c1 I B1 A01 u1 0 (u α ) , A1 B10 B1 B10 α1 2 1 1 e2 , e02 α1 = ν1 . 0 ≤ α1 ≤ l2

In a similar way, we can derive the dual problem of (27) as problem (29).



Remark 2. Obvious, compared with problems (21) and (22) in PTSVM, the dual problems (28) and (29) of our ν-PTSVM no longer contain the computation of matrix inversion. Hence, it leads to a more efficient learning procedure than PTSVM. More importantly, such dual formulations can be easily extended to the nonlinear case, which will be further discussed in Section 4. Once solving the dual problems (28) and (29), we can achieve solutions to the primal problems (26) and (27) via the following Proposition 1 according to KKT conditions. Proposition 1. Suppose that (u1 , α1 ) and (u2 , α2 ) are solutions to the dual problems (28) and (29), respectively. If there exists one component of α1 such that α1i ∈ (0, 1/l2 ) for problem (28), and that of α2 such that α2j ∈ (0, 1/l1 ) for problem (29), then solutions (w1 , ρ1 ) and (w2 , ρ2 ) to the primal problems (26) and (27) can be formulated by 1 (A0 u1 + B10 α1 ) and ρ1 = w10 (xi − m1 ), c1 1

(41)

1 (B 0 u2 − A02 α2 ) and ρ2 = −w20 (xj − m2 ), c2 2

(42)

w1 = w2 = 257

258 259 260 261

where xi and xj are instances corresponding to α1i and α2j , respectively. Proof. We now focus on the derivation of solution (w1 , ρ1 ) to problem (28). According to the KKT condition (31), we can obtain w1 = c11 (A01 u1 +B10 α1 ) directly. Suppose that the value of the i-th component of α1 is α1i ∈ (0, 1/l2 ). Then, as for the margin variable ρ1 , substituting α1i into (33), we have its corresponding 14

262 263 264

β1i > 0. Furthermore, from the condition (36), we have ξ1i = 0. In light of fact that α1i (w10 (xi − m1 ) − ρ1 + ξ1i ) = 0, we obtain ρ1 = w10 (xi − m1 ). In similar way, we can derive the solution (w2 , ρ2 ) to problem (29) as in (42). 

266

In summary, the whole procedure of linear ν-PTSVM is established in Algorithm 1.

267

Algorithm 1 The procedure of linear ν-PTSVM

265

268 269 270 271 272 273

274

275 276

Input the training set T = {(xi , yi )|1 ≤ i ≤ m}, where xi ∈ Rn and yi = {1, −1}. 2: Choose parameters c1 , c2 > 0 and ν1 , ν2 ∈ (0, 1]. 3: Construct Ak and Bk and solve the dual problems (28) and (29) by QPP solver, respectively. Then get their solutions (u1 , α1 ) and (u2 , α2 ). 4: Construct the following decision functions according to Proposition 1 as

1:

f1 (x) = w10 d1 (x) =

1 0 (u A1 d1 (x) + α1 B10 d1 (x)) c1 1

(43)

f2 (x) = w20 d2 (x) =

1 0 (u B2 d2 (x) − α2 B20 d2 (x)) c2 2

(44)

and

where dk (x) = x − mk . 5: For an unseen instance x, assign it to class y by y = arg min

k={1,2}

277 278 279

280 281 282 283

284 285 286

287 288 289

|fk (x)|.

(45)

3.3. Analysis We now turn to discuss some properties of ν-PTSVM. Because there are similar conclusions for problems (26) and (27), we only focus on problem (26). 3.3.1. The support vector and its property Suppose that α1 = (α11 , · · · , α1l2 )0 is a solution to the dual problem (28), whose value is restricted in the interval [0, 1/l2 ]. Let us first give the definition of support vector. Definition 1. (Support vector) The negative instance xj∈I2 is said to a support vector if its corresponding component α1j of α1 is nonzero, i.e., α1j 6= 0, otherwise it is a non-support vector. Apparently, the value of α1j determines whether the negative instance xj is a support vector. In what follows, we give a geometric analysis for α1 by Proposition 2 according to KKT conditions. 15

4 3.5 3 2.5 2 1.5 1 0.5 0

0

0.5

1

1.5

2

2.5

3

3.5

4

Figure 2: Geometrical interpretation of relationship between Lagrange multiplier α and support vectors for problem (26) in R2 .

290 291 292

293 294 295 296 297 298

299 300 301 302 303 304 305 306

Proposition 2. Define the projected distance function f1 (x) = w10 (x − m1 ). Suppose that α1j is the j-th component of solution α1 to problem (26). Then, it has the following geometric properties: 1. If α1j = 0, the projected distance value of xj satisfies f1 (xj ) ≥ ρ1 (lies above f1 (x) = ρ1 and ξj = 0). It is a non-support vector. 2. If 0 < α1j < 1/l2 , that of xj satisfies f1 (xj ) = ρ1 (just lies on f1 (x) = ρ1 and ξj = 0). It is a support vector. 3. If α1j = 1/l2 , that of xj satisfies f1 (xj ) ≤ ρ1 (lies below f1 (x) = ρ1 and ξj ≥ 0). It is also a support vector. Proof. The following proof is according to KKT conditions (31)-(37) with the geometrical interpretation of relationship between α and support vectors shown in Fig.2. Case 1: We have α1j = 0 → β1j = 1/l2 → ξj = 0 → w10 (xj − m1 ) − ρ1 + ξi ≥ 0 → w10 (xj − m1 ) ≥ ρ1 , then the corresponding instance xj satisfies f1 (xj ) ≥ ρ1 . Furthermore, α1j = 0 means no contribution to the primal solution w1 according to (38). Thus, it is a non-support vector (blue “triangle” in Fig.2) which sits above f1 (x) = ρ1 . 16

307 308 309 310 311 312 313 314

315 316 317 318 319

320 321 322

Case 2: We have 0 < α1j < 1/l2 → 0 ≤ β1j ≤ 1/l2 → ξi = 0 → w10 (xj − m1 ) − ρ1 + ξi = 0 → w10 (xj − m1 ) = ρ1 , then xj satisfies f1 (xj ) = ρ1 . Due to α1j 6= 0, it is a support vector (blue “square” in Fig.2) just lying on f1 (x) = ρ1 . Case 3: We have α1j = 1/l2 → β1j = 0 → ξi ≥ 0 → w10 (xj − m1 ) − ρ1 + ξi = 0 → w10 (xj − m1 ) ≤ ρ1 , then xj satisfies f1 (xj ) ≤ ρ1 . Owing to α1j 6= 0, it is also a support vector (blue “diamond” in Fig.2) which locates below f1 (x) = ρ1 . In short, we can conclude that the different value of α1j corresponds different location side of xj with respect to f1 (x) = ρ1 .  3.3.2. The property of parameter ν As mentioned above, the parameter ν1 in ν-PTSVM is more meaningful than the penalty c1 in PTSVM. In what follows, we will detail the analysis about its theoretical interpretation. Before discussing the property of ν1 , we first introduce the definition of margin-error instance [2]. Definition 2. (Margin-error instance) If the value of projection distance function f1 (xj ) for negative instance xj satisfies f (xj ) < ρ1 or slack variable ξj > 0, we call such instance xj as the margin-error instance.

327

Roughly specking, the negative instance xj with the margin error ξj > 0 is not separated “sufficient correctly” by the projection w1 . Furthermore, the margin-error instance is a kind of support vector, whose corresponding α = l12 . Fig.2 gives the intuition of the margin-error instance (blue “diamond”). Then, we have a significance property of ν1 by Proposition 3.

328

Proposition 3. Suppose that (w1 , ρ1 ) is a solution to problem (26). Then,

323 324 325 326

329 330 331 332 333

334 335 336 337 338 339 340

1. Denoting the number of support vectors as q, we have ν1 ≤ q/l2 . That is, parameter ν1 is the lower bound of fraction of support vectors. 2. Denoting the number of margin-error instances as p, we have ν1 ≥ p/l2 . That is, parameter ν1 is the upper bound of fraction of margin-error instances. Proof. The proof can be adapted from similar results in [2]. Denote the set of margin-error instances as Ime with size p, and the set of support vectors as Isv with size q. According to Proposition 2 and the constraint ν1 = e02 α1 in the dual problem (28), we have Case 1: From Definition 1, we get that α1j corresponding to the support vector satisfies 0 < α1j ≤ 1/l2 . Thus, as for p number of support vectors, we can P obtain ν1 = e02 α1 = j∈Isv α1j ≤ q/l2 . 17

341 342 343

344

345 346 347 348 349 350 351 352

Case 2: From Definition 2, we have that α1j corresponding to the marginerror instance satisfies α1j = 1/l2 . Thus, as for p number of margin-error inP stances, we can obtain ν1 = e02 α1 ≥ j∈Ime α1j = p/l2 .  4. ν-PTSVM for nonlinear case In practice, the linear classification is not suitable for many real-world learning tasks at hand. One of effective solutions is to map linearly non-separable instances into the feature space. Thus, in this section, we focus on the nonlinear extension of ν-PTSVM. To construct our nonlinear ν-PTSVM, consider the mapping xφ = φ(x) : n R → H (RKHS, Reproducing Kernel Hilbert Space) with the kernel trick. Define Aφk = {φ(xi − mk )}i∈I1 and Bkφ = {φ(xj − mk )}j∈I2 , where k = 1 or 2. Then, the primal problems of nonlinear ν-PTSVM can be expressed as min

w1 ,η1 ,ξ1 ,ρ1

s.t.

353

min s.t.

355 356 357

358 359

360

1 c2 1 0 η η2 + e01 ξ2 + kw2 k2 − ν2 ρ2 2 2 l1 2 B1φ w2 = η2 , −Aφ2 w2 ≥ ρ2 e1 − ξ2 , ξ2 ≥ 0, ρ2 > 0,

(47)

where c1 , c2 > 0 and ν1 , ν2 ∈ (0, 1] are penalty parameters, η1 , η2 and ξ1 , ξ2 are slack vectors. Because the formulations of nonlinear problems (46) and (47) are similar to linear case (26) and (27), we can obtain their solutions in the similar manner. Definition 3. Suppose that K(·, ·) is an appropriate kernel function, then define the kernel matrix D E K(Ak , Bk ) = Aφk , Bkφ , (48) whose ij-th element can be computed by K(Ak , Bk )ij

361

(46)

and w2 ,η2 ,ξ2 ,ρ2

354

1 0 1 c1 η1 η1 + e02 ξ1 + kw1 k2 − ν1 ρ1 2 l2 2 Aφ1 w1 = η1 , B1φ w1 ≥ ρ1 e2 − ξ1 , ξ1 ≥ 0, ρ1 > 0,

= φ (xi − mk )0 φ (xj − mk ) = K(xi − mk , xj − mk ).

Then, we can derive the dual problems of (46) and (47) by Theorem 2. 18

(49)

362

Theorem 2. Optimization problems min

u1 ,α1

s.t. 363

 1 0 K(A1 , A1 ) + c1 I 0 (u α1 ) K(B1 , A1 ) 2 1 e2 0 ≤ α1 ≤ , e02 α1 = ν1 l2

  K(A1 , B1 ) u1 K(B1 , B1 ) α1

(50)

  −K(B2 , A2 ) u2 K(A2 , A2 ) α2

(51)

and min

u2 ,α2

s.t.

 1 0 K(B2 , B2 ) + c2 I 0 (u α2 ) −K(A2 , B2 ) 2 2 e1 , e01 α2 = ν2 , 0 ≤ α2 ≤ l1

364

are the dual problems of (46) and (47), respectively.

365

Proof. The proof is similar to the linear case, so we omit it here.

366 367 368 369 370 371 372 373 374 375 376 377

378 379



Remark 3. Different from some existing nonparallel SVMs (such as GEPSVMs [11, 44], TWSVMs [12, 16, 17], PTSVMs [36, 37] and so on), we do not need to consider extra kernel-generated surfaces for its nonlinear extension, since only inner products appear in the dual problems (28) and (29). Thus, to obtain the dual formulations of (46) and (47), we can directly replace inner product hx D i , xj i E= x0i xj with an appropriately chosen kernel function, i.e., K(xi , xj ) = xφi , xφj .

Specifically, when the linear kernel K(xi , xj ) = x0i xj is used, the element (49) of kernel matrix in the dual problems (46) and (47) will be K(Ak , Bk )ij = (xi − mk )0 (xj − mk ), which is equivalent to the linear case. Based on the above analysis, in contrast with PTSVMs, our ν-PTSVM is consistency between linear and nonlinear cases. Therefore, nonlinear ν-PTSVM is more theoretically sound than nonlinear PTSVM. Once solving the dual problems (50) and (51), we can achieve solutions to primal problems by Proposition 4. Proposition 4. Suppose that (u1 , α1 ) and (u2 , α2 ) are solutions to the dual problems (50) and (51), respectively. If there exists one component of α1 such that α1i ∈ (0, 1/l2 ) for problem (50), and that of α2 such that α2j ∈ (0, 1/l1 ) for problem (51), then solutions (w1 , ρ1 ) and (w2 , ρ2 ) to the primal problems (46)

19

and (47) can be formulated by 1 ((Aφ1 )0 u1 + (B1φ )0 α1 ) and ρ1 = w10 φ(xi − m1 ), c1

(52)

1 ((B2φ )0 u2 − (Aφ2 )0 α2 ) and ρ2 = −w20 φ(xj − m2 ), c2

(53)

w1 = w2 = 380

where xi and xj are the instances corresponding to α1i and α2j , respectively. The whole procedure of nonlinear ν-PTSVM is summarized in Algorithm 2.

381

382 383 384 385 386 387 388 389

Algorithm 2 The procedure of nonlinear ν-PTSVM Input the training set T = {(xi , yi )|1 ≤ i ≤ m}, where xi ∈ Rn and yi = {1, −1}. 2: Choose parameters c1 , c2 > 0, ν1 , ν2 ∈ (0, 1], and an appropriate kernel function K. φ φ 3: Construct Ak and Bk and solve the dual problems (50) and (51) by QPP solver, respectively. Then get their solutions (u1 , α1 ) and (u2 , α2 ). 4: Construct the following decision functions according to Proposition 4

1:

1 0 (u K(A1 , d1 (x)) + α01 K(B1 , d1 (x))) c1 1

(54)

f2φ (x) = w20 φ(x − m2 ) =

1 0 (u K(B2 , d2 (x)) − α02 K(A2 , d2 (x))) c2 2

(55)

and

390

391 392

f1φ (x) = w10 φ(x − m1 ) =

5:

where dk (x) = x − mk . For an unseen instance x, assign it to class y by y = arg min

k={1,2}

393

5. Numerical experiments

394

5.1. Experimental setting

395 396

|fkφ (x)|.

(56)

To demonstrate the validity of our ν-PTSVM, we investigate its performance on both synthetic and real-world datasets in terms of classification accuracy1 1 +T N Classification accuracy is defined as: Acc = T P +FT PP +T , where TP, TN, FP and FN are N +F N the number of true positive, true negative, false positive and false negative, respectively.

20

397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430

and computational efficiency2 . In our implementation, we focus on comparisons between ν-PTSVM and five state-of-the-art nonparallel methods, including GEPSVM, TBSVM, NPSVM, RPTSVM and LSPTSVM, detailed as: • GEPSVM [11]: It is a generalized eigenvalue based nonparallel hyperplane classifier. GEPSVM relaxes the universal requirement that hyperplanes generated by SVM should be parallel, and attempts to seek a pair of optimization nonparallel hyperplanes by solving generalized eigenvalue problems. The parameters for GEPSVM are (δ1 , δ2 ). • TBSVM [16]: It is a SRM improvement of TWSVM [12]. TBSVM not only inherits the characteristic of TWSVM, i.e., the empirical risk, but also implements the structural risk minimization to control its model complexity. The parameters for TBSVM are (c1 , c2 , c3 , c4 ). • NPSVM [23]: It is a novel nonparallel hyperplane classifier. NPSVM introduces the ε-insensitive loss function instead of the least-square loss in TWSVM [12], such that hyperplanes no longer fit the corresponding instances but bound them as much as possible within an ε-tube. Such improvement brings the spareness property for NPSVM. The parameters for NPSVM are (c1 , c2 , c3 , c4 , ε1 , ε2 ). • RPTSVM [37]: It is a SRM and kernel improvement of PTSVM [36]. The implementation of structural risk minimization in RPTSVM not only controls the model complexity but also remedies the singularity problem in PTSVM. RPTSVM also gives the nonlinear extension of PTSVM. The parameters for RPTSVM are (c1 , c2 , c3 , c4 ). • LSPTSVM [38]: It is a least-square version of PTSVM [36]. LSPTSVM optimizes a pair of modified primal problems by solving systems of linear equations instead of QPPs in PTSVM, leading to efficient training speed. The parameters for LSPTSVM are (c1 , c2 , c3 , c4 ).

All the experiments were implemented by Matlab (2017b) on a personal computer (PC) with an Intel Core-i5 processor (2.3 GHz) and 8 GB random-access memory (RAM). The general eigenvalue problems in GEPSVM were solved by Matlab “eig” function. As for TBSVM, NPSVM, RPTSVM and ν-PTSVM, we resorted to Matlab “quadprog” function to solve their QPPs. The systems of linear equations involved in LSPTSVM were solved by Matlab “\” operation. With regard to the parameter selection, we employed the standard five-fold crossvalidation technique3 . For the sake of brevity, similar to [12, 16, 37], we set the 2 Learning time (not include the parameters tuning time) is used to represent the learning efficiency for each algorithm. 3 In detail [2], each dataset is partitioned into five subsets with similar sizes and distributions. Then, the union of four subsets is used as the training set while the remaining one is used as the testing set. The experiment is repeated 5 times such that every subset is used once as a testing set.

21

1

1

Acc = 84.00%

0.5

0.5

0

0

-0.5

-0.5

-1 -1

-0.5

0

0.5

1

-1 -1

Acc = 85.60%

-0.5

(a) GEPSVM 1

1

Acc = 86.80%

0.5

0

0

-0.5

-0.5

-0.5

0

0.5

1

-1 -1

-0.5

1

Acc = 84.00%

0.5

0

0

-0.5

-0.5

-0.5

0

0

0.5

1

(d) RPTSVM

0.5

-1 -1

1

Acc = 86.00%

(c) NPSVM 1

0.5

(b) TBSVM

0.5

-1 -1

0

0.5

1

-1 -1

(e) LSPTSVM

Acc = 87.20%

-0.5

0

0.5

1

(f) ν-PTSVM

Figure 3: Results of six classifiers with linear kernel on Ripley dataset. The positive, negative, and misclassified instances are denoted as red “+”, blue “×” and black “◦” respectively. The decision function is represented by black boundary.

22

1.12

1.1

1.1

1.08

1.08

1.06

1.06

Ratio

Ratio

1.12

1.04 1.02

1.04 1.02

1

1

0.98

0.98

0.96 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(a)

0.96 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(b)

Figure 4: The influence of parameter ν on the behavior of ν-PTSVM on Ripley dataset with linear kernel. (a) The fraction of negative SVs (nsv /l2 ) and negative margin-error instances (nme /l2 ) versus ν1 for problem (26). (b) The fraction of positive SVs (nsv /l1 ) and positive margin-error instances (nme /l2 ) versus ν2 for problem (27).

439

parameters of two problems in each classifier the same to reduce the computational complexity. More specifically, set δ1 = δ2 in GEPSVM; c1 = c3 , c2 = c4 in TBSVM, RPTSVM and LSPTSVM; c1 = c3 , c2 = c4 , ε1 = ε2 in NPSVM, and c1 = c2 , ν1 = ν2 in our ν-PTSVM. For the nonlinear case, the RBF kernel kx −x k2 K(xi , xj ) = exp(− i γ j ) is considered, where γ > 0 is the kernel parameter. Furthermore, we used grid-based approach [2] to obtain these optimal parameters. That is, the parameters δ1 , c1 , c2 , γ were selected from {2i |i = −5, −4, ..., 5}, while the parameter ε1 , ν1 were chosen from {i|i = 0.1, 0.2, ..., 1}. Once selected, we returned them to learn the final decision function.

440

5.2. Experiments on synthetic datasets

431 432 433 434 435 436 437 438

441 442 443 444 445 446 447 448 449 450 451

The first example is a two-dimensional synthetic Ripley dataset [45]. It consists of 125 instances for each class, which were draw from two heteroscedastic normal distributions with a high degree of overlap. Fig.3 illustrates the learning results of each classifier with linear kernel in terms of decision boundary and accuracy. It can be seen that ν-PTSVM obtains more suitable separating hyperplane than GEPSVM, TBSVM, RPTSVM and LSPTSVM, while is comparable with NPSVM. As for the accuracy, ν-PTSVM owns the best (87.20%) among the all classifiers. Note that, from Fig.3 (d) and (f), there are more misclassified instances near the classification boundary in RPTSVM than ν-PTSVM. The reason behind is that our ν-PTSVM relaxes the fixed margin 1/kwk in RPTSVM, and utilizes variable ρ to adapt the margin ρ/kwk to the data distribution. 23

1

1

Acc = 85.60%

0.5

0.5

0

0

-0.5

-0.5

-1 -1

-0.5

0

0.5

1

-1 -1

Acc = 86.00%

-0.5

(a) ν = 0.1 1

1

Acc = 87.20%

0.5

0

0

-0.5

-0.5

-0.5

0

0.5

1

(b) ν = 0.4

0.5

-1 -1

0

0.5

1

-1 -1

(c) ν = 0.7

Acc = 86.00%

-0.5

0

0.5

1

(d) ν = 1.0

Figure 5: The influence of parameter ν on support vectors of ν-PTSVM on Ripley dataset with linear kernel. The positive and negative instances are denoted as red “+” and blue “×”respectively. The positive and negative SVs are marked by red “circle” and blue “square”. The decision function is represented by black boundary.

24

1.3

1.5

1.2 1.25

Ratio

Ratio

1.1 1

1

0.9 0.75 0.8 0.7 0.1

0.2

0.3

0.4

0.5

(a)

0.6

0.7

0.8

0.9

1

0.5 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(b)

Figure 6: The influence of parameter ν on the behavior of ν-PTSVM on Ripley dataset with RBF kernel. (a) The fraction of negative SVs (nsv /l2 ) and negative margin-error instances (nme /l2 ) versus ν1 for problem (46). (b) The fraction of positive SVs (nsv /l1 ) and positive margin-error instances (nme /l2 ) versus ν2 for problem (47).

452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473

In Proposition 3, we have proved that the parameter ν has theoretical interpretation on the number of support vectors nsv and margin-error instances nme . Thus, in what follows, we examine the empirical influence of parameter ν on the behavior of both linear and nonlinear ν-PTSVM. Here, we perform ν-PTSVM with different ν ranging from 0.1 to 1 on the above Ripley dataset. Fig.4 and Fig.6 record the ratio of fraction of SVs and margin-error instances on ν for each problem. The results show that ν can effectively control the lower bound of fraction of SVs (nsv /l2 ) and upper bound of fraction of margin-error instances (nme /l2 ) for problem (26), which is consistent with our theoretical analysis. Furthermore, we plot the support vectors of ν-PTSVM with different ν in Fig.5 and Fig.7. The results illustrate that the number of SVs is increasing gradually with ν varying from 0.1 to 1. When ν = 1, the SVs cover the whole dataset. On the other hand, too small number of SVs (ν is small) may sometimes influence its performance. This is consistent with the observation from our experimental results that the tendency of performance (classification accuracy) rises up at the beginning and then descends afterwards. Actually, there is no effective way to obtain the optimal parameter ν. In practice, we usually adopt the grid search method for parameter selection. To sum it up, the above results confirm our previous theoretical analysis of parameter ν in Proposition 3. To verify the nonlinear performance of ν-PTSVM, in what follows, we consider a two-dimensional synthetic Two-spiral dataset. It contains two spiral distributions corrupted by the heteroscedastic Gaussian noise, and each spiral (206 25

1

1

Acc = 76.80%

0.5

0.5

0

0

-0.5

-0.5

-1 -1

-0.5

0

0.5

1

-1 -1

Acc = 87.60%

-0.5

(a) ν = 0.1 1

1

Acc = 88.40%

0.5

0

0

-0.5

-0.5

-0.5

0

0.5

1

(b) ν = 0.4

0.5

-1 -1

0

0.5

1

-1 -1

(c) ν = 0.7

Acc = 86.40%

-0.5

0

0.5

1

(d) ν = 1.0

Figure 7: The influence of parameter ν on support vectors of ν-PTSVM on Ripley dataset with RBF kernel. The positive and negative instances are denoted as red “+” and blue “×”respectively. The positive and negative SVs are marked by red “circle” and blue “square”. The decision function is represented by black boundary.

474 475 476 477 478 479 480 481 482

instances) corresponds to one class. Fig.8 visualizes the learning results of each classifiers with RBF kernel in terms of decision boundary and accuracy. It can be found that GEPSVM, TBSVM and LSPTSVM are sensitive to the noise, and can not find an appropriate decision boundary. On the other hand, NPSVM, RPTSVM and our ν-PTSVM can obtain comparable decision boundary, but with the best accuracy (98.06%) for ν-PTSVM. The results also reveal that, by inheriting the advantage of PTSVM [36, 37] and ν-SVM [5], our nonlinear ν-PTSVM is able to exploit more discriminate information, resulting in a better generalization ability. In a word, the above results confirm the effectiveness of ν-PTSVM.

26

1

1

Acc = 90.78%

0.5

0.5

0

0

-0.5

-0.5

-1 -1

-0.5

0

0.5

1

-1 -1

Acc = 93.69%

-0.5

(a) GEPSVM 1

1

Acc = 97.57%

0.5

0

0

-0.5

-0.5

-0.5

0

0.5

1

-1 -1

-0.5

1

Acc = 95.14%

0.5

0

0

-0.5

-0.5

-0.5

0

0

0.5

1

(d) RPTSVM

0.5

-1 -1

1

Acc = 96.60%

(c) NPSVM 1

0.5

(b) TBSVM

0.5

-1 -1

0

0.5

1

-1 -1

(e) LSPTSVM

Acc = 98.06%

-0.5

0

0.5

1

(f) ν-PTSVM

Figure 8: Results of six classifiers with RBF kernel on Spiral dataset. The positive, negative, and misclassified instances are denoted as red “+”, blue “×” and black “◦” respectively. The decision function is represented by black boundary.

27

Table 1: Statistics for UCI datasets used in our experiments.

483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501

Datasets

#Instances

#Features

#Training #Testing

Australian

690

14

482

208

Ionosphere

351

34

245

106

Hepatitis

155

19

108

47

Diabetes

768

8

537

231

German

1000

24

700

300

TicTacToe

958

27

670

288

CMC

1473

9

1031

442

Votes

435

16

304

131

Heart

920

13

644

276

Monks3

432

6

302

130

5.3. Experiments on real-world UCI datasets In this subsection, we apply our ν-PTSVM to several real-world datasets from the UCI machine learning repository4 , and investigate its performance and computational efficiency. For comparison, we consider ten real-world datasets, whose statistics are listed in Table 1. These datasets represent a wide range of domains (include pathology, bioinformatics, finance and so on), sizes (from 155 to 1437) and features (from 9 to 44). Additionally, all datasets are normalized before training such that the features scale in the interval [−1, 1]. In our experiments, we set up in the following way. Firstly, each dataset is divided into two subsets: 70% for training and 30% for testing. Then, we train classifiers with five-fold cross validation executions. Finally, the testing set is predicted with the well-tune classifiers. Each experimental setting is repeated 5 times. Summaries of learning results obtained by six classifiers with linear and nonlinear kernel on UCI datasets are listed in Tables 2 and 3. The best performance is highlighted in bold. The comparison results indicate that our proposed method yields better performance than RPTSVM and LSPTSVM on most datasets. For instance, for Australian dataset in linear case, ν-PTSVM achieves higher accuracy (84.21%) than RPTSVM (83.27%) and LSPTSVM (81.46%). Similar results can be obtained from the other datasets. It confirms that our ν-PTSVM can improve 4

The UCI datasets are available at http://archive.ics.uci.edu/ml

28

Table 2: The average learning results of each classifier on UCI datasets with linear kernel, in terms of testing accuracy (Acc) and learning time (Time).

Datasets

GEPSVM

TBSVM

NPSVM

RPTSVM

LSPTSVM

ν-PTSVM

Acc (%)

Acc (%)

Acc (%)

Acc (%)

Acc (%)

Acc (%)

Time (s)

Time (s)

Time (s)

Time (s)

Time (s)

Time (s)

79.54±5.40

82.35±4.24

84.08±3.01

83.27±4.03

81.46±4.97

84.31±2.57

0.0776

0.2301

0.5192

0.2564

0.0163

0.3109

84.35±4.71

87.83±4.23

89.23±3.45

86.46±4.58

85.15±4.42

88.96±3.73

0.0531

0.0624

0.1524

0.1015

0.0061

0.0979

82.28±4.67

81.75±3.19

82.97±3.18

82.52±2.34

82.95±3.01

83.57±2.82

0.0392

0.0358

0.0896

0.0434

0.0042

0.0502

73.05±3.56

75.74±4.80

75.93±2.89

75.19±3.46

74.56±4.17

76.41±3.19

0.0837

0.2793

0.5930

0.3372

0.0119

0.3635

70.95±2.81

72.93±2.77

72.57±2.92

72.43±2.46

71.55±3.24

74.42±2.36

0.3453

1.2099

2.8252

1.7971

0.0415

1.6375

71.22±4.62

69.78±4.80

70.60±4.01

70.87±3.62

68.83±5.96

71.53±3.22

0.1960

2.1795

3.8526

2.2899

0.0623

1.9572

69.14±5.40

70.71±3.98

72.65±3.25

70.96±4.49

69.88±4.97

72.29±3.96

0.2944

2.9907

5.3773

4.6317

0.1833

3.9396

92.97±2.95

93.26±3.10

92.18±2.28

92.14±2.63

91.75±3.02

93.69±2.46

0.0943

0.1826

0.4226

0.2048

0.0134

0.1772

80.11±5.51

81.63±3.84

82.92±2.89

84.89±3.59

84.75±3.37

84.30±3.04

0.0411

1.3056

3.8344

2.1294

0.0157

1.9547

85.70±4.90

87.29±3.87

88.26±3.48

87.74±4.61

86.18±5.10

88.75±4.13

0.0156

0.0995

0.2183

0.1124

0.0295

0.1063

Ave. Acc

78.93

80.32

81.13

80.64

79.70

81.82

Ave. Time

0.1240

0.8575

1.7885

1.1904

0.0384

1.0595

W/T/L

7/3/0

4/6/1

2/8/0

3/7/0

6/4/0

Ave. rank

5.2

3.8

2.5

3.4

4.7

Australian

Ionosphere

Hepatitis

Diabetes

German

TicTacToe

CMC

Votes

Heart

Monks3

29

/ 1.4

Table 3: The average learning results of each classifier on UCI datasets with nonlinear (RBF) kernel, in terms of testing accuracy (Acc) and learning time (Time).

Datasets

GEPSVM

TBSVM

NPSVM

RPTSVM

LSPTSVM

ν-PTSVM

Acc (%)

Acc (%)

Acc (%)

Acc (%)

Acc (%)

Acc (%)

Time (s)

Time (s)

Time (s)

Time (s)

Time (s)

Time (s)

72.35±4.67

74.51±4.69

77.04±3.18

76.38±3.71

75.25±4.47

77.29±2.74

0.8380

0.6493

0.5515

0.7013

0.0392

0.4946

90.26±4.33

89.80±2.99

91.27±3.42

92.19±3.65

91.53±5.24

92.64±3.56

0.1256

0.2113

0.1865

0.2316

0.0235

0.1428

80.73±4.26

82.51±4.94

83.47±3.30

82.86±4.25

81.02±4.64

83.93±2.38

0.3828

0.6586

0.1894

0.3035

0.0192

0.1597

76.96±3.22

78.13±3.89

76.86±2.60

77.14±4.31

75.36±4.70

77.57±3.45

0.2160

0.8468

1.0937

0.9803

0.0287

0.5680

69.58±5.96

71.73±3.67

73.25±3.64

69.53±2.24

70.74±3.68

73.54±2.33

0.5984

2.8011

3.7984

3.9254

0.0557

2.4867

71.32±4.49

73.92±4.32

77.36±3.05

73.69±2.46

74.32±4.16

76.03±2.53

1.3511

3.6531

3.8596

4.2535

0.1001

2.3850

70.42±3.45

73.13±3.28

72.98±2.13

74.65±5.84

72.33±4.02

74.15±2.97

2.5240

6.4552

5.8098

8.2951

0.3184

5.0318

91.52±4.16

96.18±4.19

95.07±3.20

93.26±3.16

92.48±3.22

95.83±2.90

0.1440

0.3053

0.3254

0.4705

0.0303

0.2441

82.07±3.09

83.19±4.33

84.58±2.72

83.89±4.09

82.46±5.48

85.04±3.84

0.2599

4.2132

3.2434

4.8331

0.0795

2.8869

87.06±3.92

88.95±3.84

87.43±4.39

89.77±3.93

88.93±4.34

90.97±3.56

0.1293

0.5863

0.3027

0.3964

0.0366

0.2748

Ave. Acc

79.22

81.20

81.93

81.33

80.44

82.69

Ave. Time

0.6569

2.0380

1.9360

2.4391

0.0731

1.4674

W/T/L

9/1/0

5/5/1

2/8/0

3/7/0

7/2/0

Ave. rank

5.6

3.4

3.0

3.2

4.4

Australian

Ionosphere

Hepatitis

Diabetes

German

TicTacToe

CMC

Votes

Heart

Monks3

30

/ 1.4

Table 4: Results of Friedman’s test based on the testing accuracy for UCI datasets.

502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531

Statistical value

p-value

Hypothesis

Linear case

28.114285

0.000035

Rejected

Nonlinear case

28.514281

0.000028

Rejected

the performance of PTSVMs by the additional meaningful margin regularization term. Furthermore, the results in Tables 2 and 3 also reveal that ν-PTSVM perform as well as NPSVM, and is better than GEPSVM, TBSVM, RPTSVM and LSPTSVM at the most cases. We also record the average learning time of each classifier for the above UCI datasets experiments, as shown in Table 2 and 3. The results show that our ν-PTSVM is faster than NPSVM and slower than LSPTSVM for both linear and nonlinear case. Moreover, it gets comparable learning efficiency as RPTSVM and TBSVM for the linear case, while is faster than RPTSVM and TBSVM for the nonlinear case. Because NPSVM needs to optimize much larger dual problems than other classifiers, it needs more computational consumption. Although RPTSVM and TBSVM have smaller-scale problems than ν-PTSVM, both of them involve the expensive matrix inversion during the training procedure, i.e., O(n3 ) for linear case or O(l3 ) for nonlinear. For our experiments, the feature size n is much smaller than the instance size l, i.e., n  l. That is why the nonlinear RPTSVM and TBSVM consume more learning time than ν-PTSVM and NPSVM. On the other hand, LSPTSVM only needs to solve the systems of linear equations and hence owns the fastest learning speed among all the classifiers. However, its generalization ability is poorer than the others except for GEPSVM. Furthermore, we apply a paired t-test [46] on each dataset to inspect whether our ν-PTSVM is significance Superior/Equal/Inferior (W/T/L) to the compared classifier in terms of the testing accuracy. The significance level α = 0.05. Then, we count the number of W/T/L on all datasets for both linear and nonlinear cases, also listed in Tables 2 and 3. The results reveal that our ν-PTSVM achieves the best results against others in terms of both W/T/L as well as average accuracy. To provide more statistical evidence [47, 48], we employ the Friedman’s test to check whether there are significant differences between ν-PTSVM and other classifiers on the whole datasets, according to the testing accuracies in Tables 2 and 3. The bottom lines of Tables 2 and 3 list the average rank of classifiers 31

Table 5: Results of the Holm’s test based on the testing accuracy for UCI datasets (ν-PTSVM is the control algorithm).

Linear case

Nonlinear case

i

Algorithm

z

p-value

Hypothesis

5

GEPSVM

4.54186

0.00005

Rejected

4

LSPTSVM

3.94425

0.00008

Rejected

3

TBSVM

2.86854

0.00412

Rejected

2

RPTSVM

2.39045

0.01682

Rejected

1

NPSVM

1.31475

0.18859

Accepted

5

GEPSVM

5.01996

0.00001

Rejected

4

LSPTSVM

3.58568

0.00361

Rejected

3

TBSVM

2.39045

0.01682

Rejected

2

RPTSVM

2.15141

0.03144

Rejected

1

NPSVM

1.91236

0.05582

Accepted

543

obtained by Friedman’s test. It can be seen that the proposed ν-PTSVM is ranked first in both linear and nonlinear situations, followed by NPSVM and RPTSVM successively. The p-value by Friedman’s test is also calculated and shown in Table 4. The results reject the null hypothesis5 and indicate the existence of significant differences among the performance of all classifiers. Therefore, we further carry out the Holm’s test [47] as a post-hoc test to inspect the statistical differences between our ν-PTSVM (control algorithm6 ) and the remaining classifiers, with a significant level of 0.05. The results in Table 5 conclude that the control algorithm (ν-PTSVM) is statistically superior to other classifiers in terms of the testing accuracy on both linear and nonlinear cases expect for NPSVM. Overall, the above results demonstrate the feasibility of ν-PTSVM.

544

5.4. Large-scale NDC Datasets

532 533 534 535 536 537 538 539 540 541 542

545 546 547

To further show the learning efficiency of our ν-PTSVM, in this section, we conduct a series of comparative experiments on the large-scale NDC datasets. Our experiments are set up in the following way. First, NDC datasets are gener5 6

There is no significant difference among the classifiers. The control algorithm is the one with the lowest Friedman’s rank.

32

Table 6: The average learning results of classifiers on NDC datasets with RBF kernel, in terms of testing accuracy and learning time. NDC datasets

(100 × 32) (500 × 32)

550 551 552 553 554 555 556 557 558 559 560 561 562 563 564

NPSVM

RPTSVM

LSPTSVM

ν-PTSVM

Acc (%)

Acc (%)

Acc (%)

Acc (%)

Acc (%)

Time (s)

Time (s)

Time (s)

Time (s)

Time (s)

Time (s)

77.53

78.89

80.37

80.02

78.42

80.59

0.0226

0.0594

0.0485

0.0862

0.0018

0.0402

79.76

81.29

82.05

81.80

81.27

82.84

0.1492

0.2475

0.1485

0.3228

0.0124

0.1046

83.29

84.92

85.24

85.60

84.93

85.49

10.472

11.6938

7.9247

18.973

0.0826

5.6840

(3k × 32)

83.53

85.69

86.40

86.21

85.32

86.57

345.47

84.061

58.615

129.27

8.3428

39.252

(5k × 32)

a

85.18

86.38

86.07

85.02

86.24

a

789.47

635.92

1063.2

18.430

317.58

(10k × 32)

a

b

a

b

84.38

85.97

a

b

a

b

628.09

3628.4

b

549

TBSVM

Acc (%)

(1k × 32)

a

548

GEPSVM

Experiment is stopped as computing time was very high. Terminate because of out of memory.

ated using David Musicant’s NDC generator7 with the scale increased from 100 to 10k and the feature fixed at 32. Then, each dataset is divided into two subsets: 90% for training and 10% for testing. Here, we use RBF kernel with γ = 1 and fix penalty parameters of all the classifiers at 1 in advance. Finally, we train and test each classifier with the above setting. Each experiment is repeated 5 times. Table 6 lists the five-run average testing accuracy and learning time of each classifier with the scale of NDC dataset ranging from 100 to 10k. We have highlighted the best performance. The results show that our ν-PTSVM is more efficient than the others expect for LSPTSVM on most NDC datasets. It is also worth noting that the training procedure of both TBSVM and RPTSVM contains the expensive matrix inversion, and hence their learning time increases dramatically for larger datasets. Moreover, matrix inversion also requires the more memory consuming. As a result, “out-of-memory” is occurred when the scale reached 10k. On the other hand, although NPSVM does not need the matrix inversion during the training procedure, it requires more variables to be optimized than our ν-PTSVM. In short, the above results confirm the efficiency of ν-PTSVM for large-scale problems. 7

http://research.cs.wisc.edu/dmi/svm/ndc/

33

565

6. Conclusions

587

In this paper, we have proposed an improved version of PTSVM, called νPTSVM. Our ν-PTSVM not only inherits the advantage of PTSVM but also enjoys many attractive properties. Compared with PTSVM, the penalty parameter ν in ν-PTSVM has a clear theoretical meaning. In fact, ν is the lower bound of fraction of support vectors and upper bound of fraction of margin-error instances. Thus, we can easily control the percentage of SVs by choosing an appropriate parameter ν. Furthermore, the dual problems (28) and (29) of ν-PTSVM avoid the matrix inversion by reformulating the least-square loss in its primal problems (24) and (25). This indicates that our ν-PTSVM no longer needs the time-costly inversion operation during the training procedure. In addition, the formulations of the dual problems (28) and (29) allow ν-PTSVM to apply kernel trick directly for its nonlinear extension as SVM. That is, the nonlinear dual problems (50) and (51) will become linear ones (28) and (29) just by using the linear kernel K(xi , xj ) = x0i xj . The experimental results reveal that our ν-PTSVM improves the generalization ability of PTSVMs remarkably. However, it is still a challenge for ν-PTSVM to handle large-scale problems by using the traditional QPP solver. Thus, one of further work will concern on the implementation of more efficient solver, including fix-size technique [49], SMO algorithm [2]. Moreover, extend our ν-PTSVM to multiple projections [42, 50], noisy classification [32], multi-category classification [21, 51], ordinal regression learning [52], clustering [53] and semi-supervised learning [31] would also be interesting.

588

Acknowledgment

566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586

593

This work is supported by the National Natural Science Foundation of China (Nos. 61603338, 11871183, 61866010, 61703370, 11426200 and 11426202), the Natural Science Foundation of Zhejiang Province (Nos. LQ17F030003, LY15F030013 and No.LQ13F030010), the Natural Science Foundation of Hainan Province (No. 118QN181) and the Foundation of China Scholarship Council (No. 201708330179).

594

References

589 590 591 592

595 596 597 598 599

[1] V. Vapnik, Statistical learning theory, Wiley, New York, USA, 1998. [2] N. Deng, Y. Tian, C. Zhang, Support Vector Machines: Theory, Algorithms and Extensions, CRC Press, Philadelphia, USA, 2013. [3] I. W. Tsang, J. T. Kwok, P. M. Cheung, Core vector machines: Fast SVM training on very large data sets, J. Mach. Learn. Res. 6 (2005) 363–392.

34

600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645

[4] J. A. Suykens, M. Signoretto, A. Argyriou, Regularization, Optimization, Kernels, and Support Vector Machines, Chapman & Hall/CRC Press, Boca Raton, USA, 2013. [5] B. Sch¨ olkopf, A. J. Smola, R. C. Williamson, P. L. Bartlett, New support vector algorithms, Neural Comput. 12 (5) (2000) 1207–1245. [6] H. Yin, X. Jiao, Y. Chai, B. Fang, Scene classification based on single-layer SAE and SVM, Expert Syst. Appl. 42 (7) (2015) 3368–3380. [7] T. Pinto, T. M. Sousa, I. Praa, Z. Vale, H. Morais, Support vector machines for decision support in electricity markets’ strategic bidding, Neurocomputing 172 (2016) 438–445. [8] S. Ma, B. Cheng, Z. Shang, G. Liu, Scattering transform and LSPTSVM based fault diagnosis of rotating machinery, Mech. Syst. Signal Process. 104 (2018) 155–170. [9] A. Subasi, Classification of EMG signals using PSO optimized SVM for diagnosis of neuromuscular disorders, Comput. Biol. Med. 43 (5) (2013) 576–86. [10] L. Hao, P. L. Lewin, Partial discharge source discrimination using a support vector machine, IEEE Trans. Dielectr. Electr. Insul. 17 (1) (2010) 189–197. [11] O. L. Mangasarian, E. W. Wild, Multisurface proximal support vector machine classification via generalized eigenvalues, IEEE Trans. Pattern Anal. Mach. Intell. 28 (1) (2006) 69–74. [12] Jayadeva, R. Khemchandani, S. Chandra, Twin support vector machines for pattern classification, IEEE Trans. Pattern Anal. Mach. Intell. 29 (5) (2007) 905–910. [13] Jayadeva, R. Khemchandani, S. Chandra, Twin Support Vector Machines: Models, Extensions and Applications, Springer International Publishing, Switzerland, 2017. [14] S. Ding, X. Hua, An overview on nonparallel hyperplane support vector machine algorithms, Neural Comput. Appl. 25 (5) (2014) 975–982. [15] M. Arun Kumar, M. Gopal, Least squares twin support vector machines for pattern classification, Expert Syst. Appl. 36 (4) (2009) 7535–7543. [16] Y. Shao, C. Zhang, X. Wang, N. Deng, Improvements on twin support vector machines, IEEE Trans. Neural Netw. 22 (6) (2011) 962–968. [17] Z. Qi, Y. Tian, Y. Shi, Robust twin support vector machine for pattern classification, Pattern Recogn. 46 (1) (2013) 305–316. [18] S. Ding, Y. An, X. Zhang, F. Wu, Y. Xue, Wavelet twin support vector machines based on glowworm swarm optimization, Neurocomputing 225 (2017) 157–163. [19] X. Peng, A ν-twin support vector machine (ν-tsvm) classifier and its geometric algorithms, Inform. Sciences 180 (20) (2010) 3863–3875. [20] W. Chen, Y. Shao, C. Li, N. Deng, MLTSVM: a novel twin support vector machine to multi-label learning, Pattern Recogn. 52 (2016) 61–74. [21] S. Ding, X. Zhang, Y. An, Y. Xue, Weighted linear loss multiple birth support vector machine based on information granulation for multi-class classification, Pattern Recogn. 67 (2017) 32–46. [22] Y. Tian, X. Ju, Improved twin support vector machine, Science China Mathematics 57 (2) (2014) 417–432. [23] Y. Tian, Z. Qi, X. Ju, Y. Shi, X. Liu, Nonparallel support vector machines for pattern classification, IEEE Trans. Cybern. 44 (7) (2014) 1067–1079. [24] Y. Tian, Y. Ping, Large-scale linear nonparallel support vector machine solver, Neural Netw. 50 (2014) 166–174. [25] W. Huiru, Z. Zhijian, An improved rough margin-based nu-twin bounded support vector machine, Knowl.-Based Syst. 128 (2017) 125–138.

35

646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691

[26] Y. Shao, W. Chen, N. Deng, Nonparallel hyperplane support vector machine for binary classification problems, Inform. Sciences 263 (2014) 22–35. [27] Z. Qi, Y. Tian, Y. Shi, Structural twin support vector machine for classification, Knowl.Based Syst. 43 (2013) 74–81. [28] W. Chen, Y. Shao, D. Xu, Y. Fu, Manifold proximal support vector machine for semisupervised classification, Appl. Intell. 40 (4) (2014) 623–638. [29] W. Chen, Y. Shao, N. Deng, Z. Feng, Laplacian least squares twin support vector machine for semi-supervised classification, Neurocomputing 145 (2014) 465–476. [30] Y. Shao, W. Chen, J. Zhang, Z. Wang, N. Deng, An efficient weighted Lagrangian twin support vector machine for imbalanced data classification, Pattern Recogn. 47 (9) (2014) 3158–3167. [31] W. Chen, Y. Shao, H. Ning, Laplacian smooth twin support vector machine for semisupervised classification, Int. J. Mach. Learn. Cyber. 5 (3) (2014) 459–468. [32] S. Mehrkanoon, X. Huang, J. A. K. Suykens, Non-parallel support vector classifiers with different loss functions, Neurocomputing 143 (2014) 294–301. [33] Q. Ye, C. Zhao, N. Ye, Y. Chen, Multi-weight vector projection support vector machines, Pattern Recogn. Lett. 31 (13) (2010) 2006–2011. [34] Q. Ye, N. Ye, T. Yin, Enhanced multi-weight vector projection support vector machine, Pattern Recogn. Lett. 42 (2014) 91–100. [35] K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, San Diego, USA, 1990. [36] X. Chen, J. Yang, Q. Ye, J. Liang, Recursive projection twin support vector machine via within-class variance minimization, Pattern Recogn. 44 (10-11) (2011) 2643–2655. [37] Y. Shao, Z. Wang, W. Chen, N. Deng, A regularization for the projection twin support vector machine, Knowl.-Based Syst. 37 (2013) 203–210. [38] Y. Shao, N. Deng, Z. Yang, Least squares recursive projection twin support vector machine for classification, Pattern Recogn. 45 (6) (2012) 2299–2307. [39] X. Hua, S. Ding, Weighted least squares projection twin support vector machines with local information, Neurocomputing 160 (2015) 228–237. [40] S. Ding, X. Hua, Recursive least squares projection twin support vector machines for nonlinear classification, Neurocomputing 130 (2014) 3–9. [41] X. Peng, D. Chen, PTSVRs: Regression models via projection twin support vector machine, Inform. Sciences 435 (2018) 1–14. [42] W. Chen, C. Li, Y. Shao, J. Zhang, N. Deng, Robust l1 -norm multi-weight vector projection support vector machine with efficient algorithm, Neurocomputing 315 (2018) 345–361. [43] O. L. Mangasarian, Nonlinear Programming, SIAM Press, Philadelphia, USA, 1993. [44] Y. Shao, N. Deng, W. Chen, Z. Wang, Improved generalized eigenvalue proximal support vector machine, IEEE Signal Proc. Let. 20 (3) (2013) 213–216. [45] B. D. Ripley, Pattern Recognition and Neural Networks, Cambridge University Press, UK, 1996. [46] Z. Yang, K. Fang, S. Kotz, On the student’s t-distribution and the t-statistic, J. Multivariate Anal. 98 (6) (2007) 1293–1304. [47] A. Hatamlou, Black hole: A new heuristic optimization approach for data clustering, Inform. Sciences 222 (2013) 175–184. [48] Z. Yu, Z. Wang, J. You, J. Zhang, J. Liu, H. S. Wong, G. Han, A new kind of nonparametric test for statistical comparison of multiple classifiers over multiple datasets, IEEE Trans.

36

692 693 694 695 696 697 698 699 700 701 702 703

Cybern. 47 (12) (2017) 4418–4431. [49] K. De Brabanter, J. De Brabanter, J. A. K. Suykens, B. De Moor, Optimized fixed-size kernel models for large data sets, Comput. Stat. Data An. 54 (6) (2010) 1484–1504. [50] W. Chen, C. Li, Y. Shao, J. Zhang, N. Deng, 2DRLPP: Robust two-dimensional locality preserving projection with regularization, Knowl.-Based Syst. 169 (2019) 53–66. [51] Z. Yang, Y. Shao, X. Zhang, Multiple birth support vector machine for multi-class classification, Neural Comput. Appl. 22 (1) (2013) 153–161. [52] H. Wang, Y. Shi, L. Niu, Y. Tian, Nonparallel support vector ordinal regression, IEEE Trans. Cybern. 47 (10) (2017) 3306–3317. [53] Q. Ye, H. Zhao, Z. Li, X. Yang, S. Gao, T. Yin, N. Ye, L1-norm distance minimizationbased fast robust twin support vector k-plane clustering, IEEE T. Neur. Net. Lear. 29 (9) (2018) 4494–4503.

704 705 706 707 708 709 710 711 712

Wei-Jie Chen received his B.S. degree in Electrical Engineering and Automation in 2006, and his Ph.D. degree in Control Science and Engineering in 2011 both from Zhejiang University of Technology, China. From 2017 to 2018, he was a visiting scholar at Centre for Artificial Intelligence, University of Technology Sydney, Australia (with supervisor Prof. Ivor Tsang). Currently, he is an associate professor at the Zhijiang College, Zhejiang University of Technology. His research interests include pattern recognition, intelligence computation, and manifold learning. He has published over 40 refereed papers.

713 714 715

Yuan-Hai Shao received his B.S. degree in College of Mathematics from Jilin University, and received Ph.D. degree in College of Science from China 37

716 717 718 719

Agricultural University, China, in 2006 and 2011, respectively. Currently, he is a full professor at the School of Economics and Management, Hainan University. His research interests include optimization methods, machine learning and data mining. He has published over 50 refereed papers.

720 721 722 723 724 725

Chun-Na Li received her Master’s degree and Ph.D degree in Department of Mathematics from Harbin Institute of Technology, China, in 2009 and 2012, respectively. Currently, she is an associate professor at the Zhijiang College, Zhejiang University of Technology. Her research interests include optimization methods, machine learning and data mining.

726 727 728 729 730 731 732

Ming-Zeng Liu received his B.S. degree in School of Mathematical sciences and Ph.D. degree in School of Automotive Engineering both from Dalian University of Technology, China, in 2008 and 2013. He is a lecturer at the School of Mathematics and Physics science, Dalian University of Technology at Panjin. His research interests include machine learning, intelligence computation and computation geometry. He has published over 20 refereed papers.

733 734 735

Zhen Wang received his doctor’s degree in College of Mathematics from Jilin University, China, in 2014. Currently, he is lecturer in School of Mathematical 38

736 737

Sciences from Inner Monggolia University. His research interests include pattern recognition, text categorization, and data mining.

738 739 740 741 742 743 744 745

Nai-Yang Deng received the MSc degrees in Department of Mathematics from Peking University, China, in 1967. Now, he is a full professor in College of Science, China Agricultural University, he is an honorary director of China Operations Research Society, Managing Editor Journal of Operational Research, International Operations Research Abstracts Editor. His research interests mainly including operational research, optimization, machine learning and data mining. He has published over 100 refereed papers.

39