A safe accelerative approach for pinball support vector machine classifier

A safe accelerative approach for pinball support vector machine classifier

ARTICLE IN PRESS JID: KNOSYS [m5G;February 9, 2018;13:23] Knowledge-Based Systems 0 0 0 (2018) 1–13 Contents lists available at ScienceDirect Kno...

1000KB Sizes 0 Downloads 66 Views

ARTICLE IN PRESS

JID: KNOSYS

[m5G;February 9, 2018;13:23]

Knowledge-Based Systems 0 0 0 (2018) 1–13

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

A safe accelerative approach for pinball support vector machine classifier Zhiji Yang, Yitian Xu∗ College of Science, China Agricultural University, Beijing, 100083, China

a r t i c l e

i n f o

Article history: Received 10 October 2017 Revised 1 February 2018 Accepted 4 February 2018 Available online xxx Keywords: Support vector machine Pinball loss Safe screening Variational inequality

a b s t r a c t Support vector machine (SVM) and its extensions have seen many successes in recent years. As an extension to enhance noise insensitivity of SVM, SVM with pinball loss (PinSVM) has attracted much attention. However, existing solvers for PinSVM still have challenges in dealing with large data. In this paper, we propose a safe screening rule for accelerating PinSVM (SSR-PinSVM) to reduce the computational cost. Our proposed rule could identify most inactive instances, and then removes them before solving optimization problem. It is safe in the sense that it guarantees to achieve the exactly same solution as solving original problem. The SSR-PinSVM covers the change of multiple parameters. The existing DVI-SVM can be regarded as a special case of SSR-PinSVM when the parameter τ is constant. Moreover, our screening rule is independent from the solver, thus it can be combined with other fast algorithms. We further provide a dual coordinate descent method for PinSVM (DCDM-PinSVM) as an efficient solver in this paper. Numerical experiments on six artificial data sets, twenty-three benchmark data sets, and a real biological data set have demonstrated the feasibility and validity of our proposed method. © 2018 Elsevier B.V. All rights reserved.

1. Introduction Support vector machine (SVM) proposed by Vapnik is one of the most successful classification algorithms [1,2]. Many improvements have been developed to enhance its performance, and they have gained widespread use in recent years [3–6]. The basic idea of SVM is to construct two parallel hyperplanes to separate two classes of instances and maximize the distance between the hyperplanes. The penalty on the training instances is controlled by a hinge loss. However, the hinge loss related to the shortest distance between two classes is sensitive to noise. To address this issue, Huang et al. proposed an SVM classifier with pinball loss (PinSVM) to improve the prediction performance [7]. The pinball loss is related to the quantile distance and is less sensitive to the noise data [8–10]. The major advantage of PinSVM is the noise insensitivity, especially for the feature noise around the decision boundary. The computational complexity of the PinSVM is similar to the hinge loss SVM, but the PinSVM loses the sparsity. Although the improved method in [10] enjoys sparsity to some extent, it does not perform well on computational speed, since its optimization problem is non-convex. To solve this problem, it



Corresponding author. E-mail addresses: [email protected] (Z. Yang), [email protected] (Y. Xu).

needs a complex iterative process to transform the non-convex optimization problem into a series of convex ones. Some efficient algorithms have been presented to improve the computational speed of PinSVM. In [11], the sequential minimal optimization for PinSVM (SMO-PinSVM) was established to solve its optimization problem, which broke large quadratic programming problem (QPP) into a series of QPPs as small as possible and allowed to handle large training sets [12]. In [13], an algorithm was established to find the entire solution path for PinSVM with different τ values. Additionally, in [14,15], the pinball loss was introduced into twin support vector machines (TSVMs) [16– 18] to enhance the robustness. Besides, researchers have proposed some efficient solvers to further improve the computational speed of TSVMs [19,20]. Although the efficient algorithms for PinSVM have faster computational speed, they have not taken full advantage of the characteristic of PinSVM in which most elements of its dual solution λ are equal to a constant −τ c or c, and a few elements are in the closed interval (−τ c, c ). For convenience, we define the instances corresponding to λi ∈ (−τ c, c ) as active instances, and those instances corresponding to λ j = −τ c or c as inactive instances. In the training process, if the great mass of inactive instances could be identified in a short time before solving optimization problem, it only needs to solve a smaller-sized problem rather than the whole one. Thus, the training speed will be greatly accelerated. But for the algorithms mentioned above, inactive instances cannot

https://doi.org/10.1016/j.knosys.2018.02.010 0950-7051/© 2018 Elsevier B.V. All rights reserved.

Please cite this article as: Z. Yang, Y. Xu, A safe accelerative approach for pinball support vector machine classifier, Knowledge-Based Systems (2018), https://doi.org/10.1016/j.knosys.2018.02.010

JID: KNOSYS 2

ARTICLE IN PRESS

[m5G;February 9, 2018;13:23]

Z. Yang, Y. Xu / Knowledge-Based Systems 000 (2018) 1–13

be identified before solving QPP. So we still need to solve a large optimization problem consisting of all training instances, which is a hard challenge especially for large-scale data sets. Considering these above, we try to propose an instance reduction strategy for accelerating PinSVM. There are many existing approaches for instance reduction. In general, they can be grouped in two main categories [21–23], according to their relationship with the training process: 1) as a preprocessing step before training [24,25]; 2) embedded in the training algorithms [26,27]. Most existing approaches belong to the first category. These methods are independent from the training models. For this category, the difficulty is to ensure whether the preprocessor is fit for the training models or not. The common way to verify the performance of the pre-processor is just to use knearest neighbors classifier as the learning algorithm. For the second category, the instance reduction is embedded in the specific models. To some extent, this strategy could better guarantee the safety. However, those approaches mentioned above are all based on some prior assumptions, such as clustering assumption. In real applications, if the data do not conform the assumption, it leads to a negative consequence. Furthermore, the extra time and memory are required in the process of instance reduction. Recently, a promising technique called “safe screening” was presented to handle large-scale data sets for sparse models [28–37]. The screening rule is embedded in the training process. Compared with those reduction methods mentioned above, this screening technique has the following advantages: 1) the instances discarded by the screening technique are mathematically guaranteed to have no influence on the solutions without any prior assumption; 2) the screening process runs in negligible time. The screening approach for the traditional SVM “the dual problem of SVM via variational inequalities (DVI-SVM)” [38] has shown its outstanding performance. It significantly reduces the computational cost and memory. Moreover, this method is safe in the sense that the instances discarded by the rule are guaranteed to be nonsupport vectors. Although DVI-SVM is an admirable method to reduce the computational cost of traditional SVM, it cannot be directly applied to PinSVM. For different models, the corresponding screening rules differ from each other. In the DVI-SVM, there is only one parameter c, and the feasible region of dual solution is invariable. But in PinSVM, there are two parameters τ and c, and the feasible region of solution changes with τ . In this paper, by introducing the screening idea into PinSVM, we propose a safe screening rule for accelerating PinSVM (SSRPinSVM) to reduce the computational cost. The SSR-PinSVM is presented by analyzing the dual problem via Karush-Kuhn-Tucker (KKT). By our screening rule, the inactive instances can be substantially reduced. Our method is safe in the sense that it does not sacrifice the optimal solution. Different from the existing DVISVM, our method contains multiple parameters which are allowed to change simultaneously. Besides, when the parameter τ is fixed, our SSR-PinSVM is reduced to the DVI-SVM. Moreover, the SSR is independent from the solver as it is applied before solving the optimization problem. Hence, other efficient solvers can be combined. We propose a dual coordinate descent method (DCDM) as the efficient solver. The main contributions of our approach are summarized as follows:

(ii) Multiple parameters τ and c are included in our SSR-PinSVM. It is different from the previous DVI-SVM in which only one parameter c is considered. (iii) Our rule is independent from the solvers. In this paper, we propose the DCDM as the solving algorithm. The paper is outlined as follows. Section 2 briefly introduces PinSVM and gives motivations for our method. Our SSR-PinSVM is proposed in Section 3. Some discussions are given in Section 4. Section 5 performs numerical experiments on six artificial data sets, twenty-three benchmark data sets, and a real biological data set to investigate the validity of the proposed algorithm. The last section contains the conclusions. Notations: Throughout this paper, scalars are denoted by italic  letters, and vectors by bold face letters. We use x, y = i xi yi to denote the inner product of vectors x and y, and x2 = x, x. For vector x, let xi be the ith component of x. If X is a matrix, Xi is the ith row of X and Xi, j is the (i, j ) th entry of X. For a vector x or a matrix X, let xJ = (x j1 , x j2 , · · · , x jk )T and XJ = (X j1 , X j2 , · · · , X jk )T . In addition, for a feasible solution θ in an optimization problem, θ ∗ denotes its corresponding optimal value. 2. Preliminaries We review the PinSVM, and then give motivations for our methods. 2.1. Review on PinSVM Suppose we have a set of observations {xi , yi }li=1 , where the row vector xi ∈ Rn represents the ith data instance, and yi ∈ R is the corresponding response. The traditional SVM with hinge loss is formulated as:

 1 w2 + c Lhinge (1 − yi f (xi )) 2 l

min w,b

i=1

where f (xi ) = w, xi  + b, c is a trade-off parameter and Lhinge (u ) = max{u, 0}. Lhinge is called a hinge loss function by which sufficiently rightly classified points are not punished. For this reason, the traditional SVM enjoys good sparsity. However, this feature also leads to the sensitivity of noises, since most of the instances are rightly classified and only a small number of the misclassified points determine the classifier. To address this issue, Huang et al. [7] introduce pinball loss into the SVM to improve the prediction performance. The pinball loss is given as follows:



Lτ ( u ) =

u, −τ u,

if u ≥ 0, if u < 0,

where τ is a parameter. It not only gives penalty on the misclassified points, but also takes weight into the sufficiently rightly classified instances. That enhances the noise insensitivity of the classifier, especially for the feature noise around the decision boundary. The SVM with pinball loss is formulated as the following QPP:

 1 w2 + c ξi w,b,ξ 2 l

min

i=1

s.t. yi (w, xi  + b) ≥ 1 − ξi , 1 yi (w, xi  + b) ≤ 1 + ξi , i = 1, 2, · · · , l.

τ

(i) We propose an SSR-PinSVM via variational inequalities to efficiently deal with large data, which can speed up the original solver several times and guarantees to achieve the same solution.

(1)

Once the optimal solutions w∗ and b∗ are achieved, we can determine the label of a new testing point x with the following decision function:

D(x ) = sign( f (x )) = sign(w∗ , x + b∗ ).

Please cite this article as: Z. Yang, Y. Xu, A safe accelerative approach for pinball support vector machine classifier, Knowledge-Based Systems (2018), https://doi.org/10.1016/j.knosys.2018.02.010

ARTICLE IN PRESS

JID: KNOSYS

[m5G;February 9, 2018;13:23]

Z. Yang, Y. Xu / Knowledge-Based Systems 000 (2018) 1–13

The PinSVM has many admirable properties, including noise insensitivity, minimizing misclassification error and so on. Besides, to further improve its computational speed, several efficient algorithms have been proposed [11,13,14]. However, they only concentrate on the solver itself. It is still challenging to deal with large data. 2.2. Motivations For simplicity, let xi = [xi , 1], w = [w; b]. Thus we have

w, xi  + b = w, xi . Therefore, the QPP (1) can be transformed into the following problem:

 1  w  2 + c ξi 2 l

min w, ξ

where θi∗ ∈ [−τ , 1], i = 1, · · · , l. It indicates that some θi∗ may be directly achieved if we could find the value of yi xi w∗ . Specifically, some θi∗ could be determined to be −τ or 1 without solving optimization problem. Our proposed algorithm is based on this condition. For convenience, we define the instances corresponding to yi xi w∗ = 1 (⇒ θi∗ = −τ or 1) as inactive instances, and those instances corresponding to yi xi w∗ = 1 as active instances. Our basic idea is as follows: If we could utilize the information obtained at the current parameters to identify the inactive instances at the next parameters, the corresponding variables θi∗ can be substantially achieved before solving QPP. Then, it only needs to solve a reduced problem composed of active instances. The most important advantage is that it guarantees to achieve the same solution as the original method.

i=1

s.t. yi w, xi  ≥ 1 − ξi , 1 yi w, xi  ≤ 1 + ξi , i = 1, 2, · · · , l.

3. A safe accelerative approach for PinSVM

(2)

τ

The common way for solving this type of problem is to convert it into dual problem via KKT conditions.1 From the KKT conditions, the following conclusions can be achieved: 1. αi∗ = 0 and βi∗ = τ c, if yi xi w∗ > 1; 2. αi∗ = c and βi∗ = 0, if yi xi w∗ < 1, where αi∗ , βi∗ and w∗ are the optimal solutions. Let λ = α − β, we could obtain the dual QPP of (2) as follows:

1 T λ Y X X T Y λ + eT λ 2 s.t. −τ c ≤ λ ≤ c,

Our proposed method aims at identifying the inactive instances and reduce the scale of QPP (4) into a small-sized problem. In Section 3.1, we give the form of the reduced problem. Then, in Section 3.2, the relationship of the solutions between the current and next parameters is analysed. Based on this relationship, our SSR-PinSVM is proposed in Section 3.3. Furthermore, in Section 3.4, we develop an efficient DCDM solver which is embedded in our accelerative approach. 3.1. The reduced problem of SSR-PinSVM For notational convenience, let

max − λ

(3)

where e is an appropriate dimensional ones vector. In order to simplify representation, let θ = 1c λ and Z = Y X. Then, Eq. (3) is equivalent to the following QPP:

c T T min θ Z Z θ − eT θ 2 θ s.t. −τ ≤ θ ≤ 1.

θ∗

Let be the optimal solution of (4), then the primal represented as:

(4) w∗

w∗ = cZ T θ ∗ .

can be

(5)

Note that the values of parameters c and τ in QPP (4) need to be selected before training. One common way is the grid search approach which requires a repetitive training process of tuning parameters. This process is usually time-consuming especially for large-scale data. Based on this fact, we manage to utilize the information obtained at the current parameters to reduce the scale of optimization problem at the next parameters. For better understanding our work, the whole scheme is given in Fig. 1. From this figure, if the model does not meet the requirements on the prediction accuracy, it needs to repeatedly tune the values of parameters and retrain the data. Our accelerative approach is reflected in the dashed rectangle. This step aims at reducing the scale of next optimization problem by utilizing the results obtained at the current parameters. This accelerative process can greatly improve the computational speed without sacrificing the optimal accuracy. The natural question is, what is the theoretical basis of our work? In fact, the conclusions (i) and (ii) above imply that



θi∗ = 1

3

−τ , 1,

if yi xi w∗ > 1, if yi xi w∗ < 1,

Details are given in Appendix A.

(6)

R = {i : yi xi w∗ > 1, θi∗ = −τ },

E = {i : yi xi w∗ = 1, θi∗ ∈ [−τ , 1]},

L = {i : yi xi w∗ < 1, θi∗ = 1}.

(7)

By our screening rule, we can determine the values of some elements in θ ∗ before solving the QPP (4). Then, the rest of the elements need to be calculated by solving a reduced problem. Suppose D is the index of identified instances in R and L. S is the index of the remaining instances. l is the cardinality of the set S. The reduced problem can be written as

min θ

∈R l

1 T θ H1 θ + f T θ 2

s.t. −τ ≤ θ ≤ 1, where H1 = HS,S , f =

(8) ∗ HS,D θD

− eT .

3.2. Feasible set construction To establish the reduced problem (8), we need to determine the membership of D. The condition (6) seems to be useful. However, it is generally not applicable since w∗ = w(θ ∗ ) is unknown. To overcome this difficulty, we try to estimate a region  such that θ ∗ ∈ . As a result, we have the relaxed version of (6):

min yi xi w(θ ) > 1 ⇒ yi xi w(θ ∗ ) > 1 ⇒ θi∗ = −τ , θ ∈

max yi xi w(θ ) < 1 ⇒ yi xi w(θ ∗ ) < 1 ⇒ θi∗ = 1, θ ∈

(9)

where w(θ ) can be obtained from (5), which changes with θ . Suppose we are given the parameter values (c0 , τ 0 ) and (c1 , τ 1 ). θ ∗ (c0 , τ 0 ) is known. c1 ,τ1 is a feasible region which includes θ ∗ (c1 , τ 1 ). We try to estimate this region c1 ,τ1 by utilizing θ ∗ (c0 , τ 0 ). The main technique for estimating the region is variational inequalities. For self-completeness, we cite the definition of variational inequalities as follows:

Please cite this article as: Z. Yang, Y. Xu, A safe accelerative approach for pinball support vector machine classifier, Knowledge-Based Systems (2018), https://doi.org/10.1016/j.knosys.2018.02.010

ARTICLE IN PRESS

JID: KNOSYS 4

[m5G;February 9, 2018;13:23]

Z. Yang, Y. Xu / Knowledge-Based Systems 000 (2018) 1–13

Tune values of parameters

Start

Training data

Testing data

Solve optimization problem and obtain the classifier

Prediction

Satisfy the requirement on accuracy

Safely reduce the scale of the next problem with the information at the current parameters

Yes

Stop

No

Fig. 1. The whole scheme of our SSR-PinSVM.

Lemma 1. [39] For the constrained convex optimization problem:

min g(θ ), θ ∈G

with G being convex and closed and g( · ) being convex and differentiable, θ ∗ ∈ G is an optimal solution if and only if

the upper bound of yi xi w(θ1 ) within this region. Thus, we can see that

yi xi w(θ1∗ ) = c1 Zi Z T θ1∗ = c1 Zi (Z T θ1∗ − ν Z T θ0∗ ) + ν c1 Zi Z T θ0∗ ≤ c1 Zi  · (Z T θ1∗ − ν Z T θ0∗ ) + ν c1 Zi Z T θ0∗

∇ g(θ ∗ ), θ − θ ∗  ≥ 0, ∀θ ∈ G.

1

For notational convenience, θ ∗ (c0 , τ 0 ) and θ ∗ (c1 , τ 1 ) are abbreviated as θ0∗ and θ1∗ respectively, in the following discussion. Then the following theorem2 shows that θ1∗ can be effectively bounded in terms of θ0∗ . Theorem 2. For problem (4), given c1 ≥ c0 > 0 and τ 1 ≥ τ 0 > 0. We have

Z T θ1∗ − ν Z T θ0∗ 2 ≤ ρ , τ

where r = τ0 , ν = 1

rc0 +c1 2c1

and ρ = (ν 2 −

≤ c1 Zi  · |ρ| 2 + ν c1 Zi Z T θ0∗

c0 c1

)θ0∗T Z Z T θ0∗ +

(1−r )l c1

.

From Theorem 2, the vector Z T θ1∗ lies in a ball centered at ν Z T θ0∗ with radius ρ . The center and radius are known. Thus, the ball can be regarded as the region c1 ,τ1 , i.e., given (c0 , τ 0 ) and θ0∗ , we can estimate the feasible region c1 ,τ1 such that θ1∗ ∈ c1 ,τ1 .

< 1. Note that the second inequality is due to Theorem 2, and the last line is due to the statement.  Finally, we provide a sequential version of the safe screening rule for the procedure of tuning parameters {(ck , τk )|k = 0, 1, · · · , K }. Corollary 4. (Sequential SSR for PinSVM) For problem (4), suppose we are given a sequence of parameters {(ck , τk )|k = 0, 1, · · · , K }. Asτ sume θk∗ is known for an arbitrary integer 0 ≤ k ≤ K. Let rk = τ k , r c +c

k νk = k 2kc k+1 and ρk = (νk2 − ckc+1 )θk∗T Z Z T θk∗ + k+1 θi∗ (ck+1 , τk+1 ) = 1, if the following holds

νk Zi Z T θk∗ + |ρk | 2 Zi  < 1

3.3. A safe screening rule for PinSVM Based on the results above, if θ0∗ is given, the region c1 ,τ1 could be estimated. Then, for a known instance xi , we could obtain its corresponding upper and lower bounds of yi xi w(θ ), where θ ∈ c1 ,τ1 . Thus, the instance xi could be identified whether it is an inactive instance or not, by combining the bound with condition (9). That is the central idea of our screening rule. Our SSR is summarized in the following theorem: Theorem 3. (SSR for PinSVM) For problem (4), suppose we are given θ0∗ . Then, for any τ 1 > 0 and c1 > 0, we have θi∗ (c1 , τ1 ) = 1, if the following holds

ν c1 Zi Z T θ0∗ + c1 |ρ| 2 Zi  < 1. 1

(10)

Similarly, we have θ j∗ (c1 , τ1 ) = −τ1 , if

ν c1 Z j Z T θ0∗ − c1 |ρ| Z j  > 1. 1 2

For the detailed proof, please see Appendix B.

ck+1

1 . ck+1

k+1

. Then, we have

(12)

Similarly, we have θ j∗ (ck+1 , τk+1 ) = −τk+1 , if

νk Z j Z T θk∗ − |ρk | 2 Z j  > 1

1 . ck+1

(13)

In general, a problem with all training instances at (c0 , τ 0 ) needs to be solved. Then, in each combination of (ck , τ k ), k ≥ 1, we just need to solve a reduced problem due to our SSR-PinSVM. This approach saves the computational cost and memory, compared with the traditional grid search method in which all the instances need to be entered into the optimization problem at each (ck , τ k ). 3.4. A dual coordinate descent method for PinSVM

(11)

Proof. We will prove the first half of the statement (10). The second half (11) can be proved analogously. To show θi∗ (c1 , τ1 ) = 1, the condition (6) implies that we only need to show yi xi w(θ1∗ ) < 1. In addition, according to condition (9), when the region c1 ,τ1 is known, the next step is to estimate 2

(1−rk )l

One can easily find that our SSR-PinSVM is independent from the solver of optimization problem. To further accelerate the computational speed, we develop a DCDM algorithm [40] to solve QPP. The pseudo-code of DCDM-PinSVM is summarized in Algorithm 1,3 and finally the pseudo-code of our SSR-PinSVM with DCDM (SSR-DCDM-PinSVM) is given in Algorithm 2.

3

For the derivation, please see Appendix C.

Please cite this article as: Z. Yang, Y. Xu, A safe accelerative approach for pinball support vector machine classifier, Knowledge-Based Systems (2018), https://doi.org/10.1016/j.knosys.2018.02.010

ARTICLE IN PRESS

JID: KNOSYS

[m5G;February 9, 2018;13:23]

Z. Yang, Y. Xu / Knowledge-Based Systems 000 (2018) 1–13

Algorithm 1 DCDM-PinSVM. Input: matrix Z, index S, parameters τ and c, vector f Output: θ Initialize θi = −τ , ∀i = 1, · · · , l w = cZST θ while [θ1 , θ2 , · · · , θl ]T is not optimal do for i chosen from 1 to l do A = Zi w − f i



PA =

if − τ < θi < 1 ifθi = −τ ifθi = 1

A, min (0, A ), max (0, A ),

if |PA| = 0 then θiold = θi Hii = cZi ZiT θi ← min (max (θiold − w←w+ end if end for end while

c (θi − θiold )Zi

A Hii , −τ ), 1 )

Algorithm 2 SSR-DCDM-PinSVM. Input: training set T , two vectors of parameters c and τ Output: θ ∗ k=0 (ck , τk ) ← c1 and τ1 θk∗ ← solve QPP (4) with (ck , τk ) using DCDM for i = 1 to length(τ ) do for j = 1 to length(c) do (ck+1 , τk+1 ) ← c j and τi θD∗ ← Corollary 4 θS∗ ← solve (8) with (ck+1 , τk+1 ) using DCDM θk∗+1 ← combine θD∗ and θS∗ k=k+1 end for end for

4. Discussions The accelerative approach for PinSVM is given in the above section. In this section, we further study the factors which affect the efficiency of our proposed SSR-PinSVM. In addition, the comparisons with other existing screening methods are also discussed. 4.1. Parameter setting Our screening rule is independent from the solver, but it is relevant to parameters c and τ . Specifically, the parameter interval affects the quantity of screening instances. To simplify discussion, we just analyse the interval between parameters c0 and c1 , i.e., d = c1 − c0 > 0 and τ1 = τ0 . Then, inequality (10) is equivalent to the following formulation:

d<

2 − 2c0 Zi Z T θ0∗

Zi Z T θ0∗ + Zi  · Z T θ0∗ 

.

Here, Zi = yi xi . When c0 and θ0∗ are given, the right side of this inequality is depending on different instances. If the value of d is set too large, it may occur that the rule cannot find any inactive instances corresponding to θi∗ (c1 ) = 1. Definitely, the setup of d cannot exceed 2−2c Z Z T θ ∗

maxi=1,2,··· ,l { Z Z T θ ∗ +0Z i·Z0T θ ∗  }. Analogously, from inequality (11), i

0

i

0

2−2c Z Z T θ ∗

d cannot exceed maxi=1,2,··· ,l { Z Z T θ ∗ −0Z i·Z0T θ ∗  }. i

0

i

0

In the experiments, we have verified that the common parameter setting in grid search (c is from the set {2i |i = −5.0, −4.9, · · · , 5.0} and τ from the set {0.1, 0.2, , 0.9}) is fit well for our screening rule in numerical experiments. 4.2. Comparisons with other related methods Our approach is inspired by the screening method DVI-SVM [38]. The comparison of the two methods is as follows: Our SSRPinSVM includes two parameters c and τ . By contrast, the existing method DVI-SVM [38] only concentrates on one parameter c which is the trade-off parameter in traditional SVM. When utilizing the pinball loss instead of hinge loss, the constraint of the dual problem (4) is changed. It is related to the parameter τ , i.e., the feasible region of θ changes with τ . That leads to a difficulty to τ directly apply DVI into PinSVM. A trick we used is to regard τ0 θ1∗ 1 as a feasible solution for problem (4) at (c0 , τ 0 ) (See Eq. (B.1)). By this trick, the variational inequalities could be utilized in spite of the different feasible region with changing τ . By simple derivation, it can be proved that the existing DVI-SVM is a special case of our SSR-PinSVM when we fix the value of τ . Besides, there are some recent literatures in which the instance reduction is introduced into SVMs. In [23], two sample selection methods were proposed. Though the methods have the decent performance, they cannot guarantee the safety of the solution; In [30,35,41], the screening idea was applied for semi-supervised learning, supervised twin SVM, and Universum data, respectively. Although these methods are all based on variational inequalities, the derivations are quite different. In this paper, we focus on the model with pinball loss. As the discussion above, the parameter τ of pinball loss is included in our work, and the difficulty for a changing feasible region is overcome. 5. Numerical experiments To verify the feasibility and effectiveness of our proposed approach, we conduct numerical experiments on six artificial data sets and twenty-three benchmark data sets. Besides, we further investigate the performance of our proposed method on a real biological data set. For the artificial data sets, we compare our SSR-PinSVM with the original PinSVM to evaluate the performance of our screening approach. For the benchmark data sets, other four algorithms are compared to comprehensively verify the effectiveness of our proposed method. All the methods are implemented by MATLAB R2014a on Windows 7 running on a PC with system configuration Intel Core i5-4590 CPU 3.30 GHz with 8GB of RAM. The optimal values for c in the algorithms are chosen from the set {2i |i = −5.0, −4.9, · · · , 5.0}, τ from the set {0.1, 0.2, , 0.9}. 5.1. Experiments on artificial data sets We evaluate the original PinSVM and SSR-PinSVM on six 2D artificial data sets, i.e., Toy1-Toy6 data sets. All data sets are generated for binary classification. Both of the algorithms are solved by the MATLAB toolbox “quadprog”. On Toy1, Toy2 and Toy3 data sets, we show the effectiveness of our SSR-PinSVM in discarding inactive instances even for overlapping classes. The linear kernel K (xi , x j ) = (xi · x j ) is used in the model. For these data sets, each class has 250 data points and is generated from N({μ, μ}T , 0.752 I), where I ∈ R2×2 is the identity matrix. For the positive class, μ = 1.50, 1.00, 0.75, respectively; and for the negative class, μ = −1.50, −1.00, −0.75. The classification maps (with parameters c1 = 22 and τ1 = 0.2) are plotted in Fig. 2. Note that parameters c0 = 21.9 and τ0 = 0.1. The first row is

Please cite this article as: Z. Yang, Y. Xu, A safe accelerative approach for pinball support vector machine classifier, Knowledge-Based Systems (2018), https://doi.org/10.1016/j.knosys.2018.02.010

5

ARTICLE IN PRESS

JID: KNOSYS 6

[m5G;February 9, 2018;13:23]

Z. Yang, Y. Xu / Knowledge-Based Systems 000 (2018) 1–13 4

3

3

3

2

2

1

1

0

0

0

−1

−1

−1

−2

−2

2

−2 −3

1

−3 −2

0

2

4

−3 −3

−2

−1

(a)

0

1

2

3

−3

−1

(b)

0

1

2

3

1

2

3

(c)

4

3 3

3 2

2

1

1

2 1

0

0

0

−1

−1

−1

−2

−2

−2

−3 −4

−2

−3 −2

0

2

4

−3 −3

−2

−1

(d)

0

1

(e)

2

3

−3

−2

−1

0

(f)

Fig. 2. Classification maps in linear case. The first row is obtained by original PinSVM, and the second row is achieved by SSR-PinSVM. From left to right, each column represents Toy1, Toy2 and Toy3, respectively.

by original PinSVM, and the second row is by our proposed SSRPinSVM. The blue and red points represent the positive and negative instances, respectively. The yellow and green points denote the identified instances corresponding to θ j∗ = −τ and θi∗ = 1, respectively. The red, blue and black lines in each graph represent positive, negative and decision hyperplanes, respectively. On Toy4 (circle), Toy5 (exclusive) and Toy6 (spiral) data sets, we show the excellent performance of our SSR-PinSVM for the various data with complex structure. The nonlinear Gaussian kernel K (xi , x j ) = exp(−xi − x j 2 /2σ 2 ) is used in the model. The TensorFlow Playground4 provides a useful resource to visualize these kinds of data. In our experiments, each class of these data sets contains 500 data points. For Toy4 and Toy5 the kernel parameter σ = 0.8, and for Toy6 the kernel parameter σ = 0.125. Classification maps (with parameters c1 = 2−4.9 and τ1 = 0.9) are plotted in Fig. 3. Note that parameters c0 = 2−5 and τ0 = 0.8. The first row is by original PinSVM, and the second row is by our proposed SSRPinSVM. From Figs. 2 and 3, we find that the original PinSVM uses all of the training instances to construct the classifier. By contrast, our SSR-PinSVM screens most instances, and only a few blue and red points remain for the classifier. That also implies the validity of our SSR-PinSVM. For the same data set (in the same column), the decision hyperplanes of PinSVM and SSR-PinSVM are almost identical. It demonstrates the safety of the SSR-PinSVM. Table 1 gives the detailed performance comparisons between our SSR-PinSVM and original PinSVM on these six artificial data sets. In this table, “Accuracy(%)” represents the optimal prediction accuracy of the algorithms. “Total time” denotes the overall time (seconds) of its corresponding algorithm. “Percentage” denotes the average screening percentage (%) of the inactive instances (corresponding to θi∗ = 1 or θ j∗ = −τ ) in each combination

of parameters. “Speedup” is calculated from the following equation: Speedup=Time of PinSVM / Time of SSR-PinSVM. From Table 1, we can find that our SSR-PinSVM could identify most of the inactive instances and accelerates the original PinSVM without sacrificing the optimal accuracy. That implies the safety of our SSR-PinSVM. 5.2. Experiments on benchmark data sets In this section, we demonstrate the outstanding performance of our proposed method on twenty-three real-world benchmark data sets. All of the data sets are downloaded from the University of California Irvine (UCI) Repository of Machine Learning Data Sets [42] or the web of LIBSVM Data.5 The statistics of them is shown in Table 2. Each data set is divided into two parts: training data and testing data. 4/5 of positive instances and 4/5 of negative instances are randomly selected as the training set. The rest instances are reserved as a testing set. Take the last data set “Skin” as an example. The Skin Segmentation data set is constructed over B, G, R color space. The data set is generated using skin textures from face images of diversity of age, gender, and race people. This data set contains 245,057 instances, including 50,859 positive instances and 194,198 negative instances. In our experiment, 40,687 positive data (4/5 of 50,859) and 155,359 negative data (4/5 of 194,198) are randomly collected as the training set (in total 196,046 instances) and the rest instances are reserved as testing set. To demonstrate the effectiveness of our DCDM-PinSVM and SSR-DCDM-PinSVM, four other algorithms are compared in the experiments: 1) CSVM [1] is the classic SVM which contains a tradeoffparameter c. The package is downloaded from the web.6 Note that CSVM can be regarded as a special case of PinSVM when τ = 0. 5

4

http://playground.tensorflow.org/.

6

https://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/. http://www.esat.kuleuven.be/sista/ADB/huang/softwarePINSVM.php.

Please cite this article as: Z. Yang, Y. Xu, A safe accelerative approach for pinball support vector machine classifier, Knowledge-Based Systems (2018), https://doi.org/10.1016/j.knosys.2018.02.010

ARTICLE IN PRESS

JID: KNOSYS

[m5G;February 9, 2018;13:23]

Z. Yang, Y. Xu / Knowledge-Based Systems 000 (2018) 1–13 3

3

2

2

1

1

7

2 1.5 1 0.5

0

0

−1

−1

−2

−2

0 −0.5 −1 −1.5

−3 −3

−2

−1

0

1

2

−3 −3

3

−2

−1

(a)

0

1

2

3

−2 −2

−1.5

−1

−0.5

(b)

3

2

1

1

0.5

1

1.5

2

0.5

1

1.5

2

(c)

3

2

0

2 1.5 1 0.5

0

0

−1

−1

−2

−2

0 −0.5 −1 −1.5

−3 −3

−2

−1

0

1

2

−3 −3

3

−2

−1

(d)

0

1

2

3

−2 −2

−1.5

−1

−0.5

(e)

0

(f)

Fig. 3. Classification maps in nonlinear case. The first row is obtained by original PinSVM, and the second row is achieved by SSR-PinSVM. From left to right, each column represents Toy4, Toy5 and Toy6, respectively. Table 1 Performance comparisons between SSR-PinSVM and PinSVM on six artificial data sets. Data set

Toy1 Toy2 Toy3 Toy4 Toy5 Toy6

PinSVM

SSR-PinSVM

Accuracy (%)

Total time (s)

Accuracy (%)

Percentage (θi∗ = 1)

Percentage (θi∗ = −τ )

Total time (s)

Speedup

99.20 95.20 88.40 99.00 97.00 10 0.0 0

727.59 815.04 978.22 1965.30 2006.63 765.91

99.20 95.20 88.40 99.00 97.00 10 0.0 0

89.80 88.08 82.88 36.22 48.61 30.80

87.84 87.84 84.67 21.80 30.79 0.01

14.49 14.66 19.01 1231.94 840.82 587.34

50.21 55.60 51.46 1.60 2.39 1.30

Table 2 The statistics of twenty-three benchmark data sets. Data set

#

Hepatitis Fertility PlanningRelax Sonar SpectHeart Haberman LiverDisorder Ionosphere Monks BreastCancer569 BreastCancer683 Australian Pima Banknote Adult Spambase Musk Epileptic HTRU A9A Ijcnn1 CodRna Skin

80 100 146 208 267 306 345 351 432 569 683 690 768 1372 3185 4601 6598 11,500 17,898 32,561 49,990 59,535 245,057

Instances

#

Positive

13 88 130 97 212 225 145 225 216 357 444 307 500 762 773 1813 5581 2300 1639 7841 4853 19,845 50,859

#

Negative

67 12 52 111 55 81 200 126 216 212 239 383 268 610 2412 2788 1017 9200 16,259 24,720 45,137 39,690 194,198

#

Features

19 9 12 60 44 3 6 34 6 30 9 14 8 4 122 57 166 178 8 123 22 8 3

2) LS-SVM [43] is a reformulation to the classic SVM. Its solution is achieved faster by solving a set of linear equations instead of the QPP. We use the MATLAB toolbox “svmtrain” with “LS” method to implement it. 3) TSVM [16] is also an extension of the classic SVM. TSVM solves a pair of smaller-sized problems rather than a single larger one, which makes the learning speed faster [19,20,44,45]. In the experiments, this method is solved by the MATLAB toolbox “quadprog”. 4) SMO-PinSVM [11] is an efficient solver for PinSVM. We also download its package from the web. For our DCDM-PinSVM, the maximum number of iterations is set to not exceeding 50 0 0, and the termination criterion is |g(θ ) − g(θ old )| < 1 × 10−5 . The performance comparisons are shown in Table 3. In this table, “Time” denotes the average time (seconds) in each combination of parameters. Besides, we show the number of wins, draws and losses when comparing our SSR-DCDM-PinSVM with all other methods. Note that for some large-scale data sets, some of the comparative algorithms run out of memory, and fail to give normal operations. The corresponding results are represented by “-”. From Table 3, our SSR-DCDM-PinSVM achieves the outstanding prediction performance and has the excellent computational speed. Notice that the accuracies of SMO-PinSVM, DCDM-PinSVM and our

Please cite this article as: Z. Yang, Y. Xu, A safe accelerative approach for pinball support vector machine classifier, Knowledge-Based Systems (2018), https://doi.org/10.1016/j.knosys.2018.02.010

ARTICLE IN PRESS

JID: KNOSYS 8

[m5G;February 9, 2018;13:23]

Z. Yang, Y. Xu / Knowledge-Based Systems 000 (2018) 1–13

Table 3 Accuracy comparisons of six algorithms on twenty-three benchmark data sets. Bold type shows the best result. Data set

Hepatitis Fertility PlanningRelax Sonar SpectHeart Haberman LiverDisorder Ionosphere Monks BreastCancer569 BreastCancer683 Australian Pima Banknote Adult Spambase Musk Epileptic HTRU A9A Ijcnn1 CodRna Skin Win/Draw/Loss Accuracy Time

CSVM

SMO-PinSVM

DCDM-PinSVM

SSR-DCDM-PinSVM

Accuracy

Time

LS-SVM Accuracy

Time

Accuracy

Time

Accuracy

Time

Accuracy

Time

Accuracy

Time

93.33 90.00 72.22 76.19 83.33 75.41 66.67 90.00 71.26 98.23 96.32 84.89 75.16 98.18 83.99 89.89 93.33 83.42 97.65 82.58 92.49 93.81 –

0.0149 0.0166 0.5671 0.2398 0.3346 0.2983 0.1850 0.3116 0.0686 0.1440 0.0846 0.6872 0.4849 0.1405 3.2503 1.1515 6.9853 25.4581 8.3249 695.5475 307.0085 185.9172 –

93.33 70.00 58.33 78.57 70.37 78.69 57.97 87.14 71.26 98.23 95.59 86.33 73.20 97.82 81.16 92.50 91.96 68.00 97.49 – – – –

0.0038 0.0038 0.0048 0.0052 0.0061 0.0073 0.0071 0.0089 0.0086 0.0154 0.0202 0.0235 0.0230 0.0746 0.4950 1.0925 2.5200 9.9180 34.2666 – – – –

93.33 90.00 72.22 83.33 75.93 80.33 68.12 84.29 71.26 98.23 95.59 87.77 75.16 98.18 84.46 90.43 94.69 85.57 – – – – –

0.0197 0.0398 0.0282 0.0417 0.1861 0.0477 0.0591 0.0815 0.0644 0.1909 0.2793 0.2411 0.2442 1.5954 18.3411 54.1972 283.3841 276.6687 – – – – –

93.33 90.00 72.22 80.95 83.33 73.77 68.12 92.86 71.26 98.23 96.32 84.89 76.47 98.55 83.83 91.96 93.33 83.42 97.77 82.58 92.49 93.81 –

0.1413 0.0158 0.2803 0.5084 0.3340 0.1215 0.2905 0.4839 0.0745 0.6762 0.1557 0.3930 0.6159 0.1806 3.3642 0.4042 11.3076 22.3723 16.3021 713.6985 338.3732 215.6295 –

93.33 90.00 72.22 80.95 83.33 75.41 66.67 92.86 71.26 98.23 96.32 84.89 76.47 98.91 84.14 91.52 93.33 84.48 97.79 83.78 93.79 93.88 94.34

0.0370 0.0080 0.0608 0.2347 0.0732 0.0431 0.0891 0.2931 0.0168 0.0168 0.2820 0.2472 0.2151 0.1895 7.9712 6.8555 47.3095 53.4405 10.7536 153.1218 12.7755 44.2683 330.9325

93.33 90.00 72.22 80.95 83.33 75.41 66.67 92.86 71.26 98.23 96.32 84.89 76.47 98.91 84.14 91.52 93.33 84.48 97.79 83.78 93.79 93.88 94.34

0.0086 0.0 0 01 0.0461 0.2082 0.0287 0.0064 0.0384 0.2402 0.0 0 01 0.0 0 01 0.0676 0.0646 0.0487 0.0902 1.8785 3.3458 43.2768 41.9558 7.3004 10.1593 0.1221 8.6460 66.0357

12/11/0 20/3/0

TSVM

17/3/3 9/0/14

11/5/7 20/0/3

9/12/2 20/0/3

0/23/0 23/0/0

“–” represents the abnormal output due to the algorithm running out of memory on large data.

SSR-DCDM-PinSVM are very close. The reason is that these algorithms are based on the same model PinSVM. In terms of computational time, our SSR-DCDM-PinSVM has the outstanding performance except for the LS-SVM. The understandable reason is that LS-SVM just solves linear equations rather than the QPP. Although for some data sets, the LS-SVM appears to run faster, its prediction accuracies are much lower than our approach in most cases. Compared with DCDM-PinSVM, SSR-DCDM-PinSVM achieves the identical accuracies and has the overwhelming time advantage, which demonstrates the safety and effectiveness of our proposed SSR. Moreover, when the data size is over tens of thousands, the advantage of the proposed SSR-DCDM-PinSVM is more obvious. Our screening method could significantly reduce the computational cost especially for dealing with large-scale data. Although LS-SVM seems to be faster on the small data sets, it loses efficiency for larger data, such as HTRU(17,898 × 8). When the scale of the data increases to more than tens of thousands, the LSSVM and TSVM run inefficiently due to the large matrix operation. But our DCDM-PinSVM and SSR-DCDM-PinSVM have overcome this difficulty. For instance, they can work well on the data set Skin(245,057 × 3), but other algorithms run out of memory. Although TSVM seems to achieve high prediction accuracies, it is more sensitive to the noise when comparing with pinball SVMs [14]. The robustness of our method has been further verified in Section 5.3. To show the effectiveness of our SSR, the detailed comparisons between DCDM-PinSVM and SSR-DCDM-PinSVM are given in Table 4. “Solver” represents the average time (seconds) of the solver in each combination of parameters. “Total” denotes the total time with all combinations of parameters. “Screening” denotes the average time (seconds) of screening process in SSR-DCDM-PinSVM. From Table 4, SSR-DCDM-PinSVM greatly reduces the time of the solver. In addition, the cost for screening is negligible. From the screening percentage, our SSR-DCDM-PinSVM can identify a large part of inactive instances. That also demonstrates the effectiveness of the SSR-DCDM-PinSVM.

To further illustrate how the instances are removed by our screening rule, we randomly select four data sets to observe the changing curves of their instances with different parameters. In Fig. 4, the vertical axis represents the percentage (%) of remaining instances after screening process, and the horizontal axis denotes the screening step k in Algorithm 2. From Fig. 4, we could find that a large proportion of the inactive instances are indeed deleted by the screening rule. Moreover, all of the active instances are always reserved in each step. It implies that the screening rule does not remove active instances exactly. Namely, it demonstrates the safety of our proposed SSR-PinSVM. In general, the screening rule has an excellent performance for most cases. In our experiments, some of the benchmark data sets have the class imbalance problem. To illustrate the effectiveness of our screening rule in this case, in Table 5 we give the confusion matrix (corresponding to the optimal accuracy) of DCDM-PinSVM and SSR-DCDM-PinSVM on these data sets whose imbalance ratio (IR) is larger than 2.00. It is easy to observe that the results of these two algorithms are exactly the same. That demonstrates the safety of our screening rule on imbalanced data. For a few small-sized data sets, the prediction results are partial to the majority class. The reason is that in this paper the original intention of the model is only for balanced cases. Some literatures have well studied the models for imbalanced problem [5,15]. It is worth studying to further apply the screening technique into imbalanced problem. 5.3. Experiments on a mice protein expression data set In this part, we verify the feasibility of our proposed approach on a mice protein expression data set. The task is to predict whether the mice is down syndrome or not. Down syndrome is a chromosomal abnormality (trisomy of human chromosome 21) associated with intellectual disability and affecting approximately one in 10 0 0 live births worldwide. The over-expression of genes in down syndrome is believed to be sufficient to perturb normal pathways and normal responses to stimulation, causing learning and memory deficits. Recently, some researchers have successfully

Please cite this article as: Z. Yang, Y. Xu, A safe accelerative approach for pinball support vector machine classifier, Knowledge-Based Systems (2018), https://doi.org/10.1016/j.knosys.2018.02.010

ARTICLE IN PRESS

JID: KNOSYS

[m5G;February 9, 2018;13:23]

Z. Yang, Y. Xu / Knowledge-Based Systems 000 (2018) 1–13

9

Table 4 Detailed comparisons between DCDM-PinSVM and SSR-DCDM-PinSVM on twenty-three benchmark data sets. Bold type shows the best result. Data set

DCDM-PinSVM

SSR-DCDM-PinSVM

Solver(s)

Total(s)

Screening(s)

Solver(s)

Total(s)

Percentage(%)

Speedup

Hepatitis Fertility PlanningRelax Sonar SpectHeart Haberman LiverDisorder Ionosphere Monks BreastCancer569 BreastCancer683 Australian Pima Banknote Adult Spambase Musk Epileptic HTRU A9A Ijcnn1 CodRna Skin

0.0370 0.0080 0.0608 0.2347 0.0732 0.0431 0.0891 0.2931 0.0168 0.3709 0.2820 0.2472 0.2151 4.8462 7.9712 6.8555 47.3095 53.4405 10.7536 153.1218 12.7755 44.2683 330.9325

33.64 7.26 55.28 213.30 532.76 39.17 81.02 266.43 15.28 337.18 256.31 224.70 195.55 172.25 7245.80 6231.62 4.30 × 104 4.86 × 104 9775.06 1.39 × 105 1.16 × 104 4.02 × 104 3.00 × 105

0.0 0 01 0.0 0 01 0.0 0 01 0.0 0 01 0.0 0 01 0.0 0 01 0.0 0 01 0.0 0 01 0.0 0 01 0.0 0 01 0.0 0 01 0.0 0 01 0.0 0 01 0.0 0 01 0.0035 0.0015 0.0036 0.0062 0.0 0 05 0.0295 0.0101 0.0050 0.0055

0.0085 0.0 0 01 0.0460 0.2081 0.0286 0.0064 0.0384 0.2401 0.0 0 01 0.3023 0.0675 0.0645 0.0487 1.9324 1.8750 3.3443 43.2732 41.9496 7.2999 10.1298 0.1120 8.6410 66.0302

7.82 0.70 41.89 189.23 181.68 5.83 34.95 218.35 0.55 274.88 61.43 58.69 44.30 82.02 1707.56 3041.35 3.93 × 104 3.81 × 104 6636.08 9234.84 110.95 7859.23 6.00 × 104

45.37 26.65 53.37 54.77 40.88 46.77 92.46 64.54 81.18 54.48 82.29 24.98 92.63 74.19 36.21 58.49 10.63 30.80 44.83 31.71 16.73 75.51 73.70

4.30 10.42 1.32 1.13 2.93 6.72 2.32 1.22 27.78 1.23 4.17 3.83 4.41 2.10 4.24 2.05 1.09 1.27 1.47 15.07 104.67 5.12 5.01

Pima

60

40

20

20

40

60

80

100

98 Inactive

Active

96 94 92 90 88 86 0

20

40

60

80

100

90 Inactive

(a)

Active

80 70 60 50 40 30 20 0

20

40

60

80

100

90 80

(b)

Inactive

Active

70 60 50 40 30 20 10 0

k

k

k

The remaining percentage (%)

Active

The remaining percentage (%)

The remaining percentage (%)

The remaining percentage (%)

Inactive

Skin 100

100

100

80

0 0

CodRna

Musk

100

20

40

60

80

100

k

(c)

(d)

Fig. 4. The changing curves of remaining instances by the screening rule with different parameters. Table 5 The confusion matrix of two algorithms on the imbalanced data sets. DCDM-PinSVM

SSR-DCDM-PinSVM

Data set

Actual Predicted

Negative

Positive

Negative

Positive

Hepatitis (IR=5.15)

Negative Positive Negative Positive Negative Positive Negative Positive Negative Positive Negative Positive Negative Positive Negative Positive Negative Positive Negative Positive Negative Positive Negative Positive Negative Positive

13 0 0 2 0 10 5 6 1 15 457 26 122 8 1840 0 3243 9 4632 824 7466 472 7484 454 36065 2774

1 1 0 18 0 26 3 40 0 45 75 79 7 1109 357 103 106 222 744 312 267 3702 275 3694 11 10,161

13 0 0 2 0 10 5 6 1 15 457 26 122 8 1840 0 3243 9 4632 824 7466 472 7484 454 36,065 2774

1 1 0 18 0 26 3 40 0 45 75 79 7 1109 357 103 106 222 744 312 267 3702 275 3694 11 10,161

Fertility (IR=7.33) PlanningRelax (IR=2.50) SpectHeart (IR=3.85) Haberman (IR=2.78) Adult (IR=3.12) Musk (IR=5.49) Epileptic (IR=4.00) HTRU (IR=9.92) A9A (IR=3.15) Ijcnn1 (IR=9.30) CodRna (IR=2.00) Skin (IR=3.82)

Please cite this article as: Z. Yang, Y. Xu, A safe accelerative approach for pinball support vector machine classifier, Knowledge-Based Systems (2018), https://doi.org/10.1016/j.knosys.2018.02.010

ARTICLE IN PRESS

JID: KNOSYS 10

[m5G;February 9, 2018;13:23]

Z. Yang, Y. Xu / Knowledge-Based Systems 000 (2018) 1–13 Table 6 Performance comparisons of six algorithms on a mice protein expression data set with noises. Bold type shows the best result. Noise variance

r=0.00

r=0.10

r=0.15

r=0.20

r=0.25

r=0.30

Evaluation criteria

CSVM

LS-SVM

TSVM

SMO-PinSVM

DCDM-PinSVM

SSR-DCDM-PinSVM

Accuracy Precision Sensitivity Specificity Solver time Speedup Accuracy Precision Sensitivity Specificity Solver time Speedup Accuracy Precision Sensitivity Specificity Solver time Speedup Accuracy Precision Sensitivity Specificity Solver time Speedup Accuracy Precision Sensitivity Specificity Solver time Speedup Accuracy Precision Sensitivity Specificity Solver time Speedup

80.70 80.77 77.78 83.33 0.4479 13.99 78.95 77.78 77.78 80.00 0.4617 17.69 75.44 74.07 74.07 76.67 0.4495 16.77 75.44 77.08 68.52 81.67 0.5764 22.08 71.93 73.91 62.96 80.00 0.5205 18.93 74.56 74.51 70.37 78.33 0.5714 24.31

78.95 76.79 79.63 78.33 0.0206 0.64 78.07 74.58 81.48 75.00 0.0154 0.59 71.05 67.80 74.07 68.33 0.0156 0.58 73.68 75.00 66.67 80.00 0.0158 0.61 67.54 64.91 68.52 66.67 0.0165 0.60 75.44 74.07 74.07 76.67 0.0176 0.75

81.58 77.05 87.04 76.67 0.1585 4.95 78.95 75.86 81.48 76.67 0.1431 5.48 74.56 77.78 64.81 83.33 0.1525 5.69 75.44 76.00 70.37 80.00 0.1606 6.15 71.05 65.22 83.33 60.00 0.1383 5.03 75.44 73.21 75.93 75.00 0.1395 5.94

81.58 81.13 79.63 83.33 0.3557 11.12 78.95 77.78 77.78 80.00 0.5799 22.22 76.32 76.47 72.22 80.00 0.5477 20.44 76.32 78.72 68.52 83.33 0.5510 21.11 73.68 75.00 66.67 80.00 0.5605 20.38 76.32 78.72 68.52 83.33 0.5755 24.49

81.58 81.13 79.63 83.33 0.1607 5.02 78.95 76.79 79.63 78.33 0.1636 6.27 74.56 72.73 74.07 75.00 0.1601 5.97 77.19 80.43 68.52 85.00 0.1801 6.90 74.56 73.58 72.22 76.67 0.1726 6.28 77.19 78.00 72.22 81.67 0.1729 7.36

81.58 81.13 79.63 83.33 0.0320

employed the SVM to benefit the prediction on protein data sets [46,47]. The data set7 consists of the expression levels of 77 protein modifications. There are 38 control mice and 34 trisomic mice (down syndrome), for a total of 72 mice. In the experiments, 15 measurements were registered of each protein per instance. Therefore, for control mice, there are 38 × 15 measurements, and for trisomic mice, there are 34 × 15 measurements. Thus, it contains a total of 1080 measurements per protein. Note that each measurement can be considered as an independent instance. Before the training process, this data has been pre-processed [48] and normalized. Besides, to verify the robustness of our proposed method, the features of the data set are corrupted by zero-mean Gaussian noise. For each feature, the noise variance denoted as r, is set to be 0.00 (i.e. noise-free), 0.10, 0.15, 0.20, 0.25 and 0.30. The training and testing data are aggravated by the same noise [14]. In the numerical experiments, our SSR-DCDM-PinSVM is compared with CSVM, LS-SVM, TSVM, SMO-PinSVM, and DCDMPinSVM to demonstrate the effectiveness. Their results corresponding to the optimal accuracy are shown in Table 6. “Precision”, “Sensitivity”, and “Specificity” are the corresponding evaluation criteria for classification. In terms of the prediction accuracy, our DCDM-PinSVM and SSR-DCDM-PinSVM have the obvious superiority, especially with the increase of noise variance r. That demonstrates the noise insensitivity of our proposed method. In addition, DCDM-PinSVM

7 Downloaded Expression.

from

http://archive.ics.uci.edu/ml/datasets/Mice+Protein+

78.95 76.79 79.63 78.33 0.0261 74.56 72.73 74.07 75.00 0.0268 77.19 80.43 68.52 85.00 0.0261 74.56 73.58 72.22 76.67 0.0275 77.19 78.00 72.22 81.67 0.0235

and SSR-DCDM-PinSVM always have the identical predicted results, which validates the safety of our proposed SSR. When the noise variance r becomes large, the methods with pinball loss, i.e. SMO-PinSVM, DCDM-PinSVM, and SSR-DCDM-PinSVM, generally achieve better accuracy than other algorithms. From the computational time, our SSR-DCDM-PinSVM also has the outstanding performance except for LS-SVM which calculates linear equations instead of solving QPP. In general, our proposed method inherited the advantage of pinball SVM, has the strong robustness. Moreover, our SSR-DCDMPinSVM could greatly accelerate the speed of PinSVM without sacrificing the optimal results. 6. Conclusion In this paper we propose the SSR-PinSVM to reduce the computational cost and enhance the efficiency. The main idea of the proposed method is based on variational inequalities. Our approach could identify most of inactive instances, and then remove them before solving optimization problem, so that we only need to solve smaller problems with the changing of parameters c and τ . At the same time, the solution is guaranteed to be safe. By comparison with the existing DVI-SVM, multiple parameters τ and c are considered. Besides, our approach is independent from the solver, so that other efficient solvers can be combined into this approach. We further provide a DCDM-PinSVM as an efficient solver in this paper. In the numerical experiments, we conduct the SSRPinSVM with other algorithms on six artificial data sets, twentythree benchmark data sets, and a real biological data set with different noises to demonstrate its validity and robustness. Applying the safe screening rule to other sparse models is our future work.

Please cite this article as: Z. Yang, Y. Xu, A safe accelerative approach for pinball support vector machine classifier, Knowledge-Based Systems (2018), https://doi.org/10.1016/j.knosys.2018.02.010

ARTICLE IN PRESS

JID: KNOSYS

[m5G;February 9, 2018;13:23]

Z. Yang, Y. Xu / Knowledge-Based Systems 000 (2018) 1–13

Acknowledgments

This completes the proof.

The authors would like to thank the reviewers for the helpful comments and suggestions, which have improved the presentation. This work was supported in part by Beijing Natural Science Foundation (No. 4172035) and National Natural Science Foundation of China (No. 11671010). Appendix A. The KKT conditions for Eq. (2) By introducing Lagrangian function for (2), we have l

l

i=1

i=1

l 

min g(θt,i + dei )

∂L = w ∗ − X T Y α∗ + X T Y β ∗ = 0 , ∂w 1 ∂L = c − αi∗ − βi∗ = 0, τ ∂ξ αi∗ (yi xi w∗ − 1 + ξi∗ ) = 0,

where

α∗ ,

β∗ ,

w∗

w∗ = X T Y (α∗ − β∗ ).

and

(A.1)

Appendix B. Proof of Theorem 2 Proof. Since θ0∗ ∈ [−τ0 , 1]l ⊆ [−τ1 , 1]l , we have θ0∗ ∈ [−τ1 , 1]l , i.e. θ0∗ is a feasible solution for problem (4) with (c1 , τ 1 ). τ τ Meanwhile, since θ1∗ ∈ [−τ1 , 1]l , we have −τ0 ≤ τ0 θ1∗ ≤ τ0 < 1. τ

τ

1

1

Thus, we have τ0 θ1∗ ∈ [−τ0 , 1]l , i.e. τ0 θ1∗ is a feasible solution for 1 1 problem (4) with (c0 , τ 0 ). Let g(θ ) be the objective function of Eq. (4). Then we have the following inequations from Lemma 1:

τ ∇ gc0 ,τ0 (θ ), 0 θ1∗ − θ0∗  ≥ 0, τ1 ∇ gc1 ,τ1 (θ1∗ ), θ0∗ − θ1∗  ≥ 0. ∗ 0

(B.1) c0 Z Z T θ0∗

∇ gc1 ,τ1 (θ1∗ ) =

τ  θ + θ θ − c1 + c0 0 θ1∗T Z Z T θ0∗ τ1  τ0  T ∗ − 1− e θ1 ≤ 0 . τ1



(B.2)

Since θ1∗ ∈ [−τ1 , 1]l , we have eT θ1∗ ∈ [−τ1 l , l ]. Thus,

c1 θ1∗T Z Z T θ1∗ + c0 θ0∗T Z Z T θ0∗ − c1 + c0

τ

τ0  . τ1

τ0  ∗T T ∗ θ Z Z θ0 τ1 1

(B.3)

Let r = τ0 , ν = 20 c 1 and ρ = (ν 2 − c0 )θ0∗T Z Z T θ0∗ + (1c−r )l . By trans1 1 1 1 forming the inequality (B.3) with method of completing the square, we can have rc +c

(C.2)

∇iP g(θ )

(C.3) means the projected gradient

 ∇ i g( θ ) , ∇iP g(θ ) = min (0, ∇i g(θ )), max (0, ∇i g(θ )),

if − τ < θi < 1, if θi = −τ , if θi = 1.

If (C.3) holds, we move to the index i + 1 without updating θ t, i . Otherwise, we must find the optimal solution of (C.1). If Hii > 0, easily the solution is

θit,i+1 = min(max(θit,i −

∇i g(θt,i ) Hii

, −τ ), 1 ).

(C.4)

Let w = cZST θ . Then we have

∇ i g( θ ) =

l 

Hi j θ j − fi = Zi w − fi ,

j=1

w ← w + c (θi − θiold )Zi .

θ i is the current value and θiold is the value before the updating.

Notice that = − e and c1 Z Z T θ1∗ − e. Plugging them into (B.1), we have



1 Hii d2 + ∇i g(θt,i )d + constant, 2

∇iP g(θt,i ) = 0,

where X = [x1 ; x2 ; · · · ; xl ], Y = diag(y1 , y2 , · · · , yl ), ξi∗ are the optimal solutions. Then, we can get

≤l 1−

where ei = [0, · · · , 0, 1, 0, · · · , 0]T . The objective function of (C.1) is a simple quadratic function of d,

where ∇ i g is the ith component of the gradient ∇ g, Hii = cZi ZTi . One can easily see that (C.1) has an optimum at d = 0, if and only if

1



s.t. −τ ≤ θ + d ≤ 1, t i

g(θt,i + dei ) =

βi∗ (yi xi w∗ − 1 − ξi∗ ) = 0, τ α∗ ≥ 0 , β ∗ ≥ 0 ,

θ

(C.1)

d

where α ≥ 0, β ≥ 0 are the Lagrangian multipliers. Differentiating the Lagrangian function with respect to w and ξ i yields the following KKT conditions:

c0 0∗T Z Z T 0∗

For the problem (8), the optimization process starts from an ini ∞ . tial point θ 0 ∈ Rl and and generates a sequence of vectors {θt }t=0 We refer to the process from θ t to θt+1 as an outer iteration. In each outer iteration we have l inner iterations, so that sequentially θ1 , θ2 , · · · , θl are updated. Each outer iteration thus generates vec tors θt,i ∈ Rl , i = 1, · · · , l , such that θt,l = θt , θt,l +1 = θt+1 , and

Let g(θ ) be the objective function of (8). For updating θ t, i to we solve the following one-variable subproblem:

1

∇ gc0 ,τ0 (θ0∗ )

Appendix C. Derivation of the DCDM for PinSVM

θt,i+1 ,

βi (yi xi w − 1 − ξi ), τ i=1

+



θt,i = [θ1t+1 , · · · , θit+1 , θit , · · · , θlt ]T , ∀i = 2, · · · , l . −1

  1 L(w, ξ , α, β ) = w2 + c ξi − αi (yi xi w − 1 + ξi ) 2

c1 1∗T Z Z T 1∗

11

c

Z T θ1∗ − ν Z T θ0∗ 2 ≤ ρ .

References [1] V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998. [2] I. Steinwart, A. Christmann, Support Vector Machines, Springer Verlag, New York, 2008. [3] D. Brugger, Parallel support vector machines, in: Proceedings of the IFIP International Conference on Very Large Scale Integration of System on Chip, 2007. [4] J. Zhang, Z. Li, J. Yang, A parallel svm training algorithm on large-scale classification problems, in: 2005 International Conference on Machine Learning and Cybernetics, 3, 2005, pp. 1637–1641. [5] Y. Xu, Maximum margin of twin spheres support vector machine for imbalanced data classification, IEEE Trans. Cybern. 47 (6) (2017) 1540–1550. [6] Z. Yang, Y. Xu, Laplacian twin parametric margin support vector machine for semi-supervised classification, Neurocomputing 171 (2016) 325–334. [7] X. Huang, L. Shi, J.A.K. Suykens, Support vector machine classifier with pinball loss, IEEE Trans. Pattern Anal. Mach. Intell. 36 (5) (2014) 984–997. [8] RogerKoenker, Quantile Regression, Cambridge University Press, 2005. [9] Y. Xu, L. Wang, A weighted twin support vector regression, Knowl. Based Syst. 33 (3) (2012) 92–101. [10] X. Shen, L. Niu, Z. Qi, Y. Tian, Support vector machine classifier with truncated pinball loss, Pattern Recognit. 68 (2017) 199–210. [11] X. Huang, L. Shi, J.A.K. Suykens, Sequential minimal optimization for svm with pinball loss, Neurocomputing 149 (2014) 1596–1603.

Please cite this article as: Z. Yang, Y. Xu, A safe accelerative approach for pinball support vector machine classifier, Knowledge-Based Systems (2018), https://doi.org/10.1016/j.knosys.2018.02.010

JID: KNOSYS 12

ARTICLE IN PRESS

[m5G;February 9, 2018;13:23]

Z. Yang, Y. Xu / Knowledge-Based Systems 000 (2018) 1–13

[12] Platt, C. John, Fast training of support vector machines using sequential minimal optimization, in: Advances in Kernel Methods-support Vector Learning, Cambridge, MA, 1999, pp. 185–208. [13] X. Huang, L. Shi, J.A.K. Suykens, Solution path for pin-svm classifiers with positive and negative τ values, IEEE Trans. Neural Netw. Learn. Syst. 28 (7) (2017) 1584–1593. [14] Y. Xu, Z. Yang, X. Pan, A novel twin support-vector machine with pinball loss, IEEE Trans. Neural Netw. Learn. Syst. 28 (2) (2017) 359–370. [15] Y. Xu, Z. Yang, Y. Zhang, X. Pan, L. Wang, A maximum margin and minimum volume hyper-spheres machine with pinball loss for imbalanced data classification, Knowl. Based Syst. 95 (5) (2016) 75–85. [16] Jayadeva, R. Khemchandani, S. Chandra, Twin support vector machines for pattern classification, IEEE Trans. Neural Netw. Learn. Syst. 29 (5) (2007) 905–910. [17] Y. Xu, R. Guo, L. Wang, A twin multi-class classification support vector machine, Cognit. Comput. 5 (4) (2013) 580–588. [18] M. Tanveer, Robust and sparse linear programming twin support vector machines, Cognit. Comput. 7 (1) (2015) 137–149. [19] S. Ding, X. Zhang, J. Yu, Twin support vector machines based on fruit fly optimization algorithm, Int. J. Mach. Learn. Cybern. 7 (2) (2016) 193–203. [20] S. Ding, Y. An, X. Zhang, F. Wu, Y. Xue, Wavelet twin support vector machines based on glowworm swarm optimization, Neurocomputing 225 (2017) 157–163. [21] J.L. Carbonera, M. Abel, A novel density-based approach for instance selection, in: 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI), 2016, pp. 549–556. [22] F. Zhu, J. Gao, C. Xu, J. Yang, D. Tao, On selecting effective patterns for fast support vector regression training, IEEE Trans. Neural Netw. Learn. Syst. (2017), doi:10.1109/TNNLS.2017.2734812. [23] X. Pan, Y. Xu, Two effective sample selection methods for support vector machine, J. Intell. Fuzzy Syst. 30 (2) (2016) 659–670. [24] D.R. Wilson, T.R. Martinez, Reduction techniques for instance-based learning algorithms, Mach. Learn. 38 (3) (20 0 0) 257–286. [25] E. Leyva, A. Gonzlez, R. Prez, Three new instance selection methods based on local sets: a comparative study with several approaches from a bi-objective perspective, Pattern Recognit. 48 (4) (2015) 1523–1537. [26] Q. Ye, C. Zhao, S. Gao, H. Zheng, Weighted twin support vector machines with local information and its application, Neural Netw. 35 (2012) 31–39. [27] Y. Xu, J. Yu, Y. Zhang, Knn-based weighted rough -twin support vector machine, Knowl. Based Syst. 71 (2014) 303–313. [28] E. Ghaoui, Laurent, Viallon, Vivian, Rabbani, Tarek, Safe feature elimination in sparse supervised learning, Pacific J. Optim. 8 (4) (2010) 667–698. [29] K. Ogawa, Y. Suzuki, I. Takeuchi, Safe screening of non-support vectors in pathwise svm computation, J. Mach. Learn. Res. 28 (3) (2013) 1382–1390. [30] Z. Yang, Y. Xu, A safe screening rule for laplacian support vector machine, Eng. Appl. Artif. Intell. 67 (2018) 309–316.

[31] R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, R.J. Tibshirani, Strong rules for discarding predictors in lasso-type problems., J. R. Stat. Soc. 74 (2) (2012) 245. [32] J. Liu, Z. Zhao, Z. Wang, J. Wang, J. Ye, Safe screening with variational inequalities and its application to lasso, in: Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 2014, pp. 289–297. [33] J. Wang, P. Wonka, J. Ye, Lasso screening rules via dual polytope projection, J. Mach. Learn. Res. 16 (2015) 1063–1101. [34] T. Yang, J. Wang, Q. Sun, D. Hibar, N. Jahanshad, L. Liu, Y. Wang, L. Zhan, P. Thompson, J. Ye, Detecting genetic risk factors for alzheimers disease in whole genome sequence data via lasso screening, in: Proceedings of 12th IEEE International Symposium on Biomedical Imaging, New York City, NY, United States, 2015, pp. 985–989. [35] X. Pan, Z. Yang, Y. Xu, L. Wang, Safe screening rules for accelerating twin support vector machine classification, IEEE Trans. Neural Netw. Learn. Syst. (2017), doi:10.1109/TNNLS.2017.2688182. [36] J. Wang, J. Zhou, J. Liu, P. Wonka, J. Ye, A safe screening rule for sparse logistic regression, in: Advances in Neural Information Processing Systems 27, Curran Associates, Inc., 2014, pp. 1053–1061. [37] W. Zhang, B. Hong, W. Liu, J. Ye, D. Cai, X. He, J. Wang, Scaling up sparse support vector machines by simultaneous feature and sample reduction, in: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017, 2017, pp. 4016–4025. [38] J. Wang, P. Wonka, J. Ye, Scaling svm and least absolute deviations via exact data reduction, J. Mach. Learn. Res. 32 (2014) 523–531. [39] O. Güler, Foundations of Optimization, Springer, New York, 2010. [40] C. Hsieh, K. Chang, C. Lin, S. Keerthi, S. Sundararajan, A dual coordinate descent method for large-scale linear svm, in: Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 2008, pp. 408–415. [41] J. Zhao, Y. Xu, A safe sample screening rule for universum support vector machines, Knowl. Based Syst. 138 (2017) 46–57. [42] M. Lichman, UCI machine learning repository, 2013, http://archive.ics.uci.edu/ ml. [43] J.A.K. Suykens, T.V. Gestel, J.D. Brabanter, B.D. Moor, J. Vandewalle, Least Squares Support Vector Machines, World Scientific, Singapore, 2002. [44] L. Bai, Z. Wang, Y. Shao, N. Deng, A novel feature selection method for twin support vector machine, Knowl. Based Syst. 59 (2014) 1–8. [45] Y. Shao, W. Chen, Z. Wang, C. Li, N. Deng, Weighted linear loss twin support vector machine for large-scale classification, Knowl. Based Syst. 73 (2015) 276–288. [46] Y. Zhao, Z. Su, W. Yang, H. Lin, W. Chen, H. Tang, Ionchanpred 2.0: a tool to predict ion channels and their types, Int. J. Mol. Sci. 18 (9) (2017). [47] H. Lin, H. Ding, Predicting ion channels and their types by the dipeptide mode of pseudo amino acid composition, J. Theor. Biol. 269 (1) (2011) 64–69. [48] C. Higuera, K.J. Gardiner, K.J. Cios, Self-organizing feature maps identify proteins critical to learning in a mouse model of down syndrome, PLOS ONE 10 (6) (2015) 1–28.

Please cite this article as: Z. Yang, Y. Xu, A safe accelerative approach for pinball support vector machine classifier, Knowledge-Based Systems (2018), https://doi.org/10.1016/j.knosys.2018.02.010

JID: KNOSYS

ARTICLE IN PRESS Z. Yang, Y. Xu / Knowledge-Based Systems 000 (2018) 1–13

[m5G;February 9, 2018;13:23] 13

Zhiji Yang was born in China in 1989. She received the B.S. degree from the University of Electronic Science and Technology of China, Chengdu, China, in 2012, and the M.S. degree from the College of Science, China Agricultural University, Beijing, China, in 2014, where she is currently pursuing the Ph.D. degree. Her current research interests include support vector machine, operation research, and data mining. Ms. Yang’s research has appeared in the IEEE Transactions on Neural Networks and Learning Systems, Knowledge-Based Systems, Neurocomputing, and Engineering Applications of Artificial Intelligence.

Yitian Xu received the Ph.D. degree from the College of Science, China Agricultural University, Beijing, China, in 2007. He was a Visiting Scholar with the Department of Computer Science and Engineering, Arizona State University, Tempe, AZ, USA, from 2013 to 2014. He is currently a Professor at the College of Science, China Agricultural University. He has authored about 50 papers. His current research interests include machine learning and data mining. Prof. Xu’s research has appeared in the IEEE Transactions on Neural Networks and Learning Systems, the IEEE Transactions on Cybernetics, Knowledge-Based Systems, Neurocomputing, Neural Computing with Applications, Cognitive Computation, and Engineering Applications of Artificial Intelligence.

Please cite this article as: Z. Yang, Y. Xu, A safe accelerative approach for pinball support vector machine classifier, Knowledge-Based Systems (2018), https://doi.org/10.1016/j.knosys.2018.02.010