A clipping dual coordinate descent algorithm for solving support vector machines

A clipping dual coordinate descent algorithm for solving support vector machines

KNOSYS 2922 No. of Pages 13, Model 5G 19 August 2014 Knowledge-Based Systems xxx (2014) xxx–xxx 1 Contents lists available at ScienceDirect Knowle...

1MB Sizes 1 Downloads 57 Views

KNOSYS 2922

No. of Pages 13, Model 5G

19 August 2014 Knowledge-Based Systems xxx (2014) xxx–xxx 1

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys 5 6

A clipping dual coordinate descent algorithm for solving support vector machines

3 4 7

Q1

8 9

Q2

Xinjun Peng a,b,⇑, Dongjing Chen a, Lingyan Kong a a b

10

Department of Mathematics, Shanghai Normal University, Shanghai 200234, PR China Scientific Computing Key Laboratory of Shanghai Universities, Shanghai 200234, PR China

a r t i c l e

1 5 2 2 13 14 15 16 17

i n f o

Article history: Received 19 February 2014 Received in revised form 5 August 2014 Accepted 5 August 2014 Available online xxxx

18 19 20 21 22 23 24

Keywords: Support vector machine Dual coordinate descent algorithm Single variable problem Maximal possibility-decrease strategy Learning speed

a b s t r a c t The dual coordinate descent (DCD) algorithm solves the dual problem of support vector machine (SVM) by minimizing a series of single-variable sub-problems with a random order at inner iterations. Apparently, this DCD algorithm gives a sightless update for all variables at each iteration, which leads to a slow speed. In this paper, we present a clipping dual coordinate descent (clipDCD) algorithm for solving the dual problem of SVM. In each iteration, this clipDCD algorithm only solves one single-variable sub-problem according to the maximal possibility-decrease strategy on objective value. We can easily implement this clipDCD algorithm since it has a much simpler formulation compared with the DCD algorithm. Our experiment results indicate that, if the clipDCD algorithm is employed, SVM, twin SVM (TWSVM) and its extensions not only obtain the same classification accuracies, but also take much faster learning speeds than those classifiers employing the DCD algorithm. Ó 2014 Published by Elsevier B.V.

26 27 28 29 30 31 32 33 34 35 36 37 38

39

40

1. Introduction

41

In the past decade, support vector machines (SVMs) [1–3] have become the useful tools for data classification, and have been successfully applied to a variety of real-world problems like particle identification, text categorization, bioinformatics and financial applications [4–7]. Based on the structural risk minimization principle [2,3], classical SVM finds the maximal margin between two classes of samples by solving a quadratic programming problem (QPP) in the dual space. However, one of the main challenges in classical SVM is the large computational cost of QPP. The long training time of QPP not only causes classical SVM to take a long time to train on a large database, but also prevents it from locating the optimal parameter set from a very fine grid of parameters over a large span. So far, many fast SVM training algorithms have been presented to reduce the difficulties associated to training. One of them is to obtain low-rank approximations on the kernel matrix, by using the greedy approximation [8], sampling [9] or matrix decompositions [10]. The second kind of approaches to improve the speed are decomposition methods according to the Karush– Kuhn–Tucker (KKT) conditions, including the Chunking algorithm

42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59

Q3 Q1

⇑ Corresponding author at: Department of Mathematics, Shanghai Normal University, Shanghai 200234, PR China. Tel.: +86 21 64324866. E-mail address: [email protected] (X. Peng).

[11], decomposition method [12], sequential minimal optimization (SMO) [13], and SVMlight [14]. Another kind of approaches are to directly solve the dual QPP of SVM, including the geometric algorithms [15–18], successive overrelaxation (SOR) algorithm [19], dual coordinate descent (DCD) algorithm [20]. In this paper, we only fucus on the DCD algorithm. The DCD algorithm directly solves the dual QPP of SVM by orderly optimizing a series of single-variable sub-problems. Specifically, it consists of the outer iteration and the inner iteration. Each outer iteration has n inner iterations, where n is the number of points. In these inner iterations, the DCD algorithm orderly update all variables. This dual coordinate descent algorithm can effectively deal with large-scale linear SVM with a low cache. To improve the learning speed, it solves the sub-problems in a random order. Experimental results in [20] have shown this DCD algorithm with random permutation can improve the learning speed. However, we should point out that, in the learning process of the DCD algorithm, it only simply updates the variables one by one. The strategy is also used in its random permutation, although the order of variables is randomly adjusted. In fact, this simple update strategy is often sightless since it cannot obtain an effective decrease on the objective value, Extremely, it possibly obtains a zero-decrease. Then, this often makes the DCD algorithm have a slow learning speed. To overcome this deficiency in the DCD algorithm, in this paper, we present an improved DCD algorithm, called the clipping DCD

http://dx.doi.org/10.1016/j.knosys.2014.08.005 0950-7051/Ó 2014 Published by Elsevier B.V.

Q1 Please cite this article in press as: X. Peng et al., A clipping dual coordinate descent algorithm for solving support vector machines, Knowl. Based Syst. (2014), http://dx.doi.org/10.1016/j.knosys.2014.08.005

60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84

KNOSYS 2922

No. of Pages 13, Model 5G

19 August 2014 Q1

2

X. Peng et al. / Knowledge-Based Systems xxx (2014) xxx–xxx

Table 1 Computational cost of the DCD and clipDCD algorithms with one (outer) iteration. DCD

clipDCD

Step

Operation

Cost

Step

Operation

Cost

1 2.1



–  i Þ for each i Oðn

1 2.1

Compute all Q ii Choose L

OðnÞ OðjAjÞ

2.2

Update ai



2.2

Update e  Q a

OðnÞ

Compute all rpi f ðaðkÞ;i Þ ðkÞ;iþ1

Table 2 Number of dimensions, training and test samples, together with the accuracies (in %) reported in [29] for SVM with Gaussian kernel (except for the Checkerboard dataset).

85 86 87 88 89 90 91

Set

Dim

ntr

nte

SVM acc.

B BC D F G H I R S Th T Tw W C

2 9 8 9 20 13 18 20 60 5 3 20 21 2

400 200 468 666 700 170 1300 400 1000 140 150 400 400 800

4900 77 300 400 300 100 1010 7000 2175 75 2051 7000 4600 8000

88.47  0.66 73.96  4.74 76.47  1.73 67.57  1.72 76.39  2.07 84.05  3.26 97.04  0.60 98.34  1.12 89.12  0.66 95.20  2.19 77.58  1.02 97.04  0.23 90.12  0.43 –

(clipDCD) algorithm, for solving the dual QPP of SVM. In the learning process of the clipDCD algorithm, it orderly updates one variable by a single-variable sub-problem. Whereas, this variable is selected according to the maximal possibility-decrease strategy. That is, we select the variable which possibly derives the largest decrease on the objective value if it is updated. This strategy not only effectively overcomes the possibly sightless update in the

iterations of the DCD algorithm, but also leads to a much simpler formulation compared with the DCD algorithm. Further, we discuss some implementation issues in this clipDCD algorithm. To validate the performance of this algorithm, we train some SVMs classifiers, including the classical SVM, twin support vector machine (TWSVM) [21], projection twin support vector machine (PTSVM) [22], and twin parametric-margin support vector machine (TPMSVM) [23], by using the clipDCD algorithm. Experimental results on benchmark datasets show that this proposed clipDCD algorithm obtains the same classification accuracy with a much faster numerical convergence than the DCD algorithm. The rest of this paper is organized as follows: Section 2 briefly introduces classical SVM, the DCD algorithm, and some other classifiers which can be solved by DCD algorithm, including TWSVM, PTSVM, and TPMSVM classifiers. Section 3 presents the proposed clipDCD algorithm and discusses some possible implementation issues in the clipDCD algorithm. Experimental results on benchmark datasets are given in Section 4. Some conclusions and remarks are drawn in Section 5.

92

2. Background

111

In this section, we briefly introduce classical SVM [1–3], the DCD algorithm [20], and some other classifiers could be solved

112

Table 3 Averages, standard deviations, and p-values of the test accuracies (in %) derived by the DCD and clipDCD algorithms on benchmark datasets. Set

B BC D F G H I R S Th T Tw W C

TWSVM

PTSVM

TPMSVM

SVM

DCD p-value

clipDCD

DCD p-value

clipDCD

DCD p-value

clipDCD

DCD p-value

clipDCD

88.20  1.04 0.9737 72.83  4.25 0.9602 75.39  2.38 0.9021 65.33  1.67 0.9928 76.04  2.07 0.9563 83.57  3.30 0.9432 95.32  0.86 0.9984 98.44  0.10 0.9990 87.88  0.67 0.9016 95.48  2.15 0.9610 77.18  0.54 0.9978 97.46  0.15 0.9604 89.87  0.55 0.8116 96.88  1.91 1.0000

88.52  0.92

87.60  1.00 1.0000 71.47  5.26 1.0000 74.65  2.33 1.0000 62.20  0.55 0.8245 75.00  2.27 1.0000 80.77  3.53 1.0000 95.76  0.60 1.0000 98.45  0.01 1.0000 88.60  0.73 1.0000 94.74  2.61 1.0000 75.57  2.84 1.0000 97.01  0.22 1.0000 89.67  0.54 1.0000 98.53  0.44 1.0000

87.60  1.01

88.89  0.73 1.0000 74.45  4.98 0.9987 76.81  1.63 0.9999 67.59  2.63 0.9024 76.76  2.64 0.9989 84.58  3.25 0.9810 97.53  0.63 0.8308 98.44  0.14 1.0000 89.74  0.80 0.8915 95.67  1.86 0.9436 77.18  0.54 1.0000 97.58  0.10 1.0000 89.86  0.30 0.9970 97.50  0.41 1.0000

88.87  0.71

88.41  0.72 0.9999 72.27  4.50 0.9998 75.49  1.92 0.9800 67.48  1.87 1.0000 76.39  2.04 1.0000 83.78  3.57 1.0000 97.03  0.67 0.9985 98.26  1.04 1.0000 89.15  0.79 1.0000 95.29  2.40 1.0000 77.68  0.51 1.0000 97.10  0.22 1.0000 88.94  0.61 1.0000 95.13  1.21 1.0000

88.45  0.73

72.71  4.26 75.78  2.60 65.28  1.96 75.98  1.93 83.40  3.31 95.29  0.77 98.42  0.16 87.80  0.66 95.52  2.10 77.16  0.45 97.42  0.18 89.42  0.97 96.88  1.91

71.49  5.25 74.65  2.33 62.88  5.27 75.00  2.27 80.77  3.53 95.76  0.60 98.45  0.01 88.60  0.72 94.74  2.61 75.57  2.84 97.01  0.22 89.67  0.54 98.53  0.44

74.51  4.85 76.79  1.64 67.78  2.30 76.78  2.09 84.62  3.29 97.69  0.68 98.44  0.14 89.85  0.82 95.62  1.88 77.18  0.54 97.58  0.10 89.88  0.34 97.50  0.41

72.30  4.50 75.42  1.80 67.48  1.87 76.38  2.05 83.78  3.57 97.06  0.64 98.26  1.04 89.15  0.79 95.29  2.40 77.68  0.47 97.10  0.21 88.94  0.61 95.13  1.21

Q1 Please cite this article in press as: X. Peng et al., A clipping dual coordinate descent algorithm for solving support vector machines, Knowl. Based Syst. (2014), http://dx.doi.org/10.1016/j.knosys.2014.08.005

93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110

113

KNOSYS 2922

No. of Pages 13, Model 5G

19 August 2014 Q1

3

X. Peng et al. / Knowledge-Based Systems xxx (2014) xxx–xxx Table 4 Averages and standard deviations of the number of iterations (in thousands) derived by the DCD and clipDCD algorithms on benchmark datasets. Set

B BC D F G H I R S Th T Tw W C

TWSVMa

PTSVMa

TPMSVMa

SVM

DCD

clipDCD

DCD

clipDCD

DCD

clipDCD

DCD

clipDCD

33.53 ± 43.57 29.56 ± 20.81 46.79 ± 13.46 1.82 ± 0.86 10.57 ± 4.70 0.47 ± 0.23 25.22 ± 12.04 1.40 ± 0.22 10.74 ± 11.49 1.07 ± 0.54 0.19 ± 0.05 1.23 ± 0.20 0.66 ± 0.15 73.06 ± 25.26

1.42 ± 0.80 1.38 ± 0.27 2.98 ± 0.35 0.61 ± 0.02 0.91 ± 0.07 0.14 ± 0.02 0.99 ± 0.13 0.23 ± 0.02 0.95 ± 0.08 0.12 ± 0.04 0.15 ± 0.02 0.20 ± 0.02 0.20 ± 0.02 2.34 ± 0.21

1.62 ± 1.28 21.60 ± 23.91 45.97 ± 15.51 16.05 ± 3.31 85.38 ± 66.84 3.78 ± 1.98 135.74 ± 54.96 1.59 ± 0.15 57.33 ± 79.54 24.96 ± 47.56 34.23 ± 100.08 1.58 ± 0.40 4.24 ± 1.11 71.68 ± 24.36

0.94 ± 0.44 0.83 ± 0.10 2.29 ± 0.28 0.76 ± 0.06 2.85 ± 0.20 0.56 ± 0.17 2.20 ± 0.20 0.28 ± 0.02 1.93 ± 0.18 0.22 ± 0.05 0.47 ± 1.64 0.28 ± 0.06 0.67 ± 0.13 2.40 ± 0.29

22.41 ± 40.80 25.94 ± 8.10 33.42 ± 7.98 18.10 ± 5.11 115.21 ± 79.35 3.42 ± 1.46 135.18 ± 52.05 4.61 ± 0.32 98.38 ± 52.53 9.38 ± 3.81 31.48 ± 33.18 3.47 ± 0.44 8.50 ± 1.36 73.75 ± 20.02

0.57 ± 0.16 0.88 ± 0.07 1.31 ± 0.06 0.69 ± 0.06 1.54 ± 0.10 0.29 ± 0.06 1.49 ± 0.05 0.48 ± 0.03 1.25 ± 0.04 0.32 ± 0.03 0.31 ± 0.09 0.42 ± 0.02 0.31 ± 0.03 0.76 ± 0.14

153.64 ± 36.34 57.03 ± 17.72 51.53 ± 16.31 228.73 ± 83.42 263.38 ± 96.68 64.45 ± 74.64 431.53 ± 171.32 21.45 ± 5.44 374.43 ± 65.38 19.93 ± 7.62 525.21 ± 146.42 13.46 ± 2.09 36.21 ± 7.26 933.42 ± 148.33

4.25 ± 1.12 2.75 ± 0.26 17.13 ± 2.71 13.39 ± 1.71 7.53 ± 0.46 3.14 ± 0.53 9.58 ± 0.60 1.16 ± 0.04 6.58 ± 0.22 0.80 ± 0.10 10.43 ± 7.57 1.48 ± 0.05 2.78 ± 0.33 38.28 ± 5.21

a For the TWSVM, PTSVM, and TPMSVM classifiers, the iteration numbers in this table are the sum of the iteration numbers for optimizing their two dual problems through the DCD and clipDCD algorithms.

135 115

by DCD algorithm, including TWSVM [21], PTSVM [22] and TPMSVM [23] classifiers.

116

2.1. Support vector machine

114

117 118 119 120 121 122 123

124

As a state-of-the-art of machine learning algorithm, classical SVM [1–3] is based on guaranteed risk bounds of statistical learning theory which is known as the structural risk minimization principle. Given a set of instance-label pairs ðxi ; yi Þ; i ¼ 1; . . . ; n; xi 2 Rm , yi 2 f1; 1g, linear SVM finds the best separating (maximum-margin) hyperplane Hðw; bÞ : wT x þ b ¼ 0, w 2 Rm between two classes of samples by solving the following QPP:

min 126 127 128 129 130

131 133

134

w;b

n X 1 T w w þ C lðw; b; xi ; yi Þ; 2 i¼1

ð1Þ

where C > 0 is a penalty factor and lðw; b; xi ; yi Þ is a loss function. Generally, we employ the hinge loss lðw; b; x; yÞ ¼ max   1  yðwT x þ bÞ; 0 . Here, we deal with this SVM by appending each point with an additional dimension:

xi

½xi ; 1 w

½w; b:

ð2Þ

Then we obtain the dual QPP of the primal problem (1), which is:

1 min f ðaÞ ¼ aT Q a  eT a; a 2 s:t: 0 6 ai 6 C; i ¼ 1; . . . ; n;

ð3Þ 137

where Q is a matrix with Q ij ¼ yi yj xTi xj , and e is a vector of n-dimension. In this problem, we have e ¼ ½1; . . . ; 1. Usually, many real-world problems are linearly inseparable. For this case, SVM maps training vectors into a high-dimensional space, i.e., the reproducing kernel Hilbert space, via a nonlinear (implicit) function /ðxÞ. Due to the high dimensionality of the vector variable w, one solves the dual problem (3) by the kernel trick [24], i.e., using a closed form of kðxi ; xj Þ ¼ /ðxi ÞT /ðxj Þ.

138

2.2. Dual coordinate descent algorithm

146

The DCD algorithm directly optimizes the problem (3). It starts from an initial point að0Þ 2 Rn and generates a sequence faðkÞ ; k P 0g. The DCD algorithm consists of the outer iteration and inner iteration. In the outer iteration, it updates aðkÞ to aðkþ1Þ . In each outer iteration we have n inner iterations, so that a1 ; . . . ; an are updated sequentially. Thus, each outer iteration generates vectors aðkÞ;i 2 Rn ; i ¼ 1; . . . ; n þ 1, such that aðkÞ;1 ¼ aðkÞ ,

147

Table 5 Averages and standard deviations of the number of kernel evaluations (in millions) derived by the DCD and clipDCD algorithms on benchmark datasets. Set

B BC D F G H I R S Th T Tw W C

TWSVMb

PTSVMb

TPMSVMb

SVM

DCD

clipDCD

DCD

clipDCD

DCD

clipDCD

DCD

clipDCD

7.15 ± 9.65 3.87 ± 2.97 12.40 ± 3.95 0.65 ± 0.31 4.99 ± 2.33 0.04 ± 0.02 17.69 ± 8.21 0.28 ± 0.04 5.27 ± 5.63 0.07 ± 0.05 0.02 ± 0.00 0.25 ± 0.04 0.13 ± 0.03 29.22 ± 9.53

0.30 ± 0.17 0.17 ± 0.04 0.76 ± 0.10 0.21 ± 0.01 0.38 ± 0.03 0.01 ± 0.00 0.66 ± 0.09 0.05 ± 0.00 0.48 ± 0.04 0.01 ± 0.00 0.01 ± 0.00 0.04 ± 0.00 0.04 ± 0.01 0.94 ± 0.09

3.35 ± 2.73 2.93 ± 3.40 12.40 ± 4.60 5.38 ± 1.16 36.23 ± 31.32 0.33 ± 0.18 89.30 ± 33.16 0.32 ± 0.03 28.02 ± 38.94 1.11 ± 2.05 3.39 ± 10.23 0.32 ± 0.08 0.84 ± 0.23 28.67 ± 9.11

0.19 ± 0.09 0.10 ± 0.01 0.59 ± 0.07 0.26 ± 0.02 1.12 ± 0.10 0.05 ± 0.02 1.45 ± 1.27 0.06 ± 0.00 0.96 ± 0.09 0.01 ± 0.00 0.05 ± 0.17 0.06 ± 0.01 0.13 ± 0.03 0.96 ± 0.11

4.73 ± 9.19 3.51 ± 1.17 8.97 ± 2.16 6.95 ± 1.95 55.48 ± 39.22 0.31 ± 0.14 89.67 ± 34.38 0.92 ± 0.06 47.93 ± 25.30 0.06 ± 0.02 3.15 ± 3.28 0.70 ± 0.09 1.92 ± 0.33 29.50 ± 7.82

0.12 ± 0.03 0.11 ± 0.00c 0.33 ± 0.02 0.24 ± 0.02 0.68 ± 0.05 0.02 ± 0.00 0.97 ± 0.03 0.10 ± 0.01 0.62 ± 0.02 0.01 ± 0.00 0.03 ± 0.01 0.08 ± 0.00 0.07 ± 0.01 0.30 ± 0.00

61.44 ± 14.53 11.41 ± 3.54 24.11 ± 7.63 152.35 ± 55.56 184.39 ± 67.69 10.96 ± 12.69 560.81 ± 222.64 9.01 ± 2.43 374.43 ± 65.38 2.79 ± 1.07 79.06 ± 22.04 5.39 ± 0.84 14.49 ± 2.91 746.73 ± 112.24

1.70 ± 0.43 0.55 ± 0.05 8.02 ± 1.27 8.92 ± 1.14 5.27 ± 0.32 0.53 ± 0.09 12.45 ± 0.78 0.46 ± 0.02 6.58 ± 0.22 0.11 ± 0.01 1.57 ± 1.14 0.59 ± 0.02 1.11 ± 0.13 30.63 ± 7.49

b For the TWSVM, PTSVM, and TPMSVM classifiers, the kernel evaluations in this table are the sum of the kernel evaluations for optimizing their two dual problems through the DCD and clipDCD algorithms. c Here, the 0.00 value only indicates the value of standard deviation is small, but not zero.

Q1 Please cite this article in press as: X. Peng et al., A clipping dual coordinate descent algorithm for solving support vector machines, Knowl. Based Syst. (2014), http://dx.doi.org/10.1016/j.knosys.2014.08.005

139 140 141 142 143 144 145

148 149 150 151 152 153

KNOSYS 2922

No. of Pages 13, Model 5G

19 August 2014 Q1

4

X. Peng et al. / Knowledge-Based Systems xxx (2014) xxx–xxx 0

0 TWSVM_1+DCD TWSVM_1+clipDCD TWSVM_2+DCD TWSVM_2+clipDCD

−0.5

PTSVM_1+DCD PTSVM_1+clipDCD PTSVM_2+DCD PTSVM_2+clipDCD

−1

−2

Objective

Objective

−1

−1.5

−3

−4

−2

−5

−2.5

−6

−3 0 10

1

−7 0 10

2

10

10

1

2

10

3

10

10

Iterations

Iterations

(a)

(b) 0

0

SVM+DCD SVM+clipDCD

TPMSVM_1+DCD TPMSVM_1+clipDCD TPMSVM_2+DCD TPMSVM_2+clipDCD

−0.05

−5

−0.1

Objective

Objective

−10 −0.15

−0.2

−15

−20 −0.25

−25

−0.3

−0.35 0 10

1

2

10

10

−30 0 10

3

10

1

2

10

10

3

10

4

10

Iterations

Iterations

(c)

(d)

Fig. 1. Relationship between the objective values and iteration numbers of the TWSVM (a), PTSVM (b), TPMSVM (c), and SVM (d) on Thyroid problem.

h

154

i

ðkÞ ðkÞ ; i ¼ 1; aðkÞ;nþ1 ¼ aðkþ1Þ , and aðkÞ;i ¼ a1ðkþ1Þ ; . . . ; aðkþ1Þ i1 ; ai ; . . . ; an

156

. . . ; n. For updating aðkÞ;i to aðkÞ;iþ1 , the DCD algorithm solves the following one-variable sub-problem:

159

min f ðaðkÞ;i þ kei Þ s:t: 0 6 ai þ k 6 C;

155

157

ðkÞ

160 161

k

ð4Þ

where ei ¼ ½0; . . . ; 1; . . . ; 0. The objective function of (4) is a simple quadratic function of k:

162 164 165 166 167 168 169

170

172

f ðaðkÞ;i þ kei Þ ¼

1 Q k2 þ ri f ðaðkÞ;i Þk þ constant; 2 ii

umn of Q . One can easily see that (4) has an optimum at k ¼ 0, i.e., no need to update ai , if and only if rpi f ðaðkÞ;i Þ ¼ 0, where rpi f ðaÞ is the projected gradient:

175

178

ðkÞ;i aðkÞ;iþ1 ¼ min max ai  i

174

176

ð6Þ

p ðkÞ;i Þ i fð ðkÞ;i

If r a updating a is:

173

¼ 0 holds, we move to the ði þ 1Þth index without . Otherwise, we must find the solution of (4), which





ri f ðaðkÞ;i Þ Q ii

  ;0 ;C :

179

Algorithm 1. Dual coordinate descent (DCD) algorithm

180

1. Set the initial vector að0Þ and k ¼ 0; 2. While maxj ðrpj f ðaÞÞ  minj ðrpj f ðaÞÞ P  2.1 Compute all ri f ðaðkÞ;i Þ ¼ aT Q ;i  ei and rpi f ðaðkÞ;i Þ by (6); ðkÞ;iþ1

2.2 Update ai 3. Set k k þ 1.

according to (7) if jrpi f ðaðkÞ;i Þj – 0;

ð7Þ

Remark 1. The kernel matrix Q may be too large to be stored, so  i is the one calculates its ith row when computing ri f ðaðkÞ;i Þ. If n  Þ is needed to calnonzero components of a per inner iteration, Oðn culate the ith row of Q .

189

Remark 2. For linear SVM, we have ri f ðaÞ ¼ yi wT xi  1 because of P w ¼ ni¼1 yi ai xi and Q ij ¼ yi yj xTi xj . The cost is much smaller than  Þ if w is updated throughout the coordinate descent procedure. Oðn Then, we can update w by w w þ ðanew  ai Þyi xi . The number of i operation is only Oð1Þ.

193

Remark 3. We can use the shrinking technique to reduce the size of the optimization problem without considering some bounded variables. Here, we omit the detail about this technique. In addi-

198

Q1 Please cite this article in press as: X. Peng et al., A clipping dual coordinate descent algorithm for solving support vector machines, Knowl. Based Syst. (2014), http://dx.doi.org/10.1016/j.knosys.2014.08.005

188

ð5Þ

where ri f is the ith component of the gradient rf , defined as P ri f ðaÞ ¼ ðQ aÞi  ei ¼ aT Q ;i  ei ¼ nj¼1 Q ij aj  ei ; Q ;i is the ith col-

8 minðri f ðaÞ; 0Þ; if ai ¼ 0; > > < p ri f ðaÞ ¼ maxðri f ðaÞ; 0Þ; if ai ¼ C; > > : ri f ðaÞ; otherwise:

Therefore, the DCD algorithm can be depicted as follows:

190 191 192

194 195 196 197

199 200

KNOSYS 2922

No. of Pages 13, Model 5G

19 August 2014 Q1

5

X. Peng et al. / Knowledge-Based Systems xxx (2014) xxx–xxx 0

0

TWSVM_1+DCD TWSVM_1+clipDCD TWSVM_2+DCD TWSVM_2+clipDCD

−1

PTSVM_1+DCD PTSVM_1+clipDCD PTSVM_2+DCD PTSVM_2+clipDCD −3

−3

Objective

Objective

−2

−4

−6

−9

−5 −12

−6

−7 0 10

1

−15 0 10

2

10

10

1

2

10

Iterations

10

Iterations

(a)

(b)

0

0 TPMSVM_1+DCD TPMSVM_1+clipDCD TPMSVM_2+DCD TPMSVM_2+clipDCD

−4

SVM+DCD SVM+clipDCD −10

−20 −8

Objective

Objective

−30

−12

−40

−50 −16 −60 −20 −70

−24 0 10

1

2

10

10

3

10

−80 0 10

1

2

10

Iterations

3

10

4

10

10

Iterations

(c)

(d)

Fig. 2. Relationship between the objective values and iteration numbers of the TWSVM (a), PTSVM (b), TPMSVM (c), and SVM (d) on Twonorm problem. 201 202

tion, the DCD algorithm obtains faster convergence if one randomly permutes the sub-problems in each iteration [20].

203

2.3. Other classifiers

204

In this section, we list briefly some other classifiers, whose dual QPPs (9), (12), and (14) can be solved by the DCD algorithm.

205 206 207 208 209 210 211

212

2.3.1. Twin support vector machine Linear TWSVM [21] performs binary classification using two nonparallel hyperplanes, i.e., H ðw ; b Þ: wT x þ b ¼ 0, instead of the single hyperplane used in classical SVM. The two hyperplanes of TWSVM are obtained by solving the following two smaller-sized QPPs:

min

214 215 216

217

w ;b

i

ð8Þ

j

where C  > 0 are penalty factors, and lðw; b; xi ; yi Þ is the hinge loss. The dual QPPs of (8) are:

min a

219

X 2 1 X  T w xi þ b þ C  lðw ; b ; xj ; yj Þ; 2 i:y ¼1  j:y ¼1

s:t:

1 1 T T a H H  H T H a  eT a 2 0 6 a 6 C  e ;

ð9Þ

221

where e are vectors of ones with appropriate dimensions, h i H  ¼ AT ; e , and A are the matrices consisting of two classes of

222

data. Then, we obtain the augmented vectors u ¼ ½w ; b :

225

 1 u ¼  H þ H Tþ H  a :

220

223

2.3.2. Projection twin support vector machine PTSVM [22] finds a projection axis for each class, such that the within-class variance of the projected data points of its own class is minimized meanwhile the projected data points of the other class scatter away to certain extent. To this end, PTSVM solves the following pair of QPPs:

X 2 1 X  T min w xi  wT l þ C  lðw ; xj ; yj Þ; w 2n i:y ¼1 j:y ¼1 i

228 229 230 231

232

234

where C  are trade-off constants, n are sizes of two classes, l are the means of two classes of points, and the loss   lðw; x; yÞ ¼ max 1  ðwT x  wT lk Þ; 0 . The dual QPPs for (11) are:

ð12Þ

235 236 237 238

239

241

where R are the covariance matrices of two classes.

242

2.3.3. Twin parametric-margin support vector machine TPMSVM [23] derives a pair of parametric-margin hyperplane through two smaller-sized SVM-type QPPs. Formally, it optimizes the following pair of constrained optimization problems1:

243

 X  X  1 T 2 min w w þ b m wT xj þ b þC  lðw ; b ; xi ; yi Þ; w ;b 2 j:y ¼1 i:y ¼1 j

244 245 246 247

i

ð13Þ ð10Þ 1

2

We add the terms bk ; k ¼ 1; 2 for TPMSVM to give the uniform dual problems.

Q1 Please cite this article in press as: X. Peng et al., A clipping dual coordinate descent algorithm for solving support vector machines, Knowl. Based Syst. (2014), http://dx.doi.org/10.1016/j.knosys.2014.08.005

227

ð11Þ

j

   1  T T min aT AT  e lT R1  A  l e a  e a a 2 s:t: 0 6 a 6 C  e ;

226

249

KNOSYS 2922

No. of Pages 13, Model 5G

19 August 2014 Q1

6

X. Peng et al. / Knowledge-Based Systems xxx (2014) xxx–xxx 0

0 TWSVM_1+DCD TWSVM_1+HDCD TWSVM_2+DCD TWSVM_2+HDCD

−2

PTSVM_1+DCD PTSVM_1+clipDCD PTSVM_2+DCD PTSVM_2+clipDCD

−5

−10

Objective

Objective

−4

−6

−15

−20

−8 −25 −10 −30 −12 −35 0

1

10

2

10

0

10

1

10

2

10

Iterations

3

10

10

Iterations

(a)

(b) 0

0

SVM+DCD SVM+clipDCD

TPMSVM_1+DCD TPMSVM_1+clipDCD TPMSVM_2+DCD TPMSVM_2+clipDCD

−0.02

−20

−40

Objective

Objective

−0.04

−0.06

−60

−80

−0.08

−100

−0.1

−0.12 0 10

−120

1

2

10

10

−140 0 10

3

10

1

10

2

3

10

10

4

10

Iterations

Iterations

(c)

(d)

Fig. 3. Relationship between the objective values and iteration numbers of the TWSVM (a), PTSVM (b), TPMSVM (c), and SVM (d) on Waveform problem. 250 251 252

253

where m and C  are positive penalty factors, and the loss function  lðw; b; xi ; yi Þ is defined as lðw; b; xi ; yi Þ ¼ max yi ðwT xi þ bÞ; 0 . The dual problems of (13) are:

   1 T T a A A þ e eT a  m eT AT A þ e eT a  a 2 s:t: 0 6 a 6 C  e :

min 255 256 257

ð14Þ

After optimizing the dual problems (14), we obtain the normal vectors w ¼ A a  m A e for classification.

258

3. Clipping dual coordinate descent algorithm

259

263

In this section, we propose the clipping dual coordinate descent (clipDCD) algorithm for solving the problem (3) (or (9), (12), and (14)). In the first part, we depict the detail of our algorithm. In the second part, we list some implementation issues of clipDCD algorithm.

264

3.1. Framework of clipDCD algorithm

265

269

This clipDCD algorithm is based on the gradient descent method. Without loss of generality, we denote f ð0Þ ¼ 12 aT Q a  eT a. In this clipDCD algorithm, we do not consider any outer iteration and inner iteration, and assume only one component of a is updated at each iteration, denoted aL ! aL þ k; L 2 f1; . . . ; ng is the index. Then

272

1 1 f ðkÞ ¼ ðaL þ kÞ2 Q LL þ aTN Q NN aN þ ðaL þ kÞaTN Q N L  eL ðaL þ kÞ  eTN aN 2 2     1 1 ¼ f ð0Þ þ k2 Q LL  k eL  aTN Q N L  Q LL aL ¼ f ð0Þ þ k2 Q LL  k eL  aT Q :;L ; 2 2

260 261 262

266 267 268

270

ð15Þ

where N is the index set f1; . . . ; ng n fLg and Q :;L is the Lth column of Q . Setting the derivation of k:

eL  aT Q :;L df ðkÞ ¼0 ) k¼ ; dk Q LL

ð16Þ

we have

274 275

277 278 279

2

ðeL  aT Q :;L Þ f ðkÞ ¼ f ð0Þ  : 2Q LL

ð17Þ

The objective decrease will now be approximately largest if we 2 maximize ðeL  aT Q :;L Þ =Q LL , which causes we to achieve in principle by choosing the L index as

( ) 2 ðei  aT Q :;i Þ L ¼ arg max : 16i6n Q ii

283 284 285

288 289 290 291 292 293 294

ð19Þ 296

where the index set A is

 ei aT Q :;i ei  aT Q :;i A¼ i : ai > 0 if < 0 or ai < C if >0 : Q ii Q ii

282

287

Here, we call this strategy as the maximal possibility-decrease strategy. Therefore, we have a simple update anew ¼ aL þ k. Moreover, L the k value in (16) must be adequately clipped so that 0 6 anew 6 C, which implies that we must take a k such that L 0 6 anew 6 C. To this end, we adjust the maximal possibility-decrease L strategy for choosing the L index as

( ) 2 ðei  aT Q :;i Þ ; L ¼ arg max i2A Q ii

281

ð18Þ

297 298

ð20Þ

Q1 Please cite this article in press as: X. Peng et al., A clipping dual coordinate descent algorithm for solving support vector machines, Knowl. Based Syst. (2014), http://dx.doi.org/10.1016/j.knosys.2014.08.005

273

300

KNOSYS 2922

No. of Pages 13, Model 5G

19 August 2014 Q1

7

X. Peng et al. / Knowledge-Based Systems xxx (2014) xxx–xxx 0

0 TWSVM_1+DCD TWSVM_1+clipDCD TWSVM_2+DCD TWSVM_2+clipDCD

−10

PTSVM_1+DCD PTSVM_1+clipDCD PTSVM_2+DCD PTSVM_2+clipDCD

−1

−2

−3

Objective

Objective

−20

−30

−4

−5 −40

−6 −50

−7

−60

0

1

2

10

3

10

−8 0 10

4

10

10

1

2

10

3

10

4

10

10

Iterations

Iterations

(a)

(b)

0

0

TPMSVM_1+DCD TPMSVM_1+clipDCD TPMSVM_2+DCD TPMSVM_2+clipDCD

−0.2

SVM+DCD SVM+clipDCD

−500

−0.4

Objective

Objective

−1000

−0.6

−1500

−0.8

−1

−2000

−1.2 −2500

−1.4 0 10

1

10

2

3

10

10

4

10

Iterations

0

10

1

10

2

10

3

10

4

10

5

10

6

10

Iterations

(c)

(d)

Fig. 4. Relationship between the objective values and iteration numbers of the TWSVM (a), PTSVM (b), TPMSVM (c), and SVM (d) on Checkerboard problem. 301 302 303 304

In summary, the framework of clipDCD algorithm for solving (3) can be listed as follows: Algorithm algorithm

2. Clipping

dual

coordinate

descent

(clipDCD)

1. Set the initial vector a, such as a ¼ 0; 2. While a is not optimal; 2.1 Choose the L index by (19) and (20), and compute k by (16); 2.2 Update aL as anew ¼ ½aL þ k# , where L ½u# ¼ maxð0; minðu; CÞÞ.

is satisfied, where the tolerance parameter  is given by users. In the experiments, we set  ¼ 105 for all datasets. Obviously, this stopping condition is a much strict condition since the real decrease value of the objective function at each iter2 ation is possibly less than the value ðeL  aT Q :;L Þ =Q LL . Also, this stopping condition is much simpler than that in the DCD algorithm since the latter need to cache all rpj f ðaÞ in each outer iteration in order to detect its stopping condition. In addition, the stopping condition in the DCD algorithm also causes to much extra computation, especially at the end stage of learning process. The experiment results in Section 4 will confirm this conclusion.

325

3.2.2. Convergence In this part, we will discuss the convergence of this clipDCD algorithm.

336

326 327 328 329 330 331 332 333 334 335

312 313 314

This proposed clipDCD algorithm is much simpler than the DCD algorithm, and each step in this algorithm is easily to implement.

315

3.2. Implementation issues

316

3.2.1. Stopping condition In this clipDCD algorithm, we should choose a suitable stopping condition. Unlike the stopping condition of the DCD algorithm, we terminate the clipDCD algorithm if the maximal possibility-decrease value

317 318 319 320

Proposition. The clipDCD Algorithm in Section 3.1 can find the optimal solution of the problem (3) (or (9), (12), and (14)). Proof. We first point out that the problem (3) (or (9), (12), and (14)) exists the optimal solution according to the extreme value theorem2 since it has the continuous objective function and the closed and bounded constraints. Further, the problem (3) has the global optimal solution since it is convex.

321

322

2

324

ðeL  aT Q :;L Þ < ; Q LL

>0

ð21Þ

2 The extreme value theorem: If a real-valued function is continuous in the closed and bounded region, then this function must attain a maximum and a minimum, each at least once.

Q1 Please cite this article in press as: X. Peng et al., A clipping dual coordinate descent algorithm for solving support vector machines, Knowl. Based Syst. (2014), http://dx.doi.org/10.1016/j.knosys.2014.08.005

337 338 341 339 340 342 343 344 345 346

KNOSYS 2922

No. of Pages 13, Model 5G

19 August 2014 Q1

8

X. Peng et al. / Knowledge-Based Systems xxx (2014) xxx–xxx

TWSVM 100

PTSVM TWSVM_1 TWSVM_2

80

Size of A

60

60

40

40

20 0

PTSVM_1 PTSVM_2

Size of A

80

100

20 0

20

40 Iterations

0

60

0

50 Iterations

TPMSVM

SVM

TPMSVM_1 TPMSVM_2

100

100

Size of A

Size of A

80

SVM

130

110

60

90 40 0

50

100 Iterations

150

70

0

200

400 600 Iterations

800

Fig. 5. Relationship between the size of A and iteration numbers of the TWSVM, PTSVM, TPMSVM, and SVM on Thyroid problem.

347 348 349 350 351 352 353 354 355 356 357

358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373

0

0

Second, (15) leads to f ðkÞ ¼ kQ LL  ðeL  aT Q :;L Þ and f ð0Þ ¼ 0 0 ðeL  aT Q :;L Þ. Then f ð0Þ < 0 if ðeL  aT Q :;L Þ > 0 holds, f ð0Þ > 0, T otherwise. We now consider these two cases: 1) ðe  a Q L :;L Þ > 0, e aT Q i.e., i Q ii :;i > 0. For this case, it will have k > 0 and k ¼ minðk; C  kÞ > 0 according to (16) and (20). Then, 0 f ðkÞ 6 f ðk Þ < f ð0Þ holds since f ð0Þ < 0, i.e., the objective will e aT Q descend via this iteration. 2) ðeL  aT Q :;L Þ < 0, i.e., i Q ii :;i < 0. Similarly, it will lead k < 0 and k ¼ maxðk; aL Þ < 0 to hold. Then, 0 f ðkÞ 6 f ðk Þ < f ð0Þ holds since f ð0Þ > 0. Hence, the objective value will descend no matter which case appears. That is, this clipDCD algorithm will find the solution. h

3.2.3. Computational cost We initially give the initial vector a ¼ 0 in this clipDCD algorithm. Then, we only need to compute each e2i =Q ii in order to start the initial iteration. That is, the computational cost is OðnÞ for computing all Q ii ’s at this initial iteration. During the learning process, we can cache all Q ii ’s to reduce the computational cost. Then, the space cost is OðnÞ. In general, we store another vector e  Q a during the learning process with the space cost OðnÞ. Then, we can easily determine the L index for the next update through compar2 ing with ðei  aT Q ;i Þ =Q ii . Further, if the L index is determined, we will update aL by the clipDCD algorithm and update the storage vector e  Q a, in which the ith component can be updated as  T Q ;i  kL Q Li , where a is the current vector and ei  aT Q ;i ¼ ei  a  a is the vector before updating. Hence, we only have OðnÞ kernel computations to update this vector. Compared with one outer iteration of the DCD algorithm, our method has the same cost. Table 1

gives the detail computational cost of each step in the two algorithms. For the linear SVM, we have the same interpretation as that in the DCD algorithm. It should point out that the above analysis is not suitable for the dual problems of TWSVM and PTSVM since we have to compute the inversions of two matrices in the dual problems.

374

3.2.4. Online setting In this clipDCD algorithm, another key work is to find the L index according to (19). Usually, we have to spend on some computational cost to find the maximum value in the right of (19) for large-scale problems. However, we can take a fast strategy to determine the index set A in (20): After updating ai and ei  aT Q ;i at the current iteration, we rapidly filter out the indices do not satisfying (20). Then, only a part of indices are needed to check. In fact, this is easy to implement according to the sign of kL Q Li and the ai value. In experiments, the results also show that only a small fraction of variables are possibly needed to update in each iteration. However, when the number of samples is huge in some applications, so going over all a1 ; . . . ; an causes an expensive cost for finding the index L, one can randomly choose an index set I ðkÞ at the kth iteration, then select the optimal index L in this set I ðkÞ and update aL .

380

4. Experiments

396

To validate the performance of the proposed algorithm, in this part we give some simulation results of the SVM, TWSVM, PTSVM, and TPMSVM classifiers respectively learning by the DCD and clip-

397

Q1 Please cite this article in press as: X. Peng et al., A clipping dual coordinate descent algorithm for solving support vector machines, Knowl. Based Syst. (2014), http://dx.doi.org/10.1016/j.knosys.2014.08.005

375 376 377 378 379

381 382 383 384 385 386 387 388 389 390 391 392 393 394 395

398 399

KNOSYS 2922

No. of Pages 13, Model 5G

19 August 2014 Q1

9

X. Peng et al. / Knowledge-Based Systems xxx (2014) xxx–xxx

TWSVM

PTSVM

240

240 TWSVM_1 TWSVM_2

180

Size of A

Size of A

180 120

120

60 0

PTSVM_1 PTSVM_2

60

0

50 Iterations

0

100

0

TPMSVM

100 Iterations

150

SVM

240

400 TPMSVM_1 TPMSVM_2

SVM 370

Size of A

Size of A

180 120 60 0

50

340 310

0

50

100 150 Iterations

200

280

0

500 1000 Iterations

1500

Fig. 6. Relationship between the size of A and iteration numbers of the TWSVM, PTSVM, TPMSVM, and SVM on Twonorm problem.

403

DCD algorithms on some datasets. Note that the DCD algorithm obtains faster convergence if one randomly permutes the subproblems in each iteration. Then we employ this technique to realize the DCD algorithm in the experiments.

404

4.1. Experiment setting

405

We simulate the SVM, TWSVM, PTSVM, and TPMSVM classifiers training by the DCD and clipDCD algorithms on the 13 benchmark datasets [29] in that order: Banana (B), Breast Cancer (BC), Diabetes (D), Flare (F), German (G), Heart (H), Image (I), Ringnorm (R), Splice (S), Thyroid (Th), Titanic (T), Twonorm (Tw), and Waveform (W). In particular, we use in each problem the train-test splits given in that reference (100 for each dataset except for Image and Splice, where only 20 splits are given). Table 2 contains for each dataset the data dimensions, the training and test data sizes and the test accuracies reported in [29]. In addition, we test these classifiers with two algorithms on the artificial Checkerboard (C) dataset. The checkerboard dataset consists of a series of uniform points in R2 with red and blue points taken from the 16 red and blue squares of a checkerboard. The last row in Table 2 gives the description of this dataset. This is a tricky test case in the data mining field for testing the performance of nonlinear classifiers. In the simulations we only consider the Gaussian kernel for all classifiers. For these classifiers, one of the most important problems is to choose the parameter values. However, there is no explicit way to solve the problem of choosing multiple parameters for SVMs. Although there exits many parameter-selection methods for SVMs, the most popular method to determine the parameters

400 401 402

406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426

of SVMs is still the exhaustive search [30]. For brevity’s sake, we set C 1 ¼ C 2 ¼ C for the TWSVM, PTSVM, and TPMSVM classifiers and m1 ¼ m2 ¼ m for the TPMSVM classifier. For the values of penalty parameters and kernel parameters in these algorithms, we select them from the set of values f2i ji ¼ 9; 8; . . . ; 10g by cross-validation. In addition, we set the tolerance value  ¼ 105 in the DCD and clipDCD algorithms for all datasets. It should be pointed out that we add a pair of regularization terms with suitable parameters into the TWSVM and PTSVM classifiers to improve their generalization performance. Note that the total computational costs of the algorithms are directly affected by the numbers of kernel evaluations, in the experiments. Hence, in the comparisons, we consider the test accuracies, numbers of iterations, and numbers of kernel evaluations of these methods. It should point out that, in fact, we cannot count the kernel evaluations of the TWSVM and PTSVM classifiers since they need inverse two kernel matrices in their dual QPPs. Thus, in simulation we cache the corresponding inversion matrices for the two classifiers and count the kernel evaluation if one element in matrices is used. Specifically, for the DCD algorithm, we will count one iteration and one kernel evaluation if one variable is updated in the inner iterations. While for the clipDCD algorithm, we will count one iteration and n kernel evaluations if one variable is updated, where n is the size of optimization problem.

427

4.2. Results and analysis

451

Table 3 reports the averages and standard deviations of the test accuracies (in %) for the TWSVM, PTSVM, TPMSVM, and SVM clas-

452

Q1 Please cite this article in press as: X. Peng et al., A clipping dual coordinate descent algorithm for solving support vector machines, Knowl. Based Syst. (2014), http://dx.doi.org/10.1016/j.knosys.2014.08.005

428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450

453

KNOSYS 2922

No. of Pages 13, Model 5G

19 August 2014 Q1

10

X. Peng et al. / Knowledge-Based Systems xxx (2014) xxx–xxx

TWSVM

PTSVM

300

300 TWSVM_1 TWSVM_2 200

Size of A

Size of A

200

100

0

PTSVM_1 PTSVM_2

100

0

50 Iterations

0

100

0

TPMSVM

100 200 Iterations

300

SVM

300

400 TPMSVM_2 TPMSVM_1 Size of A

200

SVM

Size of A

300

100

200 0

0

50

100 Iterations

150

0

1000 2000 Iterations

Fig. 7. Relationship between the size of A and iteration numbers of the TWSVM, PTSVM, TPMSVM, and SVM on Waveform problem.

454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482

sifiers using the DCD and clipDCD algorithms on these datasets. As it can be seen, for each classifier, both the DCD and clipDCD algorithms obtain the almost same prediction accuracies on these datasets. To give a suitable description for the performance of two algorithms, we performs a paired t-test of the null hypothesis that the accuracies of two algorithms for each classifier are not different against the alternative that the accuracies of two algorithms are obviously different on all datasets. The result of the test is returned in h. h ¼ 1 indicates a rejection of the null hypothesis at the 5% significance level. h ¼ 0 indicates a failure to reject the null hypothesis at the 5% significance level. The p-values in Table 2 show that the average testing accuracies of two algorithms for each classifier are almost the same. That is, this clipDCD algorithm does not lose any performance, i.e., it can obtain the same generalization performance as the DCD one. Of course, it can be seen that these classifiers obtain some different accuracies on these datasets. For example, the TPMSVM obtains the best results than the other classifiers on most datasets. In fact, this is because different classifiers have different generalization performance. Here, we only focus on the difference of the DCD and clipDCD algorithms but not the differences between these classifiers. Regarding the computational burden, Tables 4 and 5 give the numbers of iterations and kernel operations required to meet the stopping criterions, in which the initial vectors are set as zero vectors for all problems. It can be seen that, for each classifier, the clipDCD algorithm requires very much less iterations than the DCD algorithm for all datasets. Specifically, the clipDCD algorithm only needs about 1–10% iterations compared with the DCD algorithm for each classifier. Further, when the kernel operations are

considered, the clipDCD algorithm is clearly superior, something to be expected, as the DCD algorithm randomly updates each component at one outer iteration and the clipDCD algorithm updates the possibly most important component at one iteration. In addition, it can be found that the DCD algorithm has the larger standard deviations of the numbers of iterations and kernel operations for most datasets compared with the clipDCD algorithm. Factually, this is because the DCD algorithm randomly updates the variables in each outer loop. Then, the DCD algorithm will obtain a fast convergence speed if a good order is given. On the other hand, from Tables 4 and 5, we can see that the TWSVM, PTSVM, and TPMSVM classifiers need much less iterations and kernel evaluations than the SVM classifier if the same learning algorithms are used. This indicates again that TWSVM and its extensions have the much faster learning speeds than classical SVM. To further explain the significance of clipDCD algorithm, in Figs. Q4 1–4, we list some results of these classifiers learning by the DCD and clipDCD algorithms, i.e., the relationship between the numbers of iterations and the values of objective functions. From these figures, we can find that, to solve the dual problem, each iteration of the clipDCD algorithm obtains a larger decrease than that of the DCD algorithm. In particular, the DCD algorithm has not almost any decrease during a long learning process in some datasets. This result indicates the index selection strategy (19) in the clipDCD algorithm is a more effective strategy. In addition, it can be seen that, for the TPMSVM classifier, the clipDCD and DCD algorithms obtain the similar learning curves on the relationship between the numbers of iterations and the values of objective functions. In fact, this is because we begin the learning from zero variables

Q1 Please cite this article in press as: X. Peng et al., A clipping dual coordinate descent algorithm for solving support vector machines, Knowl. Based Syst. (2014), http://dx.doi.org/10.1016/j.knosys.2014.08.005

483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511

KNOSYS 2922

No. of Pages 13, Model 5G

19 August 2014 Q1

11

X. Peng et al. / Knowledge-Based Systems xxx (2014) xxx–xxx

TWSVM

PTSVM

400

400 TWSVM_1 TWSVM_2

300 Size of A

Size of A

300 200 100 0

PTSVM_1 PTSVM_2

200 100

0

400 800 Iterations

0

1200

1000 Iterations

TPMSVM

SVM

400

800 SVM

TPMSVM_1 TPMSVM_2

600

Size of A

Size of A

300 200 100 0

2000

400 200

0

100 Iterations

200

0

0

10000 20000 Iterations

Fig. 8. Relationship between the size of A and iteration numbers of the TWSVM, PTSVM, TPMSVM, and SVM on Checkerboard problem.

512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539

and most variables in the dual problems of the TPMSVM are shrunken at each iteration of the DCD algorithm. Compared with the DCD algorithm, one possible deficiency of our clipDCD algorithm is that it needs some extra cost to find the L index in each iteration according to (19). Fortunately, in the learning process of the clipDCD algorithm, we can automatically determine the index set A. Also, only a small fraction of variables are possibly needed to update, i.e., the size of A is small compared with the number of variables. For example, in Figs. 5–8, we show the relationship between the size of A and the iteration of these algorithms on some datasets. It can be found that the size of A becomes gradually smaller until stable during the learning process, Also, it can be found that the fraction of A is smaller compared with the number of variables, which indicates we can efficiently determine the L index with a small cost. To further explain the efficiency of this proposed clipDCD algorithm, we use the DCD and clipDCD algorithms to learn the TWSVM, PTSVM, TPMSVM, and SVM classifiers on the checkerboard problem with different sizes of training sets. Note that we do not consider a very large size of training set since the TWSVM and PTSVM need inverse two kernel matrices, which means that the memory is too large. Also, the CPU time of the DCD algorithm is too large for these classifiers if the set size is large. Here we consider the relationship between the set size and the number of kernel evaluations of the two algorithms. Fig. 9 shows the results. It can be clearly found that, given different sizes of training sets, this proposed clipDCD algorithm needs much less kernel evaluations

compared with the DCD algorithm for any classifier, which indicates the clipDCD algorithm has a much faster learning speed than the DCD algorithm. To further confirm this result, we also compare with the relationship between the set size and the learning CPU time (in seconds) of the two algorithms. Fig. 10 lists the results of the four classifiers on the checkerboard problem. Similarly, the results show the clipDCD algorithm needs much less time than DCD algorithm for the four classifiers. This indicate the proposed clipDCD algorithm is very efficient, although it needs some extra cost to find the L index in each iteration. In summary, these simulation results show that our clipDCD algorithm has a much faster learning speed than the DCD algorithm without loss of generalization performance. However, we should point out again that the TWSVM and PTSVM classifiers need inverse two kernel matrices into their dual problems. If the inversions are not cached, they will have the much slow speeds. This also causes the two classifiers to be not suitable for large-scale problems.

540

5. Conclusions

558

The recently proposed dual coordinate descent (DCD) algorithm directly solves the dual problem of SVM by optimizing a series of single-variable sub-problems with a random order at its outer iteration. This DCD algorithm needs small cache to solve large-scale problems. However, the DCD algorithm often gives a sightless update for optimizing the problem of SVM, which causes it to have

559

Q1 Please cite this article in press as: X. Peng et al., A clipping dual coordinate descent algorithm for solving support vector machines, Knowl. Based Syst. (2014), http://dx.doi.org/10.1016/j.knosys.2014.08.005

541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557

560 561 562 563 564

KNOSYS 2922

No. of Pages 13, Model 5G

19 August 2014 Q1

12

X. Peng et al. / Knowledge-Based Systems xxx (2014) xxx–xxx

TWSVM

9

10

8

Kernel evaluations

Kernel evaluations

8

10

7

10

6

10

5

10

clipDCD DCD

4

10

PTSVM

9

10

0

1000 2000 Size of train set

10

7

10

6

10

5

10

4

10

3000

0

1000 2000 Size of train set

TPMSVM

9

3000

SVM

10

9

10 Kernel evaluations

Kernel evaluations

8

10

7

10

6

10

5

10

4

10

8

10

7

10

6

10

5

0

1000

2000 3000 4000 Size of train set

10

5000

0

1000

2000 3000 4000 Size of train set

5000

Fig. 9. Relationship between the size of training set and the number of kernel evaluations of the TWSVM, PTSVM, TPMSVM, and SVM on Checkerboard problem.

TWSVM

4

2

Time (sec.)

Time (sec.)

10

10

0

10

clipDCD DCD

−2

10

0

1000 2000 Size of train set

2

10

0

10

−2

10

3000

0

1000 2000 Size of train set

TPMSVM

4

3000

SVM

10

4

10

2

10

Time (sec.)

Time (sec.)

PTSVM

4

10

0

10

2

10

0

10 −2

10

0

1000

2000 3000 4000 Size of train set

5000

0

1000

2000 3000 4000 Size of train set

5000

Fig. 10. Relationship between the size of training set and the learning CPU time (s) of the TWSVM, PTSVM, TPMSVM, and SVM on Checkerboard problem.

Q1 Please cite this article in press as: X. Peng et al., A clipping dual coordinate descent algorithm for solving support vector machines, Knowl. Based Syst. (2014), http://dx.doi.org/10.1016/j.knosys.2014.08.005

KNOSYS 2922

No. of Pages 13, Model 5G

19 August 2014 Q1

X. Peng et al. / Knowledge-Based Systems xxx (2014) xxx–xxx

580

a slow convergence speed. In this paper, we have proposed an improved version for the DCD algorithm, called the clipping DCD (clipDCD) algorithm. In this clipDCD algorithm, we choose one possibly most effective single-variable sub-problem to optimize at each iteration according to the possible decrease values. Then, this clipDCD algorithm has a much faster learning speed than the DCD algorithm. Also, the computational cost of the clipDCD algorithm is small if a suitable update strategy is given. This clipDCD algorithm not only has a much simpler formulation than the DCD algorithm, but also obtains an obviously faster learning speed than the DCD algorithm, which means it is also suitable for large-scale problems. Our experiments have shown that the SVM, TWSVM, PTSVM, and TPMSVM obtain the faster learning speeds if the clipDCD algorithm is embedded into the learning process. In fact, we can extend this clipDCD algorithm to the learning process of twin support vector regression (TSVR) [31] and some other regression models.

581

6. Uncited references

565 566 567 568 569 570 571 572 573 574 575 576 577 578 579

582

583 584 585 586 587 588 589

590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607

Q5

[25–28]. Acknowledgements

The authors would like to thank the anonymous reviewers for their constructive comments and suggestions. This work is supQ6 ported by the National Natural Science Foundation of China Q7 (61202156), the National Natural Science Foundation of Shanghai (12ZR1447100), and the program of Shanghai Normal University (DZL121). References [1] B. Boser, L. Guyon, V.N. Vapnik, A training algorithm for optimal margin classifiers, in: Proceedings of the 5th Annual Workshop on Computational Learning Theory, ACM Press, Pittsburgh, 1992, pp. 144–152. [2] V.N. Vapnik, The Natural of Statistical Learning Theory, Springer, New York, 1995. [3] V.N. Vapnik, Statistical Learning Theory, Wiley, New York, 1998. [4] E. Osuna, R. Freund, F. Girosi, Training support vector machines: an application to face detection, in: Proceedings of IEEE Computer Vision and Pattern Recognition, San Juan, Puerto Rico, 1997, pp. 130–136. [5] T. Joachims, C. Ndellec, C. Rouveriol, Text categorization with support vector machines: learning with many relevant features, in: European Conference on Machine Learning No. 10, Chemnitz, Germany, 1998, pp. 137–142. [6] I. El-Naqa, Y. Yang, M. Wernik, N.P. Galatsanos, R.M. Nishikawa, A support vector machine approach for detection of microclassification, IEEE Trans. Med. Imag. 21 (12) (2002) 1552–1563. [7] B. Schölkopf, K. Tsuda, J.-P. Vert, Kernel Methods in Computational Biology, MIT Press, Cambridge, 2004.

13

[8] A. Smola, B. Schölkopf, Sparse greedy matrix approximation for machine learning, in: Proceedings of the Seventeenth International Conference on Machine Learning, Stanford, USA, 2000, pp. 911–918. [9] D. Achlioptas, F. McSherry, B. Schölkopf, Sampling techniques for kernel methods, Advances in Neural Information Processing Systems, vol. 14, MIT Press, Cambridge, MA, 2002. [10] S. Fine, K. Scheinberg, Efficient SVM training using low-rank kernel representations, J. Mach. Learn. Res. (2001) 243–264. [11] C. Cortes, V.N. Vapnik, Support vector networks, Mach. Learn. 20 (1995) 273– 297. [12] E. Osuna, R. Freund, F. Girosi, An improved training algorithm for support vector machines, in: Proceedings of the IEEE Workshop on Neural Networks for Signal Processing, Amelia Island, FL, USA, 1997, pp. 276–285. [13] J. Platt, Fast training of support vector machines using sequential minimal optimization, in: B. Schölkopf, C.J.C. Burges, A.J. Smola (Eds.), Advances in Kernel Methods–Support Vector Machine, MIT Press, Cambridge, MA, 1999, pp. 185–208. [14] T. Joachims, Making large-scale SVM learning practical, in: B. Schölkopf, C.J.C. Burges, A.J. Smola (Eds.), Advances in Kernel Methods: Support Vector Machine, MIT Press, Cambridge, MA, 1999, pp. 169–184. [15] S.S. Keerthi, S.K. Shevade, C. Bhattacharyya, K.R.K. Murthy, A fast iterative nearest point algorithm for support vector machine classifier design, IEEE Trans. Neural Netw. 11 (1) (2000) 124–136. [16] V. Franc, V. Hlavácˇ, An iterative algorithm learning the maximal margin classifier, Pattern Recogn. 36 (9) (2003) 1985–1996. [17] M. Mavroforakis, S. Theodoridis, A geometric approach to support vector machine (SVM) classification, IEEE Trans. Neural Netw. 17 (3) (2006) 671–682. [18] J. López, Á. Barbero, J.R. Dorronsoro, Clipping algorithms for solving the nearest point problem over reduced convex hulls, Pattern Recogn. 44 (3) (2011) 607– 614. [19] O.L. Mangasarian, D.R. Musicant, Successive overrelaxation for support vector machines, IEEE Trans. Neural Netw. 10 (5) (1999) 1032–1037. [20] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, A dual coordinate descent method for largescale linear SVM, in: Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 2008. [21] Jayadeva, R. Khemchandani, S. Chandra, Twin support vector machines for pattern classification, IEEE Trans. Pattern Anal. Mach. Intell. 29 (5) (2007) 905– 910. [22] X. Chen, J. Yang, Q. Ye, J. Liang, Recursive projection twin support vector machine via within-class variance minimization, Pattern Recogn. 44 (10–11) (2011) 2643–2655. [23] X. Peng, TPMSVM: a novel twin parametric-margin support vector machine for pattern recognition, Pattern Recogn. 44 (10–11) (2011) 2678–2692. [24] J. Mercer, Functions of positive and negative type and the connection with the theory of integral equations, Phil. Trans. Roy. Soc. Lond., Ser. A 209 (1909) 415– 446. [25] X. Peng, Building sparse twin support vector machine classifiers in primal space, Inform. Sci. 181 (18) (2011) 3967–3980. [26] Y. Shao, C. Zhang, X. Wang, N. Deng, Improvements on twin support vector machines, IEEE Trans. Neural Netw. 22 (6) (2011) 962–968. [27] Y. Shao, N. Deng, A coordinate descent margin based-twin support vector machine for classification, Neural Netw. 25 (2012) 114–121. [28] Y. Shao, Z. Wang, W. Chen, N. Deng, A regularization for the projection twin support vector machine, Knowl.-Based Syst. 37 (2013) 203–210. [29] G. Rätsch, Benchmark Repository, Datasets, 2000. . [30] C.W. Hsu, C.J. Lin, A comparison of methods for multiclass support vector machines, IEEE Trans. Neural Netw. 13 (2002) 415–425. [31] X. Peng, TSVR: an efficient twin support vector machine for regression, Neural Netw. 23 (3) (2010) 365–372.

608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668

Q1 Please cite this article in press as: X. Peng et al., A clipping dual coordinate descent algorithm for solving support vector machines, Knowl. Based Syst. (2014), http://dx.doi.org/10.1016/j.knosys.2014.08.005