KNOSYS 2922
No. of Pages 13, Model 5G
19 August 2014 Knowledge-Based Systems xxx (2014) xxx–xxx 1
Contents lists available at ScienceDirect
Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys 5 6
A clipping dual coordinate descent algorithm for solving support vector machines
3 4 7
Q1
8 9
Q2
Xinjun Peng a,b,⇑, Dongjing Chen a, Lingyan Kong a a b
10
Department of Mathematics, Shanghai Normal University, Shanghai 200234, PR China Scientific Computing Key Laboratory of Shanghai Universities, Shanghai 200234, PR China
a r t i c l e
1 5 2 2 13 14 15 16 17
i n f o
Article history: Received 19 February 2014 Received in revised form 5 August 2014 Accepted 5 August 2014 Available online xxxx
18 19 20 21 22 23 24
Keywords: Support vector machine Dual coordinate descent algorithm Single variable problem Maximal possibility-decrease strategy Learning speed
a b s t r a c t The dual coordinate descent (DCD) algorithm solves the dual problem of support vector machine (SVM) by minimizing a series of single-variable sub-problems with a random order at inner iterations. Apparently, this DCD algorithm gives a sightless update for all variables at each iteration, which leads to a slow speed. In this paper, we present a clipping dual coordinate descent (clipDCD) algorithm for solving the dual problem of SVM. In each iteration, this clipDCD algorithm only solves one single-variable sub-problem according to the maximal possibility-decrease strategy on objective value. We can easily implement this clipDCD algorithm since it has a much simpler formulation compared with the DCD algorithm. Our experiment results indicate that, if the clipDCD algorithm is employed, SVM, twin SVM (TWSVM) and its extensions not only obtain the same classification accuracies, but also take much faster learning speeds than those classifiers employing the DCD algorithm. Ó 2014 Published by Elsevier B.V.
26 27 28 29 30 31 32 33 34 35 36 37 38
39
40
1. Introduction
41
In the past decade, support vector machines (SVMs) [1–3] have become the useful tools for data classification, and have been successfully applied to a variety of real-world problems like particle identification, text categorization, bioinformatics and financial applications [4–7]. Based on the structural risk minimization principle [2,3], classical SVM finds the maximal margin between two classes of samples by solving a quadratic programming problem (QPP) in the dual space. However, one of the main challenges in classical SVM is the large computational cost of QPP. The long training time of QPP not only causes classical SVM to take a long time to train on a large database, but also prevents it from locating the optimal parameter set from a very fine grid of parameters over a large span. So far, many fast SVM training algorithms have been presented to reduce the difficulties associated to training. One of them is to obtain low-rank approximations on the kernel matrix, by using the greedy approximation [8], sampling [9] or matrix decompositions [10]. The second kind of approaches to improve the speed are decomposition methods according to the Karush– Kuhn–Tucker (KKT) conditions, including the Chunking algorithm
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
Q3 Q1
⇑ Corresponding author at: Department of Mathematics, Shanghai Normal University, Shanghai 200234, PR China. Tel.: +86 21 64324866. E-mail address:
[email protected] (X. Peng).
[11], decomposition method [12], sequential minimal optimization (SMO) [13], and SVMlight [14]. Another kind of approaches are to directly solve the dual QPP of SVM, including the geometric algorithms [15–18], successive overrelaxation (SOR) algorithm [19], dual coordinate descent (DCD) algorithm [20]. In this paper, we only fucus on the DCD algorithm. The DCD algorithm directly solves the dual QPP of SVM by orderly optimizing a series of single-variable sub-problems. Specifically, it consists of the outer iteration and the inner iteration. Each outer iteration has n inner iterations, where n is the number of points. In these inner iterations, the DCD algorithm orderly update all variables. This dual coordinate descent algorithm can effectively deal with large-scale linear SVM with a low cache. To improve the learning speed, it solves the sub-problems in a random order. Experimental results in [20] have shown this DCD algorithm with random permutation can improve the learning speed. However, we should point out that, in the learning process of the DCD algorithm, it only simply updates the variables one by one. The strategy is also used in its random permutation, although the order of variables is randomly adjusted. In fact, this simple update strategy is often sightless since it cannot obtain an effective decrease on the objective value, Extremely, it possibly obtains a zero-decrease. Then, this often makes the DCD algorithm have a slow learning speed. To overcome this deficiency in the DCD algorithm, in this paper, we present an improved DCD algorithm, called the clipping DCD
http://dx.doi.org/10.1016/j.knosys.2014.08.005 0950-7051/Ó 2014 Published by Elsevier B.V.
Q1 Please cite this article in press as: X. Peng et al., A clipping dual coordinate descent algorithm for solving support vector machines, Knowl. Based Syst. (2014), http://dx.doi.org/10.1016/j.knosys.2014.08.005
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84
KNOSYS 2922
No. of Pages 13, Model 5G
19 August 2014 Q1
2
X. Peng et al. / Knowledge-Based Systems xxx (2014) xxx–xxx
Table 1 Computational cost of the DCD and clipDCD algorithms with one (outer) iteration. DCD
clipDCD
Step
Operation
Cost
Step
Operation
Cost
1 2.1
–
– i Þ for each i Oðn
1 2.1
Compute all Q ii Choose L
OðnÞ OðjAjÞ
2.2
Update ai
–
2.2
Update e Q a
OðnÞ
Compute all rpi f ðaðkÞ;i Þ ðkÞ;iþ1
Table 2 Number of dimensions, training and test samples, together with the accuracies (in %) reported in [29] for SVM with Gaussian kernel (except for the Checkerboard dataset).
85 86 87 88 89 90 91
Set
Dim
ntr
nte
SVM acc.
B BC D F G H I R S Th T Tw W C
2 9 8 9 20 13 18 20 60 5 3 20 21 2
400 200 468 666 700 170 1300 400 1000 140 150 400 400 800
4900 77 300 400 300 100 1010 7000 2175 75 2051 7000 4600 8000
88.47 0.66 73.96 4.74 76.47 1.73 67.57 1.72 76.39 2.07 84.05 3.26 97.04 0.60 98.34 1.12 89.12 0.66 95.20 2.19 77.58 1.02 97.04 0.23 90.12 0.43 –
(clipDCD) algorithm, for solving the dual QPP of SVM. In the learning process of the clipDCD algorithm, it orderly updates one variable by a single-variable sub-problem. Whereas, this variable is selected according to the maximal possibility-decrease strategy. That is, we select the variable which possibly derives the largest decrease on the objective value if it is updated. This strategy not only effectively overcomes the possibly sightless update in the
iterations of the DCD algorithm, but also leads to a much simpler formulation compared with the DCD algorithm. Further, we discuss some implementation issues in this clipDCD algorithm. To validate the performance of this algorithm, we train some SVMs classifiers, including the classical SVM, twin support vector machine (TWSVM) [21], projection twin support vector machine (PTSVM) [22], and twin parametric-margin support vector machine (TPMSVM) [23], by using the clipDCD algorithm. Experimental results on benchmark datasets show that this proposed clipDCD algorithm obtains the same classification accuracy with a much faster numerical convergence than the DCD algorithm. The rest of this paper is organized as follows: Section 2 briefly introduces classical SVM, the DCD algorithm, and some other classifiers which can be solved by DCD algorithm, including TWSVM, PTSVM, and TPMSVM classifiers. Section 3 presents the proposed clipDCD algorithm and discusses some possible implementation issues in the clipDCD algorithm. Experimental results on benchmark datasets are given in Section 4. Some conclusions and remarks are drawn in Section 5.
92
2. Background
111
In this section, we briefly introduce classical SVM [1–3], the DCD algorithm [20], and some other classifiers could be solved
112
Table 3 Averages, standard deviations, and p-values of the test accuracies (in %) derived by the DCD and clipDCD algorithms on benchmark datasets. Set
B BC D F G H I R S Th T Tw W C
TWSVM
PTSVM
TPMSVM
SVM
DCD p-value
clipDCD
DCD p-value
clipDCD
DCD p-value
clipDCD
DCD p-value
clipDCD
88.20 1.04 0.9737 72.83 4.25 0.9602 75.39 2.38 0.9021 65.33 1.67 0.9928 76.04 2.07 0.9563 83.57 3.30 0.9432 95.32 0.86 0.9984 98.44 0.10 0.9990 87.88 0.67 0.9016 95.48 2.15 0.9610 77.18 0.54 0.9978 97.46 0.15 0.9604 89.87 0.55 0.8116 96.88 1.91 1.0000
88.52 0.92
87.60 1.00 1.0000 71.47 5.26 1.0000 74.65 2.33 1.0000 62.20 0.55 0.8245 75.00 2.27 1.0000 80.77 3.53 1.0000 95.76 0.60 1.0000 98.45 0.01 1.0000 88.60 0.73 1.0000 94.74 2.61 1.0000 75.57 2.84 1.0000 97.01 0.22 1.0000 89.67 0.54 1.0000 98.53 0.44 1.0000
87.60 1.01
88.89 0.73 1.0000 74.45 4.98 0.9987 76.81 1.63 0.9999 67.59 2.63 0.9024 76.76 2.64 0.9989 84.58 3.25 0.9810 97.53 0.63 0.8308 98.44 0.14 1.0000 89.74 0.80 0.8915 95.67 1.86 0.9436 77.18 0.54 1.0000 97.58 0.10 1.0000 89.86 0.30 0.9970 97.50 0.41 1.0000
88.87 0.71
88.41 0.72 0.9999 72.27 4.50 0.9998 75.49 1.92 0.9800 67.48 1.87 1.0000 76.39 2.04 1.0000 83.78 3.57 1.0000 97.03 0.67 0.9985 98.26 1.04 1.0000 89.15 0.79 1.0000 95.29 2.40 1.0000 77.68 0.51 1.0000 97.10 0.22 1.0000 88.94 0.61 1.0000 95.13 1.21 1.0000
88.45 0.73
72.71 4.26 75.78 2.60 65.28 1.96 75.98 1.93 83.40 3.31 95.29 0.77 98.42 0.16 87.80 0.66 95.52 2.10 77.16 0.45 97.42 0.18 89.42 0.97 96.88 1.91
71.49 5.25 74.65 2.33 62.88 5.27 75.00 2.27 80.77 3.53 95.76 0.60 98.45 0.01 88.60 0.72 94.74 2.61 75.57 2.84 97.01 0.22 89.67 0.54 98.53 0.44
74.51 4.85 76.79 1.64 67.78 2.30 76.78 2.09 84.62 3.29 97.69 0.68 98.44 0.14 89.85 0.82 95.62 1.88 77.18 0.54 97.58 0.10 89.88 0.34 97.50 0.41
72.30 4.50 75.42 1.80 67.48 1.87 76.38 2.05 83.78 3.57 97.06 0.64 98.26 1.04 89.15 0.79 95.29 2.40 77.68 0.47 97.10 0.21 88.94 0.61 95.13 1.21
Q1 Please cite this article in press as: X. Peng et al., A clipping dual coordinate descent algorithm for solving support vector machines, Knowl. Based Syst. (2014), http://dx.doi.org/10.1016/j.knosys.2014.08.005
93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110
113
KNOSYS 2922
No. of Pages 13, Model 5G
19 August 2014 Q1
3
X. Peng et al. / Knowledge-Based Systems xxx (2014) xxx–xxx Table 4 Averages and standard deviations of the number of iterations (in thousands) derived by the DCD and clipDCD algorithms on benchmark datasets. Set
B BC D F G H I R S Th T Tw W C
TWSVMa
PTSVMa
TPMSVMa
SVM
DCD
clipDCD
DCD
clipDCD
DCD
clipDCD
DCD
clipDCD
33.53 ± 43.57 29.56 ± 20.81 46.79 ± 13.46 1.82 ± 0.86 10.57 ± 4.70 0.47 ± 0.23 25.22 ± 12.04 1.40 ± 0.22 10.74 ± 11.49 1.07 ± 0.54 0.19 ± 0.05 1.23 ± 0.20 0.66 ± 0.15 73.06 ± 25.26
1.42 ± 0.80 1.38 ± 0.27 2.98 ± 0.35 0.61 ± 0.02 0.91 ± 0.07 0.14 ± 0.02 0.99 ± 0.13 0.23 ± 0.02 0.95 ± 0.08 0.12 ± 0.04 0.15 ± 0.02 0.20 ± 0.02 0.20 ± 0.02 2.34 ± 0.21
1.62 ± 1.28 21.60 ± 23.91 45.97 ± 15.51 16.05 ± 3.31 85.38 ± 66.84 3.78 ± 1.98 135.74 ± 54.96 1.59 ± 0.15 57.33 ± 79.54 24.96 ± 47.56 34.23 ± 100.08 1.58 ± 0.40 4.24 ± 1.11 71.68 ± 24.36
0.94 ± 0.44 0.83 ± 0.10 2.29 ± 0.28 0.76 ± 0.06 2.85 ± 0.20 0.56 ± 0.17 2.20 ± 0.20 0.28 ± 0.02 1.93 ± 0.18 0.22 ± 0.05 0.47 ± 1.64 0.28 ± 0.06 0.67 ± 0.13 2.40 ± 0.29
22.41 ± 40.80 25.94 ± 8.10 33.42 ± 7.98 18.10 ± 5.11 115.21 ± 79.35 3.42 ± 1.46 135.18 ± 52.05 4.61 ± 0.32 98.38 ± 52.53 9.38 ± 3.81 31.48 ± 33.18 3.47 ± 0.44 8.50 ± 1.36 73.75 ± 20.02
0.57 ± 0.16 0.88 ± 0.07 1.31 ± 0.06 0.69 ± 0.06 1.54 ± 0.10 0.29 ± 0.06 1.49 ± 0.05 0.48 ± 0.03 1.25 ± 0.04 0.32 ± 0.03 0.31 ± 0.09 0.42 ± 0.02 0.31 ± 0.03 0.76 ± 0.14
153.64 ± 36.34 57.03 ± 17.72 51.53 ± 16.31 228.73 ± 83.42 263.38 ± 96.68 64.45 ± 74.64 431.53 ± 171.32 21.45 ± 5.44 374.43 ± 65.38 19.93 ± 7.62 525.21 ± 146.42 13.46 ± 2.09 36.21 ± 7.26 933.42 ± 148.33
4.25 ± 1.12 2.75 ± 0.26 17.13 ± 2.71 13.39 ± 1.71 7.53 ± 0.46 3.14 ± 0.53 9.58 ± 0.60 1.16 ± 0.04 6.58 ± 0.22 0.80 ± 0.10 10.43 ± 7.57 1.48 ± 0.05 2.78 ± 0.33 38.28 ± 5.21
a For the TWSVM, PTSVM, and TPMSVM classifiers, the iteration numbers in this table are the sum of the iteration numbers for optimizing their two dual problems through the DCD and clipDCD algorithms.
135 115
by DCD algorithm, including TWSVM [21], PTSVM [22] and TPMSVM [23] classifiers.
116
2.1. Support vector machine
114
117 118 119 120 121 122 123
124
As a state-of-the-art of machine learning algorithm, classical SVM [1–3] is based on guaranteed risk bounds of statistical learning theory which is known as the structural risk minimization principle. Given a set of instance-label pairs ðxi ; yi Þ; i ¼ 1; . . . ; n; xi 2 Rm , yi 2 f1; 1g, linear SVM finds the best separating (maximum-margin) hyperplane Hðw; bÞ : wT x þ b ¼ 0, w 2 Rm between two classes of samples by solving the following QPP:
min 126 127 128 129 130
131 133
134
w;b
n X 1 T w w þ C lðw; b; xi ; yi Þ; 2 i¼1
ð1Þ
where C > 0 is a penalty factor and lðw; b; xi ; yi Þ is a loss function. Generally, we employ the hinge loss lðw; b; x; yÞ ¼ max 1 yðwT x þ bÞ; 0 . Here, we deal with this SVM by appending each point with an additional dimension:
xi
½xi ; 1 w
½w; b:
ð2Þ
Then we obtain the dual QPP of the primal problem (1), which is:
1 min f ðaÞ ¼ aT Q a eT a; a 2 s:t: 0 6 ai 6 C; i ¼ 1; . . . ; n;
ð3Þ 137
where Q is a matrix with Q ij ¼ yi yj xTi xj , and e is a vector of n-dimension. In this problem, we have e ¼ ½1; . . . ; 1. Usually, many real-world problems are linearly inseparable. For this case, SVM maps training vectors into a high-dimensional space, i.e., the reproducing kernel Hilbert space, via a nonlinear (implicit) function /ðxÞ. Due to the high dimensionality of the vector variable w, one solves the dual problem (3) by the kernel trick [24], i.e., using a closed form of kðxi ; xj Þ ¼ /ðxi ÞT /ðxj Þ.
138
2.2. Dual coordinate descent algorithm
146
The DCD algorithm directly optimizes the problem (3). It starts from an initial point að0Þ 2 Rn and generates a sequence faðkÞ ; k P 0g. The DCD algorithm consists of the outer iteration and inner iteration. In the outer iteration, it updates aðkÞ to aðkþ1Þ . In each outer iteration we have n inner iterations, so that a1 ; . . . ; an are updated sequentially. Thus, each outer iteration generates vectors aðkÞ;i 2 Rn ; i ¼ 1; . . . ; n þ 1, such that aðkÞ;1 ¼ aðkÞ ,
147
Table 5 Averages and standard deviations of the number of kernel evaluations (in millions) derived by the DCD and clipDCD algorithms on benchmark datasets. Set
B BC D F G H I R S Th T Tw W C
TWSVMb
PTSVMb
TPMSVMb
SVM
DCD
clipDCD
DCD
clipDCD
DCD
clipDCD
DCD
clipDCD
7.15 ± 9.65 3.87 ± 2.97 12.40 ± 3.95 0.65 ± 0.31 4.99 ± 2.33 0.04 ± 0.02 17.69 ± 8.21 0.28 ± 0.04 5.27 ± 5.63 0.07 ± 0.05 0.02 ± 0.00 0.25 ± 0.04 0.13 ± 0.03 29.22 ± 9.53
0.30 ± 0.17 0.17 ± 0.04 0.76 ± 0.10 0.21 ± 0.01 0.38 ± 0.03 0.01 ± 0.00 0.66 ± 0.09 0.05 ± 0.00 0.48 ± 0.04 0.01 ± 0.00 0.01 ± 0.00 0.04 ± 0.00 0.04 ± 0.01 0.94 ± 0.09
3.35 ± 2.73 2.93 ± 3.40 12.40 ± 4.60 5.38 ± 1.16 36.23 ± 31.32 0.33 ± 0.18 89.30 ± 33.16 0.32 ± 0.03 28.02 ± 38.94 1.11 ± 2.05 3.39 ± 10.23 0.32 ± 0.08 0.84 ± 0.23 28.67 ± 9.11
0.19 ± 0.09 0.10 ± 0.01 0.59 ± 0.07 0.26 ± 0.02 1.12 ± 0.10 0.05 ± 0.02 1.45 ± 1.27 0.06 ± 0.00 0.96 ± 0.09 0.01 ± 0.00 0.05 ± 0.17 0.06 ± 0.01 0.13 ± 0.03 0.96 ± 0.11
4.73 ± 9.19 3.51 ± 1.17 8.97 ± 2.16 6.95 ± 1.95 55.48 ± 39.22 0.31 ± 0.14 89.67 ± 34.38 0.92 ± 0.06 47.93 ± 25.30 0.06 ± 0.02 3.15 ± 3.28 0.70 ± 0.09 1.92 ± 0.33 29.50 ± 7.82
0.12 ± 0.03 0.11 ± 0.00c 0.33 ± 0.02 0.24 ± 0.02 0.68 ± 0.05 0.02 ± 0.00 0.97 ± 0.03 0.10 ± 0.01 0.62 ± 0.02 0.01 ± 0.00 0.03 ± 0.01 0.08 ± 0.00 0.07 ± 0.01 0.30 ± 0.00
61.44 ± 14.53 11.41 ± 3.54 24.11 ± 7.63 152.35 ± 55.56 184.39 ± 67.69 10.96 ± 12.69 560.81 ± 222.64 9.01 ± 2.43 374.43 ± 65.38 2.79 ± 1.07 79.06 ± 22.04 5.39 ± 0.84 14.49 ± 2.91 746.73 ± 112.24
1.70 ± 0.43 0.55 ± 0.05 8.02 ± 1.27 8.92 ± 1.14 5.27 ± 0.32 0.53 ± 0.09 12.45 ± 0.78 0.46 ± 0.02 6.58 ± 0.22 0.11 ± 0.01 1.57 ± 1.14 0.59 ± 0.02 1.11 ± 0.13 30.63 ± 7.49
b For the TWSVM, PTSVM, and TPMSVM classifiers, the kernel evaluations in this table are the sum of the kernel evaluations for optimizing their two dual problems through the DCD and clipDCD algorithms. c Here, the 0.00 value only indicates the value of standard deviation is small, but not zero.
Q1 Please cite this article in press as: X. Peng et al., A clipping dual coordinate descent algorithm for solving support vector machines, Knowl. Based Syst. (2014), http://dx.doi.org/10.1016/j.knosys.2014.08.005
139 140 141 142 143 144 145
148 149 150 151 152 153
KNOSYS 2922
No. of Pages 13, Model 5G
19 August 2014 Q1
4
X. Peng et al. / Knowledge-Based Systems xxx (2014) xxx–xxx 0
0 TWSVM_1+DCD TWSVM_1+clipDCD TWSVM_2+DCD TWSVM_2+clipDCD
−0.5
PTSVM_1+DCD PTSVM_1+clipDCD PTSVM_2+DCD PTSVM_2+clipDCD
−1
−2
Objective
Objective
−1
−1.5
−3
−4
−2
−5
−2.5
−6
−3 0 10
1
−7 0 10
2
10
10
1
2
10
3
10
10
Iterations
Iterations
(a)
(b) 0
0
SVM+DCD SVM+clipDCD
TPMSVM_1+DCD TPMSVM_1+clipDCD TPMSVM_2+DCD TPMSVM_2+clipDCD
−0.05
−5
−0.1
Objective
Objective
−10 −0.15
−0.2
−15
−20 −0.25
−25
−0.3
−0.35 0 10
1
2
10
10
−30 0 10
3
10
1
2
10
10
3
10
4
10
Iterations
Iterations
(c)
(d)
Fig. 1. Relationship between the objective values and iteration numbers of the TWSVM (a), PTSVM (b), TPMSVM (c), and SVM (d) on Thyroid problem.
h
154
i
ðkÞ ðkÞ ; i ¼ 1; aðkÞ;nþ1 ¼ aðkþ1Þ , and aðkÞ;i ¼ a1ðkþ1Þ ; . . . ; aðkþ1Þ i1 ; ai ; . . . ; an
156
. . . ; n. For updating aðkÞ;i to aðkÞ;iþ1 , the DCD algorithm solves the following one-variable sub-problem:
159
min f ðaðkÞ;i þ kei Þ s:t: 0 6 ai þ k 6 C;
155
157
ðkÞ
160 161
k
ð4Þ
where ei ¼ ½0; . . . ; 1; . . . ; 0. The objective function of (4) is a simple quadratic function of k:
162 164 165 166 167 168 169
170
172
f ðaðkÞ;i þ kei Þ ¼
1 Q k2 þ ri f ðaðkÞ;i Þk þ constant; 2 ii
umn of Q . One can easily see that (4) has an optimum at k ¼ 0, i.e., no need to update ai , if and only if rpi f ðaðkÞ;i Þ ¼ 0, where rpi f ðaÞ is the projected gradient:
175
178
ðkÞ;i aðkÞ;iþ1 ¼ min max ai i
174
176
ð6Þ
p ðkÞ;i Þ i fð ðkÞ;i
If r a updating a is:
173
¼ 0 holds, we move to the ði þ 1Þth index without . Otherwise, we must find the solution of (4), which
ri f ðaðkÞ;i Þ Q ii
;0 ;C :
179
Algorithm 1. Dual coordinate descent (DCD) algorithm
180
1. Set the initial vector að0Þ and k ¼ 0; 2. While maxj ðrpj f ðaÞÞ minj ðrpj f ðaÞÞ P 2.1 Compute all ri f ðaðkÞ;i Þ ¼ aT Q ;i ei and rpi f ðaðkÞ;i Þ by (6); ðkÞ;iþ1
2.2 Update ai 3. Set k k þ 1.
according to (7) if jrpi f ðaðkÞ;i Þj – 0;
ð7Þ
Remark 1. The kernel matrix Q may be too large to be stored, so i is the one calculates its ith row when computing ri f ðaðkÞ;i Þ. If n Þ is needed to calnonzero components of a per inner iteration, Oðn culate the ith row of Q .
189
Remark 2. For linear SVM, we have ri f ðaÞ ¼ yi wT xi 1 because of P w ¼ ni¼1 yi ai xi and Q ij ¼ yi yj xTi xj . The cost is much smaller than Þ if w is updated throughout the coordinate descent procedure. Oðn Then, we can update w by w w þ ðanew ai Þyi xi . The number of i operation is only Oð1Þ.
193
Remark 3. We can use the shrinking technique to reduce the size of the optimization problem without considering some bounded variables. Here, we omit the detail about this technique. In addi-
198
Q1 Please cite this article in press as: X. Peng et al., A clipping dual coordinate descent algorithm for solving support vector machines, Knowl. Based Syst. (2014), http://dx.doi.org/10.1016/j.knosys.2014.08.005
188
ð5Þ
where ri f is the ith component of the gradient rf , defined as P ri f ðaÞ ¼ ðQ aÞi ei ¼ aT Q ;i ei ¼ nj¼1 Q ij aj ei ; Q ;i is the ith col-
8 minðri f ðaÞ; 0Þ; if ai ¼ 0; > > < p ri f ðaÞ ¼ maxðri f ðaÞ; 0Þ; if ai ¼ C; > > : ri f ðaÞ; otherwise:
Therefore, the DCD algorithm can be depicted as follows:
190 191 192
194 195 196 197
199 200
KNOSYS 2922
No. of Pages 13, Model 5G
19 August 2014 Q1
5
X. Peng et al. / Knowledge-Based Systems xxx (2014) xxx–xxx 0
0
TWSVM_1+DCD TWSVM_1+clipDCD TWSVM_2+DCD TWSVM_2+clipDCD
−1
PTSVM_1+DCD PTSVM_1+clipDCD PTSVM_2+DCD PTSVM_2+clipDCD −3
−3
Objective
Objective
−2
−4
−6
−9
−5 −12
−6
−7 0 10
1
−15 0 10
2
10
10
1
2
10
Iterations
10
Iterations
(a)
(b)
0
0 TPMSVM_1+DCD TPMSVM_1+clipDCD TPMSVM_2+DCD TPMSVM_2+clipDCD
−4
SVM+DCD SVM+clipDCD −10
−20 −8
Objective
Objective
−30
−12
−40
−50 −16 −60 −20 −70
−24 0 10
1
2
10
10
3
10
−80 0 10
1
2
10
Iterations
3
10
4
10
10
Iterations
(c)
(d)
Fig. 2. Relationship between the objective values and iteration numbers of the TWSVM (a), PTSVM (b), TPMSVM (c), and SVM (d) on Twonorm problem. 201 202
tion, the DCD algorithm obtains faster convergence if one randomly permutes the sub-problems in each iteration [20].
203
2.3. Other classifiers
204
In this section, we list briefly some other classifiers, whose dual QPPs (9), (12), and (14) can be solved by the DCD algorithm.
205 206 207 208 209 210 211
212
2.3.1. Twin support vector machine Linear TWSVM [21] performs binary classification using two nonparallel hyperplanes, i.e., H ðw ; b Þ: wT x þ b ¼ 0, instead of the single hyperplane used in classical SVM. The two hyperplanes of TWSVM are obtained by solving the following two smaller-sized QPPs:
min
214 215 216
217
w ;b
i
ð8Þ
j
where C > 0 are penalty factors, and lðw; b; xi ; yi Þ is the hinge loss. The dual QPPs of (8) are:
min a
219
X 2 1 X T w xi þ b þ C lðw ; b ; xj ; yj Þ; 2 i:y ¼1 j:y ¼1
s:t:
1 1 T T a H H H T H a eT a 2 0 6 a 6 C e ;
ð9Þ
221
where e are vectors of ones with appropriate dimensions, h i H ¼ AT ; e , and A are the matrices consisting of two classes of
222
data. Then, we obtain the augmented vectors u ¼ ½w ; b :
225
1 u ¼ H þ H Tþ H a :
220
223
2.3.2. Projection twin support vector machine PTSVM [22] finds a projection axis for each class, such that the within-class variance of the projected data points of its own class is minimized meanwhile the projected data points of the other class scatter away to certain extent. To this end, PTSVM solves the following pair of QPPs:
X 2 1 X T min w xi wT l þ C lðw ; xj ; yj Þ; w 2n i:y ¼1 j:y ¼1 i
228 229 230 231
232
234
where C are trade-off constants, n are sizes of two classes, l are the means of two classes of points, and the loss lðw; x; yÞ ¼ max 1 ðwT x wT lk Þ; 0 . The dual QPPs for (11) are:
ð12Þ
235 236 237 238
239
241
where R are the covariance matrices of two classes.
242
2.3.3. Twin parametric-margin support vector machine TPMSVM [23] derives a pair of parametric-margin hyperplane through two smaller-sized SVM-type QPPs. Formally, it optimizes the following pair of constrained optimization problems1:
243
X X 1 T 2 min w w þ b m wT xj þ b þC lðw ; b ; xi ; yi Þ; w ;b 2 j:y ¼1 i:y ¼1 j
244 245 246 247
i
ð13Þ ð10Þ 1
2
We add the terms bk ; k ¼ 1; 2 for TPMSVM to give the uniform dual problems.
Q1 Please cite this article in press as: X. Peng et al., A clipping dual coordinate descent algorithm for solving support vector machines, Knowl. Based Syst. (2014), http://dx.doi.org/10.1016/j.knosys.2014.08.005
227
ð11Þ
j
1 T T min aT AT e lT R1 A l e a e a a 2 s:t: 0 6 a 6 C e ;
226
249
KNOSYS 2922
No. of Pages 13, Model 5G
19 August 2014 Q1
6
X. Peng et al. / Knowledge-Based Systems xxx (2014) xxx–xxx 0
0 TWSVM_1+DCD TWSVM_1+HDCD TWSVM_2+DCD TWSVM_2+HDCD
−2
PTSVM_1+DCD PTSVM_1+clipDCD PTSVM_2+DCD PTSVM_2+clipDCD
−5
−10
Objective
Objective
−4
−6
−15
−20
−8 −25 −10 −30 −12 −35 0
1
10
2
10
0
10
1
10
2
10
Iterations
3
10
10
Iterations
(a)
(b) 0
0
SVM+DCD SVM+clipDCD
TPMSVM_1+DCD TPMSVM_1+clipDCD TPMSVM_2+DCD TPMSVM_2+clipDCD
−0.02
−20
−40
Objective
Objective
−0.04
−0.06
−60
−80
−0.08
−100
−0.1
−0.12 0 10
−120
1
2
10
10
−140 0 10
3
10
1
10
2
3
10
10
4
10
Iterations
Iterations
(c)
(d)
Fig. 3. Relationship between the objective values and iteration numbers of the TWSVM (a), PTSVM (b), TPMSVM (c), and SVM (d) on Waveform problem. 250 251 252
253
where m and C are positive penalty factors, and the loss function lðw; b; xi ; yi Þ is defined as lðw; b; xi ; yi Þ ¼ max yi ðwT xi þ bÞ; 0 . The dual problems of (13) are:
1 T T a A A þ e eT a m eT AT A þ e eT a a 2 s:t: 0 6 a 6 C e :
min 255 256 257
ð14Þ
After optimizing the dual problems (14), we obtain the normal vectors w ¼ A a m A e for classification.
258
3. Clipping dual coordinate descent algorithm
259
263
In this section, we propose the clipping dual coordinate descent (clipDCD) algorithm for solving the problem (3) (or (9), (12), and (14)). In the first part, we depict the detail of our algorithm. In the second part, we list some implementation issues of clipDCD algorithm.
264
3.1. Framework of clipDCD algorithm
265
269
This clipDCD algorithm is based on the gradient descent method. Without loss of generality, we denote f ð0Þ ¼ 12 aT Q a eT a. In this clipDCD algorithm, we do not consider any outer iteration and inner iteration, and assume only one component of a is updated at each iteration, denoted aL ! aL þ k; L 2 f1; . . . ; ng is the index. Then
272
1 1 f ðkÞ ¼ ðaL þ kÞ2 Q LL þ aTN Q NN aN þ ðaL þ kÞaTN Q N L eL ðaL þ kÞ eTN aN 2 2 1 1 ¼ f ð0Þ þ k2 Q LL k eL aTN Q N L Q LL aL ¼ f ð0Þ þ k2 Q LL k eL aT Q :;L ; 2 2
260 261 262
266 267 268
270
ð15Þ
where N is the index set f1; . . . ; ng n fLg and Q :;L is the Lth column of Q . Setting the derivation of k:
eL aT Q :;L df ðkÞ ¼0 ) k¼ ; dk Q LL
ð16Þ
we have
274 275
277 278 279
2
ðeL aT Q :;L Þ f ðkÞ ¼ f ð0Þ : 2Q LL
ð17Þ
The objective decrease will now be approximately largest if we 2 maximize ðeL aT Q :;L Þ =Q LL , which causes we to achieve in principle by choosing the L index as
( ) 2 ðei aT Q :;i Þ L ¼ arg max : 16i6n Q ii
283 284 285
288 289 290 291 292 293 294
ð19Þ 296
where the index set A is
ei aT Q :;i ei aT Q :;i A¼ i : ai > 0 if < 0 or ai < C if >0 : Q ii Q ii
282
287
Here, we call this strategy as the maximal possibility-decrease strategy. Therefore, we have a simple update anew ¼ aL þ k. Moreover, L the k value in (16) must be adequately clipped so that 0 6 anew 6 C, which implies that we must take a k such that L 0 6 anew 6 C. To this end, we adjust the maximal possibility-decrease L strategy for choosing the L index as
( ) 2 ðei aT Q :;i Þ ; L ¼ arg max i2A Q ii
281
ð18Þ
297 298
ð20Þ
Q1 Please cite this article in press as: X. Peng et al., A clipping dual coordinate descent algorithm for solving support vector machines, Knowl. Based Syst. (2014), http://dx.doi.org/10.1016/j.knosys.2014.08.005
273
300
KNOSYS 2922
No. of Pages 13, Model 5G
19 August 2014 Q1
7
X. Peng et al. / Knowledge-Based Systems xxx (2014) xxx–xxx 0
0 TWSVM_1+DCD TWSVM_1+clipDCD TWSVM_2+DCD TWSVM_2+clipDCD
−10
PTSVM_1+DCD PTSVM_1+clipDCD PTSVM_2+DCD PTSVM_2+clipDCD
−1
−2
−3
Objective
Objective
−20
−30
−4
−5 −40
−6 −50
−7
−60
0
1
2
10
3
10
−8 0 10
4
10
10
1
2
10
3
10
4
10
10
Iterations
Iterations
(a)
(b)
0
0
TPMSVM_1+DCD TPMSVM_1+clipDCD TPMSVM_2+DCD TPMSVM_2+clipDCD
−0.2
SVM+DCD SVM+clipDCD
−500
−0.4
Objective
Objective
−1000
−0.6
−1500
−0.8
−1
−2000
−1.2 −2500
−1.4 0 10
1
10
2
3
10
10
4
10
Iterations
0
10
1
10
2
10
3
10
4
10
5
10
6
10
Iterations
(c)
(d)
Fig. 4. Relationship between the objective values and iteration numbers of the TWSVM (a), PTSVM (b), TPMSVM (c), and SVM (d) on Checkerboard problem. 301 302 303 304
In summary, the framework of clipDCD algorithm for solving (3) can be listed as follows: Algorithm algorithm
2. Clipping
dual
coordinate
descent
(clipDCD)
1. Set the initial vector a, such as a ¼ 0; 2. While a is not optimal; 2.1 Choose the L index by (19) and (20), and compute k by (16); 2.2 Update aL as anew ¼ ½aL þ k# , where L ½u# ¼ maxð0; minðu; CÞÞ.
is satisfied, where the tolerance parameter is given by users. In the experiments, we set ¼ 105 for all datasets. Obviously, this stopping condition is a much strict condition since the real decrease value of the objective function at each iter2 ation is possibly less than the value ðeL aT Q :;L Þ =Q LL . Also, this stopping condition is much simpler than that in the DCD algorithm since the latter need to cache all rpj f ðaÞ in each outer iteration in order to detect its stopping condition. In addition, the stopping condition in the DCD algorithm also causes to much extra computation, especially at the end stage of learning process. The experiment results in Section 4 will confirm this conclusion.
325
3.2.2. Convergence In this part, we will discuss the convergence of this clipDCD algorithm.
336
326 327 328 329 330 331 332 333 334 335
312 313 314
This proposed clipDCD algorithm is much simpler than the DCD algorithm, and each step in this algorithm is easily to implement.
315
3.2. Implementation issues
316
3.2.1. Stopping condition In this clipDCD algorithm, we should choose a suitable stopping condition. Unlike the stopping condition of the DCD algorithm, we terminate the clipDCD algorithm if the maximal possibility-decrease value
317 318 319 320
Proposition. The clipDCD Algorithm in Section 3.1 can find the optimal solution of the problem (3) (or (9), (12), and (14)). Proof. We first point out that the problem (3) (or (9), (12), and (14)) exists the optimal solution according to the extreme value theorem2 since it has the continuous objective function and the closed and bounded constraints. Further, the problem (3) has the global optimal solution since it is convex.
321
322
2
324
ðeL aT Q :;L Þ < ; Q LL
>0
ð21Þ
2 The extreme value theorem: If a real-valued function is continuous in the closed and bounded region, then this function must attain a maximum and a minimum, each at least once.
Q1 Please cite this article in press as: X. Peng et al., A clipping dual coordinate descent algorithm for solving support vector machines, Knowl. Based Syst. (2014), http://dx.doi.org/10.1016/j.knosys.2014.08.005
337 338 341 339 340 342 343 344 345 346
KNOSYS 2922
No. of Pages 13, Model 5G
19 August 2014 Q1
8
X. Peng et al. / Knowledge-Based Systems xxx (2014) xxx–xxx
TWSVM 100
PTSVM TWSVM_1 TWSVM_2
80
Size of A
60
60
40
40
20 0
PTSVM_1 PTSVM_2
Size of A
80
100
20 0
20
40 Iterations
0
60
0
50 Iterations
TPMSVM
SVM
TPMSVM_1 TPMSVM_2
100
100
Size of A
Size of A
80
SVM
130
110
60
90 40 0
50
100 Iterations
150
70
0
200
400 600 Iterations
800
Fig. 5. Relationship between the size of A and iteration numbers of the TWSVM, PTSVM, TPMSVM, and SVM on Thyroid problem.
347 348 349 350 351 352 353 354 355 356 357
358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373
0
0
Second, (15) leads to f ðkÞ ¼ kQ LL ðeL aT Q :;L Þ and f ð0Þ ¼ 0 0 ðeL aT Q :;L Þ. Then f ð0Þ < 0 if ðeL aT Q :;L Þ > 0 holds, f ð0Þ > 0, T otherwise. We now consider these two cases: 1) ðe a Q L :;L Þ > 0, e aT Q i.e., i Q ii :;i > 0. For this case, it will have k > 0 and k ¼ minðk; C kÞ > 0 according to (16) and (20). Then, 0 f ðkÞ 6 f ðk Þ < f ð0Þ holds since f ð0Þ < 0, i.e., the objective will e aT Q descend via this iteration. 2) ðeL aT Q :;L Þ < 0, i.e., i Q ii :;i < 0. Similarly, it will lead k < 0 and k ¼ maxðk; aL Þ < 0 to hold. Then, 0 f ðkÞ 6 f ðk Þ < f ð0Þ holds since f ð0Þ > 0. Hence, the objective value will descend no matter which case appears. That is, this clipDCD algorithm will find the solution. h
3.2.3. Computational cost We initially give the initial vector a ¼ 0 in this clipDCD algorithm. Then, we only need to compute each e2i =Q ii in order to start the initial iteration. That is, the computational cost is OðnÞ for computing all Q ii ’s at this initial iteration. During the learning process, we can cache all Q ii ’s to reduce the computational cost. Then, the space cost is OðnÞ. In general, we store another vector e Q a during the learning process with the space cost OðnÞ. Then, we can easily determine the L index for the next update through compar2 ing with ðei aT Q ;i Þ =Q ii . Further, if the L index is determined, we will update aL by the clipDCD algorithm and update the storage vector e Q a, in which the ith component can be updated as T Q ;i kL Q Li , where a is the current vector and ei aT Q ;i ¼ ei a a is the vector before updating. Hence, we only have OðnÞ kernel computations to update this vector. Compared with one outer iteration of the DCD algorithm, our method has the same cost. Table 1
gives the detail computational cost of each step in the two algorithms. For the linear SVM, we have the same interpretation as that in the DCD algorithm. It should point out that the above analysis is not suitable for the dual problems of TWSVM and PTSVM since we have to compute the inversions of two matrices in the dual problems.
374
3.2.4. Online setting In this clipDCD algorithm, another key work is to find the L index according to (19). Usually, we have to spend on some computational cost to find the maximum value in the right of (19) for large-scale problems. However, we can take a fast strategy to determine the index set A in (20): After updating ai and ei aT Q ;i at the current iteration, we rapidly filter out the indices do not satisfying (20). Then, only a part of indices are needed to check. In fact, this is easy to implement according to the sign of kL Q Li and the ai value. In experiments, the results also show that only a small fraction of variables are possibly needed to update in each iteration. However, when the number of samples is huge in some applications, so going over all a1 ; . . . ; an causes an expensive cost for finding the index L, one can randomly choose an index set I ðkÞ at the kth iteration, then select the optimal index L in this set I ðkÞ and update aL .
380
4. Experiments
396
To validate the performance of the proposed algorithm, in this part we give some simulation results of the SVM, TWSVM, PTSVM, and TPMSVM classifiers respectively learning by the DCD and clip-
397
Q1 Please cite this article in press as: X. Peng et al., A clipping dual coordinate descent algorithm for solving support vector machines, Knowl. Based Syst. (2014), http://dx.doi.org/10.1016/j.knosys.2014.08.005
375 376 377 378 379
381 382 383 384 385 386 387 388 389 390 391 392 393 394 395
398 399
KNOSYS 2922
No. of Pages 13, Model 5G
19 August 2014 Q1
9
X. Peng et al. / Knowledge-Based Systems xxx (2014) xxx–xxx
TWSVM
PTSVM
240
240 TWSVM_1 TWSVM_2
180
Size of A
Size of A
180 120
120
60 0
PTSVM_1 PTSVM_2
60
0
50 Iterations
0
100
0
TPMSVM
100 Iterations
150
SVM
240
400 TPMSVM_1 TPMSVM_2
SVM 370
Size of A
Size of A
180 120 60 0
50
340 310
0
50
100 150 Iterations
200
280
0
500 1000 Iterations
1500
Fig. 6. Relationship between the size of A and iteration numbers of the TWSVM, PTSVM, TPMSVM, and SVM on Twonorm problem.
403
DCD algorithms on some datasets. Note that the DCD algorithm obtains faster convergence if one randomly permutes the subproblems in each iteration. Then we employ this technique to realize the DCD algorithm in the experiments.
404
4.1. Experiment setting
405
We simulate the SVM, TWSVM, PTSVM, and TPMSVM classifiers training by the DCD and clipDCD algorithms on the 13 benchmark datasets [29] in that order: Banana (B), Breast Cancer (BC), Diabetes (D), Flare (F), German (G), Heart (H), Image (I), Ringnorm (R), Splice (S), Thyroid (Th), Titanic (T), Twonorm (Tw), and Waveform (W). In particular, we use in each problem the train-test splits given in that reference (100 for each dataset except for Image and Splice, where only 20 splits are given). Table 2 contains for each dataset the data dimensions, the training and test data sizes and the test accuracies reported in [29]. In addition, we test these classifiers with two algorithms on the artificial Checkerboard (C) dataset. The checkerboard dataset consists of a series of uniform points in R2 with red and blue points taken from the 16 red and blue squares of a checkerboard. The last row in Table 2 gives the description of this dataset. This is a tricky test case in the data mining field for testing the performance of nonlinear classifiers. In the simulations we only consider the Gaussian kernel for all classifiers. For these classifiers, one of the most important problems is to choose the parameter values. However, there is no explicit way to solve the problem of choosing multiple parameters for SVMs. Although there exits many parameter-selection methods for SVMs, the most popular method to determine the parameters
400 401 402
406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426
of SVMs is still the exhaustive search [30]. For brevity’s sake, we set C 1 ¼ C 2 ¼ C for the TWSVM, PTSVM, and TPMSVM classifiers and m1 ¼ m2 ¼ m for the TPMSVM classifier. For the values of penalty parameters and kernel parameters in these algorithms, we select them from the set of values f2i ji ¼ 9; 8; . . . ; 10g by cross-validation. In addition, we set the tolerance value ¼ 105 in the DCD and clipDCD algorithms for all datasets. It should be pointed out that we add a pair of regularization terms with suitable parameters into the TWSVM and PTSVM classifiers to improve their generalization performance. Note that the total computational costs of the algorithms are directly affected by the numbers of kernel evaluations, in the experiments. Hence, in the comparisons, we consider the test accuracies, numbers of iterations, and numbers of kernel evaluations of these methods. It should point out that, in fact, we cannot count the kernel evaluations of the TWSVM and PTSVM classifiers since they need inverse two kernel matrices in their dual QPPs. Thus, in simulation we cache the corresponding inversion matrices for the two classifiers and count the kernel evaluation if one element in matrices is used. Specifically, for the DCD algorithm, we will count one iteration and one kernel evaluation if one variable is updated in the inner iterations. While for the clipDCD algorithm, we will count one iteration and n kernel evaluations if one variable is updated, where n is the size of optimization problem.
427
4.2. Results and analysis
451
Table 3 reports the averages and standard deviations of the test accuracies (in %) for the TWSVM, PTSVM, TPMSVM, and SVM clas-
452
Q1 Please cite this article in press as: X. Peng et al., A clipping dual coordinate descent algorithm for solving support vector machines, Knowl. Based Syst. (2014), http://dx.doi.org/10.1016/j.knosys.2014.08.005
428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450
453
KNOSYS 2922
No. of Pages 13, Model 5G
19 August 2014 Q1
10
X. Peng et al. / Knowledge-Based Systems xxx (2014) xxx–xxx
TWSVM
PTSVM
300
300 TWSVM_1 TWSVM_2 200
Size of A
Size of A
200
100
0
PTSVM_1 PTSVM_2
100
0
50 Iterations
0
100
0
TPMSVM
100 200 Iterations
300
SVM
300
400 TPMSVM_2 TPMSVM_1 Size of A
200
SVM
Size of A
300
100
200 0
0
50
100 Iterations
150
0
1000 2000 Iterations
Fig. 7. Relationship between the size of A and iteration numbers of the TWSVM, PTSVM, TPMSVM, and SVM on Waveform problem.
454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482
sifiers using the DCD and clipDCD algorithms on these datasets. As it can be seen, for each classifier, both the DCD and clipDCD algorithms obtain the almost same prediction accuracies on these datasets. To give a suitable description for the performance of two algorithms, we performs a paired t-test of the null hypothesis that the accuracies of two algorithms for each classifier are not different against the alternative that the accuracies of two algorithms are obviously different on all datasets. The result of the test is returned in h. h ¼ 1 indicates a rejection of the null hypothesis at the 5% significance level. h ¼ 0 indicates a failure to reject the null hypothesis at the 5% significance level. The p-values in Table 2 show that the average testing accuracies of two algorithms for each classifier are almost the same. That is, this clipDCD algorithm does not lose any performance, i.e., it can obtain the same generalization performance as the DCD one. Of course, it can be seen that these classifiers obtain some different accuracies on these datasets. For example, the TPMSVM obtains the best results than the other classifiers on most datasets. In fact, this is because different classifiers have different generalization performance. Here, we only focus on the difference of the DCD and clipDCD algorithms but not the differences between these classifiers. Regarding the computational burden, Tables 4 and 5 give the numbers of iterations and kernel operations required to meet the stopping criterions, in which the initial vectors are set as zero vectors for all problems. It can be seen that, for each classifier, the clipDCD algorithm requires very much less iterations than the DCD algorithm for all datasets. Specifically, the clipDCD algorithm only needs about 1–10% iterations compared with the DCD algorithm for each classifier. Further, when the kernel operations are
considered, the clipDCD algorithm is clearly superior, something to be expected, as the DCD algorithm randomly updates each component at one outer iteration and the clipDCD algorithm updates the possibly most important component at one iteration. In addition, it can be found that the DCD algorithm has the larger standard deviations of the numbers of iterations and kernel operations for most datasets compared with the clipDCD algorithm. Factually, this is because the DCD algorithm randomly updates the variables in each outer loop. Then, the DCD algorithm will obtain a fast convergence speed if a good order is given. On the other hand, from Tables 4 and 5, we can see that the TWSVM, PTSVM, and TPMSVM classifiers need much less iterations and kernel evaluations than the SVM classifier if the same learning algorithms are used. This indicates again that TWSVM and its extensions have the much faster learning speeds than classical SVM. To further explain the significance of clipDCD algorithm, in Figs. Q4 1–4, we list some results of these classifiers learning by the DCD and clipDCD algorithms, i.e., the relationship between the numbers of iterations and the values of objective functions. From these figures, we can find that, to solve the dual problem, each iteration of the clipDCD algorithm obtains a larger decrease than that of the DCD algorithm. In particular, the DCD algorithm has not almost any decrease during a long learning process in some datasets. This result indicates the index selection strategy (19) in the clipDCD algorithm is a more effective strategy. In addition, it can be seen that, for the TPMSVM classifier, the clipDCD and DCD algorithms obtain the similar learning curves on the relationship between the numbers of iterations and the values of objective functions. In fact, this is because we begin the learning from zero variables
Q1 Please cite this article in press as: X. Peng et al., A clipping dual coordinate descent algorithm for solving support vector machines, Knowl. Based Syst. (2014), http://dx.doi.org/10.1016/j.knosys.2014.08.005
483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511
KNOSYS 2922
No. of Pages 13, Model 5G
19 August 2014 Q1
11
X. Peng et al. / Knowledge-Based Systems xxx (2014) xxx–xxx
TWSVM
PTSVM
400
400 TWSVM_1 TWSVM_2
300 Size of A
Size of A
300 200 100 0
PTSVM_1 PTSVM_2
200 100
0
400 800 Iterations
0
1200
1000 Iterations
TPMSVM
SVM
400
800 SVM
TPMSVM_1 TPMSVM_2
600
Size of A
Size of A
300 200 100 0
2000
400 200
0
100 Iterations
200
0
0
10000 20000 Iterations
Fig. 8. Relationship between the size of A and iteration numbers of the TWSVM, PTSVM, TPMSVM, and SVM on Checkerboard problem.
512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539
and most variables in the dual problems of the TPMSVM are shrunken at each iteration of the DCD algorithm. Compared with the DCD algorithm, one possible deficiency of our clipDCD algorithm is that it needs some extra cost to find the L index in each iteration according to (19). Fortunately, in the learning process of the clipDCD algorithm, we can automatically determine the index set A. Also, only a small fraction of variables are possibly needed to update, i.e., the size of A is small compared with the number of variables. For example, in Figs. 5–8, we show the relationship between the size of A and the iteration of these algorithms on some datasets. It can be found that the size of A becomes gradually smaller until stable during the learning process, Also, it can be found that the fraction of A is smaller compared with the number of variables, which indicates we can efficiently determine the L index with a small cost. To further explain the efficiency of this proposed clipDCD algorithm, we use the DCD and clipDCD algorithms to learn the TWSVM, PTSVM, TPMSVM, and SVM classifiers on the checkerboard problem with different sizes of training sets. Note that we do not consider a very large size of training set since the TWSVM and PTSVM need inverse two kernel matrices, which means that the memory is too large. Also, the CPU time of the DCD algorithm is too large for these classifiers if the set size is large. Here we consider the relationship between the set size and the number of kernel evaluations of the two algorithms. Fig. 9 shows the results. It can be clearly found that, given different sizes of training sets, this proposed clipDCD algorithm needs much less kernel evaluations
compared with the DCD algorithm for any classifier, which indicates the clipDCD algorithm has a much faster learning speed than the DCD algorithm. To further confirm this result, we also compare with the relationship between the set size and the learning CPU time (in seconds) of the two algorithms. Fig. 10 lists the results of the four classifiers on the checkerboard problem. Similarly, the results show the clipDCD algorithm needs much less time than DCD algorithm for the four classifiers. This indicate the proposed clipDCD algorithm is very efficient, although it needs some extra cost to find the L index in each iteration. In summary, these simulation results show that our clipDCD algorithm has a much faster learning speed than the DCD algorithm without loss of generalization performance. However, we should point out again that the TWSVM and PTSVM classifiers need inverse two kernel matrices into their dual problems. If the inversions are not cached, they will have the much slow speeds. This also causes the two classifiers to be not suitable for large-scale problems.
540
5. Conclusions
558
The recently proposed dual coordinate descent (DCD) algorithm directly solves the dual problem of SVM by optimizing a series of single-variable sub-problems with a random order at its outer iteration. This DCD algorithm needs small cache to solve large-scale problems. However, the DCD algorithm often gives a sightless update for optimizing the problem of SVM, which causes it to have
559
Q1 Please cite this article in press as: X. Peng et al., A clipping dual coordinate descent algorithm for solving support vector machines, Knowl. Based Syst. (2014), http://dx.doi.org/10.1016/j.knosys.2014.08.005
541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557
560 561 562 563 564
KNOSYS 2922
No. of Pages 13, Model 5G
19 August 2014 Q1
12
X. Peng et al. / Knowledge-Based Systems xxx (2014) xxx–xxx
TWSVM
9
10
8
Kernel evaluations
Kernel evaluations
8
10
7
10
6
10
5
10
clipDCD DCD
4
10
PTSVM
9
10
0
1000 2000 Size of train set
10
7
10
6
10
5
10
4
10
3000
0
1000 2000 Size of train set
TPMSVM
9
3000
SVM
10
9
10 Kernel evaluations
Kernel evaluations
8
10
7
10
6
10
5
10
4
10
8
10
7
10
6
10
5
0
1000
2000 3000 4000 Size of train set
10
5000
0
1000
2000 3000 4000 Size of train set
5000
Fig. 9. Relationship between the size of training set and the number of kernel evaluations of the TWSVM, PTSVM, TPMSVM, and SVM on Checkerboard problem.
TWSVM
4
2
Time (sec.)
Time (sec.)
10
10
0
10
clipDCD DCD
−2
10
0
1000 2000 Size of train set
2
10
0
10
−2
10
3000
0
1000 2000 Size of train set
TPMSVM
4
3000
SVM
10
4
10
2
10
Time (sec.)
Time (sec.)
PTSVM
4
10
0
10
2
10
0
10 −2
10
0
1000
2000 3000 4000 Size of train set
5000
0
1000
2000 3000 4000 Size of train set
5000
Fig. 10. Relationship between the size of training set and the learning CPU time (s) of the TWSVM, PTSVM, TPMSVM, and SVM on Checkerboard problem.
Q1 Please cite this article in press as: X. Peng et al., A clipping dual coordinate descent algorithm for solving support vector machines, Knowl. Based Syst. (2014), http://dx.doi.org/10.1016/j.knosys.2014.08.005
KNOSYS 2922
No. of Pages 13, Model 5G
19 August 2014 Q1
X. Peng et al. / Knowledge-Based Systems xxx (2014) xxx–xxx
580
a slow convergence speed. In this paper, we have proposed an improved version for the DCD algorithm, called the clipping DCD (clipDCD) algorithm. In this clipDCD algorithm, we choose one possibly most effective single-variable sub-problem to optimize at each iteration according to the possible decrease values. Then, this clipDCD algorithm has a much faster learning speed than the DCD algorithm. Also, the computational cost of the clipDCD algorithm is small if a suitable update strategy is given. This clipDCD algorithm not only has a much simpler formulation than the DCD algorithm, but also obtains an obviously faster learning speed than the DCD algorithm, which means it is also suitable for large-scale problems. Our experiments have shown that the SVM, TWSVM, PTSVM, and TPMSVM obtain the faster learning speeds if the clipDCD algorithm is embedded into the learning process. In fact, we can extend this clipDCD algorithm to the learning process of twin support vector regression (TSVR) [31] and some other regression models.
581
6. Uncited references
565 566 567 568 569 570 571 572 573 574 575 576 577 578 579
582
583 584 585 586 587 588 589
590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607
Q5
[25–28]. Acknowledgements
The authors would like to thank the anonymous reviewers for their constructive comments and suggestions. This work is supQ6 ported by the National Natural Science Foundation of China Q7 (61202156), the National Natural Science Foundation of Shanghai (12ZR1447100), and the program of Shanghai Normal University (DZL121). References [1] B. Boser, L. Guyon, V.N. Vapnik, A training algorithm for optimal margin classifiers, in: Proceedings of the 5th Annual Workshop on Computational Learning Theory, ACM Press, Pittsburgh, 1992, pp. 144–152. [2] V.N. Vapnik, The Natural of Statistical Learning Theory, Springer, New York, 1995. [3] V.N. Vapnik, Statistical Learning Theory, Wiley, New York, 1998. [4] E. Osuna, R. Freund, F. Girosi, Training support vector machines: an application to face detection, in: Proceedings of IEEE Computer Vision and Pattern Recognition, San Juan, Puerto Rico, 1997, pp. 130–136. [5] T. Joachims, C. Ndellec, C. Rouveriol, Text categorization with support vector machines: learning with many relevant features, in: European Conference on Machine Learning No. 10, Chemnitz, Germany, 1998, pp. 137–142. [6] I. El-Naqa, Y. Yang, M. Wernik, N.P. Galatsanos, R.M. Nishikawa, A support vector machine approach for detection of microclassification, IEEE Trans. Med. Imag. 21 (12) (2002) 1552–1563. [7] B. Schölkopf, K. Tsuda, J.-P. Vert, Kernel Methods in Computational Biology, MIT Press, Cambridge, 2004.
13
[8] A. Smola, B. Schölkopf, Sparse greedy matrix approximation for machine learning, in: Proceedings of the Seventeenth International Conference on Machine Learning, Stanford, USA, 2000, pp. 911–918. [9] D. Achlioptas, F. McSherry, B. Schölkopf, Sampling techniques for kernel methods, Advances in Neural Information Processing Systems, vol. 14, MIT Press, Cambridge, MA, 2002. [10] S. Fine, K. Scheinberg, Efficient SVM training using low-rank kernel representations, J. Mach. Learn. Res. (2001) 243–264. [11] C. Cortes, V.N. Vapnik, Support vector networks, Mach. Learn. 20 (1995) 273– 297. [12] E. Osuna, R. Freund, F. Girosi, An improved training algorithm for support vector machines, in: Proceedings of the IEEE Workshop on Neural Networks for Signal Processing, Amelia Island, FL, USA, 1997, pp. 276–285. [13] J. Platt, Fast training of support vector machines using sequential minimal optimization, in: B. Schölkopf, C.J.C. Burges, A.J. Smola (Eds.), Advances in Kernel Methods–Support Vector Machine, MIT Press, Cambridge, MA, 1999, pp. 185–208. [14] T. Joachims, Making large-scale SVM learning practical, in: B. Schölkopf, C.J.C. Burges, A.J. Smola (Eds.), Advances in Kernel Methods: Support Vector Machine, MIT Press, Cambridge, MA, 1999, pp. 169–184. [15] S.S. Keerthi, S.K. Shevade, C. Bhattacharyya, K.R.K. Murthy, A fast iterative nearest point algorithm for support vector machine classifier design, IEEE Trans. Neural Netw. 11 (1) (2000) 124–136. [16] V. Franc, V. Hlavácˇ, An iterative algorithm learning the maximal margin classifier, Pattern Recogn. 36 (9) (2003) 1985–1996. [17] M. Mavroforakis, S. Theodoridis, A geometric approach to support vector machine (SVM) classification, IEEE Trans. Neural Netw. 17 (3) (2006) 671–682. [18] J. López, Á. Barbero, J.R. Dorronsoro, Clipping algorithms for solving the nearest point problem over reduced convex hulls, Pattern Recogn. 44 (3) (2011) 607– 614. [19] O.L. Mangasarian, D.R. Musicant, Successive overrelaxation for support vector machines, IEEE Trans. Neural Netw. 10 (5) (1999) 1032–1037. [20] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, A dual coordinate descent method for largescale linear SVM, in: Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 2008. [21] Jayadeva, R. Khemchandani, S. Chandra, Twin support vector machines for pattern classification, IEEE Trans. Pattern Anal. Mach. Intell. 29 (5) (2007) 905– 910. [22] X. Chen, J. Yang, Q. Ye, J. Liang, Recursive projection twin support vector machine via within-class variance minimization, Pattern Recogn. 44 (10–11) (2011) 2643–2655. [23] X. Peng, TPMSVM: a novel twin parametric-margin support vector machine for pattern recognition, Pattern Recogn. 44 (10–11) (2011) 2678–2692. [24] J. Mercer, Functions of positive and negative type and the connection with the theory of integral equations, Phil. Trans. Roy. Soc. Lond., Ser. A 209 (1909) 415– 446. [25] X. Peng, Building sparse twin support vector machine classifiers in primal space, Inform. Sci. 181 (18) (2011) 3967–3980. [26] Y. Shao, C. Zhang, X. Wang, N. Deng, Improvements on twin support vector machines, IEEE Trans. Neural Netw. 22 (6) (2011) 962–968. [27] Y. Shao, N. Deng, A coordinate descent margin based-twin support vector machine for classification, Neural Netw. 25 (2012) 114–121. [28] Y. Shao, Z. Wang, W. Chen, N. Deng, A regularization for the projection twin support vector machine, Knowl.-Based Syst. 37 (2013) 203–210. [29] G. Rätsch, Benchmark Repository, Datasets, 2000.
. [30] C.W. Hsu, C.J. Lin, A comparison of methods for multiclass support vector machines, IEEE Trans. Neural Netw. 13 (2002) 415–425. [31] X. Peng, TSVR: an efficient twin support vector machine for regression, Neural Netw. 23 (3) (2010) 365–372.
608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668
Q1 Please cite this article in press as: X. Peng et al., A clipping dual coordinate descent algorithm for solving support vector machines, Knowl. Based Syst. (2014), http://dx.doi.org/10.1016/j.knosys.2014.08.005