Applied Mathematics and Computation 217 (2011) 5328–5337
Contents lists available at ScienceDirect
Applied Mathematics and Computation journal homepage: www.elsevier.com/locate/amc
Parallel algorithm for training multiclass proximal Support Vector Machines q Lingfeng Niu Research Center on Fictitious Economy and Data Science, Chinese Academy of Sciences, Beijing 100190, China
a r t i c l e
i n f o
Keywords: Parallel algorithm Multiclass classification Proximal Support Vector Machine Post-processing Low rank matrix approximation
a b s t r a c t In this paper we describe a proximal Support Vector Machine algorithm for multiclassification problem by one-vs-all scheme. The computational requirement for the new algorithm is almost the same as training one of its element binary proximal Support Vector Machines. Low rank approximation is taken to reduce computational costs when the kernel matrix is too large. An error bound estimation for the approximated solution is given, which is used as a stopping criteria for low rank approximation. A post-processing strategy is developed to overcome the difficulty arising from unbalanced data and to improve the classification accuracy. A parallel implementation of the algorithm using standard MPI communication routines is provided to handle large-scale problems and to accelerate the training process. Experiment results on several public datasets validate the effectiveness of our proposed algorithm. Ó 2010 Elsevier Inc. All rights reserved.
1. Introduction Support Vector Machine (SVM) [1–4] has become one of the most prominent machine learning techniques in recent years. The basic idea of SVM is to find a hyperplane in feature space, which can separate samples into two classes. Following this approach, several other classifiers have be invented [5–8]. One of them is proximal SVM, which was implemented by assigning points in each class to the closest of two parallel planes that are pushed apart as far as possible. This classifier has been independently proposed several times in history. The name proximal SVM is from Fung and Mangasarian [6,9]. Suykens and Vandewalle also derived it as a modification of the standard SVM, and named it as least square SVM [5,10]. This formulation also can be interpreted as an implementation of Tikhonov regularization with the classic square loss [11,12]. From this point of view, Rifkin [13] renamed it as regularized least square classification. It has been shown that the same generalization bounds that apply to SVMs apply to proximal SVMs as well [13]. Extensive experimental results on both toy and real-world examples also demonstrate empirically that the performance of proximal SVMs is essentially equivalent to SVMs across a wide range of problems [5,6,10,13,9]. Proximal SVMs have become a highly variable alternative to SVMs. The choice between conventional SVM and proximal SVM is based on computational tractability considerations [13]. For the binary classification problems which essentially can be separated in the input space, linear proximal SVM can be trained very fast and has been considered as one of the most efficient classifiers [6]. The mathematical formulation of the proximal SVM model is quite simple. It is an equality constrained quadratic programming, whose solution can be obtained by solving a single system of linear equations. This is in direct contrast to the standard SVM which requires the solution of an inequality constrained optimization problem [13]. However, this does
q This work was initiated in 2007 when the author was a Ph.D. candidate at State Key Laboratory of Scientific and Engineering Computing, Institute of Computational Mathematics and Scientific/Engineering Computing, AMSS, CAS. E-mail address:
[email protected]
0096-3003/$ - see front matter Ó 2010 Elsevier Inc. All rights reserved. doi:10.1016/j.amc.2010.11.056
L. Niu / Applied Mathematics and Computation 217 (2011) 5328–5337
5329
not mean proximal SVMs are always easier to be solved than SVMs. The main difficulty lies in the dense kernel matrix. When massive datasets are used, the dense and large-scale kernel matrices can not fit the memory restrictions. Based on the sparsity structure of SVM solution, decomposition approaches [14–18] are derived to handle this memory problem. However, for proximal SVMs, in general all the component of solution is nonzero, which restricts the use of decomposition method. Fortunately kernel matrices have the property of low rank for lots of problems [19,20]. Therefore, we utilize the low rank approximation method in [21] to avoid explicit storage of the entire kernel matrix when the dimension of the kernel matrix is very large. Proximal SVM is inherently a binary classifier: it classifies a sample as being positive or negative. In contrast, many problems we are interested in are multiclass classification. Namely, there are more than two classes, and our job is to pick a single class to which a data point belongs. Considerable efforts have been devoted to develop efficient training algorithms for multiclassification problems [4,22–26]. Usually the computational requirement for training multiclassifiers is much higher than the same scale binary classification problems. Nowadays, a popular approach for multiclassification is decomposing a multiclass problem into multiple independent binary classification tasks, and applying some scheme such as one-vs-all, one-vs-one, directed acyclic graph or the error-correcting codes to build a multiclassifier on results of these binary predictors. Over all those approaches, one-vs-all is a competitive candidate in practice [26]. In this paper, based on the special structure of proximal SVMs, we propose an algorithm to reduce the computational requirement for training a multiclass proximal SVM by one-vs-all scheme to almost the same as a single underlying binary problem. The rest of this paper is organized as follows. In the next section we present the algorithm details for solving linear kernel proximal SVMs. A theorem estimating the error bound of approximated solutions and the post-processing technique to improve the performance are also given. Section 3 proposes a new proximal SVM model for the nonlinear kernel and extends the new algorithm to that case. Parallel implementation and experiments, which demonstrate the utility of the suggested method, are presented in Section 4. We wrap things in Section 5 with some concluding remarks and leading directions for further study. 2. The new training algorithm for linear multiclass proximal SVMs 2.1. Preliminaries Suppose we have m samples fxi ; i ¼ 1; . . . ; mg # Rn . For each data point xi, label yi 2 {1, 1} indicates which class it belongs to. Proximal SVMs for binary classification with linear kernel [6] can be formulated as the following quadratic programming (QP):
m 2 1 T knk þ ðw w þ c2 Þ; 2 2 s:t: DðAw ecÞ þ n ¼ e; min
ð1aÞ
w;c;n
ð1bÞ T
mn
where k k is 2-norm, A = (x1, . . . , xm) 2 R , D = diag(y1, . . . , ym) and e denotes the vector with all the entries 1. For the equality constraints in (1), we denote ki as the dual variable of constraint for sample xi. Then the KKT conditions for QP (1) are
w AT Dk ¼ 0;
ð2aÞ
T
c þ e Dk ¼ 0; mn k ¼ 0; DðAw ecÞ þ n e ¼ 0:
ð2bÞ ð2cÞ ð2dÞ
Eliminating the primal variable (w, c, n) by the Eqs. (2a)–(2c), we get the equations for k from (2d) as follows:
I
m
þ DðAAT þ eeT ÞD k ¼ e:
ð3Þ
Once k is computed, w and c can be obtained by (2a) and (2b), respectively. Now we consider multiclassification problems. There are still m data points fxi ; i ¼ 1; . . . ; mg # Rn , but this time label yi has k different choices {1, 2, . . . , k}. Using one-vs-all scheme, we need k different binary classifiers, each one is trained to distinguish the samples in one class from the samples in the other classes. When it is desired to classify a new sample, the k classifiers are run, and the classifier which outputs the largest (most positive) value is chosen. For each s 2 {1, 2, . . . , k}, to separate class s from the rest, we define
Di;i ¼
1; if yi ¼ s; 1; if yi – s:
Once the k minimization problems of th e form (1) are solved (with different D defined as above), k unique separating planes are generated:
xT wi ci ¼ 0;
i ¼ 1; . . . ; k:
ð4Þ
5330
L. Niu / Applied Mathematics and Computation 217 (2011) 5328–5337
A given new point x 2 Rn is assigned to class s by
xT ws cs ¼ max ½xT wi ci : i¼1;...;k
ð5Þ
However, for the binary problems resulting from one-vs-all approach, there is usually an unequal proportion of data samples between negative and positive classes, which is called as unbalanced data. Proximal SVMs tend to fit better the class with more data points and underestimate the overall error of the class with fewer data points. This often results in a poor proximal SVM performance. [9] addresses this problem by a commonly used skill-adding different weights to different classes to tilt the balance. Suppose the weight matrix is P, the modified proximal SVM model becomes
m T 1 n Pn þ ðwT w þ c2 Þ; 2 2 s:t: DðAw ecÞ þ n ¼ e: min w;c;n
This technique is called simple balance. Another defect of the proximal SVM is that it penalize all the samples whose corresponding term xTw c is not zero. But what we really want to do is only penalizing points which are misclassified, i.e. the sample satisfies y(xTw c) < 0. Fung and Mangasarian move the separating plane parallel to correct this model defect in [9], which is achieved by solving the following optimization problem:
min s;c
m 2
1 Tw þ c2 Þ; ecÞ eÞþ k2 þ ðs2 w kðDðsAw 2
where ()+ = max{, 0} and k k is 2-norm. This process is called Newton refinement. The above method which includes simple balance and Newton refinement techniques is the training algorithm for multiclass proximal SVMs proposed by Fung and Mangasarian in [9], which requires the solution of k different systems of equations. Therefore, the more classes there are, the more computing cost is needed, which is always the case for multiclassification problem. 2.2. The post-processing strategy We notice that D is a diagonal matrix whose element is ±1, namely D2 = I. Thus, system (3) can be rewritten to remove the dependency of labels in coefficient matrix:
I
m
I þ ðAAT þ eeT Þ ðDkÞ ¼ þ Q ðDkÞ ¼ y;
m
ð6Þ
where y = De and Q = AAT + eeT. This reformulation actually allows us only need to solve one system of equations no mater how many classes we have. However, the simple balance can not be performed in that case. So in order to maintain the merit of solving only one system of equations while improving the classification performance, we solve the following two dimension optimization problem as post-processing:
min s;c
m 2
1 ecÞ eÞÞTþ PððDðsAw ecÞ eÞÞþ þ ðs2 w Tw þ c2 Þ; ððDðsAw 2
ð7Þ
is the optimal w for problem (1). Model (7) improves the performance of proximal SVM by adjusting the weight where w between different classes and penalizing the misclassified samples at the same time. Suppose m1 and m2 are the number of samples in positive and negative class, respectively. We choose the following special diagonal matrix P in order not to affect the weight between misclassification penalty and maximizing margin:
8 qffiffiffiffiffiffiffiffiffiffiffiffi 1 þm2 > < m2m ; ifyi ¼ 1; 1 Pi;i ¼ qffiffiffiffiffiffiffiffiffiffiffiffi > m þm 1 2 : ; ifyi – 1: 2m2 Newton method described in [9] can be used to solve (7). In order to compare the effectiveness of different techniques mentioned above, 100 points in two dimensional space with x 2 (0.8, 2), y 2 (1, 2) are generated randomly. Point which satisfies x 6 1 gets a positive label. All the other points belong to the negative class. Therefore an unbalance dataset is formed. For the problem shown in Fig. 1, there are 14 positive samples (represented by+) and 86 negative samples (represented by ). We train the linear proximal SVM with parameter m = 10 on this dataset. Top-left diagram gives the result for the plain proximal SVM. The total training set correctness is 92% with 50% correctness for the positive class and 100% correctness for the negative class. If the simple balance strategy is included (topright diagram), all the positive samples are recognized correctly. However, the negative class correctness drops to 88.1% and total training set correctness becomes 90%. If simple balance and Newton refinement are utilized together (bottom-left diagram), the performance can be improved significantly: 75% correctness on the positive class, 100% correctness for the negative class and 96% overall correctness. Finally, diagram on the bottom right of Fig. 1 illustrates the performance of
5331
L. Niu / Applied Mathematics and Computation 217 (2011) 5328–5337
8
8 wTx=γ+1
wTx=γ+1
4
x2
x2
4
0
−4
0
−4 wTx=γ−1
−8 0.8
1
x1
wTx=γ−1
1.5
−8 0.8
2
2
8 wTx=γ+1
wTx=γ+1
4
x2
4
0
−4
0
−4 wTx=γ−1
−8 0.8
1.5 x1
8
x2
1
1
x1
wTx=γ−1
1.5
2
−8 0.8
1
x1
1.5
2
Fig. 1. Effectiveness of post-processing.
proximal SVM with the new post-processing method: 93.7% correctness for the positive class and 96.4% correctness for the negative class. The total training set correctness is also 96%. These results show that the new post-processing strategy can improve the performance of the classifier almost as good as jointly using simple balance and Newton refinement together. But the computation requirement is much lower. 2.3. Low rank approximation Linear equations system (6) is generally solved by the conjugate gradient method [5], because the coefficient matrix is positive definite and A is sparse. But when Q is low rank or has a closely low rank approximation, solving the system directly by Sherman–Morrison–Woodbury(SMW) formula is also a good choice, especially when several linear systems, with different right hand terms, need to be solved. On the other hand, for the nonlinear kernel problems we will discuss below, the low rank property can be explored. However, because general kernel matrix can not be written as AAT, there will be no sparsity structure that can be used for the conjugate gradient method. Suppose we can get a low rank approximation HHT for Q. We first give a theorem for estimating error bound of the solution with respect to the trace error of the approximation. b and matrix DQ ¼ Q Q b is positive semi-definite. Let Theorem 1. Suppose Q has a positive semi-definite approximation matrix Q b ^ x and ^ x be the solution of linear equations systems mI þ Q x ¼ y and mI þ Q x ¼ y, respectively, then if tr(DQ) 6 , we have
kx ^xk 6 m; kxk
ð8Þ
where tr() is the trace of a matrix and k k is 2-norm. b ^ b ð^ b Þx. Hence, the difference of approximate x ¼ y from mI þ Q x ¼ y, we get mI þ Q x xÞ ¼ ðQ Q Proof. Subtracting mI þ Q and real solution can be represented as
b Þ1 DQx: ð^x xÞ ¼ mðI þ m Q Using the Schwartz inequality, we obtain
b k1 kDQ kkxk 6 mkDQkkxk: k^x xk 6 mkI þ m Q
5332
L. Niu / Applied Mathematics and Computation 217 (2011) 5328–5337
So we have
kx ^xk 6 mkDQk 6 m: kxk The last inequality is based on the fact that for semi-positive definite matrix DQ, kDQk 6 tr(DQ). h The above error bound estimation does not rely on how the low rank matrix is obtained. We choose the incomplete Cholesky Factorization (ICF) with factor matrix in low rank to approximate Q in certain criteria [27,21,28]. In details, suppose ik is the index of entry with largest diagonal value of Q Hf:;1:k1g HTf:;1:k1g at the kth iteration of ICF, the kth column of H is calculated by [28]:
Hj;k ¼ where
8 pffiffiffiffiffiffi > < v ik ;
j ¼ ik ;
> : ðQ j;ik
k1 P l¼1
Hj;l Hik ;l Þ=Hik ;k ; j R fi1 ; . . . ; ik g;
ð9Þ
vj is the vector initialized with Qj,j and updated in each step by v j :¼ v j H2j;k for all j R {i1, . . . , ik}. Because
vi
k
¼ kv k1 P
1 1 Xn 1 v ¼ trðQ Hf:;1:k1g HTf:;1:k1g Þ; kv k1 ¼ i¼1 i n n n
based on Theorem 1, the iteration can be terminated when v ik is less than a pre-defined threshold. Suppose there are p columns in H. From the process of calculating H, we know only the columns {i1, . . . , ip} and the diagonal elements of matrix Q are needed, which can be computed directly from the samples. Thus, we do not need to compute the whole matrix Q. The ICF algorithm is described in Algorithm 1 for completeness. Algorithm 1: Incomplete Cholesky Factorization Require: Kernel matrix Q, rank of approximation p, stop criteria Ensure: Return factor H 2 Rmp of low rank approximation matrix for Q 1: for j = 1 to m do 2: v j ¼ Q j;j ¼ xTj xj þ 1 3: end for 4: for k = 1 to p do 5: Select ik ¼ argmaxjRfi1 ;...;ik1 g v j 6: if v ik < then 7: Stop iteration 8: end if pffiffiffiffiffiffi 9: Hik ;k ¼ v ik 10: for j = 1 to m do 11: if j R {i1, . . . , ik} then P 12: Hj;k ¼ ðQ j;ik k1 l¼1 H j;l H ik ;l Þ=Hik ;k 13: else 14: Hj,k = 0 15: end if 16: end for 17: for j = 1 to m do 18: if j R {i1, . . . , ik} then 19: v j ¼ v j H2j;k 20: end if 21: end for 22: end for
After the ICF approximation HHT of matrix Q is obtained, the inverse of 1m I þ Q can be represented in the following by SMW formula:
1
m
I þ HHT
1
¼ mI m2 HðI þ mHT HÞ1 HT :
ð10Þ
Hence, from Eq. (6) k can be computed by
k ¼ D1 ðmI m2 HðI þ mHT HÞ1 HT Þy:
ð11Þ
L. Niu / Applied Mathematics and Computation 217 (2011) 5328–5337
5333
In practice, the computation is completed in the following steps:
z :¼ HT y; t :¼ ðI þ mHT HÞ1 z; r :¼ my m2 Ht; k :¼ D1 r: 2.4. Description of the new algorithm For completeness, we give the overall new algorithm for training proximal SVMs with linear kernel in Algorithm 2. Algorithm 2: New training algorithm for multiclass proximal SVMs Require: Kernel matrix Q, rank of approximation p, regular argument m. Ensure: Return factor wc and cc for category c = 1, 2, . . . , k. b ¼ HHT ; H 2 Rmp . 1: Perform ICF on Q by Algorithm 1 and get Q
2: Compute the symmetric matrix R = I + mHTH. 3: for s = 1 to k do 4: for i = 1 to m do 5: if yi = s then 6: yˆi: = 1 7: else 8: yˆi: = 1 9: end if 10: end for ^m ). ^1 , . . . , y 11: Set D = diag(y 12: Compute z :¼ HTyˆ. 13: Solve t from Rt = z. ^ m2H t. 14: Compute r :¼ my 15: Compute k :¼ D1r. s ¼ eT k. s ¼ AT Dk and c 16: Compute w 17: Solve post-processing problem (7) to get (ws,cs). 18: end for
Computing p columns of kernel matrix and the ICF for Q needs pmn þ 12 mp2 arithmetic operations. Calculating the matrix I + mHTH and solving system (I + mHTH)t = z requires 12 mp2 þ 13 p3 arithmetic operations. Therefore, the total time complexity of our low rank approximation algorithm is pmn þ mp2 þ 13 p3 . When the dimension of training data n is much less than the number of training samples m, [A, e] can be taken directly as H without computing ICF. In this case, the corresponding complexity can be reduced to 12 mðn þ 1Þ2 þ 13 ðn þ 1Þ3 . So the ICF is performed only when
1 1 1 pmn þ mp2 þ p3 < mðn þ 1Þ2 þ ðn þ 1Þ3 : 3 2 3
pffiffi : Through simple computation we know if p < 321 ðn þ 1Þ¼0:366ðn þ 1Þ, the above inequality holds. Hence, we do not allow the rank of ICF exceed 0.366(n + 1) in practice.
3. Proximal SVMs with nonlinear kernels Now we derive an extension for nonlinear proximal SVMs, which is different from [9]. Using w = ATDu in model problem (1) to eliminate w, we get
min u;c;n
m 2
1 knk2 þ ðuT DAAT Du þ c2 Þ; 2
s:t: DðAAT Du ecÞ þ n ¼ e: Replacing AAT with the general kernel matrix K, we obtain
m 2 1 T knk þ ðu DKDu þ c2 Þ; 2 2 s:t: DðKDu ecÞ þ n ¼ e: min u;c;n
The KKT conditions are
ð12aÞ ð12bÞ
5334
L. Niu / Applied Mathematics and Computation 217 (2011) 5328–5337
DKDðu kÞ ¼ 0;
c þ eT Dk ¼ 0; mn k ¼ 0; DðKDu ecÞ þ n e ¼ 0: The main part for solving the above system is
I
m
þ ðK þ eeT Þ Dk ¼ y:
ICF mentioned above can be used to obtain a low rank approximation HHT for K + eeT. After solving k and consequently u and c, the resulting binary nonlinear proximal SVM is m X
ui yi Kðxi ; xÞ þ c:
i¼1
Thus, the multiclassifier is
max
( m X
r¼1;...;k
) ur;i ð2dyi ;r 1ÞKðxi ; xÞ cr ;
i¼1
where
di;j ¼
1; if i ¼ j; 0;
if i – j:
The post-processing strategy we mentioned above for linear kernel can also be extended to nonlinear kernel case. Define
dðs; cÞ ¼ e sDHHT Du þ cDe; f ðs; cÞ ¼ mdðs; cÞTþ Pdðs; cÞþ þ s2 uT DHHT Du þ c2 : The post-processing is to solve the following two variables optimization problem
min f ðs; cÞ: s;c
4. Parallel implementation and experiments Our implementation of Algorithm 2 for training multiclass proximal SVMs is an objected oriented C++code. To make the algorithm applicable for large scale datasets and to accelerate training, we store the low rank matrix in distributed memory and implement low rank approximation in parallel. The parallel version uses standard MPI communication routines (Message Passing Interface Forum) [29], which is easily portable on many multiprocessor systems. All the experiments are carried out on LSSC-II in the State Key Laboratory of Scientific and Engineering Computing, Chinese Academy of Sciences. The computational environment has 256 computation nodes. Each node is equipped with two 2 GHz Xeon processors and 1 GB memory. The serial code is run on one computation node. The parallel version is tested with up to 20 nodes. We start utilizing ICF when min{m, n} > 2000, and the default upper bound for the rank of approximation matrix is 0.366 min{m, n}. So
P¼
minfm; ng
if minfm; ng 6 2000;
0:366 minfm; ng otherwise;
b. is the maximal rank that we will take for approximation matrix Q 4.1. Datasets and parameter tuning We test our implementation with all the 6 small datasets mentioned in [9]. To validate the scalability of our algorithm, 4 more larger datasets and 2 text classification problems obtained from UCI [30] are also used. Statistics of all the datasets are listed in Table 1. Column ‘‘#training’’, ‘‘#test’’, ‘‘#feature’’ and ‘‘#class’’ represent number of training samples, testing samples, input features and number of classes, respectively. N/A means that there is no test data for this problem. Linear kernel is used in the following experiments. So there is no kernel parameter need to be chosen. The model parameter m is tuned with log2m in the range of 0 to 25 with step size 1. If the dataset contains both training and testing samples, we first conduct 5-fold cross validation on the training dataset to obtain the optimal parameter m. The whole training dataset is then used to train the proximal SVM with the optimal m. We predict the test dataset with this model and report the accuracy and training time. If the dataset is not separated as training and testing, we just conduct 5-fold cross validation on the whole training dataset and report the 5-fold average accuracy and training time with optimal parameter m.
5335
L. Niu / Applied Mathematics and Computation 217 (2011) 5328–5337 Table 1 Data statistics. Problem
#training
#test
#feature (n)
#class (k)
Small scale test problem Wine Glass Iris Vowel Vehicle Segment
178 214 150 528 846 2310
N/A N/A N/A 462 N/A N/A
13 9 4 10 18 19
3 6 3 11 4 7
Large scale test problem Vehicle (comb) usps mnist Letter
78,823 7291 60,000 15,000
19,705 2007 10,000 5000
100 256 780 16
3 10 10 26
Text classification test problem Sector News20
6412 15,935
3207 3993
55,197 62,061
105 20
4.2. Comparison with Fung and Mangasarian’s algorithm We first compare our algorithm with Fung and Mangasarian’s method, which uses both simple balance and Newton refinement. Results on 6 test problems used in [9] are listed in Table 2. Column ‘‘F&M’’ means Fung and Mangasarian’s algorithm. Columns ‘‘basic’’ and ‘‘improved’’ represent our algorithm without and with post-processing respectively. From these results we can see that the new algorithm is faster than Fung and Mangasarian’s method. We also conduct the comparison on the 4 larger scale problems in Table 1. The results are also given in Table 2. From these results we can see that the larger the number of categories k is, the more efficient our algorithm is. This fact illustrates the scalability of our algorithm to the classification problem with many categories. When it comes to the classification accuracy, we find that both simple balance together with Newton refinement and our new post-processing technique can improve the classification accuracy of the basic proximal SVM model. Among the 9 test problems, the new algorithm attains best accuracy on 4 datasets after post-processing. For the other 5 problems, best results are achieved by Fung and Magasarian’s method. 4.3. Effectiveness of the low rank approximation There are a big bunch of problems (such as text classification), which have high dimensional features, i.e. n 0. Usually, linear kernel is a suitable kernel choice for these kind of problems. However, SMW formula can not be applied directly, because n is too large. Therefore, Fung and Mangasarian ’s method becomes prohibitively expensive. In this situation, low rank approximation to the kernel matrix is necessary. So we test our implementation with low rank kernel matrix approximation in this subsection. The text classification datasets tt sector and tt news20 are used for this experiment. The upper bound of approximation rank is set to h min{m, n}, where h = 0.183 and h = 0.366 are used to test the classification accuracy. In Table 3, we list the classification accuracy in different cases. The computational results show that with the help of low rank approximation promising classification accuracy is obtained within reasonable time. Moreover, our post-processing strategy also works when low rank approximation is utilized.
Table 2 Training results for linear classifier. Problem
Wine Glass Iris Vowel Vehicle Segment Vehicle (combined) usps mnist Letter
F&M
Basic
Improved
Accur. (%)
Time(s)
Accur. (%)
Time(s)
Accur. (%)
Time(s)
98.87 60.19 89.95 41.13 77.64 90.08 80.53 88.99 87.66 64.60
0.006 0.006 0.004 0.020 0.040 0.042 14.630 15.360 376.92 1.120
98.87 57.61 84.33 33.12 76.30 84.66 80.61 87.00 85.97 55.34
<0:001 0.006 <0:001 0.010 0.004 0.012 4.600 2.070 157.93 0.300
98.39 61.70 88.70 36.36 77.66 89.38 80.74 87.44 86.63 65.28
0.001 0.006 0.004 0.010 0.008 0.033 5.170 2.520 161.94 1.250
5336
L. Niu / Applied Mathematics and Computation 217 (2011) 5328–5337
Table 3 Class-prediction accuracy with different h settings. Dataset
h = 0:183
Sector News20
h = 0:366
Basic
Improved
Basic
Improved
90.74 83.17
91.05 83.20
92.67 85.37
92.83 85.55
connect−4
mnist
20 min. avg. max.
15
Speedup
protein
20
20 min. avg. max.
15
15
10
10
10
5
5
5
1
1
5
10
15
#process
20
1
1
5
10
min. avg. max.
15
#process
20
1
1
5
10
15
20
#process
Fig. 2. Relative speedup: connect-4(left), mnist (middle) and protein (right).
On the other hand, we also find the classification accuracy can be improved with the increasing of the rank in approximation matrix, which stems from the fact that the error between the approximated solution and the optimal solution can be b Þ, just as stated in Theorem 1. However, because the classification accuracy is not only affected by the bounded by trðQ Q solution accuracy, but also related to the kernel type, the selection of classifier and lots of other factors. There is no obvious relationship between the rank of approximation matrix and classification accuracy. 4.4. Speedup of the parallel implementation We evaluate the relative speedup of our parallel implementation on the three largest problems connect-4, mnist and protein in Table 1. The relative speedup is defined as: spr = Ts/Tp, where Ts is the training time spent on a single processor and Tp is the training time with p processes. We vary the number of processes in the range {1, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20} and run the code three times on each dataset for the same number of processes to reduce the variance of the computing time in parallel environment. The maximum, minimum and average speedup are shown in Fig. 2. From these diagrams we can see that our parallel implementation indeed enjoy linear speedup at the beginning. This phenomenon mainly benefits from the fact that the kernel computation part in the ICF algorithm, which consumes the majority of the computational time, is evenly distributed to each node. When we further increase the number of computation nodes, the trend of speedup becomes sublinear due to the communication cost becomes heavier. In our implementation, the data transferred in each iteration consist of one training sample and a vector with length equaling to the rank of ICF matrix. So the data required for communicating in the ICF algorithm are mainly determined by the rank of the approximation matrix. Since the rank of ICF matrix used is much smaller than the size of training data (see the discussion in Section 4.3), the communication overhead for large scale datasets are always relatively low compared with the time saved from parallel kernel computation. Therefore, the speedup of our new algorithm can not be too low in practice and the parallel implementation are always worthwhile. 5. Conclusion In this paper, we gave an efficient implementation for training linear multiclass proximal SVMs. Low rank approximation is used for further saving computational costs. Post-processing is provided to improve the performance of the proximal SVM classifier. Experiments show that the new algorithm can reach high classification accuracy with lower computational complexity. Parallel implementation of the algorithm was provided to handle large scale datasets and to accelerate training process. We extended the algorithm and the post-processing technique to nonlinear kernel case. The training cost also can be reduced a lot by low rank approximation. However, because the solution has no sparsity structure, testing is still expensive. Further research on simplifying the testing process for nonlinear kernel case will be a difficult and interesting topic in future. The other improvements will concern on the optimization/distribution of the tasks which are not currently parallelized and more careful post-processing technique.
L. Niu / Applied Mathematics and Computation 217 (2011) 5328–5337
5337
Acknowledgements The author thanks anonymous referee for his or her careful reading and valuable comments on improving the original manuscript. The author would also like to thank Professor Ya-xiang Yuan for his help and encouragement, and for his studying the early drafts of this paper. This work was partially supported by NSFC grants 10831006, 70621001, 70921061, 70531040 and CAS grant kjcx-yw-s7. References [1] B.E. Boser, I.M. Guyon, V.N. Vapnik, A training algorithm for optimal margin classifiers, in: Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, ACM Press, 1992, pp. 144–152. [2] C. Cortes, V. Vapnik, Support-vector networks, Machine Learning 20 (1995) 273–297. [3] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag New York, Inc., New York, NY, USA, 1995. [4] V.N. Vapnik, Statistical Learning Theory, Wiley-Interscience, 1998. [5] J. Suykens, J. Vandewalle, Least squares support vector machine classifiers, Neural Processing Letters 9 (1999) 293–300. [6] G. Fung, O.L. Mangasarian, Proximal support vector machine classifiers, in: Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, 2001, pp. 77–86. [7] O.L. Mangasarian, Generalized support vector machines, in: Advances in Large Margin Classifiers, MIT Press, 2000, pp. 135–146. [8] Y.-J. Lee, O.L. Mangasarian, SSVM: a smooth support vector machine, Computational Optimization and Applications 20 (2001) 5–22. [9] G.M. Fung, O.L. Mangasarian, Multicategory proximal support vector machine classifiers, Machine Learning 59 (2005) 77–97. [10] J. Suykens, T.V. Gestel, J.D. Brabanter, B.D. Moor, J. Vandewalle, Least Squares Support Vector Machines, World Scientific, Singapore, 2002. [11] A.E. Hoerl, R.W. Kennard, Ridge regression: based estimation for nonorthogonal problems, Technometrics 12 (1970) 55–67. [12] T. Evgeniou, M. Pontil, T. Poggio, A unified framework for regularization networks and support vector machines, Technical Report, Massachusetts Institute of Technology, 1999. [13] R.M. Rifkin, Everything old is new again: a fresh look at historical approaches in machine learning, Ph.D. Thesis, 2002 (Supervisor-Tomaso Poggio). [14] E. Osuna, R. Freund, F. Girosi, Training support vector machines: an application to face detection, in: Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, Washington, DC, USA, 1997, pp. 130–136. [15] T. Joachims, Making large-scale SVM learning practical, in: B. Schölkopf, C. Burges, A. Smola (Eds.), Advances in Kernel Methods-Support Vector Learning, MIT Press, Cambridge, MA, 1999, pp. 169–184. [16] J. Platt, Sequential minimal optimization: a fast algorithm for training support vector machines, Technical Report 14, Microsoft Research, Redmond, Washington, 1998. [17] R. Collobert, S. Bengio, C. Williamson, SVMTorch: support vector machines for large-scale regression problems, Journal of Machine Learning Research 1 (2001) 143–160. [18] C.-C. Chang, C.-J. Lin, LIBSVM: a library for support vector machines, 2001. Software available at
. [19] A.J. Smola, B. Schölkopf, Sparse greedy matrix approximation for machine learning, in: Proceedings of 17th International Conference on Machine Learning, Morgan Kaufmann, San Francisco, CA, 2000, pp. 911–918. [20] R.C. Williamson, A.J. Smola, B. Schölkopf, Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators, IEEE Transactions on Information Theory 51 (2005) 128–142. [21] S. Fine, K. Scheinberg, Efficient SVM training using low-rank kernel representations, Journal of Machine Learning Research 2 (2001) 243–264. [22] E.J. Bredensteiner, K.P. Bennett, Multicategory classification by support vector machines, Computational Optimization and Applications 12 (1999) 53–79. [23] J.C. Platt, N. Cristianini, J. Shawe-Taylor, Large margin dag’s for multiclass classification, Advances in Neural Information Processing Systems, vol. 12, MIT Press, Cambridge, MA, 2000, pp. 547–553. [24] K. Crammer, Y. Singer, On the algorithmic implementation of multiclass kernel-based vector machines, Journal of Machine Learning Research 2 (2002) 265–292. [25] C.-W. Hsu, C.-J. Lin, A comparison of methods for multi-class support vector machines, IEEE Transactions on Neural Networks 13 (2002) 415–425. [26] R. Rifkin, A. Klautau, In defense of one-vs-all classification, Journal of Machine Learning Research 5 (2004) 101–141. [27] G.H. Golub, C.F.V. Loan, Matrix Computations, second ed., The Johns Hopkins University Press, Baltimore, Maryland, 1989. [28] E.Y. Chang, K. Zhu, H. Wang, H. Bai, J. Li, Z. Qiu, H. Cui, PSVM: parallelizing support vector machines on distributed computers, in: Advances in Neural Information Processing Systems, vol. 20. Software available at . [29] M. Snir, S. Otto, S. Huss-Lederman, D. Walker, J. Dongarra, MPI: the complete reference, Massachusetts Institute of Technology, 1996. [30] A. Asuncion, D. Newman, UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences, 2007. .