Pattern Recognition 40 (2007) 2154 – 2162 www.elsevier.com/locate/pr
Fast cross-validation algorithms for least squares support vector machine and kernel ridge regression Senjian An ∗ , Wanquan Liu, Svetha Venkatesh Department of Computing, Curtin University of Technology, G.P.O. Box U1987, Perth, WA 6845, Australia
Abstract Given n training examples, the training of a least squares support vector machine (LS-SVM) or kernel ridge regression (KRR) corresponds to solving a linear system of dimension n. In cross-validating LS-SVM or KRR, the training examples are split into two distinct subsets for a number of times (l) wherein a subset of m examples are used for validation and the other subset of (n − m) examples are used for training the classifier. In this case l linear systems of dimension (n − m) need to be solved. We propose a novel method for cross-validation (CV) of LS-SVM or KRR in which instead of solving l linear systems of dimension (n − m), we compute the inverse of an n dimensional square matrix and solve l linear systems of dimension m, thereby reducing the complexity when l is large and/or m is small. Typical multi-fold, leave-one-out cross-validation (LOO-CV) and leave-many-out cross-validations are considered. For five-fold CV used in practice with five repetitions over randomly drawn slices, the proposed algorithm is approximately four times as efficient as the naive implementation. For large data sets, we propose to evaluate the CV approximately by applying the well-known incomplete Cholesky decomposition technique and the complexity of these approximate algorithms will scale linearly on the data size if the rank of the associated kernel matrix is much smaller than n. Simulations are provided to demonstrate the performance of LS-SVM and the efficiency of the proposed algorithm with comparisons to the naive and some existent implementations of multi-fold and LOO-CV. 䉷 2007 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. Keywords: Model selection; Cross-validation; Kernel methods
1. Introduction As an interesting variant of standard support vector machines (SVM) [1], the least squares support vector machines (LS-SVM) have been proposed by Ref. [2] and have been successfully applied to many applications [3]. Though LS-SVM is not so popular as the standard SVM, the performances of LS-SVM and standard SVM are comparable in a wide range of benchmark data sets [4]. In the formulation of LS-SVM, the model includes some hyper-parameters such as the kernel parameter and the regularization parameter that govern the generalization performance of LS-SVM classifiers. Finding the hyper-parameters with a good generalization performance is crucial for the successful application of LS-SVM. A popular way to estimate the generalization performance of ∗ Corresponding author.
E-mail addresses:
[email protected] (S. An),
[email protected] (W. Liu),
[email protected] (S. Venkatesh).
a model is cross-validation (CV). In l-fold CV, one divides the data into l subsets of (approximately) equal size and trains the classifier l times, each time leaving out one of the subsets from training, but using the omitted subset to compute the classification errors. If l equals the sample size, this is called leave-one-out cross-validation (LOO-CV). Leavemany-out cross-validation (LMO-CV) is a more elaborate and expensive version of CV that involves leaving out all possible subsets of m training examples. Exhaustive LMO-CV can be prohibitively expensive for even medium amounts of data. Monte-Carlo LMO-CV was suggested by Ref. [5] which selects each validation subset by drawing m examples at random for a sufficient number of times, say L. Under certain assumptions about the data’s statistical distribution, L = n is enough for the multiple linear regression problem [5], where n is the number of training examples. The naive implementation of l-fold CV trains a classifier for each split of the data and is thus computationally expensive if l is large, especially for LOO-CV where l = n. Much work has
0031-3203/$30.00 䉷 2007 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2006.12.015
S. An et al. / Pattern Recognition 40 (2007) 2154 – 2162
2155
been done to reduce the computational complexity of LOO-CV, see Ref. [6–10] for SVM, Ref. [11] for LS-SVM, Ref. [12] for sparse LS-SVM and Ref. [13] for KFD. It should be noted that all these methods except for Ref. [11] compute the approximate LOO (A-LOO) errors. In cross-validating LS-SVM, the classifiers for each training set are not really of interest. One is only concerned with the predicted responses of the left-out examples. Using the fact that the LS-SVM training problem has a closed-form solution (solution of a linear system, see Eq. (4)), we apply the wellknown matrix inverse formula to develop a new formula for calculating the predicted labels of the left-out examples in the CV without training the LS-SVM classifiers for each split of the training examples. Two novel algorithms are proposed for evaluating l-fold CV and LMO-CV of LS-SVM. For LMO-CV, the proposed algorithm is more efficient than the naive implementation if the left-out examples for validation are less than half of all the available training examples. In particular, when the left-out data size for validation is small and independent of the number of all available data, the complexity of the proposed algorithm is approximately the same as that of LOO-CV evaluation. For l-fold CV with l > 5, the proposed algorithm performs faster than the naive implementation and the reductions in computation increase with l. An interesting property of the proposed l-fold CV is that its computation complexity decreases with l while the naive implementation involves more computations with larger l. For typical 10-fold CV with 10 repetitions, the new algorithm is approximately 23 times as efficient as the naive implementation. The layout of this paper is as follows. In Section 2, we briefly review the formulation of the LS-SVM. In Section 3, we develop the CV formula for calculating the predicted responses of the left-out data and develop the algorithms for l-fold, robust l-fold and LMO-CV evaluations. Section 4 provides approximate algorithms for large data sets using incomplete Cholesky decomposition and Section 5 considers the extensions to regression problems. In Section 6, we provide experimental examples to illustrate the performance of LS-SVM and show the efficiency of the proposed algorithms by comparing the run time with the naive and some existing implementations.
where w ∈ Rn and the nonlinear function (·) : Rm → Rn is usually introduced by a kernel function that maps the input space to a high-dimensional feature space. However, one needs not to evaluate w and (·) explicitly in the LS-SVM framework. By using the Lagrangian multiplier optimization method, the solution of the minimization problem in the primal space can be obtained by solving the following linear system (see Ref. [3] for details): 0 1Tn b 0 1 = (4) y 1 n K + In
2. Least squares support vector machines
3.1. A new formula
Given a training set {(xi , yi )}ni=1 with input data xi ∈ Rn and class labels yi ∈ {−1, 1}, according to Refs. [2,3], for a classifier in the primal space that takes the form
Denote ⎤ ⎡ 0 1Tn A ⎣ 1 ⎦ 1 n K + In
y(x) = sign[w T (x) + b],
(1)
1 1 min J (w, e) = w T w + w,b,e 2 2
(5)
The function K(·, ·) can typically be either linear, polynomial or Gaussian kernels. The LS-SVM classifier is formulated as n i K(xi , x) + b , (6) y(x) = sign i=1
where , b are the solutions of Eq. (4). 3. The CV algorithms k In l-fold CV, one splits the data into l subsets {xk,i }ni=1 of (approximately)equal size (nv ), i.e., nk ≈ nv , where k = 1, 2, . . . , l and lk=1 nk = n. Correspondingly, we split the label y and the solution of Eq. (4) into l subvectors as folT , y T , . . . , y T ]T , =[T , T , . . . , T ]T where lows: y =[y(1) (2) (l) (1) (2) (l) y(k) =[yk,1 , yk,2 , . . . , yk,nk ]T and (k) =[k,1 , k,2 , . . . , k,nk ]T . In cross-validating LS-SVM, the classifier for each training set is not really of interest. One is only concerned with the predicted responses of the left-out examples. Next, we will derive the formula for l-fold CV to directly compute the predicted responses of the left-out examples. This formula is based on the inverse of the system matrix of Eq. (4).
⎡ ei2
(2)
i=1
with constraints wT (xi ) + b = yi − ei ,
Kij = (xi )T (xj ) = K(xi , xj ).
and
the LS-SVM is formulated as n
with y = [y1 , y2 , . . . , yn ], 1n = [1, 1, . . . , 1]T , = [1 , 2 , . . . , n ]T and
i = 1, 2, . . . , n,
(3)
C
C1T
C2T
··· ···
⎢C ⎢ 1 ⎢ −1 ⎢ C2 A ⎢ ⎢ ⎢ .. ⎣ .
C11
C12
T C12
C22
.. .
.. .
Cl
T C1l
T C2l
ClT
⎤
C1l ⎥ ⎥ ⎥ · · · C2l ⎥ ⎥, ⎥ .. ⎥ .. . . ⎦ ···
Cll
(7)
2156
S. An et al. / Pattern Recognition 40 (2007) 2154 – 2162
where C is a number, Ci ∈ Rni and Cij ∈ Rni ×nj for i, j = 1, 2, . . . , l. Then we have: Theorem 1. Let y (k) (x)=sign[gk (x)] denote the classifier formulated by leaving the kth group out and let k,i = yk,i − gk (xk,i ). Then −1 (k) , (k) = Ckk
k = 1, 2, . . . , l,
(8)
where (k) = [k,1 , k,2 , . . . , k,nk ]T . (See proof in the Appendix.) Remark 1. This theorem provides an exact formula for computing the predicted responses of the left-out examples in the CV procedure. The outputs are exactly the same as that obtained by directly training LS-SVM classifiers for each split of the training examples. In Ref. [12], a similar formula is provided for leave one out (LOO) predicted residuals of sparse LSSVM classifiers. One may apply that formula on LS-SVM as a special case but the formula is not exactly true. Note that kernel methods find the classifier based on the bilateral relations (i.e., the elements of the kernel matrix) of the training patterns. A training pattern is also a source of features. In the LOO procedure, if we leave the ith pattern and train the classifier on the other patterns, both the ith row and column of the kernel matrix should be removed. This is different from the LOO procedure of classical regression problem wherein features are not related to patterns. In Ref. [12], the features are assumed to be invariant in the LOO procedure. This is approximately true for sparse LS-SVM wherein a small part of the training patterns is used in formulating the classifier (i.e., providing features). 3.2. l-fold CV Since K K +(1/)In is positive definite, d−1Tn K−1 1n = 0 and thus −1 0 1Tn A−1 = 1n K ⎤ ⎡ 1 1 − 1Tn K−1 d ⎦. (9) =⎣ 1 d 1 −1 −1 −1 T −1 − K 1n K + K 1 n 1 n K d d Therefore, ⎡C C ··· C ⎤ 11
T ⎢ C12 ⎢ . ⎣ . . T C1l
12
C22 .. . T C2l
1l
· · · C2l ⎥ 1 = K−1 + K−1 1n 1Tn K−1 .. ⎥ .. ⎦ d . . · · · Cll
(10)
and = K−1 y +
1 −1 K 1n 1Tn K−1 y. d
(11)
Once K−1 is available, one can compute and Ckk from Eqs. (11) and (10), respectively, and then (k) by solving the linear system Ckk (k) = (k) . Note that the dimension of Ckk is approximately nv which is much smaller than (n − nv ) in
general. Thus, solving Ckk (k) = (k) is generally much simpler than training the LS-SVM classifier based on (n−nk ) examples. In summary, the new l-fold CV algorithm includes the following five steps: (1) (2) (3) (4) (5)
Evaluate the kernel matrix K and compute K−1 ; Compute and Ckk from Eqs. (11) and (10), respectively; Solve the linear system Ckk (k) = (k) ; Compute the predicted labels, y (k) = sign[y(k) − (k) ]; k |yk,i − Sum up all incorrect labels Nerr = 21 lk=1 ni=1 y (k,i) | where y (k,i) denotes the ith element of y (k) .
In the naive implementation of l-fold CV, one trains the LSSVM classifiers l times, each time leaving out one of the subsets from training and using the omitted subset to compute the classification errors. This implementation involves the solutions of l linear systems of dimensions (n − nk ). Note that the complexity of solving a linear system with dimension m is 1 3 n 3 m in general [14] and nv ≈ l , the complexity of the naive l-fold CV is (l/3)(n − nv )3 ≈ ((l − 1)3 /3l 2 )n3 . In the special case when nv = 1, l = n, this reduces to the LOO-CV and the computational complexity is 13 n(n − 1)3 ≈ 13 n4 . On the other hand, the proposed algorithm involves one inverse of an n × n matrix and the solutions of l linear systems with dimension nv and thus its complexity is n3 + 13 ln3v ≈ [1 + 1/3l 2 ]n3 . Hence, the proposed algorithm is (l −1)3 /(1+3l 2 ) ≈ l/3 − 1 times as efficient as the naive implementation. For any l 6, (l − 1)3 > 3l 2 + 1 and thus the proposed algorithm is more efficient. For typical 10-fold CV, the proposed algorithm is approximately 2.4 times as efficient as the naive implementations. It is interesting to note that the computations of this algorithm decrease with increasing l while the naive implementation involves more computations with larger l. Therefore, the computational reductions for the proposed algorithm increase with l. It should be noted that the computations of the proposed algorithm for various-folds CV are approximately equal since l is typically greater than 5 and thus 1/3l 2 is a small number. In the case that l =n, this reduces to LOO-CV. The complexity is n3 which is the same as that of the fast LOO-CV developed by Ref. [11]. The proposed algorithm is slightly more efficient since it does not compute the classifiers. More precisely, the fast LOO-CV computes A−1 and , b while the proposed algorithm does not need b and the first row and column of A−1 . At this stage, the proposed algorithm involves the same number of operations. Then, based on A−1 , , b, the fast LOO-CV computes the classifier for each training set with one example deleted and then computes the predicted label of this left-out example. This procedure involves several n2 operations. The proposed algorithm computes the predicted labels directly using and the diagonals of A−1 . This procedure involves only 2n operations. Though the computational complexities are similar, the new algorithm has two advantages: first, in step 3, one can compute all elements of using matrix computation and the for loop can be avoided. This is very useful when we evaluate the algorithm in Matlab. Our simulation shows that the time reduction is significant. Second, if one applies the incomplete
S. An et al. / Pattern Recognition 40 (2007) 2154 – 2162
Cholesky decomposition technique [15–17] for large data sets, the complexity can be reduced to scale linearly on the data size (for details see Section 5). This is due to the fact that only the diagonals of C are needed. On the other hand, the algorithm by Ref. [11] requires all elements of A−1 and the computa2 tion complexity will be O(n ) even if one applies incomplete Cholesky decomposition technique. 3.3. Robust l-fold CV The procedure of l-fold CV can be repeated a number of times (r) by permuting and splitting the data repeatedly into l groups. Each time one gets an estimate of the generalization error. To reduce the variance of the l-fold CV estimate of the generalization error, Ref. [3] suggests using the average of the r estimates. The naive implementation of this procedure needs to solve lr linear systems of dimension equaling approximately ((l − 1)/ l)n and therefore its complexity is (r(l − 1)3 /3l 2 )n3 . Later, we will show that the complexity can be significantly reduced by using the proposed new formula. The novel implementation involves one inverse of an n × n matrix and the solutions of rl linear systems of dimension n/ l and thus has complexity of [1 + r/3l 2 ]n3 which is r(l − 1)3 /(3l 2 + r) times as efficient as the naive implementation. The larger r is, the more the computation is reduced. For the typical five-fold (or 10-fold) CV with five (or 10, respectively) random permutations, the proposed algorithm is approximately four (or 23, respectively) times as efficient as the naive implementation. Let P be a random permuting matrix of dimension n × n. ˆ of the permuted data is then The kernel matrix, denoted by K, T ˆ ˆ ˆ ˆ − K = P KP . Therefore K K + (1/)In = P K P T and d T −1 1n Kˆ 1n = d. So 1 ˆ = Kˆ −1 yˆ + Kˆ −1 1n 1Tn Kˆ −1 yˆ = P d and ⎡ Cˆ 11 T ⎢ Cˆ 12 ⎢ . ⎣ . . T Cˆ 1l
Cˆ 12 Cˆ 22 .. . T Cˆ 2l
· · · Cˆ 1l ⎤ · · · Cˆ 2l ⎥ 1 Kˆ −1 + Kˆ −1 1n 1Tn Kˆ −1 .. ⎥ .. ⎦ d . . · · · Cˆ ll = PCPT ,
(12)
(13)
where yˆ = P y is the label vector of the permuted data and ⎡C ⎤ 11 C12 · · · C1l T ⎢ C12 C22 · · · C2l ⎥ C ⎢ (14) .. .. ⎥ .. ⎣ .. ⎦. . . . . T T C1l C2l · · · Cll For convenience, we use p to denote the permute vector associated with the permute matrix P and split it into l subvecT , p T , . . . , p T ]T tors as we did for label vector y, i.e., p = [p(1) (2) (l) where p(k) =[pk,1 , pk,2 , . . . , pk,nk ]T . Correspondingly, we split the permuted label vector yˆ and the ˆ into l subvectors as folT , yˆ T , · · · , yˆ T ]T , lows: yˆ =[yˆ(1) ˆ =[ˆT(1) , ˆ T(2) , . . . , ˆ T(l) ]T where (2) (l) yˆ(k) = [yˆk,1 , yˆk,2 , . . . , yˆk,nk ]T and ˆ (k) = [ˆk,1 , ˆ k,2 , . . . , ˆ k,nk ]T .
2157
If we use C{p(k) , p(k) } to denote the submatrix of C which is composed by the rows and columns indicated by the index vector p(k) and use {p(k) }, y{p(k) } to denote the subvector of and y, respectively, with the elements indicated by the index vector p(k) , then we have ˆ (k) ={p(k) } and Cˆ kk =C{p(k) , p(k) }. Hence, the robust l-fold CV can be described as follows: (1) Evaluate the kernel matrix K and compute K−1 ; (2) Compute and C; T , p T , . . . , p T ]T ) (3) Choose the permutation vector p (=[p(1) (2) (l) at random repeatedly for r times. Each time solve the linear systems C{p(k) , p(k) }(k) = p(k) , compute the predicted labels y (k) = sign[yp(k) − (k) ] and sum up the incorrect k |y(p(k) (i))−y (k,i) | where y (k,i) labels Nerr = 21 lk=1 ni=1 denotes the ith element of y (k) and p(k) (i) denotes the ith element of p(k) ; (4) Average the r estimates of the incorrect labels. Note that the evaluation of and C is of order O(n2 ) when is available.
K −1
3.4. LMO-CV In LMO-CV, the training examples are split into two distinct subsets, a subset of m examples being used for validation, the other subset containing (n−m) examples being used for training the LS-SVM classifiers. The labels of the validation set are predicted from the LS-SVM classifier. There are in total Cnm of such splits of the training examples. The exhaustive LMOCV conducts this procedure on all the possible Cnm cases while the Monte-Carlo LMO-CV does so only on a certain number, say L, of the cases which are selected at random. L is typically chosen as L = n or 2n. For convenience, we use ind to denote an index vector with m distinct integers which are selected at random from {k}nk=1 . We use C(ind,ind) to denote a submatrix of C including the rows and columns indicated by the index vector ind. Similarly, we use (ind),y(ind) to denote the subvector of and y, respectively, with the elements indicated by the index vector ind. For example, ind = [1, 2], C(ind,ind) denotes the submatrix of C with its first and second rows and columns. The new LMO-CV evaluation algorithm is proposed as follows: (1) Evaluate the kernel matrix K, compute K−1 , C and ; (2) Select the index vector ind for L times. For each time, solve the linear system C(ind,ind) = (ind), compute the predicted labelsy=sign[y(ind)−] ˆ and count the incorrect labels Nerr = 21 m ˆ where y(ind)k denotes k=1 |y(ind)k −y(k)| the kth element of y(ind); (3) Sum up all incorrect labels. This algorithm involves one inverse of an n × n matrix and the solution of L linear systems with dimension m and thus its complexity is n3 + 13 Lm3 . The naive implementation solves the linear system with dimension (n − m) and thus its complexity
2158
S. An et al. / Pattern Recognition 40 (2007) 2154 – 2162
is 13 L(n − m)3 . If L = 2n, 13 L(n − m)3 > n3 + 13 Lm3 for any m < n/2 − 1. That is, if the validation set size for LMO-CV is less than half of the training examples, the proposed algorithm is more efficient. The smaller m is, the more the computations can be reduced. In particular, if m is far smaller than n, the complexity of the proposed algorithm is n3 , the same as that for LOO-CV, while the complexity of the naive implementation is 1 3 3 Ln . 4. Approximate algorithms for large data set Though the proposed algorithms for LS-SVM and kernel ridge regression (KRR) CV are much more efficient than the naive implementations, both their computational complexity O(n3 ) and memory storage (O(n2 )) are expensive for large data sets. Incomplete Cholesky decomposition [15,16] is a powerful technique for large data sets to reduce the computational complexity of kernel-based algorithms. In this section, we apply this technique to further reduce the complexity and memory storage requirement for cross-validating LS-SVM and KRR. According to Ref. [17], the incomplete Cholesky decomposition algorithm can be summarized as follows: (1) Evaluate the diagonals K(xi , xi ) of the kernel matrix K and initialize k =0 and D =[d1 , d2 , · · · , dn ] with di =K(xi , xi ) (for Gaussian kernels, di = 1); (2) while D1 > , (a) k ← k + 1; (b) Find the best new element i = arg max(D); (c) Evaluate Ki , the ith column √ of K; (d) Calculate the kth column of G: Gk = (1/ di ) Ki − G[1:k−1] Gi,[1:k−1] T ; (e) Update D with dj ← dj − G2j k for j = 1, 2, . . . , n. Here, Gj k denotes the jth element of Gk , L[1:k−1] denotes the (k − 1) columns of L which are computed in the previous (k − 1) steps and Li,[1:k−1] denotes the ith row of L[1:k−1] . D1 = ni=1 di denotes the 1-norm of the vector D. Note that di 0. One can also use the infinite norm D∞ = maxni=1 di for early stopping criterion. We remark that this algorithm is slightly different from the original form in Ref. [17] but both result in the same G. In the original form, the numbers (say i1 , i2 , . . . , ik−1 ) of the already selected columns of the kernel matrix are kept in memory and the associated elements of D are not updated. Correspondingly, the best element i is selected by excluding the previously selected elements (i ∈ / {i1 , i2 , . . . , ik−1 }). In the present algorithm, we update all the elements of D and it is easy to verify that di = 0 if i ∈ {i1 , i2 , . . . , ik−1 } and we select the same element in each step with the original algorithm. The calculation of Gk is also slightly different from the original form. √ In the original form, the i th element is computed first, Gik = di , and then the other elements are computed as we compute them in step (d). By direct √ calculation, one can verify that step (d) also yields Gik = di . Therefore, the present algorithm yields the same G as the original algorithm. Let K = GGT be the incomplete Cholesky decomposition of the kernel matrix K. Denote MGT G and let (I + M)−1 =
RR T be the Cholesky decomposition of (I + M)−1 where R T ˆ = GR. Then we have is an upper triangular matrix. Let G ˆG ˆ T, K−1 = I − 2 G(I + M)−1 GT = I − 2 G ˆ K−1 1n = [1n − G],
ˆ K−1 y = [y − G ], d = [T − n],
(15)
ˆ T 1n , = G ˆ T y. where = G Let us split G, into l blocks as we did on and y, i.e., GT = [GT(1) , GT(2) , . . . , GT(l) ], T = [T(1) , T(2) , . . . , T(l) ]. Then ˆ (k) G ˆ T + (k) T(k) /d. Ckk = I − 2 G (k) If the dimension of Ckk is less than or approximately equal to r, one can solve the linear systems directly. Otherwise, to reduce the complexity, one can apply the following inverse formula of Ckk : ˆ T −1 G(k) −1 ˆ (k) (k) ] diag −2 I, 1 Ckk = I + [ G d T(k) ˆT G(k) 1 ˆ (k) (k) ] diag(−I, 2 d) + = I − [G T(k) −1 ˆ T G(k) ˆ . (16) [ G(k) (k) ] T(k) For robust l-fold CV, the storage of C is expensive and we do not recommend computing C. Instead, we compute C{p(k) , p(k) } directly from the following formula: ˆ (k) }G{p ˆ (k) }T + {p(k) }{p(k) }T /d, C{p(k) , p(k) } = I − 2 G{p T , p T , . . . , p T ]T is a permuting vector ranwhere p = [p(1) (2) (l) ˆ (k) } and {p(k) } denote a subdomly selected in step (3), G{p ˆ and a subvector of which are composed of the matrix of G rows with index p(k) . Similarly, for LMO-CV, the C(ind, ind) can be computed via the above formula by replacing p(k) with ind. As for solving the linear systems, if the dimension of Ckk is less than or approximately equal to r, we solve the linear systems directly. Otherwise, we have similar inverse formulas as Eq. (16) for C{p(k) , p(k) } and C(ind, ind) and we can use them to solve the associated linear system efficiently. Hence, by applying the incomplete Cholesky decomposition technique, the computational complexity of the CV algorithms can be reduced to O(nr 2 ). This is very useful for large data sets whose kernel matrix usually has a much lower rank than the sample size. Finally, we remark that the computational reduction depends on the rank of the kernel matrix and thus it also depends on the kernel parameters. If the rank of the kernel matrix is much smaller than the sample size, the computational complexity of the proposed approximate algorithm will scale linearly with the sample size. For small or median size data set, the kernel matrix may have full or high rank for some kernel parameters
S. An et al. / Pattern Recognition 40 (2007) 2154 – 2162
but it may have small ranks for other kernel parameters. In order to take the advantage of the approximate algorithms, we recommend to set a bound (say half of the sample size which is applied in our simulations) for the incomplete Cholesky decomposition. If the number of the columns of G is less than this bound, we will apply the approximate algorithm. Otherwise, we use the algorithms proposed in Section 3. 5. Extensions to regression problems Note that the least squares support vector machine for regression (LS-SVR) [3] differs from LS-SVM only in that the labels are continuous, the extension of the proposed algorithms to LS-SVR is straightforward by replacing the criterion (i.e., the misclassified labels) with the mean square of the predicted response residuals. Next, we consider the KRR [18] problem. Linear ridge regression is a classical statistical problem which aims to find a linear function that models the dependencies between covariates {xi }ni=1 and response variables {yi }ni=1 , both being continuous. The classical way to do so is to minimize the regularized quadratic cost J (w) = (yi − w T xi )2 + w2 , (17) i
where is a fixed positive number. KRR is a nonlinear version of ridge regression where the data cases are replaced with their feature vectors: xi → i = (xi ) which is induced by a kernel. Similar to the case of LS-SVM, we do not actually need access to the feature vectors. The predicted value of a new test point x can be described as [18] y=
n
i K(x, xi )
i=1
and can be obtained by solving the following linear system: (K + I ) = y.
(18)
Comparing Eq. (18) with Eq. (4), one can see that KRR is a special case of LS-SVR without the bias term b. Theorem 1 still holds for KRR if we let to be the solution of Eq. (18) and Ckk to be defined as ⎡C C ··· C ⎤ 11
T ⎢ C12 ⎢ . ⎣ . . T C1l
12
C22 .. . T C2l
1l
· · · C2l ⎥ −1 .. ⎥ .. ⎦ (K + I ) . . . · · · Cll
(19)
Hence, all the algorithms proposed for LS-SVM can be extended with minor changes for KRR CV and the complexity analysis is also similar. 6. Experimental results In this section, we present some experimental results on a suite of 13 data sets from UCI benchmark repository [19]. All of our experiments are carried out on a PC machine with Xeon姠
2159
3.2G CPU and 2 G RAM memory under Matlab 7 platform. For model selection and performance evaluation, we use the Gaussian kernel 2 2 k(xi , xj ) = exi −xj /
and adopt the experimental procedure used in Refs. [20,21], where 100 different random training and test splits1 are defined (20 in the case of image and splice data sets). To compare the efficiency of the proposed algorithms, we use the 20 training data sets of image where each set has 1300 patterns and the run time is averaged over 20 cases. Tables 1 and 2 verify that the complexity of the proposed algorithm is not very sensitive with the number of folds and number of repetitions for robust multi-fold CV while the complexity of the naive implementation scales linearly on the numbers of folds and repetitions. The average run time of the proposed algorithm increases very slowly with the number of repetitions. This makes the robust l-fold CV with many repetitions more applicable in practice since it usually yields a more accurate and stable estimate of the generalization errors than l-fold CV. One may note that the average run time of the proposed algorithm does not decrease strictly with the number of folds. Note that the computational complexity is approximately (1 + 1/3l 2 )n3 . l is typically greater than 5 and thus 1/3l 2 is a small number. Hence, if the sample size n is not very large, the computations for various folds will be approximately equal. Tables 3 and 4 demonstrate the efficiency of the approximate algorithms based on incomplete Cholesky decomposition. For clarity, we refer the efficient algorithm proposed for LOO evaluation in Section 3 as the efficient LOO (E-LOO) algorithm and refer the approximate algorithm for LOO evaluation using incomplete Cholesky decomposition as the A-LOO algorithm. From Table 3, one can see that the estimated LOO error rates using A-LOO are very close to the exact LOO error rates evaluated by E-LOO but A-LOO is much more efficient than ELOO. Table 3 shows that computations of A-LOO scales linearly with the sample size n while the E-LOO scales cubically. We remark that A-LOO scales linearly with the data sample size n only when the rank of the kernel matrix is much smaller than n. Table 4 shows the performances of LS-SVM with model selection by A-LOO and E-LOO. We use the first five training splits for model selection and the model is selected by minimizing the mean of LOO errors on the five training splits using grid search. From Table 4, one can see that A-LOO selects equally well models as E-LOO since the performances are identical or comparable. While A-LOO is much more efficient than E-LOO for most of the data sets, A-LOO is less efficient for the data sets splice, ringnorm, twonorm and waveform. It is because the ranks of the kernel matrices are larger than half of the sample size for most of the kernel parameters in trial. The efficiency of A-LOO depends closely on the rank of the kernel matrices. In this experiment, we set a bound (half of the sample size) for the incomplete Cholesky decomposition. For each kernel 1 The training and test splits are available from http://ida.first.gmd.de/ raetsch/data/benchmarks.htm.
2160
S. An et al. / Pattern Recognition 40 (2007) 2154 – 2162
Table 1 The average run time (seconds) of the proposed algorithm and naive implementation for robust five-fold cross-validation with various number of repetitions on the image data set for one pair of and Number of repetitions
5
10
15
20
25
30
35
40
45
50
Naive Proposed
17.6 3.83
34.6 4.36
51.6 4.81
68.5 5.22
85.6 5.78
102.7 6.2
119.8 6.66
136.9 7.11
153.7 7.64
170.8 8.08
Table 2 The average run time (seconds) of the proposed algorithm and naive implementation for various-folds cross-validations on the image data set for one pair of and Number of folds
5
10
15
20
25
30
35
40
45
50
Naive Proposed
4.0 3.49
10.2 3.44
16.0 3.41
22.0 3.42
28.4 3.40
34.5 3.41
40.3 3.39
46.6 3.44
52.9 3.44
59.2 3.41
Table 3 The average run time (seconds) and LOO error rates by applying E-LOO and A-LOO on the banana data set with various number of samples and a fixed pair of and Samples
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Time for A-LOO Time for E-LOO A-LOO error rates E-LOO error rates
0.04 0.17 9.66 9.60
0.11 0.94 9.68 9.74
0.19 2.63 10.02 9.87
0.28 5.63 9.70 9.63
0.40 10.34 9.40 9.36
0.50 16.81 9.37 9.32
0.54 25.03 9.34 9.35
0.64 39.54 9.47 9.48
0.76 53.07 9.46 9.48
0.82 71.12 9.46 9.45
Table 4 Performances of LS-SVM with model selection by A-LOO and E-LOO error rates where Ntr is the number of training samples, ErA and ErE denote the testing error rates with the model selected by A-LOO and E-LOO, respectively, TmA, and TmE denote the running time (minutes) of A-LOO and E-LOO, respectively Data set
Ntr
TmA
ErA
TmE
ErE
Banana B. cancer Diabetes F. solar German Heart Image Ringnorm Splice Thyroid Titanic Twonorm Waveform
400 200 468 666 700 170 1300 400 1000 140 150 400 400
0.0151 0.01115 0.65 0.0901 2.8307 0.10 1.7735 0.8026 10.01 0.0235 0.006 0.7878 0.8060
10.36 ± 0.43 25.87 ± 4.51 23.16 ± 1.79 33.51 ± 1.58 23.39 ± 2.16 16.08 ± 3.00 2.74 ± 0.51 1.43 ± 0.09 10.67 ± 0.66 3.99 ± 1.96 22.82 ± 0.77 2.38 ± 0.12 9.67 ± 0.37
0.8534 0.1612 1.2357 2.8899 3.0471 0.1070 13.6555 0.7516 7.1167 0.0651 0.0716 0.750 0.7516
10.36 ± 0.43 25.87 ± 4.51 23.16 ± 1.79 34.12 ± 1.60 23.39 ± 2.16 15.92 ± 3.11 2.74 ± 0.51 1.43 ± 0.09 10.67 ± 0.66 4.00 ± 2.02 22.82 ± 0.77 2.38 ± 0.12 9.67 ± 0.37
parameter in trial, if the number of the columns of G exceeds this bound, we stop the incomplete Cholesky decomposition and apply E-LOO. Finally, we remark that the performance of LS-SVM is comparable with the performances of SVM, KFD and other methods reported in Ref. [21]. 7. Conclusions Based on the inverse of the system matrix, this paper presents a new formula for computing the predicted responses of the left-out examples in l-fold and LMO CV of LS-SVM. Some novel CV algorithms are developed using this formula. The proposed algorithms are generally more efficient than the naive
implementation. In particular, for robust l-fold CV with many repetitions, the computation reduction increases with the number of folds and the number of repetitions. For large data sets, by applying the incomplete Cholesky decomposition technique, we proposed some approximate algorithms which scale linearly on the data size. Extensions to the KRR problem are also discussed. Appendix A: Proof of Theorem 1 First, we prove that Eq. (8) is true when k = l. Denote K11 K12 A11 A12 K= , A= , (A1) T K12 K22 AT12 A22
S. An et al. / Pattern Recognition 40 (2007) 2154 – 2162
where K11 ∈ R(n−nl )×(n−nl ) , K12 ∈ R(n−nl )×nl , K22 ∈ Rnl ×nl and 0 1Tn−nl A11 = , 1 l K11 + 1 In−nl n−n 1Tnl A12 = , K12 1 A22 = K22 + Inl . (A2) To train the classifier after leaving the lth group out, one needs to solve the following linear system: 0 1Tn−nl bˆ 0 = (A3) ˆ yˆ 1n−nl K11 + 1 In−nl i.e., A11
bˆ 0 = , ˆ yˆ
(A4)
T T , yT , . . . , yT where yˆ = [y(1) (2) (l−1) ] . Then the classifier is y (l) (x) = sign[gl (x)] where
gl (x) =
n−n l i=1
(A5)
Therefore ⎡ g (x ) ⎤ l
⎢ gl (xl,2 ) ⎥ ⎥ = AT A−1 0 . ⎢ . 12 11 yˆ ⎦ ⎣ .. gl (xl,nl ) The inverse of a block matrix is given by −1 −1 −1 −A−1 A12 F22 F11 A11 A12 11 , = −1 T −1 −1 AT12 A22 −F22 A12 A11 F22
(A6)
0 yˆ ⎧ ⎡ g (x ) ⎤⎫ l l,1 ⎪ ⎪ ⎪ ⎪ ⎨ (xl,2 ) ⎥⎬ g ⎢ l −1 ⎥ ⎢ = F22 y(l) − ⎣ .. ⎦⎪ ⎪ . ⎪ ⎪ ⎭ ⎩ gl (xl,nl ) −1 = F22 (l) .
..
. Ink−1 Ink+1 ..
. Inl
⎥ ⎥ ⎥ ⎥ ⎥ ⎥. ⎥ ⎥ ⎥ ⎦
(A10)
Ink Define 0 1Tn ¯ A 1n K¯ + 1 In and denote A¯ 11 A¯ 12 ¯ A = ¯T , A12 A¯ 22
(A11)
and A¯ 11 A¯ T
A¯ 12 A¯ 22
−1
−1 F¯11 = −1 −F¯22 A¯ T12 A¯ −1 11
¯ ¯ −1 −A¯ −1 11 A12 F22 , (A13) F¯ −1 22
¯T F¯11 = A¯ 11 − A¯ 12 A¯ −1 22 A12 , −1 T F¯22 = A¯ 22 − A¯ 12 A¯ A¯ 12 .
(A14)
11
By similar manipulations, one can conclude that (A7)
−1 (k) = F¯22 (k) .
Since
1 −1 ¯ A =
T F11 = A11 − A12 A−1 22 A12 ,
F22 = A22 − AT12 A−1 11 A12 .
n1
where
where
−1 (l) = F22
⎢ ⎢ ⎢ ⎢ ⎢ Pk = ⎢ ⎢ ⎢ ⎢ ⎣
12
l,1
Hence
Next, we prove Eq. (8) for any k < l. We permute the kth group to the last one while retaining the order of the other groups. The new kernel matrix K¯ and label y¯ are then K¯ = Pk KP Tk and y¯ = Pk y where Pk is a permutation matrix as follows: ⎤ ⎡I
where A¯ 11 ∈ R(n−nk +1)×(n−nk +1) , A¯ 12 ∈ R(n−nk +1)×nk , A¯ 22 ∈ Rnk ×nk . Then 1 0 1 0 A¯ = A (A12) 0 Pk 0 PkT
ˆ i K(x, xi ) + bˆ
bˆ = [1, K(x, x1 ), . . . , K(x, xn−nl )] ˆ 0 = [1, K(x, x1 ), . . . , K(x, xn−nl )]A−1 11 yˆ .
2161
(A8)
(A15)
Pk
−1
A
1 PkT
,
(A16)
−1 = Ckk and from Eqs. (A13) and (7) one can conclude that F¯22 thus Eq. (8) is true for any k < l.
y(l) − AT12 A−1 11
References
(A9)
−1 From Eqs. (A7) and (7) F22 = Cll and thus Eq. (8) is true for k = l.
[1] V. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, 1995. [2] J. Suykens, J. Vandewalle, Least squares support vector machine classifiers, Neural Process. Lett. 9 (1999) 293–300. [3] J. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, J. Vandewalle, Least Squares Support Vector Machines, World Scientific, Singapore, 2002. [4] T. Van Gestel, J.A. Suykens, B. Baesens, S. Viaene, J. Vanthienen, G. Dedene, B. De Moor, J. Vandewalle, Benchmarking least squares support vector machine classifiers, Mach. Learn. 54 (2004) 5–32.
2162
S. An et al. / Pattern Recognition 40 (2007) 2154 – 2162
[5] J. Shao, Linear model selection by cross-validation, J. Am. Stat. Assoc. 88 (1993) 486–494. [6] V. Vapnik, O. Chapelle, Bounds on error expectation for support vector machines, Neural Comput. 12 (9) (2000) 2013–2036. [7] G. Wahba, Y. Lin, H. Zhang, Generalized approximate cross-validation for support vector machines, in: A. Smola, P. Barlett, B. Scholkopf, D. Schuurmans (Eds.), Advances in Large Margin Classifiers, MIT Press, Cambridge, MA, 2000, pp. 297–309. [8] M. Opper, O. Winter, Gaussian processes and SVM: mean field and leave-one out, in: A. Smola, P. Barlett, B. Scholkopf, D. Schuurmans (Eds.), Advances in Large Margin Classifers, MIT Press, Cambridge, MA, 2000, pp. 311–326. [9] T. Jaakkola, D. Haussler, Probabilistic kernel regression models, in: Proceedings of the 1999 Conference on AI and Statistics, 1999. [10] O. Chapelle, V. Vapnik, Model selection for support vector machines, in: Advances in Neural Information Processing Systems, 1999. [11] Y. Zhao, C.K. Kwoh, Fast leave-one-out evaluation and improvement on inference for ls-svms, in: Proceedings of the 17th International Conference on Pattern Recognition, 2004. [12] G. Cawley, N. Talbot, Fast exact leave-one-out cross-validation of sparse least-squares support vector machines, Neural Networks 17 (10) (2004) 1467–1475. [13] G.C. Cawley, N. Talbot, Efficient leave-one-out cross-validation of kernel Fisher discriminant classifiers, Pattern Recognition 36 (2003) 2585–2592.
[14] W. Press, S. Teukolsky, W. Vetterling, B. Flannery, Numerical Recipes in C: The Art of Scientific Computing, Cambridge University Press, Cambridge, 1993. [15] F.R. Bach, M.I. Jordan, Kernel independent component analysis, J. Mach. Learn. Res. 3 (2002) 1–48. [16] S. Fine, K. Scheinberg, Efficient SVM training using low-rank kernel representations, J. Mach. Learn. Res. 2 (2001) 243–264. [17] F.R. Bach, M. Jordan, Predictive low-rank decomposition for kernel methods, in: Proceedings of the 22nd International Conference on Machine Learning (ICML05), 2005. [18] C. Saunders, A. Gammerman, V. Vovk, Ridge regression learning algorithm in dual variables, in: Proceedings of the 15th International Conference on Machine Learning (ICML98), Madison-Wisconsin, 1998, pp. 515–521. [19] C.L. Blake, C.J. Merz, UCI repository of machine learning databases http://www.ics.uci.edu/mlearn/MLRepository.html , Department of Information and Computer Science, University of California, Irvine, CA, 1998. [20] G. Rätsch, T. Onoda, K.-R. Muller, Soft margin for AdaBoost, Mach. Learn. 42 (2001) 287–320. [21] S. Mika, G. Rätsch, J. Weston, B. Schölkopf, K.-R. Müller, Fisher discriminant analysis with kernels, in: Y.-H. Hu, J. Larsen, E. Wilson, S. Douglas (Eds.), Neural Networks for Signal Processing IX, IEEE, 1999, pp. 41–48.