Applied Mathematics and Computation 214 (2009) 329–335
Contents lists available at ScienceDirect
Applied Mathematics and Computation journal homepage: www.elsevier.com/locate/amc
Optimal reduction of solutions for support vector machines Hwei-Jen Lin, Jih Pin Yeh * Department of Computer Science and Information Engineering, Tamkang University, 151 Ying-Chuan Road, Tamsui, Taipei County 25137, Taiwan, ROC
a r t i c l e
i n f o
Keywords: Support vector machine Vector correlation Genetic algorithms Optimal solution Discriminant function Pattern recognition
a b s t r a c t Being a universal learning machine, a support vector machine (SVM) suffers from expensive computational cost in the test phase due to the large number of support vectors, and greatly impacts its practical use. To address this problem, we proposed an adaptive genetic algorithm to optimally reduce the solutions for an SVM by selecting vectors from the trained support vector solutions, such that the selected vectors best approximate the original discriminant function. Our method can be applied to SVMs using any general kernel. The size of the reduced set can be used adaptively based on the requirement of the tasks. As such the generalization/complexity trade-off can be controlled directly. The lower bound of the number of selected vectors required to recover the original discriminant function can also be determined. Ó 2009 Elsevier Inc. All rights reserved.
1. Introduction Support vector machines (SVMs) were developed by Vapnik [19,20], and are gaining popularity due to their many attractive features and their promising empirical performance. Since the decision surface of a SVM is parameterized by a large set of support vectors and corresponding weights, the machine is considerably slower in the test phase than other learning machines such as neural network and decision trees [1,2,11,13]. Many methods have been proposed to address this disadvantage by approximating the discriminant function using a reduced set of vectors. Burges [2] provided an explicit way to find such a reduced set of vectors. Thies [18] followed the idea of Burges and constructed another reduced set of vectors by means of the PCA (Principle Component Analysis). However, the vectors produced by both methods are generally not support vectors, and their methods can only be applied effectively to SVMs using quadratic kernels. Downs [6] simplified the solution of support vectors according to their linear dependency without any loss of generalization, but was not able to further reduce the solutions when an approximation error is allowed. Guo [7] evaluated the contribution of each support vector to the shape of the decision surface by retraining the SVM using the training sample with that support vector removed. He then evaluated the distance between the new trained discriminant function and the removed vector. Some vectors with less contribution were removed from the trained sample and the SVM was then retrained on the basis of the remaining sample. The result was a new set of support vectors, which is, as they claimed without any theoretic proof, of smaller size. Li et al. [12] constructed a reduced set of socalled feature vectors, which were selected from the support vector solutions based on the use of the vector correlation principle and a greedy algorithm. However, the solutions produced by their greedy strategy were not guaranteed to be optimal. In this paper, we propose an adaptive approach for optimally reducing the support vectors solutions using a more efficient and more effective search algorithm based on a more reasonable fitness. Searching for an optimal subset of vectors selected from the support vector solutions is generally speaking an expensive combinatorial optimization for sets of support vectors of large size. In this work, we use a genetic algorithm to search for the optimal subset of a given number of support vectors. The subset produced by the genetic algorithm can be used to best approximate the original discriminant function. The * Corresponding author. E-mail addresses:
[email protected] (H.-J. Lin),
[email protected] (J.P. Yeh). 0096-3003/$ - see front matter Ó 2009 Elsevier Inc. All rights reserved. doi:10.1016/j.amc.2009.04.010
330
H.-J. Lin, J.P. Yeh / Applied Mathematics and Computation 214 (2009) 329–335
number of the optimal vectors can be selected adaptively according to the requirements of the tasks’ need. Therefore, with our proposed method the generalization/complexity trade-off can be well controlled directly. In addition, the lower bound of the number of selected vectors required to recover the original discriminant function can be evaluated to help determine the number of selected support vectors. The paper is organized as follows: In the next section, we briefly introduce the SVM model. The criterion for an optimal reduced set of support vectors and some related details are described in Section 3. In Section 4 we present an efficient optimal reduction of support vectors using a genetic algorithm and its implementation strategy. In Section 5 we describe the experimental results. Finally, we give our concluding remarks in Section 6. 2. Support vector machines Throughout this paper, we concentrate ourselves only with the two-class pattern recognition problem. However, the proposed method can be applied also to other problems where the support vector algorithm is used, i.e., nonlinear regression. SVM uses Support vectors kernel to map the data in input space to a high-dimension feature space in which we can solve the problem in linear form [3,9,16,17,18]. Given the training sample fðxi ; yi Þ : xi 2 Rd ; yi 2 f1; þ1ggNi¼1 , then the two-class pattern recognition problem can be cast as the primal problem of finding a hyperplane:
wT x þ b ¼ 1;
ð1Þ
where w is a d-dimensional normal vector, such that these two classes can be separated by two margins both parallel to the hyperplane; that is, for each xi, i = 1, 2, . . . , N:
wT xi þ b P 1 þ fi T
for yi ¼ þ1;
w xi þ b 6 1 fi
for yi ¼ 1;
ð2Þ ð3Þ
where fi P 0, i = 1, 2, . . . , N, are slack variables and b is the bias. This primal problem can easily be cast as the following quadratic optimization problem [5]:
Minimizing wðw; fÞ ¼
! N X 1 fi ; kwk2 þ C 2 i¼1
ð4Þ
subject to yi[(w xi) b] P 1 fi, where f = (f1, f2, . . . , fN). The objective of a support vector machine is to determine the optimal w and optimal bias b such that the corresponding hyperplane separates the positive and negative training data with maximum margin and it produces the best generation performance. This hyperplane is called an optimal separating hyperplane. Using Lagrange multiplier techniques leads to the following dual optimization problem [15]. Dual optimization problem: Given the training sample fðxi ; yi Þ : xi 2 Rd ; yi 2 f1; þ1ggNi¼1 , find the Lagrange multipliers fai gNi¼1 that minimize the objective function
QðaÞ ¼
N X i¼1
ai þ
N X N 1X ai aj yi yj Kðxi ; xj Þ 2 i¼1 j¼1
ð5Þ
subject to N X
ai yi ¼ 0; 0 6 ai 6 C for i ¼ 1; 2; 3; . . . ; N;
i¼1
where C is a user-specified positive constant and K( , ) is a positive semidefinite kernel function. For any positive semidefinite kernel K on a sample space X, there is a Hilbert space H with inner product h, i and a feature mapping /: X ? H such that
Kðx; x0 Þ ¼ h/ðxÞ; /ðx0 Þi;
ðx; x0 2 XÞ:
ð6Þ
Let fai gNi¼1 be an optimal solution of the above problem, then the discriminant function takes the form:
f ðxÞ ¼
N X
ai yi Kðxi ; xÞ þ b:
ð7Þ
i¼1
It is apparent from the constraint in (5) that 0 6 ai 6 C holds for i = 1, 2, . . . , N. All training samples xi such that ai > 0 are called support vectors. To distinguish between support vectors with 0 < ai < C and those with ai = C, the former are called unbounded support vectors while the latter are called bounded ones [9,10]. In the following, we assume without loss of generality that 0 < ai 6 C for i = 1, 2, . . . , n, and ai = 0 for i = n + 1, n + 2, . . . , N. Thus, the discriminant function in (7) can be written as follows:
H.-J. Lin, J.P. Yeh / Applied Mathematics and Computation 214 (2009) 329–335
f ðxÞ ¼
n X
ai yi Kðxi ; xÞ þ b:
331
ð8Þ
i¼1
For simplicity, in what follows, we let ai = aiyi so that ai 2 [C, C]n{0} and that the discriminant function given in (8) can be written in a further simpler form:
f ðxÞ ¼ b þ
n X
ai Kðx; xi Þ:
ð9Þ
i¼1
3. Reduction of solutions for SVM The time taken for computing (9) is proportional to the number n of support vectors, which may be quite large. In this work, we focus on finding a reduced set of support vectors that best approximates or even recovers the original discriminant function given in (9), where K is a positive semidefinite kernel on the input space X = Rd, the support vectors xi 2 X and the multipliers ai 2 Rn{0} can be solved by an SVM. 3.1. Approximation of support vectors x1 ; ~ x2 ; . . . ; ~ xr g # S be a subset of vecLet S = {x1, x2, . . . , xn} be the support vectors for a sample obtained by an SVM and let f~ x2 Þ; . . . ; /ð~ xr Þg span the vector space spanned by the tors selected from S. If the mapping selected support vectors f/ð~ x1 Þ; /ð~ mapping support vectors {/(x1), /(x2), . . . , /(xn)}, then any mapping support vector /(xi) can be exactly approximated by the P xj Þ. The discriminant funcmapping selected vectors or exactly expressed as a linear combination of them: /ðxi Þ ¼ rj¼1 bij /ð~ tion shown in (9) can be written as:
f ðxÞ ¼ b þ h/ðxÞ; wi;
ð10Þ Pn
where the normal vector w ¼ i¼1 ai /ðxi Þ. For a reduced set of selected support vectors X F ¼ fxF 1 ; xF 2 ; . . . ; xF m g, we are looking for coefficients bij 2 R, 1 6 i 6 n, 1 6 j 6 m, such that each mapping support vector /i = /(xi) can be best approximated by the linear combination of the mapping selected vectors /(XF) as
~i ¼ /i ffi /
m X
bij /ðxF j Þ:
ð11Þ
j¼1
Note that any d-dimensional vector can be considered as a d 1 column. If we let UF ¼ ½/ðxF 1 Þ/ðxF 2 Þ /ðxF m Þ and * ~ i ¼ UF b i and has the approximation error: b i ¼ ½bi1 bi2 bim T , then the approximating mapping vectors can be expressed as /
*
*
*
*
*
*
*
~ i k2 ¼ k/ UF b i k2 ¼ ð/ UF b i ÞT ð/ UF b i Þ ¼ /T / 2/T UF b i þ b T UT UF b i : dF i ¼ k/i / i i i i i i i F *
ð12Þ
*
~ i ¼ UF b i best approximates /i (having the minimum approximation error dF Þ is just a The coefficient vector b i such that / i * * least squares solution to the linear system /i ¼ UF b or an exact solution to its normal system UTF /i ¼ UTF UF b , which has a unique solution as given in (13) if ðUTF UF Þ1 exists. However, we may retain the existence of ðUTF UF Þ1 by forcing each reduced set taken into account to be linearly independent, as described in Section 4. *
b i ¼ ðUTF UF Þ1 UTF /i :
ð13Þ
In such a way, the discriminant function can be approximated by
~f ðxÞ ¼ b þ h/ðxÞ; wF i;
ð14Þ
where
wF ¼
n X i¼1
ai /~ i ¼
n X m X i¼1
ai bij /ðxF j Þ:
ð15Þ
j¼1
As described in (12) and (15), the normal vector w is approximated by wF while using the reduced set XF, in which the ith ~ i with the approximation error: kai / ai / ~ i k2 ¼ a2 k/ / ~ i k2 ¼ a2 dF . term ai/i is approximated by ai / i i i i i 3.2. Optimal approximation of support vectors As shown in (16), we define the approximation error dF for the approximating discriminant function ~f ðxÞ as the sum of the approximation errors a2i dF i , and define the fitness for the corresponding reduced set XF as the negative of the error, as shown in (17)
332
H.-J. Lin, J.P. Yeh / Applied Mathematics and Computation 214 (2009) 329–335
X
a2i dF i ;
ð16Þ
FitðX F Þ ¼ dF :
ð17Þ
dF ¼
xi 2S
Substituting (12) and (13) in (16) yields:
dF ¼
X X X ða2i ð/Ti /i /Ti UF ðUTF UF Þ1 UTF /i ÞÞ ¼ ða2i /Ti /i Þ ða2i /Ti UF ðUTF UF Þ1 UTF /i Þ: xi 2S
xi 2S
Note that minimizing dF is equivalent to maximizing the summation constant. Thus, we may modify the fitness to: 0
Fit ðX F Þ ¼
X
ð18Þ
xi 2S
P
2 T xi 2S ð i /i
a
UF ðUTF UF Þ1 UTF /i Þ since
P
2 T xi 2S ð i /i /i Þ
a
ða2i /Ti UF ðUTF UF Þ1 UTF /i Þ:
is a
ð19Þ
xi 2S
Our objective is, for a given number m, to find a reduced set XF of m support vectors that minimizes the error dF, or equivalently maximizes the fitness Fit0 (XF), which, unlike the one defined by Li et al., directly reflects the distance (or error) between the approximating discriminant function and the original discriminant function. The following theorems help later discussion. The proof for the former can be found in any book of elementary linear algebra. Theorem 1. Let A be an r c matrix, then the column vectors of A are linearly independent if and only if rank(A) < c. Theorem 2. Let A be an r c matrix, then rank(A) = rank(ATA). Proof. Let AX = 0 be a homogeneous systems of linear equations, A is its coefficient matrix, X is a c 1 matrix consisting of r unknowns, and 0 is anr 1 zero matrix. Since AX = 0 is consistent, its solution space is the same as that of its normal system ATAX = 0; that is, ATA and A have the same nullity. But since ATA and A have the same number of columns, they must also have the same rank. h It is helpful to evaluate the dimension of the space spanned by the original support vectors, which is equal to the rank of the matrix U formed by the support vectors, where U = [/(x1)/(x2) /(xn)]; that is, dim(span{/(x1), /(x2), . . . , /(xn)}) = rank(U). Although we are not able to acquire the entries in the matrix U, we may obtain its rank, according to Theorem 2, by evaluating the rank of the matrix UTU (of size m m), whose entries can be acquired through the kernel function K: (UTU)ij = h/(xi), /(xj)i = K(xi, xj). Since the value of rank(U) is the minimum size of the reduced set of support vectors required for recovering the original discriminant function. If we desire an approximation with no error, we may set the size m of the reduced set to any number not less than rank(U). 4. Encoding scheme of chromosome We adopt a genetic algorithm to search for an optimal reduced set of support vectors or a reduced set with the minimum approximation error. Genetic algorithms (GA) are stochastic search methods based on the principles of natural genetic systems [8]. The basic idea is to maintain a population of possible solutions that evolves and improves over time through a process of competition and controlled variation. However, for such algorithms it is not guaranteed that they will converge to the global optimum as the number of iterations increases, and there are always possibilities for such algorithms getting trapped in some local optimum. We adopt the version, called genetic algorithm with elitism (EGA), proposed by Chakraborty et al. [4], as a way to solve that general problem by probabilistic search method. EGAs are Genetic algorithms with the strategy of replacing the worst string of the new population with the best string of the current population. It has been shown that a suitable choice of the population size and mutation probability for EGA achieves better objective function values quite early and does no seem to get trapped in a local optimum. Now we define a set of individuals in a population generated during T generation cycles, P(t) = {C ti j i = 1, 2, . . . , p}, 1 6 t 6 T, where p is the number of individuals or the population size. Each individual representing a reduced set of m vectors is generated by some encoded form known as a chromosome: C ti ¼ cti1 cti2 ctim , where ctij is a binary string of length k = dlog2ne corresponding to the index of a support vector. Assume that m = 3, k = 4, and C ti ¼ 011001111010, then cti1 ¼ 01102 ¼ 6, cti2 ¼ 01112 ¼ 7, and cti3 ¼ 10102 ¼ 10, which means that the chromosome C ti corresponds to the reduced set {x6, x7, x10}. Let the chromosome CF represent the reduced set XF. According to Theorems 1 and 2, rankðUTF UF Þ < m if and only if XF is linearly dependent. To force the GA search algorithm to search for only linearly independent vectors XF so that ðUTF UF Þ1 exists, as shown in (20), the fitness for the corresponding chromosome CF, fitness(CF), is simply set to 0, if XF is linearly independent; otherwise, fitness(CF) = Fit0 (XF)
8 if rankðUTF UF Þ < m; <0 fitnessðC F Þ ¼ P ða2 /T U ðUT U Þ1 UT / Þ else: F : F F F i i i
ð20Þ
xi 2S
The proposed GA-based search algorithm, given in the end of this section, adopts proportional selection, one-point crossover with crossover rate pc, and point mutation with mutation rate pm. During the evolution, it keeps track of the best-so-far
333
H.-J. Lin, J.P. Yeh / Applied Mathematics and Computation 214 (2009) 329–335
chromosome or the chromosome with the best fitness. The change of fitness or a fixed number of generation cycles can be used as stopping criteria depending on user’s requirements. The algorithm returns the chromosome best having the maximum fitness value, which is corresponding to a reduced set having the minimum approximation error. Reduction of Solutions for SVM Step 1. //Initialization t :¼1; Randomly generate an initial population P(t) :¼{C ti j i = 1, 2, . . ., p}; best :¼C t1 ; Step 2. //Storing the best solution (EGA) Evaluate fitness for each C ti in P(t); max :¼ arg max fitnessðC ti Þ; 16i6p
min :¼ arg min fitnessðC ti Þ; 16i6p fitnessðC tmax Þ
> fitness ðbestÞ then best :¼ C tmax else C tmin :¼ best; Step 3. //Genetic operations Perform proportional selection, one-point crossover, and point mutation on P(t) to introduce P(t + 1); Step 4. t :¼t + 1; if the stopping criteria are met then return best and stop else goto Step 2 if
5. Experimental results We have tested our proposed algorithm on some benchmark recognition examples including 2D spirals [14], a waveform dataset, and an ellipse dataset. The contrastive experiment results using the method proposed by Li et al. are also presented for comparison. All the experiments were implemented on a PC with 2.8 GHz Pentium Dual-Core processor and 4 GB RAM using Borland C++ Builder 6.0 complier. The population size was set to 50, the crossover rate pc to 0.5, the mutation rate pm to 0.05, and RBF kernel K(x, y) = exp(kx yk2/2r2) is used. In terms of accuracy recognition rate, the performance comparisons between the two methods on various reduction rates are shown in Tables 1–3. 5.1. Experiment for two spirals’ problem In the research of pattern recognition, the two spirals’ problem is a well-known problem. It is important for purely academic reasons and for industrial application [14]. The parametric equations of two spirals are given as follows:
Table 1 Recognition rates for two spirals (n = 98; rank(U) = 86). m
Reduction rate
10 20 30 50 60 86 n = 98
89.8 79.6 69.4 48.9 38.8 12.2 0.0
Recognition rate Li’s method
Our method
0.5000 0.5405 0.5475 0.6640 0.9900 0.9995 Recognition rate for original SVM = 0.9995
0.7375 0.9035 0.9995 0.9995 0.9995 0.9995
Table 2 Recognition rates for two waveform graphs (n = 103; rank(U) = 98). m
20 50 70 98 n = 103
Reduction rate
80.6 51.5 32.1 4.8 0.0
Recognition rate Li’s method
Our method
0.5185 0.8235 0.9000 0.9700 Recognition rate for original SVM = 0.9700
0.6535 0.9520 0.9700 0.9700
334
H.-J. Lin, J.P. Yeh / Applied Mathematics and Computation 214 (2009) 329–335
Table 3 Recognition rates for two ellipses (n = 234; rank(U) = 206). m
Reduction rate
5 20 30 40 50 206 n = 234
98.3 91.5 87.2 82.9 78.6 11.9 0.0
Recognition rate Li’s method
Our method
0.5710 0.6355 0.6855 0.8415 0.9315 1.0000 Recognition rate for original SVM = 1.0000
0.8165 0.9190 1.0000 1.0000 1.0000 1.0000
Spiral-1: (x, y) = ((4h + 10)cos(h), (4h + 10)sin(h)), Spiral-2: (x, y) = ((4h + 1)cos(h), (4h + 1)sin(h)). Three thousands samples were randomly generated, 300 of them were randomly chosen as training data, and the remaining as test data. For this test we set r = 8 and C = 10. The size of the set of support vectors solved by the SVM and the rank of its corresponding matrix U are 98 and 86, respectively. The comparison results based on different sizes of reduced sets are shown in Table 1. 5.2. Experiment for waveform graphs The parametric equations of the two waveform graphs are given as follows: Waveform-1: (x, y) = (t, 200 + 2t 0.031t2 + 0.001t3), Waveform-2: (x, y) = (t, 150 + 1.8t 0.0312t2 + 0.001t3). We randomly chose 3000 values for t over the interval [0, 300] to generated 3000 samples, chose 300 of them as the training data, and left the remaining as the test data. For this test we set r = 40 and C = 1. The size of the set of support vectors solved by the SVM and the rank of its corresponding matrix U are 103 and 98, respectively. The comparison results based on different sizes of reduced sets are shown in Table 2. 5.3. Experiment for two ellipses The parametric equations of the two ellipses are given as follows: Ellipse-1: (x, y) = (150 + 75 cos(t), 150 + 60 sin(t)), Ellipse-2: (x, y) = (150 + 125 cos(t), 150 + 105 sin(t)). We randomly chose 2700 values for t over the interval [0, 2p] to generated 2700 samples, chose 700 of them as the training data, and left the remaining as the test data. For this test we set r = 20 and C = 16. The size of the set of support vectors solved by the SVM and the rank of its corresponding matrix U are 234 and 206, respectively. The comparison results based on different sizes of reduced sets are shown in Table 3. 6. Conclusions and future work This paper proposed a solution reduction approach for SVMs, based on a GA search mechanism and the use of an appropriate fitness. The experimental results show that the performance of the proposed approach is superior to that of the method proposed by Li et al. Although the solution given by a GA may not be globally optimal, generally speaking it is difficult to determine if a solution is globally optimal, especially when the problem size is large. We adopted an improved version of the GA, called a genetic algorithm with elitism (EGA), which has been shown to achieve better objective function values quite early on, and does not seem to get trapped in a local optimum. However, when searching for a reduced set of size m, if m = rank(U), then any reduced set XF such that ðUTF UF Þ1 exists must be a globally optimal solution. We performed many such tests (with m = rank(U)), and found that our algorithm always yielded a solution XF such that ðUTF UF Þ1 existing, indicating a globally optimal solution. This demonstrates that our proposed GA-based search method can avoid, at least to a certain extent, being trapped into a local optimum. Although rank(U) is the minimum size of the reduced set of support vectors required for recovering the original discriminant function, a reduced set with a size much smaller than rank(U) and providing the same generation ability as the stan-
H.-J. Lin, J.P. Yeh / Applied Mathematics and Computation 214 (2009) 329–335
335
dard SVM does often exists. As illustrated by the experimental results in Tables 1–3, even when the original sets of 98, 103, and 234 support vectors are reduced to the respective subsets of 30, 70, and 30 support vectors (with reduction rates up to 69.4%, 32.1%, and 87.2%), whose sizes are much smaller than rank(U) = 86, 98, and 206, respectively, the recognition rates of 0.9995, 0.9700, and 1.000 for the standard SVMs may remain. One of our future works will be to develop an efficient reduction strategy for the solutions of SVMs based on a more appropriate definition of approximation error such as d0F as defined in (21). It will be more appropriate than the one defined in (16), and the discriminant function can be better approximated. However, the computation cost for evaluation will be more expensive. Therefore, we are currently searching for a more efficient computation technique for an appropriate fitness and a more effective searching algorithm, in the hope to further improve our proposed method.
d0F ¼ kw w0 k2 ¼ k
X
ai ð/i /~ i Þk2 :
ð21Þ
xi 2S
References [1] B.E. Boser, I.M. Guyon, V.N. Vapnik, A training algorithm for optimal margin classifiers, in: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 1992, pp. 144–152. [2] C.J.C. Burges, Simplified support vector decision rules, in: Proceedings 13th International Conference on Machine Learning, 1996, pp. 71–77. [3] C.J.C. Burges, Geometry and Invariance in Kernel Based Method, Advance in Kernel Method-Support Vector Learning, MIT Press, Cambridge, MA, 1999. pp. 86–116. [4] Biman Chakraborty, Probal Chaudhuri, On the use of genetic algorithm with elitism in robust and nonparametric multivariate analysis, Austrian Journal of Statistics 32 (1&2) (2003) 13–27. [5] C. Cortes, V. Vapnik, Support vector networks, Machine Learning 20 (1995) 273–297. [6] T. Downs, K. Gates, A. Masters, Exact simplification of support vector solutions, The Journal of Machine Learning Research 2 (2002) 293–297. [7] J. Guo, N. Takahashi, T. Nishi, A learning algorithm for improving the classification speed of support vector machines, in: Proceedings of 2005 European Conference on Circuit Theory and Design (ECCTD2005), August–September 2005. [8] D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison Wesley, New York, 1989. [9] T. Graepel, R. Herbrich, J. Shawe-Taylor, Generalisation error bounds for sparse linear classifiers, in: Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, 2000, pp. 298–303. [10] T. Joachims, Estimating the generalization performance of an SVM efficiently, in: Proceedings ICML-00,17th International Conference on Machine Learning, 1999, pp. 431–438. [11] Y. LeCun, L. Botou, L. Jackel, H. Drucker, C. Cortes, J. Denker, I. Guyon, U. Muller, E. Sackinger, P. Simard, V. Vapnik, Learning algorithms for classification: a comparison on handwritten digit recognition, Neural Networks (1995) 261–276. [12] Q. Li, L. Jiao, Y. Hao, Adaptive simplification of solution for support vector machine, Pattern Recognition 40 (3) (2007) 972–980. [13] C. Liu, K. Nakashima, H. Sako, H. Fujisawa, Handwritten digit recognition: bench-marking of state-of-the-art techniques, Pattern Recognition 36 (2003) 2271–2285. [14] K.J. Lang, M.J. Witbrock, Learning to tell two spirals apart, in: Proceedings of 1989 Connectionist Models Summer School, 1989, pp. 52–61. [15] Andrzej P. Ruszczynski, Nonlinear Optimization, Princeton University Press, 2006. p. 160. [16] J. Platt, Sequential minimal optimization: a fast algorithm for training support vector machines, Technical Report, Microsoft Research, Redmond, 1998. [17] B. Scholkopf, A. Smola, Learning with Kernels, MIT Press, Cambridge, 1999. [18] T. Thies, F. Weber, Optimal reduced-set vectors for support vector machines with a quadratic kernel, Neural Computation 16 (2004) 1769–1777. [19] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New York, 1995. [20] V.N. Vapnik, Statistical Learning Theory, Wiley, New York, 1998.