Optimal reduction of solutions for support vector machines

Applied Mathematics and Computation 214 (2009) 329–335 Contents lists available at ScienceDirect Applied Mathematics and Computation journal homepag...

Download PDF

170KB Sizes 0 Downloads 52 Views

Report

PDF Reader
Full Text

Applied Mathematics and Computation 214 (2009) 329–335

Contents lists available at ScienceDirect

Applied Mathematics and Computation journal homepage: www.elsevier.com/locate/amc

Optimal reduction of solutions for support vector machines Hwei-Jen Lin, Jih Pin Yeh * Department of Computer Science and Information Engineering, Tamkang University, 151 Ying-Chuan Road, Tamsui, Taipei County 25137, Taiwan, ROC

a r t i c l e

i n f o

Keywords: Support vector machine Vector correlation Genetic algorithms Optimal solution Discriminant function Pattern recognition

a b s t r a c t Being a universal learning machine, a support vector machine (SVM) suffers from expensive computational cost in the test phase due to the large number of support vectors, and greatly impacts its practical use. To address this problem, we proposed an adaptive genetic algorithm to optimally reduce the solutions for an SVM by selecting vectors from the trained support vector solutions, such that the selected vectors best approximate the original discriminant function. Our method can be applied to SVMs using any general kernel. The size of the reduced set can be used adaptively based on the requirement of the tasks. As such the generalization/complexity trade-off can be controlled directly. The lower bound of the number of selected vectors required to recover the original discriminant function can also be determined. Ó 2009 Elsevier Inc. All rights reserved.

1. Introduction Support vector machines (SVMs) were developed by Vapnik [19,20], and are gaining popularity due to their many attractive features and their promising empirical performance. Since the decision surface of a SVM is parameterized by a large set of support vectors and corresponding weights, the machine is considerably slower in the test phase than other learning machines such as neural network and decision trees [1,2,11,13]. Many methods have been proposed to address this disadvantage by approximating the discriminant function using a reduced set of vectors. Burges [2] provided an explicit way to ﬁnd such a reduced set of vectors. Thies [18] followed the idea of Burges and constructed another reduced set of vectors by means of the PCA (Principle Component Analysis). However, the vectors produced by both methods are generally not support vectors, and their methods can only be applied effectively to SVMs using quadratic kernels. Downs [6] simpliﬁed the solution of support vectors according to their linear dependency without any loss of generalization, but was not able to further reduce the solutions when an approximation error is allowed. Guo [7] evaluated the contribution of each support vector to the shape of the decision surface by retraining the SVM using the training sample with that support vector removed. He then evaluated the distance between the new trained discriminant function and the removed vector. Some vectors with less contribution were removed from the trained sample and the SVM was then retrained on the basis of the remaining sample. The result was a new set of support vectors, which is, as they claimed without any theoretic proof, of smaller size. Li et al. [12] constructed a reduced set of socalled feature vectors, which were selected from the support vector solutions based on the use of the vector correlation principle and a greedy algorithm. However, the solutions produced by their greedy strategy were not guaranteed to be optimal. In this paper, we propose an adaptive approach for optimally reducing the support vectors solutions using a more efﬁcient and more effective search algorithm based on a more reasonable ﬁtness. Searching for an optimal subset of vectors selected from the support vector solutions is generally speaking an expensive combinatorial optimization for sets of support vectors of large size. In this work, we use a genetic algorithm to search for the optimal subset of a given number of support vectors. The subset produced by the genetic algorithm can be used to best approximate the original discriminant function. The * Corresponding author. E-mail addresses: [email protected] (H.-J. Lin), [email protected] (J.P. Yeh). 0096-3003/$ - see front matter Ó 2009 Elsevier Inc. All rights reserved. doi:10.1016/j.amc.2009.04.010

330

H.-J. Lin, J.P. Yeh / Applied Mathematics and Computation 214 (2009) 329–335

number of the optimal vectors can be selected adaptively according to the requirements of the tasks’ need. Therefore, with our proposed method the generalization/complexity trade-off can be well controlled directly. In addition, the lower bound of the number of selected vectors required to recover the original discriminant function can be evaluated to help determine the number of selected support vectors. The paper is organized as follows: In the next section, we brieﬂy introduce the SVM model. The criterion for an optimal reduced set of support vectors and some related details are described in Section 3. In Section 4 we present an efﬁcient optimal reduction of support vectors using a genetic algorithm and its implementation strategy. In Section 5 we describe the experimental results. Finally, we give our concluding remarks in Section 6. 2. Support vector machines Throughout this paper, we concentrate ourselves only with the two-class pattern recognition problem. However, the proposed method can be applied also to other problems where the support vector algorithm is used, i.e., nonlinear regression. SVM uses Support vectors kernel to map the data in input space to a high-dimension feature space in which we can solve the problem in linear form [3,9,16,17,18]. Given the training sample fðxi ; yi Þ : xi 2 Rd ; yi 2 f1; þ1ggNi¼1 , then the two-class pattern recognition problem can be cast as the primal problem of ﬁnding a hyperplane:

wT x þ b ¼ 1;

ð1Þ

where w is a d-dimensional normal vector, such that these two classes can be separated by two margins both parallel to the hyperplane; that is, for each xi, i = 1, 2, . . . , N:

wT xi þ b P 1 þ fi T

for yi ¼ þ1;

w xi þ b 6 1 fi

for yi ¼ 1;

ð2Þ ð3Þ

where fi P 0, i = 1, 2, . . . , N, are slack variables and b is the bias. This primal problem can easily be cast as the following quadratic optimization problem [5]:

Minimizing wðw; fÞ ¼

! N X 1 fi ; kwk2 þ C 2 i¼1

ð4Þ

subject to yi[(w xi) b] P 1 fi, where f = (f1, f2, . . . , fN). The objective of a support vector machine is to determine the optimal w and optimal bias b such that the corresponding hyperplane separates the positive and negative training data with maximum margin and it produces the best generation performance. This hyperplane is called an optimal separating hyperplane. Using Lagrange multiplier techniques leads to the following dual optimization problem [15]. Dual optimization problem: Given the training sample fðxi ; yi Þ : xi 2 Rd ; yi 2 f1; þ1ggNi¼1 , ﬁnd the Lagrange multipliers fai gNi¼1 that minimize the objective function

QðaÞ ¼

N X i¼1

ai þ

N X N 1X ai aj yi yj Kðxi ; xj Þ 2 i¼1 j¼1

ð5Þ

subject to N X

ai yi ¼ 0; 0 6 ai 6 C for i ¼ 1; 2; 3; . . . ; N;

i¼1

where C is a user-speciﬁed positive constant and K( , ) is a positive semideﬁnite kernel function. For any positive semideﬁnite kernel K on a sample space X, there is a Hilbert space H with inner product h, i and a feature mapping /: X ? H such that

Kðx; x0 Þ ¼ h/ðxÞ; /ðx0 Þi;

ðx; x0 2 XÞ:

ð6Þ

Let fai gNi¼1 be an optimal solution of the above problem, then the discriminant function takes the form:

f ðxÞ ¼

N X

ai yi Kðxi ; xÞ þ b:

ð7Þ

i¼1

It is apparent from the constraint in (5) that 0 6 ai 6 C holds for i = 1, 2, . . . , N. All training samples xi such that ai > 0 are called support vectors. To distinguish between support vectors with 0 < ai < C and those with ai = C, the former are called unbounded support vectors while the latter are called bounded ones [9,10]. In the following, we assume without loss of generality that 0 < ai 6 C for i = 1, 2, . . . , n, and ai = 0 for i = n + 1, n + 2, . . . , N. Thus, the discriminant function in (7) can be written as follows:

H.-J. Lin, J.P. Yeh / Applied Mathematics and Computation 214 (2009) 329–335

f ðxÞ ¼

n X

ai yi Kðxi ; xÞ þ b:

331

ð8Þ

i¼1

For simplicity, in what follows, we let ai = aiyi so that ai 2 [C, C]n{0} and that the discriminant function given in (8) can be written in a further simpler form:

f ðxÞ ¼ b þ

n X

ai Kðx; xi Þ:

ð9Þ

i¼1

3. Reduction of solutions for SVM The time taken for computing (9) is proportional to the number n of support vectors, which may be quite large. In this work, we focus on ﬁnding a reduced set of support vectors that best approximates or even recovers the original discriminant function given in (9), where K is a positive semideﬁnite kernel on the input space X = Rd, the support vectors xi 2 X and the multipliers ai 2 Rn{0} can be solved by an SVM. 3.1. Approximation of support vectors x1 ; ~ x2 ; . . . ; ~ xr g # S be a subset of vecLet S = {x1, x2, . . . , xn} be the support vectors for a sample obtained by an SVM and let f~ x2 Þ; . . . ; /ð~ xr Þg span the vector space spanned by the tors selected from S. If the mapping selected support vectors f/ð~ x1 Þ; /ð~ mapping support vectors {/(x1), /(x2), . . . , /(xn)}, then any mapping support vector /(xi) can be exactly approximated by the P xj Þ. The discriminant funcmapping selected vectors or exactly expressed as a linear combination of them: /ðxi Þ ¼ rj¼1 bij /ð~ tion shown in (9) can be written as:

f ðxÞ ¼ b þ h/ðxÞ; wi;

ð10Þ Pn

where the normal vector w ¼ i¼1 ai /ðxi Þ. For a reduced set of selected support vectors X F ¼ fxF 1 ; xF 2 ; . . . ; xF m g, we are looking for coefﬁcients bij 2 R, 1 6 i 6 n, 1 6 j 6 m, such that each mapping support vector /i = /(xi) can be best approximated by the linear combination of the mapping selected vectors /(XF) as

~i ¼ /i ﬃ /

m X

bij /ðxF j Þ:

ð11Þ

j¼1

Note that any d-dimensional vector can be considered as a d 1 column. If we let UF ¼ ½/ðxF 1 Þ/ðxF 2 Þ /ðxF m Þ and * ~ i ¼ UF b i and has the approximation error: b i ¼ ½bi1 bi2 bim T , then the approximating mapping vectors can be expressed as /

*

*

*

*

*

*

*

~ i k2 ¼ k/ UF b i k2 ¼ ð/ UF b i ÞT ð/ UF b i Þ ¼ /T / 2/T UF b i þ b T UT UF b i : dF i ¼ k/i / i i i i i i i F *

ð12Þ

*

~ i ¼ UF b i best approximates /i (having the minimum approximation error dF Þ is just a The coefﬁcient vector b i such that / i * * least squares solution to the linear system /i ¼ UF b or an exact solution to its normal system UTF /i ¼ UTF UF b , which has a unique solution as given in (13) if ðUTF UF Þ1 exists. However, we may retain the existence of ðUTF UF Þ1 by forcing each reduced set taken into account to be linearly independent, as described in Section 4. *

b i ¼ ðUTF UF Þ1 UTF /i :

ð13Þ

In such a way, the discriminant function can be approximated by

~f ðxÞ ¼ b þ h/ðxÞ; wF i;

ð14Þ

where

wF ¼

n X i¼1

ai /~ i ¼

n X m X i¼1

ai bij /ðxF j Þ:

ð15Þ

j¼1

As described in (12) and (15), the normal vector w is approximated by wF while using the reduced set XF, in which the ith ~ i with the approximation error: kai / ai / ~ i k2 ¼ a2 k/ / ~ i k2 ¼ a2 dF . term ai/i is approximated by ai / i i i i i 3.2. Optimal approximation of support vectors As shown in (16), we deﬁne the approximation error dF for the approximating discriminant function ~f ðxÞ as the sum of the approximation errors a2i dF i , and deﬁne the ﬁtness for the corresponding reduced set XF as the negative of the error, as shown in (17)

332

H.-J. Lin, J.P. Yeh / Applied Mathematics and Computation 214 (2009) 329–335

X

a2i dF i ;

ð16Þ

FitðX F Þ ¼ dF :

ð17Þ

dF ¼

xi 2S

Substituting (12) and (13) in (16) yields:

dF ¼

X X X ða2i ð/Ti /i /Ti UF ðUTF UF Þ1 UTF /i ÞÞ ¼ ða2i /Ti /i Þ ða2i /Ti UF ðUTF UF Þ1 UTF /i Þ: xi 2S

xi 2S

Note that minimizing dF is equivalent to maximizing the summation constant. Thus, we may modify the ﬁtness to: 0

Fit ðX F Þ ¼

X

ð18Þ

xi 2S

P

2 T xi 2S ð i /i

a

UF ðUTF UF Þ1 UTF /i Þ since

P

2 T xi 2S ð i /i /i Þ

a

ða2i /Ti UF ðUTF UF Þ1 UTF /i Þ:

is a

ð19Þ

xi 2S

Our objective is, for a given number m, to ﬁnd a reduced set XF of m support vectors that minimizes the error dF, or equivalently maximizes the ﬁtness Fit0 (XF), which, unlike the one deﬁned by Li et al., directly reﬂects the distance (or error) between the approximating discriminant function and the original discriminant function. The following theorems help later discussion. The proof for the former can be found in any book of elementary linear algebra. Theorem 1. Let A be an r c matrix, then the column vectors of A are linearly independent if and only if rank(A) < c. Theorem 2. Let A be an r c matrix, then rank(A) = rank(ATA). Proof. Let AX = 0 be a homogeneous systems of linear equations, A is its coefﬁcient matrix, X is a c 1 matrix consisting of r unknowns, and 0 is anr 1 zero matrix. Since AX = 0 is consistent, its solution space is the same as that of its normal system ATAX = 0; that is, ATA and A have the same nullity. But since ATA and A have the same number of columns, they must also have the same rank. h It is helpful to evaluate the dimension of the space spanned by the original support vectors, which is equal to the rank of the matrix U formed by the support vectors, where U = [/(x1)/(x2) /(xn)]; that is, dim(span{/(x1), /(x2), . . . , /(xn)}) = rank(U). Although we are not able to acquire the entries in the matrix U, we may obtain its rank, according to Theorem 2, by evaluating the rank of the matrix UTU (of size m m), whose entries can be acquired through the kernel function K: (UTU)ij = h/(xi), /(xj)i = K(xi, xj). Since the value of rank(U) is the minimum size of the reduced set of support vectors required for recovering the original discriminant function. If we desire an approximation with no error, we may set the size m of the reduced set to any number not less than rank(U). 4. Encoding scheme of chromosome We adopt a genetic algorithm to search for an optimal reduced set of support vectors or a reduced set with the minimum approximation error. Genetic algorithms (GA) are stochastic search methods based on the principles of natural genetic systems [8]. The basic idea is to maintain a population of possible solutions that evolves and improves over time through a process of competition and controlled variation. However, for such algorithms it is not guaranteed that they will converge to the global optimum as the number of iterations increases, and there are always possibilities for such algorithms getting trapped in some local optimum. We adopt the version, called genetic algorithm with elitism (EGA), proposed by Chakraborty et al. [4], as a way to solve that general problem by probabilistic search method. EGAs are Genetic algorithms with the strategy of replacing the worst string of the new population with the best string of the current population. It has been shown that a suitable choice of the population size and mutation probability for EGA achieves better objective function values quite early and does no seem to get trapped in a local optimum. Now we deﬁne a set of individuals in a population generated during T generation cycles, P(t) = {C ti j i = 1, 2, . . . , p}, 1 6 t 6 T, where p is the number of individuals or the population size. Each individual representing a reduced set of m vectors is generated by some encoded form known as a chromosome: C ti ¼ cti1 cti2 ctim , where ctij is a binary string of length k = dlog2ne corresponding to the index of a support vector. Assume that m = 3, k = 4, and C ti ¼ 011001111010, then cti1 ¼ 01102 ¼ 6, cti2 ¼ 01112 ¼ 7, and cti3 ¼ 10102 ¼ 10, which means that the chromosome C ti corresponds to the reduced set {x6, x7, x10}. Let the chromosome CF represent the reduced set XF. According to Theorems 1 and 2, rankðUTF UF Þ < m if and only if XF is linearly dependent. To force the GA search algorithm to search for only linearly independent vectors XF so that ðUTF UF Þ1 exists, as shown in (20), the ﬁtness for the corresponding chromosome CF, ﬁtness(CF), is simply set to 0, if XF is linearly independent; otherwise, ﬁtness(CF) = Fit0 (XF)

8 if rankðUTF UF Þ < m; <0 fitnessðC F Þ ¼ P ða2 /T U ðUT U Þ1 UT / Þ else: F : F F F i i i

ð20Þ

xi 2S

The proposed GA-based search algorithm, given in the end of this section, adopts proportional selection, one-point crossover with crossover rate pc, and point mutation with mutation rate pm. During the evolution, it keeps track of the best-so-far

333

H.-J. Lin, J.P. Yeh / Applied Mathematics and Computation 214 (2009) 329–335

chromosome or the chromosome with the best ﬁtness. The change of ﬁtness or a ﬁxed number of generation cycles can be used as stopping criteria depending on user’s requirements. The algorithm returns the chromosome best having the maximum ﬁtness value, which is corresponding to a reduced set having the minimum approximation error. Reduction of Solutions for SVM Step 1. //Initialization t :¼1; Randomly generate an initial population P(t) :¼{C ti j i = 1, 2, . . ., p}; best :¼C t1 ; Step 2. //Storing the best solution (EGA) Evaluate ﬁtness for each C ti in P(t); max :¼ arg max fitnessðC ti Þ; 16i6p

min :¼ arg min fitnessðC ti Þ; 16i6p fitnessðC tmax Þ

> fitness ðbestÞ then best :¼ C tmax else C tmin :¼ best; Step 3. //Genetic operations Perform proportional selection, one-point crossover, and point mutation on P(t) to introduce P(t + 1); Step 4. t :¼t + 1; if the stopping criteria are met then return best and stop else goto Step 2 if

5. Experimental results We have tested our proposed algorithm on some benchmark recognition examples including 2D spirals [14], a waveform dataset, and an ellipse dataset. The contrastive experiment results using the method proposed by Li et al. are also presented for comparison. All the experiments were implemented on a PC with 2.8 GHz Pentium Dual-Core processor and 4 GB RAM using Borland C++ Builder 6.0 complier. The population size was set to 50, the crossover rate pc to 0.5, the mutation rate pm to 0.05, and RBF kernel K(x, y) = exp(kx yk2/2r2) is used. In terms of accuracy recognition rate, the performance comparisons between the two methods on various reduction rates are shown in Tables 1–3. 5.1. Experiment for two spirals’ problem In the research of pattern recognition, the two spirals’ problem is a well-known problem. It is important for purely academic reasons and for industrial application [14]. The parametric equations of two spirals are given as follows:

Table 1 Recognition rates for two spirals (n = 98; rank(U) = 86). m

Reduction rate

10 20 30 50 60 86 n = 98

89.8 79.6 69.4 48.9 38.8 12.2 0.0

Recognition rate Li’s method

Our method

0.5000 0.5405 0.5475 0.6640 0.9900 0.9995 Recognition rate for original SVM = 0.9995

0.7375 0.9035 0.9995 0.9995 0.9995 0.9995

Table 2 Recognition rates for two waveform graphs (n = 103; rank(U) = 98). m

20 50 70 98 n = 103

Reduction rate

80.6 51.5 32.1 4.8 0.0

Recognition rate Li’s method

Our method

0.5185 0.8235 0.9000 0.9700 Recognition rate for original SVM = 0.9700

0.6535 0.9520 0.9700 0.9700

334

H.-J. Lin, J.P. Yeh / Applied Mathematics and Computation 214 (2009) 329–335

Table 3 Recognition rates for two ellipses (n = 234; rank(U) = 206). m

Reduction rate

5 20 30 40 50 206 n = 234

98.3 91.5 87.2 82.9 78.6 11.9 0.0

Recognition rate Li’s method

Our method

0.5710 0.6355 0.6855 0.8415 0.9315 1.0000 Recognition rate for original SVM = 1.0000

0.8165 0.9190 1.0000 1.0000 1.0000 1.0000

Spiral-1: (x, y) = ((4h + 10)cos(h), (4h + 10)sin(h)), Spiral-2: (x, y) = ((4h + 1)cos(h), (4h + 1)sin(h)). Three thousands samples were randomly generated, 300 of them were randomly chosen as training data, and the remaining as test data. For this test we set r = 8 and C = 10. The size of the set of support vectors solved by the SVM and the rank of its corresponding matrix U are 98 and 86, respectively. The comparison results based on different sizes of reduced sets are shown in Table 1. 5.2. Experiment for waveform graphs The parametric equations of the two waveform graphs are given as follows: Waveform-1: (x, y) = (t, 200 + 2t 0.031t2 + 0.001t3), Waveform-2: (x, y) = (t, 150 + 1.8t 0.0312t2 + 0.001t3). We randomly chose 3000 values for t over the interval [0, 300] to generated 3000 samples, chose 300 of them as the training data, and left the remaining as the test data. For this test we set r = 40 and C = 1. The size of the set of support vectors solved by the SVM and the rank of its corresponding matrix U are 103 and 98, respectively. The comparison results based on different sizes of reduced sets are shown in Table 2. 5.3. Experiment for two ellipses The parametric equations of the two ellipses are given as follows: Ellipse-1: (x, y) = (150 + 75 cos(t), 150 + 60 sin(t)), Ellipse-2: (x, y) = (150 + 125 cos(t), 150 + 105 sin(t)). We randomly chose 2700 values for t over the interval [0, 2p] to generated 2700 samples, chose 700 of them as the training data, and left the remaining as the test data. For this test we set r = 20 and C = 16. The size of the set of support vectors solved by the SVM and the rank of its corresponding matrix U are 234 and 206, respectively. The comparison results based on different sizes of reduced sets are shown in Table 3. 6. Conclusions and future work This paper proposed a solution reduction approach for SVMs, based on a GA search mechanism and the use of an appropriate ﬁtness. The experimental results show that the performance of the proposed approach is superior to that of the method proposed by Li et al. Although the solution given by a GA may not be globally optimal, generally speaking it is difﬁcult to determine if a solution is globally optimal, especially when the problem size is large. We adopted an improved version of the GA, called a genetic algorithm with elitism (EGA), which has been shown to achieve better objective function values quite early on, and does not seem to get trapped in a local optimum. However, when searching for a reduced set of size m, if m = rank(U), then any reduced set XF such that ðUTF UF Þ1 exists must be a globally optimal solution. We performed many such tests (with m = rank(U)), and found that our algorithm always yielded a solution XF such that ðUTF UF Þ1 existing, indicating a globally optimal solution. This demonstrates that our proposed GA-based search method can avoid, at least to a certain extent, being trapped into a local optimum. Although rank(U) is the minimum size of the reduced set of support vectors required for recovering the original discriminant function, a reduced set with a size much smaller than rank(U) and providing the same generation ability as the stan-

H.-J. Lin, J.P. Yeh / Applied Mathematics and Computation 214 (2009) 329–335

335

dard SVM does often exists. As illustrated by the experimental results in Tables 1–3, even when the original sets of 98, 103, and 234 support vectors are reduced to the respective subsets of 30, 70, and 30 support vectors (with reduction rates up to 69.4%, 32.1%, and 87.2%), whose sizes are much smaller than rank(U) = 86, 98, and 206, respectively, the recognition rates of 0.9995, 0.9700, and 1.000 for the standard SVMs may remain. One of our future works will be to develop an efﬁcient reduction strategy for the solutions of SVMs based on a more appropriate deﬁnition of approximation error such as d0F as deﬁned in (21). It will be more appropriate than the one deﬁned in (16), and the discriminant function can be better approximated. However, the computation cost for evaluation will be more expensive. Therefore, we are currently searching for a more efﬁcient computation technique for an appropriate ﬁtness and a more effective searching algorithm, in the hope to further improve our proposed method.

d0F ¼ kw w0 k2 ¼ k

X

ai ð/i /~ i Þk2 :

ð21Þ

xi 2S

References [1] B.E. Boser, I.M. Guyon, V.N. Vapnik, A training algorithm for optimal margin classiﬁers, in: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 1992, pp. 144–152. [2] C.J.C. Burges, Simpliﬁed support vector decision rules, in: Proceedings 13th International Conference on Machine Learning, 1996, pp. 71–77. [3] C.J.C. Burges, Geometry and Invariance in Kernel Based Method, Advance in Kernel Method-Support Vector Learning, MIT Press, Cambridge, MA, 1999. pp. 86–116. [4] Biman Chakraborty, Probal Chaudhuri, On the use of genetic algorithm with elitism in robust and nonparametric multivariate analysis, Austrian Journal of Statistics 32 (1&2) (2003) 13–27. [5] C. Cortes, V. Vapnik, Support vector networks, Machine Learning 20 (1995) 273–297. [6] T. Downs, K. Gates, A. Masters, Exact simpliﬁcation of support vector solutions, The Journal of Machine Learning Research 2 (2002) 293–297. [7] J. Guo, N. Takahashi, T. Nishi, A learning algorithm for improving the classiﬁcation speed of support vector machines, in: Proceedings of 2005 European Conference on Circuit Theory and Design (ECCTD2005), August–September 2005. [8] D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison Wesley, New York, 1989. [9] T. Graepel, R. Herbrich, J. Shawe-Taylor, Generalisation error bounds for sparse linear classiﬁers, in: Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, 2000, pp. 298–303. [10] T. Joachims, Estimating the generalization performance of an SVM efﬁciently, in: Proceedings ICML-00,17th International Conference on Machine Learning, 1999, pp. 431–438. [11] Y. LeCun, L. Botou, L. Jackel, H. Drucker, C. Cortes, J. Denker, I. Guyon, U. Muller, E. Sackinger, P. Simard, V. Vapnik, Learning algorithms for classiﬁcation: a comparison on handwritten digit recognition, Neural Networks (1995) 261–276. [12] Q. Li, L. Jiao, Y. Hao, Adaptive simpliﬁcation of solution for support vector machine, Pattern Recognition 40 (3) (2007) 972–980. [13] C. Liu, K. Nakashima, H. Sako, H. Fujisawa, Handwritten digit recognition: bench-marking of state-of-the-art techniques, Pattern Recognition 36 (2003) 2271–2285. [14] K.J. Lang, M.J. Witbrock, Learning to tell two spirals apart, in: Proceedings of 1989 Connectionist Models Summer School, 1989, pp. 52–61. [15] Andrzej P. Ruszczynski, Nonlinear Optimization, Princeton University Press, 2006. p. 160. [16] J. Platt, Sequential minimal optimization: a fast algorithm for training support vector machines, Technical Report, Microsoft Research, Redmond, 1998. [17] B. Scholkopf, A. Smola, Learning with Kernels, MIT Press, Cambridge, 1999. [18] T. Thies, F. Weber, Optimal reduced-set vectors for support vector machines with a quadratic kernel, Neural Computation 16 (2004) 1769–1777. [19] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New York, 1995. [20] V.N. Vapnik, Statistical Learning Theory, Wiley, New York, 1998.

Optimal reduction of solutions for support vector machines

Optimal reduction of solutions for support vector machines

Recommend Documents