Optimizing resources in model selection for support vector machine

Optimizing resources in model selection for support vector machine

Pattern Recognition 40 (2007) 953 – 963 www.elsevier.com/locate/pr Optimizing resources in model selection for support vector machine Mathias M. Adan...

310KB Sizes 0 Downloads 73 Views

Pattern Recognition 40 (2007) 953 – 963 www.elsevier.com/locate/pr

Optimizing resources in model selection for support vector machine Mathias M. Adankon∗ , Mohamed Cheriet Laboratory for Imagery, Vision, and Artificial Intelligence, ÉTS, 1100 Notre Dame-Ouest, Montréal, Canada H3C 1K3 Received 16 November 2005; received in revised form 26 April 2006; accepted 6 June 2006

Abstract Tuning support vector machine (SVM) hyperparameters is an important step in achieving a high-performance learning machine. It is usually done by minimizing an estimate of generalization error based on the bounds of the leave-one-out (LOO) such as radius-margin bound and on the performance measures such as generalized approximate cross-validation (GACV), empirical error, etc. These usual automatic methods used to tune the hyperparameters require an inversion of the Gram–Schmidt matrix or a resolution of an extra-quadratic programming problem. In the case of a large data set these methods require the addition of huge amounts of memory and a long CPU time to the already significant resources used in SVM training. In this paper, we propose a fast method based on an approximation of the gradient of the empirical error, along with incremental learning, which reduces the resources required both in terms of processing time and of storage space. We tested our method on several benchmarks, which produced promising results confirming our approach. Furthermore, it is worth noting that the gain time increases when the data set is large. 䉷 2006 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. Keywords: Model selection; SVM; Kernel; Hyperparameters; Optimizing time

1. Introduction Support vector machines (SVMs) are particular classifiers that are based on the margin-maximization principle [1]. They perform structural risk minimization which was introduced to machine learning by Vapnik [2,3], and which has yielded excellent generalization performance. However, the generalization capacity of SVM depends on the hyperparameters such as C controlling the amount of overlap and the kernel parameters. As an illustration, Fig. 1 shows the variation of the error rate on validation set versus the variance of the Gaussian kernel. The task is to classify the digits 1 and 8, taken from the MNIST data benchmark [4]. Clearly, the best performance is realized with an optimum choice of kernel parameters. Several methods [1,5–11] have been developed for choosing the best hyperparameter values. In 2001, Chapelle et al. [12] proposed for the first time an automatic method for ∗ Corresponding author.

E-mail addresses: [email protected] (M.M. Adankon), [email protected] (M. Cheriet).

selecting hyperparameters for SVM by using the certain criteria that approximate the error of the LOO procedure. These criteria are called span bound and radius-margin bound. However, the techniques proposed in Ref. [12] using these criteria are quite costly in terms of computing time among other things to achieve an inversion of the Gram–Schmidt matrix of support vectors during gradient computation and a resolution of an additional quadratic problem. In 2003, Kai-Min et al. [13] proposed modified radius-margin bound that does not necessitate the inversion of the Gram–Schmidt matrix, but which however also needs to resolve the additional quadratic programming (QP) problem. Recently, Ayat et al. [14] have proposed a new criterion based on the empirical error where an empirical estimate of the generalization error is minimized through a validation set. This criterion is a linear function that does not necessitate the resolution of another QP problem save for the SVM training. Furthermore, the cost function expressing the empirical error is differentiable. Therefore, using the empirical error criterion for model selection for SVMs is efficiently simple and convenient. However, when a large data set is used, the cost of this technique, added to the training of the

0031-3203/$30.00 䉷 2006 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2006.06.012

954

M.M. Adankon, M. Cheriet / Pattern Recognition 40 (2007) 953 – 963

We first consider a binary classification problem. Let us have a data set {(x1 , y1 ), . . . , (x , y )} with xi ∈ Rd and yi ∈ {−1, 1}. In the feature space, the optimal separating hyperplane for the SVM is defined by

60

Validation error rate (%)

50

40

f (xi ) =

NVS 

j yj k(xj , xi ) + b,

(1)

j =1

30

where j and b are found after resolving the quadratic optimization problem maximizing the margin:

20

10

maximize:

W () =

 

i

i=1 0 0

0.01

0.02

0.03 0.04 0.05 gamma=1/variance

0.06

0.07

0.08



Fig. 1. Validation error rate for different values of the variance of the RBF kernel for a binary recognition problem: digit 1 versus digit 8.

subject to: SVM, becomes quite significant. Hence, although this last is less complex and faster than the others, it does however require a high CPU time and memory size for large databases. In this paper, we propose two strategies that make it possible to optimize the resources in model selection for SVMs. This optimization consists in reducing CPU time and memory size in order to facilitate the integration of model selection in real applications at least cost. We then develop an improved method of model selection based on the empirical error by using two techniques: (i) Approximation of the gradient of error, which enables us to determine the gradient without inverting the Gram–Schmidt matrix, thus reducing the complexity of the computation of the gradient. (ii) Incremental learning strategy, which makes it possible to optimize both the parameters and the hyperparameters of the SVM, using a combined method.

2. Model selection using empirical error criterion In this section, we describe the optimization of the SVM kernel parameters using the empirical error developed in Refs. [14–16]. The idea behind this technique is the minimization of the generalization error through a validation data set.

l 

i j yi yj k(xi , xj )

i,j =1

i yi = 0

and

i=1

0 i C, i = 1, . . . , .

(2)

In Eq. (1) j = 1, . . . , NVS are the support vector indices corresponding to non-zero j and k : Rd × Rd → R is the kernel function [17–19]. When we define ti =(yi +1)/2, the bipolar label becomes unipolar with ti = 0 for the negative class observations and ti =1 for those belonging to the positive class. The empirical error is given by the following expression: Ei = |ti − pˆ i |,

(3)

where pˆ i is the estimated posterior probability corresponding to the observation xi . The estimated posterior probability is determined by using the logistic function proposed by Platt in Ref. [20]. This function has two parameters A and B, and its form is pˆ i =

This paper is structured as follows. In Section 2, we present a review of the optimization of SVM kernel parameters using an empirical error. In Sections 3 and 4, we develop, respectively, the approximation of the gradient and the incremental learning technique for SVM. In Section 5, we analyze the space and the computation time of our algorithm. In Section 6, we present the experimental results confirming the efficiency of our algorithm. In the last section, we conclude the paper.

1 2

 

1 , 1 + exp(A.fi + B)

(4)

where fi = f (xi ). The parameters, A and B, are fitted after minimizing the cross-entropy error [21] as was proposed by Lin et al. [22]. The use of the model developed by Platt to estimate the probability makes it possible to quantify the distance from one observation to the hyperplane determined by the SVM using a continuous and derivable function. Indeed, the estimate of probability makes it possible to calibrate the distance f (xi ) between 0 and 1 with the following properties: • the observations of the positive class, which are well classified and located apart from the margin, have probabilities considered very close to 1; • the observations of the negative class, which are well classified and located apart from the margin, have probabilities considered very close to 0;

M.M. Adankon, M. Cheriet / Pattern Recognition 40 (2007) 953 – 963

955

1 BEGIN

0.9

Estimated probability

0.8 0.7

Initialize the parameters

0.6 0.5 Train SVM with current parameters

0.4 0.3 0.2

Estimate A and B for the sigmoid

0.1 0

-6

-5

-4

-3

-2

-1 0 1 2 f(X) : Output of SVM

3

4

5

6

Estimate the probability of error

Fig. 2. Variation of estimated probability w.r.t. the output of the SVM. Compute the gradient of the error

• and the observations located in the margin have probabilities considered proportional to f (xi ). Correct the parameters

Thus, with the empirical error criterion, only the misclassified observations and those located in the margin determined by the SVM are very important, since the other observations give almost null errors. Consequently, the minimization of the empirical error involves the reduction of the support vectors (observations being in the margin). In other words, the minimization of the empirical error makes it possible to select hyperparameters defining a margin containing fewer observations. We then construct a machine with fewer support vectors, which reduces the complexity of the classifier. The results of the tests reported in Ref. [16] confirm this property of the SVM constructed using the empirical error. In fact, we have  pˆ i if yi = −1, |ti − pˆ i | = (5) 1 − pˆ i if yi = 1. Then, Ei →0 when pˆ i →0 for yi = −1 and pˆ i →1 for yi = 1. Consequently, Ei →0 if f (xi ) < −1 for yi = −1 and f (xi ) > 1 for yi = 1 (see Fig. 2). We notice that the minimization of the empirical error forces the maximum of the observations to be classified apart from the margin. This criterion is thus useful for regularizing the maximization of the margin for SVMs. We assume that the kernel function depends on one or several parameters, encoded within the vector  = (1 , . . . , n ). The optimization of these parameters is performed by a gradient descent minimization algorithm [23], where the objec tive function is E = Ei (see Fig. 3).

NO

Converging ?

END

Fig. 3. Model selection using the empirical error.

3. Gradient approximation of the empirical error The derivative of the empirical error with respect to  is evaluated using the validation data set. Let us assume N to be the size of the validation data set, then, jE j = j j



N 1  Ei N

 =

i=1

N 1  jEi . N j

(6)

i=1

When we consider Eqs. (3) and (4), the gradient of the empirical error for each validation sample may be written as jEi jEi jpˆ i jfi = . ∗ ∗ j jpˆ i jfi j According to Eq. (5), we can write  +1 if yi = −1, jEi = jpˆ i −1 if yi = 1.

(7)

956

M.M. Adankon, M. Cheriet / Pattern Recognition 40 (2007) 953 – 963

In a condensed form, we have jEi = −yi . jpˆ i

2

(8)

1.5 1

According to Eq. (7), the second part of the gradient is expressed by jpˆ i = −Apˆ i (1 − pˆ i ). jfi

0.5 0

(9)

-0.5

The derivative jfi /j must then be estimated. By considering Eq. (1), we have ⎛ ⎞ NVS jfi j ⎝ = j yj k(xj , xi ) + b⎠ , j j

-1

j =1

jfi = j

NVS  j =1

j jb yj [j k(xj , xi )] + , j j (10)

This derivative is composed of two parts. We may include the bias b into the parameter vector  such as in (1 , . . . , NV S , b). Then, we use the following approximation proposed by Chapelle et al. [12]: j jH T = −H −1  , j j

0

-2.5

(11)

.

(12)

In Eq. (12), K Y represents the Hessian matrix of the SVM objective function. H is called the modified Gram–Schmidt matrix and its size is (NVS + 1) × (NVS + 1). Its components KijY are equal to yi yj k(xi , xj ) and Y is a vector of size NV S × 1 containing support vector labels yi . During the experiments, we note that the derivative (jj /j)k(xj , xi ) is negligible in comparison to (jk(xj , xi )/j)j . Then we approximate Eq. (10) by NVS  jk(xj , xi ) jfi yj j  . j j

-5

-4

-3

-2

-1

0

1

2

3

4

Fig. 4. Data of XOR problem.

j =1

Y

-2

-3

NVS  jk(xj , xi ) jj jfi jb yj = j + k(xj , xi ) + . j j j j

where Y K H= YT

-1.5

(13)

j =1

With this approximation, we do not need to invert the matrix H , which has a minimal time complexity of O((N V S+1)2 ) . Therefore, the CPU time and the size of the memory required by the gradient descent algorithm will be lower than with traditional approaches. To assess our methodology, we tested it on numerous synthetic and real problems. In Figs. 5 and 6, we plotted the empirical error versus the number of iterations and the validation error versus the number of iterations during the gradient descent algorithm for the exclusive OR

(XOR) problem with overlap (see Fig. 4). The classes are of equal priors and their conditional distributions are, respectively, p(x|y = 1) = 21 N(11 , 1 ) + 21 N(12 , 1 ) and p(x|y = −1) = 21 N(21 , 2 ) + 21 N(22 , 2 ), where 0 1 0 T T 1 =( 0.5 0 0.5 ), 2 =( 0 0.1 ), 11 =(−2, 0) , 12 =(2, −2) , T T 11 = (2, 1) and 12 = (−2, −3) . We first noted that the minimization of the empirical error involved the minimization of the error rate when we used the complete gradient or the approximation of gradient. We also noticed that the curves are the same in each case. Hence, we can efficiently use Eq. (9) when computing the gradient of empirical error without inverting the modified Gram–Schmidt matrix. We also tested this approximation with the “thyroid” database of the UCI benchmark,1 and obtained the same result (Figs. 5 and 6). 4. Optimization of kernel parameters with incremental learning SVMs, unlike the other classifiers, offer a good power of generalization when the size of the training set is small. We exploit this property of SVM to develop our method, which consists of beginning the optimization process with a subset S of the training set that we called the working set, and adding S, a part of the remaining samples, at each step until convergence. The idea of incremental learning for SVMs was introduced in Ref. [24], where the training preserves only the support vectors at each incremental step. In this work, however, we preserve the samples that are in the margin or close to it because during the process, the samples that are outside the margin can become support vectors when the kernel parameters are updated in the next steps. Then, in our case, 1 The database is available at http://ida.first.fhg.de/ projects/bench/ benchmarks.htm.

M.M. Adankon, M. Cheriet / Pattern Recognition 40 (2007) 953 – 963

957

0.22

0.16 0.14 0.12 0.1 0.08 0

5

10

15

20

25

30

35

40

45

50

35

40

45

50

Empirical Error

Empirical Error

0.2 0.18

Number of iterations

Validation Error rate

0.11 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0

5

10

15

20

25

30

Number of iterations

Number of iterations

Fig. 7. Behavior of S size w.r.t. gradient norm value.

Fig. 5. Empirical error and validation error rate versus the number of iterations during the optimization process with the complete gradient.

0.22 Empirical Error

0.2 0.18 0.16 0.14 0.12 0.1 0.08 0

5

10

15

20

25

30

35

40

45

50

35

40

45

50

Number of iterations

Validation Error rate

0.11 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0

5

10

15

20

25

30

Number of iterations

Fig. 6. Empirical error and validation error rate versus the number of iterations during the optimization process with the approximate gradient.

we remove from the working set S only the samples that are far from the temporary margin and add S whose size is chosen dynamically, with respect to the behavior of the gradient norm. The idea behind this is to minimize the size of the working set when the current value of the kernel parameters is far from the optimal value; that is, if the gradient norm is large, the size of S added to S is small because the large gradient norm implies that the current value of the parameters is not close to the optimal value (see Fig. 7). At each incremental step, we update the working set S, as well as the kernel parameters, after we retrain the SVM and select the optimal parameters using the minimization of empirical error. All in all, this combination makes it possible to optimize both the parameters i of the SVM and the parameters of the kernel and to drastically save on computing

Fig. 8. Algorithm of optimization of kernel parameters with incremental learning.

time. Fig. 8 presents the details of the incremental learning algorithm.

5. Analysis of space and runtime complexity Before analyzing the space memory and the computation time required for our algorithm, we introduce the following notations, as in Ref. [25]:  size of original training set; N size of validation set;

958

M.M. Adankon, M. Cheriet / Pattern Recognition 40 (2007) 953 – 963

 factor that shows how many support vectors are obtained;  factor that shows how many non-support vectors are retained for the next learning step;  factor that shows the initial size of S; St size of S at tth iteration; Nt size of support vectors at tth iteration; NV S final size of support vectors; r the number of times the set S is added; T the total number of iterations for ending the optimization process. For simplicity, the following assumptions are made: • the size of S = ((1 − )/r); • after each iteration, we retain ( + ) percent of data for the next step; • c1 is the cost of each multiplication operation. 5.1. Memory space The memory size needed for making the model selection is reduced in two ways: 1. Since we use the approximation of gradient, the matrices H, H −1 and jH /j each with a size of (N V S + 1) × (N V S + 1), are not loaded into the memory. That makes it possible to save 12(NV S + 1)2 bytes at each iteration, as similar analysis2 presented in Ref. [25]. 2. The small size of the working set S (as compared to the original size of the training set) is only loaded into the memory, thus making it possible to reduce the memory size.

For the third iteration, we preserve ( + ) + ( + ) ((1 − )/r) samples and we add ((1 − )/r) samples: 1− 1− S3 = ( + ) + ( + ) +  r r



1− 1− + . = ( + )  + r r Then, for the tth iteration



(1 − )(t − 2) 1− St = ( + )  + +  r r if 2 t r + 1. And if t > r + 1, St = ( + ). To summarize, the size of the working set S is ⎧ S1 =  ⎪ ⎪

⎪ ⎪ (1 − )(t − 2) ⎪ ⎪ ⎪ ⎨ St = ( + )  + r ⎪ 1 −  ⎪ ⎪ +  ⎪ ⎪ ⎪ r ⎪ ⎩ St = ( + )

if t = 1,

if 2 t r + 1, if r + 2 t T .

5.2.2. Estimation of support vectors set size at each iteration At each iteration, the size of the support vectors set is proportional to the number of samples used in training. Thus, at the beginning, we have Nt = .

5.2. Computation cost It is difficult to perform an exact analysis of the computation cost, but we try to estimate the cost of the operations by only taking into account the multiplication operations. 5.2.1. Estimation of the working set size at each iteration The size of the working set changes during the optimization process as samples are added and removed. For the first iteration S1 = . For the second iteration, we preserve the samples that are in the margin or close to it, and whose size is represented by ( + ) and we add ((1 − )/r) samples: 1− S2 = ( + ) +  r

1− = ( + ) + . r

And for the next iterations, the number of support vectors increase linearly until t = r + 1: 1− Nt =  +  (t − 1) r

(1 − )(t − 1) = +  r

(14)

For the other iterations t > r + 1, Nt = . 5.2.3. Learning cost According to Dong et al. [25], the learning cost of SVM at each iteration is t gapp =

8c1 (St )2 

(15)

with  being a constant. Then, for all iterations gapp =

2 The size of a float data type is 4 bytes.

if t r + 1.

T  t=1

t gapp

2  T T 8c1  8c1  2 = (St )  St ,   t=1

t=1

(16)

M.M. Adankon, M. Cheriet / Pattern Recognition 40 (2007) 953 – 963

  T where Tt=1 St =[+ r+1 t=2 [e1 (t−2)+e2 ]+ t=r+2 (+)] with e1 = ( + )(1 − )/r and e2 = ( + ) + (1 − )/r. Following a development,3 we find

2 8c1 2 (1 − )(r + 1) gapp  1 + ( + ) T − −1 .  2 (17)

959

6. Experiments 6.1. Experimental results We used two types of benchmark databases, the first being the UCI benchmark tested by Chapelle et al. [12] and by Rätsch et al. [26] and the second being the MNIST database [4].

When, we analyze this cost, we can draw two conclusions: 1. The training cost with our two strategies is lower than those encountered with the initial technique proposed by Ayat et al. [14]. 2. The cost can be further reduced if we choose best values for  and r. 5.2.4. Cost of gradient computation We evaluate the cost of the gradient computation without considering the estimation of probability, the kernel function and its derivative. When we consider Eqs. (7)–(9) and (13), the cost of the gradient computation for the tth iteration is t ggrad =

N 

(3 + 2Nt )c1 ,

i=1 t ggrad = N c1 (3 + 2Nt ).

(18)

By replacing the expression of the number of support vectors for each iteration by Eq. (13), the cost of the gradient computation for all iterations until convergence is as the following expression:4 gapp = N c1

 r+1  t=1



(1 − )(t − 1) 2  + r



+2(T − r − 1) + 3T ,



(1 − ) (r + 1) + T , ggrad = N c1 3T + 2 − 2



(1 − ) ggrad = N c1 3T + 2NV S − (r + 1) + T . 2 (19) We can conclude, through the preceding expression, that the time complexity of (or, the computational effort required by) the algorithm is O(N.T .NV S ) while the time complexity is O(N.T .NV2 S ) when we compute the gradient without approximation.

3 Details concerning this development are presented in Appendix A. 4 Details concerning this expression are presented in Appendix B.

6.1.1. UCI benchmark We used five data sets from the UCI benchmark repository: breast cancer, diabetes, heart, thyroid and titanic. Each data set is composed of 100 different training and test sets describing the binary classification problem. We followed the same experimental setup as in Refs. [12,26]. On each of the first five training and test sets, the kernel parameters are optimized using our algorithm. Finally, the model parameters are computed as the median of the five estimations. As is the case in Refs. [12,26], RBF kernels are employed. The results obtained are shown in Table 1, where we report the results obtained with the different model selection techniques. Our results are similar to those obtained by cross-validation [26] or Chapelle’s [12] methods. However, the gain in complexity with our algorithm is significant, because, on the one hand, we do not invert the matrix H of size (N V S + 1) × (N V S + 1) and, on the other hand, we use an incremental learning strategy. 6.1.2. MNIST database In this section, we tested our algorithms on an isolated handwriting recognition problem, using the MNIST database. The MNIST (modified NIST) database [4] is a subset extracted from the NIST database. The digits have been size-normalized and centered on a fixed-size image. The learning data set contains 60,000 samples (50,000 for training and 10,000 for validation) while the testing data set consists of 10,000 other samples. For the test, we use pairwise coupling. We then train 45 classifiers, and each SVM is optimized by using the empirical error criterion. We use the RBF kernel with C = 100 and the size of set S is initialized at 2500. For the test phase, we applied different types of couplings after mapping the 45 SVM outputs into probabilities. The probability that a given observation belongs to class i (i = 1, . . . , 10) is pi =

1  (pij ), 45

(20)

i=j

where pij = P (x ∈ i /x ∈ i ∪ j ) and is a coupling function. For a complete description see Ref. [27] where the following functions are reported:  1 if x > 0.5, PWC1 (x) = 0 otherwise,

960

M.M. Adankon, M. Cheriet / Pattern Recognition 40 (2007) 953 – 963

Table 1 Test error found by different algorithms for selecting the SVM parameters

Breast cancer Diabetes Heart Thyroid Titanic

Cross-validation [26]

Radius-margin bound [12]

Span bound [12]

Our approach

26.04 ± 4.7 23.53 ± 1.73 15.95 ± 3.26 4.80 ± 2.19 22.42 ± 1.02

26.84 ± 4.71 23.25 ± 1.70 15.92 ± 3.18 4.62 ± 2.03 22.88 ± 1.23

25.59 ± 4.18 23.19 ± 1.67 16.13 ± 3.11 4.56 ± 1.97 22.5 ± 0.88

25.48 ± 4.38 23.41 ± 1.68 15.96 ± 3.13 4.70 ± 2.07 22.90 ± 1.16

The first column reports the results from [26] and the next two columns are from [12].

30

Coupling model

Error test (%)

25

PWC1 PWC2 PWC3 PWC4 PWC5

1.6 1.5 1.7 1.6 1.5

PWC2 (x) = x,

PWC4 (x) =

1 , 1 + e−12(x−0.5)  1 if x > 0.5, 

PWC5 (x) =

x

otherwise,

x

if x > 0.5,

0

otherwise.

We tested three algorithms for optimizing the kernel parameters: the initial algorithm reported in Ref. [16], the modified algorithm using the approximation of gradient and the algorithm of optimization based on incremental learning with approximation of gradient. The results are identical for the three cases, and are shown in Table 2. However, there is a difference with respect to the CPU time, which becomes very significant when the size of the training set is large. In the next section, we show the relationship between the size of the training set and the CPU time during the optimization process with the three algorithms.

20 15 10 5 0 1000

1500

2000 2500 3000 3500 4000 Number of samples of class

4500

5000

Fig. 9. Variation of CPU time reduction rate depends on the training set size (total gradient versus approximation of gradient).

45 40 Reduction rate of CPU time ( % )

PWC3 (x) =

Reduction rate of CPU time ( % )

Table 2 Results obtained with MNIST using several types of couplings

35 30 25 20 15 10 5 0

6.2. Properties of our strategies in optimizing resources In this section, we analyze the impact of each strategy on the reduction of CPU time with respect to the training set size. We use the MNIST database in the experiment, and the size of the training set is varied from 1000 to 5000 for each set class. We first test each strategy separately, after which we proceed to combine the two strategies. Figs. 9 and 10 show the variation of the CPU time when we, respectively, use only the approximation of gradient and only the incremental learning. For each curve obtained, we notice that the CPU time reduction increases with respect to the size of training set. CPU time is reduced from

1000

1500

2000 2500 3000 3500 4000 Number of samples per class

4500

5000

Fig. 10. Variation of CPU time reduction rate depends on the training set size (traditional learning versus incremental learning).

10% to 26% when using the approximation of gradient, and from 10% to 41% when we use the incremental learning strategy. Fig. 11 shows the global impact of the two strategies combined. As with each technique, there is further reduction in resource w.r.t. the training set size. We also notice that the reduction is more significant than that obtained with each

M.M. Adankon, M. Cheriet / Pattern Recognition 40 (2007) 953 – 963 1800

50

Total size=10000

45

1600

Total size=8000

1400

Total size=6000

40 35

CPU time in seconds

Reduction rate of CPU time ( % )

961

30 25 20 15 10 5

1200 1000 800 600 400

0 1000

1500

2000 2500 3000 3500 4000 Number of samples per class

4500

5000

200 0

Fig. 11. Variation of CPU time reduction rate depends on the training set size (initial algorithm versus algorithm using the gradient approximation + incremental learning).

20

30 40 50 60 70 Factor (in %) that shows the initial size of S

Fig. 12. Variation of CPU time depends on the initial size of S.

technique but we do not have the additional reduction. In this case, the reduction varies between 18% and 46%. 6.3. Impact of initial size of S In Section 5.2.3, we showed that during a time complexity analysis, the initial size of S has an influence over the training time for the entire optimization process, and by extension, over the final computing time. We test the data representing the digits 0 and 1, drawn from the MNIST database by forming a binary problem. We vary the initial size of the working set S, and after a convergence of the process, we note the computing time. We consider that the total size of the training set is 6000, 8000 or 10,000, and in each case, we assume that the initial size of S is successively 20%, 30%, 40%, 50%, 60% or 70% of the total size. Fig. 12 shows the results obtained in the form of curves, where the conclusion 2 of Section 5.2.3 is confirmed, a small value of  reduces the computing time. However, we note that a very small value of  disturbs the convergence of algorithm. We find empirically that the ideal value of  is about 0.4. 6.4. Discussions We developed a fast method based on empirical error minimization for SVM model selection. We use two strategies, gradient approximation and incremental learning, to optimize the CPU time and the memory storage. We tested our method on several benchmarks and obtained encouraging results. We noticed that the performance of the classifier constructed with our method is similar to those obtained with other methods. Tables 1 and 2 show the error rate on test data confirming the good performance of the model built. In addition, considering our method, the CPU time and

memory storage are reduced in comparison to the other methods. Figs. 9–12 show this reduction obtained in experiments and its behavior with respect to the size of the training data. We notice an important reduction when the size becomes large. In this paper, we use gradient descent method to minimize the empirical error for choosing the hyperparameters. It is also possible to use genetic algorithm to optimize the hyperparameters [28]. This method provides a good solution that enhances the generalization capacity of the SVM model. However, the genetic algorithm, like other evolutionary methods, consumes important resources, particularly CPU time. Then, although the genetic algorithm performs the automatic model selection for SVM, we do not find it appropriate for the goal we want to achieve, that is, to optimize the resources required both in terms of CPU time and of storage memory. Especially, as noticed in literature, the gradient-based methods used in model selection are efficient and give good solution [12–14].

7. Conclusion In this paper, we have described a model selection for SVM using the empirical error criterion improved by incremental learning and the approximation of the gradient. The resources required during the optimization process, i.e., the CPU time and memory size, are drastically reduced. We have tested our method on several benchmarks, which have produced promising results confirming our approach. We also studied certain properties of our algorithm, and noticed that the reduction of resources grows according to the size of the training set. Thus, for problems with large database, our strategies are very effective for performing a good model selection for the SVMs with limited resources.

962

M.M. Adankon, M. Cheriet / Pattern Recognition 40 (2007) 953 – 963

Appendix A. Learning cost details 2  T T T    8c 8c 1 1 t gapp = gapp = (St )2  St ,   t=1

t=1

T

t=1

r+1

 where t=1 St =[+ t=2 [e1 (t −2)+e2 ]+ Tt=r+2 (+)] with e1 = ( + )(1 − )/r and e2 = ( + ) + (1 − )/r. 2  T 8c1  gapp  St  t=1 ⎡ r+1  8c1 ⎣   + [e1 (t − 2) + e2 ]  t=2

T 

+

⎤2

⎤2 ⎡ r−1 T   8c1 2 ⎣  [e1 t + e2 ] + ( + )⎦ +  t=0 t=r+2 ⎡ r−1  8c1 2 ⎣   + e2 + [e1 t + e2 ]  t=1 ⎤2 T  + ( + )⎦ t=r+2

 8c1 2 ⎣  + e2 +  e1 t  +

t=1

e2 + ⎡

T 

+2(T − r − 1) + 3T , r+1 1−  (t − 1) r t=1



+2(r +  + T − r − 1) + 3T ,  gapp = N c1 2

r 1−  t r t=0



+2(r +  + T − r − 1) + 3T , 1 −  r(r + 1) . r 2



+2(r +  + T − r − 1) + 3T ,

( + )⎦

t=r+2

⎤2

T

+( + )(T − r − 1)

8c1 2 ( + )(1 − )(r − 1) 1 + r( + ) +  2 2 +( + )(T − r − 1)

8c1 2 (1 − )(r − 1) 1 + ( + ) r +  2

2 +(T − r − 1) 

t=1

⎤2

r−1

r+1  (1 − )(t − 1) r



gapp = N c1 2

  8c1 2 ⎣ t+ ( + )⎦  + e2 r + e1  t=1 t=r+2

2 r(r − 1) 8c1    + e2 r + e 1  2 2



gapp = N c1 2(r + 1) + 2









r−1 t=1



+2(T − r − 1) + 3T ,

gapp = N c1 2

t=r+2

r−1 

t=1



( + )⎦



Appendix B. Gradient computation cost details  r+1

 (1 − )(t − 1) 2  + gapp = N c1 r

8c1 2 1 + ( + ) 

2 (1 − )(r + 1) × − + (T − 1) . 2

(1 − )(r + 1) gapp = N c1 2 + (r + 1) 2

+T − (r + 1) + 3T ,



1− gapp = N c1 2 T +(r + 1) +−1 + 3T , 2



(1 − ) ggrad = N c1 3T + 2 − (r + 1) + T , 2



(1 − ) ggrad = N c1 3T + 2NV S − (r + 1) + T . 2 References [1] V.N. Vapnik, Statistical Learning Theory, Wiley, New York, 1998. [2] V.N. Vapnik, Estimation of Dependences based on Empirical Data, Springer, Berlin, 1982. [3] V.N. Vacpnik, Principles of risk minimization for learning theory, Advances in Neural Information Processing Systems, vol. 4, Morgan Kaufmann, San Mateo, CA, 1992, pp. 831–838. [4] Y. LeCun, L. Bottom, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE 86 (1998) 2278–2324.

M.M. Adankon, M. Cheriet / Pattern Recognition 40 (2007) 953 – 963 [5] G. Wahba, Y. Lin, H. Zhang, Generalized approximate cross validation for support vector machines, or, another way to look at margin-like quantities, Technical Report, Department of Statistics, University of Wisconsin, February 25, 1999. [6] T.S. Jaakkola, D. Haussler, Probabilistic kernel regression models, Workshop in Conference on Artificial Intelligence and Statistics, 1999. [7] T. Joachims, Estimating the generalization performance of a svm efficiently, International Conference on Machine Learning, 2000, pp. 431–438. [8] M. Opper, O. Winther, Mean field methods for classification with gaussian processes, in: The 1998 Conference on Advances in Neural Information Processing Systems II, MIT Press, Cambridge, MA, 1999, pp. 309–315. [9] M. Opper, O. Winther, Gaussian processes and svm: mean field and leave-one-out, in: A.J. Smola, P.L. Bartlett, B. Schölkopf, D. Schuurmans (Eds.), Advances in Large Margin Classifiers, MIT Press, Cambridge, MA, 2000, pp. 311–326. [10] O. Chapelle, V. Vapnik, Model selection for support vector machines, Adv. Neural Inf. Process. Syst., 1999, pp. 230–236. [11] V. Vapnick, O. Chapelle, Bounds on error expectation for support vector machines, Neural Comput. 12 (9) (2000). [12] O. Chapelle, V. Vapnik, O. Bousquet, S. Mukherjee, Choosing multiple parameters for support vector machines, Mach. Learn. 46 (1) (2002) 131–159. [13] K.-M. Chung, W.-C. Kao, L.-L. Wang, C.-L. Sun, C.-J. Lin, Radius margin bounds for support vector machines with the rbf kernel, Neural Comput. 15 (11) (2003) 2643–2681. [14] N.E. Ayat, M. Cheriet, C.Y. Suen, Automatic model selection for the optimization of the svm kernels, Pattern Recognition Comput. Sci. 38 (10) (2005) 1733–1745. [15] N.E. Ayat, Sélection automatique de modèle des machines à vecteurs de support: Application à la reconnaissance d’images de chiffres manuscrits, Ph.D. Thesis, École de Technologie Supérieure, University of Quebec, 2003.

963

[16] N.E. Ayat, M. Cheriet, C.Y. Suen, Empirical error based optimization of svm kernels: application to digit image recognition, International Workshop on Handwriting Recognition, 2002, pp. 292–297. [17] B. Scholkopf, A.J. Smola, K.-R. Muller, Kernel principal component analysis, in: B. Scholkopf, C.J.C. Burges, A.J. Smola (Eds.), Advances in Kernel Methods: Support Vector Machines, MIT Press, Cambridge, MA, 1998, pp. 327–352. [18] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge University Press, Cambridge, 2000. [19] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, Cambridge, 2004. [20] J. Platt, Probabilistic outputs for support vector machines and comparison to regularized likelihood methods, in: A.J. Smola, P. Bartlett, B. Schoelkopf, D. Schuurmans (Eds.), Advances in Large Margin Classifiers, 2000, pp. 61–74. [21] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, Oxford, UK, 1995. [22] H.-T. Lin, C.-J. Lin, R.C. Weng, A note on platts probabilistic outputs for support vector machines, Technical Report, May 2003. [23] Y. Bengio, Gradient-based optimization of hyper-parameters, Neural Comput. 12 (8) (2000) 1889–1900. [24] N.A. Syed, H. Liu, K.K. Sung, Incremental learning with support vector machines, in: International Joint Conference on Artificial Intelligence, 1999. [25] J.-X. Dong, A. Krzyzak, C.Y. Suen, Fast svm training algorithm with decomposition on very large data sets, IEEE Trans. Pattern Anal. Mach. Intell. 27 (4) (2005) 603–618. [26] G. Rätsch, T. Onoda, K.-R. Müller, Soft margins for adaboost, Mach. Learn. 42 (3) (2001) 287–320. [27] M. Moreira, E. Mayoraz, Improved pairwise coupling classification with correcting classifiers, in: ECML, 1998, pp. 160–171. [28] G.-H. Tzeng, Y.-J. Goo, C.-H. Wu, W.-C. Fang, A real-valued genetic algorithm to optimize the parameters of support vector machine for predicting bankruptcy, Expert Systems With Applications, in press, corrected proof, available online 11 January 2006.

About the Author—MATHIAS M. ADANKON received the B.Ing. degree in Electrical Engineering from the University of Abomey-Calavi, Benin. He received the Master’s degree in Automated Manufacturing Engineering from the École de Technologie Supérieure, the University of Quebec (Montreal, Canada). He is currently Ph.D. student at Laboratory for Imagery, Vision and Artificial Intelligence (LIVIA), University of Quebec. His research interests are centered on Classification using kernel methods, Pattern Recognition, and Machine Learning. About the Author—PROFESSOR MOHAMED CHERIET received the B.Eng. from University of Algiers in 1984 and the M.A.Sc. and Ph.D. in Computer Science from University of Paris 6 in 1985 and 1988 (on handwriting recognition). Since 1992 he was Professor of Automation Engineering at the University of Quebec (École de Technologie Supérieure)—where he was appointed full professor since 1998. He was the Director of the Laboratory of Imagery, Vision and Artificial Intelligence (LIVIA: www.livia.etsmtl.ca) from 2000 to 2006, and is the Director of SYNCHROMEDIA Consortium (Pan Canadian Distributed laboratories to sustain collaboration in telepresence: www.synchromedia.etsmtl.ca) since 1998. He has held visiting appointments at IBM Almaden, Concordia University, and University of Paris 5. In addition to document image analysis, OCR, mathematical models for Image processing, pattern classification models and learning algorithms, his interests include perception in computer vision. He has published more than 150 technical papers in the field. Prof. Cheriet served times as Chair of international conferences: Vision Interface and International Workshop of Frontiers of Handwriting Recognition. He served in the Editorial Boards of the International Journal of Pattern Recognition and Artificial Intelligence (IJPRAI) and the International Journal of Document Analysis and Recognition (IJDAR). Prof. Cheriet is a Senior Member of IEEE and the IEEE Montreal Computational Intelligent Systems (CIS) Chapter Chair.