Journal of Systems Engineering and Electronics Vol. 19, No. 1, 2008, pp.191–197
Parameter selection of support vector machine for function approximation based on chaos optimization∗ Yuan Xiaofang & Wang Yaonan Coll. of Electrical & Information Engineering, Hunan Univ., Changsha 410082, P. R. China (Received November 13, 2006)
Abstract: The support vector machine (SVM) is a novel machine learning method, which has the ability to approximate nonlinear functions with arbitrary accuracy. Setting parameters well is very crucial for SVM learning results and generalization ability, and now there is no systematic, general method for parameter selection. In this article, the SVM parameter selection for function approximation is regarded as a compound optimization problem and a mutative scale chaos optimization algorithm is employed to search for optimal parameter values. The chaos optimization algorithm is an effective way for global optimal and the mutative scale chaos algorithm could improve the search efficiency and accuracy. Several simulation examples show the sensitivity of the SVM parameters and demonstrate the superiority of this proposed method for nonlinear function approximation.
Keywords: learning systems, support vector machines (SVM), approximation theory, parameter selection, optimization.
1. Introduction Artificial neural networks (ANN) have the ability to approximate nonlinear functions with arbitrary accuracy and this has been validated since the late 1980s [1−2] . Nevertheless structure and types of ANN are experientially selected, and the training of ANN is based on the empirical risk minimization (ERM)[3] principle, which aims at minimizing training errors. Therefore, ANN faces some disadvantages such as overfitting, local optimization, and bad generalization ability. Support vector machines (SVM)[3,4] are new machine learning methods derived from the statistical learning theory. Since the late 1990s, SVM has shown growing popularity and has been successfully applied to many areas ranging from handwritten digit recognition, speaker identification to function approximation, and time series forecasting [5−7] . Established on the theory of structural risk minimization (SRM) [3] principle, SVM has some distinct advantages over ANN, such as, it is globally optimal, small sample-size, good generalization ability, and it is resistant to the overfitting problem [3,6,7] .
It is well known that the SVM generalization performance depends on a good setting of hyperparameters C, ε, and the kernel parameters [8] , and now there is no systematic, general method for parameter selection. Reference[9] summarizes the existing practical approaches for parameter setting and describes the practical recommendations for settingC, and ε directly from the training data and estimated noise level. Indeed all these approaches (including Ref. [9]) for setting SVM parameters are based on a priori knowledge, user expertise or experimental trial, and hence they cannot make sure that the parameter values are globally optimal. In this article, the authors propose a novel approach for SVM parameter selection based on chaos optimization. In this implementation, the parameter selection is regarded as a compound optimization problem, and the mutative scale chaos optimization is proposed to select suitable parameter values. The chaos optimization algorithm is an efficient and convenient way for global optimization[10] , and for improving the search efficiency and accuracy, mutative scale chaos optimization is employed, which can reduce the search
* This project was supported by the National Nature Science Foundation of China (60775047, 60402024).
192
Yuan Xiaofang & Wang Yaonan
ranges during the search process. Practical validity of the proposed SVM parameter selection approach is illustrated using several nonlinear functions. Simulations demonstrate that SVM, whose parameter selection is based on chaos optimization, has a better performance over ANN, and other SVM parameter selection technique.
the following dual optimization max
SVM approximation
To introduce the subject the authors begin by outlining SVM for function approximation. Let the given training data sets be represented by D = {(x1 , y1 ), (x2 , y2 ), · · · , (xn , yn )}, where xi ∈ Rd is an input vector, yi ∈ R is its corresponding desired output, and n is the number of training data. In SVM, the original input space is mapped into the high dimensional space called feature space, by nonlinear mapping x → g(x). Let f (x) be the output of SVM corresponding to the input vector x, in the feature space, then a linear function is constructed f (x) = wt g(x) + b (1) where, w is a coefficient vector and b is a threshold. SVM learning can be obtained by the minimization of the empirical risk on the training data, and the εintensive loss function is used for the minimization of empirical risk. The loss function is defined as Lε (x, y, f ) = |y − f (x)|e = max(0, |y − f (x)|−ε) (2) Where ε is a positive parameter that allows approximation errors smaller than ε. The empirical risk is n 1 ε Remp (w) = L (yi − f (xi )) (3) n i=1
Other than the ε-intensive loss, SVM tries to reduce the model complexity by minimizing w2 . This can be described by slack variables ξi and ξˆi , which measure training data xi whose deviations exceed the constant ε. Subsequently, the SVM approximation is obtained as the following optimization problem n 1 2 min w + C (ξi + ξˆi ) (4) 2 i=1 s.t. yi −f (xi ) ε+ξi , f (xi )−yi ε+ ξˆi , ξi , ξˆi 0 (5)
where C is a positive constant to be regulated. By using the Lagrange multiplier method[3] , the minimization of formula (4) causes the problem of maximizing
yi (ˆ αi − αi ) − ε
i=1 n
n
yi (ˆ αi + αi )−
i=1
1 (ˆ αi − αi )(ˆ αj − αj )K(xi , xj ) 2 i,j=1 s.t.
2. SVM approximation and parameters 2.1
n
n
(ˆ αi − αi ) = 0, C α ˆ i , αi 0
(6)
(7)
i=1
where α ˆ i and αi are Lagrange multipliers, and kernel K(xi , xj ) is a symmetric function, which is equivalent to the dot product in the feature space, namely K(xi , xj ) = g(xi )t g(xj )
(8)
Here the Gaussian function is used as kernel K(x, y) = exp(−x − y2 /2σ 2 )
(9)
Some other kernels are given below: the polynomial kernel K(x, y) = (x·y+1)d and the hyperbolic tangent kernel K(x, y) = tanh(c1 (x · y) + c2 ). By replacing βi = α ˆ− ˆi αi = 0, the i αi and relation α optimization of formulae (6) to (7) is rewritten as max
n
y i βi − ε
i=1
n
|βi | −
i=1
s.t.
n
n 1 βi βj K(xi , xj ) (10) 2 i,j=1
βi = 0, −C βi C
(11)
i=1
The learning results for training data set D can be derived from Eqs. (10) and (11). Note that only some of the coefficients βi are not zeros and the corresponding vectors x are called support vectors (SV). That is, vectors xi whose corresponding coefficients α ˆ i − αi are not zero are SV. Then the approximation function is represented by Lagrange multipliers, namely f (x) =
P
(ˆ αi − αi )K(xi , xj ) + b
(12)
i=1
It should be noted that P is the number of SV, and f (x) is computed only from SV. Furthermore, the constant b can be determined as well. 2.2
SVM parameters
The quality of SVM models strongly depends on the proper setting of parameters and the SVM approximation performance is sensitive to parameters. For
Parameter selection of support vector machine for function approximation based on chaos optimization Gaussian kernel, the parameters to be regulated include hyperparameters C, ε, and kernel parameter σ. The problem of parameter selection is further complicated by the fact that the SVM model complexity (which is the generalization performance) depends on these three parameters. Indeed, the values of C, σ, and ε relate to the actual function model and they are not fixed for different data sets. The values of parameter C, σ, and ε affect the model complexity in a different way. ParameterC determines the trade-off between model complexity and the degree to which deviations larger than ε are tolerated. Parameter ε controls the width of the εinsensitive zone and can affect the number of SV in the optimization problem. Kernel parameter σ determines the kernel width and relates to the input range of the training data set.
3. Parameter optimization based on mutative scale chaos algorithm Chaos has properties of ergodicity, stochasticity, and “regularity”, and the chaotic trajectories of the maps are useful to probe in wide ranges of the searching domain without being trapped into the local optimal. The chaos optimization algorithm is an efficient and convenient method for global optimization and here the mutative scale chaos algorithm is applied for optimal C, σ, and ε. During the search process, the search ranges are mutatively scaled, that is, as the iterations increase, the search ranges will decrease, and this can improve the search efficiency and accuracy. Here parameter selection is regarded as the compound optimization of parameters and the complexity of SVM is needlessly considered. Here the iterative chaotic map with infinite collapses (ICMIC)[11] is chosen as the chaos model xn+1 = sin(a/xn ), n = 1, 2, · · · , x0 = 0
(13)
Assuming a=2, then xn+1 = sin(2/xn ). For SVM approximation, the objective of parameter selection is to minimize deviations between training data outputs and SVM outputs, and here the performance criterion is the mean square error (MSE) p M SE = [yk − f (xk , w)]2 /p (14) k=1
193
where p is the number of training data, yk are training data outputs and f (xk , w) are SVM outputs. Then the objective of the chaos algorithm is to search for optimal parameters C, σ, and ε, to minimize MSE ⎧ ⎨ minf (x , · · · , x ) = min M SE 1 i (15) ⎩ ai xi bi , i = 1, 2, 3 Here i = 3, corresponding to parameters C, σ, and ε, [ai , bi ] is the search range. As there is no systematic method for different data sets, general search ranges are determined, based somewhat largely on expertise, typically C ∈[0.5, 100], σ ∈[0.1,100], and ε ∈[0.01, 0.2]. For special training data samples, the search ranges can be changed. The basic steps of the mutative scale chaos optimization algorithm are as follows Step 1 Set initial parameters: K is the number of iterations that start from 1. Random initial values xi,0 and x∗i = xi (0), fi∗ = fi (0). In addition, a termination criterion is created: maximal iterations N or maximal acceptable mean square error M SEd ; Step 2 The chaos iteration operates as Eq. (13) xi,n+1 = sin(2/xi,n ), Step 3
i = 1, 2, 3;
Conversion of search ranges xi,n+1 = ai + (bi − ai )|xi,n+1 |
(16)
Hence, search ranges of i chaos variables xi,n+1 will be changed from [−1, 1] to [ai , bi ]; Step 4 Replace decision-making. Set xi (K) = xi,n+1 and compute fi (K); If fi (K) fi∗ , Then fi∗ = fi (K), x∗i = xi (K); Else fi (K) > fi∗ , and fi∗ , x∗i is maintained; Step 5 If K > N or fi∗ M SEd , this iteration is stopped; If K < N and fi∗ > M SEd , K = K + 1, the iteration is to be continued and the search ranges are modified ai = x∗i − φ(bi − ai ), bi = x∗i + φ(bi − ai )
(17)
where φ represents the mutative scale factor, which is a decreasing function given by ⎧
2 ⎪ ⎨ φ=1− K −m , K m K (18) ⎪ ⎩ φ = 1, K
194
Yuan Xiaofang & Wang Yaonan
If ai < ai , Then ai = ai ; If bi > bi , Then bi = bi . Go back to Step 2 for next iteration. Hence x∗i (i = 1, 2, 3) are the optimal values of parameters C, σ, and ε derived from the mutative scale chaos optimization. The chaos optimization algorithm is the global optimal and the derived parameters can minimize the MSE of a data set. As SVM have a good generalization performance for training data as well as test data, the derived parameters are also adaptive and can minimize MSE.
MSE in Table 1 and Fig. 1 are rather large, as here the parameters C, σ, and ε are experientially selected, without optimization, just for showing sensitivity. Thus in this search procedure, parameter selection is regarded as a compound optimization problem and three parameters are searched simultaneously as in Section 3, based on the chaos optimization algorithm, which can reach optimal values.
4. Simulation study SVM sensitive to parameters C, σ, and ε
4.1
Choose a nonlinear function as an illustration: Herx2 mite function y = 1.1∗ (1 − x + 2x2 )e− 2 , where x ∈ [0, 6]. For the range of x, 100 pairs of training data (xi , yi ) are randomly selected. In this section, three experiments are used to demonstrate the influence of parameter values on the SVM performance. In each experiment, two parameters of C, σ, and ε are fixed and the other parameter is changeable. For various values C, σ, and ε, the SVM approximation results are different. The MSE in formula (15) illustrates deviations between f (xk , w) and yk . Both Table 1 and Fig. 1(a)∼(c) show the MSE of SVM approximation. It can be concluded that SVM performances are sensitive to the values of parameters and SVM performances are somewhat different for various parameter values. Table 1
Contrastive MSE at various parameter values
No.
Expt.1
Expt.2
Expt.3
(σ=3.6,
(C=5,
(C=5,
ε=0.05)
ε=0.05)
σ=3.6)
(Fig. 1(a))
(Fig.1(b))
(Fig.1(c))
C
M SE
σ
M SE
ε
M SE
1
0.1
0.201 3
0.2
0.403 5
0.002
0.009 2
2
0.5
0.176 7
0.5
0.250 7
0.005
0.009 4
3
1
0.152 5
1
0.182 4
0.01
0.009 2
4
2
0.149 4
2.5
0.009 1
0.02
0.009 3
5
5
0.134 8
4
0.105 7
0.05
0.009 0
6
10
0.131 0
6
0.174 3
0.075
0.009 3
7
20
0.129 3
8
0.279 0
0.1
0.009 1
8
50
0.129 2
10
0.311 6
0.2
0.009 2
Fig. 1
Contrastive MSE at various parameter values
Parameter selection of support vector machine for function approximation based on chaos optimization Table 2
4.2 SVM approximation of nonlinear function Here the authors also consider Hermite function, and x ∈[0,6], 100 pairs training data (xi , yi ), and 40 pairs of test data are randomly selected. Mutative scale chaos optimization is applied to optimize parameters C, σ, and ε, and the optimization objective is to minimize MSE. Set maximum iterations is 500, and the search ranges for this training data are C ∈[0.5, 15], σ ∈[0.5, 5], and ε ∈[0.01, 0.2]. The initial values of parameters C, σ, and ε are randomly selected. As the iterations increase, MSE decreases greatly, as in Fig. 2 and the parameter values are more appropriate. MSE using two different chaos optimization algorithms is illustrated in Fig. 2, Curve 1 denotes the mutative scale chaos optimization algorithm, Curve 2 denotes the conventional chaos optimization algorithm when iterations N reach 500. MSE of these two algorithms decreases to 0.013 and 0.020, respectively. It can be observed that the mutative scale chaos optimization algorithm has a faster convergence speed and preferable results. Here the optimal values derived from the optimization algorithm are C =6.4, σ=3.9 and ε=0.026.
Method
MSE
MSE during chaos optimization procedure
Now the approximation performance for three different approximation approaches is tested, that is, (1), ANN approximation (RBF neural network); (2) SVM(1) approximation (parameters selected as Ref. [9]); (3) SVM(2) approximation (parameters selected from our approach). Both Table 2 and Figs. 3∼5 show the test results of these three approaches, whereas, Figs. 3∼5 illustrate the test performance of ANN, SVM(1) , and SVM(2) Respectively. In Figs. 3(a), 4(a)
Approximation results
Max. positive error
Max. negative error
Test curve
Test error curve
ANN
0.064
0.197
−0.084
Fig.3(a)
Fig.3(b)
SVM(1)
0.052
0.171
−0.061
Fig.4(a)
Fig.4(b)
SVM(2)
0.037
0.078
−0.043
Fig.5(a)
Fig.5(b)
Fig. 3
Fig. 2
195
Test results of ANN approximation
and 5(a), the solid line shows 40 pairs of test data, whereas, the dotted line shows the approximate outputs of these three approaches respectively. Figs. 3(b), 4(b), and 5(b) illustrate the actual errors of these three approaches respectively. From these simulations, it can be observed that SVM, whose parameter is selected based on chaos optimization, has a better approximation performance than the other two approximation approaches. 4.3
Two-dimensional function approximation
Now a two-dimensional nonlinear function is considered and it is defined as z=
1 + sin xy , and x ∈ [0, 2], y ∈ [0, 2] 4 + sin 2πx + sin πy
196
Yuan Xiaofang & Wang Yaonan
Parameter values are selected based on chaos optimization, and the search ranges for this training data are C ∈[0.5, 25], σ ∈[0.25, 10], and ε ∈[0.01, 0.2]. After 600 iterations, MSE is 0.0051 and Fig. 6 illustrates
the actual model and the SVM approximation model. Simulations demonstrate that for a two-dimensional nonlinear function, SVM, whose parameters are selected based on chaos optimization, also has a good
Fig. 4
Test results of SVM(1) approximation
Fig. 5
Test results of SVM(2) approximation
Fig. 6
Actual model and SVM approximation model
Parameter selection of support vector machine for function approximation based on chaos optimization
197
approximation performance.
modelling of nonlinear dynamic system using support vec-
5. Conclusions
tor neural networks. Engineering Applications of Artificial Intelligence, 2001, 14(2): 105–113.
Good setting parameters are very crucial for SVM learning results and the generalization ability, and the authors consider parameter selection as a compound optimization problem. This article proposes a mutative scale chaos optimization algorithm to search for optimal values of parameters. Various examples are simulated to demonstrate the superiority of this proposed approach. The simulation results demonstrate that SVM, whose parameters are selected using the authors’ approach, has a better approximation performance than ANN or SVM, whose parameters are chosen using other techniques. For other applications or types of SVM, parameters can also be selected by using chaos optimization algorithms.
[7] Chuang C C, Su S F, Jeng J T. Robust support vector regression networks for function approximation with outliers. IEEE Trans. on Neural Networks, 2002, 13(6): 1322–1330. [8] Wang W J, Xu Z B, Lu W Z. Determination of the spread parameter in the Gaussian kernel for classification and regression. Neurocomputing, 2003, 55(3-4): 643–663. [9] Cherkassky V, Ma Y Q. Practical selection of SVM parameters and noise estimation for SVM regression. Neural Networks, 2004(17): 113–126. [10] Li B, Jiang W S. Chaos optimization method and its application. Control Theory and Applications, 1997, 14(4): 613–615. [11] He D, He C, Jiang L G. Chaotic characteristics of a onedimensional iterative map with infinite collapses. IEEE
References
Trans. on Circuits and Systems-I: Fundamental Theory
[1] Funahashi K. On the approximate realization of continu-
and Applications, 2001, 48(7): 900–906.
ous mappings by neural networks. Neural Networks, 1989,
Yuan Xiaofang was born in 1979. He received the
2(3): 183–192. [2] Hornik K, Stinchombe M, White H. Multilayer feedforward networks are universal approximators. Neural Networks, 1989, 2(5): 359–366. [3] Vapnik V. The nature of statistical learning theory. New
B.S. and M.S. degrees from Hunan University in 2001 and 2006 respectively. Now he is a Ph. D. candidate in the Hunan University. His research interests include intelligent control, neural networks, and machine learning. E-mail:
[email protected]
York: Springer-Verlag, 1995. [4] Vapnik V. An overview of statistical learning theory. IEEE Trans. on Neural Networks, 1999, 10(5): 988–999. [5] Byun H, Lee S W. A survey on pattern recognition applications of support vector machines. International Journal of Pattern Recognition and Artificial Intelligence, 2003, 17(3): 459–486. [6] Chan W C, Chan C W, Cheung K C, et al.
On the
Wang Yaonan was born in 1957. He received his Ph. D. degree from Hunan University in 1994. He is a professor and a doctoral supervisor in Hunan University. His research interests include control theory and applications, neural networks, and pattern recognition and intelligent image processing.