A modified support vector regression: Integrated selection of training subset and model

A modified support vector regression: Integrated selection of training subset and model

Applied Soft Computing 53 (2017) 308–322 Contents lists available at ScienceDirect Applied Soft Computing journal homepage: www.elsevier.com/locate/...

852KB Sizes 0 Downloads 13 Views

Applied Soft Computing 53 (2017) 308–322

Contents lists available at ScienceDirect

Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc

A modified support vector regression: Integrated selection of training subset and model JinXing Che a,b , YouLong Yang a,∗ , Li Li b , YanYing Li c , SuLing Zhu d a

School of Mathematics and Statistics, Xidian University, 266 Xinglong Section of Xifeng Road, Xi’an, Shaanxi 710126, China School of Science, NanChang Institute of Technology, NanChang 330099, JiangXi, China College of Mathematics and Information Science, Baoji University of Arts and Sciences, Baoji 721013, Shaanxi, China d School of Public Health, Lanzhou University, Lanzhou 730000, China b c

a r t i c l e

i n f o

Article history: Received 22 October 2015 Received in revised form 20 June 2016 Accepted 29 December 2016 Available online 4 January 2017 Keywords: Model selection Optimal training subset Nested particle swarm optimization Support vector regression Forecasting

a b s t r a c t In recent years, support vector regression (SVR) has become an emerging and popular forecasting technique in the field of machine learning. However, it is subjected to the model selection and learning complexity O(K * N3 ), especially for a massive data set (N is the size of training dataset, and K is the number of search). How to simultaneously reduce K and N can give us insight and inspiration on designing an effective and accurate selection algorithm. To this end, this paper tries to integrate the selection of training subset and model for SVR, and proposes a nested particle swarm optimization (NPSO) by inheriting the model selection of the existing training subset based SVR (TS-SVR). This nested algorithm is achieved by adaptively and periodically estimating the search region of the optimal parameter setting for TS-SVR. Complex SVR, involving large-scale training data, can be seen as extensions of TS-SVRs, yielding a nested sequence of TS-SVRs with increasing sample size. The uniform design idea is transplanted to the above modeling process, and the convergence for the proposed model is proofed. By using two artificial regression problems, Boston housing and electric load in New South Wales as empirical data, the proposed approach is compared with the standard ones, the APSO-OTS-SVR, and other existing approaches. Empirical results show that the proposed approach not only can select proper training subset and parameter, but also has better generalization performance and fewer processing time. © 2017 Elsevier B.V. All rights reserved.

1. Introduction Support vector regression (SVR) is a very promising and popular forecasting model rooted in statistical learning theory [1], whose achievements are due to the linear or nonlinear kernel technique [2–4]. It is well-known that the generalization performance of SVR depends on the good choice of parameter setting [5,6]. For this purpose, the problem of SVR’s parameter selection becomes a fundamental yet crucial task. It will be desirable to design an effective and accurate selection algorithm to make SVRs practical for the wide variety of practitioners in engineering applications of regression and modelling [7,4]. The parameter selection of SVR is known as the model selection problem, and can be formulated as an optimization problem of a function which is multimodal and only vaguely specified. A simple method to handle the model selection is to perform an exhaustive

∗ Corresponding author. E-mail addresses: [email protected], [email protected] (Y. Yang). http://dx.doi.org/10.1016/j.asoc.2016.12.053 1568-4946/© 2017 Elsevier B.V. All rights reserved.

grid search over the parameter domain. Generally, the exhaustive grid search has high computational cost especially in large-scale training data set (the size of training dataset N is a large number): Firstly, the candidate size of the SVR parameter domain is large, so fine grid search is quite inefficient due to a large number of parameter combinations, becoming impracticable if the dimension of the parameter domain is large, even if the grid is not too fine [15]. Secondly, the computation complexity of the above problem is O(K * N3 ) (N is the size of training dataset, and K is the number of search). To this end, the population based search algorithm, such as particle swarm optimization (PSO), is a good alternative for finding better tuning parameters [8,9]. Nevertheless, if the number of search K and the size of training dataset N are all considered, the complexity of the problem will still be significant for large-scale training data set. Based on the above rationality, how to simultaneously reduce K and N can give us insight and inspiration on designing an effective and accurate selection algorithm. On the one hand, many researchers have studied how to reduce the number of trials in parameter combinations, i.e. K. By estimating some parameter-dependent error, four types of the gradient of parameter selection criterion are calculated to optimize the model

J. Che et al. / Applied Soft Computing 53 (2017) 308–322

Nomenclature K ı or  f ω, b ε i , i∗

the kernel function of SVR the width parameter of kernel function the regression function of SVR the weight vector and the bias of regression function the maximum value of tolerable error the distance between actual values and the corresponding boundary values of ε-tube C the trade-off parameter between generalization ability and training error d the dimension of data set TS(m) a selected training subset with size m Fm (x), F(x) the cumulative distribution function and empirical cumulative distribution function xi (h) the position of particle i at the moment h vi (h) the velocity of particle i at the moment h the itself and the entire swarm best position pi , pg c1 , c2 the weight factors Pbest [TS(m)] the best particle of SVR for training subset TS(m) nk the number of particles at the kth iteration ek the minimum error at the kth iteration MaxNk the maximum iteration number at the kth iteration

selection problem [10–13]. Although these algorithms can effectively reduce the K and show impressive gain in time complexity, it is likely that they get trapped in bad local minima. To avoid computing the derivatives of model selection criterion, Momma and Bennett introduce a pattern search algorithm which is suitable for optimizing SVR problems for which it is impossible or hard to obtain information about the derivatives [14]. Li et al. propose a multiobjective uniform design (MOUD) search algorithm to make the search scheme better uniformity and space-filling. This selection algorithm can dramatically reduce the K, avoid wasteful function evaluations of close-by patterns, then provide the flexibility to adjust the candidate set size under computational time constraint [15]. By combining the uniform design and stochastic optimization method, Jiménez et al. propose a random version of focused grid search [16], where more concentrated set in the parameter search space is repeatedly randomly screened and examined by using heuristic search. By updating the local and global best known positions, PSO can be expected to iteratively move the swarm toward the optimal solution, and perform a more effective search by combining the uniform design idea [17–20]. On the other hand, support vector regression (SVR) has high computation complexity O(N3 ) for large sample. Training data reduction is an effective method to reduce the N due to the sparseness of SVR. Based on it, researchers have proposed many effective solutions. One type of solution is to decompose the large dataset problem into several small sub-problems. By using the low-rank approximation method of dealing with the full kernel matrix, a reduced SVR (RSVR) is presented [21]. To further improve the computational efficiency, the other type of solution is to select a small-scale training subset for the original dataset. Lee et al. obtain a small-scale training subset by using the random sampling method, then train the SVR based on the training subset [22]. Brabanter et al. propose a modified active subset selection method by constructing a maximum objective function of quadratic Renyi entropy, and optimize the random training subsets using an iterative method, then determine an optimized fixed-size kernel models for large data sets [23]. The above SVRs for large scale of data are limited to the fixedsize training subset, Che constructs an approximation convexity optimization framework to determine the optimal size of training subset based SVR (TS-SVR), and solves it by a 0.618 method [24].

309

The work of literature [24] that hybridizes APSO and TS-SVR uses APSO for each model selection process of TS-SVRs. In particular, it is a two-step approach where the first stage is based on optimal training subset (OTS) algorithm to select a training subset considered in the second stage. For the TS-SVR with this training subset, an APSO algorithm is then employed to perform the model selection of the TS-SVR in the second stage. However, its model selection process is repeatedly performed for each iteration of training subset. This method does not consider the interconnection between training subset selection and model selection, which gives us the following motivation to achieve better performance: An integrated strategy between training subset and model selection is expected to make the selection process far more space-filling by cutting the wasteful parameter space of close-by patterns. As far as we know, no study truly integrates the training subset and model selection within a SVR modeling process. To progressively exploit a wide candidate search region stage by stage, this paper tries to integrate the selection of training subset and model for SVR, and proposes a nested PSO by inheriting the model selection of the existing training subset based SVR. This modified SVR can simultaneously reduce both K and N by using the above integrated strategy, which is a fast SVR model suitable for large-scale training data. We firstly analyze and apply the training subset based SVR (TS-SVR) iteration process in literature [24] for exploring the parameter domain. Then, the movement region of the optimal parameter setting between two TSs with different sizes is estimated, which forms a nested mechanism and contracted search to dramatically reduce the candidate set space of parameter combinations. Finally, we connect the TS-SVRs from small to optimal size by a nested particle swarm optimization (NPSO) in order to search the tuning parameters periodically. Experimental results show that the proposed integrated SVR can select proper training subset and parameter with which the testing accuracy of trained SVRs is competitive to the standard ones, and the training time can be significantly shortened. The updated search regions can be adjusted adaptively and dynamically with the iteration process of TSs. The rest of the study is organized as follows. Section 2 gives a brief introduction of the basic model. Section 3 presents the integrated training subset and model selection for SVR (I-TSMS-SVR). The convergence for the I-TSMS-SVR model is proofed in Section 4. Section 5 presents the numerical results. The final conclusion is drawn in Section 6.

2. The basic model 2.1. Support vector regression In this subsection, a very brief description of ε-insensitive support vector regression (ε-SVR) is given, and a more thorough coverage is introduced in [1,25–27]. Specifically, the training data (xi , yi ) of n records are given, where (xi , yi ) ∈ Rd × R. The SVR predictor will forecast record with input pattern x ∈ Rd by using a regression function of the form

f (x) =

n  



˛∗i − ˛i K(xi , x) + b

(1)

i=1

where K : Rd × Rd → R is the SVR kernel function that maps the input space into a higher-dimensional feature space, and this “kernel trick” makes the nonlinear regression on the input space be equivalent to the linear regression on the higher-dimensional feature space. Conventionally, the coefficients ˛∗i , ˛i are found by

310

J. Che et al. / Applied Soft Computing 53 (2017) 308–322

solving the following dual quadratic problem with box constraints plus one linear constraint: min

1 (˛ − ˛∗ )T K(˛ − ˛∗ ) − (˛ − ˛∗ )T y + ε(˛ + ˛∗ )T 1 2

(2)

i

(3)

⎪ 0 ≤ ˛∗i ≤ C ⎪ ⎪ ⎪ ⎩

i = 1, 2, . . ., n

where K is the full kernel matrix with Kij = K(xi , xj ), 1 is the vector of ones. The authors choose the following Gaussian kernel function with standard deviation



K(xi , x) = exp

−(x − xi )2 2 × ı2



1/2

| Fm (x) − F(x)|2 dx

D2 (TS(m)) =

(5)

Cd

subject to

⎧ (˛ − ˛∗ )T 1 = 0 ⎪ ⎪ ⎪ ⎪ ⎨0 ≤ ˛ ≤ C

subset selection. The most popular measure of non-uniformity in quasi-Monte-Carlo methods is L2 -discrepancy:



(4)

Once the three model parameters (C, ε, and ı2 ) are determined, the above ε-SVR model is solved. To minimize the parameter ε, Schölkopf et al. propose the -SVR by introducing an extra term  [28]. In this study, we consider the training subset and model selection problem for ε-SVR model. 2.1.1. Parameter domain of SVR It is well-known that the generalization performance of a SVR model depends on the good choice of three parameters (C, ε, and ı2 ), which is the so called model selection. However, we cannot obtain a direct valuation of each individual SVR model for a parameter setting (C, ε, and ı2 ) at hand, it is thus clear that no explicitly valuation expression for the parameter selection exists other than some kind of search on the parameter domain. One standard strategy to deal with the model selection is to employ a more or less exhaustive grid search over the parameter domain. It is obvious that the parameter domain for grid search should include a large region to cover the global optimum, which makes the grid search have high computational cost. For this purpose, we use a stochastic metamodel search alternative, i.e. particle swarm optimization (PSO), to gradually concentrate the parameter search domain. In this work, we set an empirical parameter search domain as follows: C ∈ [4−3 , 48 ], ε ∈ [0.001, 1], ı2 ∈ [4−3 , 48 ]. This initial parameter search domain of SVR is a relatively large scope, and we can adaptively adjust it in the iterative process of NPSO as descried in the following Section 3.3.2 “The nested algorithm”. 2.2. Training subset selection In this subsection, we describe the training subset (TS) algorithm for training SVR. It should ensure that the selected training subset extracts maximum information from large data set. The trained SVR may be over-fitting when the training dataset is non-uniform (or imbalanced, among many other names). To solve the above the above problem, a good solution would be to turn the imbalanced training dataset into a balanced data in the training phase. Thus, some sampling techniques are developed for balancing the datasets [29]. 2.2.1. Selection rule: uniform design idea The goal here is to choose a small, balanceable but informative subset TS from the full training dataset T. Considering that the uniform design experimentation seeks representative points to be uniformly scattered on the domain and has been popular since 1980 [30–32], we transplant the uniform design idea to the training

where Cd is a domain over d parameters, TS(m) ⊂ Cd is a selected training subset with size m, F(x) is the cumulative uniform distribution on Cd and Fm (x) is the empirical cumulative distribution function of TS(m) [33]. Intuitively, we infer that the TS(m) with more differences should have smaller value of D2 (TS(m)). Inspired by that the farther points contain more information than the nearer ones, a training subset (TS) selection technique is presented to choose some data points with maximum differences from the full training set T according to the following criterion: X ∗ = argmaxXi ∈ T

⎧ ⎨ ⎩

⎫ ⎬

|Xj − Xi |



Xj ∈ TS

(6)

Remark 1. For two sizes m1 , m2 (m1 < m2 ), we can firstly generate TS(m1 ) according to the above criterion, then generate TS(m2 ) based on TS(m1 ). Remark 2. For the model selection problem of each TS(m), we propose a stochastic metamodel search alternative, i.e. nested particle swarm optimization (NPSO), to perform the model selection problem as described in the following Section 3.3. Remark 3. The training subset based support vector regression (TS-SVR) has a distinct advantage over SVR based on the full training set, since it solves the problem of the computation and memory complexity O(N3 ) for large sample and prevents over-fitting during unbalanced data regression. 2.2.2. An illustration In the above, we present a new technique to ensure that the obtained training subset extracts maximum information from the full training set. It can choose some training pairs with maximum differences from the entire training pairs, and has been applied to find the optimal training subset (OTS). Denoted by m the size of training subset, a constant to be determined, so OTS has maximum information among all the training subsets. Our m-OTS method is a type of instance-based algorithm, with the most uncommon object being assigned to the subset. Usually Euclidean distance is used as the distance metric, another metric such as the overlap metric (or Hamming distance) can also be used. So the farther points contain more information than the nearer ones. The training phase of the algorithm consists of obtaining the training subset and training SVR algorithm with it. In the forecast phase, test points are forecasted by using the OTS based SVR. The above simple pseudo-code is implemented to obtain the m elements representing maximum information of the training set, which is called as m-OTS algorithm. Below, we employ the following 6-dimension example Friedman #1 to explain our theory in detail:



y = 10 sin(x1 x2 ) + 20 x2 −

1 2

2

+ 10x4 + 5x5 + ε

(7)

where ε ∼ N(0, 1), xj ∼ U[0, 1], j = 1, 2, 3, 4, 5. We perform 50 simulations, and generate 300 data points at each simulation (100 data points are generated as a test set). In this 6-dimension example, choose 15 gradually increased sizes of TS, that is [10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80], and the MSE performance on test set is computed for each TS-SVR with the chosen size. To make fair comparisons, we use the default setting of εSVR in R Package “e1071” [35]. Fig. 1 displays the size of TS as a function of the sum square error (SSE) performance on test set for this 6-dimension example. The last box is the SSE performance of

J. Che et al. / Applied Soft Computing 53 (2017) 308–322

be adjusted approximatively by multiplying a multiplicative factor (m2 /m1 )2/(4+d) . The movement distance of the best parameter setting can be estimated by the following equation: MD(m1 , m2 , ) =

 2/(4+d) m 2

m1

− 1 ∗ (m1 )

(8)

1000

Usually, this multiplicative factor is close to 1 as the difference m2 − m1 is relatively small or the size d is a moderate to large value, which indicates the above movement distance is a small value. As the above movement distance is an approximate estimation, the new best parameter setting can be found by finer searching in a small contraction range. For simplicity, we set a lower and upper bound of (m2 ), which can be expressed as the following domain:

500

SSE

1500

The relationship between sum square error (SSE) and sample size

311

(m2 ) ⊂ [(m1 ) − 1 ∗ ( 1 − 1) ∗ (m1 ), (m1 ) + 1 ∗ ( 1 − 1) ∗ (m1 )]

(9) 10

15

20

25

30

35

40

45

50

55

60

65

70

75

80 200

Sample size

Fig. 1. The relationship between sum square error (SSE) and sample size.

where 1

= (m2 /m1 )2/(4+d) ; (m1 ) and (m2 ) are the optimal param-

eter  of TS(m1 ) and TS(m2 ) based SVR, respectively; 1 > 1 is a positive constant. When 1 * ( 1 − 1) > 1, the lower bound is set as 0.

the full training set and regards as a baseline comparison. We can observe that the TS-SVRs with size 75 and 80 have better test performance than the SVR with the full training data, which indicates: The useful information of TS(m) increases as the size m increases, also the redundant information increases. In other words, the useful information dominates at the beginning, reversed subsequently. In a massive data set problem, the training data generally contains much redundant information. The redundant information not only could be useless for SVR training, but also could lead to accuracy constraint and computational burden. The studies of SVR show that only a percentage of the full training data (namely support vectors) have impact on the final SVR model. As described by Cherkassky and Ma, the optimal generalization performance is often obtained with the number of support vectors (SVs) less than 50% [34]. Discarding these redundant information can accelerate the training process of SVR.

3.1.2. ε, Admissible error Noise, variation in the dependent variable which cannot be inferred by using the independent variables, limits the forecasting performance of SVR [38]. In order to improve the robustness of SVR, a parameter ε is introduced: The learning errors smaller than ε are accepted, and training data points lying outside this “ε-tube” are named as “support vectors”. Smola et al., Kwok and Tsang proposed an asymptotically optimal ε setting rule from the theoretical and experimental point of view: A linear scaling between the optimal epsilon and the noise in the data [39,40,1]. It is well-known that the value of ε should be set as a proportional constant of the standard deviation, Cherkassky and Ma improved this setting rule based on the noise and the size of the training data [34]:

3. Integrated training subset and model selection for SVR

where 2 is the variance of additive noise, m is the size of training dataset. As the value of ε determines the number of support vectors, a sufficiently small ε will make all the points of TS become support vectors. After reaching this sufficiently small ε, there is little effect on the final prediction model. When the size of TS changes from a value m1 to a slightly bigger value m2 , the parameter setting can be adjusted approximatively by multiplying a multiplicative factor (m2 /m1 )1/2 * (lnm1 /lnm2 )1/2 . The movement distance of the best parameter setting can be estimated by the following equation:

In this section, the overall formulation process of the integrated training subset and model selection for SVR (I-TSMS-SVR) is presented. First, when the size of training subset changes from a value m1 to a slightly bigger value m2 , the movement region of the optimal parameter setting can be estimated by multiplying a multiplicative factor. Then, the proposed nested particle swarm optimization (NPSO) algorithm is formulated to integrate the training subset(TS) and model selection for SVR and the procedure is presented in detail.



ε = 3

lnm m

MD(m1 , m2 , ε) = 3.1. Search region estimation The members of training subset (TS) have the characteristics of uniform dispersion, so the representative of TS is very strong. Now, we analyse the relationship of parameter distribution among TSs with different size. 3.1.1. , Width parameter For the parameter , Stone [36] and Silverman [37] pointed out that an asymptotically optimal window is of order ı = O(m−1/(4+d) ) or  = 1 2 = O(m2/(4+d) ), where m is the size of training data 2×ı set and d is the dimension of data set. Based on the above theory, we can easily conclude: When the size of TS changes from a value m1 to a slightly bigger value m2 , the parameter setting can

(10)

 1/2   m2 ln m1 1/2 m1



ln m2

− 1 ∗ ε(m1 )

(11)

Usually, this multiplicative factor is close to 1 as the difference m2 − m1 is relatively small, which indicates the above movement distance is a small value. As the above movement distance is an approximate estimation, the new best parameter setting can be found by finer searching in a small contraction range. For simplicity, we set a lower and upper bound for ε(m2 ), which can be expressed as the following domain: ε(m2 ) ⊂ [ε(m1 ) − 2 ∗ ( 2 − 1) ∗ ε(m1 ), ε(m1 ) + 2 ∗ ( 2 − 1) ∗ ε(m1 )]

(12) where 2 = (m2 /m1 )1/2 * (lnm1 /lnm2 )1/2 ; ε(m1 ) and ε(m2 ) are the optimal parameter ε of TS(m1 ) and TS(m2 ) based SVR, respectively;

J. Che et al. / Applied Soft Computing 53 (2017) 308–322

TS−SVR under training subset size 20

15

−10 log 2

log 2(c ost )

10 5

−5 (ga mm a) 0

0

0.50 0.45 0.40 0.35 0.30 0.25 −15

MAE

MAE

0.50 0.45 0.40 0.35 0.30 −15

TS−SVR under training subset size 45

15 10 −10 log −5 2(g am m

5 −5

TS−SVR under training subset size 80

0.5

log 2(c ost )

312

5

0 a) 0

5 −5

TS−SVR under training subset size 200

0.5

−10 log 2(g −5 am ma )

15 10

0.2 −15

5

0

0.3

−10 log 2(g −5 am ma )

0 5 −5

log 2(c ost )

10

0.2 −15

log 2(c ost )

MA E

15

0.3

MAE

0.4

0.4

5

0 0 5 −5

Fig. 2. The model selection performance of TS-SVR under different training subset size.

2 > 1 is a positive constant. When 2 * ( 2 − 1) > 1, the lower bound is set as 0. 3.1.3. C, Tradeoff parameter For the parameter C, its value decides the trade-off between the model complexity of the final regression equation and the training error larger than ε. Intuitively, a small size TS will reduce the model complexity, which indicates a smaller C value for TS with a smaller size. Under the assumptions that the ε has been chosen, Mattera and Haykin proposed that an approximate optimal value for C is equal to max(y) − min(y) (y is the response values of training data) [41]. According to the selection criterion of TS, the parameter setting of C will have a very small change when the size of TS changes from a value m1 to a slightly bigger value m2 , and the value of max(y) − min(y) correlates negatively with the dimension d. As it is difficult to compute the exact movement distance of the best parameter setting, the new best parameter setting can be found by finer searching in a small contraction range. For simplicity, we set a lower and upper bound for C(m2 ), which can be expressed as the following domain: C(m2 ) ⊂ [C(m1 ) − 3 ∗ ( 3 − 1) ∗ C(m1 ), C(m1 ) + 3 ∗ ( 3 − 1) ∗ C(m1 )]

(13) where 3 = (m2 /m1 )0.5*(1−1/d) ; C(m1 ) and C(m2 ) are the optimal parameter C of SVR based on TS(m1 ) and TS(m2 ), respectively; 3 > 0 is a positive constant. When 3 * ( 3 − 1) > 1, the lower bound is set as 0. Therefore, we can get the following conclusion: The performance shape of search space of model selection (The shape of parameter distribution) is similar for TSs with small difference in size, but the position of this parameter distribution will totally move according to a multiplicative factor. Considering that the multiplicative factor is an approximate estimation, the movement region of the optimal parameter setting Pbest [TS(m2 )] = [ best (m2 ),

εbest (m2 ), Cbest (m2 )] can be bound by the following search domain: Pbest [TS(m1 )] − MR(m1 , m2 ) ≤ Pbest [TS(m2 )] ≤ Pbest [TS(m1 )] + MR(m1 , m2 )

(14)

where MR(m1 , m2 ) = [1 ∗ ( 1 − 1) ∗ (m1 ), 2 ∗ ( 2 − 1) ∗ ε(m1 ), T

3 ∗ ( 3 − 1) ∗ C(m1 )] denotes the movement radius. Inspired by that the multiplicative factor is closely related to the size and dimension of training data set and is an approximate estimation, we can estimate this moving region by the order Pbest ± O(g(m, d)) for the performance shape with different TSs. Note that: When m1 = m2 , we compute the movement range MR(m1 , m2 ) by using m1 = m2 + 1 to keep finer search for model selection. 3.2. An illustration Based on the above theory in Section 3.1, we can infer that the structures among TSs with small difference in size have similar characteristics. When the size of TS has been iteratively updated from a small size to the optimal size, the structure of the full training set can be progressively reflected. The size of TS begins to change slowly so that the search region could be adaptively contracted by the proposed NPSO. This can be demonstrated by the following simulation of Friedman #1. The SVR modeling is very dependent on the parameter setting (, C, ε). In order to draw our search region in a two-dimensional box, the ε is set as the user pre-specified parameter for ε-SVR. As the exhaustive grid search has high computational complexity, we perform 5 simulations for a grid search with 21 × 21 mesh parameter combinations (441 trials), and compute the average MAE for this grid search. To vividly demonstrate the performance of training subset based SVR (TS-SVR) in the parameter domain, Fig. 2 describes a TS-SVR comparison of two parameters ( = 1 2 , C) 2×ı under four training subset sizes [20, 45, 80, 200], where the x-axis,

J. Che et al. / Applied Soft Computing 53 (2017) 308–322

y-axis and z-axis are log2 C, log2  and the model performance measure MAE respectively. It is easy to observe that there are many local minima for small size TS, so the gradient-based search algorithm may be trapped into (bad) local minima. As shown in the above Figure, we can easily observe that the SVR with small size TS has similar global shape to the SVR with full training data, then more and more local shape is shown as close to the optimal size TS, which indicates the value of small size TS. And the size theory of TS is also analyzed in [24]. Generally, the global optimum parameter of SVR has a large search area, as a result of which a parameter search using the conventional SVR with the full training dataset will take a long time. To this end, a parameter search using the SVR with small size TS is an appealing alternative to the above search. Inspired by the above feature of TS, we wish to integrate the training subset and model selection process by the following stepwise strategy: Firstly, we search the highly likely global optimum region in the SVR with small size TS, in the next iterations, more and more local region as close to the SVR with the optimal size TS. 3.3. Nested particle swarm optimization In this subsection, a nested particle swarm optimization (NPSO) algorithm is proposed to integrate the training subset(TS) and model selection for SVR, which can significantly reduce K and N simultaneously. Based on the above search region estimation, this nested algorithm is achieved by adaptively and periodically estimating the search region of the optimal parameter setting for training subset based SVR (TS-SVR). In order to cut the wasteful parameter space of close-by patterns, we iteratively connect the TS-SVRs from small to optimal size by a nested particle swarm optimization (NPSO): In the first iteration, it determines the global optimum parameter by means of a small size TS-SVR, and a movement multiplicative factor of this global optimum parameter is estimated for a new size TS-SVR, then the next iterations perform a finer and finer tuning derivative-free search using the updated search region. This periodical selection strategy is an efficient fine-tuning procedure resulting in lower computational complexity and better generalization ability. In the following sections it will be explained how to do this. 3.3.1. Particle swarm optimization Particle swarm optimization (PSO) is a swarm intelligence method and motivated by the swarm behaviors such as bird flocking and fishes schooling. It is firstly presented by Kennedy and Eberhart in 1995 [42]. For the tth iteration of PSO, the movement of particle i (denoted as xi (t + 1)) is determined by the following velocity vector equation vi (t + 1):

vi (t + 1) = ω ∗ vi (t) + c1 ∗ rand()1 ∗ (pi − xi ) + c2 ∗ rand()2 ∗ (pg − xi ) (15)

xi (t + 1) = xi (t) + vi (t + 1)

(16)

where pi is the itself previous best position, pg is the entire swarm previous best position, c1 , c2 are the weight factors, and rand()1 , rand()2 are random numbers distributed U(0,1). Particle swarm optimization (PSO) can effectively perform search for optimizing SVR problems for which it is impossible or hard to obtain information about the derivatives, in other words, the PSO can perform a derivative-free search in the candidate region. What is more, PSO is a stochastic meta-model search that can gradually concentrate the parameter search domain, and therefore it has lower computational cost than the grid search.

313

However, PSO does not establish the relationship between the optimal parameter settings of TS-SVRs with different subset sizes. To this end, we propose a nested PSO algorithm as shown in the next subsection. An improved initial parameter setting: To perform an effective search under a wide candidate region, PSO generates a part of initial particles by using the uniform design method proposed in [43]. 3.3.2. The nested algorithm In this subsection, we propose the nested algorithm in two levels. At the first level, we perform a large number of PSO particles in the log-scale in a wide search region by using a small size TS-SVR, which has low computational complexity due to the small size of TS. At the second level, we estimate the movement range of the optimal parameter selection of a new size TS-SVR, and let the best parameter from the early stage be the center point of the new search region. The update method of the new search region is given in Eqs. (9), (12) and (13), where 1 = 3.5, 2 = 3.5, 3 = 3.5. We do allow all the search particles of the second level to fall outside the prescribed search region. Then we perform a finer search in the new region. This search region is adjusted adaptively and dynamically with the iteration process of TSs. The specific steps of the nested algorithm are as follows: Algorithm 1. The nested PSO algorithm. Step 1 : For three initial sizes m10 , m20 and m30 (generally, m10 < m20 < m30 n), generate three initial SVR models as follows: (a) Perform model selection for TS(m10 ) based SVR using a large number of PSO particles, and obtain the accordingly optimal parameter Pbest [TS(m10 )]. (b) Determine the following search region for TS(m20 ) based SVR: Pbest [TS(m20 )] ⊂ [(1 − 0 ) ∗ Pbest [TS(m10 )], (1 + 0 ) ∗ Pbest [TS(m10 )]] (17) T

where 0 = [1 ∗ ( 10 − 1), 2 ∗ ( 20 − 1), 3 ∗ ( 30 − 1)] for the size m10 and m20 , the subscript denotes the number of iteration and the superscript denotes the number of TS size. Perform model selection for TS(m20 ) based SVR using PSO algorithm, obtain the accordingly optimal parameter Pbest [TS(m20 )]. (c) Determine the following search region for TS(m30 ) based SVR: Pbest [TS(m30 )] ⊂ [(1 − 0 ) ∗ Pbest [TS(m20 )], (1 + 0 ) ∗ Pbest [TS(m20 )]] (18) T

where 0 = [1 ∗ ( 10 − 1), 2 ∗ ( 20 − 1), 3 ∗ ( 30 − 1)] for the size m20 and m30 , the subscript denotes the number of iteration and the superscript denotes the order number of TS size. Perform model selection for TS(m30 ) based SVR using PSO algorithm, and obtain the accordingly optimal parameter Pbest [TS(m30 )]. (d) Compute the accordingly goal optimization function F(m10 ), F(m20 ), F(m30 ), and generate m11 , m21 and m31 as shown in [24]. Step 2 : REPEAT (a) For a new size mik (i = 1, 2, 3), choose the closest size from all

the performed sizes (suppose mik−1 without loss of generality), and determine the following search region for TS(mik ) based SVR:

 

Pbest TS mik





 



 



⊂ (1 − k−1 ) ∗ Pbest TS mik−1 (1 + k−1 ) ∗ Pbest TS mik−1

, (19)

where k−1 = [1 ∗ ( 1k−1 − 1), 2 ∗ ( 2k−1 − 1), 3 ∗ ( 3k−1 − 1)]

T

for the size mik−1 and mik , the subscript denotes the number of iteration and the superscript denotes the number of TS size.

314

J. Che et al. / Applied Soft Computing 53 (2017) 308–322

(b) Perform model selection for TS(mik ) based SVR using PSO algorithm, and obtain the accordingly optimal parameter Pbest [TS(mik )]. (c) Compute the accordingly goal optimization function F(m1k ), F(m2k ), F(m3k ), and generate m1k+1 , m2k+1 and m3k+1 as shown in [24]. Step 3 : UNTIL max(m1 , m2 , m3 ) − min(m1 , m2 , m3 ) < 1, or the number of iterations is larger than the maximum iteration, or the generalization error does not change in 7 consecutive epochs. In the NPSO, the above parameter domain of SVR can be adaptively adjusted according to the optimal parameters setting. In next section, the experimental results will demonstrate the merits of the integrated training subset and model selection for SVR. However, the NPSO has some new parameters that need to be set by the user. For this problem, we have given a relationship analysis for these parameters, then provided a setting method as described in Section 3.3.3. Remark 4. The nested algorithm is an improved PSO. For the nested PSO, a more narrow search region can be estimated by using the optimal parameter setting in the last iteration of TS. Thus, the nested PSO constantly adds local information in the contraction process of global optimum region, avoids unnecessary searching wasteful local minimal values of close-by patterns, and always performs a much finer search process than the last iteration. 3.3.3. The structure parameter for NPSO By integrating training subset and model selection process, the proposed SVR model can effectively cut down the computational complexity of the traditional SVR, and dramatically reduce the candidate set space of parameter combinations. The integrated strategy makes the exploration process far more space-filling by cutting the wasteful parameter space of close-by patterns. Considering that small size TS can roughly reflect the data structure of the full training data, the movement region of the optimal parameter setting between two TSs with different sizes is estimated, which forms a nested mechanism and contracted search strategies to dramatically reduce the candidate set space of parameter combinations. Based on the theory of Section 3.1, we can obtain that the structures among TSs with small difference in size have similar characteristics. When the size of TS has been iteratively updated from a small size to the optimal size, the structure of the full training set can be progressively reflected. In the initial search phase with a small size TS, NPSO generates a large number of particles to perform crude search for a wide candidate region, this process has relatively low computational complexity due to the small size of TS. In the next iterations, the candidate region shrinks to a more and more narrow region and the TS tends to close the optimal size, NPSO only need generate relatively few particles to perform fine search for a narrow candidate region. This feature can be defined by the following tuning functions. NPSO for the SVR with the kth TS is a traditional PSO, and the above properties of NPSO can be reflected by the following coefficient of PSO: the number of particles nk , the minimum error ek , the maximum iteration number MaxNk . They are defined as:



1

nk = round (n0 ) k˛1 ek = (e0 )(k



(20)

˛2 )



MaxN k = round (N0 )

1 k˛3



(21) (22)

where n0 , e0 , N0 are initial parameters for PSO, ˛i ∈ (0, 1) is a adjustment coefficient, i = 1, 2, 3. As the SVR with small size TS has low computational complexity, n0 can be set to a larger number than traditional PSO. The parameter ˛ controls the tradeoff between the

complexity and accuracy: the smaller the ˛, the more the number of particles, and the better the performance of the obtained SVR. In this paper, ˛1 = 0.2, ˛3 = 0.1. Next, we explain how the above tuning functions are determined. According to the above description, we obtain that these three parameters monotonically decreases with k, but they should not decrease too fast. This is why the adjustment coefficient ˛ with 0 < ˛ < 1 is considered. The initial setting for m1 -TS, m2 -TS and m3 -TS is as follows: n0 = 300, N0 = 20 for m1 -TS, n0 = 40, N0 = 10 for m2 -TS and m3 -TS. Thus, the NPSO constantly adds local information in the contraction process of global optimum region, and avoids unnecessary searching wasteful local minimal values of close-by patterns, then always performs a much finer search process than the last iteration. The above process can be vividly described by Fig. 3. We choose 4 updated search regions for the iteration TSs TS(m11 ), TS(m13 ), TS(m15 ) and TS(m17 ). In Fig. 3, the updated search regions under different TS sizes for Friedman #1 dataset are shown: First, we can easily observe a finer and finer search process is performed. Second, from the 2th to 3th subfigure, we do allow all the search particles of the 3th subfigure to fall outside the prescribed search region of the 2th subfigure. Thus, this search region can be adjusted adaptively and dynamically with the iteration process of TSs.

4. The convergence for the proposed model In this paper, integrated training subset and model selection for support vector regression, named as I-TSMS-SVR, is proposed. The convergence for the I-TSMS-SVR model can be described by Theorem 1. Theorem 1. Let T be the full training dataset with size n, m10 , m20 and m30 be three initial sizes (generally, m10 < m20 < m30 n). Let mik be the

size of the ith training subset (TS) at the kth iteration, Pbest [TS(mik )] be

the optimal parameter setting for TS(mik ) based SVR at the kth iteration, i = 1, 2, 3. Then 1. When k→ ∞, we have ik − 1 → 0 for all i = 1, 2, 3. It means that the movement radius is convergent. 2. When k→ ∞, we have Pbest [TS(mik+1 )] − Pbest [TS(mik )] → 0 for all i = 1, 2, 3. It means that the optimal parameter sequence Pbest [TS(mik )] for TS(mik ) is convergent.

Proof. 1. According to the convergence of 0.618 method as described in Algorithm 2 of literature [24], when k→ ∞, we have mik → m for all i = 1, 2, 3. Thus, (mik /mik−1 ) → 1 for all i = 1, 2, 3. It is easy to show: If k→ ∞, then 1k = (m1k /m1k−1 )

2/(4+d)

→ 1,

1/2 1/2 = (m2k /m2k−1 ) ∗ (ln m2k−1 / ln m2k ) 1/2 = (m3k /m3k−1 ) → 1. That is ik − 1 → 0 for all i = 1, 2, 3.

2k 3k

→ 1 and

According to Eq. (14) MR(mik+1 , mik ) → 0 for all i = 1, 2, 3. Thus, the movement radius is convergent. 2. In a prescribed search region, the optimal parameter setting Pbest [TS(mik )] is bounded. That is: ∀k, i, ∃M > 0, s.t. Pbest [TS(mik )] 2 < M. Therefore, Pbest [TS(mik+1 )] − Pbest [TS(mik )] ≤ ik − 1 ∗ Pbest [TS(mik )] ≤

ik − 1 ∗ M From the above proof, it is easy to show When k→ ∞, we have Pbest [TS(mik+1 )] − Pbest [TS(mik )] → 0 for all i = 1, 2, 3.

J. Che et al. / Applied Soft Computing 53 (2017) 308–322 The search region under TS size 20

315 The search region under TS size 29

0 15

10

Width parameter

Width parameter

−0.5

5

0

−1 −1.5 −2 −2.5

−5 −5

0

5 Tradeoff parameter

10

−3

15

1

2

3

0

−0.5

−0.5

−1

−1

−1.5 −2 −2.5 −3

5 6 Tradeoff parameter

7

8

9

10

8

9

10

The search region under TS size 47

0

Width parameter

Width parameter

The search region under TS size 40

4

−1.5 −2 −2.5

1

2

3

4

5 6 Tradeoff parameter

7

8

9

10

−3

1

2

3

4

5 6 Tradeoff parameter

7

Fig. 3. The updated search region under different TS sizes for Friedman #1 dataset.

Thus, the optimal parameter sequence Pbest [TS(mik )] for TS(mik ) is convergent. The proof ends.䊐 According to the updated search region for Remark 5. Pbest [TS(mik )], Pbest [TS(mik )] is a sequence of random variables with the Markov property, namely that, given the present state, the future and past states are independent. 5. Numerical results In the following section, we will apply the proposed I-TSMS-SVR to four different datasets, including two artificial problems, one UCI dataset and one real case of electric load forecasting in New South Wales. Our I-TSMS-SVR is coded in MATLAB 7.10.0 [44], and all the experiments are run on a notebook PC with 8 GB of RAM and a 2.9-GHz-based processor. 5.1. Description of the data sets Two artificial regression problems, among which one is Plane proposed by Ridgeway et al. [45] and the other one is Friedman #1 proposed by Friedman et al. [46], can be described by the following functions: 5.1.1. A plane y = 0.6x1 + 0.3x2 + ε

5.1.3. Boston housing This data set is proposed by D. Harrison and D.L. Rubinfeld in 1978 [47], and can be downloaded from UCI machine learning repository [48]. There are 14 attributes and 506 cases in this dataset. The forecasting of the housing values in Boston is a very complex regression problem, so this study introduces a tolerance coefficient r = 0.9 in the fitness function to eliminate the influence of outliers. It is worth noting that there is no time index for Boston Housing. Simple random sampling in splitting the entire dataset may make the training and test set have different patterns. As a result of the difference, the test set employed by the forecasting model does not represent well the distribution of the regression function to be approximated [49]. Thus, stratified sampling in splitting the entire dataset is employed to retain the ’similarity” pattern between training and test set.

(23)

where ε ∼ N(0, 0.05), xj ∼ U[0, 1], j = 1, 2. 5.1.2. Friedman #1

Then, we generate two artificial data sets comprising 1000 instances according to the above two Equations. To further validate the effectiveness of the proposed model, two real-world problems, Boston housing data set and electric load data set in New South Wales, are employed. The data description and preprocessing of these real-world problems are described as follows.



y = 10 sin(x1 x2 ) + 20 x2 −

1 2

2

+ 10x4 + 5x5 + ε

where ε ∼ N(0, 1), xj ∼ U[0, 1], j = 1, 2, 3, 4, 5.

(24)

5.1.4. Electric load in New South Wales In this real-world case, as it is necessary for the electricity to complete the process of production, distribution and consumption at the same time, the manager must develop an effective decision-making system to accurately estimate the future electric load demand. Then, he can effectively control price and income elasticities, energy transfer scheduling, unit commitment and load

316

J. Che et al. / Applied Soft Computing 53 (2017) 308–322

Table 1 Summary of regression datasets. Problem

Table 2 The structure parameter of I-TSMS-SVR used in the experiment.

Size

A plane Friedman #1 Boston housing Electric load in New South Wales

No. variables

Training set

Testing set

500 1000 456 382

300 300 50 300

3 6 14 1 (the dimension of phase space is 3)

Initial size of TS

n0

˛1

N0

˛3

m1 m2 m3

300 40 40

0.3 0.1 0.1

20 10 10

0.3 0.1 0.1

MAPE =

20 Max (28,round(length(T(:,1))./20)) Max (36,round(length(T(:,1))./12))

n  (Pi −Ai )   i=1  Ai n

,

(27)

dispatch. Thus, we choose an electric load forecasting problem in New South Wales. To measure the generalization ability of the forecasting model, a test set, which is not learned in the training and model selection phase, should be employed to evaluate the forecasting ability. Each dataset of the above four cases is split into training set and test set: the training set is used to learn the forecasting equation, and the test set is used to estimate the generalization error. The summary of these four regression datasets is shown in Table 1. The term no. variables is the number of variables in regression dataset.

where Pi and Ai are the ith forecasted and actual values respectively, and n is the total number of predictions. To approximately measure the generalization error of a forecasting model, the difference in MAPE (DMAPE) between testing and training dataset is defined as follows:

5.2. SVR models for comparison

5.4. Selected parameters

In order to compare the I-TSMS-SVR model, simple description of four comparison models is as follows:

In Table 2, the structure parameter of I-TSMS-SVR used in the experiment is shown. As the SVR with small size TS has low computational complexity, n0 of small size TS can be set to a larger number than traditional PSO. The parameter ˛1 controls the tradeoff between the complexity and accuracy: the smaller the ˛1 , the more the number of particles, and the better the performance of the obtained SVR. To search the global optimum region under the m1 -TS, the initial number of particles and PSO iterations is set as a relatively large number 300 and 20, respectively. In this search process, NPSO generates a large number of particles to perform crude search for a wide candidate region, which has relatively low computational complexity due to the small size of m1 -TS. Then, the movement region of the optimal parameter setting between two TSs with different sizes is estimated. In the next iterations, the candidate region shrinks to a more and more narrow region according to the obtained best particle, and the TSs tend to close to the optimal size, NPSO only need generate relatively few particles to perform fine search for a narrow candidate region. Thus, we set an exponential decline coefficient ˛1 = ˛3 = 0.3 for m1 -TS. As the difference among m1 , m2 , m3 is relatively small, the search process of m2 -TS can be performed based on the best result of m1 -TS, and the search process of m3 -TS can be performed based on the best result of m2 -TS. Thus, we set a relatively small number for m2 -TS and m3 -TS: n0 = 40, N0 = 10 and ˛1 = ˛3 = 0.1. For the model selection of SVR, i.e. the setting of three parameters C, ε and ı2 , the trial-and-error and the user’s prior knowledge based method are the widely used selection method. However, it is a very time consuming process for large sample dataset. To reduce the computational complexity and improve the performance of model selection, the model selection technique based on training subset (TS) and particle swarm optimization (PSO) algorithm is employed. Since the TS is a part of the full training dataset T, the minimum testing error of the set T − TS is used as the valuation criterion to avoid the high computational cost of k-cross validation. In Table 3, the final parameters settings for the four datasets are given by running the I-TSMS-SVR model. As shown in the table, all the data points in the optimal training subset (OTS) are the support vectors (SVs) of the final model, totally. This is because of the strong representative of OTS. Therefore, we can obtain the following conclusion according to the sparseness of SVR: If the size of OTS is relatively small, then all the data points of OTS can represent the

5.2.1. Standard SVR model This is a traditional and classical SVR model called S − SVR. One can simply determine the parameters’ value according to some experimental rules, and the standard SVR model does the learning process by using all the training data. We set the parameters’ value as follows: C = 8, ε = 0.1, r = 2. 5.2.2. SVR model based on training data subset To reduce the complexity of the standard SVR model, a training data subset is extracted from the full training data set T uniformly. In this work, 20% sample is extracted from T, and the above model is named as 20%-SVR. 5.2.3. SVR model based on training data subset and APSO To determine the parameters’ setting, the APSO algorithm is employed to select the optimal SVR model based on a randomly chosen training data subset. In this work, 10% sample is extracted from T, and the above model is named as 10%-APSO-SVR. 5.2.4. SVR based on optimal training subset and APSO To accelerate its training process while keeping high accurate forecasting in each parameters selection step of APSO iteration, an optimal training subset (OTS) method is carried out to choose the representation data points of the full training data set, and this model is named as APSO-OTS-SVR. 5.3. Model evaluation methods Three different performance metrics, the root mean square error (RMSE), the mean absolute error (MAE) and the mean absolute percentage error (MAPE) can be expressed as follows:



n (P i=1 i

RMSE =

MAE =

− Ai )2

n

 n  (Pi − Ai ) i=1

n

,

,

(25)

(26)

DMAPE = MAPE(P) − MAPE(T ),

(28)

where MAPE(T), MAPE(P) are the MAPE of a forecasting model for training dataset T and testing dataset P, respectively.

J. Che et al. / Applied Soft Computing 53 (2017) 308–322

317

Table 3 The final parameters settings for the four datasets. Given by the I-TSMS-SVR model Parameter

A plane

Friedman #1

Boston housing

Electric load in New South Wales

C ε ı2

300.2457 0.0336 39.7863

4.7684 0.98 0.2857

636.934 0.0892 81.2446

5.0475e+008 0.0019 3.9892e+003

Training observations OTS size (number of SVs)

500 22

1000 56

456 19

382 40

training dataset, which reduces the computational complexity, and improves the generalization capability. 5.5. The process of training subset and model selection To vividly show the integrated process of training subset and model selection, Tables 4–11 demonstrate the main indexes that can reflect the above process. By integrating training subset and model selection process, the proposed I-TSMS-SVR model can effectively cut down the computational complexity of the traditional SVR, and dramatically reduce the candidate set space of parameter combinations. The integrated strategy makes the exploration process far more space-filling by cutting the wasteful parameter space of close-by patterns. In order to avoid falling into (bad) local minima, we share the three optimal parameter combinations for training the three TS-SVRs. To show the iteration process of the search region, we choose 4 updated search regions for the iteration TSs TS(m11 ), TS(m13 ), TS(m15 ) and TS(m17 ). In Fig. 3, the updated search

regions under different TS sizes for Friedman #1 dataset are shown: First, we can easily observe a finer and finer search process is performed. Second, from the 2th to 3th subfigure, we do allow all the search particles of the 3th subfigure to fall outside the prescribed search region of the 2th subfigure. Thus, this search region can be adjusted adaptively and dynamically with the iteration process of TSs. Considering that small size TS can progressively reflect the data structure of the full training data, we perform a large number of PSO particles in the log-scale in a wide search region by using a small size TS-SVR, which has low computational complexity due to the small size of TS. In the search phase of the global optimum region, NPSO generates a large number of particles to perform crude search for a wide candidate region. This process takes a relatively small amount of time as shown in Tables 4–11. Then, the movement region of the optimal parameter setting between two TSs with different sizes is estimated as shown in the 6th column. We can easily observe that the movement radius MR(mik , mik−1 ) is convergent. In

Table 4 The process of training subset and model selection for A plane. Iteration

Size of TS

k

m1k m2k m3k

Structure parameter

Model selection

Movement radius

Running time

MaxNk

log2 (C, ε, ı2 )

2∗ 2∗ 2∗

t(m1k ) t(m2k ) t(m3k )

nk

1

20 28 36

300 40 40

20 10 10

(8.6756 −4.5976 4.7925) (9.4993 −5.5172 6.3964) (10.2161 −7.0390 5.7532)

(22.0000 9.9658 22.0000) (8.2375 4.4831 3.8688) (6.6404 4.1229 3.8094)

34.8869 9.7283 13.9978

0.0024 0.0024 0.0023

2

22 36 40

300 40 40

20 10 10

(8.6756 −4.7470 4.9248) (10.2161 −7.0390 5.7532) (10.2161 −7.3130 5.9290)

(1.9603 1.0804 0.9517) (6.6404 4.1229 3.8094) (2.5562 1.9929 1.2684)

50.7027 13.481 20.51

0.0022 0.0023 0.0022

3

17 22 39

103 32 32

12 9 9

(8.1081 −4.4369 4.8775) (8.1081 −4.8323 5.2504) (10.4717 −7.4434 6.2641)

(4.6741 2.5413 2.4247) (1.9603 1.0804 0.9517) (0.6160 0.4780 0.3160)

9.5017 4.9059 16.0983

0.0024 0.0022 0.0022

4

18 22 35

61 28 28

9 8 8

(5.7710 −5.8147 4.1164) (8.1081 −4.8323 5.2504) (10.7386 −7.0506 6.0836)

(0.7771 0.7638 0.4744) (1.9603 1.0804 0.9517) (2.6632 1.8932 1.2965)

6.1757 4.2077 11.1068

0.0024 0.0022 0.0023

5

19 22 32

43 25 25

8 8 8

(6.0943 −5.5812 3.9947) (8.1081 −4.8323 5.2504) (9.4070 −7.7450 5.2979)

(0.7758 0.7003 0.4353) (1.9603 1.0804 0.9517) (1.9379 1.7096 0.9375)

5.829 3.9314 9.6816

0.0023 0.0022 0.0022

6

20 22 30

34 24 24

7 8 8

(6.0389 −5.5243 3.8959) (8.1081 −4.8323 5.2504) (8.4381 −8.4054 5.1535)

(0.7290 0.6635 0.4026) (1.9603 1.0804 0.9517) (1.2571 1.3304 0.6591)

4.6435 3.6219 8.5132

0.0023 0.0022 0.0023

7

21 22 28

28 22 22

6 7 7

(6.2211 −5.4466 4.0526) (8.2300 −4.8944 5.3142) (9.0667 −8.8533 4.8256)

(0.7140 0.6272 0.3982) (1.9603 1.0804 0.9517) (1.4429 1.4847 0.6593)

4.3363 3.3546 7.7755

0.0023 0.0022 0.0023

8

22 22 26

25 21 21

6 7 7

(6.3782 −5.7423 4.0523) (8.2300 −4.8944 5.3142) (8.3452 −9.3512 4.4017)

(0.6977 0.6351 0.3795) (1.9603 1.0804 0.9517) (1.4254 1.6678 0.6455)

4.4061 3.1778 6.7498

0.0023 0.0022 0.0023

9

22 22 25

22 21 21

5 7 7

(8.2300 −4.8944 5.3142) (8.2300 −4.8944 5.3142) (8.3452 −9.2253 4.3527)

(0.8473 0.5130 0.4695) (1.9603 1.0804 0.9517) (0.7587 0.8694 0.3395)

2.7601 3.2523 6.281

0.0022 0.0022 0.0026

10

22 22 24

20 20 20

5 7 7

(8.2300 −4.8944 5.3142) (8.2300 −4.8944 5.3142) (8.2300 −4.8262 5.2525)

(0.8473 0.5130 0.4695) (1.9603 1.0804 0.9517) (0.7786 0.4707 0.4263)

2.5283 3.0893 3.1909

0.0022 0.0022 0.0023

MR(m1k , m1k−1 ) MR(m2k , m2k−1 ) MR(m3k , m3k−1 )

Fitness MSE

318

J. Che et al. / Applied Soft Computing 53 (2017) 308–322

Table 5 The process of training subset and model selection for A plane (continued). Iteration

Size of TS

k

m1k m2k m3k

Structure parameter

Model selection

Movement radius

Running time

MaxNk

log2 (C, ε, ı2 )

2 ∗ MR(m1k , m1k−1 ) 2 ∗ MR(m2k , m2k−1 ) 2 ∗ MR(m3k , m3k−1 )

t(m1k ) t(m2k ) t(m3k )

nk

11

22 22 23

12

22 22 22

18 19 19

5 7 7

(8.2300 −4.8944 5.3142) (8.2300 −4.8944 5.3142) (8.0137 −4.5761 5.2124)

(0.8473 0.5130 0.4695) (1.9603 1.0804 0.9517) (0.7902 0.4623 0.4410)

2.2526 2.8969 2.3562

0.0022 0.0022 0.0023

17 19 19

5 7 7

(8.2300 −4.8944 5.3142) (8.2300 −4.8944 5.3142) (8.2300 −4.8211 5.2471)

(0.8473 0.5130 0.4695) (1.9603 1.0804 0.9517) (0.8473 0.5053 0.4635)

2.259 2.8963 2.4829

0.0022 0.0022 0.0023

the next iterations, the candidate region shrinks to a more and more narrow region and the TS tends to close to the optimal size, so NPSO only need generate relatively few particles to perform fine search for a narrow candidate region. The structure parameter of NPSO defines the above feature by using tuning functions. At this level, we estimate the movement range of the optimal parameter selection of a new size TS-SVR and let the best parameter from the early stage be the center point of the new search region. Note that: When mik = mik−1 , we compute the movement range MR(mik , mik−1 ) by using mik = mik−1 + 1 to keep finer search for model selection. In this state, the convergence of the I-TSMS-SVR model is reached. In the “A plane” data set, the convergence of the I-TSMS-SVR model is reached at the first level of crude search. This is because this dataset is generated by a linear model, which indicates the strong modeling capability of the I-TSMS-SVR model.

Fitness MSE

Taking the Boston housing dataset as an example, Fig. 4 shows the optimisation process curve for the Boston housing dataset. 5.6. Comparison analysis for the experiments For these five SVR models, three performance measurements are employed to investigate their forecasting ability, the total running time is employed to compare complexity, and DMAPE is reported to evaluate their generalization error. The detailed performance comparison is listed in Table 12. When the value of the dependent variable is relatively small, the fluctuation of MAPE value will be bigger. Thus, we mainly compare the RMSE and MAE indexes if the value of the dependent variable is relatively small. The experimental results from the above four data sets are summarized in Table 12. The 1th column represents the forecasting

Table 6 The process of training subset and model selection for Friedman #1. Iteration

Size of TS

k

m1k m2k m3k

Structure parameter

Model selection

Movement radius

Running time

MaxNk

log2 (C, ε, ı2 )

2 ∗ MR(m1k , m1k−1 ) 2 ∗ MR(m2k , m2k−1 ) 2 ∗ MR(m3k , m3k−1 )

t(m1k ) t(m2k ) t(m3k )

nk

1

20 31 50

2

Fitness MSE

300 40 40

20 10 10

(7.0426 −0.0548 −1.6317) (2.5276 −0.0312 −2.0586) (2.1100 −0.0253 −1.7972)

(22.0000 9.9658 22.0000) (11.2874 0.0714 1.1958) (4.4568 0.0474 1.6523)

437.8744 43.926 120.5846

6.8358 6.0862 5.0937

23 50 56

300 40 40

20 10 10

(7.0426 −0.0574 −1.6780) (2.1100 −0.0253 −1.7972) (2.1100 −0.0264 −1.8384)

(2.9561 0.0194 0.3330) (4.4568 0.0474 1.6523) (0.7142 0.0080 0.2950)

281.1585 118.8083 159.3397

6.8737 5.0937 4.4499

3

29 56 60

103 32 32

12 9 9

(7.0260 −0.0519 −1.7701) (2.1814 −0.0272 −1.7499) (2.1814 −0.0279 −1.7742)

(4.9872 0.0304 0.5880) (0.7142 0.0080 0.2950) (0.4453 0.0052 0.1726)

108.612 98.4068 117.675

6.0791 4.4425 4.4464

4

35 56 59

61 28 28

9 8 8

(2.2290 −0.0268 −1.8578) (2.2290 −0.0251 −1.7892) (2.1832 −0.0263 −1.7860)

(1.2717 0.0130 0.4984) (0.7142 0.0080 0.2950) (0.1066 0.0012 0.0420)

74.9726 76.1479 87.0816

5.2865 4.428 4.4456

5

40 56 58

43 25 25

8 8 8

(2.4186 −0.0291 −1.8024) (2.2377 −0.0235 −1.8041) (2.1985 −0.0258 −1.7888)

(0.9686 0.0101 0.3415) (0.7142 0.0080 0.2950) (0.1092 0.0012 0.0427)

58.29 67.9224 73.1464

5.425 4.428 4.4326

6

44 56 57

34 24 24

7 8 8

(2.3542 −0.0315 −1.7388) (2.2377 −0.0235 −1.8041) (2.1811 −0.0256 −1.7755)

(0.6676 0.0078 0.2342) (0.7142 0.0080 0.2950) (0.1102 0.0012 0.0432)

48.0231 64.41 66.0029

4.9595 4.428 4.4428

7

47 56 56

28 22 22

6 7 7

(2.4793 −0.0328 −1.8361) (2.2377 −0.0235 −1.8041) (2.2377 −0.0233 −1.7977)

(0.4836 0.0057 0.1707) (0.7142 0.0080 0.2950) (0.1151 0.0011 0.0445)

35.1474 50.2196 52.5961

4.9875 4.428 4.4271

8

49 56 58

25 21 21

6 7 7

(2.6697 −0.0309 −1.8463) (2.2646 −0.0238 −1.8200) (2.2646 −0.0242 −1.8328)

(0.3273 0.0034 0.1082) (0.1151 0.0011 0.0445) (0.2335 0.0022 0.0904)

38.4433 50.2051 50.5723

5.0071 4.4264 4.468

9

51 56 57

22 21 21

5 7 7

(2.2412 −0.0252 −1.8202) (2.2445 −0.0231 −1.8061) (2.2412 −0.0247 −1.7994)

(0.2637 0.0026 0.1024) (0.1151 0.0011 0.0445) (0.1133 0.0011 0.0437)

27.9622 47.7988 50.3784

5.1547 4.4264 4.4296

10

52 56 56

20 20 20

5 7 7

(2.3589 −0.0267 −1.8786) (2.2445 −0.0231 −1.8061) (2.2445 −0.0229 −1.7997)

(0.1341 0.0014 0.0512) (0.1151 0.0011 0.0445) (0.1154 0.0011 0.0445)

30.5748 44.6538 44.5151

4.9324 4.4237 4.4314

J. Che et al. / Applied Soft Computing 53 (2017) 308–322

319

Table 7 The process of training subset and model selection for Friedman #1 (continued). Iteration

Size of TS

k

m1k m2k m3k

Structure parameter

Model selection

Movement radius

Running time

MaxNk

log2 (C, ε, ı2 )

2 ∗ MR(m1k , m1k−1 ) 2 ∗ MR(m2k , m2k−1 ) 2 ∗ MR(m3k , m3k−1 )

t(m1k ) t(m2k ) t(m3k )

nk

11

53 56 56

12

18 19 19

5 7 7

(2.2535 −0.0235 −1.8145) (2.2535 −0.0233 −1.8076) (2.2535 −0.0233 −1.8076)

(0.1257 0.0012 0.0485) (0.1151 0.0011 0.0445) (0.1159 0.0011 0.0447)

23.2946 40.5677 41.1498

4.9163 4.4237 4.4237

54 56 56

17 19 19

5 7 7

(2.3071 −0.0232 −1.8115) (2.2535 −0.0233 −1.8076) (2.2535 −0.0233 −1.8076)

(0.1263 0.0011 0.0475) (0.1151 0.0011 0.0445) (0.1159 0.0011 0.0447)

22.6938 40.8993 40.8754

4.5756 4.4237 4.4237

13

55 56 56

15 18 18

5 7 7

(2.2860 −0.0236 −1.8261) (2.2535 −0.0233 −1.8076) (2.2535 −0.0233 −1.8076)

(0.1228 0.0011 0.0470) (0.1151 0.0011 0.0445) (0.1159 0.0011 0.0447)

20.9703 39.7097 38.6054

4.5986 4.4237 4.4237

14

56 56 56

15 18 18

5 6 6

(2.2860 −0.0237 −1.8327) (2.2535 −0.0233 −1.8076) (2.2535 −0.0233 −1.8076)

(0.1206 0.0011 0.0463) (0.1151 0.0011 0.0445) (0.1159 0.0011 0.0447)

21.4076 33.459 33.5349

4.4257 4.4237 4.4237

model, the 2nd–5th columns represent the learning performance indexes including accuracy and CPU running time, and the 6th–9th columns represent the generalization performance indexes including accuracy and DMAPE for the validation set P. It can be observed from Table 12 that the I-TSMS-SVR model has better performance, and spends less time than the APSO-OTS-SVR model. This is because that the I-TSMS-SVR model can integrate the training subset and model selection, improve the search ability, and reduce the computational complexity.

Fitness MSE

As shown in Table 12, the differences of the running time between S-SVR and 20%-SVR are just as striking, which indicates the value of TS-SVR. The 10%-APSO-SVR model only extracts 10% of the full training set T, but it obtains the better performance than the S-SVR and 20%-SVR. Note that an empirical parameter setting for the Friedman #1 dataset makes the S-SVR and 20%-SVR models unavailable as shown in Table 12. These results indicate that the parameters selection is the key to the performance of SVR model. For the two artificial data sets, 10% of the full training set T can

Table 8 The process of training subset and model selection for Boston housing. Iteration

Size of TS

k

m1k m2k m3k

Structure parameter

Model selection

Movement radius

Running time

MaxNk

log2 (C, ε, ı2 )

2 ∗ MR(m1k , m1k−1 ) 2 ∗ MR(m2k , m2k−1 ) 2 ∗ MR(m3k , m3k−1 )

t(m1k ) t(m2k ) t(m3k )

nk

1

20 28 36

2

Fitness MSE

300 40 40

20 10 10

(10.8732 −3.1063 6.9520) (7.1042 −2.2860 6.4688) (10.6211 −3.1402 5.9080)

(22.0000 9.9658 22.0000) (14.7078 3.0290 2.1186) (7.0338 1.7083 1.4654)

37.7147 55.3817 91.3456

0.121 0.1538 0.1779

16 20 34

300 40 40

20 10 10

(10.8732 −2.8880 6.7818) (10.8732 −3.1063 6.9520) (10.6211 −3.0763 5.8706)

(7.4907 1.4207 1.1625) (6.8372 1.5193 0.8718) (1.9471 0.4378 0.2602)

272.9324 4.6784 98.4454

0.2477 0.121 0.1507

3

17 20 31

103 32 32

12 9 9

(9.0623 −3.6693 7.1553) (10.8732 −3.1063 6.9520) (9.8513 −3.1885 5.6819)

(1.8109 0.5057 0.3385) (3.7450 1.5545 0.5267) (2.8950 0.7227 0.4061)

65.2685 4.0872 49.419

0.1358 0.121 0.2449

4

18 20 28

61 28 28

9 8 8

(9.9678 −3.9957 7.0305) (10.8732 −3.1063 6.9520) (9.9678 −3.7840 6.9075)

(1.8764 0.5249 0.3135) (2.8225 0.8889 0.4236) (3.2206 0.9327 0.5437)

8.5775 3.178 30.9875

0.1252 0.121 0.1534

5

19 20 26

43 25 25

8 8 8

(10.7183 −4.1742 6.9467) (10.8732 −3.1063 6.9520) (8.6052 −3.8624 6.8301)

(1.9072 0.5238 0.2930) (2.0965 0.6279 0.3073) (2.0373 0.6888 0.3921)

7.7525 2.6609 21.0589

0.1169 0.121 0.1997

6

16 19 24

34 24 24

7 8 8

(10.2188 −3.7234 6.7264) (10.2188 −3.9372 6.8560) (7.5865 −4.0924 6.6934)

(5.4856 1.4159 0.8905) (1.9072 0.5238 0.2930) (1.9373 0.7794 0.4149)

9.3975 3.7004 16.7079

0.2493 0.1169 0.1445

7

17 19 23

28 22 22

6 7 7

(9.1850 −4.5186 7.2201) (10.2188 −3.9372 6.8560) (10.2188 −3.8804 6.8237)

(1.8354 0.6227 0.3416) (1.9072 0.5238 0.2930) (1.3996 0.3921 0.2253)

7.916 3.795 4.8207

0.1376 0.1169 0.1268

8

18 19 22

25 21 21

6 7 7

(10.0793 −3.7809 6.7657) (10.0793 −3.7113 6.7229) (10.0793 −3.6557 6.6897)

(1.8974 0.4967 0.3017) (1.9072 0.5238 0.2930) (1.4412 0.3832 0.2307)

3.6153 3.2502 4.2585

0.1241 0.1159 0.1245

9

19 19 21

22 21 21

5 7 7

(10.3074 −3.7270 6.7854) (9.6322 −3.4800 6.5767) (9.6322 −3.4258 6.5428)

(1.8341 0.4677 0.2862) (1.9072 0.5238 0.2930) (1.4407 0.3730 0.2361)

3.7572 3.2985 10.5404

0.115 0.1159 0.1354

10

17 19 20

20 20 20

5 7 7

(9.9406 −3.4135 6.5890) (9.9406 −3.5399 6.6709) (9.6322 −3.3704 6.5075)

(3.5022 0.8532 0.5665) (1.8341 0.4677 0.2862) (1.5102 0.3818 0.2463)

2.0044 3.0241 2.9439

0.1241 0.115 0.1213

320

J. Che et al. / Applied Soft Computing 53 (2017) 308–322

Table 9 The process of training subset and model selection for Boston housing (continued). Iteration

Size of TS

k

m1k m2k m3k

Structure parameter

Model selection

Movement radius

Running time

MaxNk

log2 (C, ε, ı2 )

2 ∗ MR(m1k , m1k−1 ) 2 ∗ MR(m2k , m2k−1 ) 2 ∗ MR(m3k , m3k−1 )

t(m1k ) t(m2k ) t(m3k )

nk

11

18 19 19

12

18 19 19

5 7 7

(10.0435 −3.7554 6.6496) (9.9406 −3.5399 6.6709) (10.0866 −3.2920 6.4730)

(1.8907 0.4933 0.2966) (1.8341 0.4677 0.2862) (1.6616 0.3887 0.2575)

3.0109 3.2757 2.7199

0.1239 0.115 0.115

19 19 21

17 19 19

5 7 7

(9.9385 −3.6832 6.5737) (9.3150 −3.4864 6.3442) (9.3150 −3.6045 6.4152)

(1.7685 0.4622 0.2773) (1.6616 0.3887 0.2575) (3.1014 0.8551 0.5022)

2.3286 2.9908 9.661

0.1157 0.115 0.1368

13

19 19 20

15 18 18

5 7 7

(9.3150 −3.4864 6.3442) (9.3150 −3.4864 6.3442) (9.3150 −3.5462 6.3805)

(1.5345 0.4116 0.2524) (1.6616 0.3887 0.2575) (1.4605 0.4018 0.2415)

2.2518 2.8616 2.7266

0.115 0.115 0.1222

14

19 19 19

15 18 18

5 6 6

(9.3150 −3.4864 6.3442) (9.3150 −3.4864 6.3442) (9.6723 −3.4467 6.2745)

(1.5345 0.4116 0.2524) (1.6616 0.3887 0.2575) (1.5934 0.4070 0.2496)

3.4367 2.7109 2.88

0.115 0.115 0.1153

contain sufficient information to train the 10%-APSO-SVR model, so we can observe that the 10%-APSO-SVR model also has good performance results. What is more, the random error term plays a big part in the difference of training data points, which makes the OTS contains much interference information. Therefore, we have omitted the comparison of the APSO-OTS-SVR model. Several observations can be concluded from the above results. Firstly, the empirical results obtained demonstrate that the ITSMS-SVR model has better forecasting performance than the

Fitness MSE

four comparison SVR models. Secondly, the model selection is analyzed in this work, and theoretically, we can get the conclusion that the movement region of the optimal combination (ε, C, and ı2 ) between two TS-SVRs with different sizes can be estimated by multiplying a multiplicative factor. Thirdly, the proposed periodical selection strategy is an efficient fine-tuning procedure resulting in lower computational complexity and better generalization ability. Finally, the I-TSMS-SVR model is robust and effective for large-scale training data set. The I-TSMS-SVR model

Table 10 The process of training subset and model selection for electric load in New South Wales. Iteration

Size of TS

k

m1k m2k m3k

Structure parameter

Model selection

Movement radius

Running time

MaxNk

log2 (C, ε, ı2 )

2 ∗ MR(m1k , m1k−1 ) 2 ∗ MR(m2k , m2k−1 ) 2 ∗ MR(m3k , m3k−1 )

t(m1k ) t(m2k ) t(m3k )

nk

1

20 28 36

2

Fitness MSE

300 40 40

20 10 10

(15.7684 −8.1824 6.2737) (22.4362 −7.2481 8.9003) (30.2782 −8.7421 11.0371)

(22.0000 9.9658 22.0000) (14.9723 7.9788 5.0645) (15.6839 5.4164 5.3006)

37.0600 12.8999 17.2345

0.0158 0.0103 0.0084

22 36 40

300 40 40

20 10 10

(15.7684 −8.4485 6.4469) (30.2782 −8.7421 11.0371) (30.2782 −9.0824 11.3744)

(3.5630 1.9228 1.2458) (15.6839 5.4164 5.3006) (7.5759 2.4751 2.4333)

74.2915 16.7669 19.1874

0.0131 0.0084 0.0072

3

25 20 43

103 32 32

12 9 9

(17.5499 −9.8297 6.0406) (28.1050 −10.3199 12.5910) (28.1050 −10.5966 12.8539)

(5.3479 3.0703 1.5729) (7.5759 2.4751 2.4333) (4.8003 1.9883 1.8785)

34.6167 14.6301 15.7677

0.01 0.0072 0.0086

4

28 40 42

61 28 28

9 8 8

(15.7204 −10.8127 6.4532) (28.1050 −10.3199 12.5910) (28.8733 −11.4911 13.7008)

(4.2365 3.0388 1.4866) (2.8225 0.8889 0.4236) (1.5791 0.6910 0.6426)

21.0349 14.78 13.5151

0.011 0.0072 0.0077

5

31 40 41

43 25 25

8 8 8

(17.1661 −10.5311 6.3772) (28.1050 −10.3199 12.5910) (28.9340 −11.6274 13.4349)

(4.1468 2.6905 1.3172) (1.61 0.7013 0.6112) (1.6204 0.7143 0.6453)

16.4062 11.5813 11.5185

0.0151 0.0072 0.0089

6

33 40 40

34 24 24

7 8 8

(15.4876 −11.0954 6.4603) (28.1050 −10.3199 12.5910) (28.1050 −10.2273 12.5025)

(2.2831 1.7466 0.8151) (1.6127 0.6422 0.6153) (1.6127 0.6422 0.6153)

13.569 10.9302 10.7044

0.0159 0.0071 0.0074

7

35 40 40

28 22 22

6 7 7

(27.5918 −10.1769 12.5723) (27.6218 −8.4355 12.3395) (27.6218 −8.4355 12.3395)

(3.8256 1.5174 1.4920) (1.5849 0.5297 0.6072) (1.5849 0.5297 0.6072)

11.1141 10.0506 9.8396

0.0094 0.0071 0.0071

8

36 40 40

25 21 21

6 7 7

(27.7390 −10.8112 12.2689) (27.6218 −8.4355 12.3395) (27.6218 −8.4355 12.3395)

(1.8319 0.7712 0.6940) (1.5849 0.5297 0.6072) (1.5849 0.5297 0.6072)

10.2832 9.5239 9.4061

0.0089 0.0071 0.0071

9

37 40 40

22 21 21

5 7 7

(28.2900 −10.9082 12.1787) (27.6218 −8.4355 12.3395) (27.6218 −8.4355 12.3395)

(1.8169 0.7590 0.6700) (1.5849 0.5297 0.6072) (1.5849 0.5297 0.6072)

9.2187 9.3747 9.3514

0.0073 0.007 0.0071

10

38 40 40

20 20 20

5 7 7

(28.3609 −10.9164 12.2077) (28.9110 −9.0353 11.9619) (28.9110 −9.0353 11.9619)

(1.7727 0.7414 0.6536) (1.8341 0.4677 0.2862) (1.6589 0.5673 0.5887)

8.3909 9.187 9.0882

0.0078 0.007 0.007

J. Che et al. / Applied Soft Computing 53 (2017) 308–322

321

Table 11 The process of training subset and model selection for electric load in New South Wales (continued). Iteration

Size of TS

k

m1k m2k m3k

Structure parameter

Model selection

Movement radius

Running time

MaxNk

log2 (C, ε, ı2 )

2 ∗ MR(m1k , m1k−1 ) 2 ∗ MR(m2k , m2k−1 ) 2 ∗ MR(m3k , m3k−1 )

t(m1k ) t(m2k ) t(m3k )

nk

11

39 40 40

12

40 40 40

Fitness MSE

18 19 19

5 7 7

(28.6960 −11.3940 11.9694) (28.9110 −9.0353 11.9619) (28.9110 −9.0353 11.9619)

(1.7468 0.7557 0.6241) (1.6589 0.5673 0.5887) (1.6589 0.5673 0.5887)

8.0295 8.6284 8.576

0.0079 0.007 0.007

17 19 19

5 7 7

(28.5989 −11.2876 11.8118) (28.9110 −9.0353 11.9619) (28.9110 −9.0353 11.9619)

(1.6966 0.7315 0.6003) (1.6589 0.5673 0.5887) (1.6589 0.5673 0.5887)

7.7282 8.4855 8.6293

0.0072 0.007 0.007

Table 12 Summary of results of the forecasting models. Learning process evaluation on set T

Generalization evaluation on set P

MAPE

RMSE

MAE

Running time(s)

MAPE

RMSE

MAE

DMAPE

For A plane S-SVR 20%-SVR 10%-APSO-SVR I-TSMS-SVR

0.793847 2.70398 0.133457 0.176482

0.152072 0.438379 0.0524207 0.0473258

0.125 0.377 0.042 0.0375

120.0879 1.2097 61.9448 297.5692

1.78984 3.74694 0.145545 0.136453

0.155045 0.443876 0.0526664 0.047605

0.124 0.385 0.0419 0.038

0.995993 1.04296 0.0121 −0.040029

For Friedman #1 S-SVR 20%-SVR 10%-APSO-SVR I-TSMS-SVR

133.006 133.006 0.121326 0.140157

14.6359 14.6359 2.254 2.09239

13.8059 13.8059 1.66055 1.68

5.9568e+003 15.2194 348.2337 3.0226e+003

164.461 164.461 0.134259 0.140381

14.8718 14.8718 2.33692 2.16415

13.9943 13.9943 1.75438 1.663

31.455 31.455 0.012933 0.000224

For housing dataset S-SVR 20%-SVR 10%-APSO-SVR APSO-OTS-SVR I-TSMS-SVR

0.4949 0.2237 0.1517 0.15806 0.150809

25.6105 7.148 5.003 4.039 3.715

23.92 5.965 3.2247 3.42576 3.07861

1587.2 5.609 108.06 1596.5 877.3726

0.62 0.4083 0.2716 0.1359 0.114493

25.3051 12.862 6.54016 2.40422 3.00735

21.82 11.28 5.337 2.041 2.433

0.1251 0.1846 0.1199 −0.02216 −0.036316

For load dataset S-SVR 20%-SVR 10%-APSO-SVR APSO-OTS-SVR I-TSMS-SVR

0.04 0.032852 0.0163167 0.0129045 0.0127575

433.245 378.95 178.949 152.65 147.655

325.6 268 141.4 111.3 108.7

283.42 0.047 0.406 991.13 531.3997

0.0441 0.03963 0.019737 0.013307 0.0124297

483.428 479.842 235.173 171.7 150.088

369.6 334.5 177.1 117.5 108.8

0.0041 0.006778 0.0034203 0.0004025 −0.0003278

6. Conclusions

0.25

The best MSE for TS−SVR

optim value m1 m2 m3

0.2

0.15

0.1

0

2

4

6 8 The iteration number

10

12

14

Fig. 4. The optimisation process curve for the Boston housing dataset.

gives an analysis framework for integrating the training subset and model selection process. Overall, the I-TSMS-SVR model provides a very powerful tool of easy implementation for modeling SVR.

Support vector regression (SVR) is a very promising and popular forecasting model rooted in statistical learning theory, and its generalization performance depends on the good choice of parameter setting. However, the computational complexity is O(K * N3 ) in conventional search algorithms (N is the size of training dataset, and K is the number of search), this will lead to slow learning for a massive data set. The training subset (TS) selection, along with the model selection in the SVR learning process determines the model complexity and performance. Motivated by the relation between the training subset and model selection for SVR, the movement region of the optimal parameter setting between two TSs with different sizes is estimated, then a nested adaptive particle swarm optimization (NPSO) is proposed to integrate the training subset and model selection for SVR. The uniform design idea is transplanted to the above modeling process, and the convergence for the proposed model is proofed. The most salient aspect of the proposed integrated SVR is the integration strategy of encapsulating the PSO based model selection within the OTS based SVR model. Basically, a two-level selection strategy is performed in this paper: In the first level, a small size training subset TS with good representative is uniformly selected from the full training dataset T, and the rest dataset T − TS is used to estimate a generalization error for each parameter combination. Then, the global optimum

322

J. Che et al. / Applied Soft Computing 53 (2017) 308–322

parameter Pbest for this small size TS is obtained by an improved PSO search algorithm, which can reduce the high computational cost by avoiding the use of the full training dataset T and the k-fold CV score function. In the second level, the movement range of Pbest for TSs with different sizes is estimated. Usually, this movement range would be only a small region, and will become increasingly smaller with the iteration process of TSs. Thus we can easily find the new global optimum parameter Pbest for a new TS by finer search of the movement range. Four different datasets, including two artificial problems, one UCI dataset and one real case of electric load forecasting in New South Wales, are performed in the experiment section. Compared with the standard ones, the APSO-OTS-SVR, and other approaches, the proposed I-TSMS-SVR model not only can select proper training subset and parameter, but also has better generalization performance and fewer processing time. Acknowledgements The research was supported by the National Natural Science Foundation of China (Grant Nos. 71301067 and 61573266), and the Natural Science Foundation of JiangXi Province, China (Grant No. 20142BAB217015). The research was also supported by the National Social Science Foundation of China (Grant No. 12CTJ012). References [1] A. Smola, B. Schölkopf, A tutorial on support vector regression, Stat. Comput. 14 (2004) 199–222. [2] S.M. Hosseini, N. Mahjouri, Integrating support vector regression and a geomorphologic artificial neural network for daily rainfall-runoff modeling, Appl. Soft Comput. (2015), http://dx.doi.org/10.1016/j.asoc.2015.09.049, Available Online 17 October. [3] V. Vapnik, The Nature of Statistical Learning Theory, Springer, NY, 1995. ´ S. Shamshirband, et al., Support vector regression methodology [4] D. Petkovic, for prediction of input displacement of adaptive compliant robotic gripper, Appl. Intell. 41 (3) (2014) 887–896. [5] G.E. Güraksın, H. Haklı, H. Uˇguz, Support vector machines classification based on particle swarm optimization for bone age determination, Appl. Soft Comput. 24 (2014) 597–602. [6] N. Harish, S. Mandal, S. Rao, S.G. Patil, Particle swarm optimization based support vector machine for damage level prediction of non-reshaped berm breakwater, Appl. Soft Comput. 27 (2015) 313–321. [7] S. Shamshirband, et al., Support vector machine firefly algorithm based optimization of lens system, Appl. Opt. 54 (1) (2015) 37–45. [8] A.H.H.B. Zaji, S. Shamshirband, S.N. Qasem, Potential of particle swarm optimization based radial basis function network to predict the discharge coefficient of a modified triangular side weir, Flow Meas. Instrum. 45 (2015) 404–407. [9] Q. Wu, R. Law, E. Wu, J. Lin, A hybrid-forecasting model reducing Gaussian noise based on the Gaussian support vector regression machine and chaotic particle swarm optimization, Inf. Sci. 238 (2013) 96–110. [10] J. Larsen, C. Svarer, L.N. Andersen, L.K. Hansen, Adaptive regularization in neural network modeling, in: G.B. Orr, K.R. Müller (Eds.), Neural Networks: Trick of the Trade, Springer, Berlin, 1998. [11] Y. Bengio, Gradient-based optimization of hyper-parameters, Neural Comput. 12 (8) (2000) 1889–1900. [12] O. Chapelle, V. Vapnik, O. Bousquet, S. Mukherjee, Choosing multiple parameters for support vector machines, Mach. Learn. 46 (1–3) (2002) 131–159. [13] S.S. Keerthi, Efficient tuning of SVM hyperparameters using radius/margin bound and iterative algorithms, IEEE Trans. Neural Netw. 13 (5) (2002) 1225–1229. [14] M. Momma, K. Bennett, A pattern search method for model selection of support vector regression, in: Proceedings of the SIAM International Conference on Data Mining, 2002, pp. 261–274. [15] W. Li, L. Liu, W. Gong, Multi-objective uniform design as a SVM model selection tool for face recognition, Expert Syst. Appl. 38 (6) (2011) 6689–6695. [16] Á.B. Jiménez, et al., Finding optimal model parameters by deterministic and annealed focused grid search, Neurocomputing 72 (13–15) (2009) 2824–2832. [17] L. Zhang, J. Wang, Optimizing parameters of support vector machines using team-search-based particle swarm optimization, Eng. Comput. 32 (5) (2015) 1194–1213.

[18] H.-H. Tsai, B.-M. Chang, S.-H. Liou, Rotation-invariant texture image retrieval using particle swarm optimization and support vector regression, Appl. Soft Comput. 17 (2014) 127–139. [19] I.D. Lins, M.C. Moura, E. Zio, E.L. Droguett, A particle swarm-optimized support vector machine for reliability prediction, Qual. Reliab. Eng. Int. 28 (2) (2012) 141–158. [20] Z. Dong, D. Yang, T. Reindl, W.M. Walsh, A novel hybrid approach based on self-organizing maps, support vector regression and particle swarm optimization to forecast solar irradiance, Energy 82 (2015) 570–577. [21] J. Platt, Fast training of support vector machines using sequential minimal optimization, in: B. Scholkopf, C. Burges, A. Smola (Eds.), Advances in Kernel Methods – Support Vector Learning, MIT Press, Cambridge, MA, 1999, pp. 185–208. [22] Y.J. Lee, O.L. Mangasarian, RSVM: Reduced Support Vector Machines, Tech. Rep. 00-07, Data Mining Inst., Comp. Sci. Dept., Univ. Wisconsin, Madison, WI, July 2000, Available: ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/00-07.ps. [23] K.D. Brabanter, J.D. Brabanter, J.A.K. Suykens, B. De Moor, Optimized fixed-size kernel models for large data sets, Comput. Stat. Data Anal. 54 (2010) 1484–1504. [24] J. Che, Support vector regression based on optimal training subset and adaptive particle swarm optimization algorithm, Appl. Soft Comput. 13 (8) (2013) 3473–3481. [25] Z. Lu, J. Sun, Non-Mercer hybrid kernel for linear programming support vector regression in nonlinear systems identification, Appl. Soft Comput. 9 (1) (2009) 94–99. [26] D.N. Zheng, J.X. Wang, Y.N. Zhao, Non-flat function estimation with a multi-scale support vector regression, Neurocomputing 70 (1–3) (2006) 420–429. [27] Y.P. Zhao, J.G. Sun, Robust support vector regression in the primal, Neural Netw. 21 (10) (2008) 1548–1555. [28] B. Schölkopf, A.J. Smola, R.C. Williamson, P.L. Bartlett, New support vector algorithms, Neural Comput. 12 (5) (2000) 1207–1245. [29] F. Provost, Machine learning from imbalanced data sets 101[C], in: Proceedings of the AAAI2000 Workshop on Imbalanced Data Sets, 2000, pp. 1–3. [30] K.T. Fang, D.K.J. Lin, P. Winker, Y. Zhang, Uniform design: theory and applications, Technometrics 42 (2000) 237–248. [31] K.T. Fang, Y. Wang, Number-Theoretic Methods in Statistics, Chapmman and Hall, London, 1994. [32] H. Niederreiter, P. Peart, Localization of search in quasi-Monte Carlo methods for global optimization, SIAM J. Sci. Stat. Comput. 7 (2) (1986) 660–664. [33] K.T. Fang, D.K.J. Lin, Uniform experimental designs and their applications in industry, in: R. Khattree, C.R. Rao (Eds.), in: Handbook of Statistics, 22, North-Holland, Amsterdam, 2003, pp. 131–170. [34] V. Cherkassky, Y.Q. Ma, Practical selection of SVM parameters and noise estimation for SVM regression, Neural Netw. 17 (1) (2004) 113–126. [35] R: The Comprehensive R Archive Network, 2015, Available: http://cran.rproject.org/. [36] B.W. Silverman, Density Estimation for Statistics and Data Analysis, Chapmman and Hall, London, 1986. [37] C.J. Stone, An asymptotically optimal window selection rule for kernel density estimates, Ann. Stat. 12 (1984) 1285–1297. [38] O.G. Ali, K. Yaman, Selecting rows and columns for training support vector regression models with large retail datasets, Eur. J. Oper. Res. 226 (2013) 471–480. [39] A.J. Smola, B. Scholkopf, et al., The connection between regularization operators and support vector kernels, Neural Netw. 11 (4) (1998) 637–649. [40] J.T. Kwok, I.W. Tsang, Linear dependency between epsilon and the input noise in epsilon-support vector regression, IEEE Trans. Neural Netw. 14 (3) (2003) 544–553. [41] D. Mattera, S. Haykin, Support vector machines for dynamic reconstruction of a chaotic system, in: Advances in Kernel Methods, MIT Press, 1999, pp. 211–241. [42] J. Kennedy, R.C. Eberhart, Particle swarm optimization, in: Proceedings of the IEEE International Conference on Neural Networks, Perth, Australia, 1995, pp. 1942–1948. [43] C. Huang, Y. Lee, D.K.J. Lin, S. Huang, Model selection for support vector machines via uniform design, Comput. Stat. Data Anal. 52 (1) (2007) 335–346. [44] MATLAB, 2015, Available: http://www.mathworks.com/. [45] G. Ridgeway, D. Madigan, T. Richardson, Boosting methodology for regression problems, in: Proc. 7th Int. Workshop on Artificial Intelligence and Statistics, Fort Lauderdale, FL, 1999, pp. 152–161. [46] J.H. Friedman, E. Grosse, W. Stuetzle, Multidimensional additive spline approximation, SIAM J. Sci. Stat. Comput. 4 (2) (1983) 291–301. [47] D. Harrison, D.L. Rubinfeld, Hedonic prices and the demand for clean air, J. Environ. Econ. Manag. 5 (1978) 81–102. [48] C. Blake, E. Keogh, C.J. Merz, UCI Repository of Machine Learning Databases, Department of Information and Computer Science, University of California, Irvine, CA, 1998 http://www.ics.uci.edu/mlearn/MLRepository.html. [49] Z.H. Zhou, J.X. Wu, W. Tang, Z.Q. Chen, Combining regression estimators: GA-based selective neural network ensemble, Int. J. Comput. Intell. Appl. 1 (4) (2001) 341–356.