Optimal sub-models selection algorithm for combination forecasting model

Neurocomputing 151 (2015) 364–375 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Optimal...

Download PDF

523KB Sizes 0 Downloads 73 Views

Report

PDF Reader
Full Text

Neurocomputing 151 (2015) 364–375

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Optimal sub-models selection algorithm for combination forecasting model JinXing Che College of Science, NanChang Institute of Technology, NanChang 330099, JiangXi, China

art ic l e i nf o

a b s t r a c t

Article history: Received 15 March 2014 Received in revised form 8 August 2014 Accepted 16 September 2014 Communicated by P. Zhang Available online 28 September 2014

It has been widely demonstrated in forecasting that combining forecasts can improve the forecast performance compared to individual forecasts. However, how to select the optimal sub-models from all the available models is a difﬁcult problem in combination forecasting model. Consider that the redundancy information among the selected sub-models will reduce the performance of the combination forecasting model, it is advocated to select a sub-model only if it contributes to the redundancy removing mutual information between the outputs of the selected sub-models and the actual outputs. As linear combination method is promising and popular in the ﬁeld of combination forecast, a novel Max-Linear-Relevance and Min-Linear-Redundancy based selection algorithm is proposed in this paper. The proposed selection algorithm provides a theoretical approach for the optimal sub-models selection, and tries to compute the redundancy removing linear mutual information between the outputs of the selected sub-models and the actual outputs. Three monthly time series from DataMarket are used as illustrative examples to evaluate the forecasting. As a result of the implementation, it is seen that the proposed combination forecasting model produces better forecasts than those produced by other models. & 2014 Elsevier B.V. All rights reserved.

Keywords: Combination forecasting model Sub-models selection Monthly time series Max-Linear-Relevance and Min-LinearRedundancy Support vector regression

1. Introduction Forecasting has always been an essential instrument in science research, and has received much attention during the past three decades [1–4]. The main types of forecasting models include ARIMA models [5,6], exponential smoothing models [7,8], neural networks models [9,10], and support vector regression models [11–13]. Some researchers in this area often study the nature of an individual model, and try to identify the best one to generate a forecast. Although these models have merits on their own, they have some limitations. To combine information and disperse errors from different models, combination forecasting model was ﬁrst proposed in the sixties by Bates and Granger [14]. Since then many researchers have demonstrated that combination forecasting model can improve the forecast performance compared to individual forecasts [15,16]. Wang et al. [17] analyze the seasonality and nonlinearity of the electric load, select the seasonality sub-models, namely the seasonal ARIMA forecasting model and the seasonal exponential smoothing model, and the nonlinearity sub-model, namely the weighted support vector machines, then present a new combined model for

E-mail address: [email protected] http://dx.doi.org/10.1016/j.neucom.2014.09.028 0925-2312/& 2014 Elsevier B.V. All rights reserved.

electric load forecasting. From the above study, we can conclude that combining the best individual models may not obtain the best combination forecasting model, one should consider the complementarity of the selected sub-models. In order to obtain accurate forecasts, the key point in the combination forecast is the determination of the optimal subset of the available individual models. Thus, how to select the optimal sub-models from the available individual models became an important research topic in combination forecast. Sub-models selection for combination forecast is rarely studied in the literature. Costantini and Pappalardo [18], Kisinbay [19], and Franses [20] use the encompassing test for sub-models selection strategy. However, these studies only provide an experimental assessments for combination selection forecasting. Recently, by means of ﬁnding the maximum mutual information between the optimal subset of all the individual models and the actual outputs, Cang and Yu [21] ﬁrstly propose a mutual information (MI) based selection algorithm to select the optimal subset of the available individual models for combination forecasting model, and demonstrate that the combination of the selected optimal sub-models signiﬁcantly outperforms the combination of all the available individual models in tourism demand forecasting. This algorithm provides a theoretical approach for combination selection forecasting without having to try all possible combinations of the

J. Che / Neurocomputing 151 (2015) 364–375

individual models. Yet the selected optimal subset may not be the optimal subset that contains the maximum information. The reason is that the selection criterion, the accumulating MI value for the combined model, has the following two limitations for linear combination forecasting model: ﬁrstly, the calculated MI consists of the linear-relationship information and the nonlinearrelationship information, but only the linear-relationship information determines the linear combination selection. Secondly, the calculated MI contains redundancy information among the selected optimal sub-models, this will reduce the performance of the combination forecasting model. Thus, how to improve the above information measurement criteria is of critical importance. As the redundancy information among the selected optimal sub-models will reduce the performance of combination forecasting model, it is advocated to select a sub-model only if it contributes to the redundancy removing mutual information. Consider that linear combination method is promising and popular in the ﬁeld of combination forecast [17,21], a novel Max-LinearRelevance and Min-Linear-Redundancy based selection criterion, named as redundancy removing linear mutual information model selection (RRLS), is proposed in this paper. RRLS provides a theoretical approach for linear combination selection problem, and tries to compute the redundancy removing linear mutual information between the outputs of the selected sub-models and the actual outputs. The proposed selection algorithm is based on independently forecasting the series by all the individual models, and then using the proposed RRLS method to optimally select and combine these forecasts. That is, an individual sub-model is selected in the ﬁnal combination if the combination with this sub-model yields more redundancy removing mutual information between the selected sub-models and the actual outputs than a combination without this sub-model. Three monthly time series from DataMarket are used as illustrative examples to evaluate the forecasting. The experimental results demonstrate the forecast superiority of the proposed algorithm over the usual methods.

365

2.2. Combination forecasting model Many researchers have found the empirical evidence that a single model usually cannot provide the best forecasting accuracy but combining forecasts from conceptually different models generally leads to improved out-of-sample forecasting performance relative to a strategy of selecting the single best forecasting model [22]. The nature and effectiveness of the combined forecast depends on the type of the selected individual models. Mathematically, the combined forecast takes the form [23]: b ¼ Fðf ðYÞ; f ðYÞ; …; f ðYÞÞ Y 1 2 k N

where each f i : R ⟶R ði ¼ 1; 2; …; kÞ represents an individual forecasting model transforming the actual data set Y to its forecast b , F : RN ⟶RN is the combination function. For a linear combinaY tion of forecasts, F can be given by Fðf 1 ðYÞ; f 2 ðYÞ; …; f k ðYÞÞ ¼ ω1 f 1 ðYÞ þ ω2 f 2 ðYÞ þ ⋯ þ ωk f k ðYÞ

ωi ¼

ei 1 ∑ki ¼ 1 ei 1

;

8 i ¼ 1; 2; …; k

2.1. Forecasting error measurement To test the forecasting capability, the following three statistical metrics are deﬁned [26]: vﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ u n u u ∑ ðP i Ai Þ2 t RMSE ¼ i ¼ 1 ; ð1Þ n

MAPE ¼

n

ð2Þ

n100%;

ð3Þ

where Pi and Ai are the ith predicted and actual values respectively, and n is the total number of predictions.

ð6Þ

where ei denotes the validation forecast error obtained by the ith forecasting model. The choice of a particular forecast error determines the effectiveness of an error-based combination scheme.

In this section, we present the fundamental background of forecasting error measurement and combination forecasting model, describe the individual forecasting sub-models for the combined model, then introduce the concept of mutual information, and review some recently proposed methods in that area, and also point out their improvements as well as their limitations.

vﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ u u n ðP A Þ i u∑ i u ti ¼ 1 Ai

ð5Þ

To vividly introduce the combination forecasting model, a schematic diagram of it is drawn in this subsection. As shown in Fig. 1, the combination forecasting model can improve out of sample forecasting performance as it can reduce error-term and increase R-square (R2). Recently, many researchers have demonstrated that combination forecasting model can improve the forecast performance compared to individual forecasts [15–20]. In this paper, we consider the error-based linear combination methods [17,21]. In such a method, the combining weight to each individual forecast is assigned to be the inverse of its past forecast error of the corresponding model, so that a model with less error receives more weight and vice versa [24]. That is

2. Materials and methods

vﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ u n u u ∑ jðP i Ai Þj t ; MAE ¼ i ¼ 1 n

ð4Þ

N

Fig. 1. The schematic diagram of combination forecasting model.

366

J. Che / Neurocomputing 151 (2015) 364–375

The variance covariance (VACO) combination form is deﬁned as follows: N

ðjÞ ej ¼ ∑ ðyi ybi Þ2 ; i¼1

8 j ¼ 1; 2; …; k

ð7Þ

The discounted mean square forecast error (DMSFE) combination form is deﬁned as follows: N

ej ¼ ∑ β

N iþ1

i¼1

ðjÞ

ðyi ybi Þ2 ;

8 j ¼ 1; 2; …; k

ð8Þ

where N is the total number of the training data set. β is chosen as 0.95, 0.9, 0.85, 0.8 in this study like the paper [21]. 2.3. Individual forecasting sub-models for the combined model Over the past several decades, researchers developed many excellent linear and nonlinear individual models. To show the complementarity principle of the selected sub-models, the popular nonlinear (namely the support vector regression models with different input dimensions) and linear (namely the exponential smoothing and ARIMA model) models are selected. And their brief descriptions are shown as the following subsections. 2.3.1. Support vector regression model The foundations of support vector regression model (SVR), a class of robust statistical methods based on the Structural Risk Minimization (SRM) principle, are ﬁrst developed by Vapnik [25]. The main idea of nonlinear regression of SVR is “kernel trick”, which maps the input patterns into a higher-dimensional space ϝ by a function ϕ : W-ϝ. Any function that satisﬁes Mercer's condition can be used as Kernel function. In this study, the author discusses the Gaussian kernel function that uses widely among them: ! ðx x0 Þ2 0 Kðx; x Þ ¼ exp ð9Þ 2 2δ Then, the objective of SVR is to ﬁnd a linear regression with good generalization ability in the higher-dimensional feature space. Given the training data ðx1 ; y1 Þ; …; ðxn ; yn Þ W R, where W denotes the input space, and yi is the corresponding output value in the output space R. Using Vapnik's ε-insensitive loss function and Lagrange multiplier method, the SVR is converted to a Quadratic Programming Problem (QPP). Therefore, the nonlinear SVR function is n

f ðxÞ ¼ ∑ ðαni αi ÞKðxi ; xÞ þ b

ð10Þ

i¼1

A phase space reconstruction technique is accomplished by translating a single-variable time series into a multi-dimensional phase space, which indicates the system state at different time. The data used in this paper is the monthly data, thus the previous one year (12-dimensional phase space, m ¼12), a year and a quarter (m¼ 15) up to two years (m ¼ 18, m ¼21 and m ¼24) are translated into inputs to construct the ﬁve different time series. And the associated SVR model was trained by the method in [26]. 2.3.2. Exponential smoothing method Exponential smoothing, proposed by Charles C. Holt in 1957 [27], is a simple and effective forecasting technique. This exponential smoothing method is the single smoothing method where only one parameter needs to be estimated, and can be expressed as follows: F t ¼ α Rt 1 þ ð1 αÞ F t 1

ð11Þ

where Ft is the forecast for the period t, Rt 1 is the observed value of series in period t 1, F t 1 is the old forecast for the period t 1,

and α is the smoothing constant between zero and one. By replacing F t 1 with its components, the implication of exponential smoothing can be better understood as follows: F t ¼ α Rt 1 þ α ð1 αÞ Rt 2 þ ⋯ þ α ð1 αÞt 2 R1

ð12Þ

Therefore, Ft is the weighted moving average of all past observations. To interpret the role of weighting factor α, the exponential smoothing equation can be rewritten in the following form: F t ¼ F t 1 þ α ðRt 1 F t 1 Þ

ð13Þ

The forecast result is the old forecast plus the error that occurred in the previous forecast with a weight α. In general, a small value of α is desired for stable predictions with smoothed random variation, and a large value of α is desired for predictions with rapid response. 2.3.3. ARIMA model For a stationary time-series, yt, Box and Jenkins [28] proposed an ARIMAðp; d; qÞ model which considers the last p-known values of the series as well as q of the past modeling errors as follows:

ϕp ðBÞ½ð1 BÞd ðyt μÞ ¼ θq ðBÞεt

ð14Þ

ϕp ðBÞ ¼ 1 ϕ1 B ϕ2 B ⋯ ϕp B

ð15Þ

θq ðBÞ ¼ 1 θ1 B θ2 B ⋯ θq B

ð16Þ

where yt is the historical value at the period t, εt is the estimated residuals, μ is the overall mean of series which is a constant, d is the order of regular differences, and B is a backward shift operator with Byt ¼ yt 1 and Bεt ¼ εt 1 . In this paper, the minimum Akaike Information Criterion (AIC) was considered to identify the ARIMA model objectively ðp; d; qÞ by using the encompassing method. Remark. For ARIMAðp; d; qÞ, MA model is a special case of ARIMA model if p ¼ d ¼ 0, and AR model is a special case of ARIMA model if d ¼ q ¼ 0. That is MAðqÞ ARIMAð0; 0; qÞ, ARðpÞ ARIMAðp; 0; 0Þ, the parameters ðp; d; qÞ are selected by using the encompassing method, optimally. 2.4. Entropy and mutual information The entropy function H(X) of a random variable (r.v.) X measures its prior uncertainty in terms of its probability. For a continuous r.v. X, H(X) is deﬁned as follows: Z þ1 1 HðXÞ ¼ dx ð17Þ pðxÞ log pðxÞ 1 Finding pdfs p(x) exactly and performing integration is a difﬁculty and is practically impossible. Thus, continuous variables are usually conversed to discrete ones by dividing the continuous input feature space into several discrete partitions. For a discrete r.v. X, H(X) is deﬁned as follows: HðXÞ ¼ ∑pðxÞ log

1 pðxÞ

ð18Þ

The conditional entropy function HðXjYÞ measures a posteriori uncertainty of X after Y is known. It is deﬁned as follows: HðXjYÞ ¼ ∑∑pðx; yÞ log

1 pðxjyÞ

ð19Þ

The mutual information (MI) IðX; YÞ between a r.v. X and Y is deﬁned as follows [29]: IðX; YÞ ¼ HðXÞ HðXjYÞ ¼ HðYÞ HðYjXÞ ¼ HðXÞ þHðYÞ HðX; YÞ ð20Þ

J. Che / Neurocomputing 151 (2015) 364–375

Based on the above deﬁnitions, we can conclude that MI measures the amount of uncertainty in X which is reduced if Y has been observed. The MI is often used to build feature subset selection algorithm [30]. In a classiﬁcation problem, its target is to compute the class label by using the selected feature values. In the ﬁrst step, the feature Xi which has the largest MI with the class attribute Y is selected for the target feature subset S. Then, in the next step, the selection criterion is to determine how much information can be added with respect to the already existing S. Inspired by that the class-feature mutual information (CFMI) denotes the discrimination ability of a feature (relevance), while the feature-feature mutual information (FFMI) measures information about the redundancy among features, several feature subset selection algorithms are proposed over the past 10 years [31–34]. Although a signiﬁcant improvement has been made, the existing studies have two limitation: ﬁrstly, only a part of FFMI is contained in the CFMI, so the subtraction of CFMI and FFMI cannot reﬂect the max-relevance and min-redundancy. Secondly, the mutual information based selection algorithms mainly focus on the classiﬁcation problem, this study will extend them to combination selection problem. Thus, the author proposes a new method to overcome the above limitations in the following section. Remark. Vinh et al. [34] recently proposed the following normalization subtraction (NMIFS) of CFMI and FFMI to reﬂect the maxrelevance and min-redundancy:

Xi A T k

ð24Þ

3.1.2. Common linear redundancy As the linear combination model combines the selected submodels by three linear combinations generally (that is Simple Average, Variance Covariance, and Discounted Mean Square Forecast Error), only the linear part of the redundancy (the right-side term of NMIFS Eq. (21) should be considered. In this work, we deﬁne the linear-relationship information ratio between two r.v.s X, Y as the following correlation-based feature selector: r X;Y ¼

E½ðX μX ÞðY μY Þ

ð25Þ

σX σY

where μX , μY , and σ X , σ Y are the mean and the standard deviation values of X and Y, respectively. If conditioning by the linearrelationship information between X and Y is uniformly distributed throughout the region of mutual information (MI) IðX; YÞ, the amount of linear information between r.v. X and dependent variable Y, called as linear mutual information I L ðX; YÞ, can be computed by multiplying the MI IðX; YÞ: I L ðX; YÞ ¼ IðX; YÞ r X;Y

ð22Þ

Although the both terms of the subtraction in the above Eq. (21) (NMIFS) are within the range [0,1], the value of the whole Eq. (21) (NMIFS) may be outside the range [0,1]. For example, if IðX i ; X j Þ ZIðX i ; YÞ for any X j A Sk , then f ðX ðiÞ Þ r0. Generally, only a part of the right-side mutual information is contained in the left-side mutual information. 3. The proposed combination selection algorithm

ð26Þ

Based on the above theory, the information ratio of linear relationship between two r.v.s Xs and Xj can be written as r X s ;X j ¼

E½ðX s μX s ÞðX j μX j Þ

ð27Þ

σ Xs σ Xj

If conditioning by the information of the linear relationship between Xs and Xj is distributed uniformly throughout the amount of linear information I L ðX s ; YÞ, the amount of common linear information between two r.v. Xs, Xj and the dependent variable Y, named as I L ðX s ; X j ; YÞ, can be deﬁned by multiplying I L ðX s ; YÞ: I L ðX s ; X j ; YÞ ¼ I L ðX j ; YÞ r X s ;X j

3.1. Normalized mutual information selection criteria based on Max-Linear-Relevance and Min-Linear-Redundancy 3.1.1. Normalized mutual information for regression problem In a regression problem, we aim to infer the dependent variable Y by selecting some suitable independent variables X. Basically, the stronger the relationship between two variables, the larger the MI they will have. The above conclusion is formally shown in the following Theorem 1, and the proof of Theorem 1 can be found in [35]. Theorem 1. For any discrete random variables X and Y, IðX; YÞ Z 0. Moreover, IðX; YÞ ¼ 0 if and only if X and Y have no relationship. From Theorem 1, we can get the following conclusion: IðX; YÞ ¼ 0 means that observing X does not reduce the uncertainty of Y. If, however, X¼ Y, then IðX; YÞ ¼ IðY; YÞ ¼ HðYÞ means that observing X can reduce all the uncertainty of Y. Thus, 0 r IðX; YÞ r IðY; YÞ ¼ HðYÞ;

IðX; YÞ IðY; YÞ

The normalization measures the uncertainty ratio in dependent variable Y which is reduced if X has been selected, and restricts its value to the range [0,1]. In the combination forecasting model, the uncertainty ratio need to be reduced by selecting optimal submodels.

ð23Þ

and the entropy is also called self-information. The author proposes to use the entropy of the dependent attribute Y as a criterion that measures the amount of uncertainty in Y. From the above Eq. (23), the MI between the selected variable X and the dependent variable Y is bounded by the above entropy H (Y), the author, therefore, proposes to normalize the term IðX; YÞ

Million Kilowatt Hours

X ðkÞ ¼ arg max ½f ðX ðiÞ Þ

ð21Þ

NIðX; YÞ ¼

Pounds Per Cow

IðX i ; X j Þ IðX i ; YÞ 1 ∑ minfHðX i Þ; HðYÞg jSk j X j A Sk minfHðX i Þ; HðX j Þg

divided by the entropy of the dependent attribute Y as follows:

ð28Þ

4

2

x 10

Electricity Production 1 0

0

50

100

150

200

250

300

Date: From Aug 1972 to Aug 1995 (monthly) 5000 Milk Production

0

0

20

40

60

80

100

120

140

160

180

Date: From Jan 1962 to Dec 1975 (monthly) Units: Thousands of Units

f ðX ðiÞ Þ ¼

367

300 Economic indicators 200 100

0

50

100

150

200

Date: From May 1982 to Apr 2003 (monthly) Fig. 2. Raw data in the experiment.

250

300

368

J. Che / Neurocomputing 151 (2015) 364–375

For the feature-feature redundancy, we pay close attention to the part of common linear redundancy with regard to the goal variable(dependent variable), the author deﬁnes maximum common linear redundancy, named as I ML ðX s ; X j ; YÞ, to extract the above common linear redundancy: I ML ðX s ; X j ; YÞ ¼ MaxX s A S fI L ðX s ; X j ; YÞg ¼ IðX j ; YÞ r X j ;Y MaxX s A S fr X s ;X j g

ð29Þ

3.1.3. Normalized mutual information selection criteria based on Max-Linear-Relevance and Min-Linear-Redundancy In this paper, the linear combination forecasting model is employed, and the linear information between Xj and Y can be measured according to Eq. (26): LI ¼ IðX j ; YÞ r X j ;Y

ð30Þ

Remark. According to the deﬁnition Eq. (20), the mutual information (MI) IðX j ; YÞ can measure the amount of uncertainty in the dependent variable Y which is reduced if the independent variable Xj has been observed. And the correlation rðX j ; YÞ can measure the linear-relationship information ratio between variables Xj and Y. Table 1 The parameters setting used in the SVR models—monthly electricity production. Model

SVR-1 SVR-2 SVR-3 SVR-4 SVR-5

Parameter setting

Time series A

Input dimension

C

12 15 18 21 24

ε

10 487.28 3.84e þ 06 3511.68 10

0.03048 0.0763 0.0647 0.0671 0.005

ARIMA ES SVR-1 SVR-2 SVR-3 SVR-4 SVR-5

f ðX j Þ ¼ IðX j ; YÞ r X j ;Y α I ML ðX s ; X j ; YÞ

ð31Þ

X ðkÞ ¼ arg max ½f ðX j Þ

ð32Þ

Xj A T k

By the above normalization process, the ﬁnal goal function can be expressed as IðX j ; YÞ I ML ðX s ; X j ; YÞ IðX j ; YÞ r X j ;Y α ¼ r X j ;Y α IðY; YÞ IðY; YÞ IðY; YÞ n o IðX j ; YÞ r X j ;Y MaxX s A S r X s ;X j IðY; YÞ

f ðX ðjÞ Þ ¼

ð33Þ

δ2

X ðkÞ ¼ arg max ½f ðX j Þ

ð34Þ

2.263 1.786 2.292 3.172 8.775

where α is a tolerance coefﬁcient of Min-Linear-Redundancy (0 r α r1), and it can be determined by using the encompassing method or approximated by the following equation: α MaxX s A S ð1=∣S∣Þ∑X s A S r X s ;X j . The model selection algorithm based on the above goal function is called as redundancy removing linear mutual information model selection (RRLS).

Xj A T k

Table 2 MAPE, RMSE and MASE values from individual sub-models for monthly electricity production. Model

Suppose that the linear-relationship information between Xj and Y is uniformly distributed throughout the region of mutual information IðX j ; YÞ, then LI ¼ IðX j ; YÞ rðX j ; YÞ can measure the amount of linear information. In a regression problem, we want to forecast the dependent variable Y by selecting some suitable independent variables X. In the ﬁrst step, the independent variable Xi which has the largest LI with the dependent variable Y is selected for the target subset S of independent variables. In the next step, the selection criterion is to determine how much linear-information can be added with respect to the already existing S, so we should remove the common linear information between S, Xj and the dependent variable Y as deﬁned in Eq. (29). According to the criterion of Max-Linear-Relevance Eq. (30) and Min-Linear-Redundancy Eq. (29), the goal function can be expressed as

Time series A MAPE

RMSE

MAE

0.1883 0.2103 0.1777 0.1637 0.1610 0.1994 0.1636

550.6328 731.7849 516.6630 479.5840 452.8304 616.3233 444.5839

21.9357 24.5535 20.9053 19.2882 18.5946 23.0724 18.9466

3.2. Framework of the proposed combination selection algorithm In this study, the forward selection is employed, and the framework of this proposed combination selection algorithm, named as RRLS, can be summarized as follows: Step 1: Set the initial selected sub-models M ¼ fg, the initial selected data set S ¼ fg, the training data T ¼ ½F 1 ; F 2 ; …; F M , which is an N rows and M columns matrix, where F j ¼ ðf j1 ; f j2 ; …; f jN Þ is the outputs (forecasting values) of the individual model f j ð1 r j r MÞ on the training data set T, N is the size of the training data T, M is the total number of the individual models. Step 2: Select the initial sub-model according to the selection criteria based on Max-Linear-Relevance and Max-Linear-

Table 3 The monthly electricity production forecasting results using three combination forecasting models CM-A, CM-C-MI and CM-RRLS for different linear combination methods. Individual model ID

1: ARIMA; 2:ES; 3:SVR-1 (input dimension ¼12); 4:SVR-2 (input dimension ¼15); 5:SVR-3 (input dimension ¼ 18); 6:SVR-4 (input dimension ¼ 21); 7:SVR-5 (input dimension ¼ 24)

Time series A Combination selection method CM-A Optimal sub-models [1, 2, 3, 4, 5, 6, 7]

VACO DMSFE(β ¼ 0.95) DMSFE(β ¼ 0.9) DMSFE(β ¼ 0.85) DMSFE(β ¼ 0.8)

CM-C-MI [1, 3, 4, 5, 6, 7]

CM-RRLS [6, 5, 3, 7, 4]

MAPE

RMSE

MAE

MAPE

RMSE

MAE

MAPE

RMSE

MAE

0.1568 0.1571 0.1567 0.1559 0.1553

398.1608 400.0124 399.7325 398.6981 397.9672

18.2944 18.3372 18.2809 18.1797 18.1003

0.1524 0.1525 0.1521 0.1514 0.1509

382.9312 383.8237 384.0551 384.1910 384.5509

17.7799 17.7944 17.7378 17.6491 17.5830

0.1501 0.1501 0.1498 0.1495 0.1492

375.8260 376.3259 376.4697 376.9052 377.7943

17.5250 17.5256 17.4851 17.4389 17.3927

J. Che / Neurocomputing 151 (2015) 364–375

Complementarity: f 0 ðF j Þ ¼

"

IðF j ; YÞ 1 ∑ r F ;F r F j ;Y 1 γ ∣S∣ F i A T fF j g j i IðY; YÞ

# ð35Þ

where γ is a Max-Linear-Complementarity adjustment coefﬁcient, and is set to be 0.1 in this study. Then the initial sub-model is selected according to the maximum function value among all Fj: F ð1Þ ¼ arg max ½f 0 ðF j Þ

ð36Þ

Fj A Tk

Step 3: Update the data sets: M ¼ M⋃ff ð1Þ g, S ¼ S⋃fF ð1Þ g, T ¼ T fF ð1Þ g. Step 4: Calculate the goal function of RRLS between S⋃F j and the actual outputs Y for each Fj ðF j A TÞ: n o IðF j ; YÞ IðF j ; YÞ ð37Þ r F j ;Y α r F j ;Y MaxF s A S r F s ;F j f ðF j Þ ¼ IðY; YÞ IðY; YÞ

369

Step 6: Calculate the mean absolute percentage error (MAPE) of the combination forecasting model for the data set S. Step 7: Repeat Steps 4–6 until the training data T becomes an empty set, that is T ¼ ∅. Step 8: Output the optimal subset of individual models from all the available models by using the forecasting error measurement value MAPE, and the corresponding combination forecasting model. The proposed combination model selected by using the above combination selection algorithm is called as CM-RRLS.

4. Experiments To illustrate the proposed algorithm for sub-models selection, and also to demonstrate the performance of the proposed combination forecasting model, consider the time series forecasts from DataMarket [36].

Find the maximum function value among all Fj: F ðkÞ ¼ arg max ½f ðF j Þ

ð38Þ

Fj A Tk

Step 5: Update the data sets: M ¼ M⋃ff ðkÞ g, S ¼ S⋃fF ðkÞ g, T ¼ T fF ðkÞ g. Table 4 The parameters setting used in the SVR models—monthly milk production. Model

SVR-1 SVR-2 SVR-3 SVR-4 SVR-5

Parameter setting

Time series B

Input dimension

C

ε

δ2

12 15 18 21 24

6.0463e þ05 4076.06 1618.34 3390.87 544.43

0.005 0.01198 0.005 0.005 0.005

8.582 13.382 4.159 16.792 9.1195

Table 5 MAPE, RMSE and MASE values from individual sub-models for monthly milk production. Model

ARIMA ES SVR-1 SVR-2 SVR-3 SVR-4 SVR-5

Time series B MAPE

RMSE

MAE

0.2136 0.2191 0.1812 0.1106 0.1314 0.1022 0.1762

44.8508 49.0633 33.1054 12.1566 17.4518 10.8408 29.6738

6.2551 6.4388 5.2438 3.2275 3.8160 2.9957 5.1181

4.1. Data and forecasting models description In this experiment, three time series with different structure are performed to explore the effectiveness and superiority of the proposed combination forecasting model. Time series A (276 observations): This is a monthly electricity production in Australia from January 1972 to August 1995, the unit is million kilowatt hours. This time series is taken from DataMarket. The graph of this time series is given in Fig. 2. The 72 observations between August 1989 and August 1995 are chosen for the test set. Time series B (168 observations): This is a monthly milk production time series from January 1962 to December 1975, the unit is pounds per cow. This time series is taken from DataMarket. The graph of this time series is given in Fig. 2. The 72 observations between December 1969 and December 1975 are chosen for the test set. Time series C (252 observations): This is a monthly economic indicators time series (New Residential Construction) in Midwest from May 1982 to April 2003, the unit is Thousands of Units. This time series is taken from DataMarket. The graph of this time series is given in Fig. 2. The 72 observations between May 1997 and April 2003 are chosen for the test set. There are seven forecasting models in total for this study, that is ﬁve SVRs with 12, 15, 18, 21 and 24 dimensions of inputs, respectively (namely SVR1, SVR2, SVR3, SVR4, SVR5), Exponential Smoothing model (ES) and ARIMA model. For the comparison with combination model, the combination model combined by using all the individual models (CM-A), and the mutual information based combination model proposed by Cang et al. (CM-C-MI) (2013) are given. The seven forecasting models are constructed based on the data from January 1972 to July 1989 (training data for time series A),

Table 6 The monthly milk production forecasting results using three combination forecasting models CM-A, CM-C-MI and CM-RRLS for different linear combination methods. Individual model ID

1: ARIMA; 2:ES; 3:SVR-1 (input dimension ¼12); 4:SVR-2 (input dimension ¼15); 5:SVR-3 (input dimension ¼ 18); 6:SVR-4 (input dimension ¼ 21); 7:SVR-5 (input dimension ¼ 24)

Time series B Combination selection method CM-A Optimal sub-models [1, 2, 3, 4, 5, 6, 7]

VACO DMSFE(β¼ 0.95) DMSFE(β¼ 0.9) DMSFE(β¼ 0.85) DMSFE(β¼ 0.8)

CM-C-MI [2, 3, 6, 7, 1, 4, 5]

CM-RRLS [6, 2, 5 ]

MAPE

RMSE

MAE

MAPE

RMSE

MAE

MAPE

RMSE

MAE

0.1138 0.1126 0.1124 0.1126 0.1127

13.6538 13.3656 13.3135 13.3784 13.4165

3.3279 3.2951 3.2887 3.2944 3.2974

0.1138 0.1126 0.1124 0.1126 0.1127

13.6538 13.3656 13.3135 13.3784 13.4165

3.3279 3.2951 3.2887 3.2944 3.2974

0.1005 0.1005 0.1005 0.1005 0.1005

11.2167 11.2065 11.1989 11.2330 11.2849

2.9546 2.9547 2.9548 2.9548 2.9573

370

J. Che / Neurocomputing 151 (2015) 364–375

January 1962 to November 1969 (training data for time series B) and May 1982 to April 1997 (training data for time series C). For the test set, the ﬁrst 48 forecasting values (training data for combination selection algorithm) from all the individual forecasting models are employed to select the optimal sub-models using the CM-C-MI and CM-RRLS algorithms. Then, the periods from August 1993 to August 1995 (test data for time series A), from December 1973 to December 1975 (test data for time series B) and from April 2001 to April 2003 (test data for time series C) are employed to test the obtained optimal sub-models.

4.2. Empirical results and analysis In this subsection, the optimal subsets of these seven models are selected by applying the RRLS algorithm. The software MATLAB 6.5 is used for all the models [37]. All the cases have been run on a PC with 1 GB of RAM and a 2.01-GHz-based processor. For these two data sets, three performance measures are employed to investigate the forecasting ability of the proposed model. Table 7 The parameters setting used in the SVR models—monthly economic indicators. Model

SVR-1 SVR-2 SVR-3 SVR-4 SVR-5

Parameter setting

Time series C

Input dimension

C

ε

δ2

12 15 18 21 24

475.35 3523 108.9 2056.2 1969.3

0.12256 0.15945 0.088389 0.028498 0.19984

11.2587 10.9558 10.0691 18.5701 24

Table 8 MAPE, RMSE and MASE values from individual sub-models for monthly economic indicators. Model

ARIMA ES SVR-1 SVR-2 SVR-3 SVR-4 SVR-5

Time series C MAPE

RMSE

MAE

0.1378 0.1429 0.2134 0.335 0.2047 0.1426 0.1584

4.9124 5.5461 10.3 23.6489 9.7371 4.9635 6.5739

1.9812 2.0555 3.0637 4.8103 2.9489 2.0444 2.2826

4.2.1. Monthly electricity production The parameters of ES and ARIMA are set by using the encompassing test. As discussed in Section 2.3.1, the widely used Gaussian kernel function is adopted in this paper because it has good performance in most forecasting cases. However, three parameters' setting (C; ε; σ ) of SVR models is a difﬁcult problem. The usually setting method is based on the cross-validation or expert's knowledge, but it is a time-consuming process for large sample data set. To reduce the computational complexity, the method in the paper [26] is employed to select these parameters of the ﬁve SVR models, and the setting values for this monthly electricity production case are shown in Table 1. For the used combination whose mathematical model given in Eq. (5), the best weight values are computed by using Eq. (6). By implementing the MATLAB code of each sub-models, the forecasting results under ﬁve SVRs with 12, 15, 18, 21 and 24 dimensions of inputs, namely SVR1, SVR2, SVR3, SVR4, SVR5, Exponential Smoothing model (ES) and ARIMA model can be obtained. As shown in Table 2, the SVR with 18 dimension of inputs (SVR 3) has the smallest MAPE, RMSE and MAE values, which means that the forecasted time series values are closest to the actual series values. The MAPE, RMSE and MAE values of the optimal subsets for the test data set are presented in Table 3. For the purpose of comparison, the forecast results obtained from the combination model combined by using all the individual sub-models, and the combination model proposed by Cang et al. (2013) are also given in Table 3. In Table 3, the ﬁrst column presents different error-based combination schemes, and the rest of the columns present three testing evaluation methods (including MAPE, RMSE and MAE) of three combination selection methods on the monthly electricity production time series. MAPEmax ¼ 0:1571 shows that all the models chosen for load forecasting are reasonable. It is clearly seen from Table 3 that the proposed RRLS algorithm produces better combined forecasts than those obtained from other methods. When the proposed method is used for the combination, the lowest MAPE, RMSE and MAE values are obtained. All the three combination models have better performance than the best sub-model as shown in Table 3, which indicates that combination method is an effective way to improve the performance of sub-model.

4.2.2. Monthly milk production In this subsection, we examine the monthly milk production time series. As described in Section 4.2.1, the same modeling process is employed. The parameters setting of SVR models is determined by using the method in [26], and the result is shown in Table 4.

Table 9 The monthly economic indicators forecasting results using three combination forecasting models CM-A, CM-C-MI and CM-RRLS for different linear combination methods. Individual model ID

1: ARIMA; 2:ES; 3:SVR-1 (input dimension ¼12); 4:SVR-2 (input dimension ¼15); 5:SVR-3 (input dimension ¼ 18); 6:SVR-4 (input dimension ¼ 21); 7:SVR-5 (input dimension ¼ 24)

Time series C Combination selection method CM-A Optimal sub-models [1, 2, 3, 4, 5, 6, 7]

VACO DMSFE(β ¼ 0.95) DMSFE(β ¼ 0.9) DMSFE(β ¼ 0.85) DMSFE(β ¼ 0.8)

CM-C-MI [6, 5]

CM-RRLS [6, 7 ]

MAPE

RMSE

MAE

MAPE

RMSE

MAE

MAPE

RMSE

MAE

0.133 0.1321 0.1326 0.1336 0.1345

5.074 5.0234 5.0589 5.1182 5.1675

1.922 1.9103 1.9176 1.931 1.9441

0.1314 0.1318 0.1315 0.131 0.1312

4.7804 4.7514 4.7769 4.8244 4.8695

1.8959 1.9008 1.8965 1.8907 1.8932

0.13 0.1288 0.1286 0.1292 0.1302

4.5676 4.6119 4.682 4.7572 4.8186

1.8716 1.8559 1.8542 1.8635 1.8785

Table 10 Best forecasting error measurement values for all possible combinations of k individual models on different linear combination methods. Time series A k ¼1 (7) [5]

k ¼2 (21) [4, 7]

k¼3 (35) [4, 5, 7]

Combined model

MAPE

RMSE

MAE

MAPE

RMSE

MAE

MAPE

RMSE

MAE

VACO DMSFE(β ¼ 0.95) DMSFE(β ¼ 0.9) DMSFE(β ¼ 0.85) DMSFE(β ¼ 0.8)

0.1610

452.8304

18.5946

0.1488 0.1489 0.1489 0.1487 0.1487

393.4518 394.2066 394.1377 393.3787 392.4755

17.3606 17.3859 17.3838 17.3579 17.3366

0.1498 0.1496 0.1496 0.1499 0.1494

396.7430 396.6794 396.6730 396.7670 372.1263

17.4241 17.4037 17.4041 17.4291 [3, 5, 7] 17.4076

k models Total number of all possible combinations The best subset

k¼4 (35) [3, 5, 6, 7]

k¼5 (21) [3, 4, 5, 6, 7]

k¼6 (7) [1, 3, 4, 5, 6, 7]

Combined model

MAPE

RMSE

MAE

MAPE

RMSE

MAE

MAPE

RMSE

MAE

VACO DMSFE(β ¼ 0.95) DMSFE(β ¼ 0.9) DMSFE(β ¼ 0.85) DMSFE(β ¼ 0.8)

0.1504 0.1504 0.1501 0.1502 0.1492

372.4764 372.5244 374.1907 377.7813 376.4011

17.5210 17.5216 17.4785 17.4698 17.4245

0.1501 0.1501 0.1498 0.1495 0.1492

375.8260 376.3259 376.4697 376.9052 377.7943

17.5250 17.5256 17.4851 17.4389 17.3927

0.1524 0.1525 0.1521 0.1514 0.1509

382.9312 383.8237 384.0551 384.1910 384.5509

17.7799 17.7944 17.7378 17.6491 17.5830

J. Che / Neurocomputing 151 (2015) 364–375

k models Total number of all possible combinations The best subset

371

372

Table 11 Best forecasting error measurement values for all possible combinations of k individual models on different linear combination methods. Time series B k ¼1 (7) [6]

k¼2 (21) [4, 6]

k¼3 (35) [2, 5, 6]

Combined model

MAPE

RMSE

MAE

MAPE

RMSE

MAE

MAPE

RMSE

MAE

VACO DMSFE(β ¼ 0.95) DMSFE(β ¼ 0.9) DMSFE(β ¼ 0.85) DMSFE(β ¼ 0.8)

0.1022

10.8408

2.9957

0.1005 0.1005 0.1008 0.1008 0.1009

11.2167 11.2065 10.728 10.7295 10.7406

2.9546 2.9547 2.9507 2.9512 2.9545

0.1005 0.1005 0.1005 0.1005 0.1005

11.2167 11.2065 11.1989 11.2330 11.2849

2.9546 2.9547 2.9548 2.9548 2.9573

k models Total number of all possible combinations The best subset

k¼4 (35) [2, 4, 5, 6]

Combined model

MAPE

RMSE

MAE

MAPE

RMSE

MAE

MAPE

RMSE

MAE

VACO DMSFE(β ¼ 0.95) DMSFE(β ¼ 0.9) DMSFE(β ¼ 0.85) DMSFE(β ¼ 0.8)

0.1031 0.103 0.103 0.103 0.103

11.1275 11.1226 11.1164 11.1226 11.1352

3.0184 3.0179 3.0168 3.017 3.0179

0.1041 0.1041 0.104 0.104 0.1042

11.4431 11.4253 11.4162 11.4452 11.4877

3.0534 3.0513 3.05 3.0516 3.0559

0.1090 0.1083 0.1083 0.1085 0.1085

12.562 12.4115 12.4054 12.3833 12.3802

3.1916 3.1722 3.1705 3.1783 3.1782

k ¼5 (21) [1, 2, 4, 5, 6]

k¼6 (7) [1, 2, 4, 5, 6, 7]

J. Che / Neurocomputing 151 (2015) 364–375

k models Total number of all possible combinations The best subset

Table 12 Best forecasting error measurement values for all possible combinations of k individual models on different linear combination methods. Time series C k¼1 (7) [1]

k ¼3 (35) [1, 6, 7]

Combine method

MAPE

RMSE

MAE

MAPE

RMSE

MAE

MAPE

RMSE

MAE

VACO DMSFE(β ¼ 0.95) DMSFE(β ¼ 0.9) DMSFE(β ¼ 0.85) DMSFE(β ¼ 0.8) k models Total number of all possible combinations The best subset

0.1378

4.9124

1.9812

0.13 0.1288 0.1286 0.1292 0.1302 k ¼5 (21) [1, 2, 4, 6, 7]

4.5676 4.6119 4.682 4.7572 4.8186

1.8716 1.8559 1.8542 1.8635 1.8785

0.1295 0.1281 0.1267 0.1259 0.1254 k ¼6 (7) [1, 2, 4, 5, 6, 7]

4.3899 4.4064 4.4376 4.4745 4.5058

1.8654 1.8465 1.8266 1.8158 1.8099

Combined method

MAPE

RMSE

MAE

MAPE

RMSE

MAE

MAPE

RMSE

MAE

VACO DMSFE(β ¼ 0.95) DMSFE(β ¼ 0.9) DMSFE(β ¼ 0.85) DMSFE(β ¼ 0.8)

0.1277 0.1271 0.1266 0.1265 0.1268

4.5 4.5104 4.5507 4.5983 4.5673

1.843 1.8333 1.8265 1.8267 1.8301

0.1288 0.1281 0.1276 0.1273 0.1271

4.5956 4.5916 4.6084 4.6327 4.6538

1.8577 1.8487 1.842 1.8378 1.8349

0.1295 0.1289 0.1288 0.1289 0.1294

4.7493 4.7266 4.754 4.7953 4.8305

1.8709 1.862 1.861 1.8626 1.8697

k¼ 2 (21) [6,7]

k ¼4 (35) [1, 4, 6, 7]

J. Che / Neurocomputing 151 (2015) 364–375

k models Total number of all possible combinations The best subset

373

374

J. Che / Neurocomputing 151 (2015) 364–375

To vividly compare the forecasting ability, the seven submodels for this monthly milk production are implemented by running the according matlab procedure. Table 5 provides the forecasting results of these seven sub-models. To demonstrate the effectiveness of the proposed model, the performance of the proposed CM-RRLS model is compared to the best sub-model, the CM-A model and the CM-C-MI model. The MAPE, RMSE and MAE values of the above models are shown in Table 6. It can be observed from Table 6 that the proposed CM-RRLS model can get lower forecasting errors, and outperforms the comparison models in this monthly milk production case. 4.2.3. Monthly economic indicators To explore the effectiveness and superiority of the proposed combination forecasting model, we deliberately chose the monthly economic indicators with a change trend and seasonality in this subsection. As described in Section 4.2.1, the same modeling process is employed. The parameters setting of SVR models are determined by using the method in [26], and the result is shown in Table 7. To vividly compare the forecasting ability, the seven sub-models for this monthly economic indicators are implemented by running the according matlab procedure. Table 8 provides the forecasting results of these seven sub-models. To demonstrate the effectiveness of the proposed model, the performance of the proposed CM-RRLS model is compared to the best sub-model, the CM-A model and the CM-C-MI model. The MAPE, RMSE and MAE values of the above models are shown in Table 9. It can be observed from Table 9 that the proposed CM-RRLS model can get lower forecasting errors, and outperforms the comparison models in this monthly economic indicators case. 4.2.4. Comparison of the three combination models with all the possible combination models In order to validate the above optimal sub-models selection approaches, the best subset selection results (MAPE, RMSE and MAE values) for all possible combinations of k individual sub-models (1r k r 6) are compared. The MAPE, RMSE and MAE values of all possible combinations with the best MAPE of k individual submodels are shown in Tables 10, 11 and 12, respectively. Although some best MAPEs of k individual sub-models are smaller than the MAPE of CM-RRLS, these best subsets are always inconsistent across the two parts of test set: in Table 12, the best MAPE of 3 or (4) individual sub-models is [2, 6, 7] or [1, 2, 6, 7] for the ﬁrst part of test set, while the MAPE of combination forecasting model [2, 6, 7] or [1, 2, 6, 7] is 0.1304 or (0.1301) for the second part of test set. Similar phenomenon can be checked in many best subsets, which indicates that the best subset has the characteristic of randomness. From the above observation, we can conclude that the inconsistent best subset is unstable and maybe generated by random factors. As shown in Tables 10–12, the ﬁve linear combination methods have similar performance for this study, so we use the mean of MAPE values for all ﬁve linear combination methods as the criteria in the case study. For the time series A and C, the three combination models (CM-A, CM-C-MI and CM-RRLS) and the best subset selection results (MAPE, RMSE and MAE values) for all possible combinations of k individual models (2 r k r 6) outperform the best individual models. For the time series B, only the combination model of CM-RRLS and the best subset selection results for all possible combinations of k individual models (k ¼ 2; 3) have better performance than the best individual models. Therefore, the percentage of optimal subsets outperforming the best individual models is 79.2% (19 out of 24 cases) for MAPE, RMSE and MAE error measurements.

The proposed method is applied to DataMarket times series and the results are compared to other forecast combination methods available in the literature. As a result of the implementation, it is seen that the proposed forecast combination approach produces better forecasts than those produced by other methods. Specially, the RRLS and Cang's mutual information based selection methods have provided theoretical approaches without iteratively running all possible combinations of individual models. By the above empirical results, the proposed RRLS algorithm outperformed Cang's mutual information based selection algorithm. The possible reason may be that Cang's selection criterion, the accumulating MI value for the combined model, contains nonlinear-relationship or redundancy information among the selected optimal subset, this will reduce the performance of the combination forecasting model. However, the RRLS algorithm enhanced Cang's mutual information based selection algorithm by using a novel Max-Linear-Relevance and Min-Linear-Redundancy based selection criterion in this paper. Several observations can be made from the above results. Firstly, a individual forecasting model is not always best in all cases: the SVR models have obtained the best performance for both the time series A and B, while the ARIMA model has obtained the best performance for the time series C. Secondly, we can clearly observe that the MAPE values of time series A presented in Table 3, time series B presented in Table 6 and time series C presented in Table 9 for the proposed combination model CM-RRLS are evidently smaller than the accordingly best sub-model, which indicates that the forecast combination is an effective way to establish better forecast performance in practical forecasting. Thirdly, for the time series A, B and C, the forecasting performance of CM-C-MI and CM-RRLS model is better than that of CM-A model as shown in this experiment. It is because suitable sub-models combination selected by information theory will generate better forecasting results. Furthermore, as shown in Table 3, the average CM-RRLS's MAPE of the time series A is 0.14974, while is 0.15636 and 0.15186 for the CM-A and the CM-C-MI, respectively. Similar results of the time series B and C demonstrate again that proposed CM-RRLS model can identify more combination selection information in combination forecasting.

5. Conclusion In this paper, a novel Max-Linear-Relevance and Min-LinearRedundancy based selection algorithm, named as redundancy removing linear mutual information model selection (RRLS), is proposed for the linear combination selection problem. Consider that Cang's selection criterion, the accumulating MI value for the combined model, may contain nonlinear-relationship or redundancy information among the selected optimal subset, this will reduce the performance of the combination forecasting model. As opposed to Cang's mutual information based selection algorithm, the RRLS algorithm enhances Cang's mutual information based selection algorithm by using a novel Max-Linear-Relevance and Min-Linear-Redundancy based selection criterion in this paper, which is conﬁrmed using DataMarket data set forecasting experiments. In comparison with the best sub-model and the CM-A model, the forecasting accuracy of CM-RRLS is also favorable. However, it is fair to say that much remains to be done in the way of nonlinear combination model construction and nonlinear combination selection problem.

Acknowledgments The author thank the editors and the anonymous reviewers for helpful comments and suggestions. The research was supported

J. Che / Neurocomputing 151 (2015) 364–375

by the National Natural Science Foundation of China (Grant no. 71301067), and the Natural Science Foundation of JiangXi Province (Grant no. 20142BAB217015). References [1] J.P. Donate, P. Cortez, G.G. Sanchez, A.S. Miguel, Time series forecasting using a weighted cross-validation evolutionary artiﬁcial neural network ensemble, Neurocomputing 109 (3) (2013) 27–32. [2] C. Ren, N. An, J. Wang, L. Li, B. Hu, D. Shang, Optimal parameters selection for BP neural network based on particle swarm optimization: a case study of wind speed forecasting, Knowl. Based Syst. 56 (2014) 226–239. [3] R. Palivonaite, M. Ragulskis, Short-term time series algebraic forecasting with internal smoothing, Neurocomputing 127 (15) (2014) 161–171. [4] J. Shao, Application of an artiﬁcial neural network to improve short-term road ice forecasts, Expert Syst. Appl. 14 (4) (1998) 471–482. [5] G.P. Zhang, Time series forecasting using a hybrid ARIMA and neural network model, Neurocomputing 50 (2003) 159–175. [6] Y.S. Lee, L.I. Tong, Forecasting time series using a methodology based on autoregressive integrated moving average and genetic programming, Knowl. Based Syst. 24 (1) (2011) 66–72. [7] R.C. Tsaur, A new piecewise fuzzy exponential smoothing model based on some change-points, Expert Syst. Appl. 38 (6) (2011) 7616–7621. [8] J.W. Taylor, R.D. Snyder, Forecasting intraday time series with multiple seasonal cycles using parsimonious seasonal exponential smoothing, Omega, Int. J. Manag. Sci. 40 (6) (2012) 748–757. [9] D. Srinivasan, Evolving artiﬁcial neural networks for short term load forecasting, Neurocomputing 23 (1–3) (1998) 265–276. [10] W. Shen, X. Guo, C. Wu, D. Wu, Forecasting stock indices using radial basis function neural networks optimized by artiﬁcial ﬁsh swarm algorithm, Knowl. Based Syst. 24 (3) (2011) 378–385. [11] W.C. Hong, Trafﬁc ﬂow forecasting by seasonal SVR with chaotic simulated annealing algorithm, Neurocomputing 74 (12–13) (2011) 2096–2107. [12] C.J. Lu, T.S. Lee, C.C. Chiu, Financial time series forecasting using independent component analysis and support vector regression, Dec. Support Syst. 47 (2) (2009) 115–125. [13] Ling-Jing Kao, Chih-Chou Chiu, Chi-Jie Lu, Jung-Li Yang, Integration of nonlinear independent component analysis and support vector regression for stock price forecasting, Neurocomputing 99 (2013) 534–542. [14] J.M. Bates, C.W.J. Granger, The combination of forecasts, Oper. Res. Q. 20 (1969) 451–468. [15] Z. Guo, J. Wu, H. Lu, J. Wang, A case study on a hybrid wind speed forecasting method using BP neural network, Knowl. Based Syst. 24 (7) (2011) 1048–1056. [16] M. Theodosiou, Disaggregation and aggregation of time series components a hybrid forecasting approach using generalized regression neural networks and the theta method, Neurocomputing 74 (6) (2011) 896–905. [17] J. Wang, S. Zhu, W. Zhang, H. Lu, Combined modeling for electric load forecasting with adaptive particle swarm optimization, Energy 35 (2010) 1671–1678. [18] M. Costantini, C. Pappalardo, Hierarchical procedure for the combination of forecasts, Int. J. Forecast. 26 (2010) 725–743. [19] T. Kisinbay, The use of encompassing tests for forecast combinations, J. Forecast. 29 (2010) 715–727.

375

[20] P.H. Franses, Model selection for forecast combination, Appl. Econ. 43 (2011) 1721–1727. [21] S. Cang, H. Yu, A combination selection algorithm on forecasting, Eur. J. Oper. Res. 234 (1) (2014) 127–139. [22] F. Diebold, J. Lopez, Forecast evaluation and combination, in: Rao Maddala (Ed.), Handbook of Statistics, Elsevier, Amsterdam, 1996. [23] C.H. Aladag, E. Egrioglu, U. Yolcu, Forecast combination by using artiﬁcial neural networks, Neural Process. Lett. 32 (3) (2010) 269–276. [24] R. Adhikari, R.K. Agrawal, Performance evaluation of weights selection schemes for linear combination of multiple forecasts, Artif. Intell. Rev. this issue, (2012) http://dx.doi.org/10.1007/s10462-012-9361-z. [25] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, 1995. [26] J.X. Che, Support vector regression based on optimal training subset and adaptive particle swarm optimization algorithm, Appl. Soft Comput. 13 (8) (2013) 3473–3481. [27] C.C. Holt, Forecasting trends and seasonal by exponentially weighted averages, Int. J. Forecast. 20 (1) (1957) 5–10. [28] G.E.P. Box, M. Jenkins, Time Series Analysis Forecasting and Control, John Wiley and Sons, Inc., New York, NY, USA, 1976. [29] C. Shannon, W. Weaver, The Mathematical Theory of Communication, University of Illinois Press, Urbana, IL, USA, 1963. [30] Y. Zheng, C.K. Kwoh, A feature subset selection method based on highdimensional mutual information, Entropy 13 (4) (2011) 860–901. [31] N. Kwak, C.H. Choi, Input feature selection for classiﬁcation problems, IEEE Trans. Neural Netw. 13 (1) (2002) 143–159. [32] H. Peng, F. Long, C. Ding, Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy, IEEE Trans. Pattern Anal. Mach. Intell. 27 (8) (2005) 1226–1238. [33] P.A. Estevez, M. Tesmer, C.A. Perez, J.M. Zurada, Normalized mutual information feature selection, IEEE Trans. Neural Netw. 20 (2009) 189–201. [34] L.T. Vinh, N.D. Thang, Y.K. Lee, An improved maximum relevance and minimum redundancy feature selection algorithm based on normalized mutual information, in: 2010 10th IEEE/IPSJ International Symposium on Applications and the Internet (SAINT), 2010, pp. 395–398. [35] T.M. Cover, J.A. Thomas, Elements of Information Theory, John Wiley and Sons, Inc., New York, NY, USA, 1991. [36] DataMarket. 〈http://datamarket.com/〉. [37] About MATLAB [Online]. Available: 〈http://www.mathworks.com/〉.

Jinxing Che received the bachelor's degree in mathematics and applied mathematics from the Jiujiang University, China, in 2007, and the master's degree in applied mathematics from the Lanzhou University, China, in 2010. Currently, he is a lecturer at the College of Science, NanChang Institute of Technology. He is the corresponding author of 6 scientiﬁc papers in international journals. His current research interests are in forecasting theory and its application, mathematics statistics, data mining and support vector machines.

Optimal sub-models selection algorithm for combination forecasting model

Optimal sub-models selection algorithm for combination forecasting model

Recommend Documents