Available online at www.sciencedirect.com
International Journal of Forecasting 24 (2008) 679 – 693 www.elsevier.com/locate/ijforecast
Adaptive combination of forecasts with application to wind energy Ismael Sánchez Universidad Carlos III de Madrid, Avd. de la Universidad 30, 28911, Leganés, Madrid Spain
Abstract This article proposes an adaptive forecast combination procedure, denoted as AEC, that tends to be similar to the use of the best available predictor in a time varying environment. In addition, a two-step procedure is proposed to allow the use of alternative combination procedures. In the first step, different combination procedures are used, the AEC among others. In the second step, the AEC is used to combine the combinations from the first step. The proposed procedures are applied to two wind farms where alternative forecasts were available, showing the advantage of the proposed method. © 2008 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved. Keywords: Adaptive forgetting factors; Dynamic models; Combination of forecasts; Recursive least squares; Wind energy forecasting
1. Introduction 1.1. General considerations Many practitioners need efficient predictions for the appropriate management of their resources. The wind energy industry is an example of an industrial sector where accurate predictions are a necessity. As opposed to traditional sources of energy, wind energy can not be dispatched, and hence its management has to rely on predictions. For utilities, the main use of wind energy predictions is for market participation, both in the dayahead electricity market and in the available intraday markets. Also, the System Operator (SO) of the electrical network needs accurate predictions to optimize the network, especially in those countries with high
E-mail address:
[email protected].
wind power penetration, such as Denmark, Germany, and Spain. In many practical situations, a set of alternative predictions are available, and a final single forecast that takes advantage of the available predictions has to be produced. Ideally, this final single forecast should be better than the individual ones, or at least similar to always using the best available prediction. In the wind energy industry, the availability of alternative predictions is rather common. Some utilities use predictions from alternative prediction agencies when they feel that there is not a superior agency. Also, the prediction agencies themselves can use alternative procedures to obtain predictions for their clients. For instance, they can have alternative statistical procedures or use alternative meteorological information. Then, in order to supply a single forecast to their clients, they have to combine all that information. Similarly, the SO can have access to different forecasts for a given wind farm
0169-2070/$ - see front matter © 2008 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved. doi:10.1016/j.ijforecast.2008.08.008
680
I. Sánchez / International Journal of Forecasting 24 (2008) 679–693
or set of wind farms. For instance, the SO could have its own predictions as well as the predictions made by the utilities, reported to the SO for market participation. The necessity for an efficient forecast combination procedure for wind power prediction is thus apparent. In order to be useful for wind power forecasting, a forecast combination procedure has to be very flexible. This flexibility is necessary for two main reasons. First, the combination procedure will be used without user intervention. Therefore, learning and adaptation should happen on-line without explicit training. It should be noted that the most common application of wind power forecasting is in the form of prediction tools that operate on-line using some kind of automatic modelling (Sánchez, 2006a,b). The motivation for such automatic modelling is that the number of wind farms that a forecaster has to manage can be very high, and each one of very small size. For instance, there are hundreds of wind farms in Spain, with an average size of only 40 MW. Also, the frequency of operation is very high. Typical operation is at an hourly, or even 15minutes basis. Hence, it is not practical to build a prediction procedure, or a forecast combination procedure, especially designed for a specific farm and period. A second reason for the required flexibility of a forecast combination procedure is that the relative accuracy of alternative wind power predictions may vary with time and also with the prediction horizon. The variation of the relative accuracy depends on the information set used by the alternative predictors, as well as the methodology used in their models. As is well known, the relationship between meteorological predictions (wind speed and direction) and power production is highly nonlinear (Landberg, Giebel, Nielsen, Nielsen, & Madsen, 2003; Sánchez, 2006b; Jursa & Rohrig, 2008-this issue). Hence, alternative predictors would use alternative non- linear approaches like neural networks, nonparametric time series, etc. These alternative approaches can have different performances depending on their capacity to capture the nonlinearity. The article is organized as follows. The rest of this section describes the two data sets used in the article and introduces the statistical problem of forecast combination. A two-step procedure for forecast combination is also introduced. Section 2 summarizes the main regression-based combination procedures using timevarying coefficients. Section 3 introduces a new
proposal to combine forecasts, with one of the forecasts receiving almost all the weight. This proposal is based on a modification of the AFTER method of Yang (2004). Section 4 shows the results of applying the twostep combination procedure to the two data sets. Section 5 concludes and some technical details are provided in the Appendices. 1.2. Wind power data This section introduces the data used in the article. The data corresponds to two wind farms, labeled as F1 and F2. The series are the average power generated in each hour. Fig. 1 shows the time series of the hourly average power generation of each wind farm. The data set of F1 is of 2885 hours, and the data set of F2 is of 1332 hours. The production of a wind farm is always bounded between zero and its installed capacity. Consequently, for clarity of exposition, the production has been divided by the installed capacity of each wind farm, and hence the values are in the interval [0,1]. The rescaling of wind power data by dividing by the installed capacity has some advantages. First, it allows the comparison of data between alternative farms. Also, it avoids discontinuities due to changes in the installed capacity. The installed capacity can change many times during the life of a wind farm: the number of wind turbines can change, or some old wind turbines can be replaced by wind turbines of larger capacity. Note that the dynamic properties of the time series are not changed by dividing by the installed capacity. For each wind farm, there is a set of four alternative predictions, labelled as Pi, i = 1,…, 4, for several hours ahead. In each location, each alternative prediction was supplied by an independent professional forecaster. The forecasters are not the same in the two wind farms. The characteristics of the wind farms, as well as the characteristics of the terrain and the weather, are also different in each wind farm. For simplicity, we will restrict the exposition to a maximum horizon of h = 18 hours ahead. The relative behaviour of the forecasters is very different at these horizons, helping to illustrate the proposed combination procedure. Fig. 2 shows the mean squared prediction error (MSPE) of each predictor. It can be seen in this figure that the MSPE is higher in wind farm F1. The reason for such a difference is not the ability of the forecasters, but the different orographic and
I. Sánchez / International Journal of Forecasting 24 (2008) 679–693
681
Fig. 1. Time series of the mean hourly generated power at two wind farms. Data are in % of installed capacity.
meteorological conditions. Wind farm F1 is located in highly complex terrain, where geographical effects make wind speed predictions difficult. Besides this, meteorological conditions in wind farm F1 are also more unstable than in wind farm F2. In F1 there seems to be a competition between P2 and P3 at h b 6 as to which is the best available predictor. For
h N 6, P2 has the best average performance. Hence, a good strategy might be to combine P2 and P3 to predict at h b 6, and to select P2 for longer horizons. However, in F2 it is not clear which is the best available predictor, especially for h b 12. A good strategy would be to combine P2 and P4 for h b 6, and add P3 to the combination for h ≥ 6. Note that this strategy is being
Fig. 2. MSPEs of alternative predictors.
682
I. Sánchez / International Journal of Forecasting 24 (2008) 679–693
Fig. 3. MSPE of each predictor divided by the MSPE of predictor P1. The MSPE is the average of 72 hours.
proposed after we have observed the prediction errors. In a real situation, these figures are not available, and we would like to have a combination procedure that makes the appropriate decisions in real time. Fig. 2 shows average values. However, the relative performance of the predictors may vary over time. Fig. 3 shows the MSPE of each predictor at horizon 12, in groups of 72 consecutive hours (3 days). This figure reveals that there is not a single best predictor along the time. For instance, in Fig. 2, for wind farm F1, the best average predictor at h = 12 is clearly P2. However, in Fig. 3, wind farm F1, the best available predictor is not clear. Fig. 3 suggests that, instead of looking for the best average predictor, it could be more profitable to search for the best available predictor at each time. 1.3. The combination model The combination problem can be described as follows. Let pt be the variable of interest, such as the (1) (2) (K) , ft|t–h ,…, ft|t–h be wind power at time t = 1,2,…. Let ft|t–h a set of K competing lead h predictors of pt made with information available at time t − h, and let ðk Þ
ðk Þ
etjt−h ¼ pt −ftjt−h ; where k = 1,…, K, be the h step-ahead prediction error incurred by the k-th predictor. Our interest is in building a linear combination of the competing forecasts as pˆ Ctjt−h ¼
K X k¼1
ðk Þ
/k;t ftjt−h ;
ð1Þ
where ϕk,t = ϕk,t(h) is the combination coefficient for model k at time t to predict at lead time h. For economy in notation, the dependency of ϕk,t on the horizon is omitted. Let eCtjt−h ¼ pt −pˆ Ctjt−h
ð2Þ
be the prediction error of this combined predictor. The solution of the combination problem is a vector ϕt = (ϕ1,t,…,ϕK,t)′, such that it minimizes some loss C ) in such a way that function L(et|t–h n o ðk Þ L eCtjt−h V min L etjt−h : k
ð3Þ
That is, the goal is for the combination to perform better than, or similarly to the best individual predictors. A prediction system for wind energy forecasting usually operates on-line, in an environment with little or no user intervention, and for long periods of time. Given this context, a realistic proposal should allow some time evolution in the dynamic properties of the predictors, and hence in the combination. Therefore, ϕk,t should be time-varying. The alternative to a combination procedure is the selection of a single predictor. This decision, selecting or combining, is a common dilemma for many practitioners. Although the statistical methods used for selection and combination are different, a prediction selection can be seen as a special case of model (1), where the selected predictor has coefficient ϕk,t = 1 and the remaining coefficients are zero.
I. Sánchez / International Journal of Forecasting 24 (2008) 679–693
1.4. Two different combination approaches in a two-step procedure There are many approaches in the literature to combining forecasts (see, for instance, Timmermann, 2006, for a recent survey on the topic). Following Yang (2004), the combination methods can be classified into two main classes, depending on the goal of the combination. The first class of combination methods will be denoted as combination for improvement. In this class, the target is to find the best (constrained) linear combination of a set of forecasts. Methods to perform this combination for improvement can be based on the regression methodology, aimed at minimizing the residual variance of the linear combination. The original article on forecast combination by Bates and Granger (1969) falls into this class, as has been shown by Granger and Ramanathan (1984). There have been many proposals in the literature to regress the competing forecasts. Some of them are especially designed to produce time-varying coefficients, such as Diebold and Pauly (1987) and Terui and van Dijk (2002). Ideally, the optimal linear combination would outperform the individual forecasts. However, practically, it is not clear how close we can get to this ideal situation. This uncertainty about the final performance of the combination is increased when nonstationary data or small data sets are used. It is then possible that such a combination will be worse than some of the individual forecasts. In such a situation, the selection of a single forecast would be a better option. Unfortunately, in a real time operation, the decision between combining or selecting forecasts must be made before their performances are compared. The second class of combination methods is called combination for adaptation. This class of combinations aims to perform as well as the best individual procedure. This second class can then be interpreted as a dynamic model selection, where the combination tends to put all the weight on the best available predictor, whenever it is clear that one of the predictors is the best one. The problem of selection in a nonstationary environment is that the best available predictor may change over time. The combination for adaptation methods can then be seen as a problem of efficiently tracking the best available predictor. The literature on on-line selection of predictors is usually based on the exponential weighting of prediction errors. This exponential weighting has
683
proven to have good statistical properties in combination for adaptation (Yang, 2004; Sancetta, 2007). The exponential weighting of models can also be found in some other areas like Bayesian Model Averaging (Hoeting, Madigan, Raftery, & Volinsky, 1999) and computational learning theory (Vovk, 1990; Herbster & Warmuth, 1998). The performance of combination for adaptation can fail if there is not a clear winner, and the best individual predictor is constantly changing. As a result, the variability in the ranking of predictors is translated into variability of the combined forecasts. In these situations, the weighted averaging induced by the combination for improvement methods could be the best option. In a practical situation, it is not known in advance whether it would be better to use a combination for improvement method (and if so, which one) or a combination for adaptation method. In order to benefit from both approaches, a two-step procedure will be applied for efficient wind energy forecasting. In the first step, alternative combination procedures are applied. In the second step, a combination for adaptation method is used to combine the forecast combinations from the first step. The idea of the recombination of different combinations of forecasts was originally proposed by Gunter and Aksu (1989). Under some conditions (see Yang, 2004), this combination of forecast combinations has the potential to obtain large gains in accuracy, yet without the risk of incurring large losses. The combination procedures of the first step will be based on the two above-mentioned approaches: one or more methods of adaptive combination for improvement, and also a method of combining for adaptation. In a second step, the alternative combinations of the first step will be treated as a new combination problem, and will be combined to obtain the final combination. In this second step, the aim is not to improve over the single combined predictions of the first step, but only to assure that the final combination is at least as good as the best of the competing combined predictions. It is then a combination for adaptation problem. Fig. 4 summarizes this two-step approach. The advantage of the proposed two-step combination procedure is that you can use alternative combination methods in the first step without worrying much about whether they are unbiased, or whether the combination should be convex. If those aspects are clear enough, the best combination
684
I. Sánchez / International Journal of Forecasting 24 (2008) 679–693
Fig. 4. Two-step combination method.
procedure will also be clear. If they are not clear, it could be better to use (reasonable) alternative combinations that have some sensible properties. If one of the combination procedures is poor, its forecasts will be inefficient. Then the second step of the combination procedure will discard it. The proposed two-step procedure diminishes the risk of choosing an inadequate combination procedure.
2. Combination for improvement: Covariance-based methods 2.1. General description
ð1Þ ðK Þ for t = 1, 2,…, where ϕt ¼ /1;t ; N ; /K;t V; f tjt−h ¼ ftjt−h ; N ; ftjt−h V, ð1Þ ðK Þ V and etjt−h ¼ etjt−h ; N ; etjt−h . It is assumed that the competing forecasts are unbiased; if this is not the case, a constant term could be added into the forecasting vector ft|t−h. In the specific context of combining wind power forecasts, the introduction of a constant term might not be a good idea. Since the evolution of wind power is nonstationary, the inclusion of a constant might bias the combined prediction toward an undefined mean value. For instance, if all the competing forecasts were equal to zero, the constant will mean that the combined forecast will not be zero. Such a result is not realistic. Besides this, wind power predictions are usually supplied by professional forecasters. It is very unlikely that they are always biased. is the It is assumed that the loss function L eCtjt−h
The most popular procedures of combination for improvement are based on a linear regression of the actual values and the alternative forecasts. The differences between the alternative procedures lie in the constraints of the linear relationships and the estimation method. These methods can also be called covariancebased methods. In this article, we focus on adaptive estimation methods, which allow for time varying combination coefficients. The model (1) can be written as
E( pt + h ft|t–h). The solution of the combination problem (4)
pt ¼ pˆ Ctjt−h þ eCtjt−h ¼ /tVf tjt−h þ eCtjt−h ;
where the superscript ‘re’ stands for ‘regression method’. If the competing predictors are unbiased, the
ð4Þ
2 L eCtjt−h ¼ E eCtjt−h . V ¼ E etjt−h etjt−h and γt|t–h =
traditional least squares function Then
P tjt−h
¼ E f tjt−h f tVjt−h , Xtjt−h
has traditionally been called the regression method. This solution will be denoted by ϕtre and is −1 ϕre t ¼ Rtjt−h gtjt−h ;
ð5Þ
I. Sánchez / International Journal of Forecasting 24 (2008) 679–693
PK restriction k¼1 /k;t ¼ 1 can be imposed to gain efficiency. This restriction was used in the classical definition of optimal forecast combination (see, for instance, Newbold & Granger, 1974). The solution of the combination problem (4) imposing this restriction has traditionally been called the variance-covariance method (Reid, 1969; Newbold & Granger, 1974). This solution will be denoted by ϕtvc and is
−1 −1 ϕvc t ¼ Xtþt−h c = c VXtþt−h c ;
ð6Þ
where c is the K × 1 unit vector. The superscript ‘vc’ stands for ‘variance-covariance method’. It has been observed by many authors (Newbold & Granger, 1974; Bunn, 1985) that when applying the variance-covariance method, more efficient combined forecasts can be obtained by treating the forecast errors as if they were independent. This implies that the matrix Ωt|t−h should be specified only in its diagonal terms. Under this independence assumption, the optimal combination coefficients are ϕvc−in t
−1 −1 ¼ Vtþt−h c = cVVtþt−h c ;
ð7Þ
where Vt|t−h = diag (Ωt|t−h) . The superscript ‘in’ stands for ‘independent assumption’. It is easy to see that ϕtvc−in has an intuitive interpretation: the coefficients in ϕtvc−in only depend on the relative precision of the alternative predictors. If we denote the k-th element of the diagonal of Vt|t−h by vk,t|t−h, we get 1=vk;tjt−h /vc−n : k;t ¼ PK i¼1 1=vi;tjt−h
ð8Þ
The parameters ϕtre in (5), ϕtvc in (6) and ϕtvc−in in (7) need to be estimated from the data, and our intention here is to use adaptive estimation. The next subsection discusses some existing proposals for adaptive estimation. 2.2. Adaptive estimation 2.2.1. Estimation using ewma For on-line operation with the availability of large data sets, estimation based on exponential weighting of old data is especially appealing. At time t, the available prediction errors for each alternative fore-
ðk Þ
685
ðk Þ
caster are etjt−h ¼ pt −ftjt−h ; k ¼ 1; N ; K. The estimators will then have the form −1 re ϕˆ t ¼Rˆ tjt−hgˆ tjt−h ;
ð9Þ
−1 −1 vc ϕˆ t ¼ Xˆ tjt−h c c VXˆ tjt−h c ;
ð10Þ
−1 −1 ¼ Vˆ tjt−h c cVVˆ tjt−h c : ϕvc−in t
ð11Þ
=
=
ˆ ˆ tjt−h ; Xˆ tjt−h and Vˆ tjt−h can be The estimators ∑ tjt−h ; g obtained using an exponentially weighted moving average (ewma) combination of past information. For instance, the recursive estimation of Σt|t−h is Rˆ tjt−h ¼ Nt−1 Ftjt−h ;
ð12Þ
V þ kt Ft−1jt−h−1 ; Ftjt−h ¼ f tjt−h f tjt−h
ð13Þ
Nt ¼
t X s¼hþ1
s
j ki
i¼1
¼ 1 þ kt Nt−1 ; N0 ¼ 0;
ð14Þ
where 0 b kt ≤ 1, t N 0, is usually called the forgetting factor and Nt is the equivalent sample size (Ljung & Söderstrom, 1983). The starting value in Eq. (13) is a matrix of zeros of appropriate dimension. The forgetting factor λt can be either fixed (λi = λ, i = 2,…,t) or variable (see, i.e., Sánchez, 2006a,b). It is important to note that the sequence of forgetting factors λt is the key feature of the adaptation. From Eq. (12) we can see that the smaller the value of λt is, the lower the influence of past data on the estimation. Typically, the choice of the forgetting factor is a compromise between the ability to track changes in the parameters and the need to reduce the variance of the prediction error. The choice of the forgetting factor is very important, since it has a substantial effect on the efficiency of the predictions. Most applications use a constant forgetting factor, typically in the range 0.950 ≤λ ≤0.999. In a similar fashion, the following adaptive estimators of Ωt|t−h, Vt|t−h, and γt|t−h can be obtained as Xˆ tjt−h ¼ Nt−1 Wtjt−h ;
ð15Þ
V þ kt Wt−1jt−h−1 ; Wtjt−h ¼ etjt−h etjt−h Vˆ tjt−h ¼ diag Xˆ tjt−h ;
ð16Þ ð17Þ
686
I. Sánchez / International Journal of Forecasting 24 (2008) 679–693
gˆ tjt−h ¼ Nt−1 gtjt−h ;
ð18Þ
gtjt−h ¼ pt f tjt−h þ kt gt−1jt−h−1 ;
ð19Þ
where the starting values in Eqs. (16) and (19) are matrices of zeros of appropriate dimensions. 2.2.2. Estimation using recursive least squares The regression model (4) can also be estimated using recursive least squares (RLS). The RLS estimation of (4) minimizes the weighted least squares criterion ϕt̂ = arg min St2(ϕ), where the cost function is St2 ðϕÞ
¼
t X
2 jðt; jÞ pjþh −ϕ Vf jþhjj ;
ð20Þ
j¼1
where κ(t, t)=1≥κ(t, t − 1)≥ ∙∙∙≥κ(t, 1)≥0 is a sequence of weighting constants, usually called the forgetting profile. The quantity κ(t, j) is the weight given to the j-th residual in the cost function at time t. This forgetting mechanism helps to progressively reduce the importance of old data. As a result, the estimation is adaptive. There are several different approaches to building forgetting profiles. For an on-line implementation, the most common forgetting profile is jðt; jÞ ¼ jti¼jþ1 ki ; jbt where λi is the above mentioned forgetting factor, and holds for 0 b ki ≤1. The smaller the value of λi is, the lower the influence of past data. When ki =1 for all values of i=2, 3,…, t, the cost function (20) leads to ordinary least squares (OLS) estimation. The RLS estimator for the parameter vector ϕt is (Ljung & Söderstrom, 1983) ϕˆ t ¼ ϕˆ t−1 þ Gˆ tV f tVjt−h eˆ Ctjt−h ;
ð21Þ
ˆ t−1 ftjt−h being the one-step ahead with eˆ Ctjt−h ¼ pt −ϕV prediction error. The gain matrix Γt̂ is a measure of the dispersion of the estimate ϕt̂ that holds −1 −1 Gˆ t ¼ ktGˆ t−1 þ f tjt−h f tjt−h V ;
ð22Þ
which leads to
! V Gˆ t−1 Gˆ t−1 f tjt−h f tjt−h 1 ˆ ˆ : Gt ¼ G t−1 − kt kt þ f tVjt−h Gˆ t−1 f tjt−h
ð23Þ
It is important to see that in order to obtain the recursive estimation ϕt̂ , the analyst has to supply the value of λt before any computations can be performed.
Hence, as is also mentioned above, the selection of an appropriate sequence of forgetting factors is crucial to the performance of this adaptive procedure. Since RLS estimation does not assume any structure for the parameter variation, it can be interpreted, unlike Kalman filter methods, as a nonparametric regression technique. Lindoff (1997) showed the relationship between RLS and local regressions fitting with an exponential decaying kernel. 2.3. Selected covariance-based procedures The idea in the proposed two-step combination procedure is that, in the first step, we use alternative combination procedures. These alternative combination procedures should be based on the analyst's judgement. However, since the second step is going to evaluate them, this initial selection does not need to be too restrictive. Two alternative covariance-based methods have been selected to participate in the first step of the combination procedure for wind power forecasting. These two selected methods can be considered the most popular ones. They will be implemented using time varying coefficients. The first one is a regression with no constant term and coefficients adding to one; that is, pt ¼
K X
ðk Þ
/k;t ftjt−h þ eCtjt−h ;
k¼1
K X
/k;t ¼ 1:
ð24Þ
k¼1
The estimation of this model, which will be denoted by CRLS, will be based on the RLS estimation of K X ð1Þ ðk Þ ð1Þ pt −ftjt−h ¼ /k;t ftjt−h −ftjt−h þ eCtjt−h ; ð25Þ k¼2
P where /1;t ¼ 1− Kk¼2 /k;t . In the initialPstages of this research, model (24) without the restriction Kk¼1 /k;t ¼ 1 was also considered. However, its performance with wind energy data was worse than that of model (24). It should be mentioned, from our experience with wind power data, that the case ϕ ̂k,t b 0 is rather unusual. Model (25) is similar to the ewma model (10). However, the estimation of model (25) is more robust. The reason for this is that the ewma estimation of the covariance Ω̂ t|t−h in Eq. (15) has to be performed with care, since it could be noninvertible in some periods. The RLS estimation will be made with an adaptive forgetting factor, denoted as λtCook, based on the
I. Sánchez / International Journal of Forecasting 24 (2008) 679–693
Cook's distance of the new observation, as developed by Sánchez (2006a). For the sake of conciseness, the implementation of this forgetting factor in the RLS estimation of model (25) is described in Appendix A. The advantage of the proposed λtCook is that its value is proportional to the information content of the incoming data. This property is obtained by linking its value to the Cook's distance of the new observation. If the Cook's distance of the incoming observation is large, it is a signal that the parameters could be changing, and as a consequence, the value of λtCook will decrease to allow for a faster adaptation. A minimum value λmin such that λtCook N λmin is used to avoid the effect of outliers (see Appendix A for details). Conversely, if the Cook's distance of the new observations is small, it is a signal that the estimated model is appropriate. Consequently, the value of λtCook will be maintained close to unity to increase efficiency. It was shown by Sánchez (2006a) that λtCook encompasses some adaptive forgetting factors that are available in the literature. The second covariance-based method is the ewma combination (11), where only the diagonal elements of Ω̂ t|t−h are used. Hence, only the variances of the prediction errors are used, and therefore it is always invertible. This model will be denoted as CV. In this case, the vc−in ≤ 1. It is then very estimated coefficients hold 0 ≤ ϕ̂k,t vc−in , so that easy to impose a threshold value ϕmin on ϕ̂k,t vc −in ̂ b ϕmin. The use of a the final coefficient is 0 if ϕ k,t threshold is customary in the practice of forecast combination, especially if a large number of competing predictors are involved. In this application we have used ϕmin = 0.10. This threshold ϕmin could also be optimized with some off-line experiments. The forgetting factor used in Eq. (16) is also an adaptive forgetting factor. In this case, and as opposed to the forgetting factor used in CRLS, different adaptive forgetting factors will be used for each predictor. The computation of these adaptive forgetting factors is also based on the Cook's distance of the new observation, as proposed by Sánchez (2006a). Following expression (8), the estimation of the variance ̂vc–in is made using an ewma as follows vk,t|t−h used in ϕk,t ðk Þ vˆ k;tjt−h ¼ Nkt−1 Stjt−h ;
ð26Þ
where ðk Þ
ðk Þ
ðk Þ V
ðk Þ
Stjt−h ¼ etjt−h etjt−h þ kCook k;t St−1jt−h−1 ; Nkt ¼ 1 þ kCook k;t Nkt−1 ;
ð27Þ
687
(k) = 0. The computation of λkCook ,t is with Nk 0 = 0 and S0|–h described in Appendix B.
3. Combination for adaptation 3.1. General description This section describes combination procedures designed to have a performance similar to the best available predictor at each time. This approach is different to the one discussed earlier, where the intention was to improve on the best available predictor. Some researchers have claimed (see, e.g., Clements & Hendry, 1998) that when the alternative predictors are based on the same information set, it is difficult for a combined forecast to outperform each of the individual predictors. This could be the case for wind energy, since most models use the same meteorological information, such as wind speed predictions. In these cases, a better strategy would be to select the best available predictor. Performing an efficient selection in our context is not easy, for two main reasons. First, the best predictor can change with time, and hence some sequential selection could be necessary. Second, a raw sequential selection might not be efficient, since a small change in the data could produce a change in the selected predictor. As a consequence, the forecasts based on sequential selection will have an added variability that could surpass the potential gain of the selection approach. To avoid this added variability due to randomness, we search for combination procedures, instead of pure selection ones, but ones with a tendency to assign a weight close to one to the best available predictor. The aim is to add some smoothness to the sequential selection (note that the selection of a predictor can be seen as a combination method in which we assign all the weight to a single predictor). Yang (2004) proposes a combination procedure called AFTER that pursues this goal. The next subsection describes the AFTER method, showing its convenience in the present context. It will be seen that the AFTER method is not adaptive in a nonstationary situation like wind power forecasting. Then, in the following subsection, a time varying version of the AFTER procedure, with better adaptation properties, is proposed.
688
I. Sánchez / International Journal of Forecasting 24 (2008) 679–693
3.2. The AFTER method
3.3. A new proposal: The AEC method
Yang (2004) proposes a combination for adaptation method called Aggregated Forecast Through Exponential Re-weighting (AFTER) (see also Zou & Yang, ˆC 2004). the PKcombined forecast is ptjt−h ¼ PK In AFTER, ðk Þ k¼1 uk;t ftjt−h , with k¼1 uk;t ¼ 1 and 0 ≤φk,t ≤ 1. As before, the dependence on h is omitted from the notation for simplicity. The weights are obtained recursively as
This section presents a modification of the AFTER method to allow for time evolution. It is known as Adaptive Exponential Combination (AEC). The combination weights of the AFTER method can be rewritten as
2 9 8 ðk Þ < pt−1 −ft−1jt−h−1 =
uk;t
−1=2 vˆ k;t−1jt−h−1 exp − :
u ; k;t−1 ¼ ; ð28Þ 2 9 8 ði Þ < = p −f t−1 t−1jt−h−1 PK −1=2 ˆ u i¼1 vi;t−1 exp − 2 vˆi;t−1jt−h−1 : ; i;t−1 2 vˆk;t−1jt−h−1
where as starting values we can use, for instance, φk,1 = 1/K. As in Eq. (8), the lower the value of v̂k,t−1|t−h−1, the used here larger the weight φk,t. The estimator v̂k,t−1|t−h−1 ðk Þ is Eq. (26). Note that the prediction error pt−1 −ft−1jt−h−1 is standardized by the lead h MSPE. Then, the prediction error is evaluated relative to its expected value. Hence, a large prediction error with respect to the expected error is interpreted as a sign of potential deterioration of the predictor. As a result, the contribution of that prediction to the combination is exponentially reduced. The contribution of a predictor is doubly weighted (Re-weighted): by its average performance through v̂k,t−1|t−h−1, and also exponentially by its latest performance through the exponential term. This exponential weighting is intended to adapt the combination coefficients to the evolution of the observations. However, Eq. (28) shows that φk,t is affected not only by the last prediction error, but also by the previous prediction error through φk,t−1, by a similar amount. Then, the older prediction errors are as important as the last one in the determination of φk,t. For instance, a very large prediction error can lead to φk,t ≈ 0. Then, from that time on, we can have φk,s ≈ 0, s N t, even if later on that predictor becomes the best one. The influence of prediction errors in Eq. (28) is not dynamically adapted. This influence would be reasonable in a stationary environment, but it is less appropriate in a time varying context. Under nonstationarity, the influence of φk,t on future values should decay as new information arrives.
Ak;t ; uk;t ¼ PK i¼1 Ai;t
ð29Þ
where the term Ak,t can be written as 8 9 ðk Þ 2 > t−1 > t−1 pi −f = < X iji−h −1=2 Ak;t ¼ j vˆ k;iji−h exp − > i¼1 ; : i¼1 2 vˆ k;iji−h >
ð30Þ
¼ Ik;t−1 Ak;t−1 ; with
Ik;t−1
8 2 9 ðk Þ > > = < pt−1 −ft−1jt−h−1 −1=2 : ¼ vˆ k;t−1jt−h−1 exp − > 2 vˆ k;t−1jt−h−1 > ; :
ð31Þ
The term Ik,t−1 can be interpreted as the new piece of information that the k-th predictor supplies about its performance. The term Ak,t−1 summarizes the remaining older information. Expression (30) shows that the old information is as important as the new in the computation of φk,t. It could be said that Ak,t has permanent memory. This permanent memory can be seen better by rewriting Eq. (30) as 8 2 9 ðk Þ > = t−1 < t−1 pi −fiji−h > −1=2 ¼ j Ik;i ; ð32Þ Ak;t ¼ j vˆ k;iji−h exp − > 2 vˆ k;iji−h > i¼1 ; i¼1 : where it is shown that all of the previous information, Ik,i, is equally weighted. In order to introduce some forgetting into the recursion of φk,t, a forgetting factor λt, as in RLS, is applied. Let us define the term Bk,t as t−1 jðt;iÞ
kt t t kt−1 Bk;t ¼ Ik;t−1 Bkk;t−1 ¼ Ik;t−1 Ik;t−2 Bkk;t−2 ¼ : : : ¼ j Ik;i ; i¼1
ð33Þ
where jðt; iÞ ¼ jts¼iþ1 ks ; sbt; jðt; t Þ ¼ 1, is like the above mentioned forgetting profile. Note that if λs b 1, s = 1,…, t, it can be checked that limt→∞ κ(t, i) = 0.
I. Sánchez / International Journal of Forecasting 24 (2008) 679–693
689
Hence lims→∞ Bkt−s = 1, and the influence of Bkt−s on Bkt decays exponentially. The AEC weights are
combination obtained using this method will be denoted as CAEC.
Bk;t wk;t ¼ PK ; i¼1 Bi;t
4. Application of the two-step combination procedure
ð34Þ
P ðk Þ and the AEC prediction is pˆ Ctjt−h ¼ Kk¼1 wk;t ftjt−h . The Cook . forgetting factor used in this combination is again λkt That is, we use a different forgetting factor for each predictor, with a value related to the amount of information in the new observations. The computation of this forgetting factor is as shown in Appendix B, where the alternative combinations now play the role of the alternative predictions to be combined. The forecast
As mentioned above (see Fig. 4), a two-step combination of forecasts is performed. In the first step, the four competing wind power forecasts are combined using three alternative combination procedures: CRLS, CV, and CAEC. Fig. 5 shows the performance of the three combination procedures in the two wind farms F1 and F2. To avoid the effect of the initial observations, the first 200 observations have not been used in the
Fig. 5. MSPEs of the three combination procedures used in the first step of the combination algorithm.
690
I. Sánchez / International Journal of Forecasting 24 (2008) 679–693
computation of the MSPEs. This figure shows that the relative performances of the three alternative combinations are not the same in the two wind farms. In F1, the three combination procedures have a fairly similar performance. The combination with the best average performance is CRLS, with an MSPE similar to CV. The worst combination is CAEC. In F2, the three combinations are similar up to h = 10, then CRLS starts to lose accuracy. In F2, CV has the best average performance. We see therefore that the relative performance of the combinations also depends on the horizon. These results reinforce the idea that it is very difficult to foresee which combination procedure is going to be the most efficient,
and that a better strategy would be to combine several combination procedures. In the second step, the AEC method with an adaptive forgetting factor will be used to combine the three combinations obtained in the first step. The predictions obtained in this second step will be called the final combination. This second step is also performed on-line. Fig. 6 shows the MSPEs of the final combination (solid line), along with the MSPEs of the combinations from the first step (dotted line). It can be seen from this figure that the second step supplies a prediction that is very close to the best combination of the first step.
Fig. 6. MSPEs of the three combination procedures of the first step of the combination algorithm (dotted lines), with the MSPEs of the combination from the second step (solid lines).
I. Sánchez / International Journal of Forecasting 24 (2008) 679–693
691
Fig. 7. MSPEs of the four original predictions P1 through P4 (dotted lines) and the final combination (solid line).
Fig. 7 shows the MSPE of the final combination and the MSPEs of the original predictions P1 through P4 (also shown in Fig. 2). For ease of comparison, this figure also shows the difference in MSPE between each prediction and the predictor P2, which is the one with the best overall performance. The advantage of the proposed combination is now apparent. The final combination has a MSPE that is lower than or similar to that of the best individual predictor. This figure shows a very clear conclusion: the two-step procedure is very efficient in obtaining a prediction similar to or better than the best alternative prediction.
5. Concluding remarks It is common in the wind energy industry to have access to more than one forecaster. It is well known that the relative performances of the alternative wind
power forecasts can vary with the wind farm, and also with time. In these cases, an adaptive combination of forecasts can be useful to generate an efficient single forecast. This article proposes a two-step procedure that tends to obtain a combined prediction similar to or better than the best alternative predictor. The two-step procedure aims to take the advantage of the two different types of forecast combinations, namely the combinations for improvement and combinations for adaptation methods. In the first step, several combination methods of both types can be used. The alternative combinations are implemented with time varying coefficients, allowing for nonstationarities in the series and in the relative performance of the competing predictors. The time-varying estimation is made using an adaptive forgetting factor based on the information in the incoming data. A newly proposed procedure, denoted as AEC, is used among the combinations of the first step. The AEC method can be seen as a timevarying implementation of the AFTER method. The
692
I. Sánchez / International Journal of Forecasting 24 (2008) 679–693
AEC is designed to give all of the weight to the best available forecast. Depending on the wind farm, the prediction horizon, and also the time, some combinations will perform better than others. Then, in a second step, the AEC method is used to combine the alternative combinations of the first step. Since the AEC is similar to a dynamic selection method, the outcome of the second step will be close to the best alternative combination. The two data sets used in this article illustrate the usefulness of the procedure for efficient wind power forecasting. One version of this two-step combination procedure is running operationally in the control room of REE (the Spanish SO) to combine alternative wind power predictions on-line. Acknowledgments The author wishes to express his appreciation to James W. Taylor and two anonymous referees for their valuable comments and suggestions. This research was supported in part by CICYT grant SEJ2007-64500, REE (Spanish SO), and the ANEMOS plus project (TREN/07/FP6EN/S07.71594/038692).
Appendix A. The adaptive forgetting factor in model (25) In this appendix, the recursive estimation of model (25), using an adaptive forgetting factor, is described. V ð1Þ ð2Þ ð1 Þ ðK Þ ð1Þ and ztjt−h ¼ ftjt−h −ftjt−h ; N ; ftjt−h −ftjt−h . Then, Let yt ¼ pt −ftjt−h model (25) can be written as yt ¼ z Vtjt−h bt þ eCtjt−h ;
ð35Þ
with βt = (ϕ2,t,…,ϕK,t)′. In order to apply the RLS algorithm, the adaptive forgetting factor has to be previously computed. This adaptive forgetting factor is based on Sánchez (2006a), where the value of λt is related to the amount of new information in the new observation. The evaluation of the information content of the new observation is made by comparing it with the old ones. Following Sánchez (2006a), the Cook's distance of the new observation to the data set is 2 ztVGˆ t−1 zt eˆ Ctjt−h ; ð36Þ Dt ¼ 2 K rˆ t−1 1 þ ztVGˆ t−1 zt
2 C is a consistent estimate of the variance of et|t–h , where σ̂t–1 2 2 −1 Pt−1 ˆ ˆ like, for instance, rt−1 ¼ ðt−1Þ j¼1 yj −z jVbj . To add further adaptability, especially in the case of very long 2 data P sets, an adaptive estimate2of σ such as rˆ 2t−1 ¼ t−1 t−1 −1 ˆ Nt−1 j¼1 ji¼jþ1 ki yj −z jVbj could alternatively be used, with Nt being the so-called P effectivesample size, which is defined as Nt ¼ tj¼1 jti¼jþ1 ki ¼ 1þ kt Nt−1 with N0 = 0. In order to translate Dt into a forgetting factor, we need to map Dt into the interval [0,1]. Following Sánchez (2006a), the forgetting factor ktCook is built as
¼ kmin þ ðkmax −kmin ÞHt ; kCook t
ð37Þ
with Ht ¼ P v2K NKDt where χK2 is the chi-square distribution with K degrees of freedom, and λmin and λmax are the maximum and minimum values allowed for λtCook. In this application they are set to λmin = 0.99 and kmax = 0.9999. Once we have λtCook, the recursive algorithm of RLS is ! ˆ t−1 zt z Vt Gˆ t−1 1 G Gˆ t ¼ Cook Gˆ t−1 − Cook ; ð38Þ kt kt þ z Vt Gˆ t−1 zt Þ bˆ t ¼ bˆ t−1 þGˆ t z tVeˆ Ctjt−h :
ð39Þ
Appendix B. The adaptive forgetting factor in model (27) In this appendix, the computation of the adaptive Cook recursive forgetting factor λk,t is described. The estimation is based on the RLS estimation of the model (4) using the updating equation −1 −1 Gˆ t ¼ AGˆ t−1 A þ f tjt−h f tVjt−h ;
ð40Þ
instead of Eq. (23), where At ¼ diag
qffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffi kCook ; N ; kCook ; 1t Kt
ð41Þ
Cook is computed as in Sánchez (2006a), where where λkt the Cook's distance of the new observation is decomposed into the contribution of each predictor to that
I. Sánchez / International Journal of Forecasting 24 (2008) 679–693
distance. From Sánchez (2006a), the Cook's distance of the new observation due to the kth predictor is
Dkt ¼
2 git eˆ Ctjt−h rˆ 2t−1 xit
;
ð42Þ
where xit ¼ uVi Gˆ tjI ui ; ui is a K×1 vector −1with 1 in theith −1 position and zeros elsewhere; Gˆ tjI ¼ Gˆ t−1 þ f tjt−h f tVjt−h ; and git ¼ uiVGˆ tjI f tjt−h f tVjt−h Gˆ tjI ui . The value of Dkt is translated into a forgetting factor by the following transformation ¼ kmin þ ðkmax −kmin ÞHt ; kCook kt
ð43Þ
where Ht = P(χ12 N Dkt) and kmin = 0.99, kmax = 0.9999. References Bates, J. M., & Granger, C. W. J. (1969). Combination of forecasts. Operational Research Quarterly, 20, 451−468. Bunn, D. W. (1985). Statistical efficiency in the linear combination of forecasts. International Journal of Forecasting, 1, 151−163. Clements, M. P., & Hendry, D. F. (1998). Forecasting economic time series. Cambridge University Press. Diebold, F. X., & Pauly, P. (1987). Structural change and the combination of forecasts. Journal of Forecasting, 6, 21−40. Granger, C. W. J., & Ramanathan, R. (1984). Improved methods of combining forecasts. Journal of Forecasting, 3, 197−204. Gunter, S. I., & Aksu, C. (1989). N-step combinations of forecasts. Journal of Forecasting, 8, 253−267. Herbster, M., & Warmuth, M. K. (1998). Tracking the best expert. Machine Learning, 32, 151−178. Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). Bayesian Model Averaging: A Tutorial. Statistical Science, 14, 382−417.
693
Jursa, R., & Rohrig, K. (2008). Short-term Wind Power Forecasting using Evolutionary Algorithms for the Optimization of Artificial Intelligence Models.International Journal of Forecasting, 24, 694−709 (this issue). Landberg, L., Giebel, G., Nielsen, H. A., Nielsen, T., & Madsen, H. (2003). Short-term prediction — An overview. Wind Energy, 6, 273−280. Lindoff, B. (1997). On the Optimal Choice of the Forgetting Factor in the Recursive Least Squares Estimator. Lund Institute of Technology. Ljung, L., & Söderstrom, T. (1983). Theory and Practice of Recursive Identification. Cambridge, Massachusetts: MIT Press. Newbold, P., & Granger, C. W. J. (1974). Experience with forecasting univariate time series and the combination of forecasts. Journal of the Royal Statistical Society, Series A, 137, 131−165. Reid, D. J. (1969). A comparative study of time series prediction techniques on economic data. Ph.D. thesis. University of Notttingham, Nottingham. Sancetta, A. (2007). Online forecast combinations of distributions: Worst case bounds. Journal of Econometrics, 141(2), 621−651. Sánchez, I. (2006a). Recursive estimation of dynamic models using Cook's distance, with application to wind energy forecast. Technometrics, 48, 61−73. Sánchez, I. (2006b). Short-term prediction of wind energy production. International Journal of Forecasting, 22, 43−56. Terui, N., & van Dijk, H. K. (2002). Combined forecasts from linear and nonlinear time series models. International Journal of Forecasting, 18, 421−438. Timmermann, A. (2006). Forecast combination. In G. Elliot, C. W. J. Granger & A. Timmermann (Eds.), Handbook of Economic Forecasting (pp. 135−196). North-Holland: Amsterdam. Vovk, V. (1990). Aggregating strategies. In Proceedings of the 3rd Annual Workshop on Computational Learning Theory (pp. 371–383). San Mateo, CA. Yang, Y. (2004). Combining forecasting procedures: some theoretical results. Econometric Theory, 20, 176−222. Zou, H., & Yang, Y. (2004). Combining time series models for forecasting. International Journal of Forecasting, 20, 69−84.