International North-Holland
Journal
of Forecasting
81
8 (1992) 81-98
The evaluation of extrapolative forecasting methods Robert
Fildes
Lancaster University, Lancaster, UK
Abstract: Extrapolative forecasting methods are widely used in production and inventory decisions. Typically many hundreds of series are forecast and the cost-effectiveness of the decisions depends on the accuracy of the forecasting method(s) used. This paper examines how a forecasting method should be chosen based on analyzing alternative loss functions. It is argued that a population of time series must be evaluated by time period and by series. Of the alternative loss functions considered, only the geometric root mean squared error is well-behaved and has a straightforward interpretation. The paper concludes that exponential smoothing and ‘naive’ models, previously thought to be ‘robust’ performers, forecast poorly for the particular set of time series under analysis, whatever error measure is used. As a consequence, forecasters should carry out a detailed evaluation of the data series, as described in the paper, rather than relying on a priori analysis developed from earlier forecasting competitions. Keywords: Evaluation - time series methods, Evaluation - methodology, Comparative series, Robust estimation, Outliers - effect of, Loss functions - evaluation, Ex ante.
The statistical evaluation of the forecasting performance of extrapolative time series methods is important for any applied forecaster trying to decide which method to use. In inventory/production control applications the researcher will typically need to consider a method’s performance on a (potentially) large number of series. This paper identifies guidelines for carrying out such evaluations. The first section of the paper describes the various loss functions that are used in the context of inventory control. They are evaluated theoretically by reference to certain simple error models. The second section details the data used and the forecasting methods that have been applied in an attempt to understand the behaviour of these loss functions. These performance statistics are analysed in section three in the context of a forecasting competition based Correspondence to: R. Fildes, Department of Operational Research and Operations Management, Management School, Lancaster University, Lancaster LA1 4YX, UK. 016Y-2070/92/$05.00
0 1992 - Elsevier
Science
Publishers
methods
- time
on the analysis of 263 series. The final section concludes by comparing the results of this forecasting competition with those of earlier studies, in particular, Makridakis et al. (1982). The dangers of mis-interpreting forecast error statistics through the use of inadequate test statistics are demonstrated and procedures for avoiding their worst effects are suggested.
1. Loss functions
in forecast error analysis
An extrapolative (time series) model of a particular time series uses only the history of the s:ries in forecasting its future values. Formally: Y,(k) =f(Y,, YT_ ,,... YT_-p...), where YT(kl is the k step-ahead forecast of YTtk made at time T based on data available to T. Statisticians have developed criteria for selecting the lag length p and functional form so that Y,(k) is an ‘optimal’ predictor, evaluated against some preselected loss function. The principal statistical rules that are
B.V. All rights reserved
applied rely on some modification of the estimated standard deviation of the forecast error by factors which depend on the number of parameters and number of observations, eg. Akaike’s Information Criterion (AIC). [For a survey see de Gooijer et al. (198511. All these measures are based on parameters (typically the error variance) estimated within sample. However, both simulation studies and empirical studies such as Newbold and Granger’s (1974) showed that these identification procedures were inadequate. Knowledge that both formal and informal methods of identification were unsatisfactory led inexorably to the conclusion that statistics of fit alone could not be used as a basis for model choice. Instead, researchers resorted to using a combination of within-sample error statistics to limit the range of models to be considered, and outside sample error statistics to test the models. f. 1. Outside sample error ~rl~a~ur~s A wide range of error measures based o? the k-step ahead forecast error e,(k) = Yr+k - Y,(k) has been used in the attempt to provide evidence on forecast performance. When attempting to compare forecasting pe~ormance over a number of series such as is required in the various forecasting competitions, eg. Makridakis et al. (19821, a large number of forecast errors must be summarized as is illustrated diagrammatically in Exhibit 1. The error, ejj, is the error made in forecasting the T,, +j th data point of series i, lead time
Exhibit 1 The forecast error matrix (Forecast lead time, Lf.
and corres~nding
summa~
error
being assumed fixed and where T,, is the time origin of the data used for the evaluation. For any given time period, T,, +j(j > L - I), there are N L-step ahead forecast errors which can be aggregated to form two (or more> error measures, the ‘Time Squared Error’ (TSE) based on a square error measure calculated 6y time period across series and ‘Time AbsoIute Percentage Error’ (TAPE) based on absolute percentage error. [Other alternatives could also be included but measures based on either squared error or absolute percentage error are most commonly used, Carbone and Armstrong (1982)]. For any given series, there are ?‘- L + 1 L-step ahead errors which can be aggregated across time to form summary measures ‘Series Squared Error’ (SSE) and ‘Series Absolute Percentage Error’ (SAPE). A similar data matrix can be generated for each set of lead time forecasts. The various competitions incIuding the Makridakis competition (M-Competition} only included error statistics aggregated across series for one particular time origin. Such error statistics have been analysed by Makridakis and Winkler (1989) who concluded the outside sample errors in the M-Competition bear iittle reIationship to the within-sample errors. Jenkins (1982) makes the criticism that with only one time origin used as a basis for all forecasts the possibility exists that the arbitrary choice of time origin might unduly have affected the results. Of course, if the data series were stationary, this would not be a matter of serious concern. Without the stationarity assumption however, many of the conclusions that have
measures.
Series Time
T,,
+
T
Summary Summary measures series
by
N
Summary measures time period
1
2
_
_
_
elL
e2/.
TSE.,
e,r
e27.
eN7.
SSE, SAPE,
SSE, SAPE,
SSE, SAPE,
by
TAPE.
L
TAPE ,TSE 7 Overall summary measures SSE... TSE...
SAPE... TAPE..
83
R. Fiides / Extrapolatiic*eforecasting methods
been drawn from the various forecasting competitions might be undermined by nonstationary sampling variability across time ‘. 1.2. Models of forecast error in inventory control Forecasting comparisons of the type we describe are necessary in production and inventory applications. A single general loss function can seldom capture the complexities of how forecasting models are used. For example, Price and Sharp (1986) show how standard statistical accuracy measures inadequately capture the specific loss function appropriate to investment in electricity supply. Nevertheless, in inventory control applications where in practice simple inventory control schemes are used, one important measure of loss is the safety stock investment (although it neglects stock out costs), KsC,cj,, where K is the safety factor, C, is the unit value of the ith product and ci is the forecast error standard deviation over the replenishment Iead time. However forecasting researchers have, perforce, adopted a wide range of general loss functions typically based on squared or percentage error measures, hoping the practitioner trying to use the results would find a loss function that approximately matched the problem in hand. The (fixed lead time) errors (and associated loss functions such as SSE) for a particular series are, in theory, well-behaved with each forming their own empirical distribution which can be compared with an appropriate theoretical distribution and summarised by a,, Mean Square Error and MAPE etc. However the error distributions and the error statistics calculated across series for a given time period are less easily interpreted unless the data series can be thought of as a sample from a well-defined population of series. Without that necessary homogeneity, the MSE is uninterpretable as discussed by Gardner (1983), Newbold (1983) and Fildes (1983); the MSE statistics and the mean standard deviation are scale dependent and a method can be ranked first on MSE just because it scored particularly well on perhaps the one series with typical errors ’ Jenkins
also criticizes the practice statistics across lead time, a criticism this paper all analyses are carried pre-specified lead times.
of aggregating error with which I agree. In out for a number of
significantly larger than the remaining series [Chatfield, (1988)]. Scale dependence is a severe weakness in an error measure applied to business problems in that the unit of measurement in which a series is recorded is often arbitrary. Similar, though less significant criticisms, can be levelled at the MAPE as Gardner (1983) points out and these again arise because the sampling distribution for MAPE measured across series is often badly positively skewed. MAPE also suffers by being sensitive to location in that a change of origin in the data affects the MAPE. However the origin in business series is typically well-defined. Because of its widespread use we consider MAPE in our subsequent analysis. The problem posed by scale dependence can be understood by considering the following model. Suppose the squared errors are of the form:
where eirM (Ll is the L-step ahead error made in forecasting period t + L for series i using method M, and where ’ * ’ could represent either an additive or multiplicative relationship. The I’~,, are assumed positive and further that they subsume any serial relationship in the errors and fei,,JL)} are assumed to be stationary. The t~;~can be thought of as errors due to the particular time period affecting all methods equally while the {e,,,,,,(L)} are the method specific errors. Dropping the lead time for convenience, such a model effectively represents the case where the data (and errors> are contaminated by occasional outliers. When such a model holds, a ‘Series Sauared Error’ measure, SS&,,
such as
~~~~i~) could potentialiy be dominated by the (z!~,},and accordingly, comparison of two forecasting methods, M, and M,, would also be unduly affected. The inventory control loss function should not depend on {tit}. As argued earlier they are best seen as occasional outliers and therefore any automatic forecasting system should not be geared to respond to such extremes. Rather they shouid be dealt with by an exception monitoring scheme. Particularly important for the purposes of this paper, the choice of forecasting method to employ should not be dependent on such extremes except in so far as one method extracts whatever
R. Fildes / Extrapo1atiL.eforecasting methods
84
information there may be in an outlier better than another. The contamination of SMSE by series can be overcome by adopting the Geometric Root MSE’. The Geometric Root Mean Square Error by series (SGRIMSE,,) for a particular series is calculated as shown below: jne:,+ t
where n = T - L + 1 and is the number of effective data points. Using the equation e” GRhk~“?k!~~ &(L) * Ul,(r+LY the relative . time) of method 1 compared to method 2 (ijlWV~4~W)~~
= (~41cl,/~4#4)k E,:I(L)
= 4f
42(L)
; i
which is independent of {L’~~} if they are assumed to have a multiplicative effect. Both absolute and relative error measures have the advantage over the conventional RMSE in that they are scale independent. Interpretation is straightforward: a method with GRMSE 10% greater than an alternative has on average a 10% larger absolute error per series. In addition, if a forecasting method has absolute error on one series 10% higher than the alternative and on another 10% lower, the GRMSEs are equal when averaged across these two series. Eg. Suppose for series 1, method 1 has SE, of 110 compared to method 2 at 100 while on series 2, method 1 has SE, of 1000 compared to method 2’s 1100, then: Relative GRMSE = ((llO/lOO) = 1 while the Relative Arithmetic
*(1000/1100))1’2
RMSE
arises solely because of its better performance on series 2, the series with the larger observations. Now if Geometric Root Mean Squared Error, GRMSE, by series (SGRMSE) can be modelled as log normal, then Relative SGRMSE, is also log normal 3. Similarly GRMSE by time period may be assumed log normal and the summary measures of the distributions by series and by time period have the important property that they are equal, i.e. Relative SGRMSE. ,= Relative TGRMSE.. = Relative GRMSE,
where the double dot notation implies the statistics have been calculated over both the available time periods and the N data series. In summary, the following assumptions have been made which, if supported by the data, argue strongly in favour of including Relative GRMSE as one of the range of descriptive statistics useful in interpreting competitions. (1) Relative Geometric Root Mean Squared Error by Series (SGRMSE) and Relative GRMSE by Time Period (TGRMSE) can be represented by the log normal distribution for all lead times. (2) The one-step ahead relative squared errors are independent, while the relationship in the absolute squared errors may well reflect serial correlation in the {cil}. (3) The raw squared error distributions, both by series and by time period contain multiplicative outliers. As a consequence, the relative error distributions will be better behaved and more readily interpretable than the absolute error distributions. Thompson (1990) has proposed a variant of the relative GRMSE for similar reasons to those just proposed, the log relative MSE. Other error measures such as the MAPE, Median APE etc and their relative variants will depend to a greater or lesser extent on the IL:~,}~.Equally important,
= {(Ho* + 1000~)/(100~ + 1100~)}‘~2 = 0.911. The method
Relative Arithmetic MSE suggests that 2 is better than method 1 but this result
’ The measure
has long been known for averaging ages and was first used in forecasting com~risons bold and &anger (1974).
percentby New-
Strictly, for a linear combination of two normal variables to be necessarily normal, the two must be multivariate normal. A referee notes that the mean squared percentage error is often used in econometric work. While being scale independent it is likely to suffer from being unduly affected by outliers; it is also location dependent and is therefore likely to be skewed.
R. Fildes / Extrapolative forecasting methods
all measures based on APE suffer from the lack of equivalence across series and across time.
2. Forecasting
85
methods
and data
2.1. The data 1.3. An ideal error measure? Where there is no appropriate problem specific error dependent loss function, any error measure should ideally offer as full a summary of the corresponding error distribution as is possible. Therefore the summary error measures such as SSE,, SAPE,, TSE,, TAPE, (defined in Exhibit 1) should be efficient estimators of some parameter of interest from the corresponding underlying error measure distribution. The distribution itself should be well behaved in that there should be few outliers and it should be stationary over time. The error measure should not be scale dependent. Without these modest properties the analyst could find him/herself, for example, examining the mean of a highly skewed distribution with outliers, forgetting that this offers little if any information that is useful about the distribution. Thus, the choice of error measures to summarize the error distribution should not merely be a question of personal preference, as one commentator on an earlier draft of this paper noted, but rather, the forecaster must establish appropriate scaling and distributional assumptions for the data under analysis. As a consequence of these constraints there are only a limited number of valid error measures that can reasonably be adopted. The aim of this paper is: (1) to examine the empirical statistical properties of certain error measures in order to illustrate various patterns in their behaviour and to suggest reasons why some measures are likely to be better behaved than others; (2) to show how the sampling variability of error measures can lead to mis-interpreting the results from forecasting competitions. The results of the case study that lies at the heart of this analysis also underline the data specific nature of the accuracy rankings of alternative forecasting methods in any forecasting competition. Methods that perfomed well in the M-Competition, on these data series, turn out to be substantially less accurate than methods designed to capture the particular characteristics of the data being analysed.
The data used for this experiment on the evaluation of forecasting methods were collected from a telephone operating company and record the number of circuits in service by locality, available for special telecommunications services such as digital transmission. Forecasts for these series are required monthly in the company for at least 12 months ahead and form the basis for its investment in circuits and exchanges. The data are from a single US state collected over the same time scale and therefore form a homogenous population of time series. Grambsch and Stahel (1990) discuss its characteristics in more detail. They are non-seasonal with trend. An illustrative graph of two the series is shown in Exhibit 2. 2.2. Forecasting methods Comparisons of forecasting performance have become increasingly popular as the review by Armstrong (1984) makes clear. The major study was carried out by Makridakis and his colleagues (1982) which attempted to synthesise the earlier research, and overcome the criticisms that had been levelled at these so-called forecasting competitions. The M-Competition and the various authors who analysed its results have, over the years since publication, moved towards a consensus on which methods performed well overall. ’ See for example, Fildes and Lusk (198.5) and Makridakis (1983). The methods which performed particularly well in the analysis of the M-Competition data were: (0) exponential smoothing, (1) Holt’s method of exponential smoothing which includes trend, (2) Gardner-McKenzie’s damped trend (1986). Preliminary analysis of the data used in this study, [confirmed by the analysis performed by Grambsch and Stahel (1990)] show most of the series to contain strong negative trends with only six exceptions. This suggested that simple expo-
5 Since the data under analysis here are non-seasonal, eration
is limited
to non-seasonal
forecasting
models.
consid-
where the error is assumed to follow a stable distribution with shape parameter ar < 2. A stable distribution has ‘fat tails’ and is therefore suitable for modelling data where extreme values may be regularly observed. A robust estimator of the constant trend parameter y, based on observations up to t is as follows:
where m, is the median of the first differences {Aj = Y, - Y,_,) for j = 1 up to time t, and s, = median of the absolute deviations (I dj - m, I), j=l . . . t, and $(x) is a weighting function which damps out extreme deviations from the median. The 4 function used is as follows:
Time
Exhibit 2. Time series
plot of two series.
nential smoothing would perform badly and prelimina~ analysis confirmed this. It has therefore been excluded from subsequent analysis. Formulae for the above models are given in Gardner (1985). The optimal smoothing parameters have been calculated based on the first 23 data points leaving 48 data points for the subsequent analysis of ex ante errors. InitiaIisation has been carried out on the first 8 data points using Gardner’s approach and a sequential grid search adopted to establish optimal parameters. The Gardner-McKenzie formulae were used although the grid search included all three parameters for the Damped Trend model in contrast to Gardner’s AUTOcAST II. The results obtained on a sample of approximately 20 series were broadly comparable to those produced by AUTOcAST. The data described above had already been analysed by the company from which they derive and two methods found to perform well as described by Grambsch and Stahel(l990). They are: (3) Robust (4) Filter. These
Trend, methods
are briefly
described
below.
Robust Trend estimation is based on ideas of robust regression with the formulae shown below derived from the assumption that Y, is best modelled as a random walk with constant trend: dY,=y+e,,
$(x)
=2x/3
for
$(x)
= isgn(x)
IX/ < 1.5: for
~(~)=(6-~sgn(~~))/3 tcf( x) = 0 where
1.5<
1x1 <3: for
3<
IX/ <6:
otherwise,
sgn(x)
=x/1x1.
The 4 function differentially weights the observed values of the trend, dY,, depending on their absolute size, giving less weight to the extremes and the central observations and more weight to the typical observations. It therefore damps the effect of outliers. The fourth method is a simple filtering algorithm [Ionescu-Graf (1982), Pack and Whitaker (198211 that proves to be equivalent to Holt’s model, [Gardner (1985)]. The method also included a smoothing routine which attempts to alleviate the effects of large jumps. The filter equations are as follows: I.Lt= CL?-1 +P,-,
+k,*e,
PI=PI-,+k*~t t(L)
=p,+L*P,,
where Y<,(L) is the L-step ahead forecast made at time 1 for YrlL. This Filter was implemented with k, and k, assumed fixed at 1.0 and 0.07 respectively across the whole population of series. The values had been established after substantial prior experimentation by earlier researchers. The random walk model (with no drift term) was used as a basis for comparison on the grounds that despite its simplicity, it had proved a strong
R. Fildes / Extrapolatil,e
performer in the ~-Competition. The data series were also non-stationary which again suggest the random walk as a benchmark.
3. Comparative
error statistics
3. I. Comparisons by series
Summary statistics and exploratory data analysis Summary statistics based on squared error (SE) or absolute percentage error (APE) are typically used as the basis for selecting between alternative forecasting methods. However, they are only useful in so far they summarize an underlying error
Wultiplr
Roburt
forecasting
measure distribution. In this section we examine a variety of statistics in the context of their empirical distribution in an attempt to evaluate their appropriateness. A fuller empirical analysis is described in Fildes (1989b). To illustrate the shape of the error measure distributions, Exhibit 3a-b shows multiple BoxWhisker plots for Root Mean Squared Error (RMSE) and the Mean Absolute Percentage Error (MAPE) for lead 6 which is illustrative of the results for all lead times. The Geometric RMSE and the Median APE are similar though the latter has more outliers. The series are positiveIy skewed with many outliers. The results for the two exponential smoothing methods are similar
Box-and-Whirker
Filter
87
methods
R.ndoa
Plot
- Lcrd
6
lJ.np.Tl-.
Hoit
METHOD
Exhibit 3a. Multiple
Box-and-Whisker
plot - Lead 6. RMSE by series: (n = 263).
R. Fildes / Extrapolatlre forecasting methods
Damp.
Exhibit 3h. Multinle
Box-and-Whisker
to the Filter and Robust Trend except the medians of the distributions are larger and there are more extreme values. Similar characteristics can be seen whatever the error measure graphed, although the APE statistics are more distorted by extremes, particularly the MAPE, with two of series disappearing to zero in the forecast evaluation period. Since forecast evaluations are usually comparative, the SRMSEs and SGRMSE by Series relative to the Random Walk are illustrated for lead 6 in Exhibits 4a-b using a Box-Plot. The distribution of Relative RMSE for all leads is approximately symmetrical with positive tails that are slightly too long to be strictly normal for Robust Trend and the Filter while the two exponential smoothing methods giving rise to many more out-
Tr.
nolt
ulot - Lead 6. MAPE bv series: (n = 263).
liers. The Box-Plots for the relative SGRMSE are similar with thicker tails. Detailed statistics for these distributions can be found in Fildes (1989b) while a summary is given in Exhibit 5. The means shown are both geometric and arithmetic means but because the relative error data are percentages, they are therefore better summarised by the geometric mean. The exhibits underline the poor performance of the two exponential smoothing models for all leads. Their relative performance is uniformly poor whatever error measure is used. The error distribution The sample statistics of the four relative performance measures were modelled as normal or
R. Fildes / Extrapolative forecasting methods
log normal. [Williams and Goodman (1971) had previously found the Gamma distribution satisfactory in modelling the absolute error]. From the argument proposed in section 1.2 on the need for equivalence between an analysis based across time periods and one based across series, the latter is the a priori more suitable model. Exhibit 6 shows the preferred distribution CL. for lognormal, N for normal) identified through a chi-squared goodness-of-fit test. Occasional outliers have been removed where necessary. The similarity in aggregate error performance between methods does not imply that choosing the method appropriate for each series is worthless. Examination of the relative error performance of Robust Trend versus the Filter for leads 1, 6 and 12 respectively show that relative performance varies substantially series-by-series
and in principle, the appropriate choice of method could lead to major forecasting gains. For exampie, the retrospective identification of the most accurate method (Robust Trend or Filter) for each series leads to a Relative GRMSE by series of 0.73, 0.40 and 0.34 for leads 1, 6 and 12 respectively, which when compared to the strategy of using Robust Trend alone, gives gains of 4%, 9% and 13% for the three leads. However Fildes (1989a) analysed a number of method selection rules that could be used ex ante to identify the ‘best’ method and they typically only delivered a small percentage of this potential. The error measures Does the use of a variety of error measures offer us additional evidence on performance above that obtainable based on a single measure?
b
Filtrr
Exhibit 4a. Multiple
Box-and-Whisker
89
I,,,. Trrnd
plot - Lead 6. Relative
MSE by series: (n = 263)
R. Fildrs / Extrapolative forecasting methods
YO
Robust
op.
Filter
Halt
Trend
PETHO
Exhibit 4b. Multiple
Box-and-Whisker
plot - Lead 6. Relative
Relative
Lead RMSE
Holt
Damped Trend Robust
Filter
SGRMSE,
1
relative
SMAPE
and relative
SMdAPE,
MAPE
MdAPE
RMSE
all series.
Lead 12
Lead 6 GRMSE
(n = 263).
for the different lead times. The measures are significantly positively correlated, with coefficients varying from 0.25 to 0.85. However the
Spearman’s rank correlation coefficient for the measures of relative performance of Robust Trend estimation versus the Filter was calculated Exhibit 5 Means by series, relative SRMSE, relative (The arithmetic mean is in parentheses).
GRMSE:
GRMSE
0.97
0.8’)
0.91
0.86
0.76
0.55
(1.25)
(1.22)
(1.21)
(1.27)
(0.86)
(0.66)
1.02
1.02
0.95
0.92
0.86
0.70
(1.47)
(1.58)
(1.50)
(1.62)
(0.97)
(0.86)
MAPE
MdAPE
RMSE
GRMSE
MAPE
MdAPE
0.67 (0.76)
0.60 (0.71)
0.73 (0.87)
0.49 (0.65)
0.63 (0.76)
0.55 (0.69)
0.79 (0.92)
0.74 (0.88)
0.82 (0.93)
0.66 (0.82)
0.76 (0.88)
0.70 (0.84)
0.54 (0.59)
0.39 (0.49)
0.47 (0.54)
0.43 (0.52)
0.59 (0.64)
0.41 (0.51)
0.50 (0.57)
0.46 (0.55)
0.83
0.76
0.78
0.70
0.61
0.44
0.54
0.48
(0.84)
(0.79)
(0.79)
(0.74)
(0.64)
(0.50)
(0.58)
(0.53)
0.83
0.78
0.79
0.73
0.63
0.45
0.55
0.49
(0.84)
(0.81)
(0.80)
(0.77)
(0.66)
(0.52)
(0.59)
(0.55)
R. Tildes / Extrapolatiw forecasting method.7 Exhibit 6 Goodness-of-fit
test statistics
Relative
1
Lead RMSE
Holt Damped Trend Robust Filter ~’ Accepted ” Accepted
- normal
v. lognormal:
Acceptable
distributions
91
at IO’? level.
Lead 6
Lead 12
GRMSE
MAPE
MdAPE
RMSE
GRMSE
MAPE
MdAPE
RMSE
GRMSE
MAPE
MdAPE
L
L
_
L
L
L
Lh
L
L
L ”
L
L
L :’ L, N L
_
_
_
_
L, N L.N
L L L
_
N L. N
L L “
L L
L L
L L
L 1,
L ” L
L L >’
at the 18 significance at the 5% significance
level (rejected level (rejected
L L
at the 5% level). at the 10% level).
discrepancies are sufficient to require the adoption of a variety of measures when presenting the results so lone as they make theoretical sense. Outlying observations affect the two exponential smoothing techniques substantially. The results are catastrophic for one or two series but examination of the distribution of the error statistics suggests the two methods perform substantially worse than either Robust Trend estimation or the Filter on the majority of series, for all error measures. An outlier replacement routine was incorporated into the exponential smoothing methods and the results were worse. ’ The cause of this was that the replacement of an observation that had been preliminarily identified as an outlier prevented the smoothing routines from updating their ‘level’ estimates successfully. In Holt’s version of exponential smoothing the smoothing equations depend on the latest observation Y, and the latest trend, Y, - Y,_ ,. An extreme observation affects both the level estimate and the trend estimate. When the data are close to a random walk with drift, an appropriate model would accept the latest (extreme) observation as informative about the level, but would smooth the trend estimate. In effect the outliers seen in these series were in the ‘trend observation’, not the ‘level’. Rank scores have also been calculated, the results confirming the poor performance of the smoothing methods on most of the series [Fildes (1989b)J. Exhibit 7 illustrates the same point
The outlier elimination routine outliers based on the parameters search and these were replaced parameter grid search restarted.
_
identified observations as from the initial rough grid by the estimates and the
where the pairwise relative performance of two of the four methods, GRMSE and RMSE. The entries show the percentage of series where the method named in row i outperforms the method named in column j. For example, in Exhibit 7, for lead 1, Damped Trend outperforms Holts Exponential Smoothing 46% of the time, performance measured by GRMSE. Relative GRMSE and MdAPE by Series produces similar performance statistics while relative RMSE and relative MAPE are similar, again underlining the effects of extreme errors on the latter two measures. Analysis by series - Conclusions The following propositions have found support through the exploratory data analysis described above: (11 The distribution of both SRMSE and SMAPE by series is non-normal. (2) The error distributions for all five forecasting methods are contaminated by outliers. (3) The relative error measures for Robust Trend Estimation and the Filter can be described by either the normal or log-normal distribution. Using relative measures leads to substantially fewer outliers for all leads. (4) Substantial differences in accuracy obtain for all lead times and all error measures and all forecasting methods. (5) In aggregate, the Robust Trend and the Filter outperform the two smoothing methods substantially while Damped Trend is outperformed by Holt’s Exponential Smoothing. The random walk is consistently substantially worse than other methods. (6) Relative Geometric RMSE by Series (relative SGRMSE) is a well behaved distribution for
92
R. Filch
Exhibit 7 Pairwise comparisons
of forecasting
Method
by series:
Accuracy
methods
method
(in column)
Lead 6
Holt
DT
46 (47) 71(79) 63 (79)
73 (83) 70 (X5)
measured
forrcastirzg
% of series where
Lead 1
Damped Trend CDT) Robust (Rb) Filter Relative
methods
/ Extrapolatir~e
by GRMSE.
method
(in row).
Lead 12
Rh
Holt
DT
45 (42)
31 (34) 73(X1) 69 (7X)
83 (X6) X0 (X5)
and RMSE
outperforms
Rh
Holt
DT
Rh
44 (3X)
36 (44) h9 (76) 56 (68)
Xl (X4) 79 (X0)
41 (36)
(in parentheses).
I
I
I
I
168
t’
Filter
Robust
Exhibit X. Multiple Exhibit Y Geometric
mean of GRMSE. Lead
Method
Holt Damped Filter Robust Random (Geometric
Trend
Walk
median
Box-and-Whisker
of MdAPE
plot - Lead 6. GRMSE
by time period.
by time: (n = 43).
(all leads)
Lead 6 (n = 43)
1 (n = 4X)
GMnCGRMSE)
MdCMdAPE)
16.X 18.0 14.6 14.3 19.7
0.83 0.89 0.7 1 0.6’) 0.90
mean of the raw data = 2525.)
GMn(GRMSE) 66.6 84.1 54.6 52.x
I 20
Lead 12 (n = 37) MdCMdAPE)
GMn(GRMSE)
Md(MdAPE)
3.5 4.6 2.X 2.8 6.4
12s 16X 103 Y8.1 2.52
6.X 0.8 5.5 5.4 12.8
1
i’ For lead 1, two outlying observations have been omitted for the reasons explained in the text.
-I
,A._
,=
Exhibit 10 Relative geometric root MSE (Relative TGRMSE) hy time period (all leads). NB. The Geometric Means of these entries approximate the Geometric Means by series shown in Exhibit 5. They differ only due to rounding error and the omission of the outlying observations for lead 1. ____ METHOD Lead 1 D(n = 48) Lead 6 (n = 431 Lead 12 (n = 37) ~-25 Md GMn 75 Max Min 25 Md GMn 75 Max Min 25 Md GMn 75 Max ~. Min Wolt 0.65 0.73 O.Si 0.84 0.92 1.1 0.45 (I.51 0.53 0.56 -0.62 0.87 0.41 0.44 0.48 0.49 0.52 0.75 Damped Trend 0.75 0.83 0.88 0.90 0.96 t.2 0.60 0.66 0.70 0.70 0.73 0.92 0.61 0.63 0.66 r&66 0.69 O.&I Robust 0.55 rJ.61 0.70 0.71 0.76 1.0 0.36 0.40 0.42 0.44 0.46 0.79 0.32 0.35 0.38 0.39 0.43 0.53 Filter 0.54 0.62 0.71 0.72 0.82 1.1 0.33 0.38 0.42 0.46 0.50 0.83 0.32 0.35 0.37 0.41 0.45 0.66
z B g
> Ir; a ‘r;l 0 !z 2 2 $ $.
zz B @
Ewhihit Pairwisr
I1 comparisona
of forecasting
methods
by time period
TGRMSE:
‘i
of
time
periods
method
(in column)
outpcrt’orms
method (in row). Method
I.ead I? Rb
UT Damped
DT
Rh
Holt
DT
Rh
Trend CDT)
Robust (Rb) Filter
Robust Trend and the Filter, and is easily intcrpreted. It is to be preferred to Relative MSE. (7) Relative Median APE by Series (relative SMdAPE) is a well behaved distribution for Robust Trend and the Filter. It is to be preferred to Relative SMAPE but it is not readily interpretablc. It has similar performance charactcristics to the relative geometric root mean square error. (8) The lognormal distribution charactcrises the Relative SGRMSE and SMdAPE by Series.
Sunmary
stutistics
and
explorator?.
data
anulysis
Selection of a forecasting method is usually based on summary error statistics measured across series for a particular time period, or more generally. for a number of periods. However these statistics can only be illuminating in so far as they provide good summary descriptions of the underlying distribution of the error measures. Of the measures we have considered so far only the Geometric Root mean Squared Error by Time Period (TGRMSE) and the Median Absolute Percentage Error (TMdAPE) pass this test for the reasons explained in section 1.2. Full summary statistics and graphs for the 48 cross sections (43 lead 6, 37 lead 12) are available in Fildes (1989b). Their distributions arc again skewed with a small number of outliers as Exhibit 8 illustrates for the particular case of lead 6, GRMSE. A numerical summary is shown in Exhibit 9. (The l-step ahead forecast errors of the Random Walk show two extreme outliers where the MdAPE is close to zero suggesting data errors). The Filter shows greater variability in performance compared to Robust Trend while the two exponential smoothing methods perform noticeably worse than either the Filter or Robust Trend. The Random Walk is consistently poorest.
Following the arguments of the previous section, the relative summary statistics based on GRMSE were computed as shown in Exhibit 10. The relative median statistics arc similar [Fildes ( 1%C)b)]. Analysis of the corresponding Box-Whisker plot of the relative performance data for lcad 6 show somewhat skewed distributions with the Filter being less reliable than Robust Trend in that it has higher dispersion. particularly at leads 6 and 12. Again Holt outperforms Damped Trend substantially and is somewhat more skewed Using a Chi-squared test, the resulting distributions of the relative error statistics were tested for log-normality and normality. In general either distribution was supportable with log-normality slightly preferred, adding further weight to the model of the errors proposed in section 1. The five methods were ranked on both error measures. As cxpccted from the analysis across time by series Robust Trend Estimation and the Filter outperform the other methods over almost all time periods. The pairwise comparison in Exhibit I1 shows the percentage of time periods that method A (shown in column I) outperforms Exhlblt 12 Geometric RMSE by Tome Leadsl.GandlZ
1
Exhibit
12. Geometric
RMSE
hy time leads I. h and I?.
method B (in row 1) as measured by TGRMSE, (the TMdAPE statistics are similar), for the four methods measured across time for each lead time. The exhibit shows that where the relative performance by series is similar (as is shown in Exhibit 5 for example) there is almost a 50-50 chance that a particular method outperforms its competitor in a given cross-section but as relative performance diverges this increases towards a 100%. As an example, a difference in the two exponential smoothing methods of 6% (0.89 compared to 0.95) at lead 1 leads to a 75% to 25% advantage in favour of Holt whereas a differential of 21% (0.55 to 0.70) at lead 6 leads to only a single cross-section favouring Damped Trend. Cross-sectional
and
serial
effects on testing differences
dependence
and
their
between methods
The summary statistics shown above raise the question of whether there is any significant difference between methods. Standard tests assume independence while in contrast the error statistics by method are cross-correlated for all lead times, [Fildes (1989b)l. The cross-correlations of the relative performance measures are no better behaved. As a consequence an eye-ball interpretation of significance based on the raw differences between TMdAPEs, say, is likely to be biased towards a finding of no significant difference between methods. The autocorrelation structure of the various error statistics were examined to check for independence: for lead 1 the first order autocorrelation coefficient for TGRMSE were generally significant, varying between 0.08 and 0.49 for lead 1, 0.64 to 0.82 for lead 6, 0.63 to 0.89 for lead 12. The figures for TMdAPE show a similar pattern. Under standard assumptions about the residuals from a well-specified model the results for lead 1 should be independent, while dependence at longer lead times is a natural consequence of independent one-step ahead errors. For lead 1, the Relative TGRMSEs proved to be independent. As Exhibit 7 shows, at issue is the relative performance of (1) Robust Trend versus the Filter, and (2) Damped Trend versus Holt’s exponential smoothing. From the earlier analysis the distribution of the relative performance statistics can be assumed lognormal. Testing (1) above by taking logs to account for the
lognormality, the null hypothesis that the mean of the distribution is 1 is rejected at the 3.7% (2.3%) level for the TGRMSE (TMdAPE), with a confidence interval of 0.961 to 0.999 (0.951 to 0.996). Testing (2) above, the differences are significant at less than the 1% level. The dependence in the error term for lead 6 and 12 needs to be eliminated before a test of a difference in means can be carried out. For lead 6 the relative performance statistics by time period for TGRMSE (TMdAPE) can be modelled as a first order (second order) AR model but the mean is insignificantly different from 1. For lead 12, both statistics can be modelled as AR(2) with means insignificantly different from 1 and p-values of 0.13 (0.14). However the significance or otherwise of the observed differences is heavily dependent on the data base used, as Exhibit 12, showing time series graphs of the GRMSE by time period, illustrates ‘. The lognormality assumption is also important. Testing Damped Trend Smoothing versus Holt’s Exponential Smoothing, the much more substantial observed differences are significant based on an AR(l) model for the TMdAPE. The Relative TGRMSE by time period is not well characterised by a simple ARIMA model because of complex non-stationarity. Mis-identification formance
and comparatirle
forecasting
per-
In selecting between methods, Robust Trend estimation and the Filter compete in terms of accuracy, while the two exponential smoothing methods are also close alternatives. The relative performance of these two choices are shown in Exhibit 13, measured by the geometric mean across time of the TGRMSEs and the median of the TMDAPEs. The summary statistics by series have classified relative performance ex post, but the forecaster, wishing to choose between methods, must make an ex ante analysis. Jenkins (1982) has criticized the approach adopted in forecasting competitions when a single cross-sectional performance statistic is used for ex ante selection. This criticism by Jenkins can be evaluated by considering how of’
The decision on whether model
building
to use backcasting
also has implications
glance at the graphs shows.
in the ARIMA
for significance
as a
96
R. Fildes / Extrapolatir,e forecasting
Exhibit 13 Aggregate performance
across
methods
time - All leads.
1 (n = 48)
Competitive methods
Lead
GMn (GRMSE)
Md(MdAPE)
GMn(GRMSE)
Md(MdAPE)
GMn(GRMSE)
Md(MdAPE)
Robust Trend vs. Filter Damped Trend vs. Holt
0.98 1.0’)
0.‘)‘) 1.05
0.97 1.26
1.00 I.30
0.95 1.35
0.95 1.39
(Geometric
mean of the Relative
TGRMSE:
Lead 6 (n = 43)
Medians
of the TMdAPE).
ten a single cross-sectional analysis leads to inappropriate conclusions on the relative performance of the various methods. The likely errors of mis-identification that may be made are twofold: to decide that there is no difference between the methods when the differences are substantial, or to decide on a substantive difference when the differences are minor (although in an operational context the latter error is of little consequence). In the discussion that follows we will assume ‘misidentification’ occurs when the relative cross-sectional errors favour the poorer performer in aggregate. The Robust Trend and the Filter have similar aggregate performance although by lead 12 the gains from selecting Robust Trend would be worth making. This is illustrated in Exhibit 14 which shows the size of the relative errors in the crosssections, categorized by lead time, and the relative means of the population GRMSE. Examination of the exhibit makes it clear however that the variability in performance is sufficiently substantial to confuse the analyst relying only on a single cross-section. For example, for lead 1 when the two methods perform equally in aggregate, the variability in performance is such that some 6/48 periods the better method was outside the range
Exhibit 14 Sampling errors Robust
Lead
across
time when estimating
the relative
0.9-1.1. ’ Furthermore, in 20 of the 48 periods the Filter would have been identified as better than Robust Trend; (exhibit 12 shows this graphically). This suggests that apparently extreme performance over a cross-section, even if based on a large sample of series, is often due to nothing more than sampling variability. The variability for lead 6 is higher with 15/43 periods outside the 10 point range and 21 of the 43 periods leading to mis-identification. Here of course the mis-identification is close to costless in that aggregate performance is similar although with MSE adopted as the appropriate loss function the incremental losses have been shown to be substantial (Fildes, 1989a). Once again the variability is such that any forecaster, content with analysing a single cross-section, could easily draw the false conclusion that there is a substantial difference in relative performance. For lead 12 where the two methods differ most markedly, 13 out of the 37 periods lead to a misclassification. Holt’s exponential smoothing is significantly s If the M-Competition had similar variability to the experiment described here, there would be a strong possibility of major misclassification errors.
GRMSE. Damped
vs. Filter
1 6 12
Lead 12 (n = 37)
Approximate Rel. Mean
Relative
% Error
o-5
5-10
> 10
1.0 1.0 0.95
28 16 15
14 12 8
6 15 14
trend vs. Holt Approximate Rel. Mean
D’n
Lead
I 6 12
1.05 1.25 1.35
Relative
% Error
o-5
5-10
D’n > 10
13 I1 17
17 14 7
1X 18 13
R. Fildes / Extrapolatic’e forecasting methods
the better performer relative to Damped Trend smoothing, the discrepancy growing with lead time from approximately 5% at lead 1 to 35% at lead 12. At lead 1 12 out of 48 periods lead to a misclassification with 18 periods outside the interval (0.95, 1.15). The increasing disparity in performance for the longer leads is sufficient to limit misclassification errors. Analysis by time - conclusions (1) The relative performance statistics are adequately described by the log normal distribution. (2) The (relative) performance statistics are highly interdependent. (3) The level of variability, time period by time period, is such as to lead to substantively incorrect conclusions on relative performance when based on a single cross-section. Studies of comparative accuracy therefore should provide an estimate of the sampling variability of their chosen accuracy measure - hardly a controversial conclusion surely!
4. Conclusions In this paper I have attempted to show that the comparative evaluation of a number of forecasting methods requires the consideration of forecast error statistics across time. It was shown that a single cross-section, even when based on a large number of data series, could not be relied on to provide strong evidence on relative performance except in those cases where methods differ by a substantial margin. Despite the specific nature of the data, this conclusion undermines all selection based on limited (or in the extrema single cross-sectional evidence; the forecaster concerned with the method selection problem must take into account this variability, otherwise he/she runs a substantial risk of selecting the less accurate method, even when there are important differences in accuracy between the two alternatives under consideration. In the majority of cases where the loss function is not known, the forecaster must, in the first instance, consider a wide range of error statistics. There is typically a positive correlation between the various measures for the data used here, pointing to the same ranking of relative perfor-
91
mance, in particular MdAPE and GRMSE have similar performance characteristics while MAPE and RMSE suffer from similar disadvantages, the former being particularly affected by observations close to zero while the latter is sensitive to outhers in large valued series. The standard loss function in inventory control applications for a single series is usually taken to be proportional to ai, its error standard deviation and this has loosely been used to justify MSE. Without knowing the appropriate costs consequences of forecast error, however, MSE aggregated across series cannot be directly interpreted. The relative GRMSE has both desirable statistical properties (such as lognormality and additivity) and a simple interpretation as measuring the average error of one method compared to an alternative. The results also underline the importance of the data characteristics to relative method performance. In stark contrast to the M-Competition, the Naive (random walk) model performed poorly, the reason being that the data displayed a strong and regular trend. Unlike the M-Competition series the trends persisted, perhaps for structural reasons in the market place. Crucial to any method’s performance was the successful removal of outliers. The simple Filter, in itself a special case of Damped Trend Smoothing except for its outlier routine, had substantially the better performance characteristics. In the M-Competition and subsequent publications exponential smoothing with trend (Halt) has performed strongly. Damped trend smoothing outperformed Holt (Gardner and McKenzie, 1985). The results here again conflict with earlier competitions, with Holt performing poorly but beating Damped Trend Smoothing on most series and most cross-sections. Newbold (1983) argued for an end to all competitions believing such ‘naive empiricism’ tells us nothing. In contrast to his pessimistic point of view, this paper has demonstrated certain general propositions that are not generally accepted by all researchers in the forecasting field, in particular that generalisations as to relative performance from one forecasting competition to another are not valid. The results also show that, in principle, it is possible to identify better performing methods by examining how aggregate summary statistics vary by time period. Relative Geometric Root Mean Squared Error or the Relative Median
APE were recommended where the problem cific loss function was unavailable.
spe-
References Armstrong, J. Scott, 1984. “Forecasting by extrapolation: con elusions from twenty-five years of research”. Interfiws. 14: 6. 522hh. Carhone. R. and J. Scott Armstrong, 19X2. “Evaluation of extrapolative forecasting methods: results of a survey of academicians and practitioners”. Jottrnul of Forecusting. 1. 215-217. Chatfield, C., 1988, “Editorial Apples. oranges and mean square error”, Internutiortul Journul of’Forecustmg, 4. 515SIX. De Gooijer. J.G.. B. Abraham, A. Gould and L. Robinson, 19X5. “Methods for determining the order of an autoregressivr moving average process: a survey”. Interwtionul Stutisttcul
Rrt,iw.
53. X79-39
1,
Fildes. R., 19X3. “A posteriori opinions of a Bayesian forecaster”, Journul qf Forecasting, 2, 289-291. Fildes, R. lYXYa, “Evaluation of aggregate and individual forecast method selection rules”, Munu~rvncwt Scirnce,35, 10Sh-1065. Fildes. R. 1YXYb. “The evaluation of extrapolative forecasting methods”. Manchester Business School working paper 184. Manchester, UK. Fildes. R. and E.J. Lusk, lY84. “The choice of a forecasting model”. Ome~u, 12, 427-435. Gardner. E.S. Jr., 1983. “The trade-offs in choosing a time series method”, Joumul of Forrusting, 2. 263-267. Gardner, ES. Jr.. 19X4. “Exponential smoothing: The state of the art”. with discussion, Journal of Forecasting. 4. 1-2X. Gardner, ES. Jr.. and E. McKenzie, lYX5, “Forecasting trends in time series”. Managemrnt Science. 31. 1237-1246. Grambsch. P. and W.A. Stahel, IYYO. “Forecasting demand for special services”. International Jotrrnul qf Forecasting,
6. 53wl4. lonescu-Graff, A., 1982, “A sequential projectiom algorithm for special services demand”. Bell System Technical Journul. 61. 34-66. Jenkins. G.M., 19x2. “Some practical aspects of forecasting in organizations”. Jowncrl of Forrcusting , 1, 3-2 1,
Makridakis, S.. et al., 1982, “The accuracy of extrapolation (time series) methods; Results of a forecasting competition”, Jwrnd of Forecusting, 1, 1 1l-153. Makridakis, S., 19X3. “Empirical evidence versus personal experience”. in: J. Scott Armstrong and E.J. Lusk. “Commentary on the Makridakis time series competition (MCompetition)“. Journal of Forecusting, 2. 295-309. Makridakis. S. and R.L. Winkler, 1YXY. “Sampling distribution of post-sample forecasting errors”. Applied Stutistks. 38. 33 l-342. Newhold. P., IYX3, “The competition to end all competitions”. Joumul of Forew.stin,q. 2. 27627Y. Newbold. P. and C.W.J. Granger lY74. “Experiences with forecasting univariate time series and the combination of forecasts”. with discussion, J. Ro,wl Statisticul Society (A), 137. 131-165. Pack, C.D. and B.A. Whitaker, 19X2. “Kalman filter models for network forecasting”, Bell System Technid Journal, 61, l-14. Price, D.H.R. and J.A. Sharp, 1986. “A comparison of the performance of different univariate forecasting methods in a model of capacity acquisition in UK electricity supply”, Internutionul
Journal
of Fwecusting.
2. 3 13-348.
Thompson. P.A. lYY0, “An MSE statistic for comparing forecast accuracy across series”. Internutionul Jourd ofForecming.6,219-227. Williams. W.11. and M.L. Goodman. 1971. “A simple method for the construction of empirical confidence limits for Jowxul of the Americun Stutbticul economic forecasts”. Associcrtiotl.
66, 752-754.
Biography: Robert FILDES is Professor of Operational Research and Operations Management in the Management School, Lancaster University. United Kingdom. He has published four books in forecasting and planning, including The Forecasting Accuracy qf Major Time Srrirs Methods (with Spyros Makridakis) and, most recently. the World 1ndr.r of Economic Forrcusts. He was co-founder in lYX1 of the Journal of Forecasting and in lYX5 of the International Journal of Forecasting. In 1988 he took over as Editor-in-Chief of the IJF. He has published numerous articles in academic journals. including Management Science. and the Journal of the Operational Research Society. In 1985 he was a Visiting Fellow at Bell Communications Research. U.S.A.