The lead and accuracy of macroeconomic forecasts

The lead and accuracy of macroeconomic forecasts

RA U.S. Military West Point, KOLB Academy New York H.O. STEKLER Industrial College of National the Armed Defense Fort Lesley Washington, ...

711KB Sizes 0 Downloads 101 Views

RA U.S.

Military

West

Point,

KOLB Academy

New

York

H.O. STEKLER Industrial

College

of

National

the Armed Defense

Fort

Lesley Washington,

Forces University J. McNair D.C.

The Lead and Accuracy of Macroeconomic Forecasts* This paper examines whether the quality or accuracy of macroeconomic forecasts improves significantly as the forecast lead or horizon is reduced. The analysis utilizes descriptive statistics as well as a number of different procedures to determine whether there is a statistically significant difference in the accuracy of the predictions at the various horizons.

1. Introduction Economic forecasters often make a number of predictions referring to the same time period. These forecasts are issued sequentially over time, and, consequently, the horizon of each new forecast decreases. Therefore, it is relevant to ask whether the quality or accuracy of such forecasts improves significantly as each additional forecast is issued, and the forecast lead is reduced. Most of the previous studies which have examined multi-period forecasts have described how the size of the prediction error increases with the length of the forecast horizon. However, there has been no attempt to determine whether there is a statistically significant improvement in forecast quality as the horizon is reduced. If this significant improvement does not occur until “current” information becomes available, it would suggest that a forecaster is merely an efficient processor of information. To illustrate the questions involved in this trade-off between lead and accuracy, this paper will examine sets of forecasts obtained from an organization that usually issues three predictions each quarter. These forecasts are made for the current quarter and for a *The opinions expressed in this paper are those of the authors and are not the views of the National Defense University or the Department of Defense. We would like to thank the referees for their comments on an earlier version of this paper. All remaining errors are, of course, our collective responsibility.

Journal of Macroeconomics, 0164-0704/90/$1.50

Winter

1990,

Vol.

12, No.

1, pp.

111-123

111

A.A. Kolb and H.O. Stekler number of subsequent quarters. We shall present the customary descriptive statistics which have been used for evaluating forecasts. In addition, we shall also apply a number of different procedures which permit us to determine whether there is a statistically significant difference in the accuracy of the predictions at the various horizons. The first section describes the data and presents the descriptive statistics. We then discuss the procedures which are used to test whether the quality of the forecasts improves significantly with the decrease in the forecast horizon.

2. The Data and Hypotheses The forecasts which will be examined were obtained from an organization’ that usually makes three predictions each quarter. These forecasts are made for the current quarter and for a number of subsequent quarters. However, the analysis will only examine the forecasts for the current quarter (t) and for the next period (t + 1) and will concentrate on the forecast changes of nominal and real GNP. The data were available for the period 1972:&1983:io. In each case, the data which were used were those available at the time the forecast was issued. The “actual’ values are the first published figures available after the quarters to which they refer.’ To obtain a comparable real GNP series, all the data were converted into 1972 dollars using the GNP deflator. It is, therefore, possible to examine forty-eight sets of sequences of six predictions of both real and nominal GNP, all of which are made for the same quarter. Three of those predictions were made a quarter in advance (with leads of 5, 4, and 3 months to the end of the quarter) and three were made during the quarter in question (with leads of 2, 1, and 0 months, respectively).3 By treating the sequence of six predictions as if each were a different ‘type of forecast of a quarter’s nominal and real GNP, spe-

‘We have not identified the organization because we feel that it would not be fair to single out the particular results attributable to this forecasting service. Moreover, the intent of this paper is to identify procedures for examining the relationship between the lead and accuracy of forecasts in general. The organization was identified to the editor, and the data are available from the authors. ?he actual data were obtained from the relevant issues of the Suruey of Current Business. 3The forecasts were always issued near the end of the month.

112

Lead and Accuracy

of Macroeconomic

Forecasts

cific hypotheses can be tested. First, are the forecasts made at all leads comparable or are there significant differences among them? Second, do the forecasts with the longest lead have significantly smaller errors than those obtained from a simple naive standard? And third, if the errors generated by the predictions at various leads are different, which are significantly better? Answers to these questions would permit us to determine the value of these forecasts in the decision-making process. Unexplained results would demonstrate where further research is required.

3. Some Preliminary

Results

The most frequently used statistics for evaluating economic forecasts are based on either the Mean Square Error (MSE), MSE = i $/n t=1 or the Mean Absolute

Error

(1)

,

(MAE),

MAE = i IE,l/n , t=1

(2)

where E, is the forecasting error. Variants of these measures include Root Mean Square Error (RMSE), mean absolute percentage error, etc. Neither the MSE, nor the MAE, nor any of the variants provide meaningful information by themselves. These descriptive statistics need to be compared with similar data from alternative forecasters or methods, including naive and time series models. Frequently, Theil’s (1966) U-statistic is used for determining whether a particular set of forecasts is at least as good as the predictions of a naive model, vC(AP,

- AA,)’ (3)

u=

v$iq



where ticular

AP, and AA, are the predicted and actual change of a parvariable, respectively. The numerator of Equation (3) is the RMSE of the techniques being evaluated, while the denominator is the RMSE if no change 113

R.A. Kolb and H.O. Stekler (that is, the first naive model) were predicted every period. Con: sequently, a forecasting technique would be considered superior to the simple naive model if U < 1. The “no-change” naive model is the standard with which the technique is compared, but this is not a very stringent test when the series to be predicted have strong trends. Here a same-change-as-last-period naive model, an autoregression model, or another form of an ABIMA model might be a more appropriate basis of comparison. Using MSEs and MAEs, the results presented in Table 1 provide some answers to the questions posed above. First, the errors of the forecasts made in the current quarter (with forecast horizons of 0 to 2 months) are substantially less than those which were made a quarter ahead (with forecast horizons of 3 to 5 months). Second, all of the errors are smaller than the errors generated by the naive no-change models, and third, the U-coefficients for the most distant TABLE 1. Mean Square and Mean Absolute Error of Real and Nominal GNP, Made Zero to Five Months in Advance; Errors of Naive Models Nominal GNP Errors

Mean Square Predictions with lead in months

Real

GNP

Errors uCoeffi-

CTCoefficient**

Mean Square

Mean Absolute

21.2 66.8

3.68 6.16

0.27 0.48

119.7 91.1 193.9

8.63 11.35 11.07

0.64 0.79 0.79

203.7

11.29

0.83

44.5

300.1

14.63

22.3

295.4

13.61

Mean Absolute

cient**

made (horizon) 0

120.0

1

174.3 348.3

2 3 4 5 Naive 1. No change 2. Same change as previous period*

5.73 9.56 13.8

0.21 0.27 0.38 0.49

691.6

20.0 18.5

573.7

17.5

652.9

2596 910.8

0.48 0.48

*Refers to the error that would have resulted if this naive forecast had used the revised published estimate for the previous quarter available just before the end of the current period. It is thus roughly comparable to the O-lead predictions. The naive forecast uses all 48 periods; the O-lead predictions have only 45 observations. Excluding the naive predictions for those quarters when there were no O-lead forecasts would not have substantially changed the results. **The U-coefficient is based on the same number of observations as are available at the relevant lead.

114

Lead and Accuracy

of Macroeconomic

Forecasts

forecasts are 0.48 and 0.83 for nominal and real GNP, respectively. Despite these positive findings, merely reporting the value of these statistics does not permit one to determine whether there is a statistically significant decline in the forecast errors when the forecast horizon is reduced. In the next section, we present some procedures which may be used for determining whether the results are statistically significant.

4. Are Forecasts

at All Horizons

Equally Accurate?

It is first important to determine whether there are significant differences in the accuracy of the forecasts issued at the six different horizons or whether the errors are equal on average. We shall examine two non-parametric procedures (which do not require any assumptions about the underlying distributions) for testing this hypothesis-average rankings, and the Kruskal-Wallis Test. Average Rankings Assume that, at each of n time horizons, a specific variable is predicted mi times. For each set of forecasts, the errors at each horizon are ranked according to their accuracy in predicting the variable. The horizon where the lowest error occurs receives a rank of 1; the last is assigned a rank of n. Formally, R, denotes the rank assigned to the P prediction of horizon i. The process of ranking the accuracy at each horizon is repeated for each of the mi predictions. The mj ranks for each horizon are then summed

si = i R, )

i = 1, 2, . . . , n .

t=1

If the accuracy at each horizon were equal, their ranks would have the same expected values. It is possible to test the hypothesis that all horizons have equal ranks. Since the rankings are from 1 to n, the average rank for any prediction is (n + 1)/2. Summed over mi predictions, it would be m,(n + 1)/2. To test the hypothesis that Si = mi(n + 1)/2 for i = 1 . . . n, the chi-square goodnessof-fit test statistic, X2, may be used, where X2 = i

[Si - ??lj(n + l)/2)]“/[mi

(n + 1)/2)] .

(4)

i=l

115

R.A. Kolb and H.O. Stekler The X2 statistic has a chi-square distribution with (n - 1) degrees of freedom.4 A rejection of the null hypothesis would indicate that the average rankings differed significantly, and that the horizon accuracies were not equal, or rather, that some were better while others were worse. Stekler (1987) used this approach to show that forecasters’ abilities differed. Kruskal-Wallis Test We also propose another nonparametric test to determine whether there are significant differences among the forecasts at the different horizons. The test developed by Kruskal and Wallis (1952) (also see Gibbons 1971; Conover 1980) is an extension of the Wiltoxin Rank Sum Test, which involves combining the two sets of the absolute values of the forecast errors and then rank ordering this combined set. For each forecasting method, the sum of the ranks, Wi, is determined, and the statistic is then calculated from H=

12

k1 -

N(N + 1) =i=r ni [

wi-n,(n

2, 2

I

(5)

where N = Zn,, i = 1 . . . k, and the ni are the number of predictions for the i horizons. The statistic is asymptotically distributed as chi-square with k - 1 degrees of freedom. For most applications, the chi-square distribution can be used for significance testing.’ Tied Observations and Missing Observations For both average rankings and the Kruskal-Wallis tests, several methods exist for handling tied observations (for example, see Gibbons 1971, 96-98). A common procedure is to average the ranks (midrank). For example, if two observations are tied at rank 3, then average rank 3 and 4, and assign rank 3.5 to each. Similarly, if three observations are tied for rank 5, then average ranks 5, 6, and 7, and assign rank 6 to each. Missing observations are accommodated in both Equation (4) and (5) by allowing unequal m, predictions for each horizon in Equation (4) and unequal ni in Equation (5). This test is similar to Friedman’s (1937, 1940) two-way analysis of variance by ranks test. (Also see Gibbons 1971; Conover 1980.) The statistic (H) is a weighted sum of the squares of deviations between the actual (W,) and expected sum of ranks. The exact probabilities for H are tabulated in Kruskal and Wallis (1952) and I man, Quade, and Alexander (1975).

116

Lead and Accuracy

of Macroeconomic

Forecasts

Results Since there was a wide range in the MSEs for the various horizons, our intuition would suggest that the forecasts are not equally good. Both the average rankings and Kruskal-Wallis tests indicate that there are significant differences among the errors at the different leads. Table 2 shows the value of the X2- and H-statistics. Each of these values is significant beyond the 1% level. We must now determine when the forecasts begin to improve. We shall do this using statistical tests which make pair-wise comparisons of forecasts.

5. When do the Forecasts Improve Significantly? To determine when there is a significant reduction in the errors at the various horizons (leads), we shall use three statistics for making pair-wise comparisons. These are the MSE test, percentage of times better, and the Wilcoxin Rank Sum Test. MSE Regression Test While comparisons of MSEs are merely descriptive, indicating that one set of forecasts has made relatively smaller errors than another, Ashley, Granger, and Schmalensee (1980) developed a procedure for testing whether the difference between any two MSEs is statistically significant. If MSEl and MSEz are the mean square errors made by two different forecasting methods, then it can be shown that MSEl - MSE2 = (S:, - S:J + (E; - E;) ,

(6)

where Sfi and ei represent the sample variance and mean of the errors for the entire forecast period. Given the individual forecast errors, l it, define TABLE 2. Values of X’ (Average Ranks) and H (Kruskal-Wallis Test) for Comparisons of Sets of Nominal and Real GNP Errors X2 Average Ranks Nominal GNP Real GNP Critical

value

(Kruskal-Gallis

59.9 56.7 of XzO1 with

5 degrees

Test)

45.7 46.5 of freedom

is 15.09.

117

R.A. Kolb and H.O. Stekler At = Elt - Ear and Et

=

E1t

+

E2t 7

then MSEl - MSE2 = [COV (A, Z)] + (E; - E;) .

(7)

The variable COV denotes the sample covariance of the difference, A, and sum, Z, of the errors over the forecast period. Then, the first method would outperform the second if one could reject the joint null hypothesis that COV (A, X) = 0, and IL (A) = 0, where the alternative is that both quantities are non-negative, and at least one is strictly positive. This test is equivalent to a test of the coefficients of the regression equation,

where h is an error term, and the null hypothesis is & = p2 = 0, against the alternative l3i 2 0, and p2 2 0, and at least one l3, > 0. If either of the estimates for pi or p2 is negative, then the null hypothesis cannot be rejected. If both estimates are nonnegative, then the joint F-test is appropriate where significance levels are equal to half of those from an F distribution. (This is a joint Ftest, which determines if both coefficients are zero. See Ashley, Granger, and Schmalensee 1980). The MSE test focuses on the difference in the magnitude of the errors of any two forecasting methods. It is possible that a small number of particularly large errors may distort these comparisons. In particular, it has been suggested that it may be inappropriate to use MSE for evaluations because this involves averaging the squared errors over observations that have different degrees of variability (Fair 1980; Jenkins 1982; Pack 1982). Consequently, we shall also use two non-parametric tests. Percentage of Times Better Instead of using quantitative measures of error, it is possible to simply count the number of times that the forecasts at horizon A had smaller errors than those at horizon B. Then, the percentage of times that A is superior could be calculated. The statistical test 118

Lead and Accuracy

of Macroeconomic

Forecasts

would be to compare this percentage against the null hypothesis that there was no significant difference between the two series, that is, the ratio was 0.50 - 0.50. The binomial distribution would be used for calculating the appropriate probabilities. The procedure was applied in some of the analyses and comments on the Makridakis M-Competition (see Makridakis 1983, 298-99).6 The test statistic for w h ere nr is the number of times n > 40 is 2, = (nl - n/2)/(n/4), the first method is better, and n is the total number of observations . Wilcoxin Rank Sum Test Another appropriate nonparametric test for comparing a pair of forecast sets is the Wilcoxin Rank Sum Test (1947) mentioned above. After the errors of each set of forecasts have been determined, they are pooled and ranked. Then, the sum of the ranks, Wi, of each of the forecast sets is calculated.’ To test the hypothesis that the forecasts at one horizon are superior to those at another horizon, the sum of the ranks, Wi, can be compared with critical values (Wilcoxin 1947; Conover 1980, Table A7). Alternatively, if the sum of the sample sizes is at least 12, a normally distributed statistic, 2, can be formulated as

z = W1 - [ndnl + n2+ U/21

1

nrq,(n, + n2 + 1) “’

[

12



where W, is the sum of the ranks of the errors at one horizon, the ni are the forecasts for each horizon (Gibbons 1971).

(9)

and

Tied Observations and Missing Observations Tied observations present no special problems for these tests. For the MSE regression test, a pair of tied forecast errors will re-

%

the statistical literature, this is referred to as the Ordinary Sign Test (Gib1971). ‘It should be noted that the Wilcoxin test, which mixes the absolute errors of the two sets of forecasts, is not based on the relationship between the paired errors of each time period. It is thus d&rent kom the percentage-better procedure, which examines differences between the paired errors each time period. The Wilcoxin test, however, does yield results that are statistically equivalent to those that would be obtained from the Mann-Whitney test (see Mann and Whitney 1947; Conover 1989) which is another nonparametric test but is not discussed here. bons

119

R.A. Kolb and H.O. Stekler sult in a corresponding A, = 0. For the percentage-of-times-better test, each tie will reduce the binomial sample size by one. For the Wilcoxin Rank Sum Test, the previously described midrank method can be used. Missing observations can also be accommodated. For the MSE regression test and the percentage-of-times-better test, the sample size must be reduced whenever comparisons cannot be made. For the Wilcoxin Rank Sum Test, Equation (9) allows for unequal sample sizes. Results Comparison with Naive Standard. We had previously noted that Theil’s U-coefficients for the most distant forecasts, that is, those made 5 months before the end of the quarter, are 0.48 and 0.83 for nominal and real GNP, respectively. Since these coefficients are less than 1, it is obvious that the forecast’s errors are less than that of the naive standard. But are these differences significant? Nominal GNP The MSE regression test was applied to the errors of the naive and 5-month lead predictions of nominal GNP. Despite the low U, the differences in the errors are not significant. On the other hand, the percentage-better and Wilcoxin test clearly indicate that the 5-month lead forecasts are significantly better. (See Table 3.)8 Real GNP A comparison of the 5-month real GNP forecast with the naive model (Table 3) provides mixed results. The regression and Wiltoxin tests are all significant at the 5% level but not at the 1% level. However, in the percentage-better comparison, the results are not significantly different at the 10% level. These results seem to suggest that the real GNP forecasts made 5 months prior to the end of the quarter to which they refer are not clearly superior to the naive forecasts. This finding lends support to a similar result obtained in another study (Stekler 1987) which uses an entirely different methodology. The question then remains, When do these forecasts improve? ‘The MSE regression test can be adversely affected by a relatively few large prediction errors, which will result in a relatively large error term. This error term will, in effect, make it difficult to detect significant differences between forecasters. In these instances, both the naive model and the 5-month lead predictions made large errors. In other words, the forecaster’s predictions, while improving on the naive standard, contributed substantially to the error of the regression and to the observed result that the improvement was not significant.

120

Lead and Accuracy

of Macroeconomic

Forecasts

3. Significance Level of Test Statistics* for Paired Comparisons of Nominal and Real GNP Forecasts Errors; Various Leads and with Naive Model

TABLE

Nominal GNP Statistic

Paired Comparison O-month

MSE Regression Test

0.020

Percent-

Real GNP Statistic

age Better

Wiltoxin Rank Sum

MSE Regression Test

0.005

0.007

1 x

0.0006

0.035

5.8

0.0027

0.072

9.6

0.171

0.56

0.357

0.244

0.44

Percentage Better

Wiltoxin Rank Sum

10-R

0.002

0.017

x

1o-5

0.001

0.017

x

1O-5

0.007

0.066

0.157

0.264

0.467

0.476

0.212

0.109

0.406

0

0.0197

0.148

0.046

with

l-month lead l-month with %month lead %month with d-month lead d-month with knonth lead knunth with 5-month lead 5-month with “no change” naive

3.9

x lo-’

2.5

x

1O-5

0.125

*The actual values of the various tained from the authors.

1.2

x lo+

test statistics

are not presented

here,

but may be ob-

Accuracy

of Predictions and Forecast Horizons Do the forecasts become more accurate as the forecast horizon is shortened? It is possible to answer the question by comparing the errors of the sets of forecasts made one month apart, that is, a 4-month lead versus a 5-month lead, a d-month lead versus a 3month lead, etc. The results are presented in Table 3 and show that there is no significant improvement in the forecasts made a quarter in advance. For all of the tests, there is no significant difference between the errors made 4 and 5 months in advance, or between the errors of the 3- and d-month lead forecasts. These results apply to both real and nominal GNP. The tests show that the forecasts of the current quarter show significant improvement as the horizon is reduced. Then, the errors of the forecasts issued near the end of the quarter in question are significantly smaller than those made 1 month in advance. Similarly, the l-month lead predictions are superior to those made two months in advance. This result is not surprising since these forecasts used a substantial amount of early data referring to the quarter for which the forecasts were issued. 121

R.A. Kolb and H.O. Stekler The tests (with the exception of the Wilcoxin), however, also indicate that the 2-month lead forecasts made at the end of the first month of the quarter were superior to the last forecast issued in the prior quarter. Thus, we may conclude that current quarter projections, which are a combination of forecasts and estimates based on current data, improve as the forecast horizon shrinks. This forecast organization was able to interpret the current data properly and thus improve the accuracy of their estimates.

6. Summary

and Conclusions

This paper has applied a number of statistical tests to a set of forecasts to show that it is possible to determine whether the quality of the predictions improves significantly as the forecast horizon declines. From this forecast set, we concluded that the accuracy of both the real and nominal GNP predictions did differ significantly with the length of the forecast lead. Some of the tests indicated that the longest lead forecasts were not significantly better than the naive standard. The forecasts made one quarter in advance of the period to which they referred did not improve significantly with a reduction of the forecast horizon. However, current quarter predictions did improve with the reduction of the forecast lead. These findings are in accord with previous results which indicated that forecasters can use new data to improve the quality of their current quarter predictions. However, the failure to decrease forecast errors a quarter in advance is not explained, and other research is required. Received: August 1988 Final Version: April 1989

References Ashley, R., C. W. J. Granger, and R. Schmalensee. “Advertising and Aggregate Consumption: An Analysis of Causality.” Econometrica 48 (July 1980): 1149-67. Conover, W. J. Practical Nonparametric Statistics. 2d ed. New York: Wiley, 1980. Fair, Ray C. “Estimating the Expected Predictive Accuracy of Econometric Models.” Znternational Economic Review 21 (June 1980): 355-78. Friedman, M. “The Use of Ranks to Avoid the Assumption of Nor122

Lead and Accuracy

of Macroeconomic

Forecasts

mality Implicit in the Analysis of Variance.” Journal of the American Statistical Association 32 (1937): 675-701. -. “A Comparison of Alternative Tests of Significance for the Statistics 11 Problem of M Rankings.” Annals of Mathematical (1940): 86-92. Gibbons, Jean Dickinson. Nonparametric Statistical Signajkance. New York: McGraw-Hill, 1971. Iman, R.L., D. Quade, and D.A. Alexander. “Exact Probability Levels for the Kruskal-Wallis Test.” Selected Tables in Mathematical Statistics 3 (1975): 329-84. Jenkins, Gwilyn M. “Some Practical Aspects of Forecasting In Organizations.” Journal of Forecasting 1 (1982): 3-21. Kruskal, K.H., and W.A. Wallis. “Use of Ranks in One-Criterion Analysis of Variance. ” Journal of the American Statistical Association 47 (1952): 583-621. Makridakis, Spyros. “Empirical Evidence versus Personal Experience. ” Journal of Forecasting 2 (1983): 295-396. Mann, H.B., and D.R. Whitney. “On a Test Whether One or Two Random Variables is Stochastically Larger Than the Other.” Annals of Mathematical Statistics 18 (1947): 50-60. Pack, David J. “Measures of Forecast Accuracy.” Paper presented at the joint national meeting of the Operations Research Society of America and The Institute of Management Science, 1982. Stekler, H.O. “Revisions of Economic Forecasts.” Paper presented to the Seventh International Symposium on Forecasting, Boston, 1987. Mimeo. Theil, Henri. Applied Economic Forecasting. Amsterdam: NorthHolland Publishing Co., 1966. Wilcoxin, F. “Probability Tables for Individuals by Ranking Methods.” Biometrics 3 (1947): 119-22.

123