International Journal of Forecasting 19 (2003) 735–742 www.elsevier.com / locate / ijforecast
Diagnostics for evaluating the value and rationality of economic forecasts H.O. Stekler*, G. Petrei Department of Economics, The George Washington University, Washington, DC 20052, USA
Abstract A number of studies have sought to determine whether economic forecasts had predictive value. These analyses used a single statistical methodology based on the independence of the actual and predicted changes. This paper questions whether the observed results are robust if alternative statistical methodologies are used to analyze this question. Procedures suggested by Cumby and Modest as well as rationality tests were applied to two data sets. Sometimes the conclusions differ depending on the procedures that are used. The results yield a guideline for the diagnostics that should be employed in testing for the value of economic forecasts. 2002 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved. Keywords: Value of forecasts; Evaluation diagnostics; Rationality
1. Introduction There have been a number of studies that have sought to determine whether economic forecasts are valuable to the users of those predictions (Ash et al., 1998; Pesaran & Timmerman, 1992; Schnader & Stekler, 1990; Stekler, 1994). In the context of these studies, value had a specific meaning: do these forecasts provide information beyond that available from naive predictive methods. The statistical tests that are used in these evaluations are similar to those used in the financial literature. In essence, these studies have tested whether the forecasts were superior to the naive no-change model in predicting the direction of the observed changes in an economic variable. They are tests of directional accuracy. *Corresponding author. Tel.: 11-202-994-6150; fax: 11-202994-6147. E-mail address:
[email protected] (H.O. Stekler).
These evaluations of economic forecasts were influenced by the financial literature, in particular by Merton’s (1981) study. That paper established a statistical methodology for determining whether mutual fund managers had market-timing ability. A crucial assumption of this statistical methodology is that the size of the financial returns is independent of correctly predicting the direction of change.1 Cumby and Modest (1987) (hereafter CM) show that the power of Merton’s test is low even if this assumption holds. They then suggested a different procedure for evaluating the performance of investment advisors. This test has not yet been applied to economic forecasts to determine whether they would be valuable to their users. Not only is the CM procedure useful in evaluating the value of forecasts, but it is
1
The same assumption was made in the analyses of the economic forecasts.
0169-2070 / 02 / $ – see front matter 2002 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved. doi:10.1016/S0169-2070(02)00078-X
736
H.O. Stekler, G. Petrei / International Journal of Forecasting 19 (2003) 735–742
also closely related to the statistical procedure that is used to test whether economic forecasts are rational. This paper examines the statistical procedures for evaluating the value and rationality of forecasts and then applies them to two different data sets. The focus of this paper is not on the accuracy of the forecasts contained in these data sets. Rather the paper questions whether the results are robust when different statistical procedures are used to determine whether the forecasts had value. The first set of forecasts was issued by a prominent forecasting service and had previously been analyzed by Schnader and Stekler (1990) and Stekler (1994). The purpose of reexamining these forecasts is merely to determine whether the conclusions obtained from those studies would still hold when alternative procedures are used. The second data set has never been completely analyzed. It consists of the forecasts of real GDP and interest rates published semi-annually in The Wall Street Journal.2 The first sections explain the procedures that had previously been used. This is followed by the descriptions and evaluations of the two sets of forecasts. The implication of these results for evaluation diagnostics is discussed in the last section.
2. The statistical tests
2.1. Merton’ s methodology Merton (1981) derived the conditions under which a market-timing forecast 3 will have value to an investor. If an investor has a probability density function of the expected return on the market based on prior information, a forecast will only have value if it changes that prior distribution. For this to occur, the sum of the conditional probabilities of a correct forecast given the outcomes must exceed one. Formally, if Z(t) is the market rate, R(t) is the riskless return and u (t) 5 1 if the forecast is correct, the conditional probabilities of correct forecasts, Pi , are 2
The interest rate forecasts had previously been examined for a shorter time period by Kolb and Stekler (1996). 3 In Merton’s analysis a market-timing forecast is a prediction that the rate of return on market investments would in a given period exceed (fall short of) the return on a riskless investment.
Probhu (t) 5 1uZ(t)j 5 P1 for 0 # Z(t) # R(t) 5 P2 for R(t) , Z(t) , ` Consequently, Henriksson and Merton (1981) developed a non-parametric test to determine whether P1 1 P2 . 1 against the null hypothesis that P1 1 P2 5 1. They showed that under the null hypothesis, the number of correct forecasts, given the outcomes, follows the hypergeometric distribution. The null hypothesis would be rejected (and the forecasts would be considered to have value) if the number of correct forecasts exceeded a critical value.4
2.2. Contingency tables When the Henriksson–Merton probabilities are tabulated as in Fig. 1, their relationship to a contingency table becomes obvious. Schnader and Stekler (1990) and Stekler (1994) used the contingency table approach to test whether the predicted change in real GNP was probabilistically independent of the actual change. Either the Chi-square test or Fisher’s Exact Test may be used to test this hypothesis.5 Both methods determine whether a given set of forecasts differs significantly from a naive model in predicting the direction of change. (Stekler, 1994, p. 497). The assumption that underlies all of these procedures is that only the direction of change is important and that the magnitude of that change does not matter. This assumption implies that a directional error which occurs when a growth rate of 0.25% is predicted and the actual rate of growth declined by 0.10% is equal in importance to the error of predicting a positive growth of 3% and there is a decline of 2%. For this reason, it is important to test whether the results are robust. The empirical papers, therefore, used two other dichotomous classifications
4
An alternative procedure which yields identical results is to use Fisher’s Exact Test and reject the null if the resultant probability was less than the specified critical value. 5 For a 2 3 2 contingency table, the Pesaran–Timmerman and the Chi-square statistics are asymptotically equivalent. They are not equivalent for more than two classifications, Pesaran and Timmerman (1992), p.463. In this paper we only analyze 2 3 2 classifications and thus do not present the results of that test.
H.O. Stekler, G. Petrei / International Journal of Forecasting 19 (2003) 735–742
737
Fig. 1. Observation of predicted and actual changes in an Economic Variable.
(periods of growth in excess of 1% (2%), on one hand, and less than 1% (2%) positive, zero, and negative growth, on the other) to test for directional accuracy.
Although this methodology was developed for evaluating financial predictions, in Sections II and III it is applied to economic forecasts. 7
2.4. Rationality tests 2.3. Cumby–Modest procedure CM show that the power of the Merton test is low even if the size of the financial returns is independent of correctly predicting the direction of change. They then propose an alternative test for directional accuracy that does not depend upon this independence assumption.6 This procedure regresses the observed change, A t , upon a constant and the binary variable Xt , that takes the value 1 if a positive change is predicted and 0 otherwise. If the coefficient b in the regression (1): A t 5 a 1 b Xt 1 et
(1)
is significantly different from 0, the predicted directions of change explain the returns, and, thus, would have value. Moreover, CM (p. 174, fn. 8) note that the quantitative value of the predicted change, Ft , rather than the binary variable Xt may be used as in Eq. (1a). This equation also tests whether a is significantly different from 0 and determines whether the quantitative forecasts had predictive value in explaining financial returns. A t 5 a 1 b Ft 1 et 6
(1a)
In the financial framework, a forecaster may make a small number of successful predictions which yield large profits and a relatively large number of errors where the losses are small. Even though the ratio of correct to incorrect forecasts is low, the forecasts would have to be considered to have value if the gains from the successful predictions exceeded the losses resulting from the incorrect ones.
Most evaluations of economic forecasts examine the rationality of the predictions. For the forecasts to be rational, they must be unbiased and efficient; with efficiency meaning that the errors must be uncorrelated with information known by the forecaster at the time that the predictions were prepared. Over the years there have been many different operational definitions of the terms rationality and efficiency. Here we only consider weak form informational efficiency which means that the forecasts are unbiased in the sense that individuals do not make systematic errors. The outcome–prediction relationship (2) is used to test whether the forecasts are unbiased. A t 5 a 1 b Ft 1 et
(2)
where A t is the actual value (outcome) and Ft is the prediction. The forecasts would be considered unbiased if the joint hypothesis that a 5 0 and b 5 1 is not rejected. This joint null on a and b is a sufficient but not a necessary condition for unbiasedness and is also a necessary condition for efficiency. Weak form informational efficiency also requires that one period ahead forecast errors not be serially correlated and that these errors be uncorrelated with past forecast values or errors. 7
There are also other procedures that may be used to test whether forecasts have value. For example, the rank of Ft may be used in Eq. (1a); or a Spearman rank correlation can be calculated and tested for significance.
738
H.O. Stekler, G. Petrei / International Journal of Forecasting 19 (2003) 735–742
The tests associated with Eq. (1a), for determining whether the forecasts have value, and Eq. (2), for examining rationality, are related. In the first instance, the b coefficient must merely be positive and significantly different from zero. In the second case, this coefficient must not be significantly different from 1, while at the same time, a must not differ from zero. The rationality test is more restrictive, and any set of forecasts that does not reject the rationality hypothesis must, therefore, be valuable.8 The opposite result is not true: a set of forecasts might be valuable using the CM approach but not be rational.
2.5. Summary A number of statistical tests have been used to determine whether a set of forecasts had predictive value. The first measures directional accuracy by comparing the signs of the predicted and actual changes. It is a test of the null that the forecasts are no better in predicting the direction of change than the naive model that always predicts no change. However, CM develop procedures that have more power against the null that the forecasts have no predictive value. We show that the weak informational efficiency tests are more restrictive versions of the CM procedures and can also be used to determine whether forecasts have predictive value.
3. Comparison with previous empirical results Schnader and Stekler (1990) and Stekler (1994) had concluded that, for the period 1972–1983, all of the current quarter forecasts of real GNP of Set A had predictive value because the independence hypothesis was rejected. However, all predictions made with leads of three or more months, i.e. for the next quarter, were not significantly associated with the outcomes and thus did not have predictive value. This result was based on two different classifications.
8
Pesaran and Timmerman (1992), p.464, however, note that it is possible that rationality is not rejected but the forecasting method has little predictive power. Their example is stock market returns under the efficient market hypothesis.
The first dichotomized by the signs of the predicted and actual changes in real GNP; the second classification distinguished between rates of growth in real GNP that were # 1% or [1%. The results were different when the data were dichotomized by real growth changes that were # 2% or were [2%. Then forecasts at all leads had predictive value. We question whether these findings are valid when the two versions of the CM procedure (Eqs. (1) and (1a)) are applied to the same data set. The equations for leads of 3 to 5 months were estimated using OLS because there were missing observations. It was thus impossible to use a procedure that corrected for the moving average process that was inherent in making forecasts for both the current and one-quarter-ahead periods. The results, presented in Table 1, show that in all cases the a coefficients were significantly different from zero. Thus, using these tests, all the forecasts had predictive value. The conclusions with respect to the value of current quarter forecasts confirm the results of the previous studies. Some of the results regarding onequarter-ahead predictions, however, reverse the earlier findings. The reason is that the regressions place greater weight on accurately predicting large changes. Finally, this set of forecasts was tested for weak form informational efficiency. The results, presented in the last column of Table 1, indicate that the null of rationality was not rejected in any case. Since the rationality of the one-quarter-ahead forecasts was never rejected, they clearly had predictive value. The difference between these results and the earlier findings demonstrates that conclusions about the value of forecasts depend critically on the statistical methodology that is used in the evaluation.
4. Wall Street Journal forecasts These procedures may also be applied to another set of forecasts that have not been completely analyzed previously. Since January 1982, The Wall Street Journal has published the interest rate forecasts obtained from prominent financial analysts. These surveys appear semi-annually (early in January and July) and present estimates of the level of two interest rates which are expected to prevail 6
H.O. Stekler, G. Petrei / International Journal of Forecasting 19 (2003) 735–742
739
Table 1 Cumby–Modest test of the predictive value of the GNP forecast of a forecasting organization, 0 to 5 months lead Current quarter
Constant
Xt
Zero lead
3.94 (6.66) —0.05 (20.23)
8.57 (7.47)
3.66 (5.58) 0.12 (0.03)
7.36 (5.67)
3.46 (4.65) 20.78 (1.37)
6.15 (4.40)
3.13 (3.96) 20.51 (0.69)
5.76 (3.65)
2.69 (3.36) 20.99 (21.06)
4.37 (2.41)
2.62 (3.26) 21.08 (21.24)
4.70 (2.62)
One month lead
Two month lead
Three month lead
Four month lead
Five month lead
Ft
Probability a 5 0, b 5 1*
1.038 (22.03)
0.82
0.984 (12.05)
0.98
1.054 (8.08)
0.37
0.905 (5.25)
0.42
0.977 (4.85)
0.31
0.945 (4.59)
0.16
Note: Numbers in parentheses are t-ratios.
(12) months in the future.9 These interest rates are the 90-day T-bill rate and the yield on 30-year government bonds. Since January 1986, The Wall Street Journal has also published the analysts’ forecasts of the rate of growth of real GNP (GDP) that is expected to occur over spans of 6 months and 1 year. All of these forecasts are available through July 1999. Thus there are six different sets of forecasts (three variables at two different horizons) that can be evaluated, but the number of observations differ depending on when the particular prediction was first published. Although individual forecasts are available, only
9
The predicted interest rates that were expected to prevail 12 months in the future were first published in January 1984.
the mean forecast of each variable made in each period was examined, because many individuals only participated in a small number of surveys. The four tests, (independence, two Cumby–Modest, rationality), that were described above were applied to each data set.
4.1. Interest rate forecasts 4.1.1. Forecasts made 6 months ahead We first examine the interest rate forecasts made 6 months ahead. Tables 2 and 3A present the results of the tests used to evaluate these interest rate predictions. They show that it was not possible to reject the hypothesis that the signs of the actual changes and predicted changes were independent. In fact the direction of change of the interest rate on the 30-year Treasury bond was predicted incorrectly 75% of the
H.O. Stekler, G. Petrei / International Journal of Forecasting 19 (2003) 735–742
740
Table 2 Calculated probabilities of independence tests between predicated and actual changes in interest rates and GNP/ GDP; 6 month and 12 month horizons
that the mean 6 month interest forecasts do not have predictive value and would not be valuable to the users.
Variable
Horizon
P-Value
90 day T-bill 90 day T- bill 30 year T-bond 30 year T-bond GNP/ GDP GNP/ GPD
6 12 6 12 6 12
0.631 0.56 1.00 0.64 0.244 0.051
4.1.2. Twelve month forecasts There are two distinct ways to analyze the 12 month forecasts. First, one can compare the actual and predicted changes over the span of those 12 months. Alternatively, given that there also are predicted changes for the first half of the period, it would be possible to calculate only the changes that occurred in the second half of this 12 month period. Both procedures have been used in previous forecast evaluations. We chose to use the first method and evaluated the accuracy of the forecasts over the entire span of the forecast period. Table 2 (lines 2 and 4) presents the results of the independence tests applied to the 12 month interest rate forecasts while the findings of the CM tests are in Table 3B. In all cases, the results for the 12 month interest rate forecasts are identical to the findings that we obtained for the 6 month interest rate predictions: independence was not rejected; the CM tests did not yield significant regression coefficients; the forecasts were not unbiased. All of the tests yield identical results, i.e. these interest forecasts do not have predictive value.
months months months months months months
Table 3 (A) Cumby–Modest test of the predictive value of the mean 6 month forecasts of the Wall Street Journal forecasts: 90 day T-bill and 30 year treasury interest rates R t T-bill
5
20.318 1 0.317 Xt (21.21) (0.85)
R t T-bill
5
20.163 1 0.513 Ft T-bill (20.88) (1.07)
R t 30
year
5 0.176 2 0.964 Xt (0.91) (23.09)
R t 30
year
5 0.299 2 0.840 Ft 30 2(1.63) (21.34)
year
(B) Cumby–Modest test of the predictive value of the mean 12 month forecasts of the Wall Street Journal forecasters: 90 day T-bill and 30 year treasury interest rates
4.2. GNP/GDP forecasts
R t T-bill
5
20.74 1 0.632 Xt (21.81) (1.26)
R t T-bill
5
20.387 1 0.279 Ft T-bill (1.45) (0.59)
R t 30
year
5
20.36 2 0.243 Xt (1.18) (0.59)
R t 30
year
5
20.486 2 0.66 Ft 30 (2.27) (0.81)
year
Xt is a binary variable which equals 1 if Ft . 0 and is 0 otherwise; Ft is the predicated change in the interest rate. Numbers in parentheses are t-ratio.
time. The results from the alternative tests are just as negative. The a coefficients in the regressions are not significantly different from zero, and in the case of the predictions for the 30-year bonds even have the wrong sign. The forecasts definitely are not rational. Based on all of these tests we can conclude
In analyzing the GNP/ GDP predictions, it was first necessary to select a set of actual outcomes because the numbers in the National Income Accounts referring to a particular quarter are revised frequently. We used numbers that, in the literature, had previously been called the 45 day figures, i.e. the numbers released 45 days after the end of the quarter to which they refer.10 These are the numbers that previous analysts had assumed individuals were trying to predict. One further adjustment was required to evaluate whether the actual and predicted changes were related. During the period, 1986–1999, when The Wall Street Journal published the GNP/ GDP forecasts, there was only one half year when the preliminary data showed that the economy 10
Currently, these preliminary data are released about 2 months after the end of the quarter to which they refer.
H.O. Stekler, G. Petrei / International Journal of Forecasting 19 (2003) 735–742
741
Table 4 Cumby–Modest test of the predictive value of the mean GNP/ GDP forecasts of the Wall Street Journal forecasters Six month forecasts
Twelve month forecasts
GNPt 52.101 (4.89) GNPt 50.601 (0.91) GNPt 51.641 (3.60) GNPt 50.1011 (0.091)
1.03 Xt (1.84) 0.95 Ft (3.45) 1.48 Xt (2.70) 1.054 Ft (2.19)
p:a 5 0, b 5 1: 0.25
p:a 5 0, b 5 1: 0.66
Xt is a binary variable which equals 1 if Ft . 2% and is 0 otherwise. Ft is the predicted change in GNP/ GDP. Numbers in parentheses are t-ratio.
actually had a negative growth rate, and there were only two predictions of negative growth over 6 months. Given so few negative observations, we did not determine whether the signs of the actual and predicted real growth rates were related. Rather, we dichotomized between changes in excess of 2% and all others. The results are conflicting. Using the aforementioned classification scheme, it was not possible to reject the hypothesis that the predicted and actual changes were independent. (Table 2). This implies that forecasters could not always distinguish between growth that exceeded 2% and #2% growth. On the other hand, the regression tests show (Table 4) that, if a one-tailed test is used, the a coefficient was significantly different from zero at the 5% significance level, and the hypothesis that the forecasts were rational could not be rejected. Given the greater power of the CM procedure and the rationality results, we would conclude that the forecasts had value. However, the failure (in 10 of 27 cases) to distinguish growth in excess of 2% from all other types of growth is a disturbing finding. Even the one-quarter-ahead forecasts of the Schnader–Stekler papers were able to reject the independence assumption using the 2% dichotomization.11
4.2.1. Twelve month forecasts The results for the 12 month ahead forecasts of GNP/ GDP are presented in Tables 2 and 4. We find that the independence assumption is only rejected at
the .051 level, and technically, does not permit us to conclude that the mean 12 month forecasts can distinguish between periods when the growth rate is more than 2% and times when it is less than that rate.12 On the other hand, the other tests clearly indicate that the forecasts had value and do not reject the unbiasedness hypothesis.
4.3. Summary of results The analysis has shown that none of the interest rate forecasts had predictive value. The results obtained from all of the tests were consistent. On the other hand, the GNP/ GDP forecasts were all considered to be valuable. The results from the independence tests yielded the only contradictory evidence. However, it is known that this test has lower power than the other procedures.
5. Conclusion: significance for forecast diagnostics This paper has shown that there are at least three approaches for testing whether forecasts have predictive value and would be valuable to users. Our empirical analysis shows that sometimes these tests produce conflicting results. The question then becomes: Which test should be used first and in which order should the others be utilized? The test for bias
12 11
However, the Schnader–Stekler papers used the forecasts of only one organization, while this study uses the mean forecasts of a group of individuals.
We did not calculate the Pesaran–Timmerman statistic in our analysis. Because it is only asymptotically equivalent to the Chi-square statistic, it might not have yielded the same result and / or conclusion.
742
H.O. Stekler, G. Petrei / International Journal of Forecasting 19 (2003) 735–742
which requires that a 5 0 and that b 5 1 is a more stringent test than the second of the CM tests that merely requires that b be significantly greater than zero. If the unbiasedness hypothesis is not rejected, the forecasts are obviously useful, and it is not necessary to undertake the other procedure. Clearly, if the forecasts are rational, they are valuable to the users. If, however, unbiasedness is rejected, then, to test whether the forecasts have predictive value, the CM approach is preferred over the independence test because the former has more power. It gives greater weight to accurately predicting large changes that actually occur. However, the CM procedure does not indicate whether the forecasts were directionally accurate or at what level of dichotomization they became accurate. The independence test is still required to undertake this analysis.
Henriksson, R. D., & Merton, R. C. (1981). On market timing and investment performance 2: statistical procedures for evaluating forecasting skills. Journal of Business, 54, 513–533. Kolb, R. A., & Stekler, H. O. (1996). The accuracy of interest rate forecasts, Journal of Forecasting, 15. Merton, R. C. (1981). On marketing timing and investment performance 1: an equilibrium theory of value for market forecasts. Journal of Business, 54, 363–406. Pesaran, M. H., & Timmerman, A. (1992). A simple nonparametric test of predictive performance. Journal of Business and Economic Statistics, 10, 461–465. Schnader, M. H., & Stekler, H. O. (1990). Evaluating predictions of change. Journal of Business, 63, 99–107. Stekler, H. O. (1994). Are economic forecasts valuable? Journal of Forecasting, 13, 495–505.
References
G. PETREI is currently associated with Charles River Associates and previously was a student at The George Washington University.
Ash, J. C. K., Smyth, D. J., & Heravi, S. M. (1998). Are OECD forecasts rational and useful? A directional analysis. International Journal of Forecasting, 14, 381–391. Cumby, R. E., & Modest, D. M. (1987). Testing for market timing ability: a framework for forecast evaluation. Journal of Financial Economics, 19, 169–189.
Biographies: H.O. STEKLER is currently a Research Professor of Economics at The George Washington University and is an Associate Editor of the International Journal of Forecasting. He has been actively engaged in many aspects of forecasting research for a number of years. His primary interest is in the evaluation of forecasts.