Computational North-Holland
Statistics
& Data Analysis
5 (1987) 443-450
443
Estimation criterion, residuals and prediction evaluation Niels WERGARD Institute
of Economics,
Received
May 1987
University of Copenhagen, Copenhagen, Denmark
Abstract: A relation which determined the development of the Danish investments is estimated with yearly data from the period 18781970. The estimation is carried out by minimizing the sum of the absolute residuals, the least squares residuals as well as the maximum residual. Parameters and especially the distribution of the residuals for the three types of estimators is observed. The relation is estimated using only the even years and the resulting relations are used to predict the investments in the uneven years. These predictions are used to evaluate the different estimation methods.
Keywords:
Estimation
criterion,
Distribution
of residuals,
Prediction
evaluation.
1. Introduction For a long time the least squares method has totally dominated the theory for estimation of for instance economic relations. It could be motivated either in maximum likelihood estimation and normally distributed errors or in Gauss-Markov’s theorem. This theorem has been given such a dominating position that it is fair to write: “The elementary point that there may exist non linear, or for the matter biased, estimators superior to least squares for the non-Gaussian linear model is a well kept secret in most of the econometric literature” (Koenker & Bassett, 1978, p. 35). In the past twenty years an alternative literature has grown up, however. Among these alternatives there are many basically different methods, see among others Mood & Brown (1950), Theil (1950), Sen (1968), Hogg & Randeles (1975), Cifarelli (1978) and Koenker & Basset (1978). ’ The most obvious method is to minimize the distance between the observed y and the estimated y ‘( = px if the model is linear). This distance could be ’ Of course one could instead try to fix up the least square methods by removing some of the most abnormal observations, a method dating back to Laplace (see Stigler, 1973) or by weighting up the most normal observations (see Andrews, 1974). 0167-9473/X7/$3.50
0 1987, Elsevier
Science Publishers
B.V. (North-Holland)
N. Kmrgrird / Estimation
444
criterion
measured in many ways, but most common would be to use the p-norm
Especially three values of p are relevant (1, 2 and co):
(2)
(3) L, = mpl(I:-PxJI,
(4)
All of these three criterions are theoretically well analyzed; least squares are examined in every textbook, and absolute residuals are outlined in among others Taylor (1974), Blattberg & Sargent (1971), Narula & Wellington (1982) and Bloomfield & Steiger (1983). The minimax estimation is substantially less analyzed in the statistical literature, but the methods are discussed in Rice & White (1964), and Harter (1981). In some technical problems where the determination of the parameters is a matter of decision more than of estimation (one has to choose the values of the parameters in an antenna construction so that the relevant radio stations go through with minimal noise) this method could be relevant, see Madsen, Schjaer-Jacobsen & Woldby (1975) and Madsen, Nielsen, Schjaer-Jacobsen & Thrane (1975). The general conclusion is that it depends crucially on the distribution of the errors which methods should t.; chosen. With normal distributed errors least squares are the maximum likelihood method, with a double exponential distribution (a Laplace distribution),
f(u) =
+ae-abut,
(5)
the absolute residuals methods are the likelihood estimate, and with a rectangular error distribution the minimax methods are the optimal. 2 Computationally none of these estimation methods give any problems. There are efficient algorithms for calculation of all the estimates, see Ekblom (1974) Kennedy & Gentle (1977), Bloomfield & Steiger (1983) and Madsen (1985). Several Monte Carlo studies have compared the different methods; Glahe & Hunt (1970), Blattberg & Sargent (1971), Smith & Hall (1972), Bloomfield & Steiger (1983) could be mentioned. The results of the experiments are of course not unambiguous, but it seems fair to say that they are in harmony with the theoretical results in the sense that the error distribution is essential and that the least squares method is vulnerable if the distributions are thick tailed. 2 Besides these pure L,-methods combinations have been used in which the well-behaved residuals are squared, whereas the absolute value is used for the outliers. The methods are developed by Huber (1964, 1972, 1973), and analyzed among others by Ekblom (1974); he concludes that the Huber estimation is superior to all the pure L,-methods.
N. Kcergrird / Estimation criterion
445
In econometrics the different methods have been used for some actual investigations, see Meyer & Glaube (1964) Sharpe (1971) and Fama (1975). Sharpe discussed dividend and found only little difference between absolute error regression and least squares estimation. Meyer’s & Glaube’s investigation of investment is probably the most profound comparison of the methods; they found the methods of least absolute residuals optimal to outside sample prediction even if mean squared prediction error was used as criterion. 3 Because the result is so dependent of the distribution of the error terms, it would be natural to investigate this distribution, and many have tried, see Granger & Orr (1972), Carlson (1975) and perhaps in particular Zeckhauser & Thompson (1970). Zeckhauser & Thompson outset from a very general distribution f( 2.4)= K( u, 0) e-l”-51s’”
(6)
where u, < and 0 are parameters. The parameter 0 is estimated as the other parameters and the value of 8 will characterize the type of distribution - for 8 = 2 it will be normal. For four different empirical examples they found very small d-values (less than one), which should indicate that the distributions should be very thick tailed. The result is disputed by Sims (1971) and Mandelbrot (1971); they argued that the reason for this is heteroscedasticity. Thus a generally accepted conclusion has not yet been reached and further investigation seems necessary. But in order to investige these problems one needs very big samples (i.e. for time series a long observation period). Partly because one should have many observations before it is possible to distinguish between the different types of distributions with any certainty, and partly because an evaluation of the models prediction capability has to be made in another sample than the estimation sample (it is of course trivial that the least squares estimation minimizes the mean squared prediction error in the estimation sample and so on). This is one of the reasons why most of the econometric studies about these problems treat special time series (as dividends) for which there are frequent observations. In the next sections there will however be done an attempt with ordinary national account series.
2. Empirical tests: The parameters and residuals distribution The tests will be made with one of the relations CLEO. This is an econometric model estimated except the war periods 1915-21 and 1940-1949 The model is described in Rasmussen & Kargard with a rural and an urban sector where the model
from the Danish growth model on yearly data from 1878-1970 (which gives 75 observations). (1981). It is a two-sector model for the urban sector is a rather
3 Jorgensen (1966) has claimed in a review of Meyer’s & Glaube’s book that the reason was a specification error.
N. Km-g&d
446
/ Estimation
criterion
traditional keynesian model and the rural model a more supply-side model. The total model consist of 21 estimated relations and 27 identities and equilibrium conditions. The model is originally estimated with ordinary least squares. Of the relations is the investment function for the urban sector chosen for this investigation. It is a rather well-behaved relation with no distinct sign of specification error or heteroscedastisity. The relation is a logarithmic version of a stock adjustment relation, i.e. -=K, Kt-,
- K,* [ K,-I
1 n
(7)
where K, is the capital stock at time t, and K,* is the desired stock. (Y is an adjustment parameter between 0 and 1. The desired stock K * is assumed to be a function of the aggregated production y,’ K,* = yx:”
(8)
where y and S is positive parameters. (7) and (8) gives a relation between the observable variables log K,=a+blog
y,+clog
K,_,
(9)
where a = (Ylog y, b = a6 and c = (1 - CX).In the model is the relation estimated with OLS using first differences: A log K, = 0.00013 + 0.127 A log I: + 0.841 A log K, - 1, (17.59) (0.72) (4.80) N=75,
R= = 0.83,
00)
D.W. = 2.41.
A Bartlett’s test for different variance in the three parts of the estimation period gives a x=-value equal to 1.60 which should be compared with an x2-distribution with 2 degrees of freedom; there is no indication of heteroskedastisity. The relation is now estimated by minimizing the sum of absolute residuals and by minimizing the maximum residual. 4 The parameter estimates from all three estimations are summarized in table 1. The differences are not very big, and all of the results are reasonable from an economical point of view. The changes in the estimates are systematic - for all three parameters the OLS estimate is between the two others. More interesting is the distribution of the residuals from the three estimated relations. They are shown in Fig. 1. It is evident that there is a tendency for the residual’s distributions to be biased towards what is the optimal error distribution for the estimation method: If one uses Absolute Errors Regression the residuals distribution is rather thick tailed, if OLS is used the distribution looks very normal, and when minimax estimation is used the distribution has no distinct peak. The clear conclusion of the theoretical literature about which methods should be used is that it depends on the error distribution. But this investigation seems to 4 The used algorithm is taken from Barrodale & Roberts and from Madsen (1975) for the minimax estimation.
(1974) for the absolute
residuals
regression
N. Kcmgrird / Estimation Table 1 Estimated
parameters
Estimation
methods
441
criterion
Parameter
a^ Minimum sum of absolute residuals The least squares methods Minimum of the maximum residual
- 0.0007 0.0013 0.0090
6
c^
0.131 0.127 0.079
0.874 0.841 0.732
(4
I I
Fig. 1. The distribution
of the residuals, estimated with (a) the minimax squares method, (c) the absolute residuals method.
04
method,
(b) the least
N. Kcergcird / Estimation
448
criterion
indicate that one should be careful not to draw too strong conclusions from inspection of the empirical residuals. They have a bias towards acceptance of the basic assumption for the used estimation methods.
3. Prediction capability If one should evaluate the estimates it is of course impossible to use the estimation sample (the results here are trivial), and the experiment with estimating relation (9) with the three examined methods is therefore repeated only using the even years 1878, 1880, 1882,. . . . The estimated relations are then used to predict log K, for the uneven years 1879, 1881, 1883, . . . . The result could of course be evaluated with different criterions and here the three parallels to the estimation methods are used. The result is shown in Table 2. It is seen that OLS is best independent of evaluation criterion, but there should hardly be drawn too strong conclusions of this. The relation is a result of different experiments when the model was built and these were done with OLS. It is then clear that the relation chosen for the model (used here) is one which is well-behaved in OLS estimation. The same experiment is therefore repeated with an alternative version, a linear not a log-linear relation (there are no reasons from economic theory why one
Table 2 Predictionresiduals Estimation
for the log-linear
relation
methods
Minimum sum of absolute residuals The least squares methods Minimum of the maximum residual
Table 3 Prediction
errors for the linear relation
Estimation
methods
Minimum sum of absolute residuals The least squares methods Minimum of the maximum residual
Criterion
for evaluation
of the predictions
errors
Mean absolute error
Root mean squared error
Maximum error
0.00667 0.00655 0.00888
0.00902 0.00871 0.01164
0.0247 0.0208 0.0306
Criterion
for evaluation
of the predictions
errors
Mean absolute error
Root mean squared error
Maximum error
89.76 93.7 160.6
138.8 140.9 186.6
364 365 428
N. Kcergcird / Estimation
should be preferred
to the other).
criterion
The results from estimation
449
of
AK, = a + bAx + c AK,_, instead of (9) is shown in Table 3. With the linear specification it is the minimum sum of absolute errors regression which is best independent of the criterion. There is, however, also a good explanation for this; the linear relation suffers from heteroscedasticity, and it is known that the minimum sum of absolute errors regression is less sensitive to this than ordinary least squares.
4. Conclusion In this article only a few experiments are carried out, but nevertheless it seems possible to make some conclusions. First one should be very careful if the directly observed residuals should be used to make any conclusions about the distribution of the errors. The empirical residuals seem very much influenced by which estimation method is used, and an incorrect acceptance of the distributional assumption behind the used estimation methods often seems very likely. Secondly the criterion which is used for evaluation of the prediction errors seems less important. In the experiment of this paper the same estimates are best independent of which evaluation criterion was used. An argumentation saying that if one evaluates prediction errors with a quadratic loss-function one should use least squares (and so on) seems not to be in harmony with the empirical results - on the contrary it seems possible in many situations to find estimates which are optimal independent of which loss-function is used for evaluation of the prediction errors. There is, however, no foundation in the results of this paper for saying which estimation methods should be used. It depends clearly on the data generating process, and a considerable amount of experience should be collected before it is possible in a practical situation, where the distribution of the errors is unknown, to say anything about which of the different estimation methods one should use.
References D.F. Andrews (1974), A robust method for multiple linear regression, Tecknometrics 16, 523-531. I. Barrodale and F.D.K. Roberts (1974), Solution of an overdetermined system of equations in the L, norm, Comm. ACM 17, 319-320. R. Blattberg and T. Sargent (1971) Regression with non-gaussian stable disturbances: Some sampling results, Econometrica 39, 501-510. P. Bloomfield and W.L. Steiger (1983), Least Absolute Deviations-Theory, Applications, and Algorithms (Birkhauser, Basel-Boston). J.A. Carlson (1975), Are price expectations normally distributed? J. Amer. Statist. Assoc. 70, 749-754. D.M. Cifarelli (1978) La stima de1 coefficiente di regressione mediante l’indice di Gini, Riv. Mat. Sri. Econom. Social. 1, 7-38.
450
N. Kcergcird / Estimation
criterion
H. Ekblom, (1974) &-methods for robust regression, BIT 14, 22-32. E. Fama (1965) The behavior of stock market prices, J. Business, 34-105. F. Glahe and J.G. Hunt (1970) The small sample properties of simultaneous equation least absolute estimators vis-a-vis least squares estimators, Econometrica 38, 742-753. C.W.J. Granger and D. Orr (1972) ‘Infinite variance’ and research strategy in time series analysis, J. Amer. Statist. Assoc. 67, 275-285. H.L. Harter (1981) Method of least p-th powers, in: Encyclopedia of Statistical Science 5, 464-467. H.L. Harter (1981), Minimax method, in: Encyclopedia of Statistical Science 5, 514-516. R.V. Hogg and R.H. Randles (1975) Adaptive distribution-free regression methods and their applications, Technometrics 17, 399-407. P.J. Huber (1964) Robust estimation of a location parameter, Ann. Math. Statist, 35, 733101. P.J. Huber (1972) Robust Statistics: A Review, Ann. Math. Statist. 43, 1041-1067. P.J. Huber (1973) Robust regression: Asymptotics, conjectures, and Monte Carlo, Ann. Statist. 1, 799-821. D.W. Jorgensen (1966) Review of Meyer and Glauber, J. Political Econom. 74, 99-100. W.J. Kennedy and J.E. Gentle (1977), Comparisons of algorithms for minimum L, norm linear regression, Tenth Annual Symposium on the Interface, Gaitersburg, VA. N. Kargard (1979) Alternativer til mindste kvadratets methode, in: Jensen and Hersskuldsson and others (Eds.), Symposium i Anvendt Statistik, Copenhagen. R. Koenker and G. Bassett, Jr. (1978) Regression quantiles, Econometrica 46, 33-50. K. Madsen (1975), An algorithm for minimax solution of overdetermined systems of non-linear equations, J. Inst. Math. and Appl. 16, 321-328. K. Madsen, H. Schjaer-Jacobsen and J. Voldby (1975) Automated minimax design of networks, IEEE Trans. Circuits and Systems 22, 791-796. K. Madsen, 0. Nielsen, H. Schjaer-Jacobsen and L. Thrane (1975) Efficient minimax design of networks without using derivatives, IEEE Trans. Microwave Theory and Techniques 23, 803-809. K. Madsen (1985), Minimization of Non-linear Approximation Functions (Copenhagen). B. Mandelbrot (1971), Linear regression with non-normal error terms: A comment, Rev. Econom. Statist. 53, 205-206. J.R. Meyer and R.R. Glauber (1964), Investment Decisions Economic Forecastings and Public Policy (Cambridge, MA). A.M. Mood and Brown (1950) Introduction to the Theory of Statistics (McGraw-Hill, New York) (especially pp. 406-410). O.S.C. Narula and J.F. Wellington (1982), The minimum sum of absolute errors regression: A state of the art survey, Internat. Statist. Rev. 317-326. T.V. Rasmussen and J. Kaergard (1981) A growth model of the Danish Economy, in: Janssen, Pau and Straszak (Eds.), Dynamic Modelling and Control of National Economics (Oxford). J.R. Rice and J.S. White (1964), Norms for smoothing and estimation, SIAM Rev. 6, 243-56. P.K. Sen (1968) Estimates of the regression coefficient based on Kendall’s Tau, J. Amer. Statist. Assoc. 63, 1379-89. W.F. Sharpe (1971) Mean-absolute-deviation characteristic lines for securities and portfolios, Management Sci. 18B, 1-13. C.A. Sims (1971) Linear regression with non-normal error terms: A comment, Rev. Econom. Statist. 53, 204-205. V.K. Smith and T.W. Hall (1972) A comparison of maximum likelihood versus BLUE estimators, Rev. Econom. Statis. 54, 186-190. S.M. Stigler (1973), Simon Newcomb, Percy Daniel1 and the history of robust estimation 1885-1920, J. Amer. Statist. Assoc. 68, 872-79. L.D. Taylor (1974), Estimation by minimizing the sun of absolute errors, in: Zarambka (Ed.), Frontiers in Econometrics (Academic Press, New York-London). H. Theil(1950) A rank-invariant method of linear and polynomial regression analysis, I, II and III, Nederl. Akad. Wetensch. Proc. 53, 386-392, 521-525 and 1397-1412. R. Zeckhauser and M. Thompson (1970) Linear regression with non-normal error terms, Rev. Econom. Statist. 52, 280-286.