319
Economics Letters 7 (1981) 319-322 North-Holland Publishing Company
LOSS OF EFFICIENCY IN REGRESSION IRRELEVANT VARIABLES A Generalization * Thomas
ANALYSIS
DUE TO
B. FOMBY
Southern
Methodist
lJnit;ersr
Received
18 May 1981
This note extends the specification error analysis of irrelevant variables in a regression model. It is shown that the loss of efficiency in estimating relevant regression coefficients is directly related to the magnitudes of the canonical correlations between the sets of relevant and irrelevant variables.
1. Introduction Specification error analysis [Theil (1957)] is a very important topic in econometric theory. It is well known that the omission of relevant variables from a regression model provides biased and inconsistent estimates of the regression coefficients unless the omitted variables are orthogonal to the included variables. On the other hand, the inclusion of irrelevant variables allows unbiased and consistent estimation. For this reason some practitioners prefer to ‘overfit’ their models. For example, Johnston (1972, p. 169) suggests, ‘Data.and degrees of freedom permitting, one should error on the side of including variables in the regression analysis rather than excluding them’. This prescription, however, heavily discounts the loss of efficiency in the estimation of the relevant coefficients. The loss of efficiency due to the inclusion of irrelevant variables has been analyzed previously but only to a limited extent. Kmenta (1971, pp. 396-398), for example, examines the estimation of the regression model, Y, =
P, + &x,2 + &x,3 + E!
* Written
9
while on leave at Department
0 165- 1765/8 1/OOOO-0000/$02.75
of Economics,
University
of Georgia,
0 198 1 North-Holland
Athens,
GA.
T B. Fomhv / Loss of efficiency in regression anulysis
320
where the fact that p3 = 0 is ignored. Let & denote the least squares estimator of & incorrectly assuming & # 0 and & denote the ordinary least squares (OLS) estimator correctly assuming & = 0. Then Krnenta shows that the ratio of the variances is
(2) where 0 G r:s < 1 and r,, is the correlation coefficient of the variables xi2 and xi3. Since var(~,)/var(~~) Z 1, th ere is a loss of efficiency and this loss of efficiency is directly related to the correlation between the relevant variable x2 and the irrelevant variable x3. Our purpose here is to generalize the analysis of efficiency loss to the case of more than one irrelevant variable by means of canonical correlations [Anderson (1958)]. 2. The generalization The model under consideration is
Y=x,P,
+x*p*$6
(3)
where y is a (n X 1) vector of observations on the dependent variable, X, is a (n X k,) non-stochastic matrix of observations on the relevant variables, X, is a (n X k2) non-stochastic matrix of observations on the irrelevant variables, p, and /I2 are conformable coefficient vectors, & = 0 in truth, and E is a (n X 1) vector of random errors with properties EE = 0 and EEE’ = a*I,. Let 8, = (Xix,)-‘X;y denote the 6LS estimator of p, when correctly assuming p2 = 0 and /?, = B, , X;y + B,, Xi y denote the least squares estimator of p, when incorrectly assuming & # 0 with
and
B,, = (4’
-424i’A*J’,
B,, = B;, = -A,‘A,,B,,, B22
=
6422
-A2&,&-‘r
(5)
321
T B. Fomby / Loss of efficiency in regression una!vsis
The variance-covariance matrix of the OLS estimator b, is var( ,8,) = a*. (Xix,)-‘, whereas, the variance-covariance matrix of the least squares estimator /!?, from the misspecified model is var( p,) = a*B, ,. However, since A,,A;‘A,, is a positive semi-definite matrix, B,, (XX)-’ Is a positive semi-definite matrix and there is a loss of efficiency in using fi, to estimate /3,. In the extreme case that X, and X, are orthogonal, Xix, = 0, $, = /?, and var( p,) = var( /I,). To characterize the more usual case of correlated variables, Xix, # 0, we can measure the relative efficiency of the two estimators by the ratio of the generalized variances (6) Now lvar p, [=J(I’(
X;X,)-‘I
.IZ-
Let W=(X;X,)-‘(X;X,)(X;X,)-‘Xix,,
,$,A, =[I- WI,
(Xix,)-‘(
X;X,)(
Xix,)-‘Xix,
1-I.
(7)
and
(8)
where the Ai are the eigenvalues of Z - W and, hence, the (1 - hi) are the eigenvalues of W. But Hooper (1959, pp. 246-247) has shown that the squares of the canonical correlations, t-f, t-22,.. . , rp2, of the sets of variables X, and X2 are the eigenvalues of W, where p = min(k,, k2). Therefore,
P
4
r=l
1 I-r12
I ’
(9)
The canonical correlations r,, r2,. . . , rp represent the degree of collinearity between the set of relevant variables X, and the set of irrelevant variables X,. When X, and X2 are orthogonal, all p canonical correlations are zero and the inclusion of the irrelevant variables does not cause a loss of efficiency. When there exists an exact linear relationship between the columns of X, and X2, r, = 1 and /3, is not estimable via p,. The relative
322
T. B. Fomhy / Loss of efficiency III regression unulysis
inefficiency of j?, is infinite. In the intermediate case, the greater the number and degree of near exact linear relationships between the columns of X, and X2, the greater the inefficiency of the least squares estimator /3, due to the inclusion of the irrelevant variables X2.
3. Conclusion This note has extended the analysis of the loss of efficiency when irrelevant variables are included. The loss of efficiency in estimating relevant coefficients is directly related to the magnitudes of the canonical correlations between the set of relevant variables and the set of irrelevant variables. The greater the number and degree of near linear dependencies between these sets of variables, the greater the loss of efficiency due to the inclusion of irrelevant variables.
References Anderson, T.W., 1958, An introduction to multivariate statistical analysis (John Wiley. New York). Hooper, J.W., 1959, Simultaneous equations and canonical correlation theory, Econometrica 27. 245-256. Johnston, J., 1972, Econometric methods, 2nd ed. (McGraw-Hill, New York). Kmenta, J., 197 I, Elements of econometrics (Macmillan, New York). Theil, H., 1957, Specification errors and the estimation of economic relationships, Review of the International Statistical Institute 25, 41-5 I.