Economics Letters 40 (1992) 127-133 North-Holland
127
A Stein rule estimator which shrinks towards the ridge regression estimator Stephen I. Harewood University of the West Indies, Bridgetown, Barbados
Received 24 June 1991 Accepted 15 June 1992
In this paper the Stein rule estimator which shrinks towards the stochastic linear hypothesis estimator is introduced and its risk function is used to evaluate the risk performance of the Stein rule estimator which shrinks towards the ridge regression estimator.
1. Introduction Faced by the problem of multicollinearity, economists have adopted a number of ad hoc estimators which do not always guarantee risk improvement over OLS. One such estimator is the ridge regression estimator [see Hoer1 and Kennard (1970a, b)] for which the region of the parameter space over which there is improvement on the OLS estimator depends on the unknown parameters. The Stein rule [see Judge and Bock (1978, ch. lo)], on the other hand, leads to estimators which, under certain conditions, dominate OLS. This paper addresses the question of combining the ridge regression and the Stein rule estimators to produce an estimator which has the shrinkage properties of the ridge regression estimator but which is superior to the OLS estimator. This issue is analysed within the context of the Stein rule estimator which shrinks towards the stochastic linear hypothesis estimator.
2. The risk function of the estimator Consider the general linear regression model y =X/3 + e: where y is a (TX 1) vector of observations on the dependent variable; X is a (T X K) non-stochastic design matrix of rank K; /3 is a (K x 1) vector of unknown parameters; and e is a (T X 1) vector of random disturbances, which is distributed N(0, (~~1). Assume that in addition to the sample information contained in this model, stochastic prior information exists about fi in the form: r=R#!+v, Correspondence Barbados.
(1)
to: Stephen Harewood, Department of Economics, University of the West Indies, P.O. Box 64 Bridgetown,
01651765/93/$06.00
0 1993 - Elsevier Science Publishers B.V. All rights reserved
128
S.I. Harewood /A Stein rule estimator
where r is an observable random vector, R is a known (.I x K) matrix of rank J, and v is a (.I x 1) unobservable and normally distributed random vector, with mean q and covariance matrix ~~0, with I2 known. Define the risk of estimating the parameter vector /I by an estimator #I* as
P(B*, S>=E[(B*
-P)‘Q@*-k91/u2,
(2)
where Q is a positive definite symmetric weighting matrix. Let
where x&-K) represents a central chi-square random variable with (T-K) degrees of freedom, and x&,) is a noncentral chi-square random variable wjth I degrees of freedom and noncentrality parameter y, and &_ Kj and ,I&) are independently distributed. Following Judge and Bock (1978), this stochastic linear hypothesis model can be written as the following model with exact linear restrictions:
(3) where I)
(-R With appropriate
(Rpp+q 1
(4)
=O.
substitutions, eq. (3) and (4) can be written as
w=zf#J+u,
(5)
p&=0.
(6)
The restricted least squares estimator of C$in this model is
(7) where f =
(Z(Z)_‘Z’w
(8)
is the ordinary least squares estimator. The Stein rule estimator which shrinks towards the restricted least squares estimator given in eq. (7) can be defined as as
a= li
(C-6)+& &p’[
p( xrx>
-5#]
-lptj
I
(9)
129
S.I. Harewood /A Stein rule estimator
where s = (W - JZ)‘(W - Jz),
Olal
(10)
T_:+z(~-2),
and V=
[p(Z’Z)-lp’]-lp(Z’Z)-lQ(Z’Z)-lp’,
(11)
and F denotes the largest eigenvalue of I/. Following Hill and Ziemer (1984), the risk function for the estimator S can be written as ~(6, 4) =tr[(Z’Z)-IQ]
+
-tr(V)(2aE[h,(m+2)]
-a’E[h~(m+2)])
(Wu”)(2a(E[ $4 m + 2)] - E[h,(m+ 91) +a2E[h:(m
+ 4)]),
(12)
where
For a given value of y = c, the maximum and minimum values of p(S, 4) are a,2a2c q2u2c, where c-x,_and (Y, are, respectively, the largest and smallest eigenvalues of A =
[p(zrz)-‘pq
-1'2p(Z'Z)-1Q(ZrZ)-'p'[p(Z'Z)-1p']-1'2
[see Hill and Ziemer (1984)I. Let ,
and
+=
where 6, is the Stein rule estimator of q51= fi. The risk of the estimator S, is
ES{ [(t :I((2)-
(::))]'Q[ik
:,(i::) -K2))l)-
and
130
S.I. Harewood
/A
Stein rule estimator
This can be found by replacing Q in (12) and (13) by
Substituting
Q* for Q yields
(14) and [(XrX)-‘+R]-l(X’X)-‘Qll(X’X)-‘.
I/=
(15)
Consider now the ridge regression estimator of #I, which is defined as b”(d)
=
(x’x+dz)-‘X’y,
(16)
where d is a shrinkage constant and b=
(x’x)-lx’y
(17)
is the OLS estimator of p. It has been argued that this estimator is numerically equivalent to the estimator obtained by constraining the linear model with certain stochastic linear restrictions [see Fomby and Johnson (19771, Bacon and Hausman (1974)l. This is accomplished by assuming that certain prior information of the form
r=p+v
(18)
is available. Here r is a vector of prior estimates of p, v is a random vector with E(V) = 0, and E(vv’) = (a’/d>Z,. Combining the stochastic information with the sample information gives the stochastic restricted least squares estimator b*(d)
=(X’X+dZ)-‘(X’y-tdr).
(19)
The ridge regression estimator results when r = 0. Let Q,, = Z, and 0 = (a2/d>ZK. The expressions for the matrices A and I/ become A =P[A
+ (l/d)A2]-‘P’,
(20)
V=P[A
+ (l/d)A2]-‘I”
(21)
Here P is an orthonormal
matrix chosen so that
P’X’XP=A=diag(A,,...,A,). The largest eigenvalue of A is d/CA, + AL), where A, is the smallest eigenvalue of X’X. Likewise, the smallest eigenvalue of A is d/CA, + A:), where A, is the largest eigenvalue of (X’XI.
131
S.I. Harewood /A Stein rule estimator
3. Risk performance of the estimators
Interest centers on the risk performance of the estimator 6, which shrinks towards the ridge regression estimator. Comparisons are made between the risk performance of this estimator, on the one hand, and that for the OIS estimator b and the Stein rule estimator 6, = (1 - as/b’X’Xb)b [see Judge and Bock (1978, ch. lo)], on the other. For this purpose, several experiments were conducted using X matrices which reflected different patterns and degrees of collinearity. The construction of the design matrices was based on the singular value decomposition. By this method, the singular value decomposition of a known X matrix was performed, and its singular values were varied to produce new X matrices with different degrees and patterns of collinearity. It should be noted that the risks at given values of y are not strictly comparable since the noncentrality parameter is different for each estimator and for each value of the shrinkage constant d. It is still possible, however, to gain some insight into the relative risk performances of the estimators by examining their risk functions over their individual parameter spaces. In all of the results presented in this section, the matrix Q* = I, u2 = 1, K = 10, and T = 20. The constant, a, equals one half of the upper bound, unless stated otherwise. For the results reported in table 1 and table 2, d was held constant while the noncentrality parameter was varied. The degree of collinearity of the X matrix used for the results given in table 1 is weak, whereas it is strong for that used for the results presented in table 2. The results indicated that large risk gains can be archieved by S, over the least squares estimator over much of the parameter space. The risk gains at high degrees of collinearity, in particular, were quite large. These risk gains decreased as the noncentrality parameter was increased. The difference between the maximum risk bound of S, and the risk of b was very large for small values of the noncentrality parameter, but this risk gain decreased rapidly as the noncentrality parameter was increased. Increasing d resulted in an increase in the gains across the parameter space; this is illustrated in tables 1 and 2 for the cases d = 1 and d = 10. The risk functions for 6, and S, achieved their smallest values at and near the origin (y = 0). At these points, the risks of S, was less than that for 6,. It would appear that the risk performance of
Table 1 Bounding risk functions for 6, and 6,. Singular values = 10, 1, 1, 1, 1, 1, 1, 1, 1, 1. Risk for 6 = 9.01. Y
min
0 1 2 3 4 5 6 7 10 15 20
61
63
3.09 3.23 3.39 3.56 3.73 3.91 4.08 4.25 4.73 5.38 5.88
max
3.10 4.15 5.00 5.69 6.25 6.70 7.08 7.39 8.06 8.63 8.91
d= 10
d=l min
max
min
max
6.06 6.12 6.20 6.28 6.36 6.45 6.54 6.62 6.86 7.18 7.43
6.06 6.58 7.01 7.35 7.63 7.86 8.04 8.20 8.53 8.82 8.96
3.64 3.76
3.64 4.60 5.37 5.99 6.50 6.92 7.26 7.54 8.14 8.67 8.92
3.89 4.04 4.20 4.36 4.51 4.67 5.10 5.69 6.14
132
S.I. Harewood
/ A Stein rule estimator
Table 2 Bounding risk functions for 6,. Singular values = 10, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1. Risk for b = 900.01. Y
63
0 1 2 3 4 5 6 7 10 15 20
6,
min
max
309.38 322.01 337.10 353.60 370.79 388.16 405.37 422.20 469.20 534.21 584.46
309.38 414.00 499.82 568.38 624.16 669.83 707.45 738.62 804.65 862.42 889.87
d=l
d=lO
min
max
min
max
315.23 327.73 342.66 359.00 376.01 393.21 410.25 426.91 473.44 537.80 587.56
315.23 419.81 503.78 571.67 626.90 672.12 709.36 740.22 805.60 862.79 889.97
309.97 322.58 337.65 354.13 371.30 388.65 405.85 422.65 469.60 534.55 584.75
309.97 415.49 500.22 568.71 624.44 670.07 707.65 738.79 804.75 862.45 889.88
S, compares quite favourably with that of a,, especially at high degrees of collinearity. In making these comparisons, one should always be mindful of the reservations expressed about comparing the risk functions of the different estimators. The results reported in table 3 show how the risk function behaved as d was varied while the noncentrality parameter was held constant. As d was increased, the risk for 6, fell and approached a limit. The largest reductions in risk were achieved for large values of d and small values of the noncentrality parameter. For all values of d, the risk gain from use of S, was largest when the noncentrality parameter was equal to zero. The results shown in table 4 are reported to illustrate the sensitivity of the bounding risk functions to changes in the shrinkage constant, a. It is impossible to give a rule for choosing the shrinkage constant optimally, since this would depend on the values of d and the noncentrality parameter, among other factors. The results, however, indicated how sensitive the risk of 6, is to the choice of a. They show that, although the combination of a and d can exert some influence on the risk function, a itself does have a considerable amount of independent influence. A better
Table 3 Bounding risk function for a,, as d is varied. Singular values = 10, 0.1, 0.1,0.1, 0.1, 0.1,0.1, 0.1, 0.1,O.l. Risk for b = 900.01. d
0.1 0.5 1.0 2.0 3.0 4.0 5.0 10.0 15.0 20.0
y=o
361.32 320.60 315.05 312.23 311.29 310.81 310.53 309.96 309.77 309.67
y = 20
y=l min
max
min
max
374.55 333.34 327.73 324.88 323.93 323.45 323.16 322.58 322.39 322.30
459.10 424.52 419.80 417.42 416.62 416.22 415.97 415.49 415.33 415.25
613.12 590.62 587.56 586.00 585.48 585.22 585.06 584.75 584.65 584.59
890.79 890.06 889.97 889.92 889.90 889.89 889.89 889.88 889.87 889.87
133
S.I. Harewood /A Stein rule estimator
Table 4 Risks for Sr, and 6, for various values of a. The noncentrality parameter = 1. Singular values = 10,0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1. amax= maximum value of a. a
63
min
O.la,, 0.2a,,, 0.3a,, 0.4a,, OSa,, 0.6a,, 0.7a,, 0.8a,, 0.9a,, amax
721.95 575.11 459.51 375.14 322.01 300.10 309.43 349.99 421.78 524.80
61
max
731.08 598.11 501.11 440.07 415.00 425.90 472.76 555.59 674.38 829.14
d=l
d=lO
min
max
min
max
723.71 578.34 463.88 380.34 327.73 306.04 315.26 355.41 426.48 528.47
732.75 601.10 505.06 444.63 419.81 430.59 476.99 558.99 676.60 829.82
722.13 575.44 459.96 375.67 322.58 300.70 310.01 350.52 422.23 525.14
731.25 598.42 501.51 440.54 415.49 426.37 473.18 555.92 674.59 829.19
appreciation of this can be obtained by observing the similarity in behaviour of the risk functions when d = 1 and d = 10. Varying the shrinkage constants also revealed the similarity in behaviour of the risk functions of S, and 8,. Although the risk function of 6, again compared quite favourably with that of S,, no combination of c(, d, and the noncentrality parameter was found for which S, had a smaller risk than 6, for all values of a and the noncentrality parameter.
4. Conclusion The aim of this paper was to show how ridge regression and Stein rule can be combined to give an estimator which dominates OLS. It was shown that large risk gains can be obtained over the OLS estimator by employing the Stein rule estimator which shrinks towards the ridge regression estimator, 6,; these risk gains were largest for large values of the shrinkage constant, d, and small values of the noncentrality parameter. The risk performance of the estimator S, also compared quite favourably to that of the Stein rule estimator which shrinks towards the OLS estimator, S,, but the latter estimator seems to offer slightly greater risk reduction.
References Bacon, R.W. and J.A. Hausman, 1974, The relationship between ridge regression and the minimum mean squared error estimator of Chipman, Oxford Bulletin of Economics and Statistics 36 115-124. Fomby, T.B. and S.R. Johnson, 1977, MSE evaluation of ridge estimators based on stochastic prior information, Communications in Statistics, Series A, 6, 1245-1258. Hill, R.C. and R.F. Ziemer, 1984, The risk of the general Stein-like estimator in the presence of multicollinearity, Journal of Econometrics 25, 205-216. Hoerl, A.E. and R.W. Kennard, 1970a, Ridge regression: Biased estimation of non-orthogonal problems, Technometrics 12, 55-67. Hoerl, A.E. and R.W. Kennard, 197Ob, Ridge regression: Applications to non-orthogonal problems, Technometrics 12, 69-82. Judge, G.G. and M.E. Bock, 1978, The statistical implications of pre-test and Stein-rule estimators in econometrics (North-Holland, Amsterdam).