Journal of Statistical Planning and Inference 74 (1998) 343–351
On contractions in linear regression Jurgen Gross Department of Statistics, University of Dortmund, Vogelpothsweg 87, D-44221 Dortmund, Germany Received 14 May 1996; received in revised form 4 December 1997
Abstract In this paper we demonstrate how the concept of a contractive matrix plays its role in linear regression. We review some well-known facts on the outperformance of the ordinary least-squares estimator and combine these with some new results on admissibility of estimators. Moreover, c 1998 Elsevier Science B.V. results on linear suciency and linear completeness are given. All rights reserved. AMS classiÿcation: 62J05; 62C15 Keywords: Linear regression model; Contraction; Ordinary least-squares estimator; Admissible estimator; Linear suciency
1. Introduction Let R m×n denote the set of m × n real matrices. The symbols A0 ; A− ; A+ and R(A), will stand for the transpose, any generalized inverse, the Moore–Penrose inverse and the range, respectively, of A∈R m×n . The notation PA stands for the orthogonal projector L
on R(A), i.e. PA = AA+ . For A; B ∈R n×n we write A 6 B if the dierence B − A is a non-negative de nite matrix, i.e., B − A = G G 0 for some matrix G . The relation L
6 is known to be a partial ordering in Rn×n , the so-called Lowner partial ordering (cf. Lowner, 1934; Baksalary et al., 1989). A matrix K ∈R m×n is called a contraction if Im − KK 0 is a non-negative-de nite matrix. The class of contractions K is a very interesting class of matrices. It is easy to see that it contains partial isometries (K 0 = K + ) as well as orthogonal projectors (K = K 0 = K 2 ). For any contraction K ∈R n×n we have R(In − KK 0 ) ⊆ R(In − K ) = R(In − K 0 ), cf. Hartwig and Spindelbock (1983). In the following we give an application for contractions in statistics. For this, consider the linear regression model y = Xÿ + e; c 1998 Elsevier Science B.V. All rights reserved. 0378-3758/98/$ – see front matter PII: S 0 3 7 8 - 3 7 5 8 ( 9 8 ) 0 0 0 0 4 - 4
(1.1)
344
J. Gross / Journal of Statistical Planning and Inference 74 (1998) 343–351
where X is a given n ×p regressor matrix of full-column rank, ÿ is a p × 1 vector of unknown parameters and e is an n × 1 vector of unobservable disturbances with E(e) = 0 and Cov(e) = 2 In , where 2 ¿0 is unknown. To estimate the parameter vector ÿ we introduce the class L of linear estimators L = {Fy + f : F ∈Rp×n ; f ∈Rp };
(1.2)
where both F and f are non-stochastic. It is well known that under model (1.1) the ordinary least-squares (OLS) estimator ÿˆ = (X 0 X )−1 X 0 y = X + y is the best linear unbiased (BLU) estimator for ÿ. This means L
Cov(X + y) 6 Cov(Fy + f ) for any unbiased estimator Fy + f ∈L, where Fy + f is
unbiased for ÿ if and only if FX = Ip and f = 0.
L
However, it is possible to nd biased estimators Fy + f ∈L such that Cov(Fy + f )6 Cov(X + y). L
In Section 2 we demonstrate that for any admissible estimator Fy we get Cov(Fy) 6 Cov(X + y). This result is obtained by giving a new characterization of admissible estimators in terms of contractions and by using a simple property of contractive matrices. L
In Section 3 we give a necessary and sucient condition for MMSE(Fy; ÿ) 6 MMSE (X + y; ÿ) to hold, where MMSE(Fy; ÿ) stands for the mean-square-error matrix of Fy and Fy is assumed to be admissible for ÿ. Here again, we use a property of contractions. In Section 4 we give a new characterization of the class of linearly sucient and admissible estimators, which coincides with the class of general ridge estimators as demonstrated by Markiewicz (1996). Moreover, it is shown that any admissible estimator Fy is linearly complete.
2. Covariance matrix outperformance of the OLS estimator and admissibility The following lemma gives a result due to Baksalary et al. (1992b) (p. 121). L
Lemma 1. For an estimator Fy ∈L we have Cov(Fy) 6 Cov(X + y) if and only if F = X + L0
(2.1)
for some contraction L0 ∈R n×n . The class of linear estimators satisfying Eq. (2.1) is a very broad one. However, it seems reasonable to consider only estimators, which are admissible for ÿ in the sense of the following de nition. Note for this, that MSE(Fy + f ; ÿ) denotes the mean-square error of Fy + f ∈L, i.e. MSE(Fy + f ; ÿ) = E[(Fy + f − ÿ)0 (Fy + f − ÿ)]:
(2.2)
J. Gross / Journal of Statistical Planning and Inference 74 (1998) 343–351
345
Deÿnition 1. An estimator F1 y + f1 ∈L is called admissible for ÿ in the class L if there does not exist F2 y + f2 ∈L such that the inequality MSE(F2 y + f2 ; ÿ)6 MSE(F1 y + f1 ; ÿ) holds in all points of the parameter space, being strict in at least one of them. In the following, admissibility for ÿ is always meant to hold in the sense of the above de nition. The class of admissible estimators under model (1.1) is well known, see e.g. Rao (1976). The following characterization is easily deduced from Corollary 4 in Baksalary and Markiewicz (1988). Lemma 2. An estimator Fy ∈L is admissible for ÿ under model (1:1) if and only if L
XFF 0 X 0 6 XF:
(2.3)
Clearly, condition (2.3) entails the symmetry of XF. Next, we give an alternative characterization of admissible estimators, demonstrating again the importance of the notion of contraction. For this we need the following lemma, which is due to Baksalary et al., (1989), statement (1.20). L
L
Lemma 3. For A; B ∈R n×n we have 0 6 A 6 B if and only if L
A = A0 ; 0 6 B; R(A) ⊆ R(B)
and
L
AB −A 6 A
(2.4)
for some (and hence every) generalized inverse B − of B. Now, we can give a new characterization of the class of admissible estimators under model (1.1). Theorem 1. For F ∈Rp×n the following four statements are equivalent: 1. Fy is admissible for ÿ under model (1.1), L
L
2. 0 6 XF 6 In , 3. XF = K0 K00 for some contraction K0 ∈R n×k , 4. F = X + K0 K00 for some contraction K0 ∈R n×k satisfying R(K0 ) ⊆ R(X ). L
Proof. Since XFF 0 X 0 6 XF implies XF = F 0 X 0 , the equivalence of (1) and (2) follows from Lemma 2 and the choices A = XF and B = In in Lemma 3. Equivalence of (2) and (3) follows from the de nition of a contraction. If (3) is satis ed, then the equality XF = K0 K00 has a unique solution with respect to F due to X + X = Ip , namely, F = X + K0 K00 , and therefore (4). Conversely, (4) immediately leads to (3). Note from Baksalary and Markiewicz (1988) (Corollary 4) that we may characterize admissibility of ane linear estimators Fy + f by supplementing the conditions (2) – (4) by the condition f ∈R(FX − Ip ).
346
J. Gross / Journal of Statistical Planning and Inference 74 (1998) 343–351
Using Theorem 1 we can show that any admissible estimator has no greater variance– covariance matrix than the OLS estimator in the sense of the Lowner partial ordering. The proof is based on the fact that a matrix K is a contraction if and only if all eigenvalues of KK 0 lie in the closed interval [0; 1]. L
Corollary 1. If Fy ∈L is admissible for ÿ under model (1.1), then Cov(Fy) 6 Cov(X + y). Proof. The assertion follows from Lemma 1 and Theorem 1 if K0 K00 is a contraction. But since K0 is a contraction, all eigenvalues of K0 K00 lie in the closed interval [0; 1]. Hence, the eigenvalues of (K0 K00 ) 2 also lie in [0; 1], showing that K0 K00 is itself a contraction. As an example for an admissible estimator for ÿ consider the general ridge estimator Fy = (R 0 R + X 0 X )−1 X 0 y;
(2.5)
where R is an arbitrary matrix with p columns, cf. Rao (1976) (p. 1036). Then we can write XF = K0 K00 , where K0 = X (R 0 R + X 0 X )−1=2 , and we have to establish L
0 6 In − X (R 0 R + X 0 X )−1 X 0 :
(2.6)
Using (R 0 R + X 0 X )−1 = S −1 − S −1 R 0 (I + RS −1 R 0 )−1 RS −1 ; where S = X 0 X , we get In − X (R 0 R + X 0 X )−1 X 0 = (In − XX + ) + (X + )0 R 0 (I + RS −1 R 0 )−1 RX + ; which is the sum of two non-negative-de nite matrices. Note that the converse of Corollary 1 is not true in general, i.e. an estimator Fy L
which ful lls Cov(Fy) 6 Cov(X + y) is not necessarily admissible for ÿ. By Theorem 1, the situation may be stated as follows. Corollary 2. Let L0 ∈R n×n be a contraction. Then X + L0 y ∈L is admissible for ÿ if and only if PX L0 = K0 K00
(2.7)
for some contraction K0 ∈R n×k . If L0 is a non-symmetric contraction such that R(L0 ) ⊆ R(X ), then condition (2.7) can never be satis ed. Suppose now that L0 is an orthogonal projector and consider the estimator Fy = X + L0 y, as investigated by Stahlecker and Schmidt (1987).
J. Gross / Journal of Statistical Planning and Inference 74 (1998) 343–351
347
It is easy to see that two orthogonal projectors P1 and P2 commute if and only if P1 P2 is non-negative de nite, which in turn is satis ed if and only if P1 P2 is an orthogonal projector. Hence, we obtain from Corollary 2 that X + L0 y is admissible for ÿ if and only if PX L0 = L0 PX :
(2.8)
Condition (2.8) has already been established by Baksalary and Trenkler (1989) (p. 7). Moreover, the authors state that if L0 X is assumed to be of full-column rank, then Eq. (2.8) holds if and only if L0 X = X , i.e. X + L0 y = X + y. 3. Mean-square-error matrix outperformance and admissibility Suppose now that Fy ∈ L is admissible for ÿ, i.e. F = X + K0 K00 for some contraction K0 such that R(K0 ) ⊆ R(X ). In the following we give a necessary and sucient condition for: L
MMSE(Fy; ÿ)6MMSE(X + y; ÿ);
where MMSE(Fy; ÿ) is the mean-square-error matrix of Fy, i.e. MMSE(Fy; ÿ) = E[(Fy − ÿ)(Fy − ÿ)0 ]:
(3.1)
Before stating the main result of this section we give two lemmas, the rst one being due to Trenkler (1985). Lemma 4. Let Fy ∈ L and let (FX − Ip )ÿ ∈ R(D), where D = Cov(X + y) − Cov(Fy): L
Then MMSE(Fy; ÿ)6MMSE(X + y; ÿ) if and only if the following two conditions hold: 1. D is non-negative deÿnite, 2. a0 Da61 for some vector a satisfying (FX − Ip )ÿ = Da. Lemma 5. Let K0 ∈ Rn×k be a contraction. Then R(In − K0 K00 ) = R[In − (K0 K00 )2 ]. Proof. Since K0 K00 is non-negative de nite we may write K0 K00 = U U 0 with = diag{i }, where the i are the eigenvalues of K0 K00 , and U is an n × n orthogonal matrix such that the ith column of U is an eigenvector corresponding to i . Hence, (K0 K00 )2 = U 2 U 0 , where 2 = diag{i2 }. Now, since i ∈ [0; 1] we get R(In − ) = R(In − 2 ), which is equivalent to R[U (In − )U 0 ] = R[U (In − 2 )U 0 ], i.e. R(In − K0 K00 ) = R[In − (K0 K00 )2 ]. Recall from the proof of Corollary 1 that K0 K00 is itself a contraction, when K0 is a contraction. Hence, the matrices In − K0 K00 and In − (K0 K00 )2 are non-negative de nite
348
J. Gross / Journal of Statistical Planning and Inference 74 (1998) 343–351
and Lemma 5 implies R[X 0 (In − K0 K00 )X ] = R[X 0 [In − (K0 K00 )2 ]X ]:
(3.2)
This equality is relevant to the following theorem, which shows that for any admissible L
estimator Fy we have MMSE(Fy; ÿ) 6 MMSE(X + y; ÿ) if and only if ÿ0 M ÿ62 for a special non-negative-de nite matrix M . Theorem 2. Let K0 ∈ Rn×k be a contraction such that R(K0 ) ⊆ R(X ). Then L
MMSE(X + K0 K00 y; ÿ)6MMSE(X + y; ÿ) if and only if
ÿ0 M ÿ62 ;
(3.3)
where M = X 0 (In − K0 K00 )X [X 0 [In − (K0 K00 )2 ]X ]+ X 0 (In − K0 K00 )X : Proof. Let F = X + K0 K00 . Let D = Cov(X + y) − Cov(Fy) = 2 X + [In − (K0 K00 )2 ](X + )0 . In view of Corollary 1 the matrix D is non-negative de nite. Furthermore, we get (FX − Ip )ÿ ∈ R(D) if and only if X 0 (In − K0 K00 )X ÿ ∈ R[X 0 [In − (K0 K00 )2 ]X ]: But by Eq. (3.2), the latter is satis ed for any ÿ ∈ Rp . Hence, from Lemma 4, we L
see that MMSE(Fy; ÿ)6MMSE(X + y; ÿ) if and only if a0 Da61 for some vector a satisfying (FX − Ip )ÿ = Da. By using the identity X 0 (In − K0 K00 )X = X 0 [In − (K0 K00 )2 ]X [X 0 [In − (K0 K00 )2 ]X ]+ X 0 (In − K0 K00 )X ; we conclude that a=−
1 0 X X [X 0 [In − (K0 K00 )2 ]X ]+ X 0 (In − K0 K00 )X ÿ 2
is an appropriate choice. This gives Eq. (3.3). Theorem 2 has already been established by Liski et al. (1997), where K0 K00 is replaced by an arbitrary contraction L ∈ Rn×n satisfying R(In − L) = R(In − LL0 ). Clearly, the considered estimator X + Ly is not necessarily admissible for ÿ. However, tests for the MMSE dominance criterion (3.3) may be created analogous to Liski et al. (1997) (Section 3). The reader might ask why we use two dierent risk functions, namely MSE(Fy; ÿ) and MMSE(Fy; ÿ), in our analysis. For a justi cation one may observe that any estimator which is admissible for ÿ with respect to the MSE-risk is also admissible for ÿ with respect to the MMSE-risk. Hence, we get a greater reduction of the class of available estimators if we consider admissibility in the sense of De nition 1 (see also Baksalary et al., 1992a).
J. Gross / Journal of Statistical Planning and Inference 74 (1998) 343–351
349
On the other hand, condition (3.3) from Theorem 2 is also sucient for the inequality
MSE(X + K0 K00 y; ÿ)6MSE(X + y; ÿ) to hold, since any matrix inequality MMSE(Fy; L
ÿ)6MMSE(Gy; ÿ) implies MSE(Fy; ÿ)6MSE(Gy; ÿ). This can easily be derived by applying the trace operator. 4. Linear suciency, completeness and admissibility Consider now the class of linearly sucient statistics according to the following de nition. Deÿnition 2. An estimator Fy ∈ L is called linearly sucient if there exists a matrix A ∈ Rn×p such that AFy is the BLU estimator of X ÿ. Baksalary and Kala (1981) give a characterization of the class of linearly sucient statistics in a more general setting. Adapting their result to the situation under model (1.1) we easily observe that Fy ∈ L is linearly sucient if and only if R(X ) ⊆ R(F 0 ):
(4.1)
Markiewicz (1996) shows that an estimator Fy is linearly sucient and in addition admissible for ÿ if and only if Fy is a general ridge estimator, i.e. Fy = (H + X 0 X )−1 X 0 y for some non-negative-de nite matrix H . We may also write H = R0 R for some matrix R, as in Section 2. The following theorem gives an alternative characterization of this class. Theorem 3. An estimator Fy ∈ L is admissible for ÿ and in addition linearly sucient if and only if F = X + K0 K00
(4.2)
for some contraction K0 such that R(K0 ) = R(X ). Proof. Let F = X + K0 K00 , where K0 is a contraction such that R(K0 ) ⊆ R(X ). Then Fy is linearly sucient if and only if R(X ) ⊆ R(F 0 ) = R[K0 K00 (X + )0 ] = R[XX + K0 K00 (X + )0 ] = R(XX + K0 ) = R(K0 ), and the assertion follows from Theorem 1. Observe that an estimator Fy = X + K0 K00 y, where K0 is a contraction such that R(K0 ) = R(X ), can be written as Fy = (H + X 0 X )−1 X 0 y; where H := [X + K0 K00 (X + )0 ]−1 − X 0 X is a non-negative-de nite matrix, since (X 0 X )−1 − X + K0 K00 (X + )0 = X + (In − K0 K00 )(X + )0
350
J. Gross / Journal of Statistical Planning and Inference 74 (1998) 343–351
is non-negative de nite. The non-singularity of X + K0 K00 (X + )0 can be deduced by observing that R(K0 ) = R(X ) implies R(X 0 X ) = R(X 0 K0 ), which in turn gives rank(X ) = rank(X 0 X ) = rank(X 0 K0 ) = rank(X + K0 ). Since any generalized ridge estimator is easily shown to be admissible for ÿ and in addition linearly sucient, the above-mentioned result from Markiewicz (1996) is con rmed. Consider now the class of linearly complete statistics as de ned below. Deÿnition 3. An estimator Fy ∈L is called linearly complete if for all a ∈Rp such that E(a0 Fy) = 0
for all ÿ ∈ Rp ;
it follows that a0 Fy = 0 almost surely. From Drygas (1983) we conclude that Fy ∈ L is linearly complete if and only if R(F) ⊆ R(FX );
(4.3)
which is equivalent to R(F) = R(FX ). Markiewicz (1996) points out, that any estimator which is linearly sucient and admissible for ÿ is also linearly complete. The following theorem shows that under model (1.1) admissibility alone implies linear completeness. Theorem 4. Let K0 ∈ Rn×k be a contraction such that R(K0 ) ⊆ R(X ). Then X + K0 K00 y is linearly complete. Proof. Let F = X + K0 K00 . Since R(K0 ) ⊆ R(X ) we can write K0 = XX + K0 , i.e. K00 = K00 XX + . Then, obviously, R(F) = R(X + K0 K00 XX + ) ⊆ R(FX ). 5. Concluding remarks In the preceding sections we have derived some results of estimators by using properties of contraction matrices which are valid for all admissible estimators. These derivations, e.g. Theorem 2, may be applied to an actual estimator by identifying the corresponding contraction matrix. Also, it might be possible to derive additional results with the help of contractions, which again would be valid for all admissible estimators. It is clear that our results may also be extended to a linear regression model with positive de nite Cov(e) = 2 V . This can be done by left-multiplication of the model Eq. (1.1) by V −1=2 , where V −1=2 denotes the unique positive-de nite square root of V −1 . Then from Theorem 1, the class of admissible estimators for ÿ can be written in the form (V −1=2 X )+ K0 K00 V −1=2 y;
(5.1)
J. Gross / Journal of Statistical Planning and Inference 74 (1998) 343–351
351
where K0 is a contraction such that R(K0 ) ⊆ R(V −1=2 X ). By writing K0 = V −1=2 XB0 for some matrix B0 , Eq. (5.1) reads B0 B00 X 0 V −1 y;
(5.2) L
L
where B0 satis es V −1=2 XB0 B00 X 0 V −1=2 6In or equivalently B00 X 0 V −1 XB0 6Ip . The latter is satis ed if and only if L
BX 0 V −1 XB 6B
(5.3)
for B = B0 B00 . Note that Eq. (5.3) corresponds to the characterization of admissible estimators given by Rao (1976) (p. 1036). Acknowledgements Support by Deutsche Forschungsgemeinschaft under grant Tr 253=2-1=2-2 is gratefully acknowledged. References Baksalary, J.K., Kala, R., 1981. Linear transformations preserving best linear unbiased estimators in a general Gauss–Marko model. The Ann. Statist. 9, 913–916. Baksalary, J.K., Markiewicz, A., 1988. Admissible linear estimators in the general Gauss–Markov model. J. Statist. Planning Inference 19, 349–359. Baksalary, J.K., Pukelsheim, F., Styan, G.P.H., 1989. Some properties of matrix partial orderings. Linear Algeb. Appl. 119, 57–85. Baksalary, J.K., Rao, C.R., Markiewicz, A., 1992a. A study of the in uence of the natural restrictions on estimation problems in the singular Gauss–Markov model. J. Statist. Planning Inference 31, 335–351. Baksalary, J.K., Schipp, B., Trenkler, G., 1992b. Some further results on hermitian-matrix inequalities. Linear Algeb. Appl. 160, 119–129. Baksalary, J.K., Trenkler, G., 1989. Another view on least-squares estimation with a particular linear function of the dependent variable. Department of Mathematical Sciences, University of Tampere, Report A 217. Drygas, H., 1983. Suciency and completeness in the general Gauss–Markov model. Sankhy a (Ser.) A 45, 88–98. Hartwig, R.E., Spindelbock, K., 1983. Partial isometries, contractions and EP matrices. Linear Multilinear Algebra 13, 295–310. Liski, E., Trenkler, G., Gro, J., 1997. On least squares estimation from linearly transformed data. Statistics 29, 205–219. Lowner, K., 1934. Uber monotone Matrixfunktionen. Math. Z. 38, 177–216. Markiewicz, A., 1996. Characterization of general ridge estimators. Statist. Probab. Lett. 27, 145–148. Rao, C.R., 1976. Estimation of parameters in a linear model. The 1975 Wald memorial lectures. Ann. Statist. 4, 1023–1037 [correction, ibid. 7, 696]. Stahlecker, P., Schmidt, K., 1987. On least squares estimation with a particular linear function of the dependent variable. Econom. Lett. 23, 59 – 64. Trenkler, G., 1985. Mean square error matrix comparisons of estimators in linear regression. Commun. Statist. A 14, 2496 –2509.