Journal of Statistical Planning and Inference 103 (2002) 245–261
www.elsevier.com/locate/jspi
Nonconjugate Bayesian regression on many variables a Institute
B.Q. Fanga , A.P. Dawidb;∗ of Applied Mathematics, Academia Sinica, Beijing, China of Statistical Science, University College London, UK
b Department
Abstract In this paper the problem of normal linear regression on arbitrarily many variables is considered. It is shown that a nonconjugate prior implies indeterminism for the Bayes predictor, in contrast to determinism induced by the use of the usual conjugate priors. A comparison is made between the Bayes predictor and a sample-based least squares predictor which 0ts the data c 2002 Published by Elsevier Science B.V. perfectly. MSC: 62A15; 62J05 Keywords: Regression; Bayesian inference; Nonconjugate prior; Indeterminism
1. Introduction This paper considers the problem of Bayesian inference in normal linear regression when there is an unlimited number of regressor variables available, and, in particular, illustrates the asymptotically nonnegligible e7ect of the choice of prior distribution. In traditional data analysis it is usually necessary to assume that the number of observations is greater than the number of observable variables. In practice, however, it is possible that far more variables are observed on each individual than we have individuals. For example, we may have information on a large numbers of symptoms in a problem of medical diagnosis, or of attributes in one of pattern recognition. With modern instrumentation techniques such as near infra-red spectroscopy (Osborne et al., 1993), each multivariate observation can be of extremely high dimension. Bayesian inference does not, in principle, put any constraints on the number of variables that can be used, but cannot completely escape from related di
Corresponding author. E-mail address:
[email protected] (A.P. Dawid).
c 2002 Published by Elsevier Science B.V. 0378-3758/02/$ - see front matter PII: S 0 3 7 8 - 3 7 5 8 ( 0 1 ) 0 0 2 2 4 - 5
246
B.Q. Fang, A.P. Dawid / Journal of Statistical Planning and Inference 103 (2002) 245–261
size needed for this will itself grow with the number of variables. When the number of available variables is unlimited, sensitivity to the prior will remain even in the limit as the number of individuals goes to in0nity. This is the context of the present investigation. Bayesian normal linear regression with an unlimited number of explanatory variables was 0rst studied by Dawid (1988). That paper investigated the implications and consequences of using the conjugate prior distribution, and showed that this could lead to generally undesirable behaviour. In particular, the conjugate prior implies a strong belief in the possibility of deterministic prediction, meaning that the response variable can be predicted with arbitrarily high accuracy just by using a su
2. Notation The model considered in this paper involves certain spherical and rotatable distributions, including the matrix-variate normal, t; F; beta, the Wishart and inverse Wishart distributions. We shall use the notation and properties for these distributions developed in Dawid (1981). For convenience of the reader, we give a brief review of their de0nitions and some properties below. Matrix normal. The n × p random matrix Z with independent standard normal elements is denoted by Z ∼ N(In ; Ip ). For nonrandom A; B; M such that AA = ; B B = , the distribution of M + AZB is denoted by M + N(; ).
B.Q. Fang, A.P. Dawid / Journal of Statistical Planning and Inference 103 (2002) 245–261
247
Wishart. The p×p random matrix having a Wishart distribution with degrees of freedom and p×p scale matrix ¿ 0 is denoted by ∼ W (; ). For p = 1; W (; 1) is equal to 2 . If Z ∼ N(In ; ); then Z Z ∼ W (n; ). Inverse Wishart. The p × p random matrix having a standard inverse Wishart distribution with parameter ¿ 0 is denoted by ∼ IW (; Ip ), for which −1 ∼ W (; Ip ) with = + p − 1. Then AA ∼ IW (; ) with = AA . The parameter does not change for any leading submatrix of . Matrix-t. The n × p random matrix T having a standard matrix-t distribution with parameter is denoted by T ∼ T (; In ; Ip ). Location and scale parameters may be introduced as for the matrix normal. For p = 1; n ¿ 1; T (; In ; 1) = −1=2 t ; where t is the multivariate t distribution with degrees of freedom (Cornish, 1954). The matrix-t distribution has a stochastic representation as T | ∼ N(In ; ) with ∼ IW (; Ip ) or T | ∼ N(; Ip ) with ∼ IW (; In ). Matrix F. The p × p random matrix U having a standard matrix-variate F distribution with parameters ; is denoted by U ∼ F(; ; Ip ), ( ¿ 0; ¿ p−1 or integral). It is denoted by BII {p; 12 ; 12 (p + − 1)} in Tan (1969), G(; p + − 1; Ip ) in Dempster (1969) (see also Olkin and Rubin, 1964). For p = 1, F(; ; Ip ) = (=)F; . It has a stochastic representation as U | ∼ W (; ) with ∼ IW (; Ip ) or U | ∼ IW (; ) with ∼ W (; Ip ). If T ∼ T (; In ; Ip ), then T T ∼ F(n; ; Ip ). If U ∼ F(; ; Ip ); then U −1 ∼ F( + p − 1; − p + 1; Ip ); ( ¿ p − 1). Matrix beta. The p × p random matrix V having a matrix-variate beta distribution with 1 ; 2 degrees of freedom and scale matrix is denoted by V ∼ B(1 ; 2 ; ). The standard B(1 ; 2 ; Ip ); (1 + 2 ¿ p − 1) is denoted by BI (p; 12 1 ; 12 2 ) in Tan (1969), Bp ( 12 1 ; 12 2 ) in Mitra (1970). See also Olkin and Rubin (1964), Khatri (1970). For p = 1; B(1 ; 2 ; 1) = ( 12 1 ; 12 2 ); the beta distribution. If Si ∼ W (i ; ); i = 1; 2, independently, with positive-de0nite, and S = S1 + S2 , then the conditional distribution of S1 given S = is B(1 ; 2 ; ). It also has a stochastic representation as D−1 S1 (D−1 ) ∼ B(1 ; 2 ; Ip ); where Si ∼ W (i ; ); i = 1; 2; independently with 1 + 2 ¿ p − 1 , D is def
such that S = S1 + S2 = DD , and D is independent of S1 given S. If U ∼ F(; ; Ip ), then (Ip + U )−1 ∼ B( + p − 1; ; Ip ). If V ∼ B(1 ; 2 ; Ip ), then Ip − V ∼ B(2 ; 1 ; Ip ). Scale parameters for the matrix F and matrix beta may be introduced as for the inverse Wishart. 3. Assumptions To construct a nonconjugate prior, we suppose the response Y is the sum of an unobservable variable and an error , independent of each other; the joint distribution of and the explanatory variables Xi is normal; is also normally distributed. We thus suppose: Y = + ; (; X) ∼ N(1; );
∼ N(1; );
(; X)t;
(1)
248
B.Q. Fang, A.P. Dawid / Journal of Statistical Planning and Inference 103 (2002) 245–261
where X = (X1 ; X2 ; : : :), ¿ 0 is ∞ × ∞, ¿ 0, and t denotes probabilistic independence (Dawid, 1979). Let Xp = (X1 ; : : : ; Xp ) , and let 1+p be the (1 + p) × (1 + p) leading submatrix of , further partitioned as 1 00 0p 1+p = p0 pp (2) p: 1 p Assumption (1) is equivalent to the following series of regression models: for each p = 0; 1; : : : ; Y | Xp ∼ (Xp ) Rp + N(1; "2 );
(3)
where −1 Rp = pp p0 ; def
−1 #p = 00:p = 00 − 0p pp p0 ;
"2 = #p + : Suppose the prior distribution for the parameters ( ; ) is ∼ IW (; Q);
∼ IW (; K);
t;
(4)
where ¿ 0; Q ¿ 0 is ∞ × ∞, ¿ 0; K ¿ 0. Let Q1+p be the (1 + p) × (1 + p) leading submatrix of Q, partitioned in the same fashion as (2). Then −1 −1 Rp | #p ∼ Qpp Qp0 + N(Qpp ; #p );
#p ∼ IW ( + p; Q00:p ); "2 = #p + ; P
∼ IW (; K);
t#p : P
Since 00:p → 0, var{Y − E(Y | Xp )} = "2 → ¿ 0, a.s., as p → ∞. Thus under the prior given by (4), the sampling model (1) or (3) is nondeterministic: even after observing an in0nite number of regressor variables, there remains essential residual uncertainty about Y . Now suppose we have obtained the n × ∞ training data (Yn ; X n ) on the response variable Y and all the X ’s, where Yn is represented by Yn = W + Q. We consider a further case (Y 0 ; (X0 ) ), with Y 0 = 0 + 0 . We wish to predict Y 0 on the basis of the training data and Xp0 , the observations on the 0rst p explanatory variables, and shall investigate the asymptotic property of the predictor as p → ∞. From the assumptions, the overall marginal distribution of the full data matrix is a sum of two independent matrix-t distributions: 0 Y (X0 ) Z= = Z1 + (Z2 ; 0); Yn X n
B.Q. Fang, A.P. Dawid / Journal of Statistical Planning and Inference 103 (2002) 245–261
where
0 (X0 ) ∼ T (; In+1 ; Q); Z1 = W Xn 0 def Z2 = ∼ T (; In+1 ; K) Q def
249
(5)
and Z1 tZ2 . Note that the assumptions that t(; X) given the parameters, and t, imply that t(; X) in the marginal distribution of (; ; X). To make predictive inference for Y 0 on the basis of X0 and the training data (Yn ; X n ), we shall 0rst condition on the training data (Yn ; Xqn ) and Xp0 (p 6 q), where Xp0 , Xqn indicate the corresponding submatrices of X0 ; X n restricted to the 0rst p and q variables, respectively; and then let q → ∞, p → ∞. A detailed justi0cation is given in Section 4. We here give a lemma on the posterior variance of (Y 0 ; (Xp0 ) ) given (Yn ; Xqn ), which is the basis of the calculation of this paper. def
∗ Lemma 1. Let Q1+p (q) = var{(Y 0 ; (Xp0 ) ) | Yn ; Xqn }: (1 + p) × (1 + p) (p 6 q 6 ∞). Then under Assumptions (1) and (4); the following hold:
E{(Y 0 ; (Xp0 ) ) | Yn ; Xqn } = 0; ∗ Q00 (q) = E(Y 2 | Yn ; Xqn ) = E( 00 + | Yn ; Xqn )
=
Q00 + E(W W | Yn ; Xqn ) K + E(Q Q | Yn ; Xqn ) + ; +n−2 +n−2
(6) (7)
∗ (q) = E(Xp0 Y | Yn ; Xqn ) = E( p0 | Yn ; Xqn ) Qp0
(8)
=
Qp0 + (Xpn ) E(W | Yn ; Xqn ) +n−2
(9)
=
Qp0 + (Xpn ) Yn − (Xpn ) E(Q | Yn ; Xqn ) ; +n−2
∗ (q) = E{Xp0 (Xp0 ) | Yn ; Xqn } = E( pp | Yn ; Xpn ) Qpp
=
{Qpp + (Xpn ) Xpn } def ∗ = Qpp ; +n−2
(10) (11) (12)
∗ where Q1+p (q) is partitioned in the same fashion as (2). Moreover; if ¿ 2; ¿ 2 def
∗ ∗ Q1+p (∞) = limq→∞ Q1+p (q) exists a.s.
Proof. The expectation of (Y 0 ; (Xp0 ) ) given (Yn ; Xqn ) can be obtained by 0rst conditioning on the parameters and (Yn ; Xqn ), then on (Yn ; Xqn ) only, using the independence property of (Y 0 ; (Xp0 ) ) and (Yn ; Xqn ) given the parameters. Similarly, we have
250
B.Q. Fang, A.P. Dawid / Journal of Statistical Planning and Inference 103 (2002) 245–261
E(0 0 | Yn ; Xqn ) = 0 and E{0 (Xp0 ) | Yn ; Xqn } = 0. Hence 0 0 n n Y Y 0 0 n n Y = E (Y ; X (X ) ) Y ; X var q p q Xp0 Xp0 0 0 n n =E ( (X ) ) Y ; X 0 p q Xp0 0 +E (0 0) Yn ; Xqn ; 0
(13)
which, by 0rst conditioning on the parameters ( ; ) and (Yn ; Xqn ), and then on (Yn ; Xqn ), leads to the 0rst expressions, (6), (8), (11), for Qij∗ (q). Now 0 (Xq0 ) 0 t n W Xq Q implies that (0 and
(Xq0 ) )t
0 t Hence E
0 (W Q
Xqn )
0 (Xq0 ) Q: W Xqn 0 Xq0
(0
0 (0 (Xq0 ) ) W; Q; Xqn = E Xq0 =
(Xq0 ) ) W; Xqn
Q1+p + (W Xqn ) (W +n−2
Xqn )
and E(02 | W; Q; Xqn ) = E(02 | Q) =
K + Q Q ; +n−2
where the 0nal equalities follow from properties of the matrix-t distribution (Dawid, 1988, Lemma 4). Thus, by 0rst conditioning on (W; Q; Xqn ) in (13), we have the second expressions, (7), (9), (12), for Qij∗ (q). Since 1+p ∼ IW (; Q1+p ), ∼ IW (; K), for ¿ 2; ¿ 2; E( 1+p ) =
Q1+p −2
and
E() =
K −2
are 0nite for 0xed p. By the martingale convergence theorem, E( 1+p | Yn ; Xqn ) and E( | Yn ; Xqn ) tend almost surely to E( 1+p | Yn ; X n ) and E( | Yn ; X n ), respectively, as q → ∞. This completes the proof.
B.Q. Fang, A.P. Dawid / Journal of Statistical Planning and Inference 103 (2002) 245–261
251
4. Bayes and Bayes-least squares estimators For predicting the new response using the training data (Yn ; Xqn ) and the observation on the explanatory variables (q ¿ p), the Bayes rule f(Xp0 ; Yn ; Xqn ) will minimise the Bayes risk Xp0
E{Y 0 − f(Xp0 ; Yn ; Xqn )}2 :
(14)
This is equivalent to minimising the posterior expected loss E[{Y 0 − f(Xp0 ; Yn ; Xqn )}2 | Xp0 ; Yn ; Xqn ]
(15)
and the solution is ˆ p0 ; Yn ; Xqn ) def f(X = E(Y 0 | Xp0 ; Yn ; Xqn ): We shall show that, for 0xed (Yn ; Xqn ), this expression is linear in Xp0 . We 0rst give two lemmas on conditional independence under assumptions (1) and (4). Lemma 2. Under assumptions (1) and (4); Xq0 and W are conditionally independent given (Yn ; Xqn ). Proof. Assumptions (1) and (4) imply that (W; Xq0 ; Xqn )tQ so that (W; Xq0 )tQ | Xqn . By properties of the matrix-t distribution (Dawid, 1988, Lemma 4), WtXq0 | Xqn . Hence, t{W; Xq0 ; Q} | Xqn . Since Yn = W + Q, (Yn ; W)tXq0 | Xqn , which leads to WtXq0 | (Yn ; Xqn ) (Dawid, 1979, Lemma 4:2). Lemma 3. Under assumptions (1) and (4); (Xq0 ; qq ) and (Rq ; 00:q ; ) are conditionally independent given (Yn ; Xqn ). Proof. Let f; * denote the relevant density functions. Consider the sampling distribution. The assumption Qt(W; Xqn ) leads to WtQ | Xqn so that d
d
Yn | Xqn = (W + Q) | Xqn = (W | Xqn ) + Q with WtQ | Xqn : Thus, by properties of the normal distribution, f(Yn ; Xqn | 1+q ; ) = f(Xqn | 1+q ; )f(Yn | Xqn ; 1+q ; ) = f(Xqn | qq )f(Yn | Xqn ; Rq ; 00:q ; ): Also, for the inverse Wishart distribution, *( 1+q ) = *( qq )*(Rq ; 00:q ):
252
B.Q. Fang, A.P. Dawid / Journal of Statistical Planning and Inference 103 (2002) 245–261
Hence f( 1+q ; ; Xq0 | Yn ; Xqn ) = f(Xq0 ; Yn ; Xqn | 1+q ; )*( 1+q ; )=f(Yn ; Xqn ) = f(Xq0 | qq )f(Xqn | qq )*( qq ) ×f(Yn | Xqn ; Rq ; 00:q ; )*(Rq ; 00:q ; )=f(Yn ; Xqn ); which is the product of two factors depending on (Xq0 ; qq ; Xqn ) and (Rq ; 00:q ; ; Yn ; Xqn ), respectively. Hence by (1b) of Dawid (1979), (Xq0 ; qq )t(Rq ; 00:q ; ) | (Yn ; Xqn ), completing the proof. By using the same argument as in the proof of Lemma 1, we have E(Y 0 | Xp0 ; Yn ; Xqn ) = E{E(0 | Xp0 ; W; Xqn ) + E(0 | Q) | Xp0 ; Yn ; Xqn } = E{E(0 | Xp0 ; W; Xqn ) | Xp0 ; Yn ; Xqn }:
(16)
By (5) and Dawid (1988), Lemma 4, E(0 | Xp0 ; W; Xqn ) = E(0 | Xp0 ; W; Xpn ) = (Xp0 ) {Qpp + (Xpn ) Xpn }−1 {Qp0 + (Xpn ) W}: Now Lemma 2 implies that E(W | Xp0 ; Yn ; Xqn ) = E(W | Yn ; Xqn ). Hence, by Lemma 1, (16) is equal to −1
∗ ∗ (Xp0 ) {Qpp + (Xpn ) Xpn }−1 {Qp0 + (Xpn ) E(W | Yn ; Xqn )} = (Xp0 ) Qpp Qp0 (q);
(17)
which is thus seen to be linear in Xp0 . The corresponding minimum of (15) is var(Y 0 | Xp0 ; Yn ; Xqn ) = E{(Y 0 )2 | Xp0 ; Yn ; Xqn } − {(Xp0 ) E(Rp | Xp0 ; Yn ; Xqn )}2 :
(18)
Clearly, because of this linearity, the same answer would result if, in searching for f to minimise the Bayes risk (14), we restricted the search to those functions of the form f(Xp0 ; Yn ; Xqn ) = (Xp0 ) bp , where bp ≡ bp (Yn ; Xqn ) ∈ Rp is a coe
R(bp ) = E[{Y 0 − (Xp0 ) bp }2 | Yn ; Xqn ] −1
(19) −1
∗ ∗ ∗ ∗ ∗ ∗ = {bp − Qpp Qp0 (q)} Qpp {bp − Qpp Qp0 (q)} + Q00:p (q);
(20)
yielding the minimising value of bp : def
−1
∗ ∗ RpB (q) = Qpp Qp0 (q)
(21)
with corresponding minimum ∗ R{RpB (q)} = Q00:p (q):
(22)
As noted, (21) agrees with the coe
B.Q. Fang, A.P. Dawid / Journal of Statistical Planning and Inference 103 (2002) 245–261
253
It follows from Lemma 1 that, for ¿ 2; ¿ 2, def
RpB (∞) = lim RpB (q) q→∞
and
def
R{RpB (∞)} = lim R{RpB (q)} q→∞
(23)
exist a.s. In what follows, we shall be mainly concerned with predictors linear in Xp0 for given (Yn ; Xqn ). We wish to investigate the properties of the full Bayes estimator and make comparison with another linear estimator, the “Bayes-least squares” estimator. The following proposition gives lower bounds for the posterior expected squared error loss conditioned on the observed explanatory variables Xp0 and the training data (Yn ; Xqn ) (p 6 q 6 ∞). Proposition 1. Under assumptions (1) and (4); the following hold for p 6 q ¡ ∞; or for p 6 q = ∞; ¿ 2; ¿ 2: ˆ p0 ; Yn ; Xqn )}2 | Xp0 ; Yn ; Xqn ] = (Xp0 ) var(Rp | Xp0 ; Yn ; Xqn )Xp0 E[{Y 0 − f(X + E( 00:p + | Xp0 ; Yn ; Xqn ) ¿ E( | Yn ; Xqn );
(24) (25)
R{RpB (q)} = E( | Yn ; Xqn )00:p + E( | Yn ; Xqn ) ¿ E( | Yn ; Xqn ); E( | Yn ; Xqn ) ¿
K +n−2
(26) ( + n − 2 ¿ 0):
(27)
Proof. Suppose 0rst q ¡ ∞. Eq. (24) can be obtained from (3) and (18). Applying Lemma 3, E( | Xp0 ; Yn ; Xqn ) = E( | Yn ; Xqn ); which leads to (25). Substituting (6), (8), (11) into (22), we establish (26). To prove (27), we note that assumptions (1), (4) imply that (W; Xqn ; 1+q )t(Q; ) so that t(W; Xqn ; 1+q ) | Q: Hence E( | Yn ; Xqn ) = E{E( | W; Q; Xqn ) | Yn ; Xqn } = E{E( | Q) | Yn ; Xqn } Q Q + K n n Y ; Xq =E + n − 2 K if + n − 2 ¿ 0 +n−2 by the conjugate property of ∼ IW (; K). ¿
254
B.Q. Fang, A.P. Dawid / Journal of Statistical Planning and Inference 103 (2002) 245–261
The above results hold for any q such that p 6 q ¡ ∞. Since 0 ∼ T (; 1; K), −1 −1 0 ∼ T (; 1; 00 ), Rp ∼ Qpp Qp0 + T ( + p; Qpp ; Q00:p ), 00:p ∼ IW ( + p; Q00:p ); ∼ IW (; K), the following hold for 0xed p: K 00 2 2 0 2 ¡ ∞; E(Y ) 6 2E(0 + 0 ) = 2 + −2 −2 −1 Q00:p Qpp −1 −1 + Qpp Qp0 Q0p Qpp is 0nite; +p−2 Q00:p E( 00:p ) = ¡ ∞; +p−2 K ¡ ∞; E() = −2 for ¿ 2, ¿ 2; and the bounds do not depend on q. Hence, by the martingale ˆ p0 ; Yn ; Xqn ) ≡ E(Y 0 | Xp0 ; Yn ; Xqn ), E{(Y 0 )2 | Xp0 ; Yn ; Xqn }, convergence theorem, f(X 0 n n var(Rp | Xp ; Y ; Xq ) ≡ E(Rp Rp | Xp0 ; Yn ; Xqn ) − E(Rp | Xp0 ; Yn ; Xqn )E(Rp | Xp0 ; Yn ; Xqn ), E( 00:p | Xp0 ; Yn ; Xqn ) and E( | Yn ; Xqn ) tend almost surely to limits as q → ∞. Also the existence of the a.s. limit of R{RpB (q)} ensures, by the equality in (26), that the a.s. limit of E( | Yn ; Xqn )00:p exists. Letting q → ∞ in (24) – (27), these relations also hold for q = ∞. Furthermore, these equations hold on taking limits lim inf p→∞ limq→∞ on both sides.
E(Rp Rp ) =
From (27) in Proposition 1, we have limq E( | Yn ; Xqn ) ¿ 0, a.s. (K ¿ 0). Hence, by (26), lim inf R{RpB (∞)} ¿ 0; p
which shows that the Bayes estimator RpB (q) in the nonconjugate case, unlike in the conjugate case, will not be expected to incorporate deterministic predictability of the response. To investigate the property of the Bayes estimator RpB (q) further, we shall compare it with a “least-squares estimator” which 0ts the data exactly. The least squares estimator based on the training data (Yn ; Xpn ) is obtained by minimising ||Yn − Xpn bp ||2 ;
bp ∈ Rp :
If n ¿ p, (Xpn ) Xpn ¿ 0 with probability one so that the normal equation (Xpn ) Xpn bp = (Xpn ) Yn has a unique solution bˆp = {(Xpn ) Xpn }−1 (Xpn ) Yn . If n ¡ p, the equation Yn = Xpn bp is typically consistent but has many solutions. For de0niteness we shall further require that bp also minimise the posterior expected loss (19) among all such solutions, thus obtaining a hybrid “Bayes-least squares” solution. Using (1f.11.5) of Rao (1973, p. 60), to minimise the 0rst term in (20) subject to n Y = Xpn bp , we 0nd the hybrid solution: def
−1
−1
−1
−1
∗ ∗ ∗ ∗ ∗ ∗ RpL (q) = Qpp Qp0 (q) + Qpp (Xpn ) {Xpn Qpp (Xpn ) }−1 {Yn − Xpn Qpp Qp0 (q)}; (p 6 q 6 ∞; p ¡ ∞; p ¿ n):
B.Q. Fang, A.P. Dawid / Journal of Statistical Planning and Inference 103 (2002) 245–261
255
The corresponding minimum of R(bp ) is −1
∗ ∗ ∗ (q) + {Yn − Xpn Qpp Qp0 (q)} R{RpL (q)} = Q00:p −1
−1
∗ ∗ ∗ (q)}; (Xpn ) }−1 {Yn − Xpn Qpp Qp0 ×{Xpn Qpp
(p 6 q 6 ∞; p ¡ ∞; p ¿ n):
(28)
Moreover, by Lemma 1, def
RpL (∞) = lim RpL (q); q→∞
def
R{RpL (∞)} = lim R{RpL (q)} exist a:s:; q→∞
(29)
( ¿ 2; ¿ 2): From (23) and (29) we know that, if we obtain a training data-set (Yn ; Xqn ) for su
||E(Q | Yn ; Xqn )||2 →1 S∞
(q → ∞);
lim E(||E(Q | Yn ; Xqn )||2 ) = E(S∞ ) ¡ ∞:
q→∞
(30) (31)
Proof. (a) Since i ∼ T (; 1; K); i = 1; 2; : : : ; n, r | i | r (K + i2 )−(+1)=2 di : E | i | ˙ Thus E | i |r exists if and only if −1 ¡ r ¡ . For ¿ 1, {E(i | Yn ; Xqn ); q = 1; 2; : : :} is a L1 -martingale, so that /i = limq→∞ E(i | Yn ; Xqn ) exists a.s. (b) Since Q Q ∼ F(n; ; K), E(Q Q) = (n=)(=( − 2))K = nK=( − 2); ( ¿ 2). Also, by Jensen’s inequality, ||E(Q | Yn ; Xqn )||2 6 E(||Q||2 | Yn ; Xqn ): Thus {||E(Q | Yn ; Xqn )||2 ; q = 1; 2; : : :} is a submartingale closed by ||Q||2 and is u.i. (uniformly integrable). It follows that {E(i | Yn ; Xqn )2 ; q = 1; 2; : : :} is u.i. Hence, by (a) and
256
B.Q. Fang, A.P. Dawid / Journal of Statistical Planning and Inference 103 (2002) 245–261
the Mean Convergence Theorem and its Corollary (Chow and Teicher, 1978, Theorem L 4:2:3, Corollary 4:2:5), E(i | Yn ; Xqn ) →2 /i ; with E{E(i | Yn ; Xqn )}2 → E/i2 , q → ∞ (i = 1; : : : ; n). Since n E(||E(Q | Yn ; Xqn )||2 ) = E{E(i | Yn ; Xqn )}2 ; i=1
n
(31) holds. Since ||E(Q | Y ; Xqn )||2 , ||\||2 are nonnegative, by Corollary 4:2:4 of Chow and Teicher (1978), we establish (30). Next, we shall investigate the behaviour of the error sum of squares of the Bayes estimator RpB (q). Theorem 1. Under assumptions (1) and (4) with p ¿ n; the following hold: (a) If ¿ 1; then P
Yn − Xpn RpB (∞) → = E(Q | Yn ; X n ) as p → ∞: (b) If ¿ 2; then L
||Yn − Xpn RpB (∞)||2 →1 S∞ = ||E(Q | Yn ; X n )||2 ; E||Yn − Xpn RpB (∞)||2 → E(S∞ ) as p → ∞: (c) If ¿ 2; ¿ 2; then K Q00 E(S∞ ) = n + E||E(W | Yn ; X n )||2 : − −2 −2 Proof. (a) Suppose q ¿ p. Let −1 B = {In + Xpn Qpp (Xpn ) }−1 ; −1 J1 (p) = B(W − Xpn Qpp Qp0 );
J2 (p) = BQ; J3 (p; q) = (I − B)E(Q | Yn ; Xqn ):
(32)
Using the matrix identities B = In − Xpn {Qpp + (Xpn ) Xpn }−1 (Xpn ) ; −1 BXpn Qpp = Xpn {Qpp + (Xpn ) Xpn }−1
(33)
(Muirhead, 1982, p. 580), and substituting Y = + , we obtain, by Lemma 1, Yn − Xpn RpB (q) = J1 (p) + J2 (p) + J3 (p; q): Let q → ∞. By (23) and Lemma 4, the left-hand side of (34) tends a.s. to Yn − Xpn RpB (∞)
(34)
B.Q. Fang, A.P. Dawid / Journal of Statistical Planning and Inference 103 (2002) 245–261
257
and the right-hand side of (34) tends a.s. to J1 (p) + J2 (p) + J3 (p; ∞); where J3 (p; ∞) is obtained as J3 (p; q) with E(Q | Yn ; Xqn ) replaced by E(Q | Yn ; X n ) = \. −1 By (5), Xpn ∼ T (; In ; Qpp ) so that Xpn Qpp (Xpn ) ∼ F(p; ; In ), and thus B ∼ B( + n − 1; p; In ) (Dawid, 1981). Also, by (5), −1 Qp0 | Xpn ∼ T ( + p; B−1 ; Q00:p ) W − Xpn Qpp
(Dawid, 1988, Lemma 4). Thus d
||J1 ||2 = T BT; where T ∼ T (+p; In ; Q00:p ) and is independent of B. Now TT =Q00:p ∼ F(1; +p; In ). Its (1,1) element is F(1; +p; 1) with expectation 1=(+p−2). Hence (Dawid, 1981), E(TT =Q00:p ) = In =( + p − 2): Similarly, we have E(B) =
+n−1 In : +n−1+p
(35)
It follows that E||J1 ||2 = tr{E(TT )E(B)} =
n( + n − 1)Q00:p : ( + p − 2)( + n + p − 1) L
(36) P
Now let p → ∞. By (36), J1 (p) →2 0. By (35) and Lemma 4(a), J2 (p) → 0 and P J3 (p; ∞) → E(Q | Yn ; X n ) = \ as p → ∞. This completes the proof of (a). (b) We have, by (36), E||J1 (p)||2 → 0, as p → ∞. Since supp ||J2 (p)||2 6 ||Q||2 , E||Q||2 = nK=( − 2) ¡ ∞ ( ¿ 2); by Lebesgue’s Dominated Convergence Theorem and the proof of (a), lim E||J2 (p)||2 = E lim ||J2 (p)||2 = 0: p
p
Also ||J3 (p; ∞)||2 6 ||E(Q | Yn ; X n )||2 6 E(||Q||2 |Yn ; X n ). Hence {||J3 (p; ∞)||2 ; p = 1; : : :} is u.i. By the Mean Convergence Criterion and its Corollary (Chow and Teicher, 1978, Theorem 4:2:3 and Corollary 4:2:5), L
||J3 (p; ∞)||2 →1 S∞ ; lim E||J3 (p; ∞)||2 = ES∞ : p
Applying HQolder’s Inequality, we obtain lim E||Yn − Xpn RpB (∞)||2 = ES∞ ; p
and thus by (a) and Corollary 4:2:4 of Chow and Teicher (1978), L
||Yn − Xpn RpB (∞)||2 →1 S∞ :
258
B.Q. Fang, A.P. Dawid / Journal of Statistical Planning and Inference 103 (2002) 245–261
(c) By assumption (1), E(Q W) = E(Q )E(W) = 0; E(Q | Yn ; X n ) = Yn − E(W | Yn ; X n ): Hence E(S∞ ) = E||Yn ||2 − 2E{(Yn ) E(W | Yn ; X n )} + E||E(W | Yn ; X n )||2 = E||Q||2 + 2EQ EW + E||W||2 − 2E{(Yn ) W} + E||E(W | Yn ; X n )||2 = E||Q||2 − E||W||2 + E||E(W | Yn ; X n )||2 : Since Q Q ∼ F(n; ; K), W W ∼ F(n; ; Q00 ), we have E||Q||2 = nK=( − 2) and E||W||2 = nQ00 =( − 2), which, combined with the above equation, establish (c). Theorem 1 shows that the Bayes estimator RpB (q) is quite di7erent from the Bayesleast squares estimator RpL (q) in the behaviour of the error sum of squares. In particular, if K=( − 2) ¿ Q00 =( − 2), we have limp E||Yn − Xpn RpB (∞)||2 ¿ 0 in contrast to Yn − Xpn RpL (q) = 0, p 6 q 6 ∞. If we let = 0 in (1), or equivalently consider (W; X) only, the model will be identical to the conjugate case discussed in Dawid (1988). The condition that K=( − 2) ¿ Q00 =( − 2) may be interpreted as there being a su
Theorem 2. Let R(p; q) = R{RpL (q)}−R{RpB (q)}; p 6 q 6 ∞; with R{RpB (q)}; R{RpL (q)} de8ned in (22) and (28). Then under assumptions (1) and (4) with p ¿ n; ∗ {RpL (q) − RpB (q)} R(p; q) = {RpL (q) − RpB (q)} Qpp −1
∗ = {Yn − Xpn RpB (q)} {Xpn Qpp (Xpn ) }−1 {Yn − Xpn RpB (q)}:
(37) (38)
Moreover; as p → ∞; L
R(p; ∞) →1 S∞ =( + n − 2); E{R(p; ∞)} → E(S∞ )=( + n − 2);
(39) ¿ 2;
(40)
where S∞ is de8ned in Lemma 4. Proof. Letting bp = RpL (q) in (20), by (21) and (22) we establish (37). The next Eq. (38) follows from (21), (22) and (28). Now consider q = ∞. By (32), (33) and (35), we have P
−1
∗ (Xpn ) }−1 → In =( + n − 2); {Xpn Qpp
p→∞
(41)
and −1
∗ (Xpn ) }−1 6 In =( + n − 2): sup{Xpn Qpp p
(42)
B.Q. Fang, A.P. Dawid / Journal of Statistical Planning and Inference 103 (2002) 245–261
259
By (38), (41) and (a) of Theorem 1, we have P
R(p; ∞) → S∞ =( + n − 2);
¿ 1:
(43)
Note that, in the proof of (b) of Theorem 1, we can further deduce that {||Yn − Xpn RpB (∞)||2 ; p = 1; 2; : : :} is u.i., and thus, by (42), {R(p; ∞); p = 1; 2; : : :} is u.i. By Corollary 4:2:4 of Chow and Teicher (1978), (39) and (40) hold. Theorem 3. Under assumptions (1) and (4) with p ¿ n; the conditional distribution of the di9erence of the two predictors (Xp0 ) RpL (q) − (Xp0 ) RpB (q); p 6 q 6 ∞; given the training data; is (Xp0 ) RpL (q) − (Xp0 ) RpB (q) | Yn ; Xqn ∼ T { + n; 1; ( + n − 2)R(p; q)};
(44)
where R(p; q) is given by Theorem 2. Moreover; as p→∞; (Xp0 ) RpL (∞)−(Xp0 ) RpB (∞) converges in probability to a random variable Z with distribution given by Z | S∞ ∼ T ( + n; 1; S∞ ); d
S∞ = | E(Q | Yn ; X n )||2
( ¿ 1):
Proof. Assumptions (1) and (4) imply (0 ; Xp0 ; W; Xqn )tQ, so that Xp0 tQ | (W; Xqn ) and d
thus Xp0 | (W; Xqn ; Q) = Xp0 | (W; Xqn ). By (5) and Lemma 4 of Dawid (1988), in the notation of Lemma 1 d
∗ ; 1}; Xp0 | (W; Xqn ) = Xp0 | (W; Xpn ) ∼ T { + n; ( + n − 2)Qpp
when (Xp0 ) {RpL (q) − RpB (q)} | (W; Xqn ; Q) ∼ T { + n; 1; ( + n − 2)R(p; q)} by Theorem 2, (37). Since R(p; q) depends on (W; Q) through Yn = W + Q only, (44) is obtained. Represent (44) with q = ∞ as R(p; ∞)1=2 · R(p; ∞)−1=2 (Xp0 ) {RpL (∞) − RpB (∞)}, the second factor of which is distributed as T ( + n; 1; + n − 2) and is independent of R(p; ∞). Then the asymptotic distribution is obtained by applying (43) in Theorem 2 for ¿ 1. We have var(Z) = E{E(Z 2 | S∞ )} = E(S∞ )=( + n − 2): If K=( − 2) − Q00 =( − 2) ¿ 0; ¿ 2, ¿ 2, then var(Z) ¿ 0 so that Z is a nondegenerate random variable. Thus, for large p, the two predictors (Xp0 ) RpL (∞) and (Xp0 ) RpB (∞) cannot be identical. We have investigated the case that p → ∞, q = ∞. Similarly, we can show that these limits are identical to the corresponding double limits as (p; q) → (∞; ∞) with p 6 q. That is to say, the asymptotic properties discussed do not depend on the way we increase the numbers of explanatory variables in the training data and observation to be predicted, p and q, so long as p 6 q.
260
B.Q. Fang, A.P. Dawid / Journal of Statistical Planning and Inference 103 (2002) 245–261
6. Conjugate analysis The above calculations also hold formally for = 0, on taking Y = , = 0, K = 0, or starting from model (1) and (4) with = 0, i.e. (Y X ) ∼ N(1; );
∼ IW (; Q);
which is the conjugate case considered in Dawid (1988). Then the overall distribution of Z is T (; In+1 ; Q). In this case, RpB (q) does not depend on q (cf. (10), (12)). That is, if we have observed only the values of X1 ; : : : ; Xp for our new case, then we can use Yn and the 0rst p columns of X n only in the training data. The following holds: ∗ ∗ ∼ T ( + n; 1; ( + n − 2)Q00:p ); Y 0 − (Xp0 ) RpB (q) | Q00:p ∗ −1 −1 = Q00:p + (Yn − Xpn Qpp Qp0 ) {In + Xpn Qpp (Xpn ) }−1 ( + n − 2)Q00:p −1 ×(Yn − Xpn Qpp Qp0 )
∼ Q00:p + F(n; + p; Q00:p ) (Dawid, 1988, (5.14), (5.15), (6.13), (6.15)). Now Q00:p ¿ 0 decreasing implies that def
P
∗ ∞ = limp Q00:p exists so that ( + n − 2)Q00:p → ∞ . Hence as p → ∞, Y 0 − L
(Xp0 ) RpB (q) → T ( + n; 1; ∞ ). The condition ∞ = 0 is necessary and su
(Xp0 ) RpB (p) →2 0; i.e., the two predictors (Xp0 ) RpL (p) and (Xp0 ) RpB (p) for predicting a future response Y 0 are asymptotically equivalent. Moreover, by Theorem 1, the error sum of the squares of the Bayes estimator, ||Yn − Xpn Rp (p)||2 , tends to 0, compatible with the condition ||Yn − Xpn Rp (p)||2 = 0 satis0ed by the Bayes-least squares estimator. Thus, the use of the usual conjugate priors leads to asymptotic equivalence of the Bayesian inference and the unadjusted classical one, a (typically) undesirable consequence of the (typically) unrealistic determinism embodied in the conjugate prior. The behaviour under our nonconjugate prior is quite di7erent. As the optimal estimator achieving the minimum of the posterior squared error loss, the Bayes estimator RpB (q) does not imply deterministic predictability, since, as shown in Proposition 1, this minimum has a positive lower bound. Moreover, the asymptotic behaviours of the Bayes estimator RpB (q) and the Bayes-least squares estimator RpL (q) are quite di7erent. From the Bayes point of view, Theorem 2 shows that the di7erence of their posterior expected error losses is nondegenerate as p → ∞. From the classical point of view, Theorem 1 shows that the error sum of squares of the Bayes estimator RpB (q) is nondegenerate as p → ∞, in contrast to the condition that Yn = Xpn RpL (q). Also on a future
B.Q. Fang, A.P. Dawid / Journal of Statistical Planning and Inference 103 (2002) 245–261
261
case, the di7erence of the two predictors (Xp0 ) RpB (q), (Xp0 ) RpL (q) is nondegenerate as p → ∞ (Theorem 3). The above conclusions have been shown to hold on condition that K=( − 2) ¿ Q00 =( − 2) (cf. Theorem 1(c)), but we do not believe this condition to be necessary. Thus, by using a nonconjugate prior as studied in this paper, one can avoid at least some of the problems of over0tting when using a large number of explanatory variables in regression. 7. Conclusion Conjugate priors are often adopted in Bayesian analysis because of their perceived richness and ease in calculation. However, in most applied contexts it would be unreasonable to believe that prediction could be done perfectly if only su