Journal
of Econometrics
43 (1990) 255-274.
North-Holland
EFFECTS OF COLLINEARIW ON INFORMATION ABOUT REGRESSION COEFFICIENTS Ehsan S. SOOFI* Universip
of Wisconsin, Milwuukee,
WI 53201, USA
Received August 1987, final version received December
1988
This paper uses information theory to examine effects of nonorthogonality of a regression matrix on data and posterior information about the regression coefficients. In the absence of an informative prior, the entropy analysis results in collinearity measures that are akin to some well-known diagnostics available for the least squares regressions. Entropy difference and crossentropy analyses are used to quantify collinearity effects on the sample information and posterior distribution. Information-theoretic diagnostics are developed for examining the incremental contribution of additional prior precision to reduction of collinearity effects on the sample and posterior information. The proposed collinearity measures are applied to an illustrative example.
1. Introduction The problems of nonorthogonal regressors have plagued applied researchers for a long time. The symptoms of (near) collinearity are both computational and conceptual. The computational problems are those related to difficulties of computing numerical solutions to a system of linear equations with an ill-conditioned coefficient matrix. The conceptual problems are those associated with the dependency of a distribution function on a collinear structure. In the classical approach, examining the effects of an inadequate regression matrix on the sampling behavior of estimators and test statistics is of primary interest. In Bayesian analysis the interest lies in understanding the consequences of a collinear data set on given posterior distributions’ properties, posterior odds, etc. Thus far the conceptual aspects of collinearity have been discussed almost entirely in the classical paradigm. [E.g., see Farrar and Glauber (1967), Silvey (1969), Hoer1 and Kennard (1970), Marquardt (1970), Willan and Watts (1978), Belsley, Kuh, and Welsch (1980), Belsley (1982), and Stewart (1987).] A Bayesian interpretation of collinearity is given in Learner (1973). Thisted (1982) discusses collinearity problems in a decision-theoretic *I acknowledge with thanks the comments received from Arnold Zellner, an Associate Editor, and anonymous referees on drafts of this paper. I believe, in response to these comments my understanding of the problem and the presentations improved substantially. I am, however, solely responsible for any shortcomings.
0304-4076/90/$3.500
1990, Elsevier Science Publishers
B.V. (North-Holland)
E.S. Sooji, Collineuriiy
256
efeects in Buyesian regression
context. This paper presents an information-theoretic approach to the study of collinearity effects in Bayesian regression. A widely accepted consequence of collinear regressors is inadequacy of information content of the data about the regression coefficients. It is often recommended that a set of ill-conditioned data be supplemented with some additional information. However, the information which in the presence of collinearity becomes inadequate and needs to be supplemented, is not usually defined and quantified explicitly. Sometimes information is implicitly taken to be synonymous to the precision of some estimates. Unless the notion of information is precisely defined and quantified in a problem, it would be difficult to know how much more information is required to compensate for a given amount of collinearity effects. Particularly in a Bayesian analysis, diagnostics that help a researcher to supply additional prior information to compensate for some collinearity effects are of interest. The main objective of this paper is to develop measures that quantify effects of collinearity on the amount of information provided by the observed sample and by the posterior distribution about the regression parameters. This paper uses the technical definitions of Shannon (1948) Lindley (1956) and Kullback and Leibler (1951) to quantify effects of the structure of a regression matrix on data and posterior information. Information functions are defined that explicate the interrelationship between prior knowledge about the parameters and some collinearity effects. The proposed measures provide diagnostics for examining available tradeoffs between acquisition of prior information and reduction of some collinearity effects. This paper is organized as follows. In section 2, the case of absence of prior information in the analysis is discussed. It is shown that Shannon’s entropy of the posterior distribution leads to quantities that are related to some wellknown collinearity measures. In section 3, entropy differences are used to quantify the loss of data information due to collinearity for a two-stage normal regression model. In section 4, it is shown that for suitably chosen arguments the Kullback-Leibler information function can be used to measure loss of information due to collinearity in the least squares estimation and to assess the effects of collinearity on posterior analysis. In section 5, information quantities are computed for a well-known collinear data set. A summary is given in section 6.
2. Without prior information Consider
the linear model
(1)
y=xp+u,
where
Y is an
n
X 1
observable
random
vector;
X is a nonstochastic
and
ES. Sooji, Collinearity effects in Bayesian regression
257
nonsingular n x p regressor matrix; p = (pi, . . . , &)’ is the vector of regression parameters; and u is the n x 1 vector of uncorrelated error terms generated from N(0, u2), a2 > 0. Then, given p and u2, Y follows the n-variate normal distribution f( y ]/I, a2) = N( X/3, u’Z,>, where Z,, denotes the n X n identity matrix. We are interested in the effects of the structure of X on information quantities about /3. Throughout this paper we condition all the distribution functions on u2. However, this conditioning will not be displayed for notational convenience. We condition on the error variance for two reasons. First, as it will be seen, u2 appears in the subsequent information quantities either as a scaling factor or in the ratio of prior to the model precision. This ratio plays a pivotal role in analyzing the incremental contribution of the prior precision to the reduction of collinearity effects. Second, taking account of inference on the dispersion parameter into consideration will not provide simple information quantities desired for a collinearity analysis. In absence of any prior information about the parameters, we use a diffuse prior for p. In information theory, a diffuse prior can be motivated based on the minimum information (maximum entropy) principle [Jaynes (1968) Press (1982)] and the maximal data information principle [Zellner (1971,1984)], provided that the parameter space is bounded. Then the posterior distribution of p is f( /3 1~) = N[b, u2( X’X)-‘1, where b = (X’X)-‘X’y is the least squares (LS) estimate of p. To summarize the overall amount of information about /?, we use the entropy [Shannon (1948)] of the posterior distribution given by
= (p/2)log(2me)
- i log]X’X/u2].
(2)
In general, Shannon’s entropy is not invariant under nonorthogonal transformations. The origin and unit in which entropy measures uncertainty are completely arbitrary. The origin is the entropy of a uniform distribution over the entire space. It has been pointed out, for instance by Jaynes (1968) that entropy of a continuous distribution should be defined by the cross-entropy H’ = /f(t)log[f(t)/m(t)]dt, where m(t) is a measure function. As in most applications of entropy, we are interested in the difference of entropies (section 3) and cross-entropy measures (section 4) which are invariant under all invertible transformations. The unit is the base of logarithm. In statistics the base is usually taken to be e; one may also use base 10 for a reason to be explained below. [In communication engineering often the base is chosen to be 2.1 The entropy H( p 1~) quantifies amount of uncertainty in f( ply) about the parameter. Since the prior entropy of /3 is just a constant, a large value of H( /3 ]_y) indicates that the data contain small amount of information about
258
E.S. Soofi, Co1lineuril.v effecis in Bqesian
regression
~3. In the statistics literature Shannon’s information measure is defined by Z( j? ly ) = - H( p ly ). To examine the effects of the structure of X on Shannon’s information we write: Z(PlY)
= f ii
lw(~;/fJ’>-(P/W
(3)
i=l
where h,‘s. are the eigenvalues of X’X in descending order and C is a constant. The information in directions of principal components are given by Z( u;/3ly) = $ log(X,/a2) - :C, where u, is the eigenvector corresponding to hi. Thus the sum of independent information quantities in directions of principal components is Z(ply), which is an overall measure of information about p. The information in the directions of the principal components may be compared using the information indexes defined by the entropy differences: A,=
Zb;Plv)
-Zb:Plv)
= : lOg[ Al/X,]
= log Ki’
i=l,...,p.
(4
In (4), ~~ are the condition indexes utilized extensively in Belsley, Kuh, and Welsch (1980) as collinearity diagnostics for least squares estimation. In particular the information number, defined by A(X) = Z( u$ Iy) - Z( ui/? Iy), provides a diagnostic for the information spectrum of the regression matrix in the absence of an informative prior. Indeed, A(X) = log K(X), where K(X) is the condition number of X. We also note that the entropy induces a logarithmic transformation on the determinant of X’X. This transformation indicates that as X approaches singularity, the amount of information in the posterior, as measured by the entropy, decreases at a slower rate than that of IX’XI -l. The information optimal design which will be used as a reference in the subsequent analysis is identified as follows: Property 2.1. The posterior N[b, (X’X)-‘a*] is most informative (minimum entropy) if and only if the regressors are orthogonal. This is verified by defining [, = X,/i, where x is the arithmetic mean of the eigenvalues, and taking u* outside of the summation in (3). Then WIY)
= : i:
log& - !p/2)log(~a2)
+ (p/qc.
r=l
Since f = 1 implies c,!= 1 log Si 5 0, then Z( p ly) 5 - ( p/2)log(Xu 2, +_( p/2)C, with equality if and only if {, = 1 for all i = 1,. . . , p. That is, A, = A; so the
E.S. Sooj,
Collineurity
effects in Bqvesiun regression
259
eigenvalues have uniform spectrum; hence X is orthogonal. Furthermore, normalizing the mean of eigenvalues makes comparison of various entropies meaningful. In particular normalizing columns of X satisfies this requirement. Henceforth without loss of generality we assume regressors are scaled to unit lengths, so h = 1. At this point some remarks are in order. We have seen that in the absence of prior information about j3, the basic entropy analysis leads to the simple logarithmic transformation of diagnostics developed previously. The following Remarks further explicate the relationships between our information theoretic approach and the traditional diagnostics.
Remark 2.1. Sometimes in the sampling theory approach based on Fisherian notion of information, the precision of an efficient estimator is used as a summary measure of information about the parameter being estimated. Thisted (1987) notes: ‘The condition number of X has an important statistical interpretation in the regression problem which is generally overlooked.’ Using singular value decomposition of X, it can be shown that the variances of the sampling distributions of the LS estimates for the linear functions u;j3 are givenbya’/X,, i=l,..., p. Thisted (1987) then interprets the linear combinations (emphases are added): ‘. . . u;,B about which the regression data are most informative. . . . c$fl about which the data at hand are least informative. Thus ~~ is the variance ratio between the most precisely estimated linear combination of fi and the least precisely estimated.’ The above information-theoretic approach explicitly confirms this interpretation of the condition number.
It is well known in the numerical analysis that K-‘(X) gives a Remark 2.2. relative distance from X to perfect collinearity. A numerical analyst [Stewart (1987)] interprets the condition number as (emphasis is added): ‘. . . if the columns of X are roughly equal in size and K(X) = 105, then we can attain [perfect] collinearity by perturbing the elements of X in their fifth [significant] digit.’ Now if the entropy is evaluated using the base 10 logarithm, then the information number of X gives A(X) = log,, K(X) = 5. Thus A(X) provides a more natural scale for measuring the condition of X than K(X). That is, with an appropriate base, the logarithmic information number directly identifies the significant digit in which perturbations could result in perfect-collinearity of a regression matrix.
Remark 2.3. Stewart (1987) presents nant factorially exaggerates a gentle
an example to show that the determirate of descent of a matrix into the
E.S. Sooj, Colhearit?)
260
perfect-collinearity.
Tp=
He considers
effects in Bayesian regression
a regression
/1
1/a
...
l/G\
0 . . .
l/Jz . . .
...
I/J;;
00 00 . .
. .
\ 06
. ... . ..o . ..b
matrix triangularized
as follows:
. 1/J?
.
I
Then ]X;X,] = lT;TPl = l/p!. Therefore the determinant of X;XP decreases factorially with p. Stewart, using the Eckart-Young-Mirsky Theorem, observes: ‘ . . _with increasing p the regression matrix suffers a gentle descent into [perfect] collinearity, but not at the exaggerated rate suggested by the determinant.’ However, we note that the logarithmic transformation of the determinant induced by the entropy gives log IX’X] = -tip_, log i. Thus with increasing p the entropic information, Z(fi\y), decreases at a gentle rate of log p. Whether or not a logarithmic transformation could precisely reflect the rate of descent of the regression matrix into perfect-collinearity remains to be investigated by numerical analysts; but at least it points to the right direction. 3. With some prior information Based on knowledge of the data-generating process a researcher may assess the expected values of the parameters with the same precision. This can be formulated as the prior expectation, Q/3) = (pi,. . . , pLp)’ and var(P) = T’Z~. Then the least informative (maximum entropy) prior distribution is f(p) = N(p, r2ZP). If as above we assume f( y I@) = N( X/S, d2Zn), then the posterior distribution of /I is f( ~31~) = N( @, 2) with a = (X’X + pZ,)-‘( X’y + pp) and 2 = ( X’X + pZP))‘u2, where p = a’/~~. The posterior distribution contains all the sample and nonsample information formulated in the likelihood function and the prior density, respectively. The prior entropy H(j3) summarizes the uncertainty in the prior distribution f(B). After observing the data, the posterior entropy H(Ply) measures the uncertainty in the posterior distribution f(/?]_y). The difference between the two entropies quantifies the additional amount of information provided by the data about the parameters [Lindley (1956)]. The amount of information in the data about p is given by 6(P)
= H(P)
- H(Plr)
= i log
I& + X’VPI.
(5)
This simple information function quantifies the contribution of the data information to the posterior analysis. In more familiar terms, the expression (5) is one half of the log-ratio of the posterior generalized precision 12-l I to
E.S. Sooj,
Collinearity
effects in Bayesinn regression
261
the prior generalized precision. [Zellner (1971, p. 310) in discussion of posterior odds ratios interprets the generalized precision as a measure of information.] Here we may add that Yx(/3) is well-defined for rank deficient regression matrices, but it is not defined when p = 0. Noting that Y,(p) = icf_, log(1 + X,/p), the information indexes for the two-stage normal regression model are given by
“: = +l”g[(P + A,)/( P + Ai)]
9
i= l,...,
p.
For p =O, Ay=A,, and for i=p, AP=AP(X) =A(pl,+ XX) defines the information number of X for the normal regression with the conjugate prior. The generalization of A: to the case of unequal prior variances is straightforward. Often problems of collinearity are studied by comparing the results based on the actual data with the results that can be obtained from an optimal design matrix. For a given precision ratio p, the amount of information in the data about /3 is maximum if and only if X is orthogonal [see Soofi and Gokhale (1986) for details]. Consequently, the data information loss due to collinearity of the regression matrix for the two-stage model is measured by
-%w
-%,x(P) -J%(P) =
MPIY) - MPIY)
(6)
Ii log[(P+w(P+~i)1.
(7)
=+
!=I
In (6) X0 denotes an orthogonal reference regressor matrix usually defined for the comparison purposes and the posterior information functions use the conjugate prior. YL,(@) quantifies the effect of collinearity on the sample information at the presence of nonsample information formulated as the prior density. r2 represents the effect of prior knowledge in the sample information loss due to nonorthogonality of the regressors. Next we formulate the effect of the prior knowledge on the data information loss. Property 3.1. .FL,( fi) 1s monotone decreasing tone increasing in r2.
This can be shown by noting
that
in p; equivalently,
it is mono-
262
E.S. Sooji,
that the arithmetic
i = 1,. . . , p. The inequality follows from the fact mean is greater than or equal to the harmonic mean.
Property 3.2. As TV+ 00, .YL,(p) + - :cp,, l%Ai = IxOlY> where the posterior information functions use the di$iise prior.
- Ix(PlY),
This is seen by noting that p --) 0, as r* -+ cc and taking limit in (7). Thus as the prior becomes less precise, the data information loss approaches the difference between the posterior entropy of /S based on the actual data and that based on an orthogonal reference design with noninformative prior. Property
3.3.
For all 72 > 0, .%L,(/3)
< - $~~!‘, log hi.
This follows from Properties 3.1 and 3.2. Note that as r2 + 0, XL,(p) -+ 0. Properties 3.1-3.3 show that .@L,( p) provides a sensible metric for quantifying the gain from prior assumptions when analyzing extremely collinear data. Property 3.1 makes it clear that the information loss relative to an orthogonal design monotonically decreases with gains in the prior precision. Property 3.2 quantifies the limit for the amount of the information loss due to nonorthogonality. Thus XL,(p) provides to analysts diagnostics with which they could make their own judgments about the amount of prior information they are willing to supply in order to compensate for the data information loss due to nonorthogonality. Remark 3.1. Except for the trivial case of r2/a2 = 0, 9X(/3) > 0; that is, the sample is always informative about /3. Moreover, YX(p) is an increasing function of r2 and a decreasing function of u2. This confirms the intuitive notion of information that with less precise priors and/or better specification of the model one can improve the informational value of the data. We note in passing that when r* and a2 are to be estimated using the data by some F2 and 6*, then the information statistic $,(/3) can serve as a goodness-of-fit criterion for the two-stage normal regression model. jX( /3) is then a Bayesian counterpart for the conventional goodness-of-fit statistics. Therefore the effects of the precision of nonsample information on the amount of information provided by the sample about the parameters is twofold. As the prior precision increases: (a) the contribution of the sample information to the posterior analysis about /3 as measured by YX(/?) decreases, and (b) the sample information converges to the information provided by an orthogonal design. 4. Relative entropies Another
way
to measure
discrepancy
between
information
provided
by
E.S. Soofi, Collinearity effects in Buyesian regression
263
alternative designs about the regression coefficients is to use the cross-entropies between two distributions of interest. The information discrepancy between two distributions fi and fz can be measured by the cross-entropy
r(fi : fd = jfi(t)los[fi(~)/f2cI,1 dt, provided that fz(t) = 0, only if fi(t) = 0 for almost all r E RP. The integral in (8) is known as the Kullback-Leibler discrimination information function between fi and fz. It has been studied extensively in terms of the probability densities fi and f2 by Kullback (1959). Z(fi : fi) is not symmetric in its arguments. (8) measures the entropy of fi relative to fi. Thus f. is the reference distribution. A symmetric version of (8) was introduced first by Jeffreys (1946). The information discrepancy is scale-invariant, but it is not ‘centering’-invariant under which the information about the intercept is lost. In the remainder of this section I( fi : f2) is applied to two cases of interest in collinearity studies. 4.1. Orthogonal reference Denote the LS estimate of /I based on the hypothetical design X0 by b”. Many collinearity studies focus on comparing sampling characteristics of b with those of b” [see, e.g., Marquardt (1970) and Willan and Watts (1978)]. In the classical paradigm, the sampling disttibution of b” defines an idealized reference with which the information content of the data can be compared. The LS estimate is a sufficient statistic for p, so all the sample information pertaining to the parameters is contained in b; i.e., in the sampling distribution f(b1/3) = N[/?, a’( X’X)-‘I. Noting that fO(b”~~) = N(/?, a21p) and evaluating the integral in (8) gives Z(B:
B”) = I[f(blP) = : tr( XX)-’
: f,(b“lP>] - :log]x’x]
-l-
$p,
where B and B” denote the corresponding LS estimators viewed as being random variables. [We should emphasize that X0 is fictitious since model (1) implies that the expectation of Y, at the same time, cannot be X0’ and Xp for X” # X.1 Z( B : B”) is the entropy of the sampling distribution of b relative to that of b”. It is established (Property 2.1) that, in the absence of prior knowledge, X0 is the most informative (minimum entropy) design. Thus the information discrepancy I(B: B“) can be used as a measure of fess of information in LS estimation due to nonorthogonality of X. We view I( B : B”) as a scalar function of X, and summarize the effect of structure of the regression matrix on the information discrepancy as follows.
264
E.S. Sooji, Collinearity effects in Bqesian
Property 4.1. orthogonal.
I( B : B”) 2 0 for all X,
the equality
regression
holds if and on!y if X is
This result follows from the nonnegativeness of the integral in (8) [Kullback (1959)]. The equality holds if and only if the two distributions are equal almost everywhere in RP. Since they differ only on the structure of the regressor matrix; and since we are taking f, as the reference and allowing X to vary, the equality is satisfied if and only if X’X = Zp. Property
4.2.
This is shown by evaluating
I(B:
Let y, = l/h,;
I( B : B’) + 00.
As X approaches perfect-collinearity,
B”)=+
the determinant
i ‘+$ j-1 A;
and the trace in (9) which gives
i logA;-+p. ;=I
00)
then
lim (log X, + l/X,)
=
h/,-+0
lim vp-+
(logl/y,
+ v,) =
lim up-
fx3
1ogeQYp = co. fx
The last equality follows by applying 1’Hospital’s rule. Therefore, I( B : B“) ranges from 0 to cc as X departs from orthogonality toward singularity. According to Property 4.1, there is always a positive amount of information lost due to the nonorthogonality of X. By Property 4.2, the loss of information in the least squares regression is tremendously high in a near-collinearity condition. The information measure given in (9) is a comprehensive summary of (X’X))’ which is composed of the trace, the determinant, and the rank of (XX))‘. The composition of the determinant, trace, and rank of (XX))’ in the information disparity is rather interesting. From (10) we observe that Z(B:
B”) = $[H ar-‘(X)
+ logGeo(h)
-X]
p,
where Har( X), Geo( h), and x are, respectively, the harmonic, geometric, and arithmetic means of the eigenvalues. We see that the mean based on the reciprocal of A;‘s appears in the reciprocal form; the mean based on the products of X,‘s appears in the logarithmic scale; and the arithmetic mean appears in a linear form. Thus the collinearity effect on the sampling distribution of the LS estimate is measured by a combination of the three means of the characteristic roots of X’X, as each mean is transformed reversely. Although I( B: B”) is a complicated function of X, it is analytically wellbehaved and easily computable.
E.S. Sooji, Collinearity effects in Bayesian regression
265
The discrepancy measure (9) can easily be extended to the Bayesian analysis with and without an informative prior. For the case of normal conjugate prior we have fxO( /3 ]JJ) = N[ Do, a*(1 + p))‘lp], with p = (1 + p)-‘(b“ + pp). Then the posterior entropy relative to the orthogonal reference will be
I(f,: fxolv) = i(tr[(l+P>(xX+
PI,)~~]
-l%l(l + P)(x’X-t PqJ1l-P) + [(I + PwJ21 IIP- P0112?
(11)
where ]]. II denotes the Euclidean norm. I( fx : fxoly) quantifies the posterior information loss due to collinearity. The first term in (11) measures the effect of nonorthogonality of X on the posterior dispersion matrix and the last term measures the effect of collinearity on the posterior mean. Noting that X= X0 implies X’X + pZp = (1 + p)l, and /? = so, we have Property 4.1 satisfied by I( fx : fxOIy). Other properties of the posterior entropy relative to an orthogonal design are summarized as follows. Property 4.3.
For p > 0, I( fx : fxOIy) is finite for all X.
Property 4.4.
For a given X, I( fx : fxOIy) is monotone decreasing in p.
Property 4.5.
For all 72 2 0, Z( fx : fxOIy) 2 Z(B : B”) + (l/2a2)]]b
- b”]12.
The proofs are given as an appendix. These properties indicate that Z( fx: f,“Iy) is well-defined when X is singular; and for a given X, the maximum value is obtained when there is no prior information used in the analysis. Then the amount of uncertainty in the posterior distribution relative to an orthogonal design decreases monotonically with increases in the prior precision. Often the rate of decrease is faster than that of YL,(/?), i.e., the prior is more effective in reducing the collinearity effects on the posterior information than on the sample information (see fig. 1). For the noninformative prior, (11) reduces to Z( B : IV’) plus an additional term that measures the collinearity effect on the posterior mean. Therefore, the relative entropy (11) provides another set of diagnostics for evaluating the incremental contribution of prior precision on the reduction of collinearity effects on the posterior information. A graphical display of effects of the prior precision on the information quantities ,a,( /3), J&,(P) = ,a,,( p), the percentage reduction in the data information loss as measured by (7), and/or the percentage reduction in the posterior relative entropy as measured by (11) will be referred to as the Collinearity Information Graph (see fig. 1). For a given data set, the Collinearity Information Graph provides a comprehensive summary of the available
E.S. Sooj, Coliineurr~~
266
effecfs in Buy&m regression
Prior Standard Deviation (7) 51 0.23
0.16
0.13
0.11
0.10
0.07
0.09
35 30
c * d . . a
25
------
----
4
.
20
I
. . “xx
l
.
15
.
.
x
*
l*-
*1
*7
20..
10
*s $7 la*.
**++++++
****.a
5
+++++++++++++++++++++, .
..*.............**.a
CI
0.02 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.6
0.9
1.0
Prior to Model Precision Ratio ( p = g2/ 7’ ) Percent reduction in loss 01 data information contributed to posterior analysis. -
-
-
Percent reduction in posterior entropy relative to an orthogonal reference.
+ + +++
Nominal amount of data information contributed lo posterior analysis (orthogonal).
l
Nominal amount of data information contributed to posterior analysis (actual).
l l l
l l
Fig. 1. Collinearity Information Graph of the Air Pollution Data.
tradeoffs between the incremental contribution of additional prior information, the amount of sample information and its relation to the orthogonal reference, and the amount of reduction in the sample and posterior information losses. 4.2. The data perturbation effects In studying the computational aspects of collinearity, numerical analysts are concerned with how sensitive the magnitudes of solutions of a set of linear equations are to some small changes in an ill-conditioned data, Analogously we can address the problem of sensitivity of a posterior distribution to small changes in the regression matrix. Suppose X is perturbed by d X and X* = X + d X is obtained. Then the sensitivity of a posterior distribution, with
E.S. Soof, Colheariry
267
e$ects in Buyesiun regression
respect to d X is measured by the extent to which the distribution is perturbed [Rae (1973)]. For simplicity of notations, we discuss this problem for the case of the diffuse prior. The perturbed posterior distribution then would be N[ b*, ( X*‘X*)- ‘u *I, where b* = (X*‘X*)-‘X*‘y. The information discrepancy between fx* and fx measures the amount of alteration in the posterior distribution induced by dX. Using the posterior f(P]v) based on the actual data as the reference and evaluating (8) yields I(f,,:f,]y)
= +[tr(X’X)(X*‘X*)-l-log]X’X(X*‘X*)-l] + jS’X’XS/a *,
-p]
04
where 6 = b* - b. The quadratic term in (12) captures the effect of the perturbation dX, on the mean of the posterior distribution. 6’X’X6/a2 is the generalized signal-to-noise ratio effect of the perturbation dX. The corresponding effect on the posterior covariance structure is captured by the first term. The extension of (12) to the conjugate prior case is straightforward. For the precision ratio p, X’X and X*‘X* in (12) will be replaced by (X’X + ~1,) and ( X*‘X* + pl,), respectively, and 6 is adjusted accordingly. An analyst can use this relative entropy to determine the sensitivity of the posterior distribution containing various amount of prior information to small changes in the regression matrix. 5. Application to the Air Pollution Data We use the well-known Air Pollution Data to illustrate the use of various information measures discussed above. The Air Pollution Data originally analyzed by McDonald and Schwing (1973) consists of 60 sets of observations on 16 variables. The variables are fully described in that paper and the correlation matrix, the means, and the standard deviations are reported. Briefly, the data consist of cross-sectional observations on a mortality rate (dependent variable) and three groups of explanatory variables: weather, socioeconomic, and pollution. The Air Pollution Data has been used as an example in various regression analyses. All the previous analyses have focused on the correlation matrix of the regressors and have used the dependent variable in the standardized form. We follow suit; so we are a priori assuming that information about the intercept is not of interest here. The purpose of our analysis is solely illustrative. Table 1 shows the information indexes of the regression matrix for selected values of p. This table displays the incremental contributions of prior precision to the conditioning of X. We see that the information spectrum slowly
E.S. Sooji, Collineuriiy
268
efeecrs in Bqesiun
regression
Table 1 Information
indexes of the regression
matrix for the Air Pollution
Prior to model precision
Data.”
ratio (p = o*,+*)
i
4
0
0.005
0.020
0.100
0.200
0.500
1.000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
4.527 2.755 2.054 1.349 1.223 0.960 0.612 0.473 0.371 0.216 0.166 0.127 0.114 0.046 0.005
0.000 0.248 0.395 0.605 0.655 0.775 1.000 1.129 1.251 1.521 1.651 1.785 1 .x40 2.295 3.414
0.000 0.248 0.394 0.604 0.653 0.113 0.997 1.125 1.245 1.510 1.637 1.766 1.819 2.244 3.063
0.000 0.247 0.392 0.600 0.649 0.767 0.986 1.111 1.227 1.479 1.597 1.714 1.761 2.116 2.604
0.000 0.241 0.382 0.581 0.626 0.737 0.936 1.044 1.143 1.342 1.427 1.506 1.536 1.728 1.893
0.000 0.235 0.370 0.558 0.600 0.702 0.881 0.975 1.057 1.215 1.279 1.335 1.356 1.478 1.569
0.000 0.217 0.339 0.500 0.535 0.618 0.754 0.821 0.877 0.974 1.010 1.040 1.051 1.110 1.149
0.000 0.193 0.297 0.428 0.455 0.518 0.616 0.661 0.697 0.757 0.778 0.795 0.801 0.832 0.852
“The last row shows the information 03 = perfectly collinear.
index of the regression
matrix
for each p: 0 3 orthogonal,
improves to uniformity as the prior precision improves. The indexes for the last two components improve much faster than the others. For p = 0.10 (72 = 10a2), the information number of X reduces to about 55% of its values for p = 0. [An information theoretic method for component reduction is given in Soofi (1988).] Table 2 shows the computed values of various information functions for selected values of the precision ratio p. The panel A of the table gives the amount of contribution of the sample information in the posterior analysis as measured by Y,(p) and the sample information loss due to collinearity as measured by XL,(p). The percentage of reduction in the data information loss are also shown in the table. Note that since YX(p) is undefined for p = 0, the percentage reduction for the use of data information in the posterior analysis cannot be computed. Panel B summarizes the results of a posterior entropy analysis relative to an orthogonal design. The first term of (11) is used for computing the effect of collinearity on the posterior dispersion matrix. The collinearity effect on the posterior mean is computed using the last term in (11). Note that for p = 0, the effect on the dispersion matrix is Z( B : B“) = 117.12, which is the collinearity loss of information in the least squares regression. For computing the collinearity effect on the posterior mean, the prior mean is set to zero. For other values of p, the computed figures for the collinearity effect on the posterior mean and the corresponding percent reduction should be adjusted.
E.S. Soofi, Collineuriiy
effecrs in Buyesian regressiorz
269
Table 2 Contributions
of prior precision
to reductions in various collinearity Air Pollution Data. Prior to model precision
0
0.005 A. Entropy
Sample
information
Collinearity loss Percent reduction
3c
33.0
1.3 0 H. Posterior
6.8 6.3% entropy
0.020
effects on the
ratio (P = 02/r2)
0.100
0.200
23.4
13.6
10.1
6.0 16.6%
4.3 39.9%
3.3 54.1%
0.500
1.000
6.3
4.1
1.9 73.4%
1.1 85.4%
differences
relative to orthogonal
reference
Due to dispersion Percent reduction
117.12 0
64.83 44.6%
31.96 72.1%
10.76 90.8%
5.97 94.9%
2.36 98.0%
1.05 99.1%
Due to mean Percent reduction
298.81 0
160.23 46.4%
115.69 61.3%
86.16 71.2%
70.84 76.3%
46.91 X4.3%
28.09 90.6%
Total Percent
415.93 0
225.01 45.9%
147.65 64.5%
96.92 16.1%
76.81 81.5%
49.27 88.2%
29.13 93.0%’
reduction
The error variance is a scaling factor in the effect of collinearity on the posterior mean. Whereas the nominal values of this effect shown in table 2 depend on u2, the corresponding percent reduction in the collinearity loss does not depend on a particular value of u2. In computation of the nominal values, we used the mean squared error (s2 = 0.0052) for u2. The overall effect of collinearity on the posterior distribution is computed as the sum of the effects due to the dispersion and due to the mean. The percentages of reduction in the collinearity effects shown in table 2 indicate that the incremental contribution of the prior precision to the reduction of data information as measured by .YL,(fl) is generally less than the prior’s contribution to the reduction in the posterior entropy relative to an orthogonal reference design. With a prior variance of r* = 200a2, the posterior relative entropy decreases about 46% from that found for the noninformative figure for the reduction in the data prior case; whereas the corresponding information loss is about 6%. However, as the prior precision increases, the difference between the contribution of prior to the reduction of the collinearity effects on the data information and on the posterior relative entropy narrows down [also see fig. 11. The last column of table 2 shows that when r2 = a*, the posterior distribution is almost orthogonalized. Therefore, if an analyst can formulate a prior distribution as precisely as the postulated regression model for the Air Pollution Data, the collinearity effects on the posterior distribution almost disappear. Fig. 1 shows the Collinearity Information Graph for the Air Pollution Data. On the horizontal axis are shown the prior standard deviation (scale at the
270
E.S. Soofi, Collineurit,r effects m Buyesiun regression
top) and precision ratio (scale at the bottom). On the vertical axis are shown the contribution of data information to the posterior analysis (scale at the left) and the percent reduction in collinearity effects (scale at the right). The graph includes four curves displaying: (a) the percent reduction in collinearity effect on the sample information, (b) the percent reduction in the posterior entropy relative to the orthogonal reference, (c) the amount of sample information that would have been available for the posterior analysis if the regressors were orthogonal, and (d) the amount of information contributed to the posterior analysis by the actual data. The values for the first two curves are to be read on the scale shown on the right side and the values for the last two to be read on the (nominal) scale shown on the left side. The Collinearity Information Graph indicates a sharp decrease in the nominal values of YX(/3), Yp(p), and in the percent reduction of the posterior relative entropy for values of p < 0.02. In this range the percent reduction in YL,(fi) is high but not as much as the other three quantities. We also observe that for p > 0.4 all four information curves almost settle. Remark 5. I. McDonald and Schwing (1973) apply ridge regression to the Air Pollution Data. If it is assumed that p, = 0, then posterior mean corresponds to the ridge estimate defined by b(p) = (X’X + pIP)-‘X’y. In ridge regression, p is referred to as the ridge parameter and is to be chosen using one of the several available methods. In a ridge regression analysis, in addition to the ridge trace, graphs of the squared length of coefficients Ilb(p)II’ and jlb”(p)1)2, and the sum of squares of regression errors Ily - b( p)112 as functions of p are presented and discussed. Reduction in the squared length of the ridge coefficients based on the actual data and on the orthogonal reference and the increase in the sum of squares of residuals as compared with the least squares regression are usually reported. McDonald and Schwing use the ridge trace proposed by Hoer1 and Kennard (1970) and select p = 0.20. They report a 17% increase in the sum of squares of error for p = 0.20 as compared with p = 0, a 77% decrease in llb(p)ll*, and only a 9% decrease in llb”(p)l12 for p = 0.05 as compared with p = 0, etc. The information measures proposed above provide a ridge analyst with additional diagnostics. For example, as compared with the least squares estimation for p = 0.20, the loss of sample information due to collinearity decreases about 54%; the effect of nonorthogonality on the coefficient estimates decreases about 95%; the collinearity effect on the covariance structure of the ridge estimator decreases about 76%; etc. Furthermore, it is seen that a p = 0.20 translates into a prior standard deviation of about 0.16. 6. Summary The importance of formalizing the notion of information in collinearity analysis is recognized. Shannon’s entropy is adapted for quantifying informa-
E.S. Sooj,
Coilineurity
effects in Buyesian regression
271
tion about the regression coefficients. When no prior knowledge about the parameters is utilized, the entropy analysis directly leads to logarithmic transformations of the well-known diagnostics available for the least squares regression, namely: the determinant of X’X, the condition indexes and the condition number of X. It is noted that the results of the entropy analysis confirm, in a formal information-theoretic approach, an interpretation given to the condition number in the recent literature, It is also noted that the logarithmic information measure points to the right direction in responding to some queries expressed about the determinant in the recent literature. When some prior knowledge about the parameters are utilized, the entropy differences lead to a measure for quantifying loss of data information due to collinearity. This measure of information loss is a function of the precision ratio p = a2/r2, and hence a function of the prior variance r2. It is shown that this information function provides a diagnostic for quantifying the prior information required to compensate for a given amount of collinearity effect on the data information. Cross-entropy analysis is suggested to quantify the effects of collinearity on the least squares estimation and on the posterior distributions. The entropy of the sampling distribution of the LS estimate relative to an orthogonal reference design results into a synthesis of diagnostics developed in the classical approach, namely: the variance inflation factor and the predictability power of the LS estimate. The entropy of the posterior distribution relative to an orthogonal reference provides a diagnostic for measuring the effect of collinearity on the posterior distribution. This information function decomposes into two additive components: one measures the effect of collinearity on the posterior dispersion matrix and the other one quantifies the collinearity effect on the posterior mean. This diagnostic provides a metric for measuring the incremental contribution of prior precision to reduction of the collinearity effects on the posterior distributions. It is also pointed out that sensitivity of a posterior distribution to small perturbations in the data can be measured in terms of a cross-entropy. A graphical display of the information quantities developed in this paper is suggested. The Collinearity Information Graph provides a comprehensive summary of the tradeoffs between acquisition of prior information and reduction of collinearity effects. The Air Pollution Data is used as an illustrative example. Various information quantities are computed and the Collinearity Information Graph is displayed. It is noted that there is a substantial gain in the reduction of the collinearity effects for the precision ratios less than 0.2 and very little improvement for p > 0.4. For a prior distribution formulated as precisely as the regression model (p = l), the posterior distribution is almost orthogonalized.
212
Appendix:
E. S. Sooji, CoNitteari[y
effecrs in Buyesiun regression
Proof of Properties 4.345
Without loss of generality we simplify the proofs by setting p = 0. (Otherwise let 8 = p - p and replace the corresponding posterior mean of p by that 0f e.) Let A = diag(X,, . . , Ap) and V= [u,, . . . , up] be the matrix of eigenvalues and etgenvectors of XX, respectively, and let W = XV. Then
- (1 + p) -q
=
[(x’x+
=
v[(n+pl,)-’ - (1 + p) -lzp]w’y.
pl,)_’
x’y
This gives
where \v,‘s are the columns we have
of W. Evaluating
+zpxl-l 202 ,=I
The Property 4.3 is seen by noting eigenvalues are zero.
x,+p
(11) in terms of the eigenvalues,
_
that (A.l)
&)?iY%):.
is defined
(A.11
when
some of the
E.S. Soof, Coliineurity effects in Bqmrun
Taking find
derivative
with respect
to p in (A.l)
regression
and some simplifications,
213
we
dI _=-dp
with equality if and only if X, = 1 for all i = 1,. . . , p, i.e., X is orthogonal. This verifies Property 4.4. Property 4.5 is seen by noting that for p = 0, p = h and p” = b”, and using Property 4.4. References Belsley, D.A., 1982, Assessing the presence of harmful collinearity and other forms of weak data through a test for signal-to-noise, Journal of Econometrics 20, 211-253. Belsley, D.A., E. Kuh, and R.E. Welsch, 1980, Regression diagnostics: Identifying influential observations and sources of collinearity (Wiley, New York, NY). Farrar, D.E. and R.R. Glauber. 1967, Multicollinearity in regression analysis: The problem revisited, Review of Economics and Statistics 49, 92-107. Hoerl, A.E. and R.W. Kennard, 1970, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics 12, 55-67. Jaynes, E.T., 1968, Prior probabilities, IEEE Transactions on Systems Science and Cybernetics SSC-4, 227-241. Jeffreys, H.. 1946, An invariant form for the prior probability in estimation problems, Proceedings of the Royal Statistical Society (London) A 186, 453-461. Kullback. S., 1959. Information theory and statistics (Wiley, New York, NY). Kullback. S. and R.A. Leibler, On information sufficiency, Annals of Mathematical Statistics 22, 79-86. Learner, E.E., 1973, Multicollinearity: A Bayesian interpretation, Review of Economics and Statistics 55. 371-380. Lindley, D.V., 1955, On a measure of information provided by an experiment, Annals of Mathematical Statistics 27, 986-1005. McDonald, G.C. and R.C. Schwing, 1973, Instabilities of regression estimates relating air pollution to mortality, Technometrics 15, 463-481. Marquardt, D.W.. 1970, Generalized inverses, ridge regression, biased linear estimation, and nonlinear estimation, Technometrics 12, 591-612. Press, S.J., 1982, Applied multivariate analysis: Using Bayesian and frequentist methods of inference (Robert E. Krieger, FL). Rao, C.R., 1973, Linear statistical inference and its applications (Wiley. New York. NY). Shannon, C.E., 1948, A mathematical theory of communication, Bell System Technical Journal 27, 379-423. Soofi, E.S., 1988, Principal component regression under exchangeability, Communications in Statistics - Theory and Method A 17, 1717-1733. Soofi, E.S. and D.V. Gokhale, 1986, An extension of Bayesian measure of information to regression, Communications in Statistics - Theory and Method A 15, 3607-3616. Stewart, G.W.. 1987, Collinearity and least squares regression, Statistical Science 2, 68-83.
274
ES. Sooji. Colheurity
effects in Buyesiun regressiorl
Thisted, R.A.. 1982. Decision-theoretic regression diagnostics, Statistical Decision Theory and Related Topics III. 2. 363-3X2. Thisted, R.A., 1987. Comment: Collinearity and least squares regression, Statistical Science 2, 91-92. Willan. A.R. and D.G. Watts, 1978, Meaningful multicollinearity measures, Technometrics 20, 407-412. Zellncr. A.. 1971. An introduction to Bayesian inference in econometrics (Wiley, New York, NY). Zellner. A.. 19X4. Basic issues in econometrics (University of Chicago Press. Chicago, IL).