Some recent and new results on the maximum entropy distribution

Some recent and new results on the maximum entropy distribution

Statistics & Probability Letters 1 (1982) 17-22 North-Holland Publishing Company July 1982 Some Recent and N e w Results on the Maximum Entropy Dist...

386KB Sizes 4 Downloads 132 Views

Statistics & Probability Letters 1 (1982) 17-22 North-Holland Publishing Company

July 1982

Some Recent and N e w Results on the Maximum Entropy Distribution * H e n r i Theil Graduate School of Business, University of Florida, Gainesville, FL 32611, U.S.A. Received February 1982; revised version received April 1982

Abstract. This letter concerns the maximum entropy (ME) distribution proposed by Theil and Laitinen (1980); it summarizes some of the results obtained with this distribution and it explores the conditioning of the ME covariance matrix as the number of variables increases. Keywords. Covariance matrix estimation, density estimation, entropy, ridge regression.

Intermediate points between successive order statistics are defined as ~i = ~( xi, xi+ 1), where ~(.) is a symmetric differentiable function of its arguments whose value is not outside the range defined by these arguments. Each pair of successive ~ ' s yields an interval, I i = ( ~ i - 1, ~i) for i----2..... n - 1, while there are two open-ended intervals, 11 = ( - o o , ~1) and In ----(~n--1, 00). SO there are n intervals, 11..... I n, each containing a fraction 1/n of the mass of the sample distribution. An estimated density function f ( . ) is constructed which preserves these fractions, fl, f ( x ) d x = 1/n for i = 1 ..... n (a mass-preserving constraint). An analogous mean-preserving constraint is also imposed (for details, see Theil and Laitinen, 1980). Subject to these constraints, f ( . ) is derived by maximizing its entropy,

I. Introduction

The maximum entropy (ME) distribution was originally proposed by Theil and Laitinen (1980) for the estimation of the covariance matrix of a large number of variables. When this number exceeds the sample size, the sample covariance matrix is singular but the ME covariance matrix is always positive definite. Later on it appeared that the M E distribution is also useful in other respects, because it provides more accurate estimates (for small samples) of certain parameters of continuous distributions by exploiting this continuity. The objective of this letter is to summarize some of the results obtained and to explore the effect of an increasing number of variables on the ME covariance matrix.

- f~_ f ( x ) l o g f ( x ) d x . 2. The univariate M E distribution

This criterion yields a unique solution with the following characteristics. The intermediate points become midpoints between successive order statistics, ~ = ½(x i + x i+ 1). The ME distribution consists of n component distributions, two of which are exponential over the open-ended intervals I~ and I n, while the others are uniform over the bounded intervals 12 ..... In 1. The interval means

Let (x I..... xn) be a sample from a continuous distribution with range ( - o o , oo). Order statistics are indicated by superscripts: x l < x 2 < . - . < x n.

* This research was supported in part by NSF Grant SES8023555. \

0167-7152/82/0000-0000/$02.75 © 1982 North-Holland

17

Volume 1, Number 1

STATISTICS & PROBABILITY LETTERS

of the ME distribution, E(X[ X E Is) or more simply ~i, are equal to ½(~i-! +~i), where ~ 0 = x I, ~ = x"; that is, these means are secondary midpoints (midpoints between midpoints). The ME variance is smaller than the sample variance and the amount of shrinkage is

1

n~

4--n i=1

l(xi+ ~

s.2

1

n-- 1

- - X ) "-~-2"~R E i=2

(xi+ 1_x,-,)

(1) Theil and O'Brien (1980) and O'Brien (1980) considered the median and quartiles of the ME distribution as estimators of the corresponding population values for random samples from a normal distribution. The ME median and quartiles have smaller expected squared error than their sample counterpart for small n, thus illustrating the small-sample efficiency gain obtained by exploiting the knowledge that the parent distribution is continuous. Theil and Kidwai (1981a) obtained analogous results, based on a simulation experiment, for moments up to the fourth order and fourth cumulants of a normal distribution. Theil and Lightburn (1981) extended the ME distribution to the case in which the parent distribution has range (0, oo). The main difference is that the ME distribution over (0, ~ ) is now truncated exponential. Another extension, from Theil (1980), refers to the case in which the parent density function is symmetric. A symmetric ME (SYME) distribution can then be constructed by symmetrizing the order statistics. For random samples from a normal distribution, the asymptotic variance of the SYME quartiles is about 13 percent below that of the ME quartiles, thus illustrating that the gain obtained from exploiting the symmetry of the parent distribution is not confined to small samples. Symmetrized order statistics are defined as 2 j = X--+½(M-x "+1-i) for i = 1..... n, where ff is the sample mean. This definition satisfies the least-squares criterion of minimizing X~(Y~-M) 2 subject to the constraint that ff~..... if" be located symmetrically around ~. An alternative method of estimating a symmetric distribution, consisting of 'doubling the sample', was proposed by Schuster (1975); work is under way to assess this method in the ME context.

18

July 1982

3. The multivariate M E distribution

Let (x k, Yk) for k = 1..... n be a sample from a bivariate continuous distribution with range ( - oo, oo). We write (xk, Yk)= ( xi, YJ) to indicate that the k th sample element contains the i th order statistic of the first variable and the j t h of the second. The n intervals of the univariate ME ~tistribution now become a rectangular grid consisting of n z cells. Only n cells contain one observation each; the n 2 - n other cells contain no observation. Hence the mass-preserving constraint requires that the former cells be assigned mass 1/n and the latter zero mass. Within each cell, the bivariate entropy is maximized under stochastic independence. The result is that, within each cell with mass 1/n, the bivariate ME distribution is the product of two uniform or two exponential distributions or of one uniform and one exponential distribution. The extension to p-variate distributions is straightforward. The ME covariance equals the covariance of the secondary midpoints,

! n k=l

x)(y,-;)

(2)

where (.~, y ) are the sample means and (xk,)7k) equals the secondary midpoint pair (Xi, y j) rearranged according to the original sample values, with y j = l ( ~ j _ ~ +7/j) and ~ l j = ½ ( y J + y j+') (analogous to yi = ½(fi 1 + fi) and fs = ½(xi+ xi+l)). The ME variance can be obtained by subtracting the shrinkage shown in (1) from the sample variance, but it can also be calculated as

1 n k=l

+ i=1

-~ l[(~l--~O)2-~-(~n--~n_l)2

].

(3)

A comparison of (2) and (3) shows that the 2 × 2 ME covariance matrix equals C + D, where C is the covariance matrix of the secondary midpoints and D is a diagonal ridge matrix whose typical diagonal element equals the sum of the last two terms in (3). The C + D formulation also applies to the covariance matrix of a p-variate ME distribution. The ridge D is responsible for the positive definiteness of the ME covariance matrix. This

Volume 1, Number 1

STATISTICS& PROBABILITYLETTERS

ridge suggests a .relationship to ridge regression (see Hoerl and Kennard, 1970). However, there is an important difference between the ridge of ridge regression and the ridge of the ME covariance matrix in that the former involves an arbitrary choise as to the size of the ridge, whereas the latter ridge is implied by the ME criterion subject to mass- and mean-preserving constraints. Haque and Meisner (1980) and Kidwai and Theil (1981) considered the ridge and the shrinkage of the ME variance for random samples from a normal distribution. When n increases, the means and the standard deviations of the ridge and the shrinkage are approximately proportional to n-13; hence the ridge and the shrinkage converge in probability to zero as n ~ oo. Meisner (1980) and Theil and Kidwai (1981b) compared the ME and sample correlation coefficients for random samples from a bivariate normal population. The root-mean-square error (RMSE) of the ME correlation is below that of the sample correlation when the parent correlation p takes moderate values, but the opposite holds when Ipl is large. The less satisfactory performance of the ME correlation for large I pl results from the ridge which prevents the ME correlation from being very close to 1. However, when the RMSE criterion is applied to the Fisher transform of the ME and sample correlations, ME is superior as long as

values, but when p increases the ME estimator tends to dominate the M L estimator, whereas the opposite holds at p -- 0.99 and modest values of p. The ME covariance matrix is particularly attractive when the n u m b e r of variables is large and when singularity is to be avoided, as is the case under the loss function (4). Fiebig obtained similar favorable results when substituting the ME for the M L estimator in the Empirical Bayes formula which H a f f (1980) proposed for the loss function (4). He also obtained analogous results for the ME covariance matrix with missing values derived by Conway and Theil (1980), who proved that when there are missing values for some or all variables, the ME criterion subject to mass- and mean-preserving constraints yields a second ridge in addition to that shown in (3) (see Theil, 1981, for further details).

4. T h e M E covariance matrix of an increasing n u m b e r of variables

Let our problem be the estimation of the parameter 7 in Y~2=YY~l + e ~ ,

For the ME covariance matrix of order p × p, Fiebig (1981) used the loss function

P

P

(4)

where Y, is an estimator of E, which is specified as

Y~:(1--p)l+ptt'

where t' : [1 ... 1].

(5)

Fiebig generated an n × p matrix X whose rows are independent pseudonormal vectors with zero mean and covariance matrix of the equicorrelated form (5). Then (l/n)X'[1- (l/n)tt']X is the ML estimate of E, which has an infinite loss (4) for p 1> n because the determinant in (4) then vanishes, but the ME estimate C + D always has a finite loss (4). Fiebig computed average losses over 100 trials for n : 21 and various values of p and p. These averages do not show much difference between M L and ME as long as p and p take modest

a=l

..... n

(6)

where e~ is a random disturbance and the y ' s are two endogenous variables of an equation system of which (6) is one equation. There are K exogenous variables in the system, none of which occurs in (6). All consistent estimators of y require the inverse of the sample covariance matrix of the exogenous variables. Thus, they break down when K > n, which is a serious problem since the number of exogenous variables of most economy-wide econometric models exceeds the sample size. This problem is aggrevated by the dynamic character of most of these models, since it is customary to treat lagged endogenous and exogenous variables in the same way as current exogenous variables. Theil and Meisner (1980) applied an ME version of the method of moments to the L I M L estimator ( L I M L -- limited-information maximum likelihood) of "/ by systematically replacing all sample moments by M E moments. They conducted a simulation experiment, based on 100 trials, in which the true y equals 1 and the e's and the n values of the K exogenous variables are

Ipl~<0.95. 1 1~ 1 -- t r Z - - - log¢[Y - ~ 1 - 1

July 1982

\

19

Volume 1, Number 1

STATISTICS & PROBABILITY LETTERS

ge ne ra t e d as i n d e p e n d e n t p s e u d o - n o r m a l variates, with K equal to 40. Some of the results are summariz e d in Tab l e 1 for n = 20 and n = 32 observations, based on the first H exogenous variables used as i n s t r u m e n t a l variables with H increasing in multiples of 4. T h e first two c o l u m n s c o n t a i n the arithmetic m e a n over 100 trials o f the L 1 M L estimates based on M E m o m e n t s a n d their R M S E a r o u n d y = 1. T h e next two p r o v i d e the corres p o n d i n g results for the c o n v e n t i o n a l L I M L estimat e s (based on sample m o m e n t s ) and the last four c o n t a i n the results for n = 32 rather than for n - - 2 0 . Bias is n o p r o b l e m because all m e a n s are close to 1. Th e simplest solution to the p r o b l e m of nonexistence of the L I M L estimates based on sample m o m e n t s is to use M E m o m e n t s instead, a n d to use all K - - 4 0 exogenous variables, thus avoiding the p r o b l e m of h o w to select a subset of H ~< K i n s t r u m e n t a l variables. In T a b l e l , for b o t h n - - 2 0 and n = 3 2 , the R M S E s of the estimates based on sample m o m e n t s explode when H approaches n. (The L I M L estima t e s based on sample m o m e n t s d o not exist for H = n because the associated latent root is then u n d e f i n e d ; for H > n they do n o t exist because of the singularity of the m o m e n t m a t r i x of the H i n s t r u m e n t a l variables.) T h e r e is a similar (alt h o u g h weaker) t e n d e n c y of the R M S E s of the estimates based on M E m o m e n t s , b u t these R M S E s

July 1982

decline when H increases b e y o n d n. Since this ' h u m p ' a r o u n d the p o i n t of equality of the n u m bers of variables an d observations has also been observed in o t h er applications of the M E covariance matrix, it is worthwhile to pursue this m a t t e r here. F o r fixed n we can standardize the variables so that the ridge m a t r i x D equals I and the M E m o m e n t m a t r i x equals C + I. Let ~ki be a positive latent root of C; h en ce h i + 1 is a root of C + I . Below we analyze the behavior of the ?,,'s when the n u m b e r o f variables increases for fixed n. Th e result will be that their M E covariance matrix m a y i n d e e d b e c o m e m o r e ill-conditioned as this n u m b er a p p r o a c h e s n an d then b eco m es b et t er - co n d i tioned as it increases further. N o t e that, unless n is very small, the units of the Variables which yield D = I tend to be small so that the elements of C will be large. This implies that m o s t of the Xi's will also be large so that the c o r r e s p o n d i n g roots ?~, + 1 o f the M E c o v a r i a n c e matrix will be d o m i n a t e d by X~ rather than 1. We write the c o v a r i a n c e matrix of the seco n d a r y m i d p o i n t s of p + 1 variables as

where C is p X p and [c' v] contains covariances and the v a r i a n c e of the second m i d p o i n t s of an

Table 1 LIML estimates based on ME and sample moments H

32 observations

20 observations ME moments

4 8 12 16 20 24 28 32 36 40

ME moments

Sample moments

Mean

RMSE

Mean

RMSE

Mean

RMSE

Mean

RMSE

1.025 1.037 1.042 1.048 1.044 1.050 1.041 1.041 1.041 1.041

0.180 0.164 0.173 0.181 0.190 0.185 0.163 0.161 0.161 0.161

1.028 1.044 1.050 1.051 -- a -- a -- a _a -- a

0.179 0.173 0.180 0.315 -- a _a _a _a -- a

_

_

1.015 1.012 1.010 1.013 1.014 1.016 1.020 1.021 1.021 1.007

0.153 0.135 0.135 0.140 0.139 0.157 0.185 0.192 0.161 0.131

1.015 1.012 1.011 1.015 1.013 1.018 1.022 _a _a -- a

0.152 0.134 0.136 0.143 0.141 0.171 0.228 _a -- a _a

a Not applicable. 20

Sample moments

a

a

Volume 1, Number 1

STATISTICS& PROBABILITY LETTERS

July 1982

A(O)

added variable. There exists a vector w so that v = w'Cw + k, where k is positive (zero) when the matrix (7) is positive definite (singular), with k = 0 when p 1> n - 1 and, usually, k > 0 for

/,. W e interpret A as

p
a n d i n t e r p r e t / , as X,(0) a n d x as the chara+teristic vector shown in (9) so that

c = Cw and

F o r scalar 0, 0 ~ O ~< 1, consider

A(O)=[

C (Ow)'C

C(Ow) ] (Ow)'C(Ow) + k

O2w'Cw+k-X,(O)

]z0,1=0. (9)

The vector to the left of the equal sign is a characteristic vector associated with X~(0); z~(O) is a p - e l e m e n t vector and ~',(0) a scalar. Clearly, ~ki(O ) = ~ki, zi(O ) = Zi, ~i(O)= O, where Xi is a positive root of C and z~ is a n associated characteristic vector. It is shown in Section 5 that dX, _ 2~',2(Xi- k ) dO O

Cw 20w'Cw]

dO = [ O c

dX,_ [z,(e), (8)

so that A(I) is the covariance matrix of the seco n d a r y m i d p o i n t s of all p + 1 variables, while A(0) equals the covariance matrix of the first p variables bordered by zeros except for the lower-right element. The latent roots of A(0) are k and those of C. The root X~(0) of A(O) is o b t a i n e d from

Ow'C

dA

(10)

Thus, as/9 increases from 0 to 1, all X~'s exceeding k increase (at least, do n o t decrease) a n d all X~'s below k decrease (at least, do n o t increase). For k > 0 (implying p < n - l) the smallest positive root of C may be reduced when a variable is added. But k = 0 f o r p ~> n - 1 so that (10) implies that n o n e of the positive roots decreases. Since X i + 1 is a root of the M E covariance matrix, this should clarify the statement on c o n d i t i o n i n g made three paragraphs ago.

5. Derivations Let /~ a n d x be a latent root a n d a n associated characteristic vector of a symmetric matrix A. Let A be subject to a symmetric displacement d A ; then d # = x'(dA)x is the implied displacement of

of (8) a n d use

dO

;,(o)]

×[0 w'C

2Ow'Cw

~',(O) ]"

(11)

Using

Ow'Czi( O) + [ e2w'Cw + k - X,(6)] i(O) = 0, which follows from (9), we can simplify (11) to (10).

References Conway, D. and H. Theil (1980), The maximum entropy moment matrix with missing values, Econom.Lett. 5, 319-322. Fiebig, D.G. (1981), Evidence on the maximum entropy covariance matrix, Discussion Paper, Modelling Research Group, Department of Economics, University of Southern California. Haft, L.R. (1980), Empirical Bayes estimation of the multivariate normal covariance matrix, Ann. Statist. g, 586-597. Haque, N. ul and J.F. Meisner (1980), The expected ridge and shrinkage of the maximum entropy variance under normality, Econom. Lett. 5, 241-244. Hoerl, A.E. and R.W. Kennard (1970), "Ridge regression: Biased estimation for nonorthogonal problems, Technomettics 12, 57-67. Kidwai, S.A. and H. Theil (1981), Simulation evidence on the ridge and the shrinkage of the maximum entropy variance, Econom. Lett., to appear. Meisner, J.F. (1980), A risk evaluation of the maximum entropy correlation coefficients, Econom.Lett. 5, 335-338. O'Brien, P.C. (1980), The quartiles of the maximum entropy distribution, Econom.Lett. 6, 49-52. Schuster, E.F. (1975), Estimating the distribution function of a symmetric distribution, Biometrika62, 631-635. Theil, H. (1980), The symmetric maximum entropy distribution, Econom. Lett. 6, 53-57. Theil, H. (1981), The maximum entropy distribution, CEDS Discussion Paper 59, College of Business Administration, University of Florida. Theil, H. and S.A. Kidwai (1981a), Moments of the maximum entropy and the symmetric maximum entropy distribution, Econom. Lett. 7, 349-353. Theil, H. and S.A. Kidwai (1981b), Another look at the maximum entropy correlation coefficient, Econom. Lett., to appear. 21

Volume 1, Number 1

STATISTICS & PROBABILITY LETTERS

Theil, H. and K. Laitinen 0980), Singular moment matrices in applied econometrics, in P.R. Krishnaiah, ed., Multivariate Analysis Vol. V (North-Holland, Amsterdam)pp. 629-649. Theil, H. and R.O. Lightburn (1981), The positive maximum entropy distribution, Econom. Lett., to appear.

22

July 1982

Theil, H. and J.F. Meisner (1980), Simultaneous equation estimation based on maximum entropy moments, Econom. Letr 5, 339-344. Theil, H. and P.C. O'Brien (1980), The median of the maximum entropy distribution, Econom. Left. 5, 345-347.