17 Small samples and large equation systems

17 Small samples and large equation systems

E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 @ Elsevier Science Publishers B.V. (1985) 451-480 1~7 ALl Small Samp...

2MB Sizes 1 Downloads 86 Views

E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 @ Elsevier Science Publishers B.V. (1985) 451-480

1~7 ALl

Small Samples and Large Equation Systems* Henri Theil and Denzil G. Fiebig 1. Introduction

in econometrics (and in several other areas of applied statistics) it happens frequently that we face a system of equations rather than a single equation. For example, let a consumer select the quantities of N goods which maximize his utility function subject to his budget constraint. Then under appropriate conditions a system of demand equations emerges, each describing the consumption of one good in terms of income and all N prices. The n u m b e r of coefficients in such a system is on the order of N 2, which is a large n u m b e r (unless N is small) and which raises problems when we want to test hypotheses about these coefficients (see Section 2). Another example is the estimation of the coefficients of one equation which is part of a system of simultaneous equations. H e r e a problem arises when the system contains a large n u m b e r of exogenous variables (see Section 3). One way of solving such problems is by (1) recognizing that the sample is drawn from a continuous distribution and (2) using this sample to fit a continuous approximation to the parent distribution. W h e n this is done under the m a x i m u m entropy (ME) criterion subject to mass- and mean-preserving constraints, a continuous M E distribution emerges which is superior to the discrete sample distribution in a n u m b e r of respects, particularly for small samples (see Sections 4 and 5). Subsequent sections show how the M E distribution can be used for the problems mentioned above. 2. How asymptotic tests can be misleading

Let t = 1 . . . . . n refer to successive observations and i, j = 1. . . . . N consumer goods. W e consider a linear demand system,

to

N

y . - Oixot + ~ . Iri/xjt + ei,, ]=1

(2.1)

*Research supported in part by NSF Grant SES-8023555. The authors are indebted to Sartaj A. Kidwai of the University of Florida for his research assistance. 451

H. Theil and D. G. Fiebig

452

where y~, is consumption of good i, x0, is total consumption, xj, is the price of good .L eit is a random error, and 0i and "rrlj are parameters, the 7rij's being known as Slutsky coefficients. H e r e we shall be interested in two hypotheses, viz., demand homogeneity, N

~ 7r~j= 0,

i=1 .... ,N,

(2.2)

7rjj = 7rj~, i, j = 1. . . . , N ,

(2.3)

j=l

and Slutsky symmetry,

Details on these properties are provided in the Appendix. Summation of (2.1) over i = 1 . . . . . N yields x0t = xot + E i ei, (because Zi 0~ = 1, Z~ ¢rij = 0), which implies that the e~t's are linearly dependent. This problem can be solved by deleting one of the equations, say the N t h . We assume that (el~ . . . . eu_l,t) for t = 1 . . . . , n are independently and normally distributed with zero means and nonsingular covariance matrix ~Y. Since (2.2) and (2.3) are linear in the ~-~j's, the standard procedure for testing these hypotheses is an F test if ~Y is known. However, ~Y is typically unknown, in which case it is usual to replace ~Y by S, the matrix of mean squares and products of LS residuals. Many such tests have yielded unexpected negative results; see, e.g., Barten (1969), Byron (1970), Christensen et al. (1975), Deaton (1974) and Lluch (1971). Laitinen (1978) conducted a simulation experiment in order to explore this problem. H e constructed a model of the form (2.1) satisfying both (2.2) and (2.3), with the e,'s obtained as pseudo-normal variates with zero means and a known covariance matrix ~Y. He used n = 31 observations and considered systems of N = 5, 8, 11 and 14 equations. Using the true ~Y, he applied the F test of the homogeneity hypothesis (2.2). The upper left part of Table 1 shows the numbers of rejections out of 100 trials at the 5 and 1 percent significance levels; these numbers are satisfactorily close to 5 and 1, respectively. Laitinen also used the same samples to compute S and the associated test statistic; this amounts to a X2 test which is asymptotically (n ~ ~) valid. The numbers of rejections shown in the upper middle part of Table 1 are much larger, particularly for large N. The results for the corresponding exact X2 test based on the true ~Y (upper right part of Table 1) are far more satisfactory, thus strongly suggesting that the use of S rather than ~ is mainly responsible for the numerous rejections of homogeneity in the literature. Meisner (1979) conducted a similar simulation experiment for testing the symmetry hypothesis (2.3). His results, shown in the lower part of Table 1, indicate an analogous increasing bias toward rejecting the null hypothesis as N increases when S rather than ,Y is used. The tests based on S fall under what is frequently referred to as Wald tests. See Bera et al. (1981) for similar results obtained with the asymptotically equivalent Lagrange multiplier and likelihood ratio tests. Laitinen (1978) proved that the exact distribution of the homo-

Small samples and large equation systems

453

Table 1 Rejections (out of 100 samples) of homogeneity and symmetry Exact F tests based on X 5%

1%

5 goods 8 goods 11 goods 14 goods

7 8 5 4

1 2 2 2

5 goods 8 goods 11 goods 14 goods

6 5 6 4

0 2 2 2

Asymptotic X 2 tests based on S

5%

1%

Exact X 2 t e s t s based on 5%

1%

Rejections of homogeneity 14 6 30 16 53 35 87 81

8 5 5 6

1 2 1 1

Refections of symmetry 9 3 26 8 50 37 96 91

5 5 4 6

1 1 3 0

geneity test statistic based on S is Hotelling's ,/,2 which in this case is an F ratio whose denominator has n - 2 N + 1 degrees of freedom. This illustrates the problem of homogeneity testing when N is not far below one-half the sample size n. The exact distribution of the symmetry test statistic based on S is a much more difficult issue because (2.3) is a cross-equation constraint. This is also the reason why symmetry-constrained estimation of (2.1) presents a problem when X is not known (see Section 7).

3. Simultaneous equation estimation from undersized samples Our objective is to estimate the parameter 3' in Ylt = YYzt + et,

t = 1.....

n,

(3.1)

which is one equation of a system that consists of several linear equations. The y's are observations on two endogenous variables of the system. The other equations contain certain endogenous variables in addition to these two, and also p exogeneous variables; the observations on the latter variables are written xw..., xpt. T h e sample moment matrix of these variables and those in (3.1) is thus of order (p + 2) x (p + 2), .m12

m'lp7

mt2

m22

m;e|

y2,

mxe

mzp

Mp_]

xl . . . . . . xo,

Imll

ylt

(3.2)

where mtp, map are p-element vectors and Alp is square ( p x p ) . The LS

1-t. Theil and D. G. Fiebig

454

estimator of 7 in (3.1) is then m12/m22; this estimator is biased and inconsistent because Yzt and e, are correlated. However, we can obtain consistent estimators from the property that each exogenous variable is statistically orthogonal to the errors in (3.1) in the sense that, for h = 1 . . . . . p, (1/n)Etxh,E , has zero probability limit as n ~ ~. The conditional sample m o m e n t matrix of the two endogeneous variables given the p exogenous variables is

mn. p

ml2.p] =

[mll

ml2.p

m22.pJ

L m 1 2 m22 j

m ~ ] _ [m!p]M_pl[mlp

m2p] .

(3.3)

Lm2pd

T h e k-class estimator of 7 is defined as [see, e.g., Theil (1971, Chap. 10)]

~(k ) = m12- km12P

(3.4)

m 2 2 - km22.p

which includes the LS estimator mt2/m2.2 as a special case (k = 0). It can be shown that, under standard conditions, ~(k) is consistent if k - 1 has zero probability limit and that n~/2[~,(k)- 7] converges to a normal distribution with zero mean if n l / 2 ( k - 1 ) has zero probability limit. These conditions are obviously satisfied by k = 1, which is the case of two-stage least squares (2SLS). In Section 6 we shall meet a k-class estimator with a r a n d o m k. Equation (3.1) is quite special because it contains only two endogeneous and no exogenous variables. The extension to m o r e variables is straightforward [see, e.g., Theil (1971, Chaps. 9 and 10)], but it is not our main concern here. O u r problem is that the matrix (3.3) does not exist when there are m o r e exogenous variables than observations (p > n) because Mp is then singular. In fact, all standard methods of consistently estimating 7 fail for p > n because they all require the inverse of Mp. Almost all present-day economy-wide econometric models have m o r e exogenous variables than observations. The problem is even m o r e pervasive due to the occurrence of lagged variables in dynamic equation systems. It is standard practice to treat each lagged variable in the same way as the exogenous variables are treated, which means that the 'dynamic' version of Mp is even m o r e likely to be singular. T h e irony of this problem is that we can reasonably argue that it should not be a problem. We can estimate 7 by 2SLS from n = 20 observations when there are p = 10 exogenous variables in the system, but not when p = 30 (because Mp is then singular for n = 20). In the former case there are 10 variables known to be statistically orthogonal to the error vector ( e l , . . . , e,) of (3.1), in the latter there are 30 such variables; a priori one would expect that the estimation of (3.1) is improved (at least, not hurt) when there are more orthogonality conditions on its error vector. In Section 6 we shall consider this matter further.

Small samples and large equation systems

455

4. The ME distribution of a univariate sample The previous discussion illustrates difficulties with sample moment matrices: S in Section 2, Mp in Section 3. H e r e and in Section 5 we shall seek a solution under the condition that the relevant random variables are continuously distributed. Our strategy will be to use the sample to fit a continuous distribution as an estimate of the parent distribution, and to compute moments (and other characteristics) from this fitted distribution. Such moments are population-moment estimators which are alternatives to the ordinary sample moments.

4. I. The M E principle and the univariate M E distribution The principle of maximum entropy (ME) states that given some information on the parent distribution of a random variable, the fitted distribution should be most uniformative subject to the constraints imposed by the prior information. To do otherwise would imply the use of information that is not available. The criterion of uniformativeness used in information theory is the entropy, which is minus the expectation of the logarithm of the density function. Specifically, the ME criterion maximizes

H=-f~f(x)logf(x)dx_

(4.1)

by varying the density function f ( . ) subject to certain constraints. If all that is known is that the variable is continuous with a finite range (a, b), the ME distribution is the uniform over this interval. If we know the mean of a positive continuous random variable (but nothing else), the M E distribution is the exponential with this m e a n ) These results were used by Theil and Laitinen (1980) to construct an estimated distribution function from a sample (xl . . . . . x,). 2 They used order statistics, written here with superscripts: x l < x 2 < - . . < x", and defined intermediate points between successive order statistics,

~i = ~(xl, xi+l),

i=1 ..... n-l,

(4.2)

where so( • ) is a symmetric ditterentiable function of its two arguments whose value is between these arguments. These ~¢i's define two open-ended intervals, i~ = (-%~:1) and I. = ((n_l,~), and n - 2 bounded intervals, /2= (sc1,(2), . . . . In1 = ((.-2, (n-l). Each /~ contains one order statistic x i and, hence, 1For other M E properties of the exponential family, see Kagan et al. (1973). 2See Theil and Fiebig (1984) for a survey containing m a n y other results as well as proofs of the statements which follow.

H. Theil and D. G. Fiebig

456

a fraction 1/n of the mass of the sample distribution. W e impose on the density function f ( . ) which will be fitted that it preserves these fractions,



f(x) dx=-7

n

i=l .....

n,

(4.3)

which is a mass-preserving constraint. W e also impose an analogous m e a n preserving constraint, referring b o t h to the overall m e a n (the sample m e a n 2) and to the m e a n s in each interval Ii. 3 Thus, o u r constraints refer to m o m e n t s of o r d e r zero and one. Subject to these constraints we seek the density f ( . ) which maximizes the e n t r o p y (4.1). T h e solution is unique for n > 2; it implies that the intermediate points (4.2) b e c o m e midpoints b e t w e e n successive o r d e r statistics, ~¢i = 1 i + xi+l)~ and that f ( . ) is constant in each b o u n d e d / ~ and exponential in 11 ~(x and I,. Thus, the associated cdf is c o n t i n u o u s and m o n o t o n e increasing, and it is piecewise linear a r o u n d each x i except a r o u n d x 1 and x" w h e r e it is exponential. W e shall refer to this fitted distribution as the M E distribution of a univariate sample or, m o r e briefly, as the univariate M E distribution. 1 i + XI+I) to i = 0 , 1 , . . . , n, where It will be convenient to extend ~i=~(x x ° = x 1, x " + x = x " so that ~:0 = x 1, ~ , = x " . T h e s e ~'s are referred to as the primary midpoints. T h e interval m e a n s of the M E distribution, written 21 . . . . . 2", are given by xi = ~-(sci-I+ ~i),

i = 1. . . . .

n,

(4.4)

which will be called the secondary midpoints.

4.2. Applications G i v e n that the density picture of the M E distribution is so simple (piecewise constant or exponential), it is straightforward to evaluate its variance and higher m o m e n t s . F o r example, the variance of the M E distribution (the M E variance) equals n k=l

n-1 q.n i-l-

2 ~ n i=2Z ( x i + l -

xi-1) 2.

(4.5)

Since the first term is the sample variance and since the two others are negative, the M E variance is thus subject to shrinkage relative to toe sample variance. Kidwai and Theil (1981) s h o w e d that, u n d e r normality, this shrinkage 3Define the order statistics associated with each interval k as those which determine its end points: x 1 and x 2 for II = (-~, ~¢1)[see (4.2)], x" and x "-I for I,, and x i-1, x i and x i+1 for Ii with 1 < i < n. The mean-preserving constraint on Ii requires that f(x) for x Eli be constructed so that the mean is a homogeneous linear function of the order statistics associated with I~.

S m a l l s a m p l e s a n d large equation s y s t e m s

457

is a random variable whose mean and standard deviation are both about proportional to n -1'3. Simulation experiments with pseudo-normal variates indicate that the ME variance and third- and fourth-order moments about the mean are all more accurate (in the mean-squared-error sense) than the corresponding estimators derived from the discrete sample distribution. This difference reflects the efficiency gain obtained by exploiting the knowledge that the parent distribution is continuous. However, the difference converges to zero as n ~ % implying that the efficiency gain is a small-sample gain. Fiebig (1982, Chap. 4) extended the simulation experiment to the estimation of the variances of fat-tailed mixtures of normal distributions. The fatter the tails for given n, the larger is the efficiency gain of the M E variance over the sample variance. Since the M E distribution is formulated in terms of order statistics, it is natural to consider the quantiles of the ME distribution as estimators of the parent quantiles. Let n be odd and write m = ½(n + 1). Then the sample median is x m, but the ME median is $m, i.e. the median of the secondary midpoints. For random samples from a normal population, the ME median has a smaller expected squared sampling error than the sample median, but the relative difference tends to zero as n ~ . Let n + 1 be a multiple of 4 and write q = ¼(n + 1). Then the sample quartiles are x q and X 3q, whereas the ME 1 q _ 3 q+l 1 3q - 1 3q+l quartiles a r e Q L ~xq-1 -[- ~X t gX and Ou = 3~x3q-1 + ix ± ~x if q > 1.4 For random samples from a normal population, the M E quartiles have smaller expected squared errors. Again, the relative difference tends to zero as n ~ ~, but this difference is still in excess of 10 percent for the interquartile distances O u - QL and x 3q - x q at n = 39. Also, the ME median and quartiles dominate their sample distribution counterparts (under squared-error loss) in the presence of an outlier with a different mean or a different variance; see Theil and Fiebig (1984) for details. =

4.3. E x t e n s i o n s

Tile M E distribution is easily extended to bounded random variables. If the variable is positive, the only modification is that 11 = ( - % ~:1) becomes (0, £1) and that the distribution over this interval becomes truncated exponential. A different extension is in order when the parent distribution is known to be symmetric. The difference x ~- $ then has a sampling distribution identical to that of $ - x "+1i for each i. We define 2 i _ 2 as the average of these differences, i.e. ~ i = ~ "b ~(X' -" x n + l - i ) ,

i =" ~, . . . , n .

(4.6)

Clearly, ~1 . . . . . 2" are 'symmetrized' order statistics located symmetrically 4Since the ME distribution has a continuous cdf, its median and quartiles are uniquely defined for each n. This is in contrast to the sample quantiles whose definitions for certain values of n can be made unique only by interpolation between order statistics.

H. Theil and D. G. Fiebig

458

around the sample mean g. (Since the M E procedure is mean-preserving, )~ is a natural point of symmetry.) The symmetric M E (SYME) distribution is then constructed from the .fl's in the same way that the ME distribution is obtained from the x i's. An alternative justification of the definition (4.6) is that it satisfies the LS criterion of minimizing Ni ( i / - x i ) z for variations in the 2 i's subject to the symmetry constraint 2 ~+ 2 n + l - i : 2x. 5 SYME moments and quantiles can be used as estimators of the corresponding population values if the population is symmetric. Doing so amounts to exploiting the knowledge of symmetry in addition to continuity. For random samples from a normal distribution, the SYME quartiles are asymptotically more efficient than the M E and sample quartiles: as n-->% the sampling variance of the former is about 13 percent below that of the latter. This shows that there are situations in which the exploitation of symmetry yields a large-sample gain. (Recall that the ME efficiency gain, based on the exploitation of continuity, is a small-sample gain only.) Under normality, the SYME variance provides no reduction in mean squared error beyond that of M E (mainly because the SYME variance is subject to additional shrinkage), but Fiebig (1982) did obtain such reductions for fat-tailed symmetric mixtures of normal distributions.

5. The ME distribution of a multivariate sample

5.1. The bivariate and multivariate ME distributions Let (xk, yk) for k = 1 . . . . . n be a sample from a continuous bivariate population. Our objective is to use this sample in the construction of the joint density function which maximizes the bivariate entropy oo

H=-

f

f -oo

f(x,y)logf(x,y)dxdy,

(5.1)

-oo

subject to mass-and mean-preserving constraints. As in the univariate case, we start with order statistics and the intermediate points (4.2), but we do this now for both variables, yielding n intervals 11. . . . . I, for x and n intervals J1 . . . . . J, for y. In the plane of both variables, we thus have n 2 rectangular cells, but since there are only n observations, n cells contain one observation each and n 2 - n cells contain no observations. The mass-preserving constraint states that the former cells are assigned mass 1In and the latter zero mass. Maximizing (5.1) requires stochastic independence 5A different procedure for estimating a symmetric distribution, proposed by Schuster (1973, 1975), consists of 'doubling the sample'; i.e. associated with each sample element xk is a value 2 5 - xk at equal distance from ~ but on the opposite side, which yields an augmented sample of size 2n (symmetric around .~) when these associated values are merged with a sample of size n. In a bivariate context, the value associated with (xk, Yk) is ( 2 5 - xk, 2y-Yk), yielding spherical sym~ metry. However, the simulation experiments by Theft, Kidwai, Yalnizo~lu and Yell6 (1982) based on pseudo-normal variates indicate that this alternate form of symmetrizing is not very promising.

Small samples and large equation systems

459

within each cell with mass 1/n. Each such cell falls under one of three groups: those which are bounded on all four sides, those which are open-ended on one side, and those which are open-ended on two sides. For the first group, the M E distribution within the cell is the bivariate uniform distribution; for the second, it is the product of the exponential (for the open-ended variable) and the uniform (for the other variable); for the third, it is the product of two exponentials. The extension to the p-variate M E distribution is straightforward. There are then n p cells, n of which contain one observation each and are assigned mass 1/n, while the n p - n others are assigned zero mass. The M E distribution within each cell with mass 1/n is the product of p univariate distributions, each being either uniform or exponential. The cdf of this distribution is a continuous and nondecreasing function of its p arguments, and it is piecewise linear except for exponential tails.

5.2. The M E covariance matrix The covariance of the bivariate M E distribution equals the covariance of the secondary midpoints, 1 -

" ~', (xk - x)(Yk

-- Y),

(5.2)

nk=l

where (Xk, Yk) for k = 1, . . . , n are the secondary midpoint pairs rearranged in the order of the original sample elements (Xk, Yk)" This rearrangement is indicated by the use of subscripts rather than superscripts [cf. (4.4)]. The M E variance was given in (4.5), but this variance can also be written in the form "

! Z

n k=,

1

-

+ --

n-1

Z

12n i=z

(~1 -- ~:0) 2

-

,-02 +

-}- (~:, -- ~:n-1) 2

,

(5.3)

4n

where the first term is the variance of the secondary midpoints. 6 The two other terms are a weighted sum of squared differences between successive primary midpoints which is always positive. On combining (5.2) and (5.3) we find that the 2 × 2 M E c o v a n a n c e matrix takes the form C + D, where C is the covariance matrix of the secondary midpoints and D is a diagonal matrix with positive diagonal elements. This C + D formulation applies to the covariance matrix of any p-variate M E distribution. The diagonal matrix D serves as the ridge of the M E covariance matrix; 7 this ridge ensures that the ME covariance matrix is always positive definite even when p / > n. 6Expression (5.3) is nothing but the variance decomposition of the univariate ME distribution between and within groups, the 'groups' being the intervals 11,. • •, I,. 7This ridge formulation has a superficial similarity to ridge regression, q21e major difference is that the ridge of the M E covariance matrix is not subject to arbitrary choice but is uniquely determined by the M E criterion subject to mass- and mean-preserving constraints.

460

H. Theil and D. G. Fiebig

The M E correlation t~ is obtained by dividing the M E covariance by the square root of the product of the two corresponding M E variances. A simulation experiment based on 10,000 pseudo-binormal variates with correlation p indicates that ~ has a smaller expected squared error than the sample correlation r for ]p]~<0.4, but that the opposite holds for 101/>0.6. The less satisfactory performance of ~ for large [p] results from the ridge of the M E covariance matrix which prevents [Pl from being close to 1. However, the picture is different when we evaluate the correlation estimators in terms of the squared errors of their Fisher transforms; then ,6 is superior to r for IPf ~< 0.95.8 Fiebig (1982) generated pseudo-normal vectors consisting of p equicorrelated variates with zero mean and unit variance. He computed their M E and sample covariance matrices and applied different loss functions to both. The ME estimator has smaller expected loss than the sample estimator when p is not small and P not close to 1, whereas the opposite holds for p = 0.99 and small p. The latter result is again due to the ridge of the ME covariance matrix. Fiebig also amended HaWs (1980) empirical Bayes estimator of the covariance matrix by substituting the M E covariance matrix for the sample covariance matrix in Haff's formula. Simulations indicate that this is an improvement except when the population covariance matrix is close to singular or when the number of variables is small. 5.3. Ties a n d m i s s i n g v a l u e s

Ties have zero probability when the sample is drawn from a continuous distribution, but they can occur when the data are rounded. Let the a t h and bth observations on x after rounding share the tth and (t + 1)st positions in ascending order: X a ~-~ X b ~- X t =

xt+l*

(5.4)

Here we consider the bivariate M E distribution of x and y under the assumption that the Yk'S are not tied and that x~ < x b and Xa > Xb both have probability 1 before rounding. The appropriate procedure is to assign mass 1/2n to each of the four cells associated with the tie. The M E covariance formula (5.2) remains applicable if £~ and Xb are defined as

(5.5) which means that the tie x a = x b is preserved in the form Ya = YbThe univariate M E distribution is not affected by the tie (5.4) so that we can 8Since the ridge of the ME covariance matrix tends to push ~6toward zero, this difference mainly results from the downward bias of r and the upward bias of the Fisher transform of r (for p > 0). In Theil, Kidwai, Yalnizo~lu and Yell6 (1982) the simulation experiment is extended to the SYME correlation and also to the correlation of the spherically symmetric version mentioned in footnote 5. Only the last correlation estimator has some merits for particular values of p (around 0.95) under squared-error loss of the Fisher transform.

S m a l l s a m p l e s a n d large e q u a t i o n s y s t e m s

461

use (4.5) for t h e M E v a r i a n c e . 9 H o w e v e r , it is of i n t e r e s t t o also c o n s i d e r the effect of t h e tie on t h e v a r i a n c e f o r m u l a (5.3) w h i c h c o n t a i n s Sk for k = a and k = b. It can b e s h o w n that, u n d e r t h e definition (5.5), a t e r m m u s t b e a d d e d to (5.3) of t h e f o r m (x t+2- x ' - 1 ) 2 / 3 2 n , which a m o u n t s to an e x t r a r i d g e (the 'tie r i d g e ' ) of t h e M E c o v a r i a n c e m a t r i x in t h e p r e s e n c e of a tie. See T h e i l a n d F i e b i g (1984) for f u r t h e r details. S i m i l a r results h o l d for the m u l t i v a r i a t e M E d i s t r i b u t i o n with missing values as a n a l y z e d b y C o n w a y a n d T h e i l (1980). C o n s i d e r n o b s e r v a t i o n s on two variables; let n t values b e k n o w n for o n e v a r i a b l e (n - n 1 a r e missing at r a n d o m ) and n 2 v a l u e s for t h e o t h e r (n - n 2 a r e missing at r a n d o m ) . T h e n u m b e r of cells is t h e n r e d u c e d f r o m n 2 t o n l n 2. T h e result for t h e M E c o v a r i a n c e is that (5.2) is still a p p l i c a b l e p r o v i d e d that )7k is i n t e r p r e t e d as t h e s a m p l e m e a n $ w h e n x k is missing (similarly f o r 37k). This d o e s n o t m e a n t h a t we act as if the missing x k t a k e s a p a r t i c u l a r v a l u e . N o such v a l u e is a s s u m e d ; t h e o n l y thing n e e d e d for t h e M E c o v a r i a n c e is a specification of Xk for missing x k, a n d this specification is $k = X, which f o l l o w s directly f r o m the M E p r i n c i p l e s u b j e c t to mass- a n d m e a n - p r e s e r v i n g c o n s t r a i n t s u n d e r t h e a s s u m p t i o n that t h e values which are missing a r e missing at r a n d o m . W h e n we a p p l y Sk = ~ for missing x k to t h e v a r i a n c e f o r m u l a (5.3), we m u s t a d d an e x t r a r i d g e (the m i s s i n g - v a l u e ridge). T h i s result is s i m i l a r to t h a t of the tie r i d g e a n d it is n o t surprising. B o t h ties a n d missing v a l u e s m a k e t h e s a m p l e less i n f o r m a t i v e than it w o u l d b e if t h e r e w e r e n o ties o r missing values. Since the M E d i s t r i b u t i o n is o b t a i n e d by m a x i m i z i n g t h e e n t r o p y s u b j e c t to constraints i m p l i e d by t h e s a m p l e , w e s h o u l d e x p e c t t h a t b o t h missing v a l u e s a n d ties y i e l d an M E d i s t r i b u t i o n closer to the i n d e p e n d e n c e case, and that is i n d e e d w h a t is s h o w n by its c o v a r i a n c e m a t r i x .

6. Experiments in simultaneous equation estimation H e r e w e r e t u r n to (3.1) a n d we c o n s i d e r t h e q u e s t i o n of w h e t h e r t h e M E a p p r o a c h can he useful w h e n t h e s a m p l e is u n d e r s i z e d . 6.1.

The

LIML

estimator

S u p p o s e t h a t e t in (3.1) and t h e e r r o r t e r m s in the o t h e r e q u a t i o n s of the system h a v e a m u l t i n o r m a l d i s t r i b u t i o n . It is then p o s s i b l e to a p p l y the m a x i m u m l i k e l i h o o d m e t h o d , which yields a k-class e s t i m a t o r k n o w n as L I M L . 1° T h e L I M L v a l u e of k is k - - # , w h e r e # is t h e smallest r o o t of a 9For t = 1 and t = n - 1, (5.4) is an extremal tie which implies that the exponential distribution over/1 or I, collapses, all mass being concentrated at the tied point. This also holds for a multiple tie, Xa = xb = xc = X t = x t+l = x t+2. In both cases the ME distribution becomes mixed discrete/continuous, but the validity of the variance formula (4.5) is not affected. I°LIML = limited-information maximum likelihood. 'Limited information' refers to the fact that no restrictions are incorporated on equations other than (3.1). 'Full information' and FIML use all restrictions in the system; see, e.g., Theil (1971, Chap. 10).

H. Theil and D. G. Fiebig

462

polynomial which is quadratic in the case of (3.1). The solution is tz

B 2A

1 2A ~/B2-4A(mumzz-

m~2)

(6.1)

where A = mll.pm22, p - m 212.p and B = mltm22.p + mzzm l l . p - 2m12m12.p, the m ij.p ' s being obtained from (3.3). Note that/x is random. As n ~ 0% n(/x - 1) converges in distribution to a I "2 variate so that nl/20x - 1) converges in probability to zero. Therefore, the propositions stated in the discussion following (3.4) imply that nl/2[~0z ) - 7] has the same asymptotic normal distribution as its 2SLS counderpart, n m [ ~ ( 1 ) - 7]. A closer approximation to the sampling distributions of the 2SLS and LIME estimators may be described as follows, n We standardize these two estimators by subtracting the true value of 7 and then dividing the difference by their common asymptotic standard deviation. The asymptotic distribution of these two standardized estimators is standard normal. This is a first-order approximation which can be improved upon by appropriate expansions. The second-order approximation yields cdfs of the form 2SLS: LIME:

q g ( u ) - n-mO(u z - p + 1)q~'(u), q ) ( u ) - n-l/ZOu2CD'(u),

(6.2) (6.3)

where qS(u) and 45'(u) are the standard normal cdf and density function, respectively, while 0 is a constant determined by the parameters of the system which contains (3.1) as one of its equations. Since substitution of u - - 0 into (6.3) yields q~(0)-0 = ~, we conclude that the approximate distribution of the standardized LIML estimator has zero median, whereas (6.2) shows that the standardized 2SLS estimator has this property only for p = 1. As p increases, the median of the latter approximate distribution moves away from zero. It appears that to a large extent these properties also apply when the estimators are formulated in terms of ME rather than sample moments. Theil and Meisner (1980) performed a simulation experiment in which the 2SLS estimator is systematically formulated in terms of ME moments. This has the advantage that the estimator exists even when p > n [because Mp in (3.3) is then positive definite], but the estimator is badly biased for large p. We shall therefore pay no further attention to 2SLS-type estimators. On the other hand, the approximate median-unbiasedness of L I M E which is implied by (6.3) appears to also apply when this estimator is formulated in terms of ME moments. 6.2. L I M L estimators based on sample and on M E moments

We return again to (3.1) and specify that the two associated reduced-form n T h e results which follow are from A n d e r s o n and Sawa; a convenient s u m m a r y is given by Malinvaud (1980, pp. 716-721).

Small samples and large equation systems

463

equations are ~2 P

Ylt = E Xht + ~lt, h=l

P

Y2t = ~'~ Xht + ~2,,

(6.4)

h=l

which agree with (3.1) if and only if y = 1 and e t = golf- (2t" In the simulation e x p e r i m e n t to be discussed, the Xht'S and ~t's are all g e n e r a t e d as i n d e p e n d e n t p s e u d o - n o r m a l variates, 13 the distribution of each Xht being N(0, V / p ) and that of each ~'j, being N(0, ~r2). T h e r e f o r e , P

Xh,- N(0, V),

~ j t - N(0, ~ro2),

e , - N(0, tr2),

(6.5)

h=l

where 0.2 = 2o-~. N o t e that the distribution of tile e x o g e n o u s c o m p o n e n t in the r e d u c e d f o r m (6.4) is i n d e p e n d e n t of p. The objective of the experiment is to analyze the b e h a v i o r of L I M L estimators as p increases b e y o n d n. Table 2 is based on 1000 trials for the specification V = 1, ~r~ = ~. C o l u m n s (2) and (3) contain, for each selected pair (p, n), the m e d i a n of the L I M L estimates over the 1000 trials. In column (2) we use the c o n v e n t i o n a l L I M L estimator based on sample m o m e n t s ( L I M L / S A ) ; in c o l u m n (3) we have L I M L / M E , o b t a i n e d by interpreting the matrix (3.2) as consisting of M E m o m e n t s . Since (3.1) contains no constant term, b o t h the sample and the M E m o m e n t s are interpreted as s e c o n d - o r d e r m o m e n t s m e a s u r e d f r o m zero (rather than from the mean). C o l u m n s (4), (7), (10) and (13) will be discussed in the next subsection. T h e m e d i a n s in column (3) are all close to 1 and thus suggest that the L I M L / M E estimator is approximately median-unbiasedness. N o t e that the medians in c o l u m n (2) decline as p a p p r o a c h e s n. This m e a n s that the conventional L I M L / S A estimator loses its m e d i a n - u n b i a s e d n e s s for large p. Also, the interquartile distance of L I M L / S A in c o l u m n (11) increases substantially as p a p p r o a c h e s n. A c o m p a r i s o n with the c o r r e s p o n d i n g quartiles in c o l u m n s (5) and (8) indicates that this increased dispersion results primarily from a declining lower quartile but also from an increasing u p p e r q u a r t i l e ) 4 T h e e x p e r i m e n t underlying Table 2 uses u n c o r r e l a t e d e x o g e n o u s variables. In T a b l e 3 we e x t e n d this to equicorrelated variables. Let c o m p o n e n t s of the vector (Xlo . . . . xpt ) be p s e u d o - n o r m a l with zero m e a n , correlation p and

t2The reduced form is obtained by solving the system for the endogenous variables. This requires the number of equations to be equal to the number of these variables. 13The x's are not constant in repeated trials. Making them constant would have implied that all entries in any given row of Table 2 are determined by the same set of n observations on the p exogenous variables. t4Since Mafiano and Sawa (1972) have shown that the sampling distribution of the LIML/SA estimator does not possess finite moments of any order, we use medians and quartiles to measure location and dispersion.

464

H. Theil and D. G. Fiebig

Table 2 Quartiles of LIML estimators based on sample, ME and hybrid moments a

Median

Lower quartile

p (1)

SA (2)

ME (3)

HY (4)

SA (5)

ME (6)

10 15 20 25 30 35 40

1.00 0.99 0.87 b b b b

1.00 1.00 1.00 0.98 1.01 1.01 0.99

1.00 1.00 0.99 0.98 1.00 1.01 1.00

0.84 0.80 0.48 b b b b

n = 21 0.84 0.81 0.78 0.81 0.84 0.85 0.85

10 15 20 25 30 35 40 45 50

1.00 1.00 0.99 1.00 0.92 b b b b

1.00 1.00 0.99 1.00 0.99 0.99 0.99 0.98 1.01

1.00 1,00 0.99 1.00 1.00 1.00 1.00 0.98 1.01

0.88 0.87 0.85 0.83 0.57 b b b b

10 15 20 25 30 35 40 45 50 55 60

1.01 1.01 1.00 1.00 1.00 0.99 0.91 b b b b

1.01 1.01 1.01 1.01 1.00 1.00 1.00 0.99 1.01 1.01 1.00 1.00 1.00 1.00 1.01 1.01 1.00 1.00 1.00 1.00 1,01 1.01

0.89 0.91 0.89 0.86 087 0.82 0.47 b ~ b b

Upper quartile

HY (7)

SA (8)

Interquartile distance

ME (9)

HY (10)

SA (11)

ME (12)

HY (13)

observations 0.85 1.18 0.83 1.23 0.83 1.33 0.84 b 0.85 b 0.86 b 0.86 b

1.19 1.23 1.25 1.18 1.19 1.19 1.15

1,19 1.20 1.19 1.17 1.18 1.16 1,13

0.34 0.43 0.84 b b b b

0.35 0.42 0.47 0.37 0.35 0.34 0.30

0.34 0.37 0.36 0.33 0.33 0.31 0.27

n = 31 0.88 0.87 0.85 0.84 0.83 0.83 0.86 0.86 0.88

observations 0.88 1.13 0,87 1.15 0.86 1.15 0.87 1.20 0.87 1.28 0.85 b 0.87 b 0.87 b 0.88 b

1.12 1.14 1.15 1.19 1.20 1.19 1.15 1.14 1.16

1.12 1.14 1.14 1.17 1.16 1.16 1.13 1.13 1.15

0.25 0.28 0.31 0.37 0.72 b b u b

0.24 0.27 0.30 0.35 0.37 0.35 0.29 0.28 0.29

0.24 0.27 0.28 0.29 0.30 0,30 0.27 0.27 0.27

n = 41 0.90 0.91 0.90 0.87 0.87 0.83 0.84 0.87 0.88 0.88 0.90

observations 0.90 1.14 0.91 1,12 0.90 1,13 0.87 1,14 0.88 1.16 0.86 1.22 0.88 1.35 0.89 b 0.88 b 0.89 b 0.90 b

1.13 1.12 1.13 1.13 1.16 1.21 1.21 1.18 1.14 1.15 1.15

1.13 1.12 1.12 1.13 1.15 1.17 1.16 1.16 1.13 1.14 1.15

0.24 0.21 0.24 0.27 0.28 0.40 0.88 b b U b

0.24 0.21 0.23 0.27 0.29 0.38 0.38 0.31 0.26 0.27 0.25

0.24 0.21 0.22 0.26 0.27 0.31 0.28 0.27 0.25 0.25 0.25

abased on 1000 trials; see text. bThe LIM/SA estimator does not exist,

variance

V/[p + p ( p - 1 ] p ]

so that

( 6 . 5 ) is still a p p l i c a b l e

for any (p, p). Let

t h e s e v e c t o r s b e i n d e p e n d e n t f o r d i f f e r e n t v a l u e s o f t. T a b l e 3 u s e s V = 1 a n d 2 t o- 0 = ~ a s b e f o r e a n d it is b a s e d o n 1 0 0 0 t r i a l s o f s i z e n = 21 f o r s e l e c t e d v a l u e s of p. T h e

results for LIML/SA

corresponding

and LIML/ME

r e s u l t s in T a b l e 2.

in t h i s t a b l e a r e s i m i l a r t o t h e

Small samples and large equation systems

465

Table 3 Q u a r t i l e s of L I M L e s t i m a t o r s b a s e d on c o r r e l a t e d e x o g e n o u s variables"

Median

Lower quartile

Upper quartile

Interquartile distance

p

SA

ME

HY

SA

ME

HY

SA

ME

HY

SA

ME

HY

10 15 20 25 30 35 40

1.02 1,00 0.90 U b b b

1.02 1.00 0.98 0.97 1.00 0.99 1.01

1.02 1.01 0.99 0.98 1.01 0.99 1.01

0.85 0.81 0.56 b b b b

0.85 0.82 0.79 0,81 0.85 0.83 0.87

p=0 0.85 0.85 0.85 0.83 0.86 0.84 0.87

1.19 1.21 1.31 U b b b

1.20 1.20 1.22 1.17 1.21 1.16 1.18

1.19 1.18 1.18 1.15 1.20 1.14 1.17

0.35 0.40 0.75 b b b b

0.35 0.38 0.43 0.36 0.36 0.32 0.31

0.34 0.33 0.33 0.32 0.34 0.30 0.29

10 15 20 25 30 35 40

1.02 1.01 0.91 b b b b

1.01 1.01 0.99 0.98 1.00 1.00 1.00

1.01 1.01 0.99 0.99 1.00 1.01 1,00

0.84 0.82 0.49 b b b b

0.84 0.83 0.81 0.83 0,85 0.87 0.87

p = 0.3 0.85 1.18 0.86 1.23 0.85 1.37 0.85 b 0.86 b 0.88 b 0.88 b

1,18 1.22 1.24 1.16 1.16 1.16 1.16

1.18 1.19 1.17 1.14 1.15 1.16 1.15

0.34 0.41 0.88 b b b b

0.34 0.39 0.42 0.33 0.31 0.30 0.29

0.33 0.33 0.32 0.29 0.29 0.29 0.27

10 15 20 25 30 35 40

1.02 1.01 0.93 b b b b

1.0I 1.01 0.99 0.98 0.99 1.01 1,00

1.02 1.00 1.00 0.99 0.99 1.01 1.00

0.84 0.82 0.52 b b h b

0.85 0.83 0.82 0.83 0.85 0.87 0.87

p = 0.6 0.86 1.18 0.85 1.23 0.85 1.38 0.85 b 0.86 b 0.87 b 0.88 b

1.18 1.22 1.22 1.17 1.16 1.17 1.16

1.18 1.19 1.17 1.15 1.16 1.16 1.16

0.34 0.42 0.87 b b b b

0.33 0.39 0.40 0.34 0.30 0.30 0.29

0.32 0.34 0.33 0.31 0.30 0.29 0.28

10 15 20 25 30 35 40

1.02 1.01 0.94 b b b b

1.02 1.01 1.00 0.99 0.99 1.01 0.99

1.02 1.01 0.99 0.99 0.99 1.01 0.99

0.85 0.81 0.53 b b b b

0.85 0.84 0.83 0.83 0.84 0.87 0.86

P =0.9 0.85 0.86 0.85 0.85 0.86 0.88 0.86

1.17 1.24 1.38 b b b b

1.17 1.20 1.20 1.17 1.17 1.18 1.17

1,17 1.19 1.18 1.17 1.16 1.17 1.17

0.32 0.43 0.85 b b b b

0.32 0.36 0.37 0.33 0.33 0.31 0.32

0.31 0.32 0.33 0.32 0.30 0.30 0.31

10 15 20 25 30 35 40

1.02 1.01 0.93 b b b b

1.02 1.01 0.99 0.99 0.99 1.01 1.00

1.02 0.85 11.01 0.81 0.99 0.54 1,00 b 0.99 b 1.01 b 1.00 U

0.86 0.86 0.84 0.85 0.85 0.87 0.85

p = 0.99 0.86 1.17 0.86 1.24 0.85 1.38 0.85 b 0.86 b 0.88 b 0.85 U

1.17 1.20 1.18 1.17 1.16 1.19 1.18

1.17 1.18 1.16 1.17 1.16 1.18 1.17

0.32 0.42 0.84 b b b b

0.31 0.34 0.33 0.33 0.31 0.31 0.33

0.30 0.32 0.31 0.31 0.30 0.29 0.31

"Based on 1000 trials; see text.

bThe L I M L / S A e s t i m a t o r d o e s n o t exist.

466

H. Theil and D. G. Fiebig

6.3. L I M L estimators based on hybrid moments Although Tables 2 and 3 indicate that the performance of L I M L / M E is far better than that of L I M L / S A , it is the case that the interquartile distance of the former estimator shows a bulge around p = n. 15 This bulge indicates that for fixed n and increasing p, the precision of the estimator deteriorates when p approaches n and then improves when p increases beyond n. Is it possible to eliminate this bulge? O n e way of doing this is by adding a ridge to the M E m o m e n t matrix in the same way that Haff's (1980) empirical Bayes estimator of the covariance matrix amounts to adding a ridge to the sample m o m e n t matrix. Specifically, let us interpret the p + 2 diagonal elements of the matrix (3.2) as sample m o m e n t s and all off-diagonal elements as M E moments. We shall refer to (3.2) thus interpreted as the hybrid moment matrix of the p + 2 variables. Simulation experiments based on alternative risk functions have indicated that the hybrid m o m e n t matrix is an attractive alternative to the M E m o m e n t matrix, particularly when the objective is to estimate the inverse of a parent m o m e n t matrix; see Theil and Fiebig (1984). T h e 1000 trials underlying each line of Table 2 have also been used to c o m p u t e L I M L / H Y estimates, all obtained from the hybrid interpretation of the m o m e n t matrix (3.2). The medians of these estimates in column (4) are about as close to 1 as those of L I M L / M E in column (3), but the interquartile distances of the former estimates in column (13) are systematically below those of the latter in column (12). Also, the interquartile distances in column (13) do not show the same large bulge around p = n which we find in column (12). The picture of the correlated case in Table 3 is about the same. T h e evidence of Tables 2 and 3 suggests that the L I M L approach can be rescued in the case of undersized samples by the simple device of replacing sample m o m e n t s by hybrid moments. This simplicity is in agreement with the view (see Section 3, last paragraph) that the problem of undersized samples should not be a problem. See Theil and Fiebig (1984) for additional evidence concerning equations with more than two variables. In Section 7 we shall apply hybrid m o m e n t s to a problem of constrained estimation.

7. Canonical correlations and symmetry-constrained estimation 7. l. Error covariance matrices and canonical correlations W e return to the linear system (2.1), which we generalize to a system of q linear equations with q dependent variables on the left and, in each equation, the same set of p independent variables on the right. The errors in the equations form a vector (elt . . . . . eat ) with zero mean and covariance matrix £. In Section 2 we described some problems that arise when we replace .~ by the 15There is no clear evidence of such a bulge for large p. This exception reflects the fact that the p exogenous variables effectively behave as one variable when p is sufficiently close to 1.

Small samples and large equation systems

467

estimate S consisting of mean squares and products of LS residuals; here we shall consider whether an M E approach yields m o r e attractive results. The account which follows is a modified version of Meisner (1981). We write the (p + q ) x (p + q) covariance matrix of the dependent and the independent variables in partitioned form:

[

~11

~121

"~'12 "~22j

(7.1)

q dependent variables p independent variables

If this matrix is interpreted as consisting of population variances and covariances, it is related to the error covariance matrix X by ,a~ = "~11 - X12X-1 t 22X 12.

('7.2)

Let Pl . . . . . Pm be the canonical correlation coefficients of the dependent and the independent variables, where m = min(p,q). These pi's can be obtained from the determinantal equation

(7.3)

p i ~ 1 1 [ = O,

so that (7.2) implies (7.4)

{X - (1 - p~)Zn{--- O,

which provides a link between the error covariance matrix X and the canonical correlations of the q dependent and the p independent variables of the system: for i = 1 . . . . . m, one minus each squared canonical correlation coefficient is a latent root of the diagonalization of X in the metric of the covariance matrix ~11 of the dependent variables.

7.2. Estimation of canonical correlations Given our interest in ~, the result (7.4) suggests that it is worthwhile to consider the estimation of canonical correlations. Fiebig (1980) conducted a simulation experiment based on Y, = p , x , + (1 -- 0 )I'2v,,

i = I .....

9,

(7.5)

where the Xi's and vi's are 18 independent standard pseudo-normal variates. Then the Yi's are also independent standard pseudo-normal, while X~ and are uncorrelated for i ¢ j and (Xi, Yi) has correlation p~. Therefore, Pl . . . . . P9 are the canonical correlations of (X L. . . . . Xg) and "(Y1 . . . . . Yg). The joint covariance matrix of the 18 variables ( X ' s and Y's) takes the form (7.1) with Xn = Xz2 = I and ~12 diagonal with 01 . . . . . P9 on the diagonal. Their true values

468

14. Theil and D. G. Fiebig

a r e specified as PI = 0.9, P2 = 0.8 . . . . .

P8 = 0.2, P9

=

0.1 .

(7.6)

By i n t e r p r e t i n g (7.1) as consisting of e i t h e r M E or s a m p l e m o m e n t s c o m p u t e d for a s a m p l e of size n, a n d then solving the a s s o c i a t e d d e t e r m i n a n t a l e q u a t i o n (7.3), we o b t a i n nine M E o r s a m p l e c a n o n i c a l c o r r e l a t i o n s . This e x p e r i m e n t was r e p l i c a t e d 100 t i m e s a n d t h e results are s u m m a r i z e d in T a b l e 4 in t e r m s of m e a n s a n d R M S E s a r o u n d t h e t r u e value. T h e u p p e r p a r t of the t a b l e c o n c e r n s the l a r g e s t c a n o n i c a l c o r r e l a t i o n (with true v a l u e Pl = 0.9). B o t h t h e M E a n d the s a m p l e e s t i m a t o r are s u b j e c t to a s u b s t a n t i a l u p w a r d bias w h i c h slowly d e c l i n e s as n increases, 16 b u t t h e bias of t h e f o r m e r e s t i m a t o r is s m a l l e r a n d this also h o l d s for its R M S E . T h e m i d d l e p a r t of T a b l e 4 c o n c e r n s t h e a r i t h m e t i c a v e r a g e c a n o n i c a l c o r r e l a t i o n (true v a l u e 0.5) and the l o w e r p a r t d e a l s with the sum of the s q u a r e d c a n o n i c a l c o r r e l a t i o n s (true v a l u e 2.85); this s u m p l a y s a role in H o o p e r ' s (1959) t r a c e c o r r e l a t i o n coefficient. T h e r e s u l t s are s i m i l a r to t h o s e in the u p p e r p a r t : t h e r e is an u p w a r d bias which slowly d e c r e a s e s as n increases, a n d b o t h the bias a n d t h e R M S E are s m a l l e r w h e n M E r a t h e r than the s a m p l e m o m e n t s are used. A l t h o u g h these results a r e e n c o u r a g i n g for t h e M E a p p r o a c h , it s h o u l d be a d m i t t e d that the u p w a r d bias is q u i t e s u b s t a n t i a l . A c o m p a r i s o n of t h e last f o u r c o l u m n s of T a b l e 4 shows that this bias is typically close to the c o r r e s p o n d i n g R M S E , suggesting t h a t a bias c o r r e c t i o n is in o r d e r . L e t r 1/> r 2/> • " 1> r,, b e the M E c a n o n i c a l c o r r e l a t i o n s . T h e c o r r e c t e d coefficients are fl . . . . . f,,, o b t a i n e d f r o m 1 -- r~ = (1 -- r~) n/(n+p+q-1) ,

(7.7)

which is a c o r r e c t i o n in e x p o n e n t i a l f o r m . T o e x p l a i n t h e e x p o n e n t we n o t e t h a t e a c h c a n o n i c a l v a r i a t e involves p - 1 o r q - 1 m u l t i p l i c a t i v e coefficients (only t h e ratios of t h e s e coefficients m a t t e r ) . T h i s yields p + q - 2 coefficients for a p a i r of c a n o n i c a l variates, to which w e a d d 1 for the use of a c o n s t a n t t e r m , y i e l d i n g a total of p + q - 1 coefficients. (Both c a n o n i c a l v a r i a t e s h a v e c o n s t a n t t e r m s , b u t the c o v a r i a n c e in t h e n u m e r a t o r of t h e c a n o n i c a l corr e l a t i o n is not affected w h e n only o n e c o n s t a n t is used.) T a b l e 5 p r o v i d e s e v i d e n c e of t h e c o r r e c t i o n (7.7) b a s e d on the e x p e r i m e n t a l d e s i g n (7.5) a n d (7.6) for b o t h t h e M E a n d t h e h y b r i d c a n o n i c a l c o r r e l a t i o n s . ~6The upward bias of the sample estimator is not surprising, since canonical correlations are generalizations of the multiple correlations. Let R be such a correlation, associated with a linear regression on p independent variables (including a constant term). A frequently used correction amounts to multiplying 1 - R 2 by the ratio of n - 1 to n - p - 1. Both this correction and that which is shown in (7.7) below for canonical correlations are corrections to the order 1/n, but (7.7) has the advantage of never yielding a negative f]. See also Lawley (1956, 1959) for an asymptotic expansion of the expected sample canonical correlations; the implied correction is much more complicated than (7.7).

Small samples and large equation systems

469

Table 4 ME and sample canonical correlation coefficients Mean Sample

Estimated bias a ME

Sample

RMSE

n

ME

ME

Sample

10 15 20 25 30 40 50 100

0.991 0.992 0.986 0.973 0.963 0.945 0.936 0.919

Largest canonical correlation coefficient b 0.091 b 0.091 b 0.092 b 0.092 0.994 0.086 0.094 0.087 0.978 0.073 0.078 0.073 0.967 0.063 0.067 0.064 0.949 0.045 0.049 0.050 0.939 0.036 0.039 0.040 0.920 0.019 0.020 0.026

b b 0.094 0.079 0.068 0.052 0.042 0.028

10 15 20 25 30 40 50 100

Average canonical correlation coefficient 0.826 b 0.326 b 0.328 0.733 b 0.233 b 0.235 0.678 0.685 0.178 0.185 0.181 0.635 0.640 0.135 0.140 0.139 0.613 0.618 0.113 0.118 0.118 0.583 0.586 0.083 0.086 0.088 0.562 0.563 0.062 0.063 0.068 0.530 0.531 0.030 0.031 0.037

b b 0.188 0.144 0.122 0.091 0.069 0.039

10 15 20 25 30 40 50 100

Sum of squared canonical correlation coefficients 6.74 b 3.89 b 3.91 b 5.67 b 2.82 b 2.84 b 4.95 5.06 2.10 2.21 2.14 2.23 4.46 4.52 1.61 1.67 1.63 1.69 4.19 4.25 1.34 1.40 1.38 1.43 3.83 3.86 0.98 1.01 1.00 1.04 3.60 3.62 0.75 0.77 0.79 0.81 3.23 3.23 0.38 0.38 0.41 0.42

aMean minus true value. bNot computed. For n = 10 and 15, The largest sample canonical correlation coefficient is identically equal to 1.

The top row of the table shows the true value of each squared canonical c o r r e l a t i o n . T h e first e i g h t r o w s c o n t a i n m e a n s o v e r 100 t r i a l s a n d , in p a r e n theses, the RMSEs around the true value of the squared ME canonical correlation. The next eight lines provide analogous results for the hybrid e s t i m a t e s o b t a i n e d b y i n t e r p r e t i n g (7.1) as t h e h y b r i d c o v a r i a n c e m a t r i x ( w i t h sample variances on the diagonal and ME covariances elsewhere). In the lower h a l f o f t h e t a b l e t h e c o r r e c t i o n (7.7) is a p p l i e d t o e i t h e r t h e M E o r t h e h y b r i d e s t i m a t o r . A c o m p a r i s o n o f m e a n s a n d R M S E s s h o w s t h a t f o r n ~> 15 t h e c o r r e c t e d h y b r i d e s t i m a t o r is s u p e r i o r e x c e p t w i t h r e s p e c t t o t h e l a r g e s t canonical correlation.

I-I. Theil and D. G. Fiebig

470

H

o

~5 H

~D

e0

z

?, 0

H

"d ©

"K ~2

Small samples and large equation systems

471

H. Theil and D. G. Fiebig

472

7.3. A cross-country d e m a n d system W e return again to the d e m a n d system (2.1), which we now a m e n d by adding a constant term to each equation: N

Yi, = oq + BiXo, + ~'~ rrqxj, +eit.

(7.8)

j=l

O u r application of this system will not be to time series d a t a but to per capita d a t a for 15 countries (t = 1 . . . . . n = 15); see the A p p e n d i x for further details. T h e analysis of h o m o g e n e i t y and s y m m e t r y testing is b e y o n d the scope of this chapter, because it would involve not only the f r e q u e n c y of rejections of the null hypothesis when this hypothesis is true but also the p o w e r of the test. Instead, we shall i m p o s e the h o m o g e n e i t y condition (2.2) by writing (7.8) in the form N-1

Yit = ai + fliXo, + •

~ij(xi, - xm) + ei,,

(7.9)

j=l

and we shall want to estimate this system subject to the s y m m e t r y constraint (2.3). Since e u + . - . + e u t = 0, we can confine the estimation of (7.9) to i = 1..... N-1. W e write (7.9) for t = l . . . . . 15 as y i = X 6 i + e i , where •i = (O~i, fli, 7ril, • " • , 7ri.N-i)' and X is a 15 x ( N + 1) matrix whose tth row equals (1, x0,, xlt - x N , , . . . , xN_l, , - XN,). L e t ( e l t , . . . , e u 1.t) for t = 1 . . . . . 15 be ind e p e n d e n t l y and identically distributed with zero m e a n s and nonsingular c o v a r i a n c e matrix X. T h e n ( X ' X ) - I X ' y ~ is the LS e s t i m a t o r of 6~, which is u n b i a s e d if X is fixed, while S defined as

S-

1

15 - ( N + 1)

Y'[I-X(X'X)-IX']

Y,

Y=[Yl,'-',YN

1],

(7.10)

is an unbiased e s t i m a t o r of X. T h e LS estimator of 6i does not satisfy the s y m m e t r y constraint (2.3). W e can write (2.3) in the form R 6 = 0, w h e r e 8 is a vector with 6i as the ith s u b v e c t o r (i = 1 , . . . , N - 1) and R is a matrix whose e l e m e n t s are all 0 or _+1, each r o w of R c o r r e s p o n d i n g to Irq = 1rji for s o m e (i,j). T h e B L U e s t i m a t o r of 6 constrained by (2.3) is

6(Z) = d - C(X)R'[RC(.~)R'I-~Rd,

(7.11)

and its covariance matrix is

C(Z)-

C(Z)R'[RC(Z)R']-~RC(Z),

(7.12)

Small samples and large equation systems

473

where C ( X ) = X , @ ( X ' X ) -~ and d is a vector with (X'X)-~X'y~ as the ith subvector (i = 1 . . . . . N - 1). For details on constrained linear estimation, see, e.g., Theil (1971, Sec. 6.8). If X is known, we can c o m p u t e (7.11) f r o m the data. If X is not k n o w n , the standard p r o c e d u r e is to replace X in (7.11) by the estimator S of (7.10). Alternatively, we can use an estimator based on corrected canonical correlations of the type (7.7), but an adjustment must be m a d e for the fact that X refers only to N - 1 equations. ~7 H e r e we retain all equations (7.9) for i = 1 . . . . . N by specifying p = q = N in (7.1). 18 Indicating by hats (circumflexes) that (7.1) has sample variances on the diagonal and M E covariances elsewhere, we obtain the (uncorrected) hybrid canonical correlations r l > r 2 > . . . > rN from 2

(~x2.,~2~.,~12- ri~lt)Z i = 0 , where

(7.13)

z~ is a characteristic vector associated with r~, normalized so that

z'i~,laZj = 6ij or, equivalently, Z ' 2 n Z = I,

Z-

[z, . . . . . ZN] .

(7.14)

Let XN be the covariance matrix of (elt . . . . , eNt), to be estimated f r o m the N x N version of (7.4) with characteristic vectors (the z / s ) added. Since NN has rank N - 1, we correct r 1 to 1 and use (7.7) with p = q = N for i = 2 . . . . . N. Let -2 on the diagonal. T h e n , from A be the diagonal matrix with 0, 1 - r-22. . . . ,1 - r/, (7.4), (7.13) and (7.14), the corrected estimator of J2N is 2 u = ( Z ' ) - I A Z -1= 211ZAZ'21t so that N

2N = ~ ( 1 - ?~)~1tzi(~llzi) ' ,

(7.15)

i=2

after which ~ is o b t a i n e d by deleting the last row and c o l u m n of ~N. This ~ is an estimator of X in (7.11) that will be used below as alternative to S of (7.10). N o t e that X does not involve the largest canonical correlation (see the end of the p r e v i o u s subsection).

7.4. Discussion of numerical results' A simulation e x p e r i m e n t was p e r f o r m e d in order to c o m p a r e the three s y m m e t r y - c o n s t r a i n e d estimators, with N = 8 goods: f o o d ; clothing; rent; fur17Deleting the first of N equations rather than the last amounts to a linear transformation of the dependent variables. Such a transformation affects the corrected ME error covariance matrix in a nontrivial way, since the rectangular cells in the second paragraph of Section 5 become paralo lelolograms when the variables are linearly transformed. 18There are p = N independent variables in (7.9); the constant terms ai are handled by the use of variances and covariances rather than second moments around zero.

H. Theil and D. G. Fiebig

474

~d r.~

~ ? ~ ? ~ I

II

¢'.1

.= I

I

~ l f l i l l I

[

II

I

II

II

II

II

II

II

II

II

II

II

II

II

II

Small samples and large equation systems

II

~

I I I

I

I

off t~

©

I

I

I

II

~1

H

II

II

II

II

II

II

II

II

II

IIIt

II

II

II

II

II

II

II

H

II

II

II

II

II

II

II

II

II

tl

II

IItt

II

II

II

tl

II

H

II

II

4"75

476

H. Theil and D. G. Fiebig

niture; medical care; transport and communication; recreation and education; other consumption expenditures. The first column of Table 6 contains the true values of the parameters. For each of the three estimators, the columns labeled Bias and RMSE contain the estimated bias (mean minus true value) and the RMSE around the true value over 500 trials. Bias presents no problem; the estimated bias values are all small in magnitude relative to the corresponding RMSEs. Differences between the three estimators appear when we consider their RMSEs. The estimates based on S are markedly inferior to those which use the true X. When we use rather than S, we obtain estimates which compare much more favorably to those based on the true X. In order to facilitate these comparisons, we computed ratios of the RMSEs based on ~7 to those based on S and ~ for each of the 35 coefficients. These ratios are shown in the first two columns of Table 7, and the quartiles of these ratios (lower, median, upper) are shown below.

Ratios for S Ratios for ~

Lower 0.67 0.92

Median 0.74 0.95

Upper 0.81 0.99

It is evident from these figures that there is a substantial efficiency gain from using ~ rather than S in the symmetry-constrained estimation procedure, and that the efficiency loss from not knowing the true error covariance matrix is quite modest when ~ is used as its estimator. A n o t h e r matter of importance is whether the standard errors of the symmetry-constrained estimates provide an adequate picture of the variability of these estimates around the true parameter values. This problem is pursued by the RMSSEs of Table 6. These are obtained from the matrix (7.12), with X interpreted as either S or ~ or the true X, by averaging the diagonal elements of (7.12) over the 500 trials and then taking square roots of these averages. On comparing the RMSSEs based on S with the corresponding RMSEs we must conclude that the standard errors based on S tend to underestimate the variability of their coefficient estimates. Table 7 illustrates this more clearly by providing the ratio of the RMSSE to the corresponding RMSE for each estimator. The third column of this table shows the substantial understatement of the variability of the estimates based on S. The quartiles of the 35 ratios in each of the last three columns are as follows:

Ratios for S Ratios for 2 _ Ratios for X

Lower 0.48 0.89 0.99

Median 0.59 0.94 1.01

Upper 0.69 1.06 1.02

When the true X is used, the ratios are tightly distributed around unity. Use of yields ratios which are more widely dispersed around 1, but which represent a marked improvement over the use of S.

Small samples and large equation systems

477

Table 7 R a t i o s of R M S E s a n d R M S S E s of s y m m e t r y - c o n s t r a i n e d e s t i m a t e s R a t i o of R M S E b a s e d o n true X to RMSE based on

s

~

R a t i o of R M S S E to R M S E

s

2

z

0.56 0.60 0.86 0.44 0.67 0.76 0.48

1.39 0.94 1.08 0.92 1.25 0.95 0.94

1.01 1.06 1.02 0.98 0.98 1.02 0.97

i= i= i= i= i= i= i=

1 2 3 4 5 6 7

0.70 0.69 0.89 0.63 0.79 0.83 0.67

Coefficients/3/ 0.92 0.90 0.99 0.99 0.90 0.94 0.94

i i i i i i i

= = = = = = =

1 2 3 4 5 6 7

0.72 0.78 0.92 0.61 0.79 0.88 0.65

D i a g o n a l S l u t s k y c o e f f i c i e n t s ~r~i 0.94 0.57 1.00 0.60 0.97 0.84 1.00 0.38 0.93 0.67 0.97 0.79 0.92 0.44

1.52 0.88 0.95 0.83 1.35 0.85 0.87

1.03 1.00 1.00 0.97 1.04 1.03 0.96

i i i i i i i i i i i i i i i i i i i i i

= = = = = = = = = = = = = = = = = = = = =

1, j = 1, i = 1, j = 1, j = 1, j = 1,/" = 2, . / = 2,/" = 2, j = 2, j = 2, j = 3, j = 3, j = 3, j = 3, j = 4, j = 4, ] = 4,/" = 5,/" = 5,/" = 6,/" =

0.70 0.83 0.62 0.77 0.86 0.63 0.81 0.63 0.72 0.81 0.67 0.78 0.81 0.90 0.79 0.68 0.69 0.60 0.78 0.72 0.74

O f f - d i a g o n a l S l u t s k y coefficients 7r0 0.87 0.52 0.86 0.72 0.97 0.42 0.93 0.59 0.88 0.66 0.96 0.43 0.99 0.71 0.99 0.43 0.94 0.55 1.00 0.67 0.95 0.45 1.01 0.66 0.87 0.71 1.01 0.79 0.96 0.69 0.95 0.49 0.96 0.54 1.01 0.36 0.86 0.65 0.92 0.55 0.99 0.58

0.93 1.27 0.98 1.43 1.00 1.06 0.90 0.84 0.98 0.86 0.87 0.92 1.11 0.90 0.91 0.97 0.86 0.89 0.97 0.98 0.93

0.99 1.03 0.99 0.99 0.96 1.01 1.02 0.99 1.01 1.02 0.98 0.98 1.01 1.01 0.99 1.02 1.03 1.00 1.05 1.01 1.01

2 3 4 5 6 7 3 4 5 6 7 4 5 6 7 5 6 7 6 7 7

478

1-1. Theil and D. G. Fiebig

8. Conclusion We have attempted to demonstrate an approach, based on the ME distribution, to problems that arise in large equation systems. Estimators of various population parameters are generated from this distribution according to the method of moments: whenever a standard procedure uses sample moments, we use ME moments. For example, previous analyses have found that the ME moment matrix leads to small-sample gains relative to the usual sample moment matrix. On the basis of our experimentation, impressive results were also achieved from a hybrid moment matrix whose diagonal elements are sample moments and whose off-diagonal elements are ME moments. The experiments presented, hopefully, have illustrated the effectiveness of the ME approach. Simulation experiments cannot be conclusive, though, and it is appropriate that further work be done in order to reinforce these initial impressions. The simultaneous equation experiment could be extended in a number of directions. The form of the equation isolated for attention is extremely simple and the experiment could be extended to equations that include more endogenous and/or exogenous variables. Also, no attempt was made to test the validity of the asymptotic standard errors. It is appropriate to note that there exist other problems associated with large equation systems that have not been discussed here. In the context of simultaneous equation estimation, full information methods of estimation (such as three-stage least squares) require the number of endogenous variables in the system to be less than the number of observations. Without such a condition, the usual sample estimator of the error covariance matrix is singular. Essentially the same problem can arise in systems of demand equations or more generally in any system of seemingly unrelated regression equations. For example, in order to estimate a system of demand equations with 37 goods on the basis of annual U.K. data for 17 years, Deaton (1975) used an a priori specified covariance matrix. The ME approach provides a simple and elegant solution in such situations.

Appendix The demand systems (2.1) and (7.8) are obtained by differentiating an appropriately differentiable utility function subject to the budget constraint E i p i q ~ = M , where pi and qi are the price and quantity of good i and M is total expenditure (or 'income'). The technique used amounts to deriving the firstorder constrained maximum condition and then differentiating it with respect to M and the p/s. The result can be conveniently written in the differential form N

wi

d(log qi) = 0i d(log O) + ~ ~°ijd(log pj), j=l

(A1)

where w~ is the budget share of good i and d(log Q) is the Divisia volume

Small samples and large equation systems

479

index, wi =

N

Piqi

d(log O) : ~'~ wi d(log q,), i=1

M '

(A2)

while 0~ = O(p~q~)/OM is the marginal budget share of good i and the Slutsky coefficient ~/ equals ( p i p j / M ) O q , / O p / , the derivative OqflOpj measuring the effect of pj on q~ when real income remains constant. The homogeneity property (2.2) reflects that proportionate changes in all prices do not affect any qi when M also changes proportionately. The symmetry property (2.3) results from the assumed symmetry of the Hessian matrix of the utility function. To apply (A1) to time series we write D x t = l o g ( x f l x t _ t ) for any positive variable x with value x t at time t. A finite-change approximation to (A1) is then N

VvitDq, , = OeDQ, + ~ , 7rijDpj , ,

(A3)

/=1

where D O t = Z~ 1,vitDqi t and wit is the arithmetic average budget share of good i at t - 1 and t. Equation (A3) is equivalent to (2.1) for y , = ~ t D q , , x0t = DOt, x/t = D p j t. For further details, see Theil (1980). The numerical results reported in Section 7 are based on the analysis of Theil and Suhm (1981) of data on 15 countries collected by Kravis et al. (1978). These countries are the U.S., Belgium, France, West Germany, U.K., The Netherlands, Japan, Italy, Hungary, Iran, Colombia, Malaysia, Philippines, South Korea, and India. Let w~, be the per capita budget share of good i in country t. Working's (1943) model describes such a share as a linear function of the logarithm of income. T o take into account that different countries have different relative prices, Working's model is postulated to hold at the geometric mean prices across countries, Pl . . . . . PN, where 15

log Pi = ~ ~, log Pit,

(A4)

t=l

which requires that a substitution term be added to the model. The result is that the demand system takes the form (7.8), with x0t per capita real income of country t, xit = log(p~,/pj) and Yit equal to 1 - x , + X j w # x / , multiplied by wi,. Then the sums over i = 1 . . . . . N of Yi, % 13i and 7r,j are equal to 1, 1, 0 and 0, respectively, implying that eat . . . . . eNt are linearly dependent.

References Barten, A. P. (1969). Maximum likelihood estimation of a complete system of demand equations. European Economic Review 1, 7--73. Bera, A. K., Byron, R. P. and Jarque, C. M. (1981). Further evidence on asymptotic tests for homogeneity and symmetry in large demand systems. Econom. Lett. 8, 101-105.

480

H. Theil and D. G. Fiebig

Byron, R. P. (1970). The restricted Aitken estimation of sets of demand equations. Econometrica 39, 816-830. Christensen, L. R., Jorgenson, D. W. and Lau, L. J. (1975). Transcendental logarithmic utility functions. American Economic Review 65, 367-383. Conway, D. and Theil, H. (1980). The maximum entropy moment matrix with missing values. Econom. Lett. 5, 319-322. Deaton, A. S. (1974). The analysis of consumer demand in the United Kingdom. Econometrica 42, 341-367. Deaton, A. S. (1975). Models and Projections of Demand in Post-War Britain. Chapman and Hall, London. Fiebig, D. G. (1980). Maximum entropy canonical correlations. Econom. Lett. 6, 345-348. Fiebig, D. G. (1982). The maximum entropy distribution and its covariance matrix. Doctoral dissertation. Department of Economics, University of Southern California. Haft, L. R. (1980). Empirical Bayes estimation of the multivariate normal covariance matrix. Ann. Statist. 8, 586-597. Hooper, J. W. (1959). Simultaneous equations and canonical correlation theory. Econometrica 27, 245-256. Kagan, A. M., Linnik, Y. V. and Rao, C. R. (1973). Characterization Problems in Mathematical Statistics. Wiley, New York. Kidwai, S. A. and Theil, H. (1981). Simulation evidence on the ridge and the shrinkage of the maximum entropy variance. Econom. Lett. 8, 59-61. Kravis, I. B., Heston, A. W. and Summers, R. (1978). International Comparisons of Real Product and Purchasing Power. The Johns Hopkins University Press, Baltimore, MD. Laitinen, K. (1978). Why is demand homogeneity so often rejected? Econom. Lett. 1, 187-191. Lawley, D. N. (1956). Tests of significance for the latent roots of covariance and correlation matrices. Biometrika 43, 128-136. Lawley, D. N. (1959). Tests of significance in canonical analysis. Biometrika 46, 59-66. Lluch, C. (1971). Consumer demand functions, Spain, 1958-1964. European Economic Review 2, 277-302. Malinvaud, E. (1980). Statistical Methods of Econometrics, 3rd ed. North-Holland, Amsterdam. Mariano, R. S. and Sawa, T. (1972). The exact finite-sample distribution of the limited-information maximum likelihood estimator in the case of two included exogenous variables. J. Amer. Statist. Assoc. 67, 159-165. Meisner, J. F. (1979). The sad fate of the asymptotic Slutsky symmetry test for large systems. Econom. Lett. 2, 231-233. Meisner, J. F. (1981). Appendix to Theil and Suhm (1981). Schuster, E. F. (1973). On the goodness-of-fit problem for continuous symmetric distributions. J. Amer. Statist. Assoc. 68, 713-715. Schuster, E. F. (1975). Estimating the distribution function of a symmetric distribution. Biometrika 62, 631-635. Theft, H. (1971). Principles of Econometrics. Wiley, New York. Theil, H. (1980). The System-Wide Approach to Microeconomics. The University of Chicago Press, Chicago, IL. Theil, H. and Fiebig, D. G. (1984). Exploiting Continuity: Maximum Entropy Estimation of Continuous Distributions. Ballinger, Cambridge, MA. Theil, H. and Laitinen, K. (1980). Singular moment matrices in applied econometrics. In: P. R. Krishnaiah, ed., Multivariate Analysis--V, 629-649. North-Holland, Amsterdam. Theil, H. and Meisner, J. F. (1980). Simultaneous equation estimation based on maximum entropy moments. Econom. Lett. 5, 339-344. "I]leil, H. and Suhm, F. E. (1981). International Consumption Comparisons: A System-Wide Approach. North-Holland, Amsterdam. Theil, H., Kidwai, S. A., Yalnizo~lu, M. A. and Yell6, K. A. (1982). Estimating characteristics of a symmetric continuous distribution. CEDS Discussion Paper 74. College of Business Administration, University of Florida. Working, H. (1943). Statistical laws of family expenditure. J. Amer. Statistist. Assoc. 38, 4.3--56.