Journal of Statistical Planning and Inference 115 (2003) 69 – 84
www.elsevier.com/locate/jspi
Empirical method of moments and its applications Jiming Jiang Department of Statistics, University of California at Davis, 380 Kerr Hall One Shields Avenue, Davis, CA 95616, USA Received 27 September 2000; received in revised form 12 November 2001; accepted 11 December 2001
Abstract In many cases, methods of estimating a vector of parameters are well-developed, but for some reason such as evaluating the precision of the estimators knowledge about an additional vector of parameters is needed. In this paper, we propose a method, which we call empirical method of moments, for estimating such additional parameters. The method proposed here has the following good features: First, it is a simple method, and often results in closed-form expressions of the estimators. Second, it relies on weak distributional assumptions, such as moment conditions. Third, it leads to consistent estimators. The method is applied to various 6elds, including generalized linear models, time series, analysis of longitudinal data, and mixed linear models. In c 2002 Elsevier Science B.V. All rights reserved. the latter case a simulation is considered. MSC: 62F10 Keywords: Consistency; Dispersion parameter; EMM; kurtoses
1. Introduction The purpose of this paper is to introduce a method, which we call empirical method of moments. It is motivated by problems that arise naturally in practice. In many cases, methods of estimating a vector of parameters, say, , are well-developed, but for some reason such as evaluating the precision of the estimators knowledge about an additional vector of parameters, say, , is needed. The method proposed here has the following good features: First, it is a simple method, and often results in closed-form expressions The research was supported, in part, by National Science Foundation Grant SES-9978101 and National Security Agency Grant MDA904-98-1-0038. E-mail address:
[email protected] (J. Jiang).
c 2002 Elsevier Science B.V. All rights reserved. 0378-3758/03/$ - see front matter PII: S 0 3 7 8 - 3 7 5 8 ( 0 2 ) 0 0 1 1 8 - 0
70
J. Jiang / Journal of Statistical Planning and Inference 115 (2003) 69 – 84
of the estimators in terms of the data and the estimator of , which is already available. Second, it relies on weak distributional assumptions, such as some moment conditions. Third, it leads to consistent estimators. To illustrate the method, let us begin with some simple examples. Example 1. Suppose that X1 ; : : : ; Xn are i.i.d. observations from a population with mean and variance 2 ; and the problem of interest is to estimate . A well-known estimator is the sample mean; ˆ = XB . However; since var(XB ) = 2 =n; in order to evaluate the precision of ; ˆ one needs knowledge about 2 . If were known; then; a method of moments estimator of 2 would be n
1 (Xi − )2 : ˜ = n 2
(1)
i=1
Since is unknown; in (1) it is naturally replaced by . ˆ The resulting estimator is called an empirical method of moments estimator; given by ˆ2 =
n
1 (Xi − XB )2 : n
(2)
i=1
Note that (2) is the same as the maximum likelihood estimator of 2 when normality is assumed. Example 2. Consider a linear regression model; which may be viewed as an extension of Example 1: yi = xi + i ;
i = 1; : : : ; n;
where xi = (xi1 ; : : : ; xip ) is a vector of known covariates; is a vector of unknown regression coeFcients which are of main interest; and 1 ; : : : ; n are i.i.d. errors with mean 0 and variance 2 . The model can be expressed as y = X + ; where the ith row of X is xi . Assume that rank(X ) = p. Then; the least-squares (LS) estimator of is given by
ˆ = (X X )−1 X y: 2 ˆ (X X )−1 ; to 6nd the standard errors Although is of main interest; because Var( )= 2 of the estimators one needs knowledge about . If were known; then; a method of moments estimator of 2 would be
˜2 =
n
1 (yi − xi )2 : n
(3)
i=1
Now; since is unknown; it is replaced by the LS estimator. The result is an empirical method of moments estimator: ˆ2 =
n
1 ˆ 2: (yi − xi ) n i=1
(4)
J. Jiang / Journal of Statistical Planning and Inference 115 (2003) 69 – 84
71
Note that; again; ˆ2 is the maximum likelihood estimator of 2 when normality is assumed. Of course, in these two examples, methods of estimating 2 are also well-known. For example, unbiased estimators of 2 are easily obtained. However, things may not be so easy in problems more complicated than the above. We will see some examples later on. The empirical method of moments, or EMM, is introduced in the next section under a general framework. We note that, as the preceding examples indicate, there is a connection in general between EMM and maximum likelihood (ML). We also demonstrate that, under suitable conditions, the EMM estimators are consistent. Sections 3– 6 are devoted to applications of EMM. These application areas include: generalized linear models; ARMA time series models; analysis of longitudinal data; and mixed linear models. In the last application, the EMM is used to estimate the kurtoses of the random eKects and errors. In the mixed model context, method of estimating the kurtoses is mostly unknown (see Jiang, 1998). Therefore, in Section 7 we use simulation to investigate the 6nite sample performance of the EMM estimators, and demonstrate the eKects of kurtoses on estimation of the standard errors of the variance-component estimators in non-normal mixed linear models. Section 8 contains some concluding remarks. The technical proofs are left in the appendix. 2. Empirical method of moments ˆ is Let be a vector of parameters. Suppose that a consistent estimator of , , available. Let be a vector of additional parameters about which knowledge is needed. Let = ( ) , and M ( ; y) = M (; ; y) be a vector-valued function of the same dimension as that depends on and y, a vector of observations. Suppose that EM( ; y) = 0 when is the true vector of parameters. Then, if were known, a method of moments estimator of would be obtained by solving M (; ; y) = 0
(5)
for . Note that this is more general than the classical method of moments, in which the function M is a vector of sample moments minus their expected values. In econometric literature, this is referred as generalized method of moments (e.g., Hansen, 1982; ˆ The result is called an Newey, 1985). Since is unknown, we replace it in (5) by . ˆ which is obtained empirical method of moments (EMM) estimator of , denoted by , by solving ˆ ; y) = 0: M (; (6) Note that here we use the words “an EMM estimator” instead of “the EMM estimator”, because sometimes there may be more than one consistent estimators of , and each may result in a diKerent EMM estimator of . For maximum likelihood estimation, which is a special case of EMM, see Gan and Jiang (1999) for a method dealing with related problems based on hypothesis testing. In Examples 1 and 2 we see that, under normality, the EMM estimators agree with the ML estimators. In general, ML estimators may be viewed as a special kind of EMM
72
J. Jiang / Journal of Statistical Planning and Inference 115 (2003) 69 – 84
estimators. To see this, let l( ; y) = l(; ; y) be the log-likelihood function. Then, the ˆ the ML estimator of , ML estimator, ˆ = (ˆ ˆ ) satis6es @l=@ = 0, and hence , satis6es @ ˆ l(; ; y) = 0: @
(7)
On the other hand, (7) is a special case of (6), in which M (; ; y) = @l=@. Note that E(@l=@ ) = 0 when is the true vector of parameters. One important property of the EMM estimators is consistency. Note that the asymptotics here is with respect to n → ∞, where n is the sample size. So, strictly speaking, the vectors y, M ( ; y), etc. all depend on n. But for notation simplicity, n will be suppressed as an index. Similarly, a sequence of numbers, say, {an }, will be simply denoted by {a}. Let q and r be the dimensions of and , respectively; and =(i )16i6q , =(j )16j6r , M ( ; y)=(Mj ( ; y))16j6r . Let {aj }, 1 6 j 6 r be sequences of positive numbers. De6ne Mj∗ ( ; y) = Mj ( ; y)=aj , 1 6 j 6 r, and M ∗ ( ; y) = (Mj∗ ( ; y))16j6r . Let 0 and 0 be the true vectors of parameters. We use OP (1) to represent a term that is bounded in probability, and oP (1) a term that → 0 in probability. Also, |v| is the Euclidean norm of a vector v. Theorem 1. Suppose that the following hold: (i) for 3xed and y; the minimizer of |M ∗ (; ; y)| over exists; (ii) ˆ is consistent; (iii) M ∗ (0 ; 0 ; y) = oP (1); and (iv) for any ¿ 0; there is ¿ 0 such that the following are OP (1): @Mj∗ ; 1 6 i 6 q; 1 6 j 6 r; sup |−0 |6;=0 @i and inf
|−0 |6;|−0 |¿
−1 |M (; ; y)| : ∗
ˆ ; y)| over . Then, ˆ is consistent. Let ˆ be the minimizer of |M ∗ (; ˆ ; y) = 0 because the Remark 1. Here it does not require that ˆ be a solution to M ∗ (; solution may not exist. Similar de6nition has been used; for example; in the context of generalized method of moments estimation in econometrical models (e.g.; McFadden; 1989). Of course; if the solution does exist; it will be the minimizer. Remark 2. There is some similarity between Theorem 1 and consistency property of estimators derived from estimating functions. See; for example; a recent IMS monograph edited by Basawa et al. (1997). However; to the best of our knowledge; no available results would directly apply to the problem considered here. Most of the known results seem to be dealing with independent observations; in which the
J. Jiang / Journal of Statistical Planning and Inference 115 (2003) 69 – 84
73
estimating function often has the form of a sum. On the other hand; the left-hand side of (6) does not have a particular form; and Theorem 1 applies to both independent and dependent observations. Remark 3. The role of aj is normalizing. However; the choice of aj is quite Nexible as long as it has the right order. In fact; in many cases; the aj ’s are not needed in computing the EMM estimator. In such cases; the existence of the aj ’s is only of ˆ ; y)=0. theoretical interest. To see this; suppose that there exists a solution ˆ to M ∗ (; ˆ ; y)|. On the other hand; it is easy to Then ˆ is obviously the minimizer of |M ∗ (; ˆ ; y)=0 is equivalent to M (; ˆ ; y)=0; i.e.; (6); which does not involve see that M ∗ (; the aj ’s. The proof is given in the appendix. Corollary 1. Under the conditions of Theorem 1; if; in addition; the solution to (6) exists with probability tending to one; then the EMM estimator ˆ is consistent. It is trivial to show that the conditions of Theorem 1 are satis6ed for both Examples 1 and 2. Of course, it is also straightforward to directly verify the consistency of the EMM estimators in these examples. In fact, because EMM is such a simple method that in many cases it leads to closed-form expressions of the estimators (in terms of ˆ it is often easier to show consistency directly. the data and ), Although EMM is a simple method, in subsequent sections we show that it is a useful method and has many applications. 3. Generalized linear models Generalized linear models (GLM; McCullagh and Nelder, 1989) are proposed as a natural generalization of the linear models. In GLM, it is assumed that the distribution of the response is a member of a known exponential family (e.g., binomial, Poisson) and the mean of the response is associated with a linear predictor through a link function. According to the de6nition, for a linear model to 6t in GLM as a special case, one has to assume that the distribution of the response is normal. However, as is well-known, the de6nition of a linear model does not have to be associated with normality. In fact, many of the techniques developed in linear models do not require the normality assumption. Therefore, the class of GLM, by the classic de6nition, is not broad enough to include the linear models. In fact, the two de6ning characteristics of a linear model without normality are: (1) the observations are independent; and (2) the variance of the observation is a constant. We may de6ne a GLM in a similar way. Let y1 ; : : : ; yn be responses. We assume that the responses are independent, and that i = E(yi ) is associated with a linear predictor i = xi , where xi = (xi1 ; : : : ; xip ) are known covariates and is a vector of unknown regression coeFcients, through a link function g(·), i.e., g(i ) = i :
(8)
74
J. Jiang / Journal of Statistical Planning and Inference 115 (2003) 69 – 84
Finally, we assume that the variance of the response is associated with the mean and an additional dispersion parameter in the following way: var(yi ) = ai ()v(i );
(9)
where v(·) and ai (·) are known functions. Note that, in many cases, ai () = =wi , where wi is a known weight. It is known that the algorithm for 6tting GLM (e.g., McCullagh and Nelder, 1989, Section 2.5) depends only on the mean-variance relationship, and therefore can be generalized to 6t the GLM de6ned here (see McCullagh and Nelder, 1989, Section ˆ In fact, to 6nd ˆ one does not need to 9.2.3). This leads to an estimator of , say, . know ; and, under regularity conditions, ˆ is consistent. However, in order to evaluate ˆ one must know . In the following, we show that can be easily estimated Var( ) by EMM. Consider n M ( ; ; y) = [(yi − i )2 − ai ()v(i )]: i=1
Then, EM( ; ; y) = 0 when , correspond to the true parameters. If we replace ˆ we obtain the EMM equation: M ( ; ˆ ; y) = 0, or by , n n (10) (yi − ˆi )2 ; ai ()v(ˆi ) = i=1
i=1
ˆ The EMM estimator of , , ˆ is the solution to where ˆi is i with replaced by . (10). In particular, if ai () = =wi , then n (yi − ˆi )2 ˆ = i=1 : (11) n i=1 v(ˆi )=wi 4. ARMA models In time series analysis, autoregressive moving average models, or ARMA models, play an important role (e.g., Brockwell and Davis, 1991). Suppose that {yt ; t = 0; ±1; ±2; : : :} is a stationary time series, which satis6es yt − a1 yt−1 − · · · − ap yt−p = et − b1 et−1 − · · · − bq et−q ;
(12)
where {et } is a sequence of i.i.d. random variables with mean 0 and variance 2 ; the a’s and b’s are unknown coeFcients such that ap and bq are nonzero; and the polynomial a(z) = 1 − a1 z − · · · − ap z p has no root in or on the unit circle or in common with that of b(z) = 1 − b1 z − · · · − bq z q : Such a model is known as an ARMA(p; q) model.
J. Jiang / Journal of Statistical Planning and Inference 115 (2003) 69 – 84
75
When normality is not assumed, one method of estimating the coeFcients a1 ; : : : ; ap and b1 ; : : : ; bq is the least squares (LS, e.g., Brockwell and Davis, 1991, Section 8.7). To compute the LS estimators one does not need to know 2 ; and, under suitable conditions (but without normality), the LS estimators are consistent (e.g., Brockwell and Davis, 1991, Section 8.8). On the other hand, the (asymptotic) covariance matrix of the LS estimators depend on 2 , so 2 is important not for point estimation but for, say, interval estimation of the coeFcients. We now show that, again, an EMM estimator of 2 can be found easily. Suppose that y1 ; : : : ; yn have been observed, where n ¿ p. Consider M (a; b; 2 ; y) n = [(yt − a1 yt−1 − · · · − ap yt−p )2 − 2 (1 + b21 + · · · + b2q )]; t=p+1
where a = (ai )16i6p and b = (bj )16j6q . Then, EM(a; b; 2 ; y) = 0 when a, b, 2 correspond to the true parameters. Let aˆ and bˆ be the LS estimators of a and b, ˆ 2 ; y) = 0, ˆ b; respectively. Then, the EMM estimator of 2 is obtained by solving M (a; i.e., n 2 t=p+1 (yt − aˆ1 yt−1 − · · · − aˆp yt−p ) 2 : (13) ˆ = 2 2 (n − p)(1 + bˆ1 + · · · + bˆq ) 5. Analysis of longitudinal data The de6ning feature of a longitudinal study is that individuals are measured repeatedly over time. For simplicity, let us 6rst assume that repeated measurements are collected over a common set of times and there are no missing observations. Let yij , j = 1; : : : ; d be the measurements collected from the ith individual over time t1 ; : : : ; td , and yi =(yij )16j6d , i=1; : : : ; m. It is assumed that the vectors y1 ; : : : ; ym are independent with E(yi ) = Xi
and
Var(yi ) = V0 ;
(14)
where Xi is a matrix of (known) covariates; is an unknown vector of regression coeFcients; and V0 is an unknown covariance matrix. Note that, unlike a linear regression model (i.e., Example 2), the matrix V0 is not equal to 2 I , where 2 is an unknown variance and I is an identity matrix. There may be covariances in V0 . In analysis of longitudinal data, the vector is often of main interest. A method of estimating (without assuming normality) is called weighted least squares (WLS; e.g., Diggle et al., 1996, Section 4.3). The WLS estimator of is de6ned as the minimizer of the quadratic form (y − X ) W (y − X ); y=(y1 ; : : : ; ym ) ,
=(X1
(15) · · · Xm ) ,
X and W is a symmetric weight matrix. Suppose where that the matrix X WX is nonsingular. Then, the WLS estimator is given by (16)
ˆW = (X WX )−1 X Wy
76
J. Jiang / Journal of Statistical Planning and Inference 115 (2003) 69 – 84
with covariance matrix Var( ˆ ) = (X WX )−1 X WVWX (X WX )−1 ; W
(17)
where V = diag(V0 ; : : : ; V0 ). One special case of the WLS is the ordinary least squares, in which W = I . This gives ˆI = (X X )−1 X y, which is the same as the least squares estimator in linear regression (i.e., Example 2), but with a diKerent covariance matrix: Var( ˆI ) = (X X )−1 X VX (X X )−1 . To compute the WLS estimator one does not need to know V0 . In addition, under suitable conditions, the WLS estimator is consistent. Note that these results do not depend on the assumptions we made so far, that is, the repeated measures are taken over a common set of times and there are no missing observations, which is important because later we shall consider a more general case. On the other hand, it is clear that an adequate assessment of the variation of ˆW cannot be obtained without knowledge about V0 . However, it is straightforward to obtain an EMM estimator of V0 . This is given by m 1 (18) (yi − Xi ˆW )(yi − Xi ˆW ) : Vˆ 0 = m i=1
Note that this is an EMM estimator of the entire covariance matrix, which may involve at most d(d + 1)=2 parameters. We now consider a more general case, that is, the repeated measurements are supposed to be collected over a common set of times, but, for some reason, at the end there are missing observations here and there. Without loss of generality, let S = {1; : : : ; d} be the time set over which measurements are to be collected. Suppose that the measurements from the ith individual are actually collected from a subset Si of S, i.e., yij , j ∈ Si ; 1 6 i 6 m. Let di represent the cardinality of Si . Instead of (14), we now assume that y1 ; : : : ; ym are independent, where yi = (yij )j∈Si , with E(yi ) = Xi
and
Var(yi ) = Vi ;
(19) xij ,
j ∈ Si ; Vi = (vjk )j; k∈Si with dimenwhere Xi has di rows with the jth row being sion di × di , and vjk = cov(yij ; yik ). Note that here we have made an assumption that the covariances do not depend on i (otherwise they may not be consistently estimable). Then, the covariance matrix of the WLS estimator is given by (17), where V = diag(V1 ; : : : ; Vm ), and X is de6ned the same way as before given the Xi ’s. We now consider the EMM estimation of the Vi ’s. It suFces to 6nd the EMM estimators of vjk , 1 6 j 6 k 6 d. Let = (vjk )16j6k6d . Let M ( ; ; y) be the array with d(d + 1)=2 elements whose (j; k) element is [(yij − xij )(yik − xik ) − vjk ]; 1 6 j 6 k 6 d: (20) Si j; k
Note that the summation in (20) is over all i’s such that Si both j and k. Then, EM( ; ; y) = 0 when and correspond to the true parameters. The EMM estimator of is obtained by solving M ( ˆW ; ; y)=0, i.e., by setting (20) to zero with replaced by ˆW , 1 6 j 6 k 6 d. This gives 1 (yij − xij ˆW )(yik − xik ˆW ); 1 6 j 6 k 6 d; (21) vˆjk = mjk Si j; k
J. Jiang / Journal of Statistical Planning and Inference 115 (2003) 69 – 84
77
where mjk is the cardinality of the set {1 6 i 6 m : Si j; k}. It is easy to see that, when there are no missing observations, i.e., when Si = S, 1 6 i 6 m, (21) reduces to (18).
6. Mixed linear models In the previous sections, the EMM has been used to estimate variances, covariances, or other dispersion parameters (e.g., in a special case of GLM where the responses are from a gamma distribution, −1 is the shape parameter). In this section, we consider a case that is quite diKerent. Consider a mixed linear model: y = X + Z1 ,1 + · · · + Zs ,s + ;
(22)
where y is an N × 1 vector of observations; X is an N × p known matrix of covariates, and, without loss of generality, is assumed to have full rank p; is a p × 1 vector of unknown constants (the 6xed eKects); Zu is an N × mu known matrix; ,u is an mu × 1 vector of unobservable i.i.d. random variables (the random eKects); and is an N × 1 vector of i.i.d. errors. Furthermore, it is assumed that the random eKects and errors have mean 0, and variances 02 ; 12 ; : : : ; s2 corresponding to the components of ; ,1 ; : : : ; ,s , respectively; and ,1 ; : : : ; ,s ; are independent. One important feature here is that we do not assume that the random eKects and errors are normally distributed. Nevertheless, it has been shown (Richardson and Welsh, 1994; Jiang, 1996, 1997) that the Gaussian ML estimators and restricted maximum likelihood (REML) estimators of the variance components u2 , 0 6 u 6 s and 6xed eKects are consistent and asymptotically normal. Note that here we assume that the dimension of , p, is 6xed. Otherwise, there may be a diKerence in asymptotic behaviors of the ML and REML estimators (Jiang, 1996). Clearly, such results are useful, because in practice one almost is never sure about normality. However, to apply such methods there is one problem standing in the way. Under the normality assumption, the asymptotic covariance matrix of the ML or REML estimators of the variance components depends only on the variance components themselves (e.g., Searle et al., 1992, Section 6). However, if normality does not hold, such an asymptotic covariance matrix may depend on additional parameters, of which consistent estimators are not available. To make it simpler, in the following we assume that E13 = 0
and
3 E,u1 = 0; 1 6 u 6 s:
(23)
Such conditions hold if, in particular, the distributions of the random eKects and errors are symmetric. Then, it can be shown (e.g., Jiang, 1998) that the asymptotic covariance matrix, which is the same for both the ML and REML estimators, depend not only on the variance components but also on the kurtoses: 00 = E14 − 304 , 4 0u = E,u1 − 3u4 , 1 6 u 6 s. Unlike the variance components estimation, methods of estimating the kurtoses is not well-developed in the mixed model context. Jiang (1998)
78
J. Jiang / Journal of Statistical Planning and Inference 115 (2003) 69 – 84
proposed a method of estimating the kurtoses, which is based on the empirical best linear unbiased predictors, or EBLUP, of the random eKects. One problem with the EBLUP approach is that the estimators of the kurtoses are not consistent unless the number of observations corresponding to each individual random eKect goes to in6nity. To make it more clear, let us consider a simple example. Example 3. Consider the following simple random eKects model: yij = + ,i + ij ; i = 1; : : : ; m; j = 1; : : : ; n; where is an unknown mean; ,i ’s are i.i.d. random eKects with mean 0 and variance 12 ; and ij ’s are i.i.d. errors with mean 0 and variance 02 . 4 According to Jiang (1998), for consistency of the EBLUP estimators of 00 = E11 − 4 4 4 30 and 01 = E,11 − 31 in Example 3 one needs m → ∞ and n → ∞, which is not always realistic. In fact, in many cases, n is small, even if m is large. In the following, we consider the EMM estimation of the kurtoses, which are consistent under more realistic conditions. For any matrix A = (aij ), we de6ne A 4 = ( i; j a4ij )1=4 . Similarly, if a = (ai ) is a 4 1=4 vector, then a 4 = ( i ai ) . Let L be a linear space, then L⊥ represents the linear space {a : a b = 0; ∀b ∈ L}. If L1 , L2 are linear spaces such that L1 ⊂ L2 , then L2 L1 represents the linear space {a : a ∈ L2 ; a b = 0; ∀b ∈ L1 }. If M1 ; : : : ; Mk are matrices with same number of rows, then L(M1 ; : : : ; Mk ) represents the linear space spanned by columns of M1 ; : : : ; Mk . Suppose that the matrices Z1 ; : : : ; Zs have been suitably ordered such that Lu = {0};
0 6 u 6 s;
(24)
where L0 = L(Z1 ; : : : ; Zs )⊥ , Lu = L(Zu ; : : : ; Zs ) L(Zu+1 ; : : : ; Zs ), 1 6 u 6 s − 1, and Ls = L(Zs ). Let Cu be a matrix whose columns constitute a base of Lu , 0 6 u 6 s. We de6ne auv = Zv Cu 44 , 0 6 v 6 u 6 s. It is easy to see that, under (24), auu ¿ 0, 0 6 u 6 s. Let nu be the number of columns of Cu , and cuk the kth column of Cu , 1 6 k 6 nu , 0 6 u 6 s. De6ne u 2 nu 2 2 2 |Zv cuk | v ; 0 6 u 6 s: bu ( ) = 3 k=1
v=0
where 2 = (v2 )06v6s . Let 0 =(0v )06v6s . Consider M ( ; 2 ; 0; y) = (Mu ( ; 2 ; 0; y))06u6s , where 2
Mu ( ; ; 0; y) =
Cu (y
−
X ) 44
−
u
auv 0v − bu (2 );
0 6 u 6 1:
v=0
Then, by the following lemma and the de6nition of the Cu ’s, it is easy to show that EM( ; 2 ; 0; y) = 0 when , 2 , 0 correspond to the true parameters (for details see the appendix). Thus, a set of EMM estimators can be easily obtained by solvˆ ˆ2 ; 0; y) = 0, where ˆ and ˆ2 are the ML or REML estimators. The EMM ing M ( ;
J. Jiang / Journal of Statistical Planning and Inference 115 (2003) 69 – 84
79
estimators can be computed recursively: ˆ 0ˆ0 = a−1 00 d0 ; ˆ 0ˆu = a−1 uu du −
u−1 auv auu
v=0
0ˆv ;
1 6 u 6 s;
(25)
ˆ 4 − bu (ˆ2 ), 0 6 u 6 s. where dˆu = Cu (y − X ) 4 Lemma 1. Let 51 ; : : : ; 5n be independent random variables such that E5i = 0 and E54i ¡ ∞, and 71 ; : : : ; 7n be constants. Then, E
n
4 7i 5i
=3
i=1
n
2 7i2
var(5i )
+
n
i=1
7i4 {E54i − 3[var(5i )]2 }:
i=1
Example (Continued). The model can be written as y = X + Z, + , where X = 1m ⊗ 1n , and Z = Im ⊗ 1n . Here 1m represents the m-dimensional vector of 1’s, Im the m-dimensional identity matrix, and ⊗ the Kronecker product. Let
1 ···
−1 Kn = . . . 0
1
0 : .. . · · · −1 n×(n−1)
··· .. .
Then, it is easy to show that C0 = Im ⊗ Kn , C1 = Z = Im ⊗ 1n . It follows from (25) that, in closed-form, m
0ˆ0 =
n
1 (yi1 − yij )4 − 6ˆ40 ; 2m(n − 1) i=1 j=2
m m n 1 1 4 ˆ (y − n
) − (yi1 − yij )4 0ˆ1 = i· mn4 2mn3 (n − 1) i=1
−
3 n2
1−
2 n
i=1 j=2
6 ˆ40 − ˆ20 ˆ21 − 3ˆ41 ; n
n ˆ ˆ2 , and ˆ2 are the ML or REML estimators. It is easy where yi· = j=1 yij , and , 0 1 to show, either by verifying the conditions of Theorem 1, or by arguing directly, that the EMM estimators are consistent provided only that m → ∞. The only requirement for n is that n ¿ 2, but the latter is reasonable, because otherwise one cannot separate , and .
80
J. Jiang / Journal of Statistical Planning and Inference 115 (2003) 69 – 84
7. Some simulations 7.1. A simulation regarding Section 3 First consider an example of estimating the dispersion parameter in a GLM. Let y1 ; : : : ; yn be independent observations such that yi ∼ G(i ; 9), where G(; 9) means gamma distribution with density 9 9 9y 1 y9−1 exp − :(9) (e.g., McCullagh and Nelder, 1989, Section 8.2). The mean i is associated with a linear predictor i = 0 + 1 xi by the canonical link: i = i−1 . In this simulation, we take 0 =0:2, 1 =0:5, and 9=1:0. The sample size n is 6xed at 100. The xi ’s are generated from a Uniform(0:6; 2:6) distribution, or, equivalently, the i ’s are generated from Uniform(0:5; 1:5). Once the xi ’s are generated, they are 6xed throughout the simulation. An existing method of estimating = 9−1 is a moment estimator: 2 n 1 yi − ˆi ˜ ; (26) = n−2 ˆi i=1
ˆ are the MLE. In fact, this estimator is where ˆi = ˆi = ˆ0 + ˆ1 xi , and the ’s preferred over the ML estimator of for reasons given in McCullagh and Nelder (1989, Section 8.3.6). We compare the performance of the EMM estimator ˆ given by ˜ Our results, based on 1000 simulations, show that, in this case, the (11) and that of . mean and s.d. for ˆ are 0.95 and 0.20, respectively, while those for ˜ are 0.98 and ˆ especially in terms of bias. 0.19, respectively. So ˜ performs slightly better than , This suggests that some adjustment of degrees of freedom (d.f.) may be necessary to ˆ Note that the d.f. issue has been taken into account in (26), reduce the bias of . which is the generalized Pearson’s X 2 statistic, but it is less straightforward to do so with the EMM. This would be a topic for future research. ˆ−1 i ,
7.2. A simulation regarding Example 3 Consider, once again, the random eKects model of Example 3. Following Hartley and Rao (1967), we consider the estimation of the following parameters of variance components: 7 = 02 , ; = 12 =02 . The true parameters are = 7 = ; = 1. Two sample size con6gurations are considered: m = 50; n = 4; and m = 100; n = 2. Note that n is small in both cases, so that the EBLUP method of Jiang (1998) does not apply for estimating the kurtoses. One important feature of this simulation is that normality is not assumed. In fact, four combinations of nonnormal distributions for the random eKects and errors are considered. Let F and G be the distributions of the random eKects and errors, respectively. Then, in each case F (G) is either a t-distribution with eight degrees of freedom (t8 ), or a double exponential distribution (DE). All distributions are suitably scaled so
J. Jiang / Journal of Statistical Planning and Inference 115 (2003) 69 – 84
81
Table 1 Asym.=simul. variances of 7ˆ m
n
F
G
True asym.
Mean asym.
Mean asym. (Gaussian)
Simulated
50 100 50 100 50 100 50 100
4 2 4 2 4 2 4 2
t8 t8 DE DE t8 t8 DE DE
t8 t8 t8 t8 DE DE DE DE
0.017 0.017 0.017 0.017 0.025 0.025 0.025 0.025
0.017 0.018 0.018 0.016 0.028 0.026 0.024 0.025
0.010 0.010 0.010 0.010 0.011 0.010 0.010 0.010
0.019 0.028 0.021 0.024 0.030 0.037 0.027 0.035
Table 2 Asym.=simul. variances of ;ˆ m
n
F
G
True asym.
Mean asym.
Mean asym. (Gaussian)
Simulated
50 100 50 100 50 100 50 100
4 2 4 2 4 2 4 2
t8 t8 DE DE t8 t8 DE DE
t8 t8 t8 t8 DE DE DE DE
0.070 0.035 0.100 0.050 0.070 0.035 0.100 0.050
0.070 0.040 0.126 0.058 0.076 0.040 0.116 0.064
0.047 0.024 0.051 0.025 0.046 0.024 0.048 0.025
0.119 0.120 0.167 0.137 0.125 0.132 0.166 0.158
that they have the variances speci6ed above. According to Jiang √ (1998, p. 871), the √ asymptotic variance of mn(7ˆ − 7) is 72 (2 + k0 ), and that of m(;ˆ − ;) is ;2 (2 + k1 ), 4 where k0 and k1 are the normalized kurtoses: k0 = E11 =04 − 3 and k1 = E,14 =14 − 3. The normalized kurtoses for t8 and DE are 1:5 and 3, respectively. In each of the four combinations of distributions and sample size con6gurations, one thousand data sets are generated. For each data set, the REML estimators of 7 and ; are obtained. Also obtained are the estimated asymptotic variances of the REML estimators based on the above results, where the kurtoses are estimated by EMM (see the previous section). The means of the estimated asymptotic variances over the 1000 data sets are computed. These results are compared with three things: (1) the true asymptotic variances; (2) the simulated variances of the REML estimators; and (3) the means of the estimated asymptotic variances without the kurtosis corrections (i.e., under normality). The latter comparison would indicate how seriously one could be wrong when mistakenly using the Gaussian standard error estimators. The results are shown in the following tables (Tables 1 and 2). It is seen that in all cases the mean asymptotic variance with kurtosis correction is very close to the true asymptotic variance, while that without kurtosis correction (i.e., Gaussian) is not so close. Our simulation results are consistent with the fact that the Gaussian asymptotic variance corresponds to zero kurtosis, and therefore is
82
J. Jiang / Journal of Statistical Planning and Inference 115 (2003) 69 – 84
smaller than the true asymptotic variance in all cases considered here. On the other hand, especially for estimating ; when m = 100 and n = 2, the simulated variances are signi6cantly larger than the asymptotic variances. This is probably due to the fact that the asymptotic variance is supposed to be accurate when the sample size is suFciently large, while the sample size here is not considered very large. Nevertheless, one can see the improvement as the sample size increases. Note that, although the total sample size is 200 in both cases, there is a diKerence in estimating 7 and ;. The sample size for estimating 7 is 200, while the “eKective” sample size for estimating ; is smaller, although it might be a little oversimpli6ed to say that it is 50 or 100. As a result, the simulated variances are closer to the asymptotic variances in the case of estimating 7 than in the case of estimating ;.
8. Concluding remarks In many cases, there is a vector of parameters, say, , which is of primary interest. ˆ is available, but Var() ˆ depends on, in addition to Quite often, an estimator of , say, , , another vector of parameters, say, . So, one question might be: why not estimating and together? Consider, for example, the mixed linear model (22). To simultaneously estimate the variance components and kurtoses, one has to deal with twice as many parameters at the same time (without counting the 6xed eKects). This means that one has to simultaneously solve twice as many nonlinear equations. More importantly, in this case there is no method available to estimate the variance components and kurtoses simultaneously. Also, note that some methods of simultaneous estimation, such as ML, require full speci6cation of the distribution of the data given the parameters, while this is not required by EMM. Furthermore, when a well-known estimator of is available, one may not want to reestimate . In such a case, it is natural to use EMM. Unlike the method of moments, the EMM does not take into account the loss of degrees of freedom in estimating (see Examples 1 and 2). In this way, it is similar to ML. Thus, one would expect the EMM estimators to have a larger bias than the method of moments estimators. However, because of consistency, the bias of the EMM estimators goes away as sample size increases. Finally, because EMM is such a natural approach, it has been used in many occasions to produce natural estimators. Therefore, we do not pretend to be the “inventor” of EMM. But we claim to be the 6rst to systematically study its property and applications, and the 6rst to bring it to estimation of kurtoses in nonnormal mixed linear models.
Acknowledgements The author wishes to thank Prof. Bing Li for helpful discussion. The author is also grateful to two referees for their constructive comments.
J. Jiang / Journal of Statistical Planning and Inference 115 (2003) 69 – 84
83
Appendix. Proofs and details Proof of Theorem 1. For any ¿ 0; let be chosen according to (iv). By Taylor expansion; we have; for any 1 6 j 6 r; q @Mj∗ ∗ ˆ ∗ Mj (; 0 ; y) = Mj (0 ; 0 ; y) + (ˆi − 0i ); (A.1) @i ∗ ;0 i=1
where ∗ lies between 0 and ˆ and may depend on j. Thus; if |ˆ − 0 | 6 ; we have; ˆ by (A.1) and the de6nition of ; ˆ ; ˆ y)| 6 |M ∗ (; ˆ 0 ; y)| 6 |M ∗ (;
r
ˆ 0 ; y)| 6 |Mj∗ (;
j=1
r
|Mj∗ (0 ; 0 ; y)|
j=1
q r @Mj∗ |ˆi − 0i |: + sup |−0 |6;=0 @i j=1 i=1
It follows; by (ii) – (iv); that ˆ ; ˆ y)| = oP (1): |M ∗ (;
(A.2)
ˆ ; ˆ y)| ¿ 5; where Also; if |ˆ − 0 | 6 and |ˆ − 0 | ¿ ; we have |M ∗ (; 5=
inf
|−0 |6;|−0 |¿
|M ∗ (; ; y)|:
Therefore; we have; by (ii); (iv); and (A.2); P(|ˆ − 0 | ¿ ) 6 P(|ˆ − 0 | 6 ; |ˆ − 0 | ¿ ) + P(|ˆ − 0 | ¿ ) ˆ ; ˆ y)| ¿ 1) + o(1) = o(1): 6 P(5−1 |M ∗ (; Proof of Lemma 1. We have 4 n E 7i 5i = 7i1 7i2 7i3 7i4 E5i1 5i2 5i3 5i4 : i=1
(A.3)
i1 ; i2 ; i 3 ; i4
It is easy to see that the only ways the summands in (A.3) can be nonzero are: (1) i1 = i2 = i3 = i4 ; (2) i1 = i3 = i2 = i4 ; (3) i1 = i4 = i2 = i3 ; and (4) i1 = i2 = i3 = i4 . It follows that 4 n n 7i 5i = 3 7i2 7j2 E52i E52j + 7i4 E54i E i=1
i=1
i =j
=3
n i=1
2 7i2 var(5i )
+
n i=1
7i4 {E54i − 3[var(5i )]2 }:
84
J. Jiang / Journal of Statistical Planning and Inference 115 (2003) 69 – 84
Some regarding Section 6. For any N ×n matrix C; let t = C (y −X ) 44 = n derivations 4 k=1 [ck (y − X )] ; where ck is the kth column of C. By Lemma 1; it is easy to show that 2 s s 4 4 2 2 E[ck (y − X )] = Zv ck 4 0v + 3 |Zv ck | v : v=0
v=0
It follows that E(t) =
n
E[ck (y − X )]4
k=1
=
s
Zv C 44 0v + b(2 ; C);
(A.4)
v=0
n s where b(2 ; C) = 3 k=1 ( v=0 |Zv ck |2 v2 )2 . By the choice of Cu , 0 6 u 6 s, it is easy to see that Zv C0 =0, 1 6 v 6 s but C0 = 0; and Zv Cu = 0, u ¡ v 6 s but Zu Cu = 0, 1 6 u 6 s − 1; and Zs Cs = 0. Thus, we have u E Cu (y − X ) 44 = v=0 Zv Cu 44 0v + b(2 ; Cu ), and bu (2 ) = b(2 ; Cu ), 0 6 u 6 s. It then follows that EM( ; 2 ; 0; y) = 0. References Basawa, I.V., Godambe, V.P., Taylor, R.L. (Eds.), 1997. Selected proceedings of the symposium on estimating functions, IMS Lecture Notes—Monograph Ser. 32. Brockwell, P.J., Davis, R.A., 1991. Time Series: Theory and Methods, 2nd Edition. Springer, New York. Diggle, P.J., Liang, K.Y., Zeger, S.L., 1996. Analysis of Longitudinal Data. Oxford University Press, New York. Gan, L., Jiang, J., 1999. A test for global maximum. J. Amer. Statist. Assoc. 94, 847–854. Hansen, L.P., 1982. Large sample properties of generalized method of moments estimators. Econometrica 50, 1029–1054. Hartley, H.O., Rao, J.N.K., 1967. Maximum likelihood estimation for the mixed analysis of variance model. Biometrika 54, 93–108. Jiang, J., 1996. REML estimation: Asymptotic behavior and related topics. Ann. Statist. 24, 255–286. Jiang, J., 1997. Wald consistency and the method of sieves in REML estimation. Ann. Statist. 25, 1781–1803. Jiang, J., 1998. Asymptotic properties of the empirical BLUP and BLUE in mixed linear models. Statistica Sinica 8, 861–885. McCullagh, P., Nelder, J.A., 1989. Generalized Linear Models, 2nd Edition. Chapman & Hall, London. McFadden, D., 1989. A method of simulated moments for estimation of discrete response models without numerical integration. Econometrica 57, 995–1026. Newey, W.K., 1985. Generalized method of moments speci6cation testing. J. Econometrics 29, 229–256. Richardson, A.M., Welsh, A.H., 1994. Asymptotic properties of restricted maximum likelihood (REML) estimates for hierarchical mixed linear models. Austral. J. Statist. 36, 31–43. Searle, S.R., Casella, G., McCulloch, C.E., 1992. Variance Components. Wiley, New York.