Convergence rate of MLE in generalized linear and nonlinear mixed-effects models: Theory and applications

Convergence rate of MLE in generalized linear and nonlinear mixed-effects models: Theory and applications

Journal of Statistical Planning and Inference 137 (2007) 1787 – 1804 www.elsevier.com/locate/jspi Convergence rate of MLE in generalized linear and n...

268KB Sizes 0 Downloads 17 Views

Journal of Statistical Planning and Inference 137 (2007) 1787 – 1804 www.elsevier.com/locate/jspi

Convergence rate of MLE in generalized linear and nonlinear mixed-effects models: Theory and applications Lei Nie∗ Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University, Washington, DC, 20057, USA Received 9 September 2004; accepted 15 June 2005 Available online 1 September 2006

Abstract Generalized linear and nonlinear mixed-effects models are used extensively in the study of repeated measurements and longitudinal data analysis. Convergence rates of maximum likelihood estimates (MLEs) differ from parameter to parameter, which is not well explained in the literature. We consider the convergence rates of the MLEs for three different cases: (1) the number of subjects (clusters) n tends to infinity while the number of measurements per subject p remains finite; (2) both n and p tend to infinity; (3) n remains finite while p tends to infinity. In the three √ cases above, the MLE may have different convergence rates. In case (1), as we can expect, the MLE of all parameters are n-consistent under some regularity conditions; in case (2), some parameters √ could be np-consistent. These rates of convergence have a crucial impact on both experimental design and data analysis. Limited simulations were performed to examine the theoretical results. We illustrate applications of our results through one example and through presenting the exact convergence rates of some “approximate” MLEs such as PQL, CGEE2, etc. It is shown that the MLE √ for some parameters and some “approximate” MLEs may be N -consistent, here N = np is the total sample size. Also our results show that the MLE of some parameters maybe asymptotically independent of some other parameters. © 2006 Elsevier B.V. All rights reserved. Keywords: MLE; Convergence rate; Generalized linear and nonlinear mixed-effects models; Laplace’s approximation; PQL; CGEE2

1. Introduction Consider data partitioned into n clusters, where the ith cluster consists of p observations, yi = (yi1 , . . . , yip )T . The model for the data is described in two stages. In the first stage, given a random variable bi = (bi1 , . . . , biv )T , the conditional probability density function of yi is i (yi , , bi ), where  is a vector of u unknown fixed-effect parameters. We also assume that y1 , . . . , yn are conditionally independent. In the second stage, the unobservable random-effect vectors b1 , . . . , bn are assumed to be a random sample from a normal distribution N(0, ), whose density function will be denoted as (bi , ). Here  = (1 , . . . , w )T is a vector of w unknown structural parameters and  = () depends on  through some covariance structure. Let  denote the parameter space for T = (T , T ). We are interested in estimating  ∈ , based on the observations y1 , . . . , yn . Note that we assume the same p observations are obtained for each subject for ease of illustration. Extensions are straightforward to the case where pi different observations are collected. ∗ Corresponding author at: Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University, 4000 Reservoir Road,

Washington, DC 20057, USA. Tel.: +1 202 687 7532; fax: +1 202 687 2581. E-mail address: [email protected]. 0378-3758/$ - see front matter © 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2005.06.010

1788

L. Nie / Journal of Statistical Planning and Inference 137 (2007) 1787 – 1804

Models of this type arise in longitudinal studies, repeated measurement experiments over time and space and various other experiments that yield correlated multivariate data. The areas of application include biomedical, social and agricultural sciences. Lindstrom and Bates (1990) considered a two-stage Gaussian-based nonlinear mixed-effects model to examine the growth curve of the orange trees. A similar but more general model was considered in Davidian and Giltinan (1993) to model the growth curve of soybean seeds. Anderson and Aitkin (1985) proposed a subjectspecific mixed-effects logistics regression model to examine the interviewer’s variability. Burton et al. (1999) applied a generalized linear mixed-effects model to model genetics phenotypical observations. Other examples are available in Sheiner and Beal (1980), Breslow and Clayton (1993), Hedeker and Gibbons (1994), Mentré and Gomeli (1995), Pinheiro and Bates (1995), Roe (1997), McCulloch (1997, 2003), Booth and Hobert (1999), Wu and Ding (1999), Tenhave and Localio (1999), and among others. The maximum likelihood estimator (MLE) is used to make inference about the unknown parameters. Obtaining MLEs involves tremendous computational difficulty because of the integrated likelihood. MLEs are obtained via several different approaches. These approaches include numerical integration techniques (e.g., Hedeker and Gibbons, 1994; Pinheiro and Bates, 1995) and Markov chain Monte Carlo techniques (e.g., Zeger and Karim, 1991; McCulloch, 1997; Booth and Hobert, 1999). When p is large, most “approximate” MLEs work well too. These “approximate” MLEs include Lindstrom and Bates (LB) type estimate (Lindstrom and Bates, 1990), penalized quasi-likelihood (PQL) estimate (Breslow and Clayton, 1993), Laplace’s approximation based estimate, (Vonesh, 1996), maximum H-likelihood estimate (MHLE), (Lee and Nelder, 1996), Conditional second-order generalized estimating equation (CGEE2) (Vonesh et al., 2002), among others. These “approximate” MLEs rely on the asymptotic behavior of the MLE. Careful investigation of the MLE is needed. Other than direct “approximate” MLEs, some robust and consistent estimators were proposed in Jiang (1998, 1999) and Sinha (2004). Routine asymptotic properties, such as consistency and asymptotic normality of some of these estimators were established, see e.g. Jiang (1998, 1999), Vonesh (1996), Vonesh et al. (2002), and Sinha (2004). However, because its particular data structure, the asymptotic properties of the MLE here have some special features worth to pay more attention, particularly the convergence rate with regard to p. Here we start our investigation with two simple and well-known examples. Example 1. Consider a simple balanced one way ANOVA model, yij = + bi + ij ,

(1.1)

where i = 1, . . . , n and j = 1, . . . , p with bi ’s independent N(0, 2b ) and ij ’s independent N(0, 2 ). It can be shown that ˆ MLE = y¯.. , ˆ 2MLE = MSE, ˆ 2bMLE =

Var( ˆ MLE ) =

2 2 + b = O(n−1 ), pn n

Var( ˆ 2MLE ) =

2 4 = O((np)−1 ), p(n − 1)

(1 − 1/n)MST − MSE , p

Var( ˆ 2bMLE ) =

2( 2 + p 2b )2 2 4 + = O(n−1 ), n(p − 1)p 2 (n − 1)p 2

where MSE and MST are the mean square error and mean square of treatments, respectively, ˆ MLE , ˆ 2MLE and ˆ 2bMLE are MLEs of parameters , 2 and 2b . This notation will be used throughout this paper. It can be seen that ˆ MLE − 0 = Op (n−1/2 ),

ˆ 2MLE − 20 = Op ((np)−1/2 ),

ˆ 2bMLE − 2b0 = Op (n−1/2 ),

rates for different where 0 , 20 , and 2b0 are true values of , 2 , and 2b , respectively. In this example, the convergence √ parameters are different. Particularly, if p p0 and n → ∞, then the MLE of all parameters are n-consistent; if √ p → ∞ and n n0 , then the MLE of and 2b are not consistent, while the MLE of 2 is p-consistent; if both n → ∞ √ √ and p → ∞, then the MLE of and 2b are n-consistent, while the MLE of 2 is np-consistent. Not only can the √ MLE of 2 be np-consistent, but the MLE of some location parameters can also achieve this high convergence rate.

L. Nie / Journal of Statistical Planning and Inference 137 (2007) 1787 – 1804

1789

Example 2. Consider the following model: yij = + xj  + bi + ij ,

(1.2)

where bi ∼ N(0, 2b ), ij ’s independent N(0, 2 ), xj ∼ N(0, 1). It can be shown that similar to the computations in Example 1.1 ˆ MLE − 0 = Op (n−1/2 ),

ˆ MLE − 0 = Op ((np)−1/2 ).

√ Thus ˆ MLE is np-consistent, i.e. even if n does not tend to infinity, ˆ MLE , the MLE of the slope, is still a consistent estimator as long as p tends to infinity. However, ˆ MLE is not consistent in this case when n does not tend to ∞. Having all these well-known results from linear models in mind, we shall consider the MLE in generalized linear and nonlinear mixed-effects models. The main purpose of this paper is to clarify the exact convergence rate for the MLE and some “approximate” MLEs, including PQL, CGEE2. In order to clearly explain the asymptotic behavior of the MLE, we classify models in the literatures into three different cases. Most examples fall into the first case, where the number of subjects (or clusters) n converges to infinity while the number of measurements per subject p remains finite. For example, in a typical biological growth model, (e.g. Lindstrom and Bates, 1990), p 10; in a typical pharmacokinetics (PK) model, (e.g. Roe, 1997; Wolfinger and Lin, 1997), p 12; in a mixed-effects logistics regression model from genetics research (e.g. Burton et al., 1999), where p is the number of members in a nuclear family. In all these examples, the number of observation per subject p is moderate or small for all subjects, while the number of subjects n can be large. In a mathematical form, we write, n → ∞, while p p0 . On the other hand, some other examples fall into the second case, i.e. n → ∞ and p → ∞. In the mixed-effects logistic regression model to examine the interviewer’s variability by Anderson and Aitkin (1985), both the number of interviewers (n = 64) and the number of people interviewed p (20 p 35) are large. Tenhave and Localio (1999) considered a logistic regression model to investigate an assessment of health care providers according to the risk of death from pneumonia, adjusted for patient-level covariates, where n = 22 and p can be as larger as 203. Similar examples can be easily found in biological studies which use monitoring equipment or image processing techniques. Finally, although currently not well studied, models can be classified into the third case, p → ∞, while nn0 . Consider the study in pharmacokinetics by Kwan et al. (1976). Six volunteers (n = 6) received bolus intravenous injection and their plasma concentration were measured 11 times (p = 11). In this case, we may argue it falls to the third case, rather than to the first or second case. One other example is illustrated in Crowder (1978), which concerns the proportion of seeds that germinate on each of 21 plates arranged according to a 2 × 2 factorial layout by seed and type of root extract. In this example, cluster is the plate, n = 21. The total number of seeds on the plate can be as high as 81, i.e. p = 81. Classifying models into these three cases is solely for the purpose of illustration of the different behavior of the MLE. There is no obvious rule (and no need) to set up threshold values for p and n so that we can classify models into three cases easily. The classifications depend entirely on the model itself and as well researchers’ view. For instance, the example in Crowder (1978) can be classified into the second case as well. Since p can be as high as 12, in a typical PK model, (e.g. Roe, 1997; Wolfinger and Lin, 1997), some results for the MLE in the second case can actually apply to the PK √ model as well. The n-consistency of the MLE for parameters are well-known results. In order to clearly understand the role played by p in generalized linear and nonlinear mixed-effects models, we carefully investigate the convergence rate of the MLE. The marginal likelihood M(, y) of  = (, ) based on the observations y1 , . . . , yn is a product of the individual marginal likelihood, M(, y) =

n 

 Mi (, yi ),

Mi (, yi ) =

i (yi , , bi )(bi , ) dbi ,

i=1

where Mi (, yi ) is the marginal likelihood based on yi , Therefore, y1 , . . . , yn are independent random samples from associated populations. The probability density function of yi is Mi (, yi ). We will consider the convergence rate of the MLE for models that fall into all three cases. In Section 2, we consider the asymptotic behavior when n → ∞. In Section 3, the case when n does not tend to ∞ was considered as a comparison to the case when n tends to ∞. In

1790

L. Nie / Journal of Statistical Planning and Inference 137 (2007) 1787 – 1804

Section 4, we perform some simulations to study the finite population performance of the MLE. We discuss potential impact in data analysis. We also illustrate some potential applications of these results in experiment design via one example. It shows that the asymptotics results can be crucial in the design of experiments. Finally, we briefly discuss the convergence rate of some “approximated” MLEs, such as the PQL estimator by Breslow and Clayton (1993), conditional second-order estimating equation by Vonesh et al. (2002), and Laplace’s approximation estimator by Vonesh (1996). 2. Consistency of MLE when the number of clusters tends to ∞ The asymptotic properties of the MLE have been widely investigated. Cramer (1946) and Wald (1949) proved the consistency of the MLE, when the observations are independent random samples from the same populations, i.e. random samples are independently and identically distributed (i.i.d). In our case, this means y1 , . . . , yn are i.i.d with the same pdf Mi (, yi ). The following result can be found in Lehmann (1983). Lemma 1. Let y1 , . . . , yn be i.i.d with the same pdf. For any given p, as n → ∞, the MLE is listed in Theorem 6.4.1 of Lehmann (1983) hold for Mi (, yi ). Furthermore,



n-consistent if conditions

√ n(ˆ MLE − 0 ) ∼ N(0, I −1 (0 )), where I () is the information matrix  Eyi | −

j2 ln Mi (, yi ) j jT

 .

Note that this is a Cramer’s-type consistency, i.e. there is a root of the likelihood equation, such that this root is consistent. In this paper, we only consider this type of consistency. When Mi (, yi )’s are different, i.e. y1 , . . . , yn are independent but not identically distributed (i.n.i.d.), we can apply results from Bradley and Gart (1962). Lemma 2 (from Bradley and Gart, 1962) proved weak consistency when the observations are sampled from independently associated populations, i.e. random samples are i.n.i.d. Lemma 2. For any given p, as n → ∞ the MLE is Gart (1962). Furthermore,

√ n-consistent under Conditions I(i), I(ii), and III in Bradley and

√ n(ˆ MLE − 0 ) ∼ N(0, I −1 (0 )), where I () is the information matrix   n j2 ln Mi (, yi ) 1 lim . Eyi | − n→∞ n j jT i=1

Conditions in both lemmas are all regular and have broad applications in generalized linear and nonlinear mixedeffects models. When n → ∞, under conditions in Lemma 1 or 2, the MLE is consistent. However, the convergence rate of the MLE √ may be different, depending on whether p → ∞. If p p0 , then MLEs of all parameters are n-consistent under the conditions in Lemma 1 or 2. If p → ∞, things may be different, as it was shown for model (1.2). Convergence rate of MLE when n → ∞ and p → ∞: Let li1 (yi , , bi )=− ln((yi , , bi )), li2 (bi , )=− ln((bi , )), and li (yi , , bi )=li1 (yi , , bi )+li2 (bi , ). Given , let bˆi () be the minimizer of li (yi , , bi )=li1 (yi , , bi )+li2 (bi , ). j ln Mi (, yi ) − = j



li (yi , , bi ) exp(−li (yi , , bi )) dbi  , exp(−li (yi , , bi )) dbi

L. Nie / Journal of Statistical Planning and Inference 137 (2007) 1787 – 1804

1791

where li (yi , , bi ) = jli (yi , , bi )/j. It can be shown that, in Appendix A.2, using Laplace’s approximation (e.g. Barndorff-Nielsen and Cox, 1989; Tierney et al., 1989), and Lemma 5 −

j2 ln Mi (, yi )



j jT ⎡ 2 ⎣ j li (yi , , bi ) j jT



−1 j2 li (yi , , bi ) j2 li (yi , , bi ) j2 li (yi , , bi ) ⎦ − j jbiT jbi jbiT jbi jT

.

(2.1)

bi =bˆi

Conditions in Lemmas 1 and 2 focus on the marginal likelihood of yi , and consider the behavior of MLE when n → ∞. Since we now consider the behavior of MLE when p → ∞, we also assume some conditions on the conditional distribution (yi , , bi ). √ Condition 1. For all given  and bi , bˆi is a p-consistent estimate of bi . Furthermore, as n, p → ∞, ⎫ ⎧⎡

−1 ⎤ n ⎬ ⎨ 2 l (y , , bˆ ) j 1 i1 i i −1 ⎦ = op (1), +  b Eyi | ⎣biT i ⎭ ⎩ n jbi jbiT i=1 ˆ bi =bi ⎧⎡ ⎫

−1 ⎤ n ⎨ ⎬ 2 2 1 j j l (y , , b ) l (y , , b ) i1 i i i1 i i −1 ⎦ = Op (1). Eyi | ⎣ +  b i ⎩ ⎭ n j jbiT jbi jbiT bi =bˆi

i=1

√ Condition 1 is made to compute the variance of n(ˆ MLE −). The next Condition 2, will be made for the computation √ ˆ of n(MLE − ). Before presenting Condition 2, we need some notations. Eyi |bi (j2 li1 (, bi )/j jT ) is a nonnegative T (, bi )X i (, bi ), for some matrices X i (, bi ). Similarly Ey |b (j2 li1 (, bi )/jbi jbT )= matrix, it can be decomposed as X i i i i T  (, bi )Z i (, bi ). These decomposition are not unique. However, in most cases, they have natural choices. Readers Z i are referred to Appendix A.3 for these natural choices. For ease of illustration, let us simply denote below X˜ i (, bˆi ) as X˜ i , and Z˜ i (, bˆi ) as Z˜ i . We shall consider parameters  and  separately. According to (2.1), we have   j2 ln Mi (, yi ) Eyi − j jT ⎧⎡ ⎫ ⎤

−1 ⎨ j2 l (y ,  , b ) j2 l (y ,  , b ) j2 l (y , , b ) ⎬ 2 l (y , , b ) j i1 i i1 i i1 i i i1 i i ⎦ i i i i −1 ≈ Eyi ⎣ . − +  ⎩ ⎭ j jbiT jbi jbiT j jT jbi jT bi =bˆi

Note that  here is the variance of the random variable bi , defined in the Introduction. By replacing the Hessian matrix by the Fisher information matrix, we obtain

  n n 1 j2 ln Mi (, yi ) 1 − ≈ Eyi Eyi (X˜ Ti X˜ i − X˜ Ti Z˜ i (Z˜ Ti Z˜ i + −1 )−1 Z˜ Ti X˜ i ) n n j jT i=1

i=1

n 1 = Eyi (X˜ Ti (I + Z˜ i Z˜ Ti )−1 X˜ i ). n i=1

√ A necessary condition for np-consistency for MLE of  is Condition 2. n  1 (X˜ Ti (I + Z˜ i Z˜ Ti )−1 X˜ i ) Eyi n,p→∞ np

lim

i=1

is positive definite.

(2.2)

1792

L. Nie / Journal of Statistical Planning and Inference 137 (2007) 1787 – 1804

Now let us explain Condition 2 in an intuitive way. When yi ’s are i.i.d, X˜ i and Z˜ i are all i.i.d. Therefore (2.2) reduces to lim

p→∞

1 Ey (X˜ T (I + Z˜ i Z˜ iT )−1 X˜ i ) p i i

for any i. Using the fact (A + I )−1 ≈ A−1 for any positive definite matrix A as → 0, −1

−1

Z˜ iT Z˜ i Z˜ iT Z˜ i −1  ≈ − + p p p p as Z˜ iT Z˜ i /p tends to a positive matrix as p → ∞. Thus Eyi (X˜ iT (I + Z˜ i Z˜ iT )−1 X˜ i ) = Eyi (X˜ iT X˜ i − X˜ iT Z˜ i (Z˜ iT Z˜ i + )−1 Z˜ i Z˜ iT ) ≈ Eyi (X˜ iT (I − PZ˜ i )X˜ i ),

n

where PZ˜ i is the projection matrix onto the column spaces of Z˜ i . Condition 2 requires the term Eyi ((1/np) (I − P ˜ )X˜ i ) to be positive definite. This condition holds usually when Zi

n

X˜ iT Z˜ i /n < 1, T ˜T ˜ i=1  Z Zi /n

i=1 

sup ∈R u /{0},∈,bi ∈R k

˜T i=1 Xi

T

n

(2.3)

i

which roughly means the columns of Z˜ iT are not parallel to any column (or linear combinations of columns) of X˜ iT . On the other hand,   j2 ln Mi (, yi ) − j jT ⎡ ⎤

−1 2 l (y , , b ) 2 l (y , , b ) 2 l (y , , b ) j2 l (y , , b ) j j j i1 i i i2 i i i2 i i i2 i i ⎦ ≈⎣ + −1 − . jbi jbiT j jbiT jbi jT j jT bi =bˆi

This is proved in Appendix A.2, (A.4). In Appendix A.1, we will continue to show that, as p → ∞,       n n 1 j2 ln Mi (, yi ) 1 j2 li2 (, bi ) j2 li2 (, bi ) ≈ = Ebi , Eyi − Eyi n n j jT j jT j jT i=1

(2.4)

i=1

where the last term in the above equation will be denoted as . Note that does not depend on i since bi ’s are i.i.d. In order to clearly specify the convergence rate of each component of ˆ MLE , we shall focus on models which were used in the literature. In practice, i usually has the following particular form: i (yi , , bi ) = i (yi , (1) + bi , (2) ).

(2.5)

For example, the growth model in Lindstrom and Bates (1990), i (yi , , bi ) =

p  j =1

 2 (yij − fij (1 + bi , 2 , 3 )),

where  2 (·) is the pdf of a normal distribution with mean 0 and variance 2 and fij (1 , 2 , 3 )=1 /(1+2 exp(3 xij )). In this example, (1) = 1 and (2) = (2 , 3 ). For a similar model in Davidian and Giltinan (1993), i (yi , , bi ) =

p  j =1

 2 (yij − fij (1 + bi1 , 2 + bi2 , 3 + bi3 )),

L. Nie / Journal of Statistical Planning and Inference 137 (2007) 1787 – 1804

1793

where fij () = 1 /(1 + exp(3 (xij − 2 ))). In this model, (1) = (1 , 2 , 3 ). Similarly, in all mixed-effects logistic regression models used in practice, i has the form (2.5), see examples in Breslow and Clayton (1993), Hedeker and Gibbons (1994), Mentré and Gomeli (1995), Pinheiro and Bates (1995), Roe (1997), Vonesh and Chinchilli (1997), Wolfinger and Lin (1997), McCulloch (1997, 2003), Booth and Hobert (1998, 1999), Wu and Ding (1999), Tenhave and Localio (1999), Ten Have et al. (2000), Cole et al. (2003), and Lu et al. (2003). Condition 3. Assume that i (yi , , bi ) has the form or can be transformed (reparameterized) into the form (2.5). Proposition 1. Under Conditions 1–3 with dim((1) ) = 0, and conditions in Lemma 1 (if yi ’s are i.i.d) or 2 (if yi ’s are i.n.i.d), as n → ∞ and p → ∞, √ np(ˆ MLE − 0 ) ∼ N(0, i −1 (0 )), √ n(ˆ − 0 ) ∼ N(0, −1 ), where 1  i() = lim Eyi (X˜ iT (I + Z˜ i Z˜ iT )−1 X˜ i ). n,p→∞ pn n

i=1

The proof is in Appendix A.1. From the proposition, the estimation of  and  are asymptotically independent when p → ∞. Consider the general case where dim((1) ), the dimension of (1) , is not 0. Let us denote (1) = and (2) = . Let di = dim((i) ) and thus we rewrite i (yi , , bi ) as i (yi , , bi ) = i (yi , (1) + bi , (2) ) = i (yi , ci , ), where ci ∼ N((1) , ()). Condition 4. n 1  Eyi (X˜ iT (I + Z˜ i Z˜ iT )−1 X˜ i  ) n,p→∞ np

lim

i=1

is positive definite, where X˜ iT X˜ i  is the low diagonal d2 × d2 matrix of X˜ i X˜ iT , i.e.   1 j2 li1 (yi , , bi ) 1 lim ≈ X˜ iT X˜ i  . p→∞ p j jT p Note that, we cannot expect the similar assumption for (1) . Because n (1) ˜ T ˜ (1) T i=1 ( , 0)Xi Zi ( , 0) /n = 1, sup n (1) ˜ T ˜ (1) T i=1 ( , 0)Zi Zi ( , 0) /n (1) ∈R d1 /{0},∈,bi ∈R k where 0 = (0, . . . , 0)1×d1 . (1) (1) It can be shown that Condition 4 holds for Example 1.2 where (1) = and (2) = . Let T∗ = (T , 1 , . . . , d1 ), similar to Proposition 1, we have the following theorem Theorem 3. Under Conditions 1, 3, and 4, and conditions in Lemma 1 or 2, as n → ∞ and p → ∞, √ np(ˆ MLE − 0 ) ∼ N(0, i∗−1 (0 )), √ n(ˆ ∗ − ∗0 ) ∼ N(0, −1 ∗ ), where i∗ () = lim

n,p→∞

1 Ey (X˜ T (I + Z˜ i Z˜ iT )−1 X˜ i  ), np i i 

1794

and

L. Nie / Journal of Statistical Planning and Inference 137 (2007) 1787 – 1804

 ∗ = Eyi −

j2 ln (bi , ∗ ) j∗ jT∗

 .

The proof of Theorem 3 can be done along the lines of the proof of that of Proposition 1, which is therefore omitted. 3. Consistency of MLE when the number of clusters is fixed On the other hand, when n n0 , the MLE is not necessarily consistent, no matter if p tends to infinity or not. First of all, when p p0 , i.e. p is also bounded, then the MLE is not consistent naturally. However, when only p → ∞, the consistency of the MLE is not guaranteed. Model (1.1) is such a contradictory example, where ˆ MLE is not a consistent √ estimate while ˆ MLE is p-consistent. In a model like (1.1), if interest is on inference for , the MLE is still a good estimator of . This remains true for generalized linear and nonlinear mixed-effects models, under some suitable conditions. However, conditions and the proof are both quite involved and out of the scope of this paper. Nie and Yang (2005) established some sufficient √ conditions so that the MLE of some parameters are p-consistent when p → ∞ while n is fixed. 4. Simulation study and applications The purpose of this simulation study is two-fold. First, we intend to illustrate Theorem 3 through some simulation setup which mimics real examples, and show that Theorem 3 is applicable to many practical models. Second, we intend to examine the finite population performance of the MLE to verify whether Theorem 3 validates when both of n and p are moderate, a condition often met in real examples. 4.1. A logistic regression mixed-effects model The first simulation model we used in this subsection is the mixed-effects logistics regression model similar to Zeger and Karim (1991) and Breslow and Clayton (1993). We consider four cases to examine the impact of p on the MLEs. For this purpose, we fix n, the number of subjects, while varying p, the number of observation per subject. For each case, 200 data sets were created. Each data set was generated according to the following model: yij |bi ∼ Binomial(1, pij ), logit(pij ) = (a0 + bi ) + a1 tij + a2 xij + a3 xij tij , where i = 1, . . . , n = 100, j = 1, . . . , p, and xij = 1 for half the 100 subjects and 0 for the other half, i.e. xij is the group indicator, the bi ’s were generated from i.i.d. normal variables with mean 0 and covariance structure  = 1. In the simulation study, we choose a similar setup as in Zeger and Karim (1991) and Breslow and Clayton (1993), namely, a0 = −2.5, a1 = 1, a2 = −1, a3 = −0.5,  = 1. Covariates and parameters of these four cases are 1. 2. 3. 4.

n = 100, p = 7, ti = c, n = 100, p = 14, ti = (c, c), n = 100, p = 28, ti = (c, c, c, c), n = 100, p = 56, ti = (c, c, c, c, c, c, c, c),

where c = (3, −2, −1, 0, 1, 2, 3), ti = (ti1 , . . . , tip ). Models were fitted via NLMIXED procedure in SAS. Results are shown in Table 1. Note that in our simulation study, the sample size p is not too small and dim(bi ) is small, so numerical method via NLMIXED should be fine. However, the MLE can be obtained as well via Monte Carlo Markov chain methods such as in Zeger and Karim (1991), McCulloch (1997), and Booth and Hobert (1999), or Bayesian procedures with flat priors to mimic MLE via Winbugs. In this example, n = 100 was fixed. From the table, we can find that the variance of ˆ 1MLE decreases about 50% when the number of measurements per subject doubles. This situation is approximately true for ˆ 3MLE too. The results

L. Nie / Journal of Statistical Planning and Inference 137 (2007) 1787 – 1804

1795

Table 1 Simulation results for logistics regression model Parameter

True value

a0

−2.5

a1

1

a2

−1

a3

−0.5



1

Estimates

Mean Std. Mean Std. Mean Std. Mean Std. Mean Std.

p=7

p = 14

p = 28

p = 56

−2.495 0.316 1.01 0.137 −1.084 0.54 −0.494 0.213 1.001 0.585

−2.49 0.2347 1.00 0.098 −1.05 0.345 −0.488 0.156 0.944 0.33

−2.495 0.196 1.0003 0.070 −1.027 0.29 −0.4961 0.0957 0.974 0.25

−2.494 0.179 1.000 0.049 −1.01 0.263 −0.497 0.073 0.972 0.192

√ √ support Theorem 3 that ˆ 1MLE and ˆ 3MLE are np-consistent and all MLE of other parameters are n-consistent. We √ focus on the np-consistent estimators ˆ 1MLE and ˆ 3MLE . √ In the simulation the number of cluster n = 100 is fixed, thus for a np-consistent estimate, with variance equal Op (1/np), it may expect (although not necessarily for finite population) that variance reduces about 50% when the cluster size doubles. For ˆ 1MLE , when sample size p increase from 7 to 14, variance decreases about 51%; when sample size p increase from 14 to 28, variance decreases about 51%; when sample size p increase from 28 to 56, variance decreases about 49%. For ˆ 3MLE , when sample size p increase from 7 to 14, variance decreases about 54%; when sample size p increase from 14 to 28, variance decreases about 38%; when sample size p increase from 28 to 56, √ variance decreases about 58%. Simulation results support that both ˆ 1MLE and ˆ 3MLE are np-consistent. For all other parameters, the variance reduction is smaller in general as p increase. We expect the results for this simulation study hold similarly for the models used in Tenhave and Localio (1999), where p can be very large. We also expect some similar results hold in Cole et al. (2003), McCulloch (1997, 2003), etc. when p is a moderate number. However, when p is small, as in Burton et al. (1999)√and Lu et al. (2003), it is √ meaningless to consider the p -consistency. In such a case, MLE of all parameters are n-consistent. In a logistic regression model, logit(pij ) =  + xij  + bi , where  usually is the population average and  is the regressor. In practice,  is the parameter of interest. Due to restriction on the number of clusters (e.g. number of hospitals, lots of vaccines), n may not be very large, however, as long as p is large, we can expect very precise estimate for . 4.2. A nonlinear regression mixed-effects model: growth model The second simulation model that we used is similar to the soybean growth model of Lindstrom and Bates (1990) and Davidian and Giltinan (1993). Let yij denote the weight of the soybean plant at time xij , measured in weeks, i = 1, . . . , n, j = 1, . . . , p. The model is given by yij |bi = fij (xij , , bi ) + i ,

bi ∼ N(0, ),

(4.1)

with mean, fij (xij , , bi ) =

1 , 1 + exp((3 + bi3 )(xij − 2 − bi2 ))

and i ∼ N(0, 2 ). Here biT = (bi2 , bi3 ), i = 1, . . . , 20, are i.i.d. normal with mean 0 and variance–covariance matrix , which is a 2 × 2 matrix. In this model, T = (1 , 2 , 3 ), 2 , and  are the unknown parameters. We simulated

1796

L. Nie / Journal of Statistical Planning and Inference 137 (2007) 1787 – 1804

Table 2 Simulation results for growth model Parameter

True value

1

15

2

50

3

−0.1

2

0.04

[1, 1]

2

[2, 2]

0.00003

Estimates

Mean Std. Mean Std. Mean Std. Mean Std. Mean Std. Mean Std.

n = 20, p = 10

n = 20, p = 20

n = 20, p = 40

14.97 0.52 49.90 0.9322 −0.100 0.00263 0.0396 0.0048 1.97 1.3582 0.000027 0.000020

14.99 0.36 49.95 0.6728 −0.100 0.0022 0.0397 0.0032 1.80 0.99 0.000028 0.0000157

15.01 0.28 49.97 0.53 −0.1 0.0018 0.398 0.002 1.84 0.89 0.000028 0.000015

three cases. For each case, 200 data sets were created and parameters are 1 = 15, 2 = 50, 3 = −0.1, 2 = 0.04,  = Diag(2, 0.00003). These parameters were chosen to be the same as they were in Davidian and Giltinan (1993). Covariates are as follows: 1. n = 20, p = 10, xi = (14 21 28 35 42 49 56 63 70 77); 2. n = 20, p = 20, xi = (14 17 21 24 28 31 35 38 42 45 49 52 56 59 63 66 70 73 77 80); 3. n = 20, p = 40, xi = (14 17 21 24 28 31 35 38 42 45 49 52 56 59 63 66 70 73 77 80 14 17 21 24 28 31 35 38 42 45 49 52 56 59 63 66 70 73 77 80). √ According to Theorem 3, ˆ 1MLE is np-consistent. From the case, n = 20, p = 10, to the case n = 20, p = 20, the variance of ˆ 1MLE reduced 53%. From the case, n = 20, p = 20, to the case n = 20, p = 40, the variance of ˆ 1MLE reduced 42%. This roughly supports the convergence rate (Table 2). As a summary of the simulation results, we found that when p is large, results in these simulations are consistent with Theorem 3, about the convergence rate. 4.3. Design for Crowder’s example Consider the example taken from Crowder (1978), which concerns the proportion of seeds that germinated on each of 21 plates (i = 1, . . . , 21) arranged according to a 2 × 2 factorial layout by seed and type of root extract. We have total four treatment groups. Five (six for one treatment group) plates are randomly selected. On the ith plate, ni seeds are planted and yi is the number of germinated seeds on the ith plate. These data are also analyzed by, for example, Breslow and Clayton (1993). Let yi ∼ Binomial(ni , ri ), logit(ri ) = a0 + a1 x1i + a2 x2i + a12 x1i x2i + bi , bi ∼ Normal(0, ), term a12 x1i x2i is included. In this where x1i , x2i are the seed type and root extract of the ith plate, and an interaction√ design of experiment (nested factorial design), the MLE of  = (a0 , a1 , a2 , a12 ) is n-consistent. If it is allowed to divide (in many situation, that can be done) each plate into four parts, one may prefer the next design of experiment. On each of 21 plates, we plant equal amounts of seeds of all four treatments in one randomly selected division of the plate. Let us call this blocked complete factorial design. Compared to the original design, the design matrix xi is different. For ease of comparison, let us agree that 20 seeds per plates will be planted for both

L. Nie / Journal of Statistical Planning and Inference 137 (2007) 1787 – 1804

1797

Table 3 Simulation results for factorial design Parameter

True value

Estimates n = 20, p = 20

a0

−2.5

a1

1

a2

−1

a3

−0.5



1

Mean Std. Mean Std. Mean Std. Mean Std. Mean Std.

n = 40, p = 40

Original

Blocked

Original

Blocked

−2.59 1.31 1.10 1.48 −1.567 3.47 −0.16 3.84 0.6489 0.4894

−2.54 0.455 1.05 0.44 −1.11 1.264 −0.53 1.396 0.984 0.593

−2.484 0.355 −0.997 0.513 −1.06 0.551 −0.44 0.807 0.907 0.342

−2.480 0.231 0.996 0.229 −1.02 0.327 −0.50 0.392 0.955 0.314

designs. For the original design, the i (yi , ) can be written as i (yi , a0 + a1 + a2 + a3 + bi ), if the ith plate plant seeds from the treatment group x1i = 1 and x2i = 1. These are (1) type of parameters, thus we √ cannot expect np-consistency for the parameters. On the other hand, in the blocked factorial design, the i (yi , ) can be written as i (yi , a0 + bi , a1 , a2 , a3 ). √ √ So the MLE of a0 cannot be np-consistent. However, the MLE of (a0 , a2 , a3 ) are actually np-consistent. Since in this study, the main purpose is to compare treatment groups, the overall intercept a0 may not as important as other parameters. Let us use simulation results to confirm our finding here. We create 400 data sets, 200 generated from the original design, 200 generated from the blocked factorial design. In each data set, 20 seeds will be planted in each of 20 plates according to factorial design or block factorial design. The results are in Table 3. In Table 3, we also present results when both n and p double. Except for the estimation of , estimation for all other parameters in the blocked factorial design outperform those from the original design. 4.4. Convergence rate for some “approximate” MLEs The MLE or “approximate” MLEs are used to make inference for the unknown parameters. Obtaining MLEs in generalized linear and nonlinear mixed-effects models involves tremendous computational difficulty because of the integrated likelihood, which does not have a close form. As a consequence, “approximate” MLEs are used more often than the true MLEs. Many “approximate” MLEs are introduced in the literature. They include, but not limited to, first-order estimator (FOE), Sheiner and Beal (1980), LB, PQL, MHLE, CGEE2. As a well-known result, FOE has a potential bias when  is not small enough, i.e. the asymptotic behavior depends on  which is not the interest in this paper. The MHLE considers the case where p → ∞ while n is fixed, which is out of the scope of this paper. Thus, we focus on the asymptotics of LB, PQL, MHLE, CGEE2. In the following, we intend to illustrate the convergence rate of some “approximate” MLEs, without fully introducing the conditions needed. These conditions include two parts. The first part contains the conditions which guarantee the consistency of the MLE, i.e. conditions in Theorem 3; the second part contains the conditions which guarantee that “approximate” MLEs are very close to the MLEs. Consider a generalized linear mixed-effects models, the PQL estimate of  in Breslow and Clayton (1993) ties with the CGEE2 estimate in Vonesh et al. (2002). As a consequence, the following asymptotic results applies to both CGEE2

1798

L. Nie / Journal of Statistical Planning and Inference 137 (2007) 1787 – 1804

and PQL: ˆ CGEE2 − ˆ MLE = Op (p −1/2 ). Therefore, according to Theorem 3, under conditions of Theorem 3, (1) (1) ˆ CGEE2 − 0 = Op (max(n−1/2 , p−1/2 ))

and (2) (2) ˆ CGEE2 − 0 = Op (p −1/2 ).

Similarly, let us denote the estimate in Vonesh (1996) as Laplace estimator ˆ Lap . We can show, under suitable conditions, (2) (2) ˆ Lap − 0 = Op (max((np)−1/2 , p−1 )). √ That means the Laplace estimator for (2) is N -consistent if (1) (1) ˆ Lap − 0 = Op (max(n−1/2 , p−1 )),

p = c > 0, p,n→∞ n lim

i.e. n and p increase at about the same rate. For the case when p → ∞ while n is fixed, we refer to Lee and Nelder (1996) for convergence rate of the MHLE. However, this is beyond of the scope of this paper. 5. Discussion In this paper, we consider mixed-effects models where the asymptotic results rely on two sample sizes n and p, where n is the number of clusters and p is the cluster size. We considered three cases, (1) n → ∞, p remains bounded; (2) n, p → ∞; (3) p → ∞, n remains bounded. Case 1 has been considered the standard assumption, namely, there is not enough information about the individual random effects. The case (2), which is a special case of case 1 in Jiang (1999), assumes there is enough information about the individual random effects. Now we shall discuss practical usefulness of the case (2). In the logistic regression model to investigate interviewer’s variability by Anderson and Aitkin (1985), both the number of interviewers (n = 64) and the number of people interviewed p (20 p 35) are large. Furthermore, it is quite reasonable the number of people interviewed by an interviewer increase up to hundreds or even more. Similarly, in Tenhave and Localio (1999), n is the number of health care providers while p is the number of patients from a health care provider. It can be seem that both numbers can be large. The author conducted many simulations (some not be √ shown in this paper) for the case when the cluster size p is moderate, the np-consistency for some parameter seems still useful, i.e. when the cluster size increase from one moderate cluster size to another moderate size, the variance √ of the np-consistency estimator significantly reduced. However, as a referee pointed out, the asymptotic results may not necessarily the reason of the observed variance reduction. The major difference between this article and Jiang (1999) is that we focus on the traditional MLE while Jiang (1999) propose some consistent alternative estimators based on conditional inference. Thus, this article takes advantages of many well-established asymptotic properties of MLE: consistency and asymptotic normality. Jiang (1999) established the consistency of conditional estimators of both random and fixed effects, for a very general model. The generalized linear mixed-effects models considered here can be viewed as special cases of Jiang (1999). Section 4.3 illustrated one example to illustrate the different convergence rate of the MLE under a different model. However, this is just one illustrative example. In practice, people may not be able to divide the row material (cluster) into several subclusters and preform the complete factorial design on each subject. In this case, much more research needs to be done in order to improve the quality of data through design of experiment. In a lot of applications, all √ parameters are (1) type, thus MLE of such parameters are not np-consistent. Thus, increasing p does not necessarily result in any gain in terms of efficiency. Our results do not directly apply to design of experiment in such a situation. Further research is needed. In this paper, we focus on models with form assumed in Condition 3, which does not apply directly to other types of models. Although validates only if p → ∞, results from Theorem 3 are expected to work well when p is moderate size. Theorem 3 is expected to provide more understanding of generalized linear and nonlinear mixed-effects models in

L. Nie / Journal of Statistical Planning and Inference 137 (2007) 1787 – 1804

1799

terms of data analysis and experiment design. In our future work, we will consider optimal design in generalized linear and nonlinear mixed-effects models based on Theorem 3. Verifications of conditions for the examples considered in Section 4 are tedious. Readers are referred to a technical report, Nie (2005). Acknowledgment The author sincerely thanks two referees, the Editorial board member, and the Editor for their constructive suggestions and comments on the earlier version of this article. These comments and suggestions greatly helped to improve the quality of the paper. Appendix A A.1. Proof for Proposition 1 Here we directly consider the case y1 , . . . , yn are i.n.i.d. By Lemma 2, we obtain     ⎤−1 ⎞ ⎛ ⎡ j2 ln Mi (,yi ) j2 ln Mi (,yi ) 1 n 1 n

ˆ , − − E E i=1 yi | n ⎜  ⎢ n i=1 yi | j jT j jT MLE − 0 ⎥ ⎟ √ ⎜ 0 ⎢ ⎟ ∼ N⎜ ,⎣ . n    ⎥ ⎦ ⎟ ⎝ 0 ⎠ ˆMLE − 0 2 2   j ln Mi (,yi ) j ln Mi (,yi ) n n 1 1 − n i=1 Eyi | , − n i=1 Eyi | T T j j

Thus, √ n(ˆ MLE − 0 ) ∼ N(0, A11 ), √ n(ˆ MLE − 0 ) ∼ N(0, A22 ), where

 A−1 11

=

$ %  n $ % n j2 ln Mi (, yi ) 1 j2 ln Mi (, yi ) 1 − Eyi | − Eyi | − n n j jT j jT i=1



i=1

$

1 j2 ln Mi (, yi ) × Eyi | − n j jT  A−1 22

=

n

i=1

$

j2 ln Mi (, yi ) 1 Eyi | − n j jT n

%

i=1



$

1 j2 ln Mi (, yi ) × Eyi | − n j jT n

i=1

%−1 

$ % n 1 j2 ln Mi (, yi ) , Eyi | − n j jT i=1



$ % n 1 j2 ln Mi (, yi ) − Eyi | − n j jT i=1

%−1 

$ % n 1 j2 ln Mi (, yi ) . Eyi | − n j jT i=1

We will prove the following lemma, which leads to Proposition 1. Lemma 4. 

−1

1 A11 = Eyi (X˜ Ti (I + Z˜ i Z˜ Ti )−1 X˜ i ) + Op (1) n n

,

i=1

A22 = [ + op (1)]−1 + op (1). Please refer to the technical report Nie (2005) for a proof of this lemma.

j j

1800

L. Nie / Journal of Statistical Planning and Inference 137 (2007) 1787 – 1804

A.2. Proof for (2.1)  −

j2 ln Mi (, yi )



j jT

 jli (yi , , bi ) exp(−li (yi , , bi )) dbi j  = T exp(−li (yi , , bi )) dbi j j

 j2 li (yi , , bi ) =

exp(−li (yi , , bi )) dbi jjT  exp(−li (yi , , bi )) dbi

(A.1)

 jli (yi , , bi ) jli (yi , , bi ) exp(−li (yi , , bi )) dbi j jT  − exp(−li (yi , , bi )) dbi

(A.2)

 jli (yi , , bi )  jli (yi , , bi ) exp(−li (yi , , bi )) dbi exp(−li (yi , , bi )) dbi j jT   . + exp(−li (yi , , bi )) dbi exp(−li (yi , , bi )) dbi

(A.3)

Apply Laplace’s approximation to (A.1), we obtain  j2 li (yi , , bi )

& exp(−li (yi , , bi )) dbi j2 li (yi , , bi ) && jjT  + Op (1). = & & ˆ exp(−li (yi , , bi )) dbi jjT b =b i

i

It is worth to note that Laplace’s approximation with leading term was used here. However, this approximation is not powerful enough to deal with (A.2),  jli (yi , , bi ) jli (yi , , bi ) exp(−li (yi , , bi )) dbi j jT  , exp(−li (yi , , bi )) dbi since (jli (yi , , bi )/j)(jli (yi , , bi )/jT ) is a summation of p 2 terms, the leading term could have order Op (p 2 ). Therefore, we need to find out, if there is, the term with order Op (p) in the reminder. We can see this clearly in the following simple example. However, the example is not a particular case of our model but a standard Laplace’s approximation application. li (yi , , bi ) = (yi −  − bi )2 /(2p −1 ) + ln(2p −1 )/2, jli (yi , , bi )/j = −p(yi −  − bi ). Then,  jli (yi , , bi ) jli (yi , , bi ) exp(−li (yi , , bi )) dbi j jT  exp(−li (yi , , bi )) dbi = p 2 Ebi |yi , [(yi −  − bi )2 ] = p2 Var(bi |yi , ) = p.

L. Nie / Journal of Statistical Planning and Inference 137 (2007) 1787 – 1804

1801

On the other hand, if we take the Laplace’s approximation with the leading term only, then we end up with  jli (yi , , bi ) jli (yi , , bi ) exp(−li (yi , , bi )) dbi j jT  exp(−li (yi , , bi )) dbi = p 2 [(yi −  − bˆi )2 ] = 0 which bears an error with order of Op (p). In order to deal with (A.2), we need the following general form of Laplace’s approximation with high-order expansion. Lemma 5. Let x ∈ Rv and g(x) be a smooth function. Let xˆ denote the minimum point of l(x). The general Laplace’s approximation is as follows: &  &−1/2   & pl (x) ˆ && & g(x) exp[p(−l(x) + l(x))] ˆ dx = & 2 &

p −j Kj (g).

0  j
Here, Kj (g) =





v1−u1=j 2v1  3u1

2−v1 ˆ −1 D)v1 (lxu1 [(D T l  (x) ˆ (x)g(x))]x=xˆ , u1!v1!

and lxˆ (x) = l(x) − l(x) ˆ T l  (x)(x ˆ − x), ˆ ˆ − 21 (x − x) where D T = (j/jx1 , . . . , j/jxv ), and l  (x) is the Hessian matrix of l(x). Lemma 5 can be proved from Theorem 7.7.5, Hormander (1980). Reader are referred to Chapter 3, Nie (2002) for details. By Lemma 5, taking the leading term and the second-order term, after tedious computation using the definition of K1 (g), we obtain  jli (yi , , bi ) jli (yi , , bi ) exp(−li (yi , , bi )) dbi j jT  exp(−li (yi , , bi )) dbi  jli (yi , , bi )  jli (yi , , bi ) exp(−li (yi , , bi )) dbi exp(−li (yi , , bi )) dbi jT j   − exp(−li (yi , , bi )) dbi exp(−li (yi , , bi )) dbi ⎛ ⎞−1 & & & 2 l (y , , b ) & j j2 li (yi , , bi ) && j2 li (yi , , bi ) && i i i & ⎝ ⎠ = + Op (1) & & & & ˆ j jbiT jbi jbiT & ˆ jbi jT & ˆ bi =bi ()

bi =bi ()

bi =bi ()

which leads to   j2 ln Mi (, yi ) − j jT ⎡ ⎤& &

−1 & 2 l (y , , b ) 2 l (y , , b ) & 2 l (y , , b ) 2 l (y , , b ) & j j j j i i i i i i & i i i i i i ⎦& ⎣ = − & & & ˆ j jbiT jbi jbiT j jT jbi jT & b =b () i

i

bi =bˆi

+ Op (1).

1802

L. Nie / Journal of Statistical Planning and Inference 137 (2007) 1787 – 1804

However, for the parameter , we can prove the error term has an order of op (1). i.e.   j2 ln Mi (, yi ) − j jT ⎡ ⎤& &

−1 & 2 l (y , , b ) 2 l (y , , b ) & 2 l (y , , b ) 2 l (y , , b ) & j j j j i i i i i i i i i i i i & ⎦& =⎣ − & & T & ˆ j jbiT jbi jbiT j jT jb j & i b =b () i

i

+ op (1).

(A.4)

bi =bˆi

That is because, j2 li (yi , , bi ) j jT

=

j2 l2i (yi , , bi ) j jT

,

where l2i (yi , , bi ) is the loglikelihood of the random effect bi . Therefore, the Laplace’s approximation leading term of [−j2 ln Mi (, yi )/j jT ] is at most Op (1), the reminder has order of Op (p −1 ). The complete details of the proof of (A.4) is similar to the proof of (2.1) and therefore omitted. i and Z i in some examples A.3. Formulas X In most (if not all) cases in generalized linear and nonlinear mixed-effects model literature, since

 $ %  j2 li1 (yi , , bi ) jE(yi |bi ) −1 jE(yi |bi ) = , [Var(yi |bi )] Eyi |bi j j jT jT $ %

  jE(y |b ) j2 li1 (yi , , bi ) jE(yi |bi ) i i , [Var(yi |bi )]−1 Eyi |bi = jbi jT jbi jT

 $ %  j2 li1 (yi , , bi ) jE(yi |bi ) −1 jE(yi |bi ) Eyi |bi = [Var(yi |bi )] jbi jbi jbiT jbiT hold for regression parameters, we can take i (, bi ) = [Var(yi |bi )]−1/2 jE(yi |bi ) , X j

i (, bi ) = [Var(yi |bi )]−1/2 jE(yi |bi ) , Z jbi

i (, bi ) = Ey |b (j2 li1 (, bi )/jbi jbT ). In cases where the variance parameter is involved in, then the T (, bi )Z where Z i i i i i (, bi ) and Z i (, bi ) become more complicated, readers are referred to following Example 3, a linear formula of X T (, bi ) for a logistic regression model. model, for details. We also illustrate the choice of X i Example 3. Consider the linear model, yij = 0 + xij 1 + zij bi + ij , where 0 and 1 are the intercept and slope of the linear model, bi and ij have independent normal distribution with mean 0 and variance  and 2 , respectively. Let  = (0 , 1 ),   1 ... 1 XiT = xi1 . . . xip which is a 2 × p matrix. V = Var(yi |bi ) = 2 Ip , where Ip is the p × p identity matrix. In the case where we assume that is known, then T = X T V −1/2 X i i

L. Nie / Journal of Statistical Planning and Inference 137 (2007) 1787 – 1804

1803

T = zT V −1/2 , a 1 × p matrix. Here Z T = (zi1 , . . . , zip ). Normally, when is unknown, then  = (0 , 1 , 2 ), and Z i i i we define, ⎞ ⎛ 1 ... 1 0 ... 0   2 Ip , 0p×p ⎟ ⎜ T Xi = ⎝ xi1 . . . xip 0 . . . 0 ⎠ , V = . 0p×p , 4 Ip 0 ... 0 1 ... 1 Finally, T = X T V −1/2 , X i i

T = Z T V −1/2 , Z i i

where ZiT = (zi1 , . . . , zip , 0, . . . , 0). The complete details of these derivations can be found in a technical report, Nie (2005). Example 4. Consider the logistic regression model yij ∼ Binomial(1, pij ), logit(pij ) = 0 + xij 1 + zij bi , where 0 and 1 are the intercept and slope of the linear model, bi has independent normal distribution with mean 0 and variance . Let  = (0 , 1 ), vij = var(yij |bi ) = exp(0 + xij 1 + zij bi )/[1 + exp(0 + xij 1 + zij bi )]2 .

    j2 li1 (yi , , bi ) jE(yi |bi ) −1 jE(yi |bi ) Eyi [Var(yi |bi )] = , j jT jT where [Var(yi |bi )] = Diag(vi1 , . . . , vip ) and     vip vi1 , . . . jE(yi |bi ) T . = j xi1 vi1 . . . xip vip Therefore, we take

' vˆi1 , T = X ' i xi1 vˆi1

... ...

'

vˆip ' xip vˆip

=

jE(yi |bˆi ) [Var(yi |bˆi )]−1/2 j

which is a 2 × p matrix and vˆij = exp(0 + xij 1 + zij bˆi )/[1 + exp(0 + xij 1 + zij bˆi )]2 :  ' (  jE(y |bˆ ) i i T  [Var(yi |bˆi )]−1/2 . Z i = zi1 vˆi1 , . . . , zip vˆip = jbi References Anderson, D.A., Aitkin, M., 1985. Variance component models with binary response: interviewer variability. J. Roy. Statist. Soc. Ser. B 47, 203–210. Barndorff-Nielsen, O.E., Cox, D.R., 1989. Asymptotic Techniques for Use in Statistics. Chapman & Hall, New York. Booth, J.G., Hobert, J.P., 1998. Standard errors of prediction in generalized linear mixed models. J. Amer. Statist. Assoc. 93, 262–272. Booth, J.G., Hobert, J.P., 1999. Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. J. Roy. Statist. Soc. Ser. B, Methodological 61, 265–285. Bradley, R.A., Gart, J.J., 1962. The asymptotic properties of ML estimators when sampling from associated populations. Biometrika 49 (1,2), 205–214. Breslow, N.E., Clayton, D.G., 1993. Approximate inference in generalized linear mixed models. J. Amer. Statist. Assoc. 88 (421), 9–25. Burton, P.R., Tiller, K.J., Gurrin, L.C., Cookson, W.O.C.M., Musk, A.W., Palmer, L.J., 1999. Genetics variance components analysis for binary phenotypes using generalized linear mixed models (glmms) and Gibbs sampling. Genetic Epidemiology 17, 118–140. Cole, D., Morgan, B., Ridout, M.S., 2003. Generalized linear mixed models for strawberry inflorescence data. Statist. Modeling 3, 273–290. Cramer, H., 1946. Mathematical Methods of Statistics. Princeton University Press, Princeton, NJ. Crowder, M., 1978. Beta-binomial anova for proportions. Appl. Statist. 27, 34–37.

1804

L. Nie / Journal of Statistical Planning and Inference 137 (2007) 1787 – 1804

Davidian, M., Giltinan, D.M., 1993. Some general estimation methods for non-linear mixed-effects models. J. Biopharm. Statist. 3 (1), 23–55. Hedeker, D., Gibbons, R.D., 1994. A random-effects ordinal regression model for multilevel analysis. Biometrics 50, 933–944. Hormander, L., 1980. The analysis of linear partial differential operators I. A Series of Comprehensive Studies in Mathematics, vol. 256. Springer, Berlin. Jiang, J., 1998. Consistent estimators in generalized linear mixed models. J. Amer. Statist. Assoc. 93, 720–729. Jiang, J., 1999. Conditional inference about generalized linear mixed models. Ann. Statist. 27 (6), 1974–2007. Kwan, K.C., Breault, G.O., Umbenhauer, E.R., McMahon, F.G., 1976. Kienetics of indomthicin absorption, elimilation, and enterohepatic circulation in man. J. Pharmacokinetics Biopharmaceutics 4, 255–280. Lee, Y., Nelder, J.A., 1996. Hierarchical generalized linear models. J. Roy. Statist. Soc. B 58, 619–678. Lehmann, E.L., 1983. Theory of Point Estimation. Wiley, New York. Lindstrom, M.J., Bates, D.M., 1990. Nonlinear mixed effects models for repeated measures data. Biometrics 46, 673–687. Lu, W., Ramsay, J.G., Baily, J.M., 2003. Reliability of pharmacodynamic analysis by logistic regression, mixed-effects modeling. Anothesiology 99, 1255–1262. McCulloch, C.E., 1997. Maximum likelihood algorithms for generalized linear mixed models. J. Amer. Statist. Assoc. 92, 162–170. McCulloch, C.E., 2003. Generalized Linear Mixed Models, CBMS/IMS Monograph Series, vol. 7. Mentré, F., Gomeli, R., 1995. A two-step iterative algorithm for estimation in nonlinear mixed-effect models with an evaluation in population pharmacokinetics. J. Biopharm. Statist. 5 (2), 141–158. Nie, L., 2002. Laplace approximation in nonlinear mixed-effects models. Ph.D. Thesis, University of Illinois at Chicago. Nie, L., 2005. Technical proofs for some results in the paper “convergence rate of mle in generalized linear and nonlinear mixed-effects models: theory and applications”. Technical Report. Nie, L., Yang, M., 2005. Strong consistency of the maximum likelihood estimator in generalized linear and nonlinear mixed-effects models. Sankhya 67 (4), 736–763. Pinheiro, J.C., Bates, D.M., 1995. Approximation to the log-likelihood function in the nonlinear mixed-effects models. J. Comput. Graphical Statist. 4, 12–35. Roe, D.J., 1997. Comparison of population pharmacokinetic modeling methods using simulated data: results from the population modeling workgroup. Statist. Med. 16, 1241–1262. Sheiner, L.B., Beal, S.L., 1980. Evaluation of method for estimating population pharmacokinetic parameters, I. Michaelis–Menten model: routine clinical pharmacokinetic data. J. Pharmacokinetics Biopharm. 8, 553–571. Sinha, S., 2004. Robust analysis in generalized linear mixed models. J. Amer. Statist. Assoc. 466, 451–460. Ten Have, T., Miller, M., Reboussin, B., James, M., 2000. Mixed effects logistic regression models for longitudinal ordinal functional response data with multiple-cause drop-out from the longitudinal study of aging. Biometrics 56 (1), 279–287. Tenhave, T., Localio, A.R., 1999. Empirical bayes estimation of random effects parameters in mixed effects logistic regression models. Biometrics 55, 1022–1029. Tierney, L., Kass, R.E., Kadane, J.B., 1989. Fully exponential Laplace approximations to expectations and variances of non-positive functions. J. Amer. Statist. Assoc. 84 (407), 710–716. Vonesh, E.F., 1996. A note on the use of Laplace’s approximation for nonlinear mixed-effects models. Biometrika 83 (2), 447–452. Vonesh, E.F., Chinchilli, V.M., 1997. Linear and Nonlinear Models for the Analysis of Repeated Measurements. Marcel Dekker, Inc., New York. Vonesh, E.F., Wang, H., Nie, L., Majumdar, D., 2002. Conditional second-order generalized estimating equations for generalized linear and nonlinear mixed-effects models. J. Amer. Statist. Assoc. 96, 282–291. Wald, A., 1949. Note on the consistency of the maximum likelihood estimate. Ann. Math. Statist. 20, 595–601. Wolfinger, R.D., Lin, X., 1997. Two Taylor-series approximation methods for nonlinear mixed models. Comput. Statist. Data Anal. 25, 465–490. Wu, H., Ding, A.A., 1999. Population HIV-1 dynamics in vivo: applicable models and inferential tools for virological data from AIDS clinical trials. Biometrics 55, 410–418. Zeger, S.L., Karim, M.R., 1991. Generalized linear models with random effects; a Gibbs sampling approach. J. Amer. Statist. Assoc. 86, 79–86.