Estimation with unbalanced panel data having covariate measurement error

Estimation with unbalanced panel data having covariate measurement error

Journal of Statistical Planning and Inference 141 (2011) 800–808 Contents lists available at ScienceDirect Journal of Statistical Planning and Infer...

169KB Sizes 0 Downloads 99 Views

Journal of Statistical Planning and Inference 141 (2011) 800–808

Contents lists available at ScienceDirect

Journal of Statistical Planning and Inference journal homepage: www.elsevier.com/locate/jspi

Estimation with unbalanced panel data having covariate measurement error$ Jun Shao a,, Zhiguo Xiao b, Ruifeng Xu c a b c

School of Finance and Statistics, East China Normal University, China Department of Statistics, School of Management, Fudan University, China Department of Health Economic Statistics, Merck Research Laboratories, USA

a r t i c l e i n f o

abstract

Article history: Received 27 October 2009 Received in revised form 30 July 2010 Accepted 4 August 2010 Available online 10 August 2010

Panel data with covariate measurement error appear frequently in various studies. Due to the sampling design and/or missing data, panel data are often unbalanced in the sense that panels have different sizes. For balanced panel data (i.e., panels having the same size), there exists a generalized method of moments (GMM) approach for adjusting covariate measurement error, which does not require additional validation data. This paper extends the GMM approach of adjusting covariate measurement error to unbalanced panel data. Two health related longitudinal surveys are used to illustrate the implementation of the proposed method. & 2010 Elsevier B.V. All rights reserved.

Keywords: Unbalanced panel data Measurement error Generalized method of moments Weighted method of moments Nonignorable missingness

1. Introduction Panel or longitudinal data are often collected from every sampled subject in research areas such as medicine, population health, economics, social sciences, and sample surveys. When there is a linear relationship between an outcome variable and some covariates, it is well known that covariate measurement error, if ignored, leads to biased and inconsistent estimators of regression parameters. Covariate measurement error often occurs in practice. For example, in the Beaver Dam Health Outcomes Study (BDHOS) discussed in Section 5 in detail, the covariate SF-1, a self-rated value of a person’s overall health, is measured with error because the measurement can be easily affected by the mood and temporal events when the participant was interviewed. In the Wisconsin Sleep Cohort Study (WSCS) discussed in Section 5, covariates such as the sleep latency (minutes it takes for a person to fall asleep), sleep time (hours of sleep during weekday nights), and body mass index are all difficult to have accurate measurements. For cross-sectional data, most remedies for the covariate measurement error require extra validation or replication data (e.g., Fuller, 1987; Carroll et al., 2006), which are costly to obtain under many circumstances. For panel data, however, as pointed out by Griliches and Hausman (1986), valid estimators can be constructed without requiring extra validation or replication data. In this paper we focus on the approach of generalized method of moments (GMM, Hansen, 1982) proposed by Wansbeek (2001) for panel data, which has been shown to be equivalent to Griliches and Hausman’s (1986) approach by Xiao et al. (in press). A closely related paper is Biørn (2000), which discussed instrumental variable and GMM

$

The research was partially supported by the NSF Grant SES-0705033.

 Corresponding author.

E-mail address: [email protected] (J. Shao). 0378-3758/$ - see front matter & 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2010.08.003

J. Shao et al. / Journal of Statistical Planning and Inference 141 (2011) 800–808

801

estimators based on difference transformations under quite general assumptions. The key to Wansbeek’s GMM is the construction of some moment equations based on the correlation structure between the measurement errors and the covariates. Wansbeek’s GMM and all other existing methods were proposed for balanced panel data in the sense that subjects have the same number of panel outcomes. In practice, however, it is common to have unbalanced panel data. Two main reasons for having unbalanced data are (1) the sampling design is not balanced; for example, the follow-up is not the same for every subject, or ‘‘subjects’’ are clusters (such as households) containing unequal number of units, or subjects are selected according to a rotating scheme (see e.g., Hsiao, 2003) and (2) some data are missing, although the original design is balanced. In the BDHOS and WSCS discussed in Section 5, data are unbalanced due to missing values. The main purpose of this paper is to derive and study GMM estimators for unbalanced panel data with measurement error. In Section 2, we develop the GMM estimation for the case where unbalanced data are obtained by design. Our main idea is to divide data into categories so that each category has balanced panels and to efficiently combine GMM estimators ¨ in different categories, which is referred to as a multiple groups analysis; see, e.g., Joreskog (1971), Muthe´n (1989), and Wansbeek and Meijer (2000). We show that the combined GMM estimator is asymptotically optimal in a wider class of estimators obtained using the weighted method of moments (Xiao, 2010). When unbalanced data are caused by missing values, an assumption on missingness mechanism has to be added in order to derive valid estimators. In Section 3, We obtain asymptotically valid GMM estimators in the situation where the missingness mechanism depends on the observed covariates, the unobserved covariates (the true values of covariates measured with error), and an unobserved random subject effect. An asymptotically chi-square distributed test statistic based on the GMM approach is provided in Section 4 for testing the existence of covariate measurement error. Our proposed method is illustrated in Section 5 in two real data examples, the BDHOS and WSCS. Section 6 concludes the paper with some remarks.

2. Results for unbalanced sampling design Consider an independent sample of n subjects with yi being the vector of Ti Z 2 outcomes for the ith subject, i= 1,y,n. For subject i, let Xi be the Ti  p matrix whose columns are p covariate values measured with error and Zi be the Ti  q matrix whose columns are q covariate values measured without error. Thus, Zi is observed but Xi is not observed and Xi, a surrogate for Xi , is observed. We assume that the following linear model: yi ¼ Xi b þZi g þ ai 1Ti þ ei , Xi ¼ Xi þ Ui ,

ð1Þ

where Ti’s are integers between 2 and T (a fixed integer T not depending on n), b and g are unknown parameter vectors of order p and q, respectively, 1k denotes a k-vector of ones, ai is an unobserved individual effect which can either be fixed or random, ei is a Ti-vector of unobserved random errors with finite second-order moments, E(ei) =0, Ui is a Ti  p matrix of unobserved measurement errors with finite second-order moments, E(Ui) = 0, ðXi ,Zi Þ, Ui, and ei are mutually independent, EðZi uZi Þ and EðXi uXi Þ are positive definite, and Au denotes the transpose of A. We also assume that there are disjoint subsets G1 , . . . ,GL of {1,y,n} such that ðXi ,Zi ,Ui ,ei Þ, i 2 Gl , are identically distributed for each l = 1,y,L. For subjects with indices in Gl , they form a balanced set of data with the same panel size and sample size nl. We denote the panel size of group Gl by Tl. To apply Wansbeek’s approach and its extension to multivariate Xi (Xiao et al., 2010), which will be called the GMM in the sequel, we need some assumption on the covariance matrix of the measurement error Ui. Let uij be the jth column of Ui. We assume that, for each l, vec½Eðuik uij uÞ ¼ R0kjl lkjl ,

i 2 Gl ,

k,j ¼ 1, . . . ,p,

ð2Þ

where R0jkl is a known matrix, lkjl is an unknown vector of parameters, and vec(A) is the stack operator applied to matrix A (applied to the diagonal and upper triangle part of A if A is symmetric). When the panel size is 2 and p = 1, for example, uik =uij =ui and 2 3 1 0 0 2 3 " # 6 0 1 0 7 s11 s11 s12 6 76 s 7 ¼ R011l l11l ¼ 6 vec½Eðui ui uÞ ¼ vec ð3Þ 74 12 5 s12 s22 40 1 05 0

0

1

s22

for i 2 Gl . If we assume that components of uij are independent, then s12 ¼ 0 and 2 3 1 0 " # 6 0 0 7 s11 6 7 vec½Eðui ui uÞ ¼ R011l l11l ¼ 6 7 4 0 0 5 s22 0

1

ð4Þ

802

J. Shao et al. / Journal of Statistical Planning and Inference 141 (2011) 800–808

for i 2 Gl . On the other hand, if components of uij are identically distributed, then 2 3 1 0 " # 6 0 1 7 s11 6 7 vec½Eðui ui uÞ ¼ R011l l11l ¼ 6 7 4 0 1 5 s12 1

ð5Þ

0

for i 2 Gl . If ljkl in (2) has the largest possible dimension such as in case (3), then (2) is not an assumption. However, some restriction on Eðuik uij uÞ such as that in case (4) or (5) may be needed in order to identify the parameter b (see the discussion in Xiao et al., in press). Let  be the Kronecker product, It be the identity matrix of order t, At ¼ It  1t 1t 1t u, 2 3 R011l R021l    R0p1l 6 7 & R0l ¼ 4 5, R01pl R02pl    R0ppl and Rl ¼ ðIpTl  ATl ÞR0l , l = 1,y,L. For balanced data in Gl , we consider the following moment conditions: ! " # vecðXi Þ  ðyi Xi bZi gÞ ¼ 0, E Ml vecðZi Þ

ð6Þ

where "

#

½IpT2 Rl ðRl uRl Þ Rl uðIpTl  ATl Þ

0

0

IqTl  ATl

l

Ml ¼

:

Since Ml is a singular matrix, there are redundant moment conditions in (6). Let the singular value decomposition of Ml be Ml ¼ Pl Ll Ql , where Ll is a non-singular diagonal matrix and Ql is of full rank. Then (6) is equivalent to " # ! vecðXi Þ ð7Þ  ðyi Xi bZi gÞ ¼ 0: E Ql vecðZi Þ Let y ¼ ðbu, guÞu, dil ¼ Ql dl ¼

vecðXi Þ

!  yi ,

vecðZi Þ

P

i2Gl dil =nl ,

and C l ¼

Cil ¼ Ql

P

i2Gl Cil =nl .

1

vecðXi Þ

!

vecðZi Þ

 ðXi ,Zi Þ,

The GMM estimator based on (7) and balanced data in Gl is

1

y^ l ¼ ðC l uG^ l C l Þ1 C l uG^ l d l ,

ð8Þ

which is a minimizer of 1

^ ðd l C l yÞ Jl ðyÞ ¼ ðd l C l yÞuG l ^ l being an estimator of Gl ¼ E½ðdil Cil yÞðdil Cil yÞu, i 2 Gl . with G ¨ Since we can use L independent groups of data to estimate the same y, the multiple groups analysis (Joreskog, 1971; Muthe´n, 1989; Wansbeek and Meijer, 2000, pp. 216–218) can be applied, i.e., we minimize the following weighted sum of Jl ðyÞ, l= 1,y,L: JðyÞ ¼

L X

nl Jl ðyÞ ¼

l¼1

L X

^ 1 ðd C yÞ: nl ðd l C l yÞuG l l l

ð9Þ

l¼1

The resulting combined GMM estimator is

y^

GMM

¼

L X

^ 1 C l nl C l uG

!1

L X

l

l¼1

^ 1 d l , n l C l uG l

ð10Þ

l¼1

which is a function of estimators y^ l , l= 1,y,L, given by (8). The following result summarizes the asymptotic behavior of y^ GMM . The proof can be found in the Appendix. ^ l converges in probability to a positive definite matrix G0l (not Theorem 1. Assume model (1), assumption (2), and that G necessarily Gl ), l= 1,y,L. Then, as n-1, 1=2

Vn

ðy^ GMM yÞ-d Nð0,Ip þ q Þ,

J. Shao et al. / Journal of Statistical Planning and Inference 141 (2011) 800–808

803

where -d is convergence in distribution, Vn 1/2 is the inverse of the square root of the matrix !1 ! !1 L L L X X X 1 1 1 Vn ¼ nl Cl uG1 C n C u G G G C n C u G C , l l 0l l 0l l l l 0l l 0l l l¼1

l¼1

l¼1

and Cl ¼ EðC l Þ. ^ l to be a consistent estimator of Gl , although y^ GMM is For the asymptotic normality of y^ GMM , we do not require G ^ asymptotically efficient if each G l is consistent (G0l ¼ Gl ). Theorem 1 applies to the simple unweighted GMM estimator !1 L L X X y^ ¼ n C uC n C ud , l

l

l

l¼1

l

l

l

l¼1

which may not be asymptotically efficient but using y^ does not require the estimation of Gl ’s. Furthermore, y^ can be used to construct the following estimator of Gl : 1X G~ l ¼ ðd C y^ Þðdil Cil y^ Þu, nl i2G il il l

^l ¼G ~ l is asymptotically efficient. In applications, the condition of which is consistent if nl -1 as n-1. Hence, y^ GMM with G large nl for all l needs to be carefully checked in order to apply the asymptotic theory. A consistent estimator of the asymptotic variance of y^ GMM can be obtained by replacing unknown parameters in Vn by their consistent estimators. Although the use of the weighted sum JðyÞ in (9) is intuitively appealing, we wonder whether this is the most efficient way of combining GMM estimators y^ 1 , . . . , y^ L . Since y^ GMM is a unique solution to L X @JðyÞ ^ 1 ðd l C l yÞ ¼ 0, ¼ nl C l uG l @y l¼1

it is a special case of a general class of estimators that are solutions to L X

Pl

l¼1

X ðdil Cil yÞ ¼ 0,

ð11Þ

i2Gl

where Pl is a matrix whose number of rows equals the dimension of y and whose number of columns equals the number of rows of Ql, l = 1,y,L. This follows the idea of weighted method of moments in Xiao (2010) for combining estimators from homogeneous populations. For a given set of P1 , . . . , PL , the solution to (11) is !1 L L X X y~ ¼ nPC nPd: l

l

l

l¼1

l

l l

l¼1

Under model (1) and assumption (2), using a similar argument to that in the proof of Theorem 1, we can show that 1=2 V~ n ðy~ yÞ-d Nð0,Ip þ q Þ

as n-1, where V~ n ¼

L X

!1 nl P l Cl

l¼1

L X l¼1

! nl Pl Gl Pl u

L X

!1 nl Pl Cl

:

l¼1

For any P1 , . . . , PL , an application of a generalized version of the matrix Cauchy–Schwarz inequality (Wansbeek and Meijer, 2000, pp. 359–360; Xiao, 2010) shows that !1 L X V~ n  nl Cl uG1 Cl Z0 l

l¼1

(i.e., it is nonnegative definite) with the equality achieved by letting Pl ¼ Cl uG1 l . This means that among all estimators that are solutions to (11), the one with Pl ¼ Cl uG1 is asymptotically optimal. If we replace Cl and Gl by their consistent l ^ in (9), we obtain the GMM estimator y^ GMM in (10) as an asymptotically optimal estimator. estimators C l and G l 3. Results for incomplete data Consider the situation where the original data are balanced (i.e., model (1) holds with Ti = T for all i), but missing outcomes result in unbalanced data. We assume that each subject has at least two observations, which is necessary for the application of the GMM in Section 2.

804

J. Shao et al. / Journal of Statistical Planning and Inference 141 (2011) 800–808

Let Di be the T-vector whose jth component is 0 if the jth component of yi is missing and 1 if the jth component of yi is observed, j =1,y,T. Let d1 , . . . , dL be all possible values that Di can take. Each dl is called a missingness pattern. The largest possible L is 2T  T 1. For any pattern dl , we can construct the matrix Ml given by (6) according to the particular data structure in pattern dl . We now show that Theorem 1 still holds for y^ GMM defined in (10) with Gl ¼ fi : Di ¼ dl g, when missingness mechanism depends only on ðXi ,Zi Þ, i.e., pðDi jyi , Xi ,Xi ,Zi , ai Þ ¼ pðDi jXi ,Zi Þ,

ð12Þ

where pðjÞ denotes the conditional probability function. Since Xi is not observed, the missingness mechanism (12) is nonignorable. It suffices to show that (7) holds for any l. The result follows from " # " # ! ! vecðXi Þ vecðXi þ Ui Þ E Ql  ðei Ui bÞjXi ,Zi , Di ¼ E Ql  Ui bjXi ,Zi , Di vecðZi Þ vecðZi Þ ! ! " # " # vecðUi Þ vecðUil Þ ¼ E Ql  Ui bjXi ,Zi , Di ¼ E Ql  Uil b ¼ 0, vecðZi Þ vecðZil Þ where (Uil, Zil) is the sub-matrix of (Ui,Zi) with components corresponding to observed components of yi under pattern dl , the first equality follows from condition (12), the independence of ei and ðXi ,Zi ,Ui , Di Þ, and E(ei) =0, and the last two equalities follow from condition (12), the independence of Ui and ðXi ,Zi , Di Þ, and the definition of Ql. From the previous argument, Theorem 1 still holds if condition (12) is relaxed to pðDi jyi , Xi ,Xi ,Zi , ai Þ ¼ pðDi jXi ,Zi , ai Þ, provided that the unobserved random effect ai is independent of ðXi ,Zi ,Ui ,ei Þ. The missingness mechanism depending on an unobserved individual random effect was first studied in Wu and Carroll (1988) in the analysis of incomplete data without measurement error and was named as informative missingness, a special case of nonignorable missingness. When Ql’s in (7) are the same for several different patterns, we can combine Gl ’s to obtain a more efficient estimator of y. An example is when the components of Ui over t = 1,y,T are identically distributed and we can combine Gl ’s with the same ml, the number of 1’s in dl . 4. Testing for measurement error In this section we derive a test of whether there is measurement error. When there is no measurement error in Xi (Rl =0, l = 1,y,L), we can obtain an estimator of y based on moment conditions ! " # vecðXi Þ  ðyi Xi bZi gÞ ¼ 0, E MlO ð13Þ vecðZi Þ l = 1,y,L, where " IpTl  ATl MlO ¼ 0

0 IqTl  ATl

# ¼ Iðp þ qÞTl  ATl :

O

A GMM estimator y^ GMM of y can then be derived similarly as above. Since " # IpT2 Rl ðRl uRl Þ Rl u 0 l Ml ¼ MlO : 0 IqTl O

Eq. (13) implies (6), but not vice versa. Under the null hypothesis of no measurement error in Xi , y^ GMM is an asymptotically normal and asymptotically efficient estimator of y, while y^ GMM is an asymptotically normal yet asymptotically inefficient estimator of y. Under the alternative hypothesis of presence of measurement error in Xi (with a measurement error O correlation structure specified by (2)), y^ GMM is inconsistent. According to Hausman (1978) and Hausman and Taylor (1981), under the null hypothesis of no measurement error, the test statistic O

O

O

HT ¼ ðy^ GMM y^ GMM Þu½V^ ðy^ GMM ÞV^ ðy^ GMM Þ ðy^ GMM y^ GMM Þ

ð14Þ

O is asymptotically distributed as a w2K distribution, where V^ ðy^ GMM Þ and V^ ðy^ GMM Þ are consistent estimators of Vðy^ GMM Þ and O O O Vðy^ GMM Þ, the asymptotic covariance matrices of y^ GMM and y^ GMM , respectively, ½V^ ðy^ GMM ÞV^ ðy^ GMM Þ is the Moore–Penrose O O generalized inverse of V^ ðy^ GMM ÞV^ ðy^ GMM Þ, and K is the rank of Vðy^ GMM ÞVðy^ GMM Þ.

J. Shao et al. / Journal of Statistical Planning and Inference 141 (2011) 800–808

805

5. Examples As illustrations, we apply the method developed in Sections 2–4 to two real data sets. 5.1. Application to the beaver dam health outcomes study Health related quality of life had received increasing attention in the medical and public health literature. The Beaver Dam Health Outcomes Study (BDHOS) is a community-based longitudinal cohort study of health status and health-related quality of life. In the BDHOS, data were collected from a random sample of 1,430 adults over 45 years of age in Beaver Dam, Wisconsin, consisting of 590 men and 860 women with mean age 64.1 years. The study was initiated in 1990, followed by three additional interviews. The time gaps between interviews are about 27 months, 62 months, and 29 months. We consider the scores on a standardized health related quality of life instrument, SF-36, to be outcomes. The SF-36 is an index to quantify health status, which contains eight scales profiling health-related quality of life experienced by a person over the past month, including physical function, role function as limited by physical problems, body pain, general health perceptions, vitality, social function, role function as limited by emotional problems, and mental health. A higher score on SF-36 represents better health. It is of interest to study the relationship between SF-36 and covariates age and SF-1, a self-rated value of a person’s overall health. A smaller value of SF-1 indicates better health. It is suspected that SF-1 is measured with error, since it can be easily affected by the mood and temporal events when the participant is interviewed. Although the original study design is balanced with T= 4 data from each participant, incomplete data were observed for more than half of participants. Table 1 lists the number of participants in each missingness pattern. The missingness of the BDHOS is intermittent, i.e., participants may come back after missing the previous interview. There are 219 participants with 0 or 1 observation, most of them completed the first interview but missed all 3 follow up interviews. To adjust for measurement error using the method in Section 3, these participants have to be excluded and our conclusion applies to participants who completed at least two interviews. We consider linear model (1) with Xi being the observed SF-1 and Zi being the log(age). Two methods were applied to estimate the unknown parameters, one is the naive method ignoring measurement error and treating Xi as Xi , the other is the GMM method developed in Section 3 with Gl ¼ the set of participants with l observed data, l = 2,3,4. Because the interviews were over two years apart, we assume that the measurement error in SF-1 is independent and identically distributed across time points. The results are given in Table 2. The difference between the estimates of the regression coefficient of SF-1 obtained by the two methods is significant relative to their standard errors, indicating the existence of measurement error in SF-1. The Hausman test statistic HT given in Section 4 is equal to 5.07 with a p-value 0.079, which shows some but not so strong evidence of measurement error in SF-1. The difference in the estimates of the regression coefficient of log(age) is due to the correlation between SF-1 and age, since age is measured without error. After adjusting for the measurement error, the effect of SF-1 on SF-36 is stronger whereas the effect of age is weaker than that estimated by the naive method. The validity of the GMM estimates relies on the assumption that the missingness mechanism does not depend on SF-36, conditional on age, gender, the true SF-1 value, and a possible individual subject random effect. 5.2. Application to the wisconsin sleep cohort study Another example of unbalanced panel data containing covariate measurement errors is the Wisconsin Sleep Cohort Survey (WSCS). This survey involves 3,745 individuals. After the baseline survey, each individual was followed twice: four years and eight years after the baseline survey. Nonresponse resulted in an unbalanced panel data set with different patterns. The number of individuals in each missingness pattern is given in Table 3. We analyze how a person’s level of self-reported sleepiness (a summary standardized outcome based on several questions) is related to covariates, sleep latency (log-transformation of the minutes it takes the person to fall asleep), sleep time (hours of sleep during weekday nights), log-transformation of the body mass index (BMI), and age, under a linear regression model. The covariate sleep latency is believed to have measurement error (Shen et al., 2008). Other covariates such as sleep time and BMI may also be inaccurately reported. Therefore, we consider model (1) with X being a 3-dimensional covariate vector (sleep latency, sleep time, BMI) measured with error and Zi being a single covariate age Table 1 Number of participants (m) in each missingness pattern in the BDHOS. Time Time Time Time m

1 2 3 4

0 0 0 0

1 0 0 0

0 1 0 0

1 1 0 0

1 0 1 0

1 0 0 1

0 0 1 1

1 1 1 0

1 1 0 1

1 0 1 1

0 1 1 1

1 1 1 1

Total

1

217

2

265

25

7

1

111

30

62

6

703

1430

806

J. Shao et al. / Journal of Statistical Planning and Inference 141 (2011) 800–808

Table 2 Estimates (standard errors) for the BDHOS study. Covariate

Method

SF-1 log(age)

Naive

GMM

 0.018 (0.0051)  0.172 (0.0307)

 0.056 (0.0180)  0.095 (0.0485)

Table 3 Number of participants (m) in each missingness pattern in the WSCS. Baseline First follow-up Second follow-up

0 0 0

1 0 0

0 1 0

0 0 1

1 1 0

1 0 1

0 1 1

1 1 1

Total

m

90

888

58

55

694

431

93

1436

3745

Table 4 Estimates (standard errors) for the WSCS data. Covariate

Method Naive

Sleep latency Sleep time BMI Age

0.049  0.208 0.611  0.003

GMM1 (0.0249) (0.0189) (0.1852) (0.0024)

 0.224  0.228  0.224  0.003

GMM2 (0.1530) (0.0101) (0.6340) (0.0052)

0.225  0.259 0.460  0.005

(0.0389) (0.0316) (0.0800) (0.0017)

measured without error. Similar to the first example, to apply our method in Section 3 we have to exclude individuals who have no or a single observation. We computed the naive estimators by treating Xi as Xi . For the GMM approach, we assume that measurement errors for different covariates, sleep latency, sleep time, and BMI, are mutually independent but they may have different distributions. For measurement errors within each covariate, we assume they are independent over time. When there are two observations (for 1218 individuals), it follows from the result in Xiao et al. (in press) that we have to assume that measurement errors within each covariate are identically distributed; otherwise the parameter b is not estimable. When there are three observations (for 1436 individuals), we can assume that measurement errors within each covariate are either identically distributed or not identically distributed. This motivates us to consider two different GMM estimators: (1) GMM1, the GMM estimators based on unbalanced data from subjects with 2 or 3 observations, under the assumption that measurement errors within each covariate are identically distributed; (2) GMM2, the GMM estimators based on unbalanced data from subjects with 2 or 3 observations, under the assumption that measurement errors within each covariate are identically distributed for individuals with 2 observations and are not identically distributed for individuals with 3 observations. Under the missingness mechanism assumption in Section 3, all GMM estimators are asymptotically valid if the assumption on measurement errors is appropriate. The results are given in Table 4. The Hausman test statistic HT based on GMM2 is 31.6 with a p-value 2.3  10  6, indicating a strong evidence of measurement error. From Table 4 we observe that the GMM estimates under the assumption that measurement errors within each covariate are identically distributed (GMM1) are quite different from the naive estimates and the GMM2 estimates, except for the coefficient of sleep time. The GMM1 estimate of the coefficient of sleep latency is negative (and actually insignificant under the 5% statistical significance level). The GMM1 estimates of the coefficients of BMI and age are statistically insignificant. These are quite counter-intuitive results. Hence, the identically distributed measurement error assumption may not be appropriate. The naive estimate of the coefficient of sleep latency, although it is significant under the 5% level, is much smaller than the GMM2 estimate. Meanwhile, the naive estimates of the coefficients of sleep time, BMI, and age are relatively close to the GMM2 estimates. This suggests that the magnitude of measurement error in sleep time or BMI is comparatively smaller than that of the measurement error in sleep latency, i.e., the measurement errors in sleep latency is the main source of bias.

6. Concluding remarks In this paper, we derive GMM estimators to adjust for covariate measurement error in linear regression models with unbalanced panel data. Our model allows for correlated measurement errors (with a known correlation structure) in

J. Shao et al. / Journal of Statistical Planning and Inference 141 (2011) 800–808

807

multiple covariates. Although the results are derived assuming independence between measurement errors and true covariates, they can be generalized to situations where measurement errors and true covariates are correlated with a known pattern (Xiao, 2008). The idea of combining GMM estimators into an efficient GMM estimator for unbalanced panel data in Section 2 is not only for dealing with measurement error, but also for any GMM estimation with unbalanced panel data when the unbalanced data structure is by design and the number of subjects in each sub-sample tends to infinity.

Acknowledgments The authors thank two referees and an associate editor for their helpful comments and suggestions.

Appendix A. Proof of Theorem 1 Since d l ¼ C l y þ e l , where e l ¼

y^

GMM

¼

L X

^ 1 C l n l C l uG

P

e

i2Gl il =nl

!1

L X

l

n

^ 1 C C l uG l l

!1

^ 1 C l nl C l uG

!1

l¼1 L X nl l¼1

n

L X

l

l¼1

L X nl l¼1

L X

^ 1 C l y þ n l C l uG l

l¼1

¼ yþ

and eil ¼ dil Cil y, ^ 1 e l n l C l uG l

l¼1

^ 1 e : C l uG l l

Then L pffiffiffi ^ pffiffiffi X nl ^ 1 C n½y GMM y ¼ n C l uG l l n l¼1

!1

L X nl l¼1

n

1

^ e ¼ C l uG l l

L X

L X nl

l¼1

l¼1

n

1

^ C C l uG l l

!1 rffiffiffiffi  nl ^ 1 pffiffiffiffi nl e l : C l uG l n

Without loss of generality, we assume that nl =n-al as n-1. By the law of large numbers, as n-1, L X nl l¼1

n

1

^ C lC l uG l

L X

al Cl uG1 0l Cl ,

l¼1

L X nl l¼1

n

1

^ e l -0 C l uG l

in probability. By Slutsky’s Theorem and the Central Limit Theorem, !1 !1 L L n  X pffiffiffi X nl l ^ 1 C l ^ 1 e l -d n C l uG C l uG al Cl uG1 Cl uG1 l l 0l Cl 0l Nð0, al Gl Þ: n n l¼1 l¼1 ^ 1 e , l= 1,y,L, are independent, Since ðnl =nÞC l uG l l pffiffiffi ^ d n½y GMM y-Nð0,VÞ, where V¼

L X l¼1

!1 1 l Cl uG0l Cl

a

L X

! 1 1 l Cl uG0l Gl G0l Cl

a

l¼1

L X

!1 1 l Cl uG0l Cl

a

¼ lim nV n : n-1

l¼1

To prove that y^ GMM is most efficient if G0l ¼ Gl , we need to show that !1 L X 1 VZ al Cl uGl Cl , l¼1

which follows from the matrix version of the Cauchy–Schwarz inequality (see, e.g., Wansbeek and Meijer, 2000, pp. 359–360). To see this, let 2 3 2 pffiffiffiffiffiffi 3 2 3 a 1 C1 G1 G1 01 6 7 6 ^ 7 6 7 7, B ¼ 4 & & X¼4 5, A ¼ 6 5: 4 5 pffiffiffiffiffi 1 aL CL GL G0L Then V ¼ ðXuAXÞ1 XuABAXðXuAXÞ1 ZðXuB1 XÞ1 ¼

L X l¼1

!1

al Cl uG1 l Cl

:

808

J. Shao et al. / Journal of Statistical Planning and Inference 141 (2011) 800–808

References Biørn, E., 2000. Panel data with measurement errors: instrumental variables and GMM procedures combining levels and differences. Econometric Reviews 19, 391–424. Carroll, R.J., Ruppert, D., Stefanski, L.A., Crainiceanu, C.M., 2006. Measurement Error in Nonlinear Models: A Modern Perspective. Chapman & Hall/CRC, Boca Raton, FL. Fuller, W., 1987. Measurement Error Models. John Wiley, New York. Griliches, Z., Hausman, J.A., 1986. Errors-in-variables in panel data. Journal of Econometrics 32, 93–118. Hansen, L., 1982. Large sample properties of generalized method of moments estimators. Econometrica 50, 1029–1054. Hausman, J.A., 1978. Specification tests in econometrics. Econometrica 46, 1251–1271. Hausman, J.A., Taylor, W., 1981. A generalized specification test. Economics Letters 8, 239–245. Hsiao, C., 2003. Analysis of Panel Data. Cambridge University Press, Cambridge. ¨ Joreskog, K.G., 1971. Simultaneous factor analysis in several populations. Psychometrika 36, 409–426. Muthe´n, B.O., 1989. Latent variable modeling in heterogenous populations. Psychometrika 54, 557–585. Shen, L., Shao, J., Park, S., Palta, M., 2008. Between- and within-cluster covariate effects and model misspecification in the analysis of clustered data. Statistica Sinica 18, 731–748. Wansbeek, T.J., 2001. GMM estimation in panel data models with measurement error. Journal of Econometrics 104, 259–268. Wansbeek, T.J., Meijer, E., 2000. Measurement Error and Latent Variables in Econometrics. North-Holland, Amsterdam. Wu, M.C., Carroll, R.J., 1988. Estimation and comparison of changes in the presence of informative right censoring by modelling the censoring process. Biometrics 44, 175–188. Xiao, Z., 2008. Topics in GMM estimation with application to panel data models with measurement error. Ph.D. Thesis, Department of Statistics, University of Wisconsin, Madison. Xiao, Z., 2010. The weighted method of moments for moment condition models. Economics Letters 107, 183–186. Xiao, Z., Shao, J., Palta, M., 2010. GMM in linear regression for longitudinal data with multiple covariates measured with error. Journal of Applied Statistics 37, 791–805. Xiao, Z., Shao, J., Palta, M., in press. Instrumental variable and GMM estimation of panel data with measurement error. Statistica Sinica 20.