Simulation-based consistent inference for biased working model of non-sparse high-dimensional linear regression

Simulation-based consistent inference for biased working model of non-sparse high-dimensional linear regression

Journal of Statistical Planning and Inference 141 (2011) 3780–3792 Contents lists available at ScienceDirect Journal of Statistical Planning and Inf...

207KB Sizes 0 Downloads 44 Views

Journal of Statistical Planning and Inference 141 (2011) 3780–3792

Contents lists available at ScienceDirect

Journal of Statistical Planning and Inference journal homepage: www.elsevier.com/locate/jspi

Simulation-based consistent inference for biased working model of non-sparse high-dimensional linear regression$ Lu Lin a,, Feng Li b, Lixing Zhu c a b c

School of Mathematics, Shandong University, Jinan, China Zhengzhou Institute of Aeronautical Industry Management, China Hong Kong Baptist University, Hong Kong

a r t i c l e in f o

abstract

Article history: Received 26 July 2010 Received in revised form 9 June 2011 Accepted 15 June 2011 Available online 25 June 2011

Variable selection in regression analysis is of importance because it can simplify model and enhance predictability. After variable selection, however, the resulting working model may be biased when it does not contain all of significant variables. As a result, the commonly used parameter estimation is either inconsistent or needs estimating high-dimensional nuisance parameter with very strong assumptions for consistency, and the corresponding confidence region is invalid when the bias is relatively large. We in this paper introduce a simulation-based procedure to reformulate a new model so as to reduce the bias of the working model, with no need to estimate high-dimensional nuisance parameter. The resulting estimators of the parameters in the working model are asymptotic normally distributed whether the bias is small or large. Furthermore, together with the empirical likelihood, we build simulation-based confidence regions for the parameters in the working model. The newly proposed estimators and confidence regions outperform existing ones in the sense of consistency. & 2011 Elsevier B.V. All rights reserved.

Keywords: High dimensional regression Non-sparsity Biased working model Consistent inference Empirical likelihood

1. Introduction We in this paper consider the high-dimensional linear regression model 0

Y ¼ b X þ g0 Z þ e,

ð1:1Þ

where Y is a scale response variable, X and Z are, respectively, p-dimensional and q-dimensional covariates, b 2 B and g 2 G are, respectively, p-dimensional and q-dimensional parameter vectors, e is error satisfying EðejX,ZÞ ¼ 0 and VarðejX,ZÞ ¼ s2 . The model (1.1) is regarded as the full model, which is built in the initial stage of modeling and contains all of possibly relevant variables. To simplify the model and enhance predictability, one usually uses variable selection technique to remove less significant variables. Without loss of generality, we suppose that Z is relatively insignificant and then is removed from the model. The dimension q of g may be large. Consequently, we obtain a sub-model as 0

Y ¼ b X þ Z:

ð1:2Þ

$ Lu Lin was supported by NNSF project (10771123) of China, NBRP (973 Program 2007CB814901) of China, RFDP (20070422034) of China, NSF project (ZR2010AZ001) of Shandong Province of China and K C Wong-HKBU Fellowship Programme. Lixing Zhu was supported by a grant from the University Grants Council of Hong Kong, Hong Kong, China.  Corresponding author. E-mail address: [email protected] (L. Lin).

0378-3758/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2011.06.014

L. Lin et al. / Journal of Statistical Planning and Inference 141 (2011) 3780–3792

3781

This model is assumed to be a low-dimensional and is regarded as a fixed working model, and any further analysis is based on this model. However, we note that if this model does not contain all the significant variables, it is actually biased and then the classical estimation may cause inconsistency and the invalid confidence intervals/regions, or even it is consistent, we need to estimate high-dimensional nuisance parameter with very strong assumption. Our goal in this paper is to construct consistent inference for b whether the bias EðZjXÞ is small or large. As such, g is regarded as a high-dimensional nuisance parameter vector. A natural method is to directly estimate the full parameter vector ðb, gÞ by the full model (1.1). Under some regularity conditions, the resultant estimator for b could be consistent. This method is unavailable when the dimension q of g is high unless very strong conditions are imposed. For example when the dimension diverges to infinity as the sample size goes to infinity, the covariance matrix of (X,Z) is assumed to be positive definite when the ordinary least squares method is applied. As is known, it actually assumes very weak correlation between the components of (X,Z). Some traditional estimation procedures are based on hypothesis test. For example, the preliminary test (PT) estimation uses a test to decide the estimator for b which is based on either the full model (1.1) or the working model (1.2); see for example Sen and Ehsanes Saleh (1987) and Sen (1979), and a comprehensive book of Ehsanes Saleh (2006). Because this estimation is related to the estimation determined by the full model (1.1), it is also unavailable for the case when the dimension of g is high. Furthermore, this estimation is in fact biased when the full model (1.1) is non-sparse. Recently, the popularly used method is to simultaneously select variables and estimate parameters via penalty-based procedure. For example, Fan and Li (2001) and Fan and Peng (2004) proposed a smoothly clipped absolute deviation (SCAD) to achieve this goal. This technique has oracle and sparsity properties, i.e., it estimates zero components the true parameter vector as exactly zero while the estimators of the non-zero components are still consistent. In other words, when zero components are deleted from the model, the remaining components can still be consistently estimated. However, when the model under study is non-sparse, it is difficult to construct consistent estimation for the coefficients in the selected working models by the existing methods. The use of nonparametric adjustment is another way to construct the consistent estimation for b. Unlike the variable selection, the procedure of nonparametric adjustment is implemented via an inverse way, i.e, the simple model (1.2) is first thought of as a prior or approximate choice and then is adjusted by nonparametric method. The adjusted estimator for b is consistent. However, this method is suitable only for the case with low-dimensional X and Z; otherwise it will suffer from the curse of dimensionality of nonparametric estimation. The relevant references include Hjort and Glad (1995), Hjort and Jones (1996), Naito (2004), Glad (1998), Lin et al. (2009), Ma et al. (2006), among others. In this paper we will recommend a simulation-based method to overcome these difficulties. It is known that the simulation-extrapolation method (SIMEX) has been employed for inferring the models with errors-in-variables in the literature; the relevant literatures are Novick and Stefanski (2002), Cook and Stefanski (1994), Stefanski and Cook (1995), among others. Carrol et al. (2006) is a comprehensive reference book. A similar method is introduced by Delaigle et al. (2007) to solve a different problem. However, different from the methods mentioned above, the newly inference procedure is to redefine a simulation-based working model, which is approximately unbiased whether the bias of the original working model is small or large. The estimation procedure depends only on commonly used parameter estimation methods. A bias-corrected estimation is proposed for working model. Consequently, the resulting estimators of the parameters in the original working model are consistent under suitable conditions. It is worth mentioning that although the estimation procedure contains a nonparametric estimation, it achieves the parametric convergence rate because of projection or orthogonality that can reduce or eliminate the nuisance remainder appearing from variance estimation. Furthermore, together with the empirical likelihood, we build simulation-based confidence regions for the parameters in the original working model. Finally it is proved by asymptotic theories and demonstrated by simulation studies that the newly proposed estimators and confidence regions outperform existing ones in the sense of consistency. The rest of this paper is organized as follows. In Section 2 a simulation-based model is defined. Under the newly defined model, the bias-corrected LS estimator for b is obtained. Further, an asymptotically unbiased prediction is proposed and the data-driven method of choosing the artificial variance parameter is introduced. In Section 3, the asymptotic normality of the new estimators is obtained and a comparison with the commonly used estimators is given. In Section 4 a simulation-based empirical likelihood is proposed and then the confidence regions for the parameters in the working model are constructed. Simulation studies are presented in Section 5 and the proofs of the theorems are given in Appendix. 2. Simulation-based estimation From (1.1) and (1.2) it follows that EðZjXÞ ¼ g0 EðZjXÞ. The non-sparsity means ga0. When both ga0 and EðZjXÞa0, the fixed working model (1.2) is always biased. To introduce a bias-reduced working model, we first define an extrapolation variable as W ¼ Z þ U, where U is an artificial q-dimensional variable independent of ðY,X,Z, eÞ, having a given density function fU ðu, s2U Þ and satisfying EðUÞ ¼ 0 and CovðUÞ ¼ s2U SU,U . fU ðu, s2U Þ can be chosen to be a symmetric function. To reduce the bias of the working model (1.2), we define a simulation-based model as 0

Y ¼ b X þgb ðZÞ þ d, 0

ð2:1Þ

where gb ðZÞ ¼ EfEðYb XjWÞjZg. To verify that the new model (2.1) is approximately unbiased, for simplicity, we for the moment assume that Z  Nð0, s2Z IÞ. In this case we choose U such that U  Nð0, s2U IÞ. By the property of conditional

3782

L. Lin et al. / Journal of Statistical Planning and Inference 141 (2011) 3780–3792

expectation of normal distribution, we have (

)

s2Z s2 WjZ ¼ g0 Z 2 Z 2 g0 Z: EðdjX,ZÞ ¼ EðYb XjX,ZÞEðgb ðZÞjX,ZÞ ¼ g Zgb ðZÞ ¼ g Zg EfEðZjWÞjZg ¼ g Zg E s2Z þ s2U sZ þ sU 0

0

0

0

0

0

Then, when s2U is small enough, EðdjX,ZÞ  0 (a.s.), which means the new model is approximately unbiased. However, in the model (2.1), the function gb ðzÞ is unknown and to be estimated. Denote by fX,Y,Z ðx,y,zÞ and fW(w) the density functions of ðX,Y,ZÞ and W, respectively, and by fWjX,Y,Z ðwjx,y,zÞ the conditional density function of W given ðX,Y,ZÞ. 0 Let mðWÞ ¼ EðYb XjWÞ. Note that W Z is independent of (X,Y,Z). Then, RRR RRR 0 0 ðyb xÞfU ðwz, s2U ÞfX,Y,Z ðx,y,zÞ dx dy dz ðyb xÞfWjX,Y,Z ðwjx,y,zÞfX,Y,Z ðx,y,zÞ dx dy dz RRR ¼ mðwÞ ¼ fW ðwÞ fU ðwz, s2U ÞfX,Y,Z ðx,y,zÞ dx dy dz 0

¼

EfðYb XÞfU ðwZ, s2U Þg : EffU ðwZ, s2U Þg

Let ðXj ,Yj ,Zj Þ,j ¼ 1, . . . ,n, be i.i.d. observations of (X,Y,Z). Then, given b, we get an estimator of m(w) as Pn 0 2 j ¼ 1 ðYj b Xj ÞfU ðwZj , sU Þ ^ : Pn mðwÞ ¼ 2 k ¼ 1 fU ðwZk , sU Þ Furthermore, gb ðzÞ ¼ EðmðWÞjZ ¼ zÞ ¼

Z

mðwÞfWjZ ðwjzÞ dw ¼

Z

mðz þ uÞfU ðu, s2U Þ du ¼ Eðmðz þUÞÞ:

Given b, the above representation motivates an estimator of gb ðWÞ as Pn 0 2 m m n X 1X 1X j ¼ 1 ðYj b Xj ÞfU ðz þUi Zj , sU Þ 0 ^ þ Ui Þ ¼ ¼ ðYj b Xj ÞMj ðzÞ, mðz g^ b ðzÞ ¼ Pn 2 mi¼1 mi¼1 k ¼ 1 fU ðz þ Ui Zk , sU Þ j¼1

ð2:2Þ

ð2:3Þ

where U1 , . . . ,Um are i.i.d. observations of U and weight functions Mj ðzÞ ¼

m fU ðz þ Ui Zj , s2U Þ 1X : Pn m i ¼ 1 k ¼ 1 fU ðz þ Ui Zk , s2U Þ

Remark 1. It is worth mentioning that gb ðzÞ (including m(w)) is represented by unconditional expectation as given in (2.2) and thus it is essentially parametric and its estimator (2.3) can achieve the parametric convergence rate. However, the structure of the estimator g^ b ðzÞ seems like a nonparametric estimation and sU plays a similar role as bandwidth when it tends to zero. We will explain in the remark below and proof of the theorem that the seemingly nonparametric structure will not reduce the convergence rate of the estimator of b defined below because of the orthogonality of projection in the definition of gb ðzÞ. Plugging the estimator g^ b ðzÞ into the working model (2.1), we obtain an adjustment model as 0 Y~ i  b X~ i þ di ,

j ¼ 1, . . . ,n,

ð2:4Þ

where Y~ i ¼ Yi 

n X

Yj Mj ðZi Þ,

X~ i ¼ Xi 

j¼1

n X

Xj Mj ðZi Þ:

j¼1

Note that the adjustment model can be regarded as a centered version of the biased working model (1.2). By the adjustment model (2.4), we obtain a simulation-based estimator for b as n 1X X~ Y~ , ni¼1 i i P where Sn ¼ n1 ni¼ 1 X~ i X~ 0i : Finally, plugging b^ into (2.3), we get an estimator gb ðWÞ as

b^ ¼ S1 n

g^ b^ ðzÞ ¼

n X

0 ðYj b^ Xj ÞMj ðzÞ:

ð2:5Þ

ð2:6Þ

j¼1

We will investigate the asymptotic behaviors about the estimators in the next section. However, we cannot find an optimal choice for sU directly by the asymptotic properties given below. Thus a data-driven method for choosing sU is desired. Similar to the choice of bandwidth in local smoothing estimation, we can use cross-validation (CV) to choose sU . For example, it can be determined by minimizing CV ðsU Þ ¼

n 1X ðY X 0i b^ ðiÞ ðWi Þg^ b^ ðWi ÞÞ2 , ðiÞ ni¼1 i

ð2:7Þ

L. Lin et al. / Journal of Statistical Planning and Inference 141 (2011) 3780–3792

3783

where b^ ðiÞ is the leave-one-out form of b^ . The reason that we use such a choice for sU is to make the bias of the adjusted model as small as possible. By the consistency of the new estimator (see the corollaries in the next section) and the asymptotic unbiasedness of the adjusted model, the model tends to the standard one and thus the classical CV method can ¨ be used to obtain the asymptotically optimal choice of sU (see, e.g., Hadle and Marron, 1985). Also we can take generalized 0 cross-validation (GCV) criterion proposed by Wahba (1977) and Craven and Wahba (1979). Denote V ¼ MX~ ðX~ X~ Þ1 X~ ðIn MÞ, the GCV criterion is defined by GCVðsU Þ ¼

n1 RSSðsU Þ ½1trðVÞ=n2

,

ð2:8Þ

where tr(V) denotes the trace of matrix V, RSSðsU0 Þ ¼ JY^ YJ2 is the sum of squared residuals. The value of sU0 that minimizes GCVðsU Þ is just the selected standard deviation. For any given X0, we naturally predict the corresponding response Y0 by Y^ 0 ¼ X 00 b^ :

ð2:9Þ

This prediction depends only on the predictor X0 from the working model (1.2). However, this prediction is asymptotically biased when both ga0 and EðZjXÞa0 although b^ is consistent. By the bias-reduced model (2.1) and the estimator g^ b^ ðwÞ, we can predict Y0 by Y~ 0 ¼ X 00 b^ þ g^ b^

ð2:10Þ

P for any given X0, where g^ b^ ¼ n1 ni¼ 1 g^ b^ ðZi Þ. Also such a prediction depends only on the low-dimensional predictor X. 3. Asymptotic property Denote ( S¼E

X

1

s2Z

!

SX,Z Z

X

1

s2Z

!0 )

SX,Z Z

,

S~ ¼ Ef½XEðXÞSX,Z ðs2Z SZ,Z Þ1 ðZEðZÞÞ  ½XEðXÞSX,Z ðs2Z SZ,Z Þ1 ðZEðZÞÞ0 g, where SX,Z ¼ CovðX,ZÞ and s2Z SZ,Z ¼ CovðZ,ZÞ. We have the following theorem. Theorem 3.1. For the working model (1.2), assume the following conditions hold: (C1) q is fixed, S and S~ are positive definite; (C2) as n-1, m Z n, sU -0 and ns2q U -1. If ðX,Z,UÞ is elliptically symmetric distributed and fU ðu, s2U Þ is bounded, then, as n-1,  D pffiffiffi ^ 1 1 n b bs4U S~ SX,Z ðs2Z SZ,Z þ s2U SU,U Þ2 g !Nð0, s2 S~ Þ: Particularly, when EðXÞ ¼ 0, EðZÞ ¼ 0, s2Z SZ,Z ¼ s2Z I and s2U SU,U ¼ s2U I, then, as n-1, ! pffiffiffi ^ s4 D n b b 2 U 2 2 S1 SX,Z g !Nð0, s2 S1 Þ: ðsZ þ sU Þ Remark 2. From this theorem, we can see that the result is not affected by the dimension of Z. Its proof is based mainly on the following key fact: the population versions of X~ i and Z~ i are XEðEðXjWÞjZÞ and ZEðEðZjWÞjZÞ, respectively, and their expectations are zero vectors. By this property, we can avoid the curse of dimensionality; for more details see the proof of the theorem in Appendix. This is the reason why we use the simulation-based method, rather than the nonparametric method, to estimate g0 Z. Furthermore, the asymptotic result is independent of the choice of the functional form of fU ð,Þ. In the simulation we will illustrate that the estimation is also unsensitive to the choice of the functional form. Remark 3. In this theorem, we use the condition (C2) to determine the convergence rate of sU to zero. With such a rate, we can ignore the remainder terms of order Op ððnsqU Þ1=2 Þ, which appear when we estimate the variances. On the other hand, in the estimation procedure, the ellipticity of the distribution of Z is not necessary. From the proof of this theorem we can easily see that this condition can be replaced by the following linearity condition: (C3) EðXjB0 ZÞ ¼ SX,Z BðB0 s2Z SZ,Z BÞ1 B0 Z for a nonrandom matrix B. The linearity condition (C3) is widely assumed in the circumstance of high-dimensional models, for example sliced inverse regression (SIR) for dimension reduction; see Li (1991) and Cook (1998). Hall and Li (1993) showed that this linearity condition often holds approximately when the dimension q of Z is large. The additional discussion of this

3784

L. Lin et al. / Journal of Statistical Planning and Inference 141 (2011) 3780–3792

condition was given in Diaconis and Freedman (1984), Cook and Ni (2005), Zhu et al. (2006), and Zhu and Zhu (2009). We have the following theorem. Theorem 3.2. For the working model (1.2), assume that the conditions (C1), (C2) and (C3) hold, U is elliptically symmetric distributed and fU ðu, s2U Þ is bounded and B0 SZ,Z B is positive definite. Then, as n-1, pffiffiffi ^ D nðb bs4U V 1 DgÞ!Nð0, s2 V 1 Þ, where V ¼ EfðXEðXjZÞÞðXEðXjZÞÞ0 g,

D ¼ SX,Z BðB0 s2Z SZ,Z BÞ2 B0 BðI þ s2U SU,U ðB0 s2Z SZ,Z BÞ1 B0 BÞB0 s2Z SZ,Z : Particularly, when s2Z SZ,Z ¼ s2Z I and s2U SU,U ¼ s2U I, then, as n-1, ! pffiffiffi ^ s4U D 1 V SX,Z g !Nð0, s2 V 1 Þ: n b b 2 ðsZ þ s2U Þ2 Because the results are similar in different cases, we only present the results in the following corollaries without proofs for the case of EðXÞ ¼ 0, EðZÞ ¼ 0, SZ,Z ¼ s2Z I and SU,U ¼ s2U I. To eliminate bias ðs4U =ðs2Z þ s2U Þ2 ÞS1 SX,Z g, we need that s2U tends to zero at a suitable convergence rate. The following two corollaries present the results, respectively, for q o 4 and q Z4. Corollary 3.1. For the working model (1.2), suppose that either ðX,Z,UÞ is elliptically symmetric distributed or (C3) holds, either S or B0 SZ,Z B is positive definite, fU ðu, s2U Þ is bounded, EðXÞ ¼ 0, EðZÞ ¼ 0, SZ,Z ¼ s2Z I and SU,U ¼ s2U I. In addition, the following condition holds: (C4) q o4, as n-1, mZ n, sU -0, ns8U -0 and ns2q U -1. Then, as n-1, pffiffiffi ^ D nðb bÞ!Nð0, s2 S1 Þ: This corollary indicates that, for the case of qo 4, the new estimator is consistent when sU is chosen to be of order OðnB Þ with 1=8 o B o 1=ð2qÞ. Furthermore, we can use the asymptotical normality to construct confidence region or to construct the empirical likelihood given below. However, when q Z 4, we need more conditions to guarantee the asymptotic normality. Corollary 3.2. For the working model (1.2), suppose that either ðX,Z,UÞ is elliptically symmetric distributed or (C3) holds, either S or B0 SZ,Z B is positive definite, fU ðu, s2U Þ is bounded, EðXÞ ¼ 0, EðZÞ ¼ 0, SZ,Z ¼ s2Z I and SU,U ¼ s2U I. Furthermore, the following condition holds: (C5) q Z4 and JgJ ¼ OðnZ Þ for Z 414=q and, as n-1, m Z n, sU -0, s8U n12Z -0 and ns2q U -1. Then, as n-1, pffiffiffi ^ D nðb bÞ!Nð0, s2 S1 Þ: The corollary also shows that for the case of q Z4 and JgJ being small enough (see (C5)), the new estimator is consistent if sU is chosen to be of order OðnB Þ with ð12ZÞ=8 o B o1=ð2qÞ. 3.1. Comparisons We briefly compare the new estimator with the ordinary least squares (OLS) estimator, and the estimator based on a seminonparametric procedure. OLS derived from the sub-model (1.2) denoted by b^ S is of the asymptotic bias: S1 SX,Z g Such a bias is non-negligible when ga0 and SX,Z a0. We can also estimate g0 Z in the full model (1.1) by nonparametric method. For example, given b, a common kernel estimator is defined by Pn 0 2 j ¼ 1 ðYj b Xj ÞfU ðzZj , sU Þ : Pn g~ b ðzÞ ¼ 2 j ¼ 1 fU ðzZj , sU Þ Plugging g~ b ðzÞ into (2.1), the resulting estimator of b, denoted by b^ N , has the following asymptotic representation: pffiffiffi ^ ðq4Þ=2 Þ þOp ð1Þ: ð3:11Þ nðb N bOðs4U ÞÞ ¼ Op ðsU ðq4Þ=2

The proof of (3.1) will be presented in Appendix. However, on the right-hand side of (3.1), sU is unbounded when q 4 4 and sU goes to zero. Comparing this with Theorem 2.1, we can see that the new estimator of (2.5) outperforms the semiparametric estimator b^ N .

L. Lin et al. / Journal of Statistical Planning and Inference 141 (2011) 3780–3792

3785

4. Confidence region construction Note that Corollaries 3.3 and 3.4 present the asymptotic normality of the estimator b^ , and can be employed to construct confidence region for b. However, as the empirical likelihood proposed by Owen (1988, 1990) has been proved to be of several advantages in small sample situations over normal approximation, we then use it to deal with this issue. 0 From the asymptotically unbiased model (2.5), we can construct estimating functions with Zi ðbÞ ¼ X~ i ðY~ i b X~ i Þ for b. Define the log-empirical likelihood ratio as ( ) n n n X X X logðnpi Þ : pi Z0, pi ¼ 1, pi Zi ðbÞ ¼ 0 : LðbÞ ¼ 2sup i¼1

i¼1

i¼1

By Lagrange multipliers, the empirical likelihood ratio can be rewritten as LðbÞ ¼ 2

n X

0

logf1 þ l ðbÞZi ðbÞg,

i¼1

where Lagrange multiplier lðbÞ is determined by n 1X Zi ðbÞ ¼ 0: n i ¼ 1 1 þ l0 ðbÞZi ðbÞ

Theorem 4.1. Under the working model (1.2), if the conditions of Corollaries 3.3 or 3.4 hold, then LðbÞ-d w2 ðpÞ: By Theorem 4.1, an approximate confidence for b can be expressed as fb : LðbÞ r w2a ðpÞg: 5. Simulations In this section we investigate the behavior of the simulation-based inference method by simulations. We use mean, mean squared error (MSE), confidence region (including the size and coverage probability) and prediction mean square error (PMSE), to compare the estimators b^ , b^ S and b^ F and relevant models, where b^ is the newly proposed estimator (2.5), b^ S and b^ F are the ordinary LS estimators based, respectively, on the sub-model (1.2) and the full model (1.1). Moreover, different distribution types of U are taken to illustrate how much the proposed estimator depends on U. Mean, MSE, PMSE, and empirical coverage probability are calculated by replication. The sample size and the quantity of generating U are, respectively, denoted by n and m, and the confidence level is chosen as a ¼ 0:05. In all the following examples we replicate the data 200 times and select the sU s by GCV as defined by (2.8). Example 1. In the full model (1.1), set b ¼ ð6,4,5Þ0 , g ¼ ð0:3,0:6,0:4Þ0 , X  N3 ð0,IÞ, Z  N3 ð0,IÞ, e  Nð0,1Þ, U  N3 ð0, s2U IÞ and 0 1 0:95 0 0 0:2 0 C SX,Z ¼ B @ 0 A: 0 0 0:9 This model is of high correlation between the corresponding components of X and Z. Tables 1 and 2 report the simulation results of point estimation and confidence region. These simulations show that, as a whole, the new estimator b^ is the best one among the three estimators. More precisely, we have two findings. (1) b^ S is the worst one among the three estimators, with relatively large biases and MSEs, and very low coverage probabilities; (2) the differences between means and MSEs of b^ and b^ F are not significant especially when n is large. In terms of the sizes of the confidence region and the coverage probabilities, the components of b^ have well-balanced confidence intervals with reasonable lengths and coverage probabilities. The coverage probabilities of confidence intervals of b^ F are very small and not robust, although their lengths are shorter. Thus such confidence regions are incredible. Example 2. Now we choose b ¼ ð6,4,5Þ0 , g ¼ ð0:3,0:6,0:4Þ0 , X  N3 ð0,IÞ, Z  N3 ð0,IÞ, e  Nð0,1Þ, U  N3 ð0, s2U IÞ and 0 1 0:6 0 0 C SX,Z ¼ B @ 0 0:2 0 A: 0 0 0:4 The main difference from Example 1 is that the correlation between X and Z is weaker. Tables 3 and 4 list the numerical results by simulation. In this case, because of the weak correlation, the ordinary least squares estimator b^ F works well.

3786

L. Lin et al. / Journal of Statistical Planning and Inference 141 (2011) 3780–3792

Table 1 Mean and MSE of point estimation for Example 1. n

m

Mean

MSE

b^

b^ S

b^ F

b^

b^ S

b^ F

100

120

6.0912  4.0097 5.0679

6.2780  4.1222 5.3562

5.9890  4.0030 5.0021

0.0859 0.0134 0.0585

0.0915 0.0291 0.1422

0.1252 0.0104 0.0583

200

220

6.0747  4.0021 5.0668

6.2870  4.1254 5.3771

6.0009  3.9976 5.0118

0.0427 0.0063 0.0293

0.0890 0.0220 0.1501

0.0519 0.0051 0.0296

400

420

6.0479  4.0036 5.0357

6.2820  4.1217 5.3591

5.9886  4.0007 4.9905

0.0213 0.0029 0.0145

0.0833 0.0190 0.1322

0.0243 0.0029 0.0128

Table 2 Length and coverage probability of confidence interval for Example 1. n

m

Length

Coverage probability

b^

b^ S

b^ F

b^

b^ S

b^ F

100

120

1.0733 0.4052 0.8420

0.4490 0.4433 0.4525

0.3659 0.3562 0.3652

0.9050 0.9300 0.9100

0.2950 0.7550 0.1550

0.4600 0.9250 0.5600

200

220

0.7811 0.2911 0.6181

0.3229 0.3224 0.3233

0.2678 0.2658 0.2657

0.9450 0.9150 0.9300

0.0550 0.6950 0.0050

0.4300 0.9100 0.5700

400

420

0.5759 0.2073 0.4354

0.2305 0.2303 0.2294

0.1954 0.1936 0.1920

0.9350 0.9350 0.9500

0.0050 0.4900 0

0.4400 0.9300 0.6050

Table 3 Mean and MSE of point estimation for Example 2. n

m

Mean

MSE

b^

b^ S

b^ F

b^

b^ S

b^ F

100

120

6.0435  4.0194 5.0291

6.1870  4.1112 5.1605

6.0116  4.0040 5.0078

0.0203 0.0136 0.0144

0.0518 0.0298 0.0405

0.0189 0.0128 0.0140

200

220

6.0286  4.0146 5.0287

6.1787  4.1255 5.1657

6.0029  4.0018 5.0088

0.0084 0.0065 0.0071

0.0410 0.0243 0.0341

0.0079 0.0057 0.0059

400

420

6.0232  4.0080 5.0097

6.1809  4.1163 5.1490

6.0019  3.9995 4.9958

0.0042 0.0030 0.0036

0.0366 0.0177 0.0266

0.0039 0.0028 0.0034

Meanwhile, the new estimator b^ is comparable with b^ F . The differences about means and MSEs between b^ and b^ F are not significant, and b^ has slightly longer intervals with the coverage probabilities closer to the nominal ones. As in Example 1, b^ S is the worst one among the three estimators, having relatively large biases and MSEs, and absolutely low coverage probabilities. Example 3. Note that in Example 2 the components of Z are independent. In this example we aim at investigating the small sample performance of the new estimator in the case of the components of Z being correlated. We take the same model as in Example 2 except Z  N3 ð0, SZ Þ, where SZ ¼ ðszij Þ33 and szij ¼ 0:4jijj . Simulation results are listed in Tables 5 and 6. We can find that there is a little difference between the results in Examples 2 and 3. Due to the weak correlation between X and Z, b^ F works well. In this case the new estimator b^ is also comparable with b^ F , meanwhile, b^ S is the worst one among the three estimators with respect to bias, MSE and coverage probability.

L. Lin et al. / Journal of Statistical Planning and Inference 141 (2011) 3780–3792

3787

Table 4 Length and coverage probability of confidence interval for Example 2. n

m

Length

Probability

b^

b^ S

b^ F

b^

b^ S

b^ F

100

120

0.4827 0.4050 0.4370

0.4699 0.4659 0.4691

0.3647 0.3594 0.3663

0.9050 0.9100 0.9150

0.6550 0.8200 0.7350

0.8400 0.8750 0.8850

200

220

0.3457 0.2890 0.3075

0.3369 0.3356 0.3374

0.2665 0.2651 0.2676

0.9400 0.9050 0.9100

0.4600 0.6400 0.5250

0.8450 0.9150 0.9050

400

420

0.2481 0.2047 0.2172

0.2419 0.2412 0.2423

0.1933 0.1930 0.1916

0.9250 0.9300 0.9100

0.1800 0.4850 0.2850

0.9000 0.9350 0.9050

Table 5 Mean and MSE of point estimation for Example 3. n

m

Mean

MSE

b^

b^ S

b^ F

b^

b^ S

b^ F

100

120

6.0016  3.9923 5.0173

6.1684  3.8817 5.1715

6.0092  4.0004 5.0163

0.0199 0.0141 0.0184

0.0506 0.0361 0.0465

0.0218 0.0119 0.0162

200

220

5.9961  3.9810 5.0136

6.1821  3.8738 5.1838

5.9976  3.9884 5.0110

0.0093 0.0068 0.0067

0.0445 0.0273 0.0446

0.0088 0.0066 0.0059

400

420

6.0018  3.9962 4.9971

6.1901  3.8827 5.1511

6.0047  4.0020 4.9939

0.0046 0.0030 0.0029

0.0409 0.0182 0.0275

0.0050 0.0029 0.0027

Table 6 Length and coverage probability of confidence interval for Example 3. n

m

Length

Probability

b^

b^ S

b^ F

b^

b^ S

b^ F

100

120

0.5058 0.4108 0.4366

0.5196 0.5262 0.5231

0.3610 0.3657 0.3607

0.9250 0.9150 0.8750

0.7200 0.7650 0.7250

0.7700 0.8750 0.8300

200

220

0.3597 0.2916 0.3100

0.3758 0.3783 0.3754

0.2654 0.2677 0.2645

0.9400 0.9250 0.9150

0.5150 0.7150 0.5100

0.8350 0.8950 0.9000

400

420

0.2569 0.2035 0.2203

0.2706 0.2681 0.2705

0.1938 0.1918 0.1939

0.9300 0.9250 0.9600

0.1950 0.6050 0.4200

0.8050 0.9100 0.9400

Example 4. To investigate the behavior of the new estimator in the case of high-dimensional model, we here consider fivedimensional b ¼ ð4,6,5,7,8Þ0 , and 20-dimensional g ¼ ð0:3,0:3,0:6,0:75,0:6,0:6,0:3,0:6,0:6,0:6,0:75,0:45,0:9,0:6, 0:9,0:75,0,0,0,0Þ0 , X  N5 ð0,IÞ, Z  N20 ð0,IÞ, e  Nð0,1Þ, U  N20 ð0, s2U IÞ and 0

SX,Z

0:5

B 0 B B ¼B B 0 B @ 0 0

0:2

0:2

0:2

0:3

0

0

0

0

0

0



0:6

0:2

0:4

0:3

0:2

0

0

0

0

0



0

0

0

0:3

0:2

0:2

0:1

0:3

0

0



0

0

0

0

0:5

0:4

0:3

0:2

0:4

0



0

0

0

0

0

0

0:6

0:4

0:2

0



0

1

0C C C 0C C: C 0A 0

3788

L. Lin et al. / Journal of Statistical Planning and Inference 141 (2011) 3780–3792

Tables 7 and 8 suggest that, as a whole, the new estimator b^ is the best one among the three estimators. However the three estimators behave neither similar to Examples 1 to 3. Here b^ is slightly better than b^ S because it has smaller MSEs and wellbalanced confidence intervals with shorter lengths. But their coverage probabilities are higher than the nominal ones; b^ F

simply does not work because it has large MSEs and cannot get any results during computations due to near singularity of the inverse of the covariance matrix of (X, Z). In Table 8 ‘‘—’’ means that the corresponding computational procedure collapses during the computations. Example 5. In this example, we examine the prediction accuracy when our method is applied. We compare the newly proposed methods (2.9) (denoted by M1) and (2.10) (denoted by M2) with the OLS method (denoted by M3) that is constructed via the sub-model (1.2) together with the original least squares estimation. In each replication, training dataset and validation dataset are generated, and mean and standard deviation of PMSE are computed through 200 replications. We generate the data with n ¼100 and m¼ 120 under the same setting as that in Example 4 except that EðZÞ ¼ 2. Thus, the sub-model (1.2) is biased, i.e., EðZjXÞa0. The simulation results are reported in Table 9. By comparing the means and standard deviations of PMSE, we see that the method M2 performs much better than M1 and M3. However, when the sub-model (1.2) is unbiased or approximately unbiased with EðZÞ ¼ 0, i.e., the setting is exactly the same as that in Example 4, the simulation results reported in Table 9 indicate that the differences between the three predictions are marginal. This suggests the importance of bias-correction when a model is biased.

Table 7 Mean and MSE of point estimation for Example 4. n

m

Mean

MSE

b^

b^ S

b^ F

b^

b^ S

b^ F

100

120

4.1072  6.0116 4.9795 6.9403 8.0157

4.1848  6.0619 4.9957 6.8913 7.9969

3.9862  6.0084 5.0295 7.0636 8.0407

0.0574 0.0563 0.0470 0.0522 0.0469

0.1223 0.0900 0.0855 0.0774 0.0658

0.1136 0.3207 0.1811 0.7500 0.3062

200

220

4.1037  5.9955 4.9935 6.9238 7.9889

4.1861  6.0877 5.0252 6.9122 7.9944

3.9798  6.0223 5.0271 7.0486 8.0176

0.0355 0.0180 0.0212 0.0263 0.0232

0.0749 0.0394 0.0338 0.0477 0.0409

0.0455 0.1147 0.0718 0.2851 0.1184

400

420

4.0966  5.9912 4.9868 6.9079 7.9742

4.1737  6.0604 4.9877 6.8655 8.0068

3.9994  5.9823 4.9782 6.9531 7.9596

0.0192 0.0090 0.0086 0.0166 0.0099

0.0488 0.0200 0.0184 0.0370 0.0191

0.0205 0.0541 0.0335 0.1334 0.0547

Table 8 Length and coverage probability of confidence interval for Example 4. n

m

Length

Probability

b^

b^ S

b^ F

b^

b^ S

b^ F

100

120

0.8990 0.9829 0.8373 0.9717 0.9232

0.9888 0.9960 0.9868 0.9873 0.9751

— — — — —

0.9200 0.9500 0.9500 0.9300 0.9350

0.8250 0.9000 0.9300 0.9150 0.9400

— — — — —

200

220

0.6531 0.7127 0.6110 0.7181 0.6770

0.7219 0.7271 0.7254 0.7291 0.7198

— — — — —

0.9200 0.9950 0.9700 0.9800 0.9800

0.7950 0.9500 0.9450 0.8950 0.9050

— — — — —

400

420

0.4685 0.5124 0.4336 0.5172 0.4848

0.5220 0.5168 0.5158 0.5194 0.5188

— — — — —

0.9200 0.9900 0.9950 0.9700 0.9900

0.7400 0.9550 0.9350 0.8100 0.9600

— — — — —

L. Lin et al. / Journal of Statistical Planning and Inference 141 (2011) 3780–3792

3789

Table 9 Mean and standard deviation (SD) of PMSE for Example 5. EðZÞ ¼ 2

M1

M2

M3

Mean SD

50.6227 3.8302

7.5492 1.0849

52.8717 4.1682

7.5776 1.1365

7.6943 1.1586

7.6721 1.1277

EðZÞ ¼ 0 Mean SD

Table 10 Mean and standard deviation (SD) of parameter estimation for Example 6. Statistic

Distribution

b^ 1

b^ 2

b^ 3

b^ 4

b^ 5

Mean

Normal distribution t distribution

4.1016 4.0999

 6.0026  6.0005

5.0136 5.0129

6.9230 6.9165

7.9736 7.9792

SD

Normal distribution t distribution

0.1939 0.1945

0.2276 0.2291

0.2179 0.2193

0.2245 0.2111

0.2047 0.2038

Table 11 Mean and standard deviation (SD) of PMSE for Example 6. Statistic

Normal distribution

t distribution

Mean SD

7.6547 1.1593

7.6505 1.1546

Example 6. Now we examine the influence from the distribution of fU to the performance of the method. In the simulation, we consider two types of distribution, multivariate normal distribution Nð0, s2U IÞ and multivariate t distribution tðg, s2U IÞ, where g is the degree of freedom and s2U I is the covariance matrix. We consider the case when n ¼100, m ¼120, g ¼ 5 and data are generated from the same model as that in Example 4. The mean and standard deviation (SD) of the newly proposed estimator and PMSE by Eq. (2.10) via 200 replications are reported in Tables 10 and 11, respectively. The results suggest that the influence from different distribution is not significant to the new method in terms of parameter estimation and model prediction. Appendix A. Proofs We only prove the results for the case with EðXÞ ¼ 0, EðZÞ ¼ 0, s2Z SZ,Z ¼ s2Z I and s2U SU,U ¼ s2U I. For the other cases, the proofs are similar and thus are omitted here. 0 Proof of Theorem 3.1. Note that Y~ i ¼ b X~ i þ g0 Z~ i þ e~ i , where Z~ i and e~ i are, respectively, the centered versions of Zi and ei , which are similar to the representations of X~ i and Y~ i . Then

b^ ¼ b þ S1 n

n n 1X 1X X~ Z~ 0 g þ S1 X~ e~ : n ni¼1 i i ni¼1 i i

A further decomposition leads to the following: 8 0 190 10 n n < n n = X X 1X 1X 0 ~ ~ @ A @ X Z g¼ ðX EðXi jZi ÞÞ þ EðXi jZi Þ Xj Mj ðZi Þ Z Zj Mj ðZi ÞA g ; i ni¼1 i i ni¼1: i j¼1 j¼1 0 10 0 10 10 n n n n n X X X 1X 1X @ A @ A @ ¼ ðX EðXi jZi ÞÞ Zi  Zj Mj ðZi Þ g þ EðXi jZi Þ Xj Mj ðZi Þ Zi  Zj Mj ðZi ÞA g9I1 þ I2 : ni¼1 i ni¼1 j¼1 j¼1 j¼1 By the properties of elliptically symmetric distribution, we can verify 10 !0 n n X 1X 1 Xi  2 SX,Z Zi @Zi  Zj Mj ðZi ÞA g: I1 ¼ ni¼1 sZ j¼1

3790

L. Lin et al. / Journal of Statistical Planning and Inference 141 (2011) 3780–3792

Since U is elliptically symmetric distributed, EðUÞ ¼ 0 and s2U SU,U ¼ s2U I, fU ðu, s2U Þ can be expressed as ~ 2 0 fU ðu, s2U Þ ¼ sq U f ðsU x xÞ for a function f~ . As a result, Z 2 q 0 f~ ðs2 EðfU2 ðwZ, s2U ÞÞ ¼ s2q U ðzwÞ ðzwÞÞfZ ðzÞ dz ¼ Oð1=sU Þ: U Then n 1X fU ðwZk , s2U Þ ¼ fW ðwÞ þOp ððnsqU Þ1=2 Þ: nk¼1

These results lead to z

n X

Zj Mj ðzÞ ¼ z

j¼1

n X

Zj

j¼1

m X i¼1

n m X fU ðz þ Ui Zj , s2U Þ fU ðz þ Ui Zj , s2U Þ 1 X ð1 þOp ððnsqU Þ1=2 ÞÞ ¼ z 2 Zj 2 fW ðz þUi Þ n j¼1 i¼1 k ¼ 1 fU ðz þUi Zk , sU Þ

Pn

m 1X EðZi jz þ Ui Þð1 þ Op ððnsqU Þ1=2 ÞÞ ¼ zEfEðZi jWi ÞZi ¼ zgð1 þOp ððnsqU Þ1=2 ÞÞ ¼ z ni¼1

¼ z

s2Z zð1 þ Op ððnsqU Þ1=2 ÞÞ: s2Z þ s2U

Consequently, I1 ¼

!

n 1X 1 Xi  2 SX,Z Zi ni¼1 sZ

!0

Zi 

n 1X 1 ¼ Xi  2 SX,Z Zi ni¼1 sZ

!

!

n s2Z s2Z 1X 1 Zi g þ Xi  2 SX,Z Zi Z 0 gðOp ððnsqU Þ1=2 ÞÞ 2 2 2 ni¼1 sZ þ sU sZ sZ þ s2U i

!

n s2U s2Z 1X 1 0 Z g þ Xi  2 SX,Z Zi Z 0 gðOp ððnsqU Þ1=2 ÞÞ: i ni¼1 s2Z þ s2U sZ s2Z þ s2U i

Note that EfðXi ð1=s2Z ÞSX,Z Zi ÞZ 0i g ¼ 0 and CovfðXi ð1=s2Z ÞSX,Z Zi Þðs2U =ðs2Z þ s2U ÞÞZ 0i gg-0. These imply I1 ¼ op ðn1=2 Þ. By the same arguments used above, we have n 1X fEðXi jZi ÞEðEðXi jWi ÞjZi Þ þOp ððnsqU Þ1=2 ÞgfZi EðEðZi jWi ÞjZi Þ þ Op ððnsqU Þ1=2 Þg0 g ni¼1 ( )( )0 n s2Z 1X SX,Z SX,Z q 1=2 q 1=2 ¼ Z  Z þO ððn s Þ Þ Z  Z þO ððn s Þ Þ g p p i U U n i ¼ 1 s2Z i s2Z þ s2U i s2Z þ s2U i

I2 ¼

¼

n X s4U SX,Z Zi Z 0i g þ op ðn1=2 Þ þ Op ððnsqU Þ1 Þ: 2 nðs2Z þ s2U Þ s2Z i ¼ 1

Moreover !

s4U s4 SX,Z Zi Z 0i g ¼ 2 U 2 2 SX,Z g E ðs2Z þ s2U Þ2 s2Z ðsZ þ sU Þ and !

Cov

s4U SX,Z Zi Z 0i g ¼ oð1Þ: 2 ðsZ þ s2U Þ2 s2Z

Then n s4 1X X~ Z~ 0 g ¼ 2 U 2 2 SX,Z g þop ðn1=2 Þ þOp ððnsqU Þ1 Þ: ni¼1 i i ðsZ þ sU Þ

Similarly, ¼

! n n 1X 1X 1 ðXi EðEðXi jWi ÞjZi ÞÞðei EðEðei jWi ÞjZi ÞÞ þ xn ¼ Xi  2 SX,Z Zi ei þ xn , ni¼1 ni¼1 sZ

P where the convergence order of xn is lower than that of ð1=nÞ ni¼ 1 ðXi ð1=s2Z ÞSX,Z Zi Þei . Note that EððXi ð1=s2Z ÞSX,Z Zi Þei Þ ¼ 0 pffiffiffi Pn D 2 2 and CovððXi ð1=sZ ÞSX,Z Zi Þei Þ ¼ s S. Then ð1= nÞ i ¼ 1 X~ i e~ i !Nð0, s2 SÞ.

L. Lin et al. / Journal of Statistical Planning and Inference 141 (2011) 3780–3792

3791

On the other hand, ( Sn ¼ EfðXEðXjZÞÞðXEðXjZÞÞ0 g þoð1Þ ¼ E

X

!

1

S Z 2 X,Z

sZ

X

!0 )

1

S Z 2 X,Z

sZ

Therefore, by combining the results above, we conclude the proof.

þoð1Þ-S:

&

Proof of Theorem 3.2. Note that 8 0 190 10 = n n < n n X X 1X 1X 0 0 0 ~ ~ @ A @ Z ðX EðXi jB Zi ÞÞ þ EðXi jB Zi Þ Xj Mj ðZi Þ Zj Mj ðZi ÞA g X iZ ig ¼ : i ; i n n i¼1

i¼1

j¼1

0

j¼1

10

0

10 10 n n n n n X X X X 1X 1 0 0 @EðXi jB Zi Þ ¼ ðX EðXi jB Zi ÞÞ@Zi  Zj Mj ðZi ÞA g þ Xj Mj ðZi ÞA@Zi  Zj Mj ðZi ÞA g: ni¼1 i ni¼1 j¼1 j¼1 j¼1 Then by the same arguments used for proving Theorem 2.1, we can prove the theorem.

&

Proofs of Corollaries 3.3 and 3.4. The proofs follow directly from the proof of Theorem 3.1.

&

Proof of Theorem 4.1. By the same argument used in the proof of Theorem 3.1, we have n n X n 1X 1X 0 Zi ðbÞ ¼ fX EðEðXi jWi ÞjZi Þ þOp ððsqU nÞ1=2 Þg½Yi Yj b ðXi Xj ÞMj ðZi Þ ni¼1 ni¼1j¼1 i ( ) n X n 1X SX,Z Xi  2 Zi þ op ð1Þ ½g0 ðZi Zj Þ þ ei ej Mj ðZi Þ ¼ ni¼1j¼1 sZ ( ) n n 1XX SX,Z Xi  2 Zi þ op ð1Þ g0 ðZi EðEðZi jWi ÞjZi ÞÞð1þ op ð1ÞÞ ¼ ni¼1j¼1 sZ ( ) ( ) ! n n s2 1X SX,Z 1X SX,Z Xi  2 Zi þ op ð1Þ g0 Zi  2 Z 2 Zi ð1þ op ð1ÞÞ þ Xi  2 Zi þ op ð1Þ ei ð1 þ op ð1ÞÞ ¼ ni¼1 ni¼1 sZ sZ þ sU sZ ( ) n 1X SX,Z ¼ Xi  2 Zi þ op ð1Þ ei ð1 þ op ð1ÞÞ þ op ðn1=2 Þ: ni¼1 sZ

This leads to n 1 X D pffiffiffi Z ðbÞ!Nð0, s2 SÞ: ni¼1 i

Then the result of the theorem follows from the above result, Theorem 2.1 of Hjort et al. (2009) and Theorem 1 of Qin and Lawless (1994). & Proof of (3.1). We can easily verify that n n pffiffiffi ^ 1 X 1 X XNi ZNi 0 g þ Mn1 pffiffiffi XNi eNi , nðb N bÞ ¼ Mn1 pffiffiffi ni¼1 ni¼1

where Pn XNi ¼ Xi 

2 j ¼ 1 Xj fU ðZi Zj , U Þ , Pn 2 j ¼ 1 fU ðZi Zj , U Þ

s s

Pn YNi ¼ Yi 

2 j ¼ 1 Yj fU ðZi Zj , U Þ Pn 2 j ¼ 1 fU ðZi Zj , U Þ

s s

Pn

and

ej fU ðZi Zj , s2U Þ : 2 j ¼ 1 fU ðZi Zj , sU Þ

eNi ¼ ei  Pj n¼ 1

By the same argument used in the proof of Theorem 3.1, we get Pn 2 EðXf U ðzZ, s2U ÞÞ j ¼ 1 Xj fU ðzZj , sU Þ ¼ þ op ððnsqU Þ1=2 Þ: Pn 2 EðfU ðzZ, s2U ÞÞ j ¼ 1 fU ðzZj , sU Þ Further, EðXf U ðz0 Z, s2U ÞÞ ¼

Z

Z xf U ðu,1ÞfX,Z ðx,z0 sU uÞ dx du ¼ xf U ðu,1ÞffX,Z ðx,z0 Þ Z 00 0 ðx,z0 ÞsU u þfX,Z ðx,z0 Þs2U u2 þ oðs2U Þg dx du ¼ xf X,Z ðx,z0 Þ dx þ oðs2U Þ, fX,Z xf U ðz0 z, s2U ÞfX,Z ðx,zÞ dx dz ¼

Z

3792

L. Lin et al. / Journal of Statistical Planning and Inference 141 (2011) 3780–3792

and similarly EðfU ðzZ, s2U ÞÞ ¼ fZ ðzÞ þ oðs2U Þ: Then Pn 2 1 j ¼ 1 Xj fU ðZi Zj , sU Þ ¼ EðXi jZi Þ þ oðs2U Þ þ Op ððnsqU Þ1=2 Þ ¼ 2 SX,Z Zi þ Oðs2U Þ þ Op ððnsqU Þ1=2 Þ: Pn 2 f ðZ Z , s Þ s U i j j¼1 U Z Similarly, we can also prove that Pn 2 j ¼ 1 Zj fU ðZi Zj , sU Þ ¼ Zi þ Oðs2U Þ þ Op ððnsqU Þ1=2 Þ: Pn 2 j ¼ 1 fU ðZi Zj , sU Þ Then it follows that n pffiffiffi 1 X ð4qÞ=2 XNi ZNi 0 g ¼ Oð ns4U Þ þOp ðsU Þ: Mn1 pffiffiffi ni¼1 pffiffiffi P Moreover, Mn1 ð1= nÞ ni¼ 1 XNi eNi ¼ Op ð1Þ. Combining the results above, we complete the proof.

&

References Carrol, R.J., Ruppert, D., Stefanski, L.A., Crainiceanu, C.M., 2006. Measurement Error in Nonlinear Models, second ed. Chapman & Hall, New York. Cook, R.D., Ni, L., 2005. Sufficient dimension reduction via inverse regression: a minimum discrepancy approach. J. Amer. Statist. Assoc. 100, 410–428. Cook, R.D., 1998. Regression Graphics: Ideas for Studying Regressions through Graphics. Wiley & Sons, New York. Cook, J.R., Stefanski, L.A., 1994. Simulation-extrapolation estimation in parametric measurement error models. J. Amer. Statist. Assoc. 89, 1314–1327. Craven, P., Wahba, G., 1979. Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized crossvalidation. Numer. Math. 31, 377–403. ¨ Delaigle, Hall, P., Muller, H., 2007. Accelerated convergence for nonparametric regression with coarsened predictors. Ann. Statist. 35, 2639–2653. Diaconis, P., Freedman, D., 1984. Asymptotics of graphical projection pursuit. Ann. Statist. 12, 793–815. Ehsanes Saleh, A.K.M., 2006. Theory of Preliminary Test and Stein-type Estimation with Applications. John Wiley & Sons Inc. Fan, J., Li, R., 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96, 1348–1360. Fan, J., Peng, H., 2004. Nonconcave penalized likelihood with a diverging number of parameters. Ann. Statist. 32, 928–961. Glad, I.K., 1998. Parametrically guided non-parametric regression. Scand. J. Statist. 25, 649–668. Hall, P., Li, K.C., 1993. On almost linearity of low dimensional projection from high dimensional data. Ann. Statist. 21, 867–889. ¨ Hadle, W., Marron, J.S., 1985. Optimal bandwidth selection in nonparametric regression function estimation. Ann. Statist. 13, 1465–1481. Hjort, N.L., Glad, I.K., 1995. Nonparametric density estimation with a parametric start. Ann. Statist. 23, 882–904. Hjort, N.L., Jones, M.C., 1996. Locally parametric nonparametric density estimation. Ann. Statist. 24, 1619–1647. Hjort, H.L., Mckeague, L.W., Keilegom, I.V., 2009. Extending the scope of empirical likelihood. Ann. Statist 37, 1079–1111. Li, K.C., 1991. Sliced inverse regression for dimension reduction. J. Amer. Statist. Assoc. 86, 316–327. Lin, L., Cui, X., Zhu, L.X., 2009. An adaptive two-stage estimation method for additive models. Scand. J. Statist. 36, 248–269. Ma, Y., Chiou, J.M., Wang, N., 2006. Efficient semiparametric estimator for heteroscedastic partially linear models. Biometrika 93, 75–84. Naito, K., 2004. Semiparametric density estimation by local L2 fitting. Ann. Statist. 32, 1162–1191. Novick, S.J., Stefanski, L.A., 2002. Corrected score estimation via complex variable simulation extrapolation. J. Amer. Statist. Assoc. 97, 472–481. Owen, A., 1988. Empirical likelihood ratio confidence intervals for a single functional. Biometrika 75, 237–249. Owen, A., 1990. Empirical likelihood ratio confidence intervals regions. Ann. Statist. 18, 90–120. Qin, J., Lawless, J., 1994. Empirical likelihood and general estimating equations. Ann. Statist. 22, 300–325. Sen, P.K., 1979. Asymptotic properties of maximum likelihood estimators based on conditional specification. Ann. Statist. 5, 1019–1033. Sen, P.K., Ehsanes Saleh, A.K.M., 1987. On preliminary test and shrinkage M-estimation in linear models. Ann. Statist. 15, 1580–1592. Stefanski, L.A., Cook, J.R., 1995. Simulation-extrapolation: the measurement error Jackknife. J. Amer. Statist. Assoc. 90, 1247–1256. Wahba, G., 1977. A survey of some smoothing problems and the method of generalized cross-validation for solving them. In: Krisnaiah, P.R. (Ed.), Applications of Statistics. North Holland, Amsterdam, pp. 507–523. Zhu, L.X., Miao, B.Q., Peng, H., 2006. On sliced inverse regression with high dimensional covariates. J. Amer. Statist. Assoc. 101, 630–643. Zhu, L.P., Zhu, L.X., 2009. On distribution-weighted partial least squares with diverging number of highly correlated predictors. J. R. Statist. Soc. B 71, 524–548.