Consistent inference for biased sub-model of high-dimensional partially linear model

Consistent inference for biased sub-model of high-dimensional partially linear model

Journal of Statistical Planning and Inference 141 (2011) 1888–1898 Contents lists available at ScienceDirect Journal of Statistical Planning and Inf...

331KB Sizes 5 Downloads 54 Views

Journal of Statistical Planning and Inference 141 (2011) 1888–1898

Contents lists available at ScienceDirect

Journal of Statistical Planning and Inference journal homepage: www.elsevier.com/locate/jspi

Consistent inference for biased sub-model of high-dimensional partially linear model$ Yujie Gai a, Lu Lin a,, Xiuli Wang a,b a b

School of Mathematics, Shandong University, Jinan, China School of Mathematics Science, Shandong Normal University, Jinan, China

a r t i c l e in fo

abstract

Article history: Received 21 June 2009 Received in revised form 27 July 2010 Accepted 26 November 2010 Available online 2 December 2010

In this paper, we study a working sub-model of partially linear model determined by variable selection. Such a sub-model is more feasible and practical in application, but usually biased. As a result, the common parameter estimators are inconsistent and the corresponding confidence regions are invalid. To deal with the problems relating to the model bias, a nonparametric adjustment procedure is provided to construct a partially unbiased sub-model. It is proved that both the adjusted restricted-model estimator and the adjusted preliminary test estimator are partially consistent, which means when the samples drop into some given subspaces, the estimators are consistent. Luckily, such subspaces are large enough in a certain sense and thus such a partial consistency is close to global consistency. Furthermore, we build a valid confidence region for parameters in the sub-model by the corresponding empirical likelihood. & 2010 Elsevier B.V. All rights reserved.

Keywords: Biased sub-model Partially linear model Consistent estimator Nonparametric adjustment Valid confidence region Variable selection

1. Introduction Partially linear model has been fully investigated in the literature and widely used in practice; see for example Engle et al. ¨ (1986), Chen (1988), Heckman (1986), Speckmen (1988) and Hardle et al. (2000). In this paper, we consider its highdimensional form as Y ¼ buX þ guZ þgðTÞ þ e,

ð1:1Þ

where Yi are i.i.d. observations of the response variable Y and ðTi ,Xi u,Zi uÞ are observations of the associated covariates ðT,Xu,ZuÞ, b ¼ ðb1 , . . . , bp Þu is a p-dimensional vector of unknown parameters, g ¼ ðg1 , . . . , gq Þu is a q-dimensional vector of unknown parameters, gðÞ is an unknown function. Due to the curse of dimensionality, we assume, for simplicity, that T is a univariate, ei ’s are i.i.d. errors satisfying EðejX,Z,TÞ ¼ 0,

DðejX,Z,TÞ ¼ s2 :

Here the dimension q of g may be very high and even may tend to infinity as the sample size increases. Let ðXu,ZuÞ ¼ Wu, then B ¼ E½ðWEðWjTÞÞðWEðWjTÞÞu 4 0 is also required for identification of (1.1). $ This paper is supported by NNSF project (10771123) of China, NBRP (973 Program 2007CB814901) of China, RFDP (20070422034) of China, NSF project (ZR2010AZ001) of Shandong Province of China.  Corresponding author. E-mail address: [email protected] (L. Lin).

0378-3758/$ - see front matter & 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2010.11.041

Y. Gai et al. / Journal of Statistical Planning and Inference 141 (2011) 1888–1898

1889

The model (1.1) is regarded as a full model, which is built at the initial stage of modeling and contains all of possibly relevant variables. Because of the high-dimension of the linear part, one usually uses variable selection technique to remove less significant variables (Li and Liang, 2008). Without loss of generality, we suppose that Z is relatively insignificant and thus is removed from the model (1.1). Then we obtain a sub-model as Y ¼ buX þ gðTÞ þ Z:

ð1:2Þ

This model is assumed to be a low-dimensional working model. After variable selection, under certain conditions, sub-model (1.2) could be considered to be non-random, see Lin et al. (2010). So here we just treat X in (1.2) as some pre-specified components of W. Our goal in this paper is to construct efficient estimator for b whether the bias EðZjXÞ is small or large. Then g is regarded as a high-dimensional nuisance parameter vector. To estimate b, a natural method is to estimate the full parameter vector in (1.1) directly. Under regularity conditions, the resultant estimator for b is consistent (Chen, 1988). However, this method is unavailable when the dimension q of g is high. Nowadays a popularly used method is to select variable and estimate parameter simultaneously; see for example Fan and Li (2001) and Li and Liang (2008). By this way, the resultant estimator is always consistent. On the other hand, some traditional estimation procedures, for example, the restricted-model (RM) estimator which only depending on the sub-model (1.2) and the preliminary test (PT) estimator which uses a test to decide that the estimator for b is based on either the full model (1.1) or the working model (1.2), are also unavailable when the dimension is high or the sub-model is biased; see for example Saleh (2006), Sen and Saleh (1987) and Sen (1979). The above discussions indicate that the construction of consistent estimator is still challenging in modeling with variable selection. Lin et al. (2008b) studied the linear regression model, and provided with consistent estimators for low-dimensional working model. In this paper for the partially linear model, we suggest a nonparametric procedure to adjust the biased sub-model to be partially unbiased, where the partial unbiasedness means the estimators are consistent only when the samples drop into some given subspaces. Fortunately, such subspaces are large enough in a certain sense. Then such a partial consistency is close to global consistency. Unlike the variable selection, the procedure of nonparametric adjustment works via an inverse way, that is, the sub-model (1.2) is first thought of as a prior or approximate choice and then is adjusted by nonparametric method. Consequently, the newly proposed estimators are always consistent, without the use of high-dimensional parameter estimator. A further investigation shows that the new model works much better than the other classical models when the covariates are correlated. This implies that the new model could deal with the high-dimensional cases successfully in which the correlation of the covariates could not be negligible. The simulation results have confirmed our theoretical conclusions. Furthermore the larger the ratio of dimension of the parameter p with the sample size n, the better the new model performs. On the other, the nonparametric part improves much better than the parameter part does. It is not surprising for such a result since the nonparametric part usually contains much complex information and even contains the information contained in the deleted variables. Thus, when we make bias-correction procedure, the inference for nonparametric part will be first modified. The rest of paper is organized as follows. In Section 2, a bias-correction model is proposed for the sub-model of the partially linear models, and, consequently, an adjusted RM estimator and adjusted PT estimator are defined. The consistency and efficiency of parameter estimator are obtained. In Section 3, the adjusted estimating equation, together with empirical likelihood is employed to construct confidence region. Simulations are given in Section 4 to evaluate the adjustment method and compare it with the classical ones. The proofs of the theorems are presented in the Appendix. 2. Estimation for partially linear sub-model In this section we first analyze the bias of the sub-model and then adjust it to be unbiased. Based on the modified submodel, a corresponding adjusted RM estimator and an adjusted PT estimator are introduced and their consistency and efficiency are proved later whether the dimension of the part of parameter vector in the full model is high or not. 2.1. The existing parameter estimators For the model (1.1), the profile LS estimator for the full parameter vector ðbu, guÞu is defined as ! ! !1 b^ F X~ u ~ X~ uX~ X~ uZ~ Y, ¼ g^ F Z~ u Z~ uX~ Z~ uZ~

ð2:1Þ

P P P where X~ ¼ ðX~ 1 , . . . , X~ n Þu, X~ i ¼ Xi  nj¼ 1 onj ðtÞXj , Z~ ¼ ðZ~ 1 , . . . , Z~ n Þu, Z~ i ¼ Zi  nj¼ 1 onj ðtÞZj , Y~ ¼ ðY~ 1 , . . . , Y~ n Þu and Y~ i ¼ Yi  nj¼ 1 onj ðtÞYj with oni ðtÞ’s being positive weight functions. The LS estimator is consistent and efficient under some special conditions (Chen, 1988). When the dimension of the parameter vector is very high, the correlation among the covariates are usually very high. Thus the inversion in (2.1) is usually nonexistent, which means the full model (1.1) becomes invalid in high-dimensional case. As is shown in the Introduction, we in this paper suppose that, after a variable selection procedure, the high-dimensional covariate vector Z is eliminated and then the working sub-model (1.2) is obtained. For sub-model (1.2), the profile LS estimator of b is

b^S ¼ ðX~ uX~ Þ1 X~ uY~ :

ð2:2Þ

1890

Y. Gai et al. / Journal of Statistical Planning and Inference 141 (2011) 1888–1898

In fact the above estimator in (2.2) is a so-called RM estimator since it is derived only from the restricted sub-model (1.2). It can be easily verified that b^ S is an inconsistent estimator because of the bias of sub-model (1.2). More precisely, the model bias is EðZjXÞ ¼ EðYbuXgðTÞjXÞ ¼ EðguZ þ ejXÞ ¼ guEðZjXÞ:

ð2:3Þ

Here EðZjX ¼ xÞ is supposed to be a nonzero function. Obviously, this nonzero assumption is reasonable particular for highdimensional model and thus the model (1.2) is biased unless g ¼ 0. We now examine the consistency of the PT estimator, which is associated with both models (1.1) and (1.2). Besides the conditions of the model (1.1), the normal assumption e  Nð0, s2 Þ is also required for the definition of PT estimator given below. Following from the profile empirical likelihood ratio proposed by Fan and Huang (2005), we now define a PT estimator of b as 8 < b^ S ifLn o w2q ðaÞ, ^ b PT ¼ ^ ð2:4Þ : b F ifLn Z w2q ðaÞ, where w2q ðaÞ ð0 o a o 1Þ is the upper 100a% point of w2 distribution with q degrees of freedom, and Ln is the profile loglikelihood ratio statistic for the null hypothesis H0 : g ¼ 0. More precisely, Ln is defined by Ln ¼ 2ðln ðH1 Þln ðH0 ÞÞ ¼ nlog

RSS0 RSS0 RSS1 n , RSS1 RSS1

ð2:5Þ

where ln ðHi Þ ¼ n=2logð2p=nÞðn=2ÞlogðRSSi Þn=2, RSS1 ¼

n X

i ¼ 0,1,

ðYi Xi ub^ F Zi ug^ F g^ F ðTi ÞÞ2 ,

i¼1

g^ F ðÞ ¼

n X

onj ðÞðYj Xj ub^ uF Zj ug^ F Þ,

j¼1

RSS0 ¼

n X

ðYi Xi ub^ S g^ S ðTi ÞÞ2 ,

i¼1

g^ S ðÞ ¼

n X

onj ðÞðYj Xj ub^ S Þ:

j¼1

Such an estimator (2.4) is still biased (see Remark 3 given below) and also depends on the estimator of high-dimensional parameter vector g. Therefore it is inconsistent and will be unavailable when the dimension of g is high. 2.2. An adjusted model and adjusted estimators It is difficult to adjust b^ S and b^ PT to be consistent by the existing nonparametric approaches because the models (1.1) and (1.2) are related to multi-dimensional variables and parameters; see Hjort and Glad (1995), Hjort and Jones (1996), Naito (2004), Glad (1998) and Lin et al. (2008a), among others. In this subsection we will introduce a new nonparametric method to adjust the sub-model (1.2) be partially unbiased and thus to adjust estimators b^ S and b^ PT to be partially consistent. When X and Z are uncorrelated and Z is designed to be centered, i.e., E(Z)= 0, the sub-model (1.2) is already unbiased. In this situation, the common LS estimator b^ S has favorable properties. Consequently, the bias-correction is not required since the sub-model works so well. In practice, however, the correlation among the covariates often exists particular for highdimensional case. In this situation, it becomes necessary to deal with the bias of the sub-model. So we will assume that X and Z are correlated in the following discussion and the assumption E(Z)= 0 is also required for the model identification discussed in the condition (a) below. Now we define a bias-corrected form for sub-model (1.2) as Yi ¼ buXi þgðTi Þ þf ðtuZi Þ þ xðtuZi Þ,

i ¼ 1, . . . ,n

ð2:6Þ

for any t 2 F ¼ ft 2 Rq : JtJ ¼ 1,QðtÞ 4 0g, where QðtÞ ¼ EfðXEðXjT, tuZÞÞðXEðXjT, tuZÞÞug, f ðtuZÞ ¼ EðZjtuZ), xðtuZÞ ¼ Zf ðtuZÞ, Z ¼ YbuXgðTÞ and J  J denotes the Euclidean norm. Here we use the constraint JtJ ¼ 1 for identifiability because f is unknown and only the orientation of t is identifiable. The error term xðtuZÞ is partially conditional unbiased for zero, i.e., it satisfies EðxðtuZÞjtuZ,X,TÞ ¼ 0

for all tuZ, t 2 F ,Z 2 Rq and ðXu,TÞ 2 A  Rp ,

ð2:7Þ

Y. Gai et al. / Journal of Statistical Planning and Inference 141 (2011) 1888–1898

1891

where ( A¼

 u ) g1 g1 ðXu,TÞuðoÞ : kuE½ZjðXu,TÞu ¼ 0, k ¼ 0, g2  t2 , . . . , gq  tq ,

t1

t1

and t1 is supposed to be nonzero. The proof of details about (2.7) are put in the Appendix. Let Vu ¼ ððXu,TÞ,ZuÞÞ. Then if V is elliptically symmetric distributed with mean m and covariance matrix s2 S, the sample subspace A can be expressed as (  ) A¼

W : kuS21 S1 11 ðWm1 Þ ¼ 0, k ¼ 0, g2 

g1 g t , . . . , gq  1 tq t1 2 t1

u

,

where Wu ¼ ðXu,TÞ and m1 ¼ EðWÞ. Note that A guarantees that the components of W have an exactly linear relation. On the other hand, when W is elliptically symmetric distributed, EðWðiÞ jWðiÞ Þ ¼ kuWðiÞ , where k is a (p+ 1)-dimensional constant vector and W(  i) is a p-dimension random vector by eliminating the ith component of W, i.e., the components of W have a linear relation in the sense of average. The above discussion means that subspace A is large enough such that it is very close to full space and, consequently, the partial unbiasedness defined above is close to global unbiasedness. The new model (2.6) which contains two nonparametric terms f ðtuZÞ and g(T) is actually a so-called additive partially linear model (APLM) if we regard tuZ as one variable U and t is fixed. This point of view is reasonable because it is unnecessary to estimate t but only to find suitable empirical values of t to implement the estimation procedure. Further, we will prove that the consistency of new estimators defined below is free of the choice of t and their efficiency is insensitive to the choice of t. Also some elements of t can be designed to be zero for reducing the dimension of t. Therefore we can use the existing method to investigate the additive partially linear model (2.6). Additive partially linear models have been fully studied in the existing literature; for more details see for example Chen et al. (1996), Fan et al. (1998), Fan and Li (1996) and Li (2000). The modified model (2.6) has two main advantages. Firstly, it avoids the curse of dimensionality by replacing highdimensional variable Z and parameter g with a one-dimensional nonparametric function f ðtuZÞ, where t is unnecessary to be estimated, but could be replaced by its empirical value; secondly, the model works much better even than the full model (1.1) when the covariates X and Z are correlated for which the ordinary least square method does not work any more. As mentioned above, (2.6) is indeed a special APLM. We denote tuZi ¼ Ui and xi ¼ xðUi Þ, i ¼ 1, . . . ,n. Then (2.6) becomes Yi ¼ buXi þ gðTi Þ þ f ðUi Þ þ xi ,

ðð2:6ÞuÞ

which looks like a normalized APLM. Thus we can estimate b following the existing methods (see Fan and Li, 2003 for more details). From ((2.6)0 ) it follows that Yi h1 ðTi Þh2 ðUi Þ ¼ ðXi Z1 ðTi ÞZ2 ðUi ÞÞub þ xi ,

ð2:8Þ

where hðt,uÞ ¼ E½Yi jTi ¼ t,Ui ¼ u,

h1 ðtÞ ¼ E½hðt,Ui Þ,

h2 ðuÞ ¼ E½hðTi ,uÞ,

Zðt,uÞ ¼ E½Xi jTi ¼ t,Ui ¼ u, Z1 ðtÞ ¼ E½Zðt,Ui Þ, Z2 ðuÞ ¼ E½ZðTi ,uÞÞ: Let Y i ¼ Yi h1 ðTi Þh2 ðUi Þ and X i ¼ Xi Z1 ðTi ÞZ2 ðUi Þ: Then (2.8) can be rewritten as Y i ¼ X i ub þ xi :

ð2:9Þ

In the above equation, hi ðÞ and Zi ðÞ are unknown. To implement the estimation procedure, they should be replaced by their consistent estimators. For example, they can be estimated by local average as h^ 1 ðTi Þ ¼

n X l¼1

W1i,l Yl ,

h^ 2 ðUi Þ ¼

n X

W2i,l Yl ,

l¼1

P where Wji,l ,j ¼ 1,2,i ¼ 1, . . . ,n are weight functions satisfying nl¼ 1 Wji,l ¼ 1, j= 1,2. Similarly, the consistent estimators of Z1 ðTi Þ and Z2 ðUi Þ can be constructed by the same way. Also the new sub-model (2.9) can be achieved by replacing the variables (Xi ,Yi ) and errors Zi in the sub-model (1.2) with centered versions ðY i ,X i Þ and xi , respectively. The new sub-model becomes a simple linear model and we could regard (2.9) as an adjustment of (1.2). Applying the ordinary least squares (OLS) to (2.9) leads to an adjusted estimator of b as n 1X XY, ð2:10Þ ni¼1 i i P where Sn ¼ ð1=nÞ ni¼ 1 X i X i u. Note that this estimation procedure depends mainly on the data in the sub-model (1.2) and is independent of high-dimensional vector g, so it can be regarded as an RM estimator or an adjustment RM estimator. As a special additive partially linear model, the model (2.6) is supposed to satisfy the following conditions:

b~ A ¼ S1 n

(a) ðXu,T,UÞ has finite support with the support of (T,U) being a rectangle ½c1 ,d1   ½c2 ,d2 . The density function of (T,U), 4 denoted by p(t,u), is bounded by a positive constant on its support. Eðx Þ is finite, Ef ðtuZÞ ¼ 0 and Eg(T) =0.

1892

Y. Gai et al. / Journal of Statistical Planning and Inference 141 (2011) 1888–1898

(b) hj ðÞ, Zj ðÞ,gðÞ,f ðÞ and pð,Þ all belong to G4v , which is a class of functions defined in Fan and Li (2003), where v Z2 is an 2 integer. Further, s2 ðtÞ ¼ Eðx jX,T,UÞ 2 G41 . (c) The kernel functions used in weights WðÞ are all of order v. 2v (d) As n-1,n3=2 h2 -1 and nh -0. For the convenience of the representation, we also need to make some notations:

e1 ¼ ZðT1 , tuZ1 ÞZ1 ðT1 ÞZ2 ðtuZ1 Þ, V1 ¼ X1 EðX1 jT1 , tuZ1 Þ, D1 ¼ V1 þ ðe1 Eðe1 ÞÞð1c1i c2i Þ, where c1i ¼ c2i ¼ p1 ðT1 Þp2 ðtuZ1 Þ=pðT1 , tuZ1 Þ with p1 ðÞ,p2 ðÞ and pð,Þ being the density functions of T, tuZ and ðT, tuZÞ, respectively. Moreover, denote

OðtÞ ¼ E½x21 D1 D1 u, FðtÞ9Var½X1 Z1 ðT1 ÞZ2 ðtuZ1 Þ, BðtÞ ¼ F1 ðtÞOðtÞF1 ðtÞ: Then we have the following theorem. Theorem 2.1. For the model (1.1), under conditions (a)–(d), for ðXu,TÞ 2 A, we have provided that FðtÞ is positive definite.

pffiffiffi ^ nðb A bÞ-d Nð0,BðtÞÞ in distribution,

As aforementioned, the subspace A is large enough. Then the partial consistency obtained by the theorem is close to the global consistency. Let tuZ ¼ U, then (2.6) becomes the form of ((2.6)0 ), which is an additive partially linear model. Then the result of the theorem follows directly from the existing results, for example Theorem 2.1 of Fan and Li (2003). Thus we omit the details of the proof. 2

Remark 1. This theorem promises that x’s are heteroscedastic, that is the condition s2 ðtÞ ¼ Eðx jX,T,UÞ in (b) can be replaced 2 by Eðx jX,T,UÞ ¼ s2i ðt,X,T,UÞ. But, for the convenience of representation, we only use the homogeneity condition given in (b). Remark 2. The condition that FðtÞ is positive definite is an identification condition for b. In fact this condition is weaker than the condition QðtÞ 40. This is because when X is a deterministic function of (T,U) but not an additive form, QðtÞ ¼ 0 but F may be still positive definite. pffiffiffi This theorem shows that the adjusted estimator b~ A is partially n- consistent and the consistency of the adjusted estimator is free of g and t, i.e., the consistency does not depend on variable selection and the choice of direction t even if a large number of covariates or some significant variables are deleted via variable selection. Now we define an adjusted PT estimator via the adjusted estimator b~ A . In addition to the conditions of the model (1.1), we also assume ei are normally distributed. The adjusted PT estimator for b in the sub-model (1.2) is defined as 8 < b~ A if Ln o w2q ðaÞ, ~ b APT ¼ ^ ð2:11Þ : b F if Ln Z w2q ðaÞ, where Ln is defined as in (2.5). Since our main concern is the performance of the estimator b~ APT when g is close to the null vector, then we now consider a sequence of local alternatives fKn g given by fKn g : g ¼ gn ¼ n1=2 o,

o 2 Rq :

Let Gm ðx; m, LÞ denotes an m-variate multi-normal d.f. with mean vector m and dispersion matrix L, Hr ðt; dÞðt 4 0Þ denotes a noncentral chi-square d.f. with r degrees of freedom and noncentral parameter d Z0. We denote EðoÞ ¼ fz : s2 ðz þ oÞuB22:1 ðz þ oÞ Z w2q ðaÞg, B¼



!

B11

B12

B21

B22

D11

D12

D21

D22

¼

EðXXuÞ

EðXZuÞ

EðZXuÞ

EðZZuÞ

! ,

! ¼ B1 ,

1 2 Dii:j ¼ Dii Dij D1 : We have the following theorem. jj Dji ,i,j ¼ 1,2, B22:1 ¼ B22 B21 B11 B12 , and D ¼ ðouB22:1 oÞs

Y. Gai et al. / Journal of Statistical Planning and Inference 141 (2011) 1888–1898

1893

Theorem 2.2. Under the model (1.1), if the conditions of Theorem 2.1 hold and e is normally distributed, then for ðXu,TÞ 2 A, Z pffiffiffi 2 1 2 lim Pf nðb~ APT bÞ r xjkn g ¼ Gp ðx; 0,BðtÞÞHq ðw2q ðaÞ; DÞ þ Gp ðxD1 12 D22 z; 0, s D11:2 ÞdGq ðz; 0, s D22 Þ: n-1

EðoÞ

Remark 3. According to the above theorem, we can compare the adjusted PT estimator b~ APT with the original PT estimator b^ PT defined in (2.4). As is shown by Ahmed et al. (2007), b^ PT satisfies Z pffiffiffi 2 1 2 2 1 2 lim Pf nðb^ PT bÞ rxjKn g ¼ Gp ðxB1 B o ; 0, s B ÞH ð w ð a Þ; D Þ þ Gp ðxD1 q 12 11 11 q 12 D22 z; 0, s D11:2 ÞdGq ðz; 0, s D22 Þ: n-1

EðoÞ

Comparing the above formula with Theorem 2.2, we can see that the main difference of the two estimators is that they have different asymptotic biases. More precisely, the asymptotic bias of b^ PT depends on o while the adjusted estimator b~ APT does not. For b^ PT , when the value JoJ is large, the corresponding asymptotic bias will become large too, although the probability is small. So the new estimator b~ APT that is free of o can be regarded as an improvement of b^ PT in the sense that it is consistency pffiffiffi n- consistent and robustness to variable selection. Furthermore, we notice that if D12 D1 22 ¼ 0, then the new estimator is regardless of the size of JoJ. 3. Empirical likelihood confidence regions Now we construct confidence region for the parameter vector b in the sub-model (1.2) via empirical likelihood proposed by Owen (1988, 1990). Although we can obtain it by Theorem 2.1, the constructed confidence is not applicable because it is related to estimating the asymptotic covariance which is not estimable only by the data from the sub-model (1.2). Moreover, empirical likelihood has been proved to be of several advantages in small sample situations over normal approximation. From the adjusted additive partially linear model (2.6), we obtain the estimating function as Zni ¼ X i ðY i buX i Þ. Note that EðZni ðbÞjXj ,Zj ,Tj ,j Z 1Þ ¼ op ð1Þ, which could be considered as an asymptotic unbiased estimating equation. Then we define logempirical likelihood ratio as ( ) n n n X X X LðbÞ ¼ 2 sup logðnpi Þ : pi Z 0, pi ¼ 1, pi Zni ðbÞ ¼ 0 : i¼1

i¼1

i¼1

By using Lagrange multipliers and simple calculations, the empirical likelihood ratio can be written as LðbÞ ¼ 2

n X

logf1 þ luðbÞZni ðbÞg,

i¼1

12

12

12

10

10

10

8

8

8

6

6

6

4

4

4

2

2

2

0

0

0

−2

−2

−2

−4

−4

−4

−6

−6

−6

0

5

0

5

0

5

Fig. 1. The estimators of b with repetition time m =100. From left to right they are b^ S , b^ F and b~ A , respectively, and the unfilled dots are the true values of b.

1894

Y. Gai et al. / Journal of Statistical Planning and Inference 141 (2011) 1888–1898

where lðbÞ satisfies n 1X Zni ðbÞ ¼ 0: n i ¼ 1 1þ luðbÞZni ðbÞ

The following theorem provides the asymptotic distribution of empirical likelihood ratio. Theorem 3.1. If the conditions of Theorem 2.1 hold, then under the sub-model (1.2) and for ðXu,TÞ 2 A, LðbÞ-d w2 ðpÞ: By Theorem 3.1, an approximate confidence for b can be expressed as fb : LðbÞ r w2a ðpÞg:

ð3:1Þ

This confidence region is always available regardless of the variable selection. Although the score function defined above depends on t, the construction of confidence region is free of the choice of t when we remember that the estimating equation constructed for the empirical likelihood confidence region is unnecessarily chosen to be optimal, which is shown by Kitamura et al. (2004) in a relevant work. Further, the accuracy of the confidence region is insensitive to the choice of t, though it is dependent on t. It is worth mentioning that due to the bias-correction of the new model (2.6), the empirical likelihood confidence region could work much better than the sub-model (1.2) does. Table 1 MSE of parameter estimators and MASE of estimator for g(t) when the covariates are uncorrelated with n =100 and t ¼ ðp1ffiffiffiffi , . . . , p1ffiffiffiffi Þu. 20 20 Parameters

b1 b2 b3 b4 b5

True values

1 2 3.5 0.5 4

MSE

MASE

b^ S

b^ F

b~ A

g^ S

g^ F

g~ A

0.0050 0.0036 0.0050 0.0047 0.0041

0.6448ne 3 0.7431ne 3 0.7015ne  3 0.9916 ne 3 0.5768 ne 3

0.0040 0.0030 0.0042 0.0034 0.0031

0.1047

0.0233

0.0237

12

12

12

10

10

10

8

8

8

6

6

6

4

4

4

2

2

2

0

0

0

−2

−2

−2

−4

−4

−4

−6

−6

−6

0

5

0

5

0

5

Fig. 2. The estimators of b with repetition time m= 100. From left to right they are b^ S , b^ F and b~ A , respectively, and the unfilled dots are the true values of b.

Y. Gai et al. / Journal of Statistical Planning and Inference 141 (2011) 1888–1898

1895

4. Simulation results In this section we investigate the behavior of the new inference method by simulations. We use g~ A , g^ S and g^ F to denote the estimators of g(t) under adjusted model, sub-model and full model, respectively. In the simulations, the sample size is chosen to be between 50 and 400 and the dimension q of g is chosen to be 20 such that p/n is relatively large. We compare the mean square errors (MSE) of the parametric estimators and the mean average square errors (MASE) of the nonparametric estimators. We will see that all simulation results are expected in our theoretical conclusions. In the multi-dimensional partially linear model (1.1), b and g are, respectively, chosen as five dimensional and 20 dimensional vectors:

b ¼ ð1,2,3:5,0:5,4Þu, g ¼ ðga ; gb Þ, where

ga ¼ ð0:05,0:2,0:06,0:09,0:06,0:04,0:1,0:02,0:09,0:07Þu, gb ¼ ð0:25,0:08,0:25,0:04,0:15,0:26,0:12,0:03,0:05,0:2Þu 2 and nonparametric function gðtÞ ¼ t þ sinðtÞ. The errorpterm ffiffiffiffiffiffi e is assumed to be normally distributed as e  Nð0,0:2 Þ. The kernel function is chosen as Gaussian kernel KðuÞ ¼ ð1= 2pÞexpfðu2 =2Þg. The covariates X, Z and T will be considered to be independent or correlated, respectively. On the other hand, we will use different empirical choices for t to show that the adjusted estimator is robust to t: The choice of bandwidth is decided by CV methodology. Since the difference of simulation results obtained by different choices of repetition times is not very significant, we here only report the results based on the repetition time 50.

Table 2 MSE of parameter estimators and MASE of estimator for g(t) when the covariates are correlated, sample size n= 100 and t ¼ ðp1ffiffiffiffi , . . . , p1ffiffiffiffi Þu. 20 20 Parameters

b1 b2 b3 b4 b5

True values

1 2 3.5 0.5 4

MSE

MASE

b^ S

b^ F

b~ A

g^ S

g^ F

g~ A

0.0060 0.0391 0.0058 0.0122 0.0055

0.0302 0.0230 0.0316 0.0241 0.0331

0.0028 0.0276 0.0084 0.0180 0.0026

0.2115

0.6210

0.0251

12

12

12

10

10

10

8

8

8

6

6

6

4

4

4

2

2

2

0

0

0

−2

−2

−2

−4

−4

−4

−6

−6

−6

0

5

0

5

0

5

Fig. 3. The estimators of b with repetition time m =50, from left to right they are b^ S , b^ F , b~ A , respectively, and the unfilled dots are the true values of b.

1896

Y. Gai et al. / Journal of Statistical Planning and Inference 141 (2011) 1888–1898

1. Consider the case when X, T and Z are independent and the components of t are equal. We assume X  N5 ð2,IÞ, T  Nð0,1Þ, and Z  N20 ð0,IÞ. The sample size n = 100. Fig. 1 and Table 1 report the simulation results of estimators of b. They show b^ F works better than b^ and b^ , and b^ is the worst one among the three estimators. It is not surprising for such a result since the A

S

S

covariates are independent and n is much larger than p, the full model will work well. On the other hand, although the adjusted model does not work better than the full model under independent situation, it works better than the sub-model, since it reduce the bias of the nonparametric part. However, when the dimension of covariates is high, the independency among covariates could hardly be justified. So it is necessary to consider the case when covariates are correlated. 2. Consider the case when X, T and Z are correlated. For construction simplicity, we just assume the covariance matrix of W= (X;T) and Z is 0 1 0:99 0   0 B 0 0:99       0 C C SW,Z ¼ B B C: @ ^ ^ 0A 0



0:99



0

Table 3 MSE of parameter estimators and MASE of estimator for g(t) when the covariates are correlated with (a) n= 50 and t ¼ ðp1ffiffiffiffi , . . . , p1ffiffiffiffi Þu; (b) n= 200 and 20 20

t ¼ ðp1ffiffiffiffi , . . . , p1ffiffiffiffi Þu. 20 20 Parameters

True values

MSE

MASE

b^ S

b^ F

b~ A

g^ S

g^ F

g~ A

1 2 3.5 0.5 4

0.0106 0.0511 0.0144 0.0089 0.0177

0.0977 0.1070 0.1090 0.1192 0.1212

0.0061 0.0314 0.0202 0.0188 0.0089

0.4452

2.6759

0.0376

1 2 3.5 0.5 4

0.0052 0.0407 0.0042 0.0102 0.0048

0.0141 0.0106 0.0097 0.0127 0.0155

0.0026 0.0295 0.0084 0.0160 0.0019

0.1339

0.2687

0.0172

(a)

b1 b2 b3 b4 b5 (b)

b1 b2 b3 b4 b5

12

12

12

10

10

10

8

8

8

6

6

6

4

4

4

2

2

2

0

0

0

−2

−2

−2

−4

−4

−4

−6

−6

−6

0

5

0

5

0

5

Fig. 4. The estimators of b with repetition time m= 200, from left to right they are b^ S , b^ F , b~ A , respectively, and the unfilled dots are the true values of b.

Y. Gai et al. / Journal of Statistical Planning and Inference 141 (2011) 1888–1898

1897

Table 4 MSE of parameter estimators and MASE of estimator for g(t) when the covariates are correlated with (a) n= 50 and t ¼ ð0,    ,0, p1ffiffiffiffi ,    , p1ffiffiffiffi Þu; (b) n= 50 and 10 10

t11 ¼ t12 ¼ t14 ¼ t17 ¼ p1ffiffi4 and other elements are 0. Parameters

True values

MSE

MASE

b^ S

b^ F

b~ A

g^ S

g^ F

g~ A

1 2 3.5 0.5 4

0.0146 0.0469 0.0083 0.0182 0.0117

0.1148 0.1414 0.1005 0.1074 0.0975

0.0090 0.0347 0.0155 0.0220 0.0086

0.3872

1.9419

0.0432

1 2 3.5 0.5 4

0.0116 0.0408 0.0116 0.0133 0.0114

0.1154 0.1476 0.1392 0.1092 0.1213

0.0072 0.0282 0.0135 0.0168 0.0049

0.3191

2.5966

0.0325

(a)

b1 b2 b3 b4 b5 (b)

b1 b2 b3 b4 b5

The sample size n and t are the same as in case 1. Fig. 2 presents the simulation results for the estimators of b and Table 2 shows their MSEs. From both Fig. 2 and Table 2, we could find that the full model works worst, which is opposite to the independent case. The adjusted model works better, especially for the nonparametric estimator. So the advantage of our new model appears when the dimension of the covariates is high and the correlation of covariates could not be negligible. To see it deeply, we will change sample size to obtain different ratio p/n below. 3. Following case 2, we now take n=50 and 200 for further study. The simulation results are presented in Fig. 3, Table 3a and Fig. 4 and Table 3b, respectively. Combing these with the results of case 2, we have the finding that when correlation exists, our new model works better than the other ones, and the smaller the n is, the better the new model works; while the full model works oppositely. In short, when the ratio p/n is larger and the correlation exists, our new model is the best one among the three models. 4. To show the robustness of the adjusted estimator for t. We now consider different choices of t with fixed sample size n =50. From Tables 4a, b and 3a, we can see that although the direction t changes significantly, the estimator for the adjusted model is stable and thus is not sensitive to the choice of t: Appendix A Proof of Conditional Unbias of nðtuZÞ. Let Vu ¼ ððXuTÞZuÞ and LðVÞ denote the distribution of V. It is assumed that LðVÞ 2 S 2 E SÞ, where En ðms2 SÞ denotes the class of elliptically symmetric distributions with mean m ˜ n ðm SÞ and ˜E n ðm SÞ ¼ s2 4 0 En ðms S12 pþ1 11 and covariance matrix s2 S, m ¼ ðmm1 Þ S ¼ ðS m2 2 Rq , S11 is a (p +1)  (p +1) matrix and S22 is a q  q matrix. In S21 S22 Þ, m1 2 R 2 fact here m2 ¼ 0 because of the centered Z. We want to find the conditions to make sure E½xðtuZÞjtuZ,X,T ¼ 0, equivalently, to prove the below equation holds: E½guZjtuZ ¼ t ¼ E½guZjtuZ ¼ t,X ¼ x,T ¼ t:

ðA:1Þ

Note that E½xðtuZÞjtuZ,X,T ¼ E½ZE½ZjtuZjtuZ,X,T ¼ E½guZ þ ejtuZ,X,TE½guZ þ ejtuZ ¼ E½guZjtuZ,X,TE½guZjtuZ: The left-hand side of Eq. (A.1) is  2 0 13  # " P  q q q  X X  1 t qi¼ 2 ti Zi X gj Zj ÞZ1 @t tj Zj A5 ¼ E g1 þ gi Zi t E½guZjtuZ ¼ t ¼ E4  t1  t1 2 j¼1 j¼2  " #   q  X g g g ¼ 1 tþE gi  1 ti Zi t ¼ 1 t,  t1 t1 t1

ðA:2Þ

i¼2

and the right-hand side of (A.1) is " E½guZjtuZ ¼ t,ðXu,TÞ ¼ ðxu,tÞ ¼ E g1 Since

t

E½ZjðXu,TÞu ¼ ðxu,tÞu ¼ S21 S1 11 ðw 1 Þ,

m

E½kuZjðXu,TÞu ¼ ðxu,tÞu ¼ k

Pq

i¼2

ti Zi

t1 Pq

þ

q X 2

g

 # " #   q   X g1 g1   gi Zi ðXu,TÞ ¼ ðxu,tÞ ¼ t þ E gi  ti Zi ðXu,TÞ ¼ ðxu,tÞ :   t1 t1

g t t

i ¼ 2 ð i ð 1 = 1 Þ i Þ ¼

uS21 S1 11 ðw 1 Þ,

m

i¼2

kuZ and

1898

Y. Gai et al. / Journal of Statistical Planning and Inference 141 (2011) 1888–1898

we have E½guZjtuZ ¼ t,ðXu,TÞ ¼ ðxu,tÞ ¼

g1 t þ kuS21 S1 11 ðwm1 Þ: t1

ðA:3Þ

1 Put (A.2) and (A.3) into (A.1), it follows ðg1 =t1 Þt ¼ ðg1 =t1 Þt þ kuðS21 S1 11 ðwm1 ÞÞ, namely, kuS21 S11 ðwm1 Þ ¼ 0. So when ðXu,TÞ is pþ1 , Eq. (2.7) holds, which implies the unbiasedness of error term. & constrained in the set A  R

Proof of Theorem 2.2. Following directly from Ahmed et al. (2007), we have pffiffiffi pffiffiffi pffiffiffi lim Pf nðb^ PT bÞ r xjKn g ¼ lim Pf nðb^ S bÞ r xjKn ,Ln o w2q ðaÞg þ limn-1Pf nðb^ F bÞ r xjKn ,Ln Z w2q ðaÞg n-1 n-1 Z 2 1 2 2 1 2 Gp ðxD1 ¼ Gp ðxB1 11 B12 o; 0, s ðtÞB11 ÞHq ðwq ðaÞ; DÞ þ 12 D22 z; 0, s ðtÞD11:2 ÞdGq ðz; 0, s ðtÞD22 Þ: EðoÞ

Then we only need to consider the case when Ln o w2q ðaÞ. Furthermore, Theorem 2.1 leads to pffiffiffi Un ¼ GðtÞ nðb~ A bÞ-d Nð0, s2 B1 11 Þ, 1=2

where GðtÞ ¼ sB11 B1=2 ðtÞ. On the other hand, we can easily prove that pffiffiffi 2 1 Vn ¼ nðb^ S bÞB1 11 B12 o-d Nð0, s B11 Þ: The above results show that Un and Vn are identically distributed. By combining all results above, we have pffiffiffi lim Pf nðb~ A bÞ rxjKn ,Ln o w2q ðaÞg ¼ lim PfUn r GðtÞxjKn ,Ln o w2q ðaÞg ¼ lim PfVn r GðtÞxjKn ,Ln o w2q ðaÞg n-1 n-1 n-1 pffiffiffi 2 ¼ lim Pf nðb^ S bÞ r B1 11 B12 o þ GðtÞxjKn ,Ln o wq ðaÞg n-1

2 2 ¼ Gp ðGðtÞx; 0, s2 B1 11 ÞHp ðwq ða; DÞÞ ¼ Gp ðx; 0,BðtÞÞHp ðwq ða; DÞÞ

as required.

&

References Ahmed, S.E., Doksum, K.A., Hossain, S., You, J., 2007. Shrinkage, pretest and absolute penalty estimators in partially linear models. Aust. N. Z. J. Statist. 49, 435–454. Chen, H., 1988. Convergence rates for parametric components in a partly linear model. Ann. Statist. 16, 136–141. Chen, R., Hardle, W., Linton, O., Severance-lossin, E., 1996. Estimation and variable selection in additive nonparametric regression models. In: Harlde, W., Schimek, M. (Eds.), Statistical Theory and Computational Aspects of Smoothing. Physika, Heidelberg. Engle, R.F., Granger, C.W.J., Rice, J., Weiss, A., 1986. Semiparametric estimates of the relation between weather and electricity sales. J. Amer. Statist. Assoc. 81, 310–320. Fan, J., Huang, T., 2005. Profile likelihood inferences on semiparametric varying-coefficient partially linear models. Bernoulli 11, 1031–1057. Fan, Y., Li, Q., 2003. A kernel-based method for estimating additive partially linear models. Statist. Sinica 13, 739–762. Fan, J., Li, R., 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96, 1348–1360. Fan, Y., Li, Q., 1996. On estimating additive partially linear models. Mimeo, University of Guelph. Fan, J., Hardle, W., Mammen, E., 1998. Direct estimation of low dimensional components in additive models. Ann. Statist. 26, 943–971. Glad, I.K., 1998. Parametrically guided no-parametric regression. Scand. J. Statist. 25, 649–668. ¨ Hardle, W., Liang, H., Gao, T., 2000. Partially Linear Models. Physica Verlag. Heckman, N.E., 1986. Spline smoothing in a partly linear model. J. Roy. Statist. Soc. B 48, 244–248. Hjort, N.L., Glad, I.K., 1995. Nonparametric density estimation with a parametric start. Ann. Statist. 23, 882–904. Hjort, N.L., Jones, M.C., 1996. Locally parametric nonparametric density estimation. Ann. Statist. 24, 1619–1647. Kitamura, Y., Tripathi, G., Ahn, H., 2004. Empirical likelihood-based inference in conditional moment restriction models. Econometrica 72, 1667–1714. Li, R., Liang, H., 2008. Variable selection in semiparametric regression modeling. Ann. Statist. 36, 261–286. Li, Q., 2000. Efficient estimation of additive partially linear models. Internat. Econom. Rev. 41, 1073–1092. Lin, L., Cui, X., Zhu, L.X., 2008a. An adaptive two-stage estimation method for additive models. Scand. J. Statist. 36, 248–269. Lin, L., Zeng, Y.H., Zhu, L.X., 2008. Consistent estimation and valid confidence region for biased sub-model of high-dimensional linear regression model. Manuscript. Lin, L., Zhu, L.X., Gai, Y.J., 2010. A semiparametric re-modeling and inference for non-sparse ‘‘large p, small n’’ models. (manuscript). Naito, K., 2004. Semiparametric density estimation by local L2 fitting. Ann. Statist. 32, 1162–1191. Owen, A., 1988. Empirical likelihood ratio confidence intervals for a single functional. Biometrika 75, 237–249. Owen, A., 1990. Empirical likelihood ratio confidence intervals regions. Ann. Statist. 18, 90–120. Saleh, A.K.Md.E., 2006. Theory of Preliminary Test and Stein-type Estimation with Applications. John Wiley & Sons Inc.. Sen, P.K., 1979. Asymptotic properties of maximum likelihood estimators based on conditional specification. Ann. Statist. 5, 1019–1033. Sen, P.K., Saleh, A.K.M.E., 1987. On preliminary test and shrinkage M-estimation in linear models. Ann. Statist. 15, 1580–1592. Speckmen, P., 1988. Kernel smoothing in partial linear models. J. Roy. Statist. Soc. B, 413–436.