Partially linear models and polynomial spline approximations for the analysis of unbalanced panel data

Partially linear models and polynomial spline approximations for the analysis of unbalanced panel data

Journal of Statistical Planning and Inference 139 (2009) 679 -- 695 Contents lists available at ScienceDirect Journal of Statistical Planning and In...

366KB Sizes 0 Downloads 69 Views

Journal of Statistical Planning and Inference 139 (2009) 679 -- 695

Contents lists available at ScienceDirect

Journal of Statistical Planning and Inference journal homepage: w w w . e l s e v i e r . c o m / l o c a t e / j s p i

Partially linear models and polynomial spline approximations for the analysis of unbalanced panel data夡 Jinhong Youa , Xian Zhoub,∗ a b

Department of Statistics, Shanghai University of Finance and Economics, Shanghai, China Department of Actuarial Studies, Macquarie University, Sydney, NSW, Australia

A R T I C L E

I N F O

Article history: Received 8 May 2006 Received in revised form 15 April 2007 Accepted 21 April 2007 Available online 25 June 2008 MSC: primary 62H12 secondary 62G08, 62J05 Keywords: Partially linear Panel data Error components Polynomial spline Weighted estimation Asymptotic normality Model selection

A B S T R A C T

In this paper, we study the estimation of the unbalanced panel data partially linear models with a one-way error components structure. A weighted semiparametric least squares estimator (WSLSE) is developed using polynomial spline approximation and least squares. We show that the WSLSE is asymptotically more efficient than the corresponding unweighted estimator for both parametric and nonparametric components of the model. This is a significant improvement over previous results in the literature which showed that the simply weighting technique can only improve the estimation of the parametric component. The asymptotic normalities of the proposed WSLSE are also established. Another focus of this paper is to provide a variable selection procedure to select significant covariates in the parametric part, based on a combination of the nonconcave penalization and the weighted semiparametric least squares. The proposed procedure simultaneously selects significant covariates and estimates unknown parameters. With a proper choice of regularization parameters and penalty function, the resulted estimator is shown to possess an oracle property. Simulation studies and an example of application on a set of hormone data are used to demonstrate this proposed procedure. © 2008 Elsevier B.V. All rights reserved.

1. Introduction Panel data widely exist in many important areas, including medical study, sociology, economics, etc. The term “panel data” refers to the pooled observations on a cross-section of households, countries, firms, patients, etc., over several time periods. Compared with time series and cross-section data, the panel data have a number of advantages, such as control of individual heterogeneity and lower risk of biased results. They are more informative and give greater flexibility, less collinearity among the variables, higher degree of freedom and more efficiency. More details about the advantages of panel data can be found in Hsiao (1985, 1986) and Baltagi (1995). Until recently, much of panel data modeling had been confined to linear regressions. In practice, however, there often exist nonlinear features that are beyond the capacity of linear regression models. Early development of panel data nonlinear regression analysis focused on parametric nonlinear forms. These results are collected in Ahn and Schmidt (2000). On the other hand, recent development in nonparametric regression techniques has provided an alternative to model nonlinear panel data (cf. Ruckstuhl et al., 2000; Lin et al., 2004). Its immediate advantage is to avoid the need to assume prior information on model structure. Moreover, it may provide useful insight for further parametric fitting. But nonparametric methods have their own shortcomings as 夡 ∗

This work was partially supported by Research Grant No. G-YE73 of the Hong Kong Polytechnic University. Corresponding author. E-mail address: [email protected] (X. Zhou).

0378-3758/$ - see front matter © 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2007.04.037

680

J. You, X. Zhou / Journal of Statistical Planning and Inference 139 (2009) 679 -- 695

well, such as the curse of dimensionality, difficulty of interpretation, and lack of extrapolation capability. As a result, semiparametric regression, which is a compromise between a general nonparametric specification and a fully parametric model, has become an attractive alternative. For an overview on nonparametric and semiparametric techniques for longitudinal data, see Fan and Li (2006) and the references therein. One semiparametric regression model of importance is the semiparametric partially linear regression model of the form Yij = Xij b + g(Uij ) + ij ,

i = 1, . . . , k, j = 1, . . . , mi ,

(1.1)

where Yi1 , . . . , Yimi are repeated response measurements from individual i, Xij = (Xij1 , . . . , Xijp ) and Uij are referred to as the design points, b is an unknown p-dimensional vector of regression coefficients for the parametric component, g(·) is an unknown function to model the nonlinear and nonparametric components, and ij 's are random errors. The random errors ij 's are assumed to have a one-way error components structure:

ij = i + ij ,

i = 1, . . . , k, j = 1, . . . , mi ,

(1.2)

where i and ij are i.i.d. random variables with mean zero and variances 2 and 2 , respectively. With the one-way error structure (1.2), the observations Yi1 , . . . , Yimi from the same individual i share a common variable i and thus are allowed to be dependent. Model (1.1) with error structure (1.2) has attracted considerable interests from researchers and become increasingly popular due to its wide range of applications. For example, Zeger and Diggle (1994) applied this model to study CDE cell number in HIV seroconverters, Roy (1997) used it to study the calorie and income relationship for two years of panel data from rural south India, and Mundra (1997) advocated for this model to study the cross country effects of migration to US on the US exports. Other applications can be found in Honore (1992) and Kniesner and Li (1994). For model (1.1) with error structure (1.2) and m1 = · · · = mk , Li and Ullah (1998) constructed a feasible semiparametric generalized least squares estimator (SGLSE) for b by applying the kernel smoothing and least squares. They showed that the SGLSE is asymptotically more efficient than the unweighted one, but did not provide any results on the nonparametric component. Lin and Ying (2001) proposed a new estimation for model (1.1) under the formation of point processes, which has the advantage of simplicity as it does not involve any smoothing parameter. Their approach, however, neglects the error structure and thus results in inefficient estimators. Carroll and Lin (2005) and Lin and Carroll (2006) proposed an efficient kernel estimation for model (1.1) by applying the seemingly kernel weighted approach. They showed that the resulted estimator of the parametric component achieves the semiparametric boundary, and the estimator of the nonparametric component has smaller asymptotic variance than the traditional unweighted kernel estimator. The calculation of their estimation, however, is rather complicated due to involvement of iteration. Chen and Jin (2006) proposed the weighted series estimator for the parametric component of model (1.1), but for the nonparametric component they still applied the seemingly kernel weighted approach. In this paper, by using polynomial spline approximation and least squares, a weighted semiparametric least squares estimator (WSLSE) is developed for both parametric and nonparametric components of model (1.1). We show that the resulted estimators of both parametric and nonparametric components are asymptotically more efficient than the corresponding unweighted estimators. The consistency and asymptotic normalities of the proposed estimators are also established. Another focus of this paper is to establish a class of procedures to select significant covariates for the parametric part of model (1.1) with error structure (1.2). It is essential in practice to include only important variables in the model to enhance predictability and to give a parsimonious description between the response and the covariates. Recently, Fan and Li (2004) proposed a variable selection method via nonconcave penalized likelihood for model (1.1). Their method deletes insignificant variables by estimating their coefficients as 0, and simultaneously selects significant variables and estimates regression coefficients, which differs from traditional methods. In addition, they have demonstrated that with a proper choice of regularization parameters and penalty functions (such as SCAD), the penalized likelihood estimator possesses an oracle property. Under this oracle property, the truly zero regression coefficients are automatically estimated as zero, and the remaining coefficients are estimated as if the correct submodel is known in advance. Their model, however, neglected the correlation with group of panel data, resulting in inefficient method when the error structure is estimatable. We will extend their method by taking the error structure into account. The remainder of this paper is organized as follows. Some preliminary estimators are presented in Section 2, including the unweighted semiparametric least squares estimators and the estimator of the one-way error components structure. Section 2 also discusses the selection of the smoothing parameters. In Section 3 we propose WSLSE for the parametric and nonparametric components by taking the one-way error components structure into account. Section 4 studies the variable selection procedure. Simulation studies are reported in Section 5 and an application with a real data set is presented in Section 6. Final remarks are given in Section 7. Proofs of the main results are relegated to Appendix A. 2. Preliminary estimations  Throughout this paper we will consider k as increasing with sample size n = ki=1 mi , where mi are bounded, as common in panel data situations. We also assume that the design points Xij are random, while Uij are fixed, as in Speckman (1988) and Hong (2002). Extending our results to the case of both Xij and Uij being fixed or random is conceptually straightforward. Moreover,

J. You, X. Zhou / Journal of Statistical Planning and Inference 139 (2009) 679 -- 695

681

as common in the partially linear regression models, suppose that {Xij } and {Uij } are related via Xijs = hs (Uij ) + ijs ,

i = 1, . . . , k, j = 1, . . . , mi , s = 1, . . . , p,

where hs (·)'s are twice continuously differentiable functions and ijs 's are random variables with mean zero. The sequence of design points {Uij , i = 1, . . . , k, j = 1, . . . , mi } has a bounded support U and is generated by a “design density” f (u) with 0 < inf U f (·)  supU f (·) < ∞. 2.1. Semiparametric least squares estimators In this subsection we ignore the error structure in constructing the estimators by combining polynomial spline series approximation and least squares. Let U be an interval with end points 0 < M+1 . A polynomial spline of degree d  0 on U with knot sequence 0 < 1 < · · · < M+1 is a function that is a polynomial of degree d on each interval [0 , 1 ), . . . , [M−1 , M ), [M , M+1 ], and globally has continuous d − 1 derivatives for d  1. A piecewise constant function, linear spline, quadratic spline and cubic spline corresponds to d = 0, 1, 2, 3, respectively. The collection of spline functions of a particular degree and knot sequence form a linear space. The books by de Boor (1978) and Schumaker (1980) are good references for spline functions. Suppose that g(·) can be approximated by some spline function, that is g(·) ≈



Bl (·) l ,

(2.1)

l=1

where {Bl (·), l = 1, . . . , } is a basis for a linear space G of spline functions on U with a fixed degree and knot sequence. Following (1.1) and (2.1), we have Yij ≈ Xij b + B (Uij )h + ij ,

i = 1, . . . , k, j = 1, . . . , mi ,

(2.2)

where h = ( 1 , . . . , ) and B(·) = (B1 (·), . . . , B (·)) . We can estimate b and h based on (2.2) with any given by minimizing S(b, h) =

mi k   (Yij − Xij b − B (Uij )h)2 . i=1 j=1

S(b, h) has a unique minimizer  b = (X MB X)−1 X MB Y and  h = (B B)−1 B (Y − X b), where MB = In − PB = In − B(B B)−1 B , X =   (X11 , . . . , X1m1 , . . . , Xkm ) , B=(B(U11 ), . . . , B(U1m1 ), . . . , B(Ukm )) and Y=(Y11 , . . . , Y1m1 , . . . , Ykm ) . The corresponding polynomial k k k h. spline estimator of g(u) is  g(u) = B (u) b and  g(u), we first introduce some technical assumptions. In order to present the asymptotic properties of  Assumption 1. The functions g(·) and hs (·), s = 1, . . . , p, are twice continuously differentiable. Assumption 2. Pi = (Pi1 , . . . , Pimi ) , i = 1, . . . , k, are independent random matrices with mean zero and EPi 4  c0 < ∞, where Pij = (ij1 , . . . , ijp ) for i = 1, . . . , k and j = 1, . . . , mi , and  ·  denotes Euclidean norm. Further, Pi are independent of e = ( , . . . ,  ) . i

i1

imi

b is established in the following theorem. The asymptotic normality of  Theorem 1. Suppose that Assumptions 1 and 2 hold, = o(k1/2 ) and k1/2 −4 = o(1). Then √  −1 n(b − b)→D N(0, R−1 1 R2 R1 ) as k → ∞, where “→D ” denotes convergence in distribution, k 1 E(Pi Pi ), n k→∞

R1 = lim

k 1 E{Pi (2 imi imi + 2 Imi )Pi } n k→∞

R2 = lim

i=1

i=1

and imi is an mi × 1 vector of 1's. Let aL2 denote the L2 norm of a square integrable function a(u) on U and

k = dist(g, G) = inf sup |g(u) − a(u)| a∈G u∈U

be the L∞ distance between g(·) and G.

682

J. You, X. Zhou / Journal of Statistical Planning and Inference 139 (2009) 679 -- 695

The next theorem shows the convergence rate and asymptotic normality of  g(·). Theorem 2. Under the assumptions of Theorem 1,  g − g2L = Op ( k−1 + 2k ) = Op ( k−1 + −4 ) and g(u) − g(u))→D N(0, 1) ((u))−1/2 (

as k → ∞,

where (u) = B (u)(B B)−1 B QB(B B)−1 B(u) and Q = blockdiag(2 im1 im + 2 Im1 , . . . , 2 imk im + 2 Imk ). 1

k

2.2. Estimators of the error variances

b and  g(·) do not take the one-way error components structure into account, so that they may not be The estimators  asymptotically efficient. It is exceedingly important that the correlations within a subject should not be ignored. To do so, we should fit the one-way error structure first. Based on  b and  g(·), we can obtain the estimated residuals as  ij = Yij − X  b − g(Uij ),

i = 1, . . . , k, j = 1, . . . , mi .

ij

Since E(11 12 ) = 2 and E(211 ) = 2 + 2 , from these estimated residuals we can construct estimators of the error variances 2

and 2 , respectively, by

mi  k  

1

 2 =  k

i=1 mi (mi − 1) i=1 j1 =1 j2 j1

 ij  ij 1

2

and  2 =  k

mi k  

1

i=1 mi i=1 j=1

 2 −  2 . ij

(2.3)

2 and  2 . The following theorem collects the asymptotic properties of  Theorem 3. Suppose that Assumptions 1 and 2 hold. If = o(k1/2 ), k1/2 −4 = o(1), E41 < ∞ and E411 < ∞, then as k → ∞,    42 2 2 − 2 24 2 → N 0, Var(  ) + + √   D 1 k limk→∞ k−1 ki=1 mi limk→∞ k−1 ki=1 mi (mi − 1)

(2.4)

and ⎛ ⎝

k 

⎞−1/2 mi ⎠

 ( 2 − 2 )→D N

0, Var(211 ) +

i=1

limk→∞ k



24  −1 k

.

i=1 (mi − 1)

2 and  2 . Define Next, we provide consistent estimators of the asymptotic variances of  4 =  k

mi  k  

1

i=1 (mi − 1) i=1 j1 =1 j2 j1

 2  2

ij1 ij2

and  4 =  k

1

mi k  

i=1 mi i=1 j=1

 4 − 4 . ij

2 and  2 are provided in the following theorem. The asymptotic variances of  Theorem 4. Under the assumptions of Theorem 3, 4 −  4 +

k−1

→p Var(21 ) +

k

2

i=1 mi (mi − 1)



4 + −1 

 k

4  −1 k

i=1 mi

 −2  2 2

42 2 4 +   k limk→∞ k−1 i=1 mi limk→∞ k−1 ki=1 mi (mi − 1)

as k → ∞

(2.5)

J. You, X. Zhou / Journal of Statistical Planning and Inference 139 (2009) 679 -- 695

683

and  4 − 4 2  2 +

4 4 2 →p Var(211 ) +   k−1 ki=1 (mi − 1) limk→∞ k−1 ki=1 (mi − 1)

as k → ∞.

Combining Theorems 3 and 4 we can make statistical inference for 2 and 2 . 2.3. Automatic selection of smoothing parameters Because of computational complexity involved in the smoothing parameters, it is often impractical to automatically select all three components: the degrees of splines, the numbers and locations of knots. Similar to Rice and Wu (2001), we use splines with equally spaced knots and fixed degrees, and select , the numbers of knots, using the data. Here is the dimension of G and is related to the number M of interior knots through = M + 1 + d, where d is the degree of the spline. Selecting the smoothing parameters for semiparametric models, particularly for estimating the parametric component, was posed by Bickel and Kwon (2001) as an important and unsolved problem. We first note that the performance of the estimators of the parameters b, 2

and 2 do not sensitively depend on the smoothing parameter , as long as the smoothing parameter is not too small to create excessive biases. Hence, we choose the smoothing parameter that is suitable for estimating function g(·). Let {Xij , Uij , Yij ; i = 1, . . . , k, j = 1, . . . , mi } be a random sample from model (1.1) ordered according to Uij such that Ui  Uj if    i < j, where U(i−1)m = Uij . Let X(i−1)m = Xij , Y(i−1)m = Yij and (i−1)m +j = ij . Under some mild conditions, the spacings i +j i +j i +j i     U − U are of order O (log k/k) so that g(U ) − g(U ) = O (log k/k). Then by model (1.1), i+1

p

i

p

i

i+1

 − Y  = (X       ∗ ∗ Yi+1 i i+1,1 − Xi,1 ) 1 + · · · + (Xi+1,p − Xi,p ) p + g(Ui+1 ) − g(Ui ) + i+1 − i = (X  − X  ) + · · · + (X  − X  ) + ∗ − ∗ + O (log k/k). i+1,1

i,1

1

i+1,p

i,p

p

i+1

i

p

 − Y  with Thus the nonparametric function g(·) in the panel data partially linear model (1.1) is eliminated for response Yi+1 i − X  , j = 1, . . . , p}. As a result, the coefficients b = ( , . . . , ) can be estimated using the ordinary least squares regressors {X  i+1,j

1

i,j

p

from the above approximation model by ⎡ n−1    b = ⎣ (X

i+1

⎤−1

− X )(X i

i+1

i=1

− X ) ⎦

n−1 

 − X )(Y  − Y  ). (Xi+1 i i i+1

i

i=1

 b , we propose the following deleting block cross-validation (CV) method to select an appropriate value of . Define Based on  the delete block squares CV function CV( ) = n−1

mi k    (Yij − Xij  b − g ,−i (Uij ))2 ,

(2.6)

i=1 j=1

 b , i = 1, . . . , k, j = 1, . . . , mi }, omitting the i th subject. where  g ,−i (·) is the polynomial spline estimate from the data {Uij , Yij − Xij 

The function in (2.6), depending on the smoothing parameter , is used as an overall measure of the effectiveness of the estimation scheme. The deleting block CV bandwidth selector is the one that minimizes (2.6), namely  CV = arg min CV( ). When there are a large number of subjects, calculations of the deleting block CV score can be computationally intensive. In such a case, we can use the deleting block K-fold CV by splitting the subjects into K roughly equal-sized parts. In this paper we restrict our attention to splines with equally spaced knots. It might be worthwhile to investigate using the data to decide the knot positions (free-knot splines). There has been considerable work on free-knot splines for i.i.d. data; see Hansen and Kooperberg (2002). Extension of free-knot splines to panel data is beyond the scope of this paper. In addition, for simplicity, we use a same in the three situations: the unweighted estimator, weighted estimator and penalized weighted estimator. 3. Weighted semiparametric least squares estimation

The semiparametric least squares estimators  b and  g(·) are not asymptotically efficient since they do not take into account the one-way error components structure. To construct more efficient estimators, we first derive the inverse of the covariance matrix of  = (11 , . . . , 1m1 , . . . , km ) . According to (1.2) we have k

X = E(ee ) = blockdiag(2 im1 im1 + 2 Im1 , . . . , 2 imk imk + 2 Imk ),

(3.1)

684

J. You, X. Zhou / Journal of Statistical Planning and Inference 139 (2009) 679 -- 695

where Imi is an identity matrix of dimension mi , imi is a vector of ones of dimension mi . By using the method of Fuller and Battese (1974) and Wansbeek (1989), we can get the inverse of X as −2 ¯ −2 X−1 = blockdiag(−2 + −2 J¯  Em1 , . . . , k Jmk +  Emk ), 1 m1

 −1 =blockdiag( where 2i =mi 2 + 2 , J¯mi =Jmi /mi , Emi =Imi − J¯mi and Jmi = imi imi . Thus an estimator of X−1 is given by X −2 J¯ + 1 m1 −2 ¯ −2 2 + 2 and ( 2, 2 ) are defined in Section 2. Therefore, we propose the following 2 =m    −2 E , . . . ,  +  E ), where      J i     i  m1  mk k mk weighted estimation function:  Sw (b, h) = (Y − Xb − Bh) X

−1

(Y − Xb − Bh),

which has a unique minimizer −1

 w  b = (X MX B

−1

 X)−1 X MX B

−1

Y

w w  −1 B)−1 B X  −1 (Y − X and  h = (B X b ),

−1

 where MX B

 w w  −1 − PX  −1 − X  −1 B(B X  −1 B)−1 B X  −1 .  =X =X b and  h are called the WSLSE of b and h, respectively. B w The corresponding weighted polynomial spline estimator of g(u) is  gw (u) = B (u) h . w The asymptotic normality of  b is established in the following theorem.

Theorem 5. Under the assumptions of Theorem 1,



w k( b − b)→D N(0, R−1 3 ) as k → ∞, where

k 1 −2  E(Pi -i P ) with -i = −2 i imi imi /mi +  Emi . k→∞ n

R3 = lim

i=1

The next theorem shows the convergence rates and asymptotic normalities of  gw (u). Theorem 6. Under the assumptions of Theorem 1,  gw − g2L = Op ( k−1 + 2k ) = Op ( k−1 + −4 ) and (w (u))−1/2 ( gw (u) − g(u))→D N(0, 1) as k → ∞, where

w (u) = B (u)[B blockdiag(-1 , . . . , -k )B]−1 B(u). −1 w  Remark 1. It is easy to see that R1 R−1 2 R1  R3 , so that b is asymptotically more efficient than b in terms of the asymptotic

covariance matrix. In fact, as shown in Carroll et al. (1998), the R−1 3 is the semiparametric information bound (see Bickel et al., 1993). Hence, the WSLSE is semiparametrically efficient. Moreover, w (u)  (u) is also obvious. Therefore, Theorem 6 states that the weighted polynomial spline estimator of the nonparametric component also has smaller asymptotic variance than the unweighted one.

In order to apply Theorems 5 and 6 to make statistical inferences we need consistent estimators of R3 and w (u). Let

 i =  −2 −2  Emi . Define i imi imi /mi +  1 n

 = X M blockdiag(  k )MB X R -1 , . . . , 3 B and 1 w   k )B]−1 B(u),  (u) = B (u)[B blockdiag( -1 , . . . , -



where MB = In − B(B B)−1 B .

w  and  The next theorem shows the consistency of the estimators R  (u). 3

Theorem 7. Under the assumptions of Theorem 1, w w  →p R R 3 3 and  (u)→p  (u) as k → ∞.

Applying Theorems 5–7 we can construct the asymptotic confidence intervals (CI) for b and g(u). For example, an approximate 100(1 − c)% asymptotic CI for g(u) is given by w   (u))1/2 , g(u) ± zc/2 (

J. You, X. Zhou / Journal of Statistical Planning and Inference 139 (2009) 679 -- 695

685

where zc/2 is the (1−c/2)th quantile of the standard normal distribution. Moreover, similar to Huang et al. (2004) we can construct simultaneous bands for s (u) over a given subinterval [c1 , c2 ] of U by extending the pointwise CI. Partitioning [c1 , c2 ] according to M + 1 equally spaced grid points c1 = 1 < · · · < M+1 = c2 for some integer M  1, one gets a set of approximate 100(1 − c)% simultaneous CI (lc (r ), uc (r )) for g(r ) such that   lim Pr lc (r )  g(r )  uc (r ) for all r = q, . . . , M + 1  1 − c. k→∞

w gsw (r ) ± zc/(2(M+1)) (  (r ))1/2 . Let g(I) (u) be the linear Based on the Bonferroni adjustment, we can set (lc (r ), uc (r )) by  interpolation of g(r ) and g(r+1 ), r  u  r+1 :     r+1 − u u − r g(r ) + M g(r+1 ). g(I) (u) = M c2 − c1 c2 − c1 (I)

(I)

(I)

(I)

Similarly, let lc (u) and uc (u) be the linear interpolations of lc (r ) and uc (r ), respectively. Then, (lc (u), uc (u)) is an approximate (1 − c) confidence band for g(I) (u) in the sense that   (I) (I) lim Pr lc (u)  g(I) (u)  uc (u) for all u ∈ [c1 , c2 ]  1 − c. k→∞

To construct the bands for g(u), we assume that supu∈[c ,c ] |jg(u)/ ju|  v0 for a known constant v0 > 0. By Taylor's expansions 1 2 we have   (r+1 − u)(u − r ) . |g(u) − g(I) (u)|  2v0 M c2 − c1 Adjusting the bands for g(I) (u), an approximate (1 − c) confidence band for g(u) is given by      (r+1 − u)(u − r ) (r+1 − u)(u − r ) (I) (I) lc (u) − 2v0 M , uc (u) + 2v0 M . c2 − c1 c2 − c1 4. Variable selection for the parametric components Model selection is an indispensable step in statistical data analysis. However, the problem has rarely been studied in the context of semiparametric regression with one-way error component structure. Fan and Li (2001) proposed a variable selection via nonconcave penalized likelihood and found some oracle properties. In this section, we will extend their method to our settings and propose a nonconcave penalization weighted semiparametric least squares estimate. Closer to our work is the paper by Fan and Li (2004), but it did not consider the estimated weights. Suppose that Xij consist of p variables, and some of these are insignificant. A penalization weighted semiparametric least squares takes the form p

L(b) ≡

  −1 1 (Y − Xb) MX (Y − Xb) + k j pj (| j |), B 2

(4.1)

s=1

 is defined in where the pj (·)'s are penalty functions, j 's are tuning parameters, which control the model complexity, and MX B Section 3. Here the penalty functions pj (·) and the regularization parameters j are not necessarily the same for all j. This allows us to incorporate prior information for the unknown coefficients by using different penalty functions or taking different values of j . Antoniadis and Fan (2001) and Fan and Li (2001) advocated that a good penalty function should yield an estimator with the following three properties: unbiasedness for a large true coefficient to avoid unnecessary estimation bias, sparsity (estimating a small coefficient as zero) to reduce model complexity and continuity to avoid unnecessary variation in model prediction. They went further to propose a simple penalty function, namely the smoothly clipped absolute deviation (SCAD) penalty, which results in an estimator with the three desired properties. Its first derivative is defined by   (a − )+ p () =  I(  ) + I( > ) for some a > 2 and  > 0,  (a − 1)

and p (0) = 0. The SCAD involves two unknown parameters,  and a. Fan and Li (2001) suggested using a = 3.7 from a Bayesian point of view. We now study the asymptotic properties of the resulting estimate of the nonconcave penalization weighted semiparametric least squares (4.1). First we establish the convergence rate of the estimator. Assume that all penalty functions p (·) are nonnegj

ative, nondecreasing with p (0) = 0. Denote ak = maxj {|p (| j |)| : j 0} and bk = maxj {|p

(| j |)| : j 0}. Then we have the j j j

following theorem.

686

J. You, X. Zhou / Journal of Statistical Planning and Inference 139 (2009) 679 -- 695

Theorem 8. Under the assumptions of Theorem 1, if ak and bk tend to zero as k → ∞, then with probability approaching one, there w w exists a local minimizer  b of L(b) such that  b − b = Op (k−1/2 + a ). k

w

Theorem 8 demonstrates how the rate of convergence of the penalization WSLSE 

depends on j . To achieve the root k

convergence rate, we have to take j small enough so that ak = O(k−1/2 ). Next we establish the oracle property for the nonconcave penalization WSLSE. For ease of presentation, we assume, without loss of generality, that all of the first p0 components of b are not equal to 0, and all other components equal 0. Write D = diag{p

(| 1 |, . . . , p

(| p |))} 1 p0 0

and

b = (p (| 1 |)sgn( 1 ), . . . , p (| p |)sgn( p )). 1  p0 0 0

w w w Further, let  b1 and  b2 consist of the first p0 and the last p − p0 components of  b , respectively.

Theorem 9 (Oracle property). Assume that for j = 1, . . . , p, j → 0,

√ kj → ∞ and the penalty function p (| j |) satisfies j

lim inf lim inf p ( j )/ j > 0. j k→∞ j →0+

(4.2)

If ak = O(k−1/2 ), then under the conditions of Theorem 8, with probability approaching 1, the root k consistent local minimizer w w w   b = ( b , b ) in Theorem 8 satisfies 1

2

w

(i) (Sparsity)  b2 = 0;

√ w (ii) (Asymptotic normality) k{R311 + D}[ b1 − b1 + {R311 + D}−1 b]→D N(0, R311 ), where R311 consists of the first p0 rows and columns of R3 defined in Theorem 5. √ From Theorem 9, if j → 0, kj → ∞ for j = 1, . . . , p, ak = O(k−1/2 ), and condition (4.2) holds, then the resulting estimate possesses an oracle property. This implies that the resulting procedure correctly specifies the true model and estimates the unknown regression coefficients as efficiently as if we knew the submodel. If all the penalty functions are SCAD, then ak = 0 for sufficiently large k, hence the resulting estimate possesses the oracle property. It is a challenging task to find the solution of the nonconcave penalization weighted semiparametric least squares because the penalty function p (| j |) is irregular at the origin and do not have a second derivative at some points. Following Fan and Li j

(2001), we locally approximate the penalty functions by quadratic functions as follows. Given an initial value b(0) that is close (0)

to the minimizer of (4.1), when | j |   (a prescribed value), the penalty p (| j |) can be locally approximated by the quadratic j function as (0)

| j | (0) . [p (| j |)] = p (| j |)sgn( j ) ≈ p (| j |)   j j j j With this local quadratic approximation, the Newton–Raphson algorithm can be implemented directly to minimizing L(b). To implement the methods described above, it is desirable to have an automatic data-driven method for estimating the tuning parameters 1 , . . . , p . Similar to Fan and Li (2001) we can estimate k = (1 , . . . , p ) by minimizing an approximate GCV score. It is expected that the magnitude of k should be proportional to the standard error of the WSLSE of j . To reduce the burden of w w calculations, we set  =  × se( ) as in Fan and Li (2004), where se( ) denotes the standard error of the unpenalized WSLSE. j

j

j

This leads to minimizing the GCV score over the one-dimensional space, which can be done efficiently, as in our simulations reported in the next section. 5. Simulation studies In this section we carry out some simulation studies to demonstrate the finite sample performances of the proposed procedures in previous sections. The data are generated from the following panel data partially linear regression model: yij = xij1 1 + xij2 2 + g(uij ) + i + ij ,

i = 1, . . . , k, j = 1, . . . , mi ,

where xij1 ∼ i1 + ij1 , xij2 ∼ i2 + ij2 , i1 , ij1 , i2 and ij2 are i.i.d. N(0, 0.25), uij are i.i.d. U(0, 1), i ∼ N(0, 2 ) and i ∼ N(0, 2 ). We take 1 = 1.5, 2 = 2, g(t) = 2 sin(2t), (2 , 2 ) = (2, 0.5) and (2,1). Moreover, the following cases of k and {mi } are simulated: Case 1: k = 80, m1 = · · · = m50 = 2 and m51 = · · · = m80 = 3.

J. You, X. Zhou / Journal of Statistical Planning and Inference 139 (2009) 679 -- 695

687

Table 1 Simulation results for the estimators of the parametric components 1 and 2 . ( ,  ) √ √ ( 2, 0.5)

Case 1

Case 2

√ ( 2, 1)

1

Methods

Case 1

Case 2

2

Mean

SE

 SE

CI

EF

Mean

SE

 SE

CI

EF

 w  w ¯  w  w ¯

1.502 1.498 1.498 1.483 1.502 1.503

0.197 0.129 0.127 0.241 0.114 0.113

0.192 0.124 0.125 0.240 0.114 0.112

0.947 0.944 0.948 0.955 0.949 0.946

1.545 1.011 1.000 2.141 1.010 1.000

1.997 2.001 2.001 1.991 2.004 2.005

0.193 0.124 0.123 0.232 0.111 0.111

0.192 0.124 0.125 0.240 0.113 0.112

0.946 0.950 0.954 0.961 0.962 0.966

1.574 1.011 1.000 2.089 1.004 1.000

 w  ¯ w  w  ¯ w

1.506 1.506 1.507 1.487 1.493 1.493

0.206 0.163 0.161 0.259 0.155 0.158

0.206 0.159 0.162 0.251 0.151 0.153

0.939 0.943 0.944 0.940 0.940 0.947

1.282 1.009 1.000 1.710 1.022 1.000

1.989 1.997 1.998 2.005 2.007 2.007

0.203 0.1656 0.165 0.253 0.153 0.150

0.206 0.159 0.162 0.251 0.152 0.153

0.946 0.937 0.935 0.958 0.952 0.958

1.231 1.003 1.000 1.689 1.018 1.000

Table 2 Simulation results for the estimators of the error variances 2 and 2 . ( ,  ) √ √ ( 2, 0.5) Case 1 Case 2 √ ( 2, 1) Case 1 Case 2

 2

 2

Mean

SE

 SE

CI

Mean

SE

 SE

CI

1.819 1.603

0.313 0.413

0.316 0.441

0.881 0.792

0.514 1.024

0.075 0.123

0.087 0.115

0.974 0.933

1.786 1.635

0.351 0.381

0.351 0.417

0.875 0.809

1.018 0.527

0.145 0.062

0.141 0.064

0.948 0.943

Case 2: k = 40, m1 = · · · = m25 = 4 and m26 = · · · = m40 = 6. In each case the number of simulated realizations is 1000. Further, in this study we used the uniform knots, which are usually sufficient when the function g(·) does not exhibit dramatic changes in its derivatives. Thus, we just need to determine the number of knots. The method proposed in Section 2 is used to determine this number. For the parametric components ( 1 , 2 ), we compare three estimators: the (unweighted) semiparametric least squares w w estimator  b, the WSLSE  b with estimated 2 and 2 , and the WSLSE b¯ with known 2 and 2 . Table 1 summarizes the simulation results on the estimates of ( 1 , 2 ), which show simulated sample means (Mean), sample  coverage percentages of 95% nominal CI, and standard deviations (SE), means of the estimates of the standard deviations (SE), w the efficiency (EF) in terms of mean square error (MSE) ratio (with respect to the WSLSE b¯ with known error variances). From Table 1, we make the following observations: (1) All three methods yield unbiased estimates—there is no observable bias in the means of the estimators.  values are close to SE values. (2) The proposed variance estimators are consistent, as SE (3) The nominal 95% CI based on the proposed method provide satisfactory coverages for the cases studied, with coverage percentages close to the nominal 95% level. w w (4) The performances of the two WSLSE's  b and b¯ are very close. Both have substantially smaller standard deviations than  2 the unweighted estimator  b, especially when  is large compared with 2 , or when mi 's are large while ki=1 mi remain

unchanged. For example, in Case 1, the reduction in standard deviation is around 30% with (2 , 2 ) = (2, 0.5) and 20% with  (2 , 2 ) = (2, 1). On the other hand, compare Cases 1 and 2, both have the same i mi = 190, but Case 2 has larger mi ∈ {4, 6} than Case 1 (with mi ∈ {2, 3}). Given (2 , 2 ) = (2, 0.5), we can see that the reduction in standard deviation is around 30% in Case 1 and 50% in Case 2.

 and CI are presented in Table 2, which shows that the For the error variances 2 and 2 , the simulated values of Mean, SE, SE 2 2 estimates for  and  are satisfactory. In addition, we studied the estimation of the nonparametric component g(u). Fig. 1 provides the average polynomial spline estimates of the nonparametric function g(·) over 1000 simulations in Case 2. In Fig. 1a, the 2.5% and 97.5% quantiles from the Monte Carlo (MC) simulation are plotted based on both unweighted (in dotted curves) and weighted (in dash-dotted curves)

688

J. You, X. Zhou / Journal of Statistical Planning and Inference 139 (2009) 679 -- 695

3 2 1 0 −1 −2 −3

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

3 2 1 0 −1 −2 −3

Fig. 1. The estimators of the nonparametric component and its confidence intervals: (a) with the 2.5% and 97.5% quartiles from Monte Carlo simulation; (b) with the 95% asymptotic confidence bands by the methods in Section 3; dotted curves: from the unweighted polynomial spline estimator; dash-dotted curves: from the weighted polynomial spline estimator.

Table 3 Simulation results for the relative approximate model error. √ √ ( ,  ) = ( 2, 0.5) Case 1 Case 2 √ ( ,  ) = ( 2, 1) Case 1 Case 2

SM

STD

C

I

0.549 0.550

0.255 0.256

4.537 4.531

0.005 0.000

0.481 0.539

0.273 0.264

4.621 4.542

0.109 0.034

versions of the polynomial spline estimator. Fig. 1b, on the other hand, presents the corresponding 95% confidence bands using the results in Section 3. From Fig. 1 we can see that the true nonparametric functions and the average polynomial spline fits virtually overlay each other, indicating little bias. Further, the estimated asymptotic 95% confidence band is very close to the quantile band. Moreover, Fig. 1 shows that the weighted estimator produces shorter CI intervals than the unweighted, meaning that taking the error structure into account can increase the efficiency of the estimation for the nonparametric components. Furthermore, we evaluated the performance of our procedure in terms of reduction of the model complexity and the relative GMSE (RGMSE, whose definition can be found in Fan and Li, 2004), which is the ratio of an underlying procedure to that of the WSLSE without penalization. The data are generated from the following panel data partially linear regression model: yij = xij1 1 + xij2 2 + · · · + xij8 8 + g(uij ) + i + ij ,

i = 1, . . . , j = 1, . . . , mi ,

where xijs ∼ N(0, 4) for i = 1, . . . , k, j = 1, . . . , mi , and s = 1, . . . , 8, 1 = 0, 2 = 0, 3 = 0, 4 = 0.2, 5 = 1.5, 6 = 2, 7 = 0, 8 = 0, and other symbols are the same as before. Table 3 depicts the simulation results with the means (SM) and standard deviations (STD) of the RGMSEs over 1000 simulated data sets. The average number of zero coefficients is also reported in Table 3, as shown in the column labeled “C”, while the column labeled “I” shows the average number of coefficients incorrectly set to 0. According to Table 3, the proposed procedure performs quite well.

J. You, X. Zhou / Journal of Statistical Planning and Inference 139 (2009) 679 -- 695

689

2.5

2

1.5

1

0.5

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fig. 2. The estimated nonparametric component g(·) and the 95% Sandwich confidence bands. The solid curve is the weighted polynomial spline estimator and the dotted curve is the unweighted polynomial spline estimator.

6. An application on hormone data We now apply the method to analyze the data from a panel hormone study (Zhang et al., 1998). Partial data are used. The study involved 33 women whose urine samples were collected in one menstrual cycle and whose urinary progesterone was assayed on alternate days. A total of 493 observations were obtained, with each women contributing from 11 to 29 observations. Each woman's cycle length was standardized uniformly to a reference 28-day cycle since the change of the progesterone level for each woman depends on the time during a menstrual cycle. Zhang et al. (1998) and He et al. (2002) used the following panel data partially linear regression model: yij = 1 AGEi + 2 BMIi + g(uij ) + eij to fit the data, where uij is the cycle length. Both Zhang et al. (1998) and He et al. (2002) reported insignificant effects of the parametric components 1 and 2 . The variable selection methods proposed in Section 4 also confirm that 1 and 2 are insignificant. As a result, we fit the model yij = g(uij ) + ij with ij = i + ij . By the methods developed in Section 2, the estimates

2 = 0.202 and  2 = 0.369. The unweighted and weighted estimators of g(·) are shown in of the error variances are found to be  Fig. 2. It is noted that in Fig. 2, the weighted estimator of g(·) appears to have similar lengths of CI with those of the unweighted, in contrast to the simulation results in Fig. 1. This is probably due to relatively weak correlation within group compared with the error variances. 7. Concluding remarks In this paper, by using polynomial spline approximation and least squares, we have developed weighted semiparametric least squares estimators (WSLSE) of the parametric and nonparametric components of an unbalanced panel data partially linear regression model with a one-way error components structure. It has been shown that both the WSLSE of the parametric component and the WSLSE of the nonparametric component are asymptotically more efficient than the corresponding unweighted ones. We also established the asymptotic normalities of the proposed WSLSEs. In addition, we provided a variable selection procedure to select the significant covariates in the parametric part based on a combination of the nonconcave penalization and weighted semiparametric least squares. We further showed that the resulted estimator has an oracle property. Interesting topics for further studies include the extension of our results to two or more error components structure or more general semiparametric regression models, such as the varying-coefficient partially linear regression model (cf. Fan and Huang, 2005).

690

J. You, X. Zhou / Journal of Statistical Planning and Inference 139 (2009) 679 -- 695

Appendix A. Proofs of main results Before proving the main results we first introduce some notations and lemmas. Let |a| denote the Euclidean norm of a real  valued vector a. For a matrix A = (aij ), A∞ = maxi j |aij |. For a real valued function g(u) on U, g∞ = supu∈U |g(u)| denotes its supreme norm. We can choose a convenient basis system in our technical arguments and the results for the function estimates hold true for other choices of basis in the same function space. Denote by Bl (·), l = 1, . . . , , the B-spline basis functions that span G, which have the following properties (cf. de Boor, 1978): (i) Bl (u)  0, u ∈ U.  (ii) B (u) = 1, u ∈ U. l=1 l (iii) There exist positive constants c3 and c4 such that ⎤2 ⎡   c  2 c3  2 ⎣

l 

l Bl (u)⎦  4

l , U l=1 l=1 l=1

l ∈ R,

l = 1, . . . , .

√ (iv) There is a constant c5 such that g∞  c5 g for g ∈ G. We first present two lemmas needed for the subsequent process of proof, whose proofs can be found in Huang et al. (2004) and Stout (1974), respectively. Lemma 1. Let g(u) =



B (u) and h = ( 1 , . . . , ) . Then g2 ≈ g2L ≈ |h|2 / . l=1 l l 2

Lemma 2. Let {i }ki=1 be independent random variables with zero means and finite absolute moments of order s  2. Then there exists a constant cs such that  s  k  k    E i   cs k(s−2)/2 E|i |s . i=1  i=1 −1 −1 −1 w w w w Next, let b¯ =(X MX X)−1 X MX Y, h¯ =(B X−1 B)−1 B X−1 (Y−X b ), and g¯ w (u)=B (u) h , where MX = X−1 −PX = B B B B −1 −1 −1 X − X B(B X B)−1 B X .

−1

Lemma 3. Under the assumptions of Theorem 1,



k(b¯

w

− b)→D N(0, R−1 3 ) as k → ∞,where R3 is defined in Theorem 5.

−1 −1 e can be decomposed as Xs∗ MX e = P∗s  X−1 e − Proof. Let X = (X1∗ , . . . , Xp∗ ). It is easy to see that the s th element of X MX B B −1 P∗ X−1 B(B X−1 B)−1 B X−1 e + H MX e, where X∗ = (X , X , . . . , X ) , e = ( ,  , . . . ,  )  , P = ( , . . . ,  )

s

k

s

B

11s

12s

11

kmk s

and Hs = (hs (U11 ), . . . , hs (Ukm )) .Combining these with Assumption 1 we have

12

kmk

s

11s

kmk s

k

Var(P∗s  X−1 B(B X−1 B)−1 B X−1 e) = E{P∗s  X−1 B(B X−1 B)−1 B X−1 P∗s } = O( ) = o(k). Therefore, P∗s  X−1 B(B X−1 B)−1 B X−1 e = op (k1/2 ). Furthermore, there exits fs such that max1  i  k,1  j  mi |hs (Uij ) − B(Uij )fs | = O( k ). Hence −1 −1 Var(Hk MX e)  maxeig(X)(Hs − Bfs ) MX (Hs − Bfs ) B B

 maxeig(X)

mi k   (hs (Uij ) − B(Uij )fs )2 = O(k 2k ). i=1 j=1 −1

−1

e = op (k1/2 ), so XMX e = PX−1 e + op (k1/2 ) with P = (P∗1 , . . . , P∗p ) . Moreover, there exists a real vector B n such that max1  i  k,1  j  mi |g(Uij ) − B(Uij )n| = O( k ). Let G = (g(U11 ), . . . , g(Ukmk )) . Then we have This implies Hk MX B −1

Xs∗ MX B

−1

G = P∗s  MX B

−1

G + H∗s  MX B

G = [P∗s  MX B

−1

−1

(G − Bn)(G − Bn) MX B

−1 −1 1 + [(Hs − Bfs ) MX (Hs − Bfs ) + (G − Bns ) MX (G − Bns )] B B 2 2 1/2 = O( ) + O(k k ) = o(k ).

P∗s ]1/2

J. You, X. Zhou / Journal of Statistical Planning and Inference 139 (2009) 679 -- 695

691

−1 Following the same line we can show that n−1 X MX X = n−1 P X−1 P + op (1). Hence B

√ −1 −1 −1 −1 √ ¯w √ n(b − b) = n(X MX X)−1 X MX e + k(X MX X)−1 X MX G B B B B =

√ n(P X−1 P)−1 P X−1 e + op (1).

Lemma 3 then follows from the central limit theorem.  Lemma 4. Under the assumptions of Theorem 1, g¯ w − g2L = Op ( k k−1 + 2k ) = Op ( k k−1 + −4 ) and gw (u) − g(u))→D N(0, 1) (w (u))−1/2 ( as k → ∞, where w (u) is defined in Theorem 6. Proof. Applying Lemma 3, Lemma 4 follows from the same lines as the proofs for Theorem 2 and Corollary 1 in Huang et al. (2004).  Lemma 5. Under the conditions of Theorem 8, with probability approaching 1, for any given b∗1 such that b∗1 − b1  = Op (k−1/2 ) and any constant c,

L{(b∗1 , 0 ) } =



min

b2   ck−1/2

L{(b∗1 , b∗2 ) }.

Proof. Similar to the proof of Theorem 1 in Fan and Li (2001).  Proof of Theorem 1. Similar to the proof of Lemma 3.  Proof of Theorem 2. The proof is the same as that of Lemma 4.  Proof of Theorem 3. According to its definition,  2 can be decomposed as

k

⎧ ⎪ mi  k  ⎨

1

⎪ ⎩ i=1 mi (mi − 1) i=1 j1 =1 j2 j1

+

mi  k   i=1 j1 =1 j2 j1

ij1 ij2 +

mi  k   i=1 j1 =1 j2 j1

Xij (b −  b)Xij (b −  b)

(g(Uij ) −  g(Uij ))(g(Uij ) −  g(Uij )) + 2 1

1

2

2

1

2

mi  k   i=1 j1 =1 j2 j1

Xij (b −  b)ij 1

2

⎫ ⎪ ⎬ +2 (g(Uij ) −  g(Uij ))ij + 2 Xij (b −  b )(g(Uij ) −  g(Uij )) 1 1 2 2 2 ⎪ 1 ⎭ i=1 j1 =1 j2 j1 i=1 j1 =1 j2 j1 mi  k  

= J1 + · · · + J 6

mi  k  

say.

Applying Theorems 1 and 2, we can show that Ji = op (k−1/2 ) for i = 2, . . . , 6. Moreover, by the definition of ij we have

J1 − 2 =

mi mi  k k  k    1 2 2 1 i − 2 +  i ij +  ij1 ij2 . k m k m (m − 1) k i i=1 i=1 j1 =1 j2 j1 i=1 i i=1 j=1 i=1 i

692

J. You, X. Zhou / Journal of Statistical Planning and Inference 139 (2009) 679 -- 695

Therefore, √ k( 2 − 2 ) √ √ mi  k k  k m  1  2 k 2 k  i =√ (i − 2 ) +  i ij +  ij1 ij2 + op (1) k k k i=1 i=1 mi i=1 j=1 i=1 mi (mi − 1) i=1 j1 =1 j2 j1 ⎡ ⎤ mi  mi k   1 ⎢ 2 2k k 2 = op (1) + √ i ij +  ij1 ij2 ⎥ ⎣(i −  ) + k ⎦ k m (m − 1) k i=1 m i i i i=1 i=1 j=1 j =1 j j 1

2

1

k 1  = √ ai + op (1). k i=1

Obviously, ai 's are independent random variables with mean zero and variance Var(ai ) = Var(21 ) +

(k−1

4mi k

i=1 mi )

2

2 2 +

[k−1

mi (mi − 1) 4 . k 2  i=1 mi (mi − 1)]

 This implies that c1 k  B2k = ki=1 Ea2i  c2 k for some positive constants c1 and c2 . From Lemma 2 and the assumptions that E|1 |4+2 < ∞ and E|11 |4+2 < ∞, simple calculations lead to E|2i − 2 |2+2 /2  c,  2+ /2 2   mi    2  E  
k 

2+ /2 Bk 2 i=1

2+2 /2

Eai

 2+ /2 2   mi     1   E  
and

k

2+2 /2 = O(k). It follows that i=1 Eai

= O(k−2 /2 ) → 0 as k → ∞.



Thus (2.4) follows from the central limit theorem. The same argument proves (2.5) as well.

Proof of Theorem 4. Let ∇ij = [Xij (b −  b) + (g(Uij ) −  g(Uij ))]. According to the definition of 4 , we have k 

mi (mi − 1) 4 =

mi  k   i=1 j1 =1 j2 j1

i=1

+2

2ij 2ij + 1

mi  k   i=1 j1 =1 j2 j1

= K1 + · · · + K 6

2

mi  k   i=1 j1 =1 j2 j1

∇ij2 2ij + 4 1

2

∇ij2 ∇ij2 + 4 1 2

mi  k   i=1 j1 =1 j2 j1

mi  k   i=1 j1 =1 j2 j1

∇ij2 ∇ij ij + 4 1

2

2

k 

mi (mi − 1)

i=1

+2

i=1 j1 =1 j2 j1

=

k  i=1

2i 2ij + 4

say.

2

mi  k   i=1 j1 =1 j2 j1

mi (mi − 1)(I11 + · · · + I16 ) say.

3i ij2 + 4

mi  k   i=1 j1 =1 j2 j1

2ij i ij2 1

2

i=1 j1 =1 j2 j1

mi  k k m  1  4  i  2 2 i + ij ij + 4 2i ij1 ij2 1 2 k i=1 i=1 j1 =1 j2 j1 j1 =1 j2 j1

mi  k  

1

mi  k  

Further, K1 can be decomposed as K1 =

∇ij ∇ij ij ij 1

2

∇ij 2ij ij 1

1

2

J. You, X. Zhou / Journal of Statistical Planning and Inference 139 (2009) 679 -- 695

693

By the law of large numbers, as k → ∞, we have I11 →p E41 , I12 →p 4 , I14 →p 22 2 and I1i →p 0 for i = 3, 5 and 6. Therefore, k



1

i=1 mi (mi − 1)

→p Var(21 ) +

4 + K1 − 

limk→∞

k−1

k−1

k





2

i=1 mi (mi − 1)

2 k

i=1 mi (mi − 1)

4 + −1 

4 +

k 4

limk→∞ k−1

4  −1 k



i=1 mi

k

i=1 mi

−2  2 2

2 2 as k → ∞.

¨ In addition, according to the root-k consistency of  b and the convergence rate of  g(·) together with Holder inequality we can show Ki = op (1) for i = 2, . . . , 6. This proves the first result. The second result follows from the same line. w w Proof of Theorem 5. According to the definitions of  b and b¯ , together with the equality a1 b1 − a2 b2 = (a1 − b1 )(a2 − b2 ) + (a1 − b1 )b2 + b1 (a2 − b2 ), we obtain −1

 w w  b − b = b¯ − b + [(X MX B

−1

 + [(X MX B

−1  X)−1 − (X MX X)−1 ][X MX B B

−1

−1 G − X MX G] B

−1

−1  X)−1 − (X MX X)−1 ][X MX B B

−1

−1 −1 X)−1 − (X MX X)−1 ]X MX e B B

 + [(X MX B

−1 G − X MX G] B

−1 −1 X)−1 − (X MX X)−1 ]X MX G B B

 −1  X + (X MX B X) [X MB  + [(X MX B

−1

−1

−1

 −1  X + (X MX B X) [X MB

−1

e − X MX e] B

−1

e − X MX e], B

where e = (11 , . . . , 1m1 , . . . , km ) . Therefore, to prove Theorem 5, by Lemma 3 and the fact that (A + aB)−1 = A−1 − aA−1 BA−1 + k

O(a2 ) as a → 0, it suffices to show that

−1  −1 1  X (X MB X − X MX X) = Op (k−1/2 ), B k

(A.1)

   −1 1 X MX G − X MX G = op (k−1/2 ), B B k

(A.2)

−1  −1 1  X e) = op (k−1/2 ), (X MB e − X MX B k

(A.3)

−1 X) = Op (1) k(X MX B

(A.4)

−1 1  X−1 1 X MB G = op (k−1/2 ), X MX e = Op (k−1/2 ). B k k

(A.5)

and

According to the proof of Lemma 4 we have −1  −1 1  X 1  −1 P − P X−1 P) + op (1). X) = (P X (X MB X − X MX B k k

 −1 P − P X−1 P) is By the root-k consistency of  2 and  2 , the (s1 , s2 )th entry of k−1 (P X ⎤ ⎡ m mi mi  k m k   1 ⎣ i i −2 −2 −2 −2 ( i +   )ij1 s1 ij2 s2 − (i +  )ij s ij s ⎦ 1 1 2 2 k i=1 j1 =1 j2 =1

i=1 j1 =1 j2 =1

⎞ mi k m k   1 ⎝ −2  i −2  +  ijs1 ijs2 −  ijs1 ijs2 ⎠ = Op (k−1/2 ), k ⎛

i=1 j=1

i=1 j=1

694

J. You, X. Zhou / Journal of Statistical Planning and Inference 139 (2009) 679 -- 695

where ijs is the s th element of Pij . This proves (A.1). Next, the proof of (A.1) and the root-k consistency of  2 and  2 lead to  −1 1  X X MB G = op (k−1/2 ), k

1  X−1 X MB G = op (k−1/2 ), k  −1 1  X X MB G = op (k−1/2 ) k

and

1  X−1 X MB G = op (k−1/2 ), k

1   −1 X X e = Op (k−1/2 ). k

These imply (A.2) and (A.5). It remains to show (A.3), which will follow from 1   −1 1 N X e − N X−1 e = op (k−1/2 ). k k

(A.6)

Note that the s th entry of (A.6) is m k m 1  i i −2 −2 [( −2 −2  )ij1 s − (i +  )ij1 s ](i + ij2 ) i + k i=1 j1 =1 j2 =1

+

k m 1   −2 (  ijs − −2  ijs )(i + ij ) = J1 + J2 k

say.

i=1 j=1

By the root-k consistency of  2 and  2 , it is easy to show that J1 = Op (k−1 ) and J2 = Op (k−1 ). Thus (A.6) holds and so the proof is complete.  Proof of Theorem 6. Applying Theorem 4, the proof of Theorem 6 is the same as that of Theorem 2.



Proof of Theorem 7. The proof of Theorem 7 is trivial.  Proof of Theorem 8. Denote k = k−1/2 + ak . It suffices to show that for any given  > 0, there exists a large constant c such that   P inf L(b + k u)  L(b)  1 − . u=c

This implies the existence of a local minimizer in the ball {b + k u : u  c} with probability at least 1 − . Define Dk (u) = L(b + k u) − L(b). Note that p (0) = 0 and p (| j |) is nonnegative. Therefore, j

k−1 Dk (u) 

j

 −1 1 (Y − X(b + k u)) [(Y − X(b + k u)) MX B 2k p0   −1 − (Y − Xb) MX (Y − X b )] + {p (| j + k uj |) − p (| j |)}. B j j j=1

It is easy to see that  −1  −1 1 [(Y − X(b + k u)) MX (Y − X(b + k u)) − (Y − Xb) MX (Y − Xb)] B B 2k

 = k u X MX B 2k 2

−1

Xu −

 −1 1 ( k Xu) MX (Y − Xb) B k

and

2k   X  −1  −1 1 u X MB Xu − ( k Xu) MX (Y − Xb) B 2k

k

2  −1

= k u X MX Xu − k u [P−1 e + op (k1/2 )]. B 2k

k

(A.7)

Furthermore, q  j=1

& & {p (| j + k uj |) − p (| j |)}  p0 k ak  + 2k bk 2  c 2k ( p0 + cbk ) jk jk

(A.8)

by the Taylor expansion and Cauchy–Schwarz inequality. As bk → 0, the first term on the right-hand side of (A.8) will dominate (A.7) as well as the second term on the right-hand side of (A.8), by taking c sufficiently large. This completes the proof. 

J. You, X. Zhou / Journal of Statistical Planning and Inference 139 (2009) 679 -- 695

695

Proof of Theorem 9. Part (i) follows directly from Lemma 5. Now we prove Part (ii). Using the arguments similar to the proof w of Theorem 5, we can show the existence of a  b1 in Theorem 8 that is a root-k consistent minimizer of L{(b1 , 0 ) }, satisfying w b , 0 ) }/ jb = 0. Following the proof of Theorem 8 the penalized semiparametric generalized least squares equations jL{( 1

1

we have w

jL{( b1 , 0 ) } w  + op (1)}( = k[− f(1) + Op (k−1/2 ) + {R b1 − b1 )] (1) jb1 w  {1 + op (1)}( + n[bn + R b1 − b1 )], (1)

 consists of the first q rows and columns of P X−1 P. Therefore, where  (1) consists of the first q components of PX−1 , and R (1) similar to the proof of Theorem 5 and by Slutsky's Theorem, √

w k{R311 + D}[ b1 − b1 + (R11 + D)−1 b]→D N(0, R311 ).

This completes the proof of Theorem 9.  References Ahn, S.C., Schmidt, P., 2000. Estimation of linear panel data models using GMM. In: Mátyás, L. (Ed.), Generalized Method of Moments Estimation. Cambridge University Press, Cambridge. (Chapter 8). Antoniadis, A., Fan, J., 2001. Regularization of wavelets approximations. J. Amer. Statist. Assoc. 96, 939–967. Baltagi, B.H., 1995. Econometric Analysis of Panel Data. Wiley, New York. Bickel, P.J., Kwon, J., 2001. Inference for semiparametric models: some questions and an answer. With comments and a rejoinder by the authors. Statist. Sinica 11, 863–960. Bickel, P.J., Klaassen, C.A., Ritov, Y., Wellner, J.A., 1993. Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins, Series in the Mathematical Sciences. Johns Hopkins University Press, Baltimore, MD. Carroll, R.J., Lin, X., 2005. Efficient semiparametric marginal estimation for longitudinal/clustered data. J. Amer. Statist. Assoc. 100, 147–157. Carroll, R.J., Ruppert, D., Welsh, A.H., 1998. Local estimating equations. J. Amer. Statist. Assoc. 93, 214–227. Chen, K., Jin, Z., 2006. Partial linear regression models for clustered data. J. Amer. Statist. Assoc. 101, 195–204. de Boor, C., 1978. A Practical Guide to Splines. Springer, New York. Fan, J., Huang, T., 2005. Profile likelihood inferences on semiparametric varying-coefficient partially linear models. Bernoulli 11, 1031–1057. Fan, J., Li, R., 2001. Variable selection via penalized likelihood and its Oracle properties. J. Amer. Statist. Assoc. 96, 1348–1360. Fan, J., Li, R., 2004. New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. J. Amer. Statist. Assoc. 99, 710–723. Fan, J., Li, R., 2006. An overview on nonparametric and semiparametric techniques for longitudinal data. In: Fan, J., Koul, H.L. (Eds.), Frontiers in Statistics. Imperial College Press, pp. 277–303. Fuller, W.A., Battese, G.E., 1974. Estimation of linear models with cross-error structure. J. Econometrics 2, 67–78. Hansen, M.H., Kooperberg, C., 2002. Spline adaption in extended linear models (with discussion). Statist. Sci. 17, 2–51. He, X., Zhu, Z., Fung, W.K., 2002. Estimation in a semiparametric model for longitudinal data with unspecified dependence structure. Biometrika 89, 579–590. Hong, S.Y., 2002. Normal approximation rate and bias reduction for data-driven kernel smoothing estimator in a semiparametric regression model. J. Multivariate Anal. 80, 1–20. Honore, B.E., 1992. Trimmed LAD and least squares estimation of truncated and censored regression models with fixed effects. Econometrica 60, 533–565. Hsiao, C., 1985. Benefits and limitations of panel data. Econometric Rev. 4, 121–174. Hsiao, C., 1986. Analysis of Panel Data. Cambridge University Press, Cambridge. Huang, J.Z., Wu, C.O., Zhou, L., 2004. Polynomial spline estimation and inference for varying coefficient models with longitudinal data. Statist. Sinica 14, 763–788. Kniesner, T., Li, Q., 1994. Semiparametric panel data models with dynamic adjustment: theoretical considerations and an application to labor supply, Manuscript, University of Guelph. Li, Q., Ullah, A., 1998. Estimating partially linear panel data models with one-way error components. Econometric Rev. 17 (2), 145–166. Lin, D.Y., Ying, Z., 2001. Semiparametric and nonparametric regression analysis of longitudinal data (with discussion). J. Amer. Statist. Assoc. 95, 1045–1056. Lin, X., Carroll, R.J., 2006. Semiparametric estimation in general repeated measures problems. J. Roy. Statist. Soc. B 68, 69–88. Lin, X., Wang, N., Weslsh, A.H., Carroll, R.J., 2004. Equivalent kernels of smoothing splines in nonparametric regression for longitudinal/clustered data. Biometrika 91, 177–194. Mundra, K., 1997. Immigration and U.S. bilateral trade flows: a semiparametric empirical analysis, Manuscript, University of California, Riverside. Rice, J.A., Wu, C.O., 2001. Nonparametric mixed effects models for unequally sampled noisy curves. Biometrics 57, 253–259. Roy, N., 1997. Nonparametric and semiparametric analysis of panel data models: an application to calorie–income relation for rural South India, Ph.D. Thesis, University of California, Riverside. Ruckstuhl, A.F., Welsh, A.H., Carroll, R.J., 2000. Nonparametric function estimation of the relationship between two repeatedly measured variables. Statist. Sinica 10, 51–71. Schumaker, L.L., 1980. Spline functions: basic theory. Pure and Applied Mathematics. A Wiley-Interscience Publication, Wiley, New York. Speckman, P., 1988. Kernel smoothing in partial linear models. J. Roy. Statist. Soc. Ser. B 50, 413–436. Stout, W.F., 1974. Almost Sure Convergence. Academic Press, New York. Wansbeek, T.J., 1989. An alternative heteroscedastic error components model. Solution 88.1.1. Econometric Theory 5, 326. Zeger, S.L., Diggle, P.J., 1994. Semiparametric model for longitudinal data with application to CD4 cell numbers in HIV seroconvertiers. Biometrics 50, 689–699. Zhang, D., Lin, X., Raz, J., Sowers, M., 1998. Semiparametric stochastic mixed models for longitudinal data. J. Amer. Statist. Assoc. 93, 710–719.