Penalized quadratic inference functions for semiparametric varying coefficient partially linear models with longitudinal data

Penalized quadratic inference functions for semiparametric varying coefficient partially linear models with longitudinal data

Journal of Multivariate Analysis 132 (2014) 94–110 Contents lists available at ScienceDirect Journal of Multivariate Analysis journal homepage: www...

627KB Sizes 0 Downloads 85 Views

Journal of Multivariate Analysis 132 (2014) 94–110

Contents lists available at ScienceDirect

Journal of Multivariate Analysis journal homepage: www.elsevier.com/locate/jmva

Penalized quadratic inference functions for semiparametric varying coefficient partially linear models with longitudinal data Ruiqin Tian a , Liugen Xue a,∗ , Chunling Liu b a

College of Applied Sciences, Beijing University of Technology, Beijing, 100124, China

b

Department of Applied Mathematics, The Hong Kong Polytechnic University, Hong Kong, China

article

info

Article history: Received 29 March 2012 Available online 8 August 2014 AMS subject classifications: 62G08 62G20 Keywords: Semiparametric varying coefficient partially linear models Variable selection Longitudinal data Quadratic inference functions

abstract In this paper, we focus on the variable selection for semiparametric varying coefficient partially linear models with longitudinal data. A new variable selection procedure is proposed based on the combination of the basis function approximations and quadratic inference functions. The proposed procedure simultaneously selects significant variables in the parametric components and the nonparametric components. With appropriate selection of the tuning parameters, we establish the consistency and asymptotic normality of the resulting estimators. Extensive Monte Carlo simulation studies are conducted to examine the finite sample performance of the proposed variable selection procedure. We further illustrate the proposed procedure by an application. © 2014 Elsevier Inc. All rights reserved.

1. Introduction As introduced in [6,13], the varying coefficient model provides a natural and useful extension of the classical linear regression model by allowing the regression coefficients to depend on certain covariates. Due to its flexibility to explore the dynamic features which may exist in the data and its easy interpretation, the varying coefficient model has been widely applied in many scientific areas. It has also experienced rapid developments in both theory and methodology, see [11] for a comprehensive survey. Zhang et al. [30] noticed that in practice some of the coefficients are constant rather than varying and proposed the so-called semiparametric varying coefficient partially linear models. Statistically, treating constant coefficients as varying will degrade estimation efficiency. On the other hand, longitudinal data occur very frequently in biomedical studies. Qu et al. [21] proposed a method of quadratic inference functions (QIF). It avoids estimating the nuisance correlation structure parameters by assuming that the inverse of the working correlation matrix can be approximated by a linear combination of several known basis matrices. The QIF can efficiently take the within-cluster correlation into account and is more efficient than the GEE approach when the working correlation is misspecified. Qu and Li [20] applied the QIF method to varying coefficient models for longitudinal data. Bai et al. [3] extended the QIF method to the semiparametric partial linear model. Dziak et al. [7] gave an overview on QIF approaches for longitudinal data. It is well known, variable selection is an important topic in all regression analyses, therefore, many procedures and criteria have been developed for this, especially for linear regression analysis. For example, some simple and commonly



Corresponding author. E-mail addresses: [email protected] (R. Tian), [email protected] (L. Xue).

http://dx.doi.org/10.1016/j.jmva.2014.07.015 0047-259X/© 2014 Elsevier Inc. All rights reserved.

R. Tian et al. / Journal of Multivariate Analysis 132 (2014) 94–110

95

used approaches include the stepwise selection and subset selection. Other selection criteria include Akaike [2] information criterion (AIC), Mallows’ [19] Cp and Bayesian information criterion (BIC; Schwarz [23]). Nevertheless, those selection methods suffer from expensive computational costs. As computational efficiency is more desirable in many situations, various shrinkage methods have been developed, which include but are not limited to: the nonnegative garrotte [5], the LASSO [24], the bridge regression [12], the SCAD [9], and the one-step sparse estimator [34]. Furthermore, a number of works have also been done to extend the shrinkage estimation methods to nonparametric models. For example, Fan and Li [10] proposed to use the SCAD penalty for variable selection in longitudinal data analysis. Li and Liang [17] carefully studied variable selection for varying coefficient partially linear models, where the parametric components are identified via the SCAD [9], but the nonparametric components are selected via a generalized likelihood ratio test instead of a shrinkage method. Recently, Zhao and Xue [31] proposed a variable selection method to select significant variables in the parametric components and the nonparametric components simultaneously. Zhao and Xue [32] developed the variable selection procedure by combining basis function approximations for semiparametric varying coefficient partially linear models when the covariates in the parametric and nonparametric components are all measured with errors. Various methods are available for fitting the semiparametric varying coefficient partially linear models with longitudinal data, such as, the kernel smoothing method, empirical likelihood method and the penalized spline method. Ahmad et al. [1] proposed a general series method to estimate semiparametric varying coefficient partially linear models. Fan and Huang [8] developed a profile least-square technique for the estimation of the parametric component. Xue and Zhu [29] considered the empirical likelihood inference for a varying coefficient model with longitudinal data. In addition, Wang et al. [25] proposed a group SCAD procedure for variable selection of pure varying coefficient models with longitudinal data. In this paper, we extend the group SCAD variable selection procedure to the semiparametric varying coefficient partially linear models with longitudinal data, and propose a variable selection procedure based on the basis function approximations with quadratic inference functions. Furthermore, with proper choice of tuning parameters, we show that this variable selection procedure is consistent, and the estimators of regression coefficients have oracle property. Here, the oracle property means that the estimators of the nonparametric components achieve the optimal convergence rate, and the estimators of the parametric components have the same asymptotic distribution as that based on the correct submodel. We adopt the penalized QIF method for semiparametric varying coefficient partially linear models. Compared with Zhao and Xue [31,32], we considered the longitudinal data and incorporated the correlation structure. In contrast, although Xue et al. [28] considered the penalized QIF for the generalized additive model, the model they considered is just a special case of semiparametric varying coefficient model. In addition, they just selected the nonparametric function. However, our method can select significant variables in the parametric components and nonparametric components simultaneously. The rest of this paper is organized as follows. In Section 2 we first propose a variable selection procedure for the semiparametric varying coefficient partially linear models with longitudinal data. Asymptotic properties of the resulting estimators are considered in Section 3. In Section 4 we give the computation of the penalized QIF estimator. In Section 5 we carry out simulation studies to assess the finite sample performance of the method. We further illustrate the proposed methodology via a real data analysis in Section 6. The article is concluded with a brief discussion in Section 7. Some assumptions and the technical proofs of all asymptotic results are provided in the Appendix. 2. Variable selection based quadratic inference functions Consider a longitudinal study with n subjects and mi observations over time for the ith subject (i = 1, . . . , n) for n a total of N = i=1 mi observations. Each observation consists of a response variable Yij and the covariate vectors Xij ∈ Rp , Zij ∈ Rq , Uij ∈ R taken from the i subject. We assume that the observations from different subjects are independent, but that those within the same subject are dependent. The semiparametric varying coefficient partially linear models with longitudinal data have the form Yij = XijT β + ZijT α(Uij ) + εij ,

i = 1, . . . , n, j = 1, . . . , mi ,

(2.1)

where β = (β1 , . . . , βp ) is a p × 1 vector of unknown regression coefficients, α(u) = (α1 (u), . . . , αq (u))T is a q × 1 vector of unknown functions. Here, we assume that u ranges over a nondegenerate compact interval, without loss of generality, that is assumed to be the unit interval [0, 1]. We further give assumptions on the first two moments of the observation {Yij }. Let E (Yij ) = µij and Var(Yij ) = υ(µij ), where υ(·) is a known variance function. Following He et al. [15], we replace α(·) by its basis function approximations. More specifically, let B(u) = (B1 (u), . . . , BL (u))T be B-spline basis functions with the order of M, where L = K + M, and K is the number of interior knots. We use the B-spline basis functions because they have bounded support and are numerically stable [22]. The spline approach also treats a non-parametric function as a linear function with the basis functions as pseudodesign variables, and thus any computational algorithm developed for the generalized linear models can be used for the semiparametric varying coefficient partially linear models. Of course, the B-spline is not the only choice of the nonparametric approximation method here. For example, Qu and Li [20] applied the QIF approach and the penalized spline approximation method together with varying coefficient models for longitudinal data. Then, similar to He et al. [14], αk (u) can be approximated by

αk (u) ≈ B(u)T γk , k = 1, . . . , q, where γk is a L × 1 vector of unknown regression coefficients.

(2.2)

96

R. Tian et al. / Journal of Multivariate Analysis 132 (2014) 94–110

We denote Wij = Iq ⊗ B(Uij ) · Zij , where Iq is the q × q identity matrix, and ‘‘⊗’’ is the Kronecker product, Pij = (XijT , WijT )T , θ = (θ1 , . . . , θp+qL )T = (β1 , . . . , βp ; γ1 , . . . , γq )T , Yi = (Yi1 , . . . , Yimi )T , and write Xi , Zi , Ui and Pi in a similar fashion. Then, the generalized estimating equation (GEE) can be defined as follows n 

PiT Vi−1 (Yi − Pi θ ) = 0,

(2.3)

i=1

where Vi is the covariance matrix of Yi . Following Liang and Zeger [18], we simplify the covariance of the ith subject Vi by tak1/2 1/2 ing Vi = Ai Ri (ρ)Ai , where Ai = diag{Var(Yi1 ), . . . , Var(Yimi )}, and Ri (ρ) is a common working correlation with nuisance parameters ρ . Based on the estimation theory associated with the working correlation structure, the GEE estimator of the regression coefficient proposed by Liang and Zeger [18] is consistent if consistent estimators of the nuisance parameters ρ can be obtained. For suggested methods for estimating ρ , see [18]. However, even in some simple cases, consistent estimators of ρ do not always exist. To avoid this drawback, Qu et al. [21] introduced the quadratic inference functions by assuming that the inverse of the working correlation matrix R−1 (ρ) can be approximated by a linear combination of a class of basis matrices as s 

ak Mk ,

(2.4)

k=1

where M1 , . . . , Ms are known matrices and a1 , . . . , as are unknown constants. This is a sufficiently rich class that accommodates, or at least approximates, the correlation structures most commonly used. Details of utilizing a linear combination of some basic matrices to model the inverse of working correlation can be found in [21]. Here, we illustrate with a simple example which can be found in [21]. Suppose R(ρ) is the first-order autoregressive correlation matrix, with Rij = ρ |i−j| . The exact inversion R−1 (ρ) can be written as a linear combination of three basis matrices, they are M1 , M2 and M3 , where M1 is the identity matrix, M2 has 1 on two main off-diagonals and 0 elsewhere, and M3 has 1 on the corners (1, 1) and (m, m), and 0 elsewhere. Here a1 = (1 +ρ 2 /k), a2 = −ρ/k and a3 = −ρ 2 /k, where k = 1 −ρ 2 and m is the dimension of R(ρ). The third term of the inverse in this example captures the edge effect of the process AR(1) while the two-term approximation does not. Substituting (2.4) to (2.3), we get the estimating equations n 

−1/2

PiT Ai

−1/2

(a1 M1 + · · · + as Ms )Ai

(Yi − PiT θ ) = 0.

(2.5)

i=1

Unlike the GEE method, we need not find the estimates of parameters a = (a1 , . . . , as ) by optimizing some function of the information matrix associated with (2.3). Instead, based on the form of the quasi-score, we define the extend score gn (θ ) to be gn (θ ) =

n 1

n i =1

 T −1/2  −1/2 Pi A i M1 Ai (Yi − Pi θ ) n  1   .. gi (θ ) =  . . n i=1

−1/2 −1/2 PiT Ai Ms Ai

(2.6)

(Yi − Pi θ )

The estimating equations (2.5) are linear combinations of elements of the extend score vector (2.6). We define the QIF to be

Qn (θ ) = ngnT (θ )Ωn−1 (θ )gn (θ ),

(2.7)

where

Ωn (θ ) =

n 1

n i=1

gi (θ )giT (θ ).

Note that Ωn depends on θ . The QIF estimate θˆn is then given by

θˆn = arg min Qn (θ ). θ

In practice, the true model is often unknown in advance. An overfitted model can degrade the efficiency of the parameter estimates and predictions, while an underfitted model can yield biased estimates and predicted values. This motivates us to apply the penalized QIF approach to simultaneously estimate parameters and select important variables. To this end, similar to Fan and Li [9], we define the penalized QIF as

Qnp (θ ) = Qn (θ ) + n

q  k=1

pλ1k (∥γk ∥H ) + n

p 

pλ2l (|βl |),

1

where ∥γk ∥H = (γkT H γk ) 2 , H = (hij )L×L is a matrix with hij = λ as a tuning parameter (see [9]), defined as



p′λ (ω) = λ I (ω ≤ λ) +

(2.8)

l =1

 (aλ − ω)+ I (ω > λ) , (a − 1)λ

1 0

Bi (u)BTj (u)du, and pλ (·) is the SCAD penalty function with

R. Tian et al. / Journal of Multivariate Analysis 132 (2014) 94–110

97

with a > 2, ω > 0 and pλ (0) = 0. Here, the tuning parameter λ is not necessarily the same for all αk (·) and βl , and we denote λ as λ1k and λ2l , respectively in (2.8). Let θˆ be the solution by minimizing (2.8). Then, βˆ = (θˆ1 , . . . , θˆp )T is the penalized QIF estimator of β, γˆ = (θˆp+1 , . . . ,

θˆp+L , θˆp+L+1 , . . . , θˆp+qL )T and the estimator of αk (u) can be obtained by αˆ k (u) = B(u)T γˆk . 3. Asymptotic properties 3.1. Basic theoretical properties We next study the asymptotic properties of the resulting penalized QIF estimators. We first introduce some notations. Let α0 (·) and β0 (·) denote the true values of α(·) and β(·), respectively. In addition, γ0 is the spline coefficient vector from the spline approximation to α0 (·). For ease of presentation and without loss of generality, it is assumed that β0l , l = 1, . . . , p1 , consists of all p1 nonzero components of β0 and that β0l = 0, l = p1 + 1, . . . , p. Furthermore, we assume that α0k (·) = 0, k = q1 + 1, . . . , q, and α0k (·), k = 1, . . . , q1 are all nonzero components of α0 (·). First, the following theorem gives the consistency of the penalized QIF estimators of the nonparametric components, which means the estimators of the nonparametric components achieve the optimal convergence rate. Theorem 1. Under the conditions C1–C10 in the Appendix and the number of knots is K = O(N 1/(2r +1) ), we have

 −r  ∥αˆ k (·) − α0k (·)∥ = Op n 2r +1 ,

k = 1, . . . , q,

where r is defined in condition C7 in the Appendix. Furthermore, we show that such penalized QIF estimators must possess the sparsity property, that is stated in the following theorem. Theorem 2. Under the same mild conditions as these given in Theorem 1, let λmax = maxk,l {λ1k , λ2l }, and λmin = mink,l {λ1k , √ ˆ must satisfy λ2l }. If λmax → 0 and nλmin → ∞, then, with probability tending to 1, βˆ and α(·) (i) βˆ l = 0, l = p1 + 1, . . . , p (ii) αˆ k (·) ≡ 0, k = q1 + 1, . . . , q. Next, we show that the estimators for nonzero coefficients in the parametric components have the same asymptotic distribution as that based on the correct submodel. To demonstrate this, we need more notations to present the asymptotic property of the resulting estimators. Let β ∗ = (β1 , . . . , βp1 )T , α ∗ (·) = (α1 (·), . . . , αq1 (·))T , and β0∗ and α0∗ (·) be the true T T values of β ∗ and α ∗ (·), respectively. In addition, denote γ ∗ = (γ1T , . . . , γqT1 ) and γ0∗ = (γ01 , . . . , γ0q ) correspond to the 1 spline coefficients of α ∗ (·) and α0∗ (·) respectively. Corresponding covariates are denoted by Xi∗ , Zi∗ and Wi∗ , i = 1, . . . , n. In addition, let

 ⊗2

A = E X ∗T τ − E (Z ∗T diag(τ )X ∗ |u)E (Z ∗T diag(τ )Z ∗ |u)−1 E (Z ∗T τ |u)





B=E X

∗T

τ − E (Z diag(τ )X |u)E (Z diag(τ )Z |u) E (Z τ |u)ε ∗T



∗T



−1

∗T

, ⊗2

,

where A⊗2 = AAT , X ∗ = (X1∗T , . . . , Xn∗T )T , Z ∗ = (Z1∗T , . . . , Zn∗T )T , τ = (τ1 , . . . , τn )T and τi can be found in the Appendix (A.8). The following result states the asymptotic normality of βˆ ∗ .

Theorem 3. Suppose that the regularity conditions C1–C10 in the Appendix hold and the number of knots K = O(N 1/(2r +1) ), then,



L

n(βˆ ∗ − β0∗ ) −→ N (0, Σ ),

L where Σ = A−1 BA−1 , and ‘‘ −→’’ represents the convergence in distribution.

3.2. Selection of tuning parameters By the result of Theorems 1–3, we know that the proposed variable selection procedure possessed the oracle property. However, this attractive feature relies on the tuning parameters. That is, the penalty functions pλ1k (·) and pλ2l (·) involve the tuning parameters λ1k and λ2l that control the amount of penalty. Many selection criteria, such as cross validation (CV), generalized cross validation (GCV), AIC and BIC selection can be used to select the tuning parameters. Wang et al. [26] suggested using BIC for the SCAD estimator in linear models and partially linear models, and proved its model selection consistency property, i.e., the optimal parameter chosen by BIC can identify the true model with probability tending to one. Hence, we follow their suggestion throughout this paper. So the BIC will be used to choose the optimal {λi , i = 1, . . . , p + q}

98

R. Tian et al. / Journal of Multivariate Analysis 132 (2014) 94–110

which is equal to either {λ1k , k = 1, . . . , q} or {λ2l , l = 1, . . . , p}. Nevertheless, in real application, how to simultaneously select a total of p + q shrinkage parameters is challenging. It is also expected that the choice of λ1k and λ2l should satisfy that the tuning parameter for zero coefficient is larger than that for nonzero coefficient. Thus we can simultaneously unbiasedly estimate large coefficient, and shrink the small coefficients toward zero. Hence, in practice, we simplify the tuning parameters as

λ1k =

λ0 (0)

∥γ˜k ∥H

,

λ2l =

λ0 , ˜ |βl(0) |

(0) (0) where β˜ l and γ˜k are the unpenalized QIF estimates. Consequently, the original p + q dimensional problem about λi becomes a univariate problem about λ0 . λ0 can be selected according to the following BIC-type criterion

BICλ = Qn (θˆλ ) + dfλ × log(n),

(3.1)

where θˆλ = (βˆ λT , γˆ1Tλ , . . . , γˆqTλ )T is the regression coefficient estimated by minimizing the penalized QIF in (2.8) for a given λ, dfλ is the number of nonzero coefficients of both βˆ λ and ∥γˆ1λ ∥H , . . . , ∥γˆqλ ∥H . We choose the optimal tuning parameter λ by minimizing the BIC-type criterion (3.1). In fact, such a choice of tuning parameters, in some sense, is the same with the idea of Zou [33], Wang et al. [26] and Wang and Xia [27]. To investigate the theoretical properties of the BIC selector, we denote by S = {j1 , . . . , jd }, (0 ≤ d ≤ p + q) the set of the indices of the covariates in the given candidate model, which contains indices of both Xij and Zij . Then, we use ST be the true model, SF be the full model and Sλ be the set of the indices of the covariates selected by the penalized QIF procedure with the tuning parameter λ. We next present the asymptotic property of the BIC-type tuning parameter selector.

ˆ selected by the BIC Theorem 4. Suppose that the regularity conditions C1–C10 in the Appendix hold, the tuning parameter λ criterion (3.1) can identify the true model consistently, i.e. P (Sλˆ = ST ) → 1,

as n → ∞.

(3.2)

Theorem 4 demonstrates that the BIC tuning parameter selector enables us to select the true model consistently. 4. Issues in practical implementation Firstly, note that the first two derivatives of the unpenalized QIF Qn (θ ) are continuous. Around a given point θ (0) , the QIF can be approximated by

Qn (θ ) ≈ Qn (θ (0) ) + Q˙n (θ (0) ) +

1 2

(θ − θ (0) )T Q¨n (θ (0) )(θ − θ (0) ),

where Q˙n (·) and Q¨n (·) stand for the first and second partial derivatives of the function Qn (·). Also, given an initial value t0 , we can approximate the penalty function pλ (t ) by a quadratic function [9] pλ (|t |) ≈ pλ (|t0 |) +

1 p′λ (|t0 |) 2

|t0 |

(t 2 − t02 ),

for t ≈ t0 .

Therefore, the penalized QIF (2.8) can be local approximated, apart from a constant term, by

Qnp (θ ) ≈ Qn (θ (0) ) + Q˙n (θ (0) )T (θ − θ (0) ) +

1 2

n

(θ − θ (0) )T Q¨n (θ0 )(θ − θ (0) ) + θ T Σλ (θ (0) )θ , 2

where

 Σλ (θ

(0)

) = diag

(0)

(0)

p′λ21 (|β1 |)

|β1(0) |

,...,

p′λ2p (|βp |) p′λ (∥γ1(0) ∥H ) 11

|βp(0) |

,

∥γ1(0) ∥H

(0)

H, . . . ,

p′λ1p (∥γq ∥H )

∥γq(0) ∥H

 H

,

θ = (θ1 , . . . , θp+qL )T = (β1 , . . . , βp ; γ1T , . . . , γqT )T and θ (0) = (θ1(0) , . . . , θ((p0+) qL) )T = (β1(0) , . . . , βp(0) ; γ1(0)T , . . . , γq(0)T )T . p Accordingly, the quadratic minimization problem for Qn (θ ) leads to a solution iterated by   −1   θ (1) ≈ θ (0) − Q¨ n (θ (0) ) + nΣλ (θ (0) ) nΣλ (θ (0) )θ (0) + Q˙n (θ (0) ) . Finally, the following algorithm summarizes the computation of penalized QIF estimators of the parameters in semiparametric varying coefficient models with longitudinal data. Step 1. Take the ordinary QIF estimate (without penalty) β (0) , γ (0) of β, γ as their initial values. T T Step 2. Given the current values {β (m) , γ (m) }, θ (m) = {β (m) , γ (m) }T , update them by

 −1   ˙ (θ (m) ) . θ (m+1) = θ (m) − Q¨ n (θ (m) ) + nΣλ (θ (m) ) nΣλ (θ (m) )θ (m) + Q Step 3. Repeat Step 2 above until certain convergence criteria are satisfied.

R. Tian et al. / Journal of Multivariate Analysis 132 (2014) 94–110

99

Table 1 Variable selections for the parametric components under different methods. Method

n = 150

n = 200

n = 300

GMSE

C

IC

GMSE

C

IC

GMSE

C

IC

0.0048 0.0039

5.0000 5.0000

0 0

0.0019 0.0020

5.0000 5.0000

0 0

0.0009 0.0010

5.0000 5.0000

0 0

0.0112 0.0096

4.9980 4.9980

0 0

0.0043 0.0048

5.0000 5.0000

0 0

0.0022 0.0026

5.0000 5.0000

0 0

ρ = 0.7 SCAD LASSO

ρ = 0.3 SCAD LASSO

Table 2 Variable selections for the nonparametric component under different methods. Method

n = 150

n = 200

n = 300

RASE

C

IC

RASE

C

IC

RASE

C

IC

0.1358 0.1352

4.6560 4.5260

0 0

0.0986 0.0998

4.7320 4.6840

0 0

0.0697 0.0707

4.9940 4.9720

0 0

0.1959 0.1976

4.5380 4.4120

0 0

0.1411 0.1447

4.7060 4.6400

0 0

0.1045 0.1067

4.9920 4.9540

0 0

ρ = 0.7 SCAD LASSO

ρ = 0.3 SCAD LASSO

5. Simulation studies In this section we conduct a simulation study to assess the finite sample performance of the proposed procedures. And as in [17], the performance of estimator βˆ will be assessed by using the generalized mean square error (GMSE), defined as GMSE = (βˆ − β)T E (XX T )(βˆ − β). The performance of estimator α(·) ˆ will be assessed by using the square root of average square errors (RASE)

 RASE =

q M 1 

M s=1 k=1

αˆ k (us ) − αk (us )

1/2 2

,

where us , s = 1, . . . , M are the grid points at which the function α( ˆ u) is evaluated. In our simulation, M = 200 is used. Example 1 (Fixed-Dimensional Setup). In this example, we simulate data from model (2.1), where β = (β1 , . . . , β8 )T with β1 = 1.5, β2 = 0.8 and β3 = 2, and α(u) = (α1 (u), . . . , α7 (u))T with α1 (u) = 5.5 + 0.1 exp(2u − 1) and α2 (u) = sin 2π u. While the remaining coefficients, corresponding to the irrelevant variables, are given by zeros. To perform this simulation, we take the covariates Xij and Zij (j = 1, . . . , 5) from a multivariate normal distribution with mean zero, marginal variance 1 and correlation 0.5, and u ∼ U (0, 1). Response variable Yij is generated according to the model. And error vector εi = (εi1 , . . . , εi5 )T ∼ N (0, σ 2 Corr(εi , ρ)), where σ 2 = 1 and Corr(εi , ρ) is a known correlation matrix with parameter ρ used to determine the strength of with-subject dependence. Here we consider εi has the first-order autoregressive (AR-1) or exchangeable correlation structure with different correlation coefficient. In our simulation study, we consider ρ = 0.7, and ρ = 0.3, representing a strong and a weak within correlation structure. In the following simulations, we use the cubic B-splines. Similar to He et al. [14], the number of internal knots is taken to 1

be K = ⌊c × N 5 ⌋ for the simple calculation and generate n = 150, 200 and 300 respectively, where ⌊s⌋ denotes the largest integer not greater than s. To illuminate the sensitivity of the choice of the number of interior knots, here we use different c’s to get the number of interior knots, and obtain the similar results, which imply that the performance of variable selection does not depend sensitively on the choice of the number of interior knots [31,32]. To save space of the paper, we did not list all the other results, which are similar to the results in the following in the paper with c = 0.6. In addition, we make 500 simulation runs in each simulation study. In the simulation study, for each simulated dataset, the proposed estimation procedures for finding out penalized QIF estimators with SCAD and LASSO penalty functions are considered. The unknown tuning parameters λi , i = 1, . . . , p + q for the penalty functions are chosen by BIC criterion in the simulation, which is given in Section 3.2. For each of these methods, the average of zero coefficients over the 500 simulated datasets is reported in Table 1. Note that ‘‘C’’ in tables means the average number of zero regression coefficients that are correctly estimated as zero, and ‘‘IC’’ depicts the average number of non-zero regression coefficients that are erroneously set to zero. The results are also reported in Tables 1 and 2.

100

R. Tian et al. / Journal of Multivariate Analysis 132 (2014) 94–110

Fig. 1. The estimate of α1 (·) with n = 150 and ρ = 0.7.

Fig. 2. The estimate of α2 (·) with n = 150 and ρ = 0.7.

From Tables 1 and 2, we can make the following observations. Firstly, for the parametric components, the performances of variable selection procedures become better and better as n increases. For example, the values in the column labeled ‘‘C’’ become more and more closer to the true number of zero regression coefficients in the models. Secondly, for the nonparametric components, the performances of the variable selection method are similar to the parametric components. As n increases, the RASE of the estimated nonparametric function also becomes smaller and smaller. This reflects the estimate curves fit better to the corresponding true line as the sample size increases, which agrees with what was discovered from Figs. 1–6. Thirdly, the SCAD penalty method outperforms the LASSO penalty approach in the sense of correct variable selection rate, which significantly reduces the model uncertainty and complexity. Next, to examine the effect of using a misspecified correlation structure in the model, we make 500 simulation runs by using the above data generation procedure when the true correlation is an exchangeable matrix. We assume two kinds of working correlation matrices, first-order autoregressive matrix (AR-1) and exchangeable matrix. Then, we list the result with different working assumptions in Table 3. From Table 3, we can clearly see the following facts: the performances of variable selection procedures with misspecified correlation matrix are similar to the results with correctly specified working correlation matrix. This indicates our method is robust. Lastly, we compare the performances of variable selection based on the QIF and GEE methods. We also make 500 simulation runs by using the above data generation procedure when the true correlation is an exchangeable matrix, and we only generate n = 300 and ρ = 0.7. The simulation result is summarized in Table 4. From Table 4, we can clearly see that the performances of variable selection procedures based on the QIF method are almost equivalent with that based on GEE method in terms of model error and model complexity, especially for the parametric components, the performance

R. Tian et al. / Journal of Multivariate Analysis 132 (2014) 94–110

Fig. 3. The estimate of α1 (·) with n = 200 and ρ = 0.7.

Fig. 4. The estimate of α2 (·) with n = 200 and ρ = 0.7.

Fig. 5. The estimate of α1 (·) with n = 300 and ρ = 0.7.

101

102

R. Tian et al. / Journal of Multivariate Analysis 132 (2014) 94–110

Fig. 6. The estimate of α2 (·) with n = 300 and ρ = 0.7. Table 3 Variable selections when the true R is exchangeable with n = 300. Method

β GMSE

α(·) C

IC

RASE

C

IC

0.0010 0.0011

5.0000 4.9980

0 0

0.0758 0.0770

4.9820 4.9660

0 0

0.0020 0.0023

4.9980 4.9960

0 0

0.1066 0.1090

4.9400 4.8620

0 0

The working R is AR(1)

ρ = 0.7 SCAD LASSO

ρ = 0.3 SCAD LASSO

The working R is exchangeable

ρ = 0.7 SCAD LASSO

0.0008 0.0008

4.9980 5.0000

0 0

0.0637 0.0644

4.9920 4.9860

0 0

0.0017 0.0020

4.9980 4.9960

0 0

0.0947 0.0961

4.9860 4.9700

0 0

ρ = 0.3 SCAD LASSO

based on the QIF method is more effective than that of the GEE method. These imply that the performance of the variable selection procedure based on the QIF method is workable. Example 2 (High-Dimensional Setup). In this example, we discuss how the proposed variable selection procedure can be applied to the ‘‘large n, diverging p/q’’ setup for longitudinal models. We consider the high-dimensional setup of the models (2.1), where β is a p-dimensional vector of parameters with p = ⌊6n1/4 ⌋ − 4 for n = 200 and 300. The true coefficient vector is β0 = (1.5, 0.8, 2, 0p−3 )T , where 0m denotes a m-vector of 0 s. In addition, α(u) = (α1 (u), α2 (u), 0q−3 )T , where q = ⌊3.9n1/4 ⌋ − 4, α1 (u) and α2 (u) are defined in Example 1. Further, we use the other settings in Example 1. The summary of simulation results is reported in Tables 5 and 6. It is easy to see from Tables 5 and 6 that, the proposed variable selection method is able to correctly identify the true submodel, and works remarkably well, even if it is the‘‘large n, diverging p/q’’ setup for the longitudinal models (2.1). 6. A real example We now illustrate the variable selection procedure proposed in this paper through analysis of a dataset from the MultiCenter AIDS Cohort study. The dataset contains the human immunodeficiency virus (HIV) status of 283 homosexual men who were infected with HIV during a follow-up period between 1984 and 1991. This dataset has been used by many authors to illustrate varying coefficient models [16,32] and semiparametric models [10]. The objective of their analysis is to describe the trend of the mean CD4 percentage depletion over time and evaluate the effects of smoking, the pre-HIV

R. Tian et al. / Journal of Multivariate Analysis 132 (2014) 94–110

103

Table 4 Variable selections based on the QIF and GEE methods. Method

β

α(·)

GMSE

C

IC

RASE

C

IC

SCAD LASSO

0.0012 0.0012

5.0000 5.0000

0 0

0.0640 0.0647

4.9980 4.9940

0 0

SCAD LASSO

0.0012 0.0014

4.9820 4.9500

0 0

0.0630 0.0664

4.9960 4.9960

0 0

QIF

GEE

Table 5 Variable selections for the parametric components under high-dimensional setup. Method

n = 200, p = 18

n = 300, p = 20

GMSE

C

IC

GMSE

C

IC

0.0014 0.0015

14.9980 14.9940

0 0

0.0009 0.0009

17.0000 16.9980

0 0

0.0034 0.0036

14.9960 14.9860

0 0

0.0021 0.0022

16.9980 16.9920

0 0

ρ = 0.7 SCAD LASSO

ρ = 0.3 SCAD LASSO

Table 6 Variable selections for the nonparametric components under highdimensional setup. Method

n = 200, q = 10

n = 300, q = 12

RASE

C

IC

RASE

C

IC

0.0854 0.0871

7.7740 7.5780

0 0

0.0663 0.0663

9.9960 9.9700

0 0

0.1288 0.1321

7.7260 7.4700

0 0

0.0990 0.0995

9.9960 9.9680

0 0

ρ = 0.7 SCAD LASSO

ρ = 0.3 SCAD LASSO

infection CD4 percentage, and age at HIV infection on the mean CD4 percentage after infection. Huang et al. [16] indicated that, at significance level 0.05, neither smoking nor age has a significant impact on the mean CD4 percentage. In addition, the preCD4 has a constant effect over time. This motivates us to use the varying coefficient partially linear models for this dataset and to use variable selection techniques to select a parsimonious model. Let Y be the individual’s CD4 percentage, X1 be the centered preCD4 percentage, X2 be the smoking status: (1 for a smoker and 0 for a nonsmoker), X3 be the centered age at HIV infection. In addition, it is also of interest to examine whether there exist interaction effects and quadratic effects from these covariates, so we introduce the interactions of the three covariates and quadratic terms of X1 and X3 to the initial full model. For the purpose of demonstration and simplicity, we consider the following model Y = α0 (t ) + Z1 α1 (t ) + Z2 α2 (t ) + Z3 α3 (t ) + Z4 α4 (t ) + Z5 α5 (t ) + X1 β1 + X2 β2 + X3 β3 + ε, where Z1 = X12 , Z2 = X32 , Z3 = X1 X2 , Z4 = X1 X3 , Z5 = X2 X3 and α0 (t ), the baseline CD4 percentage, represents the mean CD4 percentage t years after the infection. We apply the proposed penalized QIF with SCAD penalty to identify two nonzero regression coefficients α0 (t ) and β1 in the model, where βˆ1 = 3.1928. This indicates that age and the smoking status have no significant impact on the mean CD4 percentage, which basically agrees with what was discovered by Fan and Li [10] and Zhao and Xue [32]. In addition, the curve of the estimated baseline function is showed in Fig. 7. From Fig. 7, we find that the mean baseline CD4 percentage decreases very quickly at the beginning of HIV infection, and the rate of decrease somewhat slows down four years after infection. This result is similar to what was discovered based on the semiparametric models by Fan and Li [10] and Zhao and Xue [32].

104

R. Tian et al. / Journal of Multivariate Analysis 132 (2014) 94–110 70 60

Mean CD4 percentage

50 40 30 20 10 0 –10

0

1

2

3

4

5

6

Time

Fig. 7. Application to AIDS data. The penalized QIF estimators for the mean CD4 percentage α0 (t ). The solid line represents the estimated function. The dots are the residual, on the part r (t ) = Y (t ) − X (t )T βˆ .

7. Conclusion and discussion Within the framework of semiparametric varying coefficient partially linear models with longitudinal data, we have proposed a variable selection procedure based on the basis function approximations with quadratic inference functions. This procedure can select significant variables in the parametric components and nonparametric components simultaneously. Furthermore, this procedure can simultaneously select significant variables and estimate unknown coefficients. We have shown that under mild conditions the proposed procedure simultaneously selects significant variables in parametric components and nonparametric components in the model. Furthermore, with proper choice of tuning parameters, we show that this variable selection procedure is consistent, and the estimators of regression coefficients have oracle property. Simulation studies indicated that the proposed procedure was very effective in selecting significant variables and estimating the regression coefficients. In this paper, we assume that the dimensions of the covariates X and Z are fixed. Furthermore, we also do some simulations in Section 5 and have desired results when the dimensions p and q go to infinity as n → ∞. However, if it is an ultra-high dimension case, the variable selection procedure proposed by this paper will not work any more. As a future research topic, it is interesting to consider the variable selection for the semiparametric varying coefficient partially linear models with ultra-high-dimensional covariates. Acknowledgments We sincerely thank the editor and two reviewers for their helpful comments that have improved the manuscript significantly. This work is supported by the National Natural Science Foundation of China (11171012, Key grant: 11331011), the BCMIIS, the Ph.D. Program Foundation of Ministry of Education of China (20121103110004), the Beijing Natural Science Foundation (1142003, L140003), and the Funding Project for Innovation of Doctoral Students of Beijing University of Technology (ykj-2011-4827). Appendix. Proof of Theorems For convenience and simplicity, let C denote a positive constant that may have different values at each appearance throughout this paper and ∥A∥ is used for the modulus of the largest singular value of matrix or vector A. Before we prove our main theorems, we list some regularity conditions that are used in this paper. C1. The spline regression parameter γ is identified, that is, γ0 is the spline coefficient vector from the spline approximation to α0 (·). In addition, there is a unique θ0 = (γ0 , β0 ) ∈ S satisfying E {gn (θ0 )} = o(1), where S is the parameter space. n C2. The weighting matrix Ωn (θ ) = n−1 i=1 gi (θ )giT (θ ) converges almost surely to a constant matrix Ω0 , where Ω0 is invertible. C3. The covariate matrices Xi , Zi , i = 1, . . . , n should satisfy that supi E ∥Xi ∥4 < ∞ and supi E ∥Zi ∥4 < ∞, C4. The error εi satisfies that E (εi εiT ) = Vi , supi ∥Vi ∥ < ∞, and there exists a positive constant δ such that supi E {∥εi ∥2+δ } < ∞. C5. All the variances Ai ≥ 0 and supi ∥Ai ∥ < ∞. C6. {mi } is a bounded sequence of positive integers.

R. Tian et al. / Journal of Multivariate Analysis 132 (2014) 94–110

105

C7. α(u) is rth continuously differentiable on (0, 1), where r ≥ 2. C8. The inner knots {ci , i = 1, . . . , K } satisfy max |hi+1 − hi | = o(K −1 )

1≤i≤K

and h/ min ≤ C0 , 1≤i≤K

where hi = ci − ci−1 ,

h = max hi . 1≤i≤K

maxk,l {|p′′λ2l (|β0l |)|, |p′′λ1k (∥γ0k ∥H )|

C9. Let an = C10. The penalty function satisfies

1 ′ lim inf lim inf λ− 2l pλ2l (|βl |) > 0, n→∞

βl →0+

: β0l ̸= 0, γ0k ̸= 0}, then an → 0, as n → ∞.

l = p1 + 1, . . . , p,

1 ′ lim inf lim inf λ− 1k pλ1k (∥γk ∥H ) > 0, n→∞ ∥γk ∥H →0

k = q1 + 1, . . . , q.

These conditions are commonly adopted in the nonparametric literature and variable selection methodology. Condition C2 holds based on the weak law of large numbers when n goes to infinity, and the maximum cluster size is fixed. Conditions C1, C3–C5 are some regularity conditions and usually easy to check, which are similar to those used in [20,3]. Condition C6 means that N = O(n), and we have only local dependence in the sample. The smoothness condition on α(u) as given by C7 determines the rate of convergence of the spline estimate α( ˆ u) = B(u)T γˆ . Condition C8 implies that c0 , . . . , cK +1 is a C0 -quasi-uniform sequence of partitions of [0, 1] (see [22, p. 216]). Conditions C9 and C10 are assumptions on the penalty function, that are similar to that used in [25,9,32]. The proofs of Theorems 1–4 depend on the following lemmas. We first give some notations, for h, h′ = 1, . . . , s, define n 1

gn,h (θ ) =

n i=1

Ωhh′ (θ ) =

gih (θ ),

n 1

n i=1

T gih (θ )gih ′ (θ ),

where −1/2

gih (θ ) = PiT Ai

−1/2

Mh Ai

(Yi − Pi θ ).

Then gn (θ ) and Ωn (θ ) can be written as gn (θ ) = (gn1 (θ ), . . . , gns (θ ))T , Ωn (θ ) = (Ωhh′ (θ ))sh,h′ =0 . Lemma 1. Suppose that the preceding regularity conditions of C1–C10 hold, then √ (1/ n).



L

ngn (θ0 ) −→ N (0, Ω0 ), gn (θ0 ) = Op

Proof. Consider the kth of gn (θ0 ): gn,k (θ0 ) =

=

n 1

n i =1 n 1

n i =1

−1/2

PiT Ai

−1/2

PiT Ai

−1/2

(Yi − Pi θ0 )

−1/2

εi −

Mk Ai Mk Ai

n 1

n i =1

−1/2

PiT Ai

−1/2

Mk Ai

Zi R(Ui ),

where R(u) = (R1 (u), . . . , Rq (u))T , Rk (u) = αk (u) − B(u)T γk , k = 1, . . . , q. From conditions C7, C8 and Corollary 6.21 in [22], we get that ∥R(u)∥ = O(K −r ) and ∥B(·)∥ = O(1). Then, invoking conditions C3–C5, by a simple calculation we have n T −1/2 −1/2 −1/2 −1/2 1 Mk Ai εi , then, Mk Ai Zi R(Ui ) = Op (n−1 n1/2 K −r ) = op (n−1/2 ). Denote ζik = PiT Ai i=1 Pi Ai n gn (θ0 ) =

n 1

n i=1

ζi + op (n−1/2 ), −1/2

where ζi = (ζi1 , . . . , ζis )T . Obviously, we get E ζik = 0, E ζi = 0, cov(ζik ) = PiT Ai



N11

. cov(ζi ) = PiT  .. 

Ns1 −1/2

··· .. . ···

N1s

−1/2

Mk Ai

−1/2

Vi Ai

−1/2

Mk Ai

Pi , and



..  P , .  i

Nss

−1/2

−1/2

−1/2

−1/2

−1/2

−1/2

−1/2

where Nlk = diag{A1 Ml A1 V1 A1 Mk A1 , . . . , An Ml An Vn An Mk An }. Under condition C4, for any a ∈ Rs(p+qL) which satisfies aT a = 1, EaT ζi = 0 and supi E ∥aT ζi ∥2+δ ≤ ∥aT ∥2+δ supi E ∥ζi ∥2+δ ≤ supi E ∥εi ∥2+δ < ∞. So,

106

R. Tian et al. / Journal of Multivariate Analysis 132 (2014) 94–110

aT ζi satisfies the Lyapunov condition for the central limit theorem. Thus n 

aT ζi

L

i=1

 aT

n 

cov(ζi )a

1/2 −→ N (0, 1).

i=1 1 n

According to condition C2, we have



ngn (θ0 ) =

n √ 1

n

n i=1

ζi +

P

n

i=1

cov(ζi ) −→ Ω0 , so



L

nop (n−1/2 ) −→ N (0, Ω0 ).



Hence, we have gn (θ0 ) = Op (1/ n). The proof of Lemma 1 is completed. Lemma 2. Suppose that the preceding regularity conditions of C1–C10 hold, then

∥n−1 Q˙ n (θ0 ) − 2g˙nT (θ0 )Ωn−1 (θ0 )gn (θ0 )∥ = op (n−1 ), and

∥n−1 Q¨ n (θ0 ) − 2g˙nT (θ0 )Ωn−1 (θ0 )˙gn (θ0 )∥ = op (1). Proof. First, by Lemma 1 and C2, we have

˙ n (θ0 ) = 2g˙nT (θ0 )Ωn−1 (θ0 )gn (θ0 ) + gnT (θ0 )W˙ n (θ0 )gn (θ0 ) n− 1 Q = 2g˙nT (θ0 )Ωn−1 (θ0 )gn (θ0 ) + op (n−1 ), where g˙n is the sl × l matrix {∂ gn /∂θ}, W˙ n is the three-dimensional array (∂ Ωn−1 /∂θ1 , . . . , ∂ Ωn−1 /∂θp+qL ). Similarly, we get

¨ (θ0 ) = 2g˙n T (θ0 )Ωn−1 (θ0 )g˙n (θ0 ) + Rn , n− 1 Q where Rn = 2g˙n (θ0 )Ωn−1 (θ0 )gn (θ0 ) + 4g˙n T (θ0 )W˙ n (θ0 )g˙n (θ0 ) + gnT (θ0 )W¨ n (θ0 )gn (θ0 ). By Lemma 1, C2 and C3, we have

∥g˙n (θ0 )Ωn−1 (θ0 )gn (θ0 )∥ = Op (n−1/2 ), g˙n T (θ0 )W˙ n (θ0 )g˙n (θ0 ) = op (n−1/2 ), and

∥gnT (θ0 )W¨ n (θ0 )gn (θ0 )∥ = op (n−1 ). So, Rn = op (1). This completes the proof. Proof of Theorem 1. Let δ = n−1/(2r +1) , β = β0 + δ T1 , γ = γ0 + δ T2 and T = (T1T , T2T )T . Our aim is to show that for any given ε > 0, there exists a large constant C such that

 P



inf Qnp (γ , β) > Qnp (γ0 , β0 )

∥T ∥=C

p

≥ 1 − ε.

(A.1)

p

Let ∆(γ , β) = K −1 {Qn (γ , β) − Qn (γ0 , β0 )}, then, invoking β0l = 0, l = p1 + 1, . . . , p, γ0k = 0, k = q1 + 1, . . . , q, C1 and pλ (0) = 0, with a simple calculation, we have that

∆(γ , β) ≥

1 K

[Qn (θ ) − Qn (θ0 )] +

p1 n 

q1 n 

K l =1

K k=1

[pλ2l (|βl |) − pλ2l (|β0l |)] +

[pλ1k (∥γk ∥H ) − pλ1k (∥γ0k ∥H )]

≡ I1 + I2 + I3 . First, we can expand Qn (θ ) as

˙ n (θ0 ) + Qn (θ ) = Qn (θ0 + δ T ) = Qn (θ0 ) + δ T T Q

1 2

1

δ 2 T T Q¨ (θ0 )T + δ 3 6



 T ∂  T ¨ (θ˜ )T T Q T, ∂θ

R. Tian et al. / Journal of Multivariate Analysis 132 (2014) 94–110

107

where θ˜ is between θ0 and θ0 + δ T . By Lemmas 1 and 2 we have

  δ T T Q˙ n (θ0 ) = δ T T 2ng˙nT (θ0 )Ωn−1 (θ0 )gn (θ0 ) + nop (n−1 ) √ = ∥T ∥Op ( nδ) + ∥T ∥op (δ), and 1 2

δ 2 T T Q¨ (θ0 )T =

1 2

  δ 2 T T 2ng˙nT (θ0 )Ωn−1 (θ0 )˙gn (θ0 ) + nop (1) T

= nδ 2 T T g˙nT (θ0 )Ωn−1 (θ0 )˙gn (θ0 )T + ∥T ∥2 op (nδ 2 ). Hence,

 √ 1  2 T T nδ T g˙n (θ0 )Ωn−1 (θ0 )˙gn (θ0 )T + ∥T ∥Op ( nδ) + ∥T ∥2 op (nδ 2 ) . K

I1 =

Next, by the standard argument of the Taylor expansion, we get that I2 = nK −1

p1 

[pλ2l (|βl |) − pλ2l (|β0l |)]

l =1

=

p1 

[K −1 nδ p′λ2l (|β0l |)sgn(β0l )|T1l | + nK −1 δ 2 p′′λ2l (|β0l |)|T1l |2 {1 + o(1)}]

l =1





p1 K −1 nδ an ∥T ∥ + nK −1 δ 2 bn ∥T ∥2 .

Then, it is easy to show that I2 is dominated by I1 uniformly in ∥T ∥ = C . With the same argument, we can prove that I3 is also dominated by I1 uniformly in ∥T ∥ = C . Hence, by choosing a sufficiently large C , term K1 nδ 2 T T g˙nT (θ0 )Ωn−1 (θ0 )˙gn (θ0 )T dominates the other terms and is positive. Thus, (A.1) holds. This implies, with probability at least 1 − ε , that there exists a local minimizer βˆ such that ∥βˆ − β∥ = Op (δ). Note that

∥αˆ k (u) − α0k (u)∥2 =

1



{αˆ k (u) − α0k (u)}2 du 0 1



{B(u)T γˆk − B(u)T γk + Rk (u)}2 du

= 0

1



{B(u) γˆk − B(u) γk } du + 2 T

≤2

T

2

0

1



Rk (u)2 du 0

= 2(γˆk − γk )T H (γˆk − γk ) + 2

1



Rk (u)2 du. 0

With the same arguments as before, we can get that ∥γˆ − γ ∥ = Op (n−1/(2r +1) ). Then, invoking ∥H ∥ = O(1), a simple calculation yields

 −2r  (γˆk − γk )T H (γˆk − γk ) = Op n 2r +1 .

(A.2)

In addition, it is easy to show that 1





−2r

Rk (u)2 du = Op n 2r +1



.

(A.3)

0

Invoking (A.2) and (A.3), we complete the proof of Theorem 1. Proof of Theorem 2. We first prove part (i). According to Theorem 1, it is sufficient to show that, for any γ that satisfies ∥γ −γ0 ∥ = Op (n−1/(2r +1) ), βl that satisfies ∥βl −β0l ∥ = Op (n−1/(2r +1) ), l = 1, . . . , p1 , and some given small ϵ = Cn−1/(2r +1) , when n → ∞, with probability tending to 1 we have

∂ Qnp (γ , β) > 0, ∂βl

for 0 < βl < ϵ, l = p1 + 1, . . . , p,

(A.4)

∂ Qnp (γ , β) < 0, ∂βl

for −ϵ < βl < 0, l = p1 + 1, . . . , p.

(A.5)

and

p

Thus, (A.4) and (A.5) imply that the minimizer of Qn (γ , β) attains at βl = 0, l = p1 + 1, . . . , p.

108

R. Tian et al. / Journal of Multivariate Analysis 132 (2014) 94–110

According to Lemma 2, we have that

∂ Qnp (γ , β) ∂ g T (γ , β) −1 = 2n n Ωn (γ , β)gn (γ , β) + op (n−1 ) + np′λ2l (|βl |) sgn(βl ) ∂βl ∂βl  T −1/2 −1/2 T Pi A i M1 Ai Xil n   −1  . ′ . = −2  Ωn (γ , β)gn (γ , β) + npλ2l (|βl |) sgn(βl )  . −1/2

i =1

PiT Ai

−1/2

Ms Ai

Xil

  1 ′ −1 −1/2 ) . = nλ2l λ− 2l pλ2l (|βl |)sgn(βl ) + Op (λ2l n 1 ′ 1/2 In addition, condition C10 implies that limn→∞ lim infβl →0 λ− > λmin n1/2 → ∞, it 2l pλ2l (|βl |) > 0, and note that λ2l n is clear that the sign of the derivative is completely determined by that of βl , then (A.4) and (A.5) hold. This completes the proof of part (i). Applying the similar techniques as in the analysis of part (ii) in this theorem, we have, with probability tending to 1, that γˆk = 0, k = q1 + 1, . . . , q. Then, the result of this theorem is immediately achieved from αˆ k (u) = BT (u)γˆk .

Proof of Theorem 3. Let θ ∗ = (γ ∗T , β ∗T )T . Corresponding covariates are denoted by Pi∗ = (Xi∗T , Wi∗T )T , i = 1, . . . , n. Then, p

Theorems 1 and 2 imply that, as n → ∞, with probability tending to 1, Qn (γ , β) attains the minimal value at (βˆ ∗T , 0)T and p p (γˆ ∗T , 0)T . Let Q1n (γ , β) = ∂ Qnp (γ , β)/∂β ∗ and Q2n (γ , β) = ∂ Qnp (γ , β)/∂γ ∗ , then, (βˆ ∗T , 0)T and (γˆ ∗T , 0)T must satisfy

Q1n ((γˆ ∗T , 0)T , (βˆ ∗T , 0)T ) = − Q2n ((γˆ ∗T , 0)T , (βˆ ∗T , 0)T ) = −

p1 n  2  ∗T Xi τi (Yi − Wi∗ γˆ ∗ − Xi∗ βˆ ∗ ) + op (n−1 ) + p′λ2l (|βˆ l |)sgn(βˆ l ) = 0, n i=1 l =1 n 2

n i=1

Wi∗T τi (Yi − Wi∗ γˆ ∗ − Xi∗ βˆ ∗ ) + op (n−1 ) +

q1 

p′λ1k (∥γˆk ∥H )

k=1

H γˆk

∥γˆk ∥H

= 0,

(A.6)

(A.7)

where

τi =

s  s 

−1/2

−1/2 ∗

Ai

−1/2

Pi Ωh−′ h1 Pi∗T Ai

Mh Ai

−1/2

Mh Ai

,

(A.8)

h=1 h′ =1

and Ωh−′ h1 is the (h′ , h) block of Ω0−1 . Applying the Taylor expansion to p′λ2l (|βˆ l |), we get that p′λ2l (|βˆ l |) = p′λ2l (|β0l |) + {p′′λ2l (|β0l |) + op (1)}(βˆ l − β0l ).

p1 ′ Furthermore, condition C9 implies that p′′λ2l (|β0l |) = op (1), and note that p′λ2l (|β0l |) = 0 as λmax → 0, then, l=1 pλ2l (|βˆ l |)sgn(βˆ l ) = 0 = op (βˆ ∗ − β0∗ ). Hence, by (A.6), a simple calculation yields −

n  2  ∗T  ∗ ∗ Xi τi Xi (β0 − βˆ ∗ ) + Wi∗ (γ0∗ − γˆ ∗ ) + Zi∗ R∗ (Ui ) + εi + op (βˆ ∗ − β0∗ ) = 0, n i=1

(A.9)

where Ui = (ui1 , . . . , uimi )T , and R∗ (u) = (R1 (u), . . . , Rq1 (u))T . Invoking (A.7), and using the similar arguments to (A.9), we can prove that



n 2

n i=1





Wi∗T τi Xi∗ (β0∗ − βˆ ∗ ) + Wi∗ (γ0∗ − γˆ ∗ ) + Zi∗ R∗ (Ui ) + εi + op (γˆ ∗ − γ0∗ ) = 0.

(A.10)

n n Let Φn = n−1 i=1 Wi∗T τi Wi∗ , and Ψn = n−1 i=1 Wi∗T τi Xi∗ , then, by (A.10), we have that  γˆ − γ0 = Φn ∗



−1

Ψn (β0 − βˆ ∗ ) + ∗

n 1

n i=1

 Wi τi [Zi R (Ui ) + εi ] . ∗T

∗T ∗

Substituting this into (A.9), and a simple calculation yields n 1  ∗T Xi τi {Xi∗ − Wi∗ Φn−1 Ψn }(βˆ ∗ − β0∗ ) + op (βˆ ∗ − β0∗ ) n i =1

=

n  1  ∗T  Xi τi εi + Zi∗ R∗ (Ui ) − Wi∗ [Φn−1 + op (1)]Λn , n i=1

(A.11)

R. Tian et al. / Journal of Multivariate Analysis 132 (2014) 94–110

where Λn = n n 1

n i=1

n −1

j =1

109

Wj∗T τi [Zj∗ R∗ (Uj ) + εj ]. Note that

Ψn Φn−1 Wi∗T τi {Xi∗ − Wi∗ Φn−1 Ψn } = 0,

and n 1

n i=1

  Ψn Φn−1 Wi∗T τi εi + Zi∗ R∗ (Ui ) − Wi∗ Φn−1 Λn = 0.

Then, by (A.11), it is easy to show that





n √ 1 ∗ X˘ i τi X˘ i∗T + op (1) n(βˆ ∗ − β0∗ ) n i=1 n n n 1  ∗ 1  ∗ 1  ∗ ∗ ∗ X˘ i τi εi − √ X˘ i τi Wi∗ [Φn−1 + op (1)]Λn + √ X˘ i τi Zi R (Ui ) n i =1 n i=1 n i=1

= √

≡ B1 + B2 + B3 ,

(A.12)

where X˘ i∗ = Xi∗T − Ψn Φn−1 Wi∗T . Using the central limit theorem, we can obtain L

B1 −→ N (0, B).

(A.13)

In addition, note that n 

X˘ i∗ τi Wi∗ = 0,

i

we have that B2 = 0. Furthermore, a simple calculation yields n n 1  ∗T 1  {Xi − E (Ψn )E (Φn )−1 Wi∗T }τi Zi∗ R∗ (Ui ) + √ {E (Ψn )E (Φn )−1 − Ψn Φn−1 }Wi∗T τi Zi∗ R∗ (Ui ) B3 = √ n i=1 n i =1

≡ B31 + B32 . Invoking E {[Xi∗T − E (Ψn )E (Φn )−1 Wi∗T ]τi Wi∗ } = 0, we can prove n 1 



n i=1

{Xi∗T − E (Ψn )E (Φn )−1 Wi∗T }τi Wi∗ = Op (1).

Together this with Wi = Ip ⊗ B(Ui ) · Xi , ∥B(u)∥ = O(1), and ∥R(u)∥ = o(1), it is clear that B31 = op (1). Similarly, we can prove that B32 = op (1). Hence, we have that B3 = op (1). In addition, by the law of large numbers, we have that n 1 ∗ P X˘ i τi τi X˘ i∗T −→ A, n i=1

(A.14)

P

where ‘‘−→’’ means the convergence in probability. Then, invoking (A.12)–(A.14), and using the Slutsky Theorem, we complete the proof of Theorem 3. Proof of Theorem 4. Similar to Wang and Xia [27], according to whether the model Sλ is underfitted, correctly fitted or overfitted, we can create three mutually exclusive sets: R− = {λ : Sλ ̸⊇ ST },

R0 = {λ : Sλ = ST },

R+ = {λ : Sλ ⊇ ST , Sλ ̸= ST }.

Then, the theorem can be proved by comparing BIC(Sλ ) and BIC(ST ), we consider two cases separately. Case 1-underfitted model: Let λT ∈ R0 , and for arbitrary λ ∈ R− (i.e. Sλ ̸⊇ ST , an underfitted model), 1 n

{BIC(Sλ ) − BIC(ST )} = gnT (θλ )Ωn−1 (θλ )gn (θλ ) + dfλ ≥

gnT

log(n) n

1

− BIC(ST ) n

1

(θλ )Ωn (θλ )gn (θλ ) − BIC(ST )

≥ inf

λ∈R−

−1

n

gnT

(θλ )Ωn (θλ )gn (θλ ) − gnT (θλT )Ωn−1 (θλT )gn (θλT ) − dfλT −1

log(n) n

110

R. Tian et al. / Journal of Multivariate Analysis 132 (2014) 94–110

    → min E [gn (θλ )]T Ωn−1 (θλ )E [gn (θλ )] − E [gn (θλT )]T Ωn−1 (θλT )E [gn (θλT )] Sλ ̸⊇ST

> 0 in probability by the law of large numbers and the continuous mapping theorem [4] and the fact that E [gn (θλ )] ̸= 0, Ωn−1 (θλ ) is positive definite, and E [gn (θλT )] = o(1). Case 2-overfitted model: For arbitrary λ ∈ R+ (i.e. Sλ ⊇ ST but Sλ ̸= ST ), we have inf

λ∈R+

1 n

{BIC(Sλ ) − BIC(ST )} = inf gnT (θλ )Ωn−1 (θλ )gn (θλ ) − gnT (θλT )Ωn−1 (θλT )gn (θλT ) + (dfλ − dfλT ) λ∈R+

log(n) n

(θλT )Ωn (θλT )gn (θλT )    → min E [gn (θλ )]T Ωn−1 (θλ )E [gn (θλ )] − E [gn (θλT )]T Ωn−1 (θλT )E [gn (θλT )] ≥ inf

λ∈R+

gnT

(θλ )Ωn (θλ )gn (θλ ) − −1

gnT

−1



S ⊇ST

> 0 in probability by the law of large numbers and the continuous mapping theorem [4] and the same reason similar to case 1. The results of cases 1 and 2 complete the proof. References [1] I. Ahmad, S. Leelahanon, Q. Li, Efficient estimation of a semiparametric partially linear varying coefficient model, Ann. Statist. 33 (2005) 258–283. [2] H. Akaike, Information theory and an extension of the maximum likelihood principle, in: Petrov, B.N., Csaki, F., (Eds.) Proceedings of the Second International Symposium on Information Theory, 1973, pp. 267–281. [3] Y. Bai, Z.Y. Zhu, W.K. Fung, Partial linear models for longitudinal data based on quadratic inference functions, Scand. J. Stat. 35 (1) (2008) 104–118. [4] P. Billingsley, Probability and Measure, third ed., Wiley, New York, 1995. [5] L. Breiman, Better subset selection using nonnegative garrote, Techonometrics 37 (1995) 373–384. [6] W.S. Cleveland, E. Grosse, W.M. Shyu, Local regression models, in: J.M. Chambers, T.J. Hastie (Eds.), Statistical Models in S, Wadsworth and Brooks, Pacific Grove, CA, 1992, pp. 309–376. [7] J.J. Dziak, R. Li, A. Qu, An overview on quadratic inference function approaches for londitudinal data, in: J. Fan, X. Lin, J.S. Liu (Eds.), New Developments in Biostatistic and Bioinformatics, 2009, pp. 49–72. [8] J. Fan, T. Huang, Profile likelihood inferences on semiparametric varying-coefficient partially linear models, Bernoulli 11 (2005) 1031–1057. [9] J.Q. Fan, R. Li, Variable selection via nonconcave penalized likelihood and its oracle properties, J. Amer. Statist. Assoc. 96 (2001) 1348–1360. [10] J. Fan, R. Li, New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis, J. Amer. Statist. Assoc. 99 (2004) 710–723. [11] J. Fan, W. Zhang, Statistical methods with varying coefficient models, Stat. Interface 1 (2008) 179–195. [12] W.J. Fu, Penalized regression: the bridge versus the LASSO, J. Comput. Graph. Statist. 7 (1998) 397–416. [13] T.J. Hastie, R.J. Tibshirani, Varying-coefficient models (with discussion), J. R. Stat. Soc. Ser. B 55 (1993) 757–796. [14] X. He, W.K. Fung, Z.Y. Zhu, Robust estimation in a generalized partial linear model for clustered data, J. Amer. Statist. Assoc. 100 (2005) 1176–1184. [15] X. He, Z.Y. Zhu, W.K. Fung, Estimation in a semiparametric model for longitudinal data with unspecified dependence structure, Biometrika 90 (2002) 579–590. [16] J.Z. Huang, C.O. Wu, L. Zhou, Varying-coecient models and basis function approximations for the analysis of repeated measurements, Biometrika 89 (2002) 111–128. [17] R. Li, H. Liang, Variable selection in semiparametric regression modeling, Ann. Statist. 36 (2008) 261–286. [18] K.L. Liang, S.L. Zeger, Longitudinal data analysis using generalised estimating equations, Biometrika 73 (1986) 13–22. [19] C.L. Mallows, Some comments on Cp , Technometrics 15 (1973) 661–675. [20] A. Qu, R. Li, Quadratic inference functions foe varying coefficient models with longitudinal data, Biometrics 62 (2006) 379–391. [21] A. Qu, B.G. Lindsay, B. Li, Improving generalized estimating equations using quadratic inference functions, Biometrika 87 (2000) 823–836. [22] L.L. Schumaker, Spline Function, Wiley, New York, 1981. [23] G. Schwarz, Estimating the dimension of a model, Ann. Statist. 6 (1978) 461–464. [24] R. Tibshirani, Regression shrinkage and selection via the LASSO, J. R. Stat. Soc. Ser. B 58 (1996) 267–288. [25] L. Wang, H. Li, J.Z. Huang, Variable selection in nonparametric varying coefficient models for analysis of repeated measurements, J. Amer. Statist. Assoc. 103 (2008) 1556–1569. [26] H. Wang, R. Li, C. Tsai, Tuning parameter selectors for the smoothly clipped absolute deviation method, Biometrika 94 (2007) 553–568. [27] H. Wang, Y. Xia, Shrinkage estimation of the varying coefficient model, J. Amer. Statist. Assoc. 104 (2009) 747–757. [28] L. Xue, A. Qu, J. Zhou, Consistent model selection for marginal generalized additive model for correlated data, J. Amer. Statist. Assoc. 105 (2010) 1518–1530. [29] L.G. Xue, L.X. Zhu, Empirical likelihood for a varying coefficient model with longitudinal data, J. Amer. Statist. Assoc. 102 (2007) 642–654. [30] W. Zhang, S. Lee, X. Song, Local polynomial fitting in semivarying coefficient model, J. Multivariate Anal. 82 (2002) 166–188. [31] P.X. Zhao, L.G. Xue, Variable selection for semiparametric varying coefficient partially linear models, Statist. Probab. Lett. 79 (2009) 2148–2157. [32] P.X. Zhao, L.G. Xue, Variable selection for semiparametric varying coefficient partially linear errors-in-variables models, J. Multivariate Anal. 101 (2010) 1872–1883. [33] H. Zou, The adaptive lasso and its oracle properties, J. Amer. Statist. Assoc. 101 (2006) 1418–1429. [34] H. Zou, R. Li, One-step sparse estimates in nonconcave penalized likelihood models, Ann. Statist. 36 (2007) 1509–1533.