Robust variable selection in semiparametric mean-covariance regression for longitudinal data analysis

Robust variable selection in semiparametric mean-covariance regression for longitudinal data analysis

Applied Mathematics and Computation 245 (2014) 343–356 Contents lists available at ScienceDirect Applied Mathematics and Computation journal homepag...

476KB Sizes 0 Downloads 45 Views

Applied Mathematics and Computation 245 (2014) 343–356

Contents lists available at ScienceDirect

Applied Mathematics and Computation journal homepage: www.elsevier.com/locate/amc

Robust variable selection in semiparametric mean-covariance regression for longitudinal data analysis Chaohui Guo ⇑, Hu Yang, Jing Lv a

College of Mathematics and Statistics, Chongqing University, Chongqing 401331, China

a r t i c l e

i n f o

Keywords: B-spline Generalized estimating equations Longitudinal data Modified Cholesky decomposition Partial linear models Robustness

a b s t r a c t This paper considers robust semiparametric smooth-threshold generalized estimating equations for the analysis of longitudinal data based on the modified Cholesky decomposition and B-spline approximations. The proposed method can automatically eliminate inactive predictors by setting the corresponding parameters to be zero, and simultaneously estimate the mean regression coefficients, generalized autoregressive coefficients and innovation variances. In order to overcome the outliers in either the response or/and the covariate domain, we use a bounded score function and leverage-based weights to achieve better robustness. Moreover, the proposed estimators have desired large sample properties including consistency and oracle property. Finally, Monte Carlo simulation studies are conducted to investigate the robustness and efficiency of the proposed method under different contaminations. Ó 2014 Elsevier Inc. All rights reserved.

1. Introduction Longitudinal data sets are common in medical and public health studies, the significant feature of it is repeated measures over a certain period of time. Liang and Zeger [12] proposed generalized estimating equations (GEEs) which are taken as a milestone in the development of methodology for longitudinal data analysis. Partial linear models (PLMs) include a nonparametric function and linear part, which are more flexible and more interpretable than linear models and nonparametric models respectively. Wang et al. [24] considered marginal generalized semiparametric partially linear models for clustered data and proposed profile-type estimating equation, which differs from a standard marginal generalized estimating equation model [12] mainly through introducing the nonparametric component. However, these methods mentioned above are not robust, since its estimates are sensitive to potential outliers and influential observations. Hence, to seek a more robust method against outliers is a very important issue in longitudinal studies. In recent years, many authors studied the influence of outliers to the estimate and had developed many robust methods. An incomplete list of recent works on the robust GEE method include [4,6,9,18,19] and so on. As far as we know, a better estimate for the covariance matrix will result in a better estimate for the mean parameter. But all methods above mainly payed attention to the estimate of mean parameters while regarded the covariance parameters as nuisance parameters. In fact, the covariance parameters may be not nuisance parameters and have substantive significance for its own interest, see Carroll [1] and Zhang and Li [29] for reference. Recently, motivated by the modified Cholesky decomposition, Ye and Pan [28] proposed joint mean and covariance regression models by using generalized estimating equations. The advantages of this decomposition are that it makes covariance matrices to be positive definite and the parameters in it have well-founded statistical concepts. Due to these advantages, Cholesky decomposition has received considerable ⇑ Corresponding author. E-mail address: [email protected] (C. Guo). http://dx.doi.org/10.1016/j.amc.2014.07.086 0096-3003/Ó 2014 Elsevier Inc. All rights reserved.

344

C. Guo et al. / Applied Mathematics and Computation 245 (2014) 343–356

attention in the literature. Here we only list a few. See Leng et al. [10] and Mao et al. [13] constructed PLMs for the mean and the covariance structure for longitudinal data, which are more flexible than that of Ye and Pan [28]. However, the above approaches based on GEEs are also highly sensitive to outliers in the sample. Recently, Zheng et al. [31] established three robust estimating equations to hinder the effect of outliers in both mean and covariance estimation by borrowing the idea of [6]. However, as far as we know, there is little discussion on robust estimation on the semiparametric mean-covariance model (SMCM). In this paper, we consider the SMCM and decompose the inverse of covariance matrix by the modified Cholesky decomposition. The entries in this decomposition are autoregressive parameters and log innovation variances. See [10,28,31,32] for references. Zhang and Leng [30] proposed a new regression model to decompose covariance structures by using a novel Cholesky factor with the entries being a moving average and log innovation variances. These decompositions are based on linear time series analysis. Thus, the application of semiparametric mean-covariance regression with longitudinal data may be extended to the realm of nonlinear time series analysis, since the derivation of methods for nonlinear time series analysis acts as one of the crowning achievements that emerged from the theory of deterministic dynamical systems. Kodba et al. [8] and Perc [15] applied nonlinear time series analysis methods to analyse the chaotic behaviour of a very simple periodically driven resistor–inductor diode circuit and dynamics of human gait respectively and pointed out that the nonlinear time series analysis methods are superior to mathematical modelling, because of they can introduce basic concepts directly from the experimental data. In recent years, many penalization or shrinkage based variable selection methods have been developed to select significant variables among all the candidate variables, e.g., [3,33–35], etc. All the methods mentioned above only consider the independent data. But variable selection is also a fundamentally important issue in longitudinal studies, which could greatly enhance the prediction performance of the fitted model and select significant variables. In recent years, penalty function had been adopted to select active variables in the longitudinal data analysis. For example, Fan et al. [4] developed penalized robust estimating equations for longitudinal linear regression models with the SCAD penalty. Wang et al. [23] considered the SCAD-penalized GEE for analyzing longitudinal data with high-dimensional covariates. Zheng et al. [32] considered robust variable selection method in joint mean and covariance models through using three penalized robust generalized estimating equations with three SCAD penalties. All these procedures are based on penalty functions to select variables, which are singular at zero. Consequently, these variable selection procedures need to solve the convex optimization which lead to a computational burden. Borrowing the idea of Ueki [22], Li et al. [11] developed the smooth-threshold generalized estimating equations (SGEEs) for longitudinal generalized linear models which can efficiently avoid the convex optimization problem. These facts motivate us to develop robust smooth-threshold generalized estimating equations for the SMCM with longitudinal data. This paper has made the following contributions: (i) we establish consistency and asymptotic normality of the mean regression coefficients, generalized autoregressive coefficients and innovation variances, and obtain the optimal convergent rate for estimating the nonparametric functions. (ii) The proposed method can alleviate the effect of outliers in either the response or/and the covariate domain by using the bounded Huber’s score function and Mallows weights. (iii) The proposed method can automatically eliminate inactive predictors by setting the corresponding parameters to be zero and estimate nonzero coefficients through semiparametric smooth-threshold generalized estimating equations. The rest of the article is organized as follows. Section 2 introduces the model and estimation method. Theoretical properties of the proposed estimators are also given in this Section. Section 3 describes the semiparametric smooth-threshold generalized estimating equations and oracle property. In Section 4, an efficient algorithm is proposed to implement the procedures. Moreover, we discuss how to select the tuning parameters so that the corresponding estimators are robust and sparse. Simulation studies are carried out in Section 5 to investigate the performance of the proposed estimators under two types of contaminations. Some concluding remarks are given in Section 6. All the proofs are relegated to the Appendix. 2. Robust generalized estimating equations in joint mean and covariance semiparametric models 2.1. The joint mean and covariance semiparametric models For a vector or a matrix A, we define A as random vector or matrix of population, Ai as the ith corresponding sample and A0i as the true value of Ai throughout our paper. In this paper, we consider an experiment with m subjects and ni observations  T P over time for the ith subject, where n ¼ m is the response for i¼1 ni is the total of observations. Suppose that y i ¼ yi1 ; . . . ; yini the ith subject ði ¼ 1; . . . ; mÞ at time points t i ¼ ðti1 ; . . . ; t ini Þ. We assume that t ij is the time or any time-dependent covariate and ftij g 2 ½0; 1. We further specify a marginal model through defining the first two moments of the response yij , i.e., Eðyij xij ; t ij Þ ¼ l0ij ; Vðyi jxi ; t i Þ ¼ R0i , where xij is a p-dimension covariate vector and xi ¼ ðxi1 ; . . . ; xini ÞT is a corresponding covariate matrix for the ith subject. On account of the modified Cholesky decomposition, there exists a lower triangle matrix U with 1’s on the main diagonal satisfying Ui R0i UTi ¼ D0i , where D0i is a diagonal matrix with positive entries and the lower-diagonal entries of Ui are defined as the negatives of the autoregressive coefficients /ijk of

yij  lij ¼

j1 X /ijk ðyik  lik Þ þ eij :

ð1Þ

k¼1

^ij . The modified The diagonal entries of D0i are taken as the innovation variances with entries r20ij ¼ varðeij Þ, where eij ¼ yij  y Cholesky decomposition makes R0i to be positive definite, and the parameters / and logðr2ij Þ to be unconstrained. Based on

345

C. Guo et al. / Applied Mathematics and Computation 245 (2014) 343–356

this decomposition, Ye and Pan [28] proposed joint mean covariance models by using generalized estimating equations. Leng et al. [10] generalized it to semiparametric models

gðl0ij Þ ¼ xTij b0 þ f0 ðtij Þ;

/ijk ¼ wTijk c0 ;

logðr20ij Þ ¼ zTij t0 þ f1 ðt ij Þ;

ð2Þ

by adding two nonparametric functions f0 and f1 into the mean and covariance models respectively, where gðÞ is assumed to be a known link function, which is monotone and differentiable, f0 ðÞ and f1 ðÞ are unknown smooth functions, xij ; wijk and zij are p  1; q  1 and d  1 dimension covariates vectors, and b0 ; c0 and t0 are associated regression coefficients. The covariates T wijk may be a polynomial of time difference t ij  tik i.e., wijk ¼ ð1; ðt ij  t ik Þ; ðtij  tik Þ2 ; . . . ; ðt ij  t ik Þq1 Þ or that of time dependent covariates. Welsh et al. [26] showed that regression splines perform better than kernel method with clustered data in nonparametric models. Then we can use a small number of knots to approximate the nonparametric functions f0 and f1 by B-spline basis functions for its good numerical properties, and to obtain the optimal nonparametric convergence rate. For simplicity, we assume that the smoothness of f0 and f1 are the same. Let pðtÞ ¼ ðB1 ðtÞ; . . . ; BK ðtÞÞT be the B-spline basis functions with the order of m þ 1, where K ¼ kn þ m þ 1 is the number of basis functions and kn is the number of interior knots. Then, f0 ~ , where a; a ~ 2 RK . This linearizes our regression method, based and f1 can be approximated by f0 ðtÞ  pðtÞT a and f1 ðtÞ  pðtÞT a on the above approximations, the nonlinear regression models in (2) can be rewritten as T

gðlij ð#ÞÞ ¼ xTij b þ pT ðtij Þa ¼ bij #; where bij ¼ the

ðxTij ;

T T ij Þ ;

p

combined

hij ¼

ðzTij ;

regression

p

T T ij Þ

~ ¼ hTij q; logðr2ij Þ ¼ zTij t þ pT ðtij Þa

ð3Þ T

T T

T

~T T

act as the combined design vectors with pij ¼ pðt ij Þ; # ¼ ðb ; a Þ and q ¼ ðt ; a Þ are

parameter

vectors.

Furthermore,

in

T

form

of

matrix,

Bi ¼ ðbi1 ; . . . ; bini Þ and xi ; pi ; zi and H i can be defined similarly. Then we have lij ð#Þ ¼ l

T ðbij #Þ

we

have

li ¼ ðli1 ; . . . ; lini ÞT ;

with lðÞ ¼ g 1 ðÞ. In this paper,

T

for a chosen scalar function g, we have gðli Þ ¼ ðgðli1 Þ; . . . ; gðlini ÞÞ . 2.2. Robust semiparametric generalized estimating equations Ye and Pan [28] proposed joint mean covariance models for generalized linear models, which have received considerable attention in the literature. In recent years, researchers have done some promotions from the following two aspects. On the one hand, Zheng et al. [31,32] extended it to robust generalized estimating equations (RGEEs) for longitudinal generalized linear models by adopting a bounded Huber’s score function to alleviate the effect of outliers. On the other hand, Leng et al. [10] promoted it to semiparametric GEEs which are more flexible and can reduce the deviation if the parametric assumption is fail to meet. Motivated by the works of [10,31,32], we choose a bounded score function w and define three T robust semiparametric generalized estimating equations (RSGEEs) for f ¼ ð#T ; cT ; qT Þ T

UðfÞ ¼ ðU 1 ð#ÞT ; U 2 ðcÞT ; U 3 ðqÞT Þ ;

ð4Þ

the particular estimating equations for the mean, generalized autoregressive parameters and innovation variances are defined as follows:

U 1 ð#Þ ¼

m X 1 # BTi DTi ðli ð#ÞÞðV #i Þ hi ðli ð#ÞÞ ¼ 0;

ð5Þ

i¼1

U 2 ðcÞ ¼

m X 1 c T Ti ðV ci Þ hi ð^r i ðcÞÞ ¼ 0;

ð6Þ

i¼1

U 3 ðqÞ ¼

m X 1 q H Ti Di ðV qi Þ hi ðr2i ðqÞÞ ¼ 0;

ð7Þ

i¼1

where T # bij #; hi ð

  Di ¼ Di ðli ð#ÞÞ ¼ diag l_ i1 ð#Þ; . . . ; l_ ini ð#Þ l_ ij ðÞ is the derivative of lðÞ evaluated at and c c c ^ # # # ^ li ð#ÞÞ ¼ W i ½w ðli ð#ÞÞ  C i ðli ð#ÞÞ; hi ðri ðcÞÞ ¼ W i ½w ðri ðcÞÞ  C ci ð^ri ðcÞÞ and hqi ðr2i ðqÞÞ ¼ W qi ½wq ðr2i ðqÞÞ  C qi ðr2i ðqÞÞ

r i are ni  1 vectors are the core of estimating equations with w#i ; wci ; wqi ; C #i ; C ci ; C qi ; W #i ; W ci and W qi to be defined later; r i and ^  P P with jth components rij ¼ yij  l and ^r ij ¼ Eðr ij ri1 ; . . . ; r iðj1Þ Þ ¼ j1 /ijk rik . In this paper, we denote 0 as zero when j ¼ 1. ij

k¼1

k¼1

e2i ðqÞ and r2i ðqÞ are ni  1 vectors with jth components e2ij ðqÞ and r2ij ðqÞ, respectively, where eij ¼ yij  y^ij . It is easy to show

n o r i ¼ ei . T Ti ¼ @ ^ r Ti =@ c is the q  ni matrix with that Eðe2i Þ ¼ r2i and Di ¼ diag r2i1 ; . . . ; r2ini acts as the covariance matrix of r i  ^ P jth column @^rTij =@ c ¼ j1 k¼1 r ik wijk . ~ 1=2 R ~i ~ i with A Furthermore, we define V #i ¼ A1=2 Ri with Ai denoting the diagonal elements of Ri ; V ci ¼ D1=2 and V qi ¼ A i i i ~ denoting the diagonal elements of Ri . With the same argument as Zheng et al. [31], we use the sandwich working covariance n o ~ i ¼ F 1=2 Ri ðdÞF 1=2 to model the true R ~ i ¼ Varðe2 Þ together with F i ¼ 2diag r4 ; . . . ; r4 and that Ri ðdÞ denotes the structure R i

i

i

i1

ini

correlation between r2ij and r2ik which is indexed by a new parameter d. Typical structures for Ri ðdÞ include the exchangeable

346

C. Guo et al. / Applied Mathematics and Computation 245 (2014) 343–356

and first order autoregressive working correlation structures. From the simulation study in Section 5, we can see that the working correlation structures have little effect on the estimates. Moreover, Leng et al. [10] demonstrated that the parameter d also has little effect on the estimates. These imply that the resulting estimators of regression coefficients in the mean, generalized T

autoregressive parameters and innovation variances are robust against misspecification of Ri ðdÞ. The estimates of ð#T ; cT ; qT Þ T ^T; c ^T Þ . ^T ; q obtained from RSGEEs (5)–(7) are denoted by ð# 1=2 1=2 # c ^ ~ 1=2 ðe2  r2 ÞÞ as the core of the estimating We take w ðl Þ ¼ wðA ðy  l ÞÞ; w ðr i Þ ¼ wðD ðr i  r^i ÞÞ; wq ðr2 Þ ¼ wðA i

i

i

i

i

i

i

i

i

equations. The common choice for wðÞ is Huber’s score function, defined by wc ðxÞ ¼ min fc; maxfc; xgg for some constant c which is between 1 and 2, to downweight the influence of outliers in the response variable. For simplicity, we suppose that all the weighting matrices W #i ; W ci ; W qi are the same and denote them as W i ¼ diagðwi1 ; . . . ; wini Þ, to bound the effect of leverage points in the covariates space. A common choice is Mallows weights which are defined as:

8 !j=2 9 < =   b0 wij ¼ w xij ¼ min 1;  ; T 1   : ; xij  mx S x xij  mx

ð8Þ

with j P 1; mx and S x are some robust estimates of the location and scale of xij . The general approach is to use high breakdown point location and scatter estimators, i.e., Minimum Covariance Determinant (MCD) and Minimum Volume Ellipsoid (MVE). b0 is the 0.95 quantile of the chi-square distribution with the degree of freedom being the dimension of xij . We further h i h i h i     ~ 1 ðe2  r2 ÞÞ as the bias-corr i ðcÞÞ ¼ E wðD1=2 ðr i  ^ r i ÞÞ and C q r2 ðqÞ ¼ E wðA adopt C # l ð#Þ ¼ E wðA1=2 ðy  l ÞÞ ; C c ð^ i

i

i

i

i

i

i

i

i

i

i

i

rection terms to ensure the Fisher consistency of the estimators. 2.3. Asymptotic properties To study the large sample properties of our proposed estimators, we need to make some assumptions and list some regularity conditions. Throughout the paper, we use kk to denote the Euclidean norm for a vector and the spectral norm for a T T T ^T; c ^ T Þ satisfies ^T ; q square matrix. Let h0 ¼ ðbT0 ; cT0 ; tT0 Þ be the true value of h ¼ ðbT ; cT ; tT Þ . If a sequence ð#     T T   ^ ^T ; c ^T Þ ! ðbT0 ; cT0 ; tT0 Þ ; supt f0 ðtÞ  pT ðtÞa ~  ! 0 in probability as m ! 1, we call it as a consis^T ; t ^  ! 0 and supt f1 ðtÞ  pT ðtÞa ðb tent sequence. When estimating Eqs. (5)–(7) have multiple solutions, then we only consider a consistent sequence of estiT

^T; c ^ T Þ in this section. The following conditions are needed to establish the asymptotic properties. ^T ; q mator ð# (C1) We assume that the dimensions of covariates xij ; wijk , and zij are fixed and maxi fni g is bounded, and the distinct values of tij form a quasi-uniform sequence that grows dense on ½0; 1. The first four moments of yij exist. (C2) The rth derivatives of f0 and f1 are bounded for some r P 2. ~ 1 are all bounded, that is, all the elements of the vectors and matrices are (C3) The covariates wijk and the matrices R i bounded. The function g 1 ðÞ has bounded second derivatives. T T (C4) The parametric space H of ðbT ; cT ; tT Þ is a compact subset of Rpþqþd , and the true parameter value ðbT0 ; cT0 ; tT0 Þ is in the interior of the parameter space H. 2þd 2þd 2þd # c q # # T < 1; supiP1 E h0;i < 1 and supiP1 E h0;i < 1 for some d > 0, and Eh0;i ðh0;i Þ ¼ Gi > 0; (C5) supiP1 E h0;i c

c

T

q

q T

c

#

q

Eh0;i ðh0;i Þ ¼ J i > 0 and Eh0;i ðh0;i Þ ¼ K i > 0 with sup kGi k < 1; sup kJ i k < 1 and sup kK i k < 1 where h0;i ; h0;i and h0;i i

T T 0Þ .

i

i

are evaluated at the true value ð#T0 ; cT0 ; q r i ðcÞÞ and C qi ðr2i ðqÞÞ have bounded second derivatives. The functions w# ; wc and (C6) The function g 1 ðÞ and C #i ðli ð#ÞÞ; C ci ð^ q w are piecewise twice differentiable with bounded second derivatives. ^ c ^, we further make some assumptions on the covariates ^ and t To study the asymptotic distribution for the estimates b; xij ; zij and tij as follows:

xijk ¼ g k ðt ij Þ þ dijk ; zijl ¼ g~l ðt ij Þ þ ~dijl ;

k ¼ 1; . . . ; p; l ¼ 1; . . . ; d;

ð9Þ i ¼ 1; . . . ; m; j ¼ 1; . . . ; ni ;

ð10Þ

dijk ’s are mean zero random variables which are indewhere the g k ðtÞ’s and g~k ðtÞ’s have bounded rth derivatives and dijk ’s and ~ ~ n be the n  p and pendent of the corresponding random errors and of one another. For clear presentation, we let Kn and K T n  d matrices whose kth column are dk ¼ ðd11k ; . . . ; d1n1 k ; . . . ; dmnm k ÞT and ~ d11k ; . . . ; ~ d1n1 k ; . . . ; ~ dmnm k Þ respectively. The dk ¼ ð~ following assumption are also needed: 2 ~ n ¼ 0; sup 1 E K ~ n < 1. (C7) EKn ¼ 0; supn 1n EkKn k2 < 1; EK nn ~ 0 MÞ are nonsingular, and the eigenvalues of M T R0 Mðkn =nÞ and (C8) For sufficiently large n; kn ðM T R0 MÞ and kn ðM T R   T T ~0 M R Mðkn =nÞ are bounded away from 0 and infinity, where M ¼ ðpT1 ; . . . ; pTm Þ ; R0 ¼ diag R01 ; . . . ; R0m with  0  1 # ~ 0 ¼ diag R ~ ;...;R ~0 R and C# ¼ Eh_ # ðl ð#ÞÞ ¼ E@h ðl Þ=@ l j and with R0 ¼ DT ðV # Þ C# D , i

0i

0i

i0

0i

i

i

i

i

i li ¼li ð#Þ

i

1

m

~ 0 ¼ DT ðV q Þ1 Cq D0i , and Cq ¼ Eh_ q ðr2 ðqÞÞ ¼ E@hq ðr2 Þ=@ r2 j 2 2 , where D0i ; C# and V # are evaluated at l ¼ l , R 0i 0i 0i i i i i i r ¼r ðqÞ i 0i 0i i0 i i i

and D0i ; Cq0i and V q0i are evaluated at r2i ðqÞ ¼ r20i ðqÞ.

i

347

C. Guo et al. / Applied Mathematics and Computation 245 (2014) 343–356

Remark 1. Under Condition (C1), the total sample size n is of the same order as the number of subjects m, the existence of the first four moments of the response is needed for consistently estimating the parameters in the variance. Condition (C2) determine the rate of convergence for estimating f0 and f1 . Condition (C3) is satisfied as t is bounded. Condition (C4) is common made in linear models. The Conditions (C5)–(C6) imposed on the score function w# ; wc and wq which are easy to check when w# ; wc and wq are bounded. Condition (C7) holds if ðxij ; zij ; t ij Þ is a random sample from a multivariate distribution with finite second moments. Condition (C8) is one of properties of the B-spline basis functions and holds under rather general design conditions (see He and Shi [7]). T ~ T ~ T ~ T T pffiffiffiffiffi ^T ; c ^T Þ need to compute the covariance matrix Wm ¼ ðdkl ^T ; t The asymptotic properties of ðb m Þk;l¼1;2;3 of ðU 1 ; U 2 ; U 3 Þ = m, ~ ~ ~ where U 1 ; U 2 and U 3 are defined by

~1 ¼ U

m X # 1 # T X T i D0i ðV 0i Þ hi ðl0i ð#ÞÞ;

~2 ¼ U

m X 1 c T T0i ðV c0i Þ hi ð^r0i ðcÞÞ;

i¼1

i¼1 m X q 1 q 2 ~3 ¼ Z T U i D0i ðV 0i Þ hi ðr0i ðqÞÞ; i¼1 1 ~ ~ ¼ MðM T R ~ 0 MÞ1 M T R ~ 0, where X ¼ ðI  PÞX with P ¼ MðM T R0 MÞ M T R0 , and Z  ¼ ðI  PÞZ with P Pj1 P T T j1 ^ r 0i ¼ ð^r0i1 ; . . . ; ^r0ij ; . . . ; ^r0ini Þ with ^r0ij ¼ k¼1 r0ik xTijk c0 and r 0i ¼ yi  l0i , and T 0i ¼ ð0; r 0i1 xTi21 ; . . . ; k¼1 r 0ik xTijk Þ . 

(C9) The covariance matrix Wm is positive definite, and

0

d11 m

B Wm ¼ @ d21 m d31 m

d12 m d22 m d32 m

d13 m

1

0

d11

d12

C B ! W ¼ @ d21 d23 m A d33 m

d

d22

31

d

32

d13

1

C d23 Aas m ! 1; d33

T

where W is also a positive definite matrix which is evaluated at ð#T0 ; cT0 ; qT0 Þ . T ^T ; c ^T Þ of the RSGEE (5)–(7) and the regularity conditions (C1)–(C8) hold, ^T ; t Theorem 1. Suppose there is only one root ^ h ¼ ðb 1=ð2rþ1Þ together with kn ¼ Oðn Þ, we have T

^ ¼ ðb ^T ; c ^T Þ ^T ; t (a) the RSGEE estimator h is strongly consistent T T ^ ^T ; c ^T Þ ! h0 ¼ ðbT0 ; cT0 ; tT0 Þ almost surely as m ! 1. ^T ; t h ¼ ðb

for

the

true

value

T

h0 ¼ ðbT0 ; cT0 ; tT0 Þ ;

that

is

ni n m X o2 1X ^f 0 ðt Þ  f0 ðt Þ ¼ Op ðn2r=ð2rþ1Þ Þ; ij ij

(b) n

i¼1 j¼1

m X o2 i n 1X ^f 1 ðt Þ  f1 ðt Þ ¼ Op ðn2r=ð2rþ1Þ Þ; ij ij n i¼1 j¼1 n

^ ~. ^ ; ^f 1 ðtÞ ¼ pT ðtÞa where ^f 0 ðtÞ ¼ pT ðtÞa Remark 2. Under rather general conditions (see, e.g., Lemmas 8 and 9 in Stone [21]), (b) implies that 2 2 R ^ R ^ f 1 ðtÞ  f1 ðtÞ dt ¼ Op ðn2r=ð2rþ1Þ Þ. Meaning that we obtain the optimal rate of conf 0 ðtÞ  f0 ðtÞ dt ¼ Op ðn2r=ð2rþ1Þ Þ and vergence for estimating f0 and f1 under the smoothness condition (C2). T ^T ; c ^T Þ is asymptotically normally distrib^T ; t Theorem 2. Under the regularity conditions (C1)–(C9) hold, the RSGEE estimator ðb uted, that is,

8 0 11 0 1 ^  b0 > d b < pffiffiffiffiffiB C d B ^  c0 A ! N 0; @ 0 m@ c > : 0 t^  t0

0 d22 0

0

11 0

d11

C B 0 A @ d21 33

d

31

d

d12 d22 32

d

d13

10

d11

CB d23 A@ 0 33

d

0

0 d22 0

11 9 > = C : 0 A > ; d33 0

Remark 3. If the responses yi are drawn from normal distribution, we have dkl ¼ 0 ðk – lÞ and then the asymptotic covari 1 11 22 33 ance matrix becomes diagðd ; d ; d Þ . ^ From Theorem 2, we obtain the following sandwich formula to estimate the asymptotic covariance matrix of b:

^ 1 M ^ 1M ^ 1 ; ^ ¼M Cov ðbÞ 0 0 ^ 0 ¼ Pm X T DT ðV # Þ1 C# DT X  , and where M i i i¼1 i i i i

ð11Þ

348

C. Guo et al. / Applied Mathematics and Computation 245 (2014) 343–356

^1 ¼ M

m  T X 1 # # 1 # T X T ðV #i Þ Di X i ; i Di ðV i Þ hi ðli ð#ÞÞ hi ðli ð#ÞÞ i¼1

^ here all of the quantities involved are evaluated at #.

3. Smooth-threshold generalized estimating equations and the oracle property So far, we assume that all the covariates are important. However, in many practical studies, some covariate variables are not significant with the response variable. We can see that many shrinkage approaches have been proposed in the literature to select crucial variables from the discussion of introduction part, while most penalty functions are singular at zero. As an alternative method, Ueki [22] proposed smooth-threshold estimating equations (SEEs) which are easily implemented with Newton–Raphson type algorithms. We partition h0 into active (nonzero) and inactive (zero) coefficients as follows: let     A0 ¼ j : h0j – 0 and Ac0 ¼ j : h0j ¼ 0 be the complement of A0 . Denote by s ¼ jA0 j the number of true nonzero parameters. Following Ueki [22], we propose the following robust semiparametric smooth-threshold generalized estimating equations (RSSGEEs) to select significant variables and estimate nonzero parameters simultaneously, for the mean, generalized autoregressive parameters and innovation variances,

^ 1 ÞU 1 ð#Þ  N ^ 1 # ¼ 0; ðI pþK  N ^ 2 ÞU 2 ðcÞ  N ^ 2 c ¼ 0; ðI q  N

ð12Þ

^ 3 ÞU 3 ðqÞ  N ^ 3 q ¼ 0; N

ð14Þ

ðI dþK

ð13Þ

where

9 8 = < n o ^ 1 ¼ diag ^d1 ; . . . ; ^dp ; 0; . . . ; 0 ; N ^ 2 ¼ diag ^dpþKþ1 ; . . . ; ^dpþKþq ; N |fflfflfflffl{zfflfflfflffl}; : K

and

8 9 < = ^ 3 ¼ diag ^dpþKþqþ1 ; . . . ; ^dpþKþqþd ; 0; . . . ; 0 ; N |fflfflfflffl{zfflfflfflffl}; :

K  1þs     ^  ^ ^i j1þs ; i ¼ 1; . . . ; q; ^ ^i j1þs ; i ¼ 1; . . . ; d di ¼ min 1; k1 =b dpþKþi ¼ min 1; k2 =jc dpþKþqþi ¼ min 1; k3 =jt with ^ ; i ¼ 1; . . . ; p;  i pffiffiffiffiffi T ^T ; c ^T Þ is the initial m-consistent estimator which can be obtained by solving the RSGEE (5)–(7) in Section 2.2 ^T ; t and ðb T

^T ; c ^T Þ and called the RSS^T ; t for the full model. The estimate of h obtained from the RSSGEE (12)–(14) is denoted by ^ h ¼ ðb  GEE estimator. Let k be equal to either k1 ; k2 or k3 , depending on whether hj is a component of b; c or tð1 6 j 6 sÞ. Note that dj ¼ 1 reduces to ^ hj ¼ 0. Therefore, RSSGEE can yield a sparse solution. We further define the set of estithe jth RSSGEE with ^  1þs n o   ^ ¼ j:^ dj – 1 with ^ dj ¼ min 1; k=^ hj  ; j ¼ 1; . . . ; s. mated nonzero indices as A Theorem 3. Suppose that the conditions of Theorem 2 hold, and for any positive kj and mð1þsÞ=2 kj ! 1 for j ¼ 1; 2; 3 as m ! 1, we have

s such that m1=2 kj ! 0 and

^ ¼ A0 Þ ! 1 (a) variable selection consistency, i.e., PðA pffiffiffi (b) asymptotic normality, i.e., nð^ h;A0  h;A0 Þ is asymptotically normal with mean zero and covariance matrix same as when A0 is known.

Remark 4. Theorem 3 implies that our proposed RSSGEE estimators are consistent in variable selection. Moreover, the RSSGEE estimators have the oracle property; that is, the asymptotic variance for the RSSGEE estimate is the same as that we know the correct submodel a priori. 4. Issues in practical implementation 4.1. Algorithm T ^T; c ^ T Þ of Eqs. (12)–(14), we adopt the iteratively algorithm through fixing the other ^T ; q In order to find the solution ð#  parameters. We outline our iterative algorithm as follows:

349

C. Guo et al. / Applied Mathematics and Computation 245 (2014) 343–356 T

Step 1. Select an initial value ð#ð0ÞT ; cð0ÞT ; qð0ÞT Þ which can be obtained by solving (5)–(7) for the full model, and use models ð0Þ ð0Þ ð0Þ (2) and (3) to form Ui and Di , then Ri is obtained. Set k ¼ 0. ðkÞ ðkÞ ðkÞ Step 2. Compute Ri using c and q , update # by 8" #1 " #9 m m < X = X 1 # T # T # ðkþ1Þ ðkÞ ðkÞ ðkÞ 1 # ðkÞ T ðkÞ ðkÞ ðkÞ ðkÞ ^1 ^1  # # ¼#  Bi Di ð# ÞðV i ð# ÞÞ Ci ð# ÞDi ð# ÞBi  G  Bi Di ð# ÞðV i ð# ÞÞ hi ðli ð#ÞÞ  G ; : i¼1 ; i¼1

ð15Þ where

C#i

# ^ 1 ¼ ðI pþK  N ^ 1 Þ1 N ^ 1. ¼ Eh_ #i ðli ð#ÞÞ ¼ E@hi ðli Þ=@ li jli ¼li ð#Þ , for i ¼ 1; . . . ; m and G

Step 3. Given # ¼ #ðkþ1Þ and q ¼ qðkÞ ; c can be updated approximately through ðkþ1Þ

c

¼c

ðkÞ

8" #1 " #9 m m < X = X 1 c c ðkÞ 1 c ðkÞ c T T ðkÞ ðkÞ ðkÞ ^2 ^2  c  T i ðV i ðc ÞÞ Ci ðc ÞT i  G  T i ðV i ðc ÞÞ hi ð^ri ðc ÞÞ  G ; : i¼1 ; i¼1

ð16Þ

c ^ 2 ¼ ðI q  N ^ 2 Þ1 N ^ 2. r i ðcÞÞ ¼ E@hi ð^ r i Þ=@ ^ r i j^ri ¼^ri ðcÞ , for i ¼ 1; . . . ; m and G where Cci ¼ Eh_ ci ð^ ðkþ1Þ ðkþ1Þ Step 4. Given # ¼ # and c ¼ c the innovation variance parameters q can be updated using 8" #1 " #9 < X = m m X q ðkÞ 1 q ðkÞ T ðkÞ q ðkÞ 1 q 2 ðkÞ T T ðkþ1Þ ðkÞ ðkÞ ðkÞ ðkÞ ^ ^ q ¼q  H D ðq ÞðV i ðq ÞÞ Ci ðq ÞDi ðq ÞH i  G3  H i Di ðq ÞðV i ðq ÞÞ hi ðri ðq ÞÞ  G3  q ; : i¼1 i i ; i¼1

ð17Þ q

1

q

_q

^ 3 ¼ ðI dþK  N ^ 3Þ N ^ 3. where Ci ¼ Ehi ðr2i ðqÞÞ ¼ E@hi ðr2i Þ=@ r2i jr2 ¼r2 ðqÞ , for i ¼ 1; . . . ; m and G i

Step 5. Set k

i

k þ 1 and repeat Step 2-Step 5 until a prespecified convergence criterion is met.

^ : From the algorithm above, we obtain the following sandwich formula to estimate the asymptotic covariance matrix of b

^ 1 M ^ 1M ^ 1 ; ^ Þ ¼ M Cov ðb 0 0

ð18Þ

    1þs  ^ 1þs ^  ^ 0 ¼ ^ 1 is the same as where M  G , with G ¼ ðI p  D Þ D ; D ¼ diag k1 =b ; . . . ; k = ;M  1 1 bp  ^ that in (11) and all of the quantities involved are evaluated at # . Pm

T T # 1 # T  i¼1 X i Di ðV i Þ Ci Di X i



 1







4.2. Tuning parameters selection To implement the procedures described in Section 4.1, we need to determine the tuning parameters ðk1 ; k2 ; k3 ; sÞ . One can select ðk1 ; k2 ; k3 ; sÞ through optimizing some data-driven criteria through keeping the balance of both goodness of fit and model complexity. In order to simplify the calculation, we adopt s ¼ 1 in our simulation studies. Following Li et al. [11], we use robust PWD-type criteria to choose these parameters. That is, we choose k1 as the minimizer of

PWDk1 ¼ WDev 1 þ dfk1 logðnÞ; where WDev 1 ¼

Pm

ð19Þ

# T 1=2 1 1=2 # Ri Ai hi , i¼1 ðhi Þ Ai

and dfk1 ¼

Pp

^ j¼1 1ðdj – 1Þ denotes the number of nonzero parameters with 1ðÞ the

indicator function. The selected k1 minimizes the PWDk1 . Similar to the choice of k1 we select tuning parameters k2 ; k3 by minimizing the robustified PWD-type criteria

PWDk2 ¼ WDev 2 þ dfk2 logðnÞ; Pm

c T c

PWDk3 ¼ WDev 3 þ dfk3 logðnÞ; Pm

q T ~ 1=2 ~ 1 ~ 1=2 q

ð20Þ PpþKþq

dj – 1Þ and dfk3 ¼ where WDev 2 ¼ i¼1 ðhi Þ hi , WDev 3 ¼ i¼1 ðhi Þ Ai Ri Ai hi , dfk2 ¼ j¼pþKþ1 1ð^ To avoid the computational burden, we choose these tuning parameters as follows: (1) (2) (3) (4)

PpþKþqþd

^ – 1Þ.

j¼pþKþqþ1 1ðdj

Fix k2 ¼ k3 ¼ 0, choose k1opt ¼ argmink1 PWDk1 , Fix k1 ¼ k1opt and k3 ¼ 0, choose k2opt ¼ argmink2 PWDk2 , Fix k1 ¼ k1opt and k2 ¼ k2opt , choose k3opt ¼ argmink3 PWDk3 . The final choice is kopt ¼ ðk1opt ; k2opt ; k3opt Þ.

5. Numerical studies We conduct numerical studies to assess the finite sample performance of the proposed method from two aspects: efficiency of the proposed robust method compared with the corresponding non-robust version and robustness against outliers. We compare the simulation results of our proposed RSGEE and RSSGEE estimators with the SGEE and SSGEE estimators (Leng et al. [10]) which are defined through the same equations except that wðxÞ ¼ x and W i ¼ I i , where I i are ni  ni identity matrices, the following criteria are adopted:

350

C. Guo et al. / Applied Mathematics and Computation 245 (2014) 343–356

^ c ^ will be assessed by using the average mean square error (AMSE), which are ^ and t (1) The of estimators b; performance b ^  b 2 ; kc ^  tk2 averaged over 500 simulated data sets. ^  ck2 and kt (2) The performance of estimators ^f 0 ð:Þ and ^f 1 ð:Þ will be assessed by using the square root of average square errors (RASE)

( RASE ¼

n1

ni  m X 2 X ^f ðt Þ  f ðt Þ l ij l ij

)12 for l ¼ 0; 1:

i¼1 j¼1

(3) In addition to the AMSE and RASE criteria that we considered, we introduce the entropy loss function to assess how the two methods work in estimating the covariance matrix, which is defined by

^ ¼ m1 ELðR; RÞ

m X     ^ 1 Þ  log Ri R ^ 1   ni : traceðRi R i i i¼1

^ i is its estimator. where Ri is the true covariance matrix and R (4) The average number of zero coefficients that are correctly estimated to be zero: C. (5) The average number of nonzero coefficients that are incorrectly estimated to be zero: IC. (6) The proportion of correctly fitted models: CF. To study robustness of the proposed method, we consider the following two types of contaminations: Case 1 : randomly choose 3% of yij to be yij þ 6; ð1Þ ð1Þ Case 2 : randomly choose 2% of xij to be xij  3 and 2% of yij to be yij þ 6. Example 1. In this simulation study, the model is the same as that in Leng et al. [10] ð1Þ

ð2Þ

yij ¼ xij b1 þ xij b2 þ f0 ðt ij Þ þ eij ; where xij ¼

ð1Þ ð2Þ ðxij ; xij Þ; b1

i ¼ 1; . . . ; m;

j ¼ 1; . . . ; ni ;

¼ 1; b2 ¼ 0:5; f0 ðtÞ ¼ cosðptÞ and eij  Nð0; Ri Þ with Ri to be specified later. We use similar sampling

scheme in Fan et al. [2] such that the observation times are regularly scheduled but may be randomly missing in practice. In practical terms, this means that each subject is supposed to be measured at scheduled time points f0; 1; 2 . . . ; 12g, where the number of repeated measurements, ni , is randomly chosen such that each time point (excluding time 0) has a 20% probability of being missing, resulting in different ni for different subjects. The actual measurement times are generated by adding a uniform ½0; 1 random variable to a non-missing scheduled times, we further transform tij onto interval ½0; 1. ð1Þ ð2Þ We take xij ¼ t ij þ dij , where dij is generated from standard normal distribution and xij comes from a Bernoulli distribution with success probability 0.5. The error ðei1 ; . . . ; eini Þ follows a multivariate normal distribution with mean 0 and covariance Ri satisfying Ui Ri UTi ¼ Di , where Ui and Di are described in Section 2.1 with xijk ¼ ð1; ðt ij  t ik ÞÞT and zij ¼ xij together with c ¼ ð0:2; 0:3ÞT , k ¼ ð0:5; 0:2ÞT and f1 ðtÞ ¼ sinðptÞ. We use the first-order autoregressive (ar1) and exchangeable struc~ i ¼ F 1=2 Ri ðdÞF 1=2 . For different d, the results are similar pointed out by Leng et al. [10], so we only list ture (exch) for Ri ðdÞ in R i

i

the results of ar1 and exch with d ¼ 0:5 and omit other results for brevity. The number of the knots is taken to be ½n1=5  and ½s denotes the largest integer not greater than s. In our simulations, the j in the weight function wij is chosen to be j ¼ 1 and the constant c in Huber’s function is chosen to be 2. Table 1 shows the performances of SGEE and RSGEE estimators for both the mean, autoregressive parameters and the covariance under two types of contaminations. First we notice that the robust method outperforms the non-robust method in terms of smaller AMSEs for the estimate of b. Secondly, two methods show little difference in estimating c (our method is even slightly better), because the corresponding covariates xjk for c only contain t, which has no contamination at all. To some extent, it supports that the proposed robust method performs equally well when there is no contamination in c. Thirdly, as for t, the robust method performs favorably in comparison with the non-robust method in estimation of the covariance matrix. This point supports a better estimate for the covariance matrix in the robust method resulting a better estimate for the mean parameter b. Fourthly, as for f0 and f1 , RSGEE has smaller RASEs than that of SGEE. Finally, we focus on the evaluation on the overall performance of the covariance matrix estimation, the robust method reduces the entropy loss substantially especially when the contamination is heavier. In summary, the proposed robust method generally outperforms the non-robust method under different contaminations. Table 2 is used to illustrate the performance of the sandwich formula (11), ‘‘SD’’ represents the sample standard deviation of 500 estimated coefficients in the 500 simulations which can be regarded as the true standard error. ‘‘SE’’ represents the sample average of 500 estimated standard errors by using sandwich formula (11), and ‘‘Std’’ represents the estimated standard deviation of these 500 estimated standard errors. Table 2 demonstrates that SD and SE are quite close, especially for large m. This indicates that the standard error formula (11) works well for different correlation structures. In addition, from Tables 1 and 2, we conclude that the choice of structure has no significant influence on the estimators. In the rest of the article, we only consider ar1 structure with d ¼ 0:5.

351

C. Guo et al. / Applied Mathematics and Computation 245 (2014) 343–356 Table 1 Simulation results for Example 1. n Case1

ar1

100

Method SGEE RSGEE

exch

100

SGEE RSGEE

ar1

400

SGEE RSGEE

exch

400

SGEE RSGEE

Case2

ar1

100

SGEE RSGEE

exch

100

SGEE RSGEE

ar1

400

SGEE RSGEE

exch

400

SGEE RSGEE

b1 AMSE

b2 AMSE

c1

c2

t1

t2

AMSE

AMSE

AMSE

AMSE

f0 RASE

f1 RASE

R EL

0.276 (0.388) 0.232 (0.284) 0.261 (0.368) 0.204 (0.307) 0.059 (0.074) 0.058 (0.071) 0.058 (0.091) 0.048 (0.063)

0.908 (1.103) 0.592 (0.770) 0.890 (1.257) 0.647 (0.922) 0.265 (0.443) 0.201 (0.324) 0.261 (0.437) 0.196 (0.334)

0.377 (0.351) 0.372 (0.342) 0.345 (0.364) 0.347 (0.355) 0.306 (0.189) 0.292 (0.171) 0.327 (0.181) 0.321 (0.179)

5.592 (4.299) 5.406 (4.136) 5.158 (4.357) 5.113 (4.101) 4.898 (2.404) 4.643 (2.174) 5.113 (2.234) 5.012 (2.203)

1.723 (1.745) 1.214 (1.226) 1.828 (1.760) 1.188 (1.143) 1.090 (0.857) 0.918 (0.627) 1.175 (0.695) 1.024 (0.563)

2.233 (3.037) 1.723 (2.421) 2.416 (3.147) 1.444 (1.904) 0.488 (0.669) 0.413 (0.618) 0.651 (0.819) 0.448 (0.556)

28.91 (15.01) 24.93 (14.11) 28.55 (14.61) 25.42 (12.93) 21.51 (9.627) 17.48 (9.597) 19.86 (7.267) 15.77 (7.254)

96.60 (9.905) 41.86 (7.095) 97.72 (9.771) 41.68 (7.043) 96.11 (4.906) 41.34 (3.558) 95.23 (5.101) 41.18 (3.377)

8.261 (1.474) 1.173 (0.230) 8.207 (1.526) 1.198 (0.219) 7.659 (0.656) 1.104 (0.109) 7.611 (0.701) 1.111 (0.101)

0.789 (0.769) 0.618 (0.638) 0.884 (0.834) 0.645 (0.656 0.741 (0.398) 0.524 (0.289) 0.687 (0.362) 0.488 (0.314)

1.013 (1.470) 0.826 (1.205) 0.806 (1.074) 0.647 (0.870) 0.170 (0.231) 0.141 (0.184) 0.248 (0.395) 0.198 (0.295)

0.291 (0.300) 0.288 (0.293) 0.311 (0.311) 0.315 (0.314) 0.278 (0.143) 0.264 (0.131) 0.278 (0.152) 0.257 (0.138)

4.198 (3.518) 4.113 (3.397) 4.409 (3.736) 4.349 (3.660) 4.147 (1.697) 3.924 (1.553) 4.172 (1.824) 3.884 (1.667)

1.018 (1.109) 0.613 (0.704) 1.124 (1.323) 0.690 (0.760) 0.586 (0.502) 0.477 (0.347) 0.620 (0.547) 0.563 (0.361)

2.623 (4.109) 1.532 (1.912) 2.285 (2.995) 1.167 (1.769) 0.634 (0.886) 0.490 (0.613) 0.591 (0.871) 0.352 (0.519)

29.22 (14.82) 26.86 (14.76) 30.00 (14.92) 25.89 (14.09) 23.26 (8.948) 18.43 (8.373) 23.26 (8.675) 18.23 (8.306)

98.30 (9.758) 37.19 (6.723) 99.66 (10.12) 35.58 (6.297) 99.06 (4.951) 35.51 (3.587) 99.07 (5.262) 34.83 (3.344)

8.804 (1.482) 0.902 (0.205) 8.860 (1.156) 0.883 (0.211) 8.324 (0.694) 0.815 (0.094) 8.335 (0.716) 0.817 (0.098)

^ c ^ estimators with corresponding sample standard errors (102 , in parentheses), ^ and t Simulation results of average mean square error (AMSE 102 ) for b; and average square root of average square errors (RASE 102 ) for nonparametric estimators ^f 0 ðÞ and ^f 1 ðÞ with corresponding sample standard errors (102 , in parentheses). In addition, we use average entropy loss function (EL) to assess how the two methods work in estimating the covariance matrix together with sample standard errors (in parentheses) under two types of contaminations with 500 replications for different sample size. In order to show that the choice of working correlation structure has no significant influence on the estimators, ar1 and exch working correlation matrices with d ¼ 0:5 are considered.

Example 2. This example is used to illustrate the performance of the robust semiparametric smooth-threshold generalized estimating equations (RSSGEEs) described in Section 3. We take the true values of the mean parameters, generalized autoregressive parameters and log-innovation variances as b ¼ ð1; 0:5; 0; 0; 0; 0; 0ÞT ; c ¼ ð0:2; 0:3; 0; 0; 0; 0; 0ÞT and k ¼ ð0:5; 0:5; 0; 0; 0; 0; 0ÞT , respectively. The mean covariates xij ¼ ðxijt Þ7t¼1 are generated from a standard normal distribution with the correlation between the kth and lth component of xij equal to 0:5jlkj . Similar to Zheng et al. [32], we choose xijk ¼ ðxijt  xikt Þ7t¼1 and zij ¼ ðxijt Þ7t¼1 as covariates for the generalized autoregressive parameters and log-innovation variances. Using these values, the mean li and covariance matrix Ri are constructed through the modified Cholesky decomposition (2). Then, the responses yi ’s are drawn from the multivariate normal distribution Nðli ; Ri Þ. Table 3 summarize model selection results of SSGEE, RSSGEE and their corresponding oracle’s methods for two types of contaminations. Firstly, with the number of subjects m increasing, AMSEs of all methods decrease, and for RSSGEE, their CF approach to 1 and the values in the column labeled ‘‘C’’ become more and more closer to the true number of zero regression coefficients in the models. These results are consistent with the oracle property. Secondly, our RSSGEE method apparently outperforms SSGEE method in both estimation efficiency and variable selection, especially in covariance model identification. Non-robust method can not correctly identify innovation variance model under contaminations in most replications, which indicates that the non-robust joint model is sensitive to perturbations. To investigate the influence of outliers on nonparametric functions and covariance estimations. We list RASEs of f0 and f1 and entropy loss (EL) in Table 4, we can find that the RSSGEE method outperforms SSGEE method under different contaminations. Moreover, Table 5 shows that sandwich formula (18) works well for the RSSGEE estimator.

352

C. Guo et al. / Applied Mathematics and Computation 245 (2014) 343–356

Table 2 Assessment of the standard errors using formula (11) for the RSGEE estimator in Example 1. n

Case1

100

ar1

b1 b2 b1 b2 b1 b2 b1 b2

exch 400

ar1 exch

Case2

SD

SE(Std)

SD

SE(Std)

0.046 0.077 0.045 0.080 0.021 0.045 0.021 0.044

0.038(0.016) 0.075(0.018) 0.039(0.009) 0.076(0.014) 0.015(0.010) 0.042(0.018) 0.021(0.005) 0.040(0.005)

0.046 0.091 0.043 0.081 0.021 0.037 0.015 0.045

0.038(0.016) 0.075(0.023) 0.037(0.014) 0.072(0.021) 0.020(0.004) 0.037(0.005) 0.019(0.005) 0.039(0.005)

Simulation results of average sample standard deviation (SD), average estimated standard errors by using sandwich formula (11) (SE) and estimated standard deviation of these estimated standard errors (Std) under two types of contaminations with 500 replications for different sample size. We consider ar1 and exch working correlation structures with d ¼ 0:5 to prove that the choice of working correlation structure has no significant influence on the estimators of standard errors using formula (11).

Table 3 Variable selections and parameter estimations for Example 2, where the correlation structure is ar1 with d ¼ 0:5. n

100

Method

SSGEE

Case1

b

c t RSSGEE

b

c t 300

SSGEE

b

c t RSSGEE

b

c t

Case2

C

IC

CF

Oracle

Total

C

IC

CF

Oracle

Total

4.46 4.99 4.06 4.81 4.88 4.83 4.83 5 4.75 4.99 5 4.99

0 0 0.61 0 0 0.05 0 0 0.77 0 0 0

0.59 0.99 0.14 0.82 0.88 0.81 0.84 1 0.25 0.99 1 0.99

0.0103 0.0012 0.0992 0.0102 0.0010 0.0428 0.0028 0.0010 0.1501 0.0026 0.0007 0.0259

0.0201 0.0013 0.2689 0.0131 0.0011 0.0562 0.0046 0.0011 0.3057 0.0037 0.0007 0.0337

4.42 3.88 2.07 4.79 4.55 4.81 4.94 4.62 0.33 4.99 4.98 5

0 0 0.08 0 0 0.05 0 0 0 0 0 0

0.61 0.36 0.27 0.84 0.67 0.81 0.94 0.76 0.05 0.99 0.99 1

0.0474 0.0025 0.0682 0.0306 0.0023 0.0384 0.0444 0.0024 0.0404 0.0262 0.0018 0.0323

0.0597 0.0039 0.3919 0.0376 0.0025 0.0576 0.0480 0.0026 0.1157 0.0280 0.0018 0.0324

Simulation results of average number of the true zero coefficients correctly estimated to zero (C); the average number of non-zero coefficients incorrectly estimated to be zero (IC) and average correctly fit percentage (CF) for both robust (RSSGEE) and non-robust (SSGEE) methods, under two types of Pp1 P ^k Þ; total ¼ Pp AMSEðb ^k Þ, for c; oracle ¼ Pq1 AMSEðc ^k Þ; total ¼ qk¼1 AMSEðc ^k Þ and for contaminations with 500 replications. For b; oracle ¼ k¼1 AMSEðb k¼1 k¼1 P1 P ^k Þ; total ¼ dk¼1 AMSEðt ^k Þ, where p1 ; q1 and d1 stand for the number of nonzero variables of b; c and t, respectively. t; oracle ¼ dk¼1 AMSEðt

Table 4 Simulation results of RASE for f0 ; f1 and standard errors (inside brackets). In addition, we introduce the entropy loss function to assess how the two methods work in estimating the covariance matrix for Example 2. n

Method

Case1 RASE0

Case2 RASE1

EL

RASE0

RASE1

EL

100

SSGEE 0.2343(0.0533) 0.9445(0.0857) 4.3037 0.2353(0.0500) 2.0341(1.4687) 9.2215 RSSGEE 0.2186(0.0528) 0.5987(0.0697) 2.2935 0.2073(0.0664) 0.6173(0.0832) 3.1293 300 SSGEE 0.2087(0.0285) 0.9814(0.0601) 4.4116 0.1886(0.0524) 0.9000(0.7569) 8.7036 RSSGEE 0.1805(0.0276) 0.5910(0.0473) 1.9693 0.1565(0.0513) 0.5733(0.1512) 2.4365  2 12 P Pni ^ P  ^ ¼ m1 m traceðRi R ^ 1 Þ Notation: RASEl ¼ n1 m for l ¼ 0; 1 and the entropy loss function is defined by ELðR; RÞ i¼1 i¼1 i j¼1 f l ðt ij Þ  fl ðt ij Þ   ^ i is its estimator. ^ 1   ni g, where Ri is the true covariance matrix and R log Ri R i

6. Conclusion In this paper, we develop the robust estimation based on the Huber’s score function and Mallows weights function for the SMCM in longitudinal analysis by using the modified Cholesky decomposition and approximating the nonparametric functions by B-splines. The proposed model is more flexible than that of Ye and Pan [28] and has less deviation when the parametric assumption is fail to meet. The proposed method can control the influence of outliers in both the response and

353

C. Guo et al. / Applied Mathematics and Computation 245 (2014) 343–356 Table 5 Assessment of the standard errors using formula (18) for the RSSGEE estimator in Example 2. n

Case1

100

b1 b2 b1 b2

300

Case2

SD

SE(Std)

SD

SE(Std)

0.0659 0.0658 0.0385 0.0381

0.0568(0.0163) 0.0617(0.0216) 0.0366(0.0118) 0.0349(0.0119)

0.0768 0.0679 0.0408 0.0343

0.0625(0.0344) 0.0514(0.0235) 0.0394(0.0244) 0.0304(0.0140)

Simulation results of average sample standard deviation (SD), average estimated standard errors by using sandwich formula (18) (SE) and estimated standard deviation of these estimated standard errors (Std) under two types of contaminations with 500 replications. We consider ar1 working correlation structure with d ¼ 0:5.

covariates directions based on the bounded Huber’s score function w and weight function W. Furthermore, we propose an automatic variable selection procedure to eliminate irrelevant parameters by setting them as zero for the SMCM based on robust smooth-threshold generalized estimating equations. The advantages of this method are that the estimator can be obtained without solving a convex optimization problem and the resulting estimator enjoys the oracle property. However, the potential pitfall of the newly proposed methodology is that the above estimating functions (5)–(7) and smooth-threshold generalized estimating Eqs. (12)–(14) require evaluation of C #i ; C ci and C qi , which are used to ensure the Fisher consistency and deemed to be difficult without distributional assumptions. Qin and Zhu [18] and Zheng et al. [31] have discussed the difficulty of calculating C #i ; C ci and C qi . They pointed out that C #i ; C ci and C qi can be calculated easily for binary and possion data while hard to obtain for other distributions. As an alternative, some numerical integration methods or approximations are required to achieve the numerical values of C #i ; C ci and C qi . A bias correction method proposed by Wang et al. [25] is another tool to overcome this problem. Finally, it will be of interest to investigate the statistical properties with diverging numbers of parameters or high-dimensional data with the number of variables being larger than the sample size in both the mean and variance. When referring to big data and trends analysis with mean-covariance regression. See Gao et al. [5] and Perc [16,17] for references. Xu et al. [27] proposed a novel GEE-based screening procedure to deal with the time course genomic data with ultrahigh dimensions. Based on the screening method [27], we may expect that our proposed method can still work well. We can take a two-step method. The idea is that we first apply the screening method to reduce the dimension to a number smaller than to sample size with probability tending to one, and then we use our proposed procedure to remaining variables. Research in this aspect is ongoing. Acknowledgements The authors are very grateful to the editor, associate editor, and three anonymous referees for their detailed comments on the earlier version of the manuscript, which led to a much improved paper. This work is supported by the National Natural Science Foundation of China (Grant No. 11171361) and Ph.D. Programs Foundation of Ministry of Education of China (Grant No. 20110191110033). Appendix A ~ 0 2 RK such that: Lemma 1. Under conditions (C1)–(C2), for two constants C 0 and C 1 , there exists two vectors a0 2 RK and a

  r sup f0 ðtÞ  pT ðtÞa0  6 C 0 kn ;

t2½0;1

  ~ 0  6 C 1 kr sup f1 ðtÞ  pT ðtÞa n :

t2½0;1

Proof of Lemma 1. Lemma 1 follows directly from Corollary 6.21 of Schumaker [20]. Hence, we omit it here.

h

^ ! b0 almost surely and the proofs for c ^ and ^ Proof of Theorem 1. We only give the proof of b k are similar, so we omit for brevity. According to McCullagh [14], we have

( )1 m 1X T T # 1 #  ^ b  b0 ¼  X D ðV Þ Ci Di X i m i¼1 i i i

(

) m 1X T T # 1 # X D ðV Þ hi ðli ð#ÞÞ m i¼1 i i i



þ op ðm1=2 Þ:

ðA:1Þ

#¼#0

#¼#0

1 #

~ 1i ¼ X T DT ðV # Þ h ðl ð#ÞÞ at # ¼ #0 are given by E0 ðU ~ 1i Þ ¼ 0 and On the other hand, the expectation and variance matrix of U i i i i i

n

~ 1i Þ ¼ X T DT ðV # Þ1 C# Di X  V 0 ðU i i i i i

o

#¼#0

T

1

¼ ðD0i X i Þ ðV #i Þ C#i ðD0i X i Þ:

ðA:2Þ

354

C. Guo et al. / Applied Mathematics and Computation 245 (2014) 343–356

 T 1 1=2 ~ Since V #i ¼ A1=2 Ri and R1 ¼ UTi D1 ÞC#i ðUi D0i X i Þ. By Coni i Ui , the variance can be rewritten as V 0 ðU 1i Þ ¼ ðUi D0i X i Þ ðDi Ai i ~ 1i Þ 6 j0 1pp for all i and all #, where 1pp is the p  p matrix with all dition (C3), there exists a constant j0 such that V 0 ðU P 2 ~ elements being 1’s, that is, all elements of V 0 are bounded by j0 . Thus 1 i¼1 V 0 ðU 1i Þ=i 6 1. By Kolmogorovs strong law of large numbers we have

(

) m 1X T T # 1 # X D ðV Þ hi ðli ð#ÞÞ m i¼1 i i i

! 0;

ðA:3Þ

#¼#0

almost surely as m ! 1. With the same argument, we can show that the matrix

(

m 1X 1 X T DT ðV # Þ C#i Di X i m i¼1 i i i

) ; #¼#0

^ ! b0 almost surely as m ! 1. Thus, we complete the proof of (a). in (A.1) is bounded, together with (A.3), we have b Next, we can show the proof of (b) directly from He et al. [6] and Leng et al. [10]. The proof is complete. h In order to demonstrate the asymptotic normalities of estimators, we give some notations first. Let

Am ¼

m X # 1 # T  T X T i D0i ðV 0i Þ C0i D0i X i ; i¼1

m X 1 T T0i ðV c0i Þ Cc0i T 0i ; Bm ¼ i¼1 m X q 1 q  Z T Cm ¼ i D0i ðV 0i Þ C0i D0i Z i : i¼1

~ ^ ¼ A1=2 ðb ^  b0 Þ; 1 ~ ¼ A1=2 1 ¼ Am1=2 ðb  b0 Þ; 1^ ¼ 1^ðbÞ m m U1; 1=2 1=2 1=2 ~ g ¼ Bm ðc  c0 Þ; g^ ¼ g^ ðc^Þ ¼ Bm ðc^  c0 Þ; g~ ¼ Bm U 2 ; 1=2 ~ ^ ~ m ¼ C m1=2 ðt  t0 Þ; m^ ¼ 1^ðt^Þ ¼ C 1=2 m ðt  t0 Þ; m ¼ C m U 3 : Lemma 2. Suppose the conditions (C1)–(C3) hold, we have

^1 ~k ¼ op ð1Þ; k1 g ^g ~ ¼ op ð1Þ;

ðA:5Þ

km^  m~k ¼ op ð1Þ:

ðA:6Þ

ðA:4Þ

Proof of Lemma 2. Define

uð1Þ ¼ Am1=2

m X # 1 # T X T i Di ðli ð#ÞÞðV i Þ hi ðli ð#ÞÞ; i¼1

~ /ð1Þ ¼ A1=2 m U 01 ð1Þ  1: P T T # 1 ~ Conditions (C1)–(C3) imply that uð1Þ and m i¼1 X i Di ðli ð#ÞÞðV i Þ hi ðli ð#ÞÞ have same root for 1. Moreover, 1 is the solution of /. Following the proof of Theorem 1 in [6,10], we immediately obtain that

sup kuð1Þ  /ð1Þk ¼ op ð1Þ; k1k ¼ Op ð1Þ; k1k6L

where L is a sufficiently large number. Using the Brouwer’s fixed-point theorem, we can prove (A.4). We can prove (A.5) and (A.6) similarly. h  T T T T pffiffiffiffiffi ~ ;1 ~ ;g ~ = m. Meaning that we Proof of Theorem 2. By Lemma 2, we only need to show the asymptotic normality of 1  T pffiffiffiffiffi ~ T; U ~ T; U ~ T = m. By Conditions (C1), (C3), (C4) and (C6), we have prove the asymptotic normality of U 1 2 3

h n o n o n o i3 1 c T q 1 q 2 # 1 # T E aT X T T T0i ðV c0i Þ hi ð^r 0i Þ þ c T Z T < j; i D0i ðV 0i Þ hi ðli ð#0 ÞÞ þ b i D0i ðV 0i Þ hi ðr0i Þ for any a 2 Rp ; b 2 Rq and c 2 Rd , where Furthermore, we have

j is a constant independent of i.

355

C. Guo et al. / Applied Mathematics and Computation 245 (2014) 343–356 m h n o n o n oi 1X 1 c T q 1 q 2 # 1 # T V aT X T T T0i ðV c0i Þ hi ð^r 0i Þ þ c T Z T i D0i ðV 0i Þ hi ðli ð#0 ÞÞ þ b i D0i ðV 0i Þ hi ðr0i Þ m i¼1 T

¼ ðaT ; b ; c T Þ

T T 1 T T T Wm ðaT ; b ; c T Þ ! ðaT ; b ; c T ÞWðaT ; b ; c T Þ > 0; m

 T ffi ~ T; U ~ T; U ~ T =pffiffiffiffi m is because of W is the positive definite matrix by condition (C9). Therefore, the asymptotic normality of U 1 2 3 easily proved by multivariate Lyapunov central limit theorem. Then,

0 1 0 10 ~ pffiffiffiffiffi 1 ^  b0 0 0 ðAm =mÞ1 U1= m b pffiffiffiffiffiB B CB ~ pffiffiffiffiffi C ^  c0 C m@ c A¼@ A@ U 2 = m A 0 0 ðBm =mÞ1 ffi ~ 3 =pffiffiffiffi k^  k0 m U 0 0 ðC m =mÞ1 8 0 11 11 0 11 1 0 1 1 9 > > d 0 0 d d12 d13 d11 0 0 < = d B C B CB C ! N 0; @ 0 d22 0 A @ d21 d22 d23 A@ 0 d22 0 A : > > : ; 0 0 d33 0 0 d33 d31 d32 d33 The proof of Theorem 2 is completed.

h

T ^T ; c ^T Þ obtained from (5)–(7) ^T ; t Proof of Theorem 3. By the proof of Theorem 2, we can show that the initial estimator ^ h ¼ ðb pffiffiffiffiffi is m-consistent. Let k be equal to either k1 ; k2 or k3 , depending on whether hj is a component of b; c or tð1 6 j 6 sÞ. By the c c c hj satisfies definition  of  A  in Section 3, we have h0j ¼ 0 for j 2 A0 . For any given j 2 A0 , we have that the initial estimator ^ ^  ^  0 1=2 ð1þsÞ=2 Þ, together with m k ! 1, we can derive that hj  h0j  ¼ hj  ¼ Op ðm

 

 

   1þs  1þs  1þs ¼ k1 Oðmð1þsÞ=2 Þ ! 0; P k=^hj  < 1 ¼ P ^hj  > k 6 k1 E ^hj 

ðA:7Þ

    hj  ¼ Op ðm1=2 Þ for where the first inequality applies Markov’s inequality and the second equality follows from the fact ^ c j 2 A0 . This implies that

n o P ^dj ¼ 1 for all j 2 Ac0 ! 1:

On the other hand, by the condition m1=2 k ! 0, for e > 0 and j 2 A0 , we have  



       1=ð1þsÞ 1=ð1þsÞ  1þs   P ^dj > m1=2 e ¼ P k=^hj  > m1=2 e ¼ P ðm1=2 k=eÞ > ^hj  6 P ðm1=2 k=eÞ > minh0j   Op ðm1=2 Þ ! 0; j2A0

n

dj ¼ op ðm1=2 Þ for each j 2 A0 . Therefore, we prove that P ^ dj < 1 for all j 2 A0 which implies that ^ the proof of (a). Next we will prove (b). From (a), the RSSGEE (12)–(14) coincide with

o

! 1. Thus, we complete

^j Þ  ^dj b ^j ¼ 0; ð1  ^dj Þu1j ðb

for j 2 A0 ;

ðA:8Þ

^j Þ  ^dj c ^j ¼ 0; ð1  ^dj Þu2j ðc

for j 2 A0 ;

ðA:9Þ

^j Þ  ^dj t ^j ¼ 0; ð1  ^dj Þu3j ðt

for j 2 A0 ;

ðA:10Þ

^j ¼ 0; c ^j ¼ 0 and t ^j ¼ 0 for j 2 Ac0 with probability tending to one, where u1j ; u2j and u3j are the jth component of U 1 ; U 2 and b dj ¼ op ðm1=2 Þ for j 2 A0 , so the term ^ dj is asymptotically negligible. So we can show that (A.8)– and U 3 for j 2 A0 . Using that ^ (A.10) are asymptotically equivalent to u1j ¼ 0; u2j ¼ 0 and u3j ¼ 0 and the asymptotic normality follows the same way as in the proof of Theorem 2. h References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]

R.J. Carroll, Variances are not always nuisance parameters, Biometrics 59 (2003) 211–220. J. Fan, T. Huang, R. Li, Analysis of longitudinal data with semiparametric estimation of covariance function, J. Am. Stat. Assoc. 35 (2007) 632–641. J. Fan, R. Li, Variable selection via nonconcave penalized likelihood and its oracle properties, J. Am. Stat. Assoc. 96 (2001) 1348–1360. Y. Fan, G. Qin, Z. Zhu, Variable selection in robust regression models for longitudinal data, J. Mult. Anal. 109 (2012) 156–167. J. Gao, J. Hu, X. Mao, M. Perc, Culturomics meets random fractal theory: insights into long-range correlations of social and natural phenomena over the past two centuries, J. R. Soc. Interface 9 (2012) 1956–1964. X. He, W.K. Fung, Z. Zhu, Robust estimation in generalized partial linear models for clustered data, J. Am. Stat. Assoc. 472 (2005) 1176–1184. X. He, P. Shi, Bivariate tensor-product B-splines in a partly linear model, J. Mult. Anal. 58 (1996) 162–181. S. Kodba, M. Perc, M. Marhl, Detecting chaos from a time series, Eur. J. Phys. 26 (2005) 205–215. C. Leng, W. Zhang, Smoothing combined estimating equations in quantile regression for longitudinal data, Stat. Comput. 24 (2014) 123–136. C. Leng, W. Zhang, J. Pan, Semiparametric mean-covariance regression analysis for longitudinal data, J. Am. Stat. Assoc. 105 (2010) 181–193. G. Li, H. Lian, S. Feng, L. Zhu, Automatic variable selection for longitudinal generalized linear models, Comput. Stat. Data. Anal. 61 (2013) 174–186. K.Y. Liang, S.L. Zeger, Longitudinal data analysis using generalized linear models, Biometrika 73 (1986) 13–22.

356

C. Guo et al. / Applied Mathematics and Computation 245 (2014) 343–356

[13] J. Mao, Z. Zhu, W.K. Fung, Joint estimation of mean-covariance model for longitudinal data with basis function approximations, Comput. Stat. Data. Anal. 55 (2011) 983–992. [14] P. McCullagh, Quasi-likelihood functions, Ann. Stat. 11 (1983) 59–67. [15] M. Perc, The dynamics of human gait, Eur. J. Phys. 26 (2005) 525–534. [16] M. Perc, Evolution of the most common English words and phrases over the centuries, J. R. Soc. Interface 9 (2012) 3323–3328. [17] M. Perc, Self-organization of progress across the century of physics, Sci. Rep. 3 (2013) 1720. [18] G. Qin, Z. Zhu, Robust estimation in generalized semiparametric mixed models for longitudinal data, J. Mult. Anal. 98 (2007) 1658–1683. [19] G. Qin, Z. Zhu, W.K. Fung, Robust estimation of covariance parameters in partial linear model for longitudinal data, J. Stat. Plan. Infer. 139 (2009) 558– 570. [20] L. Schumaker, Spline Functions: Basic Theory, Wiley, New York, 1981. [21] C. Stone, Additive regression and other nonparametric models, Ann. Stat. 13 (1985) 689–705. [22] M. Ueki, A note on automatic variable selection using smooth-threshold estimating equations, Biometrika 96 (2009) 1005–1011. [23] L. Wang, J. Zhou, A. Qu, Penalized generalized estimating equations for high-dimensional longitudinal data analysis, Biometrics 68 (2012) 353–360. [24] N. Wang, R.J. Carroll, X. Lin, Efficient semiparametric marginal estimation for longitudinal/clustered data, J. Am. Stat. Assoc. 100 (2005) 147–157. [25] Y. Wang, X. Lin, M. Zhu, Robust estimating functions and bias correction for longitudinal data analysis, Biometrics 61 (2005) 684–691. [26] A.H. Welsh, X. Lin, R.J. Carroll, Marginal longitudinal nonparametric regression: locality and efficiency of spline and kernel methods, J. Am. Stat. Assoc. 97 (2002) 482–493. [27] P. Xu, L. Zhu, Y. Li, Ultrahigh dimensional time course feature selection, Biometrics 70 (2014) 356–365. [28] H. Ye, J. Pan, Modelling covariance structures in generalized estimating equations for longitudinal data, Biometrika 93 (2006) 927–941. [29] J. Zhang, G. Li, Breakdown properties of location M-estimators, Ann. Stat. 26 (1996) 1170–1189. [30] W. Zhang, C. Leng, A moving average cholesky factor model in covariance modeling for longitudinal data, Biometrika 99 (2012) 141–150. [31] X. Zheng, W.K. Fung, Z. Zhu, Robust estimation in joint meancovariance regression model for longitudinal data, Ann. Inst. Stat. Math. 65 (2013) 617– 638. [32] X. Zheng, W.K. Fung, Z. Zhu, Variable selection in robust joint mean and covariance model for longitudinal data analysis, Stat. Sin. 24 (2014) 515–531. [33] H. Zou, The adaptive LASSO and its oracle properties, J. Am. Stat. Assoc. 101 (2006) 1418–1429. [34] H. Zou, T. Hastie, Regularization and variable selection via the elastic net, J .R. Stat. Soc. Ser. B. 67 (2005) 301–320. [35] H. Zou, H. Zhang, On the adaptive elastic-net with a diverging number of parameters, Ann. Stat. 37 (2009) 1733–1751.