A global partial likelihood estimation in the additive Cox proportional hazards model

A global partial likelihood estimation in the additive Cox proportional hazards model

Journal of Statistical Planning and Inference ( ) – Contents lists available at ScienceDirect Journal of Statistical Planning and Inference journa...

517KB Sizes 1 Downloads 34 Views

Journal of Statistical Planning and Inference (

)



Contents lists available at ScienceDirect

Journal of Statistical Planning and Inference journal homepage: www.elsevier.com/locate/jspi

A global partial likelihood estimation in the additive Cox proportional hazards model Huazhen Lin a,∗ , Ye He a , Jian Huang b,c a

Center of Statistical Research, School of Statistics, Southwestern University of Finance and Economics, Chengdu, China

b

Department of Statistics and Actuarial Science, University of Iowa, USA

c

Department of Biostatistics, University of Iowa, USA

article

abstract

info

Article history: Received 3 November 2013 Received in revised form 20 April 2015 Accepted 4 August 2015 Available online xxxx

The additive Cox model has been considered by many authors. However, the existing methods are either inefficient or their asymptotical properties are not well developed. In this article, we propose a global partial likelihood method to estimate the additive Cox model. We show that the proposed estimator is consistent and asymptotically normal. We also show that the linear functions of the estimated nonparametric components achieve semiparametric efficiency bound. Simulation studies show that our proposed estimator has much less mean squared error than the existing methods. Finally, we apply the proposed approach to the ‘‘nursing home’’ data set (Morris et al. 1994). © 2015 Published by Elsevier B.V.

Keywords: Additive Cox model Asymptotical properties Global partial likelihood Semiparametric efficiency

1. Introduction One of the most celebrated models for analyzing lifetime data is the Cox proportional hazards model (Cox, 1972), which explicitly postulates the exposure variables effects on the hazard of adverse events via

λ(t ) = λ0 (t ) exp

 d 

 βj Zj ,

(1.1)

j =1

where λ0 (·) is an unknown baseline hazard function, Z = (Z1 , . . . , Zd )′ are the covariates. The log-linear model (1.1) is a simple and mathematically convenient model to characterize the effects of covariates. However, in practice, there is often little prior information indicating that the effects of the covariates take a log-linear form or belong to any other finite-dimensional parametric family. Substantial improvements are possible by allowing a data-analytic transform of the covariates (Huang, 1999), which leads to the additive Cox model that has the form of,

λ(t |Z) = λ0 (t ) exp

 d 

 ηj (Zj ) ,

(1.2)

j =1

where {ηj (·), j = 1, . . . , d} are univariate unknown smooth functions. This model circumvents fitting of high-dimensional curves. Therefore, it avoids the so-called curse of dimensionality. In addition, the low-dimensional curves fitted in the



Corresponding author. Tel.: +86 28 87092207; fax: +86 28 87092372. E-mail address: [email protected] (H. Lin).

http://dx.doi.org/10.1016/j.jspi.2015.08.002 0378-3758/© 2015 Published by Elsevier B.V.

2

H. Lin et al. / Journal of Statistical Planning and Inference (

)



additive model can easily be visualized and allow for an easy data-analytic interpretation of the qualitative influence of the covariates. Because of its flexibility and interpretability, the additive Cox model has been considered by many authors. For example, Therneau et al. (1990) and Fleming and Harrington (1991) proposed using smoothed martingale residuals to explore the functional form of the covariate effect. The martingale residual approach was further discussed by Grambsch et al. (1995). Hastie and Tibshirani (1990a,b) considered maximizing a penalized partial likelihood to explore the nature of covariate effects in the Cox model. Smoothing splines (O’Sullivan, 1993) and regression splines (Sleeper and Harrington, 1990) were also used to estimate nonlinear covariate effects in the Cox model. However, in all aforementioned works, the asymptotical properties are not well developed. Huang (1999) studied asymptotic properties of the maximum partial likelihood estimator of a partially linear additive Cox model using polynomial splines. The focus of Huang (1999) is the estimator of the finitedimensional regression parameter. The asymptotic distribution is obtained only for the regression parameter, instead of the component functions. In addition, the spline estimators involve maximization of functions over a parameter space whose dimension increases with the sample size. This optimization problem can be rather complex. Recently, Linton et al. (2003) and Honda (2005) used a marginal integration method to estimate the component functions and to derive the asymptotic distribution of the marginal integration estimator. Marginal integration estimators are known to be inefficient in general. Then, Linton et al. (2003) and Honda (2005) improved the marginal integration estimators using finite-step backfitting. Linton et al. (2003) also proved that their finite-step backfitting is asymptotically normal and has smaller variance than the marginal integration method. However, there has been no work on the efficiency property of the nonparametric component estimators in the additive Cox model. In this article, we propose a method based on global partial likelihood and backfitting for estimating the component functions in the additive Cox model. We show that the proposed estimator is consistent, asymptotically normal, and that its linear functionals are semiparametrically efficient. The computation of the estimates can be carried out efficiently using a built-in iteration algorithm. The rest of the paper is organized as follows. Section 2 introduces the global partial likelihood estimation and backfitting algorithm. Section 3 establishes the consistency, asymptotic normality and semiparametric efficiency of the proposed estimator. Section 4 presents the results from the simulation studies and the analysis of a data example. Technical proofs are given in the Appendix. 2. Global partial likelihood estimation Suppose a random sample of n individuals is available. For the ith individual, let Ti and Ci be the potential failure time and censoring time, respectively. The observed time is Ti = min(Ti , Ci ). Let ∆i be a censoring indicator which equals 1 if Ti is a failure time or 0 otherwise. Assume that Ti and Ci are independent given covariates Zi . The covariates Zi = (Zi1 , . . . , Zid ) are allowed to be time dependent. For simplicity of notation, the dependence of covariates on time is not made explicit but the methods and proofs in this paper are applicable to time-dependent covariates. The observed data is

{Ti , ∆i , Zi } for i = 1, . . . , n. In model (1.2) the nonparametric components ηj , j = 1, . . . , d are identifiable only up to a shift, which means that only the differences in functions ηj are estimable from the data. Therefore, we assume that ηj (aj ) = 0 for some finite aj for identifiability. Without loss of generality, we take aj = Znj for j = 1, . . . , d. The partial likelihood for the observations based on (1.2) is

L(η) =

    n   i =1

 exp

d 

 ηj (Zij )

j =1

∆i     

  ,   d        ηj (Zkj )  exp   k∈R(Ti )

(2.1)

j =1

where R(t ) = {i : Ti ≥ t } is the set of the individuals at risk just prior to time t, η(z) = (η1 (z1 ), . . . , ηd (zd ))′ , z = (z1 , . . . , zd )′ . If the unknown functions η(·) are parameterized, the parameters can be estimated by maximizing (2.1). When the forms of the function η(·) are not available, we can only rely on their qualitative traits. Suppose that every component of η is smooth and admits Taylor’s expansion: for each given z and x around z,

ηj (x) ≈ ηj (z ) + η˙ j (z )(x − z ) ≡ ζj + γj × (x − z ),

(2.2)

for j = 1, . . . , d, where g˙ (w) = dg (w)/dw for any function g (·). Substituting (2.2) into (2.1), simultaneous estimation of ζj and γj for j = 1, . . . , d is difficult because of the curse of dimensionality. Instead, a backfitting iterative algorithm is used to derive the estimating function. Particularly, the local scoring backfitting (Hastie and Tibshirani, 1990a) is one of the most popular methods for generalized additive models although its theoretical properties are not well understood since it is just defined implicitly as the limit of a complicated iterative algorithm (Yu et al., 2008). Using the idea of the local scoring backfitting, we estimate ηj by replacing ηk , k ̸= j with their values from the previous steps. As a result, in the step of estimating ηj (·), we can use the method for the model (1.2) with Z one-dimensional. When Z is one-dimensional, Fan et al. (1997)

H. Lin et al. / Journal of Statistical Planning and Inference (

)



3

proposed a local partial likelihood approach to estimate η1 (·). They concentrated on a single neighborhood resulting in the cancellation of the risk function of the local likelihood function, so that only the derivative of η1 (z ) can be estimated. In order to recover the original risk function, they suggested integrating the function  η˙ 1 (z ). However, the large sample property of the estimator z  η˙ 1 (z )dz has not been formally established. Recently, Chen and Zhou (2007) proposed a direct estimation of the relative risk function η1 (z2 ) − η1 (z1 ) through a new version of the local partial likelihood. The asymptotic properties of their estimators are formally established. Chen et al. (2010) proposed an efficient estimator of η1 (z ) when the covariates are one dimensional based on global versions of the local partial likelihood approach. In the paper, we apply the global estimation idea to each ηj (·). Suppose ηk , k ̸= j are known. For every fixed z in the range of Zij , suppose ηj are known outside a neighborhood of z, denoted by Bn (z ). Substituting (2.2) into (2.1), we can consider maximizing

∆i     ηs (Zis )  s =1 ,   d       i =1    I exp (ψ ) + ( 1 − I ) exp η ( Z )   rj rj s rs rj

   n   

Iij exp(ψ ij ) + (1 − Iij ) exp



d 

r ∈R(Ti )

(2.3)

s=1

where ψ ij = ζj +γj ×(Zij − z )+ k̸=j ηk (Zik ), Iij equals 1 if Zij ∈ Bn (z ) and equals 0 otherwise. Since the true ηj (·), j = 1, . . . , d are unknown, (2.3) is not directly useful. For the traditional backfitting algorithm, only {ηk , k ̸= j} are replaced by the values from the previous steps. However, since the true ηj are unknown outside a neighborhood of z, (2.3) is useless even when {ηk , k ̸= j} are known. A key point of our method is that in addition to replacing the unknown functions {ηk , k ̸= j}, but we also replace ηj outside a neighborhood of z with their values from the previous steps. That is, denoting



ξ j = (ζj , hγj )′ ,

′

Xij (z ) = 1, (Zij − z )/h ,



and

where h represents the size of the local neighborhood, we estimate ξ j in step m by maximizing

∆i     ηs(m−1) (Zis )  s=1 ,    d     (m−1)  i =1    ( Z ) I exp (ψ ) + ( 1 − I ) exp η   rs rj rj rj s

   n   

Iij exp(ψij ) + (1 − Iij ) exp



d 

r ∈R(Ti )

where ψij = ξ j Xij (z ) + equations for ξ j , ′

n 

∆i

    

s=1



k̸=j

ηk(m−1) (Zik ). Differentiating the logarithm of (2.4) with respect to ξ j , we solve the following

 r ∈R(Ti )

Iij Xij (z ) −

   

i =1

  r ∈R(Ti )

i=1

∆i

 I Xij (z ) −

ij    

r ∈R(Ti )

s =1

r ∈R(Ti )

ηs(m−1) (Zrs )) ≈ exp(

Irj exp(ψrj )Xrj (z )

 



d 

(m−1)

ηs

s=1

d

n 

    

Irj exp(ψrj )Xrj (z )

Irj exp(ψrj ) + (1 − Irj ) exp

Since Irj exp(ψrj ) + (1 − Irj ) exp( the following equations for ξ j ,

    

(2.4)

exp

d  s=1

d

s=1

 = 0.   (Zrs )  

(2.5)

ηs(m−1) (Zrs )), for computational convenience, we consider

    

 = 0.   ηs(m−1) (Zrs )  

(2.6)

Then, replacing Iij and Irj with kernel functions that decrease smoothly to zero with distant from the target point, we derive the following iteration algorithm.

• Step 0. Choose the initial values of ηj(0) (z ) for z = Z1j , . . . , Zn−1,j and j = 1, . . . , d. For computational simplicity, the estimator proposed by Huang (1999) can be a reasonable initial values.

• Step m. For j = 1, . . . , d, one by one, and z = Z1j , . . . , Zn−1,j , the following equations for ξ j are solved:    j −1 d   (m) (m−1) ′ Krj (z )Xrj (z ) exp ηs (Zrs ) + ξ j Xrj (z ) + ηs (Zrs )   n    s =1 s=j+1 r ∈R(Ti )  = 0,    ∆i Kij (z )Xij (z ) −  j − 1 d   (m)  (m−1)   i=1 exp ηs (Zrs ) + ηs (Zrs ) r ∈R(Ti )

s =1

s =j

(2.7)

4

H. Lin et al. / Journal of Statistical Planning and Inference (

)



where Kij (x) = Kh (Zij − x), Kh (·) = K (·/h)/h, K is a probability density called a kernel function. Let  ζj (z ) and  γj (z ) be the (m)

solutions of ζj and γj . Then ηj (z ) =  ζj (z ) for j = 1, . . . , d. The iteration procedure is repeated until convergence. • For every z in the range of Zij , the estimates of ηj (z ) and η˙ j (z ), denoted by  ηj (z ) and  η˙ j (z ), respectively, are obtained (m−1)

by solving Eqs. (2.7) for ξ j by replacing ηj iteration above.

(Zij ) and ηj(m) (Zij ), i = 1, . . . , n, j = 1, . . . , d with their estimators from

It is interesting to note that the proposed approach reduces to the standard Cox’s partial likelihood approach when the covariate Zij is discrete. To see this, assume Zij takes finite values b1 , . . . , bK , and that ηj (bK ) = 0 for identifiability purpose. As long as the bandwidth h is smaller than min (|bk − bl |, l ̸= k), we have Zij − Zkj = 0 for |Zij − Zkj | ≤ h. As a result, (2.7) reduces the limit of



  n  i =1

  ∆i  I (Zij = bℓ ) − 

I (Zkj = bℓ ) exp

j −1

ηr (Zkr ) + ηj (bℓ ) +



d 

 

ηr (Zkr )

    = 0. j −1 K d     I (Zkj = bl ) exp ηr (Zkr ) + ηj (bl ) + ηr (Zkr )

k∈R(Ti )

r =1

r =j +1



k∈R(Ti ) l=1

r =1

r =j +1

This is exactly the same as the Cox partial likelihood estimating equation for ηj . The solution of ηj (bℓ ) is the Cox partial likelihood estimators of ηj (bℓ ), hence is efficient. In this respect, the proposed method of estimation is adequate. Let us recall the local partial likelihood score with the form of n 

  ∆i Kij (z ) Xij (z ) − X¯ ij (z ) = 0,

(2.8)

i=1

¯ ij (z ) is an average (see Fan et al., 1997). The local partial likelihood method (2.8) is based on the estimating where X equation for Xij (z ) with weight Kij (z ), hence using local data. However, our partial likelihood method (2.7) is based on the estimating equation for Kij (z )Xij (z ) using all of the data. Our method is global. As a result, our method is more efficient. Its semiparametric efficiency is presented in Theorem 3 in Section 3. Although the local scoring backfitting is one of the most popular methods for generalized additive models and has been proved working well in practice, the idea has not yet been applied to the additive Cox model. Our method provides not only the local scoring backfitting but also an improvement of the local scoring backfitting to the additive Cox model. In Appendix B, we show that the proposed estimates are consistent, asymptotically normal and semiparametrically efficient. The techniques developed in this article for the additive Cox model are applicable to the generalized additive models. To use (2.7), the bandwidth h must be selected. We use the K -fold cross-validation procedure for bandwidth selection. This is commonly used in the literature (Efron and Tibshirani, 1993; Tian et al., 2005; Fan et al., 2006; Chen et al., 2012). Tian et al. (2005), and Fan et al. (2006) have empirically demonstrated that the choice of smoothing parameter can be quite flexible. Our simulations and example also show that the cross-validation approach works well. See Section 4 for detailed description. We now describe the global partial likelihood using the counting process notation. Let Ni (t ) = I (Ti ≤ t , ∆i = 1) and Yi (t ) = I (Ti ≥ t ). The global partial likelihood score (2.7) can be expressed as the solution of the following equations for ξ j :  τ

n   i=1

0

  Kij (z )Xij (z ) −  

n 

 Yr (t )Krj (z )Xrj (z ) exp

r =1

j −1



(m)

ηs (Zrs ) + ξ j Xrj (z ) + ′

s=1 n  r =1

 Yr (t ) exp

j −1



d 

(m−1)

ηs

s=j+1

(m)

ηs (Zrs ) +

s=1

d 

(m−1)

ηs

 (Zrs )



(Zrs )    dNi (t ) = 0,  

(2.9)

s=j

with τ = ∞. To avoid the technicality of tail problems, only the data up to a finite time point τ are frequently used. Without ambiguity, we let  ξ j (z ) be the solutions of (2.9). 3. Large sample properties This section establishes the uniform consistency, asymptotic normality and semiparametric efficiency of the proposed estimator. Without loss of generality, the covariate Zi is assumed to be bounded with support [0, 1]d . Detailed regularity conditions are stated in Appendix A. Below we fix  ηj (0) = ηj (0) = 0, j = 1, . . . , d in theory. First, we state the result on the asymptotic consistency of the proposed estimator. Theorem 1. Under Conditions 1–7 given in Appendix A, we have sup ∥ ηj (z ) − ηj (z )∥ → 0 in probability for j = 1, . . . , d.

0
H. Lin et al. / Journal of Statistical Planning and Inference (

)



5

To give an explicit asymptotic expression of the estimator  ηj (z ) − ηj (z ) for j = 1, . . . , d, we introduce some necessary notation. Let e be a d-dimensional vector with elements 1, ej be a d-dimensional vector with the jth element 0 and the rest d − 1 elements 1, and z−j be z excluding the element zj . Let fj be the density function of Zij , fj (·|z ) be the conditional density −j

function of Zi given Zij = z and f be the joint density function of Zi .

µi =



x K (x)dx,

νi =

i

Γ (z) =

τ





P (t |z)λ0 (t )dt ,

xi K 2 (x)dx,

P (t |z) = P (Ti ≥ t |Zi = z),

s(t ; ζ) = E P (t |Zi ) exp e′ ζ(Zi )







.

0

For j = 1, . . . , d, denote sj (t ; ζ, z ) = E P (t |Zi ) exp e′ ζ(Zi ) |Zij = z fj (z ),









  δj (z ) = E Γ (Zi ) exp{e′ η(Zi )}|Zij = z fj (z ),  τ sj (t ; η, z ) P (t |u) Φj (z |u) = ef (u) λ0 (t )dt exp{e′ η(u)}, s(t ; η) 0 Ωj (z |z−j ) = ej fj (z−j |z )fj (z )Γ (z−j , z ) exp{e′ η(z−j , z )}, s(t ; ζ, z) = (s1 (t ; ζ, z1 ), . . . , sd (t ; ζ, zd ))′ , δ(z) = (δ1 (z1 ), . . . , δd (zd ))′ , Φ (z|u) = (Φ1 (z1 |u), . . . , Φd (zd |u))′ ,   Ω1 (z1 |u−1 )′ φ(u−1 , z1 )du−1   u−1   Ω (φ)(z) = ·· ·  , z = (z1 , . . . , zd )′ .   Ωd (zd |u−d )′ φ(u−d , zd )du−d u−d

Theorem 2. Under Conditions 1–7 stated in Appendix A, for any z ∈ (0, 1)d ,  η(z) − η(z) = ( η1 (z1 ) − η1 (z1 ), . . . , ηd (zd ) − ηd (zd ))′ satisfies the following integral equation,

 η(z) − η(z) = 6−1 (z)



Φ (z|u) ( η(u) − η(u)) du − 6−1 (z)Ω ( η − η)(z) u

1

1/2

+ (nh)−1/2 ν0 6−1/2 (z)ϕ + h2 µ2 η¨ (z)e + op (h2 + (nh)−1/2 ), 2

(3.1)

where 6(z) = diag (δ(z)), ϕ is a standard normal random vector and g¨ (z) = d2 g (z)/dzdz′ . Let A be the linear operator satisfying

A(φ)(z) = 6−1 (z)



Φ (z|u)φ(u)du − 6−1 (z)Ω (φ)(z) u

for any function φ , and I be the identity operator. Then (3.1) can be written as 1/2

1

(I − A) ( η − η) (z) = (nh)−1/2 ν0 6−1/2 (z)ϕ + h2 µ2 η¨ (z)e + op (h2 + (nh)−1/2 ). 2

(3.2)

In the proof of Theorem 2 in Appendix B, we prove that (I − A)−1 exists and is bounded. Noting that I − A is a linear operator and so is (I − A)−1 . Then (3.2) can be expressed as 1

1/2  η(z) − η(z) = (nh)−1/2 ν0 (I − A)−1 (6−1/2 )(z)ϕ + h2 µ2 (I − A)−1 (¨η)(z)e + op (h2 + (nh)−1/2 ),

2

(3.3)

which implies  η(z) − η(z) is asymptotically normal, the order of the asymptotic bias of  η(z) − η(z) is h2 and the order of the −1 asymptotic covariance is (nh) . As a consequence, the theoretical optimal bandwidth O(n−1/5 ) in nonparametric method can be taken. It follows from (3.3) in a straightforward fashion that Corollary 1. Under Conditions 1–7 stated in Appendix A, if nh5 = O(1), for z ∈ (0, 1)d ,

(nh)

1/2

  1 2 D −1  η(z) − η(z) − h µ2 (I − A) (¨η)(z)e → N (0, V(z)), 2

where V(z) = ν0 [(I − A)−1 (6−1/2 )(z)][(I − A)−1 (6−1/2 )(z)]′ .

(3.4)

6

H. Lin et al. / Journal of Statistical Planning and Inference (

)



When Z is one dimensional, our model reduces to the nonparametric Cox model considered by Chen et al. (2010) and the asymptotic expression (3.1) reduces to that in Theorem 2 of Chen et al. (2010). The difference between the additive Cox model and the nonparametric Cox model is the same as that between the additive regression model and the simple nonparametric regression model. As it is well known, there are many methods with well established properties for estimating a single component nonparametric regression model. In contrast, the methods for dealing with the additive regression model are limited and the corresponding asymptotic properties are not well developed due to the complicated iteration. In addition, the additive Cox model, compared with the single component nonparametric regression Cox model, is more useful since Z is usually multidimensional in practice. However, due to the multidimension, the closed form expression used in Chen et al. (2010) are no longer available. Based on our current study, the global method can be extended to most of the cases where local linear smoothing is valid and the global method performs generally better. In this sense, the global method can be regarded as a general methodology, although theoretical justification should still be provided on a case-bycase basis. Now we evaluate the efficiency of the proposed estimators. To investigate the efficiency of the proposed estimators, we consider the parameter z φ(z)′ η(z)dz for any function φ(z) that has a continuous second derivative on [0, 1]d and φ(0) = 0.   Let z φ(z)′ η(z)dz be an estimator of z φ(z)′ η(z)dz. We have the following result. Theorem 3. Under regularity Conditions 1–7 stated in Appendix A, if nh4 = o(1), then

 z

φ(z)′ η(z)dz is an efficient estimator of



φ(z)′ η(z)dz

z

in the sense that it achieves the semiparametric efficiency lower bound. To estimate a parameter at the rate n−1/2 , one must undersmooth the nonparametric part. Undersmoothing to obtain usual parametric rates of convergences is standard in the kernel literature and has analogs in the spline literature (Carroll et al., 1997; Hastie and Tibshirani, 1990a). This is achieved by setting nh4 = o(1) and is required by Theorem 3 to estimate parameters z φ(z)′ η(z)dz at the rate of n−1/2 . 4. Numerical studies 4.1. Simulations The performance of the proposed method is investigated by comparing it with the estimators in Huang (1999), Linton et al. (2003), Honda (2005) and Fan et al. (1997) with backfitting algorithm. The data generating models of the Honda study were adopted. These models are described below. Here X ∼ Ex(λ) means that X has an exponential distribution with mean λ −1 . Model 1. Ti ∼ Ex(exp(η1 (Zi1 ) + η2 (Zi2 ))) and Ci ∼ Ex(exp(η1 (Zi1 ) + η2 (Zi2 ))/1.75), where η1 (x) = η2 (x) = 5 log(x + 2.5), Model 2. Ti ∼ Ex(exp(η1 (Zi1 ) + η2 (Zi2 ))) and Ci ∼ Ex(exp(η1 (Zi1 ) + η2 (Zi2 ))/1.75), where η1 (x) = η2 (x) = 2 sin(1.57x), Model 3. Ti ∼ Ex(exp(η1 (Zi1 ) + η2 (Zi2 ))) and Ci ∼ Ex(exp(η1 (Zi1 ) + η2 (Zi2 ))/1.75), where η1 (x) = η2 (x) = 2x. In Models 1–3, Zi1 and Zi2 are independently generated from the uniform distribution on (−1, 1). The sample size n = 600. A total of 200 replications were carried out. The Epanechnikov kernel K (u) = 0.75(1 − u2 )I (|u| ≤ 1) was used. The results of the proposed estimator, the estimator in Huang (1999), the estimator in Linton et al. (2003), and the estimator in Honda (2005) of η1 at z = 0.5 and z = −0.5 are presented in Table 1, in which ‘‘Bias’’ means the estimation bias, ‘‘SD’’ the empirical standard deviation, and ‘‘RMSE’’ the root of the mean square error. The results of the Linton et al. and Honda’s estimators were taken from Tables 1 and 3 in Honda (2005). The proposed method uses the fixed bandwidths h = 0.5, h = 0.3 and h = 4 for the Models 1, 2 and 3, respectively, and Fan et al.’s method uses the fixed optimal bandwidths h = 0.7, h = 0.6 and h = 4 for the Models 1, 2 and 3, respectively. The following conclusions can been drawn from Table 1: (1) Linton et al.’s estimator has the largest standard deviation, as a result, the largest mean square error. This suggests that Linton et al.’s estimator is inefficient and the worst among the five methods in terms of the RMSE. (2) Honda’s estimator and Fan et al.’s estimator have larger bias. Hence, the efficiency of Honda’s estimator and Fan et al.’s are obtained at the cost of bias compared to Linton et al.’s estimator. (3) The proposed method has the least standard deviation. Compared with the proposed estimator, the average empirical efficiencies of the Huang, Fan et al., Linda et al., and Honda estimators are 23.96%, 36.26%, 11.58% and 16.60%, respectively. These imply that the proposed method requires just 23.96% of the samples required by Huang’s estimator, 36.26% of those required by Fan et al.’s estimator, 11.58% of those required by Linda et al.’s estimator and 16.60% of those required by Honda’s estimator to achieve the same efficiency. (4) Furthermore, the bias of the proposed estimator is much less than that of the Honda estimator and Fan et al.’s estimator, a little less than that of Linda et al.’s estimator and comparable with the Huang estimator. These suggest that the efficiency of the proposed estimator is achieved without any loss in bias.

H. Lin et al. / Journal of Statistical Planning and Inference (

)



7

Table 1 The results of the proposed estimator, Linton et al.’s estimator, and Honda’s estimator on η1 at z = 0.5 and z = −0.5 for Models 1, 2 and 3. Method

Model 1 z = 0.5

Proposed Huang Fan et al. Linton et al. Honda

z = −0.5

Bias

SD

RMSE

Bias

SD

RMSE

0.0621 0.0409 −0.1006 −0.0350 0.104 0

0.0606 0.2006 0.2502 0.2949 0.2302

0.0868 0.2048 0.2697 0.2970 0.2526

−0.0372

0.0901 0.2181 0.1879 0.2702 0.2408

0.0975 0.2230 0.2265 0.2731 0.2631

0.1380 0.2066 0.2995 0.3209 0.2588

0.1432 0.2116 0.2999 0.3257 0.2824

0.1558 0.2116 0.2506 0.3049 0.2569

0.1790 0.2116 0.3007 0.3147 0.2791

0.0562 0.1997 0.1837 0.2983 0.2449

0.0562 0.2041 0.1838 0.3015 0.2645

0.0671 0.2042 0.0614 0.2739 0.2345

0.0671 0.2043 0.0615 0.2779 0.2526

0.0467 −0.1267 0.0400 −0.1060

Model 2

−0.0381

Proposed Huang Fan et al. Linton et al. Honda

0.0457 0.0168 −0.0560 0.1130

0.0881

−0.0014 0.1662 0.0780 −0.1090

Model 3 Proposed Huang Fan et al. Linton et al. Honda

0.0032 0.0423 0.0072 −0.0440 0.1000

−0.0020 0.0064 0.0022 0.0470 −0.0940

Fig. 1(a)–(f) displays the averaged estimated components functions and their 95% empirical pointwise confidence limits, based on the 200 simulated data sets. Fig. 1(a)–(f) shows that the proposed estimates are very close to the true functions. Both covariates in the above simulations are generated from uniform distribution. In the following Model 4, we change one of the variables to mimic the age distribution in real data example. Model 4. Ti ∼ Ex(exp(η1 (Zi1 )+η2 (Zi2 ))) and Ci ∼ Ex(exp(η1 (Zi1 )+η2 (Zi2 ))/2.5), where η1 (x) = 2 sin(1.57 ×(x − 85)/20), η2 (x) = 5 log(x/20 − 1.75), Zi1 is generated from the age distribution in the real data ‘‘nursing house’’ below, and Zi2 is generated from the uniform distribution on (65, 105). The sample size, the number of replications and the kernel are the same with those in Models 1–3. Here, we evaluate the performance of the estimator η j (·), j = 1, 2 via the weighted mean squared errors, WMSE =

2 ng

1 n− ηj (xk ) − ηj (xk )}2 , where aj is reciprocal to the sample variance of ηj (xk ), xk (k = 1, . . . , ng ) are the grid g k=1 aj { j =1 points at which the functions ηj (·) are estimated. ng = 200. Fig. 2 depicts the distribution of the weighted mean squared errors over the 200 replications, using the proposed estimator with the optimal bandwidth h = 15, Fan et al.’s estimator with its optimal bandwidth h = 10 and the Huang estimator with the optimal degree of freedom 4. The weighted mean squared error of the proposed estimator is smaller than that of the Huang estimator, and Fan et al.’s estimator has the largest WMSE, suggesting that the proposed method is better than the Huang and Fan’s methods.

4.2. Data example The proposed approach is now applied to the ‘‘nursing home’’ data set analyzed by Morris et al. (1994), where a full description of this data set was given. The data were from an experiment with a sample of size 1601 sponsored by the National Center for Health Services Research in 1980–1982, involving 36 for-profit nursing homes in San Diego, California. The study was designed to evaluate the effects of different financial incentives on the durations of stay, among other aspects. Morris et al. (1994) took days T in the nursing home as the response variable. They used the following model:

 λ(t , Z) = λ0 (t ) exp

7 

 Zj βj

j=1

where Z1 is a treatment indicator, being 1 if treated at a nursing home and 0 otherwise, Z2 a gender variable (1 for males and 0 for females), Z3 a marital status indicator (1 if married and 0 otherwise), Z4 , Z5 , Z6 three binary health status indicators, corresponding to the best health to the worst health, Z7 age which ranges from 65 to 104. Morris et al. (1994) fitted the Cox model with three parametric and a nonparametric baseline hazard models to this data set. Their model did not include any possible interactions between age and other variables. To explore possible interactions among variables, Fan and Li (2001) added interaction terms such as Z7 Z1 , Z7 Z2 , . . . to the initial model. To examine how different age groups interacted with covariates such as treatment, gender, and marital status, Fan et al. (2006) fitted the following more general model:

 λ(t , Z) = λ0 (t ) exp

6  j=1

 βj (Z7 )Zj + g (Z7 ) ,

(4.1)

8

H. Lin et al. / Journal of Statistical Planning and Inference (

)



(a) eta_1 for Model 1.

(b) eta_1 for Model 2.

(c) eta_1 for Model 3.

(d) eta_2 for Model 1.

(e) eta_2 for Model 2.

(f) eta_2 for Model 3.

Fig. 1. Results of simulation. (a), (b) and (c) The averaged estimates of η1 (x1 ) for Models 1, 2 and 3, respectively; (d), (e) and (f) The averaged estimates of η2 (x2 ) for Models 1, 2 and 3, respectively (solid—true functions; dashed—estimated; dotted-linear—confidence limit).

where βj (·), j = 1, . . . , 6 and g (·) are unknown functions. Fan et al. (2006) proposed local versions of the partial likelihood approach for estimating the coefficient functions β1 (·), . . . , β6 (·) and g (·). Here, the model (4.1) is estimated using the proposed global partial likelihood approach. Noting that Zj for j = 1, . . . , 6 are binary variables, with βj (0) = 0, j = 1, . . . , 6 for identification, the model (4.1) can be written as

 λ(t , Z) = λ0 (t ) exp

6 

 βj (Z7 Zj ) + g (Z7 ) .

(4.2)

j =1

This is an additive Cox model that can be estimated by the method proposed in this paper. The proposed partial likelihood method was applied to the data set with bandwidth h = 20, which was chosen by K -fold cross-validation (Cai et al., 2000; Tian et al., 2005; Fan et al., 2006) to minimize the prediction error. Denote the full data set by Ω , and denote cross-validation training and test sets by Ω − Ω v and Ω v , respectively, for v = 1, . . . , K . For each bandwidth h and v , using the training set

H. Lin et al. / Journal of Statistical Planning and Inference (

)



9

Fig. 2. Boxplots of the weighted mean squared errors over 200 replications using Fan et al.’s estimator with the optimal bandwidth h = 10, the Huang estimator with the optimal degree of freedom 4, and the proposed estimator with the optimal bandwidth h = 15 for Model 4.

−v g −v (·) of βj (·) and g (·), respectively, and then the estimator Λ Ω − Ω v , we find the estimators  βj−v (·) and  0 (t ) of Λ0 (t ) as  t  dNi (u) −v  . Λ 0 (t ) = 6   −v i∈Ω −Ω v 0  g −v (Zk7 ) β (Zk7 Zkj ) +  Yk (u) exp j

k∈Ω −Ω v

j =1

Then we form the cross-validation criterion by PE (h) =

τ

K   

v=1 i∈Ω v

t

Ni (t ) − E



−v

0

Ni (t )

2

d

 n 

 Nk (t )

k=1

−v where  E −v Ni (t ) = 0 Yi (u) exp{ j=1  βj−v (Zi7 Zij ) +  g −v (Zi7 )}dΛ 0 (u) is the estimate of the expected failure number up to time t, and we choose K = 10. The plot of PE vs. h is shown in Fig. 3(h). Fig. 3(a)–(g) displays the estimated coefficient functions and their 95% pointwise confidence bands. The 95% pointwise confidence limits are calculated by  βj (x) ± 1.96 × SD( βj (x)), where the calculation of the standard errors of  βj (x) for each given x is carried out using 200 bootstrap resampled data sets, in which each subject is treated as a resampling unit in order to preserve the inherent features of the data. The choice of 200 was determined by monitoring the stability of the standard errors. The estimated functions are similar to those obtained by Fan et al.’s method. Both estimates show that the treatment is not significant. The intercept function, in Fig. 3(a), decreases with increasing age, suggesting that older patients tend to stay longer in nursing home. In Fig. 3(e), the coefficient function is quite flat and zero is included in the 95% confidential interval, suggesting health1 (the best health) is not a significant risk to stay in nursing home no matter age. However, the effect of health2 and health3 (the worst health), as well as the effect of gender and marital, vary significantly with age. Particularly, the male, unhealthy and married patients are less likely to stay at nursing homes, and the effect of gender, marital and health on the time to stay at nursing homes increase linearly with age. Comparing with the results from Fan et al. (2006), our estimates have narrower confidence bands. Specifically, the widths of confidence bands of the proposed method relative to those of Fan et al.’s method are about 50% for age, 57% for the gender effect, 20% for the health1 effect, 25% for the health2 effect, and 33% for the health3 effect in average. As a result, the proposed method shows that the gender effect is significant over all observation ages, while Fan et al.’s estimator shows that the gender effect is not significant for the patients with age less than 73. In addition, the proposed method shows that the married patients older than 80 years old are less likely to stay at nursing homes, while Fan et al.’s method suggests that the marital status is not significant. 6

5. Concluding remarks In the paper, we propose a global backfitting method for estimating the additive Cox hazard model. The proposed estimator is shown to be uniformly consistent and asymptotically normal. Its linear functionals are also shown to be semiparametrically efficient. Simulation studies and the real data example indicate that the proposed method is more efficient than the existing methods. In addition, the proposed global backfitting algorithm is easy to implement, which facilitates the application of the proposed method. Recent trends show the combination of local polynomial and backfitting methods to be very popular for fitting generalized additive models for uncensored data (Yu et al., 2008). However, the related asymptotical properties are not well developed. The techniques developed in this study for the Cox model are applicable to the generalized additive models for uncensored data, although some modifications are still needed.

10

H. Lin et al. / Journal of Statistical Planning and Inference (

(a) For age.

(b) For treat ∗ age.

(c) For gender ∗ age.

(d) For marital ∗ age.

(e) For health1 ∗ age.

(f) For health2 ∗ age.

(g) For health3 ∗ age.

)



(h) The plot of PE vs. bandwidth h.

Fig. 3. (a) to (g) The estimated coefficient functions (solid curves) via the global partial likelihood approach with bandwidth h = 20 and their 95% confidence limits (dashed lines) for the nursing home data; (h) The plot of PE vs. bandwidth h.

Selection of bandwidth is a key issue for kernel based methods. In some cases, the covariate distribution are skewed and it is difficult to find a fixed bandwidth to estimate the whole curve. In this situation, a data-driven bandwidth or the adaptive bandwidth (Brockmann et al., 1993) may be useful. Acknowledgments Lin’s research is supported by the Chinese Natural Science Foundation (11125104) and the Fundamental Research Funds for the Central Universities (Nos. JBK141111, JBK141121, JBK120509 and 14TD0046). Appendix A. Conditions Let ξ = (ζ ′ , hγ ′ )′ , ζ = (ζ1 , . . . , ζd ) and γ = (γ1 , . . . , γd ), where ′ denotes the transpose of a vector. Define

C0 = {ζ(z) : z ∈ [0, 1]d , ζ(0) = 0, ζ(z) is continuous on [0, 1]d }. We require the following assumptions.

H. Lin et al. / Journal of Statistical Planning and Inference (

)



11

The kernel function K (·) is a symmetric density function with compact support [−1, 1] and a bounded derivative. Z is bounded with compact support [0, 1]d , P (C = 0|Z = z) < 1 and P (C ≥ τ |Z = z) > 0. The density function fj (z ) of Zj is positive and has a continuous second derivative on [0, 1]. The functions ηj , j = 1, . . . , d have continuous second derivatives on the corresponding support. The conditional probability P (T ≤ t |Z = z) is positive and has a continuous second derivative on [0, 1] for each t ∈ [0, τ ] and z ∈ [0, 1]d . 6. For any φ ∈ C0 , if e′ φ(Zi ) is constant almost surely, then φ ≡ 0. 7. nh/(log n)2 → ∞ and h2 (log n) → 0. 1. 2. 3. 4. 5.

Conditions 1–5 and 7 are similar to those in Chen et al. (2010) for establishing asymptotic results. Condition 6 is used to assure that there exists a unique estimator for η(·). Appendix B. Proofs of Theorems 1–3 Proof of Theorem 1. For any vector functions ζ(·) and γ (·), and j = 1, . . . , d, set Uj (ζ, γ ; z ) = where Sn0 (t ; ζ) =

1 n

 n  1 τ n i=1

Sn0 (t ; ζ)

0



dNi (t ),

Yi (t ) exp{e′ ζ(Zi )}, and

n

i=1

n 1

Sn1,j (t ; ζ, γ , z ) =

Xij (z )Kij (z ) −

Sn1,j (t ; ζ, γ , z )

n i =1

Xij (z )Kij (z )Yi (t ) exp

 

 ζs (Zis ) + ζj (z ) + γj (z )(Zij − z ) .

s̸=j

Under model (1.1), we have Uj (ζ, γ ; z ) = uj (ζ; z )(1, 0)′ + op (1), where uj (ζ; z ) = E Γ (Zi ) exp e′ η(Zi ) |Zij = z fj (z ) −









τ

 0

sj (t ; ζ, z ) s(t ; ζ)

s(t ; η)λ0 (t )dt .

It follows that Uj ( η,  η˙ ; z ) = 0 and uj (η; z ) = 0 for j = 1, . . . , d. Denote u(η; z) = (u1 (η; z1 ), . . . , ud (η; zd ))′ . Now, we show that η is the unique root to the equation u(η; z) = 0 in C0 . ˙ ζ (h)(z) be the Gateaux derivative of u(ζ; z) at the point ζ(z). A straightforward calculation gives the complement r of Let u ˙ ζ (h)(z), u

(u˙ ζ (h)(z))r = −

τ



 u−r

0

τ





+ u−r

0

q(u−r , zr |t , ζ)e′ h(u−r , zr )du−r s(t ; η)λ0 (t )dt q(u−r , zr |t , ζ)du−r



q(u|t , ζ)e′ h(u)dus(t ; η)λ0 (t )dt ,

(B.1)

u

where q(z|t , ζ) = P (t |z) exp{e′ ζ(z)}f (z)/s(t ; ζ). If there exist two functions η(·), η(·) + h(·) ∈ C0 such that u(η; z) = 0 and u(η + h; z) = 0 for any z ∈ [0, 1]d , then



h(z) [u(η + h; z) − u(η; z)] dz = T

0=

1



1



0

˙ η+ϱh (h)(z)dϱdw. h(z)T u

(B.2)

0

Substituting (B.1) into (B.2), we get τ



 

0 =



e′ h(z)

2

q(z|t , η + ϱh)dzdρ s(t ; η)λ0 (t )dt

0

τ



 



e h(z)q(z|t , η + ϱh)dz ′

2

dρ s(t ; η)λ0 (t )dt .

(B.3)

0

Note that q(z|t , η + ϱh) is a density function for any given t, hence

κ(t , η + ϱh) =





e′ h(z)

2

q(z|t , η + ϱh)dz −



2

e′ h(z)q(z|t , η + ϱh)dz

is a variance of e′ h(Zi ), which implies that κ(t , η + ϱh) ≥ 0, and κ(t , η + ϱh) = 0 if and only if e′ h(Zi ) is a constant almost surely. (B.3) and Condition 6 ensure h ≡ 0. Therefore, there exist a unique root η to u(η; z) = 0 in C0 . Define U(ζ, γ ; z) = (U1 (ζ, γ ; z1 ), . . . , Ud (ζ, γ ; zd ))′ and

Bn = ζ : ∥ζ∥ ≤ C , ∥ζ(z1 ) − ζ(z2 )∥ ≤ c [∥z1 − z2 ∥ + bn ], z1 , z2 ∈ [0, 1]d ,





for some constants C > 0 and c > 0, where bn = h + (nh)−1/2 (log n)1/2 . To show the uniform consistency of  η, it suffices to prove the following (i)–(iii):

12

H. Lin et al. / Journal of Statistical Planning and Inference (

)



(i) For any continuous function vector ζ and any bounded function vector γ , supz∈[0,1]d ∥U(ζ, γ ; z) − u(ζ; z)∥ = op (1). (ii) supz∈[0,1]d ,ζ∈Bn ,γ ∈R ∥U(ζ, γ ; z) − u(ζ; z)∥ = op (1), where R is the set of functions on [0, 1]d , which are bounded uniformly. (iii) P { η ∈ Bn } → 1. Once (i)–(iii) are established, using the similar idea to the Arzela–Ascoli theorem and (iii), we can show that for any subsequence of { η}, there exists a further convergent subsequence  ηn such that uniformly in z ∈ [0, 1]d ,  ηn (z) → η∗ (z) in ∗ probability. It is easily seen that η ∈ C0 . Note that u(η∗ ; z) = U( η,  η˙ ; z) − U( η,  η˙ ; z) − u( η; z) − u( η; z) − u(η∗ ; z) ,









and U( η,  η˙ ; z) = 0. It follows from (ii) that u(η∗ ; z) = 0. Since u(ζ; z) = 0 has a unique root at η, we have η∗ = η, which yields the uniform consistency of  η. Proof of (i). Let n 1

S˘n1,j (t ; ζ, γ , z ) =

n i =1

Xij (z )Kij (z )Yi (t ) exp

 

 ζs (Zis ) + ζj (z )

exp{γj (z )(Zij − z )} − 1 .





s̸=j

Since γ is a bounded function and Zi is bounded, it can be shown that (e.g., Horowitz, 1996), S˘n1,j (t ; ζ, γ , z ) = Op (bn ) , uniformly in t ∈ [0, τ ] and z ∈ [0, 1]. Similarly, sup

t ∈[0,τ ],z ∈[0,1]

∥Sn1,j (t ; ζ, 0, z ) − sj (t ; ζ, z )(1, 0)′ ∥ = Op (bn ).

Note that Sn1,j (t ; ζ, γ , z ) = Sn1,j (t ; ζ, 0, z ) + S˘n1,j (t ; ζ, γ , z ). It follows that

∥Sn1,j (t ; ζ, γ , z ) − sj (t ; ζ, z )(1, 0)′ ∥ = Op (bn ).

sup

t ∈[0,τ ],z ∈[0,1],γ ∈R

Define N (t ) = that

1 n

n

i =1

Ni (t ) and Sn2,j (z ) =

1 n

n  τ i =1

0

(B.4)

Xij (z )Kij (z )dNi (t ). It follows from the discussions in Chen et al. (2010)

sup ∥Sn0 (t ; ζ) − s(t ; ζ)∥ = op (1),

(B.5)

t ∈[0,τ ]

 

sup  N ( t ) −

t ∈[0,τ ]

t

 0

 

s(y; η)λ0 (y)dy  = op (1),

(B.6)

sup ∥Sn2,j (z ) − s2j (z )(1, 0)′ ∥ = op (1),

(B.7)

z ∈[0,1]



where s2j (z ) = E Γ (Zi ) exp(

 η ( Z ))| Z = z fj (z ). Note that s is ij s =1

d

Uj (ζ, γ ; z ) = Sn2,j (z ) −

τ



Sn1,j (t ; ζ, γ , z )

0

Sn0 (t ; ζ)

dN (t ),

then, (i) follows from (B.4) to (B.7). Proof of (ii). Noting that Z is bounded, the arguments used to prove (ii) is essentially the same with those in Chen et al. (2010). Proof of (iii). Denote α1 = (α11 , . . . , αd1 )′ , α2 = (α12 , . . . , αd2 )′ , G(α1 , α2 , z ; ζ) = (Gj (αj1 , αj2 , z ; ζ), j = 1, . . . , d)′ ,

  ∗ n  Sn1 1 τ ,j (αj1 , αj2 , z ; t , ζ) Gj (αj1 , αj2 , z ; ζ) = Kij (z ) − dNi (t ), n i =1 0 Sn0 (t ; ζ) where Sn0 (t ; ζ) =

1 n

n

i=1

Yi (t ) exp{e′ ζ(Zi )}, and

Sn1,j (αj1 , αj2 , z ; t , ζ) = ∗

Let G(100) (α1 , α2 , z ; ζ) =

n 1

n i =1

Kij (z )Yi (t ) exp

dG(α1 ,α2 ,z ;ζ) , dα1

 

 ζs (Zis ) + αj1 + αj2 (Zij − z ) .

s̸=j

G(010) (α1 , α2 , z ; ζ) =

dG(α1 ,α2 ,z ;ζ) , dα2

and G(001) (α1 , α2 , z ; ζ) =

dG(α1 ,α2 ,z ;ζ) . dz

Using the

equation of G( η(z1 ),  η˙ (z1 ), z1 ; η) = 0 and G( η(z2 ),  η˙ (z2 ), z2 ; η) = 0 for any z1 and z2 that ∥z1 − z2 ∥ ≤ h, since  η is a

H. Lin et al. / Journal of Statistical Planning and Inference (

)



13

continuous function and h η˙ → 0, by Taylor expansion series, we have 0 = G( η(z1 ),  η˙ (z1 ), z1 ; η) = G(100) ( η(z2 ),  η˙ (z2 ), z2 ; η) ( η(z1 ) −  η(z2 ))   1 (010) + G ( η(z2 ),  η˙ (z2 ), z2 ; η)h  η˙ (z1 ) −  η˙ (z2 ) h

  η(z1 ) −  η(z2 ))2 + h2 + G(001) ( η(z2 ),  η˙ (z2 ), z2 ; η) (z1 − z2 ) + Op ( + Op {( η(z1 ) −  η(z2 )) (z1 − z2 ) + h ( η(z1 ) −  η(z2 )) + h(z1 − z2 )} .

(B.8)

By kernel theory and some computation, when α1 =  η(z2 ), α2 =  η˙ (z2 ), z = z2 and ζ =  η, we have dGj (αj1 , αj2 , z ; ζ)

τ

 =−

dαj1

sj (t ; η, z2 ) s(t ; η)

0

s(t ; η)λ0 (t )dt + op (1),

dGj (αj1 , αj2 , z ; ζ)

= op (1),  τ  τ dGj (αj1 , αj2 , z ; ζ) dsj (t ; η, z2 )/dz2 sj (t ; η, z2 ) =− s(t ; η)λ0 (t )dt + αj2 s(t ; η)λ0 (t )dt + op (1). dz s(t ; η) s(t ; η) 0 0 hdαj2

Substituting these into (B.8), we prove (iii). Proof of Theorem 2. Denote cn = supz∈[0,1]d | η(z) − η(z)|, dn = supz∈[0,1]d |h η˙ (z) − hη˙ (z)|, en = h2 + (nh)−1/2 (log n)1/2 . Let ej be a d-dimensional vector with the jth element 0 and the rest d − 1 elements 1, Q =



1 0

0



−j µ2 . Suppose z is z excluding

the element j, (z−j , z ) is z with the element j replaced by z. Denote

Φj (z |u) = ef (u)



τ

P (t |u)

sj (t ; η, z ) s(t ; η)

0 −j

λ0 (t )dt exp{e′ η(u)},

Ωj (z |z−j ) = ej fj (z |z )fj (z )Γ (z−j , z ) exp{e′ η(z−j , z )}, ϱnj (z ) = ( ηj (z ) − ηj (z ), h( η˙ j (z ) − η˙ j (z )))′ . Step (i). Show Uj ( η,  η˙ ; z ) − Uj (η, η˙ ; z ) =

 u

Φj (z |u)′ [ η(u) − η(u)] du(1, 0)′ −

− Q δj (z )ϱnj (z ) +

(

Op cn2

+

d2n

 z−j

 −j  Ωj (z |z−j )′  η(z , z ) − η(z−j , z ) dz−j (1, 0)′

+ en cn + en dn + n−1/2 ).

(B.9)

Suppose Sn1,j (t ; z ) = Sn1,j (t ; η, η˙ , z ). Write Uj ( η,  η˙ ; z ) − Uj (η, η˙ ; z ) = − τ

 =

n  Sn1,j (t ; z ) 1 τ

n i=1

Sn1,j (t ; z ) −  Sn1,j (t ; z )



0

0

Sn0 (t ; η)

 dN (t ) + Sn0 (t ; η)

τ



dNi (t ) +

n  1  τ Sn1,j (t ; z )

n i=1

0

Sn0 (t ; η)

[Sn0 (t ; η) − Sn0 (t ; η)]

0

dNi (t )

Sn1,j (t ; z ) Sn0 (t ; η)Sn0 (t ; η)

dN (t ).

(B.10)

Using Taylor expansion, consistency of  η and the boundness of  η˙ , we have n 1  Xij (z )Kij (z )Yi (t ) exp{e′j η(Zi ) + ηj (z ) + η˙ j (z )(Zij − z )} [ η(Zi ) − η(Zi )]′ ej Sn1,j (t ; z ) − Sn1,j (t ; z ) =

n i =1

n

+

1 n i =1



Xij (z )Kij (z )Yi (t ) exp{e′j η(Zi ) + ηj (z ) + η˙ j (z )(Zij − z )}X′ij (z )ϱnj (z ) + Op (cn2 + d2n )

′

P (t |z−j , z ) exp{e′ η(z−j , z )}  η(z−j , z ) − η(z−j , z ) fj (z−j |z )dz−j fj (z )ej (1, 0)′



= z−j

+ Qsj (t ; η, z )ϱnj (z ) + Op (cn2 + d2n + en cn + en dn ),

(B.11)

uniformly in (t , z ) ∈ [0, τ ] × [0, 1], where z is combined by z

−j

Sn0 (t ; η) − Sn0 (t ; η) =

 z

and z with the element j to be z. Similarly, we get

P (t |z) exp{e′ η(z)} [ η(z) − η(z)]′ ef (z)dz + Op (cn2 + n−1/2 ),

Sn0 (t ; η) = s(t ; η) + Op (n−1/2 ), Sn1,j (t ; z ) = sj (t ; η, z )(1, 0)′ + Op (en ),

14

H. Lin et al. / Journal of Statistical Planning and Inference (

N (t ) =

t



)



s(y; η)λ0 (y)dy + Op (n−1/2 ),

(B.12)

0

uniformly in t ∈ [0, τ ] and z ∈ [0, 1], where e is the vector with elements 1. Substituting (B.11) to (B.12) into (B.10), we obtain (B.9). Step (ii). Show (B.16) defined below. Since Uj (η, η˙ ; z ) = Anj (z ) + Vnj (z ), where Anj (z ) =

n  τ  1 n

i =1

0

(B.13) Sn1,j (t ;z )

Xij (z )Kij (z ) −



Sn0 (t ;η)

dMi (t ), Vnj (z ) =

n  τ  1 n

i =1

0

Xij (z )Kij (z ) −

Sn1,j (t ;z ) Sn0 (t ;η)



Yi (t ) exp{e′ η(Zi )}

t

λ0 (t )dt, and Mi (t ) = Ni (t ) − 0 Yi (s) exp{e′ η(Zi )}λ0 (s)ds. Note that    n   1 τ ′ Vnj (z ) = Xij (z )Kij (z )Yi (t ) exp{e η(Zi )} − exp ηs (Zis ) + ηj (z ) + η˙ j (z )(Zij − z ) λ0 (t )dt . n i=1

0

s̸=j

It can be shown that (e.g., Fan et al., 2006), we have Vnj (z ) =

1 2

h2 µ2 δj (z )η¨ j (z )(1, 0)′ + op (h2 )

(B.14)

where δj (z ) = E Γ (Zi ) exp{e′ η(Zi )}|Zij = z fj (z ). Denote z = (z1 , . . . , zd )′ , Xi = (Xi1 , . . . , Xid )′ , An (z) = (An1 (z1 ), . . . , And (zd ))′ , Ki (z) = (Ki1 (z1 ), . . . , Kid (zd ))′ , s(t ; η, z) = (s1 (t ; η, z1 ), . . . , sd (t ; η, zd ))′ , δ(z) = (δ1 (z1 ), . . . , δd (zd ))′ and η¨ (z) = diag(η¨ 1 (z1 ), . . . , η¨ d (zd )). Let An(1) (z) = (An(11) (z1 ), . . . , An(1d) (zd ))′ be the first column of An (z). The martingale central limit theorem implies that





(nh)1/2 An(1) (z) = n−1/2 h1/2

τ

n   i =1



Ki (z) −

0

s(t ; η, z) s(t ; η)



dMi (t ) + op (1)

is asymptotically normal with mean zero and finite covariance matrix ν0

(B.15)

τ

diag(s(t ; η, z))λ0 (t )dt. Denote Φ (z|u) = 0 (Φ1 (z1 |u), . . . , Φd (zd |u))′ . Noting that Uj ( η,  η˙ ; z ) = 0, by (B.9), (B.13) and (B.14), we obtain   η(z) − η(z) = 6−1 (z) Φ (z|u) ( η(u) − η(u)) du − 6−1 (z)Ω ( η − η)(z) + 6−1 (z)An(1) (z) u

1

+ h2 µ2 η¨ (z)e + op (h2 ) + Op (cn2 + d2n + en cn + en dn + n−1/2 ), 2

 u−1

where Ω ( η − η)(z) = ·· ·



u−d

(B.16)

   Ω1 (z1 |u−1 )′  η(u−1 , z1 ) − η(u−1 , z1 ) du−1 



Ωd (zd |u−d )′  η(u−d , zd ) − η(u−d , zd ) du−d

 .

Step (iii). Show that cn = sup | η(z) − η(z)| = Op ((nh)−1/2 + h2 ).

(B.17)

z∈[0,1]d

Denote A(φ)(z) = 6−1 (z) (B.16) can be rewritten as

 u

Φ (z|u)φ(u)du − 6−1 (z)Ω (φ)(z) for any function φ , and I be the identity operator. Hence 1

(I − A)( η − η)(z) = 6−1 (z)An(1) (z) + h2 µ2 η¨ (z)e + op (h2 ) + Op (cn2 + d2n + en cn + en dn + n−1/2 ). 2

(B.18)

If (I − A)−1 exists and is bounded, taking operator (I − A)−1 and supremum norm on both side of (B.18), we obtain (B.17). Now we show that (I − A)−1 exists. It is sufficient to show that φ ≡ 0 if (I − A)(φ) ≡ 0. By (I − A)(φ) ≡ 0, we have φ(z) = A(φ)(z) for any z ∈ [0, 1]d , that is,



φ(z)′ 6(z)φ(z)dz =



φ(z)′ Φ (z|u)φ(u)dudz −



φ(z)′ Ω (φ)(z)dz.

A straightforward calculation gives



φ(z) 6(z)φ(z)dz = ′

τ



E 0

 d  j =1





φ (Zij ) P (t |Zi ) exp{e η(Zi )} λ0 (t )dt , 2 j



(B.19)

H. Lin et al. / Journal of Statistical Planning and Inference (

   

φ(z)′ Φ (z|u)φ(u)dudz =   τ

φ(z) Ω (φ)(z)dz = ′

E

E

τ



d 

  φj (Zij ) P (t |Zi ) exp e′ η(Zi )

j =1

d 

φj (Zij )

 

j =1



15

λ0 (t )dt ,

s(t ; η)

0

0

)

2 





φk (Zik ) P (t |Zi ) exp{e η(Zi )} λ0 (t )dt . ′

k̸=j

Substituting these identities into (B.19), we have, τ

 0

  d   

z

2 φj (zj )

  d 

a(z|t )dz −

z

j=1

 φj (zj ) a(z|t )dz

2  

s(t ; η)λ0 (t )dt = 0,



j =1

where a(z|t ) = P (t |z) exp{e′ η(z)}f (z)/s(t ; η). On the other hand, since a(z|t ) is a density function for any give t, we have

  d z

2 φj (zj )

a(z|t )dz −

  d  z

j=1

and the equality is held if and only if

2

 φj (zj ) a(z|t )dz

≥ 0,

j =1

d

j =1

φj (Zij ) is constant almost surely. Therefore Condition 6 ensures φ ≡ 0 and then

we have proved (I − A)−1 exists. Finally, noting that A is compact, it is easy to show that (I − A)−1 is bounded on C0 . Step (iv). Similarly with Step (iii) and using (B.17), it can be shown dn = sup h| η˙ (z) − η˙ (z)| = Op ((nh)−1/2 ) + op (h2 ).

(B.20)

w∈[0,1]

Substituting (B.17) and (B.20) into (B.16) completes the proof of Theorem 2. Proof of Theorem 3. Let g(z) = (g1 (z1 ), . . . , gd (zd ))′ satisfy the following integral equation in C0 :

φ(z) =



Φ (u|z)′ g(u)du − 6(z)g(z) − u

d 

gj (zj )ej f (z)Γ (z) exp{e′ η(z)}

j =1

≡ P (g)(z).

(B.21)

Let U(1) (ζ, γ ; z) be the first column of U(ζ, γ ; z) and Uj1 (ζ, γ ; zj ) be the first element of Uj (ζ, γ ; zj ), we have



gj (z )Uj1 (η, η˙ ; z )dz = z

= where Sng ,j (t ) =



1 n

n

i =1

 n  1 τ n i=1

 n  1 τ n i=1

gj (Zij ) −

0

gj (Zij ) −

0

Sng ,j (t )



Sn0 (t ; η) Sng ,j (t ) Sn0 (t ; η)



dNi (t ) + Op (h2 ) dMi (t ) + Op (h2 ),

gj (Zij )Yi (t ) exp{e′ η(Zi )}. Hence,

g(z)′ U(1) (η, η˙ ; z)dz = z

n  d  1 τ

n i=1

0

j =1

gj (Zij ) −

Sng ,j (t ) Sn0 (t ; η)



dMi (t ) + Op (h2 ).

Using the martingale central limit theorem, we have that n1/2



g(z)′ U(1) (η, η˙ ; z)dz → N (0, σ 2 ), z

where σ 2 = e′ δg,2 e −

τ 0

e′

sg (t )sg (t )′ e 0 s(t ;η)

  λ (t )dt, δg,r = E g(Zi )⊗r Γ (Zi ) exp{e′ η(Zi )} , r = 0, 1, 2,

sg (t ) = E g(Zi )P (t |Zi ) exp{e′ η(Zi )} .





16

H. Lin et al. / Journal of Statistical Planning and Inference (

)



By (B.9) and some straightforward computations, we have



η,  g(z)′ U(1) ( η˙ ; z) − U(1) (η, η˙ ; z) dz 



z

  = z



u

g(u)′ Φ (u|z)du [ η(z) − η(z)] − g(z)′ Ω ( η − η)(z) − g(z)′ 6(z) [ η(z) − η(z)]



dz(1 + op (1))

η(z) − η(z)) dz(1 + op (1)). φ(z)′ (

= z

Note that



g(z)′ U(1) (η, η˙ ; z)dz =





η,  η˙ ; z) − U(1) (η, η˙ ; z) dz, g(z)′ U(1) ( 



z

z

we have n1/2



η(z) − η(z)) dz → N (0, σ 2 ). φ(z)′ (

(B.22)

We now consider a parametric sub-model

η(z; θ ) = η(z) + θ g(z), where θ is an unknown parameter, and η(z) and g(z) are fixed functions. The parameter θ can be consistently estimated by the solution  θ to the following Cox’s partial likelihood score function

n− 1

τ

n   i=1

0



n 

  ′ e g(Zi ) − 

j =1



Yj (t )e′ g(Zj ) exp{e′ (η(Zj ) + θ g(Zj ))}  n 

Yj

(t ) exp{e′ (η(Z

j

) + θ g(Zj ))}

  dNi (t ). 

j =1

Obviously, θ0 = 0 be the true value of θ . Then it follows from Andersen and Gill (1982) that

 θ − θ0 = σ −2 n−1

τ

n  



e′ g(Zi ) −

0

i=1

sg (t )



s(t ; η)

dMi (t ) + op (n−1/2 ).

Note that



  φ(z)′ η(z;  θ ) − η(z; θ0 ) dz = ( θ − θ0 )



φ(z)′ g(z)dz,

z

and



φ(z)′ g(z)dz =

 

z

z

u

τ

 = 0

g(u)′ Φ (u|z)g(z)dudz −



2

e′ g(z)



f (z)Γ (z) exp{e′ η(z)}dz

z

e′ sg (t )



2

s(t ; η)

λ0 (t )dt − e′ δg,2 e = −σ 2 .

Thus n    φ(z)′ η(z;  θ ) − η(z; θ0 ) dz = −n−1

 z

τ



i=1

0



e′ g(Zi ) −

sg (t ) s(t ; η)



dMi (t ) + op (n−1/2 ).

This gives



  φ(z)′ η(z;  θ ) − η(z; θ0 ) dz → N (0, σ 2 ), z

which is the same as that of z φ(z)′ ( η(z) − η(z)) dz by (B.22). As explained by Bickel et al. (1993),  efficient estimator of z φ(z)′ η(z)dz. The proof of Theorem 3 is completed.



 z

φ(z)′ η(z)dz is an

References Andersen, P.K., Gill, R.D., 1982. Cox’s regression model for counting processes: A large sample study. Ann. Statist. 10, 1100–1120. Bickel, P.J., Klaassen, C.A.J., Ritov, Y., Wellner, J.A., 1993. Efficient and Adaptive Estimation for Semiparametric Models. Springer-Verlag, New York. Brockmann, M., Gasser, T., Herrmann, E., 1993. Locally adaptive bandwidth choice for kernel regression estimators. J. Amer. Statist. Assoc. 88, 1302–1309.

H. Lin et al. / Journal of Statistical Planning and Inference (

)



17

Cai, Z., Fan, J., Li, R., 2000. Efficient estimation and inferences for varying-coefficient models. J. Amer. Statist. Assoc. 95, 888–902. Carroll, , Fan, , Gijbels, , Wand, , 1997. Generalized partially linear single-index models. J. Amer. Statist. Assoc. 92, 477–489. Chen, K., Guo, S., Sun, L., Wang, J., 2010. Global partial likelihood for nonparametric proportional hazards models. J. Amer. Statist. Assoc. 105, 760. Chen, K., Lin, H.Z., Zhou, Y., 2012. Efficient estimation for the Cox model with varying coefficients. Biometrika 2, 379–392. Chen, S., Zhou, L., 2007. Local partial likelihood estimation in proportional hazards regression. Ann. Statist. 35, 888–916. Cox, D.R., 1972. Regression models and life tables (with discussion). J. R. Stat. Soc. Ser. B 4, 187–220. Efron, B., Tibshirani, R.J., 1993. An Introduction to the Boostrap. Chapman & Hall, New York. Fan, J., Gijbels, I., King, M., 1997. Local likelihood and local partial likelihood in hazard regression. Ann. Statist. 25, 1661–1690. Fan, J., Li, R., 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96, 1348–1360. Fan, J., Lin, H.Z., Zhou, Y., 2006. Local partial-likelihood estimation for lifetime data. Ann. Statist. 34, 290–325. Fleming, T.R., Harrington, D.P., 1991. Counting Processes and Survival Analysis. Wiley, New York. Grambsch, P.M., Therneau, T.M., Fleming, T.R., 1995. Diagnostic plots to reveal functional form for covariates in multiplicative intensity models. Biometrics 51, 1469–1482. Hastie, T., Tibshirani, R., 1990a. Generalized Additive Models. Chapman and Hall, London. Hastie, T., Tibshirani, R., 1990b. Exploring the nature of covariate effects in the proportional hazards model. Biometrics 46, 1005–1016. Honda, T., 2005. Estimation in additive Cox models by marginal integration. Ann. Inst. Statist. Math. 57, 403–423. Horowitz, J.L., 1996. Semiparametric estimation of a regression model with an unknown transformation of the dependent variable. Econometrica 64, 103–137. Huang, J., 1999. Efficient estimation of the partly linear additive Cox model. Ann. Statist. 27, 1536–1563. Linton, O.B., Nielsen, J.P., van de Geer, S., 2003. Estimating multiplicative and additive hazard functions by kernel regression. Ann. Statist. 31, 464–492. Morris, C.N., Norton, E.C., Zhou, X.H., 1994. Parametric duration analysis of nursing home usage. In: Lange, N., Ryan, L., Billard, L., Brillinger, D., Conquests, L., Greenhouse, J. (Eds.), Case Studies in Biometry. Wiley, New York, pp. 231–248. O’Sullivan, F., 1993. Nonparametric estimation in the Cox model. Ann. Statist. 21, 124–145. Sleeper, L.A., Harrington, D.P., 1990. Regression splines in the Cox model with application to covariate effects in liver disease. J. Amer. Statist. Assoc. 85, 941–949. Therneau, T.M., Grambsch, P.M., Fleming, T.R., 1990. Martingale-based residuals for survival models. Biometrika 77, 147–160. Tian, L., Zucker, D., Wei, L.J., 2005. On the Cox model with time-varying regression coefficients. J. Amer. Statist. Assoc. 100, 172–183. Yu, K., Park, B.U., Mammen, E., 2008. Smooth backfitting in generalized additive models. Ann. Statist. 36, 228–260.