Efficient estimation of partially linear additive Cox model under monotonicity constraint

Efficient estimation of partially linear additive Cox model under monotonicity constraint

Accepted Manuscript Efficient estimation of partially linear additive Cox model under monotonicity constraint Minggen Lu, Tao Lu, Chin-Shang Li PII: ...

3MB Sizes 0 Downloads 28 Views

Accepted Manuscript Efficient estimation of partially linear additive Cox model under monotonicity constraint Minggen Lu, Tao Lu, Chin-Shang Li

PII: DOI: Reference:

S0378-3758(17)30134-9 http://dx.doi.org/10.1016/j.jspi.2017.07.003 JSPI 5575

To appear in:

Journal of Statistical Planning and Inference

Received date : 26 April 2016 Revised date : 1 May 2017 Accepted date : 26 July 2017 Please cite this article as: Lu, M., Lu, T., Li, C., Efficient estimation of partially linear additive Cox model under monotonicity constraint. J. Statist. Plann. Inference (2017), http://dx.doi.org/10.1016/j.jspi.2017.07.003 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

*Highlights (for review)

Highlights 1. A hybrid numerical approach based on the Newton-Raphson algorithm and the isotonic regression was proposed to compute the spline estimates. 2. The spline estimators of the nonparametric components were shown to achieve the optimal rate of convergence under the smooth condition. 3. The estimators of the regression parameters were shown to be asymptotically normal and efficient. 4. A direct and simple variance estimation method based on the least-squares estimation was proposed.

1

*Manuscript Click here to view linked References

Efficient estimation of partially linear additive Cox model under monotonicity constraint Minggen Lu School of Community Health Sciences, University of Nevada, Reno, NV, USA Tao Lu Department of Mathematics and Statistics, University of Nevada, Reno, NV, USA Chin-Shang Li Division of Biostatistics, Department of Public Health Sciences, University of California, Davis, CA, USA May 1, 2017

Abstract We provide a simple and practical, yet flexible spline estimation method for partially linear additive Cox model with monotonicity constraint. We propose to approximate the unknown monotone function by monotone B-splines, and employ a hybrid numerical approach based on the Newton-Raphson algorithm and the isotonic regression to compute the spline estimates. We show that the spline estimators of the nonparametric components achieve the optimal rate of convergence under the smooth condition, and that the estimators of the regression parameters are asymptotically normal and efficient. Moreover, a direct variance estimation method based on the least-squares estimation is proposed. The finite-sample performance of the spline estimates is evaluated by a Monte Carlo study. The methodology is illustrated with a synovial sarcoma dataset. Keywords: additive Cox model; efficient estimation; isotonic regression; monotone B-spline; partial likelihood

1

1

Introduction

The Cox proportional hazards model is assumed that the conditional hazard rate of failure time has the form λ(t|X) = λ0 (t) exp(X| β), (1) where β ∈ Rd is a vector of unknown regression parameters and λ0 (t) is an arbitrary baseline hazard function. An important assumption of Cox model (1) is that the effect of covariates on the log-hazard function is linear. This assumption, however, is not always valid in practice. In some situations, the functional forms of some covariate variables can be highly nonlinear. The linear assumption may lead to a substantial modeling error and incorrect inference. To relax this restrictive assumption, the nonparametric Cox model (O0 Sullivan, 1993; Fan et al., 1997), the partially linear Cox model (Sleeper and Harrington, 1990; Gray, 1992; Heller, 2001; Cai et al., 2008; Tang et al., 2013) and the single-index Cox model (Huang and Liu, 2006; Sun et al., 2008) have been extensively discussed in the literature. The partially linear additive Cox model that accommodates some covariates, such as treatment in linear forms, while the other covariates are modeled in different and arbitrary functions, extends the partially linear Cox model and avoids an issue called the curse of dimensionality (Stone, 1985) that can occur in the unstructured nonparametric regression. Huang (1999) studied the efficient estimation of the partially linear additive Cox model using polynomial splines. The distribution theory of the spline estimators was developed and the estimation procedure was implemented by existing software. However, the variance estimation of the estimated coefficients and the impact of spline knots selection on the estimation have not been discussed. Du et al. (2010) considered a variable selection procedure for a Cox model with semiparametric relative risk that includes the case of the partially linear additive model using penalized estimation method. A smoothing spline ANOVA model was applied to estimate the nonparametric component, and a profile partial likelihood approach was proposed to calculate the estimates of the regression parameters. In many studies, there are monotonic relationships between some covariates and the survival outcome. This leads us to consider the partially linear additive Cox model with monotonicity constraint. We assume the conditional hazard function of the failure time has the form λ(t|X, W1 , W2 ) = λ0 (t) exp(X| β + ψ1 (W1 ) + ψ2 (W2 )), (2) where W1 , W2 ∈ R, X ∈ Rd , ψ1 is an unknown smooth function, and ψ2 is an unknown monotone function. In this work the baseline hazard function λ0 (t) is regarded as a nuisance parameter and the estimation of β and ψ1 and ψ2 is of interest. The proposed method 2

can be generalized easily to a model with multiple monotone functions. To the best of our knowledge, the proposed model (2) has not been studied in the literature. The goals of this work are to study the theoretical properties of the spline estimators of β and ψ1 and ψ2 and to provide a flexible and efficient numerical algorithm. We consider the spline-based sieve M -estimation for the model (2). The non-monotone nonparametric component ψ1 is approximated by an unconstrained B-spline, while the monotone function ψ2 is approximated by a monotone B-spline. The advantage of the proposed spline estimation is that when spline basis functions are determined, the unknown functions are totally characterized by the spline coefficients and, hence, we can estimate the regression parameters and the spline coefficients simultaneously. Moreover, the asymptotic variance of the spline estimator of β can be derived by taking advantage of the spline approximation. To accommodate the monotonicity of ψ2 , a hybrid algorithm is proposed to calculate the spline estimates. The proposed monotone B-spline method has some attractive properties. Firstly, the spline estimator of β based on the partial likelihood achieves the semiparametric efficiency. Secondly, by allowing the number of knots to increase by the sample size at an appropriate rate, the spline estimators of the nonparametric components can achieve the optimal rate of convergence under the smooth condition. Thirdly, we propose a direct and simple way to consistently estimate the asymptotic variance of the spline estimator of β. Finally, the proposed hybrid algorithm is easy to implement by using the existing software. The rest of this paper is organized as follows: the spline likelihood estimators and the hybrid algorithm are discussed in Section 2. The asymptotic properties of the spline estimators are established in Section 3. A simple and consistent variance estimation method is provided in Section 4. The spline method is evaluated via a Monte Carlo study and illustrated by a synovial sarcoma dataset from the Surveillance, Epidemiology, and End Results (SEER) database in Section 5. The summary of the study and the future work are discussed in Section 6. Finally, the proofs of the asymptotic results are sketched in the appendix.

3

2 2.1

Model and algorithm Model

Let T u and T c be the failure time and censoring time, respectively. The observable random variable is (T, ∆, Z), where T = min(T u , T c ), ∆ = I(T u ≤ T c ), Z = (X| , W| )| , and W = (W1 , W2 )| , with I(·) representing the indicator function. Throughout, we assume that T u and T c are conditionally independent given the covariate Z. Let (T1 , ∆1 , Z1 ), . . . , (Tn , ∆n , Zn ) be a random sample. The logarithm of the partial likelihood function under the partially linear additive Cox model (2) can be written as ln (β, ψ) =

n X i=1

"

∆i X|i β + ψ(Wi ) − log

X

k:Tk ≥Ti

#

exp(X|k β + ψ(Wk )) ,

(3)

where ψ(Wi ) = ψ1 (Wi1 ) + ψ2 (Wi2 ). Because any constants in the regression functions ψr , r = 1, 2, can be absorbed in the baseline hazard function λ0 , ψr can only be identified up to some constants. For purposes of identifiability, impose ψ1 (w1 ) = ψ2 (w2 ) = 0 for some wr in the domains of ψr . In real application, ψr can be centered at lower bounds, upper bounds, means or medians of their domains. We construct two spline-based sieve spaces. One for non-monotone function ψ1 is the space of unconstrained B-splines, and the other for monotone function ψ2 is the one of monotone B-splines. Assume W takes value in [c1 , d1 ] × [c2 , d2 ], where c1 , c2 , d1 and d2 are n +l , with c1 = ξ1 = . . . = ξl < ξl+1 < . . . < ξjn +l+1 = . . . = finite numbers. Let Tn = {ξi }ji=1 ξjn +2l = d1 , be a sequence of knots that divides the interval [c1 , d1 ] into jn + l subintervals Ii = [ξl+i , ξl+1+i ], for i = 0, . . . , jn . A spline of order l ≥ 1 with knots Tn is a polynomial of degree l − 1 over the partition. For instance, a spline of l = 3 is a piecewise quadratic polynomial with continuous first derivative. Let Sn (Tn , l) be the class of splines of order l ≥ 1 with knots Tn . According to Corollary 4.10 of Schumaker (1981), the spline function can be uniquely represented as a linear combination of B-spline basis functions with the same smoothness and under the same partition Tn ; that is, for any s ∈ Sn (Tn , l), there exists a set Pn of B-spline basis functions {bj : 1 ≤ j ≤ qn } such that s = qj=1 γj bj , where qn = jn + l is kn +l the number of basis functions. Let {ζi }i=1 , with c2 = ζ1 = . . . = ζl < ζl+1 < . . . < ζkn +l+1 = . . . = ζkn +2l = d2 , be another sequence of knots with corresponding B-spline basis functions mk , k = 1, . . . , pn = kn + l. To accommodate the monotonicity constraint of ψ2 , construct a

4

monotone spline space Mn =

(p n X k=1

αk mk : α1 ≤ . . . ≤ αpn

)

.

The nondecreasing coefficients α1 ≤ . . . ≤ αpn guarantee that the resulting spline function Ppn k=1 αk mk is nondecreasing according to Theorem 5.9 of Schumaker (1981). If ψr are smooth enough, we approximate ψ1 and ψ2 by an unconstrained B-spline function ψ1n and a monotone B-spline function ψ2n , namely, ψ1 (W1 ) ≈ ψ1n (W1 ) = and ψ2 (W2 ) ≈ ψ2n (W2 ) =

qn X

γj bj (W1 )

j=1

pn X

αk mk (W2 )

k=1

under the constraint α1 ≤ . . . ≤ αpn , respectively.

By replacing ψr by ψrn , we obtain the spline model λ(t|X, W) = λ0 (t) exp X| β +

qn X

γj bj (W1 ) +

j=1

pn X k=1

!

αk mk (W2 ) .

(4)

Note that the spline model (4) is over-parameterized due to the linear constraints of BPn Pn mk = 1. The issue of overparameterizaton can be spline basis functions qj=1 bj = pk=1 addressed by removing any of the bj s and mk s from the fitted model. For convenience, we assume that bqn and mpn are omitted from (5) to give a full-rank model so that the effects Pqn −1 of bqn and mpn are absorbed into the baseline hazard λ0 (t). Let φ1n = j=1 γj bj and Ppn −1 | | φ2n = k=1 αk mk . Define γ = (γ1 , . . . , γqn −1 ) and α = (α1 , . . . , αpn −1 ) . The resulting spline partial log-likelihood function for (β | , γ | , α| )| is expressed as follows: ln (β, γ, α) =

n X i=1

"

∆i X|i β + φn (Wi ) − log

X

k:Tk ≥Ti

#

exp(X|k β + φn (Wk )) ,

(5)

ˆ |, α ˆ | )| be the values that maximize the where φn (Wi ) = φ1n (W1i ) + φ2n (W2i ). Let (βˆ| , γ spline partial log-likelihood function (5). Accordingly, the spline estimators of ψ1 and ψ2 are P n −1 P n −1 P n −1 defined as ψˆ1 (W1 ) = qj=1 γˆj bj (W1 ) − qj=1 γˆj bj (w1 ) and ψˆ2 (W2 ) = pk=1 α ˆ k mk (W2 ) − Ppn −1 ˆ k mk (w2 ), respectively, for some wr in the domains of ψr . For spline partial logk=1 α likelihood function (5), we can estimate the regression parameter vector β and the spline 5

coefficients (γ | , α| )| simultaneously and, hence, make the computation less demanding. Remark 1. Under the linear inequality constraint α1 ≤ . . . ≤ αpn , the spline function Ppn k=1 αk mk is monotone nondecreasing. However, this condition is not necessary, which means some monotone nondecreasing splines may violate the condition, such as cubic Bsplines. In this situation, cubic B-splines under the linear inequality condition is overconstrained. Lu (2015) showed that the linear inequality condition is sufficient and necessary for monotone quadratic B-splines. Therefore, we apply quadratic B-splines to approximate the regression functions in the simulation study and real application.

2.2

Algorithm

Let bi = (b1 (Wi1 ), . . . , bqn −1 (Wi1 ))| and mi = (m1 (Wi2 ), . . . , mpn −1 (Wi2 ))| , i = 1, . . . , n. P Define ωik = I(Tk ≥ Ti ) exp(X|k β + φn (Wk ))/ k:Tk ≥Ti exp(X|k β + φn (Wk )), k = 1, . . . , n. P Clearly, for any i, nk=1 ωik = 1. Let ηi = (X|i , b|i , m|i )| . The score statistic ∇ln (β, γ, α) and the observed information I(β, γ, α) can be written as n X

∇ln (β, γ, α) =

ηi −

∆i

i=1

n X

I(β, γ, α) =

X

∆i

i=1

X

ωik ηk

k:Tk ≥Ti

ωik ηk ηk|

k:Tk ≥Ti



!

,

X

ωik ηk

k:Tk ≥Ti

X

ωik ηk|

k:Tk ≥Ti

!

.

Define multivariate counting processes Ni (t) = ∆i I(Ti ≤ t), i = 1, . . . , n. Let Mi (t) = Ni (t) −

Z

0

t

Yi (u) exp(X|i β + ψ(Wi ))dΛ(u)

be the martingales associated with the partially linear additive Cox model, where Λ(t) = Rt λ (u)du is the baseline cumulative hazard function and Yi (u) = I(Ti ≥ u). For any 0 0 bounded function ai ≡ a(β, γ, α; Ti , Zi ), n X i=1

∆i

ai −

X

k:Tk ≥Ti

ωik ak

!

=

n Z X i=1



0

"

# P | β + ψ(W ))a exp(X k k k k:T ≥T dMi (t). ai − P k i | exp(X k β + ψ(Wk )) k:Tk ≥Ti

It is easy to show that the preceding expression is a martingale; hence, E

"

n X i=1

∆i

ai −

X

k:tk ≥ti

6

ωik ak

!#

= 0.

P By the Cauchy-Schwarz inequality and the linear constraint nk=1 ωik = 1, it can be shown that the observed information I(β, γ, α) is positive semidefinite, which implies that the spline partial log-likelihood function (5) is a concave function with respect to (β, γ, α). This leads us to propose a modified Newton-Raphson (NR) algorithm to simultaneously calculate the spline estimate of (β, γ, α). Because the NR update of α is not necessarily monotone, ˜ to accommodate the monotonicity constraint of α, we propose to project the NR update α to the feasible region Xα = {α : α1 ≤ . . . ≤ αpn −1 } during the iteration by ˜ (k+1) )| I(τ˜ (k+1) )(α − α ˜ (k+1) ), α(k+1) = argminα∈Xα (α − α

(6)

˜ (k+1)| , α ˜ (k+1)| )| and I(τ˜ (k+1 ) is a positive definite matrix. If we where τ˜ (k+1) = (β˜(k+1)| , γ choose I(τ˜ (k+1) ) to be a diagonal matrix with diagonal elements of negative I −1 (τ˜ (k+1) ) with respect to α, the optimization problem (6) reduces to the standard isotonic regression (IR) (Robertson et al., 1988). The hybrid iterative optimization procedure based on NR/IR is presented as follows: Step 0. Let τ (0) = (β (0)| , γ (0)| , α(0)| )| be the estimate of Cox model λ(t|X, W) = λ0 (t) exp(X| β+ b| γ + m| α). Sort α(0) in increasing order. Step 1 (Newton-Raphson Update). Update the current τ (k) = (β (k)| , γ (k)| , α(k)| )| by Newton-Raphson algorithm τ˜ (k+1) = τ (k) − I −1 (τ (k) )∇ln (τ (k) ). Step 2 (Isotonic Regression). Construct the cumulative  sum diagram {Pi : i = 0, . . . , pn − Pi Pi (k+1) , where wl , l = 1, . . . , pn − 1, are 1} with P0 = (0, 0) and Pi = ˜l l=1 wl , l=1 wl α diagonal elements of negative I −1 (τ˜ (k+1) ) with respect to α. The monotonic update of ˜ (k+1) is then calculated by α (k+1) αi

= max min j
l>i

˜ (k+1) . Set β (k+1) = β˜(k+1) and γ (k+1) = γ

Pl

(k+1) ˜m m=j wm α . Pl m=j wm

Step 3. Stop the iteration if the convergence criterion kτ (k+1) − τ (k) k < ε = 10−6 is met. Otherwise, go back to Step 1. Remark 2. Step 0 can use standard Cox proportional hazards model procedures, for example, coxph in R. The isotonic regression in Step 2 can be performed by applying the function pava in R package Iso. 7

Remark 3. Our experiment shows that the algorithm performs very well and usually converges in a few iterations. Convergence problems are very rare if the initial value of τ is chosen as the estimate of the standard Cox proportional hazards model.

2.3

Choosing the number and location of knots

It is well-known that the performance of B-spline estimation depends on the selection of the order and the knots of spline. In this study, quadratic splines are used to approximate the unknown functions ψ1 and ψ2 . For simplicity, we assume qn = pn ; that is, the numbers of knots for the unconstrained and monotone splines are the same. We propose to select the optimal number of interior knots, kn∗ , by minimizing the Akaike information criterion (AIC) value ˆ γ ˆ , α; ˆ kn ) + 2(2kn + 2l − 2 + d), AIC(kn ) = −2ln (β, where l = 3 for quadratic splines and kn + l − 1 is the number of B-spline basis functions. According to the optimal rate of convergence for ψˆr given in Section 3, we search the minimum value of AIC in a neighborhood of n1/(1+2p) , such as [max(2, 0.5Np ), min(2Np , n1/2 )], where Np = ceiling(n1/(1+2p) ) and p ≥ 1 is a smoothing parameter of unknown functions. In the simulation study and the synovial sarcoma data analysis, p is chosen to be 1. For a given number of interior knots, kn , we propose a data-driven approach first introduced by Rosenberg (1995) in which the k/(kn + 1) quantiles, k = 1, . . . , kn , are selected as interior knots. Our numerical studies reveal that the average number of the optimal interior knots is close to n1/3 . Therefore, we recommend an empirical rule that the optimal number of the interior knots is selected in the neighborhood of n1/3 ± 2 via AIC, and the location of the interior knots is determined by the quantile method. The similar methods can be found in Xue and Liang (2010) and Lu and Loomis (2013) for unconstrained B-splines and Lu (2015) for monotone B-splines.

3

Asymptotic Results

In this section we study the asymptotic properties of the spline estimators. Let τ = (β | , ψ1 , ψ2 )| and τ0 = (β0| , ψ0,1 , ψ0,2 )| . Assume the regression parameter space Θ ∈ Rd to be compact. Let the nonparametric spaces to be Ψ1 = {ψ1 : ψ1 is smoothing on [c1 , d1 ], −∞ < c1 < d1 < ∞} and Ψ2 = {ψ2 : ψ2 is smoothing and monotone on [c2 , d2 ], −∞ < c2 < d2 < ∞}. Let Ψ = Ψ1 × Ψ2 . Throughout the work, let k · k be the Euclidean norm, and let k · k2 8

be the L2 -norm. Also denote by k · k∞ the supremum norm. Define the L2 -norm k · k2 on Θ × Ψ as follows: kτ2 − τ1 k22 = kβ2 − β1 k2 + kψ12 − ψ11 k22 + kψ22 − ψ21 k22 , for τr = (βr| , ψ1r , ψ2r )| ∈ Θ × Ψ, r = 1, 2. Let P be the probability measure of (T, ∆, Z) and Pn be the empirical measure of (Ti , ∆i , Zi ), i = 1, . . . , n. Given a measurable function f , P f and Pn f are defined as the expectation of f under P and Pn , respectively. Let P∆ be the R subprobability measure of (T, ∆ = 1, Z) with P∆ f = ∆f dP and P∆n be the subprobability n X R empirical measure of (Ti , ∆i = 1, Zi ) with P∆n f = ∆f dPn = 1/n ∆i f (Ti , ∆i , Zi ). i=1

Define processes N (t) = ∆I(T ≤ t) and Y (t) = I(T ≥ t). Let M (t) = N (t) −

Z

t

Y (u) exp(X| β + ψ(W))dΛ(u)

0

be the martingale associated with the partially linear additive Cox model. For a single observation (Z| , ∆, T )| , the log-likelihood function for τ is, up to an additive term not containing τ , given as follows: `(τ ) = ∆ log λ0 (T ) + ∆[X| β + ψ(W)] − Λ(T ) exp(X| β + ψ(W)). Let g(Z) = X| β + ψ(W) and g0 (Z) = X| β0 + ψ0 (W). The score function for β, `˙β (τ ), is the first partial derivative with respect to β, namely, ∂`(τ ) `˙β (τ ) = = [∆ − Λ(T ) exp(g(Z))]X. ∂β Consider a parametric smooth model (λη (t), ψηr (Wr )) such that λη (t)|η=0 = λ0 (t) and ψηr (Wr )|ηr =0 = ψr (Wr ) and ∂ log λη (t) = a(t) ∂η η=0 and

∂ψηr (Wr ) = hr (Wr ). ∂ηr ηr =0

Let A and Hr be the classes of such a and hr , respectively. The score operators for Λ

9

and ψr are defined as Z ∞ ∂`(β, Λ , ψ ) η η r ˙`Λ (τ )[a] = Y (t)a(t)dΛ(t) = ∆a(T ) − exp(g(Z)) ∂η 0 η=ηr =0

and

∂l(β, Λ , ψ ) η η r `˙ψr (τ )[hr ] = = [∆ − exp(g(Z))Λ(T )]hr (Wr ). ∂ηr η=ηr =0

For hr = (hr1 , . . . , hrd )| and h = h1 + h2 , define `˙ψr (τ )[hr ] = (`˙ψr (τ )[hr1 ], . . . , `˙ψr (τ )[hrd ])| and `˙ψ (τ )[h] = `˙ψ1 (τ )[h1 ] + `˙ψ2 (τ )[h2 ]. Moreover, for a = (a1 , . . . , ad )| , denote `˙Λ (τ )[a] = (`˙Λ (τ )[a1 ], . . . , `˙Λ (τ )[ad ])| . The score function for the regression vector β, and the score operators for the cumulative hazard function Λ and the regression functions ψr can be written in martingale representations as follows: Z ∞ XdM (t), `˙β (τ ) = [∆ − Λ(T ) exp(g(Z))]X = 0 Z ∞ Z ∞ `˙Λ (τ )[a] = ∆a(T ) − exp(g(Z)) Y (t)a(t)dΛ(t) = a(t)dM (t), 0 0 Z ∞ ˙`ψ (τ )[h] = [∆ − exp(g(Z))Λ(T )]h = hdM (t). 0

The efficient score for β is defined as `∗β (τ ) = `˙β (τ ) − `˙Λ (τ )[a∗ ] − `˙ψ (τ )[h∗ ], where (a∗ , h∗ ) is called the least favorable direction satisfying E[`˙β (τ ) − `˙Λ (τ )[a∗ ] − `˙ψ (τ )[h∗ ]]`˙Λ (τ )[a] = 0, E[`˙β (τ ) − `˙Λ (τ )[a∗ ] − `˙ψ (τ )[h∗ ]]`˙ψr (τ )[hr ] = 0, for any a ∈ A and hr ∈ Hr . Using the same arguments as those in Section 4 of Huang (1999), we can derive the efficient score of β for the partially linear additive Cox model, namely, Z `∗β (τ ) =



h∗1



0

(X − a∗ − h∗ )dM (t),

h∗2 |T

where a (t) = E[X − − = t, ∆ = 1] and hr (wr ) = E[X − a∗ |Wr = wr , ∆ = 1]. The information bound for β takes the form     I(β) = E `∗β (τ )⊗2 = E ∆(X − a∗ − h∗ )⊗2 . 10

Theorem 1. (Uniform consistency and rate of convergence) Suppose conditions B1–B9 in Appendix hold and 1/(2p + 2) < ν < 1/(2p). Then kτˆ − τ0 k2 = Op (n− min(pν,(1−ν)/2) ). Consequently, kψˆr − ψ0,r k∞ = op (1). Furthermore, if ν = 1/(1 + 2p), Op (n− min(pν,(1−ν)/2) ) = Op (n−p/(1+2p) ), which is the optimal rate of convergence in semiparametric regression. Theorem 2. (Asymptotic normality) Suppose conditions B1–B9 hold. Then n1/2 (βˆ − β0 ) = n1/2 I−1 (β0 )Pn `∗β (τ0 ) + op (1) → N (0, I−1 (β0 )), in distribution, as n → ∞. Remark 4. Theorem 1 shows that ψˆr are uniformly consistent estimators of ψ0,r and the optimal rate of convergence for ψˆr is p/(1 + 2p), p ≥ 1. Theorem 2 shows that the spline estimator of β based on the partial likelihood function achieves the information bound, and is therefore asymptotically efficient among all regular estimators. Remark 5. Theorem 2 shows the number of knots to ensure n1/2 -consistency of βˆ is between n1/(2p+2) and n1/(2p) , including n1/(1+2p) that corresponds to the optimal rate of convergence ˆ This result implies that undersmoothing (i.e., qn and pn  n1/(1+2p) ) is not necessary of β. for semiparametric efficient estimation using splines for the partially linear additive Cox model. Here bn  an means limn→∞ an /bn = 0.

4

Variance Estimation

Although the efficient information matrix I(β0 ) in Theorem 2 has an explicit expression, it is not straightforward to estimate it directly. However, by Lemma 1 of Sasieni (1992), I(β0 ) can be written as  ⊗2 E[Y (t) exp(g0 (Z))(X − h∗ )] ∗ P∆ X − h − , E[Y (t) exp(g0 (Z))] where h∗ is a minimizer of

2

E[Y (t) exp(g0 (Z))(X − h)]

, ρ(h) = P∆ X − h −

E[Y (t) exp(g0 (Z))] 2 11

for any h1 ∈ H1d , h2 ∈ H2d and h = h1 + h2 . Let n X S0,n (τ ; t) = 1/n [Yk (t) exp(g(Zk ))], S0 (τ ; t) = E[Y (t) exp(g(Z))], k=1

n X S1,n (τ ; t) = 1/n [Yk (t) exp(g(Zk ))Xk ], S1 (τ ; t) = E[Y (t) exp(g(Z))X], k=1

and for h1 ∈ H1d , h2 ∈ H2d and h = h1 + h2 , let S2,n (τ ; t)[h] = 1/n

n X

[Yk (t) exp(g(Zk ))h(Wk )],

k=1

S2 (τ ; t)[h] = E[Y (t) exp(g(Z))h(W)]. Let v = (Z| , t)| . Define m1,n (τ ; v) = X −

S1 (τ ; t) S1,n (τ ; t) , m1 (τ ; v) = X − S0,n (τ ; t) S0 (τ ; t)

and m2,n (τ ; v)[h] = h −

S2 (τ ; t)[h] S2,n (τ ; t)[h] , m2 (τ ; v)[h] = h − . S0,n (τ ; t) S0 (τ ; t)

Observe that I(β0 ) = P∆ [m1 (τ0 ; v) − m2 (τ0 ; v)[h∗ ]]⊗2 . Hence, a natural estimator for I(β0 ) can be represented as ˆ ∗ ]]⊗2 , Iˆn = P∆n [m1,n (τˆ ; v) − m2,n (τˆ ; v)[h ˆ ∗ is the minimizer of where h ρn (h) = P∆n km1,n (τˆ ; v) − m2,n (τˆ ; v)[h]k22 .

(7)

Let fn = (f1 , . . . , fqn )| and gn = (g1 , . . . , gpn )| be the sets of spline basis functions such Pn that any hl,1 ∈ H1 and hl,2 ∈ H2 , l = 1, . . . , d, can be written as hl,1 = qs=1 δl,s fs = fn| δl Pp n | | | and hl,2 = t=1 ηl,t gt = gn η, where δl = (δl,1 , . . . , δl,qn ) and ηl = (ηl,1 , . . . , ηl,pn ) . Let en = (fn| , gn| )| and ϑl = (δl| , ηl| )| . The optimization problem in (7) is equivalent to finding ϑˆl = (δˆl| , ηˆl| )| , where δˆl = (δˆl,1 , . . . , δˆl,qn )| and ηˆl = (ˆ ηl,1 , . . . , ηˆl,pn )| , to minimize d h i2 X (l) P∆n m1,n (τˆ ; v) − ϑ|l m2,n (τˆ ; v)[en ] , l=1

12

where m2,n (τˆ ; v)[en ] = (m2,n (τˆ ; v)[e1 ], . . . , m2,n (τˆ ; v)[eqn +pn ])| and S1,n (τˆ ; v) is the lth element of S1,n (τˆ ; v). Denote (l)

Iˆββ = P∆n [m1,n (τˆ ; v)]⊗2 , Iˆψβ = Iˆ | , βψ

Iˆβψ = P∆n [m1,n (τˆ ; v)m|2,n (τˆ ; v)[en ]], Iˆψψ = P∆n [m2,n (τˆ ; v)[en ]]⊗2 .

Standard least-squares calculation leads to h i (l) −1 ϑˆl = Iˆψψ P∆n m1,n (τˆ ; v)m2,n (τˆ ; v)[en ] . It follows that i⊗2 h −1 m2,n (τˆ ; v)[en ] Iˆn = P∆n m1,n (τˆ ; v) − Iˆβψ Iˆψψ −1 ˆ = Iˆββ − Iˆβψ Iˆψψ Iψβ ,

which is the observed information of the partial log-likelihood function (5) with respect to β. The following Theorem shows that the proposed observed information matrix Iˆn is an asymptotically consistent estimator of I(β0 ). Theorem 3. Under the same conditions assumed in Theorem 2, Iˆn is asymptotically consistent to I(β0 ). Our Monte Carlo experiment shows that the proposed variance estimation method works very well, and the approach is applied for the inference on β in real application.

5 5.1

Numerical Results Simulation study

In this section we conduct a Monte Carlo simulation to evaluate the finite-sample performance of the proposed method. Assume a constant baseline hazard function, namely, λ0 (t) = 1. The true event times are generated from an exponential distribution with a hazard rate conditional on (X| , W| )| given by λ(t|X, W) = exp(X| β + ψ1 (W1 ) + ψ2 (W2 )),

13

where β = (β1 , β2 )| = (0.5, −0.5)| and covariates X1 , X2 ∼ Bernoulli (0.5). The two functions are ψ1 (W1 ) = 3 sin(W1 ) and ψ2 (W2 ) = exp(W2 /2) − 1 with W1 ∼ Uniform [−3, 3] and W2 ∼ Uniform [−2.5, 2.5]. The conditional distribution of the censoring variable is exponential with the hazard function λ(t|X, W) = λC exp(X| δ1 + W| δ2 ). Here δ1 = (0.5, 0.5)| and δ2 = (0.5, 1)| . The constant λC takes value 0.1 or 1, so that the censoring rates are about 20.5% or 53.5%, respectively. The simulations are duplicated 1,000 times with samples n = 200 or 400. The proposed approach displays a very similar finitesample performance for the two different censoring rates. In the sequel, we only present the results of the simulation setting with the censoring rate of 20.5%. Quadratic B-splines are used in estimating the baseline cumulative hazard function. We perform a sensitivity analysis to assess the impact of the selection of knots on the spline estimation for n = 400. The number of inner knots kn is chosen as 4, 6, 10 or 12. The location of knots is then determined by the quantile method. The bias, standard derivation, average and standard deviation of the estimated standard errors based on the proposed variance estimation method, and coverage probability from 1,000 replications are presented in Table 1. There is no essential difference observed for these simulation results, indicating that the proposed spline method is pretty robust to the choice of the number of knots. As recommended in Section 2.3, the optimal number of the inner knots that minimizes the AIC value is selected in a small neighborhood of n1/3 , for instance, n1/3 ± 2. The summary statistics for βˆ and the results of variance study based on the proposed variance estimation method and the standard bootstrap approach from 1,000 replications with n = 200 or 400 are given in Table 2. The Monte Carlo results indicate that the sample biases are small, and the standard deviations decrease as n increases, implying that the spline estimators are asymptotically consistent. The histograms and quantile-quantile (Q-Q) plots presented in Figure 1 exhibit that the spline estimators of β are normally distributed. Moreover, the averages of the estimated standard errors of βˆ based on the proposed variance ˆ providing a numerical justiestimation method are close to the standard deviations of β, fication for asymptotic normality (Theorem 2) and variance estimation (Theorem 3). The empirical coverage probabilities of 95% confidence intervals with sample sizes of 200 or 400 based on the proposed variance estimation method are close to the 5% nominal level, and are closer when the sample size doubles, indicating the spline approach is working reasonˆ ably well. The bootstrap approach tends to slightly overestimate the standard errors of β, yielding the empirical coverage probabilities larger than the nominal probability 95%. As 14

the sample size increases, however, the difference becomes less conspicuous. Moreover, the proposed variance estimation method exhibits smaller standard deviations of the estimated standard errors, indicating the proposed variance estimation method yields more reliable estimates of the standard errors of βˆ compared to the bootstrap procedure. To evaluate the advantage of the monotone spline approach, we compared the meansquared errors (MSEs) of both constrained and unconstrained spline estimates of the monotone function ψ2 at points −2.3, −2.2, . . . , 2.3. As shown in Figure 2, the spline estimates with the monotone constraint exhibit smaller MSEs, compared to the unconstrained estimates, providing a numerical justification of the proposed monotone spline approach. The pointwise MSEs of the monotone spline estimates are deceasing as the sample size increases, thereby indicating the consistency of the monotone spline estimators. Figure 3 presents the spline estimates and the corresponding pointwise 95% confidence intervals based on the quantiles of the estimates of ψ1 and ψ2 from 1,000 replications with n = 200 or 400. The fitted curves follow the true curves very closely and the upper and lower limits of the confidence intervals are close to the true curves, indicating little bias and small variation of the spline estimators. As the sample size increases, the variability decreases accordingly. In general, the proposed spline method demonstrates the desirable numerical properties for the moderate sample sizes in our simulation setting.

5.2

Data analysis

We apply the proposed method to the synovial sarcoma dataset from SEER database. Synovial sarcoma is a rare form of soft tissue cancer, which usually occurs near to the joints of the arm, neck or leg. In general, synovial sarcoma is considered as high-graded sarcoma with poor prognosis. In this application, the primary end point for therapy comparison is time to death by synovial sarcoma. After exclusion of patients who lack pathologic confirmation or are less than 18 years old because of the differences in management of adult and pediatric patients, especially with respect to the administration of radiotherapy and chemotherapy, the dataset consists of 492 complete observations. The censoring rate is 57.3%, or 282 survival times were censored. The objective of our analysis is to investigate the relationship of radiotherapy to the synovial sarcoma survival after controlling for other covariates. To characterize the potential nonlinear effect of age at diagnosis and tumor size on survival, we employ the following partially linear additive Cox model λ(t|X, W1 , W2 ) = X| β + ψ1 (W1 ) + ψ2 (W2 ),

15

where X = (X1 , . . . , X5 )| is a vector of covariates, including radiotherapy, gender, ethnicity group, SEER tumor stage and anatomic site; W1 and W2 are age at diagnosis and tumor size, respectively; and ψ1 is an unspecified smooth function and ψ2 is a smooth monotone function. Clinical studies support that tumor size is an important prognostic factor to mortality. Therefore, it is reasonable to assume ψ2 is monotone increasing. The unknown functions ψ1 and ψ2 are respectively fitted as an unconstrained quadratic B-spline and a monotone quadratic B-spline with 4 knots each, as determined by the AIC. The locations of knots are chosen by the quantile method accordingly. We set ψ1 and ψ2 to be 0 at minimums of age at diagnosis and tumor size, respectively, for purposes of identifiability. The 95% pointwise confidence intervals of ψ1 and ψ2 are constructed by the quantiles of the estimates of ψ1 and ψ2 from 1,000 bootstrap samples. The fitted functions and the corresponding 95% pointwise confidence intervals are presented in Figure 4. The estimated curves clearly demonstrate a nonlinear pattern, and the estimate of ψ2 accurately captures the nonlinear monotone feature of the tumor size. These results support the findings of previous studies that the increased mortality is associated with the tumor size. The results of the estimated regression coefficients and the corresponding standard errors based on the proposed variance estimation method and the standard bootstrap method are summarized in Table 3. The results from both methods reveal that the radiotherapy effectively improves the survival with p-value = 0.0001 and 0.0002, respectively, after adjusting for demographics and cancer stages and sites.

6

Summary and future work

In this study we proposed a practical and easy-to-implement spline method for partially linear additive Cox model with monotonicity constraint. The spline estimators of the nonparametric components achieve the optimal rate of convergence and the estimators of the regression coefficients are asymptotically normal and efficient. A simple and direct variance estimation yields more reliable estimates of the standard errors of the estimated regression coefficients compared to the standard bootstrap procedure. The monotone spline estimation method yields the smaller mean squared errors of the estimated monotone curve compared to the unconstrained spline approach. In summary, the monotone spline approach exhibits the desirable theoretic properties and finite-sample performance. It should be straightforward to adapt the hybrid algorithm presented here to other semiparametric regression models with monotonicity constraint, for instance, generalized par-

16

tially linear additive models and partially linear additive Cox models with current status data or case 2 interval-censored data. In sarcoma study, we determine the monotonicity of the estimated curve of tumor size based on the clinical significance; that is, the clinical studies support the tumor size is an important prognostic factor for mortality. One may apply the existing spline plotting technique to determine if the covariate has a monotone relationship for risk as well. Although it is beyond the focus of the current study, it would be preferable to develop the goodness-of-fit test and diagnostic statistics for semiparametric additive models with monotonicity constraint. Finally, other smoothing technique, such as penalized estimation approach, can be considered to address the potential issue of overfitting for polynomial splines. Acknowledgements The authors are grateful to Editor, Associate Editor, and two referees for their constructive comments that led to significant improvement in this manuscript. The project described was supported by the National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), through grant #UL1 TR001860 (C.S. Li), and MIND Institute Intellectual and Developmental Disabilities Research Center (U54 HD079125) (C.S. Li).

A A.1

Appendix Notations and assumptions

√ For a measurable function f , write Gn f = n(Pn − P )f for the empirical process Gn evaluated at f and kGn kF = supf ∈F |Gn f | for any measurable class of functions F. Define R L2 (P ) = {f : f 2 dP < ∞}. An ε bracket [l, u] in L2 (P ) is defined as a set of all functions with l ≤ f ≤ u and P (u − l)2 < ε2 . The bracketing number N[] (ε, F, L2 (P )) is the minimum number of ε-brackets needed to cover F. The bracketing integral is defined as Rηp J[] (η, F, L2 (P )) = 0 N[] (ε, F, L2 (P ))dε. In the following, C represents a positive constant that may vary from place to place. By Corollary 6.21 of Schumaker (1981) and Lemma A1 of Lu et al. (2007), there exist ψ0,1n ∈ Sn and ψ0,2n ∈ Mn of order l ≥ p + 2 such that kψ0,n − ψ0 k∞ = O(n−rν ), for 1/(2p + 2) < ν < 1/(2p), where ψ0,n (W) = ψ0,1n (W1 ) + ψ0,2n (W2 ) and ψ0 (W) = ψ0,1 (W1 ) + ψ0,2 (W2 ). Denote τ0,n = (β0| , ψ0,1n , ψ0,2n )| and gn (Z) = X| β0 + ψ0,n (W). Let mn (τ ; v) = g(Z) − log S0,n (τ ; t) and m0 (τ ; v) = g(Z) − log S0 (τ ; t). Define Mn (τ ) = P∆n mn (τ ; v) and 17

M(τ ) = P∆ m0 (τ ; v). The (partial) score function based on the partial log-likelihood for β is l˙nβ (τ ) = P∆n m1,n (τ ; v) and the (partial) score function based on the partial log-likelihood for ψ in a direction h = h1 + h2 , for h1 ∈ H1 and h2 ∈ H2 , is l˙nψ (τ )[h] = P∆n m2,n (τ ; v)[h]. The following conditions with respect to the location of knots, the smoothness of ψ0,1 and ψ0,2 , and the distributions of covariates are needed to derive the asymptotic properties of spline estimators. (B1) The maximum spacing of knots is assumed to be O(n−ν ), 0 < ν < 1/2. Moreover, the ratio of maximum and minimum spacings of knots is uniformly bounded. (B2) (i) The true functions ψ0,1 and ψ0,2 are pth bounded differentiable on [c1 , d1 ] and [c2 , d2 ], respectively, with p ≥ 1 and ψ0,2 is strictly increasing on [c2 , d2 ]; (ii) the true parameter β0 is in the interior of Θ. (B3) The covariate vector X has bounded support. (B4) (i) The covariate vector X and the regression functions ψr , r = 1, 2, are suitably centered; that is, E[∆X] = 0 and E[∆ψr (Wr )] = 0; (ii) for any β 6= β0 , Pr(X| β 6= X| β0 ) > 0. (B5) The failure time T u and the censoring time T c are conditionally independent given the covariate vector Z. (B6) Only the observation for which the event time T = min(T u , T c ) is in the finite interval [0, κ] is used in the partial log-likelihood. The baseline cumulative hazard function Rt Λ = 0 λ(u)du < ∞ on [0, κ].

(B7) There exists a positive ε such that P (∆ = 1|Z) > ε and P (T c > κ|Z) > ε.

(B8) There exist 0 < C1 < C2 such that the joint density f (t, ∆ = 1, Z) of (T, ∆ = 1, Z) is bounded between C1 and C2 . (B9) The qth derivatives of joint density f (t, ∆ = 1, Z) of (T, ∆ = 1, Z) with respect to t or Z are bounded for q ≥ 1.

18

Remark 6. Condition B1 is needed to derive the asymptotic consistence and the rate of convergency of τˆ . The smooth assumption on ψ0,1 and ψ0,2 in condition B2 suffices to apply the spline approximation. The requirement that β0 is not on the boundary of the parameter space is the standard assumption in semiparametric estimation. Condition B3 is needed for the entropy calculation in the proofs of Theorems 1–3. This assumption is often used in the asymptotic analysis of semiparametric regression. Condition B4 is required to establish the identifiability of the model. Condition B5 ensures the non-informative censoring mechanism for right-censored data. Condition B6 implies that the partial log-likelihood function and the score function bound away from infinite on the support of the observed time. Condition B7 indicates that the probability of being uncensored is bounded away from 0 and the censoring rate is not too high. Moreover, conditions B6 and B7 ensure the information bound is welldefined. Condition B8 is used in the calculation of the information bound and the derivation of the rate of convergency of τˆ . Finally, condition B9 ensures the partial score function in the least favorable direction is almost zero.

A.2

Technical lemmas

Lemma 1. For any η > 0, let L = {g(Z) − log S0 (τ ; t) : ψ1 ∈ Sn , ψ2 ∈ Mn , kτ − τ0 k2 ≤ η}. Then, for any 0 < ε ≤ η, log N[] (ε, L, L2 (P ) ≤ Crn log(η/ε) + C log(κ/ε), where rn = max(qn , pn ). Lemma 2. If conditions B1–B9 hold, then there exist 0 < C1 < C2 such that C1 kτ − τ0 k22 ≤ M(τ0 ) − M(τ ) ≤ C2 kτ − τ0 k22 , for τ in a neighborhood of τ0 . Lemma 3. Let =n = {τ : ψ1 ∈ Sn , ψ2 ∈ Mn , β ∈ Θ, η/2 ≤ kτ − τ0 k2 ≤ η}. Then (a)

S0,n (τ ; t) S (τ ; t) 0 −1/2 − η(rn1/2 + log1/2 (η −1 )). sup S0,n (τ0 ; t) S0 (τ0 ; t) = n t∈[0,κ],τ ∈=n

(b) Assume hn = hn1 + hn2 is a sequence of uniformly bounded functions with khn1 k∞ = O(qn−1 ) and khn2 k∞ = O(p−1 n ). Then S1,n (τ ; t) S1 (τ ; t) = op (n−1/2 ) sup − S0 (τ ; t) t∈[0,κ],τ ∈=n S0,n (τ ; t) 19

and

S2,n (τ ; t)[hn ] S2 (τ ; t)[hn ] −1/2 sup ). S0,n (τ ; t) − S0 (τ ; t) = op (n t∈[0,κ],τ ∈=n

Lemma 4. If conditions B1–B9 hold, then

P∆n [m1,n (τ0 ; v) − m2,n (τ0 ; v)[h∗ ]] = Pn `∗β (τ0 ) + op (n−1/2 ). Lemma 5. Assume that (A1) P∆n m1,n (τˆ ; v) = op (n−1/2 ) and P∆n m2,n (τˆ ; v)[h∗ ] = op (n−1/2 ), (A2) P∆n [m1,n (τˆ ; v) − m1,n (τ0 ; v)] − P∆ [m1 (τˆ ; v) − m1 (τ0 ; v)] = op (n−1/2 ) and P∆n [m2,n (τˆ ; v)[h∗ ] − m2,n (τ0 ; v)[h∗ ]] − P∆ [m2 (τˆ ; v)[h∗ ] − m2 (τ0 ; v)[h∗ ]] = op (n−1/2 ), (A3) P∆ {[m1 (τˆ ; v)−m2 (τˆ ; v)[h∗ ]]−[m1 (τ0 ; v)−m2 (τ0 ; v)[h∗ ]]} = −I(β0 )(βˆ −β0 )+op (kβˆ − β0 k) + op (n−1/2 ) hold and I(β0 ) is nonsingular. Then n−1/2 (βˆ − β0 ) = n1/2 I−1 (β0 )Pn `∗β (τ0 ) + op (1) → N (0, I−1 (β0 )), in distribution, as n → ∞. ˆ ∗ is the minimizer of ρn (h) = Lemma 6. Recall that h∗ is the least favorable direction and h ˆ ∗ is asymptotically conP∆n km1,n (τˆ ; v) − m2,n (τˆ ; v)[h]k22 . If conditions B1–B9 hold, then h sistent to h∗ ; that is, ˆ ∗ − h∗ k2 → 0, kh in probability, as n → ∞. Lemma 7. If conditions B1–B9 hold, then kτˆ − τ0 k2 = op (1). Remark 7. Lemma 1 is used to derive the consistency of τˆ . The similar entropy calculations are used to prove Theorems 1-3 as well. Lemma 2 is a key result to derive the consistency and the rate of convergence of τˆ . Lemma 3 shows that m1 (τ ; v) and m1,n (τ ; v), and m2 (τ ; v)[h] and m2,n (τ ; v)[h] are asymptotically equivalent with rate of n−1/2 for τ in a neighborhood of τ0 and h in a neighborhood of h∗ . Lemma 4 shows the efficient score functions based on the partial log-likelihood function and the log-likelihood function are asymptotically equivalent with rate of n−1/2 under the empirical measure. It plays a key role in deriving the asymptotic ˆ Lemma 5 that accommodates normality of βˆ and the asymptotic consistency of Iˆn (β). the partial likelihood estimation is parallel to the theorem for the asymptotic normality of 20

semiparametric M -estimators developed by Zhang et al. (2010). Lemma 6 is a key result to derive the asymptotic consistency of least-squares based variance estimator. Finally, Lemma 7 shows τˆ is asymptotically consistent to τ0 .

A.3

Proof of Lemma 1

According to the bracketing calculation in Shen and Wong (1994) and Lemma A.1 of Lu et al. (2009), for any η > 0 and 0 < ε ≤ η, the logarithm of the bracketing numbers of Sn and Mn , computed with L2 (P ), are bounded by Cqn log(η/ε) and Cpn log(η/ε), respectively. It is known that the neighborhood B(η) = {β : kβ − β0 k ≤ η} can be covered by O((η/ε)d ) balls with radius ε. In view of Theorem 9.23 of Kosorok (2008), the bracketing number for {X| β : kβ − β0 k ≤ η} is bounded by O((η/ε)d ). It follows that for sufficiently large n log N[] (ε, {g(Z) : ψ1 ∈ Sn , ψ2 ∈ Mn , kτ − τ0 k2 ≤ η}, L2 (P )), ≤ Crn log(η/ε) and, hence, log N[] (ε, {exp(g(Z)) : ψ1 ∈ Sn , ψ2 ∈ Mn , kτ − τ0 k2 ≤ η}, L2 (P )) ≤ Crn log(η/ε) because the function x 7→ exp(x) is Lipschitz and monotone. Moreover, the bracketing entropy of the indicator function 1[s≤t≤κ] for s ∈ [0, κ] is bounded by C log(κ/ε). Hence, the logarithm of bracketing number of the class 1[s≤t≤κ] exp(g(Z)) with ψ1 ∈ Sn , ψ2 ∈ Mn and kτ − τ0 k2 ≤ η is bounded by Crn log(η/ε) + C log(κ/ε). By some entropy calculation and the fact that x 7→ log(x) is Lipschitz and monotone, it follows that log N[] (ε, {log S0 (τ ; t) : ψ1 ∈ Sn , ψ2 ∈ Mn , kτ −τ0 k2 ≤ η}, L2 (P )) ≤ Crn log(η/ε)+C log(κ/ε), and Lemma 1 follows.

A.4

Proof of Lemma 2

Applying the same arguments as those in the proof of Lemma A.5 of Huang (1999), we can show that there exist 0 < C1 < C2 such that C1 kg − g0 k22 ≤ M(τ0 ) − M(τ ) ≤ C2 kg − g0 k22 .

21

It follows that M(τ0 ) − M(τ ) ≤ Ckτ − τ0 k22 . Denote g1 (X) = X| (β − β0 ) and g2 (W) = ψ(W) − ψ0 (W). According to the law of total expectation and the Cauchy-Schwarz inequality, {E[g1 (X)g2 (W)]}2 ≤ EW [g22 (W)]EW [{EX|W [g1 (X)|W]}2 ]. By the orthogonality of conditional expectation, there exists 0 < ξ < 1 such that EW [{EX|W [g1 (X)|W]}2 ] = ξEX [g21 (X)]. Hence, {E[g1 (X)g2 (W)]}2 ≤ ξE[g21 (X)]E[g22 (Z)]. In view of Lemma 25.86 of van der Vaart (2000), M(τ0 ) − M(τ ) ≥ kψ(W) − ψ0 (W)k22 + kX| β − X| β0 k2 , up to a constant. By Lemma 1 of Stone (1985), it follows that M(τ0 ) − M(τ ) ≥ Ckτ − τ0 k22 .

A.5

Proof of Lemma 3

The proof of Lemma 3 is similar to that of Lemma A.3 in Huang (1999).

A.6

Proof of Lemma 4

The Lemma 4 follows from the same arguments as those in the proof of Theorem 3.3 in Huang (1999).

A.7

Proof of Lemma 5

According to conditions A2 and A3, P∆n {[m1,n (τˆ ; v) − m2,n (τˆ ; v)[h∗ ]] − [m1,n (τ0 ; v) − m2,n (τ0 ; v)][h∗ ]} = −I(β0 )(βˆ − β0 ) + op (n−1/2 ). Then condition A1 yields that P∆n [m1,n (τ0 ; v) − m2,n (τ0 ; v)[h∗ ]] = I(β0 )(βˆ − β0 ) + op (n−1/2 ). By Lemma 4, P∆n [m1,n (τ0 ; v) − m2,n (τ0 ; v)[h∗ ]] can be written as Pn `∗β (τ0 ) + op (n−1/2 ). It follows that n1/2 (βˆ − β0 ) = n1/2 I−1 (β0 )Pn `∗β (τ0 ) + op (1) → N (0, I−1 (β0 )), 22

in distribution, as n → ∞.

A.8

Proof of Lemma 6 (l)

(l)

Denote ρl (τ , h) = [m1 (τ ; v) − m2 (τ ; v)[h]]2 and ρn,l (τ , h) = [m1,n (τ ; v) − m2,n (τ ; v)[h]]2 , (l) (l) l = 1, . . . , d, where m1 (τ ; v) and m1,n (τ ; v) are the lth elements of m1 (τ ; v) and m1,n (τ ; v), respectively. According to Jackson’s theorem for polynomials (de Boor 2001), there exists h∗n,l = h∗n1,l + h∗n2,l , h∗n1,l , h∗n2,l ∈ Sn , with order of l ≥ 2 such that kh∗l − h∗n,l k∞ = O(n−rν ) for ˆ ∗ minimizes ρn,l (τˆ , h), for all h = h1 + h2 , h1 , h2 ∈ Sn , 1/(2p + 2) < ν < 1/(2p). Because h l ∗ ˆ ˆ ˆ we have P∆n ρn,l (τ , hl ) ≤ P∆n ρn,l (τ , h∗n,l ). By Lemma 3, P∆n [ρn,l (τˆ , h∗n,l ) − ρn,l (τˆ , h∗l )] = P∆n [ρl (τˆ , h∗n,l ) − ρl (τˆ , h∗l )] + op (1). It follows that ˆ ∗ ) − ρn,l (τˆ , h∗ )] P∆n [ρn,l (τˆ , h l l

≤ (P∆n − P∆ )[ρl (τˆ , h∗n,l ) − ρl (τˆ , h∗l )] + P∆ [ρl (τˆ , h∗n,l ) − ρl (τˆ , h∗l )] + op (1). By some entropy calculation, it is readily to show that, for any η > 0, the class of functions ρl (τ , h) − ρl (τ , h∗l ) with h = h1 + h2 , h1 , h2 ∈ Sn , kτ − τ0 k2 ≤ η, and kh − h∗l k2 ≤ η is a P -Glivenko-Cantelli. It implies that (P∆n − P∆ )[ρl (τˆ , h∗n,l ) − ρl (τˆ , h∗l )] = op (1). Moreover, the continuous-mapping theorem and the dominated convergence theorem as well as the ˆ ∗) − consistency of τˆ yield P∆ [ρl (τˆ , h∗n,l ) − ρl (τˆ , h∗l )] = op (1). It follows that P∆n [ρn,l (τˆ , h l ρn,l (τˆ , h∗l )] ≤ op (1). In view of the continuous-mapping theorem and the dominated convergence theorem as well as Lemma 3, (P∆n − P∆ )[ρn,l (τˆ , h∗l ) − ρl (τˆ , h∗l )] = op (1). Furthermore, Glivenko-Cantelli theorem yields (P∆n −P∆ )ρl (τˆ , h∗l ) = op (1). Hence, (P∆n −P∆ )ρn,l (τˆ , h∗l ) = ˆ ∗ ) = op (1). It follows that op (1). We can similarly show that (P∆n − P∆ )ρn,l (τˆ , h l ˆ ∗ ) ≤ (P∆n − P∆ )ρn,l (τˆ , h∗ ) + P∆ ρn,l (τˆ , h∗ ) + op (1) P∆n ρn,l (τˆ , h l l l = P∆ ρn,l (τˆ , h∗l ) + op (1).

(8)

The continuous-mapping theorem and the dominated convergence theorem as well as the ˆ ∗ ) − ρl (τ0 , h ˆ ∗ )] → 0 and P∆ [ρn,l (τˆ , h∗ ) − ρl (τ0 , h∗ )] → 0. It consistency of τˆ yield P∆ [ρn,l (τˆ , h l l l l concludes that ˆ ∗ ) − ρl (τ0 , h∗ )] = op (1) + P∆ [ρn,l (τˆ , h ˆ ∗ ) − ρn,l (τˆ , h∗ )] 0 ≤ P∆ [ρl (τ0 , h l l l l ∗ ˆ ≤ op (1) − (P∆n − P∆ )ρn,l (τˆ , h ). l

ˆ ∗ ) − ρl (τ0 , h∗ )] = op (1). By The last inequality holds due to (8). Therefore, P∆ [ρl (τ0 , h l l ˆ ∗ − h∗ k2 ≥ ε is the subset of the event P∆ [ρl (τ0 , h ˆ ∗) − the uniqueness of h∗l , the event kh l l l 23

ρl (τ0 , h∗l )] > 0 and the latter approaches to 0, in probability, as n → ∞. This implies that ˆ ∗ − h∗ k2 = op (1). kh l

l

A.9

Proof of Lemma 7 (Consistency)

We apply Theorem 5.7 of van der Vaart (2000) to prove the asymptotic consistency of τˆ . The term Mn (τ ) − M(τ ) can be written as (P∆n − P∆ )[g(Z) − log S0 (τ ; t)] − P∆n [log S0,n (τ ; t) − log S0 (τ ; t)]. For any τ , Lemma 1 implies that L is P -Glivenko-Cantelli and, hence, sup |(P∆n − P∆ )(g(Z) − log S0 (τ ; t))| = o(1).

τ ∈Θ×Ψ

Furthermore, for any τ , in view of the mean-value theorem, |P∆n [log S0,n (τ ; t) − log S0 (τ ; t)]| ≤ C sup |S0,n (τ ; t) − S0 (τ ; t)| = o(1). The last equation holds due to the uniform law of large numbers. Hence, sup |P∆n [log S0,n (τ ; t) − log S0 (τ ; t)]| = o(1).

τ ∈Θ×Ψ

The triangular inequality yields sup |Mn (τ ) − M(τ )| = o(1).

τ ∈Θ×Ψ

The first condition of Theorem 5.7 of van der Vaart (2000) holds. Lemma 2 implies sup kτ −τ0 k2 ≥ε

M(τ ) ≤ M(τ0 ) − Cε2 < M(τ0 ).

The second condition of the theorem follows. Recall that τ0,n = (β0| , ψ0,1n , ψ0,2n )| . Denote gn = X| β0 + ψ0,n (W). Observe that

24

Mn (τˆ ) − Mn (τ0 ) ≥ Mn (τ0,n ) − Mn (τ0 ) = In1 + In2 + In3 , where In1 In2 In3



 S0,n (τ0,n ; t) S0 (τ0,n ; t) = P∆n − log + log , S0,n (τ0 ; t) S0 (τ0 ; t)   S0 (τ0,n ; t) = (P∆n − P∆ ) gn − g0 − log , S0 (τ0 ; t)   S0 (τ0,n ; t) . = P∆ gn − g0 − log S0 (τ0 ; t)

In view of the mean-value theorem and Lemma 3, In1 = op (n−2rν ). Let g∗0 = X| β0 + ψ(W). Lemma 1 implies that, for any η > 0, the class of functions g∗0 −g0 −log [S0 (β0 , ψ; t)/S0 (β0 , ψ0 ; t)] with ψ1 ∈ Sn , ψ2 ∈ Mn and kψ − ψ0 k2 ≤ η is P -Donsker. Moreover, by a Taylor expansion, for 0 < ε < 1/2 − ν, 

S0 (τ0,n ; t) P gn − g0 − log S0 (τ0 ; t)

2

/n−2rν+2ε ≤ Ckψ0,n − ψ0 k2∞ /n−2rν+2ε → 0,

as n → ∞. According to Lemma 19.24 of van der Vaart (2000), In2 = o(n−1/2 n−rν+ε ) = o(n−2rν ). Finally, in view of Lemma 2, In3 = M(τ0,n ) − M(τ0 ) ≥ −Ckψ0,n − ψ0 k2∞ = −O(n−2rν ). It concludes that Mn (τˆ ) − Mn (τ0 ) ≥ op (n−2rν ) − Op (n−2rν ) = −op (1). Theorem 5.7 of van der Vaart (2000) applies and yields kτˆ − τ0 k2 = op (1).

A.10

Proof of Theorem 1 (rate of convergency)

We apply Theorem 3.4.1 of van der Vaart and Wellner (1996) to prove the rate of convergency of τˆ . Let
Observe that Mn (τ ) − Mn (τ0,n ) − [M(τ ) − M(τ0,n )] can be written as

    S0,n (τ ; t) S0 (τ ; t) S0 (τ ; t) − P∆n log − log (P∆n − P∆ ) g − gn − log S0 (τ0,n ; t) S0,n (τ0,n ; t) S0 (τ0,n ; t) = In4 − In5 .

25

For the first term In4 , it can be shown that the bracketing integral     S0 (τ ; t) J[] η, g − gn − log : ψ1 ∈ Sn , ψ2 ∈ Mn , kτ − τ0,n k2 ≤ η , L2 (P ) ≤ Cηrn1/2 . S0 (τ0,n ; t) Furthermore, a Taylor expansion yields 

S0 (τ ; t) E g − gn − log S0 (τ0,n ; t)

2

≤ Ckτ − τ0,n k22 ≤ Cη 2 .

Therefore, in view of the maximal inequality in Lemma 3.4.2 of van der Vaart and Wellner (1996), E sup |In4 | ≤ ηn−1/2 O(rn1/2 ). τ ∈
For the second term In5 , a Taylor expansion and Lemma 3 yield S0,n (τ ; t) S (τ ; t) 0 = Cηn−1/2 (rn1/2 + log1/2 (η −1 )) E sup |In5 | ≤ C sup − S (τ ; t) S (τ ; t) τ ∈
with probability arbitrarily close to 1. Choosing the distance dn defined in Theorem 3.4.1 of van der Vaart and Wellner (1996) to be d2n (τˆ , τ0,n ) = M(τ0,n ) − M(τˆ ), we have 1/2 1/2 −1 u2n (u−1 un ) = O(n1/2 ). n rn + un log −1/2

It implies that un = rn n1/2 = n(1−ν)/2 . Therefore, by Theorem 3.4.1 of van der Vaart and Wellner (1996), d2n (τˆ , τ0,n ) = M(τ0,n )−M(τˆ ) = Op (u−2 n ). Lemma 2 yields M(τ0 )−M(τ0,n ) ≤ 2 −2rν Ckψ0,n − ψ0 k∞ = O(n ) and M(τ0 ) − M(τˆ ) ≥ Ckτˆ − τ0 k22 . Therefore, Ckτˆ − τ0 k22 ≤ M(τ0 ) − M(τ0,n ) + M(τ0,n ) − M(τˆ ) = O(n−2rν ) + Op (u−2 n ). Hence, kτˆ − τ0 k22 = Op (n−2 min(rν,(1−ν)/2) ). The uniform consistency of τˆ follows immediately from Lemma 7 of Stone (1986). This completes the proof of the rate of convergency.

A.11

Proof of Theorem 2 (asymptotic normality)

The conditions A1–A3 of Lemma 5 can be verified by following virtually identical steps to those in Lemmas 5.2–5.4 of Huang (1999). Hence, Lemma 5 applies and yields the asymptotic ˆ normality of β.

26

A.12

Proof of Theorem 3

By Lemma 6 and the rate of convergence of τˆ , a Taylor expansion and Glivenko and Cantelli theorem yield ˆ ∗ ]] = P∆n [m1,n (τ0 ; v) − m2,n (τ0 ; v)[h∗ ]] + op (1). P∆n [m1,n (τˆ ; v) − m2,n (τˆ ; v)[h In view of Lemma 4 and the continuous-mapping theorem, Iˆn = P∆n [`∗β (τ0 )⊗2 ] + op (1). An application of the law of large numbers concludes Iˆn = P∆ [`∗β (τ0 )⊗2 ] + op (1).

References [1] Cai, J., Fan, J., Jiang, J., & Zhou, H. (2008). Partially linear hazard regression with varying coefficients for multivariate survival data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70, 141–158. [2] de Boor, C. (2001). A Practical guide to splines, Springer-Verlag, New York. [3] Du, P., Ma, S., & Liang, H. (2010). Penalized variable selection procedure for Cox models with semiparametric relative risk. Annals of statistics, 38, 2092–2117. [4] Fan, J., Gijbels, I., & King, M. (1997). Local likelihood and local partial likelihood in hazard regression. Annals of Statistics, 25, 1661–1690. [5] Gray, R. J. (1992). Flexible methods for analyzing survival data using splines, with applications to breast cancer prognosis. Journal of the American Statistical Association, 87, 942–951. [6] Heller, G. (2001). The Cox proportional hazards model with a partly linear relative risk function. Lifetime data analysis, 7, 255–277. [7] Huang, J. (1999). Efficient estimation of the partly linear additive Cox model. Annals of Statistics, 27, 1536–1563. [8] Huang, J.Z. & Liu, L. (2006). Polynomial spline estimation and inference of proportional hazards regression models with flexible relative risk form. Biometrics, 62, 793–802. [9] Kosorok, M.R. (2008). Introduction to empirical processes and semiparametric inference, Springer, Dordrecht.

27

[10] Lu, M. (2015). Spline estimation of generalised monotonic regression. Journal of Nonparametric Statistics, 27, 19–39. [11] Lu, M. & Loomis, D. (2013). Spline-based semiparametric estimation of partially linear Poisson regression with single-index models. Journal of Nonparametric Statistics, 25, 905–922. [12] Lu, M., Zhang, Y., & Huang, J. (2007). Estimation of the mean function with panel count data using monotone polynomial splines. Biometrika, 94, 705–718. [13] Lu, M., Zhang, Y., & Huang, J. (2009). Semiparametric estimation methods for panel count data using monotone B-splines. Journal of American Statistical Association, 104, 1060–1070. [14] O’Sullivan, F. (1993). Nonparametric estimation in the Cox model. Annals of Statistics, 21, 124-145. [15] Robertson, T., Wright, F. T., & Dykstra, R. L., (1988). Order restricted statistical inference, Wiley, New York. [16] Rosenberg, P.S. (1995). Hazard function estimation using B-splines. Biometrics, 51, 874–887. [17] Sasieni, P. (1992). Non-orthogonal projections and their application to calculating the information in a partly linear Cox model. Scandinavian Journal of Statistics, 19, 215– 233. [18] Schumaker, L. (1981). Spline functions: basic theory, Wiley, New York. [19] Shen, X. & Wong, W. H. (1994). Convergence rate of sieve estimates. Annals of Statistics, 27, 580–615. [20] Sleeper, L.A., & Harrington, D. P. (1990). Regression splines in the Cox model with application to covariate effects in liver disease. Journal of the American Statistical Association, 85, 941–949. [21] Stone, C.J. (1985), Additive regression and other nonparametric models. Annals of Statistics, 13, 689–705. [22] Stone, C.J. (1986). The dimensionality reduction principle for generalized additive models. Annals of Statistics, 14, 590–606.

28

[23] Sun, J., Kopciukb, K.A., & Lu, X. (2008). Polynomial spline estimation of partially linear single-index proportional hazards regression models. Computational Statistics & Data Analysis, 53, 176–188. [24] Tang, X., Li, J., & Lian, H. (2013). Empirical likelihood for partially linear proportional hazards models with growing dimensions. Journal of Multivariate Analysis, 121, 22–32. [25] van der Vaart, A.W. (2000). Asymptotic Statistics, Cambridge University Press, Cambridge. [26] van der Vaart, A.W. & Wellner, J.A. (1996). Weak convergence and empirical processes, Springer-Verlag, New York. [27] Xue, L. & Liang, H. (2010). Polynomial spline estimation for a generalized additive coefficient model. Scandinavian Journal of Statistics, 37, 26–46. [28] Zhang, Y., Hua, L., & Huang, J. (2010). A spline-based semiparametric maximum likelihood estimation method for the Cox model with interval-censored data. Scandinavian Journal of Statistics, 37, 338–354.

29

Table 1. Results of knots selection study. Number of interior knots (kn ), bias, standard deviation (sd), average of estimated standard errors (ase), standard deviation of estimated standard errors (sdse) and coverage probability (cp) of 95% confidence interval from 1,000 replications with the censoring rate of 20.5% for n = 400, respectively. β1 kn 4 6 10 12

β2

bias sd ase sdse cp (%) bias sd -0.003 0.119 0.119 0.002 95.6 0.003 0.116 0.012 0.123 0.120 0.003 94.4 -0.013 0.121 0.022 0.125 0.123 0.003 94.0 -0.027 0.128 0.022 0.125 0.124 0.003 94.1 -0.023 0.130

30

ase sdse cp (%) 0.118 0.002 95.3 0.120 0.003 94.9 0.122 0.003 93.9 0.123 0.003 93.7

Table 2. Summary of parameter estimation and results of variance study from the proposed variance estimation method and the standard bootstrap method. Average of estimated standard errors (ase), standard deviation of estimated standard errors (sdse), and estimated coverage probability (cp) of 95% confidence interval for β, based on 1,000 Monte Carlo replications with the rate censoring of 20.5% for n = 200 or 400, respectively. The number of bootstrap samples is 100. proposed sample size n = 200 n = 400

parameter β1 β2 β1 β2

bias 0.020 -0.021 0.010 -0.018

sd ase sdse cp (%) 0.178 0.176 0.006 94.1 0.177 0.175 0.006 94.2 0.113 0.121 0.003 95.8 0.117 0.120 0.003 95.3

31

bootstrap ase sdse cp (%) 0.196 0.018 97.0 0.196 0.018 96.2 0.128 0.010 96.6 0.127 0.010 96.4

Table 3. Inference on synovial sarcoma data.

proposed predictor

estimate

radiotherapy -0.565a gender 0.482b race white 0.00 black 0.711 other -0.343 SEER stage localized 0.00 regional 0.556 unstaged 1.665 anatomic site extremity 0.00 head and neck 0.624 trunk 0.545 other 0.842

se

95% C.I.

bootstrap p-value

se

95% C.I. p-value

0.149 (-0.858,-0.273) 0.143 (0.200,0.764)

0.0001 0.0008

0.155 (-0.870,-0.262) 0.153 (0.183,0.781)

0.0002 0.002

0.204 0.318

(0.309,1.112) (-0.968,0.281)

0.0004 0.27

0.218 0.409

(0.283,1.138) (-1.145,0.458)

0.001 0.40

0.151 0.381

(0.261,0.851) (0.918,2.413)

0.0002 0.160 < 0.0001 0.449

(0.242,0.870) (0.786,2.545)

0.0005 0.0002

0.277 0.183 0.241

(0.081,1.167) (0.186,0.904) (0.371,1.312)

(0.052,1.196) (0.120,0.971) (0.352,1.331)

0.032 0.012 0.0007

a: yes vs. no; b: males vs. females.

32

0.028 0.002 0.0004

0.292 0.217 0.249

Normal Q-Q Plot for β2

-0.4 -0.8 -1.0

0.0

-2

0

2

4

-4

-2

0

2

Theoretical Quantiles

Theoretical Quantiles

Histogram with Normal Curve

Histogram with Normal Curve

4

1.5 0.0

0.0

0.5

0.5

1.0

1.0

1.5

Density

2.0

2.0

2.5

2.5

3.0

3.0

-4

Density

-0.6

Sample Quantiles

0.6 0.4 0.2

Sample Quantiles

0.8

-0.2

1.0

Normal Q-Q Plot for β1

0.0

0.2

0.4

0.6

0.8

1.0

-1.0

β1

-0.8

-0.6

-0.4

-0.2

β2

Figure 1. Histograms and Q-Q plots of βˆ1 and βˆ2 for n = 400 from 1,000 replications with the censoring rate of 20.5%.

33

n = 400 0.30

Unconstrained Estimate Constrained Estimate

0.20 0.15 0.00

0.00

0.05

0.10

0.10

Mean Squared Error of ψ^2

0.25

Unconstrained Estimate Constrained Estimate

0.05

Mean Squared Error of ψ^2

0.15

n = 200

−2

−1

0

1

-2

2

-1

0

1

2

W2

W2

Figure 2. Comparison of mean squared errors of the constrained and the unconstrained spline estimators of the true function ψ2 (W2 ) = exp(W2 /2) − 1 for n = 200 and 400 from 1,000 replications with the censoring rate of 20.5%.

34

n = 400

4 2 0 -4 −1

0

1

2

-3

3

-2

-1

0

W1

W1

n = 200

n = 400

1

2

3

3 2 1 0 -1 -2

-2

-1

0

1

2

95 % Confidence Interval of ψ2

3

4

−2

4

−3

95 % Confidence Interval of ψ2

-2

95 % Confidence Interval of ψ1

2 0 −2 −4

95 % Confidence Interval of ψ1

4

n = 200

-2

-1

0

1

2

-2

W2

-1

0

1

2

W2

Figure 3. Curve estimates and the corresponding pointwise 95% confidence intervals for ψ1 (W1 ) = 3 sin(W1 ) and ψ2 (W2 ) = exp(W2 /2) − 1. The solid curves are the true functions. The dashed curves are the mean B-spline fits and the long dashed curves are the corresponding 2.5% and 97.5% quantiles of the estimates from 1,000 Monte Carlo samples with the censoring rate of 20.5% for n = 200 or 400, respectively.

35

6

3

4 2

95% Confidence Intervals

2 1 0

95% Confidence Intervals

0

-1 -2

20

30

40

50

60

70

80

0

Age at Diagnosis

5

10

15

20

25

Tumor size (cm)

Figure 4. The estimated curves with 95% confidence intervals for synovial sarcoma data with 4 knots, based on 1,000 bootstrap samples. The solid curves are the estimated functions and the dashed curves are the corresponding 95% confidence limits based on the quantiles of the estimates from 1,000 bootstrap samples.

36