Computational Statistics & Data Analysis 40 (2002) 603 – 620 www.elsevier.com/locate/csda
The shared frailty model and the power for heterogeneity tests in multicenter trials Luc Duchateaua; b; ∗ , Paul Janssenc , Patrick Lindseyc , Catherine Legrandb , Rosemary Ngutic; d , Richard Sylvesterb a Faculty
of Veterinary Medicine, Department of Physiology, Biochemistry and Biometrics, Ghent University, Salisburylaan 133, 9820 Merelbeke, Belgium b European Organization for Research and Treatment of Cancer, Av. E. Mounier 83, Box 11, 1200 Brussels, Belgium c Center for Statistics, Limburgs Universitair Centrum, Universitaire Campus, Diepenbeek, Belgium d Department of Mathematics, University of Nairobi and Biometrics Unit, International Livestock Research Institute, Nairobi, Kenya
Abstract Heterogeneity between centers in multicenter trials with time to event outcome can be modeled by the frailty proportional hazards model. The majority of the di0erent approaches that have been used to 1t frailty models assume either the gamma or the lognormal frailty density and are based on similar log likelihood expressions. These approaches are brie3y reviewed and their speci1c features described; simulations further demonstrate that the di0erent techniques lead to virtually the same estimates for the heterogeneity parameter. An important issue is the relationship between the size of a multicenter trial, in terms of number of centers and number of patients per center, and the bias and the spread of estimates of the heterogeneity parameter around its true value. Based on simulation results (restricted to constant hazard rate and the gamma frailty density), it becomes clear how the number of centers and the number of patients per center in3uence the quality of the estimates in the particular setting of breast cancer clinical trials. This insight is important in treatment outcome research, where one tries to relate di0erences with respect to c 2002 Elsevier Science B.V. clinical practice to di0erences in outcome in the various centers. All rights reserved. Keywords: Frailty model; Treatment outcome; Multicenter trial; EM algorithm; Penalized likelihood; REML
∗
Corresponding author. Faculty of Veterinary Medicine, Department of Physiology, Biochemistry and Biometrics, Ghent University, Salisburylaan 133, 9820 Merelbeke, Belgium. E-mail address:
[email protected] (L. Duchateau). c 2002 Elsevier Science B.V. All rights reserved. 0167-9473/02/$ - see front matter PII: S 0 1 6 7 - 9 4 7 3 ( 0 2 ) 0 0 0 5 7 - 9
604
L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603 – 620
1. Introduction For large international multicenter trials, such as cancer clinical trials, AIDS trials and cardiovascular disease trials, outcomes are typically more heterogeneous than outcomes from other clinical trials. Heterogeneity decreases the power to detect clinically important treatment di0erences, but on the other hand, more heterogeneous trials lead to more general conclusions as they are based on a wider patient population. Additionally, the di0erences between centers can be studied to determine whether di0erences in clinical practice at the center level have an in3uence on the outcome. This research 1eld is termed ‘treatment outcome’ and can lead to changes in standard patient care (Harding et al., 1993; Hillner et al., 2000). Treatment outcome research, however, has to be done on multicenter trials of adequate size in order to draw valid conclusions. The main aim of this paper is to study the relationship between the size of a multicenter trial, in terms of number of centers and number of patients per center, and the bias and spread of the estimates of the heterogeneity parameter around its true value. As discussed in Chapter 9 in Therneau and Grambsch (2000), this important issue on the amount of information and the patients=center distribution needed to produce stable estimates of the heterogeneity parameter requires further consideration. Based on our results, a decision can be taken with regard to the minimum size of a multicenter trial in order to do treatment outcome research in the particular 1eld of breast cancer. We believe that such an analysis is required before undertaking treatment outcome research in a particular 1eld. We have chosen to base this analysis on simulation because the number of asymptotic results for the heterogeneity parameter available in the context of the proportional hazards model with covariates (treatment e0ect in our simulations) is rather limited. Relevant references for the asymptotic theory are Murphy (1994, 1995) and Parner (1998). Furthermore, it is not easy to obtain the standard error for heterogeneity parameter estimates. Although Klein (1992), see also Klein and Moeschberger (1997), proposes such an estimate based on the information matrix, this matrix has rank equal to the number of distinct event times plus the number of covariates plus one for the heterogeneity parameter. In the case of a multicenter trial, with typically a large number of distinct event times, this approach is not feasible because of the high dimensionality. Simulation, however, allows to study the spread of the heterogeneity parameter. Additionally, the simulations enable us to study the likelihood ratio test and validate it in the context of the frailty model. In order to model heterogeneity between centers for time to event outcomes, the frailty model is used (Klein, 1992; Therneau and Grambsch, 2000; Yamaguchi and Ohashi, 1999). The frailty model with proportional hazards assumption and unspeci1ed baseline hazard is described in Section 2. A number of recent papers describe a variety of likelihood approaches to 1t the frailty model. Taking the observable likelihood (Section 3) as corner stone we explain (in Sections 3 and 4) the usefulness of partial likelihood and penalized partial likelihood to 1t frailty models. The main purpose of this review is to explore the common ground for the available methods. For reasons of implementation we also add in the appendix the corresponding algorithms (the S-Plus code is available from the authors). The e0ect of the size of the multicenter trial on the
L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603 – 620
605
spread of estimates of the heterogeneity parameter around its true value is studied by simulation in Section 5. In Section 6, a cancer clinical trial example is discussed in some detail. 2. The shared frailty model Assume we have a total ofn individuals that come from G di0erent centers, center G i having ni individuals (n = i=1 ni ). For each patient a time to event=censoring and a censoring indicator is observed, given by yij= (tij ; ij ) for the jth patient from center i with i = 1; : : : ; G and j = 1; : : : ; ni . Associated risk variables for the jth patient from center i are collected in the (p × 1) vector xij . The number of events in center i is ni ij . The matrix y contains all yij ’s. given by Di = j=1 The frailty model is given by hij (t) = h0 (t)exp(ij ) = h0 (t)exp(xijt R + wi ); where hij (t) is the hazard rate at time t for individual j from center i, h0 (t) is the unspeci1ed baseline hazard at time t, is the 1xed e0ects vector, xij is the incidence vector for the 1xed e0ects and 1nally wi is the random e0ect for center i. The wi ’s, i=1; : : : ; G are a sample (independent and identically distributed) from a density fW (·). With = (11 ; : : : ; 1n1 ; : : : ; G1 ; : : : ; GnG )t , an (n × 1) vector, we have = X + Zw; where X and Z are the (n × p) and (n × G) incidence matrices for the 1xed e0ects and the random e0ects w, is a (p × 1) and w a (G × 1) vector. The frailty model can be rewritten as follows: hij (t) = h0 (t)exp(wi )exp(xijt ) = h0 (t)ui exp(xijt ): Now ui = exp(wi ) is termed the frailty for the ith center. The independent ui ’s, i = 1; : : : ; G have common density fU (·). Furthermore, distributional assumptions have to be made for the frailties. Two classical choices for the density of the frailties are: (a) The zero-mean normal density for W ; then the density of U is lognormal, i.e., 1 (log u)2 fU (u) = √ : (1) exp − 22 u 22 (b) The one-parameter gamma density for U , i.e., u(1=)−1 exp(−u=) fU (u) = : (2) 1= (1=) Then the corresponding density for W is (exp(w))1= exp(−exp(w)=) fW (w) = : 1= (1=) Note that for the lognormal density E(Wi ) = 0 but E(Ui ) = 1 whereas for the gamma density E(Ui ) = 1 but E(Wi ) = 0. Typically Var(W ) = 2 is used to describe heterogeneity in the lognormal density case, whereas the parameter Var(U ) = is most often
606
L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603 – 620
used in the gamma density case. Below we use as generic notation for heterogeneity (meaning 2 for the lognormal density and for the gamma density). 3. The EM algorithm for the gamma frailty Typical for the frailty model is that we have observed information y and unobserved (latent) information w. To estimate = (; ) we would like to base the likelihood maximization and statistical inference on the observed data log likelihood (Klein, 1992): ∞ G ∞ ‘obs () = log f(y; ) = log ··· f(y; w; ) dwi 0
= log
=
G
0
∞
:::
∞
0
0
f(y|w; )f(w|)
i=1
G
dwi
i=1
(Di log − log (1=) + log (1= + Di )
i=1
− (1= + Di )log 1 +
ni
H0 (tij )exp(xijt )
j=1
+
ni
ij [xijt + log h0 (tij )]):
(3)
j=1
However, this likelihood is diQcult to maximize as it contains, apart from , also the unspeci1ed baseline hazard. We therefore rely on the EM algorithm to estimate . With ‘full ()=log f(y; w; ) and ‘pred ()=log f(w|y; ) we write, in the EM algorithm framework, ‘obs () = ‘full () − ‘pred (); where ‘full () is the complete data log likelihood and ‘pred () the predictive log likelihood. Taking the conditional expectation with respect to y and with (k−1) a provisional value of at stage k − 1 in the EM algorithm, we obtain ‘obs () = Q(|(k−1) ) − H (|(k−1) ); where Q(|(k−1) ) = E(k−1) [log f(y; w; )|y];
H (|(k−1) ) = E(k−1) [log f(w|y; )|y]:
Instead of maximizing ‘obs () for rather Q(|(k−1) ) is maximized. It is a general result from EM methodology that if (k) maximizes Q(|(k−1) ) then ‘obs ((k) ) ¿ ‘obs ((k−1) ), i.e., (k) is ‘better’ than (k−1) . This property is central in Dempster et al.
L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603 – 620
607
(1977). In order to maximize Q(|(k−1) ), we alternate between an expectation and a maximization step. The details are as follows. 3.1. Expectation step The expected value Q(|(k−1) ) can be obtained by plugging in the conditional expectations for wi and exp(wi ) in log f(y; w; ). With (:) the digamma function, these are given by Ek (wi ) = E(k−1) [wi |y] = (Di + 1= (k−1) ) − log(1= (k−1) + Hi(k−1) ) and Ek (exp(wi )) = E(k−1) [exp(wi )|y] = where Hi(k−1) =
ni j=1
with H0(k−1) (t) =
1= (k−1) + Di
1= (k−1) + Hi(k−1)
;
H0(k−1) (tij )exp(xijt (k−1) )
t(‘) 6t
(k−1) h‘0
and (k−1) = h‘0
(4)
N(‘)
tqs ¿t(‘)
t (k−1) )E exp(xqs (k−1) (exp(ws ))
:
With t(1) ¡ · · · ¡ t(r) the ordered event times (r denotes the number of di0erent event times), N(‘) is the number of events at time t(‘) , ‘=1; : : : ; r (see Sections 8.2 (theoretical notes) and 8.3 in Klein and Moeschberger (1997) to get good insight). As starting values we use an initial guess (0) = ( (0) ; (0) ). These expected values need to be inserted in ni G G log f(y; w; ) = log fW (wi ) + [ij [log h0 (tij ) + xijt + wi ] i=1
i=1 j=1
− [H0 (tij )exp(xijt + wi )]]:
(5)
Plugging in the density for fW (·), corresponding to the gamma frailty, and replacing in (5) wi and exp(wi ) by their conditional expectations, we obtain after rearranging some of the terms Q(|(k−1) ) = Q1 (|(k−1) ) + Q2 (|(k−1) ) with Q1 (|(k−1) ) =
G
[(1= + Di )Ek (wi ) − Ek (exp(wi ))=
i=1
− Di log Ek (exp(wi )) − log()= − log (1=)]
608
L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603 – 620
and Q2 (|
(k−1)
)=
ni G
[ij [log h0 (tij ) + xijt + log Ek (exp(wi ))]
i=1 j=1
− [H0 (tij )exp(xijt )Ek (exp(wi ))]]: 3.2. Maximization step ni The reason why we arti1cially include the term j=1 ij log Ek (exp(wi )) in the def(k−1) (k−1) ) and Q2 (| ), with, respectively, a minus sign and a plus initions of Q1 (| sign, is that we now can interpret Q2 (|(k−1) ) as a full censored data log likelihood with (t ; 1)t as regression coeQcients and (xi1 ; : : : ; xip ; log Ek (exp(wi ))) as risk variables. For the case of an unspeci1ed baseline (cumulative) hazard function, we use the pro1le likelihood idea (1x and maximize the full censored data log likelihood as a function of the baseline hazard only) in combination with Breslow’s method to handle ties, to replace this term by r (k−1) ‘part (|(k−1) ) = ij(k−1) − N(‘) log exp(qs ) ; l=1
tij =t(l)
tqs ¿t(‘)
where ij(k−1) = xijt + log Ek (exp(wi )): So we maximize, with respect to = (; ), Q1 (|(k−1) ) + ‘part (|(k−1) ):
(6)
(k)
(k)
The new estimate for the kth iteration is then used to obtain Q(| ), the update of the conditional expectation, and so on. This process continues until the di0erence between the two consecutive values E(k−1) [log f(y; w; )|y] and E(k) [log f(y; w; )|y] becomes smaller than some prespeci1ed value ' (see Flow Chart 1 in Appendix A). Remark 1. Nielsen et al. (1992) propose a modi1ed pro1le likelihood EM algorithm that leads to the same results but the convergence is faster. 4. The penalized partial likelihood for shared frailty models From (5) and (6) it is clear that a logical proposal for the likelihood to use for the estimation of = (; ) is the penalized partial likelihood ‘ppl (; w) = ‘part (; w) − ‘pen (; w);
(7)
where, with ij = xijt + wi , r ‘part (; w) = ij − N(‘) log exp qs ‘=1
tij =t(‘)
tqs ¿t(‘)
L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603 – 620
609
and G ‘pen (; w) = − log fW (wi ): i=1
(a) For random e0ects wi ; i = 1; : : : ; G, having a zero-mean normal density, we have (now think and write for 2 ) G 1 wi2 ‘pen (; w) = + log (2) 2 i=1
and (7) becomes the penalized partial likelihood studied in McGilchrist (1993). The frailties are thus in both parts of the penalized partial likelihood. The second term penalizes random e0ects that are far away from the mean value zero by reducing the penalized partial likelihood. This corresponds to shrinking the random e0ects towards the zero-mean. The maximization of the penalized partial (log)likelihood consists of an inner and an outer loop. In the inner loop the Newton–Raphson procedure is used to maximize, for a provisional value of , ‘ppl (; w) for and w (best linear unbiased predictors, BLUPs). In the outer loop, the restricted maximum likelihood estimator for is obtained using the BLUPs. The process is iterated until convergence. The details are as follows. Let ‘ denote the outer loop index and k the inner loop index. Let (‘) be the estimate for at the ‘th iteration in the outer loop. Given ‘ , (‘; k) and w(‘; k) are the estimates and predictions for and w at the kth iterative step in the inner loop. Starting from initial values (1; 0) , w(1; 0) and (1) the kth iterative step for Newton– Raphson, given (‘) , is given by (‘; k) (‘; k−1) d‘part (; w) 0 −1 = −V + V −1 [X Z] ; [ (‘) ]−1 w(‘; k−1) w(‘; k) w(‘; k−1) d where
t 2 −d ‘part (; w) X V11 V12 = [X V= V21 V22 Zt d dt
0 0 Z] + : 0 [ (‘) ]−1 IG
Once the Newton–Raphson procedure has converged for the current value of (‘) , a REML estimate for is given by G (w(‘; k) )2 (‘+1) = i=1 i ; G−r where r = trace[(V −1 )22 ]= (‘) . This outer loop is iterated until convergence based on the di0erence between the sequential values is obtained. (b) For random e0ects wi , i = 1; : : : ; G, with corresponding one-parameter gamma density for the frailties, we have (now = ) G log 1 wi − exp(wi ) ‘pen (; w) = − −G − log : i=1
610
L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603 – 620
To maximize the penalized partial (log)likelihood we still can use an inner and outer loop. The inner loop is identical to the one described in (a). No REML estimate, however, is available for as in the normal density case. Therefore, in the outer loop, a likelihood similar to ‘obs (:) is maximized for as in the case of the EM algorithm. This likelihood, for 1xed value of (‘) , also corresponds to Q(|(k−1) ) − H (|(k−1) ) evaluated at ( (l) ; ˆ (l) ; wˆ (l) ), where ˆ (l) and wˆ (l) are the estimates obtained by maximizing the penalized partial likelihood for 1xed value (l) . This expression is in terms of the partial likelihood which contains the expectation of the frailties as 1xed o0set terms. We therefore apply, in the outer loop, the golden section method on the following modi1ed version of the observable likelihood: G 1= (Di + 1=) 1 part + log log (; w) = ‘part (; w) + ‘obs (1=) Hi + 1= i=1
− Di log(Di + 1=) + Di
:
(8)
part That ni ‘obs (; w) is ta modi1ed version of ‘obs () can be seen as follows. With Hi = j=1 H0 (tij )exp(xij ) the observable likelihood ‘obs () given by (3) can be written as nij G [ij [log h0 (tij ) + xijt + log E(exp(wi ))] ‘obs () = i=1 j=1
− [H0 (tij )exp(xijt )E(exp(wi ))]] G (Di + 1=) − (Di + 1=)log[(Hi + 1=)] + Di log + log (1=) i=1
− Di log E(exp(wi )) + Hi E(exp(wi )) ] :
(9)
Now modify ‘obs () by replacing the 1rst (double) sum on the r.h.s. of (9) by ‘part (; w). G G Using the relations E(exp(wi ))=(Di +1=)=(Hi +1=) and i=1 Di = i=1 Hi E(exp(wi )) (martingale residuals sum to zero), it easily follows that the second sum on the r.h.s. of (9) equals the sum on the r.h.s. of (8). Flow Chart 2 in the appendix gives the details. Remark 2. For the gamma case an important property; stated in Therneau and Grambsch (2000; p. 254); is that; for any 1xed value of ; the solution to the penalized partial likelihood maximization coincides with the solution based on the EM algorithm (discussed in Section 2). Since S-Plus contains a fast algorithm for the penalized partial likelihood procedure; this property is very important from a practical point of view. Also note that in this case there is no need to estimate the cumulative baseline hazard part since we use of ‘obs (·; ·) in the outer loop. Remark 3. The penalized partial likelihood approach has been implemented in the Coxph function of S-Plus. A frailty term can be added to the Cox model. Using the
L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603 – 620
611
default parameters for the convergence criterion ' and the maximum number of outer loops; however; the iteration procedure will terminate before reaching the maximum likelihood parameters. Therefore; the ' (parameter ‘eps’ in the frailty function) needs to be reset to a lower value; and the maximum number of outer loops (parameter ‘outer.max’) needs to be increased. We obtained the maximum likelihood estimations in the simulations by setting ' = 10−10 and outer.max = 50. Remark 4. A very direct approach is to maximize lpart obs (·; ·) simultaneously for (; ; w). This leads to exactly the same solution as the penalized partial likelihood approach. For 1xed ; the di0erence between lpart obs (·; ·) and lppl (·; ·) only contains terms not depending on or w so that maximization of the likelihood based on one or the other for 1xed gives necessarily the same answer. On the other hand; the penalized likelihood approach uses also lpart obs (·; ·) in its outer loop so that again this leads to the same solution. The main advantage of this approach is that a general nonlinear optimizer can be used to 1nd estimates for and and; if needed; w. Flow Chart 3 in the appendix gives the details. 5. Simulations 5.1. Description of simulations In this section, the relationship between the size of a multicenter trial and the bias and spread of estimates of around its true value is studied by simulation. The size of a multicenter trial is a function of the number of centers, G, and the number of patients per center, ni ; i = 1; : : : ; G. These two parameters are studied separately in the simulation. Parameters that might in3uence this relationship are the event rate h0 (t) (assumed constant over time: h0 (t) = h0 ), the size of the true heterogeneity parameter , and the treatment e0ect ), expressed in terms of the hazard ratio HR = exp()). Therefore, the e0ect of these parameters is also studied in this simulation. The choice of the parameter values has been determined by the clinical setting in which we want to study treatment outcome, namely breast cancer clinical trials. The number of centers typically varies from 15 to 30 centers, and for the number of patients per center we take 20, 40 and 60. We study two di0erent types of breast cancer trials: the early breast cancer clinical trials, with a typically low event rate, set here at h0 = 0:07 yearly constant event rate (median time to event equals 9.9 years); and metastatic breast cancer clinical trials, with a typically high event rate, set here at h0 = 0:22 yearly constant event rate (median time to event equals 3.15 years). We assume an accrual period of 5 years (with constant accrual rate) and a further follow up period of 3 years, which corresponds to a typical early breast cancer setting. Time at risk for a particular patients, rtij , thus consists of the time at risk before the end of the accrual period (ranging from 0 to 5 years) plus the follow up time. This results in approximately 30% and 70% of the patient having the event in the early breast cancer and metastatic breast cancer clinical trial, respectively, at the end of the study with the remaining patients censored. As true values for the heterogeneity parameter, 0,0.1 and
L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603 – 620
Density function
612
θ=0.1 h0 =0.22 h0 =0.07
5 10 15 20 Median time to event (years)
Density function
0
θ=0.2 h0 =0.22 h0 =0.07
0
5 10 15 20 Median time to event (years)
Fig. 1. The spread of the median time to event from center to center for = 0:1 and 0.2 and h0 = 0:07 and 0.22.
0.2, are used as this is the most likely range of values to be observed in breast cancer clinical trials. In the example presented in Section 6, for instance, the heterogeneity parameter is estimated to be 0.092, and for all the breast cancer trials that we studied, values in this range are found. The e0ect of the heterogeneity parameter on the spread of median time to event from center to center is demonstrated by the density function of the median time to event over the centers (Fig. 1). It can easily be shown that this density function fTM (t) is given by fTM (t) =
log(2) h0 HR
1=
1 (1=)
1+1= 1 log(2) : exp − t th0 HR
For instance, for = 0:1 and HR = 1, 90% of the centers would have a median time to event between 2 and 5.6 years for h0 = 0:22 and between 6.4 and 17.6 years for h0 = 0:07. Finally, for the treatment e0ect, we use a hazard ratio of 1 and 1.3, as a value of 1.3 is often taken to be the expected e0ect of the standard as compared to the new treatment in cancer clinical trials. For each parameter setting, 6500 data sets are generated and for each of them the heterogeneity parameter from the frailty model with the treatment e0ect as covariate is estimated by the relevant technique and the P-value of the log likelihood ratio test for H0 : = 0 is assessed as well based on a chi-square distribution with one degree of freedom (Andersen et al., 1993). We have chosen 6500 iterations in order to have a probability of 90% that the simulated mean does not di0er more than 0.001 from the
L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603 – 620
613
value that would be obtained if the number of simulations would go to in1nity. The simulations are quite computer intensive and we believe that this precision is adequate for our purposes. From this list of 6500 values of , the 5%, 50% and 95% quantiles and the mean are determined to describe the spread and the bias of the estimate around its true value. Additionally, the percentage of data sets with a signi1cant likelihood ratio test at the 5% signi1cance level is determined. For = 0 this percentage corresponds to the size of the test, whereas for ¿ 0 this corresponds to the power of the test at a nominal signi1cance level of 5%. Given a particular parameter setting (G; n; h0 ; ; HR), the observations of a data set are generated in the following way. First, G random center e0ects w1 ; : : : ; wG are generated from the appropriate frailty density. The time to event outcome for each patient, etij , is randomly generated from an exponential distribution (rexp function in S-Plus) with parameter hij for the jth patient from center i given by hij = h0 exp()xij + wi ): A patient for which the time to event is longer than the time at risk is censored with time to censoring equal to time at risk so that tij = min(rtij ; etij ) and ij = I (rtij ¿ etij ) the censoring indicator. 5.2. Results of simulations We only simulated data from the gamma frailty density (2) and 1tted the corresponding frailty model. All techniques described in the previous section lead to virtually the same results for the parameter settings studied here, which follows as well from theoretical considerations: Therneau and Grambsch (2000, p. 254) have shown that the EM algorithm leads to the same results as the penalized likelihood approach, and in Remark 4, we show that the same results are obtained for the simultaneous approach. Therefore, the simulations are based on the penalized likelihood approach as this is available as a standard S-Plus function (see Remark 3). As an extension to this, the part likelihood ratio test is based on ‘obs (·; ·). 5.2.1. Bias In all cases studied for = 0:1 and 0:2, the estimate of is biased downward as the mean of the heterogeneity parameter from the simulated data sets is smaller than the true parameter value. For a 1xed number of centers, either 15 or 30, there is little e0ect on bias when increasing the patient numbers (Table 1). Increasing the number of centers from 15 to 30 reduces the bias, especially when = 0:2 and less so for = 0:1. Part of this downward bias can be explained by the fact that heterogeneity is estimated in a model with treatment e0ect included as a 1xed e0ect and no corrections are made for this e0ect similar to the REML estimation used in the normal frailty density. Additional simulations (data not shown) seem to con1rm this, but it needs to be studied in more detail.
614
L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603 – 620
5.2.2. 90% interquantile range The 90% interquantile range decreases substantially with increasing number of patients per center and with increasing number of centers whatever be the parameter setting. A pronounced e0ect is observed when increasing the number of patients per center up to 40. When increasing the number of patients per center further, the gain in terms of a smaller interquantile range is only marginal in most cases (Table 1). The interquantile range shrinks considerably when increasing the number of centers from 15 to 30. For 1xed total sample size of 600 patients but distributed either over 15 centers or over 30 centers, the results depend on the event rate and the true heterogeneity parameter . With a low event rate and = 0:1, it is preferable to have only 15 centers with 40 patients each, whereas with a high event rate, a trial with 30 centers and only 20 patients per center leads to a smaller interquantile range. For = 0:2 it is better for both event rates to have 30 centers and only 20 patients per center (Table 1). The hazard ratio has little e0ect on the 90 % interquantile range. A higher event rate, however, reduces the interquantile range substantially, especially for = 0:1 and for the smaller trials with fewer patients (Table 1). 5.2.3. Size and power In all cases, the size of the likelihood ratio test stays well below 5% and comes somewhat closer to 5% with larger trials (Table 1). With respect to power based on the likelihood ratio test, it is obvious that a trial with 15 centers and only 20 patients per center is inadequate when the event rate is low, or when = 0:1 with a high event rate as in these cases the power stays well below 0.8. When increasing the sample size to 40 patients per center, the trial has also a power below 0.8 when = 0:1 and h0 = 0:07 regardless the value of HR; further increasing the number of patients to 60 with = 0:1, h0 = 0:07 and HR = 1 will only result in a power of 0.74. The power stays also well below 0.8 with 20 patients per center and 30 centers if the event rate is low and = 0:1. In all other cases the power is above 0.8 (Table 1). The power can also be derived based on the simulated results alone by determining the percentage of data sets for which the estimate of the heterogeneity parameter is above the 95% quantile for = 0. For instance, 42.5% of the data sets with 15 centers and 20 patients per center generated with parameters h0 = 0:07, = 0:1 and HR = 1:3 have an estimated heterogeneity parameter above 0.086, the 95% quantile for the same setting but with = 0. The simulated power, however, is in most instances much higher than the likelihood ratio power obtained from an asymptotic 12 (1) distribution. Thus, the likelihood ratio test is often too conservative. This is due to the fact that the value for in the null hypothesis is at the boundary of the parameter space, a fact well known in mixed models (Stram and Lee, 1994; Self and Liang, 1987). Therefore, the approximation of the log likelihood ratio statistic by a chi-square distribution with one degree of freedom is inadequate (see Fig. 2). Rather a mixture of a chi-square distribution with one and zero degrees of freedom should be used. This will result in a decrease of the critical value of the log likelihood ratio statistic.
L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603 – 620
615
Table 1 Simulation results for HR = 1 and 1.3 and event rate = 0:07 and 0.22
Quant.
20=15
40=15
60=15
20=30
40=30
60=30
HR = 1; event rate = 0:07 0 5% 50% 95% Mean Size 0.1 5% 50% 95% Mean PowerLR PowerSim 0.2 5% 50% 95% Mean PowerLR PowerSim
0.000 0.000 0.098 0.018 0.009 0.000 0.066 0.265 0.088 0.197 0.371 0.000 0.163 0.425 0.181 0.502 0.708
0.000 0.000 0.047 0.009 0.013 0.000 0.078 0.208 0.088 0.527 0.692 0.040 0.167 0.368 0.180 0.870 0.938
0.000 0.000 0.032 0.006 0.017 0.012 0.082 0.191 0.089 0.740 0.866 0.058 0.176 0.356 0.187 0.949 0.985
0.000 0.000 0.068 0.014 0.017 0.000 0.085 0.217 0.094 0.424 0.607 0.047 0.178 0.357 0.187 0.800 0.911
0.000 0.000 0.031 0.006 0.019 0.025 0.089 0.177 0.094 0.836 0.929 0.084 0.181 0.317 0.189 0.988 0.999
0.000 0.000 0.021 0.004 0.019 0.035 0.092 0.165 0.095 0.963 0.985 0.096 0.185 0.312 0.192 1.000 1.000
HR = 1; event rate = 0:22 0 5% 50% 95% Mean Size 0.1 5% 50% 95% Mean PowerLR PowerSim 0.2 5% 50% 95% Mean PowerLR PowerSim
0.000 0.000 0.044 0.008 0.012 0.000 0.081 0.210 0.090 0.568 0.743 0.045 0.169 0.353 0.180 0.889 0.953
0.000 0.000 0.021 0.004 0.015 0.020 0.084 0.182 0.090 0.890 0.947 0.066 0.175 0.337 0.185 0.996 0.998
0.000 0.000 0.015 0.003 0.015 0.030 0.086 0.174 0.092 0.967 0.987 0.074 0.175 0.326 0.185 0.999 0.999
0.000 0.000 0.030 0.006 0.014 0.028 0.089 0.171 0.094 0.867 0.942 0.088 0.185 0.318 0.192 0.995 0.998
0.000 0.000 0.015 0.003 0.019 0.043 0.092 0.163 0.096 0.998 1.000 0.105 0.186 0.293 0.192 1.000 1.000
0.000 0.000 0.010 0.002 0.019 0.048 0.094 0.153 0.096 1.000 1.000 0.108 0.188 0.290 0.193 1.000 1.000
HR = 1:3; event rate = 0:07 0 5% 0.000 50% 0.000 95% 0.086 Mean 0.016 Size 0.012 0.1 5% 0.000 50% 0.070 95% 0.245 Mean 0.088 PowerLR 0.254 PowerSim 0.425
0.000 0.000 0.042 0.008 0.016 0.000 0.078 0.200 0.087 0.568 0.751
0.000 0.000 0.027 0.005 0.017 0.015 0.082 0.188 0.090 0.802 0.904
0.000 0.000 0.061 0.012 0.014 0.000 0.086 0.207 0.093 0.478 0.654
0.000 0.000 0.030 0.006 0.022 0.027 0.089 0.175 0.093 0.881 0.941
0.000 0.000 0.020 0.004 0.021 0.038 0.091 0.162 0.095 0.983 0.991
616
L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603 – 620
Table 1 (Continued) Quant.
20=15
40=15
60=15
20=30
40=30
60=30
0.2
5% 50% 95% Mean PowerLR PowerSim
0.000 0.156 0.424 0.179 0.556 0.737
0.041 0.169 0.360 0.181 0.870 0.945
0.056 0.172 0.344 0.183 0.969 0.986
0.054 0.180 0.352 0.188 0.879 0.937
0.087 0.186 0.317 0.192 0.995 0.999
0.098 0.186 0.303 0.191 1.000 1.000
HR = 1:3; event rate = 0:22 0 5% 0.000 50% 0.000 95% 0.041 Mean 0.007 Size 0.011 0.1 5% 0.002 50% 0.080 95% 0.203 Mean 0.089 PowerLR 0.614 PowerSim 0.770 0.2 5% 0.048 50% 0.171 95% 0.361 Mean 0.184 PowerLR 0.899 PowerSim 0.959
0.000 0.000 0.020 0.004 0.013 0.023 0.084 0.180 0.091 0.893 0.961 0.072 0.176 0.333 0.186 0.992 0.998
0.000 0.000 0.014 0.003 0.014 0.030 0.086 0.171 0.092 0.975 0.990 0.075 0.176 0.333 0.186 0.998 1.000
0.000 0.000 0.029 0.006 0.018 0.029 0.090 0.176 0.095 0.892 0.953 0.089 0.186 0.317 0.192 0.995 0.998
0.000 0.000 0.015 0.003 0.022 0.045 0.093 0.157 0.095 0.999 0.999 0.107 0.188 0.298 0.193 1.000 1.000
0.000 0.000 0.009 0.002 0.022 0.050 0.093 0.151 0.096 1.000 1.000 0.110 0.190 0.295 0.194 1.000 1.000
0.8
1.0
0.4
0.6
G=30,n=60,h=0.22 G=15,n=20,h=0.07
0.2
χ21
0.0
χ20 0
1
2
3
4
5
Fig. 2. The estimated density function of the log likelihood ratio from the simulated data of the two di0erent settings (G = 30, n = 60, h0 = 0:22 and G = 15, n = 20, h0 = 0:07). The other two lines correspond to the density functions for 102 and 112 .
6. A case study: the peri-operative trial This research was undertaken because we wanted to study heterogeneity in an early breast cancer clinical trial comparing peri-operative chemotherapy with post-operative
L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603 – 620
617
chemotherapy (Clahsen et al., 1996). In the previous section, the e0ect of number of centers and number of patients per center was studied using an equal number of patients in each center. In practice, however, a multicenter trial typically contains a few large centers and many rather small centers. For example, the peri-operative trial consists of 15 centers, with the following number of patients per center sorted according to size: 6, 19, 25, 39, 48, 53, 54, 60, 78, 184, 185, 206, 311, 622, 902. In this section, we therefore use the actual distribution of the patients over the centers observed in the peri-operative trial. The heterogeneity parameter , the treatment e0ect HR and the baseline hazard h0 were estimated to be 0.092, 1.16 and 0.07. Furthermore, the accrual time and follow-up time equal 5 and 3 years (these values are used in the simulation). Assuming that = 0, we simulated again 6500 data sets and studied the spread of the estimates of . The 95% quantile is given by 0.007. Therefore, the observed value for = 0:092 contradicts the hypothesis that there is no heterogeneity between the centers. The power based on the likelihood ratio test, resp., the simulated power, equals 0.94, resp., 0.98 for = 0:1 and 1.00, resp., 1.00 for = 0:2. Therefore, this trial is adequate for treatment outcome research and for further investigating sources of heterogeneity.
7. Conclusions It is important, before embarking on studying the heterogeneity between centers in multicenter trials and drawing clinical conclusions, to assure that the multicenter trial is, in terms of number of centers and number of patients per center, suQciently large so that the estimated heterogeneity parameter is actually describing true heterogeneity between centers and not just random variability. The type of simulations described here can be used to that end. It is found that the number of centers and the event rate play a crucial role in determining whether a multicenter trial is adequately sized. On the other hand, once a certain number of patients per center has been obtained, adding more patients to the center does not substantially increase the power of the trial anymore. For the cases simulated here, the value is close to 40 patients per center. For each particular situation, a simulation of the type described in Section 5 can be done to determine whether the estimate of the heterogeneity parameter has a probability lower than 5% to come from a multicenter trial with = 0. Furthermore, the interquantile range and the bias can be studied for di0erent values of and an investigator can take then a decision to use this multicenter trial for treatment outcome research or not. Note that we take constant hazard rates in our simulations. This is an acceptable assumption for the example of Section 6 where an exponential hazard rate is not rejected when compared to a Weibull hazard rate (based on a likelihood ratio test). This assumption might not hold in other situations. Furthermore, only the gamma frailty density has been used. It might be of interest to consider other situations and to use frailties di0erent from the gamma density. Although the 1ndings
618
L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603 – 620
from the simulations are necessarily limited to the parameter settings studied, the conclusions are highly relevant for breast cancer clinical trials. For other tumor types or diseases di0erent parameter settings might be relevant, which can lead to di0erent design requirements.
Acknowledgements Paul Janssen received support by grant BIL00=28 (International Scienti1c and Technological Cooperation) from the Flemmish community. Rosemary Nguti received partial support from the Belgian Technological Cooperation (BTC).
Appendix A Flow Chart 1. The EM algorithm approach. ˆ are the estimates for the regression coeQcients in the classical Cox regression model (without frailties) and Hˆ = (Hˆ 1 ; : : : ; Hˆ G ) where Hˆ i is obtained from (4) with (0) and E0 (exp(wi )) = 1. Ek (w) = (Ek (w1 ); : : : ; Ek (wG )), etc. See Section 3 for further details.
L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603 – 620
619
Flow Chart 2. The penalized likelihood approach. For details, see Section 4.
Flow Chart 3. The simultaneous likelihood maximization approach. For details, see Remark 4.
References Andersen, P.K., Borgan, O., Gill, R.D., Keiding, N., 1993. Statistical Models Based on Counting Processes. Springer, New York, p. 663. Clahsen, P.C., van de Velde, C.J., Julien, J.P., Floiras, J.L., Delozier, T., Mignolet, F.Y., Sahmoud, T.M., 1996. Improved local control and disease-free survival after preoperative chemotherapy for early-stage
620
L. Duchateau et al. / Computational Statistics & Data Analysis 40 (2002) 603 – 620
breast cancer. A European organization for research and treatment of cancer breast cancer cooperative group study. J. Clin. Oncol. 14, 745–753. Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum likelihood estimation from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. B 39, 1–38. Harding, M.J., Paul, J., Gilles, C.R., Kaye, S.B., 1993. Management of malignant teratoma: does referral to a specialist unit matter? Lancet 341, 999–1002. Hillner, B.E., Smith, T.J., Desch, C.E., 2000. Hospital and physician volume or specialization and outcomes in cancer treatment: importance in quality of cancer care J. Clin. Oncol. 18, 2327–2340. Klein, J.P., 1992. Semiparametric estimation of random e0ects using the Cox model based on the EM algorithm. Biometrics 48, 795–806. Klein, J.P., Moeschberger, M.L., 1997. Survival Analysis. Techniques for Censored and Truncated Data. Springer, New York. McGilchrist, C.A., 1993. REML estimation for survival models with frailty. Biometrics 49, 221–225. Murphy, S.A., 1994. Consistency in a proportional hazard model incorporating a random e0ect. Ann. Statist. 22, 712–731. Murphy, S.A., 1995. Asymptotic theory for the frailty model. Ann. Statist. 23, 182–198. Nielsen, G.G., Gill, R.D., Andersen, P.K., Sorensen, T.I.A., 1992. A counting process approach to maximum likelihood estimation in frailty models. Scand. J. Statist. 19, 25–43. Parner, E., 1998. Asymptotic theory for the correlated gamma-frailty model. Ann. Statist. 26, 183–214. Self, S.G., Liang, K.-L., 1987. Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. J. Amer. Statist. Assoc. 82, 605–610. Stram, D.O., Lee, J.W., 1994. Variance components testing in longitudinal mixed e0ects models. Biometrics 50, 1171–1177. Therneau, T.M., Grambsch, P.M., 2000. Modeling Survival Data. Extending the Cox Model. Springer, New York. Yamaguchi, T., Ohashi, Y., 1999. Investigating centre e0ects in a multi-centre clinical trial of super1cial bladder cancer. Statist. Med. 18, 1961–1971.