The bivariate continual reassessment method

The bivariate continual reassessment method

Controlled Clinical Trials 23 (2002) 240–256 The bivariate continual reassessment method: extending the CRM to phase I trials of two competing outcom...

148KB Sizes 30 Downloads 91 Views

Controlled Clinical Trials 23 (2002) 240–256

The bivariate continual reassessment method: extending the CRM to phase I trials of two competing outcomes Thomas M. Braun, Ph.D.* Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA Manuscript received June 21, 2001; manuscript accepted November 28, 2001

Abstract Traditional phase I studies find the therapeutic dose of an investigational drug based solely on toxicity and without regard to the drug’s efficacy. We propose extending the continual reassessment method (CRM) to a bivariate trial design in which the maximum tolerated dose is based jointly on both toxicity and disease progression. We call our study design bCRM, for bivariate CRM, which we apply to a study of bone marrow patients seeking to find the optimal time after bone marrow transplant at which to taper immune suppression and eventually begin donor leukocyte infusions. We demonstrate through simulation that bCRM has excellent operating characteristics and compare the performance of bCRM in a bone marrow transplantation study to designs proposed by previous investigators. We also attempt to provide some direction to future investigators with regard to plausible models and distributions to use with a bivariate study design. © 2002 Elsevier Science Inc. All rights reserved. Keywords: Bayesian inference; Correlated binary data; Clinical trial; Bone marrow transplant

Introduction Motivation This work is motivated by current research in allogeneic stem cell transplantation for leukemia patients. Standard therapy uses an allogeneic bone marrow or peripheral blood stem cell transplant (BMT), which is preceded by a high-intensity myeloablative regimen (chemotherapy and/or radiation) and is followed with immunosuppressive (IS) agents designed to prevent acute graft-versus-host disease (aGVHD). However, this strategy is not suitable for older pa-

* Corresponding author: Thomas M. Braun, Department of Biostatistics, University of Michigan, C-345 Biostatistics Core, 1500 E. Medical Center Drive, Ann Arbor, MI 48108. Tel.: 1-734-647-3271; fax: 1-734-7636236. E-mail address: [email protected] 0197-2456/02/$—see front matter © 2002 Elsevier Science Inc. All rights reserved. PII: S0 1 9 7 - 2 4 5 6 ( 0 1 ) 0 0 2 0 5 -7

T.M. Braun/Controlled Clinical Trials 23 (2002) 240–256

241

tients and those with advanced leukemia due to increased rates of regimen-related toxicity, aGVHD, and relapse. Thus, alternative methods are needed for this subgroup of patients. Researchers are seeking to find lower-intensity regimens that rely less upon chemotherapy and radiation to kill the leukemia and rely more upon the graft-versus-leukemia (GVL) effect of the transplanted donor T cells. Therefore, high-risk patients are given a lower-intensity conditioning regimen before BMT, followed with standard immunosuppressive methods. Then, due to the increased risk of relapse, patients are slowly tapered off their immunosuppressive therapy, at which point donor leukocyte infusions (DLI) are given to promote a GVL effect. However, the optimal times at which to begin the IS taper and DLI are relatively unknown. If the IS taper begins too soon, aGVHD will occur. Conversely, if the DLI are begun too late, disease progression is likely. Therefore, given a set of possible regimens specifying the times of IS taper and DLI administration, we would like to find the regimen that is optimal, in the sense that aGVHD and disease progression rates are both kept near desired thresholds. Single outcome study designs The continual reassessment method, or CRM, was proposed to address limitations in historically simplistic phase I study designs like the 33 method [1,2]. Essentially, a one-parameter monotonic regression function is used to model the relationship of dose and probability of toxicity. Assuming each subject’s outcome to be a Bernoulli random variable, the logarithm of such a probability model gives us a likelihood for the regression parameter , to which we assign a noninformative prior distribution. A cohort of subjects enters the study at a specified dose, and based upon the cohort’s experience and the prior placed upon , the CRM computes a posterior distribution for . The posterior mean of  is then used to estimate the probability of toxicity seen at each dose level. Another cohort is then entered into the trial at the dose whose estimated probability of toxicity is closest to and less than the desired level. When the prespecified stopping criteria have been met, the maximum tolerated dose (MTD) is selected as that dose with a toxicity rate closest to and less than the optimal rate, based upon the data from all the cohorts. Numerous other single outcome study designs have been proposed as alternatives to the CRM [3–5]. Composite outcome study designs The CRM algorithm and its counterparts determine the MTD solely on toxicity and ignore disease response altogether. One solution to this problem was proposed by Thall and Russell, in which toxicity was combined with a measure of response [6]. Like the authors of a later publication, we refer to this method as TR [7]. In one specific example, T is defined to be a binary indicator of toxicity (0no toxicity; 1toxicity) and G0, 1, or 2 as an indicator of no, moderate, or severe graft-versus-host disease (GVHD), respectively. A final outcome variable Y takes one of three values as follows:  0 if G = 0 and T = 0  Y =  1 if G = 1 and T = 0   2 if G = 2 or T = 1 Thus, toxicity and response have been combined into a single trinomial variable Y. TR then uses a cumulative odds regression model for the relationship of dose and Y, and Baye-

242

T.M. Braun/Controlled Clinical Trials 23 (2002) 240–256

sian methods identical to those of the CRM are used to estimate pi(X)Prob(Yi|X), i0,1,2, for any dose level X in order to find the MTD. Addressing both toxicity and response simultaneously was also suggested by Gooley et al., who via simulation compared three candidate study designs, each based upon a series of rules for dose escalation or deescalation [8]. Their findings demonstrated that two dose-response curves add complexity to any dose selection algorithm. Operating characteristics found through simulation can help investigators determine which design will best meet their goals before the study is actually implemented. Additional recent contributions to bivariate study designs can be found in Legedza [9]. Although the method of Gooley et al. can locate an optimal dose, the method is not directly generalizable to other situations. Furthermore, there is no estimation of a parameter summarizing a response curve for a continuum of doses. In contrast, TR does estimate a dose-response curve. However, TR combines toxicity and response into one variable, making it impossible to estimate the effect of dose upon each outcome separately. It seems intuitive that if both toxicity and response measurements exist for each subject, investigators would like to ascertain the separate effect of dose on toxicity and response. Furthermore, knowing how dose elicits a clinical response will assist in the design of future studies. Therefore, we suggest a model in which, for a given dose, the probability of toxicity is parameterized separately from the probability of response. The two outcomes of toxicity and response are then combined into a joint likelihood model that contains an additional parameter for the within-subject association of both outcomes. The Bayesian methods used in the CRM and TR can then be used in this setting to estimate how dose relates to both toxicity and response. We call this approach bCRM, for bivariate CRM, which extends the traditional CRM design for Phase I trials to define the MTD as a function of two outcomes. As applied to our study of IS taper and DLI, we need to specify: (1) a regression model for the effect of dose on the probabilities of aGVHD and progression, (2) a joint distribution of two binary outcomes, namely the presence of aGVHD and progression of disease and (3) a prior distribution for the regression parameters of interest. In the next section, we describe the regression model and distributions. We will then describe how bCRM is implemented, and the section following contains simulation results covering the operational characteristics of the design, as well as a comparison to designs examined by Gooley et al. in a study of T cell doses in bone marrow transplantation. Summarizing comments are in the final section.

Model and distributional assumptions Regression model for dose effect Our phase I study attempts to discern which of J doses is the MTD, defined as that dose in which rates of toxicity and progression are closest to their targeted values. In our setting, each “dose’’ is actually a specified timing of IS taper and DLI. For each dose j, there is a probability of toxicity p1j and a corresponding probability of disease progression p2j. The doses are sorted in increasing order of probability of toxicity, which implies they are also sorted by decreasing order of probability of progression.

T.M. Braun/Controlled Clinical Trials 23 (2002) 240–256

243

For each subject i, we measure two outcomes  0; no toxicity Yi =   1; toxicity

 0; no progression Zi =   1; progression

Conditional on the value Xxj, where xj denotes the numeric value of dose j, the respective probabilities of toxicity and progression are associated to each dose by the equations: p 1 j = h 1 ( x j , β 1 ) = Prob ( toxicity seen at dose j ) p 2 j = h 2 ( x j , β 2 ) = Prob ( progression seen at dose j ) where the functions h1(.) and h2(.) are any two monotonic functions such that we have a low probability of toxicity and a high probability of disease progression on the lowest dose. A variety of forms of h1(.) and h2(.) are available. Some investigators may prefer the assumption that dose effects are additive on a logarithmic scale. This suggests h1(xj, 1)xj1 and h2(xj, 2)(1xj)2. Other investigators may wish to assume additivity on the log-odds scale, i.e., p1 j  - = – 3 + β1 x j (1) log  -------------- 1 – p 1 j p2 j  - = 3 – β2 x j log  -------------- 1 – p 2 j

(2)

Beyond their biological plausibility, the functions (1) and (2) have the added statistical property of belonging to a group of functions known as canonical links. A third possibility is h1(xj, 1){(1tanhxj)/2}1 and h2(xj, 2){(1tanhxj)/2}2 originally suggested by O’Quigley, Pepe, and Fisher [1]. Also, if biologically plausible, there is no restriction that h1(.) and h2(.) be of the same form. Nonetheless, beyond personal preference, no clear overall recommendation exists regarding the appropriate forms of h1(.) and h2(.). In our current research, we use functions (1) and (2). Assuming a continuous function for the dose-response curves requires that each dose xj has a numeric value. Although in some settings each dose will have an a priori numeric value, the dose values need to be rescaled in order to promote a reasonable fit to the functions. Selected primarily for its convenience, the method used to assign each dose’s value is as follows. At the start of the trial, the investigator will supply values p1j0 and p2j0, for p1j and p2j respectively, for each dose xj. Adopting the convention of O’Quigley and Chevret, we choose 10 1 as a starting value for 1 [10]. Based upon the values of p1j0 and 10 1, we can use Eq. (1) to solve for each xj, giving us numeric values ˆx 1j . Similarly, we can use a starting value 02 1 for 2 and use p2j0 with Eq. (2) to compute numeric values ˆx 2j . The value we assign to each dose j is 0.5( ˆx 1j  ˆx 2j ). One may be concerned that the numeric values assigned to each dose are arbitrary and that different starting values of the regression parameters would influence which value is assigned to each dose. However, the goal of this study is to estimate the probabilities of toxicity and disease progression, whose estimates will be unaffected as long as linearity of dose and the log-odds of both toxicity and response is a reasonable assumption. In support, we quote O’Quigley, Pepe, and Fisher, who stated, “in many situations, what matters are the toxic probabilities and not so much the model generating the probabilities” [1].

244

T.M. Braun/Controlled Clinical Trials 23 (2002) 240–256

Bivariate distribution for toxicity and progression Once the functional relationships of dose to the probabilities of toxicity and progression have been specified, we need to select a bivariate distribution for toxicity and progression. Such a distribution should not only have desirable statistical properties, but also be biologically plausible. One important factor in the selection of an appropriate bivariate distribution is the direction and magnitude of the association of the two outcomes under investigation. For example, in our setting, toxicity and progression are likely to have a negative association because the presence of aGVHD may indicate a GVL effect that will reduce the probability of progression. However, the magnitude of the association is generally unknown. In other settings, the two outcomes under investigation may have a small positive association, or the direction and magnitude of the association may be completely unknown. One commonly adopted model for correlated binary data is a beta-binomial model, which states that the probabilities p1j and p2j of the two binary outcomes (i.e., toxicity and disease progression) vary from subject to subject, most likely due to unmeasured subject characteristics [11]. The between-subject variability of p1j and p2j is modeled with a beta distribution, thereby creating a within-subject association of the two outcomes. However, a beta-binomial model is limited in our setting for two reasons. First, the association in a beta-binomial model must be nonnegative. Second, for a given dose, both outcomes are assumed to have equal probabilities, making a beta-binomial model inappropriate with outcomes that occur with different probabilities. Instead of assuming a beta-binomial model, we could instead extend the regression functions h1(.) and h2(.) and insert an additional subject-specific parameter i for subject i. These subject-specific parameters are assumed to be random with a known distribution G(), whose properties will define the within-subject association of the two outcomes. For example, our functions (1) and (2) could include subject-specific intercepts, i.e., p 1ij  - = – αi + β1 x j log  --------------- 1 – p 1ij p 2ij  - = αi – β2 x j log  --------------- 1 – p 2ij However, there are limitations to this model as well. First, the association of the two binary outcomes is a function of their corresponding probabilities, which vary from dose to dose. Thus, the association within a subject will vary depending upon which dose a subject receives. We would instead prefer a model that assumes the association of toxicity and disease progression is constant across all doses. Second, except for specific forms of G(), the value of the association will not have a closed form expression, making interpretation difficult. Third, the interpretations of 1 and 2 are conditional upon the value of the random intercept i. Prentice contains an excellent discussion of the pros and cons of various correlated binary regression models [12]. As an alternative to the models just discussed, we assume that conditional upon dose x, each pair (Yi, Zi) has the bivariate distribution y (1 – y)

f ( y,z x ) = k ( p 1 , p 2 ,ψ ) p 1 q 1

z

(1 – z)

p2 q2

yz

ψ (1 – ψ)

( 1 – yz )

;

y,z { 0,1 },

0<ψ<1

T.M. Braun/Controlled Clinical Trials 23 (2002) 240–256

245

where p1probability of toxicity on dose x, p2probability of progression on dose x, qi1pi, i i{1,2},  denotes the association between Y and Z, and k(.) is a normalizing constant. The parameter , which is constant across all dose levels, has a simple interpretation: /(1) is the odds ratio between Y and Z. Therefore, Y and Z are independent if 1/2, while  1/2 indicates positive association and  1/2 indicates negative association. Furthermore, the range of  is independent of p1 and p2, unlike some models discussed by Prentice in which the association is restricted by p1 and p2 [12]. As a side note, the conditional density of Y given Z is Bernoulli with parameter z

(1 – z)

p1 ψ ( 1 – ψ ) -. p y z = -----------------------------------------------------------------------------------z (1 – z) p1 ψ ( 1 – ψ ) + ( 1 – p1 ) ( 1 – ψ ) A similar statement can be made for the conditional distribution of Z given Y. See Arnold and Strauss for more details [13]. This property of the conditional distributions further promotes the biological plausibility of our model. In both subjects with toxicity and subjects without toxicity, progression has a Bernoulli distribution, although the probability of progression differs in the two groups of subjects. Likewise, in both subjects with progression and subjects without progression, toxicity has a Bernoulli distribution, although the probability of toxicity differs in the two groups of subjects. Prior distribution for regression and association parameters Once the joint distribution for both outcomes has been identified, the final decision concerns which prior distribution should be assumed for the vector of parameters in the joint distribution. Based upon the models we have selected, (1, 2, ). As with the choice of regression formulas h1(.) and h2(.), the selection of a prior distribution is guided mostly by preference and plausibility. The prior distribution should be sufficiently defined to offer direction to dose assignment of the first few subjects, but also be sufficiently vague so that the prior distribution’s influence on dose assignment decreases as more and more subjects are enrolled. In the final section, we discuss a maximum-likelihood alternative to assigning a prior distribution. We assume a noninformative prior for . Specifically, π ( θ ) = 6ψ ( 1 – ψ ) exp { – ( β 1 + β 2 ) } ,

β1 > 0 , β2 > 0 , 0 < ψ < 1

which places an exponential distribution with a mean of 1 on each regression parameter and a beta distribution with mean 1/2 on the association parameter. Note that 1, 2 and  are assumed to be marginally independent. Other joint priors for are certainly plausible, although the prior chosen should reflect the uncertainty the investigator has in the true dose effect on both toxicity and progression and the association between toxicity and progression. After observing the results for a cohort of n subjects, there exists a vector of toxicity outcomes Yn(Y1, Y2, ... , Yn) and a corresponding vector of progression outcomes Zn(Z1, Z2, ... , Zn) that create a likelihood for , which we denote as L( ; Yn, Zn). The likelihood L( ; Yn, Zn) and prior distribution ( ) are used to compute the posterior mean of via standard Bayesian methods. We then plug the posterior mean of into the functions h1(.) and h2(.) in order to n n create updated estimates of p 1 j and p 2 j for the probabilities of toxicity and progression, respectively.

246

T.M. Braun/Controlled Clinical Trials 23 (2002) 240–256

We estimate the posterior mean of via methods proposed by Tierney and Kadane, which essentially use a Laplace approximation to the integrals, which define the posterior mean of [14]. Alternatively, we could use standard Bayesian Monte Carlo methods to sample from the posterior distribution of and compute the posterior mean of . We use the integral approximation method because we found in simulations that the integral approximation method estimated the posterior means quite accurately, even in small samples, and was faster than Monte Carlo sampling. Implementation of bCRM Defining the MTD A first cohort of nc subjects enters the study on the dose the investigator thinks is most likely to be the MTD. Usually this is [J/2], where J is the number of doses and [.] is the nearest larger integer function. Once toxicity and progression outcomes are collected on all c n n subjects, the posterior mean of is computed, which is used to compute p 1 j and p 2 j . For these updated probabilities, we compute the Euclidean distance 2 n

dj =

∑ ( p kj – p k* ) n

2

(3)

k=1

for each dose j, in which p1* and p2* are the desired rates of toxicity and progression, respectively. The dose corresponding to the smallest value of djn is selected, and a new cohort of c subjects enters on that dose. Once those c subjects are evaluated, the experience of all n2c n n subjects is used to update the values of p 1 j and p 2 j . This procedure is repeated until K cohorts (nKc subjects) have been enrolled. The MTD is the dose determined by bCRM to minimize the metric (3) after all subjects in the trial are evaluated. Selecting an appropriate metric Note that there are alternate metrics to the Euclidean metric (3); the appropriate metric will vary mostly upon the goals of the study. First, we could generalize metric (3) to place more emphasis on toxicity or progression if an investigator deems one of the outcomes should have more influence on the definition of the MTD. We would then use the modified metric 2 n

dj =

∑ w k ( p kj – p *k ) n

2

(4)

k=1

where 0 w1,w2 1 and w1w21. Letting w10 or w20 would be equivalent to the univariate CRM design, whereas letting w1w20.5 would correspond to our current design. A simple example shows that serious thought should be made when selecting w1 and w2. Suppose we have three doses under consideration. For the three doses, the vectors of probabilities estimated for toxicity and disease progression are (0.10, 0.15, 0.22) and (0.50, 0.38, 0.20), respectively. The desired probabilities of toxicity and progression are 0.20 and 0.30, respectively. If we assume w1w20.5, then the second dose minimizes metric (4). Thus, the MTD is a dose whose probability of toxicity is below its threshold, but whose probability of progression is above its threshold. If instead w13/4, we find that the third dose mini-

T.M. Braun/Controlled Clinical Trials 23 (2002) 240–256

247

mizes metric (4), meaning the MTD is now a dose whose probability of toxicity is above its threshold and whose probability of progression is below its threshold. In fact, in this example, when w136/57, the second and third doses are equivalent candidates for the MTD. Any value of w1 36/57 will lead to defining the second dose as the MTD, while any value of w1 36/57 will lead to defining the third dose as the MTD. Therefore, we recommend using an unweighted metric if appropriate values for selecting w1 and w2 are difficult to identify. The Euclidean metric assumes a dose whose toxicity probability is u units above the threshold p1* is equivalent to a dose whose toxicity probability is u units below the threshold. We could instead reject any dose j whose estimated toxicity probability is above the n threshold. As a result, for any dose in which p 1 j p *1 , we would replace the value of n ( p 1 j  p *1 )2 in Eq. (3) with the value 1.0, the maximum distance between any two probabilities. Specifically, our metric would be 2

∑ ( p 1 j – p *1 )

n

n

dj =

2

2 n n n I [ p 1 j ≤ p *1 ] + ( 1 – I [ p 1 j ≤ p *1 ] ) ( p 2 j – p *2 )

(5)

k=1 n

n

where I[ p 1 j p *1 ] is an indicator function that p 1 j p *1 . A similar metric could be constructed to reject any dose whose estimated probability of progression is above the desired threshold of progression. One drawback of Eq. (5) as a metric is that a dose whose toxicity probability is barely above the threshold p1* will be completely invalidated as the MTD, even if that dose’s probability of disease progression is sufficiently below the threshold value p2*. In contrast, other studies may wish to consider a dose with a rate of toxicity marginally above the threshold p1* if that dose has a correspondingly low rate of disease progression. To address that concern, n we modify the metric by multiplying the indicator function I [ p 1 j p *1 ] by a weight 0 w 1. However, as with the weights w1 and w2 used in Eq. (4), careful consideration should be made regarding the appropriate value of w. Overriding the metric It is obvious that an endless family of metrics can be used with the bCRM algorithm and further research is needed in this area. However, regardless of which metric is used, there are situations in which no dose should be selected and the metric alone cannot be used to define n the MTD. For example, imagine that all J doses are too toxic, i.e., p 1 j  p *1 for all j. The value of j that minimizes Eq. (3) corresponds to the dose whose estimated probability of disease progression is closest to the target level. However, since all doses are too toxic, no dose should be selected as the MTD. The same situation occurs if the roles of toxicity and disease progression are reversed. Therefore, we also monitor the observed rates of toxicity and disease progression and adjust our algorithm as follows. After n subjects have been enrolled, we will have observed e1 toxic events. Based upon the values of n and e1, we can compute a one-sided 95% confidence interval for the overall rate of toxicity, assuming a binomial distribution for e1. If the lower bound of that confidence interval is greater than the target rate of toxicity, we will enroll the next cohort at the dose that minimized metric (3) only if that dose is lower than the current dose. Otherwise, we will only decrease the dose to the next lowest dose.

248

T.M. Braun/Controlled Clinical Trials 23 (2002) 240–256

Likewise, we can compute a 95% one-sided confidence interval for the overall rate of disease progression, based on observing e2 patients with disease progression. If the confidence interval’s lower bound is above the target rate of disease progression, we will enroll the next cohort at the dose that minimized metric (3) only if it is higher than the current dose. Otherwise, we will escalate to the next highest dose. If both confidence intervals lie above the targeted rates, the study will terminate. Furthermore, if bCRM recommends increasing above dose J or to decreasing below dose 1, the study will also terminate. This study design is flexible enough to include other stopping rules if necessary. For example, if the observed rate of disease progression after two cohorts is above a certain threshold, the investigator may wish to terminate the trial. We could also add the practical constraint that dose escalation or deescalation be restricted to one level above or below the current dose. Such a procedure may be more acceptable to investigators who wish to avoid exposing an early cohort to a dose that later is determined to be overly toxic or ineffective. Operating characteristics General simulations We examine the operating characteristics of bCRM in a hypothetical study seeking to determine which of six possible doses is the MTD. The targeted rates of toxicity (aGVHD) and disease progression are 0.25 and 0.30, respectively. The investigator’s prior estimates of toxicity and progression probabilities for the six doses are displayed in Table 1. As seen from the values displayed, the investigator believes dose 4 will be optimal. The study will begin by enrolling three patients on dose 3 and will continue to enroll additional cohorts of three subjects on the dose selected by bCRM until 30 total subjects have been enrolled. We choose a sample size of 30 patients as one that is reasonable both in terms of accuracy and feasibility. We reserve further discussion of sample size to the final section. The performance of bCRM is dependent upon the true underlying probabilities of toxicity and disease progression. Thus, we examine the performance of bCRM under seven possible scenarios. In scenario 1, the MTD is dose 4, and the probabilities of toxicity and progression are correctly specified by the investigator. In scenario 2, the MTD is once again dose 4. However, the probabilities of toxicity and progression at the other doses fall steeply away from the target levels. Thus, in scenario 1, finding the MTD is more difficult than in scenario 2. In scenario 3, the investigator’s prior expectations are incorrect, and the MTD is actually dose 3. In scenario 4, all doses are too toxic, and in scenario 5, all doses have excessive disease progression. Thus, in scenarios 4 and 5, bCRM should fail to find an MTD. In scenario

Table 1. Prior probability estimates of toxicity and disease progression supplied by investigator Dose Event

1

2

3

4

5

6

Toxicity Disease progression

0.02 0.70

0.05 0.60

0.15 0.45

0.25 0.30

0.40 0.20

0.60 0.10

T.M. Braun/Controlled Clinical Trials 23 (2002) 240–256

249

6, all but the lowest dose meet the MTD definition, while in scenario 7, no strict MTD exists, although the three lowest doses have acceptable rates of toxicity and the three highest doses have acceptable rates of disease progression. For all seven scenarios, the true probabilities of toxicity and progression for each dose are displayed graphically in Fig. 1. The outcomes of each subject were simulated to be negatively associated with 1/3. We tried other values of  as well, but saw little change in our results. Results for the seven scenarios are shown in Table 2. In scenarios 1 and 2, the MTD is dose 4, which bCRM correctly identified in 46.1% of simulations in scenario 1 and 61.6% of simulations in scenario 2. The higher success rate in scenario 2 was expected because the slopes of the dose-response curves around dose 4 in scenario 2 were steeper than those in scenario 1. More interestingly, bCRM incorrectly identified an MTD in 39.2% of simulations in scenario 1, which decreased to 18.1% in scenario 2. As seen from Fig. 1, the high rate of misidentification with bCRM is due to the close proximity of doses 3 and 4 in relation to the other doses. In fact, of the 392 simulations leading to an incorrect MTD in scenario 1, 240 of those had dose 3 labeled as the MTD. This higher rate of misidentification also in-

Fig. 1. Seven scenarios with dose-response curves for probabilities of aGVHD and disease progression. Targeted rate of aGVHD is 25%. Targeted rate of progression is 30%.

250

T.M. Braun/Controlled Clinical Trials 23 (2002) 240–256

Table 2. Operating characteristics of bCRM under seven possible scenarios Scenario Parameter

1

2

3

4

5

6

7

Trials where correct MTD found (%) Trials where incorrect MTD found within one level of correct MTD (%) Trials where incorrect MTD found beyond one level of correct MTD (%) Trials where MTD not found (%) Average number of aGVHD cases/trial Average number of progressions/trial Average number of subjects given dose other than MTD/trial Average number subjects/trial

46.1

61.6

53.8

n/a

n/a

94.6

n/a

36.3

15.8

32.8

n/a

n/a

1.2

73.6

2.9 14.7 7.0 9.4

2.3 20.3 7.9 9.6

2.2 11.2 7.3 8.6

3.4 96.6 6.4 5.9

10.4 89.6 4.9 7.4

n/a 4.2 5.0 7.3

10.1 16.3 7.0 8.7

18.2 27.6

14.2 27.2

14.3 28.3

n/a 12.6

n/a 14.2

0.6 29.3

0.6 27.6

All scenarios start enrollment on dose 3. Each estimate is based upon 1000 simulations.

creased the average number of patients given a dose other than the MTD from 14.2 under scenario 2, to 18.2 under scenario 1. Although bCRM is more likely to find the correct MTD when the dose-response curves are steeper around the true MTD, bCRM is also more likely to terminate early due to excess toxicity and disease progression. From Table 2, we see that 14.7% of simulations led to no MTD in scenario 1, which increased to 20.3% in scenario 2. Since enrollment begins with dose 3, we expect more disease progression than is desired. If two of the first three subjects experienced progression, we would escalate to dose 6 under both scenarios 1 and 2. It is likely that the next three patients enrolled would all experience aGVHD, while none would have progressive disease. As a result, the next cohort would be enrolled on dose 3. If we observed two subjects with progression and one subject with aGVHD, the total number of subjects with disease progression and aGVHD would be five and four, respectively. Based upon those figures, both confidence intervals would be above the targeted rates and we would terminate the study. Since disease progression on dose 3 is more likely in scenario 2 than scenario 1, early termination is more likely under scenario 2. In scenario 3, we start enrollment on the MTD, which the investigator incorrectly believes to be dose 4. Dose 3 was correctly identified in 53.8% of the simulations, which is comparable to the rate seen under scenario 2. However, bCRM misidentified the MTD in 35.0% of simulations, which is between the misidentification rates under scenarios 1 and 2. Of those 350 simulations, doses 2 and 4 were selected in 184 and 144 simulations, respectively. Since we theorized that starting at the correct MTD led to the higher correct identification rate in scenario 3, we repeated simulations under scenario 3 with dose 4 as the starting dose. The MTD was correctly identified in 51.2% of the simulations, with a corresponding 38.0% misidentification rate. Thus, starting at a dose that is close to, but not equal to, the MTD has a marginal effect upon the probability of correctly identifying the MTD. In scenario 4, we expect to terminate the study early due to excess toxicity, while in scenario 5, we expect early termination due to excess disease progression. bCRM correctly ter-

T.M. Braun/Controlled Clinical Trials 23 (2002) 240–256

251

minated the study in 96.6% of simulations under scenario 4 and 89.6% under scenario 5. The median enrollment in scenarios 4 and 5 was three cohorts (9 subjects) and four cohorts (12 subjects), respectively. We also ran simulations for a scenario when all doses were too toxic and led to excess disease progression. bCRM terminated the study early in all 1000 simulations, with a median enrollment of six patients. In scenario 6, all but the lowest dose have tolerable rates of both toxicity and progression. Thus, not surprisingly, bCRM correctly identified an MTD in 946 of the 1000 simulations. More specifically, in those 946 simulations, doses 2, 3 and 4 were selected as the MTD in 307, 426 and 182 simulations, respectively. Doses 5 and 6 were rarely selected even though their rates of progression were lower than the rates of progression for doses 2–4. This scenario exemplifies a situation when a weighted metric might be employed in order to encourage the bCRM algorithm to select doses 5 and 6 more often. In scenario 7, no dose fits the criterion of the MTD, although the lowest three doses have acceptable toxicity and the highest three doses have acceptable progression. Thus, the actual MTD lies somewhere between doses 3 and 4. Since the rate of toxicity for dose 3 and the rate of progression for dose 4 are both marginally above their respective thresholds, either dose might be an acceptable choice of MTD. The bCRM algorithm chose doses 3 and 4 in 73.6% of simulations and failed to find an MTD in another 16.3% of simulations. bCRM selected dose 3 more often then dose 4 (50.3% versus 23.3%), primarily because with an unweighted Euclidean metric, dose 3 is overall closer to the MTD than is dose 4. In many phase I studies, investigators prefer to enter the first cohort on the least toxic dose. Thus, the simulations displayed in Table 2 were repeated but changed so that the first cohort was enrolled on the lowest dose. Since the lowest dose is also believed to be ineffective in terms of response, such a design implicitly states that for the first few patients there is a priority on maintaining a low rate of aGVHD rather than a low rate of disease progression. Results of these simulations are shown in Table 3. By comparing Tables 2 and 3, we see that in the first two scenarios, starting at the lowest dose leads to increased rates of stopping early and finding no optimal dose. Such a result would be explained by the high observed rates of

Table 3. Operating characteristics of bCRM under seven possible scenarios Scenario Parameter

1

2

3

4

5

6

7

Trials where correct MTD found (%) Trials where incorrect MTD found within one level of correct MTD (%) Trials where incorrect MTD found beyond one level of correct MTD (%) Trials where MTD not found (%) Average number of aGVHD cases/trial Average number of progressions/trial Average number of subjects given dose other than MTD/trial Average number subjects/trial

23.8

37.2

48.0

n/a

n/a

91.3

n/a

34.2

15.7

33.5

n/a

n/a

6.2

64.0

4.4 37.6 6.4 10.0

3.4 43.7 7.6 10.3

3.0 15.5 6.8 9.2

2.2 97.8 5.1 5.0

3.0 97.0 3.3 6.4

n/a 2.5 4.3 7.5

13.6 22.4 6.5 9.3

20.4 26.4

18.4 24.6

18.2 27.3

n/a 9.8

n/a 10.6

4.7 29.5

n/a 26.4

All scenarios start enrollment on dose 1. Each estimate is based upon 1000 simulations.

252

T.M. Braun/Controlled Clinical Trials 23 (2002) 240–256

disease progression that would occur in the first few subjects. Conversely, the correct MTD was found less often in the first two scenarios when starting on the lowest dose. However, the rate of misidentification of the MTD was relatively unaffected. These conclusions also are reached in other scenarios. Overall, starting on the lowest dose will lead to conservative results unless the bCRM algorithm is modified to place less weight on the amount of disease progression observed in early cohorts. Comparison to Gooley We now investigate the ability of bCRM to determine an optimal dose of T cells to use in allogeneic bone marrow transplantation. There are 18 possible doses from which to choose, each with corresponding probabilities of GVHD and graft rejection in the recipient. The targeted rates of GVHD and rejection are 0.15 and 0.05, respectively. Gooley et al. discuss the motivation of this study, as well as three proposed study designs (denoted A, B, and C) which were examined in the context of that study [8]. Since design A was deemed inferior, we compare our results for bCRM to the results given by Gooley et al. for designs B and C in three scenarios: (1) 5 of 18 doses are acceptable, (2) 2 of 18 doses are acceptable, and (3) no doses are acceptable. The probabilities of GVHD and rejection in each scenario are graphically displayed in Fig. 2. In all simulations, we assume the probabilities of GVHD and rejection have been correctly identified by the investigator. We also assume that outcomes within each patient are independent (1/2). Since we have 18 doses to examine, the bCRM trials are set to allow a maximum enrollment of 60 patients. All simulations start enrollment on dose 14. A comparison of the three study designs is in Table 4. In the first scenario, doses 6 through 10 are all MTD candidates. Of the two proposed designs, B and C, design C performs the best, with the probability of identifying an MTD at 87.4% and a misidentification rate of only 0.1%. Although bCRM does not perform as well as design C in this scenario, bCRM identified the MTD correctly in 74.1% of the simulations, a rate slightly higher than that seen with design B. However, the misidentification rate of bCRM, although quite low at 6.1%, is higher than designs B and C. In comparison to design C, bCRM has slightly lower rates of GVHD and rejection, and bCRM is much more likely to terminate the study early (19.8% versus 12.5%). Thus, bCRM appears to be more conservative than designs B and C. The desire to keep observed rates of GVHD and rejection near the targeted rates causes bCRM to prematurely end the study if early cohorts have poor results. Although the average number of subjects is much higher in bCRM, this is due to the maximum enrollment being set at 60 subjects. We repeated simulations with a maximum enrollment of only 30 patients and found little change in the parameter estimates in Table 4. Specifically, with 30 subjects, an MTD was found correctly in 68.4% of simulations, found incorrectly in 14.5% of simulations, and no MTD was found in 17.1% of simulations. The average number of subjects per trial was 26.1. The slightly higher misidentification rate is likely due to the large number of doses under consideration and bCRM’s inability to gain enough information after only 30 subjects. In scenario 2, only doses 9 and 10 are MTD candidates; as a result, identifying the MTD is more difficult. However, our conclusions are similar to those reached under scenario 1. With regard to locating the MTD, design C is the best, while the performance of bCRM is slightly

T.M. Braun/Controlled Clinical Trials 23 (2002) 240–256

253

Fig. 2. Three scenarios with dose-response curves for probabilities of GVHD and graft rejection. In each plot, doses within the box meet the criteria of MTD.

better than that of design B. Furthermore, the observed GVHD rate under bCRM is between the rates under designs B and C, and the observed rejection rate is lowest under bCRM. As with scenario 1, we ran simulations for Scenario 2 with a maximum enrollment of 30 subjects. Once again, there was a sizable increase in the misidentification rate of the MTD. Specifically, an MTD was found correctly in 51% of simulations, found incorrectly in 24.8% of simulations, and no MTD was found in 24.2% of simulations. The final scenario, in which no MTD exists, best demonstrates the advantage of a conservative study design such as bCRM. The rate of MTD misidentification is only 7.7%, far below either designs B or C. Conversely, the rate of early termination seen with bCRM is 97.3%, far above the rates of the other designs. In comparison to designs B and C, bCRM has comparable rates of GVHD and rejection, and average enrollment is lower. Simulations with a maximum enrollment of 30 once again demonstrated a higher misidentification rate for bCRM than simulations with a maximum enrollment of 60 subjects. Most of this increased misidentification occurred at dose 11, whose rates of aGVHD and relapse are jointly closest in a Euclidean sense to the threshold rates of aGVHD and relapse. In summary, as with other study designs, bCRM tends to perform best when the dose-response curves slope steeply around the MTD. Furthermore, as compared to other designs, bCRM is

254

T.M. Braun/Controlled Clinical Trials 23 (2002) 240–256

Table 4. Values of parameters calculated for a hypothetical bone marrow transplant trial Scenario 1 Parameter Trials where correct MTD found (%) Trials where incorrect MTD found (%) Trials where MTD not found (%) Average % of a GVHD/trial Average % of rejections/trial Average number of subjects/trial

Scenario 2

Scenario 3

bCRM

B

C

bCRM

B

C

bCRM

B

C

74.1

73.1

87.4

59.8

50.1

66.8

n/a

n/a

n/a

6.1

0.2

0.1

12.3

19.6

19.3

7.7

14.7

23.3

19.8

26.7

12.5

27.9

22.7

14.9

92.3

85.3

76.7

12.6

22.7

14.9

14.7

21.2

14.2

27.5

27.2

21.4

4.0

6.2

5.9

5.3

6.8

7.2

13.9

13.3

17.2

50.6

31.1

29.6

46.8

32.7

31.3

19.6

25.6

27.9

Scenario 1 has five possible choices of MTD, scenario 2 has two choices of MTD, and scenario 3 has no possible MTD. Results for bCRM are based upon 1000 simulations, while those of designs B and C are based upon 5000 simulations. bCRM enrolls a maximum of 60 subjects.

more likely to terminate early when no MTD exists. However, bCRM tends to be conservative and may be outperformed by a specific design created for a specific study. Discussion Some investigators may be apprehensive in selecting a prior distribution for the dose-response parameters for fear that a poor selection of prior distribution may lead to aggressive and rapid dose escalation or deescalation before a sufficient number of subjects have been enrolled. Although O’Quigley and Shen created a maximum-likelihood alternative for the univariate CRM paradigm, extension of their methods to a bivariate design has a practical constraint: all parameters cannot be estimated until subject responses are fully heterogeneous [15]. Therefore, in our setting, we cannot compute parameter estimates until we observe at least one subject with each of the following bivariate outcomes: (1) no toxicity and no progression, (2) toxicity and no progression, (3) progression and no toxicity, and (4) both toxicity and progression. As a result, a bivariate maximum-likelihood approach will often be untenable since investigators may have to enroll a large number of subjects before observing heterogeneity in the subjects’ outcomes. Nonetheless, if sample size is not an issue, a maximum likelihood approach can be implemented. A cohort of c (two or three) subjects enrolls on a dose near the middle of all doses under consideration. Additional cohorts of c subjects are enrolled on the same dose until an event (toxicity or progression) is observed. If the event is a toxicity (progression), the next cohort of c subjects enrolls on the next lowest (highest) dose. If a subject has both toxicity and progression, then the results of the entire cohort should be used to judge whether the current dose should be escalated, deescalated, or held constant. This process of escalation and

T.M. Braun/Controlled Clinical Trials 23 (2002) 240–256

255

deescalation is continued until at least one subject is observed with each of the four possible combinations of toxicity and progression. Once heterogeneity has been reached, the joint likelihood specified earlier can be used to compute maximum-likelihood estimates (MLEs) of the regression and association parameters. The MLEs can then be used to update the probabilities of toxicity and progression for each dose and identify the MTD as specified. As an alternative to maximizing the joint likelihood, investigators could estimate the regression parameters via marginal likelihood models such as generalized estimating equations or alternating logistic regressions [16,17]. We are currently investigating the utility of joint and marginal likelihood model approaches in bivariate phase I designs. Although in our simulations we selected 30 subjects as an adequate sample size, an optimal sample size can be derived more systematically through sensitivity analysis. For a given sample size, simulations can provide an estimate of the probability that bCRM selects the correct dose, as well as the variability of that estimate. Simulation studies can also be used to determine the smallest sample size needed to reach a desired probability of correctly locating the MTD. Conversely, we can define the optimal sample size as that which minimizes the probability of not finding an MTD. The starting dose of the trial can influence the performance of bCRM. If the starting dose is close to the MTD, bCRM will perform very well; however, starting at a dose far from the MTD will increase the probability that the study will terminate early. At trial onset, if maintaining low toxicity is of greater concern than maintaining a high response rate, bCRM should begin on a dose near the beginning of the dose range. If maintaining a high response rate is of primary concern, then DBM should begin on a dose near the end of the dose range. However, in either case, the bCRM algorithm should be modified so that early toxicities or nonresponses are given less weight than later events; otherwise, on average, the trial will stop early with no MTD found. bCRM is useful when both toxicity and response occur in close temporal proximity to each other. However, in many situations, response may occur much later than toxicity, thereby increasing the length of time before all subjects have been fully observed and a dose can be modulated. We are examining a modification to bCRM in which we weight each subject’s contribution to dose modulation based upon the amount of time each subject has been observed. Such a modification is comparable to the Time-to-Event CRM (TITE-CRM) [18]. As an alternative to the traditional CRM, TITE-CRM allows new subjects to be enrolled in a phase I study before all previously enrolled subjects have been fully observed. Extension of the TITE-CRM methodology to bCRM will accommodate situations in which responses occur significantly later than toxicities. Acknowledgments This research was supported by NIH Grant CA46592, in cooperation with the Bone Marrow Transplant Division of the University of Michigan Cooperative Cancer Center. References [1] O’Quigley J, Pepe M, Fisher L. A practical design for phase I clinical trials in cancer. Biometrics 1990;46:33–48.

256

T.M. Braun/Controlled Clinical Trials 23 (2002) 240–256

[2] Storer BE. Design and analysis of phase I clinical trials. Biometrics 1989;45:925–937. [3] Babb J, Rogatko A, Zacks S. Cancer phase I clinical trials: efficient dose escalation with overdose control. Stat Med 1998;17:1103–1120. [4] Gasparini M, Eisele J. A curve-free method for phase I clinical trials. Biometrics 2000;56:609–615. [5] Leung DHY, Wang YG. Isotonic designs for phase I trials. Control Clin Trials 2001;22:126–138. [6] Thall PF, Russell KE. A strategy for dose-finding and safety monitoring based on efficacy and adverse outcomes in phase I/II clinical trials. Biometrics 1998;54:251–264. [7] Thall PF, Estey EH, Sung HG. A new statistical method for dose-finding based on efficacy and toxicity in early phase clinical trials. Investig New Drugs 1999;17:155–167. [8] Gooley TA, Martin PJ, Fisher LD, Pettinger M. Simulation as a design tool for phase I/II clinical trials: an example from bone marrow transplantation. Control Clin Trials 1994;15:450–462. [9] Legedza, A. Bayesian approaches to the design of phase I clinical trials. Ph.D. Thesis. University of Harvard. 1999. [10] O’Quigley J, Chevret S. Methods for dose finding studies in cancer clinical trials: a review and results of a Monte Carlo study. Stat Med 1991;10:1647–1664. [11] Prentice RL. Binary regression using an extended beta-binomial distribution, with discussion of correlation induced by covariate measurement errors. J Am Stat Assoc 1986;394:321–327. [12] Prentice RL. Correlated binary regression with covariates specific to each binary observation. Biometrics 1988;44:1033– 1048. [13] Arnold BC, Strauss DJ. Bivariate distributions with conditionals in prescribed exponential families. JRSS, Series B 1991;53:365–375. [14] Tierney L, Kadane JB. Accurate approximations for posterior moments and marginal densities. J Am Stat Assoc 1986;81:82–86. [15] O’Quigley J, Shen LZ. Continual reassessment method: a likelihood approach. Biometrics 1996;52:673–684. [16] Zeger SL, Liang KY, Albert PS. Models for longitudinal data: a generalized estimating equation approach. Biometrics 1988;44:1049–1060. [17] Carey V, Zeger SL, Diggle P. Modelling multivariate binary data with alternating logistic regressions. Biometrika 1993;80:517–526. [18] Cheung YK, Chappell RJ. Sequential designs for phase I clinical trials with late-onset toxicities. Biometrics 2000;56:1177–1182.