Data Monitoring Committees and Interim Monitoring Guidelines

Data Monitoring Committees and Interim Monitoring Guidelines

Data Monitoring Committees and Interim Monitoring Guidelines Boris Freidlin, PhD, Edward L. Korn, PhD, and Stephen L. George, PhD From Biometric Resea...

135KB Sizes 0 Downloads 102 Views

Data Monitoring Committees and Interim Monitoring Guidelines Boris Freidlin, PhD, Edward L. Korn, PhD, and Stephen L. George, PhD From Biometric Research Branch, National Cancer Institute, Bethesda, Maryland (B.F., E.L.K.) and the Division of Biometry, Duke University Medical Center, Durham, North Carolina (S.L.G.)

ABSTRACT: Most large randomized clinical trials have a data monitoring committee that periodically examines efficacy and safety results. A typical data monitoring committee meets every 6 months, but the interim monitoring guidelines for many trials specify formal analyses that are years apart. In this article we argue that study protocols should include monitoring guidelines with formal looks at each data monitoring committee meeting. Such guidelines are shown to reduce the average duration of a trial with negligible effect on power and estimation bias. Some of the common statistical monitoring guidelines require extreme evidence to stop a trial early and do not distinguish between stopping a trial during active accrual and follow-up stages. We propose practical solutions for these issues. Control Clin Trials 1999;20:395–407  Elsevier Science Inc. 1999 KEY WORDS: Bayesian methods, clinical trials, group sequential trials, interim analysis, stochastic curtailment

INTRODUCTION It is common practice for large randomized clinical trials to have a Data Monitoring Committee (DMC) that periodically monitors interim outcome results in a confidential manner for the possibility of stopping early for (1) unexpectedly large treatment differences, or (2) lack of efficacy that suggests a positive result at the end of the trial would be extremely unlikely. The existence of data monitoring can be thought of as part of an implicit contract between the investigators and the patients that ensures that patients will not be entered in a trial in which it is clear that one of the treatment arms is inferior, nor in a trial which is unlikely to yield scientifically useful information. Because the decision to stop a trial involves many nonstatistical considerations and is in part subjective, DMCs have multidisciplinary representation, often including patient advocates and ethicists as well as physicians and statisticians. As a consequence, the inclusion of interim monitoring guidelines in the protocol Address reprint requests to Boris Freidlin, PhD, Biometric Research Branch, EPN-739, National Cancer Institute, Bethesda, MD 20892. Received January 15, 1999; accepted May 24, 1999. Controlled Clinical Trials 20:395–407 (1999)  Elsevier Science Inc. 1999 655 Avenue of the Americas, New York, NY 10010

0197-2456/99/$–see front matter PII S0197-2456(99)00017-3

396

B. Freidlin, et al.

can be especially helpful to the DMC when deciding whether a trial should be stopped early, and specifically lessen the possibility that they will recommend stopping a trial as soon as the p value crosses below 0.05 [1]. The undesirability of this latter monitoring procedure is well known; repeated significance testing at the 0.05 level of accruing data leads to vastly inflated type 1 errors [2]. Formal statistical methods of interim analysis to avoid inflated type 1 errors have been developed. These include sequential analysis [3, 4], group sequential designs and analyses [5–7] possibly utilizing spending functions [8], stochastic curtailment [9], repeated confidence interval approaches [10], and Bayesian methods [11]. Formal guidelines are not typically defined for toxicity and accrual monitoring, although the DMC also has a role to play in these issues [12]. Another important issue, which is often overlooked by the formal statistical stopping rules, is that there are generally three types of early stopping. The first type is when a trial is stopped while patients are still being accrued. In this case stopping the trial means termination of accrual, a possible treatment change for the patients who are in the early phase of their treatments, and release of the study data (for publication). The second type of stopping refers to stopping a trial after the accrual has been completed but while some patients are still receiving the randomized portion of their therapy. It may require treatment change for the patients who are in the early phase of their treatments in addition to release of the study data. The third type, stopping a trial in its follow-up stage, means only releasing the study data. The DMC might consider different approaches in these three situations, because the complete follow-up data will be collected in case of type 3 stopping and an additional analysis could be performed (as specified in the protocol), while in case of type 1 or 2 stopping some of the data will be lost. In this paper we address a problem that we have noted as part of our participation in DMCs for randomized trials sponsored by the National Cancer Institute. For each National Cancer Institute Cooperative Group, one DMC monitors all the trials led by that group, with the exception of some large trials or trials requiring special oversight, which may have their own DMCs [13]. The DMCs meet every 6 months, but the interim monitoring guidelines for many trials specify formal interim analyses that may be separated by years. This leads to the questions (1) should DMCs evaluate relative treatment efficacy at every meeting, even when no formal analysis is planned? And, if they do, (2) what “reasonable” guidelines can accommodate frequent looks at the data? By “reasonable” we mean a guideline that does not require such strong evidence for stopping that a DMC member will feel uncomfortable continuing some trials that have not reached that level of evidence [14]. For example, we find unreasonable a guideline that states a trial should continue unless the p value is less than 0.0001, which would imply that a trial with p 5 0.0002 should continue. We address these two questions in turn. SHOULD DMCs EVALUATE RELATIVE TREATMENT EFFICACY AT EVERY MEETING? A preliminary question is: Should anyone evaluate relative treatment efficacy as often as every 6 months? We would argue that after an initial time period when few events are expected, it is reasonable to monitor relative efficacy

Data Monitoring

397

regularly, perhaps every 6 months. If something dramatic is happening in the data, then one would like to know sooner rather than later. The arguments against looking at the data frequently are logistical and statistical. From a logistical point of view, if getting the data together and “cleaned-up” for a relative-efficacy analysis is time consuming and expensive, then it may be hard to justify spending the resources required for many frequent looks. This would be especially true if the number of additional events is expected to be small in 6 months. In this case, one could argue for performing an efficacy analysis at time intervals longer than 6 months (e.g., yearly). The other arguments against many looks at the data are statistical. To maintain the type 1 error, frequent looks require a lowering of the nominal p values for rejecting the null hypothesis, which results in a loss of power [15]. However, we find the loss of power (or alternatively, the increase in type 1 error) associated with fairly frequent looks to be acceptably small; see the example below. A second statistical argument concerns the gains expected from using frequent looks in terms of the numbers of patients treated; there are diminishing returns with increasing numbers of looks. For example, from Table 2 of Pocock [16], going from a fixed sample-size design to two looks reduces the expected number of patients by 28% under a specific alternative hypothesis (corresponding to 95% power). Going from two to three looks reduces the expected number of patients treated by 10%; going from three to five looks, by 7%, and so on. As we will demonstrate in the following example, however, consideration of the expected number of patients treated does not properly take into account the distribution of possible savings in terms of number of patients treated, nor the savings in time possible with more frequent looks. Example Consider a trial that accrues 600 patients uniformly over 4 years, with 2 years of additional follow up. Consider the alternative hypothesis (HA) of median survival of 5 and 7.5 years in the control and experimental treatment arms, representing a hazard ratio of 1.5. Assuming exponentially distributed survival times, the trial has 90% power with a single analysis of the data at the end of 6 years using a log-rank test with one-sided type 1 error 5 0.05 (a one-sided test is used here because it is assumed that there is no interest in differentiating between the experimental treatment being no better than the control versus worse than the control). This power calculation, and all stopping guidelines that follow, are based on asymptotic considerations. The actual levels and powers of the procedures (estimated by computer simulations) will only be approximately the same as the desired levels and powers. For example, the estimated power of this trial using a simulation of 100,000 datasets is 0.908 (simulation standard error 5 0.0009). Table 1 displays two possible interim monitoring plans. In plan A, there are two interim looks (at 2 and 4 years) with nominal p value for stopping < 0.005 at each look. The nominal p value for rejection of the null hypothesis H0 at the finally analysis is reduced from 0.05 to 0.045 to maintain the overall type 1 error to be (approximately) 0.05. In plan B, there are eight interim looks, corresponding to analyses every 6 months starting at 2 years. The nominal p

398

B. Freidlin, et al.

Table 1 Characteristics of Two Interim Monitoring Plans A and B, for a Randomized Triala Plan Ab

Calendar time (y) 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 Total

Expected number of events H0

HA

38 58 82 109 139 170 199 226 251

32 49 69 93 119 146 171 195 217

Plan Bb

Probability of stopping under

Nominal p value to stopc

H0

HA

0.005

0.004

0.068

0.005

0.004

0.297

0.045

0.039 0.050

0.537 0.902

Nominal p value to stopc 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.040

Probability of stopping under H0

HA

0.004 0.003 0.003 0.002 0.002 0.002 0.002 0.001 0.029 0.050

0.068 0.066 0.080 0.095 0.104 0.099 0.084 0.070 0.228 0.895

a

Trial accrues 600 patients over 4 years with 2 years of additional follow-up. Alternative hypothesis is exponential survival with median survival equal to 5.0 and 7.5 years for the control and experimental treatments. Probabilities were calculated from a simulation of 100,000 datasets.

b c

Plan A: 2 interim looks; Plan B: 8 interim looks. p values refer to one-sided hypothesis tests.

values for stopping are also p < 0.005; the final nominal p value for rejection is p < 0.040. In the “Total” row of Table 1, it is seen that the power of the fixed sample size trial is reduced (from 0.908) to 0.902 using plan A, and to 0.895 using plan B. We find the reduction from 0.902 to 0.895 due to the additional looks to be minor. What about the advantages of using plan B over plan A? Under HA, the expected number of patients treated is 580 under plan A and 546 under plan B, only a 6% reduction, consistent with the point made earlier about diminishing returns in expected number of patients with larger number of interim analyses. The relative advantage of plan B seems to be better when considered in terms of the expected length of time of the trial. Under HA, the expected length is 5.1 years under plan A and 4.5 years under plan B, a 13% reduction corresponding to about 6 months. Furthermore, these average reductions only tell part of the story. Plan B will stop at a time that is not an interim monitoring time for plan A about 50% of the time. The distributions of reductions in time and numbers of patients of plan B over plan A are given

Table 2 Distribution of Reduction in Time (Years) Under the Alternative Hypothesis by Using Plan B Instead of Plan A for the Interim Monitoring of the Trial Described in Table 1a Time

0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Percent

50

15

15

15

0

2

2

2

a

Distribution was calculated from a simulation of 100,000 datasets.

399

Data Monitoring

Table 3 Distribution of Number of Fewer Patients Treated Under the Alternative Hypothesis by Using Plan B Instead of Plan A for the Interim Monitoring of the Trial Described in Table 1a No. Patients

0

1–50

51–100

101–150

151–200

201–250

Percent

76

0

10

4

4

6

a

Distribution was calculated from a simulation of 100,000 datasets.

in Tables 2 and 3. We see that the advantage of plan B over plan A in terms of time and patients can be considerable. It is important to remember that in addition to the patients potentially treated in the trial, the reduction in time may translate into the entire patient population being treated with the superior therapy earlier. If the treatment actually works better than the hypothesized alternative used in designing the trial, the advantage of plan B would be even larger. If one accepts that interim monitoring of relative treatment efficacy at 6-month intervals is worthwhile, then one alternative to presenting the analyses to the DMC is to have the study statistician perform the analysis, and bring it to the attention of the DMC only if it requires action. In our view, this is unacceptable because (1) it places an undue burden on the study statistician who will not have the range of expertise of the DMC, and (2) the study statistician has the appearance of a conflict of interest involving the study on which he has worked (for example, a study statistician may be reluctant to close a trial that he or she worked very hard to design and activate). Another alternative is to present relative treatment efficacy to the DMC every 6 months but without a “formal” analysis. For example, survival curves for the treatment arms could be shown without the p value associated with the difference in the curves. We find this to be unacceptable also, as lack of p values (or confidence intervals) makes it hard to interpret the curves. We therefore believe it is advisable to include a formal interim monitoring guideline in the protocol that reflects frequent looks at the data. General proposals for doing this are given in the next section. REASONABLE INTERIM MONITORING GUIDELINES WITH FREQUENT LOOKS We begin with a discussion of upper boundaries for stopping early when the results from the experimental treatment arm are better than the control treatment arm. We will then discuss lower boundaries for stopping early when it appears highly likely that the experimental arm will not be proven to be better than the control arm. We do not explicitly consider symmetric boundaries; they can be constructed by simply taking the lower boundary to be symmetric around zero to the upper boundary, and using a/2 instead of a as the nominal significance level. Generally, whether symmetric or asymmetric boundaries are appropriate for a trial depends on the nature of the therapies being evaluated. For a placebocontrolled trial, or a trial where the control arm is no intervention, asymmetric boundaries should be used, because one is only interested in establishing superiority of the experimental arm. In situations where both arms involve active

400

B. Freidlin, et al.

treatment, the decision is more complicated and involves considerations of toxicity, morbidity, cost, and logistics. If one of the treatments is more toxic, invasive, protracted, or much more expensive, asymmetric boundaries may be appropriate. On the other hand, if the two treatments are comparable in terms of the above factors, symmetric boundaries are often used. The purpose of this section is not to recommend specific monitoring plans. Instead, we show that standard methods of constructing stopping boundaries can lead to unreasonable nominal p values required for stopping. Fortunately, as noted by Lan and DeMets [8] and Fleming et al. [17], one can truncate extreme p values to a less extreme value (e.g., 0.001), and not materially affect the power or level of the procedure. This is demonstrated by a computer simulation. Upper Boundaries Table 4 presents nominal p values for early stopping for 16 stopping plans, as well as the simulated rejection probabilities under the null hypothesis and the simulated rejection probabilities when the nominal p value at the last look is taken to be 0.05. For some of the plans (e.g., #3), the nominal p values in practice would be based on the number of observed events at the time of the interim analyses [24]. For our purposes, we have used the expected number of events under the alternative hypothesis to estimate the nominal p values. For plans with nominal p values less than 0.001 (e.g., #3), we have also considered the same plans but with these low p values truncated to be 0.001 (e.g., #4 is the truncated version of #3). Plan 16 addresses a possible concern with frequent interim monitoring of the data: It may be difficult or resource intensive to “clean” the data and perform a formal analysis every 6 months. Although we believe this is becoming less of a concern with modern information systems, for some trials it may still be an issue. With plan 16, a p < 0.01 at an interim monitoring time induces a formal decision to stop at the next interim monitoring time if p < 0.005 at that time. This gives the study investigators 6 months to ensure that the data is reliable and up-to-date. Monitoring rules involving successive low p values have been suggested previously by Canner [25] and Falissard and Lellouch [26], although not specifically for the reason given here. Many of the common monitoring plans have unreasonably low p values at early looks. We see in Table 4 that truncating the nominal p values at 0.001 only trivially affects the levels of the tests. We recommend always doing this type of truncation. We also note that using a 0.05 nominal p value at the final look does not lead to simulated levels that are much larger than 0.05. (An exception is plan 6, which does not attempt to control the level of the test). This is important because many readers of reports of clinical trials will ignore the fact that interim monitoring has been done when interpreting the p value at the final look. In some situations it might be appropriate to use a more stringent stopping rule during the active accrual stage or before all patients complete the assigned therapy. For example, plan 15 requires p < 0.001 during the accrual stage and p < 0.005 during follow up.

1

2

3

4

8

9

10

0.0027 0.00000 0.001 0.00000 0.0023 0.00000 0.001 0.00000 0.0022 0.00002 0.001 0.00001 0.0023 0.00025 0.001 0.00014 0.0027 0.0014 0.0014 0.0009 0.0037 0.0047 0.0047 0.0032 0.0057 0.0112 0.0112 0.0081 0.0108 0.0220 0.0220 0.0168 0.050 0.050 0.050 0.040 using plans 0.0544 0.0529 0.0538 0.0420 using modified plans (0.05 at the 0.0544 0.0529 0.0538 0.0515

7 0.00000 0.00000 0.00008 0.0009 0.0041 0.0110 0.0205 0.0325 0.045

12

0.0432 0.0552 final look) 0.0526 0.0586

0.001 0.001 0.001 0.001 0.001 0.0032 0.0081 0.0168 0.040

11 0.0045 0.0038 0.0035 0.0035 0.0039 0.0049 0.0070 0.0123 0.049

14 0.001 0.001 0.001 0.001 0.005 0.005 0.005 0.005 0.0472

15

0.005/0.01 0.005/0.01 0.005/0.01 0.005/0.01 0.005/0.01 0.005/0.01 0.005/0.01 0.050

16

0.0594 0.0574 0.0530 0.0538

0.0559 0.0567 0.0504 0.0538

0.001 0.001 0.001 0.001 0.0041 0.0110 0.0205 0.0325 0.045

13

For a randomized trial accruing 600 patients over 4 years with 2 years of additional follow-up. Simulated rejection probabilities are under the null hypothesis for the given plans and for modified plans where the nominal p value at year 6.0 is taken to be 0.05. Rejection probabilities were calculated from a simulation of 100,000 datasets. b Plans are (1) plan B of Table 1; (2) Haybittle-Peto [18, 19]; (3) O’Brien-Fleming [6, 20]; (4) O’Brien-Fleming truncated at 0.001 [20]; (5) Fleming-Harrington-O’Brien with conditional probability of rejection (pi) at each of the first eight looks set equal to 0.0025 [17]; (6) Bayesian rule with skeptical prior representing a 0.05 probability (g) of exceeding the alternative hypothesis, and with early stopping if the posterior probability of the treatment effect being <0 is ,0.05 (e) [11]; (7) Bayesian rule with noninformative prior, and with early stopping if the predictive probability of eventual rejection is >0.99 [21]; (8) stochastic curtailment with a 5 0.05 and with early stopping if the conditional probability of final rejection is >0.8 (1-g) [9]; (9) same as (8), but truncated at 0.001; (10) stochastic curtailment with a 5 0.04 and with early stopping if the conditional probability of final rejection is >0.8; (11) same as (10), but truncated at 0.001; (12) Whitehead restricted procedure with maximum number of events 5 216 [4, 22]; (13) same as (12), but truncated at 0.001; (14) sequential conditional probability ratio test with maximum conditional discordant probability (r) 5 0.2 [23]; (15) use p < 0.001 during active accrual stage; use p < 0.005 during follow-up, p < 0.05 at final analysis; (16) two successive low p values required, p < 0.01 followed by p < 0.005, except at last look. c p values refer to one-sided hypothesis tests.

a

6

0.0025 0.0020 0.0034 0.0059 0.0041 0.0107 0.0046 0.0158 0.0052 0.0202 0.0059 0.0238 0.0067 0.0265 0.0076 0.0286 0.0427 0.0302 null hypothesis 0.0501 0.0628 null hypothesis 0.0568 0.0748

5

Plan Number b

Nominal p Values Associated With 16 Interim Monitoring Plansa

Nominal p value to stopc 2.0 0.005 0.001 0.00000 0.0010 2.5 0.005 0.001 0.00004 0.0010 3.0 0.005 0.001 0.0004 0.0010 3.5 0.005 0.001 0.0026 0.0026 4.0 0.005 0.001 0.0073 0.0073 4.5 0.005 0.001 0.0140 0.0140 5.0 0.005 0.001 0.0220 0.0220 5.5 0.005 0.001 0.0296 0.0296 6.0 0.0401 0.050 0.0365 0.0365 Simulated rejection probabilities under the 0.0501 0.0515 0.0498 0.0505 Simulated rejection probabilities under the 0.0593 0.0515 0.0596 0.0602

Calendar time (years)

Table 4

Data Monitoring

401

402

B. Freidlin, et al.

Table 5 Nominal p Values Associated With Three Interim Monitoring Plans for Stopping Early for Lack of Efficacy a Calendar time (y)

2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0

Plan Number 1b

3c

2b

Nominal p value associated with testing HA or H0 required to stopd HA HA HA H0 0.005 0.005 (,0.00001) ,0.00001 0.005 (,0.00001) 0.0018 0.005 (0.0002) 0.028 0.005 (0.001) 0.145 0.005 0.005 (0.005) 0.361 0.005 (0.012) 0.585 0.005 (0.026) 0.761 0.005 (0.045) 0.873 Simulated powers under the alternative hypothesis 0.904 0.901 0.903

a

For a randomized trial accruing 600 patients over 4 years with 2 years of additional follow-up. Simulated powers, calculated from a simulation of 100,000 datasets, are under the alternative hypothesis for the given plans with the nominal p value at year 6.0 taken to be 0.05.

Early stopping if the alternative hypothesis can be rejected at the p < 0.005 level. Stochastic curtailment with a 5 0.05 and with early stopping if the conditional probability of final rejection < 0.2. This plan corresponds to early stopping if the null hypotheses can be rejected at the stated nominal level. p values in parentheses are for reference. d p values refer to one-sided hypothesis tests of either HA or H0. Low p values designate that the standard treatment arm is performing better than expected (under either HA or H0, respectively) as compared to the experimental treatment arm. b c

Lower Boundaries One can define many possible lower boundaries for stopping early for lack of efficacy. For our purposes, it is sufficient to restrict attention to the three monitoring plans given in Table 5. The first two plans are analogous to the plans given in Table 1. These plans have two and eight looks at the data respectively, where at each look the trial is stopped if the p value for testing the alternative hypothesis is < 0.005 (low p values designate that the standard treatment arm is performing better than expected as compared to the experimental treatment arm). A similar approach was proposed in Fleming et al [17]. The reduction in power due to the more frequent looks is negligible, from 0.904 to 0.901. The advantage of using the more frequent looks is that the trial may end earlier and with fewer patients when the null hypothesis is true (Tables 6 and 7).

Table 6 Distribution of Reduction in Time (Years) Under the Null Hypothesis by Using Plan 2 Instead of Plan 1 for the Interim Monitoring of the Trial Described in Table 5a Time

0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Percent

46

16

16

17

0

2

2

2

a

Distribution was calculated from a simulation of 100,000 datasets.

403

Data Monitoring

Table 7 Distribution of Number of Fewer Patients Treated Under the Null Hypothesis by Using Plan 2 Instead of Plan 1 for the Interim Monitoring of the Trial Described in Table 5a No. Patients

0

1–50

51–100

101–150

151–200

201–250

Percent

71

0

11

5

5

8

a

Distribution was calculated from a simulation of 100,000 datasets.

Plan 3 in Table 5 is based on stochastic curtailment; if the conditional probability that the trial eventually rejects the null hypothesis, given that the alternative hypothesis is true, is less than 0.2, then the trial is stopped. In Table 5, this rule is converted into nominal p values for testing H0, and approximate nominal p values for testing HA (in parentheses), with small p values representing superiority of the control arm. We see that some of these p values are unreasonably small. In particular, at the first look, even a highly statistically significant advantage of the standard treatment over the experimental treatment (e.g., p 5 0.0001), would not result in stopping the trial. At the third look, a rejection of the alternative hypotheses at, say, p 5 0.0005, would also not lead to stopping the trial. One can truncate either the HA or H0 p values in plan 3 to 0.001 and only trivially affect the power of the trial; either truncation reduces the power from 0.903 to 0.902. For some trials the reasons for stopping early for lack of efficacy may not be as compelling as for positive results. For example, if the experimental treatment is not more toxic than the control treatment, then one is not reducing harm to the patients by stopping a trial because it appears that the experimental treatment will not produce better survival than the standard treatment. Additionally, a possible argument against truncating p values associated with testing HA is that the choice of HA used to design the trial is somewhat arbitrary. Therefore, why should one stop a trial simply because HA can be rejected with some small p value? However, we find this argument less than convincing. If the study investigators have chosen HA to be the minimally clinically interesting treatment difference, then ruling out a difference at least this large with high probability would appear to be sufficient evidence to stop the trial. A more aggressive rule for early stopping might be appropriate in some situations (e.g., experimental treatment is considerably more toxic than the control). In addition, one can argue that this generally should be true when the type 2 error of the trial is designated to be higher than the type 1 error (e.g., 0.1 versus 0.05). A more aggressive early stopping rule can be achieved by using a higher nominal stopping p value (e.g., 0.01 instead of 0.005) in plan 2 of Table 5. The reduction in power due to using 0.01 instead of 0.005 is negligible: from 0.901 to 0.893. Specification of an upper boundary has become an almost universally accepted practice in trial design. At the same time there does not seem to be a consensus on the need for a lower boundary. Although early stopping for lack of efficacy might not have as clear ethical motivation as the upper boundary, we believe that in most cases it is desirable. In any event, the study investigators should let their feelings be known about this issue by including in the protocol a detailed monitoring plan for stopping for lack of efficacy as well as positive results or explaining why it is not needed.

404

B. Freidlin, et al.

Example A recent study coordinated by the Cancer and Leukemia Group B (CALGB) compared sequential administration of doxorubicin plus cyclophosphamide and paclitaxel to doxorubicin plus cyclophosphamide alone as adjuvant chemotherapy for patients with node-positive breast cancer [27]. This was a prospective, open-label, randomized phase III clinical trial with a 3 3 2 factorial design. Patients were randomized to one of three treatment levels of doxorubicin (60, 75, or 90 mg/m2) in combination with cyclophosphamide followed by either paclitaxel (175 mg/m2) therapy or no further chemotherapy. The primary outcome measure on the study was disease-free survival; secondary measures included overall survival and toxicity. The primary statistical objectives were to test for significant main effects (doxorubicin, paclitaxel) and for their interaction (doxorubicin 3 paclitaxel). The study was designed to enroll 3000 patients with a final analysis planned 4 years after the end of accrual at which time 1800 events were expected to have occurred. The trial actually randomized 3170 women to the available treatments. The study was monitored by a standing DMC that reviews all phase III trials coordinated by the CALGB. This DMC meets twice a year. Formal interim analyses were planned after 450, 900, and 1350 events were observed using an O’Brien-Fleming boundary. This corresponds to using two-sided p values 0.000007, 0.0015, 0.0092, and 0.0440 at the information times 0.25, 0.5, 0.75, and 1, respectively. At the first interim analysis the observed p value was 0.0077, considerably higher than that required to stop the study (0.000007). Nevertheless, the DMC decided to release the study data at the time of the initial analysis for a number of reasons. First, a variety of calculations and simulations suggested strongly that the results were unlikely to change with further followup. For example, Bayesian predictive distributions were used to simulate the distribution of possible results at the time of the “final” analysis (i.e., after 1800 failures). These analyses suggested a very high probability of reaching a conventionally significant result at the final analysis. Second, at the time of this analysis, accrual was complete and all patients had completed their assigned protocol-specified treatments. Thus, the implication of the data release was less critical than if the trial was still accruing patients, because the full amount of information will be obtained eventually. Third, the initial O’Brien-Fleming p value was considered excessively conservative at this point because all patients had completed therapy. Other boundaries (such as a simple Bonferroni adjustment for four comparisons) would have been reached. And because the DMC was not consulted before this trial was initiated, the choice of one particular rule over another seemed quite arbitrary. All of these reasons were weighed against the known risks of releasing the results early with the eventual decision made as noted above. This example demonstrates that monitoring plans with extremely low p values at early looks can cause difficulties of interpretation for a DMC. The early conservatism of these rules is based on mathematical properties of the underlying models and does not reflect the full complexity of a clinical trial. DISCUSSION We have demonstrated that in terms of level and power there is little reason not to monitor the relative treatment efficacy frequently, and frequent monitoring offers advantages in being able to end some trials earlier. But there are

Data Monitoring

405

some potential disadvantages of using a frequent monitoring plan. One is the positive bias associated with estimators based on trials that allow early stopping for positive results. With more potential for stopping, the bias is larger. We believe the additional bias is small. For example, considering the trial simulations described in Table 1, the medians of the estimated hazard ratios over 100,000 datasets were 1.50 for a single analysis at year 6, 1.55 for plan A, and 1.58 for plan B. With survival-type endpoints, another potential problem with stopping trials earlier is that the early experience with little follow-up may not be reflective of the complete survival curves. For example, a new treatment may be very toxic, leading to a few early deaths, but may also have much better long-term results than the standard treatment. Taken as a whole, the new treatment may be viewed as better than the standard, but an early look at the data may suggest stopping for a lack of efficacy. On the other hand, it is possible for early indications of treatment efficacy to decline over time [28]. This problem can be lessened by starting the formal interim monitoring at a later time if one suspects the survival hazards associated with the treatments cross. For example, rather than starting the monitoring at year 2 in Table 1, one could start at year 3 or later. Even if it happens that one erroneously stops a trial too early, the error can be partially corrected later. The complete survival curves will become apparent as the patients already enrolled in the trial continue to be followed. Although this is obviously not an ideal situation, it is better than continuing a trial with an early large and highly significant treatment difference that later is maintained with further follow-up. A final disadvantage of early stopping is that the results of a trial with limited data, regardless of the p value, may not be viewed as convincing to the clinical community. The ultimate goal of a randomized phase III trial is to be sufficiently compelling to change medical practice. Although we believe the primary responsibility of the DMC is to protect the patients in the trial, there is an implicit contract between the study investigators and the trial participants that the results of the trial will be scientifically useful. If few clinicians believe or act on the results of the trial, then this contract has not been fulfilled. The DMC needs to weigh these issues when considering early stopping; this is one of the reasons monitoring guidelines are not monitoring rules [29]. One factor that should be considered by the DMC in their deliberations is whether other ongoing trials address the same clinical question, and whether a confirmatory trial will be mounted if the current trial is stopped early. In either of these situations, the DMC may be more willing to stop a trial with very limited data. It is useful for monitoring plans to be simple and easily understood by all DMC members. This helps to avoid the possibility of undue influence of the study statistician or the DMC statisticians on the DMC deliberation. We thus find the stopping plans which use a fixed p value for each of the interim analyses (e.g., plans 1, 2, and 15) especially appealing as they avoid the uncertainty associated with estimating the information fraction and do not involve any sophisticated statistical tools. With frequent monitoring of efficacy data, we believe it is important that a reasonable monitoring plan be described in the trial protocol. This is perhaps the best mechanism by which the study investigators give their input to the manner in which the trial will be monitored. They may be the most knowledgeable about when the appropriate time is to begin formal efficacy monitoring

406

B. Freidlin, et al.

for positive results, and about the appropriate trade offs between the expected toxicities of the treatments and early stopping for lack of efficacy. The DMC, an independent body free from conflicts of interest, is in the best position to monitor the trial once it begins. Providing them with a reasonable monitoring guideline will improve the quality of the monitoring. The authors would like to thank the referees and associate editor for their helpful comments. We would also like to thank Dr. Ming Tan for use of his SCPRT program.

REFERENCES 1. Simon R. Some practical aspects of the interim monitoring of clinical trials. Stat Med 1994;13:1401–1409. 2. Armitage P, McPherson CK, Rowe BC. Repeated significance tests on accumulating data. J Roy Stat Soc 1969;A132:235–244. 3. Armitage P. Sequential Medical Trials, 2nd ed. New York: Wiley; 1975. 4. Whitehead J. Sequential methods based on the boundaries approach for the clinical comparison of survival times. Stat Med 1994;13:1357–1368. 5. Pocock SJ. Group sequential methods in the design and analysis of clinical trials. Biometrika 1977;64:191–200. 6. O’Brien PC, Fleming TR. A multiple testing procedure for clinical trials. Biometrics 1979;35:549–556. 7. Ellenberg SS, Eisenberger MA. An efficient design for phase III studies of combination chemotherapies. Cancer Treat Rep 1985;69:1147–1152. 8. Lan KKG, DeMets DL. Discrete sequential boundaries for clinical trials. Biometrika 1983;70:659–663. 9. Lan KKG, Simon R, Halperin M. Stochastically curtailed tests in long-term clinical trials. Sequential Analysis 1982;1:207–219. 10. Jennison C, Turnbull BW. Repeated confidence intervals for group sequential clinical trials. Controlled Clin Trials 1984;5:33–45. 11. Freedman LS, Spiegelhalter DJ, Parmar MKB. The what, why and how of Bayesian clinical trials monitoring. Stat Med 1994;13:1371–1383. 12. Korn EL, Simon R. Data monitoring committees and problems of lower-thanexpected accrual or event rates. Controlled Clin Trials 1996;17:526–535. 13. Smith MA, Ungerleider RS, Korn EL, et al. Role of independent data-monitoring committees in randomized clinical trials sponsored by the National Cancer Institute. J Clin Oncol 1997;15:2736–2743. 14. Canner PL. Monitoring long-term clinical trials for beneficial and adverse treatment effects. Communications in Statistics, Theory and Methods 1984;13:2369–2394. 15. McPherson K. On choosing the number of interim analyses in clinical trials. Stat Med 1982;1:25–36. 16. Pocock SJ. Interim analyses for randomized clinical trials: the group sequential approach. Biometrics 1982;38:153–162. 17. Fleming TR, Harrington DP, O’Brien PC. Designs for group sequential tests. Controlled Clin Trials 1984;5:348–361. 18. Haybittle JL. Repeated assessment of results in clinical trials of cancer treatment. Br J Radiol 1971;44:793–797. 19. Peto R, Pike MC, Armitage P, et al. Design and analysis of randomized clinical trials requiring prolonged observation of each patient: introduction. Br J Cancer 1976;34:585–612.

Data Monitoring

407

20. Reboussin DM, DeMets DL, Kim K, et al. Programs for computing group sequential boundaries using the Lan-DeMets method. Technical Report, Department of Statistics, University of Wisconsin, 1996. 21. Spiegelhatler DJ, Freedman LS, Blackburn PR. Monitoring clinical trials: conditional or predictive power? Controlled Clin Trials 1986;7:8–17. 22. Brunier H, Whitehead J. PEST 3.0 Operating Manual, University of Reading, U.K. 1993. 23. Tan M, Xiong X, Kutner MH. Clinical trial designs based on sequential conditional probability ratio tests and reverse stochastic curtailing. Biometrics 1998;54:682–695. 24. Lan KKG, DeMets DL. Group sequential procedures: calendar versus information time. Stat Med 1989;8:1191–1198. 25. Canner PL. Monitoring treatment differences in long-term clinical trials. Biometrics 1977;33:603–615. 26. Falissard B, Lellouch J. A new procedure for group sequential analysis in clinical trials. Biometrics 1992;48:373–388. 27. Henderson IC, Berry D, Demetri G, et al. Improved disease-free and overall survival from the addition of sequential paclitaxel but not from the escalation of doxirubicin dose level in the adjuvant chemotherapy of patients with node-positive primary breast cancer. ASCO Proceedings 1998;17:101a. 28. Sylvester R, Bartelink H, Rubens R. A reversal of fortune: practical problems in the monitoring and interpretation of an EORTC breast cancer trial. Stat Med 1994; 13:1329–1335. 29. DeMets DL. Stopping guidelines vs. stopping rules: a practitioner’s point of view. Communications in Statistics, Theory and Methods 1984;13:2395–2417.