Overrunning and Underrunning in Sequential Clinical Trials John Whitehead, PhD Department of Applied Statistics, University of Reading, United Kingdom
ABSTRACT: In a sequential (or group sequential) clinical trial the fulfillment of some pre-
specified stopping rule will cause recruitment to be terminated. However, for various reasons, data on patients treated according to protocol may continue to accumulate for some time afterward. This phenomenon is called overrunning. On the other hand, a sequential clinical trial might be abandoned before the stopping rule has been fulfilled. This is called underrunning. In both of these situations the procedure for validly analyzing the study is unclear. In this paper we first review the arguments for sequential methodology and the practical way in which it can be integrated with normal clinical trial conduct. Turning next to the special situations of overrunning and underrunning, the conditions under which a valid analysis is possible are identified, and a method of analysis based on the frequentist philosophy is presented. The likelihood of first gaining and then losing significance due to overrunning and its consequences are examined. Examples based on experience with real studies are presented. KEY WORDS:
Clinical trials, delayedobservations, sequential methods, stopping rules.
INTRODUCTION Consider a clinical comparison of two medical treatments, c o n d u c t e d according to some sequential (or g r o u p sequential) design. The details of h o w and w h y a trial might be c o n d u c t e d sequentially are p r e s e n t e d in the following section. The sequential design is chosen to allow early termination if the treatment difference is seen to be substantial at an interim analysis. The design m a y also allow early termination w h e n it becomes evident that no such difference will be observed. S u p p o s e that the stopping criterion for the s t u d y is reached but that data o n patients treated according to protocol continue to accumulate for some time afterward. This is called overrunning. Conversely, s u p p o s e that the trial is forced to terminate for reasons not entirely u n d e r the control of the investigators, and before the stopping criterion has been reached. This is called underrunning. M a n y authors have described designs and c o r r e s p o n d i n g analyses w h i c h are valid for sequential trials that stop o n s c h e d u l e [1-5]. Little has b e e n written o n the appropriate analysis w h e n such
Address reprint requests to: John Whitehead, PhD, University of Reading, Department of Applied Statistics, Whiteknights, PO Box 217, Reading RG6 2AN, United Kingdom. Received May 4, 1990; revised August 8, 1991.
106 0197-2456/92/$5.00
Controlled Clinical Trials 13:106-121 (1992) © Elsevier Science Publishing Co., Inc. 1992 655 Avenue of the Americas, New York, New York I0010
Overrunning and Underrunning
107
trials overrun or underrun, and the possibility of such behavior sometimes inhibits the use of sequential methods when they would otherwise be appropriate. In this paper a strategy for analyzing trials under these circumstances will be proposed and evaluated. It should be remarked that sequential investigations based on Bayesian methods [6] or on repeated confidence intervals [7] do not suffer any problems in the face of overrunning and underrunning, although these methods are controversial for other reasons (see the correspondence on Bayesian methods [8] and the discussion published with [7]). Neither does a stopping rule alter the likelihood function of parameters based on the final observed set of data: Maximum likelihood estimates are unchanged, although their frequentist properties such as approximate unbiasedness are affected. Here we shall work within the traditional frequentist context of significance levels and single confidence intervals calculated at the end of the study. Overrunning can occur for a variety of reasons. Two of the most likely of these will now be described.
Delayed Observations Many accounts of sequential clinical trials stress the requirement that patient responses be quickly available after randomization. Often this is not possible. Examples of delayed observations include coma rating 6 months after treatment for head injury and activity score 90 days after treatment for stroke. In each case a rapid response is not possible, and yet there remains a strong ethical motivation to react to efficacy differences emerging from early data. If a sequential procedure is used, then realization of the stopping criterion can be a signal to stop recruitment. However, data from those patients already recruited but not yet evaluated will continue to be received. In a trial of a treatment that is administered continuously from randomzation to time of assessment, it may be desirable to discontinue use of the inferior treatment on trial patients as soon as the stopping criterion is reached. This will lead to unmasking of the study and a change from protocol treatment conditions. In such a case no delayed observations should be included in the analysis. There is then no problem of overrunning and the methods of this paper are inappropriate. On the other hand, the treatment might be administered only once or just for a short period following randomization (e.g., a surgical procedure, a radiosensitizing drug, or an immunosuppressive used prior to transplantation surgery). Under such circumstances blindness need not be broken after the stopping criterion is reached as the treatment given is irreversible, and the delayed observations are comparable with those originally collected. They should be included in the analysis. Survival Data A particular form of delayed observation is time from recruitment to death. In this case the delay is a random variable, and so it is not possible to wait for a fixed time after stopping recruitment and be sure of collecting all of the remaining data. However, unless analysis follows cessation of recruitment immediately, extra information will be available by the time of analysis. If
108
J. Whitehead this is collected according to protocol, then it should be included in the analysis. There are of course more prosiac reasons for small amounts of overrunning, mostly to do with various forms of administrative delay. Often commitment to an "intention-to-treat" analysis will force the inclusion of such late data. In the analyses which follow, it will be assumed that the amount of delayed information to be included is independent of the observed difference in efficacy between treatments. This rules out waiting a long time after recruitment ceases if the outcome is not what the investigator desires, but analyzing immediately if things look good! Underrunning also has two particularly likely causes.
Abandonment of the Study The study may be abandoned for external reasons such as a policy change or publication of relevant results from another center, or internal reasons such as lack of recruitment. Provided that the premature termination is independent of the observed efficacy difference, the methods to be described will be appropriate. If the study is abandoned due to side effects of treatment, then our analysis may not be valid as side effects could be correlated with the efficacy responses.
Follow-up Studies Suppose that patients are recruited quickly and then followed up for several years. Interim analyses of the accumulating data, and if necessary early stopping, would be unlikely to curtail recruitment to the trial. However, patients undergoing long-term treatment could be taken off the inferior therapy as soon as the stopping criterion was reached, and trial results could be published early, speeding up the flow of benefits from the trial and the pace of research. Whether a follow-up study does reach the stopping criterion will depend on whether the event rate is as ahticipated. It is possible, when the event rate has been underestimated, for all of the recruited patients to respond without the stopping criterion being satisfied. This would be a form of underrunning. The analyses proposed in this paper will be extensions of the analyses suggested by Tsiatis et al. [4] for use after a properly stopped sequential trial. Significance levels, confidence limits, and median unbiased estimates for treatment effect will be supplied. The applications studied in detail will be to the likelihood-based methods that are incorporated within the software package PEST2 [9]. These include the triangular test, truncated sequential probability ratio test, and the restricted procedure [3]. The methods described here can be implemented using the PEST2 package (see sections 2.5 and 3.5 of [9]). The same principles of analysis could be equally well applied to the methods described in [2] and [5], although suitable software is not as readily available. Sequential methods in clinical trials have been developed from theory that was first applied in quality control. In the latter setting a rigid link existed between crossing an upper boundary and rejecting a batch of manufactured items as faulty, or crossing a lower boundary and accepting the batch as satisfactory. This formulation has been carried over to sequential methods for
Overrunning and Underrunning
109
clinical trials, and many authors have equated crossing certain boundaries with the acceptance or rejection of a null hypothesis of treatment equivalence. The approach taken here breaks that link. The stopping boundaries establish when the trial should be terminated. Then the data analysis takes place, significance is established, and estimation performed in a w a y that recognizes the stopping rule used. As well as being appropriate to the inferential objectives of clinical research, this approach allows the analyses described here, in which the sequential rule has only been partially obeyed. The designs in the examples of this paper are described in chapter 4 of [3] and in [9]. They allow maximum flexibility: the timings of interim analyses or even their number do not have to be specified in advance. All that is laid down is the rule to determine stopping when the interim analysis is made. The method of [5] allows the same degree of flexibility. This paper shows how trials can be analyzed when the stopping rules have been overriden by a certain class of eventualities, widening the scope of the methodology further. IMPLEMENTATION OF STOPPING RULES IN CLINICAL TRIALS Before beginning the development of technical procedures for overrunning and underrunning, the practical implementation of formal stopping rules will be discussed. Such rules should be seen as a tool to be used in the conduct of a clinical trial rather than as a complete recipe for monitoring. Any clinical trial will be monitored throughout its duration; often it is a legal obligation at least to report deaths and serious adverse events. In a major trial concerning a life-threatening condition, there is likely to be a SafetyMonitoring Committee, with responsibility for reviewing all reports of deaths and serious adverse events, and the power to terminate the trial if they believe this to be necessary. In this section the people responsible for the monitoring of the trial, whether they are an independent Safety-Monitoring Committee or the investigators themselves, will be referred to as "the Committee." Sometimes the intention of the Committee will go beyond stopping for safety reasons; they may also want to stop if it seems that a treatment effect is unlikely to be discovered. Although this can be seen as being in the interests of economy, it also has the ethical advantage of diverting resources into more promising studies. If interim analyses are conducted, and these might lead to stopping, then the basis for a conventional frequentist analysis is invalidated. Conventionally calculated significance levels, confidence intervals, and unbiased estimates lose their frequentist properties. This is because such an analysis is founded on infinite repetitions of a fixed sample design with sample size equal to the number of patients observed rather than on repetitions of the sequential design (implicit or explicit) which was actually used. The author and his colleagues at Reading University have been involved in the design and conduct of several sequential clinical trials (many are ongoing, but two published examples are described in [10-12] and [13]), and we know of others that have used similar principles (e.g., [14] and [15]). The procedure has always started by ascertaining the conditions under which the Committee would wish to stop the trial. These are then translated into a design with the required properties. Several iterations may be necessary. The
110
J. Whitehead formal rule will only attend to one of the many considerations of the Committee. This will be the most likely cause of stopping, and a cause of stopping after which accurate analysis is desirable. As an example to the contrary, in a trial concerning a mild condition, no formal stopping rule for mortality is needed. Excess mortality on the experimental arm is an unlikely event, and if it occurs no fine analysis is required--the experimental treatment will be abandoned. During the trial, the Committee will meet regularly. They will be presented with many features of the trial, including the results of the formal sequential procedure (perhaps disguised or even only described as "continuing" in order to preserve impartiality). The formal procedures might be updated more often than the holding of Committee meetings: crossing a boundary might trigger the calling of a special meeting. In any case, the Committee will have the power to stop, even if the formal rule says continue, or to continue even when the formal rule says stop. Our experience so far has been that such committees confirm the formal rule. The power to stop in unforeseen circumstances or because of surprising adverse reactions has not been used. These are low-probability scenarios, and so it is not surprising that we have not yet encountered them. As the rules and their properties and consequences have been carefully discussed ahead of time, it is unlikely for the Committee to go back on them. Often the rule is the result of careful and lengthy considerations before the trial begins, whereas an interim analysis creates the need for a sudden and urgent decision. Use of the predetermined rule is then an attractive option. In the analysis, only the formal rule is allowed for. The other interim deliberations of the Committee are ignored. However, despite its limitations, this is a much closer model of the true clinical trial than the fixed sample model. This is w h y a special frequentist analysis is worthwhile and w h y the extensions to it made in this paper are being proposed. The extensions of this paper are therefore technical devices, intended to model more closely the true trial conduct, which has to be envisaged as being repeatedly applied in frequentist calculations. Further sections will be presented in terms of the idealized and formal representation of the trial.
ANALYSIS AFTER A PROPERLY STOPPED SEQUENTIAL CLINICAL TRIAL
The context for this presentation will be the general formulation of Whitehead [3]. The trial is a comparison of an experimental treatment with a control, and the parameter 0 measures the advantage of the experimental over the control. Interim inspections of the data are conducted in terms of the statistics Z and V, where Z represents the cumulative observed superiority of the experimental treatment and V the information gathered so far. Mathematically Z is the efficient score and V is Fisher's information. Theoretical results are based on the result that Z is asymptotically normally distributed with mean 0V and variance V, and further that a plot of Z against V resembles Brownian motion. The statistics Z and V can be illustrated for binary data. Suppose that the experimental and control treatments have success probabilities p~ and Pc, and
111
Overrunning and Underrunning
that after n E and nc observations, respectively, there have been S E and Sc successes and FE and Fc failures. Let n = nE + n o S = SE + S o and F = FE + F o Then choosing 0 = log [pE (1 -- pc)/{Pc (1 -- pE)}] yields: Z =(ncSE
-
-
(1)
nESc)/n
and (2)
V = nEncSF/n 3.
These are related to the traditional Pearson ×2 statistic by ×2 = Z2/V and to the O - E statistic used by Yusuf and Peto [16,17] by O - E = Z, var (O E) = V. Another equivalent expression for Z is Z = (nEnc/rZ)(fTE -- pc) where ~E = SE/YlE and f~c = Sc/nc: Z is a multiple of the observed difference between success rates. For survival data with 0 equal to the log-hazard-ratio, Z is the log rank statistic and V its null variance. For other special cases, see [3] and [91. As the trial proceeds, a series of interim inspections takes place. At the ith of these, the current value of Z, Zi, is plotted against the current value of V, Vi. Most sequential designs can be represented as stopping boundaries drawn in this Z - V plane, although some can only be drawn progressively as the trial proceeds rather than being laid out in advance. It will be assumed that there is an upper boundary, consisting of points ul, u2. . . . . and a lower boundary consisting of points (~1, ~ 2 . . . . . The boundary points ui and ~i may be functions of V1, • . •, V~, but they may not depend on any of Z1. . . . . Zi. If Zi -> ui or Zi -< fi, then the trial will stop. In some designs, there will be a maximum intended value of V, V. . . . Stopping will occur at the ith inspection if Vi >- V . . . . Suppose that the trial terminates at the Tth inspection. Thus T is a random variable, and either ZT lies outside the interval (fV, UT) or VT --> V. . . . Denote observed values of T and ZT by t and zt, respectively. After a trial has terminated we can define the function P(0) by P(0) = P{(T < t, Z v
>~
UT) or (T --> t, Zt - zt); 0}
(3)
The function P(0) is called the P - value function, although it is also used in estimation and setting confidence limits. Definition (3) is central to the frequentist analysis of the study. It is worth making its meaning explicit. The quantity P(0) is equal to the proportion of an infinite number of repetitions of the same investigation, each conducted with a treatment difference of 0, in which more extreme evidence in favor of the superiority of the experimental treatment is observed than in the investigation actually conducted. Two phrases need interpretation. One is "of the same investigation," and the other is "more extreme evidence." By "of the same investigation" we shall mean of an investigation conducted according to the same sequential design as the one actually conducted. It follows immediately that the conventional analysis will not be valid, for it refers to infinite repetitions of a fixed sample design that was not used. Often the details of a sequential design emerge as the study proceeds. Equation (3) has the advantage of depending only on the form of the design up to the point of termination; what might have happened later is not considered. "More extreme evidence" is interpreted in the context of a modified anticlockwise ordering of boundary points. The antidockwise ordering of
112
J. Whitehead points was discussed by Armitage [1] and Siegmund [18], a n d modified for use with discrete inspections by Tsiatis et al. [4]. An approximate m e t h o d of calculation was given by Facey and Whitehead [19] following an approach due to Woodroofe, (see section IX.5 of [18]). Alternative a n d equivalent expressions for P(0) are P(0) = 1 - P {(T < t, Z T
~-
~'T) or ( T >- t, Z t <~ zt); 0}
and P(0) = P {(T < t, Z T
~
UT) or (T = t, Zt -> zt); 0}
The form given in (3) is easier to generalize to the cases of overrunning and u n d e r r u n n i n g . From P(0) we can calculate: p, = P(0),
(4)
which is the significance level against the one-sided alternative of superiority of the experimental treatment, and P2 = 2 min {P(0), 1 - n(0)}
(5)
which is the significance level against the two-sided alternative. Also calculable are (0L,00) where P(OL) = 0.025, P(0u) = 0.975
(6)
which is a 95% confidence interval for 0, and 0M
where P(0M) = 0.5
(7)
which is a m e d i a n unbiased estimate of 0. W h e n the trial stops with T = t, say, the analysis is based on Eq. (3). It is necessary to k n o w the values of Vi, i = 1 . . . . .
t
and
(ei, ui), i = 1 . . . . .
t -
1
but unnecessary to k n o w subsequent values that m i g h t have been used later. This is of importance in clinical trials. Although it is good practice to plan the complete sequence of (Vi, ui, ~i) in advance, it is practical reality that the plan m a y be departed from in minor or major respects as the trial proceeds. If such departures are in no way d e p e n d e n t on the nature of the values ZI, . . . . Zi, then the validity of the statical analysis is unimpaired. Dependence of Vi on V1 . . . . . Vi-1 is permissible and often desirable. It follows that if two designs share the same values of Vv for i = 1. . . . . s and of (ui, fi) for i = 1 . . . . . s - 1, and that stopping occurs with T - s, t h e n identical analyses will result. In particular the analysis following a sequential design stopping at the first interim inspection will be identical to the conventional analysis of a fixed sample study with information equal to V1. This feature was pointed out by Rosner and Tsiatis [20]. The sequential design will only stop at the first inspection if the data are extreme, and so w h e n early stopping occurs because of evidence that the experimental treatment is superior, the significance level against the one-sided alternative, pl = P(0), will be small. It is not being suggested, in the terminology of [2], that the nominal and overall significance levels are equal. As stated at the end of the
Overrunning and Underrunning
113
Introduction, in this paper the stopping rule is viewed as a way of terminating recruitment, and not as a decision rule.
ANALYSIS AFTER A SEQUENTIAL CLINICAL TRIAL WHICH HAS
OVERRUN OR UNDERRUN Underrunning The case of underrunning is the simpler and will be treated first. Suppose that the trial is abandoned at the ath inspection, and that (?i < Z~ < u, i = 1, . . . . a. Regardless, we can define P(0) by P(O) = P{(T < a, ZT ~" UT) or (T -> a, Z -> za); 0}
(8)
and use Es. (4)-(7) to conduct the analysis. In effect, we have imposed the design with upper and lower boundary points (ul . . . . . ua-1) and (fl . . . . . G-l) and Vma× = V~, and analyzed as if that had been in force. The definition of P(0) now concerns infinite repetitions of the truncated investigation. Provided that the premature stopping really did occur independently of the position of the sample path, no bias is introduced by this maneuver. The design truncated at the ath inspection is of course less powerful than that originally intended, but loss of power is to be expected whenever a trial is stopped prematurely.
Overrunning Suppose that T = t, where Vt < Vmax. Thus Zt is either -< ft or - ut. Nevertheless, data continue to be received until a final (t + 1)th inspection is conducted. It may be possible to extrapolate corresponding boundary points G+I and ut+l, and Zt+l may or may not lie within (G+I, ut+l). That is immaterial. The flow of data has stopped, and the trial is to be analyzed. The extent of the overrunning, specifically the value of (Vt +1 - Vt), is assumed to be independent of the sample path. Now the constraint T >- t, i.e., (~i < Zi < ui, i = 1. . . . . t - 1, has been satisified. Analysis is taking place at T --- t + 1. The position of Zt gives no information; it is as if we had used the design with upper and lower boundary points (ul . . . . . Ut_l, Zt+l) and ((~1. . . . . (~t-1, zt+l) and inspections at V1. . . . . Vt_1. T h e analysis will use P(0) = P{(T < t, ZT >-- ur) or ( T >-- t, Zt+l ~- Zt+l); 0}
(9)
accordingly. If the stopping boundaries are crossed at the first look, then t = 1 and definition (9) gives P(O) = P ( T >- 1, Z2 - z2; 0) = P(Z2 ~" z2; 0)
and the analysis is identical to the conventional fixed sample analysis. More generally, whenever overrunning forms a large part of the observed sample path, the analysis will be similar to conventional results. Conversely, when overrunning is slight, the analysis will be close to that of a properly stopped study.
114
J. Whitehead Definition (9) is a pragmatic choice. It is simple and can easily be calculated from software created to evalute P(0) defined by definition (3). The properties described in the previous paragraph are favorable. It can be operated without knowledge of values of Vt+2, Vt+3. . . . . pertaining to inspections that were never carried out. It will produce significance levels and confidence intervals with the correct error and coverage probabilities. However, even for properly stopped sequential trials a wide variety of significance levels can be defined [20,21] and overrunning provides even more scope for controversy. Here we shall restrict ourselves to an exploration of the consequences of definition (9).
LOSS OF SIGNIFICANCE In the overrunning situation of the previous section, the observation of the statistics (ZT, VT) fulfilled the stopping criterion. Suppose that the upper boundary was crossed. An immediate analysis would result in the conclusion that the experimental treatment is significantly better than the control at the significance level o~, where c, is a level specified when the trial was designed. In fact p < o~, where p is the actual significance level given by (4) or (5) depending on the form of the alternative. If stopping has occurred particularly early, then p will be considerably less than o~. Delayed observations are now received, and the statistics (ZT+ 1, VT+1) are calculated. Because stopping occurred precisely because ZT was large, it is likely that subsequent data will show a somewhat less marked treatment effect. This is a form of "regression to the mean." It is even possible that these values lie once more within the continuation region of the study. Such behavior is considered to be unsettling by some. Here we shall investigate its consequences and the probability of its occurrence. First note that three different events can occur: El: (ZT+I, VT+I) lies outside the continuation region for the trial. E2: (ZT+I, VT+I) lies within the continuation region for the trial and p ~ or. E3: (ZT+I, VT+I)lies within the continuation region of the trial and p > c~. Event E1 is perfectly satisfactory and the most likely. Event E2 can occur because, when a sequential trial stops early, the significance level is not precisely equal to % there is some significance "to spare," and it is often considerably smaller than o~. For that reason, when forced to analyze a trial that has not been properly stopped, significance can still be observed. The situation has a parallel in fixed sample studies. A power calculation might lead to a specification of 500 subjects. If only 300 are recruited, it is not impossible to discover significant evidence. Event E2 would be an embarrassment, and might make a graphical representation of the sequential trial confusing. However, the message that the result remained significant should probably satisfy the investigators. Event E3 is of even more concern. First, notice that E3 will be a rare event. The sequential procedure itself will have been planned with a high power. This is the investigators' guarantee that the results are repeatable. Thus a report of the same conclusion is only to be expected from the overrunning phase of the experiment. The probability of a loss of significance is quantified in an example in the following section.
Overrunning and Underrunning
115
If E3 occurs, it is likely to do so in a way that changes a significance level just less than c~ to one just greater than cx. This is of concern only to those who set unreasonably great store by whether a result is significant at, say, the 5% level. If significance is considered as a continuous scale then little change has occurred. Finally, if a serious change of conclusion has been observed, then the investigators should accept that extra information and greater power have overturned what was in danger of being a premature and wrong conclusion. There is no scientific sense in refusing to look at the delayed observations just in case they tell us that a desirable positive result is not in fact true. A great discrepancy in behavior between the sequential and overrunning pahses of the study might, on the other hand, lead to suspicion about the validity of the probability model for the data. EXAMPLES
Both of the examples in this section are based on clinical trials conducted in the pharmaceutical industry; one has been completed and the other is ongoing. Both have been considerably simplified for inclusion here, and fictitious data have been attributed to the ongoing design. Because of these changes, and for reasons of confidentiality, no details of drugs or therapetuic areas will be given. Example 1: Underrunning In this study patients were randomized equally between an experimental and a control drug. Patient responses were available soon after treatment, and were classified as success or failure. A 60% success rate was anticipated on the control, and an improvement to 80% on the experimental drug was sought with power 0.9 for the 5% significance level (two-sided alternative). To satisfy the power specification with a fixed sample design, about 208 subjects---104 on each treatment--would be required. Monthly interim inspections were planned, and a triangular sequential design was adopted. To allow for irregular recruitment of subjects, the flexible approach described in chapter 4 of [3] was used. This formulation is now used in the computer program PEST2 [9], in which the adjustment for discrete monitoring is referred to as the "Christmas tree" adjustment. The justification for this nomenclature will be apparent from the PEST2.1 output given in Figures 1 and 2. The triangular test used has upper boundary. Z = 5.03 + 0.298V and lower boundary Z = - 5 . 0 3 + 0.893V where the statistics Z and V to be plotted are those calculated from the 2 x 2 contingency table of success or failure against treatment using Eqs. (1) and (2). Using the Christmas tree adjustment the ith pair of boundary points ((~i, u/) becomes u/ = 5.03 + 0.298Vi - 0.583 V'(Vi - Vi-1) and e; = - 5 . 0 3 + 0.893V/ + 0.583 V'(Vi - V/_I)
116
J. Whitehead
1.0
8
6.
4
2
.
.
.
L4
.
1.6
2
-4
-6 Ne~
hit
Ma~ U=
11.22
Continue the
an 9
ke~
to
continue:
stud~
Figure 1 Sample path when the clinical trial of Example 1 was abandoned (output from PEST2).
Z 20
15
10
hit N~
HaK
~=
48,25
Sto~
the
stud~
and
r e j ~"t
th~
n~lll
anw
kt~
to
continue:
h~othesis
Figure 2 Sample path for the clinical trial of Example 2, showing the data received after termination of recruitment (output from PEST2).
117
Overrunning and Underrunning Table 1
Results from the Trial of Example 1
Inspection Number (i)
Number of Patients
Vi
Zi
~i
ui
1 2 3 4 5 6 7 8 9 10 11
5 12 17 23 26 28 30 32 33 37 44
0.19 0.41 0.88 1.22 1.38 1.53 1.67 1.80 1.83 1.93 2.29
0.60 0.17 0.65 1.65 2.00 2.50 3.00 3.50 3.33 3.65 3.80
-4.61 -4.39 -4.24 - 3.60 -3.56 -3.44 -3.32 - 3.21 - 3.29 -3.12 -2.64
4.83 4.88 4.89 5.05 5.21 5.26 5.31 5.36 5.47 5.42 5.36
The trial p r o c e e d e d for 11 months. The sample p a t h is s h o w n in Figure 1 and the results are p r e s e n t e d in Table 1. It was t h e n a b a n d o n e d because positive findings in other studies m a d e its completion unnecessary. An analysis based on P(0), as defined b y Eq. (8), is p r o v i d e d by PEST2.1. The significance level P2 (two-sided alternative) is .012, the 95% confidence interval (0L, 0U) = (0.364, 2.955); and the m e d i a n unbiased estimate of 0 is 0M = 1.659. For c o m p a r i s o n a conventional analysis based on Z u and VH gives a significance level of 2 {1 - • (Zll/V~)} = 2 {1 - ~ (2.51)} = 0.012, a 95% confidence interval of (Z~/VI~ + 1.96/V~) = (0.364, 2.957) and an approximate m a x i m i u m likelih o o d estimate of 1.661. In this comparison, both analyses are based on Z and V, and so differences w o u l d be due to allowance for the sequential design rather than alternative approximations to the binomial distribution. O t h e r comparative analyses can be p e r f o r m e d using the s u m m a r y of the final data set given in Table 2. The effect of the 10 preceding interim analyses has been negligible. The effect is negligible because, for the range of 0 values of interest, the probability of stopping before the 11th inspection is small, i.e.,: P(0) = P{(T < 11, Z T ~ UT) or (T -- 11, Z n - zll); 0} -'~ P ( Z n - z~l) = Pc (0) w h e r e Pc(0) is the P-value function u s e d in a conventional analysis. The evidence of i m p r o v e m e n t d u e to the experimental d r u g is already significant, despite the u p p e r b o u n d a r y having not yet been crossed. A l t h o u g h in this case a significant finding has been achieved despite the p r e m a t u r e a b a n d o n m e n t of the study, it m u s t be r e m e m b e r e d that the p o w e r of the s t u d y as c o n d u c t e d will be appreciably less that 0.9. C o n s e q u e n t l y , it should not be the policy of an investigating team to allow early a b a n d o n m e n t
Table 2 Success Failure Total
Final Results from the Study of Example 1 Experimental
Control
Total
20 3 23
11 10 21
31 13 44
118
J. Whitehead Table 3 Results from the Trial of Example 2 Inspection Number Number (i) of Deaths Vi Zi 1 21 5.03 1.62 2 73 17.70 7.36 3 108 26.15 14.86 4 139 33.12 14.05
~i - 6.01 0.65 4.20 7.29
u, 9.13 10.33 12.02 13.26
of a trial without good cause. It must also be remembered that the validity of the analysis will be negated if the value of Zn in any way influenced the decision to abandon the study. The difficulty of persuading critics of this independence should act as a deterrent against abandoning a trial except when absolutely unavoidable. Example 2: Overrunning The second example also concerns subjects equally randomized between an experimental and a control treatment, and compared using a triangular test. The patient response of interest is time from randomization to death. A proportional hazards model is assumed: The parameter 0 is minus the log hazard ratio (experimental: control), and the log rank statistics and its null variance are used for Z and V. A value of 0 of - l o g 0.6 = 0.511 was to be detected as significant at the 5% level with power 0.9. The appropriate trangular test has upper boundary Z = 9.66 + 0.155V and lower boundary Z = -9.66 + 0.465V Four inspections of the accumulating data were conducted, and the results are presented in Table 3. At the third inspection it was found that Z3 > u3, and so recruitment to the study was terminated. However, subjects already recruited were followed for a further 6 months during which another 31 deaths were recorded. A fourth and final analysis was then conducted. Table 4 presents the results of analyses performed at the third and fourth inspections. Both analyses based on Eqs. (3) and (9), which allow for the sequential design, and analyses which use Z and V in a conventional manner, ignoring the interim inspections, are included. It can be seen by comparing the sequential analyses with the conventional Table 4 Comparative Analyses for the Trial of Example 2 Type of Sig. Level Estimate Analysis (p2) of 0 Allowing for sequential design; at 3rd insp at 4th insp Conventional; at 3rd insp at 4th insp
95% Confidence Interval for 0
0.010 0.020
0.544 0.418
(0.135, 0.937) (0.067, 0.761)
0.004 0.015
0.568 0.424
(0.185, 0.952) (0.084, 0.765)
Overrunning and Underrunning
119
ones that the latter exaggerate the significance of the result, and present large estimated treatment effects and overprecise confidence limits. These are the expected consequences of ignoring a sequential design. Figure 2 shows the sample path. It can be seen how the additional data run counter to the trend up to the third inspection, with Z actually falling. This artificial example has been constructed so as to make a point rather than to reflect typical behavior. Although the triangular continuation region has been reentered, Z4 > u4, and stopping would occur by this criterion. The significance level is less extreme and the estimated benefit of the new treatment reduced but the trial still yields a convincingly positive result. DISCUSSION In this paper it has been shown that the frequentist concepts of significance level and confidence interval can be defined and calculated after a sequential clinical trial with either overrunning or underrunning. Other definitions are of course possible. The stopping rule determines when recruitment should end, and then the methods of this paper can be used to provide a valid analysis once all of the data are in. It should be emphasized again that overrunning only creates extra valid data when protocol conditions remain in force after termination of recruitment. In particular, if patients are receiving long-term therapy, then it may be ethically imperative to switch patients to the preferred treatment as soon as possible. No extra valid data will then accrue. As was mentioned in the introduction, analyses based on Bayesian methods need no adjustment for overrunning or underrunning, nor for any form of sequential monitoring. Whether this is seen as a virtue of the Bayesian approach, or as a demonstration of its flaws, depends on personal opinion. The debate between Bayesian and frequentist statisticians was most recently joined at a conference on methodological and ethical issues in clinical trials held at the London School of Economics in June 1991. The conference proceedings are to be published in the journal Statistics in Medicine during 1992, and papers by Berry on the Bayesian approach and by myself on frequentist methodology present the principles on which the two systems are built. The likelihood principle suggests that statistical analyses should not depend on stopping rules employed. The principle is described in the book of Berger and Wolpert [22]. Some statisticians adhere to the likelihood principle alone, although this leads to rather limited forms of data summary. Others, such as Berger and Wolpert themselves, see the likelihood principle as a justification for Bayesian methods. The likelihood principle essentially states that all inferences should be drawn from a set of data by consideration of the likelihood function. The behavior of summary statistics in repetitions of the experiment are ruled irrelevant. The principle disallows significance testing and consideration of the bias of estimates or the coverage probability of confidence intervals---sequential methods are but a minor casualty! The vast majority of clinical trial analyses are based on frequentist methods, a few have used Bayesian procedures, and the author knows of none based on likelihood principles alone. The methods of this paper extend the range of the most used and best known approach.
120
J. Whitehead The repeated confidence interval method is another frequentist technique. It needs no adjustment for sequential monitoring, nor for overrunning or underrunning. This flexibility is bought at the price of conservation in the intervals quoted. The method is discussed, together with yet more approaches to sequential analysis in chapter 6 of [3], and also in the discussion of Jennison and Turnbull's paper [7]. In their reply to discussion (p. 354), Jennison and Turnbull [7] ask what happens if when using the triangular test, "the lower boundary is crossed but, for some reason, the study is allowed to continue and then the upper boundary is crossed?" They claim that "common sense dictates that H0 should be rejected, but, if the originally stated size and power are to be preserved, the theory of the triangular test can only allow acceptance of H0." Their claim need not be true. The triangular test should not be seen as a decision procedure operating to some prespecified risks of type I and type II error (the risk of type I error is what Jennison and Turnbull refer to as "size"). Instead, the stopping rule is used to end recruitment. The analysis takes place in the usual way, but with a modified method of calculation, at the end of the study. After Jennison and Turnbull's scenario, the result could well be a significance level indicating that the new treatment is superior. It must be remarked that a crossing of both boundaries is an unlikely event. If it were to occur, then the consistency of trial conditions would need to be checked carefully before any form of analysis could be believed. When a trial is abandoned early, or when overrunning is substantial, or when interim looks are few and far between, allowance for the sequential design may have little effect. However, a quantitative demonstration of that fact can effectively put an end to controversy about the influence of the stopping rule. It is hoped that the provision of an analysis method that can be used in these awkward situations will remove one more barrier to the wider implementation of sequential methods.
REFERENCES 1. Armitage P: Sequential Medical Trials (2nd Ed). Oxford, Blackwell, 1975. 2. Pocock SJ: Group sequential methods in the design and analysis of clinical trials. Biometrika 64: 191-199, 1977 3. Whitehead J: The Design and Analysis of Sequential Clinical Trial (2nd ed). Chichester, Ellis Horwood, 1991 4. Tsiatis AA, Rosner GL, Mehta CR: Exact confidence intervals following a group sequential test. Biometrics 40: 797-803, 1984 5. Lan KKG, DeMets DL: Discrete sequential boundaries for clinical trials. Biometrika 70: 659-663, 1983 6. Freedman LS, Spiegelhalter, DJ: Comparisons of Bayesian with group sequential methods for monitoring clinical trials. Contr Clin Trials 10: 357-367, 1989 7. Jennison C, Turnbull BW: Interim analyses: The repeated confidence interval approach (with discussion). J Roy Stat Soc B 51: 305-361, 1989 8. Whitehead J: Letter to the Editor concerning Reference 6, and authors' reply. Contr Clin Trials 12: 340-350, 1991 9. Whitehead J, Brunier H: PEST2.0: Operating Manual. University of Reading, 1989 K.M. Facey
Overrunning and Underrunning
121
10. Jones DR, Newman CE, Whitehead J: The design of a sequential clinical trial for comparison of two lung cancer treatments. Stat Med 1: 73-82, 1982 11. Whitehead J, Jones DR, Ellis SH: The analysis of a sequential clinical trial for the comparison of two lung cancer treatments. Stat Med 2: 183--109, 1983 12. Newman CE, Cox R, Ford CHJ, Johnson JR, Jones DR, Wheaton M, Whitehead J: Reduced survival with radiotherapy and razoxane compared with radiotherapy alone for inoperable lung cancer in a randomised double-blind trial. Br J Cancer 51: 731-732, 1985 13. Storb R, Deeg J, Whitehead J, and 18 others: Methotrexate and cyclosporine compared with cyclosporine alone for prophylaxis of acute graft versus host after marrow transplantation for leukemia. N Engl J Med 314: 729-735, 1986 14. Deichsel G: A sequential triangular trial of surfactant-treated preterm neonates. Presented at the XVth International Biometric Conference; Budapest, 1990 15. Balkau BJ, Poupon RE, Trinchet JC, Eschwege E: A sequential clinical trial for the treatment of alcholic hepatitis. Presented at ISCB 11, Nimes, 1990 16. Yusuf S, Peto R, Lewis J, Collins R, Sleight T: Beta blockade during and after myocardial infarction: An overview of the randomised trials. Prog Cardiovasc Dis 27:335-371, 1985 17. Peto R: Why do we need systematic overviews of randomised trials? Stat Med 6: 233-240, 1987 18. Siegmund D: Sequential Analysis: Tests and Confidence Intervals. New York, Springer-Verlag, 1985 19. Facey KM, Whitehead J: An improved approximation for calculation of confidence intervals after a sequential clinical trial. Stat Med 9: 1277-1285, 1990 20. Rosner GL, Tsiatis AA: Exact confidence intervals following a sequential trial: A comparison of methods. Biometrika 75: 723-729, 1988 21. Chang MN: Confidence intervals for a normal mean following a group sequential test. Biometrics 45: 247-254, 1989 22. Berger JO, Wolpert RL: The Likelihood Principle. Hayward, CA, Institute of Mathematical Statistics, 1984