Comparison of Bayesian with Group Sequential Methods for Monitoring Clinical Trials Laurence S. Freedman, MA Biometr~ Branch, National Cancer Institute, Bethes~a, Maryland
David J. Spiegelhalter, PhD MRC Biostatistics Unit, Cambridge, UK
ABSTRACT: We describe some problems with applying methods based on classical sequential analysis to monitoring clinical trials. A Bayesian method is developed and the boundaries are compared with frequentist schemes. For the examples chosen, the Bayesian boundaries can be quite similar to those obtained from Pocock and O'Brien and Fleming (OBF) rules, depending on the choice of prior distribution. They converge less rapidly than OBF's but more rapidly than Pocock's. In general the Bayesian methods provide the same desirable features as frequentist methods, without sacrificing flexibility and simplicity of interpretation. KEY WORDS: Clinical trials, sequential methods, stoppiny, rules, Bayesian methods
INTRODUCTION Clinical trial investigators like to inspect the data as they accumulate, for reasons of ethics, efficiency, and curiosity. Statisticians have long been concerned about the t e n d e n c y to overinterpret treatment differences that are observed early on in studies. Repeated statistical testing of the null h y p o t h e s i s at each inspection of the data increases the overall significance level b e y o n d the nominal significance level employed at each analysis [1]. Therefore schemes have been d e v e l o p e d , based on the theory of repeated significance testing, in which the significance levels at each analysis are reduced so as to maintain a chosen overall significance level ~ [2]. Typically ~ is 0.05 or 0.025. For a discussion of some of these schemes, see Geller and Pocock 13]. M a n y practical and conceptual difficulties arise in the use of repeated significance schemes in clinical trials. Some of these are as follows:
Address reprint requests to: Laurence S, Freedman, Biometry Branch, DCPC, N'ati~malCancer Institute, Executive Plaza North, Suite ~44, Bethesda, MD 20892. Received August 5, 1988: revised January 3, 1989.
Controlled Clinical Trials 10:357- 367 (1989) ~i: Elsevier Science Publishing ('o., lnc 1989 655 A v e n u e of Americas, New York, New York I00I(}
357 (}1~,~7-2456,1989'$3.50
358
L.S. Freedman and D. J. Spiegelhalter (1) For a group sequential method with Pocock boundaries [4] and five analyses, a nominal significance level of 0.016 is used at each analysis, leading to an overall ¢xof 0.05. Suppose a trial has evidence, at the fifth and last analysis, of a treatment difference with nominal p level 0.02. Then, according to the design, this would not be statistically significant at the 5% level, whereas an investigator with identical data carrying out a fixed size analysis would attain p = 0.02. It is difficult to explain this to clinical investigators, who wonder why previous inspections of the data should affect the interpretation of the final results. Several statisticians, including Cornfield [5], have related this difficulty to the contravention of the likelihood principle by classical sequential methods. (2) Consider a trial that is completed but where the investigators feel that the observed treatment difference is promising but not yet conclusive, and that other studies are pointing in the same direction. They may wish to carry on with the study, but how should the statistician analyze the extra data? This type of problem also occurs when there is a delay between the entry of the patient and the assessment of response to treatment. If a trial is stopped prematurely on the basis of a stopping rule, how should the statistician deal with the extra data that become available after the trial has stopped? (3) Those who advocate group sequential methods acknowledge that, although they can be a useful guide in the decision-making process, they cannot be used as rigid rules for stopping a trial. For example, considerations such as the toxicity of treatments or evidence from an extraneous study may override the formal statistical stopping rule. However, if a trial is stopped early because of toxicity, then how should the data on treatment differences be interpreted? (4) Trials that stop early due to a large effect tend to overestimate the treatment difference. Pocock and Hughes claim that methods of estimation following sequential trials do not sufficiently correct the overestimation [6]. Other problems are discussed by Geller and Pocock [3]. Although some of these problems have been discussed and solutions proposed, our overall impression is that the frequentist framework becomes rather cumbersome in the context of interim analysis. However, Bayesian methods appear to fit naturally into a dynamic setting. A pretrial prior probability distribution of the magnitude of the treatment difference is specified, and the data collected are used to modify the prior distribution into a posterior distribution. The posterior distribution may be calculated at any stage of the trial and used to make decisions with regard to the future of the trial. To argue for the use of Bayesian methods in clinical trials is not new; Cornfield was a particularly lucid advocate of this approach [5,7,8]. In the next section we illustrate this approach by presenting equations for a Bayesian analysis of a two-treatment trial where the response to treatment is a normallv distributed variable. We then compare the stopping boundaries obtained from Bayesian analysis with those obtained from frequentist schemes. In the final section we discuss the differences found.
BAYESIAN MONITORING OF A CLINICAL TRIAL The following theory may be found in any elementary text on Bayesian statistics, e.g., Lindley [9]. It is reproduced here so as to set out fully the argument.
Bayesian Monitoring of Clinical Trials
359
Consider a randomized clinical trial comparing two treatments. Let 8 be the true treatment difference, with a pretrial prior distribution p(8). Suppose a total of n pairs of patients (not necessarily matched) have been entered, one of each pair assigned to experimental treatment and the other to the control treatment. The estimate of the treatment difference is denoted by ~. We make the following assumptions: (i) p(a) ~ N(ao,0.2/no), (ii) P(~ I 8) ~ N(8,o'2/n), where 0 .2 is the variance of an individual pairwise difference, 80 is the pretrial expectation of the treatment difference, and no reflects the precision of the prior information about the treatment difference. The value of no can be thought of as the number of pairs of patients in a comparative trial that would yield the same amount of information as in the prior distribution. It can be shown by Bayes's theorem that the posterior distribution p(8 I ~) is given by
P(8 I ~) - N [n~ + no ' n +0.~]no "
(1)
This expression reflects the incorporation of an "imaginary" extra n0 pairs of patients in the study. Suppose that we now divide the treatment-difference scale into a range where the experimental treatment is considered clinically superior (~5 > ~ ) and a range where the control treatment is considered superior (8 < Ac). Depending on the clinical situation, AE and Ac will either coincide or Ac will be less than A~. In the latter case a range of clinical equivalence (Z~c,Ar) will exist within which the two treatments are considered roughly equivalent [10]; an operational definition of this range is that a clinician would have no preference for either treatment above the other were it known for certain that the treatment difference lay within this range. This partitioning of the scale of treatment differences will be based on the relative toxicities of the treatments and to a lesser extent on their cost and convenience. For example, clinicians may demand that to use a highly toxic combination of drugs given as adjuvant therapy for rectal cancer the 5-year survival is increased by 5% compared to surgery alone. In this case ,~ would equal 5%. Between a 0% and 5% increase clinicians may feel that the extra toxicity approximately balances the gain in survival rate. If so Ac would equal 0%. We will require the trial to stop if either
Pc =
~ f P(a l ~)d8 < ~
Or"
PE = a E f P(8 [8)d8 < ~.
(2)
360
L. S. Freedman and D. J. Spiegelhalter
&
&
~TROL TRERT/'IE.NT SUPER I OR
RPI'~IGE OF" ECXJ 1 VFLENCE
EXPERIMENTAL TRF_RTM,ENT SUPERI BR
Figure 1 Stopping criterion for a two-treatment trial based on the current posterior distribution: stop if Pc < ~: or P~ < ~-. This is equivalent to s t o p p i n g if either there is very little chance (lk < v) that the control t r e a t m e n t is superior, in which case we w o u l d r e c o m m e n d using the e x p e r i m e n t a l t r e a t m e n t 0r the s a m e is true for the e x p e r i m e n t a l t r e a t m e n t (PE < ~-), in which case we w o u l d r e c o m m e n d using the control treatment. Figure 1 illustrates this rule. The condition (2) m a y be written as follows:
either ~r2 [,'xc
~,n~,
•
_
~(r)]
(3)
_
~
L
(r 2
(Tp
/~
or
~ <
-n
L o-~
~;2
+
'
w h e r e ~-~(r) is the N o r m a l residual c o r r e s p o n d i n g to probability ~ [e.g., • '(0.025) = -1.96] a n d %2 = cr2/(n + no). N o t e that formal Bayesian decision theory is not i n t r o d u c e d in this s t o p p i n g rule. For a discussion outlining the reasons for this, see Spiegelhalter a n d F r e e d m a n [11], in which w e a r g u e that the full m o d e l i n g of the c o n s e q u e n c e s of a trial c a n n o t be carried out with sufficient confidence to derive s t o p p i n g b o u n d a r i e s , a n d that an implicit recognition of the costs of m a k i n g errors is m o r e realistic.
361
Bayesian Monitoring of Clinical Trials Table 1
S t o p p i n g Rules (stop only if I ~ i > A) for Three Frequentist a n d Four Bavesian S c h e m e s in a Trial with 200 Patients
Interim analvsis
Number of pairs of patients
Frequentist Pocock
OBF ~
tt/P"
800
1 2 3 4 5
20 40 60 80 100
0.38 0.27 0.22 0.19 0.17
0.72 0.36 0.24 0.18 0.14
0.52 0.37 0.30 0.26 0.14
--0.68 0.51 0.42
Bayesian n¢~ 89 22 0.72 0.39 0.28 0.23 0.19
0.45 0.27 0.21 0.18 0.15
8 0.37 0.24 0.19 0.16 0.14
"O'Brien and Fleming. ~Havbittle-Peto.
COMPARISON OF BAYESIAN A N D FREQUENTIST STOPPING BOUNDARIES In this section, we c o m p a r e the Bayesian rule (Eq. 3) with three c o m m o n l y cited frequentist rules. C o n s i d e r the two-sided rules of Pocock [2], O'Brien a n d Fleming [12] (OBF), a n d H a y b i t t l e - P e t o [13,14]. To simplify the c o m p a r i s o n , a s s u m e that we are interested in d e p a r t u r e s , in either direction, from the null h y p o t h e s i s that ~ = 0. As a c o n s e q u e n c e we choose, for the Bayesian s t o p p i n g rule, the values .'Ac and A}: to be zero, F u r t h e r m o r e , since interest is in differences in either direction, a s s u m e that at the b e g i n n i n g of the trial the prior distribution of belief is centered on ~ = 0, that is ~0 = 0. Also, set e = 0.025 to c o m p a r e with frequentist s c h e m e s using two-sided tests with overall significance level c~ = 0.05. A s s u m e that cr2 = 0.5, a p p r o x i m a t e l y the variance of a difference in proportions that are in the range 0.2 to 0.8, a n d consider two trials: one with 100 pairs of patients and with five interim analyses, and one with 1,000 pairs of patients a n d with five interim analyses. Tables 1 and 2 show the stopping rules derived from the frequentist schemes
Table 2
S t o p p i n g Rules (stop only if I ~ I > A) for Three Frequentist a n d Four Bayesian S c h e m e s in a Trial with 2,000 Patients
Interim analysis
Number of pairs of patients
Pocock
OBFa
H/I~
1 2 3 4 5
200 400 600 800 1000
0.121 0.085 0.070 0.060 0.054
0.228 0.114 0.076 0.057 0.046
0.165 0.116 0.095 0.082 0.044
°O'Brien and Fleming. ~Haybittle-Peto.
Frequentist 800
Bayesian n0 89 22
8
0.219 0.120 0.086 0.069 0.059
0.118 0.077 0.061 0.052 0.046
0.1(30 0.070 O.057 0.049 0.044
0.103 0.071 0.058 0.050 0.044
362
L.S. Freedman and D. J. Spiegelhalter m e n t i o n e d above, as well as four Bayesian rules c o r r e s p o n d i n g to different standard deviations (~/V'~n0) of the prior distribution, namely 0.025, 0.075, 0.15, and 0.25. These c o r r e s p o n d to values of no equal to 800, 89, 22, and 8, respectively. These were chosen to represent situations w h e r e prior information available is very precise, considerable, average, and scanty, respectively. The results in these tables s h o w the k n o w n p r o p e r t y that the Pocock boundaries converge m u c h more slowly than the OBF boundaries; the OBF b o u n d a r i e s start off m u c h wider than Pocock's but end up narrower. Bayes's b o u n d a r i e s tend to converge at a rate s o m e w h a t b e t w e e n the Pocock and OBF schemes. Table 1 shows the b o u n d a r i e s for a trial with 200 patients. With n~ = 89 (considerable prior information), the Bayesian boundaries begin near the OBF b o u n d a r y but finish wider than the OBF b o u n d a r y . With no = 8 (scanty prior information), the b o u n d a r i e s start off very near to the Pocock boundaries, but end up narrower. With n~ = 22 (moderate prior information) the b o u n d aries start off s o m e w h a t wider than Pocock's and end up a little wider than OBF's. Note that n0 = 800 p r o d u c e s a highly conservative b o u n d a r y . The Haybittle-Peto rule converges slowly over the first four analyses, e n d i n g with a considerable narrowing at the final analysis. This behavior remains far from the Bayesian-based rules. A n o t h e r way of describing the b o u n d a r i e s is in terms of the z value (i.e., the estimated treatment difference divided by its standard error) required to r e c o m m e n d stopping the trial at a given interim analysis. For our example this b o u n d a r y is given by AX/~n, w h e r e A is the value given in Table 1 and n is the n u m b e r of pairs of patients included in the interim analysis. Figures 2 - 4 s h o w these boundaries. In each figure we plot two g r o u p sequential b o u n d a r i e s (Pocock's and OBF's) and s u p e r i m p o s e a Bayesian b o u n d a r y with a different a m o u n t of prior information (n0 = 89, 22, and 8 respectively). The Pocock b o u n d a r y is horizontal whereas the OBF b o u n d a r y is convergent. The Bayesian b o u n d a r i e s also converge but more slowly than OBF's. These results indicate that the Bayes's stopping rule can, with certain choices of prior distribution, come quite close to frequentist stopping rules such as Pocock's or OBF's. In the above example a prior distribution with standard deviation 0.15 (no = 22) leads to a rule that mediates b e t w e e n these two frequentist rules. The prior with n0 = 89 contains nearly the a m o u n t of information as the trial itself. This prior leads to a rule that is as conservative as OBF's at the first interim analysis and that becomes more conservative than OBF's thereafter. It is interesting that this rule d e m a n d s more evidence to claim a significant treatment benefit at the end of the trial than does Pocock's scheme. If one had a prior with no = 800 then it would not be sensible to s t u d y only 200 patients in the trial itself, as the information accruing from this trial would be unlikely to alter one's opinion very much. The stopping rules derived from such a prior reflect this absurdity. We s h o w them in Table 1 only for p e d a g o g y ' s sake. Table 2 shows the b o u n d a r i e s for a trial with 2,000 patients. W h e n n~ = 89 the Bayesian b o u n d a r i e s start off near the Pocock boundaries but end u p near the OBF b o u n d a r y . With n0 = 800, the b o u n d a r i e s start off near the OBF b o u n d a r y but finish a little wider than the Pocock b o u n d a r y . With no = 22 or 8 the Bayesian b o u n d a r i e s are n a r r o w e r than both Pocock's and OBF's.
363
Bayesian Monitoring of Clinical Trials
~°'°''-~ ............. •
o~ o~
.
. . . . . .
~k
~
0
0
2
4
60
80
100
N u m b e r of p a i r s of p a t i e n t s Figure 2 Stopping rules (stop only if I z l > Z) for two frequentist schemes and a Bayesian scheme with n, = 89 in a trial with 200 patients. Pocock, ~ ..... ~J; OBF, @ @; Bayes, ~ - - - -R.
These results s h o w that Bayesian schemes can come close to Pocock's or OBF's stopping rules for trials with t h o u s a n d s of patients as well as for those with h u n d r e d s . H o w e v e r , for large-scale trials the prior distributions that give rise to rules similar to the frequentists' have less dispersion than for smaller trials. For the large trials considered here, the range of no necessary to yield frequentist-like rules was b e t w e e n 89 and 800, whereas for the small trials this range was 8 to 89. In fact, the information content of the prior relative to the planned size of trial (no~n) seems to be the d e t e r m i n i n g factor. If the information equivalent of the prior lies b e t w e e n one fifth of the trial size (i.e., the size of each interim group) and the trial size itself, then our particular choice.of Bayesian rule is close to Pocock's or OBF's. Table 2 also s h o w s that if prior information content is low relative to trial size (no = 22 or 8) then the Bayesian rule is less conservative than Pocock's or OBF's. In the limit w h e r e prior information is negligible c o m p a r e d to trial size the Bayesian rule comes close to the "naive" m e t h o d of repeated significance testing w i t h o u t a d j u s t m e n t of the nominal significance level. We believe that such a m e t h o d is only truly naive for trials of limited size, and that the naivety lies in the a s s u m p t i o n that little or no prior information exists. For trials with m a n y t h o u s a n d s of patients this "naive" m e t h o d of monitoring m a y s o m e t i m e s be justified, once information on m o r e than 500 patients, say,
364
L.S. Freedman and D. I. Spiegelhalter
OBF 4
i 3 ~
L
~
2
|
!
!
!
i
20
40
60
80
|
10 0
Number of pairs of patients Figure 3 Stopping rules (stop only if I z ] > Z for two frequentist schemes and a Bayesian scheme with n o - 22 in a trial with 200 patients. Pocock, ~ [Z_;OBF, @ O; Bayes, ~ - - - - , .
has accumulated. This would only be true, however, were the prior information less than one tenth the n u m b e r of patients at the first interim analysis.
DISCUSSION By the time a r a n d o m i z e d clinical trial has reached planning stage there has already accumulated a substantial a m o u n t of information bearing on the likely treatment difference. Although this information is often of an indirect nature, arising from earlier-phase studies, observational studies, laboratory data, and previous experience with similar treatments, it is nevertheless relevant to the interpretation of clinical trial results. It is usually this information that makes one cautious about extreme differences that emerge early (or indeed late) in a clinical trial. The attraction of classical sequential m e t h o d s to clinical trialists lies in their property of discounting large treatment differences that arise early on in the study. We w o u l d expect Bayesian procedures to r e p r o d u c e this qualitative behavior, and this has been d e m o n s t r a t e d in the previous section. The particular prior distributions explored, with mean zero and sample size equivalent n0, act as if an initial no pairs of patients have been observed in which no treatment difference occurred. The need to overcome this " h a n d i c a p " prevents u n d u l y early termination.
365
Bayesian Monitoring of Clinical Trials 5-
o~~
3
o~ .
"'ooo. .... ..~
~ |
~
2
.
.
.
.
.
.
.
.
.
.
............
- " ~ ............ ~ ............. ~ _ o ~
!
!
!
;
!
20
40
60
80
1 O0
Number of pairs of patients Figure 4
Stopping rules (stop only if I z ] > Z) for two frequentist schemes and a Bayesian scheme with no = 8 in a trial with 200 patients. Pocock, ~_ ~ ; OBF, O. O; Bayes, ~ - - - - - ~ .
Interviews with clinicians may be c o n d u c t e d to elicit their prior beliefs about treatment differences [10]. O u r experience indicates that for cancer trials with several h u n d r e d patients planned the value of no is usually in the range of 8-800, with typical values b e t w e e n 50 and 200. Thus the s c h e m e no = 89 should be given particular attention. Geller and Pocock [3] discuss the attributes of various frequentist sequential schemes. T h e y mention that the Pocock scheme "has the disadvantage of u n d e r t a k i n g the last analysis at a p value considerably smaller than 0.05" and that the OBF s c h e m e "is p e r h a p s too stringent at the first analysis, virtually assuring that the trial does not stop t h e n . " The results in Table 1 indicate that, for the particular (r2 and n chosen, the Bayesian-based rule, with no = 22, mediates b e t w e e n these options. With no = 89, however, the Bayesian rule is as conservative as OBF's at the first analysis and more conservative than both Pocock's and OBF's thereafter. This raises the question w h e t h e r for trials with h u n d r e d s of patients these frequentist rules do not sometimes too liberally advocate that trials conclude in the face of a treatment difference. Moreover, with reference to the final analysis, the Bayesian rule poses the question as to w h e t h e r 5% significance is a stringent e n o u g h criterion for accepting the existence of benefit from a new treatment. The Bayesian rule (with n0 = 89) in Table 1 requires a difference of 0.19, which is approximately 2.7 standard
366
~,. s. Freedman and D. J. Spiegelhalter errors, to infer the existence of a treatment effect should the trial continue to its planned size. Such a rule is considerably more demanding than the usual 2.0 or 1.96 standard errors requirement. In Table 2, which refers to trials with 2,000 patients, the position is reversed. The Bayesian rule corresponding to n~ = 89 is less conservative than both Pocock's and OBF's. This is because stopping a trial is not such a hazardous procedure when there are plenty of data in support and little prior information to contradict the data. However, very large trials would tend to be initiated only when the prior information is already quite precise, so there may be particular reason to consider values of n~, larger than 89 in Table 2. From a frequentist point of view, the Bayesian boundaries appear to provide a rational scheme for choosing between different sequential designs with type I error c~. Thus determining the amount of prior opinion (i.e., n~) leads to a family of boundaries parametrized by t, one member of which will have type I error exactly equal to c~. For a given n~, the appropriate value of v leading to type I error c~ can be calculated by numerical integration, tlowever, from a Bayesian point of view it is appropriate to select ¢ on the basis of implicit losses and not on a fixed c~ level. In addition, Bayesians would not necessarily feel restricted to performing a prespecified number of interim analyses, and the boundaries would not be modified according to such a specification. More frequent applications of the Bayesian rule during the trial would, however, change the frequentist properties of the monitoring scheme. The Haybittle-Peto boundaries provide extra caution with regard to early stopping for a reason not previously mentioned. A clinical trial is conducted not onlv to convince those investigators taking part but also the much larger group of clinicians outside the trial. Results that appear convincing to people involved in the trial may be treated very conservatively by outsiders. It may therefore be argued that evidence from the trial should ideally be convincing enough to shift the opinion of the clinical community. To accept this argument would lead one to adopt a very "rigid" prior distribution (n~ very large), for the purposes of stopping the trial, but not for reporting the final results. The question regarding the strength of evidence necessary to convince clinicians outside the trial merits further exploration. The arguments for Bayesian methods have been rehearsed before. In particular the problems (a)-(d) raised in connection with frequentist methods in the Introduction are resolved naturally within the Bayesian framework. Although this article talks of stopping ruh,s, it is more realistic to speak of stopping recommendations, since the monitoring of the major response measure in the trial, although important, is only one of a host of considerations that enter a decision to stop or continue a trial. In this regard the Bayesian method is an ideal tool because the inferences from the data are not affected by previous decisions regarding the continuation of the trial. Furthermore the mean of the posterior distribution may be used to estimate the treatment effect and this provides a natural shrinkage of unexpectedly extreme treatment differences towards the pretrial expectation [5]. Thus the difficulties of classical sequential methods appear to be overcome by Bayesian monitoring without sacrifice of their major benefits, namely, prevention of premature closing of a trial.
Bayesian Monitoring of Clinical Trials
367
REFERENCES I. Armitage P, McPherson CK, Rowe BC: Repeated significance tests on accumulating data. J R Stat Soc A 132:235-244, 1969 2. McPherson CK: Statistics: The problem of examining accumulating data more than once. N Engl J Med 290:501-502, 1974 3. Geller NL, Pocock SJ: Interim analyses in randomized clinical trials: Ramifications and guidelines for practitioners. ~iometrics 43:213-223, 1987 4. Pocock SJ: Interim analyses for randomized clinical trials: The group sequential approach. Biometrics 38:153-162, 1982 5. Cornfield J: Sequential trials, sequential analysis and the likelihood principle. Am Star 20:18-23, 1966 6. Pocock SJ, Hughes MD: Stopping rules, estimation problems, and reporting bias in clinical trials. Controlled Clin Trials, in press 7. Cornfield J: A Bayesian test of some classical hypotheses--with applications to sequential clinical trials. J Am Stat Assoc 61:577-594, 1966 8. Cornfield J: Recent methodological contributions to clinical trials. Am J Epidemiol 104:408-421, 1976 9. Lindley DV: Introduction to Probability and Statistics from a Bayesian Viewpoint. Cambridge: Cambridge University Press, 1965, vol 2 10. Freedman LS, Spiegelhalter DJ: The assessment of subjective opinion and its use in relation to stopping rules for clinical trials. The Statistician 33:153-160, 1983 11. Spiegelhalter DJ, Freedman LS: Bayesian approaches to clinical trials. Bayesian Stat 3:453-477, 1988 12. O'Brien PC, Fleming TR: A multiple testing procedure for clinical trials. Biometrics 35:549-556, 1979 13. Haybittle JL: Repeated assessment of results in clinical trials of cancer treatment. Br J Radiol 44:793-797, 1971 14. Peto R, Pike MC, Armitage P, Breslow NE, Cox DR, Howard SV, Mantel N, McPherson CK, Peto J, Smith PG: Design and analysis of randomized clinical trials requiring prolonged observation of each patient. Br J Cancer 35:585-611, 1976