Monitoring of large randomised clinical trials: a new approach with Bayesian methods

Monitoring of large randomised clinical trials: a new approach with Bayesian methods

ARTICLES Monitoring of large randomised clinical trials: a new approach with Bayesian methods Mahesh K B Parmar, Gareth O Griffiths, David J Spiegelh...

115KB Sizes 0 Downloads 12 Views

ARTICLES

Monitoring of large randomised clinical trials: a new approach with Bayesian methods Mahesh K B Parmar, Gareth O Griffiths, David J Spiegelhalter, Robert L Souhami, Douglas G Altman, Emmanuel van der Scheuren,* for the CHART steering committee†

Summary Background In judging whether or not to continue enrolling patients into a randomised clinical trial, most datamonitoring and ethics committees (DMECs) rely on the p value for the difference in effect between the study groups. In the 1990s, two randomised controlled trials—one in patients with lung cancer and one in those with head and neck cancer—were instead monitored by Bayesian methods. We assessed the value of this approach in the monitoring of these clinical trials. Methods Before the trials opened, participating clinicians were asked their opinions on the expected difference between the study treatment (continuous hyperfractionated accelerated radiotherapy [CHART]) and conventional radiotherapy. These opinions were used to form an “enthusiastic” and a “sceptical” prior distribution. These prior distributions were combined with the trial data at each of the annual DMEC meetings. If, during monitoring, a result in favour of CHART was seen, the DMEC was to decide whether the results were sufficiently convincing to persuade a sceptic that CHART was worthwhile. Conversely, if there was apparently no or little difference, the DMEC was asked whether they thought the results sufficiently convincing to persuade an enthusiast that CHART was not worthwhile. Findings At each of the annual meetings, the DMEC concluded that there was insufficient evidence to convert either sceptics or enthusiasts, and that the trials should therefore remain open to recruitment. Neither trial was closed to recruitment earlier than planned. However if a conventional (p-value-based) stopping rule had been used, the lung-cancer trial would probably have been stopped.

Introduction During the accrual period of randomised clinical trials, the accumulating results are usually examined confidentially by an independent data-monitoring and ethics committee (DMEC) to assess whether there is sufficient evidence to close the trial to accrual of new patients on the basis of one treatment having a substantially better (or worse) effect than the other. Conventionally, most trials use the observed p value as the means of assessing whether to stop accrual. The observed p value is usually compared against some nominal significance level to allow for multiple looks at the data. A popular approach is that of O’Brien and Fleming.1 This approach requires strong evidence of a treatment difference at the start of monitoring, which lessens as the trial progresses until it almost reaches the boundary of conventional significance at the final analysis. If a trial is stopped early on the basis of one of these rules, the estimate of treatment effect is likely to be inflated. Adjusted estimates of treatment effect that allow for stopping early in this framework can be obtained, but there are no widely agreed approaches to it.2 The results of two large randomised clinical trials, one of which compared hyperfractionated accelerated radiotherapy (CHART) with conventional radiotherapy in lung cancer and the other which compared these treatments in head and neck cancer, have been published.3 In these trials, a new approach was adopted in which emphasis was placed on the estimate of the effect, and its likely interpretation, given the amount of information available. This approach was formalised by use of Bayesian methods. In this paper, we discuss the Bayesian approach to the monitoring of these randomised trials.

Methods Interpretation This Bayesian approach to monitoring is simple to implement and straightforward for members of the DMEC to understand. In our opinion, it is more intuitively appealing than conventional approaches. Lancet 2001; 358: 375–81

*E van der Scheuren died in March, 1998. †Members listed at end of paper. Cancer Division, MRC Clinical Trials Unit, 222 Euston Road, London NW1 2DA, UK (M K B Parmar DPhil, G O Griffiths MSc); MRC Biostatistics Unit, Institute of Public Health, Cambridge (Prof D J Spiegelhalter PhD); Royal Free and University College Medical School, University College London, London (Prof R L Souhami MD); ICRF Medical Statistics Group, Centre for Statistics in Medicine, Institute of Health Sciences, Oxford (Prof D G Altman MSc); UZ Gasthuisberg, Herestraat, Leuven, Belgium (E van der Scheuren MD) Correspondence to: G O Griffiths (e-mail: [email protected])

THE LANCET • Vol 358 • August 4, 2001

Trial design The design of these trials has been reported previously,2 as has the theoretical approach to the monitoring of trials with a Bayesian approach.4 However, we will briefly review the approach to design here, since this was used to aid the monitoring scheme. To help design the trial, all prospective clinical participants were given a questionnaire to find out what difference they expected between CHART and conventional radiotherapy, and what difference they regarded as clinically worthwhile, balancing benefit with expected extra toxicity and cost of CHART. The details of this elicitation have also been reported previously.2 For the lung-cancer trial, 11 clinicians planning to take part were asked to give their opinion about the likely difference in 2-year survival rates between CHART and conventional radiotherapy. Each clinician provided a probability distribution for the likely difference, and figure 1 shows the distribution averaged across all 11 clinicians. From this averaged distribution, the median anticipated 2-year survival advantage was 10% ranging from an expected 2-year survival of 15% in the conventional radiotherapy group to 25% in the CHART group. The clinicians further estimated a 90% probability

375

For personal use. Only reproduce with permission from The Lancet Publishing Group.

ARTICLES

Probability

Range of equivalence

5

to

0 –1

to

–5 –5

to

0 0

to

5 to

10

to

15

to

20

to

25

to

30

0 5 30 25 20 15 10 –1 Absolute advantage of CHART over conventional (%)

–1

Benefit to conventional

to

35

Benefit to CHART

Figure 1: Prior distribution of 11 clinicians interested in participating in CHART lung cancer trial

that CHART would actually be more effective than conventional radiotherapy. With this information, a normal distribution was fitted to these data and is shown superimposed on the observed data. This normal distribution was actually fitted on the log hazard scale with a baseline survival of 15% in the conventional radiotherapy group. Radiobiological models indicated that CHART was expected to be more toxic than conventional radiotherapy in the first 6 months, but less toxic in the longer term. CHART would clearly involve a substantial change in working patterns, and be more expensive than conventional radiotherapy. Therefore results of the same

Probability

Range of equivalence

5

to

–1

0 5 5 0 5 0 5 0 0 5 –1 to – to to to 1 to 1 to 2 to 2 to 3 to 3 5 0 0 – 5 25 30 20 15 10 –1 Absolute advantage of CHART over conventional (%)

Benefit to conventional

Benefit to CHART

Figure 2: Prior distribution of nine clinicians interested in participating in CHART head and neck cancer trial

376

questionnaire showed that, on average, clinicians would use CHART routinely if it provided a 2-year survival improvement of at least 13·5% (from 15·0% to at least 28·5%), whereas they would prefer conventional radiotherapy if the 2-year survival improvement was less than 11% (from 15% to less than 26%). Between 11·0% and 13·5%, was a range of equivalence where, on average, the clinicians could not firmly decide on one treatment or the other. Therefore, for this group of clinicians, an 11% improvement could be regarded as the minimum clinically worthwhile improvement required before they would consider using CHART. For the head and neck trial, a similar questionnaire was sent out to nine clinicians planning to take part. The distribution, averaged across clinicians for the expected difference in 2-year disease-free survival rates between CHART and conventional radiotherapy, is shown in figure 2. The median expected difference was 11%, and ranged from an expected disease-free survival rate of 45% in the conventional radiotherapy group to 56% in the CHART group. The clinicians estimated a 92% probability that the 2-year disease-free survival rate would actually be better on CHART. They also indicated that they would definitely use CHART routinely if it provided a 2-year disease-free survival improvement of 13%, but would prefer conventional radiotherapy if the 2-year disease-free survival improvement was less than 10%. Between 10% and 13% was a range of equivalence in which, on average, the clinicians could not firmly choose one treatment or another. Thus, 10% would be the minimum clinically worthwhile improvement required before they would consider using CHART. For practical reasons, a 60:40 randomisation in favour of CHART was done, making CHART easier to schedule in participating centres. The trials were designed on the basis that, for analysis, the log-rank test would be used and that accrual would last about 4 years. The principal analysis for both trials was planned to be 1 year after entry of the last patient, and 5 years after entry of the first patient, giving an average follow-up of 3 years. A twosided significance level of 5% and power of about 90% was used in all design calculations. The main outcome measures were overall survival for the lung trial, and disease-free survival for the head and neck trial. On the basis of the elicited information on worthwhile and expected differences, and taking into account the projected accrual estimates across centres,2 about 600 patients needed to be entered into the lung cancer trial to observe an anticipated 470 events, enabling us to detect reliably an absolute improvement of 10% (from 15% to 25% in 2-year survival). 10% was the median expected survival difference for the 11 clinicians taking part in the lung trial. In the head and neck trial, about 500 patients were needed to observe an anticipated 220 events. This number would be sufficient to detect reliably an absolute difference of 15% (from 45% to 60%) in 2-year diseasefree survival rates. The targeted improvement of 15% was somewhat larger than the clinicians’ median expected difference of 11%. However, a strong determinant of the planned size was the belief that 500 patients was the maximum number that could be recruited in 4 years. Both trials were opened to accrual in January, 1990. Trial monitoring At each meeting of the independent DMEC, the following results were presented: the observed difference between treatments on the main outcome measures, associated p values and CIs, and information on accrual, toxicity, and compliance. At each meeting, the committee also

THE LANCET • Vol 358 • August 4, 2001

For personal use. Only reproduce with permission from The Lancet Publishing Group.

ARTICLES

Year

1991 1992 1993 1994 1995

Data

Posterior estimates with sceptical prior

Patients

Deaths

Hazard ratio (95% CI)

p

2-year survival improvement*

2-year survival Probability improvement* >0%

Probability >5%

Probability >10%

119 256 380 460 563

12 78 192 275 379

·· 0·55 (0·35–0·86) 0·63 (0·47–0·83) 0·70 (0·55–0·90) 0·75 (0·61–0·93)

·· 0·007 0·001 0·004 0·006

·· 20% 15% 12% 9%

·· 7% 9% 8% 7%

·· 70% 86% 80% 73%

·· 24% 31% 18% 9%

·· 95% 99% 99% 99%

*See appendix.

Table 1: Results from lung-cancer trial at about annual intervals

asked the principal clinical investigators for any new information, from within the trial or external to it, that could have a direct effect on the committee’s deliberations. The approach used to monitor the trial was to assess the likely interpretation of the current result. The results of the survey showed that clinicians were enthusiastic about CHART and generally expected that it could improve the outcome in both cancers by a worthwhile amount. We can describe the opinions and enthusiasm of the clinicians planning to take part in the trials in the form of a distribution for each of the trials. These distributions are called prior distributions, because they represent opinions before the trial has begun. These prior distributions for both trials can be described as being “enthusiastic”, and will be referred to as such throughout the paper. Fitted normal distributions displaying these prior distributions for the two trials are shown in figures 1 and 2. However, many clinicians who were not taking part in the trial were sceptical about the likely benefit of CHART. To represent these sceptics, we can specify similar normal distributions, with the best guess (median) being that there is no difference between CHART and conventional radiotherapy and only a small probability, 5%, that the effect will be as large or larger than the expected target improvement in the design of the trial. There are good theoretical and practical reasons for choosing this distribution to represent sceptics.2,4 The enthusiastic and sceptical priors are explained in more detail in the appendix (available at www.thelancet.com or The Lancet office). If, during monitoring, a result in favour of CHART was seen, the question posed to the DMEC was: are the results sufficiently convincing to persuade a sceptic that CHART was worthwhile? By contrast, if there was apparently no or little difference between CHART and conventional radiotherapy, the question posed to the DMEC was: are the results sufficiently convincing to persuade an enthusiast that CHART was not worthwhile? To help the DMEC in its deliberations, a Bayesian approach was presented in which the current results were combined with the prior distributions to see if there was enough evidence to convert these two stylised groups. Formal combination of the data and prior distribution by use of Bayes’ theorem, which is essentially equivalent to pooling the observed data with the implicit data

Year

Results The observed results of the two trials through the years are shown in columns 1–7 of tables 1 and 2. At the time of each DMEC meeting, the posterior distributions for the lung trial and head and neck trial were calculated with the evidence from the prior distributions and the evidence from the trial data, which are shown in figure 3. In the lung trial, an early difference in survival in favour of CHART

Data Patients

1991 1992 1993 1994 1995

underlying the prior distribution, results in a so-called posterior distribution. For example, the sceptical prior for the lung trial is equivalent to implicit data in which 117 patients have died and no difference has been seen between the treatments (see appendix). The 1992 interim data from the trial comprised 78 deaths and an estimated 2-year survival improvement of 20% under CHART. When pooled, the posterior distribution is then equivalent to having observed 117+78=195 deaths and an estimated 2-year survival improvement of 7%. Thus, at this stage, 78/195=40% of the evidence comes from the data and 60% from the prior; as the trial progresses, the proportion of evidence arising from the data will increase. In our calculations of the posterior distribution we did not formally allow for the slight imbalance in randomisation. This has the effect of slightly underestimating the variance, but the point estimate remains unchanged. With this posterior distribution, several probabilities of interest can be calculated from a sceptical point of view (or enthusiastic if an enthusiastic prior was used): the probability that CHART is actually better than conventional radiotherapy (ie, the effect is greater than zero); the probability that CHART is better than conventional radiotherapy by more than the minimum clinically worthwhile improvement (the lower limit of the range of equivalence); the probability that CHART is better than conventional radiotherapy by more than the upper limit of the range of equivalence; and, in case CHART was proving to be less toxic than expected, the probability that CHART is better than conventional radiotherapy by more than 5% (about halfway between 0% and the minimum clinically worthwhile improvement). The DMEC consisted of two clinicians (RLS and EVDS) and one statistician (DGA). The results and interpretation calculations were presented to the DMEC by the statistician for the trial (MKBP). The DMEC planned to meet about annually, starting in 1991.

272 531 674 791 918

Posterior estimates with enthusiastic prior Events

Hazard ratio (95% CI)

p

2-year DFS improvement*

2-year DFS Probability improvement* >0%

Probability >5%

Probability >10%

45 188 293 387 464

·· 0·91 (0·68–1·21) 0·92 (0·73–1·16) 0·89 (0·72–1·09) 0·92 (0·76–1·11)

·· 0·50 0·16 0·20 0·33

·· 3% 3% 4% 3%

·· 6% 5% 5% 4%

·· 56% 46% 53% 39%

·· 15% 7% 7% 2%

·· 90% 90% 95% 91%

DFS=disease-free survival. *See appendix.

Table 2: Results from head and neck cancer trial at about annual intervals

THE LANCET • Vol 358 • August 4, 2001

377

For personal use. Only reproduce with permission from The Lancet Publishing Group.

ARTICLES

Lung-cancer trial

Head and neck cancer trial

5

5 Prior Evidence from data Posterior

4

1992

3

3

2

2

1

1

0 –0·5

1993

0·5

1·0

0 –0·5 5

4

4

3

3

2

2

1

1

0·0

0·5

1·0

0 –0·5

5

5

4

4

3

3

2

2

1

1

0 –0·5

1995

0·0

5

0 –0·5

1994

4

0·0

0·5

1·0

0 –0·5

5

5

4

4

3

3

2

2

1

1

0 –0·5

0·0 Log (hazard ratio) Conventional superior

0·5

1·0

CHART superior

0 –0·5

0·0

0·5

1·0

0·0

0·5

1·0

0·0

0·5

1·0

0·5

1·0

0·0 Log (hazard ratio) Conventional superior

CHART superior

Figure 3: Prior distribution (sceptical for lung cancer trial and enthusiastic for head and neck cancer trial), posterior distribution, and distribution based on actual data by year

was seen, and, subsequently, the results with the sceptical prior were presented to the DMEC. In the head and neck trial, no difference was apparent in terms of disease-free

378

survival, and the results with the enthusiastic prior were presented to the DMEC. The calculated estimates and probabilities for the sceptic and the enthusiast are shown in

THE LANCET • Vol 358 • August 4, 2001

For personal use. Only reproduce with permission from The Lancet Publishing Group.

ARTICLES

the remaining columns of tables 1 and 2, respectively. We describe these results (the Bayesian interpretation) and the DMEC’s recommendations below. At the first meeting of the DMEC held on May 28, 1991, 119 patients had been recruited into the lung trial and 12 deaths had been observed, whereas 272 patients had been recruited to the head and neck trial and 45 events had been observed. It was the DMEC’s opinion that too few events had been observed in either trial to make any sensible comparison between the two groups, and that the data on toxicity and compliance showed no worrying trends. Accrual to the head and neck trial was faster than anticipated—only 150 patients had been expected during this time, and therefore the DMEC recommended increasing the planned size of the head and neck trial from 500 to a minimum of 750 patients. This number of patients gave about 90% power to detect an absolute difference at 2 years of 12%, and about 80% power for a difference of 10% (significance level of 5%). These improvements accorded much more closely with the median improvement expected by the clinical collaborators. The clinical collaborators endorsed this recommendation in June, 1991. The DMEC held their second meeting on July 22, 1992. In the lung trial, 256 patients with a median followup of 7 months had been entered, of whom 78 had died. Table 1 shows the estimates at this time. Rather unexpectedly, there was no clear evidence of differences in either short-term or long-term toxicity, although longterm toxicity data were limited. As mentioned above, the question posed to the DMEC was: is this information sufficiently convincing to persuade a sceptic that CHART is worthwhile? This result was combined with the sceptical prior distribution to obtain the posterior distribution in figure 3. The evidence from the data contributed 40% of the information to the posterior distribution. From a sceptical point of view, there was reasonably good evidence (95%) of some benefit for CHART—ie, a difference larger than zero. There was modest evidence (70%) of the difference being as large as 5%, but only slight evidence (24%) of the difference being as large as 11%. The DMEC agreed that the results were not yet convincing enough to persuade a sceptic that CHART was clinically worthwhile, and that therefore the trial should remain open. In the head and neck trial, 531 patients with a median follow-up of 9 months had been entered, of whom 188 had had an event. Table 2 shows the estimates at this time. A modest amount of extra early toxicity had been reported after CHART. There was no major difference in the amount or type of late toxicity, with reliable information up to 15–18 months. The evidence from the data contributed 71% of the information to the posterior distribution. From an enthusiastic point of view, there is reasonable evidence (90%) of some improvement with CHART, modest evidence (56%) of the difference being as large as 5%, but only slight evidence (15%) of the difference being as large as 10%. The DMEC concluded that there were no ethical or toxicity reasons why the trial should stop. The DMEC held their third meeting on June 7, 1993. In the lung trial, 380 patients with a median follow-up of 8·5 months had been entered, of whom 192 had died. Table 1 shows the estimates at this time. There was no major difference in the amount or type of late toxicity for up to 15–18 months of follow-up, although the figures indicated more frequent dysphagia in CHART and more radiation pneumonitis in the conventional group. The evidence from the data contributed 62% of the

THE LANCET • Vol 358 • August 4, 2001

information to the posterior distribution. Combining the sceptical prior with these data, there was reasonable evidence (86%) of the 2-year survival difference being as large as 5%, but only modest evidence (31%) of the difference being as large as 11%. The DMEC felt that, on balance, there were still no ethical or toxicity reasons why the trial should stop. In the head and neck trial, 674 patients with a median follow-up of 12 months had been entered, of whom 293 had had an event. Table 2 shows the estimates at this time. Early toxicity was slightly more frequent after CHART than conventional treatment; however, there were no major differences in the rate of late toxicity with reliable information up to 21–24 months. The evidence from the data contributed 79% of the information to the posterior distribution. By use of the enthusiastic prior distribution to calculate a posterior distribution, there was good evidence (90%) of some improvement, modest evidence (46%) of the difference being as large as 5%, and only slight evidence (7%) of the difference being as large as 10%. The DMEC concluded that there were insufficient data from the trial to yield results that would be regarded as reliable, and that the trial should remain open. The DMEC held their fourth and final meeting on May 16, 1994. In the lung trial, 460 patients with a median follow-up of 9 months had been entered, of whom 275 had died (table 1). The early toxicity data indicated no major differences between groups. The evidence from the data contributed 70% of the information to the posterior distribution. By use of the sceptical prior distribution to calculate a posterior distribution, there was very good evidence (99%) of some improvement in survival with CHART, modest evidence (80%) of the difference being as large as 5%, but only slight evidence (18%) of the difference being as large as 11%. The DMEC concluded that there was still insufficient evidence at this stage to yield conclusive results, and that therefore the trial should continue. In the head and neck trial, 791 patients with a median follow-up of 18 months had been entered, of whom 387 had had an event (table 2). The data on early toxicity indicated no major differences between groups. The data contributed 83% of the information to the posterior distribution. By use of the enthusiastic prior distribution to calculate a posterior distribution, there was good evidence (95%) of some improvement in disease-free survival, modest evidence (53%) of the difference being as large as 5%, and very slight evidence (7%) of the difference being as large as 10%. The DMEC again concluded that there was insufficient evidence at that stage to yield reliable and widely agreed results. In June, 1995, both trials were closed to recruitment. At this time, 563 patients with a median follow-up of 10 months had been entered into the lung trial, and 918 patients with a median follow-up of 24 months had been entered into the head and neck trial. The reason the trials were closed was because the end of the funding period for both trials had been reached and it did not seem appropriate to extend the accrual in either trial. The lung trial had almost reached its planned target of 600 patients and was sufficiently close to make little practical difference, whereas the head and neck trial had exceeded by a substantial amount the original planned number of 500 (and then 750) patients. Therefore further funding to recruit more patients was not requested. A meeting was held in open forum with CHART collaborators so that the first results of the trial could be discussed. In the lung trial, there were 379 deaths and an

379

For personal use. Only reproduce with permission from The Lancet Publishing Group.

ARTICLES

estimated 25% reduction in the risk of death with CHART (hazard ratio 0·75 [95% CI 0·61–0·93], p=0·006). This result translates into a 9% absolute difference in 2-year survival in favour of CHART (from 15% to 24%). The evidence from the data contributed 76% of the information to the posterior distribution. By use of the sceptical prior distribution to calculate a posterior distribution, there was reasonable evidence (99%) of some improvement, modest evidence (73%) of the difference being as large as 5%, and very slight evidence (9%) of the difference being as large as 11%. In the head and neck trial, there were 464 events and an estimated 8% reduction in the risk of disease or death with CHART (0·92 [0·76–1·11], p=0·33). This finding translates into a 3% absolute difference in 2-year diseasefree survival in favour of CHART (from 45% to 48%). The evidence from the data contributed 86% of the information to the posterior distribution. By use of the enthusiastic prior distribution to calculate a posterior distribution, there was reasonable evidence (91%) of some improvement, but only a small chance (39%) of a difference as large as 5%. The interim results presented at this final meeting were then published.3 Further, updated long-term results from both trials have also been published.5,6

Discussion Several large clinical trials have been monitored by means of guidelines for efficacy that explicitly take into account the effect of the accumulating evidence on clinical opinion. An example of such a guideline is: “The data monitoring committee should only alert the steering committee to stop the trial if there is (a) proof beyond reasonable doubt that for all or for some types of patients, one particular treatment is indicated, and (b) evidence that might influence the management of a patient by a clinician who is already aware of the results of other main studies.”7,8 The monitoring procedure in the CHART trials formalises this notion by using the evidence in the trial that would be sufficient to lead a reasonable set of preclinical trial opinions from sceptics to enthusiasts to a common conclusion. We report the use of the Bayesian approach in providing information to the DMEC to help in its decision as to whether a trial should be stopped early. When deciding whether a randomised clinical trial should continue, the DMEC needs to consider many issues such as the estimated benefit of the new treatment, toxicity, ethics, quality of life, and cost. There may also be external issues to the trial which warrant its early termination. The Bayesian approach aims to capture some of these issues by placing emphasis on the estimate of the effect and its likely interpretation rather than on the p value (which is emphasised in conventional approaches to stopping). Using the Bayesian approach in these two examples, we have shown how a reasonable range of prior opinion can be translated into a quantitative measure of scepticism or enthusiasm, expressed as implicit prior data that are to be added to the accumulating trial data. Together with these prior data, the approach emphasises the importance of considering the minimum clinically worthwhile effect (or range of such effects). The prior data serve as a “handicap” which the trial must overcome to convince either a sceptic that there is a worthwhile difference between treatments, or an enthusiast that there is no worthwhile difference between treatments. The examples we present serve to emphasise the limited role of the p value as the principal means of assessing whether to stop a trial, as advocated in most conventional (frequentist) approaches to interim analysis.9,10 Nearly all

380

such frequentist stopping rules would have stopped the lung trial at the 1993 interim analysis. At that time, any published report would typically have reported an estimated improvement of 15% with a p value of 0·001. Although the estimated p value did not change very much at the final analysis in 1995, the final estimated effect size of 9% was substantially smaller than that in 1993. In fact, the p value changes little over the course of the years (from 1992 to 1995), whereas the estimate of the effect on 2-year survival changes strikingly, from the early optimistic 20% to a final 9%. The Bayesian (posterior) estimate of the effect remains relatively constant throughout the 4 years of the trial, ranging from a maximum of 9% in 1993 to 7% in 1995, which is very close to the final estimate of 9%. When the lung trial was analysed after its closure incorporating the sceptical prior distribution, there was only modest evidence (73%) of a difference as large as, or larger than, 5%, and only very slight evidence (9%) of a difference as large as, or larger than, 11%. Some might ask why, given that CHART was showing superiority, the trial was not continued until the sceptics would be convinced of the benefit. When the trial was designed, a target number of patients to be recruited into the trial was specified and funding was requested to cover the cost of a trial of this size. In the framework presented, the Bayesian approaches were used only to help the DMEC make a decision on whether to stop the trial before the target recruitment had been reached. The remaining scepticism in the results in the Bayesian analysis at the end of the trial is reflected in current clinical practice. In particular, although CHART is routinely used in several hospitals, many remain sceptical towards CHART and have not yet adopted it into routine practice. In the head and neck trial, the data consistently, and through the years increasingly reliably, showed little or no evidence of a worthwhile benefit for CHART on the outcome of disease-free survival. Here, the data steadily overcame the enthusiasm of investigators for CHART at the beginning of the trial, such that the data are finally sufficiently reliable to persuade an enthusiast that in fact CHART provides no worthwhile benefit over conventional radiotherapy. Like frequentist approaches,9,10 the Bayesian approach is idealised, and within this framework we have endeavoured to keep the approach as simple as possible. Like all statistical methods, it should be regarded as a tool to help in the complicated decision-making process. In this context, we believe the approach provides substantial insight and guidance in the process of monitoring a trial. There are ways in which the approaches presented here can be extended. In these trials, we considered two prior distributions: a sceptical prior and an enthusiastic prior. These perhaps represent the two (reasonable) extremes of opinion and one could consider more. In particular, one might wish to have a portfolio of prior distributions, each of which represents opinions of different groups of clinicians. Alternatively, the subjective opinion of patients or consumers, particularly as to what effects might be worthwhile, could be different, and such information should perhaps be elicited before the trial begins to help inform decision-making. We are not aware of any published studies of such approaches, but it is one natural extension of the Bayesian approach presented here. However, such important issues are only raised in the Bayesian approach—we have never seen this issue formally discussed as part of conventional p-value-based stopping rules. To expect statistical methods to address all the issues that a DMEC has to take into account is unrealistic. One

THE LANCET • Vol 358 • August 4, 2001

For personal use. Only reproduce with permission from The Lancet Publishing Group.

ARTICLES

aspect addressed qualitatively by the committee in these trials was that stopping accrual to one trial would have an effect on the accrual to the other trial. This issue probably could not be addressed in a quantitative manner in any statistical framework (Bayesian or not). This type of Bayesian monitoring has been tested retrospectively on the UK Medical Research Council (MRC) multicentre randomised trials of misonidazole for head and neck cancer,4 neutron therapy for pelvic cancer,4 and chemotherapy for osteosarcoma.11 It has been used prospectively on these two CHART trials, on an MRC oesophageal cancer trial (OE02),12 and on an MRC trial of timing of delivery of babies with a growth restriction.13 Guidelines and examples addressing the role of DMECs in clinical trials have now started to appear.14–16 This Bayesian approach to monitoring is simple to implement and straightforward for the members of the DMEC to understand, and in our opinion more intuitively appealing than conventional approaches. To implement the Bayesian approach, beliefs need to be elicited from prospective collaborators. In our experience, this process has been the basis of a very positive interaction, making explicit the issues and information needed to help design and monitor large randomised clinical trials. Contributors Mahesh Parmar implemented the Bayesian approach and presented the results to the DMEC. Robert Souhami, Douglas Altman, and Emmanuel van der Scheuren were the members of the DMEC. Gareth Griffiths, Mahesh Parmar, David Spiegelhalter, Robert Souhami, and Douglas Altman wrote the paper.

CHART Steering Committee A Barrett (Chairman), S J Arnott, D Ash, C K Bomford, P J Bourdillon, B Cottier, M Cuthbert, P Dawes, S Dische, W Forbes, A Harvey, J M Henk, T A Hince, A H Laing, R H MacDougall, D A L Morgan, F E Neal, H Newman, M K B Parmar, A G Robertson, R I Rothwell, M I Saunders, V H Svoboda, J S Tobias, M J Whipp, H Yosef.

B Cottier, P Dawes, S Dische, J M Henk, D A L Morgan, F E Neal, H Newman, J T Roberts, M H Robinson, R I Rothwell, M I Saunders, M J Whipp, H Yosef.

References 1 2

3

4 5

6

7 8

9 10 11

12 13 14 15

Acknowledgments The following clinicians completed a questionnaire which was used to create the enthusiastic prior distribution: S Arnott, J M Bozzino,

THE LANCET • Vol 358 • August 4, 2001

16

O’Brien PC, Fleming TR. A multiple testing procedure for clinical trials. Biometrics 1979; 35: 549–56. Parmar MKB, Spiegelhalter DJ, Freedman LS, et al. The CHART trials: Bayesian design and monitoring in practice. Stat Med 1994; 13: 1297–312. Saunders MI, Dische S, Barrett A, Parmar MKB, Harvey A, Gibson D. Randomised multicentre trials of CHART vs conventional radiotherapy in head and neck and non-small-cell lung cancer: an interim report. Br J Cancer 1996; 73: 1455–62. Spiegelhalter DJ, Freedman LS, Parmar MKB. Bayesian approaches to randomised trials. J R Stat Soc 1994; 157: 357–416. Dische S, Saunders M, Barrett A, Harvey A, Gibson D, Parmar MKB. A randomised multicentre trial of CHART versus conventional radiotherapy in head and neck cancer. Radiother Oncol 1997; 44: 123–36. Saunders M, Dische S, Barrett A, Harvey A, Griffiths G, Parmar M. Continuous, hyperfractionated, accelerated radiotherapy (CHART) versus conventional radiotherapy in non-small cell lung cancer: mature data from the randomised multicentre trial. Radiother Oncol 1999; 52: 137–48. AXIS protocol. Adjuvant X-ray and 5-Fu Infusion study. London: Medical Research Council, 1989. ISIS3 protocol. Third international study of infarct survival. Oxford: Clinical Trial Service Unit and Department of Cardiovascular Medicine, Radcliffe Infirmary, 1989. Demets DL, Lan KKG. Interim analysis: the alpha spending function approach. Stat Med 1994; 13: 1341–52. Whitehead J. The design and analysis of sequential clinical trials, revised 2nd edn. Chichester: Wiley, 1997. Spiegelhalter DJ, Freedman LS, Parmar MKB. Applying Bayesian ideas in drug development and clinical trials. Stat Med 1993; 12: 1501–11. Fayers PM, Ashby D, Parmar MKB. Tutorial in biostatistics: Bayesian data monitoring in clinical trials. Stat Med 1997; 16: 1413–30. Thornton JG. Growth Restriction Intervention Trial Protocol. Leeds: University of Leeds, 1996. Armitage P. Data and safety monitoring in the Concorde and Alpha trials. Control Clin Trials 1999; 20: 207–28. Freidlin B, Korn EL, George SL. Data monitoring committees and interim monitoring guidelines. Control Clin Trials 1999; 20: 395–407. Demets DL, Pocock SJ, Julian DG. The agonising negative trend in monitoring of clinical trials. Lancet 1999; 354: 1983–88.

381

For personal use. Only reproduce with permission from The Lancet Publishing Group.