Pitfalls and biases in the reporting and interpretation of the results of clinical trials

Pitfalls and biases in the reporting and interpretation of the results of clinical trials

LUNG CANCER Lung Cancer 10 Suppl. 1 (1994) S143-S150 Pitfalls and biases in the reporting and interpretation of the results of clinical trials Mahesh...

588KB Sizes 3 Downloads 51 Views

LUNG CANCER Lung Cancer 10 Suppl. 1 (1994) S143-S150

Pitfalls and biases in the reporting and interpretation of the results of clinical trials Mahesh K.B. Parmar MRC Cancer Trials Office, 5 Shafteshuly Road, Cambridge CB2 28 W, lJK

Abstract There are numerous biases (not all statistical) that can occur in the reporting and interpretation of the results of clinical trials. In this paper we review some of the major sources of such biases and propose some solutions for these problems. In particular, we review the biases that can occur in non-randomised studies; the effects of exclusion of patients after randomisation; reporting on the subset of patients receiving ‘full’ treatment only; comparing responders with non-responders; and emphasising secondary endpoints or subgroups. We also discuss the correct interpretation of a P-value and show the need for both estimates of the treatment effect and confidence intervals for reliable interpretation of the results. Finally, we consider the important but less well discussed problems of the interpretation of both classically ‘negative’ and ‘positive’ trials and the impact of the early stopping of the trial on the estimate of the treatment effect. Key words:

Bias; Reporting;

Interpretation;

Randomised;

Clinical trial; Positive trial; Nega-

tive trial

1. Introduction The aim of any investigator conducting a clinical trial is to answer a therapeutic question in an unbiased and reliable manner. But even if the trial is designed so that this is possible, a number of problems can arise in the reporting and interpretation of the results of the trial which introduce biases. In this paper we discuss many of the ‘biases’ that can arise, and discuss means by which they can (and should) be avoided. 0169-5002/94/$07.00 0 1994 Elsevier Science Ireland Ltd. All rights reserved. SSDI 0169-5002(93)00274-D

s144

M.KB. Pannar/Lung

Cancer 10 Suppl. I (1994) S143-S150

2. The need for randomisation We know that patients with inoperable Stage III non-small cell lung cancer will generally (but not always) fare worse than patients with operable Stage I disease. However, given two apparently similar patients (that is of similar age, grade, etc.) with Stage I disease and both treated in the same way, one may live no more than a few months and the other may well live longer than 5 years. Our inability to predict this difference in prognosis is due to our lack of knowledge of all the factors determining prognosis. Thus, when comparing two groups of patients, we need to decide whether the difference in outcome between them is due to the different treatments used, or the intrinsically differing prognosis for them. This is clearly difficult when we do not know all the important prognostic factors. This issue is particularly important when we can expect no more than a moderate improvement in survival from the new treatment, as an observed difference between groups may easily be due to differing, but unknown, prognoses rather than any treatment effect. Unfortunately, moderate differences are as much as can be expected from many new therapies [16] and so we often have this dilemma. However, the process of randomisation allows us to assess whether the unknown prognostic factors are likely to be approximately evenly balanced between the treatment and control groups. The P-value informs us how likely it is that the observed difference is due to a chance maldistribution in prognostic factors - for example, a P-value such as 0.01 tells us that in only 1 of 100 occasions is the observed difference between treatment and control likely to be due to a difference in prognostic factors, and not due to the treatment. Evenly balancing the known prognostic factors between the treatment and control groups can be useful (stratified randomisation or minimisation), but to reduce the possibility of an imbalance of unknown factors requires many hundreds and often several thousands of patients to be randomised (the number of patients depending on the size of the treatment effect that is anticipated). Thus, the results of any non-randomised trial claiming a benefit for a new form of therapy, albeit encouraging, should be treated with caution [21. It may be that in years to come we will know all or nearly all of the factors determining a patient’s outcome, and randomisation will not be necessary. However, until then randomisation offers the only means of assessing whether the differences between two groups of patients can be reliably attributed to the effect of the new treatment. Because of this implicit need for randomisation we limit our further comments to randomised trials only. 3. Exclusion of patients from the analysis It is common to find that after randomisation a certain percentage of patients do not satisfy the eligibility criteria. For example, after randomisation it may be found that the patient just fails the requirement for a specified renal clearance, or did not have an appropriate diagnosis (maybe small cell instead of non-small cell). Should we include such patients in the analysis? The immediate intuitive answer is no, as they do not satisfy the eligibility criteria. However, on reflection this may not be the best approach since it may be that these ‘ineligibilities’ have come to light only

M.K.B. Pamar/Lung

Cancer 10 Suppl. 1 (1994) S143-S150

S145

because of the treatment assigned. For example, in a trial of adjuvant chemotherapy, the renal clearance ineligibility mentioned above may come to light only if the patient has been allocated to receive adjuvant treatment, and we will not know how many patients allocated no chemotherapy have a similar low clearance level. It could therefore immediately introduce bias to exclude this patient. Similar considerations may hold for the diagnosis question. If all diagnoses are reviewed independently by a pathology review panel - who do not know the allocated treatment - this is not likely to introduce bias, as bias does not arise when the reason for exclusion is independent of the treatment assigned. Clearly, the best solution is to have no exclusions after randomisation, or to limit them to those that are known for certain to be independent of treatment [ 111. 4. Comparison of those patients receiving full treatment with control group patients It is often interesting on the completion of a trial to focus on those patients who were able to receive full treatment, and to compare this group of patients with the control group. For example, in a randomised trial of surgery plus three cycles of adjuvant chemotherapy versus surgery alone, it may be of interest to compare those receiving the full three cycles with those receiving the control. Although this analysis may help us to investigate whether chemotherapy is effective in those patients able to tolerate it, it would be a mistake to interpret this as an unbiased comparison. The analysis is biased in favour of the adjuvant chemotherapy group because those patients unable to tolerate three cycles of adjuvant chemotherapy are likely to be those with a poorer prognosis (for example of poorer performance status or poorer renal function, etc.). 5. Comparison of survival between responders and non-responders A survival comparison of responders with non-responders is sometimes performed to assess the efficacy of the treatment [1,19]. Such a comparison is biased in two ways. First, responders are likely to be those patients who have a more favourable prognosis regardless of the treatment received. For example, they may have less bulky disease or better performance status. They are also likely to have more favourable unknown prognostic factors. Second, those patients who die quickly are by definition non-responders and so there is an implicit time bias in this comparison. In short, this means that those patients dying early are by definition classified as non-responders and those patients surviving a reasonable length of time as responders. This type of comparison often always leads to the ‘conclusion’ that the treatment is ‘effective’ even when it is truly ‘ineffective’. These comparisons cannot therefore be recommended as a means of establishing therapeutic efficacy. 6. Emphasis on secondary endpoints In any clinical trial a number of endpoints will be important - for example survival, disease-free survival, occurrence of metastases, local recurrence. Never-

S146

MRB.

Parmar/Lung

Cancer 10 Suppl. I (1994) S143-SISO

theless, in the protocol for the trial a principal endpoint is usually specified, on which the design for the trial is based. In lung cancer this endpoint is usually length of survival. However, at the time of reporting the results, it may be possible to emphasise other endpoints such as disease-free survival or local recurrence simply because a difference has been observed on the endpoint. There is an obvious bias involved in this practice [12], especially when we acknowledge the rather subjective judgement involved in the assessment of many of these endpoints. 7. Subgroup analysis Most clinical trials collect considerable information on possible prognostic factors. Hence, different subgroups can be investigated to assess whether there is any evidence of a significantly advantageous (or deleterious) effect in any well defined subgroup. However, if, for example, we have ten (disjoint) subgroups it is not generally recognized that, by chance alone, there is a 40% probability of finding at least one statistically significant (P < 0.05) false positive treatment effect [10,14]. Thus any observed effect, unless it is very extreme, can be regarded only as a means of providing a hypothesis to be tested in the next clinical trial. 8. Emphasis on P-values, with little or no discussion of estimates of treatment effect or confidence intervals The results of clinical trials are commonly presented with the main emphasis on the test of the hypothesis that there is no difference between the treatments [5,7,12]. A P-value is often reported to assess how much evidence there is against this hypothesis. The P-value, however, gives no indication of the size of a treatment effect - it gives an indication only of the degree of reliability with which one can say a treatment effect is likely to exist. Thus presenting P-values alone, (or over-emphasizing them), can lead to their being given greater weight than they deserve. In particular, there is a tendency to equate statistical significance with clinical importance 151.Statistical significance is often defined arbitrarily as P < 0.05, whereas clinical importance depends on the size of the treatment effect and the associated toxicity and cost of the new treatment - particularly important considerations in the choice of treatments for lung cancer patients. Table 1 illustrates this problem. Table 1 shows that if there is a certain difference between the treatments, the expected P-value is related only to the number of patients in the trial: the more patients in the trial, the smaller the P-value. It is quite easy to see from this table that the estimate of the relative risk and the associated confidence interval provides a better summary of the trial results. In particular, the associated confidence interval, which describes plausible values of the treatment effect, narrows as the size of the trial increases. In acknowledgement of the problem of the over-emphasis of the P-value, many major medical journals now require all papers describing clinical trials to contain estimates of the treatment effect and confidence intervals [5].

MXB.

Parmar/Lung

Cancer 10 Suppl. 1 (1994) S14.GS150

Table 1 The expected estimate of treatment effect, 95% confidence interval and P-value which would be observed when conducting trials of different size when the true difference between the treatment and control is defined by a relative risk of 0.82. This relative risk corresponds to an improvement in survival from 50 to 55%. Trial size

Relative risk

95% confidence level

P-Value

20 200 2000 20 000

0.82 0.82 0.82 0.82

(0.12,3.79) (0.47, 1.43) (0.69,0.98) (0.77.0.87)

0.33 0.24 0.01 < 0.00001

9. Interpretation of ‘negative’ trials It is common to define a ‘negative’ trial as one which finds no statistically significant difference between treatments [4,12,15]. However, it is clear from Table 1 that small or moderately sized trials that find no statistically significant difference between treatments are generally inconclusive, rather than negative. For example, if we assume that there is a 5% absolute improvement (relative risk 0.82) in survival, a trial of 200 patients is likely to produce a P-value of 0.24 - a conventionally non-significant result, leading to the reporting of a negative trial. We should note that a 5% improvement in survival from a widely practicable treatment for lung cancer would prevent many thousands of deaths worldwide each year. It is only when the trial size reaches 2000 patients that we achieve a conventionally significant result (P = 0.001). If, however, a confidence interval which included plausible values of the likely treatment effect had been presented in the first case (the 200 patient trial), both investigators and readers would have known that a 5% benefit (relative risk = 0.82) from the treatment was indeed still a possibility. 10. Interpretation of ‘positive’ trials Little attention has been paid to the interpretation of positive results. For example, ‘how positive’ does a result have to be before it is widely accepted that the new treatment is effective and should influence general clinical practice. This is a complex question which involves issues such as the representativeness of the trial patients; the prior belief (before the trial) of each individual clinician of the worth of the new treatment; the clinical relevance of the observed treatment effect (that is, when balanced against toxicity and cost); and the plausible values of the treatment effect (that is, the confidence interval). A particular example of the differing interpretations that can arise has been exhibited recently following the publication of a trial of neoadjuvant chemotherapy for patients with Stage III non-small cell lung cancer [3]. In this trial 153 patients were randomly assigned to

S148

M.KB. Pamar/Lung

Cancer 10 Suppl. I (1994) S143-S150

receive either radiotherapy alone or chemotherapy before radiotherapy. The investigators of this trial had actually planned to enter 240 patients, but stopped the trial early because of a ‘clear statistical evidence of a treatment difference serving as the statistical basis for study closure’ [13]. Although this group found the results obviously sufficiently convincing to stop the trial early and presumably offer all subsequently eligible patients neoadjuvant chemotherapy, other groups did not. This trial encouraged the United States National Cancer Institute to launch and successfully complete a large Intergroup trial addressing the same question while others [17] decided to continue the trials of neoadjuvant chemotherapy that they were concurrently conducting. These groups therefore kept a control arm of radiotherapy alone. It is beyond the scope of this paper to discuss the issues behind these decisions in detail here. (See reference [9] for a more detailed discussion, and a statistical means of assessing this difference in interpretation.) However, it is quite clear that different groups of clinicians have made different decisions partly because of the issues discussed above - that is their prior scepticism of the effectiveness of the neoadjuvant chemotherapy and their assessment of whether there is sufficient evidence to state that such treatment is truly worthwhile. From this example alone it must be recognised that different clinicians will interpret the results of a ‘positive’ trial in a variety of ways, and thus any recommendations arising from a trial (or trials) should recognise and acknowledge this variety of interpretations (see Parmar et al. [9] on how this might be done). 11. Early stopping of a trial

Most lung cancer trials will take a number of years to both accrue and follow-up the patients. During this period of recruitment and follow-up, data will become available on the principal endpoints. If this accumulating data shows a marked difference in the treatments it may be decided to terminate the trial early and conclude that the new treatment is effective. It is well documented that care needs to be taken in making decisions based on such early data [8]. The reason for this is that looking at the accumulating data on multiple occasions increases the likelihood of a false-positive result. This problem has generated many methods which aim to control the overall false-positive rate (Type I error) when performing successive tests of hypotheses on accumulating data [6]. However, it is less well recognised that stopping a trial early gives an inflated (biased) estimate of the treatment effect [17] and a confidence interval which is too small and incorrectly centred (i.e. on the observed treatment effect). The heuristic explanation for this is that the reason the trial has been stopped early is only because an extreme treatment effect has been observed. As mentioned before, an unbiased and reliable measure of the treatment effect is a prerequisite for deciding whether a new treatment is clinically worthwhile. However, in my experience, I have seen very few trials, which when terminated early, try to provide adjusted (towards zero) estimates of the treatment effect and confidence intervals. This may be partly because there is no generally agreed and accepted means of adjusting the observed treatment effect appropriately. Thus, although we know it needs to be adjusted towards zero we do not know by how much [7]. Similarly, with confidence intervals

MKB.

Pannar/Lung

Cancer 10 Suppl. I (1994) S143-S150

s149

we do not know by how much they should be expanded [7]. Nevertheless, publications presenting the results of trials which have stopped early should, at the very least, acknowledge that the reported treatment effect is likely to be an overestimate of the true treatment effect. This problem clearly deserves wider recognition and publicity. 12. Conclusions There are many potential sources of bias in the reporting and the interpretation of a clinical trial which can arise despite the fact that the clinical trial was well designed. We have reviewed many of the principal sources of bias here. It is important that trial investigators, editors of medical journals and perhaps most importantly the readers of these journals are aware of the possible biases that can arise. 13. References Anderson JR, Cain KC, Gelber RD. Analysis of survival by tumour response. J Clin Oncol 1982; 1: 710-19. 2 Byar DP, Simon RM, Friedwald WT, Schlesselman JJ, De Mets, DL, Ellenberg JH, Gail MH, Ware JH. Randomized clinical trials: perspectives on some new ideas. N Engl J Med 1976; 295: 74. 3 Dillman RO, Seagren SL, Propert KJ, Guerra KJ, Eaton WL, Perry MC, Carey RW, Frei EF, Green MR. A Randomized trial of induction chemotherapy plus high-dose radiation versus radiation alone in Stage III non-small cell lung cancer. N Engl J Med 1990; 323: 940-5. A Freiman JA, Chalmers TC, Smith H. Jr, et al. The importance of beta, the type II error, and sample size in the design and interpretation of the ramdomised clinical trial. Survey of “negative” trials. N Engl J Med 1978; 299: 690-4 Gardner MJ, Altman DG. Estimation rather than hypothesis testing: confidence intervals rather than P values. In: Gardner MJ, Altman DG, editors. Statistics with confidence 1989. Br Med J @ondon) 6- 19. 6 Geller NJ, Pocock SJ. Intermi analyses in ramdomised clinical trials: ramifications and guidelines for practitioners. Biometrics 1987; 43: 213-23. Hughes MD, Pocock SJ. Stopping rules and estimation problems in clinical trials. Stat Med 1988; 7: 1231-42. 8 McPherson K. The problem of examining accumulating data more than once. N Engl J Med 1974; 290: 501-2. 9 Parmar MKB, Ungerleider R, Simon R. Why and when should we perform confirmatory trials? (in preparation). 10 Peto R, Pike MC, Armitage P, Breslow NE, Cox DR, Howard SV, Mantel N., McPherson K., Peto J, Smith PG. Design and analysis of randomized clinical trials requiring prolonged observation of each patient. Introduction and design. Br J Cancer 1976; 34; 585-612. 11 Peto R, Pike MC, Armitage P, Breslow NE, Cox DR, Howard SV, Mantel N, McPherson K, Peto J, Smith PG. Design and analysis of randomized clinical trials requiring prolonged observation of each patient. Analysis. Br J Cancer 1976; 35; l-39. 12 Pocock SJ, Hughes MD, Lee RJ. Statistical problems in the reporting of clinical trials. N Engl J Med 1987; 317: 426-32. 13 Propert KJ, Dillman RO, Seagren SL and Green MR. Chemotherapy and radiation therapy as compared with radiation therapy in Stage III non-small cell lung cancer. N Engl J Med 1991; 324; 1136-7. 14 Simon R. Patient subsets and variations in therapeutic efficacy. Br J Clin Pharmacol 1982; 14: 473-82.

s150 15

16 17 18 19

MRB.

Pamar/Lung

Cancer 10 Suppl. I (1994) S143-S150

Simon R, Wittes RE. Methodologic guidelines for reports of clinical trials. Cancer Treat Rep 1985; 69: 1-3. Souhami RL. Large scale studies. In: Williams, CJ, editor. Introducing new treatments for cancer. Practical ethical and legal problems. London: John Wiley and Sons, 1992; 173-188. Souhami RL, Spiro SG, Cullen M. Chemotherapy and radiation therapy as compared with radiation therapy in Stage III non small cell lung cancer. N Engl J Med 1991; 324: 1136. Spiegelhalter DJ, Freedman LS, Parmar MKB. Bayesian approaches to randomised trials. Submitted. Weiss GB, Bunce H, Hokanson JA. Comparing survival of responders and non-responders after treatment: a potential source of confusion in interpreting cancer clinical trials. Controlled Clin Trials 1983; 4: 43-52.