Clinical Trials as Diagnostic Tests Joseph L. Pater and A n d r e w R. Willan National Cancer Institute of Canada
ABSTRACT: Concepts used in evaluating the results of diagnostic tests have been applied to
clinical trials by several authors and each has reached the same conclusion: positive trials are more often falsely positive than would intuitively be expected. This conclusion is, however, based on assumptions that require close examination. First of all, it depends upon equating the power of a clinical trial with the sensitivity of a diagnostic test. Although it is possible to define circumstances in which the two are equivalent, decisions made on the basis of the results of clinical trials usually employ a broader definition of "true positive" than, it is shown, is implied by equating sensitivity with power. Secondly, it is assumed that one can speak meaningfully of the baseline "prevalence" of positive trials. The practical application of this concept can be shown to be extremely difficult. Thus, approaching clinical trials as if they were a type of diagnostic test is superficially appealing. However, this may result in misleading conclusions. KEY WORDS: Decision theory, false-positive clinical trials.
INTRODUCTION R a n d o m i z e d clinical trials are widely accepted as the methodologically most satisfactory m e a n s of reaching conclusions about the comparative efficacy of different therapies [1]. Despite. this acceptance, some authors claim that clinical trials m a y produce misleading results more often than is generallyrealized [2-5]. The basis for these a r g u m e n t s has been, to a large extent, to approach the outcomes of clinical trials as if they were the results of diagnostic tests [4]. Just as in the case of diagnostic tests, it can then be s h o w n that u n d e r apparently plausible assumptions, the frequency of "false-positive" and "truenegative" trials is m u c h higher than would intuitively-be expected. Three sets of authors have taken this approach, each from a slightly different point of view. Peto et.al. [3] in arguing for the necessity of large clinical trials, created a table (p. 594) that "demonstrates" that small positive clinical trials are more often t h a n not "falsely positive." In constructing this table t h e y a s s u m e d that 1. Of all trials testing two treatments, the treatments actually differ in 16.7%. 2. The frequency with which true differences would be overlooked and false differences declared are equivalent to ~ and a, respectively.
Received May 5, 1983; revised September 15, 1983. Address reprint requests to: Joseph L. Pater, National Cancer Institute of Canada, Clinical Trials Group, Queens University, 82-84 Barrie Street, Kingston, Ontario, Canada, K7L 3N6. Controlled Clinical Trials 5:107-113 (1984) © Elsevier Science Publishing Co., Inc. 1984 52 Vanderbilt Ave., New York, New York 10017
107 0197-2456/84/$03.00
108
J.L. Pater In the case of a "small" trial, that is, of such a size that the 1-~ is 0.25, Peto et al. s h o w that, given an o~ of 0.05, a "positive" result is as likely to occur w h e n two treatments of equal efficacy are being compared, as w h e n one treatment is truly superior. T h o u g h Peto et al. do not draw the analogy, the similarity of this presentation to those in m a n y of the publications dealing with the evaluation of diagnostic tests [6] is obvious if one equates "prevalance" of the disease to the overall frequency of positive trials, 1-sensitivity to ~, a n d 1-specificity to c~. Thus, substituting Peto et al.'s values into a formula for calculating the predictive value of a positive test using prevalence, sensitivity, and specificity produces the same numerical result: Prevalence x Sensitivity Prevalence x Sensitivity + (1-Prevalence) x (1-Specificity) 0.167 x .25 = 0.50 0.167 x .25 + 0.833 x 0.05 Staquet a n d his coauthors [4] have been more explicit in drawing the analogy b e t w e e n clinical trials and diagnostic tests. Their formula for the "error in positivity" of a clinical trial is the following: 8 =
offl-TER) (1-TER) o~ + TER (1-[3)
where 8 = the "error in positivity" and TER (true effectiveness ratio) = the prior probability of a positive outcome. As stated by the authors, this formula is simply an algebraic manipulation of the more familiar form of Bayes' theorum as applied to diagnostic tests [7]: P(D+[T+) = P(D+) x P(T+ID+ ) P(T + ) where D + represents "disease" or "truth," and T+ represents a positive "test" or "result," and D - and T - , their opposites. Staquet et al.'s equation can be expressed in the usual notation as follows: p(D_tT.)
= P(D-)
P(D-) x P(T+]D-) x P(T+ID- ) + P(D+) x P(T+ID+)'
if the following are considered equivalent: P ( D - IT + ) and (1 - t h e positive predictive value) (the error in positivity) P(D + ) and TER (the disease prevalence) (the prior probability of a positive trial) P(T+ ID + ) and 1 - [3 (the sensitivity of the test) P(T + ID - ) and c~ (1 -specificity of the test) On the basis of w h a t they consider to be plausible values for TER, Staquet et al. s h o w that positive trials m a y often be falsely positive, but negative trials are usually truly negative.
109
Clinical Trials as Diagnostic Tests
Finally, Ciampi and Till [5] have approached this problem from a much more explicitly Bayesian viewpoint with similar results. Thus, three sets of authors have arrived at the same conclusion. "Positive" clinical trials are more often falsely positive and "negative" clinical trials more often truly negative than is generally assumed. The question is whether there are assumptions hidden in their approach that can invalidate these superficially reasonable and practically very important conclusions. The next two sections will examine the major assumptions made in treating the outcomes of clinical trials as if they were the results of diagnostic tests.
THE EQUIVALENCE
O F o¢ T O
1-SPECIFICITY & 13 TO 1-SENSITIVITY
One important feature of the calculations illustrated above is the fact that they treat the Type I and Type II errors of clinical trials as if they were equivalent to the false-positive and false-negative rates of a diagnostic test. That is, they assume that cz = 1-specificity and ~ = 1-sensitivity. It seems reasonable to argue that (~ = 1-specificity, since it expresses the proportion of times a difference will be "diagnosed," although, in fact, no difference is present. However, ~ can only be considered equivalent to 1-sensitivity in the limited sense that it expresses the proportion of times a difference of a given magnitude will not be "diagnosed" when, in fact, it is present. It is this limitation that renders the analogy between clinical trials and diagnostic tests misleadingly inexact. In the case of diagnostic tests, reality is dichotomous-a disease is present or not---and the denominators of sensitivity and specificity include all possible conditions. Reality in the context of a clinical trial is not, however, dichotomous in the same sense. That is, there may be no difference at all between the treatments; there may be a wide range of differences less than that specified in the definition of f~; and there may be a difference of at least the degree specified by ~. It is inappropriate, then, to use lq3 as a substitute for P(T+ ID + ) in the application of Bayes' theorem since, if it is so used, the denominator of the equation will not include all the circumstances under which the "test" (trial) can be positive. This point can, perhaps, be more clearly understood by referring to the denominator of the equation for the error in positivity given earlier, i.e., (1TER) (x + TER (1-~). As mentioned earlier, this formulation is considered by Staquet et al. to be algebraically equivalent to Bayes' theorum. If so, this denominator should equal P(T+), i.e., it should express all circumstances under which a positive "test" or '~result" can occur. However, in Staquet et al.'s application, (1-TER) (x + TER (1-~) does not include all situations in which a "test" or "trial" can be positive. Specifically, because 1-~ is substituted for sensitivity, it includes those situations in which there is a difference of the magnitude specified in the calculation of ~, but excludes those in which there is a true difference, but one of lesser magnitude. Reduction of the denominator by this amount results in a proportional (and artifactual) inflation of the "error in positivity." This is especially true for small trials for which the treatment difference specified in the definition of ~ most likely exceeds the smallest difference considered clinically significant. Thus, the analogy of ~3 to 1-sensitivity is least exact in precisely the circumstances, small clinical trials, where the arguments based on it have the most impact.
110
I.L. Pater The fallacy of the analogy of ~ to 1-sensitivity can be further appreciated if one considers what is meant by a "false positive" in the two different circumstances. In the context of diagnostic tests, a false-positive result means that the test is positive when the disease is not present. The definitions of the specificity and sensitivity given above imply, however, that a clinical trial yielding a "significantly" positive result is "falsely" positive if, in reality, a difference between the two treatments is present, but is of lesser magnitude than that specified in the definition of ~. This seems contrary to logic and explains, in part, w h y the authors have concluded that positive trials are often falsely positive. An analogous distortion of the meaning of true negative explains the reason they conclude that negative studies are usually "truly" negative. That is, a nonsignificanfly positive result in the presence of real difference which is, however, less than that specified by 13is considered "truly negative." Figure I is an attempt to illustrate this point. This four-fold table has been constructed to display the results of a comparison of an experimental (E) with a control (C) treatment. For the sake of clarity, a one-sided test is used. Given the definitions of truth and positivity listed, oL = b/(b + d) and ~3 = c/(c + a). Thus, this is the four-fold table implied by equating (x to 1-specificity and to 1-sensitivity. However, it is obvious that the ratio a/(a + b), analogous to the true referral rate of a diagnostic test, is meaningless since the denominator again does not include some of the circumstances in which a positive result can occur. A more appropriate table would be that given in Figure 2. The problem with this second formulation is how to regard results in the middle column; namely, are they "false" or "true," positives and negatives. In fact, all three sets of authors cited appear by implication to consider a' a "false Figure I
E>~C+a
E-
(HA)
(H O)
Positive ( p.
a
b
Negative ( p>(z )
c
d
RESULT of TRIA L
A = Difference Specified in Definition of
111
Clinical Trials as Diagnostic Tests
TRUTH E)C+A
(HA) RESULT of TRIAL
Positive ( p.~C[)
0
E.
C.,E
(Ho)
b
o' i
Negotive ( p:,(z )
C
c'
d
A= Difference Specified in Definition of /~ Figure 2 positive" but c' a "true negative." This explains, in part, why they conclude positive trials are, more often than expected, "falsely" positive and negative trials "truly" negative. These implied evaluations of results a' and c' are at best debatable. For example, judged from the pragmatic viewpoint of Schwartz, Flamant, and Lellouch [8], a' would be a "true positive" since it leads to the choice of the superior therapy.
THE PREVALENCE OF POSITIVE TRIALS
A more subtle difficulty with this approach lies in the application of the concept of the prevalence or "a priori probability" of positive trials. Again, in the context of diagnostic tests, the concept has clear meaning and refers to an entity that is at least potentially measurable. That is, we can define "how m a n y individuals in the population being studied have the disease in question." Further, this quantity can be (and may have been) determined by means other than the test under scrutiny. T h e analogous concept as applied to clinical trials is much less clear and not even potentially measurable. The different approaches used by the authors mentioned above illustrate this difficulty. Peto et al. and Staquet et al. both estimate the "prevalence" of positive trials on the basis of past experience, i.e., what proportion of previous trials have had positive results. Both sets of authors fail, however, to specify which previous trials are relevant to the question of whether a current study is dealing with treatments that in reality differ. On reflection, the only trials that could possibly be relevant to this
112
J.L. Pater question are those that addressed virtually the same therapeutic issue. If so, the argument presented above is pertinent only to the interpretation of the results of "confirmatory" studies---an important, but much narrower issue. Ciampi and Till's approach to this problem is much more "Bayesian." Thus, they ask the question, (our paraphrase) "What is the probability, on the basis of everything that is known about the treatment to be studied, that this particular trial will have a positive result?" They go on to argue that, because of ethical concerns, this probability must, of necessity, be low. Ciampi & Till's approach has the virtue of addressing the specific probability that a given trial will be positive. In addition, one can at least imagine a group of investigators reaching a consensus about the "prior probability" of a positive result. However, if 6ne holds that the results of a randomized trial are themselves the most powerful evidence for or against the effectiveness of a method of treatment, it is hard to accept that this "subjective" prior estimate should be given equal weight in making decisions on the basis of the outcome of a trial [9]. Furthermore, Ciampi and Till's argument that, for ethical reasons, there can be only a low probability of a "positive" outcome of a study is true only in the situation they describe--"one-sided" trials with a single outcome. It must be ethical to embark upon "two-sided" studies of treatments between which it is virtually guaranteed that some real difference exists either in effectiveness or toxicity--if one can honestly say that, on the basis of present knowledge, an informed choice cannot be made between them.
SUMMARY AND CONCLUSIONS The application of principles useful in the evaluation of diagnostic tests to the results of clinical trials is based, therefore, on misleading assumptions and is faced with nearly insurmountable difficulties in practice. Thus, conclusions based on this approach cannot be accepted at face value, although it is difficult to argue with the general principle that the results of a study cannot be considered in isolation.
REFERENCES 1. Byar DP: Why data bases should not replace randomized clinical trials. Biometrics 36:337-342, 1980 2. Freiman JA, Chalmers TC, Smith H, Kuebler RR: The importance of beta, the Type II error and sample size in the design and interpretation of the randomized control trial. N Engl J Med 299:690--694, 1978 3. Peto R, Pike MC, Armitage P, BreslowNE, Cox DR, Howard SV, Mantel N, McPherson K, Peto J, Smith PG: Design and analysis of randomized clinical trials requiring prolonged observation of each patient. I, Introduction and design. Br J Cancer 34:585-611, 1976 4. Staquet MJ, Rozencweig M, Von Hoff DD, Muggia FM: The delta and epsilon errors in the assessment of cancer clinical trials. Cancer Treat Rep 63:1917-1921, 1979 5. Ciampi-A, Jill JE: Null results in clinical trials: The need for a decision-theory approach. Br J Cancer 41:618-629, 1980
Clinical Trials as Diagnostic Tests
113
6. Griner PF, Mayewski RJ, Mushlin AI, Greenland P: Selection and interpretation of diagnostic tests and procedures. Ann Intern Med 94 (part 2): 557-600, 1981 7. Lusted L: Introduction to medical decision making. Springfield, Illinois: Charles C Thomas, 1968 8. Schwartz D, Flamant R, Lellouch J: Clinical trials. New York: Academic Press, 1980 9. Pocock SJ: The combination of randomized and historical controls in clinical trials. J Chronic Dis 29:175-188, 1976