EDlTORlALS
Sensitivityand Specificityof Tests: Canthe “Silent Majiwity” Speak? MORTON E. TAVEL,
MD, NATHAN
H. ENAS, MS, and JOHN R. WOODS,
D
iagnostic tests are virtually never perfect in the detection of a given disease; that is, a certain proportion of persons with the disease in question will have normal test results [“false negatives”), and, conversely, some of those without this disease will yield abnormal results (“false positives”). The percentages of false negatives and false positives for any given test are encompassed in the concepts of “sensitivity” and “specificity.” Sensitivity is simply defined as the percentage of those persons with a given disease testing positively, whereas specificity is defined as the percentage of those without disease testing negatively. These tests proportions have long been considered fixed attributes of diagnostic tests, provided that one uses a single and constant boundary between a “normal” and an “abnormal” test result. An example of a typical boundary might be the requirement of 20.1 mV electrocardiographic ST depression to designate an exercise response as “positive.” In order to establish sensitivity and specificity values for a given test, the usual strategy is to obtain a large sample of subjects with and without the disease in question (in our example, coronary disease] through the use of another, more definitive, objective means, such as coronary cineangiography. One then performs the test to be evaluated (exercise stress test) in all such patients without prior knowledge of their disease status. The test to be evaluated cannot be used to select which patients are subjected to the definitive test procedure, for this would risk selectively biasing the resulting sensitivity and specificity values. Theoretically, as long as one obtains large, representative samples of patients with and without coronary disease, and as long as the cutoff point remains constant (0.1 mV ST depression in our example], the derived sensitivity and specificity values should remain fairly constant when applied in future testing situations. From the Department of Medical Research, Methodist Hospital Graduate Medical Center, Indianapolis, Indiana. Manuscript received May 7, 1987; revised manuscript received and accepted June 18.1987. 1167
PhD
Seemingly in contrast to the idea that sensitivity and specificity are fixed values, investigators from Duke University1 found that, when studying ST responses to exercise, sensitivity varied considerably with the clinical history, age, extent of disease and treadmill performance. These investigators acknowledged that several others had also found that severity of disease, as indicated by the number of diseased coronary vessels, influences the sensitivity of the stress test response. Specificity was also variable, but to a lesser extent. Subsequently, Diamond,2 using data from the Duke study, suggested that sensitivity-specificity test characteristics are quite variable for reasons differing from those given’ by the Duke group. Although the Duke University group suggested that the sensitivity of the ST depression depended on additional clinical information, such as the appearance of this electrocardiographic pattern at low levels of stress, Diamond suggested that the variability of test sensitivity could be explained by differing frequencies of positive tests within the clinical subsets of patients. He invokes Bayes’ theorem to explain these results, and then proceeds one step further, suggesting that test sensitivity can be estimated from a group of subjects in which only a portion of those tested have been referred for the definitive test of disease verification. In disregard of previous guidelines, he proposes a mathematical method for deriving sensitivity and specificity values for a test that has already been used to select those subjects to undergo definitive disease verification. His method “corrects”’ or “debiases” the values derived from those receiving verification by using members of a larger group (“silent majority”) who have been subjected to the test in question, but who, for some unknown reason, have not been referred for verification. If one allows results of a given test to influence referral for disease verification, the resulting “referral bias” will hamper any attempts to assessthe sensitivity and specificity of this given test. Sensitivity results are apt to be spuriously high and specificity is likely to be falsely low. In the case of specificity, for instance, we
1166
EDITORIALS
may use a simple example to illustrate this cause for error. Suppose we wish to determine the specificity of a certain stress radiographic procedure designed for the detection of coronary artery disease. Let us assume that actual specificity (which we haven’t yet determined] is 90%, meaning that 90% of all persons without coronary disease will respond negatively to our new test. Thus, 10% of this group will have a falsepositive test result. Now let us assume that we sample from a group of 100 individuals, which happens to contain no persons with coronary disease. We test all 100 with our new technique, and, as expected, find 10 who test positively. Because of the aforementioned referral bias, we decide not to verify all 100 persons with the definitive test, namely, coronary cineangiography. We choose instead to take all 10 positive responders and an additional 10 negative responders as our group to undergo definitive verification. Therefore, of this group of 20, all of whom possess normal coronary arteries, the 10 who test positively do so falsely. This would lead to the mistaken conclusion that the specificity of our new test is low (50%); that is, only 50% of all those verified to have no disease have negative tests, a gross underestimation of actual specificity. For an actual example of how this type of error has led to underestimation of specificity, one may cite a recent study by Rozanski et al3 in which the nuclear ventriculogram was used to a variable extent as the basis for referral for verification with coronary cineangiography. Not surprisingly, during the later years in which the test was used increasingly for referral, test specificity showed a commensurate decline.3 In the attempt to derive true test sensitivity and specificity in situations in which referral bias has occurred, a Bayesian redefinition of sensitivity-specificity in terms of differing patterns of referral for verification2 is flawed for several reasons. First, severity of disease can influence test sensitivity.4 The Duke data demonstrate, for instance, that if one tests only persons with 3-vessel coronary disease, the sensitivity of the ST response is approximately 8570, whereas the value derived for those with l-vessel disease is about 48%. Thus, the proportion of positive ST responders would logically be greater in the group with 3-vessel disease than in the group with only 1 artery involved. Although the sensitivity of the test happens to be less in groups with fewer positive tests, one cannot conclude that test sensitivity can be mathematically derived from frequency of positive test responders in a given population, particularly as one moves outside of Duke’s observed prevalence boundaries. Virtually all the variations of sensitivity from the Duke data can be explained by the aforementioned mechanism of differing disease severity. Unfortunately, the Duke results also may not be totally accurate, for these investigators relied on a group in which prior treadmill testing was not excluded as a criterion for referral for disease verification. Diamond also states that, similar to derived sensitivity, specificity is also influenced by frequency of positive test responders in the various populations studied, and he presents graphs that seem to support this contention. The Bayesian “expected”
specificity curve is hyperbolic, while the Duke data yield little more than a horizontal line, for these investigators, with one exception, did not demonstrate any significant difference in specificity among any of the groups without coronary disease. Their lone exception was found in the group manifesting ST changes at higher maximum heart rates. In this latter instance, selective referral for verification based more heavily on electrocardiographic changes alone could have explained the lower specificity [greater proportion of false-positive ST responses] within this group. When one looks carefully at the Bayesian expected curves,2 it is clear that the observed values of sensitivity and specificity do not conform to the expected curves because they cluster either above or below these curves. A standard analysis of residuals would probably have indicated this fact clearly, yet none seems to have been performed. Although Diamond claims that the Bayesian curves explain 63% of the variance in sensitivityspecificity, he does not base this observation on the explanatory value of these curves themselves. Instead, he bases it on an analysis of individual points, which do not adequately reflect the nature and full range of variability he desires to explain. To make a statement about percentage variation from such a hyperbolic model, an analysis of nonlinear regression would be required. Up to this point, we have considered how varying spectra of disease and referral bias can invalidate any methods to establish test sensitivity and specificity. Although the Bayesian analysis is mathematically correct, when only a portion of those tested are verified by a definitive test, a Bayesian approach to determine sensitivity and specificity is almost certainly doomed to failure for the following reasons. First, Diamond’s analysis is based on the flawed assumption that “angiographic referral is conditionally independent of disease status...and predictive accuracy is conditionally independent of angiographic referral.“2 This is vital to such an analysis, for it allows one to use Bayes’ formula to calculate sensitivity-specificity values among only those referred to angiography for verification of disease status. Diamond quotes Begg and Greenes5 for support of this contention, yet he omits a key element of their discussion. That element is their concomitant information vector “X,” which “represents an exhaustive list of the factors that can influence disease status and selection for verification.” These factors might include clinical manifestations, such as the presence of chest pain or dyspnea on minimal exertion. Thus, in apparent contrast with the previous statement about angiographic referral, Diamond states that “because angiographic referral does not depend on the electrocardiographic ST-segment response alone, some of the residual variance is likely to be related to concomitant clinical factors that were not considered in this analysis.” This contradicts Begg and Greenes, who state that “the validity of this (debiasing] procedure is dependent on the assumption that X” is as defined previously. That such “residual variance” associated with concomitant clinical information may strongly influence test results is also indicated by the Duke report. There-
November
15, 1987
fore, without taking both test results jR) and symptoms (X] into account, the Bayesian analysis has a weak foundation. The final reason for the Bayesian method to fail when applied to this problem lies within the calculation itself. It is true that sensitivity and specificity depend on the distribution of positive and negative test responses among all patients tested, whether or not they are referred for verification. Accounting for this distribution is a necessary condition for arriving at an unbiased estimate of test characteristics. But it is not a sufficient condition. The calculation can only be accurate if disease prevalence happens to be the same among those who are referred and those who are nota circumstance that would be highly unlikely in practice, particularly when the group referred for definitive verification contains a much higher percentage of positive test responders than does the nonreferred population! The validity of this statement is axiomatic, for if disease prevalence differs considerably between the 2 groups-verified and unverified-then the Bayes’ formula cannot be applied, because it uses a predictive value derived from the verified group only. It is universally accepted that predictive value varies greatly with disease prevalence within each tested population. We agree with Diamond’s assertion that estimates of sensitivity and specificity will likely be biased
THE AMERICAN
JOURNAL
OF CARDIOLOGY
Volume 60
1469
whenever information about nonreferred patients is ignored. The bias will be particularly severe if the test to be evaluated is used as a criterion for referral. We do not agree that such estimates can be debiased simply by incorporating information about test responses in the nonreferred group and then applying Bayes’ theorem to determine true sensitivity and specificity. Obviously, the best solution to this problem is to verify prospectively all patients who are tested; however, this is not always feasible. The method propounded by Begg and Greenes is a possible solution, but may prove to be too complex and cumbersome for general application. In the meantime, let us not close our minds to this issue by accepting a misapplication of Bayes’ theorem.
References 1. Hlatky MA, Pryor DB, Harrell FE Jr, Califf RM, Mark DB, Rosati RA. Factors affecting sensitivity and specificity of exercise efectrocordiogrophy. Am [ Med 1984;77:64-71. 2. Diamond GA. Reverend Bayes’ silent majority. Am J Cardioi 1986;57:11751180.
3. Rozanski A, Diamond GA, Berman D, Forrester JS, Morris D, Swan HJC. The declining specificity of exercise radionuclide ventriculography, N EngIl Med 1983:309:518-522. 4. Ransohoff DF, Feinstein AR. Problems of spectrum and bias in evaluating the efficacy of diagnostic tests. N Engl J Med 1978;299:926-929. 5. Begg CB, Greenes RA. Assessment of diagnostic tests when disease verification is subject to selection bias. Biometrics 1963;39:207-215.