Clinical epistemology of sensitivity and specificity

Clinical epistemology of sensitivity and specificity

3 Clin EpidemiolVol. 45, No. 1, pp. 9-13, 1992 0895-4356192 $5.00+ 0.00 Copyright e 1992Pergamon Press plc Printed in Great Britain. All rights rese...

546KB Sizes 0 Downloads 77 Views

3 Clin EpidemiolVol. 45, No. 1, pp. 9-13, 1992

0895-4356192 $5.00+ 0.00 Copyright e 1992Pergamon Press plc

Printed in Great Britain. All rights reserved

Dissent CLINICAL

EPISTEMOLOGY OF SENSITIVITY AND SPECIFICITY GEORGE A. DIAMOND*

Division of Cardiology, Cedars-Sinai Medical Center and the School of Medicine, University of California, Los Angeles, California, U.S.A. (Received 3 June 1991)

The price we pay for the opportunity to confirm our authority through the process of diagnostic testing is the risk of being disappointed. We are therefore well advised to use only those tests stamped with an imprimatur of accuracy. That imprimatur is commonly expressed in terms of sensitivity and specz@city. A recent MEDLINE search uncovered 258,499 publications dealing with these two concepts. The concepts of sensitivity and specificity date back to the eighteenth century. Bayes’ theorem, for example, is succinctly expressed by the statement that the posterior probability of an hypothesis given some observation is proportional to the product of the prior probability of the hypothesis and the likelihoods (the probability of the observation given each alternative hypothesis). When dealing with two mutually exclusive diagnostic hypotheses (such as disease and non-disease), these likelihoods are nothing more than sensitivity and specificity.

vary widely with the severity of disease and the associated criterion used to define disease status (the stricter the definition, the greater the sensitivity and the less the specificity of a particular test response). Sensitivity and specificity also vary with the threshold employed for categorical (normal vs abnormal) interpretation of the test response (the more extreme the threshold, the less the sensitivity and the greater the specificity). An orthogonal graphical representation of sensitivity and specificity at different thresholds of test response-the so-called receiver-operating characteristic (ROC) curve-mitigates some of this difficulty. The area under this curve is a measure of the test’s discriminant accuracy independent of the arbitrary interpretive threshold (and of the prior probability of disease). The slope of the ROC curve at a given point is equivalent to the likelihood ratio (sensitivity divided by 1 - specificity) for categorical test interpretation at the threshold represented by that point. Likelihood ratio is inadequate as a measure of test performance, however, because it quantifies only the discriminant power of a particular test response without considering the discriminant power of each alternative response or their frequencies of occurrence. Moreover, as noted above, sensitivity and specificity (and their associated likelihood ratio) are usually expressed in terms of dichotomized categorical thresholds. Real patients, on the other hand, have a particular magnitude of test response. If one wishes to interpret a particular test response for a particular patient, one needs to know

LIMITATIONS OF SENSITIVITY AND SPECIFICITY

Although sensitivity and specificity are widely employed as comparative summary measures of diagnostic test performance, this practice is often naive (and sometimes grossly misleading). For example, it is generally assumed that sensitivity and specificity are invariant with respect to the severity of disease. In fact, these measures _ *All correspondence should be addressed to: George A. Diamond, M.D., Cedars-Sinai Medical Center, Division of Cardiology, 8700 Beverly Boulevard, Los Angeles, CA 90048, U.S.A. 9

GEORGE A. DIAMOND

10

the sensitivity and specificity of that particular response (such as 1.7 mm exercise -induced elec trocardiographic ST segment depression) rather than for some arbitrary categorical threshold (such as > 1.0 mm). It has been suggested that estimates of sensitivity and specificity also vary considerably as a function of other factors in addition to disease status. For example, Hlatky et al. [l] partitioned 2269 patients who had undergone both stress electrocardiography and coronary angiography into various subsets defined by common clinical descriptors such as age, gender, exercise performance and the quality, severity or duration of symptoms. Using stepwise logistic regression analysis, they observed that the sensitivity of exercise-induced ST segment depression as a marker of coronary artery disease ranged from 41 to 89%, and the specificity ranged from 70 to 100%. They further identified the important clinical predictors of this variation (covariates), and concluded that the covariate factors affecting the sensitivity of this test were different from those affecting its specificity. COVARIATE

CORRECTIONS

Now, Coughlin et al. [2] have devised a method to estimate sensitivity and specificity as a function of such covariates. Briefly, they identify some subset of covariates (including disease status as defined by the putative verification procedure) as independent variables, along with the presence or absence of the test response under consideration as the dependent variable, and submit these variables to logistic regression analysis. The resultant set of regression coefficients allows calculation of the probability of the test response given the set of covariates. By setting the disease status variable to 1, the logistic probability represents the true positive rate or sensitivity of the test response conditioned on the set of covariates. Similarly, by setting the disease status variable to 0, the logistic probability represents the false positive rate or 1 - specificity of the test response conditioned on the same set of covariates. There are several practical problems with this otherwise clever approach. Coughlin et al. [2] advise “. . . using all of the data . . .” by selecting a single subset of covariates and including disease status as an additional covariate. But, as shown by Hlatky et al. [l], the important covariates for sensitivity are not necessarily the same as those for

specificity. Use of a common set of covariates thereby adds a variable amount of noise to each estimate. This noise can lead to some very puzzling consequences [3,4]. Suppose some new diagnostic test is evaluated in 250 patients with coronary artery disease. We observe that 102 of the 150 men (68%) and 66 of the 100 women (66%) have a positive test response. Because 102/150 > 66/100, we conclude that the sensitivity of the test is greater in men than in women (we are not concerned here with the statistical significance of this conclusion). Now, we happen to know that the sensitivity of some test observations (left ventricular ejection fraction and coronary artery calcification, for example) is highly dependent on age. We therefore further stratify each population into two age categories-those above and below the age of 65. As suspected, we observe 78 positive responses among the 90 older men (87%) compared to only 24 positive responses among the 60 younger men (40%). Similarly, we observe 45 positive responses among the 50 older women (90%) compared to only 21 positive responses among the 50 younger women (42%). Because (24 + 21)/(60 + 50) < (78 + 45)/(90 + SO), we conclude that sensitivity is indeed lower in younger patients compared to older patients, and because 78190 < 45150 and 24160 < 21150 we also conclude that sensitivity is lower in each subgroup of men compared to each subgroup of women. Paradoxically, the overall sensitivity of this test is greater in men than in women, while the sensitivity in each age group is greater in women than in men (and if the samples were larger, these contradictory conclusions could be supported by conventional tests of statistical significance [S]). LOGISTIC

REGRFSSION

Coughlin et al. [2] do not say how to select the putative set of covariates in the first place. One could, I imagine, use only those variables considered significant by stepwise regression analysis, but this procedure itself is subject to error [6-111. For example, when stepwise regression models were constructed to predict cardiovascular risk using 30 candidate variables in 3 independent random samples containing 1065, 518 and 284 patients drawn from a parent population of 2113 patients with 208 outcome events [6], the resultant models looked very different from one another: 16 different

12

GEORGE

A. DIAMOND

by quoting a high sensitivity (or specificity) without mentioning its more modest associated specificity (or sensitivity), or noting that these figures derive from unrepresentative populations at one or another extreme of the clinical spectrum [13, 151. At other times, these proponents promote their test by pointing to a high likelihood ratio for a particular observation without mentioning its low frequency of occurrence. Rarely are the estimation errors ever reported explicitly. PRACTICALITY

In contrast, Coughlin et al. [2] properly emphasize the estimation errors associated with their method. For example, they estimate the sensitivity of claudication ascertained using the Rose questionnaire for the detection of peripheral arterial disease to average 23% in a sample of 575 subjects, and report the 95% confidence interval associated with this estimate to range from 16 to 59%. Unfortunately, this relatively high level of uncertainty (equivalent to a binomial sample of something like 5 of 20) casts doubt on the practical relevance of such values. How many subjects would it take to provide reasonably precise estimates of sensitivity and specificity using this approach? These limitations notwithstanding, the concepts of sensitivity and specificity do have practical relevance to all those who perform technology assessment and to all those who must ultimately judge the quality of that assessment-journal editors, granting agencies, third party payers and clinicians alike. Thus, if a rational health care planner wishes to assess the accuracy of a diagnostic test (to decide, for example, if it should be approved for reimbursement) then the area under an ROC curve generated from a series of sensitivity and specificity measurements provides a suitable measure of discrimination that is independent of the particular diagnostic threshold as well as the baserate of disease in the population, and is also relatively insensitive to verification bias [19]. Similarly, if a rational health care provider wishes to interpret some patient’s test response then she can (i) estimate the prior probability of disease in that patient, (ii) determine the sensitivity and specificity for that particular test response (rather than for some categorical threshold), (iii) correct the sensitivity and specificity for verification bias, (iv) calculate a

posterior probability of disease using Bayes’ theorem and (v) estimate the associated errors. Although I seriously doubt anyone will actually follow such arcane prescriptions, they nevertheless do serve as enlightening thought experiments. Sir Arthur Eddington once illustrated the epistemological nature of the diagnostic testing process in a far more outlandish-and engaging-way [21]: “Let us suppose that an ichthyologist is exploring the life of the ocean. He casts a net into the water and brings up a fishy assortment. Surveying his catch, he [concludes that no] sea-creature is less than two inches long. . An onlooker may object that the . . . generalization is wrong. ‘There are plenty of seacreatures under two inches long, only your net is not adapted to catch them.’ The ichthyologist dismisses this objection contemptuously. ‘Anything uncatchable by my net is ipso facto outside the scope of ichthyological knowledge, and is not part of the kingdom of fishes which has been defined as the theme of ichthyological knowledge. In short, what my net can’t catch isn’t fish’. . Suppose that a more tactful onlooker makes a rather different suggestion: ‘I realize that you are right in refusing our friend’s hypothesis of uncatchable fish, which cannot be verified by any tests you and I would consider valid. By keeping to your own method of study, you have reached a generalization of the highest importance-to fishmongers, who would not be interested in generalizations about uncatchable fish. Since these generalizations are so important, I would like to help you. You arrived at your generalization in the traditional way by examining the fish. May I point out that you could have arrived more easily at the same generalization by examining the net and the method of using it?” REFERENCES Hlatky MA, Pryor DB, Harrell FE, Califf RM, Mark DB, Rosati RA. Factors affecting sensitivity and specificity of exercise electrocardiography. Multivariable analysis. Am J Med 1984; 77: 6471. 2. Coughlin SS, Track B, Criqui MH, Pickle LW, Browner D, Tefft MC. The logistic modeling of sensitivity, specificity, and predictive value of a diagnostic test. J Clin Epidemiol 1992; 45: 1-7. 3. Simpson EH. The interpretation of interaction in contingency tables. J R Stat Sot (Ser B) 1951; 2: 238-241. 4. Hand DJ. Psychiatric examples of Simpson’s paradox. Br J Psychi& 1979; 135;-90-91. _ 5. Diamond GA. Snodick DH. CASSandRa. Am J Cardiol 1987; j9: k98-701. 6. Harrell FE Jr, Lee KL, Califf RM, Pryor DB, Rosati RA. Regression modelling strategies for improved prognostic prediction. Stat Med 1984; 3: 143-152. 1.

Dissent: Sensitivity and Specificity

variables were selected from the list of 30 variables, but only 2 of the 16 were common to all 3 models, and 7 occurred in only 1 model. In fact, the “most important variable” in 1 model (presence of a ventricular gallop) did not appear at all in the other 2, and the “most important variable” shared by them (radiographic cardiomegaly) did not appear anywhere in the other model. Similarly, Wells et al. [l l] conclude that multivariable methods such as logistic regression are reliable in “. . . the stepwise choice of [the] first two powerful predictor variables, but not in the sequence of subsequent choices or in the standardized coefficients assigned to the same collection of ‘forced’ variables”. These unreliable coefficients play a central role in the model proposed by Coughlin et al. [2]. VERIFICATION

BIAS

Moreover, the data analyzed by this regression procedure are often biased in several ways. First, conventional diagnostic assessment is often highly subjective-even for the verification procedure itself. With respect to coronary angiography as a standard for the diagnosis of coronary artery disease, for example, it is not uncommon for a given patient to be considered severely diseased by one observer and entirely normal by another [12]. Moreover, the conventional defining threshold of 50% diameter narrowing (equivalent to 75% area narrowing) is highly arbitrary. Generally, the more severe the level of disease, the more severe the associated test abnormality. For this reason, the referral of patients with more extreme test responses for angiography simultaneously results in a bias toward more severe anatomic disease. The preferential referral of positive test responders toward diagnostic verification and negative test responders away from diagnostic verificationalbeit readily justified as the exercise of good clinical judgment-results in substantial distortions of observed sensitivity and specificity [13-171. Consider an extreme example. Suppose you have a diagnostic test with a sensitivity of 80% and a specificity of 80%. Suppose further that you so rely on this test, that you refer every patient with a positive test response for diagnostic verification, but you never refer a patient with a negative test response for verification. Because only positive test responders will

11

undergo verification, every diseased patient will have a positive test (observed sensitivity = lOO%), but so will every non-diseased patient (observed specificity = 0%). Now suppose you use these biased estimates of sensitivity and specificity to calculate a posterior probability of disease based on the test response using Bayes’ theorem. If the prior probability of disease is 50%, for example, the actual posterior probability of disease given a positive test response (positive predictive accuracy) is 80%, and the actual posterior probability of disease given a negative test response (negative predictive accuracy) is 20%. Based on the biased estimates of sensitivity and specificity, however, the positive predictive accuracy is estimated to be loo%, and the negative predictive accuracy is estimated to be 0%. Selection for diagnostic verification depends not only on the test response, however, but also on a variety of additional clinical observations with putative diagnostic value [18, 191. Referral for coronary angiography among patients undergoing thallium myocardial perfusion scintigraphy because of suspected coronary artery disease, for example, is influenced directly by the test response itself, but is also influenced indirectly by a variety of additional factors such as age, gender, the quality and severity of the presenting symptoms, and the magnitude of exercise-induced electrocardiographic ST segment depression. Thus, just as Bayes’ theorem tells us that the predictive accuracy of any test is conditioned on the overall prevalence of disease in the population tested, it also tells us that test accuracy is conditioned on the overall frequency of abnormal responses in the population tested-those in whom disease status is verified and those in whom it is not [16]. As a result, the observed sensitivity and specificity are conditioned on the magnitude of bias introduced by the process of verification. Although ways to correct for verification bias have been described [18-201, Coughlin et al. [2] do not describe how to apply any of these approaches to their method. This is an important matter because sensitivity and specificity are often exploited as marketing tools. Thus, proponents of a particular testing technology sometimes make artificial comparisons of sensitivity and specificity between tests that are not really in competition (thallium scintigraphy vs exercise electrocardiography, for example). Sometimes, these proponents improperly promote their test

Dissent: Sensitivity and Specificity 7.

8.

9.

IO.

11.

12.

13.

Arnesen E, Brenn T. Selecting risk factors: a comparison of discriminant analysis, logistic regression and Cox’s regression model using data from the Tromse Heart Study. Stat Med 1985; 4: 413-423. Infante-Rivard C, Villeneuve JP, Esnaola S. A framework for evaluating and conducting prognostic studies: an application to cirrhosis of the liver. J Clin Epidemiol 1989; 42: 791-805. Feinstein AR, Wells CK, Walter SD. A comparison of multivariable mathematical methods for predicting survival-I. Introduction, rationale, and general strategy. J Clln Epidemiol 1990; 43: 339-347. Walter SD, Feinstein AR, Wells CK. A comparison of multivariable mathematical methods for predicting survival-II. Statistical selection of prognostic variables. J Ctin Epidemiol 1990; 43: 349-359. Wells CK, Feinstein AR, Walter SD. A comparison of multivariable mathematical methods for predicting survival-III. Accuracy of predictions in generating and challenge sets. J Clin Epidemiol 1990; 43: 361-372. Zir LM, Miller SW, Dinsmore RE, Gilbert JP, Harthorne JW. Interobserver variability in coronary angiography. Circulation 1976; 53: 627-63 1. Ransohoff DF, Feinstein AR. Problems of spectrum and bias in evaluating the efficacy of diagnostic tests. N EngI J Med 1978; 299: 926-930.

13

14. Diamond GA. An improbable criterion of normality. Circulation 1982; 66: 68 1. 15. Rozanski A, Diamond GA, Berman D, Forrester JS, Morris D, Swan HJC. The declining specificity of exercise radionuclide ventriculography. N Engl J Med 1983; 309: 518-522. 16. Diamond GA. Reverend Bayes’ silent majority. An alternative factor affecting sensitivity and specificity of exercise electrocardiography. Am J Cardiol 1986; 57: 1175-l 180. 17. Knottnerus JA. The effects of disease verification and referral on the relationship between symptoms and diseases. MerI Decis Making 1987; 7: 139-148. 18. Begg CB, Greenes RA. Assessment of diagnostic tests when disease verification is subject to selection bias. Biometrics 1983; 39: 207-215. 19. Diamond GA. Affirmative actions. Can the discriminant accuracy of a test be determined in the face of selection bias? Med Decis Making 1991; 11: 48-56. 20. Diamond GA, Rozanski A, Forrester JS, Morris D, Pollock BH. Staniloff HM. Berman DS. Swan HJC. A model for assessing the’ sensitivity and specificity of tests subject to selection bias: application to exercise radionuclide ventriculography for diagnosis of coronary artery disease. J Chron Dis 1986; 39: 343-355. 21. Eddington A. The Philosophy of Physical Science. Cambridge: Cambridge Univ. Press; 1949: 170: 16-19.