Clinical Radiology (1989) 40, 448 450
Pitfalls in the Use of Diagnostic Tests C. J. ROSENQUIST
Department of Radiology, University of California, Davis School of Medicine The usefulness of a diagnostic test has been defined as the extent to which it can save lives, restore health or alleviate suffering (McNeil et al., 1976). In recent years considerable progress has been made toward developing techniques which permit assessment of this usefulness. Interest in this subject has been stimulated by several factors that have influenced medical care. First, as statistical science has developed, it has become apparent that a more scientific approach for evaluating diagnostic tests would be helpful to the clinician in assessing the meaning of positive or negative results. Second, as all countries in the Western world have experienced a dramatic increase in the cost of medical care, evaluation of the usefulness of tests has become an essential part of attempts to reduce these costs. This is particularly important in radiology for the new and expensive technologies including computerised tomography and magnetic resonance imaging. Finally, interest in evaluating quality of care, and especially patient outcome measurement, has resulted in a more thorough evaluation of diagnostic tests. Because of the nature of the speciality, these factors are especially important in radiology. In this review, some of the well known parameters for evaluating various characteristics of diagnostic imaging procedures will be briefly discussed. Perhaps more importantly, several pitfalls in the use radiographic studies will be described and analysed.
M E A S U R E M E N T S OF U S E F U L N E S S In order to discuss the problems that may arise in the use of tests it is necessary to understand the basic concepts of how diagnostic performance is measured. This subject has been thoroughly described in several recent publications, and it is my intent to provide only a brief review (McNeil et al., 1975; Wasson et al., 1985; Sox, 1986). The starting point in evaluating a test is usually the decision matrix, which relates the presence or absence of disease to the results of a test (Table 1). From this information it is possible to calculate several ratios that have become the generally used standards for measurement of the accuracy of tests (Table 2). The true-positive ratio (sensitivity) and true-negative ratio (specificity) are the most commonly used measurements of test validity. Other useful information is the false-positive ratio and the false-negative ratio. It is important to realise that calculation of sensitivity and specificity is only the first step in developing information that will be clinically useful. In a clinical situation the question to be answered is usually the probability that a patient has (does not have) a suspected disease when a test is positive (negative). This is termed the positive (negative) predictive value of the test, and is determined by combining the results of the decision matrix analysis with the probability of disease in the population being studied. Table 3 shows how this is Correspondence to: C. John Rosenquist, Department of Radiology, R o o m 109R Professional Building, 4301 X Street, Sacramento, California 95817, USA
Table 1 - Decision matrix for analysis of test results Disease
TEST
Presen t
Absent
Positive
True positive (TP)
False positive (FP)
Negative
False negative (FN)
True negative (TN)
Table 2 - Formulas used for calculation of Test Accuracy TP True positive (TP) ratio (sensitivity)=TP + FN TN True negative (TN) ratio (specificity) = T N + FP FP False positive (FP) ratio = FP + T N FN False negative (FN) ratio = F N + TP
Table 3 - Calculation of disease probability using Bayes' theorem
A B
(TPR) ( P D + ) P ' ~ D + 'I T + ) = ( T P R ) ( P D + ) + ( F P R ) ( P D - - ) (FNR) (PD + ) P (D + I T - ) = (FNR) (PD + ) + (TNR) (PD --) P (D + [T + ) - probability of disease if test is positive T P R = true positive ratio (sensitivity) PD + = probability of disease prior to test F P R = false positive ratio ( 1 - specificity) P D - - = probability of no disease prior to test P (D + I T - ) = probability of disease if test is negative F N R = false negative ratio T N R - t r u e negative ratio (specificity)
accomplished using Bayes' theorem, which was developed by the Reverend Thomas Bayes. If one knows or can accurately estimate the probability of disease in the group (P[D + ]), the probability of disease given a positive test (P[D + IT + ]) can be calculated (Table 3a). Likewise, the probability of disease given a negative test (P[D+ I T - ] ) can be calculated using T N and F N ratios and the probability of disease (P[D+]) or no disease (P[D--]) (Table 3b). The probabilities of disease before and after the test are called the prior probability and postprobability respectively. Of course, the success of this approach depends upon accurate knowledge of the disease prevalence and sensitivity and specificity of the test. It is also assumed that the diseases being considered are mutually exclusive. Even after this extensive analysis the usefulness of a test has not been determined. One must evaluate the effect of the test on patient outcome. As stated earlier, the usefulness of a test lies in its'ultimate effect on the patient's well-being. This is probably the most complicated and least well developed part of the process. In some cases patient outcome is well defined and can be mea-
P I T F A L L S IN THE USE OF D I A G N O S T I C TESTS Table 4 - Effect of prevalence on predictive value using Bayes' theorem (Formula A in Table 3)
A
If P D + = 0.25: (0.25) (0.8) P (D + IT + ) = (0.25) (0.8) + (0.05) (0.75) = 0.84 If PD + = 0.05:
(0.05) (0.8)
P (D + IT + ) = (0.05) (0.8) + (0.05) (0195) = 0.46
sured. More often, however, patient outcome consists of a variety of measurements, and even the physician and patient may not agree about the desired outcome. Nevertheless, attempts are being made to develop objective measurements that can be used to quantitate patient outcome (Roper et al., 1988).
P R O B L E M S W I T H USE OF D I A G N O S T I C T E S T S The first pitfall to be discussed is the effect of prevalence of disease on the usefulness of a test. It is commonly thought that a test with high sensitivity and specificity will be generally useful. In fact, depending upon the prevalence of disease and the consequences of a positive or negative test, this may not be true. This is best understood by studying an example. If we assume that a test has a sensitivity of 80% and a specificity of 95%, then the effect of prevalence on the predictability of a positive or negative test can be measured using the formulas in Table 3. If the prevalence of the disease in the population is 25%, the likelihood of a disease if the test is positive (positive predictive value) is 0.84, whereas if the prevalence of disease is 5%, the positive predictive value is 0.46. Similar calculations can be made for the predictive value of a negative test (Table 4). This information does not reveal the ultimate usefulness of a test, however, since we also must understand how a positive or negative test will affect patient outcome. For example, if a false positive result leads to further hazardous or costly examinations, the test may be unsatisfactory. Or ifa false negative test leads to a serious oversight, the test may not be worthwhile. Both the sensitivity and specificity, and the prevalence of disease, will play a role in making this decision. A second pitfall in the use of diagnostic tests may occur if one is unaware that the sensitivity and specificity of a test performed in a given radiology laboratory may be different from that reported in the literature. Such variation may occur for several reasons. First, the population being investigated may be different, which may change sensitivity and specificity. For example, in an older population an abnormality might be more easily detected because of the stage of disease, thus leading to a greater sensitivity of a test. Second, the results reported in the literature may reflect years of experience with a procedure. Similar results, especially with a new technique, may be difficult to achieve. Finally, different equipment and different technique may give different results. H o w can this problem be overcome? Ideally, a study of accuracy of each new diagnostic test should be carried out in every radiology department. Of course, this would be prohibitively difficult and costly. As a compromise, the radiologist should maintain careful records and obtain follow-up of results of diagnostic tests, especially when starting a new one. Fortunately, the
449
patient's primary physician usually needs little encouragement in keeping the radiologist informed of his diagnostic accuracy. The third pitfall is perhaps the most subtle and least well understood. As mentioned in the introduction, the ultimate goal of a test is to save lives, restore health, or alleviate suffering. This means that a test, regardless of its sensitivity, specificity, and predictive value, is of no use unless it has a positive effect on patient outcome. In order to make this assessment the benefits of a correct diagnosis must be weighed against the hazards of a false positive or false negative diagnosis and the hazards of the test. One interesting approach to this problem has been described by Pauker and Kassirer, who developed a threshold measurement for clinical decisions (Pauker et al., 1975). Using their formula one can calculate the probability of disease at which treatment based upon results of a test, versus treatment of all subjects without the test, results in the same outcome. In some clinical conditions such an evaluation of outcome is fairly straightforward. A recent report described how this could be done in evaluating the usefulness of ultrasound for the diagnosis of appendicitis (Rosenquist, 1988). For most clinical decisions, however, the evaluation of outcome is much more complex. Factors that need to be considered include morbidity and mortality both from the disease and the test, alteration of morbidity and mortality as a result of using a test, and the benefit that a test result may have even though it does not change outcome. An example of this might be the desirability of knowing that a patient has a fatal disease even though treatment was ineffective. This is a complex evaluation that involves both objective measurements and subjective feelings of the patient. Nonetheless, outcome assessment must be the ultimate goal in evaluating the usefulness of a diagnostic test. There is one further step in this process that will be only briefly discussed in this paper. Once the usefulness of a test is determined, there should be some type of costbenefit or cost-effectiveness analysis in order to determine how this compares with other diagnostic and treatment programs. Recent examples in radiology include screening for hypertension, comparison of endoscopy with barium studies, and use of low-osmolar contrast agents (Stason & Weinstein 1977; Jacobson & Rosenquist, 1988; Simpkins, 1988). As new tests are developed their costs and benefits should be compared with either those they replace or other alternative treatments. Marginal costs and benefits should then be evaluated. Health economists, health planners and radiologists in Great Britain have been among the leaders in this effort (Roberts, 1988). Although many physicians may believe that this type of analysis should not be the basis for medical care decisions, it is already used extensively, and physician involvement is important in making the proper choices. An understanding of the techniques for evaluating the usefulness of diagnostic tests will be helpful to the radiologist in playing an appropriate role in this analysis.
REFERENCES Jacobson, PD & Rosenquist CJ (1988). The introduction of lowosmolar contrast agents in radiology. Journal of the American Medical Association, 260, 1586-1592. McNeil, BJ & Adelstein, SJ (1976). Determining the value of diagnostic and screening tests. Journal of Nuclear Medicine, 17, 439-448. McNeil, BJ, Keeler, E & Adelstein, SJ (1976). Primer on certain
450
CLINICAL RADIOLOGY
elements of medical decision making. New England Journal of Medicine, 293, 211-215. Pauker, SG & Kassirer, JP (1975). Therapeutic decision making: A costbenefit analysis. New England Journal of Medicine, 293, 229-234. Roberts, CJ (1988). Annotation: towards the more effective use of diagnostic radiology: a review of the work of the Royal College of Radiologists working party on the more effective use of diagnostic radiology, 1976 to 1986. Clinical Radiology, 39, 3 6. Roper, WL, Winkenwerder, W, Hackbarth, GM & Krakauer, H (1988). Effectiveness in health care. New England Journal of Medicine, 319, 1197-1202.
Rosenquist, CJ (1988). The usefulness of diagnostic tests. American Journal of Roentgenology, 150, 1189-1190. Simpkins, KC (1988). What use is barium? ClinicalRadiology, 39, 469473. Sox, HC (1986). Probability theory in the use of diagnostic tests. Annals of Internal Medicine, 104, 60 66. Stason, WB & Weinstein MC (1977). Allocation of resources to manage hypertension. New England Journal of Medicine, 296, 732 739. Wasson, JH, Sox, HC, Neff, RK & Goldman, L (1985). Clinical prediction rules. New England Journal of Medicine, 313, 793 799.