Evidence-based medicine: Evaluating diagnostic tests

Evidence-based medicine: Evaluating diagnostic tests

February 1999, Vol. 6, No. 1 The Journal of the American Association of Gynecologic Laparoscopists Evidence-Based Medicine: Evaluating Diagnostic Te...

641KB Sizes 1 Downloads 61 Views

February 1999, Vol. 6, No. 1

The Journal of the American Association of Gynecologic Laparoscopists

Evidence-Based Medicine: Evaluating Diagnostic Tests Elizabeth Anna Pritts, M.D., Antoni DuTeba, M.D., and David L. Olive, M.D.

Abstract Physicians have at their disposal a great number of established diagnostic tests, and new ones continue to be developed that are potentially helpful in diagnosing and establishing the prognosis of disease. Many of these tests were either inadequately evaluated or were found, on more careful scrutiny, to be less helpful than first believed. To ensure optimal patient care as well as appropriate use of health care resources, practitioners must be adept in understanding the true efficacy of diagnostic tests that they ask patients to undergo. They must be able to understand the basic language applied in evaluating tests and be able to determine ff and when tests are applicable. (J Am Assoc Gynecol Laparosc 6(1):105-112, 1999)

Inappropriate performance of a diagnostic test may be harmful to the patient in at least two ways. First is the inherent risk of the procedure itself, as well as cost and discomfort. Second, an erroneously positive or negative result may lead to unwarranted or dangerous clinical decisions. A simple, systematic approach can be applied to determine the efficacy of both established and new diagnostic tests. Practitioners also should be able to evaluate a test by answering a series of directed questions.

Patients with other disorders commonly confused with the disease in question should be included in the study population. A test that differentiates among individual diseases and therefore increases a clinician's ability to diagnose disease is inherently more useful than one that does not. An example is the introduction of a new test to diagnose human immunodeficiency viral (HIV) infection. To ensure the quality of the test, the patient population must be scrutinized. A population that includes only those with end-stage disease may not be a good predictor of disease for a patient who is HIV positive but has no sequelae. The population should also include patients who have been treated for the disease to elucidate the effect of treatment on the test outcome. Patients who are seropositive for other viral diseases such as herpes simplex virus or hepatitis A, B, or C should also be included to ensure against crossreactivity of the test.

How Is the Population Selected? Population selection in evaluating a diagnostic test is of utmost importance. Ideally, the population should include patients with a broad spectrum of disease, from mild to severe, with test results stable regardless of treatment unless a disease is proved to be eradicated and the patient has returned to health.

From the Division of Reproductive Endocrinology and Infertility, Yale University School of Medicine, New Haven, Connecticut (all authors). Address reprint requests to David L. Olive, M.D., Yale University, 333 Cedar Street, R e . Box 208063, New Haven, CT 06520-8063; fax 203 785 7134.

105

Evaluating Diagnostic Tests Prilts et al

Cross-sectional studies are the most common method of evaluating diagnostic tests. They examine a specific sample to identify disease versus health. Inherent in this approach is the premise that an ideal test, a gold standard, can correctly identify disease and health. In developing a test that diagnoses seropositivity for persons with HIV, a specific population should be sampled at a single time point, and values collected from the experimental test should be compared against those obtained from the gold standard. The assumption is that results from the gold standard are the true measure of disease. Thus, a cross-sectional analysis and evaluation of a new diagnostic test is only as good as the existing gold standard.

endometriosis? Weaknesses of a gold standard must be clearly stated. The definition of a given disease is often arbitrary. Equally arbitrary is identification of health. For example, the definition of fertility varies from test to test. Is it chemical pregnancy, a 20-week gestation, birth of a normal healthy infant, or graduation of the conceptus from law school? Is the couple who have never attempted conception to be considered fertile or infertile? Disease and health should be clearly defined. Several approaches accomplish this) The Gaussian Curve

This usually defines as normal any test value within + 2 SD of a mean encompassing 95% of the population (Figure 1). Although the method may initially appear simple and understandable, on further scrutiny it has serious shortcomings. First, distribution of biologic parameters is rarely approximated by a bellshaped curve that extends infinitely in both directions. This would allow the possibility of such measures as negative height or negative hemoglobin level, values that we know cannot exist. Second, defining normal within a fixed number of standard deviations assumes a priori the prevalence of disease. Since 2 SD would encompass 95% of a normally distributed population, every disorder thus defined would, by definition, have a prevalence of 5%. 2

What Is the GoM Standard?

By definition, a gold standard should correctly identify disease and health and therefore clearly dichotomize the population. 1It must include borderline clinical situations, be unambiguous, and be independent of the observer. In some diseases the gold standard is self-evident; however, in most clinical circumstances gold standards themselves are imperfect. In diagnosing uterine leiomyoma, the present nonsurgical gold standard is magnetic resonance imaging. It fulfills criteria by clearly identifying leiomyomas as distinct from other pelvic structures, and its interpretation is consistent among radiologists. Although ultrasonography is useful, it can be fraught with error caused by the quality of the machine, similarities of images of pelvic structures, and lack of expertise of the ultrasonographer. Comparison with existing gold standards is the established method for evaluating the value of new tests. For example, the gold standard for diagnosing endometriosis is surgical exploration and histologic evaluation of the specimen. Measurement of serum levels of CA 125 was introduced in an attempt to diagnose endometriosis nonsurgically, but is comparatively inadequate. Often, a gold standard is not the true measure of disease and health within a population, but it is the best available. For example, the gold standard for diagnosing endometriosis is not without ambiguities. If a random sample of peritoneum includes endometrial glands but not stroma, is it still endometriosis? If a sample of scarred peritoneum such as a peritoneal pocket contains neither glands nor stroma on histologic examination, can it be classified as probable "burnt out"

T h e Percentile

This definition of normal was invoked in an attempt to eliminate negative to infinity values of the GaussJan curve. Instead of a theoretical curve, this method uses actual values obtained from the test population.

2.5% I,q

I 2 SD

Mean

IDI 2.5% 2 SD

FIGURE 1. Gaussian distribution with the central 95% defined by mean _+2 SD.

106

February ] 999, Vol. 6, No. I

The Journal of the American Association of Gynecologic Laparoscopists

Normal is defined as a certain percentage of these values (Figure 2). For instance, if 100 fetuses are tested by ultrasonography for intrauterine growth retardation (IUGR) at 33 weeks' gestation, a range of weights is found. If IUGR is defined as a fetus with a weight below the tenth percentile, the highest 90 values would be called normal and the smallest 10 abnormal. However, as with the Gaussian curve, this method assumes a priori prevalence of disease.

norms, or opinions. An example is the diagnosis of homosexuality as a disease. With the passage of time, this definition has changed and homosexuality is seen as a variant of normal.

Diagnostic This definition of normal relies on a previously defined and validated test. For example, if a new test for measuring serum g-human chorionic gonadotropin to diagnose pregnancy is developed, normal and abnormal must be defined with regard to hormone levels. Since the existing gold standard uses less than 5 MIU/ml as the cutoff for normal, the new test must also use this cutoff in determining pregnant versus nonpregnant.

Risk Factors This method is based on the premise that certain risk factors may be associated with given states of disease. An example is the presence of varicocele in many men with abnormal sperm motility and subsequent infertility. Given this association, the presence of varicocele could be used to defme the abnormal state of infertility. Often, though, proof that risk factors and their elimination will directly affect disease is inadequate. Although some studies showed a link between varicocele and abnormal semen analysis in some men, others showed many men with normal sperm motility and varicocele. Furthermore, in randomized clinical trials varicocele repair did not improve pregnancy rates. Thus, use of varicocele as a surrogate marker for male infertility is highly suspect.

Therapeutic This model (also known as correlated normality) uses positive predictive value to dichotomize normal and abnormal. It is based on certain cutoff levels for test values above which therapeutic intervention does more good than harm, and below which the intervention does more harm than good. It is based on studies that directly relate test results to treatment and outcomes? This model is by far the most valuable in delineating abnormal and normal in clinical practice, but it is the most difficult and expensive to accomplish. Whenever a test is compared with a gold standard, each is divided into normal and abnormal; for the test, this pertains to results of the investigation, whereas for the gold standard, the breakdown is by health versus disease. A 2 x 2 table is commonly used to illustrate this. Once such a table is constructed, the sensitivity and specificity of a test can be determined.

Cultural and Social These measurements of normal are based less on scientific premise and more on societal or cultural

What Are Sensitivity and Specificity? Both sensitivity and specificity can be calculated easily with Table 1. Sensitivity is the probability of an abnormal test among diseased individuals, and it is calculated as the number of true positives over the number with disease = (a/a + c). When using the 2 x 2 table, column a + c relies on the gold standard to identify patients who truly have disease, and box a is the number of abnormal values in the test. Sensitivity, then, rates the index test against the gold standard with regard to its ability to identify a diseased individual. Specificity is defined as the probability of a normal test among healthy individuals -- the number of tree negatives over the number of all healthy = (d/d + b).

I

tS00

2000

4000

Bottom 10% "Abnormal.

Top 90% .

.

.

Normal"

FIGURE 2. Use of the percentile method to define normal. Each box represents one observation. The top 90 are termed normal and the bottom 10 abnormal.

107

Evaluating Diagnostic Tests Prilts el al

In Table 1 column b + d represents the proportion of individuals who truly are disease free, and box d represents normal or negative test result. Specificity thus rates the index test against the gold standard with regard to its ability to identify normal or healthy individuals. Sensitivity and specificity are referred to as stable properties because they do not change when different patient populations are studied. Thus they are unchanged if the prevalence of disease in the test group is 0.05% or 50%. A perfect test (sensitivity and specificity 100%) would be one in which all persons testing positive were diseased and all those testing negative were healthy (Figure 3). No overlap in test results would exist between the groups. 4 Unfortunately, such a perfect test rarely exists. Diagnostic tests nearly always have overlapping values in which diseased individuals test normal and healthy individuals test abnormal. Was a Receiver-Operating Characteristic Curve

Used? Receiver-operating characteristic (ROC) curves can statistically help decide the optimal demarcation site for a test value when determining normal versus abnormal or positive versus negative. 5 They can also be used to colnpare the relative value of several tests attempting to diagnose the same disease? Sensitivity is placed on the y axis and 1-specificity on the x axis (Figure 4). The area under the curve (AUC) indicates the value of the test: a useless test has an AUC of

0.5, and a perfect test has an AUC of 1.0. Applying simple equations we can find the point that minimizes both false positives and false negatives. This point can be used as a cutoff when demarcating a normal versus an abnormal value. In general, the best cutoff is the point farthest from the diagonal (labeled A in Figure 5). What Are Precision and Accuracy ?

Among many characteristics required for a good test, the most important are precision and accuracy. Precision is the ability of repetitive evaluations of a given sample to produce similar results. Accuracy evaluates how closely index test results mirror the truth, as defined by the gold standard. For example, a test that is designed to measure serum progesterone levels must consistently measure the same value in a single patient being tested repeatedly and also match the true level of serum progesterone when indexed against the current gold standard. If a serum sample was known to have a value of 24 ng/ml as measured by the gold standard, an experimental test done in triplicate that gave values of 11.0, 11.4, and 11.0 ng/ml would be deemed precise but not accurate. If the same experimental test produced values of 24.0, 23.8, and 24.1 ng/ml, it would be regarded as highly precise and accurate. These calculations help us evaluate the test. The next step takes the results and applies them to a patient in an attempt to convert clinical suspicion from a pretest probability of disease to a posttest probability of disease. The degree of suspicion of disease when

TABLE 1. Calculation of Sensitivity, Specificity, and Predictive Values of a Posilive Test (positive predictive value) and Negative Test (negative predictive value) Test Positive Negative

Gold Standard Disease a = number of individuals diseased and positive c = number of individuals diseased and negative a + c = total number of diseased individuals

Gold Standard Disease Free b = number of individuals disease free and positive d = number of individuals disease free and negative b + d = total number of disease-free individuals

a + b = total number of positives c + d = total number of negatives

Sensitivity = a/a + c = proportion of disease labeled positive by the test. Specificity = d/b + d = proportion of disease free labeled negative by the test. Predictive value -- a/a + b = proportion of those with positive test who actually have disease as measured by the gold standard. Predictive value = d/c + d -- proportion of those with negative test who actually are free of disease as measured by the gold standard.

108

February 1999, Vol. 6, No. 1 The Journal of the American Association of Gynecologic Laparoscopists

/ 9

A

.

// ///

.8

~"

6 .5 4 .3 /// //

.1

FIGURE 3. Possible relationships between test values of normal and abnormal subjects. (A) No overlap between distri-

2

I

i

k

3

.4

.5

I .6

7

I $

I

I

9

1.0

1 - SPECIFICITY

butions. (B) Partial overlap between distributions.

combining patient, test results, and prevalence of disease in the population is defined mathematically by the positive predictive value of the disease. How Good Is Prediction of Disease? The value of any diagnostic test lies in its ability to increase or decrease the probability of disease in a particular patient. Predictive values are useful in clinical decision making. They do not directly reflect the quality of the test, which, it is hoped, was measured by sensitivity and specificity. Predictive values are Perfect test

L0

L

.....

L _ _ _ ~ _

.9 .g

/'/ m

..

/

.7

.6

.;V...

/// //

.I

.2

.3

.4

,5

.6

.7

.8

.9

1.0

1 - SPECIFICITY

FIGURE 4. The ROC curves for a perfect test, a useless test, and an ordinary test. The AUC for the perfect test is 1.0, for the useless test 0.5, and for the ordinary test 0.8.

FIGURE 5. The ROC curve used to define optimal breakpoint for normal versus abnormal. Point A is farthest from diagonal and should serve as the best threshold.

easily calculated again from a 2 x 2 table, and can turn raw test values into probabilities that can be applied to specific patients. A positive predictive value is defined as the probability of disease among those testing abnormal, or those with a positive test. It can be calculated using a 2 x 2 table as the number of true positives/number of all patients with an abnormal result = (a/a + c). Negative predictive value is the probability of health among those who test normal. Again using the 2 x 2 table, the value is the number of true negatives/number of all with a normal test = (d/d + c). These numbers differ from sensitivity and specificity in that they are influenced by the prevalence of disease in the population and will change considerably with small changes in prevalence. Prevalence of the disease can be identified either by known values in a given population, or by performing the simple calculation from the 2 x 2 table: (a + c / a+ b + c + d). An example shown in Table 2 compares abnormal versus normal mammograms when diagnosing patients with and those without breast cancer.6 Sensitivity and specificity are 0.80 and 0.90, respectively, confirming that the probability of having an abnormal result in diseased individuals is high, as is the probability of having a normal test among healthy individuals. With a specific prevalence of disease, however, positive or negative predictive values indicate the actual utility of the-~est in this specific population. In the example,

100

EvaJuating Diagnostic Tests Pritts et al

TABLE 2 . 2 x 2

Table for Mammography Probability of disease Patient

Mammography

Breast Cancer

N o Breast Cancer

Abnormal Normal

80 20

b d

a c

Post-test probability (PPV)

1940 17,460

Sensitivity = a/(a + c) = 0.80. Specificity = d/(b + d) = 0.90. Positive predictive v a l u e = a/(a + b) = 0.04. Negative predictive v a l u e = d/(c + d) -- 0.99. Prevalence a + c/a + b + c + d = 0.005. M o d i f i e d from reference 6.

prevalence is 100/19,500 = 0.5%. With such low prevalence, positive predictive value is only 4%. Thus, despite relatively good sensitivity and specificity, this test is of little value in predicting who has the disease in the face of low prevalence in the population. What Are Pretest and Posttest Probabilities and Likelihood Ratios?

Another useful calculation in helping predict disease is posttest probability (Figure 5). Due to clinical suspicion, a patient carries the probability of being diseased or healthy before any diagnostic tests are completed. The test result adds information to clinical suspicion, and a posttest probability arises, either higher probability (PTL+) or lower probability (PTL-). To calculate PTL directly, one last definition will be introduced, the likelihood ratio (LR). It is calculated also with a 2 x 2 table (Figure 6). The positive LR is defined as sensitivity/1 - specificity. The negative LR equals 1 - sensitivity/specificity. Evaluation of ensuing numbers enables the clinician to quantify the value of the test for that particular disease. The positive LR is a quantity greater than or equal to 1.0, measuring the test's ability to revise probability of disease in an upward direction depending on its magnitude above 1.0. A test with a value of 2.0 to 5.0 is considered poor to fair, but one with a value above 10 is considered good. The converse is true for the negative LR, with its values being less than 1.0. The lower the value, the higher the probability that the test result can lower the probability of disease. A value of 0.5 to 0.2 is considered poor to fair, but a value less than 0.1 is considered good. To calculate posttest probability, the following calculation can be used: PTL+ = (P x LR+/1-p + P x LR+); PTL- = (P x L R - / 1 - p + p x LR-; Figure 7).

o.95

pPV (+)

o.o5

PPV (-)

Test + Pre-test probability (P) = 0.5 Test -

FIGURE 6. Relationship between pretest and posttest probability of disease.

What Kind of Test Is Being Created? Tests are usually performed for several purposes, including screening, confirmation, and exclusion.2,4The specific purpose dictates specific properties required in a test. For example, in a screening test it is imperative that all those with disease are identified, even if it means a high false positive rate. Thus the test has to have high sensitivity to ensure few false negatives. An example is mammography, which is highly sensitive but only moderately specific. It identifies women with an abnormal result who must have a confirmatory test. The false positive rate has to be minimized and the probability of disease with a positive test maximized. This requires both high specificity and high positive predictive value in confirmatory tests. A confirmatory test after mammography is breast biopsy

Likelihood ratios L R (+) = Sensitivity / (1 - Specificity) L R (-) = (1 - Sensitivity) / Specificity

9

Indicate the value o f the test in altering the

probability of diagnosis 9

Poor-fair: L R (+): 2-5; L R (-): 0.2-0.5

9

Excellent: L R (+) > 10; L R (-) < 0. I

FIGURE 7. Definition of likelihood ratios.

110

February 1999, Vol. 6, No. 1

The Journal of the American Association of Gynecologic Laparoscopists

and tissue evaluation. This indeed has high specificity and positive predictive value. Tests of exclusion rule out diseases being entertained in a differential diagnosis, so a negative test would truly have to identify healthy individuals. To do this, high sensitivity would be required, and also high negative predictive value. An example is a test that evaluates cervical swabs for the presence or absence of fetal fibronectin. The presence of fetal fibronectin in the sample is associated with preterm delivery, but a positive test may not necessarily increase the probability of a patient delivering prematurely. A negative test has high negative predictive value, ensuring that these women have a much lower probability of delivering prematurely.

9 Positive posttest probability, or positive predictive value: PPV+ = P (0.01) x LR+ (1.97)/1 - P (1 - 0.01) + P x LR+ (0.01 x 1.97) = 2%. 9 Negative predictive value: P P V - = P (0.01) x L R (0.5)/1 - P (1 - 0.01) + P x LR+ (0.01 x 0.5) = 0.5%. The sample calculations expose many shortcomings in that study. Although the proportion of those with disease labeled positive by the test (sensitivity) is 67%, and the proportion of disease-free labeled negative by the test is 66%, a negative test has little value to lower probability of disease and a positive test-has little value to increase the probability of disease as shown by likelihood ratios and posttest probabilities. The positive predictive value of an abnormal Doppler test, or the probability that a positive test is associated with disease, is increased from 1% to 2% after the test is run. The negative posttest probability, or the probability that a healthy individual will have a negative test, is decreased from 1% to 0.5%. Thus, this test is of little value given 1% prevalence (pretest probability). This is further confirmed by the LR+ and LR-, which rate the test as poor. The test does become useful in certain situations, but the population being tested and the inherent prevalence of disease are extremely important in determining this. Table 3 shows how different prevalences, or pretest probabilities, affect positive and negative predictive values. 7 If the prevalence of ovarian cancer is 0.1%, positive and negative predictive values are exceedingly low. But if the prevalence of disease in a population is 80%, the values become exceedingly high, making this a useful test.

Sample Problem This sample problem uses actual data presented in support of a diagnostic test that was introduced in 1995. The working premise was that ultrasound could identify women with ovarian cancer nonsurgically by measuring Doppler flow velocity rates of ovarian arteries (Figure 8). 9 Sensitivity, or the probability of an abnormal Doppler study in women who truly had ovarian cancer, was 0.67. 9 Specificity, or the probability of a normal Doppler study for a woman without ovarian cancer, was 0.66. 9 Positive likelihood ratio: LR+ = sensitivity (0.67)/1 specificity (1 - 0.66) = 1.97. 9 Negative likelihood ratio: L R - = 1 - sensitivity (1 - 0.67)/specificity (0.66) = 0.5. 9 Pretest probability is given at 1%.

Conclusion Many factors go into evaluating a new or existing diagnostic test to ensure that it is indeed useful and

Post-test probabilities

TABLE 3. Prediction of Ovarian Malignancy by Doppler Flow PPV (+) = P x LR(+) / [(l-P) + P x LR(+)]

Prevalence (%)

PPV (-) = P x LR(-) / [(I-P) = P x LR(-)J

PPV

=

Post-test probability (Positive predictive value)

P = Pre-test probability L R = Likelihood ratio

Positive PPV (%)

Negative PPV (%)

0.1

0.2

1

2

10 50 80

18 66 95

0.1 0.5 5.3 33 82

PPV = positive predictive value; NPV = negative predictive value. From reference 7.

FIGURE 8. Calculation of posttest probabilities.

111

EvaJuating Diagnostic Tests Priits et al

pertinent. With some simple calculations practitioners are able to evaluate any new or existing diagnostic test. They must be able to identify how the test population was selected to ensure that it was not too narrow. This could decrease the usefulness of a test if practitioners saw patients with a wide spectrum of disease, and not the narrow spectrum represented in the study population. The gold standard must also be evaluated, examining in particular the strength to which it is a true measure of disease. How authors defined normal and abnormal is also of utmost importance, as ambiguous or incorrect definitions can lead to incorrect diagnoses. If the authors have not done so, a 2 x 2 table can be easily constructed and sensitivity and specificity calculated to compare the index test against the gold standard with regard to its ability to identify disease and health correctly. Likelihood ratios and predictive values can then be calculated to help identify how useful the test is to any particular patient, and to calculate a probability of disease in a person from an ensuing positive or negative value. By following the systematic approach outlined, any test can be evaluated for its clinical usefulness, and practitioners will become educated consumers in their quest for a test to diagnose disease.

References

1. Daya S: Study design for the evaluation of diagnostic test. Semin Reprod Endocrinol 14:101-110, 1996 2. Feinstein AR: Clinical Epidemiology: The Architecture of Clinical Research. Philadelphia, WB Saunders, 1985 3. Olive DL: The range of normal. Semin Reprod Endocrinol 14:119-123, 1996 4. Olive DL: Evaluating the infertility literature. In: Infertility: Evaluation and Treatment. Edited by WR Keye, RJ Chang, RW Rebar, et al. Philadelphia, WB Saunders, 1995, pp 42-54 5. McNeil BJ, Hanley JA: Statistical approaches to the analysis of receiver operating characteristics (ROC) curves. Med Decis Making 4:137-150, 1984 6. Rosenberg RD, Lango JE Hunt WC, et al: The New Mexico mammography project. Screening mammography performance in Albuquerque, New Mexico, 1991 to 1993. Cancer 78(8):1731, 1996 7. Stein SM, Laifer-Narin S., Johnson MB, et al: Differen-

tiation of benign and malignant adnexal masses: relative value of gray-scale, color Doppler, and spectral Doppler sonography. AJR 164(2):381-386, 1995

112