J Clin Epidemiol Vol. 51, No. 3, pp. 219–231, 1998 Copyright 1998 Elsevier Science Inc. All rights reserved.
0895-4356/98/$19.00 PII S0895-4356(97)00264-3
Bias in Discrepant Analysis: When Two Wrongs Don’t Make a Right William C. Miller * Robert Wood Johnson Clinical Scholars Program, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina ABSTRACT. Imperfect reference standards (alloyed standards) often hinder evaluation of new diagnostic tests. Discrepant analysis, a two-stage modification of the standard evaluation of diagnostic tests, has been used to circumvent this problem. In discrepant analysis, additional testing is performed to resolve discrepant results of the new diagnostic test and the alloyed standard. This article demonstrates algebraically that sensitivity and specificity will be overestimated by discrepant analysis, even when a perfect gold standard is used to resolve the discrepant results. The magnitude of the bias depends on the true sensitivity and specificity of the new test and initial alloyed standard, the prevalence of disease in the study population, and the proportion of concordant errors between the two tests. The article also demonstrates substantial bias associated with the use of alloyed standard tests in both stages of discrepant analysis. This procedure should not be used routinely for evaluation of diagnostic tests. j clin epidemiol 51;3:219–231, 1998. 1998 Elsevier Science Inc. KEY WORDS. Sensitivity, specificity, diagnostic tests, discrepant analysis, sexually transmitted diseases, chlamydial infections
INTRODUCTION The performance of a new diagnostic test is typically evaluated by comparison to an accepted reference standard, or gold standard (GS). Ideally, the GS is a ‘‘perfect’’ test, correctly classifying persons with or without disease. The estimation of a test’s performance characteristics, commonly presented as sensitivity and specificity, depends on the results of the GS. This classical evaluation of the new test is limited by the quality of the reference test. Frequently, no ‘‘perfect’’ GS exists. Use of an imperfect or ‘‘alloyed standard’’ (AS) to evaluate a new test leads to biased estimates of sensitivity and specificity of the new test because disease status is misclassified. This bias, referred to as reference test bias, may lead to either over- or underestimation of the true performance of the new test [1–4]. Use of the GS may also be limited by cost, practicality, or ethical considerations. For example, if one wishes to evaluate a new screening test for colon cancer, it may be too costly, impractical, and perhaps unethical to subject large numbers of healthy individuals to a procedure such as colonoscopy, the accepted GS. Discrepant analysis, also referred to as discordant analysis, *
Address for correspondence: William C. Miller, MD, PhD, MPH, Department of Epidemiology, CB# 7400, McGavran-Greenberg, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599. Accepted for publication on 27 October 1997.
is used in an attempt to circumvent these problems with the GS. Unlike the classical evaluation of a diagnostic test, discrepant analysis uses a combination of reference standards applied in two stages. Initially, the results of the new test are compared to an AS. Subsequent testing is then performed to ‘‘resolve’’ the discordant or discrepant results. Under ideal conditions, the resolution of discrepant results is performed with a ‘‘perfect’’ GS. Because the GS is not applied to all specimens, discrepant analysis provides a less costly alternative to the classical procedure. Recently, new molecular tests for chlamydial infection, such as polymerase chain reaction and ligase chain reaction, have been licensed for use in the United States. Published reports of these new tests relied extensively on discrepant analysis [5–11]. The increased acceptance of discrepant analysis for evaluation of new microbiological tests is evidenced by its use in studies of tests for Neisseria gonorrhoeae [12–14], Clostridium difficile [15,16], Mycobacterium tuberculosis . [17–19], Toxoplasma gondii [20], Helicobacter pylori [21,22], Legionella species [23], Streptococcus pyogenes [24], cytomegalovirus [25,26], herpes simplex virus [27,28], and rotavirus [29–31]. Discrepant analysis has also been used in other medical fields, such as radiology [32]. Concerns regarding the use of discrepant analysis have been raised [33,34]. Hadgu described problems associated with the use of discrepant analysis with particular reference to studies evaluating tests for chlamydial infection [33,34]. The purpose of this article is to describe more fully and
W. C. Miller
220
TABLE 1. Calculation of sensitivity and specificity
TABLE 2. Process of discrepant analysis
Stage 1: Identification of concordant and discordant results a
Sensitivity 5 Probability (test positive| no disease) 5
TP TP 1 FN
(1) Stage 2: Identification of misclassified specimens in cells b and c b
Specificity 5 Probability (test negative |no disease) 5
TN TN 1 FP
(2)
Abbreviations: TP 5 true positive of the new test; FP 5 false positive of the new test; FN 5 false negative of the new test; TN 5 true negative of the new test. TP, TN, FP, and FN represent the results of the new diagnostic test as compared to the true disease status.
quantify the inherent biases of discrepant analysis. The article demonstrates that the concept of discrepant analysis is ill-founded because of a significant bias under ‘‘ideal’’ conditions. An algebraic formulation of the bias is presented and used to reveal the magnitude of the bias under varying conditions. Subsequently, the biases associated with the use of discrepant analysis when no GS is available are described.
Completed Stage 2: Reassignment of specimens to cells a and d c
DISCREPANT ANALYSIS PROCEDURE Equations (1) and (2) and definitions of sensitivity and specificity are shown in Table 1. Although likelihood ratios are useful test parameters, for simplicity, formulae for likelihood ratios are not presented in this analysis. Throughout the discussion, the terms true positive (TP), true negative (TN), false positive (FP), and false negative (FN) are used to represent the results of the new diagnostic test as compared to the true disease status. Sensitivity 5 Probability (test positive|disease) 5
TP TP 1 FN
(1)
Specificity 5 Probability (test negative|no disease) 5
TN TN 1 FP
(2)
The two-stage process for discrepant analysis is shown in Table 2. After initial testing with the new test and AS in
Abbreviations: ‘‘TP’’ 5 assumed true positive; TPb 5 true positive misclassified by AS and found in cell b; ‘‘TN’’ 5 assumed true negative; TN c 5 true negative misclassified by AS and found in cell c; FP 5 false positive; FN 5 false negative. a Stage 1 testing with an AS is used to classify specimens as concordant or discordant. Concordant results (11, 22) are assumed to be true positives or true negatives, respectively. The assumption is indicated by the quotation marks. b Stage 2 testing of the discordant specimens (cells b and c) is performed with the GS. The GS is used to identify true positive specimens in cell b (TPb ) and true negative specimens in cell c (TN c ). TPb , TN c , FP, and FN represent the results of the new diagnostic test as compared to the true disease status. Results assumed to be true positives or true negatives based on concordance of the new test and the AS are shown with quotation marks. c True positive specimens in cell b (TPb ) are reassigned to cell a. True negative specimens in cell c (TN c ) are reassigned to cell a. Calculations of measured sensitivity and specificity are then performed.
Bias in Discrepant Analysis
Stage 1, specimens are classified as concordant or discordant (discrepant). Concordant specimens, in cells a and d, are assumed to be true positives or true negatives, respectively, and are not subjected to any further testing (Table 2, top panel). This assumption is represented by the quotation marks. Potential misclassification of discordant specimens by the AS is assessed by testing specimens in cells b and c with the GS in Stage 2. Only specimens in cells b and c are tested with the GS. The additional testing allows ‘‘resolution’’ of the discrepant specimens. True positive specimens are reassigned from cell b to cell a; true negative specimens are reassigned from cell c to cell d. After resolution of the discrepant specimens, the measured sensitivity and specificity of the new test are calculated. INHERENT BIAS When discrepant analysis performs perfectly, the estimates obtained for sensitivity and specificity are biased. This consistent underlying bias arises because concordant specimens are not examined with the GS. Concordant positive specimens in cell a and concordant negative specimens in cell d, although assumed to be true positives or negatives, respectively, may actually be concordant false results. Inspection of Table 3 reveals that discrepant analysis, under ideal conditions, will consistently overestimate sensitivity and specificity of the new test. Recall that the terms true positive, true negative, false positive, and false negative are used to reflect the true, unknown disease status. The use of an AS in Stage 1 leads to misclassification of some results of the new test (Table 3, top panel). For example, after completion of Stage 1 testing, both true positives and false positives of the new test are distributed in cells a and b. In Table 3, top panel, the distribution of true positives in cells a and b is shown by the quantities, (1 2 m)TP and (m)TP. Summing these quantities yields all TP. Similar notation is used to represent the distributions of false positives across cells a and b and true negatives and false negatives in cells c and d. After completion of Stage 2 testing with the GS on the discordant specimens in cells b and c, the true positive and true negative specimens are redistributed, appropriately, to cells a and d, respectively (Table 3, middle panel). Thus, all true positive specimens are captured in cell a, and all true negative specimens are captured in cell d after completion of the discrepant analysis. Equations (3) and (4) below the middle panel in Table 3, reveal the measured sensitivity and specificity derived from discrepant analysis. From these equations, it is clearly seen that sensitivity and specificity will be overestimated, unless the quantities, n and q, are both equal to 0. Inclusion of the term, n(FP), in the numerator and denominator of the formula for sensitivity inflates the sensitivity estimate. This estimate is further inflated because the number of false negatives in the denominator, represented by (1 2 q)FN, is only
221
TABLE 3. Algebraic representation of discrepant analysis
After testing with Stage 1 alloyed standarda
After testing with Stage 2 gold standardb
Measured Sensitivity 5
TP 1 (n)FP TP 1 (n)FP 1 (1 2 q)FN
(3)
Measured Specificity 5
TN 1 (q)FN TN 1 (q)FN 1 (1 2 n)FP
(4)
Abbreviations: TP 5 true positive of new test; TN 5 true negative of new test; FP 5 false positive of new test; FN 5 false negative of new test; m 5 proportion of new test TP results misclassified by AS; n 5 proportion of new test FP results misclassified by AS; p 5 proportion of new test TN results misclassified by AS; q 5 proportion of new test FN results misclassified by AS. a Stage 1 testing with AS leads to misclassification of new test results. True positives (TP) and false positives (FP) are distributed in cells a and b. True negatives (TN) and false negatives (FN) are distributed in cells c and d. TP, TN, FP, and FN represent the results of the new diagnostic test as compared to the true disease status. The proportions of misclassified results are represented by the quantities m, n, p, and q. The quantities (1 2 m), (1 2 n), (1 2 p), and (1 2 q), represent correctly classified proportions of TP, FP, TN, and FN, respectively. Note that across cells a and b, the proportions of TP sum to equal TP [(1 2 m)TP 1 (m)TP 5 TP]. Similarly, FP, FN and TN sum across the horizontal cells of the table. b Stage 2 testing with the GS allows redistribution of misclassified true positive specimens from cell b to cell a. Misclassified true negative specimens from cell c are redistributed to cell d. Thus, all true positives are captured in cell a and all true negatives are captured in cell d. The misclassified false positives and misclassified false negatives are not addressed in Stage 2 testing and remain distributed in cells a and b, and cells c and d, respectively. The measured sensitivity and specificity shown beneath the table are determined directly from the table [measured sensitivity 5 a/(a 1 c); measured specificity 5 d/(b 1 d) where a, b, c, d are the cells of the table].
W. C. Miller
222
a portion of the total false negatives. The estimate of specificity is affected in an identical manner. Measured Sensitivity 5
TP 1 (n)FP TP 1 (n)FP 1 (1 2 q)FN
(3)
Measured Specificity 5
TN 1 (q)FN TN 1 (q)FN 1 (1 2 n)FP
(4)
See also Appendix A. MAGNITUDE OF THE BIAS The magnitude of the bias is determined by the interplay of several factors, as shown in Equations (3) and (4) in Table 3. The quantities FP, FN, n, and q, are the principal components affecting the magnitude of the bias. The greater the absolute number of false results, both false positives and false negatives, the greater the potential for bias. Furthermore, the greater the proportion of false results which are concordant with false results of the AS (quantities n and q), the greater the bias. The absolute number of false results of the new test depends upon the true, unknown sensitivity and specificity of the test and the prevalence of disease in the study population. Tests with better performance characteristics will, in general, have fewer false results and will be subject to less bias. False positive results will occur more commonly with low prevalence of disease; false negatives with high prevalence. Thus, the measured estimates of sensitivity and specificity will vary with the prevalence of disease in the study population. The quantities n and q represent the proportions of false positives and false negatives, respectively, that are concordant with false results of the AS. The magnitudes of n and q depend upon the sensitivity and specificity of the AS, the true sensitivity and specificity of the new test, and the ‘‘relationship’’ between these two tests. By the relationship of the two tests, I refer to the conditional independence or dependence of test results. Two tests are conditionally independent if the errors of one test are not correlated with the errors of another test. In other words, two tests are conditionally independent if the probability of a false result by the second test is the same whether or not the initial test had a true or false result. If the AS and new test are conditionally independent, the magnitude of the quantity, n, is simply equal to 1 2 specificity of the AS and the quantity, q, is equal to 1 2 sensitivity of the AS. In many circumstances, however, the new and AS tests will be conditionally dependent. Conditional dependence may be thought of as correlated false results. With conditional dependence, the probability of a false result by a second test is greater (or less) if an initial test gave a false result. An example of this concept is the use of the same test twice. Consider a test with known false negative results when an ‘‘inhibitor’’ is present in the specimen. If a false
negative occurs due to the inhibitor, repetition of the same test will also yield a false negative result. Correlated false results may be associated with tests based on similar technologies, with problems with specimen collection, or with situations of asymptomatic or mildly symptomatic disease. The amount of correlation determines the magnitude of the quantities n and q. Maximum values of n and q are determined by the relative true sensitivities and specificities of the new and AS tests. If the true specificity of the new test is higher than the specificity of the AS, then the maximum value of n 5 1. This maximum would occur if all false positive results of the new test were concordant with false positive results of the AS. If the true specificity of the new test is less than the specificity of the AS, then the maximum value of n is equal to the ratio of false positive results from the AS to false positive results of the new test. This maximum would occur if all false positive results of the AS were concordant with false positive results of the new test. Similar derivations based on false negatives can be made for the quantity q. The measured sensitivity and specificity will equal the true sensitivity and specificity under only one condition: all false results of the new test must be discordant with results of the AS. Importantly, this situation actually reflects complete ‘‘negative’’ correlation, since some concordance of false results is expected if the tests are independent, as discussed above. EFFECT OF THE BIAS To demonstrate the potential effects of the bias, I determined the measured sensitivity and specificity of a new diagnostic test with true sensitivity and true specificity equal to 0.90 under varying conditions. I performed calculations under conditions of conditional independence or conditional dependence (correlated errors). The sensitivity and specificity of the AS in Stage 1 and the prevalence of the disease in the study population were varied. For Stage 2 testing, a GS with both sensitivity and specificity equal to 1 was assumed. In Figure 1, the effects of varying the sensitivity and specificity of the AS at multiple disease prevalences are shown, assuming independence of the two tests: panel A demonstrates the effect on measured sensitivity and panel B on measured specificity. At low prevalences of disease in the study population, the distortion of measured sensitivity is maximized. Similarly, at high prevalence, the measured specificity is overestimated to a greater degree. AS tests with poorer test characteristics lead to more significant bias in measured sensitivity and specificity of the new test. In Figure 2, the effects of conditional dependence (correlated errors) are demonstrated. Both sensitivity and specificity of the AS test in this example were 0.80. The proportions of concordant false results, quantities n and q, are varied. For convenience, the quantities n and q were assumed to be equal and were varied jointly. The figure dem-
FIGURE 1. (A) Measured sensitivity of the new test assuming independence of new test and alloyed standard. The dependence of the measured sensitivity of a new test on the test characteristics of the alloyed standard (AS) and the prevalence under conditions of independence between the AS and new test is illustrated. The true sensitivity and specificity of the new test are both 0.90. Each line represents use of an AS with different sensitivity and specificity during Stage 1 testing. A perfect gold standard is used for Stage 2 of discrepant analysis. The curves are generated using Equation (3) (see Table 3 and Appendix A). (B) Measured specificity of the new test assuming independence of new test and alloyed standard. The dependence of the measured specificity of a new test on the characteristics of the AS and the prevalence under conditions of independence between the AS and the new test is shown. The true sensitivity and specificity of the new test are both 0.90. Each line represents use of a different AS in Stage 1 testing. A perfect gold standard is used for Stage 2 of discrepant analysis. The curves are generated using Equation (4) (see Table 3 and Appendix A).
W. C. Miller
224
onstrates the potential for substantial overestimation of the true parameters of sensitivity and specificity. The sensitivity and specificity are overestimated in every case except when both n and q are 0, indicating complete negative correlation of errors. Figure 3 demonstrates the effect of maximum correlation of errors. The maximums of the quantities, n and q, depend on the sensitivity and specificity of the new test and the AS. The test with the best specificity determines the maximum of n and the test with the best sensitivity determines the maximum of q. In the figure, the effects of maximum correlation are shown using AS tests with three different characteristics. In situations with maximum correlation, when the sensitivity of the AS test is less than or equal to the sensitivity of the new test, all false negative tests of the new test will be considered true negatives. The measured sensitivity will then be 1.0, as reflected by the straight line at the top of the figure. Similarly, when the specificity of the AS test is less than or equal to the specificity of the new test, under conditions of maximum correlation, all false positives of the new test will be considered true positives by discrepant analysis and the measured specificity will be 1.0. Although demonstrating bias in the measured sensitivity and measured specificity is straightforward from an algebraic standpoint, correcting the estimates obtained from discrepant analysis is not. Correction of measured sensitivity and measured specificity requires accurate estimates of the sensitivity and specificity of the AS, the prevalence of disease in the study population, and, importantly, the frequency of concordant errors. Without an estimate of the frequency of concordant errors, only a range of possible values for the true sensitivity and specificity can be derived. Consider the following example based loosely on studies of the recently licensed tests for chlamydial infection. Assume that the studies are performed in a population with a prevalence of 10% and a perfect GS is used to resolve the discrepant samples. The AS used in these studies has been tissue culture which is generally believed to have poor sensitivity (approximately 75%) and excellent specificity (approximately 99–100%). The reported sensitivity (measured sensitivity) for these tests has been approximately 94% [6– 9,11]. Under conditions of maximal concordance and assuming an excellent true specificity of 99%, the true sensitivity could be as low as 70%. Assuming conditional independence, the true sensitivity would be 91.5%. The ‘‘truth’’ is likely to be somewhere between these extremes. DISCREPANT ANALYSIS WITHOUT A GOLD STANDARD Discrepant analysis has been used commonly when all available reference tests are alloyed standards. In this circumstance, alloyed standards are used for both Stage 1 and Stage 2 testing. This procedure is also biased.
The use of an AS for Stage 2 testing leads to further misclassification of the discordant specimens, as shown in Table 4. As described previously, after initial testing, true positive and false positive results are distributed in cells a and b; true negative and false negative results are distributed in cells c and d. However, unlike the situation with a ‘‘perfect’’ GS for Stage 2 testing, use of an AS in Stage 2 leads to misclassification of some specimens during redistribution of specimens from cell b to cell a and from cell c to cell d. After resolution of the discordant specimens in cell b, a portion of true positives, represented by (w)(m)TP may remain in cell b. A portion of false positives, represented by (x)(1 2 n)FP, will be erroneously transferred to cell a. Similarly, true negatives and false negatives will be erroneously retained in cell c and transferred to cell d, respectively. The resultant equations for the (5) and (6) measured sensitivity and specificity are shown below. Measured Sensitivity 5 (1 2 wm)TP 1 (n 1 x 2 xn)FP (1 2 wm)TP 1 (n 1 x 2 xn)FP
(5)
1 (yp)TN 1 (1 2 q 2 z 1 zq)FN Measured Specificity 5 (1 2 yp)TN 1 (q 1 z 2 qz)FN (1 2 yp)TN 1 (q 1 z 2 qz)FN
(6)
1 (wm)TP 1 (1 2 x 2 n 1 xn)FP An important recognition regarding this form of discrepant analysis is that the Stage 2 AS test used to resolve discordant results may actually perform differently than if it were used to assess the entire study population. Conceptually, this is similar to a form of spectrum bias [35]. The population tested by the Stage 2 AS is selected by results of both the new test and the Stage 1 AS. If errors of the Stage 2 AS are correlated with errors of either the new test or the Stage 1 AS, the performance of the Stage 2 AS will be biased in the assessment of the discordant samples. The possibility of conditional dependence (correlated errors) of the Stage 2 AS with the new test or Stage 1 AS can have very significant effects on the measured sensitivity and specificity of the new test. Consider the following situation. Assume that the new test and the Stage 2 AS are conditionally dependent because they are based on similar technologies. Many errors of the new test may be correlated with errors of the Stage 2 AS. Under this condition, the quantity x in equations (5) and (6) may be large and the associated quantity, (x)(1 2 n)FP, will also be large. Thus, many of the ‘‘resolved’’ discrepant samples will be false positive results shifted erroneously from cell b to cell a. When two AS tests are used for discrepant analysis, the interplay of several factors determines the magnitude of the bias. The prevalence of disease in the study population; correlation of errors between the first AS test and the new test,
FIGURE 2. (A) Measured sensitivity of the new test assuming conditional dependence of the new test and alloyed standard.
The measured sensitivity for a new test is shown assuming conditional dependence (correlated errors) of the new test and the alloyed standard (AS). The true sensitivity and specificity of the new test are both 0.90. For Stage 1 testing, an AS with both sensitivity and specificity equal to 0.80 is used. Each line represents the measured sensitivity for differing proportions of concordant errors, represented by the quantities n and q. A perfect gold standard is used in Stage 2 testing. The curves are generated using Equation (3) (see Table 3 and Appendix A). (B) Measured specificity of the new test assuming conditional dependence of new test and alloyed standard. The measured specificity for a new test is shown assuming conditional dependence (correlated errors) of the new test and the alloyed standard (AS). The true sensitivity and specificity of the new test are both 0.90. For Stage 1 testing, an AS with both sensitivity and specificity equal to 0.80 is used. Each line represents the measured specificity for differing proportions of concordant errors, represented by the quantities n and q. A perfect gold standard is used in Stage 2 testing. The curves are generated using Equation (4) (see Table 3 and Appendix A).
226
W. C. Miller
FIGURE 3. (A) Maximum measured sensitivity. Using a perfect gold standard for Stage 2 testing, the maximum measured
sensitivity for a new test with true sensitivity and specificity equal to 0.90 is shown. Each line represents the measured sensitivity for differing characteristics of the alloyed standard under the condition of maximum concordant errors. The curves are generated using Equation (3) (see Table 3 and Appendix A). (B) Maximum measured specificity. Using a perfect gold standard for Stage 2 testing, the maximum measured specificity for a new test with true sensitivity and specificity equal to 0.90 is shown. Each line represents the measured specificity for differing characteristics of the alloyed standard under the condition of maximum concordant errors. The curves are generated using Equation (4) (see Table 3 and Appendix A).
Bias in Discrepant Analysis
227
TABLE 4. Algebraic representation of discrepant analysis—
two alloyed standards After testing with Stage 1 alloyed standarda
test TP results misclassified by Stage 2 AS; x 5 proportion of new test FP results misclassified by Stage 2 AS; y 5 proportion of new test TN results misclassified by Stage 2 AS; z 5 proportion of new test FN results misclassified by Stage 2 AS. a Stage 1 testing is the same as in Table 3. True positive (TP) and false positive (FP) results are distributed in cells a and b. True negative (TN) and false negative (FN) results are distributed in cells c and d. TP, TN, FP, and FN represent the results of the new diagnostic test as compared to the true disease status. b Stage 2 testing with an AS leads to imperfect redistribution of specimens. Testing of specimens in cell b leads to retention of some TP in cell b, represented by (w)(m)TP, and redistribution of some FP to cell a, represented by (x)(1 2 n)FP. Similarly, a portion of true negatives are retained in cell c, (y)(p)FN, and a portion of false negatives are redistributed to cell d, (z)(1 2 q)FN. The measured sensitivity and specificity shown beneath the table are determined directly from the table [measured sensitivity 5 a/(a 1 c); measured specificity 5 d/(b 1 d) where a, b, c, d represent the cells of the table].
the second AS test and the new test, and the two AS tests; and the true sensitivity and specificities of the three tests will influence the measured sensitivity and specificity. Importantly, the size of the bias may be substantially larger than when a perfect gold standard is used for Stage 2 testing. In other circumstances, the measured values may actually be closer to the true values.
After testing with Stage 2 alloyed standardb
STRATEGIES TO REDUCE THE BIAS
Measured Sensitivity 5
Measured Specificity 5
(1 2 wm)TP 1 (n 1 x 2 xn)FP (1 2 wm)TP 1 (n 1 x 2 xn)FP 1 (yp)TN 1 (1 2 q 2 z 1 zq)FN
(5)
(1 2 yp)TN 1 (q 1 z 2 qz)FN (1 2 yp)TN 1 (q 1 z 2 qz)FN 1 (wm)TP 1 (1 2 x 2 n 1 xn)FP
(6)
Abbreviations: TP 5 true positive of new test; TN 5 true negative of new test; FP 5 false positive of new test; FN 5 false negative of new test; m 5 proportion of new test TP results misclassified by Stage 1 AS; n 5 proportion of new test FP results misclassified by Stage 1 AS; p 5 proportion of new test TN results misclassified by Stage 1 AS; q 5 proportion of new test FN results misclassified by Stage 1 AS; w 5 proportion of new
A fundamental principle of the evaluation of diagnostic tests is that the determination of disease status should not be based on the test under evaluation [35]. Discrepant analysis violates this principle because the definition of disease status is based, in part, on the results of the test under study. Thus, preferred solutions to the bias of discrepant analysis involve the separation of the definition of disease from the new test results. If an accepted GS is available, the preferred method to avoid the biases is to apply the GS to all specimens. If no ‘‘quality’’ gold standard exists, the preferred solution is to use the best possible alloyed standard test or combination of tests. For example, combining multiple tests with differing test characteristics, such as a test with high sensitivity with a test with high specificity, may provide relatively accurate assignment for most specimens. However, these tests should be applied to all specimens along with the new test in a single stage of testing. Discrepant analysis represents a unique form of verification bias (also referred to as post-test referral bias or workup bias). Verification bias occurs when sensitivity and specificity estimates are derived from a subgroup of the study population for which verification with a gold standard has been performed. A method to correct verification bias is available and applies to discrepant analysis in limited circumstances [36,37]. Application of this method to reduce the bias of discrepant analysis requires that the GS test be applied to a randomly sampled selection of specimens concordant by the new test and the AS. In other words, the
W. C. Miller
228
GS should be applied to a random sample of specimens from cells a and d of Table 2, top panel. From this random sample, the concordance rate of false results can be approximated and the estimates of sensitivity and specificity adjusted accordingly. Further work is necessary to identify the appropriate sampling size for this process. Importantly, this procedure is adequate only if the GS available for the verification stage is perfect or nearly so. More sophisticated modeling strategies, such as latent class analysis, are also available. These modeling strategies have been used in other circumstances in which no quality GS exists [38]. However, these strategies typically assume conditional independence. DISCUSSION In this article, I have quantified the biases associated with discrepant analysis. As suggested by Hadgu [33], the measured sensitivity and specificity obtained by discrepant analysis nearly always overestimates the true values, even when a perfect GS is used to evaluate the discordant specimens. The magnitude of the bias is determined by the true parameters of the tests, the prevalence of disease in the study population, and presence or absence of conditional dependence between the tests. Correlated errors between the new test and either standard test leads to significant overestimation of the parameters of the new test. The problem of correlated errors is a common one. Repeating the same test is likely to have some correlation of errors. Similarly, the same examiner is likely to make the same mistake repeatedly on an individual patient or procedure. Tests based on similar technologies are very likely to have correlated errors. For example, inhibitors of the enzymatic amplification steps of the polymerase chain reaction are likely to affect polymerase chain reaction assays similarly, even if the assays are based on different primers. For infectious diseases, conditions of low organism burden may be missed consistently by different types of tests. Generally, subclinical or asymptomatic disease may affect test performances. The most fundamental problem with discrepant analysis is that measured disease status is conditioned on the results of the test under examination, the new test. In Stage 1 testing, the results of the new test are used to decide whether or not the AS results are ‘‘believable.’’ If the two tests are concordant, the AS results are accepted; if they are discordant, the AS results are deemed questionable and further testing is performed. The ‘‘believability’’ of the AS results should not be conditioned on the results of the new test. The philosophy of discrepant analysis suggests that the Stage 2 testing is ‘‘superior’’ to that performed in Stage 1 and has the capability of accurately resolving discrepancies. This assumption is true when a ‘‘perfect’’ GS is used for Stage 2. However, when AS tests are used in both Stage 1
and Stage 2, the assumed superiority may or may not be true. In any case, if tests of presumed higher quality are available (as would be used in Stage 2 testing), these tests should be applied to all specimens, including the initially concordant specimens. Under most circumstances, the true sensitivity and specificity of the new test cannot be readily determined from the measured sensitivity and specificity obtained by discrepant analysis. Even if a perfect GS is used to resolve discrepant specimens, the proportion of concordant false results remains an unknown quantity which cannot be determined without evaluation of specimens concordant by the AS and new test. The proportion of concordant false results could be determined by evaluation of a random sample of specimens in these cells, as suggested above. However, if independence is assumed with known sensitivity and specificity of the Stage 1 AS test and a ‘‘perfect’’ Stage 2 GS, then the proportion of concordant false results may be estimated. Back calculations can be made to adjust the measured sensitivity and specificity in this situation. Discrepant analysis should be avoided if at all possible. Widespread adoption of diagnostic tests in screening programs based on false estimates of sensitivity and specificity could have significant ramifications because of unexpected false positive or false negative results. In such screening programs, even small overestimates of sensitivity or specificity could have dramatic effects. The author was a Robert Wood Johnson Clinical Scholar at the University of North Carolina at Chapel Hill during the completion of this work. The author wishes to thank Drs. Alula Hadgu, David Ransohoff, Irva Hertz-Picciotto, Arthur Evans, Timothy Carey, Marion Danis, Bradley Gaynes, Michelle Forcier, and Sandra Moody-Ayers for critique of the manuscript.
References 1. Boyko E, Alderman B, Baron A. Reference test errors bias the evaluation of diagnostic tests for ischemic heart disease. J Gen Inter Med 1988; 3: 476–481. 2. Thibodeau L. Evaluating diagnostic tests. Biometrics 1981; 37: 801–804. 3. Vacek P. The effect of conditional dependence on the evaluation of diagnostic tests. Biometrics 1985; 41: 959–968. 4. Deneef P. Evaluating rapid tests for streptococcal pharyngitis: The apparent accuracy of a diagnostic test when there are errors in the standard of comparison. Medical Decision Making 1987; 7: 92–96. 5. Loeffelholz MJ, Lewinski CA, Silver SR, Purohit AP, Herman SA, Buonagurio DA, Dragon EA. Detection of Chlamydia trachomatis in endocervical specimens by polymerase chain reaction. J Clin Microbiol 1992; 30: 2847–2851. 6. Bauwens JE, Clark AM, Loeffelholz MJ, Herman SA, Stamm WE. Diagnosis of Chlamydia trachomatis urethritis in men by polymerase chain reaction assay of first-catch urine. J Clin Microbiol 1993; 31: 3013–3016. 7. Jaschek G, Gaydos CA, Welsh LE, Quinn TC. Direct detection of Chlamydia trachomatis in urine specimens from symptomatic and asymptomatic men by using a rapid polymer-
Bias in Discrepant Analysis
8.
9.
10.
11. 12. 13.
14.
15.
16.
17.
18.
19.
20. 21.
22.
23.
ase chain reaction assay. J Clin Microbiol 1993; 31: 1209– 1212. Schachter J, Stamm WE, Quinn TC, Andrews WW, Burczak JD, Lee HH. Ligase chain reaction to detect Chlamydia trachomatis infection of the cervix. J Clin Microbiol 1994; 32: 2540–2543. Lee HH, Chernesky MA, Schachter J, Burczak JD, Andrews WW, Muldoon S, Leckie G, Stamm WE. Diagnosis of Chlamydia trachomatis genitourinary infection in women by ligase chain reaction assay of urine. Lancet 1995; 345: 213– 216. Bassiri M, Hu HY, Domeika MA, Burczak J, Svensson LO, Lee HH, Mardh PA. Detection of Chlamydia trachomatis in urine specimens from women by ligase chain reaction. J Clin Microbiol 1995; 33: 898–900. Wiesenfeld HC, Uhrin M, Dixon BW, Sweet RL. Diagnosis of male Chlamydia trachomatis urethritis by polymerase chain reaction. Sexually Transmitted Diseases 1994; 21: 268–271. Ching S, Lee H, Hook EW, 3rd, Jacobs MR, Zenilman J. Ligase chain reaction for detection of Neisseria gonorrhoeae in urogenital swabs. J Clin Microbiol 1995; 33: 3111–3114. Smith KR, Ching S, Lee H, Ohhashi Y, Hu HY, Fisher HC, 3rd, Hook EW, 3rd. Evaluation of ligase chain reaction for use with urine for identification of Neisseria gonorrhoeae in females attending a sexually transmitted disease clinic. J Clin Microbiol 1995; 33: 455–457. Chapin-Robertson K, Reece EA, Edberg SC. Evaluation of the Gen-Probe PACE II assay for the direct detection of Neisseria gonorrhoeae in endocervical specimens. Diag Microbiol Infect Dis 1992; 15: 645–649. Schue V, Green GA, Monteil H. Comparison of the ToxA test with cytotoxicity assay and culture for the detection of Clostridium difficile-associated diarrhoea disease. J Med Microbiol 1994; 41: 316–318. De Girolami PC, Hanff PA, Eichelberger K, et al . Multicenter evaluation of a new enzyme immunoassay for detection of Clostridium difficile enterotoxin A. J Clin Microbiol 1992; 30: 1085–1088. Pfyffer GE, Kissling P, Wirth R, Weber R. Direct detection of Mycobacterium tuberculosis complex in respiratory specimens by a target-amplified test system. J Clin Microbiol 1994; 32: 918–923. Vuorinen P, Miettinen A, Vuento R, Hallstrom O. Direct detection of Mycobacterium tuberculosis complex in respiratory specimens by Gen-Probe Amplified Mycobacterium Tuberculosis Direct Test and Roche Amplicor Mycobacterium Tuberculosis Test. J Clin Microbiol 1995; 33: 1856–1859. Vlaspolder F, Singer P, Roggeveen C. Diagnostic value of an amplification method (Gen-Probe) compared with that of culture for diagnosis of tuberculosis. J Clin Microbiol 1995; 33: 2699–2703. Crouch CF. Enzyme immunoassays for IgG and IgM antibodies to Toxoplasma gondii based on enhanced chemiluminescence. J Clin Pathol 1995; 48: 652–657. Pronovost AD, Rose SL, Pawlak JW, Robin H, Schneider R. Evaluation of a new immunodiagnostic assay for Helicobacter pylori antibody detection: Correlation with histopathological and microbiological results. J Clin Microbiol 1994; 32: 46– 50. Graham DY, Evans DJ Jr, Peacock J, Baker JT, Schrier WH. Comparison of rapid serological tests (FlexSure HP and QuickVue) with conventional ELISA for detection of Helicobacter pylori infection. Am J Gastroenterol 1996; 91: 942– 948. Edelstein PH, Bryan RN, Enns RK, Kohne DE, Kacian DL. Retrospective study of Gen-Probe rapid diagnostic system for
229
24.
25. 26.
27.
28.
29. 30.
31.
32. 33. 34. 35. 36. 37. 38.
detection of legionellae in frozen clinical respiratory tract samples. J Clin Microbiol 1987; 25: 1022–1026. Knigge KM, Babb JL, Firca JR, Ancell K, Bloomster TG, Marchlewicz BA. Enzyme immunoassay for the detection of group A streptococcal antigen. J Clin Microbiol 1984; 20: 735–741. Roseff SD, Campos JM. Detection of cytomegalovirus antibodies in serum using the TranSTAT-CMV and CMV Scan assays. Am J Clin Pathol 1993; 99: 539–541. Zweygberg Wirgart B, Landqvist M, Hokeberg I, Eriksson BM, Olding-Stenkvist E, Grillner L. Early detection of cytomegalovirus in cell culture by a new monoclonal antibody, CCH2. J Virol Methods 1990; 27: 211–219. LeBar WD, Resek CM, Crist AE Jr, Sautter RL. Comparison of a rapid latex agglutination assay and a fluorescent-antibody technique for the detection of herpes simplex antibody. Diag Microbiol Infect Dis 1988; 11: 21–24. Dascal A, Chan-Thim J, Morahan M, Portnoy J, Mendelson J. Diagnosis of herpes simplex virus infection in a clinical setting by a direct antigen detection enzyme immunoassay kit. J Clin Microbiol 1989; 27: 700–704. Cromien JL, Himmelreich CA, Glass RI, Storch GA. Evaluation of new commercial enzyme immunoassay for rotavirus detection. J Clin Microbiol 1987; 25: 2359–2362. Sambourg M, Goudeau A, Courant C, Pinon G, Denis F. Direct appraisal of latex agglutination testing, a convenient alternative to enzyme immunoassay for the detection of rotavirus in childhood gastroenteritis, by comparison of two enzyme immunoassays and two latex tests. J Clin Microbiol 1985; 21: 622–625. Dennehy PH, Gauntlett DR. Evaluation of a new enzyme immunoassay (TESTPACK rotavirus) for the detection of rotavirus in fecal specimens. Diagn Microbiol Infect Dis 1988; 11: 201–203. Wester JP, Holtkamp M, Linnebank ER, et al . Non-invasive detection of deep venous thrombosis: Ultrasonography versus duplex scanning. Eur J Vasc Surg 1994; 8: 357–361. Hadgu A. The discrepancy in discrepant analysis. Lancet 1996; 348: 592–593. Hadgu A. Bias in the evaluation of DNA-amplification tests for detecting Chlamydia trachomatis. Stat Med 1997; 16: 1391– 1399. Ransohoff DR, Feinstein AR. Problems of spectrum and bias in evaluating the efficacy of diagnostic tests. N Engl J Med 1978; 299: 926–930. Begg CB, Greenes RA. Assessment of diagnostic tests when disease verification is subject to selection bias. Biometrics 1983; 39: 207–215. Diamond GA. Affirmative actions: Can the discriminant accuracy of a test be determined in the face of selection bias? Med Decision Making 1991; 11: 48–56. Walter SD, Irwig LM. Estimation of test error rates, disease prevalence and relative risk from misclassified data: A review. J Clin Epidemiol 1988; 41: 923–937.
APPENDIX A I. Derivation of Equations (3) and (4). a. With respect to the true underlying disease status, the results of the new diagnostic test may be classified as true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). b. After testing with the AS in stage 1 of discrepant analysis, there are eight possible joint outcomes (refer to Table 3):
W. C. Miller
230
Concordant true positives: Discordant (misclassified) true positives: Concordant (misclassified) false positives: Discordant false positives: Concordant true negatives: Discordant (misclassified) true negatives: Concordant (misclassified) false negatives: Discordant false negatives:
(1 2 m)TP (m)TP (n)FP (1 2 n)FP (1 2 p)TN (p)TN (q)FN (1 2 q)FN
c. After testing discordant specimens in Stage 2 with a perfect gold standard, all true positives are identified as true positives and all true negatives are identified as true negatives. The discordant TP are reclassified and captured in cell a; the discordant TN are reclassified and captured in cell d. Concordant TP 1 Discordant TP 5 All TP → (1 2 m)TP 1 (m)TP 5 TP; Concordant TN 1 Discordant TN 5 All TN → (1 2 p)TN 1 (p)TN 5 TN d. The resultant measured sensitivity is equal to cell a over the sum of cells a and c; the measured specificity is equal to cell d over the sum of cells b and c. Measured Sensitivity 5 5 Measured Specificity 5 5
a a1c TP 1 (n)FP TP 1 (n)FP 1 (1 2 q)FN d b1d TN 1 (q)FN TN 1 (q)FN 1 (1 2 n)FP
e. The quantities, (n)FP and (q)FN, will largely determine the magnitude of the bias for both measured sensitivity and specificity. The maximum value for the quantity, (n)FP, is limited by the absolute number of false positives generated by the new test and the AS. The absolute number of false positives will depend on the prevalence of disease in the study population and the specificity of new test or the AS. Similarly, the absolute number of false negatives generated by the new test and the AS will limit the maximum value of q(FN). n(FP) # min(FP of new test, FP of AS) q(FN) # min(FN of new test, FN of AS) f. If the true specificity of the new test is better than the specificity of the AS, the maximum value of n will be 1 (all FP of the new test are concordant with a false positive of the AS). If the specificity of the new test is less than the specificity of the AS, the maximum value of n will be equal to the ratio of false positives generated by the AS to false positives generated by the new test. Similar equations hold for false negatives.
Max (n) 5 min (1, FP of AS/FP of new test) If true specificity of new test $ true specificity Max (n) 5 1. If true specificity of new test , true specificity Max (n) 5 FPAS /FPNT Max (q) 5 min (1, FN of AS/FN of new test) If true sensitivity of new test $ true sensitivity Max (q) 5 1. If true sensitivity of new test , true sensitivity Max (q) 5 FNAS /FNNT
of AS, of AS,
of AS, of AS,
g. If the new test and AS are conditionally independent, the values for m, n, p and q are equal to the (1 2 sensitivity of the AS), (1 2 specificity of the AS), the (1 2 specificity of the AS), and (1 2 sensitivity of the AS), respectively. Concordant true positives: (1 2 m) TP 5 (1 2(1 2 sensitivity of AS))TP 5 (sensitivity of AS)TP Concordant true positives: (1 2 m) TP 5 (1 2(1 2 sensitivity of AS))TP 5 (sensitivity of AS)TP Discordant true positives: (m)TP 5 (1 2 sensitivity of AS)TP Concordant false positives: (n)FP 5 (1 2 specificity of AS)FP Discordant false positives: (1 2 n)FP 5 (1 2 (1 2 specificity of AS))FP 5 (specificity of AS)FP Concordant true negatives: (1 2 p)TN 5 (1 2 (1 2 specificity of AS))TN 5 (specificity of AS)TN Discordant true negatives: (p)TN 5 (1 2 specificity of AS)TN Concordant false negatives: (q)FN 5 (1 2 sensitivity of AS)FN Discordant false negatives: (1 2 q)FN 5 (1 2 (1 2 sensitivity of AS))FN 5 (sensitivity of AS)FN h. With conditional independence of the new test and AS, after resolution of discrepant specimens with a perfect GS, the resulting measured sensitivity and measured specificity will be: Measured Sensitivity 5 TP 1 (1 2 specificity of AS)FP TP 1 (1 2 specificity of AS)FP 1 (sensitivity of AS)FN Measured Sensitivity 5 TN 1 (1 2 sensitivity of AS)FN TN 1 (1 2 sensitivity of AS)FN 1 (specificity of AS)FP
Bias in Discrepant Analysis
231
II. Sample Calculations a. Conditional Independence With conditional independence of new test and alloyed standard (AS), given prevalence (prev) 5 0.10, sensitivity of AS 5 0.80, true sensitivity of new test 5 0.90, true specificity of new test 5 0.90, a study population of 1000 persons (N) and a perfect gold standard for stage 2 of discrepant analysis: Number of TP (new test) 5 true sensitivity ∗ prev ∗ N 5 0.90 ∗ 0.10 ∗ 1000 5 90 Number of FN (new test) 5 (1 2 true sensitivity) ∗ prev ∗ N 5 0.10 ∗ 0.10 ∗ 1000 5 10 Number of TN (new test) 5 true specificity ∗ (1 2 prev) ∗ N 5 0.90 ∗ 0.90 ∗ 1000 5 810 Number of FP (new test) 5 (1 2 true specificity) ∗ (1 2 prev) ∗ N 5 0.10 ∗ 0.90 ∗ 1000 5 90 (n)FP 5 (1 2 specificity of AS) ∗ FP 5 (1 2 0.80) ∗ FP (q)FN 5 (1 2 sensitivity of AS) ∗ FN 5 (1 2 0.80) ∗ FN measured sensitivity 5 5
TP 1 (n)FP TP 1 (n)FP 1 (1 2 q)FN
b. Maximum measured sensitivity and measured specificity Given prevalence (prev) 5 0.10, sensitivity of AS 5 0.80, specificity of AS 5 0.80, true specificity of new test 5 0.90, true specificity of new test 5 0.90, a study population of 1000 persons (N) and a perfect gold standard for stage 2 of discrepant analysis: Number of TP, FN, TN, and FP for the new test are the same as in a 2a above. Maximum n 5 minimum (1, FP of AS/FP of new test); since specificity of AS , specificity of new test, maximum n51 Maximum q 5 minimum (1, FN of AS/FN of new test); since sensitivity of AS , sensitivity of new test, maximum q 5 1 Maximum measured sensitivity 5
TP 1 (n)FP TP 1 (n)FP 1 (1 2 q)FN
5
90 1 (1) ∗ 90 90 1 (1) ∗ 90 1 (1 2 (1)) ∗ 10
Maximum measured sensitivity 5 Maximum measured specificity
90 1 (1 2 0.80) ∗ 90 90 1 (1 2 0.80) ∗ 90 1 (1 2 (1 2 0.80)) ∗ 10
5
TN 1 (q)FN TN 1 (q)FN 1 (1 2 n)FP
108 5 0.931 116
5
810 1 (1) ∗ 10 810 1 (1) ∗ 10 1 (1 2 (1)) ∗ 90
measured sensitivity 5 measured specificity
TN 1 (q)FN 5 TN 1 (q)FN 1 (1 2 n)FP 5
180 5 1.00 180
810 1 (1 2 0.80) ∗ 10 810 1 (1 2 0.80) ∗ 10 1 (1 2 (1 2 0.80)) ∗ 90
measured specificity 5
812 5 0.919 884
Maximum measured specificity 5
820 5 1.00 820