Statistical Methodology 9 (2012) 490–500
Contents lists available at SciVerse ScienceDirect
Statistical Methodology journal homepage: www.elsevier.com/locate/stamet
Sensitivity, specificity and ROC-curves in multiple reader diagnostic trials—A unified, nonparametric approach Katharina Lange ∗ , Edgar Brunner Department of Medical Statistics, University of Göttingen, 37073 Göttingen, Germany
article
info
Article history: Received 28 March 2011 Received in revised form 11 October 2011 Accepted 22 December 2011 Keywords: Sensitivity Specificity Area under the ROC-curve (AUC) Diagnostic accuracy Nonparametric Behrens–Fisher-problem
abstract In diagnostic trials, the performance of a product is most frequently measured in terms such as sensitivity, specificity and the area under the ROC-curve (AUC). In multiple-reader trials, correlated data appear in a natural way since the same patient is observed under different conditions by several readers. The repeated measures may have quite an involved correlation structure. Even though sensitivity, specificity and the AUC are all assessments of diagnostic ability, a unified approach to analyze all such measurements allowing for an arbitrary correlation structure does not exist. Thus, a unified approach for these three effect measures of diagnostic ability will be presented in this paper. The fact that sensitivity and specificity are particular AUCs will serve as a basis for our method of analysis. As the presented theory can also be used in set-ups with correlated binomial random-variables, it may have a more extensive application than only in diagnostic trials. © 2012 Elsevier B.V. All rights reserved.
1. Introduction In diagnostic trials, the main interest is focused on the ability of a modality (a particular diagnostic procedure) to distinguish between diseased and non-diseased subjects. Various measures can be used to quantify the accuracies of diagnostic tests. For tests with binary end-points, the ability to discriminate between diseased and non-diseased subjects can be measured by their sensitivity and specificity, which are defined as the probabilities of the test correctly identifying the diseased subjects or the non-diseased respectively. When the end-point of the modality or the test is measured on a metric or an ordinal scale, a cut-off has to be chosen in order to compute sensitivity and
∗ Correspondence to: Department of Medical Statistics, Humboldallee 32, 37073 Göttingen, Germany. Tel.: +49 551 39 4956; fax: +49 551 39 4995. E-mail address:
[email protected] (K. Lange). 1572-3127/$ – see front matter © 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.stamet.2011.12.002
K. Lange, E. Brunner / Statistical Methodology 9 (2012) 490–500
491
specificity. The ROC-curve is a graph of sensitivity versus 1-specificity when varying the cut-off over all possible values. The area under this curve (AUC) expresses the probability that for a randomly chosen pair of diseased and non-diseased subjects the test values are ranked correctly. Predictive values or likelihood ratios might also be suitable measures of diagnostic ability. But in this paper we will focus on sensitivity, specificity and the AUC as most frequently used accuracy assessments. For details on the different measures of diagnostic accuracy, we refer to the excellent textbooks by Zhou et al. [13] and Pepe [9]. For the comparison of the diagnostic ability of imaging devices binary (diseased/non-diseased) or ordinal end-points (e.g. a severity scale) are commonly used. In these studies regulatory authorities demand at least two, preferably three, blinded readers evaluating the data of each participant. Thus there are two factors influencing the end-point of a subject: the ability of the imaging device to display the health status of the patient and the ability of the reader to evaluate the images. Hence a factorial design with the two factors (reader and modality) is a suitable statistical model. In such a model, the questions 1. Is there any difference between the diagnostic ability of the modalities? 2. Is there any difference between the diagnostic ability of the readers? 3. Is the diagnostic ability of the readers consistent among the modalities? can be translated into hypotheses and as such be tested by statistical tests. Of course, the third point can only be considered, if each modality is evaluated by each reader and this model will be considered here. Analyzing multiple reader trials using factorial designs allows that consistency among readers is not only measured (as done, e.g., with the kappa statistic) but also compared and statistically tested. Furthermore, the approach of a factorial design as method of analysis does not entail the problem of how to summarize the results of different readers. The review article by Obuchowski et al. [8] provides an excellent overview of the different (parametric as well as nonparametric) approaches for the evaluation of multiple reader trials by factorial designs. A complete nonparametric approach was first developed by Song [12] in 1997, who extended the theory of U-statistics developed by De Long et al. [4] to evaluate the AUCs of several diagnostic tests in multiple reader trials. A further nonparametric method of analysis was proposed by Kaufmann et al. [5], who used the methodology of multivariate rank statistics to develop approaches of analysis for the AUC. But neither in [12] nor in [5] the methods were extended to a nonparametric approach of analysis for sensitivity and specificity. These two binary assessments of diagnostic ability may be analyzed in a factorial design by using the method of generalized estimating equations (GEEs), for instance. But to the best of our knowledge, except from this paper, there is no nonparametric unified approach for the analysis of the AUC, the sensitivity and the specificity. Thus we expand the approach of analysis for the AUC provided by Kaufmann et al. [5] to the analysis of sensitivity and specificity by regarding these quantities from a new point of view. 2. Statistical model and accuracy assessments 2.1. Statistical model We consider a diagnostic trial involving n1 diseased and n0 non-diseased subjects. In the set-up of this trial, we allow for each subject being observed repeatedly under m = 1, . . . , M modalities (e.g. imaging techniques) by r = 1, . . . , R different blinded readers. For the derivation of the asymptotic results the combination of modality m and reader r is relabeled by one index l = 1, . . . , d, where d = M · R, for reasons of simplicity. The data of subject k are collected in a vector (1) (d) (l) Xik = (Xik , . . . , Xik ), i = 0, 1, k = 1, . . . , ni , where Xik has the marginal distribution function (l)
(l)
Fi (x) = P (Xik
< x) + 0.5 · P (Xik = x). i.e. in our approach the normalized version of the (l)
distribution function is used in accordance with the idea of Ruymgaart [11], who represents Fi (x) as (l)
+(l)
−(l)
+(l)
−(l)
Fi (x) = 0.5 ·(Fi (x)+ Fi (x)), where Fi (x) and Fi (x) denote the left- and the right continuous version of the distribution function respectively. The normalized version is used to allow for both the analysis of observations with continuous and discrete observations, such as ordered categorical data. (l) Thus we note that Fi (x) may be an arbitrary distribution function with the exception of the trivial
492
K. Lange, E. Brunner / Statistical Methodology 9 (2012) 490–500
(1)
(d)
case of a one-point-distribution. Further, let Fi = (Fi , . . . , Fi ) denote the vector of the marginal distributions in sample i. To compute the sensitivity and the specificity let γ (l) denote the cut-off in the l-th component, (l) i.e. subject k is classified as diseased (by the l-th reader–modality-combination) if Xik > γ (l) and as (l)
(l)
non-diseased if Xik < γ (l) . Without loss of generality, we assume P (Xik = γ (l) ) = 0, ∀ l = 1, . . . , d. Note that this assumption is only required for discontinuous distribution functions. In this case the cut-off can be chosen by a value which is not included on the original scale. To derive asymptotic results the following regularity assumptions are needed. (l)
(r )
Assumption 1. (1) For all l, r = 1, . . . , d the bivariate distribution of (Xik , Xik ) is the same for all subjects k = 1, . . . , ni within group i, i = 0, 1 (independent replications), (2) n0 + n1 = N → ∞ such that nN ≤ N0 < ∞, i = 0, 1, where N0 denotes some finite constant. i
As a diagnostic trial involves different comparative assessments of diagnostic performance, we will present the most frequently used (sensitivity, specificity and AUC) in the next section. In doing so we will approach the binary assessments from a different point of view, which will serve as the basis for the unified analysis of sensitivity, specificity and the area under the ROC-curve. With the help of this approach results for sensitivity and specificity will be identified as a special case from the results retrieved for the AUC. 2.2. Measures on diagnostic ability AUC The area under the ROC-curve is a commonly used quality criterion to describe an overall accuracy (l) (l) of a diagnostic agent. It is defined as a function of the distribution functions F0 and F1 , namely, (l)
(l)
AUC(l) = AUC(F0 , F1 ) =
(l)
(l)
F0 dF1 .
(1)
If analyzed from the practical point of view, the AUC expresses the probability that a randomly chosen observation from the diseased and a randomly chosen observation from the non-diseased population is ranked correctly. The AUC hence describes the probability that a randomly chosen diseased subject is classified as more diseased than a randomly chosen non-diseased subject, i.e. it presents the probability that a result of the diagnostic test of a diseased has a higher value than the test result of a non-diseased. This is also referred to as the relative effect (see e.g. [2]), which was originally introduced by [6]. Sensitivity The sensitivity is the probability of a diseased subject to be correctly diagnosed as diseased, i.e.: (l)
(l)
se(l) = P (Xik > γ (l) |i = 1) = P (X1k > γ (l) ),
∀ l = 1, . . . , d,
for any k = 1, . . . , n1 . Let c (x) = 0, 0.5, 1 accordingly x <, =, > 0 denote the normalized version of the count function and let Γ (l) (x) = c (x − γ (l) ) denote the distribution function of a one-point distribution with point-mass at γ (l) . Then the sensitivity is given by (l)
(l)
(l)
se(l) = P (X1k > γ (l) ) = 1 − P (X1k < γ (l) ) = 1 − F1 (γ (l) ) = 1 −
(l)
F1 dΓ (l) .
Thus, the sensitivity can be regarded as the AUC of a particular ROC-curve generated by the (l) distributions Γ (l) and F1 , namely, se(l) =
(l)
(l)
Γ (l) dF1 = AUC(Γ (l) , F1 ).
(2) (l)
We note that for establishing the equalities the assumption P (Xik = γ (l) ) = 0, ∀ l = 1, . . . , d is required.
K. Lange, E. Brunner / Statistical Methodology 9 (2012) 490–500
493
Specificity The specificity is defined as the probability of a non-diseased subject to be correctly classified as non-diseased, namely, (l)
(l)
(l)
sp(l) = P (Xik < γ (l) |i = 0) = P (X0k < γ (l) ) = F0 (γ (l) ). As in the case of sensitivity, specificity as well can be regarded as the area under a particular ROCcurve, namely, sp(l) =
(l)
(l)
F0 dΓ (l) = AUC(F0 , Γ (l) ).
(3)
Fig. 1 illustrates this relationship between the AUC, the sensitivity, and the specificity. Regarding sensitivity and specificity as particular AUCs it is sufficient to derive asymptotic results for the area under the ROC-curve if one of the distributions is allowed to be a point-mass. 3. Estimators and their asymptotic distributions First, we present the nonparametric estimator for the usual AUC proposed by Bamber [1]. For the estimation of the AUC two samples are required, one of diseased and one of non-diseased subjects, whereas for the estimation of sensitivity and specificity in each case only one sample is necessary. Thus in a second step, we will expand the approach of estimation of the AUC to the estimation of sensitivity and specificity by showing that these quantities can be regarded as pseudo-two-sample-quantities. Furthermore, we will derive the asymptotic distribution of the estimators for sensitivity, specificity and the AUC and present these asymptotic results in a unified form. 3.1. Estimators AUC (l) (l) Let Rik denote the mid-rank of Xik among all N = n0 + n1 observations within the l-th component
ni (l) (l) (l) 1 and let Ri· = n− i k=1 Rik denote the mean of Rik .
Theorem 1. In a nonparametric set-up, the area under the curve given in (1) is consistently estimated by (l)
AUC
=
1 (l) (l) F0 d F1 =
(l)
R1· −
n0
n1 + 1
2
.
(4)
Plugging in the empirical counterparts of the distribution functions in (1) leads to this nonparametric estimator of the AUC, which was first mentioned by Bamber [1] for the one-dimensional case. For further details and properties of this estimator (such as the representation with ranks and the case of discontinuous distributions) we refer to [2]. Sensitivity and specificity To estimate the sensitivity and the specificity we apply the usual plug-in-method, which leads to se(l) =
s p(l) =
(c )
(l)
Γ (l) d F1 =
n1 1
n1 k=1
(l)
Γ (l) (X1k ) =
(c )
n1
n1
(c ) 1 (l) (l) n (l) F0 dΓ (l) = 1 − Γ (X0k ) = 0 , n0
n0 k=1
n0
(c )
where n1 and n0 denote the correctly identified diseased and non-diseased patients respectively. Note that these plug-in estimators are the well-known maximum likelihood estimators which are unbiased and consistent in this case. 3.2. Asymptotic distribution In this section, we derive the asymptotic distribution of the estimators presented in the last subsection.
2
3
0.6 0.2 0.0 -4 -3 -2 -1 0
1
2
3
-4 -3 -2 -1 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0
0.2 0.4 0.6 0.8 1.0
1
2
3
0.0 0.2 0.4 0.6 0.8 1.0
1
0.0 0.2 0.4 0.6 0.8 1.0
-4 -3 -2 -1 0
b
0.4
0.6 0.4 0.2 0.0
0.0
0.2
0.4
a
K. Lange, E. Brunner / Statistical Methodology 9 (2012) 490–500
0.6
494
0.0
0.2 0.4 0.6 0.8 1.0
0.0
0.2 0.4 0.6 0.8 1.0
Fig. 1. Densities of the distribution functions (a) used for the calculation of the ROC-curves (b), whose corresponding AUCs are the common AUC, the sensitivity and the specificity.
AUC The AUCs for the l = 1, . . . , d different reader–modality-combinations are collected in a vector = (AUC (d) )′ (1) , . . . , AUC AUC = (AUC(1) , . . . , AUC(d) )′ , and so are the corresponding estimators AUC as presented in (4). Let B(l) =
(l)
(l)
F0 d F1 −
(l)
(l)
F1 d F0 + 1 − 2 ·
(l)
(l)
F0 dF1 ,
l = 1, . . . , d
√
and B = (B(1) , . . . , B(d) ). Furthermore, let VN denote the covariance matrix of NB and let λ1 , . . . , λd denote the eigenvalues of VN . To derive the asymptotic distribution we need the additional regularity condition that λmin = min{λ1 , . . . , λd } is bounded away from 0. Assumption 2. Let λmin ≥ λ0 > 0, where λ0 is some constant.
√
−AUC) has, asymptotically, a multivariate Theorem 2. Under Assumptions 1 and 2, the statistic N (AUC normal distribution with mean 0 and covariance matrix VN . The proof is mainly based on Lindeberg’s central limit theorem; for details we refer to [2]. Sensitivity Let se = (se(1) , . . . , se(d) )′ denote the vector of the sensitivities for each reader–modality combination and let se = ( se(1) , . . . , se(d) )′ denote √ its estimator as defined in (6). Furthermore let, ν1 , . . . , νd denote the eigenvalues of Sse n1 = Cov( n1 se). Similarly to the last paragraph, an additional regularity condition is required. Assumption 3. Let νmin ≥ ν0 > 0, where ν0 is some constant.
√
Theorem 3. Under Assumptions 1 and 3, the statistic n1 ( se − se) has, asymptotically, a multivariate normal distribution with mean 0 and covariance matrix Sse n1 . Proof. We obtain that se(l) − se(l) =
(l)
Γ (l) d F1 −
(l)
Γ (l) dF1 =
n1 1
n1 k=1
(l)
Γ (l) (X1k ) −
(l)
Γ (l) dF1 .
K. Lange, E. Brunner / Statistical Methodology 9 (2012) 490–500
495
Note that, apart from some constant, the difference se(l) − se(l) is the mean of the independent random (l) variables Γ (l) (X1k ), k = 1, . . . , n1 . Then the asymptotic normality is easily established by verifying (l)
Lindeberg’s condition, which follows from Assumption 3, because the random variables Γ (l) (X1k ) are uniformly bounded. (While it would be sufficient to use Levy’s central limit theorem in this case, the proof of Theorem 2 requires Lindeberg’s version. Hence for convenience, we use Lindeberg’s theorem in both cases.)
Specificity Let sp = (sp(1) , . . . , sp(d) )′ denote the vector of the specificities for each reader–modality combination and let s p = ( sp(1) , . . . , s p(d) )′ denote √ its estimator as defined in (7). Furthermore, let τ1 , . . . , τd sp p). In this place, we also need the additional regularity denote the eigenvalues of Sn0 = Cov( n0 s condition that τmin = min{τ1 , . . . , τd } is bounded away from 0. Assumption 4. Let τmin ≥ τ0 > 0, where τ0 is some constant.
√
Theorem 4. Under Assumptions 1 and 4, the statistic n0 ( sp − sp) has, asymptotically, a multivariate sp normal distribution with mean 0 and covariance matrix Sn0 . This proof is essentially the same as for sensitivity (see Theorem 3) and thus it will be omitted here. 4. Estimation of the covariance matrix In this section, we will present the estimator of the covariance matrix of the AUC, the sensitivity and the specificity. AUC (1) (l) (l) Let Rik denote the rank of Xik among all N observations in the l-th component and Rik = (Rik ,
. . . , R(ikd) )′ denote the vector of these so-called overall-ranks as defined in Section 3.1. Further, let (l) (l) Qik , k = 1, . . . , ni denote the rank of Xik among all ni observations within the i-th sample and (d) (1) the l-th component and Qik = (Qik , . . . , Qik )′ denote the vector of these so-called internal-ranks (l) Qik , (i = 0, 1, l = 1, . . . , d). Furthermore, let ni ni 1 1 ni + 1 Ri· = 1d , i = 0, 1 Rik and Qi· = Qik =
ni k=1 ni k = 1 2 denote the means of these vectors, where 1d = (1, . . . , 1)′ denotes the d-dimensional vector of 1s. Finally, let Zik = Rik − Qik and Zi· = Ri· − Qi· .
√
Theorem 5. The covariance matrix VN of VN ,0 + VN ,1 , where
VN ,i =
N
(N − ni )2 ni (ni − 1)
− AUC) can be estimated consistently by N (AUC VN =
ni (Zik − Zi· )(Zik − Zi· )′ .
(5)
k=1
For the proof of the L2 - consistency of VN , we refer to [2]. Note that the assumption of not having one-point distributions is not required for the proof this theorem. Sensitivity and specificity For the estimation of the covariance matrix of sensitivity and specificity the usual sample covariance matrix is used: 1
n1
(0(X1k ) − se)(0(X1k ) − se)′ ,
(6)
(0(X0k ) − sp)(0(X0k ) − sp)′ . n0 − 1 k=1 Being the sample covariance matrix, this estimator is unbiased and consistent.
(7)
Sse n1 = Ssp n0 =
n1 − 1 k=1 1
n0
496
K. Lange, E. Brunner / Statistical Methodology 9 (2012) 490–500
Relation between the estimators of sensitivity, specificity and AUC In this section, it is shown that not only the parameters but also the estimators of sensitivity and specificity can be treated as special AUCs. To this end, let γ = (γ (1) , . . . , γ (d) )′ denote the vector of cut-offs for each component l and let {γ, . . . , γ} denote a pseudo-sample, where each pseudoobservation equals γ . This sample may be regarded as a realization of a random variable with pointmass at γ . Let nps denote the sample size of this pseudo-sample. Then we obtain that
(l) (x) = Γ
nps 1
nps k=1
c (x − γ (l) ) = c (x − γ (l) ) = Γ (l) (x).
(l) equals the distribution function Γ (l) itself. Hence, Thus, the empirical distribution function Γ sensitivity and specificity can be estimated in the same way as the usual AUC, when one sample is replaced by a pseudo-sample of one-point distributed observations: se(l) =
Γ (l) d F1 =
(l) (l) d Γ F1
s p(l) =
(l) F0 dΓ (l) =
(l) (l) F dΓ .
(l)
0
It can easily be verified that applying the approach of pseudo-samples to the covariance-estimator of the usual AUC leads to the following estimators:
Vn
1 +nps
=
n1 + nps n1
1
n1
n1 − 1 k=1
(0(X1k ) − se)(0(X1k ) − se)′ =
n1 + nps n1
Sse
n1
n1
n0 + nps sp S n0 . (0(X1k ) − sp)(0(X1k ) − sp)′ = n1 − 1 k=1 n0 Assuming that nps is arbitrary but fixed, whereas ni → ∞, i = 0, 1 the results of Theorems 3 and 4 can be rewritten as follows: The statistics
Vn
0 +nps
=
nps + n1 n1
nps + n0 n0
n0 + nps
·
n1
· ·
·
1
√
√
n1 (se − se) = n0 (s p − sp) =
nps + n1 (se − se) nps + n0 ·
and
√
n0 (s p − sp)
have, asymptotically, a multivariate normal distribution with mean 0 and covariance matrices √ √ n1 +nps se Sn1 n1
n +n
and 0n ps Sse n0 respectively. Applying Theorem 2 to the sample of diseased and a one-point 0 distributed sample of non-diseased patients would lead exactly to this result. Hence, the analysis of sensitivity and specificity can be performed in the same way as the analysis of the AUC by replacing one sample by a pseudo-sample. Note that the latter results are independent of the special choice n +n of the sample size nps because the factor psn i cancels out in test statistics and confidence intervals. i This approach of analysis involves a huge advantage concerning computation and interpretation of results: 1. Only a single software for the analysis of the AUC is required. If the sample of the diseased or the non-diseased is replaced by a one-point distributed pseudo-sample, the same software will be able to analyze sensitivity and specificity. 2. The unified approach allows us to analyze the different measures of diagnostic accuracy in a homogeneous and comparable way. 5. Hypotheses and test statistics In a first step, we will present hypotheses which should be considered in a diagnostic trial with multiple readers; in a second step we will derive statistics to test these hypotheses. As the hypotheses are formulated in the same way as in the theory of linear models, an additive model is assumed to describe the set-up of the diagnostic trial. Note that this approach can easily be expanded to
K. Lange, E. Brunner / Statistical Methodology 9 (2012) 490–500
497
a multiplicative or a logistic model by applying a log or a logit-transformation on the vectors of sensitivity, specificity or AUC and then using Cramer’s δ Theorem to derive the asymptotic normal distribution of the transformed effect measures. Thus, the hypotheses and test statistics presented in the following are suitable for both additive and multiplicative or logistic models. 5.1. Hypotheses To compare the diagnostic ability of several modalities, we will use the same ideas as described by Kaufmann et al. [5]. Here the non-parametric hypotheses are formulated in the same way as in the theory of linear models by multiplying the vector of the accuracies AUC from the left with a suitable contrast matrix C. The vector of accuracies can either be the vector of the common AUCs or the vector of the sensitivities or the specificities. Using this approach of analysis, hypotheses of no reader-effect, of no modality-effect and of no reader–modality-interaction can be tested. Hence, the question of consistency among readers can be analyzed in detail: not only can we answer the question whether all readers have the same ability to evaluate the diagnostic modality, but also test whether the difference between these modalities is consistent among the readers. Hereby, the second point may be more important to consider when evaluating a diagnostic trial. Note that all these questions are relevant for the AUC but as well for sensitivity and specificity. To test the hypothesis H0R of no reader effect, H0M of no modality effect, and H0MR of no modality × reader-interaction, we return to the original structure of the index l to distinguish between the m = 1, . . . , M modalities and the r = 1, . . . , R readers. Thus, we rewrite AUC = (AUC(1) , . . . , AUC(d) )′ as AUC = (AUC(1,1) , . . . , AUC(1,R) , . . . , AUC(M ,1) , . . . , AUC(M ,R) )′ . Let 1 ′ 1 1M ⊗ PR , and CMR = PM ⊗ PR , CR = CM = PM ⊗ 1′R , R M denote the contrast matrices to test the above mentioned hypotheses. Here, PM = IM − M1 1M 1′M denotes the centering matrix, IM the M-dimensional unit matrix and 1M = (1, . . . , 1)′ the Mdimensional vector of 1s. For details we refer to [5]. In this two-way-layout, the three hypotheses are written as H0M : CM AUC = 0 or AUC1· = AUC2· = · · · = AUCM · , where AUCm· =
R 1
R r =1
AUC(m,r ) ,
H0R : CR AUC = 0 or AUC·1 = AUC·2 = · · · = AUC·R , where AUC·r =
M 1
M
AUC(m,r )
and
m=1
H0MR : CMR AUC = 0 or
AUC(m,r ) = AUCm· + AUC·r − AUC·· ,
M R 1
AUC(m,r ) , m = 1, . . . , M , r = 1, . . . , R. MR m=1 r =1 Note that this approach of modeling is similar to the model assumed in multi-center clinical trials, in which the center is assumed to be a fixed factor as well as the treatment. This method of evaluation of diagnostic trials is discussed by Obuchowski et al. [8], who provide an overview of the different approaches of analyzing multiple reader trials by factorial designs. where AUC·· =
5.2. Test statistics To test the hypotheses formulated above, we restate the statistics by Kaufmann et al. [5]. An ANOVA-type statistic (ATS) was suggested by [2] and by [7] to test these hypotheses. Their simulation studies showed that even for small sample sizes the approximation achieved by the ATS controls the type-I-error better than the previously used rank version of the Wald-type statistic proposed by Puri and Sen [10] in 1971.
498
K. Lange, E. Brunner / Statistical Methodology 9 (2012) 490–500 Table 1 Data of the ultra-sound diagnostic trial. Patient
Reader 1
Reader 2
Gold standard
Base
Levovist
Base
Levovist
1 2
5 3
1 2
4 3
3 5
0 1
. . .
. . .
. . .
. . .
. . .
. . .
48 49
3 2
5 2
3 4
5 3
1 1
√
− AUC) be asymptotically N (0, VN ) distributed and Approximation procedure 1. Let N (AUC VN be the consistent estimator of VN as defined above. Furthermore, let T = C′ (CC ′ )− C denote the projection matrix generated by the contrast-matrix C. Then under the hypothesis H0 : C · AUC = 0 the asymptotic distribution of the ATS FN (T) =
N
′
tr (T VN )
· T · AUC AUC
(8)
can be approximated by a central χ2 / f distribution with f
[tr (T VN )] f = tr ([TVN ]2 ) 2
degrees of freedom. For the derivation, we refer to [2,7]. Not only can this approximation procedure be applied to the AUC but also to the sensitivity and the specificity because both of them can be regarded as AUCs (estimated with the mean of a pseudosample). Confidence intervals can also be calculated by treating sensitivity and specificity as special AUCs. For more details, we thus refer to [5], who present confidence intervals for the AUC in multireader and multi-modality settings. 6. Application: sonography ultra-sound imaging study As an application of the developed methods, we describe a diagnostic imaging study assessing leg or pelvic thrombosis by means of the color-coded Doppler sonography. The aim of the study was to compare the accuracy of a contrast medium (Levovist) with the accuracy of a non-enhanced sonography (base). According to the usual practice in imaging studies, each image was evaluated by two independent, blinded readers. Hereby, each patient was diagnosed by both diagnostic methods and we, therefore, obtain a typical paired-case paired-reader design (Table 1). In order to assess a patient’s true health status and to separate the truly diseased from the truly non-diseased, a phlebography was performed to obtain a ‘gold standard’. Fig. 2 shows the resulting ROC-curves for both readers and both modalities. Note that it is important to consider reader-specific results because consistency among the readers might be regarded as a quality attribute of the diagnostic agent. The accuracy was assessed by means of the AUC, the sensitivity and the specificity. In order to compute the latter ones, a cut-off had to be chosen. We, hence, classified patients with a score of 4 and 5 as diseased and patients with scores of 1, 2, and 3 as non-diseased. For the analysis of sensitivity (specificity) we, therefore, replaced the sample of the non-diseased (the diseased) by a one-point distributed pseudo-sample with point mass at 3.5 and analyzed the resulting AUC. Note that for this categorization every cut-off in the open interval (3, 4) would have been appropriate as point mass for the one-point distribution. The resulting dichotomization of the endpoint of the diagnostic test is of considerable importance in practical terms because after all each therapeutic decision is binary (treatment or no treatment). Hence, the sonography imaging study was not only analyzed by means
K. Lange, E. Brunner / Statistical Methodology 9 (2012) 490–500
499
Fig. 2. ROC-curves for Reader 1 and Reader 2 in the ultra-sound imaging study.
Fig. 3. Standard (solid) and logit (dashed) confidence intervals of the AUC, the sensitivity and the specificity of the enhanced and the non-enhanced sonography in the ultra-sound imaging study. Table 2 Results of the nonparametric analysis of the ultra-sound imaging study. Factor
Assessment of diagnostic accuracy AUC
Modality Reader Interaction
Sensitivity
Specificity
Statistic
p-value
Statistic
p-value
Statistic
p-value
12.98 3.29 1.21
<0.001
2.45 0.11 1.00
0.117 0.743 0.317
6.69 1.34 4.10
0.010 0.248 0.043
0.070 0.272
of the AUC but also on the basis of sensitivity and specificity. Confidence intervals for all measures of accuracy (Fig. 3) were determined by applying the pivot-method on studentized test-statistics (see e.g. [5]). The p-values of the hypotheses are displayed in Table 2. Analyzed by means of the AUC, the two modalities show a significant difference (p < 0.001) while there is neither evidence for a difference (p = 0.07) nor for a heterogeneity (p = 0.272) between the two readers. If a cut-off is chosen and sensitivity and specificity are investigated at this specific cut-off, the situation changes. While there is no evidence for any differences in the sensitivity, the specificity shows a significant interaction between reader and modality. I.e. in the sample of the non-diseased
500
K. Lange, E. Brunner / Statistical Methodology 9 (2012) 490–500
the increase in discrimination achieved by Levovist is not homogeneous between the two readers. Hence the result of the evaluation of sensitivity and specificity has to be discussed carefully because the only enhancement achieved by the contrast medium is dependent from the reader evaluating the image. The analysis of a specific point of the ROC-curve provides a lot of new information, and as every diagnostic test has to be binary in the end, this extra information cannot easily be neglected. 7. Discussion In this paper, we have developed a unified approach for the analysis of sensitivity, specificity and the area under the ROC-curve in multiple reader trials. Therefore, we have shown that sensitivity and specificity are areas under particular ROC-curves. Furthermore, we have obtained that these binary assessments of diagnostic ability can be treated as pseudo-two-sample-quantities, i.e. either the sample of diseased subjects or the sample of non-diseased subjects was replaced by a onepoint distributed pseudo-sample. Hence, the analysis of sensitivity and specificity was based on two samples just like the analysis of the AUC. We have estimated and analyzed the effect measures of a diagnostic trial nonparametrically, but note that the same approach can be used, when it is assumed that Xik , i = 0, 1, k = 1, . . . , ni are normally distributed. As sensitivity and specificity turn out to be usual AUCs, no new statistical software is required for the analysis of these quantities. Common statistical software can be used if one sample is replaced by a pseudo-sample before the analysis is performed. For the analysis with SAS, a macro (diag.sas) that performs the test mentioned above is available at the website: http://www.ams.med.uni-goettingen.de/amsneu/diagn-en.html. By formulating the hypotheses in the same way as in the theory of linear models, we have assumed an additive model. The application of a suitable transformation function (log or logit) will also allow for the analysis of multiplicative or logistic models. Applying the nonparametric asymptotic Wilcoxon-test on dichotomous data turns out to be the well-known χ 2 -test (see e.g. [3]), and thus the theory developed in this paper may be regarded as a factorial χ 2 -test on repeated measures. Note that the procedure presented here can not only be used for sensitivity and specificity but also for every set-up with correlated binomial random-variables, where a factorial model with two crossed, fixed factors is assumed. In this case, e.g. 0.5 has to be chosen as the cut-off for the one-point distribution. It is possible to have other factorial structures in the set-up of the trial, but these designs will be the topic of future research. References [1] D. Bamber, The area above the ordinal dominance graph and the area below the receiver operating characteristic graph, Journal of Mathematical Psychology 12 (1975) 387–415. [2] E. Brunner, U. Munzel, M.L. Puri, The multivariate nonparametric Behrens–Fisher problem, Journal of Statistical Planning and Inference 108 (2002) 37–53. [3] E. Brunner, M.L. Puri, Nonparametric methods in factorial designs, Statistical Papers 42 (2001) 1–52. [4] E.R. DeLong, D.M. DeLong, D.L. Clarke-Pearson, Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach, Biometrics 44 (1988) 837–845. [5] J. Kaufmann, C. Werner, E. Brunner, Nonparametric methods for analysing the accuracy of diagnostic tests with multiple readers, Statistical Methods in Medical Research 14 (2005) 129–146. [6] H.D. Mann, D.R. Whitney, On a test of whether one of two random variables is stochastically larger then the other, Annals of Mathematical Statistics 18 (1947) 50–60. [7] U. Munzel, E. Brunner, Nonparametric methods in multivariate factorial designs, Journal of Statistical Planning and Inference 88 (2000) 117–132. [8] N.A. Obuchowski, S.V. Beiden, K.S. Berbaum, S.L. Hillis, H. Ishwaran, H.H. Song, R.F. Wagner, Multireader, multicase receiver operating characteristic analysis: an empirical comparison of five methods, Academic Radiology 11 (2004) 980–995. [9] M.S. Pepe, The Statistical Evaluation of Medical Tests for Classification and Prediction, Oxford University Press Inc., New York, 2003. [10] M.L. Puri, P.K. Sen, Nonparametric Methods in Multivariate Analysis, Wiley, New York, 1971. [11] F.H. Ruymgaart, A unified approach to the asymptotic distribution theory of certain midrank statistics, in: J.P. Raoult (Ed.), Lecture Notes on Mathematics, vol. 821, Springer, Berlin, 1980. [12] H.H. Song, Analysis of correlated roc areas in diagnostic testing, Biometrics 53 (1997) 370–382. [13] X.H. Zhou, N.A. Obuchowski, D.K. McClish, Statistical Methods in Diagnostic Medicine, John Wiley & Sons, Inc., New York, 2002.