Statistics & Probability Letters 62 (2003) 61 – 70
Optimal dichotomization for repeated screening tests Ming-Dauh Wanga;∗ , Seymour Geisserb a
Lilly Research Laboratories, Eli Lilly and Company, Lilly Corporate Center, Indianapolis, IN 46285, USA b School of Statistics, University of Minnesota, Minneapolis, MN 55455, USA Received October 2001; received in revised form September 2002
Abstract Repeated use of a screening test for ascertaining a characteristic often can improve screening performance. We propose a Bayesian predictive decision-theoretic approach to choosing an optimal dichotomizer for the screening test variable in the repeat-test setting. c 2003 Elsevier Science B.V. All rights reserved. Keywords: Screening test; Repeated testing; Optimal dichotomizer; Bayesian predictive decision-theoretic approach; Monte Carlo optimization
1. Introduction In medical diagnosis, a screening test applied repeatedly to patients has been used to improve screening performance (Nissen-Meyer, 1964; Politser, 1982). From a population of interest, suppose each individual being screened for a characteristic C is taken n measurements of a same screening test, and if the number of test positives is greater than or equal to k 6 n, it is indicative of C-positive. Nissen-Meyer (1964) and Lau (1991) considered the problem of choosing n and/or k to achieve desired levels of sensitivity and speci;city of the repeated testing. In their case, the cut-o< value of the screening outcome variable (continuous, discrete, or nominal) was assumed chosen and ;xed. Murtaugh (1995) assumed the normal model for the joint distribution of the n measurements for both the diseased and non-diseased subpopulations, and with n = k he addressed on the ROC curve of the combined test composed of the n repeats. In this paper, we shall consider screening tests with a continuous outcome variable. Suppose a screening test is repeated n times on each individual, and the screening status is based upon the n ∗
Corresponding author. Tel.: +1-317-276-3479. E-mail address: wang
[email protected] (M.-D. Wang).
c 2003 Elsevier Science B.V. All rights reserved. 0167-7152/03/$ - see front matter doi:10.1016/S0167-7152(02)00425-X
62
M.-D. Wang, S. Geisser / Statistics & Probability Letters 62 (2003) 61 – 70
repeated measurements, we are interested in determining an optimal cut-o< for the screening test. We propose a Bayesian predictive decision-theoretic approach to the optimal dichotomization problem in Section 2. The approach is illustrated with the bivariate normal model for two screening repeats in Section 3, and with more than two repeats represented by multivariate normal distributions in Section 4. An example for dual repeats is given in Section 5. 2. Decision-theoretic approach to optimal dichotomization Consider a properly de;ned population in which a characteristic C is present or absent in the individuals. To screen for C on individuals, a test represented by an outcome variable Y is employed, which is to be applied n times to each testee. Also, a testee is classi;ed by a decision rule Dk as screen positive if at least k of the n test repeats yield a value greater than a selected cut-o< value (or dichotomizer). The optimal choice of n and/or k in meeting certain criteria for test eHciency parameters have been discussed by some authors (Nissen-Meyer, 1964; Politser, 1982; Wang and Geisser, 1999). In this paper, we assume n is speci;ed, and we shall look into how to choose a dichotomizer for Dk ; k = 1; : : : ; n to attain to an optimal usage of the screening test. Our approach follows that which was taken by England (1988) and Geisser (1998). A loss is assigned to each of the four possible combinations of true C-status and decision rule outcome, as de;ned in Table 1 where the relative magnitudes of the losses reIect the consequent seriousness of the four decision combinations. Furthermore, we assume l00 = l11 = 0, l10 = 1, and l01 = a ¿ 0. That is, no loss is incurred by correct decisions and a false negative decision results in a loss a times as large as a false positive decision. Suppose that the outcome variable Y , with a high value presumed to be indicative of the presence of C, is cut at y, the following parameters are de;ned: • = P(C): the probability that a randomly drawn individual from the population exhibits characteristic C, • (y) = P(test positive|C): the probability that a decision rule correctly diagnoses the presence of C, which is called the sensitivity, • (y) = P(test negative|not C): the probability that a decision rule correctly diagnoses the absence of C, which is called the speci;city. Accordingly, the expected loss associated with a decision rule is E(loss) = L(y) = (1 − ){1 − (y)} + a{1 − (y)}: Table 1 Notation for the losses assigned to the four screening decisions True state Decision rule outcome
C
not C
Positive Negative
l11 l01
l10 l00
M.-D. Wang, S. Geisser / Statistics & Probability Letters 62 (2003) 61 – 70
63
Suppose the n repeated measures Y1 ; : : : ; Yn for each testee are exchangeable. Thus, for the decision rule D1 , (y) and (y) can be expressed in terms of two n-dimensional exchangeable distribution functions Fi (y1 ; : : : ; yn ); i = 1; 2 as
(y) = 1 − F1 (y; : : : ; y);
(y) = F2 (y; : : : ; y);
which is the case considered by Murtaugh (1995). For a general Dk ; k = 1; : : : ; n,
(y) = PF1 (#{Yk ∗ ¿ y; k ∗ = 1; : : : ; n} ¿ k); (y) = PF2 (#{Yk ∗ ¿ y; k ∗ = 1; : : : ; n} ¡ k); where #{Ak ∗ ; k ∗ = 1; : : : ; n} denotes the number among k ∗ = 1; : : : ; n such that Ak ∗ is true. We are intended to ;nd an optimal optimizer yo that minimizes the expected loss L(y). It should be noticed that L(y) could have more than one minimizer, including {−∞; ∞}, so we have called yo “an” optimal value in the above sentence. For simplicity, however, we will assume a minimizer is unique and call yo “the” optimal dichotomizer. Assume that Fi ; i = 1; 2 belong to a certain distribution family, each with density function fi and parameter(s) denoted by Vi . For convenience in presentation, call the C-infected individuals subpopulation 1 and the C-free individuals subpopulation 2. To infer about the optimal dichotomizer when the Vi ’s are unknown, n measurements of the test variable Y are obtained from each of Ni randomly selected individuals from subpopulation i = 1; 2. In doing this, it is assumed that the true status of C can be con;rmed, e.g., with a so-called “gold standard” test. Let the data collected from the jth individual from subpopulation i be denoted by the vector xij = (xij1 ; : : : ; xijn )T , and denote Xi = {xi1 ; : : : ; xiNi }. Taking the Bayesian approach, assume independent prior densities i (Vi ); i = 1; 2 are available. With the sample xi and the prior density i (Vi ), the predictive distribution of the outcome measurements of the n test repeats for a new testee is Ni ˜ F i (y1 ; : : : ; yn |Xi ) ˙ Fi (y1 ; : : : ; yn |Vi ) fi (xij |Vi )i (Vi ) dVi j=1
If the prevalence is also unknown, we assume that it can be inferred in a Bayesian manner by a value . ˜ With the above predictive information, the minimizer y˜ o of the following predictive expected loss function would serve as the optimal dichotomizer yo : ˜ L(y) = (1 − )P ˜ F˜ 2 (#{Yk ∗ ¿ y; k ∗ = 1; : : : ; n} ¿ k) + aP ˜ F˜ 1 (#{Yk ∗ ¿ y; k ∗ = 1; : : : ; n} ¡ k):
(1)
In the following, we will focus on the normal model for Fi ; i = 1; 2. However, the method can be extended generally. 3. Two repeats (n = 2) In the 2-repeat case, assume Fi ∼ N2 (i ; i ) in which 1 i i ; i = i2 : i = i i 1
64
M.-D. Wang, S. Geisser / Statistics & Probability Letters 62 (2003) 61 – 70
With unknown parameters, one may use priors p(i ; i ; i2 ) ˙ 1=i (1 − 2i );
i = 1; 2
which tend to reIect “ignorance” of the information on the parameters (Geisser, 1964). With samples Xi = {xi1 ; : : : ; xiNi }; i = 1; 2, where xij = (xij1 ; xij2 )T , de;ne zij1 = zPik =
xij1 + xij2 √ ; 2
zij2 =
Ni 1 zijk ; Ni j=1
sik2 =
xij1 − xij2 √ ; 2 N
i 1 (zijk − zPik )2 ; k = 1; 2: Ni − 1 j=1
Then the predictive density for a new observable (Y1 ; Y2 )T in subpopulation i is proportional to
− Ni +5 2 − Ni2+4 2 y1 + y 2 Ni − 1 2 Ni − 1 2 1 1 y1 − y2 2 √ √ − zPi1 si1 + si2 + + zP2i2 : Ni Ni + 1 Ni Ni 2 2
This is the kernel of the product of the two independent Student distributions
Yi1 + Yi2 Ni (Ni + 3) Zi1 = √ ∼ St1 zPi1 ; ; Ni + 3 2 (Ni − 1)(Ni + 1)si1 2 and Yi1 − Yi2 Zi2 = √ ∼ St1 2
Ni + 4 ; Ni + 4 ; 0; 2 Ni ( NiN−i 1 si2 + zP2i2 )
where St p (; ; ") denotes the p-dimensional Student distribution with the density f(z) =
) #( "+p 2
p (") 2 #( 2" )
1 || 2
− "+p 2 1 T 1 + (z − ) (z − ) ; "
z ∈ Rp ; p ¿ 1:
We determine the optimal dichotomizer by minimizing the predictive expected loss represented in (1). Since an explicit formula for the minimizer for either D1 or D2 is unavailable, we propose approximating it numerically by the following procedure (P1): 1. Draw a sample of M observations {z˜ikl }M l=1 from Zik ; i = 1; 2; k = 1; 2. 2. Obtain x˜ijl by the transformations x˜i1l = √12 (z˜i1l + z˜i2l ) and x˜i2l = √12 (z˜i1l − z˜i2l ). 3. Minimize M M 1 1 1(x˜21l ¿ y or x˜22l ¿ y) + a˜ 1(x˜11l ¡ y and x˜12l ¡ y); (1 − ) ˜ M M
for D1 ;
M M 1 1 1(x˜21l ¿ y and x˜22l ¿ y) + a˜ 1(x˜11l ¡ y or x˜12l ¡ y); (1 − ) ˜ M M
for D2 ;
l=1
l=1
l=1
l=1
M.-D. Wang, S. Geisser / Statistics & Probability Letters 62 (2003) 61 – 70
65
where 1(A) denotes the indicator function of event A, and use the minimizer as an approximant of y˜ o . It is shown in Appendix A that as M gets large, the approximant converges to y˜ o . The above is a case that the prior density i renders a closed form for the predictive distribution ˜ F i . However, if F˜ i is not identi;able with other priors (elicited or conjugate), the following Monte Carlo procedure (P2) is proposed to approximate y˜ o : d 1. Generate a sample {Vil }M l=1 from Fi (Vi |Xi ), on R the posterior distribution of Vi ; i = 1; 2. Since Fi (Vi |Xi ) also is generally unidenti;able, the Markov chain Monte Carlo methodology could be applied to generate {Vil }M l=1 which then is a Markov chain. ∗ ; x ∗ )T from F (·|V ). 2. For each Vil , generate an xil∗ = (xi1l i il i2l 3. Minimize M M 1 1 ∗ ∗ ∗ ∗ (1 − ) ˜ 1(x21l ¿ y or x22l ¿ y) + a˜ 1(x11l ¡ y and x12l ¡ y); M i=1 M i=1
for D1 ;
M M 1 1 ∗ ∗ ∗ ∗ 1(x21l ¿ y and x22l ¿ y) + a˜ 1(x11l ¡ y or x12l ¡ y); M i=1 M i=1
for D2 ;
(1 − ) ˜
and use the minimizer to approximate y˜ o . It is justi;ed in the appendix that the minimizer approaches y˜ o as M goes to in;nity. 4. Multiple repeats (n ¿ 2) Consider the case in which the n repeats have the joint distribution Fi ∼ Nn (i en ; i i2 en enT + (1 − i )i2 In );
−(n − 1)−1 ¡ i ¡ 1; i = 1; 2;
where en is the one-vector of length n and In is the identity matrix of dimension n. When the parameters {i ; i ; i2 } are unknown, the “ignorance” priors are p(i ; i ; i ) ˙ 1=i (1 − i ){1 + (n − 1)i };
i = 1; 2
(Geisser,√1964). It was also noted that there exists an orthogonal matrix P = (p1 ; : : : ; pn )T with p1 = (1= n)en such that T 0 & i − 1 n ; P(i i2 en enT + (1 − i )i2 Dn )PT = 0n − 1 i D n − 1 where &i = i2 {1 + (n − 1)i } and i = i2 (1 − i ) are the two eigenvalues of i2 en enT + (1 − i )i2 Dn and 0n−1 is the zero-vector of length n − 1. For example, we can choose pkT =
1 (1; : : : ; 1; −(k − 1); 0; : : : ; 0);
k(k − 1) k −1
n− k
k = 2; : : : ; n:
66
M.-D. Wang, S. Geisser / Statistics & Probability Letters 62 (2003) 61 – 70
Obtain samples Xi = {xi1 ; : : : ; xiNi }; i = 1; 2, where xij = (xij1 ; : : : ; xijn )T ; j = 1; : : : ; n. Then transform i i zijk and sik2 = (1=Ni − 1) Nj=1 (zijk − zPik )2 ; k = 1; : : : ; n, the the data by zij = Pxij . Let zPik = N1i Nj=1 T predictive density for a future observable Y = (Y1 ; : : : ; Yn ) in subpopulation i, is proportional to 2 − Ni2+4 n N − 1 1 1 i 2 √ si1 + yi − zPi1 Ni Ni + 1 n i=1
×
n Ni − 1
Ni
k=2
n n 1 T 2 2 sik2 + (Pk y) + zPik Ni k=2
− Ni +5 2
:
k=2
This is the kernel of the product of the two independent Student distributions
n 1 Ni (Ni + 3) Zi1 = √ Yik ∼ St1 zPi1 ; ; Ni + 3 2 n (Ni − 1)(Ni + 1)si1 k=1
and
Zi2 = (p2 ; : : : ; pn )T Y ∼ Stn−1
(n − 1)(Ni + 3) + 1 0n−1 ; Dn−1 ; (n − 1)(Ni + 3) + 1 : Ni [ nk=2 ( NiN−i 1 sik2 + zP2ik )]
˜ the following procedure (P3) is proposed to Due to the infeasibility of a direct minimization of L, approximate y˜ o : 1. Draw a sample of M observations {z˜il = (z˜i1l ; : : : ; z˜inl )T }M i1 and Zi2 ; i = 1; 2. This can l=1 from Z 2 be done by noticing that if U ∼ Np (0p ; ) and V ∼ (n , then T = Vn U + ∼ St p (; −1 ; n). 2. Obtain X˜ il = (x˜i1l ; : : : ; x˜inl )T by the transformation x˜ il = PT z˜il . 3. Minimize M 1 ˜ ˜ 1(#{x˜2k ∗ l ¿ y; k ∗ = 1; : : : ; n} ¿ k) LM (y) = (1 − ) M l=1
+ a˜
M 1 1(#{x˜1k ∗ l ¿ y; k ∗ = 1; : : : ; n} ¡ k) M l=1
and use the minimizer as an approximant of y˜ o . For other priors which do not imply a closed form for the predictive distributions, the procedure (P2) can be extended for n ¿ 2. 5. An example for n = 2 Assuming the ignorance priors, we consider the following simulated example for the 2-repeat screening of a test with 1 = 15, 2 = 10, 1 = 2, 2 = 3, 1 = 0:4, and 2 = 0:2. Two samples of size N = 200, regarded as observed data, were generated from the two symmetric bivariate normal
M.-D. Wang, S. Geisser / Statistics & Probability Letters 62 (2003) 61 – 70
67
Table 2 Dichotomization for a test repeated twice (1 ; 2 )
(2; 3)
(3; 2)
Decision rule D1 D2 D1 D2 Optimal dichotomizer 12.4 10.4 12.0 9.30 Predictive expected loss 0.403 0.307 0.539 0.610
distributions. We then generated M =1000 observations from each of the two predictive distributions. Suppose ˜ = 0:05, the optimal dichotomizer approximated by (P1) and the corresponding predictive expected loss for decision rules D1 and D2 are given in Table 2 for a = 100. If we reverse the values of 1 and 2 , the resulting numbers are also given in the table. As it is shown that in the ;rst case, D1 has a higher expected loss than D2 . Therefore D2 would be a preferable decision rule. Whereas in the second case, D1 is superior to D2 in terms of the magnitude of the expected loss. Other parameters may well inIuence the choice among the decision rules when the comparison is based upon the magnitude of the expected loss.
6. Concluding remarks In choosing cut-o< values for a practical application of a screening test on individuals, the probability of making wrong decisions needs to be controlled at a possibly lowest level. Therefore, it is natural to approach the problem from an optimization perspective. The loss function is introduced to summarize all causes of loss into one utility representation so that an optimal dichotomizer can be mathematically determined. The predictive approach reIects our belief in the importance of a better usage of screening tests on future subjects. We have assumed exchangeability among the screening repeats. However, if the repeats are applied along a time span, the time e
Appendix A. Epicovergence theory based Monte Carlo optimization We are interested in obtaining the dichotomizer y˜ o which minimizes the predictive expected loss ˜ ˜ function L(y) de;ned for decision rules. In the case that a direct minimization of L(y) is not feasible, y˜ o needs to be appropriately approximated. The approach of Monte Carlo optimization supported by the epiconvergence theory was ;rst applied by Geyer and Thompson (1992) and Geyer (1992) to maximum likelihood calculations. Chen et al. (1999) then used it for a problem of sequential control. In the following, we explain how it can be adopted to facilitate the search of y˜ o .
68
M.-D. Wang, S. Geisser / Statistics & Probability Letters 62 (2003) 61 – 70
The epilimits inferior and superior of a sequence of function gn on a metric space W are functions de;ned as (e − lim inf (gn ))(w) = inf
lim inf inf {gn (u)};
(e − lim sup (gn ))(w) = inf
lim sup inf {gn (u)};
B∈N(w) n→∞ u∈B
n
B∈N(w) n→∞ u∈B
n
where N(w) denotes the class of neighborhoods of a point w ∈ W . If the two limits are equal to e a common function g = e − limn (gn ), then it is said that gn epiconverges to g, denoted by gn → g. Suppose that a direct minimization of a function g is not feasible, on account of either not knowing the exact form of g or the diHculty to work directly on g. One could instead consider an indirect minimization via a sequence of functions {gn } which is epiconvergent to g. The following two theorems demonstrate the idea for the case that involves a random factor (Geyer, 1992; Chen et al., 1999): Theorem 1. Suppose W is a separable metric space, (S; F; P) is a probability measure space, and L(w; s) is a nonnegative real-valued function on W × S. Let r(w) =
S
m
L(w; s)P(ds)
and
1 rm (w) = L(w; sl ); m l=1
where s1 ; s2 ; : : : is realizations of an ergodic Markov chain having stationary distribution P. If L(·; s) is 1. upper semicontinuous at each w for P-almost all s, and the exception set does not depend on w and 2. lower semicontinuous at each w except for s in a P-null set which may depend on w, e
then rm → r for almost all sample paths of the Markov chain. e
Theorem 2. Suppose rm → r with probability 1 and a sequence {wm } satis;es rm (wm ) 6 inf w∈W rm (w) + 3m , where 3m → 0: Furthermore, if {wm } is contained in a compact set almost surely (or in probability), then 1. Every subsequence of {wm } has a convergent subsubsequence {wmk } whose limit minimizes r. Moreover, rm (wmk ) → inf w∈W r(w). 2. If r has a unique minimizer w, ˆ then wm → wˆ and rm (wm ) → r(w) ˆ almost surely with probability 1. There are two cases in our application of the theory to the optimal dichotomizer searching problem. In (P1) and its extension (P3), a sample {x˜ il ; l=1; 2; : : :} is generated from the predictive distribution F˜ i , the probability distribution corresponding to the predictive distribution function F˜ i on the Borel measurable space (Rn ; Bn ). Let (Rn × Rn ; Bn × Bn ; F˜ 1 × F˜ 2 ) be the product of the two measure spaces (Rn ; Bn ; F˜ 1 ) and (Rn ; Bn ; F˜ 1 ). Below we display on the right-hand side notation for optimal
M.-D. Wang, S. Geisser / Statistics & Probability Letters 62 (2003) 61 – 70
69
dichotomization with its corresponding notation for epiconvergence theory on the left-hand side. (S; F; P) ↔ (Rn × Rn ; Bn × Bn ; F˜ 1 × F˜ 2 ) w ∈ W ↔ y ∈ R; s ∈ S ↔ ˜ = (x˜ 1 ; x˜ 2 ) ∈ Rn × Rn ; sl ↔ ˜l = (x˜ 1l ; x˜ 2l ); ˜ ) L(w; s) ↔ L(y; ˜ = (1 − )1(#{ ˜ x˜2k ∗ ¿ y; k ∗ = 1; : : : ; n} ¿ k) + a1(#{ ˜ x˜1k ∗ ¿ y; k ∗ = 1; : : : ; n} ¡ k); ˜ r(w) ↔ L(y) = (1 − )P ˜ F˜ 2 (#{Yk ∗ ¿ y; k ∗ = 1; : : : ; n} ¿ k) + aP ˜ F˜ 1 (#{Yk ∗ ¿ y; k ∗ = 1; : : : ; n} ¡ k); M
1 rm (w) ↔ L˜ M (y) = (1 − ) ˜ 1(#{x˜2k ∗ l ¿ y; k ∗ = 1; : : : ; n} ¿ k) M l=1
M 1 1(#{x˜1k ∗ l ¿ y; k ∗ = 1; : : : ; n} ¿ k): + a˜ M l=1
˜ ) It is obvious that L(y; ˜ has the semicontinuity properties required in theorem 1. Thus, with an e ˜ with probability 1. independent-chain sample { ˜1 ; ˜2 ; : : :}, the theorem implies that L˜ M (y)→L(y) ∗ ˜ Moreover, the minimizer of LM (y) is from {x˜ik ∗ l ; i = 1; 2; k = 1; : : : ; n; l = 1; : : : ; M } without approximation, so 3M = 0 for each M and 3M → 0: Therefore, Theorem 2 assures that y˜ o will be closely approached with suHciently large M . While if sampling from F˜ i is not feasible, (P2) may be considered. It is clear that {il = (Vil ; xil∗ ); l = 1; 2; : : :} in (P2) is an ergodic Markov chain with the joint distribution of PVi and Fi as the stationary distribution, denoted by Pi . Letting i = (Vi ; xi∗ ); i = 1; 2 and (S; F; P) = (Rd+n × Rd+n ; Bd+n × Bd+n ; P1 × P2 ), a justi;cation of the convergence of L˜ M (y) to y˜ o can be similarly established as in the ;rst case. References Chen, L.-S., Geisser, S., Geyer, C.J., 1999. Monte Carlo minimization for one step ahead sequential control. In: Geisser, S. (Ed.), Diagnosis and Prediction, IMA Volumes in Mathematics and Its Applications, Vol. 114. Springer, New York, pp. 109–130. England, W.L., 1988. An exponential model used for optimal threshold selection on ROC curves. Med. Decision Mak. 8, 120–131. Geisser, S., 1964. Estimation in the uniform covariance case. J. Roy. Statist. Soc. Ser. B 26, 477–483. Geisser, S., 1998. Comparing two tests used for diagnostic and screening purposes. Statist. Probab. Lett. 40, 113–119. Geyer, C.J., 1992. On the convergence of Monte Carlo maximum likelihood calculations. J. Roy. Statist. Soc. Ser. B 56, 261–274. Geyer, C.J., Thompson, E.A., 1992. Constrained Monte Carlo maximum likelihood for dependent data (with discussion). J. Roy. Statist. Soc. Ser. B 54, 657–699.
70
M.-D. Wang, S. Geisser / Statistics & Probability Letters 62 (2003) 61 – 70
Lau, T.-S., 1991. On dependent repeated screening tests. Biometrics 47, 77–86. Murtaugh, P.A., 1995. ROC curves with multiple marker measurements. Biometrics 51, 1514–1522. Nissen-Meyer, S., 1964. Evaluation of screening tests in medical diagnosis. Biometrics 20, 730–755. Politser, P., 1982. Reliability, decision rules, and the value of repeated tests. Med. Decision Mak. 2, 47–69. Wang, M.-D., Geisser, S., 1999. On the optimal administration of multiple screening tests. In: Geisser, S. (Ed.), Diagnosis and Prediction, IMA Volumes in Mathematics and Its Applications, Vol. 114. Springer, New York, pp. 31–64. Zellner, A., 1965. An Introduction to Bayesian Inference in Econometrics. Wiley, New York.