Evaluation and Program Planning, Vol. 4, pp. 363-375, Printed in the U.S.A. All rights reserved.
1981
0149-7189/81/030363-13$02.00/O Copyright o 1982 Pergamon Press Ltd
ON DEFINING AT-RISK STATUS E. Midland-Gladwin
WEBB
STACY,
Community
JR.
Mental
Health
Services
ABSTRACT You can include many in the at-risk group at the expense of including many not at-risk, or include few at the expense of missing many truly at-risk. The paper discusses the use of prevalence data and the relative costs of these two possibilities to help find the best way to separate at-risk and not-at-risk groups. The implications of the data being distributed normally are discussed, and several examples, including one with a nonnormal distribution, are given.
program planning. In medicine (e.g., McNeil, Keeler, & Adelstein, 1975; Metz, 1978) the same techniques are used, although there are some slight conflicts in terminology between TSD and the medical approach. Related techniques are used to evaluate the performance of a diagnostic methodology (e.g., Swets, Pickett, Whitehead, Getty, Schnur, Swets, & Freeman, 1979). They are certain to be of interest in evaluation and program planning in fields other than medicine, but we will not discuss them further here. To get an idea of the problems addressed by TSD, imagine the plight of a SONAR operator in World War II. He had to listen through noisy headphones for faint sounds reflected off enemy ships. If he was instructed to report even mild suspicions, there were many false alarms, while if he was instructed to be sure of a signal before alerting the crew there were grave delays in detecting enemy submarines. Mistakes were unavoidable, but the types of mistakes made depended not only on the presence and strength of a signal, but on the costs of false alarms and misses.
At one time or another everyone who designs a prevention program wishes there were a foolproof way to determine who runs a high risk of developing the malady the program will focus on, and who does not. Reality soon asserts that every known criterion is less than perfect: some will be treated who do not need to be, some at risk will be lost. This paper will focus on techniques for optimizing, based on the number and types of mistakes that necessarily must be made, the decision about who to call at-risk. The techniques come from the psychophysical theory of signaldetection (TSD) (Green & Swets, 1966; Coombs, Dawes, & Tversky, 1970; Egan, 1975) which came from statistical decision theory (for example, Wilks, 1962; Raiffa & Schlaifer, 1961; Hogg & Craig, 1970). They have been used successfully in research in personality (Price, 1966), memory (Bernbach, 1967; Reitman, 1971), human information processing (Broadbent, 1971) and clinical psychological decision-making (Meehl & Rosen, 1973). With one notable exception (Jones, 1980), the techniques have not been applied to STATISTICAL
DECISION
THEORY,
Signal-detection researchers (Lute, Bush, & Galanter, 1963; Swets, Tanner, & Birdsall, 1961) compared the SONAR operator’s situation and an analogous problem from statistical decision theory: given (1) a value x, (2) two candidate distributions X1 and XZ from which x might have come, (3) the payoffs associated with all possible correct and incorrect classifications, and (4) the prior probability that x came from XI or
TSD, AND PROGRAM
PLANNING
X2, to which population should x be assigned? In TSD terms: given (1) a sensory experience, (2) the need to classify it as noise alone or as signal plus noise and the distributions of each, (3) a payoff matrix, and (4) the prior probability of signal (plus noise), should the observer report the presence of a signal? Let’s rephrase the same problem for our present discussion: given (1) a person’s score on some measure
This manuscript has benefitted by thoughtful comments from Richard Jentsch, Linda Falls-Stacy, Barry Tanner, Dennis Howard, Dwight Heil, E. Webb Stacy, Sr., Lynne Mischley, Sally Petersen, and Betty Tableman, and two anonymous reviewers. Harriet Raymond has provided timely assistance with the preparation of the figure. Requests for reprints should be sent to E. Webb Stacy, Jr., Midland-Gladwin CMHSB, 2620 W. Sugnet Rd., Midland, MI 48640.
363
364
E. WEBB STACY, JR.
is not labelled at-risk when in fact they will develop the problem under consideration. It is also called a “Type I” error and a “false negative.” Statistical decision theory has no special names for correct decisions, but correct classification of someone as at-risk would be called a “hit” in TSD terminology and a “true positive” in medicine; correct classification of someone as notat-risk would be called a “correct rejection” or a “true negative.” In the medical literature, the term “sensitivity” is used to refer to the fraction of positive cases captured by a diagnostic instrument or technique and the term “specificity” refers to the fraction of negative cases rejected. As McNeil, Keeler, & Adelstein (1975) point out, there is a (sometimes specifiable) inverse relationship between these two, so that they are not fixed values for a given diagnostic method. Moreover, the term “sensitivity” is used with a different meaningthe separation of the noise from the signal-plus-noise distributions-in TSD. For this reason, we will avoid the use of these words in this discussion, and we will adopt the TSD error terminology. It is worth exploring the analogy between TSD and
known to be at least somewhat successful at predicting the incidence of the problem we would like to address; (2) the need to classify the person as one who will have the problem and therefore should be served by the prevention program (P) or one who will not and therefore should not (NP) and the distributions of scores of both populations; (3) the costs of serving someone who does not need it and of not serving someone who does; and (4) the prevalence of P, should we call the person at-risk, or not? Our task is to “detect” people giving us a “future-problems signal” from the background of those generating only “no-future-problems noise” using the screening measure as our “sensory apparatus.” The situation is represented graphically in Figure 1. Two kinds of mistakes were mentioned above: TSD calls them false alarms and misses. In the program planning context, a “false alarm” is a person admitted to at-risk status when the person will not in fact develop the problem. In statistical decision theory this is called a “Type II” error, and in medical decisionmaking it is called a “false positive.” The other kind of mistake, called a “miss” in TSD, occurs when a person
relative frequency of scores for problem people
frequency of scores for no-problem people
relative
I--------
(NP)
d’
,----+I
(PI
J
%P Figure 1. Relation
of Cuttoff
Score and Test Sensitivity
xB to Distributions
MP of Scores.
(See text for full explanation.)
365
On Defining At-Risk Status prevention program planning further. In TSD, a measure called d’ is the detectability of a signal, due to the skill of the observer, intensity of the signal, or quietness of the background. For program planning, it is the extent to which the screening instrument can distinguish the people who will develop the specific problem under consideration from those who will not. It is a measure of the separation of the noise and signal-plus-noise distributions in TSD, and of the separation of the distributions of the scores of NP and P people in program planning. Another measure, called B, is the cutoff on the psychophysical decision axis above which the observer says “Signal” and below which she or he does not. In the progr~ planning context B is usually more concrete since it is actually a score or set of scores (and
therefore observable) on the screening measure, above which the person is identified as at-risk. When we define the cutoff score, x,, below, it will correspond exactly to B. The cutoff score on the screening measure is actually a number; when humans are trying to detect signals, they produce no observable counterpart, and in this respect TSD faces more analytical difficulty.’ Where TSD must use some sophisticated equation-solving algorithms to estimate relations among the parameters of the underlying distributions (Ogilvie & Creelman, 1968; Dorfman & Alf, 1969; Abramson & Leavitt, 1969; Grey & Morgan, 1972), we can estimate them directly, usually by calculating mean and variance, in the program planning context.
OPTIMIZING THE TRADEOFF BETWEEN FALSE ALARMS AND MISSES It is clear that there is a tradeoff between the two kinds of mistakes; the higher the false alarm rate, the lower the miss rate, but a lower false alarm rate may be had only at the expense of a higher miss rate. We will see below that the consequences of each vary quite widely across program-planning situations. There are quite a number of ways to optimize the decisions- by minimi~ng the absolute number of mistakes, minimizing the number of misses for a given false alarm rate (Neyman & Pearson, 1933) or viceversa, maximizing the minimum utility or minimizing the maximum loss, or maximizing the expected utility. The last option has been extensively studied, and it is the only one we will consider in this paper. An analysis of some other options may be found in Egan (1975). In order to maximize the expected utility, we need to understand the concept of likelihood ratio. It is muchused throughout both TSD and mathematical statistics (cf. Edwards, Lindman, & Savage, 1963, or Raiffa & Schlaifer, 1961, among many others.) It represents, for a given piece of evidence x, the odds that x came from the P population and not the NP population. More formally, the likelihood ratio function of x for a given random variable X is l(x)
=
pr[xI 41
for discrete distributions of x, where Pr[x / P] and Pdx / NPj are the probabilities that X = x given that the score came from the problem and nonproblem populations respectively, and
i(x)
=
Ax I P)
(lb)
for continuous distributions, where fix 1 P) and f(x 1 Np) are the probability density functions of X at the point x for problem and nonproblem populations, respectively. We now turn to a derivation of the idea that optimal decisions may be made on the basis of the likelihood ratio. For the moment, assume that we have determined the costs of a false alarm, that is, saying “Yes” to nonproblem people (C,,,,,), and a miss, saying “No” to problem people (C&J We may assume, without loss of generality, that there is no cost or gain associated with correct decisions (i.e., CYes,Pand CNO,Np both are 0.) Then for any x, the expected cost of saying “Yes, you are at-risk” is E[Cost / Yes, x]
(la)
Pdx 1NJ’I while the expected cost of saying “No, you’re not” is i It is conceivable that we would want to use non-qu~titative human judgements in screening for prevention programs, in which case we would nor have access to the underlying distributions of P and NP scores. However, unless we have an intermediate product (such as an X-ray or CAT scan in radiology; a “judgement-training” program would do) to which the judgements are applied, we would have information only about the quality of judgements made by the particular program staff, which might be interesting clinical-judgement research, but wouid not be helpful for others’ program-planning.
E[Cost / No, x] = Pr[NP 1 xl
G,o,,vp+ pr[J’ 1xl G,,,p = mp I xl &,p
Obviously, these formulas have continuous counterparts. The best course of action will be to say “Yes” when
366
E. WEBB STACY,
E[Cost 1 Yes, x] > E[Cost 1 No, x] and “No” otherwise. We will thus say “Yes” when
PW I xl
, -.Cyes,.e,.NP
PrWP I xl
Let c
CNo,p
Then, = Ges,.NP I Cr4O.P.
and, applying
applying
Pr[P]
Pr[x
( P]
Pr[NP]
Pr[x
1 NP]
>c
(1) and rearranging,
l(x) >
C (1 -
we find that
Pr[P])
Pr[P]
Bayes’ rule,
(2) f
Again, this derivation has a continuous counterpart. Let the right-hand side of Equation 2, defined in terms only of the prevalence of P and the costs of mistakes, be called L. The formula tells us that L is an optimum likelihood ratio cutoff point, such that for scores on the criterion measure X with odds of P greater than L the optimum response is “Yes” and for odds of P less than L, “No.” In addition if I(x) is monotonic in x (if it never decreases while x increases), and if we know or may assume the distribution of X, we may find a cutoff score xB above which we say “Yes” and below which we say “No.“* Prevalence When good data are available, getting the prevalence rate is straightforward. Sometimes a little digging in a good reference (Hollingshead & Redlich, 1958; Jaco, 1960; Dohrenwend & Dohrenwend, 1966; Hare & Wing, 1970; Cooper, Kendall, Gurland, Sharpe, Copeland, & Simon, 1972; others cited in Zubin, Salzinger, Fliess, Gurland, Spitzer, Endicott, & Sutton, 1975) will produce a figure acceptably accurate. More often, the rate must be estimated from data about similar problems. Prevalence rates are easy to ignore. According to Tversky and Kahneman (1974), people overvalue current information at the expense of what they knew before. There is a danger of weighing screening instrument scores too heavily (at the expense of prevalence 2In standard psychophysical detection theory, d’ is always positive; in psychophysics, however, the signal + noise distribution never has a smaller mean than the noise distribution. We have assumed the same ordering of means (i.e., M,, > MNP) throughout this paper. If a screening device is such that the P distribution lies below the NP distribution, it can easily be rearranged (e.g., by subtracting all scores from the maximum possible.)
JR.
rates) in trying to decide who belongs to the P population. In novel situations, the effect seems as strong for trained decision-makers and statisticians as in others; it is analogous to a visual illusion that has an effect even though you’ve seen it before. This means that, unless just about half the population will develop P, it is important to take a good guess. Guessing, of course, makes the entire process less precise, but a prevalence of 1 in 10 increases the optimum likelihood ratio by a factor of 9, and 1 in 1000 increases it by a factor of 999! Clearly the prevalence is important and should not be ignored. Time spent in deriving a careful estimate will pay dividends. costs Figure 2 shows the two kinds of correct decisions and the two kinds of mistakes together with their costs. Which mistake is preferable? At the extreme, if you want to guarantee that nobody is missed, you define everybody as at-risk, and if you want nobody incorrectly assigned at-risk status then you say nobody is atrisk. Unfortunately the best solution is usually between these extremes; optimal definitions are seldom this clear. Fortunately there is often a specifiable relationship between the two kinds of mistakes that can be used to inform definitional strategies. There are two kinds of costs of a miss- the financial, and the human. If a person really will develop P, they (or society) will incur financial costs for the later treatment, rehabilitation, and support of someone who has suffered from P. For many kinds of mental health problems, this cost far exceeds the cost of preventive or early treatment, as Sussna (1977) convincingly shows. The human costs may be much more severe, although they may be more difficult to express in financial terms: if a person is not prevented from developing P, he or she will have to suffer through P. There are also two kinds of costs of a false alarm. Financially, it means that prevention program resources are used on a person who does not need the services. The human and social cost is the possibility of being stigmatized as having, or being at risk for having, P. The severity of the human cost is determined in part by public attitude toward the problem: the human costs are different for being labelled at-risk for cancer, for schizophrenia, and (for children) for reading difficulties. In combining costs, the derivation leading to Equation 2 tells us that we have to wind up with the ratio of the total false alarm costs (both financial and human) to total miss costs. Since financial costs are usually measured in dollars, they are not too difficult to obtain (or estimate.) The difficulty is that the human costs, which usually will be more significant than the financial, are harder to quantify, and even harder to compare with them.
On Defining At-Risk Status
367
Decision Person admitted to at-risk group ---_--__--_________
I HIT I Correct I
Person will develop problem
I 1 1 I
Person not admitted to at-risk group ____________________~~~~~~~~ MISS I
admission
to at-risk status
I costs: 1 Financial I
Human
I1 later
person suffers problem
I
Fact
Person won't develop problem
tmt, rehab, support
I P n I "Yes,P LNO,P -_--------------_--_-----~~~~-~~--~--~---~~-----
I
F9LSE I I costs: 1 Financial
I I i I I I I
ALARM Human
I
1 wasted
stigma
1 resources
I i
CORRECT
/ I
REJECTION
of at-risk status
I I
i
I
I I 'No NP 'Yes NP ____________'_____________,,,-,,,,,,,I,_________ Figure
2. Four DecisionPossibilities, Showing
Jones (1980) has suggested that different actors in the program setting may have different values. His actors tend to take extreme positions. By his account, the advocate assigns almost no cost to a false alarm, at least relative to a miss, while the auditor takes a symmetric but opposite stance. The program manager, not wanting to displease either, weights misses and false alarms equally. While these are probably plausible exaggerations of the positions the participants would take, if they were rational participants they might express their positions more subtly (and even if they were not rational, it might be worth trying to educate them.) For instance, for a rare and painful but difficult-to-predict P, the auditor might be willing to tolerate a higher false alarm rate than for other problems, and for a common, mild but again difficult-to-predict P the advocate might be willing to tolerate a few more misses than otherwise. If the actors can be thus budged from their extrema, the convenient practice of deciding that the costs of the two errors are equal must yield to a more careful analysis. One good way of combining multiple points of view, multi-attribute utility theory, has been lucidly described by Edwards, Guttentag, & Snapper (1975). It is important to remember that, if the costs are not estimated explicitly, they are implicitly: the choice of an arbitrary cutoff score on a screening instrument, if the prevalence of P is known and distribution-types are assumed, determines a unique cost ratio. One might even work backwards from the cutoff score to what that ratio must have been, or work interactively with cutoff scores and costs until both are subjectively satisfying. It is also important to remember, though, that the techniques described in this paper do not
I
Costs of Mistakes.
remove the necessity to make subjective cost-benefit judgements. The mathematical aspects of some of the paper should not fool the reader into believing that the problem has been solved. Distributions
of Scores
Having obtained some estimate of the prevalence of P and the costs of a false alarm, CYesNP,and a miss, c NoP, we use Equation 2 to calculate the “cutoff” likilihood ratio L. If the odds are greater than L that any person’s score x on the criterion measure is more likely to have come from the X, distribution than the X,, distribution, we call that person at-risk, otherwise we do not. The smaller L is, the looser the criteria for at-risk status, the larger the number of people defined as at-risk. If L is bigger, criteria are tighter, and fewer will be in the group. In this section we will explore the process of assigning a likelihood ratio region (above or below L) to the scores themselves. Readers not interested in the assumptions and methods involved in actually producing the optimal cutoff score should simply note that it is often possible to do so; they should then skip to the examples. Many results from the present analysis do not depend upon the exact nature of the distributions X, and X,,; those results will be summarized below. Numerical estimates of cutoff scores or regions, however, usually require some kind of knowledge of or assumption about the distributions. As might be expected, the most common assumption about the distributions is that they are normal. There are several reasons for this. An important one is the central limit theorem. It asserts that, when a large number of independent random variables, drawn from any probability distribu-
368
E. WEBB STACY, JR.
tion having finite mean and variance, are added together, their sum has (in the limit) a normal distribution. A wide variety of processes that can be thought of as the sum of many small independent processes can be well-approximated by a normal distribution. In the case of test scores on a screening instrument the assumption of normality is natural. There are also applications where the assumption of normality has been shown to be a good approximation. Egan (1975) lists a number of examples in the psychophysics lab, Swets et al. (1979) found that scores on brain-tumor-diagnostic techniques were distributed normally, and Jones (1980) found gerontological needs assessment distributions to be normal. In the cases to be examined below, we will assume that we have (from past experience, the literature on the criterion measure, or research grants) the prior scores on the criterion measure of people who have been found to have P and of others have not developed P. More formally, we have sample scores xP,, X P,2 9 * * - f %,I from the X, distribution and xNP,,, ,v,~~,~,. . . , x,~,, from X,y,,, where j and k are integers greater than 1. In the discussion to follow we will refer to the population score vectors as xP and x,, and the vector of ail scores (the concatenation of x,, and xP) simply as x. It is possible to check the assumption of normality. For each score x, in x, let n,,+P.ibe the number of scores in x,, less than or equal to x, and np,, be the number of scores in x, less than or equal to xi. Then the empirical cumulative distribution functions (c&l of x,, and x,, are
forsomeaandbandi = 1,2,. . . ,j + k. Ifdoublenormal probability paper is available, Equation 3 can be verified directly by plotting P&(x,) against F,(x) for each i; if not, the F,&x,> and F,(x,) must be converted into z,, , and z,,, before plotting. For normality to be a good assumption, the result should be a straight line. Standard tests for linearity do not apply, however, because of the error of measurement in both the horizontal and vertical directions (Metz, 1978.) If there is some doubt, a goodness-of-fit test (Hays, 1963) could be applied to each distribution. Incidentally, other families of distributions can be checked the same way by defining the z values with other cdj’s.
ACCEPTANCE
IS ASSUMED
and
Let &vf*,be the value on the abcissa for which the unit-normal cdf is F&x,), and z,., be the corresponding value for F&x,). We know that, for any two normal distributions F, and F;, F,(z)
for some a and b and all Z. If Fvp and F, are normal, therefore, we should find that Z ;rPJ= az,, + b
REGIONS WHEN NORMALITY
Things are simpler when we can assume that the variances of the P and NP distributions are equal. Since, unlike the situation in TSD, we have direct access to our estimates of S,, and S,, we might as well use an F test (e.g., Hays, 1963) to determine their equality. We may want to reject the hypothesis at some level above .0_5,however, since the cost of a Type I error is minimal. If we do determine that the variances are equal, we will pool the variances into a common estimate which we’ll call Si,, . We discuss equal and unequal variance situations separately below. It will be useful to be able to interpret the measure d’, expressed in standard-deviation units, defined to quantify the separation of the P and IVP distributions. This value corresponds to the parameter by the same name in TSD. For the equal-variance case, the common standard deviation is used, and for the unequal, the value from the NP distribution. It is natural to compare d’ with the x-axis values in tables of the cumulative unit-normal distribution. For example,
= Fz(az + 6)
(3)
distributions separated by a d’ of 1.OOcan be assigned a “separation index” of 84.13% (corresponding to an x-axis value of 1.00); a d’ of 2.00 (and an x-axis value of 2.00) would yield a separation index of 97.72%, and so on. When the distributions have different variances, the populations actually have two separation indices, but we will concentrate on the index relating to the NP distribution. Case I: Equal Variances
Let us define a new transformation all scores in x,, and x,, let
y,=
x1 - MN, S ’
of the scores: for
(4
a//
wherei = 1,2, . . . ,kforx,,andi = 1,2, . . . ,jfor x,. We find that the transformed scores from the population of nonproblem people (call them the YNPvalues) have mean 0 and standard deviation 1, and the trans-
369
On Defining At-Risk Status formed scores of the problem people (the yp values) have unit standard deviation but a nonzero mean. This mean, by definition, is d’. Recall that this is an index of the separation of the populations, and that it corresponds to the TSD parameter by the same name. We can estimate it directly by (M, - MNp)/SNp. Let us define two new distributions Y,, and Y, obtained from X,, and X, using Equation 4. Obviously, the scores yNp and yp are samples from them. If we have assumed the distributions of X,, and X, to be normal, then Y,,,, and Y, are also normal. The likelihood ratio will then be the ratio of the density of the Y,, distribution to the density of the Y,, distribution, and the optimal value of the likelihood ratio will occur at the point where the ratio of densities equals L, the right-hand side of Equation 2. Using normal densities and some algebra, we find that they-score cutoff point is d’_InL Y, = 1 d”
(5)
where In L is the natural logarithm of L. A derivation may be found in Egan (1975). Notice that, since L is always greater than 0 and d’ is not equal to 0 (otherwise the P and NP distributions would be identical), y, is guaranteed always to exist. Notice also that, for a fixed d’, ys is monotonic with L since, for all t > 0, In t increases smoothly with 1. This means that y,, and the corresponding raw score x, = S,,,y, + MNp, are true cutoffs in the sense that they partition the scores into a lower not-at-risk region and an upper at-risk region. Case II: Unequal
Variances
If we have determined that the variances are unequal, we use a slightly different transformation of the scores, using S,, as the standardizing factor: for scores m xNp and x,,, let
yi=
xi
7’
-
MNP
(6)
where i = 1, 2, . . . , kforx,,andi= 1,2,...,j for x,. Transformed NP scores (the yNp values) have mean 0 and standard deviation 1, and transformed P
scores (the yp values) have in general different parameters. Let the mean of the yp values be d’ (similar, but not identical, the TSD notation) and let the standard deviation be S. Obviously, (M, - MNp)/SNp estimates d’ and S/S,, estimates S. We now define Y,, and Y, as new distributions obtained from X,, and X, by Equation 6; again we find that assuming the X distributions normal means that the Y are normal, and that the scores yNp and yp are samples. As before, the likelihood ratio will be the ratio of the density of Y, to Y,,, and its optimum value will be L, the right-hand side of Equation 2. Using the appropriate normal densities and some algebra, we find that we should assign a score x to the P population whenever its corresponding y is outside Y, =
-d’
f S 1(d’)2 + 2 (S2 - 1) In (LS) ] [1’21,(7) (S - 1)
A derivation may be found in the appendix. These values always exist when both L > 1 and S > 1, and also when both L < 1 and S < 1. There are combinations of L, S, and d’, however, that cause yB to have no real solution. They involve values of L near 0, relatively large values of S, and small values of d’. Since this constellation will be seldom encountered in practice, we will not discuss it further except to say that, should the program-planner encounter a negative number between the curly braces in Equation 7, he or she should seriously consider finding another predictive criterion or scrapping the program. There is another notable difference from the equalvariance situation. A given value of L generates two values y,, one from adding the right-hand portion of the numerator, the other from subtracting it. Our atrisk acceptance region is more complex: it is above the upper y, and below the lower. As Egan (1975) points out, if you assume that S > 1, for moderate values of L and S this region “straddles” most of the scores in the NP distribution. If S c 1 then the region straddles most of the P distribution. In either event, there is one point, below most of the NP distribution or above most of the P distribution, of very low probability. This point is ignored in practice because it involves so few scores, so that the less extreme of the two points generated by Equation 7 usually will be treated as a single cutoff point.
EXAMPLES The examples to follow are fictional. They are intended to illustrate various ways of applying the techniques just described, but are not intended to be models of the entire process. In particular, false alarm and miss costs must be estimated with more care. It is probably safe to assume that prevention program designers will have a more thorough knowledge of the
program area than is represented in the examples, so that better cost (and prevalence, for that matter) estimates are sure to be a possibility. Infant Mental Health
Let’s imagine that a research team has just completed a five-year grant aimed at finding an instrument that
E. WEBB STACY, JR.
370
predicts severe childhood mental health problems by assessing the quality of mother-infant interaction in the first few days after birth. Assume that they have constructed an instrument called the Inhospital Mother-Infant Relationship Instrument (IMIRI), and that it has moderate success in predicting which infants will show some sort of childhood psychosis in the first four years of life. What we have, as a result of the research having been done, are distributions of scores on the IMIRI, separated by whether the mother-infant dyad produced a psychotic child (P) or not (NP). We now need two other items: the prevalence of childhood psychosis, and the costs of both kinds of mistakes. According to Hollingshead and Redlich (1958), in the 1950s the prevalence of all kinds of psychosis was about 955 per 100,000, or .955%, in adults; Kessler (1966) reported a prevalence of childhood psychosis of 2.7% in outpatient psychiatric facilities; we might expect the actual prevalence to fall somewhere between these extremes. Let us choose 1.4%. As for the costs of a miss, we can expect that a large fraction of victims of childhood psychosis will be dependent their entire lives. According to Werry (1972), this fraction might approach 2/3. A false alarm will mean that treatment resources are wasted on a child not at-risk. Let us assume that early intervention, which might involve intensive parent education and counseling as well as constant professional monitoring of the child’s behavior, will require services at a rate twice as intense as they would be later had the child developed a psychosis, but that the early intervention would only last a year, while the later intervention might last, on the average, 20 years. Then if a false alarm costs 1 unit-the actual units do not matter here since we will be comparing human and financial costs-the miss will cost .5 x 20 = 10 units. The human cost of a miss is the decimation of a
child’s - and probably an adult’s - life. It is difficult to put in financial terms, but let’s say that it costs twice as much as the financial, resulting in 20 units. The human cost of a false alarm is only the extent to which the child will suffer by having been labelled at-risk for childhood psychosis. If the intervention program is relatively successful, so that being at-risk does not come to mean developing psychotic behavior patterns, and if the parents are able to live with their child’s atrisk status, that cost will be minimal. We might guess that, on the average, the cost would be something like l/8 the cost of a miss, or about .25 units (since we have decided to weigh the human costs twice as heavily as the financial). Costs are summarized in Figure 2, and our cost ratio is (I + .25) I (10 + 20) = .04167. Plugging the prevalence and cost ratio into Equation 2, we obtain an optimum value, L, for the likelihood ratio of .04167(.986/ ,014) = 2.93. The IMIRI scores are arranged from 1 to 100, with the following parameters: MST,,,= 22.1, S,%, = 7.1, IU~ = 35.2, S, = 8.5. With these figures, even with the large sample size involved in the research grant, we find that there is no significant difference between variances. We therefore pool them to get an estimate of .S<,[,= 7.12. We find that d’ = 1.84. We inspect the graphs of z,, and z, plotted on double-probability paper, and find that they are approximately linear. We therefore decide that formula Equation 5 is appropriate, and we find a cutoff Y, of 1.504, which translates into a cutoff score of X, = S,,,YA + MVP = 32.81. Thus, if a mother-infant dyad scores at 33 or above on the IMIRI they will be assigned at-risk status, otherwise not. Looking at the X, and X,, distributions, when the cutoff score is 33 we find that the false alarm rate is .046 and the miss rate is .403. This means that, for 1000 mother-infant dyads screened, 8 or 9 of the 14 children from the P group will be in the at-risk group, Mistake
False Alarm
Miss
-___-______----_-__-____----_---------___---_~~~ I Financial
I f
I
I
i 10
1
1 I I I
Human
I
.25
______________________________________I_~~~~~-~~ Figure 3.Costsof Mistakes in Predicting
Childhood
Psychosis.
371
On Defining At-Risk Status while 45 or 46 of the 986 children from the NP group will be labelled at-risk. We may ask why, if a miss costs so much and a false alarm so little, we wind up with so many misses. The cost of one false alarm is more than 20 times smaller than the cost of one miss, but total false alarm costs mount up by sheer numbers. We just saw that 5.5 dyads will be identified as at-risk: 9 from the P population, 46 from the NP. In the at-risk group there will be about 5 times as many mistakes as correct classifications. Yet if we let more dyads into the group, we would have only a small chance of finding the one of the 5 remaining P dyads in the 945 dyads classified not-at-risk but a big chance of letting in many more false alarms. By adopting the calculated cutoff score of 33, we have done a good job of defining an at-risk group. The concentration of P cases (even though it is low) is almost 12 times higher than in the general population, and almost 31 times higher than in the population defined as not at-risk. Reading Difficulties Some maladies and disorders are measured continuously, rather than in an either-you-have-it-or-youdon’t fashion. For instance, reading problems in the third grade might be measured with a reading test whose scores range from 0 to 100. Whenever a problem is defined this way, it is common to put the definitional cutoff at whatever score lies nearest the lowest 5% of scores. If only 5% of the third-grade pupils got a score of 14 or lower on the reading test, then we would define “reading problem” to be “score below 14 on the test.” Note that the prevalence is guaranteed to be 5 per 100. Suppose we had a test like this (let’s call it Define) for measuring third-grade reading problems. In this example, Define will be different from the test we will use on kindergarten children to try to predict later reading problems (a test called Predict.) We will accept the “lowest 5% on Define” definition of reading problems. Our prevalence is therefore .05, and our task is to find the best cutoff score on Predict. The cost of missing a child is high, for the same reasons that the cost was high for missing a case of childhood psychosis; but the cost of a false alarm is greatly inflated since the reading-problems prevention program would likely involve singling out the child for special attention. Parents may not like their child in a special program, other pupils may ostracize children so identified, and the child could be negatively labelled throughout his or her educational career. As a first approximation, say that total miss costs (both financial and human) are only twice total false alarm costs, so that C = .5. Equation 2 tells us that L. = 9.5. Let’s further assume that the NP and P variances are signi~cantly different such that the ratio of S, to S,, is about 2.0. If the Predict had a d’ of 2.6 (with a separa-
tion index of 99Vo), our ye values would be 2.428 and -4.162. The negative value corresponds to an area containing only .00002 of the NP scores and .0004 of the P scores; it can safely be ignored. Setting our sole criterion at y, = 2.428, we wind up with a false alarm rate of about .008 and a miss rate of .28. For every 1000 children screened, 44 will be called at-risk, of whom 36 will actually be at-risk; but 14 of the other 956 called not at-risk will actually develop reading problems. On the other hand, if Predict only had a sensitivity of 1.16 (75% separate-still a fairly good test), our cutoff scores would be 2.1827 and -2.956. The negative cutoff would fall above .0015 of the NP scores and .02 of the P scores. This is a small, but not always negligible number of scores. Nevertheless, we will let the upper score will be our sole cutoff. If we do so, our false alarm rate will be .015 but our miss rate will be a whopping .69. Under these conditions, we will identify 31 of 1000 children as at-risk, but only 16 of them will develop reading problems. Of the 969 children identified as not at-risk, 34 will develop reading problems, while the other 935 will not. Obviously, a lower d’ leads to an increase in both kinds of mistakes. Outpatient Number-of-Visits Suppose we were interested in detecting which of two different problems a client brings to our outpatient clinic. Problem A is the most common, and requires services delivered at only moderate intensity. Problem B, while rarer, requires highly intense services, more so the longer it goes undetected. Clients with undetected Problem B tend to make many more visits to the clinic than clients with Problem A. As soon as we are able to switch a client with Problem B to the more intensive services, he or she will begin rapid recovery. Suppose further that we would like to use the number of visits the client makes to our clinic as a predictive criterion. The question is: how many visits to the clinic should someone make before we switch them to the more intensive services? Inspection of number-of-visits data for the typical outpatient clinic reveals that it is not even approximately normal. Usually the probability associated with 1 visit is higher than all others, and appears to slope down gradually with an increasing number of visits. A distribution that fits these data is the geometric distribution: The probability associated with the random variable X at the point x is: s(l-s)*-‘,x= Pr[X
1,2,...
= x 1 s] =
(8) 0,
otherwise.
E. WEBBSTACY,JR.
372
formation about whether a client brings Problem A or Problem B to the clinic, we should assign her or him to intensive treatment following the 1lth visit.
A model generating this distribution would be one where the probability of “solving” the problem, given that it had not already been solved, is the constant s. Appendix B provides a brief demonstration that the maximum-likelihood estimator for s, given a sample of x, is $ = 1 / M,, where M, is the mean of the scores. Let us suppose that people who have been found to have Problem A have come a mean of 6 times, and people who have had undetected Problem B have come an average of 20 times. This means that iA is l/6 and sB is l/20. Before we calculate the “cutoff” number of visits, we need to consider costs and prevalence. The financial cost of a false alarm is the wasted intensive treatment, but the cost of later treatment, let us say, has been found to be 10 times higher. We assign 1 cost-unit to the financial cost of a false alarm, and 10 to the financial cost of a miss. The human consequences are not quite as severe: there is little stigma attached to being falsely labelled at-risk for Problem B, and while it involves more suffering than Problem A, it is not severe. Let us assign .5 units to the false alarm and S units to the miss. We wind up with a false alarm-miss cost ratio C = .l. Let us further assume that we have found approximately 8% of the clients treated at the clinic to have Problem B. Then Equation 2 tells us that our optimum likelihood ratio value, L, is 1.15. The point at which the L is achieved turns out to be x, =
h L -I- In [.s,( 1 -.sJ/(s,(l
In I(1 -s.,J/(l
Community Placement As a final example, consider a program that seeks to prevent hospitalization in state mental health facilities by providing alternative community living arrangements (CLAs). Even though such a program is aimed at preventing an event, not a specific problem, we can still apply the technique described above. Using scores on some predictive measure, we would like to define optimally the at-risk group. In 1979 there were 105 institutionalized people per 100,000 in Michigan (Program Policy Guideiines, 1980), so the prevalence is .00105. To estimate the costs of a false alarm and a miss (see Figure 4), we begin by letting the financial cost of a false alarm-the cost of services to someone who would not be institutionalized- be 1 unit. The relative financial cost of a miss is just the relative cost of treating someone in an inpatient setting instead of the community. The results are not all in,but CLA costs might be half as much as the institution, so we will call the financial cost of a miss 2 units. Shortly we will be resealing the human costs, so for now we will set the human cost for a false alarm- the possibility of stigmatization-at 1 unit. Now as unpleasant as the stigma of being labelled at-risk for institutionalization is, being placed in an institution instead of a CLA is usually much worse. Furthermore (again, the results are not all in), it appears that the CLA is often therapeutically more effective.3 We therefore set the human cost of a miss at 10. Because we are using taxpayers’ money for partial
-s,~))]
-s,)l
*
A full discussion of the TSD applications of the geometric distribution (along with many other aiternatives to the normal) may be found in Egan (1975.) Using our particular values in the formula, we find that X, is 11.2. This means that, if we have no other in-
1A further
complication
they very clearly propriate.
is that if a person
require
is $00 dysfunctional,
institutionalization,
For simplicity,
the current
a CLA
analysis
ignores
this.
Mistake
False
Sum
Alarm Miss -____---------__-_--__--~~--~---_-______
I
I
I
/
Financial
Cost Human
3 2 / I I I I I 1 I I I / ____________-____-__~----~~~~~~-~~~~~--11 i i 10 i i I 1
I
1
x
3fll
I
x
3fll
I
= 2.73 / =.27 : 1 --______--_--_--_-__~~~~~~~~~~~~~~~~~~~~ Figure 4. Equal
Weighting
of Financial
and Human
Costs of Mistakes
in Predicting
x
3111 = 3.00
Institutionalization
1 1 1 I
and
may not be ap-
On Defining At-Risk Status support of the institutions and their alternatives, let us put the human costs on an equal footing with the financial: We note that the financial costs (1 for a false alarm, 2 for a miss) add to 3 and that the human costs (1 and 10, respectively) add to 11, We therefore multiply the human costs by 3/11, and get .27 and 2.73, respectively- now both human and financial costs sum to 3. They are summarized in Figure 4. Adding and taking the false alarm-to-miss-cost ratio, we find that C = .2865. Equation 2 now tells us that L = 272.57. It is probably too high for any single instrument to be powerful enough for a screening program to be worthwhile. But this is just another way of saying that we should not be designing programs to screen the general population to
373
find appropriate residents of a mental health CLA. Screening would be of value only when there were a more extreme contrast between the consequences of institutionalization and of living in a CLA when it is not necessary. Among people who have received services other than outpatient therapy from a community mental health center, about 10% eventually might be institutionalized. This would yield a “prevalence” of . 10, and with the same cost ratio, an L of 2.58-a workable figure. Given an appropriate predictive measure, it could be worthwhile to design a program to screen them for appropriateness of a CLA. Of course, if we limit the population to be screened this way, we necessarily miss those not screened.
SUMMARY Some of the procedures above may seem arbitrary. To repeat: these judgements are made by default when they are not made deliberately. If the particular values and costs used in the examples are not in accord with your own (quite possibly better) estimates, by all means, try yours in the formula! The tools above are as much a framework for clarification of values as they are analytical devices. There are three areas of judgement involved in applying TSD analytical techniques to program planning: estimating the prevalence of the target problem, estimating the human and financial costs of the two kinds of mistakes, and finding an appropriate probability distribution to fit the data. We have seen that the techniques are useful when these judgments can be made. Some of the reults, however, do not depend on judgement, and so are’ robust to whatever assumptions are made. There are essentially four of them:
APPENDIX
A- THE NORMAL,
Let K = (2~)~ t”21. We have assumed
These rules, and the procedure outlined above, will not result in screening techniques that always make correct decisions. Neither will they produce exact results. What they will do is crystalize program planners’ values into a method to get optimum tradeoffs between unavoidable types of mistakes.
UNEQUAL-VARIANCE
CUTOFF
POINT
Rearranging again, we have a quadratic equation:
f(xlA!.) = K exp [-y2/2]
f (x w
1. The lower the prevalence of the target problem, the more selective you will have to be in choosing the at-risk population. 2. The higher the cost of prevention and/or stigmatization, the more selective you should be. 3. The more severe the consequences of developing the disorder, the less selective you should be. 4. The more extreme your selection strategy needs to be, the more sensitive your screening instrument needs to be. In some unusual cases, the best strategy will be to admit nobody to at-risk status.
O=(S2- l)Y2 f (2d’)y - (d’)Z - 2S(ln Z(x) t 2nS).
= S-’ K exp [-(y - d’)2/2S2].
We are interested in the value of y when I(x) = L. Substituting appropriate terms, and simplifying, we find
The likelihood ratio is 1(x) = S-l exp [-{y - IZ~‘)~/~S’+ yz/2r
. YE =
Rearranging and taking the natural logarithm,
-d’&!:
(d’)2t2(9-
1)(2nLtInS) s2 - 1
I?2Z(x) =- Ins + [(l/2@)
(y’(S2 - 1) + 2yd’-
(d’)2)1.
as claimed.
(I/2)
E. WEBB STACY,
374
APPENDIX
B-MIX
FOR THE PARAMETER
Let xl, x2, . . . , x, be independent and identicaIly distributed values from a geometric distribution, whose formula was given in Equation 8. Then
JR.
OF THE GEOMETRIC
DiSTRiBUTiUN
We take the derivative and set it = 0:
0 = r2(l/s” 4”l/(1 - 2)) - 2 Lfsix, 3 . . . , .q2) = +(I
= f/T- icr,,
- sp
Xj/(l - s^f
i=t
id1
where A4, is the mean of the sample. It is now clear that
= M(Ivl S - Ivz(I - s)) + In (1 - S) 2
xj*
i=l
as claimed.
ABRAMSON, l.c., Br LEAVITT, W. Statisticaf analysis of data from experiments in human signal detection. Journal qf”Marhemati~1 Psychology, 1969, 6(3), 391-417.
GREY, D.R., & MORGAN, B.J.T. Some aspects of ROC curvefitting: Normal and logistic models. Journal of MathematicalPsychology, 1972, 9(l), 128-139.
BERNBACH, H.A. Decision processes iu memory. Psy~~~~~g~cai Review, 1%7, 74, 462-480.
HARE, E. & WING, J. (Eds.), Ps):rAiairic e~~~de~t~~lag~~: An interrratiurzais_v~~~#s~u~. London: Oxford University Press, t970.
BROADBENT, D.E. Decision andstress. London: Academic Press. 1971.
HAYS, W.L. Statistics. New York: Holt, Rinehart and Winston, 1963.
COOPER, J.E., KENDALL, K.E., GURLAND, BJ., SHARPE, I_., COPELAND, J.R.M., & SIMON, R. Psychiatric diagnosis in New York and Landon: A comparative study of mental hospital admissions (Maudsley Monograph No. 20). London: Oxford University Press, $972. DOHRENWEND, BP., & DOHRENWEND, B.S., Social stcrfus and psychological disorder. New York: John Wiley & Sons, 1966. DORFMAN, D.D., & ALF, E., JR. Maximum-likelihood estimation of parameters of signal detection theory and determination of confidence intervals-rating method data. Journal of j~~athei~at~~a~ f?S_Yck?i~g,Y, 1969, 6f39, 457-496. EGAN, J.P. Sigrzol detection theory und ROC analysis. New York: Academic Press, 1975. EDWARDS,
W.. GUTTENTAG,
M., &SNAPPER,
K. A de&inn-
HOCG, R.V., & CRAIG, A.T. Introduction to mathematicat stcrtistics (3rd ed.). New York: The Macmillan Company, 1970.
JONES, B.D. The advocate, the auditor, and the program manager: Statistical decision theory and human service programs. Evolrc~lrion Review, 1980, 4(3), 275305.
LUCE, R.D., BUSH, R.R., & GALANTER, E. (Eds.). Readings in mathematicat psychotog..v. New York: Wiley, 1963. (Originally published, 1954.)
theoretic approach r5 evaluation research. In EL. Struening & M. Gutrentag (Eds,), ~~u~~~uo~ ofevallcation research (Vol 1). Beverly Hills, Calif.: Sage, 1975.
MCNEIL, B-J., REELER, E,, & RDELSTEIN, 5.3, Primer on certain elements of medical decision-making. h’pw ~~gf~~~~~o~i~~u~qf Medicine, 1975, 293(5), 211-215.
EDWARDS, W., LINDMAN, H., & SAVAtiE, statisiical inference for psychological research. Rei+w7 1%3, 70(3), 193-242.
MEEHL, P.E., & ROSEN, A. Antecedent probability and the efficiency of psychometric signs, patterns, or cutrlng scores. In P.E. Me&l, ~~~~~~~jug~~.s~s:Selected Papers. New York: W.W. Norton & Cu., Inc., 1973. (Originally published, 1955.)
L.J. Bayesian Ps.vchn/ogicui
GREEN, D.M., & SWETS, J.A., Signal detection theory and psychophysics. New York: John Wiley & Sons, 1966.
MET%, C.E. Basic principles of ROC anafysis. Seminars in Nuclear Medicine, 1978, S(4), 283-298.
On Defining
At-Risk
Status
375
NEYMAN, .I., & PEARSON, E.S. On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London (Series A), 231, 1933, 289-337.
SWETS, J.A., PICKETT, R.M., WHITEHEAD, S.F., GETTY, D.J., SCHNUR, J.A., SWETS, J.B., & FREEMAN, B.A. Assessment of diagnostic technologies. Science, 1979, 205(4407), 753-759.
OGILVIE, estimation
SWETS, J.A., TANNER, W.P., & BIRDSALL, T.G. Decision processes in perception. Psychological Review, 1961, 68, 301-340.
J.C., & CREELMAN, of ROC curve parameters.
C.D. Journal
Maximum likelihood of Mathematical Psy-
chology, 1968, 5, 377-391. PRICE, R.H. Signal detection methods in personality tion. Psychological Bulletin, 1966, 66, 55-62.
and percep-
Program policy guidelines: Fiscal year 1981-82 (5th ed.). Michigan Department RAIFFA,
of Mental H.,
theory. Boston:
Health,
1980.
& SCHLAIFER, R. Applied statistical decision Division of Research, Harvard Business School,
TVERSKY, A., & KAHNEMAN, D. Judgement under uncertainty: Heuristics and biases. Science, 1974, 184, 1124-1131. WERRY, J.S. Childhood psychosis. In Quay, H.C., & Werry, J.S. (Eds.), Psychopathological disorders of childhood. New York: Wiley, 1972. WILKS, S.S. Mathemuticul Sons, 1962.
statistics. New York:
John
Wiley &
1961. REITMAN,
J.S. Mechanisms of forgetting Cognitive Psychology, 1971, 2, 185-195.
in short-term
memory.
SUSSNA, E. Measuring mental health program benefits: Efficiency or justice? Professional Psychology, 1977, %(4), 435-441.
ZUBIN, J., SALZINGER, K., FLIESS, J.L., GURLAND, B., SPITZER, R.L., ENDICOTT, J., & SUTTON, S. Biometric approach to psychopathology: Abnormal and clinical psychologystatistical, epidemiological, and diagnostic approaches. In M.R. Rozenzweig & L.W. Porter (Eds.), Annual review of psychology (Vol. 26). Palo Alto, Calif.: Annual Reviews, Inc., 1975.