Formulas for threshold computations

Formulas for threshold computations

(OMPLITERS AND BIOMEDICAL Formulas RESEARCH 24, 5 14-539 (1991) for Threshold Computations C.ROBERT, J. VERMONT, AND J. L. BOSSON Dkpurtement ...

833KB Sizes 0 Downloads 53 Views

(OMPLITERS

AND

BIOMEDICAL

Formulas

RESEARCH

24, 5 14-539 (1991)

for Threshold

Computations

C.ROBERT, J. VERMONT, AND J. L. BOSSON Dkpurtement Domuine

de Biostatistiques. Fact&P de m&decine. de la Merci, 38700 La Tronche. France

P. FRANCOIS Clinique

medicale

infantile, CHU de Grenoble, La Tronche, Frunce

38700

AND

J. DEMONGEOT Dkpartement Domaine

de Biostatistiques, de la Merci. 38700

Fact&P de me’decine, La Tronche, France

Received May 3, 1990

Given a continuous variable S, which density functions on two subgroups R’ and R- of population R are known (with for instance a higher mean value on R’ than on R-), we first define two strategies for classification in these groups; the first one (MWC) consists in determining a threshold a such that classifying in 0+ when S 2 LY,in fl- otherwise, leads to the highest percentage of well-classed elements. The second one consists in choosing the most probable group, given the observed value of S. We give mathematical formulas for the thresholds involved in these two strategies when the density functions, determined by the application of the maximum entropy principle, are those of normal distributions. These formulas prove that the two considered strategies are frequently equivalent, and we give simpler formulas when the partial variances of S on fi+ and fl- are unknown or approximately equal. All the formulas are adapted to the case where a cost coefficient is introduced to display the unequal seriousness of the two possible errors (misclassification in R+ or n-j. Then we consider an example, where we see that the computed thresholds can be graphically validated from empirical curves and have the same performances on the learning sample and a

on

a test

Sampk.

0 1991 Academic

Press. Inc.

The choice of a threshold for a variable, with a dichotomous diagnostic classification in view, is often found in medical decision making. More precisely, let S be a continuous variable, taking its values in an interval [a,b]; we shall suppose, in the latter part of this article, that we study the pertinence of this 514 OOIO-4809/91 $3.00 Copyright Q 1991 by Academic Press. Inc. AU rights of reproduction in any form reserved.

FORMULAS

FOR

THRESHOLD

COMPUTATIONS

515

variable S in the diagnosis of a disease for which the mean value of S is higher (if this is not the case, one would replace S with -S); to a threshold CX,we associate the following classification rule R, : R, : If S(i) 2 (Y,then i is allocated to 0+ (the sick population), If S(i) < CX,then i is allocated to fl- (the non-sick population),

where S(i) denotes the value of S on an individual i. We propose to describe simple strategies for the choice of the threshold, which lead to explicit formulas in the case where S has a normal distribution conditionally to R+ , and to R- . On an actual example, concerning the study of gestational diabetes, we shall compare the values of the thresholds obtained by the theoretical formulas, and we also compare these values to those obtained graphically on curves issued from a sample: we shall study the stability of the performances obtained by the R, rules corresponding to the theoretical thresholds, with the help of a testsample of S values. 1. CLASSIFICATION 1 .l Maximum

of the Probability

STRATEGIES of Correct

Classification

This procedure consists of choosing a threshold (Y, such that the probability of correct classification by the rule R,, is maximum, that is,

P(a,)

P((Y,) = max{P(a), 1.I .I. Determination

of Explicit

Formulas

a 5 (Y % b}. Giving

the Value of a,

Let f + (resp. f - ) be the probability density function of S in R’ (resp. a-). Let p+ be the probability to be sick and p- = 1 - p+ be that of not being sick. For a threshold (Y, the probability P(a) of correctly classifying an individual by the rule R, is written (applying the Bayes formula): P(a) = p- /a -*

f-(t) dt + pi l+‘f 0

‘(t) dt

[II

We look therefore for a number (Y, such that P(a, ) is a maximum of the function P’ the derivative of P, we must have: P’((Y,) = 0,

P; in particular, by denoting where P’ is given by

P’(ac) =

p-f-(a) - p+f++(a).

Let us consider the case where f + (resp. f -) is the density function of a normal distribution with mean m+ (resp. m-) and standard deviation C+ (resp. u-). Such a model is the straightforward one when we only know the partial means and standard deviations of S on the two populations, according to a concentration theorem (cf. Robert (I )); this theorem is a justification of the maximum entropy principle (PME) for probabilistic modelling, and this paper

ROBERT ET AL.

516

p’(cr)
P’ (c1) >o a ‘1

al=r2

b

b

a

FIG. 1. Plot of P(a), with u+ > u--, in the case where A(a) has two roots r, and often happens that r, 5 a, where a is the smallest value of S.

rz.

In practice, it

also shows the practical efficiency of the classical formulas that can be elicited in the models coming from the PME application. It can be easily shown, by taking the logarithm, that P’(a) has the same sign as the polynomial:

A(a) = (a - m+)2 _ (a - m-)* _ logp+U2(a+ )*

2(0-)*

p- cr+.

RI

(1) Let us suppose that CX’+is not equal to CC, and let us consider all the possibilities: -The polynomial A(a) has two distinct roots rI and r,. If u+ > u‘- , then A(o)

is positive if (Ylies between the roots. Therefore P(a) has a representative graph similar to that of Fig. 1 and the aimed threshold is (Y, = r2. If u+ -=Iu-, we are in the case of Fig. 2 and the aimed threshold is (Y, = r, . -The polynomial A(a) has either a double root or no roots. Such cases

P(a)

t

P(a

FIG. 2. Plot of p(u), with V+ < 6, in the case where A(a) has two roots r, and rz and with u+ < u- (one could have r2 z- b).

FORMULAS

FOR

THRESHOLD

COMPUTATIONS

are, in practice, rare and present no problem: (Y, = a, if p+ < p-, one takes CY,= b.

517

if p+ 2 p-, one takes

(2) One has o - = o‘+ ; let us note u’ the common value of the partial standard deviations (+- and (++. Thus the polynomial A(a) is a first-degree polynomial and its root is the required value CX,, which can be written CY,=

UC m+ + m2 -m+-m-

lo&$.

L31

This case is, in fact, important because numerous practical situations lead back to it; more precisely, these situations correspond to the following facts: * The standard deviations o+ and CT-, although not equal, are, however, close. One approximates the situation (1), by that in (2), taking as a value of w’? the mean value of (c+)~ and (u-)‘: cd2 = p’(a’)2

+ p-(w)?.

[41

Let us recall that in some classical methods of statistics (linear regression, analysis of variance, factorial discriminant analysis) the parameters ((T+)? and (K)~ are called th e partial variances of S and their mean value u’~ is called the within-group variance of S. We shall see that, in the examples dealt with, the formula [3] gives a good numerical approximation of the number (Y~computed in (1) (i.e., of r, if CF+ > 6 and of Y, if U+ < a). ** We do not know of or o-, but only the standard deviation u of S on the whole population (sick and normal). We particularly have such a situation when the size of the sample of S that we have at our disposal does not permit the calculation of of or of C- with sufficient precision. In such a case, the application of the PME yields to the application of [3] with (see Robert (1)) ($2 = (+? - (pf(m+)’ + p-(m-)? - m’), [51 where m is the global mean of S, that is, m = p+m+ 1.I .2. Graphical

Threshold

+ p-m-.

[61

Determination

Suppose that we have at our disposition a sample of size N of the variable S. We determine, for the thresholds (Yi = a + i((b - al/n), i = 0, . . . , n (for n large enough), the number N(ai) of the individuals of the sample who have been well classed with R,,; we can approximate the aimed threshold (Y;by a number ai,, such that N(a,)

= max {N(ai).

i = 0, .

. , n}.

[71

Obviously, there may be many i,, verifying [7] and it would therefore be difficult to make a choice.

It may seem unnecessary, if [ 71 holds for only one index i,, . to plot an empirical graph joining the points with coordinates (LY,. N(n;)/N), i = 0, . II. for determining it. However, this would be indispensable for checking the precision to be given to ai,, (see Section 2). We shall compare. on each treated example, the frequency N(a,,,)/N and the frequency N(a, )/N. where cy,is explicitly calculated by using one of the formulas from the preceeding section ((Y, is one of the roots of A(a). or verifies [3]). and the appearance of the empirical graph obtained from an N-sample with one of the theoretical graphs of Figs. 1 or 2. Coincidence of thresholds. computed with formulas which are elicited within a probabilistic model. with graphically determined ones will be. for the classification purpose at hand. a justification of the model and of the thresholds.

Let us consider the following allocation strategy: let s be the value of the variable S on the individual we wish to class; let p: (resp. p , ) be the probability to be sick (resp. to not be sick) conditionally to S = s. We say that p,+ and p; are the posterior probabilities. Ifp: 2 p; , we allocate the individual to the sick group, and if p: < p,, . we class the individual in the other group. It is not, of course, obvious that this allocation rule can be put into the form Rcu?, for a threshold cy2to be determined. We are going to see what happens. 12.1. Theoretical

Calculations

The notation and hypotheses are those of Section 1.1.1. It is we]1 known that P,

+ -

P’f‘+(s)

p+.f+(s) + pm.fm(s)

p f-(s) ps =p+.f-(s) + p-.f-(.s)

Thus, we can write p; ‘p,

Gp+J‘+(s)

- pmfm(s) 2 0.

But we see that pi f‘+(s) - p f-(s) = -P’(s), where P’ is the derivative of the function P defined in 111. Thus, the sign ofp+cf‘*(s) - p-f-(s) is that of the polynomial -A, where A is defined in [3]. Let us again study all the various possible situations. (i) In the case where o+ # o- , the allocation considered here (we will denote it by MAP, where MAP signifies maximum a posteriori) can be written, when the polynomial A has two roots, r, and r?: -In

the case where U+ > (T- (see Fig. 3): 0 If r, < s < r2 then the individual is classed in the sick group L(2-.

FORMULAS

FIG. 3. One representative that when we to affectation

COMPUTATIONS

519

‘2 ‘1 can see that, when p+ = p-. the roots of the polynomial are at the intersection of the curve off+ andf-, since the equality p+f+ = p-f- reduces tof+ = f-; we see choose a threshold outside [r,, r:]. we havef- >.f-, thus MAP classification leads on n+ (resp. fl-).

0 IfS5rlorSzr, -In

FOR THRESHOLD

then the individual cl+.

the case where cr’ < (T- : 0 If s 5 r, or s 2 Y? then the individual l If r, 5 s 5 r, then the individual

is classed in the normal group

is classed in Ris classed in In+.

Fig. 4 allows the comparison of the MAP classification with the classification defined in Section 1.1.1 by RCY,and denoted hereafter as MWC (maximum of well classed). In practice, the difference of the percentages of correct classifications between the two procedures are, in general, not significantly different because the probability that the variable S is inferior to r, (resp. superior to r2) in the case where (++ > o’- (resp. o+ < o-) is negligible, and thus we often may come back to the rule Ra, . One can see, however, that in the MAP classification, the individuals having exceptional S values (very large or very small) are always allocated to the more dispersed group (the dispersion being measured by the standard deviation of S within this group). (ii) In the case where CT+ = o , or whenever we find ourselves in such a situation by using [4], [5], 161, then there is an equivalence between the two procedures: thus the threshold cl, determined by (3), verifies at the same time two interesting characteristics: the classification by Ra, is that of the highest percentage of correctly classified individuals, and is equivalent to allocating an individual to the a posteriori most probable group. I .2.2. Graphical Representations

We will see, in the examples, and Q- enable us to graphically

that the superimposed histograms of S in Szdetermine an interval containing LYE.

(f@wCt)

Q;i

I ’ a.+

(Mm)-+

L2+

n-

b ‘2

‘1 n-

b

I

s

R+ 0+>(3-

R-

(WCt)

i2+

R+

a

b I

(MAPk+

‘1 R-

fit+

‘2 ’ n-

ä

S

FIG. 4. Shema of classification using MAP and MWC classification

I .3. Cost-Bene$ts

Considerations

The dichotomous classification described above could lead to two errors: incorrect classification in fi+ (the individual will be called a “false positive”) or in R- (the individual will be called a “false negative”). In order to establish formulas where these two errors are differentiated, we must introduce a misclassification cost ratio. Let k be the ratio of the cost of a misclassification in Cnby the cost of an incorrect classification in CR+ (a false negative costs k times more than a false positive). Let us look for the threshold a3 which minimizes the expected cost C(a) of misclassification with the rule R,: C(c-u,) = min {C(a); CI 5 (Y 5 b}. Let us write the equation which gives C(a), up to a multiplicative C(a) =

unit constant:

+ p-- jatx.f--(t)dt . kp+ j”or j-+(t) dt \ / / L cost for false positives cost for false negatives

We can also write ct(~) = kp+ + p- Let us write C(ar) = kp’ verifies, therefore:

kp+ j+%.f.‘(t)dt m

+ p-- j” .f‘--(t)dt --%

+ p- - Q(a). The threshold (Yeminimizing

the cost

FORMULAS

FOR

THRESHOLD

COMPUTATIONS

521

But on examining expression (I), one sees that Q(a,) can be written P(aJ), by replacing in P(a) the number p+ by the number kp+ . Since, for the calculations leading to the formulas for the thresholds, we have never used the fact that p+ + p- = 1, all the formulas for the thresholds stay valid when replacing p’lpby kp+/p-. Also, formula [3] becomes a3 =

($2 m+ + m2 - mt _ m-hp-’

kp+

where or2 = ~+(a+)’ + p-(CT-)* or where ur2 = $ - (p’(m’)2 + p-(m-)2 We see, therefore, that all the proposed calculations (p+m+ + p-m-)*). immediately adaptable to studies involving cost ratios.

-

are

Remarks

1. Many statistical software programs propose at the end of a factorial discriminant analysis with a two groups partition, the classification rule Rq,, where (Y,,= (m+ + m-)/2 and where S is the unique discriminant componant of the analysis; in fact, with this classification rule, an individual is allocated to the group whose gravity center is closest, using the Mahanalobis distance (see Robert (3)). Notice that this is equivalent to the application of expression [3] withp+ = p- (when p + and p- are unknown; taking them as equal is a common practice in statistical modeling). The correction term a3 - CQ = -(cr’2/(m+ - m-)) log k(p+/p-), where k = 1 for the case of equal error costs, is, in absolute value, as big as kp+ divided by p- (i.e., as p+ is far from B when k = l), as cr’ is high (i.e., as S is highly dispersed on the two considered groups) and as m+ and m- are close. Let us note that, in general, when the variable S is pertinent for the diagnosis of the illness, then the difference m+ - m- is large and (T’ is small. 2. One could imagine that k is the ratio of the probabilities of accidents according to the type of error. In that case, a3 will minimize the global probability of an accident. In the same way, the formulas can be adapted to strategies with expected utilities consideration (see (2)). 3. Let us remember that the ROC curves (3 ) allow the comparison of the specificity Sp(a) and of the sensibility Se(a) at the same threshold (Y when CY tz

varies (here Se(a) = 1

f’(t)

dt, Sp(a) = 1’ f-(t)

dt); the P-ROC curve (see

(9)) enables us to com;are the evolution of theEpositive predictive values ppv(a) and the negative predictive values npv(a) when 01varies (the positive predictive value ppv(a) is the probability of being sick when S L (Y and the negative predictive value npv(a) at the threshold cy is the probability of not being sick when S < a). It should be noted that determining cxip order that the classification with R, reaches a given sensibility or a specificity, or a given positive or negative predictive value, do not lead to explicit formulas for a (even when we assume

522

ROBERT

ET AL.

that S is normally distributed on each group) and thus are out of the scope of this paper. It seems that, in the medical decision domain, the difficulty of probabilistic modelling has discouraged the use of theoretical formulas, and thus the ROC curve appeared as one of the most pleasant tools for graphical threshold determination. In this paper, we want to show that other methods involving theoretical formulas can in fact easily be applied. For that purpose, we first stress the fact that a well-defined strategy must be chosen for medical decision-making (such a strategy, if it is to be used for classification, does not involve sensibility and specificity) and, then, using the PME easily allows probabilistic modelling and commonly yields to efficient models involving well-known distributions: for such distributions, it is then easy, as shown above, to determine formulas for the thresholds. 2. APPLICATIONON AN EXAMPLE 2.1. Presentation

of the Example

A clinical datafile was made for the study of gestational diabetes; it consists of 173 records, composed of four values of glycemia level, expressed in mmol/ liter. The first, T,, is the glycemia on an empty stomach; the others are levels of glycemia at 60, 120, 180 min after absorbing 100 g of glucose (variables Tm, T 120 7 T,,,). The last variable takes the value one in the case of gestational diabetes and 0 otherwise; all the records are relative to expectant women, about eight months into their pregnancy, and the diagnosis of gestational diabetes is established independently of the age of the patient, according to (5). For our study in comparing thresholds, we randomly picked 56 records (-30% of the cases from the initial datafile) to build a test file. We thus have two datafiles: the first one, with 117 records, denoted LF (learning file), will be used to determine the thresholds by using the formulas of the preceding paragraph or by means of graphical methods. For every threshold (Y, we will compute the proportion p(a) of correct classifications in LF using the R, rule and we will compare it to the proportion of correct classifications p’(a) in the datafile of the 56 remaining records (TF: test file). 2.2 Calculation

of the Thresholds

by the Formulas

In Table 1 one can see the thresholds associated with the formulas established in Section 1, as well as the corresponding percentages of correct classifications in LF and TF. We will note that, for the four considered variables, we have o+ > o- and that it is found that the roots yl of the polynomial A(a) are smaller than the minimum value taken by the variable on TF or LF. Therefore the classifications MAP and MBW coincide exactly and correspond to the rule R,, (where cq = a2 = y2). One can see from this table that (Yeis very close to Y? and gives the same percentage of correct classifications in LF and TF; as we can see in the Table 1, V+ and u- are very close for the four considered

FORMULAS

FOR THRESHOLD TABLE

COMPARISON

I a, b

To

0.4; 1.5

TbO

0.6; 2.6

T 120

0.5; 2.4

T 180

0.5; 1.8

OFTHE

1 THRESHOLDS

II m+ a0 = ~

523

COMPUTATIONS

III + m-

rI.

2

-0.065 80% 0.24 87% 0.015 89% 0.218 81%

0.85 73% (66%) 1.5 73% (82%) 1.27 79% (82%) 1.07 73% (64%)

IV r2

ff3

0.93 (86%) 1.75 (93%) 1.45 (95%) 1.25 (82%)

0.93 80% (86%) 1.74 87% (93%) 1.45 89% (95%) 1.24 81% (82%‘)

Note. In column I, a and b are the minimum and the maximum observed for the variable (on the initial datafile of 173 records); in column III, r, and r2 are the roots of the polynomial A(a) defined in (2); in column IV, ti3 is given by (3). where IT’* is calculated by formula (4). In columns II, III, IV in the second line for each variable, we find the percentage of correct classifications. respectively by R , R,2, Ra3 in LF and, in parentheses, in TF. For the four variables cons12 ered, the ratio p’lp- equals g = 0.34.

variables, and this explains why CQis closed to r,. One can also note that the percentage of correct classifications in the test file is comparable to the percentage of correct classifications in the learning file, except, eventually for the threshold (Y,,(but this threshold, which in fact also corresponds to the maximization of the sum of the sensibility and the specificity, is not as good as the others, as already noted in (2)). Let us note that the approximation of a model with unequal partial standard deviations CT+ and o- by a model with equal partial standard deviations is TABLE GRAPHICAL

DETERMINATION

MWC

To GO T I20 T 180

[0.91; [1.70; [1.38; [1.16;

0.931 1.781 1.461 I.381

2 OF THETHRESHOLD

MAP [0.9; [1.7; [1.3: [I.I.

I.11 I.91 I.51 I.31

a3

0.93 1.74 1.45 1.24

Note. Figures 6-9 allow the determination of an interval containing the threshold: the length of this interval depends on the aspect of the curves (e.g.. the frequency curve for correct classifications of TIBo has several maxima) and on the step width of the threshold used to make the curves. Moreover, OL,is calculated on LF to the nearest 0.01.

524

ROBERT ET AI.

.TO 0 T60 AT120 83

QT180 ,5

,75

1

1,25

1,5

1,75

2

2,25

2,5

THRESHOLD FIG. 5. Graphical determination of the thresholds corresponding to the MWC classification. The step of the threshold T, is 0.01 therefore the maximum a! = 0.93 is determined graphically to the nearest 0.01. The step of the variable for T, is 0.04, for Tm and Tim is 0.04 and for T,,, is 0.03.

excellent here. Therefore, we need only compare the graphical determination of the thresholds to the value q . 3. GRAPHICALDETERMINATIONS FORTHE MAP AND MWC CLASSIFICATIONS

In Fig. 5 one can see the curves ((w,p(a)) of the proportion P(cu) of correct classifications by R, relative to the four variables T,, , Tm, T,,, , T,,, . For To, Teal TM,, the thresholds are graphically determined to be 0.92, 1.74, 1.42 and correspond to a percentage of correct classification of 79%, 87%, and 89%. Note that when looking at Table 2, the thresholds for these three variables, determined by explicit formulas, and the thresholds determined graphically differ by not more than 0.01. The plot relative to T,,, admits three maxima corresponding to a percentage of correct classifications of 81% and to the thresholds 1.19, 1.22, 1.32, 1.35, which means that we cannot therefore graphically determine the threshold with good precision. For the graphical determination of the thresholds corresponding to the MAP classification, one could look at Figs. 6-10 which give the cumulative histograms of the variables on fl+ and CR-. One can see, on these histograms, that the probabilistic model with normal conditional distributions is not easily determined without the use of the PME. The precision on (Yis the same as that of the width of the histogram cells. The thresholds found for To, T, , T,2,,, T,, are:

FORMULAS

FOR THRESHOLD

525

COMPUTATIONS

HlstogramofXl:TO

*

25'

*

F

FIG. 6. Histograms of T,, on O+ (shaded) and W.

1, 1.8, 1.4, 1.2 (to be compared to the theoretical values o3 from Table 2 which are 0.9 1.75, 1.45, 1.25). Let us remember however, that the thresholds graphically determined here rely on the precision with which the curves were plotted. We can in fact only determine an interval for each threshold; we can look at Table 2, which gives graphical results in terms of intervals, and we note that the theoretical threshold a3 is always less than the graphically determined intervals. 4. DISCUSSION

AND CONCLUSION

Let us first summarize the preceeding results; for the four considered variables, (T+ is little different from 6 and the approximation of the model with unequal partial variances (u+)~ and ((T-)’ by a model with partial variances equal to crr2 (where gt2 = ~‘((r’)~ + p-(~-)~) must be made. The threshold a3 determined by ai =

&II

m+ t m-

2

-m+-m-

log +

is a good threshold to consider, since it corresponds to both the MAP and MWC

526

KOBEKT

ET AL.

Histogram

of Xi:

T 60

I-;. ’

25.

1 FIG. 7. Histograms of T, on 1Z+ (shaded) and W.

classifications. In any case, the MAP and MWC classifications (with the rule R,, , where, according to the sign of CT+ - IY , q is the largest or the smallest of the roots of the second degree polynomial A(a)), give, in practice, insignificantly different percentages of correct classifications. In the considered examples, there is a harmony between the graphical results issued from the study of a population of 117 individuals (of whom 30 are sick) and the theoretical formulas under the hypotheses of normality of the variable on each of the two subpopulations considered; moreover, the percentages of correct classifications between the learning file and the test file are close. It could happen, nevertheless, that in other practical situations, the graphical and theoretical results will not agree; if the learning file is sufficiently large to contain, let us say, at least 30 sick and 30 well individuals, that could be due to too a large deviation of the conditional distributions of S from normality; from our practical experience, this is often accompanied by degradations of the percentages of correct classifications in the test file. The choice of a threshold according to the considered procedures must then be graphical and is therefore delicate; in practice, it often corresponds to a variable not known to be very pertinent for the diagnostic at hand.

FORMULAS

30’.



FOR THRESHOLD

‘.

COMPUTATIONS

527

HlsfogramofXl:T120 . .

25 20 E 2 15 0 10

5

0G

5’ 6. 7’ 8’ _-

9’ lo-.

. FIG.

8. Histograms of T,,, on II’ (shaded) and W

One can (see (9)) show that the considered thresholds for the variables are also optimal for other strategies (which involve only graphical techniques). In this study, the practicians knew, before studying the datafile, that the most pertinent variables were Tbo and T,,, and asked themselves if one of these two would be better than the other. After choosing the thresholds I+ for TbOand T,z0, one can see no significant difference as to the correction classifications, between the two variables (x’ = 0.162; p = 0.5). Let us also remark that choosing a threshold often poses problems during the construction of an expert system for medical diagnosis, but that it is not recommended, in an expert knowledge-base building, to choose a threshold for each continuous variable implicated in the envisaged diagnosis: it is advised to summarize the joint diagnostic information that is given by use of the discriminant component S obtained from a discriminant factorial analysis and then to choose a threshold for S (see (1)). Lastly, let us remark that some formulas here have been demonstrated within the model with normal conditional distributions; sometimes other formulas might be elicited, with analogous computations, when the conditional density functions are known, even if they are not normal. For instance, if the variable

528

ROBERT Histogram

ET AL. of Xl:

T 180

-

25.

10.

0.c 3

A ., . . .

1

.

2' i

3'

.:

,



-;

4'

I

5. 6‘ 7' 6'

9J FIG. 9. Histograms of TisOon R’ (shaded) and W.

S has positive values and if we only know its conditional mean values, then using the PME yields to conditional distributions which are exponential: f+ = (llm+)exp(-x/m+),

f-

= (l/m-)exp(-x/m-)

Then MAP and MWC strategies are equivalent (Y, given by the formula a =(-$

- $r’logs

forxr0,

and correspond to the threshold

(withm'zm-).

Let us finally note that in this paper we only consider the case of two groups. In some situations, we have to consider more than two groups; the two strategies that we have considered here can be generalized, but they lead to more complicated decisions (which no longer involve threshold, but rest on quadratic classification-see (7)). In some other situation, one has to consider, at the same time, two partitions of the population CR, and specific strategies have to be defined (see (4)).

FORMULAS FORTHRESHOLDCOMPUTATIONS

529

REFERENCES 1. FRANCOIS, P., CREMILLEUX, B., AND ROBERT, C., Continuous variable processing in expertsystem building, submitted. 2. GLASZIOU, P., AND HILDEN, J. Test selection measures. Med. Deck. Making 9, 133 (1989). 3. MC NEIL, B. J., KEELER, E., AND ADELSTEIN, J. Primer on certain elements of medical decision making, N Engl. J. Med. 293, 211 (1975). 4. NEASE, R. F., OWENS, J. R. D., AND Sox, H. C. Threshold analysis using diagnostic tests with multiple results, 1989. 5. O’SULLIVAN, J. B., AND MAHAN, C. Criteria for the oral glucose tolerance test in pregnancy. Diabetes, 13, 278 (1964). 6. ROBERT, C. An entropy concentrated theorem. J. Appl. Probab., to appear. 7. ROBERT, C. ModPIes statistiques pour I’intelligence arti$cielle-Le principe du maximum d’entropic. Masson, Paris, to appear. 8. ROBERT, C. Analyse descriptive des donnCes. Application ?I l’intelligence artificielle. Flammarion Medecine-sciences, Paris, 1989. 9. VERMONT, J., BOSSON, J. L., ROBERT, C., FRANCOIS, P., RUEFF. A., AND DEMONGEOT, J. Curves for threshold determination. Submitted.