Assessment of fisher and logistic linear and quadratic discrimination models

Assessment of fisher and logistic linear and quadratic discrimination models

Computational Statistics & Data Analysis 1 (1983) 257-273 North-Holland 257 Assessment of Fisher and logistic linear and quadratic discrimination mo...

1MB Sizes 0 Downloads 150 Views

Computational Statistics & Data Analysis 1 (1983) 257-273 North-Holland

257

Assessment of Fisher and logistic linear and quadratic discrimination models * C.K. BAYNE, J.J. B E A U C H A M P , V.E. K A N E

**

Union Carbide Corporation, Nuclear Division, Oak Ridge, TN 37830, USA

G.P. McCABE Department of Statistics, Purdue University, West Lafayette, 1N 47907, USA Received July 1983 Revised September 1983

Abstract: This paper summarizes the results from a study comparing the performance of the Fisher and logistic linear and quadratic discriminant functions. Three types of bivariate distributions are studied. Each classification rule is compared to the optimal maximum likelihood procedure for the different data types. The theoretical misclassification probabilities of the sample discriminant functions are calculated directly and used for the comparison of the different procedures both in terms of bias and variation. Generalizations and recommendations are made to assist the applied statistician in making the correct choice of a discrimination procedure and the results of this study are compared with earlier investigations. This study shows that specification of the form of the discriminant function may be one of the most important parts of a discriminant analysis. Keywords: Linear and quadratic discrimination models, Logistic discrimination, Misclassification probabihty estimation.

I. Introduction The purpose of this paper is to compare the performance of Fisher's linear (LDF) and quadratic (QDF) discriminant functions and the linear logistic (LLF) and quadratic logistic (QLF) discriminant functions for classifying observations from three types of bivariate distributions. The three distributional types are the bivariate normal distribution with equal and unequal covariance" matrices, a

* Research sponsored by the Applied Mathematical Sciences Research Program, Office of Energy Research, US Department of Energy under contract W-7405-eng-26 with the Union Carbide Corporation. Free of fights for all U.S. Government purposes. ** Presently employed by Ford Motor Company, Detroit, Michigan. 0167-9473/83/$3.00 © 1983, Elsevier Science Publishers B.V. (North-Holland)

258

C.K. Bayne et al. / Assessment of discriminant models

mixture distribution of Bernoulli and normal random variables, and a bivariate Bernoulli distribution. In addition, the maximum likelihood discriminant function (MLF) discussed by Efron [8], corresponding to the LDF (QDF) for bivariate normal data with equal (unequal) covariance matrices, is given for the procedures involving discrete data. Each classification rule is compared to the MLF optimal procedure for the different data types. The theoretical misclassification probabilities of the sample discriminant functions are calculated directly. The performance of classification methods under nonoptimal conditions is an important consideration to the applied statistician but little research on this question is available. Gilbert [11] compared the LDF and QDF for multivariate normal distributions with one covariance matrix a multiple (X) of the other and all correlations equal. The results indicated that the misclassification errors for the two discriminant functions agreed only for a moderate range of X values and the agreement became worse as the number of variables increased. Subrahmaniam and Chinganda [24] found the LDF robust against several skewed distributions and Clarke, Lachenbruch, and Broffitt [5] used Johnson's transformation system to show that the QDF was robust to nonnormality except for highly skewed distributions. Broffitt, Clarke, and Lachenbruch [3] studied the effect of substituting robust estimates of the mean vector and covariance matrices to estimate the misclassification probabilities when the QDF is used with nonnormal distributions. Only slight gains were found for highly skewed distributions. Broffitt, Clarke, and Lachenbruch [4] also examined the robustness of the LDF under a random shift location contamination model where each variate was allowed to be contaminated independently of the other variates and found only minimal effects on the misclassification probabilities. Lachenbruch [16] also found that initial misclassification can have serious effects on the error rate of the QDF. Wahl and Kronmal [25] found sample size to be a 'critical factor' in choosing between the L D F and QDF. Many additional results for the LDF and QDF robustness studies are summarized by Dillon [6], Krzanowski [14] and Lachenbruch [15]. The choice between Fisher's discriminant functions and the logistic discriminant functions depends on the criteria considered. Presss and Wilson [23] give many pragmatic reasons for the LLF model which arises from the exponential family of distributions. Efron [8] shows that the asymptotic relative efficiency of the LLF can be less than one-half in a comparison with the LDF for the exponential distribution family. McLachlan and B~cth [18] claim the efficiency is much higher when considering other terms of the asymptotic expansion. O'Neill [21] has extended Efron's results for comparing the QDF with the QLF. This study compares the two procedures on the basis of misclassification probabilities for normal and nonnormal distributions.

2. Discrimination procedures Consider the two population bivariate discrimination problem where (Xrlj, X r 2 j ) ' is observed from population 1 I r with Y = 0 or 1 for j = 1,..., n r. The

259

C.K. Bayne et aL / Assessment of discriminant models

probability density function (pdf) for/-/y is f r ( X ) and the prior probability is ~rr where ~r0 + ~r1 = 1. Three different population distributions are considered. The bivariate normal distribution is represented by the canonical form X 0 - N(0, I ) and X 1 - N ( ( A , 0)', X1). The second distribution type is a normal-Bernoulli mixture where Xrl given Xy2=X is distributed N(/_ty,x, of) and Xr2 has a Bernoulli distribution with expectation 0 r. The pdf for this distribution is given by l - x 2 -1-( X1- ]tr'x2 ] o ~ ~p f (x) = 8 (10y)

where ~(-) is the standard normal pdf. The final distribution type is a bivariate Bernoulli with P{ Xrl = s, Xy2 = t } = Orst for s, t = O, 1, with pdf

f r ( x ) = a(1-x,,×l-xy2)nO-x.)x~2nx~,(1-x~:)ax.x~ "YO0 "Y01 "Y10 "Yll , where Es, t Ors, = 1 for Y = 0, 1. These data types are considered because they include a variety of continuous and discrete data types and it is possible to compute theoretical misclassification error rates for any sample discriminant function. The traditional Bayes posterior classification rule results in assigning an arbitrary x to H o if Q(x) < 0 and to H 1 if Q(x) > 0 where Q ( x ) = log ~r'fl(x)

~rofo(x) =/30 + ~lXl +/32x2 +/33XlX2 +/34Xl2 + B5x2.

(1)

It should be noted that one or more of the fl's will be equal to zero for some of the models. The discriminant function, Q(X), characterizes a discrimination procedure since the estimates of fl = (130, ill,--., /35)' in (1) determine the misclassification error rates. The focus of this study is the assessment of the misclassification probabilities of the sample discriminant functions. Let Q ( X ) denote (1) with an estimate, r , replacing the population ft. The theoretical misclassification probabilities are

po=Px{Q(X)>OlY=O},

pa=Px(Q(X)
p-r=~roPo+~hpl

An important feature of this study is that the exact rather than simulated expected misclassification probabilities are calculated for the sample discriminant functions using procedures given in [2]. Five procedures for estimating fl are considered for each population distribution. The five procedures are LDF, QDF, LLF, QLF, and MLF. The M L F assumes a distributional form for the density functions with the allocation rule derived from the likelihood ratio. The M L F is equivalent to the L D F (X = 1) and Q D F ()~ 4= 1) for the bivariate normal distribution. In addition, the M L F and QLF procedures are equivalent for the bivariate Bernoulli data for the case of nonzero cell frequencies. However, the M L F procedure is not equivalent to any of the other functions for the mixture distribution of Bernoulli and normal random variables and is considered separately for this case. Except for the MLF, estimation of ~ was performed in a standard manner disregarding the continuous and

260

C.K. Bayne et al. / Assessment of discriminant models

discrete nature of the data. The L D F and Q D F are standard procedures for normal data (Lachenbruch [15]) and the LLF and QLF are theoretically appropriate for the exponential family of distributions (Efron [8]). For the L D F and LLF, the coefficients /33 through /3s were set to zero, and /30 through /32 were estimated by standard methods. For the QLF, fls was set to zero for the normal-Bernoulli distribution, and /34, /3s were set to zero for the bivariate Bernoulli distribution. Otherwise standard estimators were computed. The logistic estimates required an iterative procedure described in the next section.

3. Sampling experiment The basic steps in the simulation for each type of distribution were: (1) the generation of a training set from the two populations, (2) the estimation of the parameters in each of the sample discrimination functions by the appropriate procedure, and (3) the calculation of the expected misclassification probabilities for each of the sample discriminant functions using procedures described in [2]. The results of Anderson [1] allow the specification of the sample sizes (n 0, nl) and the proportions (~ro, "/71) in which the populations H 0 and H1 appear. In this study the sample sizes are taken as n o = n I = n with n = 25, 50 and 100 and the proportions rr0 = ~q = 0.5. From Efron [8] this case should be most favorable to the logistic regression procedure. The initial estimates for the LLF (QLF) are the estimates from the L D F (QDF) procedure since the logistic estimation procedures require starting values. The optimization routine E04KBF from the Numerical Algorithms Group [20] computer procedures was used to obtain the final estimates of the fl-coefficients for the LLF and Q L F procedures. In addition to the usual tests for convergence during the iterative process, if the estimated fl-coefficients resulted in perfect classification of the sample then the estimation procedure was terminated and the final estimates used to calculate the expected misclassification probabilities. The bivariate normal random variables were generated using the K - R algorithm (see Kinderman and Ramage [13]) and the uniform random number generator U R A N D (Forsythe, Malcolm and Moler [10]). The Bernoulli random variables were also generated using U R A N D . Parameterizations were selected so that a plausible range of theoretical misclassification probabilities P T would be achieved. The values of P'r ranged from approximately 0.05 to 0.45. Exact values of the misclassification probabilities to 8 places were computed for each simulation case. The parameterizations for the bivariate normal distribution are given in Table 1. For these parameterizations the correlation between X 1 and X 2 is zero for both populations. The parameterizations for the normal-Bernoulli and bivariate Bernoulli distributions were more complex than those for the bivariate normal distribution. Two restrictions were assumed for the normal-Bernoulli data. The individual population misclassification probabilities were equal and the correlation between the normal and Bernoulli random variables was fixed at 0.5 in each of the populations. The parameterizations are given in Table 2.

C.K. Bayne et aL / Assessment of discriminant models

261

Table 1 Bivariate normal parameterizations a Misclassification probability, PT

1 2 3 4

0.05

0.10

0.15

0.20

0.25

0.35

0.45

3.29 3.91 4.31 4.61

b 3.01 3.26 3.42

2.07 2.40 2.52 2.55

b 1.90 1.90 1.76

1.35 1.46 1.29 0.77

0.77 0.56 c ~

0.25

T a b l e entries are za values. b NO s i m u l a t i o n s run. c N o A exists for specified ~ a n d PTa

Two different types of parameterizations for the bivariate Bernoulli data were chosen so that individual misclassification probabilities were equal. In addition, the parameterizations were chosen so that the optimal discriminant function would not be linear in X 1 and X 2. In order to cover the range of misclassification probabilities used with the other data types and to have reasonable choices for the 0 parameters, the correlation within each population between the observed Bernoulli variables differed only in sign between the populations. In the first parameterization (A), all the marginal probabilities were 0.5. For this case, it can be shown that the difference between the theoretical probabilities of misclassification for the linear and M L F procedures is maximized. The second parameterization (B) considered different configurations of the marginal probabilities as well as different partitionings of the misclassification probability components for a fixed value of Pv = 0.15. Both parameterizations are given in Table 3. If the maximum likelihood estimate of 0 for the bivariate Bernoulli data is zero, then some of the/~'s become undefined. The method used in the simulation study to circumvent this problem was to replace any zero cell frequency by 0.5. The expected misclassification probabilities for a sample discriminant function were denoted by/~0,, Pai and fiT, i = 1, 2 , . . . , m, for m = 5 0 0 0 0 / n simulations. For each parameterization and discrimination procedure, the mean, median, variance, and mean square errors (MSE) using /50,, /~li and P-ri values were computed. Since some of the training sets did not result in estimates of the fl-coefficients that converged for the LLF and QLF procedures, the summarized data sets included only those sets where the LLF and Q L F procedures both Table 2 Normal-Bernoulli parameterizations Mean

/~,0 /~al

Misclassification probability, p T a 0.05

0.15

0.25

0.35

0.45

5.88 8.10

3.42 5.64

2.14 4.36

1.19 3.41

0.38 2.60

a Oo = Oa = 0.7; o o = 1, o, = 2; #oo = O, #oa = 2.22.

C.K. Bayne et aL / Assessment of discriminant models

262

Table 3 Bivariate Bernoulli parameterizations a Population/7o

( Y = 0)

xl

x2=0

0 1

0.5 + 7 -- P T / ( 1 + ~') p T / ( 1 + ~')

~ ' P T / ( 1 + ~) 0.5 -- y -- T p v / ( 1 + "r)

Total

0.5 + 7

0.5 - 7

Population/71

Total 0.5 + y -- p T ( 1 -- ~-)/(1 + ~') 0.5 -- y + p T ( 1 -- ~ ' ) / ( 1 + ~')

( Y = 1)

xl

x2=0

0 1

~ ' p T / ( 1 + ~') 0.5 -- y -- ~ ' p T / ( 1 + ~')

0.5 + Y -- P T / ( 1 + ~') px/(1 + r)

Total

0.5 -- y

0.5 + y

Total 0.5 + 7 -- p T ( 1 -- ~ ' ) / ( 1 + T) 0.5 -- y + p T ( 1 -- 1")/(1 + ~')

T a b l e e n t r i e s a r e Oyst's. a P a x a m e t e r i z a t i o n A : Pa- = 0.05 (0.10) 0.45; T = 1 ; y = 0. P a r a m e t e r i z a t i o n B: P T = 0.15; r = 1 / 3 , 1, 2; 3' = 0.05 (0.10) 0.25.

converged or perfect separation of the sample training set was achieved for the LLF or QLF procedures.

4. Results

The results of the Monte Carlo study for the different data types are summarized in this section. The MLF is considered optimal and two relative measures are considered for each data type. The ratio of the MSE's and the bias ratio discussed by Efron [8] are both useful comparisons between the MLF and a test procedure, where MSE Ratio = [MSE(/~ T (MLF))]/[MSE(/~ T (test))], Bias Ratio = [MEAN( fiT (MLF)) - p T ] / [ M E A N ( / ~ T (test)) --PT]The range of the MSE Ratio is 0 ~< MSE Ratio ~< 1 with MSE Ratio = 1 being the optimal value. Note that since PT >~PT for any simulation, the Bias Ratio is a relative measure (>/0) of how close a sample discriminant function is to the optimal discriminant function. Two relative measures of variation and location for any procedure are Relative Root M S E = [MSE(PT)]I/2/p v and Relative Bias = (MEAN( p T) -- P T ) / P T" These are important to consider since the previous ratios could indicate large discrepancies between methods when the absolute difference is, in fact, small. 4.1. Bivariate normal with equal covariance matrices

The optimal MLF for this data type corresponds to the LDF. The theoretical asymptotic value of the MSE ratio for the LLF procedure is derived in [2] using

C.K. Bayne et aL

/ Assessment of discriminant models

263

results from Efron [8]. Figure l(a) is a plot of this MSE ratio as a function of PT for the different procedures, as well as the asymptotic theoretical function for the comparison of the L D F and LLF procedures. The MSE ratio for the LLF procedure monotonically increases as PT increases indicating that the LLF procedure should give comparable variability when the populations are not well separated. Even though the MSE ratio decreases rapidly for small values of P T, the practical importance of this decrease for the LLF procedure may be minimal since the populations are well separated. The behavior of the MSE ratio for the quadratic procedures is nonmonotonic and indicates a substantial increase in variability of the estimated misclassification probability (decrease in the MSE ratio) for cases when the populations are not well separated, e.g., PT = 0.35. Figure l(b) is a plot of the bias ratio as a function of PT for the different

ORNL-DWG 82-2637

I

I

1

FED

T

I

(o) o • • --

L D F / L L F RATIO LDF/QDF RATIO L D F / Q L F RATIO ASYMPTOTIC LDF/LLF RATIO

t.00

i

0 t-n~

0.75

Ld 03

0.50

0.25

0

1.00

_(b)

[

I

I

1

I

1

]

I

I

,6 0.75 0 tr~ o9

0.50

-

_ 0.25

o

/ = / I I 0.05

I

I

0.t5

0.25

I 0.35

I 0.45

PT

Fig. 1. MSE and bias ratio vs PT for bivariate normal data w i t h equal covariance matrices (n = ]00).

C.K. Bayne et al. / Assessment of discriminant models

264

ORNL-DWG 8 2 - 2 6 3 8

49.6

[

FED

F

50 o LDF n LLF • QDF

25 \

I-z



2

20

LLI

fl/ i,I n

t5

I

QLF

~

I

I-,

t0

I

I'1 O-------

5

==0130

( a ) RELATIVE ROOT MSE

o

J 0.05

I

i

i

I

0.t5

0.25

0.55

PT

] ]

RELATIVE ( b BIAS , )

,

,

1 , 0.45

0.05

0.t5

0.25

0.55

0.45

PT

Fig. 2. Relative root MSE and bias vs PT for bivariate normal data with equal covariance matrices

(n = 25). procedures, as well as the asymptotic theoretical function given in Efron [8] for the comparison of the LDF and LLF procedures. The theoretical functions and the sample ratios from the simulations show good agreement. The behavior of this bias ratio as a function of P v is similar to the behavior of the MSE ratio. Therefore, the same conclusions follow with respect to bias considerations. Figure 2(a) is a plot of the relative root MSE for the different procedures when n = 25. The figure indicates the relative root MSE can be quite large for the nonoptimal procedures. This is especially true for small values of P'r- Although not shown in this figure the relative root MSE decreases considerably as the sample size increases. In particular, except for the QLF procedure with PT = 0.05, all relative root MSE's are less than 15% for n = 50 and decrease to less than 6% for n = 100. The practical importance of large relative root MSE for the Q L F procedure when P'r is small may be minimal since the populations are well separated. The relative bias of the overall misclassification probability is shown in Figure 2(b) for the different procedures as a function of P'r when n = 25. The first conclusion from an examination of this figure is the similarity in behavior of .the relative bias and relative root MSE as a function of P'r- The relative bias for the logistic procedures can be considerable (about 50% for QLF) when PT is small. The behavior of the relative bias was similar to that of the relative root MSE's for the larger sample sizes with maximum values of about 50%, 20% and 8% for n = 25, 50, 100, respectively.

C.K. Bayne et aL / Assessment of discriminant models

265

4.2. Bivariate normal with unequal covariance matrices

The MLF corresponds to the QDF for this case. The MSE ratios indicated that all three methods performed poorly when compared to the QDF except for the LDF when the sample size and X were small with the populations moderately separated. Also, the LDF and the LLF performed better than the QLF for small sample sizes when PT < 0.2. This is probably due to the balance between estimating fewer parameters with a small sample size versus using the correct form of the discrimination model. The misclassification probabilities P0 and Pa must be examined separately to properly evaluate the bivariate normal distribution with unequal covariance matrices. In general, P0 is larger than Pa in the cases examined. Therefore, any application of the selected discrimination method must consider which of the two misclassification probabilities should be given the greater weight. Because the ORNL-DWG 8 2 - 2 6 4 0

FED

48 x10 -3

I

t6 0 n-

O 1.1.. hi

hi ~9 <[

I n 25 L

[

ln=4[o o [] •

44 12

LDF LLF QDF

I ~3 ]

40 8 6 4

2

l~'

"°~e...--"-e

o

I

I

I

Io

t

2

I

I

~ "II--l--~l

i

I

fo loo l

4

2

7

J r,.o h

5

hi (1")

4

hi (.9 fie hi >

2 4 0 0

3 X

4

5 0

5

4

5

k

Fig. 3. Average MSE vs X for bivariate normal data with covariance matrices of I and h i .

266

C.K. Bayne et al. / Assessment of discriminant models

optimal rule is developed to minimize the total error rate, a minimax rule or an unequal cost function [15] may be more desirable to control the individual misclassification probabilities. The MSE, which is the sum of squared bias and variance, versus P0 (P~) was used to evaluate the four classification methods. These plots are summarized by averaging the MSE's for each method over the range of P0 or Pl for each combination of sample size and X. This summarization is displayed for sample sizes 25 and 100 in Figure 3 as the average MSE versus X. The summarization of the average MSE versus 2~ for P0 shows a very distinct separation of the linear and quadratic methods after the sample size is greater than 25. The Q D F and QLF average MSE's are fairly constant for X >/2, but the L D F and LLF MSE's are somewhat irregular as they range from the minimum at = 1 to the maximum at X = 4. The average MSE summary plots for Pl indicate much smaller MSE values than for P0 for all four methods. The average MSE's for the quadratic methods are fairly constant for X >~ 2. The average MSE's for the LDF monotonically increase with ~, while the average MSE's for the LLF first decrease then increase with ?~. Separation of the linear and quadratic methods depend both on sample size and X. The LLF average MSE is generally smaller than the QLF for ?~ < 4 except for the larger n = 100 when ?~ = 2. The performance of the Q D F and the QLF depends primarily on the variance term which is the main contribution to the MSE. For the LDF and LLF, their performance depends on both the variance term and bias term. The reason for this behavior is that the bias for the two linear functions changes monotonically from negative or near zero values to higher values as P0 (Pl) changes. When the bias is near zero, which usually occurs at small P0 and pl values with the exact value depending on ?~, the variance term dominates. Otherwise, the bias term is the main contributor to the MSE.

4.3. Normal-Bernoulli Since the/35 term does not enter the model and/34 = 0 in (1), the optimal rule is quadratic only through the term involving/33- Thus the linear Q ( X ) ' s that have/30 through/32 in the model have too few parameters and the quadratic Q ( X ) ' s that have/30 through/34 in the model have too many parameters. Only the M L F with /30 through/33 has the correct number of parameters in Q(X). The MSE ratios in Figure 4(a) indicate appreciable increased variability over the MLF for PT values between 0.15 and 0.35 where all ratios are below 0.5. The L D F and LLF curves are similar, except a t PT = 0.05 where the LLF again performs poorly. At other PT values the LLF seems slightly superior to the L D F procedure. There is not a uniform dominance between the linear and quadratic rules. However, the Q L F does perform relatively well at PT = 0.45 and dominates the Q D F except at PT = 0.05. The bias ratio for all methods exhibits an identical pattern with values shifted slightly higher. Smaller sample sizes make the M L F procedure somewhat less desirable, but not over the entire range of P v- Figure

C.K. Bayne et aL / Assessment of discriminant models ORNL-DWG 8 2 - 2 6 4 ~

I n =t00

~

(a) tOO

o 0.75 t-127 W

to 0.50

I o [] • •

I MLF/LDF MLF/LLF MLF/QDF MLF/QLF

I

267

FED

I

RATIO ,~ RATIO / 8 RATIO /// RAT~/~~•

0.25 0

( b) tOO

125

I

I

t

n

I/ /

\

0

U-- 0.75 or' t,.d 03

0.50

0.25

0.05

0. t 5

0.2-5

0235

0.45

PT

Fig. 4. MSE ratio vs PT for normal-Bernoulli data (n = 25, 100).

4(b) indicates that the relative performance of all methods is similar for PT greater than 0.25. The relative root MSE in Figure 5(a) indicates that appreciable variation is possible for n = 25 and that the variation depends on pT. Also, the logistic rules perform poorly for PT = 0.05. For the intermediate pT values of 0.15 and 0.25, the M L F offers a marked improvement over the other methods. The relative bias shown in Figure 5(b) for the logistic rules can also be 5% to 10% which may not be important in applied problems. Only the Q D F procedure appears appreciably worse than the other methods for most values of PT using the relative bias and MSE measures. For n = 100, the relative bias is less than 4% for all methods except the Q D F ( < 6%). The practical implication of the above comparisons is that in the important middle range (0.15 ~
C.K. Bayne et al. / Assessment of discriminant models

268

ORNL-DWG 8 2 - 2 6 4 2

(a)

I

I

I

I

o LDF

25



u

LLF

w (,0



QDF



QLF



M LF

FED

r --

20 0 0 n,w

_>t5 _J w

t0

5

(~)

]

I

I

1

I

P

",,

oo

J %

X j"

-

\

,,, 40 I-_J hi

a:

5

0

I 0.05

I 0.15

I 0.25

I 0.55

I 0.45

PT

Fig. 5. Relative root MSE and bias vs PT for normal-Bernoulli data (n = 25).

4.4. Bioariate Bernoulli The optimal MLF procedure can be shown to be equivalent to the QLF procedure for nonzero cell frequencies. No appreciable differences were observed between the results for the two sample sizes considered, so only the results for the case n = 100 will be discussed. Because the sample space contains only four points, a wide range of estimated r-values in Q(X) result in identical estimates of the misclassification probabilities. Therefore, for many of the parameterizations considered, the variation and bias in :the estimated misclassification probability were zero. In those cases where the numerator and denominator of the MSE a n d / o r bias ratio were both zero, the ratio was defined as one. The practical implication of the results from the MSE ratio is that the variation in the estimated values ofpr is minimal for the M L F and Q D F procedures, and a comparison of the linear and quadratic procedures is difficult to evaluate from this ratio. Except for a few instances of minimal practical importance, the

C.K. Bayne et a L / Assessment of discriminant models ORNL- DWG 8 2 - 2 6 4 3

I

I

I

269

FED

I

_859 o LDF n LLF • MLF

855~.

25O v

if)

"3- z o o tl.l

>

< t50 _J I.U n."

o-..

400

50 o

0 0.05

0.45

0.25

0.55

0.45

PT Fig. 6. Relative bias vs P'r for bivariate Bernoulli data (parameterization A, n = 100).

behavior of the bias ratio was similar to the behavior of the MSE ratio. Therefore, the same conclusion follows with respect to bias considerations. The magnitude and behavior, as a function of P T, Y and ~, of the relative bias and relative root mean square error were similar. Therefore, only the relative bias will be considered. Figure 6 is a plot of the relative bias for parameterization A. The difference between the M L F and Q D F was indistinguishable on the figure, so only the M L F symbol was used. In parameterization A the relative bias for the Q D F and M L F was less than about 5% (5.6% at P x = 0.45) for the values of Px considered. However, the relative bias for the linear procedures was as much as 800% for small values o f p v and decreased to between 10-15% as P x increased to 0.45. The relative bias as a function of 3' and "r (parameterization B) was minimal ( < 2%) for the Q D F and M L F procedures. In parameterization B for the linear procedures, the relative bias did not change significantly over the range of ,r considered. However, the relative bias did decrease linearly as a function of Y from a high of around 200% when 3' was 0.05 to around 60% when 7 was 0.25.

270

C.K. Bayne et al. / Assessment of discriminant models

5. Discussion and generalizations The results presented in the previous section will be discussed in relation to four generalizations that are derived from previous studies or are suggested by this study. For this study, it is obvious that the p = 2 case is the simplest case of a data mixture. It is unclear how the conclusions relate to higher dimensions, but the comparison of the discrimination procedures is made on a relative basis. Poor performance by a procedure for p = 2 could surely be amplified in higher dimensions and good performance when p = 2 may not be retained. The generalizations are as follows: (1) The maximum likelihood discrimination function is preferable for the types of data considered over all sample sizes, but it requires distributional modeling. (2) The standard LDF is distributionally robust for discriminant functions linear in the x ' s and is least sensitive to small sample sizes. (3) The QDF distributional robustness can depend upon the type of nonnormality present and its performance was found to be sensitive to sample size. In addition, the QDF may perform poorly if the LDF is the appropriate discrimination model. (4) The logistic procedures, theoretically, are distributionally robust for the exponential family provided the discrimination model is correct (Efron [8]). However, the logistic procedures perform poorly for small misclassification probabilities (say ~< 0.15) and small sample sizes (say ~< 25). The remainder of this section gives the supporting evidence for each of the above generalizations derived from this study and previous studies. The optimal performance of the MLF for all data types was not surprising since the correct form of Q(X) was used and the B's in Q(X) were estimated by maximum likelihood using the correct underlying pdf's. The MLF provided a benchmark for comparing the other discrimination procedures. O'Neill [21] extended Efron's [8] work to the general exponential family and concluded the MLF 'should be used whenever possible'. However, these authors did not consider the additional cases examined in this study where the discrimination functions have been incorrectly specified by the inclusion or exclusion of particular terms. Two observations were somewhat unexpected. First, the magnitude of the differences in MSE values was quite large even for n = 100. It was not unusual for the MSE of the MLF to be 50 to 75% lower than the other methods. Second, the performance varied appreciably for changing PT with PT = 0.45 showing the least difference between methods. These results may suggest the desirability of modeling the data with a pdf and using the MLF approach. However, this is difficult for high dimensional data and the procedure may not be robust against incorrect specification of the pdf. The results of this study have extended and confirmed the work of earlier authors in demonstrating that the L D F can perform quite well or arbitrarily bad in some nonnormal situations. Therefore, as the results from the normal-Bernoulli and bivariate Bernoulli data show, care must be exercised in applying the LDF. In fact, in the normal-Bernoulli data, the covariance matrices were equal in both

C.K. Bayne et al. / Assessment of discriminant models

271

populations, but the performance of the L D F varied greatly (Figures 4 and 5). This lack of consistency in the performance of the LDF procedure is caused by its failure to consider the influence of all the necessary terms in the ratio of the pdf's, i.e., the optimal discriminant function includes a cross-product term. The original derivation of the LDF by Fisher [9] was based on a separability principle and did not require the normality of the distribution of the observed random variables. The LDF was derived by maximizing the between population variance relative to the within population variance. Therefore, one might expect the LDF to perform well in cases where the covariance structure in the populations is the same even for nonnormal variables, i.e., distributionally robust. Gilbert [11] recommended the use of the LDF when the observed variables are binary. However, the parameterizations used in her study did not have the property of 'reversal' of the likelihood function present in this study and only linear discrimination rules were considered. The presence or absence of 'reversals' in the log likelihood was also used by Moore [19] to determine when the LDF might perform well. Dillon and Goldstein [7] also found that as the correlation between the Bernoulli variables increased, the discrepancy between the LDF and MLF increased. The work of Dillon and Goldstein included the LDF as well as classes of classification rules based on different model representations of the cell probabilities, but considered only discrete random variables. Krzanowski [14] gives a good summary of conditions where the LDF can yield poor results under nonoptimal conditions, but did not include the logistic models considered in this study. The performance of the QDF for the three distributions considered gave similar results as reported by Marks and Dunn [17] for the normal distribution case. In general the QDF did not perform as well as the linear functions for small sample sizes and for populations with similar covariance matrices. Sample size considerations are important because many more parameters need to be estimated for the Q D F t h a n for the linear discriminant functions. Even for large sample sizes in the normal-Bernoulli case, which had the same covariance matrices but required a cross-product term for the MLF, the performance of the QDF was unsatisfactory. Moore [19] used the mean actual error and correlation between observed and the true log likelihood ratio to conclude that the QDF rarely performs as well as the LDF for binary data. However, this study has shown for the bivariate Bernoulli case that the QDF and MLF produce similar results. The appeal of the logistic procedure is that the linear logistic discriminant function arises from a general exponential family of pdf's (Efron [8]) and is thus theoretically robust with respect to various underlying distributions. However, the price that is paid can be quite high for the following undesirable reasons. First, the robustness of the logistic discriminant function can be lost for small sample sizes (Figures 2, 3, and 5). Second, even though the logistic procedure may be distributionally robust for large sample sizes, specification of the appropriate discriminant function was generally found to be of greater importance than the estimation method chosen (Figures 1, 3, and 6). Finally, the logistic procedure does not seem appropriate for small misclassification probabilities (Figure 1). Efron [8] examined the performance of the LLF procedure relative to the LDF for

272

C.K. Bayne et al. / Assessment of discriminant models

the case when the observed variables are normally distributed with equal covariance matrices and found, for small values of PT, a loss in efficiency of the logistic procedure. A theoretical explanation for the poor performance of the LLF for small values of p T for the bivariate normal with equal covariance matrices is given in [2]. The theoretical explanation shows, for small misclassification probabilities, that only a few 'outlier' observations that can occur between two separated populations can have appreciable influence on the estimated fl's of the LLF. This influence is shown by an evaluation of some logistic regression diagnostics given in Pregibon [22]. 6. Conclusions

The decision concerning which discrimination procedure to use should consider the following points: (1) the sample size available to estimate the necessary parameters, (2) the form of the discriminant function, (3) the discrepancy between the covariance matrices, and (4) the population separation. The sample size consideration should be first since a small sample size may make it difficult to evalute items (2)-(4). For normally distributed data and small sample sizes (n ~< 25), the L D F or LLF procedures would generally be recommended to reduce the number of parameters to be estimated unless the discriminant function is dominated by quadratic terms. As the discrepancy between the covariance matrices increases or as population separation decreases, the use of a quadratic procedure and an examination of the individual misclassification probabilities becomes important. Therefore, if sufficient sample size is available, the covariance matrices and separation of the populations should be examined. For those cases where the covariance matrices do not differ significantly and the populations appear to not be well separated, the LDF or LLF procedures would be recommended. If the covariance matrices do appear to differ significantly, the Q D F would generally be recommended. If this is not the case, some transformation to achieve normality could be considered and then an examination of the covariance matrices performed. Even if the transformation does not achieve normality, removal of skewness from the distribution might improve the performance of the QDF. For discrete or mixed continuous and discrete data, this study indicates, at least for the bivariate case, that the logistic procedures may be appropriate provided the sample size is sufficiently large (say n >~ 50), the misclassification probabilities are not too small (say PT >1 0.15), and the form of the discriminant function (e.g. linear or quadratic) is correct. Fisher's discriminant function seems appropriate for small sample sizes or small misclassification probabilities. It is not surprising that a discrimination procedure can be made to perform arbitrarily poorly if the form of the discriminant function is misspecified. This lack of discrimination model robustness indicates that more attention should be given to the model specification process.

C.K. Bayne et al. / Assessment of discriminant models

273

References [1] J.A. Anderson, Separate sample logistic discrimination, Biometrika 59 (1972) 19-35. [2] C.K. Bayne, J.J. Beauchamp, V.E. Kane and G.P. McCabe, Assessment of Fisher and logistic linear and quadratic discrimination models. ORNL/CS/TM-200 Report, Oak Ridge National Laboratory, Oak Ridge, TN 37830 (1983). [3] B. Broffitt, W.R. Clarke and P.A. Lachenbruch, The effect of Huberizing and trimming on the quadratic discriminant function, Communications in Statistics - Theory and Methods A9(1) (1980) 13-25. [4] B. Broffitt, W.R. Clarke and P.A Lachenbruch, Measurement errors - A location contamination problem in discriminant analysis, Communications in Statistics - Simulation and Computation BI0(2) (1981) 129-141. [5] W.R. Clarke, P.A. Lachenbruch and B. Broffitt, How nonnormality affects the quadratic discriminant function, Communications in Statistics - Theory and Methods A8(13) (1979) 1285-1301. [6] W.R. Dillon, The performance of the linear discriminant function in nonoptimal situations and the estimation of classification error rates: A review of recent findings, Journal of Marketing Research 16 (1979) 370-381. [7] W.R. Dillon and M. Goldstein, On the performance of some multinomial classification rules, Journal of the American Statistical Association 73 (1978) 305-313. [8] B. Efron, The efficiency of logistic regression compared to normal discriminant analysis, Journal of the American Statistical Association 70 (1975) 892-898. [9] R.A. Fisher, The use of multiple measurements in taxonomic problems, Annals of Eugenics 7 (1936) 179-188. [10] G.E. Forsythe, M.A. Malcolm and C.B. Moler, Computer Methods for Mathematical Computations (Prentice-Hall, Englewood Cliffs, 1977). [11] E.S. Gilbert, On discrimination using qualitative variables, Journal of the American Statistical Association 63 (1968) 1399-1412. [12] E.S. Gilbert, The effect of unequal variance - covariance matrices on Fisher's LDF, Biometrics 25 (1969) 505-516. [13] A.J. Kinderman and J.G. Ramage, Computer generation of normal random variables, Journal of the American Statistical Association 71 (1976) 893-896. [14] W.J. Krzanowski, The performance of Fisher's linear discriminant function under non-optimal conditions, Technometrics 19 (1977) 191-200. [15] P.A. Lachenbruch, Discriminant Analysis (Hafner Press, New York, 1975). [16] P.A. Lachenbruch, Note on initial misclassification effects on the quadratic discriminant function, Technometrics 21 (1979) 129-132. [17] S. Marks and O.J. Dunn, Discriminant functions when covariance matrices are unequal, Journal of the American Statistical Association 69 (1974) 555-559. [18] G.J. McLachlan and K. Byth, Expected error rates for logistic regression versus normal discriminant analysis, Biometrical Journal I (1979) 47-56. [19] D.H. Moore, Evaluation of five discrimination procedures for binary variables, Journal of the American Statistical Association 68 (1973) 399-404. [20] Numerical Algorithms Group, Inc., NAG F O R T R A N Library Manual (NAG, Oxford, 1977). [21] T. O'Neill, The general distribution of the error rate of a classification procedure with application to logistic regression discrimination, Journal of the American Statistical Association 75 (1980) 154-160. [22] D. Pregibon, Logistic regression diagnostics, The Annals of Statistics 9 (1981) 705-724. [23] S.J. Press and S. Wilson, Choosing between logistic regression and discriminant analysis, Journal of the American Statistical Association 73 (1978) 699-705. [24] K. Subrahmaniam and E.F. Chinganda, Robustness of the linear discriminant function to normormality: Edgeworth series distribution, Journal of Statistical Planning and Inference 2 (1978) 79-91. [25] P.W. Wahl and R.A. Kronmal, Discriminant functions when covariances are unequal and sample sizes are moderate, Biometrics 33 (1977) 479-484.