Mixture or logistic regression estimation for discrimination

Mixture or logistic regression estimation for discrimination

STATISTICS & Plp$ABMp ELSEVIER Statistics Mixture & Probability Letters or logistic regression Terence Department of Statistics, Faculty of E...

263KB Sizes 1 Downloads 93 Views

STATISTICS &

Plp$ABMp ELSEVIER

Statistics

Mixture

& Probability

Letters

or logistic regression Terence

Department

of Statistics,

Faculty of Economics

and Commrrcr,

Received

May

20 (1994) 139- 142

estimation

for discrimination

J. O’Neill Australian

National

Uniaersity,

Canberra.

A.C. T. 0200, Australia

1993; revised June 1993

Abstract When a training sample for a classification rule includes unclassified observations, the estimation can be done by maximum likelihood using both the classified and unclassified data (GM) or (assuming an exponential family) by logistic regression (L) on the classified data only. This paper shows that the choice depends on the separation and shape of the family. Key words:

Logistic

regression;

Mixtures;

Unclassified

observations

1. Introduction Sapra

(1991) recently

discussed

the relationship

between three well-known maximum likelihood estimators of Fisher’s linear discriminant function (FLDR) for two multivariate normal distributions. The first two estimators, the logistic regression, L, and the plug-in maximum likelihood estimators rule, C, are derived from training samples of classified observations while the third, M, is derived from a sample from the mixture of the two normal distributions. A good discussion on the logistic regression approach to discrimination and to discrimination based on a partially or completely unclassified training set is given in McLachlan (1992). Efron (1975) found that the efficiency of L compared to C was between one-half and two-thirds for the statistically interesting range of the parameters. O’Neill(l978) compared C and a generalization of M, GM, which was estimated from a training sample with proportion y from the mixture and (I - y) classified observations and found that 0167-7152/94/$7.00 0 1994 Elsevier Science B.V. All rights reserved SSDI 0167-7152(93)E0157-0

observations from the mixture can contain significant information concerning FLDR. At the conclusion of his article, Sapra (1991) commented that “computations of the relative efficiency of L and M would be of interest”. McLachlan (1993) also commented on the connection between the models. This short note makes precise what is meant by a comparison of L, C and GM in the more general context of exponential families and uses the results of Efron (1975) and O’Neill (1978) to tabulate their relative efficiencies. 2. The relative efficiencies

of L, C and GM

We use the notation of O’Neill (1980). Suppose that a p-dimensional vector x is measured on individuals who are from population ni if y = i, i = 0, 1. Also P(y = i) = Xi and f(XlY = i) =fi(x),

i = 0, 1

140

T.J. O’NeiN!Statistics

are the densities of x with measure which are assumed ponential family form, J(X) = Ci

Cei,

YI) C2(&

& Probability

respect to Lebesgue to be of general ex-

YI) exr@i

Letters

20 (1994)

and

139-142

so

B=~ll~~~-“’

D

I, (1x’) {%3.m)

X).

g(x, B) = Bcl + PI&

where m, is Lebesgue measure over D, then it follows from O’Neill (1980) that for the three rules L, C and GM, the asymptotic error rate (AER) satisfies

where

AER(rule)

Then the optimal classification rule (Anderson, 1958, p. 130) uses the linear discriminant function

= lim yl{error rate of rule - optimal n+cc

error rate}

= tr BC,, and

where CR is the asymptotic covariance matrix of the estimator used in the rule. The asymptotic relative efficiency of R, with respect to Rz is defined as

/z = log(rci/no). Let 8(x, y) denote

the full log-likelihood

e(x, Y) = ~og{(~lfi(x))y(~,f,(x))l-y}, e(x) the marginal

log-likelihood

l(x) = log{%fr(x)

+ %f&)j

and e(y Ix) the conditional &lx)

of (x, y),

ARE(Ri,

R2) = AER(R,)/AER(R,)

of x, (2.2)

log-likelihood

= log{7r,(x)%~(x,‘~‘}.

of vlx, (2.3)

Let 9c, $rM and YLa denote the inverses of the asymptotic covariances of p obtained from (2.1), (2.2) and (2.3), respectively. We denote the resulting estimates of the optimal classification rule by C, M and LR, respectively. Then with appropriate regularity assumptions using a suitable reparametrization, it can be shown in a similar manner to the lemma in O’Neill (1978) that

BCR, .

= tr BC,,/tr

(2.1)

When comparing GM with either C or L, it is necessary to be careful concerning what is meant by the ARE. One possibility for a sample with y unclassified is to compare the L or C that would result from making the extra effort to ascertain the group membership with the GM on the original data giving ARE( ., GM). ARE(GM, C) was used by O’Neill (1978) to compare C to GM for equal variance normal distributions. In practice when confronted with a data set containing proportion y of unclassified observations, the choice is to apply either L or C to the classified subset or to apply GM to the whole data set. Denote the resulting ARE of either L or C to GM by AREs( . , GM). It is obvious that AREs(. , GM) = (1 - y)ARE(.

9,

= ghl+

YJ_R.

(2.4)

Eq. (2.4) clearly demonstrates interplay between the three estimates. The generalized mixture estimate, GM, is obtained from a sample with proportion y from the mixture (2.2) and hence has Fisher information

In the following two examples.

3. Examples

section

, GM).

ARE(L, GM) is found

for

of the efficiency of L to GM

3.1. Normals

with common

variance

$oM = (1 - y) 9c + YYM = Yc - QLR . Now if D = {x; g(x, B) = O>

Assume 1975)

the following

canonical

f;(x) = N(( - l)‘+’ A/2eI, I,),

model

(Efron,

T.J. O’Nrillj

Statistics

& Probability

Letters

20 (19941

141

139-142

where A 3 0 and e, is the vector with one in the first position and zeros elsewhere. Combining the notation of Efron (1975) and O’Neill (1978), let

ai = A=

s

x’$J(x) exp( - A2/8)

rriexp(dx/2)

+ no exp( - AX/~)

dx,

( > a0 al,

a1

H-’

a2

1 + A2/4

= (

- (no - n,)AP 1 + 27rorc,A2

- (no - n1)AP

2.0



2.5

3.0

Mahalanobis

b = (1 i=

3.5 Distance

- AlA )‘, Fig. 1. Relative efficiency

log(rr&o),

q(1, A) = aob’A-

of mixture

to logistic

for normals.

lb. 3.2. Bivariate

exponentials

Then it is easy to show that ARE(L,

GM)

= 4(1, A) Eff,(A, A, Y) + (P - 1) Eff,(&

A, Y)

4(i> A) + (P - 1) (3.1)

Consider the case of bivariate exponentials with densities eilei2 exp( - &x), i = 0, 1 where 0; = (Bil, ~i2). Then by the usual invariance arguments the following canonical model can be assumed: fi(x) = expC - x1 - x2),

where Eff,(;l, A, y) = b’(H - y/I-‘b/b’A-‘b

fo(x) = (1 + c)(l + bc)expj

- (1 + c)xi

- (1 + bc)x,},

and Efiz (2, A, Y) = ao(l + rroqA2)/{1

- yao(l + 7coqA2)}.

Combining (3.1) with ARE(L, C)given by Eq. (3.20) Efron 1975, p. 896), we can find ARE(.;) for any pair of C, L and GM. We note that lim AREs(L, Yt1

GM) = 0.

In Fig. 1 we consider the case 7r1 = f and y = 1. Then GM corresponds to the ordinary mixture estimation of Sapra (1991). In that case q(O, A) = 1 and Effi = Eff, = ARE(L, GM). It can be seen that mixture discrimination is actually more efficient than logistic for Mahalanobis distances greater than 3.45. For close populations however, logistic discrimination is substantially better.

wherec>Oand -(l+c)-‘
4. Summary On the basis of relative efficiency, M compares favourably with L for normals. However, it was found to have a very poor performance relative to L or C for a bivariate exponential example.

142

T.J. O’NeiN/ Statistics

Table 1 Relative asymptotic

efficiency

of mixture

& Probability

to logistic for bivariate

139-142

exponentials

b

-1

Letters 20 (1994)



1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8

0.25

0.50

0.75

1

2

3

4

5

0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0.02

0 0 0 0 0 0 0.01 0 0.01 0.06 0.35

0.01 0.01 0.01 0.01 0.01 0 0.02 0.02 0.6 0.55

0.09 0.07 0.05 0.04 0.04 0.03 0.14 0.51

0.21 0.17 0.14 0.11 0.09 0.07 0.31

0.33 0.29 0.23 0.19 0.15 0.11 0.60

0.48 0.41 0.34 0.27 0.22 0.16

0 0 0 0 0

Accordingly, for a training sample with proportion y unclassified, the choice of whether to use GM or L will depend on the level of our confidence in the specification of the family and the skewness and length of the tails of that family. References Efron, B. (1975), The efficiency of logistic regression compared to normal discriminant analysis, J. Am. Statist. Assoc. 70, 892-898.

IO 1.20 I .05 0.89 0.72 0.56 0.38

McLachlan, G.J. (1992), Discriminant Analysis and Statistical Pattern Recognition, Wiley, New York. McLachlan, G.J. (1993), Letter to the Editor, The Am. Statist. 47, 88. O’Neill, T.J. (1978). Normal discrimination with unclassified obiervations, J. Am Statist. Assoc. 73, 821-826. O’Neill, T.J. (1980). The General distribution of the error rate of classification procedure with application to logistic J. Am. Statist. Assoc. 75, regression discrimination, 154-160. Sapra, SK. (1991), A connection between the logit model, normal discriminant analysis,‘and multivariate normal mixtures, The Am. Statist. 45, 265-268.