STATISTICS &
Plp$ABMp ELSEVIER
Statistics
Mixture
& Probability
Letters
or logistic regression Terence
Department
of Statistics,
Faculty of Economics
and Commrrcr,
Received
May
20 (1994) 139- 142
estimation
for discrimination
J. O’Neill Australian
National
Uniaersity,
Canberra.
A.C. T. 0200, Australia
1993; revised June 1993
Abstract When a training sample for a classification rule includes unclassified observations, the estimation can be done by maximum likelihood using both the classified and unclassified data (GM) or (assuming an exponential family) by logistic regression (L) on the classified data only. This paper shows that the choice depends on the separation and shape of the family. Key words:
Logistic
regression;
Mixtures;
Unclassified
observations
1. Introduction Sapra
(1991) recently
discussed
the relationship
between three well-known maximum likelihood estimators of Fisher’s linear discriminant function (FLDR) for two multivariate normal distributions. The first two estimators, the logistic regression, L, and the plug-in maximum likelihood estimators rule, C, are derived from training samples of classified observations while the third, M, is derived from a sample from the mixture of the two normal distributions. A good discussion on the logistic regression approach to discrimination and to discrimination based on a partially or completely unclassified training set is given in McLachlan (1992). Efron (1975) found that the efficiency of L compared to C was between one-half and two-thirds for the statistically interesting range of the parameters. O’Neill(l978) compared C and a generalization of M, GM, which was estimated from a training sample with proportion y from the mixture and (I - y) classified observations and found that 0167-7152/94/$7.00 0 1994 Elsevier Science B.V. All rights reserved SSDI 0167-7152(93)E0157-0
observations from the mixture can contain significant information concerning FLDR. At the conclusion of his article, Sapra (1991) commented that “computations of the relative efficiency of L and M would be of interest”. McLachlan (1993) also commented on the connection between the models. This short note makes precise what is meant by a comparison of L, C and GM in the more general context of exponential families and uses the results of Efron (1975) and O’Neill (1978) to tabulate their relative efficiencies. 2. The relative efficiencies
of L, C and GM
We use the notation of O’Neill (1980). Suppose that a p-dimensional vector x is measured on individuals who are from population ni if y = i, i = 0, 1. Also P(y = i) = Xi and f(XlY = i) =fi(x),
i = 0, 1
140
T.J. O’NeiN!Statistics
are the densities of x with measure which are assumed ponential family form, J(X) = Ci
Cei,
YI) C2(&
& Probability
respect to Lebesgue to be of general ex-
YI) exr@i
Letters
20 (1994)
and
139-142
so
B=~ll~~~-“’
D
I, (1x’) {%3.m)
X).
g(x, B) = Bcl + PI&
where m, is Lebesgue measure over D, then it follows from O’Neill (1980) that for the three rules L, C and GM, the asymptotic error rate (AER) satisfies
where
AER(rule)
Then the optimal classification rule (Anderson, 1958, p. 130) uses the linear discriminant function
= lim yl{error rate of rule - optimal n+cc
error rate}
= tr BC,, and
where CR is the asymptotic covariance matrix of the estimator used in the rule. The asymptotic relative efficiency of R, with respect to Rz is defined as
/z = log(rci/no). Let 8(x, y) denote
the full log-likelihood
e(x, Y) = ~og{(~lfi(x))y(~,f,(x))l-y}, e(x) the marginal
log-likelihood
l(x) = log{%fr(x)
+ %f&)j
and e(y Ix) the conditional &lx)
of (x, y),
ARE(Ri,
R2) = AER(R,)/AER(R,)
of x, (2.2)
log-likelihood
= log{7r,(x)%~(x,‘~‘}.
of vlx, (2.3)
Let 9c, $rM and YLa denote the inverses of the asymptotic covariances of p obtained from (2.1), (2.2) and (2.3), respectively. We denote the resulting estimates of the optimal classification rule by C, M and LR, respectively. Then with appropriate regularity assumptions using a suitable reparametrization, it can be shown in a similar manner to the lemma in O’Neill (1978) that
BCR, .
= tr BC,,/tr
(2.1)
When comparing GM with either C or L, it is necessary to be careful concerning what is meant by the ARE. One possibility for a sample with y unclassified is to compare the L or C that would result from making the extra effort to ascertain the group membership with the GM on the original data giving ARE( ., GM). ARE(GM, C) was used by O’Neill (1978) to compare C to GM for equal variance normal distributions. In practice when confronted with a data set containing proportion y of unclassified observations, the choice is to apply either L or C to the classified subset or to apply GM to the whole data set. Denote the resulting ARE of either L or C to GM by AREs( . , GM). It is obvious that AREs(. , GM) = (1 - y)ARE(.
9,
= ghl+
YJ_R.
(2.4)
Eq. (2.4) clearly demonstrates interplay between the three estimates. The generalized mixture estimate, GM, is obtained from a sample with proportion y from the mixture (2.2) and hence has Fisher information
In the following two examples.
3. Examples
section
, GM).
ARE(L, GM) is found
for
of the efficiency of L to GM
3.1. Normals
with common
variance
$oM = (1 - y) 9c + YYM = Yc - QLR . Now if D = {x; g(x, B) = O>
Assume 1975)
the following
canonical
f;(x) = N(( - l)‘+’ A/2eI, I,),
model
(Efron,
T.J. O’Nrillj
Statistics
& Probability
Letters
20 (19941
141
139-142
where A 3 0 and e, is the vector with one in the first position and zeros elsewhere. Combining the notation of Efron (1975) and O’Neill (1978), let
ai = A=
s
x’$J(x) exp( - A2/8)
rriexp(dx/2)
+ no exp( - AX/~)
dx,
( > a0 al,
a1
H-’
a2
1 + A2/4
= (
- (no - n,)AP 1 + 27rorc,A2
- (no - n1)AP
2.0
’
2.5
3.0
Mahalanobis
b = (1 i=
3.5 Distance
- AlA )‘, Fig. 1. Relative efficiency
log(rr&o),
q(1, A) = aob’A-
of mixture
to logistic
for normals.
lb. 3.2. Bivariate
exponentials
Then it is easy to show that ARE(L,
GM)
= 4(1, A) Eff,(A, A, Y) + (P - 1) Eff,(&
A, Y)
4(i> A) + (P - 1) (3.1)
Consider the case of bivariate exponentials with densities eilei2 exp( - &x), i = 0, 1 where 0; = (Bil, ~i2). Then by the usual invariance arguments the following canonical model can be assumed: fi(x) = expC - x1 - x2),
where Eff,(;l, A, y) = b’(H - y/I-‘b/b’A-‘b
fo(x) = (1 + c)(l + bc)expj
- (1 + c)xi
- (1 + bc)x,},
and Efiz (2, A, Y) = ao(l + rroqA2)/{1
- yao(l + 7coqA2)}.
Combining (3.1) with ARE(L, C)given by Eq. (3.20) Efron 1975, p. 896), we can find ARE(.;) for any pair of C, L and GM. We note that lim AREs(L, Yt1
GM) = 0.
In Fig. 1 we consider the case 7r1 = f and y = 1. Then GM corresponds to the ordinary mixture estimation of Sapra (1991). In that case q(O, A) = 1 and Effi = Eff, = ARE(L, GM). It can be seen that mixture discrimination is actually more efficient than logistic for Mahalanobis distances greater than 3.45. For close populations however, logistic discrimination is substantially better.
wherec>Oand -(l+c)-‘
4. Summary On the basis of relative efficiency, M compares favourably with L for normals. However, it was found to have a very poor performance relative to L or C for a bivariate exponential example.
142
T.J. O’NeiN/ Statistics
Table 1 Relative asymptotic
efficiency
of mixture
& Probability
to logistic for bivariate
139-142
exponentials
b
-1
Letters 20 (1994)
‘
1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8
0.25
0.50
0.75
1
2
3
4
5
0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0.02
0 0 0 0 0 0 0.01 0 0.01 0.06 0.35
0.01 0.01 0.01 0.01 0.01 0 0.02 0.02 0.6 0.55
0.09 0.07 0.05 0.04 0.04 0.03 0.14 0.51
0.21 0.17 0.14 0.11 0.09 0.07 0.31
0.33 0.29 0.23 0.19 0.15 0.11 0.60
0.48 0.41 0.34 0.27 0.22 0.16
0 0 0 0 0
Accordingly, for a training sample with proportion y unclassified, the choice of whether to use GM or L will depend on the level of our confidence in the specification of the family and the skewness and length of the tails of that family. References Efron, B. (1975), The efficiency of logistic regression compared to normal discriminant analysis, J. Am. Statist. Assoc. 70, 892-898.
IO 1.20 I .05 0.89 0.72 0.56 0.38
McLachlan, G.J. (1992), Discriminant Analysis and Statistical Pattern Recognition, Wiley, New York. McLachlan, G.J. (1993), Letter to the Editor, The Am. Statist. 47, 88. O’Neill, T.J. (1978). Normal discrimination with unclassified obiervations, J. Am Statist. Assoc. 73, 821-826. O’Neill, T.J. (1980). The General distribution of the error rate of classification procedure with application to logistic J. Am. Statist. Assoc. 75, regression discrimination, 154-160. Sapra, SK. (1991), A connection between the logit model, normal discriminant analysis,‘and multivariate normal mixtures, The Am. Statist. 45, 265-268.