An efficient algorithm to find the MLE of prior probabilities of a mixture in pattern recognition

An efficient algorithm to find the MLE of prior probabilities of a mixture in pattern recognition

Pattern Recognition, Vol. 29, No. 2, pp. 337-339, 1996 Elsevier Science Ltd Copyright © 1996 Pattern Recognition Society Printed in Great Britain. All...

270KB Sizes 0 Downloads 29 Views

Pattern Recognition, Vol. 29, No. 2, pp. 337-339, 1996 Elsevier Science Ltd Copyright © 1996 Pattern Recognition Society Printed in Great Britain. All rights reserved 0031 3203/96 $15.00+.00

Pergamon

0031-3203(95)00079-8

AN E F F I C I E N T A L G O R I T H M TO F I N D THE MLE O F P R I O R PROBABILITIES O F A M I X T U R E IN PATTERN RECOGNITION T Z E F E N LI Department of Applied Mathematics, National Chung Hsing University, Kuo-Kuang Road, Taichung, 400 Taiwan, R.O.C.

(Received 23 September 1994; in revisedform 11 May 1995; receivedfor publication 2 June 1995) Abstraet--A number of techniques have been proposed to determine the parameters which define the unknown components of a mixture in pattern recognition. The most common method is the maximum likelihood estimation (MLE). A direct ML approach requires solution to maximize the likelihood function of the unknown prior probabilities of classes in a mixture. This is a complicated multiparameter optimization problem. The direct approach tends to be computationally complex and time consuming. In this study, we use the concave property of the Kullback-Leibler information number to derive a simple and accurate algorithm which can find the MLE of the prior probability of each class. The results of a Monte Carlo simulation study with normal and exponential distributions are presented to demonstrate the favorable prior estimation for the algorithm. Concave function

Maximum likelihood estimation

To solve the classification problem, one approach is to find a Bayes decision rule which separates classes based on the present observation X and minimizes the probability of misclassification. {I 3) To apply this approach, one needs sufficient information about the conditional density function f(xlco) of X given class and the prior probability p(c,) of each class. Otherwise, the conditional density function and the prior probability have to be estimated through a set of past observations. In this study, we use a maximum likelihood (ML) approach to solve k-class classification problems: the conditional density function f(xJo)) is known, where co denotes one o f k classes % i = l , . . . , k, and the prior probability of each class is unknown. Let _0=(01,...,0k) be the prior probabilities of k classes. LeG the k-class mixture distribution (marginal distribution) be: k

(1)

i=l

A number of techniques have been proposed to determine the parameters which define the unknown components of a mixture. The most c o m m o n method is the m a x i m u m likelihood (ML) estimation. The advantage of the M L approach is that it is asymptotically unbiased, but a direct M L approach requires solution to:

J = 4~logp(xi'OloI....,ok max ( i = l . . . . . Ok)}.

Prior probability

This is a complicated multiparameter optimization problem. First one has to evaluate the objective function J on a coarse grid to locate roughly the global m a x i m u m and then apply a numerical method (Gauss method, N e w t o n - R a p h s o n or some other gradientsearch iterative algorithms). Hence, the direct approach tends to be computationally complex and time consuming. Furthermore, J may have several local maximum values, the M L E obtained above is not unique. It may give a wrong estimate. Kazakos (*) used the M L method for the estimation of the prior probabilities of a mixture, assuming the remaining parameters to be known. The M L estimate is nonrecursive in nature and its implementation requires solving a set of nonlinear equations by numerical techniques which are complex. Hence, Kazakos ~4) used stochastic approximation techniques to develop a recursive estimation scheme for the prior probabilities. M a k o v and Smith (5) used a Baysian approach for identifying a mixture under the same assumptions. Young and Coraluppi ~6)discussed a simple self-learning algorithm for decomposition of a Gaussian mixture using stochastic approximation method. Mizoguchi and Shimura ~7~ extended their results to the multivariate case with a known mean vector and known covariance matrix. F o r this case, all statistics of each class as well as the number of classes are unknown. Postaire and Vasseur (s) first proposed a technique for identifying the Gaussian mixture. In this study, we use the concave property of the Kullback-Leibler information number to derive an algorithm which can find the M L E of the prior probabilities for a mixture. Our procedure is simple and

1. I N T R O D U C T I O N

p(xl0) = ~ Oif(xlcl).

Mixture

(2) 337

338

TZE FEN LI

accurate. The estimate of a prior can be easily computed. As the number of samples increases, the M L estimate will converge to the true unknown prior probability. In the next section, an estimation method and some related theoretical results are presented. In Section 3, an algorithm and simulation results are presented to demonstrate the accuracy of the estimation procedure. 2. THE C O N C A V I T Y O F THE INFORMATION

KULLBACK-LEIBLER NUMBER

Let X be the present observation which belongs to one of k classes ci, i = 1, 2 .... , k. Consider the decision problem consisting of determining whether X belongs to c i. Let f(xlco) be the conditional density function of X given co, where co denotes one of k classes and let 0i, i = 1, 2,..., k, be the prior probability of c i with k Zi=lO i = 1. In this study, f(xlco) is known and the 0i are unknown. Let:

S = { 0=(01'02 ..... Ok); 0 i > 0 '

i=l~0i=l}

(3)

define a parameter space of prior probabilities of k classes. Let _X, = ( X 1 , X 2 , . . . , X n ) . Using a set of unidentified mixed samples to estimate the unknown parameters is known as unsupervised learning or learning without teacher. (1-3'5 8) In a pattern recognition system with unsupervised learning, X_, is a set of unidentified input patterns. Let p(x,,l_0) is the marginal density of X,, with respect to the prior distribution of classes, i.e. P(Xml0) = Z~= l Oif(XrnlCi)"

Theorem 1. The Kullback-Leibler number, (9) defined by:

information

H(#, _0) = S log p(xlO_)p(xl#) dx,

(4)

is a strictly concave function of 0 E S with its m a x i m u m value at _0= p, where _ff= (#1,-.. ,#k) is the true prior probabilities of classes.

Proof. It is known (9) that H(#,_0) has a m a x i m u m value at _0= _ft. For concavity, let 0_',0" be in S and 0 < 5 < 1. Then:

3H(H_,O_') + (1 -- 3)H(ff, 0") = S [3 log p(xl_0') + (1 -- b)log p(xlO_')]p(xl~)dx < ~log [5p(x]O_') + (1 - 5)p(xlO_")]p(xlg_)dx = HI-~, (~0' + (1 - ~)0")].

(5)

The inequality'(5) follows from the concavity of the log function and the last equality is obtained by a simple computation. [] F r o m the Strong Law of Large Numbers, (9/we have the following corollary. converges a.s. to Corollary 1. ~Y,,=llogp(X,,l_0) 1 , H ( # , _0), i.e. lim,~ ~ ~52g, = l log p(X,,[0) is a strictly concave function of_0 and has a global maximum at _0= ~.

3. AN ALGORITHM AND S I M U L A T I O N

RESULTS

The concavity of the Kullback-Leibler information number in Theorem 1 indicates the direction to find the M L E of the true prior probabilities, i.e. to climb to the top of the concave Kullback Leibler information function. Corollary 1 can be used to estimate the unknown prior probabilities of a mixture. Let J , = (1/n)Zg,= 1 logp(Xml_0). As described in Corollary 1, J , converges a.s. to H(~,_0), which is strictly a concave function of _0~S and has a maximum at 0 = ~ . Hence for n fixed, we can use the value 0 of 0 which maximizes J , to be the estimate of the true _/j,i.e..0is the M L E of the unknown U As n --* oc, _0will converge to _/j. H o w do we find such M L E .0? Since, for a large n, J , behaves a concave function, any gradient-search algorithm will find the M L E of~. For this study, we use an easy one. Let X,,, m = 1,...,n, be a set of mixed samples. The algorithm of finding an estimate o f # can be stated as follows: (1) Locate the _01 on a coarse grid of S with interinterval = 0.1 such that J , has a maximum at _0I. (2) Let S 1 be a subset of S such that the center ofS 1 is 01 and the vertices of S 1 are (0t _+ 0.1 . . . . . 0~ ± 0.1). (3) Locate the _02 on a finer grid of S 1 with interinterval = 0.01 such that J , has a maximum at _02 in S 1. (4) Continue this process until the sequence of .ore, m = 1, 2 .... , converges to a point which will be used as the estimate of true _~. In this simulation, we use 02 as an estimate for both two-class and three-class classifications. Note that this is not a gradient-search algorithm, which is computationally complex. Our algorithm is simple. Since the concavity of the Kullback Leibler information function guarantees one m a x i m u m value, our algorithm will find the MLE. The mixed samples are generated from two classes of normal and exponential distributions, respectively, i.e. the normal and exponential distributions are used as conditional density functions. F o r i = 1,2, col is the mean of class c i. F o r the two-class classification, the true prior probabilities (#1,#2) = (0.05,0.95), (0.2, 0.8), (0.34, 0.66), (0.5,0.5), (0.76,0.24) and (0.95,0.05) are used to generate the unclassified samples. The number of total mixed samples begins with 200 and increases by 200 until 1200 mixed samples are generated. For the three-class classification, the mixed samples are generated from three different bivariate normal distributions.

Case 1. Two normal distributions with means 0 and 2, and with the same identity variance are generated. The M L estimates _02 of # are presented, in Table 1, which shows that the estimates converge to their true prior probabilities as the number of samples increases. Case 2. Two exponential distributions with means 1 and 4 are generated. The M L estimates are also presented in Table 1. Since the variance of an exponential distribution is large, the convergence of the estimates is not as good as in the normal case, but the M L

MLE of prior probabilities

Table 1. The convergence performance of the e s t i m a t e -02 of the true prior as the number of mixed samples increases, where the estimate _02 is obtained by the algorithm No. of unidentified mixed samples True priors 200 400 600 800 1000 1200 (Normal distributions with mean 1 = 0, mean 2 = 2, variance 1 = 1, variance 2 = 1) 0.02 0.13 0.29 0.47 0.73 0.90

0.02 0.17 0.32 0.49 0.76 0.94

0.02 0.20 0.33 0.50 0.74 0.94

0.03 0.20 0.34 0.49 0.76 0.94

0.04 0.20 0.34 .049 0.76 0.95

0.04 0.20 0.34 0.50 0.76 0.96

(0.05, (0.20, (0.34, (0.50, (0.76, (0.95,

0.95) 0.80) 0.66) 0.50) 0.24) 0.05)

(Exponential distributions with mean 1 = 1, mean 2 = 4, variance 1 = 1, variance 2 = 16) 0.05 0.01 0.07 0.07 0.06 0.05 (0.05, 0.95) 0.22 0.27 0.24 0.23 0.22 0.21 (0.20, 0.80) 0.37 0.43 0.38 0.37 0.36 0.35 (0.34, 0.66) 0.52 0.60 0.54 0.52 0.52 0.51 (0.50, 0.50) 0.78 0.83 0.79 0.78 0.77 0.77 (0.76, 0.24) 0.97 0.98 0.96 0.96 0.96 0.95 (0.95, 0.05)

Table 2. The convergence performance of the estimate _02 of the true prior as the number of mixed samples increases, where the estimate _02 is obtained by the algorithm No. of unidentified mixed samples True priors 100 200 400 600 800 1000 (Three classes of bivariate normal distributions with mean 1 = (8, 5), mean 2 = (8, 7), mean 3 = (10, 6) and identity covariance matrix) 0.46 0.34 0.20

0.47 0.32 0.21

0.43 0.36 0.21

0.39 0.39 0.22

0.40 0.40 0.20

0.40 0.39 0.21

0.4 (class 1) 0.4 (class 2) 0.2 (class 3)

0.76 0.13 0.11

0.73 0.12 0.15

0.65 0.18 0.17

0.62 0.20 0.18

0.60 0.21 0.19

0.61 0.19 0.20

0.6 (class 1) 0.2 (class 2) 0.2 (class 3)

0.55 0.29 0.16

0.49 0.28 0.23

0.45 0.29 0.26

0.42 0.30 0.28

0.40 0.30 0.30

0.40 0.30 0.30

0.4 (class 1) 0.3 (class 2) 0.3 (class 3)

e s t i m a t e s are still close to the true p r i o r probabilities. Case 3. T h r e e b i v a r i a t e n o r m a l d i s t r i b u t i o n s with m e a n v e c t o r s (8,5), (8,7) a n d (10,6) a n d i d e n t i t y c o v a r i a n c e m a t r i c e s are g e n e r a t e d . T h e M L e s t i m a t e s are p r e s e n t e d in T a b l e 2. All M L e s t i m a t e s c o n v e r g e to the p r i o r probabilities.

339

4. CONCLUSION In this p a p e r , the c o n c a v i t y o f the K u l l b a c k - L e i b l e r i n f o r m a t i o n n u m b e r is used t o establish a n a l g o r i t h m w h i c h c a n p r o v i d e the M L e s t i m a t e o f t h e u n k n o w n p r i o r p r o b a b i l i t y ' o f e a c h class. T h e c o n c a v i t y s h o w s the d i r e c t i o n to find the M L E of the true p r i o r p r o b abilities, i.e. to climb to t h e h i g h e s t value o f the c o n c a v e K u l l b a c k - L e i b l e r i n f o r m a t i o n function. O u r p r o c e d u r e is simple a n d the e s t i m a t e o f a p r i o r c a n be easily c o m p u t e d . T h e p r i o r e s t i m a t i o n a c c u r a c y c a n be i m p r o v e d by c o n t i n u i n g t a k i n g the u n i d e n t i f i e d s a m p l e s a n d a c c o r d i n g to T h e o r e m 1, the e s t i m a t e s will c o n v e r g e to the true u n k n o w n priors. T h e results of a s i m u l a t i o n s t u d y with n o r m a l d i s t r i b u t i o n s a n d e x p o n e n t i a l d i s t r i b u t i o n s s h o w that the e s t i m a t e s f r o m o u r p r o c e d u r e c o n v e r g e to t h e true u n k n o w n p r i o r p r o b a b i l i t i e s o f the mixtures.

Acknowledgements--The author is grateful to the editor and referees for their valuable suggestions for the improvement of this paper.

REFERENCES

1. K. Fukunaga, Introduction to Statistical Pattern Recognition. Academic Press, New York (1972). 2. R.L. Kasyap, C.C. Blayton and K.S. Fu, Stochastic approximation, Adaptation, Learning and Pattern Recognition Systems: Theory and Applications, J, M. Mendel and K. S. Fu, eds. Academic Press, New York (1970). 3. T. Y. Young and T. W. Calvert, Classification, Estimation and Pattern Recognition. Elsevier, New York (1974). 4. D. Kazakos, Recursive estimation of prior probabilities using a mixture, IEEE Trans. Inform. Theory IT-23, 203211 (March 1977). 5. U. E. Markov and A. F. M. Smith, A quasi-Bayes unsupervised learning procedure for priors, IEEE Inform. Theory IT-23, 761 764 (November 1977). 6. T. Y. Young and C. Coraluppi, Stochastic estimation of a mixture of normal density function using an information criterion, IEEE Trans. Inform. Theory IT-16, 258 263 (May 1970). 7. R. Mizoguchi and M. Shimura, An approach to unsupervised learning classification, IEEE Trans. Comput. C-24, 979 983 (October 1975). 8. J. G. Postaire and C. P. A. Vasseur, An approximate solution to normal mixture identification with application to unsupervised pattern classification, IEEE Trans. Pattern Anal. Mach. Intell. PAMI-3, 163 179 (March 1981). 9. S.S. Wilks, Mathematical Statistics. J. Wiley and Sons, New York (1962).

About the A u t h o r - - T Z E FEN LI received the B.A. degree in forestry from Chung Hsing University,

Taichung, Taiwan, in 1962 and the Ph.D. degree in statistics from Michigan State University in 1981. Dr Li then taught statistics at University of North Carolina at Charlotte for 2 years and computer science at Rutgers University at Camden for 6 years. He is at present a professor of computer science at Chung Hsing University. Dr Li is interested in statistics and pattern recognition and has published papers in statistics and computer science.