A comparison of Gaussian and Pearson mixture modeling for pattern recognition and computer vision applications

A comparison of Gaussian and Pearson mixture modeling for pattern recognition and computer vision applications

Pattern Recognition Letters 20 (1999) 305±313 A comparison of Gaussian and Pearson mixture modeling for pattern recognition and computer vision appli...

268KB Sizes 9 Downloads 115 Views

Pattern Recognition Letters 20 (1999) 305±313

A comparison of Gaussian and Pearson mixture modeling for pattern recognition and computer vision applications Swarup Medasani, Raghu Krishnapuram

*

Department of Mathematics and Computer Science, Colorado School of Mines, Golden, CO 80401, USA Received 30 December 1997; received in revised form 29 September 1998

Abstract Gaussians are widely accepted and used in mixture modeling. At the same time, other models such as Pearson Type II distributions have not received much attention. In this paper, we compare the modeling capabilities of Gaussian mixtures with those of Pearson Type II mixtures for certain pattern recognition and computer vision applications. In particular, we compare the performance of Gaussian and Pearson Type II mixtures on classi®cation of several benchmark pattern recognition data sets, and on edge and plane modeling. We introduce an agglomerative technique for mixture decomposition which automatically determines the number of components required to model the data eciently. Ó 1999 Elsevier Science B.V. All rights reserved. Keywords: Maximum likelihood; Mixture decomposition; Gaussian mixtures; Pearson Type II mixtures; Agglomerative techniques

1. Introduction Mixtures have been used extensively as models in a wide variety of important practical situations where data can be viewed as arising from several populations mixed in varying proportions. Mixture modeling can be viewed as a superposition of a ®nite number of component densities. The problem of estimating parameters of the components of a mixture has been the subject of diverse studies (Everitt and Hand, 1981; Redner and Walker, 1984). The isotropic and unimodal nature of Gaussian functions, along with their capability to represent the distribution compactly by a mean

* Corresponding author. Tel.: +1-303-273-3860; fax: +1-303273-3875; e-mail: [email protected]

vector and covariance matrix, have made Gaussian Mixture Decomposition (GMD) a popular technique. The Expectation Maximization (EM) algorithm (Dempster et al., 1977) is widely used to estimate the parameters of the components in a mixture. However, Gaussian mixtures may not be the best choice in all applications. For instance, the range image of a polyhedron can be modeled e€ectively as a mixture of uniformly distributed patches rather than as Gaussians. The Pearson Type II density function is ellipsoidally symmetric with a compact representation. As can be seen from Fig. 1, by changing a parameter denoted by K, the Pearson Type II density function can be made to approximate a wide range of shapes including uniform distributions (K ! 0) and Gaussians (K ! 1). In this paper, we consider the Pearson Type II density function and discuss its advantages in

0167-8655/99/$ ± see front matter Ó 1999 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 8 6 5 5 ( 9 8 ) 0 0 1 4 9 - 4

306

S. Medasani, R. Krishnapuram / Pattern Recognition Letters 20 (1999) 305±313

2. Mixture decomposition Let X ˆ fxj jj ˆ 1; . . . ; N g be a set of N random vectors, drawn from an independent and identically distributed (i.i.d.) mixture model. The loglikelihood of the observed samples is de®ned as the logarithm of the joint density p…Xjh† ˆ Q N jˆ1 p…xj jh† given by J ˆ log p…Xjh† Fig. 1. Plot of the univariate normal distribution and the Pearson Type II distribution for di€erent values of K. The parameters are estimated from 11 points in [ÿ5, 5] placed at uniform intervals.

modeling certain problems. We introduce an agglomerative approach to mixture decomposition (Medasani and Krishnapuram, 1997) which automatically ®nds the number of components required to model the mixture eciently. The agglomerative approach is computationally more attractive than traditional techniques which usually run the EM algorithm a number of times, each time with a di€erent number of components, and pick the case that maximizes a chosen best-®t criterion. Examples of criteria are Likelihood Ratio Test (Wolfe, 1971), Akaike Information Criterion (Akaike, 1974) or Minimum Descriptor Length (Risannen, 1978). We test the modeling capabilities of Gaussian and Pearson Type II mixtures on several kinds of real and synthetic data sets. The rest of the paper is organized as follows. In Section 2, we brie¯y review GMD, and Pearson Type II Mixture Decomposition (PMD), and present the EM update equations. In Section 3, we introduce the agglomerative version of mixture decomposition, present the update equations for the Agglomerative Gaussian Mixture Decomposition (AGMD) and derive the update equations for Agglomerative Pearson Type II Mixture Decomposition (APMD). In Section 4, we compare the performance of the proposed APMD algorithm with that of AGMD on problems such as pattern classi®cation, and edge and plane segmentation. Finally in Section 5, we present the conclusions.

ˆ

N X

log …p…xj jh††

jˆ1 N c X X ˆ log P …xi †p…xj jxi ; hi † jˆ1

iˆ1

jˆ1

iˆ1

!

! N c X X log pij ; ˆ

…1†

where pij ˆ P …xi †p…xj jxi ; hi †. In Eq. (1), P …xi †; i ˆ 1; . . . ; c, represent the mixing parameters, p…xj jxi ; hi † represents the component density corresponding to component xi , and h ˆ …h1 ; . . . ; hc † represents the mixture density parameters. The maximum likelihood estimate of h is the one that maximizes J. If the mixture consists of c components, and p…Xjh† is di€erentiable with respect to h, then the likelihood estimate for h must satisfy the following necessary conditions: N X oJ p o Pc ij ˆ log …pij † ˆ 0; ohi kˆ1 pkj ohi jˆ1

i ˆ 1; . . . ; c:

…2†

We now present the necessary conditions for Gaussian and Pearson Type II mixtures. 2.1. Gaussian mixtures For the Gaussian density function given by p…xj jxi ; hi † ˆ   exp

1 …2p†

n=2

jCi j

1=2

 1 T ÿ …xj ÿ mi † Ciÿ1 …xj ÿ mi † ; 2

…3†

it can be shown (Duda and Hart, 1973) that the maximum likelihood estimates of the means mi

S. Medasani, R. Krishnapuram / Pattern Recognition Letters 20 (1999) 305±313

307

and covariance matrices C i of the c Gaussian components constituting the mixture must satisfy PN jˆ1 lij xj …4† m i ˆ PN jˆ1 lij

Eqs. (1) and (9), we derive the necessary conditions for PMD: P x 2R lij xj =qij mi ˆ Pj i …10† xj 2Ri lij =qij

and

and

Ci ˆ

PN

jˆ1

lij …xj ÿ mi †…xj ÿ mi †T : PN jˆ1 lij

…5†

In Eqs. (4) and (5), lij is the a posteriori probability of the sample xj belonging to component i, and is given by pij lij ˆ Pc : …6† kˆ1 pkj The maximum likelihood estimate for P …xi † is obtained by maximizing Eq. (1), subject to P …xi † P 0 and

c X P …xi † ˆ 1:

…7†

iˆ1

This yields N 1X l : P …xi † ˆ N jˆ1 ij

…8†

Starting with an initial estimate for lij , the EM algorithm (Dempster et al., 1977) uses Eqs. (4), (5), (8) and (6) in an iterative fashion to decompose the mixture. 2.2. Pearson Type II mixtures The Pearson Type II density function (Kotz, 1997; Tou and Gonzalez, 1974) is a symmetric function given by p…xj jxi ; hi †  c…k‡1‡n=2† 1=2 K jWj …D† ˆ c…k‡1†pn=2 0 T

xj 2 region R; elsewhere:

…9†

where D ˆ ‰1 ÿ …xj ÿ mi † W…xj ÿ mi †Š and c( ) refers to the gamma function. The region R denotes the interior of the hyper-ellipsoid …xj ÿ mi †T W…xj ÿ mi † ˆ 1, and the weight matrix ÿ1 W is given by W ˆ …n ‡ 2…K ‡ 1†† C ÿ1 i ; K P 0, where C i is the covariance matrix and n is the dimensionality of the data. The parameter K determines the shape of the density function. From

P

Ci ˆ

xj 2Ri

T

Z…xj ÿ mi †…xj ÿ mi † =qij P ; xj 2Ri lij

…11†

where lij is the same as given in Eq. (6), qij ˆ ‰1 ÿ T ÿ1 …xj ÿ mi † C ÿ1 i …xj ÿ mi †…n ‡ 2…K ‡ 1†† Š and Z ˆ 2K=…n ‡ 2…K ‡ 1††. The maximum likelihood estimate for P …xi † can be shown to be the same as in Eq. (8). Starting with initial estimates for lij , Eqs. (10), (11), (8) and (6) can be used in an iterative fashion to ®nd the parameters of the components in the mixture. 3. Agglomerative mixture decomposition We use an agglomerative technique introduced in (Medasani and Krishnapuram, 1997; Frigui and Krishnapuram, 1997) to ®nd the number of components in a mixture automatically. This is accomplished by adding a second term to the objective function in Eq. (1) as shown in Eq. (12). The agglomerative scheme starts with an overspeci®ed number of components in the mixture, and as the algorithm proceeds, the components compete to model the data. Only the ®ttest, i.e., the components that model the data eciently, survive, resulting in the ``optimal'' number of components required to model the data. Thus, the objective function used by the proposed agglomerative mixture decomposition technique is ! N c c X X X log pij ‡ a P 2 …xi †; …12† Jˆ jˆ1

iˆ1

iˆ1

which is maximized subject to Eq. (7). The ®rst term in Eq. (12) is the log-likelihood function, and it takes on its global maximum value when each component represents only one of the feature vectors (in this case, we can choose mi ˆ xi and for arbitrary non-singular C i , p…xj jxi ; hi † ˆ 1; 8j). The second term in Eq. (12) is the agglomerative term and reaches its maximum when

308

S. Medasani, R. Krishnapuram / Pattern Recognition Letters 20 (1999) 305±313

all the feature vectors are modeled by a single component, i.e., when P …xi † ˆ 1 for some i and P …xj † ˆ 0; 8j; j 6 i. When both terms in Eq. (12) are appropriately combined, the ®nal partition maximizes the likelihood of the input feature vectors and at the same time models the mixture using the optimal number of clusters. To maximize J with respect to the mixture parameters h ˆ ‰h1 ; . . . ; hc Š, we set the gradient of J with respect to hi ˆ …mi ; C i ; P …xi †† to zero. This gives us the update equations for the agglomerative versions of GMD and PMD. Since the second term in Eq. (12) does not involve mi and C i , the update equations for mi and C i in the agglomerative Gaussian mixture decomposition (AGMD) are still given by Eqs. (4) and (5), respectively. The update equation for the a priori probability can be shown to be PN …tÿ1† 2 jˆ1 lij ‡ 2a…P …xi †† …t† : …13† …P …xi †† ˆ Pc …tÿ1† N ‡ 2a kˆ1 …P 2 …xk †† In Eq. (13), t denotes the iteration number and the superscripts (t) and (t ÿ 1) indicate values from iterations t and t ÿ 1, respectively. From Eq. (13), we can see that when a approaches zero, Eq. (13) reduces to the update equation for P …xk † in Eq. (8). In the case of APMD, the update equations for the mean mi and the covariance matrix C i are the same as in Eqs. (10) and (11), respectively, and the update equation for P …xi † is as in Eq. (13). The choice of a is critical to the e€ective performance of the agglomerative scheme. Since the ®rst term in Eq. (12) is the log-likelihood function and the second term the agglomeration factor, a speci®es the trade-o€ between the required likelihood of the data and the number of components to be found. Both in AGMD and APMD, we can choose a to be the ratio of the ®rst term to the second term in Eq. (12) in each iteration, i.e., PN Pc …tÿ1† † jˆ1 log … iˆ1 pij : …14† a…t† ˆ g…t† Pc …tÿ1† 2 iˆ1 …P …xi †† We now present the implementation details of the AGMD/APMD algorithms. We use the Fuzzy C-Means (FCM) algorithm (Bezdek, 1981) for 10 iterations to generate an initial partition with an

over-speci®ed number of clusters c ˆ cmax . Since FCM favors clusters of spheroidal shapes, we use the non-agglomerative mode of the AGMD/ APMD algorithm (with a ˆ 0) for 5 iterations so as to tune and adjust the covariance matrices according to the shapes and sizes of the initial components. The AGMD/APMD is then run using the result of the previous stage. To provide a good transition from the non-agglomerative mode to the agglomerative mode of the AGMD/APMD algorithm, we increase the value of a gradually, starting from zero and let it remain steady at its peak value. During this stage, the weaker components lose their points to stronger ones and eventually vanish. After the agglomeration results in the necessary component reduction, a is gradually made to decay to zero. Once the value of a approaches zero, the AGMD/APMD algorithm once again runs in the non-agglomerative mode with the ``optimal'' number of clusters until convergence. Thus, we choose the ``annealing schedule'' for g…t† as follows: 8 0 t 6 s1 ; > > < p…tÿs1 † ÿ1 s1 < t 6 s 2 ; e …15† g…t† ˆ s2 < t 6 s 3 ; ep…s2 ÿs1 † ÿ 1 > > : p…s2 ÿs1 † ÿ 1†eÿq…tÿs3 † t > s3 : …e In Eq. (15) s1 ; s2 ; s3 are integer constants (measured in iterations t), and p, q are constants which control the agglomeration rate. The plot of a general pro®le of g…t† versus iteration number t is shown in Fig. 2. In the case of small data sets with high dimensionality, the covariance matrices can be assumed

Fig. 2. Plot of the annealing schedule g…t† versus iterations t.

S. Medasani, R. Krishnapuram / Pattern Recognition Letters 20 (1999) 305±313

to be diagonal for stability reasons. The update equations for the mean and a priori probabilities remain the same for this special case. The update equation for the covariance matrices is di€erent, and is obtained as follows. Since the covariance matrix is assumed to be diagonal, we di€erentiate Eq. (12) w.r.t. rÿ1 il , the squareroot of the lth diagonal element of C ÿ1 i . This gives us ( ) 2 N X oJ Z…xjl ÿ mil † 2 ˆ lij ril ÿ ˆ 0 or qij orÿ1 il jˆ1 PN 2 jˆ1 Zlij …xjl ÿ mil † =qij 2 : …16† ril ˆ PN jˆ1 lij Ideally, a component should be deleted when its mixing parameter P …xi † drops to zero. To prevent instabilities in the computation of C ÿ1 i , we delete a component whenever P …xi † is less than  ˆ …1% of N ‡ n†=N , where n is the dimensionality of the input feature space, and N is the total points in the mixture. The AGMD/APMD algorithm is presented below: AGMD(APMD) Algorithm Fix the maximum number of components c ˆ cmax ; Initialize lij as the fuzzy partition matrix provided by FCM; Compute P …xi † for i ˆ 1; . . . ; c using Eq. (8); REPEAT Update prototypes mi and C i using Eq. (4) and Eq. (5) for AGMD and Eqs. (10) and (11) for APMD; Compute lij for 1 6 i 6 c and 1 6 j 6 N by using Eq. (6); Compute a using Eq. (14); Update P …xi † for i ˆ 1; . . . ; c using Eq. (13); if P …xi † <  discard component i; UNTIL (prototypes stabilize).

4. Experimental results In this section, we compare the modeling capabilities of mixtures represented by Gaussians with those represented by Pearson Type II density functions. In the pattern recognition application, AGMD and APMD were used to model the class

309

conditional densities in eight standard pattern recognition data sets. The classi®cation was performed using the Bayes rule after the class conditional densities were estimated. While estimating the conditional density function for each class bk , we use AGMD/APMD to model the density function for bk as a mixture of multiple components xki . Let p…xj jxki † be the conditional probability of selecting input xj given component xki , and let P …xki † be the mixing proportion of component xki in class bk . Then the conditional probability for selecting P k input xj , given class bk is P …xki †p…xj jxki †, where Nk given by p…xj jbk † ˆ Niˆ1 represents the set of all components in class bk . If we also assume that P …bk † is the a priori probability for class bk , then we can classify xj by assigning it to class bk if P …bk †p…xj jbk † P P …bl †p…xj jbl †; 8l 6ˆ k. Table 1 presents the characteristics of the data sets used, namely, the number of features, the number of samples per class, and the dimensionality. The Synthetic 2-class data sets was obtained from the University of Oxford (Ripley, 1996), the 2-D Noisy spiral data set was locally synthesized using the following equations for the two dimensions, x ˆ …0:05 ‡ 0:1  h=2p† cos…h† ‡ 0:5 and y ˆ …0:05 ‡ 0:1  h=2p† sin …h† ‡ 0:5. Uniformly distributed noise in the interval [ÿ0.01, 0.01] was then added to both x and y locations. The Breast Cancer, Pima Indian Diabetes, New Heart Diseases, Bupa Liver Disorders, and the German, Australian Credit Card data sets were obtained from the machine learning repository at the Uni-

Table 1 Characteristics of the data sets Data sets

No. of classes

Data in

Synthetic Noisy spiral Breast cancer Pima Indian diabetes Heart disease Bupa liver disorders German credit card Australian credit card

2 2 2 2 2 2

444 211 444 500 150 145

239 211 239 268 120 200

2 2 9 8 13 6

2 2

525 383

225 307

24 14

Class 1 Class 2

No. of features

310

S. Medasani, R. Krishnapuram / Pattern Recognition Letters 20 (1999) 305±313

Table 2 Comparison of testing and training classi®cation rates on eight benchmark data sets using AGMD and APMD algorithms Data sets Synthetic Noisy spiral Breast cancer Pima Indian diabetes Heart disease Bupa liver disorders German credit card Australian credit card

AGMD

APMD

K

Training

Testing

Training

Testing

91.23 99.23 96.68 73.97 89.75 77.95 69.8 85.66

90.73 90.73 95.9 70.32 81.85 68.98 69.5 84.36

91.45 99.85 95.6 77 91.12 79.9 73.07 85.95

91.2 93.2 96.8 75.6 81.86 71.03 70.2 84.94

versity of California, Irvine (Murphy and Aha, 1973). Each of the data sets was divided into training and testing sets using a 25% jackknife procedure. Four such jackknifes were generated randomly for each of the data sets. All the results presented in this section are the average values over the four jackknifes. The training set was used to estimate the component parameters and the resulting mixture densities were used with a Bayes classi®er to classify the testing data. The classi®cation rates for the 8 data sets using AGMD and APMD algorithms are presented in Table 2. APMD was tried for a range of K values with K varying from 5 to 40, in increments of 5. The K values corresponding to the best results on the training set were used for testing. These values are also given in Table 2. For the Breast Cancer, Pima Indian Diabetes, Heart Disease, German Credit Card, and the Australian Credit Card data sets, the covariance matrices were constrained to be diagonal to eliminate instability problems. In Table 3, we present the number of components per class that were found by AGMD and APMD. From Table 2, we can see that in the case of the test data sets the APMD algorithm gives consistently higher classi®cation rates than the AGMD algorithm. In the case of training sets the APMD algorithm is almost always better than the AGMD, except for the Breast Cancer data set. The APMD uses approximately the same number of components as the AGMD for the eight data sets. In the computer vision applications, AGMD and APMD were used for modeling edges and

30 30 10 30 10 15 15 30

planes. Bivariate Pearson Type II distributions with a very small variance in one direction and a relatively large variance in the other perpendicular direction can be used to model edge segments in 2D. Similarly trivariate Pearson Type II distributions can be used to model planar surfaces in 3-D. Due to space constraints we show only one example of edge modeling in Fig. 3. Fig. 3(a) shows the original image of a polyhedral object, and the Sobel edge image corresponding to Fig. 3(a) is shown in Fig. 3(b). The FCM result after 15 iterations with 45 components, used as the initialization by AGMD as well as APMD, is shown in Fig. 3(c). The means are represented by black squares, and the ellipses include points that are within Mahalanobis distance 4. The initial covariance matrices were estimated from the FCM partitions. The results of the AGMD and the APMD algorithms are shown in Fig. 3(d) and (e)

Table 3 Number of components per class used by AGMD and APMD algorithms Data sets Synthetic Noisy spiral Breast cancer Pima Indian diabetes Heart disease Bupa liver disorders German credit card Australian credit card

AGMD

APMD

Class1 Class2

Class1 Class2

3.25 21.75 2 2.25 2.5 3.25 3.25 3

3.5 21.75 3 2.25 2.5 3.25 3.5 3

3.25 22 2.5 2.25 1 2.75 2 2.25

3.5 22 3.25 2 2.25 2.75 2 2.25

S. Medasani, R. Krishnapuram / Pattern Recognition Letters 20 (1999) 305±313

311

Fig. 3. Comparison of the edge detection results from AGMD and APMD algorithms. (a) The original 256 ´ 256 image. (b) Edge image of (a). (c) Initialization used by AGMD and APMD. (d) Segmented edges from AGMD. (e) Segmented edges from APMD.

respectively. The APMD algorithm was run with K ˆ 2. It can be seen that AGMD fails to detect the small edges on the top of the smaller object. The APMD algorithm ®nds 19 components while the AGMD algorithm determines the number of components to be 17. Fig. 4(a) and Fig. 5(a) show 200 ´ 200 range images of block obtained from Environmental Research Institute of Michigan (ERIM) and a chair image obtained from University of Southern California (USC). The hard segmentation corresponding to the initialization obtained from FCM after running for ®ve iterations with 25 components is shown in Fig. 4(b) and Fig. 5(b). The

boundaries between the components are shown in white. The AGMD algorithm models the block in Fig. 4(a) using seven components (Fig. 4(c)), while the APMD (intuitively correctly) detects and models the block using 6 components (Fig. 4(d)). For the example in Fig. 5(a) the AGMD algorithm models the chair using 12 components (Fig. 5(c)), while the APMD ®nds seven components (Fig. 5(d)) and models the data intuitively correctly. Since we know that planes are uniformly distributed, the parameter K was set to 1 for APMD. As explained earlier, the Pearson Type II distribution with a low value of K is a good approximation to the uniform distribution. From the

Fig. 4. Comparison of the plane detection results from AGMD and APMD algorithms. (a) The original 200 ´ 200 image of a block. (b) Initialization used by AGMD and APMD. (c) Segmented images from AGMD. (d) Segmented edges from APMD.

312

S. Medasani, R. Krishnapuram / Pattern Recognition Letters 20 (1999) 305±313

Fig. 5. Comparison of the plane detection results from AGMD and APMD algorithms. (a) The original 200 ´ 200 image of a chair. (b) Initialization used by AGMD and APMD. (c) Segmented images from AGMD. (d) Segmented edges from APMD.

edge and plane modeling results, we can see that APMD models edges and planes better than AGMD. 5. Conclusions In this paper, we consider Pearson Type II mixtures and compare the modeling capabilities of Gaussian and Pearson Type II mixtures. Pearson Type II mixtures have the advantage that by varying a parameter K, they can be made to approximate a wide variety of shapes. The AGMD and APMD algorithms use an agglomerative technique to ®nd the number of components required to model the mixture eciently, thereby overcoming the drawback of the EM algorithm. From the results on the benchmark pattern recognition data sets, we can say that the Pearson mixtures have better modeling and generalization capabilities than Gaussian mixtures as indicated by the higher classi®cation rates. The results on edge and range data indicate that Pearson Type II mixtures are more versatile than Gaussian mixtures for edge and plane modeling. For a given value of K, the computational complexity of APMD is very similar to that of AGMD. For example on the Bupa liver disorders data set, the AGMD algorithm took 48 milliseconds to train while the APMD algorithm took 63 milliseconds on a Sun spark Ultra 1 machine. (In theory, we could derive necessary conditions to optimize J in Eq. (12) with respect to K. However, the resulting equations are rather unwieldy.) When the nature of the distribution is approximately

unknown, we can pick an appropriate value for K. For instance, a value of 1 for K works well for edge and plane modeling. For general data sets, we recommend that APMD is run on the training data set for few values of K (e.g., K ˆ 1, 5, 10, 15, 20 and 30) and pick the value that gives the best performance. Although this procedure increases the computational complexity to a certain degree, it is still quite reasonable because training is done ``o€-line'' and we do obtain improved performance

References Akaike, H., 1974. A new look at the statistical model identi®cation. IEEE Transactions on Automatic Control AC-19 (6), 716±723. Bezdek, J.C., 1981. Pattern Recognition With Fuzzy Objective Function Algorithms. Plenum Press, New York. Dempster, A.P., Laird, N.M., Rubin, 1977. Maximum likelihood from incomplete data via the em algorithm. Journal Royal Statistical Society B 39 (2), 1±38. Duda, R.O., Hart, P.E., 1973. Pattern Classi®cation and Scene Analysis. Wiley, New York. Everitt, B.S., Hand, 1981. Finite Mixture Distributions. Chapman and Hall, London, UK. Frigui, H., Krishnapuram, R., 1997. Clustering by competitive agglomeration. Pattern Recognition 30 (7), 1223±1232. Kotz, S., 1997. Multi-variate distributions at a cross-road. In: Patil, G.P., Kotz, S., Ord, J.K. (Eds.), Statistical Distributions in Scienti®c Work. Reidel, Dordrecht. Medasani, S., Krishnapuram, R., 1997. Determination of the number of components in Gaussian mixtures using agglomerative clustering. In: Proceedings of the ICNN, Houston, June 1997, pp. 1412±1417. Murphy, P.M., Aha, D.W., 1973. UCI repository of machine learning databases [http://www.ics.ci.edu/ mlearn/MLRepos itory.html]. Department of Information and Computer Science, University of California-Irvine.

S. Medasani, R. Krishnapuram / Pattern Recognition Letters 20 (1999) 305±313 Redner, R.A., Walker, H.H., 1984. Mixture densities, maximum likelihood and the em algorithm. SIAM Review 26 (2), 195±239. Ripley, R.D., 1996. Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge. Risannen, J., 1978. Modeling by shortest data description. Biometrika 14, 465±471.

313

Tou, J.T., Gonzalez, R.C., 1974. Pattern Recognition Principles. Addison-Wesley, Reading, MA. Wolfe, J.H., 1971. A Monte Carlo study of sampling distribution of the likelihood ratio for mixtures of multi-normal distributions. Technical Bulletin STB 72-2, San Diego: US Naval Personnel and Training Research Laboratory.