Journal of the Korean Statistical Society 37 (2008) 135–143 www.elsevier.com/locate/jkss
Determining the number of clusters in cluster analysisI My-Young Cheong a , Hakbae Lee b,∗ a Department of Preventive Medicine, Seoul National University College of Medicine, Institute of Environmental Medicine,
Seoul, 110-799, Republic of Korea b Department of Applied Statistics, Yonsei University, Seoul, 120-749, Republic of Korea
Received 1 March 2007; accepted 1 October 2007 Available online 22 April 2008
Abstract Cluster analysis has been a popular method for statistical classification. The classical cluster analysis, however, has a theoretical shortcoming in the sense that the inference to determine the number of clusters does not provide the theoretical guideline. To estimate the number of clusters, this paper explores the problem through the EM algorithm, Maximum a Posteriori and Gibbs sampler. In addition, we investigate the Bayesian Information criteria (BIC), the Laplace Metropolis criteria and the modified Fisher’s criteria in order to determine the number of clusters. c 2008 The Korean Statistical Society. Published by Elsevier Ltd. All rights reserved.
AMS 2000 Subject Classification: primary 62H30; secondary 91C20 Keywords: Bayesian information criterion; Cluster analysis; EM algorithm; Gibbs sampler; Laplace Metropolis criteria; Maximum a posteriori; Mixture model; Modified Fisher’s criteria
1. Introduction We will focus our attention on model-based clustering with the multivariate normal mixture model (McLachlan & Peel, 2000). In a mixture model, each component probability distribution corresponds to a cluster and each observation estimates the probability belonging to each cluster. Lots of classical clustering methods do not have a rule to determine the number of clusters and to choose an appropriate clustering model. For solving this problem, we propose an algorithm; the modified Fisher’s criteria. Section 2 discusses the clustering with the mixture model. In Section 3 we review estimation methods such as the EM algorithm, Maximum a Posteriori and Gibbs sampler. Section 4 explores the Bayesian Information Criterion (BIC), the Laplace Metropolis criteria and the modified Fisher’s criteria to determine the number of clusters. Section 5 demonstrates the real data study to check our discussion. Section 6 addresses the conclusion.
I This research was supported by Fund for Supporting Basic Science Research in the College of Business and Economics, Yonsei University. ∗ Corresponding author.
E-mail addresses:
[email protected] (M.-Y. Cheong),
[email protected] (H. Lee). c 2008 The Korean Statistical Society. Published by Elsevier Ltd. All rights reserved. 1226-3192/$ - see front matter doi:10.1016/j.jkss.2007.10.004
136
M.-Y. Cheong, H. Lee / Journal of the Korean Statistical Society 37 (2008) 135–143
Table 2.1 Types of covariance matrix Σ k Model
Volume
Shape
Orientation
λI λk I λA λk A λAk λk Ak λDADT λk Dk Ak DT k λDk ADT k T λk Dk ADk
Same Different Same Different Same Different Same Different Same Different
Same Same Same Same Different Different Same Different Same Same
None None Coordinate axes Coordinate axes Coordinate axes Coordinate axes Same Different Different Different
λ: All λk are the same. A: All λk j are the same. D: All Dk are the same.
2. Clustering with mixture model Assume that a random vector x = (x1 , x2 , . . . , xn ) has K -distributions and each of the n observations is assigned to the nearest distribution. Given data x consisting of independent multivariate observations x1 , . . . , xn , the K components mixture model function is given by f (x j ) =
K X
τk f k (x j ),
j = 1, . . . , n,
k=1
where the proportion of the populations τ = (τ1 , . . . , τ K ) is the nonnegative quantity that sums to one, that is, PK 0 ≤ τk ≤ 1 (k = 1, . . . , K ) and k=1 τk = 1. The f k (x j ) is called the k-th component density of the mixture. The f (x j ) is called the K -component mixture density or unconditional density of x j . Each component corresponds to a cluster. Let us focus on the mixture model with the normal density components. It is assumed to take the component densities as the multivariate normal densities. That is, f (x j |Θ) =
K X
τk φ(x j |µk , Σ k ),
(2.1)
k=1
where Θ = (τ1 , . . . , τ K , µ1 , µ2 , . . . , µ K , Σ 1 , Σ 2 , . . . , Σ K ), µk is the k-th component mean, Σ k is the k-th component covariance matrix and T −1 1 1 (2.2) φ(x j |µk , Σ k ) ≡ √ exp − x j − µk Σ k x j − µk . 2 det(2πΣ k ) Banfield and Raftery (1993) proposed a parametrization of the component covariance matrix through the standard spectral decomposition of Σ k , Σ k = λk Dk Ak DTk ,
(2.3)
where Dk is the orthogonal matrix of eigenvectors of Σ k , λk is the largest eigenvalue of Σ k and Ak is a diagonal matrix whose elements are proportional to the eigenvalues of Σ k ; Ak = diag(1, λk2 /λk , . . . , λkp /λk ). Table 2.1 shows the parametrization type of the covariance matrix Σ k . This has the geometric interpretation. λk controls the volume of the k-th cluster. Ak and Dk represent its shape and the orientation respectively. In this paper, we consider only three models λI, λk I and λDADT . Models λI and λDADT are probably the most used Gaussian mixture models for clustering data (McLachlan & Basford, 1988). The generalizations of λI and λk I to allow for different volumes have proved to be powerful in many practical situations (Bensmail, Celeux, Raftery, & Robert, 1997; Celeux & Govaert, 1995).
137
M.-Y. Cheong, H. Lee / Journal of the Korean Statistical Society 37 (2008) 135–143
Our goal is to assign an observation to the cluster with the largest posterior probability (Rencher, 1998). The posterior probability represents the probability that an observation x j belongs to the i-th cluster, Ci : τi φ(x j |µi , Σ i ) . K P τk φ(x j |µk , Σ k )
P(Ci |x j ) =
(2.4)
k=1
The posterior probability (2.4) will be estimated by the EM algorithm, Maximum a Posterior and Gibbs sampler. 3. Estimation of models 3.1. EM algorithm The EM algorithm (Dempster, Laird, & Rubin, 1977) has the basic idea to transform an incomplete data into a complete data problem because the complete data likelihood is computationally more tractable for a required maximization. For the EM algorithm (Fraley & Raftery, 1999, 2002) of a mixture model, let z = (x, y) denote the vector containing a complete data where x is observed and y is unobserved. The likelihood is given by L C (z|θ) =
n Y
f (z j |θ),
j=1
where L C denotes the complete-data likelihood. Taking the log of L C (z|θ ) gives log L C (z|θ) =
n X
f (z j |θ ) = E θ (k) {log f (θ, y|x)} ,
(3.1)
j=1
where log L C (z|θ) denotes the objective function for which the maximization process is equal to the conditional expectation of the complete data log-likelihood. Because y is unobserved, the observed data likelihood is produced by integrating y out of the complete data likelihood. Z L(x|θ) = L C (z|θ)dy. Let us investigate the application of the EM algorithm for the multivariate normal mixture model. f (x j |Θ) =
K X
τk φk (x j |θ k ),
k=1
where θ k = (µk , Σ k ). The EM algorithm has two steps: the E-step (Expectation step) and the M-step (Maximization step). In the (k + 1)-th iteration, E-step: Estimate the conditional expectation of the complete data log-likelihood given the observed data and the (k+1) current parameter estimates, yˆ ji . (k+1) yˆ ji
(k) ˆ i(k) ) ˆ i(k) , Σ τˆi φi (x j |µ = = Pˆ (k+1) (Ci |x j ) K P (k) (k) ˆ (k) ˆ k , Σk ) τˆk φk (x j |µ
(3.2)
k=1
which is the estimation of the posterior probability (2.4) that the j-th observation x j belongs to the i-th cluster Ci of the mixture model (McLachlan & Krishnan, 1997). M-step: Estimate the parameter maximizing the complete data log-likelihood (3.1) at the values computed in the E(k+1) step, yˆ ji . These two steps are iterated until the difference of the log-likelihood value between the (k + 1)-th iteration and (k)th iteration by the stopping criteria has converged. Celeux and Govaert (1995), and McLachlan and Krishnan (1997) discussed the EM algorithm.
138
M.-Y. Cheong, H. Lee / Journal of the Korean Statistical Society 37 (2008) 135–143
3.2. Maximum a posteriori (MAP) When the EM algorithm has a singularity of covariance matrices, Maximum a posteriori (MAP) can be an alternative to MLE. The MAP has also two steps: E-step and M-step as in EM algorithm The E-step is effectively the same as the computation of MLE of θ in the EM algorithm. The M-step differs in that the objective function for the maximization process is equal to the conditional expectation of the complete data log-likelihood augmented by the log prior density, log f (θ). By the prior of parameters for the multivariate normal mixture model, we consider the conjugate prior. A normal prior on the mean conditional on the covariance matrix was used; µ|Σ ∼ N (µ p , Σ /κ p ) h io n κ 1 p ∝ |Σ − 2 | exp − trace (µ − µ p )T Σ −1 (µ − µ p ) , 2 and an inverse gamma prior on the variance for the diagonal and spherical models, ) ( ν +2 ς p2 2 2 2 p2 exp − 2 , σ ∼ inverseGamma(ν p /2, ς p /2) ∝ (σ ) 2σ and an inverse Wishart prior on the covariance matrix for the ellipsoidal models, i h ν p +d+1 1 −1 −1 . Σ ∼ inverseWishart(ν p , Λ p ) ∝ (Σ ) 2 exp − trace Σ Λ p 2 The initial values for the prior hyperparameters are summarized as µ p : the mean of the data κ p : 0.01 ν p : d + 2 (d = the number of the dimensions) sum(diag(var(data)))/d ς p2 : G 2/d var(data) Λp : . G 2/d The M-step estimator for the mean and variance of the multivariate mixture normal model with the normal inverse gamma and normal inverse Wishart conjugate priors are shown by Fraley and Raftery (2005). In the (k +1)-th iteration E-step: The posterior probability (2.4) is estimated by MAP instead of MLE. M-step: Determine θ (k+1) to maximize E θ (k) {log f (θ, y|x)} (3.1). These two steps are iterated until the difference of the log-likelihood value augmented by the log prior density between the (k + 1)-th iteration and (k)-th iteration by the stopping criteria has converged. 3.3. Gibbs sampler The Markov chain Monte Carlo (MCMC) has become a very important computational tool in Bayesian statistics, since it allows inference to be drawn from the complex posterior distribution where analytical or numerical integration techniques cannot be applied. From this point of view, the Markov chain Monte Carlo (MCMC) is closely related to the EM algorithm. However the MCMC needs to integrate the posterior distribution of model parameters given the data, and the EM algorithm may need to integrate the distribution of observable given parameter values. The Gibbs Sampler is a popular MCMC algorithm. The advantage is almost no theory, no more than the dependent conditional probability. The conjugate prior for the parameters of the multivariate normal distribution may be used (Smith & Roberts, 1993). The prior distribution of the mixing proportions τ = (τ1 , . . . , τ K )T is a Dirichlet distribution τ ∼ D(α1 , . . . , α K ) where α1 = · · · = α K = 1, the means µk |Σk ∼ N (ξk , Σk /κk ) and the variance matrices depend on the model (Bensmail et al., 1997). The estimation using Gibbs sampler consists of the following steps (Bensmail et al., 1997) on (k + 1)-th iteration:
139
M.-Y. Cheong, H. Lee / Journal of the Korean Statistical Society 37 (2008) 135–143
Step 1. Simulate the posterior probability (2.4) yˆ (k+1) . This step is the same as E-step of the EM algorithm. (k+1) Step 2. Simulate the vector τˆ (k+1) of proportions according to its posterior distribution conditional on the yˆ ji s. (k+1)
Step 3. Simulate the parameters of the model θ according to their posterior distributions conditional on the yˆ ji
s.
The details of Step 3 of Gibbs sampler for the mean and variance of multivariate mixture normal model with the normal inverse gamma and normal inverse Wishart conjugate priors are shown by Bensmail et al. (1997). Now, we determine an appropriate model and the number of clusters. The issue for determining the number of clusters are described in Section 4. 4. Assessment of models 4.1. Bayesian information criterion (BIC) Bayesian Information Criterion (BIC) by Schwarz (1978) is one of the most popular Bayesian model selection criteria. The Bayesian information criterion is based on the Bayes factor and the posterior model probability (Kass & Raftery, 1995). The Bayes factor is equal to the ratio of the marginal or integrated likelihood for each model. Consider the K different clusters Ci (i = 1, . . . , K ) which are mutually exclusive and exhaustive. We assign the PK prior probability p(Ci ) with i=1 p(Ci ) = 1 to Ci . After observing data x, the posterior probability of the cluster Ci is p(Ci |x) =
p(Ci ) p(x|Ci ) K P
,
i = 1, 2, . . . , K ,
p(Ck ) p(x|Ck )
k=1
where p(x|Ci ) is the probability of the data x given the cluster Ci . The marginal density can be expressed as Z p(x|Ci ) = p(x|Θ i , Ci ) p(Θ i |Ci )dΘ i = E Θ i |Ci [ p(x|Θ i , Ci )] ,
(4.1)
where Θ i = (τi , µi , Σ i ) is the parameter under the given model. p(x|Ci ) means the expected value of all possible likelihoods, where the expectation is taken with respect to the prior distribution of the parameters. Under the regular models the factor is approximated as follows. h i 1 log p(x|Ci ) ≈ log p(x|Θ˜ i , Ci ) − pi log n, 2 ˜ i of Θ i is obtained by EM or MAP. The where pi is the number of parameters under the cluster Ci . An estimate Θ approximation to twice the logarithm of the Bayes factor is the Bayesian Information Criterion (BIC) (4.2). Schwarz (1978) showed that for sufficiently large n, the number of observations, the best model is the one which maximize BIC. h i BIC = 2 log p(x|Ci ) ≈ 2 log p(x|Θ˜ i , Ci ) − pi log n. (4.2) In other words, the number of clusters is determined with the one maximizing BIC. 4.2. Laplace–Metropolis criteria Using the second-order Taylor series expansion, let’s take the logarithm of the integrand in (4.1). Its integral term is of a Gaussian form, so it can be evaluated readily. Hence p(x|Ci ) can be approximated by p
ˆ i , Ci ) p(Θ ˆ i |Ci )(2π ) 2i |H−1 | 12 , p(x|Ci ) ≈ p(x|Θ ˆ Θi
(4.3)
ˆ i is the posterior mode of p(x|Θ ˆ i , Ci ) p(Θ ˆ i |Ci ), H−1 is the variance-covariance matrix of the Gaussian where Θ ˆi Θ approximation to the posterior distribution. This approximation (4.3) is known as the Laplace method. The Laplace ˆ i and H−1 . Thus the simplest way to estimate Θ i is to approximate the method requires the posterior mode Θ ˆ Θi
140
M.-Y. Cheong, H. Lee / Journal of the Korean Statistical Society 37 (2008) 135–143
integrated likelihood using the Gibbs sampler output. While the Laplace method is often very accurate, it is not directly applicable here because the derivatives it requires are not easily available. The idea of the Laplace–Metropolis estimator (Lewis & Raftery, 1994; Raftery & Lewis, 1996) is to get around the limitations of the Laplace method by using posterior simulation to estimate the quantities it needs. The likelihood at the approximate posterior mode is given by ˆ i) = p(x|Θ
n Y
ˆ i ). ˆ i, Σ τˆi φ(x j |µ
j=1
4.3. Modified Fisher’s criteria In discriminant analysis for several clusters, we are concerned with the comparison of within-cluster and betweencluster variability for choosing the number of clusters. For any given vector a, the relative variation of y = a T x among clusters with respect to the within variation, which one can write as maxa (aT Ha)/(aT Ea), is a function of K . We consider an adjustment for choosing too large K to the relative variation, which leads us to the following modified Fisher’s criterion: K aT Ha X + log(τˆk ), aT Ea k=1
FM (a, K ) = where E =
ni K X X
(xi j − x¯ i )(xi j − x¯i )T : within sum of squares
i=1 j=1
H=
K X
n i (x¯i − x¯ )(x¯i − x¯ )T : between sum of squares.
i=1
We propose FM (K ) = max FM (a, K ) a
for a criterion to choose K . This is the maximal relative variation between K clusters along the principal axis. Note that the matrices E and H depend on K , so does the maximizing vector, say, a1 (K ) of FM (a, K ). By letting v = E1/2 a, maximizing maxa (aT Ha)/(aT Ea) subject to aT Ea = 1 is equivalent to maximizing v T (E−1/2 HE−1/2 )v subject to kvk = 1. This is the well-known optimization problem for the conventional principal component analysis. The solution is v = v1 (K ) where v = v1 (K ) is the eigenvector of E−1/2 HE−1/2 that corresponds to the largest eigenvalue, say, λ1 (K ) of that matrix. This means that a1 (K ) = E−1/2 v1 (K ) is the eigenvector of E−1 H that corresponds to λ1 (K ). Thus, we obtain FM (K ) = FM (a1 (K ), K ). 5. Real data study Two examples are explored to compare the methods discussed in the previous sections. Table 5.1 summarizes the combination of estimation method and its assessment criterion to determine the number of clusters. The common rule for choosing an appropriate number of clusters is to estimate the number of clusters which corresponds to maximizing the BIC, the Laplace–Metropolis Criteria and the Modified Fisher’s Criteria. 5.1. Colon cancer data This data was presented and analyzed by Alon et al. (1999). Expression levels of about 2000 genes were measured for 40 tumor and 22 normal colon tissues. The data consists of two types of clusters: tumor and normal.
141
M.-Y. Cheong, H. Lee / Journal of the Korean Statistical Society 37 (2008) 135–143 Table 5.1 Combination of estimation and its assessment criterion for the number of clusters Notation
Estimation method
Assessment criteria for the number of clusters
EM(BIC) MAP(BIC) Gibbs (Laplace) Gibbs (Modified Fisher)
EM algorithm Maximum a posteriori Gibbs sampler Gibbs sampler
BIC BIC Laplace–Metropolis criterion Modified Fisher’s criterion
Table 5.2 Colon cancer data: EM (BIC) Model
Number of clusters 2
3
4
5
λI
−1586.48
−1556.30
−1533.41
−1548.24
λk I λDADT
−1588.99 −1418.75
−1563.00 −1410.75
−1527.55 −1487.44
−1545.81 −1525.52
Table 5.3 Colon cancer data: MAP (BIC) Model
Number of clusters 2
3
λI
−794.14
−779.64
−769.99
−780.49
λk I λDADT
−795.93 −758.27
−784.29 −755.69
−770.08 −798.07
−784.90 −824.44
4
5
Table 5.4 Colon cancer data: Gibbs (Laplace) Model
Number of clusters 2
3
4
5
λI
−1770.88
−1743.34
−1883.06
−1877.71
λk I
−804.71
−810.68
−841.62
−879.31
λDADT
−785.61
−818.52
−855.04
−905.55
Table 5.5 Colon cancer data: Gibbs (Modified Fisher) Model
Number of clusters 2
3
4
5
λI
−0.96
−3.27
−5.01
−5.01
λk I
28.74
27.77
23.97
15.59
λDADT
12.23
12.16
22.84
7.92
Tables 5.2 and 5.3 show that the number of clusters by both of EM(BIC) and MAP(BIC) is estimated improperly as three clusters rather than two. In contrast, Tables 5.4 and 5.5 show that the number of clusters by both of Gibbs(Laplace) and Gibbs(Modified Fisher) is estimated precisely as two clusters. Furthermore, Table 5.6 implies that the Gibbs(Modified Fisher) has a smaller misclassification rate than the Gibbs(Laplace). Hence, the Gibbs sampler method with the modified Fisher’s criterion is better than other methods.
142
M.-Y. Cheong, H. Lee / Journal of the Korean Statistical Society 37 (2008) 135–143
Table 5.6 Colon cancer data: Misclassification rate comparison Estimation method
EM
MAP
Gibbs
Gibbs
Criteria to determine the number of clusters
BIC
BIC
Laplace
Modified Fisher
MODEL
λDADT
λDADT
λDADT
λk I
Number of clusters Misclassification rate
3 0.24
3 0.24
2 0.45
2 0.08
Table 5.7 IRIS data: EM (BIC) Model
Number of clusters 2
3
4
5
λI λk I
−1123.41 −1012.24
−878.77 −853.81
−784.31 −783.83
−734.39 −746.99
λDADT
−688.10
−632.97
−591.41
−604.93
Model
Number of clusters 2
3
4
5
λI λk I
−1124.85 −1016.35
−885.70 −865.14
−799.83 −796.02
−808.43 −797.98
λDADT
−690.53
−638.90
−655.36
−618.93
Table 5.8 IRIS data: MAP (BIC)
Table 5.9 IRIS data: Gibbs (Laplace) Model
Number of clusters 2
3
4
5
λI
−872.46
−990.54
−974.46
−1096.43
λk I λDADT
−514.89 −427.37
−450.07 −441.81
−487.90 −382.16
−497.28 −395.17
5.2. IRIS data Let us explore Fisher’s iris data (Fisher, 1936; Anderson & Edgar, 1935), with the variables sepal length, sepal width, petal length and petal width. The IRIS data with three clusters is measured by 50 flowers from each of 3 species of iris; iris setesa, versicolor, virginica. Tables 5.7–5.9 display poor performance due to improper estimates of the number of clusters by EM (BIC), MAP (BIC) and Gibbs (Laplace) respectively. However, Tables 5.10 and 5.11 show that Gibbs(Modified Fisher) is superior to the remaining three methods. In addition, Gibbs(Modified Fisher) has the smallest misclassification rate. 6. Conclusion We investigate the performance of EM (BIC), MAP (BIC), Gibbs (Laplace), Gibbs (Modified Fisher) in the nonhierarchical clustering based on the multivariate normal mixture. For solving the problems with the EM algorithm (such as the singularities), Maximum a posteriori (MAP) and Gibbs sampler have been proposed.
143
M.-Y. Cheong, H. Lee / Journal of the Korean Statistical Society 37 (2008) 135–143 Table 5.10 IRIS data: Gibbs (Modified Fisher) Model
Number of clusters 2
3
4
5
λI λk I
0.11 14.34
−2.57 21.76
−4.46 12.90
−7.68 12.90
λDADT
0.58
8.77
8.77
−6.19
Table 5.11 IRIS data: Misclassification rate comparison Estimation Method
EM
MAP
Gibbs
Gibbs
Criteria to determine the number of clusters
BIC
BIC
Laplace
MODEL
λDADT
λDADT
λDADT
Modified Fisher λk I
Number of clusters
4
5
4
3
Misclassification rate
0.11
0.22
0.37
0.08
The modified Fisher’s criteria has a smaller misclassification rate than other criteria. For estimating the number of clusters the modified Fisher’s criteria is better than the BIC or the Laplace Metropolis Criteria. References Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., et al. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences, 96, 6745–6750. Anderson, & Edgar (1935). The irises of the Gaspe Peninsula. Bulletin of the American Iris Society, 59, 2–5. Banfield, J. D., & Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803–821. Bensmail, H., Celeux, G., Raftery, A. E., & Robert, C. P. (1997). Inference in model-based cluster analysis. Statistics and Computing, 7, 1–10. Celeux, G., & Govaert, G. (1995). Gaussian parsimonious clustering models. Pattern Recognition, 28, 781–793. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1–38. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179–188. Fraley, C., & Raftery, A. E. (1999). MCLUST: Software for model-based clustering. Journal of Classification, 16, 297–306. Fraley, C., & Raftery, A. E. (2002). Model-based clustering, discriminant analysis and density estimation. Journal of the American Statistical Association, 97, 611–631. Fraley, C., & Raftery, A. E. (2005). Bayesian regularization for normal mixture estiamtion and model-based clustering. Journal of the American Statistical Association, 97, 611–631. Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773–795. Lewis, S. M., & Raftery, A. E. (1994). Estimating Bayes factors via posterior simulation with the Laplace–Metropolis estimator, Technical report, 279. Department of Statistics, University of Washington. McLachlan, G. J., & Basford, K. E. (1988). Mixture model: Inference and applications to clustering. Marcel Dekker. McLachlan, G. J., & Krishnan, T. (1997). The EM algorithm and extensions. New York: Wiley. McLachlan, G. J., & Peel, D. (2000). Finite mixture models. New York: Wiley. Raftery, A. E., & Lewis, S. M. (1996). Implementing MCMC. In Practical Markov chain Monte Carlo. London: Chapman and Hall. Rencher, A. C. (1998). Multivariate statistical inference and applications. New York: Wiley. Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461–464. Smith, & Roberts (1993). Bayesian computation via the Gibbs sampler and related Markov Chain Monte Carlo Methods (with discussion). Journal of the Royal Statistical Science, B55, 3–24.