Mixture model selection via hierarchical BIC

Mixture model selection via hierarchical BIC

Computational Statistics and Data Analysis 88 (2015) 139–153 Contents lists available at ScienceDirect Computational Statistics and Data Analysis jo...

717KB Sizes 6 Downloads 92 Views

Computational Statistics and Data Analysis 88 (2015) 139–153

Contents lists available at ScienceDirect

Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda

Mixture model selection via hierarchical BIC Jianhua Zhao a,b , Libin Jin c , Lei Shi a,b,∗ a

School of Statistics and Mathematics, Yunnan University of Finance and Economics, Kunming, 650221, China

b

Yunnan Tongchuang Scientific Computing and Data Mining Center, China

c

School of Statistics, Renmin University of China, Beijing, 100872, China

highlights • • • • •

A new HBIC criterion is proposed for model selection in mixture models. BIC ignores the clustered data structure while HBIC matches the structure well. HBIC is a large sample approximation of variational Bayesian lower bound. BIC is a less accurate approximation. The results show that HBIC outperforms BIC and BIC suffers from underestimation.

article

info

Article history: Received 20 September 2013 Received in revised form 17 November 2014 Accepted 27 January 2015 Available online 3 March 2015 Keywords: Model selection Mixture model EM Maximum likelihood estimation BIC Hierarchical BIC Clustering

abstract The Bayesian information criterion (BIC) is one of the most popular criteria for model selection in finite mixture models. However, it implausibly penalizes the complexity of each component using the whole sample size and completely ignores the clustered structure inherent in the data, resulting in over-penalization. To overcome this problem, a novel criterion called hierarchical BIC (HBIC) is proposed which penalizes the component complexity only using its local sample size and matches the clustered data structure well. Theoretically, HBIC is an approximation of the variational Bayesian (VB) lower bound when sample size is large and the widely used BIC is a less accurate approximation. An empirical study is conducted to verify this theoretical result and a series of experiments is performed on simulated and real data sets to compare HBIC and BIC. The results show that HBIC outperforms BIC substantially and BIC suffers from underestimation. © 2015 Elsevier B.V. All rights reserved.

1. Introduction Finite mixture models (McLachlan and Peel, 2000) provide flexible and powerful tools for analyzing or modeling heterogeneous multivariate data, in particular, the clustered/grouped data collected from multiple sources, and has been widely used in many fields such as pattern recognition, signal processing, computer vision, engineering, bioinformatics, etc. The Gaussian mixture model (GMM) has achieved much attention in the literature, mainly because of its simplicity and wide applicability in real data analyses. However, when the covariance structure of GMM is full (FGMM), the number of model parameters to be estimated (i.e. the number of degrees of freedom) increases quadratically with the data dimension and linearly with the number of mixture

∗ Corresponding author at: School of Statistics and Mathematics, Yunnan University of Finance and Economics, Kunming, 650221, China. Tel.: +86 0871 5113808; fax: +86 0871 5113808. E-mail addresses: [email protected] (J. Zhao), [email protected] (L. Jin), [email protected] (L. Shi). http://dx.doi.org/10.1016/j.csda.2015.01.019 0167-9473/© 2015 Elsevier B.V. All rights reserved.

140

J. Zhao et al. / Computational Statistics and Data Analysis 88 (2015) 139–153

components, usually leading FGMM to suffer from overfitting for high dimensional data. In such cases, it is necessary to use more parsimonious covariance structures to reduce the number of degrees of freedom in FGMM. Banfield and Raftery (1993), Bensmail et al. (1997) and Celeux and Govaert (1995) propose using various constraints on the eigen-decomposed component covariances, yielding a family of GMMs that enjoys a wide range of degrees of freedom. For clarity, this family is called eigen-decomposed GMMs (EDGMM) in this paper. The fitting for EDGMM can be carried out by maximum likelihood (ML) or maximum a posteriori (MAP) methods via the popular expectation maximization (EM) algorithm (Dempster et al., 1977). The well known R software mclust (Fraley and Raftery, 2002; Fraley et al., 2012) implements a subset of ten of these models. The Matlab software mixmod (Biernacki et al., 2006) and R software mixture (Browne and McNicholas, 2014) implement all of the fourteen models. Recent developments of EDGMM include the incorporation of t distributions to deal with outliers (t-EDGMM; Andrews and McNicholas, 2012), the incorporation of skew distributions to further tackle asymmetry (skew-EDGMM; Vrbik and McNicholas, 2014) and the extension of t-EDGMM in the presence of missing data (Lin, 2014). A central issue in learning EDGMM is to determine a suitable number of mixture components and covariance structure. The Bayesian information criterion (BIC) proposed in Schwarz (1978) is commonly used for this purpose and has been taken as the default model selection criterion in mclust, mainly due to its theoretical consistency result (Keribin, 2000) and substantial practical evidence for satisfactory performance in a number of applications (see, e.g. Fraley and Raftery, 1998, 2002). Despite its popularity, as can be seen in Section 2.4.1, BIC implausibly penalizes the complexity of each component using the whole sample size and completely ignores the clustered structure inherent in the data, leading to over-penalization. In this paper, we develop a novel criterion for model selection in EDGMM called hierarchical BIC (HBIC), which penalizes the component complexity only using its local sample size and well matches the clustered data structure. More importantly, we show theoretically that HBIC is an approximation of the variational Bayesian (VB) lower bound when sample size is large and the widely used BIC is a less accurate approximation. Our proposed HBIC is an extension of the criterion under the mixture of factor analyzers in Zhao et al. (2013), which is only applicable to the case of no constraints across components and cannot be used for learning EDGMM. The remainder of the paper is organized as follows. Section 2 reviews the EDGMM family and two commonly used model selection criteria. In Section 3 we propose a new criterion for EDGMM called hierarchical BIC, which is derived from the variational Bayesian (VB) lower bound in the large sample limit as shown in Section 3. In Section 4, we perform experiments to test our theoretical result and compare HBIC with related competitors. Section 5 ends the paper with a conclusion and some discussions on future works. 2. Previous works In this section, we briefly review the family of eigen-decomposed Gaussian mixture models (EDGMM), its associated algorithms, and two frequently used model selection criteria. 2.1. Gaussian mixture models (GMM) Let x be a d-dimensional random vector drawn from a g-component Gaussian mixture model (GMM). The probability density function (pdf) of x is given by p(x|θ) =

g 

πk p(xi |θ k ),

k =1

where πk is the prior probability of component k satisfying πk ≥ 0, k = 1, . . . , g, and k πk = 1, θ k = (µk , 6k ) is the component parameter specified by mean µk and covariance 6k , θ ≡ {πk , θ k ; k = 1, . . . , g } collects the set of parameters in the GMM, and the k-component pdf p(xi |θ k ) is given by







1 p(xi |θ k ) = (2π )−d/2 |6k |−1/2 exp − (xi − µk )′ 6k−1 (xi − µk ) . 2 Given a set of i.i.d. sample data X = {xi }Ni=1 , the log-likelihood of a g-component GMM is

L(X|θ) =

N  i =1

log

 g 

 πk p(xi |θ k ) .

k=1

The ML estimate θˆ is obtained by maximizing the log-likelihood

θˆ = arg max{L(X|θ)}. θ

If the prior p(θ) for model parameter θ is available, then the MAP estimate θˆ can be obtained by

θˆ = arg max{L(X|θ) + log p(θ)}. θ

J. Zhao et al. / Computational Statistics and Data Analysis 88 (2015) 139–153

141

Table 1 14 eigen-decomposed GMMs. Dk and D0 are the number of free and common parameters in θ k , respectively. d is data dimension and β = d(d + 1)/2. ‘E’: equal across components, ‘V’: varying, ‘I’: an identity matrix. Model

6k

EII VII EEI VEI EVI VVI EEE

λI λk I λA λk A λAk λk Ak λDAD′

D0

Dk

1 0 d d−1 1 0

d d+1 d d+1 2d − 1 2d d

β

Model

6k

VEE EVE VVE EEV VEV EVV VVV

λk DAD λDAk D′ λk DAk D′ λDk AD′k λk Dk AD′k λDk Ak D′k λk Dk Ak D′k ′

D0

Dk

β β −d+1 β −d

d+1 2d − 1 2d

d d−1 1 0

β β +1 β +d−1 β +d

2.2. Eigen-decomposed Gaussian mixture models (EDGMM) If the covariance 6k is in general form, the number of free parameters of 6k to be estimated increases quadratically with the data dimension d and linearly with the number of components g. In the case that the sample size N is relatively small compared with the data dimension d, GMM usually suffers from overfitting and it is necessary to employ more parsimonious covariance structures or constrain 6k ’s to prevent overfitting. For example, 6k = 6, k = 1, . . . , g, restrict all the components to share a common 6 and 6k = λk I, k = 1, . . . , g, restrict all of the component covariances to be spherical, where I denotes an identity matrix of suitable dimension. More generally, Celeux and Govaert (1995) propose the EDGMM family of 14 parsimonious GMMs which enjoys a wide range of degrees of freedom. EDGMM is obtained by employing various constraints on the eigen-decomposed component covariances 6k ’s as follows.

6k = λk Dk Ak Dk ,

(1)

1/d

where λk = |6k | is the scalar determining the volume of the component k, Dk is the matrix of eigenvectors determining its orientation and Ak is proportional to the diagonal matrix of eigenvalues determining its shape (|Ak | = 1). By either restricting some of λk , Ak and Dk to be equal (‘E’) across components or allowing them to vary (‘V’), a family of fourteen different models can be obtained. For notation convenience, the free and common component parameters in θ k are respectively denoted by θ sk and θ 0 , i.e., θ k = (θ sk , θ 0 ). Correspondingly, Dk and D0 are the number of free parameters of θ sk and θ 0 , respectively. Table 1 lists all the members in the EDGMM family and their associated number of free (Dk ) and common (D0 ) parameters to be estimated. 2.3. The EM algorithm The expectation maximization (EM) algorithm (Dempster et al., 1977) is a well known iterative procedure for finding the ML estimate. In the t-th iteration, the value θ (t ) of the model parameter is updated into its new version θ (t +1) through an E-step to obtain the expected complete data log-likelihood, i.e., the so-called Q function, followed by a M-step to maximize Q with respect to (w.r.t.) θ . Celeux and Govaert (1995) give the implementations of the EM for all members in the EDGMM family. Below we give a sketch on this. Consider the complete data (X, Z) = {xi , zi }Ni=1 , where zi = (zi1 , . . . , zik , . . . , zig ), zik is an indicator variable taking the value 1 if xi comes from the k-th component, and 0 otherwise. The complete data log-likelihood is

Lc (X, Z|θ) =

g N  

zik log(πk p(xi |θ k )).

i=1 k=1

E-step: Computes the expected Lc given X and θ (t ) : Q (θ|θ (t ) ) =

g 

Qk (θ k |θ (t ) ),

k=1

where Qk (θ k |θ (t ) ) =

N 

E(zik ) log(πk p(xi |θ k )),

i=1

depending only on (πk , θ k ) of component k. The posterior probability of data point xi belonging to component k is computed by rik (θ (t ) ) , E(zik ) = P (zik = 1|xi ) =

πk(t ) p(xi |θ (kt ) ) . g  πk(t ) p(xi |θ (kt ) ) k=1

142

J. Zhao et al. / Computational Statistics and Data Analysis 88 (2015) 139–153

It is convenient to define the local sample covariance matrix (t +1)

Sk

,

N 

1 (t +1)

N πk

(t +1)

rik (θ (t ) )(xi − µk

)(xi − µ(kt +1) )′ .

M-step: For ML estimate, maximizing Q w.r.t. θ under the restriction equations:

πk(t +1) = µk(t +1) =

N 1 

N i=1

(2)

i=1

g

i=1

πk = 1, we have the following updating

rik (θ (t ) ), N 

1 (t +1)

N πk

rik (θ (t ) ) xi ,

(3)

i=1

and the equation for 6k depends on the parametrization. For the VVV model, we have

[λk(t +1) , Dk(t +1) , A(kt +1) ] = eig(S(kt +1) ),

(4)

where [λ, D, A] = eig(B) performs the eigen-decomposition as in (1). Let S(t +1) = have



k

πk(t +1) S(kt +1) . For the EEE model, we

[λ(t +1) , D(t +1) , A(t +1) ] = eig(S(t +1) ).

(5)

The updating formulae for the other models are available in Celeux and Govaert (1995). For MAP estimate, θ (t +1) is obtained by

θ (t +1) = arg max{Q (θ|θ (t ) ) + log p(θ)}.

(6)

θ

2.4. Previous model selection criteria In this subsection, we briefly review two commonly used model selection criteria for the EDGMM family: Bayesian information criterion (BIC; Schwarz, 1978) and integrated complete-data likelihood criterion (ICL; Biernacki et al., 2000). Formally, both criteria can be summarized into the general form

ˆ g , c )) = L(X|θ( ˆ g , c )) − P (θ( ˆ g , c )), L∗ (g , c , θ(

(7)

ˆ g , c )) is a penalty term that penalizes where c ranges over 14 candidate covariance structures detailed in Table 1 and P (θ( the higher values of g and more complex covariance structure c. Given a range of values of g from gmin to gmax , which is assumed to include the optimal one, the optimal model (ˆg , cˆ ) is chosen as ˆ g , c ))}. (ˆg , cˆ ) = arg max{L∗ (g , c , θ( (g ,c )

2.4.1. Bayesian information criterion (BIC) The Bayesian information criterion (BIC; Schwarz, 1978) is one of the most popular criterion for model selection in the EDGMM family due to its theoretical consistency in choosing the number of components (Keribin, 2000) and satisfactory performance in a number of applications (Fraley and Raftery, 1998, 2002). For g-component EDGMM, the penalty term of BIC in (7) is given by

ˆ = Pbic (θ)

g 1

2 k=1

Dk log N +

D0 + g − 1 2

log N ,

(8)

where Dk and D0 are the number of free and common parameters in θ k , respectively. It can be observed from (8) that D0 and Dk are treated equivalently in the sense that both are penalized by the same sample size N, which means that BIC uses the whole sample size N to determine the Dk of component k. However, from (2) and (3), the effective sample size for estimating the θ sk of component k in EDGMM is not N but N πˆ k only. For illustration, Fig. 1 shows a toy data set of two-well separated clusters, where cluster 1 and 2 have 50 and 450 observations, respectively. For this data, BIC uses the total sample size N = 500 to determine D1 , rather than the effective sample size 50, resulting in over-penalization for component 1. In short, BIC ignores the clustered data structure completely and suffers from overpenalization.

J. Zhao et al. / Computational Statistics and Data Analysis 88 (2015) 139–153

143

Fig. 1. A toy data set with sample size N = 500.

2.4.2. Integrated complete-data likelihood (ICL) The integrated complete-data likelihood criterion (ICL; Biernacki et al., 2000) is an extension of BIC that tends to favor well-separated mixtures. The penalty of ICL comprises the BIC penalty and an additional entropy term that measures the overlap among components

ˆ = Pbic (θ) ˆ − Picl (θ)

g N  

MAP{rik } log rik ,

(9)

i=1 k=1

where MAP{rik } is the MAP classification with rik , taking 1 if arg maxl {ril } = k and 0 otherwise. 3. Hierarchical BIC (HBIC) We propose in Section 3.1 a new criterion called hierarchical BIC (HBIC) for model selection in the EDGMM family. In Section 3.2, we show that HBIC is a large sample approximation of the variational Bayesian (VB) lower bound. In Section 3.1.1, we discuss its relationship with BIC. 3.1. The proposed criterion The proposed criterion also has the general form (7) but with a new penalty term

ˆ = Phbic (θ)

g 1

2 k=1

Dk log(N πˆ k ) +

D0 + g − 1 2

log N ,

(10)

where πˆ k is the ML/MAP estimate, Dk (resp. D0 ) is the number of free parameters of θ sk (resp. θ 0 ). Different from BIC in (8), it can be seen from (10) that HBIC determines Dk only using the corresponding effective sample size N πˆ k and thus HBIC well matches the clustered data structure as detailed in Section 2.4.1. The key to HBIC is that the parameters are penalized only based on their relevant effective sample sizes. In fact, this idea is not new. For example, Steele (2002) has examined a special case of HBIC under the VVV structure for model selection of one-dimensional GMM and Gollini and Murphy (2014) have further investigated this criterion under mixtures of latent trait analyzers. A theoretical justification of this criterion is given in Zhao et al. (2013). The main difference from this paper is that Zhao et al. (2013) consider the case of no constraints across components while, for EDGMM, we must consider the case that there are constraints across components. Similar criteria for hierarchical or random effects models have also been suggested in Pauler (1998) and Raftery et al. (2007). On the other hand, it can be seen from (10) that HBIC seems not sensitive to the spurious components that have small values of N πˆ k . In Section 4.2.1, we find that this drawback can be easily eliminated by means of a component annihilation step detailed in Section 4.1.2. 3.1.1. Relationship between HBIC and BIC Comparing (8) and (10), we see that (i) when g = 1 (i.e., one-component mixture), HBIC degenerates into BIC; and (ii) when g > 1, since N πˆ k ≤ N, BIC penalizes the model more heavily than HBIC. Despite the difference, there exist close relationships between HBIC and BIC. In HBIC, BIC is performed in a ‘hierarchical’ manner: the first hierarchy is a global BIC and the second hierarchy is a local BIC.

• global BIC for πk ’s and common component parameter θ 0 : Since all N data points are devoted to estimating θ 0 and the  πk ’s under the constraint k πk = 1, the corresponding BIC penalized term to θ 0 is D0 /2 log N and πk ’s is (g − 1)/2 log N, which yields the second term in (10).

144

J. Zhao et al. / Computational Statistics and Data Analysis 88 (2015) 139–153

• local BIC for free component parameter θ sk : Since the estimation of θ sk is based on the local effective sample size N πˆ k , the corresponding BIC penalized term to θ sk is Dk /2 log(N πˆ k ). Summing over all the g components leads to the first term in (10).

ˆ in (8) and Phbic (θ) ˆ (10) by Furthermore, we can write the relationship between Pbic (θ) g 1

ˆ = Pbic (θ) ˆ + Phbic (θ)

2 k =1

Dk log πˆ k .

(11)

The second term in (11) is only an order-1 term and thus BIC is a less accurate approximation of HBIC in the large N limit. This indicates that the consistency of BIC in choosing the number of components (Keribin, 2000) is also applicable to HBIC. 3.2. HBIC: a large sample limit of the variational Bayesian lower bound In the Bayesian treatment of EDGMM, the true posterior p(Z, θ|X) is usually intractable. In such a situation, it is often preferred to find a deterministic approximation solution. To this end, the variational Bayesian (VB) adopts a tractable q to approximate p and optimizes a lower bound F of the log marginal likelihood log p(X). By Jensen’s inequality, log p(X) = log



p(X, Z, θ)dZdθ ≥



q(Z, θ) log

p(X, Z, θ) q(Z, θ)

dZdθ = F (q).

(12)

The tractable q can often be obtained by assuming that q factorizes over Z and θ : q(Z, θ) = q(Z)q(θ).

(13)

Following Bishop (2006), we use the priors p(θ) = p(π)p(θ 0 )

g 

p(θ sk ),

and,

p(π) = Dir(π|α0 ) = C (π 0 )

k=1

g 

α −1

πk 0

,

(14)

k=1

where Dir(·) denotes the Dirichlet distribution and C (π 0 ) = Γ (g α0 )/Γ (α0 )g is the normalization coefficient of p(π). With the priors in (14), we obtain the additional factorization q(θ) = q(π)q(θ 0 )



q(θ sk ).

(15)

k

Let E[·]q(·) stand for an expectation w.r.t. the distribution q(·). Substituting (13) and (15) into (12) and maximizing F over q(Z), q(π), q(θ 0 ) and q(θ k ), k = 1, . . . , g, yields the following VBEM updating steps:

• VBE-step: q(Z) =

g N  

q(zik ),

(16)

i=1 k=1

where log q(zik ) = zik E[log πk ]q(π) + E[log p(xi |θ k )]q(θ s )q(θ 0 ) + const .





(17)

k

• VBM-step: q(π) = Dir(π|α), g

log q(θ 0 ) =

(18) N



rik E[log p(xi |θ k )]q(θ s ) + log p(θ 0 ) + const ,

(19)

k

k=1 i=1

and log q(θ sk ) =

N 

rik E[log p(xi |θ k )]q(θ 0 ) + log p(θ sk ) + const ,

(20)

i=1

k = 1, . . . , g, where rik = E[zik ]q(Z) , αk = α0 + Nk , and Nk = (18)–(20) until convergence. Substituting (16), (18)–(20) into (12), we can write F as

 F = E log



p(X, Z|θ) q(Z)



FD





i rik .

The VBEM algorithm alternatively iterates (16),

g  − KL[q(π) ∥ p(π)] − KL[q(θ 0 ) ∥ p(θ 0 )] − KL[q(θ sk ) ∥ p(θ sk )] .       k=1 q(Z)q(θ)  F1 F2   

F3

(21)

J. Zhao et al. / Computational Statistics and Data Analysis 88 (2015) 139–153

145

In the large N limit, we have

ˆ + O(1), FD = L(X|θ) F1 = F2 =

m−1

ˆ + O(1), log N − log p(π)

2

D0 2

log N − log p(θˆ 0 ) + O(1),

and

F3 =

g  Dk k=1

s

log(N πˆ k ) − log p(θˆ k ) + O(1).

2

The proofs for these limits are in general similar to those for the mixtures of factor analyzers in Zhao et al. (2013). Since the proof for the limit of F1 is directly available from Zhao et al. (2013), we simply give the proofs for F2 , F3 and FD in

ˆ + p(θˆ0 ) + Appendices A and B. By further dropping the order-1 term p(π)



k

s

log p(θˆ k ), we obtain

Theorem 1. As the sample size N tends to infinity,

ˆ − Phbic (θ) ˆ + O(1), F = L(X|θ) ˆ is given by (10). where Phbic (θ) The HBIC criterion in Section 3.1 follows from Theorem 1. Theorem 1 shows that HBIC is a large sample limit of the VB lower bound F , which establishes the connection between HBIC and VB, similar to that between BIC and Bayesian. With this connection, we can use HBIC reliably when sample size is large. 4. Experiments In this section, we perform a series of experiments on synthetic and real-world data sets. The first objective is to empirically examine Theorem 1 and the second is to compare the proposed HBIC with the two commonly used model selection criteria: (i) BIC (Fraley and Raftery, 2002) and (ii) ICL (Biernacki et al., 2000). The parameter estimation is performed by a variant of the EM implementation (Celeux and Govaert, 1995) as detailed in Section 4.1, except that the EVE and VVE models are based on the majorization minimization implementation recently proposed in Browne and McNicholas (2014). Since BIC, ICL, and HBIC differ only in the penalty term, for fair comparison, we compute the values of all criteria using the same estimate. 4.1. Implementation strategies In this subsection, we discuss several strategies used for our implementation of EDGMM, including maximum a posteriori (MAP) estimation, component annihilation, initialization and convergence criterion. 4.1.1. Maximum a posteriori (MAP) estimation Since the ML estimate of EDGMM may suffer from the problem of singularities and in this case its BIC or HBIC value cannot be computed, Fraley and Raftery (2007) suggest using the MAP estimate. Inspired by Srivastava et al. (2007), we use the prior p(θ) = p(π)

g 

p(µk )p(6k ) ∝

k=1

g 

  g 1 1 |6k |− 2 exp − tr(6− B ) , k

k=1

(22)

2

where B is a positive definite matrix. The prior (22) assumes an inverted Wishart prior with g degrees of freedom over 6k and noninformative priors for component probability πk ’s and mean µk ’s. Substituting (22) into (6), the updating equation for 6k in the M-step needs some small modifications. For example, for (t +1) the VVV model, Sk in (4) is now replaced by (t +1)

S˜ k

=

1



(t +1)

N πk

(t +1)

where γ1 = g /(N πk S˜ (t +1) =

+g

(t +1) (t +1)

N πk

Sk

 B + B = (1 − γ1 )S(kt +1) + γ1 ,



where γ2 = g /(N + g ).

(23)

+ g ). For the EEE model, S(t +1) in (5) is now replaced by

 B NS(t +1) + B = (1 − γ2 )S(t +1) + γ2 , N +g g 1

g

(24)

146

J. Zhao et al. / Computational Statistics and Data Analysis 88 (2015) 139–153 Table 2 Successful rate (%) in selecting the component number by BIC and HBIC. Criterion

BIC HBIC

Threshold n0 2

5

15

20

30

40

99 86

100 99

100 100

100 100

100 100

100 100

If we let g = d and take B as a scalar covariance, e.g., B/g = r · tr( k πk Sk )/d · I, the ML estimates in (23) and (24) are shrunk by scalar covariances, which are similar to that in regularized discriminant analysis (Friedman, 1989). Nevertheless, the regularization parameters γ1 and γ2 herein are data-adaptive as the sample size N varies. In our experiments, unless else stated, we set g = d and r = 0.05 in (23), which means that we simply use a small amount of regularization.



4.1.2. Spurious component annihilation The spurious component annihilation mechanism (Zhang et al., 2004) is popular for learning GMM in the machine learning community but it seems to receive little attention in the statistics community. This mechanism is another effective way to alleviate the problem of singularities and its origin should be credited to Figueiredo and Jain (2002). If only the MAP estimate is adopted, the final estimate may still include the spurious or insignificant components that have small values of N πˆ k . When the number of components g is larger than the optimal one, this tends to happen frequently. To overcome this problem, we follow Zhao et al. (2013) in adding a component annihilation step to the immediate MAP estimates of the EM. In detail, we remove the components with N πˆ k ≤ n0 , where n0 is a user-defined parameter. In other words, any components with their effective sample sizes less than n0 are regarded as spurious components and simply discarded. As a result, the obtained estimate excludes spurious components and guarantees that the H-BIC value can always be computed. The possible disadvantage of this step is that, when some component j is of interest but its local sample size N πˆ j < n0 , this component will not survive and be forced to merge with other ones. Therefore, in practice, it may not be good to take a large value of n0 . Unless else stated, we set n0 = 15 in our experiments. 4.1.3. Initialization and convergence To tackle the initialization issue, we choose the best run in terms of log-likelihood out of r = 10 runs with different K -means starts. To monitor convergence, we stop the EM algorithm after the relative change in the objective function is smaller than a threshold 10−5 . 4.2. Synthetic data In this subsection, we use synthetic data to perform three experiments: (i) sensitivity to the component annihilation threshold n0 ; (ii) connection among the VB lower bound, HBIC and BIC; (iii) comparison of HBIC with BIC and ICL. 4.2.1. Sensitivity to the component annihilation threshold It can be seen from (10) that HBIC seems insensitive to the spurious components that have small values of N πˆ k . Therefore, in this experiment, we investigate the effect of n0 on the performance of HBIC and BIC. We generate 10-dimensional data from the 3-component EVI model parameterized by

(π1 , π2 , π3 ) = (0.2, 0.5, 0.3); µ1 = µ2 = µ3 = (0, . . . , 0)′ , µ2i = 0.5, µ3i = −0.5, i = 1, . . . , 5; 61 = diag{1, 1, 1, 1, 1, 0.1, 0.1, 0.1, 0.1, 0.1}, 62 = diag{1, 5, 8, 10, 10, 0.09, 0.09, 0.09, 0.09, 0.09}; 63 = diag{10, 10, 8, 5, 1, 0.11, 0.11, 0.11, 0.11, 0.11},

1

and 6k = 6k / |6k | 10 ,

k = 1, 2, 3,

where diag{v} is a diagonal matrix with the elements of vector v on its main diagonal. We choose a moderately large sample size 500, for which it is expected that Theorem 1 would hold approximately and BIC and HBIC would perform similarly. Table 2 shows the successful rates in choosing the number of components over 100 repetitions. It can be observed from Table 2 that BIC is less sensitive than HBIC to the small values of the component annihilation threshold. Nevertheless, HBIC performs reliably when n0 ≥ 15. In practice, it may not be good to take a large value of n0 , since the number of data points in the class of interest may be smaller than n0 . Therefore, we choose the moderate one: n0 = 15 in our following experiments. 4.2.2. Connection among the VB lower bound, HBIC and BIC In this experiment, we empirically investigate our theoretical result that both HBIC and BIC converge to the VB lower bound in the large sample limit. To this end, we generate d = 10-dimensional data from the three-component EII model

J. Zhao et al. / Computational Statistics and Data Analysis 88 (2015) 139–153

(a) Synthetic data, sample size 400.

147

(b) Sample size 400.

(c) Sample size 4000.

(d) Sample size 40 000.

Fig. 2. (a) An example of synthetic data with sample size N = 400. The objective function values of BIC, HBIC and the VB lower bound versus number of components (varying from 1 to 7) when (b) sample size is 400; (c) sample size is 4000; (d) sample size is 40 000. Table 3 Relative difference shown as mean(±std.) of the VB lower bound from BIC and HBIC over 100 replications. Criterion

Sample size 400

BIC HBIC

4000

4.34(±0.09) × 10 2.83(±0.08) × 10−2

−2

40 000

4.79(±0.03) × 10 3.07(±0.03) × 10−3

4.86(±0.01) × 10−4 3.11(±0.008) × 10−4

−3

parameterized by

πk = 1/3, 6k = I, k = 1, 2, 3; µ1 = (0 0 · · · 0)′ , µ2 = (4 4 · · · 4)′ ,

and µ3 = −µ2 .

Fig. 2(a) shows a typical example of this data with sample size N = 400 in the first two-dimensional space. Fig. 2(b)–(d) shows the objective function values of BIC, HBIC and the VB lower bound from one replication of different sample sizes. Table 3 summarizes the results obtained by 3-component VVV models over 100 replications, where the relative difference of the VB lower bound F from BIC (resp. HBIC) is defined as (BIC − F )/F (resp. (HBIC − F )/F ). From Fig. 2 and Table 3, (i) it can be observed that both HBIC and BIC converge to the VB lower bound in the large sample limit; (ii) HBIC is a better approximation to VB than BIC. 4.2.3. Comparison of HBIC with BIC and ICL We generate the data from the 3-component EVV model parameterized by

(π1 , π2 , π3 ) = (0.2, 0.5, 0.3); 9 = 0.2 · diag{1.0, 1.1, . . . , 1.9},

A1 = A;

A2 = A(:, 1 : 3);

31 (7, 7) = 7, 32 (5, 5) = 5, and 6k = λDk 3k D′k ,

µ1 = (0, . . . , 0)′ ,  1 1  A = 0 0 0

1 0 1 0 0

A3 = A(:, 1 : 4),

33 (6, 6) = 6;

µ2 = 3.5 + µ1 , 1 0 0 1 0

1 0 0 0 1

0 1 1 0 0

0 1 0 1 0

0 1 0 0 1

0 0 1 1 0

0 0 1 0 1

µ3 = −3.5 + µ1 ; ′ 0 0  0 , 1 1

[Dk , 3k ] = eig(Ak A′k + 9), λ = 3 |33 |

1 10

,

k = 1, 2, 3; 1

3k = 3k / |3k | 10 ,

148

J. Zhao et al. / Computational Statistics and Data Analysis 88 (2015) 139–153

a

b

Fig. 3. Results of different criteria on synthetic data sets: (a) the average estimated number of components and standard deviation versus training sample size N; (b) the average test log-likelihood divided by 104 and standard deviation versus training sample size N. The dotted lines in (a) and (b) signal the true number of components and test log-likelihood, respectively.

a

b

c

Fig. 4. Results of different criteria on synthetic data sets: (a) the average estimated number of components and standard deviation versus training sample size N; (b) successful rate (%) in choosing the component covariance versus training sample size N; (c) the average test log-likelihood divided by 104 and standard deviation versus training sample size N. The dotted lines in (a) and (c) signal the true number of components and test log-likelihood, respectively.

where A(:, 1 : q) is the matrix consisting of the first q columns of matrix A, [D, 3] = eig(B) performs eigen-decomposition of matrix B with D being an orthogonal matrix of the eigenvectors and 3 being a diagonal matrix of the corresponding eigenvalues. We investigate the accuracy and prediction/generalization performance of all criteria as the sample size N varies. To access sampling variation, for each N we generate 100 training data sets. To measure the accuracy, we use the estimated number of mixture components and/or the successful rate in choosing the correct covariance structure. To investigate the prediction performance of the chosen model by each criterion, we follow Zhao and Yu (2009) in using the test log-likelihood. The test log-likelihood is the log-likelihood of the chosen model evaluated on an independent (or unseen) test data set with size 5000. The test log-likelihood is typically useful for comparing different criteria, because it can show that which criterion yields the best performance, even if its chosen model is not the true. In the first experiment, we investigate the performance in selecting the number of components only. To this end, we fix the covariance structure to be the true EVV and perform model selection over the set {g }, where the number of components g ranges from gmin = 1 to gmax = 15. Fig. 3 shows the results. In the second experiment, we investigate the performance in selecting the number of components and the type of covariance structures simultaneously. For all criteria, we perform model selection over the set {(g , c )}, where the number of components g ranges from gmin = 1 to gmax = 15 and c ranges over the 14 candidate covariance structures shown in Table 1. Fig. 4 shows the results. The main observations from Figs. 3 and 4 include: (i) Large sample size. When N is large, all criteria tend to perform similarly in both accuracy and prediction. (ii) Small sample size. HBIC outperforms BIC and ICL substantially in both accuracy and prediction. Since the true components almost do not overlap, BIC and ICL perform similarly. 4.3. Real data In this subsection, we perform unsupervised clustering experiments on two real data sets: Fisher’s classical Iris data (Fisher, 1936) and Seeds data (Charytanowicz et al., 2010), which are available from UCI Machine Learning Repository (Bache and Lichman, 2013): (i) The Iris data set contains 150 observations of three classes of plants, measured on four variables: sepal length, sepal width, petal length and petal width. Each class has 50 observations.

J. Zhao et al. / Computational Statistics and Data Analysis 88 (2015) 139–153

149

Table 4 Results on Iris data, including the chosen model (number of components and covariance structure), the corresponding adjusted Rand index and misclassification rate. The best result is shown in bold face. Measure

BIC

HBIC

ICL

mixture (BIC)

Adjusted rand index Misclassification rate Model

0.57 33.3 (2,VEV)

0.90 3.3 (3,VEV)

0.57 33.3 (2,VEV)

0.57 33.3 (2,VEV)

Table 5 Results on Seeds data, including the chosen model (number of components and covariance structure), the corresponding adjusted Rand index and misclassification rate. The best result is shown in bold face. Measure

BIC

HBIC

ICL

mixture(BIC)

Adjusted rand index Misclassification rate Model

0.54 33.3 (2,EEV)

0.88 4.3 (3,VEV)

0.54 33.3 (2,EEV)

0.49 14.8 (5,EEV)

(ii) The Seeds data set contains 210 observations of three different varieties of wheat: Kama, Rosa and Canadian. Each class has 70 observations, consisting of values measured on seven geometric parameters of wheat kernels. Since these parameters have different measurement units, we preprocess the data by standardization. To access the performance of different criteria, we use two measures: (i) adjusted Rand index (Hubert and Arabie, 1985); and (ii) misclassification rate. The adjusted Rand index has a value between 0 and 1. The larger the adjusted Rand index, the better the agreement between the clustering results and the true class labels of the data. The misclassification rate is obtained by a MAP classification on the data. 4.3.1. Performance on Iris data The detailed clustering results by BIC, HBIC and ICL on the whole Iris data can be found in Table C.7. A concise summarization is given in Table 4 and further visualized in Fig. 5. From Table 4, it can be seen that HBIC selects a 3-component mixture, which yields the best performance no matter whether judged by adjusted Rand index or misclassification rate, while both BIC and ICL choose the similar 2-component solutions for the 3-class Iris data, which results in significantly lower adjusted Rand index and higher misclassification rate. The VEV structure chosen by HBIC means that all the three types of Iris plant share a common shape but vary in orientations and volumes. From Fig. 5, the solution given by HBIC looks much more satisfactory than those by BIC and ICL. From Table C.7, it can be observed that the 3-component mixture with the second largest BIC (resp. ICL) value is clearly better than the 2-component mixture with the largest BIC (resp. ICL) value. In fact, this 3-component mixture is very similar to the best given by HBIC. However, this model is not chosen by BIC (resp. ICL) because of its slightly lower BIC (resp. ICL) value. This observation implies that BIC (resp. ICL) suffers from over-penalization while HBIC does not. For comparison, we re-perform the clustering experiment by the R software mixture as this software implements all of the fourteen models in the EDGMM family. The best model given in Table C.7 also has two components only. Table C.8 lists the detailed results with three different varieties of initialization. From Tables C.7 and C.8, it can be seen that the main results are similar to ours. That is, the 3-component mixture with the second largest BIC is actually the best model but not chosen by BIC. 4.3.2. Performance on Seeds data The results by BIC, HBIC and ICL on the whole Seeds data can be found in Table C.9. A concise version is given in Table 5 and visualized in Fig. 6. The results are generally similar to those obtained on Iris data. From Table 5, it can be seen that the 3-component mixture selected by HBIC yields the best performance, while both BIC and ICL suffer from underestimation, choosing the similar 2component solutions for the 3-class Seeds data. The VEV structure chosen by HBIC implies that all the three varieties of wheat share a common shape but vary in orientations and volumes. From Fig. 6, the solution given by HBIC looks much better than those by BIC and ICL. It can be observed from Table C.9 that all the top three models by BIC (resp. ICL) have two components only, while those by HBIC have three components, which indicates that BIC (resp. ICL) suffers from serious over-penalization while HBIC does not. The 5-component mixture obtained by the R software mixture yields very low adjusted Rand index. For this data, the MAP estimation could affect the results significantly while the R software mixture fails to include it and thus the detailed results listed in Table C.10 are not easily comparable with ours in Table C.9. To see how the regularization parameter r used for the MAP estimation in Section 4.1.1 affects the performance of BIC, HBIC, and ICL, we vary r in a wide range of

150

J. Zhao et al. / Computational Statistics and Data Analysis 88 (2015) 139–153

(a) BIC.

(b) H-BIC.

(c) ICL.

Fig. 5. Results on Iris data. (a) the 2-component estimate by BIC; (b) the 3-component estimate by HBIC; (c) the 2-component estimate by ICL, where the data is shown in the leading two linear discriminant component space.

(a) BIC.

(b) H-BIC.

(c) ICL.

Fig. 6. Results on Seeds data. (a) the 2-component estimate by BIC; (b) the 3-component estimate by HBIC; (c) the 2-component estimate by ICL, where the data is shown in the leading two linear discriminant component space. Table 6 Results of various values of the regularization parameter on Seeds data. Measure

Model

Adjusted Rand Index

Criterion

Regularization parameter r 0.0001

0.001

0.01

0.02

0.03:0.01:0.08

0.09

0.1

0.2

0.5

BIC

4 EEV

3 EEV

3 EEV

3 VEV

2 EEV

2 EEE

3 EEE

3 EEE

3 EEE

HBIC

5 VEV

4 EEV

3 VEV

3 VEV

3 VEV

3 VEV

3 VEV

3 VEE

3 EEE

ICL

4 EEV

3 EEV

3 EEV

2 EEV

2 EEV

2 EEV

2 EEV

3 EEE

3 EEE

BIC HBIC ICL

0.63 0.49 0.63

0.61 0.64 0.61

0.85 0.87 0.85

0.88 0.88 0.54

0.54–0.55 0.88 0.54–0.55

0.54 0.88 0.54

0.88 0.88 0.54

0.90 0.92 0.90

0.89 0.89 0.89

{0.0001, 0.001, 0.01 : 0.01 : 0.1, 0.2, 0.5}, where a : h : b denotes the set {a, a + h, . . . , b}. The smaller the value of r, the smaller the amount of regularization, and vice versa. Table 6 lists the results. The following can be seen from Table 6. (i) r has a significant effect on the results of BIC, HBIC and ICL. When r is small, e.g. r = 0.0001, all criteria choose the number of components greater than three. On the contrary, when r is large, e.g. r = 0.5, all criteria choose three components and the simple EEE structure. (ii) However, when 0.01 ≤ r ≤ 0.1, HBIC is much more stable than BIC and ICL, since HBIC always chooses 3 components and the VEV structure, while the choices of BIC and ICL are sensitive to the values of r. 5. Conclusion and future works In this paper, we proposed a new criterion called hierarchical BIC (HBIC) for model selection in finite mixture models. We show both theoretically and empirically that HBIC is a large sample limit of the VB lower bound and BIC is a less accurate approximation. Our empirical results demonstrate that our proposed HBIC is more accurate than BIC and ICL. Compared with HBIC, BIC ignores the clustered data structure and thus suffers from over-penalization. ICL would also suffer from this problem because the penalty of ICL uses the BIC penalty plus an additional entropy term that measures the overlap. On the other hand, BIC is known to over-estimate the number of mixture components in some cases. This is because it tries to select the ‘best’ number of components which best approximates the underlying distribution, instead of the true number of clusters. In contrast, ICL is more tailored at selecting the number of clusters (Biernacki et al., 2000). Therefore, it would be interesting to further develop a hierarchical ICL criterion that uses the HBIC penalty plus the entropy term of ICL, since hierarchical ICL could match clustered data structure and favor well-separated mixtures simultaneously, thus combining the advantage of both HBIC and ICL.

J. Zhao et al. / Computational Statistics and Data Analysis 88 (2015) 139–153

151

Finite mixture modeling using the t distribution and skew distributions, namely t-EDGMM and skew-EDGMM families, have been proposed in Andrews and McNicholas (2012) and Vrbik and McNicholas (2014) and found to perform better than their Gaussian counterpart in real applications. Extending our proposed HBIC to accommodate t-EDGMM and skew-EDGMM families would be interesting future works. Acknowledgments The authors would like to thank the Associate Editor and the two anonymous reviewers for their valuable comments and suggestions that improved this paper substantially. The work was supported partly by Natural Science Foundation of China under Grant 11361071, Grant 61403337, Grant 11161053, and Grant U1302267, and partly by Hong Kong Research Grants Councils, General Research Fund, under Grant HKU 706710P. Appendix A. The limit of F2 and F3 Let us consider E[log q(θ sk )] and E[log q(θ 0 )]. For exponential family models with missing data, which includes EDGMM as a special case, the variational posterior distribution q(θ) is approximately normally distributed (Wang and Titterington, 2004), from which it follows that q(θ sk ) and q(θ 0 ) are also approximately normally distributed. Thus we obtain, as N → ∞, s 1 E[log q(θ sk )]q(θs ) ≈ log |2π H1 (θˆ k )|, k 2 and 1 E[log q(θ 0 )]q(θ0 ) ≈ log |2π H2 (θˆ 0 )|, 2 s

where H1 (θˆ k ) and H2 (θˆ 0 ) are the negative Hessian and q(θ 0 ) evaluated at MAP estimator θˆ k and θˆ 0 ,  matrices of q(θ k )  respectively. Noting that H1 scales linearly with i rik (from (20)), and i rik /nπˆ k → 1, we have that as N → ∞, H1 /N πˆ k converges to a constant matrix, denoted by H10 , and thus

E[log q(θ sk )]q(θs ) =

Dk

E[log q(θ 0 )]q(θ0 ) =

D0

2

2

log(N πˆ k ) +

1

log |2π H10 | + O(1) =

Dk

log(N πˆ k ) + O(1). (A.1) 2 2 Similarly, noting that H2 scales linearly with N (from (19)), we have that as N → ∞, H2 converges to a constant matrix, and thus k

log N + O(1).

(A.2)

By (A.1) and (A.2), as N → ∞, we have

F2 = E[log q(θ 0 )] − E[log p(θ 0 )] =

D0 2

log N − log p(θˆ 0 ) + O(1),

and

F3 =

g 

E[log q(θ sk )] − E[log p(θ sk )] =

k=1

g  Dk k=1

2

s

log(N πˆ k ) − log p(θˆ k ) + O(1).

Appendix B. The limit of FD s

ˆ , θˆ 0 and θˆ k (Attias, 1999), that is, In the large N limit, q(π), q(θ 0 ) and q(θ sk ) will be strongly peaked at π ˆ + O(1), q(π) = δ(π − π)

(B.1)

q(θ 0 ) = δ(θ 0 − θˆ 0 ) + O(1),

(B.2)

and s

q(θ sk ) = δ(θ sk − θˆ k ) + O(1),

(B.3)

where δ(·) is the Dirac delta function. These results follow from the approximate normal distributions in the large sample limit. Combining (B.1)–(B.3) with (17) yields

ˆ + O(1), q(zik ) = p(zik |xi , θ) By (B.1)–(B.4), we obtain that

ˆ + O(1). FD = L(X|θ) Appendix C. Some results See Tables C.7–C.10.

as N → ∞.

(B.4)

152

J. Zhao et al. / Computational Statistics and Data Analysis 88 (2015) 139–153 Table C.7 Results of the top 3 models by various criteria on Iris data. The chosen model is shown as (mixture number, covariance structure). Criterion

Rank

Criterion value

Model

Adjusted rand index

BIC

1 2 3

−563.82 −565.45 −576.11

(2,VEV) (3,VEV) (2,VVV)

0.57 0.90 0.57

HBIC

1 2 3

−529.08 −529.39 −537.34

(3,VEV) (4,VEV) (3,VVV)

0.90 0.78 0.90

ICL

1 2 3

−563.82 −568.85 −576.11

(2,VEV) (3,VEV) (2,VVV)

0.57 0.90 0.57

Table C.8 Results of the top 3 models on Iris data by R software mixture via different initializations. The chosen model is shown as (number of components, covariance structure). Initialization

Rank

BIC value

Model

Adjusted rand index

K -means

1 2 3

−561.77 −562.65 −574.02

(2,VEV) (3,VEV) (2,VVV)

0.57 0.90 0.57

10 random values

1 2 3

−561.77 −574.02 −580.17

(2,VEV) (2,VVV) (3,VEV)

0.57 0.57 0.67

Deterministic annealing

1 2 3

−574.02 −580.17 −592.67

(2,VVV) (3,VEV) (4,EVE)

0.57 0.67 0.87

Table C.9 Results of the top 3 models by various criteria on Seeds data. The chosen model is shown as (mixture number, covariance structure). Criterion

Rank

Criterion value

Model

Adjusted rand index

BIC

1 2 3

−252.97 −255.65 −259.36

(2,EEV) (2,VEV) (2,EVV)

0.54 0.54 0.55

HBIC

1 2 3

−216.12 −218.60 −223.43

(3,VEV) (3,EEV) (3,VVV)

0.88 0.88 0.83

ICL

1 2 3

−254.73 −257.37 −260.28

(2,EEV) (2,VEV) (2,EVV)

0.54 0.54 0.55

Table C.10 Results of the top 3 models on Seeds data by R software mixture via different initializations. The chosen model is shown as (number of components, covariance structure). Initialization

Rank

BIC value

Model

Adjusted rand index

K -means

1 2 3

184.69 167.14 155.49

(5,EEV) (4,EEV) (4,VEV)

0.49 0.62 0.62

10 random values

1 2 3

139.21 116.23 99.67

(3,EEV) (5,EEV) (3,EVV)

0.63 0.56 0.63

Deterministic annealing

1 2 3

174.74 139.21 114.51

(4,EEV) (3,EEV) (5,EEV)

0.56 0.63 0.45

J. Zhao et al. / Computational Statistics and Data Analysis 88 (2015) 139–153

153

References Andrews, J.L., McNicholas, P.D., 2012. Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions. Stat. Comput. 22 (5), 1021–1029. Attias, H., 1999. Inferring parameters and structure of latent variable models by variational bayes. In: Proc. 15th Uncertain. Artif. Intell. pp. 21–30. Bache, K., Lichman, M., 2013. UCI machine learning repository. http://archive.ics.uci.edu/ml. Banfield, J.D., Raftery, A.E., 1993. Model-based Gaussian and non-Gaussian clustering. Biometrics 49 (3), 803–821. Bensmail, H., Celeux, G., Raftery, A.E., Robert, C.P., 1997. Inference in model-based cluster analysis. Stat. Comput. 7 (1), 1–10. Biernacki, C., Celeux, G., Govaert, G., 2000. Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 22 (7), 719–725. Biernacki, C., Celeux, G., Govaert, G., Langrognet, F., 2006. Model-based cluster and discriminant analysis with the MIXMOD software. Comput. Statist. Data Anal. 51 (2), 587–600. Bishop, C.M., 2006. Pattern Recognition and Machine Learning. Springer, New York. Browne, R.P., McNicholas, P.D., 2014. Estimating common principal components in high dimensions. Adv. Data Anal. Classif. 8 (2), 217–226. Celeux, G., Govaert, G., 1995. Gaussian parsimonious clustering models. Pattern Recognit. 28 (5), 781–793. Charytanowicz, M., Niewczas, J., Kulczycki, P., Kowalski, P., Lukasik, S., Zak, S., 2010. A complete gradient clustering algorithm for features analysis of x-ray images. In: Ewa Pietka, J.K. (Ed.), Information Technologies in Biomedicine. Springer-Verlag, Berlin, Heidelberg, pp. 15–24. Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum likelihood from incomplete data using the EM algorithm. J. R. Stat. Soc. Ser. B 39 (1), 1–38. (with discussion). Figueiredo, M.A.T., Jain, A.K., 2002. Unsupervised learning of finite mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 24 (3), 381–396. Fisher, R.A., 1936. The use of multiple measurements in taxonomic problems. Ann. Eugenics 7 (2), 179–188. Fraley, C., Raftery, A.E., 1998. How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput. J. 41 (8), 578–588. Fraley, C., Raftery, A.E., 2002. Model-based clustering, discriminant analysis, and density estimation. J. Amer. Statist. Assoc. 97, 611–631. Fraley, C., Raftery, A.E., 2007. Bayesian regularization for normal mixture estimation and model-based clustering. J. Classification 24 (2), 155–181. Fraley, C., Raftery, A.E., Murphy, T.B., Scrucca, L., 2012. MCLUST version 4 for R: normal mixture modeling and model-based clustering. Tech. Rep. 597. Department of Statistics University of Washington, URL: http://www.stat.washington.edu/mclust/. Friedman, J.H., 1989. Regularized discriminant analysis. J. Amer. Statist. Assoc. 84 (405), 165–175. Gollini, I., Murphy, T.B., 2014. Mixture of latent trait analyzers for model-based clustering of categorical data. Stat. Comput. 24 (4), 569–588. Hubert, L., Arabie, P., 1985. Comparing partitions. J. Classification 2 (1), 193–218. Keribin, C., 2000. Consistent estimation of the order of mixture models. Sankhya¯ Ser. A 62 (1), 49–66. Lin, T.-I., 2014. Learning from incomplete data via parameterized t mixture models through eigenvalue decomposition. Comput. Statist. Data Anal. 71 (3), 183–195. McLachlan, G., Peel, D., 2000. Finite Mixture Models. John Wiley & Sons, New York. Pauler, D.K., 1998. The Schwarz criterion and related methods for normal linear models. Biometrika 85 (1), 13–27. Raftery, A.E., Newton, M.A., Satagopan, J.M., Krivitsky, P.N., 2007. Estimating the integrated likelihood via posterior simulation using the harmonic mean identity. In: Bayesian Statistics. Vol. 8. Oxford University Press, Oxford, pp. 1–45. Schwarz, G., 1978. Estimating the dimension of a model. Ann. Statist. 6, 461–464. Srivastava, S., Gupta, M.R., Frigyik, B.A., 2007. Bayesian quadratic discriminant analysis. J. Mach. Learn. Res. 8, 1277–1305. Steele, R., 2002. Practical importance sampling methods for finite mixture models and multiple imputation (Ph.D. thesis), University of Washington. Vrbik, I., McNicholas, P.D., 2014. Parsimonious skew mixture models for model-based clustering and classification. Comput. Statist. Data Anal. 71 (3), 196–210. Wang, B., Titterington, D.M., 2004. Convergence and asymptotic normality of variational Bayesian approximations for exponential family models with missing values. In: Proc. 20th Uncertain. Artif. Intell. pp. 577–584. Zhang, B., Zhang, C., Yi, X., 2004. Competitive EM algorithm for finite mixture models. Pattern Recognit. 37 (1), 131–144. Zhao, J., Yu, P.L.H., 2009. A note on variational Bayesian factor analysis. Neural Netw. 22 (7), 988–997. Zhao, J., Yu, P.L.H., Shi, L., 2013. Model selection for mixtures of factor analyzers via hierarchical BIC. Tech. Rep. School of Statistics and Mathematics, Yunnan University of Finance and Economics, Yunnan, PR China, URL: http://www.ynufe.edu.cn/pub/tsxy/jhzhao/pub/en/a-mfa.pdf.