Accepted Manuscript
Improved Bayesian Information Criterion for Mixture Model Selection Arash Mehrjou, Reshad Hosseini, Babak Nadjar Araabi PII: DOI: Reference:
S0167-8655(15)00344-X 10.1016/j.patrec.2015.10.004 PATREC 6361
To appear in:
Pattern Recognition Letters
Received date: Accepted date:
21 January 2015 1 October 2015
Please cite this article as: Arash Mehrjou, Reshad Hosseini, Babak Nadjar Araabi, Improved Bayesian Information Criterion for Mixture Model Selection, Pattern Recognition Letters (2015), doi: 10.1016/j.patrec.2015.10.004
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Research Highlights (Required)
CR IP T
1
To create your highlights, please type the highlights against each \item command.
It should be short collection of bullet points that convey the core findings of the article. It should include 3 to 5 bullet points (maximum 85 characters, including spaces, per bullet point.)
• Mathematical derivation of the criterion is justified.
AN US
• A new criterion for mixture model selection is proposed.
• The proposed criterion works as good as the state-of-the-art criteria for large sample size. • The proposed criterion outperforms the state-of-the-art criteria for small sample size.
AC
CE
PT
ED
M
• The proposed criterion performs well for real datasets.
ACCEPTED MANUSCRIPT 2
Pattern Recognition Letters journal homepage: www.elsevier.com
Improved Bayesian Information Criterion for Mixture Model Selection Arash Mehrjoua,∗∗, Reshad Hosseinia,∗∗, Babak Nadjar Araabia,b a Control
and Intelligent Processing Center of Excellence, School of ECE, College of Engineering, University of Tehran, Tehran, Iran of Cognitive Science, Institue for Research in Fundamental Sciences(IPM), Tehran, Iran
CR IP T
b School
ABSTRACT
AN US
In this paper, we propose a mixture model selection criterion obtained from the Laplace approximation of marginal likelihood. Our approximation to the marginal likelihood is more accurate than Bayesian information criterion (BIC), especially for small sample size. We show experimentally that our criterion works as good as other well-known criteria like BIC and minimum message length (MML) for large sample size and significantly outperforms them when fewer data points are available. c 2015 Elsevier Ltd. All rights reserved.
1. Introduction
AC
CE
PT
ED
M
Mixture modeling is a powerful statistical technique for unsupervised density estimation especially for high-dimensional data. Because of its usefulness as an extremely flexible method of modeling, finite mixture models have received increased attention over years with applications on pattern recognition, computer vision, signal and image processing, machine learning, and so forth (McLachlan and Peel, 2004; Bishop, 2007; Keener, 2010; Murphy, 2012). Mixture models divide the entire space of data to several regions and each region is modeled by a probability density which is usually chosen from a class of similar parametric distributions. In mixture density estimation this division is soft, meaning that each datum may belong to several components (Jordan and Jacobs, 1994). In clustering applications the division is hard, that is each datum is assigned to only one cluster. Therefore, mixture models are applicable to clustering applications through probabilistic model-based approaches (Jain et al., 2000; McLachlan and Peel, 2004; McLachlan and Basford, 1988). An important issue in mixture modeling is the selection of the number of mixture components (McLachlan and Peel, 2004). Too many components may over-fit the observations, meaning that it can fit the training data accurately but it may not be a good model for underlying data-generating process. On the
∗∗ Corresponding
authors (Arash Mehrjou, Reshad Hosseini): Tel.: +98-936-122-8008, Tel.: +98-21-6111-9799; e-mail:
[email protected] (Arash Mehrjou),
[email protected] (Reshad Hosseini)
other hand, too few components may not be flexible enough to approximate the underlying model. Different approaches have been proposed in the literature for determining the number of mixture components (McLachlan and Basford, 1988). Some criteria select the number of components based on the generalization performance of the model. This is done either by having a separate validation set for testing performance (Smyth, 2000), or by deriving asymptotic bias for goodness-of-fit as done in AIC measure (Akaike, 1974). Some other criteria like BIC use Bayesian framework in model selection, meaning that they try to find a model that has the maximum posterior probability or maximum marginal likelihood under some regularity conditions (Schwarz, 1978; Fraley and Raftery, 2007). Integrated Complete Likelihood (ICL) is another criterion that like BIC approximates the marginal likelihood. ICL performs poorly when mixture components overlap and tends to underestimate the number of components (Figueiredo and Jain, 2002). As predicted from the theory behind ICL and shown in experiments, this criterion works well for the cases where each datum is assigned to only one cluster, that is in clustering applications (Biernacki et al., 2000). Mixture model selection has remained a topic of active research until recently. Xie et al. (2013) used an adaptive method that investigates the stability of log characteristic function versus number of components to find the true model. Their method proved to be suitable for large sample sizes. Zeng and Cheung (2014) suggested a model-based clustering algorithm which uses a modified version of MML for model selection. Their method is tailored to clustering applications and also requires sufficiently large sample sizes. Maugis and Michel
ACCEPTED MANUSCRIPT 3
We summarize the contributions of this paper as follows:
ˆ MAP = argmax{log p(X|Θ) + log p(Θ)} . Θ
(3)
Θ
2.1. Maximum Likelihood Solution The common procedure for solving the optimization problems in (2) and (3) is expectation-maximization (EM) algorithm (McLachlan and Peel, 2004; Neal and Hinton, 1998). Despite its fast convergence, simple EM suffers from one main drawback and it is the problem of converging to a local maximum. We observed that simple strategies like multiple initialization are not able to solve the local maxima problem. To this end, we implemented the well-known split and merge EM (SMEM) algorithm of (Ueda et al., 2000) that nicely addresses this problem1 . ML estimation procedure does not return reasonable estimate of the parameters when the log-likelihood of mixture model is unbounded. Intuitively, this happens when one component gets small number of data but its log-likelihood becomes infinite. Several possible remedies have been proposed in the literature (Ciuperca et al., 2003; Hathaway, 1985). One of the simplest and most powerful of them is using a suitable prior on the parameter space, that is estimating MAP instead of ML (Fraley and Raftery, 2007). Accordingly, in all of our experimental results, the parameters of mixture models are estimated using MAP estimator.
AN US
1. Under the assumption of non-overlapping components, we derive a new model selection criterion. 2. We show through different experiments that violating wellseparateness assumption of components does not have a detrimental effect on the performance. Our criterion actually works significantly better than other criteria even for the case of overlapping components.
Assuming a prior p(Θ) over the parameter set and maximizing the posterior likelihood over the parameters results in maximum a posteriori (MAP) estimate which takes the form
CR IP T
(2011) suggested a non-asymptotic penalized criterion for mixture model selection. Their method needs compute a quantity named bracketing entropy which is not easily obtainable for non-Gaussian components. We propose a criterion for determining the number of components of mixture models that is based on Laplace approximation to the marginal likelihood. BIC criterion can be also viewed as the asymptotic Laplace approximation neglecting the terms that does not grow with the number of components. Instead of neglecting those terms, we assume the components are well-separated and derive a different approximation. It turns out that our approximation works better in many simulations.
M
The rest of the paper is organized as follows: We first review the concepts related to mixture model parameter estimation in Section 2. Popular model selection techniques are summarized in Section 3. We describe the derivation of our proposed method in Section 4. The simulation results and comparisons are given in Section 5. Finally, we finish the paper by a short conclusion and envisioning future directions of research in Section 6.
ED
2. Mixture Models
p(x|Θ) =
K X
πm p(x|θm ),
AC
where π1 , . . . , πK are the mixing probabilities coming from the PK K-dimensional probability simplex, i.e. m=1 πm = 1, and θm is the set of parameters defining m-th component. The variable Θ = {θ1 , . . . , θ K , π1 , . . . , πK } indicates the complete set of parameters of the mixture model. Let X = {x(1) , . . . , x(n) } be the set of n i.i.d samples from the underlying distribution, the log-likelihood over this set is given by: log p(X|Θ) = log
n Y i=1
p(x(i) |Θ) =
n X i=1
log
K X m=1
πm p(x(i) |θm ) .
The set of parameters that maximizes the log-likelihood function is called maximum likelihood (ML) estimate and is given by ˆ ML = argmax{log p(X|Θ)} . Θ (2) Θ
The aim of model selection is selecting a model in the hypothesis space that best describes the underlying distribution of the observed data. For the case of mixture models, each hypothesis corresponds to a mixture with specific number of components. There are two main classes of model selection procedures: deterministic and stochastic. Deterministic methods are commonly used for determining the number of components in mixture models and are the main focus of the current paper.
(1)
m=1
CE
∀x ∈ Rc ,
PT
In this section, we discuss some basics of mixture models, for in depth treatment of this subject see (McLachlan and Peel, 2004; McLachlan and Basford, 1988). The density of a mixture of K components assumes the form
3. Previous Model Selection Criteria
3.1. Deterministic Methods Given a set of models in the hypothesis space, deterministic procedures select the model with the optimum information criterion. Information criterion is a function of data log-likelihood at the ML solution and the model complexity represented as ˆ X) = −α log p X|Θ ˆ + β F (Θ), ˆ IC(Θ,
ˆ represents model complexity and is indewhere function F (Θ) pendent of the observed data. Also, α, β ≥ 0 are the weights determining the influence for each of these opposing terms. ˆ increases by Normally, in the case of mixture models F (Θ) increasing the number of components penalizing complex mixture models with more components.
1 The toolbox developed by our group can be downloaded from http://visionlab.ut.ac.ir/mixest
ACCEPTED MANUSCRIPT 4
h i ˆ − Eq log p x|Θ0 , Eq log p X|Θ
and can be used as a measure of complexity. If the number of data points go to infinity, Akaike (1974) derived an analytic form for this bias. He proved that the bias term is equal to the number of parameters. Thus, the bias corrected log-likelihood called Akaike Information Criterion (AIC) which is defined as h i ˆ − b = −2 log p X|Θ ˆ + 2d, AIC = −2 log p X|Θ
can be used for model selection. Here, d is the dimensionality of the parameter space.
n d n−d−1
M
ˆ+2 AICc = −2 log p X|Θ
ED
that takes the number of data points n into account as proposed in (Hurvich and Tsai, 1989).
CE
PT
Bayesian Information Criterion. Bayesian Information Criterion (BIC) is another successful measure for model selection that can be interpreted as an approximation to the Bayes factor (Kass and Raftery, 1995). BIC is an asymptotically consistent criterion for model selection if the distribution behind data is regular (e.g. exponential family) and the priors on both the hypothesis space and the parameter space are uniform. BIC measure is formulated as
AC
ˆ + d log n BIC = −2 log p X|Θ
Length(Θ, X) = Length(X|Θ) + Length(Θ) .
(4)
The form of the MML measure which is inspired from (4) and used in this paper for comparison against our proposed criterion is adopted from (Figueiredo and Jain, 2002). It is formulated as ( MML = min − log p Θ − log p X|Θ + Θ ) d 1 log I Θ + 1 + log κd , (5) 2 2 where |I(Θ)| ≡ −E ∂2 log p X|Θ /∂Θ2 is the determinant of the expected Fisher information matrix and κd is a constant which is described in the next few lines. The parameter vector is quantized and Length(Θ) is computed depending on how the quantization regions are defined. Unlike in the scalar case, these quantization regions are not simple intervals. Optimal quantization lattices for multivariate quantization were proposed in (Conway and Sloane, 1998) and adopted by the authors in (Figueiredo and Jain, 2002) for their derivation of MML criterion specific to mixture models. The constant κd appeared in (5) is related to this quantization scheme and is called optimal quantizing lattice constant for Rd (Conway and Sloane, 1998). The modified version of MML criterion specialized for mixture models is given by
AN US
Corrected Akaike Information Criterion. The bias used in AIC is not accurate in the case of finite number of data points. However, in practice it has been used and has been an effective criterion for model selection. The only special case that the bias can be calculated analytically for finite sample size is the linear regression model. In this case, the corrected version of the information criterion becomes
Minimum Message Length. There is another famous class of criteria with roots in information and coding theory. The class of minimum encoding length criteria (e.g. minimum message length (MML) and minimum description length (MDL)) considers the total code length which is necessary to represent data information (Oliver et al., 1996). More accurate parameters need more bits for coding the model but fewer bits to represent the data. On the other hand, the models that spend less bits for parameter coding, require longer codes for data representation because of the inaccurate modeling of the data by inaccurate parameters. This rationale can be expressed in a simple equation as
CR IP T
Akaike Information Criterion. Let q(.) be the true data-generating density and p(.|Θ) be a parametric model density. The expected bias between the loglikelihood of the training data evaluated at ML solution and expected logarithm of the model density evaluated at its maximum Θ0 is written as
in (Schwarz, 1978). It has been shown in (Keribiin, 1998) that under some regularity conditions, BIC is a consistent measure for selecting the number of components in mixture models. This criterion has been used for model selection in MoGs for large datasets in (Fraley and Raftery, 2007). It is also well-known that BIC criterion is an incorrect approximation to Bayes factor if the objective function is unbounded as it is the case for the log-likelihood of mixture models (Rusakov and Geiger, 2002). Fraley and Raftery (2007) suggested to use MAP estimator instead of ML and accordingly use penalized log-likelihood instead of log-likelihood in the model selection criterion.
MML =
K d˜ X nπm K n log + log + 2 m=1 12 2 12
K(d˜ + 1) − log p X|Θ , 2
where d˜ is the number of parameters specifying each mixture component (Figueiredo and Jain, 2002). In this equation, κd is approximated by 1/12 irrespective of the parameter space dimensionality d. 3.2. Stochastic Methods Some stochastic methods try to approximate either a model selection criterion (Mengersen and Robert, 1993; Favaro et al., 2013) or full Bayesian posterior (Richardson and Green, 1997; Chang and Fisher III, 2013). In both cases, a computationally expensive sampling procedure like Markov Chain Monte Carlo is used.
ACCEPTED MANUSCRIPT 5
4. Proposed criterion Based on Bayes theorem, the posterior probability of a hypothetical model given observed data is p X|H p H , (6) p H|X = pX
H = {K};
Θ = {θ1 , θ2 . . . , θ K , π1 , π2 , . . . , πK } .
M
Assume that the prior likelihood on the parameter set of (6) is uniform, then comparing model posteriors is tantamount to comparing marginal likelihoods. The marginal likelihood can be written as Z n 1X p X|H = exp n log p x(i) |Θ dΘ . (7) n i=1 | {z } q X|Θ
ED
where q X|Θ is the average log-likelihood. Using Laplace approximation of (7), we obtain (2π)d/2 ˆ exp nq X| Θ ˆ 1/2 nd/2 J Θ
PT
p X|H ≈
K Y ˆ ≈ Jc Θ ˆ = M J Θ πdmm J θˆ m .
(9)
m=1
It is known that the Fisher information matrix of a multinomial distribution has determinant M = (π1 π2 . . . πK )−1 . Putting (9) in (8), taking logarithm and multiplying by −2 reveals ˆ + d log n −d log 2π − BICI = −2nq X|Θ | {z } BIC
K X
K X
K X
+
dm log πm +
m=1
m=1
log πm
m=1
log J θˆ m .
(10)
First two terms of (10) is the same as BIC criterion, which has been successfully used to determine the number of components in mixture models when n → ∞ (Fraley and Raftery, 2007). The additional terms can be considered as the improvement over normal BIC which makes it applicable for moderate values of n. Thus we name it BICI and ’I’ emphasizes that the criterion is an improved version of the original BIC. Fisher information matrix for mixture models can become singular (Watanabe, 2009). The singularity happens when either two components completely overlap or when the probability weight of one component becomes zero. Since we are using a procedure to avoid local minima, theses problems cannot occur. The weights of some components can become very small also when the log-likelihood is unbounded. Since we are using MAP estimator for computing the parameters, this problem is also avoided. The assumed prior is a weak prior to resolve the unboundedness of log-likelihood, therefore the assumption that the posterior likelihood of (6) is proportional with a constant to the marginal likelihood also holds approximately for MAP estimates.
AN US
where H is the model hypothesis and X is the dataset. We assume that the hypothesis space contains the number of components in the mixture model. Consider a mixture distribution with K components as formulated in (1). Here, the hypothesis space and the parameter space are
matrix can be approximated by
CR IP T
Some other techniques take advantage of re-sampling (Efron and Tibshirani, 1997; Bischl et al., 2012) (e.g. bootstrapping method) or cross-validation for computing the generalization performance of different models. In this paper, the focus is on the deterministic methods. In the next section, we introduce our proposed criterion and illustrate its mathematical derivation.
(8)
AC
CE
as formulated in (Kass and Raftery, 1995) where n is the number of data points and d is the number of free parameters in ˆ is the Fisher information mathe mixture model. Also, J Θ ˆ of trix of the mixture model evaluated at the ML estimate Θ ˆ the parameters. J Θ cannot, in general, be calculated (Titterington et al., 1985). To overcome this difficulty, we approxiˆ by complete-data Fisher information matrix defined mate J Θ ˆ ≡ −E ∂2 log p X, Z|Θ/∂Θ2 , where Z is the set of laas Jc Θ ˆ bels determining which datum goes to which component. Jc Θ ˆ and has a block-diagonal structure is an upper-bound of J Θ ˆ = diag π1 J θˆ 1 , π2 J θˆ 2 , . . . , πK Jc θˆ K , M , Jc Θ where J θˆ m is the Fisher information matrix of m-th component, and M is the Fisher information matrix of a multinomial distribution (Titterington et al., 1985). The approximation ˆ by Jc Θ ˆ becomes exact when components are wellof J Θ separated. Therefore, the determinant of Fisher information
5. Experiments The derivation of the proposed criterion is general for all mixture models. However, the simulations are carried out on Gaussian mixture models which are commonly tested in model selection literature. The generic form of a c-variate Gaussian component in a MoG distribution is ( ) 1 1 T −1 exp − (x − µ ) C (x − µ ) (11) p(x|θm ) = m m m c 1 2 (2π) 2 |Cm | 2 where µm and Cm are the mean vector and the covariance matrix of m-th component in the mixture model of (1), respectively. The dimension of θm differs with respect to the assumed form of the covariance matrix for each mixture component. For example, diagonal covariance has c free elements while free form covariance has c(c+1)/2 free elements. For a Gaussian mixture model with free form covariance, we can write θm = (µm , Cm );
c(c + 1) d˜ = c + , 2
(12)
ACCEPTED MANUSCRIPT 6 Table 1. The performance of different criteria for the dataset generated from the distribution of the first experiment with 900 data points.
3 2
Model
AIC
AICc
BIC
MML
BICI
0
k=1 k=2 k=3 k=4 k=5 k=6
0 0 65 23 10 2
0 0 72 20 7 1
0 0 100 0 0 0
0 0 99 1 0 0
0 0 100 0 0 0
−1 −2 −3 −5
−4
−3
−2
−1
0
1
2
3
4
5
Fig. 1. Scatter plot of data points generated by the mixture model of the first experiment. The elliptic contours of mixture components and the covariance eigenvectors are also shown by dashed and solid lines, respectively.
5.1. First Simulated Experiment
AC
CE
PT
ED
M
900 samples are generated from a 2-dimensional 3component Gaussian mixture model with probability coefficients: π1 = π2 = π3 = 1/3, mean vectors: µ1 = [0 −2]T , µ2 = [0 0]T , µ1 = [0 2]T , and covariance matrices: C1 = C2 = C3 = diag{2, 0.2}. The scatter plot of data points generated by this mixture is illustrated in Figure 1. The results showing the number of times that each criterion selects a specific number of components are shown in Table 1. It is seen that BIC and BICI can always find the true number of components, while MML makes only one mistake in these 100 runs. It is expected because the number of data points is large and components are well-separated. AIC and AICc show significantly lower performance. That was expected due to their tendency toward overestimating the number of components for mixture models (Figueiredo and Jain, 2002). In order to see how different criteria behave when the number of data points is small, we did an experiment with 100 data points sampled from the same distribution. As it can be seen in Table 2, BICI criterion works much better than BIC and MML. It is a known result that AICc works well when the number of data points is small. In this experiment, AICc shows a comparable result to BICI . To investigate the effect of overlapping, we use a three component mixture model like Figure 1 but with spherical covariance matrices for each component. We introduce δ as the separation between these three components, meaning that the three Gaussians are located at µ1 = [0 0]T , µ2 = [δ δ]T , and
2 The related codes to this paper http://visionlab.ut.ac.ir/resources.html
Table 2. The performance of different criteria for the dataset generated from the distribution of the first experiment with 100 data points.
Model
AIC
AICc
BIC
MML
BICI
k=1 k=2 k=3 k=4 k=5 k=6
0 0 48 30 12 10
0 2 82 16 0 0
21 14 62 3 0 0
0 0 2 9 24 65
1 7 84 10 1 0
AN US
where c is the data dimension and d˜ is the number of free parameters for each component. Most of our experiments are designed based on the experiments in (Figueiredo and Jain, 2002)2 . In all of our simulated results, we run each experiment 100 times (unless stated otherwise) and report the performance of each criterion. For the explanation of our estimation procedure see Section 2.
CR IP T
1
can
be
downloaded
from
µ3 = [−δ −δ]T . Then, we change δ from 0.5 to 2.7 with 0.2 step size and obtain the percentage of times (in 50 runs) that the true model is selected for each criterion. The diagram depicting the performance of each criterion versus δ, which is a measure of separation between components, is shown in Figure 2. It is clearly seen at Figure 2 that for small δ two components are not distinguishable by any criteria. For large δ (δ ≥ 1.7), the components are well-separated and the performance of BICI is among the best performing criteria. In the regime where the components gradually become separated (1.3 ≤ δ ≤ 1.7), BICI actually performs better than other criteria showing that the initial well-separateness assumption was not a restricting factor for BICI criterion.
5.2. Second Simulated Experiment In this experiment, we are going to test the applicability of the proposed criterion when mixture components overlap each other and they are of different shapes, that is the covariance matrices are not equal. The scatter plot of data points generated for this experiment is shown in Figure 3. 200 data points are generated from a MoG with four components with the probability coefficients: π1 = π2 = π3 = π4 = 0.25, mean vectors: µ1 = µ2 = [−4 4]T , µ3 = [2 2]T , µ4 = [−1 −6]T , and covariance matrices: " # " # 1 0.5 6 −2 C1 = , C2 = , 0.5 1 −2 6 " # " # 2 −1 0.125 0 , C4 = . C3 = −1 2 0 0.125
ACCEPTED MANUSCRIPT 7 Table 3. The performance of different criteria for the dataset generated from the distribution of the second experiment with 200 data points.
1 0.9
0.7
Model
AIC
AICc
BIC
MML
BICI
0.6
k=1 k=2 k=3 k=4 k=5 k=6
0 0 0 49 26 25
0 0 1 74 16 9
0 0 56 42 2 0
0 0 0 32 23 45
0 0 4 77 17 2
0.5 0.4 BICI
0.3
BIC MML AIC AICc
0.2 0.1 0 0.5
0.7
0.9
1.1
1.3
1.5
1.7
1.9
2.1
2.3
2.5
2.7
Separation
Fig. 2. Comparison of different model selection criteria for different separation of mixture components.
Table 4. The performance of different criteria for the dataset generated from the 10-dimensional 3-component distribution of the third experiment with 2500 data points.
6
0 −2 −4 −6 −8
Model
AIC
AICc
BIC
MML
BICI
k=1 k=2 k=3 k=4 k=5 k=6
0 0 9 8 39 44
0 0 84 15 1 0
0 13 86 1 0 0
0 0 86 14 0 0
0 0 86 14 0 0
AN US
4 2
CR IP T
Correct Selection Ratio
0.8
1
−10 −10
−5
0
5
10
M
−15
ED
Fig. 3. Scatter plot of data points generated by the mixture model of the second experiment.
PT
The number of times that each criterion selects one of the six models is shown in Table 3. This experiment verifies that our proposed criterion is also applicable to the case of overlapping components surpassing many common model selection criteria.
CE
5.3. Third Simulated Experiment
AC
To investigate the effect of data dimension on the performance of BICI model selection criterion, we use a dataset similar to that of the first experiment but with spherical covariances in 10-dimensional space. The mean vectors are located at µ1 = [0 . . . 0]T , µ2 = [δ . . . δ]T , and µ3 = [−δ . . . −δ]T , where δ = 1.5. We tested the performance of the reviewed model selection criteria when we sampled large number of data points (n = 20000) and observed that all BICI , MML and BIC have perfect performance. We also evaluated the performance of the criteria for smaller number of data points (n = 2500). For this amount of data, the number of times that different criteria select a model with k different number of components is shown in Table 4. The performance of BIC and MML are the same as BICI in this experiment.
0.8 0.6 0.4 0.2 0
2
3
4
5
6
7
8
Fig. 4. Histogram of acidity dataset is shown by gray bars. Solid black line shows the fitted MoG density and gray lines show three Gaussian components of the model. BICI criterion correctly detects 3 clusters in the dataset.
5.4. Experiment on a Real Dataset For an experiment on a real data, we consider a univariate acidity dataset studied in (Richardson and Green, 1997). It is known that underlying physical process that generates this dataset should have 3 clusters. The dataset contains only n = 155 data points and as seen in Figure 4 the clusters are not well-separated. We observed that our proposed criterion is able to successfully estimate the true number of components. 6. Conclusion We proposed a model selection criterion specific for mixture models named BICI which is based on Laplace approximation
ACCEPTED MANUSCRIPT 8
CR IP T
Fraley, C., Raftery, A.E., 2007. Bayesian regularization for normal mixture estimation and model-based clustering. Journal of Classification 24, 155– 181. Hathaway, R.J., 1985. A constrained formulation of maximum-likelihood estimation for normal mixture distributions. The Annals of Statistics , 795–800. Hurvich, C.M., Tsai, C.L., 1989. Regression and time series model selection in small samples. Biometrika 76, 297–307. Jain, A.K., Duin, R.P.W., Mao, J., 2000. Statistical pattern recognition: A review. Pattern Analysis and Machine Intelligence, IEEE Transactions on 22, 4–37. Jordan, M.I., Jacobs, R.A., 1994. Hierarchical mixtures of experts and the EM algorithm. Neural Computation 6, 181–214. Kass, R.E., Raftery, A.E., 1995. Bayes factors. Journal of the American Statistical Association 90, 773–795. Keener, R.W., 2010. Theoretical Statistics. Springer Texts in Statistics, Springer. Keribiin, C., 1998. Estimation consistante de l’ordre de mod`eles de m´elange. Comptes Rendus de l’Acad´emie des Sciences-Series I-Mathematics 326, 243–248. Maugis, C., Michel, B., 2011. A non asymptotic penalized criterion for gaussian mixture model selection. ESAIM: Probability and Statistics 15, 41–68. McLachlan, G., Peel, D., 2004. Finite mixture models. John Wiley & Sons. McLachlan, G.J., Basford, K.E., 1988. Mixture models. inference and applications to clustering. Statistics: Textbooks and Monographs, New York: Dekker, 1988 . Mengersen, K.L., Robert, C.P., 1993. Testing for mixtures: a Bayesian entropic approach. INSEE/Dpt de la recherche. Murphy, K.P., 2012. Machine Learning: A Probabilistic Perspective. MIT Press. Neal, R.M., Hinton, G.E., 1998. A view of the EM algorithm that justifies incremental, sparse, and other variants. Springer. pp. 355–368. Oliver, J.J., Baxter, R.A., Wallace, C.S., 1996. Unsupervised learning using MML, in: International Conference on Machine Learning, pp. 364–372. Richardson, S., Green, P.J., 1997. On Bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical Society: Series B (statistical methodology) 59, 731–792. Rusakov, D., Geiger, D., 2002. Asymptotic model selection for naive bayesian networks, in: Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence, pp. 438–455. Schwarz, G., 1978. Estimating the dimension of a model. The Annals of Statistics 6, 461–464. Smyth, P., 2000. Model selection for probabilistic clustering using crossvalidated likelihood. Statistics and Computing 10, 63–72. Titterington, D.M., Smith, A.F., Makov, U.E., 1985. Statistical analysis of finite mixture distributions. Wiley New York. Ueda, N., Nakano, R., Ghahramani, Z., Hinton, G.E., 2000. SMEM algorithm for mixture models. Neural computation 12, 2109–2128. Watanabe, S., 2009. Algebraic geometry and statistical learning theory. volume 25. Cambridge University Press. Xie, C., Chang, J., Liu, Y., 2013. Estimating the number of components in gaussian mixture models adaptively. Journal of Information Computational Science 10, 14. Zeng, H., Cheung, Y.m., 2014. Learning a mixture model for clustering with the completed likelihood minimum message length criterion. Pattern Recognition 47, 2011–2030.
ED
M
AN US
of the marginal likelihood. In the derivation, we assumed that the components in the mixture model are well-separated. Through different experiments, we investigated the performance of our BICI criterion and compared it against other wellknown information criteria. For the case of large sample size and well-separated components, we observed that our proposed criterion works perfectly as BIC and MML. Even in this simple scenario, the performance of AIC and AICc is pretty low. When the number of data points are few, BICI mostly outperforms other approaches even when the components of the dataset overlap. We also investigated the performance of BICI on a real dataset and observed that BICI can correctly detect the underlying number of clusters. We tested our proposed criterion for MoG models. Our criterion, however, is not specific to MoG densities and can be applied also to non-Gaussian mixture models. It is remained for future investigations to evaluate the performance of the proposed criterion for other mixture models. In this paper, we concentrated on the density estimation problem using mixture models. Our proposed criterion can be easily modified for the clustering. In clustering approaches, each datum is assigned to only one cluster. Therefore, the approximation of the Fisher information matrix that we used in our derivation is more accurate in this case and our criterion should also perform better. In future works, the performance of the proposed criterion can be compared against sophisticated criteria for clustering such as ICL. In this paper, we used an offline method to select mixture models. Meaning that, the parameters for mixture models with different number of components are estimated first; then, the model selection criterion is used to choose the correct model. It would be valuable if model selection process becomes interlaced in the process of parameter estimation. We believe that by integrating model selection with parameter estimation, common issues in mixture modeling like unboundedness of loglikelihood and local minima will be treated more efficiently.
PT
References
AC
CE
Akaike, H., 1974. A new look at the statistical model identification. Automatic Control, IEEE Transactions on 19, 716–723. Biernacki, C., Celeux, G., Govaert, G., 2000. Assessing a mixture model for clustering with the integrated completed likelihood. Pattern Analysis and Machine Intelligence, IEEE Transactions on 22, 719–725. Bischl, B., Mersmann, O., Trautmann, H., Weihs, C., 2012. Resampling methods for meta-model validation with recommendations for evolutionary computation. Evolutionary Computation 20, 249–275. Bishop, C.M., 2007. Pattern recognition and machine learning. Springer. Chang, J., Fisher III, J.W., 2013. Parallel sampling of DP mixture models using sub-cluster splits, in: Advances in Neural Information Processing Systems, pp. 620–628. Ciuperca, G., Ridolfi, A., Idier, J., 2003. Penalized maximum likelihood estimator for normal mixtures. Scandinavian Journal of Statistics 30, 45–59. Conway, J.H., Sloane, N.J.A., 1998. Sphere packings, lattices, groups. Springer. Efron, B., Tibshirani, R., 1997. Improvements on cross-validation: the 632+ bootstrap method. Journal of the American Statistical Association 92, 548– 560. Favaro, S., Teh, Y.W., et al., 2013. MCMC for normalized random measure mixture models. Statistical Science 28, 335–359. Figueiredo, M.A., Jain, A.K., 2002. Unsupervised learning of finite mixture models. Pattern Analysis and Machine Intelligence, IEEE Transactions on 24, 381–396.