Pattern Recognition Letters 33 (2012) 103–110
Contents lists available at SciVerse ScienceDirect
Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec
Infinite Liouville mixture models with application to text and texture categorization Nizar Bouguila Concordia Institute for Information Systems Engineering, Faculty of Engineering and Computer Science, Concordia University, Montreal, Qc, Canada H3G 2W1
a r t i c l e
i n f o
Article history: Received 23 December 2010 Available online 12 October 2011 Communicated by J.A. Robinson Keywords: Liouville family of distributions Infinite mixture models Proportional data Nonparametric Bayesian inference MCMC Gibbs sampling
a b s t r a c t This paper addresses the problem of proportional data modeling and clustering using mixture models, a problem of great interest and of importance for many practical pattern recognition, image processing, data mining and computer vision applications. Finite mixture models are broadly applicable to clustering problems. But, they involve the challenging problem of the selection of the number of clusters which requires a certain trade-off. The number of clusters must be sufficient to provide the discriminating capability between clusters required for a given application. Indeed, if too many clusters are employed overfitting problems may occur and if few are used we have a problem of underfitting. Here we approach the problem of modeling and clustering proportional data using infinite mixtures which have been shown to be an efficient alternative to finite mixtures by overcoming the concern regarding the selection of the optimal number of mixture components. In particular, we propose and discuss the consideration of infinite Liouville mixture model whose parameter values are fitted to the data through a principled Bayesian algorithm that we have developed and which allows uncertainty in the number of mixture components. Our experimental evaluation involves two challenging applications namely text classification and texture discrimination, and suggests that the proposed approach can be an excellent choice for proportional data modeling. Ó 2011 Elsevier B.V. All rights reserved.
1. Introduction With the progress in data capture technology, very large databases composed of textual documents, images and videos are created and updated every day and this trend is expected to grow in the future. Modeling and organizing the content of these databases is an important and challenging problem which can be approached using data clustering techniques. Data clustering methods decompose a collection of data into a given number of disjoint clusters which are optimal in terms of some predefined criteria functions. The main goal is to organize unlabeled feature vectors into clusters such that vectors within a cluster are more similar to each other than to vectors belonging to other different clusters (Everitt, 1993). The clustering problem arises in many disciplines and the existing literature related to it is abundant. A typical widely used approach is the consideration of finite mixture models which have been used to resolve a variety of clustering problems (McLachlan and Peel, 2000). A common problem when using finite mixtures is the difficulty to determine the accurate number of clusters. Indeed, when using mixtures we generally face the model selection dilemma where simple models cause underfitting and then large approximation errors, while complex models cause overfitting and then estimation errors (Allen and Greiner, 2000; Bouchard and Celleux, 2006). There have been extensive research efforts that strive to provide model selection capability to finite mixtures. A lot of deterministic approaches E-mail address:
[email protected] 0167-8655/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2011.09.037
have been proposed and generally involve a trade-off between simplicity and goodness of fit (see Bouguila and Ziou, 2006, for instance, for discussions and details). Some parametric and nonparametric Bayesian approaches have been proposed, also (Robert, 2007). Successful studies of Bayesian approaches have been completed in a variety of domains including computer vision, image processing, data mining and pattern recognition. Indeed, for many problems it is possible to use Bayesian inference for models estimation and selection by using available prior information about the mixture’s parameters (Robert, 2007). Bayesian approaches are attractive for several reasons and automatically embody Occam’s razor (Blumer et al., 1987). In particular, there has been recently a good deal of interest in using nonparametric Bayesian approaches for machine learning and pattern recognition problems. Rooted in the early works of Ferguson (1973), Antoniak (1974), progress has been made in both theory (Ishwaran, 1998; Gopalan and Berry, 1998; Kim, 1999; Neal, 2000) and application (Rasmussen, 2000; Kivinen et al., 2007; Bouguila and Ziou, 2010). This renewed interest is justified by the fact that nonparametric Bayesian approaches allow the increasing of the number of mixture components to infinity, which removes the problems underlying the selection of the number of clusters which can increase or decrease as new data arrive (Ghosh and Ramamoorthi, 2003). In approaches to machine learning and pattern recognition that use mixture models, success depends also on the ability to select efficiently the most accurate probability density functions (pdfs) to represent the mixture components. Exhaustive evaluation of all
104
N. Bouguila / Pattern Recognition Letters 33 (2012) 103–110
possible pdfs is infeasible, thus it is crucial to take into account the nature of the data when a given pdf is selected. Unlike, the majority of research works which have focused on Gaussian continuous data, we shall focus in this paper on proportional data clustering which naturally appear in many applications from different domains (Aitchison, 1986; Bouguila et al., 2004; Bouguila and Ziou, 2006; Bouguila and Ziou, 2007). Proportional data entertain two restrictions namely non-negativity and unit-sum constraint. The most relevant application where proportional data are naturally generated is perhaps text classification where text documents are represented as normalized histograms of keywords. Another application which has motivated our work is images categorization where modern approaches, to handle this problem, are based on the so-called bag of features, a technique inspired from text analysis, extracted from local image descriptors and presented as vector of proportions (Bouguila and Ziou, 2010). We propose then in this paper an autonomous unsupervised nonparametric Bayesian clustering method for proportional data that performs clustering without a priori information about the number of clusters. As mentioned above the choice of an appropriate family of distributions is one of the most challenging problems in statistical analysis in general and mixture models in particular (Cox, 1990). Indeed, the success of mixture-based learning techniques lies on the accurate choice of the appropriate probability density functions to describe the components. Many statistical learning analyzes begin with the assumption that the data clusters are generated from the Gaussian distribution which is usually only an approximation used mainly for convenience. Although a Gaussian may provide a reasonable approximation to many distributions, it is certainly not the best approximation in many real-world problems and in particular those involving proportional data as we have shown in our previous works (Bouguila and Ziou, 2010; Bouguila and Ziou, 2008) where we have investigated, in particular, the use of nonparametric Bayesian learning for Dirichlet (Bouguila and Ziou, 2008) and generalized Dirichlet (Bouguila and Ziou, 2010) mixture models. Both models have their own advantages and are not, however, exempt of drawbacks. The Dirichlet involves a small number of parameters (a D-variate Dirichlet is defined by D + 1 parameters), but has a very restrictive negative covariance matrix which makes it non applicable in several real problems (Bouguila and Ziou, 2006). On the other hand, the generalized Dirichlet has a more general covariance matrix which can be positive or negative, but involves clearly a larger number of parameters (a D-variate generalized Dirichlet is defined by 2D parameters) (Bouguila and Ziou, 2006). The present paper proposes another alternative called the BetaLiouville distribution that we extract from the Liouville family of distributions. Like the generalized Dirichlet and in contrast to the Dirichlet, the Beta-Liouville has a general covariance structure which can be positive or negative, but it involves smaller number of parameters (a D-variate Beta-Liouville is defined by D + 2 parameters). It is noteworthy that Liouville distributions have been used only as priors to the multinomial in the past (Wong, 2009), but its potential of being an effective parent distribution to model directly the data in its own right has long been neglected. In this paper we adopt nonparametric Bayesian learning to fit infinite Beta-Liouville mixture models to proportional data which to the best of our knowledge has never been considered in the past. In particular, we establish that improved clustering and modeling performance could result from using this approach as compared to infinite Dirichlet and infinite generalized Dirichlet approaches. This paper is organized as follows. Preliminaries and details about Liouville mixture model are given in Section 2. The principles of our infinite mixture model and its prior-posterior analysis are presented in Section 3 where a complete learning algorithm is also developed. Performance results, which involve two interesting applications namely text categorization and texture discrimination, are presented in Section 4 and show the merits of the
proposed model. Section 5 contains the summary, conclusions and potential future works. 2. Finite Beta-Liouville mixture model 2.1. The Beta-Liouville distribution In dimension D, the Liouville distribution, with positive parameters (a1, . . . , aD) and generating density f() with parameters n, is defined by Fang et al. (1990)
P D a 1 D C Y d¼1 ad X dd ! pð X ja1 ; . . . ; aD ; nÞ ¼ f ðujnÞ PD Cðad Þ u d¼1 ad 1 d¼1
ð1Þ
P ! where X ¼ ðX 1 ; . . . ; X D Þ; u ¼ Dd¼1 X d < 1; X d > 0; d ¼ 1; . . . ; D. The mean, the variance and the covariance of a Liouville distribution are given by Fang et al. (1990)
ad
EðX d Þ ¼ EðuÞ PD
d¼1
ð2Þ
ad
VarðX d Þ ¼ Eðu2 Þ P
a2d adðad þ 1Þ EðX d Þ2 2 PD PD a d¼1 ad þ 1 d¼1 ad
D d¼1 d
al ak Eðu2 Þ EðuÞ2 PD PD d¼1 ad þ 1 d¼1 ad l¼1 al
ð3Þ
!
Cov ðX l ; X k Þ ¼ Pd
ð4Þ
where E(u) and E(u2) are respectively the first and second moments of the probability density function f(). Note that, in contrast with the Dirichlet and like the generalized Dirichlet, the covariance can be positive or negative. Other interesting properties of the Liouville distribution can be found in (Sivazlian, 1981; Gupta and Richards, 1987). A convenient choice as a generating density for u, in Eq. (1), is the Beta distribution, which shapes are variable enough to allow for an approximation of almost any arbitrary distribution (Bouguila et al., 2006), with positive parameters a and b, beta(a, b), (i.e. n = (a, b))
f ðuja; bÞ ¼
Cða þ bÞ a1 u ð1 uÞb1 CðaÞCðbÞ
ð5Þ
and then
EðuÞ ¼
a aþb
ð6Þ
aða þ 1Þ ða þ bÞða þ b þ 1Þ ab VarðuÞ ¼ ða þ bÞ2 ða þ b þ 1Þ Eðu2 Þ ¼
ð7Þ ð8Þ
replacing Eq. (5) into Eq. (1), gives us the following
C ! pð X jhÞ ¼
P D
a 1 D ad Cða þ bÞ aPD a Y X dd b1 d¼1 d ð1 uÞ u CðaÞCðbÞ Cðad Þ d¼1
d¼1
ð9Þ
where h = (a1, . . . , ad, a,b), which is commonly called the Beta-Liouville distribution (Fang et al., 1990; Aitchison, 1986). Replacing Eqs. (6) and (7) into Eqs. (2)–(4), we obtain the mean, the variance and the covariance of the Beta-Liouville distribution. Note that when the density genP erator has a Beta distribution with parameters Dd¼1 ad and aD+1, Eq. (1) is reduced to the Dirichlet distribution with parameters a1, . . . , aD+1. Thus, Beta-Liouville distribution includes the Dirichlet as a special case. 2.2. The finite mixture model ! ! Let X ¼ ð X 1 ; . . . ; X N Þ be a set of unlabeled feature vectors representing N images or documents, for instance. To capture the multimodality of the data (i.e. the data set contains generally several
105
N. Bouguila / Pattern Recognition Letters 33 (2012) 103–110
clusters) mixture models are generally adopted (McLachlan and Peel, 2000): M X ! ! pð X jHÞ ¼ pj pð X jhj Þ
ð10Þ
j¼1
where each cluster j has a certain weight pj (positive and sum to one), ! M is the number of components in the mixture model, pð X jhj Þ is the Beta-Liouville density associated with cluster j, and ! H ¼ ðfhj g; P ¼ ðp1 ; . . . ; pM ÞÞ is the set of all the mixture parameters. The main goal here is to discover cluster structure hidden in our unlabeled D-dimensional vectors. It is noteworthy that our mixture can be viewed as a generative model where each vector belonging to a certain component j is generated by first choosing a Beta-Liouville component with probability pj and then generating the vector from ! pð X jhj Þ. An important problem here is the estimation of the mixture parameters. Estimation methods have been based generally on the standard maximum likelihood formulation using the expectation maximization (EM) optimization technique (McLachlan and Krishnan, 1997). The EM algorithm is basically an iterative method for finding the maximum likelihood parameter estimates of the mixture model. The main idea is to work with the notion of complete and missing data (Little and Rubin, 1983). The missing data in our case is the membership vector Z = (Z1, . . . , ZN), where Zi = j ! if X i is generated from the mixture component j. It is noteworthy that the different elements Zi are supposed to be drawn independently from the distributions.
pðZ i ¼ jÞ ¼ pj
j ¼ 1; . . . ; M:
ð11Þ
! It follows from Bayes’ theorem that pðZ i ¼ jj X i Þ, the probability that ! vector i is in cluster j, conditional on having observed X i is given by
! ! pðZ i ¼ jj X i Þ / pj pðX i jhj Þ
ð12Þ
The iterations in EM alternate between estimating the expectation of log-likelihood of the complete data (E-Step) and determining the new parameters estimates by maximizing this expected loglikelihood (M-Step) (McLachlan and Krishnan, 1997). As a ‘‘hill climbing’’ algorithm, the final estimates are in general local maxima and depend highly on the initialization. Although the EM algorithm is widely used, several works have shown that the Bayesian approach has more important advantages such as the indirect incorporation of the prior knowledge which allow for prior uncertainty in the parameters and its insensitiveness to the initial values which avoid the laborious work requested for the EM to find appropriate initial parameters values (McLachlan and Krishnan, 1997, 2000, and many others). Thus, we propose in the following a Bayesian approach for the estimation of Beta-Liouville mixture models.
3. The Bayesian mixture model Bayesian algorithms represent a class of learning techniques that have been intensively studied in the past. The importance of Bayesian approaches has been increasingly acknowledged in the last years and there is now considerable evidence that Bayesian algorithms are useful in several applications and domains (Robert, 2007). This is can be justified by advances in Markov chain Monte Carlo (MCMC) methods which have made the application of Bayesian approaches feasible and relatively straightforward (Robert, 2007). In this section we present our model, specified hierarchically within the Bayesian paradigm, and we discuss the construction of its priors. We then develop the complete posteriors from which the model’s parameters are simulated.
3.1. Priors and posteriors We start by rewriting the Beta-liouville distribution in Eq. (9) as the following:
P m t1 D Y Xd d ! Cð Dd¼1 md tÞCðmÞ msPD md t d¼1 pð X jfÞ ¼ ð1 uÞð1mÞs1 u CðmsÞCðð1 mÞsÞ Cðmd tÞ d¼1 ð13Þ PD
ad
a.
where f = ({md}, t, s, m), t ¼ d¼1 ad , md ¼ t , s = a + b, and m ¼ s It is noteworthy that this reparametrization provides actually interpretable parameters where md and m can be viewed as mean parameters, and t and s can be viewed as shape parameters. By substituting this new reparametrization in Eq. (10), the mixture parameters are now ! ~ j ; t j ; sj ; mj Þ; m ~ j ¼ ðmj1 ; . . . ; mjd Þ; H ¼ ðf1 ; . . . ; fM ; P Þ, where fj ¼ ðm j ¼ 1; . . . ; M. The Bayesian model adds a prior specification, reflecting certain assumptions, about the set of parameters H quantified as a distribution, say p(H), to the likelihood of the data pðX jHÞ. The inference, about H, after observing the data is then based on the posterior distribution obtained by combining the likelihood and the prior multiplicatively according to the Bayes theorem (Robert, 2007):
pðHjX Þ / pðX jHÞpðHÞ
ð14Þ
The first important step is thus to assess the prior distribution p(H). A common assumption when using Bayesian approaches to learn mixture models is to suppose that the mixture parameters are independent realizations from some appropriately selected distributions. An appropriate prior distribution for mjd and mj is the Beta, since both are defined in the compact support [0, 1]:
mjd betaða1 r1 ; a1 ð1 r 1 ÞÞ mj betaðar; að1 rÞÞ
ð15Þ
where a1 > 0, 0 < r1 < 1, a > 0, and 0 < r < 1 are the prior parameters (i.e. hyperparameters) chosen common to all components. Moreover, we assign independent inverse gamma priors for tj and sj, since both are positive and control the final shape of the distribution:
tj IGðb1 ; c1 Þ sj IGðb; cÞ
ð16Þ
where b1 > 0 and b > 0 are the shapes of the inverted Gamma distributions, and c1 > 0 and c > 0 are the scales, chosen also common to all mixture components. Having these priors in hand, the conditional posteriors of mjd, mj, tj and sj can now be determined as follows:
2 PD pðmjd j Þ /
majd1 r1 1 ð1
C
mjd Þ
a1 ð1r 1 Þ1 4
QD
d¼1 mjd t j
d¼1
2
! P mjd tj
3nj
Cðmjd tj Þ
3
D
D Y6 X 6 1 X id 4 Z i ¼j
d¼1
d¼1
5
D Y
m t j 1 7
X idjd
d¼1
7 5
ð17Þ
3nj 2 PD d¼1 mjd t j cb11 expðc1 =tj Þ 4C 5 pðt j j . . .Þ / QD Cðb1 Þt bj 1 þ1 d¼1 Cðmjd t j Þ 2 3 D ! P mjd tj D D Y6 X Y d¼1 m t 1 7 6 1 X id X idjd j 7 4 5 Z i ¼j
d¼1
ð1 mj Það1rÞ1 pðmj j Þ / mar1 j
Y Z i ¼j
"
D X d¼1
!m j s j X id
ð18Þ
d¼1
nj
Cðmj Þ Cðmj sj ÞCðð1 mj Þsj Þ
ð1
D X d¼1
#
X id Þ
ð1mj Þsj 1
ð19Þ
106
N. Bouguila / Pattern Recognition Letters 33 (2012) 103–110
pðsj j Þ /
nj
cb expðc=sj Þ Cðmj Þ Cðmj sj ÞCðð1 mj Þsj Þ CðbÞsjbþ1 2 !m j s j !ð1mj Þsj 1 3 D D Y X X 4 5 X id 1 X id Z i ¼j
d¼1
M C Mg þ nj ! ! ! CðgÞ Y pðZjgÞ ¼ ! pðZj P Þpð P jgÞd P ¼ Cðg þ NÞ j¼1 C Mg P Z
ð20Þ
d¼1
ð34Þ
which can be considered as a prior on Z from which we can show that (Neal, 2000; Rasmussen, 2000)
ni;j þ Mg N1þg
In the spirit of Bayesian statistics, prior distributions can be specified also on the hyperparameters a1, r1, a, r, b1, b, c1, and c which allows more flexibility by adding another layer to our Bayesian hierarchy (see Fig. 1). Thus, we select the following priors for our hyperparameters:
where Zi = {Z1, . . . , Zi1, Zi+1, . . . , ZN}, ni,j is the number of vectors, ! excluding X i , in cluster j. Letting M ? 1 in Eq. (35), the conditional prior gives the following limits (Neal, 2000; Rasmussen, 2000)
r 1 betaðe1 f1 ; e1 ð1 f1 ÞÞ r betaðef ; eð1 f ÞÞ a1 IGðg 1 ; h1 Þ a IGðg; hÞ b1 IGðp1 ; q1 Þ b IGðp; qÞ
ð21Þ ð22Þ
pðZ i ¼ jjg; Z i Þ ¼
c1 expoðw1 Þ c expoðwÞ
ð23Þ
where expo denotes the exponential distribution. Having these priors, the conditional posteriors of the hyperparameters are given by: M Y D r1e1 f1 1 ð1 r 1 Þe1 ð1f1 Þ Cða1 ÞMD Y
pðr1 jÞ /
½Cða1 r 1 ÞCða1 ð1 r 1 ÞÞMD
a1 r1 1 mjd ð1 mjd Þa1 ð1r1 Þ1
j¼1 d¼1
ð24Þ pðrjÞ /
r
ef 1
eð1f Þ
ð1 rÞ
CðaÞ
MY M
½CðarÞCðað1 rÞÞM
mar1 ð1 mj Það1rÞ1 j M Y D Y
h11 expðh1 =a1 ÞCða1 ÞMD
C
g þ1 ðg 1 Þa11 ½
MD
Cða1 r 1 ÞCða1 ð1 r1 ÞÞ
a1 r 1 1 mjd
j¼1 d¼1
ð1 mjd Þa1 ð1r1 Þ1 g
pðajÞ /
h expðh=aÞCðaÞ
CðgÞagþ1 ½CðarÞCðað1 rÞÞM
pðb1 jÞ / pðbjÞ /
p M 1 Y q11 expðq1 =b1 ÞcMb expðc1 =t j Þ 1
Cðp1 Þb1p1 þ1 Cðb1 ÞM
j¼1
t bj 1 þ1
M 1 Y w1 expðw1 c1 ÞcMb expðc1 =tj Þ 1
Cðb1 ÞM
j¼1
ð27Þ
j¼1
M qp expðq=bÞcMb Y expðc=sj Þ pþ1 M sbþ1 CðpÞb CðbÞ j¼1 j
pðc1 jÞ / pðcjÞ /
mar1 ð1 mj Það1rÞ1 j
tjb1 þ1
M wexpðwcÞcMb1 Y expðc=sj Þ M sjbþ1 CðbÞ j¼1
ni;j N1þg
if ni;j > 0 ðcluster j 2 RÞ
g
if ni;j ¼ 0 ðcluster j 2 UÞ
N1þg
ð28Þ ð29Þ ð30Þ ð31Þ
ð36Þ
where R and U are the sets of represented (i.e. nonempty) and unrepresented (i.e. empty) clusters, respectively. We can notice that 0 pðZ i ¼ jjg; Z i Þ ¼ pðZ i – Z i0 8i –i jg; Z i Þ when ni,j = 0. It is noteworthy that the adjective infinite is used here because the number of clusters can increase as new data arrive, but it is actually bounded by the number of vectors N (i.e. of course we cannot have more that N nonempty clusters). Having the conditional priors in Eq. (36), the conditional posteriors are obtained by combining these priors with the likelihood of the data (Neal, 2000)
N1þg
ð26Þ M Y
M
(
ð35Þ
8 ! ni;j > < N1þg pð X i jfj Þ pðZ i ¼ jj Þ ¼ ! > : R gpð X i jfj Þpðfj ja;r;a1 ;r1 ;b;c;b1 ;c1 Þ df
j¼1
g
pða1 jÞ /
ð25Þ
pðZ i ¼ jjg; Z i Þ ¼
j
if j 2 R ð37Þ if j 2 U
where
pðfj ja; r; a1 ; r1 ; b; c; b1 ; c1 Þ ¼ pðmj ja; rÞpðtj jb1 ; c1 Þpðsj jb; cÞ
D Y
pðmjd ja1 ; r 1 Þ
ð38Þ
d¼1
Thus, we can see that besides their Bayesian nature which is explicit, infinite models have another implicit attractive property namely handling the online learning of data which is crucial in several applications involving proportional data such as text classification and filtering (Yu and Lam, 1998; Tauritz et al., 2000; Chai et al., 2002). This is of particular interest since an important feature of any learning model is to be able to make structural change to itself as new data are observed or new information is introduced with the
3.2. The infinite Beta-Liouville mixture model A very important but difficult question for a mixture model is how many components there are in the mixture. This question is not easy to answer (see, for instance, McLachlan and Peel, 2000; Bouguila and Ziou, 2007). In recent years there has been much interest in infinite mixture models to resolve this inherent problem and our approach here capitalizes on this trend. Infinite mixture models are based on the specification of an appealing prior distribution for the mixing proportions pj, namely the symmetric Dirichlet distribution with positive parameters Mg : M g ! CðgÞ Y 1 pð P jgÞ ¼ QM pMj g j¼1 CðM Þ j¼1
ð32Þ
By noticing that M Y ! n pðZj P Þ ¼ pj j
ð33Þ
j¼1
which can be easily determined according to Eq. (11), we can show that
Fig. 1. Graphical model representation of the infinite Beta-Liouville mixture. Nodes in this graph represent random variables, rounded boxes are fixed hyperparameters, boxes indicate repetition (with the number of repetitions in the lower right) and arcs describe conditional dependencies between variables.
N. Bouguila / Pattern Recognition Letters 33 (2012) 103–110
intent of improving generalization capability. The model is expected to be gradually refined as new data are introduced. Further theoretical properties of infinite models have been extensively analyzed in (Rasmussen, 2000; Neal, 2000 and Teh et al., 2006), for instance. In the following, we propose a complete Bayesian MCMC sampling algorithm to learn our infinite Beta-Liouville mixture model. For a detailed treatment on MCMC techniques, we refer the reader to Robert and Casella (1999) and to references therein. 3.3. Complete learning algorithm Having all the full conditional posterior distributions in hand, we propose in this section an MCMC algorithm for posteriors computation. In particular, the basic computational tool that we shall use is the Gibbs sampler which was introduced in (Geman and Geman, 1984) in the context of statistical image restoration. The Gibbs sampling is an iterative procedure which alternately samples the unknown quantities from their conditional distributions given the observed data and the current values of all the other unknown quantities. This technique has now become standard and has been studied, discussed and explained in so many contexts that we will not go into detail (see, for instance, Liu et al., 1994). Our complete algorithm can be summarized as follows: Generate Zi from Eq. (37) and then update nj, j = 1, . . . , M, i = 1, . . . , N. Update the number of represented components M. n P j ¼ Nþj g, j = 1, . . . , M and the mixing parameters of unrepreg sented components PU ¼ gþN . Generate mjd, tj, mj and sj from Eqs. (17)–(20), j = 1, . . . , M, d = 1, . . . , D. Update the hyperparameters: Generate r1, r, a1, a, b1, b, c1 and c from Eqs. (24)–(30) and (30), respectively. The updating of the number of represented components step is based on the previous step which is the generation of the Zi. Indeed, M is increased by one when a given sample is affected to an unrepresented component. In the other hand, if a component becomes empty during the iterations then M is decreased by one. For the initialization, we have used the well-known method of moments (see, for instance, Bouguila et al., 2004) which is based on the first- and second-order moments of the Beta-Liouville distribution. It is worth mentioning that the conditional posteriors in our infinite model do not follow known simple forms. The hyperparameters posteriors given by Eqs. (24)–(30) and (30) are all logconcave, then the samples generation is based on the adaptive rejection sampling (ARS) (Gilks and Wild, 1993). For the sampling of the vectors Zi we use the approach proposed in (Neal, 2000), which consists of approximating the integral in Eq. (37) by gener~ j , mj, ating a Monte Carlo estimate by sampling from the priors of m tj and sj. For the posteriors of mjd, tj, mj and sj, the sampling cannot be done directly, thus we use the Metropolis–Hastings algorithm driven by a Gaussian random walk (Robert, 2007).
107
applications which involve proportional data1 namely text categorization and texture discrimination using the notion of textons. In these applications our specific choice for the hyperparameters is (e1, f1, e, f) = (2, 0.5, 2, 0.5), we set g1, h1, g, h, p1, q1, p, q, g to 1 and w1, w to 1/8. Note that these specific choices were based on our beliefs for our applications and were found convenient in our experiments.2 For each of the applications that we shall present, we ran ten chains with varying initial parameters values and used 10000 Gibbs iterates. Convergence of the Gibbs sampler was assessed using the diagnostic procedures widely discussed and recommended in (Cowles and Carlin, 1996). For each model, we take 9000 iterations, from which we calculate our posteriors, and burn-in the first 1000 iterations. 4.2. Text categorization The exponential growth of internet and intranets has led to the generation of a huge amount of text documents, increasing every day, and then to a great deal of interest in developing efficient techniques to assist in categorizing this textual content. The problem is challenging and different statistical approaches have been proposed and used in the past (see, for instance, Li and Jain, 1998; Yang and Liu, 1999). Although different the majority of the proposed techniques have approached this problem as following: given a set of labeled text documents belonging to a certain number of classes, a new unseen text is assigned to the category with the highest similarity with respect to its content. In this first experiment, we report on the results of applying our infinite model to text categorization using a widely used data set namely the ModApte version of the Reuters-21578 corpus. 3 The ModApte version has 9,603 training documents and 3,299 test documents. Generally, only the categories having more than one document in the training and testing sets are considered which reduces the total number of categories to 90 (Joachims, 1998; Chai et al., 2002). The commonly used approach to represent text documents is the bag-of-words scheme. This scheme is based on representing each text document as a feature vector containing the frequencies of distinct words (after tokenisation, stemming, and stop-words removal) observed in the text. Having our frequency vectors, the dimensionality of these vectors was then reduced using LDA (latent Dirichlet allocation) (Blei et al., 2003) as a pre-processing step that results in the representation of each document by a vector of proportions. The vectors in the different training sets were then modeled by our infinite mixture using the algorithm in the previous section. After this stage, each class in the training set was represented by a mixture. Finally, in the classification stage each test vector was affected to a given class according to the well-known Bayes classification rule (Everitt, 1993). Following (Chai et al., 2002) and in order to evaluate our results we have used the F1 measure which combines the precision and recall measures as follows:
F1 ¼
2 Recall Precision Recall þ Precision
where
Number of correct positive predictions number of positive examples Number of correct positive predictions Precision ¼ number of positive predictions Recall ¼
4. Experimental results 4.1. Design of experiments The main goal of our experiments is to compare the performance of the infinite Beta-Liouville mixture (or IBLM, as we shall henceforth refer to it) to two previously proposed models, for proportional data, namely the infinite Dirichlet (IDM) and infinite generalized Dirichlet (IGDM) mixtures. We refer the reader to Bouguila and Ziou (2008), Bouguila and Ziou (2010) for details about the IDM and IGDM, respectively. In this section, we are chiefly concerned with
1 Note that the Gaussian mixture is ill-suited for this kind of data (see, for instance, Bouguila and Ziou, 2008; Bouguila and Ziou, 2010) and will not be considered for comparisons here. 2 To evaluate the sensitivity of our model to prior specification, we repeated Gibbs sampler for several times with different priors and hyperparameters, and different numbers of iterations. The results have not changed significantly. 3 http://www.daviddlewis.com/resources/testcollections/reuters21578.
108
N. Bouguila / Pattern Recognition Letters 33 (2012) 103–110
Table 1 Classification results for the ModApte data set.
Micro Macro
IBLM
IGDM
IDM
0.80 0.59
0.81 0.60
0.77 0.58
When we have different document classes, which is our case, the F1 scores are commonly summarized via their micro- and macro-averages which can be defined as the F1 over categories and documents, and the average of within-category F1 values, respectively (Chai et al., 2002; Yang and Liu, 1999). Table 1 shows the classification results using IBLM, IGDM and IDM models. According to this table it is clear that the IBLM and IGDM perform better than the IDM and that the IGDM performs slightly better than the IBLM. It is important, however, to study the significance of these results. In order to investigate the significance of the obtained results we have used the micro sign-test (s-test) and the macro sign-test (S-test) as proposed in (Yang and Liu, 1999). These tests have shown that the difference between the IBLM and the IDM models is statistically significant when considering the micro measure (p-value < 0.01). The difference between the IGDM and the IDM is statistically significant (p-value < 0.01), also. The difference between the IBLM and the IGDM, however, is not significant (p-value > 0.05). In this case the IBLM is favored, since it clearly involves a smaller number of parameters to be estimated. 4.3. Texture images classification using textons In recent years, the number of large image and video databases has dramatically increased due to the advances in storage devices, networking, computational power, digital cameras and scanning. Huge databases can be found in several areas such as medicine, astronomy, space science, forestry, agriculture, and military applications which in turn has created an urgent demand for efficient tools for images categorization and retrieval (Fayyad et al., 1996). Modeling and classifying visual scenes requires powerful statistical models to represent their content (ex. color, texture). In this application we shall focus on the problem of texture images modeling and classification which arises in several disciplines such as image analysis, computer vision and remote sensing. Texture is a wellstudied property of images and several descriptors, categorization approaches and comparative measures have been proposed and discussed in the past (see, for instance, Smith and Burns, 1997). We shall not elaborate further on the different approaches that have been proposed to model textures which is clearly beyond the scope of this paper. Rather we focus on an important texture feature extraction approach that has received a lot of attention recently. This approach is based on the representation of texture
Fig. 3. Examples of images from the 10 different classes in the KTH-TIPS data set.
Table 2 Overall classification rate of different approaches for the 2 texture data sets. Using SIFT
UIUCTex KTH-TIPS
Using neighborhoods
IBLM
IGDM
IDM
IBLM
IGDM
IDM
97.86 96.12
97.78 95.98
93.88 92.07
90.74 89.55
90.13 89.17
86.66 84.91
using visual dictionary obtained through the quantization of the appearance of local regions, described by local features, which gives characteristic texture elements generally called textons (see, for instance, Varma and Zisserman, 2003). For our experimental evaluation, we use two texture data sets. The first data set is the UIUCTex (Lazebnik et al., 2005) which is composed of 25 texture classes with 40 images per class and the second one is the KTH-TIPS (Hayman et al., 2004) which is composed of 10 texture classes with 81 images per class. Figs. 2 and 3 show examples of images from the different classes in these two data sets, respectively. For both data sets, we randomly select half of the images for training and the rest for testing. An important step in our application is the extraction of local features to describe the textures. In our case we use two main approaches. The first one is based on detecting local regions, using Laplacian detector, which we describe using their SIFT descriptors (Lowe, 2004), giving 128dimensional vector for each local region. The second one, previously proposed in (Varma and Zisserman, 2003), uses n n pixel compact neighborhoods as image descriptors. Using this approach each texture pixel is described by an n2-dimensional vector which represents the pixel intensities of its n n square neighborhood. Following (Varma and Zisserman, 2003), we use n = 7, thus each pixel is represented by a 49-dimensional vector. Then, we construct, for each of these two approaches, a global texton vocabulary via the clustering of the related generated descriptors. Following (Zhang et al., 2007) we extract 10 textons using K-means (i.e. the textons are actually the K-means cluster centers) for each texture class and then concatenate the textons of the different classes to form the visual vocabulary. Thus, the vocabulary sizes are 250 and 100 for the UIUCTex and KTH-TIPS, respectively. Once the
Fig. 2. Examples of images from the UIUCTex data set.
N. Bouguila / Pattern Recognition Letters 33 (2012) 103–110
109
Fig. 4. Accuracy as a function of the number of images, taken from each class, in the training sets. (a) UIUCTex data set using SIFT vectors, (b) KTH-TIPS data set using SIFT vectors, (c) UIUCTex data set using neighborhoods, (d) KTH-TIPS data set using neighborhoods.
vocabulary of textons is built, each image in each texture data set can be represented then by a vector of frequencies which after normalization give us proportion vectors where each entry represents the probability that a given texton occurs in the image. Evaluation results for the two data sets generated by the IBLM, IGDM, and IDM using both SIFT and neighborhood descriptors are summarized in Table 2. Clearly, the IGDM and the IBLM outperform the IDM and a Student’s t-test shows that the differences in performance is statistically significant. This can be explained by the fact that they make less restrictive assumption than the IDM about the data covariance structure. The results show also that the IBLM performance is comparable to the IGDM one (i.e. the difference is not statistically significant) despite the fact that it involves the estimation of less parameters. Fig. 4 display the accuracy of the classification, when using the different approaches, as a function of the number of images in the training set. It is clear that the accuracy increases as the number of images in the training set increases.
5. Conclusion In this paper, we have presented a hierarchical nonparametric Bayesian statistical framework based on infinite Beta-Liouville mixtures for proportional data modeling and classification that has been motivated by the importance of this kind of data in several applications. Infinite models have many advantages: they are general, consistent, powerful, extensible and flexible enough to be applied to a variety of learning problems. We estimate the posterior distributions of our model parameters using MCMC simulations. Through challenging applications involving text categorization and texture discrimination, we have shown how our infinite mixture in conjunction with Bayesian learning and MCMC methods can be used and provide excellent modeling and classification capabilities. The Bayesian framework developed in this article can be extended to integrate feature selection and/or outliers detection which can improve further the classification results. It is our hope that many dif-
110
N. Bouguila / Pattern Recognition Letters 33 (2012) 103–110
ferent types of real-world problems involving proportional data will be tackled within the developed framework. Acknowledgment The completion of this research was made possible thanks to the Natural Sciences and Engineering Research Council of Canada (NSERC). References Aitchison, J., 1986. The Statistical Analysis of Compositional Data. Chapman & Hall, London, UK. Allen, T.V., Greiner, R., 2000. Model selection criteria for learning belief nets: an empirical comparison. In: Proc. of the Internat. Conf. on Machine Learning (ICML), pp. 1047–1054. Antoniak, C.E., 1974. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Ann. Statist. 2 (6), 1152–1174. Blei, D.M., Ng, A.Y., Jordan, M.I., 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022. Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.K., 1987. Occam’s Razor. Inform. Process. Lett. 24 (6), 377–380. Bouchard, G., Celleux, G., 2006. Selection of generative models in classification. IEEE Trans. Pattern Anal. Machine Intell. 28 (4), 544–554. Bouguila, N., Ziou, D., 2006. A Hybrid SEM algorithm for high-dimensional unsupervised learning using a finite generalized Dirichlet mixture. IEEE Trans. on Image Process. 15 (9), 2657–2668. Bouguila, N., Ziou, D., 2006. Unsupervised selection of a finite Dirichlet mixture model: an MML-based approach. IEEE Trans. Knowledge Data Eng. 18 (8), 993– 1009. Bouguila, N., Ziou, D., 2007. High-dimensional unsupervised selection and estimation of a finite generalized Dirichlet mixture model based on minimum message length. IEEE Trans. Pattern Anal. Machine Intell. 29 (10), 1716–1731. Bouguila, N., Ziou, D., 2008. A Dirichlet process mixture of Dirichlet distributions for classification and prediction. In: Proc. of the IEEE Workshop on Machine Learning for Signal Processing (MLSP), pp. 297–302. Bouguila, N., Ziou, D., 2010. A Dirichlet process mixture of generalized Dirichlet distributions for proportional data modeling. IEEE Trans. Neural Networks 21 (1), 107–122. Bouguila, N., Ziou, D., Vaillancourt, J., 2004. Unsupervised learning of a finite mixture model based on the Dirichlet distribution and its application. IEEE Trans. Image Process. 13 (11), 1533–1543. Bouguila, N., Ziou, D., Monga, E., 2006. Practical Bayesian estimation of a finite beta mixture through Gibbs sampling and its applications. Statist. Comput. 16 (2), 215–225. Chai, K.M.A., Ng, H.T., Chieu, H.L., 2002. Bayesian online classifiers for text classification and filtering. In Proc. of the Annual Internat. ACM SIGIR Conf. on Research and development in Information Retrieval (SIGIR), pp. 97–104. Cowles, M.K., Carlin, B.P., 1996. Markov chain Monte Carlo convergence diagnostics: a comparative review. J. Amer. Statist. Associat. 91, 883–904. Cox, D.R., 1990. Role of models in statistical analysis. Statist. Sci. 5 (2), 169–174. Everitt, B., 1993. Cluster Analysis, third ed. John Wiley & Sons, Inc. Fang, K.T., Kotz, S., Ng, K.W., 1990. Symmetric Multivariate and Related Distributions. Chapman and Hall, New York. Fayyad, U.M., Djorgovski, S.G., Weir, N., 1996. From digitized images to online catalogs. AI Mag. 17 (2), :51–66. Ferguson, T.S., 1973. A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 (2), 209–230. Geman, S., Geman, D., 1984. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Machine Intell. 6 (6), 721–741.
Ghosh, J.K., Ramamoorthi, R.V., 2003. Bayesian Nonparametrics. Springer. Gilks, W.R., Wild, P., 1993. Algorithm AS 287: adaptive rejection sampling from logconcave density functions. Appl. Stat. 42 (4), 701–709. Gopalan, R., Berry, D.A., 1998. Bayesian multiple comparisons using Dirichlet process priors. J. Amer. Statist. Associat. 93 (443), 1130–1139. Gupta, R.D., Richards, D.St.P., 1987. Multivariate Liouville distributions. J. Multivar. Anal. 23 (2), 233–256. Hayman, E., Caputo, B., Fritz, M., Eklundh. J-O., 2004. On the significance of realworld conditions for material classification. In Proc. of the 8th European Conf. on Computer Vision, Prague, Czech Republic, pp. 253–266. Ishwaran, H., 1998. Exponential posterior consistency via generalized polya urn schemes in finite semiparametric mixtures. Ann. Statist. 26 (6), 2157–2178. Joachims, T., 1998. Text categorization with support vector machines: learning with many irrelevant features. In: Proc. of the European Conf. on Machine Learning (ECML), pp. 137–142. Kim, Y., 1999. Nonparametric Bayesian estimators for counting processes. Ann. Statist. 27 (2), 562–588. Kivinen, J.J., Sudderth, E.B., Jordan, M.I., 2007. Learning multiscale representations of natural scenes using Dirichlet processes. In: Proc. of the IEEE 11th Internat. Conf. on Computer Vision (ICCV), pp. 1–8. Lazebnik, S., Schmid, C., Ponce, J., 2005. A sparse texture representation using local affine regions. IEEE Trans. Pattern Anal. Machine Intell. 27 (8), 1265–1278. Li, Y.H., Jain, A.K., 1998. Classification of text documents. Comput. J. 41 (8), 537– 546. Little, R.J.A., Rubin, D.B., 1983. On jointly estimating the parameters and missing data by maximizing the complete-data likelihood. Amer. Statist. 37 (3), 218– 220. Liu, J.S., Wong, W.H., Kong, A., 1994. Covariance structure of the Gibbs sampler with applications to the comparisons of estimators and augmentation schemes. Biometrika 81 (1), 27–40. Lowe, D.G., 2004. Distinctive image features from scale-invariant keypoints. Internat. J. Comput. Vision 60 (2), 91–110. McLachlan, G.J., Krishnan, T., 1997. The EM Algorithm and Extensions. Wiley Interscience, New York. McLachlan, G.J., Peel, D., 2000. Finite Mixture Models. Wiley, New York. Neal, R.M., 2000. Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Statist. 9, 249–265. Rasmussen, C.E., 2000. The infinite Gaussian mixture model. Adv. Neural Inform. Process. Syst. (NIPS), 554–560. Robert, C.P., 2007. The Bayesian Choice From Decision-Theoretic Foundations to Computational Implementation. Springer. Robert, C.P., Casella, G., 1999. Monte Carlo Statistical Methods. Springer-Verlag. Sivazlian, B.D., 1981. On a multivariate extension of the gamma and beta distributions. SIAM J. Appl. Math. 41 (2), 205–209. Smith, G., Burns, I., 1997. Measuring texture classification algorithms. Pattern Recognition Lett., 18:1495–1501. Tauritz, D.R., Kok, J.N., Sprinkhuizen-Kuyper, I.G., 2000. Adaptive information filtering using evolutionary computation. Inform. Sci. 122, 121–140. Teh, Y.W., Jordan, M.I., Beal, M.I., Matthew, J., Blei, D.M., 2006. Hierarchical Dirichlet processes. J. Amer. Statist. Associat. 101 (476), 1566–1581. Varma, A., Zisserman, A., 2003. Texture classification: are filter banks necessary? In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 691–698. Wong, T-T., 2009. Alternative prior assumptions for improving the performance of naı¨ve Bayesian classifiers. Data Min. Knowl. Disc. 18 (2), 183–213. Yang, Y., Liu, X., 1999. A Re-examination of text categorization methods. In: Proc. of the Annual Internat. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR), pp. 42–49. Yu, K.L., Lam, W., 1998. A new on-line learning algorithm for adaptive text filtering. In Proc. of the Internat. Conf. on Information and Knowledge Management (CIKM), pp. 156–160. Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C., 2007. Local features and kernels for classification of texture and object categories: a comprehensive study. Internat. J. Comput. Vision 37 (2), 213–238.