Pattern Recognition 47 (2014) 3132–3142
Contents lists available at ScienceDirect
Pattern Recognition journal homepage: www.elsevier.com/locate/pr
Bounded generalized Gaussian mixture model$ Thanh Minh Nguyen a,n, Q.M. Jonathan Wu a, Hui Zhang a,b,1 a
Department of Electrical and Computer Engineering, University of Windsor, 401 Sunset Avenue, Windsor, ON, Canada N9B-3P4 School of Computer and Software and Jiangsu Engineering Center of Network Monitoring, Nanjing University of Information Science and Technology, Nanjing 210044, China
b
art ic l e i nf o
a b s t r a c t
Article history: Received 11 September 2013 Received in revised form 26 February 2014 Accepted 23 March 2014 Available online 4 April 2014
The generalized Gaussian mixture model (GGMM) provides a flexible and suitable tool for many computer vision and pattern recognition problems. However, generalized Gaussian distribution is unbounded. In many applications, the observed data are digitalized and have bounded support. A new bounded generalized Gaussian mixture model (BGGMM), which includes the Gaussian mixture model (GMM), Laplace mixture model (LMM), and GGMM as special cases, is presented in this paper. We propose an extension of the generalized Gaussian distribution in this paper. This new distribution has a flexibility to fit different shapes of observed data such as non-Gaussian and bounded support data. In order to estimate the model parameters, we propose an alternate approach to minimize the higher bound on the data negative log-likelihood function. We quantify the performance of the BGGMM with simulations and real data. & 2014 Elsevier Ltd. All rights reserved.
Keywords: Mixture model Bounded support regions Generalized Gaussian distribution
1. Introduction The finite mixture model is widely used in areas where statistical modeling of data is needed such as bioinformatics, pattern recognition, and machine learning. The main advantage of this technique is in its capability to use prior knowledge to model the uncertainty in a probabilistic manner. Among the algorithms based on the Bayesian technique, the Gaussian mixture model (GMM) [1–3] is a well-known method used for most applications. An advantage of GMM is that it requires a small amount of parameters for learning. Also, these parameters can be efficiently estimated by adopting the expectation maximization (EM) algorithm [4,5] to maximize the loglikelihood function. Although the GMM is a flexible and powerful tool for data analysis, it is sensitive to outliers and may lead to excessive sensitivity to small numbers of data points. Also, for many applied problems, the tail of the Gaussian distribution is shorter than required. In order to improve the robustness of the algorithm for modeling data with different shapes, the Student's-t mixture model (SMM) has been proposed in [6–8]. Compared to the
☆ This work was supported in part by the NSERC Discovery Grant and in part by the National Natural Science Foundation of China under Grant 61232016 and Grant 61105007. n Corresponding author. Tel.: +1 519 253 3000x5977. E-mail addresses:
[email protected] (T.M. Nguyen),
[email protected] (Q.M. Jonathan Wu),
[email protected] (H. Zhang). 1 Tel.: +1 519 2533000x5979.
http://dx.doi.org/10.1016/j.patcog.2014.03.030 0031-3203/& 2014 Elsevier Ltd. All rights reserved.
GMM, each component of SMM has one more parameter called the degrees of freedom (v). This parameter is viewed as a robustness tuning parameter. For the particular case of v ¼1, the Student's-t distribution reduces to the Cauchy distribution. When v tends to infinity, the Student's-t distribution approaches the Gaussian distribution. Hence, SMM provides a more powerful and flexible approach for modeling data compared to the GMM. Recently, there has been a growing research interest in Bayesian technique based on the modeling of the probability density function of the data via the generalized Gaussian distribution (GGD) [9–12]. GGD has been successfully used in texture discrimination and retrieval [9–11], image and video coding [13,14], image denoising [15], and change detection [16]. This distribution has one parameter (λ) more than the Gaussian distribution. The parameter λ controls the tails of the distribution and determines whether the latter is peaked or flat. It is worth mentioning that the Laplacian and Gaussian distributions are particular cases for the generalized Gaussian distribution, where λ ¼1 and λ ¼2, respectively. Thus, the generalized Gaussian mixture model (GGMM) has the flexibility required to fit the shape of the data better than the GMM. One drawback of the above-mentioned mixture models is that their distributions are unbounded with a support range of ( 1; þ 1). We observe in many real applications that the observed data always fall within the bounded support regions [20–22]. For example, in the area of signal processing, the power spectrum is semi-bounded. In the area of image computer vision, the pixels are usually in the limited range. In [21,22], a bounded Gaussian mixture model (BGMM) was proposed for speech
T.M. Nguyen et al. / Pattern Recognition 47 (2014) 3132–3142
processing. However, in many applications, the tail of the Gaussian distribution is shorter than required. Also, the BGMM is not flexible enough to fit the shape of the data. Motivated by the aforementioned observations, we introduce in this paper a bounded generalized Gaussian mixture model (BGGMM) for analyzing data, which includes the GMM, LMM, GGD, GGMM, and BGMM, as special cases. Our approach differs from those discussed above by the following statements. Firstly, there is an extension of the generalized Gaussian distribution in this paper. This new distribution has a flexibility to fit different shapes of observed data such as non-Gaussian and bounded support data. Secondly, each component of our model has the ability to model the observed data with different bounded support regions. Finally, to estimate the parameters of the proposed model, we propose a new method in order to minimize the higher bound on the data negative log-likelihood function. We demonstrate through extensive simulations that the proposed model is superior to other methods based on the modeling of the probability density function of the data via finite mixture model. The remainder of this paper is organized as follows: Section 2 describes the proposed method in detail; Section 3 presents the parameter estimation; Section 4 sets out the experimental results; and Section 5 presents our conclusions.
2. Proposed method Given a finite mixture model with K components, the marginal distribution of the random variable xi is K
f ðxi jΘÞ ¼ ∑ π j pðxi jΘj Þ
ð1Þ
j¼1
where Θ represents the model parameters. The prior probability π j satisfies the constraints
π j Z 0 and
K
∑ πj ¼ 1
ð2Þ
j¼1
As shown in (1), the main goal of statistical modeling is to establish a model that can best describe the statistical properties of the underlying source. The mixture models have relied on pðxi jΘj Þ to model the underlying distributions. Note that pðxi jΘj Þ can be any kind of distribution. In GMM [1–3], SMM [6,7], and GGMM [9,11], pðxi jΘj Þ is the Gaussian distribution Φðxi jμj ; sj Þ, the Student's-t distribution Sðxi jμj ; sj ; vj Þ, and the generalized Gaussian
3133
distribution Tðxi jμj ; sj ; λj Þ, respectively. These distributions are all unbounded with support range ( 1; þ 1Þ. In order to overcome this problem, we propose a new finite mixture model that has the flexibility to fit different shapes of observed data such as non-Gaussian and bounded support data. First, for each component (denoted by Ωj ), we define ∂j to be the bounded support region in R, and the indicator function as 1 IF xi A ∂j Hðxi jΩj Þ ¼ ð3Þ 0 Otherwise With the indicator function Hðxi jΩj Þ in (3), we define a bounded generalized Gaussian distribution Ψ ðxi jμj ; sj ; λj Þ: Tðx jμj ; sj ; λj ÞHðxi jΩj Þ
i Ψ ðxi jμj ; sj ; λj Þ ¼ R
μ sj ; λj Þ dx
∂j Tðxj j ;
The distribution Tðxi jμj ; sj ; λj Þ in (4) is as follows: ! xi μj λj Tðxi jμj ; sj ; λj Þ ¼ Aðλj Þ exp Bðλj Þ s
ð4Þ
ð5Þ
j
The parameters μj and sj are the mean and standard deviation. The parameter λj controls the tails of the distribution and determines whether the latter is peaked or flat. The Laplacian and the Gaussian distributions are particular cases for the generalized Gaussian distribution, where λj ¼ 1 and λj ¼ 2, respectively. In (5), Aðλj Þ and Bðλj Þ are defined as qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi λj Γ ð3=λj Þ Γ ð3=λj Þ λj =2 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi and Bðλj Þ ¼ ð6Þ Aðλj Þ ¼ Γ ð1=λj Þ 2s Γ ð1=λ Þ Γ ð1=λ Þ j
j
j
R
In (4), ∂j Tðxjμj ; sj ; λj Þ dx is the normalization constant and is identified as the share of Tðxi jμj ; sj ; λj Þ that belongs to the support region ∂j . The idea to define the distribution Ψ ðxi jμj ; sj ; λj Þ in (4) is based on the fact that the observed data are digitalized and have bounded support. We assign Ψ ðxi jμj ; sj ; λj Þ as equal to Tðxi jμj ; sj ; λj Þ in the support region ∂j , and as zero outside. Two examples with the histogram of the observed data are presented in Fig. 1(a) and (b), where the observed data are in the intervals (0.32, 0.39) and (0.25, 0.35), respectively. In these examples, we present the estimated distributions obtained by employing Φðxi jμj ; sj Þ, Sðxi jμj ; sj ; vj Þ, Tðxi jμj ; sj ; λj Þ, and Ψ ðxi jμj ; sj ; λj Þ, respectively. The Gaussian distribution Φðxi jμj ; sj Þ, Student's-t distribution Sðxi jμj ; sj ; vj Þ, and general Gaussian distribution Tðxi jμj ; sj ; λj Þ are poor in these cases. The distribution Ψ ðxi jμj ; sj ; λj Þ in (4) allows for the necessary flexibility to fit the observed data with the limited range more effectively than a
Fig. 1. Comparison of the distributions. (a) The first example with observed data in the interval (0.32, 0.39) and (b) the second example with observed data in the interval (0.25, 0.35).
3134
T.M. Nguyen et al. / Pattern Recognition 47 (2014) 3132–3142
Gaussian distribution, the Student's-t distribution, or the generalized Gaussian distribution. It is worth mentioning that the proposed distribution in (4) will always satisfy the conditions of the probability density [2]: Z 1 Ψ ðxi jμj ; sj ; λj Þ Z 0 and Ψ ðxjμj ; sj ; λj Þ dx ¼ 1 ð7Þ 1
Given the distribution Ψ ðxi jμj ; sj ; λj Þ in (4), the log-likelihood function is written in the form ( ) N
) Tðxi jμj ; sj ; λj ÞHðxi jΩj Þ R ∑ πj j¼1 ∂j Tðxjμj ; sj ; λj Þ dx K
ð8Þ
From (8), we can see that each component of the proposed model has the ability to model the observed data with different bounded support regions ∂j . We can define any shape based on the prior knowledge about the observed data. As shown in the Appendix A, comparing the mathematical expressions of the proposed model with the GMM [1–3], we see that if we set λj ¼ 2, and the bounded support region is assigned as Hðxi jΩj Þ ¼ 1 for each label Ωj , our method is similar to the GMM. If we set λj ¼ 1, and Hðxi jΩj Þ ¼ 1, then our method is similar to the LMM. When K ¼1, and Hðxi jΩj Þ ¼ 1, the proposed method is similar to the GGD in [9]. And our method is similar to the GGMM [11] when Hðxi jΩj Þ ¼ 1. Now, we compared our method with BGMM in [21,22], and if we set λj ¼ 2, and Hðxi jΩj Þ ¼ Hðxi jΩk Þ; 8 j; k ¼ ð1; 2; …; KÞ, our method is similar to the BGMM. Similarly, our method is similar to the bounded Laplace mixture model (BLMM) when we set λj ¼ 1, and Hðxi jΩj Þ ¼ Hðxi jΩk Þ; 8 j; k ¼ ð1; 2; …; KÞ. Therefore, it could be said that the underlying model of the proposed method is a generalization of the GMM, LMM, GGD, GGMM, BGMM, and BLMM models. Given the log-likelihood function in (8), the next objective is to optimize the parameter set in order to maximize this loglikelihood function.
ð11Þ
Then, minimizing the negative log-likelihood function in (9) is equivalent to minimizing the error function EðΘÞ: ( N
K
EðΘÞ ¼ ∑ ∑ zij log π j þ log Tðxi jμj ; sj ; λj Þ: i¼1j¼1
þ log Hðxi jΩj Þ log
)
Z
j¼1
(
i¼1
K
i¼1j¼1
∑ π j Ψ ðxi jμj ; sj ; λj Þ
i¼1
¼ ∑ log
N
JðΘÞ r ∑ ∑ zij glog π j þ log Ψ ðxi jμj ; sj ; λj Þg
K
LðΘÞ ¼ ∑ log N
log ð∑Kj¼ 1 zij sÞ Z ∑Kj¼ 1 zij log ðsÞ to the error function in (9) to give
∂j
Tðxjμj ; sj ; λj Þ dx
ð12Þ
3.2. Mean parameter estimation To minimize this function in (12), we consider the derivation of the error function EðΘÞ with the means μj at the (tþ 1) iteration step. We have ( λj N ∂EðΘÞ ¼ Bðλj Þ λ ∑ zij jxi μj jλj 2 ð xi þ μj Þ: ∂ μj s j i¼1 R
j
λj 1 dx ∂ Tðxjμj ; sj ; λj Þsignðμj xÞjx μj j R j ∂j Tðxjμj ; sj ; λj Þ dx
)
ð13Þ
where signðxÞ is equal to 1, if x Z 0, and 1, otherwise. Note that in R (13), the term ∂j Tðxjμj ; sj ; λj Þsignðμj xÞjx μj jλj 1 dx is the expectation of function signðμj xÞjx μj jλj 1 under the probability distribution Tðxjμj ; sj ; λj Þ. Then, this expectation can be approximated as [2,20]: Z Tðxjμj ; sj ; λj Þsignðμj xÞjx μj jλj 1 dx ∂j
ðtÞ 1 M ðtÞ λj 1 ∑ signðμðtÞ Hðsmj jΩj Þ j smj Þjsmj μj j Mm¼1
ð14Þ
ðtÞ
ðtÞ where smj TðxjμðtÞ j ; sj ; λj Þ denotes the random variable that is ðtÞ
ðtÞ drawn from the probability distribution TðxjμðtÞ j ; sj ; λj Þ, and M is
the number of random variables smj . Note that M is a large integer 3. Parameter learning Thus far, the discussion has focused on Ψ ðxi jμj ; sj ; λj Þ in (4) for modeling the underlying distributions. In order to adjust the parameters Θ ¼ gπ j ; μj ; sj ; λj f, we need to maximize the likelihood function in (8). In order to present conveniently, we subdivide this section into five subsections. 3.1. Minimizing the negative log-likelihood function
K
i¼1
j¼1
JðΘÞ ¼ LðΘÞ ¼ ∑ log ∑ π j Ψ ðxi jμj ; sj ; λj Þ
ð9Þ
Therefore, maximizing the likelihood LðΘÞ in (8) is equivalent to minimizing JðΘÞ in (9). In order to minimize the error function JðΘÞ, we define one variable zij . The variable zij is defined as zij ¼
π j Ψ ðxi jμj ; sj ; λj Þ
∑Km ¼ 1 π m Ψ ðxi jμm ; sm ; λm Þ
From (14) and (15), the ∂EðΘÞ=∂μj in (13) is rewritten as ( ) λj N Rj ∂EðΘÞ ¼ Bðλj Þ λ ∑ zij jxi μj jλj 2 xi þ μj ∂ μj jxi μj jλj 2 s j i¼1 j
Since the logarithm is a monotonically increasing function, it is more convenient to consider the negative logarithm of the likelihood function [17,18], as an error function: N
value. We use M ¼ 106 for our experiments. R Similarly, the term ∂j Tðxjμj ; sj ; λj Þ dx in (13) can be approximated as [2,20]: Z 1 M ∑ Hðsmj jΩj Þ Tðxjμj ; sj ; λj Þ dx ð15Þ Mm¼1 ∂j
ð10Þ
Because the values of zij in (10) always satisfy the conditions ∑Kj¼ 1 zij ¼ 1, we can apply Jensen's inequality [19] in the form
ð16Þ where, ðtÞ
Rj ¼
ðtÞ ðtÞ λj ∑M m ¼ 1 signðμj smj Þjsmj μj j
Ωj Þ
∑M m ¼ 1 Hðsmj j
1
Hðsmj jΩj Þ
ð17Þ
According to the theory of robust statistics [21],[23], any estimate U is defined by an implicit equation: ∑Υ ðxi UÞ ¼ 0
ð18Þ
i
This gives a numerical solution of the location of U as a weighted mean: U¼
∑i wi xi Υ ðxi UÞ ; where wi ¼ xi U ∑i w i
ð19Þ
T.M. Nguyen et al. / Pattern Recognition 47 (2014) 3132–3142
Now we can apply (18) to the ∂EðΘÞ=∂μj in (16), the solution of ∂EðΘÞ=∂μj ¼ 0 yields the solutions of μj at the (tþ1) step:
μjðt þ 1Þ ¼
λj 2 x þ R Þ ∑N i j i ¼ 1 zij ðjxi μj j
ð20Þ
λj 2 ∑N i ¼ 1 zij jxi μj j
3.3. Standard deviation estimation
j
R
μ sj ; λj Þð 1 þ Bðλj Þ
∂j Tðxj j ;
Similar to (14), in (21) the term
1 M ðtÞ ∑ f ðs jμðtÞ ; sðtÞ j ; λj ÞHðsmj jΩj Þ M m ¼ 1 mj j 2
ð22Þ
From (22) and (15), the ∂EðΘÞ=∂∂sj in (21) is rewritten as N ∂EðΘÞ λ ¼ sj 1 ∑ zij g 1 þ Bðλj Þjxi μj jλj λj sj j Gj g ∂ sj i¼1
ð23Þ
where, ðtÞ
ðtÞ
ðtÞ λj λj ∑M ðsðtÞ ÞHðsmj jΩj Þ m ¼ 1 ð 1 þ λj Bðλj Þjsmj μj j j Þ ðtÞ
ðtÞ
∑M m ¼ 1 Hðsmj jΩj Þ
ð24Þ
Similarly, we can apply (18) to the ∂EðΘÞ=∂sj in (23). The solution of ∂EðΘÞ=∂sj ¼ 0 yields the solutions of sj at the (tþ1) step: !1=λj λj Bðλj Þ∑Ni¼ 1 zij jxi μj jλj ðt þ 1Þ sj ¼ ð25Þ ∑N i ¼ 1 zij ð1 þ Gj Þ
9
μ sj ; λj Þgf 2 ðxjμj ; sj ; λj Þþ gðxjμj ; sj ; λj Þg dx= R ; ∂j Tðxjμj ; sj ; λj Þ dx
∂j Tðxj j ;
ð30Þ
∂f ðxjμj ; sj ; λj Þ gðxjμj ; sj ; λj Þ ¼ ∂λj " # 0 0 1 3Ψ ð1=λj Þ 3Ψ ð1=λj Þ 9Ψ ð3=λj Þ 3Ψ ð3=λj Þ ¼ þ þ 4 4 2λj 2λ j λ2j λ3j λ3j λj 2 x μj x μj Γ ð3=λj Þ 1 log Bðλj Þ s Bðλj Þ 2log Γ ð1=λ Þ sj j j !2 0 0 3Ψ ð3=λj Þ þ Ψ ð1=λj Þ 9Ψ ð3=λj Þ Ψ ð1=λj Þ x μj λj þ þ s 3 2λj j 2λj Γ ð3=λj Þ 3Ψ ð3=λj Þ þ Ψ ð1=λj Þ 1 þ 2Bðλj Þ log 2 2λj Γ ð1=λj Þ x μj λj x μj log
sj
sj
R
ð31Þ
Similar to (14), the term ∂j Tðxjμj ; sj ; λj Þgf ðxjμj ; sj ; λj Þ þ gðxjμj ; sj ; λj Þg dx in (30) can be approximated as Z 2 Tðxjμj ; sj ; λj Þff ðxjμj ; sj ; λj Þ þ gðxjμj ; sj ; λj Þg dx 2
∂j
3.4. Shape parameter estimation The next step is to update the estimate of the parameter λj . This involves holding the other parameters fixed and improving the estimate of λj using the Newton Raphson method [11,22,24]. Each iteration requires the first and second derivatives of the objective function EðΘÞ with respect to the parameter λj ! 1 ∂EðΘÞ ∂2 EðΘÞ ðt þ 1Þ ðtÞ λj ¼ λj þγ ð26Þ 2 ∂ λj ∂ λj ðtÞ λj ¼ λj
where γ is a scaling factor. The derivative of the error function EðΘÞ with respect to λj is given by 9 8 R = < N ∂EðΘÞ ∂j Tðxjμj ; sj ; λj Þf ðxjμj ; sj ; λj Þ dx R ¼ ∑ zij f ðxi jμj ; sj ; λj Þ: ; : ∂ λj i¼1 ∂ Tðxjμj ; sj ; λj Þ dx j
ð27Þ In (27), f ðxjμj ; sj ; λj Þ is given by " # 1 3Ψ ð1=λj Þ 3Ψ ð3=λj Þ þ f ðxjμj ; sj ; λj Þ ¼ 2 λj 2λj x μj λj x μj log Bðλj Þ Bðλj Þ
sj
R
In (30), gðxjμj ; sj ; λj Þ is given by
∂j
ðtÞ 1 M ðtÞ ðtÞ λðtÞ λj j ∑ ð 1 þ Bðλj Þjsmj μðtÞ λj ðsðtÞ ÞHðsmj jΩj Þ j j j Þ Mm¼1
ð29Þ
The calculation of the terms ∂2 EðΘÞ=∂λj is obtained as 8 < N ∂ ∂EðΘÞ ¼ ∑ zij gðxjμj ; sj ; λj Þ: : ∂λj ∂λj i¼1 R
2 ∂j Tðxjμj ; sj ; λj Þf ðxjμj ; sj ; λj Þ dx þ R
2 ∂j Tðxjμj ; sj ; λj Þ dx
λj Þ j
dx can be approximated as [2,20] jx μj js Z λ Tðxjμj ; sj ; λj Þð 1 þ Bðλj Þjx μj jλj λj sj j Þ dx
Gj ¼
Γ ð3=λj Þ Ψ ð1=λj Þ 3Ψ ð3=λj Þ x μj λj 1 log þ ð28Þ 2 sj 2λj Γ ð1=λj Þ R Similar to (14), the term ∂j Tðxjμj ; sj ; λj Þf ðxjμj ; sj ; λj Þ dx in (27) can be approximated as Z Tðxjμj ; sj ; λj Þf ðxjμj ; sj ; λj Þ dx ∂j
Setting the derivative of the error function EðΘÞ with respect to sj at the (tþ 1) iteration step, we have ( N ∂EðΘÞ λ 1 ¼ sj ∑ zij 1 þ Bðλj Þjxi μj jλj λj sj j : ∂ sj i¼1 9 R λj λj Þ dx= ∂j Tðxjμj ; sj ; λj Þð 1 þ Bðλj Þjx μj j λj sj R ð21Þ ; ∂ Tðxjμj ; sj ; λj Þ dx
jλj λ
3135
sj
1 M ðtÞ 2 ðtÞ ∑ ff ðsmj jμðtÞ j ; sj ; λ j Þ Mm¼1 ðtÞ
ðtÞ þgðsmj jμðtÞ j ; sj ; λj ÞgHðsmj jΩj Þ
ð32Þ
3.5. Prior probability estimation and the algorithm The next step is to update the estimate of the prior probability
π j . Note that the prior probability π j should satisfy the constraints in (2). The constraint ∑Kj¼ 1 π j ¼ 1 enables π jðt þ 1Þ ¼
1 N ðtÞ ∑ z N i ¼ 1 ij
ð33Þ
So far, the discussion has focused on estimating Θ ¼ fπ j ; μj ; sj ; λj g of the model. The various steps of the proposed mixture model can be summarized as follows Step 1: Initialize the parameters Θ ¼ fπ j ; μj ; sj ; λj g
þ The initialization of the parameters π j ; μj ; sj in our method are the same as that of GMM [1]. þ The initial value of λj is set to 2 Step 2: Evaluate the variables zij in (10)
3136
T.M. Nguyen et al. / Pattern Recognition 47 (2014) 3132–3142
Step 3: Re-estimate the parameters Θ ¼ fπ j ; μj ; sj ; λj g þ Update the means μj by using (20) þ Update the standard deviation sj in (25) þ Update the parameter λj in (26) þ Update the prior distribution π j in (33) Step 4: Evaluate the error function in (9) and check for the convergence of either the error function, or the parameter values. If the convergence criterion is not satisfied, then go to step 2. Comparing the mathematical expressions of the parameter learning of π j , μj , sj , we can see that our approach, which is to minimize the higher bound on the data negative log-likelihood function, offers a closed form M-step with computational complexity similar to that of the standard EM algorithm. However, for the standard EM algorithm, in order to estimate the parameter λj , we need to set the function ∂LðΘÞ=∂λj equal to zero and obtain the parameters λj at the (t þ1) iteration step. Unfortunately, because of the complexity of equations LðΘÞ in (8), there are no closed form update equations for parameter by adopting the standard EM algorithm. In contrast with the standard EM algorithm, our approach can make it easy to estimate the parameter λj from observations by minimizing the higher bound on the data negative log-likelihood function as shown in (26). In the next section, we will demonstrate the robustness, accuracy, and effectiveness of the proposed model, as compared with other approaches.
4. Experiments We demonstrate the performance of the proposed method in various experiments. The performance of BGGMM is compared to the GMM [1], SMM [6], GGD [9], BGMM [21,22], and GGMM [11]. These methods are initialized by the K-mean algorithm similar to the initialization of the proposed method. To measure the fitting accuracy of each model, we use the goodness-of-fit statistic value X 2 , which is calculated as follows [11]: ½OðxÞ EðxÞ2 X2 ¼ ∑ EðxÞ x
ð34Þ
where O(x), and E(x) represent the empirical and expected frequencies, respectively, for the observed data. Note that, all the compared methods were repeated 10 runs and the average value of the statistic values X 2 is recorded. We begin with one experiment on simulated data in Fig. 2 to demonstrate the robustness of the proposed distribution. As
shown, the observed data is in the interval ( 0.5, 0.5). In this experiment, the number of components for all compared methods is assigned a value of 2 (K ¼2). In Fig. 2(b), the GMM method is very poor, with X 2 ¼ 17; 109:06. The SMM and GGMM methods slightly improve the result, with X 2 ¼ 3102:11 and X 2 ¼ 5203:80, respectively. However, the estimated histogram remains poor. Compared to GMM, SMM, and GGMM, we find that BGGMM is the most robust and has the lowest X 2 ¼ 74:14. In the next experiment, as shown in Fig. 2(c), we show the values of the statistic X 2 as a function of the number of components K used in the GMM and SMM, GGMM, and BGGMM models, respectively. We note that the SMM and GGMM models require a lesser number of components to reach a high level of histogram fitting accuracy than with the GMM. Also, GGMM can fit the observed data better than SMM when we increase the number of components. Comparing these approaches, BGGMM obtains a better estimate for the histogram. This demonstrates that, to fit the data, the proposed approach uses less complex models than the SMM, GGMM, and BGGMM approaches. This aspect is of prominent importance for image coding and compression applications [13,14]. In order to further test the behavior of the proposed algorithm, we show the goodness-of-fit statistic value as a function of the number of samples from which the histogram is derived. In Fig. 3(c), we show the value of GMM and SMM, GGMM, and BGGMM by using a different number of samples (N ¼676, 841, 1024, 1369, 1849, 2704, 4096, 7396, 16,384, and 65,536). In this experiment, the number of components for all compared methods is assigned a value of 2 (K ¼2). Fig. 3(a) and (b) shows the performance of the compared methods for the case of N ¼676 and 1024. As we can see, the histogram appears more jagged when we reduce the number of samples. In comparison, the accuracy of the goodness-of-fit statistic value obtained by employing our method is very high. As mentioned above, the underlying model of the proposed method is a generalization of the GMM, LMM, GGMM, BGMM, and BLMM models. In the next experiment of Fig. 4, we will explain in detail why the performance of the proposed method is better than that of these methods. In this experiment, two examples with the histogram of the observed data are presented in the 1st column of Fig. 4, where the observed data are in the intervals (0.3, 0.337) and (0.34, 0.46), respectively. The 2nd column of Fig. 4 shows the results of GMM, LMM, GGMM, BGMM, GLMM, and our method, respectively. As shown, the methods without the supporting regions (GMM, LMM, and GGMM) are very poor. The BGMM method slightly improves the result. However, the tail of the Gaussian distribution is shorter than required. In this experiment, the BGMM is not flexible enough to fit the shape of the data. Comparing these approaches, we find that BGGMM is the most robust.
Fig. 2. The estimated histogram. (a) The histogram of the observed data, (b) the estimated histogram of GMM (X 2 ¼ 17109:06), SMM (X 2 ¼ 3102:11), GGMM (X 2 ¼ 5203:80), and BGGMM (X 2 ¼ 74:14), and (c) the statistic values X 2 for GMM, SMM, GGMM, and BGGMM.
T.M. Nguyen et al. / Pattern Recognition 47 (2014) 3132–3142
3137
Fig. 3. The estimated histogram with different number of samples. (a) N ¼676 (GMM: X 2 ¼ 119:39, SMM: X 2 ¼ 47:71, GGMM: X 2 ¼ 54:82, BGGMM: X 2 ¼ 4:93), (b) N ¼1024 (GMM: X 2 ¼ 359:24, SMM: X 2 ¼ 49:50, GGMM: X 2 ¼ 95:39, BGGMM: X 2 ¼ 7:66), and (c) the statistic values X 2 for GMM, SMM, GGMM, and BGGMM (N ¼676, 841, 1024, 1369, 1849, 2704, 4096, 7396, 16,384, and 65,536).
Fig. 4. Comparison of the distributions for GMM, LMM, GGMM, BGMM, GLMM, BGGMM. (1st row) The first observed data in the interval (0.3, 0.337) and (2nd row) The second observed data in the interval (0.34, 0.46).
The wavelet approximation coefficient is an important problem in computer vision as it plays a major role in a wide range of applications. In the next experiment, as shown in Fig. 5, we measure the accuracy of the proposed model for wavelet histogram fitting. In Fig. 5(a), we show one texture image (D90) from Brodatz [25]. This image is decomposed into three high-pass subbands (CH, CV, CD) and one low-pass subband (CA). In this paper, the Daubechies filter bank (db4) is used. In Fig. 5(b), we show the wavelet coefficients of the high-pass subband (CH). In Fig. 5(c)–(f), we present the results obtained by employing the GMM, SMM, GGD, and BGGMM methods, respectively. As shown,
the tail of the Gaussian distribution in Fig. 5(c) is shorter than required. In Fig. 5(d), SMM provides a more powerful and flexible approach for modeling data compared to the GMM. Compared to the Student's-t distribution in Fig. 5(d), GGD in Fig. 5(e) has the flexibility required to fit the shape of the data better than the GMM and SMM. As evident from the results, the proposed method outperforms other methods with the lowest X 2 ¼ 82:41. In order to give some implementation details about loglikelihood function, an image (D73) from Brodatz, as shown in Fig. 6(a), is used. In Fig. 6(b)–(e), we show the wavelet coefficients of the high-pass subband (CH). In Fig. 6(f), we plot the log-likelihood
3138
T.M. Nguyen et al. / Pattern Recognition 47 (2014) 3132–3142
Fig. 5. Approximation of the wavelet coefficients (K¼1). (a) The original image (D90 from Brodatz datasets), (b) wavelet coefficients of high-pass subband (CH), (c) GMM (X 2 ¼ 1782:9), (d) SMM (X 2 ¼ 164:17), (e) GGD (X 2 ¼ 141:09), and (f) BGGMM (X 2 ¼ 82:41).
Fig. 6. Approximation of the wavelet coefficients (K¼ 2). (a) The original image (D73 from Brodatz datasets: high-pass subband CH), (b) GMM (X 2 ¼ 95:80), (c) SMM (X 2 ¼ 25:89), (d) GGMM (X 2 ¼ 12:95), (e) BGGMM (X 2 ¼ 9:38), and (f) comparison of the log-likelihood function.
function versus the number of iteration. From that plot we see that our method reaches higher maximum values, which implies better performance.
In the next experiment of Fig. 7, we implement the method with the supporting regions (BGMM, BLMM, and BGGMM) for the real data. In this experiment, the wavelet coefficients (db4,
T.M. Nguyen et al. / Pattern Recognition 47 (2014) 3132–3142
3139
Fig. 7. Comparison of the approximation of the wavelet coefficients for BGMM, GLMM, BGGMM (K¼1), (a) D20 from brodatz, cH1, (b) BGMM (X 2 ¼ 80; 255:09), BLMM (X 2 ¼ 13; 009:59), BGGMM (X 2 ¼ 3275:31), (c) D25 from brodatz, cH1, and (d) BGMM (X 2 ¼ 59; 528:23), BLMM (X 2 ¼ 26; 379:13), BGGMM (X 2 ¼ 18; 581:84).
Fig. 8. Approximation of the wavelet coefficients in Table 1. (1st row) The original image and (2nd row) BGGMM.
CH, level 1) of two images (D20, and D25) from Brodatz are shown in Fig. 7(a) and (c). In Fig. 7(b) and (d), we show the result of BGMM, BLMM, and BGGMM. As we can see, the
accuracy of BGMM is quite poor. BLMM slightly improves the result. However, in this real experiment, the tail of the Laplacian distribution is shorter than required. In this experiment,
3140
T.M. Nguyen et al. / Pattern Recognition 47 (2014) 3132–3142
the BGGMM is flexible enough to fit the shape of these real data. In the next experiment, a set of 20 (from D81 to D100) 640 640 texture images from Brodatz [25] is used in our tests. We divide each image into 16 160 160 non-overlapping samples. For each sample, three-level wavelet decomposition with the Daubechies filter bank (db4) is used. Therefore, every sample is decomposed into nine high-pass subbands and one low-pass subband. We shall focus only on the high-pass subbands. Thus, for each wavelet level, we create a database of 960 testing images. Fig. 8 shows some of the testing images by employing the proposed method. The average statistic values X 2 obtained by
GMM, SMM, GGD, GGMM, BGMM, and BGGMM algorithms are shown in Table 1. It is worth mentioning that GGD is a special case of GGMM when K ¼1. As shown, the proposed method is more accurate compared to other methods. In the next experiment, we show the density estimates of the lottery data in Fig. 9. These real data come from the 1970 US draft lottery data and are available on the Statlib website [26,27]. These data have 365 observations, which correspond to 365 days of the year. The second element is a priority score assigned to that day. All compared methods use the same initialization with the common K-mean algorithm, and with K¼2. As shown from Fig. 9(b), our proposed method achieves the best accuracy. The goodness-of-fit statistic value
Table 1 Comparison in the 960 testing images: the average statistic values X 2 . K
Level
GMM
SMM
GGMM
BGMM
BGGMM
K¼ 1
Level 1 Level 2 Level 3
2880.81 2652.44 3265.27
266.55 282.72 335.82
232.49 499.49 588.38
2872.82 2650.21 3247.86
158.09 167.09 177.76
K¼ 2
Level 1 Level 2 Level 3
106.15 107.78 155.65
53.02 68.10 85.27
46.71 43.40 53.36
103.32 93.21 132.49
34.50 35.16 39.39
Fig. 9. Real data from the draft lottery. (a) The original data and (b) the estimation of the observed data.
Fig. 10. Real data of the population in USA. (a) The original data and (b) the estimation of the observed data.
T.M. Nguyen et al. / Pattern Recognition 47 (2014) 3132–3142
of GMM, LMM, SMM, GGMM, and the proposed method are 1406.37, 9612.48, 1463.58, 1097.56, and 77.06, respectively. In Fig. 10, we show one example for the two boundary problem. This real dataset shows the population in USA from 1890 to 2011. This dataset is updated quarterly and is available for download from the author of [28]. Fig. 10(b) shows the results of GMM, LMM, SMM, GGMM, BGMM, and our method, with K ¼1, respectively. As we can see from Fig. 10(a), the observed data always fall within the bounded support regions. The intensity distribution does not exhibit an exact Gaussian shape and is not symmetric. For these reasons, the performance of GMM, LMM, SMM, and GGMM is poor. The BGMM method yields a better result compared to the four previous methods. However, a closer look at the estimated data of BGMM (from 1990 to 2011) clearly shows that it cannot estimate the ground truth data. The proposed method is more accurate compared to the above-mentioned methods. 5. Conclusions A finite mixture model based on the modeling of the probability density function, using the finite generalized Gaussian distribution, has been proposed in this paper. The advantage of the proposed distribution is that it has the flexibility to fit different shapes of observed data such as non-Gaussian and bounded support data. We propose an alternate approach in order to minimize the higher bound on the data negative log-likelihood function in order to estimate the model parameters. Experimental evaluation of our algorithm has been conducted using synthetic and real data, thereby demonstrating the excellent performance of the proposed model. One limitation is that the proposed method performs only local optimization. Thus, it depends on the initial starting point. And the bad initialization can lead to bad results. One possible solution to overcome this problem is to apply global optimization in order to estimate the model parameters. Another limitation is that our method is applied only for analyzing univariate data. One possible extension of this work is to use the bounded multivariate generalized Gaussian mixture model for analyzing correlated data. Another possible extension of this work for practical K setting or finding is to adopt the variational Bayesian (VB) learning, Bayesian information criterion (BIC), or minimum description length (MDL) to automatically optimize the parameter K. Conflict of interest None declared.
3141
References [1] G. McLachlan, D. Peel, Finite Mixture Models, Wiley, New York, 2000. [2] C.M. Bishop, Pattern Recognition and Machine Learning, Springer, Berlin, Germany, 2006. [3] A.K. Jain, R.P.W. Duin, J. Mao, Statistical pattern recognition: a review, IEEE Trans. Pattern Anal. Mach. Intell. 22 (1) (2000) 4–37. [4] P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via EM algorithm, J. R. Stat. Soc. 39 (1) (1977) 1–38. [5] M.S. Yang, C.Y. Lai, C.Y. Lin, A robust EM clustering algorithm for Gaussian mixture models, Pattern Recognit. 45 (11) (2012) 3950–3961. [6] D. Peel, G. McLachlan, Robust mixture modeling using the t distribution, Stat. Comput. 10 (2000) 339–348. [7] C. Liu, D. Rubin, ML estimation of the t distribution using EM and its extensions ECM and ECME, Stat. Sin. 5 (1) (1995) 19–39. [8] W. Xin, Y. Zhen, The infinite Students t-factor mixture analyzer for robust clustering and classification, Pattern Recognit. 45 (12) (2012) 4346–4357. [9] M.N. Do, M. Vetterli, Wavelet-based texture retrieval using generalized Gaussian density and Kullback–Leibler distance, IEEE Trans. Image Process. 11 (2) (2002) 146–158. [10] S.K. Choy, C.S. Tong, Statistical wavelet subband characterization based on generalized gamma density and its application in texture retrieval, IEEE Trans. Image Process. 19 (2) (2010) 281–289. [11] M. Allili, Wavelet modeling using finite mixtures of generalized Gaussian distributions: application to texture discrimination and retrieval, IEEE Trans. Image Process. 21 (4) (2012) 1452–1464. [12] G. Liu, J. Wu, S. Zhou, Probabilistic classifiers with a generalized Gaussian scale mixture prior, Pattern Recognit. 46 (1) (2013) 332–345. [13] S.G. Chang, B. Yu, M. Vetterli, Adaptive wavelet thresholding for image denoising and compression, IEEE Trans. Image Process. 9 (9) (2000) 1532–1546. [14] S. Kasaei, M. Deriche, B. Boashash, A novel fingerprint image compression technique using wavelets packets and pyramid lattice vector quantization, IEEE Trans. Image Process. 11 (12) (2002) 1365–1378. [15] P. Moulin, J. Liu, Analysis of multiresolution image denoising schemes using generalized Gaussian and complexity schemes, IEEE Trans. Inf. Theory 45 (3) (1999) 909–919. [16] Y. Bazi, L. Bruzzone, F. Melgani, An unsupervised approach based on the generalized Gaussian model to automatic change detection in multitemporal SAR images, IEEE Trans. Geosci. Remote Sens. 43 (4) (2005) 874–887. [17] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, Oxford, UK, 1995. [18] M.N. Thanh, Q.M.J. Wu, S. Ahuja, An extension of the standard mixture model for image segmentation, IEEE Trans. Neural Netw. 21 (8) (2010) 1326–1338. [19] W. Rudin, Real and Complex Analysis, McGraw-Hill, New York, 1987. [20] A.A. Farag, A.S. ElBaz, G. Gimelfarb, Precise segmentation of multimodal images, IEEE Trans. Image Process. 15 (4) (2006) 952–968. [21] P. Hedelin, J. Skoglund, Vector quantization based on Gaussian mixture models, IEEE Trans. Speech Audio Process. 8 (4) (2000) 385–401. [22] J. Lindblom, J. Samuelsson, Bounded support Gaussian mixture modeling of speech spectra, IEEE Trans. Speech Audio Process. 11 (1) (2003) 88–99. [23] P.J. Huber, Robust Statistics, Wiley, New York (1981) 43–44. [24] J. Ashburner, K.J. Friston, Unified segmentation, NeuroImage 26 (3) (2005) 839–851. [25] Texture database: 〈http://www.ux.uis.no/ tranden/brodatz.html〉. [26] 〈http://lib.stat.cmu.edu/DASL/Stories/DraftLottery.html〉. [27] H. Muller, U. Stadtmuller, Multivariate boundary kernels and a continuous least squares principle, J. R. Stat. Soc. Ser. B 61 (2) (1999) 439–458. [28] J.S. Robert, Irrational Exuberance, Princeton University Press, Princeton, New Jersey, 2005.
Appendix A. The special cases of the proposed method See Table A1. Table A1 The comparative models are special cases of the BGGMM. GMM LMM GGD GGMM BGMM BLMM
BGGMM: BGGMM: BGGMM: BGGMM: BGGMM: BGGMM:
λj ¼ 2, Hðxi jΩj Þ ¼ 1 λj ¼ 1, Hðxi jΩj Þ ¼ 1 K ¼1, Hðxi jΩj Þ ¼ 1 Hðxi jΩj Þ ¼ 1 λj ¼ 2, Hðxi jΩj Þ ¼ Hðxi jΩk Þ; 8 j; k ¼ ð1; 2; …; KÞ λj ¼ 1, Hðxi jΩj Þ ¼ Hðxi jΩk Þ; 8 j; k ¼ ð1; 2; …; KÞ
Thanh Minh Nguyen received his B.E. degree with honors in Electrical and Electronic Engineering from the HCM City University of Technology, Vietnam in 2004, the M.S. degree from the Department of Electrical Engineering at the Da-Yeh University, Taiwan in 2006, and the Ph.D. degree from the Department of Electrical Engineering at University of Windsor, Canada in 2011. He is currently a Post-Doctoral Fellow with the Department of Electrical and Computer Engineering at the University of Windsor, Canada. His research interests are computer vision and pattern recognition. He has consistently exceeded in all areas of his research work. The best example of his ingenuity
3142
T.M. Nguyen et al. / Pattern Recognition 47 (2014) 3132–3142
is shown through the project on Autonomous Racing Challenge, which took top honors at University of Waterloo in 2008 and 2009. He is the author of several papers published in the IEEE Transactions on Neural Networks and Learning Systems, IEEE Transactions on Medical Imaging, IEEE Transactions on Fuzzy Systems, IEEE Transactions on Image Processing, IEEE Transactions on Systems, Man, and Cybernetics-Part B, and IEEE Transactions on Information Technology in Biomedicine.
Q.M. Jonathan Wu (M'92-SM'09): received the Ph.D. degree in electrical engineering from the University of Wales, Swansea, U.K., in 1990. He was with the National Research Council of Canada for ten years from 1995, where he became a Senior Research Officer and a Group Leader. He is currently a Professor with the Department of Electrical and Computer Engineering, University of Windsor, Windsor, ON, Canada. He has published more than 250 peer-reviewed papers in computer vision, image processing, intelligent systems, robotics, and integrated microsystems. His current research interests include 3-D computer vision, active video object tracking and extraction, interactive multimedia, sensor analysis and fusion, and visual sensor networks. Dr. Wu holds the Tier 1 Canada Research Chair in Automotive Sensors and Information Systems. He is an Associate Editor for the IEEE Transactions on Systems, Man, and Cybernetics: Systems, IEEE Transaction on Neural Networks and Learning Systems, and the International Journal of Robotics and Automation. He has served on technical program committees and international advisory committees for many prestigious conferences.
Hui Zhang received the B.S. degree in radio engineering, the M.S. degree in biomedical engineering, and the Ph.D. degree, all from Southeast University, Nanjing, China, in 2003, 2006, and 2010, respectively. From 2011 to 2013, he was a Research Associate with the Department of Electrical and Computer Engineering, University of Windsor, Windsor, Canada. His research interest is mainly focused on pattern recognition, image watermarking, image segmentation and image registration.