ARTICLE IN PRESS Signal Processing 90 (2010) 774–783
Contents lists available at ScienceDirect
Signal Processing journal homepage: www.elsevier.com/locate/sigpro
Modelling with mixture of symmetric stable distributions using Gibbs sampling Diego Salas-Gonzalez a,, Ercan E. Kuruoglu b, Diego P. Ruiz c a b c
Department of Signal Processing, Networking and Communications, 18071, University of Granada, Spain ISTI-CNR. Via G. Moruzzi 1, 56124 Pisa, Italy Department of Applied Physics, 18071, University of Granada, Spain
a r t i c l e i n f o
abstract
Article history: Received 10 December 2008 Received in revised form 4 May 2009 Accepted 6 July 2009 Available online 14 July 2009
The stable distribution is a very useful tool to model impulsive data. In this work, a fully Bayesian mixture of symmetric stable distribution model is presented. Despite the nonexistence of closed form for a-stable distributions, the use of the product property make it possible to infer on parameters using a straightforward Gibbs sampling. This model is compared to the mixture of Gaussians model. Our proposed methodology is proved to be more robust to outliers than the mixture of Gaussians. Therefore, it is suitable to model mixture of impulsive data. Moreover, as Gaussian is a particular case of the a-stable distribution, the proposed model is a generalization of mixture of Gaussians. Mixture of symmetric a-stable is intensively tested in both, simulated and real data. & 2009 Elsevier B.V. All rights reserved.
Keywords: Symmetric alpha-stable distribution Bayesian mixture model Reversible jump Markov chain Monte Carlo
1. Introduction Gaussian mixture modelling is nowadays a powerful approach that allows us to model data sampled from a population that is composed of distinct subpopulations. Mixture model have been successfully applied in model, in a parametric manner, data with non-standard distribution (see [1] for a review). Mixture of Gaussian distributions is the most widely studied mixture model due to the fact that the Gaussian distribution can be theoretically justified by the central limit theorem and the stability property and, moreover, it is analytically straightforward to work with. The Gaussian distribution is a particular case of a more general family of distributions called a-stable laws. This distribution allows us to model data more impulsive than the Gaussian distribution. The a-stable distribution has many desirable properties which made it an alternative
Corresponding author.
E-mail addresses:
[email protected] (D. Salas-Gonzalez),
[email protected] (E.E. Kuruoglu),
[email protected] (D.P. Ruiz). 0165-1684/$ - see front matter & 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.sigpro.2009.07.003
for modelling non-Gaussian signals in many fields as signal processing, electrical engineering, computer science, economics or physics [2]. Nevertheless, this distribution does not have a general analytical expression for its probability density function. In this work, we propose a mixture of symmetric a-stable distributions which is a generalization of the Gaussian mixture model. Furthermore, the symmetric a-stable distribution can be written as a scale mixture of normals representation using the product property [3]. This allows us to write a symmetric a-stable distribution as a Gaussian conditionally on a positive a-stable random variable. Therefore, the non-existence of a closed form for the probability density function is surmounted and, as the symmetric a-stable distribution is written as a Gaussian, a straightforward Gibbs sampling can be used to estimate the unknown parameters. Mixture of a-stable distributions were also studied in an unpublished technical report before [4] in which it is claimed that its approach needs more evaluation, specially in the case of symmetric stable mixtures. An estimation method based on an alternative representation of the mixture was discussed in [5,6] where the a-stable mixture
ARTICLE IN PRESS D. Salas-Gonzalez et al. / Signal Processing 90 (2010) 774–783
model is compared extensively with the work in [4]. The methodology used in [6] was proved to be computationally more efficient, to work better for mixtures with closer modes, to be capable to estimate also the number of components in the mixture and to be easier to implement. The mixture of symmetric a-stable proposed here has several advantages with respect to [6,4]. It is considerably easier to implement, the analytical expressions for the posterior distribution are very similar to the one obtained for the Gaussian mixture model and new estimates can be easily sampled from the posterior distribution using Gibbs sampling, while in [6,4], the full pdf needs to be evaluated by numerical integration. We believe mixture of a-stable distributions has many potential applications in different fields. For instance, based on an a-stable mixture model and the property of the scale mixture of normals, in a recent work, we have derived a novel statistic that can be used to identify differential expression in microarray experiments [7]. This paper is organized as follows: in Section 2, the a-stable distribution and the main properties which will be used throughout the paper are presented. In Section 3, the Bayesian symmetric a-stable mixture model is introduced. Section 4 provides the Markov chain Monte Carlo algorithm to infer on the Bayesian symmetric a-mixture model. Indications to extend the mixture of the proposed univariate symmetric stable distribution to the multivariate case is included in Section 5. Simulation results are provided in Section 6 and, finally, the conclusions are drawn in Section 7.
775
a1 ¼ 2 (Gaussian case) and a2 o1. For these parameter values, a scale mixtures of normals (SMiN) could be used for the symmetric stable law as follows (see [9–11] for more details). If yi is a i.i.d. sample from a symmetric a-stable distribution with location parameter m and scale parameter g: yi f a;0 ðg; mÞ
(1)
The product property can be used to obtain the following equivalent representation: yi Nðm; li g2 Þ
li f a=2;1 2 cos
(2)
pa2=a 4
;0
(3)
where Nðm; li g2 Þ is the normal distribution with mean m and variance li g2 . This equivalent model is very useful for Bayesian inference as conditionally on the auxiliary positive stable random variable li , the symmetric a-stable variable yi is Gaussian. 3. Symmetric a-stable mixture model The symmetric a-stable mixture model is pY ðyÞ ¼
k X
wj f aj ;0 ðyjgj ; mj Þ
(4)
j¼1
0owj o1
and
k X
wj ¼ 1
(5)
j¼1
2. a-stable distribution Due to the non-existence of an analytical expression for the a-stable probability density function, this family is usually expressed by means of the characteristic function of an a-stable distribution f a;b ðyjg; mÞ, which is given by ( a ejgoj ½1i signðoÞb tanðpa=2Þþimo ðaa1Þ jðoÞ ¼ ejgoj½1þi signðoÞð2=pÞb logðjojÞþimo ða ¼ 1Þ where the parameters of the stable distribution are: a 2 ð0; 2 is the characteristic exponent which sets the level of impulsiveness, b 2 ½1; þ1 is the skewness parameter, (b ¼ 0, for symmetric distributions and b ¼ 1 for the positive/negative stable family respectively), g40 is the dispersion, a scale parameter, and m is the location parameter. 2.1. Product property The main drawback of the a-stable distribution is the non-existence of a closed expression for the probability density function. Nevertheless, this problem can be surmounted by taking into account that the following property holds for symmetric a-stable distributions [8]: Let X and Y40 be independent random variables with Xf a1 ;0 ðs; 0Þ and Yf a2 ;1 ððcos pa2 =2Þ1=a2 ; 0Þ. Then XY 1=a1 is stable with parameters XY 1=a1 f a1 a2 ;0 ðs; 0Þ. f a;b ðg; mÞ denotes an a-stable distribution with parameters a; b; g; m. We are interested in the case in which
For this model, a mixture of k distributions is considered, where j is the index of every subpopulation, wj is the weight of the distribution j. gj , mj and aj are the dispersion, location parameter and characteristic exponent for the symmetric a-stable component j and y are the observations or data vector. We want to infer on the 4k þ 1 unknown variables fwj ; gj ; mj ; aj ; kg using the available data y. In order to accomplish this goal, it is useful to introduce a latent variable zi 2 ½1; 2; . . . ; k named allocation variable which indicates that a given observation yi belongs to the component zi with parameter values fwzi ; gzi ; mzi ; azi g. Namely, zi ¼ j denotes that the observation yi has been drawn from the subpopulation j. Therefore, in that case: yi jzi ¼ jf az ;0 ðyjgzi ; mzi Þ ¼ f aj ;0 ðyjgj ; mj Þ i
(6)
The Bayesian paradigm provides us a suitable methodology to infer on unknown quantities. Let B ¼ fwj ; gj ; mj ; aj ; kg be the unknown quantities and y the known data. The Bayes’ theorem allows us to build a hierarchical model in which the unknown quantities are estimated using the prior information and the available data via the Bayes’ rule: pðBjyÞ ¼
pðyjBÞpðBÞ pðyÞ
(7)
where pðBjyÞ denotes the posterior probability of B, given the data y, pðyjBÞ is the likelihood of y given the parameters B and pðBÞ is the prior distribution of the unknown variables. pðyÞ is only a normalizing constant.
ARTICLE IN PRESS 776
D. Salas-Gonzalez et al. / Signal Processing 90 (2010) 774–783
3.1. Prior distributions
Thus, the Bayes’ theorem can be rewritten as pðBjyÞ / pðyjBÞpðBÞ
(8)
Replacing the unknown quantities B and the latent variable zi , in last expression we get pðwj ; gj ; mj ; aj ; k; zjyÞ / pðyjwj ; gj ; mj ; aj ; k; zÞ pðwj ; gj ; mj ; aj ; k; zÞ
(9)
As usual in Bayesian methods, in order to introduce more flexibility in the model, we allow prior distribution to depend on hyperpriors with its corresponding hyperparameters. For the sake of clarity in the notation, we write the parameters of the symmetric a-stable as y ¼ fa; g; mg and its corresponding hyperparameters as Z ¼ fa0 ; b0 ; k; x; ag pðk; w; z; y; Z; yÞ ¼ pðyjk; w; z; y; ZÞpðyjk; ZÞpðzjw; kÞ pðwjk; zÞpðkjk0 Þpðk0 ÞpðzÞpðZÞ
(10)
Fig. 1 shows the direct acyclic graph for the proposed Bayesian symmetric a-stable mixture model. The symmetric a-stable mixture model studied in this paper using a fully Bayesian methodology is very similar to the Bayesian–Gaussian mixture model presented in [15]. The symmetric a-stable can be considered a generalization of the Gaussian distribution, but in addition, using the scale mixture of normals property of the symmetric stable distribution enables to write the likelihood of the mixture model as a product of Gaussians conditionally in the positive a-stable distribution ðlÞ. This enables to use conjugate priors for the location, dispersion and weight parameters, and therefore, an analytical expression for the posterior distribution of these parameters can be calculated. Thus, Gibbs sampling, one of the most straightforward form of Markov chain Monte Carlo, can be used to sample from posteriors. The main difference between both models (Gaussian mixture and symmetric stable mixture model) is the inclusion of the positive a-stable random variable l and, as it will be seen in the simulation results, the advantages of the symmetric a-stable mixture model over the mixture of Gaussians are notable.
k0
a
0
0
k w
z
y Fig. 1. Directed acyclic graph (DAG) for the symmetric a-stable mixture model. Circles denote unknowns variables while rectangles represent fixed hyperparameters or vector observation and arrows denote the conditional dependence between variables.
We choose the conjugate priors for location, dispersion and weights. In Bayesian probability theory, a class of prior probability distributions is said to be conjugate to a class of likelihood functions if the resulting posterior distributions are in the same family as the prior distribution. Exponential families of distribution always have conjugate priors. In our model, the likelihood for the symmetric a-stable mixture model is Gaussian and, therefore, the conjugate priors are the same as in the Gaussian mixture case. Using conjugate priors enables to obtain a closed-form expression for the posterior. This avoids the calculation of difficult numerical integration because it is possible to sample from the posterior distribution using Gibbs sampling. The conjugate prior for the location parameter is the Gaussian Nðmjx; k1 Þ with density function ( ) 1 ðm xÞ2 (11) exp pðmj jx; k1 Þ ¼ pffiffiffiffiffiffi 2k2 2pk1 Inverse gamma distribution is chosen for the dispersion parameter, gj : 0 1 N ðyi mj Þ2 N 1 X 2 @ þ b0 A pðgj ja0 ; b0 Þ ¼ IG a0 þ ; (12) li 2 2 i¼1:z ¼j i
and the conjugate prior on w is the Dirichlet distribution. (13)
wDðz; . . . ; zÞ
For the a parameter, prior distribution is chosen to be the uniform distribution in its support a 2 ð0; 2. pðaj jaÞ ¼
1 1 ¼ a 2
for 0oaj 2
(14)
4. Markov chain Monte Carlo implementation For the Bayesian symmetric a-stable mixture model, the unknown parameters are estimated, at every iteration, using the following Markov chain Monte Carlo scheme: (1) Updating the weights w, m, g, using the Gibbs sampling. (2) Updating a using Metropolis sampling. (3) Updating the allocation of variables z. (4) Estimating the auxiliary variable l. (5) Reversible jump Markov chain Monte Carlo (split/ combine move) to estimate the number of components k. Gibbs sampling is a Markov chain Monte Carlo method in which a sequence of proposal distributions are defined in terms of the conditional distributions of the joint distribution. It is assumed that, whilst the joint distribution is too complex to draw samples from directly, its conditional distributions are tractable to work with. For the symmetric a-stable mixture model these one-dimensional conditional distributions are straightforward to sample from as it will be shown below.
ARTICLE IN PRESS D. Salas-Gonzalez et al. / Signal Processing 90 (2010) 774–783
4.1. Updating the weights ðwÞ using Gibbs sampling
where
The full conditional for w is also a Dirichlet distribution, with parameters z þ nj . Thus, at every iteration, every new estimate for the weights can be obtained by sampling from
Aaj ¼ min 1;
wj . . . Dðz þ n1 ; . . . ; z þ nk Þ
(
(15)
g m g m
a
)
(18)
It is possible to simplify Eq. (18) for this model as priors are independent, therefore, pðk; wzi ; anew ; gzi ; mzi Þ pðanew Þ ¼ pðaðtÞ Þ pðk; wzi ; aðtÞ ; gzi ; mzi Þ Thus, using a symmetric proposal qðanew jaðtÞ Þ ¼ qðaðtÞ janew Þ and taking into account that the prior pðaÞ is chosen to be uniform on its support (see Eq. (14)), the acceptance rejection ratio Aaj simplifies to ( QN ) new ; gzi ; mzi Þ i:zi ¼j pðyi jk; wzi ; aj (19) Aaj ¼ min 1; QN ðtÞ i:zi ¼j pðyi jk; wzi ; aj ; gzi ; mzi Þ
li
4.3. Updating the dispersion g using Gibbs sampling
The full conditional of every unknown parameter is very similar to the full conditional obtained in the Bayesian Gaussian mixture model. Thus, analogously to the Gaussian case, it is straightforward to sample from the posterior using the Gibbs sampling. The posterior distribution for the parameters mi , gi and w are compared in Table 1 where they are shown to differ from the full conditional obtained in the Bayesian mixture of Gaussians only for the inclusion of the positive a-stable random variable l.
The full conditional for g2 is an inverse gamma distribution IG: 0 1 N 1 1 X 2 @ IG a0 þ nj ; ðy mj Þ þ b0 A (17) 2 2 i¼1:z ¼j i i
4.4. Updating the characteristic exponent a using Metropolis–Hasting
4.5. Updating the allocation of variables z
For the parameter a, it is not possible to write the full conditional in a closed form. Hence, this parameter is estimated using the Metropolis–Hasting algorithm. For a given component j, samples aj are obtained following the scheme:
The allocation of variables z is an index that indicates which subpopulation the data yi is more likely to belong to. This step is accomplished by the full conditional for allocation of variables ðpðzi ¼ jj . . .ÞÞ:
(1) At each iteration t we sample a candidate point for aj (denoted as anew ) from a proposal distribution qðjÞ j
pðzi ¼ jj . . .Þ ¼ pðyi jk; wj ; aj ; gj ; mj ÞpðzÞ thus, an observation yi is considered to be drawn from the a-stable component j with parameters yj ¼ faj ; gj ; mj g with probability given by
anew qðanew jaðtÞ Þ j j j (2) We set
a
aðtþ1Þ ¼ aðtÞ j j
We estimate the location parameter mi sampling from the posterior distribution using Gibbs sampling: 1 0 1 PN yi C Bg2 i¼1:zi ¼j li þ kx 1 C (16) NB ; A @ 1 PN 1 1 PN 1 þ k þ k i¼1:zi ¼j i¼1:zi ¼j 2 2
g
new ; bzi ; zi ; zi Þ i:zi ¼j pðyi jk; wzi ; j QN ðtÞ ; pðy jk; w ; z i zi ; zi Þ i:zi ¼j i
if the new value is not accepted we set
4.2. Updating the location parameter m using Gibbs sampling
li
QN
pðk; wzi ; anew ; gzi ; mzi ÞqðaðtÞ janew Þ j j j ðtÞ ðtÞ pðk; wzi ; aj ; gzi ; mzi Þqðanew j a Þ j j
where nj is the number of samples assigned to the P component j ðnj ¼ i dðzi jÞÞ and d denotes the Dirac function.
g
777
, so we accept the proposed value anew j new aðtþ1Þ ¼ a , with probability minf1; Aaj g, j j
pðzi ¼ jj . . .Þ ¼ wj pðyi jk; wj ; aj ; gj ; mj Þ
(20)
Table 1 Comparison between the full conditional of every unknown parameter for Gaussian and symmetric a-stable mixture model. Parameter
g2j
Full conditional Gaussian mixture model
1 2
IG a0 þ nj ; 0
1PN ðy mj Þ2 þ b0 2 i¼1:zi ¼j i 1
Full conditional symmetric a-stable
1 2
IG a0 þ nj ; 0
ðyi mj Þ2 1PN þ b0 2 i¼1:zi ¼j li
!
mj
1P B 2 N C i¼1:zi ¼j yi þ kx 1 C NB ; 2 Bs C 1 @ A s n þ k j j n þ k j 2
1 1 PN yi Bg2 i¼1:zi ¼j li þ kx C 1 C NB ; @ 1 PN A 1 1 PN 1 þ k 2 i¼1:zi ¼j þ k i¼1:zi ¼j 2
wj
Dðz þ n1 ; . . . ; z þ nk Þ
D ðz þ n 1 ; . . . ; z þ n k Þ
sj
g
li
g
li
ARTICLE IN PRESS 778
D. Salas-Gonzalez et al. / Signal Processing 90 (2010) 774–783
where pðyi jk; wj ; aj ; gj ; mj Þ is the density of the Gaussian distribution Nðyi jmj ; li g2j Þ. 4.6. Estimating the auxiliary variable li The product property allows us to write a symmetric
a-stable distribution as a Gaussian conditionally on a random positive stable variable li . Different approaches can be used to obtain, at every iteration, samples from the posterior distribution pðlÞ. pa2=a pðli jy; m; g; zi ¼ jÞ / Nðyi jmj ; li g2j Þf a=2;1 li j2 cos ;0 4 (21) In [9,12], Metropolis–Hasting algorithm is proposed to draw samples from 21. The chosen proposal distribution is l i f a=2;1 ð2ðcos pa=4Þ2=a ; 0Þ, so the acceptance rejection ratio Al i can be calculated by ! pðyi jmj ; li g2j Þ (22) Al i ¼ min 1; pðyi jmj ; li g2j Þ
where li is the proposed new value for the auxiliary variable. li can be easily drawn using the Chambers– Mallows–Stuck algorithm [13]. The main drawback of the Metropolis–Hasting algorithm is that the proposed new values are not always accepted in every iteration. Due to that, instead of using the Metropolis–Hasting algorithm, we calculate the auxiliary variable using the rejection sampling. The rejection sampling allows us to accept a new value of l in every iteration. Note that the maximum value of the likelihood is bounded by a function which depends on jyi mzi j: 1 1 pðyi jmi ; li g2 Þ pffiffiffiffiffiffi (23) exp 2 2pjyi mzi j Therefore, the rejection sampling can be used to draw samples from li . 1. We draw samples from the positive stable distribution with parameters pa2=a l i f a=2;1 2 cos ;0 . 4 2. We draw samples from the following uniform distribution ! 1 expð1=2Þ . ui U 0; pffiffiffiffiffiffi 2pjyi mzi j 2
3. If u4pðui jmi ; li g Þ goto 1.
4.7. Updating the number of components k using RJMCMC One of the main advantage of our algorithm is that it is possible to estimate the number of components in the mixture using variable dimension Markov chain Monte
Carlo methods. This algorithm jumps between parameters subspaces of different dimension and accepts or rejects the proposed values of the parameters and number of components using the expression for the acceptance– rejection ratio given by the reversible jump Markov chain Monte Carlo technique. See [14] for more details. In [14], it is explained how to jump between spaces with different dimension attaining detailed balance. If a trans-dimensional move denoted by m is proposed, from a given state x to a new state x0 . The reversible jump Markov chain Monte Carlo propose to build a bijection between both spaces with different dimension. This can be accomplished introducing dimðx0 Þ dimðxÞ random variables u. A vector of continuous random variables u is drawn from a density qðuÞ, independent of x, and the new values x0 are proposed using an invertible deterministic function x0 ðx; uÞ. This transformation in the variables x ! x0 , is taking into account in the expression of the acceptance ratio by means of the density qðuÞ and the Jacobian of the transformation. Thus, the acceptance probability A, is pðx0 jyÞr m ðx0 Þ @x0 (24) A ¼ min 1; pðxjyÞr m ðxÞqðuÞ @ðx; uÞ where r m ðx0 Þ is the probability of choosing move type m when the actual state is x and j j is the Jacobian of the transformation. In [15], an application of RJMCMC to the estimation of the number of components in a mixture of Gaussian model is proposed. We estimate the number of components following a similar approach. In [15], two trans-dimensional moves: birth–death move for empty components and split–combine move for non-empty components are suggested, where a component j is said to be empty when its corresponding allocation of variable is zj ¼ 0. We extend that work to mixture of a-stable densities. The birth–death move suggested in [15] was implemented, but the acceptance rate for this move was found to be very low. For this reason, we only consider the split– combine move which was found to be enough for our purposes. For the split–combine move, the reversible jump mechanism is needed. Two moves in tandem need to be designed as they form a reversible pair. The new parameters setting are the same as in [15]: wj ¼ wj1 þ wj2
(25)
wj mj ¼ wj1 mj1 þ wj2 mj2
(26)
wj ðm2j þ g2j Þ ¼ wj1 ðm2j1 þ g2j1 Þ þ wj2 ðm2j2 þ g2j2 Þ
(27)
where two components j1 and j2 with weights, dispersion and location parameters ðwj1 ; gj1 ; mj1 Þ and ðwj2 ; gj2 ; mj2 Þ respectively, are combined in a new component, denoted as j , with parameters ðwj ; gj ; mj Þ. The allocation of variables changes in this move. Thus, for every data yi which zi ¼ j1 or zi ¼ j2 we set zi ¼ j . Although the combined move is deterministic, the reverse split move is not. There are three degrees of freedom, due to the change of dimensionality so three
ARTICLE IN PRESS D. Salas-Gonzalez et al. / Signal Processing 90 (2010) 774–783
continuous random variables u must be introduced at this point. As in [15], beta distributions Beð; Þ are used with the following parameters: u1 Beð2; 2Þ u2 Beð2; 2Þ u3 Beð1; 1Þ The proposed new values for weights, location and dispersion parameters of the new components j1 and j2 , split from a given existing component j , are wj1 ¼ wj u1
(28)
wj2 ¼ wj ð1 u1 Þ
(29)
mj1 ¼ mj u2 gj
sffiffiffiffiffiffiffiffi wj2 wj1
(30)
mj2 ¼ mj þ u2 gj
sffiffiffiffiffiffiffiffi wj1 wj2
(31)
w g2j1 ¼ u3 ð1 u22 Þg2j j wj1
(32)
w g2j2 ¼ ð1 u3 Þð1 u22 Þg2j j wj2
(33)
1
i:zi ¼j1
A¼
gj1
1
1
QN
1
i:zi ¼j
gj
QN
i:zi ¼j2
1
gj2
ðyi mj Þ2 =2li g2j
e
2
(36)
e
2
b0 ðg2 þg2 g2 Þ j1 j2 j
e
u2 ð1 u22 Þð1 u3 Þg2j
G2 ; . . . ; l
1=2
Gd Þ
(37)
has a symmetric a-stable distribution in Rd . Any vector X distributed as in Eq. (37) is called subGaussian Symmetric a-stable random vector in Rd with underlying Gaussian vector G. It is also said to be subordinated to G. The characteristic function of a sub-Gaussian random vector has the following special structure: (38)
where the matrix R is positive-definite and corresponds to the covariance matrix of the underlying Gaussian multivariate distribution. Therefore, using these properties of the symmetric multivariate a-stable distributions, the symmetric a-stable mixture model proposed in this paper can be extended to multidimensional data. Let note that, in that case, prior distributions should be modified analogously as in the multivariate Gaussian mixture case. Thus, the conjugate prior distributions for the mean vector and covariance matrix would be multivariate normal and inverse Wishart respectively.
ðmj xÞ2 g
dkþ1 fg 2;2 ðu1 Þg 2;2 ðu2 Þg 1;1 ðu3 Þg1 bk P alloc wj jmj1 mj2 jg2j1 g2j2
1=2
G1 ; l
fðtÞ ¼ expð12ðtT RtÞa=2 Þ
2
1=2
2
ðyi mj Þ2 =2li g2j
e0:5kfðmj1 xÞ þðmj2 xÞ 2p !a0 1 2 2
ba00 gj1 gj2 Gða0 Þ g2j
G ¼ ðG1 ; G2 ; . . . ; Gd Þ
X ¼ ðl
wzj11þn1 wzj21þn2 1 ðk þ 1Þ z1þn þn 1 2 a wj Bðz; kzÞ rffiffiffiffiffiffi
k
The scale mixture of normals property presented in Section 2.1 can be extended to the multidimensional case as follows [8,10]: choose pa2=a li f a=2;1 2 cos ;0 (35) 4
be a zero mean Gaussian vector in R independent of l. Then the random vector
5. Extension to multidimensional data
d
ðyi mj Þ2 =2li g2j
e
it is accepted with probability minf1; Ag. If one-combined new component is proposed, this is accepted with probability minf1; A1 g. Lastly, we remark that it is not allowed to propose a combine move when k ¼ 1 or a split move when k is greater than a given integer k0. The first line in expression (34) is the likelihood ratio, the second one is the ratio between priors for a, w and z. The term k þ 1 in this line is obtained due to the restriction to the set m1 om2 o omk . The third and fourth line are the ratio between priors for the location parameter m and dispersion g. The fifth line is the proposal ratio and the last one is the Jacobian of the transformation.
Let
After proposing these new values, we need to test if the condition ½m1 om2 o omk holds. If not, the move is rejected. The new values for the allocation of variables must be calculated after this move by assigning to the values labelled as j the new allocation, either j1 or j2, using the expression (20). QN
779
(34)
where n1 and n2 are the number of samples from yi assigned to the components j1 and j2 . Bð; Þ is the beta function, P alloc is the probability that the current allocation is chosen and bk and dk ¼ 1 bk are the probabilities of choosing between split and combine moves respectively. Thus, at every iteration, two-split new components are proposed with probability bk (otherwise one-combined component with probability dk ¼ 1 bk is proposed) and
6. Simulation results 6.1. Synthetic data Simulation 1: the proposed methodology is applied to N ¼ 1000 samples with the following distribution: pY ðyÞ ¼ 0:2f 1:4;0 ðyj0:2; 2Þ þ 0:3f 1:4;0 ðyj0:5; 0Þ þ 0:5f 1:4;0 ðyj0:6; 3Þ
(39)
ARTICLE IN PRESS 780
D. Salas-Gonzalez et al. / Signal Processing 90 (2010) 774–783
Table 2 True value and estimated value for every unknown parameter.
400
Parameter
300
a1 a2 a3 m1 m2 m3
True value
Estimated value
Standard deviation
200 1.4 1.4 1.4 2 0 3
1.39 1.38 1.41
0.11 0.12 0.15
1.983 0.00 3.01
0.018 0.04 0.03
100 0 3
4
5
6
7
3
4
5
6
7
300 g1 g2 g3
0.2 0.5 0.6
0.233 0.51 0.59
0.014 0.04 0.03
w1 w2 w3
0.2 0.3 0.5
0.222 0.282 0.01
0.012 0.016 0.03
The settings for hyperparameters of prior distributions are: a0 ¼ 1, b0 ¼ 1, x ¼ 0, k ¼ 1=32 and d ¼ 1. The probability of choosing the split or combine move was set to bk ¼ dk ¼ 0:5 and the number of iterations is set to 500 with a burn-in period of 100 iterations. The number of components is initialized to k ¼ 5. For pðkÞ, a discrete uniform distribution between 1 and 10 is chosen. After a few iterations, the true number of components k ¼ 3 is obtained and every parameter in the simulation is estimated very accurately. Table 2 presents the true values for every parameter and the estimated values using the proposed methodology. Fig. 2 shows two histograms with the number of components k estimated for the symmetric a-stable mixture and Gaussian mixture model. It is easily seen that the true number of components k ¼ 3 is obtained for the mixture of symmetric a-stable case whereas for mixture of Gaussians, the number of component is overestimated. Our method estimate correctly the true number of components ðk ¼ 3Þ, therefore in this example, symmetric stable mixture model does not suffer from overfitting. The number of Gaussian components needed to explain this dataset was k ¼ 5. Let note that three symmetric alphastable components in the mixture are described by 11 parameters (2 weights, 3 locations, 3 dispersions, and 3 characteristics exponents) while five Gaussian mixtures are described by 14 parameters (4 weights, 5 means and 5 variances). Therefore, in simulation 1, Gaussian mixture model can be said to overfit the data, as the dataset was generated using only 11 parameters. The main difference between the Bayesian–Gaussian mixture model and the symmetric a-stable is the calculation of the parameter a using Metropolis–Hastings and the estimation, at each iteration, of a random vector l of dimension N with positive a-stable distribution. For this dataset, the symmetric a-stable mixture approach takes approximately 100 more computational time than Gaussian mixture model. Nevertheless, we would like to point out that the computation time is not the most relevant issue in Markov chain Monte Carlo Methods, as
200 100 0 Fig. 2. Histograms of the number of components estimated after the burn-in period. Top figure shows the number of components estimated considering mixture of symmetric a-stable model. Bottom figure shows the results assuming mixture of Gaussians.
0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 −4
−2
0
2
4
6
Fig. 3. Discrete histogram of the mixture of three components in Eq. (39). Continuous line: predicted SaS density with parameters given in Table 2. Dashed line: predicted density considering mixture of three Gaussian components.
they only require off-line computation. Furthermore, the symmetric a-stable mixture model was not implemented in terms of computational efficiency and the calculation of l could be optimized. Simulation 2: mixture of symmetric a-stable is more robust to outliers than Gaussian mixture model. In order to show that, we try to model the N ¼ 2000 samples distributed as Eq. (39) with a mixture of three (fixed) normal components. The predicted density and the discrete histogram for the data are plotted together in Fig. 3. In this figure, the predicted symmetric a-stable with parameters given in Table 2. The mixture of Gaussian approach with three components fail in order to model the data. This is due to the fact that the maximum and minimum value for data vector y are 26.40 and 34:15 respectively. These values are very far from the mean
ARTICLE IN PRESS D. Salas-Gonzalez et al. / Signal Processing 90 (2010) 774–783
hence they cannot be explained under a Gaussian framework. A component in the mixture with very high variance and very low weight is obtained to model the outliers. Simulation 3: as it was extensively explain in the paper, one of the main advantages of this model is that the Gaussian mixture model is a particular case of the symmetric a-stable mixtures. The N ¼ 1000 samples with distribution given by a mixture of three normals with parameters w ¼ ½0:4; 0:3; 0:3, m ¼ ½3; 0; 2:5, s ¼ ½0:8; 0:8; 0:4 are simulated. This corresponds to a mixture mixture with a ¼ ½2; 2; 2 and of symmetric a-stable pffiffiffi dispersion g ¼ s= 2 ¼ ½0:57; 0:57; 0:28. The data is fitted using the mixture of symmetric a-stable approach. In Table 3, the estimated symmetric a-stable parameters and the true values are given. It is shown that all the parameters are estimated accurately. The number of iterations of the MCMC was set to 1000 with a burn-in period of 500 iterations. The minimum mean square error estimator was used to obtain the estimated values. In that case, it would be more convenient to use the mode as the estimator of the characteristic exponent a due to the fact that the domain of the characteristic function is a ¼ ð0; 2. Using the mode, the estimated value for alpha would be a ¼ ½2; 2; 2. The discrete histogram of the mixture of Gaussian data and the predicted density are depicted in Fig. 4. The predicted density fits very accurately the data. Simulation 4: in this simulation, the proposed methodology is applied to N ¼ 2000 samples mixture of three semi-heavy tailed distributions: pY ðyÞ ¼ 0:3Lðyj1; 5Þ þ 0:3tðyj1; 0Þ þ 0:4tðyj5; 5Þ,
781
symmetric a-stable components is obtained. Furthermore, mixture of Gaussian distributions was also used to fit this dataset. In this case, the number of Gaussian components obtained to explain the data was k ¼ 6. Fig. 5 shows the histogram of the dataset in Eq. (40) and the predictive symmetric stable mixture and Gaussian mixture model. Thus, we study the performance of Gaussian and symmetric a-stable distribution when a more general mixture model is used. In that case, equally to it was stated in Simulation 1, symmetric a-stable distribution works better than Gaussian mixture for impulsive data. Furthermore, analogously as Simulation 1, mixture of Gaussians also suffers from overfitting for this dataset. Specifically, this dataset is fitted using three symmetric a-stable distribution (11 parameters) and six Gaussian components (17 parameters) are not capable to fit the data accurately due to the presence of outliers in the data.
0.35 0.3 0.25 0.2 0.15 0.1
(40)
where Lðyjb; mL Þ is a Laplace distribution with scale parameter b and location parameter (mean) mL and tðyjn; mt Þ denotes a Student’s t-distribution centred in mt and n degrees of freedom. The settings for hyperparameters of prior distributions are chosen to be the same as in Simulation 1. Again, after a few iterations, k ¼ 3
0.05 0 −6
−4
−2
0
2
4
Fig. 4. Discrete histogram of the mixture of three Gaussian components. Continuous line: predicted SaS density with parameters given in 3.
0.18 Table 3 Simulation 3: true value and estimated value for every unknown parameter.
0.16
Parameter
0.12
True value
a1 a2 a3
2 2 2
m1 m2 m3
3 0 2.5
Estimated value
Standard deviation
1.96 1.93 1.95
0.05 0.07 0.05
3.05 0.06 2.48
0.06 0.07 0.03
0.14
0.1 0.08 0.06 0.04 0.02
g1 g2 g3
0.57 0.57 0.28
0.56 0.60 0.29
0.03 0.06 0.02
w1 w2 w3
0.4 0.3 0.3
0.42 0.29 0.29
0.02 0.02 0.02
0 −15
−10
−5
0
5
10
15
Fig. 5. Discrete histogram of the mixture of Laplace and t-Student distributions. Continuous line: predicted 3-component symmetric astable density. Dashed line: predicted Gaussian mixture model with six model.
ARTICLE IN PRESS 782
D. Salas-Gonzalez et al. / Signal Processing 90 (2010) 774–783
6.2. Real data The proposed methodology is tested in three different datasets. These datasets have been chosen because they have been analysed before in the literature using mixture models. The ‘Enzyme data’ concerns the distribution of enzymatic activity in the blood, for an enzyme involved in the metabolism of carcinogenic substances, among a group of 245 unrelated individuals [16,15]. The ‘Acidity data’ dataset involves a log scale acidity index measured in a sample of 155 lakes in the Northeastern of USA [17,18,15]. The ‘Galaxy data’ consists of the measure of the velocities of 82 distant galaxies, diverging from the Milky Way. This dataset has been analysed widely in the mixture model literature [19–21,15]. Table 4 Estimated values for three different real datasets. Parameter
Enzyme data ðk ¼ 2Þ
Acidity data ðk ¼ 2Þ
Galaxy data ðk ¼ 3Þ
a1 a2 a3
1.66 1.84 –
1.80 1.82 –
1.06 1.38 1.35
m1 m2 m3
0.20 1.27 –
4.34 6.25 –
9.71 20.0 23.0
g1 g2 g3
0.14 0.37 –
0.30 0.42 –
0.76 0.74 1.20
w1 w2 w3
0.62 0.38 –
0.59 0.41 –
0.09 0.45 0.46
The number of components and the parameter estimates for these three different datasets are presented in Table 4. The histograms for every data set and the predictive symmetric a-stable mixture model are depicted in Fig. 6. Furthermore, the study of these datasets show that mixture of symmetric a-stables distributions performs even when the dimension of the observation vector is small. 7. Conclusion This work presents a new mixture model based on mixture of symmetric a-stable distribution. Despite the non-existence of closed form densities for the a-stable probability density function, Bayesian inference is possible as the density of a symmetric a-stable distribution can be written as a Gaussian, conditionally on a positive a-stable random variable. The main advantage of this model is that as the Gaussian distribution is a particular case of the symmetric a-stable, mixture of symmetric a-stable distribution is a generalization of the wellstudied Gaussian mixture model. The proposed methodology was tested and every unknown parameter was proved to be estimated very accurately. The MCMC realization for the full Bayesian model considered was compared with the fully Bayesian–Gaussian mixture model and both were shown to be very similar except for the auxiliary variable. Moreover the symmetric a-stable mixture model was shown to be more robust to outliers than the Gaussian mixture model. The mixture of a-stable model was also tested in three different real datasets.
Acknowledgements
enzyme data 1.5 1 0.5 0 −1 −0.5
0
0.5
1 1.5 2 acidity data
2.5
3
3.5
4
0.8 0.6 0.4 0.2 0
This work was partially supported by the project TEC2007-68030-C02-02/TCM of the MCyT of Spain, the PETRI project DENCLASES (PET2006-0253) of the Spanish MEC, the TEC2008-02113 and the Excellence Projects TIC02566 and TIC-03269 of the Consejerı´a de Innovacio´n Ciencia y Empresa (Junta de Andalucı´a, Spain). The first author did part of the work while at ISTI-CNR of Pisa. References
2
3
4
5 galaxy data
6
7
8
25
30
35
0.4 0.3 0.2 0.1 0 10
15
20
Fig. 6. Discrete histogram for every real dataset and predictive symmetric a-stable density. ‘Enzyme data’: two components. ‘Acidity data’: two components. ‘Galaxy data’: three components.
[1] G. McLachlan, D. Peel, Finite Mixture Models, Wiley Series in Probability and Statistics, 2000. [2] C.L. Nikias, M. Shao, Signal Processing with Alpha-Stable Distributions and Applications, Wiley-Interscience, New York, USA, 1995. [3] W. Feller, An Introduction to Probability Theory and its Applications, vol. 2, Wiley, New York, USA, 1971. [4] R. Casarin, Bayesian inference for mixture of stable distributions, Technical Report No. 0428, CEREMADE, University Paris Dauphine, 2004. [5] D. Salas-Gonzalez, E.E. Kuruoglu, D.P. Ruiz, Estimation of mixtures of symmetric alpha stable distributions with an unknown number of components, in: IEEE International Conference on Acoustics, Speech and Signal Processing, Toulouse, France, 2006. [6] D. Salas-Gonzalez, E.E. Kuruoglu, D.P. Ruiz, Finite mixture of a-stable distributions, Digital Signal Processing 19 (2009) 250–264. [7] D. Salas-Gonzalez, E.E. Kuruoglu, D.P. Ruiz, A heavy-tailed empirical Bayes method for replicated microarray data, Computational Statistics and Data Analysis 53 (2009) 1535–1546.
ARTICLE IN PRESS D. Salas-Gonzalez et al. / Signal Processing 90 (2010) 774–783
[8] G. Samorodnitsky, M. Taqqu, Stable Non-Gaussian Random Process: Stochastic Models with Infinite Variance, Chapman & Hall, New York, 1994. [9] S. Godsill, E.E. Kuruoglu, Bayesian inference for time series with heavy-tailed symmetric alpha stable noise processes, in: Proceedings of Applications of Heavy Tailed Distributions in Economics, Engineering and Statistics, Washington, DC, USA, June 1999. [10] E.E. Kuruoglu, C. Molina, W.J. Fitzgerald, Approximation of alpha stable probability densities using finite mixtures of Gaussian, in: Proceedings of the European Signal Processing Conference, Rhodes, Greece, 1998. [11] E.E. Kuruoglu, C. Molina, S.J. Godsill, W.J. Fitzgerald, A new analytic representation for the alpha-stable probability density function, in: The Fifth World Meeting of the International Society for Bayesian Analysis (ISBA), Istanbul, Turkey, 1997. [12] E.G. Tsionas, Monte Carlo inference in econometric models with symmetric stable disturbances, Journal of Econometrics 88 (1999) 365–401. [13] J. Chambers, C. Mallows, B. Stuck, A method for simulating stable random variables, Journal of the American Statistical Association 71 (1976) 340–344. [14] P.J. Green, Reversible jump Markov chain Monte Carlo computation and Bayesian model determination, Biometrika 82 (4) (1995) 711–732.
783
[15] S. Richardson, P.J. Green, On Bayesian analysis of mixtures with an unknown number of components, Journal of the Royal Statistical Society B 59 (1997) 731–792. [16] Y.C. Bechtel, C. Bonaiti-Pellie, N. Poisson, J. Magnette, P.R. Bechtel, A population and family study of n-acetyltransferase using caffeine urinary metabolites, Clinical Pharmacology and Therapeutics 54 (1993) 134–141. [17] S.L. Crawford, M.H. DeGroot, J.B. Kadane, M.J. Small, Modeling lake chemistry distributions: approximate Bayesian methods for estimating a finite mixture model, Technometrics 34 (1992) 441–453. [18] S.L. Crawford, An application of the Laplace method to finite mixture distributions, Journal of the American Statistical Association 89 (1994) 259–267. [19] M.D. Escobar, M. West, Bayesian density estimation and inference using mixture, Journal of the American Statistical Association 90 (1995) 577–588. [20] D.B. Phillips, A.F.M. Smith, Practical Markov chain Monte Carlo, in: W.R. Gilks, S. Richardson, D.J. Spiegelhalter (Eds.), Bayesian Model Comparison Via Jump Diffusions, Chapman & Hall, London, 1995, pp. 215–239. [21] M. Stephens, Dealing with label switching in mixture models, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 62 (2000) 795–809.