Analytic calculations for the EM algorithm for multivariate skew-t mixture models

Analytic calculations for the EM algorithm for multivariate skew-t mixture models

Statistics and Probability Letters 82 (2012) 1169–1174 Contents lists available at SciVerse ScienceDirect Statistics and Probability Letters journal...

396KB Sizes 14 Downloads 431 Views

Statistics and Probability Letters 82 (2012) 1169–1174

Contents lists available at SciVerse ScienceDirect

Statistics and Probability Letters journal homepage: www.elsevier.com/locate/stapro

Analytic calculations for the EM algorithm for multivariate skew-t mixture models I. Vrbik, P.D. McNicholas ∗ Department of Statistics, University of Guelph, 50 Stone Road East, Guelph, Ontario, N1G 2W1, Canada

article

info

Article history: Received 27 September 2011 Received in revised form 21 February 2012 Accepted 22 February 2012 Available online 3 March 2012 Keywords: Multivariate skew-t distribution Mixture models em algorithm Skewness

abstract The em algorithm can be used to compute maximum likelihood estimates of model parameters for skew-t mixture models. We show that the intractable expectations needed in the e-step can be written out analytically. These closed form expressions bypass the need for numerical estimation procedures, such as Monte Carlo methods, leading to accurate calculation of maximum likelihood estimates. Our approach is illustrated on two real data sets. © 2012 Elsevier B.V. All rights reserved.

1. Preliminaries A finite mixture model assumes that a population is composed of a finite number of subpopulations, usually of the same g statistical distribution type. The density of a g-component finite mixture density is given by f (y ) = i=1 πi fi (y ), where g πi > 0 such that i=1 πi = 1 are the mixing proportions and fi (y ) is the ith component density, for i = 1, . . . , g. In recent work, mixtures of multivariate skew-t (MST) distributions have been developed (Wang et al., 2009; Pyne et al., 2009; Lin, 2010; Lee and McLachlan, 2011). Lin (2010) and Lee and McLachlan (2011) use the representation of the MST distribution described in Sahu et al. (2003). Herein, however, we employ the adapted version of the MST used by Pyne et al. (2009) and Wang et al. (2009). A random vector Y is said to follow a p-variate skew-t distribution with location vector ξ , covariance matrix 6, skewness vector δ, and ν degrees of freedom if it can be represented by Y = δ|U | + U0 ,

(1)

where U0 | u, w ∼ Np (ξ + δ|u|, 6/w), U | w ∼ N (0, 1/w), and W ∼ 0 (ν/2, ν/2). From (1), the joint distribution of Y , U, and W is f (y , u, w) = C w R−1 e−wS ,

(2)

where S = (y − ξ − δ|u|)T 6−1 (y − ξ − δ|u|) + u2 + ν /2, R = (ν + p + 1)/2, and C = (ν/2)ν/2







 −1 p+1 1 (2π ) 2 |6| 2 0 ( ν2 ) .

It can be shown that S can be written au2 + 2b|u| + c, where a = 2−1 (1 + δT 6−1 δ), b = −2−1 (y − ξ)T 6−1 δ, and



Corresponding author. E-mail address: [email protected] (P.D. McNicholas).

0167-7152/$ – see front matter © 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.spl.2012.02.020

1170

I. Vrbik, P.D. McNicholas / Statistics and Probability Letters 82 (2012) 1169–1174

c = 2−1 (y − ξ)T 6−1 (y − ξ) + ν/2. Integrating out u and w from (2), we get 1

D 2 −R



f (y ) = 2C 0 (R) √ I1 a

b

R, √ Da



,

(3)

where D = c − b2 /a and I1 () is defined in Section 2. Note that (3) approaches the multivariate skew normal distribution – the parameterization used by Pyne et al. (2009) as opposed to that of Lin (2009) – as ν → ∞ and reduces to the multivariate t-distribution at δ = 0. Moreover, for δ = 0 we approach the  multivariate normal distribution as ν → ∞. We denote the g density function for the MST mixture model by f (yj | 9) = i=1 πi f (yj | θ i , νi ), where f (yj | θ i , νi ) is the MST density defined in (3), θ i = (ξ i , 6i , δi ), and 9 = (π1 , . . . , πg , θ 1 , . . . , θ g , ν1 , . . . , νg ). Important distinctions must be made between the MST mixture models used herein and other versions within the literature. The version of the MST density defined in Wang et al. (2009), and used by Lin (2010) and Lee and McLachlan (2011), can be expressed as a product of the probability density function (pdf) and cumulative distribution function (cdf) of the multivariate t-distribution. The version we borrow from Pyne et al. (2009) and Wang et al. (2009) relies only on the product of the p-dimensional pdf and the univariate cdf of the t-distribution. Furthermore, the latter representation requires a p-dimensional vector δ in its stochastic representation rather than the diagonal matrix 1 = diag{δ} needed in the former version. We adopt the terminology used by Lee and McLachlan (2011), referring to the version of the MST distribution we use as ‘restricted’ and the other as ‘unrestricted’. The second prominent difference is in the calculation of the e-step. In Lin (2010), the expectations required in the e-step are estimated using Monte Carlo integration. As demonstrated by Lee and McLachlan (2011), this method requires costly Gibbs sampling techniques that become increasingly slow as the dimension of the data increases. Lee and McLachlan (2011) exploit the fact that, under the unrestricted version of the MST, the integrals can be expressed as functions of the moments and the cdf of the truncated multivariate non-central t-distribution (cf. Ho et al., 2012). Expressing the integrals in this way greatly increases the speed of the algorithm as well as the accuracy of parameter estimates. The key distinction between our method, using the restricted version, and the approach of Lee and McLachlan (2011), using the unrestricted version, is that all of the calculations required in our e-step are written in closed form and solved analytically. 2. Methodology Hereafter, ‘MST’ shall be taken to mean the restricted version (cf. Section 1). Under the em framework (Dempster et al., 1977), it is convenient to adopt the following notation. We regard the observed data y = (y1′ , . . . , yn′ ) as being incomplete, where u = (u1 , . . . , un ) and w = (w1 , . . . , wn ) are unobservable latent variables. We introduce missing data z = (z1 , . . . , zn ), where zj = (z1j , . . . , zgj )′ for j = 1, . . . , n with zij = 1 if yj belongs to group i and zij = 0 otherwise. Following convention, we refer to (y , u, w , z ) as the complete-data. The complete-data log-likelihood is given by log L(9 | y , u, w , z ) = log L1 (π) + log L2 (θ) + log L3 (ν), where ν = (ν1 , . . . , νg ), log L1 (π) = log L2 (θ) =

g  n  zij i=1 j=1

log L3 (ν) =

g  n  i =1 j =1

g n i=1

j =1

zij log(πi ),



2



log |6i | − log(2π ) + wj (yj − ξ i − δi uj ) 6i (yj − ξ i − δi uj ) ,

 zij −

−1



1 2

(p − 1) log(wj ) + w

2 i uj



+

νi  2

−1

wj − log

 ν  i

2

+ log 0

ν  i

2

+

ν

i

2





− 1 log(wj ) .

Let 9(t ) represent the value of the model parameters 9 at iteration t of the em algorithm. Consider the expected value of the complete-data log-likelihood, Q (9 | 9(t ) ) := E[log L(9 | y , u, w , z )]. Each iteration of the em algorithm consists of an e-step wherein Q (9 | 9(t ) ) is computed and an m-step wherein Q (9 | 9(t ) ) is maximized with respect to the model parameters 9 to obtain 9(t +1) . The algorithm is iterated until some convergence criterion is met. For mixtures of MST, the following expectations are needed in the e-step:

E(Zij |yj ) =

πi(t ) fi (yj | θ (i t ) , νi(t ) ) =: e0,ij , g  πi(t ) fi (yj | θ (i t ) , νi(t ) ) i=1

E(Zij Wj | yj ) = P (Zij = 1 | yj )E(Wj | yj ) =: P (Zij = 1 | yj )e1,ij , E Zij Uj  Wj | yj = P (Zij = 1 | yj )E Uj  Wj | yj =: P (Zij = 1 | yj )e2,ij ,



 



 



E(Zij Uj2 Wj | yj ) = P (Zij = 1 | yj )E(Uj2 Wj | yj ) =: P (Zij = 1 | yj )e3,ij , E(Zij log Wj | yj ) = P (Zij = 1 | yj )E(log Wj | yj ) =: P (Zij = 1 | yj )e4,ij ,

I. Vrbik, P.D. McNicholas / Statistics and Probability Letters 82 (2012) 1169–1174

1171

where the first expectation is the posterior probability of yj belonging to the ith group. We now define the key results necessary for the expressions in the following section. I1 (E , α) := I2 (E , α) :=



 α





x(1 + x2 )−E dx = −

α

(1 + α 2 )1−E 2(1 − E )

. . . when E > 1,





I3 (E , α) :=

  1 π 0 (2E − 1) 3 2 (1 + x ) dx = − α 2 F1 , E ; ; −α , 0 (E )2 22E −1 2 2 2 −E

log(1 + x2 )(1 + x2 )−E dx

α

 = − log(1 + α 2 )



(1 + α 2 )1/2−E 3 F2

1 2

1 , E − 12 , E − 12 ; E + 12 , E + 12 ; 1+α 2



1 − 2E

(1 + α 2 )−1/2−E 3 F2



2(1 + α 2 )1/2−E 3 F2





3 2

1 , E + 21 , E + 21 ; E + 32 , E + 32 ; 1+α 2



(2E + 1)2

+

1 2

1 , E − 21 , E − 12 ; E + 12 , E + 12 ; 1+α 2



(1 − 2E )2

,

where p Fq is the generalized hypergeometric function. 3. The expectations Using the results given in the Appendix, the four required expectations can be written









wj f (yj , uj , wj )dwj duj ÷

e1,ij = −∞





−∞

0

f (yj , uj , wj )dwj duj 0





R I1 R + 1, √b

Da

=



D I1 R ,



∞ −∞

√b

 ,

Da





uj wj f (yj , uj , wj )dwj duj ÷

e2,ij =

Da I1 R,









−∞ ∞

Cu2j wjR e−wj S dwj duj ÷

=

a ∞



f (yj , uj , wj )dwj duj

0

C wjR−1 e−wj S dwj duj 0

,

log wj f (yj , uj , wj )dwj duj ÷



∞ −∞

0



= ψ(R) − log D +











e4,ij = −∞



−∞

0

R − 2be2,ij − ce1,ij







0

−∞



a

u2j wj f (yj , uj , wj )dwj duj ÷



0

 − e1,ij ,





e3,ij = −∞

√b Da

f (yj , uj , wj )dwj duj

−∞

b

Da











= √





0

R I2 R + 1, √b

=





I3 R, √b

Da



I1 R,

√b Da





f (yj , uj , wj )dwj duj 0

 ,

d where ψ(x) = dx log 0 (x) is the digamma function. The parameter updates needed in the m-step are given by Pyne et al. (2009). To demonstrate the efficacy of this method, we apply our em algorithm to two real data sets in Section 4. In these applications, we use a deterministic annealing approach with multiple random starts to help avoid local maxima (see Ueda and Nakano, 1998; Zhou and Lange, 2010, for details).

I. Vrbik, P.D. McNicholas / Statistics and Probability Letters 82 (2012) 1169–1174

-2

-1

0

1

Bfat

0 -1

waiting time

2

1

3

2

1172

-1.5

-1.0

-0.5

0.0 eruptions

0.5

1.0

1.5

-2

-1

0

1

2

3

4

BMI

Fig. 1. Contour plots of the two-component MST mixture models fitted to the Old Faithful data (left) and the AIS data (right), where the variables have been standardized in both cases.

4. Applications 4.1. Old faithful data These data comprise the waiting time between eruptions and the duration of eruptions for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA. Although there are no true groupings per se, the data appear to contain two groups, each located around a prominent mode (Fig. 1). These data, often used as an example of skewed data (e.g., Ali et al., 2010), exhibit skewed groups which many clustering algorithms cannot accommodate. Fig. 1 illustrates the contours of the fitted MST mixture model; we have clearly captured the bimodal density of the data, while the skewness parameter allows teardrop shaped clusters that provide a close fit to the data. 4.2. Australian institute of sport data We also apply our em algorithm to the subset of the Australian Institute of Sport (AIS) data that were considered by Lin (2010) and Lee and McLachlan (2011). We consider the body mass index (BMI) and body fat (Bfat) variables for 202 athletes, taking their gender to be unknown. We fit a two-component mixture of MST distributions and report that the error rate (0.04455446) based on the misclassification of gender is lower than that reported by Lee and McLachlan (2011). The contour plot associated with our fitted model (Fig. 1) illustrates that our fitted model has captured the density of the data very well. Note that repeating our analysis without using deterministic annealing gives an identical error rate (0.0910891) to that found by Lee and McLachlan (2011). 5. Conclusions The methodology presented herein provides analytic solutions for evaluating the intractable e-step for the mixture of MST distributions. These analytic solutions, which are expressed as functions of the generalized hypergeometric function, circumvent the need for numerical estimation and ensure accurate calculation of parameter estimates within the em framework. Our approach was illustrated on two real data sets, where deterministic annealing was used for starting values. In future work, our MST em algorithm will be adapted to incorporate some of the covariance decompositions that arise when families of mixture models are used. Such families, and the associated covariance decompositions, have recently been considered for the mixture of multivariate t-distributions; e.g., the extensions of multivariate t-factor analyzers that were considered by Andrews and McNicholas (2011). Acknowledgments The authors gratefully acknowledge very helpful comments from Professor Hira Koul, Co-Editor-in-Chief, and from an anonymous reviewer. This work was supported by a Queen Elizabeth II Graduate Scholarship in Science and Technology (Vrbik) and by an Early Researcher Award from the Ontario Ministry of Research and Innovation (McNicholas).

I. Vrbik, P.D. McNicholas / Statistics and Probability Letters 82 (2012) 1169–1174

1173

Appendix. Technical details







−∞



Cw

R−1 −w S

e

dw du = 2C 0 (R)





[au2 + 2bu + c ]−R du −R  ∞ b2 av 2 + c − = 2C 0 (R) dv a b/a −R  ∞     D 2 = 2C 0 (R) D 1+t dt 0

0



D



D a

a t



b

. = 2C 0 (R) √ I1 R, √ a Da  ∞ ∞ uS −R du C |u| w R−1 e−wS dw du = 2C 0 (R) 0 0 ∞ u [au2 + 2bu + c ]−R du = 2C 0 (R) 0    1 D 2 −R

D

= 2C 0 (R) √ 

−∞



Cu2 w R−1 e−wS dw du = 2C 0 (R)



b

b



b I1 R , √ a Da

. . . sub u = v − a  .





0

b

I2 R , √ a Da

a



. . . sub v =

b



−∞





a

√b Da 1 −R 2



. . . sub u = v −

u2 S −R du

0

= 2C 0 (R)





u2 [au2 + 2bu + c ]−R du,

0

and writing u2 =







−∞

D a

(1 + t 2 ) − 2 ba



D t a

2



Cu w 2

R−1 −w S

e

0



−∞

and u = 1



D 2 −R



dw du = 2C 0 (R) √ a

−2 

c a

+ 2 ba2 −

b



D

a

a

D a



D t a



b a

gives





b

I1 R − 1, √ Da





b

I2 R, √ Da

+2

b2 a2



b

I1 R , √ Da



c





b

− I1 R , √ a

Da

.



C log(w)w R−1 e−wS dw du

0

 = 2C 0 (R) Ψ (R)







D 1+ b/a 1



D 2 −R

= 2C 0 (R) Ψ (R) √

a

1

D 2 −R



a D

v

2

  −R



1 + t2

 √ b/ Da

dv −

 −R

b/ Da 1

D 2 −R

= 2C 0 (R) √

a

 

log D 1 + b/a

a D

v

2

  

D 1+

a D

v

2

−R

 dv

dt

 log(D) + log 1 + t

 √









− √

a



 





  2



1+t b

Ψ (R) − log(D) I1 R, √

Da

 2 −R



dt



b

+ I3 R , √

Da



.

References Ali, M.M., Woo, J., Nadarajah, S., 2010. Some skew symmetric inverse reflected distributions. Brazilian Journal of Probability and Statistics 24 (1), 1–23. Andrews, J.L., McNicholas, P.D., 2011. Extending mixtures of multivariate t-factor analyzers. Statistics and Computing 21 (3), 361–373. Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B 39 (1), 1–38. Ho, H.J., Lin, T.-I., Chen, H., Wang, W.L., 2012. Some results on the truncated multivariate t distribution. Journal of Statistical Planning and Inference 142, 25–40. Lee, S., McLachlan, G.J., 2011. On the fitting of mixtures of multivariate skew t-distributions via the EM algorithm. arXiv:1109.4706. Lin, T.-I., 2009. Maximum likelihood estimation for multivariate skew normal mixture models. Journal of Multivariate Analysis 100, 257–265. Lin, T.-I., 2010. Robust mixture modeling using the multivariate skew t distribution. Statistics and Computing 20, 343–356.

1174

I. Vrbik, P.D. McNicholas / Statistics and Probability Letters 82 (2012) 1169–1174

Pyne, S., Hu, X., Wang, K., Rossin, E., Lin, T.-I., Maier, L.M., Baecher-Allan, C., McLachlan, G.J., Tamayo, P., Hafler, D.A., Jager, P.L.D., Mesirow, J., 2009. Automated high-dimensional flow cytometric data analysis. Proceedings of the National Academy of Sciences 106, 8519–8524. Sahu, S.K., Dey, D.K., Branco, M.D., 2003. A new class of multivariate skew distributions with application to Bayesian regression models. Canadian Journal of Statistics 31, 129–150. Ueda, N., Nakano, R., 1998. Deterministic annealing EM algorithm. Neural Networks 11, 271–282. Wang, K., Ng, S.K., McLachlan, G.J., 2009. Multivariate skew t mixture models: applications to fluorescence-activated cell sorting data. In: Proceedings of Conference of Digital Image Computing: Techniques and Applications. IEEE Computer Society, Los Alamitos, California, pp. 526–531. Zhou, H., Lange, K.L., 2010. On the bumpy road to the dominant mode. Scandinavian Journal of Statistics 37, 612–631.