Dimensionally reduced mixtures of regression models

Dimensionally reduced mixtures of regression models

Journal of Statistical Planning and Inference 141 (2011) 1744–1752 Contents lists available at ScienceDirect Journal of Statistical Planning and Inf...

208KB Sizes 2 Downloads 41 Views

Journal of Statistical Planning and Inference 141 (2011) 1744–1752

Contents lists available at ScienceDirect

Journal of Statistical Planning and Inference journal homepage: www.elsevier.com/locate/jspi

Dimensionally reduced mixtures of regression models Angela Montanari, Cinzia Viroli  Department of Statistics, University of Bologna, Italy

a r t i c l e in fo

abstract

Article history: Received 15 January 2010 Received in revised form 17 November 2010 Accepted 22 November 2010 Available online 1 December 2010

A mixture of regression models for multivariate observed variables which contextually involves a dimension reduction step through a linear factor model is proposed. The model estimation is performed via the EM-algorithm and a procedure to compute asymptotic standard errors for the parameter estimates is developed. The proposed approach is applied to the study of students satisfaction towards different aspects of their school as a function of various covariates. & 2010 Elsevier B.V. All rights reserved.

Keywords: Mixture models Regression models Factor analysis

1. Introduction Finite mixture models are recently receiving a wide interest in the statistical literature. According to this approach a sample of observations arises from a specified number of the underlying populations of unknown proportions. The distributional form of the density of each underlying populations is specified and the purpose of the finite mixture approach is to decompose the sample into its mixture components (McLachlan and Peel, 2000). In so doing one performs what is called model-based clustering. For multivariate data, when the distributional form for each component is assumed to be Gaussian, it is well known that mixture models represent an over-parameterized solution as they require, for each mixture component, to estimate the mean vector and the distinct elements of the variance–covariance matrix, besides the mixing proportions. This aspect has been variously addressed in the statistical literature. A well-known strategy is aimed at parameterizing the generic componentcovariance matrix thus directly reducing the number of free parameters; see Banfield and Raftery (1993), Celeux and Govaert (1995), Fraley and Raftery (2002) and Bouveyron et al. (2007). As an alternative, over-parameterized solutions can be avoided by performing dimension reduction in each mixture component by means of factor models. Mixture of factor analyzers (Ghahramani and Hilton, 1997; McLachlan et al., 2003) assumes that within each component the data are generated according to an ordinary factor model, thus reducing the number of parameters on which the variance–covariance matrices depend. More recently, McNicholas and Murphy (2008) introduced a general family of mixture models that encompasses this model along with some new proposals. In a slightly different perspective, the so-called heteroscedastic factor mixture analysis (Montanari and Viroli, 2010; Yoshida et al., 2006) simultaneously performs dimension reduction and model-based clustering by assuming a global factor model, with latent variables jointly modeled through a finite mixture of Gaussians. This approach can be viewed as an extension of independent factor analysis, in which the latent variables are marginally modeled through finite mixture of Gaussians, (Attias, 1999; Viroli, 2009; Montanari et al., 2008). It also generalizes the factor mixture models (Lubke and Muthe´n, 2005), in which heterogeneity

 Corresponding author.

E-mail address: [email protected] (C. Viroli). 0378-3758/$ - see front matter & 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2010.11.024

A. Montanari, C. Viroli / Journal of Statistical Planning and Inference 141 (2011) 1744–1752

1745

is exclusively ascribed to factor mean differences, while conventionally factor variances and covariances are held equal across classes. The classification performance of heteroscedastic factor mixture analysis proved to be competitive with the other dimensionally reduced classification methods while involving fewer parameters. In this work we address the problem of including a set of explanatory variables in the heteroscedastic factor mixture analysis. In the classical finite mixture approach without dimension reduction, covariates have been incorporated in the so-called conditional mixture model (Wedel and DeSarbo, 1995), which allows for the simultaneous probabilistic classification of observations and the estimation of regression models relating covariates to the expectations of the observed (dependent) variables within latent classes. The underlying idea is that a single set of regression coefficients across all observations may be inadequate and potentially misleading if the observations arise from a number of unknown groups in which the coefficients differ. In this paper we propose a method which, by including covariates in the heteroscedastic factor mixture analysis, performs a mixture of regression models for multivariate observed variables, while contextually involving a dimension reduction step. The following real case study may clarify the goal of the procedure. In the 2006–07 school year students of a secondary school in Bologna have been asked to give their opinions toward some aspects and services of their institute. A sample of 330 students have answered an anonymous questionnaire addressing various aspects of the school, such as teaching activities and school cleaning. Some of the questions, say p of them, may be interpreted as indirect measures of an overall appreciation towards the different aspects of the institute, which therefore represent latent variables, and it is not difficult to imagine that students may tend to cluster according to their different degree of appreciation. This aspect can be described by jointly modeling the r factors or latent variables as a mixture of Gaussian distributions, where the mixture components representing a student cluster should be allowed to be heteroscedastic. This also amounts to fit a p-variate Gaussian mixture in the observed variable space, whose real latent dimensionality is r, with r op. Some other items in the questionnaire, say q of them, may be interpreted as covariates which affect the latent overall appreciation. A further aspect which is worth considering is that, in this context, the covariates may differently influence student satisfaction as the student clusters vary. At the same time covariates could differently affect the a priori probability of group membership. One possible solution is to determine the student clusters and then separately fit a linear regression model within each of them. This solution, however, presents at least two drawbacks. First of all, it is a two step procedure and this causes eventual estimation errors occurred during the first estimation phase regarding the latent variables also influence the regression step. Secondly, in so doing one prevents covariates from conditioning the clustering phase, which may not be correct. In the following we propose a model and an estimation algorithm which address the abovedescribed issues simultaneously. The paper is organized as follows. In Section 2, the heteroscedastic factor mixture model is presented and extended to incorporate the covariate effects. Then, in Section 3, we give evidence that the proposed model may be reinterpreted as a modelbased clustering method with a dimension reduction step. In Section 4, model estimation via the EM-algorithm is developed, identifiability conditions are discussed and a procedure to compute standard errors for the parameter estimates is developed and illustrated for the regression coefficients. An application to real data about student satisfaction is presented in Section 5. 2. The model Consider a p-dimensional vector y of continuous and mean centered variables and a q-dimensional vector, denoted by x, of numerical or binary covariates. We assume that the p observed variables y have been generated according to a factor model involving r latent variables z, through a heteroscedastic factor mixture model. Let K be a factor loading matrix and u a p-dimensional Gaussian term which includes the so-called specific factors with zero mean and diagonal covariance matrix W. The specific factors u and the common factors z are assumed to be mutually independent. Then the heteroscedastic factor mixture model is y ¼ Kzþ u:

ð1Þ

The latent vector z is assumed to be distributed as a finite mixture of r-variate Gaussians, f ðzÞ ¼

k X

pi fðrÞ ðli , Ri Þ,

ð2Þ

i¼1 ðrÞ

where pi are the unknown mixing proportions, f is the r-variate Gaussian density with component vector mean li and component-covariance matrix Ri . This assumption means that, with respect to the latent variables, the overall population may be heterogeneous and composed of k distinct subgroups. Here we extent the factor model developed in (1) and (2) by incorporating the effect of the covariates x according to two general assumptions. Assumption 1. It is assumed that within each mixture component the covariates x linearly affect the latent variable means: ~ where x~ is the q+ 1 dimensional vector containing both the covariates and a component equal to one aimed at li ¼ !i x, including the intercept in the model, and !i is a matrix of dimensions r  (q+1) containing the vectors of regression coefficients which are allowed to vary across the components. Assumption 2. It is also assumed that the covariates x may differently affect the a priori probability of group membership. To this purpose let v be a k-dimensional allocation variable, or, in other terms, the vector of latent categorical variables denoting, for each

1746

A. Montanari, C. Viroli / Journal of Statistical Planning and Inference 141 (2011) 1744–1752

unit, the component membership. The covariates are allowed to influence the group allocation through a multinomial logit regression

pi ðxÞ ¼ Prðvi ¼ 1jxÞ ¼

expðg> i xÞ , Pk1 1 þ iu ¼ 1 expðg> xÞ iu

ð3Þ

where i ¼ 1, . . . ,k and gk ¼ 0 for identifiability. Thus the covariates indirectly affect the observed variables through the two sets of latent variables z and v and expression (2) is replaced by f ðzÞ ¼

k X

~ Ri Þ: pi ðxÞfðrÞ ð! i x,

ð4Þ

i¼1

3. Dimensionally reduced model-based clustering It may be proved that modeling the latent variables as a finite mixture of Gaussians amounts to fit a particular multivariate Gaussian mixture model to the observed variables, as well. In fact by combining Eqs. (1) and (4) the distributional density of the observed variables may be rephrased as f ðyÞ ¼

Z

f ðyjzÞf ðzÞdz ¼

k X

pi ðxÞ

Z

~ Ri Þ dz, fðpÞ ðKz, WÞfðrÞ ð!i x,

i¼1

that is f ðyÞ ¼

k X

~ KRi K> þ WÞ, pi ðxÞfðpÞ ðK!i x,

ð5Þ

i¼1

which allows for heteroscedastic mixture components, sharing the same K and W matrices. In a clustering perspective each statistical unit is then allocated to the mixture component it has the maximum posterior probability of coming from (McLachlan and Peel, 2000). Eq. (5) can be regarded as a dimensionally reduced model-based clustering method. The proposed approach may be viewed as a generalization of the solutions proposed by Fokoue (2005) and Lubke and Muthe´n (2005). The former includes covariates in a mixture of factor analyzers model while assuming that the vector of regression coefficients is the same for all the mixture components (which are allowed to be heteroscedastic); the latter allows for varying intercepts only, in a mixture model with homoscedastic components. Our approach allows both for different covariate effects within the mixture components and for heteroscedastic components. 4. Model inference 4.1. Maximum likelihood estimation Let y ¼ ðy1 , . . . ,yn Þ and x ¼ ðx1 , . . . ,xn Þ denote the p variables and the q covariates observed in a sample of size n, respectively. The log-likelihood of the proposed model is given by ‘ðhÞ ¼

n X

log½f ðyj ,xj ; hÞ,

ð6Þ

j¼1

where h collectively denotes the set of model parameters. The maximum likelihood estimation problem can be solved using the EM-algorithm (Dempster et al., 1977), since the proposed model consists of two layers of missing data: the factors, z, and the k-dimension allocation vector, v, which derives from modeling the factors as a mixture of Gaussians. More specifically, with reference to the mixture distribution in (5) the allocation variable vij takes values 0 or 1, indicating whether observation j belongs to the group or component i. Since a mixture distribution can be thought of as an heterogeneous population consisting of k groups of sizes proportional to pi , it is assumed that vij are i.i.d. multinomial variables: f ðvj ; hÞ ¼

k Y

v

pi ij :

ð7Þ

i¼1

Thanks to these two levels of latent variables the log-likelihood can be rephrased in terms of the so-called complete loglikelihood (i.e. the log-likelihood that we could compute if we knew the latent variables z and v): n X XZ ‘ðhÞ ¼ log ð8Þ f ðyj ,xj ,zj ,vj ; hÞ dzj : j¼1

v

A. Montanari, C. Viroli / Journal of Statistical Planning and Inference 141 (2011) 1744–1752

1747

The complete density can be expressed in a hierarchical form by observing the following facts. The allocation variable, v, depends only on the covariates, x, while the factors, z, depend on both the covariates and the allocation variable, v. Moreover, the observed data, y, depend on the covariates and the allocation variable only through the latent variables z. In other terms the following decomposition holds logf ðy,x,z,v; hÞ ¼ logf ðyjz; hÞ þ logf ðzjx,v; hÞ þ logf ðvjx; hÞ:

ð9Þ

The EM-algorithm alternates between two steps, Expectation (E-step) and Maximization (M-step), until convergence in ‘ðhÞ. In the E-step the expected value of the log-likelihood given the observed data is calculated on the basis of provisional estimates of the parameters, denoted by hu. In the M-step, the expectation of ‘ðhÞ is maximized with respect to h to obtain new provisional estimates. These steps are illustrated in Appendix A. 4.2. Identifiability In the proposed model, the observed variables are taken as mean centered in order to estimate zero mean latent variables. This assumption does not completely solve the identifiability problem. In model (1) it is in fact implied that the latent variables are defined only up to a scaling transformation. Scaling the factors z by a vector m would not affect the model if the mixing matrix K were scaled by m 1 at the same time. This kind of ambiguity causes numerical problems in the estimation process, due to the existence of multiple optimal solutions. The identifiability problem can be addressed by assuming identity factor covariance matrix, thus dealing with standardized factors. More specifically, at each iteration of the EM algorithm we performed the following scaling transformation:

Ri -N1=2 Ri N>1=2 , K-KN1=2 , !i -N1=2 !i with

N ¼ VarðzÞ ¼

k X

> i Þ

pi ðRi þ li l

i¼1

k X i¼1

!

pi li

k X

!>

pi li

,

ð10Þ

i¼1

~ The previous transformations ensure VarðzÞ ¼ Ir . where li ¼ !i x. 4.3. Standard errors of the estimates The EM-estimators, being maximum likelihood estimators, are asymptotically normal under typical regularity conditions (Crame´r, 1946). This implies the possibility to perform statistical tests in order to evaluate the significance of the parameter estimates of the model. This is a crucial issue since in this model we allow the covariates to differently affect the latent variables within each class. This means that a covariate may be influent for a group but not relevant for another one. In the following, we will derive asymptotic standard errors for the parameter estimates by evaluating the inverse observed information matrix, as established by Cramer–Rao’s theorem. Unfortunately, in most estimation problems, and in the current one too, the observed information matrix can be very hard to evaluate directly. Meng and Rubin (1991) solved the problem by deriving the asymptotic variance–covariance matrix as a function of the complete-data variance–covariance matrix, which is instead algebraically tractable even in the most complicated estimation problems. According to this latter approach, we propose a procedure for computing the standard errors of the parameters !i for i ¼ 1, . . . ,k since they give a measure of how much covariates linearly affect the factor means. The analytic developments are reported in Appendix B. 5. Data analysis This section goes back to the student satisfaction problem we discussed in the introduction, i.e. if the questionnaire items could be synthesized by one or more latent variables expressing overall appreciation towards different aspects of school services, if the students tend to cluster according to their latent variable scores and if the covariates differently affect these latent variables within the clusters and also affect the a priori probability of group membership. The information on the student satisfaction towards their institute was collected by self-administered questionnaires delivered during one lesson almost at the end of the school year. The students were asked to express the degree of agreement to a series of items according to a seven level scale which can be treated as numerical response. The variables which entered our study are described in Table 1. They represent only a subset of the complete questionnaire. Table 2 describes the covariates included in the data analysis: sex, class year and three dummies related to the course curriculum, since the school offers four different learning programs. The traditional curriculum has been taken as the reference category. 5.1. Multiple linear regression results We have first fitted model (5) with a single component and a single latent variable. This amounts to assume that students form a homogeneous group. The model log-likelihood is  4257.036 and Akaike’s Information Criterion (AIC; Akaike, 1974) is

1748

A. Montanari, C. Viroli / Journal of Statistical Planning and Inference 141 (2011) 1744–1752

Table 1 Description of the items. Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10

During the last two months, the classrooms have been properly cleaned During the last two months, the corridors have been properly cleaned During the last two months, the dressing rooms have been properly cleaned During the last two months, desks and chairs have been properly cleaned During the last two months, toilets have been properly cleaned The institute protects and respects the student’s rights At school I feel safe and protected For doubts I can ask to my teachers also out of the lesson hours My teachers are well prepared The test planning helps me in my study organization

Table 2 Description of the covariates. X1 X2 X3 X4 X5

Class year (from the second to the fifth one) Sex (1= male; 0= female) Dummy for the course curriculum (1= science) Dummy for the course curriculum (1= informatics) Dummy for the course curriculum (1= foreign languages)

Table 3 Regression coefficient estimates. X1 X2 X3 X4 X5

 0.97  0.19 0.38 0.65 1.00

(0.89) (0.06) (0.11) (0.11) (0.08)

8621. It is evident from the factor loading vector (0.61, 0.42, 0.18, 0.57, 0.45, 0.56, 0.61, 0.59, 0.53, 0.47) that the latent variable has almost equal loadings for all the variables and therefore it may be interpreted as a measure of overall satisfaction. Table 3 contains the estimated regression coefficients with standard errors in brackets obtained according to (B.4). All the covariates show significant regression coefficients which can be easily interpreted. As the class year increases the overall school appreciation increases too among all the students; on average males are less satisfied than females and there are significant differences between the four learning programs. In general, students belonging to the traditional curriculum are less satisfied. 5.2. Dimensionally reduced mixtures of regression model results The issue remains whether all students exhibit the same association of overall appreciation with the covariates or whether groups of students exist that exhibit different associations. The purpose of the following analysis is to identify such groups of students that show a different association of overall school satisfaction with the covariates, indicating different determinants of overall appreciation. To address this issue, our dimensionally reduced mixture of regression models was applied to the data for different values of k and for different numbers of factors r. Covariates have been included under the two assumptions they affect only the latent variables or they affect both latent variables and group membership. On the basis of the AIC we chose a model with k= 3 mixture components and r =2 latent variables, where covariates are supposed to affect the factors but not the mixture weights. This solution has the smallest value of the AIC equal to 8560. Table 4 contains the estimated factor loadings. As shown in the table, the first latent variable captures the appreciation towards the cleaning of the school (the first five items) and the other one measures satisfaction towards the atmosphere and the teaching activities of the institute (last five items). Table 5 presents the estimated regression coefficients for this three group solution, besides some descriptive measures of group characteristics. The first group consists of few students (9.2%), and specifically of the most satisfied ones about all the analyzed school aspects, as can be deduced by inspecting the group means of the latent variables, which in this case are positive and equal to 1.89 and 1.01, respectively. This group is also the one showing the lowest appreciation variability. In this group class year and curriculum only have a significant regression coefficients, while gender turns out to be not relevant. The regression coefficient of the class year is negative and significant only for the first factor, meaning that in the group of the most satisfied students, appreciation towards school cleaning decreases as the class year increases. Compared to

A. Montanari, C. Viroli / Journal of Statistical Planning and Inference 141 (2011) 1744–1752

1749

Table 4 Factor loading estimates.

Factor 1 Factor 2

Y1

Y2

Y3

Y4

Y5

Y6

Y7

Y8

Y9

Y10

0.71 0.08

0.47 0.13

0.63  0.04

0.57 0.19

0.65 0.20

0.14 0.61

0.00 0.69

0.14 0.51

 0.03 0.64

0.19 0.44

Table 5 Parameter estimates. In brackets standard errors are shown. Group 1

Group 2

Group 3

mi s2i

0.092 31 1.95 0.04

0.638 212  0.25 0.48

0.271 87  0.09 0.84

b0 b1 b2 b3 b4 b5

2.22  0.13 0.01 0.28  0.39 0.27

wi ni

1.05 0.17 (0.28) (0.02) (0.22) (0.04) (0.04) (0.04)

 0.11 0.06 0.02 0.84 1.24 1.13

2

Factor 2

1

0

–1

–2

(0.57) (0.06) (0.16) (0.08) (0.07) (0.05)

 0.12  0.06  0.04  0.28 0.30 0.43

0.46 0.10 (0.61) (0.08) (0.11) (0.08) (0.08) (0.06)

 0.44 0.17  0.16 0.53 0.33 0.60

(0.28) (0.04) (0.05) (0.04) (0.03) (0.03)

0.63  0.15  0.28  0.42 0.45  0.34

 1.50 0.06 (0.61) (0.02) (0.05) (0.09) (0.08) (0.07)

 2.22 0.14  0.11 0.57 0.55 0.51

(0.17) (0.01) (0.04) (0.03) (0.05) (0.04)

1 1 2 1 2 2 1 2 2 1 222 2 2 2 2 11 1 1 1 22 2 1 2 1 2 2 22 2 2 22 1 1 22 22 22 2222 22 2 2 22 2222 2 2 21 2 1 2 2 1 1 2 2 22 2 2 222 2 2 2 2 1 1 11 2 2 22222 2222 2 1 1 22 22 2 2 2 2 22 22 2 2 1 2 2 22222 2 22222 2 2 22 22 2 2 2 1 222222222 222 2 2 2 2 2 2 2 2 2 1 22 22 22222 2 2 2 2 2 1 2 22 2 2 2222 2 22 2 3 32 2 2 2 2 2 22 32 222 2 2 2 1 2 2 222 2 32 2 2 2 2 3 2 3 33 3 33323 23 3 2 3 3 3 3 3 3 3 22 2 1 3 3 3 33 3 32 3 3 2 33 2 3 3 3 3 33 3 3 3 3 3 3 3333 3 3 3 3 3 33 33 3 3 33 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2

–2

22

–1

0

1

2

Factor 1 Fig. 1. Factor scores.

the other two groups, the coefficients of the curriculum dummies for the second factor are much larger, indicating that the perceived atmosphere and the teaching activities are much more appreciated in the specialized learning programs rather than in the traditional one. The second group which includes 63.8% of the students is the cluster of those moderately satisfied about the teaching activities and less satisfied about the cleaning of the school, its overall appreciation means being  0.25 and 0.46, respectively. In it, all the covariates related to the second factor show a significant coefficient. In particular, student satisfaction increases as students progress in their school career, but it is lower for males. With reference to the other factor the dummies for the school curricula are the only significant variables and they differentially express as the curriculum varies. These results seem to indicate that students who have only recently entered the secondary school are more exigent towards their teachers than students who have already experienced school life; that gender is relevant for some school services (items related to perceived atmosphere in the school) but not for those related to the adequacy and cleaning of the spaces; that the curriculum is always determinant on the perceived appreciation.

1750

A. Montanari, C. Viroli / Journal of Statistical Planning and Inference 141 (2011) 1744–1752

The third group, which includes 27.1% of the students, is composed by the unsatisfied students towards their teachers (the average second latent variable being  1.50). In this group for the first time all the covariates are significant. Boys are less satisfied than girls and the class year has a decreasing effect on the first latent factor and an increasing effect on the second one. The covariates related to the specialized curricula still show a significant positive effect on the second latent variable, even if the coefficients are smaller than in the previous groups. A closer look at the factor score distribution between the three student clusters (see Fig. 1) shows that the first group has the highest factor scores especially for the first latent variable and the lowest variability. The other two groups have a higher variability and they differentiate each other mainly because of the second latent variable scores. A result which is consistent with the above interpretation. The results we have obtained show the existence of different student clusters with different correlation between the two common latent variables and the explanatory variables. The study helps in understanding the dynamics of student appreciation and may suggest strategies for improving it. Moreover, the heterogeneity in the covariate effects at the group level seems to have masked the significance of their effect in the single group case. Also performing a two step procedure prevents from recovering a differential covariate effect. We still assumed a three group structure in the latent space composed by two factors and consequently we recovered the student clusters. After fitting a separate multivariate regression model within the three groups (composed by the moderately satisfied, 156 and 88, and very satisfied students, 86) we found that only the curriculum had a significant effect, and that it happened only in the group of the most satisfied students. 6. Discussion The dimensionally reduced mixture of regression models developed in this paper finds potential application in all the research fields, where the recourse to a single linear regression model is suspected to be inadequate because of observations belonging to a number of subgroups (thus causing the sample to be heterogeneous), which differ in the parameters of the linear models and the covariates indirectly affect the observed variables through one or more latent ones common to all the subgroups. Assuming that the observed data have been generated according to a specific factor model, the proposed mixture likelihood approach simultaneously estimates the membership of the observations in a data derived number of classes, the factor model parameters and the parameters of the linear models that relate the latent variables to the explanatory variables within each class. This model generalizes Wedel and DeSarbo (1995) approach for mixture regression modeling to cases needing a dimension reduction step through a linear factor model. From a different perspective it may also be interpreted as a variation of mixture of factor analyzers (McLachlan et al., 2003), along three main lines: assumption of a single factor model for all the observations while allowing for different variances within the components; in so doing performing dimension reduction not only as far as the component variance–covariance matrices is concerned, but also for the component means; introduction of covariates aimed at explaining the within group latent variable variability.

Appendix A. EM-algorithm A.1. E-step In order to compute the conditional expected value of the complete log-likelihood given the observed data, the conditional distribution of the missing data given the observed data and the covariates on the basis of provisional estimates of h has to be derived: f ðz,vjy,x; huÞ ¼ f ðzjy,x,v; huÞf ðvjy,x; huÞ: With reference to each mixture component i, (for i ¼ 1, . . . ,k) it may be proved that the first term on the right of the previous expression is distributed according to a Gaussian density: ðrÞ

f ðzjy,x,vi ; huÞ ¼ f ðqi ðy,xÞ, Xi Þ >

1

y þ R1 i !i xÞ

ðA:1Þ >

1

1 K þ R1 . i Þ

and Xi ¼ ðK W with qi ðy,xÞ ¼ Xi ðK W The conditional distribution of vi given the observed data can be computed as posterior probability using Bayes rule:

Pi f ðy,xjvi ; huÞ : i ¼ 1 Pi f ðy,xjvi ; huÞ

f ðvi jy,x; huÞ ¼ Pk

ðA:2Þ

A.2. M-step In order to maximize the expectation of the complete log-likelihood with respect to h, we need to evaluate the first derivatives of the expression: XZ Ez,vjy,x ½logf ðy,x,z,v; hÞ ¼ ðA:3Þ f ðz,vjy,x; huÞlogf ðy,x,z,v; hÞ dz, v

A. Montanari, C. Viroli / Journal of Statistical Planning and Inference 141 (2011) 1744–1752

1751

which can be decomposed as the sum of three terms according to (9). Since the three terms depend on a different set of parameters, all cross-derivatives are zero and they may be therefore maximized separately. The maximum of (A.3) with respect to K and W is obtained by evaluating the first and the second moments of the conditional distribution (A.1) at the ðpÞ current parameter estimates. In fact, as yjz  f ðKz, WÞ we derive Ez,vjy ½logf ðyjz; hÞp12 logðdetWÞ12trW1 ðyy> 2yE½z> jyK> þ KE½zz> jyK> Þ from which the estimates for the new parameters given the provisional ones are ^ ¼ yE½z> jyE½zz> jy1 , W ^ ¼ diagfyy> yE½z> jyK> g: K The maximum likelihood estimates for !i and Ri may be computed by evaluating the first derivative of Ez,vjy,x ½logf ðzjx,v; hÞ ðrÞ

~ Ri Þ as a consequence of Eq. (4). The estimates of the new parameters in terms of provisional ones where f ðzjx,vi ; hÞ ¼ f ð!i x, are ^ ¼ ! i

f ðvi jx,y; huÞE½zjvi ,y,x , ~ 1 ~ x~ > xÞ f ðvi jx,y; huÞxð

R^ i ¼

f ðvi jx,y; huÞE½zz> jvi ,y,x ^ xÞð ^ ~ >: ð! i ~ ! i xÞ f ðvi jx,y; huÞ

Finally, the estimates for the weights of the mixture can be computed by evaluating the score function of Ez,vjy,x ½logf ðvjx; hÞ with respect to gi , for each i ¼ 1, . . . ,k. It is easy to check that the gradient offers a non-explicit solution for the parameter vector gi , whose estimate can be obtained by nonlinear optimization methods. For this purpose the Newton-type algorithms performed successfully (Dennis and Schnabel, 1983) leading to a generalized version of the EM-algorithm (McLachlan and Krishnan, 2008).

Appendix B. Standard errors for the parameter estimates Meng and Rubin procedure is based on an appealing property, the so-called ‘‘missing information principle’’: the observed information can be computed as the difference between the complete information and the missing information: Io ðhjy,xÞ ¼ Ioc Iom

ðB:1Þ

in which the complete information is given by  2  @ lnf ðy,x,z,v; hÞ : Ioc ¼ Ez,vjy,x;hu @h  @h From the Eq. (9) it follows:  2  @ lnf ðyjz; hÞ @2 lnf ðzjx,v; hÞ @2 lnf ðvjx; hÞ þ þ : Ioc ¼ Ez,vjy,x;hu @h  @h @h  @h @h  @h Since the three terms on the right of the previous expression depend on a different set of parameters, the relative standard errors can be separately derived. Hence, the complete information matrix can be thought of as a diagonal block matrix. Eq. (B.1) can be written as 1 Io ðhjy,xÞ ¼ ðIIom Ioc ÞIoc :

ðB:2Þ 1 IomIoc

Dempster et al. (1977) showed that the product is equal to the matrix rate of convergence of the EM-algorithm, whose elements can be numerically evaluated at each step of the algorithm. Denoting this matrix as R, substituting it in the (B.2) and inverting gives the asymptotic variance–covariance matrix, V 1 1 1 1 V ¼ Ioc ðIRÞ1 ¼ Ioc þ Ioc RðIRÞ1 ¼ Ioc þ DV:

ðB:3Þ

In order to compute the asymptotic standard errors for the regression coefficients !i we define F B as the conditional expectation of lnf ðzjx,v; hÞ. Moreover let til be the vector of regression coefficients of the lth covariate with l ¼ 1, . . . ,q. The complete information matrix diagonal element for the regression coefficients til is obtained by @2 F B ¼ x2l R1 i f ðvi jy,xÞ, @t> @til il and taking its conditional expectation " # @2 F B ¼ x2l pi R1 E i : @t> @til il

1752

A. Montanari, C. Viroli / Journal of Statistical Planning and Inference 141 (2011) 1744–1752

Therefore, vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u 1 u S:E:ðtil Þ ¼ t þ½DVtil : 2 ^ 1 Þ  p^ Pn diagðR i i h ¼ 1 xhl

ðB:4Þ

References Akaike, H., 1974. A new look at statistical model identification. IEEE Transactions on Automatic Control 19, 716–723. Attias, H., 1999. Independent factor analysis. Neural Computation 11, 803–851. Banfield, J.D., Raftery, A.E., 1993. Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803–821. Bouveyron, C., Girard, S., Schmid, C., 2007. High-dimensional data clustering. Computational Statistics and Data Analysis 52, 502–519. Celeux, G., Govaert, G., 1995. Gaussian parsimonious clustering models. Pattern Recognition 28, 781–793. Crame´r, H., 1946. Mathematical Methods of Statistics. Princeton University Press. Dempster, N.M., Laird, A.P., Rubin, D.B., 1977. Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society B 39, 1–38. Dennis, J.E., Schnabel, R.B., 1983. Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Prentice-Hall, Englewood Cliffs, NJ. Fokoue , E., 2005. Mixtures of factor analyzers: an extension with covariates. Journal of Multivariate Analysis 95, 370–384. Fraley, C., Raftery, A.E., 2002. Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 97, 611–631. Ghahramani, Z., Hilton, G.E., 1997. The EM algorithm for mixture of factor analyzers. Technical Report CRG-TR-96-1, Department of Computer Science, University of Toronto, Canada. Lubke, G.H., Muthe´n, B., 2005. Investigating population heterogeneity with factor mixture models. Psychological Methods 10, 21–39. McLachlan, G.J., Krishnan, T., 2008. The EM Algorithm and Extensions, second ed. Wiley, Hoboken, NJ. McLachlan, G.J., Peel, D., 2000. Finite Mixture Models. John Wiley & Sons INC., New York. McLachlan, G.J., Peel, D., Bean, R.W., 2003. Modelling high-dimensional data by mixtures of factor analyzers. Computational Statistics and Data Analysis 41, 379–388. McNicholas, P.D., Murphy, T.B., 2008. Parsimonious Gaussian mixture models. Statistics and Computing 18, 285–296. Meng, X.L., Rubin, D.B., 1991. Using EM to obtain asymptotic variance–covariance matrices: the SEM algorithm. Journal of the American Statistical Association 416, 899–909. Montanari, A., Viroli, C., 2010. Heteroscedastic factor mixture analysis. Statistical Modeling 10, 441–460. Montanari, A., Calo´, D.G., Viroli, C., 2008. Independent factor discriminant analysis. Computational Statistics & Data Analysis 52, 3246–3254. Viroli, C., 2009. Bayesian inference in non-Gaussian factor analysis. Statistics and Computing 19, 451–463. Wedel, M., DeSarbo, W.S., 1995. A mixture likelihood approach for generalized linear models. Journal of Classification 12, 21–55. Yoshida, R., Higuchi, T., Imoto, S., Miyano, S., 2006. ArrayCluster: an analytic tool for clustering, data visualization and model finder on gene expression profiles. Bioinformatics 22, 1538–1539.