Computational Statistics and Data Analysis 54 (2010) 750–766
Contents lists available at ScienceDirect
Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda
Acceleration of the EM algorithm via extrapolation methods: Review, comparison and new methods Foued Saâdaoui ∗ Higher Institute of Management, Street Abdel-Aziz El-Bahi, Sousse 4000, Tunisia
article
info
Article history: Available online 24 November 2008
abstract EM-type algorithms are popular tools for modal estimation and the most widely used parameter estimation procedures in statistical modeling. However, they are often criticized for their slow convergence. Despite the appearance of numerous acceleration techniques along the last decades, their use has been limited because they are either difficult to implement or not general. In the present paper, a new generation of fast, general and simple maximum likelihood estimation (MLE) algorithms is presented. In these cyclic iterative algorithms, extrapolation techniques are integrated with the iterations in gradient-based MLE algorithms, with the objective of accelerating the convergence of the base iterations. Some new complementary strategies like cycling, squaring and alternating are added to that processes. The presented schemes generally exhibit either fast-linear or superlinear convergence. Numerical illustrations allow us to compare a selection of its variants and generally confirm that this category is extremely simple as well as fast. © 2008 Elsevier B.V. All rights reserved.
1. Introduction The expectation-maximization (EM) algorithm (Bishop, 1995; Dempster et al., 1977; Ghahramani and Jordan, 1995; Jordan and Jacobs, 1994; Redner and Walker, 1984; Wu, 1983) is a popular tool for maximizing objective functions. It gives an iterative procedure for calculating maximum likelihood estimator (MLE) of stochastic models with hidden random variable. One reason for its popularity is its wide applicability and straightforward implementation. It also guarantees in every iteration an increase in the objective function and converges to a local maximum. Compared to many other optimization routines it does not need to check that the stationary point is not a minimum (see McLachlan and Krishnan (1997) for more details). The concept in the use of the EM is that while the direct maximization of the likelihood for the incomplete data is difficult, augmenting the observed data with the missing information results in a simpler maximization. The stability of the EM algorithm is often achieved at a heavy price of slow, linear convergence, with a convergence rate inversely related to the missing information. Such slowness severely limits the ability to perform computer intensive tasks such as model exploration and validation, sensitivity analysis, bootstrapping, and simulations. Slow convergence also confines the usefulness especially in modern applications such as hidden Markov models (HMM), neural networks, and classification in large databases. Despite the appearance of numerous acceleration techniques in the last decades, their use has been limited due to their non-generality and the lack of stability. Monotone methods such as expectation conditional maximization (ECM) (Meng and Rubin, 1993; Liu and Rubin, 1994), Flexible augmentation (Meng and Van Dyk, 1997) and parameter expansion expectation maximization (PX-EM) (Liu et al., 1998), gave serious results but are not general. On other hand fast nonmonotone techniques were proposed such as quasi-Newton (Lange, 1995b; Jamshidian and Jennrich, 1997) and conjugate
∗
Tel.: +216 97448605. E-mail address:
[email protected].
0167-9473/$ – see front matter © 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2008.11.011
F. Saâdaoui / Computational Statistics and Data Analysis 54 (2010) 750–766
751
gradient (Jamshidian and Jennrich, 1993), but these are difficult implemented methods. In the present work, new iterative schemes are developed to accelerate the EM algorithm convergence at a minimal costing term of simplicity and stability. These procedures allow to arrive to a compromise between speed and simplicity since these are based on the connection between sequence transformations and the fixed point method. The extrapolated versions of EM algorithm did not require the computation of auxiliary quantities such as complete or incomplete data log-likelihood functions or their gradient since they necessitate only EM updating. They require additional effort and they converge linearly with a faster rate, where the gains can be substantial, especially for the problems with slow EM such as large fractions of missing information. The flexibility of such schemes incited us to make also some modifications such as the cycling, the squaring and the alternating strategies in order to improve the convergence. In Section 2, the maximum likelihood method by EM algorithm is reviewed. Next, extrapolation theory is re-introduced through two variants showing an ability for finely solving the EM fixed point problem. Some recent supplementary strategies are presented and applied to their first order. The last section is devoted to numerical results based on a sequence of extended EM processes. 2. Backgrounds on EM algorithm The origins of the EM algorithm could be traced well in the past, but the seminal reference that formalized it and provided a proof of convergence is the fundamental paper by Dempster et al. (1977). A more recent book devoted entirely to the EM algorithm and its applications is due to McLachlan and Krishnan (1997). The algorithm is applied for the maximum likelihood estimation of the so-called missing data problems and is remarkable because of the simplicity and generality of the associated theory, as well as the variety of the examples that falls under its reach. The missing data problem refers to the situation when, given a sample Y = {y1 , . . . , yN } with a probability distribution function (pdf) p(yi |Θ ), the estimation of parameter vector through maximum likelihood is either difficult or even impossible. However, if the sample Y is augmented with some missing data Z , the estimation becomes trivial. Let X = (Y , Z ) be the complete data distributed with the pdf p(x|Θ ) = p(y, z |Θ ). The EM algorithm consists of two steps. The expectation step (E-step) which starts by considering an arbitrary parameter vector Θ (0) . As part of the data is missing, the log-likelihood function is a random variable. However, one can make it more precise by conditioning it with the known data y and approximate it through its expectation. Let h(., .) be a given function of two variables. Consider h(θ , Z ) where θ is a constant and Z is a random variable governed by a distribution fZ (z ). Then EZ [h(θ , Z )] =
Z
h(θ , z )fZ (z )dz
z ∈Z
is now a deterministic function that could be maximized. Notice the meaning of the two arguments in the function Q (Θ , Θ 0 ). The argument Θ corresponds to the parameters that will be optimized ultimately to maximize the likelihood. Θ 0 corresponds to the parameters used to evaluate the expectation. The second step consists in maximizing the expectation computed in the first one. It is therefore known as the maximization step (M-step). One seeks
Θ (i) = arg max Q (Θ |Θ (i−1) ).
(1)
Θ
It is based on an iterated procedure that is repeated until the differences between old and new parameters are arbitrary small. Under some restrictions, the vector Θ (i) converges to the maximum likelihood parameters of the incomplete problem and shares their properties such as efficiency, unbiasedness and consistency. A full proof is far beyond the scope of this paper and we direct the reader to Dempster et al. (1977). 2.1. EM algorithm convergence We hereafter review the EM algorithm convergence. The EM algorithm implicitly defines a fixed point mapping F from the parameter space onto itself, F : Ω ⊂ Rp 7→ Ω , such that
Θ (i+1) = F (Θ (i) ),
i = 0, 1, . . . .
(2)
Under standard regularity conditions, a Taylor series expansion around the fixed point Θ yields ∗
Θ (i+1) − Θ ∗ = J (Θ ∗ )(Θ (i) − Θ ∗ ) + o(kΘ (i) − Θ ∗ k), where J (Θ ∗ ) is the Jacobian matrix of F (Θ ) evaluated at Θ ∗ . For sufficiently large i, the distance between Θ (i) and Θ ∗ tends to be small, and the previous equation becomes
Θ (i+1) − Θ ∗ = λ(Θ (i) − Θ ∗ ) + o(kΘ (i) − Θ ∗ k),
i → ∞,
where λ is the largest eigenvalue of J (Θ ). It has been shown in Dempster et al. (1977) that ∗
−1 J (Θ ∗ ) = Imiss (Θ ∗ ; y)Icom (Θ ∗ ; y),
(3)
752
F. Saâdaoui / Computational Statistics and Data Analysis 54 (2010) 750–766
where Imiss (Θ ; y) = −
Z
∂ 2 log f (z |y, Θ ) f (z |y, Θ )dz ∂Θ2
Z
∂ 2 L(x|Θ ) f (z |y, Θ )dz . ∂Θ2
and Icom (Θ ; y) = −
Next, denote Iobs (Θ ; y) the observed information matrix given by Iobs (Θ ; y) = −
∂ 2 log p(y; Θ ) . ∂Θ2
Using the missing information principle stating that Iobs (Θ ∗ ; y) = Icom (Θ ∗ ; y) − Imiss (Θ ∗ ; y), we obtain −1 J (Θ ∗ ) = Ip − Iobs (Θ ∗ ; y)Icom (Θ ∗ ; y),
where Ip is the p-identity matrix. J (Θ ∗ ) measures the fraction of the missing information, whereas the EM convergence rate is governed by its largest eigenvalue λ. Such convergence depends immediately on the properties of Iobs (Θ ∗ ; y). When this is positive definite, Θ ∗ is a local maximum. More precisely, the EM iterated procedure Θ (k) converges to the local maximum Θ ∗ . When Iobs (Θ ∗ ; y) is positive semi-definite, the eigenvalues of J (Θ ∗ ) lie in the interval [0, 1], and therefore, convergence is not guaranteed. Finally, the convergence can be also discussed through the values of the spectral radius of J (Θ ∗ ), for Θ ∈ Ω . An interesting case for ρ(J (Θ )) > 1 is developed in Sidi (1986b). It has proved that, using some extrapolation methods, even though the spectral radius is greater than 1, the convergence may be obtained. 2.2. EM algorithm acceleration In this section, we exhibit accelerators requiring only the EM operator F (Θ ) = Θ in order to be distinguished from the E and M steps described previously. Let f (Θ ) = Θ − F (Θ ), be the corresponding EM step. When the EM algorithm is slow its steps are usually small. It can be accelerated by increasing them without destroying its convergence (see for instance Lange (1995b)). The most celebrated extrapolation schemes are the Romberg’s method based on the trapezoidal rule for numerical quadrature and the Aitken’s 12 process. A step lengthening, called the lambda-method is developed in Laird et al. (1987) consisting in a method in which the amount of lengthening at each iteration is based on the two previous EM steps. It is in fact a re-formulation of the acceleration procedure suggested in Dempster et al. (1977). Recently, an extension has been proposed in Varadhan and Roland (2004) representing both a generalization and an improvement to extrapolation. The authors exploited the strong connection between extrapolation methods and fixed point iterations for solving Eq. (2) using a cycling idea. The strategy of cycling consists in building, within each cycle, a vector sequence with a certain number of base iterations, and then using these iterates in the extrapolation schemes to compute an extrapolated vector. This vector is be used to start the next cycle to generate another one using the base iterations, and so on. The object is to develop faster iterative process for locating the fixed point of the EM mapping given in (2). Consider the sequence of vectors {ti }, i = 0, . . . generated by an EM algorithm starting with t0 = Θ (n) . The sequence is defined by the iterative process ti+1 = F (ti ), i = 0, 1, . . . . Newton’s method is given by
Θ (n+1) = t0 − Jn−1 f (t0 ), where Jn is the Jacobian of f (Θ ) = F (Θ ) − Θ at Θ
(4) (n)
. It is obtained by finding the zero of the linear approximation of f (Θ ),
f (Θ ) = f (t0 ) − Jn f (Θ − t0 ). For fixed-point problems, extrapolation is an alternative to Newton method since it provides quadratic convergence without the evaluation of derivatives. Sidi (1986b) proved that the extrapolation methods in particular the the minimal polynomial extrapolation (MPE) and the reduced rank extrapolation (RRE) are efficient acceleration procedures of the iterative method for solving (2). The most simple method is the rank-one extrapolation. A developed one is due to Varadhan and Roland (2004). The first order method can be considered as a Steffensen-type scheme for vector fixed-point problems that can be obtained by constructing a two-point secant-like approximation of the Jacobian. Consider some linear approximations for f (Θ ) one about Θ (n) and the other about F (Θ (n) ): f0 (Θ ) = f (t0 ) −
1
αn
f (Θ − t0 )
F. Saâdaoui / Computational Statistics and Data Analysis 54 (2010) 750–766
753
and f1 (Θ ) = f (t1 ) −
1
αn
f (Θ − t1 ),
where the Jacobian is approximated with the matrix α1 I. The zeros of f0 (Θ ) and f1 (Θ ) provide two different approximations n for the fixed point, u0n+1 = t0 − αn f (t0 )
and u1n+1 = t1 − αn f (t1 ).
αn is then chosen such that it minimizes some measures of discrepancy between u0n+1 and u1n+1 . The simple choice of ku1n+1 − u0n+1 k2 as measure of discrepancy, leads to the first order RRE’s steplength given by αn =
rnT vn
vnT vn
,
(5)
where rn = t1 − t0 = F (Θ (n) ) − Θ (n) ,
vn = f (t1 ) − f (t0 ) = F (F (Θ (n) )) − 2F (Θ (n) ) + Θ (n) . However, the choice of ku1n+1 − u0n+1 k2 /αn2 gives the first order MPE’s steplength
αn =
rnT rn rnT vn
.
(6)
The choice of (5) and (6) as steplength leads, respectively, to the two polynomial extrapolations mentioned for finding the fixed-point of EM mapping, which is written as
Θ (n+1) = Θ (n) − αn (F (Θ (n) ) − Θ (n) ).
(7)
Note that we obtain the EM scheme for αn = −1. Each method given above represents the lower order of the polynomial extrapolation methods (Varadhan and Roland, 2004; Saâdaoui and Roland, submitted for publication). They could be considered as analogous of the classical Cauchy method generally defined by
Θ (n+1) = Θ (n) − αn rn , where αn is the steplength and rn is the gradient direction. To more improve convergence, Varadhan and Roland (2004) extended the idea of squaring in Raydan and Svaiter (2002) to the fixed-point problem by combining the classical Cauchy method with the recent Barzilai–Borwein one. It consists first in applying the one stage extrapolation scheme at Θ (n) to obtain an intermediate vector Φn . Then, once again apply the extrapolation method at Φn , to obtain the parameter update for the next cycle, Θ (n+1) . The resulting scheme can be formulated as
Θ (n+1) = Θ (n) − 2αn rn + αn2 vn , where rn = F (Θ (n) ) − Θ (n) and
vn = F (F (Θ (n) )) − 2F (Θ (n) ) + Θ (n) . It was first employed in Raydan and Svaiter (2002) to accelerate the convergence of the classical Cauchy method for solving the quadratic optimization problem. Roland and Varadhan (2005) extended the squaring technique to solve the general non-linear fixed point problem. They proposed a squared version of the Lemarechal iterative scheme. It is straightforward to obtain the following error equations for every iterative scheme. For the one-step method given in (7), we have
ξn+1 = (I − αn (J − I ))ξn + o(ξn ), where ξn = Θ (n) − Θ ∗ and J is the Jacobian of F at the fixed point Θ ∗ . The new scheme have the following equation for the error propagation
ξn+1 = (I − αn (J − I ))2 ξn + o(ξn ).
(8)
Therefore, the novel method could be considered as a squared one step method. The dramatic difference in convergence behavior between simple first order and its squared version is truly remarkable. Even though the first experiences (Raydan and Svaiter, 2002) were tested for a simple, linear problem, such behavior is quite typical even in non-linear EM iterations.
754
F. Saâdaoui / Computational Statistics and Data Analysis 54 (2010) 750–766
When applied to EM iterations, the rates of convergence of the given extrapolations are governed by the spectral radius of (I − αn (J − I )). Precisely, one will be able to apply the recursive error relation of the ith component error (Varadhan and Roland, in press) given by i) ξn(+ 1
αn = I− λi
ξn(i) ,
where λi , i = 1, . . . , p are eigenvalues of (J − I )−1 such that 0 < λ1 ≤ · · · ≤ λp . The theoretical proofs relating to the convergence of the derived schemes for accelerating the EM algorithm, are developed in Varadhan and Roland (2004) for the first order polynomial methods and their squared version. The previous can be considered as Steffensen process which convergence properties are too insured in Sidi’s papers (Sidi, 1986b). Whereas, for the squared extrapolated schemes, more relevant properties exist in Roland and Varadhan (2005) and Raydan and Svaiter (2002). Recently, in Berlinet and Roland (2007), Roland and Berlinet reconsidered the problem from a geometrical point of view. Using easy geometric arguments, they clarified some former arguments and simultaneously derived a new class of acceleration schemes containing as special case the class developed by Varadhan and Roland (2004). Finally, we recall that we associate for every method presented above, the groping process with a cycling process. The connection generates a fast alternative to the classical EM algorithm. 3. Main method 3.1. A cycling-based epsilon algorithm This section is devoted to present our main idea to accelerate the EM convergence. We present a new technique that accelerates the convergence using the first order scalar epsilon algorithm (SEA with k = 1) due to Wynn. We precisely extend the epsilon-accelerated EM algorithm of Kuroda and Sakakihara (2006) with a cycling procedure alike to the one used for the polynomial family. Although the appealing features of the original method, it seems interesting to complement it with a cycling strategy. Nevertheless, in Saâdaoui and Roland (submitted for publication), a general comparison allow us to consider the cycling as the best and most natural way to implement vector extrapolation schemes. The Raydan and Svaiter’s squaring technique is also tested for this family of iterative schemes. Such processes yield a new generation of strategies that possesses better features. As for the previous section and for the same purpose, we develop new iterative procedures for finding the fixed point of the EM mapping. Let us reconsider the function given in (4). It is known that the Cauchy methods recalled previously converges linearly, but with slow convergence rate especially for ill-conditioned problems. This is due to the choice of the steplength αn , but not to the gradient direction rn . To overcome a similar problem, a Barzilai–Borwein method (Raydan, 1993) is proposed. Introduced in Barzilai and Borwein (1988) for an unconstrained minimization problem, the Barzilai–Borwein method represent a nonmonotone gradient procedure, briefly recalled hereafter.
Θ (n+1) = Θ (n) − αn−1 rn ,
(9)
in accordance with a Cauchy method, the steplength could be either
αn−1 =
rnT−1 rn−1 rnT−1 vn−1
,
or
αn−1 =
rnT−1 vn−1
vnT−1 vn−1
,
where rn−1 = F (Θ (n−1) ) − Θ (n−1) and
vn−1 = F (F (Θ (n−1) )) − 2F (Θ (n−1) ) + Θ (n−1) . The parameter αn−1 is simply the optimal steplength at the previous iteration, where we choose α1 > 0 arbitrarily. Having in view its simplicity and numerical efficiency for well-conditioned problems, proved inter alia by Raydan (1997), the Barzilai–Borwein gradient method has received a great deal of attention. However, like all steepest descent and conjugate gradient methods, the Barzilai–Borwein method becomes slow when the problems happens to be more ill-conditioned. The subsequent iterations can be written in scalar case as
Θ (n+1) = F (Θ (n) ) − αn F (rn ), which is a particular case of Eq. (7). When using Eq. (5) in steplength, this latter can be written as
−1 Θ (n+1) = F (Θ (n) ) + [Θ (n) − F (Θ (n) )]−1 + [F (F (Θ (n) )) − F (Θ (n) )]−1 .
F. Saâdaoui / Computational Statistics and Data Analysis 54 (2010) 750–766
755
In accordance with the first orders of polynomial methods, this iterative formula can also be considered as the first order of the well-known Epsilon Algorithm (Appendix A.1). The Wynn’s vector epsilon algorithm (VEA) is one of the well known recursive lozenge methods. This algorithm is generally used to accelerate the convergence of a slowly convergent scalar sequence. The algorithm has been extended to vector sequences by Wynn (1962) and it is known that it is very effective for linearly converging sequences. Notice that our proposed Epsilon1 method (for vector sequences) derives from the SEA scheme by either considering it a first order VEA, or simply a Barzilai–Borwein method, which offers two efficient alternative strategies for accelerating the EM sequence. Theorem 1. The vector sequence {Θ (n) }n≥0 generated by the cycling-based epsilon algorithm converges to the stationary point Θ ∗ of the EM sequence. Proof. See Appendix A.2.
In Saâdaoui and Roland (submitted for publication) we apply a cycling-based VEA to obtain the general method for acceleration of the EM algorithm. In this present paper, we just reveal the cycling first order for such method, which we call Epsilon1, i.e. taking the order of extrapolation k = 1. Similar to a Barzilai–Borwein gradient method (but absolutely different), the model could be also seen as an adjusted polynomial scheme similar to (7), since it can be written in scalar form as
Θ (n+1) = F (Θn ) + (Θn − F (Θn )) − αn rn . Therefore, it is also possible to strengthen the strategy with a squaring stage similar to the one used for the previous methods, and hence he have
Θ (n+1) = F (Θ (n) ) + (Θn − F (Θ (n) )) − 2αn rn + αn2 vn , where rn = F (Θ (n) ) − Θ (n) and
vn = F (F (Θ (n) )) − 2F (Θ (n) ) + Θ (n) . Generally, the squaring supplement seem advantageous likewise for a Barzilai–Borwein method, in fact, given Θ ∈ Ω and the forward difference operator 1k such that 1k ui = 1k−1 ui+1 − 1k−1 ui , we can write
1Θ (n) = F (Θ (n) ) − Θ (n) = rn and
1rn = F (rn ) − rn = 12 Θ (n) = vn at each iteration n we set
αn−1 =
rnT−1 rn−1 rnT−1 vn−1
.
We firstly apply the one stage extrapolation scheme at Θ (n) to obtain an intermediate vector
Φn = Θ (n) − αn−1 1Θ (n) . Then we once again apply the extrapolation method but this time at Φn , to obtain the parameter update for the next cycle:
Θ (n+1) = Φn − αn−1 1Φn
= Θ (n) − 2αn−1 1Θ (n) + αn2−1 12 Θ (n) = Θ (n) − 2αn−1 rn + αn2−1 vn . Thus, in the Squared method, the steplength αn−1 is computed once, but used twice. Two evaluations of the base iteration, F , are performed in each cycle of the squared method. The squared methods require only an additional scalar-vector product and a vector addition beyond that of the one stage scheme. Hence the additional computational burden is negligible. Furthermore, it does not require any additional vector storage beyond that of the one stage methods. The propagation of the error can be implicitly deduced from the previous paragraph. For a simple first order epsilon method the relating equation is equivalent to
ξn+1 = (I − αn−1 (J − I ))ξn + o(ξn ). The squared version is written as
ξn+1 = (I − αn−1 (J − I ))2 ξn + o(ξn ),
756
F. Saâdaoui / Computational Statistics and Data Analysis 54 (2010) 750–766
where ξn = Θ (n) − Θ ∗ and J is the Jacobian. Therefore, the new method could be considered as a squared Barzilai–Borwein method. Furthermore, we can mention a little remark concerning the stability of squared schemes. Considering the relaxation parameter as constant, i.e. αn−1 = α , a sufficient stability condition for a squared method is
ρ((I − α(J − I ))) < 1, i.e. the spectral radius (modulus of the largest eigenvalue) must be strictly less than unity. In general, the eigenvalues of the Jacobian are complex since the Jacobian is not symmetric. In the EM settings, however, when the Hessian of the observed data log likelihood is negative definite, not only are the eigenvalues real, but they also lie in the half-open unit interval [0, 1). Consequently, for convergence it is sufficient to have sup(1 − α(λi − 1)) < 1, i
where 0 ≤ λi < 1, ∀i. This condition can be rewritten as: 2
λi − 1
≤ α < 0,
∀i.
From the fact that the EM sequences converge linearly, we can see that the epsilon accelerated EM algorithm accelerates the convergence of the EM sequence without affecting its simplicity and stability (Kuroda and Sakakihara, 2006). In fact Brezinski and Redivo Zaglia (1991) have completely described the convergence and acceleration properties of the vector epsilon algorithm, only for special classes of vector sequences. Whereas cycling proved its importance when it is joined to polynomial extrapolation (Sidi, 1986b) and squared polynomial extrapolation (Varadhan and Roland, 2004). Thus, the mathematical consideration of the cycling-based epsilon method is not trivial. Furthermore, a simplicity advantage incites us to resort to the epsilon algorithm since it requires only O(N ) arithmetic operations while the Newton–Raphson and quasiNewton algorithms are achieved at O(N 3 ) and O(N 2 ), respectively, and thus the computational cost is likely to become more expensive as N becomes large (Kuroda and Sakakihara, 2006). We show for example that for a scalar sequence Θ (i) i>0 , the process of convergence is guaranteed since Eq. (9) is formulated as
ε2(n) = Θ (n+1) + = Θ (n+1) +
1
Θ (n) − Θ (n+1)
+
1
−1
Θ (n+2) − Θ (n+1)
[Θ (n) − Θ (n+1) ][Θ (n+2) − Θ (n+1) ] Θ (n+2) − 2Θ (n+1) + Θ (n)
(10)
which is the same as an Aitken 12 method. Traub proved later on that this method is able to accelerate the convergence (n) of linear convergent sequences in the sense that it satisfies the condition limn→∞ (ε2 − s/sn − s) = 0. Louis (1982) gave an acceleration formula that resembles Eq. (10) based on the multivariate version of the Aitken’s acceleration method as follows:
ε2(n) = Θ (n−1) + [IN − J (n−1) ]−1 [Θ (n) − Θ (n−1) ],
(11)
where IN denotes the N × N identity matrix and J (n−1) is the Jacobian matrix for EM (Θ ) evaluated at Θ (n−1) . In order to estimate [IN − J (n−1) ]−1 , Louis recommended making use of the equation
[Id − J (n−1) ]−1 = E [η(Θ (n−1) |X )|y]I [Θ (n−1) |y]−1 where
η(Θ
(n−1)
∂2 |x) = log f (x|Θ ) , T ∂Θ∂Θ Θ =Θ (n−1)
and I (Θ
(n−1)
∂2 |y) = log g (y|Θ ) . T ∂Θ∂Θ Θ =Θ (n−1)
In Meilijson (1989), we can distinguish that Louis’s acceleration method using Eq. (11) is essentially equivalent to the Newton–Raphson algorithm in the neighborhood of MLEs. To summarize, notice that throughout this paper, for either simple or squared strategy, the extrapolation is cycling-based.
F. Saâdaoui / Computational Statistics and Data Analysis 54 (2010) 750–766
757
3.2. An alternating extrapolation-EM method The proposal to alternate the both previous methods with an EM-update come from the fine applicability of a filter stage to smooth the process straight ahead towards the fixed point. A simple or compound EM step (F (Θ ), F (F (Θ )) . . .) succeeds an extrapolation method and can be view like a stabilization stage in order to keep up the fit direction. In Varadhan and Roland (2004), the resort to such procedures is only tied up with numerical problems. It allow a method to restart convergence. The idea to incorporate that stage into the strategy could be interesting, bearing in mind the non-monotonicity convergence that characterize the given methods of extrapolation, in particular the polynomial generation. Indeed, the idea to integrate an EM-update between two cycles could offer much regularity to the imitated iterative trajectory, to take a similar aspect to the EM algorithm one, namely its stability and monotonicity. Consequently, the new strategy appear quasilinearly convergent, like the EM algorithm but it have a faster rate of convergence. Such practice could also be decisive especially for the polynomial methods (often instable) since these schemes can fail to converge due to some numerical problems like stagnation (for the first order of RRE) and near breakdown (the first order of MPE). The hybrid feature that combines each scheme to an EM-update permit a better conditions in such a manner that it overcomes both stagnation and near breakdown. To summarize, such alternating extrapolation-EM method contributes to:
• Attenuate the fluctuation range of the iterative trajectory straight ahead towards the fixed point, which reflect the speed and accuracy of the strategy.
• Surpass some numerical problems, especially the stagnation of the first order of the RRE method and the near breakdown relating to the first order of the MPE method. In only few lines of codes, each method could be easily programmed and computed. In the following we give the skeleton of such algorithms. While not converged -
Let Θ (0) the starting parameters Θ (1) = EMupdate(Θ (0) ) Θ (2) = EMupdate(Θ (1) ) Compute T (Θ (0) , Θ (1) , Θ (2) ), using either of Eqs. (7) and (17) Θ 0 = T (Θ (0) , Θ (1) , Θ (2) ) Θ (0) = EMupdate(i) (Θ 0 ), the alternation stage (i = 1 or 2) Check for convergence: kΘ (0) − Θ 0 k < tolerance.
4. Numerical results This part of the paper is devoted to illustrate in the one hand the fair contribution of the extrapolation methods in some well-known univariate and multivariate problems, and to compare on the other hand several generated variants each other. In this comparison we should not except to find an optimal method generated from each family that outperforms all the others on all datasets since more than one factor can be considered to evaluate a reliability of a method (rapidity, stability, monotony . . . ). The given methods are considered as the lower in the scale of extrapolation for each family. Because of the use of first order, each generated scheme is marked with 1. This order seems reasonable since we would like to accelerate a method which is close to a linear iteration in the neighborhood of the limit. This choice look more practical seeing that we need essentially a trade-off between speed of convergence and easiness. Saâdaoui and Roland (submitted for publication) propose higher-order models applied to some significant practices in computational statistics field. The strategy of cycling consists in constructing, within each cycle, a vector sequence with a certain number of base iterations, and then using these iterates in the extrapolation scheme given by (7) and (17), to compute an extrapolated vector. We compare a number of generated strategies (from both families) to the EM algorithm and each other. We base this comparison upon a performance measure that we assess by the number of fevals necessary to reach a stationary level. Since every scheme absorbs within several ‘‘iterations of base’’ (EM-update or feval), we intend to take into account the factor ‘‘cost of evaluation’’ of a method or simply the number of fevals that absorb such method within. That strategy gives the numerical results much fairness for a quite comparison. In the first subsection, we present a scalar case where we limit us to compare the two different extensions and materialize the role of the alternating technique in simple MLE problem. Afterwards we devote a part to illustrate the fair contribution for some multivariate problems. The former is much more complex and slow than the first and permits to clearly and exhaustively expound our net contribution. The compounded chain ‘‘Cycling-Squaring-Alternating’’ is applied to the given extrapolation schemes. All the following results are obtained by means of MATLAB 7 software simply running on a Pentium 4 machine with a 3.00 GHz processor and 500 MB of RAM. 4.1. Univariate case: Multinomial distribution We illustrate the extrapolated EM algorithm on a famous genetic example first considered by Fisher and Balmukand. It is also discussed in Rao’s monograph and especially by McLachlan and Krishnan (1997). Suppose we have a random sample of
758
F. Saâdaoui / Computational Statistics and Data Analysis 54 (2010) 750–766
n = N from the selfing of a double heterozygote, where the 4 phenotypic classes will be represented roughly in proportion to their theoretical probabilities, their joint distribution being multinomial,
Multinomial
N,
2+Θ 1−Θ 1−Θ Θ 4
,
4
,
4
,
4
.
(12)
We describe a variant of Fisher’s approach to exemplify the EM algorithm. Let y = (1997, 906, 904, 32) be a realization of the vector y = (y1 , y2 , y3 , y4 ) believed to be coming from the multinomial distribution in (12) starting from the initial estimates Θ (0) = 1. In fact, we reconsider the problem in accordance with Thisted (1988) to intend to slow up the convergence in order to be able to compare comfortably and clearly draw our different accelerated schemes. The probability mass function, given the data, is g (y; Θ ) =
n! y1 !y2 !y3 !y4 !
1 2
1
+ Θ
y 1
4
1 4
y 2 +y 3 y4 1 (1 − Θ ) Θ . 4
Assume that instead of original value y1 the counts y11 and y12 , such that y11 + y12 = y1 , could be observed, and that their probabilities are 1/2 and 1/4, respectively. The complete data can be defined as x = (y11 , y12 , y2 , y3 , y4 ). The probability P mass function of incomplete data y is g (y; Θ ) = gc (x, Θ ), where gc (x; Θ ) = c (x)(1/2)y11 (Θ /4)y12 (1/4 − Θ /4)y2 +y3 (Θ /4)y4 , c (x) is free of Θ , and the summation is taken over all values of x for which y11 + y12 = y1 , the complete log-likelihood is log Lc (Θ ) = (y12 + y4 ) log(Θ ) + (y2 + y3 ) log(1 − Θ ). Our goal is to find the conditional expectation of log Lc (Θ ) given y, using the starting point for Θ (0) , Q (Θ , Θ (0) ) = EΘ (0) [log Lc (Θ )|y]. As log Lc (Θ ) is a linear function of y11 and y12 , the E-Step is done by simply replacing y11 and y12 by their conditional expectations, given y, then in the M-Step, we choose Θ (1) so that Q (Θ , Θ (0) ) is maximized. After replacing y11 and y12 (0) (0) by their conditional expectations y11 and y12 in the Q -function, the maximum is obtained at
Θ (1) =
(0)
y12 + y4 (0)
y12 + y2 + y3 + y4
.
Now, the E- and M-Steps are alternating. At the kth iteration we have
Θ (k+1) = (k)
(k)
y12 + y4 (k)
n − y11
, (k)
(k)
where y11 = 12 y1 /(1/2 + Θ (k) ) and y12 = y1 − y11 . Now that the problem is ready to turn, we apply some schemes among those given above in order to illustrate and simultaneously compare these extrapolated versions each other and to their counterpart the simple EM. The Alternating Extrapolation-EM method are illustrated by Figs. 1a and 1b. The convergence behavior of the extrapolated schemes are confronted with and without alternating to the classical EM algorithm. Generally, in scalar parameter case, the first order extrapolated schemes are simply reduced to only one Polynomial and Epsilon methods. We display for the five mentioned strategies, the style of the trajectory relative to the increase of the likelihood function versus the number of fevals. As explained above, we take into account in that plots the number of components within each cycle. The interesting feature to observe is the superiority of the polynomial extrapolation schemes. The fair contribution of the alternation can be remarkably seen for the polynomial method. In spite of its bad performance during its first cycles, the Epsilon method showed better characteristics especially in deep steps. Generally, this new generation of schemes can be considered as an interesting clone of methods able to succeed the classical EM algorithm. The fine contribution can although be indirectly view but plainly, through visual comparison of the numerical schemes based on the error measure decline along the curve in the straight line towards the fixed point. We can evaluate that by the residual norm residue = kmethod(Θ (n) ) − Θ (n) k, where method is the implicit mapping relating to each of compared strategies. In Fig. 1c, the residual moderation related to every scheme in decimal logarithm is reported. Results are plotted according to the number of fevals. The procedure continue until this measure become practically negligible by the software, i.e. the value of the residual norm residue became close to some point = 10−18 . The figure confirms the outcomes presented for the likelihood increase and authenticates the fact to turn to extrapolation and particularly to polynomial one. We show a dramatically collapse of the residue for the called method reached for only 13 fevals, conversely to the counterparts, respectively, the epsilon method and the simple EM algorithm, both in 54 fevals. Consequently the speed of the convergence exceeds 13 times the ordinary one.
F. Saâdaoui / Computational Statistics and Data Analysis 54 (2010) 750–766
759
Fig. 1a. The performance of the Polynomial method: The log-likelihood function is represented according to the number of fevals in logarithmic scale.
Fig. 1b. The performance of the Epsilon method: The log-likelihood function is represented according to the number of fevals in logarithmic scale.
Fig. 1c. The performance of the methods: The decimal logarithm of the residual norm is plotted versus the number of fevals.
4.2. Multivariate cases In order to examine the performance of the extrapolation accelerated EM algorithm in multivariate cases, we apply the algorithm to a data set for which the speed of convergence of the EM algorithm seems quite slow. We test the performance of extrapolation implements and compare them to EM algorithm in a multivariate case of maximum likelihood estimation,
760
F. Saâdaoui / Computational Statistics and Data Analysis 54 (2010) 750–766
Table 1 Data from the London Times on deaths during 1910–1912. Deaths i Frequency ni
0 162
1 267
2 271
3 185
4 111
5 61
6 27
7 8
8 3
9 1
i.e. some vectors or matrices of parameters have to be estimated. The both categories of extrapolation given above are exploited to make two generations of new and easy schemes extended to accelerate the EM algorithm. In particular, a sequence of schemes with and without the Alternating strategy is executed. The performance of the compounded procedure of an extrapolation is illustrated together with the Cycling-Squaring, especially when it is complemented with an alternation stage. In this numerical experience we aim moreover to revalue the squaring technique. All results are given according to the residual measure between two successive cycles until the stopping criterion. This latter is expressed by
kmethod(Θ (n) ) − Θ (n) k ≤ δ,
(13)
where δ is a threshold parameter that must be small enough to assure to the current iteration maximum accuracy. There are six cycling strategies to compare to the EM algorithm with and without Alternating. These strategies are shared as follows.
• • • • • •
The minimal polynomial extrapolation method: MPE1, The reduced rank extrapolation method: RRE1, The Epsilon extrapolation method: Epsilon1, The squared minimal polynomial extrapolation method: Squared-MPE1, The squared reduced rank extrapolation method: Squared-RRE1, The squared Epsilon extrapolation method: Squared-Eps1.
4.2.1. Poisson mixture distribution We here consider the famous data from The London Times during the years 1910–1912 reporting the number of deaths of women 80 years and older (Roland and Varadhan, 2005; Varadhan and Roland, in press). In Table 1, the line ‘‘Deaths i’’ refers to the number of deaths of women with 80 years old and more reported by day with i deaths. The line labeled ‘‘Frequency ni ’’ refer to the number of days with i deaths. A two component mixture of Poisson distribution provides a good fit to the data, whereas a single Poisson had a poor fit. This is due to different patterns of death during the winter and summer. In fact, the finite mixture distributions are an efficient and popular class of statistical models for describing heterogeneity in observed data, whereas the EM algorithm is the best know computational approach for parameter estimation in such problems. The likelihood for this problem can be written as L(Θ ) =
9 Y
(p exp−µ1 µi1 /i! + (1 − p) exp−µ2 µi2 /i!)ni .
i =0
We have to estimate a three parameters vector Θ = (p, µ1 , µ2 ). The EM algorithm leads to the iterative functions p
(k+1)
=
X i
(k)
ni π˜ i1
X
ni
(k+1)
and µj
i
=
X
(k)
ini π˜ ij
(k)
ni π˜ ij ,
j = 1, 2
i
where (k+1)
π˜ ij
(k)
=
p(k) (µj )i exp (k)
(k)
−µj
(k)
p(k) (µk1 )i exp−µ1 +(1 − p(k) )(µk2 )i exp−µ2
,
j = 1, 2.
Starting from the initial vector that we assess by the moments method Θ (0) = (0.3, 1.0, 2.5)T , the EM algorithm takes an excruciating 534 iterations for the log-likelihood to attain its maximum l∗ = −1989.946. Even worse, it takes 2082 iterations for the parameters to attain the stationary vector Θ ∗ = (0.3599, 1.256, 2.663)T , with an accuracy recorded at a threshold δ = 10−7 . We show that the ‘‘Cycling-Squaring-Alternating’’ approach spectacularly improves the convergence for this multivariate maximum likelihood estimation problem. For every iterative scheme shared above, the strategy of cycling works by building, within each cycle, a vector sequence with a certain number of EM-updates (two EM-updates without alternation and three EM-updates with alternation). Therefore, we evaluate all the numerical schemes on the basis of number of EM-updates (fevals) required to reach the tolerance summarized in (13) at a level 10−7 . Such value is sufficient to permit to do comparison, because of the moderate speed of convergence in this application. Some results are reported in Table 2. Next, we give shape to such results by the representations in Figs. 2a–2d, where we confront the two family of extrapolation given above with and without a stabilization (alternating) stage, respectively, to double and triple EM-update based cycles. In accordance with univariate case, we expose the residual norm evolution related to every scheme in decimal logarithm. Therefore, the convergence is investigated in terms of residual norm.
F. Saâdaoui / Computational Statistics and Data Analysis 54 (2010) 750–766
761
Table 2 Two-component Poisson mixture estimation: The strategies are compared basing on the number of fevals to reach a threshold discrepancy δ = 10−7 . Method
No fevals without alter.
No fevals with alter
EM MPE1 RRE1 Epsilon1 Squared-MPE1 Squared-RRE1 Squared-Eps1
2082 1532 266 490 246 354 491
2082 195 171 207 84 66 139
Fig. 2a. The convergence of the First Order schemes without Alternating: The residual norm (in decimal logarithm) versus the number of cycles (equivalent to 2 EM-updates).
Fig. 2b. The convergence of the First Order schemes with Alternating: The residual norm (in decimal logarithm) versus the number of cycles (equivalent to 3 EM-updates).
Besides interesting features as far as speed is concerned, the figures show how a cycling method can be improved using alternating and squaring strategies. All the first order methods with and without alternating are plotted in Figs. 2a and 2b, respectively. The non-monotonicity of polynomial methods compared with epsilon methods is plainly observable, although the considerable decrease on frequency of the fluctuation in Fig. 2b. The squared extensions are represented also with and without alternating in Figs. 2c and 2d, respectively. We can as well observe a more important improvement in appearance, especially in the second case, where the cycling methods are endowed with both squaring and alternating. For all the schemes, the convergence becomes more stable and fast. Note that the epsilon method remains the most stable from the start of comparison. The EM is shown to be very slow to converge for this problem, regardless of the starting value, because the data does not contain adequate information to clearly separate the two components. This can be seen by examining the eigenvalues of
762
F. Saâdaoui / Computational Statistics and Data Analysis 54 (2010) 750–766
Fig. 2c. The convergence of the Squared schemes without Alternating: The residual norm (in decimal logarithm) versus the number of cycles (equivalent to 2 EM-updates).
Fig. 2d. The convergence of the Squared schemes with Alternating: The residual norm (in decimal logarithm) versus the number of cycles (equivalent to 3 EM-updates).
the Jacobian matrix of the EM mapping at the MLE, i.e. the eigenvalues of J (Θ ∗ ) which are computed as 0.9957, 0.7204 and 0. The extremely slow convergence of the EM is well explained by the fact that the spectral radius ρ(J (Θ ∗ )) is very close to 1. To recapitulate, we can observe that the squared extrapolated method had the best performance for both the polynomial and epsilon first order extrapolation families. It makes the big time by accelerating the EM algorithm by a factor around 6–8.5, without alternation and 25–32 with alternation. With less efficiency, the one-step schemes of polynomial methods RRE1 and MPE1 and the Epsilon1 method prove a promising gain, especially thank to the great contribution of the process of alternation. The new extended epsilon family proved appealing characteristics. It succeeded to retain the most attractive features of the EM algorithm, such as stability, flexibility and simplicity. It seems consequently hopeful to more benefit from. In general, we observed how a family of new and efficient iterative schemes can be used to accelerate EM algorithm. We proved their neat performance especially when they are well guided to the fixed point by accompanying the procedure with an EM-stage within each cycle. 4.2.2. Two-dimensional Gaussian mixture The recent interest increase in multivariate Gaussian mixture distributions in many fields motivates us to opt for them to carry on doing experiences. We try to give them a broad orientation in various empirical studies. We proceed with the wellknown example of the Old Faithful Geyser. The data is composed of 222 eruptions of the Old Faithful Geyser in Yellowstone National Park. Each observation consists of two measurements: the duration of the eruption, and the waiting time before the next eruption with all measures taken in minutes. This example can be modeled by a mixture of two Gaussian distributions in two-dimensions. This character is clearly visible in the Fig. 3a. The Scatter plot shows data clustered in two parts having nearly the same statistical characteristics. Analytically a multivariate m-dimensional Gaussian has the same parameters as the single normal distribution, except that the mean is in the form of a vector µ and the covariance is in the form of a m × m
F. Saâdaoui / Computational Statistics and Data Analysis 54 (2010) 750–766
763
Fig. 3a. The Scatter plot of the Old Faithful Geyser data. We can subdivide it into two subsets having the same statistical characteristics.
covariance matrix Σ . The probability density function for a Gaussian is given by
1 Φ (x, µ, Σ ) = (2π )−m/2 |Σ |−1/2 exp − (x − µ)T Σ (x − µ) , 2
and at each state, we assume the observation is generated by k Gaussians, therefore the overall output is f ( x) =
J X
πj Φ (x, µj , Σj ),
j =1
where the initial value of the weights πj for each mixture component should sum up to one. Summarizing, the EM algorithm estimates of the new parameters in terms of the old parameters are given as (k) i=1 tj (xi ) xi (k) nk
• µj(k+1) =
PN
• Σj(k+1) =
1 (k) nj (k) nj
• πj(k+1) =
PN
i=1 tj
(xi )(k) (xi − µ(j k+1) )(xi − µ(j k+1) )T
N
where tj (xi )(k) =
πj(k) Φ (xi , µj(k) , Σj(k) ) f ( xi )
and (k)
nj
=
N X
tj (xi )(k) ,
for j = 1, 2.
i =1
As for the previous examples, the choice of the starting parameters Θ (0) = (π (0) , µ(0) , Σ (0) ) is simply based on the moments method. The data set is firstly divided into two levels. For each cluster, the empirical moments representing the initial guesses are computed. Generally, the choice of starting values remains a relevant task for a better departure of the EM algorithm (Biernacki et al., 2003). The speed of the algorithm may depend strongly on these values. One more sophisticated case is used in Saâdaoui and Roland (submitted for publication) to guess starting parameters for a higher order finite mixture. In Table 3, we report all results for the same strategies shared above. Namely, we give the number of fevals that necessitate each method to attain a residual norm between two cycles equivalent to δ = 10−16 . In fact this tolerance seems much tiny as stopping criterion, but the fast convergence in comparison to the previous application, especially in advanced stages compel us to choose a little value to go the furthest, allowing therefore a suitable representation of the convergence. Once the parameters are computed, we can construct the related diagrammatic representation that we illustrate in the Fig. 3b. A simple difference in comparison to the previous application is the moderate speed of convergence of the EM algorithm. Thus, we do not expect a great contribution of the strategies in such example. Consequently, the comparison of the schemes to each other will not be according to their speed as well as their aptitude to surpass the first difficult cycles (because of non-monotonicity of the extrapolations) towards the solution Θ ∗ . Another new point is that we integrate two styles of alternation stages for each scheme.
• A simple EM update (F (Θ )), • A compound EM update (F 2 (Θ ) = F (F (Θ ))).
764
F. Saâdaoui / Computational Statistics and Data Analysis 54 (2010) 750–766
Fig. 3b. The 2-Dimensional Gaussian Mixture Distribution of the Old Faithful Geyser data. On the whole, 14 parameters were estimated. Table 3 Two-component multivariate Gaussian mixture estimation: The strategies are compared basing on the number of fevals to reach a threshold discrepancy δ = 10−16 . Method
Without alter
Simple alter
Compound alter
EM MPE1 RRE1 Epsilon1 Squared-MPE1 Squared-RRE1 Squared-Eps1
140 Diverges Diverges 84 42 56 54
140 71 59 95 Diverges Diverges 63
140 86 82 63 54 60 68
The second order is assumed to satisfy, but we could opt for higher orders if necessary, i.e., if a method fails to converge. Numerical results in the table allow especially to show how much an alternating procedure can be useful, particularly for a first order extrapolation method. Let us look at the first order polynomial methods. The simple polynomial schemes MPE1 and RRE1 fail absolutely to converge without alternating, whereas their squared versions succeed. therefore, we need either alternating or squaring in order to force them to converge. Further, the epsilon method seems better fulfill. For both simple and squaring versions, the convergence remains strictly regular and exhibits zero failures. Notice especially the appealing feature of the new proposed scheme Squared-Epsilon1. We can eventually notice the perfect results released whenever a compound alternation is used. Therefore, it is profitable to insert it into the iterative procedure, since it make the convergence more sure and fast. 5. Conclusion The EM algorithm is an ubiquitous tool for calculating maximum likelihood estimator. However, a common criticism is that the convergence of the EM algorithm is slow. The terrible speed of convergence of this algorithm induces some problems especially in machine cost for many critical empirical studies. In this paper, simple and fast extrapolation-based EM algorithm that well performs the convergence of the EM sequence incorporating the classical vector extrapolation algorithm is proposed. The vector extrapolation algorithm in itself is a fairly simple computational procedure and its computational cost is less expensive than that of Newton-type algorithms. The extrapolation-based EM algorithm does not require computation of the matrix inverse at each iteration and thus is numerically stable. Therefore, this accelerated EM algorithm is an extension algorithm within the framework of the EM algorithm without affecting its simplicity, stability and flexibility. Numerical experiments demonstrate that this accelerated EM algorithm produces a sequence converging to MLEs much faster than the sequence generated by the EM algorithm in both univariate and multivariate cases. Hence, the accelerated EM algorithm finds sufficiently accurate estimates using a smaller number of EM iterations especially with squared extrapolation versions. We also distinguished an optimistic relationship linking an extrapolated EM to an EM map at every cycle as far as stabilization is concerned. We also showed that by an alternating process we can dramatically decrease the number of iterations required to optimize the log-likelihood map. Acknowledgments I would like to thank Christophe Roland for helpful discussions and comments on this work. I am also indebted to the anonymous referees for constructive and valuable comments.
F. Saâdaoui / Computational Statistics and Data Analysis 54 (2010) 750–766
765
Appendix A.1. Epsilon algorithm Let Θ (i) denote a vector of dimension N that converges to a vector Θ ∗ as i increases. Let the inverse [s]−1 of vector s defined by s [ s ] −1 = , ksk2 = (s, s), ks k2 in which (s, s) is the scalar product of s by itself. In general, the vector epsilon algorithm for a sequence {Θ (i) }i>0 starts with (n) ε− 1 = 0,
ε0(n) = Θn , (n)
and then generates a vector εk+1 by
εk(n+)1 = εk(n−+11) + [εk(n+1) − εk(n) ]−1 k, n = 0, 1, 2, . . . .
(14)
For the case of k + 1 = 2r + 2, we have the iteration form
h i −1 (n) (n+1) (n) (n+1) −1 (n+2) (n+1) −1 (n+2) (n+1) −1 ε2r + [ε2r − ε2r ] + [ε2r − ε2r ] + [ε2r ] +2 = ε2r −2 − ε2r
(15)
from Eq. (14), see Brezinski and Redivo Zaglia (1991). For practical implementation, we apply the case of r = 0 to Eq. (15):
h i −1 (n+2) (n+1) −1 ε2(n) = ε0(n+1) + [ε0(n) − ε0(n+1) ]−1 + [ε0(n+2) − ε0(n+1) ]−1 + [ε− − ε ] . 2 0 (n)
Then, from the initial conditions ε0 becomes as follows:
(n) = Θn and ε− 2 = ∞ of Brezinski and Redivo Zaglia (1991), the iteration in Eq. (16)
−1 ε2(n) = Θ (n+1) + [Θ (n) − Θ (n+1) ]−1 + [Θ (n+2) − Θ (n+1) ]−1 , because
(n+2) [ε− 2
(n+1) −1
− ε0
]
(16)
= [∞ − Θ
(n+1)
(17) −1
] = 0 from the definition of [s] .
A.2. Proof of Theorem 1 Inside a cycling procedure, an iteration can be written as
(n) ˜ (n+1) = F (Θ ˜ (n) ) + [Θ ˜ − F (Θ ˜ (n) )]−1 + [F (F (Θ ˜ (n) )) − F (Θ ˜ (n) )]−1 −1 , Θ
(18)
where
˜ (n) = F (Θ (n) ) + [Θ (n) − F (Θ (n) )]−1 + [F (F (Θ (n) )) − F (Θ (n) )]−1 −1 . Θ
(19)
˜ (n) }n≥0 are the sequences generated, respectively, by the EM algorithm and the cycling-based epsilon {Θ (n) }n≥0 and {Θ accelerated EM algorithm. For n → ∞, we have Θ (n) → Θ ∗ . The purpose is to show that the sequence generated by the cycling-based method given by Eq. (18) converges to the one generated by the EM algorithm, and consequently to the fixed point of the EM sequence, that is Θ ∗ . Therefore, it is necessary ˜ (n) → F (Θ (n) ), or simply that the quantity to prove firstly that Θ
δ (n) = [Θ (n) − F (Θ (n) )]−1 + [F (F (Θ (n) )) − F (Θ (n) )]−1 diverges to infinity as n → ∞, given that limn→∞ Θ (n) = Θ ∗ . From Eq. (3) we have F (Θ (n) ) − Θ ∗ = λ(Θ (n) − Θ ∗ ) + o(kΘ (n) − Θ ∗ k),
n → ∞,
(20)
then
Θ (n) − F (Θ (n) ) = (1 − λ)(Θ (n) − Θ ∗ ) + o(kΘ (n) − Θ ∗ k),
n → ∞.
(21)
Similarly to Eq. (20), we have F (F (Θ (n) )) − Θ ∗ = λ(F (Θ (n) ) − Θ ∗ ) + o(kF (Θ (n) ) − Θ ∗ k),
n → ∞,
(22)
and from Eqs. (20) and (22), we also have F (F (Θ (n) )) − F (Θ (n) ) = −λ(Θ (n) ) − F (Θ (n) ) + o(kΘ (n) − Θ ∗ k),
n → ∞,
(23)
766
F. Saâdaoui / Computational Statistics and Data Analysis 54 (2010) 750–766
thus, using Eqs. (21) and (23), we can obtain
δ (n) ' γ (Θ (n) − Θ ∗ )−1 , (n)
n → ∞,
˜ (n) is close to F (Θ (n) ). where γ = − λ . Since Θ → Θ as n → ∞, δ (n) diverges to infinity, and Θ ˜ (n+1) − F (Θ ˜ (n) ). Given that limn→∞ Θ ˜ (n) = F (Θ (n) ), we also can write Now, let δ˜ (n) = Θ 1
∗
δ˜ (n) ' γ (F (Θ (n) ) − Θ ∗ )−1 , ' Γ (Θ
(n)
−Θ ) , ∗ −1
n → ∞,
n → ∞,
˜ (n+1) → F (Θ ˜ (n) ) = F (F (Θ (n) )). Until a some order N, we where Γ = −γ . Also δ˜ (n) diverges to infinity as n → ∞, and Θ 2
can write
˜ (n+N ) → F N (Θ (n) ), Θ
n → ∞,
where F i (Θ ) denote the composition of mapping F applied i times to Θ . References Barzilai, J., Borwein, J.M., 1988. Two-point step size gradient methods. IMA Journal of Numerical Analysis 8, 141–148. Berlinet, A., Roland, Ch., 2007. Acceleration schemes with application to the EM algorithm. Computational Statistics and Data Analysis 51, 3689–3702. Biernacki, C., Celeux, G., Govaert, G., 2003. Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Computational Statistics and Data Analysis 41, 561–575. Bishop, C., 1995. Neural Networks for Pattern Recognition. Clarendon Press, Oxford. Brezinski, C., Redivo Zaglia, M., 1991. Extrapolation Methods. Theory and Practice. North-Holland, Amsterdam. Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum-likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B 39 (1), 1–38. Ghahramani, Z., Jordan, M., 1995. Learning from incomplete data. Technical Report AI Lab Memo No. 1509, CBCL Paper No. 108, MIT AI Lab. Jamshidian, M., Jennrich, R.I., 1993. Conjugate gradient acceleration of the EM algorithm. Journal of the American Statistical Association 88, 221–228. Jamshidian, M., Jennrich, R.I., 1997. Acceleration of the EM algorithm by using quasi-Newton methods. Journal of the Royal Statistical Society B 59, 569–587. Jordan, M., Jacobs, R., 1994. Hierarchical mixtures of experts and the EM algorithm. Neural Computation 6, 181–214. Kuroda, M., Sakakihara, M., 2006. Accelerating the convergence of the EM algorithm using the vector epsilon algorithm. Computational Statistics and Data Analysis 51 (3), 1549–1561. Laird, N., Lange, N., Stram, D., 1987. Maximum likelihood computation with repeated measures: Application of the EM algorithm. Journal of the American Statistical Association 82, 97–105. Lange, K., 1995b. A quasi-Newton acceleration of the EM algorithm. Statisticia Sinica 5, 1–18. Liu, C., Rubin, D.B., 1994. The ECME algorithm: A simple extension of EM and ECM with faster monotone convergence. Biometrika 81, 633–648. Liu, C., Rubin, D.B., Wu, Y.N., 1998. Parameter expansion to accelerate EM: The PX-EM algorithm. Biometrika 85, 755–770. Louis, T.A., 1982. Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society B 44 (2), 226–233. McLachlan, G.J., Krishnan, T., 1997. The EM Algorithm and Extensions, Wiley Series in Probability and Statistics. John Wiley and Sons. Meilijson, I., 1989. A fast improvement to the EM algorithm on its own terms. Journal of the Royal Statistical Society B 51, 127–138. Meng, X.L., Rubin, D.B., 1993. A maximum likelihood estimation via EM algorithm: A general framework. Biometrika 80, 267–278. Meng, X.L., Van Dyk, D., 1997. The EM algorithm: An old folk-song to a fast new tune. Journal of the Royal Statistical Society B 59, 511–567. Raydan, M., 1993. On the Barzilai and Borwein choice of steplength for gradient method. IMA Journal of Numerical Analysis 13, 321–326. Raydan, M., 1997. The Barzilai and Borwein gradient method for the large scale unconstrained minimization problem. SIAM Journal of Optimization 7, 26–33. Raydan, M., Svaiter, B.F., 2002. Relaxed steepest descent and Cauchy–Barzilai–Borwein method. Computational Optimization and Applications 21, 155–167. Redner, R., Walker, H., 1984. Mixture densities, maximum likelihood and the EM algorithm. SIAM Review 26 (2), 195–239. Roland, Ch., Varadhan, R., 2005. New iterative schemes for nonlinear fixed point problems, with applications to problems with bifurcations and incompletedata problems. Applied Numerical Mathematics 55, 215–226. Saâdaoui, F., Roland, Ch., 2008. Review and comparison of some extrapolation methods for accelerating the EM algorithm (submitted for publication). Sidi, A., 1986b. Convergence and stability properties of minimal polynomial and reduce rank extrapolation algorithms. SIAM Journal on Numerical Analysis 23, 197–209. Thisted, R.A., 1988. Elements of Statistical Computing: Numerical Computation. Chapman and Hall, New York. Varadhan, R., Roland, Ch., 2004. Squared extrapolation methods (SQUAREM): A new class of simple and efficient numerical schemes for accelerating the convergence of the EM algorithm. Department of Biostatistics Working Paper. Johns Hopkins University, vol. 63, pp. 1–70. Varadhan, R., Roland, Ch., 2005. Simple, stable and general methods for accelerating any EM algorithm (in press). Wu, C.F.J., 1983. On the convergence properties of the EM algorithm. The Annals of Statistics 11, 95–103. Wynn, P., 1962. Acceleration techniques for iterated vector and matrix problems. Mathematics of Computation 16, 301–322.