Computational Statistics and Data Analysis 56 (2012) 255–265
Contents lists available at SciVerse ScienceDirect
Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda
Accelerating the quadratic lower-bound algorithm via optimizing the shrinkage parameter Guo-Liang Tian a , Man-Lai Tang b,∗ , Chunling Liu c a
Department of Statistics and Actuarial Science, The University of Hong Kong, Pokfulam Road, Hong Kong, China
b
Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
c
Department of Applied Mathematics, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, China
article
info
Article history: Received 22 January 2011 Received in revised form 14 July 2011 Accepted 19 July 2011 Available online 2 August 2011 Keywords: Cox proportional hazards model EM-type algorithms Logistic regression Newton–Raphson algorithm Optimal QLB algorithm QLB algorithm
abstract When the Newton–Raphson algorithm or the Fisher scoring algorithm does not work and the EM-type algorithms are not available, the quadratic lower-bound (QLB) algorithm may be a useful optimization tool. However, like all EM-type algorithms, the QLB algorithm may also suffer from slow convergence which can be viewed as the cost for having the ascent property. This paper proposes a novel ‘shrinkage parameter’ approach to accelerate the QLB algorithm while maintaining its simplicity and stability (i.e., monotonic increase in log-likelihood). The strategy is first to construct a class of quadratic surrogate functions Qr (θ |θ (t ) ) that induces a class of QLB algorithms indexed by a ‘shrinkage parameter’ r (r ∈ R) and then to optimize r over R under some criterion of convergence. For three commonly used criteria (i.e., the smallest eigenvalue, the trace and the determinant), we derive a uniformly optimal shrinkage parameter and find an optimal QLB algorithm. Some theoretical justifications are also presented. Next, we generalize the optimal QLB algorithm to problems with penalizing function and then investigate the associated properties of convergence. The optimal QLB algorithm is applied to fit a logistic regression model and a Cox proportional hazards model. Two real datasets are analyzed to illustrate the proposed methods. © 2011 Elsevier B.V. All rights reserved.
1. Introduction Logistic regression model and Cox proportional hazards model are two important models in applied statistics and are widely used in biomedical studies. Because of the property of quadratic convergence, the Newton–Raphson algorithm or the Fisher scoring algorithm is usually employed to find maximum likelihood estimates (MLEs) of the parameters of interest. However, the two methods have drawbacks such as requiring tedious calculation of matrix inverse of the observed information (i.e., negative Hessian) or the expected information at each iteration. Furthermore, the Newton–Raphson algorithm does not necessarily increase the log-likelihood, leading even to divergence sometimes (Cox and Oakes, 1984, p.172). For example, Böhning and Lindsay (1988, p.645–646) provided a simple example of a concave objective function in which the Newton–Raphson method does not converge. In Section 5.1 of this paper, we further show that the Newton–Raphson and Fisher scoring algorithms do not work for logistic regression for the cancer remission data (Lee, 1974). If we could create a structure of missing data, the EM algorithm (Dempster et al., 1977; Meng and Rubin, 1993) was the best choice as it possesses the ascent property guaranteeing the stable or monotone convergence at a linear rate. It is well known that the stability of the EM algorithm is reached at the cost of slow convergence. Different methods have been
∗
Corresponding author. Tel.: +852 3411 7018. E-mail address:
[email protected] (M.-L. Tang).
0167-9473/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2011.07.013
256
G.-L. Tian et al. / Computational Statistics and Data Analysis 56 (2012) 255–265
proposed to accelerate its convergence. They include the conjugate gradient method (Jamshidian and Jennrich, 1993), the quasi-Newton method (Lange, 1995; Jamshidian and Jennrich, 1997; Zhou et al., 2011), the space-alternating generalized EM (Fessler and Hero, 1994), the ‘working parameter’ method (Meng and van Dyk, 1997, 1998; van Dyk, 2000; Tan et al., 2007), the parameter expansion method (Liu et al., 1998), the vector epsilon method (Kuroda and Sakakihara, 2006; Wang et al., 2008), the squared iterative method (Varadhan and Roland, 2008), the parabolic acceleration method (Berlinet and Roland, 2009) and so on. However, the EM-type algorithms may not apply to generalized linear models (e.g., logistic regression and log-linear models) and Cox proportional hazards models due to the absence of a missing-data structure. Therefore, for problems in which the missing-data structure does not exist or is not readily available, minorization–maximization (MM) algorithms (Lange et al., 2000; Hunter and Lange, 2004) are often useful alternatives. As a special case of MM algorithms, the quadratic lower-bound (QLB) algorithm (Böhning and Lindsay, 1988) is an optimization device which monotonically increases the log-likelihood at each iteration. Like EM-type algorithms, the QLB algorithm may also suffer from slow convergence which can be viewed as the cost for holding the ascent property. For complicated problems or high-dimensional data, the slow convergence will be getting much terrible. Motivated by the ‘working parameter’ idea (Meng and van Dyk, 1997) in the acceleration of the EM algorithm, we propose in this paper a novel ‘shrinkage parameter’ approach to accelerate the QLB algorithm while maintaining its simplicity and stability (i.e., monotonic increase in log-likelihood). The strategy is first to construct a class of quadratic surrogate functions Qr (θ|θ (t ) ) that induces a class of QLB algorithms indexed by a ‘shrinkage parameter’ r (r ∈ R) and then to optimize r over R under some criterion of convergence. For all three commonly used criteria of convergence (i.e., the smallest eigenvalue, the trace and the determinant of the matrix Iq −∇ Mr (θˆ ), where Iq is the q×q identity matrix and ∇ Mr (θˆ ) denotes the matrix rate of convergence which is defined in (3.7)), we derive a uniformly optimal shrinkage parameter that corresponds to an optimal QLB algorithm. The rest of the article is organized as follows. In Section 2, we briefly review the QLB algorithm. Section 3 proposes the ‘shrinkage parameter’ approach to accelerate the original QLB algorithm. Some theoretical justifications are derived and an optimal QLB algorithm is provided. Section 4 generalizes the optimal QLB algorithm to penalized problems and investigates some properties of convergence in this new setting. In Section 5, we apply the optimal QLB algorithm to fit a logistic regression model and a Cox proportional hazards model. Two real examples are analyzed to illustrate the proposed methods. We conclude with a discussion in Section 6. The technical proofs of all propositions are given in the Appendix. 2. The QLB algorithm Let ℓ(θ ) denote the log-likelihood function which is assumed to be twice continuously differentiable and concave, and θ ∈ Rq be the parameter vector of interest. The QLB algorithm is an iterative approach with monotonic convergence to compute the MLE
θˆ = arg maxq ℓ(θ ).
(2.1)
θ∈R
Let ∇ denote the partial derivative operator, ∇ℓ(θ ) the score vector and −∇ 2 ℓ(θ ) the observed information. A key assumption for the QLB is that there exists a symmetric positive definite matrix B that does not depend on θ and globally majorizes the observed information, i.e., B ≥ −∇ 2 ℓ(θ ) ∀ θ ∈ Rq .
(2.2)
In other words, B + ∇ ℓ(θ ) is a non-negative definite matrix. Based on this B, Böhning and Lindsay (1988) constructed a quadratic surrogate function 2
Q1 (θ|θ (t ) ) = ℓ(θ (t ) ) + (θ − θ (t ) )⊤ ∇ℓ(θ (t ) ) − 0.5(θ − θ (t ) )⊤ B(θ − θ (t ) ), (t )
θ , θ (t ) ∈ Rq ,
(2.3)
and showed that increasing Q1 (θ|θ ) forces an increase in ℓ(θ ) for all θ ∈ R . This ascent property implies that finding (2.1) is equivalent to iteratively finding
θ (t +1) = arg maxq Q1 (θ|θ (t ) ) = θ (t ) + B−1 ∇ℓ(θ (t ) ) θ∈R
q
(2.4)
until the sequence {θ (t ) }∞ t =0 converges. The original QLB algorithm defined in (2.4) is particularly suited to problems where EM-type algorithms cannot be applied because of the lack of a missing-data structure. Next, the algorithm retains the same ascent property as all EM-type algorithms, thus assuring the monotone convergence. Another merit of the algorithm is that the matrix inverse B−1 is done only once for all iterations. The QLB algorithm can be applied to logistic regression models, Cox proportional hazards models (Böhning and Lindsay, 1988), multinomial logistic regression models (Böhning, 1992), and probit regression models (Böhning, 1999). 3. Accelerating the QLB algorithm via optimizing the shrinkage parameter To accelerate the original QLB algorithm, we first construct a class of QLB algorithms indexed by a ‘shrinkage parameter’ r (0.5 ≤ r < 1), and then prove that these QLB algorithms retain the ascent property (i.e., the monotone convergence). Next
G.-L. Tian et al. / Computational Statistics and Data Analysis 56 (2012) 255–265
257
we consider three commonly used criteria of convergence speed (i.e., the global speed or the smallest eigenvalue, the trace and the determinant) and prove the local convergence. For the three criteria, we determine the value of a uniformly optimal shrinkage parameter that induces an optimal QLB algorithm. 3.1. A class of QLB algorithms To construct a class of quadratic surrogate functions, we first formally define two notions. Definition 1. A matrix B (> 0) is said to be a global majorization (GM) matrix if B does not depend on θ and satisfies condition (2.2). Further, B is said to be the smallest GM matrix if (i) B is a GM matrix; and (ii) for any B′ > 0 such that B ≥ B′ ≥ −∇ 2 ℓ(θ ) for all θ ∈ Rq , then we have B = B′ . Remark 1. For logistic regression models, Cox proportional hazards models, multinomial logistic regression models, and probit regression models, Böhning and Lindsay have found such smallest GM matrices. In general, we do not know whether a GM matrix exists. For example, for Poisson regression models, such a GM matrix indeed does not exist. Generally speaking, even finding a GM matrix is not an easy task for a specific model. Therefore, further exploration along this direction is needed. From now on, for notational convenience, we denote the smallest GM matrix with B. First, we can see that there are an infinite number of GM matrices majorizing the B. In fact,
{rB: r ≥ 1}
(3.1)
is a family of GM matrices. Second, the convergence rate of the QLB algorithm is entirely controlled by the length of the step-length vector B−1 ∇ℓ(θ (t ) ). When the smallest GM matrix B is too ‘big’, the length of step-length vector B−1 ∇ℓ(θ (t ) ) will be too short, leading to a slow convergence. This motivates us to consider the following matrix family
{rB: r0 ≤ r < 1},
0 < r0 < 1,
(3.2)
where r is called the shrinkage parameter, and r0 the smallest shrinkage parameter. Obviously, the definition of the B indicates that any member in (3.2) is not a GM. Based on the two matrix families (3.1) and (3.2), we can construct a class of quadratic surrogate functions as in (2.3), Qr (θ|θ (t ) ) = ℓ(θ (t ) ) + (θ − θ (t ) )⊤ ∇ℓ(θ (t ) ) − 0.5r (θ − θ (t ) )⊤ B(θ − θ (t ) ),
r ≥ r0 ,
(3.3)
and as in (2.4) define a class of QLB algorithms by iteratively computing
θ (t +1) = arg max Qr (θ|θ (t ) ) = θ (t ) + (rB)−1 ∇ℓ(θ (t ) ),
r ≥ r0 .
(3.4)
The following results show that the monotone convergence (or ascent property) holds only for those QLB algorithms with 0.5 ≤ r < 1, i.e., the smallest shrinkage parameter r0 = 0.5. The proof of Proposition 1 is moved to the Appendix. Proposition 1 (Ascent Property). Let B be the smallest GM matrix and the QLB algorithms be defined by (3.3) and (3.4). (i) When r ≥ 1, the majorization inequality
ℓ(θ ) − Qr (θ|θ (t ) ) ≥ 0 ∀ θ , θ (t ) ∈ Rq (t )
(3.5) (t )
holds with equality if and only if θ = θ ; and the ascent property holds in the sense that increasing Qr (θ|θ ) forces an increase in ℓ(θ) for all θ ∈ Rq . (ii) When 0.5 ≤ r < 1, in general, the majorization inequality (3.5) is not guaranteed; while the ascent property still holds, that is, ℓ(θ (t +1) ) ≥ ℓ(θ (t ) ), where θ (t +1) and θ (t ) are determined by (3.4). 3.2. Criteria of acceleration and local convergence From this subsection on, we only consider the family of QLB algorithms (3.4) with r0 = 0.5. The speed of convergence for these QLB algorithms depends on the shrinkage parameter r. A sequence {θ (t ) }∞ t =0 is said to converge linearly to a fixed
point θˆ ∈ Rq if there exists a constant c ∈ (0, 1) satisfying ‖θ (t +1) − θˆ ‖ ≤ c ‖θ (t ) − θˆ ‖. For a fixed r (≥ 0.5), let the sequence {θ (t ) } be generated by the QLB algorithm (3.4), which in fact defines a mapping θ → Mr (θ ) from Rq to Rq , where Mr (θ ) = θ + (rB)−1 ∇ℓ(θ ). (t +1)
(3.6) (t )
If the iterative mapping θ = Mr (θ ), t = 0, 1, . . . , converges to θˆ ∈ Rq , then θˆ = Mr (θˆ ) is a fixed point of the (t ) mapping. After expanding Mr (θ ) at the neighborhood of θˆ based on a Taylor series, we obtain (Meng, 1994)
θ (t +1) − θˆ = ∇ Mr (θˆ )(θ (t ) − θˆ ).
(3.7)
Therefore, in the neighborhood of θˆ , the QLB is also a linear iterative algorithm like the EM algorithm (Meng and Rubin, 1991).
258
G.-L. Tian et al. / Computational Statistics and Data Analysis 56 (2012) 255–265
As in Meng (1994), we refer to ∇ Mr (θˆ ) as the matrix rate of convergence of the QLB sequence {θ (t ) }, and the largest eigenvalue ρ{∇ Mr (θˆ )} of ∇ Mr (θˆ ) as the global rate of convergence of {θ (t ) }. In the optimization literature, the absolute value of ρ{∇ Mr (θˆ )}, i.e., |ρ{∇ Mr (θˆ )}|, is called the spectral radius of ∇ Mr (θˆ ) (Lange, 1999, p.71). Since a larger value of ρ{∇ Mr (θˆ )} corresponds to a slower algorithm, we refer to the smallest eigenvalue s{I − ∇ Mr (θˆ )} = 1 − ρ{∇ Mr (θˆ )} of I − ∇ Mr (θˆ ) as the global speed of the algorithm. Another two commonly used criteria are trace and determinant. We have the following result on the local convergence. Proposition 2 (Local Convergence). For a fixed r, let the sequence {θ (t ) }∞ t =0 be generated by the QLB algorithm (3.3) and (3.4). Then, the algorithm converges linearly to a local maximum θˆ with matrix rate of convergence
∇ Mr (θˆ ) = Iq − (rB)−1 [−∇ 2 ℓ(θˆ )],
(3.8)
for r ≥ 0.5. 3.3. Uniformly optimal shrinkage parameter Under all three criteria, i.e., the global speed, trace and determinant, the following results show that the uniformly optimal shrinkage parameter is rˆ gs = rˆ tr = rˆ det = 0.5. Proposition 3. Let the QLB algorithms be defined by (3.3) and (3.4), then the smallest eigenvalue, the trace and the determinant of Iq − ∇ Mr (θˆ ) are given by s{Iq − ∇ Mr (θˆ )} = r −1 · s{Iq − ∇ M1 (θˆ )} ∈ (0, r −1 ), tr{Iq − ∇ Mr (θˆ )} = r −1 · tr{Iq − ∇ M1 (θˆ )}
and
det{Iq − ∇ Mr (θˆ )} = r −q · det{Iq − ∇ M1 (θˆ )}, respectively, and they are monotonically decreasing functions of r, achieving their maximum values at the same point r = 0.5. 4. Generalization to penalized optimization Optimization with linear equalities or quadratic constraints (e.g., ridge regression) can be reformulated into a penalized optimization without constraints. In this section, we generalize the QLB algorithm (3.4) to the penalized maximization problem
θ˜ = arg max ℓλ (θ ) = arg max{ℓ(θ ) − λJ (θ )},
θ ∈ Rq ,
(4.1)
and study the corresponding matrix rate of convergence, where the assumption on ℓ(θ ) is the same as in Sections 2 and 3, J (θ) is a penalty function and λ > 0 is a smoothing parameter. The penalized MLE θ˜ is closely related to the posterior mode (or maximum a posteriori estimate) in the sense that the function c · e−λJ (θ ) in (4.1) plays the computational role of a prior density of θ . The QLB algorithm (3.4) implies that θ˜ can be obtained by iteratively calculating
θ (t +1) = arg max Qrλ (θ|θ (t ) )
(4.2)
= arg max{Qr (θ|θ (t ) ) − λJ (θ )}, =θ
(t )
(t )
+ (rB) [∇ℓ(θ ) − λ · ∇ J (θ −1
(4.3) (t +1)
)],
r ≥ 0.5,
(4.4)
where Qr is defined by (3.3). Similar to Propositions 1–3, we have the following results. Proposition 4 (Ascent Property). Let ℓλ (θ ) be defined by (4.1) and the sequence {θ (t ) }∞ t =0 be generated by (4.3). (i) When r ≥ 1, then ℓλ (θ ) − Qrλ (θ|θ (t ) ) achieves its minimum zero at θ = θ (t ) ; and the ascent property holds in the sense that increasing Qrλ (θ|θ (t ) ) forces an increase in ℓλ (θ ) for all θ ∈ Rq . (ii) When 0.5 ≤ r < 1 and J (θ ) is a linear or quadratic function of θ , then the ascent property still holds, i.e., ℓλ (θ (t +1) ) ≥ ℓλ (θ (t ) ), where θ (t +1) and θ (t ) are determined by (4.4). Proposition 5 (Global Speed of Convergence). Let θ˜ be defined by (4.1), λ be a fixed positive constant, and J (θ ) be twice continuously differentiable and convex. Then, (i) the matrix rate of convergence for the QLB algorithm (4.3) is given by
∇ Mrλ (θ˜ ) = [rB + λ · ∇ 2 J (θ˜ )]−1 (∇ 2 ℓ(θ˜ ) + rB),
r ≥ 0.5,
(4.5)
and the global rate of convergence
ρ{∇ Mrλ (θ˜ )} < 1. (ii) The global speed s{Iq − ∇ Mrλ (θ˜ )} is a monotonic decreasing function of r, achieving its maximum at r = 0.5.
(4.6)
G.-L. Tian et al. / Computational Statistics and Data Analysis 56 (2012) 255–265
259
5. Applications In this section, we apply the proposed methods to fit the logistic regression model and the Cox proportional hazards model, respectively. A cancer remission dataset and a liver dataset are used to illustrate the optimal QLB (OQLB) algorithm. Computational results show that the speed of the OQLB algorithm is twice as fast as that of the traditional QLB algorithm, confirming the theoretical results in Section 3. 5.1. Application to the logistic regression model We consider the following logistic regression model: ind
(5.1) yi ∼ Binomial (ni , pi ), pi ⊤ log = x(i) θ , 1 ≤ i ≤ m, 1 − pi where yi denotes the number of subjects with positive response in the i-th group with ni trials, pi the probability of a subject in the i-th group with positive response, x(i) a vector of covariates, and θq×1 the unknown parameters. The log-likelihood of θ is
ℓ(θ ) = constant +
m −
⊤ yi (x⊤ (i) θ ) − ni log[1 + exp(x(i) θ )] .
i=1
Hence, the score vector and the observed information matrix are given by
∇ℓ(θ ) =
m − (yi − ni pi )x(i) = X ⊤ (y − Np) and
− ∇ 2 ℓ(θ ) =
i=1
m −
⊤ ni pi (1 − pi )x(i) x⊤ (i) = X NPX ,
(5.2)
i =1
respectively, where X = (x(1) , . . . , x(m) )⊤ , y = (y1 , . . . , ym )⊤ , N = diag(n1 , . . . , nm ), p = (p1 , . . . , pm )⊤ ,
⊤ pi = exp[x⊤ (i) θ ]/(1 + exp[x(i) θ]),
and P = diag p1 (1 − p1 ), . . . , pm (1 − pm ) .
Note that the observed information matrix does not depend on the observed data {yi }m i=1 , so that the Fisher information matrix and the observed information matrix are identical. Therefore, the Newton–Raphson algorithm and the Fisher scoring algorithm are the same for the logistic regression model (5.1). The Newton–Raphson algorithm is defined by
θ (t +1) = θ (t ) + [−∇ 2 ℓ(θ (t ) )]−1 ∇ℓ(θ (t ) ) = θ (t ) + [X ⊤ NP (t ) X ]−1 X ⊤ (y − Np(t ) ),
(5.3)
where
(t )
(t )
P (t ) = diag p1 (1 − p1 ), . . . , p(mt ) (1 − p(mt ) ) , (t )
p(t ) = (p1 , . . . , p(mt ) )⊤ , (t )
pi
=
(t ) exp{x⊤ (i) θ } (t ) 1 + exp{x⊤ (i) θ }
(5.4)
,
i = 1, . . . , m.
The estimated covariance matrix of the MLE θˆ is given by
(θˆ ) = (X ⊤ N PX ˆ )−1 Cov
(5.5)
(θˆ ). so that the estimated standard errors of θˆ are the square roots of the diagonal elements of Cov If the observed information matrix X ⊤ NP (t ) X in (5.3) is singular or close to singular for some/all t ∈ {0, 1, 2, . . .}, then the Newton–Raphson algorithm does not work. In this case, the QLB algorithm could be a useful optimization tool. The key to the application of the QLB algorithm is to find a positive definite matrix B satisfying condition (2.2). In view of the fact that pi (1 − pi ) ≤ 1/4 and (5.2), Böhning and Lindsay (1988) obtained B = 41 X ⊤ NX ≥ −∇ 2 ℓ(θ ). From (3.4), we obtain θ (t +1) = θ (t ) + (rX ⊤ NX /4)−1 X ⊤ (y − Np(t ) ), (5.6) (t ) where p is defined in (5.4). When r = 1, the algorithm specified by (5.6) reduces to the traditional QLB algorithm. When r = 0.5, the algorithm (5.6) is the OQLB algorithm. Using the OQLB algorithm, the estimated standard errors of θˆ can be obtained by a parametric bootstrap approach (Efron, 1979; Efron and Tibshirani, 1993). Example 1 (Cancer Remission Data). We consider the cancer remission data (Lee, 1974) listed in Table 1. The binary outcome yi = 1 implies that the i-th patient’s remission occurred and yi = 0 otherwise. There are six explanatory variables, denoted by X1 , . . . , X6 . The logistic regression model is pi = θ0 + xi1 θ1 + · · · + xi6 θ6 . log 1 − pi
260
G.-L. Tian et al. / Computational Statistics and Data Analysis 56 (2012) 255–265
Table 1 Cancer remission data. i
X1
X2
X3
X4
X5
X6
yi
i
X1
X2
X3
X4
X5
X6
yi
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0.80 1.00 1.00 0.95 1.00 0.95 0.85 0.70 0.80 0.20 1.00 0.65 1.00 0.50
0.88 0.87 0.65 0.87 0.45 0.36 0.39 0.76 0.46 0.39 0.90 0.42 0.75 0.44
0.70 0.87 0.65 0.83 0.45 0.34 0.33 0.53 0.37 0.08 0.90 0.27 0.75 0.22
0.8 0.7 0.6 1.9 0.8 0.5 0.7 1.2 0.4 0.8 1.1 0.5 1.0 0.6
0.176 1.053 0.519 1.354 0.322 0.000 0.279 0.146 0.380 0.114 1.037 0.114 1.322 0.114
0.982 0.986 0.982 1.020 0.999 1.038 0.988 0.982 1.006 0.990 0.990 1.014 1.004 0.990
0 0 0 0 0 0 0 0 0 0 0 0 0 0
15 16 17 18 19 20 21 22 23 24 25 26 27
1.00 0.90 0.95 1.00 0.80 0.90 0.90 0.95 1.00 1.00 1.00 1.00 1.00
0.33 0.93 0.32 0.73 0.83 0.36 0.75 0.97 0.84 0.63 0.58 0.60 0.69
0.33 0.84 0.30 0.73 0.66 0.32 0.68 0.92 0.84 0.63 0.58 0.60 0.69
0.4 0.6 1.6 0.7 1.9 1.4 1.3 1.0 1.9 1.1 1.0 1.7 0.9
0.176 1.591 0.886 0.398 1.100 0.740 0.519 1.230 2.064 1.072 0.531 0.964 0.398
1.010 1.020 0.988 0.986 0.996 0.992 0.980 0.992 1.020 0.986 1.002 0.990 0.986
0 0 0 0 1 1 1 1 1 1 1 1 1
Source: Lee (1974).
Table 2 The maximal and minimal eigenvalues, the condition number, the determinant of the observed information matrix A = X ⊤ NPX defined in (5.2) for various initial values for the cancer remission data.
θ (0 )
λmax (A)
λmin (A)
κ(A)
Determinant
17 0.3 · 17 2 · 17 3 · 17 (1, 0, −0.2, 0.3, 3, 6, 10)⊤ (5, 3, 19, −1, 3, −90, 80)⊤ (90, −2, 4, −10, −30, 0, 0)⊤ θˆ given by (5.7)
0.57547529 16.3753500 0.00741540 0.00013891 0.00003458 3.7026 × 10−01 1.0779 × 10−10 2.2065 × 10+01
1.1120 × 10−05 2.8928 × 10−04 6.8663 × 10−08 3.6678 × 10−10 2.4160 × 10−10 5.4146 × 10−13 9.8377 × 10−29 8.9694 × 10−05
51758 56607 107998 378736 143162 6.83 × 1011 1.09 × 1018 246004
6.3926 × 10−19 1.2655 × 10−08 6.5716 × 10−34 5.1523 × 10−48 2.6147 × 10−50 1.9348 × 10−42 5.980 × 10−134 7.2113 × 10−11
κ(A) = λmax (A)/λmin (A) denotes the condition number of the matrix A.
Table 3 The number of iterations needed to reach the ε accuracy in (5.8) for various combinations of ε and r for the cancer remission data.
ε r
10−2
10−3
10−4
10−5
10−6
10−7
10−8
10−9
0.5 0.6 0.7 0.8 0.9 1.0
5 5 6 6 6 7
6 6 7 8 8 9
10 13 14 15 17 18
48 55 60 63 66 69
117 136 153 168 183 197
194 229 260 291 320 348
274 324 372 418 463 508
354 421 485 547 609 669
In Table 2, we calculate the maximal and minimal eigenvalues, the condition number, the determinant of the observed information matrix X ⊤ NPX defined in (5.2) for various initial values θ (0) . The Newton–Raphson algorithm does not work for this dataset because the observed information matrix evaluated at these θ (0) ’s is apparently singular (see the last column of Table 2). The problem persists with other initial values. Starting from the initial vector θ (0) = 17 , we apply the QLB algorithm (5.6) with r = 0.5, 0.6, 0.7, 0.8, 0.9, and 1, respectively, to the cancer remission data. After 321, 390, 456, 521, 587 and 653 iterations for the respective choices of r, the maximum value of the log-likelihood function for all the selected values of r is achieved at
θˆ = (58.0385, 24.6615, 19.2936, −19.6013, 3.8960, 0.1511, −87.4339)⊤ .
(5.7)
These results show that the OQLB algorithm is almost twice as fast as the traditional QLB algorithm, confirming our findings in Proposition 3. Fig. 1 shows the comparison of log-likelihood functions versus the number of iterations between the OQLB algorithm and the traditional QLB algorithm. The solid (dash) curve denotes values of log-likelihood function generated by the OQLB (traditional QLB) algorithm. Table 3 displays the number of iterations needed to reach the accuracy of
|ℓ(θ (t +1) ) − ℓ(θ (t ) )| < ε
(5.8)
for various combinations of ε and r. In fact, the absolute-value symbol in (5.8) is not necessary since all QLB algorithms satisfy the ascent property (see Proposition 1).
G.-L. Tian et al. / Computational Statistics and Data Analysis 56 (2012) 255–265
261
Fig. 1. Comparison of log-likelihood functions between the OQLB and QLB algorithms for the cancer remission data given in Table 1.
The estimated standard errors of θˆ are given by
(71.19, 47.70, 57.81, 61.54, 2.33, 2.27, 67.48)⊤ , which are calculated by the parametric bootstrapping approach with 1500 replications via the OQLB algorithm. It is noteworthy that the Newton–Raphson algorithm may not work when the observed information matrix (i.e., X ⊤ NP (t ) X ) in (5.3) is close to singular. One possibility is the existence of co-linearity among at least a pair of the explanatory variables. Therefore, carrying out a variable selection procedure is generally necessary before the parameters of interest are estimated. Another possibility is the small sample size of the binary observations for a total of six explanatory variables. Since the estimated standard errors of the regression coefficients are very large, the fitted model does not generally give good predictions. 5.2. Application to the Cox proportional hazards model Cox’s regression (Cox, 1972) is a semi-parametric approach to survival analysis in which the hazard function
λ(t ) = λ0 (t ) exp(θ ⊤ z ) is modeled, where λ0 (t ) is the unspecified baseline hazard function, z is the vector of covariates, and θ is the vector of regression coefficients. Let Ti denote the survival time of subject i for i = 1, . . . , N and δi be the right censoring indicator of subject i, which equals zero if Ti is right censored and one otherwise. The partial log-likelihood function is given by
ℓ(θ ) =
N − i =1
δi θ zi − log ⊤
−
exp(θ zj ) ⊤
,
j∈Ri
where Ri = {j: Tj ≥ Ti } is the risk set of Ti . For simplicity, in the sequel, we use the notation Yi (t ) = I (Xi ≥ t ) instead of the risk set, where Yi (t ) is a 0–1 predictable process indicating by value one, whether the i-th subject is at risk at time t.
262
G.-L. Tian et al. / Computational Statistics and Data Analysis 56 (2012) 255–265
Denote aa⊤ by a⊗2 for a vector a and define pij =
exp(θ ⊤ zj ) N ∑
.
Yk (Ti ) exp(θ ⊤ zk )
k=1
The score function and the observed information are given by
∇ℓ(θ ) =
N − i=1
δi zi −
N −
Yj (Ti )pij zj
and
j =1
⊗ 2 N N N − − − Yj (Ti )pij zj⊗2 − Yj (Ti )pij zj , −∇ 2 ℓ(θ ) = δi j=1 j=1 i =1
(5.9)
respectively. For the original QLB algorithm, Böhning and Lindsay (1988) found that
⊗ 2 N N N − − 1− 1 2 Yj (Ti )zj B= δi − N Yj (Ti )zj ∑ 2 i=1 Yk (Ti ) j=1 j =1 k=1
for any θ ∈ Rq . We can see that this B is the smallest majorization matrix. Thus, our QLB iteration is given by
θ (t +1) = θ (t ) + (rB)−1 ∇ℓ(θ (t ) ).
(5.10)
In particular, when r = 0.5, this reduces to the OQLB algorithm. From (5.9), the estimated covariance matrix of the MLE θˆ is given by
(θˆ ) = [−∇ 2 ℓ(θˆ )]−1 Cov
(5.11)
(θˆ ). so that the estimated standard errors of θˆ are the square roots of the diagonal elements of Cov Example 2 (PBC Data). Consider the primary biliary cirrhosis (PBC) data of the liver collected from January 1974 to May 1984 in Mayo Clinic trial for comparing the drug D-penicillamine (DPCA) with a placebo (see Fleming and Harrington (1991) Appendix D.1). The data contain information about the survival time and prognostic factors for 418 patients. The variables in this dataset include survival time Ti , right censoring indicator δi , and 17 covariates Z1 , . . . , Z17 . Using the LASSO method, Tibshirani (1997) analyzed the PBC data by only considering 276 observations with fully observed covariates and decided to exclude six covariates Z1 , Z5 , Z9 , Z12 , Z14 , and Z15 . Based on his results, we only need to estimate the other 11 covariates. Six out of 276 subjects were omitted so that there are no ties in the survival times in our calculation. Using θ (0) = 111 as the initial values, we apply the QLB algorithm (5.10) with r = 0.5 and r = 1 to the PBC data, and arrive at
θˆ = (0.262, −0.137, 0.023, 0.069, 0.203, 0.426, −0.334, 0.178, 0.233, 0.235, 0.368)⊤ after 1303 and 2614 iterations, respectively. This further justifies that the OQLB algorithm is twice as fast as the QLB algorithm. Using (5.11), the estimated standard errors of θˆ are given by
(0.628, 0.752, 0.829, 0.649, 0.770, 0.793, 0.587, 0.786, 0.692, 0.653, 0.540)⊤ . 6. Discussion Because of the quadratic convergence, the Newton–Raphson algorithm or the Fisher scoring algorithm is always our first choice if it works. However, when the two algorithms do not work as shown in Section 5.1 the EM-type algorithms are not available, the QLB algorithm is a useful optimization tool. The major objective of this paper is to provide a fast QLB algorithm, which is especially useful in the calculation of standard errors by using a bootstrap approach. From the proof process of Proposition 1, we observed several facts. First, being a very strong assumption, (2.2) is merely a sufficient (not necessary) condition to ensure the ascent property. This strong assumption combines with condition (A.2) which is weaker than (2.4) (i.e., θ (t +1) need not be the maximizer of Qr (θ|θ (t ) )) yields the ascent property. Second, according to Lange et al. (2000), the majorization inequality (3.5) is the ‘‘essence’’ of all EM-type algorithm and MM-type algorithms so that EM/MM algorithms are also generalized EM/MM algorithms. Thus, the QLB algorithms based on (3.4) are special members of the generalized MM algorithms only for r ≥ 1. In contrast, when 0.5 ≤ r < 1, the corresponding QLB algorithms do not belong to the MM family because of the lack of a majorization inequality, but they still maintain the ascent property
G.-L. Tian et al. / Computational Statistics and Data Analysis 56 (2012) 255–265
263
produced by updating θ (t +1) according to stronger condition (3.4) rather than weaker condition (A.2). In other words, these QLB (0.5 ≤ r < 1) are not generalized QLB algorithms. Finally, the essence for the QLB algorithms lies in the ascent property rather than the majorization inequality. The exploration of this fact motivates our accelerated tactic for a class of QLB algorithms. The QLB algorithm of Böhning and Lindsay (1988) is somewhat related with the trust region method, which is originally proposed by Celis et al. (1985). Essentially, the trust region algorithm (Fletcher, 1987) first approximates the objective function with a quadratic surrogate function in a neighborhood of the current iterate (the so-called trust region), and then optimizes the quadratic surrogate function over the trust region, leading to a constrained optimization problem (the so-called trust region subproblem). Finally, one needs to design an algorithm to solve the trust region subproblem. Fortunately, there are several algorithms for solving the trust region subproblem and these have been incorporated into effective optimization codes. The reader is referred to Nocedal and Wright (1999, Chapter 4) for more details. We observed two major differences between the QLB algorithm and the trust region algorithm: for the former, (i) one must first find a positive definite matrix (not depending on θ ) satisfies condition (2.2) so that one can iteratively maximize the quadratic surrogate function Q1 (θ|θ (t ) ) defined by (2.3) over the whole parameter space Rq ; and (ii) there exists an explicit solution, given by (2.4), to the maximization of Q1 (θ|θ (t ) ). In Section 5.1, we only discuss the situation of independent binary data. How to apply the optimal QLB algorithm to logistic model with correlated or clustered binary data is one of the future work. Furthermore, for some generalized linear models (e.g., Poisson regression for counting data), a universal majorizing matrix indeed does not exist so that the optimal QLB algorithm is not applicable. Therefore, it would be worthwhile to design a similar fast algorithm to estimate the parameters in generalized linear models. Finally, since the QLB/OQLB algorithm can also be applied to probit regression models (Böhning, 1999), it is worthwhile to numerically compare the OQLB algorithm with an accelerated EM algorithm like the PX-EM algorithm of Liu et al. (1998) or ‘working parameter’ approach of Meng and van Dyk (1997). Acknowledgments The authors would like to thank the Editor, an Associate Editor and one referee for their helpful comments and suggestions. GL Tian’s research was supported in part by a grant (Project Code: 2009-1115-9042) from HKU Seed Funding Program for Basic Research and in part by a grant (HKU 779210M) from the Research Grant Council of the Hong Kong Special Administrative Region. Appendix Proof of Proposition 1. Let
Bq (θ (t ) , r∗ ) = {φ : ‖φ − θ (t ) ‖2 ≤ r∗2 } be a q-dimensional ball centered at θ (t ) with radius r∗ = ‖θ − θ (t ) ‖, where ‖ · ‖ denotes the Euclidean norm. The second order Taylor expansion of ℓ(θ ) in a neighborhood of θ (t ) ∈ Rq gives
ℓ(θ ) = ℓ(θ (t ) ) + (θ − θ (t ) )⊤ ∇ℓ(θ (t ) ) + 0.5(θ − θ (t ) )⊤ ∇ 2 ℓ(θ ∗ )(θ − θ (t ) ) for all θ ∈ R and some point θ ∈ Bq (θ guarantees ∗
q
(t )
(A.1)
, r∗ ). (i) When r ≥ 1, we have rB ≥ B > 0. This fact together with condition (2.2)
ℓ(θ ) − Qr (θ|θ (t ) ) = 0.5(θ − θ (t ) )⊤ [∇ 2 ℓ(θ ∗ ) + rB](θ − θ (t ) ) ≥ 0. Thus, ℓ(θ )− Qr (θ|θ (t ) ) achieves its minimum zero at θ = θ (t ) . Therefore, (3.5) follows. This fact combines with the condition Qr (θ (t +1) |θ (t ) ) ≥ Qr (θ (t ) |θ (t ) ) ∀ θ (t ) , θ (t +1) ∈ Rq ,
(A.2)
leading to the ascent property
ℓ(θ (t +1) ) = {ℓ(θ (t +1) ) − Qr (θ (t +1) |θ (t ) )} + Qr (θ (t +1) |θ (t ) ) ≥ {ℓ(θ (t ) ) − Qr (θ (t ) |θ (t ) )} + Qr (θ (t ) |θ (t ) ) = ℓ(θ (t ) ), where the inequality is strict if θ (t +1) ̸= θ (t ) . (ii) When 0.5 ≤ r < 1, substituting θ in (A.1) by θ (t +1) , we obtain
ℓ(θ (t +1) ) − ℓ(θ (t ) ) = (θ (t +1) − θ (t ) )⊤ ∇ℓ(θ (t ) ) + 0.5(θ (t +1) − θ (t ) )⊤ ∇ 2 ℓ(θ ∗ )(θ (t +1) − θ (t ) ). From (3.4), we have ∇ℓ(θ
ℓ(θ
(t +1)
(t )
(t )
) = rB(θ
) − ℓ(θ ) = 0.5(θ
(t +1)
(t +1)
(A.3)
(t )
− θ ). Hence, (A.3) becomes
− θ (t ) )⊤ [B + ∇ 2 ℓ(θ ∗ ) + (2r − 1)B](θ (t +1) − θ (t ) ).
If r = 0.5, since B + ∇ 2 ℓ(θ ∗ ) ≥ 0, the above quadratic form is non-negative. If 0.5 < r < 1, since (2r − 1)B > 0, the above quadratic form is positive provided that θ (t +1) ̸= θ (t ) .
264
G.-L. Tian et al. / Computational Statistics and Data Analysis 56 (2012) 255–265
Proof of Proposition 2. Differentiating (3.6) with respect to θ and evaluating at θ = θˆ , we immediately obtain (3.8). According to Ostrowski’s Theorem (Ortega, 1990) or Proposition 13.3.1 of Lange (1999, p.162), it suffices for local convergence to show that the spectral radius |ρ{∇ Mr (θˆ )}| < 1 for any r ≥ 0.5. To derive ρ{∇ Mr (θˆ )}, we first deal with the special case of ρ{∇ M1 (θˆ )}. When r = 1, (3.8) becomes
∇ M1 (θˆ ) = Iq − B−1 [−∇ 2 ℓ(θˆ )] = [−∇ 2 ℓ(θˆ ) + (∇ 2 ℓ(θˆ ) + B)]−1 (∇ 2 ℓ(θˆ ) + B). Let H = −∇ 2 ℓ(θˆ ) and A = ∇ 2 ℓ(θˆ ) + B, then ∇ M1 (θˆ ) = B−1 A = (H + A)−1 A. Let α be any eigenvalue of ∇ M1 (θˆ ), we have α |∇ M1 (θˆ ) − α Iq | = 0. Hence |(1 − α)A − α H | = 0. So there exists a non-zero vector z such that z ⊤ Az = 1−α z ⊤ Hz. Since the concavity of the ℓ(θ ) implies that H is positive definite and assumption (2.2) indicates that A is non-negative definite, α it follows that 1−α ≥ 0, hence α ∈ [0, 1). The largest eigenvalue ρ{∇ M1 (θˆ )} ̸= 0; otherwise ∇ M1 (θˆ ) is a zero matrix, i.e.,
θ (t +1) = θˆ . Thus, the largest eigenvalue ρ{∇ M1 (θˆ )} ∈ (0, 1). From (3.8), we obtain ∇ Mr (θˆ ) = Iq − r −1 [Iq − ∇ M1 (θˆ )].
(A.4)
Hence, the global rate of convergence
ρ{∇ Mr (θˆ )} = 1 − r −1 [1 − ρ{∇ M1 (θˆ )}] ∈ (1 − r −1 , 1),
(A.5)
resulting in |ρ{∇ Mr (θˆ )}| < 1 for any r ≥ 0.5. This fact indicates that the iteration scheme (3.6) is locally attracted to θˆ . To prove that the convergence of the QLB algorithm is linear, we need the identity
ρ{∇ Mr (θˆ )⊤ ∇ Mr (θˆ )} = ρ 2 {∇ Mr (θˆ )}.
(A.6)
In fact, defining P = ∇ Mr (θˆ ) and H = −∇ 2 ℓ(θˆ ), from (3.8), we have P = I − (rB)−1 H and P ⊤ = I − H (rB)−1 . Let λ be an eigenvalue of P, then there exists a non-zero vector z such that Pz = λz. Therefore, 1 − λ is the corresponding eigenvalue of (rB)−1 H. Since eigenvalues of (rB)−1 H and H (rB)−1 are the same, we have P ⊤ z = λz. The identity (A.6) follows since P ⊤ Pz = λP ⊤ z = λ2 z. On the other hand, from (3.7), we have
‖θ (t +1) − θˆ ‖2 (θ (t ) − θˆ )⊤ P ⊤ P (θ (t ) − θˆ ) = ˆ 2 ‖θ (t ) − θ‖ (θ (t ) − θˆ )⊤ (θ (t ) − θˆ ) ≤ sup{z ⊤ P ⊤ Pz /z ⊤ z } z ̸=0
= ρ{P ⊤ P } = ρ 2 {P }. Therefore, ‖θ (t +1) − θˆ ‖ ≤ |ρ{∇ Mr (θˆ )}| · ‖θ (t ) − θˆ ‖, which implies that the convergence of the QLB algorithm is linear.
Proof of Proposition 3. From (A.4), we obtain Iq − ∇ Mr (θˆ ) = r −1 [Iq − ∇ M1 (θˆ )]. In (A.5) let r = 1, we have ρ{∇ M1 (θˆ )} ∈
(0, 1) so that s{Iq − ∇ M1 (θˆ )} = 1 − ρ{∇ M1 (θˆ )} ∈ (0, 1). Thus, the three assertions follow immediately.
Proof of Proposition 4. Note that ℓλ (θ ) − Qrλ (θ|θ (t ) ) = ℓ(θ ) − Qr (θ|θ (t ) ). (i) When r ≥ 1, (3.5) indicates that ℓλ (θ ) − Qrλ (θ|θ (t ) ) ≥ 0, achieving its minimum zero at θ = θ (t ) . On the other hand, (4.2) implies Qrλ (θ (t +1) |θ (t ) ) ≥ Qrλ (θ (t ) |θ (t ) ) for θ (t ) , θ (t +1) ∈ Rq . Thus
ℓλ (θ (t +1) ) = {ℓλ (θ (t +1) ) − Qrλ (θ (t +1) |θ (t ) )} + Qrλ (θ (t +1) |θ (t ) ) ≥ {ℓλ (θ (t ) ) − Qrλ (θ (t ) |θ (t ) )} + Qrλ (θ (t ) |θ (t ) ) = ℓλ (θ (t ) ), where the inequality is strict if θ (t +1) ̸= θ (t ) . (ii) From (4.4), we have ∇ℓ(θ (t ) ) = rB(θ (t +1) − θ (t ) ) + λ · ∇ J (θ (t +1) ). Thus
ℓλ (θ (t +1) ) − ℓλ (θ (t ) ) = −λ[J (θ (t +1) ) − J (θ (t ) )] + [ℓ(θ (t +1) ) − ℓ(θ (t ) )] = −λ[J (θ (t +1) ) − J (θ (t ) )] + (θ (t +1) − θ (t ) )⊤ ∇ℓ(θ (t ) ) + 0.5(θ (t +1) − θ (t ) )⊤ ∇ 2 ℓ(θ ∗ )(θ (t +1) − θ (t ) ) = 0.5(θ (t +1) − θ (t ) )⊤ [2rB + ∇ 2 ℓ(θ ∗ )](θ (t +1) − θ (t ) ) + λ · δ,
(7.3)
where the first term is non-negative because 2rB + ∇ 2 ℓ(θ ∗ ) ≥ 0 for r ≥ 0.5, and
δ =(θ ˆ (t +1) − θ (t ) )⊤ ∇ J (θ (t +1) ) − [J (θ (t +1) ) − J (θ (t ) )].
(A.7)
G.-L. Tian et al. / Computational Statistics and Data Analysis 56 (2012) 255–265
265
If J (θ) is a linear function of θ , then δ = 0. Furthermore, if J (θ ) is a quadratic function, say, J (θ ) = θ ⊤ C θ with C ≥ 0, then ∇ J (θ) = 2C θ , and
δ = 2(θ (t +1) − θ (t ) )⊤ C θ (t +1) − θ (t +1)⊤ C θ (t +1) + θ (t )⊤ C θ (t ) = (θ (t +1) − θ (t ) )⊤ C (θ (t +1) − θ (t ) ) ≥ 0. Therefore, for the two cases, from (A.7), we have ℓλ (θ (t +1) ) − ℓλ (θ (t ) ) ≥ 0.
Proof of Proposition 5. (i) The iteration (4.4) defines a map Mrλ (θ ) = θ + (rB)−1 [∇ℓ(θ ) − λ · ∇ J (Mrλ (θ ))]. Differentiating this map with respect to θ and evaluating at θ = θ˜ , we have
∇ Mrλ (θ˜ ) = Iq − (rB)−1 [−∇ 2 ℓ(θ˜ ) + λ · ∇ 2 J (θ˜ ) · ∇ Mrλ (θ˜ )], which can be rearranged to give the desired result (4.5). Now, we prove (4.6). On the one hand, when r = 1, similar to Proposition 2, we can prove that the largest eigenvalue ρ{∇ M1λ (θ˜ )} ∈ (0, 1). On the other hand, from (4.5), we have
∇ Mrλ (θ˜ ) = Iq − Wrλ [Iq − ∇ M1λ (θ˜ )],
(A.8)
where Wrλ = [rB + λ · ∇ 2 J (θ˜ )]−1 (B + λ · ∇ 2 J (θ˜ )). From Proposition (a) of Green (1990), we immediately obtain (4.6). (ii) We can rewrite (A.8) as Iq − ∇ Mrλ (θ˜ ) = Wrλ [Iq − ∇ M1λ (θ˜ )] so that s{Iq − ∇ Mrλ (θ˜ )} = s{Wrλ [Iq − ∇ M1λ (θ˜ )]}, which is a monotonic decreasing function of r (r ≥ 0.5), achieving its maximum at r = 0.5.
References Berlinet, A., Roland, C., 2009. Parabolic acceleration of the EM algorithm. Statistics and Computing 19, 35–47. Böhning, D., 1992. Multinomial logistic regression algorithm. Annals of the Institute of Statistical Mathematics 44, 197–200. Böhning, D., 1999. The lower bound method in probit regression. Computational Statistics and Data Analysis 30 (1), 13–17. Böhning, D., Lindsay, B.G., 1988. Monotonicity of quadratic approximation algorithms. Annals of the Institute of Statistical Mathematics 40, 641–663. Celis, M., Dennis, J.E., Tapia, R.A., 1985. A trust region strategy for nonlinear equality constrained optimization. In: Boggs, P., Byrd, R., Schnabel, R. (Eds.), Numerical Optimization. SIAM, Philadelphia, pp. 71–82. Cox, D.R., 1972. Regression models and life tables with discussion. Journal of the Royal Statistical Society, Series B 74, 187–220. Cox, D.R., Oakes, D., 1984. Analysis of Survival Data. Chapman & Hall, London. Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum likelihood from incomplete data via the EM algorithm with discussion. Journal of the Royal Statistical Society, Series B 39, 1–38. van Dyk, D.A., 2000. Fitting mixed-effects models using efficient EM-type algorithms. Journal of Computational and Graphical Statistics 9 (1), 78–98. Efron, B., 1979. Bootstrap methods: another look at the jackknife. The Annals of Statistics 7, 1–26. Efron, B., Tibshirani, R.J., 1993. An Introduction to the Bootstrap. Chapman & Hall/CRC, Boca Raton. Fessler, J.A., Hero, A.O., 1994. Space-alternating generalized expectation-maximization algorithm. IEEE Transaction on Signal Processing 42, 2664–2677. Fleming, T.R., Harrington, D.P., 1991. Counting Processes and Survival Analysis. Wiley, New York. Fletcher, R., 1987. Practical Methods of Optimization. Wiley, New York. Green, P.J., 1990. On the use of the EM algorithm for penalized likelihood estimation. Journal of the Royal Statistical Society, Series B 52, 443–452. Hunter, D.R., Lange, K., 2004. A tutorial on MM algorithms. The American Statistician 58, 30–37. Jamshidian, M., Jennrich, R.I., 1993. Conjugate gradient acceleration of the EM algorithm. Journal of the American Statistical Association 88, 221–228. Jamshidian, M., Jennrich, R.I., 1997. Acceleration of the EM algorithm by using quasi-Newton methods. Journal of the Royal Statistical Society, Series B 59 (3), 569–587. Kuroda, M., Sakakihara, M., 2006. Accelerating the convergence of the EM algorithm using the vector epsilon algorithm. Computational Statistics and Data Analysis 51 (3), 1549–1561. Lange, K., 1995. A quasi-Newton acceleration of the EM algorithm. Statistica Sinica 5, 1–8. Lange, K., 1999. Numerical Analysis for Statisticians. Springer, New York. Lange, K., Hunter, D.R., Yang, I., 2000. Optimization transfer using surrogate objective functions with discussion. Journal of Computational and Graphical Statistics 9, 1–20. Lee, E.T., 1974. A computer program for linear logistic regression analysis. Computer Program in Biomedicine 4, 80–92. Liu, C.H., Rubin, D.B., Wu, Y.N., 1998. Parameter expansion to accelerate EM the PX-EM algorithm. Biometrika 85, 755–770. Meng, X.L., 1994. On the rate of convergence of the ECM algorithm. The Annals of Statistics 22, 326–339. Meng, X.L., Rubin, D.R., 1991. Using EM to obtain asymptotic variance–covariance matrices: the SEM algorithm. Journal of the American Statistical Association 86, 899–909. Meng, X.L., Rubin, D.B., 1993. Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80, 267–278. Meng, X.L., van Dyk, D.A., 1997. The EM algorithm—an old folk-song sung to a fast new tune with discussion. Journal of the Royal Statistical Society, Series B 59, 511–567. Meng, X.L., van Dyk, D.A., 1998. Fast EM-type implementations for mixed effects models. Journal of the Royal Statistical Society, Series B 60, 559–578. Nocedal, J., Wright, S.J., 1999. Numerical Optimization. Springer, New York. Ortega, J.M., 1990, Numerical analysis: a second course. Society for Industrial and Applied Mathematics, Philadelphia. Tan, M., Tian, G.L., Fang, H.B., Ng, K.W., 2007. A fast EM algorithm for quadratic optimization subject to convex constraints. Statistica Sinica 17 (3), 945–964. Tibshirani, R., 1997. The Lasso method for variable selection in the Cox model. Statistics in Medicine 16, 385–395. Varadhan, R., Roland, C., 2008. Simple and globally convergent methods for accelerating the convergence of any EM algorithm. Scandinavian Journal of Statistics 35 (2), 335–353. Wang, M.F., Kuroda, M., Sakakihara, M., Geng, Z., 2008. Acceleration of the EM algorithm using the vector epsilon algorithm. Computational Statistics 23 (3), 469–484. Zhou, H., Alexander, D., Lange, K., 2011. A quasi-Newton acceleration for high-dimensional optimization algorithms. Statistics and Computing 21, 261–273.