Computational Statistics & Data Analysis 52 (2007) 1119 – 1131 www.elsevier.com/locate/csda
Fast robust regression algorithms for problems with Toeplitz structure Nicola Mastronardia , Dianne P. O’Learyb,∗ a Istituto per le Applicazioni del Calcolo “M. Picone”, sede di Bari, Consiglio Nazionale delle Ricerche,
Via G. Amendola, 122/D, I-70126 Bari, Italy b Department of Computer Science, Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20742, USA
Available online 18 May 2007
Abstract The problem of computing an approximate solution of an overdetermined system of linear equations is considered. The usual approach to the problem is least squares, in which the 2-norm of the residual is minimized. This produces the minimum variance unbiased estimator of the solution when the errors in the observations are independent and normally distributed with mean 0 and constant variance. It is well known, however, that the least squares solution is not robust if outliers occur, i.e., if some of the observations are contaminated by large error. In this case, alternate approaches have been proposed which judge the size of the residual in a way that is less sensitive to these components. These include the Huber M-function, the Talwar function, the logistic function, the Fair function, and the 1 norm. New algorithms are proposed to compute the solution to these problems efficiently, in particular, when the matrix A has small displacement rank. Matrices with small displacement rank include matrices that are Toeplitz, block-Toeplitz, block-Toeplitz with Toeplitz blocks, Toeplitz plus Hankel, and a variety of other forms. For exposition, only Toeplitz matrices are considered, but the ideas apply to all matrices with small displacement rank. Algorithms are also presented to compute the solution efficiently when a regularization term is included to handle the case when the matrix of the coefficients is ill-conditioned or rank-deficient. The techniques are illustrated on a problem of FIR system identification. © 2007 Elsevier B.V. All rights reserved. Keywords: Iteratively reweighted least squares; Robust regression; Toeplitz matrices; Outliers; Small displacement rank; FIR system identification
1. Introduction Consider the approximation problem Ax ≈ b, where A ∈ Rm×n and b ∈ Rm (m n) are given and x ∈ Rn is to be determined. We define the residual r = b − Ax. The usual approach to the problem is least squares, in which we minimize the 2-norm of the residual over all choices of x. This produces the minimum variance unbiased estimator of the solution when the errors in the observation b are independent and normally distributed with mean 0 and constant variance. ∗ Corresponding author. Tel.: +1 301 405 2678; fax: +1 301 405 6707.
E-mail addresses:
[email protected] (N. Mastronardi),
[email protected] (D.P. O’Leary). 0167-9473/$ - see front matter © 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2007.05.008
1120
N. Mastronardi, D.P. O’Leary / Computational Statistics & Data Analysis 52 (2007) 1119 – 1131
It is well known, however, that the least squares solution is not robust if outliers occur, i.e., if some of the components of b are contaminated by large error. In this case, alternate approaches have been proposed which judge the size of the residual in a way that is less sensitive to these components. For example, in order to reduce the influence of the outliers, we might replace the least squares problem min r2 , x
by min x
m
(ri (x)),
(1)
i=1
subject to r = b − Ax, where is a given function. This robust regression problem has been extensively studied. Taking (z) = z2 /2 gives the ordinary linear least squares problem. Other possible functions include: (z) = Huber (1964):
z2 /2, |z| , |z| − 2 /2, |z| > ,
z (z) = log cosh ,
2
logistic (Coleman et al., 1980):
2
(z) = Fair (1974) (Holland and Welsch, 1977):
(z) =
|z| |z| − log 1 + ,
z2 /2, 2 /2,
|z| , |z| > ,
Talwar (Hinich and Talwar, 1975; Huber, 1964): (z) = |z|.
1 :
In Section 2, we consider how the solution to weighted problems can be computed efficiently, in particular, when the matrix A has small displacement rank (Kailath and Sayed, 1995, 1999). Matrices with small displacement rank include matrices that are Toeplitz, block-Toeplitz, block-Toeplitz with Toeplitz blocks (BTTB), Toeplitz plus Hankel, and a variety of other forms. This structure has been effectively exploited in solving least squares problems (Kailath and Sayed, 1999; Ng, 1996), weighted least squares problems (Benzi and Ng, 2006), total least squares problems (Kalsi and O’Leary, 2006; Mastronardi et al., 2001a, 2004), and regression using other norms (Pruessner and O’Leary, 2003), so we now extend these ideas to robust regression. For exposition, in Section 2 we focus on Toeplitz matrices, but the ideas apply to all matrices with small displacement rank. In Section 3, we also show how to compute the solution efficiently when we include a regularization term in case the matrix A is ill-conditioned or rank-deficient. Section 4 extends these results to robust regression problems. Experiments illustrating these ideas are presented in Section 5, and Section 6 provides our conclusions. 2. Weighting to solve well-conditioned problems with outliers In this section we assume that we have a well-conditioned model in which we know which components of the vector b have large errors. This situation arises, for example, if there are known defects in a set of detectors which reduce the effectiveness of some of them without making their data worthless. In Section 3, we consider ill-conditioned problems, and in Section 4, we extend these ideas to situations in which the number and indices of the outliers are unknown. Suppose that p of our observations are defective, where p>m. Let D be a diagonal weighting matrix that takes this into account, with djj = 1 for the trusted observations bj and 0 < djj < 1 for the others. Then our goal is to solve min D 1/2 (b − Ax)2 . x
(2)
N. Mastronardi, D.P. O’Leary / Computational Statistics & Data Analysis 52 (2007) 1119 – 1131
1121
By setting the derivatives to 0 we obtain the normal equations AT DAx = AT Db.
(3)
If A has small displacement rank, then fast algorithms exist for solving the problem when D = I . We show that such algorithms can still be efficiently used when p>m entries of the main diagonal of D are different from 1. Without loss of generality, we can order our observations so that the first p components djj = 1. Three matrices will be helpful to us. Let V contain the first p columns of the (m + n) × (m + n) identity matrix, let Sp be the p × p −1 diagonal matrix with entries djj − 1, and let W = V S p . In the following three subsections, we develop three methods for solving our problem.
2.1. Method 1: use preconditioning If none of the weights is 0, our problem (2) is equivalent to solving −1 D b A y = , x 0 AT 0
(4)
where D −1 y = r. If we use constraint preconditioning (Dyn and Ferguson, 1983) on this system, we obtain the iteration matrix −1 −1 I A D A N= ≡ BI−1 BD . AT 0 AT 0 Benzi and Ng (2006) used a similar approach, replacing I by I , where is the average value of the diagonal entries of D −1 . Taking = 1, as in the original constraint preconditioning procedure, has a great advantage for us, though. Since BD = BI + W W T , the matrix N = BI−1 BD is the identity matrix plus a rank p correction. Therefore, all but p eigenvalues of N are equal to 1, and an iterative method such as GMRES will terminate with the solution in p + 1 iterations. The bulk of the work in each iteration is multiplication of vectors by the matrices BD and BI−1 . To multiply by BD , we multiply by the Toeplitz matrix A and then by AT , which can be done in O((m + n) log(m + n)) operations (see, for example, Kailath and Sayed, 1999; O’Leary and Simmons, 1981; Van Loan, 1992). Application of the preconditioning matrix BI−1 is related to solving a standard least squares problem involving the Toeplitz matrix A (Björck, 1996, Section 8.4). We can solve z c BI = , (5) w d in O(mn + n2 ) operations; first we compute R, the upper triangular factor from the QR factorization of A, by means of the generalized Schur algorithm (GSA) (Kailath and Sayed, 1995, 1999), and then, since AT A = R T R, we solve R T Rw = AT c − d and then form z = c − Aw (Björck, 1996; Björck and Paige, 1994). Thus the entire algorithm is O(pmn). 2.2. Method 2: use the Sherman–Morrison–Woodbury formula An alternative approach to solve (4) uses the Sherman–Morrison–Woodbury formula (Golub and Van Loan, 1996, p. 50) to solve a linear system involving BD using an algorithm to solve a linear system involving BI . Since BD = BI + W W T , then −1 = BI−1 − BI−1 W (I + W T BI−1 W )−1 W T BI−1 , BD
so we can solve a linear system BD f = g at a cost about equal to that of solving p + 1 linear systems involving BI : • Find the (m + n) × p matrix Z that solves the linear system BI Z = W .
1122
N. Mastronardi, D.P. O’Leary / Computational Statistics & Data Analysis 52 (2007) 1119 – 1131
• Find the vector fˆ that solves BI fˆ = g. • Solve the p × p linear system (I + W T Z)q = W T fˆ. • Form f = fˆ − Zq. T This approach will work when neither BI nor I + W Z is too ill-conditioned. Since W = VSp , and the diagonal entries −1 of Sp are djj − 1, it is necessary that no weight be too close to 0 relative to the others. Notice that the algorithm breaks down for a 0 weight. Suppose now that the weights are 1’s and 0’s. In this case we can derive a stable algorithm that is a variant on the one above by avoiding computation with the matrix W , instead computing with V , which just√ contains columns of the identity. Consider the result of replacing the 0 weights by some small number , so that W = −1 − 1V . For small but positive, the Sherman–Morrison–Woodbury formula is still valid, and as → 0,
BI−1 W (I + W T BI−1 W )−1 W T BI−1 → BI−1 V (V T BI−1 V )−1 V T BI−1 . Thus we have an algorithm for problems with 0–1 weights: • • • •
Find the (m + n) × p matrix Z that solves the linear system BI Z = V . Find the vector fˆ that solves BI fˆ = g. Solve the p × p linear system (V T Z)q = V T fˆ. Form f = fˆ − Zq.
As discussed at the end of Section 2.1, linear systems involving BI can be solved in O(mn) operations when A is Toeplitz, so for either of these two algorithms the cost is O(pmn) operations. 2.3. Method 3: use the GSA In order to reduce the effects of the observations in model (2) that are contaminated by large errors, the corresponding diagonal entries in the weight matrix D must have positive values much smaller than 1. We can write our least squares problem (2) as min D 1/2 (b − Ax)2 = min (I − S 2 )1/2 (b − Ax)2 , x
x
where S is a diagonal matrix with diagonal entries
(6)
1 − djj , j = 1, . . . , m. The normal equation (3) becomes
(AT A − AT S 2 A)x = AT (I − S 2 )b. Since A is a Toeplitz matrix, the matrix Mp = AT A − A T S 2 A
(7)
is a rank-p modification of the Toeplitz-like matrix AT A, i.e., a matrix with small displacement rank (Kailath and Sayed, 1999). Therefore, the displacement rank (Kailath and Sayed, 1999) of the symmetric positive definite (SPD) matrix (7) is at most 4 + 2p. The least squares problem (6) can be solved via the seminormal equation RpT Rp x = AT (I − S 2 )b,
(8)
where RpT ∈ Rn×n is the Cholesky factor of (7). In general, if p>m, the Cholesky factorization of (7) can be computed in O(mn+n2 ) operations by means of GSA (Kailath and Sayed, 1995, 1999). Once the Cholesky factor RpT is computed, we solve the seminormal equation (8) using the following linear systems: RpT y = ATp (I − S 2 )b
(9)
and Rp x = y.
(10)
N. Mastronardi, D.P. O’Leary / Computational Statistics & Data Analysis 52 (2007) 1119 – 1131
1123
The right-hand side of (9) can be computed either in O(mn) or in O((m+n) log(m+n)) operations. The triangular linear systems (9) and (10) can be solved in O(n2 ) operations via forward and backward substitution, respectively. Therefore, the solution of the Toeplitz least squares problem can be computed either in O((n2 )+mn) or O((n2 )+(m+n) log(m+n)) operations by means of GSA. In the remainder of this subsection we discuss the rather complicated details of this GSA. 2.3.1. Generalised Schur algorithm The seminormal equation (8) can be solved using fast computation of the Cholesky factor of Mp defined in (7). This can be accomplished using displacement theory. For the sake of completeness, the basic concepts of the displacement theory for Toeplitz and Toeplitz-like matrices are introduced here. A comprehensive treatment of the topic can be found in Kailath and Sayed (1999). Without loss of generality, we describe the case of the displacement rank of SPD matrices, since SPD matrices are involved in our problem. Let M ∈ Rn×n be an SPD matrix. The displacement of M with respect to the shift matrix Zn of order n, ⎡ ⎤ 0 0 ··· ··· 0 . . .. · · · 0⎥ ⎢1 .. ⎢ ⎥ .⎥ ⎢ . Zn = ⎢ 0 . . · · · . . . .. ⎥ ⎢. . ⎥ ⎣. ⎦ .. 1 . 0 0 0 ··· 0 1 0 is defined as ∇Zn M = M − Zn MZ Tn .
(11)
The displacement rank kˆ of M with respect to Zn is the rank of ∇Zn M. Therefore, (11) can be written as the sum of kˆ = kˆ1 + kˆ2 rank-1 matrices, ∇Zn M =
kˆ1
(p) (p) T
g i gi
−
i=1
kˆ2
(n) (n) T
gi gi
.
i=1
(p) (n) The vectors gi , i = 1, . . . , kˆ1 , and gi , i = 1, . . . , kˆ2 , are, respectively, called positive and negative generators of M with respect to Zn . GSA can be used to compute the Cholesky factorization of M from the generators. Let (p)
(p)
(p) k1
(n)
(n)
(n) k2
G = [g1 , g2 , . . . , g ˆ , g1 , g2 , . . . , g ˆ ], J = diag(1, 1, . . . , 1, −1, −1, . . . , −1). kˆ1
kˆ2
Therefore, we can write (11) as, ∇Zn M = M − Zn MZ Tn = GJ GT .
(12)
Since M − Zn MZ Tn = GJ GT , T
Zn MZ Tn − Zn2 MZ 2n = Zn GJ GT ZnT , .. .. . . T
T
T
− Znn−1 MZ n−1 = Znn−2 GJ GT Znn−2 , Znn−2 MZ n−2 n n and T
T
= Znn−1 GJ GT Znn−1 , Znn−1 MZ n−1 n
1124
N. Mastronardi, D.P. O’Leary / Computational Statistics & Data Analysis 52 (2007) 1119 – 1131
the matrix M can be written as a function of the generator matrix G: M=
n−1
jT
j
Zn GJ GT Zn .
(13)
j =0
Based on this expansion, we show how GSA works by describing the first iteration. Observe that the matrices involved in (13), except for GJ GT , have the first row equal to 0. A key role in GSA is played by J -orthogonal matrices, matrices H satisfying H J H T = J. Let G0 ≡ G and let H1 be the J -orthogonal matrix (Bojanczyk and Steinhardt, 1989; Rader and Steinhardt, 1988) such that ˜ 1 = [1 , 0, . . . , 0] with e1T G
1 > 0,
˜ 1 be the first ˜ 1 = G0 H1 and ei , i = 1, . . . , n are the columns of the identity matrix. Furthermore, let g˜ 1 and where G ˜ 1 , respectively, i.e., column and the last kˆ − 1 columns of G ˜ 1 ]. ˜ 1 = [g˜ 1 , G ˜ 1 is 0. Define J˜ to be the matrix formed by deleting the first row and column of J. Then Observe that the first row of we can write (13) as follows, M=
n−1
jT
j
Zn G0 J GT0 Zn
j =0
=
n−1
jT
j
Zn G0 H1 J H T1 GT0 Zn
j =0
=
n−1
jT
˜ 1 ]T Z n ˜ 1 ]J [g˜ 1 , Zn [g˜ 1 , j
j =0
= g˜ 1 g˜ 1T +
n−1
j
jT
Zn g˜ 1 g˜ 1T Zn +
j =1
= g˜ 1 g˜ 1T +
n−2
n−2
T
jT
˜ 1 Zn ˜ 1 J˜ Zn j
j =0 j ˜ 1 ]J [Zn g˜ 1 , ˜ 1 ]T Znj Zn [Zn g˜ 1 ,
T
j =0
= g˜ 1 g˜ 1T +
n−2
j
jT
Zn G1 J GT1 Zn ,
j =0
˜ 1 ], that is, G1 is obtained from G ˜ 1 displacing downward one position the entries in the first where G1 ≡ [Zn g˜ 1 , column. Thus, the first row of G1 is 0. Hence, g˜ 1 is the first column of the Cholesky factor of M. To compute the second column of the Cholesky factor of M we proceed in a similar way. That is, determine a hyperbolic transformation H2 such that e2T G1 H2 = [2 , 0, . . . , 0] with
2 > 0.
˜ 2 is the second column of the Cholesky factor and G3 is obtained from ˜ 2 = G1 H2 . Then, the first column of G Define G ˜ 2 displacing downward one position the entries in the first column. Proceeding in this way, the Cholesky factorization G of the matrix M is computed. It is important to note that GSA relies only on knowledge of the generators of M, not on knowledge of the matrix itself. ˆ − j )) The jth iteration of GSA, j = 1, . . . , n, involves the product Gj −1 Hj which can be computed in O(k(n operations (Bojanczyk and Steinhardt, 1989; Rader and Steinhardt, 1988). Therefore, if the displacement rank kˆ of M is small compared to n, GSA computes its factorization in O(n2 ) operations, rather than the O(n3 ) operations required by standard algorithms (Golub and Van Loan, 1996).
N. Mastronardi, D.P. O’Leary / Computational Statistics & Data Analysis 52 (2007) 1119 – 1131
1125
The GSA is weakly stable (Chandrasekaran and Sayed, 1996; Stewart and Van Dooren, 1997; Mastronardi et al., 2001b). 2.3.2. Applying GSA to our problem Now let ⎡ t tn−1 ··· t2 n .. .. ⎢ . . tn ⎢ tn+1 ⎢ . . . . ⎢ . .. .. .. ⎢ . ⎢ . . . . A = T = ⎢ .. .. .. .. ⎢ ⎢ .. .. .. .. ⎢ . . . . ⎢ .. .. ⎣ . . tm+n−2 tm tm+n−1 tm+n−2 · · · tm+1
t1 ⎤ ⎥ t2 ⎥ .. ⎥ ⎥ . ⎥ .. ⎥ . ⎥ ⎥ .. ⎥ . ⎥ ⎥ ⎦
(14)
tm−1 tm
be an m × n Toeplitz matrix. Although the matrix M ≡ AT A is not Toeplitz, it is Toeplitz-like, since its displacement ˆ = m + n − 1. In particular, since rank is equal to 4 with respect to the displacement matrix Zn . Define m mˆ mˆ ⎡ mˆ ⎤ ti ti ti ti−1 ··· ti ti−n+1 mˆ i=n mˆ i=n mˆ i=n ⎢ ⎥ ··· i=n ti ti−1 i=n ti−1 ti−1 i=n ti−1 ti−n+1 ⎥ ⎢ M =⎢ ⎥, .. .. mˆ mˆ ⎣ ⎦ . . i=n ti−n+2 ti−n+2 i=n ti−n+1 ti−n+2 mˆ mˆ mˆ ··· i=n ti ti−n+1 i=n ti−n+2 ti−n+1 i=n ti−n+1 ti−n+1 then ∇Zn M = M − Zn MZ Tn mˆ mˆ ⎡ mˆ ⎤ i=n ti ti i=n ti ti−1 · · · i=n ti ti−n+1 m ˆ ⎢ ⎥ 0 ··· 0 ⎢ ⎥ i=n ti ti−1 =⎢ ⎥ .. .. .. ⎣ ⎦ . . . mˆ ti ti−n+1 0 ··· 0 ⎡ i=n ⎤ ⎤ ⎡ 0 0 ⎢ tn−1 ⎥ ⎢ tˆ ⎥ ⎥ [0 tn−1 · · · t1 ] − ⎢ m ⎥ +⎢ (15) . ⎣ . ⎦ ⎣ .. ⎦ [0 tmˆ · · · tm+1 ]. . . tm+1 t1 mˆ 2 Define v ≡ T T T e1 / i=n ti , where e1 is the first column of the identity matrix. The generators of M with respect to Zn are (p)
g1
≡ v,
(p) g2 (n) g1 (n) g2
≡ [0 tn−1 · · · t1 ]T , ≡ Zn ZnT v, · · · tm+2 tm+1 ]T . ≡ [0 tmˆ tm−1 ˆ
Using these four generators, GSA can compute the Cholesky factor of M, or, equivalently, the R factor of the QR factorization of T (required for solving linear systems such as (5)). Since p = 4>n, the computational complexity of GSA is O(n2 ). Moreover, v can be computed either in O(mn) or in O((m + n) log(m + n)) operations exploiting the Toeplitz structure of T . The computation of the Cholesky factor of Mp (required for solving (8)) can be done in a similar way. In fact, the rank of the displacement of Mp , ∇Zn Mp = ∇Zn AT (I − S 2 )A,
(16)
1126
N. Mastronardi, D.P. O’Leary / Computational Statistics & Data Analysis 52 (2007) 1119 – 1131
is bounded by the sum of the displacement rank of AT A and the displacement rank of the rank-p matrix AT S 2 A. As already seen, the first term of the sum has displacement rank equal to 4. In a straightforward way it can be checked that the displacement rank of the second term of the sum is at most 2p. It can be shown that the positive and negative generators of Mp are (p)
g1
≡ v,
(p) g2 ≡ [0 tn−1 · · · t1 ]T , (p) gk+2 ≡ 1 − dkk Zn A(k, : )T , k = 1, . . . , p, (n) g1 ≡ Zn ZnT v, (n) · · · tm+2 tm+1 ]T , g2 ≡ [0 tmˆ tm−1 ˆ (n) gk+2 ≡ 1 − dkk A(k, : )T , k = 1, . . . , p
(17)
mˆ 2 with v ≡ T T T e1 / i=n ti . Computing the generators requires either O(mn) or O((m + n) log(m + n)) operations, depending on the choice of algorithm, and then GSA requires O(n2 ) additional operations. In particular, given the generators (17), the computation of the Cholesky factor RpT of Mp via GSA is accomplished in n iterations working on the following generator matrix: G = [G(p) , G(n) ], with (p)
(p)
(p)
(n)
(n)
(n)
G(p) = [g1 , g2 , . . . , gp+2 ], and G(n) = [g1 , g2 , . . . , gp+2 ], and defining J = diag(1, 1, . . . , 1, −1, −1, . . . , −1). p+2
p+2
The kth iteration, k = 1, 2, . . . , n, consists of the following steps. (1) Reduce the generator matrices G(p) and G(n) by annihilating all the entries in the kth line of both matrices except for the first entry. This is accomplished multiplying the matrices by two different Householder transformations. Since first k − 1 rows of G(p) and G(n) are 0, neglecting the terms of lower order, this step requires 8p(n − k) operations. (2) Perform a hyperbolic rotation between the first column of the matrices G(p) and G(n) in order to annihilate G(n) (k, 1). Since the first k − 1 entries of G(p) and G(n) in their first column are 0, this step requires 6(n − k) operations (plus lower order terms). Thus, at each iteration, the J-orthogonal matrix is given by the product of the two Householder transformations and the hyperbolic rotation involved. The first column of G(p) is the k-th line of Rp . (3) Displace downward one position the entries of the first column of G(p) . Hence, the overall complexity of GSA is 2n(m + p) + 4pn2 . (The complexity reduces to O(m + n) log(m + n) + 4pn2 if the vector v is computed via fast Fourier transform (Van Loan, 1992). Computing the Cholesky factor of Mp requires 2n2 (m − n/3) + nm operations (Golub and Van Loan, 1996). Therefore, GSA is faster if p is negligible with respect to m and n. The Toeplitz-block case can be handled in a similar way with the same complexity. 3. Weighting for the ill-conditioned or rank-deficient case If the matrix A is ill-conditioned or rank-deficient, then a regularization term is often added, transforming our problem to min D 1/2 (b − Ax)22 + F x22 , x
(18)
N. Mastronardi, D.P. O’Leary / Computational Statistics & Data Analysis 52 (2007) 1119 – 1131
1127
where F is a matrix designed to control the size of x and > 0 defines the relative importance of the two terms. Typical choices of the matrix F include the identity matrix and the matrices ⎡ ⎤ ⎡ ⎤ 1 −1 1 −2 1 .. .. .. .. .. ⎣ ⎦ ∈ R(n−1)×n and ⎣ ⎦ ∈ R(n−2)×n , . . . . . 1 −1 1 −2 1 that are scaled approximations to the first and the second derivative operators, respectively (Hansen, 1998). By setting the derivatives to 0 we obtain the normal equations (AT DA + F T F )x = AT Db.
(19)
3.1. An approach based on preconditioning If A and F have small displacement rank, then fast algorithms exist for solving the problem when D = I and when D is diagonal with 0–1 weights, and again we want to take advantage of this. If none of the weights is 0, our problem (18) is equivalent to solving −1 D y b A = . (20) x 0 AT F T F If we use constraint preconditioning on this system, we obtain −1 −1 I A D A N= ≡ BI−1 BD . AT F T F AT F T F Again, N is the identity plus a rank p matrix, so an iterative method such as GMRES will find the solution in p + 1 iterations. 3.2. Small rank updates using Sherman–Morrison–Woodbury Again we can use the Sherman–Morrison–Woodbury formula to solve the linear system involving BD using at a cost about equal to that of solving p + 1 linear systems involving BI . 3.3. Generalized Schur algorithm The GSA can be easily updated to handle this case. In fact, if F ≡ I, the matrix J is the same as the unregularized case and the generators are the same as in (17) except that we redefine T T T e1 + e1 v≡ . mˆ 2 t + i=n i If F and A have low displacement rank with respect to the same displacement operator, it is always possible to construct a fast GSA, since the number of generators is much smaller than the size of the coefficient matrix in the linear system (20). 4. Solving robust regression problems In this section we apply the techniques we have developed to the robust regression problem (1). It is possible to add a regularization term, but for ease of exposition, we omit it. We consider solving our minimization problem (1) by Newton’s method as, for example, in O’Leary (1990). We compute the gradient vector z defined by zj = (rj )
1128
N. Mastronardi, D.P. O’Leary / Computational Statistics & Data Analysis 52 (2007) 1119 – 1131
and define D(r) to be a diagonal matrix with entries djj = (rj ). (At points where the function fails to be differentiable, standard techniques, such as use of subgradients, should be employed.) Then the function f (x) =
m
(ri (x))
i=1
has gradient g = −AT z and Hessian matrix H = AT DA, so the Newton step for solving (1) is s = −H −1 g = (AT DA)−1 AT z, which causes a change r = As = −A(AT DA)−1 AT z in the residual vector r. Once we update x using s and r using r, we can compute the new gradient vector in preparation for the next Newton step. Observe now that we can determine s by solving the linear system −1 −1 D D z A y = , s AT 0 0 which is the same problem we considered in (4) in Section 2, except for the scaling by D −1 on the right-hand side. We can overcome the difficulty that arises when some component djj = 0; in this case, we can replace it by a modest value, using a modified Newton method (Nash and Sofer, 1996). (This value could be, for example, the square root of machine precision times the largest entry in D. See Fang and O’Leary (2006) for more information on modifying the Cholesky factor in Newton’s method.) Furthermore, our weighting functions have the property that when |rj | < , either djj = 1 (Huber, Talwar) or djj ≈ 1 (logistic, Fair). In the first case, our Sherman–Morrison–Woodbury algorithm will work well, since the matrix AT DA is a small-rank modification of AT A, and in the second, we can use the preconditioning algorithm since the modification is a small-rank matrix plus a matrix of small norm. We can also take advantage of the fact that the matrix D might not change much from one Newton iteration to the next. In such cases, we can save a factorization of AT DA used to compute one Newton step and use it after D has been changed (in place of the matrix BI in Section 2) in either the preconditioning algorithm or the Sherman–Morrison–Woodbury algorithm in order to compute the next Newton step. As noted, for example, in Watson (1998), the 1 approximation can be computed from the Huber function by letting → where is a number smaller than the smallest component of the vector |r|. The initial guess for Newton’s method is obtained by solving the ordinary least squares problem minx Ax − b2 . Newton’s method is iterated until g2 tol, with tol a fixed tolerance. For the numerical examples in the next section tol is chosen equal to 10−5 . 5. Numerical examples We performed some numerical experiments with the algorithms designed in this paper, considering an example taken from a paper on Toeplitz least squares problems with no outliers (Ng, 1996). Example 1. This example concerns finite impulse response (FIR) system identification, which has wide applications in engineering. The input signal of an FIR system identification model drives the unknown system to produce the output sequence. We model the unknown system as an FIR filter. If the unknown system is actually an FIR system, then the model is exact. In the tests, we formulate a well-defined least squares prediction problem by estimating the autocovariances from the data samples. By solving the normal equations, the FIR system coefficients can be found. The entries tk of the 100 × 30 rectangular Toeplitz matrices of Eq. (14) are generated from the second-order autoregressive process (Haykin, 2001) given by tk − 1.4tk−1 + 0.5tk−2 = yk ,
N. Mastronardi, D.P. O’Leary / Computational Statistics & Data Analysis 52 (2007) 1119 – 1131
1129
Table 1 Percentage of problems for which the relative error is greater than 10−3 on 100 simulations with different percentages of outliers (rows) and different values of the standard deviation (columns)
1.2
1.5
2.0
2.5
5.0
10.0
2% outliers 5% outliers 10% outliers
1 10 42
0 22 44
1 9 41
2 13 40
1 11 49
2 16 43
where t−1 , t0 and yk , k = 1, 2, . . . , m + n − 1, are sampled from the Gaussian distribution with mean 0 and standard deviation 1. We employ this input process to generate the Toeplitz data matrices. In the tests, we choose the solution x to be the vector with all entries equal to 1. The right-hand-side b in (2) is computed from perturbing Tx by Gaussian noise b = T100,30 x + e with ek generated from the Gaussian distribution with 0 mean and standard deviation 1.0e − 4, k = 1, 2, . . . , 100. We report in Table 1 only the results obtained by computing a solution from robust regression using the Talwar function; this seems to deliver the most reliable results. The parameter for the Talwar function has been chosen equal to 0.5 × 10−4 , on the same order as the noise. For each case, we performed 100 simulations replacing, respectively, 2%, 5% and 10% of the entries of the vector b by values from the Gaussian distribution with mean 0 and standard deviation 1.2, 1.5, 2.0, 2.5, 5, 10, respectively. For each value of the standard deviation and for each level of changed observations, 100 simulations with Toeplitz matrices of size 100 × 30 are considered. The values reported in the table are the percentage of times that the relative error was larger than 10−3 : x − xc 2 > 10−3 x2 with xc the solution computed by solving the robust regression problem. The algorithm produces a good solution when a small number of observations are outliers, and this seems to be somewhat independent of the standard deviation. On average, four steps of Newton’s method are required. Example 2. To show the computational efficiency of our ideas, we used Matlab 5.3 to compare the number of operations of our algorithm with that of a standard algorithm for solving (18). Let T ∈ R1000×400 be a Toeplitz matrix generated as in the previous example, and consider solving (19) with F = I . We computed 16 SPD matrices T T (I − Sp2 )T + I,
p = 0, 1, . . . , 15,
(21)
where Sp is a diagonal matrix with only p − 1 entries different from 0, corresponding to the indices of the outliers. We observe that the size of the regularizing factor, chosen for these experiments to be = 10−4 , is not relevant from a complexity point of view. In Fig. 1 we compare the work of computing the Cholesky factorization of (21) using the GSA and the following one, which does not exploit the Toeplitz structure: (1) Compute Tˆ = (I − Sp2 )1/2 T . This step requires O(pn) operations. (2) Compute RM , the R factor of the QR-factorization of Tˆ , via the Matlab function qr. This step requires O(mn2 ) operations (Golub and Van Loan, 1996). T R + I. This step requires O(n3 ) operations (Golub and Van Loan, 1996). (3) Compute the matrix Aˆ = RM M (4) Compute the Cholesky factorization of Aˆ via the Matlab function chol. This step requires O(n3 ) operations (Golub and Van Loan, 1996). Since m?n, the most expensive step is the second one. At the same order of complexity, we could also compute the Cholesky factor of (21) by first computing the product T T (I − Sp2 )T .
1130
N. Mastronardi, D.P. O’Leary / Computational Statistics & Data Analysis 52 (2007) 1119 – 1131
109
flops
108
107 GSA algorithm General algorithm
106
0
5
10
15
Number of outliers Fig. 1. Comparison of the number of floating point operations (flops) to compute the Cholesky factorization of T T (I − Sp2 )T + I . The solid curve is the cost when exploiting the Toeplitz structure using the GSA algorithm. Computing the generators takes about 1,605,000 operations, and GSA ranges from 2,102,000 to 14,228,000 operations. The dash-dotted curve is the result of computing Tˆ (0–6030 operations), computing RM (673,920,000 operations), computing Aˆ (21,494,000 operations), and factoring Aˆ (21,413,000 operations).
On this set of problems, we found that our algorithm is at least 45 times faster than the standard algorithm. The computation of the generators involved in GSA is very cheap because only one matrix–vector product is involved. Given the generators, the complexity of GSA grows linearly with the number of outliers. 6. Conclusions We have shown how to efficiently compute the solution to robust regression problems when the data matrix has low displacement rank, and have also explained how regularization can be included in the problem formulation. Thus, the possibility of having outliers in the data is no longer an impediment to using fast methods for structured problems. Acknowledgments The work of the first author was partially supported by MIUR, Grant no. 2004015437, and by the National Research Council of Italy under the Short Term Mobility Program. This author would like to acknowledge the hospitality of the Department of Computer Science and Institute for Advanced Computer Studies of University of Maryland, where part of this work took place. The work of second author was supported by the National Science Foundation under Grant CCF 05-14213 and the Department of Energy under Grant DEFG0204ER25655. The authors would like to thank the anonymous referees for their helpful comments. References Benzi, M., Ng, M.K., 2006. Preconditioned iterative methods for weighted Toeplitz least squares problems. SIAM J. Matrix Anal. Appl. 27, 1106–1124. ˚ 1996. Numerical Methods for Least Squares Problems. SIAM, Philadelphia, PA. Björck, A., ˚ Paige, C.C., 1994. Solution of augmented linear systems using orthogonal factorizations. BIT 34, 1–24. Björck, A., Bojanczyk, A.W., Steinhardt, A.O., 1989. Stabilized hyperbolic Householder transformations. IEEE Trans. Acoust. Speech Signal Process. 37, 1286–1288. Chandrasekaran, S., Sayed, A.H., 1996. Stabilizing the generalized Schur algorithm. SIAM J. Matrix Anal. Appl. 17, 950–983. Coleman, D., Holland, P., Kaden, N., Klema, V., 1980. A system of subroutines for iteratively reweighted least squares computations. ACM Trans. Math. Software 6, 327–336. Dyn, N., Ferguson, W.E., 1983. The numerical solution of equality-constrained quadratic programming problems. Math. Comp. 41, 165–170.
N. Mastronardi, D.P. O’Leary / Computational Statistics & Data Analysis 52 (2007) 1119 – 1131
1131
Fair, R.C., 1974. On the robust estimation of econometric models. Ann. Econom. Social Measurement 3, 667–678 (ref. by Holland, P.W., Welsch, R.E., 1977. Robust regression using iteratively reweighted least-squares. Comm. Statist. Theory, Methods A 6, 813–827). Fang, H.R., O’Leary, D.P., 2006. Modified Cholesky algorithms: a catalog with new approaches, Submitted for publication. Golub, G.H., Van Loan, C.F., 1996. Matrix Computations. third ed. Johns Hopkins University Press, Baltimore, MD. Hansen, P.C., 1998. Rank-Deficient and Discrete Ill-Posed Problems. Numerical Aspects of Linear Inversion. SIAM Monographs on Mathematical Modeling and Computation. Society for Industrial and Applied Mathematics, Philadelphia, PA. Haykin, S., 2001. Adaptive Filter Theory. fourth ed. Prentice-Hall, Englewood Cliffs, NJ. Hinich, M.J., Talwar, P.P., 1975. A simple method for robust regression. J. Amer. Statist. Assoc. 70, 113–119. Holland, P.W., Welsch, R.E., 1977. Robust regression using iteratively reweighted least-squares. Comm. Statist. Theory Methods A 6, 813–827. Huber, P.H., 1964. Robust estimation of a location parameter. Ann. Math. Statist. 35, 73–101. Kailath, T., Sayed, A.H., 1995. Displacement structure: theory and applications. SIAM Rev. 37, 297–386. Kailath, T., Sayed, A.H., 1999. Fast Reliable Algorithms for Matrices with Structure. SIAM, Philadelphia, PA. Kalsi, A., O’Leary, D.P., 2006. Algorithms for structured total least squares problems with applications to blind image deblurring. J. Res. Nat. Inst. Standards Technol. 111 (2), 113–119. Mastronardi, N., Lemmerling, P., Van Huffel, S., 2001a. Fast structured total least squares algorithm for solving the basic deconvolution problem. SIAM J. Matrix Anal. Appl. 22, 533–553. Mastronardi, N., Van Dooren, P., Van Huffel, S., 2001b. On the stability of the generalized Schur algorithm. Lecture Notes in Comput. Sci. 1988, 560–567. Mastronardi, N., Lemmerling, P., Kalsi, A., O’Leary, D.P., Van Huffel, S., 2004. Implementation of the regularized structured total least squares algorithms for blind image deblurring. Linear Algebra Appl. 391, 203–221. Nash, S.G., Sofer, A., 1996. Linear and Nonlinear Programming. McGraw-Hill, New York. Ng, M.K., 1996. Fast direct methods for Toeplitz least squares problems. IEEE TENCON ’96. Proceedings on Digital Signal Processing Applications, vol. 2, pp. 743–748 O’Leary, D.P., 1990. Robust regression computation using iteratively reweighted least squares. SIAM J. Matrix Anal. Appl. 11, 466–480. O’Leary, D.P., Simmons, J.A., 1981. A bidiagonalization-regularization procedure for large scale discretizations of Ill-posed problems. SIAM J. Sci. Statist. Comput., 2474–2489. Pruessner, A., O’Leary, D.P., 2003. Blind deconvolution using a regularized structured total least norm approach. SIAM J. Matrix Anal. Appl. 24, 1018–1037. Rader, C.M., Steinhardt, A.O., 1988. Hyperbolic Householder transforms. SIAM J. Matrix Anal. Appl. 9, 269–290. Stewart, M., Van Dooren, P., 1997. Stability issues in the factorization of structured matrices. SIAM J. Matrix Anal. Appl. 18, 104–118. Van Loan, C., 1992. Computational frameworks for the fast Fourier transform. SIAM Frontiers in Applied Mathematics Series, Philadelphia, PA. Watson, G.A., 1998. Choice of norms for data fitting and function approximation. Acta Numerica 7, 337–377.