An even faster algorithm for ridge regression of reduced rank data

Computational Statistics & Data Analysis 50 (2006) 642 – 658 www.elsevier.com/locate/csda An even faster algorithm for ridge regression of reduced ra...

Download PDF

281KB Sizes 4 Downloads 39 Views

Report

PDF Reader
Full Text

Computational Statistics & Data Analysis 50 (2006) 642 – 658 www.elsevier.com/locate/csda

An even faster algorithm for ridge regression of reduced rank data Berwin A. Turlach∗ School of Mathematics and Statistics (M019), The University of Western Australia, 35 Stirling Highway, Crawley WA 6009, Australia Received 13 January 2003; received in revised form 15 September 2004; accepted 21 September 2004 Available online 13 October 2004

Abstract Hawkins and Yin (Comput. Statist. Data Anal. 40 (2002) 253) describe an algorithm for ridge regression of reduced rank data, i.e. data where p, the number of variables, is larger than n, the number of observations. Whereas a direct implementation of ridge regression in this setting requires calculations of order O(np 2 + p 3 ), their algorithm uses only calculations of order O(np 2 ). In this paper, we describe an alternative algorithm based on a factorization of the (transposed) design matrix. This approach is numerically more stable, further reduces the amount of calculations and needs less memory. In particular, we show that the factorization can be calculated in O(n2 p) operations. Once the factorization is obtained, for any value of the ridge parameter the ridge regression estimator can be calculated in O(np) operations and the generalized cross-validation score in O(n) operations. © 2004 Elsevier B.V. All rights reserved. Keywords: Bidiagonal factorization; Cross validation; Matrix factorization; Ridge regression

1. Introduction To address problems with highly correlated predictors, Hoerl and Kennard (1970) proposed the use of ridge regression. Writing X for the n × p design matrix and y for the

∗ Tel.: +61 8 6488 3383; fax: +61 8 6488 1028.

E-mail address: [email protected] (B.A. Turlach). URL: http://www.maths.uwa.edu.au/∼berwin. 0167-9473/$ - see front matter © 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2004.09.007

B.A. Turlach / Computational Statistics & Data Analysis 50 (2006) 642 – 658

643

vector of responses, the ridge regression estimator for , for a given ridge parameter k, is given by −1 ˆ (k) = X X + kI∗ X y. (1) Here I∗ is the p × p identity matrix if the regression has no intercept. Otherwise, assuming that the ﬁrst component of corresponds to the intercept term, it is an identity matrix with its top left element replaced by zero. ˆ Hawkins and Yin (2002) 2 point3 out that calculating 2 (k) directly using (1) requires computations of order O np + p , namely O np to set up the normal equations and O p3 to perform the matrix inversion. However, in some applied sciences, e.g. chemometrics, it is not unusual that p, the number of variables, is larger than n the number of observations—sometimes substantially larger. In these situations such a direct approach would be prohibitively expensive. Hawkins andYin (2002) propose an alternative algorithm that, essentially, calculates the inverse of the matrix X X + kI∗ via repeated applications of a rank-1 updating formula and allows to calculate ˆ (k) in O np 2 operations. Thus the algorithm of Hawkins and Yin (2002) offers some substantial savings in computing effort. Although we have no doubts that their algorithm is adequate for a wide variety of problems we have some concerns about its numerical stability and, hence, about its suitability as a “black-box” algorithm. The algorithm has several features that the numerical analysis literature considers problematic. In particular, the algorithm uses rank-1 matrix updates and it calculates explicitly a matrix inverse. For calculating ridge regression estimates for the more familiar case where n is larger than p several numerically stable algorithms based on various factorization of the design matrix have been proposed. These algorithms have also the advantage that, once the factorization is calculated, it becomes relatively cheap to calculate the ridge regression estimator for different values of the ridge parameter k or to calculate the generalized cross-validation (GCV) score. For example, Golub and van Loan (1996) propose to calculate a QR factorization of the design matrix X. For the case n > p, this factorization can be calculated in O np 2 3 operations and, once the factorization is calculated, for any given k, one needs O p oper ations to update the factorization and ˆ (k) can be determined in O p 2 operations. Lawson and Hanson (1995) suggest to ﬁrst calculate the singular value decomposition (SVD) of the design matrix X. Once the SVD is calculated, for any given k, one can update the SVD in O(p) operations and calculate ˆ (k) in O p2 operations. Unfortunately, calculating the SVD of a matrix is an expensive and non-trivial task. An approach which uses a factorization that is easy to calculate was proposed by Eldén (1977); see also Björke (1996, Chapter 5.3.4). Eldén (1977) proposed to calculate a bidiagonal factorization of the design matrix X. This factorization can be computed with the same effort as a QR factorization and its implementation is not more difﬁcult than the implementation of a QR factorization. However, the bidiagonal factorization approach offers substantial computational savings for subsequent computations that equal the savings of the SVD approach. For a given value of the ridge parameter k, once the bidiagonal factorˆ ization is calculated, one can update the factorization in O(p) operations, calculate (k) in O p2 operations (Eldén, 1977) and the GCV score in O(p) operations (Eldén, 1984). Elden’s algorithms have good numeric stability properties and can be adapted to quite

644

B.A. Turlach / Computational Statistics & Data Analysis 50 (2006) 642 – 658

a range of problems, e.g. Wood (2000) extends them to a setting with several quadratic penalties. In this paper, we shall show how the algorithms of Eldén (1977, 1984) can be adapted to ridge regression problems with n < p. In particular, we show that by basing our algorithm on a factorization of the transpose of the design matrix we can calculate the factorization in O n2 p operations. Once the factorization is calculated, for any given value of the ridge parameter k, we can calculate ˆ (k) in O(np) calculations and the GCV score in O(n) operations. The algorithm that we propose has the additional advantage that it does not need much memory. While calculating the factorization of the transpose of the design matrix, most information about the factorization can be stored at the memory locations that originally held the design matrix and only a few extra vectors of length n are needed to hold auxiliary information. By way of contrast, the algorithm of Hawkins and Yin (2002) constructs the −1 explicitly and, thus, requires quite some memory resources. p × p matrix X X + kI∗ The rest of the paper is organized as follows. Details of our algorithm are given in Section 2. In Section 3 we report on some timing experiments that we performed for our proposed algorithm.

2. An algorithm using a bidiagonal factorization The algorithm described below is an adaptations of algorithms proposed by Eldén (1977, 1984). Eldén (1977) considers problems of the form minimize (y − X) (y − X) + k K K,

(2)

which arise when one tries to solve Fredholm equations of the ﬁrst kind. Here K may be a general (positive deﬁnite) matrix. If K is the identity matrix, then Eldén says that (2) is in standard form; see also Björke (1996). In the next section we show how to reduce the ridge regression problem to a problem in standard form. The following sections discuss how the methods of Eldén (1977, 1984) for calculating the solution of (2) and for choosing k by GCV can be adapted to the ridge regression setting with n < p. On occasions it may happen that the observations have weights attached and that one wants to perform weighted (ridge) regression. To keep the notation simple, we ignore prior weights in the discussion below. Such prior weights are easily incorporated by replacing X and y with W1/2 X and W1/2 y, respectively, where W is a matrix with the prior weights on its diagonal. 2.1. Reducing the problem to standard form Note that (1) can be written as X X + kI∗ ˆ (k) = X y.

B.A. Turlach / Computational Statistics & Data Analysis 50 (2006) 642 – 658

645

This shows that these are the normal equations to the following least-squares problem: 2 √X ∗ − y . (3) minimize kI 0 Our ﬁrst aim isto reduce (3) to a problem in standard form. Hence, partition X = (X1 X2 ) and = 1 2 such that we write (3) as 2 X1 X 2 1 y . √ − (4) minimize 0 kI 2 0 Assume that X1 has p1 columns. Next, we calculate a QR factorization of the matrix X1 , i.e. we ﬁnd a n × n orthogonal matrix Q and p1 × p1 matrix R such that R R X1 = Q = (Q1 Q2 ) , (5) 0 0 where Q1 is n × p1 and Q2 is n × (n − p1 ). Since Q is an orthogonal matrix we have 2 2 √X ∗ − y = Q 0 √X ∗ − y , kI kI 0 I 0 0 whence we write (4) as   R Q X2 Q y 2 1 1 1  0 Q2 X2  minimize − Q2 y . √ 2 0 0 kI Thus, we calculate ˆ (k) using a two-step procedure. First, by solving 2 Q X2 Q2 y 2 √ minimize 2 − 0 kI 2 we obtain ˆ 2 (k). Then, using this result, we calculate ˆ 1 (k) = R−1 Q1 y − X2 ˆ 2 (k) .

(6)

(7)

(8)

Note that (7) is a problem in standard form Remark 1. For ridge regression problems, p1 is either 1 or 0 depending on whether the design matrix X has an intercept term or not; but the algorithm discussed here is also applicable for general p1 . If p1 = 0, then (3) is already in standard form and nothing has to be done at this stage. Otherwise, we have to calculate the QR factorization in (5). Note that we would not calculate Q explicitly. Such QR factorizations are conveniently implemented using Householder matrices. If p1 = 1 we have to calculate only one Householder matrix and the Householder vector that represents this matrix can be calculated in O(n) operations. Using this Householder vector Q X and Q y can be calculated in O(np) operations; see Golub and van Loan (1996, p. 210ff).

646

B.A. Turlach / Computational Statistics & Data Analysis 50 (2006) 642 – 658

2.2. Problems in standard form To simplify notation in this section, let y and Z denote Q2 y and Q2 X2 , respectively. Thus, we write (7) as 2 Z y √ minimize − (9) 2 kI 0 2 which is a problem in standard form; note that Z is a n1 × m matrix with n1 = n − p1 and m = p − p1 . Since (9) is a problem in standard form, we could now use the algorithms of Eldén (1977, 1984) directly. However, since these were developed for the case n1 > m this would be inefﬁcient in our case where we assume that n1 < m. To adapt Eldén’s approach to the case where n1 < m, note that the solution of (9) is −1 ˆ2 (k) = Z Z + kI Z y. (10) Using the Sherman–Morrison–Woodbury formula (Golub and van Loan, 1996, p. 50) we see that −1 1 −1 = I − Z ZZ + kI Z , Z Z + kI k whence −1 1 1 ˆ2 (k) = (11) I − Z ZZ + kI Z Z y = Z y − ˆ (k) k k with ˆ (k) = (ZZ + kI)−1 Zw and w = Z y. Thus, once we have calculated ˆ (k), we can calculate ˆ (k) in O(n1 m) = O(np) operations. To calculate ˆ (k) we note that ˆ (k) is the solution of the problem 2 Z w . √ minimize − (12) kI 0 Thus, ˆ (k) is the solution of a ridge regression problem. In the following subsections we show how ˆ (k) can be efﬁciently calculated using the bidiagonal factorization approach of Eldén (1977). Remark 2. For (very) small values of k one can expect (potentially disastrous) cancellation errors in (11) during the calculation of y − ˆ (k). These cancellation errors would then be magniﬁed by the factor k −1 . Thus, if it is necessary to calculate ˆ2 (k) for small values of k it might be more prudent to apply Eldén’s (1977) algorithm directly to (9). That is, calculate a bidiagonal factorization of Z in O n2 p operations and then ˆ2 (k) at the additional cost of O(np) operations for each value of k. 2.2.1. Calculating a bidiagonal factorization of Z To solve (12) we ﬁrst calculate a bidiagonal factorization of Z , i.e. we calculate orthogonal matrices U (m × m) and V (n1 × n1 ) such that B B V = (U1 U2 ) V , (13) Z = U 0 0

B.A. Turlach / Computational Statistics & Data Analysis 50 (2006) 642 – 658

647

where U1 and U2 are m × n1 and m × (m − n1 ) matrices, respectively, and B is a n1 × n1 bidiagonal matrix, i.e.     B=   

a1 0 .. . .. . 0

d1 .. . .. .

0 .. .

... .. .

0 .. .

an1 −2 .. .

dn1 −2

0

an1 −1 0

dn1 −1 an1

...

...

    .   

(14)

Again, the matrices U and V are not calculated explicitly but an implicit representation of these matrices may be retained. For example, one possible approach multiplies Z alternatively from the left and from the right with Householder matrices. The Householder vectors that represent these matrices can be stored in the elements of Z that are transformed to zero. Schematically, this algorithm proceeds as follows, where X denotes an entry in the matrix that may be non-zero: 

X X  X  X  X  X X

X X X X X X X

X X X X X X X

X X X X X X X

  X X X  0   X  0   X 0   X  0   X 0 X 0  X 0  0  0  0  0 0  X 0  0  0  0  0 0  X 0  0  0  0  0 0

X X X X X X X X X 0 0 0 0 0 X X 0 0 0 0 0 X X 0 0 0 0 0

X X X X X X X 0 X X X X X X 0 X X 0 0 0 0 0 X X 0 0 0 0

X X X X X X X 0 X X X X X X 0 0 X X X X X 0 0 X X 0 0 0

  X X X  0   X  0   X 0   X  0   X 0 X 0   0 X X  0   X  0   X 0   X  0   X 0 X 0   0 X 0 0   X  0   X 0   X  0   X 0 X 0   0 X 0 0   0 0   X 0   X  0   X 0 X 0

X X X X X X X X X 0 0 0 0 0 X X 0 0 0 0 0 X X 0 0 0 0 0

0 X X X X X X 0 X X X X X X 0 X X 0 0 0 0 0 X X 0 0 0 0

0 X X X X X X 0 0 X X X X X 0 0 X X X X X 0 0 X X 0 0 0

 0 X  X  X  X  X X  0 0  X  X  X  X X  0 0  0  X  X  X X  0 0  0  X.  X  0 0

648

B.A. Turlach / Computational Statistics & Data Analysis 50 (2006) 642 – 658

Further details are given in Golub and van Loan (1996, p. 251ff), who also describe an alternative approach for calculating the bidiagonal factorization (13). In their notation, the algorithm described above uses 4mn21 − 4n31 /3 ﬂops while the other uses 2mn21 + 2n31 ﬂops. Thus, to minimize the computational effort, one could choose between these two algorithms based on the ratio m/n1 . For our count it sufﬁces to note that these factorizations can be calculated in O mn21 + n31 = O mn21 = O pn2 operations. 2.2.2. Calculating ˆ (k) After calculating the bidiagonal factorization (13) and using a change of variables =V , we write (12) as 2 B U1 w , √ − minimize kI 0

(15)

where U1 w is calculated during the bidiagonal factorization (13). The solution ˆ (k) of (15) is easily calculated. We ﬁrst ﬁnd an orthogonal matrix Q such that B U1 w Bk z1 Q √ . (16) = 0 z2 kI 0 Again, we do not calculate the matrix Q explicitly. Rather it is build up from a series of Givens rotation. Since B is a bidiagonal matrix calculating the quantities on the right-hand side of (16) requires only O(n) operations. Note also that Bk is again a bidiagonal matrix, i.e.     Bk =    

1 0 .. . .. . 0

1 .. . .. . ...

0 .. .

... .. .

0 .. .

n1 −2 .. .

n1 −2

0

n1 −1 0

n1 −1 n1

...

    .   

(17)

Thus, we see that the solution of (15) can be calculated as −1 −1 ˆ (k) = B B + kI B U1 w = Bk Bk Bk z1 = B−1 k z1

(18)

that is, ˆ (k) can easily be calculated by back-substitution in Bk ˆ (k) = z1 . This calculation involves O(n) operations. If we use the ﬁrst algorithm described in Golub and van Loan (1996) to calculate the bidiagonal factorization (13), then we obtain V as a product of n1 − 2 Householder matrices and each Householder matrix can be represented by its Householder vector. Thus, when we do the ﬁnal change of variables to calculate ˆ (k) = Vˆ (k) we can do this in O n2 operations. Since n < p this shows that once we have calculated the bidiagonal

B.A. Turlach / Computational Statistics & Data Analysis 50 (2006) 642 – 658

649

factorization (13), we can calculate the ridge regression estimator ˆ2 (k) for any given k with O(np) operations. 2.2.3. Generalized cross-validation One of the practical problems in ridge regression is the choice of k. Typically, one chooses k such that it fulﬁlls some optimality criteria, e.g. that it minimizes the generalized crossvalidation (GCV) score. For problem (9) the GCV score is deﬁned as

y − Zˆ2 (k) 2 GCV(k) = 2 , Z tr I − ZS−1 k

(19)

where ˆ2 (k) is the solution of (9) and Sk = Z Z + kI.

(20)

The following two sections show how the two terms in ratio (19) can be calculated fast and efﬁciently using the bidiagonal factorization of Z . In particular, we show that each of these terms can be calculated in O(n) operations. Thus, once we have calculated the bidiagonal factorization (13) the GCV score GCV(k) can be evaluated for any given k in O(n) operations. 2.2.3.1. Calculating y − Zˆ2 (k) 2 : Using the bidiagonal factorization of Z , it is easy to show that 2 −1 2 y − Zˆ2 (k) = I − Z Z Z + kI Z y 2 −1 = I − B BB + kI B V y . This result could be used to calculate the Euclidean norm of the residual vector fast and efﬁciently by calculating a decomposition of BB + kI similar to the calculations done for B B + kI in Section 2.2.2. However, for small values of k, one would expect (potentially disastrous) cancellation errors to occur. Thus, it is preferable to use the following identity, which can be easily proved (e.g. using the SVD of B) −1 −1 I − B BB + kI B = k B B + kI to calculate the Euclidean norm of the residual vector as −1

y − Zˆ2 (k) 2 = k B B + kI V y 2 = k t 2 , −1 −1 where t = B B + kI V y = B B + kI B B −1 V y. Note that we can calculate the vector B −1 V y during the bidiagonal factorization (13) with an once-off effort of O(n2 ) operation. Thus, for any given k, we can calculate t in O(n) operation using the same technique as we used in (18). Calculating the Euclidean norm of k t, and hence the ﬁrst term in the GCV score, requires O(n) calculations.

650

B.A. Turlach / Computational Statistics & Data Analysis 50 (2006) 642 – 658

: It remains to calculate 2.2.3.2. Calculating tr I − ZS−1 Z k −1 tr I − ZS−1 k Z = (n − p1 ) − tr ZSk Z .

(21)

If we set Tk = ZZ + kI, then we can use the well-known relation tr (AB) = tr (BA) and the facts that 1 −1 −1 S−1 = T Z and T−1 I − Z k k k ZZ = I − kTk k to show that 1 1 −1 −1 ZZ Z I − Z Tk Z Z = tr ZZ − ZZ T−1 tr ZSk Z = tr k k k 1 = tr ZZ − tr T−1 k ZZ ZZ k 1 = tr ZZ − tr (I − kT−1 k )ZZ k 1 = tr ZZ − tr ZZ − kT−1 ZZ k k −1 = tr T−1 = (n − p1 ) − k tr T−1 . k ZZ = tr I − kTk k Thus

−1 tr I − ZS−1 k Z = k tr Tk

and we need a way to calculate the trace of this matrix fast. Using (13) we obtain −1 −1 −1 = tr ZZ + kI = tr VB BV + kI = tr B B + kI . tr T−1 k From (16) it is seen that B B + kI = Bk Bk and letting bi denote the ith row of B−1 k , we obtain n1 −1 tr T−1 B

bi 2 . = tr B = k k k i=1

Using the identity Bk B−1 k = I we get, using (17), the relations

n1 bn1 = en1 , i bi = ei − i bi+1 ,

i = n1 − 1, . . . , 1,

where ei denotes the ith unit vector. Since Bk is bidiagonal it follows that B−1 k is upper ) is orthogonal to e . Hence, by taking in the triangular and that bi+1 (the i + 1st row of B−1 i k

B.A. Turlach / Computational Statistics & Data Analysis 50 (2006) 642 – 658

651

above equations the scalar product of the left-hand side with itself and the scalar product of the right-hand side with itself, we obtain

bn1 2 =

bi 2 =

1 , 2n1

1 + 2i bi+1 2 , 2i

i = n1 − 1, . . . , 1.

(22)

Using these recursions we can compute the trace term in (19) in O(n) operations. 2.3. Generalized cross-validation for problem (3) It may be argued that if the ridge parameter k is chosen by GCV, then the GCV score for problem (3) should be minimized and not the GCV score for (9). The GCV score for (3) can be written as

(I − H)y 2 , (23) {tr (I − H)}2 −1 where H = X X X + kI∗ X is the hat matrix. Thus the question arises whether this GCV score can be calculated as efﬁciently as (19). We shall now show that this is indeed possible. Note that, using (5) and (8), the numerator in (23) can be written as 2 2

(I − H)y 2 = y − Xˆ (k) = Q y − Xˆ (k) 2 Q y R Q1 X2 ˆ 1 (k) 1 − = Q y 0 Q2 X2 ˆ 2 (k) 2 2 ˆ 1 (k) 2 ˆ = Q y − R Q X 1 2 1 ˆ2 (k) + Q2 y − Q2 X2 2 (k) 2 = Q2 y − Q2 X2 ˆ 2 (k) . GCV(k) =

The last term is exactly the quantity calculated in Section 2.2.3.1 and can be computed in O(n) calculations. Furthermore, using (7) and (8) we see that

ˆ1 (k) R−1 Q1 y − X2 ˆ 2 (k) ˆ(k) = = −1 ˆ 2 (k) X2 Q2 Q2 X2 + kI X2 Q2 Q2 y

−1 R−1 Q1 I − X2 X2 Q2 Q2 X2 + kI X2 Q2 Q2 = y. −1 X2 Q2 Q2 X2 + kI X2 Q2 Q2 Using this result, we ﬁnd the following decomposition of H ˆ (k) yˆ = Xˆ (k) = (X1 X2 ) ˆ 1 = X1 ˆ 1 (k) + X2 ˆ 2 (k) 2 (k) = (H1 + H2 + H3 )y = Hy,

652

B.A. Turlach / Computational Statistics & Data Analysis 50 (2006) 642 – 658

where H1 = X1 R−1 Q1 = Q1 Q1 , −1 H2 = X1 R−1 Q1 X2 X2 Q2 Q2 X2 + kI X2 Q2 Q2 , −1 = Q1 Q1 X2 X2 Q2 Q2 X2 + kI X2 Q2 Q2 , −1 H3 = X2 X2 Q2 Q2 X2 + kI X2 Q2 Q2 . Thus, since tr (H) = tr (H1 + H2 + H3 ) = tr (H1 ) + tr (H2 ) + tr (H3 ), we can calculate the trace term in (19) as follows (using the fact that Q = (Q1 Q2 ) is an orthogonal matrix, i.e. Q1 Q2 = 0): tr (H1 ) = tr Q1 Q1 = tr Q1 Q1 = p1 , tr (H2 ) = 0, −1 tr (H3 ) = tr Q2 X2 X2 Q2 Q2 X2 + kI X2 Q2 . Note, that in the notation used in Section 2.2 the trace of H3 is equal to tr ZS−1 Z . Thus, k we see that tr (I − H) = n − p1 − tr ZS−1 Z , k whose right-hand side is identical to the right-hand side of (21). Thus, perhaps surprisingly, the GCV scores (19) and (23) are identical and can be calculated in O(n) operations once we have calculated the bidiagonal factorization. 2.4. Case diagnostics After ﬁtting a (ridge) regression model to the data, the researcher usually wants to calculate some case diagnostics; see Section 5 of Hawkins and Yin (2002). Unfortunately, it seems that some of these diagnostics are difﬁcult to compute using the bidiagonal factorization approach describe above. This holds in particular for those diagnostics that involve deleting one or more cases from the data, updating the bidiagonal factorization if a case is deleted is not trivial. Thus in this section we shall discuss how these diagnostics can be calculated using an alternative approach based on a QR factorization of the design matrix. 2.4.1. Jackkniﬁng and cross-validation A diagnostic that is often used is leave-one-out cross-validation (jackkniﬁng) or leaveseveral-out cross-validation (m-fold cross-validation). This involves reﬁtting the model with one or more cases removed, and comparing the predictions of this reﬁtted model on the cases that have been removed with the observed responses. Here we describe an algorithm to calculate jackknifed parameter estimates but this algorithm can be easily extended to the case in which several observations are removed.

B.A. Turlach / Computational Statistics & Data Analysis 50 (2006) 642 – 658

The jackknifed parameter estimates are deﬁned as −1 ˆ (i) = X(i) X(i) + kI∗ X(i) y(i) , i = 1, . . . , n,

653

(24)

where X(i) and y(i) are the design matrix and the vector of responses with the ith case (row) deleted. Hawkins and Yin (2002) note that for ridge regression it is conventional to “auto-scale” (studentize) the data to zero mean and unit standard deviation. Due to this scaling it would usually not be necessary to include an intercept term in the design matrix. However, the model that one is ﬁtting has implicitly an intercept term with the corresponding estimate being zero. As noted by Hawkins and Yin (2002), during cross-validation the removal of one (or more) cases effects the intercept and thus an intercept term should be in the design matrix. Hence, we assume that in (24) the ﬁrst column of X corresponds to an intercept term and that the matrix I∗ it is an identity matrix with its top left element replaced by zero. To develop a fast algorithm that evaluates (24) we ﬁrst return to the complete data situation. Note that if I∗ is an identity matrix with its top left element replaced by zero, then, using the Sherman–Morrison–Woodbury formula, we see that −1 −1 = X X + kI − k e1 e1 X X + kI∗ −1 k = X X + kI + ww 1 − k e1 (X X + kI)−1 e1 −1 k = X X + kI + (25) ww , 1 − k e1 w where e1 is the ﬁrst unit vector and w = (X X + kI)−1 e1 . Thus, if we have a fast method for multiplying the p × p matrix (X X + kI)−1 with a p-dimensional vector, then we can apply this method to X y and also use it to evaluate w. Using those two results and (25), we can evaluate (1) in an additional O(p) operations. Using the Sherman–Morrison–Woodbury formula once more, we see that −1 1 (X X + kI)−1 = I − X XX + kI X k −1 1 = X I − X Rk Rk k 1 −1 = (26) I − X Rk−1 Rk X , k where Rk is an upper diagonal matrix such that Rk Rk = XX + kI. If Rk is available, then we can, for any p-dimensional vector z, use (26) to calculate (X X + kI∗ )−1 z in O(n2 + np) = O(np) operations. It remains to show how to calculate Rk . Two strategies are possible. If we want to evaluate (1) or (24) only for one k, e.g. the one selected by GCV, then we could ﬁnd Rk as the R-factor of the QR factorization Rk √X = Qk , I 0

654

B.A. Turlach / Computational Statistics & Data Analysis 50 (2006) 642 – 658

which can be calculated in O n2 (p + n) = O np + n3 operations. Alternatively, if the calculations should be done for several values of k, then it would be more efﬁcient to ﬁrst R calculate the R-factor R in the QR factorization X =Q 0 . This matrix R can be calculated in O n2 p operations, and for any given k we can update R to Rk using Givens rotations 3 in O n operations (see, for example, Golub and van Loan, 1996; Stewart, 1998). Finally, to calculate (24), note that removal of a case from the data corresponds to the deletion of a row in the design matrix X and the deletion of a column in X . But the algorithm for evaluating (1) or (24) outlined above is based on a QR factorization of X . It is well known how to update a QR factorization if a column of the factorized matrix is removed (see, for example, Golub and van Loan, 1996; Stewart, 1998). Thus, we can use such an update routine to calculate Rk,(i) from Rk and then use (25) and (26), with X replaced by X(i) , to evaluate (24). The effort to calculate Rk,(i) from Rk depends on the column to be removed, but the worst case requires O(n2 ) operations. Thus, once we have Rk calculated, we can calculate a ˆ (i) , on average, in O(n2 + np) operations. Using this approach we can calculate all ˆ (i) in O(n3 + n2 p) = O(n2 p) operations, note that this count would include the required QR factorization. This approach yields still a slight improvement over the algorithm by Hawkins and Yin (2002) which requires O(np2 ) operations to calculate all ˆ (i) . 2.4.2. Leverage hi Other quantities that are needed for some diagnostics include the leverages hi of each case i. These leverages are deﬁned as the ith diagonal elements of the hat matrix −1 X X X + kI∗ X . To calculate these quantities, we suggest to ﬁrst calculate the matrix −1 W = X X + kI∗ X −1 by multiplying iteratively the n columns of the matrix X with X X + kI∗ , using (25) and (26). This can be done in O n2 p operations. Given X and W, we can calculate the p diagonal elements of XW in O(np) operations.

3. Timing experiments In this section, we report on a timing experiment that was designed to verify the performance of the proposed algorithms under various settings. The algorithms were implemented using the computational environment R (R Development Core Team, 2004). Some time critical parts of the algorithms were written in FORTRAN and the compiled FORTRAN code was then interfaced to R. The timing experiment was set up was follows. For n = 20, 40, 50, 75, 80, 100, 120, 125, and 150 (denoted by 1, . . . , 9 in the plots of time against p in the right panels of Figs. 1–3) and p = 200, 400, 500, 750, 1000, 1500, 2000, 2500, and 3000 (denoted by 1, . . . , 9 in the plots of time against n in the left panels of Figs. 1–3) we generated ﬁve random data sets. Each data set was used to time various aspects of the proposed algorithms. If the R

B.A. Turlach / Computational Statistics & Data Analysis 50 (2006) 642 – 658

655

1.000

1.000

0.200

0.200 time

time

Time for bidiagonal factorization

0.050 0.010

0.050 0.010

0.002

0.002 20

40

60 n

80 100

140

500

1000 1500 2500 p

Fig. 1. Time required to calculate a bidiagonal factorization of an n × p matrix for various values of n and p. The left panel plots the time (in s) required to factorize a matrix against n and on the right the time is plotted against p. The plots are on a log–log scale.

^ Time for calculating β using the bidiagonal factorization

9 9 88 77 6 6

0.020

time

0.010 0.005

0.002 0.001

9 8 7

9 8 7 6 5

6

4

5 4

3 2

3 2

1

9 8 7

55

6 5

9 8 7 6 5 4

4 4 33 2 2

4 3 2

11

2 3

99 88 77 6 6 55 4 4 33 2 2

0.020

6 5 4 3 2

0.010 0.005

11 1 1

0.002

1

0.001

1

20

0.050

9 8 7

time

0.050

7 8 6 5 9 4 3 2

9 7 6 8 5 4 3 2

7 9 8 6 4 5

9 7 8 6 5 4 3

9 8 7 6 4 5 3

3 2

3

9 8 7 6 5 4 3 2

9 8 7 6 5 4 3 2

2 1 1 1

2 1

2 3 2

8 9 7 6 5 4

9 8 7 6 5 4

1 1

1 1

1

40

60 n

80 100

140

500

1000 1500

2500

p

Fig. 2. Time required to calculate ˆ using the bidiagonal factorization approach for various values of n and p. The left panel plots the time (in s) required to calculate ˆ against n and on the right the time is plotted against p. The plots are on a log–log scale.

function foo implemented an aspect that we wanted to time, the timing was done by using the following commands gc(); tm < − system.time(f or(ll in 1 : lmax)f oo(arguments)) The initial call to gc served to trigger a garbage collection by R, this was done in the hope that no garbage collection, which could potentially distort the timing results, would be triggered during the call to system.time. The times returned by system.time were

656

B.A. Turlach / Computational Statistics & Data Analysis 50 (2006) 642 – 658 ^ Time for calculating all β (i)

5.00

log(time)

2.00 9 8 7

0.50 0.20

9 8 7

0.05 0.02

6 5 4 3 2 1

20

6 5 4 3 2

99 88 77

9 8

66 55 44 3 32 2 1 1

7 6 5 4 3 2

9 8 7 6 5 4 3 2 1

99 88 77 6 6 55 44 3 3 22 11

20.00

9 8 7

5.00

6 5 4 3 2 1

2.00 log(time)

20.00

0.50 0.20

9 8 7 6 5 4

9 8 7 6 5 4 3

1

0.05

1

3 2

2

9 8 7 6 5 4

9 8 7 6 5 4

6 5 4 3

3 3 2

9 8 7

9 8 7 6

9 8 7 6

5 4

5 4 3

5 4 3 2

9 8 7 6

3 2

2

1

2

2

9 8 7 6 5 4 3 2

1

1 1 1

1

0.02

1

1

1

40

60 log(n)

80 100

140

500

1000 1500

2500

p

Fig. 3. Time required to calculate all jackknifed estimates ˆ (i) using the QR factorization approach for various values of n and p. The left panel plots the time (in s) required to calculate all ˆ (i) against n and on the right the time is plotted against p. The plots are on a log–log scale.

corrected by the factor lmax to obtain the average time that a call to the function foo requires. The timing experiment was done on 2.4 GHz Pentium IV computers that are running under Debian GNU/Linux, and have 512 MB RAM and about 1 GB swap space. Graphical analysis of the average times from the ﬁve data sets via boxplots showed that the recorded times did, in general, not vary much from one data set to another. A noticeable variation was observable for the function that called the FORTRAN code ﬁrst. Presumably, the initial call to FORTRAN code involves some overhead that is not required on subsequent calls but has an inﬂuence on the timing of the function that makes the ﬁrst call to FORTRAN. Occasionally, for some values of n and p, also other routines needed much longer for one data set than for the others. Presumably on those occasion a garbage collection was triggered during the timing of the function. For these reasons, we report here the median time recorded on the ﬁve data sets and all times are reported in seconds. Fig. 1 shows the time needed to calculate the bidiagonal factorization of a matrix. The left panel shows, on a log–log scale, how the time needed to calculate the factorization depends on n, while the right panel shows the dependence on p. Fitting a linear model with the logarithm of time as response, and log(n) and log(p) as regressors, shows that the bidiagonal factorization is an O(n2 p) algorithm. Fig. 2 shows the time needed to calculate ˆ once the bidiagonal factorization of a matrix is obtained. The left panel shows, on a log–log scale, how the time needed to evaluate (1) using our algorithm depends on n, while the right panel shows the dependence on p. Fitting a linear model with the logarithm of time as response, and log(n) and log(p) as regressors, shows that calculating ˆ is a O(np) algorithm. Fig. 3 shows the time needed to calculate all jackknifed estimates (24) using a QR factorization of the design matrix. The left panel shows, on a log–log scale, how the time

B.A. Turlach / Computational Statistics & Data Analysis 50 (2006) 642 – 658

657

needed to calculate all jackknifed estimates depends on n, while the right panel shows the dependence on p. Fitting a linear model with the logarithm of time as response, and log(n) and log(p) as regressors, shows that one can calculate all jackknifed estimates with O(n2 p) effort. We also timed how much effort is needed to evaluate a GCV score. A ﬁgure, similar to Figs. 1–3, for these times clearly showed that the evaluation of a GCV score does not depend on p, but increases with n. However, ﬁtting a linear model with the logarithm of time as response, and log(n) and log(p) as regressors indicates that calculating the GCV score needs an effort of order O(n ) with >1; which is not possible. The only explanation that we can offer is that calculating the GCV score using our algorithm needs O(n) operations and, with the values of n considered in this timing study and our method of measuring runtimes of functions, we were essentially timing how long R needs to run through a loop and not how much time the routine to calculate the GCV score needs. To obtain meaningful measurements of the time needed by the function that evaluates the GCV score, we would have had to put the for loop into the FORTRAN code. For this reason, we decided not to report the results for the timing of the function that evaluates the GCV score here. 4. Conclusion We propose new algorithms to calculate ridge regression estimates and various diagnostics in the situation where one has more variables than observations. These algorithms are based on calculating a factorization of the design matrix, which makes them very inexpensive in terms of memory usage. Most of the information needed about the factorization can be stored in the memory location that originally holds the design matrix. Only a little extra memory (of order O(n)) is needed to hold some auxiliary quantities. Our timing experiment shows that for values of n and p, that are typically encountered in chemometrics or with spectral data, these factorization can be calculated very fast. Using these factorization, subsequent calculations can also be implemented efﬁciently so that the whole analysis could be done interactively. Thus our algorithms provide a substantial improvement, both in terms of memory usage and runtime, over the algorithm described in Hawkins and Yin (2002). 4.1. Final comments During the computational work on this paper, we noticed that the formula for the residual sum of squares RSS(k) = 1/C22 − k 1 + ˆ (k) 2 given in Hawkins and Yin (2002, p. 256) should read RSS(k) = 1/C22 − k 1 + ˆ (k) I∗ ˆ (k) . Also, the comment on p. 257 that during the rank-1 down-date to obtain Bn from A−1 n “only the ﬁrst row and column are altered” seems to be incorrect. This would only be true

658

B.A. Turlach / Computational Statistics & Data Analysis 50 (2006) 642 – 658

if all variable are standardized to have mean 0, if necessary on the weighted scale, prior to starting the calculations and if all observations are used; in particular, during jackkniﬁng or cross-validation, when one or more observations are removed, this comment would not be correct. Acknowledgements The author would like to thank A/Prof Les Jennings for helpful discussions, and the Editor and a reviewer for helpful suggestions that greatly improved this paper. Timing experiments were performed on a cluster of Debian GNU/Linux machines funded by ARC Large Grant A69941083. The author is grateful to Prof A. Baddeley for access to this cluster. References ˚ 1996. Numerical Methods for Least Squares Problems, SIAM, Philadelphia. Björke, A., Eldén, L., 1977. Algorithms for the regularization of ill-conditioned least squares problems. BIT 17, 134–145. Eldén, L., 1984. A note on the computation of the generalized cross-validation function for ill-conditioned least squares problems. BIT 24, 467–472. Golub, G.H., van Loan, C.F., 1996. Matrix Computations, third ed. John Hopkins University Press, Baltimore. Hawkins, D.M., Yin, X., 2002. A faster algorithm for ridge regression of reduced rank data. Comput. Statist. Data Anal. 40 (2), 253–262. Hoerl, A.E., Kennard, R.W., 1970. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, 55–67. Lawson, C.L., Hanson, R.J., 1995. Solving Least Squares Problems. Classics in Applied Mathematics, vol. 15. SIAM, Philadelphia (originally published by Prentice-Hall, 1974). R Development Core Team, 2004. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-00-3. URL http://www.R-project.org Stewart, G.W., 1998. Matrix Algorithms, vol. 1. Basic Decompositions. SIAM, Philadelphia. Wood, S.N., 2000. Modelling and smoothing parameter estimation with multiple quadratic penalties. J. Roy. Statist. Soc. Ser. B 62 (2), 413–428.

An even faster algorithm for ridge regression of reduced rank data

An even faster algorithm for ridge regression of reduced rank data

Recommend Documents