Generalised matrix inversion and rank computation by successive matrix powering

Generalised matrix inversion and rank computation by successive matrix powering

PARALLEL COMPUTING ELSEVIER Parallel Computing 20 (1994) 297-311 Generalised matrix inversion and rank computation by successive matrix powering Luj...

746KB Sizes 13 Downloads 94 Views

PARALLEL COMPUTING ELSEVIER

Parallel Computing 20 (1994) 297-311

Generalised matrix inversion and rank computation by successive matrix powering Lujuan Chen, E.V. Krishnamurthy *, Iain Macleod Computer Sciences Laboratory, Austrahan National University, Canberra A C T 0200, Australia

(Received 15 January 1993; revised 8 October 1993, 29 November 1993)

Abstract

Computation of the generalised inverse A + and rank of an arbitrary (including singular and rectangular) matrix A has many applications. This paper derives an iterative scheme to approximate the generalised inverse which can be expressed in the form of successive squaring of a composite matrix T. Given an m by n matrix A with m = n, we show that the generalised inverse of A can be computed in parallel time ranging from O(log n) to O(log 2 n), similar to previous methods. The rank of matrix A is obtained along with the generalised inverse. The successive matrix squaring algorithm is generalised to higher-order schemes, where the composite matrix is repeatedly raised to an integer power l > 2. This form of expression leads to a simplified notation compared with that of earlier methods, and helps to clarify the relationship between l, the order of the iterative scheme and K, the number of iterations. In particular, the accuracy achieved in approximating A § is a function only of the magnitude of l r and does not depend on the particular values chosen for l and K; we argue that there is no obvious advantage in choosing l other than 2. Our derived error bound for the approximation to A § is tighter than that previously established. The same bound applies to the rank. Numerical experiments with different test matrices (square, rectangular, complex, singular, etc.) illustrate the method. They further demonstrate that our tighter error bound provides a useful guide to the number of iterations required. In the examples given, the specified accuracy was achieved after the calculated number of iterations, but no earlier. Implementation studies on a general-purpose parallel machine (CM-5) demonstrated a smaller than expected penalty for direct squaring of matrix T in comparison with multiplication of its component block matrices. For special-purpose VLSI architectures, the simple structure of the matrix squaring algorithm leads to a straightforward parallel implementation with low communication overheads.

* Correspondmg author. Email: [email protected]

0167-8191/94/$07.00 9 1994 Elsevier Science B.V. All rights reserved SSDI 0167-8191(93)E0078-A

L. Chenet al. / ParallelComputing20 (1994)297-311

298

Key words: Complexity; Data-parallel computation; Generalised inverse; Moore-Penrose Inverse; Parallelism; Rank computation; Repeated Squaring; Systolic architectures

1. Introduction

The generalised inverse and rank of an arbitrary matrix (including singular and rectangular) have many applications in statistics, prediction theory, control system analysis, curve fitting, etc. Sequential algorithms for computing the generalised inverse have been known since 1962 [1-4], but implementation of a parallel algorithm has only recently been reported when Krishnamurthy and Ziavras [5] described their adaptation of the Ben-Israel and Greville (B-IG) algorithm [2] for the CM-2 Connection Machine. In contrast, parallel computation of nonsingular matrix inverses has been studied for nearly twenty years. It was originally thought that parallel computation of a matrix inverse required a time linear with the order of the input matrix. However, in 1976 Csanky [7] proved that given a non-singular matrix A of order n, A -1 could be computed in parallel time O(log 2 n) if sufficiently many processors were available. Attempts to prove that O(log 2 n) was a lower time bound for parallel matrix inversion continued until the discovery [8,9] of algorithms with a time bound of O(log ~ n), where 1 < a < 2. Despite years of effort, Csanky's result for general non-singular matrices has not been improved substantially and it remains an open question as to whether an O(log n) deterministic algorithm exists. Improved values for a and many interesting complexity results have arisen from this quest, through narrowing the class of matrices to be inverted and broadening the range of resources algorithms can employ. For example, in [10-12] a bound of O(log n) was obtained in this manner. In recent papers, Codenotti [13,14] presents an iterative algorithm for solution of linear systems and inversion of matrices based on a repeated matrix squaring scheme. For very well conditioned matrices, i.e. K(A*A)= A I ( A * A ) / A r ( A * A ) = 1, where the non-zero eigenvalues h i ( A ' A ) of A*A are ordered such that

AI(A*A) > A z ( A * A

) > ".

>Ar(A*A ) >0,

this algorithm achieves a time bound O(log n). To date, all such repeated squaring methods have required that p ( P ) < 1 for convergence, where P = I /3ATA is the iterative correction matrix in schemes such as XK+ 1 = P X K + Q [13,14,7]. A further disadvantage of previously published algorithms is that they have not specified ranges for/3 within which their convergence is assured. In the following sections we present a deterministic iterative algorithm for computing both the generalised inverse A + and rank of matrix A ~ C ''• the set Of m by n complex matrices. This algorithm (called successive matrix powering) is obtained by generalising a repeated squaring algorithm to higher powers. The iterative steps in the computation are performed in sequence; each step comprises l o g / l parallel matrix multiplications, where l is the power to which the matrix is raised. Having derived A + we can then proceed to compute either the minimumnorm solution or the least-squares solution of the linear system Ax = b (depending

L. Chen et al. / Parallel Computing 20 (1994) 297-311

299

on the consistency or otherwise of this system). For very well conditioned matrices (as defined above), this algorithm enables calculation of both the generalised inverse and the rank of A in time O(log(m + n)) with relative error bounded by any e ~ (0, 1), independent of matrix size. Compared with the algorithm in [13], our iterative scheme Xx+ 1 = PXK+ Q (derived from the properties which the generalised inverse satisfies) permits p(P) < 1 rather than p(P) < 1. This means that A can have reduced rank.

2. Definition of the G-inverse In 1955 Penrose [15] showed that for an arbitrary finite matrix A ~ C re• is a unique matrix X satisfying the set of equations: AXA=A,

XAX=X,

(AX)*=AX,

(XA)*=XA.

there (1)

X is commonly known as the Moore-Penrose or generalised inverse (g-inverse in abbreviated form) and is often denoted by A +, which reduces to the conventional matrix inverse A - l when A is square and non-singular.

3. The successive matrix squaring algorithm In this section we show that the unique g-inverse of matrix A is the solution of a simple matrix equation X = PX + Q. According to this equation we construct an iterative scheme to approximate A +. Then we speed up the method by successive matrix squaring (SMS) and show that it is equivalent in a certain sense to Ben-Israel and Greville's. Finally, we generalise our method to a successive matrix lth order scheme for computing A +. We start from (1) to construct a suitable matrix equation. Since AXA = A and ( A X ) * = A X , we have A* = (AXA)* = A * ( A X ) * ~A*AX. Therefore

X=X-fl(

A*AX-A*)

= (I-~A*A)X

+ ~A*,

where /3 is a relaxation parameter. Letting P = ( I - f l A * A ) and Q =flA*, the solution of matrix equation X = PX + Q can be approximated by the following iterative scheme:

Xx+ 1 = PX x + Q X1 = Q.

(2)

We can compute (2) in parallel given the following observations. Consider the m + n by m + n matrix T, defined as

L. Chenet at / ParallelComputing20 (1994)297-311

300

T K, K -- 1, 2 . . . . .

and notice that the matrix

is given by

7""---O K KEPQI. -1 ] i=OI

]

By direct inspection, it is not difficult to see that EK-1P i=o it3 ~,r is the K t h approximant to A +, as given in (2), that is, Xg = ~ir~lPiQ. It immediately follows that the calculation of Xg can be reduced to the computation of T g. In turn, T K can be computed by 'repeated squaring,' that is T0=T T,+ 1 = 7],.z

i = 0, 1 , . . . , J , where J is such that 2 J > K .

(3)

Note that K steps of (3) are equivalent to 2 r steps of (2). The following theorems prove that the iterative solution of (2) converges to A § using scheme (3). First we prove:

1. If TOand TK are as given in scheme (3), we have

Theorem

TK= T2K=

K

i~=o;,a .

Proof. If K = 0, TO= [~ Q] true. Assume this conclusion is true for K - 1. That is:

TK_I ~

9

Now

TK = TK-1 " TK-1 = [p~K-I

i~=ol Pi Q] 9

-'

i=oI~-""' e lj

2K--l-1 2K-I--1 l ] ~, pi.p2 "-I.Q+ y" p'Q i=0 i=O I IP~ 2K-l-1 2x_1 ] = x E piQ + E eia i=0 i=2K-1 = IP2K-IOP2K-I

Ii __

]

2K--1 q E e'a/ i=OI

l

I

L. Chen et al. / Parallel Computing 20 (1994) 297-311

The theorem is thus proved by mathematical induction.

301

[]

The next theorem gives the necessary and sufficient conditions for the SMS algorithm to converge to A + in terms of the Euclidean or L 2 norm, denoted here

as I1" II. Theorem 2. The sequences of approximations 2K-1

X2K= E

(I-/3A*A)i'/3 A*

t=0

and r2K = n - t r ( I - ~3A'A) 2K = n - t r ( P ) 2K

determined by the SMS algorithm (3) converge to the g-inverse A + in the matrix norm and to the rank of A respectively if, and only if, [3 is a fixed real number such that 0
II A + II I r - r2r I r

~-~ (1 -- /31~r) 2K

(4)

1 r E ( 1 - / 3 a , ) 2~. rt= 1

Furthermore, the optimum p a r a m e t e r is/3op = 2/(A 1 +/~r) [16,17]. In this case the error estimates are given by

A+x2 IIA+II

rr2 -A I- -I- Ar ]

,

r

( 1 rl2 K

K=0,

1,2 .... ,

AI+Ar]

(5) where '~1 and Ar are the nonzero largest and smallest eigenvalues of A*A respectively. Note that this is a tighter bound by a factor of ~ than that previously established by Petryshyn [16]. If Aa = Ar we can check directly that A + = A * / A p We can use a method similar to that of Petryshyn to prove the necessary and sufficient convergence condition of the SMS method. The proof that r2K converges to r can be found in [17]. We now give the proof for the error estimates (4). We know that A + A A + = A +, A + A X z K = X 2 K, since A + A = PRCA*) and X z K ~ R ( A * Therefore, 1[A + - X 2 K II = tl A + A A + - A + A X 2 K 1[ = I[ A + ( A A + - A X 2 K) II

-< II A + I1" II AA + - A X 2 ,, II

).

302

L. Chen et al. / Parallel Computing 20 (1994) 297-311

and

II .4 + - x :

II

-< II A A + - A X 2

II.4 + II

g

II = II `4A + - A X o II 2'~ = (1

-/3Ar)

2K

tr( I - ~3A'A) 2K = n - r + ~ (1 -/3A/) 2K. i=1

If we use the optimum parameter/3op = 2/(A 1 + Ar) , we get

IIA+--X2K II (A1--Ar/2K IIa+ll _< Al+Xr] " There exists an i o (0 ___0 if i > i o and 1 - 2AJ(A r + A1) < 0 if i < i o. Thus 2A i

0_<1

2At

A1 - - A r

- - _ < 1

0>1

=

AI + A r

A 1 d-A r

2A i

2A 1

Ar -- A 1

Ar+A 1

Ar+A 1

Ar+A 1

when i > i o

A1 + ~ r

when i < i o.

Furthermore 2Ai

) 2K

0_<(1 AI+A r

0_<(1

/ A 1 - A r l 2K <

when i > i 0 ~A1-}'Ar]

--I~r+X~J

Ar+AI

/Al+Ar/

when i < i o.

Thus, independent of whether i > i o or i < io, we have

(

2Ai

1

)( 2K

Ar+AI

<

-

A1 - - A r

A-~-~]

:

for0_
We now consider the relative error. Since t r ( I - / 3 A * A ) E r = n - r + E ~ i = l ( 1 flAi) 2r, rEr = n - tr(I - f l A * A ) 2r and flop = 2 / ( A 1 + Ar), w e have the following

error estimation:

r

ri= 1

~14:-1~r

<-~ -- r i = l t A l + A r

]

tAlq-Ar]

"

The estimation of/3 based on eigenvalues becomes difficult for larger matrices. In this case we can estimate /3 using /3--2/tr(A*A); this value will usually be sub-optimal, but guarantees the convergence of the algorithm since 0 < 2/tr(A*A)

< 2 / A I ( A * A ).

L. Chen et al. / Parallel Computing 20 (1994) 297-311

303

Remark. For a rank 1 matrix A*A with A~ = A and A 2 = )t 3 . . . . . An = 0, then tr(A*A) = A. Here the choice of /3tr,op = 2/A is at the disallowed limit. To avoid this difficulty, we may choose a smaller value for/3 within the convergence limit. The original B-IG algorithm is given by Xr+l = Xr(2I - AXr), starting with X 0 = / 3 A * , where 0
{ ~ i +l=( I + P r ) X r Pr+l =p2

(6)

= flA*, Po = I - /3A*A. Again, using mathematical induction we can prove that 2r-1

X r = ~ ( I - f l A * A ) i ' / 3 A *, i=0 which shows the equivalence of SMS to the B-IG method. Thus, the SMS method has the same time bound. Now we generalise the SMS scheme to a method which successively raises a matrix to integer powers ( > 2) in forming estimates of A +, using the following scheme:

Again, using mathematical induction we can show that

r,,=r0'*=

"

Ee'Q/. i=OI

]

Scheme (7), which we call successive matrix powering of degree l or SMP(/) for short, is equivalent to the Ith-order iterative method proposed by Petryshyn [16], modified to take the following form:

{)(K= ( I + PK + P2 +

.

.

.

+ p~=Z)Xlr l > 2

Xr+, =Xr + pr~? r where the matrices PK are defined by P r = I - A X r ,

(8) X o =/3A*, K = 0, 1, 2, . . . .

304

L. Chen et al. / Parallel Computing 20 (1994) 297-311

Using the same approach and noting that Pr+ 1 = g l , we can prove that X r in (8) is given by

lr-1 lr-1 XK = y" p i Q = y" ( I - f l A * A ) ' . ~ A * . i=0

i=0

Similarly, we can show that r K = n - t r ( I - ~ A * A ) ;K. For the error estimates we have a similar result which is

HA+-XrII

{ A l - - ~ r ) IK,

[rK-rl -

IIA+ll

~-~~ A X"~-Ar

-

(A1-Arl <

r

--

lK, (9)

A l "~-Ar ]

when we use the optimum p a r a m e t e r /3op = 2/(A 1 +At). (We have followed Petryshyn's notation in the above; X K and r K are equivalent to our earlier notation X2K and r2K for the case I = 2.) Comparing the SMP(I) method with (8), we can make the following observations. When we consider iterative schemes (6) and (7) for estimating A +, we first choose the desired maximum relative error e. From (9) we can then obtain the necessary magnitude of l K which, for a given lth-order scheme determines K, the IK required n u m b e r of iterations. From the relationship TK = T~ , which, follows from (7), we simply have to raise matrix T O to the power I K. Note that with the problem formulated as in (7), we are free to choose l and K. However, when we look at the computational cost of obtaining T0tK there is no advantage in choosing l different from a power of 2. With I = 3, for example, we need two matrix multiplications to get T 3 from To, whereas we could obtain T04 with the same n u m b e r of matrix multiplications by successive squarings. The argument for l = 4, 8, etc. is the same as for 1 = 2. We thus find that we can compute T tr in d successive matrix squarings, where 2 a > l K. Thus, representation (7) makes it clear that there is no particular advantage in using I different from a power of 2. It thus follows that within this framework, SMP(2) = SMS is optimal.

4. Complexity analysis In this section we analyse implementation of the SMS method in terms of the n u m b e r of parallel steps and the number of parallel processors needed to compute the g-inverse and rank to a pre-specified precision.

Theorem 3. Let A ~ C re• then Tr = To2r of the iterations (3) can be computed in time O ( K log(m + n)) on M ( m + n) processors, where M ( m + n) is the number of processors necessary to support matrix multiplication in time O(log(m + n)). Proof. Since the exponent of T doubles at each squaring step of (3), log 2 2 K = K matrix multiplications are sufficient to compute T 2x. With regard to the n u m b e r of processors, [18,19] show that M(n)<_ n ~ where the exponent to is such that

L. Chen et al. / Parallel Computing 20 (1994) 297-311

305

sequential matrix multiplication can be performed with O(n "~ arithmetic operations. The lowest bound established to date is to .-~ 2.38 [20]. [] In the remainder of this section we determine the number of iterations K as a function of the condition number A I ( A * A ) / A r ( A * A ) and the required accuracy E = 2 - L Let Ix = A r ( A * A ) / A I ( A * A ) . Since A is not a zero matrix, it is easy to see that 0 < Ix < 1. Suppose Ix = 2 -8. First, if Ix = 1 then as previously mentioned 1 A + = A * / A 1. Second, if 89< Ix < 1 t h e n we have 0 < (1 - ~ z ) / ( 1 + Ix) < ~ leading to the inequality ((1 - I x ) / ( 1 +Ix))2 <(1)2 K. Letting ( 89 log 2 7, which indicates that [log 2 3'] iterations, where [.] is the ceiling function, are sufficient to obtain the required accuracy. If 0 < Ix < 89 then since In ((1 - Ix)/(1 + I X ) ) = - 2 ( I X + 3 1 ]./,3 + ~1/./,5 + . . - ) , it follows that l n ( ( 1 - i x ) / ( l + i x ) ) < - 2 i x . Therefore y / ( 2 i x ) > - 7 / ( l n ( ( 1 - Ix)/(1 + Ix))). If we ask that the relative error K . . be less than e, we have ((1 - I x ) / ( 1 + Ix))2 < e, gwlng 2 r > - y / a n ((1 - I x ) / ( 1 + Ix))). Setting 2 r > y/(2ix), we find that K = [log 2 y + 6] iterations guarantee the required accuracy. The above results can be summarised in the following theorem. Theorem 4. Let Q = / 3 A * , P --- I - / 3 A * A and 0
the successive matrix squaring algorithm (3), an approximate g-inverse XzK and rank r2K, such that the relative errors are less than e = 2 -7, can be computed 1 (a) in K = [log z y] iterations, if ~ < Ix < 1 (b) in K = [log z 3' + 8] iterations, if 0 < IX < 89 where Ix = 2 - ~ is the condition number.

A complexity analysis of the above algorithm is easy to perform. When calculating K, in common with other researchers [14] we assume an upper bound to the condition number or equivalently a lower bound o- to Ix. One step of the successive matrix squaring algorithm requires O(log(m + n)) time on M ( m + n) processors. When we choose/3 = 2/(A 1 + Ar) < 2 / p ( A * A ) , the iterated value X2K gives an approximate g-inverse of A within a relative error e in K steps. Using K = O(log y + log ( y / 2 t r ) ) , an approximate g-inverse X2K of matrix A ~ C re• with Ix > o', can thus be obtained in time T = O(log(m + n)(log y - log or)) on O ( M ( m + n)) processors. From the above we have the following theorem: Theorem 5. A n approximate g-inverse X2u and rank r2K o f A ~ C mXn, with n > m

and tx > tr, can be obtained in time T = O(log(n)(log y - log tr)) on O ( M ( m + n)) processors. In particular T = O(log(n)) if t r = O(1) and T = O(log2(n)) if o'= O ( n - ~ ) , where both e and a (>_ 1) are independent o f m and n. Similar results apply in the case n < m.

We conclude this section by observing that with minor modifications we can adapt the SMS algorithm to find solutions of Ax = b with the same asymptotic bounds. We know that for given A ~ C m• and b ~ C m, with y an arbitrary vector in C n, x = A + b + ( I - A + A ) y is the general solution of equation Ax = b. If this equation is consistent, we call x o = A + b the minimum-norm solution. Otherwise

L. Chen et al. / Parallel Computing 20 (1994) 297-311

306

x=A+b+(I-A§

is the least-squares minimum-norm least-squares solution.

solution

and

x0=A+b

is the

5. Numerical examples To verify that the matrix powering method performs as expected, we use the following examples to compare the observed rate of convergence with that predicted from (5), by contrasting the relative error obtained after the required n u m b e r of iterations K estimated from T h e o r e m 4 with the specified error 9 All calculations were performed using Mathematica on a Sun 4/690.

5.1. Square non-singular matrix The 8 by 8 real test matrix A used here was formed by constructing a symmetric matrix with off-diagonal elements chosen from a uniform random distribution with range (0, 1). The diagonal elements were calculated as ten times the absolute sum of elements in the corresponding row. The resulting matrix was thus of full rank and reasonably well conditioned, as demonstrated by the calculated eigenvalues of ATA (2726.3, 2309.2, 1953.7, 1835.7, 1767.1, 1640.9, 1613.8, 627.12) giving /z = A8/A 1 = 0.2300. The number of iterations necessary to achieve a specified relative error 9 = 10 -6 is calculated as K = [4.91] = 5. Using flop = 2 / ( h i + A8) = 5.96 x 10 -4, after four and five iterations the relative error II A + - X 2 K II/( II A + II) was 5.56 x 10 -4 and 2.95 x 10 -7 respectively, verifying the estimate of K. The corresponding relative errors for the rank were 1.39 x 10 -4 and 7.72 x 10 -8. In the absence of calculated eigenvalues, we can use f t r = 2/tr(ATA) = 1.38 • 10 -4. In this case, achievement of the specified accuracy with 9 = 10 -6 required eight iterations, after which the relative errors for A + and the rank were 8.36 • 10 -ix and 1.05 x 10 -11 respectively. (After seven iterations the corresponding errors were 9.15 x 10 -6 and 1.14 x 10-6.)

5.2. Rectangular complex matrix with reduced rank Consider the following matrix:

A= A*A

[!2ii04+2i 0 0 -3 2 1 1

1] -6 4 - 4i

-3

3i 9

has largest and smallest eigenvalues A1 = 112.46 and Ar = 15.544, giving /z ----0.1382. For specified errors of E = 10 -6 and e = 10 -8, we can calculate K as [5.64] and [6.06] (i.e., 6 and 7) respectively. In this case there are only two non-zero eigenvalues, so fop = f t r = 0.0156. After five, six and seven iterations, the relative errors II Z + - S 2 K II/( II A + II) obtained were 1.36 x 10 -4, 1.50 x 10 -8 and 3.56 X 10 -16 respectively. The corresponding errors in computing the rank ( = 2) were

L. Chen et al. / Parallel Computing 20 (1994) 297-311

307

1.36 x 10 -4, 1.85 • 10 -8 and 1.33 X 10 -15. These figures provide further evidence of the rapid convergence achieved with moderately well conditioned matrices.

5.3. Ill-conditioned square matrix Elements of the 8 by 8 random test matrix used here were chosen from a uniform random distribution with range (0, 1). Several matrices were constructed and their eigenvalues calculated. From these we selected one which had full rank but was rather ill-conditioned, and for which ATA had eigenvalues of 17.197, 0.9980, 0.8457, 0.4939, 0.3606, 0.06760, 0.01226 and 0.0004218, giving ~ = 2.45 x 10 -5 and/3op = 0.116. For a specified relative error of e = 10 -4 we can estimate K as [17.52] = 18. Using/3op, after 17 and 18 iterations the relative errors obtained in A + were 1.61 x 10 -3 and 2.60 x 10 -6. The corresponding errors in computing the rank were 4.04 x 10 -4 and 6.51 x 10 -7. Because the largest eigenvalue dominates, ~tr = 0.100 is not much smaller than /3op = 0.116. As a result, the specified error was also achieved in 18 iterations using ~tr" In this case the relative errors in calculating A § after 17 and 18 iterations were 3.95 x 10 -3 and 1.56 x 10 -5. The corresponding errors in computing the rank were 4.93 x 10 -4 and 1.95 x 10 -6.

5.4. Matrix with equal eigenualues Consider the following matrix: I A =

0 -2-2i - 2 + 2i 0

-2+2i 0 0 2 + 2i

-2-2i 0 0 2 - 2i

202i

1

2o2ij

A*A has four equal eigenvalues of 16.0 giving /~ = 1.0 and flop = 1/16. We can quickly show that A*A = 161 (a scalar matrix), that P = I - [ J o p A * A is a null matrix and that X 0 = Q = flopA* is already equal to A +. The trace of P is zero, as expected because we know that A is of full rank. If we u s e f l t r = 1 / 3 2 = f l o p / 2 , then we need to iterate to obtain the generalised inverse and rank of A within a specified accuracy. In this case, we find that the relative error in computing A § after four and five iterations is 1.53 x 10 -5 and 2.33 x 10-10 respectively. The corresponding errors for the rank are 1.53 x 10 -5 and 2.33 x 10-10

6. CM-5 implementation results The directly expressed SMS algorithm is primarily intended as (i) a conceptual tool for clarifying issues relating to the order of the iterative scheme employed, and

308

L. Chen et al. / Parallel Computing20 (1994) 297-311

(ii) a simple structure suitable for special-purpose VLSI implementation. On most general-purpose parallel architectures the simplicity of the direct SMS algorithm is offset by the additional unnecessary computations associated with the component block matrices 0 and I in T. These computations can be avoided by explicit multiplication of T's upper half components, giving: PK+I = p 2 , QK+I =PKQK + QK"

(Note that multiplication of these components is equivalent to a second-order version of the modified Petryshyn method presented in Section 3 above.) The direct and block forms of the SMS algorithm plus the B-IG algorithm as expressed in (6) were implemented on the ANU's CM-5 Connection Machine, which has 32 processing nodes (four vector units per node). Tests with complex random test matrices of various shapes and sizes were then conducted with all three methods to determine the relative speed and the number of iterations required to achieve a specified accuracy of 10 - 6 in both the rank and generalised inverse. All tests used/3tr = 2 . 0 / t r ( A * A ) for the relaxation parameter since this is the easiest value to compute in practice. Three shapes of matrix A ~ C mxn were tested: m = 2n and m = n / 2 (rectangular), plus m = n (square). The matrix sizes varied from 128 X 128 to 1024 X 1024. Programs were written according to the CM Fortran data parallel model, with CMSSL routines being used for matrix computations (all in double precision). As far as speed of computation was concerned, the direct SMS algorithm was the slowest, with the block SMS and B-IG algorithms being comparable in execution time. In each iteration the direct algorithm requires a single multiplication of two m + n by m + n matrices; the other algorithms require two multiplications, one of n x n by n X m and the other of n x n by n x n, plus an addition of two n x m matrices. Ignoring matrix additions the direct algorithm can be seen to require about (m + n ) 3 operations, whereas the others require about n2m + n 3. For m = n, m = 2n and rn ~ n / 2 , the ratios of the computational requirements of the direct and block algorithms are thus about 4.0, 9.0 and 2.25 respectively. The ratios observed in the CM-5 execution times were smaller, being in the range of approximately 2.1 to 3.3 for m = n, 3.9 to 6.1 for m = 2n and 1.4 to 1.9 for m = n / 2 . The smaller than expected penalties here can be accounted for partly by the improving ratio of computation to communication for parallel matrix multiplication as matrices get larger, and partly by the reduced setup and communication overheads possible with the simple SMS structure. As an example, the direct SMS, block SMS and B-IG methods took 608.3, 186.6 and 184.5 seconds respectively to perform 20 iterations with a 1024 by 1024 complex matrix. As expected, execution times were dominated by matrix multiplication. Johnsson [21] found that matrix multiplication scales well on the CM-5 architecture, obtaining a performance of about 60 to 80 Mflops per node as the number of nodes varied from 1 to 512. Thus, given the dominance of matrix multiplication in the SMS and B-IG methods, we did not examine the way these methods scaled on the CM-5. The execution times obtained on our 32 node machine were respectably short, but the number of processing nodes (and density of interconnections) did

L. Chen et al. / Parallel Computing 20 (1994) 297-311

309

not approach that required [18,19] for parallel matrix multiplication in time O(log(m + n)). Although the random matrices employed were of full rank, they generally had one eigenvalue somewhat larger than the others. This meant that ftr was usually close to flop and that ~ < 1, as required for application of Theorem 4(b) in estimating the number of iterations K. All three algorithms converged at a similar rate, achieving the specified relative accuracy of 10 -6 within one iteration of each other (on completion of K iterations or shortly thereafter). Convergence of the rank towards an integer value proved to be a useful guide to convergence of the approximation to A+: with all our test matrices the specified error in the g-inverse was achieved with at most one additional iteration after the rank had converged.

7. Discussion

A significant advantage of the SMS algorithm is that its formulation yields a simple expression for the result of successive iterations. This much simplified notation, compared to that of earlier schemes, leads to a clarification of the role of the order of the iterative scheme l in relation to the number of iterations K. In particular, we have shown that it is only the magnitude of l/( which is significant (and not the individual values of l and K) and have argued that there is no obvious advantage in choosing l other than 2. Note that the number of iterations required is a function of the condition number of the input matrix but not of its size. The effective rank of an arbitrary matrix is often useful, but can be difficult to obtain with large matrices. The matrix rank is obtained here for little additional cost. Another useful result derived above is a tighter upper bound (5) than that previously established on the relative error of the iterative estimate for A + after K iterations. The results of numerical experiments reported in Section 5 suggest that (5) provides not only an upper bound but also a realistic practical estimate of the number of iterations required to achieve a given accuracy. In the example of Section 5.1, the required number of iterations for e = 10 -6 was K = [4.91] = 5. The relative errors obtained in computing the g-inverse and rank after five iterations were 2.95 • 10 -7 and 7.72 • 10 -8, respectively. The real-valued estimate of the required number of iterations was close to the number of iterations used and the desired accuracy was achieved by a relatively narrow margin, suggesting that the error bound (5) is in fact quite tight. The results of Section 6 provide further evidence here. Note that because of the exponent 2 r in (5) and the fact that ('~1 -- ~'r)//()tl + Ar) will lie in the range [0, 1), the number of digits of accuracy approximately doubles with each further iteration (neglecting machine precision effects), so that achievement of a relative error of, for example, 10 -9 when the required accuracy was 10 -6 does not mean that fewer iterations could have been used. Computing T 2 directly (rather than via its component block matrices) involves trivial multiplications by 0 and 1, but the simple structure could well be attractive

310

L. Chen et aL / Parallel Computing 20 (1994) 297-311

in hardware implementations. Assuming that we have a parallel matrix squaring architecture which leaves the elements of the squared matrix in their original locations, we can then simply repeat the squaring operation K times. The obvious advantage here is that there is no need to save the results or to load the individual processors with fresh data items at the end of each cycle of successive squaring, global data movement of this type being a relatively time consuming operation on typical parallel architectures. To the extent that the architecture employed has a large number of locally-connected processors and that the accompanying matrix squaring algorithm minimises global data communication, then the smaller data communication overheads in the SMS algorithm will offset the approximate factor of four increase in arithmetical operations. As noted earlier, the number of non-trivial arithmetical operations in each iterative step of the SMS algorithm is similar to that of the B-IG scheme. Working within the context of VLSI systolic architectures, we are currently exploring strategies for taking advantage of the simple structure of the SMS algorithm while at the same time avoiding overheads arising from unnecesary trivial operations

[22]. 8. Acknowledgement The authors gratefully acknowledge the assistance of the referees in making constructive suggestions towards improvement of the original version of this paper.

9. References [1] L.D. Pyle, Generalised inverse computations using the gradient projection method, J. ACM 11 (1964) 422-428. [2] A. Ben-Israel, An iterative method for computing the generalised inverse of an arbitrary matrix, Math, Comp. 19 (1965) 452-455. [3] J.C.G. Boot, The computation of the generalised inverse of singular or rectangular matrices, Amer. Math, Month. 70 (1963) 302-303. [4] V.N. Kublanovskaya, On the computation of the generalised matrix inverse and the projection, J. Cornp. Math. Phys. 6 (1966) 326-332. [5] E.V. Krishnamurthy and S.G. Ziavras, Matrix g-inversion on the Connection Machine, Tech. Report, CAR-TR-399, Univ. of Maryland, 1988. [6] A. Ben-Israel and T.N.E. Greville, Generalized Inverses: Theory and Applications (Wiley, New York, 1974). [7] L. Csanky, Fast parallel matrix inversion algorithm, SIAMJ. Comput. 5 (1976) 618-623. [8] S.A. Cook, A taxonomy of problems with fast parallel algorithms, Inform, and Control 64 (1985) 2-22. [9] J. von zur Gathen, Parallel arithmetic computations: a survey, Lecture Notes in Computer Science 233 (Springer-Verlag, New York, 1986) 93-122. [10] B. Codenotti and F. Flandoli, A monte carlo method for the parallel solution of linear systems, J. Complexity 5 (1989) 107-117. [11] B. Codenotti and M. Leoncini, Parallel Complexity of Linear System Solutions (World Scientific, Singapore. 1991).

L. Chen et aL / Parallel Computing 20 (1994) 297-311

311

[12] F.P. Preparata, Inverting a Vandermonde matrix in minimum parallel time, Inform. Proc. Letters 38 (1991) 291-294. [13] B. Codenotti, M. Leoncini and G. Resta, Repeated.matrix squaring for the parallel solution of linear systems, PARLE '92 (Parallel Architectures and Languages in Europe), Lecture Notes in Computer Science 605 (Springer-Verlag, New York, 1992) 725-732. [14] B. Codenotti, Parallel solution of linear systems by repeated squaring, Applied Math. Letters 3 (1990) 19-20. [15] R. Penrose, A generalized inverse for matrices Proc. Cambridge Philos. Soc. 51 (1955) 406-413. [16] W.V. Petryshyn, On generalised inverses and on the uniform convergence of ( I - / 3 K ) n with application to iterative methods, J. Math. Anal & App., 18 (1967) 417-439. [17] E.V. Krishnamurthy and S.K. Sen, Numerical Algorithms: Computations in Science and Engmeermg (East-West Press, New Delhi, 1986). [18] V. Pan, Complexity of parallel matrix computations, Theor. Comput. Sci. 54 (1987) 65-85. [19] V. Pan and J. Reif, Fast and efficient parallel solution of dense linear systems, Comput. and Math. with Appl. 17 (1989) 1481-1491. [20] D. Coppersmith and S. Winograd, Matrix multiplication via arithmetic progression, Proc. 19th Annual A C M Symposium on Theory of Computing (Springer-Verlag, New York, 1987) 1-6. [21] S.L. Johnsson, personal communication, 1993. [22] L. Chen, B.B. Zhou and I. Macleod, Fast parallel solution of linear systems via successive matrix squaring, in preparation.