Parallel Computing 5 (1987) 187-196 North-Holland
187
A parallel implementation of the QR-algorithm * G.W. S T E W A R T Department of Computer Science and Institute for Physical Science and Technology, Unil~ersity of Maryland, College Park, MD 10742, U.S.A.
Abstract. In this paper a parallel implementation of the QR-algorithm for the eigenvalues of a non-Hermitian matrix is proposed. The algorithm is designed to run efficiently on a linear array of processors that communicate by accessing their neighbors' memory. A module for building such arrays, the Maryland CRAB, is also described.
Keywords. Parallel QR-algorithm, linear array of processors, communication through locally shared memory, implementation features.
1. Introduction This paper has two purposes. The first is to describe a parallel version of the QR-algorithm for the eigenvalues of a nonsymmetric matrix. The second is to show how the algorithm may be implemented on a network of processors that communicate through locally shared memory. We shall be especially concerned to insure that our algorithm runs well at medium granularity. To see what this means, consider the problem of performing Gaussian elimination on a matrix A of order n with a linear array of p processors. Typically the matrix might be partitioned by columns in the form A = (A1, A 2 . . . . . Ap),
(1.1)
where each A k has roughly n/p columns. These blocks are then assigned in order to the processors, which perform in parallel the arithmetic operations required by the elimination algorithm. To perform the first step of the elimination, each processor needs certain numbers, called multipliers, computed by the first processors, and getting these numbers to the processors requires time at least proportional to n. For a small number of processors, this will ordinarily be a small part of the total work, since the computations performed on each processor will be proportional to n2/p. However, as p increases, the communication time will account for an increasing share of the execution time. It may even happen that a point is reached after which adding more processors increases the execution time. Our aim is to defer this point as far as possible. We shall attempt to obtain a medium grained algorithm in two ways. First we shall propose a variant of the Householder reduction to Hessenberg form that circumvents one of the computational roadblocks. Second, we shall show how the algorithm can be implemented on a * This work was supported in part by the Air Force Office of Sponsored Research under grant AFOSR-82-0078. 0167-8191/87/$3.50 © 1987, Elsevier Science Publishers B.V. (North-Holland)
188
G.W. Stewart / A parallel QR-algorithm
system of processors designed specifically to reduce communications overhead. To avoid making this paper too long, we shall assume that the reader is familiar which the details of the Q R - a l g o r i t h m - - b o t h the reduction to Hessenberg form and the subsequent QR-iteration. In the next section we shall describe the parallel version of the reduction. In Section 3 we shall describe a processor module, called a CRAB, that is currently being designed at the University of Maryland [3], and show how it can be used to reduce the communications overhead. Finally, in Section 4 we will discuss the parallelization of the QR-iteration.
2. Reduction to Hessenberg form In this section we shall discuss the parallel reduction of a nonsymmetric matrix to Hessenberg form by Householder transformations. As was indicated in the introduction, the reader is assumed to be familiar with the details of the reduction (for example, see [2, pp. 222-223]). What follows is a sketch of the first step. From the first colunm of A, one computes a vector u defining a Householder transformation H = I - uu T.
This transformation is pre- and postmultiplied into the matrix to yield A" = HAH
= I - x u T - uy T + ixuu T,
(2.1)
where x =Au,
(2.2)
yT = uTA,
(2.3)
ix = u T x .
(2.4)
and
The effect of this transformation is that the elements a;l . . . . . a ' l of A' are zero. The process is then repeated on the trailing (n - 1) × (n - 1) principal submatrix of A ' - - a n d so on until the reduction is complete. Let us now consider what happens when we attempt to implement the first step on a linear array of processors. Suppose that a has been partitioned by columns as in (1.1), and block A k has been assigned to the k th processor in the array. The first step is to compute u on the first processor and broadcast it to the other processors. The next step is to compute x and y defined by (2.2) and (2.3). The vector y is easy to compute; for if we partition yT
(y~,
=
r
y2 . . . . . yT),
then yk can be calculated on the ith processor according to the formula
y [ = uTAh. Moreover, from (2.1) it is seen that Yk will be on the processor where it is needed to form A'. The vector x = A u is another matter. To calculate it, we must partition
uT:(u calculate z k = x=
,
A k u k,
. . . . .
and then form
z l + z 1+
...
+ Zp.
(2.5)
If (2.5) is accumulated on a single processor, it will require roughly n p floating-point additions. This means that for medium grained computations, where p = O ( n ) , the first step will take at
G. IV. Stewart
/ A parallel QR-algorithm
189
if (k i){ compute u; broadcast(u; 2:p); ==
> else receive (u) ;
Yk = ATu; Zk ~ AkUk for (i=l; i<=n; i+÷){
if (k == 1) c=O;
else receivel(c)
;
c=c+zi; i f (k != p) sendr (c) ; else Xi ~ C; } if (k == p)
broadcast(x; l:p); else receive(x); /~ = xTu;
Ak = Ak + u(/~u~ - yT) - xuT;
Fig. 1. Reduction to Hessenberg form: First step, processor k.
least the time required for O(n 2 ) operations, and the entire reduction will take the time required for O(n 3 ) operations. Since the sequential algorithm also requires O(n 3) operations we have not achieved the full benefits of parallelism. 1 The problem can be circumvented by organizing the computations in a wave front. 2 Let us denote the ith component of z k by z~ k). The first processor passes z~1> to the second processor which forms the sum c = z~1) + z~2). The second processor then passes c to the third processor which adds z~3) to it. At the same time the first processor passes z~21) to the second, which computes c = z~~) + z~z2). In general, at step s processor k receives a value of c from the processor to the left, adds z~k_)k+3 to it, and passes it on to the right. At the end, the components of x are on processor p. The entire process takes time proportional to p + n. Figure 1 contains pseudo-code describing the operations performed by the ith processor in the first step of the reduction. It divides naturally into four parts, which are separated by blank lines in the figure. In the first part, the action taken depends on the location of the processor. The first processor computes u and broadcasts it to the remaining processors, while the other processors enter a request for u. In this broadcast-receive protocol, each broadcast message must correspond to requests on the receiving processors that tell where to place the message. It is assumed that the processor requesting a message via r e c e i v e is blocked until the message arrives. We will discuss the mechanics of broadcasting in the next section. The second part consists of the routine computation of Yk and z k.
a A binary tree summation of (2.5) would reduce the count to n log2p. However, such a process does not lend itself
naturally to a linear array of processors and still does not give the full benefits of O ( n ) processors. 2 I am indebted to Rob Schreiber for suggesting this approach in place of my own clumsier solution.
G. IV.. Stewart / A parallel QR-algorithm
190
Receiving Program
Sending Program 1. Request a message from the receiving program giving a pointer to the destination location for the data.
1. Send a message to the sending program giving a pointer to the destination location for the data.
2. Transmit the block of data to the locations starting at the pointer location.
2. Request a message from the sending program. (Since the latter only sends the signal after the data has been transferred, the receiving program knows that the data is in place once this message arrives.)
3. Send a message to the receiving program to indicate that the data has been delivered.
Fig. 2. Communicating programs.
The third part consists of the wave-front computation of x and its distribution over the processors. The primitive s e n d r causes a number to be sent to the processor on the fight, which must execute a corresponding r e c e i v e [ to get it. In the final part, A~ is updated. This essentially completes the description of the parallel reduction to Hessenberg form, the other steps being analogous. At step s, the processor containing the sth column generates the vector u, which is now of order n - s, and broadcasts it to the other processors. The computation of x and y proceeds as above, allowance being made for the fact that the first s - 1 components of y are zero. When the R k and Ck are updated, time can be saved by observing that only the last n - s + 1 columns of A change. How good is the algorithm? In terms of arithmetic operations, not bad. The greatest amount of work is done by the p t h processor, and an elementary operation count shows that it performs approximately
Fpar= 3n3 + ~ + n2 P floating-point additions and multiplications. 3 For p = n this is 4.5n 2 additions and multiplications. On the other hand, the sequential algorithm performs F eq =
additions and multiplications. Thus we are within a factor of 2.7 to the optimal speedup. However, arithmetic parallelism is not enough to make a good algorithm. Communication costs must also be reasonable. Since we have subsumed communications under the b r o a d c a s t and the s e n d r primitives, let us now examine what is involved in implementing them. 4 It is important to keep in mind that the actual moving of a block data between two processors is only part of the job of communicating; for the sending program must not send the data prematurely, and the receiving program must not attempt to use the data before it has arrived. Thus the two programs must coordinate their activities by means of synchronizing messages. Figure 2 shows how this might be done. We will assume that once a program requests a message, its execution is suspended until that message arrives. Let us now examine the overhead in passing this message. The time taken to transfer the data will consist of a setup time and a transmission time proper. The setup time will vary from system to system. On a system in which both processors must shake hands to transfer data, it 3 This is an overestimate for small p. The last two terms are from the wave-front accumulation of x 4 The following discussion is adapted from [3].
G. W. Stewart / A parallel QR-algorithm
191
could be large. On the other hand, if the transfer is made by direct memory access, the setup time may be negligible. The actual transmission rate, after setup, is potentially very fast. To summarize, the transmission of a block of data between processors need not be unduly burdensome. Synchronizing the transmission is a different story. A typical implementation of the send-request communication in Fig. 2 will have four stages: (1) The message is wrapped in a packet containing its source and destination program. (2) The packet is transmitted to the receiving processor. (3) If the receiving program has not yet requested it, the message is queued. (4) After the receiving program requests the message, it is retrieved from the queue and delivered. There is a lot of overhead here; so much that in our apphcation it can overwhelm the transmission and computation time unless we pass a very large amount of information with each transmission. In particular, the wavefront accumulation of x is impractical in such a setting. 5 The fact that we must broadcast u makes matters worse. One way of implementing the broadcast is to pass u from processor to processor. If only transmission time is taken into account, it will take time proportional to np for u to reach the last processor. Since the first processor cannot begin to compute the second u until the last processor has received and processed the first u, this broadcast time will be a lower bound on the total execution time. In fact, the total execution time will be bounded below by the sum of the communication times required to broadcast the u's. An elementary analysis shows that this time is of order pn 2. When n = p, the broadcasting takes time proportional to n3--which is of the same order as the arithmetic work for the sequential algorithm! The problem would disappear if we could send each component of u from processor to processor, since then it would require time proportional to n to get u from the first processor to the last. However, this will be a solution only if the synchronization overhead involved in passing a component is very small. In the next section, we shall introduce a parallel architecture for which this is true.
3. The Maryland CRAB The Maryland CRAB is a processor module for building networks of processors. Neighboring processors in a network communicate by writing in each other's memory. Figure 3 shows a block diagram of a CRAB. From this it is seen that a CRAB consists of a central processing unit with local memory and an ethernet controller. The processing unit is connected to a device called a crabboard, which has a memory that it can share with other CRABs. The crabboard has a number of ports (indicated by the small squares in the figure) through which it can be connected to other crabboards. A CRAB shares memory with other crabs as follows. Suppose its crabboard has k ports. The address space of the processor is divided into k + 2 blocks, which are associated with the local memory, the shared memory, and the ports as shown in Fig. 4. Addresses in the local memory block are mapped directly to local memory. Addresses in the shared memory block are mapped to shared memory, after a base equal to the size of the local memory is subtracted. When a reference is made in a block associated with a port, the address is shunted to the crabboard 5 Since here the messages are of length one, we could treat them in the same way as a synchronizing message. Nonetheless the overhead still makes the wave-front algorithm impractical
G. W. Stewart / A parallel QR-algorithm
192
local memory
local memory
ethernet shared memory port 1 port 2
crab board
I
port k
shared memory Fig. 3. The Maryland CRAB.
Fig. 4. Crab address space.
connected to that port and mapped onto that crabboard's shared memory. In this way processors connected through their crabboards can share memory. This ability of neighboring processors to share memory has two important consequences. First it is cheap to move blocks of data from one processor to another. In fact, such a transfer can be accomplished by a program loop that copies the data into the address space of the appropriate port. Moreover, in our implementation of the CRAB, synchronization primitives like test-and-set, will function across the crabboards, so that programs can synchronize their own communication. Let us see how a linear array of CRABs can broadcast a vector. We assume that each processor has an array u in its shared memory to contain the vector u. Figure 5 contains a program implementing the broadcast. The program being by setting the variable numu to zero. This variable tells how m a n y components of u have arrived at the processor and will be incremented by the processor to the
numu = O; busy = FALSE; if
(proc != p){ while ( testandset(right_busy))(} if (proc == I){ for (i=1; i<-n; i++)( right_u[i] = u[i]; right_numu++; } } else( for (i=l; i<=n; i++)( while (numu < i){} right_u[i] - u [ i ] ; right_numu++ } }
} else while (numu < n)(}
Fig. 5. Program to broadcast a vector.
G.W. Stewart / A parallel QR-algorithm
193
left as it delivers each component. The flag busy is set to FALSE to indicate that it is ready to receive the components of u. Since the last processor has only to receive the components of u we treat it separately. The statement
while( testandset( right_busy)){
}
synchronizes the passage of the data. The variable b u s y with the prefix r i g h t refers to the variable b u s y in the shared memory of the processor to the right. As long as it is TRUE, the function t e s t a n d s e t merely returns the value TRUE. In this case the program spins, executing the null wh i l e statement. When the variable r i g h t _ b u s y is FALSE, t e s t a n d s e t returns FALSE and sets r i g h t b u s y to TRUE. Thus the program falls through the w h i l e statement. But just as important, it cannot reenter this section of code unless the processor to the right resets its variable b u s y . Thus there is no danger of our processor moving u to the processor on the right and then attempting to move another vector before the processor on the right has had time to consume the first one. The code is now slightly different for the first processor and for the other p - 2 processors. The first processor simply transfers the components of u in its shared memory to the corresponding component of u on the shared memory on the processor to the right. As it does so, it increments numu on the right to indicate that the component has arrived. The other process (except the last) pass on components just like the first, but not before numu indicates that the component is there. The last processor simply waits until numu tells it that the last component has arrived. Let us analyze the complexity of this program. There are two parts. The first is a setup time in which numu and b u s y are initialized and the processor waits for the processor on the fight to signal that it is ready. If all the processors are assumed to be ready to broadcast at the same time, this startup cost is negligible. The second part is the time required to get all the data to the p t h processor. Let % be the maximum time required for one iteration of the transmission loop for processors 2 . . . . . p - 1. Then for the first component of u to arrive at processor p - 1 requires time ( p - 2)~. It then takes time n% for processor p to receive all n components of u. Thus the total transmission time is rt .... = ( p + n - 2 ) ~ . Since p is bounded by n, we have an O(n) time for broadcasting, as opposed to the O(n 2) time of Section 2. However, the right order for transmission will make little difference if ~ is large. Unfortunately, we cannot give a deterministic bound on ~, since the number of times the statement
whi le(numu
}
is executed depends on the action of an asynchronously executing processor. But once the processors are all busy passing components of u, it is not unreasonable to assume that this statement will be executed no more than two times for each component of u. Thus ~ will be proportional to the time required to execute a small segment of code. The constant ~ can be reduced by coding this segment in assembly language. If the loop fits in an instruction cache, will be yet smaller. In any event nX will be of the same order of magnitude as the computations involving u. It should not be thought that locally shared memory entirely obviates the need for message passing in the style of Fig. 2. In our broadcast example it is assumed that the variables b u s y and numu have been initialized to TRUE and zero. Thus the processors must pass messages to
G.W. Stewart / A parallel QR-algorithm
194
indicate that the initializations are complete, and these messages cannot use shared variables that require initialization. However, this sequence of message passing is a one-time startup cost for the entire reduction to Hessenberg form and for matrices of reasonable size will represent only a small part of the total time.
4. The parallel QR-step In this section we shall describe how the QR-iteration may be implemented on a linear array of processors. For simplicity we confine ourselves to the explicitly shifted QR-algorithm. It is assumed that the reader is familiar with the details of the sequential QR-algorithm (see [2, pp. 220-221]). The s th step of the QR-iteration goes as follows. Beginning with the current (upper Hessenberg) matrix A s and a shift xs, determine rotations R~s) in the (i, i + 1)-plane so that
Bs = R(~) n-l''"
ox ' 2( s ) Z~'l o ( s ) (\ ~~s
_ rsi)
is upper triangular. Then form As+ 1 = B,R~'°TR(2s)T... R(,,s)_S + G I .
(4.1)
The matrix As+ 1 is also upper Hessenberg. If the shifts are chosen suitable, the element ,,(') ~ n,n-- I will converge quadratically to zero and a~S,) to an eigenvalue. In our implementation, the matrix will be distributed as usual in groups of t colunms across the array of processors. Figure 6 show.s the first two groups and part of the third for t = 3. The double vertical lines represent processor boundaries. The horizontal lines divide the groups of columns into t × t blocks within each processor. Each processor will generate the rotations for its diagonal block. Since the generation of rotations in a single step is a strictly sequential process, the processors will generate rotations for different steps: the first processor for step s, the second for step s - 1, the third for s - 2, and so on. Each processor passes the rotations it generates to the left where they are applied to the appropriate rows and then passed on. After a processor has processed all rotations from the left, is applies its own rotations to its columns (cf. (4.1)). Figure 7 contains a more detailed description of the algorithm from the point of view of the k t h processor. The variable 1 co t I'k] is the index of the leftmost column on the processor, and r c o L I'k] the index of the rightmost colunm. The rotations are denoted by R with a suitable index, and the i th processor is denoted by P I"i ]. The code begins with an iteration loop, which only terminates when the entire QR-iteration has converged. How convergence is determined will be discussed later. The first part of the code consists of the generation and application of rotations to the rows contained in the processor. The processor must wait k - 1 iterations for the preceding k - 1 processors to generate the rotations corresponding to r 1. Thereafter at successive iterations it generates rotations for xl, x 2, x 3, and so on.
xxl,XxL, x
X X x x x x x x
X X x x xx
xxx xx x
x
Fig. 6. Distribution of the matrix.
G. W. Stewart / A parallel QR.algorithm
195
for (iter=1; !converged; iter++){
if (iter >= k){ wait for R[icol[k]] from P[k-l]; else generate R[I] ; for (i=icol[k]; i<=rcol[k]; i++){ send R[i] to P[k+l]; applyrow R[i] ; if (i != rcol[k]) generate R[i+l];
} } botblk = min(iter, proc-l); for (blk = botblk; blk>=l; blk--) for (i=icol[blk]; i<=rcol[blk]; wait for R[i] from P[k-l]; send R[i] to P[k+l]; applyrow R[i] ;
i+÷){
} if (iter >= k){ for(j=icol[k]; j
and
Fig. 7. QR-step on processor k.
For reasons that will become apparent later, the rotation R [ t co L[ k] "l cannot be generated on the kth processor unless k = 1. Consequently, the program begins with a busy wait for that rotation. Synchronization of the passage of rotations is accomplished by the techniques described in Section 3. The code next enters a loop in which the rotations for the (k, k)-block are applied, generated, and passed on to the next processor. Here we use the notation applyrow to indicate the application of a rotation across two rows on a processor. In the next loop, the processor waits for rotations from the processor to the right and applies them to its rows. The loop is organized by blocks. In the final loop, the processor applies its rotations to its own columns (app l y c o l works the same way as app lyrow). For all but the last rotation this is a routine operation. The last rotation manipulates elements on processor k + 1, which must apply its own rotation R [ l o c l [ k + 1 ] ] before processor k can apply R [ r c o l [ k ] ]. Consequently, processor k must await a signal from processor k + l that a [ i ] [ l c o l [ k + l ] ] is free before applying the rotation. Processor k must compute the first rotation for processor k + 1 for the following reason. In Fig. 6, just before R [ 3 ] is applied to the columns, the part of the matrix around the (4, 3)-element has the form a32
G33
G34
0
0
a44
0
0
a54
196
G. W. Stewart / A parallel QR-algorithm
After the rotation has been applied, it has the form a32
a33
a34
0
-0a44
7a,~
0
-- aa54
"Ya54
where 3' and a are the cosine and sine that determine RI'3]. Ordinarily, the rotation R[43 would be determined from the elements a44 and a54. However, if ), is zero, these elements cannot be recovered by processor k + 1. It is therefore necessary for processor k to compute the new R E t c o t E k + 1 3 ] as it applies R E r c o t E k ] ] to its column. Some idea of the complexity of the parallel QR-step can be obtained by considering the time taken by the last processor. Let p be the time required to move a rotation between two processors. Assume that the time required to apply t rotations to a t x t block is bounded by /xt 2 and that this same bound applies to the generation step. Then the time taken by the the last processor is approximately Tlast = pn + 21tpt 2,
where as usual p is the number of processors. Since t = n / p , we have Zlast ~ pn + 21zn2/p.
Since the time for the sequential algorithm is approximately/zn 2, we see that we get optimal order speedup at medium granularity, provided p is comparable to ~t. The CRAB architecture insures that this will be approximately true. Of course this sketch does not represent a complete implementation. In a real program we would want to use the implicit double shift. The fact that this algorithm works with 3 x 3 Householder transformations rather than 2 x 2 rotations complicates the algorithm but does not affect the broad outlines. The shifts, which are generated on the last processor can be passed back to the first. Since all but the last processor are idle at some time, this should entail no additional cost. However, note that the shift x s is determined from the matrix As_p+ 1. An analysis in [3] for the case of a once deferred shift ( p = 2) indicates that the convergence of the iteration will be slowed but remain superlinear. An important property of the QR-iteration is that while it is converging rapidly to an eigenvalue, the subdiagonal elements of the matrix are also approaching zero, albeit linearly. Eventually these elements become small enough to ignore, and the problem breaks into smaller problems. Each processor can detect when this is happening to its own subdiagonal elements and start computing its own shifts for the deflated problems, which it passes to the left. Convergence occurs when all subdiagonal elements are negligible. In many applications it is necessary to have the product of the rotations generated by the algorithm. These can be accumulated on each processor in the same ways as they were applied to the columns of A. This accumulation will improve the load balance, since it adds a constant amount of work for each processor.
References [1] R. van de Geign, Unpublished manuscript, 1986. [2] G.H. Golub and C. Van Loan, Matrix Computations (Johns Hopkins University Press, Baltimore, 1983). [3] D.P. O'Leary, R. Pierson, G.W. Stewart and M. Weiser, The Maryland Crab--A module for building parallel computers, University of Maryland Computer Science Technical Report 1660, 1986.