~‘om~urersdi SrrucruresVol. 62, No. I. pp. 185-195, 1997 Copyright 0 I996 Elswier ScienceLtd Printed in Great Britain. All rights reserved 00457949197$17.00+ 0.00
PII: SOO45-7949(%)00282-9
EFFICIENT
PARALLEL GAUSS-DOOLITTLE TRIANGULATION
MATRIX
W. J. Watkins, D. Kennedy? and F. W. Williams Division of Structural Engineering, Cardiff School of Engineering, University of Wales, Cardiff, PO Box 917, The Parade, Cardiff CF2 lXH, U.K. (Received
18 July
1995)
Abstract-An efficient parallel procedure for the triangulation of real symmetric (or complex Hermitian) matrices is presented. The methods of Gauss and Doolittle have previously been combined to produce a hybrid sequential method that was computationally faster than both. Numerical results demonstrate that this advantage is retained when MIMD (multiple instruction multiple data) distributed memory parallel processing is employed, so that parallel Gauss-Doolittle triangulation is faster than the equivalently parallelized Gauss elimination method. Copyright 0 1996 Elsevier Science Ltd.
1. INTRODUCTION Many computational problems in science and engineering involve the reduction of real symmetric, or complex Hermitian, matrices to upper triangular form, for example to solve sets of linear equations or to find the eigenvalues of linear systems. Eigenvalue problems occur frequently in structural engineering and involve a large banded stiffness matrix K relating nodal forces and displacements. In the Wittrick-Williams method for the exact solution of transcendental eigenvalue problems, K is a function of an eigenparameter representing the frequency in undamped vibration problems [1] or the load factor in critical buckling problems [2]. Assembly and triangulation of K at successive trial values of the eigenparameter enables all required eigenvalues to be located with certainty using the “sign count” [l] and optionally [3] the determinant of K, both of which are readily obtained from the triangulated matrix K”. The triangulation dominates the computation time even when advantage is taken [4] of repetitive geometry, substructuring and bandwidth reduction techniques. Two of the authors have previously [5] published a triangulation procedure which successfully combines the elimination methods of Gauss and Doolittle, to form a hybrid Gauss-Doolittle algorithm which is computationally faster than both of them. Many papers have discussed parallel forms of existing matrix triangulation methods such as Gauss elimination [6] and Choleski decomposition [7, 81. These and other parallel blocked algorithms [9] have generally been developed for MIMD (multiple instruction, multiple data) parallel computers with t To whom correspondence should be addressed.
shared memory. However, it is envisaged that future design engineers will make increasing use of low cost parallel computers, including relatively small distributed memory MIMD systems such as networks of workstations or transputers. This paper presents extensions to the Gauss-Doolittle algorithm which enable it to be run on such distributed memory parallel computers, using two or more processors without requiring vectorization. simultaneously, The matrix is initially distributed among the available processors, all of which work together to triangulate it, with the amount of communication between them kept to a minimum. Although the method is specifically designed for distributed memory MIMD parallel computers, a broadly analogous method is also applicable to shared memory MIMD machines. The objective of developing a parallel form of the Gauss-Doolittle method is to reduce the overall solution time for large problems. It is also desirable that the parallel method be computationally efficient, i.e. that p processors can triangulate a matrix in close to l/p times the time taken by one processor. This is not the case if processors spend an excessive time idle (e.g. waiting for messages from other processors) or doing work not needed by a sequential solution. The sending and receiving of messages is an almost inevitable cause of inefficiency in parallel processing, tending to dominate problems when the total computational effort is relatively small. Speedup and efficiency (defined in eqns (6) and (7) below) results, obtained using the Gauss-Doolittle method to reduce large matrices on two typical distributed memory parallel computers, demonstrate that the method can usually be parallelized efficiently, to utilize an appropriate number of processors which depends on the size and bandwidth of the matrix. 185
186
W. J. Watkins em-3
et al.
(m - 1) separate occasions, by
xxxxxxx
xxxxxx xxxxxx XXXXXX
0
\
‘I
!
xxxxxx
XXXXX
xxxx
XXX xxx
Symmetric
k k,,
k ‘J =k. =‘k
\
::
Fig. 1. Banded symmetric matrix of order n and bandwidth m. x denotes a non-zero element.
’
(1)
wheretheindexq=i-m+l,...,i-ldenotesthe pivotal row. This method, although costly in terms of memory accesses on the elements k,,, has the advantage that a single multiplier (k,,/k,,) is used to update all the elements in row i due to row q being pivotal. In Doolittle’s method, each row is again pivotal in turn, but its corrections are only applied to elements in a subsequent row immediately prior to that row being pivotal. Thus a typical element k, is corrected only once, by
2. BRIEF REVIEW OF THE GAUSS-DOOLITI’LEMETHOD
,--I Let K = {k,; i,j = 1, . . . , n} be a real symmetric, non-singular, banded n x n matrix, with bandwidth m, as shown in Fig. 1. (The extension to complex Hermitian matrices is straightforward, but is omitted here for clarity.) K can be reduced to upper triangular form K* in a number of ways. The simplest form of Gauss elimination is often used in structural analysis, numerical ill-conditioning being generally avoided without resort to scaling or row interchanges [lo]. In this method, each row is pivotal in turn and updates elements in succeeding rows by a single correction term. A typical element k, is thereforecorrected on
k,,
k, = k, -
$k,. C q=r-mt, 4q
Compared with Gauss elimination, this method requires fewer memory accesses on the element k,, but requires additional looping operations over the multipliers (k,/k,) when updating each element. The hybrid Gauss-Doolittle method, referred to as GD(o), combines the advantages of both these methods, as follows. For some integer u (assumed for convenience to be a factor of n and m), let K be partitioned into (n/v) blocks of u rows, as shown in
k12 k13 . . . . . . . . . . kl, k22 k23 . . . . . . . . . . . . . . . k2,m+l
bl
k vv kv,(v+l) k(v+l),(v+l)
kv,(v+2) *.*....kv,(v+m-l) k(v+,),(v+2) . . . . . . . . . . . . . k(v+q,(v+q
(2)
k(v+2),(v+3)
+....-..-.
b2
k(n-v+l),(n-v+l)
-**
b(n/v)
k nn Fig. 2. Partitioning of a banded matrix into (n/u) blocks b,, each containing u rows.
187
Efficient parallel Gauss-Doolittle matrix triangulation Fig. 2. The Ith block b, (I = 1,2, . . . , n/u) comprises the {(I - 1)~ + l}th, {(I - 1)~ + 2}th, . . , futh rows of K. Instead of one row being pivotal at a time, as in the Gauss method, a block of u rows is pivotal, each element in succeeding rows being corrected by Doolittle updating, using each of these u rows simultaneously. A typical element k, is therefore updated a maximum of (m - 1)/u times by elements outside its own block 1, depending on the values of I and (j - i), by
1.
d=O,p=l
2 d= 1 ,p=2
1 1
4
3
d=2,p=4 k, = kii -
%
4-,,,-,,r+,
Fkw
*q
(3)
where l, = I- (m/v) + 1, . . . ,I - 1. Each element is also updated once by a maximum of (v - 1) elements in the preceding rows of its own block. Generalizing this additional updating and that of eqn (3) to blocks, the upper triangular form of the Ith block, br, is given by
bp = b,+
i cl ‘I=I - (!I?,?) +I
(4)
where cl is the contribution, calculated by the Gauss-Doolittle method, of the u rows of the qth block to the Zth block (q < I). GD(1) is equivalent to Gauss elimination. For v > 1, GD(u) was demonstrated [5] to be substantially faster than Gauss elimination, with u = 6 giving a 25% time saving over Gauss on a VAX-1 l/780 computer. The computational time was minimized by ensuring optimum use of the arithmetic registers for storage during the element updating operations of eqn (3). 3.
PARALLEL
GAUSS-DOOLIITLE
TRIANGULATION
The Gauss-Doolittle method has been implemented on distributed memory, MIMD parallel computers. The parallel software operates in a hostless manner, with each processor running the same matrix triangulation code. The processors are assumed to be able to form a ring configuration, so that each processor knows its own position within the ring and has associated with it two neighbouring processors, one of which sends data to it while the other receives data from it. Most modern distributed architectures would easily facilitate this configuration [ 111,which is illustrated for hypercube architectures in Fig. 3. Note that the connections between processors which are permitted by the hypercube configuration, but which are additional to the ring connections (see the d = 3 case in Fig. 3), are not used, making the method machine independent. The matrix K, which is assumed to have been calculated in advance, is initially distributed throughout the ring of processors. This distribution is
1IT--
2 6
d=3 ,p=8 3
Fig. 3. Architecture of hypercuhe configurations of dimension d. Arrows indicate the ring formed by the p processors and the flow of messages.
completely determined by n, v and p, the number of processors in the ring. Let K be partitioned into (n/u) blocks 6, (each of v rows) for Gauss-Doolittle reduction, see Fig. 2. These blocks are assembled into n x n matrices in each of the p processors such that the matrix in processor q (q = 1,2, . . , p) is null except for blocks q, (p + q), (2~ + q), . . . , (n/o - p + q), which contain the corresponding blocks b, of K, as illustrated in Fig. 4(a) for the case (n/u) = 8, p = 4. Note that the explanation has been simplified by assuming that n is an integer multiple of pv, but that the theory and results cover non-integer multiples as well. Note too that, to avoid unnecessary complication, the method is initially described as if the matrix were fully populated (i.e. M = n), and the illustrative example of Fig. 4 is further restricted to have n/pa = 2. The key differences caused by the matrix being banded are described later. The triangulation of the matrix begins with block b, being made pivotal by processor 1. This processor first calculates cl and completes bf from eqn (4). It next calculates c:, c:, . . . , ci, storing them in its blocks 2,3,. . ,p. Then it sends (the contents of) these blocks in a message to the communications buffer where the next processor in the ring, processor 2, can read them. The contents of the blocks in each processor are now as shown in Fig. 4(b). Figure 4(c) shows what happens next. Processor 2 finds the message waiting for it in the communications buffer and accumulates the contributions c:.
W. J. Watkins et al.
188 cl,
. , ci from processor 1 in its blocks 2,3, . . . , p. Processor 2 is now working in parallel with processor 1, which is finding the contributions ci+ ,, CL>, >c:,,, and accumulating them in its blocks n/v. Simultaneously, processor 2 p+l, p+2,..., makes block b2 pivotal. Thus it calculates c: and completes b;l in block 2, then accumulates the 2 contributions c:, cj, . .,c,+, In its blocks 3,4, . ,p + 1. It then sends these blocks in a message to the communications buffer for processor 3. The contents of these blocks are (cl + c:), (cl + ci), . . , (ci + ci), ci+,. This enables processor 3 to calculate c: and complete bt, etc., while processor 2 calculates its contributions ci,,, , c$ to its blocks p + 2, . . . , n/v. (If n had been larger, processor 1 would now be calculating c&,+ ,, . ., c:,.)
All the processors are finally active when processor its first message from processor p - 1. Figure 4(d) shows that processor p then proceeds to form b,” and to accumulate the contributions c,“+,, . . . , cf,_, to its blocks p+ 1,. .,2p - 1, which it sends on to processor 1. (Processor p will then proceed to find the contributions c$, . , c,$ to its blocks 2p, . . . , n/v.) When processor 1 has completed the accumulation of contributions due to its first pivotal block b,, it reads the message from processor p, and then forms b,d+ , by adding (b, +, + ci + ,) from its block p + 1 to the accumulated contributions to block p + 1 from all other processors, i.e. to the (c,‘, , + cj+, + . + c,“+,) contained in the message. Processor 1 then makes its block p + 1 pivotal,
p collects
Processor
Processor Block
1
2
3
4
b,
0
0
0
0
bz
0
0
Block
1 b+
3
4
0
0
0
I
0
0
b3
0
0
0
0
b4
b5
0
0
0
b5
0
0
0
0
b6
0
0
0
b6
0
0
0
0
bl
0
0
0
b7
0
0
0
0
bg
0
0
0
b8
(b)
(a)
Processor
Processor Block
2
Block
1
2
3
4
0
1
b?
0
0
0
c:
0
2
c:
0
c:
3
c:
b! c: +c:
0
0
0
ct b5+c;
b4 0
4
c:
c: +c;
bf\ c: +cf
0
5
: bs+c::
6
I : ’ i C6
1
2
3
4
bf
0
0
c:, c: c:
b?
+c: I-------l
b6 0 0
(c)
0 b7 0
0 b8
I I I : I
c:
C: +C:
c: +c; +C:
b, +c;
7
C3
c: b, +c;
8
cs2
c:
c&cd cc’: b8
(4 Fig. 4. Contents of the blocks in each of the p processors during parallel Gauss-Doolittle triangulation, for the case (n/u) = 8 and p = 4. (a) Initially; (b) as processor two becomes active; (c) as processor three becomes active; (d) as processor four sends its first message; (e) finally. Arrows denote the passage of a message containing the blocks in the solid box, to be added to those in the dashed box of the receiving processor. (Continued opposite.)
Efficient parallel Gauss-Doolittle
189
matrix triangulation
Processor Block
3
4
0
0
0
0
b: c;+c;
0 b:
+c: c:
5
+C;
6
7
b!:
8
c;+c;+c; +c;+,;+c;
(e) Fig. &Continued
accumulating the contributions to all succeeding blocks and sending the contributions to blocks p + 2, . . . ,2p on to processor 2 when they have been calculated. Successive blocks are made pivotal by successive processors until the final triangulated matrix KA has been formed. It is distributed over the p processors as illustrated in Fig. 4(e). Therefore K* can be recovered by combining the pivotal blocks &’ from each of the processors. (Note that n/u>>p2 for nearly all practical problems, and so processor 1 will not have to wait for the message from processor p. This contrasts with the very simple illustrative problem of Fig. 4, for which n/u = 8 and p* = 16, so that delay occurs. The implications of this delay are discussed in Section 4.) The triangulation of a banded matrix with pv < m < n proceeds in a similar manner. The matrix is split up into n/u blocks of u rows and distributed throughout the processor ring in the same way as for the fully populated matrix, except that each processor stores only the banded region of the matrix in memory. The main differences between the two cases after triangulation begins concern the size of the messages passed between processors, and also the amount of calculation required to complete the contributions from each pivotal block after communication of p - 1 blocks to the next processor in the ring, as each pivotal block can contribute to at most m/v succeeding blocks. 4. COMMUNICATIONSOVERHEADS
It can be shown that the updating parallel Gauss-Doolittle method CAS 62/1-G
operations in the are identical to
those of eqn (3) for the sequential method. As stated earlier, inefficiencies generally occur in the parallel method when any processor is idle or performing a task not required in the sequential method. All such tasks are associated with the communication between the processors. The first overhead of this type occurs at the start of the triangulation, due to the fact that each processor (other than the first) can only start work when it has received a message from its predecessor. This time penalty increases with p for three reasons: (1) each processor must calculate @ - 1) contributions to succeeding blocks before it can send its first message; (2) the size of each message when the matrix is banded is mu@ - 1) elements; (3) (p - 1) such messages must be passed before all the processors are working simultaneously. Overheads also arise because the processors do not complete their work simultaneously. However, the
time penalties described in this paragraph are predicted to have little impact on computational efficiency for large problems with n/v>>p*. Associated with each message, time may also be lost as the receiving processor unpacks data from the communications buffer. This communications overhead also increases with the total number of processors, but is again predicted to be small in comparison with the overall computation time for large problems. A more serious overhead occurs when the time between a processor sending a message to its successor and receiving the next message from its
190
W. J. Watkins er al.
predecessor exceeds the time taken for that processor to complete all the calculations it can do before becoming idle. This occurs, in processor 1 for example, when the following condition is met (see Fig. 4):
+c ,+p-l
Tr,
,=i
T:, +
T,,
(5)
where T,, is the time taken for processor i to read a message from processor (i - l), T,, is the time taken for processor i to send a message to processor (i + 1) and T; is the time taken for processor i to calculate the contribution of its current pivotal block b, to block 6,. In travelling completely around the ring, p messages of size mu@ - 1) elements will be sent in total. This overhead therefore depends on the number of processors p, and on the block size v, which determine the message size and the amount of necessary calculation prior to sending messages. The total number of elements passed in messages is approximately nm(p - 1) for m <
REGISTERS
Sequential or parallel Gauss-Doolittle triangulation of a matrix requires [5] precisely the same number of calculations as do sequential Gauss and Doolittle reduction. The savings in computation time taken arise from the ability of the Gauss-Doolittle method to make optimal use of the small amount of fast on-chip memory available. This memory usually comprises a limited number of arithmetic registers, in which frequently used variables can be stored and from which they can be recovered without the overhead of memory access. The most time consuming part of the Gauss-Doolittle method is the loop over the elements in a row, see eqn (3), in which each element is updated by contributions from v rows in a previous block. The address of the element k, being updated, as well as the addresses of the u modifying elements k, from the pivotal block should be stored in these registers and be subject to automatic incrementation at each pass through the loop. The v multipliers (k,,/k,) are invariant in the loop and should also be stored in registers, along with the loop counter, loop limit, the temporary storage of the running total Zr (kqi/k,)kti and the next product (k,,/k,,)k, to be added to this running total. Full optimization of the loop therefore requires enough registers to store (v + 3) integers and (v + 2) floating point numbers, e.g. for v = 4, 13 registers are required (or 19 if double precision arithmetic is used and floating point numbers each require two registers).
The method has been coded in a parallel version of Fortranfor execution on the nCUBE2 distributed memory computer. An optimizing compiler was used to produce executable code which was tested on a variety of matrices of different sizes, as reported in Section 6. It was observed in both the sequential and parallel cases that the solution time decreased with increasing v up to v = 4, after which it increased drastically. Examination of the assembler code produced for each v showed that the compiler successfully optimized the loop structure for v < 4 but failed completely for v = 5 and 21= 6. Writing in assembler it was possible to improve both these cases substantially. Each nCUBE2 processor has 16 arithmetic registers available, each capable of holding one integer or one double precision floating point number. Thus the case v = 5 requires 15 registers, and the main loop was fully optimized. With v = 6, 17 registers are ideally required, so the main loop could not be fully optimized, but nevertheless a substantial improvement on the Fortran code was possible. Higher values of v were not attempted. The Gauss-Doolittle method has also been implemented on a more modern distributed memory parallel machine, the Transtech Paramid. The main processing chip in this computer is the Intel i860 which employs 32 numerical registers. In this case the compiler fully optimized the main loops for v= 1,...,8. 6. NUMERICAL
RESULTS
The parallel software was tested for various values of v on an nCUBE2 distributed memory parallel computer. This machine has 32 processors connected in a hypercube configuration and permits the necessary ring structure, see Fig. 3. Processors can only be allocated in powers of 2 and a maximum of 8 were actually used. Therefore in obtaining test results for other numbers of processors, up to 3 of the allocated processors were not used for the calculations. The hypercube architecture is such that for rings with p = 3, 5 and 7 it was not possible to link every processor directly to both its neighbours. When this happened one of the redundant processors was used purely as a communications link. The difference between the time taken to pass a message in such cases and when all messages are passed directly between adjacent processors is negligible. This is because the start up time for a message between any two processors is 120~s and far exceeds the additional 2 pts required for a message to pass through an extra processor. The elapsed solution times for the reduction of a large, well conditioned, real, symmetric banded matrix with n = 840 and m = 350 (the elements of which were generated using a trigonometrical formula), are shown in Table 1 for values of v from 1 to 6 and values of p from 1 to 8. The table also gives
Efficient parallel Gauss-Doolittle
191
matrix triangulation
Table I. Computational requirements for Gauss-Doolittle triangulation of an 840 by 840 matrix with bandwidth M = 350, for v = I to 6 and p = l-8 processors on an nCUBE2 parallel computer. Computation times are given in seconds, with the corresponding speedup and efficiency given beneath them. The best time for each number of processors is underlined and the best time for each v is followed by * 1
V
1
70.7 1.o. 100% 54. I 1.0, 100% 49.0 1.0, 100% 45.6 I .o, 100% 57.9 1.o, 100% 57.0 1.0, 100%
2 3 4 5 6 5” 6”
3
37.8
25.7 2.75,92% 19.8 2.13, 91% 17.9 2.74,91% 16.8 2.71,90% 21.0 2.76,92% 20.7 2.75, 92%
19.8 3.57, 89% 15.4 3.51,88% 14.0 3.50, 88% 13.3 3.43, 86% 16.5 3.51,88% 16.3 3.50,87%
17.0 2.71,90% 17.7 2.77,92%
13.5* 3.41,85% 14.1* 3.48,87%
1.87,94%
28.8 1.88,94% 25.9 1.89,95% 24.3 1.88,94% 30.5 I .90,95% 30.5 1.87. 93%
46.1 1.0, 100% 49.1 1.0, 100%
Number of processors 5 4
2
24.4
1.89,94% 25.4 1.93,97%
p
6
7
8
16.3 4.34, 87% 13.0 4.16,83% 12.0* 4.08, 82% 12.2* 3.74, 75% 14.9 3.89,78% 16.2 3.52,70%
14.2 4.98, 83% 12.4* 4.36, 73% 13.6 3.60,60% 14.7 3.10, 52% 18.0 3.22, 54% 19.6 2.91,48%
l3.2* 5.36, 77% 14.6 3.71, 53% 16.0 3.06,44% 17.3 2.64,38% 21.1 2.74, 39% 22.9 2.49, 36%
15.3 4.62. 58% 16.8 3.22,40% 18.4 2.66, 33% 19.8 2.30,29% 24.2 2.40, 30% 26.2 2.18, 27%
13.6 3.39,68% 14.9 3.30,66%
16.4 2.81) 47% 17.9 2.74,46%
19.1 2.41,34% 21.0 2.34,33%
21.9 2.11, 26% 24.0 2.05,26%
a Manually optimized assembler code.
values of speedup manner as
and efficiency, defined in the usual
results for one value of v, demonstrating good efficiencies for four or fewer processors. With larger numbers of processors, the overheads associated with communication, mentioned previously, eventually dominate the computation and the efficiencies drop.
As expected, this occurs more rapidly with the higher values of v because of the greater amount of work done by each processor prior to the sending of messages. The lower part of the table gives the results for v = 5 and 6 when the assembler coding was optimized manually. These improvements bring the times for v = 5 and 6 into line with those for the other values of v. The efficiencies vary in a similar way to the equivalent efficiencies in the upper part of the table, but are all slightly lower, since optimization of the code reduces only the arithmetic processing time, leaving the communications time unchanged. The
(7)
Linear speedup represents the ideal case where a task on p processors takes I/p times as long as one processor takes, so that the efficiency is 100%. The upper part of the table shows the times, speedups and efficiencies for the parallel code produced by the optimizing compiler. Each row gives
60
Time (set)
3. 20 -10 -0
J I
I
2
3
;r
5
6
7
8
Number of processors P Fig. 5. Plot of solution time against p for tt = 5, for the matrix triangulation of Table 1: (a) with assembler alteration; (b) without assembler alteration.
W. J. Watkins et al.
192
‘lime (set)
Number
of processors
p
Fig. 6. Solution time plotted against p and C, for the matrix triangulation of Table 1
times with and without this assembler alteration are plotted against p for v = 5 in Fig. 5. The solution times are also plotted in three-dimensional form against v and p in Fig. 6. It is seen that the best times (but not the best efficiencies) occur most often when v = 4 or 5. Figure 7 shows the variation of speedup with p for each value of v. In order to investigate the loss of efficiency which occurs when more than four processors are used, diagnostic software was applied to the case v = 4 for different values of p. Running on a single processor, this software measured the solution time as the sum of two components: (i) calculation time and (ii) communication time, including idle time waiting for messages. The results, plotted in Fig. 8, show that the calculation time component was approximately inversely proportional to the number of processors, as required for linear speedup, but with a slight loss of efficiency when p > 1 due to the additional calculations associated
with preparation and processing of messages. The communication time component was negligible for p d 4, but for p > 5 appeared to increase linearly as additional processors were added. This indicates that for p < 4 there was no idle time, but for larger values of p the condition of eqn (5) was satisfied so that processors were idle waiting for messages. This is probably because, asp increases, the idle time of each processor during the triangulation of p blocks, as estimated from eqn (5) is dominated by the sum of p(p - 1) terms T:,. Figure 9 shows, for the case v = 4, the variation in speedup with number of processors for five matrices, all with approximately 330,000 non-zero elements on or above the diagonal, but with different order n and bandwidth m, with the ratio n/m varying between 1 and 16. All the matrices show good improvement for 2 and 3 processors, but when p = 4 the speedup for the n/m = 16 matrix starts to deteriorate markedly. When p = 5 both the n/m = 8 and the n/m = 4
i
0-l 1 Fig. 7. Speedup
against
number
2
3
4 5 Number of processors
of processors for the matrix
6
7
8
P
for all v (using the assembler triangulation of Table 1.
alteration
for v = 5 and 6),
Efficient parallel Gauss-Doolittle
193
matrix triangulation
Time (set)
Number
of processors
p
Fig. 8. Breakdown of activity within a single processor when u = 4, for the matrix triangulation of Table 1. The total solution time (A) is shown along with its constituent parts, i.e. the time taken for (B) calculation and (C) communication with other processors.
matrices are losing speedup. The n/m = 1 matrix maintains good speedup until p = 7. The reason for
these phenomena is as follows. As the ratio n/m increases, each processor performs less calculation in making one block pivotal. As the number of processors increases, the time interval between a processor writing data to the next processor in the ring and reading data from the previous processor in the ring increases. In order for the processor to remain active, it must have sufficient work to do. Equation (5) has the consequence that, for each bandwidth m, there is a limiting number of processors (p& beyond which the time taken for a processor to complete its calculations for one pivotal block is exceeded by the time taken for communication around the ring. Furthermore, this value of p,,, increases with m, so that the bigger the ratio n/m, the smaller the number of processors p on which the method can run efficiently.
Figure 10 plots the solution times for the same five matrices, also for v = 4. An interesting observation is that for p > p,,, the solution times for all the matrices are approximately equal and increase linearly with p. This confirms that each processor lies idle after making a block pivotal, being constrained by the communication time predicted by the left-hand side of eqn (5). As p increases, this is dominated by the sum of terms TL,, giving an overall triangulation time approximately proportional to the product of (p - 1) and the number of non-zero elements in the matrix. Therefore, since each matrix has approximately the same number of non-zero elements, their solution times are approximately equal and increase linearly with p. A preliminary evaluation of the method has also been made on a 48 processor Transtech Paramid computer. This is a newer, much faster computer than the nCUBE2, although it is more suited to
n/m=
/ Speedup
4 3 2 1 i
0 I
2
3
4 5 6 Number of processors p
7
8
Fig. 9. Speedup against number of processors, on the nCUBE2 parallel computer, for the triangulation of five banded matrices, each containing approximately 330,000 non-zero elements on or above the diagonal, but with different order n and bandwidth m.
W. J. Watkins et al.
194
Fig. 10. Solution time against number of processors, on the nCUBE2 parallel computer, for the triangulation of five banded matrices, each containing approximately 330,000 non-zero elements on or above the diagonal, but with different order n and bandwidth m.
coarse grain parallel applications calculation
because its ratio of speed to communication speed is much
higher than that of the nCUBE2. Table 2 gives the solution times for the triangulation of a matrix with order n = 1680 and bandwidth m = 700, for v = l-8 and with up to eight processors. This represents a much larger, but coarser grain problem than that used to obtain the nCUBE2 results of Table 1. The trends exhibited in Table 2 are generally similar to those in Table 1. The increased number of arithmetic registers (32) allows the updating loop to be fully optimized by the compiler. Consequently on up to three processors, when calculation dominates, the solution time decreases as v increases, until v = 7. For v = 8, this improvement is offset by the increased amount of communication as each block is pivotal, see eqn (5). For all values of v, the efficiency begins to fall off when four or more processors are
used. When six or more processors are used, the solution time is dominated by communications and is virtually independent of v, as predicted at the end of Section 4. 7.
CONCLUSIONS AND
PLANS FOR FUTURE WORK
An efficient parallel form of the Gauss-Doolittle GD(v) matrix triangulation method has been presented. For any parallel computer, the size of matrix that can be efficiently reduced to upper triangular form by this method using a particular number of processors depends on the relative speed of inter-processor communications to that of arithmetic operations. The optimal value of v for any machine depends on the time taken for individual arithmetic operations, general memory access times and the number of arithmetic registers available. The
Table 2. Computational requirements for Gauss-Doolittle triangulation of a 1680 by 1680 matrix with bandwidth m = 700, for u = 1 to 8 and p = l-8 processors on a Transtech Paramid parallel computer. Computation times are given in seconds, with the corresponding speedup and efficiency given beneath them. The best time for each number of processors is underlined and the best time for each u is followed by * v
1
1
86.9 1.0, 100% 60.7 1.0, 100% 53.3 1.0, 100% 41.5 1.o, 100% 47.0 1.o, 100% 45.0 1.0, 100% 44.9 1.0, 100% 47.1 1.0, 100%
2 3 4 5 6 7
8
2
3
44.6 1.95,91% 31.1 1.95,98% 21.5 1.94,91% 24.6 1.93,97% 24.2 1.94, 97% 23.3 1.93, 97% 23.2 1.94,97% 24.4 1.93,97%
30.4 2.86,95% 21.7* 2.80,93% 19.4* 2.75, 92% 17.6* 2.70, 90% 17.4* 2.70,90% 16.9* 2.66, 89% 16.9* 2.66, 89% 17.71 2.66, 89%
Number of processors p 5 4 24.9* 3.49, 87% 24.3 2.50,62% 24.5 2.18, 54% 24.8 1.92,48% 25.1 1.87, 47% 25.4 1.77,44% 25.9 1.73,43% 26.1 1.76,44%
31.4
2.77, 55% 31.5 1.93,39% 31.9 1.67, 33% 32.3 1.41,29% 32.9 1.43, 29% 33.5 1.34, 21% 34.0 1.32, 26% 34.6 1.36,27%
6
7
8
40.7 2.14, 36% 40.8 1.49, 25% 41.4 1.29, 21% 41.6 1.14,19% 42.3 1.11, 19% 43.0 1.05, 17% 43.5 1.03, 17% 44.4 1.06, 18%
47.2 1.84,26% 41.4 1.28, 18% 48.0 1.11, 16% 48.5 0.98, 14% 49.3 0.95, 14% 49.9 0.90, 13% 50.6 0.89, 13% 51.6 0.91, 13%
56.4 1.54, 19% 56.7 1.07, 13% 57.3 0.93, 12% 57.9 0.82, 10% 58.7 0.80, 10% 59.4 0.76,9% 60.2 0.75,9% 61.3 0.77, 10%
Efficient parallel Gauss-Doolittle best value for u was found to be usually four on the
nCUBE2 and six or seven on the Transtech Paramid. For comparison the sequential GD(u) method [5] gave best results on the VAX with u = 6. The results on the nCUBE2 with u = 4 suggest that as many as five processors can be used efficiently for problems above a minimum size. On the Paramid, three is a more suitable maximum number of processors. For very small problems, however, communication times dominate and no parallelization is practical. A similar parallel scheme could be applied to other matrix triangulation methods, such as Gauss-Jordan or Choleski decomposition. The extension to complex Hermitian matrices is straightforward, but would only achieve comparable time savings on computers having enough registers to allow the updating loops to be programmed efficiently. Because the GD(u) method works best when p is small, it is suitable for combining with coarse grain parallel methods, such as that developed by the authors [ 121 which conducts repeated triangulations by performing one on each processor and which also works best when p is small. Thus hybrid methods can be developed which use pip2 processors, where pl is the number of coarse grain parallel operations, each of which uses p2 processors to perform its triangulation. The determination of pI and p2 is problem specific and can be performed either by or by software which decides an inspection, intelligent initial distribution of tasks and then, if conditions alter during the run, redistributes them in a more efficient manner. Acknowledgements-The authors gratefully acknowledge support from the Engineering and Physical Sciences Research Council under Grant number GR/H45490. They also thank the Department of Physics and Astronomy and
matrix triangulation
195
the Department of Computer Science at the University of Wales, Cardiff for allowing the use of their computing facilities. REFERENCES
1. W. H. Wittrick and F. W. Williams, A general algorithm for computing natural frequencies of elastic structures. Q. J. Mech. appl. Math. 24,263-284 (1971). 2. W. H. Wittrick and F. W. Williams, An algorithm for computing critical buckling loads of elastic structures. J. struct. Mech. 1, 4977518 (1973). 3. F. W. Williams and D. Kennedy, Reliable use of determinants to solve non-linear structural eigenvalue problems efficiently. Int. J. numer. Meth. Engng 26, 182551841 (1988).
4. M. S. Anderson and F. W. Williams, BUNVIS-RG: exact frame buckling and vibration program, with repetitive geometry and substructuring. J. Spacecraft 24, 353-361 (1987).
5. F. W. Williams and D. Kennedy, Fast Gauss-Doolittle matrix triangulation. Comput. Struct. 28, 143-148 (1988).
6. L. S. Chien and C. T. Sun, Parallel processing techniques for finite element analysis of nonlinear large truss structures. Comput. Struct. 31, 102331029 (1989). 7. 0. 0. Storaasli, D. T. Nguyen and T. K. Agarwal, Parallel-vector solution of large-scale structural analysis problems on supercomputers. AIAA J. 28, 1211-1216 (1990).
8. D. Zheng and T. Y. P. Chang, Parallel Cholesky method on MIMD with shared memory. Comput. Struct.
56, 25-38 (1995).
9. J. J. Dongarra, I. S. Duff, D. C. Sorensen and H. A. van der Vorst, Solving Linear Systems on Vector and Shared Memory Computers. SIAM, Philadelphia (1991). 10. R. K. Livesley, Matrix Methods of Structural Analysis. Pergamon Press, Oxford (I 964). 11. K. N. Chiang and R. E. Fulton, Structural dynamics methods for concurrent processing computers. Comput. Struct.
36, 1031-1037 (1990).
12. W. J. Watkins, D. Kennedy and F. W. Williams, Efficient parallel solution of structural eigenvalue problems. Ado. Engng Software 25, 281-289 (1996).