JOURNAL
OF PARALLEL
AND DISTRIBUTED
13,325-33 I ( 1991)
COMPUTING
A Tridiagonal System Solver for Distributed Memory Parallel Processors with Vector Nodes CHRISTOPHER L. Cox * AND JAMES A. KNISELY + Department of Mathematical Sciences, Clemson University, Clemson, South Carolina 29634-l 907
A variant of the odd-even cyclic reduction algorithm for solving tridiagonal linear systems is presented. Of particular interest is the case where the number of equations is much larger than the number of processors. The target architecture for this scheme is a distributed-memory parallel computer with nodes which are vector processors. The partitioning of the matrix system is governed by a parameter which determines the amount of redundancy in computations and the amount of communication. One feature of the method is that computations are well balanced, as each processor executes an identical algebraic routine. A brief description of the standard cyclic reduction algorithm is given. Then a divide and conquer strategy is presented along with some estimates of speedup and efficiency. Finally, FORTRAN programs for this algorithm which run on the Intel iPSC/Z-VX and Floating Point Systems FPS T-20 computer are discussed along with experimental results. Of particular interest is the performance evaluation of this algorithm according to Gustafson’s concept of scaled speedup. o 1991 Academic PRSS, IIIC.
where we set al = c, = 0. Note that all the odd unknowns can be expressed as linear combinations of the even unknowns, so that the odd-numbered unknowns can be factored out, and the system is “reduced” to a system of half the size in the even-numbered unknowns. This reduction is repeated until one equation in one unknown is left. After that unknown is found, the other unknowns (i.e. the oddnumbered unknowns at each step) can be determined through backsubstitution. If we set al ‘) = ai, for all i (and similarly for the other coefficients), and assume for convenience that n = 2 k - 1 for some integer k, then the reduction phase is Algorithm a(,j+l) I blj+‘)
1. Reduction
= 6”’21 =
(A (A a2i c2i-l b”’ 21 1
b;)
1. INTRODUCTION Tridiagonal systems of the form ( 1) arise in many applications, e.g., numerical methods for differential equations [ 121. b, CI adw2
fl
x .
L
a!ff!i’l
21 1
i= l,...,-.
1) (A (I) C2i a2i+I b(j) 21+ 1
+j)
1
2r+l
ffik
n+
LAJ
1
1
1, j=
2J
(1)
l,...,
k-
1.
Then after setting xik’ = f ‘,k’/b\k’, the backsubstitution phase proceeds with
An accepted method for solving such a system on serial computers is the banded LU factorization followed by a forward solve and backsubstitution [2]. The recursive nature of this algorithm makes it inappropriate for vector or parallel processing. A suitable alternative in these environments is the cyclic reduction method [ 71, The algorithm is easy to derive if we consider ( 1) in equation form, i.e., aix,-l + bixi + cixi+l =J;,
fjJ+l)
(ifi# 1
Algorithm 2.
Backsubstitution
xC$ = x(j+l) i= l,...,-I 5 (j) -_ f$i, xZiml
i = 1, . . . , n,
2J
- a:i!,x:j12 - c$I,x$) 6”’21 1 i=
*
[email protected].
[email protected].
n+l
l,...,-
1 3
n+l 2J
j = k - 1, . . . , 1. 325
0743-73 I5/9 I $3.00 Copyright 0 195, I by Academic Press, Inc. All rights of reproduction in any form reserved.
326
COX
AND
KNISELY
From a serial operation perspective, the cyclic reduction al- for SIMD machines, with an operation count nearly identical gorithm has twice as many operations as the standard tri- to cyclic reduction [ 161. Our algorithm differs from those diagonal algorithm [ 81. However, with respect to the i index, mentioned above with regard to the partitioning strategy the cyclic reduction algorithm is inherently parallel. Thus used. Also, we evaluate performance based on scaled speedup. on a computer which can perform concurrent operations, cyclic reduction is preferable. For example, speedup com2. DIVIDE AND CONQUER STRATEGY parisons of the two algorithms on a Cray Supercomputer The task at hand is to solve a tridiagonal system of n equaindicate that as the dimension of the system increases, the tions on a distributed-memory, vector-node parallel procestridiagonal algorithm takes nearly 8 times as long as cyclic sor with p nodes where n 9 p. As much as possible, work reduction [ 15 1. Lambiotte and Voigt experimented with should be evenly balanced among processors, and commucyclic reduction on a CDC STAR- 100 computer [lo]. This nication time should not dominate computation time. Ortega method has also been analyzed for use on distributed-memand Voigt [ 13 ] point out that there are two main principles ory parallel computers. Johnsson formulated the algorithm by which parallel numerical algorithms can be classified: difor various configurations, including binary trees and n-cubes vide and conquer and reordering. Cyclic reduction is based [ 81. Hackney and Jesshope describe a version of cyclic reduction for solving an n-dimensional system on n processors on the latter principle. The target architecture for our problem motivates an algorithm that reflects both principles. The which keeps the level of parallelism constant at 12[ 71. The goal of this effort is to adapt cyclic reduction to a strategy we follow is to partition the system of equations into distributed-memory parallel processor whose nodes are vec- p local systems in such a way that the reduction and backtor processors. Vector arithmetic will be used to exploit fine- substitution phases of cyclic reduction can be performed on each local system independently of the others. The cost for grain, or statement-level parallelism. A “divide and conquer” approach will be used to distribute the work among proces- this independence is some computational redundancy. The sors. This strategy is described in Section 2. In Section 3, algorithm is formulated in terms of a parameter which govways of measuring performance are discussed and estimates erns the amount of redundancy and the cost of communicating the coefficients of a small linear system after the refor the efficiency of the divide and conquer cyclic reduction algorithm are given. In Section 4 we present results of nu- duction phase. Let each local system consist of n, equations and suppose merical experiments with FORTRAN codes on two vectorthat k - 1 reduction steps leave q equations on each processor node hypercubes, the Intel iPSC/ZVX and the Floating Point Systems FPS T-20. These codes are extensions of an with the systems on neighboring processors having one unoccam code, written for the symmetric case, which ran on known in common. The following proposition relates n to the FPS T-20 [ 3 ] . Cyclic reduction generalizes in an obvious the parameters n,,, q, and k. way to systems with a block tridiagonal coefficient matrix PROPOSITION 1. Suppose the tridiugonal system of n [ 6, 91. Our discussion will be limited to the scalar case. We equations is to be divided into p overlapping local systems assume that the coefficient matrix is diagonally dominant, each with n, equations. Zf so that the cyclic reduction algorithm will not break down [6]. n, = (q + 1)2k-1 - 1, (2) In addition to the works already mentioned, we point the reader to four recent papers which are similar in nature to where q and k are integers, and ours. A general approach for parallelizing any tridiagonal solver was presented by Mtiller [ 111. A lower bound on efn, - 4 ficiency of 33% is given, and experimental results yield efn=pn,-(p1) q+l ( 1 ficiencies of 45 to 70%. Egecioglu et al. implemented a tridiagonal solver based on recursive doubling [4]. The dimension of the largest linear system which was solved is then k - 1 reduction steps leaves each processor with q equations in q + 2 unknowns where systems on neighboring pro8 192. The best measured efficiency listed is 39%, though they claim results between 50 and 60% were obtained. As cessors share one unknown. the authors point out, there are some questions about the Proof: From a global perspective, the system of n equanumerical stability of the recursive doubling algorithm. Three tions in n unknowns is reduced to a system of pq equations parallel algorithms for solving tridiagonal systems are pre- in pq unknowns. It is not difficult to show using induction sented in the paper by Sun et al. [ 141. A variant of recursive that this occurs in k - 1 steps if n = 2!+‘(pq + 1) - 1. doubling is used here, so there are some unanswered stability Combining Eqs. (2) and ( 3) yields this value. n questions. One advantage of their approach over ours and the others mentioned is that some pivoting can be incorIn practice, n and p are given and the parameters k and q porated, if necessary. Wang proposed a tridiagonal solver, must be determined so that n < 2k-’ (pq + 1) - 1. Just as
PARALLEL/VECTOR
TRIDIAGONAL
327
SOLVER
it can be done according to the rule that processor i contains one assumes that n = 2k - 1 for standard cyclic reduction, we will assume for convenience that n = 2k-’ (pq + 1) - 1 entries of the solution vector globally indexed by for some integers k, p , and q . As a simple example, consider thecasewheren= 19,p=2,q=2,andnp= ll.Figurel 2 [q(n, + l)] + 1 through L; 3 ; [drip + 1)l + %. illustrates the partitioning of the coefficient matrix for this case. The right-hand side is distributed in the same way. After the coefficients are distributed, each processor will 3. PERFORMANCE ESTIMATES contain a linear system of the form The best-known measures for evaluating the performance of a parallel algorithm are speedup and efficiency. Speedup a(l’x(” + b!l’x!l’ + c(l)x(l) (1) I r+l-fi , i=l,..., n,. I II 1 I is normally defined as S, = T,/ Tp, where T, is the CPU time of solving a problem of dimension n on one processor Note that on the first processor al” = 0 and on the pth and Tp is the CPU time for solving the same problem on p processor c,( ’ ) = 0 . Thus each processor holds the coefficients processors. Efficiency is then defined as E,, = SJp . Following of n, equations in n, + 2 unknowns. .Algorithm. 1 is then Gustafson et al. [ 51, consider T1 normalized to 1 with T, applied, with the modification that a{‘+” and c~$~),~, are = s + c, where s is the amount of time spent on the serial computed. After k - 1 reduction steps each local system has part of the program and c the time spent on that part of the the form program which contains operations which could be performed concurrently. An often-quoted (and possibly disa!k’x(k’ + b(k’r!W + C(k’X(W 1+1_-fi (k) , i= I ,..., q. I II l-r I couraging) rule in this context is Amdahl’s law [ 11, which notes that speedup is then The global system has been reduced to a system of pq equations in pq unknowns, with each processor holding q equasp = I c’ tions. One could send all of the equations to one processor, .Y+solve for the pq unknowns, and redistribute those values. P The latter communication step is unnecessary if, through an all-to-all broadcast, all processors obtain the global reduced which approaches 1 /s as p gets large. This law seems even system. Each processor then solves the same pqxpq system more foreboding for algorithms such as ours because if one to get the values of the q + 2 unknowns associated with its includes in s the time spent in communication, s is larger in local reduced system. Then starting with q equations in q the multiprocessor case than for a single processor. f 2 unknowns, Algorithm 2 is performed on each processor. Gustafson et al. argue that since, in practice, the size of If one wishes to gather all solution elements to one location, the problem solved scales with the number of processors, a more practical question to ask is How long would a problem which runs on p processors take to run on 1 processor? They propose that in the scaled problem size approach two aspects ‘1 C2 of efficiency should be addressed. First, one should consider ‘2 b2 3 the operation eficiency factor, defined as a b c 3
3
3
a4 b4 c4
Processor
a5 b5 5
1
a6 b6 ‘6
vp =
a7 b7 C7
bs% ‘a9b9 C9
a 10 bc10
10 a,,b,,cll
operation count to solve size n problem on 1 processor ’ operation count to solve size n problem on p processors
This factor reflects extra operations which are done in a parallel algorithm, but does not account for the loss in efficiency due to communication. The apparent eficiency factor is an indicator of inefficiency due to communication, and we use Gustafson’s definition [ 5 ] that
a12blA Y3%3 a14b14C14 alSb15c15 a,6b16c16 a17b,7c,7
a 18b
18c 18
a19 b19
FIG.l.
Partitioningforn=
19,p=2,q=2,nP=
11.
cYp=
compute time total time ’
Clearly, (Yeis a machine-dependent
factor, and in the next
328
COX AND KNISELY
section we estimate (Yefor our algorithm on two machines based on numerical experiments. Operation efficiency is an algorithmic factor, and for the divide and conquer cyclic reduction algorithm we have the following lower bound for v~. PROPOSITION2. Assume for the divide and conquer cyclic reduction algorithm that the operation count for the intermediate solve of a pq X pq system is negligible in comparison to the number of operations in the reduction and backsubstitution phases. Then the operation eficiency factor satisfies the inequality VP a-----
L
4
q+ 1’
n, =
From (3) we have
(4+
l)n+(l P4+1
-PI4 .
Assuming the number of operations to solve a system of m equations using cyclic reduction is proportional to m, we have n(p9 + 1)
qp=L PnP
P[(4+ l)n+(l
~ PC!+1 p(q+l)~q+l.
-PI41
4 .
Table I lists values of vp for various combinations
of p, q,
TABLE I
Values of qp as Parameters p and q Vary 4 P
4
k
nP
n
9n
4 4 4 8 8 8
12
8191
11 10
8191 8191
12
8191
11 10
8191 8191
12
8191
26,623 29,695 31,231 51,199 58,367 61,951 100,351
16
3 I 15 3 7 15 3 7
16
15
11 10
8191 8191
115,711 123,391
0.811 0.903 0.946 0.779 0.885 0.932 0.761 0.871 0.915
16
4
8
16
Number
Remark. The assumption made in this proposition is, essentially, that n, is at least an order of magnitude larger than pg. In the proof we also assume that the number of operations in solving a system of m equations using cyclic reduction on one processor is equal to a constant times m. (Johnsson [ 8 ] notes that a rigorous count actually gives the number of operations as 17m - 18 log*m + 2). Proof of Proposition 2.
; 2
4+ 0.750 0.875 0.938 0.750 0.875 0.938 0.750 0.875 0.938
1
32
64
128
256
512
’
of Processors
FIG. 2. Number of processors vsestimatedoperationefficiency.
and k which correspond to problem sizes which we have used in the numerical experiments discussed in Section 4. We calculated qD according to n ” = An, + P4) so that the operations in the intermediate solve of a reduced system were also accounted for. For comparison we list q/ (q + 1) in each case. Equation (4) is a worst case estimate of vp in the sense that it assumes that the intermediate solve is done redundantly on all processors to avoid a communication step, and that the intermediate solve is done using cyclic reduction. Note that the cases with q = 1 were not included in Table I. When q = 1 the greatest overlap in equations between neighboring processors occurs. In fact, from Eq. (3), when q = 1 and p is large pn, N 2n. Thus one would expect the operation efficiency factor to be about 0.50. Figure 2 illustrates the effect of the value of q on qp as the number of processors increases beyond the size of the machines to which we have access. The value of n, is 8 19 1 throughout. As p increases, the efficiency decreases less dramatically with the smaller value of q . The graph highlights an obvious alteration that one would make in the algorithm if p is large. Namely, do not solve the reduced system on every processor if the size of that system is of the same order of magnitude as n,. From another perspective, the graph provides information as to how many processors one would use to solve a problem of given size efficiently. 4. EXPERIMENTAL
RESULTS
The divide and conquer cyclic reduction algorithm described in Section 2 was coded in FORTRAN for numerical experiments on two vector-node hypercubes at Clemson University. The 16 nodes of the Intel iPSC/2-VX contain
PARALLEL/VECTOR
TRIDIAGONAL
329
SOLVER
TABLE II Standard Performance Measures for FPS T-20 Computations Time
P
I 2 4 8
8191 4095 2041 1023
(s)
n
Reduction and backsub.
Comm.
Solve
Total
Speedup
8191 7935 7807 7143
0.588 0.309 0.181 0.114
0.008 0.011 0.015
0.039 0.054 0.068
0.588 0.356 0.246 0.197
1 .oo 1.65 2.39 2.98
Intel 80386 processors plus 80387 math coprocessors. Eight of these nodes have a VX processor which performs vector arithmetic. The millisecond clock on this machine was imprecise for timings of single vector calls, so each time listed is based on an average of 1000 calls. We also report results for the FPS T-20 hypercube. Each of the 16 nodes has an Inmos T4 14 transputer plus a vector processing unit. There was not a substantial difference between the two machines with respect to times for internode communication and vector arithmetic. The major difference was in how fast data are moved around memory, for example in scatter-gather operations. The FPS-T took about 10 times as long as the iPSC/Z to rearrange data within each node. This accounts for the disparity between timings on the two machines. We first calculated efficiency and speedup in the standard way, i.e., comparing multiprocessor times to uniprocessor times. In Table II, timings are listed for four configurations of the FPS T-20 computer with q set (for the multiprocessor cases) at 15. Note that as the number of processors increases the global system size decreases slightly. We calculated efficiency and speedup, based on average total time per node, as if the sizes are uniform throughout. Analogous results for the iPSC/ 2 computer are given in Table III. Four times are listed for each configuration. The total time is composed of the times for reduction and backsubstitution, communication (one all-to-all broadcast), and the intermediate solve. Because scalar floating point operations on the T-20 are performed in software, which is very slow, the intermediate solve was done by cyclic reduction using the vector units. The
Efficiency 1 .oo 0.83 0.60 0.37
same approach was used on the iPSC/2 for this step, though in the future it may be more efficient to use the standard tridiagonal algorithm with the math coprocessors. The reduction and backsubstitution time includes the time for scatter-gather operations. On the iPSC/2, the ratio of vector operation time to scatter-gather time was at least 4 to 1, while on the T-20, the within-memory moves dominated vector arithmetic. In both Tables II and III, efficiency decreases drastically as the number of processors increases. Clearly, this is happening because the serial portion of the algorithm dominates as p gets large, confirming Amdahl’s law. One aspect of the scaled speedup approach (operation efficiency) for this algorithm was discussed in Section 3. Now estimates of the apparent efficient factor, CQ,,are given based on numerical experiments. Table IV is a listing of values of CY,,for 8 and 16 processorsexperiments on the FPS T-20. Table V contains eight processor results for the iPSC/2. It would be misleading to point out that ‘ypfor the FPS-T computations is higher in some cases than that for the iPSC/2 without noting the total time for each. On average, the total run time on the T-20 was 3 times as long as that on iPSC/ 2. Graphs summarizing the results listed in Tables IV and V are shown in Fig. 3. 5. CONCLUSIONS
AND
FURTHER
WORK
In this paper we have described an algorithm for solving tridiagonal linear systems on a parallel computer with vector
TABLE III Standard Performance Measures for iPSC/2 Computations Time
(s)
P
np
n
Reduction and backsub.
Comm.
Solve
Total
Speedup
1 2 4 8
8191 4095 2047 1023
8191 7935 7807 7743
0.192 0.092 0.053 0.029
0.009 0.018 0.030
0.0054 0.0073 0.0098
0.193 0.113 0.077 0.069
1.00 1.71 2.50 2.80
Efficiency I .oo 0.86 0.62 0.35
330
COX AND KNISELY
TABLE IV Apparent Efficiency Factor for FPS T-20 Computations Time (s) P
8
16
4
k
np
n
Total
Compute
Ok
3 6 12 15 3 7 15
11 10 9 9 11 10 9
8191 7167 6655 8191 8191 8191 8191
51,199 50,175 49,663 61,951 100,35 1 115,711 123,391
0.626 0.574 0.55 1 0.635 0.667 0.685 0.678
0.593 0.539 0.513 0.597 0.602 0.607 0.620
0.95 0.94 0.93 0.94 0.90 0.89 0.91
(a) (W
processing nodes. What distinguishes this algorithm from FIG. 3. qvs+.(a)p= 8onFPST-20,(b)p= 16onFPST-20,(c) others is the partitioning strategy. The work is evenly bal- p = 8 on iPSC/2. anced among processors, and some redundancy in computation minimizes communication. The algorithm is simple in its derivation and implementation, and for a given probThere are a number of areas for further study regarding lem size and number of processors one additional parameter this algorithm. The availability of fast scalar arithmetic (such controls both the partitioning of terms in the linear system as on the iPSC/2) warrants the consideration of a hybrid and the amount of overhead due to parallelism. algorithm, where the reduction phase terminates early, once Performance was evaluated in terms of the fixed-size and vector lengths reach a certain minimum size, and a fast scalar the scaled-size efficiency factors. Following Gustafson’s as- algorithm takes over. The intermediate solve of the reduced sertion that the latter approach is most practical, we base global system is another candidate for a scalar algorithm. our main conclusions about the effectiveness of the algorithm An obvious next step in this research is the extension of the on the scaled-size results. For the machines on which we algorithm to block tridiagonal systems. performed numerical experiments, the algorithm was shown to have a favorable operation efficiency factor, between 0.75 6. ACKNOWLEDGMENTS and 0.95 depending on the choice of the parameter q that governs redundancy. An estimate for operation efficiency which depends only on q was given, along with conditions The authors gratefully acknowledge the helpful comments and additional under which this estimate is reasonable. The apparent effi- references provided by the reviewers. ciency factor, based on computer experiments, was between 0.87 and 0.95. Using these results, the algorithm can be REFERENCES “tuned” by using the parameter which governs redundancy and communication in order to balance computation time 1. Amdahl, G. Validity of the single-processor approach to achieving largewith communication overhead. For the computers which scale computer capabilities. AFIPS Conf: Proc. 30 ( 1967), 483-485. were used for the numerical experiments presented here, this 2. Burden, R. L., and Faires, J. D. Numerical Analysis. PWS-Kent, Boston, parameter can be set so that operation efficiency and apparent 1989. efficiency are each close to 90%.
TABLE V Apparent Efficiency Factor for iPSC/Z Computations Time (s) P
4
k
np
n
Total
Compute
8
3 6 12 15
11 10 9 9
8191 7167 6655 8191
51,199 50,175 49,663 61,951
0.207 0.190 0.189 0.229
0.196 0.175 0.164 0.198
(Ye 0.95 0.92 0.87 0.87
3. Cox, C. L. A cyclic reduction algorithm for a parallel processor with vector nodes. Proc. Third Conference on Hypercube Concurrent Computers andApplications. ACM Press, 1988, pp. 1532-1538. 4. Egecioght, O., Koc, C. K., and Iaub, A. J. A recursive doubling algorithm for solution of tridiagonal systems on hypercube multiprocessors. J. Comput. Appl. Math. 27 (1989), 95-108. 5. Gustafson, J. L., Montry, G. R., and Benner, R. E. Development of parallel methods for a 1024-processor hypercube. SIAM J. Sci. Statist. Comput. 9, 4 (July 1988), 609-638. 6. Heller, D. Some aspects of the cyclic reduction algorithm for block tridiagonal linear systems. SIAM J. Numer. Anal. 13 ( 1976), 484-496. 7. Hackney, R., and Jesshope, C. Parallel Computers: Architecture, Programming and Algorithms. Hilger, Bristol, 198 1. 8. Johnsson, L. Odd-even cyclic reduction on ensemble architectures and
PARALLEL/VECTOR the solution of tridiagonal systemsof equations. Dept. Computer Science Rep. YALEU/CSD/RR-339, Yale University, New Haven, CT, 1984. 9. Kershaw, D. Solution of single tridiagonal linear systems and vectorization of the ICCG algorithm on the CRAY- 1. In Rodrigue, G. (Ed.). Parallel Computations. Academic Press, New York, 1982. 10. Lambiotte, J., and Voigt, R. The solution of tridiagonal linear systems on the CDC STAR-100 computer. ACM Trans. Math. Software 1 (1975), 308-329. 11. Miiller. S. M., A method to parallelize tridiagonal solvers. Proc. Fifth Conference on Distributed Memory Computing. Apr. 8-l 2, 1990. 12. Ortega, J. M., and Poole, W. G. An Introduction to Numerical Methods fir D#krential Equations. Pitman, Marshfield, MA, 198 1. 13. Ortega, J., and Voigt, R. Solution of partial differential equations on vector and parallel computers. SIAM Rev. 27 ( 1985 ) , 149-240. 14. Sun, X. H., Zhang, H., and Ni, L. M. Parallel algorithms for solution of tridiagonal systemson multicomputers. Proc. 1989 ACM International Conference on Supercomputing, Crete, Greece, June 1989. 15. Vu, P., and Yang, C. Comparing tridiagonal solvers on the CRAY XMP/416 system. Cray Channels (Winter 1988), 22-25. Received March 15, 1990; accepted October 5, 1990
TRIDIAGONAL
SOLVER
331
16. Wang, H. H. A parallel method for tridiagonal equations. ACM Trans. Math. Software 7 (June 1981), 170-183.
CHRISTOPHER L. COX is an associate professor in the Department of Mathematical Sciences at Clemson University. He has published research papers in the areas of finite element methods for fluid dynamics problems, numerical linear algebra, and parallel processing algorithms. He received a B.S. degree in mathematics from Grove City College, and M.S. and Ph.D. degrees from Carnegie-Mellon University. His thesis research was supported by NASA, and funding for recent work has come from industry and the U.S. Department of Energy. Professor Cox is a member of the American Mathematical Society and the Society for Industrial and Applied Mathematics. JAMES A. KNISELY is a Ph.D. student in the Department of Mathematical Sciences at Clemson University. His current research area is graph theory. He received his B.S. degree from Bob Jones University, and an M.S. degree from Clemson University. Mr. Knisely is a member of SIAM.