Computers & Strucrures Printed in the U S.A
Vol. 20. No.
l-3.
pp. 99-105,
1985 0
TRENDS IN NUMERICAL
ANALYSIS AND PARALLEL
DIRECT FINITE ELEMENT EQUATION ALGORITHMS?_
ALGORITHMS
SOLVING
R. J. MELOSH Civil and Environmental
and SENOL UTKU Engineering Department, Duke University, Carolina, U.S.A.
0045-7949185 $3.00 + .OU 1985 Pergamon Press Ltd.
Durham, North
MOKTAR SALAMA Jet Propulsion Laboratory, Pasadena, California, U.S.A. Abstract-This paper presents and examines direct solution algorithms for the linear simultaneous equations that arise when finite element models represent an engineering system. It identifies the mathematical processing of four solution methods and assesses their data processing implications using concurrent processing.
1. INTRODUCTION Selection of the best equation solving strategy for a concurrent processing computer depends on the problem, the computing hardware, the solving algorithm, and the processing goal. The problem of interest here is solution of the linear simultaneous equations that associate with finite element modeling of engineering systems. Thus, we wish to solve the equations Ku=p,
(11
where K is a symmetric nonsingular and usually positive definite matrix, u is the vector of unknown responses, and p is the vector of imposed “loads.” The stiffness matrix is sparsely populated and equations and unknowns can always be ordered and the matrix partitioned so that all nonzero coefficients fall in the diagonal and first off-diagonal partitions. Thus, the stiffness matrix can always be arranged as
K=
&I
K12
K21
K22
K23
K32
K33
K34
&,,,-I
I
&-,.,-I K,r-
KY-I.,
I
Kr,,
(2) with all undesignated partitions null. The sparseness of the non-null partitions of eqn (2) depends on the discretization of the system. Though the stiffness partitions, particularly the K,, i # j, get increasingly sparse, we can always organize the equations so that all the K, are square
t This report represents part of the work completed under contract No. 956940 of the Jet Propulsion Laboratory. 99
matrices of the same order. Accordingly, we regard n as the order of the row partitions of the stiffness matrix and N as the total number of equations. The finite element equation solving problem is cast into the form of eqns (1) and (2) whether the physical system is modeled conceptually as a linear or nonlinear elliptic, parabolic, or hyperbolic partial differential equation. The form evolves directly for the linear elliptic equation. The stiffness matrix plays the role of the time integration operator when rate and acceleration effects arise. By the Choleski based decomposition, the eigenvector problem is also transformed into equations of this form, given an eigenvalue. In nonlinear problems, where stiffness models are predicated on a variational principle, the step-by-step linearized equations also take the form of eqns (1) and (2). The computer hardware architecture and contiguration details represent constraints on the solution strategy. Utku [l] describes a general mathematical model which serves as background for the current study. This model is based on use of a number of processors which may or may not have equal capabilities and which can be addressed either simultaneously or sequentially. The characteristics of a computer are particularized by a limited number of directly measurable parameters in this model. With these considerations, the discussion centers on solution algorithms. By definition, solution algorithms consist of mathematical processing and data processing rules. Math processing decisions involve the essential operations of equation solving. We define these as the sequence in which matrix partitions are dealt with and the form sought in these operations. Data processing defines the detailed sequence of calculations, data allocation to storage, and the flow of data during solution. Together, mathematical processing and data processing rules define every step of the implementing computer program. We note that the best solution strategy depends on the basis used to cast the problem into the form
R. J.
100
MELOSH
of eqns (1) and (2). Thus the selection of finite element modeling over finite differences, variational formulation over boundary integral, potential energy over equilibrium principle. and in the case of nonlinear analysis, a LaGrangian reference frame over Eulerian, effects computational efficiency. Furthermore, the modeling policies-regular or irregular griding, global or local remeshing, Er or p models-have an impact on efficiency. This paper describes four solution algorithms. The next section presents schema for mathematical processing for these direct methods. Section 3 provides study conclusions.
2. MATHEMATICAL
PROCESSING FOR DIRECT METHODS
Direct solution methods involve performing a sequence of matrix premultiplications to eliminate the coupling terms between equations. Thus, we premultiply eqn (I): A, &-I
v**A, K u = A, A,_,
... A, p,
(3)
where A are nonsingular non-Boolean antecedent matrices and a has a minimum value of 1 and a maximum of 3Nn - 2n’. In some cases, it is advantageous to make successive variable changes to solve the equations, i.e. we let (4) Thus, in general, a direct method consists of some mixture of antecedent and post-multiplication matrices to eliminate coupling terms. (The mathematical literature refers to this generalization of direct methods of linear equation solving as A&ken’s method.) Since variable changes encumber calculations by those required to transform the solution for v of the modified equations back to the original basis, processes which involve post-multiplication are ignored in the sequel. One of two goals is involved in the choice of particularization of the antecedent matrices. We may seek the decomposition of the coefficient matrix or we may seek to transform this matrix to an identity matrix. In the first class of processing, we want to represent the equations in the form FSTu=p,
(5)
where F is the first decomposition factor; S is the second decomposition factor; and T is the third decomposition factor. We choose the form of the factors so that we can deal with the equations one-byone to find the unknowns by substitutions. Thus we
et al.
successively
soIve the equations Fw = p
for%‘,
Sv=w
forv.
Tu
forci.
and = v
Aitken’s “below the line method,” Crout’s method, Choleski’s method, cyclic reduction [2], and the WZ method of Evans and Hatzopoulos [31 are members of this processing class. Of these methods, Gauss elimination and cyclic reduction merit further examination in the context of concurrent processors [41. When the goal of mathematical processing is transformation of the coefficient matrix to the identity matrix, we may construct the inverse matrix or the product of the inverse by the p vector. Since the number of calculations for the substitutions and the extra premultiplications to change from the decomposed form to the identity matrix are the same, and, when the computer has infinite precision produce the same result, the decision to do substitutions or extra premultiplications is a data processing decision. Accordingly, this study subsequently disregards inverse methods. Aitken’s pivotal condensation, Gauss-Jordan reduction, and Morris’ escalator method are members of the product of the inverse times n processing class. But, these methods inherently require about twice as many calculations as those of the decomposition without enhancing parallel processing opportunities. Thus, we will examine and compare mathematical processing implications of Gauss elimination and cyclic reduction. We will also introduce transfer reduction. 2.1 Block Gnms elimination Table 1 defines the mathematical processing which associates with Gauss equation solution when N/n = 9. Here K denotes a fully-populated (square) partition: U, an upper triangular partition: and 0, a null partition created during the processing stage. In this example, there are ten stages of processing. The result of the first stage is the explicit representation of the stiffness matrix. The processing which produces these data is a significant portion of the total processing time. Since this “preamble” processing is common to all the equation solving methods, it can be neglected in comparing these methods. Stage two processing results in triangularizing the diagonal partition of the first row partition and annihilating the coupling partition between the second row partition and the first. The processing of stages three through nine is similar to that of stage two. The diagonal partition
Direct finite element equation solving algorithms
101
Table 1. Gauss elimination stages Stage 1 result
-
Stage 2 result
K23
K34
K33
L
I<45
K54
K55 K65
K43
K5.5 KS.5 K7h
I67
I &4
&J
K54
K55 K65
K77
K7s
Ks7
KS8
KS9
KQ!? K99
-
K34
Stage 3 result
K56 &6 K 67 K76
K77
K78
K87
Kss
K9R
Stage 10 result
Ufl
K12 U::
K:3
U::
Ki.4 uti
I65
LJ:: K67 K77
K78
KS7
&a8
&9
K9s
K99
K!6 UG
Kd, U:: K:8
UQ1
K/9
U::,
Subscript ij denotes the i row, jth column partition. Subscript kl denotes k multiplications by a triangular matrix and I subtractions of a triple main product. of row i is triangularized and row i decoupled from row i + 1. The result at the end of stage 10 is the upper triangular decomposition of the total stiffness matrix. Because the arithmetic operations for triangularization can be viewed as matrix multiplications, the number of operations is a cubic polynomial in n. For comparing operations of methods for problems where n S 1, we can consider only the cubic terms of the polynomial. Furthermore, since the operations consist of a sequence of vector dot products, we can satisfactorily consider the number of additions equal to the number of multiplications and thus define an operation to be one multiplication and one addition. Accordingly, the number of operations is given below for each block transformation. Calculations to form the antecedent triangular matrix are neglected since of the order of n2 (Table 2). The Gauss process results in a symmetrical or triangular diagonal partition in every stage of the process. This fact is implied in the arithmetic operation counts above. In particular, the operations to produce Kzj. It can also be exploited to save storage locations for stage results. The need to transfer data from one storage location to another during the solution process leads to a second measure of computer resources required to support a particular set of math processing rules. Accordingly, we assume the stiffness matrix Table 2. Block arithmetic operation counts (symmetry exploited) Step Operations
K,, + U,, n3/6
Kg-K:, n3/2
01 K,, + K,, n3
has so many nonzero terms, it resides in secondary storage at the end of stage one. We measure data transfers by the total number of accesses to the blocks and the total number of data items transferred, assuming fully-populated partitions. New data required as input to a stage of calculations and output blocks needed for the substitutions are considered. Transfers for intermediate (erasable) results are counted only when between parallel processors of a stage. Under these rules, a transfer of a diagonal partition involves one access and n(n + 1)/2 = n’l2 items of data, when symmetry is exploited. Transfer of an off diagonal involves one access and n2 items of data. Opportunities for parallel computation occur in stages two through ten. These associate with the option to perform vector operations of a matrix multiplication concurrently or of operating independently on several processors with pairs of dot product coefficients. 2.2 Block cyclic reduction
Table 3 defines the mathematical processing for cyclic reduction when Nln = 9. The notation is the same as used to represent Block Gauss elimination. In this example, there are five stages of processing. In the second stage, the coupling terms are eliminated between odd numbered row partitions. This results in transforming the first five row partitions into partitions of the triangular decomposition matrix. The last four row partitions become uncoupled from the first five. In addition, the last four rowcolumn partitions assume the tridiagonal form if reordered with this interest. Processing in stages three through five repeat this
102
R. J. MELOSHet al.
Table 3. Cyclic reduction stages Stage 1 result
Stage 2 result
Kl2
KII K33
K32
KS4 K77
&6 Kg9
I-_ G4 K:4
K34
K55
KS8
K& KA
K711
K9h
Kg
&3
KO' 66
KM
Kg
KU KX8
Stage 3 result
Stage 5 result
K12 G
Ull u:3
K:4
u:5
K& K:,
U:,
U:,
Kb, U!! __
Subscript rj denotes the i row, jth column partition. Subscript kl denotes k multiplications by a triangular matrix and I subtractions
pattern. In each stage, INT [(R + 1)/2] rows are triangularized where R is the remaining number of untriangularized row partitions. Again, for approximating the number of arithmetic operations we regard calculations as matrix operations, ignore operations less than cubic in n, and consider one multiply and one addition to be an operation. The cyclic reduction process results in symmetrical or triangular diagonal partitions in every stage of the process. Thus, the arithmetic counts for block transformations are given by those of Table 2. Furthermore, storage allocation policies may save data transfers by exploiting symmetry of the total matrix and the diagonal blocks just as in Gauss elimination. Opportunities for parallel computation in steps two through five occur at the block, vector, and coefficient-pairs levels. Significant economy accrues using cyclic reduction if the stiffness matrix is regular. We define a regular stiffness matrix as one in which
K,, = K,,
for any i and j,
and K, = Kp4 for i
p < q.
and K non-null.
Then, the cyclic reduction preserves this regularity. Thus, in stage two of the Table 3, all U partitions will be same, all Kb will be the same, all K,, will be the same, and all KE’ will be equal to or the transpose of each other. This means that the block parallelism results in the same number of stages, but the calculations of each stage need only involve three row partitions.
K& G
Ki: Ub:, KG U:; K:$ U$
of a triple main product.
2.3 Block transfer reduction
Table 4 defines the mathematical processing for transfer reduction when N/n = 9. This process is a specialization of the wave front transform method [5] and a generalization of the transfer matrix method of Leckie and Pestel 161. In this example, there are nine stages of processing. In stage two, each row partition is multiplied by an upper triangular matrix which triangularizes the off-diagonal partition. This row partition is multiplied by square matrices and subtracted from the second and third row partition to annihilate column two coupling with the first row in the third stage. Superdiagonal blocks are assumed nonsingular. Processing in stages four through nine repeat this pattern. In each stage, row premultiplication is performed to annihilate coupling with subsequent row partitions. Coupling with the first partition persists until completion of the last stage. This results in n equations in the first column partition of unknowns which are solved for the unknowns by triangularizing the 9, 1 partition. Assuming that n * 1 and an arithmetic operation involves one multiplication and one addition we arrive at the number of operations for each matrix operation of Table 5. Note that symmetry of the diagonal partitions is not preserved. This not only results in more calculations, but doubles the data items of transfer for diagonal partitions. Regularity reduces the number of calculations on each stage. In stage two, a typical row partition may represent all rows. In stages three through nine, only one of the K,, -+ K$ operations results in distinct results.
Direct finite element equation solving algorithms Table 4. Transfer reduction stages Stage 2 result
Stage 1 result K23 K33 I<43
G3 K& KL
K33 K‘u K54
L:4 K&
L:s K55 Ki5
L:6 G-5 LL K:6
K:7 K:7
L:8 KQ8 LA9 KdR
Stage 3 result
K99
Stage 10 result
Li3
Li3 K::
LA4 Lis
ti:
L:6
i_l
I-
Opportunities for parallel computation occur in stages two through nine at the block, vector, and coefficient-pairs levels. 2.4 Block parallel transfer reduction To illustrate the possibilities of parallel computation in transfer reduction we next describe a process which maximizes parallel processing on the block level using transfer reduction. Table 6 indicates the results of stage arithmetic in parallel transfer reduction. There are just four stages of processing. Stage two processing is the same as in transfer reduction. Stage three produces all the left-hand column partitions. Stage four develops the triangular representation of Kg,. To effect the changes of stage three, we employ the distributive and associative properties of matrix arithmetic to cast the calculation of the final values of Kg, in a form more suitable for concurrent processing. Suppose, e.g. we wish to evaluate the final value of KG1in the transfer-reduction method. Then, expressing this result in matrix notation following the stages three through six of Table 4 yields, assuming L matrices are reduced to identity matrices, &I = ti,
[I& (K;,
+ I&
K:,
x 1-I&K!,
-
-
K;,K;,)
K:&
- K:~I IW:&,
,))I + I& < K:d
- K:j W:, - K:&I)I
+ K:s I-K& (K:, - K:S;,) - Kh [-I&K!, - I&W:, - K:zKi,)l). (6) Distributing the matrix then results in 13 terms The minimum number of is two and the maximum
multipliers over addition to be added to form I<6,. matrix products for a term is five.
We can assign a processor to each term of the sum. We can also increase parallel computation using the associate property of matrix multiplication. For example, if a term involves the product of four matrices, we assign two processors to the term, multiplying the first two and last two matrices together in separate processors, and then taking the product of these matrix results to evaluate the term. This is “pyramiding.” When N/n is nine, Kg, involves 55 terms with from four to eight matrix products per term. If we wish to maximize parallelism in block computations we would assign 141 processors to the task. The number of processors needed for maximizing parallelism for the regular matrix case is the same as for the irregular. What happens is that fewer arithmetic operations are needed because of the repetition of factors in the terms. The opportunity to reduce arithmetic operations using the CayleyHamilton theorem also arises when n < N/n [6]. Between the little parallelism in stage three of transfer reduction and the extreme parallelism in stage three of block parallel transfer reduction lie a number of other intermediate math processing methods. Of all of these, transfer reduction requires the least number of arithmetic operations and data transfers and parallel transfer reduction the most. 2.5 A comparison
of mathematical
processing
methods
The math processing method selected defines the minimum number of calculations and the maximum
Table 5. Block arithmetic operation count (nonsymmetry Step Operations
KU -+ u, n3/6
case) K, + K: n3/2
K,,+K$ 3 - n3 2
R. J.
104 Table
-
K
K12
&I
K22
K23
K32
K33
-
6. Results
Stage 1 result
-
MELOSH et al.
of stage arithmetic in parallel transfer reduction
-
-
-
Stage 2 result
-
-
-
-
Lh K43
K:2 K:2
K34 &4
K45
K54
br KS5
-
L&S G5 K&
K56 K&S
K67
K76
K77
K78
&7
KM8
Kss
&8
KYY -
-
-
G KL L44 G4 a4 LA5 K4.i KS5
-
Lk? KQ8 LAP K&3 -K&
K:6 -
-
Stage 3 result
-
-
u4
Stage 4 result
G4
.,
ii
L:X
K71
KQI Kdf
LAY
Table 7. Computer arithmetic needs? Math method Gauss elimination
Problem+ Nonsymmetric Symmetric Regular
Cyclic reduction
105 81 81
Transfer reduction
123 96 57
116 116 67
Parallet transfer reduction 1971 1971 1184
t Arithmetic operations times 6/n3, N/n = 9. $ Block tridiagonal matrix with n constant.
Table 8. Computer arithmetic time? Math method Gauss elimination
Problemf Nonsymmet~c Symmetric Regular
105 81 81
Cyclic reduction
Transfer reduction
Parallel transfer reduction
54 4s 45
53 53 53
26 26 26
t Minimum serial arithmetic operations times 6/n3, N/n = 9. $ Block tridiagonal matrix with n constant.
Table 9. Data transfers for equation solution? Gauss elimination ~onsymmet~c Symmetric Regular
tc
,
(45,31 .5nZ) (36,27n2) (20,lW
) = total number of flies transferred,
Cyclic reduction (77,53n2) (77,53& (8,8n2) total number of data items transferred.
Transfer reduction
Direct finite element equation solving algorithms number of stages. Thus comparison of methods, based on calculation counts, affords a rough measure of computer resources and wall clock time for each method. Accordingly, we choose the total number of arithmetic operations for the decomposition as the measure of the computer resources needed. This assumption implies that idle processors will be committed to other jobs or, equivalently, that the computer is a serial processor. Table 7 lists the number of arithmetic operations for the three classes of problems. These data demonstrate that the lowest resource needs for the general symmetric block tridiagonal matrix associate with Gauss elimination. Even were the tridiagonal matrices nonsymmetric, Gauss elimination is the clear winner. If the matrix is regular, however, cyclic reduction requires the least computer resources. These conclusions are valid as long as N/n > 2. The fact that Gauss elimination exploits symmetry of the diagonal block to reduce arithmetic operations and avoids the penalties due to fill-in in originally null blocks is the basis for its advantage in the general case. Conversely, the facility of cyclic reduction to take advantage of matrix regularity to reduce arithmetic operations makes cyclic reduction the best for regular matrices. To measure mathematical method computer wall clock time, we choose the number of arithmetic operations which must be performed serially. This is the sum of the maximum number of operations of each stage, under the assumption that block matrix operations are allocated to as many processors as can be effectively used in the stage. Table 8 cites the value of this measure for nine cases. These data are the synthesis of the data of Tables 1 through 5. They indicate the block level concurrent processor superiority of parallel transfer reduction over the other math processes-for all three problem classes. Table 9 notes the number of block accesses and items of data transferred for the cases of Tables 1, 3 and 4. We note that the number of data items transferred is of the order of l/n of the number of arithmetic operations. Furthermore the number of accesses and items transferred increases as redundant
105
transfers are needed to exploit parallel processing. Gauss elimination is best for symmetric matrices and cyclic reduction best when the matrix is regular. These data suggest that the reductions in computer arithmetic parallel computation on the block level are small (i.e. far from a factor of l/p where p is the number of processors) and accompanied by relatively large increases in computer resource needs. 3. CONCLUSIONS
This study of Gauss elimination, cyclic reduction, transfer reduction, and parallel transfer reduction at the block level leads to the following conclusions: (1) Gauss elimination is best for irregular matrices and cyclic reduction best for regular when optimization is with respect to the total number of arithmetic operations. Parallel transfer reduction is best when optimization is with respect to the total computer arithmetic time. (2) Ideally, using P concurrent processors reduces computer time to a factor of l/P times its value using one processor. Data on cyclic reduction, transfer reduction, and parallel transfer reduction suggests that the constraint of equation solving limits the factor to about 4 due to increasing arithmetic operation and data transfer penalties as P increases. REFERENCES
1. Senol Utku, A mathematical characterization of concurrent processing: Machines, jobs, and strategems. JPL Report, July, 1983. 2. Don Heller, Some aspects of the cyclic reduction algorithm for block tridiagonal linear systems. SIAM J. Numer. Anal. 13, 484-495 (1978). 3. D. Evans and Hatzopoulos, A parallel linear system solver. Znr. J. Comput. Math. Sect. P. 7, 227-238 (1979). 4.
M. Salama, S. Utku and R. Melosh, Parallel solution of finite element equations. 8th Conference on Electronic Computation, ASCE. Houston, Feb. 21-23, 1983. 5. R. J. Melosh, Transformation for solving linear equations. ASCE Struct. Div. J. 6, 1289-1302 (1977). 6. E. C. Pestel and F. A. Leckie, Matrix Methods ofElastomechanics. McGraw-Hill, New York (1963).