Parallel algorithms for the solution of a sparse system

Parallel algorithms for the solution of a sparse system

Computing Systems in Engineering, Vol. 4, Nos 4-6, pp. 505 511, 1993 Elsevier Science Ltd Printed in Great Britain 0956-0521/93 $6.00 + 0.00 Pergamon...

612KB Sizes 0 Downloads 32 Views

Computing Systems in Engineering, Vol. 4, Nos 4-6, pp. 505 511, 1993 Elsevier Science Ltd Printed in Great Britain 0956-0521/93 $6.00 + 0.00

Pergamon

PARALLEL

ALGORITHMS SPARSE

FOR THE SYSTEM

SOLUTION

OF A

JENN-CHING LUO Department of Civil Engineering, Columbia University, NY 10027, U.S.A. Abstract--This paper explores the effectiveness of the author's decomposition in a parallel environment. The author's decomposition is a new class of direct method, which factorizes a nonsingular matrix into triangular matrices such that the product of the obtained triangular matrices is the inverse of the original matrix. This new technique provides an efficient procedure and a new feature in structural analysis. Both vectorization and parallelization are considered in this work. Performances in solving large-scaled sparse systems and comparisons with the well-known type of direct method, Gaussian elimination, are reported. Examples in application to domain decomposition are also included.

NOMENCLATURE [A]: [A] l: [A,]: Au:

{e}: [D]: [q: ILl: [~]: S~: [U]:

{x}: {x,}:

a nonsingular matrix the inverse of [A] the Ith submatrix in the Jth block column of [A] the ith entry in the jth column of [A] the right-side vector of a system of linear equations a diagonal matrix the identity matrix a lower triangular matrix a normalized lower triangular matrix the ith temporary variable an upper triangular matrix a vector of unknown to be determined the ith subvector in {X}

and sparse solver, l have been proposed to deal with sparse systems. Some parallel alternatives to Gaussian elimination, for example Refs 2-6, are also available for structural analyses. This paper will study the parallelism of another kind of direct method derived by the author. The author's decomposition 7'8 provides a special feature in inverting a nonsingular matrix [A] directly into [A] ~. The inverse of [A] is written as: [A] -I [[L,] [Eft,

INTRODUCTION

= ~[L][D][L] r, [[LI[D][U],

Many engineering problems written in differential equations or in integral equations may be discretized by the finite element/difference method, and resulted in algebraic systems with a sparse matrix. This kind of sparse system is an important branch in the field of scientific and engineering computing. Developments of an efficient procedure for the solution of a sparse system have become more and more important. The main difference between a sparse solver and a general solver is that a sparse solver does not operate zero fill-ins. Both iterative and direct methods may be applied to sparse systems. However, in nonlinear structural analyses, i.e. buckling, we usually need to deal with unloading processes. An unloading process is equivalent to an indefinite algebraic system which cannot be solved by the classic iterative methods, i.e. Gauss-Seidel iteration, Jacobi method and Conjugate Gradient method. For the purpose of general applications, the direct methods providing an unconditional convergence are the most popular in nonlinear analyses. Gaussian elimination is the most well-known direct method, which has been widely applied to structural analyses. Some faster alternatives to Gaussian elimination, i.e. substructuring, band solver, skyline solver,

if [A] is symmetric and definite; if [A] is symmetric and indefinite; if [A] is asymmetric.

Based upon the author's decomposition, a system of linear equations in the form

[AI{X} = {B}.

(1)

may be solved by matrix vector multiplications as

{x} = [LI[DIWI{.I.

(2)

As cited in Duff, ~ explicitly inverting the left-side matrix [A] may take advantage of multiprocessors when solving {X}. This point can be realized from Eq. (2) in which the matrix-vector multiplications have a high potential to be implemented in parallel. This enhances the importance of the author's decomposition. A scheme to achieve balance in the distributing of the author's decomposition onto employed processors was studied in Ref. 10. Besides the potential of parallel computing, this new approach also has another three special features: the first one II is that the round-off error of the author's decomposition can be written in terms of approximation, the one obtained on a computer, not the exact solution. We know that the round-off error 505

506



J.-C

is commonly defined by the difference between the exact solution and the obtained approximation. However, the exact solution cannot be obtained in most problems, such that it is impossible to obtain the round-off error. This special feature shows an advantage of the author’s decomposition, by which the round-off error can be estimated directly by the approximation. The second feature is that the complexity in inverting a dense matrix can be reduced by a block-based procedure,12 and can be further reduced to the minimal amount,13 only 48% of the complexity required by Gaussian elimination. This shows an important strategy for inverting a nonsingular matrix. Once the inverse has been economically obtained, the solution {A’} in Eq. (1) can be efficiently computed by Eq. (2); in addition to the advantage for dense matrices, the author’s decomposition also provides a special feature for sparse matrices. We know that the inverse of a sparse matrix is generally a dense matrix, i.e. by conventional methods we cannot take advantage of sparsity in the inverse. However, the author’s decomposition writes the inverse in the form [L] [D ] [ U]. The author’s decomposition therefore provides a possibility of keeping [A], [L], and [V] in the same sparsity such that [A] and [A]-’ can share with the same sparsity. One type of sparsity identified as stairs-shape sparsity 14.15has been found to be suited for the author’s decomposition. With these four advantages, the author’s decomposition not only shows a new concept for the solution of a system of linear equations but also provides an efficient way for scientific and engineering computing. Up to the present time, there is no application for this new approach to practical structural problems. It is the first time to explore the parallelism, including vectorization and parallelization, of the stairs-shape sparsity in a parallel environment. Timing results show that the author’s decomposition is superior to the Gaussian elimination when dealing with sparse systems in stairs shape. The essential concept for this approach will be discussed in detail in the following sections. THE AUTHOR’S DECOMPOSITION

Let us consider the general case that [A] is asymmetric, so that [A]-’ = (L][D][U]. The author’s decomposition has two versions: the element-based procedure and the block-based procedure. Firstly, let us discuss the block-based procedure. If [A] is in a stairs-shape sparsity, then [A] may be partitioned as:15

[Al=

IAll1

[A,21

[A,,1

L&l

[A,, 1

Luo

Then, the author’s decomposition for inverting matrix [A] in the form of Eq. (3) is written as (please see Ref. 15): (a) For J = 2+N, do independently (4) (b) For I = 2-N,

do independently

[Awl+

64

[AHI+

P,,l+



[Anl~

(3)

5

K=2

[~,,1[&1

(5)

1 -

(6)

-

[A.I[A,,l.

(7)

The results of Eqs (4)-(7) represent

VI

(8)

(9)

PI

VI

VI

[A,21

LAwI

(10)

=

The computing streams defined in Eqs (4) (5) and (7) with a high degree of data independence can be implemented in parallel. Synchronization is required in Eq. (6), in which the product [AIK][AK,] for each K can be computed in parallel. The block-based procedure employs the elementbased procedure for the inverse of each diagonal submatrix for the example of [A,J] in Eq. (4). The element-based procedure is written as:8 Forj=n-+l with step (-1) do (a) For i = j + 1-+n, do independently

[AIN1

L4331

[A,,l[A,,l.

(d) For I = 2+N, do independently

S,+Aji+ PI31

-

1 A,* k=!+l

A,;.

(11)

(b) For i = j + 1+n, do independently r-l A,,+-&*A,,-

1: [A,, 1

LLvI

-

c &*A,,* k=,fL

A,;.

(12)

Parallel algorithms

(cl

1

AjC

A,+Z;=j+,

A,,* A,’

(13)

(d) For i =j + 1+n, do independently SitA,+

1

A, * A,.

(14)

k=i+l

(e)

For i =j + 1-tn, do independently 1-l A,+-A,,*Si-

C Aik*Akk*Sk k=jfl

(15)

where n is the order. The result of Eqs (11)-(15) represents 1

Kl=

A,,

1

A,,

A,,

.

1

WI= A nn

507

[V] = [L]‘. metric and cedure for and please

The computing strategies for both symasymmetric systems are similar. The prosymmetric systems is not included here, see Ref. 14.

COMPUTATIONAL

CONSIDERATIONS

The parallelism considered in this work includes the vectorization and parallelization (concurrency). The vectorization can implement a set of data, i.e. 32 elements of data on an Alliant/FX8 computer, at once. Certainly, processing a set of data with a vector instruction is faster than processing the data with a series of scalar instructions. The performance of vectorization depends on the instruction (operation) involved and the degree of pipeline achieved in scalar mode, which usually requires a reformulation of certain equations into another forms. The parallelism is a way to concurrently execute a single program on available processors. Both vectorization and parallelization require the condition of data independence. A high degree of data independence certainly achieves a better performance. In this work, the decomposition procedure is executed parallel in the outside loop and is vectorized within the loop; for example, submatrices [A,] where J = 2-N in Eq. (4) are distributed onto employed processors, then on each processor the inverse operation is vectorized. This can be illustrated by the following segment in FORTRAN language:

C execute Step (a) in parallel by CVD$L CNCALL directive CVD$L CNCALL DO IDCPU = 1,NP CALL STEPA(A,IDCPU,NP, .) END DO

1

A,*

A13 A,3

.

A,, ... A,, A,,

Eqs (1 l), (12), (14) and (15) can be implemented in parallel; while Eq. (13) may employ a parallel inner product for C; =, + , Alk * A,. The overall performance for the element-based procedure7~*~‘0can reach up to an efficiency of 90% on eight computing elements of an Alliant/FX8 computer. Since the inverse of each diagonal submatrix is implemented by the element-based procedure, the inverse of each submatrix is in the form of [L] [D] [U], not in a single matrix representation; for example, the right-hand side of Eq. (5) implicitly contains four matrices as -[A,,][L][D][U], where [L], [D], [U] are obtained in Eq. (4). If matrix [A] is symmetric, then [A]-’ = [L][D][LIT. The procedure as shown in Eqs (3)-(15) may be simplified by using the property that

where NP is the number of employed processors. The computations of Eq. (4) is distributed among employed processors. Then, on each processor subroutine STEPA invokes the directive CDV$L VECTOR for vectorization. Steps (b) and (d) with Eqs (5) and (7) respectively, are aranged in the similar way that parallelization is in the outside loop and vectorization is within the loop. The key to vectorization is to reformulate equations such that components are independent of each other. Vectorizations of matrix-vector multiplications as shown in the right-hand side of Eqs (5) and (7) are so straightforward, so as not to be further discussed. This paper just discusses a way to rewrite the element-based procedure into another forms suited for vectorization; for example, Step (d) with Eq. (14) may be rewritten as S,tA,

(i =j + I-m).

(16)

Fork=j+l+n,do S,+S,+

A,k* A,

(i =j+

1+/c-l),

(17)

J.-C Luo

508

Eq. (16) can be viewed as {S};+,t{A,};+, which is fully vectorized, and Eq. (17) can be viewed as {S}~;~+{S}:;,’ + {Ak}T;: *A,. Equation (17) is also fully vectorized. Similarly, Step (e) containing Eq. (15) also can be rewritten as A, + - A, * S,

For k =j + l-tn,

(i = j + 1 --WI).

(18)

do

T = A,, * Sk ApA,-

A,, *T

(i=k+l+n),

(19)

so as to be fully vectorized. Steps (a) and (b) containing Eqs (11) and (12), respectively, also can be arranged in a way similar to Steps (d) and (e). Only Eq. (13) cannot be transformed into a vector mode, but it contains a relatively small portion of computations and does not seriously degrade the overall performance. Unlike the situation in Eqs (4), (5) and (7) which can be easily implemented in parallel-vector mode, for implementing Eq. (6) each employed processor requires a certain amount of computer memory as temporary matrix for computing a portion of C$‘=, [AIK][AK,]. Then, Cg=, [A,,][A,,] is the sum of temporary matrices. In this strategy, the computations of temporary matrices are distributed as balancing as possible onto each processor. This can be theoretically implemented in parallel-vector mode. After computing temporary matrices, the summation of [A,,] is also divided onto employed processors as balancing as possible; for example, processor 1 is assigned to assemble the first consecutive 50 entries, and processor 2 assembles the next 50 entries, and so on. This stage also can be achieved in parallel and vector mode. Finally, the inverse of [A,,] is implemented by the element-based procedure in parallel-vector model. Once [A]-’ = [L][D][U] has been obtained, the solution {B} of Eq. (1) may be computed in parallel-vector mode as

1% 1-(X,

>+ 5

,=2

{X,}t[A,]{X;}

[A,il{K) (i = 1,2,.

, N)

(Xi}-{Xi} + L%,I{~,} G = X3,

. , W,

where {X> and {B} share with the same computer memory, and [A,] is not in a single matrix representation, but in the form [L][D][U]. Some numerical examples will be used to demonstrate the efficiency of the presented method. EXAMPLES

AND DISCUSSIONS

An Alliant/FX8 computer, a memory-sharing machine, will be used to demonstrate the method. A

FX/8 model computer contains one required and up to seven computational elements (CEs). Each CE is a processor compatible with the MC68000 architecture. In addition, the processor contains floating point, concurrency, and vector instructions. This kind of machine is well suited for an algorithm with the capacity of parallelization and vectorization. Four large-scaled examples will be used to analyze performances. The first example is a system of linear equations [A](X) = {B) with a symmetric but indefinite stairs-shape sparsity matrix [A] of order (22656 x 22656), which is partitioned into (177 x 177) blocks, and each block is of order (128 x 128), as

[A,,1 [A211

[Al=

IA221 L&l

[4~~.,1

[A,,1

!

(20)

[A177,1771

A pseudo-random procedure was employed to generate [A] which results in 4344896 non-zero fill-ins. Because of indefiniteness, classic iterative methods cannot provide a convergent solution. Direct method is a major tool for this kind of general system. The timing results in solving the first example in parallel-vector mode were shown in Table 1, from which it can be seen that the time 1068.48 s on one CE was reduced to 167.35 s on 8 CEs, giving a speedup of 6.38. This example in the same sparsity was also solved by the well-known Gaussian elimination, which took 5633.96 s on one CE without vectorization. Because processing a vector instruction on a FX/series can speed two to four times faster than processing scalar instructions’6 if a transformation into vector mode is possible, the ideal speed of Gaussian elimination on 8 CEs with vectorization is 5633.96/(4 * 8) = 176.06 s, i.e. ideally speeded up to eight times faster in parallel mode and four times faster in vector mode. This ideal speed 176.06 s of Gaussian elimination is slower than the practical speed 167.35 s by the author’s decomposition. It would be better emphasized that if a transformation into a vector mode is impossible, then vectorization gains nothing, and it is not clear how much percent of Gaussian elimination can be transformed into a parallel and vector mode. However, we are sure that Gaussian elimination cannot be optimized to achieve the perfect speedup in parallel-vector mode, i.e. 32 times faster on 8 CEs. The author’s decomposition is superior to the Gaussian elimination in the first example. The second example is also a system of linear equations [A](X) = {B} with an asymmetric stairsshape matrix [A] of order (22656 x 22656) which is also partitioned into (177 x 177) blocks. A pseudo-

509

Parallel algorithms Table 1. Performance in solving example 1 Number of CEs

User time (s)

System time (s)

Total time (s)

Speedup

Efficiency (“/I

I

1068.40 545.88 290.56 167.33

0.08 0.05 0.03 0.02

1068.48 545.93 250.59 167.35

1.00 1.96 3.68 6.38

100.00 98.00 92.00 79.75

2 4 8 random resulting

PI=

procedure was employed in 8667136 entries as

[A,,1 [A,,I [A,,1

to generate

[A,*1 [A131 .’ [A221 [A,,1

[A]

ML1771

then consecutively number the degrees-of-freedom in a subdomain and then another subdomains, resulting in the entries from the second diagonal submatrix to the last diagonal submatrix as

1 1 [.I

‘..

- L&l

[A171.1171 1.

[@I

(23)

(21)

The timing results in solving the second example in parallel-vector mode were shown in Table 2. This asymmetric example showed a performance very similar to the first example. The required time may be almost reduced in half when doubling processors. This example was also solved by Gaussian elimination, which took 11,148.15 s on one CE without vectorization. The ideal speed of Gaussian elimination on 8 CEs is 11,148.15/(4 * 8) = 348.38 s, which is slower than the author’s decomposition. In the stairs-shape sparsity, the author’s decomposition requires less executing time than the Gaussian elimination. It is not clear, presently, how to reorder a sparse matrix into a shape of stairs so that the author’s decomposition can be applied. However, in structural analysis we usually consider a whole structure as an assembly of substructures. This concept is equivalent to domain decomposition, i.e. a subdomain is a substructure and the interface between adjacent subdomains is also equivalent to the interface of adjacent substructures. Based upon the concept of domain decomposition, the stairs-shape sparsity can be easily constructed; for example, we firstly number the degrees-of-freedom on interfaces consecutively, which results in the first block column and the first diagonal submatrix (and the first block row if asymmetric) as

I.1 [*I [@I

I.1

I.1

“’

[.I1 (22)

- Gl

[*I A stairs-shape sparsity is therefore constructed. The following two examples are used to demonstrate applications to domain decomposition. The third example is defined as Au + iu = 1

in

where 3, = 1.4, subject

R: (0,64n)

x (0,64x),

to the boundary

u=o

on

condition

an

(25)

Standard finite element discretizations of Eq. (24) yield linear algebraic systems with coefficient matrices which are symmetric and indefinite. Since the eigenvalues of -A for the domain Q are given by 1/642(m2+n2)wherem,n=1,2,3,...,weseethat 1 = 1.4 is not an eigenvalue and therefore the problem Eq. (24) leads to nonsingular coefficient matrices. Now, suppose we consider solving Eq. (24) by using domain decomposition with eight subdomains, each of which is of size (0,64~) x (0,8n). The eigenvalues of -A for each subdomain are then defined by m 2/642 + n */g2 where m,n,=1,2 ,.... Clearly, 1 = 1.4 is not an eigenvalue of each subdomam, and then each diagonal submatrix is nonsingular. The degrees-of-freedom are firstly assigned to the interfaces, and then consequently assigned to subdomains. This gives a sparse system in the form of Eq. (20), and the timing results were shown in Table 3. This performance also convinces that the time required by the author’s decomposition can be reduced approximately in half when doubling processors. This

Table 2. Performance in solving example 2 Number of CEs

User time (s)

System time (s)

Total time (s)

1

1842.83 940.70 497.94 285.18

0.11 0.04 0.02 0.01

1842.94 940.74 497.96 285.19

2 4 8

(24)

Speedup

1.oo 1.96 3.70 6.46

Efficiency (%I 100.00 98.00 92.00 80.75

510

J.-C. Luo

Table 3. Performance in solving example 3 Number CEs

of

User time

1 2 4 8

System

time

+$

=y

(s)

Speedup

Efficiency W)

2543.12 1322.66 703.64 402.55

0.98 0.49 0.26 0.16

2544.70 1323.15 703.90 402.71

1.00 1.92 3.62 6.32

100.00 96.00 90.50 79.00

in R: (0,64a) x (0,64x),

subject to the boundary u=l

(26)

condition on

x=0.

(27)

Equation (26) may be approximated into algebraic linear systems with asymmetric coefficient matrices, whose exact solution is u = -ix’ + xy + 1. Similarly to the third example, the domain 0 was divided into eight subdomains. Each subdomain is of size (0,64~) x (0,&c). The degrees-of-freedom are firstly assigned to the interfaces, and then consecutively assigned to subdomains. This may yield an asymmetric matrix with a stairs-shape sparsity as Eq. (21). The timing results were shown in Table 4, whose performances were consistent to the other examples so as to convince the powerfulness of the author’s decomposition in both symmetric and asymmetric systems. For the purpose of comparisons, this example was also solved by Gaussian elimination, taking 20,760.OOs on one CE without vectorization. The paper uses two partial differential equations to demonstrate a way of construction of a stairs-shape sparsity by the concept of domain decomposition, in which a domain can be viewed as a substructure. The same idea may be directly applied to structural analysis. This method not only shows a clear concept of parallelism but also provides an easy way of programming. The parallelism can be possibly implemented on different types of machine. As mentioned previously, the solution of a sparse system of linear equations is an important part in scientific and engineering computing. The presented method may be a useful tool for such computations. Comparisons

Table 4. Performance Number CEs 1 2 4 8

of

time

(s)

example was also solved by the Gaussian elimination, taking 10,369.75 s on one CE without vectorization. The fourth example is defined as

g

Total

(s)

User time

System

time

of timing results show that the author’s decomposition is superior to the Gaussian elimination in stairs-shape sparsity. This works presents a new direct method for sparse systems. Direct methods may provide a procedure with an unconditional convergence, and can be a general solver well suited for nonlinear analysis with unloading processes. The costs required by a sparse solver is dependent of the number of non-zero fill-ins. It is not clear, presently, how to optimally renumber a sparse matrix into a shape of stairs. These renumbering schemes are under study. As mentioned previously, the concept of domain decomposition is also a way to construct the stairs-shape sparsity. This concept may provide a more clear way than renumbering schemes. Each subdomain has a corresponding diagonal submatrix. Certainly, each diagonal submatrix with a corresponding subdomain also can be renumbered into a stairs-shape submatrix by dividing the corresponding subdomain into sub-subdomains, then by numbering the sub-interfaces and finally the sub-subdomains, so as to further reduce non-zero fill-ins. From the timing results as shown in Tables 14, it can be seen that the efficiency on 8 CEs with vectorization is down to 80%. As compared with the performance without vectorization giving an efficiency of 93% on 8 CEs,’ the procedure with vectorization lowers speedup. This kind of degradation with vectorization comes from the length of a vector, because the performance of vectorization is a function of the vector length,16 which says that a longer vector has a better vectorization. Unfortunately, when employing more CEs, each vector will be reciprocally shorter. This degrades the performance. The degradation in efficiency partially comes from implementations of vector instructions, not totally from the algorithm. With the clear concept of parallelism, the presented sparse solver may be a useful tool in scientific and engineering computing.

in solving

example

Total

4

time

65)

(S)

b)

4015.46 2102.47 1120.52 649.73

0.98 0.48 0.25 0.19

4016.44 2102.95 1120.77 649.92

Speedup

Efficiency (%)

1.00 1.91 3.58 6.18

100.00 95.50 89.50 11.25

Parallel REFERENCES 1. D. R. Mackay, K. H. Law and A. Raefsky, “An implementation of a generalized sparse/profile finite element solution method,” Computers and Structures 41, 723-737 (1991). 2. T. K. Agarwal, 0. 0. Storaasli, and D. T. Nguyen, “A parallel-vector algorithm for rapid structural analysis on high performance computers,” AIAA Paper 9&l 149 (1990). 3. 0. 0. Storaasli and P. Bergan, “Nonlinear substructuring method for concurrent processing computers,” AIAA Journal 25, 871-876 (1987). 4. 0. 0. Storaasli, D. T. Nguyen and T. K. Agarwal, “Parallel-vector solution of large-scale structural analysis problems on supercomputers,” AZAA Journal 28, 1211-1216 (1990). 5. M. Al-Nasra and D. T. Nguyen, “An algorithm for domain decomposition in finite element analysis,” Computers and Structures 39, 277-289 (1991). 6. A, K. Noor, H. A. Kamel and R. E. Fulton, “Substructuring techniques-status and projections,” Computers and Structures 8, 621632 (1978).

algorithms

511

7. J.-C. Luo, “A new class of decomposition for symmetric system,” Mechanics Research Communications 19, 1599166 (1992). 8. J.-C. Luo, “A new class of decomposition for inverting asymmetric and indefinite matrices,” Computers and Mathematics with Applications, 25(4), 95-104 (1993). 9. I. S. Duff, A. M. Erisman and J. K. Reid, Direct Methods for Sparse Matrices, Oxford, 1986. 10. J.-C. Luo, “A note on parallel processing,” Applied Mathematics Letters 5(2), 75-78 (1992). 11. J.-C. Luo, “Analysis of round-off error,” Applied Mathematics Letters 5(4), 77-80 (1992). 12. J.-C. Luo, “A proof in favor ofblock-based procedure,” Applied Mathematics Letters S(5), 75-78 (1992). 13. J.-C. Luo, “An optimum partition for inverting a nonsingular matrix,” 6(5), 79-84 (1993). 14. J.-C. Luo, “A sparsity for decomposing a symmetric matrix,” Computers and Mathematics with Applications, 25(5), 83-90 (1993). 15. J.-C. Luo, “Extended concept of stair-shape sparsity for the inverse of an asymmetric matrix,” Computers and Mathematics With Applications, 25(5), 113-122 (1993). 16. Using Alliant FX/8, Argonne National Laboratory.