Convergence analysis for a parallel Jacobi algorithm

Convergence analysis for a parallel Jacobi algorithm

Convergence Analysis for a Parallel Jacobi Algorithm M. El-Attar Department of Engineering Mathematics Faculty of Engineering Alexandria Universi...

483KB Sizes 2 Downloads 126 Views

Convergence

Analysis for a Parallel Jacobi Algorithm

M. El-Attar Department

of Engineering

Mathematics

Faculty of Engineering Alexandria University Alexandria, Egypt

Transmitted by Melvin Scott

ABSTRACT An iterative introduced. convergence speed-up

1.

parallel

scheme

It is based analysis

for

of the parallel

for the solution

on the implementation the

proposed

implementation

of banded

systems

of the theory

algorithm

is carefully

of equations

of matrix introduced

is

splitting.

A

and the

is studied.

INTRODUCTION

The solution of large systems of linear or linearized equations is one of the central problems in engineering applications and computational mathematics. Systems that have a banded coefficient symmetric matrix, usually resulting from finite element modeling, are of great interest; however, banded nonsymmetric systems are also common in the analysis of geomaterials modeled with nonassociative plasticity. The solution of these systems in conventional computer architectures is an algorithmically simple problem, but suffers from extensive time and memory limitations. Parallel and distributed computation is currently an arae of intense research activity; the availability of powerful parallel computers tributed algorithms

is generating interest in the development of new disin order to solve problems that were not addressed in the

past due to cost and speed constraints. Recent studies [l-5], have addressed the generation of parallel schemes for the solution of large systems of linear equations. In this paper, an iterative parallel block Jacobi-type algorithm for solving large banded linear systems is introduced. The algorithm is a two stage APPLIED MATHEMATICS

AND COMPUTATION

0 Elsevier Science Inc., 1994 655 Avenue of the Americas, New York, NY 10010

62:191-201(1994)

191 0096-3003/94/$7.00

M. EL-ATTAR

192

iterative type based on the theory of matrix multisplitting. A theoretical convegence analysis for the proposed algorithm is described. The present algorithm is tested for the solution of banded systems on a small transputer unit (transputer is a microprocessor with its own local memory and with links that can connect one transputer to another). The architecture used in this study is based on a mesh connection of 16 transputers as shown in Figure 1. Processors 1 and 16 are connected to IBM AT, which acts as a controller that downloads instructions and data to the transputer unit. From a practical point of view, this connection represents a grid, a hypercube, a pipeline, and a ring. The pipeline configuration is the one used to implement the algorithm of this paper. The results for different then carefully discussed.

linear systems with the same bandwidth

2.

ALGORITHM

DESCRIPTION Consider

OF

THE

are

the linear system of equations

Ax=b, where A is a banded matrix of size (n X n) and bandwidth w, x, and b are the unknown and known vectors, respectively. Based on the multisplitting of

Transputcr

FIG. 1.

Transputer

Card

in

PC

mesh architecture.

193

Parallel Jacobi Convergence the

matrix,

operations

an iterative

method

can be performed

is presented

that

is structured

so that

in parallel.

DEFINITION. Let A, B,, Ck, and D, be (n X n) matrices, 1,2,. . . , K. Then ( B, , C, , Dk) is called a multisplitting of A if

where

k =

(i) A = B, - C,, k = 1,2,. . . , K where each B, is invertible, and (ii) Ck D, = I, where the matrices D, are diagonal and D, z 0. Using (i) above, (1) may be written as

k = 1,2,...,

BLx = C,x + b,

K

(2)

or

x = (Bk)-lCkx

k = 1,2 ,...,

+ (BJ’b,

K.

(3)

D,, these K sets of equations can be combined

Using the weighting matrices as

CD,~

=x

k

= CD~(B~)-‘C~X

+ xDk(BJ1b. k

k

Equation (4) yields the following algorithm. i = 0, 1,2,. . . until convergence

x i+l

Choose

x0 arbitrarily,

= Hx” + Gb,

k

and

then for

(5)

where

H = CD~(B~)-~C~

(4)

G = CD,< k

Bk)-‘.

M. EL-ATTAR

194

using a two stage splitting, which The algorithm of (5) is implemented described as follows. The system (1) can be written in the submatrix form

A,,

A,,

4,

0

0

*.*

0

0

4.~

A,,

0

***

0

0

0

4,

A,,

A,,

0

0

0

0

**. 0 ... ... ... A, K_l

is

4‘ b2 = b3

0

A,,

bK

such that the diagonal submatrices A,, are of the same size (m X m) when k k is even as in Figure 2. Then the set of

is odd and (9 x 9) when multisplitting

matrices

are

C, =

r

I’ - A,,

-A,,

-A21

I” - A 22

0

0

0

B, =

where tively,

0..

0

.

-A,, . .

. .

0

.

. .

.

-4.~1

0..

0

-4,k+l

.

-A

.

1’

0

*

*

*

0

0

I”

.

.

.

0

.

.

.

.

.

.

.

.

.

.

.

.

0

0



. . 0

. . 0

. . *

Ak,k

0 . . I'



. . *

. . *

I’ and I” are identity

sk

matrices

=

and K is an even number. (B,, ck, Dk) is a multisplitting

;I,

0

0

0

0

0

0

I’ -A,,

K,K-1

0 0

D, = sk

of size (m

X m> and (9 X 91, respec-

for k is odd for k is even,

of A since

(a) A = B, - C,, k = 1,2,. . . , K as long as each B, is invertible or equivalently each A,, is invertible, which is the case with matrices that result from finite element problems, and (b) Ck D, = 1.

Parallel Jacobi Convergence

195

0

0

.

.

.

.

.

.

.

_

.

I

h”

.

-2

V

I .

.

.

.o..

. 3

-Y

4 .

.

. .

. 0

.

.

.

.

.

. 7h”

.

2

,

&

. . ..,

-7 h’

.

.

.

-? _ I -

.

**

0

2

.

0

.

.

0

.

0

.

.

0

.

0

.

.

0

V

I

OT..

8 0.

.o

i/3 ^“. T V

I

k 0

A. -t V

I II

196

M. EL-ATTAR

Now that the matrices following specific form: Choose

x0 arbitrary,

W and

G

are

defined,

X i+l

(

xk)i+l

=

takes

the

=

HXi

+

scheme.

Gb.

(6)

k, this has the form

-(Akk)-lAk,k-l(xk-~)i - (A~-~Ak,k+d~k+d~

+(4J1b \

algorithm

e.g., x0 = [O,O, . . . ,OIT.

Introduce subsequently the following iterative For i = 0, 1,2,. . . until convergence:

For the subsystem

the

Bandwidth

(7)

\

\

FIG. 2.

Two stages multisplitting.

197

Parallel ]acobi Convergence or A,,(

xkji+’

=b,

This method is essentially 3.

PARALLEL

-Ak,k-i(~-r)~

a tridiagonal

IMPLEMENTATION

The above scheme

-Ak,k+&+$

Jacobi. OF THE

can be implemented

ALGORITHM

in a group of (K/2)

processors.

Each subsystem is transferred through the links between processors until each processor has two subsystem, i.e., processor 1 has the subsystems 1 and 2 while processor K/2 has the subsystems K-l and K. All processors perform the first step of Gaussian elimination. For example, processor k brings the matrices A2k_1 2k-1 and A,, 2k to upper triangular form and updates accordingly the matrices A2k_1,2k_2, A2k-1 2k, Azk 2k_1, and A,k,ok+l and the vectors bzk_ 1, b,,. This is computationally the most expensive operation but is performed only once. Back substitution is performed in parallel in all processors with the odd numbered subsystems. The even numbered subsystems receive the solution from the odd numbered ones. In parallel, each processor updates the right-hand side vector of the even numbered subsystem; then back substitution is performed in parallel in all processors for only the even numbered subsystems. The solutions of the even numbered subsystems are communicated to the odd numbered ones and their right-hand side vectors

are

updated.

This

is the

end

of the

first iteration,

where

each

processor obtains part of the solution vector. A convergence check after each iteration, based on the criterion lJri - xi-ill, < Tolerance, is carried out, where xi is the solution of the ith iteration and 11. (Imis the infinity norm. If this is true, the process is terminated. In this parallel implementation, two serial stages are combined in order to reduce the number of calculations in a possible one stage algorithm. 4.

CONVERGENCE In the sequence

ANALYSIS x ‘+’ = Hxi + Gb, H is called the iteration

matrix. The

necessary and sufficient condition of convergence is that all of the eigenvalues of H have a magnitude smaller than one, i.e., if the spectral radius p( H > is smaller than one. This is the most general possible convergence condition, but it is of limited use because the eigenvalues of H are rarely known exactly. Thus, a more refined tool is introduced based on suitable distance function(norm) from the desired point of convergence. Convergence is then guaranteed if each iteration reduces the value of this norm.

198

M. EL-ATTAR

DEFINITION.

Given

a vector

o > 0, we define

the weighted

maximum

norm II *llm”by Ilxllm”= maxilx,/c+(. The vector norm I( * llm”induces a matrix norm, also denoted by )I * II,“, defined as

IIAl/m”= where

(8)

A is n X n matrix.

An alternative Ilxllm”= 1; then

where

IIAdIm”

max r+O Ilxllm”’

but equivalent

aij are the entries

THEOREM 1.

definition

of this norm is based on assuming

of A.

lf A is an n X n nonnegative

there exists some w > 0 such that p(A)

PROOF. Let A be an eigenvalue an eigenvalue of A corresponding jlxljm” = 1. Then

matrix, then for every E > 0, < IIAil:.

of A such that (A[ = p(A). to the eigenvalue

IIAll: > IIAxlIZ’ = IIAxK’ = 11J%’ = above theorem,

From

the

matti

such that p(A)

Let x # 0 be

A normalized

so that

P( A).

we can state that if A is a square

nonnegative

< 1, then there exists some w > 0 such that

IIAlIZ’< 1. n

DEFINITION. A square block matrix nally dominant if C

j#i An equivalent

definition

A with blocks

II AijIIm”< IIAiiIIG’.

is Cj + il( A,ilAijllmO < 1.

Aij is (block)

diago-

199

Parallel Jacobi Convergence lf A is a block diagonally

THEOREM 2. method

dominant,

then the block Jacobi

(6) converges.

PROOF.

From the definition

of the iteration

nij= -(A,,>-~

matrix

H

~~~~

then llHi$

= c

i ‘j.

< 1,

llAt;l~ijll~

j+i

Also Hii = 0; then we can write IlHllm” < I, and from Theorem

1 we conclude

that p(H)

< 1, and hence

proved.

5.

NUMERICAL The efficiency

the theorem

is n

RESULTS of a paralle algorithm

is judged

by its ability of significant

speed gains as more processors participate in the solution of a problem. Table 1 gives the solution time as a function of the number of processors ( p is the number of processors and N is the number of equations in the system solved). Four different systems were considered, but all with the same

TABLE 1 SOLUTION TIME (SEC),

BANDWIDTH

486

972

1458

2 4

68.2 12.6

93.7

6 8 10 12 14 16

3.3 -

42.5 21.4 8.7 2.6 -

P/N

108.6 74.3 58.2 36.6 20.1 12.4

=

71 1944 138.2 104.7 79.1 63.4 52.1

M. EL-ATTAR

200

bandwidth (W = 71). It is interesting to note here that, unless the number of iterations is very large (say over IOO), the solution times are not affected significantly. This is because the forward elimination, which is performed only once, is computationally much more expensive than the back substitution, which is performed in every iteration. In all cases presented in Table I, back substitution took only a fraction of a second (less than 4 set for the 10 to I5 iterations of the most computationally intensive cases such as I944 equations on 12 processors). The observation of Table 1 may lead to a wrong conclusion that the speed-up and efficiency of the method is higher than it acutally are. The parallel algorithm involves more calculations than the direct one-processor method for the same size system. As a result, some systems are solved faster with one processor than with two or sometimes four processors. In the example of 972 equations, the solution of the system with I4 processors is over one hundred times faster than the solution with four processors, however it is only I5 times faster than the direct solution with one processor. Therefore, the speed-up provided by I4 processors is approximately 15, where the speed-up is defined as the ratio s = T,/T, where I; is the total time it takes for the best direction sequential method to solve the system on one processor and T = Tfw + T,, . NIT is the total time spent in one processor in the paral fel algorithm, where Th is the time spent for forward elimination, T,, is the time spent for back substitution, and NIT is the number of iterations performed to obtain the solution within a specific tolerance. The efficiency of the algorithm is defined mathematically as e = s/k, where s is the speed-up and K is the number of processors. It is clear that the efficiency of the proposed algorithm is very low for a small number of processors, used.

6.

but it reaches values as high as 0.8 and 0.9 when 16 processors

are

CONCLUSION

A parallel algorithm to solve banded systems of equations is presented in this study. The algorithm is baed on the theory of matrix multisplitting. It is iterative in nature and implemented on a pipeline architecture consisting of 16 transputers. Since the number of available processors is small, the proposed algorithm is presented as a two stage splitting, which also performs best with thin systems (small bandwidth compared to size). Theoretical convergence criteria were presented and the results of a numerical experiment were discussed. It is concluded that the proposed algorithm can use parallel architectures very efficiently.

Parallel Jacobi Convergence

201

REFERENCES 1

K. Datta,

Parallel

complexities

QR factorization, 2

1. Gohberg, rithms

Internat.

T. Kailath,

for linear

S. L. Johnson, hypercube

4

S. L. Johnson,

5

of equations

Math.

of Gholesky’s l&67-82

and P. Lancaster, with recursive

decomposition

and

(1985). Linear

complexity

structure,

Linear

algo-

Algebra

(1987).

Communication

architectures,

Statist. Comput.

I. Koltracht,

systems

Appl. 88/89:271-315 3

and computations

J. Comput.

].

efficient

ParaZZeZDistr.

Solving tridiagonal 8:354-392

linear

Comput.

algebra

4:133-172

systems of ensemble

computations

on

(1987).

architectures,

SIAM].

Sci.

(1987).

Y. Saad

and M. H. Schultz,

systems,

Linear

Algebra

basic

Parallel

direct

Appl. 88/89:623-650

methods (1987).

for solving

banded

linear