Convergence
Analysis for a Parallel Jacobi Algorithm
M. El-Attar Department
of Engineering
Mathematics
Faculty of Engineering Alexandria University Alexandria, Egypt
Transmitted by Melvin Scott
ABSTRACT An iterative introduced. convergence speed-up
1.
parallel
scheme
It is based analysis
for
of the parallel
for the solution
on the implementation the
proposed
implementation
of banded
systems
of the theory
algorithm
is carefully
of equations
of matrix introduced
is
splitting.
A
and the
is studied.
INTRODUCTION
The solution of large systems of linear or linearized equations is one of the central problems in engineering applications and computational mathematics. Systems that have a banded coefficient symmetric matrix, usually resulting from finite element modeling, are of great interest; however, banded nonsymmetric systems are also common in the analysis of geomaterials modeled with nonassociative plasticity. The solution of these systems in conventional computer architectures is an algorithmically simple problem, but suffers from extensive time and memory limitations. Parallel and distributed computation is currently an arae of intense research activity; the availability of powerful parallel computers tributed algorithms
is generating interest in the development of new disin order to solve problems that were not addressed in the
past due to cost and speed constraints. Recent studies [l-5], have addressed the generation of parallel schemes for the solution of large systems of linear equations. In this paper, an iterative parallel block Jacobi-type algorithm for solving large banded linear systems is introduced. The algorithm is a two stage APPLIED MATHEMATICS
AND COMPUTATION
0 Elsevier Science Inc., 1994 655 Avenue of the Americas, New York, NY 10010
62:191-201(1994)
191 0096-3003/94/$7.00
M. EL-ATTAR
192
iterative type based on the theory of matrix multisplitting. A theoretical convegence analysis for the proposed algorithm is described. The present algorithm is tested for the solution of banded systems on a small transputer unit (transputer is a microprocessor with its own local memory and with links that can connect one transputer to another). The architecture used in this study is based on a mesh connection of 16 transputers as shown in Figure 1. Processors 1 and 16 are connected to IBM AT, which acts as a controller that downloads instructions and data to the transputer unit. From a practical point of view, this connection represents a grid, a hypercube, a pipeline, and a ring. The pipeline configuration is the one used to implement the algorithm of this paper. The results for different then carefully discussed.
linear systems with the same bandwidth
2.
ALGORITHM
DESCRIPTION Consider
OF
THE
are
the linear system of equations
Ax=b, where A is a banded matrix of size (n X n) and bandwidth w, x, and b are the unknown and known vectors, respectively. Based on the multisplitting of
Transputcr
FIG. 1.
Transputer
Card
in
PC
mesh architecture.
193
Parallel Jacobi Convergence the
matrix,
operations
an iterative
method
can be performed
is presented
that
is structured
so that
in parallel.
DEFINITION. Let A, B,, Ck, and D, be (n X n) matrices, 1,2,. . . , K. Then ( B, , C, , Dk) is called a multisplitting of A if
where
k =
(i) A = B, - C,, k = 1,2,. . . , K where each B, is invertible, and (ii) Ck D, = I, where the matrices D, are diagonal and D, z 0. Using (i) above, (1) may be written as
k = 1,2,...,
BLx = C,x + b,
K
(2)
or
x = (Bk)-lCkx
k = 1,2 ,...,
+ (BJ’b,
K.
(3)
D,, these K sets of equations can be combined
Using the weighting matrices as
CD,~
=x
k
= CD~(B~)-‘C~X
+ xDk(BJ1b. k
k
Equation (4) yields the following algorithm. i = 0, 1,2,. . . until convergence
x i+l
Choose
x0 arbitrarily,
= Hx” + Gb,
k
and
then for
(5)
where
H = CD~(B~)-~C~
(4)
G = CD,< k
Bk)-‘.
M. EL-ATTAR
194
using a two stage splitting, which The algorithm of (5) is implemented described as follows. The system (1) can be written in the submatrix form
A,,
A,,
4,
0
0
*.*
0
0
4.~
A,,
0
***
0
0
0
4,
A,,
A,,
0
0
0
0
**. 0 ... ... ... A, K_l
is
4‘ b2 = b3
0
A,,
bK
such that the diagonal submatrices A,, are of the same size (m X m) when k k is even as in Figure 2. Then the set of
is odd and (9 x 9) when multisplitting
matrices
are
C, =
r
I’ - A,,
-A,,
-A21
I” - A 22
0
0
0
B, =
where tively,
0..
0
.
-A,, . .
. .
0
.
. .
.
-4.~1
0..
0
-4,k+l
.
-A
.
1’
0
*
*
*
0
0
I”
.
.
.
0
.
.
.
.
.
.
.
.
.
.
.
.
0
0
’
. . 0
. . 0
. . *
Ak,k
0 . . I'
’
. . *
. . *
I’ and I” are identity
sk
matrices
=
and K is an even number. (B,, ck, Dk) is a multisplitting
;I,
0
0
0
0
0
0
I’ -A,,
K,K-1
0 0
D, = sk
of size (m
X m> and (9 X 91, respec-
for k is odd for k is even,
of A since
(a) A = B, - C,, k = 1,2,. . . , K as long as each B, is invertible or equivalently each A,, is invertible, which is the case with matrices that result from finite element problems, and (b) Ck D, = 1.
Parallel Jacobi Convergence
195
0
0
.
.
.
.
.
.
.
_
.
I
h”
.
-2
V
I .
.
.
.o..
. 3
-Y
4 .
.
. .
. 0
.
.
.
.
.
. 7h”
.
2
,
&
. . ..,
-7 h’
.
.
.
-? _ I -
.
**
0
2
.
0
.
.
0
.
0
.
.
0
.
0
.
.
0
V
I
OT..
8 0.
.o
i/3 ^“. T V
I
k 0
A. -t V
I II
196
M. EL-ATTAR
Now that the matrices following specific form: Choose
x0 arbitrary,
W and
G
are
defined,
X i+l
(
xk)i+l
=
takes
the
=
HXi
+
scheme.
Gb.
(6)
k, this has the form
-(Akk)-lAk,k-l(xk-~)i - (A~-~Ak,k+d~k+d~
+(4J1b \
algorithm
e.g., x0 = [O,O, . . . ,OIT.
Introduce subsequently the following iterative For i = 0, 1,2,. . . until convergence:
For the subsystem
the
Bandwidth
(7)
\
\
FIG. 2.
Two stages multisplitting.
197
Parallel ]acobi Convergence or A,,(
xkji+’
=b,
This method is essentially 3.
PARALLEL
-Ak,k-i(~-r)~
a tridiagonal
IMPLEMENTATION
The above scheme
-Ak,k+&+$
Jacobi. OF THE
can be implemented
ALGORITHM
in a group of (K/2)
processors.
Each subsystem is transferred through the links between processors until each processor has two subsystem, i.e., processor 1 has the subsystems 1 and 2 while processor K/2 has the subsystems K-l and K. All processors perform the first step of Gaussian elimination. For example, processor k brings the matrices A2k_1 2k-1 and A,, 2k to upper triangular form and updates accordingly the matrices A2k_1,2k_2, A2k-1 2k, Azk 2k_1, and A,k,ok+l and the vectors bzk_ 1, b,,. This is computationally the most expensive operation but is performed only once. Back substitution is performed in parallel in all processors with the odd numbered subsystems. The even numbered subsystems receive the solution from the odd numbered ones. In parallel, each processor updates the right-hand side vector of the even numbered subsystem; then back substitution is performed in parallel in all processors for only the even numbered subsystems. The solutions of the even numbered subsystems are communicated to the odd numbered ones and their right-hand side vectors
are
updated.
This
is the
end
of the
first iteration,
where
each
processor obtains part of the solution vector. A convergence check after each iteration, based on the criterion lJri - xi-ill, < Tolerance, is carried out, where xi is the solution of the ith iteration and 11. (Imis the infinity norm. If this is true, the process is terminated. In this parallel implementation, two serial stages are combined in order to reduce the number of calculations in a possible one stage algorithm. 4.
CONVERGENCE In the sequence
ANALYSIS x ‘+’ = Hxi + Gb, H is called the iteration
matrix. The
necessary and sufficient condition of convergence is that all of the eigenvalues of H have a magnitude smaller than one, i.e., if the spectral radius p( H > is smaller than one. This is the most general possible convergence condition, but it is of limited use because the eigenvalues of H are rarely known exactly. Thus, a more refined tool is introduced based on suitable distance function(norm) from the desired point of convergence. Convergence is then guaranteed if each iteration reduces the value of this norm.
198
M. EL-ATTAR
DEFINITION.
Given
a vector
o > 0, we define
the weighted
maximum
norm II *llm”by Ilxllm”= maxilx,/c+(. The vector norm I( * llm”induces a matrix norm, also denoted by )I * II,“, defined as
IIAl/m”= where
(8)
A is n X n matrix.
An alternative Ilxllm”= 1; then
where
IIAdIm”
max r+O Ilxllm”’
but equivalent
aij are the entries
THEOREM 1.
definition
of this norm is based on assuming
of A.
lf A is an n X n nonnegative
there exists some w > 0 such that p(A)
PROOF. Let A be an eigenvalue an eigenvalue of A corresponding jlxljm” = 1. Then
matrix, then for every E > 0, < IIAil:.
of A such that (A[ = p(A). to the eigenvalue
IIAll: > IIAxlIZ’ = IIAxK’ = 11J%’ = above theorem,
From
the
matti
such that p(A)
Let x # 0 be
A normalized
so that
P( A).
we can state that if A is a square
nonnegative
< 1, then there exists some w > 0 such that
IIAlIZ’< 1. n
DEFINITION. A square block matrix nally dominant if C
j#i An equivalent
definition
A with blocks
II AijIIm”< IIAiiIIG’.
is Cj + il( A,ilAijllmO < 1.
Aij is (block)
diago-
199
Parallel Jacobi Convergence lf A is a block diagonally
THEOREM 2. method
dominant,
then the block Jacobi
(6) converges.
PROOF.
From the definition
of the iteration
nij= -(A,,>-~
matrix
H
~~~~
then llHi$
= c
i ‘j.
< 1,
llAt;l~ijll~
j+i
Also Hii = 0; then we can write IlHllm” < I, and from Theorem
1 we conclude
that p(H)
< 1, and hence
proved.
5.
NUMERICAL The efficiency
the theorem
is n
RESULTS of a paralle algorithm
is judged
by its ability of significant
speed gains as more processors participate in the solution of a problem. Table 1 gives the solution time as a function of the number of processors ( p is the number of processors and N is the number of equations in the system solved). Four different systems were considered, but all with the same
TABLE 1 SOLUTION TIME (SEC),
BANDWIDTH
486
972
1458
2 4
68.2 12.6
93.7
6 8 10 12 14 16
3.3 -
42.5 21.4 8.7 2.6 -
P/N
108.6 74.3 58.2 36.6 20.1 12.4
=
71 1944 138.2 104.7 79.1 63.4 52.1
M. EL-ATTAR
200
bandwidth (W = 71). It is interesting to note here that, unless the number of iterations is very large (say over IOO), the solution times are not affected significantly. This is because the forward elimination, which is performed only once, is computationally much more expensive than the back substitution, which is performed in every iteration. In all cases presented in Table I, back substitution took only a fraction of a second (less than 4 set for the 10 to I5 iterations of the most computationally intensive cases such as I944 equations on 12 processors). The observation of Table 1 may lead to a wrong conclusion that the speed-up and efficiency of the method is higher than it acutally are. The parallel algorithm involves more calculations than the direct one-processor method for the same size system. As a result, some systems are solved faster with one processor than with two or sometimes four processors. In the example of 972 equations, the solution of the system with I4 processors is over one hundred times faster than the solution with four processors, however it is only I5 times faster than the direct solution with one processor. Therefore, the speed-up provided by I4 processors is approximately 15, where the speed-up is defined as the ratio s = T,/T, where I; is the total time it takes for the best direction sequential method to solve the system on one processor and T = Tfw + T,, . NIT is the total time spent in one processor in the paral fel algorithm, where Th is the time spent for forward elimination, T,, is the time spent for back substitution, and NIT is the number of iterations performed to obtain the solution within a specific tolerance. The efficiency of the algorithm is defined mathematically as e = s/k, where s is the speed-up and K is the number of processors. It is clear that the efficiency of the proposed algorithm is very low for a small number of processors, used.
6.
but it reaches values as high as 0.8 and 0.9 when 16 processors
are
CONCLUSION
A parallel algorithm to solve banded systems of equations is presented in this study. The algorithm is baed on the theory of matrix multisplitting. It is iterative in nature and implemented on a pipeline architecture consisting of 16 transputers. Since the number of available processors is small, the proposed algorithm is presented as a two stage splitting, which also performs best with thin systems (small bandwidth compared to size). Theoretical convergence criteria were presented and the results of a numerical experiment were discussed. It is concluded that the proposed algorithm can use parallel architectures very efficiently.
Parallel Jacobi Convergence
201
REFERENCES 1
K. Datta,
Parallel
complexities
QR factorization, 2
1. Gohberg, rithms
Internat.
T. Kailath,
for linear
S. L. Johnson, hypercube
4
S. L. Johnson,
5
of equations
Math.
of Gholesky’s l&67-82
and P. Lancaster, with recursive
decomposition
and
(1985). Linear
complexity
structure,
Linear
algo-
Algebra
(1987).
Communication
architectures,
Statist. Comput.
I. Koltracht,
systems
Appl. 88/89:271-315 3
and computations
J. Comput.
].
efficient
ParaZZeZDistr.
Solving tridiagonal 8:354-392
linear
Comput.
algebra
4:133-172
systems of ensemble
computations
on
(1987).
architectures,
SIAM].
Sci.
(1987).
Y. Saad
and M. H. Schultz,
systems,
Linear
Algebra
basic
Parallel
direct
Appl. 88/89:623-650
methods (1987).
for solving
banded
linear