Parallel Computing 1 (1984) 325-330 North-Holland
325
Short Communication
Parallel marching Poisson solvers Marian VAJTERSIC Institute of Technical Cybernetics, Slovak Academy of Sciences, Bratislava, Czechoslovakia Received March 1984 Abstract. The paper presents parallel algorithms for solving Poisson equation a t N 2 mesh points. The methods based on marching techniques are structured for efficient parallel realization. Using orthogonal decomposition properties of arising matrices, the algorithms can be formulated in terms of transformed vectors. On a MIMD computer with not more than N processors, the computations can be performed in horizontal slices with minimal synchronization requirements. Considering an SIMD machine with N 2 processors, the complexity bound O(Iog N) has been achieved, whereby the single marching requires 10 log N steps only.
Keywords. Parallel algorithms, Poisson solvers, MIMD computer, SIMD computer.
1. Introduction There are numerous problems in diverse scientific applications, where the solution of Poisson equation is required. The problem is to seek a function u(x, y) which satisfies for given functions f(x, y) and g(x, y) Au(x, y ) = f ( x , y), u(x, y) = g(x, y),
(x, y ) ~ R (x, y) ~ OR
(1)
on the unit square R. The linear system of equations which results from the approximation of (1) at N 2 mesh points is of the form Mu = w .
(2)
Here, M is a block-tridiagonal matrix with N tridiagonal blocks A =(-1,4,
-1)
(3)
of order N on the main diagonal and with negative identity matrices on the both subdiagonals. This problem is time-consuming for large N and, therefore, there are serious endeavours to solve it on parallel computers. As a consequence, a number of procedures has been developed for the parallel solution of this equation (see e.g. [5,10]). Most recently two parallel variants of the cyclic odd-even reduction algorithm have been run and compared on CRAY-1 and ICL DAP computers [7] and the matrix decomposition algorithm was implemented with a considerable speed up ratio on the multiproeessor EGPA [12]. Although the marching methods belong to the class of fastest sequential Poisson solvers, no 0167-8191/84/$3.00 © 1984, E.Isevier Science Publishers B.V. (North-Holland)
326
M. VajterJic / Parallel marching Poisson solvers
parallel algorithms using this principle have been formulated so far. Due to the recurrent computation with the tridiagonal matrices in the forward and backward marching through system [2], the single marching [1] can not be parallelized straightforwardly. In the generalized (multiple) marching algorithm [2], there are also marching computations on subregions included and moreover, the evaluation of vectors of different reduction levels is strictly sequential. Although the vectors on the same level can be computed in parallel, each of them results from a recursive solution of a sequence of tridiagonal linear systems. Our aim is to overcome these obstacles and to propose such variants of the marching algorithms which allow to consider them as valuable candidates also for parallel processing. In the next section of this paper, a modification of the basic marching principle is suggested. There, the vector recurrencies are computed with diagonal matrices only and the dense system of equations is solved in a diagonalized form. The simplified form of the algorithm is a result of exploiting orthogonal decomposition properties of the modified Chebyshev polynomials. For the purpose of using only diagonal matrices in a computational process, two orthogonal transforms of real vectors with N 2 components are to be performed before and after the marching computations. Since we do not discuss any details of parallel machines organization we shall adopt model computer systems of the SIMD and MIMD types, as described in [10] and [8], respectively. Thus, on a MIMD computer with p = N processors all computational phases of the algorithms (except the final transform) can be performed without any interaction between processors. Hence, the complexity of the algorithms is reduced by a factor of :p. For p close to N, the cost of the transforms (performed as classical matrix by matrix multiplication) is O(N2), while the marching process requires O(N) operations only. However, the main advantage of this approach is that the computational process can be represented as a string of linearly ordered tasks. These can be divided into independently solvable parts, according to the number of processors. It is obvious that these algorithms would yield best results on computers where the cost of synchronization is of dominant influence on implementation effectivity. The hierarchical multiprocessor EGPA [4] could be serve as a typical example of such computer. For a computer of the SIMD type a direct parallelization of the marching algorithms would employ N processors in respectively O(N log N) and O(N log2N) steps for single and multiple marching. Our modifications of the algorithms allow to use N 2 processors and to achieve complexity O(log N). Three lemmas will be of importance in order to give precise estimations of the parallel arithmetical complexity. Lemma 1. The Fourier transform of N real data can be evaluated on N processors on 3 log N steps [10].
Lemma 2. Any tridiagonal Toeplitz symmetric system of N equations can be solved with N processors in 6 log N steps [10]. Lemma 3. Any ruth order recurrent relation of n equations can be computed in (log m + 2) log n + O(log2m) steps on 3 m 2 n / 4 + O ( m n ) processors [3].
2. Parallel marching algorithm The marching algorithm for sequential solution of the Poisson equation with a complexity O(N 2) was proposed in [1]. The first phase of the algorithm is the forward marching through the system (2), expressed by Z N _ I = --WN~ Z N _ 2
= A z N - 1 - wN-1, ZN-a = A z N - i + I - zN-~+2,
i = 3, 4 . . . . , N.
(4)
327
M. VajterJic / Parallel marching Poisson solvers
Denoting by S ~°, i = 0 . . . . . N the Chebyshev polynomials of the first kind in A, i.e. St°)=I, S°)=A,
s
i = 2 .... ,N,
(5)
the solution u u on the last vertical line of mesh points can be computed by solving (6)
S¢N)UN = - - z o.
The solution of (6) can be carried out by the recurrent solving of a sequence of tridiagonal systems i = 1 . . . . . N,
(A-q[N)I)xi=x,_,,
(7)
where qtiN) = 2 c o s ( i l r / N + 1) and X o = - z
o, X N = U N. The remaining blocks of the solution vector u result from the
backward marching UN_ 1 = A u N - wN, UN_ i = AUN_i+ 1 -- UN_i+ 2 -- WN_i+I,
i = 2 , . . . , N -- 1.
(8)
The formulae (4) and (8) represent vector recurrencies with the tridiagonal coefficient matrix A. Recurrent relations of this type are not well suited for parallel evaluation. Moreover, the solution of (6), carried out by (7), requires on the basis of Lemma 2 0 ( N log N) steps with N processors. In order to arrange these computations in a simpler manner, the matrix A can be considered in the decomposed form (9)
A = QTDQ,
where Qij=~12/(N+l)
sin(/j'n/N+l),
i, j = l . . . . . N
and D=diag(dl,
d z ..... dN),
di=4-2cos(i~r/N+
l),
i--1 ..... N
are explicitly known. Using (9), the matrices S can be expressed as S (o) = QTIQ, s (1) = QTDQ, S(i)
= QTDQ ( QT ~ ( i - a)o
-
oT~bO'-2)O)
Q T ( D q S ( i - 1 ) _ qbci-2))Q _ OT#(i)O,
=
i= 2 ..... N,
(10)
where q5(i) contains eigenvalues of S C°. In terms of (10), the three phases of the marching algorithm can be performed in terms of vectors transformed by Q. From the transformed right-hand side ~,,=Ow,,
i--1,2 ..... N
(11)
the vectors ~ can be evaluated by ~'N-1 = --WN, ZN-2 = Q ( Q D Q ( Q ~ N - 1 ) - Q ~ N - a ) = D~.N-a -- ~ N - a ~'N-i = Q ( Q D Q ( Q~.N-i+a) - Q~.N-i+ R - Q~N-~+a) =
D~.N-i+2 -- Z'N-a+2 -- WN-i+I,
i = 3 ..... N,
(12)
on the ground of the orthogonality and symmetry of Q. The Nth block of the solution vector u in the transformed form can be obtained from QT~b(N)Q(QfiN) = - Qr.o,
fix
=
- -
(~(N))--120"
(13)
328
M. VajterJic
/ Parallel marrying Poisson solvers
The final marching (8) can be expressed as FtN_1=DF~N--WN,
FtN_i=DBN_i+l--FtN_i+2--i-ON_i+l,
i=2 ..... N-1.
(14)
The required solution of (2) is then obtained by the back-transforming u i = Qg~i,
(15)
i = 1 ..... N.
Thus, the formulae (4)-(8) of the original algorithm can be replaced by (11)-(15). Since operations with the matrices A and S (~¢) do not occur in the modified process, it renders a change to implement this algorithm on parallel computers effectively. For a computer of the M I M D type consisting of p ~
P2 ~'~ P4
Q
wl
t tl .
w~
-
_
F"
,,
--
1
T
--tttttt-t-t-Ht Q
ui
/
L
ui
II
Fig. 1.
3'1. VajterJic / Parallel marching Poisson solvers
329
only and (14) can be realized analogically as (12), the total number of steps being 12 log N for this algorithm. Hence, this version of marching procedure belongs also to the class of parallel Poisson solvers with asymptotically optimal complexity bound O(log N). The number 12 log N is the same as the result achieved in [10]. This number can be reduced by the underlying on observation that the result z 0 of the forward marching (4) can be expressed as a sum --g O = S ( N - 1 ) W N - F S ( N - 2 ) w N _ I
.q- . . .
+S(°)w1"
The matrices S can be considered in the decomposed form (10) and hence, g0 = computed by --g0 =
~b(N-I)wN q- ~(N-2)~'~N-1"~
"" "
+Dw2 + wl-
(16)
Qzo can be (17)
Considering N 2 processors, the products in (17) can be obtained in 1 step. The sum of N
vectors can be calculated with the same number of processors by the log sum procedure in log N steps. Thus, 2 log N steps are saved and the complexity estimation 10 log N has been achieved. It appears to be the best result known so far for solving the Poisson equation, considering a SIMD computer with N 2 processors.
3. Concluding remarks Since the analysis of the marching has found it to be stable on most computers for N = 8 only, its stabilized variants have been developed [2]. One of them, the multiple marching, is tree structured and proceeds in 2 log N computational levels. Vectors of a given level depend upon vectors of preceding levels, i.e. a strong sequential dependence exists in the top-down direction of the computational process. A parallelism occurs in the horizontal direction only, i.e. by evaluating vectors within frame of one level. Decomposing the Chebyshev polynomials of the first and second types, this algorithm can be also reformulated for orthogonally transformed vectors. Since the computation can proceed by evaluating horizontal slices of vectors, it can be divided between processors of a MIMD machine proportionally, taking advantage of the independent performance. For a SIMD computer, with N 2 processors, the algorithm becomes extremely simple and requires 13 log N steps only. In contrast, the parallelization of the original method would give the complexity result O(N log2N), whereby only N processors can participate in computations: To achieve even faster methods the marching principle can be efficiently combined with other fast Poisson solvers. In [2], an incorporation of the matrix-decomposition algorithm and the cyclic odd-even reduction are described for sequential computation. Although the matrixdecomposition method in self is well suited for parallel computers [10,12], the recursive nature of marching makes it difficult to develop a fast parallel algorithm based on the combination of both methods. However, using the idea of orthogonal decomposition, the recurrent dependence of computation can be simplified. Analogically as in the previous marching algorithms, this version supports an effective MIMD implementation. In order to get the solution of (2) on a SIMD machine with N 2 processors, 12 log N steps are required. We note that also other fast Poisson solvers (e.g. FACR and Buneman's algorithms [6]) can be modified in the same way. The algorithm of the complete cyclic odd-even reduction is proposed in [9] for the slice-wise execution. The SIMD version of this algorithm requires also 12 log N steps.
330
M. Vajter~ic / Parallel marching Poisson solvers
References [1] R.E. Bank and D.J. Rose, An O(n 2) method for solving constant coefficient boundary value problems in two-dimensions, SIAM J. Numer. Anal. 12 (1975) 529-540. [2] R.E. Bank and D.J. Rose, Marching algorithms for elliptic boundary value problems. I: The constant coefficient case, SIAM J. Numer. Anal. 14 (1977) 792-829. [3] S.C. Chen and D.J. Kuck, Time and parallel processor bounds for linear recurrence systems, IEEE Trans. Comput. C-24 (1976) 1023-1029. [4] W. Hltndler, F. Hofmann and H.J. Schneider, A general purpose array with a broad spectrum of applications, in: Proc. of the Workshop on Comp. Arch. (Springer, Berlin, 1975). [5] D.E. Heller, Direct and iterative methods for block tridiagonal linear systems, Rep. Dept. of Comp. Sc., Came#e-Mellon Univ., 1977. [6] R.W. Hockney, Rapid elliptic solvers, in: Numerical Methods in Applied Fluid Dynamics (Academic Press, New York, 1980). [7] R.W. Hockney, Optimizing the FACR(I) Poisson-solver on parallel computers, in: K.E. Batcher, W.C. Meilander and J.L. Potter, eds. Proc. 1982 Int. Conf. on Parallel Processing (IEEE Press, 1982). [8] R.W. Hockney and C.R. Jesshope, Parallel Computers (Adam Hilger, Bristol, 1981). [9] J. Mildolko, R. Klette, M. Vajtergic and I. Vrlo, Fast algorithms and their realization on specialized computers (Veda, Bratislava, 1984) (in Slovak). [10] A.H. Sameh, S.C. Chen and D.L Kuck, Parallel Poisson and biharmonic solvers, Computing 17 (1976) 219-230. [11] M. Vajtergic, An elimination approach for fast parallel solution of the model fourth-order elliptic problem, Comp. Phys. Comm. 26 (1982) 357-361. [12] M. Vajter~ic, Parallel Poisson and biharmonic solvers implemented on the EGPA multiprocessor, in: K.E. Batcher, W.C. Me/lander and J.L. Potter, eds., Proc. 1982 Int. Conf. on Parallel Processing (IEEE Press, 1982).