Comput. Methods Appl. Mech. Engrg. 191 (2002) 1381±1394
www.elsevier.com/locate/cma
Block and full matrix ILU preconditioners for parallel ®nite element solvers S.é. Wille b
a,b,*
, é. Sta a, A.F.D. Loula
b
a Faculty of Engineering, Oslo University College, Cort Adelersgate 30, N-0254 Oslo, Norway Departamento de Matem atica Aplicada e Computacional, Laborat orio Nacional de Computacßa~o Cienti®ca, LNCC/MCT, Avenida Getulio Vargas 333, Quitandinha, Petropolis, Brazil
Received 19 February 2001
Abstract Parallel ®nite element solvers based on ILU preconditionings are developed, implemented and tested in two- and three-dimensional Laplace problems. The computational domain is decomposed into N subdomains for parallel processing. The structure of the parallel computer system consists of the main processor and N satellite processors. Two algorithms are developed: a block ILU preconditioner at the subdomain level, without communication between the satellite processors, and a full matrix ILU preconditioner coupling the subdomain degrees of freedom and requiring communication between the satellite processors. Different node orderings, mesh sizes and number of satellite processors are tested. The ef®ciency of both block and full matrix ILU preconditioners is strongly dependent on the node ordering inside each subdomain. The ®nite elements in each subdomain must be connected. Ó 2002 Elsevier Science B.V. All rights reserved.
1. Introduction In solving large scale systems associated with ®nite element approximations of partial dierential equations massively parallel computations are necessary to obtain solutions in reasonable computational time [1,2,4,5]. Parallel processing with iterative equation solvers for ®nite element problems has been subject to intensive investigations [6±8]. The convergence rate of iterative conjugate gradient solvers can be considerably increased by preconditioning of the equation system. ILU preconditioning [9±11] has proved to be quite ecient. Block ILU preconditionings have been applied to increase the eciency of parallel iterative solvers [3,12,13]. In previous work, [10,11] the node ordering has shown to be of great signi®cance for the eciency of ILU preconditioners. With a single processor, the highest eciency is obtained when the nodes are sorted with respect to some point far away from the computational mesh. The ®ll-in in the ILU preconditioner from all nodes is then believed to propagate in the equation matrix in an optimal way during the factorization. When implemented serially in a single processor ILU preconditioners are very ecient, in terms of the number of iterations for convergence. In parallel processings, block ILU preconditioners have shown less eciency in terms of the number of iterations, although they can be very ecient in terms of computational time per iteration. In the present work a full matrix parallel ILU preconditioner is developed. Compared to the block ILU preconditioner, the full matrix parallel ILU preconditioner requires communication between the satellite
*
Corresponding author. E-mail address:
[email protected] (S.O. Wille).
0045-7825/02/$ - see front matter Ó 2002 Elsevier Science B.V. All rights reserved. PII: S 0 0 4 5 - 7 8 2 5 ( 0 1 ) 0 0 3 2 9 - 2
1382
S.O. Wille et al. / Comput. Methods Appl. Mech. Engrg. 191 (2002) 1381±1394
processors to pass information of the matrix and vector coecients associated with the interfaces of the subdomain but it also demands less iterations for convergence.
2. Test problem As a model problem we consider the stationary Laplace equation in two or three dimensions r2 u 0
in X
with Dirichlet u u0
on Cu
and Neumann ru n r on Cn boundary conditions, where X Rd ; d 2 or d 3, and C oX Cu [ Cn : Let U fv 2 H 1
X; v u0 on Cu g and V fv 2 H 1
X; v 0 on Cu g: A variational formulation of our model problem reads Find u 2 U such that, for all v 2 V , a
u; v f
v with
Z a
u; v
X
Z ru rv dX;
f
v
Cu
rv dC:
This leads to the standard Galerkin ®nite element method. Find uh 2 Uh such that a
uh ; vh f
vh
8vh 2 Vh ;
where Uh U and Vh V are appropriate ®nite element spaces. In particular we consider C 0 linear ®nite element space and write uh
N X
Ui Li
i1
yielding AU F with Aij a
Li ; Lj ;
Fj f
Lj :
To solve the linear system AU F we will consider preconditioned iterative methods well suited to parallel implementation. Both block ILU and full matrix ILU preconditioners will be developed combined with the conjugate gradient method.
S.O. Wille et al. / Comput. Methods Appl. Mech. Engrg. 191 (2002) 1381±1394
1383
3. Fill-in rules for incomplete Gaussian factorization The ®ll-in rule for the preconditioner of iterative solvers for general algebraic equations is that ®ll-in is accepted at locations in the equation matrix where the magnitude of the coecients is above a prede®ned limit. The order of ®ll-in is determined by the size of this limit. This ®ll-in rule is not applicable for ®nite element equations. Therefore a new and dierent ®ll-in rule for ®nite element equations has to be considered. Zero-order and ®rst-order ®ll-in rules are de®ned as follows: Zero-order ®ll-in First-order ®ll-in
Fill-in is only accepted on the diagonal of the global matrix Fill-in is accepted at locations in the global matrix where the nodes belong to the same element.
We will apply the above ®ll-in rules to generate diagonal preconditioner (zero-order ®ll-in) and block and full matrix ILU preconditioners (®rst-order ®ll-in) associated with appropriate nodal numbering (see Fig. 1).
4. Processor communication The logical architecture considered is shown in Fig. 2. The master processor can communicate with all satellites, and the satellites can communicate with their logical neighbors and the master processor. Communication primitives are restricted to point to point message passing. The implementation is based on a model where each satellite runs in a process on a standard computer networked with the other computers over Ethernet connected through a switch. Fig. 3 shows the connection between any two processors with each process divided into two threads running on the same processor. This is clearly a superset of the required logical architecture, and subdividing into two threads provides the added bene®t of asynchronous communication between processors. When sending a message, the thread doing the actual calculations will queue the data in a fairly lightweight O
1 operation, and the heavier O
N task of actually sending the message is left to the communicating thread to act in parallel with other calculations. For receiving a message, the asynchronous nature of the threads provides a bene®t if the data were received while the processor was busy doing some other work. Then, receiving a message is reduced to getting a pointer to data already in the process' address space, which is a fairly low overhead operation performed in constant time. Fig. 4 shows a simple example, roughly corresponding to vector elimination for two satellites, of the dierence between synchronous and asynchronous message passing. It is clear that the asynchronous case has the potential to make some message operations extremely cheap.
Fig. 1. The left part of the ®gure shows zero-order ®ll-in for the middle node at the upper edge, the node at the center and the node at the lower left corner. The right part of the ®gure shows the ®rst-order ®ll-in for the same nodes.
1384
S.O. Wille et al. / Comput. Methods Appl. Mech. Engrg. 191 (2002) 1381±1394
Fig. 2. The ®gure shows the con®guration of a master and seven satellite processors. The master processor has a two way communication to all satellite processors. Each satellite processor has a two way communication to the neighbor satellite processors.
Fig. 3. Two random processors (master or satellite) connected through a network. Each process is threaded, and the ``queue'' communication is a constant overhead asynchronous operation while the TCP communication is a synchronous linear overhead operation.
Fig. 4. This shows the potential dierence between synchronous and asynchronous message passing. In the synchronous case, satellite 1 must wait for satellite 2 to ®nish calculating before the message can be sent. When passing the message asynchronously, satellite 2 will receive a message from satellite 1 while calculating, and when it needs the data, and the message passing overhead is reduced to a simple constant time dequeue operation.
S.O. Wille et al. / Comput. Methods Appl. Mech. Engrg. 191 (2002) 1381±1394
1385
Added complexity, in running multiple threads, adds on overhead. However, for the architecture at hand, this overhead is orders of magnitude lower than the time spent sending even fairly short (only several byte long) messages. Additionally, the algorithm presented here requires few and large messages, rendering the added overhead negligible.
5. Node ordering in satellites Fig. 5 shows a full domain mesh decomposed into three subdomain meshes. Each of the subdomains is processed in one satellite processor. Previous work [10,11] has shown that the eciency of ILU preconditioning is strongly dependent on the node ordering. The local node ordering in each subdomain mesh is shown in Fig. 5. The internal subdomain nodes, which are all subdomain nodes except those on the upper
Fig. 5. The ®gure shows a domain decomposed into three subdomains for parallel processing. The node numbering of each subdomain mesh is divided into interior, receiving and transmitting nodes. The interior nodes are nodes 1±15. The receiving nodes are the nodes 16±20 and the transmitting nodes are the nodes 21±25. The interfaces between adjacent subdomains are the receiving and transmitting nodes.
1386
S.O. Wille et al. / Comput. Methods Appl. Mech. Engrg. 191 (2002) 1381±1394
and lower edges, are numbered in increasing order from the center towards the boundary. Then the nodes on the upper and lower edges, respectively, are numbered. 6. Structure of satellite matrices With the node ordering given in Fig. 5, the matrix corresponding to a subdomain mesh will be organized as in Fig. 6. The matrix AII corresponds to the internal degrees of freedom. The matrices AIR , AIT and ARI , ATI correspond to the coupling between the inner nodes and the interface nodes to the adjacent subdomain. The matrix ARR corresponds to the nodes on the upper interface in Fig. 5 and the ATT corresponds to the nodes on the lower interface in Fig. 5. The zero matrices are due to the absence of coupling between the nodes on the upper and lower interfaces. The matrices shown in Fig. 6 are sparse. So the sparse storage structure in Fig. 7 is used. The vector pac is pointing to where the row begins in the par vector while the par vector contains those nodes that are present in the row. Since the ®nite element matrix is symmetric, only the nodes to the right of the diagonal need to be stored in par. 7. Parallel algorithms Iterative solution algorithms for symmetric and non-symmetric equation systems have been subjected to intensive investigation, [6±8]. The iterative algorithm chosen in the present work is described in [7]. Three algorithm for solving the linear equation system are compared: the conjugate gradient algorithm without preconditioning, the conjugate algorithm with a block ILU preconditioning and the conjugate gradient with a full matix ILU preconditioning [9±11]. The block ILU preconditioning requires no communication between satellite processors once preconditioning is performed independently for each block matrix corresponding to a particular subdomain. In the full matrix ILU preconditioning method, matrices and vectors which correspond to the nodes on the interface between subdomain meshes are communicated. An algorithmic description of the factorization, the forward elimination and backward substitution is given below. These three algorithms can be executed in parallel on the satellites. First, the matrix corresponding to the internal degrees of freedom is eliminated, Algorithm 7.1. Then, if it exists, the equation matrix corresponding to the interface from the above subdomain is added to the
Fig. 6. The ®gure shows the ®nite element matrix for a particular domain. The matrix AII corresponds to the internal degrees of freedom, the matrices ARR ; ATT correspond to the degrees of freedom on the interface to adjacent subdomains. The zero matrices, 0, originate because there are no couplings between the nodes of the two interfaces.
S.O. Wille et al. / Comput. Methods Appl. Mech. Engrg. 191 (2002) 1381±1394
1387
Fig. 7. The ®gure shows the sparse storage structure of the ®nite element matrix of a particular subdomain. The vector pac contains pointers to par, the location of the beginning of each row. The vector par contains the nodes above the diagonal which are present in each row.
corresponding matrix of the present subdomain and factorized. If the subdomain is the last one, the matrix corresponding to the lower edge is also factorized. The forward elimination, Algorithm 7.2, of the right-hand side is performed in a similar way. The vector corresponding to the internal degrees of freedom is eliminated. Then, if it exists, the right-hand side vector corresponding to the interface from the above subdomain is added to the corresponding right-hand side of the present subdomain and eliminated. If the subdomain is the last one, the right-hand side vector corresponding to the lower edge is also eliminated.
1388
S.O. Wille et al. / Comput. Methods Appl. Mech. Engrg. 191 (2002) 1381±1394
The backward substitution, Algorithm 7.3, begins with substitution of the solution corresponding to the last subdomain mesh. Then the solution for the nodes corresponding to the upper edge is found and ®nally the solution corresponding to the internal degrees of freedom is computed. 7.1. Matrix factorization if
n > 1 if
n MPR
AnII
Eliminate n 1 AnRR AnRR ATT Eliminate Eliminate
AnRR AnTT :
1
7.2. Vector elimination if
n > 1 if
n MPR
Eliminate BnR BnR BnT Eliminate Eliminate
1
BnI BnR BnT :
2
7.3. Vector substitution if
n MPR if
n < MPR
Substitute Substitute BnT Bn1 R Substitute
BnT BnR
3
BnI :
8. Numerical tests We ®rst tested the parallel ILU algorithm in a two-dimensional Laplace problem. The domain X 0; 1 0; 1 is decomposed into four subdomains. Each subdomain was discretized with 512 linear elements, 297 nodes and 297 degrees of freedom, as shown in Fig. 8. The full domain mesh consists of 2048 elements, 1089 nodes and 1089 degrees of freedom. Dirichlet boundary conditions, u0
y
1 y 2 , were
Fig. 8. The ®gure shows a 2D regular mesh with 33 33 nodes and 2048 triangular elements. The domain is decomposed into four subdomains for parallel processing in four satellite processors. The mesh size in each satellite processor is 9 33 nodes and 512 elements.
S.O. Wille et al. / Comput. Methods Appl. Mech. Engrg. 191 (2002) 1381±1394
1389
Fig. 9. The ®gure shows a 2D solution computed on four parallel processors.
imposed on the opposite edges x 0 and x 1, and homogeneous boundary conditions, r ou=on 0, were adopted on the two other edges. The corresponding ®nite element solution for this Laplace problem is shown in Fig. 9. The initialization time as a function of the number of satellites used in the computation is shown in Fig. 10. The initialization time includes initial communication of the mesh, matrix generation and ILU factorization of the element matrices in each satellite. The results show that the initialization time is maximum for one satellite processor. It is slightly higher than the computation time with no satellite processor because it includes the communication time to one processor. The initialization time is decreasing with the number of satellites approaching an almost constant level.
Fig. 10. The ®gure shows the initialization time as a function of the number of satellites. The initialization time includes initial communication of the mesh, matrix generation and ILU factorization of the element matrices in each satellite.
1390
S.O. Wille et al. / Comput. Methods Appl. Mech. Engrg. 191 (2002) 1381±1394
Fig. 11. The ®gure shows the communication time, the CPU time and the real time for one conjugate gradient iteration. The real time is the sum of CPU time and communication time.
Fig. 11 shows the communication time, the CPU time and the real time for one conjugate gradient iteration with a mesh size of 256 256 elements. The communication time with one or more satellites is fairly constant. The CPU time and real time, which is de®ned as the sum of CPU time and communication time, decrease with the number of satellites from one satellite on. Both the CPU time and real time seem to approach a constant level. Fig. 12 shows the convergence behavior of the conjugate gradient method and the preconditioned conjugate gradient method executed on one processor. The improvements in terms of convergence rate with ILU preconditioning are clearly observed. In addition the residual reaches a much smaller value than the value obtained by the unpreconditioned conjugate gradient method. The convergence performance of the full matrix ILU preconditioned conjugate gradient method on one processor, the block ILU preconditioned conjugate gradient method on 32 processors and the full matrix ILU preconditioned conjugate gradient method are shown in Fig. 13. The ®gure shows that the full matrix ILU preconditioned conjugate gradient method for 32 satellite processors has convergence characteristics similar to those observed with the preconditioned conjugate gradient method on one processor. The block ILU preconditioned conjugate gradient method for 32 satellite processors has a pronounced slower convergence rate than the full matrix ILU preconditioned conjugate gradient method for 32 processors. The convergence characteristics of the block ILU preconditioned conjugate gradient method for 1, 2, 4, 8, 16, and 32 satellite processors are illustrated in Fig. 14. For one processor, the block ILU preconditioning recovers the full matrix ILU preconditioning of the conjugate gradient method. We can observe that, for more that one satellite processor, the convergence characteristic of the block ILU preconditioner is almost independent of the number of satellite processor. However, it is remarkable to ®nd the reduced rate of convergence observed when more than one satellite processor is employed. The convergence rate of the full matrix ILU preconditioned conjugate gradient method is shown in Fig. 15. For eight satellite processors or more the convergence rate is nearly independent of the number of satellite processors. A comparison of Figs. 14 and 15 highlights the superiority of the full matrix ILU preconditioned conjugate gradient algorithm. A three-dimensional mesh is shown in Fig. 16. The global mesh has 5 5 9 nodes decomposed into four subdomain meshes with 5 9 3 nodes. A much more re®ned mesh with 33 33 17 nodes was
S.O. Wille et al. / Comput. Methods Appl. Mech. Engrg. 191 (2002) 1381±1394
1391
Fig. 12. The ®gure shows the residual as a function of the number of iterations for the conjugate gradient method and the preconditioned conjugate gradient method. Both computations are performed on one processor. The size of the mesh is 257 257 nodes.
Fig. 13. The ®gure shows the residual as a function of the number of iterations for the preconditioned conjugate gradient method for one processor, the block ILU preconditioned conjugate gradient method for 32 processors and the full matrix ILU preconditioned conjugate gradient method for 32 processors.
1392
S.O. Wille et al. / Comput. Methods Appl. Mech. Engrg. 191 (2002) 1381±1394
Fig. 14. The ®gure shows the residual as a function of the number of iterations for 1, 2, 4, 8, 16, and 32 parallel processors. The conjugate gradient algorithm is preconditioned by the block ILU precondtioner. The mesh size is 32 32 triangular elements.
Fig. 15. The ®gure shows the residual as a function of the number of iterations for 1, 2, 4, 8, 16, and 32 parallel processors. The conjugate gradient algorithm is preconditioned by the full matrix ILU precondtioner.
employed to compute the ®nite element approximation, as shown in Fig. 17. In this case each subdomain mesh has 33 33 5 nodes. The boundary conditions are: u0
y; z
1 y 2
1 z2 on the two planes normal to the x-axis and zero normal derivatives r ou=on 0 at the other boundary planes.
S.O. Wille et al. / Comput. Methods Appl. Mech. Engrg. 191 (2002) 1381±1394
1393
Fig. 16. The ®gure shows a 3D regular mesh of tetrahedra. The domain is decomposed into four subdomains for parallel processing in equal number of satellite processors. The mesh size allocated for each satellite processor is 9 5 3 nodes.
Fig. 17. The ®gure shows a 3D iso-surface of the solution computed on four parallel processors with a 33 33 17 node mesh of regular tetrahedra.
9. Discussion In the present work, a block and a full mesh parallel ILU preconditioner for iterative solvers have been developed, implemented and tested in two- and three-dimensional ®nite element solution of the Laplace equation. The eciencies of both preconditioners are strongly dependent on the node numbering. In the full mesh parallel ILU preconditioner, the nodes have to be numbered specially in order to communicate the degrees of freedom on the interface of the subdomain meshes.
1394
S.O. Wille et al. / Comput. Methods Appl. Mech. Engrg. 191 (2002) 1381±1394
The simulations show that both the initialization time and the iteration time for the full ILU preconditioner approach a constant level as the number of satellite processors increases. The interpretation of these observations is that the number of satellite processors adopted in a certain problem can be increased until the constant time level is reached. By comparing the convergence characteristics of the block and full matrix parallel ILU preconditioners, the full matrix parallel preconditioner reveals superior convergence properties. The convergence properties of the full matrix parallel preconditioner are almost independent of the number of satellites used and the convergence properties are very similar to the full ILU preconditioner for one processor. Acknowledgements This research was supported in part by the CNPq (Conselho Nacional de Desenvolvimento Cientõ®co e Technol ogico ± Brazil) Grant No. 300520/98-0. This research was also supported in part by the Norwegian Research Council, Grant No. 131523/410. References [1] E. Barragy, G.F. Carey, A parallel element-by-element solution scheme, Int. J. Numer. Meth. Engrg. 26 (1988) 2367±2382. [2] E. Barragy, G.F. Carey, Parallel-vector computation with high-P element-by element methods, Int. J. Comput. Math. 44 (1988) 329±339. [3] E. Barragy, G.F. Carey, R. van de Geijn, Parallel performance and scalability for block preconditioned ®nite-element (P) solution of viscous-¯ow, Int. J. Numer. Meth. Engrg. 38 (1995) 1535±1554. [4] S.K. Alibadi, T.E. Tezduyar, Parallel ¯uid-dynamics computations in aerospace application, Int. J. Numer. Meth. Fluids 21 (1995) 783±805. [5] S. Mittal, T.E. Tezduyar, Parallel ®nite element simulation of 3D incompressible ¯ows: ¯uid±structure interaction, Int. J. Numer. Meth. Fluids 21 (1995) 933±953. [6] D.M. Young, K.C. Yea, Generalized conjugate-gradient acceleration of non-symmetrizable iterative methods, Linear Algebra Appl. 34 (1980) 159±194. [7] P. Sonneveld, CGS, a fast Lanczos-type solver for nonsymmetric linear systems, SIAM J. Statist. Comput. 10 (1989) 36±52. [8] D. Howard, W.M. Connolley, J.S. Rollett, Unsymmetric conjugate gradient methods and sparse direct methods in ®nite element ¯ow simulation, Int. J. Numer. Meth. Fluids 10 (1990) 925±945. [9] S.é. Wille, An element by element preconditioner for re®ned ®nite element grids, in: Proceedings of the First International Conference on Numerical Grid Generation in Computational Fluid Dynamics, Landshut, Germany, 1986. [10] O. Dahl, S.é. Wille, An ILU preconditioner with coupled node ®ll in for iterative solution of the mixed ®nite element formulation of the 2D and 3D Navier±Stokes equations, Int. J. Numer. Meth. Fluids 15 (1992) 525±544. [11] S.é. Wille, A.F.D. Loula, A priori pivoting in incomplete Gaussian preconditioning for iterative solution of mixed ®nite-element formulation of the Navier±Stokes equations, Comput. Meth. Appl. Mech. Engrg. 190 (29±30) (2001) 3735±3747. [12] G.F. Carey, A.I. Pehlivanov, P.S. Vassilevski, least-squares mixed ®nite element methods for non-selfadjoint elliptic problems: II. Performance of block-ILU factorization methods, SIAM J. Sci. Comput. 16 (1995) 1126±1136. [13] R.S. Silva, R.C. Almeida, A.C.N. Gale~ao, A. Coutinho, Iterative local solvers for distributed Krylow±Schwarzs method applied to convection±diusion problems, Comput. Meth. Appl. Mech. Engrg. 149 (1997) 353±362.