Parallel Computing 27 (2001) 761±775
www.elsevier.com/locate/parco
Parallel multigrid 3D Maxwell solvers q Gundolf Haase, Michael Kuhn, Ulrich Langer * Institute of Analysis and Computational Mathematics, Johannes Kepler University of Linz, Altenberger Str. 69, A-4040 Linz, Austria Received 18 January 2000; received in revised form 25 September 2000; accepted 5 November 2000
Abstract Three-dimensional magnetic ®eld problems are challenging not only because of interesting applications in the industry but also from the mathematical point of view. In the magnetostatic case, our Maxwell solver is based on a regularized mixed variational formulation of the Maxwell equations in H0
curl H01
X and their discretization by the Nedelec and Lagrange ®nite elements. Eliminating the Lagrange multiplier from the mixed ®nite element equations, we arrive at a symmetric and positive de®nite (spd) problem that can be solved by some parallel multigrid preconditioned conjugate gradient (pcg) method. More precisely, this pcg solver contains a standard scaled Laplace multigrid regularizator in the regularization part and a special multigrid preconditioner for the regularized Nedelec ®nite element equations that we want to solve. The parallelization of the pcg algorithm, the Laplace multigrid regularizator and the multigrid preconditioner are based on a uni®ed domain decomposition data distribution concept. We present the results of some numerical experiments on a parallel machine with distributed memory that show the high eciency of our approach to real-life applications. Ó 2001 Elsevier Science B.V. All rights reserved. Keywords: 3D Maxwell equations; Magnetostatics; Finite element method; Domain decomposition data distribution; Parallel iterative solvers; Parallel multigrid preconditioners; Distributed memory
1. Introduction The computer simulation of 3D technical magnetic ®eld problems often requires powerful ®nite element solvers due to the geometrical and physical complexity. q This research has been supported by the Austrian Science Fund FWF within the SFB ``Numerical and Symbolic Scienti®c Computing'' under the grant SFB F013. * Corresponding author. E-mail address:
[email protected] (U. Langer).
0167-8191/01/$ - see front matter Ó 2001 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 8 1 9 1 ( 0 0 ) 0 0 1 0 6 - X
762
G. Haase et al. / Parallel Computing 27 (2001) 761±775
Moreover, advanced problems such as nonlinear or/and transient problems, as well as coupled magneto-mechanical ®eld problems and optimal design problems require the solution of linearized problems several times (see, e.g., [16]). The availability of fast and ecient solvers is crucial especially in such cases. The most ecient solvers for ®nite element equations are certainly multigrid or multilevel methods. Typically, the multigrid convergence rate is independent of the mesh size parameter, and the arithmetical complexity grows linearly with the number of unknowns. However, the standard multigrid algorithms fail for the Maxwell ®nite element equations in the sense that the convergence rate deteriorates as the mesh size decreases. To overcome this drawback, Hiptmair [11] proposed to modify the smoothing iteration by adding a smoothing step in the discrete potential space. Similarly, Arnold et al. [1] suggested a special block smoother that has the same eect. In [2], one can ®nd some implementation issues for the multigrid algorithm proposed by Hiptmair and numerical results for the eddy current problem that gives an additional L2 -term after the time discretization. The parallelization of these or, more precisely, of appropriately modi®ed multigrid solvers is certainly the only feasible way to enhance the eciency of these algorithms. However, the parallelization is not straightforward due to the peculiarities of the multigrid methods for the Maxwell equations. In this paper, we propose a uni®ed approach to the parallelization of multigrid methods based on a data distribution that is known from the non-overlapping domain decomposition. In order to develop a basic parallel Maxwell solver that can be used for more advanced problems as basic module, it is sucient to consider the magnetostatic case. In the magnetostatic case, the Maxwell equations can be reduced to the curl± curl-equation that is not uniquely solvable because of the large kernel of the curloperator (gradient ®elds). In practice, a gauge condition is imposed in order to pick out a unique solution. The so-called Coulomb gauge aims at a divergence-free solution (vector potential). The weak formulation of the curl±curl-equation and the gauge condition together with a clever regularization leads to a regularized mixed variational formulation of the magnetostatic Maxwell equations in H0
curl H01
X that has a unique solution. The discretization by the Nedelec and Lagrange ®nite elements results in large, sparse, symmetric, but inde®nite system of ®nite element equations. Eliminating the Lagrange multiplier from the mixed ®nite element equations, we arrive at a symmetric and positive de®nite (spd) problem that can be solved by some parallel multigrid preconditioned conjugate gradient (pcg) method. More precisely, this pcg solver contains a standard scaled Laplace multigrid regularizator in the regularization part and a special multigrid preconditioner for the regularized Nedelec ®nite element equations that we want to solve. The parallelization of the pcg algorithm, the Laplace multigrid regularizator and the multigrid preconditioner are based on a uni®ed domain decomposition data distribution concept mentioned above. From the parallelization point of view, we prefer Hiptmair's multigrid method with some modi®cations for the construction of the special multigrid preconditioner. The rest of the paper is organized as follows. In Section 2, we recall the mixed variational formulation of the magnetostatic Maxwell equation, its regularizations,
G. Haase et al. / Parallel Computing 27 (2001) 761±775
763
and its ®nite element discretization. Further, we present the multigrid preconditioned conjugate gradient method for solving the spd reduced problem. Section 3 provides the data distribution concept and the parallelization of the basic algorithms, namely the pcg and the multigrid algorithms. In Section 4, we discuss the parallelization peculiarities that come along with the spd reduced problem (Schur complement system) and with the special multigrid preconditioner. Section 5 contains some results of our numerical experiments on a parallel machine with distributed memory that show the high eciency of our approach to a real-life application.
2. 3D magnetostatic ®eld problems The magnetostatic equations, in which we are interested throughout the paper, can be rewritten as curl H J ;
H mB;
div B 0;
1
where H and B denote the magnetic ®eld intensity and the magnetic ¯ux density, respectively. The permeability l (m 1=l) and the impressed current density J are supposed to be given. If permanent magnets are present, J has to be replaced by J curl H0 , where H0 is the initial (permanent) magnetic ®eld intensity. Furthermore, we note that the current density J is physically divergence-free, i.e., div J 0:
2
Theoretically, the computational domain X is equal to the space R3 in any case. The behavior of the magnetic ®eld at in®nity is described by radiation conditions. In practice, one may often simplify the problem by considering a bounded, simply connected computational domain X R3 with Lipschitz boundary C oX and by replacing the radiation condition by the boundary condition Bn0
on oX;
3
where n stands for the unit outward normal with respect to oX. Of course, this modi®cation causes some modeling error. However, if the sources are completely surrounded by iron or, more generally, if X is chosen large enough, this error can be neglected. Introducing some vector potential u for the B-®eld B curl u, we can easily derive the canonical variational formulation: ®nd u 2 H0
curl; X such that Z Z m curl u curl v dx J v dx 8v 2 H0
curl; X;
4 X
X
from the magnetostatic equations (1) in X and the boundary conditions (3) on C, 3 3 where H0
curl; X : fv 2 L2
X j curl v 2 L2
X ; v n 0 on Cg. Condition (2) ensures the solvability of the variational problem (4). However, the solution is not unique. Indeed, if one adds to some solution u of (4) a gradient-®eld ru 2 L2
X3 with ru n 0, one obtains again a solution.
764
G. Haase et al. / Parallel Computing 27 (2001) 761±775
From now on we allow the coecient m 1=l to be piecewise constant with m P mmin > 0. We mention that the normal component of the B-®eld and the tangential component of the H-®eld are continuous across the interfaces. In order to obtain some variational formulation that has a unique solution, we have to take into account some gauge condition for the vector potential u. The Coulomb gauge div u 0
5
aims at divergence-free vector potentials. Treating the weak formulation of the Coulomb gauge (5) as an equality constraint, we arrive at the following mixed variational formulation that is fundamental for our approach to the numerical solution of the magnetostatic Maxwell equations (1): Find
u; p 2 X M : H0
curl; X H01
X such that a
u; v b
v; p hf ; vi 8v 2 H0
curl; X; b
u; q 0
8q 2 H01
X;
6
7
R R R where a
u; v : X m curl u curl v dx, b
v; p : X v rp dx, and hf ; vi : X J v dx. Now it is not dicult to conclude from the Brezzi±Babuska theory that the mixed variational problem (6) and (7) has a unique solution. Moreover, choosing v rp 2 H0
curl; X in (6), we immediately observe from (2) and Friedrich's inequality that p 0:
8
This simple observation is crucial for our approach since it ensures that the formulations being considered are equivalent to the original problem (4). The Lagrange multiplier p and the corresponding form b
; are introduced in order to derive another formulation in the primal variable u with an additional additive term ensuring uniqueness based on (5). In some cases this approach is favorable for the construction of ecient numerical methods, e.g., for the construction of preconditioners, see below. Indeed, adding an arbitrary symmetric and positive de®nite bilinear form c
; : M M ! R1 to the second equation of our mixed variational problem (6) and (7), we arrive at the equivalent mixed variational problem: Find
u; p 2 X M such that a
u; v b
v; p hf ; vi 8v 2 H0
curl; X; b
u; q
c
p; q 0
8q 2 H01
X:
9
10
Let now be Xh X and Mh M the lowest-order edge element space (see [15]) and the space of piecewise linear nodal elements on a shape-regular tetrahedral triangularization with mesh width h of X, respectively [4]. Then the mixed ®nite element approximation to the regularized mixed variational problem (9) and (10) reads as follows: Find
uh ; ph 2 Xh Mh such that
G. Haase et al. / Parallel Computing 27 (2001) 761±775
a
uh ; vh b
vh ; ph hf ; vh i b
uh ; qh
8vh 2 Xh ;
765
11
c
ph ; qh 0 8qh 2 Mh :
12
Again, using the Brezzi±Babuska theory, we conclude that the mixed ®nite element scheme (11) and (12) has a unique solution. From the same theory, we can derive the standard a priori error estimates (see, e.g., [3]). Similar to the continuous case, we observe that ph 0 since qh 2 Mh implies rqh 2 Xh . Let us choose the standard ®nite element basis in the spaces Xh and Mh . Then the mixed ®nite element scheme (11) and (12) is equivalent to the regular, symmetric, but inde®nite system uh fh A BT
13 ph 0 B C of linear ®nite element equations, where the matrices A, B, C and the ®rst component f h of the right-hand side are derived from the bilinear forms a
; , b
; , c
; , and the linear form hf ; i, respectively. Eliminating ph C 1 Buh from the second equation in (13) and inserting it into the ®rst equation, we obtain the spd Schur complement system Guh :
A BT C 1 Buh f h :
14
~ be some spd matrix that is spectrally equivalent to C (brie¯y, C ~ C), i.e., there Let C are some positive constants c1 and c2 , independent of h, such that ~ ; q 6
Cq ; q 6 c2
Cq ~ ; q 8q 2 Rmh ; c1
Cq h h h h h h h
15
T
where mh : dim
Mh . Since the solution
uh ph of (13) satis®es ph 0, the original Schur complement system (14) is equivalent to the modi®ed Schur complement system ~ :
A BT C ~ 1 Bu f : Gu h h h
16
Instead of solving the symmetric, but inde®nite system (13), we solve the spd modi®ed Schur complement system (16). Let us choose the spd bilinear form Z 1 c
p; q : rprq dx
17 mmin X corresponding to the Laplace operator scaled by 1=mmin , and let us consider an spd preconditioner CH for the spd matrix H : A M, i.e., there are some positive constants c3 and c4 , independent of h, such that c3
CH vh ; vh 6
H vh ; vh 6 c4
CH vh ; vh
8vh 2 Rnh ;
where nh : dim
Xh and M is the scaled mass matrix in Xh de®ned by Z
Muh ; vh : mmin uh vh dx: X
18
19
766
G. Haase et al. / Parallel Computing 27 (2001) 761±775
The discrete LBB-condition and the spectral equivalence inequalities (18) imply that ~ (see [13] for the detailed proof). Once a good preconditioner CH and an CH G G ~ are available, we can solve the modi®ed Schur comappropriate regularizator C plement system (16) by the preconditioned conjugate gradient method. In practice, we choose the multigrid preconditioner CH : H
I
MH
1
20
and the multigrid regularizator ~ : CC : C
I C
MC 1 ;
21
where MH and MC are the corresponding multigrid iteration operators with respect to H and C. Choosing the appropriate symmetric multigrid cycles, we can now conclude from the results of [1,10,11] that the pcg method is asymptotically optimal with respect to the operation count and to the memory demand [12]. The numerical results obtained from the serial implementation of this algorithm con®rm this statement [14]. In this paper, we are interested in the parallel implementation of this algorithm. The parallelization of this algorithm is far from being straightforward because of the peculiarities connected with the multigrid regularizator CC and with the special multigrid preconditioner CH .
3. A uni®ed data distribution concept 3.1. Notations We use a non-overlapping domain decomposition for our approach, i.e., we decompose X into P subdomains Xs such that X [Ps1 Xs with Xs \ Xq ;, 8q 6 s, s; q 1; P holds. Each subdomain Xs is discretized by a mesh sh;s , such that the whole triangulation sh [Ps1 sh;s of X is conform. A global ®nite element space Vh is de®ned with respect to sh and the local spaces Vh;s are projections of Vh onto sh;s . The P index set of PP unknownsPPin Xs is denoted by xs . Note that x [s1 xs but N : jxj 6 s1 jxs j : s1 Ns holds, i.e., dierent subdomains may share unknowns whereas elements from dierent subdomains do not overlap. The mapping of a vector u 2 RN in global numbering onto a local vector us 2 RNs in subdomain Xs (s 1; P ) is represented symbolically by subdomain connectivity matrices As of dimension Ns N with entries 1 if j global number of i i;j As : 8i 2 xs 8j 2 x:
22 0 else The transpose ATs of these binary matrices As : RN 7! RNs maps a local vector back to the global one. The index set of all those subdomains, an unknown uj , j 2 x belongs to, is denoted by rj : s j uj 2 Xs s j 9i 2 xs : Ai;j 6 0 :
23 s
G. Haase et al. / Parallel Computing 27 (2001) 761±775
767
3.2. Vector and matrix types We store the data of a vector component ui in the subdomain Xs if s 2 ri . There are (at least) two opportunities to store those components and ®nally that vector (see also Fig. 1). · A vector us (u) is called an accumulated vector if each vector component ui is stored in all subdomains Xs , s 2 ri with its full value. The local vectors us can be represented as us : As u:
24 · A vector r is called a distributed vector if it is decomposed into local vectors rs such that P X r ATs rs
25 s1
holds, i.e., all subdomains Xs , s 2 ri store only rs and possess a portion of the full vector value ri which can be determined only by taking the sum in (25). The conversion of a distributed vector v into an accumulated vector w can be done by evaluating the sum in (25) followed by the restriction (24), i.e., w
v : ws : As w As
P X s1
ATs vs :
26
The conversion in the other direction is not unique ± we prefer an equal weighted distribution of the accumulated vector. A matrix weighted distribution is also feasible in case of jumping coecients. The weights are chosen such that the re-conversion (26) will result in the original vector: v
w : vs :
Rs i
1
ws
with Rs diag fjr jg, i.e., belongs to. i2xs
Ri;i s
27 stores the number of subdomains the component i
Fig. 1. Illustration for accumulated and distributed vectors and matrices.
768
G. Haase et al. / Parallel Computing 27 (2001) 761±775
The matrix de®ned by the bilinear form in (6) can also be stored in two ways. With respect to an element-wise domain decomposition, we can store the f.e. matrix accumulated or distributed. · A matrix M is called accumulated if its local restrictions Ms possess the full entries of it, and we can write Ms : As M ATs :
28 · We call a matrix K distributed if we have locally stored matrices Ks such that K :
P X s1
ATs Ks As
29
holds, i.e., each subdomain Xs stores only a part of its full values. We obtain distributed matrices Ks automatically after the local f.e. accumulation with respect to Vh;s , due to our non-overlapping construction principle. The sum in (29) is equivalent to a global matrix accumulation which will not be carried out in our approach. 3.3. Basic operations This section collects only some results from the previous papers, for details see [5± 7,9]. It is trivial that additive combinations of vectors from the same type do not require any communication. The inner product of dierent type vectors requires one global reduce operation of the local inner products hw; ri wT r wT
P X s1
ATs rs
P P X X T
As w rs hws ; rs i: s1
30
s1
Any other combination of vector types requires more eort. The multiplication of a distributed matrix with an accumulated vector Kw
P X s1
ATs Ks As w
P X s1
ATs
Ks ws
P X s1
ATs vs v
31
results in a distributed vector. The realization requires no communication at all because we only have to compute locally vs Ks ws . The situation changes if we use an accumulated matrix M. Here, we have to ensure that the pattern of M ful®lls the condition 8i; j 2 x : ri 6 rj ) Mi;j 0:
32
If the pattern condition (32) holds then the operations wMu
and
d MT r
33
can be performed locally without any communication, i.e., 8s 1; P ws Ms us
and
ds MTs rs :
34
G. Haase et al. / Parallel Computing 27 (2001) 761±775
769
3.4. Basic algorithms The operations (30) and (31) allow already to formulate a parallel pcg algorithm for solving the matrix equation Ku f with a preconditioner CK . Algorithm 1 (Parallel pcg P C G (K; u; f; CK )). repeat v Ks a r=hs; vi u uas r r av w ( CK 1 r r hw; ri b r=rold , rold r s wbs until termination Besides the inner products, only the preconditioning step w ( CK 1 r involves communication indicated by using ( instead of . In the case of CK I, i.e., no preconditioning, this step reduces to a type conversion (26) involving communication. We require that the communication costs for applying any other preconditioner CK 1 are in the same range. One possible choice for the preconditioner is CK 1
I MK K 1 , with MK being the multigrid iteration operator for K. The parallel multigrid iteration is presented in Algorithm 2, where ` denotes the level such that ` 1 stands for the coarsest grid. The algorithm needs a smoother SM O O T H with a good parallel performance, e.g., a block Jacobi smoother with Gauss±Seidel smoothing in blocks containing interior unknowns of the subdomains. Furthermore, the interpolation P has to ful®ll the pattern condition (32) and we take PT as restriction. The coarse grid system can be solved directly (either in parallel or sequentially on one processor) or again by some iterative method similar to the P C G in Algorithm 1. Despite the coarse grid solver, only the smoothing sweep requires communication. Algorithm 2 (Parallel multigrid P M G (K; u; f; `)). if ` 1 then P P T u ( Solve s1 As Ks As u f else ~ ( Smooth
K; u; f u ~ d f Ku dH PT d wH 0 wH ( pmg
KH ; wH ; dH ; ` 1 w P wH ^ ~w u u ^; f u ( SmoothT
K; u end if
770
G. Haase et al. / Parallel Computing 27 (2001) 761±775
4. Parallel multigrid Maxwell solver We want to solve (16) by the pcg algorithm (Algorithm 1) using a multigrid preconditioner (Algorithm 2) for the realization of CK 1 . In this section, we will discuss how these components have to be adapted to the case of our Maxwell solver presented in Section 2. Our reduced primal formulation (16) has been derived from the block system (13). As discussed in the previous section, the matrices are generated locally, such that the local components As , Bs , Cs are available. Denoting the subdomain connectivity matrices with respect to the spaces Xh and Mh by AX;s and AM;s , respectively, we have the following relations: A
P X s1
AX;s As ATX;s ;
B
P X s1
AM;s Bs ATX;s ;
C
P X s1
AM;s Cs ATM;s :
~ : A BT C ~ is a precon~ 1 B; where C The system matrix in (16) is de®ned by G ~ ditioner for C. In order to apply pcg
G; u; f; CG~ ) we explain in Algorithm 3 how the ~ Hereby, the rematrix-by-vector operation is de®ned for the distributed matrix G. 1 ~ quired operation C is being realized by one multigrid iteration step pmg
C; p; q; `) in ~ is distributed, the corresponding matrix-bythe space Mh . Although the matrix G vector operation requires as many communications as one multigrid iteration step for C in Mh . ~ s). Algorithm 3 (The operation v ( G q Bs p ( pmg
C; p; q; ` v
A s BT p
Algorithm 4 (Parallel hybrid multigrid H Y B R I D P M G (H; C; u; f; `)). if ` 1 then P P T u ( Solve s1 As Hs As u f else ~ ( HybridSmooth
H; C; u; f u ~ d f Hu dH PT d wH 0 ~ H ( hybridpmg
HH ; CH ; wH ; dH ; ` 1) w ~H w Pw ^ ~w u u ^; f u ( HybridSmoothT
H; C; u end if Furthermore, the operation CK 1 in Algorithm 1 is now realized by one iteration ~ C; u; f; ` de®ned in Algorithm 4. Comparing Algorithms 4 and step of Hybridpmg
H;
G. Haase et al. / Parallel Computing 27 (2001) 761±775
771
Fig. 2. The lifting operator L : Xh ! Mh .
2 we observe that only the smoother has to be adapted to our special application. In particular we use a hybrid smoother as proposed in [11] which is suitable for parallelization. As in [11], we introduce the lifting operator L : Xh ! Mh where Li;j : 1, Li;k : 1 if the oriented edge with the unknown i in Xh connects the two unknowns j and k in Mh , see Fig. 2. Otherwise we have Li;j : 0. We observe that LT satis®es the pattern condition (32). Now Algorithm 5 is the correct de®nition of the parallel hybrid smoother hybridsmooth
H; C; u; f. The post-smoothing step hybridsmoothT
H; C; u; f in Algorithm 4 is obtained from Algorithm 5 by executing step 1 after steps 2±3±4±5 instead of executing the given order 1±2±3±4±5. Since HY B R I D SM O O T H involves at least two smoothing steps, one in Xh and one in Mh , at least two subsequent communications are required. Note, using the matrix C derived from (17) for de®ning the smoother in Mh we have to use the correct scaling by m2min corresponding to the de®nition of M in (19). Finally, the smoother SM O O T H can be any standard smoother with a good parallel performance. In order to achieve the last goal, we propose to apply a block Jacobi smoother with Gauss±Seidel smoothing in the blocks containing the interior unknowns of the subdomains and Jacobi smoothing in the blocks of the interface unknowns. Hence, the de®nition of the smoother depends directly on the partition of the mesh. In particular, the blocks of unknowns at the interfaces where Jacobi steps are performed grow with the number of subdomains. Practically this means that the number of iterations grows slightly, see below. Note, the other components of the solver presented above are completely independent of the partitioning. Algorithm 5 (Parallel hybrid smoother ~ ( Smooth
H; u; f u ~ q L
f H u p 0 ~ ( Smooth
m2min C; p; q p ~ LT p ~ u u
H Y B R I D S M O O T H (H; C; u; f)).
5. Numerical results In this section, we present an example from magnetostatics and we apply the algorithms presented above. The geometry of our model problem together with the corresponding coarse surface mesh is shown in Fig. 3 on the left. We consider the model of a transformer
772
G. Haase et al. / Parallel Computing 27 (2001) 761±775
Fig. 3. Geometry, initial mesh (left), resulting magnetic ¯ux density B (right).
with a kernel, three coils and the air domain around these parts. The outer boundary is given by an iron casing of high conductivity that motivates the boundary condition (3). The magnetic ®eld which is to be computed is generated by tangential currents within the three coils. The iron core has a permeability of 1000. The resulting magnetic ¯ux density B is shown in Fig. 3 on the right. The basis for our domain decomposition is a tetrahedral mesh generated fully automatically by NETGEN [17]. The surface mesh shown in Fig. 3 corresponds to a volume mesh of 2465 tetrahedra. We apply a modi®ed recursive spectral bisection (rsb) algorithm which allows to use any number P of subdomains, see [8] an references therein. The use of the rsb ensures that the elements are distributed almost equally to the P processors being used. The partition of the mesh de®nes now the domain decomposition introduced in Section 3.1. The partitions into 4 and 32 subdomains are shown in Fig. 4. Finer meshes corresponding to the levels ` 2; 3; 4 are obtained from uniform re®nement resulting in overall 1262080 tetrahedra. This re®nement is purely local and can be realized without communication. However, newly created unknowns at interfaces between dierent subdomains have to be
Fig. 4. Partitions into 4 and 32 subdomains.
G. Haase et al. / Parallel Computing 27 (2001) 761±775
773
identi®ed uniquely by all surrounding processors. For this purpose, one root process per interface receives data from all surrounding subdomains, it identi®es all unknowns uniquely and it ®nally distributes this information again to all adjacent processors. This setup phase is required once after each re®nement step. The numerical experiments presented here are carried out on a SGI Origin 2000 machine with 64 CPU R12000, 300 MHz and overall 20 GB main memory. The numerical simulations are carried out using the object oriented C++ code FEPP [14,18]. The message passing is based on MPI from the SGI Message Passing Toolkit 1.2. The wall-clock time has been measured by MPI_WTIME(.). The system (16) has been solved using the P M G algorithm with the relative accuracy 10 4 . For the multigrid preconditioner CH (20), a V-cycle with 1 pre- and 1 post-smoothing step H Y B R I D S M O O T H has been used. The multigrid regularizator ~ (21) has been realized by a V-cycle with 2 pre- and 2 postpreconditioner C smoothing steps using a standard smoother with good parallel eciency as described before. Table 1 shows the corresponding results including the number of unknowns (dof), number of iterations (It.) and wall-clock time in seconds (T [s]). We increase the number of unknowns from top to bottom, while the number of processors is increased from left to right. First, we observe that the number of iterations is almost independent of the number of unknowns. This shows the optimality of our algorithm. However, the number of iterations depends slightly on the number of processors. This is because our smoothers depend on the partition of the mesh as mentioned before. The wall-clock times for the pcg are given in the upper part of Table 1 for each level ` separately. Additionally we present the overall time for the ®nest grid (` 4) and for all grids (R` 1; . . . ; 4) in the lower part of Table 1. This time includes the grid re®nement together with the setup phase for the vector accumulation, the assembling of the matrices and the solution of the system by the pcg. As expected, the computation time can be reduced drastically if the number of processors is increased. Although most of the gain is due to the additional CPU capacity, the additional cache which comes with each processor also contributes to the acceleration. Table 2 and Fig. 5 show the corresponding speedup results. First the overall speedup for the accumulated time over all four levels is given. It performs well until Table 1 Wall-clock time for the solver on each level (upper part), overall wall-clock time for ` 4 and accumulated for all levels (` 1; . . . ; 4) in seconds ` 1 2 3 4
dof 3466 26 907 212 597 1 691 370 T
` 4 T
R` 1; . . . ; 4
1
4
16
32
48
60
It.
T (s)
T (s)
T (s)
T (s)
T (s)
It.
T (s)
5 8 9 11
0.1 7.1 84 1398
0.1 2.9 28 496
0.1 1.5 7.3 81
0.1 1.4 4.7 42
0.1 1.4 3.8 28
5 11 13 14
0.1 1.7 4.1 24
2593 2852
806 886.4
160 188.4
86 108.5
61 83.8
54 79.6
774
G. Haase et al. / Parallel Computing 27 (2001) 761±775
Table 2 Speedup results for overall time, time for ` 4, pcg (solver) for ` 4 and one iteration of pcg for ` 4 P
1
4
16
32
48
60
R` 1; . . . ; 4 `4 pcg (` 4) 1 Iter. (` 4)
1.0 1.0 1.0 1.0
3.2 3.2 2.8 3.3
15.1 16.2 17.2 21.9
26.2 30.0 33.0 42.1
34.1 42.5 49.6 63.2
35.6 48.0 58.5 74.4
Fig. 5. Speedup results.
P 16 and is no longer optimal for P 60. This loss of eciency is due to the setup phase for interface unknowns which shows a rather bad scalability in the current implementation. So it scales from 32 s for P 16 to 15 s for P 60 only. The speedup computed for ` 4 shows a slightly better behavior since coarse grid eects are neglected. However, the speedup computed for the solver and ` 4 only shows much better results. Here the speedup is almost optimal for P 60. For a more detailed analysis we consider the speedup with respect to one iteration for ` 4. Now we observe even superlinear speedups. However this is due to cache eects. Looking at Fig. 5 which shows the results of Table 2 we see that for P P 16 the local problems are small enough to ®t into local caches. Summarizing, we observe an optimal scalability of our iterative solver. In the case of outer iterations, e.g., for nonlinear problems, where the linear solver is called repeatedly on a ®xed mesh, the time required for the setup phase can be neglected and the solver which shows optimal eciencies will dominate.
References [1] D. Arnold, R. Falk, R. Winther, Multigrid in H(div) and H(curl), Numer. Math. 85 (2000) 197±218. [2] R. Beck, P. Deu¯hard, R. Hiptmair, R. Hoppe, B. Wohlmuth, Adaptive multilevel methods for edge element discretizations of Maxwell's equations, Surv. Mat. Ind. 8 (1999) 271±312. [3] F. Brezzi, M. Fortin, Mixed and Hybrid Finite Element Methods, Springer Series in Computational Mathematics, vol. 15, Springer, Berlin, 1991.
G. Haase et al. / Parallel Computing 27 (2001) 761±775
775
[4] P. Ciarlet, The Finite Element Method for Elliptic Problems, North-Holland, Amsterdam, 1978. [5] G. Haase, Parallel incomplete Cholesky preconditioners based on the non-overlapping data distribution, Parallel Comput. 24 (11) (1998) 1685±1703. [6] G. Haase, Parallelization of Numerical Algorithms for PDEs, Teubner, 1999 (in German). [7] G. Haase, A parallel amg for overlapping and non-overlapping domain decomposition, Electron. Trans. Numer. Anal. (ETNA) 10 (2000) 41±55. [8] G. Haase, M. Kuhn, Preprocessing for 2D FE-BE domain decomposition methods, Comput. Vis. Sci. 2 (1999) 25±35. [9] G. Haase, U. Langer, A. Meyer, The approximate Dirichlet decomposition method. Part I, II, Computing 47 (1991) 137±167. [10] W. Hackbusch, Multi-Grid Methods and Applications, Springer Series in Computational Mathematics, vol. 4, Springer, Berlin, 1985. [11] R. Hiptmair, Multigrid methods for Maxwell's equations, SIAM J. Numer. Anal. 36 (1999) 204±225. [12] M. Jung, U. Langer, A. Meyer, W. Queck, M. Schneider, Multigrid preconditioners and their applications, in: G. Telschow (Ed.), Third Multigrid Seminar, Report R±MATH±03/89, Karl± Weierstrass±Institut, Biesenthal 1988, pp. 11±52, Berlin, 1989. [13] M. Kuhn, Ecient parallel numerical simulation of magnetic ®eld problem, Ph.D. Thesis, Johannes Kepler University Linz, 1998. [14] M. Kuhn, U. Langer, J. Sch oberl, Scienti®c computing tools for 3D magnetic ®eld problems, in: J.R. Whiteman (Ed.), The Mathematics of Finite Elements and Applications X, Elsevier, Amsterdam, 2000, pp. 239±258. [15] J. Nedelec, A new family of mixed ®nite elements in R3 , Numer. Math. 50 (1986) 57±81. [16] M. Schinnerl, J. Sch oberl, M. Kaltenbacher, U. Langer, R. Lerch, Multigrid methods for the fast numerical simulation of coupled magnetomechanical systems, ZAMM 80 (2000) 117±120. [17] J. Sch oberl, NETGEN ± an advancing front 2D/3D-mesh generator based on abstract rules, Comput. Vis. Sci. 1 (1997) 41±52. [18] J. Sch oberl, Object-oriented Finite Element Code FEPP, 1997, see also: http://www.sfb013.uni-linz. ac.at.