Applied Numerical North-Holland
Mathematics
7 (1991) 379-398
379
A multigrid solver for the Euler equations on the iPSC/2 parallel computer L. Beernaert Department
and D. Roose
of Computer Science, K. U. Leuven, Celestijnenlaan
2OOA, B-3001 Leuven, Belgium
R. Struys and H. Deconinck Van Karman Institute for Fluid Dynamics,
Waterloosesteenweg
72, B-1640 Rode, Belgium
Abstract Beernaert, L., D. Roose, R. Struys and H. Deconinck, A multigrid solver for the Euler equations parallel computer, Applied Numerical Mathematics 7 (1991) 379-398.
on the iPSC/2
The Euler equations describe inviscid compressible fluid flow and form a system of four nonlinear partial differential equations. We discuss the implementation of a multigrid solver for the two-dimensional Euler equations on a distributed memory multiprocessor. The discretization in space is based on quadrilateral cells, forming a “logically rectangular” grid. The parallelism is introduced by a decomposition of the domain into subdomains. Several relaxation schemes are considered. Some of these relaxation schemes are also used as smoother in the multigrid implementation. Timings and efficiency results obtained on the Intel iPSC/2 hypercube are discussed.
1. Discretization of the Euler equations The Euler equations describe inviscid compressible flow [7]. Let p, P, U, U, e, H denote respectively pressure, density, the velocity components of the flow in x and Y direction, the specific total energy and the total enthalpy. Then the Euler equations can be written as
where U denotes y direction :
the vector of conservative
variables
and F and G are the flux vectors in x and
PO
!
PUU
pu*+p
(2)
PVH
At the Von Karman Institute for Fluid Dynamics an Euler solver has been developed for two-dimensional structured domains. The finite volume discretization in space is based on quadrilateral cells, forming a “ logically rectangular grid” (see Fig. l(a)). 0168-9274/91/$03.50
0 1991 - Elsevier Science Publishers
B.V. (North-Holland)
380
L. Beernaert et al. / Multigrid solver for Euler equations
b)
4 bOUlld~
boundary
.....:.......:....I.. ;
d P
. . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . .
?
Fig. 1. Finite volume discretization
I
. . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . .
based on quadrilateral
cells.
A first-order upwind discretization scheme is obtained as follows. After integrating (1) over each of the grid cells, U is taken to be constant within a cell and the integral over the boundary of the cell is replaced by a sum over the four adjacent cells. This leads to the following equation for each cell P
tip%
+
;
HpQ AS,,
(3)
= 0,
Q=l
where 62, denotes the volume of cell P, AS,, the length of the boundary between cells P and Q and HPe = Fn, + Gn, represents the flux across the boundary between P and Q ([n,, ny]’ denotes the normal vector). UP are the “cell-centered” variables. The flux HpQ can be written as function of the variables in P and Q using an upwind flux vector splitting Hpp = H+(Q)
+ H-(Up).
(4)
In this solver, H+ and H- are calculated using the Van Leer flux vector splitting, see [lO,ll]. In case Q is a boundary point, the unknowns are determined such that the numerical flux function H pB, based on the boundary and adjacent interior variables, gives a flux which satisfies the boundary conditions [3]. In order to compute the steady state solution of (3) a time-marching approach is used, based on an implicit scheme. Applying the Euler backward scheme on (3) yields (-Jpn”_uupn
02,
At
4 +
c
H;;’
(5)
A&,=0.
Q=l
H ‘+I is a nonlinear function of Uin+i and U$+i. Equation (5) can be linearized by applying a N%ton linearization of Hipi 1 around time level n. Because we are only interested in the steady state solution, and not in an accurate solution at each time level, it is common practice to perform only one step of the Newton linearization at each time level. This leads to the following linear equations in terms of the differences AU = Un+l - U” for cell P and the four neighbouring cells Q:
[%+..I
AU,+
t B
Q=l
pQ
AU,= -RES=
-
i Q=l
[H+(u;)+H-(u;)]
Asp,. (6)
L. Beernaert et al. / Multigrid solver for Euler equations
The right-hand side RES denotes the residual contain the derivatives of the flux, i.e. A,=
[““‘I’ “spe,
;
BP,=
aup
Q=l
and the 4 X 4 matrices
381
A, and B,,
(Q = 1,. . . ,4)
[$]a AS,,.
(7)
The boundary conditions lead, for each cell on the boundary, to an equation similar to (6) where Up now denotes the boundary unknowns and U, the adjacent cell unknowns (see Fig. l(b)) [31Assuming the logically rectangular grid has N, X NY cells, the computational grid thus consists of a mesh of (N, + 2) X (NY + 2) grid points, corresponding to the centers of the cells and the boundaries (see Fig. l(c)). These points will be called interior and boundary grid points respectively.
2. The test problem In this section we describe our test problem. Consider a channel with on the lower wall a circular bump, see Fig. 2. Upper and lower walls are solid, on the left and right boundary the exact solution is imposed. Using an inlet Mach number M = 0.85 produces a weak shock near the end of the bump, see Fig. 3. We refer to [9] for a more precise description of the test problem. The grid used is denser close to the bump than further away. As initial condition uniform flow with M = 0.85 is taken. In our tests the convergence requirement was log
II RES”
II 2
II RES’ II 2
<
-
3.9
and
llRES”ll,
log
< _3 9
.
IIRES’ II m
I
I
I
I
I
I
-2
-1
0
1
2
3
Fig. 2. “ GAMM-channel”
with circular bump and the finite volume grid.
382
L. Beernaert et al. / Multigrid solver for Euler equations -2.0
-1.0
0.0
-1.0
0.0
1.0
2.0
0
Pi
0
--
2
cz
d -2.0
Fig. 3. Solution
of the “GAMM-channel”
2.0
1.0
problem,
and we used CFL = 100. We used a discretization points.
with inlet Mach number
M = 0.85. Isomach
lines.
with 96 X 32 cells, i.e. 98 X 34 (= 3332) grid
3. Sequential solution methods At each time level a linear system of 4( N, + 2)( NY + 2) equations has to be solved. This is done either by a (point or line) Gauss-Seidel relaxation scheme or by a multigrid method in which Gauss-Seidel relaxation is used as smoother. In the latter case N, and NY must be multiples of 2’-’ with 1 the number of multigrid levels. Each coarser grid is then constructed by considering 2 x 2 neighbouring cells of the current fine grid as one new cell of the next coarser grid. In our test problem six multigrid levels can be used. 3. I. Relaxation
methods
Table 1 surveys point Gauss-Seidel
Table 1 Timings and convergence
the results obtained on one processor for pure relaxation scheme the following sweep patterns are used alternately:
results for relaxation
schemes
methods.
on one processor
Method
Number of relaxation steps
Total time
Time per step
(set)
(set)
Time per point per step (msec)
Point Gauss-Seidel Line Gauss-Seidel Red-black point Gauss-Seidel Red-black line Gauss-Seidel
430 230 1032 482
30091 36797 79802 79322
70.0 160.0 77.3 164.6
21 48 23 49
In the
383
L. Beernaert et al. / Multigrid solver for Euler equations
(a) the points are relaxed from bottom (b) from top right to bottom left. In the line Gauss-Seidel (a) (b) (c) (d)
left to top right,
scheme, four sweep patterns
are used alternately:
vertical lines, left to right, horizontal lines, bottom to top, vertical lines, right to left, horizontal lines, top to bottom.
A red-black coloring of the points or lines allows updating all points or lines of the same color in parallel. In the red-black point Gauss-Seidel and red-black line Gauss-Seidel schemes, first all red points or lines are updated and afterwards all black points or lines. In the red-black line Gauss-Seidel scheme horizontal and vertical lines are alternated. Table 1 clearly shows that the number of iterations needed to obtain the prescribed accuracy, is much higher for the red-black schemes that for the corresponding lexicographic schemes. The line schemes require less but more expensive steps than the corresponding point schemes. For a more detailed comparison of these relaxation schemes, we refer to [11,13]. Notice that, due to implementation details, the lexicographic schemes require less time per relaxation step than the corresponding red-black schemes. 3.2. Multigrid In a multigrid context a lot of parameters can be varied, e.g. the number of multigrid levels, the sequence of visiting the different levels, the smoothing method, the number of smoothing steps. Depending on the choice of these parameters different convergence results are obtained. In this text we give timing results for the following values of the parameters: number of multigrid levels: 3, 4, 5 or 6; number of smoothing or relaxation steps on all but the coarsest grids ( = nf): 4, 6 or 8; number of relaxations to solve the problem on the coarsest grid “exactly” ( = nc): 20; relaxation schemes: lexicographic point Gauss-Seidel (Table 3) or red-black point Gauss-Seidel (Table 4). Our test program starts with a full multigrid step and then continues with P’-cycles until the convergence requirement is fulfilled. In order to compare the amount of work of the two programs, we define a workunit as the cost for one relaxation step on the finest grid. One relaxation or smoothing step on the next coarser grid costs one fourth of a workunit, on the next coarser grid one sixteenth of a workunit, and so on. The costs of a full multigrid step and a V-cycle on several numbers of grids are given in Table 2.
Table 2 Cost of a full multigrid Number 6 5 4 3
of grids
step and a V-cycle on several numbers
of grids
Full multigrid
V-cycle
&nc + $nf &nc + gnf &nc + ;nf &nc + +nf
&nc + gnf &nc + ynf &nc + tnf
&nc
+ +&nf
384
L. Beernaert et al. / Multigrid solver for Euler equations
Table 3 Timings and convergence Gauss-Seidel Number of multigrid levels
t7C
6 6 6
20 20 20
5 5 5
results
for multigrid
runs
on one processor.
Relaxation
scheme:
Lexicographic
Number of V-cycles a
Theoretical number of workunits
Measured number of workunits
Total time
Time per workunit
(se@
(set)
4 6 8
9 6 4
67.68 71.37 68.43
61.13 62.28 63.02
4609 4646 4674
75.4 74.6 74.2
20 20 20
4 6 8
8 6 4
61.31 71.47 68.38
57.81 65.97 65.58
4351 4894 4839
75.3 74.2 73.8
4 4 4
20 20 20
4 6 8
8 6 4
62.44 71.81 68.19
58.72 63.06 64.59
4397 4662 4735
74.9 73.9 73.3
3 3 3
20 20 20
4 6 8
7 5 4
59.25 62.75 67.50
56.50 59.25 65.00
4201 4341 4728
74.4 73.3 72.7
a Number
of V-cycles,
nf
after a full multigrid
In Tables 3 and 4 costs given in Table 2 workunits. The latter convergence condition
Table 4 Timings and convergence Number of multigrid levels 6 6 6
’ Number
nc
step, to fulfill the convergence
requirement.
we distinguish the theoretical number of workunits (computed from the and the number of V-cycles to be performed) and the measured number of number is the lowest since the smoothing process on a grid stops if the is fulfilled on that grid. The time per workunit is higher than the time per
results for multigrid
nf
runs on one processor.
Number of V-cycles a
Theoretical number of workunits
Relaxation
scheme:
red-black
point Gauss-Seidel
Measured number of workunits
Total time
Time per workunit
(set)
(set> 83.3 82.2 81.8
20 20 20
4 6 8
13 9 7
94.38 101.38 108.42
90.79 92.07 93.35
7561 7571 7631
20 20 20
4 6 8
13 9 7
94.83 101.52 108.36
90.46 91.95 94.69
7506 7533 7715
83.0 81.9 81.5
20 20 20
4 6 8
13 9 6
96.50 102.00 94.81
91.44 90.75 89.91
7555 7407 7286
82.6 81.6 81.0
20 20 20
4 6 8
13 9 7
102.75 103.75 107.25
95.69 93.94 98.50
7856 7603 7926
82.1 80.9 80.5
of V-cycles,
point
after a full multigrid
step, to fulfill the convergence
requirement.
L. Beernaert et al. / Multigrid solver
for
EuIer equations
385
relaxation step in Table 1 since the transfers from one grid to another (restriction and prolongation) are neglected in counting the workunits. When the number of smoothing steps on all but the coarsest grid increases, the intergrid transfers become relatively less important and, furthermore, less V-cycles and thus also less transfers are required. This explains why for a fixed number of multigrid levels, the time per workunit decreases if nf increases. Also, with nf being kept fixed, the time per workunit decreases if the number of multigrid levels decreases. This reflects the fact that the boundary grid points are not counted in the definition of workunit, but nevertheless must be treated. And the more multigrid levels are used, the coarser are the coarsest grids, and the higher becomes the relative importance of the boundary grid points on these coarse grids. Compared with the timing results for the pure relaxation schemes we can conclude that the multigrid solver is at least six times faster for lexicographic point Gauss-Seidel, and at least ten times faster for red-black point Gauss-Seidel. Since less V-cycles are required for the lexicographic point Gauss-Seidel than for the red-black point Gauss-Seidel, we can also conclude from Tables 3 and 4 that for our test problem the lexicographic point Gauss-Seidel scheme has much better convergence properties than the red-black point Gauss-Seidel scheme, when these schemes are used as smoother in a multigrid procedure. Using only three multigrid levels seems sufficient with lexicographic point Gauss-Seidel as smoother, but not when red-black point Gauss-Seidel is used as smoother. In the former case the required number of V-cycles slightly increases with the number of multigrid levels used and both the number of workunits and the execution times for three multigrid levels are competitive. In the latter case the required number of V-cycles is nearly independent of the number of multigrid levels and now both the number of workunits and the execution times for three multigrid levels are about the highest mentioned. The number of smoothing steps does not influence the execution times very much. For the lexicographic point Gauss-Seidel the ratio of the execution time of the “worst” case to that of the “best” case is 1.16, for the red-black point Gauss-Seidel that ratio is 1.09 and only 1.06 if the execution times on three multigrid levels are left out. We experimentally found out that the number of smoothing steps on all but the coarsest grids had to be at least four, since with a smaller number there was possibly no convergence.
4. Parallelization of the code 4.1. Relaxation
solvers
The parallel implementation of the Euler solver uses a decomposition of the grid of cells into as many subdomains as there are nodes. A rectangular grid of p, X pu subdomains ( p, and pv being a power of two) can be mapped onto a hypercube such that neighbouring subdomains are mapped onto neighbouring nodes. Each node is responsible for all computations for the grid points in its subdomain. The equations associated with a grid point on the border of a subdomain involve one or more grid points allocated to neighbouring nodes. It is common practice to store in each node also data for grid points lying just outside its subdomain, see e.g. [6,12]. These points are called the grid points in the overlap regions. The variables UP associated with the overlap regions are made consistent by exchanging them after each relaxation step. Four reals per grid point in the overlap regions are then sent to and
386
L. Beernaert et al. / Multigrid solver for Euler equations
received from the neighbouring nodes. Because of the original sequential algorithm, for the red-black schemes an additional work vector that contains four reals per red grid point in the overlap regions has to be exchanged after the relaxation of the red grid points. Global communication is required for the computation of the maximum allowable time step and the convergence test. No other communication is necessary except for the block-tridiagonal solver (see Section 5.1.2). Communication has been implemented using the iPSC-version of the SUPRENUM Communications Subroutine Library. This library has been designed for grid-oriented problems [5]. It contains subroutines for mapping two-dimensional (and three-dimensional) grids onto the processors, for exchanging data of the overlap regions, for global operations on arrays distributed over all nodes, etc. However, this library could not be used immediately, since it uses a different data structure than the Von Karman Euler solver. In the SUPRENUM Communications Subroutine Library all information of a grid function (i.e. one real per grid point) is stored in consecutive memory locations, while in the Von Karman Euler solver the variables U, of one grid point P (four reals) are stored in consecutive memory locations. Thus we adapted the SUPRENUM Communications Subroutine Library, so that for each grid point it can now select a number of consecutive memory locations. Each communication phase (either a global operation or exchanging the grid points of the overlap regions) causes an implicit synchronization of all the processors. Therefore it is important that each processor has the same amount of work to do in between two communication phases. This is only possible if each processor treats the same number of grid points (both interior and boundary grid points). Otherwise, some processors will temporarily be idle, waiting for the processor that treats the highest number of grid points. The amount of work to be done per relaxation step can be described as: t tot = tj + t, + t,, + fgl,
(8)
where t tot: the total time for one relaxation step; ti: the time for treating the interior grid points;
it is proportional to ni, the maximum number of interior grid points; t,: the time for treating the boundary grid points; it is proportional to nb, the maximum number of boundary grid points; t or: the time for exchanging the overlap regions; it is proportional to norts + ktst_up, where nor is the maximum number of points in the overlap regions, t, is the time for sending the information of one grid point (four reals) (notice that t, also includes the time for selecting the appropriate values from the data structure and copying these into the buffer for communication), tst_up is the start-up time for sending a message and k is the number of messages, i.e. the maximum number of neighbours a processor has in the configuration; t),J: the time for the global communication; it is proportional to log,P, the dimension of the hypercube. ti and t, contribute to the computational part of each relaxation step and are thus also proportional to tcalc, the average time needed for a floating point operation. For the parallel computer under consideration (Intel iPSC/2) tcalc = 5.9 l~,s[1,2]. t,, and t,, contribute to the time for communication. Two important values are here tst_up, the start-up time and tsend, the time for
L. Beernaert et al. / Multigrid solver for Euler equations
387
sending a floating point number from one processor to a neighbour. For the Intel iPSC/2 distinction must be made whether the floating point number is either part of a short (< 100 bytes) or of a long message. The respective values for each kind of messages are: tst_“r = 369 ps for short and 688 l.~sfor long messages, fsend = 0.67 l_~sin a short and 1.43 l~,sin a long message. For more details concerning these times, we refer to [1,2]. For a fixed number of processors in a hypercube and with the restriction that we want to use a two-dimensional grid of processors whose sizes are a power of 2, still several configurations are possible. We will now discuss which of these configurations will lead to a minimal execution time. In order that the computations for the interior grid points are well balanced, the number of interior grid points in each direction must be a multiple of the number of processors in that direction, i.e. N, = k,p, and NY = kypy. Notice that for the 96 X 32 interior grid points of our test problem, this condition is satisfied for all processor configurations possible on a four-dimensional hypercube. The boundary grid points, however, can cause imbalance. Assuming that the interior grid points are perfectly balanced, the processors can, in the general case that both p, > 1 and py > 1, be divided into three groups: (a) “interior” processors, without any boundary grid points; (b) processors lying on an “edge” of the processor grid, having N,./p, or Ny/py boundary grid points; (c) processors lying on a “corner” of the processor grid, having N,/p, + NY/p,, + 1 boundary grid points. the imbalance, NJp, + N,,/py + 1 must be minimized under the condition that the minimum will be reached when N,/p, = NY/p,, i.e. when the subdomains resulting from the decomposition of the computational grid are square. An exception to this “rule” happens when p, =pv = 2. In that case the boundary grid points are perfectly balanced, independent of the size of the computational grid. On the other hand, when either p, or pu equals 1 (in fact the processors are then ordered in a one-dimensional array) there is an imbalance of N, + 2 or NY + 2 boundary grid points, and this imbalance cannot be removed, except for the 2 x 1 and 1 x 2 processor grids, for which configurations the boundary grid points are also perfectly balanced. It is well known that square subdomains also minimize the total length of the interior boundaries [4], i.e. the amount of information to be communicated during the exchange of the overlap regions. Thus for a fixed number of processors, the processor configuration leading to (almost) square subdomains will also have the lowest value for nor, the maximum number of grid points in the overlap regions. However, other configurations can lead to the lowest value for k, the number of messages to be sent during the exchange operation. Indeed, in a one-dimensional array of processors, each processor has at most two neighbours. Thus, for a fixed number of processors, the time for exchanging the overlap regions will be lowest when the subdomains are almost square, unless the start-up time lst_“r is very large compared to norrs. Then it is very important to reduce the number of messages. In Table 5 we give the values of n i, n b, nor, k and log,P for a grid of 98 x 34 grid points and all processor configurations possible on a four-dimensional hypercube. These values confirm the discussion given above. The importance of the high values of nor and n b for some configurations must not be exaggerated, since the whole program is computationally dominant and an interior grid point is more computation intensive than a boundary grid point. For rather large problems, In order to minimize
p, . pu = P. It is easy to verify
388
L. Beernaert et al. / Multigrid solver for Euler equations
Table 5 Values that determine the amount of work to be done in one relaxation step for a 98 X 34 grid. ni: maximum number of interior grid points, nb: maximum number of boundary grid points, nor: maximum number of grid points in the overlap regions, k: maximum number of neighbours nb
n or
k
l%,P
16x1 8x2 4x4 2x8 1x16
192 192 192 192 192
46 29 33 53 102
68 46 64 102 196
2 3 4 3 2
4 4 4 4 4
8x1 4x2 2x4 1x8
384 384 384 384
58 41 57 106
68 58 106 196
2 3 3 2
3 3 3 3
4x1 2x2 1x4
768 768 768
82 65 114
68 66 196
2 2 2
2 2 2
2x1 1x2
1536 1536
130 130
34 98
1 1
1 1
1x1
3072
260
0
0
0
Px x Py
n,
the term ti is dominant in (8) and the load imbalance prohibit reaching a high speed-up, see Section 5.
and the communication
time will not
4.2. Multigrid solver As already mentioned before, a coarser grid is constructed by considering 2 x 2 neighbouring cells of the fine grid as 1 new cell of the next coarser grid. A problem arises when the number of cells per processor in a direction is odd. In that case it is impossible to construct the next coarser grid on P processors. A possible solution for this problem is called agglomeration [5,12]. Instead of P processors, only +P or i P (depending on the number of cells per processor in both directions) are active on the next coarser grid. As soon as the finer grid is worked on again, all processors become active again (deagglomerution). The agglomeration process can be repeated, so that the parallel program uses as many grids as the sequential one, but possibly only one processor is active on the coarsest grid. For a given fine grid with N, x NY cells, and thus also N, x NY interior grid points, the number of multigrid levels that can be used without agglomeration, say I,,,, depends on the processor configuration p, x pu. Suppose NJp, = c, . 21x and Ny/py = cy * 2’~ with c, and c,, odd. Then 1max= min( 1,, ZY)+ 1. Notice that now the (almost) square subregions are not necessarily best. Starting from all possible configurations with 16 processors we indicate in Table 6 how many processors are active on each multigrid level if the finest grid has 96 X 32 interior grid points. Intergrid transfers require communication. Restriction of a fine grid to the next coarser one and prolongation of a coarse grid to the next finer one require that the overlap regions of the new grid are made consistent. In an agglomeration phase the processors that go idle send their
389
L. Beernaert et al. / Multigrid solver for Euler equations Table 6 Active processor Multigrid Number Number
configurations
on all multigrid
level of grid points of interior grid points
Active processor configurations
levels for a grid with 96 X 32 interior
grid points
1
2
3
4
5
6
98x34 96x32
50x18 48x16
26x10 24x 8
14x6 12x4
8x4 6x2
5x3 3x1
16x 1 8x 2 4x 4 2x 8 1x16
16x 1 8x 2 4x 4 2x 8 1x16
4x1 4x2 4x4 2x4 1x4
2x1 2x2 2x2 2x2 1x2
1x1 1x1 1x1 1x1 1x1
8x 8x 4x 2x lx
1 2 4 8 8
information to another processor and in a deagglomeration phase idle processors await a message to become active. The SUPRENUM Communications Subroutine Library contains routines for agglomeration and deagglomeration. However, we preferred to write our own agglomeration and deagglomeration routines, because the corresponding SUPRENUM routines are complicated and the adaptation with respect to the data structure would have been a harder job. For a smoothing or relaxation step on each of the six multigrid levels (8) is valid with appropriate values of ni, nb, nor, k and log,P. On each level ni is perfectly balanced over the active processors, but n b and nor are not. ni decreases faster with increasing multigrid levels than nb and nor do (perimeter effect). The grids of level 5 and 6 contain more boundary grid points than interior grid points. The coarser the grids, the more imbalance there is, and the time for communication becomes a larger fraction of the total execution time. On the very coarse grids ti is not dominant any more. As a consequence, speed-up and efficiency will be very low on these grids. Fortunately, the relative importance of the coarse grids is small, see also Section 3.2. Apart from the work on each grid, there is ttrans, the time needed for intergrid transfers. It is given by: t tram = t, + t, + t.&,
where t trans* . t,:
t,:
taggi:
the total time for intergrid transfers; the time for the computational part of the restriction and prolongation phases; for each prolongation or restriction step it is proportional to np, the number of grid points of the new grid; the time for exchanging the overlap regions in the restriction and prolongation phases; for each restriction or prolongation step it can be written as nrts + k * ts_, where n, is the number of points in the overlap regions of the new grid, k * is the number of messages and t, and tst_up are as in (8); the time for the agglomeration and deagglomeration steps; for each agglomeration or deagglomeration step it is proportional to n.,,, the number of grid points of the finest grid involved.
Concerning the best processor configuration context, the following holds:
for a fixed number of processors in a multigrid
L. Beernaert et al. / Multigrid solver for Euler equations
(Almost) square subregions minimize the time for smoothing and also for restriction and prolongation, as long as no agglomeration is necessary. Agglomeration requires extra communication ( taggl), and results in a severe imbalance during the treatment of the coarsest grids. However, agglomeration can be avoided by a proper choice of the computational grid size and of the processor configuration, and also by restricting the number of multigrid levels used, see section 5. In Tables 7 and 8 we give, as an example, the sum over all the multigrid levels of the values ni, step and a V-cycle on four multigrid log,P, np, n,, k* and nag,, for a full multigrid levels. These sums are denoted in capital letters, i.e. Ni, Nb, etc. Ni, Nb, N,,, K, Log P depend on the number of relaxation or smoothing steps on each level. Therefore these numbers are given in function of nc and nf. The values in Tables 7 and 8 give an idea of the imbalance and the communication that influence the execution times of the next section. We now give some more explanation concerning some of these values. In the earlier discussion we stated that the number of interior grid points is perfectly balanced on each multigrid level. Of course, agglomeration has to be taken into account: ni is perfectly balanced on the active processors of a given level, not on all processors. Hence different processor configurations can lead to different values for Ni. nb is not perfectly balanced on each level. How bad this is for some configurations, can be seen if the values given under Nb are compared with the value for one processor, divided by the number of processors. Here also some values are higher than one would expect, again due to agglomeration. For the values that tell something about the communication, i.e. N,,, K and Log P, we cannot compare with the values for one processor. But comparing these values for a fixed number of nb, nor, k
Table I Values that determine points
the amount
of work to be done in a full multigrid K
%
Px x Py
Ni
Nb
16x 1 8~ 2 4x 4 2~ 8 1x16
36nc+384nf 18nc+336nf 9nc+336nf 18nc+336nf 36 nc+384 nf
36 18 15 24 48
nc + 158 nf nc+91 nf nc+103 nf nc+163 nf nc + 318 nf
36 21 24 45 84
nc + 220 nc + 146 nc+192 nc+314 nc + 604
nf nf nf nf nf
36 18 24 48
nc nc nc nc
36 27 45 84
nc + 220 nc + 182 nc + 326 nc + 604
nf nf nf nf
8~ 4~
1 2
2~
4
lx
8
36nc+672nf 18nc+672nf 18nc+672nf 36nc+672nf
4~ 2~
1 2 4
36nc+1344nf 36nc+1344nf 36nc+1344nf
36nc+254nf 27nc+199nf 48ncf350nf
IX
1 2
72 nc+2688nf 72 nc+2688 nf
54 ncf398 54 ncf398
lx
1
144nc+5376nf
108nc+796nf
IX
2~
+ + + +
182 127 175 326
nf nf nf nf
36nc+220nf 30nc+206nf 84nc+604nf nf nf
step on four levels for a grid with 98 X 34 grid
18 nc+llO 42 nc+302 0
nf nf
Log P
NP
Nr
K*
&I
nf nf nf nf nf
6 9 12 9 6
nc+24 nc+28 nc+28 nc+28 nc+24
nf nf nf nf nf
726 546 549 642 938
312 210 264 438 840
24 36 48 36 24
528 120 0 156 712
6 nc+14 nf 9 nc+21 nf 9 nc+21 nf 6nc+14nf
6 9 9 6
nc+21 nc+21 nc+21 nc+21
nf nf nf nf
1092 987 1053 1284
312 255 453 840
24 36 36 24
240 0 0 312
6nc+14nf 6nc+14nf 6 nc+14 nf
6 nc+14 6 nc+14 6 nc+14
nf nf nf
1974 1896 2106
312 288 840
24 24 24
0 0 0
3 nc+7nf 3 nct7nf
3 nc+7nf 3 nc+7nf
3792 3792
156 420
12 12
0 0
0
0
7584
6 9 12 9 6
nc+14 nc+21 nc+28 nc+21 nc+14
00
0
L. Beernaert et al. / Multigrid solver for Euler equations Table 8 Values that determine
the amount
Px X Py
N,
16x 1 8x 2 4x 4 2x 8 1x16
12 6 3 6 12
nc+336nf nc+312 nc+312 nc+312 nc+336
nc+624nf nc +624 nf nc +624 nj nc+624nj
of work to be done in a V-cycle on four levels for a grid with 98
Nb nj nf nj nj
K
Log P
x
34 grid points
12nc+126nf 6 nc+75 nf 5 nc + 85 nf 8nc+135nj 16nc+262nj
12ncf180nf 9nc+120nf 8nc+160nf 15nc+260nj 28nc+500nj
2nc+lOnf 3nc+15nf 4nc+lOnf 3nc+15nj 2nc+lOnj
2nc+18nf 3nc+20nj 4nc+20nj 3nc+20nf 2nc+18nj
486 399 405 461 626
192 129 168 275 528
12 18 24 18 12
224 40 0 52 304
12 6 8 16
nj nj nj nf
12 nc+180 nj 9nc+150nj 15 nc+270 nf 28 nc+500 nj
2 nc+lO nj 3nc+15nj 3 nc+15 nf 2 nc+lO nj
2 nc+15 nj 3nc+15nj 3 nc+15 nf 2 nc+15 nj
798 741 783 922
192 159 285 528
12 18 18 12
80 0 0 104
2 nc+lO nj 2 nc+lOnj 2 nc+lOnj
1482 1434 1566
192 180 528
12 12 12
0 0 0
1 nc+5 1 nc+5
2868 2868
96 264
6 6
0 0
5736
0
0
0
8~ 4x 2x 1X
1 2 4 8
12 6 6 12
4~ 2~ IX
1 2 4
12 nc+1248nj 12 nc+1248nj 12 nc+1248nj
12 nc+2lOnj 9nc+165 nj 16nc+290nj
12 nc+180nj lOnc+170nj 28nc+500 nj
2 nc+lO 2 nc+lO 2 nc+lO
2~ 1X
1 2
24nci2496nj 24nc+2496nj
18nc+330 nj 18nc+330nj
6 nc+90 nf 14nc+250nj
1 nc+5 1 nc+5
1X
1
48nc+4992
nj
391
nc+150 nc+105 nc+145 nc+270
36nc+660nj
0
0
nj nj nj nf nj
0
nj nj
processors learns that some configurations will spend much more time in communication than others. Notice that the values under Log P are smaller if more agglomeration steps are required: if less processors are active, the global operations become cheaper. For Nr, a similar remark as for Nb applies, and for N, and K *, the communication values of the restriction and prolongation steps, also a remark similar to that of the other communication steps applies. Finally Nagpl shows which processor configurations require agglomeration.
5. Timing and efficiency results 5. I. Relaxation methods 5.1.1. Red-black point Gauss-Seidel relaxation Timing results obtained on the iPSC/2 for parallel red-black point Gauss-Seidel relaxation on all processor configurations possible on a four-dimensional hypercube are presented in Table 9. For a fixed number of processors, the highest speed-up and efficiency are obtained, as predicted, for the configuration with almost square subregions, i.e. for which nb and nor are lowest. However, due to the high start-up time on the iPSC/2, the influence of the number of messages for exchanging the overlap regions cannot be neglected. It explains why the configuration 16 x 1 can give faster execution than 4 x 4. It is well known that, the more processors are used, the harder it is to use them efficiently [4]. This phenomenon applies here too. Nevertheless, it is still possible to choose a 16-processor configuration that yields more than 90% efficiency.
392
L. Beernaert et al. / Multigrid solver for Euler equations
Table 9 Timings and efficiency Processor configuration
results obtained
with red-black
Total time
point Gauss-Seidel Speed-up
(set)
relaxation Efficiency @I
14.5 14.7 14.3 13.3 11.7
90.5 92.0 89.1 83.2 73.4
16x 1 8X 2 4x 4 2x 8 1x16
5511 5424 5595 5996 6800
8X 1 4x 2 2x 4 lx 8
10421 10242 10679 11540
7.66 7.79 7.47 6.92
95.7 97.4 93.4 86.4
4x 2x 1x
1 2 4
20171 20085 21068
3.96 3.97 3.79
98.9 99.3 94.7
2x 1x
1 2
40070 40364
1.99 1.98
99.6 98.9
1x
1
79802
1.00
100
5.1.2. Red-black line Gauss-Seidel relaxation The main difficulty in parallelizing the line Gauss-Seidel scheme is that a parallel solver for the block-tridiagonal systems is needed, since the data are distributed among a row or column of processors. Such a solver requires communication and contains a sequential part. We used the parallel block-tridiagonal solver, developed by Krechel, Plum and Sttiben [8]. However, a few interface routines were required to have the data in the appropriate data structure for this solver. In this solver each processor first performs a parallel elimination on its local part of the system. Then an interface system is set up in one of the processors. That interface system is solved either by cyclic reduction or by Gauss elimination and the solution is broadcasted among the participating processors. Finally, backsubstitution can be performed in parallel. The second phase, i.e. setting up the interface system, solving it and broadcasting the solution, requires communication and contains a sequential part. If several independent systems have to be solved, this is exploited by executing the sequential part for the different systems on different processors. In that case parallelism remains high in all phases and high efficiencies are obtained. Timing results are reported in Table 10. Clearly, the parallel solution of the block-tridiagonal systems does not cause a significant loss of efficiency. For this relaxation scheme a decomposition of the two-dimensional grid in (horizontal or vertical) strips has on the one hand the advantage that either the horizontal or the vertical lines are completely stored in one processor and that the block-tridiagonal solver thus works completely in parallel, but has on the other hand the disadvantage that the other lines are split among many processors and thus require a lot of communication. On 16 processors the 4 x 4 configuration seems to offer a good compromise between the two extremes.
393
L. Beernaert et al. / Multigrid solver for Euler equations Table 10 Timings and efficiency results obtained with red-black line Gauss-Seidel relaxation Processor configuration
Total time
Speed-up
Efficiency (WI
(set) 14.7 15.0 15.1 14.1 12.6
91.9 93.8 94.1 87.9 78.6
16x 1 8X 2 4x 4 2x 8 1x16
5392 5286 5269 5641 6305
8X 1 4x 2 2x 4 lx 8
10200 10213 10387 11013
7.78 7.77 7.64 7.20
97.2 97.1 95.5 90.0
4x 2x 1x
1 2 4
19961 20344 20526
3.97 3.90 3.87
99.3 97.5 96.6
2x 1x
1 2
39905 40512
1.99 1.96
99.4 97.9
1x
1
79332
1.00
100
5.1.3. Block-Jacobi, point and line Gauss-Seidel relaxation Because the lexicographic Gauss-Seidel schemes converge much faster than the red-black schemes, we also implemented the following strategies. On each subdomain, relaxation is performed by the lexicographic Gauss-Seidel schemes. This can be done in each processor in parallel. After each sweep the overlap regions are exchanged. This leads to a block-Jacobi, point Gauss-Seidel or to a block-Jacobi, line Gauss-Seidel scheme. Of course, the convergence properties of these methods are worse than when the lexicographic schemes are used on the whole domain. For each configuration the convergence requirement is fulfilled after a different number of relaxations, see Table 11. The more processors (i.e. the more subdomains) are used, the higher the number of relaxations that is necessary. But also for a fixed number of processors, the number of relaxations differs, depending on the configuration. Configurations leading to almost square subdomains are, here too, best with respect to the execution time (per relaxation). Furthermore, they are also best with respect to the number of relaxations for convergence of the block methods. These configurations thus combine both qualities. In Table 12 we consider only these best configurations. The speed-ups and efficiencies reported in Table 12 are defined as the ratio of the execution times on P processors and on one processor to reach the prescribed convergence requirement (i.e. taking into account the different number of relaxation steps). Of course, speed-up and efficiency are for these schemes much lower than for the corresponding red-black schemes. However, if the number of subdomains is not too high, the block schemes have shorter execution times than the corresponding red-black schemes. An additional advantage of the block-Jacobi line Gauss-Seidel scheme is that the block-tridiagonal systems need only be solved within each processor, without any communication. Starting
394
L. Beernaert et al. / Multigrid solver for Euler equations
Table 11 Number of relaxation steps necessary to fulfill the convergence requirement for the block-Jacobi Processor configuration
Block-Jacobi point Gauss-Seidel
Block-Jacobi line Gauss-Seidel
16x 1 8x 2 4x 4 2x 8 1x16
623 570 574 627 921
349 302 326 394 620
8X 4x 2x lx
1 2 4 8
527 503 546 646
265 278 312 386
4x 2x 1x
1 2 4
489 467 540
234 262 304
2x IX
1 2
438 457
234 250
1x
1
430
230
Table 12 Timings and efficiency results obtained with block-Jacobi Method
Block-Jacobi, point Gauss-Seidel
Block-Jacobi, line Gauss-Seidel
schemes
relaxation schemes Total time
Speed-up
Efficiency
Processor configuration
Number of steps
8x2 4x4
570 574
2673 2770
4x2
503
4519
6.66
83.2
4x1 2x2
489 467
8656 8249
3.48 3.65
86.9 91.2
2x1
438
15352
1.96
98.0
1x1
430
30091
1.00
8x2 4x4
302 326
3122 3440
4x2
278
5605
6.57
82.1
4x1 2x2
234 262
9420 10468
3.91 3.52
97.7 87.9
2x1
234
18691
1.97
98.4
1x1
230
36797
1.00
(48)
(set> 11.3 10.9
11.8 10.7
70.4 67.9
100 73.7 66.9
100
L. Beernaert et al. / Multigrid solver for Euler equations
395
from a sequential code, these schemes can be implemented very easily on a parallel machine. This makes these schemes very attractive as pure relaxation schemes. A disadvantage is, however, that they cannot be used as smoother in a multigrid solver, because they do not smooth in a uniform way over the whole domain. 5.2. Multigrid
methods
Because block-Jacobi point Gauss-Seidel cannot be used as smoother in the multigrid program, we will only give results about runs with red-black point Gauss-Seidel as smoother. From Table 6 it follows that the more multigrid levels are used, the less processors are active on the coarse grids. If six multigrid levels are used, all computations on the coarsest grid happen on one processor, whatever processor configuration was chosen. This fact influences the speed-up and efficiency in a negative way. In Table 13 we show the speed-ups and efficiencies obtained for nc = 20 and nf = 4, the “best” sequential case on six levels. For our test problem using only three multigrid levels is not the best choice, but it is sufficient, since on one processor execution times are not much higher for three levels than when more levels are used (see Table 4). However, when working in parallel, using only three levels has the advantage that no agglomeration is required, except for the configurations 16 x 1 and 1 X 16. As a result speed-up and efficiency are considerably higher and execution times are (much) lower for three levels (Table 14) than for six levels (Table 13). The more processors used, the higher is the difference in execution times. In Table 15 we give timings and convergence results as in Table 4, but now for the processor configuration 8 x 2, i.e. the best configuration with 16 processors. Now it is clear that the highest
Table 13 Timings and efficiency Processor configuration 16X 1 8x 2 4x 4 2x 8 1x16
results for six multigrid Total time
levels. nc = 20, nf = 4, smoother: Speed-up
(set)
red-black
point Gauss-Seidel
Efficiency @>
809 760 793 831 984
9.34 9.94 9.53 9.10 7.68
58.4 62.1 59.6 56.9 48.0
8x 4x 2x lx
1 2 4 8
1190 1155 1212 1346
6.35 6.55 6.24 5.62
79.4 81.9 78.0 70.2
4x 2x IX
1 2 4
2057 2032 2175
3.68 3.72 3.48
91.9 93.0 86.9
2x 1x
1 2
3859 3887
1.96 1.95
98.0 97.3
1x
1
7561
1.00
100
396
L. Beernaert et al. / Multigrid solver for Euler equations
Table 14 Timings and efficiency
results for three multigrid
Processor configuration
levels. nc = 20, nf = 6, smoother:
Total time
red-black
Speed-up
Efficiency
(set)
16x 1 8X 2 4x 4 2x 8 1x16
point Gauss-Seidel
@I
661 620 641 696 841
11.5 12.3 11.9 10.9 9.04
71.9 76.6 74.1 68.2 56.5
8x 4x 2x lx
1 2 4 8
1053 1046 1104 1218
7.22 7.27 6.87 6.24
90.2 90.9 85.8 78.1
4x 2x 1x
1 2 4
1971 1956 2089
3.86 3.89 3.64
96.4 97.2 91.0
2x 1x
1 2
3828 3855
1.99 1.97
99.3 98.6
IX
1
7603
1.00
100
execution times occur for six and five (in this order) multigrid levels. For the computation of the reached speed-up and reached efficiency, we compared the execution times on 8 X 2 processors with these on 1 processor with the same multigrid parameters. The columns of the reached speed-up and reached efficiency contain strictly increasing numbers.
Table 15 Timings and Gauss-Seidel
convergence
Number of multigrid levels
nc
6 6 6
20 20 20
a Number
nf
results
for multigrid
runs
on
8X2
processors.
Relaxation
scheme:
Number of V-cycles a
Theoretical number of workunits
Measured number of workunits
Possible speed-up
4 6 8
13 9 7
103.25 109.56 116.50
97.78 98.59 99.47
14.6 14.8 14.9
760 739 728
9.94 10.2 10.5
62.1 64.0 65.5
20 20 20
4 6 8
13 9 7
100.81 106.81 113.44
95.16 96.03 98.69
15.1 15.2 15.3
707 706 694
10.6 10.7 11.1
66.4 66.7 69.5
20 20 20
4 6 8
13 9 6
101.50 105.75 97.63
94.88 93.50 92.31
15.2 15.4 15.5
675 646 623
11.2 11.5 11.7
69.9 71.6 73.1
20 20 20
4 6 8
13 9 7
102.75 103.75 107.25
95.69 93.94 98.50
16.0 16.0 16.0
661 620 637
11.9 12.3 12.4
74.2 76.6 77.7
step, to fulfill the convergence
requirement.
of V-cycles,
after a full multigrid
Total time
Reached speed-up
red-black
(set>
Reached efficiency @I
point
L. Beernaert et al. / Multigrid solver for Euler equations
391
The number of workunits is a scaled number. A unit is now the cost for one relaxation step on one sixteenth of the finest grid, i.e. the (ideal) amount work of each processor on the finest grid. In counting the workunits agglomeration is taken into account in the following way: if the number of active processors on a multigrid level is divided by 2”, the weight factor of that level and the measured number of is multiplied by 2”. This explains why both the theoretical workunits on four, five and six multigrid levels, are now higher than if only one processor is used. Notice that the use of a scaled number for the workunits has the advantage that one can simply compare the numerical values of these numbers, without thinking about a multiplicative factor. A consequence of the higher values for the number of workunits on 16 than on one processor is, that it is not possible to obtain a speed-up of 16. The maximal speed-up that could be obtained, if there were no imbalance due to the boundary grid points and no communication, is also given in Table 15. It increases if less agglomeration steps are necessary and if nf increases. From the previous discussion it follows that, for a (relative) high number of processors, it may be advantageous to use less multigrid levels than in the sequential case. Reasons for this are: on the coarse grids the imbalance due to the boundary grid points is relatively much more important than on the fine grids; also communication time is not negligible any more in the total time for a smoothing step; and last but not least, agglomeration causes a severe load imbalance.
6. Conclusion Starting from a sequential version, two parallel Euler solvers have been implemented, one that uses pure relaxation methods and a multigrid one. In the relaxation Euler solver communication represents only a very small fraction of the total execution time, but the treatment of the boundary grid points induces a load imbalance. This load imbalance is the most important reason for the not perfect but still very high speed-up and efficiency. For the test problem considered, block-Jacobi methods, which are easy to implement, also have better convergence properties than the corresponding red-black methods. However, these block-Jacobi methods cannot be used as a smoother in the multigrid program. In the multigrid Euler solver there are much more factors that influence the speed-up and efficiency. There is still the load imbalance, that increases with an increasing number of multigrid levels. Furthermore, on the coarse grids communication is not negligible any more. And last but not least, there is agglomeration that is necessary if the parallel program wants to use as many multigrid levels as the sequential one. But agglomeration induces a severe load imbalance. All these factors decrease the speed-up and efficiency considerably. But the big advantage of multigrid, i.e. a much shorter execution time than for the pure relaxation methods, remains valid in the parallel program.
Acknowledgement The authors wish to thank SUPRENUM GmbH (Bonn, FRG) and G.M.D. (Sankt Augustin, FRG) for the use of the SUPRENUM Communications Subroutine Library and the use of the parallel Block Tridiagonal Systems Solver, developed at G.M.D. within the SUPRENUM project. Finally our thanks go to the anonymous referee for this constructive remarks.
398
L. Beernaert et al. / Multigrid solver for Euler equations
References [l] L. Bomans and D. Roose, Communication benchmarks for the iPSC/2, in: F. Andre and J.P. Verjus, eds., Hypercube and Distributed Computers (Elsevier Science Publishers, Amsterdam, 1989) 93-103. [2] L. Bomans and D. Roose, Benchmarking the iPSC/Z hypercube multiprocessor, Concurrency: Practice and Experience 1 (1989) 3-18. [3] H. Deconinck and R. Struys, Consistent boundary conditions for cell centered finite volume Euler solvers, in: K.W. Morton and M.J. Baines, eds., Numerical Methods for Fluid Dynamics III (Clarendon Press, Oxford, 1988). [4] G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon and D. Walker, Solving Problems on Concurrent Processors (Prentice-Hall, Englewood Cliffs, NJ, 1988). [5] R. Hempel, The SUPRENUM Communications Subroutine Library for grid-oriented problems, Rep. ANL-87-23, Argonne Nat. Lab., Argonne, IL (1987). [6] R. Hempel and A. Schiiller, Experiments with parallel multigrid algorithms using the SUPRENUM Communications Subroutine Library, GMD-Studien 141, St Augustin, FRG (1988). [7] A. Jameson, Numerical solution of the Euler equation for compressible inviscid fluids, in: F. Angrand et al., eds., Numerical Methods for the Euler Equation of Fluid Dynamics (SIAM, Philadelphia, PA, 1985). [8] A. Krechel, H.-J. Plum and K. Stiiben, Solving tridiagonal linear systems in parallel on local memory MIMD machines, in: F. Andre and J.P. Verjus, eds., Hypercube and Distributed Computers (Elsevier Science Publishers, Amsterdam, 1989) 49-63. [9] A. Rizzi and H. Viviand, eds., Notes on Numerical Fluid Mechanics 3 (Vieweg, Braunschweig, 1981). [lo] R. Struys, Calculation of inviscid compressible flow using implicit finite volume schemes and relaxation, Project Rep. 1987-20, Von Karman Institute, St. Genesius Rode, Belgium (1987). [ll] J.L. Thomas, B. van Leer and R.W. Walters, Implicit flux-split schemes for the Euler equations, AIAA 85-1680 (1985). [12] S. Vandewalle and R. Piessens, A comparison of parallel multigrid strategies, in: F. Andre and J.P. Verjus, eds., Hypercube and Distributed Computers (Elsevier Science Publishers, Amsterdam, 1989) 65-79. [13] B. van Leer and W.A. Mulder, Relaxation methods for hyperbolic equations, Rep. 84-20, Department of Mathematics and Informatics, Technical University Delft, Netherlands (1984).