Computer Methods in Applied Mechanics and Engineering 102 (1993) 61-88 North-Holland CMA 280
Two-dimensional viscous flow computations on the Connection Machine: Unstructured meshes, upwind schemes and massively parallel computations Charbel Farhat Department of Aerospace Engineering and Center for Space Structures and Controls, University of Colorado, Boulder, CO 80309-0429, USA
Loula Fezoui and St6phane Lanteri INRIA Sophia-Antipolis, 2004, Route des Lucioles, 06560 Valbonne, France Received 26 September 1991 Revised manuscript received 9 March 1992 Here we report on our effort in simulating two-dimensional viscous flows on the Connection Machine, using a second-order accurate monotonic upwind scheme for conservation laws (MUSCL) on fully unstructured grids. The spatial approximation combines an upwind finite volume method for the discretization of the convective fluxes with a classical Galerkin finite element method for the discretization of the diffusive fluxes. The resulting semi-discrete equations are time integrated with a second-order low.storage explicit Runge-Kutta method. A communication efficient strategy for mapping thousands of processors onto an arbitrary mesh is presented and proposed as an alternative to the fast north-east-west-south (NEWS) communication mechanism, which is restricted to structured grids. Measured performance results for the simulation of low Reynolds number chaotic flows indicate that an 8K CM-2 (8192 processors) with single precision floating point arithmetic is at least as fast as one CRAY-2 processor.
1. Introduction Previously, we have reported on our experience with performing two-dimensional structured compressible flow computations on the Connection Machine CM-2 [1,2]. We have found that this massively parallel processor is particularly well suited for explicit computations on regular grids. For grids that result in a high virtual processor ratio (VPR or VP ratio), using the north-east-west-south fast communication mechanism, we have measured the communication component of the simulation time to represent typically less than 10% of the total CPU time. We have concluded that on a 64 K machine (65536 processors), high efficiency rates in the neighborhood of 2 gigaflops are attainable. We have also found that for both inviscid Correspondence to: Professor Charbel Farhat, College of Engineering, University of Colorado, Campus Box 429, Boulder, CO 80309, USA. 0045-7825/93/$05.00 (~) 1993 Elsevier Science Publishers B.V. All rights reserved
C. Farhat et al., 2D viscous flow computations
62
(Euler equations) and viscous (Navier-Stokes equations) flow structured computations, a 16 K CM-2 (16384 processors) can be 4 and 6 times faster than one CRAY-2 processor, respectively. In this paper, we focus on massively parallel viscous flow computations using fully unstructured grids. In Section 2, we formulate the problem to be solved, and in Section 3, we derive first-order and second-order spatial schemes that are characterized by an upwind integration of the convective fluxes. Second-order accuracy is achieved through a monotonic upwind scheme for conservation laws (MUSCL) technique. An explicit, and therefore nicely parallelizable, Runge-Kutta method is selected for time integration; it is summarized in Section 4. Because the mesh irregularities inhibit the use of the NEWS mechanism, interprocessor communication is bound to be carried out via the slower machine router. If a trivial processor mapping is used, up to 60% of the total CPU time is consumed in communication requirements. This bottleneck has been previously analyzed and documented by Farhat et al. [3] for massively parallel finite element computations in solid mechanics problems. It has also been recently addressed by several other investigators for fluid flow computations. In particular, Shapiro [4] has proposed the use of a graph coloring algorithm to allow a particular implementation of the communication steps which reduces the communication costs by a factor of two. Hammond and Barth [5] have developed a vertex-based partitioning scheme for inviscid flow computations which attempts to minimize both the computational and communication costs associated with unstructured grids. Here, we present a strategy for mapping thousands of processors onto an unstructured grid which leads to an efficient scheme for carrying out communications of an arbitrary pattern. The key elements of this strategy are discussed in Section 5. These include the selection of an appropriate parallel data structure, the partitioning of a given unstructured grid into subgrids, and the mapping of each individual processor onto an entity of these subgrids. Combining this mapping strategy with a communication compiler reduces the communication overhead by an order of magnitude and brings it down to 15% of the total simulation time. In Section 6, we apply our massively parallel code and its highly vectorized variant to the simulation of low Reynolds number chaotic flows. Measured performance results indicate that for such computations on unstructured grids, an 8K CM-2 with single precision floating point hardware is as fast as one CRAY-2 processor.
2. Mathematical modeling First we recall the mathematical problem to be solved, and introduce the notation that is used in the sequel.
2.1. Governing equations Let ~ C ~2 be the flow domain of interest and F be its boundary. The conservative law form of the equations describing two-dimensional Navier-Stokes flows is given by
OW where
1
ot + v. W=(p
v. pu
pv
E)',
(1)
C. Farhat et al., 2 D viscous flow computations
[0
63
O] t
v=~
0-;'
F(w) ~(w) = [ a(w)
]'
(2)
] • (w) [ R(W) s(w) J • =
L
The functions F(W) and G(W), and R(W) and S(W), denote the convective and diffusive fluxes, respectively. They can be written as
F(W) =
pu2 + P
I
our
I'
Lu(E+p)J
+ p I'
o(w)- I o
L.(E+p)J n
0 ~xx
R(w)
(3)
=
u1"xx+ vT~y+
s(w)
1'k cle Pr ax
0 ~y ~,,
=
,
+ Wyy + yk 08
where p is the density, U = (u, v) is the velocity vector, E is the total energy per unit of volume, p is the pressure, and e is the specific internal energy. The variables p, E, p, U, e and the temperature T are related by the state equation for a perfect gas: p = (1'- 1)(E-
½plluIl')
(4)
½(IIUII=),
(5)
and by = C~T=
E_ _
P
where 1' denotes the ratio of specific heats. The components of the (.~,;ay "~ , ' stress tensor ~xx,~xy and ~-yyare given by
64
C. Farhat et al., 2D viscous flow computations
%,,=-~ l.t 2 0x
Oy '
,'yy='~ It 2 0y
-~x '
~'~Y=# Oy +-~x
'
(6)
where/~ and k are the normalized viscosity and thermal conductivity coefficients. Two other variables appear in the above equations; the Reynolds number R e = poUoLo/l~o, where Po, U0, Lo and ~ denote, respectively, the characteristic density, velocity, length and diffusivity of the flow under consideration, and the Prandtl number Pr = i~oCp/k o. We consider the initial and boundary value problem (IBVP): OW 1 Ot + V. 3 ; ( W ) = Re V. ~ ( W )
(X, t) ~ ~ x 9t + (7)
w ( x , o) = Wo(X) ,
x ~ a,
w ( x , t) = W r ( X ) ,
x ~ r = on,
where W0 and W r are specified functions, and focus on finding a weak solution of (7) that is amenable to massively parallel computations. 2.2. Boundary conditions
We are mostly interested in external flows around airfoils. Therefore, we consider the case where the computational domain D is delimited by the boundary F = Fb U F®. We denote by v the outward unit normal at a given point of F (Fig. 1). In the far field, we assume that the viscous effects are negligible so that the flow is uniform. We adopt a formulation where the physical variables are non.dimensionalized. The freestream vector W. is given by
p.--l,
U~=
cos a ] sina
'
1
P*--yM~'
(8)
where a is the angle of attack and M~ is the free-stream Mach number. On the wall boundary
Uw Fig, 1. The computational domain.
C. Farhat et al., 2 D viscous flow computations
65
Fb, we impose the no-slip condition and specify the temperature: U = O,
T= Tb .
(9)
We do not impose any boundary condition on the density. Therefore, the total energy per unit of volume and the pressure on the wall are given by E = pC~T b,
p=(T-1)E.
(10)
3. Spatial discretization 3.1. Preliminary
The flow domain n is assumed to be a polygonal bounded region of 9t 2. Let 3 h be a standard triangulation of/~, and h the maximal length of the edges of 3 h. A vertex of a triangle A is denoted by S i, and the set of its neighboring vertices by K(i). At each vertex Si, a cell C~ is constructed as the union of the subtriangles resulting from the subdivision by means of the medians of each triangle of 3"h that is connected to St (Fig. 2). The boundary of C~ is denoted by OCt, and the unit vector of the outward normal to OCt by v~ = (Vix, viy). The union of all of the constructed cells forms a non-overlapping partition of the d o m a i n / ] : n$
n-UC~,
(11)
ill
For each cell Ct, a characteristic function ~ is defined as ~(X)=
1, O,
i f X ~ C,, otherwise.
Fig. 2. Cell definition in an unstructured grid.
(12)
66
C. Farhat et al., 2 D viscous ~low computations
Also, the following discrete spaces are introduced:
% = {o~ I o~ ~ c°(a), o~l, ~ e,, VA ~ S-~),
03)
°/~h = {0h I Oh ~ L 2 ( O ) ' OhlQ = Oi = constant, i = 1 , . . . ,
ns},
where Px is the space of polynomials in two variables and of degree I. Clearly, any function f belonging to oFh is uniquely determined by its values f(Si) at each vertex St, and can be expressed as
f(X)= ~
(14)
f(S~)N~(X),
i = 1 ,ns
where {N, jX~=~="s~is a basis of ~h. Finally, it is noted that a natural bijection between the spaces oFh and ~ttv" h can be constructed as
s(l(x)) =
Y~
l(s,)~(x) vf ~
~
(15)
.
i = 1 ,n$
3.2. Variational formulation and first order spatial approximations A variational formulation of the IBVP (7) is as follows: Find Wh E (Wh)4, V~,h E ~'h, l
--~ ~o~dx dy +
V. ~(W~)e h dx dy = a""e
V. ~(Wh)q h dx dy.
(16)
We construct a mixed finite volume/finite element (Galerkin) approximation for solving the above problem by introducing appropriate schemes for computing the left- and right-hand side integrals (16). Chosing eh as the shape function N~ associated with the node S~ and applying the operator S to the left-hand side of (16) leads to a mass-lumped variational approach which transforms the above equation into
fc, OWh I fs.pN~V.~t(Wh)Nidxdy, Ot dx dy + fqV' ,.~'(Wh)dx dy = R-"~
(17)
where Sup N~ = U ,.s,¢,~d. Using Green's formula for the convective term and integrating by parts, the diffusive one leads to
Ot LOWhdxdy+f~
- - 1"!" E Re ,~,s,ea
q
~(Wh). ~ do"
~(W. ) .VN: dx dy + ~
or~ ~(Wh). ~N~ do',
(18)
where N~ is the restriction of N~ to triangle A. Finally, we drop the right-hand side boundary integral as we enforce the viscous boundary conditions in a strong form on Fb and neglect the viscous effects on F®, so that (18) simplifies to
67
c. Farhat et al., 2D viscous flow computations
_ fo iE5-'r(O
Ot d x d y +
+ +
cq
w(w ) • vq d t r ( 1 )
faqnrb fac~nr®
.~(l~h) • u~dot(2)
~(l~'h) • u~ dot(3 )
(19)
~(Wh) .VN~a dx dy(4),
Re a,s,¢a
where I~'h is the specified value of Wh at the boundaries. The reader should note that the above formulation leads to a locally one-dimensional computation of each convective term, along the normal direction v. For this purpose, the boundary 0C i of the cell Ci is split into bi-segments 0Cq which join the middle point of the edge [S~Sj]to the centroids of the triangles with both of S~ and Sj as vertices (Fig. 3), and the integral (1) is evaluated as
m~m,)facq" .
' ~ ( W h ) ' P q d ° ' = jeK(i) ~
~;(tJ)fo% vqdo',
(20)
where $F(O) is some approximation of the convective flux computed at the interface between
cells Ci and Cj.
ate
Following Fezoui and Stoufflet [6], we choose ~ ( U ) to be a numerical flux function • associated with a first-order accurate upwind scheme [7]. It is denoted here by HI]), where the superscript (t) emphasizes the first order accuracy, and can be written as
(21) where W~=
Wh(S,) and• (1). ~ = W~(Sj). For example,
be used to construct n #
the following numerical flux functions can
.
G2,~~"J ij $i ~
" ~ 1 iJ
_,.~;v ij G~, ij Fig. 3. Splitting of OCq.
-
Sj
68
C. Farhat et al., 2D viscous flow computations
(a) Roe's scheme [8]: a,~(u, v, v) = ½( ~ ( v , v) + ~ ( v , v)) - d(u, v, ~,),
(22)
where d(U, V, v) is a numerical diffusivity defined as
d(U, V, ~)- la(~¢, ~)I~(V- U)
(23)
a~
and W is some mean value of U and V. (b) Steger and Warming's scheme [9]:
, +SW (u, v, v)= a+(u, v)u + a - ( v , v)v ,
(24)
where s~ ffi s~ + + sO- and I ~ 1 - ~ + - ~ The viscous integral (4) is evaluated via a classical Galerkin finite element PI method which results in a centered scheme. Since the approximations of the physical variables are taken in T',, the components of the stress tensor and those of ~'N~a are constant in each triangle. The velocity vector in a triangle is computed as 3
V,+ -+ gI
X v* ,---,.,+,+
.
(25)
Consequently, the viscous fluxes are evaluated as
X fa ~(Wh).VN:dxdy-- a.s,¢a X area(a)(RaON: ON: OX + s a - ~ - ) '
a.s.m,
(26)
where R a and Sa are the constant values of R(W) and S(W) in the triangle n.
3.3. Higher order extension The numerical integration with an upwind scheme described above leads to a spatial approximation that is only first-order accurate. Here, we focus on constructing a second-order accurate solution without changing the space of approximations. We develop a second-order scheme that is an extension of Van Leer's MUSCL method [7] to the case of unstructured meshes. Usually, a second-order approximation requires the evaluation of the gradient of the solution at each vertex. Clearly, the gradient of a function v, of ~h is constant in each element and discontinuous in the flow domain. Following the MUSCL method, one way to achieve second-order spatial accuracy is to evaluate the fluxes with extrapolated values W~t, Wi+ at the interface OC+t30Ci. Basically, this leads to substituting H~~) in the previous scheme by H~2) which is given by
C. Farhat et al., 2D viscous flow computations
69
%fw+,. w,,, w,
=
w, + ½(vw)p.s, sj.
(27)
w. = w j - ½(vw)~.s, sj. where the approximate nodal gradients (VW)~ i are obtained via a/3-combination of centered and fully upwind gradients: (VW)p = (1 -/3)(VW) Cent
(28)
- I - / ~ ( V W ) Upw .
Here, a centered gradient (VW) cem = (VW) #--° can be chosen as any vector satisfying
(vw)~°"'.s,s, = wj- w~.
(29)
A nicely parallelizable scheme for computing the upwind gradients (VW) Upw is as follows. First, we note that (VW) Upw= (VW) #-', and from (28) we derive (vw)~
(30)
~ - 2 ( v w ) p ° , , ~ _ ( v w ) ? o., .
We compute the half-upwind gradients (/3 = ½) via a linear interpolation of the Galerkin gradients computed in each triangle of C+, so that
+,l... ffi fc~ VWI~ dx dy (vw)f
fc, dx d y _ 1 - area(C~"--~ ~
~c,
area(T)
3
3 ~
k~i,k~T
wkv~ok.
(31)
Finally, we evaluate the nodal gradients using the following third-order biased scheme: ( v w ) p - , , ~ = 32(vw)p--o + ~ ( v w ) p ~' - 32(]rw)# ~° + ½ ( 2 ( v w ) ~ =1/2 _ ( v w ) ? --o)
ffi ~(VW)? °0 + ~(VW)~ °1/2 '
(32)
3,4. Boundary conditions The second term (2) and the third term (3) of the right-hand side of (19) contain the physical boundary conditions. These are represented by the vector '~h which involves quantities that depend on the interior values of Wh, and quantities that are determined by the physical boundary conditions.
70
C. Farhat et al., 2 D viscous flow computations
(a) Wall boundary: the no-slip condition is enforced in a strong form (9), (10) so that the corresponding boundary integral (2) does not need to be evaluated. (b) Inflow and outflow boundaries: at these boundaries, a precise set of compatible exterior data which depend on the flow regime and the velocity direction must be specified. For that purpose, a plus-minus flux splitting is applied between exterior data and interior values. More precisely, the boundary integral (3) is evaluated using a non-reflective version of the flux-splitting of Steger and Warming [9]: v, d , , -
+(W,,
W, +
(33)
4. Time discretization
The resulting semi-discrete fluid flow equations can be written as dW + O(W) = O. dt
(34)
Because it lends itself to massive parallelism, the explicit Runge-Kutta method is selected for integrating the above equations. A 3-step variant is used here. It is summarised as W (0) = W n ,
At
-t)
4 = k O(W t*
),
k --- I, 2, 3,
(35)
W n*| = W (~) .
The above scheme is often referred to as the low-storage Runge-Kutta method as only the solution at substep a - 1 is used to compute the one at substep a. It is third-order accurate in the linear case, but only second-order accurate in our case. 5. Parallel implementation on the Connection Machine CM-2
Clearly, expressions (19) and (27)-(35) reveal that both the spatial and temporal integrations are in principle nicely paraUelizable. In this section, our interest lies in investigating the most efficient way to implement these computations on a single instruction multiple data (SIMD) massively parallel computer such as the Connection Machine CM-2. Special care is given to interprocessor communication because mesh irregularities (a) inhibit the exploitation of the NEWS grid, so that the relatively slow router must be used, and (b) induce a different amount of communication steps within each processor, which is not particularly desirable on a SIMD machine. Rather than overviewing the CM-2, we refer the reader to the technical summary of Thinking Machines [10] for architectural details, and to Farhat et al. [3] for an in-depth analysis of interprocessor communication on the CM-2 when computing over an irregular mesh.
71
C. Farhat et al., 2D viscous flow computations
5.1. Parallel data structure
Behind the performance of any parallel algorithm lies the choice of the corresponding parallel data structure. The latter is closely related to both the entity and the task to be assigned to each processor. Therefore, all of the computational, communication and memory requirements should be considered before the distributed data structure is determined. For the mixed finite volume/finite element method presented here, we consider four candidates for a fundamental entity (Fig. 4): - the vertex Si, the edge Eij joining the vertices Si and Sj, the element (here the triangle) A#k connecting the vertices S~, St and Sk, and the cell C~ defined in Section 3.1. -
-
-
Memory considerations While regular grids are most often characterized (in terms of memory requirements) by their number of vertices N v, irregular triangular grids can be also characterized by either their number of elements N,~, or by their number of edges N e. Here, we assume for simplicity that 3"h is characterized by its number of vertices. Euler's relations for a triangulation state that Nv+N.,-Ne=I. - N.,, =
(36)
3N .
where Nsv denotes the number of vertices at the boundary of the triangulation. This implies that Na ~- 2Nv
and
N B ~- 3Nv.
(37)
A ilk
Si
Fig. 4. Fundamental entity candidates.
C. Farhat et al., 2 D viscous flow computations
72
Therefore, if 3 h is designed, for example, so that its number of vertices matches a given Connection Machine size, the VP ratio associated with each data structure candidate varies as indicated below:
VPR
Vertex
Edge
Element
Cell
1
3
2
1
The reader should note that for the edge case, the machine automatically selects a VP ratio of 4, since it is the closest power of two to the theoretical VPR. Clearly, the vertex and cell entities are the best candidates on the sole basis of efficient memory usage.
Operation count The numerical algorithms discussed in Sections 2 and 3 can be organized around three basic computational steps:
Step a. Evaluation of the Galerkin gradients (32). Step b. Evaluation of the diffusive fluxes (26). Step c. And evaluation of the convective fluxes (27). While Step c is most efficiently performed using edge-wise computations, Step a and Step b are inherently element-level calculations. Therefore, whatever fundamental entity is selected, it must contain both edge and element information, which rules out the edge Eq data structure. On the other hand, in an element-based partition, every triangle Ajjk provides direct access to all of the three edges Eij, Ejk and Ek,. However, m that case, two VP sets must be used; one containing Na processors which store triangle related data (geometrical data), and another one containing Nv processors which store vertex related data (physical data). Otherwise, if only one set of virtual processors is used and assigned to both triangle and vertex data, a nodal result would be duplicated in as many processors as there are triangles connected to that vertex. The vertex entity S~ is an effective candidate only when augmented with the auxiliary data structures that can handle, the data associated with the elements and edges connected to a given vertex; that is, when transformed into a cell data structure. Finally, we note that the cell entity stores both vertex and element data, and therefore provides access to all of vertex, element and edge information. Consequently, only element and cell partitions are retained for further discussions. Next, we evaluate the operation count for each of Steps a, b and c, assuming an element- or ceU-based data structure. We denote by C~ and C~, the number of arithmetic operations associated with one edge computation during Step c, and with one triangle computation during Step a and Step b, respectively. The computational complexities characterizing the two retained candidates are tabulated below:
Step c Step a + Step b
Element
Cell
2 x C~ Ca~
2 x C~ 3 × Ca~
C. Farhat et al., 2D viscous flow computations
73
In both an element- and cell-based partition, an edge is shared by two virtual processors, so that the flux --iiH(2)across [S, Si] is computed twice. Only an edge partition would eliminate these redundant computations, but that choice has already been eliminated. In a cell-based partition, a triangle A,ik is shared by three virtual processors, and therefore additional redundant computations are generated. Communication costs The computational steps discussed above require four communication steps denoted here by el, c2, c3 and c4. These are discussed below for the element and cell parallel data structures. First, we consider the case of an element-based partition. During the first communication step el, each virtual processor assigned to a triangle a~ik gets the physical states at vertices S i, S i and Sk from neighboring processors. Then, the computations in Step a and Step b are carried out. During the second communication step c2, the element-wise results are sent back to the virtual processors holding vertex data. The latter virtual processors use these values to compute the nodal gradients (32) and diffusive fluxes (26). In Step c3 the nodal gradients are communicated to neighboring processors. Next, each virtual processor evaluates three secondorder convective fluxes (15) across the three edges connected by triangle Zl~/,. During the last communication step, c4, the edge-wise fluxes are sent to the virtual processors holding vertex data. Communication with a cell-based partition is more complex, as each cell may have a different number of neighbors. However, fewer communication steps are needed because each virtual processor stores within its local memory all of the element-wise values that are necessary for the evaluation of the nodal gradients and the diffusive fluxes, as well as the elemental convective fluxes. The communication count associated with the four steps cl-c4 is tabulated below for each of the two retained data structure candidates. ~m,x ' " nuigh denotes the maximum number of neighboring cells.
Element
Cell
,, m,|.
cl c2 c3 c4
rTr
Mmax
3 3 3 6
, , neigh
0 Mmax " neigh
0
Selected candidate The operation and communication counts are summarized below for both the element and cell data structures. Equations (36) are used to express the results in terms of the number of vertices in the mesh.
Element
Operation count Communication count
(6 x
C~ + 2 x
C.%) x
30x Nv
Cell Nv
(6x
+6x c.%)x 12XNv
Clearly, redundant arithmetic operations can be avoided only at the expense of additional communication characterized by an irregular pattern, which is usually not beneficial on a
74
C. Farhat et al., 2D viscous flow compraations
massively parallel processor such as the CM-2. Therefore, we have chosen the cell-based parallel data structure and have accepted the additional cost of redundant flux computations. Hammond and Barth [5] have invoked a graph theory result due to Chrobak and Eppstein [17] to eliminate redundant edge-based flux computations for Euler flows. This result states that for any planar graph, there exists an orientation of the edges such that no vortex has more than three edges directed out from it. This means that there exists a cell partition where no processor needs to compute the convective fluxes across more than three edges of the computational cell. However, this graph theory result does not apply for our viscous computations because these also include element-based operations.
5.2. Grid decomposition and processor mapping Efficiency in arbitrary communication on the CM-2 requires the minimization of both the 'hammering' on the router- that is, wire contention, and the distance that information has to travel (i.e. the number of hops between the sender and receiver processors). Here, this implies that (a) adjacent cells must be assigned, as much as possible, to directly connected processors or processors that are lying in directly connected chips, and (b) contention for the wire connecting neighboring chips must be reduced. In a first step, the unstructured grid is decomposed into a series of subgrids each containing 16 adjacent numerical cells. Each subgrid is assigned to a certain CM-2 chip that is subsequently identified, so that adjacent cells within a subgrid are assigned to directly connected processors lying in the same chip. As a result, off-chip communication is needed only across the subgrid boundaries. Wire contention is reduced if each of the defined subgrids is surrounded by the largest possible number of neighboring subgrids. Indeed, wherever a suL;grid boundary is shared with several other subgrids, off-chip communication is split between distinct chips and is distributed across several of the available inter-chip wires (Fig. 5). On the other hand, if for example a subgrid is adjacent only to two other subgrids, a maximum of two wires can be used during off-chip communication, which may create a severe wire contention that would serialize communication and significantly increase its cost. Here, we use the mesh decomposer of Farhat [11] which has proven to be very effective at reducing wire contention on the CM-2 [3]. The next step is to reduce the distance that information has to travel during off-chip WIRE 1
WIRE9 - ~ - ~
~~==~" L3
~
WIRE3 WIRE4
WIRE6 Fig, 5, Grid decomposition with rc~u~ed wire-contention,
C. Farhat et al., 2 D viscous flow computations
75
communication, that is when data are exchanged between centers of cells that are assigned to processors lying on different chips. This can be achieved by assigning adjacent subgrids as far as possible to directly connected chips. A combinatorial optimization-like procedure known as simulated annealing (see, for example, [12]) is probably the most popular technique for tackling this mapping problem. However, it is a very expensive procedure that has often proved to be impractical. Alternative heuristic-based schemes have been developed by several authors including Bokhari [13], Farhat [14], and recently Hammond and Schreiber [15]. In this work, we have adopted the mapper in [14]. It is based on a combined greedy/divide and conquer approach and is tuned for hypercube topologies. A detailed analysis of interprocessor communication on the CM-2 for unstructured grids can be found in [3]. In that reference, it is shown that mesh irregularities induce a multiple instruction multiple data (MIMD) style of programming for the communication phase which dominates the cost of communication. It is also suggested that since the irregular pattern of communication is fixed in time, a considerable improvement can be achieved if that pattern is evaluated during the first time step, then compiled or stored in the CM-2 for re-use in subsequent time steps. However, no software was available at that time for validating the proposed communication strategy. Recently, a communication compiler prototype has become available [16] and can be used for storing the routing pattern. In Section 6, we report on its performance.
6. Application to low Reynolds number chaotic flows
Here we apply our computational algorithm to the simulation of two-dimensional low Reynolds number flows past a NACA0012 airfoil at high angle of attack and low Mach number. It is shown in [18] that such flows can be resolved with a reasonable number of grids points. The accuracy of the computed solutions are substantiated by successive mesh refinements and comparisons with the results reported in [18]. All computations are performed on an 8K CM-2 with 32 bit floating point hardware and 32 Kbytes of memory per processor. The code is implemented using a combination of C* and Paris instructions. Measured performance results are compared with those obtained on a CRAY-2 processor using a highly vectorized version of the same computational scheme. 6.1. Simulation results Four unstructed meshes MI, M2, M3 and M 4 a r e generated. Their characteristics (number of nodes, triangles and edges, the maximum number of neighbors for a given node, and the corresponding VP ratio for an 8K machine) are summarized below, and a partial view of MI shown in Fig. 6. Mesh
Nv
Na
S E
~m'~x , , neigh
VPR
Ml M2 M3 M4
8119 16116 32236 63168
15998 30936 63992 123744
24117 47052 96228 186912
9 8 9 8
1 2 4 8
76
C. Farhat et al., 2D viscous flow computations
Fig, 6. Partial view of mesh Mt (NACA0012).
The flow conditions are M~ n 0 . i , a ~ 300 and Re--800 and 10000. Initially, the free stream conditions are assumed to prevail all over the flow region. The airfoil is impulsively started with no perturbations. These conditions result in unsteady flows which become chaotic for increasing Reynolds numbers. All simulation results reported below for Re - 800 are those obtained with mesh Mz. Those reported for Re = 10 000 are obtained with the finer mesh Ms. Sample Mach contours at various times are shown in Figs. 7 and 8 for Re = 800 and Re = 10000, respectively. These contour plots show that both flows develop initially with a growth of the boundary layer on the upper and lower airfoil surfaces. A trailing edge starting vortex is shed, and the upper surface boundary layer grows to a significant height, especially for the case R e - 10 000. Vortices are then shed periodically. Figures 9 and 10 report the lift history (using the non-dimensionalized pressure). For the flow at Re -- 800, a transient regime is observed during the first 5 non-dimensionalized units of time, after which the lift oscillates with a single frequency. The flow at R e - 10000 is irregular. After 25 non-dimensionalized units of time, we still cannot observe a regular oscillatory behavior. A longer run is needed to establish a conclusion, and that means several dozens of CPU hours. Finally, we note that all of these flow results are consistent with the conclusions reported in [18].
C. Farhat et al., 2D viscous flow computations
t
=
t
=
0.22
0.58
Fig. 6. Continued.
77
78
C. Farhat et al., 2D viscous flow computations
t
---
t
0.94
-
1,88
Fig, 6, Continued,
C. Farhat et al., 2 D viscous
flow computations
=
t
2.83
=
3.80
Fig. 6. Continued.
79
80
C. Farhat et al,, 2 D viscous flow computations
t
5.79
=
t
=
T,T4
Fig, 7. Mach contours for Re ffi 800.
C. Farhat et al., 2D viscous flow computations
=
t
t
0.18
=
0.52
Fig. 7. Continued.
81
82
C. Farhat et al., 2D viscous flow computations
t
-
0,91
t
=
138
Fig. 7, Continued.
C. Farhat et al., 2D viscous flow computations
83
w
t
:
2.73
/ t
=
3.46
Fig. 8. Mach contours for Re = 10 000.
._,.
84
C. Farhat et al., 2D viscous flow computations l.IFr (using non.dimensionalized pressure) - Re = 800 v
!
u
w
2.5
1.5 1
0.i05
5
10
15
2O
25
TIME (non-din~nslonalized) Fig. 9. Lift history for Re : 800.
LIFT (usln8 non.dlmemlona~r~l p~eum) • Re u tO000
1,8
t,6
1.4 1.2 1 0,8 0,6
0.
$
I0
15
TIME (nea-dimemloaaia~) Fig, I0. Lift history for R e - I0 000.
20
25
C. Farhat et al., 2D viscous flow computations
85
6.2. Performance results on an 8 K CM-2
Performance results are reported for three sets of runs R~, R 2 and R 3. The first set of runs, R~, is characterized by the use of a random processor mapping, while the second set, R e, is characterized by the use of the decomposer/mapper outlined in Section 5.2. For R3, we have used FASTGRAPH (written by Dahl [16]) to compile the communication patterns generated by our decomposer/mapper. The total CPU time and the ratio of CPU time/communication time are reported in Figs. 11 and 12, respectively, for one time step-that is, three sub-steps of the Runge-Kutta algorithm. Clearly, interprocessor communication dominates the CPU time when a random processor mapping is used and the resulting communication pattern is not compiled. The decomposer/mapper is shown to significantly improve the performance of the code, especially for the finer meshes where communication costs are reduced by as much as 30%. However, the most dramatic improvement in performance is obtained when the optimal communication pattern is compiled with FASTGRAPH. In the latter case, the communication time is shown to represent less than 15% of the total simulation time.
8
SECONDS/STEP
,o
?,S ? •l J
6,5
i
0 .........
i
S
•i
I:11
,al
li,8 o• I
S •e I
4,S
I
•
4 .•
3,S
S
~!'
.4.
¢
lU- - - - - g 2
S
$
o•
S w ~
2.5
J,
,•
a,~
S
2
. , . 4m ** " *"
s" o •
1.5 1
't'
"2 "
lit
,4, -'~
1
2
•o S°
+
0.5 0
4 VP
S RA TI 0
s
7
Fig. 11. CPU time per time step (8K CM-2).
R3
86
C. Farhat et al., 2D viscous flow computations PERCENTAGE
65,
. . . . . . . .
.6 .................
0
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . . .
. . . . . .
0
.
60, 55,
¢,.. _
_
,.
_
,.....
,4K
.........
RI
1(- . . . .
R2
0
50, 45, 40,
3& 30, 26. 20. ill
iii!
ii
|
ii
IOI il
al
II
II
IIm n
I
n
u
II
III!
Ill
I I i
I
|
|
|
i
|
I
i
|
t
41
lS. 4. . . . . .
'4, . . . . . .
10.
S
o
I 1
I
2
I
3
I
4 VP
I
I
S
RA rl
6
I
?
I
8
0
Fig. 12. Communication time/CPU time per time step (8K CM-2).
6.3. CM.2 versus CRA Y-2
The CM.2 and CRAY-2 are representatives of the two architectural extremes of today's supercomputers. In order to compare their performances for the fluid flow computations described above, a highly vectorized version of the computational method presented in this paper is implemented on a single processor CRAY-2 (64 bit arithmetic hardware). We note that the CRAY-2 code is intrinsically more efficient than its CMo2 counterpart because it does not perform any redundant computation. In the vector code, the convective fluxes are computed in an edge-by-edge manner, and the diffusive ones in an element-by-element fashion. All scatter/gather operations are preprocessed with a coloring scheme which makes them vectorizable (see, for example, [19] or [20]). Figure 11 compares the performance of the CM-2 code using the decomposer/mapper and FASTGRAPH, with that of the CRAY-2 code. The timings reported in Fig. 13 indicate that an 8K CMo2 is at least as fast as one processor CRAY-2. However as stated earlier, these performance results are for an 8K CM-2 with 32 bit floating point hardware. A CM-2 with double precision floating point hardware was not available to us.
87
C. Farhat et al., 2 D viscous flow computations
CPU TIME PER TIME STEP 8K CM.2/1P CRAY-2 3,8
.i
3,6
SECONDS/STEP
s j,
3,4 3.2 J
3
o° o
S
.o °
2,8 S.,
s
2.6 j
2,4
O- ........
8K CM-2
IK- . . . .
1 CRAY-2
s
2,2 .sS
2 * " iI('
1.8 ,,S s'S
1.6
#'S s'S s'S
1.4 12
I
1
J
of.3""
o,e 0,6 0.4
S
K
0,2 0
1
2
3
4
VP
5
RA TI
6
7
8
0
Fig. 13. Comparison 8K CM-2 with one processor CRAY-2.
7. Closure Mixed finite volume/finite element spatial schemes for fully unstructured grids are developed and implemented on the Connected Machine CM-2, and applied to the simulation of two-dimensional viscous flows. Second-order accuracy in the discretization of the convective fluxes is achieved through a monotonic upwind scheme for conservation laws (MUSCL) technique. The diffusive fluxes are computed using a classical Oalerkin finite element method, and the resulting semi-discrete equations are time integrated with an explicit Runge-Kutta algorithm. A strategy for mapping thousands of processors onto an unstructured grid is presented. Its key elements are given by the selection of an appropriate parallel data structure, the careful partitioning of a given unstructured grid into specific subgrids, and the mapping of each individual processor onto an entity of these subgrids. Whenever the communication patterns are compiled during the first time step, the total time elapsed in interprocessor communication using the router is drastically reduced to represent only 15% of the total CPU time of the simulation. The performance of the resulting massively parallel code is assessed on an 8K CM-2 with 32 bit floating point hardware, for various mesh sizes. Performance comparisons of the presented
88
C. Farhat et al., 2D viscous flow computations
solution algorithm with a highly vectorized version where scatter/gather manipulations are treated with a coloring scheme, indicate that an 8K CM-2 with single precision floating point arithmetic is at least as fast as one processor CRAY-2. Acknowledgment
The first author wishes to acknowledge partial support by the Air Force Office of Scientific Research under Grant AFOSR-89-0422, ~nd partial support by INRIA Sophia-Antipolis. References
[1] A. Saati, S. Biringen and C. Farhat, Solving Navier-Stokes equations on a massively parallel processor: Beyond the one gigaflop performance, Internat, J. Supercomp. Appl. 4 (1990) 72-80. S. Lantcri, C. Farhat and L. Fezoui, Structured compressible flow computations on The Connection Machine, INRIA Report No. 1322, 1990. [3] C. Farhat, N. Sobh and K.C. Park, Transient finite element computations on 65536 processors: The Connection Machine, Internat. J. Numer. Methods Engrg. 30 (1990) 27-55. [41 R.A. Shapiro, Implementation of an Euler/Navier-Stokes finite element algorithm on the Connection Machine, AIAA Paper 91-0483, 29th Aerospace Sciences Meeting, Reno (1991). [5l S. Hammond and T. Barth, An efficient massively p~rallel euler solver for unstructured grids, AIAA Paper 91-0441, 291h Aerospace Sciences Meeting, Reno, Nevada (1991). [6l L. Fezoui and B. Stoufflet, A Class of Implicit Upwind Schemes for Euler Simulations with Unstructured Meshes, J. Comput. Phys. 84 (1989) 174-206. [7] B. Van Leer, Towards the ultimate conservative difference scheme V: A second-order sequel to Goudonov's rr,~thod, J. Comput, Phys. 32 (1979). [8l EL. Roe, Approximate Riemann solvers, parameters vectors and difference schemes, J. Comput. Phys. 43 (1981) 357-371. [91 J. Steger and R.F. Warming, Flux vector splitting for the inviscid gas dynamic with applications to finite-difference methods, J. Comput, Phys. 40 (1981) 263-293, [1o1 Thinking Machines Corporation, Connection Machine Model CM-2: technical summary, Version 6,0, 1990. [ill C. Farhat, A simple and efficient automatic finite element mesh domain decomposer, Comput & Structures 28 (1988) 579-~2. [12] J.W. Flower, S.W. Otto and M.C. Salama, A preprocessor for irregular finite element problems, CalTech/JPL Report C3P-292, 1986. [13] S.H. Bokhari, On the mapping problem, IEEE Trans. Comput. 30 (1981) 207-214. [14] C. Farhat, On the mapping of massively parallel processors onto finite element graphs, Comput. & Structures 32 (1989) 347-354. [151 S. Hammond and R. Schreiber, Mapping unstructured grid problems to the Connection Machine, RIACS Technical Report 90.22, 1990. [16] E.D. Dahl, Mapping and compiling communication on the Connection Machine system, Proc. Distr. Mere. Comp. Conf., Charleston (1990), [171 M. Chrobak and D. Eppstein, Planar orientations with low out-degree and compaction of adjacency matrices, Theoret. Comput. Sci., in press. [18] T.H. Pulliam, Low Reynolds number numerical solutions of chaotic flows, AIAA Paper 89-0123, 27th Aerospace Sciences Meetipg, Reno, NV (1989). [19] T.J.R. Hughes, R.M. Ferencz and J.O. Hallquist, Large-scale vectodzed implicit calculations in solid mechanics on a Cray X-MP-48 utilizing EBE preconditioned conjugate gr~tdients, Comput. Methods Appl. Mech. Engrg. 61 (1987) 215-248. [20] C. Farhat and L. Crivelli, A general approach to nonlinear FE computations on shared-memory multiprocessors, Comput. Methods Appl. Mech. Engrg. 72 (1989) 153-171.
[21