Computing Systems in Engineering, Vol. 6, No. 2, pp. 97 109, 1995
Pergamon
0956-0521(95)00008-9
Copyright ~3 1995 Elsevier Science Ltd Printed in Great Britain. All rights reserved 0956-0521/95 $9.50 + 0.00
PROBLEM DECOMPOSITION FOR ADAPTIVE ELEMENT METHODS
hp FINITE
ABAN! PATRA and J. TINSLEY ODEN Texas Institute for Computational and Applied Mathematics, The University of Texas at Austin, Austin, TX-78712, U.S.A. Abstract--Problem decomposition strategies for load balancing parallel computations on adaptive hp finite element discretizations are discussed in this work. The special difficultiesthat arise in partitioning these discretizations are highlighted. Three classes of algorithms--mesh traversal based on orderings, interface based decompositions and recursive bisection of orderings are discussed. A new ordering scheme for efficientrecursive bisection of orderings is introduced. Details of the algorithms and examples along with discussions of their merits and demerits are presented. Recursive bisection on the new ordering introduced here outperforms several known algorithms on test cases.
I. INTRODUCTION
methods have been adapted for our problems and the details are discussed in the sequel. Another interesting approach has recently been used by Salmon and Warren 6 for parallel N-body problems. In their technique the list of two-dimensional or threedimesnional elements is ordered based on the centroids of the elements and this element ordering is then partitioned. This approach has largely motivated a new algorithm we discuss in the following sections. There also has been some contemporary work on this idea by Pilkington and Baden 7 and Ou et al. 8 Most of the algorithms presented in the literature thus far have been for finite element grids with only h-refinements and are based on equidistributing degrees of freedom. For higher order elements these algorithms may not perform satisfactorily. The communication graph models outlined in some of the recent literature do possess the inherent flexibility to incorporate a weighting scheme to account for the uneven computational effort associated with different elements. However, we have not seen any such ideas in the recent literature. In this presentation, however, we shall confine ourselves to a more intuitive approach, without the formalism associated with such graph theoretic models. The schemes discussed here seem to be satisfactory for non-uniform high order meshes typical of adaptive hp finite element methods.
In this work, we explore an important aspect of parallel computing for adaptive hp finite element methods: the efficient automatic decomposition of adaptive hp finite element discretizations into "load balanced" subdomains, subproblems posed on which can then be solved by different processors at roughly equal effort. An efficient strategy for doing this will lead to better utilization of the processors and computing times closer to optimal. Conventionally, the decomposition of computation for solving partial differential equations has been attempted at the data structure level by partitioning matrices and arrays. While this approach is quite efficient in a fixed or uniform refinement type of structure, it becomes very cumbersome and expensive in an hp adaptive setting. The standard tree type of data structure used in hp finite element codes are difficult to partition this way. In the present investigation, the partitioning problem is approached at the physical domain level, the idea being to obtain a distribution of computation effort over the entire domain and then to split up the problem into the required number of partitions. Previous attempts at a solution of such problems, for h-version finite element methods have been made by several researchers. Notable among them have been the contributions of Farhat, I Miller, Vavasisy Simon4 and Pothen et al. 5 Miller, Vavasis and their coworkers used a graph theoretic approach z3 in their work. This approach has been shown to yield good results and theoretical bounds on achievable balance and interface size can be developed. Simon4 and Pothen et al. 5 have popularized another family of algorithms, called the spectral bisection algorithms, that uses the second eigen vectors of the Laplacian matrix of the graph associated with the finite element mesh for partitioning the mesh. Both of these
2. ADAPTIVE hp FEM PROBLEM D E C O M P O S I T I O N
We have in mind here the management of computational effort for so-called hp finite element methods in which the mesh size h and the local element spectral order p can both be varied as parameters over a mesh. The management of different h and p non-uniformly over a mesh requires a special data structure and is generally considerably more complex than conventional h version or p version finite elements. The 97
98
ABANI PATRA and J. TINSLEY ODEN
payoff, however, can be significant, and with a proper orchestration of hp meshes, could include exponential convergence and very high accuracies with minimum degrees of freedom. The first such hp data structure was presented in a series of papers by Oden, Demkowicz and their colleagues9'1°and extensions of such approaches have appeared in several subsequent publications (e.g. Ref. 11). The possibility of further increasing the efficiencies of such methods through multiprocessor parallel computation is intriguing, but we know of no studies of such problems in the literature. To begin our study of decomposition strategies we establish the goals of such decompositions. A successful problem decomposition strategy needs to achieve: i. equidistribution of computational effort among the subdomains and ii. minimization of the interfaces between the subdomains. The first objective is clearly necessary to achieve maximum utilization of the processors. Further, most domain decomposition algorithms result in independent subproblems for the subdomains and an additional interface problem. Normally this interface problem is quite poorly conditioned and computationally the most intractable. The size of this problem also determines the amount of inter-processor communication required to solve the problem. Communication among processors, especially in distributed memory multiprocessor type of computers, either MIMD or SIMD, is often a very expensive process. Hence minimization of the interface is treated as an equally important objective in the problem decomposition strategy. It may often be necessary to accept a tradeoff among the two goals as the advantages of a minimum interface may outweigh the disadvantages of small imbalances in the load. However the sequences of hp finite element meshes, with different orders of approximation and different sizes of elements in different parts of the domain, needed to produce the exponential convergence and very high accuracies associated with adaptive hp finite element methods, present special difficulties in generating partitionings. In general such meshes results in: 1. non-uniform computational load across elements 2. non-uniform communication patterns 3. the necessity of enforcement of constraints across partitions 4. the necessity of identifying an appropriate choice of computational load measure for partitioning. The non-uniform computational load across elements is the result of the different orders of approximation used. The result is that a trivial cell/element count cannot be used as a load measure. The dof (degrees of freedom) associated with different elements is clearly very different, and these differences can be very large. For example, a trilinear hexahedral element for the elasticity operator in three dimensions
has 24dof associated with it while an element of order 8 has 1029 dof. The non-uniform communication patterns are an inevitable result of this nonuniform distribution of dof. A third difficulty arises in the implementation of the constrained elements of the type used in the hp finite element methods, as described in Demkowicz et al.9 If an irregular node falls on a partition line, then special measures have to be taken to enforce the constraint. The problem of identifying an appropriate choice of computational load measure for partitioning is discussed in detail in the next section. We note here that there are two approaches to the partitioning problem for sequences of adaptive meshes. In the first "static" approach, after every adaptation the entire mesh is repartitioned. This approach requires that the grid geometric data be either duplicated across all processors or be easily accesible. In the second "dynamic" approach, a base partitioning is carried out on an initial mesh and the partitioning is adapted along with the mesh. However, this approach requires considerable data reshuffling among processors since the load distribution can be dramatically altered by a p adaptation. Considering the low volume of grid data (less than 5% of all our data 9) and for the low cost partitioning schemes we use, we choose a static approach in which we repartition the grid after every adaptation. 3. COMPUTATIONAL EFFORT MEASURES
A major issue in partitioning these adaptive hp meshes is the appropriate choice of a measure of computational effort, which should be minimized and equidistributed among the processors. In the simpler h-adaptive meshes and solution strategies using direct solvers, computational effort is related simply to the dof in the problem as O(N ~) where ~ is determined by the choice of solver. In the current study of higher order elements and iterative solvers computational effort is not a simple function of the dof. The distribution of computational effort for these meshes is highly non-uniform. The conditioning of the system and error distribution in an initial mesh may also influence the computational effort. Furthermore, a usable measure of computational effort must possess the local additivity property (i.e. one must be able to compute local measures of computational effort on a per element basis that add up to the global computational effort). Good candidate measures of appear to be 1. square of the norm of the error in a coarser mesh solution 2. the number of degrees of freedom in the mesh. The motivation for error as a partitioning measure can be found in the hp adaptive strategy developed in our earlier work. H Since the final distribution of h and p is determined by the error distribution in the initial mesh, it is reasonable to assume that the local
Problem decomposition distribution of computational effort on the final mesh would also be determined by the error distribution. This type of error based problem decomposition also allows the decomposition to be easily embedded in an overall algorithm using the three-step hp adaptive strategy developed by Oden et al. ~ The first step of the three-step method, which involves solution on a very coarse mesh, could be carried out on a single processor. Thereafter, based on the error in this solution, the domain could be partitioned and distributed among the processors which would then carry out the remaining parts of the three-step strategy on their portions of the domain. Communications among the processors would be required during the solution and error estimation stages. Another possible advantage of this kind of decomposition, especially in a MIMD environment where each processor accesses a local data structure, is that each processor could efficiently implement the tree-type data structure in current hi) adaptive finite element codes, independently. The problems associated with dynamically partitioning the global treetype data structure would be avoided, as the partitioning would have been done a priori at the "root" of the tree. However, this will necessitate the use of distributed dynamic data structures currently unavailable. We now discuss the three families of decomposition algorithms: 1. mesh traversal based decompositions 2. interface based decompositions 3. recursive bisection of orderings. 4. M E S H T R A V E R S A L BASED D E C O M P O S I T I O N S (MTBD)
In this family of decomposition algorithms, the mesh is traversed in some fashion and elements are accumulated into partitions based on some choice of computational effort. In the first naive ordering implemented, the mesh is traversed in a nearest neighbor with lowest load fashion and in the second a more sophisticated ordering is employed. 4.1. Nearest neighbor ordering based mesh traversal In this method the mesh is traversed in a nearest neighbor with lowest load fashion. This method ensures some degree of locality in the decomposition, as each element in a decomposition has at least one neighboring element also in the same subdomain. An algorithm to partition the domain f~ into M sub domains based on these principles is outlined below: Algorithm 1 O: estimated global computational effort OK: estimated computational effort for the Kth element D~: set of element indices in the Ith partition, finite element in the mesh}
99
EI = running total of computational effort for Ith partition 1. Compute E = O / M 2. 1--+I; 0--+EI For K = 1 to number of elements, do 3. 0 x = min L 0L where 8f~Kc~8~t # 4)
E~+O~-+F=~ 4. If E+ < E, then set D I = Dzu{K } Find HK+~ and let K + 1--+K Else I + 1--+I D 1 = DI~A{K } K + 1--*K End for While this algorithm does result in somewhat load balanced subdomains, it has significant drawbacks. Two primary drawbacks are: i. the interface problem is not directly addressed, and ii. the nearest neighbor traversal strategy often fails when all nearest neighbors have been allocated, a phenomenon we call "locking". The first problem can be somewhat addressed by adding interface minimization constraints to Steps 3 and 4 which select the next element for the partition. The second problem is less tractable and will always arise in any algorithm that traverses a mesh element by element in a nearest neighbor fashion seeking to construct partitions. While ad hoc measures can be devised to continue the traversal, all remedial measures will cause disconnected subdomains and large interfaces. This problem can be eliminated by using alternate orderings as outlined below. 4.2. Peano-Hilbert ordering based mesh traversal The problem of "locking" can be eliminated by traversing the mesh in other orderings. One idea that has been tested is to generate a Peano-Hilbert ordering of the elements of the mesh. This ordering is based on the discoveries of Peano ~2 and Hilbert ~3of a class of continuous mappings, h,,: R ~ U,, where R is the unit interval [0, 1] and U,, is the unit hypercube. Remarkably, given such a mapping, one can completely fill a hypercube with the images of the unit interval. Such curves are thus denoted as "space filling curves". Conversely, one might argue that given any set of points in n dimensional space one can find a space filling curve that passes through all of them and, consequently, a mapping from a point in the unit interval onto each of these points, after some appropriate scaling. Each point in n dimensional space can then be associated with a point in the unit interval, i.e. a number t e [0, 1]. Sorting the points according to the value of this number will thus define an ordering of the set of points in n dimensional space. If this mapping also preserves locality, that is points that are close to each other in the n dimensional space are mapped onto points close to each
100
ABANI PATRA a n d J. TINSLEY ODEN
1.
.2
.22
I
.21
I"
I
..... L . . . I I
I
I
I L] .
.20 .12
.11
.1
.10
"-][-'-] r .
.02 .01
.0 .0
.1
.2
.013 .00
1.
[
I
t
L_I
F-" L
I IL
f-fl ] I-J I
'
r7
I
L J
L':'I' I
I
I
LJ
I
.01 .02 .10 .11 .12 .20 .21 .22
b) Second level curve
a) First level curve
L(a) = 0.1101
.2
.0
b
f" l -'1
.0
.1.11
L(b) = 0.1122
I"7
L(c) = 0.22
:
.12.2
c) Mixed First and Second level curve Fig. 1. Sample space filling curve and illustration of some fundamental properties for a 3 x 3 stencil. The coordinates are in base 3 digits. Location numbers of a and b share the same first 2 digits since they belong to the same first degree subcube.
o t h e r in the unit interval, t h e n this m a p p i n g will preserve a d j a c e n c y p r o p e r t i e s o f the p o i n t s in n d i m e n s i o n a l space. N o w , if to the c e n t r o i d o f each e l e m e n t o f a finite e l e m e n t m e s h we associate an
1
8
2 2
7 4
3
10
~
15 6
11 8
5
16
3
5
7 4
9
2
index g e n e r a t e d by such a m a p p i n g , we implicitly define an o r d e r i n g o f the e l e m e n t s due to the values o f these indices. It is easy t h e n to o b t a i n an o r d e r i n g o f the m e s h by s o r t i n g the e l e m e n t s b a s e d o n value
14
)<
::,3 :=
9 12
13
i";C
',, "";<" i )
Finite element mesh
\
Associated Node Graph
Associated Dual Graph
Fig. 2. Finite element mesh mapped to a graph.
i
101
Problem decomposition
. ............
e~or dist 0.0201 0.0178 0.0154 0.0131 0.0107 0.0084 0.00605 0.00371
.. "'" "-- ..
""-...
............
...... ...... ....... ............. ....... ...... ....... ....... ........
._. '-'-... "'- ,..
:
/
Fig. 3. Error distribution in Poisson's equation with solution u = tan-I(x + y -x0)x(1 - x ) y ( l - y ) .
of the index associated with the centroid of the element. Several techniques for generating these indices can be derived from the work of Patrick et al) 4 The subdomains are now generated on whatever computational effort metric we choose traversing the mesh in the order of these indices. A variation of this idea was used by Salmon and Warren 6 in the parallel data structure used in their analysis of N-body problems. We now review some fundamental properties of these curves as described by Bially. 15 The recent book by Sagan ~6 also has a very comprehensive treatment of the subject. Figure 1 illustrates some of these fundamental properties. 1. S u b - c u b e concept: If each axis of the unit hypercube U~ is decomposed into M units, then U~ is decomposed into M d subcubes where d is the space dimension of the domain. Note that this process may be carried on recursively and each subcube can again be decomposed into M d
subcubes leading to a total of M d¢ subcubes, where l is the number of levels of this type of
Fig. 4. Problem decomposition on h-adaptive mesh using the error distribution and Algorithm 4 (RLBBC).
Spectral Order Adaptive hp Finite Eiem~t Mesh
Fig. 5. Adaptive hp finite element mesh. CSE 6/2--B
102
ABANI PATRA and J. TINSLEY ODEN
Partition no. Domain Decomposition Using RLBBC
Fig. 6. Problem decomposition using recursive load based bisection of coordinates.
2.
recursion. Figures l(a) and (b) show two levels of such a decomposition for M = 3. A n approximation of a space filling curve is now easily generated by joining the centroids of each subcube at the finest level in the figure with a line passing through each subcube. Figures l(a) and (b) show two such curves at different levels of decomposition. Figure l(c) shows a mixed curve composed of a combination of first and second level curves. S e l f s i m i l a r i t y : While there are many ways that such curves may be constructed, it is possible to duplicate the path of the curve through each sub-cube with the path through the higher level subcube containing it, i.e. a stencil such as the
one shown in Fig. 1(a) is carried on recursively through the levels subject only to spatial rotations. 3. D i g i t a l c a u s a l i t y : Let x = (Xl, x2) be a point in the unit cube U2 and let x~ = 0 " x l l x 2 1 "" "xkx "" " and x2 = O ' X ~ 2 X 2 2 " ' ' X k 2 " ' " b e the M-ary representation of its coordinates (x~, x2). The point x can now be assigned a location number by interleaving digits in the coordinate of the point in the M-ary representation as described below. L(x) = 0 ' Xl,X12' ' 'XklXk2" " " It is clear that the location n u m b e r of all points in a subcube will have the same digits in the first
Domain Deeomlr~ition Using RLBBO
Fig. 7. Problem decomposition using recursive load based bisection of a Peano ordering.
103
Problem decomposition Table I. Imbalance and shape fractions for RLBBC partitioning (Fig. 6) Tot.dof Interface.dof IF SF
Global
SubDoml
SubDom2
SubDom3
SubDom4
2072 I 111 -5.4
587 79 6.7 13.4
621 52 12.1 8.4
572 56 4.2 9.7
403 43 23 10.7
Table 2. Imbalance and shape fractions for RLBBO partitioning (Fig. 7) Tot.dof Interface.dof IF SF
Global
SubDom 1
SubDom2
SubDom3
SubDom4
2072 67 -3.2
473 3 8.9 0.6
520 44 2.1 8.5
447 39 12.5 8.7
699 47 23.4 6.7
dl places where d is the spatial dimension of the cube and l is the level of the subcube. 4. Vertex property: the curve as constructed above will enter and leave a subcube at a particular level at exactly one 16oint. This implies that the curve need necessarily pass through all points in a subcube before passing through points outside of a subcube. An algorithm to generate the Peano-Hilbert ordering of the elements of the mesh in two dimensions is presented below. Extensions to higher dimensions are obvious. The algorithm is motivated by the digital causality property of the location numbers. Several other more complex algorithms for generating space filling curves of different types can be found in the treatise of Sagan 16.
Algorithm 2 1. Let ( x , y ) be element centroid coordinates scaled so that ( x , y ) e (0, 1)2 and compute a binary representation of x and y as
x =O.alaza3a4a 5 . ' ' y=O'blb2b3b4b 5 ' ' ' and the as and bs are 0 or 1 (i.e. binary expansions). 2. F o r m t t = 0" (2a I ) (2bj) (2a 2) (2b2) (2a3) (2b3)'" • where t is base 3. 3. Order elements by the t values of their centroids. For such an ordering to be effective in achieving our partitioning objectives, the one dimensional
Fig. 8. Problem decomposition using recursive spectral bisection of the weighted node graph corresponding to the adaptive hp mesh.
104
ABANI PATRAand J. TINSLEYODEN
mapping must preserve locality properties of the two or three dimensional and mesh (i.e. points close to each other in the original domain must also be close to each other in the transformed domain or at least not too far away in the transformed domain). The hierarchical nature of the subcubes along with the propert of digital causality described before imply that this type of mapping does preserve locality except at subcube boundaries. The jumps across subcube boundaries of a particular level must again be confined to the higher level subcube. We now illustrate a simple proposition that ensures that such jumps are at most of the order of the size of the domain. Proposition. Let x~ = (x~, y~) and x2 = (x2, Y2) be two points in U = [ 0 , 1] × [0, 1 ] c ~ 2. Let tl and t2 be the
+ j=,~ (b)-bT)3-4232
~< 2 { [ j ~ l (a) -- a2)3-t4J- 2)]2
+
(b) -b])3 -~j
=A2+B2=~.
) l Now look at Ilxl - x2 1122= [Ixl - yl 12 + Ix2 - y2121 =
a
-- a
-J
Lj= !
+
corresponding images under the mapping =, described in the algorithm above, i.e.
(b - b 2 -j j
1
=C2+D2=~. Z(X]) = t~ The ratio is thus Z ( x 2) = t2.
( j ~ = ( a ) - - a ~ ) x 3-(4j-2))2 + ( j ~ l ( b ) - b ~ ) × 3-4J) 2 A2+B2
7
0(- C2+D 2
C=~ ( a ) - - a } ) x 2-j)2 + (V~l ( b ) - b ~ ) x
2-J) 2
~< ~ (a~ -- a}) 2 x 3-2(4j-2) + (b) - b2) 2 x 3 -8j :=,
(a) - a - ~ ) - i x ~ - ~ - ~ ) - - b f ~ x 2
If
I l x l - x2 ql22= IXa- x212 + l y l - Y212= ~ I q - t2l2 = fl then the ratio
~c where C is O(1) or o~ ~. 1 and fl ~. 1. Let us use n digits in the mapping $. Then X 1
=
0"
1 1 •" " 1 ala2 an,
x 2 = O ' a l a2 2
2..
2 "an,
Yl
=
1 2 I 0 • bib " . b,1
y2 = O'b2b 2''" bn.
~(X1) = t I = 0 ' 2a]2bl • • • 2a.2b.l1 ~(X2) . t2. = 0. 2a22bZl • .
2 2 2a,2b,
x 3 - 2 + 2 x 3 -3
+bl x 3-4...+a~
'
If x 1 X2 then a = 0 and y = 0 then the ratio collapses to 0/0. If x I :)~ x 2 then a ¢: 0 and 7 ~ 0. Let k be the lowest value o f j for which a) and a} differ and l be the lowest value of j for which b) and b 2 differ. Now fl ~< 7; hence, we can write =
-~< ~t
(a I - ak2)2 X 3 - ( a - 2) + (b~ - b~) 2 x 3 -8' (akl -- ak2)2 X 2 -2* + (b) - bt2)2 x 2 -2t
•
Using the above algorithm
(tl-t2)2=[(2 x 3-]+al
and
-~
x 3-"+b~ x 3-0
- ( 2 x 3 -l + a 2 x 3 - 2 + 2 x 3 -3 + b~ × 3 - ' . . . + a~ × 3 -~ + b~ × 3-"]:
= [j=~l ( a ) - a 2 ) 3 -(4:- 2)
3 -(4k- 2) + 3 -st 2 -k
+ 2 -sl ~
where C can be determined as the maximum value of the above expression. • For example, if k = l = l then C ~ 0 . 2 2 2 . If k , ~ l and k is small (i.e. close to 1) then fl/~t = ~(2)4k which is a bounded number. Similarly for l ~ k and l is small then fl/~t = (~)st which is also a bounded number. For k and l both large the ratio can be large but k and l are large indicates that both ~ and fl are very small numbers, i.e. Xl and x2 are close and tl and t2 are close• We emphasize that this result is not a proof of locality; it merely establishes that the jumps in the location indices may not be too large. The basic mesh partitioning algorithm is now outlined:
105
Problem decomposition
|
l
n
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
t
"
~
mmmmmmmmmmmm mmmmmmmmmmmm
.....
(RLBBC)
mt-~-~
'mk
. . . .
l
Fig. 9. Adaptive hp mesh for Example 3.
Algorithm 3 O: estimated global computational effort Define: Jt = {K if element K e subdomain I} M: the number of partitions desired ~ : running total of subdomain computational load 1. Compute O and corresponding element computational effort measures OK 2. Compute O r = O / M 3. Create an ordering of the elements by mapping the centroids of the elements onto a Peano-Hilbert curve. 4. Set I=1 ~=0 5. For K = 1 to the number E of elements do
~=~
+Or
J, = J , ~ { K } If(q~/>81) t h e n I = l + l ; K=K+I.
ment of this number can be controlled by the problem and is machine dependent. On computers with slower message passing architectures, it is possible to penalize the interface requirements higher so that minimal interfaces are achieved even at the cost of some load imbalance. 5.1. Recursive load based bisection of coordinates
q~=0;
Various types of recursive load balancing schemes based on bisections along coordinate lines have been proposed by Vavasis 2,3 and Miller et al. 17The advantage of these methods is that both objectives, load balance and minimum interface, are explictly addressed. However, the cost of doing so may inhibit the method. An algorithm that will implement this strategy is outlined below. This simplest of these algorithms leads to a type of recursive bisection. The drawback is that partitions that are powers of 2 are obtained. This seems to be acceptable for current hypercubetype distributed memory architectures but will be a drawback on grid-type architectures where the number of processors need not be powers of 2.
Algorithm 4 Or: computational effort estimate for element K, Or specified as dof in the description of the algorithm. It may be replaced by any alternate measure of computational effort. D~: list of elements in subdomain I. n~: number of trial separator surfaces. qi: quality index for a trial separator. 1. Compute maximum and minimum coordinates in any one of the dimensions of the entire domain x lmin, X lmax For i = l to ntdo
A partitioning generated by using these indices will enforce some locality on the subdomains generated. However, the interface problem is again not addressed explicitly. This drawback shows up when the meshes get finer and can often produce disconnected subdomains. One alternative is to seek specific partitioning interfaces and then try to balance the work in the resulting partitions. An algorithm based on this technique is outlined in the next section.
5. INTERFACE BASED DECOMPOSITIONS
These methods consist of selecting candidate separator surfaces and then selecting a separator based on lowest total workload, smallest load imbalance and smallest interface. The selection involves assigning to each candidate separator surface a number indicative of the computational effort associated with the resulting domain partitions and the interfaces. The assign-
Fig. 10. Problem decomposition using RLBBO an Example 3.
106
ABANI PATRA a n d J. TINSLEY ODEN
2. Compute
X I i = X lmi n "}-
doflen qi
dOfright
(x 1 max q- X I max ) ni
* doftot + dofi.ter
where dOfleft and dOfright are the degrees of freedom to the left and right of x 1~ and dofiate r is the degrees of freedom on the trial separator xl~. 3. Choose as interface the separator corresponding to the lowest q . 4. If the center of mass of element has an x l coordinate less than that of the interface then
1. Create an ordering of the elements by mapping the centroids of the elements onto a Peano-Hilbert curve. 2. Let tK be the distance of the centroid of element K along the space filling curve. 3. Compute maximum and m i n i m u m of tlc, tmax and
/min'
4. Compute nt trial separator levels as t i = tmi n -']
/max - - tmin
nl
5. For each ti compute a quality of interface index
qi . t/ dof,a t '~ qi = a o s / - - 1/] • doftot + dofi,t+, \dof~i~t
Ol~Dlw{K} Else
D2-~D2w{K} At this stage, the original domain has been split into two. 5. For the next level of decomposition apply Steps 1 4 with D I and D2 instead of the entire domain and x2 as the coordinate. This will result in four subdomains. In three dimensions for the next • level use Steps 1-4 with these four subdomains and x3 as the coordinate. This process is recursively continued until the desired number of domains is attained. Clearly, for better shaped domains, equal numbers of splits in each coordinate must be domain. Thus, for two dimensions, 4 n subdomains and in three dimensions, 8" subdomains are obtained. One disadvantage of the method observed in numerical tests is that for non-convex domains it results in disconnected subdomains.
Replace dof by error or other load estimate as appropriate. 6. Choose as interface ti,t the t1 that corresponds to lowest qi 7. For K = 1 to the number of elements If t,v ~< t~,, then
O,+-Dffo{K} else D2+--D2w{K }
end if end for At this stage, the original domain has been split into two. 8. Apply Steps 1-7 recursively on each of the generated subdomains. Initial results obtained using this algorithm are quite promising. One particularly demanding mesh and the corresponding decomposition are shown in Figs 5 and 7.
6. R E C U R S I V E B I S E C T I O N O F O R D E R I N G S
In this family of algorithms an attempt is made to combine the advantages of the mesh traversal algorithms with that of the interface partitionings. The elements are ordered using some ordering and then a recursive splitting is applied to the resulting one-dimensional mapping. 6.1. Recursive load based bisection of an ordering
(RLBBO) In this method the elements are ordered using the space filling curve ordering outlined in Section 6 and then a recursive splitting is applied to the resulting one dimensional mapping. The basic algorithm is outlined below:
Algorithm 5 DI: list of elements in subdomain I. n/: number of trial separator surface. qe: quality index for a trial separator.
Fig. 11. Problem decomposition using RLBBC on Example 3.
Problem decomposition 6.2. Recursive spectral bisection (RSB) Another technique, popularized by Simon, Pothen and others, 4'~8 generates a partitioning based on the Laplacian matrix of the graph associated with the mesh. Two types of graphs can be associated with a mesh. The first type called a node. graph can be formed by associating a node on the mesh with a vertex on the graph and constructing and edge between two nodes if their interaction is non-zero. The second type, called a dual graph, is formed by associating an element on the mesh with a vertex on the graph and an constructing an edge between two vertices if the associated elements share a side. A simple mesh and its associated graphs are shown in Fig. 2. The Laplacian matrix L(G) of a graph is defined by
lq =
l if (Vi, Vj) have a common edge for vertices v~, vj if i = j where m is
I~
the total number of edges meeting at vi
107
distribution contours are shown and in Fig. 4 the corresponding domain decomposition from the interface-based decomposition algorithm is shown. In the second example, a particularly demanding combination of a non-convex geometry, a full range of spectral orderings from 1 through 6, and three levels of h refinement are considered. The mesh is shown in Fig. 5. The naive MTBD class of algorithms failed for this problem. The partitioning generated by Algorithm 4 (RLBBC) is shown in Fig. 6. The load balance of the decomposition is good, but because of the non-convex nature of the domain the subdomains generated are disconnected. In Fig. 7 the partitioning generated by Algorithm 5 (RLBBO) is shown. This partitioning seems to be balanced with compact subdomains. The CPU time requirements for this algorithm are primarily in the one sort operation needed, which is O (log N), and insignificant compared to the rest of the solution process. To quantify the results of this example, we define two new quantities to measure the quality of the partitioning.
otherwise
where lq are the components of L(G). The second eigen vector of the Laplacian matrix or the so called Fiedler vector is used to define an ordering of the nodes of the elements and these are then recursively bisected. The standard Laplacian matrix can be used only for h adaptive meshes. To apply such strategies to hp meshes we use the node weight feature provided by Barnard and Simon in their recursive spectral bisection code. We use p2 where p is the spectral order associated with the node as the node weight. The algorithm is now outlined briefly as follows:
Algorithm 6 1. Construct the graph associated with the hp mesh. Use an apropriate weighting scheme to represent the spectral orders. (Weights proportional to p2 have performed well) 2. Compute the second eigen vector of the Laplacian matrix (Fiedler vector) 3. Sort vertices of the graph according to the value of their associated components in the Fiedler vector 4. Assign half the vertices to each subdomain 5. Repeat Steps I-4 recursively on each subdomain. This algorithm is applied to a test adaptive hp mesh and appears to produce good decompositions. 7. E X A M P L E S
Some examples are now presented to demonstrate the basic performance of the algorithms presented in the previous sections. The first example uses error as the measure of computational effort and an interface based partitioning algorithm. In Fig. 3 the error
Imbalance fraction (IF) This is defined as the percentage with respect to the largest subdomain of the deviation from the average dof in each partition. IF = -N~N~ Nma~
× 100
(1)
where N~ is the dof associated with the subdomain i, Nay and Nmax are the average and maximum dof associated with the different subdomains.
Shape fraction (SF) This is defined as the percentage of interface dof in a particular partition with respect to the total dof associated with that partition. For the whole domain, all dof on interfaces are compared to the total dof in the problem SF = -Nm, ×
N,
100
(2)
where N~nt is the number of dof on all the interfaces associated with subdomain i. Tables 1 and 2 summarize these performance measures for the partitionings associated with Example 2. While the imbalance fractions of both decompositions are comparable, the shape fractions, especially for the global interface, of RLBBO are much lower. Thus the size of the interface is surprisingly better controlled by the RLBBO algorithm for this example. Since in most domain decomposition solvers the size of the interface largely controls the inter-processor communication requirements, this can have a significant effect on the computation times. Further itera-
108
ABAN! PATRA and J. TINSLEY ODEN
Table 3. Imbalance and shape fractions for RLBBO partitioning (Fig. 10). Solution time on an 8 processor i860 is 10.2 seconds Tot.dof Interface.dof IF SF
Global
1
2
3
4
5
6
7
8
840 130 -15.4
151 28 13.9 44.4
36 16 41.1 16.3
209 34 41.7 23.1
130 30 3.9 49.4
132 65 4.8 30.2
86 26 17.1 40
105 42 8.1 19
126 24 2
Table 4. Imbalance and shape fractions for RLBBC partitioning (Fig. 11). Solution time on an 8 processor i860 is 9.8 seconds Tot.dof Interface.dof IF SF
Global
1
2
3
4
5
6
7
8
840 126 -15
149 35 17.6 23.3
132 42 6.3 31.8
117 34 3 29.1
118 42 3 35.5
150 33 18.3 22
106 23 11 21.7
108 53 9.7 24.5
102 25 13.8
tive substructuring type algorithms will also be more efficient for this partitioning. In Fig. 8, the partitioning generated by the RSB code from Simon et al. is s h o w n for comparison. To implement the m e t h o d , we use a weighting of the nodes corresponding to the square o f the order o f a p p r o x i m a t i o n associated with the node. The partitioning looks quite similar to that generated by the R L B B O . One can possibly argue t h a t the Fiedler vector used in this partitioning is in fact a n ordering o f the nodes and, implicitly, also of the elements a n d hence, it is no surprise t h a t the partitionings look similar. The only d r a w b a c k we observe is t h a t a few isolated nodes are located a p a r t from the domain. A good s m o o t h i n g pass o u g h t to take care of these. All the results shown have n o t employed any s m o o t h i n g passes for either the R L B B O or the R L B B C algorithms. In Fig. 9 we depict a third example adaptive hp mesh generated automatically by the algorithm described in Ref. 11. The c o r r e s p o n d i n g partitionings are s h o w n in Figs 10 a n d 11. Tables 3 a n d 4 show the imbalance fractions, shape fractions a n d solution times for parallel solution of Poisson's e q u a t i o n o n an 8 processor Intel i860 using this mesh a n d partitioning.
Acknowledgements--The financial support of ARPA under contract no. DABT63-92-C-0042 is gratefully acknowledged. We also thank Dr. H. D. Simon for providing a copy of the code for recursive spectral bisection.
REFERENCES
1. C. Farhat and F.-X. Roux, "Implicit parallel processing in structural mechanics", Computational Mechanics Advances, IACM Vol. 1, No. 2, Elsevier Scientific Publishers, Amsterdam, 1994. 2. S. A. Vavasis, "Automatic domain partitioning in three dimensions," SlAM J. Sci. Statistical Comput. 12(4), 950 970 (July 1991). 3. S. A. Vavasis, "Automatic domain partitioning in three dimensions," TR 90-1082, Department of Computer Science, Cornell University, Ithaca, NY 14853-7501, January 1990.
4. H. D. Simon, "Spectral partitioning for dynamically changing calculations on parallel machines," Seventh International Conference on Domain Decomposition Methods in Scientific and Engineering Computing, State College, October 1993. 5. A. Pothen, H. Simon and K. P. Liou, "Partitioning sparse matrices with eigen vectors of graphs," SIAM J. Matrix Analysis Appl. 11, 430-452 (1990). 6. J. Salmon and M. Warren, "Parallel hashed oct-trees," in Proceedings, Supercomputing '93, Portland, Oregon, Nov. 1993. 7. J. R. Pilkington and S. B. Baden, "Partitioning with spacefilling curves," Technical Report CS94-349, Department of Computer Science and Engineering, University of California, San Diego, March 1994. 8. C.-W. Ou, S. Ranka and G. Fox, "Fast mapping and remapping algorithms for irregular and adaptive problems." Proceedings of the International Conference on Parallel and Distributed Systems, Taipei, Taiwan, 1993. 9. L. Demkowicz, J. T. Oden, W. Rachowicz and O. Hardy, "Toward a universal hp adaptive finite element strategy, Part 1. Constrained approximation and data structure," Comput. Methods Appl. Mech. Engng 77, 79 112 (1989). 10. W. Rachowicz, J. T. Oden and L. Demkowicz, "Toward a universal hp adaptive finite element strategy, Part 3. Design of hp meshes," Comput. Methods Appl. Mech. Engng 77, 181-212 (1989). 11. J. T. Oden, A. Patra and Y. S. Feng, "An hp adaptive strategy," in Adaptive, Multilevel and Hierarchical Computational Strategies (edited by A. K. Noor), AMD-Vol. 157, pp. 23-46, 1992. 12. G. Peano, "Sur une courbe, que remplit route une aire plane", Math Annalen 36, 157 160 (1890). 13. D. Hilbert, "Uber die stetige Abbildung einer Linie auf ein Flachenstuck," Math Annalen 38, (1891). 14. E. A. Patrick, D. R. Anderson and F. K. Bechtel, "Mapping multidimensional space to one dimension for computer output display," IEEE Trans. Comput. C-17(10) (October, 1968). 15. T. Bially, "A class of dimension changing mappings and its application to bandwidth compression," Ph.D. thesis, Polytechnique Institute of Brooklyn, 1967. 16. H. Sagan, Space Filling Curves, Springer Verlag, Berlin 1994. 17. G. L. Miller, S. Teng, W. Thurston and S. A. Vavasis, "Automatic mesh partitioning," CTC92TR112, Advanced Computing Research Institute, Cornell University, Ithaca, NY 14853 7501, 1992. 18. H. D. Simon, "Partitioning of unstructured problems for parallel processing," Comput. Systems Engng 2, 135-148 (1991).
Problem decomposition 19. M. Ainsworth and J. T. Oden, "A unified approach to a posteriori error estimation using element residual methods, Numer. Math. 65, 23-50, 1993. 20. J. T. Oden, A. Patra and Y. S. Feng, "Domain
109
decomposition for adaptive hp finite element methods," Domain Decomposition Methods in Scientific and Engineering Computing (edited by D. Keyes and J. Xu), Contemporary Mathematics, Vol. 180, AMS, 1994.