Efficient parallel algorithms for numerical simulation

Efficient parallel algorithms for numerical simulation

Future Generation Computer Systems 17 (2001) 961–967 Efficient parallel algorithms for numerical simulation David Lecomber a , Mike Rudgyard b,∗ b a...

109KB Sizes 3 Downloads 132 Views

Future Generation Computer Systems 17 (2001) 961–967

Efficient parallel algorithms for numerical simulation David Lecomber a , Mike Rudgyard b,∗ b

a Trinity College, Oxford, UK Department of Computer Science, University of Warwick, Coventry CV4 7AL, UK

Abstract COUPL+ is a programming environment for applications using unstructured and hybrid grids for numerical simulations. It automates parallelization by handling the partitioning of data and dependent data and maintaining halo interfaces and copy coherency. We explore some algorithms behind this package. A multi-level partitioning method is described which is effective in the presence of skewed data, solving the multi-set median-finding problem. Partitioning elements over a set of pre-partitioned nodes is explored and a novel method is suggested for reducing communication in the resulting distribution. © 2001 Elsevier Science B.V. All rights reserved. Keywords: Numerical simulations; Parallel algorithms; Multi-set median-finding problem

1. Introduction In computational fluid dynamics, electromagnetics and structural analysis, a common problem is partitioning data across parallel systems to achieve low communication between partitions and good balance. The majority of data-parallel applications partition data and connectivities relating distinct sets that the data span. Typically, these are meshes of tetrahedral or triangular elements. Connectivities define relationships between elements, faces, edges and nodes of the mesh, or the edges or nodes of bounding surfaces. Data sets can be of such magnitude, that the partitioning itself must be carried out in parallel. Applications must maintain data coherency across the processors using communication libraries. This requires management of nodes, data and elements. For example, having partitioned the nodes, a partitioning ∗

Corresponding author. Tel.: +44-24-7652-2417; fax: +44-24-7657-3024. E-mail addresses: [email protected] (D. Lecomber), [email protected] (M. Rudgyard).

of elements must be induced, or vice-versa. Connectivities, described by lists of nodes, must have these nodes relabelled to reflect the new location of nodes following partitioning. Such tasks are non-trivial; it is essential that tools manage this for the programmer. COUPL+ [9,11] 1 is a parallel library and environment motivated by this. It features distributed grid-manipulation functions, distributed partitioning of data and connectivities, and a simple parallel loop syntax used to ensure that copies of interface data are maintained. PVM [2], MPI [13] and BSP [14] libraries for communication are supported. Programmability is greatly assisted by the system creating a halo of boundary nodes from adjacent partitions and relabelling connectivities, so that all nodes appear local; the system ensures consistent values. The goal is for existing codes to be parallelized with minimal effort. This paper explores some of the inner workings of the COUPL+ library, describing a new bulk-synchronous approach to these tasks. We present results on a cluster of PCs and also an SGI Origin machine. 1

See http://www.comlab.ox.ac.uk/oucl/work/david.lecomber.

0167-739X/01/$ – see front matter © 2001 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 7 3 9 X ( 0 1 ) 0 0 0 3 8 - 3

962

D. Lecomber, M. Rudgyard / Future Generation Computer Systems 17 (2001) 961–967

2. Performance prediction The BSP programming model [10,14] is adopted for algorithms presented in this paper. In this model, the communication mechanism is only guaranteed to have completed pending remote writes and reads at barrier synchronization points, where all the processors synchronize. The model is simple — enabling our algorithms to be explained clearly. A BSP system is parameterized as follows: • p — the number of processors, • g — where the time required to effect h-relations in continuous message traffic is gh time units. An h-relation is a message pattern in which no processor receives, nor sends, more than h words, • l — the minimum time between successive synchronization operations. BSP predicts the time between successive synchronization points to be directly related to the maximum communication traffic, h, and computation, w, seen at any of the processors by the formula t = w + gh + l. Algorithms have been written in this bulk-synchronous style for portability and may be analyzed effectively. BSP style programming is supported by the main parallel libraries.

3. Geometric partitioning Geometric partitioning is the partitioning of data based on purely the geometric location of the nodes. Optimal bisection of a mesh is computationally infeasible and hence heuristic approaches are preferred [1,12]. One approach is to split the 3D-space by some plane: this provides a small boundary surface area for a bounded region if oriented correctly. Methods to determine the orientation and location of the plane can be found in [1,12]. Partitioning techniques are often based on bisection methods, e.g.: 1. Recursive inertial bisection (RIB) 1.1. Compute eigenvector of inertial matrix for data [1]. 1.2. Bisect partition at mid-points. 1.3. Repeat (1.1) and (1.2) until the number of partitions is p.

2. Slicing 2.1. Compute eigenvector of entire set as above, once. 2.2. Slice evenly along this axis. We concentrate on the first method here, obvious extensions can be made to our methods that enable the second. To bisect the nodes, the plane must evenly apportion points. Given the normal to a cutting plane, we must find the median distance along this normal. Median finding is significantly faster than sorting. Parallel median finding is feasible in O(n/p) time for practical parallel machines [3,6]. For inertial bisection [1], the computation of the inertial matrix and the consequent eigenvector is also O(n/p). However, both these phases of the high-level algorithm require perfectly balanced initial data to achieve optimality. 3.1. Parallelizing recursive bisection There are two opposing methods that can be used to apply the RIB algorithm to parallel machines. Definition 1. A partitioning P of a set S is defined to be a sequence P0 , . . . , Pk−1  of disjoint sets (partitions), such that ∪0≤i
D. Lecomber, M. Rudgyard / Future Generation Computer Systems 17 (2001) 961–967

is met initially: each processor holds n/p elements of the initial set. Thereafter, if the bisection algorithm is applied to each set of an existing partitioning P in turn to create partitioning Q, then we risk skew in Pi for each i. It is evident that even though each processor holds a total of n/p elements from the whole data set, this balance does not apply to each of the partitions individually which make up this set. 3.2. p-Way partitioning An algorithm for computing a partitioning of S into p partitions, where p = 2k is presented, this is easily generalized to any p ∈ N+ . Note 1. The subset of elements of a set P that reside on processor j is written Pj . Furthermore for any P, ∪0≤i


963

a single value and computes the appropriate vector in O(p) time. Results are broadcast using only one synchronization, which is easily o(n/p) for any realistic machine. Step 2(2.3), parallel multi-median finding, is described later. Step 2(2.4) is perfectly balanced as the total number of elements moved around locally is n/p for each processor. Provided parallel multi-median finding is O(n/p), then total complexity of the entire method is O((n/p)(g + log p)). The g-term is incurred in the final movement of data to the destination processor. 3.3. Results in parallel median finding In [3], the median of an evenly distributed set is computed with computation 2n/p + o(n/p) and communication time of (1 + o(1))n/p, provided g = o(nε /log1/3 n) for a fixed 0 ≤ ε < 13 . Their algorithm attains these bounds with high probability (i.e. 1 − O(n1−ρ ) for some ρ > 1). The crux of the algorithm in [3], and others in [6], is random sampling of a small subset of data. The sample is then sorted. The median of the whole set is between two predetermined elements, l and u, of the sample with high probability. The number, k, of elements of the original data less than lower bound, l, is then computed. The original data lying between l and u are then selected; this subset is shown to be small, and this data is now sorted. The median is then picked out as the element at position 21 n−k of this set. 3.4. A parallel multi-median finding algorithm The novel algorithm used by COUPL+ to bisect a set of partitions applies the technique of Section 3.3 concurrently to all partitions in a partitioning to remove imbalance. We begin by defining the problem. Definition 2. The parallel multi-median finding problem is to find the median of k sets P0 , . . . , Pk−1 each of total size n/k and distributed, such that for each 0 ≤ j j < p, | ∪0≤i 1. Let µ = (n/k)1/3 (3ρ log n)2/3 and define l = 21 t − µ and u = 21 t + µ. These parameters follow from [3] routinely:

964

D. Lecomber, M. Rudgyard / Future Generation Computer Systems 17 (2001) 961–967

1. Each processor broadcasts the sizes of each subset held locally. 2. For each 0 ≤ i < k, a subset Qi of size t is chosen with each processor choosing a set of size t/p. Processor, j, randomly selects elements from Pi — using the sizes returned in step 1 to determine the processors required element lie on. The elements are exchanged in one aggregated communication. 3. Using any efficient deterministic sorting algorithm, sort the sample sets concurrently — i.e. do the appropriate computation and communication for each set before synchronizing in each phase of the algorithm. 4. Broadcast the l and u elements (li and ui ) of each sorted sample Qi , from the processors holding these elements to all processors. j 5. For each of the processor j and set i compute li j j and ui , the number of elements in Pi less than li and greater than ui , respectively. j j 6. Concurrently reduce for each i sum the li and ui . Determine whether, for all i, the median of Pi is between li and ui . 7. On the assumption of success in the previous step, let Mi be the values held between li and ui for each j i. Locally calculate Mi for each i, j. 8. Balance these sets by randomly distributing values across processors. Using any efficient deterministic sorting algorithm, sort these sets concurrently. 9. For each i select the required (n/2k)−|li |th element from the sorted sets Mi . The costs of these steps are detailed in [9]. Briefly, steps 1, 4 and 6 are broadcasts of O(k) values from each processor, and k < p. In step 2, each processor determines the t/p elements that it needs from Pi by choosing elements uniformly from Pi ; this requires knowledge of the local sizes Pil for each l (hence step 1). This ensures a random distribution to the elements in the sample. Near perfect balance for the sourcing of elements is obtained, but only over the union of P0 , . . . , Pk−1 . Aggregating this step into one synchronization yields balance, furthermore the work and communication is o(n/p) per processor. Steps 3 and 8 perform a sort in time O((kt log t)/p) [3] which is o(n/p) by our choice of t. The initial balancing phase of step 8 requires at most ktg time which is also o(n/p). Steps 5 and 7 can be performed together. Each processor holds a total of n/p elements,

hence these steps are completed in total time n/p. Step 9 is dominated by preceding steps. Consequently, the significant stages of this algorithm are steps 5 and 7. These stages require no more than two comparisons per element and thus the algorithm has complexity 2n/p + o(n/p) [9] as required. 4. Importing a connectivity After partitioning nodes, the tetrahedra or other such elements must be distributed consistently with this. The simplest method to do this is highly parallel. During node partitioning, we keep track of data as it is moved around the COUPL+ system. Partitioning information is distributed amongst the processors — each processor holds an array giving the original location of the nodes currently stored at this processor. The inverse of this mapping can be stored to record where formerly local nodes have moved to. Definition 3. If E represents a set of elements then define n(E) to be the set of nodes pointed to by E. Assuming that the inverse mapping for nodes is available, the algorithm proceeds as follows. Let E be the set of elements. 1. Each processor j computes n(Ej ), the set of nodes for which their location is required by the locally held portion of E. 2. Nodes in sets n(Ej ) are sent to the original hosting processors of the nodes. 3. For each node received by processor j from processor k in the previous step, send back the new destination information to k. 4. Locally, for each j and for each element of Ej , relabel the nodes pointed to by using the destination information for n(Ej ) received. 5. Locally, for each j and for each element e of Ej determine a destination for e according to the following rule: 5.1. If the majority of nodes are held by processor k, send e to k. 5.2. Otherwise, choose a random processor from the set of processors holding nodes of e. 6. Move the elements to their destinations.

D. Lecomber, M. Rudgyard / Future Generation Computer Systems 17 (2001) 961–967

965

4.1. Smoothing an imported graph Evidently, the distribution of elements must be improved. Only boundary elements need be considered — those having at least one node on another processor — the other elements are already perfectly placed. Standard graph-based techniques such as Kernighan and co-workers [4,7,8] are inapplicable directly. These move nodes in order to improve edge-cut. Our innovation is to use these edge-refinement algorithms on the dual graph of the boundary elements. The dual graph expresses dependencies between elements in the graph: two elements are linked if they require the same node.

Fig. 1. Example partitioning of simple mesh of edges.

Definition 4. Let G = (V , E) be a graph with vertices v ∈ V and elements e ∈ E. The dual graph is defined by D(G) = (V , E ), where V = E and

Note that the penultimate step is carried out locally. This method raises a problem: a locally favorable move for an element is made independently of the elements surrounding it. If, e.g., elements are simply edges and the task is to minimize the number of remote nodes that are communicated between processors, Fig. 1 shows the possible mix between partitions at a boundary (indicated with lightly dashed line). Most edges require the import of a different boundary node. In contrast, suppose boundary edges were grouped so that in the top half of the figure the edges were assigned to the (dashed) left-hand partition and the bottom half to the right hand. In this arrangement, each boundary node shares its remote node with another boundary edge held on the same processor. The communication requirement is halved without a balance penalty. For triangle-based meshes non-optimal sawtooth-like results are often obtained.

E = {(e1 , e2 )|e1 , e2 ∈ E · n(e1 ) ∩ n(e2 ) = ∅}. Minimizing edge-cut in the dual corresponds to minimizing communication between elements in the source graph. There is added complication in that the problem is not strictly isomorphic to edge-cut refinement problems: freedom to move elements is less, as this task cannot be executed in isolation to the location knowledge of the nodes to which the elements point. COUPL+ currently uses a greedy algorithm for refining the dual graph of elements. Elements are moved if doing so improves the edge-cut of the dual graph locally — thus an element is moved to a processor that already requires some of the nodes which the element requires. Table 1 shows that a 10% improvement in

Table 1 Set communication requirements Mesh

F15

M6 coarse

Falcon

Nodes

Elements

Processors

Distinct remote nodes per processor Initial partitioning

Post-smoothing

690991

4247412

8 4 2

6881 5905 3942

6235 5357 3559

10812

57564

8 4 2

550 600 464

491 526 399

155932

847361

8 4 2

2188 2294 1817

1926 2005 1592

966

D. Lecomber, M. Rudgyard / Future Generation Computer Systems 17 (2001) 961–967 Table 2 Node partitioning times (seconds) Mesh

F15

Fig. 2. Imported graph: (i) without smoothing; (ii) with smoothing.

edge-cut is obtained with one pass of this algorithm. Fig. 2(ii) shows the improvement obtained over the initial partitioning of edges in a four-way partitioning of the M6 wing.

5. Experimental results The performance shown is for an eight-node P-II-450 cluster using a standard PVM library and 100 Mbit Ethernet, and an SGI Origin with the BSP library [5]. The nodes of a substantial mesh, the F15, are partitioned into eight sets under 4 s on the PC cluster. In fact, our data sets are not sufficiently large to show the asymptotic picture. Performance for the Falcon and F15 aircraft meshes is good. The M6 wing is far too small to hide the latency when partitioned and consequently poor performance is obtained, for the Falcon and F15 we have found this also becomes a factor for beyond 16 processors on the SGI Origin (Table 2). Independently of this paper, we have analyzed performance using the BSPlib [5] tool and this has shown our balance meets the theoretical expectation.

M6 coarse Falcon

Nodes

Processors

Machine Linux cluster

SGI Origin

690991

16 8 4

– 3.61 3.70

1.35 1.35 1.51

10812

8 4

0.84 0.26

– –

155932

8 4

1.47 1.10

– –

Table 3 Edge import and interface smoothing on P-II-450 cluster Mesh

Elements

Processors

Import time (s)

Smoothing time (s)

F15

4247412

8 4 2

20.69 24.26 33.68

20.02 13.63 5.88

57564

8 4

0.32 0.37

1.50 1.36

847361

8 4

3.46 4.56

6.05 5.05

M6 coarse Falcon

problem is magnified: the F15 has six times more elements than nodes, and each element consists of four node pointers. For comparison, loading the elements of the F15 grid onto the system took 81 s. Investigations revealed that half of the above running time was associated with the movement of data to the destination distribution; little improvement on the above would seem possible without a better network or communication library. 6. Conclusions

The second area of our paper, importing connectivities, is slower (Table 3). However, the size of the

We have presented a parallel node-partitioning algorithm that achieves high performance and guarantees consistency. Partitioning is atypical of parallel algorithms — as p increases so does the work as we ask it to partition — our algorithm has complexity O(n log n/p) and avoids the pitfalls of the na¨ıve algorithms. Importing a connectivity is heavily communication bound — analysis shows this to be incurred in dis-

D. Lecomber, M. Rudgyard / Future Generation Computer Systems 17 (2001) 961–967

967

tributing the elements. This cost compares favorably to the initial required I/O, and it is difficult to conceive of improvements to our algorithm given the highly local nature of the element distribution method used. We see that interface smoothing is costly but improves communication. Most of the cost is incurred in constructing the dual graph, rather than the refinement, and, as with importing, these costs should be taken in perspective with the initial I/O. Although we feel the idea has potential, these costs need to be addressed. Related work on the ParMetis library [7] addresses node partitioning but does not appear to investigate the smoothing of data sets that depend on existing partitionings. It is possible that the partitioning of the nodes can lead to imbalance when used to import elements. This balance can be important to solvers performing computation on elements and nodes. So-called multi-constraint refinement methods [8] may be used, a generalization of Kernighan–Lin [4]. These areas are of future interest.

[6] J. JáJá, An Introduction to Parallel Algorithms, Addison-Wesley, Reading, MA, 1992. [7] G. Karypis, V. Kumar, Parallel multilevel k-way partitioning scheme for irregular graphs, in: Proceedings of the Supercomputing’96, 1996. [8] B.W. Kernighan, S. Lin, An efficient heuristic for partitioning graphs, Bell Syst. Tech. J. 49 (2) (1970) 291–307. [9] D. Lecomber, M. Rudgyard, Algorithms for generic tools in parallel numerical simulation, in: Proceedings of HPCN Europe 2000, Lecture Notes in Computer Science, Vol. 1823, Springer, Berlin, May 2000. [10] W.F. McColl, Scalable computing, Lecture Notes in Computer Science, Vol. 1000, 1995, p. 46. [11] M. Rudgyard, D. Lecomber, T. Schönfeld, The COUPL+ User Manual, 1998. [12] H.D. Simon, Partitioning of unstructured problems for parallel processing, Comput. Syst. Eng. 36 (5) (1991) 745–764. [13] M. Snir, S.W. Otto, S. Huss-Lederman, D.W. Walker, J. Dongarra, MPI: The Complete Reference, MIT Press, Cambridge, MA, 1995. [14] L. Valiant, A bridging model for parallel computation, Commun. ACM 33 (8) (1990) 103–111.

Acknowledgements

David Lecomber completed his doctorate at Oxford University Computing Laboratory in 1998 developing formal methods of parallel programming for BSP computers and exploring automatic-mode BSP programming. His research interests are in formal software derivation and performance prediction for scalable parallel computers. Following his doctorate, he spent 2 years at Oxford as part of the COUPL+ project developing the core parallel library.

This work was partly funded by the European Community as part of the JULIUS project under contract ESPRIT EP25050. References [1] C. Farhat, A simple and efficient automatic FEM domain decomposer, Comput. Struct. 28 (5) (1988) 579–602. [2] A. Geist, A. Beguelin, J. Dongarra, J. Weicheng, R. Manchek, V. Sunderam, PVM: A Users Guide and Tutorial for Networked Parallel Computing, MIT Press, Cambridge, MA, 1994. [3] A.V. Gerbessiotis, C.J. Siniolakis, Deterministic sorting and randomized median finding on the BSP model, in: Proceedings of the Eighth ACM Symposium on Parallel Algorithms and Architectures (SPAA’96), ACM Press, New York, June 1996. [4] B. Hendrickson, Graph partitioning and parallel solvers: has the emperor no clothes, in: Proceedings of the Irregular’98, 1998. [5] J.M.D. Hill, D.B. Skillicorn, Lessons learned from implementing BSP, Lecture Notes in Computer Science, Vol. 1225, 1997, 762 pp.

Michael Rudgyard is presently the Lecturer in Scientific Computing at the Department of Computer Science at Warwick University. After studying an undergraduate degree in Mathematics and Engineering at Nottingham University, Michael completed his doctorate at Oxford University Computing Laboratory in 1990, followed by four-and-a-half years as a researcher at CERFACS in Toulouse, France. He began the development of COUPL (the CERFACS and Oxford University Parallel Library), and then led the development of COUPL+, while at CERFACS and on his subsequent return to Oxford in 1995 as a Smith Institute Research Fellow.