Graph partitioning based methods and tools for scientific computing

Graph partitioning based methods and tools for scientific computing

PARALLEL COMPUTING Parallel ELSEVIER Computing 23 (1997) IX?- I64 Graph partitioning based methods and tools for scientific computing ’ Fraqois Pel...

881KB Sizes 7 Downloads 103 Views

PARALLEL COMPUTING Parallel

ELSEVIER

Computing 23 (1997) IX?- I64

Graph partitioning based methods and tools for scientific computing ’ Fraqois Pellegrini LoBRI, URA CNRS 1304, Universitl

*

Bordeuux I, 351 cows de lu Lihhtion,

33405 Tulence. Frunce

Abstract The combinatorial optimization problem of assigning the communicating coexisting processes of a parallel program onto a parallel machine so as to minimize its overall execution time is called static mapping. This paper presents the work that has been carried out to date at the LaBRI on static mapping. We introduce a static mapping algorithm based on the recursive bipartitioning of both the source process graph and the target architecture graph and describe the capabilities of SCOTCH 3. I, a software package that implements this method. SCOTCH can efficiently map any weighted source graph onto any weighted target graph in a time linear in the number of source edges and logarithmic in the number of target vertices. We give brief descriptions of the algorithm and its bipartitioning methods and compare the performance of our mapper with respect to other mapping and partitioning software packages. Keywork

Static mapping; Graph partitioning; Dual recursive bipartitioning; Domain decomposition

1. Introduction The efficient execution of a parallel program on a parallel machine requires that the communicating processes of the program be assigned to the processors of the machine so as to minimize its overall running time. When processes are assumed to coexist simultaneously for the duration of all the program, this optimization problem is called mapping. It amounts to balancing the computational weight of the processes among the processors of the machine, while reducin g the communication overhead induced by parallelism by keeping intensively intercommunicating processes on nearby processors. In many such programs, the underlying computational structure can be conveniently

* E-mail: [email protected].

’ This

work was supported by the French GDR PRS.

0167-8191/97/$17.00 PI/

Copyright 0

SOl67-8191(96)00l02-0

1997 Elsevier Science B.V. All rights reserved.

modeled as a graph in which vertices correspond to processes that handle distributed pieces of data and edges reflect data dependencies. The mapping problem can then be addressed by assigning processor labels to the vertices of the graph, so that all processes assigned to some processor are loaded and run on it. In a SPMD context, this is equivalent to the disfriburion of data structures across processors; in this case, all pieces of data assigned to some processor are handled by a single process located on this processor. A mapping is called srutic if it is computed prior to the execution of the program and is never modified at run-time. Static mapping is NP-complete in the general case [4]. Therefore, many studies have been carried out in order to find sub-optimal solutions in reasonable time. Specific algorithms have been proposed for mesh [ 131 and hypercube [2,5] topologies. When the target machine is assumed to have a communication network in the shape of a complete graph, the static mapping problem turns into the partitioning problem, which has also been intensively studied [ 1,8,10,18]. SCOTCH is a project carried out at the Laboruroire Bordelais de Recherche en lnformatique (LaBRI) of the Universid Bordeaux I, by the ALiENor (algorithmics and environments for parallel computing) team. Its goal is to study static mapping by the means of graph theory, using a ‘divide and conquer’ approach. It has resulted in the development of the dual recursive bipartitioning (or DRB) mapping algorithm and the analysis of several graph bipartitioning heuristics, all of which have been embodied in the SCOTCH software package for static mapping. This package allows the user to map efficiently any weighted source graph onto any weighted target graph, or even onto disconnected subgraphs of a given target graph, in a time linear in the number of source edges and logarithmic in the number of target vertices. This paper summarizes the results that have been obtained to date within the SCOTCH project. The rest of the paper is organized as follows. Section 2 presents some definitions and Section 3 outlines the most important aspects of the dual recursive bipartitioning algorithm. Section 4 defines some of the bipartitioning algorithms that we use and Section 5 describes SCOTCH 3.1 itself. Section 6 compares its performance to other partitioning and mapping software packages. Then follows the conclusion.

2. Static mapping and cost functions

The parallel program to be mapped onto the target architecture is modeled by a weighted unoriented graph S called source graph or process graph. Vertices us and edges es of S are assigned integer weights w( u,> and w(es), which estimate the computation weight of the corresponding process and the amount of communication to be transmitted on the inter-process channel, respectively. The target machine onto which is mapped the parallel program is also modeled by a valuated unoriented graph T called target graph or architecture graph. Vertices ur and edges er of T are assigned integer weights w( I+> and w(er), which estimate the computational power of the corresponding processor and the cost of traversal of the inter-processor link, respectively. A mapping of a source graph S onto a target graph T consists of two applications ~~.r: V(S) --t V(T) and ~s,~: E(S) +.P( E(T)), where P(E(T)) denotes the set of all the simple loopless

F. Pcllqrini/

Porullel Computing 23 (19971 1.53-164

155

paths which can be built from E(T). rS,r(uS) = uT if process us of S is mapped onto processor ur of T and rX.T(e,) = (ek, e:, . . . . e;} if communication channel es of S is routed through communication links ei, ei, . . , ef of T. I ~~,r(e,)l denotes the dilation of edge e,, that is the number of edges of E(T) used to route es. The computation of efficient static mappings requires an a priori knowledge of the dynamic behavior of the target machine with respect to the programs which are run on it. This knowledge is synthesized in a cost function, the nature of which determines the characteristics of the desired optimal mappings. To avoid aggregate functions which combine terms of different nature by means of weighted.sums, the coefficients of which are hard to tune and depend on target machine technologies, we have chosen, as several authors did before [3,12,18], to separate computation criteria from communication ones. The goal of our mapping algorithm is thus to minimize some communication cost function, while keeping the load balance within a user-specified tolerance. The communication cost function fc that we have chosen is the sum, for all edges, of their dilation multiplied by their weight:

This function is easy to compute, can be updated incrementally and its minimization favors the mapping of intensively intercommunicating processes onto nearby processors. The strong positive correlation between its values and effective execution times has been experimentally verified by several authors [2,5,7]. The quality of mappings is evaluated with respect to the criteria that we have chosen: the balance of the computation load across processors and the minimization of the interprocessor communication cost modeled by function fc. However, since the maximum load imbalance ratio is provided by the user in the input of the mapping, the effective measure of load imbalance is of little interest; what matters is the minimization of the communication cost function under this load balance constraint. For communication, the salient parameter to consider is fc.

3. The dual recursive bipartitioning

algorithm

3.1. Outline of the algorithm Our mapping algorithm 1141is based on a divide and conquer approach. It starts by considering a set of processors, also called the domain, containing all the processors of the target machine and with which is associated the set of all the processes to map. At each step, the algorithm bipartitions a yet unprocessed domain into two disjoint subdomains, and calls a graph bipartitioning algorithm to split the subset of processes associated with the domain across the two subdomains. Whenever a domain is restricted to a single processor, its associated processes are assigned to it and recursion stops. The association of a subdomain with every process defines a partial mapping of the process graph. The complete mapping is achieved when successive bipartitionings have reduced all subdomain sizes to one.

1.56

F. Pr//qyini/

Parullel

Compuring

23

f 1997)

1.53-164

The above algorithm relies on the ability to define five main objects: - A domain structure, which represents a set of processors in the target architecture. - A domain biparfirioning fincrion, which. given a domain, bipartitions it in two disjoint subdomains. - A domain disrance funcrion, which gives, in the target graph, a measure of the distance between two disjoint domains. Since domains may not be convex nor connected, this distance may be estimated. However, it must respect some locality properties, such as giving more accurate results as domain sizes diminish. The domain distance function is used in the process bipartitioning algorithms to compute the communication function to minimize, since it allows us to estimate the dilation of the edges that link vertices which belong to different domains. Using such a distance function amounts to considering that all routings use shortest paths on the target architecture. This is not unreasonable to assume, as most existing parallel machines handle routing dynamically with shortest-path routings. We have thus chosen that our program would not provide routings for the communication channels, leaving their handling to the communication system of the target machine. - A subgraph srrucrure, which represents the subgraph induced by a subset of the vertex set of the original source graph. - A process biparritioning function, which, given a domain, its two subdomains and a set of processes, bipartitions this latter into two disjoint subsets of processes to be mapped onto each subdomain. All of these routines are seen as black-boxes by the mapping program, which can thus accept any kind of target architecture and process bipartitioning function. 3.2. Partial cost function The production of efficient complete mappings requires that all graph bipartitionings favor the criteria that we have chosen. Therefore, the bipartitioning of a subgraph S of S should maintain load balance within the user-specified tolerance and minimize the partial communication cost function fc, defined as

UE V(.s’) (U.U’)EE(S)

which accounts for the dilation of edges internal to subgraph S’ as well as for the one of edges which belong to the cocycle of S, as shown in Fig. 1. Taking into account the partial mapping results issued by previous bipartitionings makes it possible to avoid local choices that might prove globally bad, as explained below. 3.3. Execution scheme From an algorithmic point of view, our mapper behaves as a greedy algorithm (the mapping of a process to a subdomain is definitive), at each step of which iterative algorithms can be applied. The double recursive call performed at each step induces a recursion scheme in the shape of a binary tree, each vertex of which corresponds to a bipartitioning job, that is the bipartitioning of both a domain and its associated subgraph.

F. Pellegrini/Purullel

Compurinx 23 (19971 153-164

IS7

D

.:

6.

I

I

.‘:‘..‘; ,Li

: ..

.:

/

DO

4,

DI

a. Initial position. Fig.

i

,

L

I. Edges accounted for in the partial communication

,

DI

b. After one vertex is moved. cost function when bipartitioning

associated with domain D between the two subdomains D, and D,

the subgraph

of D. Dotted edges are of dilation zero,

their two ends being mapped onto the same subdomain. Thin edges are cocycle edges.

In the case of depth-first sequencing, bipartitioning jobs run in the left branches of the tree have no information on the distance between the vertices they handle and neighbor vertices to be processed in the right branches. On the contrary, sequencing the jobs according to a by-level (breadth-first) travel of the tree allows that, at any level, any bipartitioning job may have information on the subdomains to which all the processes have been allocated during the previous level. Thus, when deciding in which subdomain to put a given process, a bipartitioning job can account for the communication costs induced by the neighbor processes, whether they are handled by the job itself or not, since it can estimate the dilation of the corresponding edges. This results in an interesting feed-back effect: once an edge has been kept in a cut between two subdomains, the distance between its end vertices will be accounted for in the partial communication cost function to be minimized and following jobs will thus be more likely to keep these vertices close to each other, as illustrated in Fig. 2. Moreover, since all domains are split at each level, they all have equivalent sizes, which respects the locality properties of the distance function and gives the algorithm more coherence. Experimental comparisons of the depth-first and breadth-first sequencing schemes show that the efficiency of the schemes strongly depends on the structure of the source and target graphs [16]. When source graphs are strongly connected and/or have heavily weighted edges, depth-first sequencing does better, because the contribution of the heaviest edges dominates the cost function, and thus knowing more accurately the dilations of these edges compensates the risk of computing worse partial mappings in the left branches of the bipartitioning tree. On the other hand, when source graphs are loosely connected, exhibit great locality and are of small dimensionality, breadth-first sequencing is much more efficient in preserving this locality in the resulting mapping, so that the source graph can be ‘unfolded’ as efficiently as possible on the target architecture. One can note that, by using the hypercube as target topology and depth-first execution, our mapping program is identical in nature to the one of Ref. [2]. In that

F. Pr//r~rini/

158

Purullel Computing 23 (1997) 153-164

CLI

CLI

D

D

CL2

a. Depth-first

CLI

CL2

sequencing.

b. Breadth-first

sequencing.

Fig. 2. Influence of depth-first and breadth-first sequencings on the bipartitioning of a domain D belonging to the leftmost branch of the bipartitioning tree. With breadth-first sequencing, the partial mapping data regarding vertices belonging to the right branches of the bipartitioning

tree are more accurate (CL.

stands for ‘cut

level’).

sense, our work, by formalizing the concepts of domain, distance, and execution scheme, can be seen as a generalization of their work that handles many target topologies and graph bipartitioning methods.

4. Domain and graph bipartitioning methods 4.1. Domain bipartitioning methods Since, in our approach, the recursive bipartitionings of target graphs are fully independent with respect to the ones of source graphs (however, the opposite is false), the recursive decomposition of a given target architecture needs only to be computed once, in order to store the resulting data in look-up decomposition tables which will be used in the mapping process. Decomposition tables can be easily computed with our mapper, by mapping the considered target graph onto the complete graph with same number of vertices. Mapping onto the complete graph zeroes the contribution of cocycle edges, so that only cut minimization is considered. In the resulting decomposition, strongly-connected clusters of processors will be kept uncut as long as possible and strongly-connected clusters of processes will therefore tend to be mapped onto them. In the case of heterogeneous architectures, the minimization of the communication function favors the cut of the edges of smallest weight, that is of biggest bandwidth. From the communication point of view, we obtain a hierarchical decomposition in which links of highest bandwidth act as backbones between subdomains containing links of smaller bandwidth. We have evidenced that the decomposition of the target architecture has a great impact on the quality of the mappings, mostly because of the influence that it has on the behavior of the distance function [ 163. As a matter of fact, the principle of the DRB

F. Prlkyrini/

Puruliel Compurinx 23 (1997) 153-164

159

algorithm is to make less informative choices at first and to refine partial mappings as domain sizes diminish. In order to produce mappings of quality, the distance function must be such as to give more accurate results as the sizes of the end domains diminish and such that its variations decrease accordingly; else, decisions resulting in great variations of the communication cost function could be made after ones of smaller impact, when the number of degrees of freedom would become too low to optimize the cost function as it could have been otherwise. This explains why the decompositions of grid architectures computed following a nested dissection approach are more efficient than the ones built by performing all recursive bipartitionings along some dimension before considering other dimensions. In practice, the decompositions of classical topologies (such as meshes, hypercubes, complete graphs, multi-stage networks...) are algorithmically computed at run-time by means of specific built-in functions. This algorithmic handling, which may seem redundant with respect to the general-purpose decomposition table mechanism, allows us to handle huge regular target architectures without storing tables whose sizes evolve as the square of the number of processors. Moreover, algorithmically computed decompositions give in most cases mappings of better quality than the ones that use decompositions computed by mapping. This is not really surprising, since the definition of decomposition algorithms requires some knowledge of the topological properties of the considered target architectures, which can be exploited to provide more regular and efficient decompositions that preserve the locality properties of the distance function. However, for non-standard target architectures (such as, for instance, a non-rectangular subdomain of a bidimensional grid architecture), the built-in functions cannot be used and a proper decomposition table must be computed. 4.2. Graph bipartitioning methods The core of our recursive mapping algorithm uses process graph bipartitioning methods as black boxes. It allows the mapper to run any type of graph bipartitioning method compatible with our criteria for quality. Bipartitioning jobs maintain an internal image of the current bipartition, indicatin g for every vertex of the job whether it is currently assigned to the first or second subdomain. It is therefore possible to apply several different methods in sequence, each one starting from the result of the previous one, and to select the methods with respect to the job characteristics leading to the definition of mapping strategies. Several graph bipartitioning methods have been implemented to date: random and greedy algorithms to compute initial bipartitions and refine them, a backtracking method and an improved version of the Fiduccia-Mattheyses heuristic [3]. The Fiduccia-Mattheyses (FM) graph bipartitioning heuristic tries to reduce the cut of a partition by moving vertices between subsets. To achieve fast convergence, vertices the moving of which would bring the same gain to the cut, are linked into lists which are indexed in a gain array. The almost-linearity in time of this algorithm is based on the assumption that the range of gain values is small, so that the search in the gain array for the vertices of best gain takes an almost-constant time. To handle the huge gains generated by the possibly heavy weights and large dilations of the graph edges, we have implemented a

160

F. Prllrgrini/

Parullel

Compuring

23 (19971 153-164

logarithmic indexing of gain lists, which keeps the gain array a reasonable size guaranteeing an almost-constant access time. Results obtained with linear and logarithmic data structures are equivalent in quality, which shows that the approximation induced by logarithmic indexing is of the same order of magnitude as the one inherent to the FM algorithm [16]. Our mapper also implements a multi-level bipartitioning method, which should be considered more as a strategy rather than as a method since it uses other methods as parameters. This method, which derives from the multi-grid algorithms used in numerical physics and has already been studied by several authors in the context of graph partitioning [1,7,10], repeatedly reduces the size of the graph to bipartition by finding matchings that collapse vertices and edges, computes a partition for the coarsest graph obtained and projects the result back to the original graph. Experiments that have been carried out to date show that the multi-level method, used in conjunction with our FM method to compute the initial partitions and refine the projected partitions at every level, reduces the cost of mappings by 15% on average with respect to the plain Fiduccia-Mattheyses method, with gains of up to 70% in some cases. By coarsening the graph used by the FM method to compute and project back the initial partition, the multi-level algorithm broadens the scope of the FM algorithm and makes possible for it to account for topological structures of the graph that would else be of a too high level for it to encompass in its local optimization process.

5. The SCOTCH software package SCOTCH [17] is a software package for static mapping, which embodies the algorithms developed within the SCOTCH project. Apart from the mapper itself, the SCOTCH 3.1 package contains programs to build and test source graphs, compute target graph decompositions and visualize mapping results. Advanced command-line interface and vertex labeling capabilities make them easy to interface with other programs (see Ref. [15] for details). The mapper can map any weighted source graph onto any weighted target graph, or even onto disconnected subgraphs of a given target graph, which is very useful in the context of multi-user parallel machines. On these machines, when users request processors in order to run their jobs, the partitions allocated by the operating system may not be regular nor connected, because of existing partitions already attributed to other people. With SCOTCH, it is possible to build a target decomposition corresponding to this partition and therefore to map processes onto it, automatically and regardless of the partition shape. The SCOTCH 3.1 academic distribution, which implements multi-level bipartitioning as well as advanced strategy handling, may be obtained from the www at http:// www . labri .u- bordeaux .f r/-pelegrin/ scotch/, or by anonymous f tp at ftp.u-bordeaux.fr in directory /pub/Local/Info/Software/Scotch. The distribution file, named scotch_3.1~. tar. gz, contains the executables for

several machines and operating systems, along with documentation and sample files. A collection of test graphs in our format, gathered from other packages and from individuals, is also available from the www page; it contains all of the test graphs that are referred to in this paper.

F. Pellepini/

PorulM

Computing

23 (1997) 153-164

161

6. Performance evaluations The mappings computed by SCOTCH 3.1 exhibit a great locality of communications with respect to the topology of the target architecture. For instance, when mapping the toroidal graph PWT onto a hypercube with 16 vertices, SCOTCH finds a Hamiltonian cycle in the hypercube, such that about 98% of the edges are kept local, 2% are mapped at distance 1 and less than 0.2% are of dilation 2, as illustrated in Fig. 3. Edges of dilation 2 are only located on the small sphere detailed in Fig. 3b, which is mapped onto a sub-hypercube of dimension 2 i.e. a square. In this area, SCOTCH minimizes the communication cost function by assigning buffer zones to intermediate processors, so that edges of dilation 2 are replaced by at most twice their number of edges of dilation 1. When mapping onto the complete graph, our program behaves as a standard graph partitioner. Table 1 summarizes edge cuts that we have obtained for classical test graphs, compared to the ones computed by the recursive graph bisection algorithms of CHACO 1.O [7] and METIS 2.0 [l 11. Over all the graphs that have been tested, SCOTCH produces the best partitions of the three in two thirds of the runs. It can therefore be used as a state-of-the-art graph partitioner. However, not accounting for the target topology generally leads to worse performance results of the mapped applications [5,8], due to long-distance communication, which makes static mapping more attractive than strict partitioning for most communication-intensive applications. Recently, CHACO gained static mapping capabilities by the addition of a feature called rerminal propagation [9], which is similar to the accounting for cocycle edges that we do in f;: for our DRB algorithm. Tables 2 and 3 summarize some results that have been obtained by CHACO 2.0 with terminal propagation and by SCOTCH 3.1 when mapping graphs 4ELT and BCSSTK32 onto hypercubes and meshes of various sizes. Here again, SCOTCH outperforms CHACO in most cases. A complexity analysis of the DRB algorithm shows that, provided that the running time of all graph bipartitioning algorithms is linear in the number of edges of the graphs,

a. Global

view.

b. Detail.

Fig. 3. Result of the mapping of graph PWT onto a hypercube with 16 vertices. Vertices with :he same grey level are mapped onto the same processor.

162

F. Pclle~rini/

Porullel Computing 23 (1997) 153-164

Table 1 Edge cuts produced by CHACO 1.0, P-METIS 2.0 and SCOTCH blocks (CHACO and METIS data extracted from [IO]) Graph

4ELT BCSSTK30 BCSSTK3 I BCSSTK32 BRACK2 PWT ROTOR

CHACO

1.0

3.1 for partitions

P-METIS 2.0

with 64, 128 and 256

SCOTCH 3. I

64

128

256

64

I28

256

64

I28

256

2928 241202 65764 106449 34172 9166 53804

4514 3 18075 98131 153956 46835 I2737 75140

6869 423627 141860 223181 66944 18268 104038

2965 1901 I5 65249 106440 29983

4600 271503 97819 152081 42625 12632 75010

6929 384474 140818 222789

2906 188240 66780 104651 29187 9225 52864

4553 270165 98148

6809 382888 140703 220476 59454 18459 102697

Table 2 Edge cuts and communication costs produced for mappings of graph 4ELT onto hypercube data extracted from [6]) Target

9130 53228

60608 18108 103895

152082 42169 13052 73461

by CHACO 2.0 with terminal propagation and by SCOTCH 3.1 (H) and bidimensional grid (Mz) target architectures (CHACO

CHACO 2.0-TP

SCOTCH

cut

fc

cut

fc

I68 412 769 1220 1984 3244 5228 1779 4565

I68 484 863 1447 2341 3811 6065 2109 6167

166 396 708 II78 2050 3194 5051 1629 4561

166 447 841 1405 2332 3712 5887 2039 6001

H(l) H(2) H(3) H(4) H(5) H(6) H(7) M,(5, 5) M,(lO, IO)

3. I

Table 3 Edge cuts and communication costs produced by CHACO 2.0 with terminal propagation and by SCOTCH 3.1 for mappings of graph BCSSTK32 onto hypercube (H) and bidimensional grid (Ml) target architectures (CHACO data extracted from [6]) Target

H(l) H(2) H(3) H(4) H(5) H(6) H(7) M&5,5) M,(lO.lO)

SCOTCH 3. I

CHACO 2.0-TP cut

fc

cut

fc

5562 15034 26843 49988 79061 119011 I74505 64156 150846

5562 I51 IO 27871 53067 89359 I43653 218318 76472 211672

4797 10100 24354 43078 71934 I I2580 I64532 69202 150715

4797 11847 28813 49858 87093 141516 211974 87737 223%8

F. PrllrpYni

/ Parallel

Compuring 23

f 1997)

163

153- 164

the running time of the mapper is linear in the number of edges of the source graph and logarithmic in the number of vertices of the target graph. Due to the graph partitioning algorithms that we use, this is verified in practice [16]. For instance, on a 190 MHz RlOOOO-basedSGI Onyx machine with 128 Mb of main memory, the BCSSTK30 graph with 28924 vertices and 1007284 edges is mapped in 9, 17 and 40 CPU s onto hypercubes with 4, 16 and 256 vertices, respectively.

7. Conclusion In this paper, we have presented the work that has been carried out to date within the SCOTCH project. We have described the dual recursive bipartitioning mapping algorithm that we have developed and outlined the principles and capabilities of SCOTCH, a software package for static mapping which implements it and is able to map any weighted source graph onto any weighted target graph. Due to the graph bipartitioning algorithms used, its running time is linear in the number of source edges and logarithmic in the number of target vertices. SCOTCH is currently being evaluated to decompose unstructured meshes into domains for parallel aerodynamics codes that nm on Cray’s T3D. We expect this study to help us determine the characteristics of efficient mappings with respect to the type of numerical method used, in order to develop suitable bipartitioning strategies. A nesteddissection ordering code for direct block solvers is also being developed, based on the graph partitioning library that makes up the core of our mapping program.

References [I]

S.T.

Barnard and H.D.

Simon.

A fast multilevel

implementation

partitioning unstructured problems. Concurrenc)~: Pruct. Exper. 60) [2] F. Ercal. J. Ramanujam bipartitioning.

JPDC

and P. Sadayappan.

Task allocation

Mattheyses. A linear-time heuristic for improving network partitions, in: Proc.

19th Design Aufmn. Crmf (IEEE,

1982) 175- 18 I.

[4] M.R. Garey and D.S. Johnson. Computers cmd Inrrucruhiliry: [5] SW.

IO1 - I 17.

onto a hypercube by recursive mincut

10 (I 990) 35-44.

[3] C.M. Fiduccia and R.M.

(W.H.

of recursive spectral bisection for (1994)

Freeman, San Francisco,

A Guide 10 the Theory of NP-completeness

1979).

Hammond. Mapping unstructured grid computations to massively parallel computers, Ph.D. Thesis,

Rensselaer Polytechnic Institute, Feb. 1992. (6) B. Hendrickson, personal communication, Jul. 1996. [7] B. Hendrickson and R. Leland. The CHACOs Sandia National Laboratories, [8] B. Hendrickson SHPCc’94.

and R. Leland.

Knoxville

(IEEE.

user’s guide: Version 2.0. Technical Report SAND942692,

1994. An empirical

study of static load balancing

algorithms,

in: Proc.

May 1994) 682-685.

191 B. Hendrickson. R. Leland and R. Van Driessche. Enhancing data locality by using terminal propagation, in: Proc. 29th Huruii

Inr. Conf: on System Sciences (IEEE,

Jan. 1996).

[ 101 G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. TR 95035,

University of Minnesota, June 1995.

164

F. Pellqrini/

Porullel

Compuring 23

f 1997) 153-164

[I I] G. Karypis and V. Kumar. METIS: Unstructured graph partitioning and sparse matrix ordering system, Version 2.0, University of Minnesota, June 1995. [ 12) B.W. Kemighan and S. Lin. An efficient heuristic procedure for partitionin g graphs. BELL Syst. Tech. J. Feb. ( 1970) 29 I-307. [ 131D.M. Nicol. Rectilinear partitioning of irregular data parallel computations. J. Puru/le/ Disrrih. Cornput. 23 (1994) 119-134. [ 141 F. Pellegrini. Static mapping by dual recursive bipartitioning of process and architecture graphs, in: Proc. of SHPCC’94. Knoxuillr (IEEE, May 1994) 486-493. [ 151 F. Pellegrini. SCOTCH 3.1 user’s guide, Technical Report I 1X7-96, LaBRI, Universite Bordeaux I, Aug. 1996, available at URL scotch_user3.l.ps.gz. [I61 F. Pellegrini and J. Roman.

http://w.labri.u-bordeaux.fr/-pelegrin/papers/ Experimental

analysis

of the Dual Recursive

Bipartitioning

algorithm

for

static mapping, Research Report 1138-96, LaBRI, Universid Bordeaux 1, Sept. 1996, available at URL http://ww.labri.u-bordeaux.fr/-pelegrin/papers/ scotch_expanalysis.ps.gz. [ 171 F. Pellegrini and J. Roman. SCOTCH: A software package for static mapping by dual recursive bipartitioning of process and architecture graphs, in: Proc. HPCN’96. Brussels, LNCS 1067, April (1996) 493-498. [ 18) A. Pothen, H.D. Simon and K.-P. Liou. Partitioning sparse matrices with eigenvectors of graphs. SIAM J. Murrix And. I l(3) (1990) 430-452.