Partitioning graphs on message-passing machines by pairwise mincut

Partitioning graphs on message-passing machines by pairwise mincut

INFORMATION SCIENCES__ ~N ~ A ~ O N A L ELSEVIER JO~NA[ Information Sciences 111 (1998) 223-237 Partitioning graphs on message-passing machines by...

753KB Sizes 1 Downloads 56 Views

INFORMATION SCIENCES__ ~N ~ A ~ O N A L

ELSEVIER

JO~NA[

Information Sciences 111 (1998) 223-237

Partitioning graphs on message-passing machines by pairwise mincut P. Sadayappan a, F. Ercal b, j. Ramanujam c,.,1 a Department of Computer and Information Science, The Ohio State University, Columbus OH 43210, USA b Department of Computer Science, University of Missouri-Rolla, Rolla, MO 65401, USA c Department of Electrical and Computer Engineering, Louisiana State University, Baton Rouge, LA 70803-5901, USA

Received 2 July 1997; received in revised form 23 November 1997; accepted 3 January 1998 Communicated by Subhash C. Kak

Abstract

Realizing the potential of massively parallel machines requires good solutions to the problem of mapping computations among processors so that execution is load-balanced with low inter-processor communication resulting in low execution time. This problem is typically treated as a graph partitioning problem. We develop a parallel heuristic algorithm for partitioning the vertices of a graph into many clusters so that the number of inter-cluster edges is minimized. The algorithm is designed for message-passing machines such as hypercubes. This algorithm is suitable for use with runtime approaches that have been recently developed for parallelizing unstructured scientific computations. We present a parallelization of the Kernighan-Lin heuristic that starts with an initial random multiway partition and performs pairwise improvements through application of the mincut bisection heuristic, known as Partitioning by Pairwise Mincut (PPM). A novel parallel scheme providing nearly linear speedup is developed for PPM that is optimal in terms of communication, © 1998 Published by Elsevier Science Inc. All rights reserved. Keywords: Mapping; Graph partitioning; Parallel partitioning by pairwise mincut;

Linear speedup; Hypercube

*Corresponding author. E-mail: [email protected] 1 Supported in part by an NSF Young Investigator Award CCR-9457768, by an NSF grant CCR-9210422 and by the Louisiana Board of Regents through contract LEQSF(1991 94)-RD-A09. 0020-0255/98/$19.00 © 1998 Published by Elsevier Science Inc. All rights reserved. PII:S0020-0255(98)10005-1

224

P. Sadayappan et al. / Information Sciences 111 (1998) 223-237

1. Introduction

Massively parallel machines with tens of thousands of processors promise very high peak performance. The effective use of such machines requires prudent assignment of tasks in the application programs onto processors; the computational load must be balanced among processors in such a way that interprocessor communication is minimized resulting in minimal execution time. The problem of assigning tasks among processors in a load-balanced fashion that minimizes inter-processor communication is referred to as the mapping problem [2,4,11]. The mapping problem is usually treated as a graph partitioning problem the vertices of the graph represent the tasks (computation) and the edges represent the interaction (communication) among tasks. Viewed this way, the mapping problem is one of assigning equal number of vertices to each processor such that the total number of edges crossing between processors is minimized. Since the problem is NP-complete, it is solved in practice through heuristic procedures. A very effective heuristic for graph bisection has been the Kernighan-Lin mincut procedure [7], which we will refer to as the KL-heuristic. This paper deals with the parallelization of the KL-heuristic. While several sequential heuristics exist, very few parallel algorithms for graph partitioning have been developed. Parallel approaches to this problem have gained considerable importance recently. For many important applications which exhibit substantial data-level parallelism such as unstructured grid computations, sparse system solvers etc., the computational structure of the problem is not known until runtime. Saltz and coworkers have developed efficient runtime support software which crucially depend on good partitioners [12]. Thus, there is a need to partition the computation in parallel at runtime quickly [12]. Pothen, et al. [9] have shown that KL-heuristic in combination with spectral partitioning yields the best results in practice. Thus, parallelization of the KL-heuristic is important in practice. We present a parallel KL heuristic known as Parallelpartitioning by pairwise mincut (PPPM). PPPM starts with a random initial multiway partition and performs a number of pairwise interchanges between partitions to reduce the number of inter-partition edges. Kernighan and Lin [7] have observed that PPM generally provides superior partitions [7] and is hence preferable to a recursive bisection strategy. PPM is not trivially parallelized, and we propose an optimal (in the sense of communication) parallel algorithm for a local-memory multiprocessor with a hypercube interconnection. In Section 2, we outline the Kernighan-Lin heuristic for graph bisection and then use it to derive the sequential algorithm for graph partitioning by pairwise mincut (PPM); a discussion of related work follows. Section 3 presents the parallelization of PPM for a hypercube, its proof of correctness and an analysis of its complexity. Section 4 concludes with a summary and discussion.

P. Sadayappan et al. / Information Sciences 111 (1998) 223-237

225

2. Mapping by sequential multiway graph partitioning Kernighan and Lin [7] proposed an effective mincut heuristic for bisection of graphs. Fiduccia and Mattheyses [1] developed a more efficient variant of that heuristic, with linear time complexity. We begin by formalizing the multiway graph partitioning problem addressed in this paper. A heuristic procedure for bisection of a graph is then presented, since the multiway partitioning scheme that we consider uses the bisection heuristic. Following this, the sequential algorithm for multiway partitioning are presented. Given a graph G(V, E) where IV] = N, the partitioning problem can be formalized as follows: Let the weight of an edge (i,j)E E be eij, and cg = { 1, 2 , . . . , C} be the set of indices corresponding to the C partitions. The partitioning problem is that of finding a mapping M : V --+ cg from vertices (V) to partitions (cg) such that (ij)eE,M(i)~kM(j)

is minimized under the constraint [WI]--]Wzl . . . . .

]Wc]- IV]

(2) C' where ]W~]-- number of vertices assigned to partition i. For convenience of exposition, we assume, without loss of generality, that IV] is a multiple of C. Relation (1) represents the total amount of communication incurred and relation (2) represents the load-balance constraint. In terms of graph partitioning, relation (1) represents the total number of edges cut; an edge in the graph is said to be cut, if its endpoints belong to different partitions. Relation (2) represents the equipartitioning constraint.

2.1. A mincut (KL-) heuristic for graph bisection We now present an algorithm for bisection of a graph, based on the mincut interchange heuristic proposed by Kernighan and Lin [7] and refined by Fiduccia and Mattheyses [1]. The algorithm is essentially an iterative improvement procedure with a built-in hill-climbing ability and is shown in Fig. 1. The mincut procedure starts with two randomly generated initial partitions and iteratively attempts to perform node transfers between the partitions, attempting to decrease the total weight of inter-partition edges. A sequence of node-swaps is tried, maintaining the total gain resulting from each swap. This process is repeated, disallowing marked nodes from moving again, till all nodes are so tried. After one complete pass that has considered every node in the graph exactly once, the cumulative gain will clearly be zero since the final configuration will be exactly equivalent to the initial configuration, with each node

P. Sadayappan et al. / Information Sciences 111 (1998) 223-237

226

mineut (origCl , origC2 ) /* accepts two clusters origC1, origC2 as input and */ /* tries to reduce the cutsize between these clusters by */ /* moving vertices between them. Total gain in cutsize */ l* achieved by vertex movements is recorded. Algorithm */ /* returns the final clusters and the total gain. */

Algorithm

- Initially C1 ~-- origCl, C2 *-- origC2 - Mark all nodes unlocked - Associate a gain value gv with each node v - Initialize 9v to zero tot-gain ~-- 0 repeat

{ Compute V v C V, 9~ = total reduction in the cost of the cut when v is moved from its current cluster to the other cluster. - Compute W1 and W2, the total number of vertices in Ct and C2 respectively seqno ~ 0 repeat -

{ seqno ,--- seqno + 1 - Among the unlocked vertices, identify vl* E Cj with maximal gain 9'1 - Assume that vl* is moved to C2, update the gain for all unlocked nodes - Among the unlocked vertices, identify v2* E Cz with maximal gain 9*2 - Assume that v2* is moved to Cl, update the gain for all unlocked nodes - If no such (vl,v2) pair exists then exit this loop - Lock vl*, v2*, record the status of the movement and gain[seqno] ~ 9'1 + 9*2

} Let ~* = maxk ~/~=l 9ain[ i] = ~ 1 9ain[i] i.e., k* is the value of k that maximizes the cumulative gain G i f (~* > 0) then perform all moves from 1 to k* tot-gain +-- tot-gain +G* endif } u n t i l ~* < 0 return (Cl~ C2, tot-gain) Fig. 1. Graph bi-partitioning algorithm using mincut heuristic. h a v i n g been m o v e d to the opposite partition. The c u m u l a t i v e gain for the sequence o f trial-moves is c o m p u t e d to find the largest positive value, if any. If a positive m a x i m u m is f o u n d , the sequence o f trial-moves u p to the one that resulted in that c u m u l a t i v e m a x i m u m are " a c t u a l l y " performed, a n d all n o d e s are reset as u n m a r k e d . T h e above p r o c e d u r e is repeated till a pass produces n o i m p r o v e m e n t . The use of a n a p p r o p r i a t e d a t a structure to facilitate the efficient d e t e r m i n a t i o n o f the m a x i m a l - g a i n node, similar to that p r o p o s e d in [1], results

P. Sadayappan et al. / Information Sciences 111 (1998) 223-237

227

in a bisection heuristic whose running time is essentially O(N + E), linear in the size of the graph being partitioned. For the kinds of graphs encountered in practice, the number of edges, E, is essentially O(N); hence the complexity of the mincut heuristic can be considered O(N). If the number of partitions required, C, is a power of 2, then a simple recursive procedure can be used to perform a C-way partition of the graph - first creating two equal sized partitions, then dividing each of these independently into two subpartitions each, and so on till C partitions are created. While multiway partitioning by recursive bisection seems quite natural, Kernighan and Lin [7] use the following reasoning to argue that the recursive partitioning approach might not be very effective in general: the mincut procedure at the first level attempts to minimize the number of inter-partition edges, thus maximizing the number of intra-partition edges; the second level partitions therefore attempt to create mincut subpartitions from subgraphs that are highly internally connected due to the effective maximization of the number of intra-partition edges by the first level cut, i.e., a succeeding lower-level partition works against the work done by a preceding higher-level partition. Kernighan and Lin [7] therefore suggested a pairwise interchange heuristic that we elaborate on next.

2.2. Multiway partitioning by pairw&e mincut Partitioning by pairwise mincut (PPM) starts with C equal-sized initial partitions, formed randomly without concern for inter-partition edge costs. A pair of partitions is then considered and the mincut procedure applied, to try and reduce the total edge-weight between the two selected partitions. The mincut bisection procedure is likewise tried between each possible pair of partitions, and this process is continued till no further improvement results, between any of the possible pairs of partitions. The sequential algorithm for PPM is shown in Fig. 2. Let Te(N) denote the time for mincut bisection of a graph with N nodes, and let To(N) denote the time taken to communicate N values between two neighbor processors. The function Te(N) is a linear function of the Fiduccia-Mattheyses improvement to the KL-heuristic and thus can be written as KeN. For the above algorithm, C(C - 1)/2 instances of the pairwise mincut procedure are invoked, where each invocation works on a subgraph of 2 × N / C nodes. Assuming that the outer repeat loop is executed ~: times (usually a small bounded number, relatively independent of the graph), the time complexity for sequential execution is given by: 2

T~

If Te(N) -~ KeN this reduces to: Tseq(N, C) = ~KeN(C- 1).

.

228

P. Sadayappan et al. / Information Sciences 111 (1998) 223-237

Algorithm PPM (V, C, S o ) /* V : vertex set of the graph G = (V, E) to be partitioned into C clusters */ /* S c : Set of clusters obtained */

- Initially obtain C random clusters Cl, C2, • • •, Cc

repeat

( TOTgain ~ 0 for i ~-- 1 t o C d o for j ,-- i + 1 to C do (Ci, Cj, gain) *- mincut(Ci, Cj) TOTgain ~- TOTgain + gain endfor endfor

} until TOTgain = 0 Fig. 2. Sequentialalgorithmfor PPM.

The PPM algorithm is applicable even when C is not a power of 2, unlike recursive bisection. However, due to exhaustively trying for improvement between all possible pairs of partitions, its time complexity is greater than that of recursive bisection. The more exhaustive search procedure also implies that it has the potential to produce partitions with lower total inter-partition edgeweight. 2.3. Related work

Gilbert and Zmijewski [3] and Moore [8] have considered parallel graph partitioning on a hypercube. Gilbert and Zmijewski [3] apply a recursive divide and conquer approach to the problem that uses recursive bisection to achieve multiway partitions; where as, our approach parallelizes the algorithm that uses pairwise mincut. Moore's algorithm for PPM [8] requires 2 C - 1 phases compared to our algorithm which requires C - 1 phases. Since each cluster has to meet C - 1 other clusters, our algorithm for PPM is optimal in the number of phases. The separator based approaches [9,14,15] employ eigenvalue solvers which exhibit high degree of parallelism. Hence, our algorithm for parallel KL-heuristic is suitable on parallel machines as suggested by [14]. Recently Karypis and Kumar [5,6] have addressed the multiway partitioning of irregular graphs.

P. Sadayappan et al. I Information Sciences 111 (1998) 223-237

229

3. Parallel multiway graph partitioning using PPM We now consider the parallelization of the PPM algorithm. The notation used in specifying the parallel algorithm is shown in Table 1. Since the algorithm involves multiple pairwise applications of the mincut procedure, there clearly is potential for parallelism between independent disjoint pairs of partitions. Thus mincut can be simultaneously applied, for example, to ((Oral, (~2), ((~3, (~4),. • • , ((~C-1, (~C). The problem, however, is in finding sets of partition pairs so that all pairs in each set are disjoint and the sets collectively include all of the possible C(C - 1)/2 pairs. We present a novel tournament approach to parallelizing the PPM algorithm on a hypercube parallel computer. The algorithm shown in Fig. 3 performs a C-way partition using C/2 processors, i.e., half as many processors as the number of required clusters (partitions) are used. Each processor P~ contains two of the C partitions, labeled Cli and C2i, with no two processors containing the same partition. The processors alternate between computation and communication, repeatedly performing: 1. a pairwise mincut on the two locally held clusters; 2. communication of one of the resulting clusters to a neighbor processor, in turn receiving some other cluster from a neighbor. During each outer pass of the algorithm (the repeat loop), a pairwise mincut is tried between every possible pair of clusters. Each outer pass comprises p + 1 phases (indexed by d), where p is the number of dimensions of the hypercube system. Each phase consists of two subphases: a cyclic-pairwise-mincut subphase where processors communicate in closed rings, and a ring-fragmentation subphase where each ring subdivides into two isolated subrings. The ring structure thus changes from phase to phase, with 2d independent rings of size 2p-d being formed for communication during phase d, as illustrated for a 4-dimensional hypercube in Fig. 4. Note that the variable s denotes the ring size in the parallel PPM algorithm shown in Fig. 3. RNd(k) and LNd(k) are used to denote respectively the right neighbor and left neighbor of processor Pk in the approTable 1 Notation Given a processor numbered i, 0 ~> 1) ~3 i i.e., the gray code o f / ~-~(i) i @ (i >> 1) @ (i >> 2) ® ... ® (i >> n - 1) i.e., the gray code inverse of i C = 2~ the n u m b e r o f clusters (partitions) to be formed p=2 p the n u m b e r o f hypercube processors

230

P. Sadayappan et al. / Information Sciences 111 (1998) 223-237

Parallel Algorithm for PPM: Processor P~ executes: repeat for d ~ 0 t o p d o for s ~ 1 to 2 (p-a) - 1 do (Cli, 02i, cutgain) *-- mincut(Cli U 02i); Lgaini ~-- Lgaini + cutgain; send(C2i, RNd( i) );

receive(C2i,L Nd( i) ) ; endfor ( e l i , C2i, cutgain) ~-- mincut(Cli 13 C2i); Lgaini ~ Lgaini + cutgain; if (d < p) then if(bp-d-l(i) = 1)then s e n d ( C l i , e p - d - i (i)); receive(Cli, e p - d - I (i)); else send(C2i, e p - d - I (i)); receive (C2i, ep_ d_ 1(i) ); endif endif endfor Ggain ~ calc_Ggain(Lgaini); until (Ggain < 0)

Fig. 3. Parallel PPM algorithm. priate ring during phase d. The ring-neighbor relationship can be precisely specified using a cyclic gray-code of the appropriate dimension [10], as follows:

RNa(k) = hd(k) ll~p-dl(~(mod(~ -l ( l~p-d)(k) ) + 1,2P-a))), LNa(k) = hd(k) ll~p-d)(~(mod((9 -1 (l~p-a)(k)) - 1,2P-d))), where ha(k), ld(k), f~(k), f~-i (k) are as defined previously. For example, with a 4-dimensional hypercube, RN1 (1100) = hi (1100)113 ((¢(mod(ff -1 (13(1100)) + 1,23))) = I113((~(mod(f~-I (100) + 1, 8))) = 1113((~(mod(111, + 1, 8))) = 11/3((~(000))= 1000, RN2(1100) -- h2(1100)ll2(~(mod(~ -1 (/2(1100)) + 1,22))) = llJ/2(fg(mod(fg-l(00) + 1,4))) = lll/2(ff(mod(00 + 1,4)))

= 11112(~(01)) = 1101.

231

P. Sadayappan et al. I Information Sciences 111 (1998) 223-237

............... :::::::::::::::::::::::............... 0110

....

..

""

.

.

_. . . . . . . . . .

11;~'. ~

.

"'°'.

o..j,

10~.0011

1111

:

101

1011

(s) d = O, 1 ring o f size 16

0110

~

0111

1110

~

.ooA I-/ ,'" ~

0011

1111

i-o,

10101--.-"

i

i1011

00O0

(b) d = 1, 2 rings of size 8 0110 ~

0111 1100

0101 o01o ~

1110

,/_/t, v

0Oll

1111

1101

lOlO

lOll

0000 (C) d = 2, 4 rings of size 4 0110 . ~ . . - - . . , - - . ) .

0100 ~

0111

0101 0010.,(_~_).

0000"~-~'0001

1110 ~

1100 ~ 0011

1111

1101 1010 ~

1000 ~

~_

1011

1001

(d) d = 3, 8 rings of size 2 Fig. 4. Illustration of communication in different phases of the parallel PPM algorithm.

232

P. Sadayappan et al. I Information Sciences 111 (1998) 223-237

During a phase, corresponding to one iteration of the d-loop of the algorithm, each processor keeps one of its clusters (C1) local, while it repeatedly receives, transforms and passes on the second cluster (C2). Considering phase 0, with all processors communicating along a single all-inclusive ring, at the end of the 2p - 1 steps in the cyclic-pairwise-mincut subphase, all clusters constituting the various C1;'s (denoted cgSel) would have been matched up and optimized with respect to every cluster in the c~6e2 set. Thus the only pairings between clusters that have not been attempted are between the members of the cgrel set and likewise, mutually among the members of ogre2. During the ringfragmentation subphase of phase 0, pairwise communication exchanges occur between each processor and the neighbor that differs from it in address in the highest bit. During this subphase, each processor P,- with highest address bit of one (bp_l(i) = 1), swaps its C1 cluster for the C2 cluster of its partner processor (Pt, with bp_l(l)= 0). Thus, after this subphase, all processors P~- with (bp-1 (i) = 1), will only have clusters from the original ¢£6e2 set, while all processors with (bp-1(i) = 0) will have all the clusters comprising the original cgrel set. For the rest of the current outer pass, no communication between the "highest-bit-l" processors and the "highest-bit-0" processors takes place, i.e., the hypercube gets fragmented into two lower dimensional subcubes. Thus in phase 1, two rings of size 2p-I are formed for the cyclic-pairwise-mincut subphase and communication occurs between processors differing in their (p - 2)th bit during the ring-fragmentation subphase. During each phase of the algorithm, new cluster-pairs meet at the processors, for application of the pairwise mincut algorithm. The algorithm guarantees that during an outer pass, no pair of clusters is ever matched up more than once. Fig. 5 is used to illustrate this "no-repetition" property of the algorithm. In order to focus on the nature of the ring-fragmentation subphase, the effects of the alternating cyclic-pairwise-mincut subphase are intentionally omitted. Eight clusters are shown, mapped onto four processors, two clusters per processor. During phase 0 (d = 0), the application of the cyclic-pairwise-mincut subphase results in the optimization of each of the members of~Sel with respect to each of the members of ogre2 since a complete ring interconnection of the four processors implies that each of A00,A01,A 10,A 11 will "meet" each of Boo,B01,B10,B11. Ignoring the actual permutation of the C2 clusters that will result at the end of the cyclic-pairwise-mincut subphase, and assuming it to be as shown, the ringfragmentation subphase of phase 0 will result in the state shown for d = 1. Processors Poo and P01 a r e left with clusters Aoo,A01,A 10,A 11, whereas/)10 and/'11 now have clusters Boo,B01, Bl0, Bll. After the ring-fragmentation phase of phase 0, P.0. and/'1, do not ever again communicate with each other. Since none of the A-clusters had been mutually optimized during phase 0, and since none of the B-clusters can any longer meet any of the A-clusters, all pairs of clusters that align at any processor are unique combinations that have not occurred earlier. The same property clearly holds recursively, as illustrated in Fig. 5. Gilbert and Zmijewski

233

P. Sadayappan et al. / Information Sciences 111 (1998) 223-237

d=0

C1

C2

P00

A00 ~

BOO

P01

A01 ~

> B01

P10

A10 <

~

Pll

All ~

~ Bll

B10

C2 d:l P01

C1 /

~

C2

!~-

P10

BOO

Pll

B01 ~

C1

~- BIO > Bll

~

C2

d--2 PO0

P01

PIO

Pll

Fig. 5. Illustration of 8-way partitioning by parallel PPM on a 2-D hypercube (four processors).

[3] and Moore [8] have considered parallel graph partitioning on a hypercube. Moore's algorithm for PPM requires 2C - 1 phases compared to our algorithm which requires C - 1 phases. Since each duster has to meet C - 1 other clusters, our algorithm for PPM is optimal in the number of phases. Before going on to prove the algorithm more formally, we address the issue of termination of the algorithm. Whereas a single flag that kept track of any change during an outer pass was adequate for the sequential algorithm, a distributed approach to detection of global lack of improvement during an outer pass is needed. A simple way of doing this is to keep track locally, within each processor, of the cumulative gain improvements due to the mincut applications during the current pass, and use a global sum algorithm [13] to compute the global gain. This takes O(log 2 P) time, which is of a lower order than the amount of time taken by each outer pass, and hence negligible.

3.1. Correctness of the P P M algorithm We now argue more formally about the correctness of the parallel algorithm, before addressing its time complexity. In the following we often refer

P. Sadayappanet at I Information Sciences111 (1998) 223-237

234

to a "match" occurring between clusters c~i and c~j. Given any two clusters (~i and c~j, a match (c~., ~j) is said to occur for every attempted mincut optimization, mincut (~i, c~j) or mincut (c~j, c~i) during the execution of the parallel PPM algorithm. Lemma 1. The total number of pairwise cluster combinations encountered during one outer pass of the parallel P P M algorithm is C( C - 1)/2. Proof. Each processor performs one pairwise comparison during every step of every phase of the algorithm, as is clear from the algorithm specification. The number of steps in phase d is 2 (p-d). Hence the total number of pairwise combinations tried is P

P

~ 2 c°-a) = 2p * ~ 2 -d = P ( 2 P - 1) : C ( C - 1)/2. d=O

[]

d=O

Lenuna 2. Any processor Pk is part of a communication ring of size 2p-d during

the cyclic-pairwise-mincut subphase of phase d of the algorithm. Proof. During the cyclic-pairwise-mincut subphase of phase d, Pk communicates only with Pi, where l is either RNd(k) or LNd(k). From the definition of RNd(k) and LNd (k), the d higher order bits of k form the d higher order bits of RNd(k) and LNd(k). Thus the d higher order bits of any two communicating processors during this subphase are identical. Further, the p - d lower order bits of k and RNd (k) form successive elements of a p - d digit binary gray code. Likewise the p - d lower order bits of LNd (k) and k also form successive elements of a p - d digit gray code. Since a p - d bit gray code sequence forms a cycle with 2p-d elements, Pk is part of a cyclic communication ring of size 2p-d. [] Lemma 3. Given any clusters (~i and cgj, a match (~i, ~j) can occur at most once

during any cyclic-pairwise-mincut subphase. Proof. A match (cgi, cgj) can occur in phase d only if one of (~i o r cgj belongs to cgAe1 for some ring-connected subcube of processors while the other belongs to cgAe2 for the same processor ring. The cyclic-pairwise-mincut subphase of phase d has 2P-d_ 1 communication steps, where the Cl-cluster-set stays stationary, while the elements of the C2-cluster-set circulate among the processors in a cyclic ring. From Lemma 2, the ring length is 2P-d SO that a match (cgi, cgj) cannot occur more than once. [] Lemma 4. Given two processors Pk and Pt, with bp-d (k) ¢ bp-d (l), clusters ~gi and c~j belonging respectively to Pk and Pt at the beginning of phase d, a match

(cgi, ~gy) cannot occur during any phase d' > d.

P. Sadayappan et al. I Information Sciences 111 (1998) 223-237

235

Proof. During phase d, Pk belongs to a ring of processors whose d higher order address bits are identical. Hence, for Pk and Pt to belong to the same ring, ha(k) = ha(l). However, since bp_d(k) ~ bp-d(l), ha(k) ~ hd(l), i.e., Pk and Pt cannot belong to the same ring of processors. Therefore (¢gi,cgj) cannot match in phase d. During the cube-fragmentation phase of phase d, communication is again only between members of the same ring, so that no processor that Pk can reach can communicate with any processor reached by PI. The above argument is similarly true for any value of d' > d. Hence (~;, cgj) cannot match during any phase d'/> d. [] Lemma 5. Given any clusters ¢gi and ~gj, a match (cgi, ~j) can occur at most once during an outer pass of the algorithm. Proof. Let d be the earliest phase that a match ((~i, %) occur. By Lemma 3 at most one such match can occur during the cyclic-pairwise-mincut subphase of phase d. For such a match to occur, one of them must belong to the Cl-clusterset and the other to the C2-cluster-set. Since they belong to different clustersets, during the cube-fragmentation subphase of phase d, ¢gi and ~j will necessarily end up in different processors Pk, Pl, where l and k differ at least in bit p - d - 1. By Lemma 4, they cannot get matched in any later phase d' > d. Hence at most one match (cgi, cgj.) can occur during an outer pass. [] Theorem 1. Given any two clusters ~i and ¢gj, a match once during an outer pass of the algorithm.

((~i, (~j)

Occurs exactly

Proof. Theorem 1 follows immediately from Lemmas 1 and 5. By Lemma 1, a total of C(C - 1)/2 matches occur during one outer pass of the algorithm, and by Lemma 5, no match (cgi, cgj) can occur more than once per outer pass. Since the number of possible distinct combinations of cluster pairs is C(C - 1)/2, all possible matches must occur exactly once during an outer pass of the algorithm. [] If C > 2P, then the parallel PPM algorithm can be extended in a straightforward fashion. For C = MP, M = 2k, k > 1, we can imagine groups of M / 2 partitions in place of single partitions in the presented algorithm. Now, instead of a single pairwise mincut operation, (M/2) 2 pairwise mincut operations need to be performed at each step of the algorithm between member partitions of the two (M/2)-ary partition-groups in a processor. With such an (M/2)-ary group of clusters in place of single clusters, the algorithm for parallel PPM is essentially the same as above, except for one added set ofmincut operations between the components of each (M/2)-ary group of partitions. The time complexity of the generalized parallel PPM algorithm is: Per outer pass,

236

P. Sadayappanet al. I InformationSciences111 (1998)223-237 Tpar(N, f , p ) = m -~ [M - 1] Te ( 2_~)

+ [ 2 P - 1]

TTTeMM( 2.~)

+ [2P - 21re

_ C ( C - 1 ) Te( 2__~_~)+ 2 ( P - 1 ) T c ( ~-----p) With Te(N) = KeN and Tc(N) = KcN,

Tpar(N,e,P) = KN[(c -- l)Ke + ( P -

Tseq(N,C)

[

1)Kc],

1 ]

Speedup - Tp.r(N, C,P) - P 1 + 7-1r~ "

C- I Xe.I

Thus, nearly linear speedup is feasible for P ~
P. Sadayappan et al. / Information Sciences I l l (1998) 223-237

237

sizes. Extensions of the approach to partition graphs with weighted vertices seems feasible but non-trivial, and is the subject of current investigation.

References [1] C.M. Fiduccia, R.M. Mattheyses, A linear time heuristic for improving network partitions, Proceedings of the 19th Design Automation Conference, 1982, pp. 175-18 I. [2] G.C. Fox, Load balancing and sparse matrix vector multiplication on the hypercube, Caltech Concurrent Computation Project, Report # 327, 1985. [3] J.R. Gilbert, E. Zmijewski, A parallel graph partitioning algorithm for message passing computers, International Journal of Parallel Programming 16 (1987) 427-449. [4] S. Hammond, Mapping unstructured grid computations to massively parallel computers, Ph.D. thesis, Rensselaer Polytechnic Institute, Department of Computer Science, Troy, NY, 1992. [5] G. Karypis, V. Kumar, Parallel multilevel k-way partitioning scheme for irregular graphs, Proc. Supercomputing 96, Pittsburgh, 1996. [6] G. Karypis, V. Kumar, A fast and high quality multilevel scheme for partitioning irregular graphs, SIAM Journal on Scientific Computing (to appear). [7] B.W. Kernighan, S. Lin, An efficient heuristic procedure for partitioning graphs, Bell Systems Technical Journal 49 (2) (1970) 291-308. [8] R. Moore, A round-robin partitioning algorithm, Technical Report, Computer Science Department, Cornell University, 1988. [9] A. Pothen, H. Simon, K. Liou, Partitioning sparse matrices with eigenvectors of graphs, SIAM Journal on Matrix Analysis 11 (1990) 43-452. [10] Y. Saad, M. Schultz, Topological properties of hypercubes, IEEE Transactions on Computers C-37 (7) (1988) 867-872. [11] P. Sadayappan, F. Ercal, J. Ramanujam, Cluster-partitioning approaches to mapping parallel programs onto a hypercube, Parallel Computing 13 (1) (1990) 1-16. [12] J. Saltz, H. Berryman, J. Wu, Multiprocessors and runtime compilation, Concurrency: Practice & Experience 3 (6) (1991) 573-592. [13] K. Schwan, W. Bo, Topologies - Computational messaging for multicomputer, Proceedings of the III Hypercube Conference on Concurrent Computers and Applications (HC3A), ACM Press, 1988, pp. 580-593. [14] H. Simon, Partitioning of unstructured problems for parallel processing, Computing Systems in Engineering 2 (2/3) (1991) 135-148. [15] R. Williams, Performance of dynamic load balancing algorithms for unstructured mesh calculations, Concurrency: Practice & Experience 3 (5) (1991) 457-481.