Two-stage m-way graph partitioning

Two-stage m-way graph partitioning

Parallel Computing 19 (1993) 1359-1373 North-Holland 1359 PARCO 816 Two-stage m-way graph partitioning H.B. Zhou * Department of Computer Science...

863KB Sizes 0 Downloads 63 Views

Parallel Computing 19 (1993) 1359-1373 North-Holland

1359

PARCO 816

Two-stage m-way graph partitioning H.B. Zhou

*

Department of Computer Science, Uniuersity of Zurich, Winterthurerstrasse 190, CH-8057 Ziirich, Switzerland

Received 4 January 1993 Revised 21 June 1993

Abstract

This paper presents a group of multiple-way graph (with weighted nodes and edges) partitioning algorithms based on a 2-stage constructive-and-refinement mechanism. The graph partitions can be used to control allocation of program units to distributed processors in a way that minimizes the completion time and for design automation applications. In the constructive stage, 4 clustering algorithms are used to construct raw partitions, the second refinement step first adjusts the cluster number to the processor number and then iteratively improves the partitioning cost by employing a Kernighan-Lin based heuristic. This approach represents several extensions to the state-of-the-art methods. A performance comparison of the proposed algorithms is given, based on experiment results. Keywords. Clustering; undirected graph (UDG); Kernighan-Lin heuristic; task allocation (assignment); design automation

I. Introduction T h e p u r p o s e of task allocation in a set of interconnected processors is to reduce the job t u r n a r o u n d time. A p r o g r a m or c o m p u t a t i o n can be r e p r e s e n t e d by a graph with data interrelationship (edges). T h e graph can be directed or undirected, d e p e n d i n g on w h e t h e r there are strictly p r e c e d e n c e constraints a m o n g the nodes or not. F o r some computations, such as some image processing operations [38] and finite element c o m p u t i n g [1], the precedence constraints can be neglected. Since in such cases, the data which n e e d to be c o m m u n i cated a m o n g the nodes are only the trivial b o u n d a r y data, the p r e c e d e n c e constraint is weak [39]. Therefore, their data d e p e n d e n c y graph can be represented by an undirected g r a p h ( U D G ) . T h e r e are many works on the p r e c e d e n c e constrained graph scheduling [40] and n o n - p r e c e d e n c e constrained graph allocating problems respectively [10]. O u r approaches in this p a p e r focus on solving the latter problem. Given a h o m o g e n e o u s distributed multiprocessor with m processing elements, the partitioning p r o b l e m involves finding an allocation of n p r o g r a m or c o m p u t a t i o n segments to the m processors which is optimal concerning some set of measures. H e r e we further assume that the target processor network is a fully connected network with identical n o d e processing capacity, that interprocessor c o m m u n i c a t i o n is expensive, and that intraprocessor c o m m u n i c a tion is free. T h e data d e p e n d e n c y g r a p h of the c o m p u t a t i o n is an undirected graph with both n o d e and edge weights. This p a p e r deals with the problem of partitioning the nodes of the data d e p e n d e n c y graph into two or m o r e disjoint subsets and the p r o b l e m of finding the * Email: [email protected]. 0167-8191/93/$06.00 © 1993 - Elsevier Science Publishers B.V. All rights reserved

1360

H.B. Zhou

optimal assignment of the nodes of the dependency graph to the processor network of a fixed processor number in such a way that a minimum completion time (the measure) can be reached when the computation is to be computed by the processor network. This problem also arises in VLSI design applications such as component layout, etc. [21]. So, the methods proposed here can also be applied to such issues with modification. It is well known that finding an optimal solution to the partitioning problem is NP-complete [13]. The problem of finding the minimum bisection of a graph lies at the heart of many partition problems on which multiple partition solutions depend [5]. Unfortunately, the graph bisection problem for graphs with unit edge cost is also NP-complete. Even for the case of planar graphs which always have bisection widths O(n 1/2) [23], no even approximation algorithms are known. As a result, people either resort to other approaches to find a suboptimal solution or consider restricted classes of graphs [5,30]. A 2-stage constructive-and-refinement mechanism is proposed in this paper to find a suboptimal solution. Although our approach is for general graph class, the present experiment results only show the performance of partitioning planar graphs.

2. Previous work

There are two major communities which are interested in the graph partitioning problem: the VLSI design automation community and the distributed computing community. The former group is interested in the circuit network partitioning problem [21]. Most of the graph partitioning algorithms up to now are proposed in that community. Those algorithms are mostly used for the so-called hypergraph or network partitioning [31,37]. Similar basic approaches can also be used (with some change of the measurement functions) for the graph partitioning in parallel and distributed system applications. Three standards are usually considered for the assignment problem in approaches to find the suboptimal solution. (1) Minimize interprocessor communication by assigning nodes connected in the graph to the same processor; (2) Maximize the parallellism of the graph by assigning nodes which can run in parallel to different processors; (3) Balance the computational load evenly among the processors. Obviously, it is very difficult to find an assignment satisfying all the requirements simultaneously since these three standards are mutually contradictory. The first two standards are the so-called Min-max dilemma. Maximum throughput is achieved by load-balancing (without interprocess communication), which tries to distribute modules as much as possible, but overhead due to interprocess communication drives the allocation strategy to cluster modules to as few processors as possible. Different approaches to solving the problem have been proposed in previous researches. Graph-theoretic approach and network flow technique for task assignment have been used under a variety of processor conditions [35,5,36,8,22]. Mathematical programming approaches are formulated in [8,22,9]. Well-known computing schema such as branch and bound [26], evolution-based approach [32], quadratic assignment [7], clustering [16,11,14], and simulated annealing [17,34,10,6] have also found applications here. Many heuristic algorithms have been presented in [18,12,19,24,25,4,2,31]. Stone and Bokhari's optimal solutions suffer from the limitations of processor number and graph structure. The Ford-Fulkerson algorithms find a cut with maximal flow, which is thus a minimal cost cut; this represents a minimum cost partition of the graph into two subsets but of

Two-stage m-waygraphpartitioning

1361

unspecified sizes. Therefore the load balancing requirements can not be met. The clustering methods suffer from a similar load balancing problem, i.e. the numbers of clusters are usually not equal to the number of available processors [11]. Other formal approaches [3,7] are mathematically elegant but lack practical effectiveness. Although the performance improved uniformly with the increasing of the running time, the simulated annealing usually needs much more running time than almost all the other approaches. The Kernighan-Lin heuristic is the recognized champion among the classical approaches to the graph bisection problem [6]. Experiments run by several authors [15,5] indicate that iterative improvement (Kernighan-Lin like) heuristics yield very good bipartitions if the minimum vertex degree or the density of the graph is large. This is because for dense graphs the partitioning problem has few local optima that are not global optima, and the Kernighan-Lin style 2-way change heuristic is a local method. The network flow (multicommodity-flow, [22]) is a global method whose effectiveness needs further research. In this sense, simulated annealing is the best as it is asymptotically optimal. New Kernighan-Lin based variants are continously emerging recently due to the flexiblity of such heuristics. [31] developed a multi-way partitioning method based on the level gain concept by adding lookahead (backtrack) ability proposed in [19] which in turn is based on the Fiduccia-Mattheyses approach. Some other developments are based on finding the so-called graph separator. One of the most significant results is due to [23]'s theorem on planar graph separator. One common problem in the separator (approximation) algorithms is that the edges in the graph are assumed to have unit cost [23,30]. So it is difficult to make those results of use to our partition problem on graphs with both weighted vertices and edges.

3. The problem and the two-stage approach A parallel program or computation consists of n separate cooperating and communicating modules called tasks. Its behavior is represented by a data dependency graph called a task graph. An edge (i, j) between two tasks i and j exists if there is a data linkage between the two tasks which means that task j has a certain amount of data to transfer to task i during or after the computation, a n d / o r vice versa. Such graphs can represent a number of interesting program structures such as message passing, dataflow graphs, and other distributed software. UDG definition. A UDG is a graph G of n nodes with sizes (load weights) wi, 0 < w i <~Wmax, i = 1. . . . . n and k egdes with sizes (communication costs) ci~/> 0, i, j = 1,..., n. Definition of m-way partition. Let m be a positive integer (the processor number in a network). A n m - w a y partition of G is a set of nonempty, pairwise disjoint subsets of G, C 1. . . . . C m (uppercase C) such that U inn_lCi G. A partition is admissible if =

--Wmax ~ ] C i ] -

Icjl

~ Wmax

(3.1)

for all i and j, (i, j = 1. . . . . n), where the symbol ]x] stands for the size of a set x, and equals the sum of the wi's of all the elements of x. The cost of a partition is then defined as max i=1

+ n~ax {Cij j=l

/)

,

(3.2)

1362

H.B. Zhou

where the uppercase Cij denotes communication between partitioned subsets; the uppercase W, denotes the sum of node weights in subset i. This cost corresponds to the maximum time requirement if the computation is to be computed by a homogeneous distributed multiprocessor system with m identical processing elements. So, the objective is to minimize the cost function (3.2). Previous research almost exclusively defines the cost of a partition as the summation of c/i over all i and j such that i and j are in different subsets. The cost is thus the sum of all external costs in the partition. Minimization of such a cost function does not directly relate to achieving the minimum completion time. That is, there is no guarantee that a minimum execution time will be reached even if an optimal partitioning has been found based on the communication edge summation cost function. Minimizing the new cost function (3.2) proposed here will directly guarantee a minimum completion time. In the case of coarse-grain computing, the computation load is a magnitude or more larger than the communication cost. So the load balance is a very important factor to be considered in the partitioning, because the communication cost is negligible comparing to the computation load. Our p-way partition definition emphasizes the load balancing factor by the introducing of condition (3.1) and cost function (3.2) in several ways. In both fine and coarse grain cases, the lower limit of an optimal partitioning cost is the average computation load. Two main disadvantages of the previous methods are: they assume that vertices are of equal size [18] or edges are of equal size [30]. That results in the weakness of the load balancing consideration which is one of the most important factors in partitioning data dependency graphs for parallel and distributed computations. This paper gives a full consideration and treatment of such problems. Partitioning algorithms can be divided into two categories [19]: (1) Algorithms that construct a pre-partition by the consideration of some basic constraints, called constructive algorithms; (2) Algorithms that improve upon the pre-partition by considering some more fine and strict requirements, called refinement algorithms. A practical approach would be a combination of the two steps in that order. Our two-stage algorithm originates from this thought. To date, iterative improvement techniques [18,12,19,31] that make local changes to an initial partition are still the most successful partitioning algorithms in practice [21], despite the fact that we cannot say much in theory about the performance of these heuristics. On the basis of some experiment observations, we believe that good pre-partition may improve the performance of Kernighan-Lin based heuristics significantly, and, at the same time, shorten the running time of their iterations. The clustering algorithms [11,10,14] are fast, effective, and easy to implement. They also offer a relative good performance. The disadvantage is that they suffer from the problem of so-called reassignment [11]. Based on the above thoughts, a two-stage partitioning algorithm is proposed with the clustering algorithms as the pre-partitioning constructive procedure and the Kernighan-Lin based heuristics as the fine-tuning refinement procedure.

4. The constructive clustering procedure The clustering methods have been adopted to graph partitioning by several authors [16,11,10]. An intuitive interpretation of a cluster in a graph is that it is a part which has denser heavy-weighted edges among its nodes than the remaining graph. The clustering

Two-stage m-way graph partitioning

1363

approach can be more suitably used as a constructive method in the sense that it finds the natural groups in the graph which can be used as seeds for further partitioning refinements. Four constructive clustering algorithms have been developed here. Algorithm 1: Heavy edge first merge (HEF); Algorithm 2: Heaviest edge first merge (HstEF); Algorithm 3: Minimum local communication merge (MLC); Algorithm 4: Maximum node and min. local communication merge (MnMLC). At the constructive stage, we do not take into account the load balance factor. Only two threshold values are introduced to constrain the clustering process. (1) Average load value (avgLoad). This is to avoid the generation of extremely large and load unbalanced nodes; (2) Average edge cost value (avgEdgeCost). This is to avoid merging very low cost edges.

4.1. HEF and HstEF clustering The two algorithms are based on the idea that, if the two nodes connected by heavy edges can be allocated to the same processor, the communicating cost can be substantially reduced.

4.1.1. HEF clustering The algorithm runs iteratively, until no more node can be merged. Every iteration starts from the smallest existing (not yet being merged) node number in the graph.

Algorithm HEF; Begin nodesLeftl := m; nodesLeft2 .'= 0; while (nodesLeft2 < nodesLeft l ) do Count the current node number left in the graph nodesLeftl; For activeNode := 1 to m do; Find the heaviest edge heavEdge and the corresponding node contdNode connected to activeNode; while (heavEdge > avgEdgeCost) and ((contdNode.weight + activeNode.weight) < avgLoad)) do Merge contdNode to activeNode, if (contdNode < activeNode); or vice versa; activeNode := Max{contdNode, activeNode}; Find the heaviest edge heavEdge and the corresponding node contdNode connected to activeNode; end while; end for; Count the current node number left in the graph nodesLeft 2; end while; end.

4.1.2. HstEF clustering This algorithm runs iterativelly, until no more node can be merged. Every iteration selects the edges in decreasing order from heaviest to lightest.

Algorithm HstEF; Begin nodesLeft l := m; nodesLeft2 := O; while ( nodesLeft2 < nodesLeft l ) do

H.B. Zhou

1364

Count the current node number left in the graph nodesLeftl; Find the heaviest edge heavEdge in the graph and the two nodes connected to it node1 and

node2; if ((heavEdge > avgEdgeCost ) and ((nodel.weight + node2.weight ) < avgLoad)) then Merge node1 to node2, if (node1 < node2); or vice versa; end else while ((heavEdge < = avgEdgeCost ) or ((nodel.weight + node2.weight ) > = avgLoad) do Find the next heavEdge in the graph and the two nodes connected to it node1 and node2;

if no suitable heaviest edge found then halt; end else if ((heavEdge > avgEdgeCost ) and ((nodel.weight + node2.weight ) < avgLoad)) then Merge nodel to node2, if (nodel < node2); or vice versa; end if; end if; Count the current node number left in the graph nodesLeft2; end while; end.

4.2. MLC and MnMLC clustering

The two algorithms are based on the idea that the nodes are allocated to proper dusters so that the total weight of the edges radiating from each cluster is minimized.

4.2.1. MLC clustering This algorithm runs in a similar way like algorithm HEF.

Algorithm MLC; Begin nodesLeftl := m; nodesLeft2 := 0; while (nodesLeft2 < nodesLeftl) do Count the current node number left in the graph nodesLeftl; For activeNode := 1 to m do; Find the node contdNode and the edge connected to activeNode with minimal 'radiating edge sum'; if (( edge > avgEdgeCost ) and ((ContdNode.weight + activeNode, weight ) < avgLoad)) do Merge contdNode to activeNode, if (contdNode < activeNode); or vice versa; activeNode := Max{contdNode, activeNode}; end if; end for; Count the current node number left in the graph nodesLeft2; end while; end.

Two-stage m-waygraphpartitioning

1365

4.2.2. MnMLC clustering This algorithm runs in a similar way like algorithm HstEF. Algorithm MnMLC; Begin nodesLeftl := m; nodesLeft2 := 0; while ( nodesLeft 2 < nodesLeft l ) do Count the current node number left in the graph nodesLeftl; Find the node activeNode with maximal edge sum (edges radiated from it) in the graph; Find the node contdNode and the edge connected to activeNode with minimal 'radiating edge sum'; if ((edge > avgEdgeCost ) and ( ( contdnode, weight + activenode, weight ) < avgLoad ) ) then Merge contdNode to activeNode, if (contdNode < activeNode); or vice versa; end else while ((edge < = avgEdgeCost ) or ( ( contdnode, weight + activenode, weight ) > = avgLoad ) do Find the next activeNode with maximal edge sum (edges radiated from it) in the graph; Find the node contdNode and the edge connected to activeNode with minimal 'radiating edge sum'; end while;

if no suitable contdNode found then halt; end else if ((edge > avgEdgeCost ) and ((contdnode.weight + activenode.weight ) < avgLoad)) then Merge contdNode to activeNode, if (contdNode < activeNode); or vice versa; end if; end else; Count the current node number left in the graph nodesLeft2; end while; end.

5. The heuristic refinement procedure

The second stage refinement procedure consists of two steps. First, since the former clustering procedure does not guarantee to reach exactly the same number of clusters as the processor number p, a grain packing procedure is used to adjust the cluster number to p. Then, a new Kernighan-Lin based heuristic is proposed to further fine-tune the partition with load-balance constraints. This final procedure improves the performance significantly. The whole second stage refinement procedure elegantly solves the 'reassignment problem' states in [11].

5.1. Grain packing The term grain packing was first used in [20]. It is a technique to automatically determine the grain size for a parallel system [27]. The determination of a suitable grain size is important in trading off the min-max dilemma and for load balancing. The grain packing algorithm here, by taking some partial considerations on such factors, works only as a preliminary step

1366

H.B. Zhou

towards the final grain size determination and allocation. A final, fuller consideration on the min-max dilemma and load balancing trade-off will be presented in the next heuristic procedure. The main task of the grain packing procedure is to merge some trivial and small nodes (clusters) left in the clustered graph, based on the former constructive clustering step, to larger nodes in order to reduce the resulting cluster number to the available processor number p.

Algorithm grainPacking; Begin nodesLeftl := 1; nodesLeft2 := 0 while ( nodesLeft2 < nodesLeft l ) do Count the current node number left in the graph nodesLeftl; Find the current max node weight maxNodeWt and the min weight minNode in the clustered graph; Find the maximum edge connected to minNode and the corresponding node contdNode with contdnode.weight < maxNodeWt; Merge contdNode to minNode, if (contdNode < minNode); or vice versa; Count the current node number left in the graph nodesLeft2; end while;

nodesLeft := nodesLeft2 ; while ( NumberProcessor < nodesleft ) do Find the minimum weight minNode in the graph; Find the max edge connected to minNode and the corresponding node contdNode Merge contdNode to rninNode, if (contdNode < minNode); or vice versa; Count the current node number left in the graph nodesLeft; end while; end 5.2. The Kernighan-Lin based heuristic As mentioned in the preceding sections, the Kernighan-Lin heuristic is the recognized champion among the classical approaches to the partitioning problem [6,21]. It is also a cost-effective and flexible algorithm (this may be due to the fact that it is mathematically less formalized). In the Kernighan-Lin heuristic, we can accommodate additional constraints, such as required sides for certain vertices. The features of specified graphs (i.e. planar graphs) can also be taken into consideration easily, compared to the multicommodity flow [37] and simulated annealing [10] approaches. This feature is important in task allocation since we may want some tasks to be assigned to pre-specified processors in practice. We can take care of this requirement simply by assigning the node to the corresponding side and locking it in the heuristic process. Besides, the Kernighan-Lin heuristic is quite robust. Some earlier obstacles, such as unit vertex constraint and bisection limitation, in applying the Kernighan-Lin heuristic to practice have been overcome as the research has deepened. The extended heuristics through [12,19,3] (as is mentioned in the preceding section) are called Kernighan-Lin based or Kernighan-Lin like heuristics in the literature. One of the most effective extension is the Fiduccia-Mattheyses [12] algorithm which makes the consideration of load-balancing factor a simple task. Nevertheless, the Kernighan-Lin based algorithms share the common weakness that they are often trapped by local minima when the size of the graph is very large. One way to

Two-stage m-way graph partitioning

1367

overcome this difficulty is to group highly connected subgraphs into clusters and then condense these clusters into single nodes prior to the execution of the Kernighan-Lin based algorithms [37,6]. This idea has been an elementary factor to the formation of our two-stage approach. Our algorithm follows the Fiduccia-Mattheyses way by moving only one node in each step. Before developing into the technical details, we need first to define some concepts here. Definition of binding value of a node n i on cluster j. The binding value of node n i on cluster j equals to the sum of all edge cik, where n~ is in cluster j. That is: Bit =

E Cik, VnkEC j

of the gain. Assume, without the contribution of node k in cluster i, the communication cost between cluster i and cluster j is C, the gain of moving the node k from cluster i to cluster j (here, W / - Wj > Wmax, if not, no move will be conducted) is defined as: Definition

if (Wi + C + B k i - w k )

> (Wj + C + w k ) ,

Gain = ( W i + C + Bkj ) - (Wj + C + Bki + w~ ) = wl< + Bkj - Bki;

(5.1)

or

if (Wj + C q-Bki + Wk) > ( W i d- C -1- Bki - Wk) , Gain : ( W i + C + B~j) - (Wj + C + Bki + Wk ) = W i - Wj. + Bkl - Bki - wk; (5.2) Proof of the reasonableness of the definition. Before the move, the minimum completion time based local cost (deduced from (3.2) in the preceding section) is: W i + C + Bkj,

since ( W i + C + gkj ) > (Wj + C ) ;

After the move, the local cost becomes: W i + C + Bki - Wk,

if ( W i + C + B k i - wk) > (Wj + C + Bki + wk);

Wj + C + Bki -- Wk,

if (Wj + C + Bki + w k ) > ( W i -t- C + Bki -- w k ) ,

or

thus, the gain is: if (Wi + C + B k i - W k )

> (Wj + C + w k ) ,

Gain = ( W i + C + Bk] ) -- (Wj + C + Bki + wk ) = w k + Bkj -- Bki; or

if (Wj + C + Bki + wk) > ( W i + C + Bki - w k ) , Gain = ( W i + C + Bkj ) -- (Wj + C + B~i + w~ ) = W i - ~ + B k j - B~i - Wk; Obervation. If there is a local unbalance, i.e. IV/- Wj. > Wm~x,then, from (5.1), we have:

Gain = w k + B~j - Bki > Bkj -- Bki; and from (5.2), we have: Gain = W / - W~+ Bkj - Bki -- w k > Wmax - -

Wk +

Bkj - Bki",

1368

H.RZhou

in both cases, we have: Gain > kj - Bki; That means a move between a load unbalanced cluster pair i and j will guarantee a positive gain as long as the binding value of the to-be-moved node k to the smaller weighted cluster i is also smaller than the binding value of k to cluster j.

Algorithm KLBheuristic; Begin repeat Find the cluster minCluster with minimum weight in the clustered graph; Find the cluster localMaxCluster connected to minCluster with maximum weight; if ( ( ocalMaxCluster, weight - minCluster, weight ) > Wmax) then Find a node activenode in minCluster with maximum gain maxGain when moved to localMaxCluster ; if (maxGain > 0) then Move the activenode to locaIMaxCluster and adjust all the weights and connections; until no more suitable minCluster and ocaIMaxCluster can be found; end. 5.3. The complexity

We shall see from the experiments described in the next section that the HEF and MLC based two-stage approaches have the best performance among the four mentioned algorithms. These two algorithms spend much less run time than the HstEF and MnMLC based approaches. We will show here as an example that the HEF based approach runs in linear time with regard to the node number of the graph to be allocated. The HEF clustering process goes over every node in the graph just once. For each node, it tries to find the maximum cost edge, then merges it. If the degree of the graph d is a magnitude or more smaller than the node number n (d < n), which is mostly the case in normal applications, then the time spent by this process is O(n). The grain packing process tries to find trivial nodes in the clustered graph and merges the nodes with bigger clusters. It finally adjusts the cluster number to the available processor number of processors p. The time spent by this process is in the order of O(n). Finally, the Kernighan-Lin based load balance adjusting heuristic tries to shorten the overall execution time by moving the nodes between the clusters. Similar with the grain packing process, the time spent by this process is also in the order of O(n). So, the overall time spent by the HEF based approach is O(n) which is linear. The experimental results shown in the following section confirm this analysis.

6. Discussion of results The adjacent-structure representation [33] has been used for the graph in our practical implementation. As we have a planar grid graph partitioning problem for image processing application [38,39] in mind, the present experiment has been on testing the performance of using the two-stage approach for the planar grid graph partitioning. A planar grid graph is a n m 1/2 by m 1/2 planar graph with nodes located on the crossing points of a weighted edge mesh network. A 3 by 3 planar grid graph is shown below (Fig. 1). All the experiments have been done on a Macintosh Ilfx machine using THINK Pascal.

Two-stage m-way graph partitioning

(

26

Q)

4

37

1369

Q 18

29

)36

( 3~

) 1

31 4o

)

) 3 4

Fig. 1. A 3 by 3 planar grid graph.

The fine- and coarse-grain computing models have been considered here in our experiment considering that the algorithms may behave differently on differently featured graphs. A grain is defined as a set of computation steps that is to be executed sequentially by a single processor. A coarse-grain computing with regard to a specific multiprocessor system has the feature that the computation load on a single processor in that system is a magnitude or more larger than the communication cost in the same system. In the case of a fine-grain computation, the computation load and the communication cost are approximately in the same order of magnitude. Considering that the proposed algorithm may behave differently on those two computations, experiments have been conducted on partitioning those two computation graphs respectively. 6.1. Performancefor coarse-grain computing

The graphs being processed here are a group of (5) planar grid graphs each with 100 nodes. The average edge weight is 10 units with edge weights varying from 1 to 20. The average node weight is 147.6 units with node weights varying from 1 to 300. The partitionings have been done on 2 and 10 processor systems respectively. Costs and timings are given in Tables 1 and 2. Table 1 Costs 2 Processor costs

HEF

HstEF

MLC

MnMLC

Before

Costs

9850.2

7932

7980.2

8173.4

KLB

Variances

3316.8

562.2

631

910.8

After

Costs

7748.6

7601.6

7633

7625.4

KLB

Variances

132.2

76.8

90

98.4

KLB improvements

10 Processor costs

21.34%

4.35%

6.70%

HEF

HstEF

MLC

MnMLC

2258.6

2442

2221.4

2227.2

1534.8

1456.6

1327.6

1173.2

1700.4

1714.6

1705.4

1711.8

317.8

339.6

324.6

Before

Costs

KLB

Variances

After

Costs

KLB

Variances

341.8

KLB improvements

4.2%

24.71%

29.18%

23.23%

23.14%

1370

H.B. Zhou

Table 2 Time spending (in milliseconds) Time spendings 2 Pro-

+ Packing

cessors

HEF

HstEF

MLC

MnMLC

775.4

2992.8

730

1453

KLB

115.6

19.6

16.6

20.6

Total

890

3012.4

746.6

1473.6

10 Pro-

+ Packing

666.2

2847.4

739.4

1592.8

cessors

KLB

89.6

64.2

52.6

52

Total

755.8

2911.6

792

1644.8

Table 3 Costs 2 Processor costs

HEF

HstEF

MLC

MnMLC

Before

Costs

1097.2

924.6

929.2

944.4

KLB

Variances

344.2

58.4

65.8

93.8

After

Costs

906.2

909.2

885.6

897.8

KLB

Variances

13.4

23.6

8.8

9.2

4.69%

4.93%

KLB Improvements 10 Processor costs Before KLB

Costs Variances

17.41%

1.67%

HEF

HstEF

MLC

MnMLC

236

284.6

259.6

253

150

136.8

118.6

81.4

After

Costs

288.2

230.2

233.2

229.8

KLB

Variances

174.4

49.4

61.4

70.4

19.11%

10.17%

KLB improvements

18.11%

9.17%

6.2. Performance for fine-grain computing The test conditions are almost the same as in the former test. Only the average node weight has been changed to 14.8 units with node weights varying from 1 to 30 (Tables 3 and 4). We can see from the preceding experiment results that the algorithms perform consistently both for fine-grain and coarse-grain applications.

Table 4 Time spending (in milliseconds) HEF

HstEF

MLC

MnMLC

2 Pro-

Time spendings + Packing

753.6

3137.8

730.2

1427

cessors

KLB

140

18.6

20.8

23

Total

893.6

3156.4

751

1450

10 Pro-

+ Packing

727.2

2842.6

740.2

1550.6

cessors

KLB

89.4

77.6

54.2

45.2

Total

816.6

2920.2

794.4

1595.8

Two-stage m-way graph partitioning

1371

Performance Measure

3"

2"

J

J

----a----

MI~



HEF

•'

MnMLC HstEF

0

i

0

2

i

i

4 6 Number of Processors

!

i

|

8

10

12

Fig. 2. Relative performance ( I = 10000).

In order to give an intuitive show of the relative performance of the described algorithms, a measure-relating improvement in cost and execution time has been introduced. Clearly, the performance of an algorithm is better if a smaller cost has been reached and, at the same time, a smaller amount of running time has been used. Suppose that a large integer I is bigger than all the minimized costs achieved by the algorithms, then a measurement function of relative performance is developed as follows. A performance curve has been drown in Fig. 2 based on the performance measure (6.1) (with I = 10000) and the experimental data. Here the experimental data consists of both fineand coarse-grain tests. We can see from the experiment data that the MLC and HEF based two-stage approaches supply a much better performance. Furthermore, they are easier to implement. A simulated annealing implementation has also been conducted [39] to do the same partitioning which has been presented in a separate paper. The collected performance data shows that our algorithms outperform the simulated annealing algorithm significantly with the (6.1) measure. Performance Measure = Ln

I - Cost ] Time

(6.1)

7. Conclusion and future work

This paper gives a comprehensive discussion on the state-of-the-art partitioning approaches and an effective two-stage partitioning procedure has been proposed based on the former work. The following four extensions to the former work are the major contributions of this paper: (1) A new gain measurement ((5.1) and (5.2)) has be proposed based on a new cost function (3.2) which aims to achieve a minimum completion time instead of minimizing the total communication cost in a general way;

1372

H.RZhou

(2) A generalization of the former methods to graphs with both weighted nodes and weighted edges, which are lacking in most of the former methods, has been successfully conducted; (3) The so-called reassignment problem has been solved elegantly in a new and flexible way; (4) An iterative heuristic (Kernighan-Lin based) with load balancing consideration playing an important role in it has been presented. A comprehensive test of the two-stage approach on general graphs and its comparison with a simulated annealing implementation is an on-going work. A complexity analysis of the two-stage approach will be fulfilled soon. Those results will be presented in a separate paper.

Acknowledgements The author would like to express sincere thanks to his supervisor Prof. L. Richter for his continous encourage and support during the work. Without his encourage and support, the work would not have been possible.

References [1] I.G. Angus, G.C. Fox, J.S. Kim and D.W. Walker, Solving Problems on Concurrent Processors: General techniques and regular problems; Vol. 1 (Prentice-Hall, Englewood Cliffs, N J, 1988). [2] J. Baxter, J.H. Patel, The LAST algorithm: A heuristic-based static task allocation algorithm, Proc. IEEE Internat. Conf. Parallel Processing (1989). [3] E.R. Barnes, An algorithm oft partitioning the nodes of a graph, S/AM J. Algeb. Discr. Math. 3 (4) (Dec. 1982). [4] E.R. Barnes, A. Vannelli, J.Q. Walker, A new heuristic for partitioning the nodes of a graph, SIAM J. Discrete Math 1 (3) (1988). [5] T. Bui, S. Chaudhuri, T. Leighton, M. Sipser, Graph bisection algorithms with good average case behavior, Proc. 25th Ann. IEEE Syrup. on Foundations of Computer Science (1984). [6] T. Bui, C. Heigham, C. Jones, T. Leighton, Improving the performance of the Kernighan-Lin and simulated annealing graph bisection algorithms, Proc. 26th Design Automation Conf. A C M / I E E E (1989). [7] R.B. Boppana, Eigenvalues and graph bisection: An average case analysis, Proc. 28th Ann. Syrup. on Foundations of Computer Science, IEEE (1987). [8] S.H. Bokhari, Assignment Problems in Parallel and Distributed Computing (Kluwer, Dordrecht, 1987). [9] K.M. Doty, P.L. McEntire, J.G. O'Reilly, Task allocation in a distributed computer system, Proc. IEEE Infocom 82 (1982) 33-38. [10] J.G. Donnett, M. Starkey, D.B. Skillicorn, Effective algorithms for partitioning distributed programs, Proc. Internat. Phoenix Conf. on Computers and Communications (1988). [11] K. Efe, Heuristic models of task assignment scheduling in distributed systems, IEEE Comput. (June 1982). [12] C.M. Fiduccia, R.M. Mattheyses, A linear time heuristic for improving network partitions, Proc. 19th Design Automation Conf. A C M / IEEE (1982). [13] M.R. Garey, D.S. Johnson, L. Stockmeyer, Some simplified NP-complete graph problems, Theoret. Comput. Sci. 1 (1976). [14] J. Garbers, H.J. Proemel, A. Steger, Finding clusters in VLSI circuits, Proc. Internat. Conf. on Computer-Aided Design (1990) 520-523. [15] M.K. Goldberg, M. Burstein, Heuristic improvement technique for bisection of VLSI networks, Proc. IEEE Internat. Conf. on Computer Design: VLSI in Computers (1983). [16] V.B. Gylys, J.A. Edwards, Optimal partitioning of workload for distributed systems, Proc. Compcon Fall (1976) 353-356. [17] D.S. Johnson, C.R. Aragon, L.A. McGeoch, C. Schevon, Optimization by simulated annealing: An experimental evaluation (part I), Preprint, AT&T Bell Laboratories, Murray Hill, NJ, 1985. [18] B.W. Kernighan, S. Lin, An efficient heuristic procedure for partitioning graphs, Bell Syst. Tech. J. 49 (1970). [19] B. Krishnamurthy, An improved min-cut algorithm for partitioning VLSI networks, IEEE Trans. Comput. C-33(5) (1984). [20] B. Kruatrachue, T. Lewis, Grain size determination for parallel processing, IEEE Software (Jan. 1988). [21] T. Lengauer, Combinatorial Algorithms for Integrated Circuit Layout (Wiley, New York, 1990). [22] T. Leighton, S. Rao, An approximate max-flow rain-cut theorem for uniform multicommodity flow problems with

Two-stage m-way graph partitioning

1373

applications to approximate algorithms, Proc. IEEE 29th Annual Symp. on Foundations of Computer Science (1988). [23] R.J. Lipton, R.E. Tarjan, A separator theorem for planar graphs, SIAMZ Applied Math. 36 (2) (April 1979). [24] V.M. Lo, Heuristic algorithms for task assignment is distributed systems, Proc. 4th Internat. Conf. Distr. Comput. Syst. (May 1984) 30-39. [25] P. Ma, A model to solve timing-critical application problems in distributed computer systems, Computer 17 (1) (Jan. 1984) 62-68. [26] P. Ma, E.Y. Lee, M. Tsuchiya, A task allocation model for distributed computing systems, IEEE Trans. Comput. C-31 (1) (Jan. 1982) 41-45. [27] C. McCreary, H. Gill, Automatic determination of grain size for efficient parallel processing, Comm. ACM 32 (9) (Nov. 1989) 1073-1078. [28] D.A. Plaisted, A heuristic algorithm for small separators in arbitrary graphs, SIAM J. Comput. 19 (2) (April 1990). [29] J.H. Reif, Minimum s-t cut of a planar undirected network in O(n log2(n)) time, SIAMJ. Comput. 12 (1) (Feb. 1983). [30] S. Rap, Finding near optimal separators in planar graphs, IEEE 28th Annual Symp. Foundations of Computer Science (1987). [31] L.A. Sanchis, Multi-way network paritioning, IEEE Trans. Comput. 38 (1) (Jan. 1989). [32] Y. Saab, V. Rap, An evolution-based approach to partitioning ASIC systems, Proc. 26th A C M / I E E E Design Automation Conf. (1989). [33] R. Sedgewick, Algorithms (Addison-Wesley, Reading, MA, 1988) 421. [34] J. Sheild, Partitioning concurrent VLSI simulation programs onto a multiprocessor by simulated annealing, lEE Proc. 134 (Jan. 1987) 24-30. [35] H.S. Stone, Multiprocessor scheduling with the aid of network flow algorithms, IEEE Trans. on Software Engrg. 3 (1) (Jan. 1977) 85-94. [36] D. Towsley, Allocating programs containing branches and loops within a multiple processor system, IEEE Trans. Software Engrg 12 (10) (Oct. 1986). [37] C.W. Yeh, C.K. Cheng, T.T. Lin, A general purpose multiple way partitioning algorithm, Proc. 28th A C M / I E E E Design Automation Conf. (1991). [38] H.B. Zhou, Image processing in a workstation-based distributed system, Proc. 2nd Internat. Conf. on Automation, Robotics, and Computer Vision, Singapore (1992). [39] H.B. Zhou, Effective Algorithms for distributed program allocation, in preparation. [40] Y. Zhu, Workload scheduling: A new technique for scheduling task graphs with communication costs in parallel systems, Proc. Internat. Conf. on Parallel Processing, (1991) II 288-289.