Toward Efficient Scheduling of Evolving Computations on Rings of Processors

Toward Efficient Scheduling of Evolving Computations on Rings of Processors

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING ARTICLE NO. 38, 92–100 (1996) 0131 Toward Efficient Scheduling of Evolving Computations on Rings of P...

320KB Sizes 0 Downloads 52 Views

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING ARTICLE NO.

38, 92–100 (1996)

0131

Toward Efficient Scheduling of Evolving Computations on Rings of Processors* LI-XIN GAO AND ARNOLD L. ROSENBERG† Department of Computer Science, University of Massachusetts, Amherst, Massachusetts 01003

able attention since the advent of (even the promise of) parallel computers (cf. [1]). In this paper, we describe a simple, low-overhead policy which we believe balances and schedules loads well for a variety of dynamically evolving computations, on parallel architectures whose underlying structure is a ring of identical processors. The challenge of balancing loads on such architectures is their large diameters and small bisection-bandwidths, which preclude certain provably effective balancing-via-randomizing policies [5–8]. While we have not yet been able to delimit the class of computations on which our policy works well, we report here a first step toward this end. We adduce evidence here that the policy works well on computations that end up being large and ‘‘bushy,’’ by showing (a) that it balances loads well as long as tasks keep spawning, and (b) that it yields asymptotically optimal parallel speedup on dynamically evolving computations that end up having the structure of complete binary trees or of two-dimensional pyramidal meshes. Specifically, our policy allows a p-processor ring to execute a computation that ends up with the structure of the height-n complete binary tree (which has 2n 2 1 nodes) in time

We study a simple, low-overhead policy for scheduling dynamically evolving computations in which tasks that spawn produce precisely two offspring, on rings of processors. Such computations include, for instance, tree-structured branching computations. We believe that our policy yields good parallel speedup on large classes of these computations, but we have not yet been able to verify this. In the current paper, we adduce evidence that the policy works well on computations that end up being large and ‘‘bushy,’’ by showing (a) that it balances loads well as long as tasks keep spawning, and (b) that it yields asymptotically optimal parallel speedup when the evolving computations end up having the structure of complete binary trees or of two-dimensional pyramidal meshes. Specifically, we show that a p-processor ring can execute a computation that evolves into the height-n complete binary tree (which has 2n 2 1 nodes) in time Ttree(n; p) #

1 n (2 2 1) 1 p 1 (2 cos(p /p))n p

5 (1 1 o(1))

1 n (2 2 1). p

Similarly, the ring can execute a computation that evolves into the side-n pyramidal mesh (which has (n11 2 ) nodes) in time Tmesh(n; p) #

S D

Ttree(n; p) #

S D

1 n11 3 1 n11 1 n 1 2 5 (1 1 o(1)) . p 2 p 2 2

1 n (2 2 1) 1 p 1 (2 cos(f/p))n p

5 (1 1 o(1))

1 n (2 2 1). p

 1996 Academic Press, Inc.

Similarly, it allows the ring to execute a computation that ends up with the structure of the side-n pyramidal mesh (which has (n12 1) nodes) in time

1. INTRODUCTION

The promise of parallel computers to accelerate computation relies on an algorithm designer’s ability to keep all (or most) of the computers’ processors fruitfully occupied1 all (or most) of the time. The problem of balancing computational loads to approach this goal has received consider-

Tmesh(n; p) #

S D

These results support our belief since our scheduling policy assumes no a priori knowledge of the final shape of the computation. We are in the process of studying our scheduling policy, together with some refinements, empirically, to determine its behavior on a much broader class of dynamically evolving computations [4]. We present the current results separately, since we believe that our techniques of analysis here may apply in other arenas.

* A portion of this paper was presented at the 6th IEEE Symposium on Parallel and Distributed Processing, 1994, under the title ‘‘On Balancing Computational Load on Rings of Processors.’’ † E-mail: hgao,[email protected]. 1 The qualifier ‘‘fruitfully’’ emphasizes that processors must be kept busy on the computation of interest, not, e.g., on auxiliary tasks to keep loads balanced. 92 0743-7315/96 $18.00 Copyright  1996 by Academic Press, Inc. All rights of reproduction in any form reserved.

S D

1 n11 1 n11 3 1 n 1 2 5 (1 1 o(1)) . p 2 p 2 2

EVOLVING COMPUTATIONS ON RINGS OF PROCESSORS

93

FIG. 1. (a) The height-4 tree T4 . (b) The height-4 mesh G4 .

2. THE FORMAL SETTING

2.1. Our Load-Balancing Problem The Architecture. We focus on rings of identical processors (PEs, for short). The p-PE ring, Rp , has PEs P0 , P1 , ... , Pp21 , with each Pi connected directly to its clockwise neighbor Pi11 mod p and its counterclockwise neighbor Pi21 mod p . PEs observe a single-port communication regimen, so in a single step a PE: 1. receives a message from one of its neighbors 2. performs a computation 3. transmits a message to one of its neighbors.2 The Computational Load. The computations we schedule have the structure of dynamically growing leveled dags (directed acyclic graphs), whose nodes represent computational tasks, and whose arcs denote functional dependencies (in a sense that will become clear). Binary Tree-Dags. An N-node binary tree-dag (tree, for short) T is a dag whose nodes comprise a set of N binary strings that is full—the string x0 is a node of T precisely when the string x1 is—and prefix-closed—the string x is a node of T whenever x0 and x1 are. The arcs of T lead each nonleaf, parent, node x to its left child x0 and its right child x1. The null string l is the root of T ; each childless node is a leaf of T. The length uxu of node x is its level in T (so the root is the unique node at level 0). The height of T is the number of distinct node-levels (5 the number of nodes on the longest rootto-leaf path). The weight WGT(x) of node x is the number of 1’s in the binary string x. Of particular interest is the height-n complete tree Tn whose nodes comprise all 2n 2 1 binary strings of length , n and whose leaves are the 2n21 nodes at level n; see Fig. 1a. The (dynamic) computation that ‘‘generates’’ a tree T starts with the root of T (which is its only active leaf-task initially) and proceeds as follows, until no active leaf-tasks remain. At each step, some subset of the then-active leaf-tasks (the 2 In fact, our load-balancing algorithm transmits data only in a clockwise direction.

particular subset depending on the scheduling policy) get executed. An executed leaf-task may:

• halt, thereby becoming a permanent leaf; • spawn two new active leaf-tasks, thereby becoming a parent. A Motivating Scenario. We illustrate one ‘‘real’’ computational problem that abstracts to a tree. Consider the problem of numerically integrating a function f on a real interval [a, b] using the trapezoid rule.3 Each task in this computation corresponds to a subinterval of [a, b]. The task associated with subinterval [c, d] proceeds as follows: 1. Evaluate the area of the trapezoid T having corners (in clockwise order) (c, 0), (c, f (c)), (d, f (d)), (d, 0). 2. If the quantity As(d 2 c) is less than some prespecified (resolution) threshold, then return the area of T as the integral of f on [c, d], and halt; otherwise, proceed. 3. Evaluate the area of the trapezoid T 9 having corners (c, 0), (c, f (c)), (As(c 1 d), f (As(c 1 d))), (As(c 1 d), 0); evaluate the area of the trapezoid T 0 having corners (As(c 1 d), 0), (As(c 1 d), f (As(c 1 d))), (d, f (d)), (d, 0). 4. If the sum of the areas of T 9 and T 0 differs from the area of T by less than some prespecified (accuracy) threshold, then return the area of T as the integral of f on [c, d], and halt; otherwise proceed. 5. Solve the two new tasks corresponding to the intervals [c, As(c 1 d)] and [As(c 1 d), d]. In step 5, an active leaf spawns two new leaves; in steps 2 and 4, an active leaf halts with a subarea. Subareas are accumulated centrally as they become available. Two-Dimensional Mesh-Dags. An N-node two-dimensional mesh-dag (mesh, for short) G is a dag whose nodes comprise a set of N pairs of nonnegative integers that is full—the pair kk 1 1, ll is a node of G precisely when the pair kk, l 1 1l is—and prefix-closed—if the pair kk, ll is a node of G, then:

• if k 5 0, then the pair kk, max(0, l 2 1)l is also a node of G ; 3

Simpson’s rule yields another motivating example.

94

GAO AND ROSENBERG

• if l 5 0, then the pair kmax(0, k 2 1), ll is also a node of G ; • else, at least one of kk 2 1, ll and kk, l 2 1l is also a node of G. The arcs of G lead from each nonsink, parent, node kk, ll to its left child kk, l 1 1l and its right child kk 1 1, ll. The pair k0, 0l is the origin of G ; each childless node is a sink of G. The sum k 1 l is the level of node kk, ll in G (so the origin is the unique node at level 0). The height of G is the number of distinct node-levels (5 the number of nodes on the longest origin-to-sink path). Of particular interest is the side-n pyramidal mesh Gn whose nodes comprise all (n12 1) pairs of nonnegative integers kk, ll such that k 1 l , n and whose sinks are the n pairs at level n 2 1; see Fig. 1b. The (dynamic) computation that ‘‘generates’’ a mesh G starts with the origin of G (which is its only active sink-task initially) and proceeds as follows, until no active sink-tasks remain. At each step, some subset of the then-active sink-tasks (the particular subset depending on the scheduling policy) get executed. An executed sink-task may:

• halt, thereby becoming a permanent sink; • spawn two new task-arcs, thereby becoming a parent. Each newly spawned task-arc is either: —a unary task-arc leading from the executed task to a new active sink-task, or —a binary task-arc leading from the executed tasknode either to a new inactive sink-task, or to a preexisting inactive sink-task that was created by another executed node,4 which thereby becomes active. Intuitively, an inactive sink-task becomes active when it gets the correct number of parents, meaning that its associated task has received all needed inputs. A Motivating Scenario. Whereas our ‘‘real’’ tree-generating scenario produces a complete binary tree only occasionally, the following ‘‘real’’ mesh-generating scenario always produces a pyramidal mesh. We grow a pyramidal mesh whose nodes are the slots of a dynamic-programming table [2, Chap. 16] in order to allocate the dynamic program’s tasks to the PEs of Rp . The dynamic program is executed by running the generating schedule ‘‘backward.’’ 2.2. Our Main Results We now describe our balancing-plus-scheduling policy. As noted earlier, we believe that the policy works well on dynamically evolving trees and meshes that grow into large ‘‘bushy’’ dags. The results we present here support this belief by showing that our policy: (a) balances loads well while tasks keep spawning children; (b) schedules dags well when an evolving tree grows into a complete tree, or

an evolving mesh grows into a pyramidal mesh:5 in these cases, our policy achieves asymptotically optimal parallel speedup, i.e., a factor-of-p speedup with p PEs. We stress that neither our scheduling policy nor our analysis assumes preknowledge of the ultimate shape of an evolving tree or mesh; see the discussion after Theorem 2.2. The KS – BF Policy. The balancing component of our balancing-plus-scheduling policy has each PE observe the regimen Keep-left–Send-right in response to a spawning task: a PE keeps the left child of the spawning task and sends the right child to its clockwise neighbor. The scheduling component of the policy mandates that each PE execute the tasks assigned to it in a locally Breadth-First manner: a PE keeps its as-yet unexecuted tasks lin a priority-queue, in order of their levels (in the tree or mesh being executed) and, within a level, in breadth-first order. With trees, ‘‘breadth-first order’’ means lexicographic order of the string-names of the nodes; with meshes, it means order of the first entries of the integer-pair names of the nodes. Details for Trees. Each computation begins with the root (and initial leaf) l of the evolving tree T as the sole occupant of PE P0’s (priority-ordered) task-queue. At each step of the computation, the task-queue of each PE Pi contains some subset of the then-active leaves of T. Each Pi having a nonempty task-queue performs the following actions. 1. Pi executes the first active leaf x in its task-queue. 2. If leaf x spawns two children, then Pi adds the new leaf x0 (the left child) to its task-queue and transmits the new leaf x1 (the right child) to the task-queue of PE Pi11 mod p . We assess one time unit for this two-phase process. Adapting these details to meshes is straightforward and is left to the reader. Work Distribution by the KS – BF Policy. We lend some intuition for the KS load-balancing regimen by illustrating in Table I how the regimen distributes the nodes of T6 and of G6 within R4 . Table I is computed easily via the following lemma which specifies exactly which PE of Rp will execute each node of any evolving tree or mesh. LEMMA 2.1. Under the

The italicized clause precludes having ‘‘parallel’’ arcs from one node to another.

load-balancing regimen:

(a) each node x of an evolving tree T is executed at PE PWGT(x) mod p of Rp ; (b) each node kk, ll of an evolving mesh G is executed at PE Pk mod p of Rp . Proof. We prove only part (a), leaving the similar proof of part (b) to the reader. Note first that the root l of T, which has weight WGT(l) 5 0, is executed at PE P0 . Assume inductively 5

4

KS

In the numerical integration scenario, for instance, a complete tree emerges from an evolving tree all of whose leaf-tasks halt because of the resolution threshold.

95

EVOLVING COMPUTATIONS ON RINGS OF PROCESSORS

TABLE I Node Assignments When R4 Executes T 6 and G 6 under the

KS

Regimen

T6

PE

Tree level

G6

Resident nodes

PE

Mesh level

Resident nodes

P0

0 1 2 3 4 5

l 0 00 000 0000, 1111 00000, 01111, 10111, 11011, 11101, 11110

P0

0 1 2 3 4 5

k0, k0, k0, k0, k0, k0,

0l 1l 2l 3l 4l, k4, 0l 5l, k4, 1l

P1

1 2 3 4 5

1 01, 10 001, 010, 100 0001, 0010, 0100, 1000 00001, 00010, 00100, 01000, 10000

P1

1 2 3 4 5

k1, k1, k1, k1, k1,

0l 1l 2l 3l 4l, k5, 0l

P2

2 3 4 5

11 011, 101, 110 0011, 0101, 0110, 1001, 1010, 1100 00011, 00101, 00110, 01001, 01010, 01100, 10001, 10010, 10100, 11000

P2

2 3 4 5

k2, k2, k2, k2,

0l 1l 2l 3l

P3

P3

3 4 5

111 0111, 1011, 1101, 1100 00111, 01011, 01101, 01110, 10011, 10101, 10110, 11001, 11010, 11100

3 4 5

k3, 0l k3, 1l k3, 2l

that some given (but arbitrary) nonleaf node x of T is executed at PE PWGT(x) mod p . By definition of the KS regimen:

• if x spawns a left child x0, then x0 is executed at PE PWGT(x) mod p (the ‘‘keep left’’ part of the regimen); • if x spawns a right child x1, then x1 is executed at the clockwise neighbor PWGT(x)11 mod p of PE PWGT(x) mod p (the ‘‘send right’’ part of the regimen). The induction is thus extended, because WGT(x0) 5 WGT(x), and WGT(x1) 5 WGT(x) 1 1. j We infer directly from Lemma 2.1 that the KS – BF policy does not perform well on all trees or all meshes. To wit: Trees. Consider the complete binary tree Th,lp whose node-set comprises all binary strings of length , h that def have weight # lp 5 log p. When h @ lp , the number of nodes in Th,lp is easily shown to be N(h; p) 5 Q(h lp). Clearly, no scheduling policy that employs the KS balancing regimen can even approach optimal parallel speedup on such trees, since the regimen assigns work to only lp PEs. Meshes. Consider next meshes whose node-sets have the form (hi, jj 3 h0, 1, ... , m 2 1j) < (h0, 1, ... , m 2 1j 3 h0j), where i ; 0 mod p, and j ; 1 mod p. Note that, for sufficiently large m, the KS balancing regimen assigns almost all the work on such meshes to PEs P0 and P1 .

Situations Where KS – BF Does Well. Despite its bad worst-case behavior, we believe that the KS – BF policy performs well when scheduling evolving trees and meshes that end up being large and ‘‘bushy.’’ This belief is supported by the following two results: the first (Theorem 2.1) suggests that the KS regimen balances loads well on ‘‘bushy’’ dags; the second (Theorem 2.2) presents a natural class of ‘‘bushy’’ dags which the KS – BF policy schedules asymptotically optimally. We view Theorem 2.2 as the more important of these results, since just balancing computational loads does not preclude pathological situations wherein PEs do equal work, but with little concurrency, so that one achieves no speedup over a sequential computer. THEOREM 2.1. Focus on an evolving dag in which every executed task spawns two new tasks. After N $ p 2 1 steps of the KS balancing regimen, the numbers of unexecuted tasks residing in the heaviest and lightest loaded PEs of Rp differ by p 2 2. Proof. Easily, after p 2 1 steps of the KS regimen, every PE of Rp contains at least one task awaiting execution. Less obviously, the disparity in load between the heaviestand lightest-loaded PEs at this moment is exactly p 2 2. To wit, when PE Pp21 first receives a task to execute, the work profile within Rp is as follows: PE P0 contains a single task awaiting execution, while, for the remaining PE-indices i [ h1, 2, ... , p 2 1j, PE Pi contains p 2 i tasks awaiting execution. As long as all PEs have nonzero load, the KS regimen uniformly adds one more task to each

96

GAO AND ROSENBERG

PE at each step, hence continually maintains disparity p 2 2. j

LEMMA 3.1. The exact value of Wi (n) is given by Wi (n) 5

THEOREM 2.2. When the ring Rp uses the KS – BF balancing-plus-scheduling policy: (a) It executes each evolving tree that grows into the height-n complete binary tree Tn in time Ttree(n; p) #

1 n (2 2 1) 1 p 1 (2 cos(f/p))n p

5 (1 1 o(1))

1 n (2 2 1). p

S D

1 n11 3 1 n12 p 2 2

5 (1 1 o(1))

S D

U

Wi (n) 2

It is important to stress the significance of the phrases that grows into in the Theorem. KS – BF is an on-line policy: it does not know what shape the evolving dag will end up with. Despite this, the policy achieves asymptotically optimal performance on complete binary trees and pyramidal meshes. If we knew a priori that we were scheduling a complete binary tree or a pyramidal mesh, then we could easily produce simple exactly optimal schedules (which are left to the reader). We prove part (a) of Theorem 2.2 in Section 3 and part (b) in Section 4. 3. ANALYZING THE

KS – BF

POLICY ON TREES

We prove part (a) of Theorem 2.2 in two steps. First, in Section 3.1, we prove that the KS regimen asymptotically balances the workload of Rp’s PEs while executing any complete binary tree Tn . Then, in Section 3.2, we prove that the BF schedule for a KS-balanced workload ensures that, once a PE of Rp first receives a task of Tn to execute, it will always have work to do until it has completed all of its work. Note that, whereas p is a fixed but arbitrary constant throughout, n ranges over all (positive) integers for each value of p. 3.1. Work Distribution under the

KS

Wi (n; l) 5

(3.2)

O SjlD

j ;i mod p

nodes from level l of Tn (where 0 # l , n) are executed by PE Pi of Rp . Of course, the workshare Wi (n) is just the summation of Wi (n; l) over all levels of Tn . In other words, Wi (n) 5

O W (n; l) 5 O O SljD .

n21

n21

i

l5 0

l50 j ;i mod p

This double summation yields Eq. (3.1) if one interchanges the order of summation. We derive the more perspicuous bound (3.2) on Wi (n) by gauging the actual workshares’ deviations from the ideal workshare. In what follows, g is a primitive pth root of unity, and Fp(g) is the order-p DFT (Discrete Fourier Transform [2, p. 786]) matrix6

Fp(g) 5

1

1

1

1

1

???

1

1

g

g2

g3

???

g p21

1

g2

g4

g6

? ? ? g 2( p21)

1 . . .

g3 . . .

g6 . . .

g9 . . .

? ? ? g 3( p21) . . . . . .

1 g p21 g 2( p21) g 3(p21) ? ? ?

2

g ( p21)

2

.

The following facts allow us to obtain information about the workshares Wi (n) via calculations involving Fp(g). First, because every node of Tn is executed exactly once, we have Fact 3.1. The cumulative workload of all PEs of Rp when executing Tn is

O W (n) 5 2 2 1.

Regimen

Our analysis of the workload of each PE of Rp while executing a tree that grows into Tn builds on Lemma 2.1(a), which allows us to profile the distribution of work among the PEs. For 0 # i # p 2 1, let Wi (n) denote the total work done by PE Pi during this execution.

U

2n 2 1 # (2 cos(f/p))n 5 o(2n). p

Proof. For any k, there are (wk) length-k binary strings of weight w. By Lemma 2.1(a) and the definition of complete binary tree, therefore, precisely

1 n11 . p 2

Both of these times are asymptotically optimal.

(3.1)

k;i11 mod p

This yields the implicit bound

(b) It executes each evolving mesh that grows into the side-n pyramidal mesh Gn in time Tmesh(n; p) #

O SknD .

p21

i

n

i50

6

The authors discovered after completing this paper that essentially the same analysis appears, with different motivation, in the unpublished master’s thesis [3].

97

EVOLVING COMPUTATIONS ON RINGS OF PROCESSORS

Next, by regrouping terms and noting that g is a primitive pth root of unity, we have

we finally obtain from inequality (3.5) the asymptotic optimal (since ap , 2) bound of (3.2):

U

Fact 3.2. For each k [ h1, 2, ... , p 2 1j,

O W (n)g

p21

i

k(i11 mod p)

i50

5

Og

p21

O SlD n

k(i11 mod p)

i 50

l;i11 mod p

5 (1 1 g ) 2 1. k n

Combining the well-known fact that the matrix Fp(g) is nonsingular and has inverse F p21(g) 5

Wi (n) 2

1 Fp(g21) p

3.2. Running Time under the

12 1 2 2n 2 1

W0(n)

(1 1 g)n 2 1

W1(n) . . .

5 F 21(g)

(1 1 g 2)n 2 1 . . .

.

(3.3)

(1 1 g p21)n 2 1

Wp22(n)

Equation (3.3) yields the explicit expression for Wi (n) in terms of g Wi (n) 5

O F F O

G

p21 1 (2n 2 1) 1 g2j(i11)((1 1 g j )n 2 1) p j 51

G

1 n p21 2j(i11) 2 1 g (1 1 g j )n , 5 p j 51

(3.4)

p21

the second equality following since oj50 c j 5 0 for any pth root of unity c (since the sum is invariant under multiplication by c ? 0). The final expression in (3.4) leads directly to bound (3.2), after we invoke the triangle inequality and recall that every g j is a pth root of unity:

U

Wi (n) 2

U UO

1 p21 2j(i11) 2n 2 1 1 5 g (1 1 g j )n 1 p p j5 1 p

O

UU

U

1 1 p21 u(1 1 g j )nu 1 . # p j51 p By setting

(3.5)

0,j ,p

Policy

LEMMA 3.2. Each node x of Tn is executed by PE def PPROC(x) of Rp at step t (x) 5 uPRECEDE(x)u 1 PROC(x). Therefore, Rp executes an evolving tree that grows into Tn in time Ttree(n; p) # p 1 max Wi (n). 0#i#p21

Proof. We argue: (a) that the KS balancing regimen ensures that each PE Pi of Rp starts working at time i; (b) that the BF scheduling policy guarantees that each PE Pi of Rp performs all of its work in an uninterrupted block of Wi (n) steps. Assertion (a) is immediate by induction, since only P0 has work at step 0 of the execution, and the KS regimen passes work along precisely one PE per step. We establish assertion (b) by verifying that each node x of Tn gets executed by PE PPROC(x) at step t (x). Note that node x could not be executed any earlier, because PE PPROC(x) does not start working until step PROC(x), and there are uPRECEDE(x)u tree-nodes that PPROC(x) must execute (because of the BF policy) before it gets to node x. We complete the proof by verifying, by induction on the breadth-first order of the nodes of Tn , that node x is available for execution at time t (x). We remark first that the root l of Tn gets executed at PE P0 at step t (l) 5 0, as predicted by the fact that PROC(l) 5 uPRECEDE(l)u 5 0. We next focus on a nonroot node xd of Tn , where d [ h0, 1j, and assume that every node y of Tn which precedes xd in breadth-first order is executed at step t ( y). Now, since the parent x of node xd precedes it in breadth-first order, we know by induction that node x is executed at step t (x). Therefore, node xd resides in the task-queue of PE PPROC(xd) beginning at step uPRECEDE(x)u 1 PROC(x) 1 1. Consider two cases. If node xd is a left child of its parent x (i.e., d 5 0), then PROC(x0) 5 PROC(x) under the KS regimen. Therefore, uPRECEDE(x)u , uPRECEDE(x0)u, so that node x0 is available to be executed by step t (x0).

TABLE II Notation for Lemmas 3.2 and 4.2 PROC(x)

ap 5 max u1 1 g j u 5 u1 1 gu 5 2 cos(f/p) , 2,

KS – BF

We now analyze the time required by Rp to execute the tree Tn under the KS – BF policy. To this end, we establish the notation in Table II, which will be reused in Section 4.2.

with Facts 3.1 and 3.2, we obtain the following expression for the workshares Wi (n):

Wp21(n)

U

2n 2 1 p21 n 1 a p 1 # a pn . j # p p p

PRECEDE(x)

The index of the PE at which node x is executed The set of nodes that are assigned to PE PPROC(x) that precede x in breadth-first order

98

GAO AND ROSENBERG

Else, node xd is a right child of its parent x (i.e., d 5 1), so that PROC(x1) 5 PROC(x) 1 1 mod p under the KS regimen. In this case, uPRECEDE(x1)u $ uPRECEDE(x)u because, for each y [ PRECEDE(x), we have y0u x u2u y u1 [ u x u2u y u PRECEDE(x1). (We know that y0 1 is a node of Tn because it precedes x1 in breadth-first order, and by hypothesis, x1 is a node of Tn .) We now distinguish two subcases. If PROC(x) , p 2 1, then PROC(x1) 5 PROC(x) 1 1, so that

h0, 1, ... , p 2 1j,

uPRECEDE(x)u 1

Elementary manipulations convert the last expression to expression (4.1). We verify bound (4.2) by quantifying the deviation of PEs’ actual workshares from the ideal. We note first how much work each PE of Rp would do in a perfectly balanced computation.

1 1 5 uPRECEDE(x)u 1 PROC(x1) # uPRECEDE(x1)u 1 PROC(x1).

PROC(x)

Else, we must have PROC(x) 5 p 2 1, so that PROC(x1) 5 0 5 PROC(x) 2 p 1 1. In this case, ux1u $ p, since node x1 is a right child, so its weight is positive and (by Lemma 2.1(a)) divisible by p. It follows that PRECEDE(x1) must contain, in addition to all nodes of the form y0u x u2u y u1, where y [ PREi CEDE(x), at least the p 1 1 additional WGT-0 nodes h0 u 0 # i # pj. We thus have uPRECEDE(x)u 1 PROC(x) 1 1 5 uPRECEDE(x)u 1 PROC(x1) 1 p , uPRECEDE(x1)u 1 PROC(x1).

KS – BF

POLICY ON MESHES

We now prove part (b) of Theorem 2.2. Since our proof will follow both the organization and underlying reasoning of Section 3, we shall be somewhat sketchy in this section. 4.1. Work Distribution under the

KS

O

k;i mod p

S

Regimen

Our analysis of the workload of each PE of Rp while executing a mesh that grows into Gn builds on Lemma 2.1(b). For 0 # i # p 2 1, let Wi (n) denote the total work done by PE Pi of Rp during this execution.

(n 2 k) 5

5 n2i2

p 2

O

(n2i)/p

j50

(n 2 (i 1 jp))

Kn p2 iHDSKn p2 iH 1 1D .

Fact 4.1. In a perfectly balanced computation of Gn , each PE of Rp would do work

1 ˆ def W 5 p

In either subcase, node x1 is available to be executed no later than step t (x1). Thus, node xd is always executed precisely at step t (xd), extending the induction. j 4. ANALYZING THE

Wi (n) 5

O W (n) 5 p1 Sn 12 1D .

p21

i

i50

ˆ , while othObviously, some workshares Wi (n) exceed W ers are smaller. In fact, the progression from the smallest workshare to the largest is monotonic: Fact 4.2. For all n, W0(n) . W1(n) . ??? . Wp21(n). Fact 4.2 can be verified as follows. For each k [ h0, 1, ... , n 2 1j, we call the set of mesh-nodes hkk, ll u 0 # l # n 2 k 2 1j the kth row of Gn . Lemma 2.1(b) assures us that all nodes in each row of Gn are executed at the same PE of Rp . We partition Gn’s rows into bands, the ith band (where 0 # i # (n 2 1)/p) comprising those rows whose indices fall in the set hip, ip 1 1, ... , ip 1 p 2 1j. (The last band may contain fewer than p rows.) Now, note that within each band, the sum of the row-sizes and the rowindices stays constant (as long as the row exists). Hence, each Wi (n) $ Wi11(n) 1 1. j

LEMMA 4.1. The exact value of Wi (n) is given by

S

Wi (n) 5 n 2 i 1 n 2 i 2

p 2

DK H K H n2i p

2

p 2

n2i 2 . p

(4.1)

ˆ , Facts 4.1 and 4.2 combine to ensure that Wp21(n) , W W0(n). We can, therefore, bound the deviations of the actual workshares from the ideal by bounding the differˆ and W ˆ 2 Wp21(n). For the former differences W0(n) 2 W ence, we have

This yields the implicit bound

U

Wi (n) 2

S DU

1 n11 p 2

#

3 n 1 2. 2

(4.2)

Proof. Lemma 2.1(b) tells us that each node kk, ll of Gn is executed by PE Pk mod p . It follows that, for each i [

S DK H K H S D S D

ˆ 5n1 n2 W0(n) 2 W 2

p 2

n p

2

p 2

n p

2

2 3 1 n11 3 2 # n , n. p 2 2p 2 2

EVOLVING COMPUTATIONS ON RINGS OF PROCESSORS

For the latter difference, we have ˆ 2 Wp21(n) 5 1 W p

S D n11 2

S

2n1p212 n2

3 p11 2

D

D

Policy

uPRECEDE(x)u 1 PROC(x) 1 1 5 uPRECEDE(x)u 1 PROC(x9) 1 p , uPRECEDE(x9)u 1 PROC(x9).

2

S D S

#

3 n 1 2. j 2

3 3 1 1 2 1 n1 2 p 2 2p

4.2. Running Time under the

KS – BF

uPRECEDE(x)u 1 PROC(x) 1 1 5 uPRECEDE(x)u 1 PROC(x9) # uPRECEDE(x9)u 1 PROC(x9). Else, we must have PROC(x) 5 p 2 1, so that PROC(x9) 5 0 5 PROC(x) 2 p 1 1. In this case, node x9 has level at least p in Gn , since being a right child, its row-number k 1 1 is positive and (by Lemma 2.1(b)) divisible by p. It follows that PRECEDE(x9) must contain, in addition to all nodes of the form ka 1 1, bl, where ka, bl [ PRECEDE(x), at least the p 1 1 additional row-0 nodes hk0, il u 0 # i # k 1 1j. We thus have

Kn 2 (pp 2 1)H 1 2p Kn 2 (pp 2 1)H #

99

Finally, we analyze the time required by Rp to execute Gn under the KS – BF policy. Recall the notation from Table II. LEMMA 4.2. Each node kk, ll of Gn is executed by PE def PPROC(k,l) of Rp at step t (k, l) 5 uPRECEDE(k, l)u 1 PROC(k, l). Therefore, Rp executes an evolving mesh that grows into Gn in time Tmesh(n; p) # max (Wi (n) 1 i) # W0(n). 0#i#p21

Proof. As in Lemma 3.2, each PE Pi of Rp starts working at time i. We now show that Pi performs all of its work in an uninterrupted block of Wi (n) steps. We claim that each node kk, ll of Gn gets executed by PE PPROC(k,l) at the earliest possible time, namely (as argued in Lemma 3.2), step t (k, l). We verify, by induction on the breadth-first order of the nodes of Gn , that the node is available for execution at that time. We remark first that the origin node k0, 0l gets executed at PE P0 at step t (0, 0) 5 0, as predicted by the fact that PROC(0, 0) 5 uPRECEDE(0, 0)u 5 0. We next focus on a def nonorigin node x9 5 kk 1 r, l 1 sl of Gn , where hr, sj 5 h0, 1j, and assume that every node kf, cl of Gn that precedes x9 in breadth-first order is executed at step t (f, c). Note def that the parent x 5 kk, ll of node x9 precedes x9 in breadthfirst order, hence is executed at step t (x). Therefore, node x9 resides in the task-queue of PE PPROC(x9) beginning at step uPRECEDE(x)u 1 PROC(x) 1 1. If node x9 is a left child of its parent x (i.e., r 5 0, s 5 1), then PROC(x9) 5 PROC(x), so that uPRECEDE(x)u , uPRECEDE(x9)u. This means that node x9 is available to be executed by step t (x9). Else, node x9 is a right child of its parent x (i.e., r 5 1, s 5 0), so that PROC(x9) 5 PROC(x) 1 1 mod p. In this case, uPRECEDE(x9)u $ uPRECEDE(x)u because, for each ka, bl [ PRECEDE(x), we have ka 1 1, bl [ PRECEDE(x9). (We know that ka 1 1, bl is a node of Gn because it precedes x9 in breadth-first order, and by hypothesis, x9 is a node of Gn .) We distinguish two subcases. If PROC(x) , p 2 1, so that PROC(x9) 5 PROC(x) 1 1, then

In either subcase, node x9 is available to be executed no later than step t (x9). Thus, node kk 1 r, l 1 sl is always executed precisely at step t (k 1 r, l 1 s), extending the induction. j ACKNOWLEDGMENTS It is a pleasure to acknowledge helpful conversations with Vittorio Scarano and Zhi-Li Zhang. This research was supported in part by NSF Grant CCR-92-21785. A portion of the second author’s research was supported by a Lady Davis Fellowship at the Technion.

REFERENCES 1. Brent, R. P. The parallel evaluation of general arithmetic expressions. J. Assoc. Comput. Mach. 21, (1974), 201–206. 2. Cormen, T., Leiserson, C. E., and Rivest, R. L. Introduction to Algorithms. MIT Press, Cambridge, MA, 1990. 3. Even, G. Construction of Small Probability Spaces for Deterministic Simulation. M.Sc. thesis, The Technion, Israel. [In Hebrew] 4. Gao, L.-X., Gregory, D. E., Rosenberg, A. L., and Cohen, P. R. Efficient scheduling of branching computations on rings of processors: An empirical study. Typescript, University of Massachusetts, 1996. 5. Karp, R. M., and Zhang, Y. Randomized parallel algorithms for backtrack search and branch-and-bound computation. J. Assoc. Comput. Mach. 40, (1988), 765–789. 6. Lu¨ling, R. and Monien, B. A dynamic, distributed load-balancing algorithm with provable good performance. 5th ACM Symposium on Parallel Algorithms and Architectures, 1993, pp. 164–172. 7. Ranade, A. G. Optimal speedup for backtrack search on a butterfly network. Math. Systems Theory 27, (1994), 85–101. 8. Rudolph, L., Slivkin, M., and Upfal, E. A simple load balancing scheme for task allocation in parallel machines. 3rd ACM Symposium on Parallel Algorithms and Architectures, 1991, pp. 237–244.

LI-XIN GAO is an assistant professor of computer science at Smith College. She received a B.S. in computer science from the University of Science and Technology of China in 1986, and an M.S. in computer engineering from Florida Atlantic University in 1987; she expects to receive a Ph.D. in computer science from the University of Massachusetts at Amherst in Dec., 1996. Prior to studying at the University of Massachusetts, Gao was a software engineer at Bendix/King. Gao’s research interests include distributed systems, parallel computing, and computer networks. Gao is a member of the ACM.

100

GAO AND ROSENBERG

ARNOLD L. ROSENBERG is a distinguished university professor of Computer Science at the University of Massachusetts at Amherst. Prior to joining the University of Massachusetts, Rosenberg spent 5 years as a professor of computer science at Duke University and 16 years as a research staff member at the IBM Watson Research Center. Additionally, he has held visiting or adjunct positions at New York University, the Polytechnic Institute of New York, the Technion (Israel Institute of Technology), the University of Toronto, and Yale University, and he has had short-term visiting positions at several European institutions. Dr. Received May 30, 1995; revised June 3, 1996; accepted June 4, 1996

Rosenberg holds the A.B., A.M., and Ph.D. from Harvard University. Dr. Rosenberg’s current research focuses on theoretical aspects of parallel architectures and communication networks, with emphasis on the use of algorithmic techniques to design better networks and architectures and to use them more efficiently. He is the author of more than 100 technical papers on these and other topics in theoretical computer science and discrete mathematics. Dr. Rosenberg is a Fellow of the ACM, a senior member of the IEEE, and a member of the IEEE Computer Society and SIAM.