A partially asynchronous and iterative algorithm for distributed load balancing

A partially asynchronous and iterative algorithm for distributed load balancing

PARALLEL COMPUTING ELSEVIER Parallel Computing 20 (1994) 853-868 A partially asynchronous and iterative algorithm for distributed load balancing * J...

838KB Sizes 1 Downloads 195 Views

PARALLEL COMPUTING ELSEVIER

Parallel Computing 20 (1994) 853-868

A partially asynchronous and iterative algorithm for distributed load balancing * Jianjian Song * National Supercomputing Research Center, National University of Singapore, Singapore 0511, Singapore (Received 9 February 1993; revised 19 May 1993)

Abstract Defining tasks as independent entities with identical execution time and the workload of a processor as the number of tasks, load balancing is to distribute tasks among processors of a network so that the resulting workload of every processor will be as close to the average over all the workloads as possible. We propose in this paper a partially asynchronous and iterative algorithm for distributed load balancing, show its properties, and report its simulation results. The algorithm converges geometrically as assured by a theorem for balancing continuous workload. We prove that the algorithm can achieve the maximum load imbalance of no more than [d/2] tasks, where d is the diameter of a network. Our simulation not only validated the properties but also showed that the algorithm could produce much smaller load imbalances for hypercubes. The obtained imbalances for hypercubes of order up to ten were no more than two tasks and 56% of the sample runs produced only one task difference, as opposed to the theoretical maximum of six tasks.

Key words: Distributed load balancing; Hypercubes; Load sharing; Load imbalance; Partial asynchronism

1. Introduction

I n a parallel (or d i s t r i b u t e d ) c o m p u t i n g system, tasks (jobs, processes) may be g e n e r a t e d , received, c o m p l e t e d a n d t r a n s m i t t e d o n the processors (nodes) of the

* The extended summary of this paper was presented at the 7th International Parallel Processing Symposium, April 1993, Newport Beach, California, USA. * Emaih [email protected] 016%8191/94/$07.00 © 1994 Elsevier Science B.V. All rights reserved SSDI 0167-8191(93)E0097-F

854

J. Song/Parallel Computing 20 (1994) 853-868

system. One way to accomplish even workload distribution is to assign tasks, as they are generated, to processors so that the workload of each processor remains the same, which is called load balancing. It is a general consensus that load balancing is necessary to achieve high performance. Load balancing has been studied in the context of both distributed computing and parallel processing and in various areas of computation such as distributed computer network [15], the branch-and-bound problem [19], and molecular dynamics simulation [4], etc. It has been found that even simple load balancing techniques can greatly improve the performance of a parallel computation [7]. A distributed load balancing proposal may have the following components: (1) Transfer policy for the sender or receiver to decide when load balancing should be initiated. (2) Location policy to specify who decides (the sender or receiver) where to transfer a task. (3) Information gathering to assist in decision making (global or nearest-neighbor knowledge). (4) Stability control to assure convergence. (5) Performance criteria to justify the algorithm (load imbalance, average response time, or system speedup against no load balancing). (6) Performance evaluation (mathematical proofs or simulation) [7,8,24]. The general concepts and some projects of distributed load balancing are reviewed in [24]. There are basically two beliefs of the purpose of load balancing: to balance the workload (balancing) or to keep every processor busy (sharing), resulting in two distinct transfer policies: sender-initiated vs. receiver-initiated. Load balancing is usually initiated by an overloaded processor (sender) with an extra task to distribute. Load sharing is usually initiated by an under-loaded (or idling) processor (receiver) trying to get tasks from the other processors. There are two distinct location policies represented by two terms: bidding and drafting. 'Bidding' refers to the technique of letting a processor (sender) with a newly-generated (or arrived) task decide where to send the task [9]. 'Drafting', on the other hand, is that an underloaded processor (receiver) will decide from which overloaded processor it should get a task [18]. Although bidding usually is senderinitiated while drafting is receiver-initiated, there are cases of receiver-initiated bidding [14,16]. Static load balancing generally refers to a location policy that is independent of the current system state while dynamic one is based on the current state of a system [7,23]. Load balancing is asynchronous when a processor balances its load regardless what the other processors do, Bertsekas and Tsitsiklis in [3] further divide asynchronous algorithms into two groups: totally asynchronous one and partially asynchronous one. To paraphrase them, totally asynchronous algorithms "can tolerate arbitrarily large communication and computation delays" but partially asynchronous ones "are not guaranteed to work unless there is an upper bound on those delays." Our algorithm is a bidding process that is distributed, asynchronous, and

J. Song/Parallel Computing 20 (1994) 853-868

855

dynamic. It assumes the following: Tasks are independent and have identical execution time; workload is the number of tasks; and the performance of the algorithm is measured by load imbalance and evaluated by mathematical proof plus simulation validation. The information gathering is nearest-neighbor. Its transfer policy could be either sender- or receiver-initiated. The algorithm iteratively balances workload until it converges. There are a number of distributed load balancing proposals based on the same definitions of task, workload, and performance of ours [4,6,11,12,27]. Cybenko in [6] assumed workload could be infinitely divisible and proposed an algorithm called "Dimension Exchange" that can completely balance workload of a hypercube of degree d in d synchronous steps. Hosseini et al. in [11] extended the work in [6] from hypercubes to arbitrary networks using graph coloring techniques and showed that the dimension exchange algorithm can balance integer workload of a hypercube such that any processor should have a load not more than n / 2 tasks away from the average. Jaja in [12] presented an algorithm for hypercubes that achieves load imbalance of at most one task between any two processors. Woo in [27] proposed a synchronous tree algorithm for hypercubes, achieving load imbalance of at most one task in d steps for a hypercube of degree d. All of the above-mentioned algorithms are synchronous bidding and some of them are static [6,11]. In addition, they do not specify the transfer policy but assume that load balancing is started somehow and hence, they basically describe various location policies. In comparison, our algorithm is asynchronous, iterative, and dynamic and can be either sender or receiver initiated. For example, the algorithm in [11] exchanges load conditions of a pair of neighbors one pair at a time synchronously while our algorithm exchanges load information massively and asynchronously. Furthermore, the claim in [11] that "each processor has a load not more than n / 2 away from the average" for a hypercube of diameter n actually means the worst case of n task difference, which is worse than n / 2 that our algorithm achieves. For load balancing techniques that assume variable task execution times, the reader is referred to the following articles: [1,5,7,8,17,18,25]. Section 2 of the paper introduces formally the load balancing problem and a proposition from [3]. Section 3 proposes an algorithm for task balancing. Section 4 shows some properties of our algorithm. Section 5 presents simulation results for it and Section 6 concludes the paper.

2. A proposition for distributed load balancing Following the notations in [3], we describe a network of processors as an undirected graph G = (N, A), where N = {1. . . . . n} is a set of n processors and A is a set of arcs connecting the processors. A set of processors that have direct links with processor i is represented by A ( i ) = {k I k ~ N and (i, k) ~A}. Each processor i has a number of tasks x i ( t ) > 0 at time instant t to be executed.

856

J. Song/Parallel Computing20 (1994)853-868

One distributed, asynchronous and iterative bidding technique is proposed by Bertsekas and Tsitsiklis in [3]. A processor, say i, keeps in its memory a variable xj(t) > 0 at time instant t to represent the workload of processor j c A ( i ) . Due to communication delay and asynchrony, xj(t) may not be xj(t) but one of the early values of it: x~(t) = xi(rj(t)), where -cj(t) is an integer with 0 < rj(t) <_t. A diffusion model of distributed task balancing for processor i at time instant t can then be expressed as

Xi(t ~- 1) = x i ( t ) -- E Sij(t) ~'- E rji(t), j~A(i) j~A(i)

(1)

where sij(t) is the number of tasks migrated from processor i to processor j (sij(t) = 0 if xi(t)
Proposition 1 ([3], Proposition 4.1). lim t _~ xi(t) = L / n , i = 1. . . . . n, where xi(t) is the load of processor i, n is the number of processors and L is the total number of tasks of a network. L need not be constant. Furthermore, xi(t), i = 1. . . . . n, converges geometrically. Three sets of assumptions were made in proving Proposition 1. The first one is that Xg(t)s are continuous variables. The second one is called partial asynchronism assumption ([3], Assumption 4.1) as stated informally in Assumption 1 below.

Assumption 1. (a) Processors should do load balancing regularly. (b) The latest load information of a processor should be broadcast to its neighbors as soon as possible. (c) Communication channel delay is not arbitrarily large. The third one ([3], Assumption 4.2) has two assumptions related to an actual load balancing algorithm as follows.

Assumption 2. (a) ([3], Assumption 4.2 (a)) When a processor distributes some of its tasks to its neighbors, the most lightly loaded neighbor must always be given some tasks.

J. Song/Parallel Computing 20 (1994) 853-868

857

(b) ([3], Assumption 4.2 (b)) If j ~ A ( i ) and xi(t) >x~(t), then during load balancing processor i should still maintain the largest number of tasks among all j ~A(i), which can be expressed as

xi(t)-

Z

Sik( t ) > ~ x ~ ( t ) + s i t ( t ) , V j ~ A ( i )

and x i ( t ) > x ~ ( t ) .

(2)

k~A(i)

Proposition 1 establishes possibility of a distributed and iterative solution for load balancing, although it lacks details for implementation and the assumption that xi(t)s are continuous may not be realistic. Assuming that Proposition 1 is true for integer xi(t)s, we propose in this paper a partially asynchronous and distributed load balancing algorithm for integer xi(t)s that satisfies Assumptions 1 and 2.

3. Our algorithm We present in this section an algorithm that satisfies the assumptions in Section 2 and explain briefly how the algorithm works with two examples. Since every processor runs its own copy of the algorithm asynchronously and independently of the other processors, we only discuss the execution of the algorithm on one processor. The transferred tasks will be called task migrants. Without loss of generality, assume that processor 0 is trying to balance its load with its n neighbors. Let xj(t) be initialized with the workload of processor j known to processor 0, i.e. x j ( t ) = x ° ( t ) , j cA(0). (Notice that x~(t) is different from the definition in Section 2.) Assume xn(t)> . . . >_xk(t)>xk_l(t)> ... >_ x2(t) > xl(t) and find index k such that Xo(t) > xi(t) for i < k and Xo(t) < xi(t) for i > k. The latter case is not considered for load balancing since processor 0 will not send any task to processor j if Xo(t)xk(t), which is Inequality (2). If successful, we then try to transfer an equal number of tasks, say rnl, to processors 1 and 2 so that x ~ ( t ) + m o + m 1 = x 2 ( t ) + mt = x 3 ( t ) and X o ( t ) - m o - 2* m 1 >Xk(t). We continue the process with processors 1, 2 and 3, etc. until it is not possible to do so. In the second step, each processor that has been given some task migrants may get an identical number of tasks from processor 0, provided that Inequality (2) holds. In the third step, one extra task may be given to each of the above processors. The three-step process is shown in the following example where processor 0 is connected with processors 1, 2, 3, and 4 which all have less tasks than processor 0. Example 1. Given the initial loads xi(t)s known to processor O, each step of our algorithm distributes load more evenly as shown below.

J. Song / Parallel Computing 20 (1994) 853-868

858

Initial load After Step 1 After Step 2 After Step 3 # migrants

X0

X4

X3

X2

X1

20 14 11 9

9 9 9 9

7 7 7 8

5 7 7 9

3 7 7 9

- 11

0

1

4

6

The above three step procedure can obtain load imbalance of no more than d tasks as shown later, where d is the dimension of a network. It will not reduce the load imbalance when processors are connected as an one-dimensional array and have loads forming a consecutive n u m b e r sequence. One example is when processors 0, 1, 2, 3, and 4 form a linear array and Xo(t)= 10, xl(t)= 11, x2(t)= 12, x3(t) = 13, x 4 ( t ) = 14. In order to reduce the load imbalance further, we propose to add another step that will be executed by a processor in the middle of the above sequence. When a processor finds that its load is the middle number of its two neighbors, it should tell the neighbor with more tasks that it needs one task migrant. The neighbor may send a task migrant in the next iteration of load balancing as shown in the following example.

Example 2. When processor 2 finds its load value to be a middle number of its two neighbors 1 and 3, it tells processor 3 that it needs one task. Processor 3 may send a task to processor 2. (Processors 1 and 3 may do the same to 2 and 4 respectively.)

Initial load After step 4

X0

X1

X2

X3

X4

10 10

11 I1

12 13

13 12

14 14

A formal description of our four-step algorithm is given as Algorithm 1 for processor 0 at time instant t.

Algorithm 1. { Assume xn(t)>_ ... >_xk(t)>_Xk_l(t)>_ "'" >_x2(t)>_xl(t) and index k is found such that Xo(t) >xi(t) for i _< k and Xo(t) k. m := k; xi(t + 1 ) : = x i ( t ) , i= 1..... k; Xo(t + 1):=x0(t); If (xk(t) = 0) go to Step 2. Step 1. Find the largest index m that satisfies the following inequality. m--I

Xo(t)-(rn--1)*Xm(t)+

~_,xi(t)>_x~(t),

m~{l

. . . . . k}.

(3)

i=l

If ( m does not exist) go to Step 3 else do the following.

xi(t + l):=x,,(t),

i = 1 ..... m - 1 . m

Xo(t + l):=xo(t ) -

1

E [Xm(t) -xi(t)]" i-1

(4)

859

J. Song/Parallel Computing 20 (1994) 853-868

Step 2. Distribute

excessive tasks as evenly as possible to the neighbors from 1 to

m.

if(m=k)p:=lxo(t+l)-xk(t) m +- ]

I

else p :=

[x o ( t + l ) - x k ( t ) ] m

"

([x] means largest integer < x.) If ( p > 0) { xi(t+l):=xi(t+l)+p, i = 1 . . . . . m. Xo(t + 1):=x0(t + 1 ) - p * m.} Step 3. Distributed one more task to each neighbor if there is enough. j:=l. while ((Xo(t + 1) -xk(t + 1)) > 1){ xj(t + 1) := xj(t + 1) + 1. Xo(t + 1):=x0(t + 1 ) - 1. j :=j + 1.} Step 4. Break a consecutive number sequence. If any processor i has asked for a task and Xo(t+ 1 ) > x i ( t + 1), then Xi(t -t- 1):= xi(t "4- 1) 4- 1 and Xo(t + 1):= Xo(t + 1) - 1. If (xo(t + 1) + 1 =xk(t + 1) + 2 =x~(t + 1) for any k < j < n ) , then ask processor j to send a task. Step 5. Send [xi(t -t- 1) --xi(t)] task(s) to processor i, i = 1. . . . . m, and send load value Xo(t + 1) to all the neighbors. } End of Algorithm 1. Step 1 to 4 calculates the number of tasks that should be migrated to other processors and Step 5 does actual task transfer. If x~(t)= 0, then x i ( t ) = 0 for i < k and Step 1 is escaped to avoid useless computation. Otherwise, Step 1 tries to find as many xi(t)s as possible that can be made equal to the next larger load by transferring tasks from processor 0. Starting with the smallest load xl(t), an attempt is made to increase it to equal x2(t) by taking [ x 2 ( t ) - X l ( t ) ] tasks from Xo(t) while maintaining Xo(t) - [x2(t) -Xl(t)] >_xk(t). Next, if possible, xl(t) and x2(t) are made to equal X3(t) by subtracting 2 . [x3(t)- x2(t)] tasks from { x 0 ( t ) [XE(t)-Xl(t)]} and adding [x3(t)-x2(t)] each to xl(t) and x2(t). The process continues until Xm(t) is found with m being the largest index such that Xo(t)~_,rf=-11[Xm(t)--xi(t) ] >__Xk(t) , which is another form of (3). All xi(t + 1)s for i < m are equal after Step 1. Step 2 then tries to evenly distribute more tasks to processor i for i < m. If possible, Step 3 will assign one more task migrant to every processor so that they may get the same load. Step 4 is activated only if three loads including Xo(t + 1) form consecutive numbers with Xo(t + 1) being the middle number. For example, if Xl(t 4- 1) = 24, Xo(t + 1) = 25, and X4(t -4- 1 ) = 26, processor 0 will execute Step 4 to request one task from processor 4. When the algorithm finishes one iteration of the five steps, the number of task migrants from processor 0 to processor i will be [xi(t + 1 ) - x i ( t ) ] . Besides load balancing, every processor does the following to satisfy the partial asynchronism assumption. (i) Balance its load whenever its load is larger than any neighbor's. (ii) Send its load update to the neighbors as soon as it is available so that every

J. Song/Parallel Computing20 (1994)853-868

860

processor will have the latest load figures of its neighbors in order to satisfy Assumption 1 (b). Assumption 1 (c) is satisfied easily since nearest neighbor communication is assumed. Assumption 2 (a) is true because the algorithm always assigns the largest number of task migrants to the processor that had the smallest load. Assumption 2 (b) is satisfied as proven in the next section.

4. Analysis of the algorithm Since the algorithm is iterative and runs asynchronously and distributively, it is difficult to discuss its execution process. But we can say something about its final results, especially, its performance in terms of the resulting maximum load imbalance of a network. This section examines the theoretically largest load imbalance after Algorithm 1 fails to generate any task migrant on any processor of a network. It concludes that the maximum load imbalance will be less than d if only Steps 1 to 3 are applied and it will be no more than [ d / 2 ] if all four steps are utilized, where d is the diameter of a network (see Corollary 2 and Proposition 5). It also shows that each of the steps satisfies (2), which, for convenience, is written as

X o ( t + l ) > x i ( t + a ) , i = 1 . . . . . k, (5) where Xo(t + 1) and xi(t + 1) are the values after the execution of each step of Algorithm 1.

Proposition 2. After the completion of Step 1, (5) and the following inequality

hold. Xo(t+ 1 ) - - X k ( t ) xi(t + 1), i = 1. . . . . k, after Step 1. Combining the update formula (4) of Step 1, m-I

Xo(t + l):=Xo(t ) -

Y'~ ( X m ( t ) - x i ( t ) ) = x o ( t ) - ( m - 1 ) x m ( t

)

i=l m-I

+ E xi(t), i=1

with (3), we obtain xo(t + 1) >xk(t). We know x~(t + 1 ) = x ~ ( t ) since xk(t) is not modified in Step 1. The task distribution process assures that xi(t + 1) >xj(t + 1) for i >j. Therefore, Xo(t + 1) > X k ( t + 1)>_xi(t+ 1), i = 1 , . . . , k . To show that (6) is true, we use the fact that m is the smallest index such that the following is true. m

Xo(t ) - ~_~[Xm+l(t)--Xi(t)]
J. Song/Parallel Computing 20 (1994) 853-868

861

C o m b i n i n g the above with (4), we obtain

Xo(t + 1) --Xk(t ) x k ( t ) . T h e r e fore, xj(t)s for j > rn are not c h a n g e d in Step 1.

Proposition 3. After the completion of Step 2, (5) and the following inequality hold. Xo(t + 1) - x k ( t + 1) < m . (7) Proof. (5) and (7) are true if p = Xo(t + 1) - xk(t) = 0. We prove that Proposition 3 is true for both m < k and m = k. W h e n m < k, the following u p d a t e formulas are used:

Xo(t+

[Xo(t + l)--Xk(t) 1

1 ) : = x 0 ( t + 1) - m *

xi(t+l):=Xi(t+l)+

m

lXo(t+ l)--Xk(t) l m

,

i = 1 . . . . . m.

Since the following are true:

x°(t + l ) - m * [ X°(t + l ) - X k ( t ) l >x°(t + -m*

x 0 ( / + 1)

--xk(t)

m

=xk(t )

and

xi(t+ l)+ [ x°(t + l ) - x k ( t ) ] __ xk(t) > xi(t + 1), i = 1 . . . . , k. T o prove (7) we notice that x~(t + 1) =Xk(t) and x°(t + l ) - m * l

X°(t + l ) - x ~ ( t )

=x°(t + X ) - m * [ X°(t + l ) - x k ( t ) - Q m + = x 0 ( t + 1) - [ x 0 ( / +

1)

Qm

- x k ( t ) - Qm]

= X k ( t ) + Om, w h e r e Qm is the r e m a i n d e r of [x0(t + 1 ) - x k ( t ) ] m and Qm < m. T h e r e f o r e , we know Xo(t + 1) - x g( t ) = Xk( t ) + Qm -- Xk( t ) = Qm < m after Step 2.

862

J.

Song/Parallel Computing20 (1994) 853-868

When m = k, it is clear that Xi(t + 1) = xk(t), i = 1. . . . . k - 1, and the update formulas are as follows:

Xo(t+ 1 ) : = X o ( t + 1) - m * xi(t + 1) :=xi(t ) +

[ Xo(t + l) --Xk(t) ] m+l

m+l

'

i=1

"'"

k.

And we know

Xo(t +_1_) -_xk(t ) [ x°(t+l)-m*[ xk(t ) +

m+l

lxo(t+l)--xk(t)] m+l

Xo(t + 1) + m* xg(t)

J >

<

m+l

and

Xo(t+l)+m*xk(t)

-

m+l

Therefore, Xo(t + 1) >xk(t + 1) =xi(t + 1), i = 1 , . . . , k after Step 2. It is easy to see that (7) is true for m = k since x0(t + 1)

-m*

t

m+l

1

Lx0,+l, m+l


Proposition 4. After the completion of Step 3, (5) and the following inequality

hold. 0 0 is always true during the execution of Step 3. Hence, it is true that Xo(t + 1) >x~(t + 1). From the termination condition for Step 3, we know 0
J. Song/Parallel Computing20 (1994) 853-868

863

Corollary 1. [ x i ( t ) - x i ( t ) l < 1, V(i, j ) ~ A , if and only if no task migrant is generated anywhere in the network after Algorithm 1 is executed.

Proof. We prove first that if 0 <_xi(t)--xj(t)_< 1 for any (i, j ) ~ A , processor i will not generate any task migrant for processor j. If Xg(t) - xy(t) = 0, then no task can be migrated from processor i to j. If x i ( t ) - xj(t) = 1 and processor i gives one task to processor j, then xi(t + 1) =xi(t) - 1 = x i ( t ) < x j ( t ) + 1 =xy(t + 1), which contradicts (5). Therefore, no task migrant from processor i to j can be generated by Algorithm 1. Next, we prove that if x~(t) - x j ( t ) > 1, Algorithm 1 will produce at least one task migrant from processor i. If j = k, then xi(t) - x~(t) > 1. Step 3 will produce at least one task migrant. If j < k, then xi(t) >_x~(t) + 1. Step 3 will take one task away from processor i and assign it to processor m, m ___xk(t). Therefore, Corollary 1 holds. [] Corollary 2. When Algorithm 1 with only Steps 1 to 3 terminates, the maximum load imbalance between any two processors of a network is d tasks, where d is the diameter of a network.

Proof. Take a linear array with n processors as an example. The diameter of the array is (n - 1). If consecutive integers are assigned as workloads of the processors starting from one end of the array to the other, Algorithm 1 can not generate any task migrant for this workload distribution since the workload difference of any two neighboring processors is one task (Corollary 1). The maximum workload difference of the network is (n - 1) = d tasks. If a network has maximum load imbalance of more than d tasks, say xp(t) Xq(t) > d between processors p and q, we can always find a link (i, j ) ~ A along a path from p to q such that x i ( t ) - x j ( t ) > 1. Corollary 1 then guarantees that at least one task migrant will be generated. Take, for example, the longest path from p to q. The length of the path in terms of the n u m b e r of links is at most d. Even if xg(t)s, i being processors on the path, strictly decrement along the path from p to q, the load of the processor next to q will be xp(t) - d + 1. The load difference between this processor and processor q is [xp(t) - d + 1 - x q ( t ) ] > 1. [] Proposition 5. I f Steps 1 to 4 are utilized and no task migrant is generated anywhere in the network, the load difference between any two processors will be less than or equal to d / 2 for even d and (d + 1 ) / 2 for odd d, where d is the diameter of a network. In other words, the load difference is no more than [d/2]. Proof. When Step 4 fails to generate any task migrant, two conditions are true. (i) ]xi(t) - x j ( t ) ] < 1 for any (i, j) ~ A ; (ii) if processors i, j, k are connected in cascade, xi(t), xj(t) and xk(t) should not form a consecutive integer sequence in any order. The truth of first condition is guaranteed by Step 3 and the second condition is the termination condition of Step 4. Given any path of a network, we

864

J. Song/Parallel Computing 20 (1994) 853-868

can find a sequence of processors with loads xi(t)s along the path. x i ( t ) s of any adjacent processors must satisfy Condition (i) and xi(t)s of any three cascade-connected processors must satisfy Condition (ii). In order to prove Proposition 5, we need to show that the following lemma is true. Lemma 1. Define a i = a o + i, a o > O, i being positive integer, to represent the n u m b e r o f tasks o f a processor. A p a t h o f processors with the following sequence o f tasks along it produces the largest task difference o f all the possible sequences a is that satisfy Conditions (i) and (ii): aoalala2a2a3a 3 ... an_2an_2an_lan_lan,

where each position is one processor on the path.

Proof. The proof is by induction for sequences of odd length. The proof for even length is similar. Of all the sequence of four ais , aoa~ala 2 is obviously a sequence with the largest difference of two tasks, which can be shown by enumerating all the combinations. Assuming that Lemma 1 is true for a o a l a l a 2 a 2 a 3 a 3 • • • a k_ lak_ lak, we prove that it is true for a o a l a l a 2 a 2 a 3 a 3 ' ' ' a k _ l a k _ l a k a k a k + r Since the latter sequence is formed by the former one plus akak+ 1 and the former one has the largest task difference, we only need to show that a~ak+ 1 gives the largest overall task difference of all the aiajs that can be added to the former sequence. To satisfy Conditions (i) and (ii), a i can only be ak_ 1 or a k. If a i is a~_ 1, aj can only be a k or ak_ ~. In general, aia j ~ {ak_~ak_ ~, ak_~a ~, a k a k, akak+~}, among which akak+ 1 forms the sequence with the largest load difference. [] Let the diameter of the network be d. A path with the largest load difference should have the length of d. The largest load difference of any network is then (d + 1)/2 tasks for odd d and d / 2 for even d. The latter is formed by the following sequence: a o a l a l a 2 a 2 a 3 a 3 " " a n _ 2 a n _ 2 a n _ l a n _ v []

5. Simulation results for linear arrays and hypercubes Linear arrays and hypercubes were simulated to see the maximum load imbalance of networks with random initial load distribution after the algorithm was applied to them. A linear processor array constitutes a worst case for the algorithm and the hypercube is the most common structure. To simplify programming and make the results more comprehensible, we assumed that the algorithm was synchronized globally. All processors would do the following in lock steps: Loop: Exchange load information. Execute Algorithm 1. Exchange messages. Add received tasks to its own load. End of loop. The iteration continued until no task migrant was generated.

865

J. Song/Parallel Computing 20 (1994) 853-868

Each processor has a communication queue to store outgoing messages for load information, task migrants and task requests. The queue length is the delay of a message. All messages from one processor to the others have the same delay time measured in terms of a number of the above loops. If the queue length is n for processor 0, every neighbor of processor 0 should have at time instant t the load figure of processor 0 at time ( t - n + 1). If ( t - n + 1 ) < 0, neighbor loads are assumed to be zero. The queue lengths can be made different to simulate asynchronous behavior. Although the simulation did not mimic the truly asynchronous behavior of the algorithm, its results can still help us to understand the performance of the algorithm since the final load imbalances are similar whether the algorithm is implemented synchronously or asynchronously according to Proposition 1. The difference would be the speed of convergence. To see the effect of Step 4, we run one program implementing Steps 1 to 3 and another one with Steps 1 to 4. For each network, 200 sample problems with initial random task distribution were solved to get the occurrence frequencies of load imbalances of a network with different load distributions, after the algorithm failed to generate task migrants. The initial number of tasks on each processor could be from 0 to 1000 randomly. We run the two programs with networks of diameters 5, 8, and 10. In terms of the number of processors, the linear arrays have 6, 9 and 11 processors and the hypercubes have 32, 256 and 1024 processors respectively. Table 1 presents the frequencies of the maximum task differences obtained by our algorithm for 200 trials of each network, assuming no communication delay. As can be seen, the results for the linear arrays have validated Corollary 2 and Proposition 5, meaning that they represent the worst case performance of the algorithm. The results for hypercubes are more interesting in that the maximum load imbalances did not increase proportionally with the diameters of the hyperTable 1 The results from 200 examples of each network with random initial load and no communication delay Dimension Linear arrays Hypercubes Maximum task 5 8 10 5 8 10 difference Freq Freq Freq Freq Freq Freq Freq Freq Freq Freq Freq Freq (i)

(ii)

(i)

(ii)

(i)

(ii)

(i)

0 1

1

4

2 3 4 5 6 7 8 9 10

6 4 156 33

162 34

2

2 1 4 4 164 25

1 93 104

4 4 4 4 2 171 11

3 4 83 110

48 149 3

Notes: (i) Steps 1 to 3 were used. (ii) Steps 1 to 4 were used.

(ii)

5 178 17

(i) 1

92 98 9

(ii)

146 54

(i)

(ii)

112 88 54 125 21

866

J. Song/Parallel Computing 20 (1994) 853-868

Table 2 The results from 200 examples of each network with random initial load and random communication delay of 0 to 3 loops Dimension

Linear arrays

Hypercubes

8 10 5 8 10 Maximum task 5 difference Freq Freq Freq Freq Freq Freq Freq Freq Freq Freq Freq Freq (i) (ii) (i) (ii) (i) (ii) (i) (ii) (i) (ii) (i) (ii) 0 1 2 3 4 5 6 7 8 9 10

6 96 81 12 4 1

11 136 50 3

26 66 52 27 18 9 7

2 62 84 49 3

10 42 48 34 19 22 13 ll 1

29 73 59 36 3

83 117

1 183 16

43 154 3

164 36

18 181 1

149 51

Notes: (i) Steps 1 to 3 were used. (ii) Steps 1 to 4 were used. cubes. A s a m a t t e r o f fact, the d i f f e r e n c e s w e r e all four tasks o r less for case (i) a n d two tasks o r less for case (ii) r e g a r d l e s s o f t h e d i a m e t e r s o f t h e h y p e r c u b e s . 5 6 % o f t h e i m b a l a n c e s w e r e only o n e task in case (ii). W e do n o t have a satisfactory e x p l a n a t i o n for this result. It is p o s t u l a t e d t h a t the result might have s o m e t h i n g to do with t h e rich h y p e r c u b e i n t e r c o n n e c t i o n . T a b l e 2 shows t h e results from t h e s a m e n e t w o r k s as in T a b l e 1 except for c o m m u n i c a t i o n delays o f f r o m 0 to 3 l o o p s c h o s e n r a n d o m l y for t h e p r o c e s s o r s to s i m u l a t e a s y n c h r o n o u s b e h a v i o r o f t h e a l g o r i t h m . Since s o m e o f t h e o b s o l e t e l o a d values w e r e u s e d by the a l g o r i t h m d u e to the delays, t h e l o a d b a l a n c i n g calculations c o u l d n o t b e a c c u r a t e b u t chaotic. But as s e e n in T a b l e 2, T h e delays actually m a d e t h e l o a d i m b a l a n c e s even s m a l l e r by o n e o r two tasks t h a n t h e z e r o - d e l a y case o f T a b l e 1.

6. Discussions A p a r t i a l l y a s y n c h r o n o u s a n d iterative l o a d b a l a n c i n g a l g o r i t h m ( A l g o r i t h m 1 in S e c t i o n 3) is p r o p o s e d , a n a l y z e d a n d e x p e r i m e n t e d . T h e a l g o r i t h m with S t e p s 1 to 3 c o n v e r g e s g e o m e t r i c a l l y a c c o r d i n g to a t h e o r e m for b a l a n c i n g c o n t i n u o u s l o a d in [3]. T h e c o n v e r g e n c y of t h e a l g o r i t h m with all four steps is n o t g u a r a n t e e d a l t h o u g h all t h e e x a m p l e s in o u r s i m u l a t i o n did converge. W e have p r o v e d in this p a p e r that, w h e n t h e a l g o r i t h m converges, t h e l o a d i m b a l a n c e s b e t w e e n any two p r o c e s s o r s will b e no m o r e t h a n [ d / 2 ] tasks, d b e i n g t h e d i a m e t e r of a network. T h e s i m u l a t i o n results o f t h e a l g o r i t h m with l i n e a r arrays a n d h y p e r c u b e s o f d e g r e e s o f 5, 8 a n d 10 v a l i d a t e d the proof. F o r t h e 200 trials o f e a c h h y p e r c u b e of

J. Song/Parallel Computing 20 (1994) 853-868

867

degrees of 5, 8 and 10, the worst load imbalance was two tasks and more than 56% of the trials showed the imbalance of just one task as opposed to the theoretical maximum of six tasks. The load imbalances were even smaller when communication delays were added. Although our proofs assumed a constant total load, that the algorithm should work equally well for problems with varying total load is guaranteed by Proposition 1 which says that lim t_~o~ x i ( t ) = L / n even if L changes when tasks are generated and destroyed. A probabilistic analysis of this case is reported in [6]. The main body of the algorithm actually specifies one kind of location policy and the simulation implemented sender-initiated transfer policy. The algorithm can however be combined with receiver-initiated transfer policy so that load balancing is started by an idling processor asking its neighbors for tasks. Once it is started, the algorithm will iterate until it converges. Notice that we have not shown in this paper that Algorithm 1 should always converge. The convergence of Algorithm 1 is assumed to be guaranteed by the work of Bertsekas and Tsitsiklis in [3]. We simply presented a necessary and sufficient condition for the convergence of the algorithm in Corollary 1 and predicted the system state when the algorithm converges in Corollary 2 and Proposition 5. Assuming that tasks are independent and uniform in execution time may restrict application of the algorithm. But the algorithm can still be useful in the cases where execution time of a task can not be decided a priori and therefore, the best one can do is to assume identical tasks. One immediate example is the O R parallelism for branch and bound tree search, where each live node is independent and the number of live nodes on a processor can be considered the workload of the processor. The algorithm can be applied to distribute live nodes among a network of processors as the nodes are being generated [19]. It can also be useful in molecular dynamics simulation [4].

References [1] I. Ahmad and I. Ghafoorx, Semi-distributed load balancing for massively parallel multicomputer system, IEEE Trans. Soft. Eng. SE-17 (10) (1991) 987-1004. [2] A. Barak and A. Shiloh, A distributed load-balancing policy for a multicomputer, Software Practice & Experience 15 (9) (1985) 901-913. [3] D.P. Bertsekas and J.N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods (Prentice-Hall, Englewood Cliffs, NJ, 1989). [4] F. Bruge and S.L. Fornili, A distributed dynamic load balancer and its implementation on multi-transputer systems for molecular dynamics simulation, Comput. Phys. Commun. 60 (1990) 39-45. [5] S. Chowdhury, The greedy load sharing algorithm, J. Parallel Distributed Comput. 9 (1990) 93-99. [6] G. Cybenko, Dynamic load balancing for distributed memory multi-processors, J. Parallel Distributed Comput. 7 (2) (1989) 279-301. [7] D.L. Eager, E.D. Lazowska and J. Zahorjan, Adaptive load sharing in homogeneous distributed systems, IEEE Trans. Software Eng. SE-12 (5) (1986) 662-675.

868

J. Song/Parallel Computing 20 (1994) 853-868

[8] K. Efe and B. Croselj, Minimizing control overheads in adaptive load sharing, 9th Int. Conf. on Distributed Computing Systems (1989) 307-315. [9] D.J. Farber, The distributed computing system, Proc. Compcon Spring (1973) 31-34. [10] A. Hac and T.J. Johnson, Sensitivity study of the load balancing algorithm in a distributed system, J. Parallel Distributed Comput. 10 (1990) 85-89. [11] S.H. Hosseini, et al., Analysis of a graph coloring based distributed load balancing algorithm, J. Parallel Distributed Comput. 10 (1990) 160-166. [12] J. Jaja and K.W. Ryu, Load balancing and routing on the hypercube and related networks, J. Parallel Distributed Comput. 14 (1992) 431-435. [13] H. Kuchen and A. Wagener, Comparison of dynamic load balancing strategies, Proc. 2rid Workshop on Parallel&Distributed Comput. (Bulgaria, 1990) 303-314. [14] F.C.H. Lin and R.M. Keller, The gradient model load balancing method, IEEE Trans. Software Eng. SE-13 (1) (1987) 32-38. [15] M. Livny and M. Melman, Load balancing in homogeneous broadcast distributed systems, Proc. A C M Comput. Network Performance Syrup. (1982) 47-55. [16] R.P. Ma, F.S. Tsung and M.H. Ma, A dynamic load balancer for a parallel branch and bound algorithm, 3rd Conf. on Hypercube Concurrent Computers and Applications (1988) 1505-1513. [17] R. Mirchandaney, D. Towsley and J.A. Stankovic, Adaptive load sharing in heterogeneous systems, 9th Int. Conf. on Dist. Comp. Sys. (1989) 298-306. [18] L.M. Ni, C.W. Xu and T.B. Gendreau, A distributed drafting algorithm for load balancing, IEEE Trans. Software Eng. SE-11 (10) (1985) 1153-1161. [19] S. Patil and P. Banerjee, A parallel branch and bound algorithm for test generation, IEEE Trans. CAD 9 (3) (1990) 313-322. [20] X.S. Qian and Q. Yang, Load balancing on generalized hypercube and mesh multiprocessors with LAL, 11th Int. Conf. on Dist. Comp. Sys. (1991) 402-409. [21] K. Ramamritham, J.A. Stankovic and W. Zhao, Distributed scheduling of tasks with deadlines and resource requirements, IEEE Trans. Comp. 38 (8) (1989) 1110-1123. [22] E. Shamir and E. Upfal, An approach to the load-sharing problem in distributed systems, J. Parallel Distributed Comput. 4 (1987) 521-531. [23] K.G. Shin and Y.C. Chang, Load sharing in distributed real-time systems with state change broadcasts, IEEE Trans. Comput. 38 (8) (1989) 1124-1142. [24] N.G. Shivaratri, et al., Load distributing for locally distributed systems, IEEE Comput., (Dec. 1992) 33-44. [25] T.T.Y. Suen and J.S.K. Wong, Efficient task migration algorithm for distributed systems, IEEE Trans. Parallel Distributed Sys. 3 (4) (1992) 488-499. [26] M. Willebeek-LeMair and A.P. Reeves, A localized dynamic load balancing strategy for highly parallel systems, 3rd Symp. on the Frontiers of Massively Parallel Computations (1990) 380-383. [27] J. Woo and S. Sahni, Load balancing on a hypercube, Int. Parallel Processing Symp. (1991) 525-530. [28] J. Xu and K. Hwang, Heuristic methods for dynamic load balancing in a message-passing supercomputer, Supercomputing Conf. (1990) 888-897.