An algorithm for load balancing in multiprocessor systems

An algorithm for load balancing in multiprocessor systems

28 August 1990 Information Processing Letters 35 (1990) 223-228 North-Holland AN ALGORITHM FOR LOAD BALANCING Michael C. LOU1 and Milind A. SOHONI...

542KB Sizes 3 Downloads 145 Views

28 August 1990

Information Processing Letters 35 (1990) 223-228 North-Holland

AN ALGORITHM

FOR LOAD BALANCING

Michael C. LOU1 and Milind A. SOHONI Coordinated Science Laboratory, Urbana, IL 61801, USA

IN MULTIPROCESSOR

SYSTEMS

*

University of Illinois at Urbana-Champaign,

1101 West Springfield Avenue,

Communicated by G.R. Andrew Received 8 June 1989 Revised 26 April 1990

We present an algorithm for dynamic load balancing in a multiprocessor system that minimizes the number of accesses to the shared memory. The algorithm makes no assumptions, probabilistic or otherwise, regarding task arrivals or processing requirements. For k processors to process n tasks, the algorithm incurs O(k log k log n) potential memory collisions in the worst care. The algorithm itself is a simple variation of the strategy of visiting the longest queue. T’he key idea is to delay reporting task arrivals and completions, where the delay is a function of dynamic loading conditions.

Keywords: Algorithm, load balancing, load sharing, scheduling, multiprocessor, shared memory

Introduction Many researchers have studied load balancing in multiprocessor systems [1,5]. Here we consider the shared memory model of Manber [4]. There are k processors, each with unlimited private memory, and a shared memory. Tasks arrive, or are generated by ongoing tasks, at unpredictable instants, and their computation times are also unpredictable. All tasks are independent of each other and may be assigned in any order to any processor. The processors must perform these tasks promptly and with minimum interference from each other. All processors execute the same load balancing algorithm, which keeps track of the loads of the processors and assigns a suitable number of available tasks whenever a processor becomes free. This dynamic information is maintained in the shared memory. As processors access the shared memory asynchronously, they may interfere with each * Current address: Department of Computer Science, Indian Institute of Technology, Bombay, India. Supported by the Office of Naval Research under Contract NOOOlQ85-K-0570. 0020-0190/90/$03.50

other. Like Manber [4], we assume that processors use fine-grained locking to coordinate accesses to shared data structures. If a processor P attempts to acquire a lock held by another processor, then P must wait, and we say that a collision has occurred. The number of potential collisions is our sole performance criterion. We present a new algorithm for dynamic load balancing. For k processors to complete n tasks, our algorithm incurs O( k log k log n ) collisions in the worst case. This improves on the algorithm of Manber [4], which incurs O(k’ log n) (E > 1) collisions in the worst case.

The problem One approach to the problem is to retain all tasks (or their descriptors) in one common data structure C. Each processor is assigned only one task, and on its completion the processor accesses C to acquire another task. There are two disadvantages to this strategy. First, if a task generates a new task, then the new task must be inserted

6 1990 - Elsevier Science Publishers B.V. (North-Holland)

223

Volume 35, Number 5

INFORMATION

PROCESSING

into C. Second, at the completion of a task each processor must access C to obtain a new task. Both events can lead to collisions; to process n tasks, there could be 2n collisions. An alternative strategy is to assign several tasks to each processor. Then an idle processor examines the loads of other processors, visits a busy processor, and takes some of the busy processor’s tasks. Here, the load of a processor is a number of tasks assigned to it. Thus a load balancing algorithm must handle two issues: (i) An idle processor must decide which busy processor to visit. (ii) Efficient data structures should enable the idle processor to take a fraction of the busy processor’s load. We briefly sketch Manber’s algorithm. We refer the reader to [4] for a complete discussion. Each processor P has an individual area in the shared memory, called its local segment, in which P stores the tasks (or their descriptors) assigned to P. The local data structure is a tree that allows insertion, deletion, and splitting in constant (i.e., O(1)) time. The split operation splits the tree into two trees whose sizes are between + and 4 of the original. Suppose P is executing a task. If this task generates more tasks, or if some external tasks arrive, then P puts the new tasks into its local segment. When P finishes its current task, it accesses its local segment. If the local segment is not empty, then P selects a task from its local segment for execution and deletes the task from its local segment. If the local segment is empty, however, then P visits another processor, called a host, that has a nonempty local segment. This local segment is split, and the visitor P takes a part of the host’s local segment into its own local segment. We say that P steals some of the host’s tasks, which migrate to P. A processor is busy if it is executing a task or if its local segment is nonempty; otherwise it is idle. A global data structure maintains the status (i.e., busy or idle) of each processor. The data structure is organised as an m-ary tree with each of the k processors as a leaf. Each internal node maintains information about the loads of its leaves. Finding a busy processor takes 0( k’) collisions, where E decreases toward 1 as m increases. For k 224

LETTERS

28 August 1990

processors to finish n tasks, this scheme incurs O(k’ log n) collisions in the worst case. Also, Manber established a lower bound of Q(k log n) conditions. Our algorithm differs from Manber’s [4] only in how it finds a processor to visit. We use a global data structure specified as an abstract data type on k objects {xi } with real weights { w, }, and the following operations: max:

find

the

maximum

weighted

object;

r): increment the weight of x by positive real r; decr(x, r): decrement the weight of x by positive real r. In an implementation with a height-balanced binary tree, each operation takes O(log k) time. In an implementation with a Fibonacci heap [3] or a relaxed heap [2], the mux and deer operations take O(1) (amortized) time, and the incr operation takes O(log k) time. In our analyses below, we assume that each operation involves O(log k) memory accesses, all of which are potential collisions. A global data structure S with operations specified above maintains information about processors’ loads in the shared memory. Each processor P, is an object, and its reported load R, is the weight associated with P,. For ease of analysis, we first present a simple version of our algorithm. Rather than report every change in its load, a processor reports only an increase in its last reported load, using the incr operation on S. Thus the reported load is always an overestimate of the actual load. When processor P, completes a task and discovers that its local segment is now empty, Pi performs the operation max on S to obtain the name of the processor P, whose reported load is the largest. After Pi visits P,, processor P, has between $ and $ of the old load of P,, and P, has the remainder. Processor P, uses incr and deer to record the new loads of Pi and Pj in S. Each report involves one operation on S, hence O(log k) collisions. Each visit involves three operations on S, hence O(log k) collisions, plus O(1) collisions in the local segment of the host. (When a processor selects a task from its own local segincr(x,

INFORMATION

Volume 35, Number 5

PROCESSING

ment, the access to the local segment involves a collision only if there is a concurrent visit.) So there are two types of collisions: reporting collisions and uisiting collisions. We estimate them separately.

Visiting collisions First we estimate visiting collisions in the “static” case. Assume that there are n tasks, distributed over the local segments of the k processors. Further, no new tasks are generated within, or added to, the system, so no reporting collisions occur. Theorem 1. In the static case, if there are n tasks in a k-processor system, then in the worst case, the processors complete all tasks with 0( k log k log n) collisions. Proof. Define a step to be an occurrence of a report or a visit. In the static case, there are no reports, and every step is a visit. We show that to complete all n tasks, our scheme takes O(k log n) visits, each of which can incur O(log k) collisions. Steps are numbered, and by convention, the computation begins at step 0. Let R, ( t ) = reported load of Pi at step t ; Li (t ) = the actual load of P, at step t; m(t)

=maxi{R,(t)}.

Thus m(t) is the largest weight in S at step t. If Pi visits P, at step u, then P, steals between + and f of the tasks at P,. In our notation, R;(u)=L;(u)<$L,(u-l)<$Rj(u-l), Rj(u)

= Lj(u)

< $Lj(u-

1) < $Rj(u-

1).

Hence Ri(u)gfm(u-1)and

Rj(u)<$m(u-1).

Now consider m( u + k), the maximum reported load k steps after u. We claim that m( u + k) < $m(u). At step u there can be at most k processors with reported load greater than $m( u), and the processor with the largest reported load is

LE’ITERS

28 August 1990

visited at each step. Thus in the k visits following u all processors with reported load greater than $m(u) must be visited, and our claim holds. But m(0) = n, so the number of steps in the computation is O(k log n). q In the static case, there were no additional tasks or reporting collisions to complicate our analysis. Nevertheless, the bound on the visiting collisions holds even in the dynamic case when visiting collisions are interspersed with reporting collisions. Theorem 2. Under dynamic conditions, with generation and arrivals of new tasks, if there are initially n tasks in a k-processor system, then in the worst case, the processors complete n tasks with at most 0( k log k log n ) visiting collisions. Proof. In the proof of Theorem 1, since no tasks were introduced into the system, m(t) was a monotonic decreasing function. This is not true in the dynamic case. Let us assume that all the n tasks in the system at step 0 are coloured blue, and all tasks generated or added are green. Further, as a theoretical convenience, we assume that when P, visits P,, processor P, steals blue and green tasks in proportion to their current proportions at Pj. As before, a step is a report or a visit, which involves O(1) operations on S, hence O(log k) collisions. In addition to the previously defined functions R;(t), Li(t), m(t), let us define R,?(t) to be the number of blue tasks in the reported load of P, at step t, and m*(t)= maxj{Rf(t)}. Also define n*(t) to be the number of blue tasks in the system at step t. For all t, m*(t) 2 n*(t)/k. Fix a step t and consider the next 9k visits after t. Let t’ be the last of these visits. We shall show that if m*(t’)> $m*(t), then by step t ‘, n * (t) tasks have finished execution. Assume m*(t’) > %m*(t). Let P,, be the processor such that Rz(t’) = m *( t’). First we show that none of the 9k visits following t involves Ph. Since m * is monotonically nonincreasing, and since blue and green tasks migrate in proportion to their numbers at the host processor in every visit, if P,, made or hosted a visit after 225

Volume 35, Number 5

INFORMATION

PROCESSING

step t, then the number of blue tasks at Ph after this visit, and thereafter, would be at most $m * (t); this would contradict the definition of Ph. Thus Ph neither makes nor hosts a visit between t and t’, and hence R, changes only when P,, reports increases in R,. Furthermore, between these reports of R,, the number of blue tasks at Ph cannot increase, and consequently R:(t) > Rc(t’). Since each visit occurs at the processor that appears to be the most heavily loaded, the 9k visits between t and t’ must involve hosts whose reported loads are greater than or equal to Rh(t) > R;(t) > Rz(t’) = m*(t’) > +m*(t). Let r = Rh(t). We shall show that with the exception of the last visit by each processor between t and t ‘, each of the 9k visits implies the completion of ;r 2 &n*(t)/k tasks either before or after the visit. Consequently (9k - k):r > (8k)(&n*(t)/k) > n*(t) tasks must finish by step t ‘. Suppose P, visits Pj at step u, where t < u Q t’. Then Rj(u - 1) 2 r. If L,(u - 1) < :Rj(u - l), then P, has completed more than : Rj( u - 1) > :r tasks since the last time Rj was reported. We associate ;r of these tasks with the visit at step u. If L,( u - 1) 2 $Rj( u - l), then after the visit, both P, and P, have at least iLj(u - 1) > +Rj( u - 1) >, ?jr tasks. Now we show that if Pi makes another visit at some step u’ after step u, and u’ < t’, then Pi completes at least $r tasks between u and u’. If Pi is not visited between u and u’, then Pi must complete all tasks that it stole at step u before making the visit at step u’, and we associate these $r or more tasks with step u. If Pi hosts one or more visits between u and u’, then two cases arise. If Li > 2Ri at all of these visits, then Pi still has at least *r tasks after the last visit, which it must complete before step u’. If Li < $Ri at some visit, then Pi has completed at least $r tasks, and we associate fr of these tasks with step u. (AS mentioned above, the other fr tasks are associated with this visit to Pi.) Each completed task is associated with at most one of the 9k visits. Thus between t and t ‘, either m * decreases by *m*(t), or n*(t) tasks finish. f, ie* *, m*(t’)g But n *(0) = n is the number of blue tasks in the system at step 0. Therefore n tasks, blue or green, finish in 0( k log n) steps. This proves the theorem. cl 226

LETTERS

28 August 1990

Reporting collisions Next, we estimate reporting collisions. First, note that our proofs of Theorems 1 and 2 rely on the assertion that the reported load is always an overestimate of the actual load. This assertion follows from the policy that processors report only increases in their previously reported load. Under these conditions we established that it takes 0( k log k log n) collisions, in the worst case, to process n tasks. Hence if O(k log k log n) reporting collisions suffice to introduce n tasks into the system, then our algorithm would incur 0( k log k log n) total collisions. To reduce the number of reporting collisions, we relax the requirement that processors report every increase in load immediately. Instead, if the current load at processor P is L, then P reports the number [log,L], were p > 1 is a constant to be determined later. If at a task arrival/generation this number does not change, then P does not report an increase. Thus our modified reporting policy is as follows: Do not report any load reductions; report increases only if [log,Ll changes, where L is the current load of the processor. Of course, if Pi visits Pi, then the new loads of both processors are reported, but these reports have already been counted among the visiting collisions. Let us see how this new reporting policy affects our proofs of Theorems 1 and 2. The actual load under the new reporting policy is at most p times the reported load. Hence as long as :p < 1, i.e., p < t, Theorems 1 and 2 hold. First, a static result. We analyse the worst-case number of reporting collisions required to introduce an additional n’ tasks when no visits occur. Theorem 3. In the static case, in a k-processor system, 0( k log k log n’) reporting collisions suffice to add n’ tasks into the system, in the worst case. Proof. Let n, be the load of processor Pi at time 0, and of the n’ new tasks, let n[ be added to Pi. Clearly, in the static case, adding ni tasks to processor Pi takes no more than log,( ni + ni) logp(n,) G log,nl reporting steps. Although there

Volume 35, Number 5

INFORMATION

PROCESSING

may be no visits, Pi may complete some of its assigned tasks. Under our reporting policy, this could only lead to fewer reports, since even more tasks must arrive at P, before Pi reaches the next reporting level. Hence the total number of reports is less than Cf= I log ni G k log n’. Since each report involves O(log k) collisions, the number of 0 possible collisions is O(k log k log n’). In the dynamic case, as a theoretical convenience, assume that the tasks form a first-in first-out (FIFO) queue in front of their processors. Thus if tasks b,, . . . , b, are in the queue of PI, then b, entered the queue the earliest and b, the latest. The next task arriving at or generated for PI sits after b,. Further, assume that when PI has b 1,..., b,,,*, b,,,?,, ,..., b,,,,,,,, in its queue, and a visit by Pz steals m tasks from PI, then after the visit, P, has b,, . . . , b,,,/ and P2 has b,,+,, . . . , b,,,,+,,, as their respective queues. Thus visits do not disturb the order of the migrating tasks or of the tasks left behind. Let b, be the m th task from the head of the queue at time t in some processor. Define the level number I(b,, t) = [log,m]. Clearly, for each b, I(b, t) only decreases with time. Further, since f~ < 1, when task b migrates to a visiting processor, /(b, t) decreases by at least 1 in this migration. Let I,(b) denote the level number of task b when it first entered the system. Then Z(b, t) Q I,(b) for every 6, at any time t. For the following theorem we need an assumption on I,: Steady state assumption: There is a constant c such that for every task b, f,(b) Q c log n. This local steady state assumption appears reasonable. For example, the assumption holds when the number of tasks ever at a processor is bounded above by a polynomial in n, the number of tasks currently in the system. Theorem 4. At time t, let processor PI have n’ tasks in its queue, and let n be the total number of tasks in the system. Then, under the above scheme, and under the steady state assumption, no more than O(log k log n) reporting collisions could have occurred while introducing the n’ tasks into the system. Proof. Here again a step is a report or a visit. Let b I,. . ., b,,t be in the queue of PI at time t. Let

28 August 1990

LETTERS

0 = mO, ml,. . . , m, be numbers such that bm,+l?"'~ bm,+, originated at processor Pm, for all O,O such that x > y, we have log(x + z) - log(x) ( log( y + z) - log(y), and hence llog(x + z)l - [log(x)] G [log( y + z)l llog( y)l + 2. Consequently

4aLl,+,) - 4dbm,+l) I( b,,,,+,,t) - I( b,,+u t) + 2.

=G

But l,,(b,,+,) - I,(b,,+,) is the number of possible reportmg steps incurred in introducing the jth segment into the system at P,,,. Summing both sides of this inequality yields: r-1

c [al,+,) - 43(bm,+J]

j=O

r-1

Q

c [‘(bm,,,~ t) - @m,+,, t) + 21

J=o

< [log,n’]

+ 2r < [log,n’]

+ 21,( b,).

By the steady state assumption, I,( b,) = O(log n). Since n ’ G n, we conclude that O(log n) reporting steps occurred during the addition of n’ tasks to the queue of PI. 0

We have shown that it takes O(k log k log n) visiting collisions following step t to process the n tasks in the system at time t. Similarly we have shown that it takes O( k log k log n) reporting collisions preceding time t to arrive at a configuration of n tasks at time t. We have not shown 227

Volume

35, Number

5

INFORMATION

PROCESSING

that over O(k log k log n) collisions, the system completes n tasks; this is not true. The second assertion holds only when we assume steady loading conditions.

Acknowledgment A referee noticed an error in the previous version of the proof of Theorem 2.

References [l] G. Cybenko, Dynamic load balancing for distributed memory multiprocessors, J. Parallel and Distributed Comput. 7 (1989) 279-301.

228

LETTERS

PI J.R. Driscoll, H.N. Gabow,

28 August

1990

R. Shairman and R.E. Tarjan, Relaxed heaps: an alternative to Fibonacci heaps with applications to parallel computation. Comm. ACM 31 (1988) 1343-1354. [31 M.L. Fredman and R.E. Tat-Jan, Fibonacci heaps and their uses in improved network optimization algorithms, J. ACM 34 (1987) 596-615. On maintaining dynamic information in a 141 U. Manber, concurrent environment, SIAM J. Compur. 15 (1986) 1130-1142. [51Y.T. Wang and R.J.T. Morris, Load sharing in distributed systems, IEEE Trans. Compur. 34 (1985) 204-217.