Parallel Computing 25 (1999) 1013±1033
www.elsevier.com/locate/parco
Analysis of nearest neighbor load balancing algorithms for random loads Peter Sanders*,1 Max-Planck-Institut f ur Informatik, Im Stadtwald, 66123 Saarbr ucken, Germany
Abstract Nearest neighbor load balancing algorithms, like diusion, are popular due to their simplicity, ¯exibility, and robustness. We show that they are also asymptotically very ecient when a random rather than a worst case initial load distribution is considered. We show that diusion needs H
log n2=d balancing time on a d-dimensional mesh network with nd processors. Furthermore, some but not all of the algorithms known to perform better than diffusion in the worst case also perform better for random loads. We also present new results on worst case performance regarding the maximum load deviation. Ó 1999 Elsevier Science B.V. All rights reserved. Keywords: Parallel algorithm analysis; Nearest neighbor load balancing algorithm; Load balancing random loads; Maximum load deviation; Diusion load balancing
1. Introduction Load balancing is one of the key aspects of parallel computing. The problem has so many aspects that abstract models are important for separating a load balancing algorithm from the machine and the particular application used. One interesting load model is the following: the load of a processor (PE) can be accurately determined and arbitrarily subdivided into multiple pieces which can be independently worked on by dierent PEs. Any portion of load can be communicated to a neighboring PE in unit time. A related model assumes the load to consist of discrete, equally sized load units and transmission costs which are proportional to the number of units transferred.
* 1
E-mail:
[email protected] Most of this work was done while the author was at the University of Karlsruhe.
0167-8191/99/$ ± see front matter Ó 1999 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 8 1 9 1 ( 9 9 ) 0 0 0 4 0 - X
1014
P. Sanders / Parallel Computing 25 (1999) 1013±1033
This model can be used to analyze a family of load balancing algorithms which can be characterized as nearest neighbor methods. They only require local communication, and every PE cycles through identical iterations during which a PE determines its own load and that of (some of) its neighbors. Based on this local information the PE decides how much of its load is transmitted to each neighbor. The PEs are usually assumed to proceed synchronously (but refer to [17] for an asynchronous algorithm). This is no strong restriction for nearest neighbor schemes because algorithms which only use local communication can always replace synchronous execution by local synchronization [13]. Algorithms with local control are particularly attractive for on-line load balancing applications, i.e., in settings where a load balancing process should run concurrently with an application which can generate or consume load at any time (sometimes this situation is also called dynamic). Load balancing algorithms with global control may have diculties in this context because global coordination can take so long that the situation has changed before a global data exchange is ®nished. There are a few results on on a rather special on-line model [14]. Most papers consider the easier to characterize o-line load balancing problem where a load balancer is judged by looking at the time required to balance an initial load distribution. For comparability and simplicity we adopt this convention. Note that the relative o-line performance of the local algorithms considered here at least re¯ects the performance in those on-line settings where the load changes rarely. In particular, the local algorithms have the following self-stabilizing behavior: once the load stops changing, the algorithms are immediately moving towards a balanced state with the speed of the o-line setting. There are signi®cant results about the worst case performance of nearest neighbor load balancing algorithms, but little progress has been made in understanding the gap between the worst case and the actual load patterns of real applications. A worst case analysis is therefore often complemented by simulations, e.g., for random loads [20]. Which model is more appropriate certainly depends on the application one has in mind. But there is an important argument which favors other models: For worst case instances, there are simple and ecient load balancing algorithms which use global operations. (One example is described in Section 6.2.) Local algorithms have diculties competing here. Nearest neighbor algorithms are potentially superior for load distributions which do not require global load movements. We show that random problem instances can be solved exponentially faster by some local algorithms than by global algorithms. This is interesting because one might intuitively expect that ``random'' means ``irregular'' means ``dicult to balance''. 1.1. Some nearest neighbor algorithms Perhaps the conceptually simplest local algorithm is diusive load balancing. A constant fraction of the load dierence between all neighboring PEs is exchanged at every step. By exploiting the fact that this algorithm can be modeled by a simple partial dierence equation (the diusion equation or Poisson equation), analytical
P. Sanders / Parallel Computing 25 (1999) 1013±1033
1015
results about the convergence rate in the worst case can be obtained for meshes, tori, hypercubes, and (less accurately) for arbitrary networks [1,2,18,20]. It turns out that diusion can be rather slow in the worst case. For example, on an n-PE linear array X
n2 iterations are required in order to achieve approximate load balancing. A discrete analog for diusion is to exchange a single load unit whenever there is a load dierence [9]. This algorithm has been claimed to be more ecient than continous diusion based on a comparison of the worst case bound for diusion with simulations of the discrete algorithm for random instances. The present results show that this argumentation is invalid. In addition, the discrete scheme often gets stuck in arbitrarily unbalanced con®gurations. The situation becomes less puzzling if load is exchanged with one neighbor after the other rather than with all neighbors at once. There are several such dimension exchange algorithms [2,8,20]. The discrete version of dimension exchange quickly converges up to a maximal (global) load dierence of d load units on a d-dimensional mesh, and a simple extension even ensures the best possible convergence [16]. Similar algorithms for expander graphs are considered in [4]. Another approach which improves the worst case behavior of diusion is second order diusion [5] which uses the load of two preceding iterations to compute the load transfers. 1.2. Overview To summarize, our objective is to evaluate the performance of local load balancing algorithms for random initial loads, giving tight bounds if possible. The results may help in understanding previous simulation results and in guiding future studies. Furthermore, they give a rigorous justi®cation, why local load balancing can be more ecient than algorithms with global control (some local algorithms turn out to be exponentially faster for random loads on mesh networks). In addition, we de®ne a criterion for the quality of load balancing which is more directly related to the execution time of the overall application than previous criteria. In order to simplify comparisons, we also derive worst case bounds for this criterion. The remainder of this paper is structured as follows. In Section 2 we give a more detailed description of the machine and load models. Section 3 introduces the diffusion algorithm and adapts previous results in order to derive worst case lower bounds on the balancing time. The core part of the paper is Section 4 which analyzes the performance of diusion for random problem instances and derives upper bounds for worst case instances. Because it turns out that all the main points can already be made for the linear array, we ®rst restrict the proofs to this notationally simpler case. Section 5 then explains which modi®cations have to be made in order to extend the results to higher dimensional mesh and torus networks. The scope is further expanded in Section 6 where we discuss algorithms which are superior to diusion for worst case instances. Section 7 summarizes and discusses the results and identi®es some possible directions of future research. An extended abstract summarizing some of the results presented here has been published in [15].
1016
P. Sanders / Parallel Computing 25 (1999) 1013±1033
2. The model The considered machine consists of N identical PEs which interact by message passing and work synchronously. Most of the time we assume a d-dimensional mesh with N nd PEs numbered by an index set I : f0; . . . ; n ÿ 1gd else assume I : f0; . . . ; N ÿ 1g. We sometimes note how the results can be adapted to torus networks. In the conclusions we shortly discuss the case of nonequal side lengths. The load at time t is described by the N -dimensional load vector l
t. The load of PE i, li
t is measured by the sequential execution time required to consume it. Whenever the intended P point in time is clear form the context, the t is omitted. Let lavg : 1=N i2I li denote the average load. As explained in the introduction, we analyze a scenario where the initially given load l
0 is to be balanced while the sum of all PE loads (and therefore also lavg ) remains constant. The fully balanced situation T can therefore be de®ned by the vector lavg :
lavg ; . . . ; lavg . Let e : l ÿ lavg denote the deviation from the balanced state or error vector. Previous papers judge the quality pof P the 2 achieved load balancing in terms of the Euclidean norm kek 2 i2I ei of e. This is the ``default'' in mathematics, and it is easier to handle than other measures. But we do not know of any immediate connection between kek2 and the performance of an underlying application. We therefore use a dierent model. De®nition 1. The load is e-balanced when kek1 maxi2I jei j 6 e for a given positive constant e. (We omit the e-pre®x from now on since we always use the same parameter e.) In particular, in a balanced situation the remaining parallel execution time required for consuming the load is at most
1 elavg , and for the next
1 ÿ elavg time units all PEs can work productively without additional communication. The balancing time Tbal can then be de®ned as the smallest t for which l
t is balanced. Algorithms with discrete loads have the additional constraint that li
t must always be an integer and that only one load unit per time step can be transmitted. Random loads are modeled by considering the li
0 to be independent, identically distributed, bounded random variables with expectation Eli
0 l and li
0 6 l^ such ^ l is a constant. For example, the frequently assumed uniform distribution that l= [9,20] has l^ 2l. We use the following frequently used de®nition of asymptotic behavior with high probability for our results (e.g.[12]). ~
f
n i De®nition 2. X 2 O 8b 2 R : 9c 2 R : 9n0 : 8n P n0 : P X 6 cf
n P 1 ÿ nÿb : For example, the execution time of a load balancing algorithm is considered to be of the order T
n with high probability if cT
n iterations suce for all but a polynomially small fraction of all random instances (for some constant c and suciently large n).
P. Sanders / Parallel Computing 25 (1999) 1013±1033
1017
For suciently large n, lavg will be so close to l that the two We have Elavg l. quantities can be used almost interchangeably. In particular, the following lemma allows us to analyze the load balancing time in terms of the deviation from the expected load rather than in terms of the deviation from the actual average load: (we omit the proof which is quite simple and rather uninteresting). Lemma 3. For all e > 0 there is a d > 0 such that the time required to achieve . . . ; l T ) also suffices to achieve kek 6 elavg with high jj l jj1 ÿ l < dl (with l :
l; probability. 3. Diusive load balancing For any undirected graph with maximum node degree at most 1=a, synchronous diusive load balancing can be de®ned by the simple rule: X li
t 1 : li
t a
lj ÿ li ;
1 j2D
i
where D
i is the set of neighbors of PE i. A useful observation is that diusive load balancing is closely related to random walks. Lemma 4. If the loads are normalized, i.e., nlavg 1, li
t is identical to the probability pi
t that a particle is at PE i at time t, if this holds for t 0 and the particle performs a random walk moving to each possible neighbor with probability a. Proof. By induction on t. By assumption pi
0 li
0. Now assume pi
t li
t. pi
t 1 is the probability that the particle has been there before and does not move away plus the probability that it moves in from a neigboring PE, i.e., ! X X X a apj
t pi
t a
pj
t ÿ pi
t pi
t 1 pi
t 1 ÿ j2D
i
li
t a
X
j2D
i
lj
t ÿ li
t lj
t 1:
j2D
i
j2D
i
A lower bound for the eciency of diusion for meshes can immediately be derived from previous results: Theorem 5. The worst case balancing time for diffusion on d-dimensional meshes with nd PEs is in X
n2 . Proof. Let e denote any positive constant. Using [18,20] it can be seen that there is a choice of l
0 such that ke
0k1 alavg (a > e) and ke
tk1 jkjt ke
0k1 such that jkj 2 1 ÿ H
1=n2 . (k is the subdominant eigenvalue of a matrix which can be used to
1018
P. Sanders / Parallel Computing 25 (1999) 1013±1033
de®ne the diusion rule and l
0 is lavg plus an appropriate scaling of the corresponding eigenvector.) Hence, there is a constant c such that for suciently large n, jkj P 1 ÿ c=n2 , and therefore 2
t
ke
tk1 P
1 ÿ c=n2 a eÿct=n a > e for t <
ln ae 2 n: c
4. The linear array The analysis of diusion for the linear array proceeds as follows. In Section 4.1 we derive a simple closed form formula for li
t as a linear combination of the coecients of l
0. Two key observations make this possible. The linearity of diusion implies that the behavior for a unit load already contains all the information we need. Furthermore, a special treatment of the border PEs 0 and n ÿ 1 can be avoided by introducing periodic boundary conditions. Section 4.2 establishes the main result 2 of this section, namely that with high probability Tbal 2 O
log n for random instances. A key tool making this possible are Hoeding bounds. Finally, in Sections 4.3 and 4.4 the missing lower bounds for random instances respectively upper bounds for the worst case are derived. 4.1. A closed form solution for li
t To get the main idea for ®nding a closed form solution, we start with some simpli®cations. We assume that there is an in®nite number of PEs ± one for each integer. Also, we ®x the diusion parameter a to 1=2. This value is identi®ed as optimal for the linear array in [20]. (For the torus it is optimal for odd n.) The diusion rule Eq. (1) can now be written as li
t 1
liÿ1
t li1
t : 2
Furthermore, assume2 li
0 i 0, i.e., initially there is only a unit load on PE 0. Then we get the following load development:
2
l
0
. . . ;
l
1 1=2
. . . ;
l
2 1=4 l
3 1=8
. . . ;
. . . ;
l
4 1=16
. . . ; 0;
. . .T
0;
1;
0;
0;
1;
0;
1;
0;
0;
0; 1;
1; 0;
0; 3;
2; 0;
0; 3;
1; 0;
0; 1;
0;
1;
0;
4;
0;
6;
0;
4;
0;
1;
T
. . .
T
. . . . . .T T
0; . . .
We adopt the notation from [6] to de®ne P 1 if the predicate P is true and P 0 else.
P. Sanders / Parallel Computing 25 (1999) 1013±1033
1019
Fig. 1. Periodic boundary conditions for the mesh.
l
t is basically the tth row of a Pascal triangle. More precisely, t ÿt li
t 2 ti even
t i: 2
The above result can be generalized for arbitrary initial load patterns by scaling, shifting and adding several unit-load solutions. Theorem 6. li
t 2
ÿt
X j
t
t
iÿj 2
lj
0even
t
i ÿ j:
Proof. Instead of justifying the entire sequence of the above heuristic steps the theorem can be proved directly by induction over t using the well known relation. (Note that j runs over all integers but for all j not in ft ÿ i; . . . ; t ig the binomial coecient becomes zero.) t t t1 : k kÿ1 k
This simple formula can also be used for a ®nite array if we introduce the periodic boundary condition li : lg
i mod 2n with g
i min
i; 2n ÿ i ÿ 1.3 So, instead of considering a ®nite linear array, we now use an in®nite line of PEs with a periodic repetition of the initial load of the ®nite array alternating with its mirror image. Fig. 1 shows the structure of this periodic pattern. 4.2. Upper bound for random instances This section is devoted to the proof of the following theorem. Theorem 7. For random problem instances diffusion with parameter a 1=2 on an nPE linear array has ~
log n2 : Tbal 2 O
3
The ring topology can be treated similarly by using the periodic boundary condition li : li mod n .
1020
P. Sanders / Parallel Computing 25 (1999) 1013±1033
Our principal tool will be the following Hoeffding bound theorem which we cite from [7, Theorem 2.6.7] in a form slightly rewritten for our purposes: Theorem 8 (Hoeding bounds). If X1 ; . . . ; Xn are independent random variables with P ai 6 Xi 6 bi , then for eE i Xi > 0 " P
X
Xi >
1 eE
i
X i
#
"
ÿ P 2 # e 2 E i Xi Xi 6 exp ÿ P : 2 i
bi ÿ ai
2 2
Let b denote an arbitrary positive constant and let t c
ln n for a value of c still to be determined. Since P jj l ÿ 1 jj1 > el 6 nP jli
t ÿ lj > el for some ®xed i, it suces to show that P jli
t ÿ lj l 6 enÿbÿ1 for every ®xed i. We consider the cases li
t >
1 el and li
t <
1 ÿ el separately: Part 1: P li
t >
1 el 6 12 nÿbÿ1 . From Theorem 6 we know that li
t
X
ÿt even
t
i ÿ jlj
0: t
iÿj 2 t
2
j
By appropriately reordering this sum it can be written as X t 2ÿt ljk
0 li
t k k and for t < n the nonzero terms are independent random variables in the range t ^ l : 0; 2ÿt k Note that this also holds if PE i is near the border since for t < n there are never any PEs whose initial load contributes to two nonzero terms of the sum. Furthermore, Therefore we can use the Hoeding-bound Eq. (2) to see that Eli
t l. 2 6 h i 6 P li
t >
1 el 6 exp 6 ÿ 4
We now use the relation X t 2 k
k
2t t
3 7 e2 l2 7 2 7: 5 P t 4ÿt l^2 k k
P. Sanders / Parallel Computing 25 (1999) 1013±1033
1021
(for example [6, Eq. 5.23]) and the Stirling approximation for the binomial coecient q 2t 6 4t pt1 . t 3 2 " p # 6 h i e2 l2 4t 7 e2 l2 pt 7 6 P li
t >
1 el 6 exp 6 ÿ 7 6 exp ÿ 4 ^2 2t 5 l^2 l t p 2 e2 l2 cp 1
b 1 1= log n l^4 ÿ n l^2 6 nÿbÿ1 for c P : 2 pe4 l4 Part 2: P li
t <
1 ÿ el 6 12 nÿbÿ1 . The argument is analogous to the argument for Part 1. The Hoeding bound can be used to bound deviations below the expected value by substituting e ÿe and ÿXi . We can now conclude that P j li
t ÿ l j> el 6 nÿbÿ1 . Xi 4.3. Lower bound for random instances Theorem 9. There are distributions of lj
0 such that for most random instances of 2 diffusive load balancing the execution time is in X
log n . l; ^ P lj
0 0 1 ÿ l= l. ^ Fix any Proof. Consider the distribution P lj
0 l^ l= ^ constant c < 1= log
l=l. (Mentally) subdivide the array into equal sized intervals of length c log n.4 Using elementary but lengthy calculations, it can be proven that for suciently large n, with probability > 1/2, there is at least one interval in which every ^ Balancing the load of this highly loaded interval can be shown to PE receives load l. 2 require X
log n steps (using similar techniques we are using in Theorem 13 to show worst case upper bounds). The proof can also be adapted to other load distributions like the uniform distribution. The only requirement is that there is a constant nonzero probability for an unbalanced initial value li
0. 4.4. Upper bound for worst case instances We start with some simple observations which immediately follow from Theorem 6. These lemmas are formulated for in®nite input patterns. Lemma 10. ke
tk1 is a decreasing function. Lemma 11. li
t does only depend on lj
0 if even
t even
i ÿ j.
4
Slightly tighter results are possible by selecting subsets of PEs which include only every other PE in an interval of length 2c log n.
1022
P. Sanders / Parallel Computing 25 (1999) 1013±1033
Fig. 2. Transformation of initial load for worst case bound.
In Theorem 6, the coecients before lj
t decrease with growing distance from i. Therefore, manipulating l
0 by ``moving'' initially present load closer to a PE i can only increase its load at time t. (Note that this manipulation can produce nonperiodic load patterns but we do not claim that the modi®ed pattern corresponds to a legal initial situation.) Lemma 12. For any t if b P a, k ÿ j is even and jk ÿ ij 6 ji ÿ jj then substituting lj
0 ÿ a and lk
0 lk
0 b cannot decrease li
t. lj
0 We now have the building blocks for the proof of the following upper bound. Theorem 13. The worst case balancing time for diffusion on the linear array with diffusion parameter a 1=2 is in O
n2 . Proof. We again look at the maximum load ®rst. Consider any PE i and any legal initial load pattern. Due to the monotonicity of ke
tk1 , we can restrict ourselves to even times t without loss of generality. Due to Lemma 11, we can therefore disregard all positions j with odd
i ÿ j. 5 By exploiting the symmetry properties of l
0 and repeatedly applying Lemma 12 it can be shown that li
t cannot decrease if we transform l
0 to lj
0 : lavg i 6 j ^ even
i ÿ j 2nlavg i j: Strictly speaking, we only need to rede®ne lj
0 for ji ÿ jj 6 t. Fig. 2 shows this transformation operation. Therefore, ÿt
li
t 6 2 lavg
X j6i
6 lavg 1 2n2ÿt
t t
iÿj 2
t
even
t
i ÿ j 2n
t=2
using the Stirling approximation for
5
2t t
t
!
t=2
again, we get
We do not lose any information by doing that. For our periodic input patterns, all loads with odd
i ÿ j will also appear as a ``mirror image'' at an even distance.
P. Sanders / Parallel Computing 25 (1999) 1013±1033
1023
r! 2 8n2 1 2n 6 lavg
1 e for t P 2 : pe tp
li
t 6 lavg
To bound the behavior of the minimum load we can construct a similar set of lemmas and transform l
0 to lj
0 : lavg i 6 j ^ even
i ÿ j ÿ 2nlavg i j. Then the ``hole'' in l
0 will produce a load de®cit smaller than elavg for some t 2 O
n2 .6 This result is easy to adapt for tori with odd n. For even n, convergence in the worst case is only possible for a < 1=2d. However, our calculations can be adapted for the case a 1=4d. 5. Higher dimensions With some eort, the results for the one-dimensional case can be adapted for (constant) higher dimensions. For conciseness, we only outline what has to be changed. Some generalizations are immediate. The optimal diusion parameter for a d-dimensional mesh is a 1=2d [20]. By introducing periodic boundary conditions for all dimensions we can write li
t 1
d 1 X
liuk
t liÿuk
t 2d k1
(where uk denotes the kth unit vector). Furthermore, X xiÿj
tlj
0 li
t j2Zd
and the coecients xiÿj are independent of the load. In order to determine xk , it is again sucient to consider the development of a unit load at PE 0. The interpretation of this setting as a random walk is still applicable. Furthermore, x
k1 ;...;kd is zero if t k1 kd is odd. For even t, we have xk 6 x0 for all k. Still, the coecients xk are more dicult to determine and manipulate now. We know a relatively simple closed form only for d 2. Theorem 14. For d 2; x
x;y
t 4ÿt
t
txy 2
t
txÿy 2
even
t x y:
Proof. By induction over t.
6
The negative load at PE 0 causes no problems because the transformed con®guration need not be a legal initial load con®guration.
1024
P. Sanders / Parallel Computing 25 (1999) 1013±1033
Fortunately we do not really need to know all the xk if we have a good estimate for x0 . ÿ Lemma 15. x0
t 2 O tÿd=2 . Proof. Assume without loss of generality that t is divisible by 2d. In the probabilistic interpretation, x0 is the ``return to origin'' probability of a d-dimensional random walk. Following Feller [3, Section XIV.7] we get x0 2
ÿt
t t=2
X
d
ÿt=2
k1 kd t=2
t=2! Q i ki !
2
P 2 P 2 Using ai 6
max ai
ai and the fact that the terms inside
constitute a multinomial distribution (and therefore sum to 1) we have x0 6 2ÿt
t=2!
t=2! t t d ÿt=2 max Q d ÿt=2 2ÿt d t=2 t=2 k1 kd t=2 k !
t=2d! i i
and using the Stirling approximation r p
t=2 2 d ÿt=2 pt
t=2 eÿt=2
1 H
1=t x0 6 p d t pt pt=d
t=2d
t=2d eÿ2d p pt ÿd=2 1 2 1H : d t In particular, a generalization of Theorem 13 is easy now since it is straightforward to adapt lemmas 10±12 for higher dimensions, and since the proof of Theorem 13 only needs the value of x0 . Theorem 16. The worst case balancing time for diffusion on a nd -mesh with diffusion parameter a 1=2d is in O
n2 . For random instances, we encounter a new problem. If i is near one of the corners of the mesh then the random variables lj
0 which contribute to li
t are not all independent even for small t. The reason is that due to the re¯ections implicit in the periodic boundary conditions there may be up to d indices which are determined by the same value (for t < n). But li
t is nevertheless the sum of H
td independent random variables so that the Hoeding bound can be applied. The expectation of the Using a similar trick as in the proof of Lemma 15, the sum of these variables is still l. sum over the squares of the bounds of these random variables is not larger than l^2 dx0 . We can now modify the proof of the one-dimensional case (Theorem 7) to d=2 2=d conclude that P li
t >
1 el 6 eÿHt so that there is a t 2 O
log n which
P. Sanders / Parallel Computing 25 (1999) 1013±1033
1025
We conclude ensures with high probability that no PE has load more than
1 el. the following theorem. Theorem 17. For random problem instances diffusion with parameter a 1=2d on the d-dimensional mesh has ~
log n2=d : Tbal 2 O The corresponding theorem for torus networks is even simpler to prove because there is no trouble with dependent variables. The lower bound for random instances can be proved as in the one-dimensional case (Theorem 9) by showing that with high 1=d probability there is a highly loaded subcube of side length X
log n . Theorem 18. There are distributions of lj
0 such that for most random instances of diffusive load balancing the execution time is in X
log n2=d . 6. Other algorithms We now discuss some algorithms known to be superior in the worst case. Section 6.1 introduces second order diusion and Section 6.2 explains the dimension exchange approach and a simple yet ecient global algorithm that can be viewed as an instance of dimension exchange. The performance of these algorithms for random instances is compared with ordinary diusion in Section 6.3. In Section 6.4, we look at the discrete variant of dimension exchange. 6.1. Second order diusion In general, ordinary (®rst order) diusion schemes can be de®ned by the recurrence l
t 1 Ml
t where M is an N N doubly stochastic matrix. This can be extended to a second order diusion scheme [5] by setting l
t 1 bMl
t
1 ÿ bl
t ÿ 1: For the d-dimensional mesh b
2 1 q 2 1 ÿ H n 2 1 1 ÿ
1 ÿ 1=n2
is identi®ed as an optimal value for worst case instances. At least with respect to the Euclidean norm, the resulting algorithm is quite ecient and has a load balancing time in O
n.
1026
P. Sanders / Parallel Computing 25 (1999) 1013±1033
6.2. Dimension exchange The (generalized) dimension exchange approach is based on a k-coloring of the edges of the interconnection network, i.e., a mapping c from edges to ``colors'' in f1; . . . ; k g with the property that no vertex (PE) is incident to two edges with the same color. A balancing cycle of dimension exchange consists of k phases. During phase i a PE a interconnected by an edge of color i to a PE b exchanges load according to the rule la
t 1
1 ÿ kla
t klb
t: For the d-dimensional mesh and worst case instances an optimal value 1 1 21ÿH kopt 1 sin
2p=n n can be identi®ed for a simple 2d coloring [20]. For the Euclidean norm, the resulting algorithm has a load balancing time in O
n. However, we get no improvement for random instances. Theorem 19 (Generalized). dimension exchange load balancing with the optimal parameters needs load balancing time Tbal 2 X
n both in the worst case and for random instances. Proof. In every phase a PE gives a fraction in 1 ÿ H
1=n of its load to its current communication partner. Using a calculation analogous to the proof of Theorem 5 we can conclude that it takes X
n steps until a piece of load of size
1 alavg (a > e) can be reduced below
1 elavg . Similar arguments are likely to apply to second order diusion. An interesting special case is dimension exchange for the hypercube with k 1=2 and the straightforward edge coloring where every interconnection is colored with its dimension. This algorithm achieves perfect load balance within a single cycle, i.e. in log N phases [2]. By embedding the hypercube, this algorithm can also be used on a mesh. For many embeddings, e.g., the one described in [11, Fig. 6.11-(c)], we get a load balancer working in time H
n. We pay for its speed and accuracy by getting a global algorithm with the disadvantages discussed in the introduction. In particular, there is no hope to get faster execution for random instances. 6.3. A general pattern? We now compare diusion with second order diusion and dimension exchange. We resort to simulation but nevertheless try to identify asymptotic developments.
P. Sanders / Parallel Computing 25 (1999) 1013±1033
1027
Fig. 3. Performance of second order diusion, dimension exchange, and diusion for random loads. Parameters are optimized for worst case instances. (Double logarithmic plot.)
We therefore concentrate on the linear array, which makes it easier to simulate very large networks. For dimension exchange we count the number of phases and for diusion the number of iterations. e is set to 1=4. Measurements for random instances are based on uniformly distributed random loads. The load balancing times are averages over 100 runs. Fig. 3 illustrates that the worst case optimal parameters are rather poor for 2 random instances and large n. We have plotted Tbal scaled by 1=
log n because this makes it possible to see how close the simulations come to the asymptotic bounds. It turns out that the ratio is still slowly increasing for large n. Since we have proved the O
log n2 upper bounds, this indicates that there are lower order terms which still have in¯uence for n 214 . For small n, the performance of the algorithms is similar up to a factor of 2:2. For large n, the ``optimized'' algorithms are inferior to diusion since they still need linear time for balancing random loads. It is interesting to note that second order diusion needs about 0:95n steps whereas dimension exchange is about two times faster, not counting the fact that its individual steps are simpler because at each phase every PE communicates only over a single link. For a heuristic interpretation of this negative result it is again useful to employ the analogy of the random movement of a particle. Diusion corresponds to a completely symmetric random walk, and this is the reason why it is so slow in the worst case ± it takes time H
k 2 to get k steps away from the origin on the average. Dimension exchange or second order diusion with a very asymmetric parameter is like adding a lot of ``stiness'' to the motion of the particle such that a random motion only emerges on large scales. 7
7
We use a similar strategy when we stir a cup of tea in order to ``balance'' the sugar.
1028
P. Sanders / Parallel Computing 25 (1999) 1013±1033
The following theorem explains that this is disadvantageous for random instances because it is sucient to balance O
log n neighboring PEs. Theorem 20. For random loads, on a network with N PEs it suffices with high probability to load balance within arbitrary but fixed subnetworks of size H
log N which form a partition of the network. Proof. With an argumentation analogous to the proof of Theorem 9 we can see that in general a subnetwork size in X
log N will be required. For the upper bound assume a subnetwork size of c ln N for a constant c still to be determined. Using Hoeding bounds, we can estimate the probability that a given subnetwork has a total load which does not allow load balancing. Without loss of generality, let the PEs in the subnetwork be numbered from 0 to c ln N ÿ 1. Then, " # ! 2 2 X 2 e
lc ln N ÿc
el^ ln N 6 exp ÿ l li
0 >
1 elc : N P l^2 c ln N i
P. Sanders / Parallel Computing 25 (1999) 1013±1033
1029
Fig. 4. Performance of second order diusion, dimension exchange, and diusion for random loads. Parameters are optimized for a scale of 2 log n. (Double logarithmic plot.)
Indeed, from Theorem 20 we know that we get exactly this performance if we really chop the network into subnetworks and do not allow communication between subnetworks. The assumption seems to be reasonable that an algorithm which is identical to the one it was derived from, with the only dierence being that its communication abilities are pruned, performs no better on the average than the original algorithm. We still have a problem. In order to exploit the full power of dimension exchange or second order diusion we need to adapt their parameters to the kind of loads to be balanced. Fig. 5 shows that, for worst case instances, diusion is the overall looser and the other algorithms tuned for the worst case are far superior to their
Fig. 5. Performance of local load balancing algorithms for a dicult instance (li
0 : i < n=2). ``scale'' denotes the network size for which the parameters are tuned. (Double logarithmic plot.)
1030
P. Sanders / Parallel Computing 25 (1999) 1013±1033
Fig. 6. Performance of discrete dimension exchange for random loads and dierent l^ between 20 and 200.
1 elavg is rounded.
counterparts tuned for random loads. For this particular instance, second order diusion needs about 30% less steps then dimension exchange. But note that depending on the PE architecture a diusion step may take up to two times longer than a dimension exchange step. The above problem is even worse in practice since we have neither worst case nor random loads but something hard to characterize. We would prefer a load balancing algorithm as robust as diusion but as fast as a tuned variant of the more complex methods. 6.4. Discrete dimension exchange For the discrete load model there is a simple algorithm which is ecient for both worst case and random instances. Discrete dimension exchange uses the same communication pattern as its continous relative, but it exchanges a single load unit whenever the load at a PE diers from the load of its current communication partner. For higher dimensions, it is additionally important to color the communication edges in the right way. For details refer to [16]. The cited paper shows that the algorithm is ecient for worst case instances. For random instances, consider Fig. 6 which shows Tbal scaled by log n for dierent l^ und n. Tbal 2 O
l^log n seems to be reasonable model for random instances. This is again in accordance with Conjecture 21. 7. Conclusions Simple local load balancing algorithms can be much more ecient for random instances than in the worst case. For example, diusion requires worst case time
P. Sanders / Parallel Computing 25 (1999) 1013±1033
1031
Tbal 2 H
n2 to achieve approximate load balance on a d-dimensional mesh with nd 2=d PEs. For random problem instances H
log n steps are sucient with high probability. These results can also be adapted to torus networks. In contrast, algorithms which employ some kind of global control often have the same asymptotic performance for worst case and random instances and are therefore inferior for situations in which random instances are a more appropriate model. The same problem can be observed with nearest neighbor algorithms, like second order diusion and dimension exchange, if their parameters are tuned for eciency in the worst case. Experiments and heuristic arguments suggest that tuning these algorithms for subnetworks of size O
log n yields very ecient algorithms for random instances. Discrete dimension exchange achieves good performance both for dicult and random instances and there is no need to adjust any parameters. We therefore see the role of diusion mainly as a simple and robust algorithm which can be used as a ``®rst try'' when little is known about the load pattern. Often, it may then turn out that load balancing is no bottleneck. Furthermore, the simplicity of diusion makes it a good choice for analyzing fundamental properties of load balancing. 7.1. Future work There are a number of ways the considered model could be generalized. For example, meshes with unequal side length could be considered. While worst case performance will be governed by the maximum side length [20], this will not aect performance for random instances as long as the minimum side length is large 1=d compared to
log N . It suces to balance within subcubes with O
log N PEs. Theorem 20 shows that random loads are a special instance of loads which only require balancing within some radius R. The methods used to show the worst case upper bound from Theorem 13 might also help to show a more general result like Tbal 2 O
R2 for diusion. ^ l on Tbal because the Furthermore, we had to neglect the in¯uence of e and l= Hoeding bounds yield only rather loose bounds with respect to these parameters. Perhaps tighter results are possible if more knowledge about some particular load distributions (e.g. uniform) is employed. From a practical point of view, the largest gap is the discrepancy between the load model (static load balancing of random instances) and the intended use (on-line load balancing for real applications). What we would really like to have would be on-line load models which are meaningful for a variety of applications and analytically tractable at the same time. However, we believe that this will at least be very dicult because the more accurate a model gets, the narrower is the domain of applications for which it yields meaningful results. For example, consider two of the applications for which nearest neighbor balancing has been used successfully. For particle simulation (e.g. [9]), it is crucial that interacting particles are at nearby PEs. While this property is easy to maintain on one-dimensional networks if nearest neighbor load balancing is used, the situation is much more complicated for
1032
P. Sanders / Parallel Computing 25 (1999) 1013±1033
higher-dimensional networks. For best ®rst branch-and-bound (e.g. [10,19]), we are not so much interested in a balanced distribution of the number of search tree nodes but we want to have some ``important'' work on every PE. We believe that there will always be room for simpli®ed abstract models because they make it possible to judge the relative merits of dierent algorithms for a wide spectrum of settings.
Acknowledgements I would like to thank Thomas Worsch for fruitful cooperation, in particular on the discrete dimension exchange algorithm. Discussions with Miltos Grammatikakis have helped to clarify the importance of the analogy between diusion and random walks.
References [1] J.E. Boillat, Load balancing and Poisson equation in a graph, Concurrency: Practice and Experience 2 (4) (1990) 289±311. [2] G. Cybenko, Dynamic load balancing for distributed memory multiprocessors, Journal of Parallel and Distributed Computing 7 (1989) 279±301. [3] W. Feller, An Introduction to Probability Theory and its Applications, 3rd ed., Wiley, New York, 1968. [4] B. Ghosh, F.T. Leighton, B.M. Maggs, S. Muthukrishnan, C.G. Plaxton, R. Rajaraman, A.W. Richa, T.E. Tarjan, D. Zuckerman, Tight analyses of two local load balancing algorithms, in: ACM Symposium on the Theory of Computing, 1995. [5] B. Ghosh, S. Muthukrishnan, M.H. Schultz, First and second order diusive methods for rapid coarse distributed load balancing, in: ACM Symposium on Parallel Architectures and Algorithms, 1996, pp. 72±81. [6] R.L. Graham, D.E. Knuth, O. Patashnik, Concrete Mathematics, Addison Wesley, Reading, MA, 1992. [7] M. Hofri, Probabilistic Analysis of Algorithms, Springer, Berlin, 1987. [8] G. Horton, A multi-level diusion method for dynamic load balancing, Parallel Computing 19 (1993) 209±218. [9] G.A. Kohring, Dynamic load balancing for parallelized particle simulations on MIMD computers, Parallel Computing 21 (1995) 683±693. [10] N. Kuck, M. Middendorf, H. Schmeck. Generic branch-and-bound on a network of transputers, in: R. Grebe et al. (Eds.), Transputer Applications and Systems, IOS Press, 1993, pp. 521±535. [11] V. Kumar, A. Grama, A. Gupta, G. Karypis, Introduction to Parallel Computing. Design and Analysis of Algorithms, Benjamin/Cummings, New-York, 1994. [12] F.T. Leighton, B.M. Maggs, A.G. Ranade, S.B. Rao, Randomized routing and sorting on ®xedconnection networks, Journal of Algorithms 17 (1994) 157±205. [13] K. Nakamura, Asynchronous cellular automata and their computational ability, Systems, Computers, Controls 5 (1974) 58±66. [14] X. Qian, Q. Yang, An analytical model for load balancing on symmetric multiprocessor systems, Journal of Parallel and Distributed Computing 20 (1994) 198±211. [15] P. Sanders, On the eciency of nearest neighbor load balancing for random loads. In Parcella 96, VII. International Workshop on Parallel Proccessing by Cellular Automata and Arrays, Akademie Verlag, Berlin, 1996, pp. 120±127.
P. Sanders / Parallel Computing 25 (1999) 1013±1033
1033
[16] P. Sanders, T. Worsch, Konvergente lokale Lastverteilungsverfahren und ihre Modellierung durch Zellularautomaten, in: PARS Workshop, number 14 in PARS Mitteilungen, 1995, pp. 285±291. [17] J. Song, A partially asynchronous and iterative algorithm for distributed load balancing, Parallel Computing 20 (1994) 853±868. [18] R. Subramanian, I.D. Scherson, An analysis of diusive load-balancing, in: ACM Symposium on Parallel Architectures and Algorithms, 1994, pp. 220±225. [19] S. Tsch oke, M. R acke, R. L uling, B. Monien, Solving the traveling salesman problem with a parallel branch-and-bound algorithm on a 1024 processor network, Technical report, Universit at Paderborn, 1994. [20] C. Xu, F.C. Lau, The generalized dimension exchange method for load balancing in k-ary n-cubes and variants, Journal of Parallel and Distributed Computing 24 (1) (1995) 72±85.