Load balancing data parallel programs on distributed memory computers

Load balancing data parallel programs on distributed memory computers

Parallel Computing 19 (1993) 1199-1219 North-Holland 1199 PARCO 808 Load balancing data parallel programs on distributed memory computers t J. D e ...

1MB Sizes 1 Downloads 112 Views

Parallel Computing 19 (1993) 1199-1219 North-Holland

1199

PARCO 808

Load balancing data parallel programs on distributed memory computers t J. D e K e y s e r * a n d D . R o o s e Computer Science Department, K.U. Leuven, Celestijnenlaan 200A, B-3001 Leuven, Belgium Received 16 July 1992 Revised 29 April 1993

Abstract In this paper a set of programming constructs for the implementation of data parallel algorithms on distributed memory parallel computers is proposed. The load balancing problem for data parallel programs is cast in a special from. Its relation to the general load balancing problem is analyzed. The applicability of these constructs is asserted for a number of grid-oriented numerical applications. A software tool provides run-time support for data parallel programs based on the proposed constructs. While the application - according to the data parallel programming paradigm - partitions the grid, the tool assigns the partitions to the processors, using built-in mapping algorithms. The approach is general enough to accommodate for data parallel algorithms with varying communication structure and variable calculation requirements using pseudo-dynamic load balancing strategies.

Keywords. Data parallelism; load balancing; distributed memory computers

1. Data parallel programs Any algorithm can be described by a set of interacting operations. When an operation uses information that is produced by another one, it is said to be data-dependent on it.

Definition. A program is a set of processes Pi ~ equipped with a partial ordering ~ which represents the data dependencies between them. The acyclic directed graph ( ~ , ~ ) is known as the precedence graph. It can be assumed that it has a unique minimum m (a begin operation). In [13] it is shown that the partial ordering ( ~ , ~ ) allows the definition of time in terms of a discrete logical clock. Such a clock must obey the clock condition: Va,b e ~ : a ~ b c t o c k ( a ) < c t o c k(b). Because of this property the graph is also called the temporal process graph. One possible clock definition is c L o c k ( p ) = A ( p , m) in which A(a, b) represents the length of the longest path from a to b in the graph ( ~ , ~ ) . Two operations a,b e ~ are simultaneous (a ~ b ) w h e n c L o c k ( a ) = crock(b). The relation ( ~ , ~ ) is an equivalence relation, implying a partitioning ~ = {c i} of the temporal graph, such that operations within each partition class are simultaneous. The clock relation induced on the set of classes ~ is defined by: c i ~ Cj ¢:~ :=lpiE Ci, pj ~ Cj: c Lock(p/) < c Lock(p/). ( ~ , --+) is totally ordered. * This text presents research results of the Belgian Incentive Program 'Information Technology' - Computer Science of the Future, and the Belgian programme on Interuniversitary Poles of Attraction, initiated by the Belgian State Prime Minister's fiervice - Science Policy Office. The scientific responsibility is assumed by its authors. * Corresponding author. Email: [email protected] 0167-8191/93/$06.00 © 1993 - Elsevier Science Publishers B.V. All rights reserved

1200

J. De Keyser,D. Roose

A parallel computer is an ensemble of connected processors. In distributed memory parallel computers each processor has its own private memory. Definition. A parallel computer is a set of processor nodes ~ equipped with a communication system. The graph (~', ~ ) represents the machine connectivity. Weights can be associated with its edges. They may be used to carry information about the speed of the communication links.

A parallel program can be executed on a parallel computer in many ways. The goal is to execute it as fast as possible. Definition. The general load balancing problem ( GLBP) is the optimization problem: schedule the processes of the temporal process graph on the processors of the parallel machine so as to minimize the total execution time.

Scheduling implies determining when (scheduling in time) and where (assignment to a processor) each process is executed. This optimization problem is not trivial. A first difficulty is finding an objective function modeling the execution time accurately. Especially the communication delay is hard to approximate [27], because it depends on the message traffic in a nonlinear way. Even with the simplest cost models the GLBP is an NP-hard optimization problem. Additionally, it is subject to restrictions (the temporal precedence constraints). Another difficulty is the large size of the problem. Attempts to solve the GLBP are currently undertaken in the context of parallelizing compilers, code optimization and parallelization tools. This is only possible if the temporal graph is known at compile time (e.g. extracted from the program by the compiler, or specified by the programmer using compiler directives). 1.1. Data parallelism A data parallel algorithm is based on distributing the problem's data items 8" = {ei} among the processors. The algorithm consists of a sequence of calculations {ci}, each of which is applied by every processor qj on the data assigned to it (process Pii). The values of variables that have been modified in calculation c k_ 1 have to be communicated to the processors that need them as input for calculation c~. For a data parallel program, the discrete clock introduced previously is c t o c k(Pii) = i. The totally ordered partitioning (~', ~ ) of the temporal graph coincides with the sequence of data parallel calculations. The operations within each partition class are mutually simultaneous: there exist no data dependencies between them. They can therefore be executed in parallel without further communication. It is now possible to simplify the problem of scheduling (.~, ~ ) . The calculations are executed sequentially (scheduling in time). For each calculation c i the operations p# can be scheduled without having to deal with temporal precedence constraints [8]. Definition. The restricted load-balancing problem ( RLBP ) is the optimization problem: distribute the processes (or the corresponding data) among the processors in order to minimize the execution time for one calculation.

The GLBP for the temporal graph is replaced by a series of simpler RLBP problems. It is obvious that the total execution time obtained in this way cannot be smaller than the minimal time for the GLBP since we restrict the space of feasible solutions. However, it is easier to find a good solution. Moreover, in adaptive problems (where the data decomposition changes

Load balancing data parallel programs

1201

Clock

P

Fig. 1. T e m p o r a l p r o c e s s g r a p h for i n t e g r a t i o n .

throughout the program execution) the temporal graph is not known in advance, so there is no alternative to tackling one R L B P at a time. 1.2. A n example The following example illustrates the relationship between the G L B P and RLBP. Suppose that an N-point integration rule is parallelized. The temporal precedence graph for one possible parallel algorithm is shown in Fig. 1. Each processor qi calculates the contribution of the N i integration points assigned to it. All partial results are gathered and summed in processor q]. The time to calculate the contribution for one abscissa is denoted by a, the time to transmit a partial result to processor ql is /3 and the time to add two partial results is y. The execution time TpGl~Se on P processors can be calculated by the recursion: T1GLBP = a N 1 T/CLBP=max{Ti-l, a N i - l + / 3 } + 3 ' l < i < P For the particular case in which P = 3, N = 12, and a =/3 = y = 1, there are several work distributions that give rise to a minimal execution time Tp~LBe = 7. Figure 2 shows a time-line representation for one of these possibilities ( N 1 = 4, N 2 = 3, N 3 = 5). If the aforementioned clock definition is used, the algorithm is partitioned into a sequence of P steps. The first step consists of the concurrent calculation of the contributions of all processors. Step 2 up to P are communication steps, sequentially adding the partial results from each processor to the current total in processor q r Cost models for each step are: TpRLBP1 = max { aN~} qiE~

TpRLSPk=/3+y

l
Processor 1 | ,

I

/'

.J

I

.J

1

31

I /

I

I

~l

!

I

I

~1

/

~./"

Processor 2 '

Processor 3 . m

1

'

.

.

partial

I

. sum

~

/

add

Fig. 2. T i m e - l i n e for G L B P s o l u t i o n .

....'~' s e n d

1202

J. De Keyser, D. Roose

Processor 1 I

l

i

I

I /.J

I

I

I

(/:

I

I

I

I

I

I ;,I I ,/I I

I

[

I

Processor 2 I I

Processor 3 I

m, partial sum

I

~ add

il

I

:t i

)1 I

:~l

...."-'~ send

Fig. 3. Time-line for RLBP solution.

The total execution time then is

TPnLBP=

E k=l ..... P

T f LBPk= max{aN,.} + ( e - 1 ) " ( 3 + 3 ' ) . qi~

Optimizing the execution time of each step separately yields t N / P ] < Ni < IN~P], i.e. an even load distribution. For the same particular case as before, we now have TffLBP= 8 for N 1 ---N2 = N 3 = 4. The time-line diagram shown in Fig. 3 illustrates how the execution is broken up in 3 steps: 1 calculation step followed by 2 communication steps. This example shows that separately solving each RLBP is easier, and corresponds to what is understood intuitively by a 'balanced' data distribution, but at the same time it demonstrates the superior quality of the GLBP solution: TenLBP > T G L B P .

1.3. Data-parallel programming abstractions In order to allow a precise formulation of data parallel algorithms and in order to describe their load balancing requirements, several authors have tried to formalize the data parallel programming paradigm by identifying a set of fundamental concepts [2,5,26]. We generalize these ideas to accommodate for algorithms with varying and irregular data dependencies. In particular, the parallel implementation of iterative methods for solving partial differential equations is considered [4,25,10]. In this case the problem data set is a grid. Definition. The basic data item of a data parallel algorithm is called a point.

The calculation in a point e ~ ~ requires information associated with several other points, called the set of neighbors ~ ¢ ( e ) c ~ . In general this data dependency relation can be non-symmetric. In grid-oriented applications, the basic data item is the structure representing a grid point or a grid cell. The data dependencies then have a geometric interpretation, hence the term geometric parallelism. For large scale applications, it is inconvenient that a software tool that manages a problem's distributed data set, should handle individual points. The application usually decomposes the set of points into a smaller number of (atomic) aggregates which are involved in the parallel execution of an operation. The precise nature of this decomposition is closely related to the RLBP and will be discussed in the next section. Definition. Let ~" = are called units.

{ui}

be a partitioning of the problem data structure ~. The partition classes

The units are considered to be atomic during the parallel execution of all operations in a class, i.e. they are the smallest data entities and they do not change. It is the application programmer who decides which aggregates are most suited for each parallel operation, by imposing some conditions on the structure representing a unit. In grid-oriefited applications, units correspond to grid partitions which usually have some geometric properties.

Load balancing data parallelprograms

1203

As a consequence of the data partitioning, a point may be data dependent upon points belonging to different units. Definition. The set ~ ( u i) := U uj ~d,u ~ ( ua i ) with ~ j ( u i) := n u j ( u ek~u,,Y~(ek)) is the ex~ i. change buffer or the external interaction region [2] of u i.

The data dependencies between points induce a similar relation on the set of units, which can be defined formally by: uj ~.~'(u i) ,~,~.(u i) = uj n ~ ( u i) ~ ¢. This relation is represented by the unit interconnection graph (~/, sO). Logically rectangular grids give rise to a regular and fixed topology of the unit interconnection graph, which makes it possible to design efficient communication schemes, e.g. by embedding the problem topology into the processor interconnection topology. For general block-structured grids and for irregular grids the problem topology is irregular. In the case of adaptive grids, this topology varies throughout the program execution; it is not known in advance but arises as a result of the computation [8]. Assume that Pi is the process responsible for performing the calculations for the points of unit u i. In order to do this, it must have access to the information associated with all points of ui u~q~(u~). Therefore, the execution of Pi must be preceded by the receipt of the information for ~ ( u i) from all processes holding the adjacent units uj ~ ¢ ( u i ) . Definition. The data exchange or flush [10] operation is the communication required to update the exchange buffers of all units.

Each calculation in a data parallel program corresponds to a phase, defined as follows: Definition. A phase is the execution of all operations in a partition class of the temporal precedence graph, i.e. first performing the data exchanges and then executing the operations in parallel. Any data parallel program can be written as a sequence of phases, an alternation of communication and computation steps: p r o g r a m data_parallel_program for all phases do for U i ~ ~" do for uj: u i ~.~¢(uj) do send ,~i(uj) to uj for u i ~ 7 / d o for u/: u i ~ , ~ ( u i) do receive ~q~j(u i) from uj for u~ ~ ~ do calculate on u i end for end p r o g r a m

Figure 4 shows a time-line diagram for a phase, assuming that all processors start the execution at the same time, and that all messages are sent simultaneously and without startup time. Processor 1 t

~

1 5,,<~

/

/ (

P r o c e s s o r 2 k ~,'t i\./ P r o c e s s o r 3 ~'~<'~

=.. calculate

I I I '

c ~ walt --:~ communicate

Fig. 4. T i m e - l i n e d i a g r a m for a phase.

3

I 3

1204

J. De Keyser,D. Roose

The phase concept has been formulated by several authors for various purposes. In applications with a numerical background, a phase often corresponds to a loop, or an iteration. In fact, the phase concept is a generalization of loop parallelism [12,23]. In [26] a work~exchange model is considered, in which a work/exchange cycle corresponds to a phase as defined above. One of the benefits of the phase concept is the possibility to treat each phase separately. Each phase constitutes a separate program module. Its correctness is not influenced by previous phases. This separation of concern makes programming with phases attractive from the program developer's point of view. Another advantage is that such a program can never deadlock.

2. Load balancing data-parallel algorithms Dynamic load balancing techniques assume no prior knowledge about the temporal graph and allow work redistribution at any time. In the special case of data parallel programs, data redistribution should be considered only at the beginning of a phase. Repeated use of a static load balancer at fixed points in time is called iterative static load balancing or quasi-dynamic load balancing [27]. For grid-oriented numerical applications reliable predictions of the calculation and communication costs associated with each unit in the next phase are easy to obtain, as will be shown in the examples later on. If such predictions are not available, a decision model may be used to determine when a load balancer should be invoked [18]. The RLBP that must be solved in each phase consists of finding a good distribution of the problem data. It can be tackled in two ways: either by assigning individual data points to the processors, or by first decomposing the data set into units and subsequently mapping these units onto the parallel machine. There are several reasons to prefer the latter approach: - In the former case the set of data points that is mapped onto the same processor is determined by properties of the cost function and the mapping algorithm only. Prior decomposition into units that obey certain criteria may be used to impose additional conditions. Such a condition might be for example: 'the data structure representing the data items in a processor is an array'. - In [5] a distinction is made between decomposition and mapping in order to simplify the optimization procedure. Different objective and profit functions are associated with each step. - The mapping problem stated in terms of units is easier to solve because of the smaller space of feasible solutions. Of course the resulting mapping cannot be better than the optimal solution of the mapping problem for individual points. Extensive overviews of criteria for decomposition and mapping are found in [5,19,27]. In the following sections the two-step approach will be used. First the data decomposition is discussed. Then the mapping problem is formulated in terms of units. Only fast suboptimal techniques for decomposition and mapping have been considered, because it is our purpose to calculate a new data distribution at run-time, as part of the parallel implementation of solution-adaptive algorithms. As soon as the mapping algorithm decides upon a new data distribution, the unit data structures are physically transferred to the processor onto which they are mapped.

2.1. Partitioning When partitioning the problem data set, one has to take into account the conditions that are imposed by the application's definition of a unit. Often these conditions are implied by

Load balancing data parallel programs

1205

the nature of the data structure that is used to represent a unit and its exchange buffers. Additionally, it is advantageous when the partitioning implicitly meets some of the load balancing optimization criteria, e.g. limiting the size of the exchange buffers, or producing units of the same size. Many heuristic approaches to partitioning exist. A well-known strategy is based on recursive 2-way partitioning of the data set. Such recursive bisection heuristics have been studied extensively [5,22,27]. The application must provide one primitive: the split-operation or partitioner [2], which splits a unit in several new ones. Note that the new units also have to obey the imposed conditions. This might call for a modification of the basic recursive bisection heuristics. In a block-structured grid code for example, a block must be partitioned along grid lines in order to ensure that each new block can be represented by an array structure. A consequence of data partitioning is the corresponding decomposition of the address space: a data item is no longer referenced by one address, but by the couple (unit name, local address). The unit name is a globally unique identification. The local addresses locate the data item within the unit. For distributed array structures, address calculation schemes exist that derive the location of an array element from the globally known address of the array partition and the locally known offset of the element in the partition [3,23]. In the case of irregular or varying distributions, one associates with each unit name the physical location of the unit data structure: the couple (processor number, address), also known as a foreign pointer in the distributed memory address space [27]. When units move from one processor to another, their new location must become known before new messages are sent. This association is stored in a translation table. Since this table is small, it can be replicated among the processors which makes table lookup very fast. In case of pointwise mapping, this translation table is so large that it has to be distributed among the processors, which slows down table lookup [6].

2.2. Cost functions for the mapping problem Assume that the data decomposition is given. A good load distribution is found by minimizing the estimated parallel execution time of the phase. This can be done either by explicitly modeling the program execution time with some (approximate) cost function, or by implicitly using load balancing criteria in heuristic procedures [5,29,27]. Accurate cost functions are nonlinear [27] and sometimes there are additional constraints [5]. Since there exists an analogy between the mapping problem and the graph partitioning problem [17], some of these heuristics have their origin in combinatorial optimization. When the set of P processors is denoted by ~e and the set of N units by ~', the solution to the mapping problem can be represented as a function m: ~ ~', that assigns unit u to processor m(u). In the graph (~', --->) weights can be assigned to graph nodes and edges, which may be used to hold information about the cost involved in the corresponding calculation and communication. This information is stored in a weight matrix A. It allows the construction of simple cost models, l_~t tca~c be the time to perform one floating point operation on the particular machine and let c i be the number of floating point operations to be performed for unit u i. Then the calculation cost associated with u i is Ai; = tcalcC i. Let tcomm be the time to send one word between neighboring processors and let cij be the number of words to be sent from u i to u~.. Then the communication cost can be approximated by Agj tcommCij. The set of units assigned to processor q is denoted by ~(q). A(p, q) represents the graph distance between processors p and q, i.e. the minimal number of graph edges that have to be traversed in order to go from p to q. The distance between the processors holding units u~ =

J. De Keyser, D. Roose

1206

and uj is denoted by Aij. The diameter of the machine graph is Area x. The following cost functions were used to model the execution time of a phase:

Ca(m ) = max

q~

C3(m ) = max q~

~,

Aii

UiE~ct(q)

Ui~-~(q) uj~at(u i)

E A i i + max AjiAji. ui~,(q) ui~(q) uj ~s4(ui)

/ ° 1

+

-

E

Amax N2 ui~,(q) uj ~sC(ui)

Aji

//

"

In all these cost functions it is assumed that the message startup time is negligible. Cost function C 1 is the model of a fully connected machine with infinite communication bandwidth. C z takes into account the machine topology via the Z~ji factor. This will force the maximum distance between the processors onto which connected units are mapped to be minimal, at least if this does not endanger the calculation load balance. Cost function C 3 is similar to C 2, but it models contention by introducing a communication delay factor based on the average communication link occupation. Note that in the cost functions C 2 and C3 the calculation costs Aii and the communication costs Aij , i ~ j must be given in the same units. This requires knowledge of the machine parameters tcomm and tcatc [27]. Similarly the contention factor 0 is determined by the machine architecture.

2.3. Exact solution techniques The assignment of N units to P processors is a discrete optimization problem. The solution space has cardinality pN, which increases very fast with P and N. Enumeration of all possible solutions is prohibitively expensive. However, some simplifications can be made. In general the solution is not unique because of the symmetry properties of the cost function (and the underlying hardware). For cost function C 1 the optimum is P!-fold, since a fully connected homogeneous machine is symmetric with respect to interchanging each pair of processors. A unique symmetry class is selected by imposing m(u k) ~ {ql . . . . ,qk}, k = 1. . . . . min{P, N} which reduces the number of combinations to be inspected to ply/p!. For cost functions C z and C 3 the machine symmetry manifests itself through the properties of A. If the system is homogeneous a reduction by a factor P is obtained by imposing m(u 1) = ql. A second simplification is based on the following observation. Suppose that ~v-= t_) q ~ ~, ~"(q) is a set of t units which are already allocated. For each of the three cost functions Ck(k = 1, 2, 3) it is possible to define partial cost functions PC k with the properties PCk(m, ~ ) = Ck(m) and T a C ~'b ~ PCk(m, Ta) < PCk(m, ~/'b). A branch-and-bound algorithm has been implemented for these cost functions using both simplifications [7]. A subtree with partial cost PC(m, ~e-), corresponding to a partial allocation of units 7/, is pruned if this partial cost exceeds the cost PCmi . of the current best solution. Indeed, allocating the remaining units will augment the cost, so that any complete allocation will be more expensive than the current best one: PCmi n < PC(m, 7/') <_PC(m, ~ ) = C(m). Cost function C 3 is less symmetric than C2, causing more subtrees to be pruned, and yielding a more efficient search. In spite of all these simplifications the number of combinations inspected in this pruned search is so large that the algorithm turns out to be impractical for realistic problems.

Load balancingdataparallelprograms

1207

2.4. Stochastic techniques Another approach to the graph partitioning problem is proposed in [17]. The mapping problem is transformed into a continuous optimization problem by representing the allocation m as an N x P matrix X, in which Xij ~ [0, 1] denotes the probability that unit u i is allocated to processor (partition) qj. The cost function

C4(m)=(XAXT)-I=(

I-I

~-,

AijXikXjk) -1

qk E ~ Ui,Uj E ~'(q)

represents the inverse of the intra-partition connectivity, and has to be minimized. C 4 assumes a fully connected machine topology, and is therefore better suited for partitioning than for mapping. For this continuous cost function, the steepest descent algorithm can be implemented in a highly efficient way. More involved stochastic techniques, which use the evolution algorithm as a hill-climbing component, are genetic algorithms. These have been applied successfully to the decomposition-and-mapping problem for individual points [16,19]. Other techniques include neural nets and simulated annealing [15]. They yield very good results, even for complicated cost functions, but they are expensive.

2.5. Heuristics A heuristic frequently applied to the graph partitioning problem is recursive bisection [22,27]. The underlying philosophy is to replace P-way partitioning by repeated 2-way partitioning (bisection). Such a procedure is defined by a recursion with depth log 2 P. At level k, 2 k-~ bisections are required. Although recursive bisection is used most often for decomposition [20,22,27] it can also be applied to the mapping problem [5]. In this case both the problem graph and the processor graph are partitioned recursively; the mapping assigns a problem graph partition to the corresponding processor. For a hypercube machine topology, the algorithm is: program recursive_ bisection bisect(~', log 2 P, O) end program procedure bisect(~, l, s) split ~ in two subsets 7 f °) and ~/-(2) if l > 0 then bisect(Tf d), l - 1, s with bit l set to 0) bisect(TO"(2), l - 1, s with bit l set to 1) end if end procedure In each recursion step the graph partition ~", that is currently assigned to the/-dimensional subcube with binary representation s, is bisected. Each subset is assigned to one of both ( l - D-dimensional subcubes of s. Particular versions of this algorithm differ in the technique used to split a partition. The splitting can be based on minimizing a cost function as before, but now applied to the 2-processor case. This minimization can be exact - e.g. by using the branch-and-bound technique - or approximate by applying a stochastic or heuristic algorithm. Even when each 2-way problem is solved exactly, recursive bisection does not yield the global optimum.

1208

J. De Keyser, D. Roose

Applying a heuristic reduces the computational cost. The heuristic we use is based on the properties of partial cost functions. A set ~ / o f N units is bisected by sequentially adding a unit u i to the subset of already allocated units 7r, and assigning it to partition ~'(~) or ~(2) such that the partial cost associated with the partial allocation is minimal. The assignment of a unit to a partition is never undone; no backtracking is performed. procedure split_in_two(U)

~,--¢ ~(~) ,__ ¢; ~(2) ,__ for i = 1. . . . . N do select u from ~ ' \ ~ "

~,-- ~ u {u} if assigning u to ~-(1) is cheaper than assigning it to .~/'(2) then ~/~(1) (..._ .~f/'(1) U {U} else ~/'(2) (___.~/'(2) U {U} end if end for return .~-(1), end procedure

.~/'(2)

The partial costs associated with both possible allocations of u can be calculated in an incremental way. For the case of C 2, the partial cost function P C 2 can be written for bi-partitioning as: 1

PC 2

=

,-,(k) __ = max I E k=max1,2try(k) k ~'~calc + L" c°m"~) k=l,2 .~ ui~7/.(q ) Aii +

max

ui~.~'(q) u~ ~ (

Ajiz~ji I .

u i) n ~rF(q)

1

Given the values of r~(k) tk) associated with the partial allocation of ~//-1, the costs "~ calc and C comm for ~ = T/_ 1 U {Ui} when u i is allocated to partition 1 ~ {1, 2} are calculated as: if k = l then C(cka}c(~i) = ~'~calc ~ ( k ) ~" ¢ ~ . i - 1 j~ + A i i c ( k ) m m ( ~ i i ) = m a x C c o(k)m m ( ~ i - 1), maxuj ~ ~,(,,)n(~_l\~iiLk))Aji}

else

(k)

r(k) ¢~..

Ccalc(~i)~-~"calc"

i-1-'

Ccomm(~i ..~ max{Cc(ok)mm(~i_1), maXuj,~(_k),uiE~'(u/)Aij} end if

which allows a fast bisection. The actual algorithmic complexity of such heuristics depends on the precise nature of the graph. 2.6. I t e r a t i v e i m p r o v e m e n t

All mapping algorithms described above yield a solution of which the practical quality is limited by the accuracy of the cost function a n d / o r the minimization technique. It is therefore interesting to try to improve this solution. If the minimization is done exactly but the cost model is crude, improvement with respect to a more accurate cost function might be worthwhile. If the minimization is done approximately, an improvement both with respect to the same or a more accurate model can yield a solution with a lower cost.

Load balancing data parallel programs

1209

Such an iterative improvement algorithm, that starts from an initial assignment m 0 and generates a sequence of successively better mappings, has been proposed a.o. in [5] (a special case of the Metropolis algorithm). We developed a variant of this algorithm that examines transfers of individual units and pairwise exchanges. In each iteration this algorithm performs an exhaustive search in a neighborhood of the current solution. For each processor p, exchanges with units in neighboring processors .aC,(p) = {q ~ ~': 0 < A(p, q) < ~} are examined. We always used E = 1. Since only a limited number of alternative allocations is considered, the final solution will in general not be the global optimum. If the cost function does have a global minimum, this iterative procedure is guaranteed to terminate. The trend is that more iterations are required when the initial solution is worse, which is the case when the size of the mapping problem increases. procedure iterative_ improvement(m0, e) m

~--- m 0

repeat modified ~ FALSE for u i ~ ~,(qO with q~ ~ ~ do gain ~ 0 repeat for qt ~ , ( q k ) do for ui ~ ~(q~) do gain ,- max{gain, gain when m(u i) = ql and re(u j) = ql,} gain ~- max{gain, gain when m(ui) = qt} end for if gain > 0 then exchange or transfer accordingly modified ~ TRUE end if until gain = 0 end for until modified = FALSE end procedure

Similar improvement algorithms are simulated annealing [15,27] and recursive clustering and mincut heuristics [20-22]. They are usually applied to the graph of points, rather than the graph of units. They yield very good results, but at a high cost. 2. Z Implementation variants The mapping algorithms given above can be implemented in a global or local and a centralized or distributed way. The term global refers to the case in which a new load distribution for the entire parallel machine is determined. Local mapping redistributes the load in processor subsets only. We will present results for both variants. Centralized mapping indicates that the optimization procedure is implemented as a sequential algorithm. For the testcases discussed below a centralized mapping has been used, since the time needed for the optimization is still small compared to the time consumed by the application. On larger machines the centralized approach should be replaced by a distributed minimization (i.e. a parallel algorithm).

J. De Keyser,D. Roose

1210 3.

Software

supporting

data

parallelism

A software tool LOCO has been developed that assists the execution of data parallel applications. In order to make this tool applicable to a wide range of problems, the interaction with the application program is limited to a well-defined interface based on the concepts of unit and phase as introduced earlier. This tool serves as a framework in which load balancing strategies have been incorporated.

3.1. Building programs with phases A data parallel program consists of a sequence of phases. LOCO requires that for each phase the graph of units is specified. The application associates calculation and communication cost estimates with each unit. The application provides for each phase one or more of the following: the calculation routine that must be applied to each unit a routine to construct the data exchange message that has to be sent from one particular unit to a given neighbor unit, and a routine to unpack the information contained in this message in the receiving unit a routine to pack the data structure representing a unit in a transfer message, and one to unpack it. The latter routines cannot be provided by the tool itself, because it is not aware of the data structures used by the application. During the execution of a data parallel program, LOCO determines the mapping and schedules calculations and exchanges. All interprocessor communication is done within LOCO: data exchange messages, load information gathering, and the transfer of units. The application code therefore does not contain any communication primitives. LOCO consists of three subtasks: - a scheduler, that decides when data exchanges and calculations are performed while avoiding processor idle time, a minimizer, that maps the units to the processors at run-time, using the methods described earlier, a communication subprocess that handles the receipt of data exchange and transfer messages. LOCO uses one (large) buffer for sending and one for receiving messages. This choice was made in order to limit memory consumption, although it slows down communication. These subtasks all use a global name table that contains the actual locations of the unit data structures. It is sufficient to re-establish the consistency of the name table at the end of phases in which units have been created, destroyed, or remapped. The LOCO implementation relies heavily on the use of asynchronous (buffered) communication. A detailed description of this tool can be found in [7].

3.2. Applicability of the phase concept Any data parallel program can be written as a sequence of phases, and can therefore be implemented using the L O C O tool. Such an implementation can be efficient only when sufficiently accurate load estimates are available. Since one inevitably encurs some overhead when using the tool, it is only interesting to use it for compute-intensive applications. The presented concepts and the tool based on them proved to be useful in the implementation of a variety of applications. In the next section some typical numerical applications are reviewed. Other parallel algorithms that have been implemented using the tool include an

Load balancing data parallel programs

1211

adaptive block-structured Navier-Stokes solver [14], parallel mesh refinement and mesh partitioning [8], and multigrid methods with solution-adaptive irregular grids [9]. The latter application demonstrates how the tool deals with varying communication patterns (intergrid and intragrid communication) and varying calculation load (after adaptive mesh refinement).

4. Comparison of mapping techniques In order to compare the proposed mapping algorithms, we present results obtained with the LOCO tool for parallel iterative solvers for partial equations. We cOnsider solvers for a 1-dimensional parabolic equation, a linear 2-dimensional hyperbolic equation on an irregular grid, and for the 2-dimensional Euler equations that describe the inviscid flow of compressible fluids. While partitioning the problem data set is different for the one- and two-dimensional cases, the mapping problem remains essentially the same. As mapping techniques, we will use cost-based recursive bisection with cost functions C 1 and C 2 and the evolution algorithm. All three of them were combined with iterative improvement based on C 2. These mapping procedures will be referred to as RBI1, RBI2, and EVOLI. In order to evaluate the quality of the solutions obtained with these mapping algorithms, some parallel program performance characteristics are introduced. Let t f base denote the time needed by processor qj to complete a phase, and let t~arc be the calculation time required for unit u i. Define: T(P, N ) = max,~t~hase: the time spent in the phase - A(P, N ) = (~'ul ~ ' t f a l c ) / ( e " m a X q ~ ~Y"ui ~ ~"'q}tcalc): the calculation load balance - ep(P, N) = T(1, N ) / P . T(P, N): the parallel efficiency - E~(P, N ) = T(1, N ) / T ( P , P" N): the scaled parallel efficiency While ep measures the effect of using more processors to solve a fixed size problem, es compares problems which are proportionally larger [11]. Both are related by 0 _<%(P, N ) < es(P, N) < 1. -

4.1. A one-dimensional problem Consider the parabolic partial differential equation defined for x ~ [0, 1] and t ~ [0, ~):

Ou( x, t) ~}t

a

t) ~)X2

u(O, t) = O, u(1, t) = O, u ( x , O) = 1.

This equation was discretized with finite differences for the spatial coordinate. The arising system of ordinary differential equations is integrated in time using forward Euler time-stepping. Let there be N equidistant points in the interior of [0, 1], separated by a distance h = 1 / ( N + 1). If the time step is chosen as k = h2/3a < h 2 / 2 a (stability condition) then this explicit method is: U(I) _

1 (lj(l-1)

i -- 3 k ~ i

4- , , ( 1 - 1 ) --~i+1

± , ( l - 1)'~ T~i--1 ]"

The calculation of the solution at a new time level corresponds to a phase. According to the principle of geometric parallelism, the problem domain is partitioned into subintervals. Let there be N discretization points, P processors, and B. P units. In each timestep the solution value in the points on either side of each interval must be exchanged; hence the unit interconnection graph has the topology of a chain. The cost estimates are easily obtained: a data exchange message consists of 1 number (communication time = 350 ms on an iPSC/2 hypercube), and the calculation requires 3 floating point operations per grid point (each ~ 6 ms). The ratio tcomm/tcatc is -----60. In order to experiment with different load balances, the

1212

J. De Keyser, D. Roose

Table 1 Assignment of 8 units to 4 processors m(u)

Cl C2 Ca

no

Ul

~2

U3

U4

U5

U6

U7

00 00 00

00 00 00

01 01 01

01 01 01

10 11 11

10 10 11

11 10 10

11 11 10

number of consecutive interior points N~ assigned to each unit is not taken equal, but is chosen as: N/=

B-ff 1 + ~ - - ~ i -

i>0

N - F.i> 0N,.

i= 0

in which tr ~ ( - 1, + 1) determined the variation in block sizes. Note that if B = 2 an almost perfect load balance is possible by assigning blocks i and B P - i to the same processor. By solving the mapping problem exactly with the branch-and-bound algorithm, it is possible to compare the accuracy of the cost models. We assume that Vui, ui: Aii > > Aij, i.e. the communication costs are small compared to the calculation costs. The contention p a r a m e t e r was chosen 0 = 1. The allocations found for a 2-dimensional hypercube with processors 00, 01, 10 and 11, when the interval is partitioned into 8 units u i of equal size, are displayed in Table 1. Both C 2 and C 3 map neighboring units onto physically neighboring processors. Only cost function C 3 groups the units in adjacent pairs and maps these pairs in the binary reflected Gray code order known from static regular allocations [10], yielding the lowest communication cost. The parallel efficiencies were determined on an i P S C / 2 hypercube using global mapping algorithms. For a problem of fixed size N = 30000, decomposed unevenly (tr = 0.4) in twice as much blocks as there are processors (B = 2), the parallel efficiency was measured for P = 1, 2, 4, 8, 16 and 32 processors. Strategies RBI 1 and RBI 2 always gave the same results; E V O L I performs worse. Figure 5 shows how the efficiency degrades as P increases. This is due to the increasing importance of communication, difficulty of mapping, and overhead. In Fig. 6 the i00 80 60

,, (%) 40

20 ~ 0 2

4

8

18

Number of Processors P

Fig. 5. Parallel efficiencyep.

32

Load balancingdata parallelprograms

1213

100 80 60

,, (%) 40-

200

I

I

I

I

2

4

8

16

32

Number of Processors P

Fig. 6. Scaled efficiencyEs. scaled efficiency is shown for a fixed block size N / = 15000. In this case the relative importance of calculation and computation is constant, which gives a more realistic impression of the scalability of the algorithm. The deterioration of e s with increasing machine size is much less severe. A closer look at the obtained mappings reveals that with R B I 1 and RBI 2 a nearly perfect load balance is always obtained. Iterative improvement is necessary when P increases in order to reduce the communication cost, but it may be time-consuming. EVOLI tends to be slightly worse than the other mapping algorithms, because the evolution algorithm that generates the initial solution for the iterative improvement does not exploit the machine topology. The efficiency degradation observed for large P is mainly due to overhead and communication delay. The assignments are not optimal with respect to communication, but the solutions are reasonable. Figure 7 displays the influence of the communication to calculation ratio on the scaled efficiency. The scaled efficiency was determined for both the i P S C / 2 and iPSC/860 hypercubes, using RBI 2 mapping. The iPSC/860 calculates much faster than the iPSC/2, while the communication speed is not accelerated by the same factor. The dominating effect of communication causes a fast deterioration of es, as the number of processors increases. 100

~

80

iPSC/,

iPSC/860~....~ 60

(%) 40200

1

I

I

I

I

2

4

8

16

Number of Processors P

Fig. 7. Scaled efficiencyes.

32

J. De Keyser, D. Roose

1214 i00



+

95

,, (%)

~ 90

,

+g=l o

85

80

g=O g=2

*

g=3

o

g=4

I

i

I

~

1

2

4

8

16

Number of Processors P

Fig. 8. Scaledefficiencyfor local mapping. The time invested in the calculation of a new allocation can be limited by resorting to non-global optimization. In this case the load is redistributed only within processor subsets. The quality of the best achievable mapping degrades since the set of feasible solutions is limited when compared to global optimization. On the other hand, each local optimization problem is smaller. The scaled efficiency for local RBI2 mapping (see Fig. 8) was determined on the iPSC/2 for the same problem as before. The load balancing strategy consisted of two steps. First, blocks of unequal size were distributed among the processors assuming that their calculation load is equal (communication costs were ignored) using RBI z. In a second step local RBI 2 remapping was performed within g-dimensional subcubes. One would expect a better assignment when g is large. This is not always true: since the heuristic performs worse for a large problem, it does not necessarily find a solution that is better than the one obtained for small g.

4.2. A two-dimensional irregular grid problem Consider the solution of the linear first-order hyperbolic equation au au --+2--=0 ax ay on the two-dimensional domain depicted in Fig. 9. This hyperbolic equation is discretized using first order accurate upwind differences on a polygonal finite volume mesh. Boundary

r4

rl

r, 0

: '1 Fig. 9. Linearhyperbolicproblemdomain.

Load balancing data parallel programs

1215

Fig. 10. Irregular grid and solution for linear hyperbolic equation.

conditions are specified on a curve /'1 t.A/'2 L)/'3 u/'4 such that the solution is unique (u = 0 on/'1 and /'E, U = 1 on /'3 and /'4). The discrete equations were solved using damped Jacobi relaxation (to = 0.5). The initial approximation was taken uniformly equal to zero. The solution is shown in Fig. 10. Partitionings of the two-dimensional mesh containing 2549 volumes with the orthogonal recursive bisection (ORB) and the inertial recursive bisection (IRB) heuristics [27] are shown in Fig. 11. Calculation cost estimates are obtained by counting the number of floating point operations per volume. Given the fact that the units contain approximately the same number of volumes, and assuming that the number of edges per volume does not vary too much, a balanced computation load is obtained when there are as many units as processors. Interprocessor communication consists of exchanging the solution along the interfaces between units; the communication volume is proportional to the number of edges along these interfaces. The communication cost estimates are obtained by counting the corresponding number of words that have to be exchanged. The bisection heuristics implicitly try to minimize this communication volume, since the clustering of adjacent volumes reduces the length of the interfaces. Figure 12 compares the parallel efficiencies ep obtained on an iPSC/860 hypercube with RBI2 mapping, for ORB and IRB grid partitioning. From geometric considerations, it is known that in general IRB generates units with smaller perimeter. There is however the disturbing effect of the unit interfaces along the domain border, for which no communication is required, but an additional amount of work for imposing the boundary conditions. As expected, the parallel efficiency drops fast as more processors are used: the problem that remains to be solved by each processor becomes smaller, so that communication overhead is

q

Fig. 11. ORB and IRB partitioning on 4 processors.

1216

J. De Keyser, D. Roose 100 80 60

,~ (%) 40 20 0 1

r

I

2

ll6

I

4

8

32

Number of Processors P Fig. 12. Ep for the linear hyperbolicproblem. increasingly important. The parallel efficiency for this application is lower than for the one-dimensional example, because there is relatively more communication. 4.3. The two-dimensional Euler equations

In the following example the Euler equations are considered. This is a system of 4 hyperbolic partial differential equations that describe the flow of a compressible fluid in the absence of viscosity. A two-dimensional irregular mesh was constructed adaptively using the refinement algorithm proposed in [24]; the streamwise entropy gradient was used as refinement criterion. The mesh was partitioned and mapped at run-time. The equations were discretized using a first-order accurate upwind finite volume scheme, based on van Leer flux vector splitting [1]. Forward Euler local time stepping was used to integrate the equations in order to obtain the stationary solution. The internal flow through a channel with a contraction was computed. Supersonic inlet boundary conditions were such that the flow enters the channel at Mach 1.8. As the density contours in Fig. 14 demonstrate, a compression shock is

100 80 60

,p (%) 40 200

]

I

I

i

2

4

8

16

Number of Processors P Fig. 13. ev for the Euler solver.

32

Load balancingdata parallelprograms

I I Illll itlt '~~'!~'~'~'!!!!! ' ........

1217

I

J 1.22151 t.31317 1.4S755 I.¢ISS2 t.llll] 1.51•4 ~.|3ltS 2.144g? l.llglS i.*Ull

|.|317|

Fig. 14. Supersonicflowthrough a channel - mesh, IRB partitioningand densitycontours. formed in front of the contraction and is reflected between the upper and lower channel walls. An expansion fan is generated at the end of the contraction. The quality of the solution is limited because of the fact that the numerical method is only first-order accurate. The mesh (Fig. 14) contains 1873 finite volumes, which is rather small as a significant amount of storage is associated with each finite volume. Inertial recursive bisection was used for partitioning. The mapping technique is RBI 2. Figure 13 shows the parallel efficiency obtained for this application on the iPSC/860. Calculation and communication costs are obtained by analyzing the forward timestepping algorithm: per finite volume in the order of 700 floating point operations are needed, while only 4 words have to be communicated per edge along part interfaces. It should be stressed that these are only estimates, as the precise amount of computations per finite volume depends on the flow regime. Moreover, the amount of work involved in imposing the boundary conditions cannot be accurately predicted. As compared to the scalar hyperbolic equation, the ratio between computation and calculation requirements of the Euler solver is quite large. On the other hand, because more information is stored for each volume, smaller meshes are used. The effect of communication therefore is not negligible, although it has less effect than in the scalar example. A good parallel efficiency can be maintained with more processors. Still higher parallel and scaled efficiencies are obtained when the processor's local memory is expanded, i.e. when larger grids are used. The latter will certainly be needed when one computes three-dimensional flow fields, as the relative importance of communication then becomes even larger. 5. Conclusion Data parallel algorithms exhibit a special kind of parallelism. In this paper programming constructs were introduced that allow an accurate description of such programs. This formalism enabled us to reformulate the general load balancing problem, and to cast it in a simpler form. Several optimization techniques were applied to solve it at run-time, in situations where computation and communication cost estimates can be computed. We implemented a software tool that is based on these concepts and that supports these load balancing techniques. Results obtained with different load balancing algorithms were compared for several applications. They indicate that there exist suitable mapping heuristics based on partial cost

1218

J. De Keyser, D. Roose

functions for solving the simplified load balancing problem. The problem of partitioning the data set must be addressed for each individual application. Good parallel efficiencies can be achieved when the application is compute-intensive, provided that the local memory size allows the storage of a sufficiently large problem. The efficiency does not appear to depend strongly on the precise value of the cost estimates, as long as the correct relative magnitude of calculation and communication costs is specified.

References [1] W.K. Anderson, J.L. Thomas and B. van Leer, A comparison of finite volume flux vector splittings for the Euler equations, AIAA Paper No. 85-0122, presented at the AIAA 23rd Aerospace Sciences Meeting, Reno, Nevada, 1985. [2] S.B. Baden, Programming abstractions for dynamically partitioning and coordinating localized scientific calculations running on multiprocessors, SIAMJ. Sci. Stat. Comput. 12 (1) (1991) 145-157. [3] B.M. Chapman, H. Herbeck and H.P. Zima, Automatic support for data distribution, in: Proc. 6th Distributed Memory Computing Conf. 51-57 (IEEE Computer Society Press, Silver Spring, MD, 1991). [4] G. Chesshire and A. Jameson, FLO87 on the iPSC/2: A parallel multigrid solver for the Euler equations, in: J. Gustafson, ed., Proc. Fourth Conf. on Hypercubes, Concurrent Computers and Applications (Golden Gate Enterprises, 1990) 957-966. [5] N.P. Chrisochoides, E.N. Houstis and C.E. Houstis, Geometry based mapping strategies for PDE computations, in: Proc. ACM Supercomputing Conf. '91 (ACM, New York, 1991) 115-127. [6] R. Das, D. Mavriplis, J. Saltz, S. Gupta and R. Ponnusamy, The design and implementation of a parallel unstructured Euler solver using software primitives, in: Proc. 30th Aerospace Sciences Meeting and Exhibit, AIAA-92-0562 (1992). [7] J. De Keyser and D. Roose, Load balancing data-parallel programs on distributed memory computers, Report TW 162, K.U. Leuven, Leuven, Belgium, 1991. [8] J. De Keyser and D. Roose, A software tool for load balanced adaptive multiple grids on distributed memory computers, in: Proc. 6th Distributed Memory Computing Conf. (IEEE Computer Society Press, Silver Spring, MD, 1991) 122-128. [9] J. De Keyser and D. Roose, Partitioning and mapping adaptive multigrid hierarchies on distributed memory computers, Report TW 166, K.U. Leuven, Leuven, Belgium, 1992. [10] G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon and D. Walker, Solving Problems on Concurrent Processors (Prentice-Hall, Englewood Cliffs N.J., 1988). [11] J.L. Gustafson, G.R. Montry and R.E. Benner, Development of parallel methods for a 1024 processor machine, SIAM J. Sci. Stat. Comput. 9 (4) (1988) 609-638. [12] C. Koelbel and P. Mehrotra, Programming data parallel algorithms on distributed memory machines using KALI, in: Proc. ACM Supercomputing Conf. '91 (ACM, New York, 1991) 414-423. [13] L. Lamport, Time clocks, and the ordering of events in a distributed system, Comm. ACM 21 (7) (1978) 558-565. [14] K. Lust, J. De Keyser and D. Roose, A parallel multiblock Euler/Navier-Stokes code with adaptive refinement and run-time load balancing, Report TW 187, K.U. Leuven, Leuven, Belgium, 1993. [15] D.R. Mallampati, P.P. Mutalik and R.L. Wainwright, A parallel multi-phase implementation of simulated annealing for the traveling salesman problem, in: Proc. 6th Distributed Memory Computing Conf. (IEEE Computer Society Press, Silver Spring, MD, 1991) 488-491. [16] N. Mansour and G.C. Fox, An evolutionary approach to load balancing parallel computations, in: Proc. 6th Distributed Memory Computing Conf. (IEEE Computer Society Press, Silver Spring, MD, 1991) 200-203. [17] H. Miihlenbein, M. Gorges-Schleuter and O. Kr~imer, Evolution algorithms in combinatorial optimization, Parallel Comput. 1988 (7) (1988) 65-85. [18] D.M. Nicol and P.F. Reynolds, Optimal dynamic remapping of data parallel computations, IEEE Trans. Comput. 39 (2) (1990) 206-219. [19] D. Roose, J. De Keyser and R. Van Driessche, Load balancing grid-oriented applications on distributed memory parallel computers, in: P. Dewilde and J. Vandewalle, eds, State of the Art in Computer Systems and Software Engineering (Kluwer, Dordrecht, 1992) 191-216 [20] P. Sadayappan and F. Ercal, Nearest-neighbor mapping of the finite-element graphs onto processor meshes, IEEE Trans. Comput. C-36 (12) (1987) 1408-1424. [21] P. Sadayappan, F. Ercal and J. Ramanujam, Cluster partitioning approaches to mapping parallel programs onto a hypercube, Parallel Comput. 13 (1) (1990) 1-16.

Load balancing data parallel programs

1219

[22] P. Sadayappan, F. Ercal and J. Ramanujam, Parallel graph partitioning on a hypercube, in: J. Gustafson, ed., Proc. 4th Distributed Memory Computing Conf. (Golden Gate Enterprise, CA, 1990) 67-70. [23] J. Saltz, R. Das, R. Ponnusamy, D. Mavriplis, H. Berryman and J. Wu, Parti procedures for realistic loops, in: Proc. 6th Distributed Memory Computing Conf. (IEEE Computer Society Press, Silver Spring, MD, 1991) 67-74. [24] R. Struijs, P. Van Keirsbilck and H. Deconinck, An adaptive grid polygonal finite volume method for the compressible flow equations, in: Proc. AIAA 9th CFD Conf. (1989). [25] S. Vandewalle, J. De Keyser and R. Piessens, The numerical solution of elliptic partial differential equations on a hypercube multiprocessor, in: J.T. Devreese and P.E. Van Camp, eds, Scientific Computing on Supercomputers (Plenum Press, New York, 1989) 69-97. [26] M.C. Wikstrom, The generalized work/exchange model, TR 91-11b, Iowa State University/Ames, Ames, IA, 1991. [27] R.D. Williams, Performance of dynamic load balancing algorithms for unstructured mesh calculations, Concurrency: Practice and Experience 3 (1991) 457-481.