Operations Research Letters 8 (1989) 351-356 North-Holland
AN ALGORITHM
December 1989
FOR THE MULTIPROCESSOR
ASSIGNMENT
PROBLEM
V.F. M A G I R O U and J.Z. MILIS Athens School of Economics and Business Science, Patission 76, 10434 Athens, Greece Received September 1988 Revised May 1989
An exhaustive search algorithm is presented for the assignment of tasks to processors in a distributed processing system so that the s u m of execution and communication costs is minimized. The algorithm relies on an efficient lower bound generated by reducing the original task graph to a tree, for which the optimization problem is polynomially solvable. It is also pointed out that the problem is NP-complete even in the case of 3 processors. multiprocessor assignment * computational complexity * branch-and-bound
1. Introduction
Significant research has been done on the problem of allocating the tasks of software system to the processors of a distributed computing environment. The problem is usually presented in terms of an allocation model that incorporates constraints as well as several performance criteria. The constraints might reflect memory and processing limitations, precedence relations etc. Performance criteria that have been suggested are: total interprocessor communication cost, total execution and interprocessor communication cost, job completion time, and processor load balancing. For some distributed processing applications the minimization of the total execution and interprocessor communication cost is the most important allocation criterion. For such systems, Stone [10] has shown a polynomial min-cut algorithm for a two processor system. An extension of this method to an arbitrary number of processors requires the solution of a minimum N-cut problem, which (as shown [3]) is NP-complete for N >~ 3. We will formally show this relation between the N-cut and the processor allocation problem in Section 2. Polynomial time optimal assignment algorithms exist in the case where the graph showing data transfers among the tasks is of a special form. Bokhari [1,2] presents an O ( M N 2) algorithm for
trees and an O ( M N 3) for series-parallel graphs of M tasks and N processors. For general problems, Price [8], and Ma et al. [5] present branch-andbound exact methods, while heuristic methods are shown in [7]. In this paper we give a branch-andbound algorithm which uses an effective lower bound for the problem. This bound is obtained by disregarding selected communications among tasks, leading to a tree type data transfer structure. We also report on our computational experience with the algorithm.
2. Problem statement and complexity
Consider a modular program consisting of tasks indexed by i = 1, 2 . . . . . M. The data transfers between the tasks can be represented by a task graph G = (T, E ) with nodes T corresponding to the tasks and edges E = ((i, j ) I data is transferred from i to j }. For scheduling applications the precedence (i, j ) vs. ( j , i) is important. However in our context precedence will make no difference and we can consider G to be undirected. A number C;.j on (i, j ) in E shows the amount of data to be transferred. For our later presentation it will be useful to allow in the task graph degenerate edges with C equal to zero. Thus G is essentially corn-
0167-6377/89/$3.50 © 1989, Elsevier Science Publishers B.V. (North-Holland)
351
Volume 8, Number 6
OPERATIONS RESEARCH LETTERS
Mi,P, j,q
=
0
Optimal Optimal
cost
cost
(No replication):
(Replication)
: 5,0,
December 1989
5.5,
p=q
(A,B,C -->2)
(A-->2,3,B-->2.C-->3)
Fig. 1. Communication times and the effect of replication. OptimM cost (no rep~cation): 5.5 (A, B, C--* 2). Optimal cost (rep~cafion): 5.0 (A -~ 2, 3, B ~ 2, C --* 3). mi,p,j,q = 20 if p * q, 0 if p = q.
plete with C representing the actual data transfer. Each task can be executed on any available processor from the set P indexed by p = 1, 2 . . . . , N. Due to differences in processors the required time of task i on processor p, E~,p, depends on p. If task i runs on processor p, task j on q and i sends (receives) data to (from) j there is a communication time involved, which depends on the time needed to set up a communication channel between p and q and the time of channel use. A reasonable formula for this time is
mi,p,j, q = Ci, j * dp,q + Dp,q * L , j with Dp,q the setup time and dp.q the time required to transfer a unit of data on the channel. The term f,.: is 1 if Ci,j > 0 and 0 otherwise and is required to give a setup cost only if there is actual data transfer. The minimal total cost multiprocessor assignment problem can be formulated [7] as a 0-1 programming problem: Using the 0, 1 variables X~,p which are 1 if task i runs on p and 0 otherwise, we want to minimize
E Si,p * Ei,p + E mi.p,j,q * Si,p * Sj,q i,p i,j,p,q
(1)
N
i = 1 , 2 . . . . . M.
Theorem (Dahlhaus et al. [3]). The min 3-cut prob-
lem is NP-complete. We can easily show that the 3-processor problem is NP-complete provided we prove the following: Lemma. The minimum 3-cut problem transforms polynomially to the 3-processor problem. Proof. Consider a weighted graph G = (V, E ) with S 1, S z, S 3 distinguished vertices and weights W~,j. We consider the following related 3-processor problem: There are IV I - 3 tasks corresponding to the non-S~ vertices. The processing times, E~,p, are defined as Ei. r ~.q,pl~ s ,i' and the communication costs a s mi,p,j,q= ~i.j. Any processor assigment corresponds to a 3-cut of the same cost. Conversely, it is easy to show that a minimal cut corresponds to some processor assignment of equal =
such that E X,,p=l,
where this duplication of effort might lead to a lower cost is given in Figure 1 and Table 1. The task assignment problem (1), (2) without replication has been reported [1] to be NP-complete for N >/4. However, Stone [10] established a close correspondence between the k-processor and the min k-cut problem. Furthermore, we have the following theorem:
(2)
p=l
A more general formulation might allow a task to run on more than one processor. An example Table 1 Processing times Tasks
Processors 1
2
3
A B C
0.5 4 2
2 0.5 3
2 4 0.5
352
Fig. 2. The cut shown has weight 9 which is exactly the cost of the assignment A --, 1, B ---, 2, C --, 1.
Volume 8, N u m b e r 6
O P E R A T I O N S R E S E A R C H LETTERS
Table 2 Execution costs Tasks
A B C
Processors 1
2
3
1 5 2
1 3 5
2 2 3
cost. An example of this correspondence is shown in Figure 2 and Table 2. Thus the two problems are polynomially equivalent, since they have polynomial related input lengths.
original by erasing the vertices corresponding to tasks in Q, as well as the incident edges. A lower-bound for the cost of any assignment problem P can be derived as follows. Consider the corresponding task graph of the problem. If this graph is a tree or a forest the optimal assignment can be computed by dynamic programming algorithm in O(MN 2) time [1]. However, if the graph is not a tree we can consider erasing a minimal edge set that reduces it to one. Erasing an edge (i, j ) means setting Ci,j equal to 0 in a modified processor assignment problem. In particular, given any spanning tree (or forest), H, we consider a modified problem P(H) which differs from the original P in the data transfer and has
3. A branch-and-bound algorithm C~j= We define a partial assignment on a task set Q c T as an assignment of the X~.p binary variables for i e Q and all p = 1, 2 . . . . . N, such that the appropriate subset of constraints (2) is satisfied. An extension of a partial assignment is an assignment of values to all Xj.p which is consistent with the partial one. The optimal extension of a partial assignment corresponds to a problem similar to the original. In particular, the optimal extension problem is to minimize
E E(Ei,p + E Emi,p.j,qSj*q)Xi,p i~Q p jEQ q Jr E E mi,p.j,qXi,pXj.q i,j~Q ?,q + E EEi.pXi?p + E E mi,p,j.qXi?pSj*,q, iEQ p i,j~Q p,q
(3) where X* t , p , for i ~ Q are the variables that have been fixed by the partial assignment of the Q tasks. Hence (3) becomes
E EEi~pXi,p + E E mi.p,j.qXi,pSj,q i~Q p i,j~Q p,q + constant which is again a processor assignment problem on M - I QI tasks, with modified execution times
Ei~p = Ei,p + E Emi,p,j,qXj*,q , jEQ q
(4)
mi.p,j, q identical to those of the original problem. The new task graph can be obtained from the and
December 1989
{
Ci.j
if (i, j ) in H,
0
i f ( i , j ) n o t i n H.
Obviously the coefficients mi,p,j, q of the quadratic terms in the modified problem are not greater than the corresponding ones in the original problem, i.e. m'~< m. Since the execution times Ecp are the same in both P and P(H) it is clear that the optimal assignment on P(H) provides an easily computable lower bound on the cost of the problem. The tree leading to the best lower bound is not easily determined, but a reasonable choice is to use the maximum spanning tree of the graph with edge weights Ccj. It should be noted that the optimal assignment on P(H) is feasible on P, and hence its cost computed on the original cost function provides an upper bound for P. This upper bound is expected to be sharp for sparse task graphs. The above lower and upper bound generating procedure can be incorporated in the framework of a basic branch-and-bound algorithm [6]:
Branch-and-bound algorithm begin active_set:= {0}; (comment: This is the original problem) generate U0, Z0; set U.'= U0; current_best := assignment of U0 while active_ set is not empty do begin remove from active_ set the node k with the smallest Zk; if Z k < U then 353
Volume 8, N u m b e r 6
O P E R A T I O N S R E S E A R C H LETTERS
begin generate the N children of k and the corresponding lower and upper bounds Zp and Up; for all p do if U p < U then set U to Up and update current_best; for all p do if Zp < U then add child p to active_ set; end end end. The active set of nodes in the algorithm consists of partial assignments of the first Q tasks, Q - 1, 2 . . . . . M. The numbering of the tasks is arbitrary. A child p of a partial assignment k of Q tasks is an assignment of Q + 1 tasks with SQ+l,p=l. The index p corresponds to the processors. We remove nodes from the active set according to the smallest lower bound and generate all of the node's children. For every child we determine a lower bound Zp and an upper bound Up corresponding to the assignment leading to Zp. Children are added to the active set only if Zp < U, U being the current best upper bound, which is updated whenever Up < U. The following example clarifies the operation of the algorithm for M = 5 and N = 2. The communication costs between tasks that run on different processors are shown in the graph of Figure 3, with execution costs in Table 3. Corresponding to the 0-th partial assignment a lower-bound is found by looking at the maximum spanning tree, {AC, CE, AB, DE}. The optimal assignment for that problem is B on processor 2, while the other tasks run on 1 (see Figure 4(a)). The lower-bound is Z 0 = 8 while the actual cost of the assignment is U0 = 11. The algorithm generates extensions (A to
2
Fig. 3. Task graph. 354
December 1989
Table 3 Execution times Proc.
1 2
Tasks A
B
C
D
E
1 4
7 1
2 1
1 3
1 4
1} and (A to 2}. The maximal spanning tree with A removed becomes (BC, CE, ED} and the optimal assignment when (A to 1} is formed by looking at the problem in Figure 4(b). The optimal assignment for {A to 2} is shown in Figure 4(c). Note that the execution times in the second column of Figure 4(b) and in the first of Figure 4(c) have been altered according to (4). The complete tree is given in Figure 5.
4. Algorithm implementation results
and computational
The major burden of the algorithm is the computation of the lower bounds Zp. Upon the removal of a node from the active set its children are generated corresponding to the assignment of the next task /next to one of the N processors. To determine the lower bounds Zp one first evaluates the maximum spanning tree on the graph of the unassigned tasks. Then the dynamic programming algorithm runs for each child. The tree is the same for all children since the mi,p,j, q have not changed, but the E/p are different for each child, requiring different runs of the optimization algorithm. The computational burden can be reduced at the expense of the quality of the bound: We generate the max spanning tree of the graph that includes tnext a s well as the unassigned tasks, and compute the optimal assignment. It can be shown that from this single optimization we can derive with little extra computation lower bounds for the children corresponding to the specific processor assignments of tnext. These bounds are not as sharp as the one derived earlier. The computational requirement for the two bounds can be estimated as follows: For the original bound, the generation of the Z~, for all children with R unassigned tasks requires R log R operations for the spanning tree [9] and N3R operations for N runs of the optimal assignment. For
Volume 8, Number 6
OPERATIONS RESEARCH LETTERS
1245 ~2 proc.
1
December 1989
2
proc.
8
(
1
(a)
proc.
B
C
D
i
i
I
proc.
~
4~
1
2
proc.
1
proc.
2
9
B
16
6
8
C
7
5
2
7
1
3
3
D
(c) Fig. 4. Application of dynamic programming algorithm.
the modified lower bound the operations are R . log R + N2R. To test the performance of the algorithm with respect to changes in the relevant parameters we ran it on several randomly generated problems. [-~] Uo = 11
=10
11
Fig. 5. Branch-and-bound tree.
11
The generation of communication data was done in accordance with the first method outlined in [4], so that we could select the density d = 2E/M(M - 1) of the resulting task graph. We ran examples for N = 3, 6; M = 5, 10, 15, 20, and d = ½, ½, 2. For every (M, N, d ) value we generated 5 random topology problems and solved each using both the original lower bound and the modified one. Table 4 provides results for the original bound while Table 5 for the modified one. We report in these tables (a) the average number of steps required (a step is a removal of a node from the active_ set), (b) the average time, normalized by that of the smallest problem. The results show an exponential average time increase with respect to M, N, and a sizeable 355
Volume 8, Number 6
OPERATIONS RESEARCH LETTERS
Table 4 Computational results of the original algorithm a M
d 1 3
N= 3 5
1 2
1a 1.0 a
2 3
1 1.2
2 2.2
10
3 8.8
6 17.7
11 35.7
15
11 52.8
24 137.7
79 484.1
20
175 1313.6
289 2509.1
338 3372.6
5
1 6.7
2 10.1
3 15.7
10
4 69.0
15 280.6
25 515.0
15
45 1279.5
59 2019.5
190 7178.3
N=6
a The first entry specifies the number of steps; the second one the time.
December 1989
increase with d. Note that there is a large variance associated with the averages making difficult the comparison of individual problems of different size. As expected fewer steps are required for the runs with the efficient lower bound, but the overall time requirements of the modified lower bound were smaller. One can try several variations of the above algorithm by generating more or less efficient lower bounds by appropriate edge removals. The computational results in Tables 4 and 5 show that the overall performance of the algorithm can not be readily assessed a priori.
Acknowledgement We would like to thank Prof. C.H. Papadimitriou for several suggestions and for pointing out reference [3]. We would also like to thank a referee for a most thorough and constructive review. Of course the authors are solely responsible for any errors.
Table 5 Computational results of the modified algorithm a
References
M
[1] S.H. Bokhari, " A shortest tree algorithm for optimal assignment across space and time in a distributed processor system", 1EEE Trans. Soft. Eng. 7, 583-589 (1981). [2] S.H. Bokhari, Assignment Problems in Parallel and Distributed Computing, Kluwer Academic Publishers, MA, 1987. [3] E. Dahlhaus, D.S. Johnson, C.H. Papadimitriou, P. Seymour and M. Yannakakis, " T h e complexity of multiway cut", Unpublished manuscript, 1987. [4] W.V. Gehrlin, " O n methods for generating random partial orders", Oper. Res. Lett. 5, 285-291 (1986). [5] P.R. Ma, E.Y.S. Lee and M. Tsuchiya, "A task allocation model for distributed computing systems", IEEE Trans. Comput. 31, 41-47 (1982). [6] C.H. Papadimitriou and K. Steiglitz, Combinatorial Optimization: Algorithms and Complexity. Prentice-Hall, N J, 1982. [7] C.C. Price, "The assignment of computational tasks among processors in a distributed system", Proc. Nat. Comput. Conf., 291-296 (1981). [8] C.C. Price and U.W. Pooch, "Search techniques for a nonlinear multiprocessor scheduling problem", Naval Res. Logist. Quart. 29, 213-233 (1982). [9] R. Sedgewick, Algorithms, Addison-Wesley, MA, 1983. [10] H.S. Stone, "Multiprocessor scheduling with the aid of network flow algorithms", IEEE Trans. Soft. Eng. 3, 85-93 (1977).
d 3
N=3 5
2
1a 0.6 a
3
2 0.9
3 1.6
10
5 5.5
10 13.1
18 24.4
15
14 28.8
42 99.4
139 357.0
20
305 974.5
516 2010.4
566 2431.6
5
1 2.0
3 4.8
3 5.6
10
7 25.0
31 116.1
56 232.7
15
89 525.5
121 853.1
432 3461.4
N=6
a The first entry specifies the number of steps; the second one the time.
356