JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING ARTICLE NO.
46, 15–27 (1997)
PC971376
A Task Duplication Based Scalable Scheduling Algorithm for Distributed Memory Systems Sekhar Darbha*,1 and Dharma P. Agrawal†,2 *Department of Electrical and Computer Engineering, Rutgers Unviersity, Piscataway, New Jersey 08855-0909; and †Department of Electrical and Computer Engineering, North Carolina State Unviersity, Raleigh, North Carolina 27695-7911
compile time (static) scheduling problem with the assumption that a partitioning algorithm is available which takes the application program as input and transforms it to a DAG. Obtaining an optimal schedule for assigning tasks to DMSs has been proven to be an NP-complete [13] problem. There are very few restrictive cases [18] where the optimal schedule can be generated in polynomial time. The suboptimal polynomial time solutions to the static scheduling problem can be obtained by heuristic methods which are based on certain assumptions. The earliest scheduling algorithms were priority-based algorithms [1, 14, 19, 23, 24]. These algorithms assign a priority to each task of the DAG and whenever a free processor is available, the task with the highest priority among all the ready tasks (i.e., tasks for which all the predecessor tasks have completed their execution) is assigned to the free processor. A simple priority-based algorithm is to assign a value of level (or colevel) to each node of the DAG [1] and assign a higher priority to the task which has higher level (or lower colevel). The level of any node is the length of the longest path from the node to an exit node and the colevel of any node is the length of the longest path from the node to an entry node. An entry node is a node which is not dependent on data generated by other tasks; i.e., the number of incoming edges at an entry node is zero and an exit node is a node which does not communicate its data to other nodes; i.e., the outgoing number of edges at an exit node is equal to zero. When the level and colevel are computed, the communication costs are ignored and only the computation costs are taken into account. The priority-based algorithms are simple to implement. The problem with these algorithms is that most of these schemes do not take interprocessor communication (IPC) time into account. Even those algorithms which take IPC costs into account suffer from the fact that they try to balance the workload rather than trying to minimize the overall schedule length. There are many scheduling algorithms based on clustering schemes [10, 11, 16, 20–22] which cluster the tasks which communicate heavily onto the same processor. Even these schemes do not guarantee optimal execution time. Also, if the number of processors available is less than the number of clusters, there could be a problem.
One of the major limitations of distributed memory systems (DMSs) is the high cost for interprocessor communication, which can be minimized by having an efficient task partitioning and scheduling algorithm. It is well known that scheduling the tasks of a directed acyclic graph (DAG) to obtain an optimal solution is a strong NP-hard problem. This paper presents a scalable task duplication based scheduling (STDS) algorithm which can schedule the tasks of a DAG onto the processors of a DMS with a worst case complexity of O(|V | 2), where |V | is the number of nodes of the DAG. This algorithm generates an optimal schedule for DAGs provided a cost relationship is satisfied and if the required number of processors are available. The STDS algorithm generates a schedule for the number of processors available in the system. The performance of the STDS algorithm has been observed by comparing the parallel execution times for practical DAGs with the theoretical lowerbound. © 1997 Academic Press
1. INTRODUCTION
Advances in VLSI technology, interprocessor communication networks, and routing algorithms have increased the popularity of distributed memory systems (DMSs). In addition, the emergence of several high performance workstations has led to the development of DMSs using a network of workstations (NOWs). The scalability of DMSs gives them a major advantage over other types of systems. Some of the applications which use DMSs are fluid flow, weather modeling, database systems, real-time applications, and image processing. The data for these applications can be distributed to the processors of the DMSs and with fast access of local data on a processor, high speed-ups can be obtained. To obtain maximum benefits from DMSs, an efficient task partitioning and scheduling strategy is essential. A task partitioning algorithm partitions an application into tasks and represents them in the form of a directed acyclic graph (DAG). Once the application is transformed to a DAG, the tasks are scheduled onto the processors. This paper deals with the 1E-mail: 2E-mail:
[email protected].
[email protected]. 15
0743-7315/97 $25.00 Copyright © 1997 by Academic Press All rights of reproduction in any form reserved.
16
DARBHA AND AGRAWAL
Another set of algorithms is based on the concept of duplicating certain critical tasks [2, 5, 6, 18] in an effort to reduce the communication costs. SDBS algorithm [7] provides an optimal solution with a complexity of O(|V |2 ), where |V | is the number of tasks of the DAG, if certain conditions are satisfied and if the required number of processors is available. The limitations of the SDBS algorithm is that it is applicable for small problems and/or larger systems because of unnecessary duplications. Also, the algorithm is not scalable to the actual number of processors available in the system. The basic strategy involved in most of the scheduling schemes is to group the tasks into a certain number of clusters and assume that available number of processors is greater than or equal to the number of clusters generated by the algorithm. A major limitation with most of these algorithms is that they do not provide an allocation which can be scaled down to the available number of processors in a gradual manner if the available number of processors is less than the number of processors required by the initial clusters. Also, in case the available number of processors is higher than the initially required number of processors, the unused or the surplus processors can be used to obtain a lower parallel time. In a DMS, once the resources have been assigned to a user, they remain under the user’s control until the program completes its execution. Thus, if some assigned resources are not used by a user, they remain unutilized and there would be no overhead in terms of resource requirements if certain tasks are duplicated onto these unused processors. This paper introduces a scalable task duplication based scheduling (STDS) algorithm which can scale the schedule to the available number of processors on the system. This algorithm initially generates clusters similar to linear clusters and uses them to generate the final schedule. The concept of duplicating critical tasks is used in this algorithm which helps in getting a better schedule. The rest of the paper is organized as follows. Section 2 gives a brief overview of the scheduling algorithm. The running trace of STDS algorithm on an example DAG is illustrated in Section 3. The results obtained by our algorithm are reported in Section 4. Finally Section 5 provides the conclusions. 2. STDS ALGORITHM
The idea behind this work is to introduce a fast, scalable, optimal algorithm which would require a practically feasible number of processors. It is assumed that the task graph represented in the form of a DAG is available as an input to the scheduling algorithm. The DAG is defined by the tuple (V, E, τ, c), where V is the set of task nodes and E is the set of edges. The set τ consists of computation costs and each task i ∈ V has a computation cost represented by τ (i). Similarly, c is the set of communication costs and each edge from task i to task j, ei, j ∈ E, has a cost ci, j associated with it. In case two communicating tasks are assigned to the same processor, the communication cost between them is assumed to be zero.
Without loss of generality, it can be assumed that there is one entry node and one exit node. If there are multiple entry or exit nodes, then the multiple nodes can always be connected through a dummy node which has zero computation cost and zero communication cost edges. A task is an indivisible unit of work and is nonpreemptive. The underlying target architecture is assumed to be homogeneous and connected and the communication costs for a fixed length of message between a pair of processors are the same. It is assumed that an I/O coprocessor is available and thus computation and communication can be performed simultaneously. The STDS algorithm generates the schedule based on certain parameters; the mathematical expressions to evaluate these parameters are provided below: pred(i) = { j | e j, i ∈ E} (1) (2) succ(i) = { j | ei, j ∈ E} est(i) = 0, if pred(i) = φ (3) max (ect( j), ect(k) + ck, i ) est(i) = min j∈pred(i ) k∈pred(i ), k6= j if pred(i) 6= φ (4) ect(i) = est(i) + τ (i) (5) fpred(i) = j | (ect( j) + c j, i ) ≥ (ect(k) + ck, i ), ∀ j ∈ pred(i); k ∈ pred(i), k 6= j (6) lact(i) = ect(i) lact(i) = min
if succ(i) = φ min
j ∈succ(i), i6=fpred( j)
min
j ∈succ(i ), i =fpred( j)
(7) (last( j ) − ci, j ), !
(last( j ))
last(i) = lact(i) − τ (i) level(i) = τ (i ) if succ(i) = φ level(i) = max (level(k)) + τ (i) k∈succ(i)
(8) (9) (10)
if succ(i) 6= φ. (11)
The computation of the earliest start and completion times proceeds in a top-down fashion starting with the entry node and terminating at the exit node. The latest allowable start and completion times are determined in a bottom-up fashion in which the process starts from the exit node and terminates at the entry node. For each task i, a favorite predecessor fpred(i) is assigned using (6), which signifies that assigning the task and its favorite predecessor on the same processor will result in a lower parallel time. As defined earlier, level(i) is the length of the longest path from node i to an exit node. This algorithm will yield optimal results if the condition given below is satisfied. This condition needs to be true only for join nodes. A join node is a task of the DAG for which the number of predecessor tasks is greater than one. For nonjoin nodes, i.e., for nodes with number of predecessors less than or equal to one, satisfying this condition is not necessary.
SCALABLE SCHEDULING ALGORITHM
Condition. Let m and n be the predecessor tasks of task i that have the highest and second highest values of {(ect( j) + c j, i )| j ∈ pred(i)}, respectively. Then one of the following must be satisfied: • τ (m) ≥ cn, i if est(m) ≥ est(n), or • τ (m) ≥ (cn, i + est(n) − est(m)) if est(m) < est(n). This condition needs to be satisfied to obtain optimal results and is not a prerequisite for the algorithm to execute. The condition intuitively states that allocating the predecessors of a task to separate processors will lead to a better schedule. The condition will be satisfied if the granularity of the DAG as defined in [12] is greater than or equal to 1.0. But it is not necessary for the granularity to be greater than or equal to 1.0. The computation and communication costs can cause the granularity of the DAG to be less than 1.0 and still satisfy the condition. Some example DAGs which have low granularity and which satisfy the condition are considered in Section 4. The pseudocode in Fig. 1 gives an overview of the steps involved in this algorithm. The first two steps of the algorithm compute the est, ect, fpred, level, last, and lact. These values are computed using (1)–(11) and are used in step 3, where the tasks are assigned to the processors. Step 3, shown in Fig. 2, generates the initial tasks clusters and is based on the parameters computed in steps one and two and on the array queue. The elements in the array queue are the nodes of the task graph sorted in smallest level first order. Each cluster is intended to be assigned to a different processor and the generation of a cluster is initiated from the first task in the array queue which has not yet been assigned to a processor. The generation of the cluster is completed by performing a search similar to the depth first search starting from the initial task. The search is performed by tracing the path from the initial task selected from queue to the entry node by following the favorite predecessors along the way. If the favorite predecessor is unassigned, i.e., not yet assigned to a processor, then it is selected. In case the favorite predecessor has already been assigned to another processor, it is still duplicated if there are no other predecessors of the current task or if all the other predecessors of the current task have been assigned to another processor. For example, in the DAG shown in Fig. 3, task 5 is the only predecessor of tasks 6, 7, and 8 and thus task 5 is duplicated on all the three processors. In case the favorite predecessor is already assigned to another processor and there are other predecessors of the current task which have not yet been assigned to a processor, then the other predecessors which have not been assigned to a processor are examined to determine if they could initially have been the favorite predecessor. This could have happened if for another task k (k ∈ pred(i)), (ect(k) + ck, i ) = (ect( j) + c j, i ), where i is the current task and j is its favorite predecessor. If there exists such a task k, the path to the entry node is traced by traversing through the task k. If none of the other predecessors could initially have been the favorite predecessor,
17
then the process of cluster generation can be continued by following through any other unassigned predecessor of task i. This process helps reduce the number of tasks which are duplicated. The generation of cluster terminates once the path reaches the entry node. The next cluster starts from the first unassigned task in queue. If all the tasks are assigned to a processor, then this step terminates. In this step, the algorithm also keeps track of all the tasks which did not make use of the favorite predecessor to complete the task allocation. After the initial clusters are generated, the algorithm examines whether the number of clusters in the initial schedule, which is the number of required processors (R P), is less than, equal to, or greater than the number of available processors (A P). If R P is equal to A P, the algorithm terminates. If R P is greater than A P, the processor reduction procedure is executed, and if R P is less than A P, the processor incrementing procedure is executed. The code for reducing and increasing the number of processors is shown in Fig. 4. In the reduction step, each processor i is initially assigned a value of exec(i), which is the sum of execution costs of all the tasks on that processor. For example, if tasks 1,5,9 are assigned to processor i, then exec(i ) would be equal to (τ (1) + τ (5) + τ (9)). After computing the exec(i ) of all the processors, the algorithm sorts the processors in ascending order of exec(i) and merges the task lists of processors in logarithmic fashion. In the first pass the task lists are merged to obtain half the initial number of processors. The task lists of the processors with the highest and lowest values of exec(i) are merged, the task lists of processors with second highest and second lowest values of exec(i) are merged, and so on. In case the required number of processors is odd, the processor with the highest exec(i ) remains unchanged and the tasks of the rest of the processors are merged. If the number of required processors is still higher than the number of available processors, multiple passes through this procedure are required and each time the number of processors is reduced to half of the previous number of processors. For example, suppose the number of processors required by the initial clusters is 7 and the number of available processors is 2. Suppose the values of exec(i)’s of those processors are {30, 14, 15, 13, 10, 11, 8}. In this case the task lists of processors 3 and 7 are merged, the task lists of processors 2 and 5 are merged, and the task lists of processors 4 and 6 are merged. After the first pass the number of required processors is 4, which is still higher than the number of available processors. The values of exec(i) for the new set of processors are {30, 23, 21, 24}. The tasks lists of processors 2 and 4 are merged and the task lists of processors 1 and 3 are merged to yield the final allocation. The values of exec(i) of the new processor will not necessarily be the sum of exec(i )’s of the earlier processors. This is because a task which has been duplicated on both the initial processors need not be executed twice on the merged processor. When the tasks are merged, the tasks on the new processor are executed in highest-level-first order.
18
DARBHA AND AGRAWAL
FIG. 1. Overall code for the STDS algorithm.
FIG. 2. Code for Step 3 of the STDS algorithm.
SCALABLE SCHEDULING ALGORITHM
FIG. 3. DAG with critical node and its clusters.
19
Finally, in the incrementing step, the surplus processors are used in an effort to lower the overall schedule length. This is achieved by duplicating certain critical tasks. For each extra processor that is available, the algorithm examines the task on a processor that did not use its favorite predecessor when allocating originally. The new allocation uses the favorite predecessors for all the tasks while traversing back to the entry node. The original task list from the task that did not use its favorite predecessor initially is copied to a new processor. For example, suppose task i on processor j did not use its favorite predecessor initially. Then the initial traversed list from i to the entry node that is on processor j is copied to a new (additional) processor. On processor j , the favorite predecessors of all the tasks from task i are traversed till the entry node is encountered.
FIG. 4. Code for Step 4 of the STDS algorithm.
20
DARBHA AND AGRAWAL
2.1. Complexity of the Algorithm The first two steps compute the various parameters by traversing all the tasks and the edges of the task graph. In the worst case, the algorithm would have to examine all the edges of the DAG. Thus, the worst case complexity of these steps is O(|E|), where |E| is the number of edges. Since the third step deals with the traversal of the task graph similar to the depth first search, the complexity of this step would be the same as the complexity of the general search algorithm which is O(|V | + |E|) [3], where |V | is the number of nodes. Since the task graph is connected, the complexity is O(|E|). In step 4, the complexity of the reduction procedure depends upon the difference between the required number of processors for the linear clusters and the number of processors provided by the system. For each pass through the reduction code shown in Fig. 4, the time required is of order O(|V | + |P| log |P|) where |P| is the number of processors. The number of passes through the reduction code (step 4 of the algorithm) is given by RP . Q = log2 AP
(12)
The complexity of this step would be O(Q(|V | + |P| log |P|)). In the worst case, R P would be equal to |V |, A P would be equal to one, and Q would be log2 (|V |). Also, in the worst case, |P| would be equal to |V |. The complexity of this step in the worst case would be O(|V | log2 (|V |)+|V |(log2 (|V |))2 ). For all values of |V |, the condition |V | > log2 (|V |) would be true and for |V | ≥ 16, the condition |V | ≥ (log2 (|V |))2 ) would be satisfied. For practical applications, |V | would always be larger than 16 and consequently, the worst case complexity of this step would be O(|V |2 ). For the incrementing step, in the worst case, for each task which did not use its favorite predecessor, all the nodes of the graph might have to be examined. Thus, the worst case complexity of this step is O(|T | |V |), where |T | is the number of tasks which did not use its favorite predecessors. In the worst case |T | would be equal to |V | and the worst case complexity of the incrementing step would be O(|V |2 ). The overall time complexity of the STDS algorithm is in O(|V | + |E| + |V |2 ) or O(|V | + |E| + |V |2 ), depending upon whether the number of available processors is less than or greater than required number of processors by the initial clusters. For a dense graph the number of edges |E| is proportional to O(|V |2 ). Thus, the worst case complexity of the algorithm in both cases is O(|V |2 ). 3. ILLUSTRATION OF THE STDS ALGORITHM
The working of the STDS algorithm is illustrated by a simple example DAG shown in Fig. 5. The steps involved in scheduling the example DAG are explained below:
FIG. 5. Example directed acyclic graph.
Step 1. Find est, ect, fpred for all nodes i ∈ V . These are shown in Table I. Step 2. Find last, lact, level for all nodes i ∈ V . These are shown in Table I. Step 3. Generate clusters similar to linear clusters. For this DAG, the array queue is: queue = [10, 7, 9, 8, 5, 2, 4, 6, 3, 1] The generation of linear clusters starts from the first node in queue, i.e., node 10. The search proceeds to the entry node by tracing through the favorite predecessors of each task. Since this is the first pass, none of the tasks have been assigned to any processor. Consequently, all the tasks can have their favorite predecessors allocated to the current processor and the allocation list of {10, 8, 6, 3, 1} is obtained for processor 1. The next cluster starts from the first unassigned task in the array queue which is task 7. The search from task 7 yields task 6 as its favorite predecessor. Since task 6 is already allocated to another processor, the search continues with task 5, but TABLE I Start and Completion Times for Nodes of DAG Node
level
est
ect
last
lact
fpred
1
20
0
3
0
3
—
2
10
3
9
6
12
1
3
17
3
7
3
7
1
4
12
3
8
3
8
1
5
9
7
12
7
12
3
6
13
7
11
8
12
3
7
4
15
18
15
18
6
8
9
11
19
13
21
6
9
7
12
18
12
18
6
10
1
21
22
21
22
8
21
SCALABLE SCHEDULING ALGORITHM
to the first encountered task, which is task 7 in this case. Using the processor incrementing procedure, the processor allocations shown in Fig. 7 are obtained. For the example task graph, Table II shows the schedule length that would be achieved if the number of available processors is varied from 1 and above. Even if there are more than six processors available, the algorithm utilizes only six processors. 4. PERFORMANCE OF THE STDS ALGORITHM
FIG. 6. Initial processor schedule.
makes a note that task 7 did not use task 6 on processor 2. In case an additional processor is available task 7 duplicates task 6 on its processor. Proceeding with the search yields the allocation list of {7, 5, 3, 1} for processor 2. It can be observed that even though task 3 has already been assigned to processor 1, it is still duplicated on processor 2 because it is the only predecessor of task 5. The next search starts with task 9. Following this procedure, the task clusters shown in Fig. 6 can be obtained. In these clusters, task 7 on processor 2 and task 9 on processor 3 did not use their favorite predecessors. Step 4. The number of required processors is 4. If the available number of processors is less than 4, then the task lists of different processors have to be merged. For example, suppose the number of processors available is 2. Then the number of processors has to be reduced by 2. The procedure to reduce the number of processors is as follows: 1. Find the values of exec(i ) for each processor i. These values are 20, 15, 14, and 9 for processors P1 , P2 , P3 , and P4 respectively. 2. Since the number of processors is being reduced to half, only one pass through this procedure is required. In this pass the task lists of processors (P1 , P4 ) and processors (P2 , P3 ) are merged. In case only three processors are available, the task lists of processors (P3 ,P4 ) are merged. The modified allocations for the two and three processors cases are shown in Fig. 7. Step 5. The initial number of required processors is 4. Suppose 10 processors are available for the execution of this application. In this DAG, there are two places where a favorite predecessor was not used when the tasks were initially assigned to processors. For tasks 7 and 9 allocated to processors 2 and 3, respectively, task 6 is the favorite predecessor. In the initial clusters, tasks 7 and 9 were not allocated to the same processor as task 6. In the modified allocation, task 6 is assigned to the same processor as tasks 7 and 9. Even if there are more than six processors available, only six processors are used to generate the schedule. In case five processors are available, this procedure is applied only
The performance of the STDS algorithm has been observed for three cases. The first case is if the condition stated in Section 2 is satisfied and the required number of processors are available. In this case, optimal schedule is guaranteed and the proof is shown in [8, 9]. Next, the cases where either the condition is not necessarily satisfied, or the required number of processors is not available is considered. For this case, extensive simulations have been performed. Random data for edge and node costs have been generated and the schedule length generated by STDS algorithm has been compared with the theoretical lowerbound. Finally, the algorithm has been applied to four different application DAGs having nearly 3000 tasks each. 4.1. Application of Algorithm on Random Data The performance of the STDS algorithm on DAGs with random edge and node costs that do not necessarily satisfy the condition has been observed in this section. To observe the performance, the example DAG shown in Fig. 5 has been taken and the edge and node costs for the DAG have been generated using a random number generator. For each of these data sets and for varying number of processors, the ratio of the STDS generated schedule and the absolute lowerbound has been obtained. The absolute lowerbound for an application on P processors is defined as follows: Absolute Lowerbound ( = maximum of
Level of entry node,
) 1 X τ (i) . ∗ P i∈V
(13) The first term of the equation corresponds to the longest path (of computation costs) of the DAG. Since the dependencies have to be maintained, the schedule length can never be lower than this longest path which represent the level of the entry node. The second term of the equation is the overall sum of all the computation costs divided by the number of processors. The schedule length has to be greater than or equal to the second term. Thus, the maximum of the two terms will be the theoretical lowerbound, which may or may not be practically achievable. The random number generator has been used 1000 times to generate 1000 sets of data. For each of these data sets, the costs
22
DARBHA AND AGRAWAL
FIG. 7. Final processor schedule.
were taken as (modulus 100 +1) of the number generated. In effect the edge and node costs lie in the range of 1 to 100. TABLE II Number of Processors vs Schedule Length Number of processors
Schedule length
1
45
2
30
3
26
4
26
5
24
6 and above
22
The random number generator generates DAGs with varying granularities (as defined in [12]). Using this definition of granularity, the percentage of graphs versus the granularity values for these data sets has been plotted in Fig. 8. All of these DAGs required five or fewer processors for generating the schedule. Thus, the ratio of algorithm generated schedule and the absolute lowerbound has been obtained for these random DAGs as the number of processors is varied from 1 to 5 and is shown in Fig. 9. On the average, the STDS algorithm required 26% more time than the absolute lowerbound to schedule these DAGs. 4.2. Application of Algorithm to Practical DAGs Finally, this algorithm has been applied to various practical DAGs. The four applications are the Bellman–Ford algorithm
SCALABLE SCHEDULING ALGORITHM
23
FIG. 9. Ratio of schedule lengths vs number of processors. FIG. 8. Percentage of graphs vs granularities.
[4, 17], the systolic algorithm [15, 17], the master–slave algorithm [17] and the Cholesky decomposition algorithm [11]. The first three applications are part of the ALPES project [17]. These four applications have varying characteristics. The number of nodes of these DAGs is around 3000 but the number of edges varies from 5000 to 25,000. Thus, these graphs vary between sparse and dense. The number of predecessors and successors varies from 1 to 140 and the computation and communication costs also vary in a wide range. The schedule
length is generated by the STDS algorithm for different number of processors and is compared against the absolute lowerbound. These are shown in Figs. 10–13. These figures have four parts and the description of these plots is as follows: • In the (a) plots, the variation of schedule length with the number of processors has been plotted. It can be observed that the schedule length varies in a stepwise fashion. The algorithm initially merges the task lists of the idle processors. But there comes a stage where it is necessary to merge the task lists with the most busy processor. At that point, there is a noticeable
FIG. 10. Plots for Bellman–Ford shortest path algorithm.
24
DARBHA AND AGRAWAL
FIG. 11. Plots for Cholesky decomposition algorithm.
FIG. 12. Plots for systolic algorithm.
25
SCALABLE SCHEDULING ALGORITHM
FIG. 13. Plots for master–slave algorithm.
increase in the schedule length. After that point, the schedule length again increases very slowly till the next critical point is reached. • The (b) plots show the variation of absolute lowerbound with the number of processors. This, in effect, is a plot of (13). • In the (c) plots, the interesting portions of the (a) and (b) plots for each of the applications has been magnified. In this plot, the stepwise increase in schedule length generated by STDS algorithm is more noticeable. • Finally, in (d), the ratio of STDS generated schedule to the absolute lowerbound as the number of processors is
varied is shown. It can be observed from the plots that the ratios fluctuate very heavily when the number of processors is lower. As the number of processors increases the ratios remain constant. The reason for this is that as the number of processors is lowered, the schedule length increases rapidly; i.e., the number of steps visible is large. Since the absolute lowerbound varies smoothly, the ratio between the STDS schedule and the absolute lowerbound is dictated by the STDS schedule. The characteristics for each application is shown in Table III.
TABLE III Performance Parameters for Different Applications Applications Bellman–Ford algorithm
Cholesky decomposition
Systolic algorithm
Master–slave algorithm
Granularity of DAG
0.002
0.013
0.002
0.002
Condition satisfied
Yes
Yes
No
No
Initial processors required by STDS
203
75
50
50
Maximum processors required by STDS
1171
342
97
50
Average ratio of STDS/absolute lowerbound
1.095
1.118
1.315
1.50
Ratio of STDS/absolute lowerbound schedule when maximum processors are available
1.0004
1.0
1.0017
1.0019
Characteristics
26
DARBHA AND AGRAWAL
5. CONCLUSIONS
A scalable task duplication based scheduling algorithm for DMSs has been presented in this paper. The algorithm is based on the concept of task duplication and operates in two phases. Initially linear clusters are generated and if the number of processors required by the linear clusters is more than the number of processors provided by the system, the task lists are merged in an effort to lower the number of required processors. In contrast, if there are more processors available then the surplus processors are utilized to duplicate certain critical tasks to reduce the schedule length and try to bring it as close to optimal as possible. The complexity of this algorithm is of order O(|V |2 ), where |V | is the number of nodes or tasks in the DAG. The results obtained by its application to the randomly generated DAGs are very promising. Also, the STDS algorithm has been applied to large application DAGs. It has been observed that the schedule length generated by the STDS algorithm varies from 1.1 to 1.5 times the theoretical lowerbound. It has been observed that the STDS algorithm performs very well in reducing the number of processors by increasing the schedule length slowly. Since the algorithm is scalable to the available number of processors, this algorithm can be applied to all sizes of DMSs. ACKNOWLEDGMENTS We thank the anonymous reviewers who helped improve the quality of the paper. Also, we thank Dr. J. P. Kitajima for providing us with the data for the application DAGs which were used as a part of the ALPES project. This work has been supported by the Army Research Office under Grant DAAH04-94G-0306.
REFERENCES 1. Adam, T. L., Chandy, K. M., and Dickson, J. R. A comparison of list schedules for parallel processing systems. Comm. ACM 17, 12 (Dec. 1974), 685–690. 2. Ahmed, I., and Kwok, Y.-k. A new approach to scheduling parallel programs using task duplication. Proceedings International Conference on Parallel Processing, Vol. II. 1994, pp. 47–51. 3. Aho, A. V., Hopcroft, J. E., and Ullman, J. D. The Design and Analysis of Computer Algorithms. Addison–Wesley, Reading, MA, 1974. 4. Bertsekas, D. P., and Tsitsiklis, J. N. Parallel and Distributed Computation: Numerical Methods. Prentice–Hall International, Englewood Cliffs, NJ, 1989. 5. Chen, H. B., Shirazi, B., Kavi, K., and Hurson, A. R. Static scheduling using linear clustering with task duplication. Proceedings of the ISCA International Conference on Parallel and Distributed Computing and Systems. 1993, pp. 285–290. 6. Colin, J. Y., and Chritienne, P. C.P.M. scheduling with small communication delays and task duplication. Oper. Res. 39, 4 (July 1991), 680– 684. 7. Darbha, S., and Agrawal, D. P. SDBS: A task duplication based optimal scheduling algorithm. Proceedings of Scalable High Performance Computing Conference. 1994, pp. 756–763. 8. Darbha, S., and Agrawal, D. P. Scalable scheduling algorithm for distributed memory systems. Proceedings of Eighth IEEE Symposium on Parallel and Distributed Processing. New Orleans 1996, pp. 84–91.
9. Darbha, S. Task scheduling algorithms for distributed memory systems. Ph.D. Thesis, North Carolina State University, 1995. 10. El-Rewini, H., and Lewis, T. G. Scheduling parallel program tasks onto arbitrary target architectures. J. Parallel Distrib. Comput. 9 (1990), 138– 153. 11. Gerasoulis, A., and Yang, T. A comparison of clustering heuristics for scheduling directed acyclic graphs on multiprocessors. J. Parallel Distrib. Comput. 16 (Dec. 1992), 276–291. 12. Gerasoulis, A., and Yang, T. On the granularity and clustering of directed acyclic task graphs. IEEE Trans. Parallel Distrib. Systems. 4, 6 (June 1993), 686–701. 13. Graham, R. L., Lawler, L. E., Lenstra, J. K., and Kan, A. H. Optimization and approximation in deterministic sequencing and scheduling: A survey. Ann. Discrete Math., (1979), 287–326. 14. Hwang, J. J., Chow, Y. C., Anger, F. D., and Lee, C. Y. Scheduling precedence graphs in systems with interprocessor communication times. SIAM J. Comput. 18, 2 (April 1989), 244–257. 15. Ibarra, O. H., and Sohn, S. M. On mapping systolic algorithms onto the hypercube. IEEE Trans. Parallel Distrib. Systems. 1, 1 (Jan. 1990), 48–63. 16. Kim, S. J., and Browne, J. C. A general approach to mapping of parallel computation upon multiprocessor architectures. International Conference on Parallel Processing. 1988, Vol. 3, pp. 1–8. 17. Kitajima, J. P., and Plateau, B. Building synthetic parallel programs: The project (ALPES). Proceedings of the IFIP WG 10.3 Workshop on Programming Environments for Parallel Computing. Edinburgh, 1992, pp. 161–170. 18. Kruatrachue, B. Static task scheduling and grain packing in parallel processing systems. Ph.D. Thesis, Oregon State University, 1987. 19. Lee, C. Y., Hwang, J. J., Chow, Y. C., and Anger, F. D. Multiprocessor scheduling with interprocessor communication delays. Oper. Res. Lett. 7, 3 (June 1988), 141–147. 20. Pande, S. S., Agrawal, D. P., and Mauney, J. A new threshold scheduling strategy for sisal programs on distributed memory systems. J. Parallel Distrib. Comput. 21, 2 (May 1994), 223–236. 21. Pande, S. S., Agrawal, D. P., and Mauney, J. A scalable scheduling method for functional parallelism on distributed memory multiprocessors. IEEE Trans. Parallel Distrib. Systems 6, 4 (April 1995), 388–399. 22. Sarkar, V. Partitioning and Scheduling Parallel Programs for Execution on Multiprocessors. MIT Press, Cambridge, MA, 1989. 23. Sih, G. C., and Lee, E. A. A compile-time scheduling heuristic for interconnection-constrained heterogeneous processor architectures. IEEE Trans. Parallel Distrib. Systems 4, 2 (Feb. 1993), 175–187. 24. Wang, Q., and Cheng, K. H. List scheduling of parallel tasks. Inform. Process. Lett. 37 (Mar. 1991), 291–297.
SEKHAR DARBHA received his B.Tech. in electrical engineering from the Institute of Technology, Banaras Hindu University, Varanasi, India in 1989. He received his M.S. and Ph.D. in computer engineering from North Carolina State University, Raleigh, in 1991 and 1995, respectively. He has been working as an assistant professor in the Electrical and Computer Engineering Department at Rutgers University since August 1995. His research interests are in program partitioning and scheduling for multiprocessing systems. He has served as a minitrack coordinator for the partitioning and scheduling minitrack at the 30th Hawaii International Conference on System Sciences (HICSS) and he will be involved with the 31st HICSS as a minitrack coordinator for the minitrack on compiling for distributed embedded systems. He is a member of the IEEE and the IEEE Computer Society. DHARMA P. AGRAWAL received the B.E. in electrical engineering from Ravishanker University, Raipur, in 1966, the M.E. (Hons.) in electronics and communication engineering from the University of Roorkee, Roorkee, 1968,
SCALABLE SCHEDULING ALGORITHM and the D.Sc. Tech in electrical engineering from the Swiss Federal Institute of Technology, Lausanne, 1975. He is currently a professor of electrical and computer engineering at North Carolina State University, Raleigh. His current research interests are in parallelism detection and scheduling, mobile and communication networks, and system reliability. He is an author of a selfstudy guide, “Parallel Processing” (IEEE Press, 1991), an editor of a tutorial text, “Advanced Computer Architecture” (IEEE Comp. Soc. Press, 1980), and a co-editor of tutorial texts on “Distributed Computing Network Reliability,” 1990, and “Advances in Distributed System Reliability,” 1990. He is a recipient of the Certificate of Appreciation and three Meritorious Service Awards from the IEEE Computer Society. He has served as an editor of IEEE Computer, IEEE Transactions on Computers, and the IEEE Computer Society Press, and is currently an editor of the Journal of Parallel and Distributed Received June 15, 1995; revised March 27, 1996; accepted August 21, 1997
27
Computing (Academic Press) and the International Journal of High-Speed Computing (World Scientific). Dr. Agrawal has served as the General Chair for the MASCOTS 1996 workshop, the founding Chair of the 1995 ICPP Workshop on Challenges for Parallel Processing, the Program Chair of the 1994 International Conference on Parallel Processing, the General Chair for the 6th International Conference on Parallel and Distributed Computing Systems, and the Program Chairman of the 11th International Symposium on Computer Architecture (1984), and the Army Research Office Workshops on Future Directions in Computer Architecture and Software (1986). Currently, he is the Chairman of the IEEE–CS Harry Goode Memorial award and the IEEE–CS W. Wallace McDowell Award. He has been a member of the ACM Lecturership Committee, the Computer Science Accreditation Board, and the IEEE–CS Distinguished Visitor Committee.