A high performance, low complexity algorithm for compile-time task scheduling in heterogeneous systems

A high performance, low complexity algorithm for compile-time task scheduling in heterogeneous systems

Parallel Computing 31 (2005) 653–670 www.elsevier.com/locate/parco A high performance, low complexity algorithm for compile-time task scheduling in h...

413KB Sizes 3 Downloads 106 Views

Parallel Computing 31 (2005) 653–670 www.elsevier.com/locate/parco

A high performance, low complexity algorithm for compile-time task scheduling in heterogeneous systems T. Hagras *, J. Janecˇek Czech Technical University in Prague, Department of Computer Science and Engineering, Karlovo nam. 13, 12135 Praha2, Czech Republic Received 31 October 2004; received in revised form 15 March 2005; accepted 15 April 2005

Abstract Heterogeneous computing systems are an interesting computing platforms due to the fact that a single parallel architecture may not be adequate for exploiting all of a programs available parallelism. In some cases, heterogeneous systems have been shown to produce higher performance for lower cost than a single large machine. Task scheduling is the key issue when aiming at high performance in these kind of systems. A large number of scheduling heuristics have been presented in the literature, most of them target only homogeneous computing systems. In this paper we present a simple scheduling algorithm based on list-scheduling and taskduplication on a bounded number of heterogeneous machines, called Heterogeneous Critical Parents with Fast Duplicator (HCPFD). The analysis and experiments have shown that HCPFD outperforms on average all other higher complexity algorithms.  2005 Elsevier B.V. All rights reserved.

*

Corresponding author. E-mail addresses: [email protected], [email protected] (T. Hagras), [email protected] (J. Janecˇek). 0167-8191/$ - see front matter  2005 Elsevier B.V. All rights reserved. doi:10.1016/j.parco.2005.04.002

654

T. Hagras, J. Janecˇek / Parallel Computing 31 (2005) 653–670

1. Introduction A heterogeneous computing system is defined as a distributed suite of computing machines with different capabilities interconnected by different high speed links utilized to execute parallel programs [1]. However, the performance of the parallel program execution on such platforms is highly dependent on the scheduling of the parallel program tasks onto these machines [1,2]. The main objective of the scheduling mechanism is to map the multiple interacting program tasks onto machines and order their execution so that precedence requirements are satisfied and minimum overall completion time is achieved [3,4]. When the characteristics of the parallel program in terms of task execution times, task dependencies and amount of communicated data are known a priori, the parallel program can be represented by the static model [3–5], and scheduling can be accomplished at the compile-time. In the general form of the static model, the application is represented by a directed acyclic graph (DAG), in which the nodes represent application tasks and the edges represent inter-task data dependencies. Each node is labeled by the computation cost (the expected computation time) of the task and each edge is labeled by the communication cost (the expected communication time) between tasks [3–5]. The task scheduling problem in general is NP-complete [2–6]. Therefore, heuristics can be used to obtain a sub-optimal scheduling rather than parsing all possible schedules. Task scheduling has been extensively studied, and various heuristics have been proposed in the literature [6–18]. In compile-time scheduling, these heuristics are classified into a variety of categories (such as list-scheduling, clustering and task-duplication). List-scheduling basically consists of two phases: a task prioritization phase, where a certain priority is computed and is assigned to each node of the DAG, and a machine assignment phase, where each task (in order of its priority) is assigned to a machine that minimizes a suitable cost function. List-scheduling is generally accepted as an attractive approach since it combines low complexity with good results [3–5,7–13]. The basic idea of task-duplication is to try to duplicate the parents of the current selected task onto the selected machine or onto another machine(s), aiming to reduce or optimize the task start or finish time [14–18]. The main weakness of duplication-based algorithms is their high complexity and that they mainly target an unbounded number of computing machines. In this paper we propose a compile-time task scheduling algorithm based on listscheduling and task-duplication. The algorithm is called Heterogeneous Critical Parents with Fast Duplicator (HCPFD). The algorithm works for a bounded number of fully connected heterogeneous machines and aims to introduce a simple list-scheduling mechanism instead of the classical prioritization phase and a low complexity duplication-based mechanism for the machine assignment phase. The remainder of this paper is organized as follows: the next section introduces the compile-time scheduling problem in the heterogeneous environment and provides definitions of some parameters utilized in the algorithm. The HCPFD scheduling algorithm is presented in Section 3. Section 4 contains a brief overview of the most-recent heterogeneous scheduling algorithms that we apply for performance comparison.

T. Hagras, J. Janecˇek / Parallel Computing 31 (2005) 653–670

655

A performance evaluation, based on a large number of randomly generated application graphs and two real-world applications, is presented in Section 5.

2. Problem definition This section presents: the model of the application used for compile-time scheduling, the model of the heterogeneous computing systems, scheduling attributes, and the scheduling objective. The application can be represented by a directed acyclic graph G(V, E, W, l) as shown in Fig. 1, where: V is the set of v nodes, and each node vi 2 V represents an application task, which is a sequence of instructions that must be executed serially on the same machine. E is the set of communication edges. The directed edge ei,j joins nodes vi and vj, where vi is called the parent and vj is called the child. This also implies that vj cannot start until vi finishes and sends its data to vj. W is a v · p computation costs matrix in which each wi,j gives the estimated time to execute task vi on machine pj. l is the set of the amount of communicating data, and edge ei,j carries amount of data li,j 2 l. A task without any parent is called an entry task and a task without any child is called an exit task. If there is more than one exit (entry) task, they may be connected to a zero-cost pseudo exit (entry) task with zero-cost edges, which do not affect the schedule. The heterogeneous computing environment model is a set P of p heterogeneous machines (processors) connected in a fully connected topology. It is also assumed that:

Fig. 1. Application model and computation model matrices of heterogeneous computing systems.

T. Hagras, J. Janecˇek / Parallel Computing 31 (2005) 653–670

656

• any machine (processor) can execute the task and communicate with other machines at the same time, • once a machine (processor) has started task execution, it continues without interruption, and after completing the execution it immediately sends the output data to all children tasks in parallel. The communication costs per transferred byte between any two machines are stored in matrix R of size p · p. The communication startup costs of the machines are given in a p-dimensional vector S. The communication cost of edge ei,j for transferring l bytes of data from task vi (scheduled on pm) to task vj (scheduled on pn), is defined as ci;j;m;n ¼ S m þ Rm;n  li;j ; where Sm is pm communication startup cost (in seconds), li,j is the amount of data transmitted from task vi to task vj (in bytes), Rm,n is the communication cost per transferred byte from pm to pn (in seconds/ byte). The assumption of fully connected machines is only for simulation simplicity. if the network topology will be taking into account, the communication cost given above can be simple modified to take into account the network topology. In order to be able to compute the attributes needed to find the critical tasks the average computation and communication costs are utilized. The average computation cost can be computed as follows: wi ¼

p X wi;j p j¼1

and the average communication cost can be computed as follows: ci;j ¼ S þ R  li;j ; where S is the average communication startup cost over all machines and is comP puted as follows: S ¼ pq¼1 S q =p, and is the average communication cost per P P transferred byte over all machines and is computed as follows: R ¼ pi¼1 pj¼1 Ri;j = ðpðp  1ÞÞ. The average earliest start time AEST(vi) can be computed recursively by traversing the DAG downward starting from ventry AEST ðvi Þ ¼ max fAEST ðvm Þ þ wm þ cm;i g; vm 2prntðvi Þ

where prnt(vi) is the set of immediate parents of vi and AEST(ventry) = 0. The average latest start time ALST(vi) can be computed recursively by traversing the DAG upward starting from vexit

T. Hagras, J. Janecˇek / Parallel Computing 31 (2005) 653–670

ALST ðvi Þ ¼

657

min fALST ðvm Þ  ci;m g  wi ;

vm 2chldðvi Þ

where chld(vi) is the set of immediate children of vi and ALST(vexit) = AEST(vexit). The main objective of the scheduling mechanism is to determine the assignment of tasks of a given application to a given machine (processor) set P such that the scheduling length (makespan) is minimized satisfying all precedence constraints.

3. Proposed algorithm Recently, a list-scheduling heuristic for homogeneous computing environments, called CNPT, has been presented [10]. This heuristic achieved the lower-bound complexity of list-scheduling heuristics and has a performance comparable or even better than the higher complexity heuristics. This section presents an algorithm for listscheduling on a bounded number of fully connected heterogeneous machines called Heterogeneous Critical Parents with Fast Duplicator (HCPFD), which aims to achieve high performance and low complexity. The algorithm consists of two mechanisms, a listing mechanism, which is a modified version of the CNPT heuristic for heterogeneous systems and a machine assignment mechanism which is a near lower-bound complexity mechanism based on task-duplication. 3.1. Listing mechanism The main idea of both CNPT and the proposed listing mechanism come from the fact that delaying a critical task (CT) will delay its children CTs and will delay the exit task. A CT is defined as a task that has zero difference between its AEST and ALST. Based on this idea the mechanism tries to add the critical tasks to the list as soon as possible. The modified mechanism (Fig. 2), divides the task graph into a set of unlisted parent-trees. The root of each parent-tree is a CT. The algorithm starts with an empty queue L and an auxiliary stack S that contains the CTs pushed in decreasing order of their ALSTs, i.e., the entry task is on the top of S (top(S)). Consequently, top(S) is examined. If top(S) has an unlisted parent (i.e., has a parent not in L), then this parent is pushed on the stack. Otherwise, top(S) is popped and enqueued into L. For the DAG in Fig. 1, the listing trace is given in Table 1. Fig. 3 indicates each CT unlisted parent-tree and the order of all tasks in L. The critical tasks are shown by the thick edges connecting them, which are (v1, v3, v7, and v10). 3.2. Machine assignment mechanism The main idea of the proposed machine assignment mechanism is to: 1. select the machine that minimizes the start time of the current selected task, 2. examine the idle time left by the selected task on the selected machine for duplicating one selected parent, 3. confirm this duplication if it will reduce the start time of the current selected task.

T. Hagras, J. Janecˇek / Parallel Computing 31 (2005) 653–670

658

Fig. 2. The listing mechanism.

Table 1 Listing trace of the listing mechanism Step

top(s)

Checked parents

Pushed on S

Poped from S

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

1 3 7 4 7 10 8 2 8 6 8 10 9 5 9 10

– – 4 – – 8, 9 2, 6 – 6 – – 9 5 – – –

– – 4 – – 8 2 – 6 – – 9 5 – – –

1 3 – 4 7 – – 2 – 6 8 – – 5 9 10

In contrast to the basic idea of general duplication algorithms, the proposed mechanism selects the machine then checks for duplication. Instead of examining one task at each step, the mechanism examines one task and one parent, which does not increase the complexity of the classical non-insertion based machine assignment mechanism.

T. Hagras, J. Janecˇek / Parallel Computing 31 (2005) 653–670

659

Fig. 3. Critical tasks unlisted parent-trees.

The following definitions should be given to clarify the proposed machine assignment mechanism (Fig. 4) that will be used for the machine assignment phase: Definition 1. The Task Finish Time on a machine (TFT) is defined as TFT ðvi ; pq Þ ¼ max fRT ðpq Þ; FT ðvn Þ þ k.cn;i g þ wi;q ; vn 2prntðvi Þ

where prnt(vi) is the set of immediate parents of vi, RT(pq) is the time when pq is available, FT(vn) is the finishing time of the scheduled parent vn, and k = 1 if the machine assigned to parent vn is not pq and, k = 0 otherwise.

Definition 2. The Task Start Time on a machine (TST) is defined as TST ðvi ; pq Þ ¼ TFT ðvi ; pq Þ  wi;q . Definition 3. The Duplication Time Slot DTSðvi ; pm Þ ¼ TST ðvi ; pm Þ  RT ðpm Þ.

660

T. Hagras, J. Janecˇek / Parallel Computing 31 (2005) 653–670

Fig. 4. The machine assignment mechanism.

Definition 4. The Critical Parent (CP) is the parent vcp (scheduled on pq) of vi (tentatively scheduled on pm) whose data arrival time at vi is the latest. Definition 5. DAT(vcp2, pm) is the data arrival time of the second critical parent vcp2 on pm. Definition 6. The duplication condition is DTSðvi ; pm Þ > wcp;m

and TFT ðvcp ; pm Þ < TST ðvi ; pm Þ.

T. Hagras, J. Janecˇek / Parallel Computing 31 (2005) 653–670

661

Simply, the mechanism selects the machine pm that minimizes the TFT of vi and duplicates its critical parent at the idle time between vi and the previous task on pm, if this time slot is enough and this duplication will reduce the TST of vi on pm. The machine selection trace is shown in Table 2, where the second and third columns indicate the selected task and machine in each step, the fourth column indicates the critical parent task, the fifth column indicates whether the duplication condition is satisfied or not, and the last column indicates the finishing time before and after duplication. The machine selection trace preserving the communication heterogeneity is presented in Fig. 5. 3.3. Algorithm complexity analysis Complexity is usually expressed in terms of the number of nodes v, the number of edges e, and the number of machines p. • the complexity of computing the AEST and ALST of all nodes is O(e + v), • the complexity of listing all unlisted parent-trees is O(e + v), and • the complexity of assigning all task nodes to p machines using the duplication mechanism or non-duplication is O(pv2). The total algorithm complexity is O(e + v) + O(pv2). Taking into account that e is O(v2), the total algorithm complexity is O(pv2).

4. Task scheduling in heterogeneous systems This section briefly overviews the list scheduling algorithms that we will use for the comparison with HCPT. These algorithms are: Fast Load Balancing (FLB-f) [7],

Table 2 Machine selection trace of the proposed machine assignment mechanism preserving communication heterogeneity Step

Slct. v

Slct. p

CP

Dup. con. sat.

Finish time Before dup.

After dup.

1 2 3 4 5 6 7 8 9 10

1 3 4 7 2 6 8 5 9 10

2 2 2 2 3 3 1 1 2 2

– 1 1 4 1 1 4 1 5 7

– No No No Yes No Yes No Yes No

15 20 39 51 46 44 74 68 83 94

15 20 39 51 22 44 71 68 77 94

662

T. Hagras, J. Janecˇek / Parallel Computing 31 (2005) 653–670 P1

1

10

3

20

30

P2

4

P3

1

2

4

40

6 7

50

5 60

70

5

8 9

80

10 90

Fig. 5. Scheduling of Fig. 1 application graph preserving communication heterogeneity (makespan = 94).

Heterogeneous Earliest Finish Time (HEFT) [8], and Critical Path on a Processor (CPOP) [8]. 4.1. Fast load balancing (FLB-f) algorithm The FLB-f algorithm [7] utilizes a list called the ready-list, which contains all ready-nodes to be scheduled at each step. The ready-node is defined as the node that has all its parents scheduled. In each step, the execution finish time for each readynode in the ready-list is computed in all machines and the node-machine pair that minimizes the earliest execution finish time is selected. The complexity of the FLB-f algorithm is O(pv2). 4.2. Heterogeneous earliest finish time (HEFT) algorithm The HEFT algorithm [8] has two phases: a task prioritizing phase, where the upward rank attribute is computed for each task and a task list is generated by sorting the tasks in decreasing order of the upward rank. The second phase is the

T. Hagras, J. Janecˇek / Parallel Computing 31 (2005) 653–670

663

machine selection phase, where tasks are selected in order of their priorities and are scheduled to the best machine that minimizes the task finish time in an insertion based manner. The complexity of the HEFT algorithm is (pv3). 4.3. Critical path on a processor (CPOP) algorithm The CPOP algorithm [8] has two phases: task prioritizing and machine selection. In the task prioritizing phase, two attributes are used: the upward and downward ranks. Each task is labeled with the sum of upward rank and downward ranks. A priority queue is used to maintain the ready tasks at a given instant (initially, it contains the entry node). At any instant, the ready task with the highest priority is selected for machine selection. In the machine selection phase, it defines the critical path machine as the machine that minimizes the cumulative computation cost of the tasks on the critical path. If the selected task is on the critical path, it is assigned to the critical path machine; otherwise, it is assigned to a machine that minimizes the finishing execution time of the task. The complexity of the CPOP algorithm is O(pv2).

5. Experimental results and discussion This section presents a performance comparison of the HCPFD algorithm with the algorithms presented above. For this purpose, we consider two sets of graphs as the workload: random generated application graphs, and graphs that represent some numerical real-world problems. Several metrics were used for the performance evaluation. To be able to compare the performance of HCPFD to the other algorithms, homogeneous communication is assumed and the average communication cost is used as the communication cost. As an example, the scheduling produced by each algorithm for the application graph in Fig. 1 is given in Fig. 6, where the gray tasks in the HCPFD schedule are the duplicated tasks. The performances of the algorithms were compared with respect to graph size and with respect to number of available machines. Sixteen (16) machines with different capabilities were used as a target computing system in the performance evaluation with respect to the graph size and a set p = {4, 8, 16, 32} of machines was used in the performance evaluation with respect to the number of available machines. 5.1. Comparison metrics The comparisons of the algorithms are based on the following metrics: 5.1.1. Makespan The makespan, or scheduling length, is defined as: makespan ¼ FT ðvexit Þ; where FT(vexit) is the finish time of the scheduled exit task.

T. Hagras, J. Janecˇek / Parallel Computing 31 (2005) 653–670

664 P2

P1 10

1 1

4

6

50 60

1

4

7

5

10

1

20

3 6

30

P3

P2

P1

40

6 50

8 9

60

P3

10

10

20

3

20

4

30

5

2

40

8

50

7 9

60

6

4

30

2

40

7 8

9

3 4 2 6 5

60

70

70

80

80

80

80

90

90

7

9

10 10

a

90

b

10

90

c

P3

1

50

70

70

P2

P1

1

5

2

40

P2

P1

3

20 30

P3

8

10

d

Fig. 6. Scheduling of the task graph in Fig. 1 with: (a) HCPFD assuming homogeneous communication (makespan = 79), (b) FLB-f (makespan = 91), (c) HEFT (makespan = 91), and CPOP (makespan = 97).

5.1.2. Scheduling length ratio (SLR) The main performance measure is the makespan. Since a large set of application graphs with different properties is used, it is necessary to normalize the schedule length to the lower bound, which is called the Schedule Length Ratio (SLR). The SLR is defined as SLR ¼ P

makespan ; vi 2CT minfwi;j g pj 2P

The denominator is the sum of the minimum computation costs of the critical tasks. We utilized average SLR values over a number of generated task graphs in our experiments. 5.1.3. Quality results of schedules The percentage number of times that an algorithm produced a better, worse, or equal quality of schedules compared to every other algorithm is counted in the experiments. 5.2. Random generated application graphs A random graph generator was implemented to generate application graphs with various characteristics. The generator requires the following input parameters: • number of tasks in graph v, • graph levels GL, • heterogeneity factor b, where wi is generated randomly for each task and (using a randomly selected b from b set for each pj) wi,j = b Æ wi "vi on pj,

T. Hagras, J. Janecˇek / Parallel Computing 31 (2005) 653–670

665

• communication to computation P ratioP CCR, which is defined as the ratio of the v v 1 average communication cost ð i¼1 j¼1 ci;j Þ to the average computation cost e Pv 1 ðv i¼1 wi Þ. In all experiments, only graphs with a single entry node and a single exit node were considered. The input parameters were restricted to the following values: v 2 {20, 40, 60, 80, 100, 120}, 0.2v 6 GL 6 0.8v, b 2 {0.5, 1.0, 2.0}, CCR 2 {0.5, 1.0, 2.0}. We generated sets of 10,000 application graphs with randomly selected l, b and CCR for each v from the above list. The average SLR values for each set of application graphs (with a selected v and random GL, b and CCR) are shown in Fig. 7. The quality of the results was as in Table 3. For each number of available machines, a set of 3000 application graphs was generated for each v = {40, 80, 120} in the second experiment. The CCR, GL and b were selected randomly from the above CCR, GL and b sets for each graph. The average SLR for the 9000 generated graphs for each available machine number was as shown in Fig. 8. The quality of the results was as in Table 4.

2.2

2.1

Average SLR

2 1.9

1.8 1.7

HCPFD FLB-f HEFT CPOP

1.6

1.5 1.4

0

20

40

80 60 Number of Tasks

100

120

140

Fig. 7. Average SLR of random generated graphs with respect to graph size.

Table 3 Results quality of random generated graphs with respect to graph size

HCPFD

Better Equal Worse

FLB-f

HEFT

CPOP

74.37% 3.3% 22.33%

83.96% 2.83% 13.22%

95.23% 1.89% 2.88%

T. Hagras, J. Janecˇek / Parallel Computing 31 (2005) 653–670

666

2.1

Average SLR

2

1.9 1.8 1.7

HCPFD FLB-f HEFT CPOP

1.6 1.5

4

16 8 Number of Machines

32

Fig. 8. Average SLR of random generated graphs with respect to number of available machines.

Table 4 Results quality of random generated graphs with respect to number of available machines

HCPFD

Better Equal Worse

FLB-f

HEFT

CPOP

72.04% 2.31% 25.65%

80.72% 1.73% 17.56%

93.55% 1.23% 5.21%

5.3. Applications Finally, we compared the performance of the scheduling algorithms based on two real-world applications: Gaussian elimination and Laplace equation. 5.3.1. Gaussian elimination The structure of the application graph is defined in Gaussian elimination. The number of tasks v, and the number of graph levels GL depends on the matrix size m. In the performance evaluation with respect to matrix size, only CCR and b were selected randomly for each m. Fig. 9 presents the average SLR of 1000 generated graphs for each matrix size m 2 {4, 8, 16}, and the quality of the results was as in Table 5. For each available machine number, a set of 1000 application graphs was generated for m = 16. The CCR, GL and b were selected randomly form the above CCR, GL and b sets for each graph. The average SLR for each available machine number was as shown in Fig. 10. The quality of the results was as in Table 6. 5.3.2. Molecular Dynamic Code Molecular Dynamic Code [19] is an irregular application since it has a fixed number of tasks (v = 41) and a known graph structure. Hence, we varied only CCR and b in our experiments. For each CCR from the above set, 1000 graphs were generated. The average SLR for each CCR was as shown in Fig. 11. Table 7 indicates the quality of the schedule.

T. Hagras, J. Janecˇek / Parallel Computing 31 (2005) 653–670

667

1.7

Average SLR

1.6

1.5 1.4 1.3

HCPFD FLB-f HEFT CPOP

1.2 1.1

4

16

8 Matrix size

Fig. 9. Average SLR of Gaussian elimination with respect to graph size.

Table 5 Results quality of gaussian elimination with respect to graph size

HCPFD

Better Equal Worse

FLB-f

HEFT

CPOP

91.83% 5.03% 3.13%

83.27% 12.53% 4.2%

90.13% 6.6% 3.27%

2.2

HCPFD FLB-f HEFT CPOP

Average SLR

2

1.8 1.6 1.4

1.2 1

4

32

16 8 Number of Machines

Fig. 10. Average SLR of Gaussian elimination with respect to number of available machines.

Table 6 Results quality of gaussian elimination with respect to number of available machines

HCPFD

Better Equal Worse

FLB-f

HEFT

CPOP

94.65% 0.13% 5.23%

75.38% 2.75% 21.88%

92.25% 0.78% 6.98%

T. Hagras, J. Janecˇek / Parallel Computing 31 (2005) 653–670

668

2.6

2.4

Average SLR

2.2 2

1.8 1.6

HCPFD FLB-f HEFT CPOP

1.4

1.2 1

2

1 CCR

.5

Fig. 11. Average SLR of Molecular Dynamic Code with respect to CCR.

Table 7 Schedule quality of molecular dynamic code graphs with respect to CCR

HCPFD

Better Equal Worse

FLB-f

HEFT

CPOP

92.07% 0.63% 7.3%

90.67% 1.03% 8.3%

96.27% 0.33% 3.4%

2.2

Average SLR

2

1.8 1.6 1.4

HCPFD FLB-f HEFT CPOP

1.2 1

2

4

16 8 Number of Machines

32

64

Fig. 12. Average SLR of Molecular Dynamic Code with respect to number of available machines.

Table 8 Schedule quality of molecular dynamic code graphs with respect to number of available machines

HCPFD

Better Equal Worse

FLB-f

HEFT

CPOP

83.23% 0.75% 16.03%

78% 0.9% 21.1%

94.28% 0.23% 5.5%

T. Hagras, J. Janecˇek / Parallel Computing 31 (2005) 653–670

669

For each number of available machines, a set of 10,000 application graphs was generated. CCR and b were selected randomly from the above CCR and b sets for each graph. The average SLR for each number of available machines was as shown in Fig. 12. Table 8 indicates the quality of the schedule.

6. Conclusion In this paper, we presented the HCPFD algorithm for scheduling tasks to a bounded number of computing machines with different capabilities. The algorithm consists of two mechanisms: a simple listing mechanism for task selection instead of the classical prioritization phase of list-scheduling, and a fast machine assignment mechanism based on task-duplication. Based on the experimental study using a large set of randomly generated application graphs with various characteristics and application graphs of real-world problems (such as Gaussian elimination, and Molecular Dynamic Code), HCPFD outperformed the other algorithms in terms of performance and complexity.

References [1] J.G. Webster, Heterogeneous distributed computing, in: Encyclopedia of Electrical and Electronics Engineering, vol. 8, 1999, pp. 679–690. [2] D. Feitelson, L. Rudolph, U. Schwiegelshohm, K. Sevcik, P. Wong, Theory and practice in parallel job scheduling, JSSPP (1997) 1–34. [3] Y. Kwok, I. Ahmed, Benchmarking the task graph scheduling algorithms, in: Proc. IPPS/SPDP, 1998. [4] J. Liou, M. Palis, A comparison of general approaches to multiprocessor scheduling, in: Proc. Int. Parallel Processing Symp., 1997, pp. 152–156. [5] A. Khan, C. McCreary, M. Jones, A comparison of multiprocessor scheduling heuristics, ICPP 2 (1994) 243–250. [6] A. Gerasoulis, T. Yang, A comparison of clustering heuristics for scheduling dags on multiprocessors, Journal of Parallel and Distributed Computing 16 (1992) 276–291. [7] A. Radulescu, A. van Gemund, Fast and effective task scheduling in heterogeneous systems, 9th Heterogeneous Computing Workshop, 2000, pp. 229–238. [8] H. Topcuoglu, S. Hariri, W. Min You, Performance-effective and low-complexity task scheduling for heterogeneous computing, IEEE Trans. Parallel Distributed Syst. 13 (3) (2002) 260–274. [9] G. Sih, E. Lee, A compile time scheduling heuristic for interconnection constrained heterogeneous processor architectures, IEEE Trans. Parallel Distributed Syst. 4 (1993) 75–87. [10] T. Hagras, J. Janecek, A high performance, low complexity algorithm for compile-time job scheduling in homogeneous computing environments, in: IEEE Proc. Int. Conf. Parallel Processing Workshops (ICPP03) workshops, October 2003, pp. 149–155. [11] K. Taura, A. Chien, A heuristic algorithm for mapping communicating tasks on heterogeneous resources, Heterogeneous Computing Workshop, 2000, pp. 102–118. [12] W. Minyou, D. Gajski, Hypertool: a programming aid for message passing systems, IEEE Trans. Parallel Distributed Syst. 1 (1990). [13] Y. Kwok, I. Ahmed, Dynamic critical path scheduling: an effective technique for allocating task graph to multiprocessors, IEEE Trans. Parallel Distributed Syst. 7 (1996) 506–521. [14] I. Ahmed, Y. Kwork, A new approach to scheduling parallel program using task duplication, Int. Conf. Parallel Processing, vol. 2, 1994, pp. 47–51.

670

T. Hagras, J. Janecˇek / Parallel Computing 31 (2005) 653–670

[15] S. Darbha, D. Agrawal, A fast and scalable scheduling algorithm for distributed memory systems, in: Proc. of Symp. on Parallel and Distributed Processing, 1995, pp. 60–63. [16] I. Ahmed, Y. Kwork, A comparison of task-duplication-based algorithms for scheduling parallel programs to message-passing systems, in: Proc. 11th Int. Symp. of High-Performance Computing Systems, 1997, pp. 39–50. [17] G. Park, B. Shirazi, J. Marquis, DFRN: a new approach for duplication based scheduling for distributed memory multiprocessor system, in: Proc. Int. Conf., Parallel Processing, 1997, pp. 157– 166. [18] K. Oh-Han, A. Dharma, S3MP: a task duplication based scalable scheduling algorithm for symmetric multiprocessor, in: Proc. IPDP00, 2000, pp. 451–456. [19] S. Kim, J. Browne, A general approach to mapping of parallel computation upon multiprocessor architectures, in: Proc. Int. Conf. Parallel Processing, vol. 2, 1988, pp. 1–8.