Task scheduling using Bayesian optimization algorithm for heterogeneous computing environments

Task scheduling using Bayesian optimization algorithm for heterogeneous computing environments

Applied Soft Computing 11 (2011) 3297–3310 Contents lists available at ScienceDirect Applied Soft Computing journal homepage: www.elsevier.com/locat...

369KB Sizes 0 Downloads 117 Views

Applied Soft Computing 11 (2011) 3297–3310

Contents lists available at ScienceDirect

Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc

Task scheduling using Bayesian optimization algorithm for heterogeneous computing environments Jiadong Yang a , Hua Xu a,∗ , Li Pan b , Peifa Jia a , Fei Long b , Ming Jie b a State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China b Emerging Automation, Corporate Technology, SIEMENS Ltd., 7 Zhonghuan Nanlu, Wangjing Chaoyang Distinct, Beijing 100102, China

a r t i c l e

i n f o

Article history: Received 13 October 2009 Received in revised form 17 November 2010 Accepted 30 November 2010 Available online 16 December 2010 Keywords: Multiprocessor scheduling Heterogeneous Parallel computing Bayesian optimization algorithm

a b s t r a c t Efficient task scheduling, as a crucial step to achieve high performance for multiprocessor platforms, remains one of the challenge problems despite of numerous studies. This paper presents a novel scheduling algorithm based on the Bayesian optimization algorithm (BOA) for heterogeneous computing environments. In the proposed algorithm, scheduling is divided into two phases. First, according to the task graph of multiprocessor scheduling problems, Bayesian networks are initialized and learned to capture the dependencies between different tasks. And the promising solutions assigning tasks to different processors are generated by sampling the Bayesian network. Second, the execution sequence of tasks on the same processor is set by the heuristic-based priority used in the list scheduling approach. The proposed algorithm is evaluated and compared with the related approaches by means of the empirical studies on random task graphs and benchmark applications. The experimental results show that the proposed algorithm is able to deliver more efficient schedules. Further experiments indicate that the proposed algorithm maintains almost the same performance with different parameter settings. © 2010 Elsevier B.V. All rights reserved.

1. Introduction Efficiently scheduling tasks of a multiprocessor application [1] is one of the critical issues for heterogeneous computing systems, in which resources with different computing ability are interconnected together. The general form of multiprocessor scheduling problem is to assign tasks, which are partitioned from a large program, to different processors and determine their execution sequences so that the precedence relationships among tasks are satisfied and the overall time required to execute the entire application, also known as makespan, is minimized. The multiprocessor scheduling problem is NP-complete in the general case as well as some restricted cases [2]. A large set of approaches have been delivered to find the tradeoff between the complexity and the performance. According to the rationale behind the solving process, scheduling approaches can be mainly divided into two categories: heuristic-based approaches and stochastic optimization techniques [3]. The former resolves the problem efficiently, however, the actual performance is not always acceptable; while the latter produces more competent solutions at the cost of time complexity [4].

∗ Corresponding author. E-mail address: [email protected] (H. Xu). 1568-4946/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.asoc.2010.11.029

In order to take advantage of two kinds of methods mentioned above while avoiding their drawbacks, this paper presents a novel multiprocessor scheduling algorithm for heterogeneous computing environments based on the Bayesian optimization algorithm (BOA) [5,6]. The proposed algorithm in this paper divides the scheduling into two phases. First, Bayesian networks [7] are initialized and learned according to the task graph of the multiprocessor scheduling problem to find the optimal solution assigning tasks to different processors. Then the execution sequence of tasks on the same processor is determined by the heuristic-based priority used in the list scheduling approach. The comparison study on random and benchmark applications shows that the proposed algorithm outperforms the related ones significantly. Further experiment also demonstrates that the proposed algorithm maintains almost the same performance with different parameter settings. The remainder of this paper is organized as the following. In next section, some background materials on the multiprocessor scheduling issue are reviewed. Section 3 introduces the related work in the multiprocessor scheduling problem and BOA. In Section 4, the proposed algorithm is described in details. The performance of the proposed algorithm is evaluated based on random task graphs and well-known applications in Sections 5 and 6, respectively. Section 7 analyzes the performance sensitivity to the key parameters and Section 8 discusses merits of the proposed algorithm. Finally, the paper is summarized in Section 9.

3298

J. Yang et al. / Applied Soft Computing 11 (2011) 3297–3310

a

b

1 12 2

23

17

3 17 6

16

7 18

20

P1

P2

P3

P4

1

14

16

14

17

2

10

12

12

14

3

17

19

15

16

4

20

22

18

22

5

21

18

16

20

6

12

14

14

16

7

16

16

20

17

8

19

17

20

18

9

14

15

16

16

8 4

14

21

Task

5 19

10 8

13

9

A simple task graph with communication costs

Computation costs

Fig. 1. An example of multiprocessor scheduling problem. (a) A simple task graph with communication costs and (b) computation costs.

2. Scheduling problem In the problem of multiprocessor task scheduling, the application to be scheduled is usually represented by a directed acyclic graph (DAG). The DAG is also called a task graph or a macro-dataflow graph. A DAG G = (V, E) consists of a set V of v nodes and a set E of e edges. In general, the nodes represent tasks partitioned from the application, and the edges represent precedence constraints between dependent tasks. Specifically, each edge ei,j ∈ E between task ni and nj means that the result of task ni has to be transmitted to task nj so that task nj is able to start its execution. In the task graph, the task without any predecessors is called entry task and the task that does not have any successors is called exit task. The positive weight wi associated with each task ni ∈ V represents its computation cost. Since the computation ability of each processor in heterogeneous computing environments is not the same, the computation cost of the task ni on the processor pj is denoted as wi,j and its average computation cost is denoted as wi . The nonnegative weight ci,j associated with edge ei,j ∈ E represents its communication cost between dependent tasks ni and nj . Note that communication cost is only required when dependent tasks are assigned to different processors. Therefore, the real communication cost is equal to zero when tasks are assigned to the same processor. Here, it is assumed that the communication network is fully connected and there is no contention in the links. Assume that there is a task graph with T tasks to be assigned to a multiprocessor platform with P processors. For the task ni , its earliest start time Tstart (ni , pj ) on the processor pj is defined as Tstart (ni , pj ) = max{Tfree (pj ), Tready (ni , pj )}

(1)

where Tfree (pj ) is the time when processor pj is free for task ni to be executed. Most of the time, Tfree (pj ) is the time when processor pj completes the execution of its last assigned task. Sometimes, however, task ni could be inserted into an idle time slot between two already-scheduled task on processor pj . In another word, when the length of an idle time slot (i.e., the difference between execution start time and finish time of two tasks that are consecutively scheduled on the processor pj ) is larger than wi , Tfree (pj ) could be prior to the finish time of the last task assigned to the processor. In Eq. (1), Tready (ni , pj ) denotes the time when all the data needed by the task ni have arrived at the processor pj , which is defined as Tready (ni , pj ) = maxnk ∈ pred(ni ) {Tfinish (nk ) + ck,i }

(2)

where Tfinish (nk ) is the time when the task nk actually finishes its execution and pred(ni ) denotes the set of immediate predecessors of task ni . With nonpreemptive environments, the earliest finish time Tfinish (ni , pj ) of task ni on processor pj is defined as Tfinish (ni , pj ) = Tstart (ni , pj ) + wi,j

(3)

When task ni is determined to be scheduled on the processor pj , the earliest start time and the earliest finish time on processor pj are the actual start time and the actual finish time of the task, respectively (i.e., Tstart (ni ) = Tstart (ni , pj ) and Tfinish (ni ) = Tfinish (ni , pj )). The total schedule length of the entire application, also called makespan, is the actual finish time of the exit task, whose actual finish time is the largest among all tasks, see Eq. (4). makespan = maxi {Tfinish (nexit )}

(4)

Therefore, the aim of task scheduling problem is to allocate a set of tasks to a set of processors to minimize the makespan without violating precedence constraints. And the efficiency of the schedule is measured by speedup, which is calculated as follows:



min{ speedup =

j

i

wi,j }

makespan

(5)

where the numerator denotes the sequential execution time. It is computed by assigning all tasks to a single processor that minimizes the cumulative of the execution time. Given the description above, a simple task graph with nine nodes and their average communication costs is shown in Fig. 1(a). The computation costs of each node in the graph is detailed in Fig. 1(b). One schedule of the scheduling problem is given in Fig. 2, which requires a makespan of 106 with four processors available and the speedup is equal to 1.35. 3. Related work 3.1. Scheduling approach At the highest level, the multiprocessor scheduling problem can be divided into two classes: the static scheduling problem and the dynamic scheduling problem. The former one is the issue in which all information about the tasks and their dependencies, such as execution time, data dependencies and communication costs between dependent tasks, are known as constants before scheduling. Con-

J. Yang et al. / Applied Soft Computing 11 (2011) 3297–3310

Fig. 2. A schedule for the multiprocessor scheduling problem given in Fig. 1.

versely, in the dynamic scheduling problem, some information about tasks and their relations may be undeterminable until runtime. In this article, only the static scheduling problem is addressed. Generally, algorithms solving the static multiprocessor scheduling problem include two main categories [3]: deterministic approaches, which are based on hard-coded rules extracted from developers’ intuition to straightforwardly find schedules, and non-deterministic approaches, which are guided by random search-based algorithms. The deterministic approaches can be further classified into three categories: list scheduling algorithm [3,8], clustering scheduling algorithm [9], and task duplication algorithm [10]. In the list scheduling algorithm, HEFT [3] is one of the most representative. It leverages the average execution and communication cost to calculate the priority of each task. During the scheduling process, tasks are selected in decreasing order of their priorities one by one. And in each step, the selected one is assigned to the processor that minimizes the makespan so far. In clustering scheduling algorithms, each task is initially regarded as a cluster. Clusters are merged repeatedly if such a dealing shortens the makespan or the number of clusters is larger than that of processors. At last, tasks in each cluster are assigned to the same processor and ordered according to their precedence constraints. Duplication-based scheduling allows the same task to be executed on multiple processors. The replication of tasks is able to eliminate some communication among the processors, which shortens the total makespan. Algorithms based on the heuristics mentioned above could schedule tasks with high efficiency as they seek available solutions in a considerably small portion of the search space. However, the performance of these algorithms is heavily dependent on the effectiveness of the heuristics. Therefore, they are not likely to produce consistent results on a wide range of problems, especially for heterogeneous computing environments. Contrary to deterministic algorithms, non-deterministic algorithms explore the problem space stochastically in a certain extent. During the solving process, these methods combine the knowledge gained from previous search results with some randomizing features to generate new and better results. In this group, genetic algorithms (GAs) [11–13] are the most popular techniques. Besides, simulated annealing (SA) [14] and artificial immune system (AIS) [15] have been adopted to solve scheduling problems as well. It has been shown that these approaches are able to deliver schedules with better quality; however, their computational costs are usually much larger than those of the heuristic-based techniques. There also exist some studies on combing both deterministic and nondeterministic algorithms together for multiprocessor scheduling problem, which is similar to our work. Ref. [16] uses GAs in the list scheduling techniques to evolve the priority of each task, and

3299

Ref. [17] incorporates AIS with list scheduling heuristic to schedule tasks for heterogeneous computing environments. Different from these algorithms, the proposed one incorporates BOA, which is one of estimation of distribution algorithms, with list scheduling heuristic to deliver scheduling results. Additionally, estimation of distribution algorithms (EDAs) [18], which have been recently identified as novel paradigms in the field of GAs, have also been applied to different types of scheduling problems: [19–21] focused on the job shop scheduling problem; [22] tackled the nurse scheduling problem, and [23] solves the curriculum scheduling problem. However, to the best of our knowledge, there is no one dedicated to our target problem, the heterogeneous multiprocessor scheduling within heterogenous computing environments. In fact, the target problem differs from the ones mentioned above in both the formal model and scheduling approach. 3.2. Bayesian optimization algorithm The Bayesian optimization algorithm (BOA) [5,6], as one of state-of-the-art estimation of distribution algorithms (EDAs) [18], combines Bayesian networks [7] and evolutionary algorithms to solve nearly decomposable problems. To overcome the disruption of building blocks in traditional GAs, it explicitly extracts global statistical information from the promising solutions found so far and models them based on Bayesian networks. Then new candidate solutions are generated by sampling the built models. Since Bayesian networks capture the dependencies among variables at each iteration, it is able to efficiently and robustly solve hard optimization and search problems in a variety of domains. More detailed information about BOA can be referred to Ref. [6]. 4. The proposed algorithm 4.1. Framework As the computing costs of each task on different processors in the heterogeneous multiprocessor scheduling problems are not the same, the efficiency of the schedule depends on two independent factors: task assigning (i.e., mapping tasks to different processors) and tasks ordering (i.e., making execution sequence of tasks on each processor). In the proposed algorithm, BOA is responsible for assigning tasks and heuristic-based priority determines the execution sequence of tasks. Then, the total makespan is calculated and served as the fitness to evaluate each candidate assignment in BOA. The procedure of the proposed algorithm is described in Algorithm 1. In the subsections below, the algorithm is described in details. Algorithm 1.

Procedure of the proposed algorithm

1 Generate initial population of assignment at random 2 Build a Bayesian network according to the task graph 3 for all assignment in population do 4 Determine the task execution sequence according to heuristic-based task priority 5 Calculate the total makespan 6 end for 7 repeat 8 Select population of promising assignments 9 Learn Bayesian network 10 Sample Bayesian network to generate new population 11 for all assignment in new population do 12 Determine the task execution sequence according to heuristic-based task priority 13 Calculate the total makespan 14 end for 15 Incorporate new population into the current one 16 until The average makespan of population is not improved 17 Select the best solution in the final generation as the result

3300

J. Yang et al. / Applied Soft Computing 11 (2011) 3297–3310

12 2

23 3

T1

T1

1 17

T1

8 3

4

T2

T3

T4

T5

T2

T3

T4

Fig. 3. Bayesian networks initialization.

4.2. Encoding of solutions The population in BOA consists of a group of solutions, which represent candidate assignments to the scheduling problem. Each solution is encoded as a string of integers. Suppose that the problem has to schedule n tasks with m available processors and each task has a unique index (from t0 to tn−1 ). In that case, each solution could be represented by n integer variables (i.e., p(t0 ), p(t1 ), . . ., p(tn−1 )). And each integer variable in the solution is taken from 0 to m − 1. The i th variable in the solution represents the processor to which the task ti is assigned. In consequence, the search space of the algorithm is mn . 4.3. Bayesian networks initialization Bayesian networks [7] are directed acyclic graphical models, in which each node corresponds to one variable in the candidate solutions and arcs between nodes represent probabilistic dependencies. As mentioned in Section 4.2, each variable in solutions indicates the index of the processor to which the corresponding task is assigned. Hence, each node in the Bayesian network built corresponds to a unique task in the multiprocessor scheduling problem. On the other hand, the application in the scheduling problem is usually described as a directed acyclic graph (see Section 2). In the task graph, nodes represent tasks partitioned from an application and directed edges specify both precedence constraints and communication paths between tasks (see Fig. 1). Therefore, there is a one to one correspondence between the nodes in the Bayesian network and the task graph. Given the discussion above, the number of nodes in the Bayesian network is equal to that of tasks in the scheduling problem. Furthermore, as edges in task graph indicate the precedence constraints between tasks, it is reasonable to add dependencies between the corresponding nodes in Bayesian networks. Consequently, the initial Bayesian network is constructed following the structure of the task graph (see Fig. 3). 4.4. Learning Bayesian networks A Bayesian network consist of two components: a directed graph representing the variables and their dependencies, and a set of conditional probability tables (CPTs) identifying the conditional probability for each variable. Consequently, learning Bayesian networks includes two corresponding subtasks: learning the structure and learning the CPT of each variable. As mentioned in Section 4.3, the structure of the initial Bayesian network is consistent with that of the task graph. Such a structure reflects all dependencies induced by the predecessor relationship. Instead of performing greedy strategy on a network without any edges [5], it is sufficient to specify the dependencies produced by the sibling relationship. More specifically, if two nodes in the task graph are sibling nodes, a corresponding directed edge is tried to be added into the Bayesian network (see Fig. 4). Then, the scoring metric to measure the quality of structure is calculated. If the addition improves the quality of the current network, the directed edge is added actually. The process is not terminated until all the sibling node pairs have been tested. There are a variety of score

T5

T2

T3

T4

T5

Fig. 4. Structural learning.

metrics based on different assumptions [24]. Among them, twopart MDL metric called Bayesian information criterion (BIC) [25] and the Bayesian–Dirichlet (BD) metric [26] are the ones most used. It has been pointed out that MDL metrics favor simple models so that no unnecessary dependencies need to be considered, however, MDL metrics often result in overly simple models and require large populations to learn a model that captures all necessary dependencies. Consequently, BD metric is adopted as the scoring metric in the paper. On the other hand, since there is no prior information about the parameters when the proposed algorithm deals with scheduling problems, we employ the BD metric with a uniform priors [26], which can be expressed as j=1 n  

BD(B) = p(B)

i=1 qi

(ri − 1)!  Nijk ! (Nij + ri − 1)! k=1

(6)

ri

where Nijk is the number of cases in which the variable xi takes its k th value (k = 1, 2, 3, . . ., ri ) and its parents i is instantiated as its i Nijk . j th value (j = 1, 2, 3, . . ., qi ), and Nij = rk=1 According to the discussion in Section 4.3, the Bayesian network is initialized according to the task graph of the scheduling problem. And edges only between sibling’s nodes are tried to be added into the Bayesian network during the structure learning. Consequently, the maximum number of incoming edges into each node in the network is a certain constant determined by the task graph. Therefore, the time to construct the network using greedy algorithm with the BD metric is O(n3 + n2 N) [27], where N is the population size and n is the number of variables in the problem (i.e., the number of tasks in the scheduling problems). There are two strategies to implement the structure learning. One of the choices is to update the network structure in each iteration. The second alternative is to perform structural update only once in m iterations. Compared with the first choice, the later one significantly improves the overall efficiency. Furthermore, as structure learning always starts from the initial network instead of an empty one, the network structure may change very little. In order to add edges into the model more accurately, more information are required. Accordingly, the proposed algorithm adopts the later strategy. Learning the CPT of each variable in Bayesian networks is relatively simple, as the value of each variable in the population of promising solutions is specified. Given all possible values of its parents, the calculation of conditional probabilities of each possible value for each node corresponds to counting the number of times the value appears. 4.5. Sampling of Bayesian network Once the structure and CPTs of a Bayesian network have been learned, new candidate solutions are generated by sampling the network. In this paper, the probabilistic logic sampling of Bayesian networks [28] is adopted, which includes two steps. The first step computes an ancestral ordering of the nodes, where each node is preceded by its parents. Then, the values of all variables in a new candidate solution are generated based on the corresponding CPTs in the network under the ordering computed in the first step. The time complexity of generating an instance of all variables is then

J. Yang et al. / Applied Soft Computing 11 (2011) 3297–3310

a

3301

b

X1

X3

X1 X2 P(X 3 = 1|X1 ) P (X4 = 1|X1 , X2 )

X2

X4

0 0

0 1

0.30 0.30

0.20 0.25

1 1

0 1

0.40 0.40

0.45 0.35

Conditional probability table

Bayesian network structure

Fig. 5. An example of mutation in sampling Bayesian networks. Before mutation on X2 : X1 = 1, X2 = 1, P(X3 = 1) = 0.40, P(X4 = 1) = 0.35. After mutation on X2 : X1 = 1, X2 = 0, P(X3 = 1) = 0.40, P(X4 = 1) = 0.45. (a) Bayesian network structure and (b) conditional probability table.

bounded by O(n), where n is the number of tasks in the scheduling problems [27]. To generate offspring, sampling from Bayesian networks leverages the global statistical information extracted from the current promising solutions, however, the information about the locations of the solutions found so far is not directly used. In order to make full use of the local information of the solutions found so far while utilizing the dependencies between variables modeled by Bayesian networks, a mutation operator is introduced to complement probabilistic logic sampling. If each variable is mutated after the entire solution is generated, as performed in standard mutation operator, it disrupts the probabilistic distribution for ignoring the dependencies between variables. Instead, with aim of keeping the the dependencies, each variable in the proposed algorithm is mutated immediately after its value is generated given the mutation rate. And the successors of that variable are produced according to its real value after the mutation. Assume there are four binary variable. Their dependency and conditional probability is given in Fig. 5. If both X1 and X2 take value 1, then P(X3 = 1) and P(X4 = 1) are equal to 0.4 and 0.35, respectively. When X2 is mutated and X1 not, the mutation affects variable X4 (i.e., P(X4 = 1) = 0.45). However, X3 remains unchanged. In such a situation, the sampling integrated with mutation produces solutions that can hopefully fall in or around a promising region characterized by Bayesian networks. Therefore, dependencies modeled by Bayesian networks are maintained along the search, and the efficiency of the evolution is improved. 4.6. Makespan calculation according to Heuristic-based priority After tasks are assigned, it is required to determine the execution sequence of tasks on each processor. Then the total makespan is calculated to evaluate the fitness of each assignment. List scheduling approach is applied here. More specifically, all tasks are arranged in the execution queue one by one. In each step, the highest priority one among the ready tasks is scheduled. Here, if a task has no predecessors or all its predecessor tasks are already scheduled, it is called a ready task. Therefore, how to prioritize the tasks is the key issue to calculate the total makespan. Intuitively, the priority may include two factors: the completion time of itself and completion time of the tasks that depend on it. Inspired by the HEFT in [3], the concept of upward rank is introduced to represent the priority of each task. The upward rank of a task ni , rank(ni ) is defined recursively as follows.



rank(ti ) =

wi,p(ni ) , wi,p(ni ) + maxnj ∈ succ(ni ) (ci,j + rank(nj )),

if succ(ni ) = ∅ if succ(ni ) = / ∅

(7)

where p(ni ) denotes the processor on which task ni is assigned, wi,p(ni ) specifies the execution time of task ni on the processor p(ni ), succ(ni ) is the set of immediate successors of task ni and ci,j denotes

the real communication cost between task ni and task nj , which is defined in Eq. (8):



ci,j =

com(ni , nj ), 0,

if p(ni ) = / p(nj ) if p(ni ) = p(nj ),

(8)

Although both the proposed algorithm and HEFT in [3] use the notation of upward rank to determine the priority, there exists obvious differences between them. First, the proposed algorithm relies on the execution time on the specified processor instead of the average among different processors to prioritize each task. Second, the proposed algorithm is able to give the real communication cost between dependent tasks (zero or nonzero), while upward ranks calculation in [3] always assumes that the communication cost exists. Therefore, the priority in the proposed algorithm reflects more rationally how each task influences the total makespan. After calculating the priority (i.e., the rank value) of each task, the execution sequence of tasks on each processor is determined just according to the the priority. In a nonincreasing order of the priority, the earliest start time of each task is computed one by one. As defined in Eq. (1) (see Section 2), the earliest start time of a task on a certain processor is equal to the free time of that processor or its ready time on that processor. Since each task ni has been assigned to a specified processor p(ni ) before the prioritization, the earliest start time of task ni on processor p(ni ) is just its actual start time, i.e., Tstart (ni ) = Tstart (ni , p(ni )). Then, we immediately get the actual finish time of task ni , i.e., Tfinish (ni ) = Tstart (ni ) + wi,p(ni ) . Consequently, the makespan is computed according to Eq. (4) in Section 2. The procedure of makespan calculation is described in Algorithm 2. Algorithm 2. Makespan calculation according to Heuristic-based priority 1 Compute rank value for all tasks upwardly, according to the computation and communication cost determined by the assignment 2 Sort all tasks in a list by nonincreasing order of rank value 3 repeat 4 Select the first task ti in the list 5 Compute the earliest start time of ti on its assigned processor 6 Compute the finish time of ti 7 until There are no tasks in the list 8 Set makespan as the max finish time of all the task

5. Performance evaluation based on random task graphs In this section, the performance of the proposed algorithm is evaluated based on a series of random task graphs. More specifically, we compare the proposed algorithm with three related algorithms mentioned in Section 3, including HEFT [3], a GA-based scheduling algorithm (called CPGA) [13], and a hybrid scheduling algorithm based on AIS [17]. The three algorithms mentioned above are the most representative ones of the deterministic approach, the non-deterministic approach and the hybrid approach described in

3302

J. Yang et al. / Applied Soft Computing 11 (2011) 3297–3310

Table 1 Parameters to generate random task graphs. Parameter

Value

Number of nodes Shape coefficient of the graph Out-degree coefficient of each node CCR value Number of processors

{16, 36, 64, 100} {0.5, 1, 2} {0.25, 0.5, 0.75} {0.25, 0.50, 1.00, 2.00, 4.00} {2, 4, 8}

Section 3, respectively. In the experiments, these algorithms are implemented by ourselves since the corresponding references give every detail about these algorithms. 5.1. Experimental setup In order to generate random task graphs, several parameters to describe the structure are considered: • The number of nodes (i.e., tasks) in the graph (n). • The shape coefficient of the graph (es ). It is the ratio of the average width (i.e., the number of tasks in a level) to the average height (i.e., the number of levels) of the task graph. The shape coefficient indicates whether the task graph is with high (e.g., es  1) or low (e.g., es  1) parallelism. • The out-degree coefficient of node (ed ). It is defined as the ratio of the average out-degree of each node to the width of the graph. Such a parameter reveals how the communication cost impacts on the schedule. When ed is high, (e.g., ed > 0.5), it means that outputs of tasks have to be transmitted to many other ones. In such a case, the schedule is more sensitivity to the communication cost. On the contrary, the schedule is influenced by the communication cost less when ed is low (e.g., ed < 0.5). Besides the parameters above, there are some other ones influencing the execution of the task graph. • The number of available processors (p). • Communication to computation ratio (CCR). It is the ratio of the communication cost to average computation cost of the task graph. CCR indicates an application is communication-intensive (higher CCR) or computation-intensive (lower CCR). The values of the parameters above are listed in Table 1 to generate random task graphs. Furthermore, the execution time of each task on each processor follows Poisson distributions with the average equal to 40. And it is assumed that the communication cost between dependent tasks is consistent. In the experiments below, each random tasks graph produces twenty real scheduling problems with different execution time of each task. And algorithms run 20 times on each of these real problems. Unless otherwise specified, the parameters used in the proposed algorithm and other ones in the comparison are set according to Table 2. The adaptive crossover and mutation rate of CPGA are set according to [13]. 5.2. Experimental results The performance comparison among the proposed algorithm, AIS-based hybrid algorithm [17], CPGA [13] and HEFT [3] on the random task graphs are represented in Tables 3, 5 and 7, in which the best result obtained in each test case is marked in bold font. Furthermore, we also investigate the significant differences among these results. Tables 4, 6 and 8 show the p-value obtained by an uppertailed paired t-test for the results in Tables 3, 5 and 7, respectively. The significance level of each test case is set as 0.05.

Table 3 represents the experimental results with various graph sizes. The shape coefficient and degree coefficient are fixed to 1 and 0.5, respectively. From Table 3, it is obviously that the proposed approach is able to produce more efficient schedules in almost all the experimental cases. The speedup could be improved by more than 10% generally when using the proposed algorithm. And the highest improvement is able to arrive at 28.74%, 21.45%,22.4% compared with AIS, CPGA and HEFT, respectively. When looking at the table from the horizontal perspective, the gaps in the speedups between algorithms are more remarkable when more processors are available. From the vertical perspective, it is noticeable that the increase in the graph size also enlarges the gaps. Then reason behind the phenomenon is that more processor available and more tasks embedded improve the parallelism. In consequence, the proposed algorithm is able to exploit more efficient schedules. The significant differences in Table 4 validate that the proposed algorithm is indeed significantly better than AIS, CPGA and HEFT in almost all test cases. The performance comparison between different algorithms with three different shape coefficients is showed in Table 5. The graph size is set as 64 and the degree coefficient is set as 0.5. It is obvious that the proposed algorithm outperforms HEFT in all of the experimental cases. And except cases in the left bottom of the table, which are set with larger shape coefficient (e.g., 2.0) and fewer processors (e.g., 2), the proposed algorithm outperforms AIS and CPGA as well. On average, the results of the proposed algorithm is higher than others by about 10%. As more processors are available in the scheduling problem, the improvement of the proposed algorithm become more remarkable. And the highest improvement reach 35.44%, 25.88% and 24.01%, comparied with AIS, CPGA and HEFT, respectively. From Table 5, it could be also identified that the advantages are weaken when shape coefficient become larger (e.g., 2.0). It is because when shape coefficient is larger than 1.0, the task graph tend to be “thin”, which impedes the potential parallelizability of applications. Thus, the improvement of the proposed algorithm is limited as well. According to the significant differences in Table 6, it is also obvious that the proposed algorithm is certainly better than AIS, CPGA and HEFT remarkably with various shape coefficients. In Table 7, the results of the experiment with various degree coefficient are represented. The graph size and shape coefficients are equal to 64 and 1.0, respectively. From Table 7, it is clear that the proposed algorithm yields schedules with higher speedup than other three algorithms in almost all the experiment cases except the ones in the left top and bottom of the table. The performance could be enhanced by 10% overall in terms of the speedup. Similar to Tables 3 and 5, the performance of the proposed algorithm outperforms other two algorithms more remarkably when more processors are available. It is also observable that with cases of lower degree coefficient (e.g., 0.25), the advantage of the proposed algorithm tend to be more obvious. The highest improvement of the proposed algorithm arrives at 44.25%, 27.41% and 35.68% when compared with AIS, CPGA and HEFT. It is due to that the task graph with low degree coefficient means that dependencies between different tasks are relatively sparse, which reduces the potential communication cost. Therefore, the proposed algorithm is able to deliver more competent schedules. Furthermore, the p-value in Table 8 validates as well that the proposed algorithm is able to produce much more competent results. It has to be pointed out that the performance of proposed algorithm sometimes degrades when the number of tasks or the out-degree coefficient of each node become larger. In such cases, the task graph become denser, which means every node in the graph has a lot of predecessors except the entry nodes. If the Bayesian network is initialized completely according to the task graph, it probably contains some redundant or conflict dependen-

J. Yang et al. / Applied Soft Computing 11 (2011) 3297–3310

3303

Table 2 Parameter settings. Parameter

BOA

AIS [17]

CPGA [13]

Population size Crossover rate Mutation rate Selection method Interval of network’s structure update Terminal condition

2000 – 0.04 Tournament 10 Convergence

2000 – 0.1 Proportional – Convergence

2000 Adaptive Adaptive Tournament – Convergence

Table 3 Speedups of schedules on random task graphs with different numbers of tasks. n

CCR

p=2

p=4

BOA

AIS

CPGA

HEFT

Ave

SD

Ave

SD

Ave

SD

16

0.25 0.50 1.00 2.00 4.00

2.08 2.07 2.04 1.88 1.46

0.07 0.07 0.05 0.08 0.04

2.08 2.07 2.04 1.88 1.45

0.07 0.07 0.05 0.08 0.04

2.07 2.05 2.03 1.87 1.48

0.07 0.07 0.06 0.06 0.06

36

0.25 0.50 1.00 2.00 4.00

2.13 2.10 2.05 1.88 1.59

0.03 0.03 0.03 0.03 0.04

2.06 2.04 2.02 1.85 1.42

0.03 0.03 0.03 0.03 0.03

2.06 2.04 2.00 1.87 1.52

64

0.25 0.50 1.00 2.00 4.00

2.14 2.12 2.05 2.02 1.47

0.03 0.03 0.02 0.02 0.03

2.02 2.01 1.98 1.85 1.45

0.02 0.03 0.02 0.03 0.03

100

0.25 0.50 1.00 2.00 4.00

2.16 2.15 2.12 1.89 1.54

0.03 0.03 0.02 0.02 0.02

2.02 2.02 1.97 1.87 1.51

0.03 0.03 0.02 0.02 0.02

p=8

BOA

AIS

CPGA

HEFT

Ave

SD

Ave

SD

Ave

SD

1.89 1.86 1.80 1.66 1.20

3.61 3.05 2.75 1.98 1.34

0.14 0.10 0.08 0.05 0.04

3.29 2.97 2.56 1.96 1.32

0.08 0.10 0.05 0.05 0.04

3.36 2.99 2.63 1.99 1.32

0.13 0.11 0.09 0.05 0.08

0.03 0.03 0.03 0.04 0.05

1.92 1.90 1.87 1.75 1.33

3.54 3.39 2.91 2.26 1.70

0.10 0.07 0.06 0.05 0.04

3.26 3.17 2.66 2.03 1.33

0.08 0.07 0.06 0.05 0.03

3.52 3.32 2.84 2.14 1.45

2.04 2.03 1.99 1.88 1.44

0.03 0.03 0.03 0.04 0.04

1.97 1.96 1.90 1.79 1.41

3.63 3.51 3.23 2.58 1.72

0.11 0.07 0.12 0.05 0.03

3.45 3.30 2.85 2.26 1.57

0.06 0.05 0.05 0.03 0.02

1.98 1.98 1.95 1.87 1.53

0.08 0.08 0.07 0.05 0.04

2.01 1.99 1.93 1.87 1.52

4.25 4.12 3.71 2.89 1.89

0.11 0.09 0.07 0.04 0.02

3.77 3.68 3.14 2.55 1.74

0.07 0.05 0.04 0.04 0.03

BOA

AIS

CPGA

HEFT

Ave

SD

Ave

SD

Ave

SD

3.24 2.93 2.41 1.85 1.25

3.84 3.10 2.59 1.94 1.29

0.08 0.06 0.06 0.05 0.03

3.51 3.07 2.54 1.92 1.28

0.08 0.08 0.07 0.05 0.03

3.75 3.07 2.55 1.92 1.31

0.10 0.10 0.08 0.05 0.03

3.24 2.89 2.38 1.81 1.24

0.10 0.09 0.08 0.07 0.06

3.38 3.12 2.81 2.14 1.42

5.42 4.59 3.49 2.44 1.62

0.15 0.10 0.07 0.06 0.03

4.21 3.81 3.02 2.28 1.30

0.10 0.10 0.07 0.06 0.02

5.07 4.29 3.30 2.25 1.41

0.14 0.10 0.08 0.05 0.07

4.53 3.75 2.94 2.26 1.48

3.61 3.43 3.04 2.32 1.54

0.11 0.10 0.09 0.06 0.03

3.48 3.38 2.94 2.41 1.54

6.21 5.42 4.33 2.81 1.72

0.14 0.10 0.08 0.06 0.03

5.33 4.72 3.42 2.46 1.64

0.13 0.11 0.09 0.06 0.03

5.56 4.91 3.90 2.71 1.68

0.24 0.18 0.10 0.05 0.02

5.71 4.83 3.73 2.62 1.67

3.67 3.50 3.18 2.54 1.75

0.09 0.08 0.08 0.05 0.03

3.79 3.57 3.17 2.58 1.72

7.02 6.04 4.83 3.32 1.92

0.15 0.11 0.08 0.07 0.03

5.82 5.13 4.21 3.03 1.83

0.13 0.08 0.07 0.04 0.03

5.78 5.28 4.29 3.10 1.98

0.23 0.15 0.11 0.04 0.01

5.98 5.23 4.42 3.12 1.98

Table 4 p-Values obtained by an upper-tailed paired t-test on random task graphs with different numbers of tasks. n

CCR

p=2

p=4

BOA vs. AIS

BOA vs. CPGA

BOA vs. HEFT

BOA vs. AIS

BOA vs. CPGA

BOA vs. HEFT

BOA vs. AIS

BOA vs. CPGA

BOA vs. HEFT

16

0.25 0.50 1.00 2.00 4.00

6.56E−1 4.59E−1 4.91E−1 6.94E−1 3.69E−1

4.12E−1 8.23E−2 4.19E−1 6.50E−2 2.40E−3

5.09E−10 1.21E−10 1.70E−13 2.19E−10 9.89E−18

2.36E−9 1.02E−3 5.55E−13 2.78E−3 1.18E−2

7.06E−6 2.23E−3 2.99E−5 3.81E−1 4.90E−3

6.70E−10 5.09E−5 3.40E−14 5.58E−10 1.02E−9

6.66E−15 3.68E−2 1.63E−3 2.58E−3 3.20E−3

1.86E−5 1.80E−3 2.67E−4 2.67E−4 2.58E−1

3.00E−18 1.70E−11 2.66E−11 1.48E−10 1.26E−7

36

0.25 0.50 1.00 2.00 4.00

1.38E−17 9.40E−3 2.33E−14 7.67E−4 1.71E−13

3.73E−10 1.06E−2 1.67E−11 4.64E−2 1.90E−4

1.34E−17 4.04E−14 2.32E−17 1.29E−12 1.79E−17

1.43E−14 7.47E−18 7.17E−4 1.49E−3 3.87E−17

1.15E−8 2.80E−14 7.11E−6 4.73E−3 1.66E−13

5.10E−7 3.44E−13 5.94E−7 2.63E−9 1.99E−17

3.15E−17 8.93E−17 8.16E−16 3.42E−13 1.28E−18

3.31E−8 2.07E−7 1.73E−7 2.37E−8 5.00E−11

4.90E−16 2.10E−18 1.17E−17 1.11E−17 2.82E−15

64

0.25 0.50 1.00 2.00 4.00

3.45E−15 2.15E−17 1.15E−1 2.84E−2 1.70E−2

2.69E−9 1.26E−12 3.91E−1 1.28E−1 9.03E−2

3.88E−16 8.26E−17 3.04E−12 1.68E−11 3.77E−8

3.81E−10 2.10E−14 3.99E−12 2.75E−18 1.94E−18

3.06E−1 5.83E−2 1.91E−4 4.78E−12 4.97E−16

8.15E−6 2.66E−7 3.44E−18 3.98E−12 3.01E−17

2.31E−14 2.00E−13 8.95E−19 5.03E−15 3.48E−9

1.34E−9 6.81E−10 1.48E−12 1.23E−5 4.35E−5

2.14E−12 6.73E−16 1.26E−17 5.77E−11 6.92E−7

100

0.25 0.50 1.00 2.00 4.00

5.93E−12 1.22E−12 4.79E−15 2.18E−5 2.01E−4

3.96E−11 8.96E−11 2.60E−8 3.26E−2 3.05E−2

6.89E−15 1.64E−14 3.30E−18 6.69E−6 2.88E−4

5.21E−13 1.24E−13 1.14E−14 8.14E−17 4.40E−13

2.03E−12 6.42E−15 6.29E−17 6.02E−14 3.26E−15

1.80E−12 3.46E−15 2.32E−16 2.31E−16 1.17E−17

1.09E−18 1.42E−16 1.63E−14 1.69E−10 1.58E−8

9.81E−13 3.34E−15 1.44E−12 7.61E−10 1.20E−9

1.27E−15 1.23E−16 8.50E−14 3.16E−10 9.98E−1

cies, which prevent the algorithm finding the promising solutions. The simplest way to tackle the problem is to initialize the network according to the simplified task graph, which is obtained by randomly removing half edges in the initial one. In the experiment, four cases are done in such a way. And the results with and without edge removal in these cases are shown in Table 9.

p=8

6. Performance evaluation based on task graphs of well-known applications In additional to randomly generated task graphs, task graphs of three well-known real parallel applications, including Gauss elimination [29], Gauss–Jordan elimination [30] and fast Fourier

3304

J. Yang et al. / Applied Soft Computing 11 (2011) 3297–3310

Table 5 Speedups of schedules on random task graphs with different shape coefficients. es

CCR

p=2

p=4

BOA

AIS

CPGA

HEFT

Ave

SD

Ave

SD

Ave

SD

0.5

0.25 0.50 1.00 2.00 4.00

2.02 1.97 1.82 1.41 1.06

0.02 0.02 0.04 0.03 0.01

1.82 1.75 1.58 1.24 0.87

0.03 0.03 0.04 0.02 0.02

1.95 1.88 1.70 1.33 1.01

0.04 0.04 0.04 0.03 0.03

1.0

0.25 0.50 1.00 2.00 4.00

2.14 2.12 2.05 2.02 1.47

0.03 0.03 0.02 0.02 0.03

2.02 2.01 1.98 1.85 1.45

0.02 0.03 0.02 0.03 0.03

2.04 2.03 1.99 1.88 1.44

2.0

0.25 0.50 1.00 2.00 4.00

2.11 2.04 2.03 2.02 2.00

0.03 0.02 0.02 0.02 0.02

2.07 2.07 2.06 2.06 2.00

0.02 0.03 0.03 0.02 0.03

2.08 2.08 2.06 2.05 1.98

p=8

BOA

AIS

CPGA

HEFT

Ave

SD

Ave

SD

Ave

SD

1.88 1.79 1.58 1.26 0.88

3.20 2.70 2.07 1.44 1.07

0.08 0.06 0.03 0.02 0.01

2.82 2.56 1.74 1.24 0.79

0.06 0.04 0.03 0.03 0.01

3.00 2.56 1.97 1.35 0.85

0.07 0.05 0.04 0.02 0.02

0.03 0.03 0.03 0.04 0.04

1.97 1.96 1.90 1.79 1.41

3.63 3.51 3.23 2.58 1.72

0.11 0.07 0.12 0.05 0.03

3.45 3.30 2.85 2.26 1.57

0.06 0.05 0.05 0.03 0.02

3.61 3.43 3.04 2.32 1.54

0.03 0.03 0.03 0.03 0.03

1.97 1.97 1.97 1.96 1.91

4.21 4.06 3.86 3.60 3.07

0.11 0.09 0.08 0.08 0.07

3.96 3.83 3.71 3.35 2.60

0.05 0.06 0.06 0.05 0.04

3.96 3.91 3.80 3.48 2.72

BOA

AIS

CPGA

HEFT

Ave

SD

Ave

SD

Ave

SD

3.03 2.66 2.02 1.35 0.94

3.46 2.83 2.13 1.40 0.86

0.10 0.07 0.03 0.03 0.01

2.85 2.42 1.87 1.29 0.79

0.08 0.06 0.05 0.03 0.02

3.45 2.87 2.13 1.41 0.84

0.06 0.05 0.03 0.02 0.01

2.79 2.39 1.88 1.30 0.88

0.11 0.10 0.09 0.06 0.03

3.48 3.38 2.94 2.41 1.54

6.21 5.42 4.33 2.81 1.72

0.14 0.10 0.08 0.06 0.03

5.33 4.72 3.42 2.46 1.64

0.13 0.11 0.09 0.06 0.03

5.56 4.91 3.90 2.71 1.68

0.24 0.18 0.10 0.05 0.02

5.71 4.83 3.73 2.62 1.67

0.08 0.08 0.09 0.09 0.07

3.89 3.81 3.63 3.42 2.62

7.78 7.20 6.61 5.10 3.56

0.16 0.14 0.09 0.07 0.05

6.33 5.93 5.37 4.30 3.06

0.08 0.11 0.09 0.08 0.03

6.96 6.66 5.95 4.73 3.26

0.16 0.16 0.12 0.09 0.06

6.87 6.55 5.85 4.64 3.20

Table 6 p-Values obtained by an upper-tailed paired t-test on random task graphs with different shape coefficients. es

CCR

p=2

p=4

p=8

BOA vs. AIS

BOA vs. CPGA

BOA vs. HEFT

BOA vs. AIS

BOA vs. CPGA

BOA vs. HEFT

BOA vs. AIS

BOA vs. CPGA

BOA vs. HEFT

0.5

0.25 0.50 1.00 2.00 4.00

1.12E−18 2.17E−16 1.37E−19 7.41E−20 6.74E−22

1.76E−7 2.74E−9 2.25E−13 1.29E−7 2.71E−7

1.05E−17 4.86E−18 1.60E−16 6.76E−16 2.38E−24

9.84E−18 1.43E−19 1.10E−17 5.87E−20 1.63E−26

1.54E−7 1.15E−7 9.01E−9 1.34E−10 2.13E−20

1.94E−8 3.68E−3 1.91E−6 2.85E−15 2.49E−24

2.74E−14 1.78E−13 7.39E−14 3.03E−9 1.41E−12

3.23E−1 6.14E−1 2.98E−1 6.35E−1 5.47E−4

6.44E−17 5.98E−16 8.77E−18 6.25E−11 8.09E−14

1.0

0.25 0.50 1.00 2.00 4.00

3.45E−15 2.15E−17 1.15E−1 2.84E−2 1.70E−2

2.69E−9 1.26E−12 3.91E−1 1.28E−1 9.03E−2

3.88E−16 8.26E−17 3.04E−12 1.68E−11 3.77E−8

3.81E−10 2.10E−14 3.99E−12 2.75E−18 1.94E−18

3.06E−1 5.83E−2 1.91E−4 4.78E−12 4.97E−16

8.15E−6 2.66E−7 3.44E−18 3.98E−12 3.01E−17

2.31E−14 2.00E−13 8.95E−19 5.03E−15 3.48E−9

1.34E−9 6.81E−10 1.48E−12 1.23E−5 4.35E−5

2.14E−12 6.73E−16 1.26E−17 5.77E−11 6.92E−7

2.0

0.25 0.50 1.00 2.00 4.00

5.09E−10 5.09E−10 5.09E−10 5.09E−10 7.74E−1

8.00E−3 6.81E−1 6.19E−1 5.17E−1 1.53E−2

5.27E−10 1.11E−9 3.59E−10 3.41E−13 1.12E−10

8.64E−11 6.64E−9 3.82E−7 2.72E−10 8.87E−17

2.79E−7 1.27E−4 8.41E−2 3.28E−4 5.02E−14

1.57E−10 1.74E−10 8.22E−9 4.03E−15 1.64E−15

2.92E−18 1.46E−16 6.65E−19 2.96E−18 7.22E−20

1.32E−9 6.25E−7 3.85E−11 7.17E−8 7.85E−13

8.36E−14 2.43E−18 2.74E−16 6.36E−18 3.63E−14

Table 7 Speedups of schedules on the random task graphs with different out-degree coefficients. ed

CCR

p=2

p=4

BOA

AIS

CPGA

HEFT

Ave

SD

Ave

SD

Ave

SD

0.25

0.25 0.50 1.00 2.00 4.00

2.09 2.06 2.01 2.00 1.96

0.03 0.03 0.03 0.03 0.03

2.05 2.05 2.03 1.99 1.73

0.02 0.03 0.03 0.03 0.03

1.97 1.95 1.86 1.74 1.69

0.04 0.03 0.04 0.05 0.04

0.5

0.25 0.50 1.00 2.00 4.00

2.14 2.12 2.05 2.02 1.47

0.03 0.03 0.02 0.02 0.03

2.02 2.01 1.98 1.85 1.45

0.02 0.03 0.02 0.03 0.03

2.04 2.03 1.99 1.88 1.44

0.75

0.25 0.50 1.00 2.00 4.00

1.96 1.94 1.88 1.68 1.44

0.03 0.03 0.03 0.03 0.03

1.99 1.97 1.91 1.71 1.31

0.02 0.02 0.03 0.03 0.02

1.99 1.98 1.94 1.72 1.34

p=8

BOA

AIS

CPGA

HEFT

Ave

SD

Ave

SD

Ave

SD

1.99 1.99 1.93 1.89 1.80

3.84 3.64 3.31 2.79 2.51

0.07 0.06 0.05 0.05 0.05

3.63 3.52 3.24 2.64 1.74

0.07 0.07 0.08 0.05 0.03

3.34 3.01 2.82 2.26 1.97

0.08 0.07 0.07 0.06 0.05

0.03 0.03 0.03 0.04 0.04

1.97 1.96 1.90 1.79 1.41

3.63 3.51 3.23 2.58 1.72

0.11 0.07 0.12 0.05 0.03

3.45 3.30 2.85 2.26 1.57

0.06 0.05 0.05 0.03 0.02

3.61 3.43 3.04 2.32 1.54

0.02 0.03 0.02 0.04 0.03

1.89 1.87 1.85 1.59 1.17

3.59 3.16 2.74 2.24 1.45

0.07 0.06 0.05 0.05 0.03

3.34 3.07 2.71 2.14 1.41

0.04 0.06 0.03 0.04 0.03

3.55 3.11 2.70 2.19 1.44

BOA

AIS

CPGA

HEFT

Ave

SD

Ave

SD

Ave

SD

3.70 3.55 3.23 2.67 1.85

6.34 5.55 4.80 3.68 2.48

0.17 0.21 0.07 0.09 0.07

5.44 4.81 3.85 2.94 2.09

0.11 0.09 0.07 0.04 0.03

5.57 5.00 3.81 3.05 2.21

0.08 0.06 0.06 0.06 0.06

6.25 5.40 4.25 3.17 2.16

0.11 0.10 0.09 0.06 0.03

3.48 3.38 2.94 2.41 1.54

6.21 5.42 4.33 2.81 1.72

0.14 0.10 0.08 0.06 0.03

5.33 4.72 3.42 2.46 1.64

0.13 0.11 0.09 0.06 0.03

5.56 4.91 3.90 2.71 1.68

0.24 0.18 0.10 0.05 0.02

5.71 4.83 3.73 2.62 1.67

0.08 0.07 0.07 0.04 0.02

3.46 3.12 2.70 2.12 1.43

6.33 5.37 4.02 2.79 1.70

0.10 0.08 0.07 0.06 0.03

5.27 4.65 3.31 2.40 1.52

0.09 0.07 0.04 0.04 0.03

5.61 4.88 3.84 2.67 1.66

0.20 0.14 0.09 0.04 0.02

5.59 4.80 3.88 2.69 1.67

J. Yang et al. / Applied Soft Computing 11 (2011) 3297–3310

3305

Table 8 p-Values obtained by an upper-tailed paired t-test on random task graphs with different out-degree coefficients. ed

CCR

p=2

p=4

p=8

BOA vs. AIS

BOA vs. CPGA

BOA vs. HEFT

BOA vs. AIS

BOA vs. CPGA

BOA vs. HEFT

BOA vs. AIS

BOA vs. CPGA

BOA vs. HEFT

0.25

0.25 0.50 1.00 2.00 4.00

4.25E−10 1.54E−3 1.82E−1 1.04E−2 1.18E−18

4.89E−13 1.71E−10 1.10E−8 1.28E−13 1.52E−18

2.48E−12 1.16E−9 9.81E−12 1.58E−12 5.68E−16

1.03E−9 6.41E−11 2.00E−7 3.07E−12 1.23E−20

5.86E−10 7.86E−13 1.72E−13 5.20E−16 9.74E−18

8.80E−8 1.15E−6 3.67E−7 4.12E−19 1.28E−22

7.42E−15 5.27E−12 7.90E−19 3.17E−16 2.80E−14

3.89E−14 8.19E−13 1.42E−9 1.52E−15 1.75E−12

1.90E−4 1.42E−3 1.84E−17 2.21E−15 5.23E−14

0.50

0.25 0.50 1.00 2.00 4.00

3.45E−15 2.15E−17 1.15E−1 2.84E−2 1.70E−2

2.69E−9 1.26E−12 3.91E−1 1.28E−1 9.03E−2

3.88E−16 8.26E−17 3.04E−12 1.68E−11 3.77E−8

3.81E−10 2.10E−14 3.99E−12 2.75E−18 1.94E−18

3.06E−1 5.83E−2 1.91E−4 4.78E−12 4.97E−16

8.15E−6 2.66E−7 3.44E−18 3.98E−12 3.01E−17

2.31E−14 2.00E−13 8.95E−19 5.03E−15 3.48E−9

1.34E−9 6.81E−10 1.48E−12 1.23E−5 4.35E−5

2.14E−12 6.73E−16 1.26E−17 5.77E−11 6.92E−7

0.75

0.25 0.50 1.00 2.00 4.00

5.09E−10 5.09E−10 5.09E−10 5.09E−10 1.89E−12

6.50E−1 6.55E−1 6.29E−1 6.12E−1 4.59E−10

1.48E−6 1.30E−4 3.45E−10 4.65E−21 2.51E−7

7.08E−15 3.31E−5 3.57E−5 1.36E−10 1.67E−3

2.84E−1 6.79E−7 3.35E−8 8.32E−2 2.42E−4

8.76E−3 1.32E−3 2.22E−9 8.76E−3 7.68E−3

1.53E−18 4.22E−18 2.23E−18 9.23E−15 7.57E−13

6.34E−13 1.79E−11 8.25E−7 1.08E−6 3.93E−3

1.11E−17 3.12E−17 8.80E−8 1.91E−6 4.98E−4

Table 9 Performance comparison with and without edge removal. Case

Situation

CCR 0.25

n = 64, p = 8, es =1.0, ed =0.50 n = 64, p = 8, es =2.0, ed =0.50 n = 64, p = 8, es =1.0, ed =0.75 n = 100, p = 8, es =1.0, ed =0.50

With ER Without ER With ER Without ER With ER Without ER With ER Without ER

0.50

2.00

4.00

Ave

SD

Ave

SD

Ave

SD

Ave

SD

Ave

SD

6.21 5.59 7.78 6.52 6.33 5.42 7.02 5.97

0.14 0.10 0.16 0.12 0.10 0.09 0.15 0.10

5.42 4.75 7.20 6.34 5.37 4.59 6.04 5.02

0.10 0.09 0.14 0.10 0.08 0.07 0.11 0.08

4.33 3.59 6.61 5.71 4.02 3.49 4.83 4.12

0.08 0.07 0.09 0.08 0.07 0.05 0.08 0.05

2.81 2.51 5.10 4.49 2.79 2.44 3.32 3.01

0.06 0.05 0.07 0.06 0.06 0.03 0.07 0.04

1.72 1.64 3.56 3.13 1.70 1.62 1.92 1.83

0.03 0.03 0.05 0.04 0.03 0.03 0.03 0.03

transformation (FFT) [31], are treated as the workload in this section. The statistic significance of the performance difference is also investigated. 6.1. Experimental setup In the experiments, task graphs representing Gauss elimination and Gauss–Jordan elimination with the matrix size equal to {6, 9, 12} together with 1D FFT on 4, 8 and 16 data points are adopted as the test bed. Fig. 6(a) and (b) depict the task graph of the Gauss–Jordan elimination and the Gauss elimination, respectively, when the dimension of the matrix is equal to 5. Similarly, the task graph of the 1D FFT algorithm for the case of 4 data points is given in Fig. 6(c). Each node in Fig. 6(a)–(c) represents an independent operation in these applications. As the structure remains the same with different dimensions, the total number of nodes in the task graph of Gaussian–Jordan elimination, Gauss elimination and 1D FFT are equal to (d2 + d)/2, (d2 + d)/2 − 1, and dlog2 d + 2d − 1, respectively, where d represents the matrix size. The values of CCR and the numbers of processors together with matrix sizes used in the experiment are summarized in Table 10. Each combination of them produces a meta-graph for the exper-

Table 10 Experimental parameters. Parameter

Value

Matrix size of applications

{6, 9, 12} for Gauss–Jordan elimination {6, 9, 12} for Gauss elimination {4, 8, 16} for fast Fourier transformation {0.25, 0.50, 1.00, 2.00, 4.00} {2, 4, 8}

CCR value The number of processors

1.00

iment. Additionally, the computation and communication cost of each task are set as the same in Section 5.1. Then each metagraph produces twenty real scheduling problems with different execution time of each task. The proposed algorithm runs 20 times on each of these real problems. Based on the data sets described above, the proposed algorithm is also compared with AIS [17], CPGA [13] and HEFT [3]. The value of parameters in the algorithms are also set according to Table 2. Undoubtedly, the higher the speedup, the better the scheduling of tasks executed on multiprocessors. 6.2. Experimental results The performance comparison on the task graphs of Gauss elimination, Gauss–Jordan elimination and FFT are represented in Tables 11, 13 and 15, respectively. The best result obtained in each test case is marked in bold font. Similar to Section 5.2, we also investigate the statistic significance of the performance difference. The p-value obtained by an upper-tailed paired t-test for the result in Tables 11, 13 and 15 are represented in Tables 12, 14 and 16. The significance level of each test case is set as 0.05. It is obvious that the proposed approach is able to schedule tasks with speedup higher than those produced by HEFT in all of the experiment cases. Generally, the proposed algorithm is better than HEFT by more than 10%. With the scale of the scheduling problem increasing (e.g., more processors or larger matrix size), the gap in the speedups between two algorithms is more remarkable. The proposed algorithm is able to outperform HEFT by 23.6%, 31.4% and 31.1% at most for Gauss elimination, Gauss–Jordan elimination and FFT, respectively. The significant differences in Tables 12, 14 and 16 confirm that the proposed algorithm certainly outperforms HEFT as well.

3306

J. Yang et al. / Applied Soft Computing 11 (2011) 3297–3310

a

b

Gauss elimination

c

Gauss-Jordan elimination

FFT

Fig. 6. Sample task graphs of the selected applications. (a) Gauss elimination, (b) Gauss–Jordan elimination and (c) FFT.

Table 11 Speedups of schedules on the task graphs of Gauss elimination. Size

CCR

p=2

p=4

BOA

AIS

CPGA

HEFT

Ave

SD

Ave

SD

Ave

SD

6

0.25 0.50 1.00 2.00 4.00

1.82 1.75 1.62 1.42 1.17

0.05 0.04 0.04 0.03 0.02

1.80 1.74 1.60 1.39 1.14

0.04 0.04 0.04 0.03 0.04

1.81 1.75 1.59 1.40 1.15

0.04 0.03 0.04 0.03 0.03

9

0.25 0.50 1.00 2.00 4.00

1.97 1.95 1.89 1.75 1.54

0.04 0.04 0.03 0.04 0.03

1.79 1.75 1.65 1.49 1.24

0.04 0.05 0.05 0.05 0.05

1.93 1.89 1.79 1.62 1.39

12

0.25 0.50 1.00 2.00 4.00

2.04 2.02 1.99 1.90 1.73

0.03 0.03 0.03 0.04 0.03

1.82 1.78 1.71 1.53 1.22

0.04 0.03 0.03 0.04 0.03

1.95 1.93 1.86 1.74 1.54

p=8

BOA

AIS

CPGA

HEFT

Ave

SD

Ave

SD

Ave

SD

1.70 1.64 1.41 1.25 1.08

2.11 2.02 1.75 1.46 1.19

0.06 0.05 0.06 0.03 0.04

2.03 1.88 1.60 1.25 1.15

0.07 0.07 0.05 0.06 0.04

2.10 1.99 1.70 1.38 1.08

0.06 0.06 0.06 0.05 0.05

0.04 0.03 0.04 0.04 0.04

1.83 1.78 1.68 1.53 1.35

2.89 2.76 2.33 1.95 1.49

0.08 0.08 0.06 0.04 0.04

2.77 2.67 2.24 1.82 1.33

0.05 0.08 0.07 0.05 0.03

2.75 2.57 2.19 1.72 1.31

0.04 0.04 0.04 0.04 0.05

1.89 1.87 1.80 1.65 1.48

3.45 3.36 2.93 2.46 1.92

0.04 0.04 0.05 0.04 0.03

3.27 3.12 2.82 2.27 1.76

0.04 0.03 0.06 0.04 0.04

3.09 2.93 2.55 2.04 1.51

From Tables 11, 13 and 15, it is also noticeable that the proposed algorithm is superior to AIS-base hybrid algorithm [17] in almost all experiment cases except for cases of FFT on 4 data points with smaller CCR value and fewer processors. More specifically, when scheduling problem is in large scale or the application is computation-intensive (e.g., CCR equals to 0.25 or 0.5), the proposed algorithm outperforms the AIS-based algorithm remarkably. In terms of the speedup, the result could be improved by more than 10% in average. And the highest improvement reaches 28.87%, 50.44%, and 25.26% for Gauss elimination,

BOA

AIS

CPGA

HEFT

Ave

SD

Ave

SD

Ave

SD

1.91 1.85 1.53 1.21 1.11

2.13 2.04 1.74 1.44 1.17

0.07 0.04 0.05 0.04 0.03

1.96 1.78 1.58 1.32 1.13

0.06 0.06 0.05 0.03 0.03

2.05 2.00 1.68 1.29 0.98

0.07 0.06 0.06 0.06 0.07

1.91 1.84 1.52 1.22 1.09

0.09 0.09 0.07 0.06 0.06

2.66 2.57 2.13 1.68 1.23

2.94 2.80 2.36 1.94 1.51

0.05 0.05 0.03 0.03 0.03

2.81 2.70 2.27 1.78 1.39

0.06 0.04 0.06 0.06 0.03

2.87 2.64 2.22 1.67 1.19

0.06 0.08 0.07 0.06 0.07

2.67 2.58 2.20 1.66 1.26

0.07 0.07 0.07 0.07 0.07

3.18 3.06 2.76 2.10 1.61

3.80 3.56 3.06 2.41 1.83

0.08 0.06 0.05 0.03 0.03

3.58 3.40 2.85 2.19 1.65

0.09 0.06 0.04 0.06 0.01

3.60 3.27 2.73 2.04 1.42

0.09 0.10 0.08 0.07 0.07

3.42 3.31 2.73 2.08 1.48

Gauss–Jordan elimination and FFT, respectively. The p-value in Tables 12, 14 and 16 also verify that the proposed algorithm is able to deliver more better results than AIS-based algorithm. Experimental results in Tables 11, 13 and 15 demonstrate that the proposed algorithm outperforms CPGA in all test cases of Gauss and Gauss–Jordan elimination and in almost all test cases of FFT. Generally, the average gap of speedup between the proposed algorithm and CPGA is about 10%. However with the number of tasks and processor increasing and CCR reduced,

Table 12 p-Values obtained by an upper-tailed paired t-test on the task graphs of Gauss elimination. Size

CCR

p=2

p=4

p=8

BOA vs. AIS

BOA vs. CPGA

BOA vs. HEFT

BOA vs. AIS

BOA vs. CPGA

BOA vs. HEFT

BOA vs. AIS

BOA vs. CPGA

BOA vs. HEFT

6

0.25 0.50 1.00 2.00 4.00

1.27E−3 1.87E−3 4.53E−4 1.66E−5 9.21E−7

4.87E−1 1.58E−1 1.82E−2 1.67E−5 8.06E−4

1.74E−3 3.00E−3 7.38E−3 1.55E−3 8.55E−3

2.47E−9 6.56E−11 1.83E−12 4.60E−14 2.98E−6

2.28E−2 8.68E−3 7.82E−6 1.31E−8 2.15E−9

1.73E−3 5.19E−3 1.21E−3 3.62E−3 3.87E−3

6.36E−14 3.13E−17 1.02E−12 4.22E−12 2.81E−3

7.16E−5 9.30E−5 8.54E−6 1.39E−10 2.43E−11

2.19E−3 1.21E−3 1.76E−3 5.93E−3 6.87E−3

9

0.25 0.50 1.00 2.00 4.00

8.05E−14 1.08E−14 4.37E−17 1.56E−14 4.96E−14

2.05E−8 1.60E−7 2.37E−12 3.34E−11 8.34E−12

6.56E−13 1.12E−13 4.57E−17 2.52E−15 5.51E−17

4.28E−9 5.25E−5 2.73E−4 3.71E−11 2.62E−10

2.70E−8 4.50E−13 1.18E−9 5.19E−13 5.72E−9

1.48E−10 1.79E−9 3.30E−11 3.21E−17 2.79E−16

6.76E−11 1.93E−7 1.28E−7 4.20E−12 2.43E−10

3.46E−7 1.15E−7 1.45E−10 3.41E−16 2.73E−11

8.86E−15 1.02E−13 6.43E−14 3.19E−19 1.71E−17

12

0.25 0.50 1.00 2.00 4.00

9.95E−22 1.00E−21 1.04E−19 5.67E−20 2.38E−23

2.01E−6 2.05E−8 2.99E−8 6.72E−11 2.49E−13

2.27E−15 3.28E−15 7.05E−16 1.23E−16 2.90E−18

3.61E−11 1.16E−14 2.69E−6 1.73E−14 1.27E−13

5.19E−13 6.78E−15 1.02E−15 2.25E−18 4.63E−16

1.40E−12 5.34E−17 2.33E−12 4.48E−18 3.45E−20

2.14E−8 3.01E−7 4.17E−12 3.70E−10 2.71E−15

1.02E−7 4.48E−8 3.06E−12 1.64E−16 6.76E−17

4.13E−14 1.61E−13 1.42E−16 5.68E−21 1.52E−21

J. Yang et al. / Applied Soft Computing 11 (2011) 3297–3310

3307

Table 13 Speedups of schedules on the task graphs of Gauss–Jordan elimination. Size

CCR

p=2

p=4

BOA

AIS

CPGA

HEFT

Ave

SD

Ave

SD

Ave

SD

6

0.25 0.50 1.00 2.00 4.00

1.93 1.91 1.85 1.63 1.29

0.03 0.03 0.02 0.05 0.03

1.93 1.91 1.84 1.61 1.28

0.03 0.03 0.02 0.05 0.03

1.92 1.88 1.79 1.60 1.27

0.04 0.03 0.04 0.05 0.04

9

0.25 0.50 1.00 2.00 4.00

2.08 2.06 2.02 1.93 1.62

0.02 0.02 0.02 0.02 0.03

1.94 1.92 1.85 1.70 1.37

0.03 0.03 0.04 0.04 0.03

2.00 1.97 1.91 1.77 1.54

12

0.25 0.50 1.00 2.00 4.00

2.12 2.12 2.10 2.04 1.91

0.02 0.02 0.02 0.01 0.02

1.95 1.94 1.91 1.80 1.63

0.01 0.03 0.03 0.03 0.03

2.01 1.99 1.95 1.87 1.70

p=8

BOA

AIS

CPGA

HEFT

Ave

SD

Ave

SD

Ave

SD

1.88 1.81 1.70 1.40 1.26

3.01 2.71 2.12 1.76 1.33

0.06 0.05 0.04 0.05 0.04

2.75 2.51 2.03 1.59 1.15

0.07 0.05 0.04 0.04 0.04

2.87 2.58 2.05 1.65 1.26

0.09 0.08 0.04 0.06 0.06

0.03 0.03 0.04 0.04 0.05

1.89 1.87 1.78 1.63 1.20

3.63 3.42 2.89 2.33 1.74

0.05 0.05 0.05 0.03 0.05

3.37 3.23 2.66 1.99 1.34

0.05 0.06 0.05 0.04 0.05

3.27 3.04 2.61 2.05 1.50

0.03 0.03 0.03 0.03 0.04

1.95 1.93 1.89 1.80 1.60

4.02 3.89 3.60 3.00 2.25

0.06 0.05 0.05 0.07 0.04

3.71 3.57 3.24 2.84 1.98

0.09 0.06 0.05 0.03 0.04

3.44 3.28 2.99 2.42 1.77

BOA

AIS

CPGA

HEFT

Ave

SD

Ave

SD

Ave

SD

2.72 2.44 1.97 1.35 1.15

3.35 2.92 2.20 1.79 1.32

0.10 0.06 0.04 0.05 0.04

3.06 2.76 2.04 1.50 1.13

0.07 0.05 0.03 0.03 0.03

3.20 2.71 2.05 1.58 1.15

0.11 0.09 0.06 0.07 0.08

2.97 2.64 1.96 1.39 1.05

0.08 0.08 0.07 0.07 0.06

3.29 3.03 2.52 1.87 1.25

4.52 3.87 2.94 2.32 1.70

0.11 0.06 0.04 0.06 0.06

4.16 3.56 2.67 1.86 1.13

0.08 0.08 0.05 0.03 0.02

4.27 3.65 2.78 2.05 1.40

0.11 0.08 0.06 0.07 0.08

3.97 3.37 2.58 1.75 1.06

0.08 0.09 0.08 0.08 0.09

3.60 3.49 3.12 2.44 1.64

5.55 4.95 4.00 3.03 2.09

0.10 0.08 0.06 0.05 0.03

5.24 4.76 3.84 2.73 1.82

0.12 0.08 0.05 0.04 0.02

5.05 4.46 3.60 2.57 1.72

0.17 0.16 0.11 0.09 0.09

5.15 4.56 3.52 2.42 1.59

Table 14 p-Values obtained by an upper-tailed paired t-test on the task graphs of Gauss–Jordan elimination. Size

CCR

p=2

p=4

p=8

BOA vs. AIS

BOA vs. CPGA

BOA vs. HEFT

BOA vs. AIS

BOA vs. CPGA

BOA vs. HEFT

BOA vs. AIS

BOA vs. CPGA

BOA vs. HEFT

6

0.25 0.50 1.00 2.00 4.00

2.38E−23 2.38E−23 2.38E−23 2.38E−23 2.38E−23

1.79E−3 2.38E−7 1.85E−7 2.22E−2 2.00E−2

1.47E−6 7.59E−11 9.18E−16 1.18E−13 7.20E−4

4.83E−9 1.28E−9 2.40E−6 1.23E−8 2.46E−12

1.20E−9 1.34E−7 7.01E−10 3.22E−7 3.70E−6

2.09E−14 5.10E−15 7.46E−13 8.47E−17 3.14E−13

3.58E−14 3.73E−8 6.75E−11 4.08E−14 9.40E−17

4.63E−8 3.96E−9 1.64E−8 2.35E−12 4.15E−9

1.25E−14 1.17E−17 2.20E−23 5.59E−20 1.87E−20

9

0.25 0.50 1.00 2.00 4.00

3.81E−17 5.93E−16 8.28E−16 1.18E−14 4.75E−16

8.87E−13 6.17E−14 5.33E−12 5.69E−17 2.67E−7

6.77E−20 9.28E−23 1.49E−21 1.62E−21 6.86E−17

1.21E−14 1.27E−11 1.04E−11 8.41E−17 5.20E−6

1.20E−12 9.15E−15 3.90E−13 3.65E−12 9.57E−13

3.46E−18 2.69E−17 3.75E−22 4.10E−20 6.83E−23

4.58E−10 1.14E−12 9.74E−15 2.88E−18 1.12E−19

3.47E−9 4.21E−11 1.29E−12 1.20E−13 1.53E−10

1.29E−17 9.12E−19 2.61E−18 2.41E−20 7.17E−20

12

0.25 0.50 1.00 2.00 4.00

8.26E−21 4.78E−18 8.28E−16 1.21E−18 1.88E−16

2.87E−12 6.90E−12 9.68E−12 7.35E−17 2.07E−14

6.63E−19 2.00E−19 2.96E−20 4.19E−24 1.11E−21

2.56E−11 3.54E−12 2.54E−19 1.04E−9 5.71E−14

5.29E−17 3.79E−17 2.32E−19 3.37E−18 6.83E−16

2.48E−17 5.07E−18 1.39E−20 1.25E−18 9.25E−23

2.05E−9 1.94E−9 1.09E−9 2.38E−8 3.82E−8

3.03E−10 6.29E−12 8.22E−12 1.60E−15 5.94E−15

4.86E−13 2.75E−14 1.62E−17 4.81E−22 3.23E−23

the improvement of our algorithm become more and more remarkable. The largest improvement is able to arrive at 28.87%, 27.12% and 38.33% with Gauss elimination, Gauss–Jordan elimination and FFT, respectively, when comparing the proposed algorithm with CPGA. Furthermore, according to the results in

Tables 12, 14 and 16, it could be safely concluded that the proposed algorithm is statistically better than CPGA with a significant level of 0.05. The evolutionary trends in terms of generation are also compared between the proposed algorithm, AIS, and CPGA.

Table 15 Speedups of schedules on the task graphs of FFT. Size

CCR

p=2

p=4

BOA

AIS

CPGA

HEFT

BOA

p=8 AIS

CPGA

HEFT

BOA

AIS

CPGA

HEFT

Ave

SD

Ave

SD

Ave

SD

Ave

SD

Ave

SD

Ave

SD

Ave

SD

Ave

SD

Ave

SD

4

0.25 0.50 1.00 2.00 4.00

1.89 1.85 1.70 1.41 1.09

0.04 0.04 0.04 0.04 0.03

1.90 1.86 1.72 1.43 1.11

0.04 0.04 0.04 0.04 0.03

1.89 1.85 1.71 1.41 1.09

0.04 0.04 0.04 0.03 0.03

1.76 1.69 1.51 1.25 1.02

2.65 2.31 1.82 1.52 1.13

0.08 0.08 0.05 0.06 0.04

2.49 2.17 1.75 1.38 1.08

0.10 0.07 0.05 0.04 0.03

2.60 2.25 1.78 1.43 1.06

0.09 0.07 0.05 0.06 0.04

2.29 2.02 1.59 1.25 1.04

2.85 2.42 1.84 1.52 1.13

0.12 0.09 0.05 0.05 0.04

2.63 2.25 1.75 1.32 1.09

0.10 0.08 0.05 0.05 0.05

2.81 2.38 1.81 1.39 1.02

0.12 0.09 0.05 0.06 0.06

2.38 2.10 1.62 1.27 1.04

8

0.25 0.50 1.00 2.00 4.00

2.07 2.06 2.02 1.94 1.80

0.04 0.04 0.03 0.03 0.03

2.01 1.98 1.97 1.93 1.78

0.03 0.02 0.04 0.04 0.03

2.08 2.06 2.05 1.91 1.81

0.04 0.04 0.03 0.03 0.04

1.94 1.92 1.93 1.91 1.76

3.79 3.65 3.37 3.01 2.43

0.07 0.06 0.05 0.06 0.04

3.63 3.62 3.35 2.79 1.94

0.10 0.07 0.07 0.07 0.05

3.75 3.64 3.32 2.67 1.99

0.11 0.12 0.12 0.12 0.12

3.61 3.57 3.29 2.62 1.81

5.55 5.00 3.98 3.30 2.47

0.14 0.14 0.09 0.07 0.03

5.28 4.62 3.64 2.98 2.10

0.09 0.10 0.07 0.07 0.04

5.42 4.73 3.68 2.77 2.00

0.19 0.14 0.11 0.12 0.16

5.12 4.47 3.47 2.64 1.84

16

0.25 0.50 1.00 2.00 4.00

2.15 2.14 2.13 2.09 2.03

0.03 0.03 0.03 0.03 0.03

1.99 1.98 1.96 1.91 1.87

0.02 0.01 0.03 0.02 0.03

2.04 2.03 2.01 1.97 1.87

0.03 0.03 0.03 0.03 0.03

1.95 1.94 1.93 1.89 1.81

4.28 4.20 4.03 3.72 3.14

0.07 0.07 0.06 0.05 0.05

3.86 3.80 3.63 3.30 2.65

0.05 0.05 0.07 0.03 0.04

3.61 3.51 3.35 2.96 2.27

0.07 0.09 0.09 0.10 0.09

3.60 3.56 3.39 3.17 2.52

6.97 6.53 5.55 4.36 3.46

0.13 0.10 0.10 0.06 0.06

6.49 6.14 5.32 4.13 2.92

0.11 0.10 0.05 0.06 0.02

5.69 5.26 4.56 3.47 2.37

0.25 0.23 0.21 0.15 0.11

6.22 5.90 5.09 3.93 2.64

3308

J. Yang et al. / Applied Soft Computing 11 (2011) 3297–3310

Table 16 p-Values obtained by an upper-tailed paired t-test on the task graphs of FFT. Size

CCR

p=2

p=4

BOA vs. AIS

BOA vs. CPGA

BOA vs. HEFT

BOA vs. AIS

BOA vs. CPGA

BOA vs. HEFT

BOA vs. AIS

BOA vs. CPGA

BOA vs. HEFT

4

0.25 0.50 1.00 2.00 4.00

8.43E−1 7.82E−1 8.81E−1 7.47E−1 6.38E−1

4.81E−1 3.03E−1 5.31E−4 2.22E−1 2.39E−2

3.38E−12 7.34E−12 2.90E−13 1.05E−13 1.65E−8

4.93E−9 5.16E−10 1.23E−8 3.65E−8 3.92E−10

5.25E−5 6.99E−7 8.69E−6 3.67E−6 7.24E−8

5.35E−14 1.98E−12 1.71E−14 4.59E−14 7.82E−10

6.86E−14 2.50E−11 1.11E−12 8.07E−15 2.72E−5

5.83E−5 2.83E−4 2.60E−5 9.73E−12 1.37E−10

1.60E−12 3.49E−12 1.30E−13 1.60E−14 1.19E−9

8

0.25 0.50 1.00 2.00 4.00

9.71E−10 1.39E−7 2.37E−8 5.78E−7 4.33E−2

3.20E−1 1.13E−1 5.03E−1 8.97E−5 5.22E−1

2.42E−11 3.56E−12 6.92E−10 3.81E−4 2.95E−6

5.24E−6 1.60E−2 6.99E−10 6.99E−10 2.41E−18

6.03E−2 2.94E−1 6.92E−4 3.22E−10 2.27E−13

6.72E−10 1.14E−5 2.83E−6 5.18E−16 4.05E−23

6.89E−7 3.17E−9 1.30E−10 8.48E−13 2.24E−18

1.28E−3 6.05E−6 1.20E−10 5.63E−14 1.72E−11

4.62E−11 8.98E−13 5.41E−16 2.72E−19 4.47E−26

16

0.25 0.50 1.00 2.00 4.00

2.44E−14 7.21E−16 2.30E−12 2.13E−14 2.82E−15

2.62E−18 1.28E−14 5.26E−16 1.59E−15 2.25E−15

6.01E−16 5.32E−16 3.10E−16 3.46E−16 1.26E−17

1.75E−14 7.45E−16 1.70E−18 2.73E−18 7.00E−20

1.02E−19 6.35E−19 5.56E−19 2.37E−17 1.62E−20

3.32E−20 1.57E−18 7.08E−21 2.88E−21 3.28E−21

4.35E−10 9.80E−11 2.72E−8 1.09E−13 1.05E−18

8.05E−16 4.87E−16 2.46E−18 1.92E−14 8.97E−19

8.50E−16 3.14E−16 5.36E−14 9.52E−18 1.03E−22

Fig. 7 depicts the results when these algorithms deal with the Gauss–Jordan elimination and the Gauss elimination with matrix dimension equal to 6, and FFT on 4 data points. The number of processors available is 8 and the CCR is set to 0.25. According to Fig. 7, it is clear that the gaps of the generation needed before converging to the final schedules are not significant when comparing our algorithms with AIS and CPGA. 7. Parameter analysis In this section, two parameters in the algorithm, including the interval of structural learning on Bayesian networks and mutation rate, are discussed. The performance sensitivity to them are investigated by experiments on Gauss–Jordan elimination with the matrix size equal to 6. In the experiments, the number of available processors is set to 8 and the CCR is set to 0.25. 7.1. Performance sensitivity to the interval of structural learning As mentioned in Section 4.4, it is more efficient and reliable to perform structure learning only once in several iterations. In this subsection, different intervals (from 5 to 30) of structure update are examined. The results are depicted in Fig. 8(a), in which the solid line under the scale of left Y-axis indicates the change of the speedup and the dot line under the scale of right Y-axis represents the execution time over runs without structural update. The interval set to 0 means that the structure of a Bayesian network remains unchanged after it is initialized according to the task graph. Fig. 8(a) shows that the algorithm produces the best performance when the interval of updates is set to 10, 15 and 20. The speedup degrades with update interval lower than 10 or higher than 20 but not significantly. It is illustrated that the structure update, which is always performed based on the structure of the initial Bayesian network (as mentioned in Section 4.4), does not improve the results very much. Since the initial structure is built following the task graph of the scheduling problem, it implies that the dependencies between tasks are mainly caused by the immediately ancestral relationship in the task graph. On the other hand, the execution time of the algorithm, represented by the dot line, fluctuates unremarkably. Consequently, the interval could be set in a wide range. 7.2. Performance sensitivity to mutation rate The effect of mutation on the performance is investigated as well. The algorithm is rerun with various mutation rates ranging

p=8

from 0.00 to 0.20 increasing by 0.02. Similar to Fig. 8(a), the solid line and the dot line in Fig. 8(b) describe the speedup and the execution time over runs without mutation, respectively. The mutation rate set to 0.00 means the algorithm is run without the mutation operator. From Fig. 8(b), it is obvious that without mutation, our algorithm produces better results than the related one. (Our algorithm delivers the schedule with speedup equal to 3.25, while AIS, CPGA and HEFT reach 3.06, 3.20 and 2.97, respectively (see Table 13).) Furthermore, results are further improved by the mutation operator. However, the execution time rises significantly with the increment of the mutation rate. More specifically, running the algorithm with the mutation rate equal to 0.2 requires more than ten times of the computation time than running without mutation operator. The reason for the phenomenon lies in that larger mutation rate requires more iterations because of the slow convergence caused by the frequent mutation. Therefore, mutation rate less than 0.10 appears to be sufficient to guarantee the performance at relatively low computation cost. 8. Discussion From the experimental results in Sections 5 and 6, it could be concluded that the proposed algorithm is able to deliver more efficient schedules comparing with the other ones. The reasons behind the phenomenon are listed as below. First, in the proposed algorithm, since tasks are assigned to definite processors before prioritization, tasks are able to be prioritized according to their real execution time on specified processors and real communication cost. Consequently, the priority of each task reflects more exactly which task is critical to potentially reduce the makespan of a schedule. Thus, scheduling based on these priorities could produce more competent results. Next, in order to capture the relationship between the dependent tasks, the proposed algorithm initializes and builds Bayesian networks according to the task graph of the scheduling problem. Therefore, the global information of the scheduling problem is embedded in the built model. Guided by such models, the algorithm is able to generate more promising candidate schedules during the evolution. Additionally, a mutation operator is employed to complement Bayesian networks sampling in order to generate offspring. The idea behind the operator is to make full use of the local information of the promising solutions while utilizing the global statistical information which lies in Bayesian networks. Rather

J. Yang et al. / Applied Soft Computing 11 (2011) 3297–3310

a

2.25

2.25

b

3309

3.50 3.25

2.00

1.75

1.75

3.00

Average Speedup

Average Speedup

2.00

1.50

1.50

20

30

2.25

BOA AIS CPGA

1.75 1.25 50

1.25 10

2.50

2.00

BOA AIS CPGA 0

2.75

40

1.50 0

10

20

Generation

Gauss elimination

c

30

40

50

Generation

Gauss-Jordan elimination

3.00

Average Speedup

2.75

2.50

2.25

2.00

BOA AIS CPGA

1.75

1.50 0

10

20

30

40

50

Generation

FFT Fig. 7. Comparison on the evolutionary trends between the proposed algorithm (i.e., BOA), AIS and CPGA. (a) Gauss elimination, (b) Gauss–Jordan elimination and (c) FFT.

1.20

3.44

Execution Speedup

3.42

1.15

3.40

1.10

3.38

1.05

3.36

1.00

3.34

0.95

3.32

0.90

3.30

0.85

3.28

0.80

3.26

0.75 0.70

3.24

0

5

10

15

20

25

30

Interval of Updates

With different intervals of structuralupdate

b

3.44 3.42

Speedup Computation Time

12

10 3.40 3.38

8

3.36 6

3.34 3.32

4 3.30 3.28

2

3.26 0 3.24 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22

Computation Time over Runs without Mutation

Speedup Computation Time

Computation Time over runs without Structural Update

a

around a promising region characterized by Bayesian networks. Therefore, dependencies modeled by Bayesian networks are maintained along the search, and the efficiency of the evolution is improved.

Execution Speedup

than being mutated after the entire solution is generated, variables are flipped immediately after their values are generated according to the mutation rate. The sampling integrated with the mutation produces solutions that can hopefully fall in or

Mutation Rate

With different mutation rates

Fig. 8. Performance comparison with different parameter settings. The change of the speedup and the computation time are under the scale of left and right Y-axis, respectively. (a) With different intervals of structural update and with different mutation rates.

3310

J. Yang et al. / Applied Soft Computing 11 (2011) 3297–3310

9. Conclusions This paper has proposed a BOA-based multiprocessor scheduling algorithm for heterogeneous computing environments. In the algorithm, Bayesian networks are initialized and learned according to the task graph of the scheduling problem to find the optimal solution assigning tasks to different processors firstly, and then the heuristic-based priority used in the list scheduling approach determines the execution sequence of tasks on the same processor. Empirical studies on random task graphs and benchmark applications show that the algorithm is able to deliver more competent results compared with related ones. And further experiment indicates that the proposed algorithm maintains almost the same performance with different parameter settings. Future areas for research include further evaluating the proposed algorithm in the real heterogeneous computing environments. And since the proposed algorithm only adds dependencies into the Bayesian network after its initialization, it is worth designing efficient strategies to remove redundance dependencies during the evolution. Acknowledgments The authors would like to thank anonymous reviewers for their valuable suggestions. This work was supported by National Natural Science Foundation of China (Grant No. 60875073), National Key Technology R&D Program of China (Grant No. 2009BAG12A08) and Important National Science & Technology Specific Projects of China (Grant No. 2009ZX02001). References [1] O. Sinnen, Task Scheduling for Parallel Systems (Wiley Series on Parallel and Distributed Computing), Wiley-Interscience, 2007. [2] J.D. Ullman, NP-complete scheduling problems, Journal of Computer and System Sciences 10 (3) (1975) 384–393. [3] H. Topcuouglu, S. Hariri, M.-y. Wu, Performance-effective and low-complexity task scheduling for heterogeneous computing, IEEE Transactions on Parallel and Distributed Systems 13 (3) (2002) 260–274. [4] Y.-C. Lee, A. Zomaya, A novel state transition method for metaheuristic-based scheduling in heterogeneous computing systems, IEEE Transactions on Parallel and Distributed Systems 19 (9) (2008) 1215–1223. [5] M. Pelikan, D. Goldberg, E. Cantá-Paz, BOA: the Bayesian optimization algorithm, in: GECCO’ 99: Proceedings of the 1999 Genetic and Evolutionary Computation Conference, 1999, pp. 525–532. [6] M. Pelikan, Hierarchical Bayesian Optimization Algorithm: Toward a new Generation of Evolutionary Algorithms, Springer, Berlin, 2005. [7] D. Koller, N. Friedman, Probabilistic Graphical Models: Principles and Techniques, 2009. [8] M.I. Daoud, N. Kharma, A high performance algorithm for static task scheduling in heterogeneous distributed computing systems, Journal of Parallel and Distributed Computing 68 (4) (2008) 399–409. [9] V. Kianzad, S.S. Bhattacharyya, Efficient techniques for clustering and scheduling onto embedded multiprocessors, IEEE Transactions on Parallel and Distributed Systems 17 (7) (2006) 667–680.

[10] D. Bozda˘g, F. Özgüner, U.V. Catalyurek, Compaction of schedules and a twostage approach for duplication-based DAG scheduling, IEEE Transactions on Parallel and Distributed Systems 20 (6) (2009) 857–871. [11] A.S. Wu, H. Yu, S. Jin, K.-C. Lin, G. Schiavone, An incremental genetic algorithm approach to multiprocessor scheduling, IEEE Transactions on Parallel and Distributed Systems 15 (9) (2004) 824–834. [12] M.R. Bonyadi, M.E. Moghaddam, A bipartite genetic algorithm for multiprocessor task scheduling, International Journal of Parallel Programming 37 (5) (2009) 462–487. [13] F.A. Omara, M.M. Arafa, Genetic algorithms for task scheduling problem, Journal of Parallel and Distributed Computing 70 (1) (2010) 13–22. [14] G. Attiya, Y. Hamam, Task allocation for maximizing reliability of distributed systems: a simulated annealing approach, Journal of Parallel and Distributed Computing 66 (10) (2006) 1259–1266. [15] A. Swiecicka, F. Seredynski, A.Y. Zomaya, Multiprocessor scheduling and rescheduling with use of cellular automata and artificial immune system support, IEEE Transactions on Parallel and Distributed Systems 17 (3) (2006) 253–262. [16] C. Boeres, E. Rios, L. Ochi, Hybrid evolutionary static scheduling for heterogeneous systems, in: Evolutionary Computation, 2005. The 2005 IEEE Congress on, vol. 3, 1929–1935. [17] H. Yu, Optimizing task schedules using an artificial immune system approach, in: GECCO ’08: Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation, 2008, pp. 151–158. [18] P. Larranaga, J. Lozano, Estimation of Distribution Algorithms, A New Tool for Evolutionary Computation, Kluwer Academic Publishers, 2002. [19] A. Salhi, J.A.V. Rodríguez, Q. Zhang, An estimation of distribution algorithm with guided mutation for a complex flow shop scheduling problem, in: GECCO’07: Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation, 2007, pp. 570–576. [20] M. Fujimoto, T. Eguchi, Job shop scheduling using estimation of distribution algorithms, in: ICCAS-SICE: Proceedings of the International Conference on Control Automation and System and the Annual Conference of the Society of Instrument and Control Engineers of Japan, 2009, pp. 2627–2630. [21] J. Li, Y. Jiang, Estimation of distribution algorithms for job schedule problem, in: ICIC ’09: Proceedings of the 2009 Second International Conference on Information and Computing Science, 2009, pp. 7–10. [22] J. Li, U. Aickelin, BOA for nurse scheduling, in: M. Pelikan, K. Sastry, E. CantPaz (Eds.), Scalable Optimization via Probabilistic Modeling, Springer, Berlin/Heidelberg, 2006, pp. 315–332. [23] Q. Jiang, Y. Ou, D. Shi-Du, Optimizing curriculum scheduling problem using population based incremental learning algorithm, in: DMAMH ’07: Proceedings of the Second Workshop on Digital Media and its Application in Museum & Heritage, 2007, pp. 448–453. [24] S. Yang, K.-C. Chang, Comparison of score metrics for Bayesian network learning, IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans 32 (3) (2002) 419–428. [25] G. Schwarz, Estimating the dimension of a model, The Annals of Statistics 6 (1978) 461–464. [26] G. Cooper, E. Herskovits, A Bayesian method for the induction of probabilistic networks from data, Machine Learning 9 (1992) 309–347. [27] M. Pelikan, D.E. Goldberg, E.E. Cantú-paz, Linkage problem, distribution estimation, and Bayesian networks, Evolutionary Computation 8 (3) (2000) 311–340. [28] M. Henrion, Propagating uncertainty in Bayesian networks by probabilistic logic sampling, in: Proceedings of the Uncertainty in Artificial Intelligence 2 Annual Conference on Uncertainty in Artificial Intelligence (UAI-86), 1986, pp. 149–163. [29] M. Cosnard, M. Marrakchi, Y. Robert, D. Trystram, Parallel Gaussian elimination on an mimd computer, Parallel Computing 6 (3) (1988) 275–296. [30] A.K. Amoura, E. Bampis, J.-C. König, Scheduling algorithms for parallel Gaussian elimination with communication costs, IEEE Transactions on Parallel and Distributed Systems 9 (7) (1998) 679–686. [31] T.H. Cormen, C.E. Leiserson, R.L. Rivest, C.U. Clifford Stein, Introduction to Algorithms, MIT Press, Cambridge, MA, USA, 2001.