An efficient implementation of the Min-Min heuristic

Computers & Operations Research 40 (2013) 2670–2676 Contents lists available at SciVerse ScienceDirect Computers & Operations Research journal homep...

Download PDF

339KB Sizes 2 Downloads 59 Views

Report

PDF Reader
Full Text

Computers & Operations Research 40 (2013) 2670–2676

Contents lists available at SciVerse ScienceDirect

Computers & Operations Research journal homepage: www.elsevier.com/locate/caor

An efﬁcient implementation of the Min-Min heuristic Pablo Ezzatti n, Martín Pedemonte, Álvaro Martín Instituto de Computación, Universidad de la República, 11300—Montevideo, Uruguay

art ic l e i nf o

a b s t r a c t

Available online 30 May 2013

Min—Min is a popular heuristic for scheduling tasks to heterogeneous computational resources, which has been applied either directly or as part of more sophisticated heuristics. However, for large scenarios such as grid computing platforms, the time complexity of a straightforward implementation of Min—Min, which is quadratic in the number of tasks, may be prohibitive. This has motivated the development of high performance computing (HPC) implementations, and the use of simpler heuristics for the sake of acceptable execution times. We propose a simple algorithm that implements Min—Min requiring only O(mn) operations for scheduling n tasks on m machines. Our experiments show, in practice, that a straightforward sequential implementation of this algorithm signiﬁcantly outperforms other state of the art implementations of Min—Min, even compared to HPC implementations. In addition, the proposed algorithm is at least as suitable for parallelization as a direct implementation of Min—Min. & 2013 Elsevier Ltd. All rights reserved.

Keywords: Heterogeneous computing Grid computing Scheduling Min—Min heuristic

1. Introduction A distributed heterogeneous computing (HC) environment, such as a grid computing platform, can employ thousands or even millions of computational resources to solve several difﬁcult problems, involving the simultaneous execution of a large number of tasks. In this context, an efﬁcient use of the resources is a critical issue that calls for schedules with short makespan, deﬁned as the time required to complete the execution of all the tasks according to the schedule [1]. It is well known, however, that computing a minimum makespan schedule, for a given estimate of the execution time of each task on each machine, is an NP-hard problem [2,3]. This has motivated a big research effort focused on heuristic methods, trying to ﬁnd accurate solutions with moderate runtime (see, e.g., the comparison of heuristics in [1,4,5]). One of the most popular techniques for scheduling tasks in HC environments is the deterministic heuristic Min—Min [6, Algorithm D], which, in general, yields good schedules in acceptable execution times [5]. For this reason, Min—Min has also been used as a building block in more sophisticated heuristics [7—9], and applied to produce an initial solution that is then successively improved, typically using some local search method [1,9]. Min—Min is a greedy algorithm; having constructed a partial schedule for a subset of the given set of tasks, the schedule is extended in such a way that the makespan increment is minimum. A direct implementation of Min—Min requires Oðmn2 Þ operations [6] n

Corresponding author. Tel.: +598 27114244x125. E-mail addresses: pezzatti@ﬁng.edu.uy (P. Ezzatti), mpedemon@ﬁng.edu.uy (M. Pedemonte), almartin@ﬁng.edu.uy (Á. Martín). 0305-0548/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.cor.2013.05.014

to schedule n tasks on m machines. (In [6], m is regarded as a constant.) Although it is faster than other more complex heuristics (e.g. cellular memetic algorithms [10], ant colony optimization [11], and tabu search algorithm [12]), a time complexity that is quadratic in n may be prohibitive in real modern HC environments, where several thousands of tasks need to be scheduled. As a consequence, some authors have proposed simpliﬁcations of Min—Min [13]. Others have explored high performance computing (HPC) implementations, including the use of parallel techniques, and non-traditional hardware, such as Graphics Processing Units (GPUs) [14,15]. In this paper we propose an efﬁcient implementation of Min—Min. The algorithm works in two phases: ﬁrst, for each machine, the tasks are sorted in nondecreasing order of execution time estimate, and then, in a second phase, the Min—Min schedule is constructed. Using a radix exchange sort [16] for the ﬁrst phase, and exploiting in the second phase the fact that, for each machine, the tasks have been sorted, we show that the algorithm only requires O(mn) operations on registers of length Oðlog mnÞ. Since, for each machine, the algorithm stores the permutation that sorts the set of tasks, some additional memory is required with respect to a direct implementation of Min—Min. The latter requires O(mn) space, while the space complexity of the new algorithm is Oðmn log nÞ. Our experiments demonstrate, in practice, that this new algorithm signiﬁcantly outperforms the state of the art implementations of Min—Min, even compared to HPC implementations. Summing up, the main contributions of this paper are:

a new algorithm that efﬁciently implements the Min—Min heuristic,

a theoretical analysis of the algorithm, and

P. Ezzatti et al. / Computers & Operations Research 40 (2013) 2670–2676

an experimental evaluation of the algorithm. The rest of the paper is structured as follows. In Section 2, we review in more detail the related works. Later, in Section 3, we recall the formal deﬁnition of the Min—Min heuristic and establish the notation for the rest of the paper. In Section 4, we introduce our proposal, an efﬁcient algorithm to compute the Min—Min heuristic, and analyze it theoretically. We also discuss, brieﬂy, some practical considerations related to the choice of the sorting algorithm, and the potential parallelization of our implementation of Min—Min. In Section 5 we present some experimental results that validate our proposal and, ﬁnally, we discuss some conclusions and future work in Section 6.

2. Related works Ibarra and Kim presented one of the pioneering works on static heterogeneous computing scheduling [6], where ﬁve different heuristics were evaluated, including Min—Min. Additionally, the authors also studied other strategies for two particular cases: when the tasks have to be scheduled on only two machines, and when the machines are identical. Since the pioneer work of Ibarra and Kim, many researchers have shown the beneﬁts of Min—Min heuristic for heterogenous computing scheduling, since it makes possible to obtain good quality solutions in an acceptable runtime. Some of these works are summarized below. Braun et al. [1] studied experimentally 11 heuristics for static scheduling in HC environments, including a wide range of simple greedy constructive heuristic approaches and Min—Min. Then, Xhafa et al. [4] have also evaluated several static scheduling strategies, including Min—Min. In the same line of work, Luo et al. [5] analyzed and compared a set of 20 greedy heuristics under different conditions. In another line of work, researchers have proposed several extensions to Min—Min or new algorithms with several points of contact with this heuristic. Segmented Min—Min is an algorithm closely related to Min—Min proposed by Wu et al. [7]. In this algorithm, tasks are sorted according to some score function of the expected time to compute in all machines (it could be the maximum, minimum or average expected time to compute among all machines). Then, the ordered sequence is segmented in groups, and ﬁnally Min—Min is applied to each of these groups of tasks. Another common idea used by researchers is to apply a local search method in order to improve the quality of the solution obtained with Min—Min. Ritchie et al. [11] proposed a local search method that selects the best task movement from a neighborhood of solutions, which includes the solutions where a single task from the most loaded machine is moved to another machine or it is swapped with another task that executes on another machine. On the other hand, Pinel et al. [9] proposed the H2LL local search operator that chooses randomly a task from the most loaded machine and moves it to one of the least loaded machines. Other interesting extensions are brieﬂy described next. He et al. [17] proposed a QoS (Quality of Service) guided Min—Min heuristic which can guarantee QoS requirements of certain tasks while minimizing the makespan at the same time. He and Sun [18] integrated the time associated to data movement into traditional scheduling on grid environments using the Min—Min heuristic. Finally, Chauhan and Joshi [8] combined Min—Min with the weighted mean time-Min heuristic. Despite the good results of Min—Min, when considering large scenarios (both tasks and machines), the complexity of the algorithm makes impracticable its direct application. This is particularly important when working in grid environments because they consist of thousands of machines, and the number of tasks to be

2671

scheduled is of the order of several thousands. For this reason, two different approaches have been proposed to reduce the execution time of this heuristic: to modify the original heuristic in order to run faster or to implement the algorithm in parallel. Diaz et al. [13] proposed a two-phase heuristic for the energyefﬁcient scheduling of independent tasks on computational grids that can be considered as a simpliﬁcation of Min—Min. In the ﬁrst phase, the tasks are sorted by a certain criteria (average, minimum, maximum expected time to compute). In the second phase, the tasks are processed in order, searching for the best machine assignation. This heuristic is in fact a smart extension of the B heuristic described by Ibarra and Kim [6]. The parallel implementation of Min—Min on GPU platforms [15] was studied considering three different approaches: a single GPU version, and synchronous and asynchronous versions that execute on four GPUs concurrently. Pinel et al. [14] also addressed the parallelization of Min—Min for solving large scenarios proposing both a multicore and a GPU implementation. From all the reviewed works it is clear that Min—Min heuristic has a wide applicability and it is widely used by the community to solve scheduling problems. On the other hand, the resolution of large scenarios, specially in grid environment contexts, poses an important challenge to this heuristic due to its computational complexity. For these reasons, a signiﬁcant improvement in the computational efﬁciency of the algorithm could be welcomed and adopted by the community.

3. The Min—Min scheduling heuristic In this section we formally deﬁne the scheduling problem and the Min—Min heuristic. We follow, loosely, the deﬁnitions and notation from [6]. Consider a set M of m machines and a set T of n tasks. We assume that an expected time to compute (ETC) of a task J on a machine i, denoted μi ðJÞ, is available for all J∈T and all i∈M. A schedule S ¼ fLi gi∈M , assigns a set Li of tasks to be executed on machine i, for each i∈M. The makespan of S, denoted f^ ðSÞ, is the largest amount of execution time assigned by S to a machine i, among all i∈M, i.e. ( ) f^ ðSÞ ¼ max ∑ μ ðJÞ : ð1Þ i∈M

J∈Li

i

Ideally, we look for a schedule with minimum makespan. As mentioned, however, this is computationally unaffordable for problem instances of large size, which motivated the deﬁnition of several heuristics in [6], including Min—Min. Algorithm 1. Min—Min heuristic.

1 2 3 4 5

input: A set T of tasks, a set M of machines, and the ETC μi ðJÞ for all J∈T; i∈M output: A schedule S ¼ fLi gi∈M foreach i∈M do j Set Li ¼ ∅, ti ¼ 0 end Set U¼T while U≠∅ do

6 7

Let ði; JÞ ¼ arg minði;JÞ ft i þ μi ðJÞ : i∈M; J∈Ug: ==Solve ties arbitrarily Set Li ¼ Li ∪ J ; t i ¼ t i þ μi ðJÞ; U ¼ U\ J :

8

end

An algorithmic deﬁnition of Min—Min is shown in Algorithm 1. The algorithm iterates through a loop in Step 5,

2672

P. Ezzatti et al. / Computers & Operations Research 40 (2013) 2670–2676

assigning one task to one machine in each iteration. For each machine i, the sum of the execution time of all tasks assigned so far to i is maintained in a variable t i. In each iteration of the loop, Step 6 selects for scheduling a machine i and a task J, such that the ﬁnalization time of J on i, given by t i þ μi ðJÞ, is minimized among all tasks and machines. 1 Notice that, since several pairs (i,J) may attain the minimum in Step 6, Algorithm 1 may output different schedules depending on how ties are broken. Any such schedule is called a Min—Min schedule. As a byproduct of the algorithm, the makespan of the produced schedule, S, can be obtained as f^ ðSÞ ¼ maxft i g:

Algorithm 2. An efﬁcient implementation of Min—Min. input: A set T of tasks, a set M of machines, and the ETC μi ðJÞ for all J∈T; i∈M output: A schedule S ¼ fLi gi∈M foreach i∈M do j Set Li ¼ ∅, ti ¼0, ni ¼ 1, and compute Ji end Set U¼ T while U≠∅ do

1 2 3 4 5 6

foreach i∈M do while J i ðni Þ∉U do Increment ni

7 8

i∈M

As observed in [6, Theorem 1], a direct implementation of Min—Min requires Oðmn2 Þ operations. Indeed, an exhaustive search over U M to ﬁnd the minimum in Step 6 requires OðmjUjÞ operations, and this step is repeated n times, with jUj varying from n down to 1.

9 10 11 12

end end Let i ¼ arg mini ft i þ μi ðJ i ðni ÞÞ : i∈Mg ==Solve ties arbitrarily Set Li ¼ Li ∪ J i ðni Þ ; t i ¼ t i þ μi ðJ i ðni ÞÞ; U ¼ U\fJ i ðni Þg

13 end The following theorem establishes the correctness of Algorithm 2.

4. An efﬁcient implementation of Min—Min

Theorem 4.1. Algorithm2 correctly constructs a Min—Min schedule.

In this section we present an efﬁcient implementation of the Min—Min heuristic, which is the main contribution of this paper. We prove that the algorithm is correct and analyze its time and space complexity. Finally, we discuss some practical implementation details.

Proof. After the initialization in Step 2, the only step of Algorithm 2 that updates the value of ni, i∈M, is Step 8, where ni is incremented if and only if J i ðni Þ∉U. Since no task is added to U after Step 4, this implies that, at every time during the execution of the main loop in Step 5, we have, for each i∈M J i ðrÞ∉U;

4.1. The algorithm Our implementation of Min—Min is based on the following simple observation, which allows for a fast implementation of the minimization in Step 6. Lemma 4.1. Let i∈M be a machine and let J; J′ be two tasks scheduled to i in iteration number ℓ; ℓ′ of the loop in Step5, respectively. If ℓ ≤ ℓ′, then μi ðJÞ ≤ μi ðJ′Þ. Proof. Since ℓ ≤ ℓ′, at iteration number ℓ both J and J′ belong to U. Thus, the selection of (J,i) in Step 6 implies that μi ðJÞ ≤ μi ðJ′Þ. □

for all r oni :

Hence, since, by the condition checked in Step 5, the set U is nonempty, the loop in Step 6 must ﬁnish before the value of ni exceeds n. At this time, by the condition in the loop of Step 7, we must have J i ðni Þ∈U. Therefore, by (3) and the deﬁnition of Ji in (2), at the time of executing Step 11, the value of ni satisﬁes, for each i∈M μi ðJ i ðni ÞÞ ¼ minfμi ðJÞ : J∈Ug; t i þ μi ðJ i ðni ÞÞ ¼ t i þ minfμi ðJÞ : J∈Ug

ð5Þ

¼ minft i þ μi ðJÞ : J∈Ug:

ð6Þ

By (6), the machine i selected in Step 11 satisﬁes

μi ðJ i ðrÞÞ ≤μi ðJ i ðr′ÞÞ

4.2. Complexity analysis

ð2Þ

The proposed algorithm, Algorithm 2, starts by computing the mappings Ji, i∈M. In addition, for each machine i, a variable ni maintains a pointer to the ﬁrst task, in ETC order, that has not been assigned yet. The loop in Step 5 follows essentially the same steps as in Algorithm 1, except that each variable ni, i∈M, is updated in the loop in Step 7. Notice also that the minimization in two variables, i and J, in Step 6 of Algorithm 1 is now replaced by a minimization in the single variable i in Step 11 of Algorithm 2. This is the fact that yields a dramatic reduction in time complexity of Algorithm 2 with respect to Algorithm 1. 1 Equivalently, Step 6 is formulated in [6] as a two-step minimization. First, the minimum completion time for each unassigned task on all machines is calculated in order to ﬁnd the best machine to execute each task. Then, the pair task-machine with overall minimum completion time is chosen among all the pairs determined in the ﬁrst stage.

ð4Þ

or, adding ti on both sides

Since, by Lemma 4.1, the tasks scheduled to a machine i are delivered in nondecreasing order of ETC, the minimization in Step 6 is simpliﬁed if, for each machine, the set T of tasks is sorted in that order. Speciﬁcally, for each i∈M, let Ji be a mapping from f1; 2; …; ng onto T, such that if r ≤r′; r; r′∈f1; 2; …; ng:

ð3Þ

t i þ μi ðJ i ðni Þ ¼ minft h þ μh ðJÞ : h∈M; J∈Ug; which implies that scheduling the task J i ðni Þ to machine i is correct according to the Min—Min policy. □

We now analyze the time complexity and memory requirements of Algorithm 2. We assume that ETCs are represented with some ﬁxed numeric precision, independent of n and m. Therefore, the computation of Ji in Step 2 can be implemented with a radix exchange sort algorithm [16], which requires O(n) operations. Theorem 4.2. Assume a constant precision representation of the ETCs. 1. Algorithm 2 requires Oðmn log nÞ memory. 2. Algorithm 2 requires O(mn) operations on registers of Oðlog mnÞ bits.2

2 Thus, as usually assumed, memory access operation such as indexing into an array of size O(mn) and transferring Oðlog nÞ bits require Oð1Þ time. It is readily veriﬁed that the number of bit operations required by Algorithm 2 is Oðmn log mnÞ.

P. Ezzatti et al. / Computers & Operations Research 40 (2013) 2670–2676

Proof. We start with Part 1. We identify each task with an integer in the range 1; …; n, and each machine with an integer in the range 1; …; m. The set U is represented by a Boolean vector of size n, such that the ith element is true if and only if the task i, 1 ≤i ≤n, belongs to U. Similarly, we use a matrix of dimension m n to represent ETCs, μi ðÞ, i∈M, which requires O(mn) memory. The sets Li, i∈M, admit several memory representations. A Boolean matrix of dimension m n, for instance, requires O(mn) space. Each of the m variables ni requires Oðlog nÞ bits. Also, since the value of each variable ti is upper-bounded by n times the maximum ETC admitted by the constant precision representation of μ, storing ti requires Oðlog nÞ bits. Finally, each of the m mappings, Ji, is represented by a vector of n indexes, each index requiring Oðlog nÞ bits. The total memory requirement is, thus, Oðmn log nÞ, as claimed. We now consider Part 2. Each of the m executions of Step 2 requires, using a radix exchange sort, O(n) operations to compute Ji. Thus, the overall time complexity of Step 2 is O(mn). We next show that the loop in Step 5 also runs in time that is O(mn). Notice that, since each iteration removes a task from U in Step 12, the loop is repeated exactly n times and, therefore, the emptiness of U in Step 5 need not to be checked explicitly. In each of the n iterations of the main loop, the condition checked in Step 7 evaluates to false once and only once for each of the m machines that the loop in Step 6 iterates through, for a total of mn comparisons that evaluate to false. On the other hand, each time this condition evaluates to true for some machine i, the value of ni is incremented to point to the next task. Since there are n tasks, we conclude that, for each machine i, the condition checked in Step 7 evaluates to true and ni is incremented at most n−1 times. With our choice of data structures to represent U and Ji, checking whether J i ðni Þ∉U requires O(1) operations. Thus, the overall execution time of Step 7, considering all iterations of the main loop and all machines (loops in Step 5 and Step 6), is O(mn). Since, clearly, each execution of Step 11 and Step 12 requires O(m) and O(1) operations, respectively, we conclude that the overall execution time of the loop in Step 5 is O(mn). □ 4.3. Practical aspects The time complexity of computing Ji in Step 2 using a radix exchange sort algorithm depends linearly on the number of bits used in the binary representation of the ETCs, which is constant. If this constant is large, however, computing Ji using a comparison based sorting algorithm, which requires Oðn log nÞ operations, may yield a better performance in practice, even though the theoretical time complexity of Algorithm 2 becomes Oðmn log nÞ. Some implementations of the popular quicksort algorithm [19], for example, are highly optimized and perform exceptionally well, even though the worst case time complexity of quicksort is not the best possible. In addition, an appealing feature of Algorithm 2 is that the mappings Ji, i∈M, can be computed independently of each other. Therefore, the loop in Step 1 suits very well to parallel computing.

5. Experimental results and discussion In this section we evaluate, empirically, the performance of Algorithm 2. We ﬁrst specify some details of our implementation, the execution platform, and the test cases used. Then, we analyze comparatively the performance of the novel algorithm and the performance reported by other authors using the traditional

2673

implementation of Min—Min in medium and large scenarios. Finally, we discuss the results. 5.1. Test environment Algorithm 2 was implemented in Matlab R2011b, using double precision arithmetic, and the native sort routine of Matlab (Matlab implements a variant of quicksort [20]). The execution platform was a PC with an Intel i7-2600 processor, with four cores at 3.40 GHz, and 16 GB of RAM, using the CentOS Linux 6.2 operating system. All the executions were launched with the option -singleCompThread of Matlab, to guarantee a strictly sequential execution. 5.2. Test instances We evaluated Algorithm 2 in several scenarios with different number of tasks and machines. The test instances can be classiﬁed into two groups:

Medium sized scenarios: 20 instances for each scenario with a

number of tasks ranging from 1024 to 32,768, and a number of machines ranging from 32 to 1024 (1024 32, 2048 64, 4096 128, 8192 256, 16,384 512 and 32,768 1024). These instances were taken from the repository publicly available at: http://par-cga-sched.gforge.uni.lu/instances/etc/. Although in [14] instances with 65,536 tasks and 2048 machines are also used, those instances are not available in the repository. Large sized scenarios: 20 instances with 65,536 tasks and 2048 machines (65,536 2048) and 131,072 tasks and 4096 machines (131,072 4096). These instances were created using the generator publicly available at: http://www.ﬁng.edu.uy/inco/grupos/cecal/ hpc/HCSP.

5.3. Experimental results We compare the performance of our implementation of Algorithm 2 with the runtime of the state of the art implementations of Min—Min reported in the literature. The runtime for each instance of each scenario is detailed in Appendix A. In order to compare with the other implementations of Min—Min we consider the mean runtime of Algorithm 2 for each scenario (notice, from Appendix A, that the standard deviation of the runtime is small for all scenarios). Table 1 shows the mean runtime of our proposal for computing a Min—Min schedule in each scenario and the runtimes reported in [15] for a single-threaded PC (QuadCore Xeon E5530 at 2.4 GHz, 48 GB RAM, 8 MB cache and 4 NVidia Tesla C1060 GPU), a GPU, and a four GPUs implementations. The hardware platforms involved in this comparison, although not exactly the same, are modern PCs in all cases and, clearly, are not responsible for the large difference in performance that we report in this section.3 All times are measured in seconds. The table also includes the improvement factor (IF) of our algorithm over the other implementations. We have left out of the comparison another four GPUs implementation with a shorter execution time presented in [15] that uses a domain decomposition on the tasks, since it computes a different schedule that Min—Min with a gap in the makespan between 15% and 30%. These results demonstrate that an impressive reduction in the execution time of the single-threaded implementation can be 3 Furthermore, the sequential version presented in [15] outperforms in almost four times our Matlab implementation of Algorithm 1 running on our experimental platform.

2674

P. Ezzatti et al. / Computers & Operations Research 40 (2013) 2670–2676

Table 1 Comparative analysis of Algorithm 2 versus the classical Min—Min. GPU [15]

4 GPUs [15]

Runtime (s)

IF

Runtime (s)

IF

Runtime (s)

IF

0.07 0.39 2.25 15.67 119.74 848.62 6352.94 49,764.76

21.9 31.0 32.9 51.9 108.9 180.9 270.3 505.5

0.23 0.37 1.02 4.77 24.83 176.85 1049.34 7253.88

72.1 29.4 14.9 15.8 22.6 37.7 44.7 73.7

1.03 1.26 1.95 4.16 14.84 60.16 292.75 2236.72

322.9 10.0 28.5 13.8 13.5 12.8 12.5 22.7

5.4. Discussion The experimental results show an impressive reduction in the execution time of the new algorithm with respect to a direct implementation of Min—Min. It should be noticed that, since our implementation is not highly optimized, the execution times reported in this paper are far from the best possible for the speciﬁed experimental setting. Porting the code to C/C++ and using shorter data types as well as a better sorting algorithm can produce an additional runtime reduction. In addition to this, it should be emphasized that the parallelization of the proposed algorithm is much more natural than the original Min—Min. In the ﬁrst stage of the algorithm, the tasks can be sorted independently for each machine i∈M, using different computing resource. Even the second stage is also easily parallelizable, since

100 90 80 70 60 50 40 30 20 10

6 09 x4 72

10 13

65

32

53

76

6x

8x

20

10

48

24

2 51 4x 38 16

81

92

96

x2

x1

56

4 40

20

48

x6

2 x3 24 10

28

0

Instances (tasks x machines)

Fig. 1. Real and theoretical execution time of Algorithm 2.

105 104 103 102 101 100 10−1 10−2

6 09

48

x4 72 10 13

65

53

6x

10 8x 76 32

20

24

2 51 4x 38 16

81

92

x2

56

28 40

96

x1

4 x6 48 20

x3

2

10−3 24

obtained for all the scenarios evaluated using the algorithm proposed in this paper instead of the classical Min—Min. Another really important aspect of these results is that, as expected, the improvement factor increases with the size of the instances. Thus, the lower asymptotic time complexity translates, effectively, into a better scaling of the algorithm in practice. Algorithm 2, even with our non-optimized Matlab implementation in double precision arithmetic, running in a single thread, is also far superior to both parallel implementations on GPUs, including the version that uses the computing power of four GPUs. When our algorithm is compared with the parallel implementations, the improvement factor also increases with the size of the instances. In another work, Pinel et al. [14] show a great improvement in a parallel implementation on GPU of the Min—Min heuristic. For the largest instance (65,536 2048) that they consider, the authors report a speed up of 538 over a sequential runtime of three days (more than 259,200 s), i.e., at least 482 s of execution time in the GPU implementation. (The execution time is not reported for smaller instances). In contrast, for an instance of the same size, our implementation of Algorithm 2 runs in 23.50 s on the average. We next analyze comparatively the empirical runtime of our algorithm with the theoretical complexity derived in Section 4. Fig. 1 compares the real mean execution times, obtained experimentally (Table 1), with theoretical ones, which were estimated by scaling the real mean execution time of the smallest instance proportionally to mn. The ﬁgure shows that, for the evaluated instances, the algorithm execution time scales closely to O(mn) as m and n increase (even though our implementation uses a variant of quicksort and, thus, the theoretical time complexity is Oðmn log nÞ in the average). Fig. 2 shows, on a logarithmic scale, the real execution time of our algorithm, the theoretical execution time estimated both proportionally to mn and to mn2, and the runtime reported in [15]. Notice how the gap between real execution times grows at a rate similar to the gap between theoretical estimates.

10

3.19 10−3 1.26 10−2 6.84 10−2 3.02 10−1 1.10 4.69 23.50 98.45

Sequential [15]

Runtime (s)

1024 32 2048 64 4096 128 8192 256 16,384 512 32,768 1024 65,536 2048 131,072 4096

Algorithm 2 mean time (s)

Runtime (s)

Instance size

Instances (tasks x machines)

Fig. 2. Comparison between real and theoretical runtimes of Algorithm 2 and traditional Min—Min.

the selection of the next task to be scheduled can be computed using the well known reduction pattern [21] to choose a task with minimum ﬁnishing time.

6. Conclusions and future work We have presented an efﬁcient algorithm for computing the Min—Min heuristic, which is simple and straightforward to implement. Compared to classical implementations, the asymptotic time complexity of Algorithm 2 is much smaller, and this complexity reduction is reﬂected, in practice, in low computation times even

P. Ezzatti et al. / Computers & Operations Research 40 (2013) 2670–2676

for very large problem instances. The proposed algorithm is also easily parallelizable. In the light of these results, the use of the Min—Min heuristic in large scenarios, typical of grid computing environments, is now far more accessible. In addition, other more sophisticated heuristics that depend on Min—Min, for example for generating an initial solution, can also beneﬁt from Algorithm 2. We identify three lines for future work. The ﬁrst one is the implementation of a high performance version of Algorithm 2, taking special care of implementation details, and including the use of parallel techniques and the evaluation of different hardware conﬁgurations, such as multi-core, distribute computing, and GPU computing. This leads to a second line of interest, which is using

2675

the high performance version to resolve even larger scenarios. Finally, we aim to study the application of local search techniques to the solutions obtained with Algorithm 2 in order to improve the overall quality of the scheduling.

Acknowledgments P. Ezzatti, M. Pedemonte and Á. Martín acknowledge support from Programa de Desarrollo de las Ciencias Básicas (PEDECIBA), Uruguay and Agencia Nacional de Investigación e Innovación (ANII), Uruguay.

Table A1 Runtime for computing Min—Min schedule with Algorithm 2 on 1024 32, 2048 64 and 4096 128 scenarios. Instance name

Runtime (s) −3

Instance name

Runtime (s) −-2

Instance name

Runtime (s)

etc_c_1024x32_hihi_1 etc_c_1024x32_hihi_2 etc_c_1024x32_hihi_3 etc_c_1024x32_hihi_4 etc_c_1024x32_hihi_5 etc_c_1024x32_hihi_6 etc_c_1024x32_hihi_7 etc_c_1024x32_hihi_8 etc_c_1024x32_hihi_9 etc_c_1024x32_hihi_10 etc_c_1024x32_hihi_11 etc_c_1024x32_hihi_12 etc_c_1024x32_hihi_13 etc_c_1024x32_hihi_14 etc_c_1024x32_hihi_15 etc_c_1024x32_hihi_16 etc_c_1024x32_hihi_17 etc_c_1024x32_hihi_18 etc_c_1024x32_hihi_19 etc_c_1024x32_hihi_20

3.205 10 3.220 10−3 3.187 10−3 3.200 10−3 3.183 10−3 3.184 10−3 3.187 10−3 3.209 10−3 3.178 10−3 3.193 10−3 3.183 10−3 3.189 10−3 3.186 10−3 3.168 10−3 3.169 10−3 3.179 10−3 3.192 10−3 3.201 10−3 3.173 10−3 3.182 10−3

etc_c_2048x64_hihi_1 etc_c_2048x64_hihi_2 etc_c_2048x64_hihi_3 etc_c_2048x64_hihi_4 etc_c_2048x64_hihi_5 etc_c_2048x64_hihi_6 etc_c_2048x64_hihi_7 etc_c_2048x64_hihi_8 etc_c_2048x64_hihi_9 etc_c_2048x64_hihi_10 etc_c_2048x64_hihi_11 etc_c_2048x64_hihi_12 etc_c_2048x64_hihi_13 etc_c_2048x64_hihi_14 etc_c_2048x64_hihi_15 etc_c_2048x64_hihi_16 etc_c_2048x64_hihi_17 etc_c_2048x64_hihi_18 etc_c_2048x64_hihi_19 etc_c_2048x64_hihi_20

1.275 10 1.257 10−2 1.262 10−2 1.258 10−2 1.256 10−2 1.258 10−2 1.264 10−2 1.267 10−2 1.266 10−2 1.268 10−2 1.266 10−2 1.268 10−2 1.264 10−2 1.261 10−2 1.263 10−2 1.265 10−2 1.258 10−2 1.257 10−2 1.264 10−2 1.257 10−2

etc_c_4096x128_hihi_1 etc_c_4096x128_hihi_2 etc_c_4096x128_hihi_3 etc_c_4096x128_hihi_4 etc_c_4096x128_hihi_5 etc_c_4096x128_hihi_6 etc_c_4096x128_hihi_7 etc_c_4096x128_hihi_8 etc_c_4096x128_hihi_9 etc_c_4096x128_hihi_10 etc_c_4096x128_hihi_11 etc_c_4096x128_hihi_12 etc_c_4096x128_hihi_13 etc_c_4096x128_hihi_14 etc_c_4096x128_hihi_15 etc_c_4096x128_hihi_16 etc_c_4096x128_hihi_17 etc_c_4096x128_hihi_18 etc_c_4096x128_hihi_19 etc_c_4096x128_hihi_20

6.857 10−2 6.841 10−2 6.849 10−2 6.827 10−2 6.867 10−2 6.830 10−2 6.831 10−2 6.881 10−2 6.836 10−2 6.834 10−2 6.824 10−2 6.829 10−2 6.844 10−2 6.829 10−2 6.854 10−2 6.849 10−2 6.855 10−2 6.854 10−2 6.837 10−2 6.822 10−2

Mean runtime (s) Std. dev. Min runtime (s) Max runtime (s)

3.188 10−3 1.334 10−5 3.168 10−3 3.220 10−3

Mean runtime (s) Std. dev. Min runtime (s) Max runtime (s)

1.263 10−2 5.041 10−5 1.256 10−2 1.275 10−2

Mean runtime (s) Std. dev. Min runtime (s) Max runtime (s)

6.843 10−2 1.560 10−4 6.822 10−2 6.881 10−2

Table A2 Runtime for computing Min—Min schedule with Algorithm 2 on 8192 256, 16,384 512 and 32,768 1024 scenarios. Instance name

Runtime (s)

Instance name

Runtime (s)

Instance name

Runtime (s)

etc_c_8192x256_hihi_1 etc_c_8192x256_hihi_2 etc_c_8192x256_hihi_3 etc_c_8192x256_hihi_4 etc_c_8192x256_hihi_5 etc_c_8192x256_hihi_6 etc_c_8192x256_hihi_7 etc_c_8192x256_hihi_8 etc_c_8192x256_hihi_9 etc_c_8192x256_hihi_10 etc_c_8192x256_hihi_11 etc_c_8192x256_hihi_12 etc_c_8192x256_hihi_13 etc_c_8192x256_hihi_14 etc_c_8192x256_hihi_15 etc_c_8192x256_hihi_16 etc_c_8192x256_hihi_17 etc_c_8192x256_hihi_18 etc_c_8192x256_hihi_19 etc_c_8192x256_hihi_20

3.019 10−1 3.025 10−1 3.016 10−1 3.020 10−1 3.024 10−1 3.013 10−1 3.022 10−1 3.023 10−1 3.026 10−1 3.016 10−1 3.024 10−1 3.025 10−1 3.021 10−1 3.015 10−1 3.011 10−1 3.011 10−1 3.022 10−1 3.010 10−1 3.011 10−1 3.013 10−1

etc_c_16384x512_hihi_1 etc_c_16384x512_hihi_2 etc_c_16384x512_hihi_3 etc_c_16384x512_hihi_4 etc_c_16384x512_hihi_5 etc_c_16384x512_hihi_6 etc_c_16384x512_hihi_7 etc_c_16384x512_hihi_8 etc_c_16384x512_hihi_9 etc_c_16384x512_hihi_10 etc_c_16384x512_hihi_11 etc_c_16384x512_hihi_12 etc_c_16384x512_hihi_13 etc_c_16384x512_hihi_14 etc_c_16384x512_hihi_15 etc_c_16384x512_hihi_16 etc_c_16384x512_hihi_17 etc_c_16384x512_hihi_18 etc_c_16384x512_hihi_19 etc_c_16384x512_hihi_20

1.097 1.094 1.105 1.098 1.108 1.102 1.105 1.102 1.101 1.103 1.099 1.103 1.102 1.103 1.105 1.101 1.098 1.099 1.100 1.103

etc_c_32768x1024_hihi_1 etc_c_32768x1024_hihi_2 etc_c_32768x1024_hihi_3 etc_c_32768x1024_hihi_4 etc_c_32768x1024_hihi_5 etc_c_32768x1024_hihi_6 etc_c_32768x1024_hihi_7 etc_c_32768x1024_hihi_8 etc_c_32768x1024_hihi_9 etc_c_32768x1024_hihi_10 etc_c_32768x1024_hihi_11 etc_c_32768x1024_hihi_12 etc_c_32768x1024_hihi_13 etc_c_32768x1024_hihi_14 etc_c_32768x1024_hihi_15 etc_c_32768x1024_hihi_16 etc_c_32768x1024_hihi_17 etc_c_32768x1024_hihi_18 etc_c_32768x1024_hihi_19 etc_c_32768x1024_hihi_20

4.695 4.665 4.678 4.703 4.696 4.673 4.681 4.696 4.709 4.687 4.675 4.685 4.688 4.695 4.674 4.697 4.671 4.681 4.697 4.683

Mean runtime (s) Std. dev. Min runtime (s) Max runtime (s)

3.018 10−1 5.560 10−4 3.010 10−1 3.026 10−1

Mean runtime (s) Std. dev. Min runtime (s) Max runtime (s)

1.101 3.284 10−3 1.094 1.108

Mean runtime (s) Std. dev. Min runtime (s) Max runtime (s)

4.687 1.174 10−2 4.665 4.709

2676

P. Ezzatti et al. / Computers & Operations Research 40 (2013) 2670–2676

Table A3 Runtime for computing Min—Min schedule with Algorithm 2 on 65,536 2048 and 131,072 4096 scenarios. Instance name

Runtime (s)

Instance name

Runtime (s)

etc_c_65536x2048_hihi_1 etc_c_65536x2048_hihi_2 etc_c_65536x2048_hihi_3 etc_c_65536x2048_hihi_4 etc_c_65536x2048_hihi_5 etc_c_65536x2048_hihi_6 etc_c_65536x2048_hihi_7 etc_c_65536x2048_hihi_8 etc_c_65536x2048_hihi_9 etc_c_65536x2048_hihi_10 etc_c_65536x2048_hihi_11 etc_c_65536x2048_hihi_12 etc_c_65536x2048_hihi_13 etc_c_65536x2048_hihi_14 etc_c_65536x2048_hihi_15 etc_c_65536x2048_hihi_16 etc_c_65536x2048_hihi_17 etc_c_65536x2048_hihi_18 etc_c_65536x2048_hihi_19 etc_c_65536x2048_hihi_20

23.427 23.319 23.322 22.792 23.476 23.255 23.672 23.380 23.701 23.789 23.547 23.698 23.488 22.892 23.681 23.547 23.895 23.777 23.668 23.644

etc_c_131072x4096_hihi_1 etc_c_131072x4096_hihi_2 etc_c_131072x4096_hihi_3 etc_c_131072x4096_hihi_4 etc_c_131072x4096_hihi_5 etc_c_131072x4096_hihi_6 etc_c_131072x4096_hihi_7 etc_c_131072x4096_hihi_8 etc_c_131072x4096_hihi_9 etc_c_131072x4096_hihi_10 etc_c_131072x4096_hihi_11 etc_c_131072x4096_hihi_12 etc_c_131072x4096_hihi_13 etc_c_131072x4096_hihi_14 etc_c_131072x4096_hihi_15 etc_c_131072x4096_hihi_16 etc_c_131072x4096_hihi_17 etc_c_131072x4096_hihi_18 etc_c_131072x4096_hihi_19 etc_c_131072x4096_hihi_20

98.664 98.337 97.480 99.646 98.883 98.505 97.987 97.532 98.827 97.988 98.368 98.197 99.058 98.629 99.098 99.039 97.917 98.878 99.122 96.910

Mean runtime (s) Std. dev. Min runtime (s) Max runtime (s)

23.499 2.835 10−1 22.792 23.895

Mean runtime (s) Std. dev. Min runtime (s) Max runtime (s)

98.453 6.714 10−1 96.910 99.646

Appendix A. Detailed numerical results Tables A.1 and A.2 present the runtime required for Algorithm 2 to compute the Min—Min schedule on the 120 instances of the repository publicly available at: http://par-cga-sched.gforge.uni.lu/ instances/etc/. Table A3 present the runtime required for Algorithm 2 to compute the Min—Min schedule on 40 instances created using the generator publicly available at: http://www.ﬁng.edu.uy/inco/ grupos/cecal/hpc/HCSP. The instances were generated with high task heterogeneity and machine heterogeneity. References [1] Braun TD, Siegel HJ, Beck N, Bölöni LL, Maheswaran M, Reuther AI, et al. A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems. Journal of Parallel and Distributed Computing 2001;61(6):810–37, http://dx.doi.org/10.1006/ jpdc.2000.1714. [2] Fernandez-Baca D. Allocating modules to processors in a distributed system. IEEE Transactions on Software Engineering 1989;15(11):1427–36, http://dx.doi. org/10.1109/32.41334. [3] Garey MR, Johnson DS. Computers and intractability: a guide to the theory of NP-completeness. New York, NY, USA: W.H. Freeman & Co.; 1979. [4] Xhafa F, Barolli L, Durresi A. Batch mode scheduling in grid systems. International Journal of Web and Grid Services 2007;3(1):19–37, http://dx.doi.org/ 10.1504/IJWGS.2007.012635. [5] Luo P, Lü K, Shi Z. A revisit of fast greedy heuristics for mapping a class of independent tasks onto heterogeneous computing systems. Journal of Parallel and Distributed Computing 2007;67(6):695–714, http://dx.doi.org/10.1016/j. jpdc.2007.03.003. [6] Ibarra OH, Kim CE. Heuristic algorithms for scheduling independent tasks on nonidentical processors. Journal of the ACM 1977;24(2):280–9, http://dx.doi. org/10.1145/322003.322011. [7] Wu M-Y, Shu W, Zhang H. Segmented Min—Min: a static mapping algorithm for meta-tasks on heterogeneous computing systems. In: Proceedings of the 9th heterogeneous computing workshop, HCW '00. Washington, DC, USA: IEEE Computer Society; 2000. p. 375—85.

[8] Chauhan SS, Joshi RC. QoS guided heuristic algorithms for grid task scheduling. International Journal of Computer Applications 2010;2(9):24–31. [9] Pinel F, Dorronsoro B, Pecero J, Bouvry P, Khan S. A two-phase heuristic for the energy-efﬁcient scheduling of independent tasks on computational grids. Cluster Computing 2012:1–13, http://dx.doi.org/10.1007/s10586-012-0207-x. [10] Xhafa F, Alba E, Dorronsoro B, Duran B. Efﬁcient batch job scheduling in grids using cellular memetic algorithms. Journal of Mathematical Modelling and Algorithms 2008;7:217–36, http://dx.doi.org/10.1007/s10852-008-9076-y. [11] Ritchie G, Levine J. A fast, effective local search for scheduling independent jobs in heterogeneous computing environments. In: PLANSIG 2003: proceedings of the 22nd workshop of the UK planning and scheduling special interest group; 2003. p. 178—83. [12] Xhafa F, Carretero J, Dorronsoro B, Alba E. A tabu search algorithm for scheduling independent jobs in computational grids. Computers and Artiﬁcial Intelligence 2009;28(2):237–50. [13] Diaz CO, Guzek M, Pecero JE, Danoy G, Bouvry P, Khan SU. Energy-aware fast scheduling heuristics in heterogeneous computing systems. In: Proceedings of the 2011 international conference on high performance computing & simulation (HPCS 2011); 2011. p. 478—84. [14] Pinel F, Dorronsoro B, Bouvry P. Solving very large instances of the scheduling of independent tasks problem on the GPU. Journal of Parallel and Distributed Computing 2013;73(1):101–10, http://dx.doi.org/10.1016/j.jpdc.2012.02.018. [15] Canabé M, Nesmachnow S. Parallel implementations of the Min—Min heterogeneous computing scheduler in GPU. In: Latin American symposium on high performance computing, HPCLatam; 2012. URL 〈http://www.clei.cl/cleiej/ papers/v15i3p8.pdf〉. [16] Hildebrandt P, Isbitz H. Radix exchange—an internal sorting method for digital computers. Journal of the ACM 1959;6(2):156–63. [17] He X, Sun X, von Laszewski G. QoS guided Min—Min heuristic for grid task scheduling. Journal of Computer Science and Technology 2003;18(4):442–51, http://dx.doi.org/10.1007/BF02948918. [18] He X, Sun X-H. Incorporating data movement into grid task scheduling. In: Zhuge H, Fox G, editors. Proceedings of the 4th international conference on grid and cooperative computing—GCC 2005, Lecture notes in computer science, vol. 3795. Springer; 2005. p. 394–405. [19] Hoare CAR. Quicksort. The Computer Journal 1962;5(1):10–6. [20] Mathworks, Matlab, 〈http://www.mathworks.com/〉, accessed December 18, 2012. [21] McCool MD, Robison AD, Reinders J. Structured parallel programming: patterns for efﬁcient computation. Morgan Kaufmann; 2012.

An efficient implementation of the Min-Min heuristic

An efficient implementation of the Min-Min heuristic

Recommend Documents