Priority-grouping method for parallel multi-scheduling in Grid

Priority-grouping method for parallel multi-scheduling in Grid

Journal of Computer and System Sciences 81 (2015) 943–957 Contents lists available at ScienceDirect Journal of Computer and System Sciences www.else...

1018KB Sizes 0 Downloads 54 Views

Journal of Computer and System Sciences 81 (2015) 943–957

Contents lists available at ScienceDirect

Journal of Computer and System Sciences www.elsevier.com/locate/jcss

Priority-grouping method for parallel multi-scheduling in Grid Goodhead Tomvie Abraham ∗ , Anne James, Norlaily Yaacob Distributed Systems and Modelling Group, Coventry University, Priory Street, Coventry, CV1 5FB, United Kingdom

a r t i c l e

i n f o

Article history: Received 3 July 2014 Received in revised form 20 September 2014 Accepted 1 October 2014 Available online 30 December 2014 Keywords: Grid computing Scheduling Parallel programming Multicore-systems Multi-scheduling

a b s t r a c t This article presents a method of enhancing the efficiency of Grid scheduling algorithms by employing a job grouping method based on priorities and also grouping of Grid machines based on their configuration before implementing a suitable scheduling algorithm within paired groups. The Priority method is employed to group jobs into four groups, while two different methods, SimilarTogether and EvenlyDistributed, are employed to group machines into four groups before implementing the MinMin Grid scheduling algorithm simultaneously. Implementing the scheduling algorithms simultaneously within paired groups (multi-scheduling) ensures a high degree of parallelism, increases throughput and improves the overall performance of scheduling algorithms. Two sets of controlled experiments were carried out on an HPC system. Analysis of results shows that the Priority Grouping method improved the scheduling efficiency by very large margins over the nongrouping method. © 2014 Elsevier Inc. All rights reserved.

1. Introduction As the multicore technology becomes ever more pervasive in our daily lives and as Grid computing continues to grow, it becomes ever more appealing for a method that exploits one of the technologies for the benefit of the other. In an age characterised by multicore systems, scheduling of Grid jobs without considering the advantages of parallelism achievable from the evolving multicore hardware does not augur well for the current trend in computing hardware. Neglecting the underlying hardware in scheduling Grid jobs will hamper the growth of the Grid as the technology continues to evolve because sequential scheduling constitutes a bottleneck when the number of jobs increases. Current Grid scheduling algorithms fall short of exploiting the advantages of the underlying multicore systems, hence are inadequate to address the future of the Grid. To achieve any commensurate gain in Grid scheduling in tandem with advances in hardware, Grid scheduling algorithms need to be designed to exploit the available multicores. A parallel multi-scheduling method that considers the underlying multicore hardware of the scheduler and Grid machines will not only ensure the sustenance of the projected growth of the Grid but will also reverse the neglect by most Grid scheduling algorithms on the need for a paradigm shift towards the exploitation of the advantages of the multicore system in Grid scheduling. This work is intended for implementation on multicores. It uses a priority-based grouping method to split jobs into four priority groups and splits machines into same number of groups before implementing the MinMin scheduling algorithm between paired job group and machine group in parallel. A priority group in this context means a group containing a set of jobs split on some priority attribute. A machine group refers to a list containing machines (Grid resources) in the same category based on our grouping method. Two methods, SimilarTogether and EvenlyDistributed, are used alternatively to split

*

Corresponding author. E-mail addresses: [email protected] (G.T. Abraham), [email protected] (A. James), [email protected] (N. Yaacob).

http://dx.doi.org/10.1016/j.jcss.2014.12.009 0022-0000/© 2014 Elsevier Inc. All rights reserved.

944

G.T. Abraham et al. / Journal of Computer and System Sciences 81 (2015) 943–957

machines into groups. The factors that determine whether a machine has been sorted into a particular group are either the similarity in configuration or the non-similarity in configuration. Scheduling of jobs is carried out within the groups, hence we can have several instances of threads executing scheduling algorithms independently. Multi-scheduling implies a method of scheduling in parallel from within the matched groups of jobs and machine groups. Parallelism enables the optimised use of available processing components within a system. This paper is organised as follows. Section 1 introduces the Priority Based Parallel Multi-scheduler for Grids. Section 2 provides some background for our research and Section 3 places the research in the context of related work and current trends. Section 4 discusses our Priority grouping method for jobs and our Machine grouping methods. Section 5 exposes our simulation for the experiment and Section 6 describes the experimental design. Section 7 presents and discusses the results, analysing the performance of the Priority grouping method. A conclusion and future thoughts are provided in Section 8. 2. Background The backbone of the Grid [1] is the already established Internet, powerful supercomputers, multiple computing clusters, large scale distributed networks and the connectivity of these resources. The aggregation and integration of these powerful computing systems, clusters, networks and resources, implemented with policies that ensure the delivery of computing services to users’ specifications or requirements, is what represents the Grid. The Grid computing environment enables jobs submitted by users to be executed at Grid sites that may be located thousands and millions of miles apart. This demands reliability of the hardware and software, efficiency in time consumption and effectiveness in the utilisation of resources. It also requires that schedules are as optimal as possible. Finding the optimal schedule for a set of jobs is an NP-complete problem and so heuristics are typically used. Alternatively, non-deterministic algorithms such as genetic algorithms can be used. However if the scheduling algorithm becomes too complex, the benefits of obtaining an optimal solution are outweighed by the time it takes to schedule. Our aim is to reduce scheduling time by using parallelism in the multicore environment. 2.1. Pervasive Grid and multicores With advances in chip technology, current computer systems are now imbued with multiple cores [2–4]. These advances are due largely to the paradigm shift in hardware design and call for a retrospective paradigm shift in software design technology if the potential gains of multicore computing are to be achieved. The requirement on Grid utility that jobs submitted by users from several and different locations are processed at other locations millions of miles away demands an effective and fundamental scheduling algorithm [5,6]. Scheduling of Grid jobs and the growth of the Grid will be hampered if Grid scheduling algorithms are not designed to benefit from the hardware technology of the day [5,7]. For Grid scheduling to gain from these advances in hardware technology, meet its projected growth and the challenges of the future and also make good the revolution in computer hardware design, a paradigm shift is imperative in software programming model [8]. This is because sequential programs do not scale with multicore systems and therefore do not benefit from parallelism [9–13]. Hence, it will be a welcome development for effort to focus on methods that ensure the gains in computing hardware are translated generally to gains in software production and in particular Grid scheduling. This is a necessity because Grid scheduling can become a bottleneck when the number of jobs increases and the scheduling process remains sequential. It is therefore important to exploit parallelism in scheduling and make use of multicore availability to improve throughput in the scheduler. 2.2. Some impediments to the impact of multicore systems The advent of multicore systems created a gap in application designs as execution of jobs in sequence in the midst of several processors does not optimise utilisation of available processors. Some of the impediments to legacy systems and applications in the multicore era according to Gurudutt Kumar [9] include:

• Inefficient parallelisation – which is an impediment in legacy systems or applications that fail to support multi-threading and in some cases too many threads,

• Serial bottlenecks – which are mostly common to applications that share a single data source among contending threads, or serialisation of data-accessing processes to maintain integrity,

• Over-dependence on operating system or runtime environment – which arises when too much is handed to the operating system or runtime environment to scale and optimise the application,

• Workload imbalance – where the tasks are unevenly spread to the various cores, • I/O bottlenecks – which occur due to disk I/O blocking and • Inefficient memory management – which is a performance inhibitor caused by the sharing of memory by several CPUs. To correct or eliminate these impediments, many scientific and engineering platforms have reacted appropriately by embracing and implementing mechanisms that utilise multicores to greater benefit [7,11,14–16] while other researchers have

G.T. Abraham et al. / Journal of Computer and System Sciences 81 (2015) 943–957

945

issued calls for the design of applications that focus on parallelism in an effort to increase throughput and ensure hardware optimisation [17,18]. Despite such efforts, the Grid scheduling community has not done much to reverse these impediments. So far, most Grid scheduling algorithms have concentrated more effort on scheduling parallel task execution rather than parallelising the actual scheduling process. Interesting work in scheduling parallel tasks has indeed been and continues to be achieved [19–27]. However in these works, the scheduling is first carried out serially and then the tasks are carried out in parallel using the multiple resources of the Grid. Thus the scheduler can become a bottleneck. We intend to avoid this problem by developing a system such that the scheduling is carried out in parallel as well as the task execution. To leverage Grid scheduling in line with the trend in the technology, we focus on how we can exploit parallelism on multicores. We employed the grouping of Grid jobs based on their priorities, categorising Grid machines based on two selected methods, SimilarTogether and EvenlyDistributed, then scheduling the jobs in parallel from within the paired groups. 3. Related work 3.1. Grid scheduling algorithms Current Grid scheduling algorithms are mostly concerned with the delivery of Quality of Service (QoS), using factors like execution time, schedule, and required completion time for optimisation. Scheduling can be carried out in immediate mode or batch mode [28]. Immediate mode is when a job is assigned to a machine as it arrives and batch mode is when a number of jobs are batched and scheduled together. The algorithms studied in the work described below included the MinMin and MinMax algorithms introduced by Ibarra and Kim [29]. The MinMin algorithm computes the completion time for all jobs on all machines, then assigns jobs with the minimum completion time to the processor that can complete the job the earliest. The MaxMin algorithm applies a similar principle to MinMin by computing the completion time for all the jobs on all the processors but jobs with maximum completion time are assigned to processors that can complete the job earliest. Another algorithm investigated is the Sufferage algorithm introduced by Maheswaran [28]. The Sufferage heuristic is based on the idea that better mappings can be generated by assigning a machine to a task that would “suffer” most in terms of expected completion time if that particular machine is not assigned to it. Algorithms for immediate mode include: the traditional First Come First Serve (FCFS); Easy-Backfill, which optimises FCFS by allowing jobs to jump the queue where they can fit a gap which otherwise would be left empty due to requirements of the next in line job; Opportunistic Load Balancing (OLB), where a task is assigned to the machine that becomes ready next, without considering the execution time of the task onto that machine; minimum execution time (MET) where the job with minimum execution time is selected next; minimum completion time (MCT) where the job with minimum completion time is selected next; and k-percent best (KPB). The k-percent best (KPB) heuristic considers only a subset of machines while mapping a task. The subset is formed by picking the k-percent best machines based on the execution times for the task. The task is assigned to a machine that provides the earliest completion time in the subset [28]. We now consider some work of others where these algorithms have been investigated. Freund et al. have developed SmartNet [30,31] which is a resource scheduling system for distributed computing environments. Their work focused on the benefits that can be achieved when the scheduling system considers both computer availability and the performance of each task on the computer. The system requires jobs to be broken down into tasks and also requires estimates of execution time of tasks. It collects and uses data on jobs, tasks, machines and networks in order to tune the scheduling outcomes. The system uses various scheduling algorithms to attempt to assign tasks to the computer that will run that task best. The work is interesting as it was one of the first to consider detail of computer, task, job and network characteristics in scheduling. SmartNet implemented the MinMin and MaxMin algorithm introduced by Ibarra and Kim [29]. The SmartNet approach showed improvement over simple load balancing. In 1999 Muheswaren et al. compared new and previously proposed dynamic matching and scheduling heuristics for mapping independent tasks onto heterogeneous computing systems under a variety of simulated computational environments [28]. Five immediate mode heuristics and three batch mode heuristics were studied. For immediate mode they investigated opportunistic load balancing (OLB), minimum execution time (MET), minimum completion time (MCT) and k-percent best (KPB). For batch mode they considered MinMin, MinMax and Sufferage. Sufferage was a new algorithm proposed by the researchers. The authors showed that the choice of dynamic scheduling heuristic in a heterogeneous environment depends on parameters such as heterogeneity characteristics of task and machine as well as the arrival rate of tasks. Etminani and Naghibzadeh [32] designed a new scheduling algorithm called Selective Algorithm to select at each decision point, the best algorithm between MinMin and MaxMin according to length of tasks in the batch. For instance if there is a prevalence of long tasks in the remaining tasks in a batch, MaxMin would be chosen, if there is a prevalent of short tasks, the choice would be MinMin. Their algorithm increased throughput but did not consider deadline of each task, cost of execution on each resource or cost of communication. Some work has taken more of a user perspective and concentrated on providing or maintaining Quality of Service. Chen, Li and Wang [20] employed a resource reservation mechanism. Their method supported resource reservation request scheduling models implemented on First Come First Serve (FCFS) and Easy-backfilling. Albodour, James and Yaacob [22] proposed BGQOS, a QoS model for business-oriented and commercial applications on Grid computing systems. BGQoS allows Grid Resource Consumers (GRCs) to request specific QoS requirements from Grid Resource Providers (GRPs) for their

946

G.T. Abraham et al. / Journal of Computer and System Sciences 81 (2015) 943–957

resources to be utilised. Dynamically calculated QoS parameters are supported, such as resource reliability. This increases the accuracy of meeting the GRC’s requirements. GRPs are capable of advertising their resources, their capabilities, their usage policies and availability both locally and globally. This leads to a flexible model that could be carried across domains without altering the core operations and which could easily be expanded in order to accommodate different types of GRC, resources and applications. Monitoring and reallocation are used to ensure that QoS targets are met. Xiao and Dongbo [33] proposed a Multi-Scheme Co-Scheduling Framework (MSCSF) to provide enhanced deadline-guarantees in heterogeneous environments. The system works by integrating multiple co-scheduling schemes and quantitatively evaluating the deadline of each co-scheduling scheme. The system can then select the best scheduling scheme for real-time applications at run time. Experimental results show that it can provide enhanced deadline-guarantee. The schemes described above concentrate on overall performance in terms of scheduling to complete the whole task set more efficiently or on providing improved quality of service to users. They do not concentrate on improving the efficiency of the scheduler in terms of how long the scheduling task takes. Our work aims to improve the efficiency of the scheduler through exploiting parallelisation in a new way. This will improve throughput further in addition to improvements achieved by the research described above. 3.2. Parallel scheduling algorithms for Grid In this section, we consider some works employing parallelism to improve the efficiency of the scheduler itself. Canabé and Nesmachnow [34,35] have investigated the use of massively parallel GPUs (Graphical Processing Units) to improve scheduling time. In 2011 [35] they implemented the MinMin and Suffrage algorithm on a GPU architecture. They recorded improvements in scheduling time when the number of tasks goes beyond 8000 and where number of machines is more than 250. In their experiment the number of machines (GPUs) was 32 times less than the number of tasks. In 2012 [34], the same researchers applied four variants of the parallelism on the MinMin algorithm and obtained large improvements in computation time when using parallel scheduling in comparison to serial scheduling when the number of tasks increases. The experimental evaluation of the proposed parallel methods demonstrates that a significant reduction in the computing times can be attained when using the parallel GPU hardware. Other authors have proposed genetic and memetic algorithms which exploit GPUs in solving the scheduling problem. Pinel, Dorronsoro and Boury [36] have presented CPU and GPU multi-threaded parallel designs of the MinMin algorithm. As would be expected, the GPU design outperforms the CPU because of the massive parallelisation. The parallel CPU solution outperformed the serial. They also proposed a cellular genetic algorithm to solve the MinMin problem. The CGA brought more accurate results but took longer to run. Mirsoleimani et al. [37] proposed a memetic algorithm, which uses combinations of non-deterministic approaches to solve the scheduling problem in a GPU environment. Very high speed-up was achieved. The difference between the above work and our work is that most other works which concentrate on parallelisation of the scheduler have focussed on a GPU environment and/or on non-deterministic algorithms such as genetic or memetic algorithms. The GPU environment offers massive parallelisation. However non-deterministic algorithms have unpredictable run times. The scope of our work has been the more general purpose environment. We selected this environment as we wanted to create a facility that did not require a specialised environment. We also concentrated on deterministic algorithms to have better control on scheduler execution time. As mentioned above, Pinel, Dorronso and Bouvry [36] presented research on multi-core CPU as well as GPU and genetic algorithms. That part of their work is related to ours. However our work differs in the novel use of grouping to achieve greater efficiency through parallelisation. Parallel multi-scheduling can improve Grid scheduling performance and should be exploited. Our aim is therefore to exploit the use of multicores both on our scheduler and on the Grid sites. We do this through our innovative grouping algorithm. This work, our Priority-based Parallel Multi-scheduler, is an attempt to steer Grid scheduling algorithms towards exploiting the benefits of multicore systems in order to enhance and optimise performance. A parallel scheduler for the Grid will facilitate increased throughput and scalability. Our challenge is to develop a scheduler that is dynamic, optimises resource utilisation, ensures QoS and above all increases throughput. Grouping-based scheduling aims at splitting Grid jobs into various groups and utilising independent threads to execute scheduling algorithms from those threads. Multi-scheduling allows multiple independent scheduling instances to occur simultaneously within the groups. This will enable the parallelisation based on threads, allowing them to independently access the groups to select jobs for scheduling based on the scheduling policy. We use the term Parallel Multi-scheduling to mean that scheduling is carried out in parallel from within the matched job and machine groups. Parallelism enables the optimised use of all available processing components within a system. Our method encourages each thread to execute the scheduling algorithm within a group pair, automatically improving the total schedule time of the grouping method by factors approaching N, if N is the number of group pairs used. However our results show speed-up greater than N. This is because as well as benefiting from parallelisation, our grouping method also benefits from smaller data input sets to a polynomial scheduling algorithm, thus providing further speed-up. In the next section, we shall discuss the design of the Priority-based Parallel Multi-scheduler for the Grid.

G.T. Abraham et al. / Journal of Computer and System Sciences 81 (2015) 943–957

947

Table 1 Stages for implementing the Priority-based Parallel Multi-scheduler. i: Read jobs into scheduler ii: JobsSplitMode – method for splitting jobs into priority groups iii: WorkersGroupingMethod – method for splitting machines into groups iv: InsideGroupsScheduling – Grid scheduling algorithm used inside the groups v: Output result – time used in scheduling the jobs is recorded

4. Priority-based Parallel Multi-scheduling method 4.1. Overview of method The Priority-based Parallel Multi-scheduling method is generally aimed at exploiting parallelism on multicores to enhance Grid scheduling algorithms. Our design targets multicores at both the scheduler and resource level. To achieve our objectives, we used job grouping and machine grouping methods. Jobs are split into groups (priority groups). So also are machines. Job groups are then paired to machine groups. Then independent threads are deployed to execute a scheduling algorithm across each group pair. Grid jobs submitted by users consist largely of independent tasks, and hence we have the opportunity of reaping the full benefits of massive parallelism. Using multiple threads to execute the scheduling algorithms independently allows our method to gain high parallelism on the multicore platform and enhances the performance of the scheduling algorithm. The steps taken to achieve this are as follows (summarised in Table 1): i. Number of groups to use is specified. ii. Jobs are batched into groups based on their characteristics – Priority in this case. There are four priority groups in this test, hence jobs can only be placed into one of four groups. iii. Machines are split into groups based on their characteristics. Because we are dealing with only four priority groups of jobs, the machines are also split into four groups. iv. The scheduling algorithm (InsideGroupsScheduling) is then applied to the jobs in those groups in parallel using thread implementation. The InsideGroupsScheduling algorithm we use is the MinMin scheduling algorithm. Our method of parallelism was implemented with a dynamic thread pool. We created a pool of threads where the threads are activated when needed and deactivated when no longer needed. With the thread pool, we had the option of choosing in our ‘Test Parameters’ how many threads to use for each execution. The threads executed independently within the matched groups, optimising the algorithm and scheduling Grid jobs from a particular job group to a set of machines paired to that group. This gave us a great degree of parallelism using multi-threading within the groups. The InsideGroupsScheduling refers to the scheduling algorithm executed within the priority groups. After grouping both jobs and machines, we implemented the MinMin scheduling algorithm within each group in parallel using threads. Implementing the scheduling algorithm within independent groups creates room for a high-level of parallelism, exploits the multicores, enhances throughput and increases the efficiency of the scheduling algorithms. Fig. 1 shows the stages of the method and Fig. 2 provides an overview architecture of our system. 4.2. Grid job source and relevant attributes Jobs used for our experiment were downloaded from the Grid Workload Archive [38]. The Grid Workload Archive (GWA) is designed to make traces of Grid workloads available to researchers and developers. It contains files both in plain text format and the Grid Workload Format (.GWF). The attributes of user jobs are the distinct items that characterise Grid jobs. Grid scheduling algorithms depend largely on the attributes of jobs as specified for the optimisation of the algorithm and delivery of many requirements like completion time and execution time. The priority grouping method used job attributes from the GWA which were relevant for our purpose. The method requires an estimation of job size. The completion time of a job on a processor can be estimated. Attributes used for estimating completion time of jobs are shown in Table 2. 4.3. Job grouping The Priority method requires that Grid jobs are categorised or sorted into priority groups before scheduling. From the GWA files, the attribute we have used to determine priority is the number of processors requested by the user (ReqNProcs).

948

G.T. Abraham et al. / Journal of Computer and System Sciences 81 (2015) 943–957

Fig. 1. Model of the Priority-based Parallel Multi-scheduler.

Before submission, a user either states or selects the number of processors he needs his job to be executed on. The choices vary from not specifying to specifying more than a few hundred. As provided in the Grid Work Flow archive, this attribute (ReqNProcs) may reflect the importance with which a user ascribes his job. Grid users who specify a higher number of processors for the execution of jobs could be regarded as desiring a higher priority for their jobs. In our method, jobs from the GWA file are first allocated priorities based on the number of processors requested, before they are sorted into groups based on the priority. The choice of attribute(s) on which to determine priority is somewhat arbitrary for our purposes. Other characteristics could be used to determine priority, even a priority choice made directly by the customer [22]. In production environments, suitable attributes would be determined depending on application and available meta-data. Table 3 shows the algorithm that determines the priority of jobs and allocates them to priority groups based on the number of processors requested by the user. We use four priority groups. 4.4. Machine grouping A Grid machine possesses the following attributes: I. WorkerId: used to identify the machine II. Number of CPUs: determines potential of the machine. III. CPUSpeed: the speed of the CPU given in GHz.

G.T. Abraham et al. / Journal of Computer and System Sciences 81 (2015) 943–957

949

Fig. 2. System diagram for the Priority-based Parallel Multi-scheduler.

Table 2 Selected attributes from the GWA trace file. Attribute

Description

JobID NProcs ReqTime

Identifies the job Number of allocated processors Requested time measure in wallclock seconds Requested number of processors Time job actually executes Average CPU time over all the allocated processors

ReqNProcs RunTime AverageCPUTimeUsed

Table 3 Algorithm for determining the priority of jobs and allocating to groups. Assigning jobs to groups according to priority i. Allocate 4 job groups, one per priority (1 – very low, 2 – medium, 3 – high, 4 – very high) ii. For each job a. Assign priority based on relevant attributes b. Add it to the group with matching priority

Machines are split into groups based on their attributes or characteristics. Two methods used to split machines into groups are:

950

G.T. Abraham et al. / Journal of Computer and System Sciences 81 (2015) 943–957

Table 4 Algorithm for the SimilarTogether method. SimilarTogether i. Sort machines based on configurations (e.g. number and speed of processors) ii. Divide machines into groups of equal size N. a. Add top N of machines to the first group, b. Add next N machines to the next group c. Repeat step b. until all machines are assigned

Table 5 Algorithm for the EvenlyDistributed method. EvenlyDistributed i. Sort machines based on configurations (e.g. number and speed of processors) ii. Divide machines into groups of equal size N. a. Add first machine to first group b. Add next machine to next group, c. Repeat b. until last group is reached. d. Add next machine to first group e. Repeat step b. step c. and step d. until all machines are assigned to groups.

SimilarTogether: In this method, machines with similar configuration are grouped together based on the number of CPUs and speed of the CPU. First, machines are sorted by performance (Number of CPUs ∗ SpeedofCPU) from slowest to fastest, each group is guaranteed same number of machines. The first group takes N machines, the second group takes the next N machines and the third group takes the third N machines and the fourth group takes the fourth N machines. As a result, the first group is guaranteed to have the slowest machines, followed by the second group, then the third. The fourth group is guaranteed to get the best set of machines. This means some groups have better performing machines than others. Groups with better performing machines complete jobs faster than groups with slower machines. The group with the lowest ranking will therefore perform worse than the rest if the same number of tasks are assigned to all groups. With this grouping method, higher priority jobs can be assigned to higher configuration machine groups and lower priority job groups to lower configuration machine groups. The method has potential to better address quality of service requirements for particular jobs and also improve makespan if priority job groups are appropriately balanced. Table 4 shows the algorithm for the splitting of jobs using the SimilarTogether method. EvenlyDistributed: This method ensures that machines of all configurations are equally split and distributed into the groups (based on their configuration). This method guarantees that each group has same or almost the same number of machine configurations. The first machine is added to the first group, the second machine to the second group, then third to the third group and fourth machine to the fourth group. The process is then repeated until all machines have been allocated. This method provides a more balanced processing infrastructure which might suit some input job sets better than SimilarTogether. Table 5 shows the algorithm for splitting the jobs according to the EvenlyDistributed method. 5. Simulation This section discusses the simulation carried out in the experiment. 5.1. Simulation of execution time The simulation of the execution time was based on the job attributes provided in the GWA provided in Table 2 in Section 4.2. Variations to the calculations needed to be made depending on the attribute values available in the file because sometimes values were missing. Table 6 shows the algorithm for simulating the execution time of jobs. 5.2. Simulation of Grid A Grid site was characterised by the following attributes: Category; CPU; RAM; Bandwidth. For example A; 1200; 2 000 000; 1000 represent Grid site A, CPU 1200, RAM 2 000 000, Bandwidth 1000. A Machine is defined with the following attributes: CORES; CPU Speed; RAM. For instance {2; 2000; 2 000 000} represent a Grid resource (machine) with 2 Cores, 2000 MHz (2 GHz) and 2 000 000 B (2 MB). Table 7 shows the characteristics of the simulated Grid.

G.T. Abraham et al. / Journal of Computer and System Sciences 81 (2015) 943–957

951

Table 6 Algorithm for simulating execution time of jobs. SIMULATION OF TIME OF EXECUTION i. Estimate job execution time (T) on a reference machine (1 GHz, 1 core) a. Return the actual execution time (if available) b. Return the an estimate based on job size (otherwise) ii. Scale the expected time to match the current machine a. Calculate performance ratio (R) between the current and the reference machine b. Return the expected execution time divided by the performance ratio (T/R)

Table 7 Characteristics of our simulated Grid. Grid site

Characteristics

Grid site

Number of machines

Speed of CPU

Number of CPU/Cores

A 240 Machines

30 30 30 30 30 30 30 30

1 2 3 4 1 2 3 4

GHz GHz GHz GHz GHz GHz GHz GHz

1 1 1 1 2 2 2 2

B 400 Machines

50 50 50 50 50 50 50 50

1.5 GHz 2 GHz 3.5 GHz 4 GHz 1.5 GHz 2 GHz 3.5 GHz 4 GHz

2 2 2 2 4 4 4 4

Characteristics Number of machines

Speed of CPU

Number of CPU/Cores

C 440 Machines

60 60 60 60 60 60 60 60

1.5 GHz 2 GHz 3.5 GHz 4 GHz 1.5 GHz 2 GHz 3.5 GHz 4 GHz

2 2 2 2 4 4 4 4

D 600 Machines

50 50 50 50 50 50 50 50 50 50 50 50

1.5 GHz 2 GHz 3.5 GHz 4 GHz 1.5 GHz 2 GHz 3.5 GHz 4 GHz 1.5 GHz 2 GHz 3.5 GHz 4 GHz

2 2 2 2 4 4 4 4 8 8 8 8

6. Experimental design This section discusses the experimental design, constraints and the platform of execution. The full range of experiments is shown in Table 8. 6.1. The experiment parameters The experiment was executed on one of Coventry University’s HPC systems – Pluto. We used one node which consisted of a number of processors. The configuration of the HPC is given in Section 6.2. In the experiment, a Grid environment was simulated comprising of four Grid sites consisting of machines with different CPU speeds and number of processors as described in Section 4, Table 7 above. Job scheduling was aimed directly at the CPUs on the individual machines such that jobs were assigned to individual processors. In the first experiment, the MinMin algorithm was executed on the HPC to schedule a range of jobs (from 1000 jobs to 10 000 jobs in steps of 1000). In each instance of the experiment, the time of scheduling was recorded. Time of scheduling is the time taken to schedule each set of jobs, for instance the time taken to schedule 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, and 10 000 jobs in turn for each method. This experiment did not use groups. In the second experiment, the Priority method was used to group jobs and the SimilarTogether method to group the machines before implementing the MinMin algorithm to schedule the same range of jobs (1000 to 10 000 in steps of 1000). The MinMin algorithm operated within the matched groups. In the third experiment, the Priority method was again used to group the jobs and the EvenlyDistributed method was used to group machines before implementing the MinMin algorithm with each paired group to schedule same range of jobs (1000 to 10 000 steps 1000). In all of the above experiments, the number of threads was varied from one to eight in steps of power 2 and results for each specific run were recorded separately.

952

G.T. Abraham et al. / Journal of Computer and System Sciences 81 (2015) 943–957

Table 8 Test scenarios. Experiment number

Group characteristic

Scheduling method

Number of Groups

Number of Jobs

1

Non-grouping Method

MinMin

0

2

Priority Grouping Method – Evenly Distributed Priority Grouping Method – Similar Together

MinMin

4

MinMin

4

1000, 6000, 1000, 6000, 1000, 6000,

3

2000, 7000, 2000, 7000, 2000, 7000,

3000, 8000, 3000, 8000, 3000, 8000,

4000, 9000, 4000, 9000, 4000, 9000,

5000, 1000 5000, 1000 5000, 1000

Table 9 Results from experiment and correlation between methods. Jobs Limit

MinMin

EvenDist

SimTog

1000 2000 3000 4000 5000

654 3230 7601 12 920 18 219

95 340 673 1092 1776

105 412 839 1345 2008

6000 7000 8000 9000 10 000

22 671 29 504 39 074 48 178 59 982

2837 3860 5312 7818 12 004

242 033 24 203.3

35 807 3580.7

Total Average

Correlation

Anova Significance Test

Between MinMin and EvenDist = 0.9740

Between MinMin and EvenDist P-value = 0.006 (significant)

3339 4570 7500 8830 12 058

Between MinMin and SimTog = 0.9895

Between MinMin and SimTog P-value = 0.006 (significant)

41 006 4100.6

Between EvenDist and SimTog = 0.9876

Between EvenDist and SimTog P-value = 0.772 (not significant)

6.2. Configuration of the HPC machine The configuration of the HPC machine node on which our experiment was executed is as follows: Number of physical CPUs per node/head: 2 Numbers of cores per one compute node/head: 12 CPU family: Intel(R) Xeon(R) CPU X5650 2.67 GHz stepping 02 Operating System: Linux x86_64 RHEL 5 The scenarios tested are shown in Table 8. 7. Results and evaluation This section presents results of the experiments and analysis. 7.1. Results, tables and analysis The result of the experiments and the computation of correlation and Anova significance test is shown in Table 9, while Table 10 shows the computation of speed-up in multiples and in percentages. The correlation results between the methods show that the same general pattern is observed across methods, however the raw result figures show the differences in scheduling time. Figs. 3 and 4 show speed-up achieved by the Priority method with the two grouping methods. Figs. 5 and 6 illustrate how the Priority method compares with the MinMin method when MinMin is used alone. 7.2. Computation methods The methods applied for calculating the various performance measures are as follows: Speed-up in multiples (X) This formula computes the performance improvement at each step of the schedule by dividing the MinMin schedule time by the priority method’s schedule time.

Speed-Up (X) =

MinMinschedtime Priorityschedtime

(1)

G.T. Abraham et al. / Journal of Computer and System Sciences 81 (2015) 943–957

953

Fig. 3. Speed-up in multiples achieved by Priority method.

Fig. 4. Speed-up in percentage achieved by Priority method.

Fig. 5. Comparison between MinMin and the Priority methods.

Speed-Up in percent (%) This value is obtained by subtracting the scheduling time of the priority method at each stage from the scheduling time of the MinMin at the corresponding stage then dividing by the scheduling time of the MinMin multiplied by 100%.

 Speed-Up (%)

MinMinschedtime − Priorityschedtime MinMinschedtime

 ∗ 100

(2)

7.3. Analysis of data Table 10 shows the values of speed-up recorded by the priority methods over the MinMin in percentages and in multiples.

954

G.T. Abraham et al. / Journal of Computer and System Sciences 81 (2015) 943–957

Fig. 6. Total and average scheduling time between the MinMin and the Priority methods.

Table 10 Computation of speed-up in percentages and in multiples with increase in jobs. Jobs Limit

1000 2000 3000 4000 5000 6000 7000 8000 9000 10 000 Total Average

Schedule MinMin

Time in seconds

Speed-up (%)

Speed-up (X)

Priority and EvenDist

Priority and SimTog

Priority and EvenDist

Priority and SimTog

Priority and EvenDist

Priority and SimTog

0.7 3.2 7.6 12.9 18.2 22.7 29.5 39.1 48.2 60.0

0.1 0.3 0.7 1.1 1.8 2.8 3.9 5.3 7.8 12.0

0.1 0.4 0.8 1.3 2.0 3.3 4.6 7.5 8.8 12.1

85% 89% 91% 92% 90% 87% 87% 86% 84% 80%

84% 87% 89% 90% 89% 85% 85% 81% 82% 80%

6.9 9.5 11.3 11.8 10.3 8.0 7.6 7.4 6.2 5.0

6.2 7.8 9.1 9.6 9.1 6.8 6.5 5.2 5.5 5.0

242.1

35.8

40.9 87%

85%

6.8

5.9

7.3.1. Speed-up in percentages (%) The Evenly Distributed method recorded between 80% to 92% speed-up against the MinMin algorithm. It had an average of 87% speed-up. This represents 5.0 to 11.8 times (with an average of 6.8 times) better than the MinMin algorithm. The Similar Together method recorded a range of 80% to 90% with an average of 85% speed-up against the MinMin algorithm, representing 5.0 to 9.6 times (with an average of 5.9 times) better than the MinMin algorithm. 7.3.2. Speed-up in multiples (X) The EvenlyDistributed method recorded between 5.0 to 11.8 times (with an average of 6.8 times) speed-up against the MinMin algorithm while the SimilarTogether method recorded 5.0 to 9.6 times (with an average of 5.9 times) speed-up against the MinMin algorithm as the number of jobs increases from 1000 to 10 000 (see Table 10 and Fig. 3). The best improvement is achieved when number of jobs equals to 4000, at this point, the speed-up = 11.8 for the EvenlyDistributed method and 5.9 for the SimilarTogether method. In Fig. 3, we see that the speed-up improved from 1000 jobs to a maximum point at 4000 jobs, then it began a downward trend, this negative slope of the speed-up as the number of jobs increases indicates that even though the method was generally better than the MinMin, performance was degrading as the number of jobs increased. This can be attributed to the types of jobs being used in the experiment and the priority grouping method for jobs. In our experimental set, more than 90% of the jobs in the experiment had low priority hence are sorted to only one group, and as a result, the scheduling time for the method approaches the same as the non-grouping MinMin method. MinMin has a polynomial curve as jobs increase. This indicates that it is important to adjust priority scheduling depending on the characteristics of the incoming jobs. If the input set has a large number of jobs of the same priority then some adjustment to grouping needs to be applied or performance will deteriorate. Fig. 4 shows the speed-up in terms of percentage. Figs. 5 and 6 show the actual times for the three methods tested. 7.3.3. Discussion of the results There was a significant difference in the results between the MinMin approach when compared with both the EvenlyDistributed and the SimilarTogether method. Both the EvenlyDistributed and the SimilarTogether method performed better than MinMin alone with high speedups being achieved. There was no significant difference in performance between the EvenlyDistributed and the SimilarTogether methods (see Table 9).

G.T. Abraham et al. / Journal of Computer and System Sciences 81 (2015) 943–957

955

Fig. 7. Pattern of the graph for the Priority methods.

Though both EvenlyDistributed and SimilarTogether performed better than the MinMin algorithm, the pattern of the graph for both methods was generally polynomial (see Fig. 7). Therefore performance was degrading relatively as the number of jobs increases. This is exacerbated by the fact that, for our test data set, more jobs were scheduled to one machine group while the other machine groups ended up with fewer jobs. As the number of jobs increases, the number of jobs in that one priority group approaches same number of jobs as in the non-grouping method, thereby degrading the general performance. This effect can be dampened by making sure that jobs are equally distributed among the groups. Polynomial time has the characteristic that as the number of instances of the input set increases so does the time per instance. Thus grouping to create smaller sets and running in parallel improves makespan performance. This paper has concentrated on improving scheduling time. However improvements in scheduling time are not valuable if the resulting schedule is inferior to one which would have been produced via a slower scheduling algorithm. Using our method, we found that if many of the input jobs have the same priority, they will be assigned to the same group of machines which could cause an imbalance in processing activity. This might result in a poorer schedule than would have been the case if machines were not grouped. On the other hand, if the priority is evenly distributed across the input jobs, then the resulting schedule is likely to be of equal if not better quality to one produced without grouping. If the EvenlyDistributed method is used for machine grouping, then the execution time should be the same as non-grouping whereas if SimilarTogether is used the execution time and quality of service should be improved as higher priority and larger jobs could be assigned to the more powerful machines. In either case, makespan should be improved, where we consider makespan to be a combination of both scheduling time and execution time. However much depends on the exact requirements of the incoming jobs and the characteristics of the receiving Grid. Thus we conclude that an even balance of jobs across groups is desirable and also that tuning the scheduling parameters according to incoming job characteristics would be beneficial towards achieving a better schedule. 8. Conclusion and future work We have introduced a Priority-Grouping Method which can be used to parallelise Grid scheduling. The method is designed to be used in batch scheduling and involves grouping jobs according to some priority and also grouping the machines into the same number of groups. Job groups and machine groups can then be paired so that scheduling can take place in parallel over the paired groups. Two different methods of grouping the machines were explored in this paper: Evenly Distributed and Similar Together. We used the MinMin scheduling algorithm both within our Priority method and for comparison as a standalone method. Our results show that the Priority method introduced achieved significant speed-up for both machine grouping methods. We conclude that the Priority-grouping method can be an effective way of reducing scheduling time. The splitting of jobs into groups means that fewer read accesses overall need to be made on jobs and machines to complete the checks required by the MinMin algorithm. The MinMin method, which is used in our approach, method is polynomial in nature, thus savings can be made through using smaller groups even without parallelisation. However running each grouped pair in parallel achieves still greater processing time benefits. We also note that the nature of the input

956

G.T. Abraham et al. / Journal of Computer and System Sciences 81 (2015) 943–957

set and the machine grouping approach has an impact on the effectiveness of the method. Future work will explore how this relationship can be exploited to achieve greater speed-up. Acknowledgment We are grateful to the following: 1. The Parallel and Distributed Systems Group at Delft University of Technology (TUDelft), Netherlands for making the Grid Workload Archive available and free. Members of the group are: Shanny Anoep (TU Delft); Catalin Dumitrescu (TU Delft); Dick Epema (TU Delft); Alexandru Iosup (TU Delft); Mathieu Jan (TU Delft); Hui Li (U. Leiden); and Lex Wolters (U. Leiden). (GWA 2014) (http://www.pds.ewi.tudelft.nl/). 2. The e-Science Group of HEP at Imperial College London for providing the LCG data. Hui Li for making the data publicly available and Dr. Feitelson of the parallel workloads archive. http://lcg.web.cern.ch/LCG. 3. The Grid’5000 team (especially Dr. Franck Cappello) and the OAR team especially Dr. Olivier Richard and Nicolas Capit) for the trace http://oar.imag.fr. 4. John Morton (john_x_sharrcnet.ca) for providing the trace file and the parallel workload archive for making it publicly available. 5. The AuverGrid team with special thanks to Dr. Emmanuel Medernach the owner of the AuverGrid system made available through the Grid workloads archive http://auvergrid.fr. 6. NorduGrid team, with special thanks to Dr. Balasz Knoya, the owner of NorduGrid system made public through the Grid workload archive. References [1] I. Foster, C. Kesselman, The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, San Francisco, 1999, pp. 159–180. [2] P. Gepner, M.F. Kowalik, Multicore processors: new way to achieve high system performance, in: IEEE Parallel Computing in Electrical Engineering, 2006, pp. 9–13. [3] W. Knight, Two heads are better than one [dual-core processors], IEEE Rev. 51 (9) (2005) 32–35. [4] R. Kalla, B. Sinharoyand, J.M. Tendler, IBM Power5 chip: a dual-core multithreaded processor, IEEE MICRO 24 (2) (2004) 40–47. [5] H. Koziolek, S. Becker, J. Happe, P. Tuma, T. de Gooijer, Towards software performance engineering for multicore and manycore systems, ACM SIGMETRICS Perform. Eval. Rev. 41 (3) (2014) 2–11. [6] X. Tang, K. Li, M. Qiu, E.H.M. Sha, A hierarchical reliability-driven scheduling algorithm in grid systems, J. Parallel Distrib. Comput. 72 (4) (2012) 525–535. [7] J. Stone, D. Gohara, G. Shi, OpenCL: a parallel programming standard for heterogeneous computing systems, Comput. Sci. Eng. 12 (3) (2010) 66. [8] M.D. McCool, Scalable programming models for massively multicore processors, Proc. IEEE 96 (5) (2008) 816–831. [9] G.V.J. Gurudutt Kumar, Considerations in software design for multicore/multiprocessor architectures, developerWorks, IBM (20 May 2013), available from http://www.ibm.com/developerworks/aix/library/au-aix-multicore-multiprocessor (accessed 06/03/2014). [10] M.D. Hill, M.R. Marty, Amdahl’s law in the multicore era, IEEE Comput. 41 (7) (2008) 33–38. [11] J. Nickolls, I. Buck, M. Garland, K. Skadron, Scalable parallel programming with CUDA, ACM Queue 6 (2) (2008) 40–53. [12] D.A. Bader, G. Cong, SWARM: a parallel programming framework for multicore processors, in: Encyclopedia of Parallel Computing, Springer, US, 2011, pp. 1966–1971. [13] R. Dolbeau, S. Bihan, F. Bodin, HMPP: a hybrid multi-core parallel programming environment, in: Workshop on General Purpose Processing on Graphics Processing Units, GPGPU, 2007. [14] H. Jin, D. Jespersen, P. Mehrotra, R. Biswas, L. Huang, B. Chapman, High performance computing using MPI and OpenMP on multicore parallel systems, Parallel Comput. 37 (9) (2011) 562–575. [15] P. Ciechanowicz, H. Kuchen, Enhancing Muesli’s data parallel skeletons for multi-core computer architectures, in: Proceedings of the 12th IEEE International Conference on High Performance Computing and Communications, HPCC, IEEE, 2010, pp. 108–113. [16] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, C. Kozyrakis, Evaluating mapreduce for multi-core and multiprocessor systems, in: Proceedings of the 13th International Symposium on High Performance Computer Architecture, HPCA, IEEE, 2007, pp. 13–24. [17] K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz, N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel, K. Yelick, A view of the parallel computing landscape, Commun. ACM 52 (10) (2009) 56–67. [18] R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, J. Zhou, SCOPE: easy and efficient parallel processing of massive data sets, Proc. VLDB Endow. 1 (2) (2008) 1265–1276. [19] K.H. Khan, K. Qureshi, M. Abd-El-Barr, An efficient grid scheduling strategy for data parallel applications, J. Supercomput. (2014) 1–16. [20] J. Chen, B. Li, E.-F. Wang, Parallel scheduling algorithms investigation of support strict resource reservation from grid, Appl. Mech. Mater. 519 (2014) 108–113. [21] F. Liang, Y. Liu, H. Liu, S. Ma, B. Schnor, A parallel job execution time estimation approach based on user submission patterns within computational grids, Int. J. Parallel Program. (2013) 1–15. [22] R. Albodour, A. James, N. Yaacob, High level QoS-driven model for grid applications in a simulated environment, Future Gener. Comput. Syst. 28 (7) (2012) 1133–1144. [23] A. Quezada-Pina, A. Tchernykh, J.L. González-García, A. Hirales-Carbajal, J.M. Ramírez-Alcaraz, U. Schwiegelshohn, R. Yahyapour, V. Miranda-López, Adaptive parallel job scheduling with resource admissible allocation on two-level hierarchical grids, Future Gener. Comput. Syst. 28 (7) (2012) 965–976. [24] M. Kalantari, M.K. Akbari, A parallel solution for scheduling of real time applications on grid environments, Future Gener. Comput. Syst. 25 (7) (2009) 704–716. [25] W. Zhang, A.M.K. Cheng, M. Hu, Multisite co-allocation algorithms for computational grid, in: Proceedings of the 20th International Conference on Parallel and Distributed Processing Symposium, IPDPS, IEEE, 2006, p. 8. [26] G. Sabin, R. Kettimuthu, A. Rajan, P. Sadayappan, Scheduling of parallel jobs in a heterogeneous multi-site environment, in: Job Scheduling Strategies for Parallel Processing, Springer, Berlin, Heidelberg, 2003, pp. 87–104. [27] B.G. Lawson, E. Smirni, Multiple-queue backfilling scheduling with priorities and reservations for parallel systems, in: Job Scheduling Strategies for Parallel Processing, Springer, Berlin, Heidelberg, 2002, pp. 72–87.

G.T. Abraham et al. / Journal of Computer and System Sciences 81 (2015) 943–957

957

[28] M. Maheswaran, S. Ali, H.J. Siegel, D. Hensgen, R.F. Freund, Dynamic mapping of a class of independent tasks onto heterogeneous computing systems, J. Parallel Distrib. Comput. 59 (2) (1999) 107–131. [29] O.H. Ibarra, C.E. Kim, Heuristic algorithms for scheduling independent tasks on non-identical processors, J. Assoc. Comput. Mach. 24 (2) (1977) 280–289. [30] R.F. Freund, M. Gherrity, S. Ambrosius, M. Campbell, M. Halderman, D. Hensgen, E. Keith, T. Kidd, M. Kussow, J.D. Lima, F. Mirabile, L. Moore, B. Rust, H.J. Siegel, Scheduling resources in multi-user, heterogeneous, computing environments with SmartNet, in: Proceedings of the 7th Heterogeneous Computing Workshop, HCW, IEEE, 1998, pp. 184–199. [31] R. Freund Richard, T. Kidd, D. Hensgen, L. Moore, SmartNet: a scheduling framework for heterogeneous computing, in: Proceedings of the 2nd International Symposium on Parallel Architectures, Algorithms, and Networks, ISPAN, IEEE, 1996, pp. 514–521. [32] K. Etminaniand, M. Naghibzadeh, A Min-Min Max-Min selective algorithm for grid task scheduling, in: Proceedings of the 3rd IEEE/IFIP International Conference in Central Asia on Internet, IEEE, 2007, pp. 1–7. [33] P. Xiao, D. Liu, Multi-scheme co-scheduling framework for high-performance real-time applications in heterogeneous grids, Int. J. Comput. Sci. Eng. 9 (1) (2014) 55–63. [34] M. Canabé, S. Nesmachnow, Parallel implementations of the MinMin heterogeneous computing scheduler in GPU, CLEI Electron. J. 15 (3) (2012) 1–11. [35] S. Nesmachnow, M. Canabé, GPU implementations of scheduling heuristics for heterogeneous computing environments, in: XVII Congreso Argentino de Ciencias de la Computación, 2011, pp. 292–301. [36] F. Pinel, B. Dorronsoro, P. Bouvry, Solving very large instances of the scheduling of independent tasks problem on the GPU, J. Parallel Distrib. Comput. 73 (1) (2013) 101–110. [37] S.A. Mirsoleimani, A. Karami, F. Khunjush, A parallel memetic algorithm on GPU to solve the task scheduling problem in heterogeneous environments, in: Proceedings of the Fifteenth Annual Conference on Genetic and Evolutionary Computation, ACM, 2013, pp. 1181–1188. [38] GWA, The Grid Workload Archive, available from http://gwa.ewi.tudelft.nl/dataset/ (accessed 27th June 2013).