Journal of Parallel and Distributed Computing 62, 1203–1222 (2002) doi:10.1006/jpdc.2002.1844
Evolving toward an Optimal Scheduling Solution through Adaptivity Michael J. Oudshoorn and Lin Huang Department of Computer Science, University of Adelaide, Adelaide, SA 5005, Australia E-mail: michael@cs:adelaide:edu:au Received January 29, 2002; accepted February 15, 2002
Different users of the same application may have vastly different requirements, and therefore completely different usage patterns of the software. This makes the determination of an efficient distribution of the software tasks across the available processors within the distributed system an extremely difficult problem. This paper presents an adaptive system to automatically allocate tasks to processing nodes based on the past usage statistics of each individual user. The system evolves to a stable and efficient allocation scheme. The rate of evolution of the distribution scheme is determined by a collection of parameters that permits the user to fine tune the system to suit their individual needs. # 2002 Elsevier Science (USA) Key Words: scheduling mechanisms; adaptive systems; distributed systems.
1. INTRODUCTION Parallel programming is an attempt to produce solutions to complex problems in an efficient manner in comparison with a sequential solution. Difficult as it is to partition the problem into tasks and to allocate them to the available processors, luxuries such as distributed shared memory and a homogeneous processing environment minimize the complexities. Distributed programming has the same goal as well as the additional need to employ the available resources, such as databases, in an efficient and effective manner. These resources may not be available to all nodes of the system and it may be infeasible to migrate the necessary data to a specific node so that computation can proceed. Furthermore, the complexity of partitioning the program into tasks, task scheduling, task communication and synchronization is what makes both parallel and distributed programming difficult. A programming environment can assist in significantly reducing a programmer’s workload, and increase system and application performance, by automating the allocation of tasks to the available processing nodes [1]. Such automation also minimizes errors through the elimination of tedious chores and permits the 1203
0743-7315/02 $35.00 # 2002 Elsevier Science (USA)
All rights reserved.
1204
OUDSHOORN AND HUANG
programmer to concentrate on the problem at hand rather than burdening them with details that are somewhat peripheral to the real job. This paper focuses on parallel and distributed applications in which runtime task spawn and communication may occur conditionally. That is to say, the application code may contain conditional statements guarding the spawning of tasks and intertask communication. This means that it is not possible to statically examine the code and determine which tasks will execute and perform task allocation on that basis. The best that is achievable prior to execution is an educated guess. The scheduling problem has been proved to be NP-complete [2]. Various heuristics [3, 4] and software tools [4, 5] have been developed to pursue a suboptimal solution within acceptable computation complexity bounds. A detailed survey of such systems is found in [6, 7]. Most scheduling heuristics assume the existence of a task model that represents the application to be executed. The general assumption that is made is that the task model does not vary between program executions. This assumption is valid in domains whereby the problem presents itself in a regular way (e.g., solving partial differential equations). It is, however, generally invalid for general-purpose applications where activities such as task spawn and communication may take place conditionally, and where the interaction between the application and a user may differ between executions. Consequently, such an approach does not lead to an optimal distribution of tasks across the available processors. The scheduling of conditional tasks remains an open problem. The tasks may be scheduled at run-time but such a scheduling scheme suffers from several performance issues as the workflow through the processors must be monitored and on-the-fly decisions must be made regarding task allocation. Any inappropriate choice early in the execution may force an extremely poor and expensive allocation at a later time. The alternative approach is the static allocation of tasks to processing elements. This probabilistic approach is the one that is explored in this paper. El-Rewini et al. [8] proposed an algorithm based on simulation. Prior to execution, a number of simulations are conducted of possible task models (according to the execution probability of the tasks involved) that may occur in the next execution. Based on the results of these simulations a scheduling algorithm is employed to obtain a scheduling policy for each task model. These policies are then combined to form a policy to distribute tasks and arrange the execution order of tasks allocated to the same processor. The algorithm employed simplifies the task model in order to minimize the computational overhead involved. It is clear, however, that the computational overhead involved in simulation remains excessive. In essence, this technique derives an average scheduling policy based on probability that each task may run in the next execution of the application. E-business applications on the internet are examples of general purpose applications that may benefit from such scheduling. Users have a variety of mechanisms to interact with the company and each user is likely to utilize the company’s services in a different way. For example, one may be listening to music selections, while another may be purchasing a book, while yet another may be making a query to determine when a previously ordered product will be delivered. These E-business applications may be distributed geographically over a number of
EVOLVING TOWARD AN OSS THROUGH ADAPTIVITY
1205
locations and may require access to databases that may be maintained centrally. It is also clear that when dealing with a global customer base that it is not feasible to predict the best distribution policy for the application. Furthermore, as fads and fashions occur, the interactions between the applications and the client base may vary significantly and this may require a different scheduling policy. These trends may be different in different geographical locations and may require a regional scheduling policies as a consequence. The dynamic allocation of tasks to processors may not be an attractive option for such an application. As each user interacts with the application, computational overhead is incurred to manage the resources and determine the best node on which to execute the task spawned by the users request. Such overheads may detract from the service levels that a company may wish to offer its on-line customers. The simulation-based static allocation method of El-Rewini et al. [8] clearly suffers from computational overhead and furthermore assumes that each user will interact with the software in a similar manner. The practical approach advocated in this paper is predictive and adaptive. It is sufficiently flexible that an organization can allow it to adapt on an individual basis, regional basis, or global basis. This leads to a tailored distribution policy, which delivers good performance, to suit the organization. This paper is organized as follows: Section 2 explores the issues involved in conditional task scheduling and introduces the task model used in the adaptive approach. Section 3 introduces ATME [9], an adaptive task mapping environment, and Section 4 discusses the conditional task scheduling algorithm. An evaluation of the performance of the scheduling algorithm is provided in Section 5, and an evaluation of ATME is provided in Section 6. Section 7 provides some conclusions and a discussion on future work.
2. ISSUES IN CONDITIONAL TASK SCHEDULING The task-scheduling problem can be decomposed into three major components: (1) the task model which portrays constituent tasks and the interconnection relationships among tasks of a parallel program, (2) the processor model which abstracts over the architecture of the underlying parallel system on which the parallel program is to be executed, and (3) the scheduling algorithm which produces a scheduling policy by which tasks of a parallel program are distributed onto available processors and possibly ordering for execution on the same processor. The aim of the scheduling policy is to optimize the performance of the application relative to some performance measurement. Typically, the aim is to minimize total execution time of the application [3, 4, 10], or the total cost of the communication delay and load balance [11–13]. The scheduling algorithm and the scheduling objective determine the critical attributes associated with the tasks and processors in the task and processor model, respectively. Assuming a scheduling objective of
1206
OUDSHOORN AND HUANG
minimizing the total parallel execution time of the application, the task model is typically described as a weighted directed acyclic graph (DAG) [1, 3, 5, 14] with the edges representing relationships between tasks [15]. The DAG contains a unique start and exit node. The processor model typically illustrates the processors available and their interconnections. Edges show the cost associated with the path between nodes. Figure 1 illustrates a typical processor model. It shows three nodes, P1, P2 and P3, with relative processing speeds of 1, 2, and 5, respectively. Edges represent network latency between nodes. Since ATME provides support for preemptive task execution [16, 17], it is assumed that message transmission may occur at any point within the task. For the sake of simplicity, it is further assumed that all data are received before a task commences execution so that the issues of busy waiting, excessive context switching are avoided. Applications supported by ATME are those based on multiple processors that are loosely coupled, execute in parallel and communicate via message-passing through networks. With the development of high-speed, low-latency communication networks and technology [18–21], and the low cost of computer hardware, such multiprocessor architectures have become commercially viable to solve application problems cooperatively and efficiently. A software application featuring a number of interrelated tasks owing to data or control dependencies between the tasks is known as a conditional task system. Each node in the corresponding task model identifies a task in the system and an estimate for the execution time for that task should it execute. Edges between the nodes are labeled with a pair which represents the communication costs (volume and time) between the tasks and the probability that the second task will actually execute (i.e., be spawned) as a consequence of the execution of the first task. Figure 2a shows an example of a conditional task model: tasks A and C depend on the successful execution of task S, but task C has a 40% probability of executing if S executes, whereas A is certainly spawned by S. A task, such as C, which may not be executed will have a ‘‘ripple effect’’ in that it cannot spawn any dependent tasks unless it itself executes. Figure 2b represents an actual task model that reflects what actually took place during program execution. Note all probabilities are now either 0 or 1 indicating whether communication and subsequent execution actually took place. The actual task model is used to predict what the next likely usage pattern will be for that particular individual or group of users. The task model and the processor model are provided to ATME in order to determine a scheduling policy for the application. The scheduling policy determines the allocation of tasks to processors and specifies the execution order on each
P1
22
1 5
P2 2 17
5
P3 FIG. 1. Processor model.
1207
EVOLVING TOWARD AN OSS THROUGH ADAPTIVITY
S5 (4, 1)
A8 (8, 0.8)
S5 (6, 0.4)
(4, 1)
C10 (9, 0.6)
A8
(10, 0.6)
(8, 1)
D20
(20, 1)
C10 (9, 0)
(10, 0)
(20, 1)
(20, 0.8)
B12
(6, 1)
B12
D20
(20, 1)
(30, 1)
(30, 0)
E6
E6
(a)
(b)
FIG. 2. (a) Conditional task model and (b) actual task model for a particular execution.
processor. The scheduling policy performs this allocation with the express intention of minimizing total parallel execution time based on the previous execution history. The attributes of the processors and the network are taken into consideration when performing this allocation. Figure 3 provides an illustration of the task scheduling process. To avoid cluttering the diagram, all probabilities are set to 1.
3. ATME ATME is developed as a parallel and distributed systems programming that addresses the task scheduling issue. Input into ATME consist of the user-defined parallel tasks (i.e., the application), the task interconnection structure and the processor topology specification. ATME then annotates and augments the user
Task model
P1
S5
0
6
5
2
10
A10
B10
C15
7
15
7
A10
B10
S C D
15
B Scheduling Algorithm
30
F 40
Processor model P1 P2 22
5
2
50
H
60
17 5
A
20 C10
E5
1
P3
20
10 8
10
D10
P2
time
P3 FIG. 3. The process of solving the scheduling problem.
G E
processor
1208
OUDSHOORN AND HUANG
source code and distributes the tasks over the available processors for physical execution. ATME is developed over the PVM platform [22]. The user tasks are physically mapped onto the virtual machines provided by PVM, but the use of PVM is entirely transparent to the user. This permits the underlying platform to be changed with ease, and ensures that ATME is portable. In addition, the programmer is relieved of the need to be concerned with the subtle characteristics of a parallel and distributed system. ATME tackles the conditional scheduling problem in a new and novel manner. The approach comprises two steps. First, the task model of the forthcoming program execution is estimated. This occurs by analyzing execution profiles of past executions. Second, a new algorithm named CET [23], is invoked to generate a scheduling policy. Figure 4 illustrates the functional components and their relationships. The target machine description component presents the user with a general interface to specify the available processors, processor interconnections and their physical attributes such as processing speed and data transfer rate. The user-supplied application code is preprocessed, instrumented and analyzed by the program preprocessing and analysis component which enables it to execute under PVM and produces information to be captured at runtime. The task interconnection structure is also generated in this component. ATME provides explicit support through PVM-like runtime primitives to realize task spawn and message-passing operations. Target Machine Topology
Target Machine Description
Processor Model
Task Interconnection Structure
||
Task Interconnection Structure
Parallel and Distributed User Application
Program Preprocessing and Analysis
Scheduling Policy
Task Scheduling
Task Model Construction
Task Model
Control Flow Graph
Instrumented ATME Tasks
+ Pre-execution
Tuning Suggestions
Post-execution Compilation, Link and Execution
Tuning Suggestions Analysis Files
Post-Execution Analysis
Program Database
Runtime Data Collection Results
Report Generation
Reports
Program Execution
ATME
Preprocessor Legend ||
Output either one of the inputs.
Functional component of ATME.
+
Combine the two inputs.
File generated/accessed by ATME.
Output from ATME or environment.
Component outside the environment.
FIG. 4. Structure of the ATME environment.
EVOLVING TOWARD AN OSS THROUGH ADAPTIVITY
1209
With the task model obtained from the task model construction and the processor model from the target machine description, the task scheduling component generates a policy by which the user tasks are distributed onto the underlying processors. At runtime, the runtime data collection component collects traces produced by the instrumented tasks which is stored, after the execution completes, into program databases to be taken as input by the task model construction to predict the task model for the next execution. The post-execution analysis and report generation provides various reports and tuning suggestions to the user and to ATME for program improvement. The underlying target architecture is generally stable; that is to say, the system generally does not change along with the application program. In this case, the processor model of the target machines, once established, does not have to be reconstructed every time when an application program is executed. There is also no need to undertake further program preprocessing and analysis until the application code is modified. A feedback loop exists in the ATME environment: starting with the task model construction, through task scheduling, runtime data collection and back to task model construction. This entire procedure makes ATME an adaptive environment in that the task model offered to the scheduling algorithm is incrementally established based on the past usage patterns of the application. Accurate estimates of task attributes are obtained for relatively stable usage patterns and thus admits improvement in execution efficiency. The data collected in the program databases is aged so that the older data has less influence on the determination of the task attributes. This ensures that ATME responds to evolving usage patterns but not at such a rapid rate that a single execution which does not fit the usual profile forces a radical change in the mapping strategy. Experimental results [17] support theoretical analysis [15] and demonstrate that the use ATME leads to considerable reduction if total parallel execution time for a large variety of applications when compared to a random and round-robin distribution of tasks. This includes applications that have constant usage patterns as well as those that evolve over an extended period of time and those whose usage pattern varies significantly. The key to this improvement is the scheduling algorithm employed and the adaptive nature of the ATME environment.
4. PREEMPTIVE TASK SCHEDULING A typical feature of preemptive task execution is that the pre-determined scheduling policy is not altered while the program is executing, only the execution sequence and commencement time of tasks on each host processor may vary. The assumption is made that all information required by a task in order for it to execute is available to it prior to it commencing its execution. This avoids having to deal with idle time within the execution of a task. We consider two forms of preemption in the scheduling algorithm presented: *
a-Preemption: Where the execution of a task may be interrupted by another task which is distributed to the same processor.
1210
*
OUDSHOORN AND HUANG
b-Preemption: Where the dependent task does not wait for the parent task to complete before it commences its execution. It merely waits until all of the information it requires is available.
Generic strategies are proposed for dealing managing task execution where preemption is permitted. These strategies are termed aP and bP to deal with a- and bpreemption, respectively. The notations N aP and N bP are used to refer to a scheduling policy that does not permit a- or b-preemption, respectively. A scheduling algorithm supporting the policies N aP N bP is clearly non-preemptive. The aP strategy is discussed in the literature dealing with job scheduling [24, 25]. A detailed discussion of the performance improvements gained by each strategy is provided in [17]. It is discovered through case analysis that the bP strategy leads to an improvement in system performance over the non-preemptive case, whereas aP does not necessarily result in an improvement. Several proposals to deal with non-preemptive task scheduling have been put forward. There exist two main techniques to statically handle this scheduling problem. One approach is cluster scheduling [5, 14]. This approach first groups the tasks into ‘‘clusters’’ and then, in a second step, distributes them onto available processors. Further clustering may be required in the second step, when the number of available processors is less than the number of task clusters produced. The other main scheduling strategy is list scheduling, in which each task is first assigned a priority order and allocated onto an idle processor according to such priority. In order to schedule preemptive tasks it is necessary to modify the task model to incorporate a new task attribute to more accurately reflect the execution of parallel tasks. This new attribute is referred to as the preemption start point. The preemption start point represents the point at which message transmission within a parent task to a dependent child task may first occur at runtime. It is defined as a ratio given by the following formula: V ðm; cÞ ¼
CT ðcÞ CT ðmÞ ; U ðmÞ
where: m; c is a parent task, m, and its dependent child task, c, CT ðtÞ the execution commencement time of task t, U ðtÞ the minimum completion time of task t defined as CT ðtÞ þ total expected execution time for assuming no preemption. V ðm; cÞ provides some measure of the point at which the parent task may communicate with a dependent task allocated to the same processor, and therefore some measure of when the parent task may be swapped out, and the dependent task swapped in. The preemptive scheduling algorithm PET deals with the preemptive task model. PET is based on the list-scheduling algorithm ERT [4]. At any instant, each schedulable task or free task (i.e., those tasks for which all tasks on which it depends have been distributed onto processors), in the application program is assigned a priority value based on its ‘‘earliest start time.’’ The PET algorithm, which schedules tasks prior to execution, is presented below. The following definitions are required first. Let: *
Mðt1 ; t2 Þ represent the time latency to transmit data from task t1 to task t2 which are resident on processors p1 and p2 , respectively. If the two tasks are
EVOLVING TOWARD AN OSS THROUGH ADAPTIVITY
*
1211
on the same processor, Mðt1 ; t2 Þ is assumed to be 0. If the execution probability between t1 and t2 is estimated to be 0, then Mðt1 ; t2 Þ is also regarded as 0. AðpÞ represent the time when processor p is available to execute another task.
PET ALGORITHM. Step 1 (Initialization). All tasks are considered unallocated. All processors are idle and available immediately. Set current time t ¼ 0. Step 2 (Task scheduling). *
* *
Let W be the set of all tasks that have not yet been scheduled and whose parent tasks have all been scheduled. If W is empty, then exit. Select a task, ti in W and a processor, pj , with the smallest value of the ‘‘earliest start time’’ which is calculated by Si ðti ; pj Þ ¼ maxðF ðtip ; pip Þ V ðtip ; ti Þ þ Mðtip ; ti ÞÞ for every task ti whose parent task, tip , resides on host pip .Let S2 ðti ; pj Þ ¼ maxfS1 ðti ; pj Þ; Aðpj Þg be the ‘‘earliest start time’’ of task ti on processor pj . Evaluate the smallest ‘‘earliest start time’’ of task tk on processor pj by S0 ðtk ; pl Þ ¼ minðS2 ðti ; pj ÞÞ among all tasks ready to be scheduled on the available processors. Task tk is then scheduled onto processor pl .
Step 3 (Update). The earliest time that the selected processor pl will be available is set to its previous value of earliest available time for pl plus the estimated length of time required to execute selected task tk on processor pl . Step 4. Go to Step 2. The complexity of the PET algorithm is Oðmn2 Þ where m is the number of processors in the distributed system and n is the number of tasks of the parallel program. The efficiency of PET is illustrated in Section 5 through extensive simulation experiments. The PET algorithm employs the policy of N aP bP to manage program execution. This policy can ensure a performance improvement for parallel programs.
5. EVALUATION OF THE PET ALGORITHM Experiments conducted illustrate the performance improvement due to consideration of preemption among tasks within ATME. A constant usage pattern is assumed so that the adaptive nature of ATME does not affect this experiment. A large
1212
OUDSHOORN AND HUANG
number of parallel programs are simulated by simply determining task attributes such as task computation time, intertask communication data volume and b-preemption start point. Parallel applications are classified according to PMRatio}the ratio of the average magnitude between task computation and communication in a task model. A high value of PMRatio (e.g., 10.0) models computation-intensive applications, while a low value (e.g., 0.1) represents communication-intensive applications. A range of experiments with varying values of PMRatio are undertaken with different task numbers, different task interconnections and various numbers of processors to obtain the average performance in each situation. The usage patterns for all experiments was constant so that the adaptive nature of ATME did not affect the results. Three kinds of experimental results for each simulated parallel program are studied and compared. First, assume all constituent tasks are non-preemptive and undertake the scheduling procedure to distribute tasks onto processors. The ‘‘nonpreemptive’’ performance is determined in this experiment. In the next step, allow tasks to be preemptive at runtime while ignoring this factor when performing task scheduling. This is ‘‘preemptive task execution.’’ Finally, an experiment is conducted which considers the preemption start point while distributing tasks. Thus, ‘‘preemptive task scheduling’’ performance results are determined. The term GNR is introduced to measure the parallel execution time difference between preemptive task execution (PTE) and non-preemptive execution (NPTE): GNR ¼
Exec: TimeðNPTEÞ Exec: TimeðPTEÞ : Exec: TimeðNPTEÞ
PNR is defined as the performance discrepancy between preemptive task scheduling (PTS) and the non-preemptive case (NPTS): PNR ¼
Exec: TimeðNPTSÞ Exec: TimeðPTSÞ : Exec: TimeðNPTSÞ
Table 1 shows experimental results of the performance comparison between (a) PTE and NPTE, and (b) PTS and NPTE, respectively. The data are grouped depending on the value of GNR and PNR. The table displays the percentage of applications (Exec%) falling into each group and the average value of GNR and PNR (Diff %), respectively. The context switch time between tasks on the same host is ignored in the experiments. The experimental results are consistent with those predicted by the formal case analysis. From Table 1a, preemptive task execution shows better performance than non-preemptive execution in about 90% of the simulated applications. From Table Ib it is seen that preemptive scheduling can achieve up to 20% better performance than non-preemptive scheduling when PMRatio is 10.0, but in a very limited number of cases, PTS performs slightly worse than NPTE. In general, both the theoretical discussion and extensive empirical experiments indicate that consideration of preemption in task execution and scheduling can achieve improved system performance.
1213
EVOLVING TOWARD AN OSS THROUGH ADAPTIVITY
TABLE 1 Performance Comparison between (a) Preemptive Task Execution and Non-Preemptive Execution, and (b) Preemptive Task Scheduling and Non-Preemptive Task Scheduling GNR ¼ 0
GNR50
GNR > 0
PMRatio
Exec%
Diff%
Exec%
Exec%
Diff%
0.1 0.5 1.0 5.0 10.0
0 0 0 0 0
0 0 0 0 0
2 1 2 9 9
98 99 98 91 91
4 10 10 7 7
PNR ¼ 0
PNR50
0.1 0.5 1.0 5.0 10.0
PNR > 0
Exec%
Diff%
Exec%
Exec%
Diff%
5 1 3 2 1
10 1 1 1 1
4 1 1 2 1
91 98 96 96 99
6 12 14 17 20
6. EVALUATION OF ATME The experiments discussed below illustrate the performance of ATME in generating a good scheduling policy (thus increasing the system performance) for conditional task models. These experiments include an evaluation of the performance of ATME, an examination of the need for accuracy in the task model, the adaptiveness of ATME subject to the change of usage patterns of the parallel program, and a comparison to a random distribution of parallel tasks. The experiments focus on evaluating the application execution time which is assumed to be solely affected by the scheduling policy. In the experiments designed to simulate user interaction, the parallel system and user programs are abstracted by a set of parameters, as shown in Table 2. The parallel system is assumed to be composed of identical processors which are fully connected via identical networks. Therefore, the architecture of the system is determined once given the number of available processors, indicated by the ProcNum parameter. The processor execution speed and network transmission speed are all presumed to be 1.0. TaskNum simulates the number of tasks in an application program. For a fixed TaskNum, a number (determined by Sets) of concrete task models are established, each of which is different from others with respect to task attribute values and task interconnection structures. The task model represents the execution time and sequence of the parallel tasks while task interconnections reflect the task precedence relationships in the application. Tasks and task interconnections are assigned values to represent the computation time, communication volume, and the execution probability (generally referred to as ‘‘task attributes’’). The definition of PMRatio remains unchanged from Section 5.
1214
OUDSHOORN AND HUANG
TABLE 2 Experimental Control Values Control parameter ProcNum PMRatio TaskNum RegLB, RegUB FreqLB, FreqUB InitRuns Runs Sets ExecHistory
Adopted values 5, 11, 17 0.1, 0.5 1.0, 5.0, 10.0 20, 32, 44 3, 10 10, 15 3 15 10 3
A parallel program, in its lifetime, can have different usage patterns by different or even the same users. The usage pattern of a program is portrayed by a set of input parameters provided to each task when invoking the program. In the simulation, for a fixed task model, the variation in each input parameter is assumed to follow a specific curve (‘‘sine’’) across executions, simulating the fluctuation of usage patterns. Such a curve can be tuned to represent rapid or gradual change in value (frequency within [FreqLB, FreqUB]), and represent small or significant change (magnitude within [RegLB, RegUB]), reflecting typical use by a user. Once the input parameters to a task are determined, task attributes attached to the task model are fixed. In the simulated task models, input parameters to a task has a linear relationship to task computation time and task communication data size. The value of execution probability between a pair of interconnected tasks is discrete (either 0 or 1) and is determined by evaluating the value range of the task input parameters. Such functional relationships between task input parameters and task attributes are not available to ATME. ATME can only identify the usage pattern of a program by analyzing the runtime captured execution information, so that ATME can make its own estimates of task attributes in the next execution. After fixing a task model and a processor model, InitRuns executions are initially simulated to run with a default scheduling policy (all tasks are assigned to a single processor and ordered by their precedence constraints) in order to collect appropriate task attribute values; Runs executions are then performed to evaluate the performance of ATME and conduct other experiments regarding, say, the influence of an accurately estimated task model onto the system performance. The number of execution histories kept to assist in estimation is determined by ExecHistory. In the following discussion, experimental results illustrate the efficiency of ATME and issues involved in the conditional task scheduling problem.
6.1. The Application Performance when Using ATME The application performance of ATME is measured by the performance discrepancy between ATME and the ideal case in which the task model is assumed
EVOLVING TOWARD AN OSS THROUGH ADAPTIVITY
1215
to be precisely available while producing the scheduling policy. For this purpose, the term, AIR, is introduced AIR ¼
Ideal Exec: Time ATME Exec: Time : Ideal Exec: Time
A positive value for AIR indicates that ATME is performing better than when the task attributes are precisely known prior to execution; that is, the program parallel execution time by employing ATME is shorter than the ideal case. A negative AIR shows the opposite. It is possible that the performance of ATME is better than that of the ‘‘ideal.’’ This implies that the ‘‘ideal’’ performance discussed here is not necessarily the optimal performance, but merely represents the ‘‘best’’ solution as anticipated by the scheduling algorithm. The experiment is conducted as follows. Prior to program execution, with the ATME predicted task model, a scheduling policy is generated to distribute tasks onto available processors. The tasks of the program are executed in parallel on their hosting processors, thus, at the end of the execution, the ‘‘actual’’ execution time is achieved. This is also the program performance by employing ATME. After program execution, actual task attributes (captured at runtime) are known. These values reconstruct the ‘‘ideal’’ task model that precisely reflects the task behavior at runtime. We repeat the execution with the ideal task model and obtain the ‘‘ideal execution time.’’ The first two lines in Fig. 5 display this process. This experiment compares the ‘‘actual’’ and ‘‘ideal’’ execution time, in order to show the variation in performance of the application when employing ATME. Table 3 shows the execution time of application programs when employing ATME to automate task scheduling. The results illustrate the performance of ATME in different situations represented by the value of PMRatio. For a fixed value of PMRatio, the executions are divided into three groups, depending on the value of AIR. Each group has two sub-columns, representing the percentage of all executions falling into this group, and the average difference from the ‘‘ideal execution time,’’ respectively. From Table 3, it can be observed that, around 60% of executions, ATME performs as well as, or better than, the ideal situation (in which all task attributes are precisely known and available prior to execution). In each category of PMRatio, it is possible that the performance of ATME is better than that of the ideal (the AIR > 0 column). The performance discrepancy between ATME and the ideal case is around
Actual execution task attributes
ATME-estimated task attributes
Task model
Ideal (actual) task attributes
Task model
Schedule policy
Ideal execution time
Assumed_actual task attributes
Task model
Schedule policy
Asumed-actual execution time
Scheduling algorithm
Schedule policy
Execution
FIG. 5. The actual ideal and assumed-actual time.
Actual execution time
1216
OUDSHOORN AND HUANG
TABLE 3 Application Performance under ATME AIR ¼ 0
AIR > 0 PMRatio 0.1 0.5 1.0 5.0 10.0
AIR50
Exec%
Diff%
Exec%
Exec%
Diff%
14 12 9 8 9
7.2 3.8 2.7 3.2 3.7
45 50 50 53 55
41 38 41 39 36
15.1 7.6 7.7 6.1 4.0
10% (the AIR > 0 column). In addition, when utilizing ATME, computationintensive applications (PMRatio is 10.0) show better performance than those that are communication intensive. Tasks in computation-intensive applications are distributed evenly among underlying processors by their computation time, which ATME estimates with high accuracy. On the other hand, when PMRatio is 0.1 (communication-intensive applications), minor inaccuracy in task attribute estimates may result in significant differences in the execution time. 6.2. Impact of the Accuracy of Estimates An experiment is conducted to show the performance improvement brought about by accurate task attribute estimates. Consider the first and the third line in Fig. 5. Given that task attributes are precisely known after execution, assuming one of the three task attributes is precisely known prior to execution (while the other two attributes still utilize estimates that ATME predicts), we obtain the ‘‘assumed actual task model.’’ This task model is ‘‘submitted’’ for execution and thus the ‘‘assumed actual’’ time is achieved. Therefore, we get three ‘‘assumed actual’’ times under three different assumptions: when assuming accurate task computation time, when assuming accurate communication data size between tasks, and when assuming accurate execution probability between tasks. Similar to AIR, we introduce a term, SIR, to represent the performance difference between the assumed actual and the ideal situation: SIR ¼
Exec: TimeðIdealÞ Exec: TimeðAssumed actualÞ : Exec: TimeðIdealÞ
The experiment illustrates the performance improvement that is obtained with the accurate task attributes, especially the precise execution probability. In Table 4, Column Actual lists the percentage of executions and the performance difference (when AIR > 0) from the ideal situation when ATME just uses its estimated task attributes to distribute tasks onto processors. Columns AssAcuComp, AssAcuComm and AssAcuProb show situations where ATME is available with the mixture of estimated task attributes and assumed actual attribute while generating the scheduling policy. All execution performance is compared to the ideal situation and only lists results in the case of AIR or SIR > 0 in Table 4.
1217
EVOLVING TOWARD AN OSS THROUGH ADAPTIVITY
TABLE 4 Performance Improvement Based on Precise Task Attributes Actual
AssAcuComp
AssActComm
AssAcuProb
AIR > 0
SIR > 0
SIR > 0
SIR > 0
PMRatio Exec% Diff% 0.1 0.5 1.0 5.0 10.0
64 54 58 55 58
Exec%
Diff%
Exec%
Diff%
61 51 57 49 49
16 10 8 7 6
59 54 59 55 57
15 9 9 7 6
15 10 9 7 6
Exec% Diff% 49 39 36 36 40
9 5 5 4 3
The experiment indicates that significant performance improvement occurs when the execution probability is assumed to be precisely available prior to execution. For instance, when PMRatio is 1.0, the performance of 58% of simulated executions is 9% worse than that of the ideal execution; in this situation, if assuming the execution probability is precisely obtained, the new scheduling policy results in only 36% of executions falling behind the corresponding ideal execution, with an average efficiency difference of 5%. The experimental results also imply that the execution probability attribute in conditional task scheduling cannot be ignored. Table 4 also shows that accurate task computation or communication estimates also bring about a little improvement in terms of application performance. This could explain why, in the area of deterministic task scheduling in which the execution probability for all tasks is 1.0, some researchers [5] have claimed that the scheduling algorithm is not weight-sensitive.
6.3. Adaptiveness of ATME When the usage pattern of a parallel program is stable, i.e., the program is provided with the same set of input parameters between a number of program executions, ATME can adaptively capture the precise task model and its attributes at runtime and thus gradually approach the performance of the ideal situation. When the usage pattern varies, ATME needs to collect a new set of execution histories regarding the task model so that it can reestablish a new model to truly reflect the parallel program. This experiment evaluates ATME’s performance regarding how many executions ATME needs to achieve its ideal performance. For a fixed set of control parameters, such as PMRatio, task number and processor number, Sets task models are simulated, each of which is different from the others in terms of task structure and task attributes. For each task model, Runs times of program executions are simulated, in which the initial several runs with the exactly the same task model; then the usage pattern of the program is abruptly changed (simulating the change of task attributes especially execution probability between interconnected tasks); the new task model is adopted in the rest of test runs to show the rate at which ATME tends to the ideal performance.
1218
OUDSHOORN AND HUANG
TABLE 5 The Adaptiveness of ATME PMRatio
TN
PN
Exec.
AIR
AcuComp
AcuComm
AcuProb
1.000
32
11
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.04 0.10 0.00 0.00 0.00 0.00 0.00
0.91 0.98 1.00 1.00 1.00 1.00 1.00 0.26 0.91 0.62 1.00 1.00 1.00 1.00 1.00
0.92 0.99 1.00 1.00 1.00 1.00 1.00 0.23 0.91 0.67 1.00 1.00 1.00 1.00 1.00
1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.64 0.82 0.82 1.00 1.00 1.00 1.00 1.00
Table 5 displays the experimental results when PMRatio is 1.0, and there are 32 parallel tasks (TN column) of the program and 11 processors (PN column) available in the parallel system. Fifteen (Exec column) test runs for each fixed task model and in the middle (the 8th run), the task model is changed and remains the same for the following runs. The AIR column shows the performance discrepancy between ATME and the ideal situation. AcuComp, AcuComm and AcuProb represent the accuracy of ATME estimates on task computation time, communication data size and execution probability, respectively. When the task model suddenly changes, ATME needs to recollect the task information and rebuild the task model. With only one sharp drop of ATME performance (13% from the ideal performance), and two cases of performance fluctuation (4% and 10%, respectively), ATME quickly adapts to the ideal situation. After three program executions from the pattern change, ATME can achieve the ideal performance, and the accuracy of ATME estimates of task attributes are back to 100% as well. The adaptiveness of ATME depends on the fluctuation of the usage patterns of the parallel program and the execution history ATME requires to predict and construct the task model. The more stable the usage pattern, the quicker ATME can adapt. When the pattern is quite stable, the shorter the execution history, the faster the adaption is. The user of the parallel program should know how he/she intends to use the program (sharp or soft pattern change between executions), therefore, we leave the decision of the parameter ExecHist to the user.
6.4. ATME vs. Random Distribution The comparison between the performance of ATME and that of a random distribution strategy is undertaken in the experiment. The reason for the comparison
1219
EVOLVING TOWARD AN OSS THROUGH ADAPTIVITY
against a random distribution strategy (denoted as RDIST) is that no other algorithms addressing the conditional task scheduling have been identified. For each set of ‘‘control parameters’’ (such as PMRatio, number of tasks in the program, number of processors available, etc.), a number of task models are simulated to represent different kinds of parallel programs. For each ‘‘program,’’ ATME is used to obtain a scheduling policy (based on previous program executions); and then a random distribution policy (RDIST) is used with the same task model to distribute tasks onto processors. The parallel execution time of each ‘‘program’’ under these two scheduling strategies is compared and the average values are obtained. We introduce the term RAR to measure the parallel execution time (i.e., scheduling objective) difference between that achieved by ATME against the random strategy RDIST: RAR ¼
Exec: TimeðRDIST Þ Exec: TimeðATMEÞ : Exec: TimeðRDIST Þ
A positive value for RAR indicates that ATME performs better than the random strategy RDIST; while a negative RAR shows the opposite. Table 6 shows the experimental results. For each group (depending on the value of RAR), the table displays the percentage of applications (Exec%) falling in this group and the average performance difference (AveDiff%) between RDIST and ATME. As illustrated in Table 6, ATME significantly outperforms RDIST in all cases. For instance, when PMRatio is 0.5, that is, the communication load is slightly higher than the computation load in a parallel application, 98% of applications shows as much as 46% performance gain when utilizing ATME to schedule tasks, as compared to the random strategy. From Table 6 it can also be seen that the performance of applications with an intensive communication load more sensitive to the scheduling policy than other kinds of parallel applications. When PMRatio is 0.1, RDIST performs up to 124% worse than ATME does in around 99% of executions. From the above experiments, we can see that application programs utilizing ATME to automate the scheduling issues can generally achieve good performance. The performance is affected by the accuracy of task attribute estimates}accurate task attributes, especially the execution probability, can improve the execution
TABLE 6 The Performance of ATME vs. RDIST RAR ¼ 0
RAR50 PMRatio 0.1 0.5 1.0 5.0 10.0
RAR > 0
Exec%
Diff%
Exec%
Exec%
Diff%
2 2 3 4 3
2 3 4 3 3
1 1 2 2 2
98 97 95 94 95
124 46 35 22 20
1220
OUDSHOORN AND HUANG
efficiency distinctly. Therefore, in addition to research into efficient scheduling algorithms, there is a need to focus on providing accurate task information to the algorithm, which is what ATME aims to.
7. CONCLUSIONS AND FUTURE WORK This paper addresses the issue of scheduling preemptive tasks in a parallel and distributed system that supports a message-passing protocol. Significant previous work has focussed on the specification of the task model and the development of scheduling algorithms for non-preemptive tasks, largely ignoring the preemptive nature of tasks within a general purpose parallel and distributed application. The scheduling algorithm discussed in this paper, when used with the adaptive environment provided by ATME, results in a good automated distribution strategy that outperforms most human strategies, especially those based on a random and round-robin strategy. Furthermore, the adaptive nature of the system allows it to reconfigure itself in response to the way the application is being used. This is ideal for applications such as internet-based distributed E-business applications for which it is impossible to predict the behavior of the user base, and therefore impossible to generate an optimal scheduling policy by hand. ATME has demonstrated that it is feasible to statically allocate tasks to processors and build an efficient and adaptive systems to deal with tasks supporting preemption. The problem of node failure, and performance within a multi-user environment where other users may place demands on the available processors, is ignored in ATME. Work is now underway to build a dynamic scheduling system which is itself distributed across the available processing elements and which supports fault tolerance. If a node fails, this failure is detected and remedial action is taken to ensure that a result is eventually returned. This occurs, in part, by replicating task where support for real-time systems is required, to simply rescheduling the task when failure is detected. Roll-back recovery mechanisms may be required to ensure that correct results are achieved. If the fault is transient, as in the case of a temporary network failure, then it is possible that the second instance of a task is spawned to calculate the result. If the fault then rectifies itself, then there will be two instances of the task calculating the results. It is therefore necessary to deal with multiple results (which may differ in accuracy) being returned to the calling instance. Conflict resolution based on timestamping may be used to ensure that the most recent and up-to-date result is used. Although such a proposal offers many benefits over the approach taken by ATME, it does have some drawbacks. It is likely to be more computational intensive at runtime as significant overhead will be incurred in terms of detecting and dealing with failure. Load-balancing issues will also have a negative impact on performance as it deals with other users accessing the resources required by the distributed application.
EVOLVING TOWARD AN OSS THROUGH ADAPTIVITY
1221
This further research aims to develop an automated scheduling system that provides the support required for modern distributed applications. The goal is to build an efficient and responsive system, with minimal runtime overheads, so that an acceptable distribution in a multi-user environment is achieved with support for fault-tolerance. This should eliminate the need for the majority of programmers to focus on the tedious details associated with distribution in all but the most specialized and critical applications REFERENCES 1. M. Y. Wu and D. Gajski, Hypertool: A programming aid for message-passing systems, IEEE Trans. Parallel Distrib. Systems 1(3) (July 1990), 330–343. 2. J. Ullman, NP-Complete scheduling problems, J. Comput. System Sci. 10 (1975), 384–393. 3. H. El-Rewini and T. Lewis, Scheduling parallel program tasks onto arbitrary target machines, J. Parallel Distrib. Comput. 19(2) (June 1990), 138–153. 4. C. Y. Lee, J. J. Hwang, Y. C. Chow, and F. D. Anger, Multiprocessor scheduling with interprocessor communication delays, Oper. Res. Lett. 7(3) (June 1999), 141–147. 5. T. Yang, ‘‘Scheduling and Code Generation for Parallel Architectures,’’ Ph.D. thesis, The State University of New Jersey, Rutgers, 1993. 6. T. L. Casavant and J. G. Kuhl, A taxonomy of scheduling in general-purpose distributed computing systems, IEEE Trans. Software Eng. 14(2) (February 1988), 141–154. 7. H. El-Rewini, H. H. Ali, and T. Lewis, Task scheduling in multiprocessing systems, Computer 28(12) (December 1995), 27–37. 8. H. El-Rewini and H. H. Ali, Static scheduling of conditional branches in parallel programs, J. Parallel Distrib. Comput. 24(1) (January 1995), 41–54. 9. M. J. Oudshoorn and L. Huang, Conditional task scheduling on loosely-coupled distributed processors, in ‘‘10th International Conference on Parallel and Distributed Computing Systems,’’ pp. 136–140, New Orleans, October 1997. 10. V. Sarkar, Determining average program execution time and their variance, in ‘‘Proceedings of the SIGPLAN’89 Conference on Programming Language Design and Implementation,’’ ACM SIGPLAN Notices, pp. 298–312, Vol. 24, No. 7, July 1989. 11. W. W. Chu, L. J. Holloway, M.-T. Lan, and K. Efe, Task allocation in distributed data processing, Computer 13(11) (November 1980), 57–69. 12. F. Harary, ‘‘Graph Theory,’’ Addison-Wesley, New York, 1969. 13. H. S. Stone, Multiprocessor scheduling with the aid of network flow algorithms, IEEE Trans. Software Eng. SE-3(1) (January 1977), 85–93. 14. V. Sarkar, ‘‘Partitioning and Scheduling Parallel Programs for Execution on Multiprocessors,’’ The MIT Press, Cambridge, MA, 1989. 15. L. Huang and M. J. Oudshoorn, ‘‘Preemptive Task Execution and Scheduling of Parallel Programs in Message Passing Systems,’’ Technical Report TR98-04, Department of Computer Science, University of Adelaide, 1998. 16. L. Huang and M. J. Oudshoorn, Static scheduling of conditional parallel tasks, Chinese J. Adv. Software Res. 6(2) (1999), 121–129. 17. L. Huang and M. J. Oudshoorn, Scheduling preemptive tasks in parallel and distributed systems, Aust. Comput. Sci. Commun. 21(1) (February 1999), 289–301. 18. H. Detmold and M. J. Oudshoorn, Communication constructs for high performance distributed computing, Aust. Comput. Sci. Commun. 18(1) (February 1996), 252–261. 19. H. Detmold and M. J. Oudshoorn, Responsibilities: Support for contract-based distributed computing, Aust. Comput. Sci. Commun. 18(1) (February 1996), 224–233.
1222
OUDSHOORN AND HUANG
20. H. Detmold, M. Hollfelder, and M. J. Oudshoorn, Ambassadors: Structured object mobility in worldwide distributed systems, in ‘‘Proceedings of the 19th IEEE International Conference on Distributed Computing Systems,’’ Austin, TX, pp. 442–449, 31 May–5 June, 1999. 21. M. Hollfelder, H. Detmold, and M. J. Oudshoorn, A structured communication mechanism for mobile Java objects as ambassadors, Aust. Comput. Sci. Commun. 21(1) (February 1999), 265–276. 22. A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam, ‘‘PVM: Parallel Virtual Machine. A User’s Guide and Tutorial for Networked Parallel Computing,’’ The MIT Press, Cambridge, MA, 1994. 23. L. Huang and M. J. Oudshoorn, ATME: A parallel programming environment for applications with conditional task attributes, in ‘‘1997 3rd International Conference on Algorithms and Architectures for Parallel Processing,’’ pp. 275–282, Melbourne, December 1997. 24. E. G. Coffman Jr., ‘‘Computer and Job-Shop Scheduling Theory,’’ Wiley, New York, 1976. 25. T. Gonzalez and S. Shni, Flowshop and jobshop schedules: Complexity and approximation, Oper. Res. 26(1) (1978), 36–52.
MICHAEL OUDSHOORN received his B.Sc. in 1983, his B.Sc.(Hons) in 1984 and his Ph.D. in 1992, all from the University of Adelaide. He commenced as a lecturer in the Department of Computer Science at the University of Adelaide in 1989 and became a senior lecturer in 1994. His research interests are distributed systems, software engineering and compiler techniques. He is a member of the ACM, IEEE, ISCA and the Australian Computer Society. LIN HUANG completed her Ph.D. at the University of Adelaide in 1999 under the guidance of Dr. Oudshoorn. Her research interests are in scheduling techniques for distributed systems. Since completing her Ph.D., Dr. Huang has been working in industry where she now develops commercially distributed software.