Mapping and scheduling algorithms for heterogeneous architectures

Mapping and scheduling algorithms for heterogeneous architectures

MAPPING AND SCHEDULING ALGORITHMS FOR HETEROGENEOD. .. 14th World Congress oflFAC Q-9d-Ol-4 Copy right CO 1999 IFAC 14th Triennial World Congress. ...

3MB Sizes 0 Downloads 52 Views

MAPPING AND SCHEDULING ALGORITHMS FOR HETEROGENEOD. ..

14th World Congress oflFAC

Q-9d-Ol-4

Copy right CO 1999 IFAC 14th Triennial World Congress. Beijing, P.R. China

MAPPING AND SCHEDULING ALGORITHMS FOR HETEROGENEOUS ARCHITECTURES

D. N. Ramos-Hernandez :/: M. O. Tokhi :j: and J. M. Bass

-r

t Department ofAutomatic Control and Systems Engineering, The University ofSheJJield, SheJJield, United Kingdom. email: [email protected]@SheJJield.ac. uk t The University of Wales, Bangor, United Kingdom. email:JBass~~ees.bangor.ac . uk

Abstract: This paper presents an investigation into the development of mapping and scheduling strategies for heterogeneous architectures. The strategies attempt to minimise the execution time of a control application (an avionic autopilot) implemented for a parallel heterogeneous architecture. The paper, firstly describes previously proposed mapping and scheduling strategies including a simple heuristic and a genetic algorithm (GA). To improve the performance of the application a new approach, namely the Priority-Based Genetic Algorithm (pBGA) is developed and reported in this paper. Its performance is verified in comparison to previously developed strategies. Copyright © 1999IFAC Keywords: Genetic algorithms, heuristic algorithms, heterogeneous architectures, mapping, scheduling. 1. INTRODUCTION

A single application often comprises widely differing computational requirements, which can be efficiently accomplished with different types of processing elements. Heterogeneous computing is thc effective use of the diverse hardware and software components in a suite of machines connected by a high-speed network to meet the varied computational requirements of a given application (Siegel, et aI., 1996). For heterogeneous computing, one of the objectives is to decompose an application into several tasks and match each task to the machine where it is best suited for execution. In this manner, mapping and scheduling are two important factors to be considered in a heterogeneous system. The scheduling problem depends heavily on the topology of the task gmph representing the precedence relations among the computational tasks, the topology of the multiprocessor system, the number of parallel processors, the uniformity of the task processing time, and the performance criteria chosen (Hou, et aI., 1994). To efficiently eX"ploit a multiprocessor system, it is desinible to minimise execution time, to maximise the system throughput and also minimise the number of processors to reduce the cost and increase the processor utilisation (Ahmad and Dhodhi, 1996).

The multiprocessor scheduling problem IS computation ally complex, so that heuristic algorithms have been proposed to obtain optimal and suboptimal solutions. Most of the existing techniques are based on list scheduling, where each task is assigned a priority either statically or dynamically. Whenever a processor becomes available a task with the highest priority is selected from the list and assigned to the processor (Ahmad and Dhodhi, 1996) Kashara and Narita (1984) proposed a heuristic algorithm (critical path/most immediate successors first) and an optimisation/approximation algorithm (depth first/implicit heuristic search). In 1988, Chen et a1. developcd a state-space search algorithm coupled with a heuristic to solve the multiprocessor scheduling problem. Recently, genetic algorithms (GAs) have been applied for solving parallel processing problems such as scheduling, data organisation and partitioning, communication and TOuting. Genetic algorithms are defined by Goldberg (1989) as search algorithms based on the mechanics of natural selection and natura] genetics. GAs have becn applied to optimisation problems. However, GAs differ from traditional optimisation methods in the following ways: a GA uses a coding of the parameter set rather than the parameters themselves, a GA searches from a population of search nodes instead of from a single one, a GA uses an objective function instead of auxiliary knowledge. Finally, a GA utilises probabj]jstic transition rules. Ahmad and Dhodhi,

8653

Copyright 1999 IFAC

ISBN: 0 08 043248 4

MAPPING AND SCHEDULING ALGORITHMS FOR HETEROGENEOU. ..

(1996) developed a technique based on the problemspace genetic algorithm (PSGA) for the static onto scheduling of direct acyclic graphs homogeneous multiprocessor systcms to reduce the response-time. Baxter et at. (1996) proposed several mapping algorithms that attempt to minimise the cycle-time of the application algorithms for heterogeneous architectures. The mapping algorithms are simple heuristic named Modified Menasce (MM) and a simple genetic algoritlun (SGA).

The intention of this paper is compare the different approaches proposed for Ba;;,:ter et al. (I996) \\ci.th a new approach named Priority-Based Genetic Algorithm (pBGA). This approach is similar to the SGA approach except for the objective function. The comparison is done for a heterogeneous architecture. The paper will continue with the description of the target architecture and an m·ionic autopilot application. Thc M'M and SGA approaches, along ..."itb the proposed PBGA approach are presented in Section 4. A comparative assessment of the different approaches is given in Section 5. Finally, conclusions and suggestions for further work a~~ presented in Section 6.

14th World Congress ofTFAC

HOST

T8

Fig. 1. Heterogeneous architecture.

3. CASE STUDY The algorithm considered as case study is the Versatile Auto-Pilot (VAP) control law. This law is the most complex of the two laws for approach and landing developed in the theoretical studies at Royal Aerospace Establishment, Bedford (Garcia Nocetti andFleming, 1992; Goddard, 1979). Thc control law has previously been employed on the Civil Avionics Section's BAC 1-11, using a single-processor (M68000) implementation. The V AP is a four inputs, two outputs control algorithm. Fig. 2 shows a block diagram of the V AP algorithm. The inputs are pitch rate ( q ), barometric height error (hE ), vertical acceleration ( d 2 hi dt 2

)

and airspeed error (u a

).

The

2. TARGET ARCHITECTURE

outputs are elevator rate demand (drJ/dt) and

The architecture utilised for the eXllerimental investigations is illustrated in Fig. 1. This comprises an Iumos TS05 (TS) trnnsputcr which is connected to a sub-network of three Texas instruments TMS320C40 (C40) DSP devices. Software support is provided by 3L Parallel C for both processors types and the hardware is hosted by a Sun Workstation, via a SCSI interface.

throttle demand (vd)' The notation in Fig. 2 follows the convention that the gain between the elevator position and an aircraft state error input is denoted by G with the error variable as a subscript. Similarly, A is used to denote the gain associated with throttle poSition (Garcia Nocetti and Fleming, 1992). This algorithm possesses significant cross-coupling tenns and due to its complexity can be of significant interest in mapping and scheduling analysis.

U

r;;/;ll

~G12(S) 1 +

a ......::.-----+i.~f----_

V~d.

.I--_ _ _ _ _ _

Au

+

Fig. 2. Block diagram of the VAP algorithm.

8654

Copyright 1999 IF AC

ISBN: 008 0432484

MAPPING AND SCHEDULING ALGORITHMS FOR HETEROGENEOU. ..

The mapping and scheduling problem depends heavily on the topology of the task graph representing the prccedence relations. It is therefore necessary to convert the algorithm's block diagram representation into a task graph form. To do this the Development Framework is used. This is an advanced Computer Aided Control System Design (CACSD) environment which incorporates a number of tools to provide a robust design tool for real-time control systems (Bass, et aI., 1994). When the algorithm is imported into the Development Framework this is represented as a data-flow diagram (DFD). Fig. 3 shows the DFD for the V AP algorithm which contains 40 processes. The algorithm in this form is architecture independent and each node in the DFD represents a software module within the algorithm. The scheduling problem also depends on the uniformity of the task processing time, 50 that using the available tools from the Development Framework, the computation of tasks cxecution and intcrprocessor communication times of the V AP are obtained. 4. MAPPING AND SCHEDULING APPROACHES The mapping and scheduling problems are simple in concept. However, several factors need to be considered. For example, a single task must be mapped to a single node with the aim of minimising its execution time, but the upper and lower bounds of the execution time are difficult to obtain. In heterogeneous architectures execution time of a task will vary depending upon whieh processor type it is mapped on. Thus, each task should be matched on

14th World Congress ofTFAC

the processor that achieves minimum execution time. Interprocessor communication overhead, which often dominates the computing time, is important. Furthermore, it is important to note that a DFD has precedence, so that, tasks cannot be executed until their predecessors have been completed. Finally, it is important to consider the interference cost that occurs when the combination of several tasks on a processor affect their expected execution times (Baxter, et aI., 1996). The remainder of this section focuses on first describing the MM: and SGA approaches pre'viously proposed (Baxter, et a., 1996) and then presenting the PBGA approach. The MM approach is considered for reasons of comparison of the GA algorithms with a heuristic technique.

4.1 A10dified AIenasce approach The 1vIM. algorithm is a simple heuristic that constructs a mapping in a single pass (Baxter, et aI., 1996). This algorithm is based on task execution times for each processor, communication costs and the precedence of tasks. The algorithm determines the mapping of the tasks and thc order in which they should be executed.

4.2 Simple GA

The SGA is a basic three operator approach (Baxter, et a1., 1996). This algorithm uses a simple encoding strategy, where the chromosome length is equal to the number of tasks and the value of each element of the chromosome corresponds to the processor to which that task is mapped to.

Fig. 3. Data flow diagram of the converted YAP.

8655

Copyright 1999 IF AC

ISBN: 008 0432484

14th World Congress oflFAC

MAPPING AND SCHEDULING ALGORITHMS FOR HETEROGENEOU ...

The number of individuals in the population is currently set to twice the number of tasks in the application. Its objective function is based on the precedence tasks graph of the application. The fitness values are derived from the objective values by linear ranking, with a selection pressure of 2, prior to selection. The selection strategy is stochastic universal sampling with a generation gap of 1. The crossover is reduced to surrogate shuffle crossover with a survival rate of 0.3. Finally, the mutation survival rate is sel to 0 .02.

4.3 Priori(y-Based Genetic Algorithm

This PBGA is also a basic three operator approach. Each individual (chromosome) of the population, as in the SGA approach, has a Jength equal to the number of tasks and the value of each element represents thc processor to which it is mapped, as shown in Fig. 4.

7

the precedence of the tasks in the DFD and the data communication are also passed as parameters to the objective function. In the objective function, the tasks with highest priority (one) are scheduled first on the processor where they were mapped. This process is repeated until all tasks are scheduled. In the PBGA approach the cost of inter-processor communication is also considered. The fitness values are derived from the objective values by linear ranking. A stochastic universal sampling selection strategy is used with a gencration gap of I. A simple onc-point crossover operator is used with a crossover rate of PXover = 0.6. Finally, a mutation operator is used 'with a mutation rate of Pm = 0.71 (number of tasks) to indicate the percentage of the total number of genes in the population which are mutated in each generation.

"J

.,oe... o, numb., 10 which

1

H ,n cE

L

lhe t"~k '" m.p~d

c-

o

I

2

0

3

1

3

0

1

0

2

3

2

t

3

1 2o 0

,0

•• • • •

,

I

0

,

2

,

numlw!r of tl'lsks

0

t

-1

Fig. 4. Example of a small population. The fitness value of each chromosome is calculated according to the fitness function (objective function). The objective function of the PBGA is based on a priority assigned to each task. In a real-time system this will ensure minimum execution time with the allocation of processes. The priorities are obtained according to the sequence of the tasks in the dataflow diagram. A priority equal to onc (highest priority) is assigned to all these tasks with zero precedence tasks (input tasks). Then, the priority is increased by one for each successor. This procedure is repeated until either the successor is an output or its priority has been assigned. If the task' s priority has been assigned already, this means that a loop has been found and the task needs lO keep its priority. After this procedure a temporal priority list is created. Thus, all input task can have several temporal priority lists, from among which the one '\-vith the highest priority for each task, is chosen as the priority list. Similarly, the application can have several priority lists, from which the final priority list, is constructed and passed as parameter to the objective function. Fig. 5 shows a simple example illustrating the process of obtaining the final priority list. The execution times for each task in different processors, thc inter-processor conununication time,

d

I • 1-;::" /1st

Fig. 5. Procedure of obtaining the final priority list.

5. RESULTS

The execution times for the V AP algorithm using the .M:M and SGA approaches were previously obtained for the case of four processors (Baxtcr, et aI., 1996). In this paper, the execution times for two and three processors are also obtained to compare with the execution times of tlle PBGA approach. Thus, the V AP was mapped using the MM, SGA and PBGA accordingly. These approaches were executed on a Sun workstation and for the GA algorithms the GA Toolbox ofMATLAB 4.2b was used. For the case of the GA approaches, after several experiments, the generation gap (GGPA) and the mutation rate (Pm) were set to 1 and 0.02, respectively. The population size used was 68 (34 tasks x 2) and the nnmber of generations considered was 20. The results obtained with the PBGA are shown in Table 1. The execution times for the MM and SGA are also presented in this table.

8656

Copyright 1999 IF AC

ISBN: 0 08 043248 4

MAPPING AND SCHEDULING ALGORITHMS FOR HETEROGENEOU. ..

Table 1 V AP execution times (sec)

MM

NumberQf processors 2 3 4

14th World Congress ofTFAC

Table 2 Execution time speedup with the PBGA over MMandSGA PBGA

SGA

3.208 4.157

2.512 2.797

1.254 1.716

4.614

2.882

1.808

Table 1 confirms that the GA based approaches outperform the :MM approach. In case of the GA approaches, it can be scen that an excellent reduction in execution time is obtained with the PBGA approach than with the SGA. This indicates that improving the objective function a reduction in the execution time is obtained. Table 2 shows the performance in terms of execution time speedup achieved with the PBGA over the Iv1M and SGA approaches. It is noted, that the PBGA outperforms the MM approach by 2.559 and the SGA approach by 2.004 for the cases of two processors.

Number of processors

MM

2 3 4

2.559 2.423 2.552

SGA 2.004 1.631 1.594

Fig. 6 shows the distribution of the tasks onto two, three and four processors using the PBGA and considering inter-processor communication (pctt) for each case. The distribution of the tasks in time is shown along the horizontal axis, as the task starts and finishes. The number of tasks is indicated along the vertical axis. It can be observed that for the case of two and three processors the inter-proccssor communication lime is acceptable, as it takes less than half of the total execution time of the application. However, in the case of four processors this time increases, due to heavy communication between processors when the number of processors increases.

~~~~~~~~~~=-~~--~ 11..... (rnwc,

i:""[P~"

~

I

=

"---J

='

i"['-

I

]"

Cl-.l"O

,

~,.

"

~~~~~=-~"=oo~~~~~~~

~~~~~=-~=~~=-~~--~ Ti .... jm_r:)

1....."("......'1

(a) Two processor 1mplementation.

l="

j,.

~t-:

4

~10~ 00

zoo

-400

600

i:r:

600 lQC)(l Tim .. lm:l ac:)

a

:

200

1'10

4tlU

1000

~-

GM

"c: ;11 .

11JU1J

l;NO

1«10

llHIU

12tH)

1400

1600

l'_lf11SI;W;:)

SOil 1000 nrn..CITeec:.)

(;2(10

,;:2()

(In

BOO

1000

120C1

1400

?M

T"~(rn:;ec)

800

11.JO(J

12Q{)

1400

Tmrt(m ___) __" . . . , , -

(c) Four processor 1mplementation. [Proc Ill: T8, Pme #2: C40, Proc It",: C40 and Pmc #4: C40. pctt: inter-processor communication.

J Fig. 6. Scheduling and inter-processor communication of the V AP for two, three and four processors.

8657

Copyright 1999 IF AC

ISBN: 008 0432484

14th World Congress oflFAC

MAPPING AND SCHEDULING ALGORITHMS FOR HETEROGENEOU ...

Fig. 7 shows the objective values reached in each generation after executing the PBGA approach for the case of two, three and four processors. It is noted that an acceptable objective value is obtained \'r'ith a small number of generations. 1 . $5~Hi'

\

\

1.15~

\'\

1~t).L-C;----7-"""""~,-c,;;-,~"""""7,.c--~~",,,,;

can be obtained with only running this approach for few generations. Meanwhile for the SGA, it is necessary to run the algorithm for several hours to reach an optimal value. It is important to note that the inter-processor communication time has a heavy influence on the total execution time of the application and to obtain an efficient mapping and scheduling this time must be reduced, It is envisaged to concentrate more on new strategies for the mapping instead of using an encoding method. Moreover, the PBGA has been considered for parallel heterogeneous architectures, however it should also be interesting to apply this approach to a distributed heterogeneous system.

g.........I -

(a) Two prOCeSl'OTS. 2.1",,1<,/'

ACKNOWLEDGEMENTS The authors acknowledge the support of CONACYT-MEXICO and the EPSRC, Grant No.: GRIK 64310.

"~.~-:--;---:-~,,--::,,;---c,~,~==;~ (JI" ...... ~U"n

Ch) Three processors.

(c) Four processors.

Fig. 7. Objective values reached for the V AP with two, three and four processors.

6. CONCLUSIONS AND FURTHER WORK An investigation into different mapping and scheduling approaches has been presented. The M1v1 approach is a simple strategy to map and schedule tasks. However, its perfonnance is poor compared with the GA approaches. Thus, the MM is taking the maximum execution time expected for the application. The two GA approaches have the same complexity and they only differ in the objective function. Nevertheless, the PBGA outperforms the SGA approach. This means, that assigning a priority to each task of the application is a better strategy tJlan only considering the precedence of the tasks. In addition, the PBGA approach can find the mapping and scheduling solution of an application that includes some feedback loops in its control representations in contrast with the other approaches, Also, with the PBGA acceptable results

REFERENCES Ahmad, 1. and M. K. Dhodhi (1996). Multiprocessor scheduling in a genetic paradigm. Parallel Computing, 22, pp. 395-406. Bass, J.M., A.R Browne, M.S. Hajji, D.G. Marriott, P.R Crol1 and P,l. Fleming (1994) Automating the development of distributed control software, IEEE Parallel and Distributed Technology, 2, pp. 9-19. Ba:x1:er, M.l., M.O. Tokhi and PT Fleming (1996). An investigation of the heterogeneous mapping problem using genetic algoritllms. UKACC International Conference on Control '96, pp. 448-453. Chen, c.L., C.S.G. Lee and E.S.H. Hou (1988). Efficient scheduling algorithms for robot inverse dynamics computation on a multiprocessor system. IEEE Trans. Syst. Man. Cybernetics, 18, pp. 729-743. Goddard, K.F. (1979). Theoretical Studies of Automatic Control Laws for a BAC 1-11 Aircraft Utilising the Wing Spoilers for Direct Lift Control. Technical Report 79034, Royal Aircrqfi Establishment. Goldberg, DE (1989). Genetic Algorithms in Search, Optimization, and 1'v1achine Learning. Addison-Wesley, Reading, MA. Hou, E.S.H., N. Ansari and H. Ren (1994). A Genetic Algorithm for Multiprocessor Scheduling. IEEE Trnns. on Parallel and Distributed Systems, 5, pp. 113-120. Kasahara, H. and S. Narita (1984). Practical multiprocessing scheduling algorithms for efficient parallel processing. .lEl!.E Trans. on Compu t., 33, pp. 1023-1029. Siegcl, H.J., J.K. Antonio, RC. Metzger, M. Tan and y.A. Li (1996). Heterogeneous computing. In: Parallel and Distributed Compuling Handbook, (Zomaya (Ed», pp. 725-761 , McGraw-Hill, New York.

8658

Copyright 1999 IF AC

ISBN: 0 08 043248 4