Architectural Support for Predictability in Hard Real-Time Systems

Architectural Support for Predictability in Hard Real-Time Systems

ARCHITECTURAL SUPPORT FOR PREDICTABILITY IN HARD REAL TIME SYSTEMS Matjaz Colnaric University of Maribor. Faculty of Technical Sciences Smetanova 17. ...

1MB Sizes 1 Downloads 39 Views

ARCHITECTURAL SUPPORT FOR PREDICTABILITY IN HARD REAL TIME SYSTEMS Matjaz Colnaric University of Maribor. Faculty of Technical Sciences Smetanova 17. 62000 Maribor. Slovenia. [email protected]

Wolfgang A. Halang University of Groningen. Department of Computing Science P . O. Box 800.9700 AV Groningen. The Netherlands . halanglliics .rug .nl

Abstract : There is evidence that . among all design domains of hard real time systems. architectural issues gained the lowest research interest. Universal architectures . which are generall~' applied as hardware bases for hard real time applications. are seldom behaving ill a fully predictable wa.v. In the paper. several commonly used techniques which prevent temporal determinism of instruction execution are enumerated . An asymmetrical multiprocessor architecture for hard real time applications is presented . whose temporal behaviour is fully predictable . Some adequate features are discussed which are incorporated into the processor's implementation . I\evwords : Real time computer systems : computer architecture ; parallel processing ; temporal behaviour analysis : predictability.

INTRODUCTION Predictability of th e temporal behaviour of task executions is a precondition for schedulability anal~' sis. which is the cru cia l point of int e res t In th e d esign of hard real tim e ~.\' s tems In ord e r t o be abl e to predict a program ' ~ temporal behaviour . e ven feature of the program and of the und e rlving computing system must bcha\'/:' predictably (Iuye.,.-by-Iaye.,. p.,.edlctabllity . according to Stankovic and Ramamritham (1990)) . Some of th e design domains (e .g .. scheduling and schedulabilitv . programmin g iss ues. etc.) gained more rese arch int e rest than ot hers : ;Imong th e latter arc certainly all issues concerned with hard real time Syst t: llI~ archltC( tll res . Common micropro c essor architectures are mainly de signed to meet tll c classical obj ec tive of maximising (ave rage) resourc e utilisation . Th e application of th e sam e guiding prinCiple in hard real tIme systems d esign is based on on c of the common misconception s pointed to by Stankovlc (1988) that real time s~' stem s were equal to fast systems . However. optimisation of resource utilisallo n and average pe rformance usually contradicts the re al time . viz .. worst-case . requirements . On th e other hand . through the d e ve lopment of hard ware technology . hard ware costs and . hence. optimal utilisation are losing importance . H.v the year ·~OOO . 9 an integration density of up to 10 transistors per dye is expected (Gelsinger.Gargini.Parke r.Yu. 1989 : Zorpette . 1990) . open ing up totally ne w perspectives in

the design of architect ures for hard real time systems (Furht and Halang . 1991) .

UNDESIRABLE PROPERTIES OF CONVENTIONAL ARCHITECTURES lJ ndesirable properties of conventional architectures are those which make a system 's temporal behaviour ( I) unpredictable or / and (:!) difficult to be estimated. :\ common counterproductive charac teristic of their design is complexity . which is usually unmanageable b~' state-of-the-art verification techniques and tools. In this section. first processor and . then. system architectures arc screen ed for such properties .

Processor architec tur es The increase in microprocessor performance was mainly influenced by two factors . viz .. technological advances and new ideas employed in architectural design . These new ideas arc the R[Se philosophy, parallel processing (pipelining). and caching. While RISe ideas appear to be very useful also for architectures employed in hard real-time systems . because of the simplicity and ve rifiability achieved . the other two features are to be used with extreme preca.ution only - if at all. Independent parallel operation of internal components . [n order to fully exploit inherent pa.rallelism of oper-

281

ations. the internal units of processors are becoming more and more autonomous resulting in highly asynchronous operation. By this development the average performance of processors is increased. However , it is a complex task to analyse corresponding sequences of machine code instructions in order to determine their execution times. For the time being, verification of such complex processors is pra.ctically impossible. The most common way for improving the throughput of a processor by exploiting instruction level parallelism is pipelining . Various techniques are used to cope with the well-known pipeline breaking problem due to the dependence of instruction flow control on the results generated by previous instructions. One of the most common techniques is the delayed branch, i.e ., the instruction immediately following a conditional branch instruction is always executed . It is a responsibility of the compiler to place an instruction there . which needs to be executed in any case, or, if impossible . to insert a No Operation instruction. Thus. if branches appear frequently in a program, the delayed branch pipelining principle provides predictable operation. but little gain in performance. To predict execution times, complex machine code analysers would be necessary, based on detailed information about instruction execution, which is often proprietary . Owing to its complexity, a processor With plpelined operation is also very difficult to verify. Thus , it is recommendable to either prevent pipeline breaking or to renounce pipelining at all . Caching . Fetching data from off-processor sources represents a traditional bottle-neck, especially in the von Neumann processor concept. To avoid it, cache memories were introduced. Since instruction fetching from memory takes a considerable amount of execution time. the latter thus highly depends on whether an instruction was found in the cache or not. To predict that. complex analysers would be necessary. emulating the caching operation . Even the most sophisticated ones would fail in the case of multiprogramming, where caches are filled With new contents on every context switch . Thus. to predict worst case executIOn times . the time required to fetch instructions from memory via the system bus is to be taken into account. Registers . To reduce data memory accesses . in contemporary microprocessors a trend to increase the number of registers can be observed . However . they can only be utilised in assembly language programlTIlng . because the high level language programmer is not aware of registers . Very sophisticated compilers are needed to optlmlse the allocatIOn of registers to variables. Also . to achieve compatibility with previous processor versions . often compilers may only use parts of given register sets . Even when register use is optimised. a lot of unproductive moving of data between registers and storage must be performed . Although behavioural predictability is not necessarily endangered. another kind of on-chip temporary data storage shall be used to avoid the above mentioned problems.

System architectures Similar to processor architecture considerations, in classical system architecture design some features were implemented to improve average performance. Again, these features may lead to undesirable consequences which make process execution time prediction difficult or even impossible. Some of them are listed in the sequel. Direct memory a.ccess. Since general processors are ineffective in transferring blocks of data, direct memory access (DMA) techniques were designed . With respect to the delays caused by DMA transfers, there are two general modes of DMA operation, viz., the cycle "tealing and the bur"t mode. A DMA controller operating in the cycle stealing mode is literally stealing bus cycles from the processor, while in the other mode the processor is stopped until the DMA transfer is completed . Although the processor has no control over its system during such a delay, this mode appears to be more appropriate when predictability is the main goal. If block length and data transfer initiation instant are known at compile time, the delay can be calculated and considered in the program execution time estimation. The same prediction for the cycle stealing mode would be much more complicated. Furthermore, block transfer is also faster, because bus arbitration needs only to be done once. However. precautions have to be taken to enable a processor, whose operation was suspended for DMA operation, to react in the case of catastrophic events. Virtual addressing . To cope with ' limited amount of memory available in computer systems. virtual addressing techniques were developed . Using them, the data access times are very much dependent on whether requested items are already in memory or pages have to be loaded first . To determine this beforehand . a complex program analyser would be necessary. Even then . restrictions would have to be imposed . For example. register indirect addressing modes should only access data in the same page, what is difficult to assure when using a high level programming language . From the above It can be concluded that virtual addressing should be renounced in hard real time applications. Data transfer protocols . Data transfer via microcomputer busses also deserves some consideration. Synchronous data transfer protocols by definition ensure predictable data transfer times . Asynchronous ones are more flexible . however. their behaviour is very difficult to control. espeCIally in shared-bus systems . The only possible way to guarantee realistic data transfer times seems to be synchronous operation of all potential bus masters . The same conclusions hold for local area networks as well.

A NOVEL SYSTEM CONCEPT The ultimate requirement of hard real time systems is that all tasks must be completed within prede-

fined time frames . In multi-tasking systems, dynamic scheduling algorithms to generate appropriate schedules must be implemented . The ones which fulfil!

282

The external process and the peripherals are controlled by tasks running in the task processor. For all operating system services the OS kernel co-processor is responsible. which hierarchically consists of the following three layers :

the above requirement are referred to as feasible. In the literature. several such algorithms have been reported . For our purpose , the earliest-deadline-first scheduling algorithm was chosen as described by Halang and Stoyenko (1991). It has been shown that it is feasible for scheduling tasks on single processor systems ; with the throwforward extension it is also feasible on homogeneous multiprocessor systems. However. this extension leads to more pre-emptions and is more complex and, thus , less practical.

Hardware layer: On this layer basic HW functions are implemented . A high resolution clock is maintained and real time management is performed by generating various time related events . Also. precise timing of operations is possible by sending a continuation signal to (a number of) halted processes at predefined instants . When events from the environment are signalled . they are latched together with the times of their appearance. Synchronisers and shared variables are represented and supervised . Events are generated as they change states or acquire certain values. respectively. Finally. there is an optional separate programmable interrupt generator for software simulation purposes .

For process control applications, where process interfaces are usually physically wired to sensors and actuators establishing the contact to the environment, it is natural to implement either single processor systems or dedicated multiprocessors acting and being programmed as separate units. Thus, earliest-deadlinefirst scheduling can be employed, resulting in a number of advantages discussed by Halang and Stoyenko (1991 ). The main idea can be illustrated by an everyday life example : the job of a manager's secretary is to relieve him from the routine work. so that he has more tim e left for more important tasks. The secretary also administers the manager 's schedules and prevents him from being interrupted for unimportant reasons . In the classical computer architecture the operating system is running on the same processor(s) as the application software . In response to any occurring event . the context is switched, system servIces are performed. and scheduling is performed . Although it is very likely that the same process will be resumed. a lot of performance wasted by superfluous overhead. This suggests employing a "secretary", i.e., a parallel processor . to carr~' out the operating system services. Such an asymmetrical architecture turns out to be advantageous . since. by dedicating a multi-layer processor to the real time operating s~' stem kernel. the user task processors are relieved from any administrative overhead. This concept was in detail elaborated bv lIalang (19 87. 1988a.1988b). The comput e r "ystem to be e mployed in hard real time applications is hierarchicaHv o rganised In Fig . I it is shown that It consists of the lask proceHor' and the ke rnel processur . which are full~' separated from each othe r .

I

--

I

EXTERNAL PROCESS

TASK PROCESSOR

11 t

secondary reaction level

--

primary reaction level

R

PERIPHERALS

KERNEL PROCESSOR

Primary reactIOn layer : The purpose of the second layer is low level servicing of events . The possible sources (either int ernally ge nerated time events . changes of synchroniser or shared variable states , or external events 1 represe nted bv the hardware layer in the form of storage elements are cyclically polled . When an event is recognised . an appropriate secondary reaction is initiated . Also . time schedules and critical instants are mana ~ ed . Secondary reactlOll layer : This la.ver represents an interface to the task processor . Here. deadline-driven scheduling with overload handling and all tasking operations are performed . Furth e r . functions for synchroniser and shared variable management are available . As a result of th e above . processor activities are initiated or otherwise controlled

The structure of the whole s.vstem is shown in Fig . 1: the external process and th e pe ripherals are controlled b~' the task processor via I/ O data commu nication . Events from the environment . however. are fed into the hardware laver of the kernel processor . This way. the task pro cessor IS only interrupted if an e vent is relevant and a cont e xt switch is really necessary . There is a direct connection between the task and the kernel processor by which low level signals are interchanged. e.g .. the co ntinuation signal. Typically. the application interface betweell the two processors is available 011 the secondary reaction layer in the form of high level lan~uage constructs .

OUTLINE OF THE SYSTEM ARCHITECTURE DESIGN

I

Following design guidelines derived from the above concept resulted in the architectu re presented in Fig . ~. In this section we give an ouitline of a number of particular features . For th e details of the implementation it is referred to Colnarit (1992).

HW level

Fig. 1. Conceptual diagram of the architecture

Data transfer. To avoid obvious problems caused by common bus architectures . serial point-to-point interunit connections were chosen . Parallel point-to-point connections are not feasibl e. because of excessive

283

WIrIng and inconvenient bus dimensions. To enable predictability, and also simple determination of data access times, synchronous data transfer protocols were implemented , which also do not require too much overhead . State-of-the-art technology (such as fibre optics) already allows very fast serial data transfer, whose capacity is comparable to parallel bus data transfer when considering the necessary overhead.

busy state. preventing the task control unit to send further data processing instructions. In parallel, flow control instructions can be executed . This distribution of instruction execution minimises unproductive data transport . The program counter is exclusively administered in the task control unit. also eliminating the need for program address transmissions . A certain level of inherent parallelism is utilised, yielding similar effects as pipelining.

Kernel processor

Instruction set . For safety critical real time applications, a reduced instruction set appears to be the only reasonable choice with respect to enable system verification. The instruction set is divided into flow control and data processing instructions . Flow control in$truction$ are branches, subroutine calls and returns , and the wait instruction . By the latter, instruction sequencing is delayed for a specified period of time. If the continuation signal is received before the period expires , a branch is performed ; this way, time bound synchronisation mechanisms can be implemented easily. The data processing instructions perform basic arithmetic and Boolean operations . There are basically two data addressing modes , absolute and indirect, both referring to the local data storage.

The functions of the kernel processor have already been outlined above. Via its link to the task processor , tasking operations such as task pre-emption, initiation etc. are requested to be performed by the secondary reaction layer. In the communication protocol, the task processor has a possibility to disable data transfer while the executing task is inside of a non-pre-emptable critical region . In addition to this link . there is a direct line for the continuation signal enabling precise timing of process execution by requesting continuation of temporarily delayed processes . The kernel processor is also connected to a serial bus transferring data between the data processing unit and the external data access unit. By this link, the kernel processor monitors synchronisers and shared data accesses and can , as a consequence, generate events in specified situations.

Task processor In the task processor the application processes are executed without any influence from the operating system functions . Functionally it is divided into the ta$k control and the data proce$$mg units, residing on the same board - or even on the same chip . For that reason they are connected by a uni-directional parallel instruction path and some signals . Designing the task processor several guidelines were followed. which cope with the problems stated above. Program and data memories are separated for safety reasons : programs are kept in ROM to make selfmodification of code impossible . Instructions are executed at the most appropriate places: flow control Instruc tions are processed close to program memor y where also the program counter is located . To the data processing umt only instructions dealing with data are sent in a straight-forward instruction stream. Data are excl usi vely handled in the data processing umt from where external sources are accessed . Task control unit . The function of this unit is to execute task programs . From the program memory ROM instructions are fetched and pre-decoded. Flow control instructions are executed which have an impact on the program counter only. To allow for conditional branches. a status flag (B) is accessible from the data processing un it . For the administration of subroutine return addresses a stack is provided . Data processing instructions are forwiuded via the parallel link to the data processing unit . While they are being executed there. the synchronisation line indicates the

Data processing unit . Instead of the above criticised register files , a local storage called variable file is implemented which appropriately supports the variable concept of high level programming languages . Advances in technology allow for a large number of transistors per dye ; part of them should be used to implement the local storage . The variable file keeps local variables , constants, and references to global variables as occurring in HLL programs . Its contents is loaded block wise from main storage when a software module is scheduled to run. This way, data cache is substituted by a predictably behaving (because of the constant block length and the known transfer initiation instants ) technique . To enable dyadic operations . the special design of the local storage allows fully synchronous reading of two different locations: writing is always done to a single location . Some kind of automatic access to indirectly addressed main memor y locations and peripheral device registers using the local storage is implemented to unify local and external data accesses To perform arIthmetIC and Boolean operallon$ . an AL U is implemented . The main requirement for its design is temporal determinism of operation . The basic arithmetic operations on integer and 32- bit floating point numbers need to be provided . For the floating point implementation . the IEEE Pi54 and P854 standards were followed . including their exception handling .

External data access unit When a location in the variable file is referenced by indirect addressing, its contents is used to address an external unit location . There are two types of external data sources . viz .. the global data storage and peripheral deVICe rnterface$ .

284

Global data storage. Global data storage is required for shared and large data structures like arrays. sets of records etc. Owing to advances in technology it is easy to implement considerable capacities. thus techniques to cope with insufficient storage amounts such as virtual addressing are not necessary any longer. Dynamic RAM chips are still prevailing in large main memory implementations. However, precautions have to be taken regarding their refresh procedure; refreshing may not influence the data access time. To support the serial transmission of constant-length data blocks. a special dynamic RAM design was introduced in (Halang, 1986).

Peripheral devices and their controllers . Considering the properties of peripheral devices and following the fundamental requirement of temporal predictability, some general implementation guidelines can be derived . First of all, because of inherent serial communication in a number of peripherals, serial data transfer is also feasible between processors and peripherals . To this group belong, for instance, telecommunication interfaces. video display interfaces. certain data acquisition interfaces etc. Further. peripheral devices can be classified into two groups according to the amount of data to be transferred. Peripherals of the first group transmit data in words. the others in blocks . To access the peripheral devices of the first group. their registers are mapped into the global memory address space . Because it is unsuitable for the processor to transmit larger blocks of data . for the second group the dIrect memory access method appropriate for hard real time environments is implemented. To support point-to-point communications and to minimise data traffic. the DMA controlier is integrated into the external data access module . Again. the proposed dynamic RAM implementation is very suitable for block-oriented data transfer . I t appears appropriate to physically connect the peripheral device controllers to a serial bus. This way th e flexibility of the s~'stem's configuration is considerabl~' enhanced The operation. however. should reIIIdin Hnchronous dnd predictable . Peripheral devices utilised in hard real time environments should have deterministic data access and transfer times as well as transfer initiation instants . Precaution should be taken when selecting mass storage devices . Fixed-head and semiconductor disks seem the most appropriate .

CONCLUSION In this paper, an architecture for hard real time applications was presented . I ts ultimate goal - predictable temporal behaviour - seems achievable following the guidelines proposed. In setting up these guidelines. forecasts of the technological development in the near future were taken into consideration. Instead of classical pipelining . inherent parallelism of instruction execution is supported . Instruction

caching is replaced by a separate. directly addressable program memory located close to the instruction fetch unit. For similar reasons, data caching is superfluous. too. To support high level language pragramming, a local memory file takes over the function of conventional registers . Direct memory access is only permitted in a form working predictably. Finally, data transfer is implemented by synchronous serial point-ta-point links. The predictability of the described architecture's behaviour provides the basis necessary for further research on other real time features.

REFERENCES Matjaz Colnaric (1992) . Predictability of temporal behaviour of hard real-time systems. Ph .D . Thesis, University of Maribor . Borko Furht and WoIfgang A . Halang (1991). The future of real time and embedded computing systems. Unpublished paper . Patrick P. Gelsinger . Paolo A. Gargini, Gerhard H. Parker , and Albert Y.C . Yu (1989). Microprocessors circa 2000. IEEE Spectrum. 26{10), 43-47 . WoIfgang A . Halang (1986). On methods for direct memory access without cycle stealing . Microprocessing and Microprogramming, 17,277-283. WoIfgang A . Halang (1987). Architectural support for high-level real-time languages. In Proceedings lEE International Conference on Software Engineering for Real Time Systems. Cirencester, pp . 7i-86. Wolfgang A . Halang (1988a) . iliary processor dedicated system kernels . Technical 2228 CSG-87, Universit~· Champaign .

Definition of an auxto real-time operating report UILU-ENG-88of Illinois at Urbana-

Wolfgang A. Halang (1988b). Parallel administration of events in real time systems. Microprocessing and Mlcroprograrnmmg. 24. 687-692 . WoIfgang .A. Halang and Alexander D. Stoyenko (1991). Constructmg PredIctable Real Time Systems . Kluwer Academic Publishers. BostonDordrecht-London . John A. Stankovic (1988) . :o.tisconceptions about real-time computing. IEEE Computer, 21(10), 10-19. John A. Stankovic and \\rithi Ramamritham (1990). Editorial: What is predictability for real-time systems. Real- TIme Systems , 2(4), 246-254 . Glenn Zorpette (1990) . Technology 90 : Minis and mainframes: 11icroprocessors - Expert opinion of Stephen E . Nelson. IEEE Spectrum, 27(1), 30-34 .

285

PERIPHERALS P2

1

PI

I

I

P3

mtcrocontroller (interface)

I

I

serial bus

,..

-------- - -

---------

, I

A. ~

local memory (variable file)

V

A

"

~

v

I

serial link master

I I I

"'"

I I I

control unit

""

I

serial link slave

I

ALUs

I

B

inst. decoder

J

I I

;:...

"'"

I

I

'I

serial bus master

1

:..

,.. control unit ,. DMA controller

global data storage (main memory)

I I

data processing

I

unit

I

EXTERNAL DATA ACCESS UNIT

I

sync

B

I

task control unit

I I I I

control unIt. InstructIon sequence, ,. decoder

lA

}.

iArl I : ~l I

i

::!::~

I

~v

subr stack

senal link slave

I I

I

I contInuatIon I

I

I

J I

serial link master

lr

serial link slave

J

""

:..

SRL

I

I

PRL

I I I

HW

Vt-~

I

TASK PROCESSOR

KERNEL PROCESSOR

Fig. 2. Asymmetrical multIprocessor architecture for hard real time systems

286