A SAFE AND EFFICIENT PROTOCOL FOR OASIS-BASED DISTRIBUTED SAFETY-CRITICAL REAL-TIME SYSTEMS Jean-Sylvain Camier, Damien Chabrol, Vincent David and Christophe Aussagu` es
CEA, LIST BP 65, GIF SUR YVETTE CEDEX, F-91191 FRANCE
Abstract: In most key industries, distributed real-time systems have found widespread use in order to ensure safety-critical functions. Design of such systems still represents a challenge because of many problems arisen by the combination of real-time, distribution, performance and safety. Fault-tolerant real-time networks have been a major research subject for these last years. OASIS provides a timetriggered multitasking and communication approach for designing safety-critical real-time systems. In order to fulfil the industrial needs, an extension to distributed architectures has been carried out which includes a TDMA-based deterministic and real-time protocol and its optimized version, called A-TDMA, presented here. c Copyright 2007 IFAC Keywords: Distributed systems, Real-time networks
1. INTRODUCTION
In order to fulfil industrials needs, works have been carried out to extend OASIS to distributed architectures. A deterministic and real-time protocol based on TDMA and its optimized version, called A-TDMA, where the use of proprietary or COTS is possible, is developed according to the time-triggered context.
To prepare the next generation of sensitive systems such as instrumentation and control or embedded systems in transport and energy domains, our research on distributed safety-critical realtime systems has set the following goals: • • • •
Maintain safety and enhance availability Cover functional and performance needs Afford easier operation and maintenance Shorten development time
2. THE OASIS APPROACH
To reach these aims, our contribution is based on the OASIS approach developed by the CEA (French Atomic Energy Commission) and AREVANP (Nuclear Power Firm), in cooperation with EDF (Electricit´e de France). This approach provides methods to design and implement safety-critical real-time systems over single and multiple processors architectures.
245
OASIS provides an environment for real-time multitasking and communication design and an execution environment based on a safety-oriented embedded real-time kernel. The tools include a safety-oriented kernel performing a memory segmentation during link edition, as well as code generation, offline resource sizing features and verification/validation capabilities for all stages during development.
Additional information on the OASIS model and its implementation can be found in David et al. (1998).
To ensure safe execution, the OASIS kernel uses every possible protection mechanism available in a computer system. A rigorous protection is based on the following two mechanisms:
The OASIS system is predictable and reproducible in both the logical and temporal domains, even in case of failure, thanks to error detection and confinement, control of sources of interruption and strict communication mechanisms.
• Limiting privileged processor mode to the kernel in order to control access to hardware resources as well as to the system layer and the micro-kernel. The application runs under user mode. • Enabling memory protection, through a MMU, for instructions and constant/variable data exchanged by the tasks, the system layer and the micro-kernel.
Communication through shared variables are prohibited: associated with execution asynchronisms, they are responsible of non deterministic behaviours. OASIS brings two communication mechanisms based on strict observation principles. Communication are strictly managed to keep determinism and therefore independence of the implementation.
These fault detection and confinement mechanisms are covered by a patent CEA/FramatomeANP David and Delcoigne (2000).
Real-time data flows are defined by the use of temporal variables: values, only available to allowed tasks, are stored and updated by a single writer, the owner, at a predetermined temporal rhythm. This rhythm is expressed in the source code as a regular time period.
3. A SAFE AND EFFICIENT PROTOCOL In traditional networks, maximizing the bandwidth or minimizing the average delay of transmission are both criteria of performance. In the context of safety-critical real-time network, these criteria are not trivial. Indeed, such networks must satisfy timeliness and must be in conformance with safety requirements, especially determinism and fault tolerance.
In addition to conventional sending attributes, real-time messages in OASIS are typed, checked during compilation and their visibility dates (dates beyond which messages can be seen and extracted by the recipient tasks) are specified by the sender. A recipient task has queues for receiving purposes. A queue can only accommodate messages of the same type and has an associated time limit, which allows a validity period to the received messages. To achieve determinism, the sequence of messages sorted by visibility is made to be total.
3.1 TDMA After an analysis of the medium access protocol Rushby (2001), Time Division Medium Access has been identified as a good support for safe and timely communication. Global time is divided into a sequence of time slots, each of them assigned to a single unit of the network. These assignments and the time slot sizes are stored in a static schedule table that is computed offline during an analysis stage called network scheduling.
A safety-critical application must be designed according to hypotheses about its sizing and its execution. In order to ensure a safe behaviour, OASIS provides a safety-oriented real-time kernel whose role is to detect all attempts of hypothesis deviation. For this, defensive programming and fault tolerant mechanisms are implemented in order to ensure an early detection and confinement.
Communication latencies are precisely predictable because they can be easily computed. Indeed, TDMA ensures that, at any time, only one board uses the network bus. To do this, during network scheduling, time slots are sized according to its network occupation. Namely, a time slot is the temporal interval during which a unit is ensured to be alone on the bus.
The OASIS kernel ensures detection of: • Non-conform task behaviours. Each task is monitored by using the state-transition diagram extracted during the compilation stage. This diagram describes all of the logical and temporal behaviours of a task; • Usual exceptions (fault addressing). The OASIS kernel is built without mutual exclusion zones and by disabling all hardware interrupts except those of the clock timer; • Non respect of temporal constraints (deadline and quota); • Attempts of memory overflow.
Because of the jitter brought about clock drifts (see Fig. 1.) and usual sources of asynchronism, several transmission (by cycle repetition) can not be performed at the same effective dates: −ρ ≤ S ′ − S ≤ ρ + σ
modulo C
(1)
where S and S ′ are effective dates of transmission and C is the network cycle period.
246
Fig. 1. TDMA scheduling. Busy interval are coloured in dark gray and silent interval in light. hypotheses such as the absence of collision or the bounded clock synchronization accuracy.
According to the equation (1), the network occupation of one transmission must therefore be upper bounded such as the computed time slot is the sum of its associated transmission window, the underlying network features minspace and the packet transmission latency. Slot = [Ts , Te ] Te = Ts + kW k + minspace + kpacketk
3.2 Advanced-TDMA To ensure data availability in time, a deadline is integrated on each producing task to end its processing early enough before the beginning of the dedicated packet transmission. TDMA provides a network scheduling, where each unit as a reserved bandwidth part (time slots), in order to ensure timeliness of its packet transmissions. However, the reserved bandwidth from one unit is partially used and may not be given to any other unit. This loss of network bandwidth affects the sizing of the system. Indeed, data must be produced before its availability deadline: the production is performed in less time, increasing the processor load.
(2)
where Ts is the earliest starting transmission date and Te is the latest ending date of transmission. The transmission window W is the interval during which the transmission command must occur, in order to meet the time slot ending date, computed from an upper bound of the clock synchronization accuracy ρ and the transmission precision σ, such as: W = [S − ρ, S + ρ + σ]
In order to optimize the network sizing, we propose an evolution of the TDMA protocol, called Advanced-TDMA (noted A-TDMA), where time slots overlapping is performed while still ensuring determinism, timeliness and integrity of the communication.
(3)
where S is the formal date of transmission command computed off-line. Since TDMA brings a static network scheduling, its behaviour is therefore reproducible in temporal and logical domain. To ensure the timeliness, the network scheduling must be computed in order to set each time slot before the network deadline. However, this is insufficient to ensure data timeliness. Indeed, communication latencies are especially significant in comparison with memory copies. Then, if the ending of the data production is not anticipated and ensured, data will be not available in time for distant consumers. Therefore, such a network deadline must be set early enough to ensure a complete production and transmission.
To carry out this concept, the network hardware must provide a carrier sense mechanism which can enable instantly the transmission if the network is silent and enable automatically the transmission after the effective one when the transmission is attempted during a busy network. It is important to note that this behaviour is not a collision: the network signal is not disturbed. When transmission commands are performed during an effective transmission, they are left pending until the bus is empty. Such a mechanism is widespread over COTS components (e.g. Ethernet).
Network fault tolerance begins from the ability to detect errors coming from the network. the error detection is simple and efficient. Since the network scheduling is static, each unit has a perfect knowledge of the network behaviour. The error detection is therefore based on a qualitative approach where missing or swapping of packets are critical errors. Moreover, network is supervised in order to detect disregarded execution and sizing
During network scheduling, time slots are sized like in TDMA except for the starting and ending dates of transmissions which are postponed if possible. Indeed, the objective is to parallelize an effective transmission with the transmission window of the next communication. This optimization is therefore performed in accordance with the following conditions in order to preserve safety and timeliness requirements.
247
Fig. 2. Advanced-TDMA. Time slot 1 is postponed from interval (T1 ,T2 ) to (T1′ ,T2′ ). Condition 1 Transmission windows must never be overlapped one over another, a collision or a transmission swapping may occur otherwise: Wi ∩ Wj = ∅ (i 6= j)
in conformance with the network requirements. A bandwidth is won and also available for others units (see Fig. 2). This benefit is equivalent to the sum of all the transmission windows from overlapped time slots.
(4)
Condition 2 Time slot overlapping must be performed only between two consecutive ones. If more than one transmission window overlaps a time slot in accordance with condition 1, a collision will inevitably occur. Indeed, pending after the sending of the transmission command, the set of transmission will be attempted to start at the same time thus causing a collision.
4. CONCLUSION The OASIS approach allows the design of multitasking real-time systems with reproducible behaviours, in conformance with the safety requirements. OASIS supplies all the tools, from the formal dialect to the binary generation, at an industrial level for 68k Motorola, IA32/PC-AT and for POSIX targets. The described research focuses on a synchronous protocol where timeliness and determinism are ensured as well as an efficient bandwidth allocation. Associated with the distributed architectures extension, OASIS provides a design approach for safety-critical real-time systems increasingly complex. The proposed package frees designers from concerns such as clock synchronization and communication, thus affording flexible use and transparent architecture. Works on automatic network sizing and scheduling will be detailed in further papers. Moreover, R&D works are in progress in order to provide an additional package to assist the network architecture design in accordance with the targeted safety levels.
Condition 3 In accordance with condition 1 and 2, the time slot, overlapped by the transmission window of the next time slot, must have a length sufficiently large such as: kSlot1 k > kW k + kW ′ k + minspace
(5)
where packet 1 is sent before packet 2 and minspace is the minimum interframe spacing required by the physical network support (e.g. 9,6 µs over 10Mbit/s Ethernet) (see Fig. 2.). Condition 4 When a system runs, the network bus is always silent. The first time slot may therefore induce a drift on the network scheduling. However, if its transmission window is computed as overlapped with the previous transmission, its jitter is not directly considered because parallel with the previous transmission. In consequence, no optimization must therefore be performed for the first transmission, otherwise the network scheduling will be wrong because unfit to ensure timely transmission. Nonetheless, an error detection will be raised by the OASIS kernel since a packet is not received in time. For this occasion, a synchronization barrier is identified in the network by all the boards. It corresponds to a date where no communication can begin before and end after it. This allows to prevent drift propagation.
REFERENCES V. CEA/Framatome-ANP David and J. Delcoigne. Patent wo 02/39277 a1, security method making deterministic real-time execution of multitasking applications of control and command type with error confinement, 2000. V. David, J. Delcoigne, E. Leret, A. Ourghanlian, and Ph. Hilsenkopf. Safety properties ensured by the oasis model for safety critical real time systems. IFIP-SAFECOMP’98 Conf., 1998. J. Rushby. Bus architectures for safety-critical embedded systems. Lecture Notes in Computer Science, 2211, 2001.
A-TDMA allows to improve the bandwidth by avoiding unused silent spaces, and thus remains
248