INCOM'2006: 12th IFAC/IFIP/IFORS/IEEE/IMS Symposium Information Control Problems in Manufacturing May 17-19 2006, Saint-Etienne, France
A TIME-TRIGGERED CONTROLLER AREA NETWORK PLATFORM WITH ESSENTIALLY DISTRIBUTED CLOCK SYNCHRONIZATION Fabiano C. Carvalho ∗ Edison P. Freitas ∗ Carlos E. Pereira ∗∗ Fernando H. Ataide ∗∗
∗ Instituto de Inform´ atica Universidade Federal do Rio Grande do Sul ∗∗ Departamento de Engenharia El´etrica Universidade Federal do Rio Grande do Sul
Abstract: Recently, the development of control systems for safety-critical industrial applications has gained special attention in the international committees. Some standards such as the IEC-61508 introduce guidelines for risk assessment considering failure rates less than 10−6 per year. For a distributed system to meet that requirement, one alternative is to employ fault-tolerance techniques such as active redundancy and message cross-checking. Considering that for cost and locality reasons the processing units of these distributed systems are usually interconnected through a shared bus, the underlying communication platform becomes the most important building block. It must provide low-level support for deterministic data transmission as well as a global time base to coordinate the actions of replicated units. Within this context, this paper presents a time-triggered extension of the CAN protocol as a communication architecture for safety-critical applications. Unlike other related work that rely on a centralized reference of time, our communication platform is enhanced with a low cost, essentially distributed c 2006 IFAC clock synchronization algorithm. Copyright Keywords: Real-Time Protocols, Clock Synchronization, Fault-Tolerance
1. INTRODUCTION
silicon die at low cost. As a consequence, the new trend in industrial systems is to move away the computational effort from a centralized control unit toward a number of intelligent devices in the field. Modern fieldbuses are indeed distributed systems implemented as a number of average (or low) performance processing units which execute control tasks in parallel.
The introduction of fieldbuses, e.g. industrial digital networks, has offered manufacturers the possibility to replace the excessive cabling of relay panels and programmable logic controllers (PLCs) by a single communication cable. First solutions used to rely on a high performance processing unit which was responsible for executing all control tasks since field devices were no more than signal transducers with network interfaces. Thanks to the increasing evolution of VLSI technology nowadays it is possible to produce field devices in which analog interfaces, network controllers, processor and memory cells can be all assembled in the same
Nevertheless, there have been lots of discussion on how to map a safety application over a distributed set of processing units (Dilger et al., 1997), (Bridal et al., 1998) and (Srinivasan and Lundqvist, 2002). To use a distributed system instead of a hard-wired PLC, for instance, it has
95
to be proven that the risks of harm are at most equivalent in both. One particular problem is that faults are more likely to occur in intelligent devices than in robust, dedicated circuitry because intelligent devices are more complex (Bowles, 1992), embodying hardware and software components from various sources, including off-the-self and pre-designed legacy blocks. In order to increase the trustfulness of a distributed system in a practicable way, one alternative is to employ fault-tolerance techniques at the component level (Avizienis, 1995). Assuming that processing units communicate by means of message-passing only, the underlying communication platform must offer features like time composability and reliable clock synchronization (Rushby, 2001). Unfortunately, most of current fieldbus protocols suffers from the lack of these features that would provide means to employ distributed fault-tolerance in a straightforward manner. Even the IEC-61158 fieldbus (Durante and Valenzano, 1999), which is an attempt to create an international standard, is not reliable because of its centralized arbitration and time management mechanisms.
is a centralized arbitration method very similar to that used in WorldFIP whilst the other is based on a token-passing strategy which is a clear heritage of the Profibus protocol. A single network unit called the LAS (Link Active Scheduler) is responsible for coordinating access permissions for all nodes within a single bus segment. During runtime, it requests the transmission of field state messages at predefined instants by sending a specific message named the Compel Data. Whenever the LAS detects that free bandwidth is available it sends a delegated token message to another requesting node in order to give it access rights. For each node that receives the token the delegated token hold time (DTHT) must be respected to ensure that real-time traffic is not jeopardized. While the target token rotation time (TTRT) does not expire, the token circulates according to a round-robin strategy in order to allow each active network unit to send its pending messages. In effect, the IEC-61158 protocol does not offer the possibility of building essentially distributed realtime systems simply because it depends on stable operation of the LAS for bus arbitration and time control. Clock synchronization, which is a fundamental service for many fault-tolerance techniques, is very rudimentary since it relies solely on the LAS that periodically broadcasts time information. In this case, even short disturbances on the bus during the transmission of a time message may result in lost of synchrony.
In this paper, the authors describe the implementation of a Communication Architecture for Safety-Critical Applications, or simply CASCA, which is a time-triggered extension of the original CAN protocol. The work was first motivated by the growing interest in the automotive industry in developing embedded protocols with clock synchronization and static message transmission schedule intended for x-by-wire applications. However, the authors pretend that these ideas can also be applied in the industrial domain in order to build dependable distributed systems.
2.2 The Time-Triggered Architecture The TTA (Kopetz and Bauer, 2003) defines a model of distributed processing and communication in which relevant system actions are triggered as a global time base progresses. Bus arbitration is performed by a time-division multiple access (TDMA) control logic which is tightly coupled with a low level implementation of the FaultTolerant Average (FTA) clock synchronization algorithm (Welch and Lynch, 1988). There is a sequence of TDMA slots where each node transmits one message thus forming a TDMA round. After finishing one TDMA round, the next one is started. The temporal access pattern of the new TDMA round is basically the same, but possibly different messages are sent. The number of different TDMA rounds determines the length of a cluster cycle. After a cluster cycle is finished, the transmission pattern starts over again. This cyclical operation is maintained as long as half of the nodes keep synchronized to each other.
This paper is organized as follows: in section 2 some related work is briefly surveyed; the CASCA platform is presented in section 3, covering architectural and functional aspects; section 4 presents practical implementation results and finally, in section 5, some concluding remarks are made. 2. RELATED WORK 2.1 The IEC-61158 Fieldbus The IEC-61158 (Durante and Valenzano, 1999) is an attempt to create an international fieldbus standard. Because of economical and political disputes, only the physical layer of the IEC61158 has been approved while remaining parts of the document, which describes the data link and application layers, are still subject of intense debate.
From the TTA three protocols derive, the TTP/A, TTP/B and TTP/C, which is the fault-tolerant version. One of the most important characteristics of the TTP/C is the confinement of all communi-
The current definition of the data link layer merges two well-known arbitration schemes as an essay to satisfy all interested parts. The first one
96
cation activity inside the platform. It offers the possibility of executing replicated units within absolute synchrony with minimal communication overhead without interfering at the application layer hereby, facilitating project and validation of functional and non-functional blocks. Moreover, each processing unit knows exactly when all system-wide periodic actions happen since a global schedule is defined offline. This is very useful for fast error detection which is achieved by comparing the actual system behavior with expected behavior.
host interface D-PORT RAM
GLOBAL CLOCK
2.3 Time-Triggered Extensions of CAN
D-PORT RAM
PIC16F84A
CAN BSP
CAN BSP
BTL
BTL
82C250
82C250
PROG MEM
ISO11898 CAN bus
The time-triggered paradigm is being adopted as the primary design concept of of many dependable embedded protocols since its desirable properties mentioned above. Researchers from Robert Bosch GmbH have created the TT-CAN (TimeTriggered on CAN) (Fuhrer et al., 2000) which is an extension of the CAN protocol. The physical and data link layers was left unchanged to facilitate migration and preserve compatibility with current existing controllers. Communication in the TT-CAN is organized in several TDMA rounds each one being triggered by a centralized master which broadcasts the reference message (RM). Another example is the FTT-CAN proposed in (Almeida et al., 2002). There, the authors also emphasize the importance of combining periodic and aperiodic traffic and to provide dynamic admission control of transmission scheduling. The main drawback in both solutions is again that bus arbitration and time control are centralized functions that depends on the correct operation of a single processing unit.
Fig. 1. Structural Organization are implemented in C language and compiled with the SourceBoost IDE 1 . After compilation, the resulting .hex file is converted into a VHDL program memory description that is directly attached to the instruction bus. The mais steps of clock synchronization (clock reading and clock updating) are performed in hardware inside the Global Clock entity which contains essential counters that control the size of TDMA slots and rounds. The CAN BSP, the BTL and an external transceiver (82C250) implement the CAN protocol in accordance with the ISO-11898 standard) except that automatic retransmissions and error signaling features were not included aiming to avoid slot-overlapping in case of bus errors. To employ bus redundancy using duplicated channels, the entire data link layer must be replicated by attaching another instance of BSP and BTL to the internal bus, as suggested in figure 1. During runtime, control data is taken from dualport memories at the host interface and then loaded into the BSP block for transmission. Likewise, messages from the bus(es) are transferred from the BSP to the correct destination at the reception memory area. Moreover, the host is able to read a global time based directly from the Global Clock block so that interrupts can be programmed to trigger significant actions at the application level.
3. THE CASCA PLATFORM 3.1 Architectural Aspects The CASCA network controller was entirely implemented in synthesizable VHDL code in order to take full advantage of low turnaround time of rapid prototyping using FPGA devices. It is composed of a set of hardware blocks and a PIC processor (PIC16F84A) attached to an internal parallel bus as shown in figure 1. The PIC processor is the core of the system since it performs safetyfunctions like cross-checking of messages from redundant units and supervises all data flow. This strategy provides best error containment since a faulty block can not adversely affect another one without confusing the central core first.
3.2 Time Management Essential distributed clock synchronization is performed by executing the Daisy-Chain algorithm (Lonn, 1999) with dedicated hardware support. This method is simpler to implement since there is no need to temporarily store clock readings like
Furthermore, the PIC processor is also responsible for executing the system initialization procedure described in subsection 3.3. Its software routines
1
The SourceBoost IDE is available at: http://www.picant.com/c2c/c.html
97
happens in other distributed clock synchronization algorithms like the FTA. All nodes adjust their clocks according to the time view of the current transmitter hence there is no need to rely on a centralized time reference. At any time, the maximum skew between non-faulty nodes provided by the Daisy-Chain algorithm is bounded by: δmax = ǫ + 2ρR
Clock entity. Then dedicated hardware circuitry implemented inside of it calculates the difference between the actual time and the expected arrival of that message, taking into account the propagation delay that is necessary to compensate the time the signal takes to propagate through the bus from the source to destination. Once calculated, this difference (denoted as ±D) is applied to the microtick counter (vMicrotick) the next time it reaches its upper limit (cMicrotickMax) as shown in figure 3.
(1)
where ρ is an upper bound of clock drift, ǫ is the reading error resulting from variable propagation delays and R is the synchronization interval, that is, the distance in time between two clock successive corrections. There is absolutely no need of precise oscillator devices mainly because clock adjusts take place more frequently.
System time is defined within a bounded time horizon called a cluster cycle. In a CASCA system, at any moment the global time is represented as follows: vCycleCounter:vMacrotick and it progresses from 0:0 up to: cCycleCountMax:cMacrotickMax
0
1
2
3
i-1
i
vCycleCounter
Once the global time reaches its upper limit, its is restarted over again. The beginning of all TDMA slots lay on this time horizon, as all control actions that are triggered at the host.
cCycleCountMax gdCycle
0
1
2
3
0
1
2
3
i-1
i
macrotick level
i-1
i
microtick level
vMacrotick gdMacrotick
3.3 System Lifecycle
vMicrotick
The design of a distributed system using the CASCA platform starts with the requirements of transmission periods of messages from which a static cluster schedule must be defined. An example is shown in figure 4 where it is assumed for simplicity that the system has only 3 messages of equal length (MSG 1, MSG 2 and MSG 3) to be transmitted with different periods (1ms, 2ms and 4ms). It is important to note that all periods may be defined as a integer number of macroticks else they can not be scheduled for transmission. Moreover, it is noteworthy that the duration of the resulting cluster cycle must be the minimum common multiple of the periods. A direct consequence of this issue is the existence of empty slots.
Fig. 2. The CASCA Time-Hierarchy For this work we adopted a time hierarchy very similar to that used in the FlexRay automotive protocol (FlexRay, 2004). The microtick is the primary source of discrete time pulses and it is taken directly from the oscillator device. Since processing units can have different oscillator frequencies the length of a microtick is not under constraint to be a global parameter. Above the microtick level is the macrotick, which corresponds to the most fine grained unit of global time whose duration must be defined at design phase and must be the same for all nodes. Moreover, this duration should not exceed the assumed value for δmax to ensure that at any point of physical time t all nodes read the same clock value C(t). vMicrotick cMicrotickMax
vMacrotick
+D 1
-D
2
3
4
Fig. 3. Principle of Clock Correction Fig. 4. Example of Cluster Cycle Schedule
At each message transmission the clock reading of the sender is inferred at all receivers like follows: Whenever a start of frame interrupt signal raises at the BSP interface the PIC processor sends a clock reading command to the Global
Once the cluster schedule is defined the system is ready to run. However, a problem that comes into play is the establishment of normal TDMA operation considering that initially the clocks are not
98
aligned. In CASCA, this initial synchronization is extremely simplified since possible first collisions are resolved by the non-destructive bitwise arbitration strategy of the original CAN protocol.
Table 1. Resource Utilization of the Xilinx Spartan2E Device Block Name BSP + BTL Global Clock PIC Core
When any node is powered on, it first listens to the bus in order to detect ongoing communication. If a message is received during this listen time then the node adjusts its counters according to the contents of the corresponding message identifier that carries the current slot and TDMA round. On the other hand, if there is no bus traffic then the node enters the coldstart mode and attempts to create a new time base by trying to send the first message on the bus. Since logical clocks are not initially synchronized, collisions may occur when the listen timeout of two nodes expires at about the same time. When that happens, the node with the highest priority wins arbitration and becomes the leading node. The other nodes (including the one that lost arbitration) adopt the time base of the leading node and start communication. From now on clock synchrony is maintained by the Daisy-Chain algorithm.
slices 315 353 527
slice FFs 239 140 346
LUTs 556 637 948
First, in order to verify the correctness of the bit stream processor, the internal transmit flag was tied to TRUE in the VHDL code to force successive collisions. The bitwise arbitration can be clearly distinguished in figure 5 showing a scope image of TxD signals at two nodes. Please note the acknowledgment pulse at the receiver signalizing that the frame was correctly processed.
Fig. 5. CAN Bitwise Arbitration
3.4 Allocation of Asynchronous Traffic
Thereafter, each CASCA controller was configured with the same cluster cycle consisting of 5 TDMA slots and one single TDMA round pattern with total duration of 2ms. The macrotick length, that is, the precision of the global time was set to 1us considering oscillator drifts of 10−5 and maximum synchronization interval qual to the duration of a TDMA round. Figure 6 shows stable communication just after the startup phase in which all three nodes have participated.
Despite the fact that the CASCA is essentially time-triggered, the host can also transmit asynchronous messages in empty slots resulting from the definition of the global schedule. As one can see in figure 4, there is free bandwidth available in all TDMA rounds except the last one. That bandwidth should be used to scheduled dynamic traffic during runtime. This is service of utmost importance in industrial applications for diagnostic and parameterization purposes. In contrast to timed messages which always overwrite the output buffer, asynchronous messages must be queued for transmission. Despite being subject to contention, the transmission of asynchronous messages starts at the action time of the corresponding slot as it were a pre-schedule message. Because of that, they can also be used for clock synchronization.
Fig. 6. Synchronized Operation 4. SYSTEM PROTOTYPING AND PRACTICAL RESULTS 5. CONCLUDING REMARKS The proposed platform was validate by means of prototyping. A cluster consisting of 3 nodes and a serial twisted cable that interconnects them was implemented using Xilinx Spartan2E development boards. Successful synthesis of the CASCA controller was achieved resulting in a reduced utilization of programmable resources as indicated in table 1.
This paper discussed some issues on the development of dependable distributed systems and presented the CASCA, which is an optimized timetriggered communication platform intended to provide support for the design of safety-functions at low cost. Correct operation of protocol behavior and clock synchronization was validated by means
99
of prototyping using FPGA devices. It was shown with practical results that there is no need to rely on a centralized time source neither at the startup phase nor at normal operation.
Rushby, John (2001). Bus architectures for safetycritical embedded systems. In: EMSOFT. Srinivasan, Jayakanth and Kristina Lundqvist (2002). Real-time architecture analysis: A cots perspective. Proceedings of the 21st Digital Avionics Systems Conference pp. 5D4–1– 5D4–9. Welch, Jennifer Lindelius and Nancy Lynch (1988). A new fault-tolerant algorithm for clock synchronization. Inf. Comput. 77(1), 1– 36.
As future work, the authors are involved with the development of synthesis and modelling tools aiming to facilitate the mapping from system functional specification to hardware platform considering non-functional requirements of faulttolerance. The design of bus guardian devices are also being considered aiming to enhance the reliability degree of the communication architecture itself.
REFERENCES Almeida, Lu´ıs, P. Pedreiras and J. Fonseca (2002). The ftt-can protocol: Why and how. IEEE Transactions on Industrial Electronics 49(6), 1198–1201. Avizienis, Algirdas (1995). Dependable computing depends on structured fault tolerance. Proceedings of the Sixth International Symposium on Software Reliability Engineering pp. 158–168. Bowles, John B. (1992). A Survey of Reliability Prediction Procedures for Microelectronic Devices. IEEE Transactions on Reliability. Bridal, Olof, Rolf SnedsbØl and Lars-˚ Ake Johansson (1998). On the design of communication protocols for safety-critical automotive applications. Proceedings of the IEEE 44th Vehicular Technology Conference 34, 1098–1102. Dilger, Elmar, Thomas F¨ uhrer, Bernd M¨ uller and Stefan Poledna (1997). The x-by-wire concept – time triggered information exchange and fail silence support by new system services. Society of Automotive Engineers. Durante, Lucas and Adriano Valenzano (1999). On the Performance of the IEC 61158 Fieldbus. Computer Standards and Interfaces 21(3), 241–250. FlexRay (2004). FlexRay Communications System Protocol Specification Version 2.0. FlexRay Consortium. Fuhrer, Thomas, Bernd M¨ uller, Werner Dieterle, Florian Hartwich, Robert Hugel, Michael Walther and Robert Bosch GmbH (2000). Time triggered communication on can. Proceedings of the 7th International CAN Conference. Kopetz, Hermann and G¨ unther Bauer (2003). The time-triggered architecture. In: IEEE. Vol. 91. pp. 112–126. Lonn, Henrik (1999). The fault tolerant daisy chain clock synchronization algorithm. Research report. Chalmers University of Technology.
100