Copnight
©
I FAC 9th Triennial \\'orld Congress
Budapest. 1-I1In~ar~ , 19H-1
FAULT-TOLERANT AND FAIL-SAFE MICROCOMPUTER SYSTEMS FOR MODERN AUTOMATIC PROCESS CONTROL BY REDUNDANCY E. Schrodi ProtiuktiollS(llltomntisie1l11lg Ilnd Au(omatisierullgss),slrllle. SiPlIIrm .-\(;, D -75(){) 1\(lrI.,ruh,', Fninai Republic of CFl"mlll/I
Abstract. Fault-tolerant automation subsystems have to be added to a modern distributed process control system in order to fulfill the most stringent requirements in view of reliability and safety of process control. The paper describes design features and experience with both new 1-out- of- 2 hot standby system and 2-out-of-3 system, each one operating in a fully synchronized mode. The subsystems have the same scope of functions as a basic non- redundant system and are suitable for non-stop process control . Operator communication, observation and configuration can be carried out either from a central control room via a bus system and special operating subsystems, or from a direct coupled local operating station. Keywords. Computer hardware; digital computers; industrial control, microprocessors; process control; redundancy; fault-tolerant system. INTRODUCTION
These microprocessor based automation subsystems we can divide into three groups : first, the operating systems, second the communication or bussystems, and third the subsystems with automation functions for open or closed loop control, computation and optimization. Now, this three- Ieveled structure near to the process calls for the highest requirements in availability, reliability and safety . According to these three levels, we can place a hardware redundancy anywhere:
Modern high capability computers and process control systems are very powerful but also relatively complex . In spite of all the measures taken during development and manufacturing to ensure a high level of reliability of microcomputer based control systems, the probability of faults or hardware defects is not equal to zero . Therefore there is a need for systems with additional hardware redundancy to satisfy those applications requiring an extreme high system reliability or security. Especially in modern automatic process control we can find many applications of microcomputer systems asking for high reliability and availability, and even for fail-safe and non- stop features . Those systems have to fulfill very special conditions, because not only material, but also often human life is depending on their faultfree operating. This paper wants to point out some special conditions to redundant process control systems and also the fundamental redundancy concepts. Furthermore, some new development results are presented.
- first, one can use more than one operating system, - second, the communication network can be realized completely decentralized without any central master or control module, but perhaps with a flying master principle and a hot - standby data- link, - third, the automation subsystems can be desicned in a redundant, fault - tolerant or fail - safe way. In the Siemens decentralized automatic p:ocess control system TELEPERM M all these features are included. Task of this paper is to show some features of the redundant automation sUbsystems.
REDUNDANCY AND SYSTEM HIERARCHY First, some words to the use of redundancy in the hierarchical structure of a modern process control system. The present shows, that there is no longer one centralized process computer, but more and more many decentralized mini - or microcomputers to operate and monitor each one a separate part of a large process (e . g. a power plant station) . These microcomputers, called subsystems, are linked via a bus system to communicate to each other, to one or more high-level operating and even to process computers or data processing systems .
REQUIREMENTS TO REDUNDANT PROCESS CONTROL SYSTEMS Redundant systems need the management and control of active and spare units, the detection and location of defects and faulty modules, and last not least the bumpless switchoverto fault free standby spares after the isolation of faulty units. All these actions must be done during normal process operation without any disturbance or breakdown of the online mode, and without any effect to the systems performance or throughput. The main aspects are therefore : detection and location of faults and hardware defects, and afterwards reaction after a fault (reconfiguration). --
Important is now: the higher the level in this hierarchical structure, the more complex is the data processing and the information state; but also the lower is the degree of the desired or required realiability and safety. So, the highest degree of reliability is to be introduced on the lowest level of overall process control system, here called the automation sUbsystems.
These three topics are essential; all measures to design a redundant system have to be done with reqard to detection, locRtion and reconfiguration, Because they determine the effectiveness and per2739
2740
E. Schrodi
formance of the redundancy. Some requirements therefore are: - Any fault in the system hardware must be quickly detected and located within some few microor mi l liseconds. - The standby hardware must be completely updated and possess the latest information from both, operator and process inputs; all data sets have to be valid in order to ensure a bumpless switch over to a standby-spare in a logical sense . - The reconfiguration time (this is the time from locating a fault to the fault - free operating of the online-tasks) has to be negligible in order to ensure a bumpless switch over in a temporal sense . Only the accomplishment with all these conditions leads to an uninterrupted process control, even in the case of fast open- loop or closed-loop control tasks. Besides, further conditions have to be fulfilled in order to judge the systems monitoring and operating behaviour and its throughput, namely : - Any reconfiguration must not change the users operating manner; operating and monitoring has to be homogenious even after a module failed. - The process ing of online-tasks must not have a heavy load to realize a high grade of throughput. - All spare units must be checked out in their non-operating mode. To fulfill these requirements, one has to check out all known redundancy concepts; the next chapter will do this in a short way. SOME DIFFERENT REDUNDANCY CONCEPTS The development of our TELEPERM M redundant subsystems is based on an analysis of various redundancy concepts and their suitability for process automation functions. The well-known redundancy structures are: - Synchronous and asynchronous standby systems in 1-out-of- 2 mode - Asynchronous 1-for-n floating standby systems and asynchronous systems with graceful degradation representing the systems with dynamic form of redundancy. In addition, we know sys tems with static (or masking) r edundanc y, such as - the 2-out-of-3 or N-out-of- M systems with M
~
3, N 2: 2 .
The last - mentioned systems are rel ativel y hardware expensive and require at least three identical subsystems, so that this method is used in safetyoriented applications (e.g. automation systems requiring approval of authorities such as burner controIs in conventional power stations or reactor monitoring systems in nuclear power stations). On the other hand, systems with dynamic redundancy (usually two subsystems) provide the user with an increase in the availability of the required functions within the automation system at a reasonable cost . An evaluation of redundant configurations shows the critical points of all asynchronous sys tems . There are: - no devices for very fast detection of all ha rdware faults - no devices for rescuing a total set of actual process and operator data in the standby system - no capabilit y of continuing operation after switchover to the standby spares in a bumpless manner (i .e. in an 1-out-of-2 system with the
actual software instruction in sequence). It was found that, even with a little more of hardware expense the fully synchronized operation mode is the one, fulfilling most of the requirements mentioned in chapter 3. Therefore, a synchronous configuration was the fundamental structure to design our 1- out- of-2 as well as our 2- out- of- 3 subsystem. DESIGN FEATURES OF REALISEO HOT-STANDBY AND FAIL-SAFE SYSTEMS Members of the decentralized Siemens process control system TELEPERM M are the automation subsystem AS 220H and AS 220HF. They are both based at the well-known non-redundant subsystem AS220 for optimal open and closed loop control and have the same scope of functions. The AS 220H is a fully synchronized 1-out-of- 2 system for highreliable applications; the AS 220HF is a fully synchronized 2- out - of- 3 system for high-reliable and fail-safe applications (e . g. for burner control) which is to be licensed by a German TUV- authorit y. Both are designed for Non-Stop-Operation; operating, communication, and configuration of user's software can be carried out either from a central control room using special operating subsystems via a redundant bussystem, or from a local control s tation including CRT, keyboards, printers, and floppy - discunits. The Fau l t -Tol erant Hot- Standby System Basic configuration. In this hot - standby configuration both central unit s operate in clock and instruction synchronism (Fig . 1) . Depending on the system status , one of them is connected in a 1out-of-2 configuration via a redundant switching unit to the single channel process peripherals. Operator entries and process data are branched to both central units. Whereas the output data are processed by both subsystems, the y are linked to the process onl y by the active subsystem. The system status is determined by the switching units in conjunction with the operating system so that either subsystem may be operating in master, standby or passive mode . In the master status, the subsystem acti vely controls the output modules, the bus and the operator console. In the standby status, the subsystem operates troublefree and is ready to take over control in case of failure s of the subsystem which is in the master mode. The passive status means that the subsystem is disconnected from the input/output modules for test ing, repairs or maintenance . It cannot take ove r control in case of failure of the other subsystem . A fourth status is the backup status whi ch occurs for a short time during commissioning or after repair of a defective subsystem . In the backup state, a passive system is transferred to the standby s tate (automati c synchronization) without disturbing the master subsystem in its online process control function. Synchronous operation enables assignment of either of the two subsystems to the desired operating status, i.e. either master/standby or standby/ master. All data is put into both subsystems so that they are always automatically updated . Online backup data transmission between the two central unit s is not necessary . Synchronous operation of the AS 220H subsystem allow a very effective fault finding strategy, namely complete isolation of fa ult de tection from fault localization. A comparator unit provides fast detection of all faults; it compares each bit on all bus signal lines of both subsystems and s ig-
Fault-tolerant and Fail-safe Microcomputer Systems
nal s any de viations. This is only possible because both subsystems operate in synchroni sm. Thus, any fault which can affect the system bus is detected within a few microseconds . The proper fault detection function of the comparator unit is checked by cyclically simulating deviations on the buslines . Faul t localization - which begins after fault detection - i s based on suitable hardwa re combined with software self diagnostic routines. Whereas the hardware initiates disconnection of the defective subsystem simultaneo usly with error detection on occurence of likely failures, e.g. failure of power supply, failure of clock pulse, parity error in system memory, delay in acknowledgement, processor stop , failure of bus interface and failure of the watchdog, the self- diagnostic programs are onl y activated in the event that hardware localiizat ion measures do not succeed. They perform, amongst others, tests of memory, processor, floating point arithmetic and stop unit as well as tests on the central busses and peripherals . The complete separation of fault detection from fault localization made it possible to dispense with comple x and time-consuming online testing by cyclically triggered self-testing routines. The load on the system was thus reduced and the time avai l able for user software was not limited by self-diagnostic fun ct i ons. Once the AS 220H dual system has detected and localized the fault, it immediately initiates decoupl i ng of the defective subsystem from the input/output modules . This can be the case for the maste r system as well as for the standby system. In the first case , a changeover to the s tandby system is effected . This system now operates as a master and takes over all the functions at the correct instruction points and with fully updated data. Process control is not delayed or interrupted by reconfiguration or restarting . In the second case, the standby system is switched over to the passive mode, i . e . it is tagged as being faulty and is not available if the master subsystem fails, too . The master subsystem itself is not affected by this action. If faults cannot be localized (e . g. sporadic or intermittent faults), the standby subsystem is switched over to the passive mode and updating with automatic synchronizing is started . The freedom to select master or standby mode is utilized by the operating system to switch over to master/standby and vice versa at fixed intervals, e . g. at intervals of minutes or hours. By this means, the standby system is handed over control (master status) and al l data channels and modules which are not activated during the standb y mode are checked for proper functioning. If the check is positi ve , it cont inues operation in the master status ; otherwise the defect is detec ted and the system decoupled . This function provides additional monitoring of all components in the standby subsystem. It i s an important feature of the AS 220H subsystem and provides protection against undetected failure of the switching units whi ch would only become apparent when the master subsystem failed . Design Structure . The new redundant subsystem is designed around the basic AS 220 subsystem and incorporates symmetrical operation of the dual system and duplicated hardware . An important feature is the constructive subdivision between the two subsystems as well as just the few clearly defined interfaces in the form of f r ont connectors with flat ribbon cables. The basic unit of the AS 220H subsystem is accommodated in two double - height sub racks of the used packaging system . The two central processing units are accommodated in the basic unit, whereas the extension subrack accommodates two power supply modules , two changeover
274 1
modules and seven input/output modules. The standard TELEPERM M cabinet can accomodate further two extension units so that a max i mum of 35 input/output modules can be connected . The control processing units (a microprogrammed 16- bit syst em with floating point arithmetic fo r closed l oop control and separate bit handling unit for fast open loop control ) , the memory modules and the power supply modules as well as the interface modules for the input/output bus, for the bussystem, for the mini floppy unit, for the CRT and the serial interface are duplicated. Synchronizing, comparating and switching units are necessary for symmetrical operation of the dual subsystem. the synchronizing unit provides for synchronization of the clock pulse , of all interrupt requests as well as of the read, write and acknowledge signals. The synchronizing units of the two subsyst ems are connected by means of two 16-way front connectors and a flat ribbon cable; they serve to keep their own subsystem opera tional in case of failure or faults in the partner module. Both switching units are responsible for linking the input/outputbusses as well as for feeding the process data to both subsystems . Furthermore , each swi tching unit has a fault and system status logic as well as a master/ standby control and organizes the behaviour of the system in conj unction with the operating system. The cyclic master /s tandby changeover is used to monitor the functioning of these modules, too . The comparator unit is the only central component of the dua l system. I t consists of two sandwichtype modules which are interconnected via two 60way front cables. It contains the actual bus signal comparator as well as a backup control logi c together with the corresponding data transmission equipment for the backup phase during the commissioning (automatic synchronizat ion of both subsystems) . This ensures that a fault or fail ure even here does not affect both subsystems simultaneousl y. The failure of the comparator module can be equated to the fa il ure of the standby system: The standby marking is erased and the original master subsystem continues to operate in a simplex mode like a non- redundant AS 220 subsystem. Thus, each of the subsystems is capable of fulfil ling all its functions even when the comparator fails. This i s of importance for maintenance and repairs. Operator Communication and Observation . The type of configurati on , operator communication and observation is identical to the AS 220 basic subsystem. It does not change even in case of a change of status caused by internal faults. It can function under all operating condi tions either as a bus-coupled installation with central operator communication and monitoring using the bus subsystem together with the operator communication and observation subsys tem, or as a stand-alone line system with direct operator communication and monitoring (Fig. 2) . Each subsystem AS 220H is connected to the bus line through two bus interface modules which have identi cal addresses on one and the same redundant bus. Only the interface mOdule of the master operates actively on the bus , whereas that of the standby subsystem carr ies out internal test routines. Here also, cyclic change over from master to standby and vice versa causes activat ion and thus monitoring of both the bus interface modules . The operator does not notice the redundant configuration of the complete installation as he communi cates via an operating subsystem , and there is thus no difference as compared to operator communication with the nonredundant AS 220 subsystem. A signal distribution unit is provided for connection of either one or two local operator consoles with the process keyboard, configuration keyboard , log printer, floppy disc unit, and CRT . If two operator consoles are connected , one of them is active, the other being in the standby mode. Changeover to the
E. Schrodi
2742
second operator console is to be effected manually by the operator. Maintenance, Repairs and Testing in Online Operation. The system enables maintenance and repairs without disturbing process control. The particular subsystem - regardless of whether it is in themaster or standby status - is therefore disconnected from synchronous operation. The partner system is assigned, or automatically takes over active control of the process, i.e. master status. The blocked - and thus passive - subsystems now operates fully independently, but without the input/ output modules which are required in the master subsystem. This simplex operating mode with two independent subsystems furthermore offers the possibility of structuring new user software or loading, testing and trying out new software on the process. This flexibility avoids undesirable interruptions in process control during modifications in the configuration of the automation system. The change over from the passive status of a subsystem to the standby status is initiated by the operator and completed without any effect on the online operatio" of the master sUbsystems. The final stage in this process is automatic synchronizing. The second subsystem is now again in the standby mode and is ready to operate if required as the master. The Fault-Tolerant 2-out-of-3-System. This highreliable and fail-sa fe-oriented subsystem in a 2out-of-3 mode was designed for security applications in chemical, petrochemical and power station plants, e.g. for burner control systems. The licencing procedure with German authorities, such as the Technische Uberwachungsverein (TUV) is just being applied; the subsystem AS 220HF is furthermore fulfilling some guidelines and standards for e.g. oil and gas burners. Basic Configuration. Similar to the Hot-standby system AS 220H, the 2-out-of-3 configuration uses three central units operating in clock and instruction synchronism. Any input data are branched to each subsystem, the output data are voted in a 2out-of-3 scheme and given to the peripheral modules in different peripheral units (Fig. 3). Furthermore, there is an internal total voting mechanism: any communication between the CPU modules and their memory modules operates in a bidirectional 2-outof-3 mode, too. This holds for memory read data as well as for memory write data and causes the systems architecture tolerating more than one hardware defects, provided that not the same modules in different subsystems are failed: e.g. a faulty CPU in subsystem 1, a faulty random access memory in subsystem 2 and additionally a faulty read-only memory in subsystem 3 may be allowed at the same time (Fig. 4). Safety Applications. The independance of the different peripheral units, each with an own 2-out-of3 voting module, provides the limitation of faults in an input/output module to its own peripheral unit. Any single fault in neither the central units nor in one of the peripheral units can damage the overall system. This feature is very important for safety applications: in a burner control system i. g. each burner is controlled by input/output modules in an own peripheral unit. So, any single fault causes the breakdown of only a single burner. Concerning steam boilers with eight or more burners, this breakdown doesn't affect the available boiler performance. Electric safety circuits are attached to input/output modules in two or more different peripheral units. Any single failure, here too, doesn't effect a dangerous process operation mode: the disconnection of a switching relay may always be done by a second channel. Design Structure.
AS 220HF is also designed
around a basic AS 220 subsystem. It incorporates a triplex central unit each with CPU, memory modules, power supply modules, and interface modules for the input/output bus, for the mini floppy unit, CRT and serial interface. Additional components are here clock generator unit, synchronizer unit and voting units for internal central data busses as well as for peripheral data busses. Operator Communication, Maintenance and Repairs. The operator communication console with CRT, process and configuration keyboard, log printer and mini floppy unit is connected via an operator interface. Maintenance and repairs of one of the central units can be done without any effect to the others; it can be switched to operate in an in dependant mode. The remaining two central units operate their online tasks by masking the third independant central unit by means of their 2-outof-3 voting. CONCLUSIONS The discussed redundant automation systems for high reliable, non-stop and fail-safe applications are an important addition to a modern overall process automation system. The user thus can plan redundancy to meet the requirements of the particular process control application. Both the presented redundant systems are to operate as subsystems linked to others via a bus system as well as stand-alone systems, hiding their redundancy behind the operator interface. REFERENCES Avizienis, A. (1976). Fault Tolerant Systems. IEEE Trans. on Comp., 25, 1304-1312. Carter, W.C. et al. (1971~ A Survey of Fault Tolerant Computer Architecture. Computer,~, 9-16. Short, R. (1968). The Attainment of Reliable Digital Systems through the Use of Redundancy. IEEE Comp. Group News, ~, 2-17.
Fault-tolerant and Fail-safe Microcomputer Systems
CS 275 Bus subsystem
Fig. 1. Basic configuration of the AS 220H redundant automation subsystem.
Standard 1/0 units, monitor, printer, process keyboard, conliguration keyboard
Fig. 2. Direct or central operator communication with the AS 220H system. Printer
~ EJjLig~pen
Centraloperalorcommunicatoo.
observation and mnliguratioo
Colour VDU
Mini floppy
Conliguration keyboard
Process keyboard
Printer
BlacI:-and-white VDU
Irit
floppy disk unit
Basicunil
Basic unit OS251.0S252
OS 250
CS 275 bus subsystem
Basic unit AS 220H Mini lIappy unit
Basic unit AS 230
Basic unit
AS 220
Mini II0PPY unit
Configuratiof'
keyboard
Process keyboard
Printer
ID
DileCfoperatoroomrnuricatioo, observation
and conl'9uratioo
Black-and-white VDU
Conliguration keyboard
Process
keyboard
Printer
Black-and-white VDU
ID
2743
E. Schrodi
2744
StandardI/O-Devices
Basle units (BU) AS 220-HF'
BU I
BU 11
.- - - -
--.--~
I/O-Buses
Extenslonunl ts
Securltyclrcul ts
Fig. 3. Basic configuration of the AS 220HF 2-out-of-3-system.
Fig. 4. The total 2-out-of-3-voting mechanism of the AS 220HF system.