A modular fault-tolerant multiple processor for front-end process control

A modular fault-tolerant multiple processor for front-end process control

A MODULAR FAULT-TOLERANT MULTIPLE PROCESSOR FOR FRONT-END PROCESS CONTROL K. Chmillon, J. Heger and H. Herzberger BBC Brown Boveri & Company, Resea...

2MB Sizes 83 Downloads 58 Views

A MODULAR FAULT-TOLERANT MULTIPLE PROCESSOR FOR FRONT-END PROCESS CONTROL K. Chmillon,

J.

Heger and H. Herzberger

BBC Brown Boveri & Company, Research Centre, D-6900 Heidelberg, FRG

AB~i£~£i~ Distributed higher availability availability is still cesses fault-tolerant

control systems inherently have the quality of compared to centralized systems. If this degree of insufficient for more complex and critical procomputing nodes are needed.

In the front end layer of process control systems a large number of such nodes are used to perform fast control algorithms. Therefore cost optimal solutions for fault tolerance are required in this field. The architecture of the "SAFIR"'" was developed for the application field mentioned above. It allows modular configurable fault-tolerance and supports an application oriented programming language. The main goal of the project is nearly unlimited availability. This is accomplished by quickly replacing defectiv hardware-modules in the running system ("on line repair"). process control; application oriented language; AOL; duplex redundancy; fault tolerance; availability; software transparency; on line replacement; on line repair.

K@Y~Q£g~~

In the past duplication of critical system components had to be explicitly formulated within the application program. The use of fault-tolerant components frees the programmer/engineer from having to deal with the management of the redundant components.

FAULT TOLERANCE IN DISTRIBUTED PROCESS CONTROL SYSTEMS Industrial processes depend more and more on integrated computer systems. A natural result of this, is the increasing requirement for features such as safety, high availability, data consistency or even non-stop capability.

In the front-end layer of such systems a large number of microprocessor based computing modules is used. These modules are quasi autonomous and execute fast algorithms for process control. It is very important to have economical solutions for fault-tolerance in this field.

In medium and large scale processes (i.e. power plants, energy distribution, steel mills) the total computing power needed for monitoring, controlling, optimization and safeguarding is usually distributed in layers according to the structure of the process.

These are the reasons why Brown Boveri Research started a project to evaluate and realize an appropriate architecture.

Process control systems structured this way have an inherent high availability due to their quasi autonomous functional partitioning. In addition it is possible to realize the layer specific functional requirements in a cost optimal way.

The following requirements were deemed as most important: modular faulttolerant multiple processor system programable in an application oriented language (AOL) for control,

This degree of availability may be still insufficient for complex and critical processes, i.e. undisturbed and safe continuation of the control functions is required even in the presence of faults in the controlling system.

configuration of fault-tolerant and non fault-tolerant systems with the same components,

"'SAFIR - §afety by hlternated failuretest and Internal Replacement 129

K. Chmillon, J. Heger and H. Herzberger

130

minimal additional hardware tor for fault-tolerance

costfac-

upgrading of non fault-tolerant to fault-tolerant systems by only plugging in additional (duplicated) processor modules, software transparency fault-tolerance,

of

selfdetection of errors and tus-monitoring both local outside,

the staand

on line replacement of faulty components with automatic synchronisation and updating.

The intermediate code is executed in the technic of threaded code. The subprograms are implemented in the processor's native code. The last action of each functionblock is a jump to the next functionblock defined in the intermediate code. Special conditional blocks decide between alternativ successor blocks. Sequences of function-blocks are bound together and build 'tasks'. These 'tasks' can be executed in parallel or quasi parallel. The tasks are cyclic but they differ in their priority and their scheduling: event-driven, time-driven,

The system that has been developed to fulfil these requirements is called 'SAFIR'. The fault-tolerance concept is based on special characteristics of the application oriented language to be used.

PROGRAMMING THE SAFIR

permanently rescheduled. The program structure discribed above allows to implement all mechanisms of fault-tolerance handling below the intermediate-code layer. Therefore the application programmer need not care about fault-tolerance ("user transparency").

Application Oriented programing Languages (AOL) allow the user to solve problems using the wellknown terminology and representation of his special application field. In the field of process control so called 'functionblocks' and their connections are the basic elements of the AOL. A functionblock represents an AND-function, a PlO-algorithm etc. The connections are made between the inputs and outputs of different functionblocks. Programming can be supported by graphical editors, which allow an interactive CAD-like design of the application program (see Fig. 1.). The editor-output is the source for a compiler and can be used in further editing sessions. The compiler produces an 'intermediate code', which is a sequential list of parametrized subprogram calls. Every subprogram is contained in a library and represents a 'functionblock'.

THE SAFIR-ARCHITECTURE

The primary goal of fault tolerance is not to avoid 'failures' but to ensure correct functionality even in the presence of failures. In all cases redundancy is required for each system resource under consideration. Redundancy can be achieved through: structural or static redundancy, functional or dynamic redundancy, time domain redundancy, or a mixture of these of redundancy.

fundamental

types

Furthermore it is necessary to provide an architecture, which handles redundancy with the global result of 'fault-tolerance

I.

Three basic functions are required: - detection of malfunction

->

error detection

- discovery of defective hardware module

->

error localization

- instantanous isolation of defectiv module, on line replacement, automatic initialization and actualization of the replaced module

-> Fig. 1. Example of a function plan, connection names are automaticly generated

error correction

These functions may be implemented in hardware and/or software. It is essential that the basic functions must be executed

131

Multiple Processor for Front-end Process Control correctly, independent of any typ of failure and its system-wide consequences. A further essential demand is physical and logical operation of all system components with no back-effect to each other. Even in a pure software implementation of fault-tolerance mechanisms there are critical parts which are assumed to be failure free ('hardcores') . Since these parts can not be completely avoided they have to be minimized . The functionality of these hardcores must be controlled continuously by test routines running in the background of the system.

These qualitativ above determine functionality .

requirements mentioned ' fault-tolerance ' as a

Because of the large number of components in a modular process control system restrictive economical aspects must be additionally considered. This limits the usage of fault-tolerant architectures based on 2-of-3 or even 2-of-4 systems. The SAFIR was designed as a totaly redundant and symmetric multiple processor system. It serves all requirements with an overall cost-factor of about 2 . 4 compared to non fault-tolerant solutions . Nevertheless it has the characteristics of a 2-of-3 structure in the considered application field . Figure 2 shows the minimal faulttolerant configuration consisting of Input-, Output-, and two identical Processor modules, connected with two system busses (Bus A, Bus B) .

In

Every module in the system can be quickly disconnected from the busses by isolating switches. Additionally the modules are linked via serial support-busses (bus a, bus b) . Communication for error handling uses this serial busses. Output-devices have simple hardware-comparators ( hardcore!) to compare the synchronized data from the two busses. All comparators in the system are tested periodically in background, the results are compared and logged redundantly in internal state tables . Majority decisions concerning the results of these periodical tests can be made if there are more than two output devices in the system. Processors of a fault-tolerant pair (P , A PB ) are algorithm-synchronized. They operate on the system busses A and B and on the serial support busses a and b respectively . Additionally a private serial link ( ' umbilical cord') is used for synchronisation, for start of error-locating routines and for exchange of intermediate results between each pair of processors. After replacement of a defective processor modul this umbilical cord is used to initialize the hardware, the operating system and to update the plugged in processor. This updating is done in background with neglegible influence on the application program ( ' on the fly') . To enable preventive checking of bus-transceivers the processor modul has two separate bus-interfaces and runs periodically complete read/write-cycles on its bus. Test results are mutually exchanged between paired processors and logged redundantly in internal status tables.

Out

Fig. 2. Minimal fault-tolerant configuration using three different hardware modules.

132

K. Chmillon, J. Heger and H. Herzberger

A

B

Fig. 3. Example of a multiple processor configuration containing two fault-tolerant processor pairs (Pl, P2), one single processor (P3) and a monitoring modul (M)

If an error occurs that is not located on a processor modul the system roughly goes through the following states:

A SAFIR system usually contains multiple processors in a mixture of fault-tolerant processor pairs and non fault-tolerant singles (Fig. 3) .

rerun of §11 background-tests, exchange of last entries in the status tables of a processor pair, compare of status tables since good entry by both processors, positiv (and negativ) decision the error location,

last

Every modul reports about its internal state by four LEDs on its front. The monitoring module M provides an additional facility of monitoring the system status locally §Og outside via phoneline. In each case there are two forms of the monitoring information:

about

isolation of defective module(s), activatione of mechanisms to discriminate transient and permanent errors. If there is an error on a processor modul the described sequence might not be correctly executed . In this case a special error locating routine is activated which runs exclusively on a processor pair. This routine is executed within less than 10 milliseconds. During this time the last output-data are frozen . This can be regarded as an application independant fail-safe state. In case of an error situation which cannot be overcome by isolating modules from the bus (i.e. mechanical defect of backplane) the respective hemisphere of the duplexed system is shut down. The remaining hemisphere continues operation without any redundancy until repair is made.

by speech output in natural language, in graphical form on screen, showing the configuration and the status of each module.

LIMITS OF THE ARCHITECTURE The fault-tolerance characteristics of SAFIR depend essentially on the structure of the software, i.e. the special features of the application oriented language and its supporting runtime system. Application software written in other programming languages may use the SAFIR architecture if it can be structured in an adequate way . Wether the resulting restrictions allow to use the SAFIR architecture in other fields may be object of further research.

DISCUSSION

prototype, which is only fail-safe, we have synchronization of the three microcomputers by a comparison of inputs and outputs and, in addition, a comparison of user times. In order to realize faulttolerance, we would need another process connection.

Muller: You mentioned that you use Multibus I for your system. What modifications or additions to the standard system did you have to make to allow for the direct connection for the two processors, etc.? Heger: On every module of the system there is a power module, and on Multibus there is only a power line, so the regulation has to be done in each module.

Dummermuth: I see the problems of fault-tolerance and error protection in two areas. One is simply checking that the hardware is doing its job, but the more important thing is to make sure that the process itself has no errors. If we use two or three processors to look at the same input and if, say, an input wire breaks, then the computers say that everything is fine! In order to protect the process we have to employ redundant input data, such as that mentioned in the paper, using, say, two contacts of which one is normally closed and the other normally open for the same event! So you are actually reading in redundant data. In my opinion, the fault-tolerance should be done at the process level, not simply at the CPU level. Experience says that about 90% of process failures occur because of physical problems such as a die breaking, a limit switch failing, a motor stalling, a part falling off a clamp, and so on. These are all process failures, not CPU failures, and the CPU actually survives most of the time. In an automated factory we have to approach the problem of process fault-detection, not CPU fault-detection, in order to keep a production line up.

HaIling: Concerning fault-tolerance, we see some implementations, and we get a feeling that a lot has been done. What is the state of the art in this field? Kopetz: We have seen, in the last few years, a number of commercial companies appearing, with good designs which have been very successful on the market, for example, STRATUS and TANDEM. Even in the area of real-time process control, a number of newcomers are providing hardware, so it is clearly moving from research into commercial products. The questions of design diversity and design faults are still open questions, and to my knowledge there is no system on the market which claims to handle design faults. HaIling: Mr Heger, are there any plans to use your system in a factory environment? Heger: At the moment we have set up the system in our research centre, but there are some groups within the company who are interested in using the system in the future.

Kopetz: I would like to address the question of the relationshop between reliability and safety. The system discussed in this paper puts more safety into the system, with the result that the reliability is going to be reduced, because the probability that the system will be shut down is higher than that for a single-processor systeml Did you do any qualitative analysis on the reliability reduction caused by an increase in safety?

Frentzen: In my case, the prototype is being checked by the "Berufsgenossenschaft", and this process is still going on. Then it will take about a year to minimise the hardware. Each microcomputer consists of three modules, so you have nine modules, and the hardware costs are still too high. Koslacz: You have three independent channels, and if one fails you stop. Decision logic is more typical, selecting two out of three, etc. What is your concept of synchronization in three independent microcomputer systems?

Frentzen: I think the reliability is the same as in a system with one processor. Dummermuth: Actually, the reliability does not go down linearly with the number of parts. If you have three times as many memories and processors in the system, you have a good chance that your reliability will drop by a factor of ten. If, however, the reliability of the CPU is a hundred times higher than that of the press itself, then such a loss is, of course, only 1 or 2% of the total machine

Frentzen: We did two things. First, we realized a fault-tolerant type of system, consisting of three identical microcomputers. In this prototype there was no time synchronization; synchronization was achieved by a comparison of the process inputs and outputs. In the second

133

134

Discussion

down-time. On the other hand, if something goes wrong, whether in t~e press or in one of your CPUs, you don't know where it is, because any failure (in the press or in the CPU) is recognized as a failure, and you have the capability of making a safe stop. Frentzen: In this paper I did not discuss the question of handling faults in the process, since these are user-dependent, and this is the task of the user program.

Inamoto: The reliability of hardware systems is very high, but the software, especially the operating system and the application software system, is not so reliable, in comparison. We have to think about fault-tolerance to protect against software failures. Also, we must consider maintenance to expand it, or to repair and test software modules, without disturbing the on-line system.

Halling: So what you are saying is that we are also facing an integration problem.

Halling: Possibly, what you are asking for is real-time on-line diagnostics for running software systems! In many industrial processes, the programmers do not use general-purpose software; instead they use relatively simple software, such as that used on PLCs. The load is known, and they do not have intensive load-switching. It seems that this is a good way to go, particularly as remarkable results are being achieved, especially when compared to normal, general-purpose systems, in which the software is often never fully checked out.

Dummermuth: You have to look at the total picture. If you want to make twice as many automobiles, you have to find what mechanical failure causes the line to stop. CPUs do generally run a hundred thousand times better, in terms of faults, than mechanical parts.

Frentzen: In our situation, if there is a change (for example driving another machine say, a robot rather than a press), then the user program is different. This user program consists of logical equations in a simple language, and must be checked explicitly.

Purkayastha: If you have an absolutely fail-safe state, then everything is all right. But if you have many possible state-dependent fail-safe states, there is the problem of getting to one of these.

Koslacz: Again, if you really have to change something, what are the rules? If, for example, a small enhancement takes place, do you really have to go back to the full procedure?

Frentzen: state.

define the safe

Frentzen: Yes, you have to go through all procedures again.

Sievert: What is your opinion regarding the effectiveness of diverse programming? This assumes that failure occurrences are statistically-independent, but there are research results showing that there are typical programming failures, and the probability that they will occur in more than one program is high.

Halling: This is probably the state of the art. In Germany we have the TUV operation; this group has great difficulties in understanding questions about the reliability of software, and of systems which contain computers plus software, etc. Whenever you have to get into contact with them, there is a very complex and difficult procedure to pass, because they themselves do not have strict rules - they are not used to this kind of environment. They are used to mechanical devices and so on. This situation may change, but it's now very difficult to get approval for a system to be declared safe.

Dummermuth: Unfortunately, because of the laws and the claims that are made, if somebody gets a hand cut off, it is always claimed that the CPU supplier is at fault! So the CPU supplier says, "I'll triple my circuit, and then you cannot blame me anymore!" But this doesn't really help a lot to increase production by keeping the production line running!

The

user

must

Kopetz: The assumption that two separate versions are statistically-independent is definitely rejected. There is a correlation in two versions, even if they have been produced by different teams on different computers, with different programming languages. A number of extensive experiments have confirmed this. The question is, "To what extent? How big is the correlation, and what is the improvement in fault coverage obtained by two different versions?" There is evidence of significant improvement! This has also been shown by experiments and observations on systems which use diversity in real operations. An experiment has been done in Newcastle, and they give some figures on their improvement. Of course, it depends on the form and the complexity of the specifications, but it is definitely not independent, and there definitely is an improvement! There have been a number of systems installed world-wide, which use design diversity in operational systems; the Swedish railway system, the flight control system on the Airbus, parts of the flight control systems on the Boeing, and on some McDonald-Douglas systems,for example.

Rodd: Just a comment to end this session: Many papers about DCCS receive attention because the subject is "fun", and most engineers are simply boys that have grown up and have toys that are bigger than those they had earlier! Just listening to this discussion, I feel we are tackling a "fun" area it really is fun to design hardware because we know how to do it, we know what the problems are, and we can actually solve those problems! The worry is that we are not actually tackling the real issues. Professor Kopetz' comments are one hundred percent right. If you do a full reliability analysis, then you often find that we are wasting our time building TMR computers, since for every failure of a single processor, the plant is failing a thousand times! For every failure of the hardware of a microprocessor, the software is failing a hundred times because of specification errors!