Systems Design and Interfaces
FUNCTIONAL REDUNDANCY TO ACHIEVE HIGH RELIABILITY M. H. Gilbert and W.
J. Quirk
Atomic Energy Research Establishment, Harwell, Oxon, England
Abstract. In conventional protection and control systems, hardware is dedicated to perform a given function, and to protect against failure, this hardware has to be replicated. There is a trend to continue this practice into the field of computer-based protection systems. However in such systems, a single piece of hardware is responsible for many different functions, and redundancy at this level may not be the most effective way of achieving the desired high integrity of the resultant system . Furthermor~, such an approach ignores the capability of computer-based systems to accomplish a large amount of self checking. The important aspect of total system integrity is not the hardware but the functional integrity of the processes supported by the hardware. A fault in a processor can be looked upon as causing a change in the transfer functions of the processes it is supporting. By starting from a proper functional specification, it is possible to produce a system whose overall transfer function remains unchanged or at least acceptable in the face of a limit ed number of changed functional units. Thus the system integrity is maintained even if some of the processes do not implement the function specified - ie. contain design faults. We explore a suitable specification technique in this paper, showing how it is possible to design such a system, and also how the individual functional units can be mapped onto the available processors in a decentalised way. A suitable architecture for such a system is discussed, based on a broadly conceived capability structure. Within this architecture, no process is statically assigned to a given processor element and the overall system behaviour is unaffected by the failure of a specified number of processors. Keywords. Computer Architecture, Multiprocessing Systems, Reliability Theory, Special Purpose Computers, System Failure and Recovery, System Integrity, Fuctional Redundancy.
INTRODUCTION
At the risk of stating o r re stating the obvious, this approach to increasing system reliability does not immediately relate to computer -b ased systems. And this for two reasons: fi rst, there is usually a many-to-one correspondence between functional units and processing units; second, it is much less obvious that any functional unit is correct ab -initio. One might add that there may well be subtle interactions b etween hardware and software which reveal themselves not immediately as a hardware error, but r ather as a failure of some su bs et of the functional units mapped on to that processing unit.
The problem to be addressed is simple to state and understand; it concerns overkill, how to recognise it and how to prevent it. In a conventional hardware-based safety or control system, there is a more-or-less 1-1 co rrespond ence between 're cognisable hardware boxes' or processing units and 'fun ctions to be performed' or functional units. On the assumption that the processing units are i n i t i a l l y functionally co rrect, the only thing which causes a system malfunction is a hardware failure of some sort. Replicating the processing units and voting between the results of ab-initio identical and correct units then increases the system reliability in that the overall system performance is unaffected by small numbers of failures in replicated sets of units.
However, this does not mean that computer-based systems are inevitably less reliable than hardware-based systems, for computers do have advantages in other respects. In particular, they offer a much
59
60
M. H. Gilbert and W. J. Quirk
greater checking-ability ( e i t h e r self-checking or one checking another) and because of the many-to-one correspondence between functions and processes, they allow a significant decrease in the amount of hardware needed to implement a given system, and a corresponding increase in basic system availability. These advantages will be lost however if the computer is treated as a standard hardware device. Not only does straight redundancy not necessarily increase system reliability, but its introduction may cause the resultant overkill to prejudice adversely the overall performance achievable by the system. Overkill implies the sub-optimal allocation of system resources and therefore endangers the system availability, while the replication of similarly 'out-of-spec' units does not produce an 'in-spec' component.
because clearly there is some effect, but in this particular case, it is the majority of the three outputs which is of concern, and the majority is truly unaffected by anyone function. So we see we must modify slightly our definition of n-reliable and say that a system S is n-reliable with respect to property P if p(S')=P(S) where S' is any derivation of S with less than n+l of its active functional units changed. A simple example is in order, if only to clarify the above definition. Figure 1 shows in block form a system to raise an output alarm if any 3 or more of the 4 input lines are activated. Since any block could fail, we could have one output line raised spuriously or one output line not raised in an alarm situation.
RELIABILITY, RESOURCES AND FUNCTIONAL UNITS While not wishing to get bogged down in detailed definitions and nuances of language, we need to be clear about what we are trying to achieve, and what is at hand to help us meet the objective. A system comprises a set of functional units which interact to produce the total system function. These units should be realised at specification or early design time. Resources are available to implement these functions at run-time. They will be realised during the system design. Both these sets will usually be redundant, in that not all the fu~~tional units need to be exercised or to even meet the specification for the system to behave in an acceptable manner. Further, some resources may become unavailable as time proceeds, and again, within limits, this should not adversely affect the system as a whole. Thus as a working but ad hoc definition let us define a system to be n-reliable if its overall behaviour is unaffected by the behaviour of any m of its resources or functional units, where m < n. Now, although the visible parts of a system are the resources, the parts that really matter are the functional units. And for any particular invocation of a functional unit, it is the combination of functional unit and resource which will' do' something. In a standard triple modular redundant system, the use of three identical blocks and three voters is seen as a way of ensuring the correct majority result in the case of any single hardware failure. That is, the system is I-reliable. We would suggest a slightly different slant to this; that the overall function of the 6 blocks is 'unaffected' by whatever function is implemented in anyone of the blocks. 'Unaffected' is in quotation marks
r--
-
Fl 3/4
X
F2 3/4
Y
F3 3/4
Z
a b
I I
-~
I
~-
d
- -
~
I .....
I I
WL Fig. 1.
A Triply-Redundant 3/4 Voter
Thus if P is the property '<3 active inputs imply <2 outputs )2
)1
then the system S is respect to P.
I-reliable
with
Before we leave this example, let us expand it slightly to show another property of the functional unit approach. If we consider the horrifying formulae: X Y Z
«aAb)A(cVd»V«aVb)A(cAd» « Mc)/\( bVd»V « aVc )A( bAd» aAd)A( bVc»V aVd)l\( bAc»
«
«
a realisation of which is shown in figure 2, then after a fair amount of analysis, we can discover that it also represents a I-reliable system satisfying the functional requirements of figure 1. The interesting point is that there is no replication of any function in this implementation,
Functional redundancy to achieve high reliability although it does depend quite heavily on the symmetry of the situation. And because there is no replication, a software implementation would presumably be less prone to common-mode failures.
a
b c
d
Fig. 2.
A Realisation of Figure 1
In fact it is useful to stretch this definition very slightly. On one hand if a processing unit is recognised as faulty and the system takes some action in recognition of this fact, then after this 'reconfiguration' the system may again be tolerent of a single fail ure. Thus a sys tem may well be p-sequentially i-reliable meaning that up to p single errors may occur and the system still maintain the desired property. On the other hand, if a functional unit fails, it may fail repeatedly on each invocation. This is worse than just failing on a single invocation, and so we have a stronger requirement of reliability in time. We shall say that a system is i*-reliable if the desired system property is maintained in the presence of a continually failing functional unit.
61
THE ALLOCATION OF FUNCTIONAL UNITS The target we are trying to achieve is thus to produce a system which is both p-sequentially n-reliable and n*-reliable with respect to some safety property P. If this can be achieved at all, we need to know how to analyse 'functional' systems for failures, how to produce functional units and how to allocate them to processing units; the latter being the real hardware of the system. Again there are well established traditional approaches and perhaps Fault Trees are a good example. Can these be applied to computerised systems? The usual answer is 'no', and this is because the computer and its software offer too many different failure states. One cannot even classify these states into a tenable number of equivalent states. What is wrong however is not the theory, but the magnitude of the analysis. However, this is almost entirely due to the requirements of general-purpose machines. The capabilities for a particular piece of software to be able to perform complex actions has led to the design of hardware in which any piece of software has such capabilities. This last statement is a little unfair - most machines around now have at least two levels of privilege, and some have more. But clearly no single-processor system can be i-reliable, and most micro-processors - the obvious candidates for multi-processor machines do not have such protection features. Indeed most mini-computers have the minimum possible in hardware and then rely on software at a particular privilege level to judge whether or not a request to that level from a lower privilege level is acceptable. In contrast, many micro-based systems do not have a vestige of such protection software. Thus we see in general that the more complicated a machine - the more complex its architecture - the more failure states there are. What is needed is an architecture which reduces the number of such states. If a system can be made i-reliable then since the single failure has no effect on the system, the exact nature of the failure is unimportant. In other words, the fault tree for the failure of functional units is just two-way branching. Unfortunately this has only hidden the problem, because in general, a single hardware fault can fail many functional units. However, one thing which the microprocessor does allow one to do economically - is to allocate a single functional unit at a time to a resource. 'At a time' is very important, for we are certainly not talking about a static design in which anything in the older style hardwired systems is replaced by a less reliable modern counterpart. The significant
M. H. Gilb e rt and W. J . Quirk
62
point is that if there is only one fun c tion per resource at anyone time, th e n one need not differ e ntiate b e tw ee n resour ce failure and functional unit failure, a nd the fault tre e a n a lysi s is now ten a bl e . Thus we are l e d t o the p0sition of r equi ring that functi o n a l unit s b e a ll oca t ed dynamically to processing units. Furth er mo re this must be don e in a decentralised way or else I-reli a bility cannot be achieved . Can such a system be pr oduced?
FU N D A~ENTA L j-R ELI AB LE SYSTEM ARCHITECTURE
Let us re cal l the 3 fund amental buildin g blo c ks id e ntifi e d in th e ear li er work on syst em specification (Quirk, 1978). These blocks were essen ti a lly C lo cks , Functional Unit s a nd Channels. The semantic meaning of the SPECK model necessitated the allocation of th e functional units to pro ces sing units; so thi s conce pt matches very neatl y onto th e j-r e l ia bl e machine concept. Clocks a nd Channels are in r eal it y only special typ es of processes, and thus these too should mat ch quite easily. However to ac hi eve thi s , some extension of th e norm a l function al definitions is r e quired. Beca u se to achieve j-reliability, one must localise the effects of any e rr o r situation an d this in turn r e quir es that any action associated with e rror r ecove ry is also loc a l. In a sys t em which is designed to b e e rror-t o l e r ent and pe rh aps also reconfigur a ble, r eco nfigurati o n a nd err o r handling should be seen as a normal part of the system operation. Thus the input and output spaces of eac h function should be ex tended t o include a value 'failed', and the fun ctio n definition expanded to be suitably defined on th ese values. Furth e rmore, the ac t of 'd oing nothing' ought a lways to be an acceptable action for a si ng le pro cess , and should result in th e corresponding output for that process being th e value 'failed'. This can b e ac hie ve d by havin g the associated output c hannel initialise its el f to this value at pro cess invocation. There are other ways of co din g th e outputs of functional units so that failures ca n be recognised. Another example, often used in conventional equipment, is to us e an alternating output. Thus th e bool ea n s t a t e true is repres en ted not b y a constant value but by a sequence such a s 0,1,0,1, ••• FAILURE DETECTION AND ASSOCIATED ACTION With this ver y loos e view of system components in mind, consider how failures could be dealt with in a I-reliable system. It is dangerous to tread the dividing line between software failure and hardw a re failure, but we need some working rules for deciding how such a system should cope with errors. Self-checking of machines is
quite feasible up to a point, but th e probl em then arises that any assoc i ate d action to b e taken on the se lf-dis cove ry of a fault has to be t aken b y the fail e d machine. Thus there i s no gua rant ee of this ac ti on being successfully taken. For a 1*-r e liabl e system th is implies that any self-checking mechanism must have associated actions with only lo ca l effec t s . In fac t any sor t of c hecking with associate d glo b a l actio n s l eads to problems (as we shal l see) when the process with such glo b al capabili ti es itself faults . We sha l l deal with this further below. Of immediate conce rn is the fo llowin g question: given that a fault h as been detected , i s it more likely to imply that t he processing unit has fai l ed , or that t he functional unit has failed? If th e former, the p r ocessing unit shou l d be di sa bl e d, if the latter then it sho uld not. It is not clear how best to make this assessment , but the ' synch r onised pair' of identical microprocessors (Cooper , Quirk and Clark, 1979) does sat isf y the j-r e liable crite rion; since only 1 failure a t a time is to be ca t ered fo r, then any di sc r epancy between the two processors (possibly after some retries) sho ul d lead to th e pair b eing disabled. Since our processing units are r ather more than just a b a re pro cesso r, some simi l a r hardware replication may be n eeded in th e l ocal memory, clock etc . The point to be made is tha t e rr ors which are likely to be hard are th e ones t o detect at this l evel . Other software detected er ror s could put th e offe nding processor into a lo op or h a lt, for example . this wo uld th e n rely on th e e rror b e in g detected when the process over-ran its run-tim e bounds. If th e c hann e ls a r e ini t i alise d to th e 'f ai led' value for the process, th en this value co uld be l ef t there, as requir ed . Thu s th e simplest possible ac ti o n taken by the faulty process is acceptable t o the system. Fault annunciation is usually demanded of such systems, but we a r e wary of the value of such syst e ms for several r easons . In the first pl ace , the annun cia ti on mechanism is a n a dded co mpli ca ti on t o the system. Secondly, engin ee rs invariably want to run special di ag nosti cs to tr o uble-shoot systems. Thirdly, a time of failure is a time of stres s for the system in th a t while attempting to re cove r or mask the failure, it is less tolerant of other failures. Thus it is not the time t o be worried about annunciation. This is not of central importance to this paper, but we feel that equipment with r e gular service periods is probably better maintained by spe c ialised diagnostic procedures rather than trying to make sense of a dated log of possibly spurious error reports.
Function a l r e dundan c y to achi e ve hi gh r e liability THE ANAL YS I S OF SINGLE SYSTEM FAILURES The minimum necessary organisation of such a system must include the scheduling and allocation of functional units, and the data transmission between these units. Our 'processor pool' machine must therefore have as a primitive operation, 'allocate free processor' (say to the code in a particular piece of memory). To invoke a functional unit, one must allocat e a free processor to that function. The obvious candidate for the allocator is th a t it is itself a function allocated to a processor, and this function has the capability to allocate free processor. One can now see how to decentralise scheduling, rather than have a continuously-running scheduler, we have a process which i) i i) iii) iv)
waits for whatever time delay is required all oc ate s the fun c t ion s i t controls allocates a copy of its e lf terminates, returning its own processor to the free pool.
Unfortunately, even this is not i-reliable because if a scheduler dies, it cannot re-allocate i t s e l f . Thus, we need a watch-dog of some kind - again decentralised. The process as well as allocating the functions and a copy of itself, allocates a monitor. The monitor oversees all the processes invoked by this incarnation of the scheduler, ex c luding itself but including the next incarnation of the scheduler. This still has a window because the scheduler might fail before allocating the monitor. Monitoring monitors is starting to climb the dangerous and infinite pyramid of layer upon layer of checking processes. But a moments reflection on how we have decentralised the schedulers gives a clue. Just as the scheduler reschedules another copy of itse l f, so let the monitor look after a nother copy of itself. To be precise, we allow the monitor to oversee both the current and next invocation of the s c heduler and its associated processes. Each monitor monitors the next incarnation of itself, and now there is no window. Recall that our system is supposed to be i-reliable; let us see if this design is so. There are four single failure modes for a particular allocation. i)
ii) iii)
iv)
an allocated function fails the next scheduler fails the monitor fails a channel fails
Failure of a Scheduler. A scheduler fail clearly requires action by the monitor to reallocate a scheduler. Thus the monitor needs the capability to reallocate a scheduler but not a function. Without this capability, it should be impossible for a s.c.cs. c ·
63
failed monitor to run amok among the other allocated processors, but one possible failure mode is for it to try and reallocate a scheduler when it need not. This would, of course, run two or more copies of the same scheduler. This is obviated by making its capability an 'abort-and-reallocate', for notice that aborting a scheduler does not harm the system provided that a new copy is almost immediately started. Leaving function reallocation to the next scheduler invocation is a good example of where the simplest possible form of error-associated a c tion is utilised. Failure of a Functional Unit. Functional units c an fail in two ways : either they fail to produce an answer in the allotted time or they produce the wrong answer. In the latter case, it mayor may not be re cog n is a b 1 Y wr 0 n g. If 0 u r s y s t em is i-reliable, then it will not strictly speaking matter, but a system which can recognise errors can usually be made more effective. Failure to produce an answer c an be spotted by the monitor for that function. If we assume that the communication me c hanism can be informed of this failure, then all the monitor needs to do is to be a ble to abort-function. This unfortunately has severe implications for monitor failure. Failure of a Channel. Inter-process communications are somewhat harder to deal with. Firstly, the input to the system and output from it will be down physical wires whi c h are fixed. These ch a nnels cannot thus be moved around within the system. Even those which could be moved present somewhat of a problem in that copying data during a move may itself cause errors. The failure of a channel may be more catastrophic to the system than a failure of the function feeding i t , because the memory of a channel makes its failure equivalent to a succession of failures of the function. This will be of no effect if the system as a whole is i*-reliable. Some latitude is available to cope with failed channels. Leaving aside those which are immobile, the others could be moved occasionally (a scheduled process would move them). But more interestingly, the monitors could check that reasonable data was being delivered. For example sequence numbers could be attached to data when written, and the monitor could then check that the correct number of different data values was provided at each channel read. A detected fault could then lead to the channel being aborted and a new one created. No data need be copied from failed channel to new one because of the l*-reliability of the design. In this situation, rollback is not necessary so another potentially difficult area of system error recovery is bypassed. Indeed, one of the advantages of the channel-based architecture is the lack of replicated data which has to be kept 'in step' at all
M. H. Gilb e rt and W. J. Quirk
64 updates.
The immobile channels can of course only be dealt with by ensuring that sufficient redundancy exists in the presentation of data to the system so that single failures here do not adversely affect its operation. Notice here especially the value of the reading processes being able to determine the failure of the input channels to enable the function algorithms to make best use of the available data. Failure of a Monitor. But what if a monitor fails? Since the monitor is not c arrying out a crucial system function then in principle its failure has no impact on system performance. However since the monitor has some global capabilities, one has to worry about the misuse of these capabilities. The proposal here is that the combination of software and hardware techniques reduce the probability of such misuse to the 'inconceivable' level. The capabilities granted to any process in a system should be just the minimum required by that process, so even a monitor is not free to do anything and everything. The software will be well trusted - probably proved and by suitably coding the capabilities, the probability of a corrupt capability being accepted as another capability should be made very low. Similarly the capability mechanism will be a well trusted hardware construction. Finally, the monitor should have the capability to abort the single monitor it itself is monitoring. Thus any attempt to 'run amok' by one monitor will be arrested by its overseer.
CONCLUSION Let us stand back at this point, for a paper of this nature should not be a detailed design document. Clearly there will be problems producing an architecture which meets some of this requirement just mentioned, but these are largely conventional hardware problems. What we hope to have indicated is that with a broad concept of capability mechanisms, one can at least envisage architectures which allow provable general statements about the system behaviour to be made. The possibility of achieving a 1*-reliable design seems to be real enough providing that the basic system specification allows one to recognise the consequences of a different function being implemented in one of the functional blocks. TMR is a standard example of this. The results of such a specification are that the inherent redundant parts of the system are not needlessly replicated, but that the redundancy neccessary to achieve a given level of fault tolerence can be implemented. A look at a specification will often reveal the typical pyramid shape of much redundancy at the input tailing off to very little at the output. With a SPECK-type specification, the
consequences of a function change at any point can be traced, and a hand analysis or simulation can establish whether or not there is 1- or 1*-reliability with respect to this function. We are currently working on the problems of producing a suitable architecture, and we can say little more until this work is more advanced. We suspect it will be a case when a small amount of spectacular testing - for example pulling boards out of a working machine without disturbing it - will seem more convincing than pages of detailed argument. However, surely the important point to bear in mind is that the ability to be able to offer some sort of proof of system integrity is crucial to licensing authorities, and that such an unconventional system as we have been speculating about here is only justified if it offers some advantage over a more normal approach. Problems such as the importance of common-mode failures are all too real in systems such as these, but these difficulties do not seem to be any more acute in such systems than in many others. Finally, this does seem to be an approach in which the full and powerful potential of multi-processors can be achieved in a controlled and flexible manner.
REFERENCES Quirk, W. J. (1978). The automatic analysis of formal real-time system specifications. AERE-R 9046. Cooper, M. J., W. J. Quirk, and D. W. Clark (1979). Specification,design and implementation of computer-based reactor safety systems. AERE-R 9362.
DIS CUSSIO "l Walze : You described a system using redundancy in a dynamical way, i.e. dynamic allocation of hardware modules to functions. Other approaches are cold-redundancy or stand-by-redundancy. My impression is that your approach needs much more organization efforts. Gilbert: The redundancy we described is on the functional level, therefore there is the possibility of optimization and failure recovery. Lauber: When you talk about optimization criteria you mean hardware cost? Gilbert: Not necessarily, we expect to achieve optimizaion in total, that means hardware and software.