Selfchecking computer module based on the Viper
microprocessor
Conventional microprocessors are designed without the rigour or features required for safety critical applications. Max P Halbert describes how the formally specified Viper microprocessor can be used as the basis for a selfchecking, fault tolerant computer module
A selfchecking computer module suitable for u s e in a variety of safety critical applications is described. The module is able to detect any errors generated in normal processing, and also to test itself comprehensively for latent faults. The basis of the module is an enhanced version of the Viper microprocessor, a formally proven dev&e designed specifically for use in safety critical applications. To minimize component count, methods of incorporating the error detection and handling mechanisms into the Viper chips themselves are given. Also described are techniques for implementing the module as a multicard system. By defining a standard backplane and designing a general-purpose set of cards, selfchecking computers of various sizes can be configured without additional hardware design. The paper concludes by showing how two or more selfchecking modules can be connected in various fault tolerant configurations microsystems error detection
microprocessors Viper 1A
fault tolerance
The need for microprocessor-based systems to perform safety critical tasks is increasing dramatically in both civil and military sectors. Typical applications include flight control for dynamically unstable aircraft, protection systems in the nuclear and petrochemical industries, railway signalling and the control of equipment in automated factories. The designer of such systems is faced with a problem. Commercial microprocessors, intended for the mass markets of office automation and consumer goods, are designed with neither the rigour nor the features required Cambridge Consultants Ltd, Science Park, Milton Road, Cambridge CB4 4DW, UK Paperpresentedat the Safetyand ReliabilitySocietySymposiumheld at Altrincham, UK, on 11-12 November1987. Revised:29 February1988
for safety critical applications. Of particular concern is the great complexity of modern devices; this frequently leads to misunderstandings by the programmers, and also opens up the possibility of design errors in the hardware itself. A further problem is encountered when board-level products based on such microprocessors are examined. Invariably, these provide very little support for fault tolerance, and the coverage which can be achieved with built-in test routines is highly uncertain. Although quite reliable by many standards, such systems are rarely adequate when the loss of lives could be the consequence of failure. The designer has two choices: to return to first principles and attempt the custom design of an appropriate fault tolerant system; or to try to find an off-theshelf fault tolerant system which meets his/her requirements. The first option is very costly and time consuming, and may still be subject to the limitations of commercial microprocessors. The second option suffers the usual compromises of buying off-the-shelf equipment. A particular problem is that most commercial fault tolerant systems are directed at the online transactions processing market and are only packaged for the office environment. As a solution to some of the above problems, this paper describes the design of a selfchecking computer module which serves as a building block for constructing a range of fault tolerant systems. The module is based on a modified version of the Viper microprocessor, a 32-bit device specifically designed for safety critical applications. The essential characteristic of the selfchecking module is concurrent error detection, i.e. the automatic and immediate detection of any error which occurs during normal processing. In addition, the module can be thoroughly tested by means of a self test program stored in ROM. If run periodically, this can guarantee the absence of latent faults in the module.
0141-9331/88/05264-07 $03.00 © 1988 Butterworth & Co. (Publishers) Ltd 264
Microprocessors and Microsystems
A feature of the selfchecking module is that it can be implemented as a multicard system, based on generalpurpose designs for the processor, memory and I/O cards. As with existing families of standard boards (e.g. STD, Multibus, VME), this enables systems of various sizes to be constructed without the need to engage in detailed hardware design.
VIPER MICROPROCESSOR The Viper microprocessor, on which the present work is based, was designed at RSREMalvem, UK, specifically for use in safety critical applications. Although in many respects the methods described here are general, the use of Viper is appropriate because it overcomes many of the problems of relying on commercial microprocessors in safety critical areas. One unique feature of Viper is that the instruction set is specified mathematically using the language LSF-LSM1' 2 and the gate-level logic design has been proven to conform to this specification. This guards against the risk of design errors in the processor logic. Simplicity and freedom from ambiguity are the other distinguishing features of the Viper instruction set. All instructions are 32 bit in length, and the only data type supported is 32-bit integer. Interrupts have been excluded, and illegal instructions or instructions which produce unexpected results cause Viper to stop, preventing the processing of invalid data. Testability is an area where Viper has an important advantage over commercial microprocessors. This results partly from the relative simplicity of the circuitry, and partly because circuit descriptions are available to the user. It is possible, therefore, to write a relatively compact test program for Viper, and to evaluate its coverage by fault simulation. Fabrication of Viper is initially on gate arrays, and is being implemented by two independent companies usingthree different processes (bipolar, CMOS and silicon on sapphire).
BACKGROUND TO THE WORK In general, the features of Viper are directed at avoiding the problem of design errors. As it stands, however, the initial version of Viper is still vulnerable to silicon failures, whether in the processor chips themselves or in the surrounding components. This problem was recognized by the Viper designers, and a contract was let to study its use in systems, particularly those with built-in test or fault tolerant capabilities. The output of that contract is the subject of this paper. The aim of the project was to devise system architectures, based on Viper, suitable for a wide range of safety critical applications. The required response to faults would vary with application: some would simply require that the processor stopped in a known failsafe state; others would require that processing continued unhindered; yet others would tolerate a reduced-performance mode of operation, perhaps due to loading of a simpler, alternative program. Because of the wide range of potential applications for Viper, flexibility was a key requirement of the architectural approach to be adopted. This tended to rule out an
Vol 12 No 5 June 1988
approach based on voted triple modular redundancy because of the rather rigid architectures that this entails. It was also desirable that the solutions would not require excessive modifications to the Viper design, and would not incur substantial overheads in chip count. An approach which could be easily understood by a nonspecialist would also be highly desirable, since this can greatly ease the task of convincing certifying authorities that adequate safety standards have been achieved. The approach that was adopted, which is based on the use of selfchecking computer modules as building blocks, meets all of the above criteria. The next section describes the design of the selfchecking module; the subsequent section then discusses the various configurations in which it can be used.
DESCRIPTION OF SELFCHECKING MODULE This section describes the main design features of the selfchecking computer module, and discusses some of the reasons behind the choices. Methods of error detection are discussed first, looking at the techniques used for the microprocessor, memory and I/O interfaces. The action to be taken on detection of errors is then discussed, followed by some comments on ensuring complete testability of the module. Finally, the extension of the selfchecking module to a multicard system is described.
Microprocessor error detection Two techniques for detection of microprocessor errors are described in the literature: redesign of the microprocessor as a custom selfchecking circuit 3-s, and duplication and comparison 6'7. The first technique has the advantage of making it possible to implement the microprocessor in a selfchecking system as a single chip. However, in many other respects, the technique of duplication and comparison is preferable: for example, it avoids the need for extensive redesign of the microprocessor circuitry; it guarantees the detection of any type of error that the processors might produce, not just those caused by a restricted fault set such as single 'stuck-at' faults; and the strategy adopted is highly visible, easing the task of safety justification. The one potential disadvantage of duplication and comparison is the large chip count that can result. Not only are two microprocessors required, but a large number of SSI devices can be needed to make up the comparator and the associated error handling circuitry. To overcome this problem, we have suggested a method of incorporating the comparators into the microprocessor chips themselves. The fundamentals of the technique are depicted in Figure 1, and a description is given below. All data and address lines of the two modified Vipers (referred to as Viper 1A) are connected together. One Viper 1A is designated 'active' and the other 'monitor', these being selected by a single pin on each Viper 1A. In active mode, a Viper 1A will drive the bus as normal. In monitor mode, the write buffers are disabled, allowingthe read buffers to observe the data written by the other processor. The internal comparator in the monitor processor then compares what the monitor would have written on the bus with what the active processor has
265
ACTIVE/ MONITOR(=I) INTERNAL WRITE WRITE ?ATA ~ ~
°.,.%
.... L.-./ I 4sO-
Data bus
I
io,ooo,r --
ACTIVE/MONITOR (=0)
[~ [ 40 IA I ~ L ~ ~
WRITE ~ [
Decoderl
INTERNAL WRITE DATA INTERNAL READ 3"2 .DATA
Memory [Comparator I ACTIVE/MONITOR ( = I) INTERNAL ADDRESS
ACTIVE/MONITOR (=0) INTERNAL ADDRESS
Addressbus I
poooro*ori ~P'tC°~pQreI
Viper (active)
Viper (monitor)
Figure 1. Block diagram showing a pair of Viper 1A microprocessors with bus comparators and memory coding circuitry incorporated
actually written, flagging an error if they disagree. Bus clashes or stuck faults on the external bus will probably also be detected in the active processor. Note that, to guard against open circuits, the two Viper 1A chips on a processor card should be positioned at opposite ends of the bus. This ensures that all spurs will take valid data off the bus.
Memory error detection Memory errors are most efficiently detected by an appropriate error-detecting code. Once again, a number of advantages are gained by incorporating the coding circuitry into the microprocessor chips themselves, as shown diagrammatically in Figure 1. These advantages include the following. • The need for external coding and decoding circuits is obviated. Such circuits are normally expensive in terms of pin count and/or chip count. • The coding circuitry is effectively duplicated, and the duplicated microprocessors compare coded data buses. Thus any errors in the coding circuitry itself are detected. The choice of memory code for Viper 1A was made under the assumption that the 32-bit-wide memory would be constructed with byte-wide memory devices. Accordingly, a code with an 8-bit package error detection (SSED) property was chosen. To implement this, one additional byte-wide memory package is required for the check bits. Each check bit is formed as the parity of the corresponding bit position in each of the other four memory packages. In this way, any errors due to faults within a single package are detected, including address decoder faults, coupling faults and multiple stuck memory cells.
266
Error detection in I / 0 interfaces The interface for I / 0 devices normally comprises tristate buffers for inputs (to enable data onto the microprocessor bus), latches for outputs (to store output values between successive output instructions) and an address decoder to select the required device. Errors in the I / 0 interfaces can be categorized as address or data errors. Address errors potentially have the most serious effect on the system, since they can simultaneously affect all data lines. The methods we have proposed for detection of I/0 interface errors are summarized briefly below.
Address errors The address decoders for the selection of I / 0 devices or memory banks are duplicated and compared. To overcome the risk of breaks in the address or control lines, the inputs to the two address decoders are taken as separate spurs from the lines passing between the two Viper 1A chips. This is shown in Figure 1. The risk of breaks in the select lines themselves is overcome in either of two ways. • The select lines are routed to their respective I / 0 interfaces before being compared. • The two copies of the select lines are used for different fields of the data word, thus giving a high probability of a data code violation if one copy is faulty.
Data errors The criticality of data errors depends on the device which is being interfaced. The methods used for detection of data errors can therefore be chosen according to the application. Some possible techniques are as follows.
Microprocessors and Microsystems
• I / 0 wraparound. Output data is read back via an input buffer. • Duplication. This can be done at separate addresses (taking care to avoid common-mode faults on the data bus) or at the same address using different fields of the 32-bit data word (e.g. one set in the top 16 bit, the other in the bottom 16 bit). • Coding. In some cases, I/O devices may be able to generate or check the memory code used by the Viper 1A. In other cases, it may be convenient to use some other code checked by software. One example would be a checksum or CRC (cyclic redundancy check) applied to a block of input data.
Viper ERROR STOPPED4
• Testing of the error detection mechanisms is made difficult, particularly from a ROM-based program, since the processor must be manually restarted each time an error is injected. • Transient errors will shut a module down until someone is available to intervene manually. In certain circumstances this could lead to an unwarranted loss of system availability. To overcome these problems, a limited automatic restart facility was proposed. The principles of the scheme are demonstrated by Figure 2. For the purposes of simplicity, this diagram omits any duplication which is necessary for error detection purposes. The block labelled 'Viper' shows the relevant functions in an existing Viper chip. When an error is detected, the processor stops and outputs an indication on the STOPPED pin. The external circuitry feeds this signal back to the RESET input via a set of gates. Provided that the restart latch is set, the processor will restart automatically. However, the restart latch can only be set by an explicit instruction in the Viper program, and is cleared again each time the processor is actually reset. This arrangement overcomes the two problems noted above as follows. To test the stopping mechanism, a test program sets the restart latch and introduces a deliberate error. The processor should reset. If an unexpected error occurs subsequently, the processor still stops as required. To allow recovery from transient faults, the program sets the restart latch. If a transient fault occurs, the
Vol 72 N o 5 June 7988
I logic I
mmo [_~
C> SYSTEM RESET LINK Res~ort I.ootch RESTART ENABLE ~J(pulsedby 'output' instruction) Power-up latch
Error handling strategy The above sections have discussed methods of detecting errors. We now consider the actions which should be taken following the detection of errors. The specification for the original Viper calls for it to be driven into a stopped state following the detection of an error such as an illegal instruction code or an illegal memory address. The processor must then remain in the stopped state until it receives a reset input. In the context of a fault tolerant system based on replicated selfchecking modules, this characteristic of the selfchecking module is reasonably convenient. For example, it makes it easy to ensure that the module will always fail in a known failsafe state, and it makes it easy for other modules to recognize the failure due to the absence of output signals. It was therefore decided to retain the basic strategy of stopping the selfchecking module on the detection of errors. However, this leads to two problems which must be overcome.
7 aetecT ~
F/gure 2.
Function of the automatic restart circuitry
program is reset. In a typical program structure, this would lead to the application program being temporarily suspended while a test program is run to determine whether a permanent fault is present. If so, the processor stops. If not, the application program is restarted from a suitable recovery point. A count of the transient faults can be maintained so that the processor can still decide to stop if the frequency of errors is too great. The lower latch in Figure 2, referred to as the power-up latch, is required so that a program can distinguish between the two main reasons for a reset. In practice, several other diagnostic latches can be provided to enable the source of a detected error to be identified. As with the bus comparators and memory coding circuitry, it is possible to reduce system chip count by incorporating the restart circuitry and diagnostic registers into the Viper 1A chip. This has the further advantage that, in duplicating the Viper 1A processors, the restart and power-up latches are also duplicated, thus ensuring that faults in these will not result in undetectable erroneous behaviour.
Testability of the selfchecking module One of the design objectives for the selfchecking module is that it should be thoroughly testable by means of a built-in self test program resident in ROM. This is particularly important for applications such as protection systems, where latent faults can seriously affect the probability of failure to act on demand. Note that the test program should cover not only obvious faults, such as in the main processing circuitry, but also the error detection mechanisms themselves. The restart mechanism just described is an essential first step toward achieving this objective. It now remains to ensure that sufficiently thorough test patterns can be applied to each piece of circuitry in the module. The task of testing the microprocessor and memory is relatively straightforward in the selfchecking module, because the error detection mechanisms are already in place. Essentially what is required is to exercise the circuitry sufficiently to ensure that any existing faults will
267
provoke an error at the detection boundaries. Even this task is very difficult with many commercial microprocessors. With Viper, however, we have estimated that a program of only 1500-2000 instructions will be necessary for a full test of the processors. In the case of memory, a march test can be carried out using the existing contents of each location and its complement. Provided that each location is complemented an even number of times, this enables the entire RAM to be tested without destroying its contents. A rather more difficult testing problem is encountered when considering some of the error detection mechanisms themselves. Consider for example the bus comparator, whose task it is to detect disagreement between the two Viper 1A processors. In normal use, this comparator only receives equal inputs, and we have no evidence that it will correctly detect an error when one actually occurs. To test a typical comparator circuit for all single stuck-at faults, we must apply the combination 01 and 10 to each bit position in turn, whilst keeping the other bit positions equal. In the context of the selfchecking module, this would require the deliberate injection of faults into each address and data line in turn - - clearly not an attractive proposition. The solution adopted to this problem was to use selfchecking comparator design. The circuitry for a 4-bit selfchecking comparator is shown in Figure 3. This is made up of three two-rail checker circuits (labelled T2 in the diagram). A selfchecking comparator of any size can be made by adding further branches to the tree. The selfchecking comparator circuit has the following properties. • If the circuit is fault free, the output is 01 or I 0 when the inputs are equal and 00 or 11 when they are not equal. • If the circuit contains a single stuck-at fault, the output will be 00 or 11 for at least one set of valid (i.e. equal) inputs. The major advantage of this comparator circuit, as it is shown, is that it can be tested merely by the use of equal inputs. Thus there is no need to inject an extensive set of deliberate errors. In practice, however, the need to drive the error handling circuitry makes it convenient at some stage to convert the two-rail output into single-rail format. This is done with an exclusive NOR gate. Once this is done the selfchecking property of the comparator is lost, and deliberate errors must again be injected to test the exclusive NOR gate and any subsequent error handling
circuitry. Fortunately, however, it is now only necessary to inject errors at one bit position instead of all possible bit positions. The selfchecking comparator therefore results in considerable savings both in testing time and in error injection circuitry. Similar selfchecking circuitry was adopted for the memory decoding circuits. It is worth noting that the additional cost of selfchecking circuits can be quite low. For example, a standard memory decoder requires 101 gate array cells, whereas the selfchecking version requires only 117.
Extension to multicard systems
If the selfchecking computer can be implemented as a multicard system, this allows the possibility of tailoringthe amount of memory and I/O to the application in hand using only general-purpose cards. This section describes the main features of a standard backplane and the methods used to ensure that the selfchecking properties are retained across all cards in the system.
Address bus Two copies of the address bus are transmitted along the backplane. Each copy originates as a distinct spur from the address lines that pass between the Viper 1A pair. Therefore any fault (e.g. open circuit conductor) which prevents correct propagation of the address will appear as a disagreement between the two copies on the backplane, orwill be detected by the monitorViper 1A chip.
Data bus The data bus is transmitted with its eight check bits. Memory references will always use the code for error detection. I/O devices which prefer not to generate the code can negate a pair of CHECKENABLE lines and use some other method of error detection. One convenient technique is to communicate with I/O devices in 16-bit words, duplicating the word in the top and bottom halves of the 32-bit data bus.
Control lines Two versions of the major control lines are transmitted along the backplane. Again these are derived as separate spurs from the lines that pass between the pair of Viper 1A processors.
I T2 T2 A3 B3
a
A4 B4
J>O~
T2
b
Figure 3. A 4-bit selfchecking comparator: a, comparator circuit; b, expansion of the two-rail checker T2
268
Microprocessors and Microsystems
Address decoders
Address decoders for I/O devices or banks of extra memory will exist on each I/O or memory extension card. These address decoders will be duplicated on each card, and each decoder will be connected to a different copy of the address and control buses. The select lines output by the duplicated address decoders will be compared, and the result will be fed back to the processor via a set of n error lines. To test this comparator, a gate in each of the two strobe lines will disable each address decoder in turn, thus allowing any desired error pattern to be fed to the corn parator.
APPLICATIONS IN FAULT-TOLERANT SYSTEMS This section provides examples of the many configurations in which sets of selfchecking modules might be used. Two main classes of application are considered: shutdown (or protection) systems and control systems.
Shutdown systems The function of a shutdown system is to monitor the readings produced by a number of sensors and to implement a shutdown sequence if they are outside an allowable envelope. The outputs are usually discrete; outputs from redundant modules can therefore be resolved by relay voting. Any number of selfchecking modules can be used in a shutdown system, depending on the reliability required. In some applications, just one selfchecking module may be adequate, e.g. where it is serving as a backup to some other system. The value of the selfchecking module in this situation is that, provided it has not stopped (in which case an alarm could be activated) it can be guaranteed to be producing correct results, and to be free of latent faults. A pair of selfchecking modules would be ideal for many shutdown systems. Both modules would run completely independently, and two-out-of-two voting would be done on the outputs, i.e. a trip would only be generated when demanded by both modules. Each module would be arranged to fail in the tripped state. Note therefore that only if failures occurred in two modules would the system behave erroneously, in which case it would generate a false alarm. Failure to act on demand is precluded by the comprehensive selfchecking features. This dual selfchecking configuration should be compared with a system which implements two-out-ofthree voting on simplex lanes. Failure probabilities will be similar, since again two failures are necessary for erroneous results. However, the detection of latent faults in simplex lanes is a great deal more difficult, and in general is unlikely to achieve the comprehensive coverage obtained in the selfchecking modules described. Greater numbers of selfchecking modules are also possible in the most highly critical applications. For example, four modules with three-out-of-four voting would be a good choice. This would allow for the vestigial possibility of a selfchecking module failing in an untripped state. It would also allow three module failures to occur before a false alarm is generated.
Vol 12 No 5 June "1988
Control systems The requirements of control systems vary greatly, one important factor beingthe nature of the actuators and the possibility of exploiting redundancy in the actuators. Two possible fault tolerant control systems based on pairs of selfchecking modules are shown in Figure 4. In Figure 4a, the two modules operate totally independently, and their outputs are summed by the actuator. If one of the modules fails, this ceases to drive the actuator, allowing the other module to continue, albeit at reduced loop gain. This is a very simple scheme, but has been found very effective in many avionics applications. More than two selfchecking modules can be used if necessary. Figure 4b shows a more difficult situation where it is only acceptable to drive one actuator (or one actuator input) at any given time. In this case one of the selfchecking modules must be designated active and the other standby. This necessitates communication between the pair of modules and a protocol for agreeing which module should be active. When a module fails, the output of its communication channel is automatically silenced. This is sensed by the other module, which is then able to assume the active role. Note that, once communications channels are installed, the possibility of implementing more advanced protocols arises. For example, the modules may be able to exchange state information to ensure that the handover of control does not result in serious perturbation of the system.
CONCLUSIONS A building block approach for constructing a variety of computer systems suitable for safety critical applications has been described. The basis of the approach is a selfchecking computer module. This is able to detect any errors generated in normal processing, and also test itself comprehensively for latent faults. It can be used alone or
Sensors ~ j
i
-J Selfchecking module J j
Selfchecking module
Error .~
_ _ Error .~.
_
t Summ~ oct~r
_
a
Sensors
SelfcheckingJ Active module
Alternative octuators ActuatorI
Communications Selfcheckingmodul JJeStandby~e.~ b
7
Actuator2 J
Figure 4. Two possible configurations for a fault tolerant control system: a, with two modules operating independently; b, with only one actuator to be driven at a time
269
in configurations of two, three or four to achieve various levels of system reliability, availability and integrity. Examples of its use in both shutdown systems and control systems have been given. The selfchecking module itself is also flexible in terms of memory size and I/0 capability. The error detection techniques employed allow the definition of a standard backplane and the design of general-purpose processor, memory and I/0 cards. This will enable the user to configure various systems with minimal hardware design effort. Finally, the fact that the module is based on the Viper microprocessor overcomes many of the concerns of constructing safety critical systems with commercial microprocessors. In particular, Viper is relatively simple and easy to understand, its behaviour under all conditions has been specified unambiguously, and its design has been proven correct. The incidence of design errors, both in hardware and software, should therefore be substantially reduced.
ACKNOWLEDGEMENTS This work has been carried out with the support of the Procurement Executive of the UK Ministry of Defence. The author would like to thank Dr C H Pygott, Dr W J Cullyer and Dr J Kershaw of RSRE,Malvern, UK, for their valuable support and guidance. Important contributions to this work were also made by G P Pink, N L 8ragg, P Beynon and S M Bose of Cambridge Consultants Ltd.
REFERENCES 1 2
2 70
Cullyer, W J 'Viper microprocessor: formal specification' Report No 85013 Royal Signals and Radar Establishment, Malvern, UK (October 1985) Gordon, M 'LCF-LSM' Technical Report No 41 University of Cambridge Computer Laboratory, UK
3 4
5
6 7
Disparlep C P 'A selfchecking VLSI microprocessor for electronic engine control' Dig. 11th Int. Syrup. Fault-tolerant Computing (FTCS-11) (June 1981) p 253 Halbert, M P and Bose, S M 'Design approach for a VLSI self-checking MIL-STD-1750A microprocessor' Dig. 14th Int. Conf. Fault-tolerant Computing (FTCS14) (June 1984) pp 254-259 Nicolaidis, M 'Evaluation of a selfchecking version of the MC68000 microprocessor' Dig. 15th Int. Conf. Fault-tolerant Computing (FTCS-15) (June 1985) pp 350-356 Rennels, D A 'Architectures for fault-tolerant spacecraft computers' Proc. IEEE Vol 66 No 10 (October 1978) pp 1255-1268 Peterson, C B et aL 'Two chips endow 32-bit processor with fault-tolerant architecture' Electronics (7 April 1983)
Max Halbert is a senior engineer at Cambridge Consultants Ltd, where for the last seven years he has been engaged in a wide range of research and development projects for government and commercial clients. His interests include fault tolerant computing, design for testability and digital signal processing. Applications for his work have included military avionics, civil aviation, nuclear reactor protection, road tunnel services and digital communications systems. He initially studied electrical engineering at the University of Western Australia, and was subsequently awarded a PhD degree by the University of Cambridge, UK.
Microprocessors and Microsystems