Hardware voter for fault-tolerant transputer
systems Transputer parallelism allows redundancy which improves system fault tolerance. J Standeven, M J Colley and D M Lyons describe a hardware voter that organizes this redundancy in a manner that is transparent to the software
The paper describes the design of a hardware voter which is applicable to n modular-redundant (nMR) transputer systems. The alternative approaches to fault tolerance, including both hardware and software strategies, are discussed. The transputer is reviewed from the point of view of its use in fault-tolerant systems and the reasons for proposing majority voting of replicted link transmissions are considered. Initial design studies suggest that a VLSI hardware voter can be made which has minimal effect on system performance and can be used in a way which is largely transparent to the application's software. microsystems transputers hardware voter
fault-tolerant systems
There is an increasing awareness of the need to make computer systems resilient so that an acceptable level of service can be maintained in the presence of faults. At best, a system should continue to operate without loss of functionality and, at worst, there should be a graceful degradation of the system's performance. Applications where some fault tolerance is desirable or essential include array processors, flight guidance systems and process control systems. Realtime control applications will generally require the system to be able to keep going with minimal performance degradation, which implies substantial parallelism and redundancy 1-3. Triplication can enable a system to mask any single point of failure and higher levels of protection can be achieved at proportionally greater cost by increasing the redundancy still further. Figure 1 shows a triplicated scheme in which copies of the
Department of Computer Science, University of Essex,Wivenhoe Park, Colchester CO4 3SQ, UK Paperreceived: 4 August 1988. Revised: 17 April 1989
J7 Processor
Voter[
~1 Processor
_l Processor
-I
Figure 1. Basic TMR configuration comprising three processors and one voter
code are executed on different computational units and a vote is taken on the outcome. The voting can be performed by hardware or software with various tradeoffs depending on the application. Multiple communication links make the transputer an attractive processor on which to build fault-tolerant systems. It allows flexible interconnections, has excellent cross-checking capabilities and redundancy can be provided in a straightforward manner. This paper considers the outline design of a hardware voter which matches the communication links of the transputer and can be used to build n modular-redundant (nMR) systems. The hardware voter should enable overall performance to be maintained close to that for an unreplicated system in a way which is essentially transparent to the application software. The voter design is itself modular in nature and the degree of redundancy (n) can be varied from three (TMR), the minimum value, to arbitrarily larger values.
0141-9331/89/08588-09 $03.00 © 1989 Butterworth & Co. (Publishers) Ltd 588
Microprocessors and Microsystems
ALTERNATIVE APPROACHES TO FAULT TOLERANCE
_1Self-checking module
Roll-back and recovery blocks Software techniques can be used to make a system fault tolerant without substantial redundancy in the hardware. Such systems rely on being able to roll back to predetermined recovery points when a problem occurs. If the fault is transitory, the reattempt should be successful and the system can continue. If the initial reattempt is unsuccessful, the task may be initiated on a different hardware unit within the system or with a different version of the software module involved. These techniques have been the subject of considerable research effort 4-7, and for many application areas they can provide satisfactory solutions. A feature of software-based approaches, and particularly those not emphasizing redundant hardware, is that there can be a substantial overhead which affects performance.
Replicated hardware A significant loss of performance is unlikely to be acceptable in many realtime applications where timecritical processes are being handled which may make heavy demands on the system. A more appropriate approach to providing fault tolerance in such systems is to replicate all the critical hardware. If a fault occurs the system should have sufficient capacity to maintain performance close to that obtainable from a single channel of the system.
Self-checking modules A self-checking module consists of two channels made up of duplicate hardware and organized to run the same tasks in close synchronization. Comparators are inserted at suitable points and one channel is allocated a master role while the other acts as monitor. Any differences detected by the comparators indicates an error, but it is not possible to say in which channel the error has occurred. When an error is detected the parts of both channels covered by a comparator must be isolated even though it is probable that one channel is working quite satisfactorily. It is only possible, therefore, to use the pairs of channels as described above to form self-checking modules which have fail-stop properties 1. For a fault-tolerant system which can mask an error, at least two self-checking modules must be provided. This implies a total of four redundant channels to mask single faults (Figure 2). This is greater than for a triple-redundant system with majority voting which has the same capability. Only a few processors have been designed with the possibility of self-checking properties in mind. One such processor of the Intel 4328 and there are proposals for adding comparison hardware to the Viper processor 9.
Majority voting Majority voting involves executing multiple (possibly different) copies of the program code and using a voting algorithm to determine an agreed result. It is usual that each copy of the code is executed on a different processor, although this need not always be the case. If the voting is done by software there must necessarily
Vol 73 No 9 N o v e m b e r 1989
Switch "]Self-checking o0u,e
Figure 2. Duplicated self-checking modules be a substantial communications overhead between the replicated processors. This will significantly reduce the performance compared with an unreplicated system. If the system has point-to-point links, as for example a transputer-based system, a faulty processor will require alternative message routing which will result in quite different path times for the replicated channels. This implies a substantial amount of effort has to be devoted to coping with out of sequence messages which must be sorted before voting10. Whether the overhead is acceptable or not will depend on the nature of the application and the amount of 'excess capacity' a single channel possesses. Hardware voting techniques are attractive as they offer a way of making fault-tolerant systems which possess minimal performance overheads and are largely transparent to the applications software. Voting usually takes place at a common communications point, for example the memory access paths, and ensures that an agreed result is always passed through. A penalty of this approach is that the hardware for voting is additional to simple replication of each processing channel. This increase costs and introduces yet another potential source of system unreliability. It is therefore essential to minimize the number of components and connections associated with the extra hardware and VLSI implementations of the voter are required. A consequence of adding special hardware to a system to provide fault tolerance is that it too must be replicated to avoid the problem of dependency on a single unit. A minimum configuration based on three processors is shown in Figure 3.
TRANSPUTERS AND FAULT TOLERANCE General principles The transputer has been specifically designed to assist in the creation of highly parallel systems. The four independent, bidirectional, serial communication links available on each transputer allow point-to-point connection of processor-memory nodes in a wide variety of topologies (Figure 4). Compatible cross-bar switches can potentially be used to allow reconfiguration of a system. It is proposed to exploit multiple transputers both to accommodate the parallelism in the application and also to provide redundancy for fault tolerance. By using a hardware-oriented solution, a fault-tolerant module would contain a number of transputer-memory pairs to provide redundancy plus the checking/voting logic which must itself be replicated. Redundancy on the links between modules can be implemented by using one link from each of the transputers to form a channel.
589
-~Processor~ =
[
~
Voter ~
~[ Pr°cess°rr
I
Voter I voter
Figure 3. Example of nMR using three processors and three voters
Tronsputer t "
Figure 4. links
Basic transputer showing four bidirectional
Locationandgranularityofthecheckinghardware There are potentially two locations where hardware voting algorithms could be implemented in transputer systems: the memory data paths and the channel links. Checking the memory highways would certainly provide a fine-grained level of fault tolerance, but it would be essential that the transputers operate in exact Iockstep. As the issues involved do not specifically relate to the use of transputers, this route, which is closely linked to selfchecking concepts, is not pursued in this paper. An alternative approach would be to implement error correction in the memory units by extending the word length and using appropriate codes. Inserting checking/voting hardware into the transputer links is a more general method and can also accommodate faults in the communication paths. Although it offers a coarser level of fault monitoring, this approach does provide more flexibility. Faults generated within the transputer-memory pair or a channel can be contained within the suspect region and are thus prevented from propagating to neighbouring regions. Owing to the redundant nature of the data presented to the voting unit, it is possible to ensure that correct data are always forwarded. The requirement for exact Iockstep operation is also relaxed.
Transputerlinkprotocol Transputers communicate with each other using highspeed, asynchronous serial data links capable of operating at either 5, 10 or 20 Mbit s-1 11. These communications are in the form of messages which consist of a series of data packets. An acknowledgement must be received for
590
each data packet that is sent before the next data packet can be output. The acknowledgement is used only to indicate that data were received and does not imply that any kind of checking has been carried out. The format of a data packet comprises a start bit, a packet type bit (one for data packets), 8 data bits and a stop bit, while an acknowledge packet has only a start bit and a packet bit, which is always zero. It is possible therefore to determine after only two bit times whether a data field is present. The link protocol allows data and acknowledgement packets to be interleaved in an unrestricted manner. Also, as transputer link connections are bidirectional the transmission of an acknowledgement may be overlapped with a data packet. This is possible with some versions of the transputer, the T800 for example, and the acknowledgement packet is transmitted as soon as an incoming data packet is recognized. Transputers with this capability can maintain the maximum data rate allowed by the raw bit rate.
Interconnectionschemes A minimum system based on majority voting would require a single group of just three transputers (TM R). This implies no inherent parallelism in the application and the parallel properties of the transputer are used just to provide redundancy. Replacing each of the general processors shown in Figure 3 with a transputer illustrates such a system. The implication of the voters being connected to the processor outputs is that checking would only occur naturally when signalling to an output device occurs. Dummy output messages to force voter activity artificially could be considered, but at the cost of some loss of performance. For applications which contain identifiable parallelism, which can be distributed over a number of transputers, the question of the form of interconnection then arises. One possibility is a pipeline which, for a TMR implementation, would have three processors in each stage. If hardware voters are inserted between stages, two of the four links of each transputer are used, one for input, the other for output (Figure 5). The transputers in each triple do not need to share information for the purpose of fault tolerance. If the pipeline has a fixed physical topology then it is not possible to exploit reconfiguration to replace faulty nodes. A pipeline model can also be used for software-based voting which uses all the transputer links. Two links are used for input and output transfers along the pipeline and the other two for exchanging voting messages between transputers in each triple.
Fault-toleranttransputer-basedbuildingblock The most general way of considering hardware voting in a transputer array is to treat each of the four links symmetrically. This leads to a fault-tolerant building-block approach where each cluster of transputers, the minimum always being three, has the functionality of a single transputer, as shown schematically in Figure 6. The advantage of this approach is that any topology which might be selected as being appropriate for the application can then be made fault tolerant.
Microprocessors and Microsystems
Stage M + 1
Stage M
--~
vl v] vl
P
-~ P Figure 5.
vk vk vk
Pipeline configuration using TMR with hardware voting. Key: P = processor; V = voter
Figure 6. Fault-tolerant building block showing triplicated transputers and voting hardware on each of the four transputer links. Key: T = transputer; V = voter The concept of a fault-tolerant building block applies to the logical rather than physical organization. Repair and maintenance considerations may make physical distribution desirable. It is also attractive to allow the faulttolerance mechanism to operate transparently, although provision should be made for status information to be used by higher level software to exploit the full faulttolerant potential of the system. Transparency will allow existing software to execute unchanged, but still gain the benefit of improved reliability.
Synchronization
the distinct channels should remain reasonably closely in step. Voting will be undertaken on each data and acknowledgement packet transmitted along a link and this will automatically result in enforced processor resynchronization at each packet end. The replicated voters on each channel will have separate clocks and the issue of synchronization between them, as well as the transputers, must be addressed. Drift between voters is only significant during data transmission and timing commences when the first bit of a packet is received. The maximum spread between the absolute start times of equivalent voter inputs is limited to half a clock cycle, which is a small fraction of a bit time. Resynchronization occurs at the start of every data and acknowledgement packet and, because these packets are only 1 byte long, their short length will not allow significant drift to accumulate. The voters, therefore, can be assumed to resynchronize effectively the packets arriving from transputers. A typical realtime control system will involve communication with a variety of input and output sensors. If these are replicated for redundancy purposes their input signals are likely, in electronic terms, to be highly asynchronous. The buffering available within the voters cannot reasonably be expanded to cope with this situation. This implies that nodes of the system connected to input sensors must have software buffers and communication between channels1°. The issue of software synchronization for asynchronous realtime sensors is not specifically addressed in this paper. In general the application will be distributed over internal processing nodes with many processes allocated to each processor. Buffer processes between the physical links and the main processes, combined with the fixed scheduling enforced by the transputer, will ensure that interprocess skew between replicated channels is not a problem 12.
issues VOTER DESIGN
The details of the voter design are dealt with in the next section but the issue of synchronization is introduced now. The clocking of transputers and voters in each redundant channel must be independent for reliability reasons and, in the limit, each could have its own clock. Consequently tight synchronization between transputers is not possible, or necessary, for the purpose of the link protocol. What is necessary is that process execution in
Vol 13 N o 9 N o v e m b e r 1989
Overall structure
The voting mechanism assumes that three (or more) transputers provide information from one of their serial communications links. This information is voted on and the majority concensus is forwarded to the receiving group of transputers. Figure 7 shows a block diagram of a
591
FIFO buffer
Sampler
Data
d
°t
Sampler
Clock
Voting logic
FIFO buffer
- -
Data out
I
a
Sampler
FIFO buffer
Buffer status
Buffer control
Control status lines
Controller/timer Clock
Figure 7.
Block diagram of the hardware voter
voting unit taking three transputer links as input and producing a single majority voted output. The unit is constructed from replicated link samplerFIFO buffer pairs, one for each voter input, which have sufficient decoding logic to determine the packet type as well as storage for data. The buffer outputs are connected to the two-from-three majority voting logic. Synchronization between the buffer outputs and voter timing is performed by the control/timer circuit. This also communicates error conditions to the voting logic which may arise from transmission skew effects. In the ideal case the data on each of the replicated links would be transmitted simultaneously in a synchronous manner and under these conditions the delay due to the voter logic would be limited to hardware propagation effects. Generally, however, there will be some transmission skew and voting will await the last input, provided the period is not excessive. If the transmission skew is excessive, which could arise either because of a fault or because of a long interval between messages, the integrity of the voter must be protected by a timeout mechanism. If the timeout is exceeded, voting will take place on those inputs which are available. This gives a degree of determinism to the data output of the voter, allowing the worst case time delay to be calculated.
Interfacing to transputer links It is necessary to oversample the link signal, as with all asynchronous communication, to determine the information content. This should be at least twice the data rate of the link according to sampling theory but some slow asynchronous devices, such as UARTs, sample at 16 times the data rate. As Inmos specifies the edge times of the link waveforms quite tightly 1°, it is only required that the link is sampled sufficiently fast to prevent data loss
592
due to edge jitter of the link data. It is most likely that in the case of the voter sampling at four times the link data rate will be adequate. Each of the incoming transputer links is connected to the voter via a link interface, which performs the function of an asynchronous-to-synchronous serial line converter. The link interface detects the start of a packet and then clocks the information from the link into its associated buffer. However, unlike conventional asynchronous receivers such as the UART, the link interface must be capable of identifying the two differently-sized packet formats correctly, to ensure that no extraneous data are transferred to the FIFOs.
Packet buffering Transputers demand an acknowledgement for every data packet transmitted and will suspend further transmission until an acknowledgement is received. This implies there is no need for the FIFO buffers to hold more than a single byte and data can never be lost due to overflow. Data are transferred into and out of each FIFO in a conventional manner using separate WRITE and READ clock signals. The only status information needed is a DATA PRESENT signal as data packets are of fixed, single byte length. The DATA PRESENTsignal will become TRUE when the first bit of the data packet has been accepted.
Voting logic The algorithm adopted is to start a timeout counter as soon as any one of the FIFO buffers signals DATA PRESENT. Data will start to be extracted in parallel from the FIFO buffers when either the active buffers indicate DATA PRESENT or the timeout period expires. A FIFO
Microprocessors and Microsystems
associated with an input which has been prevented from further voting by an error condition is deemed to be inactive and excluded from the algorithm. This means that voting can always take place with the minimum possible delay. Other approaches are possible but it is felt that this provides the best response time under all possible conditions. The voting logic produces the majority concensus from the active FIFOs. If a FIFO is forced into the inactive state because of a known error condition or because its buffer is empty, then that channel of the voter is inhibited to prevent erroneous voting. Each time a new set of data bits is accessed from the FIFOs and presented to the voter inputs a VOTE ENABLE signal is generated to ensure synchronous operation of the voter logic. An input is considered to be in error and excluded from future voting if one of the following conditions arise: • it does not agree with the majority concensus for a particular output value • no data were received before the expiry of the timeout period • its FIFO buffer signalled empty before the end of packet. An error caused by any of these conditions will be latched by the voting logic as a FAULT STATUS which will remain until corrective action is taken. The length of the timeout period determines the maximum allowable skew between the cooperating processors and it is expected that it will not exceed two data packet times.
Voter packaging To keep the extra hardware required for voting to a minimum, use of VLSI implementations of the voting elements is envisaged. This would allow multiple voting elements to be contained within the same physical package. It is important that with this approach each voting element is completely independent from others contained within the same package so that common failure points .are not introduced. Replicated voter hardware can then be cohstructed using elements contained within different packages, each of which would be separately powered and clocked. This allows economy of construction to be combined with a distributed structure for reliability.
SYSTEM ASPECTS Overview The hardware voting element described in the previous section produces one majority-voted output from multiple inputs. This can be used to provide a single, unidirectional, checked link connection between nMR modules. It would be necessary, however, to 'fanout' to each of the succeeding module inputs (Figure 1). The weakness of such an approach is that the system is susceptible to single point failure of a voter. The only secure organization is to replicate the voting elements by 'n' as shown in Figure 2. The output of each voter can now be directly connected to a corresponding module input and the need for fanout is transferred to the
Vol 73 N o 9 N o v e m b e r 1989
voter inputs. By using this arrangement, a voter failure can be masked in the same way as other hardware faults. To accommodate the bidirectional nature of the transputer's links it is necessary to provide an identical voting arrangement on the return side of a link. The two halves of a link connection then appear as mirror images of each other.
Single- or double-ended voting For single-ended voting one voting element is provided in each direction for each nMR interconnection. It may be positioned at either the sending or receiving end of a link. In either case it is not possible to determine where in the system the error first occurred as the module and link connections will be seen as one unit by the voter. The position of the voting element only determines the ability to localize fault conditions. Placing the voting unit at the sending end of a link will indicate an error that occurs within the sending module or any of its incoming link connections. However, placing the voting units at the receiving end of the link will only indicate an error that occurs within the sending module or the sending link. Double-ended voting involves using two voting elements in each direction for each nMR interconnection. These are positioned at the link ends and provide the capability of distinguishing between link and processor faults. The overall fault cover provided is not greater than for single-ended voting with the voting unit placed at the receiving end, but separation of link and processor faults may be useful if interconnections can be altered dynamically. The use of double-ended voting is expensive in terms of the number of voters required and probably cannot be justified unless advantage can be taken of the separate identification of processor and link faults. Single-ended voting at the receiving end of the link connections offers the required level of fault cover and, although it is not possible to distinguish between transputer and link faults, it is possible to indicate reasonably closely where the fault has occurred.
Performance implications Each voting element will of course introduce delay into the link channel. Initial simulation results indicate that under ideal conditions any delay will be in the order of two bit times, which is the same as a C004 crossbar switch. The round trip delay (data message out, acknowledgement return) for a single-ended system would be twice this and for a double-ended system four times. Using a non-overlapped acknowledgement scheme, which will apply if T414 transputers are used, the absolute maximum rate a data packet can be transmitted is one packet per 13 bit times. This is the time taken to transmit the data packet plus the time for the acknowledgement. If single-ended voting is considered with a voter introducing a 2-bit delay, the packet time now becomes 13 bit times to transmit the data (packet time + delay time) and 4 bit times to transmit an acknowledgement. This yields 17 bit times between data packet transmissions, an increase of approximately 31%. If double-ended voting is considered then the delay time would also be doubled, yielding 21 bit times between data packet transmissions, an increase of approximately 62%.
593
If the system uses T800 transputers, overlapped acknowledgement will take place, and a data packet can be transmitted every 11 bit times. This is because the acknowledgement is transmitted as soon as the data packet is recognized, which occurs on arrival of the second packet bit. The acknowledgement can be completely transmitted within the remaining data packet transmission time and the overall data rate is not affected. If single-ended voting is used then a total delay of 4 bit times must be added into the calculation. This means that 6 data packet bits are transmitted before the acknowledgement arrives (as opposed to 2 under normal conditions), which still allows complete overlap to occur. Thus there is no increase in data packet transmission times. In the double-ended voting system, a total delay of 8 bit times must be added. This means that 10 data packet bits are transmitted before the acknowledgement arrives, increasing the data packet transmission time by 1 bit time. This corresponds to a fall in maximum data rate of approximately 8%. In summary, if transputers with overlapped acknowledgement mechanisms are used, even the most general voting scheme only introduces a small reduction in link data rate. It would seem therefore that unless the ability to determine exactly where a fault occurred is important, the single-ended voting scheme offers the minimum link delays over the complete transputer family.
Fault m a s k i n g
The objective of a system based on nMR techniques is to mask any single hardware failure within a replicated unit. If
the system comprises a number of replicated units, which would be the general case for the system described in this paper, then multiple failures may be covered. In all fault conditions with which the system is able to deal, the masking must be transparent to the application and its effects must not propagate to neighbouring units. For the system described here faults will result in either data messages or acknowledgements not being sent and the receiving transputer will wait indefinitely. However, as each replicated link contains at least one three-input voter, the missing messages are regenerated by the twofrom-three voter action and the faults are contained within the suspect unit. This is illustrated in Figure 8 which shows the connections for two of the transputer's links. This is sufficient for a pipeline structure, but the other pair of links may be handled in a symmetrical manner. Links are bidirectional and shown as two-wire connections with the voter at the receiving end. The voters are also replicated and each wire of the link drives one input on each of the three receiving voters. It is thus guaranteed that each voter receives the same information at the same time. In this system faults could occur in three places: the transputer, the voter, or the connections. First consider the transputer. A fault here results in the loss of both data and acknowledgement messages. These messages are transparently replaced by the remaining two transputers in the TMR unit at the voter inputs of the neighbouring units. Next consider the voter, where a fault results in the loss of either a data or an acknowledgement message depending on the direction of information flow. In either case the receiving transputer will wait indefinitely for information that will never arrive. In the worst case this will
Voter inputs from neigbouring link outputs
Link output
T
Single transputer link
TMR Unit
Figure 8.
594
TMR unit that could form part of a pipe. Key: T = transputer; V = voter
Microprocessors and Microsystems
cause the transputer to stop producing any form of message. This is now the same situation as a transputer fault and the missing messages are replaced as for a faulty transputer. Finally consider the connections. A fault in a transputer link output results in one input of all the voters not receiving any messages. These messages will however be automatically regenerated from the remaining two inputs to the voter. This type of fault has the same appearance as a failed transputer in a neighbouring unit, where it would also be the case that one input to all voters stops receiving messages. If the fault is on a voter output then no messages will be received by the associated transputer. This fault has the same appearance as a failed voter. The same recovery procedures apply in the case of double-ended voting. However, the regeneration of missing messages occurs at the sending end of the links, and it is possible to separate link faults from transputer or voter faults.
of the voter is relatively low compared with the transputer and its associated memory and the serial nature of link communications minimizes the number of additional connections. While it has to be accepted that any increase in the amount of hardware will also increase the probability of a fault occurring, fault masking through the use of replicated voters more than compensates for this. VLSI fabrication of the voter is necessary and high-level modelling of the voter's functions has been undertaken as a preliminary step along this route. As well as masking faults the voter is able to provide valuable information on the nature of faults in a system and potentially can be used to assist in reconfiguration which would allow faulty nodes to be replaced. The transputer family of devices already includes a crossbar switch which would be a key component in a reconfiguration scheme. The mechanisms for achieving reconfiguration in a fault-tolerant manner form a separate area of investigation.
Dealing with errors
REFERENCES
The voting mechanisms can both detect and mask error conditions as they arise, but it is also necessary to have an error-reporting scheme. This allows the presence of fault conditions to be communicated to a supervisory system which may be able to undertake corrective action. Such a system could be based on a fault bus which would be external to the link communication structure. The bus could be used by the nMR modules to communicate any error conditions that may arise directly to a supervisory system and also provides a supervisory system with the potential to interrogate the faulty component and take the necessary correcive action. The nature of the supervisory system will be very much a function of the application and further consideration is beyond the scope of this paper.
CONCLUSIONS The availability of the transputer has enabled computer designers to build powerful, physically compact, parallel processing systems with many fewer components and connections than has previously been possible. This in itself helps to increase the inherent reliability of a system. The modular building block structure of transputer systems also makes it straightforward to exploit a variety of topologies and to van/the n umber of processing nodes to suit the application. When considering fault tolerance, the potential parallelism within the system can be shared between the application and the need for redundancy based on replication. Organization of the redundancy can be done by software but this implies a significant performance overhead which may not be acceptable in a realtime application. The overhead can be minimized by delegating as much as possible of the organizational aspects to special hardware which votes the communications between the nodes of a replicated nMR system. This paper has considered the design of such a voter which would fit easily into a transputer system and also the system aspects of the alternative ways in which it could be used. The use of hardware for fault tolerance allows it to be provided in a way which is largely transparent to the application software. The complexity
Vol 13 N o 9 N o v e m b e r 1989
1 Rennels, D A 'Fault-tolerant c o m p u t i n g - - concepts and examples' IEEE Trans. CompuL Vol C33 No 12 (1984) pp 1116-1129 2 Pradhan, D E (ed.) Fault-Tolerant Computing Theory and Techniques Vol II, Prentice-Hall, Englewood Cliffs, NJ, USA (1986) 3 Anderson, T and Lee, P A Fault Tolerance: Principles and Practice Prentice-Hall, Englewood Cliffs, NJ, USA
(1981) 4 Randell, B 'System structure for software faulttolerance' IEEE Trans. Software Eng. Vol SE-I No I (June 1975) pp 220-232 5 Chen, L and Avizienis, A 'N-version programming: a fault tolerant approach to reliability of software operation' Proc. IEEE 8th Annu. Int. Syrup. on FaultTolerant Computing (June 1978) pp 3-9 6 Knight, J C, Leveson, N G and St Jean, L D 'A large scale experiment in N-version programming' Proc. IEEE15th Annu. Int. Symp. on Fault-TolerantComputing (June 1985) pp 135-139 7 Shrivastava, S K, Mancini, L V and Randell, B On the Duality of Fault Tolerant System Structures Technical Report No 248, Computing Laboratory, University of Newcastle upon Tyne, UK (November 1987) 8 Intel 432 System Summary Intel Corporation, Aloha, OR, USA (1981) 9 Halbert, M P R 'Selfchecking computer module
John Standeven is a senior lecturer in the Department of Computer Science at the University of Essex. He obtained his PhD at the University of Manchester in 1968 and was a lecturerthere until 1972 before moving to Essex. He has been involved in numerous projects in the computer architecture and realtime applications areas and his current research relates to distributed systems.
595
based on the VIPER microprocessor' Microprocess. Microsys. Vol ~12 No 5 (June 1988) pp 264-270 10 Mancini, t V and Shrivastava, S K 'Exception handling in replicated systems with voting' Proc. IEEE 16th Annu. Int. Symp. on Fault-Tolerant Computing (June 1986) pp 384-389 11 Rygol, M and Watson, T Connecting Inmos Links Martin Colley obtained a BSc in computer science from Queen Mary College, London, UK in 1977 and his PhD from the University of Essex, UK in 1984. He is currently a lecturer in the Department of Computer Science at the University of Essex. His research interests include display systems, distributed computing and fault-tolerant systems.
596
Inmos Technical Note 18, Inmos Ltd, Bristol, UK (April 1987) 12 Welch, P H 'Managing hard real-time demands on transputers' Proc. 7th Occam Users Group (September 1987) David Lyons obtained a BSc in electrical engineering from the University of Alberta, Canada in 1962. He worked at the research labs of Northern Electric in Ottawa until he became a Research Fellow at Harwell's Math Branch in 1966. He joined the Department of Com. . . . ........ puting Science at the University of Essex in 1969. His research interests are in computer architecture and operating systems, and in computer-aided education for the severely mentally handicapped.
Microprocessors and Microsystems