Design and implementation of fault-tolerant multi-
microcomputer systems D Bernhardt and E Schmitter describe a partially meshed ring structure for multimicrocomputer reliability and fault tolerance realization
Distributed computer systems based on a multimicrocomputer structure offer the best preconditions to improve the reliability of a system and to realize fault tolerance. Basic fault-tolerant system (BFS), described in this paper~ is the implementation of a fault-tolerant multimlcrocomputer system. The architecture of BFS is a partially meshed ring structure, based on previous work. This kind of ring structure is appropriate for system monitoring and reconfiguration mechanisms.
and the principles of system level diagnosis11,t2. The module i+1 is kept informed about the status of module i, and module i will be monitored by module i+1. In case of a fault of module i, the neighbour i+1 causesa test of module i by module i-1. Afterwards the two neighbours, i-1 and i+1 form a 'diagnostic unity' which manages recovery and reconfiguration (Figure 2). They can inform one another by the alternative path from module i-1. The architecture, control mechanisms and reconfiguration strategies guarantee that faults of every second computer module are tolerable. If two neighbours are defective, the system is no longer monitored.
The new application areas of microcomputer systems1 require a higher standard of availability because the breakdown of the overall system is not tolerable for the user. Multimicrocomputer systems, with their inherent hardware and software redundancies, are well adapted for meeting the requirements of fault-tolerant computing 2-4 . Initially, an architecture must be found which facilitates system monitoring and allows an efficient reconfiguration mechanism. In case of a fault, the defective element should be replaceable by another system component, therefore single busstructures and hierarchical organized systems like master/slave structures were excluded. A system with identical components/master/ master) and a bus structure with alternative communication paths between the modules was chosen. This system offers best preconditions to solve the problems of fault tolerance with decentralized control, fault diagnosis on different levels and reconfiguration strategies. In the laboratories, some of the aspects of BFS were realized s .
HARDWARE
DESCRIPTION
OF BFS
For simplification of system-monitoring reconfiguration strategies and decentralized control, all microcomputer modules of BFS are identical. In the first implementation they were based on Intel's SBC boards. Every microcomputer module consists of an 8086 master CPU board, an intelligent communication controller board with an 8085 CPU and memory and I/O boards. Therefore the performance of the master processor will not be diminished by message handling. A dual port memory is used for the data exchange between the two CPU boards (Figure 3). The memory boards are equipped with internal error correcting circuits (ECC) with the effect that single bit errors will be detected and corrected without injuring running programs. The hardware-based memory checking is very important because software-based memory tests
1 ........
OF BFS
Satisfying the requirements of fault tolerance s-s , the basic model BFS was implemented with the special feature that each microcomputer module has bidirectional linkages to its next and next-but-one neighbours (Figure 1). This corresponds to the ideas of Metze, Preparata and Chien 9,1°
Research Laboratories, SiemensAG, M,'inchen, FRG This work has been supported under the technological programme of the Federal Department of Research and Technology (BMFT) of the FRG (project number IT 1017 3).
vol 5 no 4 may 1 9 8 1
Figure 1. Structure of BFS
0141-9331/81/040153-4 $02.00 © 1981 I PC Businesss Press
153
are time consuming and therefore not acceptable in real-time applications. The set/a[ interface of our implementation is not a high speed linkage, but we will make use of this board because: • • •
•
it is an easy way to implement the communication network without special hardware development it is a flexible solution to set up a faster communication medium without changing other system parts it is possible to realize main requirements of fault tolerance without regard for the communication network and the system topology the separation of CPU and communication processor will optimize the interprocessor communication and reduce the expensive management of fault tolerance by parallel processing of both processors
SOFTWARE COMPONENTS With regard to decentralized control, reconfiguration and recovery, tables are necessary for special software components of fault tolerance. These tables describe the status of the special module and its near environment. With these informations the fault of a module is tolerable and its tasks can be distributed to other system components. The tables are described below.
Configuration table
Floppy disc
>
tnterfo¢=
Hold disc
Figure 3. BFS module Module i+1 is kept informed about every change in this table. Each element of the table encloses: • • •
number of the task number of the calling task number of the executing module
Task queue The description of each task which had been accepted by this module is stored in this queue. In case of a fault, all these tasks must be distributed anew. The queue of module i is duplicated in module i+1. The registration encloses:
This table consists of a list of the system resources which can actually be used by this module. It is required by the communication processor for determining, whether a task can be accepted by this module or not. Beyond this, the table contains a list of the resources which could be switched over to this module in case of a fault of the previous owner.
• • • • •
Table of delegated tasks
The path table describes the status o.f the linkages to the four neighbours of this module, and it contains a list of the defective neighbours.
name of the task number of the task number of the calling task number of the calling module pointer to a block which contains the start or restart information
Path table The table of delegated tasks contains a description of all tasks which were delegated by this module and not yet finished. All of these tables together clarify the relationship between all delegated tasks in the whole system. Redundancy infcx'mation of M I
Redundancy table The information about the status of module i-1 is stored in module i. The redundancy information consists of: • actual configuration of module i-1 • table of delegated tasks of module i-1 • task queue of module i-1
DELEGATING OF TASKS
Dioqnostic unity of MI
Figure 2. Diagnostic unity of BFS
154
BFS makes use of a simple strategy of delegating tasks in its first implementation. If module i wants to delegate a task, it forms a message containing the program name. This message will be sent from one module to the next, as a 'ring message'. Each module able to accept the task will register the length of its own task queue in the ring message. The returning message will be analysed by module i and the definite commission wilt be sent to the selected modulej. Afterwards module i inserts the task in its table of delegated tasks and modulej registers it in its task queue. The neighbours i+1 and j + l must be informed for updating their redundancy tables. The calling task will be suspended until the finishing of the commission is announced. Meanwhile anothel ~
microprocessors and microsystems
task becomes the running task. The last restart point which isregistered in the task queue of the calling task was therefore necessarily set up before the call. In case of a fault and of redistribution, the call will be cancelled by the recovery organization and repeated by the reconfiguration mechanism.
FAULT DETECTION STRATEGIES BFS has mainly three fault-detection strategies: a selftest, a test of each module by its two neighbours and by cyclic redundancy checking (CRC). Additionally it is possible to execute special test routines initialized by a service function.
Selftest Each module consists of four basic components: CPU, memory, communication subsystem and I/O interfaces. Software memory tests are not efficient because the time taken to check 1 MByte is in the range of seconds, therefore memory boards with error correcting codes were used. The remaining components are tested periodically by a software selftest. As a selftest program runs on the processor which is to be tested, it is necessary to start the test with a minimal nucleus, which is assumed to be correct. The test is supervised by a watchdog-timer, which is started at the beginning of the test and reset if the test indicates a fault-free system. Otherwise the processor reaches a defined error status which could be the Halt-state 13-1s.
Test by neighbours Every module i is tested by a test message of module i+1. The test is done periodically within a certain timeframe. If module i+1 detects an error, it will inform module i-1. Thereupon module i-1 starts a test of module i. If the results of the two tests are the same, it will be concluded that module i is defective and the two neighbours i-1 and i+1 form a diagnostic unity and manage together the reconfiguration (Figure 3). If there are different results, the conclusion is that one linkage has failed and therefore the message or the acknowledgement did not arrive. In this case the path tables of module i and module i+1 are updated.
Message check Every message in the ring will be checked by means of a 16-bit CRC. The two bytes of the polynomial will be calculated by the sender and checked by the receiver. In case of an error the receiver will send a negative acknow.ledgement. After three unsuccessful attempts, there starts an algorithm for localization of the faulty linkage.
R ECON FIGURATION STRATEGI ES In case of a fault of module i, the two neighbours i-1 and i+1 form a diagnostic unity. The module i+1 is informed about thestatus of module i and starts with the reconfiguration. Every action of module i+1 will first be sent to module i-1. Thus a faulty module can be prevented from starting reconfiguration, which would cause an undefined status of the overall system. The first thing to do is to inform the two other neighbours i-21and/+2 about the failure of module i and the four neighbours must change
vol 5 no 4 may 1981
their path tables. Afterwards all tasks which were delegated by module i will be cancelled. The basis for this is the copy of the table of delegated tasks of module i. When a task of modulej is cancelled all the tasks which were created by this task must be cancelled too. This can be done by the table of delegated tasks of modulej. In this table all tasks are inserted which were created by the cancelled task. Therefore one can build up the tree of task relationship. The third thing which must be done is to redistribute the resources of the failed module. All modules which are able to take over the resources must register this in a ring message which is initialized by the diagnostic unity. The redistribution will be managed by the diagnostic unity after analysing this message. The last thing to do is to delegate the tasks which were registered in the task queue of the faulty module and update the task queues of the commissioners.
REINSERTION OF REPAIRED MODULES It is possible to reinsert a module after repairing. The reorganization is also done by the two neighbours, First they initiate a selftest. Some time afterwards they test the new module. If there is no fault detected, they start reinsertion by setting up a defined status of the repaired module i, by telling it the redundancy information of module i-1 and by updating their own tables.
CONCLUSION First experiences in the use of BFS showed that distributed multimicrocomputer structures are appropriate to features of fault tolerance. The diagnostic unity which is supported by the partially meshed ring structure of BFS is very useful for system monitoring and for reconfiguration. The diagnostic unity fills the place of hardware voters of a known system. After fault detection the tasks of the faulty module will be delegated to other modules and there is dynamic reconfiguration of the overall system with the principles of fail-safe and graceful degradation 4. The periodical tests of all modules within a certain time-frame reduce the diagnosis time of the overall system and balance the amount of computation for testing and recovery 11. Fault diagnosis on different levels improves the test coverage of the BFS. For the debugging of the overall system some faults of components (memory errors, faulty I/O interfaces), failures of modules, faulty links between system components and manipulation of the check polynom are simulated by a computer connected with one system component. The fault tolerance of BFS could be demonstrated in this way. Nevertheless there are a lot of questions which must be solved in order to use these fault-tolerant multimicrocomputer systems in application areas like traffic control, patient monitoring or process control.
ACKNOWLEDGEMENTS This article is based on a paper given at the sixth ACM European regional conference on systems architecture, ICS 81, held in London, 30 March-1 April 1981. The proceedings of this conference are published by Westbury House, the books imprint of IPC Science and Technology Press Ltd.
155
REFERENCES 1 Enslow, P H 'What is a 'distributed' data processing system?' Computer Vol 11 No 1 (1978) pp 13-21 2 Avizienis, A 'Computer system reliability: an overview'
10 Preparata, F P, Metze, G and Chien R T 'On the connection assignment problem of diagnosable systems' IEEE Trans. on Electronic Computers Vol EC-16 No 6 (1967) pp 848-854
Infotech State o f the Art Conf: system reliability London (1977) pp 1-13
3
4
5
6 7
8
Friedmann, A D and Simoncini, L 'System-level fault diagnosis' Computer Vol 13 No 3 (1980) pp 47-53
9
Avizienis, A 'Fault-tolerance: the survival attribute of digital systems' Proc. of IEEE special issue on fault-tolerant digital systems (October 1978) pp 1109-1125 Siewiorek, D P 'Multiprocessors: reliability modelling and graceful degradation' Infotech State of the Art Conf." system reliability London (1977) pp 47-73 Schmitter, E 'Structure principles for fault-tolerant microprocessor systems' Siemens Forschung unt Entwichlung Berichte Vol 7 No 6 (1978) pp 328-331 Dal Cin, M 'Fehlertolerante Systems' Teubner Studienbi)cher Stuttgart (1979) Konrad, W 'Fault-tolerance aspects of distributed computer systems' MSc Thesis Universit~t Karlruhe (1980)
11 Abraham, J A and Metze, G 'Roving diagnosis for high performance digital systems' Proc. of 1978 Conf. on Information Sciences and Systems The John Hopkins University (March 1978) pp 221-226 12 Bernhardt, D, Birzele, P, Buchmann, K, Geitz, G, Schmitter, E and Stock, P 'Design eines fehlerolerierenden Multimikrocomputer-Systems' BM FT-Forschungsberich t ( 1979) 13 Giordano, A and Nilsson, S A 'Vollst~indiger selbsttest eines Mikrorechnertestkerns' ACM Workshop Munchen
(1979) 14 Maehle, E 'Entwurf yon Selbsttestprogrammen ffir Mikrocomputer' ACM Workshop Munchen (1979)
Nilsson, S 'Konzept und Architektur eines fehlertoleranten Mehrmikrorechner-Systems' Dissertation Universit~t Karlsruhe (1980)
15 Nilsson, S 'Selbsttestverfahren ffir Mikrorechner' VDI-Berichte No 328 (1978) pp 77-85
A n n o u n c i n g a n e w i n t e r n a t i o n a l journal p u b l i s h e d by IPC S c i e n c e and T e c h n o l o g y Press in a s s o c i a t i o n with t h e E u r o p e a n C o m p u t e r Measurement Association
published quarterly from June 1980 Main subject areas to be covered: • • •
m e a s u r e m e n t applications measurement technology analytical m e t h o d s
• • •
modelling t e c h n i q u e s c a p a c i t y planning performance theory
For further details please contact: Christine Mullins, Computer Performance, IPC Science and Technology Press Ltd., PO Box 63, Westbury House, Bury Street, Guildford, Surrey GU2 5BH, England. Telephone: 0483-31261 Telex: 859556 SCITEC G
156
microprocessors and microsystems