North-Holland Microprocessingand Microprogramming21 (1987) 333-338
333
Improving the Reliability of Bus Systems: Fault Isolation and Fault Tolerance Richard Vogt University of Karlsruhe Department of Computer Science (Prof. K. Bender) Zirkel 2, PO Box 6980, D-7500 Karlsruhe 1
Abstract A shared bus is the commonly used medium to interconnect the various modules of a (micro-) computer. In this paper, the reliability of a shared bus is examined and the limiting factors are determined. Then, a method is given to overcome these limitations by tolerating faults of active components. In contrast to other solutions, the proposed method requires no modification to the interfaces of the modules, and so it is applicable to existing and perhaps standardized bus systems. The reliability improvement is investigated, and some realization details of a VMEbus prototype are described. Finally, the method will be extended to tolerate faults of the bus itself, still without changing the interfaces.
1.
Motivation
With the increased spreading of microelectronic components and microelectronic systems in almost any imaginable application the efforts at hardware-standardization led to some remarkable results: bus systems, used to interconnect microcomputer modules such as processing units, memory and input/output devices, have been introduced and have been standardized by national and internationalcommittees. Popular examples are the VMEbus, the MULTIBUS I and II and the G64/96-Bus. These bus systems are supported by many manufacturers, and the market for the so-called "board computer systems" is still growing. Important consequences of standardized bus systems are reduced costs for development and production of computer systems, and the dropping prices for the compatible modules cause an expansion of the market [1]. At the same time, there is an increasing number of applications in which the dependability (i. e. reliability, availability) of the computing system is of central importance. In this field, especially in process automation, the shared bus becomes the limiting factor: first, the availability is restricted because there is no possibility to implement an on-linerepair-facility - all the boards connected to the bus have to be switched off before a faulty board can be replaced. Second, the reliability is restricted because the shared bus forms a "single point of failure". For these reasons the usage of a standardized bus system may be impossible when the demands on dependability increase. Special hardware will be necessary, and the benefits of standardization mentioned above are lost. In this paper a method is introduced which improves the dependability of bus systems. The hardware-interface of the modules remaines unchanged. Thus it is possible to extend
the range of applications of standardized bus oriented (micro-) computers, and to avoid the expense of a special hardware design in many cases.
2.
Reliability Analysis of the Shared Bus
Fig. 1 shows the basic concept of a bus oriented system [2]:
Fig. 1: Bus oriented system Due to the modularity of this design it is relatively easy to introduce redundancy on module level by connecting additional modules to the shared bus. Correct operation of this system, however, requires that the shared bus performs its function correctly. A failure of the shared bus can be caused either by a failure of the bus itself (e. g. stuck-atfault of a bus-wire, line break, faulty termination network) or by a special class of faults of any module connected to the bus (e. g. stuck-at-fault of a bus driver circuit, continuous access, protocol violation). The latter faults are called global faults of a module, in contrast to the local faults which only affect the function of a module itself. Therefore it is appropriate to divide each module into two (fictive) parts, a local part ML and a global part MG, where any fault in MG will directly cause a failure of the shared bus (e. g. bus drivers, interface units etc. are part of MG). Fig. 2 shows the resulting model of the bus oriented system:
R. Vogt / Improving the Reliability of Bus S/stems
334
The advantages of this method are: • There is no need to change the interface of the modules, neither electrical nor mechanical. Standardized modules can be used.
~s Sysi;~J
..............................................................................................................
Fig. 2: Model of the bus oriented system As can be seen, the bus system consists of the shared bus and all the global parts MG. To obtain the correct function of the bus system, it is necessary that a//these components are fault-free. Therefore the shared bus together with the MGs forms a series system [3]. On the assumption that the reliability of all MGs is equal (i. e. RMG(t) :-- RMG i(t) for i --- i...n) and that failures are independent, the following equation holds (for simplicity the parameter t is omitted): Rbus system = Rshared bus " (RMG) n
(1)
Hence trying to increase the overall system reliability by introducing redundant modules reduces the reliability of the bus system (as n in eq. 1 increases). Depending on the reliability of bus and MGs it is even possible that the overall system reliability is decreased by redundant modules [4].
• Both signal and power lines of a faulty module are disconnected. This enables an on-line repair facility and improves the availability of the system (in the case of a faulty ML this is true even when the control-units are replaced by a simple switch which has to be activated manually by repair personnel to disconnect the affected module). • Bus wires, switches and control-units can be integrated to form a new and fault-protected bus with the same connections as the original one. Thus it is possible to increase the dependability of an existing system by replacing its bus. Realization of this method only seems useful if a gain in reliability can be reached and if appropriate switching elements are available. These two points of interest are discussed in the following chapters. 3.1 R e l i a b i l i t y A n a l y s i s [5]
3. Fault Isolation
Referring to fig. 3 the bus system is fault-free if the following requirements are met:
As stated above a single fault in one of the MGs causes the failure of the shared bus. To recover from this failure the faulty module has to be disconnected from the shared bus, preventing it from further hanging up the bus. It is obvious that avoiding a system breakdown requires that this disconnection is performed by the system itself. Therefore bus switches are introduced, as shown in fig. 3:
1. the shared bus itself is fault-free (e. g. bus wires, termination networks) 2. none of the switches disturbs the function of the bus (e. g. due to a stuck-at error inside a switch) 3. sw i is open, given MG i is faulty 4. sw i is closed, given MG i is fault-free Condition no. 4 is required to prevent a system breakdown due to disconnection of fault-free modules (although this would not affect the bus system). The stated conditions yield the reliability block diagram [3] shown in fig. 4: .. I - - I P s w
i closed / MG i ~ " ~ - I Ps i open ! n times
Fig. 3: Usage of bus switches (sw i) to disconnect faulty modules from the shared bus The function of each switch (sw 1 ... s w n) is to disconnect all the signal and power lines of a faulty module. The switches are driven by control-units which have the task to detect and localize any module fault concerning the shared bus. So switches and control-units provide a means to isolate a faulty module and to protect the shared bus against module failures. Together with redundancy on the module level this arrangement makes it possible to tolerate faults in ML and MG as well.
Fig. 4: Reliability block diagram of the bus system shown in fig. 3 Conditions no. 1 and no. 2 are represented by the blocks b u s and s w i. The blocks P s w i c l o s e d / M G i and Psw i open/M'G'~ refer to no. 3 and 4, respectively; both blocks express a conditional probability which is determined mainly by the coverage of fault detection and fault localization of the control-units [6]. Finally denotes that MG i is faulty, which is the opposite case of M G i.
R. Vogt / Improving the Reliability of Bus Systems
To calculate the reliability of the fault isolating bus system the following assumptions are made: • •
Rsw := Rsw i , Psw * := Psw i * , RMG := RMG i, i = 1...n module and switch failures are independent
The impact of bus switches on the reliability can also be demonstrated by the reliability improvement factor (RIF) which is defined as follows: RIF = (1 - Rold)/(1 - Rnew)
(2)
Then the reliability of the fault isolating bus fiystem (Rfibs) is given by the expression [5]: Rfibs = Rshared bus " ' [Rsw'(Psw closed/MG'RMG+Pswopen/~-'~'(1-RMG))]n
(3)
In comparison with eq. 1 (original bus system) the term RMG in eq. 1 is replaced by the term in brackets in eq. 3. Hence, with RMG_sw:=Rsw'(Pswclosed/MG'RMG+Pswopen/~'~'(1-RMG)),
335
(7)
With the term RIFM indicating the reliability improvement factor of a single MG due to the bus switch, RIFM = (1 - RMG)/(1 - RMG_sw)
(8)
and the values RMG = 0.95, Rshared bus = 0.9, the following figure shows the RIF of the bus system as a function of the number n of modules [5]: RIFbus system= (1 - Rsharedbus)/(1 - Rfibs) RIFbu s
(9)
system
there is a gain in reliability if the condition RMG.sw > RMG
(4)
is satisfied. This is true if: Rsw > RMG, and
(5)
Psw closed/MG'RMG+Pswopen/~--~'(1-RMG)> RMG/Rsw (6) Eq. 5 indicates that the reliability of the switch (see condition no. 2) plays a key role in the design of the fault isolating bus system. In chapter 3.2 this fact will influence the choice of an appropriate switching element. Eq. 6 directly concerns the coverage (fault-detection, faultlocalization; see chapter 4) of the control-units. As can be seen, the conditional probabilities Psw closed/MG and Psw open/M-"G are multiplied by RMG and (1-RMG), respectively. Therefore, if the reliability of MG has a value close to one the conditional probability Psw closed/MGis of greater importance than Psw open/M-'G.In other words: even if the fault-detection capability of the control-units is poor the reliability of the fault isolating bus system can be improved compared to the original bus system. Fig. 5, for example, shows a plot of the minimum values of Psw open/M"G needed to satisfy eq. 6 (for Rsw = 0.999). For any value greater than these the reliability is improved. P sw ooen/~'~ min
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
n
Fig. 6: Reliability improvement factor of the bus system
3.2 Switching E l e m e n t s Several restrictions have to be taken into account when selecting a suitable switching element: Firstly, bus lines are normally used in a bidirectional way. Therefore, building up the bus switch with standard bus transceivers would require a direction signal which enables either the transmitter or the receiver. To avoid the overhead involved generating this signal a bidirectional switching element should be used. Secondly, the switching element must be compatible with the electrical requirements of the shared bus. It must be able to switch the current intensity of a bus line (e. g. VMEbus: up to 64 mA), it must provide a low on-resistance to reduce the voltage drop between module and bus and it must not affect the bus timing. Additional restrictions are the high reliability needed (as noted above) and the fact that in parallel bus systems with up to 100 or more bus wires an equal number of bus switches per module is necessary.
1.
0.1.
0.01 •
2
~-~~~-
0.001 . . . .
, 0.85
.
~o,=.~.G=1 ' , 0.g0
.
, 0.95
.
II 1.00
RMG
Fig. 5: Minumum values of Psw open/~--~to satisfy eq. 6 (parameter: Psw closed/MG)
For these reasons power MOSFETs (Metal-Oxide-Silicon Field Effect Transistor) have been choosen as switching elements. These semiconductor devices are distinguished by their high input impedance and by the fact that, being a majority carrier device, they do not suffer from minority carrier storage time effects, thermal runaway, or second breakdown. In contrast to ordinary lateral MOSFETs a power MOSFET provides a significantly lower on-state channel resistance (RDS(on)) for the same blocking voltage and faster switching [7]. Typical values of RDS(on) go down to 0.03 Q. Fig. 7a shows the circuit symbol of an N-
R. Vogt / Improving the Reliability of Bus Systems
336
channel, enhancement mode power MOSFET with its inherent parasitic diode:
t to module control-
i-'-~ I'-"
Gate
~.to Fig. 7a: Power MOSFET
bus
Fig. 7b: Bus switch
circuit symbol To avoid the effect of the parasitic diode, two power MOSFETs have to be connected in series to build up a bidirectional bus switch (fig. 7b; see chapter 5). This simple design yields high reliability. Furthermore, the power MOSFETs operate in a static state, i. e. they are either switched on or switched off. The actual switching occurs only if a faulty module is disconnected from the bus or if a replaced module is reintegrated. Then, the reliability is also improved by the fact that gate and channel of a power MOSFET are separated by an isolating oxide layer, thus preventing the gate control signal from spreading to source and/or drain. Finally the consequence is a power MOSFET failure rate of about 10-10 to 10-12 h-1 [8], so eq. 5 can be fulfilled.
type of module (master, slave, or both), access rights (read, write, or both), access limitations (time; memory space, I/O space). If these informations are invariable it is possible to initialize the control unit accordingly and hence to minimize or even avoid any modification to system software. • Finally the fault detection capabilities of the control-units can be supported by the computer system itself. This is reached, for example, by transfering predefined test patterns between modules, by appending check bytes to each transmission, or by introducing an 'Tm alive" message which has to be sent by each module. In addition to the measures mentioned so far there may be further possibilities to detect faults, depending on the bus system under consideration (e. g., several bus systems have an additional serial bus which can be used for this purpose).
4.2 Fault Localization If the bus itself is fault-free then any failure of the shared bus can only occur if a faulty module is driving the bus. Hence, localization of this module can be done by monitoring its driving current, or rather the voltage drop across the bus switch caused by this current. This results in the structure shown in fig. 8, where the control-units have access to both sides of the switches.
4. Fault Detection and Fault Localization As noted in eq. 6, the reliability improvement of the fault isolating bus system depends on successfulfauh detection and fault localization by the control-units. In the following, several possibilities to realize these two steps are considered (only faults due to a faulty module are assumed).
4.1 Fault Detection The detection of faults concerning the shared bus can take place at different levels: • At bus level the provisions for detecting faults depend only on the specification of the given bus system. A fault is present if these specifications are violated. Hence fault detection can be implemented by checking the conformance of the shared bus with its specifications. Some examples are the examination of voltage levels, current intensities, parity and error lines (if present) etc. and, of course, the examination of bus timing (especially the relationship between various signals and timeout conditions). As no information about the structure of the computer system is needed the fault detection at bus level requires no modification to system hard- and software. • The next level of fault detection uses more detailed information about each module connected to the bus, e. g.
Fig. 8: Fault localization In addition to this method it is also possible to localize the faulty module by disconnecting one module after the other and checking the shared bus at the same time. However, this is an awkward and trouble-prone way compared with the above-said.
5. Application to the VMEbus The method introduced so far has been applied to the VMEbus, a widely'used microcomputer bus system [9]. In the prototype implementation bus switches and associated driving circuitry for each module are mounted on an adapter board which is inserted between backplane and module (see fig. 9). This adapter board is connected to a centralized control-unit by a separate cable.
R. Vogt / Improving the Reliability of Bus Systems
Backplane
337
With the structure given in fig. 11 it is possible to tolerate bus failures, too (fig. 11 shows the case of centralized control-units).
r---1 r-'1::~3 -~r-'1:~3 mr'm VMEbus board [:::::3 1:~3~.4~_lp. N~O
'".."T~"
'
~
...........................
contro6unil Fig. 9: Prototype implementation Some features of this implementation are the possibility to disconnect distinct groups of signal lines separately (e. g. bus request/bus grant/bus busy; interrupt request/interrupt acknowledge; serial clock/serial data) and the possibility to read or drive a module's signal line by the control-unit if this line is disconnected from the bus (e. g. the SYSFAIL signal). Special attention has been payed to the daisy-chain-lines where it is necessary to avoid an interruption of the daisy chain if a module is disconnected from the bus. Fig. 10 shows an appropriate switch design.
....'~'!!f.~an'~''~'em X .................................................................................. Fig. 11: Fault tolerant bus system
?
The bus control-unit (left side) checks bus A and generates an appropriate signal F which is distributed to the andgates. If F=I (which should be the case if bus A is faultfree) the top row of bus switches is enabled. These switches together with bus A and the switch control A form a fault isolating bus system as discussed above.
3
As soon as the bus control-unit detects a failure of bus A, the signal F=0 is generated. Therefore the switches of the top row are opened, the bottom row is enabled and work can go on with bus B and switch control B.
~f
to module
rom
module
signal---~_~ [_..
lgombus ~
t)
£
S
Fig. 10: Bus switch used for daisy-chain-signals As a result of the investigations made with this implementation it has been shown that not only the on-state resistance of the power MOSFETs used is important, but also their parasitic capacities. To minimize crosstalk and capacitive loading, the gate-source, gate-drain and drain-source capacities have to be low, and the driving circuitry has to be designed accordingly. Due to the structure of power MOSFETs low parasitic capacities yield a high RDS(on) and vice versa, so a suitable compromise has to be found. In the present design marketable devices and discrete power MOSFETs are used. Therefore the obvious supposition is that with advanced technologies the adapter board becomes unnecessary and that it is possible to integrate all the devices into a new and fault isolating backplane.
6. Tolerating Bus Failures In the fault isolating bus system there remains a single point of failure, that is the bus (including bus switches) itself.
The reliability study of this design is similar to that presented above and is carried out in [5] (single-pointfailures due to failures of line F can be avoided, e. g. by TMR-techniques). To sum up it can be stated that the effect of the switch control-units and the bus switches in this design is comparable with that in the non-redundant, faultisolating bus system discussed in chapter 3. However, the main difference is the possibility to use a redundant bus. Therefore, the block bus shown in the reliability block diagram of the non-redundant bus system (see fig. 4) is now replaced by the following combination of bus A, bus B and the conditional probabilities which are determined by the coverage of the bus control-unit.
Fig. 12: Reliability block diagram of the redundant busses The upper branch represents the case that bus A is fault-free and that the bus control-unit generates the correct signal F= 1 (with the conditional probability PF= I/A)" In the lower branch bus A is fault-free, too, but the wrong signal F=0 is generated. Yet work can go on if bus B is fault-free. Finally
R. Vogt / Improving the Reliability of Bus Systems
338
the middle branch represents the case that bus A is faulty, F=0 is generated correctly and bus B is fault-free.
covering some distance, e. g. local area networks, bus systems in a factory environment etc.
With
As before, the interface to the modules remains unchanged, so standard modules can be used.
PF=0/A = 1 - PF=I/A
(10)
and the assumption that Rbus A ---Rbus B =: Rbus
(1l)
the reliability of the redundant busses is given by eq. 12: Rred. bus = = Rbus'(PF=I/A+PF=0/~) - Rbus2"(PF=I/A+PF=0/7,- 1) (12) For perfect coverage, i. e. PF=I/A + PF=0/A -- 2, this yields the well-known equation of a 1-of-2 system: Rred. bus max. = 2"Rbus - Rbus2
(13)
7.
Summary
with the application of bus switches it is possible to tolerate several classes of faults occuring in a bus oriented (micro-) computer system. The proposed method requires no modification to the given interface of the modules, so standard and marketable modules can be used. In comparison to other, i. e. special purpose solutions (e. g. the redundant "Dynabus" in the Tandem 16 NonStop system [10]) this results in high economic efficiency.
Fig. 13 shows a plot of the reliability improvement factor, RIFbus = (1-Rbus) / (1-Rred. bus)
(14)
as a function of the sum PF=I/A+PF=0/~ and the parameter Rbus.
Acknowledgement The author wishes to thank A. Pfitzmann and B. Pfitzmann for their help in the preparation of this manuscript.
RIFbus 10
=o.g
References [1] Bailey: Bus standards enrich microcomputer board variety. Electronic Design, April 15, 1982, p. 77. [2] Thurber, K., Jensen, A., Jack, L., Kinney, L., Patton, P., Anderson, L.: A systematic approach to the design of digital bussing structures. In: AFIPS Fall Joint Computer Conference, 1972, pp. 719-740.
0.1
PF=I/A + PF=O/A 0.0
0.5
1.0
1,5
2.0
Fig. 13: Reliability improvement factor of the bus [5] With PF=I/A+PF=0/~ near 2.0, a high value of RIFbus can be reached. Some additional advantages of the fault tolerant bus system are:
• Centralized units (e. g. bus arbiter, system clock driver, power monitor, system controller, etc.) can be regarded as parts of the bus. Hence switching over to the redundant bus also means switching over to a redundant set of centralized units. • Since the faulty bus is disconnected completely it can be replaced without interruption of system operation, thus increasing the overall system dependability. • The redundant bus and associated switches makes it possible to establish an exclusive path to one of the modules, e. g. for diagnosis or initialization purposes. Due to the increased expense involved with redundant switches and busses it is to be expected that the field of application of this design is limited to those bus systems which use a less reliable interconnection medium, i. e. the probability of failures of the bus itself is relatively high. In particular, this may be the case in (serial) bus systems
[3] Siewiorek, D., Swarz, R.: The theory and practice of reliable system design. (Digital Press, 1982). [4] Pfitzmann, A., Hiirtig, H.: Grenzwerte der Zuverl~issigkeit yon Parallel-Serien-Systemen. In: Nett, E. and Schw~-'tzel, H. (eds.), Fehlertolerierende Rechnersysteme (Springer-Verlag, Berlin Heidelberg, 1982), pp. 1-I1. [5] Vogt, R.: Ein Verfahren zur Fehlerausgrenzung und Fehlertolerierung in busorientierten Rechensystemen. Accepted paper for the 3rd international conference "Fault-Tolerant Computing Systems", Sept. 9 - 11, 1987, Bremerhaven (Springer-Veflag, Berlin Heidelberg, 1987). [6] Amer, H., McCluskey, E.: Calculating the coverage parameter for the reliability modelling of fault-tolerant computer systems. International Symposium on circuits and systems, May 1986 (ISCAS 86), San Jose, Ca., 1986. [7] Dunning, P., Locher, R.: Introduction to power MOSFETs and their applications. Fairchild Application Note DIS-2, January 1985. [8] HEXFETs - the most reliable semiconductor-devices ever made. International Rectifier Inc., data sheet no. M320, October 1983. [9] VMEbus specification manual, revision C. 1. PRINTEX Pupblishing Inc., October 1985. [10] Katzman, J.: A fault-tolerant computing system. Tandem Computers Inc., 1977.