Reliability and safety analysis of fault tolerant and fail safe node for use in a railway signalling system

Reliability and safety analysis of fault tolerant and fail safe node for use in a railway signalling system

Reliability Engineering and System Safety 57 (1997) 177 183 ELSEVIER P ! ! : S 0 9 5 1 - 8 3 2 0 ( 9 7 ) 0 O0 2 O- 3 1997 Elsevier Science Limited ...

601KB Sizes 0 Downloads 25 Views

Reliability Engineering and System Safety 57 (1997) 177 183

ELSEVIER

P ! ! : S 0 9 5 1 - 8 3 2 0 ( 9 7 ) 0 O0 2 O- 3

1997 Elsevier Science Limited All rights reserved. Printed in Northern Ireland 0951-8320/97/$17.0t~

Reliability and safety analysis of fault tolerant and fail safe node for use in a railway signalling system Vinod Chandra & K. Vijaya Kumar Department of Electrical Engineering Indian Institute of Technology New Delhi-llO 016. India (Received 3 September 1993; accepted 29 January 1997)

In this paper, we propose a Markov Reliability model for a transputer based fail safe and fault tolerant node for use in a network of distributed safety critical railway signalling systems. Using the Markov model we quantify the reliability and safety in terms of Probability of being in unsafe state, Probability of Safe shutdown during the useful life period and last phase of bath-tub curve. A fault analysis of the fail safe and fault tolerant node addressing Byzantine (malicious) faults with an extension of Byzantine General's problem is given. © 1997 Elsevier Science Limited.

1 INTRODUCTION

transient and recurrent analysis of the Markov model is described for a given component failure rate and fault recovery rate and the state transition probability matrix quantifying the probability of being in unsafe state of the node and probability of node failure has been obtained. The component failure rate follows a bath-tub curve with age and the worst case probability of being in unsafe state, worst case probability of node failure, mean time to failure (MTTF) of the node and the availability of the node is obtained. In this paper, Section 2 describes the basic scheme of the fail safe node, Section 3 gives the analysis of byzantine faults in fail safe node, Section 4 describes the Markov model for fail safe node, Section 4.1 gives the reliability and safety analysis of the Markov model and Section 5 gives the results and conclusion.

Distributed safety critical railway signalling systems are based on fault-tolerant and fail safe techniques to provide-high reliability and safety. The various safety critical system functions distributed geographically in a railway yard can be grouped and interconnected to form a local area network. For a dual ring topology of the local area network each node provides the following functions: 1. Supports the fault tolerance provided in the given network topology. 2. Takes care of single processor fault within the node. 3. Perform digital input/output to drive safety system functions under its jurisdiction. 4. Ensure safe reaction in the safety system functions under its jurisdiction in the event of two or more processor faults.

2 BASIC SCHEME OF FAIL SAFE N O D E

The node of full duplex dual ring network consists of four transputers (T1, T2, T3, T4) configured as a square mesh using the serial links of the transputers is shown in Fig. 1 [1-3]. Out of the four serial links of each transputer, one is used for maintaining connectivity with neighbouring node, one is used for performing Input/output for driving system functions and the remaining two are used, one each for connecting with neighbouring transputer of same node. The serial data of first channel of dual ring DRI

The reliability and safety of the node depends on the failure rate of the components in use in the node. To quantify the reliability and safety of the node, a Markov reliability model having finite number of states of the node has been proposed. The node consists of four transputers and is based on a new technique of leadership based on rotation described in the paper [1-3]. The Markov chain contains both safe states and unsafe (dangerous) states of the node. The 177

V. Chandra, K. V. Kumar

178

VITAL SYSTEM OUTPUTS

NON VITAL SYSTEMS

Restrictive safe state at shut down

Status quo safe state at shutdown

Power supply disable

Buffers disable

Z3

Z,1

Z3

' at Specific safe states shutdown

Z4

Z3

Z4 External trigger input

I

1/O Controller

¢

SI

,,,

1

¢

Multiplexer DEMUX CLOCK

De multiplexer

Aout Bout Cout Dout v From Serial links

Aou t Ain Dual ringA

Ain

Bin Cin Din a v To Serial links

CLK A

Bou t

Bin

[

¢, ¢

ch annel~l-----~ 0 (DRI) v ]

¢2~--~1

Transputer A Reset

1-

R!

/t

~, ¢

0

Transputer B

I Reset

CLK B

~[ Ec

Dual ring channel I (DRI/

¢2

3

a ,R~

_It.

I

Zl

Two Rail Fail Safe •Circuit for Shut down Logic

b--,~ c

d--~

~---~ Z 3 AC Coupled Logic Circuit Z4 a2 " ~~ CLK D Dual ring

Dual ring channel2 2 ~ (DR2)

[

channel 2 ~ (DR2)

CL~ C

0

l

l

Z1

Z2

[Reset

CLK CLK CLK CLK

¢



¢

,1

Rd

Dout Din

Rc

Cou t Cin

A -~I B C D

Clock Selection Circuit to MUX/DEMUX

(0,1,2,3 are serial links of each transputer)

___Mux DEMUX CLOCK

AC Coupled Logic Circuit For External Reset

¢, Ra

Rb

Rc

Rd

aI

¢ ¢¢* a2

bI

Fig. 1. Fail safe node using four transputers.

b2 c I

f

c2 d I d2

Ra Rb Rc Rd

Z2

Fault tolerant and fail safe node is processed by both T1 and T2 in series for the purpose of redundancy. Normally, T1 enables permissive outputs and T2 enables restrictive outputs. Similarly the serial data from second channel (DR2) is processed by T3 and T4 where T3 enables permissive outputs and T4 enables restrictive outputs normally. For providing fault tolerance [5-7] to the above architecture the following options are available: 1. Majority voting of the outputs of four transputers. 2. Election of a leader from among the four transputers for a fixed tenure of time. 3. Leadership based on rotation. The majority voting [4] requires an external reliable majority voter, while the election of a leader from among the four transputers for a fixed tenure of time requires the process of election which involves time overhead. In addition, both these options require fault detection mechanisms. The selection of leadership on rotation basis for a fixed tenure time does not have the above disadvantages.

3 ANALYSIS OF BYZANTINE FAULTS IN FAlL SAFE N O D E

In the fail safe node, conflicting messages about the health of the system can be received from the two neighbouring transputers. For example, one transputer calls for safe system shutdown on account of a failure and another transputer does not call for safe system shutdown on account of malicious fault (Byzantine fault) for the same failure. This type of disagreement has been reported in the Byzantine Generals problem [8, 9] and is briefly given below: Several divisions of the Byzantine army with each division commanded by its own General camp outside an enemy city to decide upon a common battle plan after observing the enemy. Generals from different divisions communicate only through messengers and some of the generals may be traitors. An algorithm must be designed to meet the following requirements: 1. All loyal Generals decide upon the same plan of action. 2. A small number of traitors cannot cause the loyal Generals to adopt a wrong plan. Theorem 1 of [9] shows that the problem is solvable if and only if n > -- 3m + 1, where rn is the number of traitors and n is the total number of Generals. It is assumed that the Generals communicate through a fully connected network and a loyal General need not know the traitor. Some extensions to the Byzantine General's problem on a partially connected network with authenticated messages has been reported in [10]. Applying Theorem 1 of [10] to the Fail safe node

179

which is not a fully connected network to tolerate one Byzantine fault, 4 transputers are required to be interconnected by at least two disjoint paths between each pair of transputers. The fail safe node consisting of three or less transputers cannot agree on the same result in the event of single byzantine fault and therefore the fail safe node should have at least four transputers. It is assumed that hardware fault in the processing element or fault in the communication system or fault in the software being run in the processing element would correspond to malicious fault (Byzantine fault) and hence a traitor. 3.1 Extension of Byzantine generals on a network of interconnected fail safe nodes with authenticated messages

Let each fail safe node have n transputers (Generals), where n is even. Each fail safe node is a partially connected network with the smallest number of disjoint paths between each pair of transputers (Generals) being two. There are N fail safe nodes interconnected to form n/2 multiple rings (Fig. 1 shows a dual ring for n = 4) with one of the N fail safe nodes acting as a command unit (central controller). Let each transputer of the fail safe node be designated as pq where i is the number of the transputer within the node (1 < - - i < --n) and j is the node number ( 1 < = j < =-N). The command unit fail safe node guides each transputer (General) of the fail s~fe node when it is indecisive. In the n/2 multiple ring network, consider a fail safe node j with its adjacent nodes j - 1 and j + 1. For all j, the activities of the transputers (Generals) of fail safe node j is monitored by the neighbouring transputers of nodes j - 1 and j + 1 by executing a process S. Each transputer p~j performs the tasks of fail safe node and in addition to it executes a process S in parallel to observe the activities of the neighbouring transputer q,,~, (where 1 < = u < = n ) and 1 < = v < = N ) and conveys it to the command unit. This process S is called spy process and differs from a General of fail safe node in the following ways: 1, The spy process does not have any control on the outputs of fail safe node or on the transputer (General) which it is spying. 2. A spy process does not require any additional resources and it runs in parallel with the General's functions of the fail safe node on the each transputer. Thus each transputer has dual role of performing the functions of the fail safe node to which it belongs and at the same time spy the activities of the General of the adjacent node connected to it which is conveyed to the command unit. The spy does not take any decisions but conveys the following in a message frame to the command unit:

V. Chandra, K. V. Kumar

180

1. Aberration in the incoming data rate from the General of the adjacent node connected to it. 2. Deviation in the limits of the incoming data from the General. 3. Deviation in the leadership time of the General of the adjacent node connected to it. There are r spy processes for n Generals of each fail safe node. Each of the r spy processes (where r < = n) convey a message about the activities of the General of the adjacent node to the command unit through the partially connected network with authenticated messages. The command unit would decide whether a given General of a fail safe node is traitor or not based on the messages from the r spy processes monitoring the n Generals. The following assumptions are made: 1. There is an alternate external path available through the n/2 multiple ring network for loyal Generals of each fail safe node for exchange of authenticated messages. 2. If the number of traitor Generals of a fail safe node is more than the number of loyal Generals, it is not valid and the fail safe node is shut down and is isolated from the network. If a General becomes a traitor, it is reported by the spy about the same by a message. Each of the loyal General (non-faulty transputer) of the fail safe node detects that there is a General who is differing and this is monitored by the spy. The command unit receives the messages from all r spies and then ascertains whether the General is the traitor or spy of the General is the traitor.

Lemma. If n is the number of transputers (Generals) in the fail safe node (where n is even) and r is the n u m b e r of spy processes (where 2 < r-< s for any s ->n), if r' is the number of traitor spy processes (r' < -- r) then the number of traitor Generals (t) that can be tolerated within the fail safe node to reach an agreement among the loyal Generals of the fail safe node is O
if(r'=>r/2).

(1) (2)

Proof. As per the theorem 1 of [10], for the partially connected network of n Generals of a fail safe node with the smallest number of disjoint paths between each pair of Generals being two, the number of traitors that can be tolerated within the fail safe node is less than or equal to 2 for the loyal Generals to reach an agreement. If the number of traitors within the fail safe node exceed two then it is not possible for loyal Generals to exchange authenticated messages because of relay traitors. However, as an external alternate path is available through the neighbouring transputers, loyal Generals can exchange authenticated messages by rerouting their

authenticated messages through the alternate route provided by adjacent fail safe node or its neighbours. Therefore, the number of traitors that can tolerated within the fail safe node from theorem 1 of [10] is less than or equal to ( n - l ) / 3 . The command unit performs majority voting on the messages received from the r spies to arrive at the traitor General. If t > ( n - 1)/3 then the loyal Generals which cannot come to an agreement would receive the information about the traitor Generals from the command unit which would then enable them to reach agreement. If more than half of the spies become traitors (when r'>r/2) then the command unit may reach a false decision. If r' < r/2 and more than n/2 Generals are traitors then the fail safe node goes to safe shut down state. Therefore, if r' < r/2 the number of traitors that can be tolerated within the fail safe node is less than or equal to n/2. If r = > r / 2 then the number of traitors that can be tolerated within the node is less than or equal to (n - 1)/3. 3.2 Safe shutdown model In this model for a four transputer fail safe node as shown in Fig. 1, a transputer on account of first single byzantine fault can exhibit a arbitrary behaviour. Based on theorem 1 from [10], the rest of the three of non-faulty transputers arrive at the same result. The fault detection mechanism of the fail safe node detects and isolates the faulty transputer within 200 ms. The fail safe node continues to operate in degraded mode with the three non-faulty transputers. In the event of the occurrence of a second byzantine fault in any one of the remaining three transputers~ the two non-faulty transputers with the help of command unit independently initiate safe shutdown action. 3.3 The fault coverage of the safe shutdown model The fault coverage is given below: Byzantine fault: output of each transputer of Fail safe node Byzantine fault free: it is based on the decentralised majority voting of 3 out of 4. First fault (Byzantine) in fault detection, and any one of the four isolation of the faulty transputers: transputer by the three non-faulty on agreement of the same. First Byzantine fault based on the decentralised detected and isolated agreement of 3 nonfaulty transputers. and second Byzantine fault not occurred: the two non-faulty Occurrence of second Byzantine fault in any transputers initiate safe one of the three shutdown. remaining transputers:

Fault tolerant and fail safe node 4 MARKOV MODEL FOR FAlL SAFE NODE The Markov reliability model of the fail safe node is shown in Fig. 2. The model has finite states which together form a continuous Markov chain. The transitions representing constant failure rates and fault recovery rates are described below: A: Failure rate of transputer during useful life period of bath-tub curve = 6.3246 x 10-8/sec (1 in 6 months). As: Failure rate of slave transputer = 1.14 x 10-5/sec (A *0-75); AM: Failure rate of rotating master transputer = 6-954 x 10-8/sec (A * 0-25); At3: Failure rate of node which already contains multiple undetected faults leading to potentially dangerous unsafe output = A * 2"5/10-4; ~zd: Single faulty detection and isolation rate = 5/sec (fault detection and Isolation time of 200 ms); Ix•: leadership rotation rate = 20/sec (Mean leadership time 50 ms); /z .... ,: manual reset rate for repair and restart = 1 in 120 sec (2 minutes); ..... i[ Manual intervention rate for taking care of dangerous output = 1 in 3600 sec (one hour); An: Failure rate of transputer during the first 15 days of last phase period of bath-tub curve = 2 * A, Af2: Failure rate of transputer during the second 15 days of last phase period of bath-tub curve --- 4 * A. The nine states of state transition diagram are described below:

t..I

_Xv

;..

\

Fig. 2. Markov reliability model for fail safe node.

181

State 1. This represents that all four transputers of the node are in operation as shown in Fig. 2. In the event of a single fault in the master transputer, the transition to state 7 occurs. In the event of a single fault in any one of the three transputers the transition occurs to state 2. State 7. This represents the state when a first single fault has occurred in the master transputer when all the four transputers are in operation. After the leadership time the transition occurs to state 2 for fault detection and isolation. The present master transfers its leadership to the next slave transputer on the basis of rotation. State 2. This represents the state where a first fault in any one of the three slave transputers has occurred and this fault has not yet been detected. It also represents the state when a single fault has occurred in a master transputer which has not been detected within the leadership time of the master transputer. In the event of detection and isolation of the above single fault within the fault detection time, then transition to state 4 occurs. In the event of the second fault occurring in the node before the first fault is detected, the transition to state 3 occurs. This means that any two out of the four transputers have faults not yet detected. State 3. This represents the state when at least two out of the four transputers have faults not detected and isolated. The faults may or may not be correlated and it may or may not lead to potentially unsafe state. The fault could be in hardware or software. The occurrence of multiple faults in more than one transputer leading to the same output by the faulty transputers which could be potentially dangerous unsafe output would enable transition to state 9. State 9. This represents a potentially dangerous unsafe state when multiple faults have occurred in the node and are yet to be detected and isolated. These multiple faults are leading to potentially dangerous output which are to be observed and corrected by external manual intervention. This means that the system has not yet shutdown but the incorrect dangerous outputs are to be corrected by external manual intervention before or after the eflect of these outputs. State 4. This represents the node with degraded performance. This means that the faulty transputer has been isolated and the node is in operation with three transputers. In the event of a single fault in any one of the remaining two slave processors then transition to state 5 occurs. In the event of a fault in the master transputer then transition to state 8 occurs. State 8. This represents the state when a first single fault has occurred in the master transputer when the node is in degraded state with only three out of the four transputers are in operation. After the leadership time the transition occurs to state 5 for fault detection

V. Chandra, K. V. Kumar

182

and isolation. The mastership is also transferred to next slave transputer on the basis of rotation. State 5. This represents the node in degraded state in which a fault has occurred in any one of the two slave processors and the fault is not yet detected. It also represents the state of the degraded node in which a single fault has occurred in master transputer which not been detected within the leadership time. In the event of the second fault occurring in the degraded node before the above first single fault is detected then transition to state 3 occurs. This means that at least two out of the three transputers in the degraded state are faulty and the faults have not been detected and isolated. State 6. This represents the state of the degraded node in which the single fault in any one of the remaining three transputers is detected and the faulty transputer isolated and safe shutdown actions taken. In the event of repair and manual reset the node restarts with normal operation from state 1.

4.1 Reliability and safety analysis of fail safe node The transition probabilities of the markov chain shown in Fig. 2 satisfy the Chapman-Kolmogorov differential equations [11] given by dP dt

P(t)Q

and

dP ~-t = Q P ( t )

4.2 Results

where Q is the state transition rate matrix and P(t) is the transition probability matrix. The solution to this matrix equation is given by

P(t) = P(O)e Q'.

steps of failure rates. The last phase period is chosen as 30 days (720 hrs). The state transition rate matrix Q with the initial condition P(0) is used to obtain the unknown probabilities pij(t) from eqns (l) and (2) for each of the phases of the bath-tub curve. The state transition rate matrix Q is as per its failure rate and recovery rate corresponding to each phase of the bath-tub curve. The reliability of the fail safe node is the sum of the probability of being in state 1 and probability of being in state 4. The probability of being in state 6 (shutdown) F, quantifies the Probability of Safe shutdown failure of the fail safe node. The availability of the node is (1 - F). Tile Safety of the fail safe node depends on the probability of being in potentially unsafe state 9. As per the design philosophy of fail safe node, a first single fault is detected and isolated within 200 ms before a second fault could occur. The safety of the fail safe node is quantified as the probability of being in state 9. The initial conditions: P(0) = [1 0 0 0 0 0 0 0 0]. For the transient analysis of the Markov model, the above analysis is repeated after setting u .... t and u . . . . a~ to zero (No manual repair) and the node in operation for a period of one hour. The recurrent and transient analysis of Markov model has been done and the results are given below.

(3)

Assuming that the eigen values of Q are all distinct, Q can be put in the form

Q = MDM-I, where M is a nonsingular matrix formed with the eigen vectors of Q and D is the diagonal matrix with the distinct eigen values of Q as its elements. Then we can obtain P(t) = Me°'M-1. (4) For the recurrent analysis of the Markov model of the node, the failure rate of the transputer of the node is made to follow a bath-tub curve. The effect of the initial phase (burn-in period) of the bath-tub curve (infant mortality period) can be neglected as the node would be made to start its operations only after the burn-in test. In the next phase of bath-tub curve (useful life period) the failure rate of the transputer is constant. The useful life period is chosen as 10yrs (typical for commercial equipment). The last phase (wear-out period) of the bath-tub curve which is linearly increasing with age is approximated by two

4.2.1 Transient analysis of the Markov model For one hour operation of the fail safe node, the transition probability matrix P(t) was obtained. From this matrix, for the initial conditions P(0), the reliability of the fail safe node is obtained as 0.9999689. The probability of safe shutdown failure (probability of being in state 6) is 1.1334 x 10 -7. The probability of being in potentially dangerous unsafe state 9 is 5.237 × 10 19. 4.2.2 Recurrent analysis of Markov model For the operation of the fail safe node for the time of useful age (10yrs), the transition probability matrix P(t) was obtained. From this matrix, for the given initial condition P(0), during its useful life time (10years), the probability of safe shut down failure (Prob. of being in state 6) is F = 7-81 x 10 -~. The availability of the fail safe node is 0-999992/10yrs. The reliability of the fail safe node is 0.99999 in the useful life time of 10yrs and its MTTF is 9-99991 years. The probability of being in potentially dangerous unsafe state 9 is 5.36x 10 -14. During the last phase of the bath tub curve the reliability of the fail safe node is 0.999963, probability of being in unsafe state 9 is 1.01 × 10 ~4, Probability of safe shutdown failure of the fail safe node is 3.66 × 10 -05 and the availability of the node is 0.999965.

Fault tolerant and fail safe node 5 CONCLUSIONS The Byzantine General's problem has been applied to Fail safe nodes and the number of traitor Generals that can be tolerated within the fail safe node has improved from < = (n - 1)/3 to < = n/2, by means of an extension of Byzantine General's on the network of interconnected fail safe nodes with authenticated messages using spy processes. The transient and recurrent analysis of the Markov model of fail safe nodes has enabled us to quantify Reliability and Safety during the useful life period and last phase period of a bath-tub curve; the worst probability of being in unsafe state of the node is 5.36 x 10 -~4, the worst case probability of safe shutdown failure of the node is 3.66808 x 10 -5, the worst case reliability of the node is 0.999963. The increase in failure rate during the last phase of bath-tub would result in increase in the manual reset rate (/zre~et). The increase in the value of /z .... t can be monitored and if it exceeds a given limit, the node can be replaced.

REFERENCES 1. Kumar, K. V. and Chandra, V., Transputer-based fault-tolerant and fail-safe node for dual ring distributed railway signalling systems. Microprocessors and

183

Microsystems, 1994, 18(8), 141-150. 2. Kumar, K. V. and Chandra, V., A fail safe node using transputers for railway signalling applications. In IEEE TENCON, Melbourne, Australia, 1992. 3. Kumar, K. V. and Chandra, V., Simulation of multi-transputer fault tolerant system for railway safety applications. In European Simulation Multiconference, Lyon, France, 1993. 4. Standven, J., Colley, M. J. and Lyons, D. M., Hardware voter for fault-tolerant transputer systems. Microprocessors and Microsystems, 1989, 13(9), 588596. 5. Thompson, H. A., Transputer-based fault tolerance in safety-critical systems. Microprocessors and Microsystems, 1991, 15(5), 243-248. 6. Thompson, H. A., Fault-tolerant-based controller configurations for gas-turbine engines, lEE Proceedings, 1990, 137(D4), 253-260. 7. Garcia-Nocetti, F., Thompson, H. A., De Olivera, M. C. F., Jones, C. M. and Fleming, P. J., Implementation of a transputer-based flight controller. IEE Proceedings, 1990, 137(D3), 130-136. 8. Pease, M., Shostak, R. and Lamport, L., Reaching agreement in the presence of faults. Journal of ACM, 1980, 272, 228-234. 9. Lamport, L., Shostak, R. and Pease, M., The Byzantine General problems. ACM Transactions on Programming Languages and Systems, 1982, 42, 382-401. 10. Wu, J. and Fernandez, E. B., Some extensions and applications of the Byzantine Generals problem. Dept of Computer Engineering, Florida Atlantic University, Boca Raton, Florida. 11. Viswanadham, N., Sarma, V. V. S. and Singh,'M. G., Reliability of Computer and Control systems. NorthHolland systems and control series, volume 8, 1987.