ARTICLE IN PRESS
Reliability Engineering and System Safety 93 (2008) 1720–1729 www.elsevier.com/locate/ress
Modeling demand rate and imperfect proof-test and analysis of their effect on system safety Manoj Kumara,, A.K. Vermab, A. Srividyab a
Control Instrumentation Division, BARC, Trombay, Mumbai 400085, India Reliability Engineering Group, IIT Bombay, Powai, Mumbai 400076, India
b
Received 28 February 2007; received in revised form 27 November 2007; accepted 2 December 2007 Available online 15 December 2007
Abstract Quantitative safety assessment of a safety system plays an important role in comparing design alternatives at design stage and deciding appropriate design options to apply for safety systems. There are a number of such indices given in the literature. Most of the safety indices consider only system parameters (hazard rate, repair rate, diagnosis, coverage, etc.) along with proof-tests (or inspection). This paper extends the underlying model to incorporate demand rate and imperfect proof-tests. It also introduces a new safety index, average probability of failure on actual demand (PFaD), and an availability index, manifested availability (mAv). This paper uses Markov regenerative process-based analysis for state probabilities. Based on state-probability values of various states of the underlying Markov chain, solutions are derived for safety index PFaD and availability mAv. r 2008 Elsevier Ltd. All rights reserved. Keywords: Probability of failure on demand; Fail dangerous; Fail safe; Markov model; Markov regenerative process; Continuous-time Markov chain; Proof-test; Availability
1. Introduction Safety systems are used for automatic shutdown of equipment under control (EUC), whenever the equipment or plant parameters go beyond the acceptable limits for more than acceptable time. These kinds of systems are used in a variety of industries, such as oil refining, nuclear power plants, chemical and pharmaceutical manufacturing, etc. When the safety system is functioning correctly (successfully), it permits the EUC to continue provided its parameters remain within safe limits. If the parameters move outside of an acceptable operating range for a specified time, the safety system automatically shuts down the EUC in a safe manner. Safety systems generally have some redundancy and can tolerate some failures while continuing to operate success-
Corresponding author. Tel.: +91 22 25591822; fax: 91 22 25505151.
E-mail address:
[email protected] (M. Kumar). 0951-8320/$ - see front matter r 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.ress.2007.12.001
fully. As discussed in Refs. [1–3,11], a system’s independent channels can fail leading the system to the following states: 1. Safe failure (SF) state: where it erroneously commands to shut down a properly operating equipment. Taking a channel offline and shutting down of a channel is also referred to as safe failure. 2. Fail dangerous detected (DD) state: where a channel(s) has (have) failed in dangerous mode, but is (are) detected by internal diagnostics, and announced. 3. Fail dangerous undetected (DU) state: where a channel(s) has (have) failed in dangerous mode and is (are) not detected by internal diagnostics, hence not announced. The safety system can fail in two distinctly different ways [1–3,7,11]: 1. Safe failure (FS): Failure that does not have potential to put the safety system in a hazardous or fail-to-function state [1]. This occurs when more than tolerable numbers
ARTICLE IN PRESS M. Kumar et al. / Reliability Engineering and System Safety 93 (2008) 1720–1729
Nomenclature CTMC CCF DC DD
continuous-time Markov chain common cause failure diagnostic coverage dangerous detected (failure category in IEC61508) DU dangerous undetected (failure category in IEC61508) SF safe failure (failure category in IEC-61508) DF dangerous failure (failure category in IEC61508) DEUC damage to EUC (or accident) EUC equipment under control or process plant PFD average probability of failure on demand PFaD average probability of failure on actual demand MTBD mean time between demands MBF multiple beta factor of channels are in safe failure. This type of failure is referred to in a variety of ways including fail safe [3,7], false trip and false alarm. 2. Dangerous failure (DF): Failure that has the potential to put the safety system in a hazardous or fail-to-function state [1]. More than tolerable numbers of channels in DD and/or DU lead to this failure. The system fails in such a way that it is unable to shut down the EUC properly when shutdown is required (or demanded). Dangerous failures are important from a safety point of view. A survey of recent research work related to safety quantification indicates that there are diverse safety indices, methods and assumptions about safety systems. Safety indices used are PFD (probability of failure on demand) [1,2,4,11–15], MTTFD (mean time to dangerous failure) [3,7], MTTFsys (mean time to system failure) [5], MTTUF (mean time to unsafe failure) and SSS (steady-state safety) [6], and MTTHE (mean time to hazardous event) [9]. Simplified equations [1,4,11–13], Markov models [2,3,5–7,9,10,14,15] and fault trees [12] are the methods used for safety quantification. Safety indices of [6,9] consider only repair, [3] considers repair as well as periodic inspection to uncover undetected faults, [1,2,4,11–13] consider common cause failure (CCF) and periodic inspection along with repair, and [10] considers demand rate. Ref. [4] discusses the CCF model (beta factor) of [1] and suggests generalization, the multiple beta factor (MBF) model. Safety index PFD [1] has already been published as a standard. We take this index as a basis for this paper. IEC [1] gives simplified equations for safety evaluation. A review of different techniques by Rouvroye [8] suggests that Markov analysis covers most aspects for quantitative safety evaluation. Bukowski [14] compares various techniques for PFD evaluation and defends Markov models.
1721
Tproof proof-test interval mAv manifested availability FDU dangerous undetected state of the safety system FS safe failure state of the safety system lSF hazard rate of a channel leading to SF lDD hazard rate of a channel leading to DD lDU hazard rate of a channel leading to DU m repair rate of a channel in FS mP(t) time-dependent proof-test rate larr demand arrival rate (1/MTBD) D probability redistribution matrix demand refers to a condition when the safety system must shut down EUC. The condition arises when EUC parameters move outside of an acceptable operating range for a specified time. The safety system shuts down the EUC by opening its output control switch(s) IEC safety standard IEC 61508 Zhang [2] provides a Markov model for PFD evaluation without considering demand rate and modeling imperfect proof-tests. Bukowski [10] gives a safety measure based on PFD considering demand rate, but this model does not consider periodic proof-tests. A detailed comparison of these models with the one proposed here is given in Section 2. The system model taken here for analysis is similar to the model of IEC [1] and uses the Markov model for analysis. This model explicitly considers periodic proof-test (perfect or imperfect), demand rate and safe failures. Incorporation of safe failure enables modeling of all possible system states and estimation of additional measures such as availability (or probability of being in one or more specified states) for a given amount of run time. The paper is organized as follows. Section 2 gives system description and assumptions about the system to derive a Markov model. Performance-based safety index and availability are derived in Section 3. In Section 4 an example is taken to illustrate computation of safety index and availability. Advantage of modeling safe failures and availability along with safety index is discussed in Section 5. Conclusions are given in Section 6. 2. System model The systems being discussed here fall into the category of programmable electronic systems (PES) as defined in IEC [1]. These systems are used for control, protection or monitoring based on one or more programmable electronic devices [1]. The elements of the system (sensors, processing devices, actuators, power supplies and wiring, etc.) are grouped into channels that independently perform a function. To model the system, most of the assumptions taken here are similar to those given in Annex B of part 6 of IEC [1]. Assumptions such as (i) failure rates are constant over
ARTICLE IN PRESS 1722
M. Kumar et al. / Reliability Engineering and System Safety 93 (2008) 1720–1729
system life and (ii) channels in a voted group all have the same failure rate, diagnostic rate, diagnostic coverage, mean time to restore and proof-test interval are taken unchanged. Some of the assumptions of IEC [1] are modified/generalized. These are given below along with new ones: (i) Overall channel failure rate of a channel is the sum of the dangerous failure rate and safe failure rate for that channel. Their values need not be equal. This is the generalization of the assumption [1,11] that these two failure rates are equal in value. (ii) At least one repair team is available to work on all known failures. This is the generalization of the assumption in [1,11]. One repair facility works for one known failure only. Availability of single repair crew in many cases has been discussed in [14]. (iii) The fraction of failures specified by diagnostic coverage is detected, and the corresponding channel is put into a safe state and restored thereafter. This assumption is contrary to the assumption of IEC [1] for a low-demand case that assumes online repair, but in case of a high-demand IEC [1] assumes that a system achieves a safe state after detecting a dangerous fault. With this assumption, 1oo1 and 1oo2 voted group, on any detected fault put the EUC in safe state. (iv) A failure of any kind (SF, DD, DU) that has once occurred to any channel cannot be changed to other types without being restored to a healthy state [6]. This means if a channel fails to SF state, then unless it is repaired back to a healthy state it cannot have failures of type DD or DU and vice versa. (v) Proof-tests (inspections or functional tests) are conducted online. Proof-test of a healthy channel neither changes the system’s state nor the EUC’s, while a channel with undetected faults is put into a safe state following proof-tests. This is a new assumption. It is mainly based on the practice followed in the nuclear industry. (vi) Proof-tests are periodic with negligible duration. The proof-test interval is at least 3 orders of magnitude greater than the diagnostic test interval. This assumption modifies the assumption of IEC [1] that puts the limit of 1 order of magnitude. This assumption is based on the fact that the order of the diagnostic test interval is usually less than or equal to tens of seconds, while proof-test intervals are not less than a day. (vii) The expected interval between demands is at least 3 orders of magnitude greater than the diagnosis test interval. IEC [1] defines two different limits for lowdemand and high-demand modes of operation. Here the limit of high-demand operation is taken with limits increased to 3 orders. This is based on the assumption that the expected interval between demands is not less than a day [10] even in highdemand mode.
(viii) On occurrence of a safe fault, the channel is put into a safe state, independent of other channels. Hence all safe failures even in voted groups are detectable. (ix) Time between demands is assumed to follow exponential distribution with parameter demand rate. This is as in Bukowski [10]. (x) Time to restart the EUC following safety action by safety system on demand is assumed to be negligible. (xi) Following safe failure of a safety system, EUC can be started as soon as a sufficient number of channels of safety system are operational. (xii) The fraction of failures that have a common cause are assumed to be equal for both safe and dangerous undetected failures. A state-transition diagram of a generic system is given in Fig. 1. System state OK represents a healthy state of all its channels. When the system has some channels either in SF or in DU state but a sufficient number of channels are in healthy state to take safety action on demand is denoted by Dr. When more than tolerable numbers of channels are either in SF state or in DU state, it leads the system to go to FS or FDU, respectively. Demand for safety action when the system is in FDU leads to damage to EUC (DEUC) or accident condition. Transitions from OK state to Dr state are represented by l1,2. This includes CCF of more than one channel, considering MBD [4]. Failure of all channels due to CCF to safe or dangerous undetected state is represented by l1,3 and l1,4, respectively. Further safe or DU faults of healthy channels from system state Dr lead the system to a safe sate (FS) or dangerous undetected state (FDU), respectively. These transition rates are denoted by l2,3 and l2,4 including CCF of more than one healthy channel. By means of repair, channels with SF are restored to healthy states; this is denoted by im. Here i denotes either the number of known failures or identical repair facilities, whichever is minimum. From FDU the system goes to the DEUC state on demand arrival as represented by larr.
Fig. 1. Generic Markov model for a safety system.
ARTICLE IN PRESS M. Kumar et al. / Reliability Engineering and System Safety 93 (2008) 1720–1729
Proof-test is modeled by mP(t). Proof-test converts the channels with DU failure to go to SF. The state-transition diagram of Fig. 1 explicitly considers safe failures. Refs. [3,7,15] also model the safe failure state for different purposes. Here the intention is to be able to estimate the probability of being in all possible states up to a specified time. In addition to IEC [1,2,11], periodic inspections (prooftests) are considered in [3]. Bukowski [3] defined MTTFD and MTTFS. While estimating MTTFS (or MTTFD) repair (or restoration) from MTTFD (or MTTFS) is assumed. This means it gives mean time to reach the safe (or dangerous) state irrespective of the number of visits to the dangerous failure (or safe) state. Similarly, it defines availability as ‘‘probability that a system is successfully operating at time t without regard to previous failures or repairs’’. Demand rate is incorporated in [10] as in our model. If the system is in a dangerous failure state on demand arrival, then the system goes to a state similar to (FEUC) of our model. Except modeling of demand, model of Ref. [10] is totally different from the one considered here. The key differences are as follows: (i) change of safety system state on demand arrival in healthy states, (ii) considers online repair, (iii) does not incorporate safe failures, (iv) does not incorporate CCF, (v) does not incorporate periodic proof-tests (or inspection or functional test). Our model can be considered combining, in a conceptual sense, the ideas of periodic proof-test [3] and process demand rate [10]. 3. Performance-based safety and availability indices All transitions of Fig. 1, except mP(t), are constant and independent of time. Exclusion of this transition (i.e. mP(t) ¼ 0 for all t) from state-transition diagrams transforms the state-transition diagram into a continuous-time Markov chain (CTMC) [19,20]. The infinitesimal generator matrix Q of the CTMC of Fig. 1 is given by " # LTT 0 Q¼ . (1) LTA 0 8 1 > > > > > > < ð1 ZÞ dij ¼ Z > > > > > > : 0
To analyze such a system analytically, we use the Darroch and Seneta [18] technique of only considering the transient state. LTT is the infinitesimal generator matrix of CTMC considering transient states only. All transient states of the CTMC communicate to the absorbing state, ensuring that LTT is regular (i.e. LTT6¼0) [18]. Time-dependent transient state probabilities are given by solving the Chapman–Kolmogrov equation [19,20]: _ ¼ LTT pðtÞ, pðtÞ pðtÞ ¼ ½P1 ðtÞ P2 ðtÞ P3 ðtÞ
P4 ðtÞ.
(2)
Solution of (2) gives time-varying transient state probabilities without periodic proof-tests. Incorporation of periodic proof-tests in this model makes it a nonMarkovian model. Marsan [16] proposed a method to analyze a non-Markovian system for steady states, which satisfies two conditions: (i) non-Markovian event is deterministic and (ii) only one deterministic event is enabled at any instant. Mainkar [17] uses a method based on the Markov regenerative process [19] to solve availability problems with periodic load pattern. As per Markov regenerative process time instances, in the present context, Tproof, 2Tproof, 3Tproof, etc. are called Markov regeneration epochs. State probabilities of the model can be obtained by sequentially solving the Markov chain between Markov regenerative epochs and redistributing the state probabilities at regeneration epochs. Bukowski [3] also uses a similar method to determine MTTF under various conditions, and calls it the piece-wise CTMC method. Here we have used Markov regenerative process-based analysis to determine state probabilities with periodic proof-tests. State probabilities for time up to the first regenerative epoch can be obtained from (2): pðtÞ ¼ eLTT t pð0Þ;
0ptoT proof .
(3)
State probabilities just before the first proof-test are given by: pðt Þ ¼ eLTT t pð0Þ;
t ¼ T proof .
Let state probability redistribution matrix be given by D: D ¼ ½dij . Redistribution matrix (D) is a square matrix. The rows of this matrix correspond to each state of the Markov model. In any state, channels of the system can have OK, SF or DU states. Values of elements of redistribution matrix (dij) are determined as follows:
8i ¼ j and all channels are either OK or SF in system state corresponding to i; 8i ¼ j and any channel state is DU in system state corresponding to i; 8 state i is having channels with SF; state j is having channels with DU; and conversion of all channels with DU of state j to SF leads to state i; otherwise;
The CTMC of Fig. 1 is absorbing, so its infinitesimal matrix is a singular. This is also obvious from (1).
1723
where Z is the degree of perfection of proof-tests.
ARTICLE IN PRESS M. Kumar et al. / Reliability Engineering and System Safety 93 (2008) 1720–1729
1724
State probabilities just following the first proof test is given by pðtþ Þ ¼ DeLTT t pð0Þ;
t ¼ T proof
(i.e. complement of reliability) [20] is a non-increasing function of time. 3.2. Comparison with PFDPRS [10]
and pðtÞ ¼ pðt þ SÞ ¼ eLTT S DeLTT t pð0Þ;
toto2t; t ¼ T proof .
Generalization of the above equation gives pðtÞ ¼ pðnt þ SÞ ¼ eLTT S an pð0Þ;
ntotoðn þ 1Þt,
(4)
where a ¼ DeLTT t . Let the system run continuously for time duration T; then the mean state probabilities for this duration can be computed as RT pðtÞ dt E½pðtÞ ¼ p¯ ¼ 0R T 0 dt ! Z Z ntþS n 1 X jt p¯ ¼ pðtÞ dt þ pðtÞ dt T j¼1 ðj1Þt nt p¯ ¼
1 1 LTT t ½L ð½e I½I a1 ½I an T TT þ ½eLTT s Ian Þpð0Þ,
ð5Þ
where T ¼ nt þ S. Eq. (5) gives the closed-form solution for state probabilities considering demand rate and periodic proof-test (perfect as well as imperfect). 3.1. Safety index: probability of failure on actual demand (PFaD) The safety index PFaD is intended to measure the probability of reaching the DEUC state. p(t) of the previous section gives the probability of all the states except DEUC. From the state-transition diagram of Fig. 1, it is clear that the system is conservative. The sum of all state probabilities at any instant is 1, i.e. the system is in one of five states. The probability of not being in p(t) states is DEUC state probability, i.e. PFaD. Using (4) the PFaD(t) is given by PFaDðtÞ ¼ 1 ½1x pðtÞ,
(6)
As discussed earlier, formulation of PFaD can be considered as a combination of research work of [3,10]. Ref. [3] uses a different index (MTTF) while [10] uses a similar performance-based safety index, PFDPRS(t). PFDPRS(t) does not include periodic proof-test. When proof-tests are not considered in safety index PFaD(t) then PFaD(t) and PFDPRS(t), by definition, are the same except differences in respective models. To compare PFaD(t) values with that of PFDPRS(t) [10], we take the parameter values of [11]. Ref. [11] does not consider safe failure hazard rate and periodic proof-tests; we assume safe failure hazard rate equal to zero and keep the proof-test interval more than run time for our model. For convenience, all parameter values are given in Table 1. Result of the two models for different demand rates are given in Table 2. At higher demand rate, PFaD(t) values are less than that of corresponding PFDPRS(t) values. The main reason for this is difference is that Bukowski’s [10] model considers online repair and unsafe failure from a dangerous detected state. At low demand rate, both models are in good agreement. Mean PFaD values for the specified time duration are also given in Table 2. For the same parameter values (i.e. Table 1), average PFD and PFH (probability of failure per hour) [1] are 5.25 103 and 2 107, respectively. These two values are the probability of being in a dangerous undetected state in low and high demand modes, while the safety index average PFaD for the specified run time gives the actual probability of failing on demand.
Table 1 Parameter values used for comparison Parameter
Value
lS (h1) lD (h1) DC Z m (h1) T (h) TProof (h)
0 2 106 0.9 1 1/8 5000 6000
where 1x is a vector of 1’s; equal to a size of pðtÞT .
Table 2 Comparison of result
PFaD can be evaluated using (5) and (6): mean PFaD ¼ 1 ½1x p, ¯
(7)
where T
1x is a vector of 1’s; equal to a size of p¯ . The Markov model being absorbing p(t) will decrease to 0 with increase in time. So, PFaD(t) like failure distribution
MTBD
PFDPRS(t) [10]
PFaD(t)
PFaD
% Difference
1/day 1/week 1/month 1/year 1/10 years
0.0015 0.0014 0.00085 0.00022 0.00002
0.000995 0.000966 0.000856 0.000238 0.000028
0.000495 0.000467 0.000377 0.000083 0.000009
33.6886667 31.0071429 0.6752941 0.3636364 0.3996500
ARTICLE IN PRESS M. Kumar et al. / Reliability Engineering and System Safety 93 (2008) 1720–1729
3.3. Availability index: manifested availability
into account both types of failures affecting the availability of EUC.
PFaD(t) denotes the fraction of a number of systems failed on demand by time t. Failing on demand often leads to endangering plants or EUC. In this model we have assumed that the safety systems that failed on demand cannot be brought back to operation. Let a safety system survive from unsafe failures (i.e. DFUC) up to time t, then conditional state probability is given as pi ðtÞ pi ðtÞ p^ i ðtÞ ¼ P . ¼ 1 PFaDðtÞ p ðtÞ i i
1725
(8)
With this condition, probability of not being in FS gives the availability of the safety system. This availability is termed as ‘‘manifested availability’’ (mAv). Average mAv value up to time ‘‘t’’ is given as X avg: mAvðtÞ ¼ 1 (9) p^ i ðtÞ; i
where the system state corresponding to iAFS. This definition of availability is different from Bukowski [3], which considers the probability of being in state OK (with reference to the Markov model of Fig. 1). mAv takes
4. Example of PFaD and manifested availability computation To illustrate evaluation of PFaD and manifested availability, two commonly used hardware architectures; 1oo2 and 2oo3 of a safety system, are chosen. Schematic of these architectures are shown in Fig. 2. 1oo2 architecture consists of two s-identical channels with output switches wired in series, also called output ORing. If either channel undergoes SF, then its control switch opens and EUC is shut down. Therefore, 1oo2 architecture is sensitive to safe failures of its channels: even a single channel’s safe failure causes the EUC to shut down. On the other hand, if one channel is in FDU, then its switch is unable to open at demand, but the other channel’s switch opens and shutting down of EUC is ensured. So, 1oo2 system can endanger EUC, only if both channels are in DU at demand. So, 1oo2 with output ORing can tolerate one channel’s DU failures only. Both channels can fail to SF or DU due to CCF. We use the beta-factor model [1,4] for CCF. The Markov model of the system with one repair station is shown in Fig. 3.
Fig. 2. Schematic of (a) 1oo2 and (b) 2oo3 architectures.
ARTICLE IN PRESS 1726
M. Kumar et al. / Reliability Engineering and System Safety 93 (2008) 1720–1729
The states of the Markov chain are denoted by a threetuple, (i, j, k):
Fig. 3. Markov model of 1oo2 system.
i denotes the number of channels in OK state, j denotes the number of channels in SF state, k denote the number of channels in DU state.
2oo3 consists of three s-identical channels with a pair of output control switches from each channel. These control switches are used in implementing majority voting logic. 2oo3 architecture shuts down the EUC when two channels go to SF. It endangers the EUC when two channels are in dangerous failure states (DU) at demand. This means that 2oo3 can tolerate one channel’s safe or dangerous failure. To incorporate CCF we have used the MBF of Hokstad [4]. The advantage of MBF is that it can model a variety of cases, beta-factor, gamma-factor and base case. Beta-factor allows only simultaneous failure of three channels, gammafactor allows simultaneous failure of two channels only, while base case allows a combination of simultaneous failure of two and three channels. The Markov model of the system with one repair station is shown in Fig. 4.
Fig. 4. Markov model of 2oo3 system.
ARTICLE IN PRESS M. Kumar et al. / Reliability Engineering and System Safety 93 (2008) 1720–1729
1727
Table 3 Parameter values Parameter
Value
lS (h1) lD (h1) DC Z m (h1) b b2 T (h) MTBD (h)
5 106 5 106 0.9 0.9 1/8 0.1 0.3 87,600 43,800
4.1. Parameter values Table 3 gives system parameter values such as the channel hazard rate, repair along with proof-test interval or mission time used for the example architectures. Probability redistribution matrix, D, for two architectures is derived from the discussion of Section 2. Theses are given as below: For 1oo2 2
3
1
0
0
0
0
0
60 6 6 60 D¼6 60 6 6 40 0
1 0
0 1
0:9 0
0
0
0:1
0 0
0 0
0 0
0 0 7 7 7 0:9 0:9 7 7. 0 0 7 7 7 0:1 0 5 0
Fig. 5. Variation of avg. PFaD w.r.t. Tproof.
0:1
For 2oo3 2
0
3
1
0
0
0 0
0
0
0
0
60 6 6 60 6 60 6 6 60 D¼6 60 6 6 60 6 60 6 6 40
1
0
0 0
0:9
0
0
0 0
1 0
0 0 1 1
0 0
0:9 0
0 0:9
0 0
0 0
0 0 0 0
0:1 0
0 0:1
0 0
0
0
0 0
0
0
0
0 0
0 0
0 0 0 0
0 0
0 0
0 0:1
0 7 7 7 0:9 0 7 7 0 0:9 7 7 7 0 0 7 7. 0 0 7 7 7 0 0 7 7 0:1 0 7 7 7 0 0 5
0
0
0
0 0
0
0
0
0:1 0
0
4.2. Calculation and results Average PFaD and availability values for the specified period are evaluated using (7) and (9). These values are evaluated with parameter values given in Table 3 for prooftest interval [100 h, 12,000 h] at increments of 100 h. The plots of average PFaD and mAv for 1oo2 and 2oo3 architectures are given in Figs. 5 and 6, respectively.
Fig. 6. Variation of avg. mAv w.r.t. Tproof. Table 4 Parameter values for different designs Parameter
Case I
Case II
Case III
lS (h1) lD (h1) C a b b2 m (h1) T (h) MTBD (h) TProof (h)
2 106 2 106 0.9 1 0 0 1/8
4.45 105 4.10 105 0.99 1 0 0 1 10,000 [5000, 7500] 2000
4.45 106 4.10 106 0.99 1 0 0 1
5. Discussion For the same settings, PFaD values of 1oo2 architecture are lower than 2oo3 architecture for all proof-test intervals. PFaD values for both the architectures increase with
ARTICLE IN PRESS M. Kumar et al. / Reliability Engineering and System Safety 93 (2008) 1720–1729
1728 Table 5 MTBD
(a) PFaD values 1oo2 2oo3
Case I
Case II
Case III
5000 h
7500 h
5000 h
7500 h
5000 h
7500 h
4.3565E08 1.3091E07
3.1114E08 8.9673E08
1.8186E07 5.5001E07
1.2512E07 3.7824E07
3.1914E08 3.3288E08
1.0918E07 5.2946E08
0.99993665 0.99999999
0.99982906 0.99999996
0.99982906 0.99999996
0.99998292 1.00000000
0.99998292 1.00000000
(b) Availability values 1oo2 0.99993666 2oo3 0.99999999
increase in proof-test interval. Manifested availability values of 2oo3 architecture are high compared with 1oo2. From Fig. 5 it can be observed that there is no appreciable decrease in availability of 2oo3 architecture with proof-test interval, while 1oo2 architecture shows a decrease in availability with increasing proof-test interval. One factor for the higher PFaD value of 1oo2 architecture is that it spends more time (compared with 2oo3) in FS (i.e. safe shut down). The lower value of availability for 1oo2 proves this fact. 5.1. Advantage of modeling safe failures Safety index PFaD(t) along with availability index mAv(t) can be very useful at safety system design as well as at operation phase. During design phase, these two indices can be evaluated for different design alternatives (architecture, hazard rates, DC, CCF) with specified external factors (proof-test interval, MTBD) for a specified time. The design alternative that gives the lowest PFaD(t) (lower than required) and maximum mAv(t) is the best design option. We take an example of three cases whose system parameter values along with common environment parameter are given in Table 4. Both PFaD and availability are evaluated for all the three cases for 1oo2 and 2oo3 architectures at two values of MTBD. The results are shown in Table 5a and b. Case I with 1oo2 architecture and case III with 2oo3 architecture give the lowest value of PFaD for both MTBD values. Now looking at availability values suggests that availability is highest. So, case III with 2oo3 architecture seems to be preferable. During the operational phase of the system mainly proof-test interval is tuned to achieve target safety. Frequent proof-tests increase safety but decrease the availability of safety system as well as of EUC. These two indices are helpful in deciding the maximum proof-test interval that will meet the given PFaD(t) value that maximizes mAv(t). 6. Conclusions Quantitative safety index PFD is published in safety standard IEC 61508. Various researchers have contributed
to make the method more clear, usable and relevant. Contributing in the same direction, we have derived a Markov model for the system considering safe failures, periodic proof-tests (perfect as well as imperfect) and demand rate. The analysis has been done to derive a closedform solution for performance-based safety index PFaD and availability. We have shown the advantages of modeling safe failures with the help of an example. Reliable data on process demands are needed to correctly estimate demand rate and its distribution. Future directions for further research include (i) a systematic study to estimate value of b2 (MBF), (ii) incorporation of EUC state in the model and (iii) online proof-tests are conducted when none of the channels are in SF, and incorporation of this phenomenon in the model. Acknowledgments We thank Shri U. Mahapatra (BARC) for fruitful discussions on this work. A special thanks to Prof. Varsha Apte (IIT Bombay) for providing useful insight and guidance. References [1] IEC 61508. Functional safety of electric/electronic/programmable electronic safety-related systems, Parts 0–7, October 1998–May 2000. [2] Zhang T, Long W, Sato Y. Availability of systems with selfdiagnostic components—applying Markov model to IEC 61508-6. Reliab Eng Syst Saf 2003;80:133–41. [3] Bukowski JV. Modeling and analyzing the effects of periodic inspection on the performance of safety-critical systems. IEEE Trans Reliab 2001;50(3):321–9. [4] Hokstad P, Corneliussen K. Loss of safety assessment and the IEC 61508 standard. Reliab Eng Syst Saf 2004;83:111–20. [5] Scherrer C, Steininger A. Dealing with dormant faults in an embedded fault-tolerant computer system. IEEE Trans Reliab 2003; 52(4):512–22. [6] DeLong TA, Smith T, Johnson BW. Dependability metrics to assess safety-critical systems. IEEE Trans Reliab 2005;54(3):498–505. [7] Bukowski JV, Goble WM. Defining mean time-to-failure in a particular failure-state for multi-failure-state systems. IEEE Trans Reliab 2001;50(2):221–8. [8] Rouvroye JL, Brombacher AC. New quantitative safety standards: different techniques, different results? Reliab Eng Syst Saf 1999;66: 121–5. [9] Choi CY, Johnson BW, Profeta III JA. Safety issues in the comparative analysis of dependable architectures. IEEE Trans Reliab 1997;46(3):316–22.
ARTICLE IN PRESS M. Kumar et al. / Reliability Engineering and System Safety 93 (2008) 1720–1729 [10] Bukowski JV. Incorporating process demand into models for assessment of safety system performance. In: RAMS 2006. p. 577–81. [11] Guo H, Yang X. A simple reliability block diagram method for safety integrity verification. Reliab Eng Syst Saf 2007;92:1267–73. [12] Summers A. Viewpoint on ISA TR84.0.02—simplified methods and fault tree analysis. ISA Trans 2000;39(2):125–31. [13] Brown S. Overview of IEC 61508: functional safety of electrical/ electronic/programmable electronic safety-related systems. Comput Control Eng J 2000;11(1):6–12. [14] Bukowski JV. A comparison of techniques for computing PFD average. In: RAMS 2005. p. 590–5. [15] Goble WM, Bukowski JV. Extending IEC61508 reliability evaluation techniques to include common circuit designs used in industrial safety systems. In: Annual reliability maintenance symposium, 2001. p. 339–43.
1729
[16] Marsan MA, Chiola G. On Petri nets with deterministic and exponentially distributed firing times. Advances in Petri Nets 1986, Lecture Notes in Computer Science, vol. 266. Berlin: Springer; 1987. p. 132–45. [17] Mainkar V, Trivedi KS. Transient analysis of real-time systems using deterministic and stochastic Petri nets. In: International workshop on quality of communication-based systems, 1994. [18] Darroch JN, Seneta E. On quasi-stationary distributions in absorbing continuous-time finite Markov chains. J Appl Prob 1967;4:192–6. [19] Cox DR, Miller HD. The theory of stochastic processes. Methuen & Co., 1970. [20] Trivedi KS. Probability & statistics with reliability, queueing, and computer science applications. Englewood Cliffs, NJ: Prentice-Hall; 1982.