Reliability l£ngmeering 2 (1981) 135 145
PITFALLS IN RISK C A L C U L A T I O N S
G. APOSTOLAKISt& S. KApLAN Pic'kard, Lowe and Garrick Inv., 17840Skypark Boulevard, lrvine, California 92714, USA
(Received: 17 December, 1980)
ABSTRACT Two pitJctlls in risk and reliability calculations are Mentified. The first is the treatment qf the failure rates o f nominally identical items as independent variables, when the states o f knowledge that determine their distributions are identical. Such analyses result in underestimation o f the uncertainties. The second is due to the use o f the linear approximation to the exponential distribution or the use o f lognormal distributions JbrJ)'equeneies offailure per demand, which results in erroneous means and varianees.
INTRODUCTION
In the computation of risk and reliability of various technological systems, certain convenient mathematical approximations are customarily used, a m o n g them the use of lognormal shapes to represent probability or frequency distributions of elemental events. These elemental distributions must then be combined in various ways to produce distributions describing the probability of frequency of compound events. Although these customary approximations usually give reasonable results when passed through the combining operations, there are subtleties, limitations and pitfalls that must be fully recognized by the analysts, so that the results will remain meaningful. In this paper, we identify and discuss two such pitfalls that can cause severe error in both the mean values and the shapes of the resulting combined distributions. ~"Permanent address: Universityof California School of Engineering and Applied Science, Los Angeles, California 90024, USA. 135 Reliability Engineering0143-8174/81/0002-0135/$02.50 © Applied Science Publishers Ltd, England, 1981 Printed in Great Britain
136
(;. AI'OSTOLAKIS. S. K A P I A N PITFALl. 1 : I)EPENDEN('Y OF STALE OF K N O W L E I ) G F I)ISTRIBI.!II()NS
In risk assessments one often has to deal with functions of the form
Q =g(ql,q2...qx)
{1)
For example, Q may be the unavailability of a system, in which case qi may be a component failure (repair) rate, a human error rate, etc. Or, Q may be the frequency of a particular accident sequence, in which case the qi are frequencies of the events comprising the sequence, e.g. system unavailabilities. The function g is known from the structure of the system or sequence, e.g. it may have been determined by a fault tree. However, the numerical values ofq~ are usually not known. In this case, the analyst's state of knowledge concerning the possible values of q~ is then expressed in terms of probability density functions n~(q~). An important part of risk assessment is the determination of the state of knowledge of Q implied by the state of knowledge of the ql. For this purpose, we must calculate the distribution of Q by combining the distributions of the qi according to eqn. (1). While a great deal of effort has been expended on the development of consistent and efficient methods for the performance of this calculation, the fact that some of the 7z~(q~) may be expressing the same state of knowledge has often been overlooked. Understanding the significance of this omission requires an explanation of the meaning of the distributions 7t~(qi) and of the modelling process. The distribution 7t~(qi) represents the frequency of a certain failure mode of a class of nominally identical components, e.g. failure of motor-operated valves (M OV) to open. Generic risk studies, like the Reactor Sql~'ty Study, 1 interpret ~E(q~) as expressing mainly the plant-to-plant variability of q~. In other words, at a specific plant, qi has a specific numerical value, which will be revealed after sufficient statistical evidence has been collected. This numerical value is, in general, different from one plant to another and ~z~(q~) quantifies the analyst's state of knowledge about this variability. Consistent with this interpretation of the generic distributions it is natural and convenient to adopt a model in which all components of the same type, at the same plant, have the same failure rate. In assessing the risk at a specific plant then, it is appropriate to group together the experience of all the nominally identical components in that plant. Thus, all the failures of MOVs to open in that plant arc aggregated into a single number. Likewise, the total number of demands in that plant is aggregate. This evidence can then be used in Bayes' theorem to specialize the generic distributions to that particular plant. 2 Now suppose we are applying eqn. ( 1) to a specific plant, and suppose that ql and q2 represent the failure rates of two physically distinct but nominally identical MOVs. Since our model assumes that all such MOVs have the same failure rate, we must set ql = q2. Moreover, since our state of knowledge is the same for the two valves we have ~l(ql)
=~1(q2)
P I T F A L L S IN RISK C A L C U L A T I O N S
137
Thus, when we come to numerically evaluate eqn. (1) for g(Q) we must regard g i and 7z2 as being not only equal distributions but beyond that we must treat them as completely dependent distributions. It should be pointed out that this kind of dependence is different from the dependence between events (common cause failures). The latter are contained in the function g itself and they are unrelated to the dependencies that stem from the analyst's state of knowledge concerning failure rates. The following examples illustrate the dependencies that we are discussing.
Example 1. Series configuration Consider the simple case where a system fails when one of two motor-operated valves fails to open on demand. The system unavailability Q is given in terms of the M O V failure frequencies (per demand) by Q = 2q
(2)
Equation (2) expresses the fact that the failure rates of the two MOVs are identical (more precisely, that the distributions of the failure rates express the same state of knowledge). If the system unavailability were calculated by the equation Q = ql + q2
(3)
where ql and q2 are independent variables having the same distribution, the result would be incorrect. To demonstrate the consequences of using eqn. (3), rather than the appropriate eqn. (2), we calculate the mean, a o, and variance, rid, of Q. Using eqn. (2) we get
:~Q = 2%
(4)
fl~= 2 2flq 2 = 4fl2
(5)
and On the other hand, eqn. (3) gives ~e = c%, + aq2 = 2%
(6)
fl~ =/3~1 + fl~2 = 2flq2
(7)
and
As eqns. (4) and (6) show, the mean value of Q is unaffected. However, eqns. (5) and (7) show that use of eqn. (3) results in a smaller variance and hence an understatement of the uncertainty in the system unavailability.
Example 2." Parallel configuration If the two MOVs are in parallel and system failure occurs when both fail to open, the appropriate equation is Q = q2
(8)
138
G. A P O S T O L A K I S , S. K A P L A N
Again the equation that is sometimes thoughtlessly used in lieu of eqn. (8) is
Q =qlq2
(9)
The mean value o f Q, if eqn. (8) is used, is
= q +/q
i1 o t
and, if eqn. (9) is used, (l 1)
~(Q = (~ql,~(q2 = ~(q
C o m p a r i n g eqns. (10) and (11) we see that for the parallel configuration even the mean value o f system unavailability is underestimated when the state-of-knowledge dependence is ignored. The effect of eqn. (9) versus eqn. (8) on the "spread" or variance of (2 can bc illustrated for the important case of lognormally distributed q as follows. The most convenient way to look at a lognormal distribution is to write: q=me":
(12)
If z in eqn. (12) is a standard normal variate then q is lognormally distributed. ]'he median value of q is m and a is called the lognormal standard deviation, or the "multiplicative' standard deviation. The reason for the latter term is seen, in eqn. (12), from the fact that if we increase z by the additive a m o u n t 1.0 then q increases by the multiplicative factor e ". Thus, the multiplier e" plays the same role in a lognormal curve as the additive quantity a plays in a normal curve. That is, when q changes by the factor e ". the cumulative probability changes by one standard deviation's worth. Now, with this understanding of lognormal curves we note that if q is lognormal. then from eqns. (12) and (8) we get
Q=q2 = m2 e2.=
13~
q2 is also lognormal with median
m 2 and Iognormal standard deviation 2a. Also, if ql and q2 are two lognormal variates:
ql = m t e ....
q2=m2 e'~2::
14l
then qlq2
= 1111n12 e"qzt ~ ~2=2
Since the sum of two normal curves is a normal curve we have f f l Z i + ff2Z2 = G3Z 3
where
(;3 - - \ ' ((71)2 -({'72
)2
[51
PITFALLS
139
IN RISK CALCULATIONS
Thus, from eqn. (9) we see that Q = qlq2 = mlm2 ea3Z3
i.e. that Q is a lognormal variate with median mlm2 and lognormal standard deviation as eqn. (15). Now if ql and q2 have the same distribution m I =m2=m o 1 zo" 2 ~0
then Q = m 2 e "'~-"
(16)
Comparing eqn. (16) with eqn. (13) we see that using eqn. (9) rather than eqn. (8) leads to a tighter distribution for Q. Thus, neglect of the dependence of the probability distributions for q~ and q2 leads to a severe understatement of the uncertainty in Q. The median of Q is correct, but the distribution about that median is too narrow leading, as we saw in eqns. (10) and (11), also to an erroneous and unconservative calculation of the mean value of Q. TABLE 1 EXAMPLES SHOWING THE EFFECT OF STATE-OF-KNOWLEDGE DEPENDENCE
Variable
Mean
5th Pereentih"
q
1"5 × 10 - 3
l"9 × 1 0 - 4
qlq2
2'3×
9"4x
q2
6"0 × I0
qlqzq3
3"5 × 10 - 9 6"OxlO 8 5 " 4 x 10 - 1 2 1 ' 6 x 10 9
q3
qlqeq3q4 q4
10 6 6
Median
95th Percentih,
9"5 × 10 4
4 ' 7 × 10 3
10
8
9 " 0 × 10 - 7
8 " 7 × 10
6
3"7 × l0
8
9"0 × 1 0 - 7
2"2 × 10
5
5"4 X 10 ' 1 1 7'1 x l O 12
8"6 x lO lo 8.6X10-10
1'4 X 10 - 8 1'1 x l O 7
3"3x 1'4x
8 " 2 x 10 8 " 2 x 10
2 " O x 10 -11 4 " 9 x 10 Io
10 - 1 4 10 x5
13 13
Table 1 contains numerical examples that further demonstrate the importance of handling state-of-knowledge dependencies correctly. All the variates are lognormally distributed. As expected, the mean values and the uncertainties are underestimated. The worst case is the fourth order redundant system, where the mean is underestimated by a factor of 300 and the 95th percentile by a factor of 25.
PITFALL
2~ E R R O R S I N M E A N S A N D
VARIANCES
RESULTING
FROM
USE O F L O G N O R M A L
DISTRIBUTIONS
While the state of knowledge about an uncertain quantity is properly expressed in terms of a distribution, e.g. the ni(qi) of the preceding section, very often it is
140
G. APOSTOLAKIS, S. KAPLAN
convenient to summarize the distribution by citing several characteristic values, e.g. various percentiles, the mean value, etc. The mean and variance also possess useful mathematical properties, e.g. eqns. (10) and (11), which can be used in propagating uncertainties. Furthermore, the mean values are often used to rank the contributors to risk. It is customary in risk and reliability work to use lognormal distributions to represent our state-of-knowledge distributions of failure rates, etc. The lognormal is very convenient mathematically, but has the defect that the mathematical curve has a tail extending to infinity whereas the true state-of-knowledge distribution truncates at finite values. (For example, a true failure rate on d e m a n d cannot be greater than 1.0.) The presence of this tail can be regarded as a mathematical artefact which in most cases does not distort the results of the analysis. In other cases it does. The purpose of this section, in fact, is to point out that this artefact can result in severe error in calculated means and variances even while the usual percentiles are unaffected. To demonstrate the sensitivity of the mean value to the tails o f distributions we can use the following simple example. Let the variable X take on two values as follows: Pr(X=10 2)=0.9999=1_10 4 (17) Pr(X=2)=10
4
(181
The mean value of X is, then, Z(x=10 2 x O . 9 9 9 9 + 2 x l
0 4=1.020x
10 2
(19)
The major part of this value comes from the point 10 2 which has the highest probability. Suppose, now, that we are interested in the cube of X. Then P r ( X 3 = 10 ~) = 0.9999 Pr(X 3=8)=10
20)
4
21t
The mean value is now ~x~=10
~x0.9999+8×
10 ~ = 8 . 0 0 9 ×
10 4
22)
a result that is completely dominated by the value X = 2. If this value happens to bc unrealistic, but tolerated due to some modelling approximation that assigns to it a very small probability (eqn. (18)), it will distort any calculations that include X -~ Situations like this do arise in risk analysis in the following situations.
The exponential approximation It is c o m m o n practice to treat the time-to-failure of items as an exponentially distributed r a n d o m variable, i.e. the failure distribution is
F(t)=l-c
"
(23)
141
P I T F A L L S IN RISK C A L C U L A T I O N S
where 2 is the failure rate. Due to the smallness of 2, F(t) is almost always approximated by
F(t),,~2t
for 2t < 0.10
(24)
The requirement that 2t be less than 0.10 is often forgotten, since the most likely values of 2 are indeed small. However, the distribution of 2 is usually the lognormal, which is unbounded on the right. The condition of eqn. (24), then, is violated by the high percentiles resulting in an overestimate of the mean and variance. The following example demonstrates this point.
Example 3 Consider a component that has a failure rate 2 which is lognormally distributed w i t h / ~ = - 7 - 9 8 and tra= 1,26. We are interested in the unreliability of the component during a 24-h period. The variable 242 is also lognormally distributed with parameters #24~.= --4"80 and tr = 1'26. Table 2 shows the results of several calculations of interest. We observe the following: 1.
2.
The mean and the variance are overestimated when the linear approximation is used. The accuracy is worse for the variance and it worsens as the degree of redundancy increases. The commonly used percentiles are fairly insensitive to the approximation. TABLE 2 THE ACCURACY OF THE EXPONENTIAL APPROXIMATION
Function 242 1 - e -24~ (242) z (1 - e - 2 4 a ) 2 (242) 3 (l--e-24't) 3
Mean
Variance
5th Percentile
Median
95th Percentile
1'82 × 10 - 2 1'82 x 10 - 2 1"62 × 10 - 3 1'08 x 10 - 3 7 ' 0 6 x 10 - 4 1"52x10 4
1'29 x 10 - 3 7-84 x 10 4. 1.50 × 10 - 3 3"21 x 10 - 5 0.80 3.21x10-5
1-04 x 10 3 1.05 x 1 0 - 3 1-07 x 10 6 1'10 x 10 - 6 1-11 × 1 0 - 9 1.15x10 9
8 ' 2 3 × 10 3 8.18 x 1 0 - 3 6.77 x 10 - 5 6-68 x 10 - 5 5"57 × l 0 - 7 5.46x10 7
6 ' 5 4 × 10 - z 6.33 x 1 0 - 2 4.28 × 10 - 3 4-01 x 10 3 2 ' 8 0 X 10 - 4 2.54×10-4
The lognormal distribution The frequency of failure on demand (unavailability), q, can take on values in the interval (0, 1). Very often our state of knowledge concerning the possible values of q is expressed in terms of a lognormal distribution, which, of course, covers the interval (0, ~). Even though the probability of the (nonphysical) values (1, ~ ) is usually negligibly small, this tail may distort the mean and the variance just as in the preceding case. To prevent this distortion, we may desire to use a distribution whose range is the interval (0, 1), e.g. the beta distribution, or to retain the lognormal distribution but to truncate it at q = 1, i.e. to limit q to values smaller than unity. If N(z) is the
142
G. APOSTOLAK1S, S. KAPLAN
cumulative standard normal distribution, then the truncated cumulative lognormal distribution will be
('n7 @(q) =
0_
=1
(25)
l_
Use of this distribution will eliminate the distorting and unrealistic values of q greater than unity.
SOME CONSEQUENCES
The fallacy of point estimates A point-estimate calculation usually means that some representative point values are used as inputs to a risk model and the result of the model is also a point estimate. For example, one may use point values for the inputs, qi, ofeqn. (1), and thus obtain a point value for Q. The Reactor SaJet), Study 1 carried out such calculations using the medians of the distributions as point estimates. The Reactor Safi, ty Study was later criticized for these calculations, even though a separate uncertainty analysis was done, and the point estimates played a minimal role in the overall risk assessment. It was widely believed that a correct point estimate should be based on mean values rather than medians (or any other percentiles). The results of this paper show that even a point estimate based on mean values may be seriously misleading when the uncertainties are large and highly redundant systems are analysed (see Table 1). A rigorous risk analysis requires a rigorous treatment of the uncertainties. Our results further show that the mean value is very sensitive to the tails of the distributions, while the percentiles are not. This insensitivity makes the use of percentiles preferable when a few representative values are desirable, e.g. in setting up safety goals.
Fault tree analysis Consider the simple fault tree of Fig. 1. The Y events are similar, e.g. human errors, as well as the Z events. There are nine minimal cut sets of order two, namely yly2, yIZ2, y1Z4, Z1Y2, Z1Zz, Z1Z4, Z3Y2, Z3Z2, and Z3Z 4. If Y and Z also denote the corresponding failure rates, a conventional calculation of the system unavailability would be based on the expression
Q = (Y1 + z l + z3)(Y2 + z z + z4)
(26)
P I T F A L L S IN R I S K C A L C U L A T I O N S
TOP Q
143
i
Fig. 1. Simplefault tree.
If Y a n d Z are l o g n o r m a l variables with parameters, ILr = - 6-96, a r = 0.698,/~z = - 9 . 2 6 , a z = 0.698, we will get for Q 5th percentile = 4-2 × 10-7 m e d i a n = 1.4 x 10 - 6 m e a n = 2 . 1 × 10 6 95th percentile = 5.8 × 10 6 The correct expression which replaces eqn. (26), is Q = (Y+ 2Z) z
(27)
which gives 5th percentile = 2.4 x 10 v m e d i a n = 1.5 × 10 - 6 m e a n = 3 . 1 x 10-6 95th percentile = 1 0 - s As expected, use of eqn. (26), which ignores the state-of-knowledge correlation, results in an u n d e r e s t i m a t i o n o f the uncertainties. E q u a t i o n (27) suggests that we define the s u p e r c o m p o n e n t s A a n d B (Fig. 1) as A =_ { Y I , Z 1 , Z 3 }
(28)
B = 1II2, Z2, Z,}
(29)
144
(;. APOSTOLAKIS, S. KAPLAN
The two supercomponents are identical, therefore the system unavailability is Q = A2
(30)
Equation (30) can be used as the basis for further calculations. For example, the mean and variance of each supercomponent are % , = ~ y + 2 ~ x = 1 . 4 5 2 x 10 3 j~ = [,¢~+ 414z = 9-5 x 10 ~ therefore, the mean of Q is, according to eqn. (10),
which is the correct value. Appendix IV of the Reactor Safe o, Study contains calculations in which the same numerical value is assigned to nominally identical failure rates. The reason given, however, is related to potential common cause failures and not to the common state of knowledge that we have discussed. This is the wrong reason for doing the right thing. Common cause failures are a different type of dependence and they are part of the function g (eqn. (1)) itself. The dependencies that we are referring to have to do with the fact that our beliefs and judgement, and possibly data sources, about the failure rates of nominally identical items, are the same. Because of this misunderstanding the Reactor Sq[ety Study erroneously applied this coupling to hardware failures only, explicitly excluding human error rates, test and maintenance contributions, etc. Even with this limited coupling the Reactor SaJety Stud)' found that the system unavailabilities had wider uncertainties (Table IV 4-1, ref. 1).
Erent tree analysis State-of-knowledge dependence should also be included in the calculation of the frequencies of accident sequences, as the following simple example demonstrates. Consider an accident sequence in which the initiating event has frequencyJi E and the unavailabilities of two safety systems are Ja and .l,. Then, in a conventional calculation, the frequency f of the sequence would be
.t =
./IE.JA,II~
(31)
The systems, however, may be unavailable due to causes that are nominally identical. For example, system A may be down due to a human error of frequency .IAnE or due to other causes that have frequency fo. Similarly, system B fails due to human error, f~E, and other causes, j,0. Equation (31) becomes 'O /'Hie •1 " -- -- - If'E " JI A/'HE + -fA )(L ~
+fo)
(32)
If the human errors were of the same type, e.g. forgetting to close valves after testing, the appropriate expression, which accounts for the state-of-knowledge dependence, would be ' '.F J ' = JlE(JA
+
.f oA)(JA"HE +
./B' o )
(33)
145
PITFALLS IN RISK CALCULATIONS
As a numerical example we consider the frequencies to be lognormally distributed with the following parameters: fiE: fAHE:
st = --6"9 # = -8"85
Cr= 1'4 cr = 1'6
fo:
St= -10"10
or= 1'4
fBHE:
St = - 8 " 8 5
a = 1"6
fo:
St = - 9 ' 7 0
a = 1"5
The results are shown in Table 3. We observe that when the incorrect eqn. (32) is used the mean is underestimated by about one order of magnitude and the 95th percentile by about a factor of 2. TABLE 3 FREQUENCIESOF AN ACCIDENTSEQUENCE
Expression
Mean
5th Percentile
Median
95th Percentile
eqn. (32) eqn. (33)
1.04 x 10 -9 1-27 × 10 -a
1.96 x 10 x2 1.14 × l 0 - 1 2
7.55 × 10 -11 6'74 x l0 T M
3"46 x 10 -9 7"46 x 10 9
It would appear from this analysis that a rigorous treatment of the uncertainties would require the analysis of a giant event tree that would contain all the initiating events and the fault tree of each system. That would ensure the correct handling of the state-of-knowledge dependence. It would also make the risk calculations extremely complex. Perhaps a preliminary analysis that ignored the dependencies would eliminate the obviously insignificant sequences. Furthermore, each fault tree could be replaced by the dominant causes of system unavailability, before an analysis of the dependencies would be carried out. These ideas are promising and are currently being investigated.
ACKNOWLEDGEMENT
This work was performed for Pickard, Lowe and Garrick Inc., in support of a risk analysis for the Zion and Indian Point Nuclear Generating Stations. The authors thank B. J. Garrick, who provided the environment that made this work possible, and M. Kazarians and D. Stillwell, who helped with the numerical examples.
REFERENCES
1. Reactor Safety Study: WASH-1400, US NRC, NUREG-75/014, October, 1975. 2. APOSTOLArdS, G., KAPLAN, S., GARRICK, B. J. and DUPHILY, R. J. Data specialization for plant specific risk studies, Nucl. Eng. Des., 56 (1980), pp. 321-9.