oLI98-1354/91 s3.00 + 0.00
Computers&em. Engng,Vol. 15, No. 3, pp. 157-169, 1991 Printedin Great Britain.All rightsreserved
DIRECT-EVALUATION
ALGORITHMS PROBABILITIES L.B. PAGE~~
‘Department
Copyright
FOR
8
1991
Pergamon
Ress plc
FAULT-TREE
and J.E. PERRY*
of Mathematics,
Box 8205, North Carolina State University, Raleigh NC 27695-8205, U.S.A. *Department of Computer Science. Box 8206, North Carolina State University, Raleigh, NC 27695-8206, U.S.A. (Received 22 November 1989;final revision received 9 June 1990; received for publication I5 October 1990) Abatraet-Traditional fault-tree methodologies have estimated the probability of system failure based upon a prior determination of minimal cut sets. Recent direct-evaluation methods offer an alternative which is exact rather than approximate and which often is faster than cut-set based methods. Furthermore, these direct-evaluation methods are easily implemented and used in a microcomputer environment. We examine and compare some of the most promising approaches that have emerged during the past few years. A large number of examples are included to illustrate not only the manner in which the various direct-evaluation methods differ from each other but also the potential for such methods to replace cut-set methods for probability calculations involving systems of moderate size.
1. INTRODUCTION Fault trees are used for reliability analysis of complex
systems such as nuclear or chemical reactors, aircraft or spacecraft systems, and various types of industrial processes. The purpose of a fault tree is to represent all the manners in which system failure can occur and to enable calculation (or estimation) of the probability of system failure. The fault tree itself consists of AND gates, OR gates and basic events, and the structure of the trees shows all the sequences of basic events which, in combination with each other, can lead to system failure. A “minimal cut set” is a minimal set of basic events whose occurrence causes system failure. In a 1986 article in Computers & Chemical Engineering, we described some top-down algorithms for recursively determining fault tree probabilities without recourse to any consideration of cut sets (Page and Perry, 1986a). Pioneer work on direct evaluation methods had been done by B. V. Koen and others during the 1970s and early 198Os, and references to some of these earlier works may be found in the bibliography of a recent paper by Patterson-Hine and Koen (1987). Standard fault-tree methodologies, by contrast, typically defer probability calculations until some collection of minimal cut sets has been obtained and then utilize a form of inclusion-exclusion to approximate the probability based on knowledge of some or all of the minimal cut sets. Later articles (Page and Perry, 1986b, 1988; Patterson-Hine and Koen, 1989) continued to develop the ideas that we explored in the first paper. For example, our subsequent work presented algor-
ithms for treating noncoherent fault trees in which complements of basic events appear in the tree. (For instance, if. one basic event represents failure of a valve, then the complementary basic event would be the event that the valve does not fail, but the complementary basic event may itself contribute to system failure in some other way.) Most importantly, however, the later algorithms are substantially more sophisticated than the earlier ones and enable analysis of much larger fault trees. The examples in the later papers (Page and Perry, 1986b, 1988) include some of the largest fault trees presented in the recent literature to illustrate the workings of various algorithms. Other researchers who also have continued to develop direct-evaluation methods for fault-tree probabilities include McCullers and Wood (1988) and Helman and Rosenthal (1989). The approach of McCullers and Wood is somewhat similar in spirit to Page and Perry (1988) in that both rely on “factoring” or “pivotal decomposition” in conjunction with other ways of simplifying and modularizing fault trees. Response to the original 1986 article indicated to us a significant interest in fault-tree methodologies among the readership of this journal, and the nature of correspondence from readers further showed a particular interest in methodologies which are userfriendly and which can be implemented so as to perform well in a microcomputer environment. Spinoffs from the original paper continue, as illustrated by the paper by Patterson-Hine and Koen (1989) which adapts our earliest algorithms to an object-oriented programming environment and improves perform-
ance of the algorithms by “remembering” solved modules in the fault tree so as to avoid unnecessary
tTo whom all correspondence should be addressed. 157
I58
L. B. PAGE and J. E. PERRY
recomputations. This experimentation was done in a LISP environment using a dedicated LISP processor. Subsequently, we have incorporated the basic idea of Patterson-Hine and Koen, namely the idea of storing ‘intermediate results to avoid recomputation, into the Pascal environment that was used to develop implementations of our algorithms. As a result there is now available a wide range of direct-evaluation approaches to fault-tree probabilities which includes far more powerful algorithms than the original ones presented in our 1986 paper. The purpose of this article is to make these known to the readers of this journal who expressed interest in the original article. These approaches include new features such as the “memory” idea of PattersonHine and Koen, as well as some new refinements to the ideas that we developed later which lead to very substantial improvements in performance of the algorithms. A large number of examples are used in Section 4 to show that direct-evaluation exact methods can now treat problems that a few years ago were difficult to treat with the best approximation methods. A consequence of the improvements in methods now available is that it is no longer possible to test these algorithms in any meaningful fashion on the fault trees that have commonly been used in the literature to demonstrate performance of earlier fault-tree methodologies. These fault trees are simply too small or too elementary to demonstrate the relative merits of the improved generation of algorithms now under discussion. For that reason we have constructed several new fault trees which are larger and more complex than those commonly used in the literature for demonstration purposes. In order to include realistic test cases we have also obtained a very large fault tree used in analysis of a commercial nuclear reactor and have included 16 intermediate size subtrees (ranging in size from 103 to 769 nodes) from this large fault tree among our test cases. Additionally, we do include treatment of the largest fault trees we have found in the recent literature as a means of comparing our algorithms and implementations with earlier published results. Three basic strategies for direct evaluation of faulttree probabilities will be assessed in this paper. One is the algorithm TDPP (Page and Perry, 1986b). Another is a modification of TDPP which “remembers” solved modules in the fault tree to avoid unnecessary repeated computations. This idea is found in the recent paper by Patterson-Hine and Koen, though the algorithms to which the idea was applied in their work were the earliest algorithms (Page and Perry, 1986a) rather than the improved later algorithm TDPP. The third strategy is a refined and improved version of our factoring algorithm (Page and Perry, 1988). All of these approaches can be successfully used in a microcomputer environment, and the advantages and disadvantages of each are spelled out in Sections 4 and 5. These three
strategies, as well as implementations of the strategies, are referred to extensively throughout this paper by the acronyms IN-EX, IN-EX W/MEM and FACTOR. A description of these three algorithms is the subject of Section 2. Section 3 illustrates why direct evaluation methods such as those of Section 2 are often faster than traditional cut-set approaches. Section 4 then compares the three algorithms of Section 2 on a wide range of fault trees. Run-time data given in Sections 3 and 4 are obtained using implementations written in THINK@ Pascal 2.0 and using a Macintosh II microcomputer. One reason for including so many examples (25 are included) is that any choice of a few examples would not illustrate the divergence in performance of the algorithms depending upon size and structure of fault trees. Since most examples that have previously appeared in the literature are too small to test the current generation of algorithms, it is time to introduce new examples and challenge other researchers to test their methods on these new and more complex examples. Figures are included or referenced for Examples l-9. Examples 1O-25 which arise in analysis of a commercial nuclear plant will be supplied to any reader who requests them. These figures are not included because of space limitations, but extensive descriptions of the fault trees for Examples 10-25 are found in Table 3. 2. STRATEGIES FOR FAULT-TREE
DIRECT EVALUATION PROBABILITIES
OF
Differing fault-tree methodologies make different assumptions about the nature of the underlying fault tree. For example, some standard methods (such as the widely-used program MOCUS) do not incorporate treatment of complemented basic events. In order to clarify the generality of the algorithms and implementations described in this paper, we wiil first describe precisely the assumptions that our algorithms make about the nature of the fault trees being analyzed. 2. I. Assumptions
about fault
trees
1. Every node in the fault tree is one of the following: (1) an AND gate; (2) an OR gate; or (3) a BASIC EVENT. The leaf nodes in the fault tree are BASIC EVENTS, and all other nodes in the fault tree are AND gates or OR gates. The probability of occurrence of each BASIC EVENT is assumed to be known. 3. BASIC EVENTS may appear at multiple locations in the fault tree provided that the event is given the same label at each place it appears in the tree. Logical complements of BASIC EVENTS may also appear as BASIC EVENTS. 4. Aside from complements, any set of BASIC EVENTS is assumed to be an independent collection of events. In other words, given any collection of BASIC EVENTS in which no event is the complement of another event in the
Direct-evaluation algorithms for fault-tree probabilities
collection, the probability that all events in the collection occur is simply the product of their individual probabilities of occurrence. 5. AND gates and OR gates may have any number of children. The same label may be used for two AND gates or two OR gates in the fault tree provided that the subtrees rooted at the two locations are identical. The implementations described in this article assume that integer labels are used to label the nodes in the fault tree. The convention used to denote complements of basic events is to use a negative label for this purpose. Thus, the complement of a basic event which has been labeled “13” would be labeled “_ 13”. Labels for AND gates and OR gates are assumed to be positive integers. This restriction is not included in the list of assumptions above because the labeling convention used is simply a matter of convenience and has nothing to do with the logical structure of the fault trees which the algorithms can treat. Recoding the implementations so that alphanumeric labels rather than integer labels are used, for example, would be a routine matter that has nothing to do with the workings of the algorithms themselves. The alphanumeric labels would then simply be converted to integer labels by a utility procedure attached to the present implementations. Efficiency of direct-evaluation algorithms relies heavily on finding modules within the fault tree which are statistically independent. When mutually independent events A and B are found, then the elementary identities: Pr(AnB}
= Pr{A>Pr{B}
(1)
= Pr{A) + Pr(B} - Pr{A}Pr{B},
(2)
and Pr{AuB}
can be used. Of course, equations (1) and (2) can be viewed as simple formulas for evaluating probabilities of AND gates and OR gates with two mutually independent children. Trees without replicated inputs or complementary basic events (in other words, trees in which each basic event appears only once in the tree and complements of basic events do not appear in the tree) are trivial to evaluate.in this manner using either a top-down or bottom-up approach, and the time requirement is proportional to the number of nodes in the tree. (Further discussion of this appears in Section 3.) The performance of a direct-evaluation algorithm, however, is largely determined by how the algorithm treats unions and intersections of dependent events. The primary means for evaluating more complicated Boolean expressions in which independence is not present are the following: Pr{AuB}
= Pr{A) + Pr(B} - Pr{AnB}
(3)
and Pr{A} = Pr{B}Pr(A]B}
+ Pr(S}Pr(A]B},
(4)
159
[equation (4) is the familiar pivotal decomposition formula, where B denotes the complement of event B and Pr{A(B) denotes the conditional probability of A given 31. It is not difficult to develop simple recursive algorithms based on (3) or (4) for evaluating fault-tree probabilities. The simplest of these, such as Algorithm 1 (Page and Perry, 1986a) are computationally inefficient and are useful only on quite small fault trees or for pedagogical purposes. Figure 1 shows an outline of the recursive function for the simplest algorithm utilizing equation (3) above and capable of treating fault trees with replicated basic events. The probability that is returned by the call PROB(S) in Fig. 1 is the probability Pr{A,n . __ nA,} if S is the set of node labels corresponding to the events A,, . . ., A, in the fault tree. [For example, if AZ, A3, and & are the events represented by gates 2, 3 and 9 in the fault tree, then Pr{A,nA,nA,} = PROB(S) where S = {2,3,9}.] An algorithm based on this recursive function (of Fig. 1) need only read the fault tree from an input file and issue a call to PROB in which the parameter set S consists of the single node label corresponding to the top event in the fault tree. By recursively calling itself, PROB breaks the problem of top-event probability down into subproblems until the recursion terminates via calls in which S is a set of basic events. The last line of Fig. 1 (the line which treats OR gates) is logically equivalent to an application of equation (3). This algorithm makes no use of any independence in the fault tree until the resursive function PROB is called with the parameter set S consisting exclusively of basic events (which are necessarily mutually independent). In that case, the product is returned as an application of equation (1) above. The algorithms described in the fhst paper (Page and Perry, 1986a) are all fundamentally based on equation (3) with varying degrees of searching for independence in the fault tree so as to utilize (1) and (2) as alternatives to (3) whenever possible. The algorithm TDPP (Page and Perry, 1986b) also is based on (3) but requires a more sophisticated form of parameter passing in order to avoid use of (3) until
Fig. 1. Programskeletonfor the simplestrecursivefunction whichcan be used as the basis for an algorithmto evaluate
top-event probability of a fault tree with replicated basic events. The program need only read the fault tree from an input tile and then issue a call to PROB in which S is the set consisting of a singleelement which is the node label of the top event in the fault tree.
L. B. PAOE and J. E.
160
a more exhaustive search for alternative decompositions based on independent subtrees has been made. In the original work (Page and Perry, 1986a), the strategy used was to evaluate the probability of the top event via a recursive function that evaluates Pr{A,n
. . . AA,),
(5)
where A,, . . ., A, are events corresponding to the nodes in the fault tree. (The recursive function of Fig. 1 is such an example.) In the later work (Page and Perry, 1986b), the purpose of the more sophisticated parameter passing used is to enable the recursive function to evaluate an event of the form: Pr{(B,u
. . . uB,,,)n(A,n
. . . nA,)},
(6)
where B,, _ . ., B, and A,, . . ., A, are events corresponding to nodes in the fault tree. In this setting, the recursive function is evaluating the probability that all of the events A,, . . ., A, occur and that at least one of the events B, , . . _,B,,, occurs. The effect of the more complicated information passed to the recursive fimction is to enable more independence to be found in the fault tree and therefore more use of equations (1) and (2) and less reliance on equation (3) than the earlier algorithms. The factoring approach (Page and Perry, 1988) uses equation (4) rather than equation (3). After making preliminary checks for ways to take advantage of independence in the tree if any independent modules can be found, the factoring algorithm picks a basic event and “factors” the problem into two subproblems depending upon whether or not the chosen basic event occurs. So in equation (4), event B always corresponds to a basic event in the fault tree and the problem splits into two subproblems depending upon whether or not event B occurs. Both the algorithm TDPP (Page and Perry, 1986b) and the factoring algorithm (Page and Perry, 1988) also employ other techniques for presimplifying and modularizing the fault tree that are not included in earlier algorithms. These techniques include a bottom-up presimplification of certain modules in the fault tree that involve only basic events not replicated outside of such modules. These techniques, along with the more complex information being passed to the recursive function, cause the later algorithms to be far superior to the earlier ones. Furthermore, the later algorithms treat more general types of fault trees including noncoherent fault trees where complemented basic events appear. We hope to convey to the reader a sense of the advantages and disadvantages of each of the three direct-evaluation algorithms under discussion. Since they will be compared and contrasted frequently in the next three sections, it will be helpful to establish simple acronyms by which they can be referenced. The acronyms we shall use are as follows: IN-EX+an acronymn for “inclusionexclusion”.) This is the algorithm TDPP (Page and Perry, 1986b). It is the simplest of the three
PEKRY
algorithms and always uses the inclusionexclusion formula equation (3) to evaluate probabilities when no independence can be found. The approximate length of our implementation of IN-EX is 550 lines of documented Pascal code. IN-EX W/MEM-(an acronym for “inclusionexclusion with memory”.) This algorithm has not previously appeared. It represents a modification of IN-EX to include the basic idea of PattersonHine and Koen (1989), that performance of recursive algorithms can be significantly improved by “remembering” all calls to the function PROB in order to avoid recomputation of any subproblem which has previously been generated. The advantage from doing so is that the amount of computation can be significantly reduced. The disadvantage is that one has to dedicate a potentially large amount of computer memory to a “memory” area where a record is maintained of every call to PROB. In our implementation the “memory” of all calls to PROB is implemented via a hash table. At the beginning of each new call to PROB the hash table is checked to see whether the current subproblem has already been treated, and if so the stored return value is called up from the hash table so as to avoid recomputation. The approximate length of our implementation of IN-EX W/MEM is 750 lines of documented Pascal code. FACTOR-This is a refinement of the factoring algorithm (Page and Perry, 1988). Here we use a different strategy for selecting the basic event B to “pivot” on whenever equation (4) is used. Roughly speaking, the present strategy is to pivot on the basic event that appears most often in the subproblem being treated. By contrast, the node selection strategy used in Page and Perry (1988) stipulated pivoting on a basic event that appears under the maximum possible number of gates in the subproblem passed to PROB. In effect, this means picking the basic event that appears under the maximum possible number of gates corresponding to the events A,, . - ., A, and B,, . . ., B,,, in equation (6). The new node-selection strategy requires determining not just the fact that a given basic event appears somewhere as a descendent of a given gate but precisely the number of times that each basic event appears beneath a given gate. Computational experience indicates, however, that the new node-selection strategy significantly improves average performance of the algorithm. The approximate length of our implementation of FACTOR is 1000 lines of documented Pascal code. An inherent advantage that FACTOR has over the inclusion-exclusion approaches is that equation (4) generates only two subproblems whereas equation (3) generates three. This is significant in terms of the
Direct-evaluationalgorithmsfor fault-trorprobabilities overall computational complexity as the examples of Section 4 indicate. The number of recursive function calls is significantly less for FACTOR than for IN-EX or IN-EX W/MEM. A disadvantage of FACTOR is that the entire fault tree must be rebuilt each time equation (4) is used. This means that FACTOR requires significantly more real time per recursive function call. Further comments on the relative advantages and disadvantages of implementations of these algorithms is found in Section 5. Observing in Section 5 that FACTOR generally performs faster than either of the inclusion-exclusion algorithms, the reader may wonder why a FACTOR WITH MEMORY hasn’t been implemented to speed up FACTOR in the same way that IN-EX W/MEM acts as a faster version of IN-EX. The answer is simply that the subproblems encountered in FACTOR involve entirely reconstructed fault trees rather than different subproblems posed in the framework of the original fault tree. Therefore it is far less likely that subproblems will be encountered more than once, and it would be difficult to detect such an occurrence even if it did arise. 3. DIRECTEVALUATIONvs CUT SETS The most widely used methods for evaluating fault-tree probabilities have relied on prior determination of all or some of the minimal cut sets (or prime implicants in the case of noncoherent fault trees). Typically one might create a list of all minimal cut sets with probability above some arbitrary threshold level and then attempt to estimate the probability that the top event of the fault tree occurs by applying the inclusion-exclusion formula (usually in some approximate form, since the number of terms is so great) to estimate the probability that at least one of these minimal cut sets retained in the list does in fact occur. Direct-evaluation algorithms often vastly outperform cut-set methods. The following proposition provides some insight to why this is true. It is easy to derive Proposition 1 by using the Fussell algorithm (as described by Barlow and Lambert, 1975) for determining the cut sets.
n
AND gate
a
161
Proposition 1 -For each positive integer n, let T. denote a balanced and complete binary fault tree with 2rrlevels and with the property that every gate at odd numbered levels in the tree is an OR gate and every gate at even numbered levels in the tree (except for level 2n where the leaf nodes appear) is an AND gate. If none of the basic events at level 2n in the tree are replicated, then the number of minimal cut sets for the fault tree T, is given by 2(“-‘I. Figure 2 shows the fault tree T,. The number of minimal cut sets in T, is 128 according to Proposition 1. By comparison, T, has 2,147,483,648 minimal cut sets. An exact methodology based on cut sets is therefore not feasible for such a tree. Direct-evaluation algorithms, however, are elementary for fault trees such as this because the children of every gate constitute mutually independent events. (This independence is inherent in the assumption that basic events are not replicated in the fault tree.) Such direct evaluation can be accomplished by either a bottom-up approach (as one would do manually), or by a top-down approach using recursion. The direct-evaluation algorithms described in this paper combine both approaches in that an initial bottom-up presimplification is performed, and later when independence is found in the tree the top-down recursion utilizes the independence to split the problem efficiently into smaller problems. The net effect is that these algorithms, for fault trees in which distinct subtrees are automatically independent, require only a number of computations proportional to the number of nodes in the fault tree. In going from T3 to T5 for example, the number of nodes increases by a factor of 4. So the number of computations required for direct evaluation of top-event probability only quadruples while the number of minimal cut sets grows from 128 to 2,147,483,648. Variations of the fault tree in Fig. 2 are useful to illustrate differences in the direct-evaluation algorithms described in Section 2. Consider the fault tree of Fig. 3, for example. Here the mutual independence of disjoint subtrees has been destroyed by the fact that basic event number 1 is replicated in every subtree. No elementary bottom-up presimplification
OFI gate
Fig. 2. The fault tree T,.
event 0 basic
162
L. B.
PAGE and
Fig.
is possible, and furthermore no top-down recursion can utilize equations (1) or (2) because independence of subtrees is lacking. Figure 3 is an extreme case which is trivial for the “factoring” algorithm but not for either of the “inclusion-exclusion” algorithms. “Pivoting” on event 1 splits the problem of top-event probability into two cases depending upon whether or not event 1 occurs. This generates two subproblems consisting of modified fault trees in which event 1 is not present at all. (One of the subproblems is the modification of the original fault tree to reflect the fact that event 1 has occurred; the other is a modification corresponding to event 1 not occurring.) The key point is that in both subproblems no further replicated basic events appear, so efficient calculation is possible just as in Fig. 2. Figure 3 presents a much greater difficulty for algorithms based on use of the inclusion-exclusion formula because such an approach offers no way of “eliminating” basic event 1, which is the event that causes the difficulty. As a result, IN-EX or IN-EX W/MEM require vastly more computations to evaluate the fault tree of Fig. 3 than does FACTOR. In our own implementations of these algorithms, IN-EX requires 163 s and 14,183 calls to the recursive function PROB, and IN-EX W/MEM requires 25 s and 166 1 calls to PROB, while FACTOR requires only 4 s and 32 calls to PROB. Whereas Fig. 3 represents an extreme case in which all dependence in the fault tree is caused by a single basic event, Fig. 4 represents the opposite extreme in which dependence is caused by a large number of basic events being replicated once in the fault tree
Fig.
J. E.
Pmuxy
3
rather than a single basic event being replicated many times. FACTOR cannot now split the original problem into two trivial subproblems. Instead, the replicated basic events will have to be “factored out” one at a time, causing many more subproblems to be generated. Runtime data on this fault tree is as follows: IN-EX requires 1154 s and nearly 50,000 calls to PROB; IN-EX W/MEM performs about the same as for Fig. 3, requiring now 26 s and 1617 calls to PROB; and FACTOR on Fig. 4 requires 39 s and 781 calls to PROB. 4. COMPUTATIONAL
EXPERIENCE
The determination (or approximation even) of the probability of occurrence of the top event in a fault tree has long been known to constitute an intractable problem (Rosenthal, 1975). This means that no known algorithms behave well on all examples. A consequence is that inevitably some type of truncation methodology is required in order to treat very large fault trees if basic events are replicated throughout the fault trees. With exact evaluation methods, and even with most approximation methods, there is a clear need to understand the general scope of problems that can be treated. With exact methods the limiting factor is the time requirement for the number of computations inherent in large problems. Approximation methods, on the other hand, must be judged both in terms of time and in terms of accuracy of results. Accuracy is not a consideration in discussion of the exact methods of this paper. The three strategies described here produce identical results (within the confines of
4
Direct-evaluation algorithms for fault-treeprobabilities
floating point arithmetic, which means to 12-15 significant digits in a good computer environment) whether calculation is done in a few seconds with a small number of computations or in several minutes with many thousands of recursive function calls. Small fault trees are sometime useful for clarifying the details of the manner in which an algorithm may work, but generally such small fault trees provide little insight into the computational difficulties presented by larger problems. Furthermore, it is not so much the size as the structure of a fault tree that determines the computational complexity of probability evaluation. “Structure” is a hard concept to quantify, and even with familiar algorithms it is often quite difficult to anticipate how well an algorithm will perform on a fault tree simply from a visual inspection of the tree. Compounding the difficulty in evaluating algorithms is the fact that as algorithms have improved they have outrun the supply of good examples in the literature to test them on. While it is worthwhile to peruse the literature of the past 10 yr and collect the fault trees that have been most widely used as illustrative examples to demonstrate methods, most of these examples present so little challenge to the methods of this paper that they are of little use in distinguishing the performance of one method from another. In this section we will focus on computational experience with fault trees from three different sources: (1) we have collected some of the most frequently cited fault trees in the standard literature over the past 10 yr; (2) we have constructed some fault trees of moderate size (in the range of 10&150 nodes) that present difficulties to direct-evaluation methods because of excessive replication of basic events or other modules in the fault tree which limit ones ability to find mutual independence. Most of these fault trees are noncoherent since they include complemented basic events; and (3) in order to include realistic examples as are encountered in industrial applications, we have obtained a very large fault tree used in analysis of a commercial nuclear plant. While this tree is too large for our direct-evaluation strategies to be applicable to the whole tree, it is sometimes useful to develop information about particular modules in such a fault tree as a step in the process of understanding the entire tree. Whereas the entire fault tree in this case included about 400 distinct gates (AND gates, OR gates and a few 2-out-of-3 and 2-out-of-4 gates) and 400-508 distinct basic events, we decided to analyze subtrees for which the number of distinct events (that is, the number of distinct gates plus the number of distinct basic events) in the tree was in the range of 30-120. (The few 2-out-of-3 and 2-out-of-4 gates were changed to OR gates because K-out-of-n gates are not presently implemented in the algorithms.) There are 67 such subtrees, of size ranging from 31 nodes to 769 nodes. By “size” here we mean the number of nodes in the fault tree counted according to multiplicity. (For
163
example, if a given basic event appears at three different locations in the fault tree, it is counted three times.) Of these 67 fault trees, we eliminated those with fewer than 100 nodes. This left 16 logically distinct fault trees having from 103 to 769 nodes. These 16 fault trees constitute Examples l&25 of this section and are described in Table 3. We give references for figures corresponding to the first five examples. Examples 6-9 are new, so we include figures for them. The structures of these fault trees are sufficiently complex to test the limits of some or all of the algorithms under discussion. Table 1 gives data on the fault trees used in Examples l-9. All run-time data listed in this section (as well as Section 3) is obtained using implementations of the three algorithms of Section 2 coded in THINK@ Pascal 2.0 on a Macintosh II microcomputer. Example I Kumamoto et al. (1980) used a coherent fault tree with 45 nodes to illustrate their dagger sampling technique. This fault tree appears as Fig. 4 in Page and Perry (1986b). Laviron et al. (1982) used the same fault tree to demonstrate use of ESCAF, a special electronic apparatus developed for fault tree analysis. In the latter paper the authors report that ESCAF can determine the 39 minimal cut sets in “a few seconds” and that additionally 3 s is required to compute top event probability to three significant digits “if each basic event is assigned a probability of occurrence of 10m3”. Since ESCAF utilizes an approximation methodology, time requirements depend on the degree of accuracy desired, and for a given degree of accuracy this in turn is dependent upon the magnitude of the probabilities of the basic events. They further report that, “for a probability of lo-*, ESCAF takes 17 s to provide the result, whereas” Kumamoto et al. (1980) “required a sophisticated Monte-Carlo method involving thousands of trials on a full-size computer to get the same answer: 3.72 x lo-‘.” With exact methods such as ours, time requirements do not fluctuate with changes in the basic event probabilities. Table 2 shows the time requirement and the number of recursive function calls for each of the three direct evaluation methods of this paper to obtain exact top event probability. Example 2 Example 2 was introduced by Henley and Kumamoto (198 1) appearing in their text. It was used by Page and Perry (1986a) to demonstrate the methods introduced in their paper, and it was one of the three examples used by Patterson-Hine and Koen (1989). In the LISP environment of a Texas Instruments Explorer Workstation used in the latter research, computation of top-event probability for this fault tree required “less than 3 s” and 91 function calls. This calculation employed the object-oriented modification of the main algorithm of Page and Perry
L. B. PAGEand J. E.
164
PHRRY
Table 1. Data on the fault trees wed
Example
No. of distinct AND Bates
No. of distinct OR gat=
No. of distinct basic events
4 3 16 6 14 9 14 15 11
15 15 20 15 13 10 ::
11 c
I 2 3 4 5 6 7 8
9
13
t8’ 2z 15 14
a8 Examplea
No. of =Kit? eVeSItS. 9” 24 1:
:: 58 14
6 19
;‘: 51
i:
l-9
No. of duplications among basic eVCMSb
z
No. of node3 in fault trteO
coberalt?
45 z
* yes
43
; Y= Y=
12 97 110 94
tl0 El0
no
‘The number of distinct basic events which am replicated of whose complement -18 aomcwherc in thefault tree. For this counting. the complement of a basic is not distinguished from the -t. For -pk, if a basii event appears twice and its complement onoe. this is counted as a single distinct repeated basic event. bNumber of nodes in fault tree representing basic. events that are duplicated (or whose compkment appeam) elsewhere in the fault tree. For example. if a basic event appears twioe and its complement once. this is counted as three appearances of a duplicated basic event. =Total number of gates and basic events when each is counted acoording to the number of times it appears in the fault tree. -
(1986a) which was cited in Section 1. As seen in Table 2, IN-EX W/MEM is the fastest among the three strategies which we are considering, requiring 1.5 s and 66 function calls. A figure for Example 2 appears in Page and Perry (1986a) and again as Fig. 4 in Patterson-Hine and Koen (1989). (For accuracy, the latter reference should be consulted because it corrects a typographical error in one of the fault tree labels.) The dominant characteristic of this fault tree is that a large subtree (about one-third the size of the entire tree) is duplicated in the fault tree. The common feature of the approach of Patterson-Hine and Koen and of IN-EX W/MEM is the ability to remember partial results and reuse them to avoid recomputation. The large repeated subtree makes such storage of partial results particularly worthwhile in this instance. Example
3
Example 3 is another fault tree due to Henley and Kumamoto. A figure appears as Fig. 5 in Page and Perry (1986b). Locks (1980) had suggested that a methodology introduced by Kumamoto and Henley (1978) was unnecessarily complicated. Henley and Kumamoto responded with this example in the “author reply” portion of the dialogue with Locks (1980) as a further case on which they might demonstrate their method. Locks (1981) later wrote an entire paper discussing aspects of this particular fault
tree, and Worrell et al. (1981) used it to demonstrate use of the SETS program on noncoherent fault trees. To argue the need for new methods to treat this fault tree, Henley and Kumamoto report& in the exchange with Looks (1980) that “our computer code generated 352 prime implicants in a computatin time of 62s, while MCCUS failed to obtain the Boolean indicators after 120 s.” Worrell ef al. (1981) stated that they were able to obtain Boolean indicators and prime implicants “by a single SETS job that required 6s of CPU time on a CDC 7600”. (SETS is still a widely used commercial software package for fault tree analysis.) None of the references that cite this fault tree mention probability calculations. Using such cut-set based methodologies, tBis would involve some type of approximation performed after the cut sets had been determined. The times and number of recursive function calls for each of our three approaches are given in Table 2, where the fastest time is obtained with FACTOR and is 7.5 s for exact probability evaluation. Examples
4 and 5
To demonstrate further their object-orientated LISP implementation of the main algorithm of Page and Perry (1986a), Patterson-Hine and Koen include two randomly generated coherent fault trees of moderate size. These are shown as Figs 5 and 6 in
Table 2. Performance of the direct-evaluation algorithms of this paper on the fault trres of Exampks
EXamDlC 1 2 3 4 5 7 6 8 9
IN-EX time requirement(s) 2.3 3.3 14.3 11.5 20.3 216 284 2826
IN-EX No. of function calls
IN-EX W/MEM time muirmmt(a)
270 361 1295 1455 1904
1.7 1.5 12.3 Z:Z
44,860 20,030 291,761
53.6 2.8 218 135
IN-RX W/MEM No. of func.calls 137 8:
FAmOR time rwuimmentls~
l-9
FACTOR No. of function calla
623 195
x 7:s 2.8 13.4
47 117 141 91 241
4150 326 13.603 G446
21.0 2.4 30.9 12.6
4: 1223 203
Direct-evaluationalgorithmsfor fault-treeprobabilities
Fig. 5. Fault tree for Example 6. The module T4 appearsat six locations in the fault tree. Patterson-Hine and Koen (1989) and are described as Examples 4 and 5 in Table 1. The time that the implementation of Patterson-Hine and Koen requires for the evaluation of top-event probability for Example 4 is 155 s, and 7368 function calls are used. Our times shown in Table 2 vary from a fastest time of 2.8 s for FACTOR to a slowest of 11.5 s for IN-EX. The number of function calls ranges from 91 for FACTOR to 1455 for IN-EX. Example 5 is a slightly larger fault tree with more repeated basic events. The time requirement for this
165
fault tree is 45 s in Patterson-Hine and Koen (1989) and 2223 function calls are used. They explain, in terms of their object-oriented environment, why execution is faster in this example in spite of the facts that the tree is larger than in Example 4 and there are more repeated events. Our results also demonstrate that storing computed partial results is extremely beneficial for this fault tree. Here the fastest evaluation occurs with IN-EX W/MEM which takes 3 s and 195 function calls. FACTOR, on the other hand, requires 13.4 s and 241 function calls. In terms of either real time or amount of recursion required, the performance of IN-EX W/MEM relative to that of FACTOR improves by about a factor of 8 in moving from Example 4 to Example 5. All three of the strategies described in this paper significantly outperform the implementation found in Patterson-Hine and Koen on these two examples. This is not a mark against their object-oriented environment, but rather reflects the fact that the methods we are describing incorporate much more sophisticated ways of utilizing independence in modules of the fault tree than does the algorithm which Patterson-Hine and Koen started with. Examples
69
The first five examples have been taken from other recent articles and are generally too simple to demonstrate the significant differences in the three methods we are considering. Larger fault trees with more replicated basic events are shown as Figs 6-9, and some of the characteristics of these fault trees are
a 7
Fig. 6. Fault tree for Example 7.
166
L. B.
PAGE
and J. E. PERay
Fig. 7. Fault tree for Example 8.
shown in Table 1. Performance of the algorithms on these examples is found in Table 2. Notice that Example 4 (shown in Fig. 5) has a module that is duplicated at six different locations in the fault tree. This causes the “memory” feature of IN-EX W/MEM to pay off handsomely. Another feature is that the replicated basic events consist of a relatively small number of events replicated a large number of times. This causes FACTOR to work very
well. IN-EX performs poorly by comparison because it canpot take advantage of either of these features of the fault tree. Example 7 (Fig. 6) has two large modules that appear twice in the fault tree. There are many more replicated basic events than in Example 6, but they are not replicated as many times. Again FACTOR is the fastest and IN-EX slowest, but the gap is less extreme. Example 8 (Fig. 7) is in a sense the most
Fig. 8. Fault tree for Example 9.
Direct-evaluation
algorithms
complex of the first nine examples in that it has four replicated modules with three of them appearing twice in the fault tree and one appearing at three locations. This fault tree is too complex for IN-EX, and FACTOR remains the fastest. In Example 9 (Fig. S), FACTOR is an order of magnitude faster than its nearest competitor. Examples IO-25 As indicated earlier, Table 3 gives data on 16 fault trees of intermediate size which are encountered in analysis of a commercial nuclear plant. The fault trees we use here as Examples 10-25 are those whose total number of distinct gates and basic events is in the range 30-l 20 and whose size is in the range of 100 nodes or greater (when each node is counted the number of times it appears in the fault tree). FACTOR evaluates all but one of these fault trees in an acceptable amount of time ranging from 2.3 to 108 s. The largest fault tree (with 769 nodes and 471 duplications among the basic events) is too complex for FACTOR to solve. With IN-EX W/MEM we can evaluate 14 of the 16 fault trees, and IN-EX succeeds on 11 of the 16. In three of the examples, INEX W/MEM is slightly faster than FACTOR. In a few cases, FACTOR is an order of magnitude faster than either of the inclu&on-exclusion methods. 5. CONCLUSIONS
AND NOTES ON IMPLEMENTATION
The evidence provided by the 25 examples treated in this paper indicates that the “factoring” strategy that was described in Page and Perry (1988) remains among the fastest approaches yet suggested for direct evaluation of fault tree probabilities. The implementation of FACTOR that has been used to generate the results of this paper is a refinement of the one described earlier in that a more sophisticated method for selecting basic events for “factoring” is now used,
for fault-tree probabilities
167
and a few other methods for simplifying the fault tree are now incorporated. The disadvantages of FACTOR are two: (1) its requirement that the fault tree be continually rebuilt means that entire fault trees rather than just information about the original fault tree must be passed to the recursive function; and (2) this continual rebuilding of the fault trees makes it unclear how one might usefully store results of computations for reuse as has been suggested by Patterson-Hine and Koen and incorporated into IN-EX W/MEM. The first of these two disadvantages means that a significantly larger amount of memory must be available for use as a recursion stack to store the parameters passed to the recursive function. About 0.5 MB is adequate for the 25 examples treated here. By comparison, IN-EX and IN-EX W/MEM do not alter the original fault tree, but rather their recursive function repeatedly solves subproblems in the original tree. This means that much less information must be passed to the recursive function and then stored on the recursion stack. The 64 k of memorv that is available as stack space in some programming environments on MSDOS microcomputers is more than ample for the needs of IN-EX and IN-EX W/MEM for the examples of this paper. While many of these examples could be treated via FACTOR using a 64 k stack if one were very careful with the programming, we doubt that the largest fault trees could be treated in such an environment. Our implementation of IN-EX W/MEM utilizes a hash table to store preliminary results. The size of the hash table will depend, of course, on the amount of information one wants to record and the size of the fault tree. We opted to store a record of all calls to the recursive function, and again about 0.5 MB of memory is adequate for most of the fault trees treated in this paper. One could clearly elect not to store records of calls that are very simple to evaluate and
Table 3. Data on fault trees of intcrmcdiate sire encountered in analysis of a commercial nuclear power plant. The fault trees included here an those having between 30 and 120 distinct events (AND and OR gates plus basic events) and more than 100 nodes in the fault tree when nodes are counted according to multiplicity (each node being counted as many times as it appears in the tree). The same counting conventions apply for columns 57 as are described in footnotes a. b and c in Table 1
Example IO 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
No. of distinct AND gates
No. of distinct OR gates
No. of distinct basic events
No. of distinct repeated basic events
5 z
22 20 22 24 28 27 34 20 22 23 24 22 29 30 31 33
;:
11 24 21 ::
5 5 : 4 2 5 5 7 7 7 8
: z 35 34 37 39 ;: 42 64
21 21 25 25 25 24 26 40 34 34 58
No. of duplications among basic events 35 61 49 38 53 53 7’: 76 76 74 100 144 133 133 471
No. of nodes in fault tree 103 107 109 ill 123 124 125 138 144 148 155 173 250 253 256 769
Coherent? Yes Yes Yes YYes Yes Yes Yes Yes yes YYes Yes Yes WS
G=s
L. B. PAGE and J. E. PERRY
168
Table 4. Pcrformanoc of the direct-evaluation ~orithms IN-RX time
Example
~UirClllCaf(S)
10 II 12 13 14 15 16 17 18 ::
16.6 4.8 4.2 270 22.9 29.5 26.9 115 119 127 26.0
IN-EX No. of function
calls
1878 542 27.:;: 2590 2999 2592 13,627 13,632 13,635 2670
of this paper on the fault tree of Examples 10-25
IN-EX W/MEM time
IN-EX W/MEM
requiremeat
No. calls
6.3 2.7 3.0 68.6 9.9 7.5 10.6 22.1 23.4 24.6 14.4
FACTOR time
FACTOR No. of function
calls
759 479 761 2039 2043 2046 1146
ns~=t@) 5.2 2.3 4.8 8.8 5.9 8.8 6.7 19.2 22.3 24.8 7.7
472 211 5:z
166 99 99 213 66 309 68 510 515 518
21 22
-
-
139
10.876
16.1 16.4
8: 648
z 25
-
-
342 331 -
21,329 21.333 -
108 94.5 -
776 781 -
for which recomputation would be quick, and this would cut down on memory requirements at the expense of a slight loss of speed. Memory for a hash table typically comes from the heap rather than the stack, and in some environments (such as some MS-DOS programming environments) more memory is available in the heap than on the stack. So with IN-EX W/MEM it might be easier to take advantage of available memory in some programming environments than would be the case with FACTOR. IN-EX, while generally the slowest of the three methods we have discussed, required very little memory and can be used on the smallest current generation of microcomputers without difficulty. It is also the simplest of the three algorithms in terms of the number of lines of code required to implement it. Cut-set methods are in no danger of becoming obsolete. Even if direct-evaluation methods do frequently surpass cut-set methods for reliability calculations, engineers will always need to know “why” systems fail. Cut sets are also useful for verifying and validating fault tree synthesis steps and for computing the relative importance of gates in the tree. With direct-evaluation methods it is, of course, possible to catalog intermediate reliabilities during top-event evaluation. The scheme devised by Patterson-Hine and Koen (laS9) utilizes such an approach to improve algorithm efficiency. But the need to analyze cut sets remains. On some of the examples of this paper we can now perform direct evaluations in a microcomputer environment significantly faster than the best approximation methods performed on mainframe computers 10 yr ago. The computational complexity of the largest fault trees will continue to require the use of approximation methods, however. All the directevaluation methods we describe are exact methodologies. An interesting question posed by a referee is whether probabilistic or failure rate bounds might be developed using direct-evaluation methods. Even if truncation methods are to be used for the largest problems, however, there is a clear need to push exact
algorithms to their limits, if for no other reason than to provide useful validation for the truncation method that one might ultimately have to use. This is particularly essential in light of the difficulty in estimating the error when truncation methods are used. As Modarres and Dezfuli (1984) point out, “. . . two types of truncation are used in PRA studies; one based on minimal cut set size . . . and the other on minimal cut set probability. . . . The selection of truncation value in either case is highly subjective and depends on the judgment of the analyst.” Where great reliability is required, one would prefer not to have the validity of reliability calculations depend on the subjective ‘judgment of the analyst”. Directevaluation methods such as those presented in this paper significantly extend the realm of problems for which exact analysis is possible. REFERENCES
Barlow R. E. and H. E. Lambert, Introduction to fault tree analysis. Reliability and Fault Tree Analysis (R. E. Barlow and J. E. Fussell, Eds). SIAM (1975). Helman P. and A. Rosenthal, A decomposition scheme for the analysis of fault trees and other combinatorial circuits. IEEE Trans. Reliab. R38, 312-327 (1989). Henley E. J. and H. Kumamoto, Top-down algorithm for obtaining prime implicant sets of non-coherent fault trees. IEEE
Trans. Reliab.
R27,
242-249
(1978).
Henley E. J. and H. Kumamoto, Reliabili:y Engineering and Risk Assessmenr. Prentice-Hall, Englewood Cliffs, New Jersey (1981). Kumamoto H. and K. Tanaka, K. Inoue and E. J. Henley, Dagger-sampling Monte-Carlo evaluation of system reliability. IEEE Trans. Refiab. R29, 122-125 (1980). Laviron A., A. Camino and J. C. Manaranche, ESCAF-a new and cheap system for complex reliability analysis and computation, IEEE Trans. Reliab. R31, 339-348 (1982). Locks M. O., Fault trees, prime implicants, and noncoherence; E. I. Ogunbiyi. Author reply No. 1; H. Kumamoto, E. J. Henley, Author reply No. 2; M. 0. Locks, Rebuttal. IEEE
Trans. Reliab.
R29,
130-135
(1980).
Locks M. O., Modularizing, minimizing, and interpreting the K&H fault tree. IEEE Trans. Reliab. R30, 411415 (1981).
McCullers W. and R. K. Wood, Probabilistic analysis of fault trees using pivotal decomposition. Applications of Discrete
Mathematics:
Proc.
Third Conj:
Discrete
Math -
Direct-evaluation algorithms for fault-tree probabilities emutics (R. Ringeisen and F. Roberts, Eds), pp. 107-121 SIAM. (1988). Modarres M. and H. Dezfuli, A truncation methodology for evaluating large fault trees. IEEE Tmns. Refiab. R33, 325-328 (1984). Page L. B. and J. E. Perry, A simple approach to fault tree probabilities, Computers them. EnsnglO, 249-257 (1986a). Page L. B. and J. E. Perry, An algorithm for exact fault-tree probabilities without eut sets. IEEE Trans. Aeliob. R35, 544-558 (1986b). Page L. B. and J. E. Perry, An algorithm for fault-tree probabilities using the factoring theorem. Microelect. Reliab. 28, 273-286 (1988).
169
Patterson-Hine F. A. and B. V. Koen, Comment on: an algorithm for exact fault-tree probabilities without out sets. IEEE Trans. Rehkb. 836, 640-641 (1987). Patterson-Hiae F. A. and B. V. Eoen, Direct evaluation of fault trees using object-oriented programming techniques. IEEE Trans. Reliab. R38, 186192 (1989). Rosenthal A., A computer scientist looks at reliability computations. Reliability and Fault Tree AnaCysb (R. E. Barlow and J. E. Fussell, Eds). SIAM (1975). Worm11 R. B., D. W. Stack and B. L. Hulme, Prime implicants of noncoherent fault trees. IEEE Trans. Reliab. R30, 98-100 (1981).