Reliability Engineering and System Safety 34 (1991) 309-326
Common Cause Failure Analysis: A Critique and Some Suggestions G a r e t h W. Parry NUS Corporation, 910 Clopper Road, Gaithersburg, Maryland 20878, USA
A BSTRA CT This paper presents a critical review of a recently proposed procedure for a quantitative common cause failure analysis. Limitations resulting from the quality of the data base and the lack of guidance for event interpretation are highlighted. It is suggested that more effort needs to be made to understand and provide for a more consistent and detailed description of failure mechanisms. A review process, based on the assessment of the adequacy of the defensive strategy in place at a plant, is proposed as a supplement to the current approach based on a review of historical events to identify, in a more complete fashion, potential CCF mechanisms.
1 INTRODUCTION The numerous probabilistic safety assessments (PSAs) that have been performed over the last few years have focused attention on the importance of treating correctly the dependency between the events representing failures of redundant and diverse components of various systems. As briefly reviewed in the next section, although many sources of dependency are treated explicitly in the logical structure of the models (the event trees and fault trees), it has been found convenient to include, in these models, events that are called common-cause failure events. A recent N R C / E P R I document I presents a procedure which addresses the definition of these events and the quantification of their probabilities. The procedure has not been without its critics. D6rre, 2 for example, has suggested that the introduction of the idea of common-cause failure as an additional 309 Reliability Engineering and System Safety 0951-8320/91/$03.50 © 1991 Elsevier Science Publishers Ltd, England. Printed in Great Britain
310
Gareth W. Par O,
phenomenon to be treated with extra methods is confusing, and has suggested an alternative approach to accounting for the experience data. Others have pointed to the lack of an adequate data base with which to apply the procedure. In the light of these criticisms, this paper presents a critical review of the state of the art as represented in Ref. 1 and suggests some ways in which the quality of common-cause failure analyses could be improved, particularly in the area of emphasizing the role of defenses. Firstly, however, the origin of the common-cause failure concept is summarized.
2 THE O R I G I N OF THE C O M M O N - C A U S E FAILURE CONCEPT In constructing the logic models of PSAs, many causes of dependency are explicitly included in the structure of the model. For example, the failure of a motor-operated pump is modeled as arising from failures of the pump itself, and also from failures of the power from the bus which supplies the motor. Further, the model of failure of the bus to supply power includes contributions from failures to supply power to the bus from both normal and emergency power supplies. Similarly, if the pump is such that it requires seal, bearing, or room cooling, failures of the cooling systems are modeled as potential failures of the pump. In this way a model is constructed that reflects explicitly the hard-wired functional dependencies between the various components and systems of the plant. For further discussion of the modeling of dependencies see, for example, the P R A Procedures Guide. 3 The typical plant logic model, constructed in this way, models the plant response in terms of units, called basic events, which include events that represent particular failure modes of the components. Examples of such basic events are: pump A fails to start on demand because of local faults; diesel generator D1 fails to run for 6 hours because of local faults. To turn this model into a quantitative assessment tool, a probability model must be constructed that allows the evaluation of the probability of each basic event, preferably on the basis of some historical experience. The final stage in the quantification of the model is to specify the rules for combining the basic event probabilities and initiating event frequencies, to obtain the accident sequence frequencies, and system unavailabilities. The assumption of statistical independence of the basic events is an appealing one to start with, but experience with collecting data on component failures has shown that the number of multiple coincidental failures of like components is higher than expected on the basis of that independence assumption. Hence the concept of a common-cause failure model to correct for this statistical dependence between the basic events has been introduced. New common-
Common cause failure analysis: critique and suggestions
311
cause basic events are introduced into the logic structure to explicitly account for the expected excess number of multiple component failures. To understand the reason for this dependency, it is important to recognize that a single component/failure mode basic event represents the logical sum of all the mechanisms that would result in that failure. As long as there are some mechanisms that have a higher propensity than others for failing a second or third component simultaneously with the first, it is clear that there will be a statistical dependence between these events. D6rre 2 has shown this clearly in the case of a model in which this propensity results from an increased failure rate for a given subset of the failure mechanisms, even though failures are independent at the level of the subset. Thus the dependency arises because the probability of each basic event itself characterizes average behavior, the average being taken over the failure mechanisms contributing to that event. The philosophy behind the common-cause failure approach, as discussed in the next section, is to accept that a level of decomposition in terms of failure modes is appropriate, and to provide a correction factor, or correction factors, to get the numbers right. What D6rre seems to be suggesting is an alternative approach in which basic events should be developed at a more fundamental level, that of the occurrence of an environment, and associating different conditional failure probabilities for components in different environments. In this way a more explicit accounting for the residual dependency results. However, that model has yet to be developed to the level of application to a full-scale PRA. Consequently, the critique given in the next section is directed largely at the approach described in Ref. 1. However, in principle, any alternative approach should address the same issues raised here.
3 T H E C C F A P P R O A C H A N D ITS LIMITATIONS The general approach to common-cause failure modeling, described in Mosleh e t al., ~ begins with the identification of common-cause component groups, i.e. those groups of components for which the independence assumption is suspected to be incorrect, and includes new basic events in the logic model which represent common-cause failures of those groups and subsets of those groups. Then, for each new basic event, a probability model is constructed in the same way as for the single component/failure mode basic events. The set of basic event probability models associated with the components in a common-cause c o m p o n e n t group is collectively known as a common-cause failure model. The models discussed below, and described in Ref. 1, generally have very little causal structure. With one exception, they do not propose a
312
Gareth 14: Parry
relationship between single and multiple failure events; they merely recognize that such events can exist. The basic parameter model t defines the same type of probability model for each basic event of different multiplicity within a CCF group. For example, the constant failure rate, or constant failure probability model, is assumed for the single, double, triple, etc., component failure basic events. The parameters of the basic event models are then free parameters to be fitted to the available data on the set of observed single, double, triple, etc., failure events. The multiple Greek letter or alpha factor methods (and the more primitive beta factor method) are reparameterizations of this model. The binomial failure rate model does have an underlying causal structure in that it proposes a mechanism in which certain agents, called shocks, impact all components in the group. The probabilities of multiple failures are prescribed on the basis of the frequency of the shocks and according to whether the shocks are lethal, in which case all components fail, or nonlethal, in which case the numbers of components failed are distributed according to a binomial distribution. The model has, at most, four parameters. The multinomial failure rate model'* also uses the concepts of independent and shock-related failures, but the probabilities of multiple failure events are essentially free parameters. The models discussed above, with the exception of the BFR, by definition, have enough free parameters to fit any data; more free parameters are created as needed for higher redundancy systems. Therefore, with the exception of the BFR, and even then only with specific assumptions, none of the models can be used to predict what would be the effect of an increase in the level of redundancy from three to four, say, on system unavailability. With the possible exception of the BFR model, they are not in any way theoretical models of the physics of failure or characterizations of causal mechanisms. Their sole purpose is to guide an analyst in partitioning event data in a meaningful way, to enable estimates to be made of probabilities of failure events of varying multiplicity, which, in the language of Ref. 1, are the basic events of the basic parameter model. The creation of these basic events is an intermediate analysis step to bridge the gap between the algebraic (logical) solution to the system unavailability, expressed as component state cut-sets, and the quantification of that unavailability, and can be regarded as a mathematical convenience to create quasi-independent basic events. (See the discussion in Appendix C of Ref. 1.) In addition, they also give increased visibility to the importance of this very real class of dependent failure mechanisms. While the models themselves do not explain why the common-cause failures occur, the procedure does require that an attempt to understand failure mechanisms is made during the analysis of event data to provide the
Common cause failure analysis: critique and suggestions
313
information necessary to estimate the parameters of the model. Recognizing that little data are likely to be obtained from a single plant, the procedure proposes pooling data from all plants, and then by a process of interpretation (identifying the elements in the causal chain involved in each event in the data base) and a reinterpretation (performing a thought experiment to determine if the causal chain could occur at the plant of interest, and characterizing its likely effect at that plant in terms of the number of components failed) creating a pseudo plant-specific data base. This pseudo plant-specific data base gives the number of events with 1,2 ..... n component failures, where n is the degree of redundancy. Reference 1 provides details of how this data base is used in parameter estimation. While this is a logical approach, there are some very real practical difficulties. Firstly, the data available to perform this analysis is limited. As discussed in Ref. 1, to create the pseudo data base, it is imperative to perform the analysis on a data set that incorporates all failure events, whatever their multiplicity. The LER reporting rules changed in 1984 and ceased requiring many single failure events to be reported. Other data bases, such as the Nuclear Plant Reliability Data System (Institute for Nuclear Power Operations, Atlanta, Georgia), do not record, in a readily retrievable way, multiple failure events. Thus a consistent data base in the post-1984 era is not readily available. Secondly, the event descriptions themselves are often insufficient to understand the true reasons why the failures occurred, so that the interpretation and reinterpretation process is difficult and becomes very subjective. One key ingredient in understanding how an event can have occurred at one plant, but may not at another, is the ability to assess the relative quality of the defenses against failures, and particularly multiple failures, at those plants. In this regard in particular, the procedure t is deficient in providing guidance on how to interpret event reports. Thirdly, the analysis of the data is very time consuming, since, if it is to be done properly, not only do the multiple failure events have to be reinterpreted for the plant of interest but so indeed do the single failure events, as they may be reinterpretable as multiple failure events (which was the basis of the C-factor approach 5) or they may be reinterpretable as nonfailures. Fourthly, for some important components, the data are sufficiently few that some analysts have resorted to constructing models of failure mechanisms. An example is the analysis of battery common-cause failures, ~ where the quantification involved the use of generic human error probabilities.
314
Gareth ~ Parry
Thus, while to a first approximation the procedure t gives an acceptable and, most importantly, documentable approach to the quantification of some common-cause failure probabilities, it has to be recognized that it is not without practical and, although not addressed here, in some sense, conceptual problems. However, because they relate largely to the quality of our information base these problems are shared with alternative methods such as that proposed by Drrre. 2 Increased confidence in the quantification process can only be gained through the establishment of an improved data base and a more systematic procedure for the review and interpretation of event data, particularly with respect to the role and deficiencies of defenses. The former is clearly not a short-term solution, and will be unavailable on the time scale of the IPEs required by Ref. 8. The latter is feasible on a short term, but given the shortcomings of the data itself may not provide much immediate improvement. The limitations discussed above have raised the question in some people's minds of the value of the detailed quantification process. This is discussed in the next section.
4 HOW VALUABLE IS DETAILED QUANTITATIVE ANALYSIS? One of the questions often asked about common-cause failure analysis is: 'Is it worth going through all the effort when the CCF probability may change by a relatively small factor, if at all, particularly given the large uncertainty in the estimates? '8 Factors of improvement of as high as two or three over the generic values obtained from EPRI NP-39679 are difficult to justify. If the measures of risk, such as core damage frequency, obtained using some generic values are acceptable and meet a numerical safety goal, for example, then it is hard to argue that a detailed analysis need necessarily be done solely for the purpose of 'improving the numbers'. My usual response, however, has been to say that the reason for doing the detailed data analysis is that, since it concentrates on identifying failure mechanisms that have occurred, it provides insights into those features of plant design or operation that might be improved to prevent CCFs. Thus its value is as much for qualitative as quantitative reasons. This proposition has particular validity when the sample of data is large enough that the spectrum of mechanisms is well represented. However, the smaller the amount of data, the clearer it is that this approach is certain to be incomplete. Furthermore, as discussed in Section 3, there is often too little information to determine the failure mechanism and relatively little guidance on how to interpret and reinterpret the event data, particularly with regard to hypothesizing which defense (or defenses) was deficient. This, in addition to limiting the ability to make
Common cause failure analysis: critique and suggestions
315
qualitative insights, implies that estimates of the CCF parameters based on these data could have large uncertainty, and, more importantly, are likely to be highly analyst dependent. This is of considerable concern for those wishing to compare analyses performed by different analysts. These concerns notwithstanding, there is no doubt that there is a significant role for performing quantitative analyses. Some degree of quantification is essential for prioritizing detailed analyses. Furthermore, a single analyst can, by applying the detailed analysis approach consistently, establish a ranking of potential CCFs within a plant and across plants. However, it is clearly important to find ways of improving consistency between analysts. A first step is to improve qualitative understanding of common-cause failure mechanisms. Section 5 describes some concepts that are useful in considering failure mechanisms. Their adoption to describe events could improve understanding and consistency of interpretation. In addition, some form of consistent language is a necessary precursor to providing more explicit guidance on how to improve event reporting and on event interpretation. Section 6 proposes a CCF review, based on an assessment of the quality of defenses, that is intended to complement a historical event-based review and improve the completeness of the search for potential CCF mechanisms. Currently, this is seen primarily as a qualitative tool. The work in these two sections was performed as part of the NRCsponsored dependent failures project described in Ref. 10. 5 THE C A U S E - C O U P L I N G - D E F E N S E PICTURE OF COMMONCAUSE FAILURES In order to understand why CCFs arise, it is first important to recognize how failures occur, and how they can occur simultaneously in several components. The meaning of simultaneous in this context is that the failures occur within the required mission time. As discussed in Ref. 1, there are three separate issues to be discussed in relation to CCF events, causes, coupling factors, and defenses. 5.1 The mechanics of failure The description of a failure in terms of a single 'root cause' is in many cases too simplistic. For some purposes it may be quite adequate to identify that a pump failed because of high humidity. However, to understand in a detailed way the potential for multiple failures, or how to prevent further failures, it is necessary to identify why the humidity was high and why it affected the pump. There are many different paths by which this ultimate reason for
316
Gareth I4/.. Parry
failure could be reached. The sequence of events that constitute a particular failure path, or failure mechanism, is not necessarily simple. As an aid to thinking about failure mechanisms, the following concepts are useful. A proximate cause t~ that is associated with a failure event is a characterization of the condition that is readily identifiable as leading to the failure, but it does not in itself necessarily provide a full understanding of what led to that condition. In the above example humidity could be identified as the proximate cause. The proximate cause in a sense can be regarded as a symptom of the failure cause. As such it may not, in general, be the most useful characterization of failure for the purposes of identifying appropriate corrective actions. To expand the description of the causal chain of events resulting in the failure, it is useful to introduce the concepts of conditioning events and trigger events. These concepts are introduced as an aid to a systematic review of event data, but it is not always necessary or convenient to consider both concepts. A conditioning event is an event which predisposes a component to failure or increases its susceptibility to failure, but does not of itself cause failure. A trigger event is an event that activates a failure or initiates the transition to the failed state, whether or not that failed state is revealed at that time. A trigger event, particularly in the case of C C F events, is usually an event external to the components in question. An event that led to high humidity in a room and subsequent equipment failure would be such a trigger event. It is not always necessary, or even possible, to uniquely define a conditioning event and a trigger event for every type of failure. However, the concepts are useful in that they focus on the ideas of an immediate cause and subsidiary causes whose function is to increase susceptibility to failure given the appropriate ensuing conditions. Some examples of the use of these concepts are: 1. A design error is such that under real demand conditions a component fails its function even though it has successfully performed its function during testing. For this type of event the design error could be regarded as a trigger event. The design error is such that the component is already in an unrevealed failed state with respect to the real demand. 2. A p u m p fails to run because of moisture in a control cabinet. The proximate cause is the high humidity and the trigger event is the event leading to occurrence of the high humidity. If the level of humidity experienced was within design conditions and the control cabinet was designed against intrusion of humidity, the failure might be an indication of the occurrence of a conditioning event, such as a failure of the implementation of the protection (e.g. by properly sealing the control cabinet) against high humidity. On the other hand, if
Common cause failure analysis: critique and suggestions
317
the event that caused the high humidity was outside the design envelope, there is no conditioning event per se, unless the choice of design envelope had been erroneous in that the possible occurrence of high humidity was not foreseen. 3. Following a specific maintenance act, a component fails. The proximate cause is a maintenance error. The trigger event is the maintenance act that caused the failure. If there is an error in the maintenance procedure, this error can be regarded as a conditioning event. If the failure is a result of a slip on the part of the crew that performed the maintenance, there is no need to identify a conditioning event unless the reason why the specific crew made an error is being investigated. (In this case it might be more appropriate to define conditioning factors rather than a conditioning event.) 4. A pump shaft fails due to the cumulative effect of high vibration. If the excessive vibration were due to an installation error, this error would be a conditioning event. The trigger 'event' is the cumulative exposure of the pump to the excessive vibration. The fundamental point here is that a conditioning event happens prior to the failure, and its effect is to weaken the component in some sense. A trigger event is related directly to the transition to the failed state. Another concept that is important when discussing failure mechanisms is the speed with which they act. Lofgren et al. 11 have talked about impulsive, fast acting, mechanisms and persistent, slow acting, mechanisms. This concept is especially important when considering possible defenses. Failure mechanisms that act slowly, with detectable evidence of degradation, have a greater chance of being discovered before they cause catastrophic failures. So far the concept of 'root cause' has not been mentioned. This is deliberate and is a reflection of the fact that its identification is subjective and is related to the defensive strategy adopted to prevent recurrence.
5.2 Coupling factors For failures to become multiple failures, the conditions have to be conducive for the trigger event and the conditioning events to affect all the components simultaneously, with the meaning of simultaneity discussed earlier. It is convenient to define a set of coupling factors. A coupling factor is a property of a group of components or piece parts that identifies them as being susceptible to the same mechanisms of failure. Such factors include similarity in design, location, environment, mission, and operational, maintenance, and test procedures. These, in some references, have been referred to as examples of a coupling mechanism, but because they really identify a potential for common susceptibility it is preferable to think of
318
Gareth W. Parry
them as factors that help to define a potential common-cause component group. In fact, it is questionable whether it is necessary to talk about a coupling mechanism as an entity separate from the failure mechanism. What is important is to identify the specific features of the coupling factors that lead to a simultaneous impact on the components in the group. This is a function of how the trigger and conditioning events are introduced at the system or common-cause component group level. For example, in the four examples discussed previously: 1. Components of the same design in a similar usage will all fail on a c o m m o n demand if there is a fatal design error. The trigger event (the design error) is c o m m o n to all components in the group and is introduced simultaneously into the group. 2. More than one p u m p may fail if the conditions of high humidity exist in more than one control cabinet. However, for this to happen all control cabinets have to be susceptible to humidity intrusion, and they also have to be located in a similar environment. Following on from the previous example, if there were a conditioning event of failing to seal the cabinets properly, it would have to have occurred in more than one cabinet and, in addition, the trigger event causing the high humidity would have to affect the cabinets simultaneously, within our definition of simultaneity, for there to be a common-cause failure. This implies a c o m m o n domain of influence of the source of the trigger and the conditioning events. 3. There are many ways a maintenance-related error can propagate, depending on how it arose. For example, a basic error in the procedure would be systematic across all crews no matter when they do the maintenance. The conditioning event, the procedure error, is c o m m o n to all crews. On the other hand, ambiguity in the procedure may result in one crew adopting an alternative approach consistently, no matter when they perform the maintenance. The ambiguity in the procedure would be one conditioning event; the trigger event is the particular maintenance activity in which the crew misinterpreted and misapplied the procedures. In this case the crew acts as the agent introducing the failure. However, the failure may not become a common-cause failure unless the crew performs the maintenance on redundant equipment close enough in time with respect to opportunities for discovery of the error. Slips or errors made during maintenance are unlikely to be sources of common-cause failures unless maintenance is performed on several redundant components during a short time interval, and the slip or error is somehow made systematically over that maintenance episode.
Common cause failure analysis: critique and suggestions
319
4. A common, systematic, installation error could lead to simultaneous high levels of degradation. The coupling of failures is introduced into the system through a conditioning event and is activated by the trigger event. The way the potential for coupling is activated is different for the different conditioning and trigger events. In addition, it is not easy to postulate separate coupling and failure mechanisms; the coupling occurs at all points of the causal chain to some degree or other, and hence is inextricably a part of the failure mechanism. 5.3 Defensive tactics Common-cause failures can be prevented by a variety of defenses. A defense can operate to prevent the occurrence of the failure mechanisms. Another approach is to decouple failures by effectively decreasing the similarity of components and their environment in some way that prevents a particular type of root cause from affecting all components simultaneously, and allows more opportunity for detecting failures before they appear in all components of the group. The key to successful mitigation and prevention of common-cause failures is to understand how the primary defenses failed. In the examples considered previously, therefore, the following are plausible reasons why the failures occurred, in terms of failure of defenses: 1. The design review process as a primary defense and proof testing as a secondary defense were deficient. 2. If the error was that, at the design stage, it was not envisioned that a high humidity condition might arise, the design review process was potentially deficient. On the other hand, the failure may have been a failure to maintain an adequate barrier against humidity, given that it was realized that such a barrier was necessary. 3. Insufficient or inadequate training could have led to the conditioning of a particular crew such that they had a high likelihood of making errors. In another case it could have been a failure in the procedures review process that resulted in faulty or ambiguous procedures. In the case of ambiguous procedures, training can be a secondary defense. An error resulting from ambiguous procedures can therefore be regarded as resulting from a failure of two defenses, the procedures review and training. 4. The quality control process during installation may have been deficient. A general set of defensive tactics can be defined. The set below is loosely based on those in Ref. 12, but with definitions extended to include their
Gareth 14/.. Parr)'
320
application against failure mechanisms as well as coupling:
--Barriers. Any physical impediment that tends to confine and/or restrict a potentially damaging condition.
--Personnel training. A program to assure that the operators and maintainers are familiar with procedures and are able to follow them correctly during all conditions of operation. --Quality control. A program to assure that the product is in conformance with the documented design and that its operation and maintenance are according to approved procedures, standards, and regulatory requirements. --Redundancy. Additional, identical, redundant components added to a system solely for the purpose of increasing the likelihood that a sufficient number of components will survive exposure to a given cause of failure and be available to perform a given function. This is a tactic to improve system availability, but, by definition, common-cause failures decrease the positive impact of this particular tactic. However, increasing redundancy will generally still have value. --Preventive maintenance. A program of applicable and effective preventive maintenance tasks designed to prevent premature failure or degradation of components. --Monitoring, surveillance testing, and inspection. Monitoring via alarms, frequent tests, and/or inspections so that unannounced failures from any detectable causes are not allowed to accumulate. This includes special tests performed on redundant components in response to observed failures. --Procedures review. A review of operational, maintenance, and calibration/test procedures to eliminate incorrect or inappropriate actions resulting in component or system unavailability. --Diversity. The mixture of interchangeable components made by different manufacturers (equipment diversity) or the introduction of a totally redundant system with an entirely different principle of operation (functional diversity) for the express purpose of reducing the likelihood of a total loss of function that might occur because all like components are vulnerable to the same cause(s) of failure. Diversity in staff is another form of applying this concept, i.e. using different teams to maintain and test redundant trains. This is a tactic that specifically addresses common-cause failures.
5.4 Event descriptions From the foregoing discussions it is evident that, to describe a CCF event (or indeed any failure event) in sufficient detail, it is desirable to include
Common cause failure analysis: critique and suggestions
321
descriptions ofall those factors that allowed the event to occur. This includes the identification of trigger events, conditioning events, the factors that allowed these conditioning events and trigger events to impact more than one component, and the particular aspect or aspects of the defenses that were deficient. At this stage it is useful to introduce the idea of a root cause. The r o o t cause, following Gano, ~3 is defined as the most basic reason for an effect which, if corrected, will prevent recurrence. The root cause can be seen to be a subjective notion, since it is tied to the implementation of a defense; different operators will identify different corrective actions, implying different deficiencies in defenses. Since, as the examples discussed previously have shown, there are, in many cases, several possible defenses that are applicable to a particular causal mechanism, there could in principle be many 'root causes'. The root cause could be a trigger event or a conditioning event. To be a root cause o f a C C F event, however, it must have the capacity for impacting all components in the group. While the arguments are not developed in this paper, it can be argued that mechanisms of failure can be grouped according to general root cause characteristics and coupling factors, and that a suitable defense can be assigned to each group. This is the basis on which the review philosophy outlined in Section 6 is based.
6 A P R O C E D U R E F O R A C C F REVIEW BASED ON D E F E N S E S This section discusses the possibility of developing a procedure for performing a review of nuclear power plant design and operations that is focused on identifying potential common-cause failure mechanisms. The overall philosophy proposed is to structure the review according to a generic class of defenses against equipment unavailability, such as those identified in Section 5. These defenses can be argued to be present in some form or other at all plants. Since it is the role of defenses in inhibiting the CCFs that is of primary interest here, however, the objective of the review is to identify particular weaknesses in the application of these defenses that could allow simultaneous multiple failures. The reason for attacking the problem from the point of view of the defenses is that it can be argued that a good defense can prevent a whole class of common-cause failures for many types of components, irrespective of the details of the failure mechanisms. Thus identification of the existence of particular strengths in the defenses can lead to increased assurance that certain types of CCFs are unlikely to occur. The identification of weaknesses leads to an identification of the type of mechanisms for which a more
322
Gareth ~ Parry
detailed analysis is warranted. The first level of the review, therefore, is a high level screening analysis. Because of the emphasis on common-cause failures, a major requirement for a review is that it should provide a means to assess the adequacy of the defensive strategy being applied at a plant, as a means of maintaining independence between r e d u n d a n t c o m p o n e n t s with respect to the occurrence of failure causes. However, it should not be forgotten that an equally good strategy is to prevent CCFs by preventing the failure mechanisms themselves. Thus a review of the defensive strategy that is focused on the assurance of a low probability of common-cause failures cannot be divorced from one that is designed to minimize unavailability or maximize reliability of individual components. In order to structure the review, it is necessary first to identify how each tactic can help prevent CCFs. This can be used to define a requirement, or set of requirements, for the successful implementation of that tactic. Further, it will help to define the information that will be necessary to assess the quality of the implementation. An essential part of the review process will be the establishment of methods for determining the significance of any observed weaknesses. It is at this stage that an appreciation of individual failure mechanisms becomes more important. As an example to the approach, consider the case of barriers as a defensive tactic. 6.1 A review
procedure for
the assessment of barriers as a defense
Definition of requirements Barriers are effective as defenses against common-cause failures resulting from environmental, or external, agents if they: (a)
create separate environments for redundant components and therefore reduce the susceptibility of components to a single trigger event which affects the quality of the environment; (b) shield some or all of the components from potential trigger events. The barriers may separate redundant components, separate the trigger source from one or more components, or be local, at the component, and thus relate to the quality of the effectiveness of component design against prevailing or abnormal environmental conditions. Barriers are primarily designed to protect against internal or external environmental disturbances which are harmful to the components. (The internal environment is the fluid for fluid systems, the electrical current for electrical systems. A barrier for the fluid systems can be the provision of a separate water supply, for
Common cause failure analysis: critique and suggestions
323
example. A fuse or protective relay acts as a barrier in an electrical system.) The specific characteristics of the barriers are different for different classes of environmental disturbance. Review guidelines
For the purposes of the review it is initially sufficient to identify the potential types of environmental disturbances and their impact domain, which will generally depend on the type of environmental disturbance. The disturbance type can be equated to the proximate cause, which is the agent causing failure, e.g. humidity, smoke. Thus a review of the effectiveness of barriers should include identification of the type, location, and purpose of barriers, the type of disturbance to which a barrier is impermeable, the quality of its installation, and the quality of administrative controls that maintain its integrity, coupled with an identification of potential sources for trigger events for the various environmental disturbances, in terms of their location and severity. It should be noted that barriers are not the only defenses against environmental disturbances. Surveillance tests, monitoring, and preventive maintenance can also be effective against those agents whose effects result in measurable degradation of performance. A review process for identifying potential locations of concern has been developed for dealing with fires and floods. It can be adapted to cover other types of environmental disturbance by following the steps below for each disturbance in turn: I. Identify the location of the components of interest. 2. Identify the piece parts of the components that are susceptible to each disturbance. 3. Identify the locations of the barriers against the disturbance and divide the plant into nominally independent zones. 4. Identify potential sources of significant environmental disturbances. 5. Identify those zones that contain more than one component, or vulnerable piece parts of more than one component, and a source or sources. 6. Identify potential pathways between zones containing components or vulnerable piece parts, and/or sources via penetrations/connections, or defective barriers. The process identifies the zones, or groups of zones, on a qualitative basis and does not require a detailed analysis of the specific failure mechanisms. This constitutes a coarse screening analysis. It may be refined further, following again the examples of fire and flood analysis. The adequacy of a design with barriers that have allowed a zone identified
324
Gareth ~t/~ Parry
in step 5 to contain vulnerable piece parts of more than one redundant component and a potential source of a trigger event has to be assessed against the likelihood of the occurrence of the trigger affecting the component group. This type of analysis is done routinely in fire and flood risk analysis. The factors taken into account include the relative locations of the source(s) and the vulnerable component piece parts, the magnitude of the disturbance (as a function of frequency), the potential for propagation of the disturbance, and the possibility of early detection and mitigation of the disturbance, and is thus dependent on a more detailed assessment of failure mechanisms. For the groups of zones identified in step 6 the primary review should be directed toward establishing the adequacy of the barrier, as its existence implies that it is believed to be necessary. The barrier has to be investigated for its design adequacy, its installation, and the adequacy of, and adherence to, the administrative controls designed to maintain the integrity of the barrier. 6.2 Summary The example discussed above is relatively well developed conceptually, because such a procedure has been used in PRAs to model the effects of fires and floods. Development of detailed guidelines for other defenses will require substantial effort. It is relatively straightforward to define general requirements for, and general characteristics of, a good defensive strategy, which can be used as a tool for screening analysis, but developing guidelines for a more detailed analysis is more complicated. For example, setting criteria for the measurement of the quality of training or the clarity of procedures is not a trivial task. In addition, the qualities of the defenses are not independent. For example, the quality of the preventive maintenance program can be influenced by that of the ISI program and the quality of the training program by that of the procedures review. It is clear that the establishment of the guidelines and their application in a review will be multi-disciplinary and involve design engineers, operating staff, and human factors specialists. Nevertheless, the idea of basing a review on the analysis of the quality of the defensive strategy seems to be an approach that has considerable merit, particularly as a complement to a review against historically occurring events. It is of particular value because it approaches the problem from a different direction, thus increasing the scope of the review. The increased appreciation of the role of defenses provides input to the cause-defense matrices introduced in Ref. 10, and is also important for improving the quantification of CCF probabilities.
Common cause failure analysis: critique and suggestions
325
7 SUMMARY In this paper some of the limitations in applying the procedure for quantitative common-cause failure analysis, presented in Ref. 1, have been described. The principal difficulties arise from the quality and quantity of data, and the lack of guidance for the interpretation of event descriptions. This forces the analyst into making assumptions when analyzing data, and since different analysts may have different biases this makes comparing analyses performed by different analysts difficult. Furthermore, the data set on which the quantitative analysis is based in general will represent only a subset of these failure mechanisms that are possible. In response to these concerns, a proposal is made to supplement the quantitative analysis with a qualitative approach, based on a review of the adequacy of the application of a generic set of defenses to the prevention of common-cause failures. This combined approach would increase the completeness of the search for potential common-cause mechanisms, as well as providing a means to identify corrective actions. In order to set up such a review it is necessary to have a comprehensive picture of the types of common-cause failure mechanisms that can arise. This can be achieved by considering some general characteristics of CCF mechanisms based on a review of those that have occurred. Some concepts that are helpful in describing these characteristics have been introduced. Their inclusion in event reports would greatly improve the quality of the data base. The full arguments supporting a defense-based review have not been developed here. They appear in a N U R E G / C R 5460.14 The purpose of this paper is to introduce the concept and to propose the development of more comprehensive qualitative tools, which are a necessary prerequisite for developing more detailed quantitative models. The search for potential CCF mechanisms using a review of the adequacy of defenses provides a way of approaching the problem from a different viewpoint and increases the chance of identification of significant failure mechanisms.
ACKNOWLEDGMENTS Many of the ideas discussed in Sections 5 and 6 were developed while performing work for Sandia National Laboratories under the guidance of Allen Camp and Don Mitchell, for the dependent failures program being carried out for U S N R C under FIN No. A1384. The author has especially benefited from discussions with Henrique Paula, Dale Rasmuson, and Ali Mosleh, and comments from D. Campbell.
326
Gareth 14: Parr)' REFERENCES
1. Mosleh, A., Fleming, K.N., Parry. G.W., Paula, H. M., Worledge, D.H. & Rasmuson, D. M., Procedures Jbr Treating Common Cause Failures in Safety and Reliability Studies, Voi. 1, Procedural Framework and E.,camples, NUREG/ CR-4780, EPRI NP-5613, Electric Power Research Institute, January 1988; Voi. 2, Analytical Background and Techniques, Electric Power Research Institute, January 1989. 2. D6rre, P., Basic aspects of stochastic reliability analysis for redundancy systems. Reliability Engineering and Systems Safet), 26 (1989) 251-375. 3. PRA Procedures Guide. NUREG;'CR-2300, USNRC, Washington, DC, January 1983. 4. Apostolakis, G. & Moieni, P., The foundation of models of dependence in probabilistic safety assessment. Reliability Engineering, 18 (1987) 177-95. 5. Parry, G. W., Incompleteness in data bases: impact on parameter estimation uncertainty. Paper presented at the Annual Meeting of the Society for Risk Analysis, Knoxville, Tenn., 15 September-3 October 1984. 6. Paula, H. M., A probabilistic dependent-failure analysis ofa DC electric power system in a nuclear power plant. Nuclear Safety, 29(2) (1988) 196-208. 7. Parry, G. W., Comments on basic aspects of stochastic reliability analysis for redundancy systems. Reliability Engineering, 24 (1989) 377-85. 8. United States Nuclear Regulatory Commission, Individual plant examination for severe accident vulnerabilities--10CFR50.54(f) (generic letter 88-20), February 1988. 9. Fleming, K. N., Mosleh, A. & Acey, D. L., Classification and analysis of reactor operating experience involving dependent events. Report EPRI NP-3967, Electric Power Research Institute, June 1985. 10. Parry, G.W., Paula, H.M., Mitchell, D. B., Whitehead, D.W. & Rasmuson, D.M., A cause-coupling-defence approach to common cause failures. Presented at PSA'89, Pittsburgh, Pa, 2-7 April 1989. 11. Lofgren, E. V., Rothleder, B. M. & Karimium, S., Guidelines for using reliability programs to defend against common-cause failures (draft). June 1988. 12. Smith, A. M., Mott, J. E. & Crellin, G. L., Defensive Strategies for Reducing Susceptibility to Common-Cause Failures, Vol. 1, Defensive Strategies. EPRI NP-5777, June 1988. 13. Gano, D. L., Root cause and how to find it. Nuclear News, 30(10) (August 1982) 39-43. 14. Paula, H. M. & Parry, G. W., A cause-defense approach to the understanding and analysis of common cause failures. NUREG/CR-5460, US NRC, Washington, D.C., February 1990.