Reliability Engineeringand System Safety 45 (1994)201-204 1994ElsevierScienceLimited Printed in Northern Ireland 0951-8320/94/S7.00
hid
ELSEVIER
Occurrence of c o m m o n m o d e failure H . Buchner* Siemens A G , K W U NDS4, PO Box 32 20, D-91050 Erlangen, Germany
In technical facilities, for example in nuclear power plants, redundant systems are used to prevent random failures from deleting the complete system function. However, although this redundancy concept is adequate to cope with random failures in single redundancies, its applicability is limited in case of multiple failures due to a systematic failure cause to which all redundancies are submitted due to their identical features. Some general considerations have been formulated to rule out the occurrence of such common mode failure (CMF) in redundant systems under certain circumstances. CMF means that in more than one redundancy the systematic failure cause is activated at the same time, or within the same frame of time (e.g. during the mission time for an accident). It therefore has to be distinguished between the systematic cause and the actual occurrence of CMF: a latently existing systematic cause does not necessarily lead to simultaneous failure; it must be activated and therefore is only the prerequisite for CMF. A systematic cause results in simultaneous failure if --the systematic cause is activated by specific circumstances associated with the accident: a triggering effect to which the redundancies are subjected due to their identical features --previous failures have accumulated undetectedly before the accident. They now appear on demand due to the accident. For exclusion of CMF, both exclusion of a triggering effect and of accumulation is necessary. A trigger can be excluded, if the components are not affected by the accident at all, or are not submitted to any 'abnormal' operation. Accumulation can be ruled out by self annuciation. From this a matrix for excluding CMF-susceptibility has been derived.
1 INTRODUCTION
opposite conclusion under which circumstances CMF will not occur. The considerations are to be seen in addition to the CMF-defense strategies as presented by e.g. Paula et al. 2 and in Ref. 3.
In technical facilities such as nuclear power plants, redundant systems are used to prevent random failures from deleting the complete system function: in case of loss of function of one redundancy, the mission can be fulfilled by the other redundancy(ies). Independent random failures in more than one redundancy could delete the whole system function but are of low probability. However although this redundancy concept is adequate to cope with random failures in single redundancies, its applicability is limited in case of multiple failures due to a systematic failure cause to which all redundancies are submitted due to their identical features. Some general considerations have been formulated to rule out the occurrence of such c o m m o n mode failure (CMF) in redundant systems. They are based on the definition of CMF, as compiled e.g. in Ref. 1, and exploit the definition of C M F in order to draw the
2 EXPRESSION 'COMMON MODE FAILURE' Possible definitions of CMF have been compiled in Ref. 1. Based on these definitions the following understanding of C M F is used for this work: common mode failure means the failure of more than one component fulfilling similar tasks - - f o r the same reason (otherwise it would not be CMF, but multiple random failure). This means that a systematic cause within the component itself or within the system environment exists. Because it is latently existing, failure due to that systematic cause could occur at any time, and --simultaneously (on demand due to an accident, or within the same frame of time, e.g. during the mission time necessary for an accident).
* Present address: P + C Projekt Consult GmbH, Allersberger Str. 61, D-90461 Niirnberg, Germany. 201
202
H. Buchner
However, common mode failure does not include - - s e c o n d a r y failure (failure as a consequence of e.g. rupture of an other component), ---commanded failure (failure of intact components as a consequence of common resources, or as a consequence of interconnections due to the process). Failure of that kind has to be modelled explicitly in the fault tree. Thus, the latently existing systematic cause must be activated in more than 1 component and at the same time. If the systematic cause is not activated simultaneously, failures due to the systematic cause are distributed over the time, possibly resulting in an enhanced failure rate. On the other hand random failures have no reason to occur exactly at the same time. If they really do occur simultaneously the probability is very low. This again points out that CMF can only be due to a systematic cause. Therefore the systematic cause is a prerequisite for CMF. Existance of a systematic cause means that there is potential for CMF, but does not necessarily mean that CMF will actually occur: the systematic cause must be activated simultaneously in more than one component to result in CMF. Otherwise a systematic cause may result in an enhanced failure rate, but the failures will not occur at exactly the same time. Therefore, 'systematic cause' and 'CMF' are not synonymous since CMF means systematic cause plus simultaneous occurrence. There must be a coupling effect between components bearing the systematic cause to which simultaneously more than one component is submitted. Simultaneity requires either (i) a trigger, i.e. that the components are affected by the accident; something 'happens' to the components due to the accident, e.g. change of operation conditions or a shock. A trigger results in loss of function due to the demand of up-to-now unobjectionable components. A trigger can be considered as a lethal/non-lethal shock to which all/some of the components are submitted; or (ii) accumulation, i.e. that previous undetected accumulated failures now appear to be due to the demand. The undetected loss of function occurred prior to demand due to e.g. aging, but is detected only on demand due to the accident. For exclusion of occurrence of CMF both exclusion of trigger and exclusion of accumulation are necessary.
2.1 Exclusion of trigger A trigger resulting from if: - - t h e accident does not the component 'sees' This means that there change of operation
an accident can be excluded affect the component at all: nothing from the accident. is no change of state or no conditions; the component
does exactly the same before and after the accident has occurred, or - - i f there is a change at state (e.g. cut-in of a system or of a switch) or a change of operation condition (e.g. other flow-rate) then a trigger can be excluded if there is no abnormal operation: if there is no difference between operation under accident condition and under normal condition. Then the available operational experience covers both the operation under normal conditions and under accident conditions. Therefore under accident conditions the components have no reason to do something different than they would do under normal operation conditions: the components do not 'know' that this is a demand due to an accident; under accident conditions the components remain within the same operational range for which operational experience exists from normal operation. However it should be noted that tests are not necessarily identical to demands under accident conditions. In this context normal operation does not mean 'within design', but 'within usual operational range'. If the component is operated during the accident within the same operational range as during normal operation it can be induced that it will also work under accident conditions. If, however, the component is operated differently under accident conditions than under normal conditions this may be a triggering effect, even if it is operated within the design basis.
Examples A car is designed to run 2 0 0 k m / h . Due to traffic restrictions it is only run up to 130km/h. Therefore operational experience only covers the range up to 130km/h although it is designed for 200km/h. Therefore it is not sure whether it really will reach 200 km/h when demanded, A stand-by pump is periodically tested only at 10% flow rate. Therefore it is not sure whether it will really reach 100% flow rate when demanded or whether it will operate properly under 100% conditions. The same example is applicable e.g. for diesel-generators. In the above examples there is experience over a certain range and over certain functions, but not covering the complete design basis.
Additional aspect: trigger independent of an accident If there is no triggering effect the components have no reason to behave differently under accident conditions and under normal operation conditions. On the other hand, if redundant components are already liable to simultaneous failure under normal operation conditions such failure can also occur under accident conditions. As an example simultaneous failure of cooling water pumps due to pollution of the river may
Occurrence of common mode failure be taken (e.g. hay on the river), or sudden appearance of a large number of fishes. Such additional events must occur during the mission time of the accident, but they are principally independent of the accident. Therefore this is not very probable, but may be considered as a lower limit for CMF-occurrence. Such 'additional' events however are not considered in this approach: only CMF due to the accident is treated here. Such additional events can be considered to be triggering effects which are independent of the accident and therefore may occur at any time. If necessary such effects should be superimposed to this approach.
2.2 Exclusion of accumulation Accumulation of undetected failures can be excluded by self annunciation, i.e. failures are monitored during normal operation. Thus the state of the systems/components is known at any time from the operation before the accident: there is no time for failures to accumulate unrecognized. If a failure should occur it will be recognized immediately and countermeasures will immediately be initiated: to repair the failed component or to switch-over to an intact stand-by system. Self annuciation is given, if --prior to the accident the system is already in the same operation state as in accident condition: e.g. in systems already running prior to the accident, a failure would result in switch-over to the respective stand-by system, or e.g. components like switches/valves which are already in 'accident' position prior to the accident. Loss of function therefore would immediately be recognized before the accident, resulting in switch-over to an intact train. --individual components are frequently operated
C o m p o n e n t not affected b y the accident e.g. no change of state or operation condition
203
during normal operation (e.g. switching of relays, control devices, e.g. turbine control valve). Failure would be realized after short time on the occasion of the next demand. - - a large number of identical components exists and is operated under similar circumstances: A systematic cause would be self-reporting via the observation of that large number of operated identical components during normal plant operation. So in this case not individual components are monitored but a collective of many identical operated components. Thus in case of a collective there is much operational experience due to the large number of operated components indicating that there are no problems with the components in general. A systematic cause would result in an enhanced failure rate within the collective during normal operation thus leading to countermeasures. From the behaviour of the collective (large number of identical operated components) it thus can be concluded that a systematic cause, if any, under the normal operation conditions does not lead to failures. However, a collective of identical operated components is characterized by both type-identical components and similar operation conditions: 'identical components' means not only type-identical, but within complete 'system environment', e.g. same operation history, similar operation frequency, same maintenance, same ambient conditions etc. Therefore the collective can not be extended, e.g. to type-identical components which are not operated during power state, but only under shutdown conditions, because the time between non-shutdown states is not monitored and hence failures can accumulate undetected during that time. In summary, concerning a collective, in operational systems 'always' demands occur due to operational reason. A systematic failure cause under the respective operation conditions therefore would result in an enhanced failure rate within the collective,
> no trigger
OR no abnormaloperation nodifferencebetweenoperationundernormalconditionand
under accident condition
- - > no CMF susceptibility
no c h a n g e o f state e.g. no cut-in or cut-off OR frequent operation u n d e r n o r m a l conditions OR
= = > no a c c u m u l a t i o n
collective o f identical operated c o m p o n e n t s
Fig. 1. Matrix for excluding CMF-susceptibility.
204
H. Buchner
and therefore would be recognised before the accident during normal plant operation. Therefore only individual components may fail randomly on demand due to an accident, but not all components together due to any systematic cause provided no triggering effect exists.
components exists from which, due to the frequent operation of the collective as a whole, it can be concluded that the components in general are intact and a systematic cause, if any, under normal operation conditions (i.e. no trigger) does not lead to failure.
2.3 Exclusion of CMF-occurrence REFERENCES Exclusion of triggering effect and of accumulation is summarized in Fig. 1. If no trigger exists and no accumulation is possible, then there is no reason for CMF to occur. A trigger can be excluded either if the components are not affected by the accident or if there is no abnormal operation due to the accident. Accumulation can be excluded if either the systems are already in the state demanded by the accident (e.g. pumps already running before the accident, switchgears in accident position), or if individual components are operated frequently (e.g. relays), or if a collective of identical
1. Smith, A. M. & Watson, I. A., Common Cause Failures--A Dilemma in Perspective. RAMS San Francisco, USA. 1980. 2. Paula, H. M., Campbell, D. J. & Rasmuson, D. M., Qualitative cause-defense matrices: Engineering tools to support the analysis and prevention of common cause failures. Reliability Engineering and System Safety, 34 (1991) 389-415. 3. Bourne, A. J., Edwards, G. T., Hunns, D. M., Poulter, D. R. & Watson, I. A., Defences against common-mode failures in redundancy systems. A guide for management, designers and operators. SRD R 196 (UKAEA), Jan. 1981.