Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Contents lists available at ScienceDirect
Reliability Engineering and System Safety journal homepage: www.elsevier.com/locate/ress
Common cause failures in safety-instrumented systems: Using field experience from the petroleum industry S. Hauge a, P. Hokstad a, S. Håbrekke a, M.A. Lundteigen b,n a b
SINTEF Technology and Society, Safety Research, Trondheim, Norway The Norwegian University of Science and Technology (NTNU), Norway
art ic l e i nf o
Keywords: Common cause failures (CCFs) Safety instrumented systems (SIS) Safety integrity level (SIL) Reliability data Data collection Field experience
a b s t r a c t Safety instrumented systems often employ redundancy to enhance reliability, but the intended effect may be reduced when common cause failures are taken into account. It is often assumed that a certain fraction of component failures will occur close in time, due to a shared cause. Unfortunately, few attempts have been made to systematically investigate field experience on common cause failures, with the exception of the nuclear industry which has been in the forefront of research in this area. This paper presents selected results from a research project carried out in the Norwegian oil and gas industry to collect and analyze reported failures. This includes the presentation and derivation of generic (i.e. industry average) values of beta-factors for typical components in the oil and gas industry, and the demonstration of how failure data may be used to construct checklists for updating the value of beta in operation. The results are based on a review of some 12.000 maintenance notifications from six different onshore and offshore petroleum facilities. It is found that the new beta-values are higher than what is seen in many data sources, and some possible explanations are discussed. & 2015 Elsevier Ltd. All rights reserved.
1. Introduction Many of the safety barriers that control the risk in hazardous process industries are implemented by safety instrumented systems (SIS). A SIS is designed to bring the process or protected system to a safe state in response to critical events at the facility. Examples of such events are process upsets, releases of hazardous materials, and fires. The SIS is usually split into three main subsystems; initiating elements such as sensors and push buttons, logic solver such as programmable electronic controller (PLC), and actuating devices, such as valves and circuit breakers. Redundancy is often introduced to enhance reliability, but this positive effect may be reduced if components are prone to the same (shared) cause of failure. Such failures, often referred to as common cause failures (CCFs), may result in a major disabling or complete loss of safety instrumented functions (SIFs). An important part of SIS management is therefore to assess and implement measures to reduce the influence of CCFs on the reliability. CCFs have received considerable attention over several decades, but the main attention has been on development of models rather than collecting data to support the models, see e.g. Hokstad and Rausand [10] for an overview. Reliability modeling of CCFs was n
Corresponding author. E-mail address:
[email protected] (M.A. Lundteigen).
introduced in the nuclear industry for about 40 years ago, and early results were presented e.g. in NUREG-75/014 [24] and Edwards and Watson [2]. The Three Miles Island accident in 1979 (caused by CCFs) resulted in a further advancement of this work, and a number of papers [1,2,12,19,21,30,31,33,36,37] and reports [26,27] were published. Also the aviation industry has given a close attention to CCFs, and more recently the standard IEC 61508 [14] has pointed out the importance of controlling these failures in order to maintain the integrity of SIFs. The most widely adopted model is the standard beta-factor model [6,15,17,26], with the parameter β (also referred to as betafactor, or just beta), defined as the fraction of a component's failure rate that represents CCFs. A crucial assumption in this model is that when there is a CCF, all components of the specified CCF group (i.e. a group of similar components for which a CCF event can be registered) will fail. The PDS method [6] uses a variant of this approach called the multiple beta-factor model [9], where also the multiplicity of the CCF (i.e. number of components affected) is explicitly treated. In the standard beta-factor model the CCFs are implicitly modeled, meaning that a collection of shared causes are catered for by the beta-factor. Also explicit modeling of CCFs is relevant and should be used when sufficient information is available to perform this [21,26]. In this case explicit CCF causes are identified and included in the system failure model. The focus of the present
http://dx.doi.org/10.1016/j.ress.2015.09.018 0951-8320/& 2015 Elsevier Ltd. All rights reserved.
Please cite this article as: Hauge S, et al. Common cause failures in safety-instrumented systems: Using field experience from the petroleum industry. Reliability Engineering and System Safety (2015), http://dx.doi.org/10.1016/j.ress.2015.09.018i
2
S. Hauge et al. / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
paper has been on implicit modeling, considering estimation of the beta-factor. However, common components, such as common utility system, power supply, common logic etc., and common external events, including fire, flooding and earthquakes, should be modeled explicitly, and not be included in the beta factor modeling. Thus, when estimating new values of beta, the contribution from these types of events has been excluded. It is often assumed that CCFs account for 1–10% of a component's failure rate; e.g. see the range of values given in the checklists for determining beta-factors proposed in IEC 61508 [14]. Earlier versions of such checklists, such as the one in Humphreys [12] even suggested 30% as the maximum value of the beta factor. These checklists are primarily based on expert judgments, and few attempts have been made to link them to operational data. The nuclear industry is the only industry sector that has run a major project on collecting CCF data. The results were published in several open reports [22,23,35], but these data are not necessarily applicable to other industries. The occurrence of CCFs is highly impacted by local conditions [34], and experience from nuclear plants is not necessarily transferable to other sectors, due to the differences in design and engineering practices, environmental exposure, and the way of organizing and managing operation and maintenance. The most extensive database for the oil and gas industry, the OREDA database and the OREDA handbooks [28,29] only mention CCFs in relation to fire and gas detectors, and the data are also rather old. Hauge et al. [7] carried out a survey in 2005–2006 of CCF experience among manufactures and oil companies, but the study was of a qualitative nature. The PDS data handbook [5] proposes values of beta for typical SIS equipment in the oil and gas industry, but these are mainly based on expert judgements and to a minor extent on reported failures. The lack of detailed insight into historical CCF data in the oil and gas industry, led to a research project initiative by SINTEF with a broad participation from the industry through the PDS forum1. The project, with main funding from the Norwegian Research Council, carried out operational reviews for six oil and gas facilities; in total some 12 000 notifications for reported failures were reviewed. All failures were identified and further analyzed and classified into failure categories used in IEC 61508 [14] and IEC 61511 [15]. Three main objectives were formulated in relation to the project: (1) Gain deeper understanding of common causes of dangerous undetected (DU) failures and in what context they occur, (2) update generic values of beta for typical SIS components, to reflect an industry average with basis in field experience, and (3) develop new equipment specific checklists that may be used to adjust generic values with conditions and experience relevant for a specific facility. Checklists are already regarded as a good engineering practice for determining beta-factors in many of the key standards for SIS, such as IEC 61508 [14] and IEC 62061 [16]. In addition, the checklists are useful in pointing to specific measures to reduce the likelihood of having CCFs. However, the checklists provided in the standards are not well explained (in terms of underlying assumptions), and they are also too general and too design related to fully capture the effects from local, operational impacts and the variations between the various component types. This paper describes the main results of this research project. It refers to initial results as presented at the ESREL conference in 2014 [4] and to the final report of the PDS research project [3]. The remaining part of the paper is organized as follows: Chapter 2 reviews some definitions related to CCF and specifies 1 PDS forum is a co-operation between 26 participating companies, including oil companies, drilling contractors, engineering companies, consultants, safety system manufacturers and researchers, with a special interest in SISs, see www. sintef.no/pds.
how the relevant concepts are used in the present paper. Chapter 3 discusses the estimation of the beta-factor, focusing on the NUREG estimators and the PDS estimator. Chapter 4 describes the operational reviews that have been carried out to collect CCF data, and presents new suggested beta-estimates based on the reviews. Chapter 5 discusses the equipment specific checklists being developed, and provides the checklist for shutdown valves. Finally, some concluding remarks are given in chapter 6.
2. Definitions, interpretations, and practical challenges It was recognized early in the project that a precise definition of CCFs was needed to support the operational reviews. The foundation for CCF modeling was therefore devoted considerable attention and also discussed with the PDS forum participants. The following sub-sections summarize some of the reflections and discussions, based on experience from the operational reviews and on the feedback received from the PDS forum participants. 2.1. CCF related terms and definitions Smith and Watson [36] conducted a rather detailed survey of CCF definitions already in 1980, and concluded that a CCF has the following characteristics: (1) The components affected are unable to perform as required, (2) multiple failures exist within (but are not limited to) redundant configurations, (3) The failures are “first in line” type of failures and not the result of cascading failures (i.e. where a failure of one component has triggered the failure of another component), (4) the failures occur within a defined critical time period (e.g., the time a plane is in the air during a flight or the time between two test intervals), (5) the failures are due to a single underlying defect or a physical phenomenon (the common cause of failures), and (6) the effect of failures must lead to some major disabling of the system's ability to perform as required. This definition is often regarded as rather exhaustive, but more recent research may suggest some clarifications. In particular, it has been proposed, e.g. by [20,32] that it may be reasonable to also add human errors to the common causes in condition (5) of Smith and Watson [36]. The generic standard on SIS, the IEC 61508 [14], defines a CCF as a “failure, that is the result of one or more events, causing concurrent failures of two or more separate channels in a multiple channel system, leading to system failure”. Unfortunately, it is not straight forward to apply this definition directly [3]:
The requirement that a CCF shall lead to a system failure is not
considered to be very appropriate. If two components in a 2oo4 voting configuration fail due to a common cause, it should in our opinion be considered a CCF event (due to being a multiple failure and a major disabling of the function), even if the system is still functioning (e.g. in a 2oo2 mode). The distinction between a CCF and a CCF event is unclear in the IEC 61508 definition, ref. statement referred above “… failure that is the result of one or more events”. A more consistent wording would in our opinion be that a CCF event is an event where two or more components fail simultaneously due to a common cause. A limitation of the definition of CCF in IEC 61508 is the focus on multi-channel systems. The concept of CCF has a wider application area, and CCF may also involve multiple failures of several single channel systems. This aspects is recognized in other and more recent definitions of CCF in ISO/TR 12489 [18] and IEC 60050-192 [13], and are more in line with the interpretation of CCF in this paper.
Please cite this article as: Hauge S, et al. Common cause failures in safety-instrumented systems: Using field experience from the petroleum industry. Reliability Engineering and System Safety (2015), http://dx.doi.org/10.1016/j.ress.2015.09.018i
S. Hauge et al. / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
With basis in the definitions of CCFs, we identified the need to clarify our use of the following terms:
Common cause: the event or condition that explains why the
CCF occurred. Remark. Causes of CCFs may be split into three main levels: the first level is the direct failure cause that triggered the failure. This may be related to a rupture, breakage, or loss of signal. The second level is the underlying causes that explain why the failure was triggered. At level three the root cause is the most basic explanation for the component failure, which if it was corrected would prevent the recurrence. Parry [30] and Paula et al. [31] presented another classification by splitting the shared cause into two explanatory factors; root causes and coupling factors; an approach that has been adopted in many CCF analyses[20,26,27]. The coupling factors explain why the failures may be regarded as dependent, rather than independent. CCF: a component failure where conditions (1)–(5) of Smith and Watson [36] are fulfilled, but with the extension that condition (5) can include human errors. This interpretation should be in line with the definition in e.g. IEC 60050–192 [13]. CCF event: A failure event where at least two components fail due to a common cause (i.e. having a CCF as defined above). Remark. A CCF event is not restricted to events where all the failed components belong to the same safety function. Our experience indicates that the shared cause of failure relates to the type of components, their location, or the way they are operated and maintained, and this may go beyond the boundaries of a single function. Complete CCF event: a set of failures where all six conditions of Smith and Watson [36] are satisfied. This means that not all CCFs are complete CCFs. Potential CCF: an observed component failure, where it is judged likely that the same type of failure due to the same cause can be replicated for similar components during a relatively short time window, e.g. a test interval. Remark. Potential CCFs from the operational reviews have been excluded from the estimation of beta, as their inclusion would result in too conservative results. If they were to be included, they could not have been included with the same weight as complete CCFs. The main rationale for identifying potential CCFs is to make a more open-minded prediction of what can happen in the future. Potential CCFs may capture a wider range of experienced vulnerabilities in design or its operating environment, and they have been found useful as input in the development of the new proposed checklists. CCF group: a collection of identical/similar components for which a CCF event can be registered. Remark. A CCF group is a group of components for which the independence assumption is suspected not to be correct. A CCF group is usually significantly smaller than the total population of comparable components on the facility. It can for example be represented by the number of IR gas detectors located in a certain area.
2.2. Causes of CCFs The causes of CCF may be randomly occurring events and systematic failures that are replicated for several equipment. IEC 61508 and ISO/TR 12489 [18] define systematic and random hardware failure slightly different. IEC 61508 restricts random hardware failures to natural degradation, while ISO TR 12489 recognizes that random hardware failures to some extent will also include other random events, like human errors. The PDS method [6] defines human errors as systematic failures (as is done in IEC
3
61508), but suggests (unlike IEC 61508) that these and other systematic failures are to be reflected in the component failure rates. The operational reviews indicate that most of the CCFs (and also several independent failures) are systematic failures and human errors, and supports therefore the position of ISO TR 12489 and the PDS method that such failures should be included in the reliability parameters. For the purpose of identifying CCFs it is not essential whether a failure is classified as a systematic failure or not, as long as it is agreed that random failures, human errors, and other re-occurring failures are to be included in the estimation of CCF (implicitly or explicitly). For instance consider a pressure transmitter failing to respond to high pressure due to a calibration error. This could become a CCF, if the same erroneous calibration is made for other components as well. Either the procedure is erroneous (systematic failure), or the procedure is right, and the miss-calibration is due to misconception of the operator (systematic), or it could be a random human error (random). For other failure modes the relevant CCF causes could be external stresses, like corrosion (systematic), fire and flooding (random events). Thus, possible disagreement on which failures are systematic and not, are of little relevance in this context, and will not be further discussed in this paper. 2.3. Practical challenges for the estimation of the CCF fraction The starting point for the operational reviews was a list of all reported failures (notifications) registered over a period of 2–3 years. It was assumed that all components had undergone at least one proof test. The identification of CCFs was done by the detailed assessment of notifications. Unfortunately, many notifications had a limited description of the failure cause, e.g. sometimes the notification only included the statement that “the component has failed”. In other cases, a failed component was just replaced by a new one without further investigation. The operators and maintenance personnel are often focused on restoring the system as soon as possible, and may not always have the resources to carry out an analysis of the failure cause. The manufacturer could have supplemented with more descriptions about the failure cause, but this information is not often added to the notification. In the operational reviews, we often saw that e.g. failed detectors were just replaced by a similar type without further analysis. Stuck valves were sometimes lubricated before completing the proof test, without further analysis of why the valves experienced this type of problem. The same oil company may operate several facilities, and provide statistics and failure reports for all of these facilities. Data from all facilities are combined in order to provide estimates for the generic CCF rates. The data will also be relevant for study of possible triggers, root causes and coupling factors for various CCFs. And when it is investigated how the occurrence of CCFs depends on local conditions, such as work practices, operation and maintenance, environmental exposure, and design solutions, it is also essential to utilize facility specific information. Estimating the beta of the beta-factor model is not straight forward. The estimate is very sensitive to the maximum number of components assumed to take part in a CCF event. The term CCF group is used to define a collection of similar components (often located in vicinity of each other), for which a CCF event can be observed. A CCF group may consist of identical components within the same system. But this is not a requirement, as the CCF group might also be defined to include several systems. Actually it can be a challenge that failure data are sometimes not sufficiently detailed to decide whether the components of a multiple failure event belong to the same CCF group.
Please cite this article as: Hauge S, et al. Common cause failures in safety-instrumented systems: Using field experience from the petroleum industry. Reliability Engineering and System Safety (2015), http://dx.doi.org/10.1016/j.ress.2015.09.018i
S. Hauge et al. / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
4
A problem related to the standard beta-factor model is that it assumes that all components of a CCF group will fail when a CCF event occurs. So either there are independent failure(s), or we have a CCF event where all components of a CCF group fail. If the CCF group for instance has n ¼4 components, we will according to the beta-factor model never observe a CCF event with 2 or 3 components being failed. In practice we will, however, get data where such events do occur, and this will cause problems in the estimation of the beta-factor; cf. chapter 3.
3. Determining the beta-factor As indicated above, estimation of the beta is problematic, due to the very restrictive definition of the beta-factor model. So the commonly used estimators for beta, here referred to as NUREG estimators, have some limitations, and a new estimator has been suggested in the PDS project. The various estimators are described in the following sections.
the result 4/(10 þ4) ¼29%. If all four components had failed in the two observed CCF events (as should be the case according to the beta-factor model), the estimate would be 8/(10 þ8) ¼ 44%. 3.1.1. Numerical example Based on the operational reviews of the equipment group level transmitters (ref. Section 4.2.4), (a total population of 346 transmitters and accumulated operational time of 12.4 million hours), 54 components had a DU failure. 13 of these failures occurred in CCF events, implying that NDU ¼ 54, N DU;I ¼ 41, N DU;CCF ¼ 13. Further, NCCF ¼ 3. In these three events, respectively 9, 2 and 2 components failed, (giving the total of 13 components having failed due to a CCF). The data do not specify which loops are affected by the CCF events, but this information is not required in the NUREG estimators. The estimates then become:
β^ 1 ¼
13 0:24 54
β^ 2 ¼
2 U3 0:13: 41 þ 2 U 3
3.1. NUREG estimators Let N DU be the total number of DU failures experienced for a homogenous population of components observed over a specified period of time. Further, let N DU;CCF be the total number of DU failures included in all CCF events having occurred in the same period, (this means there is NDU;CCF components experiencing a CCF), without considering whether the components belong to the same CCF group or not. A well-recognized estimator, β^ 1 in NUREG/ CR-4780 [25] is:
β^ 1 ¼
N DU;CCF N DU
ð1Þ
When considering a large population (involving more than one CCF group), this approach gives a rather high potential for having a CCF, and so this estimator will often give conservative results. So the same report [25] suggest an alternative (less conservative) estimator as:
β^ 2 ¼
2 UN CCF N DU;I þ 2 U N CCF
ð2Þ
Here, N CCF is the number of observed CCF events (regardless of number of failed components in each CCF event). Further, N DU;I equals the number of independent DU failures, so that N DU ¼ N DU;I þ N DU;CCF . The estimate (2) assumes that each CCF event results in failures of two components; i.e. N DU;CCF in Eq. (1) is replaced by 2N CCF . We should note that neither of the two NUREG estimators requires that a CCF event shall affect components of the same safety function. When components fail due to the same cause, this will be considered as a CCF event, also when the components belong to different systems. Note 1: It is essential to observe that beta refers to the fraction of a components failure rate that is a CCF. The number of CCF events as such is irrelevant. For example, if two components, A and B (of a CCF group) in total observe three failure events: one single A-failure, one single B-failure and one CCF event (with both A and B failing), then both components have failed twice, and the beta is estimated to equal 50%, even if only one third of the events was a CCF event. Note 2: If the CCF group has for instance four components, and we observe 10 independent failures and two CCF events with two and three failures, respectively, we obviously have a problem. The beta factor model itself is in conflict with the data, and no ideal beta-estimator exists. The solution of the standard beta estimator (1) is to just add the number of components having a CCF, i.e. five, to give the beta-estimate 5/(10þ 5) ¼33%. The estimator (2) gives
3.2. New estimators suggested in the project (“PDS estimators”) The PDS estimator for the beta-factor differs rather significantly from the NUREG estimators. First, in contrast to the standard betafactor model, the definition of β in the PDS method is entirely related to a pair of components: Now β equals the fraction of the component's failure rate that causes both components of a pair to fail due to a CCF, (also when the components in a CCF group 42). This beta-factor can be given the following interpretation: Consider a specific pair of components of a CCF group. Then, given that one of the two components has failed, β equals the probability that the other component also fails “simultaneously” due to a CCF. So the β in the PDS method is identical to the standard beta-factor for a dual system only. In addition to the beta-factor, a modification factor is applied to adjust the beta-factor depending on the system voting [6]. The advantage of the PDS definition is that it allows any number of components to fail in a CCF. And as this beta explicitly refers to a pair of components, we do not encounter the problem referred in Note 2 above. Hokstad [8] introduced an estimator of this β, assuming that the failure data was collected for systems with the same number (n) of components in the CCF group. For the purpose of this estimator, the following parameters were used: n ¼Number of components in the CCF groups affected; (say n ¼2 if the CCF group is simply the components of a duplicated system) and K¼ Number of failure events, i.e. number of independent failures plus number of CCF events. Sometimes it is judged that say two CCF groups have been affected even if the failures apparently have occurred “simultaneously”, and in that case this is counted as two CCF events. Further Yj ¼ Number of failed components, (i.e. failure multiplicity) in failure event j; ðj ¼ 1; 2; …; K Þ. In this case, the MLE (maximum likelihood estimate) for β equals ([8,11] PK Yj Yj 1 β^ MLE;K ¼ j ¼ 1 PK ð3Þ ðn 1Þ j ¼ 1 Y j Note here that ponents, i.e. N DU ¼
K P j¼1 K P j¼1
Y j equals the total number of failed comY j . Also observe that if we consider just one
failure event, and Y1 (out of n) components failed in this event, the estimator becomes, (inserting K ¼1):
β^ MLE;1 ¼
Y1 1 : n1
Please cite this article as: Hauge S, et al. Common cause failures in safety-instrumented systems: Using field experience from the petroleum industry. Reliability Engineering and System Safety (2015), http://dx.doi.org/10.1016/j.ress.2015.09.018i
S. Hauge et al. / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
In the PDS method this is the rather intuitive β estimator based on a single observation; given that one specific of the n components have failed, exactly Y 1 1 of the other n 1 components have also failed, and the conditional probability of one of the other components to fail is estimated by ðY 1 1Þ=ðn 1Þ. This means that for a duplicated system (n ¼2), the β estimator equals 0 when Y 1 ¼ 1, and 1 when Y 1 ¼ 2. Further, in the numerical example with the transmitters above, having now a CCF group of n ¼ 50 components, and Y 1 ¼ 9 of these are failing, we get the estimator 8 β^ MLE;1 ¼ 49 0:16. When we have data with different n values, the situation is more complex, but this was treated in [11], giving an expression, which in the current notation can be written: PK Y j Y j 1 = nj nj 1 β^ 3 ¼ j ¼ 1 PK ð4Þ j ¼ 1 Y j = nj
4. Collection of CCF data Failures notifications from the last 2–3 years covering six facilities were reviewed in the operational reviews. Each notification was reviewed with the purpose of classifying failures according to IEC 61508 and considering CCFs in particular. The following equipment groups were included: Transmitters
Detectors
Final elements
Level transmitters
Point gas detectors Line gas detectors Flame detectors
ESD and PSD valves
Pressure transmitters Temperature transmitters Flow transmitters
here nj equals the size of the CCF group affected by failure event no. j. Observe that we can write this estimator as
β^ 3 ¼
K X Y j 1 cj nj 1 j¼1
That is, β^ 3 represents a weighting of the above intuitive esti mate β^ MLE;1 ¼ Y j 1 = nj 1 , using weights Y j =nj j ¼ 1; 2; …; K: c j ¼ PK i ¼ 1 Y i =ni The weights might also be chosen differently than in (4); a simpler expression is obtained by letting cj ¼ 1=K. Observe that the result (4) reduces to (3) when all nj are equal to n. 3.2.1. Numerical example (continued) We will use the same example as in Section 3.1 with level transmitters. For estimating β^ 3 we now need to define the parameters nj ,K and Y j . To find K we observe that there are three CCF events with 9, 2 and 2 DU failures respectively. In addition we have the single failure events (41). Defining nj is somewhat challenging due to the format of the underlying data. The CCF data have been collected from six facilities with level transmitter populations of 30–140 components. For a given facility many of the transmitters will be single, other transmitters will be redundant (1oo2) and some transmitters may be triplicated (2oo3). This information, i.e. how the failed components are “looped” together, is however not explicitly known. We therefore have to make some assumptions about the size of each CCF group within which the CCF events have been observed. Based on the observed CCF events, we see that these events have typically occurred for components that are physically located “close” to each other. We will therefore assume that the CCF event with 9 failures is “dimensioning”, in the sense that all component groups for which a single or a multiple failure has occurred is assumed to include 9 components each. Summarizing, we assume that 41 single DU failures have occurred and three CCF events with 9, 2 and 2 failures each. The number of failure events K then becomes 41þ 3¼ 44, with n1 ¼ n2 ¼ ⋯ ¼ n44 ¼ 9 and Y 1 ¼ Y 2 ¼ ⋯ ¼ Y 41 ¼ 1; Y 42 ¼ 9, Y 43 ¼ Y 44 ¼ 2. Applying formula (3), this gives: 41 U1 U0 þ 9 U 8 þ 2 U 2 U 1 ¼ 0:1759 ¼ 18%: β^ 3 ¼ 8 U ð41 U1 þ 9 þ 2 U 2Þ It should again be noted that the β^ 3 estimator is rather sensitive to the assumptions made concerning group size (i.e. the nj). Observe that this estimate is somewhat more optimistic than β^ 1 but less optimistic than β^ 2 .
5
ESD riser valves Blowdown valves
Smoke detectors Pressure safety valves Heat detectors Fire dampers Deluge valves
4.1. Operational reviews The failure reviews were carried out as 3–5 day meetings with representatives from the operators and SINTEF/NTNU. As input to the meeting, a comprehensive Excel report was prepared summarizing all notifications registered in the maintenance system (e.g. SAP) for SIL rated equipment for the period under consideration. This Excel report was then applied as a basis for further failure classification during the meetings. Also, during the meetings there is always a maintenance system expert attending, so that it was possible to access the maintenance system and extract additional details if required (and if available). The operational reviews have been facilitated by SINTEF/NTNU and have involved key discipline personnel (automation, safety, mechanical, process, maintenance, etc.) from the specific facility and company in question. Also, subject matter experts from the company have frequently been called upon during the meetings to solve ad hoc questions regarding the notifications and the equipment under consideration. A multidisciplinary group is important both with respect to quality assurance of the classifications and to achieve a collective understanding of status and challenges for the operator and personnel involved. The work during the failure reviews typically included:
A thorough review of each notification, in particular with
respect to detection method, failure mode, criticality of failure and failure cause; A classification of the failure according to IEC 61508 and IEC 61511; i.e., is the failure Safe (S), Dangerous Detected (DD) or Dangerous Undetected (DU)? Or is the registered notification considered not applicable (NA), i.e. a notification written against the tag but not affecting the main function of the component (e.g. removal of fire insulation on an ESD valve); A discussion of the identified failures, in particular those classified as DU and those that are potential CCFs and actual CCFs.
Ideally, an operational review should be performed once a year, as the notifications would then be more recent and fresh in mind and thus easier to classify with the support of operating and maintenance personnel. Annual reviews would also reduce the number of notifications into a more manageable size. The typical
Please cite this article as: Hauge S, et al. Common cause failures in safety-instrumented systems: Using field experience from the petroleum industry. Reliability Engineering and System Safety (2015), http://dx.doi.org/10.1016/j.ress.2015.09.018i
S. Hauge et al. / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
6
number of notifications for SIS related components on a facility is 300–600 during one year. Operational reviews can be considered an integral part of the required barrier management in the operational phase. The overall aim of such reviews is to verify performance requirement from design and if necessary carry out measures to improve the reliability of equipment. The manufacturers are interested in operational feedback on the equipment they deliver, but often express frustration about lack of information from the operators. Thus, operational reviews can be seen as a good opportunity to systematically collect information on the equipment, which can also be shared with the manufacturers. 4.2. CCF data for selected equipment groups For the purpose of this paper, we present results for some selected (and key) equipment groups; Shutdown (ESD/PSD) valves, process safety valves (PSVs), gas detectors, and level transmitters, see Table 1. Here “Total operational time” is the aggregated operational hours for the entire population (across the six facilities) and the other parameters are as defined in Section 3.1. Based on these data and the different estimators described in chapter 3, three estimates of the β-value have been calculated for each equipment group and the results are summarized in Table 2. From Table 2 we see that the type of estimator chosen and the underlying assumptions have a major influence on the estimated β-value. Neither β^ 1 nor β^ 2 explicitly takes into consideration the number of failures in each CCF event, and the size of the component group for which common cause failures are not registered. As discussed in chapter 3, the β^ 3 estimator is calculated based on the assumption that the size of the equipment group (nj) equals the size of the CCF event with most failures. Hence, β^ 3 is considered somewhat more “adjusted” to the underlying data material than β^ 1 and β^ 2 . The main approach has therefore been to give most credit to the β^ 3 estimates, and adjust it for several considerations as discussed in [3]. One important observation from the table is that for all equipment groups under considerations the estimated β-values based on operational experience are significantly higher than what has traditionally been assumed in design calculations. It can therefore be concluded that more effort should be put into systematically analyzing and reducing the extent of CCFs. Table 1 Input data from operational reviews for the selected equipment groups. Equipment group
No. of comps.
Shutdown valves 1 120 (ESD/PSD valves) Pressure safety 2 356 valves (PSVs) Gas detectors (point 2 239 and line) Level transmitters 3 46
It should also be emphasized that the estimated βs are based on average values and vary significantly between facilities. On some facilities no CCF events have been observed for certain equipment groups, whereas on other facilities a very high frequency of CCFs has been observed for certain equipment. For example, for ESD/PSD valves, where the accumulated number of observed DU failures and CCFs are highest, the estimated β values between installations vary from 5–40%. It is therefore important to adjust the generic estimates of β, based on facility specific conditions. For this purpose, equipment specific CCF checklists have been developed (ref. Chapter 5). In the following sections, some more details for the selected equipment groups are given. For more results and details reference is made to the main project report [3]. 4.2.1. Shutdown valves The emergency and process shutdown valves shall close upon a demand. Several of them also have requirements with respect to closing time and some of them to maximum leakage rate in closed position. In total 279 DU failures were observed for the shutdown valves from the six operational reviews. The total underlying population included 1120 valves. The 12 CCF events registered involved the following failures:
19 DU failures (delayed operation) were due to poor design of hydraulic connections resulting in too long closing times.
11 other DU failures (fail to close) were due to incorrect valve
Total operational time (h)
NDU NDU,I NCCF NDU,CCF
2:52 U 107
279
211
12
68
7:45 U 107
148
116
11
32
6:51 U 107
74
54
5
20
1:24 U 107
54
41
3
13
type for the specific application (fail-open valves installed in fail-close application). Additional six DU failures (fail to close) of another type of valve were due to poor design (these were not designed for the intended use and degraded immediately). All these valves had to be replaced All 10 ESD riser valves on a facility did not close upon low pressure due to oxidized aluminum gasket and defect (creased) spring. It should be noted that the valves did close under normal operation (upon pressure). Six DU failures (delayed operation) were due to changing hydraulic oil viscosity (caused by temperature variations). Four DU failures (fail to close) were due to actuator capacity. The operation of these valves is normally assisted by pipeline pressure, but when testing with no (delta) pressure over the valve they were unable to close. Three DU failures (fail to close) were due to damaged gaskets combined with dirt from operation. Three DU failures (leakage in closed position) were due to a leakage problem with a group of valves of similar type. No root cause of this problem was revealed. Two DU failures (fail to close) were caused by wrong mounting of solenoid valves while modifying the control of these valves. Two DU failures (fail to close) on similar types of valves were detected simultaneously (unknown cause). Two DU failures (fail to close) were caused by corrosion on the actuator stem due to wrong material selection.
Table 2 Different β-estimates for the selected equipment groups. Equipment group
β^ 1 (%)
β^ 2 (%)
β^ 3 (%)
New suggested β (%)
β from PDS data handbook (for comparison) (%)
Shutdown valves (ESD/PSD valves) Pressure safety valves (PSVs) Gas detectors (point and line) Level transmitters
24 22 27 24
10 16 16 13
16 11 16 18
12 11 15 15
5 5 7 6
Please cite this article as: Hauge S, et al. Common cause failures in safety-instrumented systems: Using field experience from the petroleum industry. Reliability Engineering and System Safety (2015), http://dx.doi.org/10.1016/j.ress.2015.09.018i
S. Hauge et al. / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
An additional (unknown) number of DU failures (delayed
operation) were due to inadequate bleed-off (wrong tuning of bleed-off valve). An additional (unknown) number of DU failures (delayed operation) were due to inadequate bleed-off (wrong tuning of bleed-off valve).
In addition, several failures were observed with a potential of causing CCF events. Examples are valves that did function due to low temperatures and ice/freezing, stuck hand wheel due to rust, insufficient lubrication and cleaning of valves, nitrogen leakage from valve accumulators and incorrect adjustment of valve during previous maintenance. Three CCF categories were defined, and the contribution between these was estimated, based on the observed CCF events and the potential CCFs. Some CCF events have more than one cause and their contribution is therefore distributed among two (or three) of the categories (Table 3). The distribution of these CCF categories is an important input to the weights given in the equipment specific checklists, as discussed in chapter 5. 4.2.2. Pressure safety valves (PSVs) The PSVs shall open on a predefined set point. During testing of PSVs, a failure is usually registered if the valve does not open within 120% of the set point pressure. The critical failure mode for a PSV is thus fail to open (FTO). In total, 148 failures were defined as DU failures for the facilities where PSV valves have been reviewed (constituting a total of 2356 PSVs). Here 11 CCF events were identified and involved the following failures:
Seven of the CCF events comprised simultaneous failure (i.e.
failure on the same test/demand) of PSVs located near each other (e.g., on the same vessel) without any further information about failure cause(s). One event involved DU failures of six PSVs, one event involved three PSVs and five events involved DU failures of two PSVs. One CCF event involved failure of five PSVs. The pilot exhaust lines from these valves were plugged. Two CCF events both involved a failure of three PSVs located near each other due to rust. Two PSVs located near each other failed to open due to medium inside the valve, probably from operational conditions.
Also, a number of systematic failures were observed with a potential of causing CCF events. Examples of such potential CCFs include:
One PSV failed due to hydrate caused by loss of heat tracing.
Could also have affected other PSVs. One PSV was filled with sand. One PSV had loose bolts. For one PSV the insulation was defect. Three PSVs (on two different facilities) had incorrect adjustments of set point. At least three PSVs failed (at different times and/or facilities) due to degraded O-rings (which were replaced).
Table 3 Distribution of CCF categories for ESD/PSD valves.
7
As can be seen from the above descriptions, limited information has been found in the notifications regarding the cause of failures of PSVs. So even if the number of failures (and CCFs) is rather high, the quality of the data could have been better. 4.2.3. Gas detectors Gas detectors shall detect the presence of gas and initiate an alarm at specified concentration(s). Besides hydrocarbon gas, some H2S, CO2 and O2 detectors are also included in the sample. The sample of gas detectors included both IR point gas detectors (1341 off) and IR line gas detectors (898 off). In total, 59 failures were defined as DU failures for point gas detectors and 15 failures were defined as DU failures for line gas detectors from the six operational reviews. The five registered CCF events involved the following failures:
Ten DU failures (wrong measurement) were due to poor design
of a point gas detector type, so that all the detectors had to be replaced. Four DU failures (wrong measurement) were all detected at the same time by random observations (no alarm). The four point gas detectors had frozen measurement and had to be replaced. Cause unknown. Two point gas detectors of an old design (located in same area) were exposed to corrosive attack (unknown failure mode, but assumed fail to function). One DU failure (and possible up to around ten additional DU failures) was due to wrong calibration caused by failure in the test procedure. One DU failure (affecting an unknown number of line gas detectors) was due to wrong type of cabling (not intrinsically safe) and physical damage of this cable.
Potential CCFs included detectors that did not function due to environmental influences causing dirty lenses, dense filter due to inadequate maintenance procedures, poor design of the detector mirror, clogged filters and erroneous calibration. From the failure history we see that failures related to incorrect calibration and wrong measurements are important causes. Also, design related properties including location, and environmental influences seem to give significant contributions. The distribution between the three CCF categories was estimated, based on the complete and the potential CCF events, see Table 4. Note that some CCF events have more than one single cause and for several DU failures, the underlying failure cause was unknown, so some subjective considerations had to be made. 4.2.4. Level transmitters A level transmitter shall measure the level in a vessel or tank, and send a signal to the logic when the level is outside the setpoint limits. In total, 54 failures were defined as DU failures for level transmitters from the six operational reviews. The total population included 346 level transmitters. The three registered CCF events involved:
Nine DU failures (wrong measurement) were due to wrong measuring principle for the specific application, resulting in Table 4 Distribution of CCF categories for gas detectors.
CCF categories
Distribution (%)
CCF categories
Distribution (%)
Design properties Environmental control (internal and external) Operation, maintenance and modifications
45 20 35
Design properties (incl. location) Environmental control (internal and external) Operation, maintenance and modifications
35 30 35
Please cite this article as: Hauge S, et al. Common cause failures in safety-instrumented systems: Using field experience from the petroleum industry. Reliability Engineering and System Safety (2015), http://dx.doi.org/10.1016/j.ress.2015.09.018i
S. Hauge et al. / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
8
wrong level measurements (revealed randomly or on tests). It should be noted that many vessels have very challenging measurement conditions, and a suitable level measurement principle working for all operational situations may therefore be hard to find. Two DU failures (wrong measurement) were due to wrong data values in their data sheet. DU failures (wrong measurement) of two transmitters located near each other, both revealed at the same time (by observations in control room).
Failures that could potentially have resulted in CCFs included transmitters exposed to cold temperatures resulting in icing/ hydrate formation in impulse lines, incorrect mounting of transmitter, and a transmitter exposed to an electrical failure. 4.3. Credibility of results A main problem in the estimation of beta described above, is the quality of data. Often the data do not provide sufficient details about the underlying failure causes, (which is even more problematic when the checklists are formulated). The notifications provided no or very limited information on the (size of) the CCF groups. In total this implies some uncertainty regarding the validity of results. As explained, for the NUREG estimators of the standard beta-factor, the main problem is the beta-factor model itself (assuming that all components of the CCF group are failing). We found the data to be insufficient for carrying out explicit modeling of (some of) the CCFs. However, the high beta-values obtained here indicates a need for explicit modeling, and this may be a focus in future analyses. However, there is no clear definition of which beta-value represents the boundary being acceptable for implicit modeling. As pointed out, IEC 61508-6, appendix D, suggests an upper value of 10% for field sensors and final elements, and Humphreys [12] proposed 30% as a worst case value of the beta-factor. The effect on the result (meaning the calculated reliability of a SIF) should not be influenced by whether explicit and/or implicit is used, but the inclusion of explicit failure causes may increase the awareness to specific common causes. The high values obtained for beta should give a warning to the industry, and one should need a further investigation of fundamental reasons for this unsatisfactory result. So far we can only present some hypotheses. It could be a symptom that experience from operation is not sufficiently shared with manufacturers. Perhaps it is a symptom that poor solutions are not subject to a proper root cause analysis, but instead put back in operation after replacement or “quick fix”. “Copy and paste” of solutions from one facility to the next may be another possible explanation. Also, it could be due to the rather limited focus on CCF and the fact that a failure of a component is followed up more or less isolated without any experience transfer to other similar components in the maintenance system. Finally, it might be a symptom of technology being too complex? But again, this is definitely a topic for further work. In the current study we have focused on the failure categories of IEC 61508, i.e. DU, DD and S. An alternative approach could have been to focus on the actual failure modes of the various components, like FTO (Fail-To-Open) and FTC (Fail-To-Close) for valves. However, for the application of reliability data for SIS, the classification into DU, DD, S is really essential, and most of the relevant components in the oil and gas industry have the same functions (and therefore the same dangerous and safe failure modes). So the grouping of failure modes into categories DU, DD and S captures essential features. A step towards explicit modeling one could of course consider providing betas for the various failure modes, but that would require a lot of rather detailed data.
Some of the data indicates failure causes that may stem from the deficiencies in the design/certification of components, or inadequate checks and tests in in later phases such as installation, commissioning, and pre-startup tests. Our approach has been to exclude failures from the first period (1–2 years) of operation to ensure that some time is given to correct these. However, beyond this period we believe that any previous attempts to correct the failures have been unsuccessful or there has not been sufficient indepth analysis of failure causes to identify the modifications that are needed. In the latter case, we find it reasonable to include the failures in the estimation of beta.
5. New equipment specific checklists 5.1. Application of equipment specific checklists As discussed above, the experienced β -values from the operational reviews vary significantly between the facilities and depend on factors and conditions both related to design and to operation. The uncertainty related to these factors and conditions, and thus the uncertainty concerning the beta-factor, will be largest at the start of a project and will decrease throughout later phases as more information and operational experience becomes available (given that the operational experience is systematically evaluated and acted upon to reduce the occurrence of CCFs). The purpose of the equipment specific checklists will therefore be to utilize available information to reduce the uncertainty related to the β -estimate. Typical applications of the checklists will be:
During early design to determine facility specific beta-factor for early SIL calculations
During detailed design phase to revise the beta-factor based on
additional knowledge about facility specific factors and conditions. The updated beta-factors will typically be applied for SIL calculations and possibly for input to QRA. During operational phase to justify that the assumptions and for the facility specific beta-factor are still valid (usually not done).
The purpose of developing CCF-checklists is not necessarily limited to adjust a β-value after having judged a specific design in light of control questions. Another, more qualitative application of the checklists, can be – during any phase – to check out the vulnerability to CCFs of a particular design and a foreseen operational regime. A checklist may also be used to point at possible defense tactics that, if implemented, will result in a reduction of the betafactor. 5.2. CCF categories and defenses Equipment specific checklists have been developed based on experiences from the operational reviews as well as input from various literature. The suggested checklist format is based on the following assumptions: (i) Component specific CCF categories can be listed with basis in the review of reported failures: These are specific causes (and couplings) known to affect the occurrence of CCFs (e.g. sand in the production flow, presence of ice and snow, same operators calibrating several transmitters, etc.). These CFFs are assumed to affect the beta-factor-value in some way. (ii) Defenses may be introduced to reduce the impact of specific CCF causes and couplings, but not to a zero level (for example, sand detectors, heat tracing of impulse lines, 3rd part control of work, etc.).
Please cite this article as: Hauge S, et al. Common cause failures in safety-instrumented systems: Using field experience from the petroleum industry. Reliability Engineering and System Safety (2015), http://dx.doi.org/10.1016/j.ress.2015.09.018i
S. Hauge et al. / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
The following CCF categories and sub-categories (examples – not a complete list) have been suggested:
Design properties (1) Equipment specification and manufacturing (2) Material selection (3) Complexity (4) Associated utility systems (5) Location and separation (6) Software/logic (7) Prior use Environmental control (external and internal): (1) Climate and temperatures; Ice, snow, fog, rain, sea spray, etc. (2) Variability in climatic conditions (3) Sand, dirt, hydrates and deposits (4) Corrosion and erosion Operation, maintenance and modifications (1) Installation commonalities; personnel, procedures and routines (2) Latent failures introduced during commissioning, installation and modifications (3) Maintainability and HMI (4) Procedures and routines (5) Personnel competency and training (6) Operator errors (of omission and commission) (7) Maintenance commonalities; personnel, procedures and routines. (8) Management of change Training, experience and competence (1) Technical equipment competence (2) SIS equipment and process specific courses (3) Simulator training Design related reviews: (1) Design analyses and reviews (w.r.t. sizing, material selection, actuator capacity, response time, etc.) (2) Process analyses (w.r.t. temperatures, pressure, flow, keeping tight in closed position, etc.) (3) Environmental and location analyses Operational follow-up and analyses: (1) Root cause analyses/analyses to identify systematic and common/shared failure causes (2) Design analyses and reviews (w.r.t. sizing, material selection, actuator capacity, response (3) Analyses to improve testing/maintenance and to identify measures to reduce systematic failures and CCFs (4) Handling of changes and modifications (5) Audits and reviews to validate the quality of testing/maintenance Examples of physical/technical/operational defenses are: Operational and maintenance measures (1) Cleaning (2) Lubrication (3) Draining (4) Adjustment checks (5) Removing ice, snow, etc. (6) Monitoring and inspection (7) Check of heat tracing/heaters (8) Maintenance staffing and scheduling (staggered testing and staff diversity) Design and functionality measures (1) Fit for purpose (2) Diversity (3) Changes/improvements of design / functionality (4) Improved technology (1) Improved materials (2) Increased diagnostics
9
Physical and layout measures (1) (2) (3) (4)
Separation/Segregation Enclosure Insulation Heat tracing/heaters
5.3. Equipment specific checklists – format, content and example The CCF-checklists include a set of columns that are explained below. Some of the columns relate to topics, figures, or issues that have been pre-set with basis in the operational reviews, while others relate to aspects that must be judged by the user of the checklist. CCF category as discussed above. These are pre-set categories. Weights (in %) which are based on results from the operational reviews. Here, each CCF category and sub-category is given a certain weight based on the distributions exemplified in chapter 4 (i.e. from the actually observed data). For each equipment group the weights of each of the three CCF categories add up to 100%. Relevance which is an assessment related to the relevance of the associated statement or question, and is to be judged by the user of the checklist. There are three different options to select among: Yes: The question/statement is considered relevant and can be answered at the given point in time of filling in the checklist Premature: The question/statement is considered relevant but cannot be answered at the given point in time of filling in the checklist (e.g. questions related to operational follow-up is difficult to answer in the design phase). The generic predefined weight is then applied in the beta-factor estimation NA: The question/statement is not considered relevant for the design or facility under consideration. Hence the generic weight of the CCF sub-category is subtracted in the beta-factor estimation. Defenses related to the specific CCF sub-category under consideration, i.e. measures or strategies to prevent or reduce the likelihood of a CCF to occur. These are pre-set and based on a classification of defenses as discussed above. Efficiency (Eff) is a scaling comparable to the scaling used for potential CCFs in [27], with one added level. The following scaling factors are suggested: 3.0: No measure is implemented resulting in a “penalty” of a factor 3. 1.0: Measures are implemented but are judged to have a Low (or average) efficiency that does not go beyond what is considered average/current CCF protection standard in the industry. 0.5: Measures beyond average protection have been implemented, but the effect of the measure is considered Medium (or limited) or it has not been documented that the foreseen effects have been achieved. 0.1: Measures have been taken, and it has been documented that the measures have a High (and positive) effect on the issue in question. The value used is selected by the user of the checklist but a default of 1.0 is assumed. Modification factor (Mod Factor) is the product of the efficiency factor assigned to each defense and the weight related to the specific CCF question/statement. By summing up the modification factors for each CCF question/statement, an aggregated modification factor is obtained. This is further illustrated in the example in the next section. The three CCF categories have – for each equipment type-been assigned a basic weight based on experiences and classification of CCFs-and potential CCFs failure causes from the operational reviews. For each category a number of equipment specific checklist questions (or statements) have been suggested. The weight of each such “question” is again based on findings from the
Please cite this article as: Hauge S, et al. Common cause failures in safety-instrumented systems: Using field experience from the petroleum industry. Reliability Engineering and System Safety (2015), http://dx.doi.org/10.1016/j.ress.2015.09.018i
10
CCF category
Weight Relevance Defenses
Design properties
45%
Mod. factor
Y/Pre/NA
Description of defense
Eff.
Valves within the same equipment group are of the same type (make, 10% manufacturer, material selection, etc.). Response times (max/min) are critical for the successful performance of 15% the valves. Inadequate actuator force can cause the valve not to close upon particular 5% process conditions, such as e.g. low pressures.
Y
1.0 (L) 1.0 (L) 0.1 (H)
0.1
Failure of common utility systems, such as accumulators, hydraulic-, or pneumatic systems can result in valves failing to close. Valves are required to keep leak tight in closed position.
5%
NA Y
5%
Y
1.0 (L) 1.0 (L) 0.5 (M)
0
5%
Has the design been reviewed with the purpose of revealing common vulnerabilities, associated with sizing, material selection, location and so on, and are defenses implemented to reduce common vulnerabilities (separation, diversity) Have design reviews been carried out with the purpose of identifying possible design constraints in view of response time requirements? Has it been verified by analyses (and possibly testing) that there is sufficient actuator force to close the valve under all foreseeable process conditions and/or has extra sizing/ dimensioning of actuator, etc. been implemented to ensure that the valve will function under all foreseeable process conditions?? Has it been confirmed that the associated utility systems have sufficient capacity, and are procedures in place to ensure that as well as the individual valve actuators? Are condition monitoring or other operational measures implemented to prevent too high leakage rates in closed position? Is any operational experience related to the specific valve type available, or is prior-use experience available for the valve and relevant for the current application?
20%
Y/Pre/NA
Description of defense
Eff.
3%
Y
4%
Y
The valves are exposed to snow, temperature changes, icing conditions, sea 3% spray, etc., possibly affecting valve performance? As above 4%
Y
Has specific procedures for control of sand, dirt, hydrate formation, etc. been implemented, including procedures for cleaning and lubrication of valves? Have physical measures for control of internal environmental influences such as sand traps, inhibitors, etc. been implemented? Are procedures for removing ice, build-up of snow, control of hydraulic oil viscosity, etc. during periods with cold temperatures in place and implemented (including procedures to check that e.g. heat tracing is functioning / is on)? Are valves exposed to snow, ice, sea spray, cold temperatures equipped with separate weather protection/ isolation/ heat tracing? Are inspection procedures and associated acceptance criteria in place for controlling and preventing corrosion?
Whether the design can be considered fit for purpose is considered an issue for the particular application. Subtotal modification factor for 'Design properties': Environmental Control (external & internal) The valves are exposed to an internal environment such as dirt, sand, hydrates, etc. with a potential to affect valve performance. As above
Y Y
0.15 0.005
0.05 0.025 0.33
Y
1.0 (L) 0.5 (M) 0.5 (M) 0.5 (M) 0.5 (M) Have physical measures for control of the corrosive environment been implemented such as material choice, corrosion 1.0 inhibitor, etc.? (M)
The valves are subject to a corrosive environment (internal and/or external) that can affect valve performance. As above
3%
Y
3%
Y
Subtotal modification factor for 'Environmental control' Operation, maintenance & modifications
35%
Y/Pre/NA
Description of defense
Eff.
Are test- and maintenance procedures readily available, made familiar among the maintenance personnel, and are they kept continuously updated throughout the operational phase? Are there routines/procedures in place to periodically review and if necessary adjust the frequency of proof testing for the valves in light of registered failure history and experienced failure causes? Do the maintenance personnel always check for similar failures on other valves, if a failure is revealed during testing or operation? Have the maintenance operators been given particular training with respect to understanding the valves functionality, critical failure modes and registering maintenance notifications in order to give a good description of e.g. failure cause, detection method and failure modes? Are maintenance notifications regularly gone through in order to reveal repeating valve failures, to compare results for all relevant valves across the facility, and to initiate and perform root cause analysis in order to identify measures to remove these failure causes? If a valve has to be run several times to function satisfactorily (e.g. close fast enough), are then additional measure (lubrication, cleaning, finding the root cause) always put in place in order to prevent this from happening again? Are operational measures implemented to ensure that the bleed-off arrangement is correctly tuned and (if relevant) periodically calibrated towards required closing times?
0.5 (L) 3.0 (N) 0.5 (L) 1.0 (L)
0.03 0.02 0.015 0.02 0.015 0.03 0.13
The valves are periodically tested and maintained according to a predefined maintenance program. As above
3%
Y
3%
Y
As above
4%
Y
Results from periodic testing and maintenance are logged into some kind 5% of electronic (or written) maintenance system.
Y
As above
5%
Pre
In case of a valve failure during testing or operation it may be a relevant 5% measure to run the valve again until it functions. 5% Wrong tuning of bleed off valve and/or weaknesses in the bleed-of arrangement/control block may result in valve failing to close or closing too slow. Adding new valves or valve modifications, including changes to valve 5% performance requirements may become relevant during the operational phase
Pre
Subtotal modification factor for 'operation, maintenance & modifications' Aggregated modification factor
Pre
Y
0.015 0.09 0.02 0.05
1.0 (L)
0.05
1.0 (L) 1.0 (L)
0.05
Are there procedures for independent checks after valve modifications, valve adjustments and adjustments of closing/ 0.1 opening time? (H)
0.05
0.005
0.33 0.79
S. Hauge et al. / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Please cite this article as: Hauge S, et al. Common cause failures in safety-instrumented systems: Using field experience from the petroleum industry. Reliability Engineering and System Safety (2015), http://dx.doi.org/10.1016/j.ress.2015.09.018i
Table 5 Example of filled-in CCF checklist for shutdown (ESD/PSD) valves.
S. Hauge et al. / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
operational reviews (and some expert judgments) and is also related to the particular failure modes observed for the equipment under consideration. E.g. for shutdown valves it is observed that closing time is often a problem. Hence, this particular vulnerability is assigned a relatively high weight in the checklist for shutdown valves. The defenses related to each of the CCF questions have been predefined in the checklist whereas the users themselves shall evaluate the efficiency of the defenses according to the scheme described above. Default efficiencies (of 1.0) are however filled in which the users will be able to adjust based on facility specific knowledge. Below is given an example of a checklist (for shutdown valves) and how it could be filled in. The user has to fill in the relevant column with 'Y', 'Pre' or 'NA' (shown in bold italic) and also reevaluate the defaulted efficiency factors (re-evaluated efficiency factors shown in bold italic). The modification factors are automatically calculated (Table 5). From the example checklist we see that the estimated modification factor is 0.79. Then if the generic beta-factor for shutdown valves is 12%, the updated β-value becomes 12% 0.79 ¼9.5% based on the evaluation of the checklist questions. Note that some issues related to the category “operation, maintenance and modifications” have not been evaluated (i.e. “Pre”). In such case the checklist can be reviewed at a later stage when more information is available. Such an update should then preferably include a re-evaluation of all checklist questions.
6. Discussions, conclusions, and further work The purpose of this paper has been two-fold: (1) to present new generic estimates of beta-factors with basis in operational experience, and (2) to demonstrate how such results can be used to derive a tool (i.e., checklist) for updating the value of beta at a specific facility. Generic data for beta-factors are required in the early stage of SIS design, when no decision about the choice of manufactures has been made. Generic data is important as an indication of expected performance, considering the average performance of components with the requested functionalities. Once the manufacturers have been selected, it is possible to update the initial reliability estimates with the new and design specific data. Manufacturer data often suggest a higher reliability performance than generic data, due to the exclusion of e.g. systematic faults that are beyond the control of the manufacturers. The new values estimated for beta are significantly higher than suggested in current generic data sources [5], recent checklists [14,16] and manufacturer data. It is therefore reasonable to ask: (1) Are the amount and quality of data sufficient to claim a credible result? (2) Are the estimators used to calculate beta adequate? (3) Is it reasonable to expect higher values of beta today than in the past?
We claim that the answer to question 1 is yes. The amount of
data should provide a sufficient basis for estimation, but not without uncertainty. About 12 000 failure notifications were reviewed, and the evaluation and classification were made with involvement of responsible engineering, facility operators, and maintenance personnel. Despite their valuable contribution, it is always difficult to reach a sufficient level of understanding when many of the failures that were reported are a few years back in time, or handled by different shifts at the plant. The credibility could have been increased if more information about underlying causes were included in each notification when it was registered the first time. Increased awareness
11
concerning CCFs among operating and maintenance personnel, and their technical support organizations, is very important to ensure that sufficient information are provided when failures are registered in the maintenance system. Operational review may be used to verify that the information provided is sufficient. Involving the manufacturers and their competence may be helpful to add more in-depth explanation of underlying failure causes. We have attempted to overcome some weaknesses in the NUREG estimators by also applying a new estimator. Our proposal is therefore a yes also to question 2, and we would also emphasis the usefulness of having compared the results of using different estimators. Still, we hope the discussions about estimators in this paper may inspire other researchers to look for even better alternatives. Further, when the estimates reach values above e.g. 10%, it may be questioned whether some CCFs could have been modeled explicitly The use of explicit modeling of CCFs is a good alternative, but data will often not support such estimates
It is difficult to answer a clear yes or no to the third question. However, we have noticed that some manufacturers are concerned about the increase in complexity of safety systems. In the past, the focus was often to keep safety systems as simple as possible. Some of complexity seen in systems today may come from the wish to have more flexibility in the way of operating and testing the systems. Logic (software) may be added to manage start-up situations, including the overriding of certain alarms and the control of interlocks to avoid mal-operation of valves. Logic is sometimes also added to reduce the global effects (at the facility level) of a single SIF activation. Safety systems are also subject to more modifications than in the past, as it has become more common to prolong the life of facilities with new (subsea) field tie-ins. Such modifications may also add complexity. It is not possible to argue any direct relationship between high complexity and the occurrence of CCFs, but proneness to human errors, which represent an important contributor to CCFs, may increase with increasing complexity. Many of the failures classified as CCFs have been traced back to inadequate design, installation, and commissioning. Such failures are included in the estimation of beta, if the failures have been reported after the initial period of operation. It may be relevant to ask if the industry “copy-and-paste” designs from one design project to the next, without verifying all the underlying assumptions. Pre-using previous design solutions is not only negative, if the underlying assumptions are checked. However, we have observed that many manufactures and system integrators complaint about too little experience transfer from operation. Without such transfer, it is difficult to increase the awareness to poor design principles. Verifying design against the specific operating environment is important, but this cannot be achieved without more efficient sharing of operational experience. OREDA handbooks are published every 4–5 years, but the experience transfer should be more continuous and consider more details about improper design solutions. It may be questioned why many design, installation, and commissioning related failures were not revealed before start-up, or corrected in the first (initial) period of operation. Commissioning may seem to be a particularly critical phase, since it covers the detailed testing after installation. Commissioning is carried out with many temporary arrangements, to allow system-by-system testing and to compensate for not having all interfacing systems available. Improper dismantling of temporary arrangements may introduce new failures. The research did not cover any in-depth analysis of pre-start up failures, but we may remark that it is a concern that the introduction of standards like IEC 61508 (and IEC
Please cite this article as: Hauge S, et al. Common cause failures in safety-instrumented systems: Using field experience from the petroleum industry. Reliability Engineering and System Safety (2015), http://dx.doi.org/10.1016/j.ress.2015.09.018i
12
S. Hauge et al. / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
61511) does not indicate any significant reduction, despite all requirements regarding verification and validation activities. The topic may be pursued in future research. Generic data and/or manufacturer data defines the starting point for the expected reliability performance, but a facility may alone have too few CCF failures to be able to update the beta with operational experience. This was the main motivation for introducing equipment specific checklists as a means to update the initial beta-values with facility specific considerations about CCF vulnerabilities. New values of beta may be used in the updating of quantitative risk analyses. The checklists themselves may be used more actively in the management of barrier performance. More research is needed to verify that the most important factors have been embedded into the CCF checklists. With more insight it may be possible to reassess the choice of questions and weights, and the possibility for more automated updating of the checklist should be investigated. The drawback about operational reviews is the manual effort needed to study the reported failures. At the same time, it is important to emphasize that this effort (despite being time-consuming) has a value on its own: Operational reviews bring engineering disciplines and operational managers together. In the discussion of failures and failure classification, we have seen an increased awareness about why reporting is important and what role each one can take to contribute to higher quality. However, many systems harvest (automatically) large amounts of data that are not fully exploited. Combining different types of data to obtain more insight into failure causes is an interesting topic for further research. For example, data about technical state of components (e.g., degradation state or age) may be combined with data about process states and conditions (e.g., sand production level, variations in temperatures or pressures, demand rates). The exploitation of such data may require the bridging of several types of models, including statistical models, dynamic process models, and models of physics of failure.
Acknowledgment The present work has been carried out as part of the research project “Tools and guidelines for overall barrier management and reduction of major accident risk in the petroleum industry”, a project that has been sponsored by the Research Council of Norway (Grant number 220841/E30) and the PDS participants (www.sintef.no/ pds). Thanks to all our colleagues and personnel from the operating companies that have contributed with their knowledge, comments and operational experience.
References [1] Brand P. UPM 3.1: A pragmatic approach to dependent failures assessment for standard systems. UK: AEA Technology; 1996. [2] Edwards G, Watson A. A study of common-mode failures 1979, SRD R-146. Wigshaw Lane, Culcheth Warrington (WA3 4NE): United Kingdom Atomic Energy Authority. [3] Hauge S, Hoem ÅS, Håbrekke S, Lundteigen MA. Common cause failures in safety instrumented systems: Beta-factors and equipment specific checklists based on operational experience (Report no SINTEF A26922). Trondheim, Norway: SINTEF; 2015. [4] Hauge S, Håbrekke S, Lundteigen MA. Using field experience in the treatment of common cause failures in reliability assessment. In: Safety and Reliability: Methodology and Applications. Proceedings of the European safety and reliability Conference, ESREL 2014, Poland, 14–18 september 2014. London, UK: Taylor & Francis; 2015. [5] Hauge S, Håbrekke S, Onhus T. Reliability data for safety instrumented systems. Trondheim, Norway: SINTEF; 2013.
[6] Hauge S, Kråkenes T, Hokstad P, Håbrekke S, Jin H. Reliability Prediction method for safety instrumented systems. Trondheim, Norway: SINTEF; 2013. [7] Hauge S, Onshus T, Øien K, Grøtan, TO, Holmstrøm, S Lundteigen, MA. STF50 A06011 Uavhengighet av sikkerhetssystemer offshore-status og utfordringer. 2006 (in Norwegian). [8] Hokstad P. A generalisation of the beta factor model. Probab Saf Assess Manag 2004;1–6:1363–8. [9] Hokstad P, Corneliussen K. Loss of safety assessment and the IEC 61508 standard. Reliab Eng Syst Saf 2004;83:111–20. [10] Hokstad P, Rausand M. Common cause failure modeling: Status and trends. In: Misra KB, editor. Handbook for performability engineering. London: Springer; 2008. [11] Hokstad P, Maria A, Tomis P. Estimation of common cause factors from systems with different nubers of channels. IEEE Trans Reliab 2006;55:18–25. [12] Humphreys RA. Assigning a numerical value to the beta factor common cause evaluation. Reliability 1987. [13] IEC 60050-192. International electrotechnical vocabulary-Part 192: Dependability. Geneva: International Electrotechnical Commission; 2015. [14] IEC 61508. Functional safety of electrical/electronic/programmable electronic safety-related systems, parts 1–7. Geneva: International Electrotechnical Commission; 2010. [15] IEC 61511. Functional safety – safety instrumented systems for the process industry sector. Geneva: International Electrotechnical Committee; 2003. [16] IEC 62061. Safety of machinery – functional safety of safety-related electrical/ electronic and programmable electronic control systems. Geneva: International Electrotechnical Commission; 2005. [17] ISA TR 84.00.02. Safety instrumented functions (SIF) – safety integrity level (SIL) evaluation techniques. parts 1–5. Research Triangle Park, NC: The instrumentation, Systems, and Automation Society; 2002. [18] ISO/TR 12489. Petroleum, petrochemical and natural gas industries – reliability modelling and calculation of safety systems. .. Geneva: International Organization for Standardization; 2013. [19] Johnston B. A structured procedure for dependent failure analysis (DFA). Reliab Eng 1987;19:125–36. [20] Lundteigen MA, Rausand M. Common cause failures in safety instrumented systems on oil and gas installations: Implementing defenses through function testing. J Loss Prev Process Ind 2007;20:218–29. [21] Mosleh A. Common cause failures: an analysis methodology and examples. Reliab Eng Syst Saf 1991;34:249–92. [22] NEA. International common-cause failure data exchange. ICDE general coding guidelines. Technical note NEA/CSNI/ R(2004)4. Nuclear Energy Agency. [23] NEA. International Common Cause Failure Data Exchange Project web-page, 〈http://www.nea.fr/html/jointproj/icde.html〉; 2014. [24] NUREG-75/014. Reactor safety: an assessment of accident risk in the U.S. commercial nuclear power plants. WASH-1400. Washington, DC: U.S. Nuclear Regulatory Commission; 1975. [25] NUREG/CR-4780. Procedures for treating common cause failures in safety and reliability studies, vol. 2. Washington DC: U.S. Nuclear Regulatory Commmission; 1989. [26] NUREG/CR-5485. Guidelines on modeling common-cause failures in probabilistic risk assessment. Washington DC: U.S. Nuclear Regulatory Commmission; 1998. [27] NUREG/CR-6268. Common-cause failure databases and analysis system: event data collection, classification, and coding. Washington DC: U.S. Nuclear Regulatory Commission; 2007. [28] OREDA. OREDA reliability data. 5th ed.. Oslo, Norway: OREDA Participants; 2009. [29] OREDA. OREDA reliability data. 6th ed.. Oslo, Norway: OREDA Participants; 2015. [30] Parry G. Common cause failure analysis: a critique and some suggestions. Reliab Eng Syst Saf 1991;34:309–20. [31] Paula H, Campbell D, Rasmuson D. Qualitative cause-defense matrices: engineering tools to support the analysis and prevention of common cause failures. Reliab Eng Syst Saf 1991;34:389–415. [32] Rahimi M, Rausand M, Lundteigen MA. Management of factors that influence common cause failures of safety instrumented system in the operational phase. In: Advances in Safety, Reliability and Risk Management – Proceedings of the European Safety and Reliability Conference, ESREL 2011. CRC Press; 2012. [33] Rasmuson DM. Some practical considerations in treating dependencies in PRAs. Reliab Eng Syst Saf 1991;34:327–43. [34] Rausand M. Reliability of safety-critical systems: theory and applications. Hoboken, New Jersey: Wiley; 2012. [35] SKI. Investigates events that (potentially) lead to CCF. SKI Technical Report 98:09. Sweeden: Swedish Nuclear Power Inspectorate; 1998. [36] Smith A, Watson I. Common cause failures – a dilemma in perspective. Reliab Eng 1980;1:127–42. [37] Watson IA, Johnston BD. Treatment of dependent failure in PSA. Nucl Eng Des 1989;115:133–42.
Please cite this article as: Hauge S, et al. Common cause failures in safety-instrumented systems: Using field experience from the petroleum industry. Reliability Engineering and System Safety (2015), http://dx.doi.org/10.1016/j.ress.2015.09.018i