IPL2 and 3 performance improvement method for process safety using event correlation analysis

IPL2 and 3 performance improvement method for process safety using event correlation analysis

Computers and Chemical Engineering 34 (2010) 2007–2013 Contents lists available at ScienceDirect Computers and Chemical Engineering journal homepage...

1MB Sizes 0 Downloads 8 Views

Computers and Chemical Engineering 34 (2010) 2007–2013

Contents lists available at ScienceDirect

Computers and Chemical Engineering journal homepage: www.elsevier.com/locate/compchemeng

IPL2 and 3 performance improvement method for process safety using event correlation analysis Junya Nishiguchi ∗ , Tsutomu Takai Yamatake Corporation, 1-12-2, Kawana, Fujisawa, Kanagawa 251-8522, Japan

a r t i c l e

i n f o

Article history: Received 11 December 2009 Received in revised form 22 July 2010 Accepted 26 July 2010 Available online 5 August 2010 Keywords: EEMUA IPL Alarm management Event correlation Clustering Knowledge extraction

a b s t r a c t Alarm management efforts have recently intensified, and many tools based on the guidelines in EEMUA 191 have been used to analyze alarm system performance at chemical sites. However, attempts to improve alarm systems using conventional methods have not made satisfactory progress, because they focused on evaluating current performance without providing information that would be useful for improving and rationalizing the alarm system. In this paper a novel method using event correlation analysis is proposed as a means of improving IPL2 and 3 performance, including alarm systems and operator actions. This method’s effectiveness was evaluated with data from an alarm system improvement project at a chemical site. © 2010 Elsevier Ltd. All rights reserved.

1. Introduction Safe operation is a top priority for chemical plants. To provide protection from hazardous incidents, a safety concept that has been extensively applied at various plants (AIChE/CCPS, 1993) is that of eight independent protection layers (IPLs), as shown in Table 1. The second and third layers (IPL2 and 3) are related to the alarm system. The primary purpose of IPL2 is the supervision of a normally operating plant with a basic process control system. When the process variables deviate from the set points, the system activates alarms and notifies the operator to take corrective action. IPL3 represents critical alarms and corresponding operator interventions. Failure in the functions of IPL2 and 3 results in production loss. Therefore, it is important from both a safety and a production standpoint for IPL2 and 3 to function effectively. For rationalization of alarm systems, EEMUA 191 (Engineering Equipment & Materials Users’ Association, 1999) is widely accepted as a de facto standard. EEMUA 191 defines an alarm system as “a very important way of automatically monitoring the plant condition and attracting the attention of the process plant operator to significant changes that require assessment or action.” It specifies several performance metrics that can be used to assess alarm system performance:

∗ Corresponding author. Tel.: +81 466 52 7123; fax: +81 466 24 3995. E-mail addresses: [email protected] (J. Nishiguchi), [email protected] (T. Takai). 0098-1354/$ – see front matter © 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.compchemeng.2010.07.029

• • • • • • • •

Operator questionnaires; Alarm usefulness surveys; Assessment of number of alarms in a system; Measurement of average alarm rate; Measurement of number of alarms following a major plant upset; Measurement of operator response time; Measurement of number of standing alarms; Analysis of the priority distribution of alarms configured and occurring; • Correlation techniques. Among these performance metrics, operator questionnaires are especially important for improvement of IPL2 and 3 performance because they evaluate the alarms’ appropriateness for the operators (Bransby & Jenkinson, 1998). However, questionnaire-based evaluation is often subjective, and may contain bias arising from the format of the questionnaires or from individual respondents. Therefore, in order to evaluate IPL2 and 3 performance more objectively, it is necessary to analyze how the actual alarms propagate after a fault, and how the operators intervene in response to the alarms. As a conventional method for fault propagation analysis, signed directed graphs (Shiozaki, Shibata, Matsuyama, & O’Shima, 1989) and multilevel flow models (Bergquist, Ahnlund, & Larsson, 2003; Dahlstrand, 2002) have been actively researched. Also, there have been a few systematic methods for rationalizing operator actions, the most notable being virtual operator models (Liu, Kosaka, Noda, & Nishitani, 2007a; Liu, Kosaka, Noda, & Nishitani, 2007b; Noda &

2008

J. Nishiguchi, T. Takai / Computers and Chemical Engineering 34 (2010) 2007–2013 Table 1 Independent protection layers for process safety.

Nishitani, 2009). However, such model-based methods are complicated and difficult for site engineers or operators to put into practical use because much effort is required to adjust the models whenever the process devices are replaced or operating policy is changed. In this paper, we provide a new method of improving IPL2 and 3 performance that deals with operator actions as well as alarms, and is easy for site engineers and operators to use. We focus on the relationship between alarms and operator actions, which are extracted from historical event log data with event correlation analysis. In Section 2 of this paper we describe a new method for IPL2 and 3 performance improvement. In Section 3, we provide chemical plant data showing the effectiveness of the method. Section 4 provides conclusions. Section 5 discusses future work.

When several operator actions are strongly related, they are likely to be complex sequential operations, which can be reduced by automation. iii. Unnecessary alarms When there is a high-frequency alarm without related operator actions, it may be unnecessary. The occurrence of such alarms can be reduced by changing the alarm settings or putting the contents of the alarm into an operator message. iv. Causes of upset When alarms and operator actions occur in a consistent order, the first event to occur is likely to be the source of plant upset.

2. New method for IPL2 and 3 performance improvement 2.2. Event correlation analysis 2.1. Concept This paper proposes a novel method by which even those without particular expertise in data analysis, such as site engineers and operators, can improve the IPL2 and 3 performance of their plant quickly, easily, and effectively. In the method, IPL2 and 3 performance is evaluated by extracting the relationships between alarms and operator actions in their temporal context, as mentioned in EEMUA 191. In other words, the method focuses on whether the alarms and corresponding operator actions are appropriate for each other. Since this method provides clues for analyzing fault propagation paths, the origin of chain alarms, and operator knowledge extracted from event log data for alarms and operators actions, it reduces the need for on-site investigations with the use of piping and instrumentation diagrams. As a way to extract clues from event data to improve IPL2 and 3, event correlation analysis (Nishiguchi & Tsutsui, 2005), which will be described in detail in the next section, is used. Event correlation analysis defines the correlation of discrete event data, and quantifies the degree of relationships and the order of occurrence of alarms and operator actions. (Note that the target process is assumed to be equipped with a non-faulty alarm system, in the sense that each alarm is able to occur correctly to signal the defined fault.) Examples of extracted relationships that may signify a problem are listed below, together with the corresponding solution. i. Consequential alarms When several alarms are strongly related, they are likely to be consequential alarms. The number of such alarms can be reduced by alarm filtering techniques. ii. Complex operator actions

This section provides more detail concerning event correlation analysis, which quantifies the interrelationship and sequence of events (alarms and operator actions). Although the correlation coefficient is usually used to measure the relationship between two continuous values, it is a well-known fact that it cannot be applied to event data (Li, 1990), such as alarm and operator action log data. In event correlation analysis, event pairs separated by consistent time intervals are considered to be related, since the length of the time lags is determined by factors such as process dynamics and operator response time. A similarity measure, which is explained below, between all event pairs is calculated from the log data along with the probability distribution of correlation regarding independent event pairs. Calculating similarities and intervals between occurrence times results in the definition of groups with consequential events as well as their order of occurrence. Specifically, we use the following four steps. • Step 1. Convert to binary sequences In the first step, alarm and operator action log data containing occurrence time and event type are obtained from the distributed control system (DCS). The log data is converted into multiple event time series si (t), specifically a binary sequences (or point process [Daley & Vere-Jones, 1998]) for each type of event i, the value of which is 1 if it occurs one or more times within the time window t and 0 if it does not (Fig. 1; for simplicity, in the following we refer to event type i, etc. as event i).



si (n) =

1, 0,

if some points in (nt, (n + 1)t] otherwise

(1)

where n is a unit of time, and t is the window size, which should be adjusted according to variations in the process dynamics and operator response time.

J. Nishiguchi, T. Takai / Computers and Chemical Engineering 34 (2010) 2007–2013

2009

Fig. 1. Conversion to binary sequences.

A binary sequence is generated for each event type (i.e., each kind of alarm or operator action). • Step 2. Cross-correlation

Fig. 2. Cross-correlation.

The cross-correlation function (2) between events i and j indicating the number of occurrences of event j following event i in the time range (mt, (m + 1)t) is calculated.

cij (m) =

⎧ T/t−m ⎪ ⎨  ⎪ ⎩

si (n)sj (n + m)

n=1

cji (−m)

m≥0

,

−K ≤m≤K

(2)

m<0

where T is observation time, and K is the maximum lag as measured in a time unit defined as a parameter (Fig. 2). The cross-correlation function for binary sequences is equivalent to counting the number of ones in both events when the binary sequences are shifted with the time axis. The maximum correlation value cij∗ and time lag at maximum correlation m∗ij are defined as follows: cij∗ = max cij (m),

m∗ij = arg maxm cij (m)

(3)

Although the maximum correlation cij∗ indicates the degree of relationship between the event pairs, it is not appropriate for comparing event pairs with a greatly different occurrence rate. • Step 3. Similarity between event pairs with probability distribution Therefore, probability distribution is introduced to evaluate the similarity so that the results are not affected by different occur-

rence rates. The actual maximum correlation value cij∗ is evaluated by means of probability distribution of correlation for independent event pairs. As a result, the similarity between two events Rij is defined by the probability that the correlation between two independent events is lower than cij∗ within the maximum lag K.

Rij = P(cij (m) < cij∗ | − K ≤ m ≤ K) ∼ =

⎧c∗ −1 ij ⎨ ⎩

l=0

l e− l!

⎫2K+1 ⎬ ⎭

(4)

In (4),  is the expected value of the Poisson distribution for the occurrence of independent events, which is approximated by the average number of co-occurrences between two independent events as shown in Eq. (5). =

 T t  · pi · pj ∼ si (n) · sj (n) = T t T/t

T/t

n=0

n=0

(5)

In other words, this method conducts a statistical test of the hypothesis that two events i and j are generated independently. The actual maximum correlation cij∗ is compared with the distribution of the correlation between two independent events. The similarity Rij is calculated by subtracting the reject rate from one.

Fig. 3. Similarity between event pairs.

2010

J. Nishiguchi, T. Takai / Computers and Chemical Engineering 34 (2010) 2007–2013

Fig. 4. Clustering.

Even if the calculated maximum correlation values cij∗ are the same, an event pair with frequent occurrence has high probability of co-occurrence, resulting in a small similarity. Conversely, an event pair with rare occurrence has low probability of cooccurrence, resulting in a large similarity (Fig. 3).

• Step 4. Clustering using pair-wise similarities

3. Validation with chemical plant data 3.1. Target plant The proposed method was applied to data from a chemical plant for validation. The target plant has a typical multiple-stage gas purification unit. According to the event log data obtained from this unit, 1267 types of alarms and operator actions occurred a total of 56,350 times over a period of 2 months. 3.2. Parameter adjustment

In the final step, groups with highly related events are identified. Although any kind of clustering method that utilizes a pair-wise similarity can be applied, we use the hierarchical clustering method, which has been used widely and is easily understood. Hierarchical clustering merges clusters (which include event pairs with high similarity) iteratively (Fig. 4). We can adjust the threshold to extract the clusters that contain the events with higher similarity than the threshold. In addition, the order of occurrence of each event pair is estimated with the time lag m∗ij , which represents the most probable time delay. It will be noted that in chemical plants, the probability of occurrence of one event may affect the probability of another event due to dependence. While such conditional probability of events is not considered in our method, the similarity can be defined as the relationship between three or more events. However, determining the similarity between three or more events would yield little advantage and requires a large amount of calculation. For example, consider three related process variables (temperature A, pressure B, and flow rate C) and the corresponding upper limit alarms as shown in Fig. 5. In this case, the similarity between the three events will be zero, since the upper limit alarms did not occur simultaneously. In contrast, our method appropriately estimates the relationships between three or more events by means of the similarities between event pairs. In the case of the Fig. 5, the similarities between temperature A and pressure B, and between pressure B and flow rate C, will be large values. As a result of clustering, these three events will belong to the same group. As shown in the example, since three or more correlated events seldom exist in the actual process event log, calculating the similarity measure of event pairs is sufficient for practical use.

In this method, three parameters (maximum time lag K, similarity threshold in clustering (mentioned in step 4), and window size t) should be adjusted. The first, maximum time lag, represents

Fig. 5. Example for correlation of three alarms.

J. Nishiguchi, T. Takai / Computers and Chemical Engineering 34 (2010) 2007–2013

2011

Fig. 6. Example of consequential alarms.

the time delay of the target process. Since this parameter does not affect the results greatly, it can be set at a rather large value. In the case of this chemical plant, it was set at 1 h. The second parameter, the similarity threshold, is simply for arrangement of the results, and it also is not critical for the calculation. It was set at 99.5. The last parameter, window size or t, is the most significant for this method. Its value should be larger than the variations in the process dynamics and operator response time. If t is smaller than these variations, event pairs that are related will not have high similarities. However, if it is too large, unrelated event pairs may have high similarities. Since process dynamics and operator response times vary widely, and operator responses differ with the situation, it is difficult to estimate a suitable t from the event log data only. For that reason, a systematic method for tuning t remains an open issue. In practical cases, however, the parameter can be determined using the variability of process dynamics and operator response time as estimated by the site engineers and operators. In case studies with several plants, including refineries and petrochemical plants, about 5–10 min has been an effective range for this parameter, with no appreciable difference due to type of plant. However, since the proposed method assumes that variability is constant, applying the method to event log data from plants where variability changes greatly at different times is difficult at present. 3.3. Results Event log data from the chemical plant was analyzed with a software product that implements event correlation analysis. The software generated the groups of consequential events in a few seconds. From discussions with site engineers and operators, we were able to plan solutions for IPL2 and 3 performance improvements for the top 35 groups, which accounted for 60% of the occurrences, within only 6 h. We will next give some examples of the results, which reflected actual process properties and operator behavior correctly. I. Consequential alarms The five types of events listed in Fig. 6 were extracted as related events by means of event correlation analysis. The total count of these events was 702. From the occurrence times in the log data, we found that Alarms 1, 2, 3 and 4 followed Operation 1 at around midnight every day. From interviews with operators we learned that when they switched the blower to manual operation, Alarm 4 notified them of blower operation in the pipes. As a result, the air

and steam flow rates fluctuated, triggering Alarms 1, 2 and 3. There were no operator actions corresponding to these alarms. This is a typical example of consequential alarms due to improper state-based alarming, in this case during blower operation in the pipes. Since there was no corresponding operator action, it was decided to eliminate Alarms 1, 2 and 3 during manual blower operation using an alarm filtering technique. In addition, Alarm 4, which only gave guidance to the operators, was changed to a message event. II. Complex operator actions The five events listed in Fig. 7 were also extracted as related events. The total count was 495. From the occurrence times in the log data, we found that Operations 2 and 3 and Alarm 6 occurred simultaneously at around 4:00 a.m., and Operations 2 and 4 and Alarm 5 occurred simultaneously at around 4:00 p.m. Investigation of the daily report found that these events are related to pipe flushing. In system A, operators first changed the alarms to pipe flushing mode (Operation 2), then the pipes were flushed (Operation 3), and then Alarm 6 signaled pipe flushing. Likewise, in system B, operators first changed the alarms to pipe flushing mode (Operation 2), then the pipes were flushed (Operation 4), and then Alarm 5 signaled pipe flushing. This example is representative of regularly performed operator actions. In general, regular operator intervention should not be required in order to maintain stable operation. Therefore, Operations 2, 3, and 4 were automated with a timer, and Alarms 5 and 6 were changed to messages. III. Unnecessary alarms Alarm 7, shown in Fig. 8, is a low limit alarm for flow rate, and occurred 657 times. According to the event correlation analysis, there were not any alarms or operator actions related to Alarm 7. Thus, this alarm was judged to occur independently of other alarms and operator actions. The operators confirmed that they did not take any action in response to this alarm, but waited until it returned to normal by itself. This kind of alarm, which did not function as a true alarm (i.e., it was unnecessary), should be removed, because it can reduce the operators’ sensitivity to alarms. IV. Cause of upset

Fig. 7. Example of complex operator actions.

2012

J. Nishiguchi, T. Takai / Computers and Chemical Engineering 34 (2010) 2007–2013

Fig. 8. Example of unnecessary alarms.

Fig. 9. Example of cause of plant upset.

Fig. 9 shows an example of a cause of plant upset. As seen in the figure, six events occurred at devices located near each other. According to the event log data, Operation 5 was a manipulated variable change action, Operation 6 was an upper threshold change action, and Alarms 8, 9, 10, and 11 were all upper limit alarms. According to the order of occurrence indicated by the analysis, Alarm 8 occurred first and Operation 6 occurred last. These results suggested a problem with pipes near the heat exchanger that were cooled with seawater. In fact, a field investigation revealed that these pipes easily became clogged, resulting in poor cooling. It was therefore decided that the pipes would be cleaned regularly to solve the problem. 4. Conclusions This paper describes a new method for enhancing IPL2 and 3 performance, for better process safety and productivity. Our method is based on event correlation analysis utilizing event log data that includes alarms and operator actions. The method is superior in terms of extracting process knowledge, including process behaviors and corresponding operator interventions, without the need for detailed process information. We describe the application of this method during process improvement activities at a chemical plant. The method had a significant effect by identifying consequential alarms, complex operator actions, unnecessary alarms, and causes of process upset. 5. Future work One of the open issues for our method is the establishment of guidelines for adjusting the window size t parameter, as is mentioned in Section 3.2. In order to improve IPL2 and 3 performance more effectively, we must utilize the results of our method (number of unnecessary alarms or average operator response time, etc.) as performance metrics for process safety and productivity. While performance metrics for evaluation of alarm systems and operator actions

have begun to be proposed recently (Takai, Higuchi, Shimameguri, Kurooka, & Noda, 2010), the use of metrics combined with the result of the event correlation analysis will enable us to evaluate processes more objectively. Making such metrics available in real time would lead to significant advantages. Now that our method has been applied at a number of plants (Higuchi, Yamamoto, Takai, Noda, & Nishitani, 2009), in order to extend its range of application we must apply it also to batch plants. A potential issue is the need for preprocessing, such as dividing all of the log data by product and using archival records of field operations in the event log data.

References AIChE/CCPS. (1993). Guidelines for engineering design for process safety. New York: American Institute of Chemical Engineers, Center for Chemical Process Safety. Bergquist, T., Ahnlund, J., & Larsson, J. E. (2003). Alarm reduction in industrial process control. In Proceedings of the ETFA 2003, vol. 2 (pp. 58–65). Bransby, M. L., & Jenkinson, J. (1998). The management of alarm systems. HSE contract research report 166/1998. HSE Books. Dahlstrand, F. (2002). Consequence analysis theory for alarm analysis. KnowledgeBased Systems, 15(1), 27–36. Daley, D. J., & Vere-Jones, D. (1988). An introduction to the theory of point processes. In Springer series in statistics. Springer. Engineering Equipment & Materials Users’ Association. (1999). Alarm systems. A guide to design, management and procurement. EEMUA publication no. 191. Higuchi, F., Yamamoto, I., Takai, T., Noda, M., & Nishitani, H. (2009). Use of event correlation analysis to reduce number of alarms. In Proceedings of 10th international symposium on process systems engineering PSE2009 (pp. 1521–1526). Li, W. (1990). Mutual information versus correlation functions. Journal of Statistical Physics, 60, 823–831. Liu, X., Kosaka, H., Noda, M., & Nishitani, H. (2007a). Model-based dynamic evaluation to support the design of alarm systems. Part 1. Development of virtual subject. Human Factors in Japan, 11(2), 118–127. Liu, X., Kosaka, H., Noda, M., & Nishitani, H. (2007b). Model-based dynamic evaluation to support the design of alarm systems. Part 2. Case study of a boiler plant simulator. Human Factors in Japan, 11(2), 128–138. Nishiguchi, J., & Tsutsui, H. (2005). A new approach to process alarm reduction using statistical point processes. In SICE annual conference 2005 (pp. 443– 448). Noda, M., & Nishitani, H. (2009). Optimal assignment of plant operators on basis of shift’s ability evaluation. In Proceedings of 10th international symposium on process systems engineering PSE2009 (pp. 2067–2072).

J. Nishiguchi, T. Takai / Computers and Chemical Engineering 34 (2010) 2007–2013 Shiozaki, J., Shibata, B., Matsuyama, H., & O’Shima, H. (1989). Fault diagnosis of chemical processes utilizing signed directed graphs—Improvement by using temporal information. IEEE Transactions on Industrial Electronics, 36(4), 469– 474.

2013

Takai, T., Higuchi, F., Shimameguri, A., Kurooka, T., & Noda, M. (2010). A Comprehensive evaluation method of alarm system from the standpoint of 8 characteristics. In Proceedings of 5th international symposium on design, operation and control of chemical processes, PSE Asia 2010.