10th International Symposium on Process Systems Engineering - PSE2009 Rita Maria de Brito Alves, Claudio Augusto Oller do Nascimento and Evaristo Chalbaud Biscaia Jr. (Editors) © 2009 Elsevier B.V. All rights reserved.
1221
IPL2&3 Performance Improvement Method for Process Safety Using the Event Correlation Analysis Junya Nishiguchi, Tsutomu Takai Yamatake Corporation, 1-12-2, Kawana, Fujisawa-shi, Kanagawa, Japan
Abstract Alarm management efforts have recently intensified in the world, and there are many tools based on a guideline EEMUA191 to evaluate alarm system performance. However, the improvement and rationalization of alarm systems have not made satisfactory progress, because these tools usually just provide with alarm system performance metrics but not useful information to improve and rationalize alarm systems. In this paper, a novel method for performance improvement of IPL2&3 including alarm system using the event correlation analysis was proposed. Keywords: EEMUA, IPL, alarm management, event correlation, clustering
1. Introduction Safe operation is the top priority for process plants. As a concept of safe design to provide protection from hazardous incidents, Independent Protection Layers (IPLs) which consist of eight layers as shown in Table 1 has been extensively applied to the various plants (AIChE/CCPS 1993). The second layer (IPL2) and third layer (IPL3) are related to alarm system. The primary purpose of IPL2 is the supervision of the plant under the normal operation with Basic Process Control System. When the process variables deviate from the set points, the system activates alarms and requires the corrective actions of the operator. On the other hand, IPL3 represents the critical alarms and corresponding operator interventions. Failure in the functions of IPL2&3 is resulting to production loss. Therefore, it is important that IPL2&3 should be able to function effectively from both safety and production standpoint. Table 1. Independent Protection Layers for Process Safety Layers IPL8
Functions
Definitions Community Emergenc y Res pons e
IPL7
Plant Emergency Res pons e
IPL6
Phys ical Protect ion (Dikes )
IPL5
Phys ical Protect ion (Relief Device s )
IPL4
Automatic Action SIS or ESD
IPL3
Critical Alarms, Operator Supervision, a nd Manual Intervention
IPL2
Bas ic Controls , Proces s Alarms , and Operator Supervis ion
IPL1
Proces s Des ign
Minimize damage from an incident
Ens ure s afe and product ive operations
Prevent an incident from happening
For rationalization of alarm system, EEMUA191 (EEMUA 1999) is widely accepted as a de facto standard guideline. In EEMUA191, alarm system is defined as a very important way of automatically monitoring the plant condition and attracting the attention of the process plant operator to significant change that require assessment or
1222
J. Nishiguchi and T. Takai
action. It specifies several performance metrics that can be used to assess the alarm system performance: x Operator questionnaires x Alarm usefulness surveys x Assessment of number of alarms in a system x Measurement of average alarm rate x Measurement of number of alarms following a major plant upset x Measurement of operator response time x Measurement of number of standing alarms x Analysis of the priority distribution of alarms configured and occurring x Correlation techniques Unfortunately the effective correlation techniques from the standpoint of rationalizing IPL2&3 had not been defined well. Analyzing correlation between the alarms and operator actions in a process upset is expected to estimate the propagation path, source origin, and nuisance alarms, which can be useful information to improve the alarm system and understand the IPL2&3 performance properly.
2. New Method for IPL2&3 Performance Improvement 2.1. Concept As referred to in EEMUA191, every alarm should be defined uniquely to notify a specific upset situation, and every operator action should be defined to solve the corresponding alarm. Once a certain upset causes alarms in the plant with the poor IPL2&3, it will propagate via the process fluid and consequently leads to the alarm flood; meanwhile, alarms without the corrective actions will increase operators’ workload as nuisance alarms. Therefore, in order to rationalize IPL2&3 performance, every alarm and operator action should be properly related each other. As a conventional method for fault propagation analysis, Signed Directed Graphs (Shiozaki et al. 1989) has been actively researched. However, it is difficult for this method to put into practical use because it requires much effort to adjust process model whenever the process devices are replaced. Also, to our knowledge there has not been any systematic method for rationalizing operator actions. The concept of our method is to provide a fast, easy, and effective way to improve IPL2&3 performance. The relationship and occurrence order between events, which contain alarm and operator action events, are extracted from event log data with Event Correlation Analysis (Nishiguchi et al. 2005). In this method, the event pairs with constant occurrence time lags are considered to be similar events, since the time lags depend on the time delay of the process dynamics and human reaction time. From the similarity and the occurrence time lag calculated by the Event Correlation Analysis, we can estimate the group with the consequential events as well as their occurrence order. Our method enables us to extract the relationships between process variables change and operators’ behavior only with accumulated log data. The solutions to improvement for the IPL2&3 design and management are found rapidly and easily. The examples of extracted relationships and corresponding solutions are as follows. I. Consequential Alarms When the several alarms are strongly related, these alarms are likely to be consequential alarms, which can be reduced by alarm filtering techniques. II. Complex Operator Actions When the several operator actions are strongly related, the operator actions are likely to be complex sequential operations, which can be reduced by automating operations.
IPL2&3 Performance Improvement Method for Process Safety Using the Event Correlation Analysis
1223
III. Redundant Alarms When there is a high frequent alarm without the related operator actions, it may be redundant alarm, which can be reduced by changing set point or replacing into message. IV. Causes of upset When constant occurrence order of alarms and operator actions is found, the first occurrence event is likely to be the source of plant upset. 2.2. Event Correlation Analysis Event Correlation Analysis quantifies the relationship and occurrence order between alarms and operator actions. Although correlation coefficient is usually used to measure the relationship between two continuous values, it is well known fact that the coefficient cannot be applied for event data (Li 1990), such as alarms and operator actions log data. In this method, similarities between all event pairs are calculated from the log data with probability distribution of correlation regarding independent event pairs (Figure 2). The alarms and operator actions log data obtained from Distributed Control System (DCS) contains occurrence time and event kind. The log data is converted into event time series si(t) defined as binary series for each event i, which is 1 if the event occurs within the time window 't and 0 if does not. 1, if some points in n't , ( n 1)'t @ si t ® ¯0, otherwise
(1)
where ¨t is window size, and n is time unit. The cross correlation function (2) between event i and j indicates the occurrence counts that event j follows event i in the time range of (m't, (m+1)'t]. T / 't m ° ¦ si n s j n m m t 0 , K d m d K ® n1 °¯ c ji m m0
cij m
(2)
where T is observation time, and K is maximum lags of time unit. From (2) maximum correlation value c*ij and time lag at maximum correlation m*ij are defined as follows. cij*
max cij m , mij*
arg max m cij m
(3)
As a result, the similarity between two events Rij is defined with the probability that the correlation between two independent events is lower than c*ij within time lag K. *
Rij
°cij 1Q l e Q P cij m c | K d m d K # ®¦ °¯ l 0 l!
* ij
½° ¾ °¿
2 K 1
(4)
where Q is the expected value of Poisson distribution regarding the occurrence of independent events, which is approximated by the average co-occurrence number between independent two events as shown in equation (5).
Q
T / 't T 't T / 't pi p j # ¦ si n ¦ s j n 't T n0 n 0
(5)
In other word, this method conducts the statistical test of hypothesis that two events i and j are generated independently. The actual maximum correlation c*ij is compared
1224
J. Nishiguchi and T. Takai
with the distribution of correlation value between independent two events. The similarity Rij is defined with the value, which is subtracted the reject rate from one. By applying some clustering algorithm (e.g. hierarchical clustering) with the similarities of every event pair, the event groups with similar events will be extracted automatically. In addition, the occurrence order of each event pair is estimated with the time lag m*ij, which represents the most probable time delay. time
Event log data time
Event i
name
2002/01/01 0:12:34
event1
2002/01/01 0:12:35
event2
2002/01/01 0:12:56
event1
2002/01/01 1:31:05
event3
2002/01/01 1:43:22
event2
2002/01/01 1:51:19
event2
2002/01/01 1:51:23
event1
2002/01/01 2:01:55
event3
0 1 0 11 0 0 01 0 0 00 10
Event j Lag
0 01 1 0 01 1 1 00 1 0 10 1 + 1 1 = Count +
Similarity Rij
Max correlation c* ij
Max correlation c* ij
0 Time lag m* Lag ij Cross Correlation Function
Probability distribution of Cross correlation
Figure 2. Event Correlation Analysis
2.3. Numerical Experiment We evaluated the Event Correlation Analysis with synthetic data consists with six event type generated with the following rules for 30 days in one-second units. Event 1 and 4 are generated by Poisson distribution with mean of 0.0001 and 0.0004, respectively. Event 2 and 3 are generated from event 1, added with noise of Poisson distribution with mean of 0.000005 and removed with the probability 50%. In addition, occurrence time of each event is shifted with Normal distribution N(600,100) and N(1800,100), respectively. Event 5 and 6 are generated from event 2, added with noise of Poisson distribution with mean of 0.0002 and removed with the probability 50%. In addition, occurrence time of each event is shifted with Normal distribution N(600,100) and N(1800,100), respectively. Table 3. Results for the synthetic data Order i ! j
Similarity c*ij
Delay m*ij
Order i ! j
Similarity c*ij
Delay m*ij
1 ! 2
0.99
600
4 ! 5
0.98
600
1 ! 3
0.98
1800
4 ! 6
0.97
1900
2 ! 3
0.96
1200
5 ! 6
0.95
1100
Table 3 shows the similarity and delay time of the related event pairs with higher similarity than 0.9. From the table, we can see the proposed method properly extracted the related event pairs and their occurrence time lags even though the data contained noise.
IPL2&3 Performance Improvement Method for Process Safety Using the Event Correlation Analysis
1225
3. Validation with Actual Plant Data 3.1. Target Plant The proposed method was applied for actual chemical plant data for validation. The target plant has the typical multiple-stage gas purification unit. According to the event log data obtained from this unit, a total of 1,267 types of alarms and operator actions occurred a total of 56,350 times over a period of two months. 3.2. Results The event log data was analyzed with Alarm Analyst R20, the software product that implements the Event Correlation Analysis. The software generated the groups with the consequential events for a few seconds. From the discussions with the plant engineers and operators, we were able to plan the solutions for IPL2&3 performance improvement regarding the top 35 groups accounted for 60% of occurrences within only six hours. We will next give some examples of the results, which show actual process property and operators' behavior correctly. I. Consequential alarms The five events listed in Figure 4 were occurred synchronously and the total count was 702. From the result of the Event Correlation Analysis, we found Alarm 1, 2, 3 and 4 followed after Operation 1 at around midnight everyday. From the interview with the operators, when they switched the blower to the manual operation, Alarm 4 notified them of pipe blowing. As a result, the air and steam flow rate became unstable and consequently Alarm 1, 2 and 3 occurred. Since these alarms did not have corresponding operator actions in this situation, it was decided that Alarm 1, 2 and 3 were inactive with alarm filtering technique during manual pipe blowing procedures. In addition, Alarm 4, which gave only guidance to the operators, was changed to a message. Event Alarm1 Alarm2 Alarm3 Alarm4 Operation1
Description Upper and lower limit of tower air flow rate Abnormality in steam flow rate Lower limit of air flow rate Message about blowing Blower auto start OFF
1 Month
Figure 4. Example of consequential alarms
II. Complex operator actions The five events listed in Figure 5 were also occurred synchronously and the total count was 495. From the result of the Event Correlation Analysis, we found Operation 2, 3 and Alarm 6 occurred simultaneously at around 4:00am, and Operation 2, 4 and Alarm 5 occurred simultaneously at around 4:00pm. The investigation of the daily report found that these events are related to the pipe flushing. In the system A, operators firstly invalidated alarms with Operation 2, followed by flushing procedure with Operation 3 and the notify of pipe flushing with Alarm 6. Likewise, in the system B, operators firstly invalidated alarms with Operation 2, followed by flushing procedure with Operation 4 and the notify of pipe flushing with Alarm 5. In general, operators should not intervene with such regular procedures. Therefore, Operation 2, 3 and 4 were automated with the timer, and Alarm 5 and 6 were changed to message. Event Operation2 Operation3 Alarm5 Operation4 Alarm6
Description Invalidating alarms for flushing system A and B Flushing pipes for system A Message indicating system B is being flushed Flushing pipes for system B Message indicating system A is being flushed
Figure 5. Example of complex operator actions
1 Month
1226
J. Nishiguchi and T. Takai
III. Redundant alarms Alarm 7 is a flow rate alarm with lower limit type and occurred 657 times. According to the result of the Event Correlation Analysis, there were not any alarms or operator actions related to Alarm 7. Thus, this alarm is considered to occur independently with other alarms and operator actions. In fact, the operators told they usually did not take any actions in response to this alarm, but waited until it returned to stable by itself. We removed the alarms like this alarm that did not function as alarm, because they could deteriorate the sensitivity of the operators to alarms. IV. Cause of upset Figure 6 shows the example of the cause of plant upset. As shown in Figure 6, these six events occurred at the devices located near each other. From the event log data, Operation 5 was a manipulated variable change action, Operation 6 was an upper threshold change action, and Alarms 8, 9, 10 and 11 were all upper limit alarms. The occurrence order estimation found that Alarm 8 occurred first and Operation 6 occurred last. According to these results, it must be a problem with the pipes near the heat exchanger subjected to cooling with seawater. In fact, a field investigation revealed that the pipe easily became clogged, resulting in poor seawater cooling. Therefore, it was decided that this pipe should be cleaned regularly to solve this problem. Alarm11
Event sequence Product
Alarm8: Oil temperature after cooling
Material Alarm9: Oil temperature into Tower2
Cooler1
Tower 1
Tower 2
Alarm8 Cooling Alarm10 Water Tank1 Tank2 Operation6
Alarm9
Operation5
Alarm10: Seawater coolant Temperature Alarm11: Gas temperature Operation5: Oil temperature after cooling Operation6: Oil temperature into Tower2
Figure 6. Example of cause of upset
4. Conclusion In this paper, a new method for IPL2&3 performance improvement using Event Correlation Analysis was proposed. The validation result with the actual plant data shows our method is considered to improve process safety as well as productivity rapidly, easily, and effectively. In addition, since this method extracts the operators' behavior from the log data, it leads to standardization of operator actions.
References AIChE/CCPS, Guidelines for Engineering Design for Process Safety, AIChE, 1993 Engineering Equipment & Materials Users’ Association, Alarm Systems. A Guide to Design, Management and Procurement, EEMUA Publication No. 191, 1999 J. Shiozaki, B. Shibata, H. Matsuyama, and E. O’Shima, Fault Diagnosis of Chemical Processes Utilizing Signed Directed Graphs—Improvement by Using temporal Information, IEEE Transactions on Industrial Electronics, vol. 36, no. 4, pp. 469–474, 1989. J. Nishiguchi and H. Tsutsui, 2005, A New Approach to Process Alarm Reduction Using Statistical Point Processes, SICE Annual conference 2005, pp.443-448 W. Li, Mutual Information Versus Correlation Functions, Journal of Statistical Physics, vol. 60, pp. 823-831, 1990