Computers
them. Enpg,
Vol. 14, No. 12, pp. 1379-1396,
Printedin Great Britain.All rightsreserved
0098-I354/90 $3.00 + 0.00 Copyright0 1990 PergamonPressplc
1990
A ROBUST EVENT-ORIENTED DIAGNOSIS OF DYNAMIC F. E. FINCH,? 0.0. Laboratory
(Received
OYELEYE$
METHODOLOGY FOR PROCESS SYSTEMS and M. A. KRAMER@
for Intelligent Systems in Process Engineering, Department of Chemical Engineering. Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.
18 December
1989; final revision received 16 July 1990; received for publication 29 Augusf
1990)
Abstract-The Model-Integrated Diagnostic Analysis System (MIDAS) is a program for diagnosing abnormal transient conditions in chemical, refinery and utility systems. MIDAS determines the root causes of disturbances using an event mode1 that expresses process causality and conditions for violation of quantitative process constraint equations. Root causes typically considered are equipment degradation and failure, sensor failure, incorrect operation and external disturbances. The diagnostic algorithm utilizes the event model to construct clusters of related events, each cluster attributable to a single malfunction. The algorithm is designed to handle practical problems such as unreliable sensors, variations in the order of symptom detection, complex dynamics induced by feedback and process controllers, and multiple simultaneous malfunctions. Temporal features in the dynamic propagation of a malfunction are employed, and the diagnosis is incrementally evolved as disturbances appear and abate. The algorithm has been applied to a case study of a continuous reaction process. Results show MIDAS to be accurate in 99% of the cases studied, dealing successfully with many instances of complex process dynamics and out-of-order events.
1. INTRODUCl-ION
Fault diagnosis is the problem of identifying and isolating the root causes of process disturbances from observable symptoms. Two classes of root causes can be considered: physical failures, such as Sensor and controller failures, fouling, blockages and leaks; and external disturbances, such as feedstock or utility variations. Here, we will refer to both classes as process malfunctions. Prompt and accurate diagnosis of process malfunctions is essential in maintaining economical and safe plant operation. The diagnosis task is currently performed mainly by human operators. However, computer diagnostic aids can make diagnosis faster, more accurate and more reliable. The computer has been used mainly in two different modes. Consultation systems, usually rule-based or hybrid expert systems, are used primarily off-line, and only on a periodic basis. The human-computer interaction typically has the form of a question-and-answer dialog, with the computer prompting the operator for symptom and test information, and then interpreting the data. These systems are usually intended to diagnose a specific equipment item or class of processing problems. Monitoring or data3ltering systems, on the other hand, are interfaced directly with the process data acquisition system, and used to provide a continuous assessment of Wurrently with Gensym Corporation, Cambridge, MA 02140. U.S.A. $Currentiy with Arco Chemical Co., Channelview, TX 77530, U.S.A. §Author to whom all correspondence should be addressed.
the plant performance and safety, largely independent of the process operator. The term filtering in this context is particularly associated with programs that prioritize process alarms, to help the operator focus on the most significant incoming alarms. Monitoring systems detect performance problems and may also diagnose the root causes of disturbances. Monitoring and filtering systems often employ some type of qualitative process model, rather than being wholly based on process-specific expertise, and have a realtime character not present in consultation systems. Model-Integrated Diagnostic Analysis The System (MIDAS) is a process monitoring system that performs malfunction diagnosis of continuous chemical, refinery and related processes. It is an outgrowth of a system called DEA originally described by Kramer (1988). MIDAS employs a deep knowledge approach based on reasoning about causality and constraints, similar to several previous techniques (Iri et al., 1979; Kokowa et al., 1983; Shiozaki et al., 1985; Kramer and Palowitch, 1987; Dhujati et al., 1987; Kramer, 1987a; Rich and Venkatasubramanian, 1987; Ulerich and Powers, 1988). However, MIDAS addresses several significant problems ignored or only partially treated in previous diagnostic systems, including: 1. Utilization of evolving information-Temporal features in the dynamic evolution of a malfunction contain important diagnostic information (Washio et al., 1987). Most current methods analyze a “snapshot” of the plant state at a single time point. Since processes often exhibit nonmonotonic behaviors including compensatory (normal-rhigh-rnormal) and
1379
1380
F. E. FINCH et al.
inverse (normal-rhigh+low) responses, a snapshot is only partially informative, even misleading. Models and algorithms in MIDAS deal with the full range of dynamic behaviors produced by the plant, including nonmonotonic responses due to control systems and other feedback mechanisms. 2. Robustness to symptom variation-When disturbances caused by a malfunction propagate through a plant, they rarely fohow an exactly predictable pattern. The trajectories of continuous variables and the temporal sequence of discrete events such as alarms can be influenced by the malfunction extent vs time profile, the exact location of the malfunction, and (in the case of discrete events) noise and decision thresholds (Kramer, 1987b). By representing all plant dynamics as sequences of events, and employing diagnostic algorithms that are robust to variations in the order of occurrence of events, as well as to false and missing events, MIDAS can cope with considerable variability in symptom patterns. 3. Diagnosis of multiple malfunctions-Multiple malfunctions can be simultaneous independent malfunctions, or induced failures (malfunctions caused by other malfunctions). Although the general multiple malfunction diagnosis problem remains intractable, MIDAS can diagnose cases of induced sensor failures (which account for most induced failures), and multiple failures provided that each malfunction influences a different set of sensors, so that the symptoms form separate groups or clusters (“non-overlapping” failures). The ability to diagnose non-overlapping failures assures that unrelated events (including false alarms) occurring during diagnosis of a malfunction will not affect the ability to diagnose the malfunction successfully. 4. Combined use of qualitative and quantitative models-Causal models represent local relationships between process variables, and indicate the probable Constraint order of malfunction propagation. equations can model global relationships, such as overall material and energy balances, that cannot be represented by causal models. By combining causal and quantitative constraint models, MIDAS produces diagnoses with resolution similar to that which would be produced by intersecting a causal diagnosis with a constraint equation diagnosis. Kramer and Finch (1989) and Kramer and Palowitch (1985) have presented examples demonstrating that intersection of diagnoses produced by different types of models can significantly improve the overall results. 5. Automatic knowledge base preparation-MIDAS includes interactive programs and subroutines for deriving the diagnostic model automatically from a given process flowsheet. This reduces the engineering effort for creating the diagnostic system for a new plant to a small fraction of what would be needed for a custom-designed system. Use of automatic routines also helps to assure the accuracy of the diagnostic model. The algorithm for building the causal model from the process flowsheet utilizes several new results
on interpretation of feedback that results in a less ambiguous qualitative model. The diagnostic methodologies of MIDAS are the major focus of this paper. The theoretical background and details of the modeling methodologies used are presented elsewhere (Oyeleye et al., 1990). The diagnostic algorithm is general in the sense that it is not limited to a single process or type of process, but can be applied to any system for which an appropriate event model can be developed. Furthermore, it will produce accurate diagnoses (with the actual malfunction in the ranked candidate set) with any sensor configuration-but naturally with greater or lesser resolution depending on the actual sensor deployment. The present work focuses on applications in continuous chemical and petrochemical processes possessing a nominal steady state. In Section 2, we briefly summarize the modeling basis of MIDAS. Details of the knowledge representations employed follow in Section 3, and the diagnostic methodology and inference cycle are described in Section 4. Results of a case study follow in Section 5. Examples of the treatment of dynamics, outof-order alarms and multiple malfunctions, using scenarios from the case study, are presented.
2. EVENT
MODELING
The key object in MIDAS modeling is the event, which can be used to represent any significant observable change in process behavior or condition. Events represent transitions between process states and are considered to have no temporal extent. For application to dynamic processes, trajectories of continuous variables are represented by a sequence of discrete events. Examples of the types of events important in chemical process modeling and diagnosis are: Changes in variable or parameter states (e.g. temperature was NORMAL and is now HIGH) Changes in variable or parameter trends (e.g. concentration was STEADY and is now INCREASING) Changes in equipment status (e.g. level sensor was OK and has now FAILED) Changes in the status of quantitative constraints (e.g. mass balance was SATISFIED and is now VIOLATED HIGH) Results from off-line tests (e.g. inspection for corrosion was NEGATIVE) Actions initiated by operators (e.g. flow control loop has been placed on MANUAL, or manual by-pass valve has been OPENED). It can be seen that a wide range of occurrences can be treated as events. MIDAS views the time evolution of the process state as a sequence of events, and the diagnostic reasoning is based on the type and order of events. Complex dynamics, such as compensatory
Event-oriented diagnosis of dynamic process systems and inverse response, are represented by a sequence of events on a single variable. The event-based representation has a particular advantage over a state-based representation, related to truth maintenance. Truth maintenance (De Kleer, 1986) is the task of managing data and conclusions so that when data changes, the conclusions derived from the original data will be retracted. In real-time expert systems, truth maintenance takes on the additional element of time dependency (Moore and Kramer, 1986), and conclusions may need to be retracted after the expiration of a “vaiidity interval” based on the characteristic time scale for changes in underlying variables. In a state-based representation, all deductions based on a state must be identified and retracted whenever the state changes. For example, if the state “REACTOR TEMPERATURE HIGH” supports the conclusion “HOT SPOT IN REACTOR”, then when the temperature returns to normal, the conclusion of a hot spot must be retracted. In the event-oriented representation, most such retractions are avoided. Once introduced, an event such as “REACTOR TEMPERATURE HIGH AT 13:00 HOURS”, remains true forever. A return to normal is represented as a second event, such as “REACTOR TEMPERATURE NORMAL AT 13 : 30 HOURS”, which does not conflict with the initial event because they occur at different times. Similarly, a conclusion such as “REACTOR HOT SPOT DEDUCED AT 13: 00 HOURS” is not retracted, but is followed by another conclusion, such as “REACTOR HOT SPOT ABATED AT 13:30 HOURS”. While this representation eliminates much of the routine truth maintenance, truth maintenance is still required for restructuring assumptions on the causal relationships among events in response to certain cases of out-oforder events (see Section 4.2). To represent process causality, event models function similarly to the digraph models described by Iri et al. (1979) and Kokawa et al. (1983). As in the digraph, the causal influences among root causes and process variables are represented by directed arcs. In fact, any digraph model can be reformulated as an event model, and Oyeleye (1990) presents an automatic procedure for this conversion. There are, however, some differences in the format of digraphs and event models. In an event model:
. Each possible qualitative l .
l
state is represented explicitly by a separate node There are no unmeasured event graph nodes Links are not signed and their existence may be conditional on past events Nodes are not limited to process variable states but can also represent trends, constraint residuals, equipment status and operator actions.
The representation of a single state variable by multiple nodes representing different qualitative states allows modeIing of different qualitative effects for each qualitative state. The use of this
1381
representation is discussed in Allen (1984), who outlines several situations where the single node representation is insufficient. The elimination of unmeasured variables is intended to simplify on-line diagnostic reasoning, so MIDAS does not have to manage hypotheses dealing with unmeasured variables. Unmeasured nodes are eliminated automatically by the model preparation algorithm. The event model can also include events associated with the satisfaction and violation of quantitative constraint equations. Conceptually, quantitative constraints are any relations in the form: F(X, p) = R,
(1)
where X is a vector of measurements, p is a vector of unmeasured parameters related to malfunctions, and R is the constraint residual, near zero (normal) as long as p is fixed. Transition of the residual from a normal to an abnormal value is represented as an event with root causes corresponding to variation in the elements of p. Because unverified measurements are used directly in calculation of the constraint, p typically includes not only process failures but also sensor errors associated with X. Similar to other constraint-oriented diagnostic systems (Kramer, 1987a; Petti et a/., 1990) constraints do not involve unmeasured variables, so some manipulation of the basic process equations may be required. The modeling accuracy of the constraint as well as process and sensor noise dictate the variance of R under normal conditions, which in turn determines the threshold for constraint violation. It is incumbent upon the user to select constraints that represent the process with reasonable accuracy, otherwise thresholds must be set loosely and sensitivity to malfunctions is sacrificed. While constraint satisfaction is not subsequently counted as evidence against a malfunction hypothesis, wide thresholds mean that the constraint is less likely to contribute information to the diagnostic process. MIDAS does not mandate a particular form for F, and in case studies, linear and non-linear, algebraic and differential constraints have been used. The calculation of the constraint residual is done externally to MIDAS, so conceivably any computation can be used, provided that it operates on a measurement vector X and produces an output R that can be interpreted in terms of process (including sensor) malfunctions. Since the event model format supports both causality and constraints, any combination of causality and constraints can be used. One strategy, which might be preferred if constraints are difficult to develop, is to create a complete causal model, with a few key constraint equations. Another strategy is to use constraints only, in which case MIDAS operates more or less like the method of governing 1987a) or the Diagnostic equations (Kramer, Model Processor (Petti et al., 1990). Such a model would be Iess susceptible to ambiguities, but harder to produce.
F. E.
1382
FINCH
5
7
L
ip,
F,
Fig. 1. Gravity flow tank. It is useful to visualize the event model as a graph. For example, consider the gravity flow tank with two sensors, depicted in Fig. 1. A part of the event graph for this system is shown in Fig. 2 (status transitions have been omitted for clarity). Nodes in the event graph represent states of process variables or constraints. States of different variables may be connected by intervariable precursor/successor links, shown as solid arrows, that depict relationships between states. Intravariable precursor /successor links, shown as bold arrows, relate states on the same variable, and are indicative of the occurrence of events, rather than causality. Given the two sensors attached to the tank process, there are eight possible events (for each variable: NORM+HZGH, NORM +LO W and LOW+ HIGH + NORM, NORM). Root causes are the different malfunctions that could affect the process, such as High Inflow and Level Sensor High Bias. Root causes are linked to events via another type of link, shown as dashed arrows, called Iocal cause links. The states linked to a root cause are referred to as the primary symptoms of the root cause. One of the primary symptoms should be the first observed when the root cause is present. Diagnostic information is attached to precursor/successor links using fink conditions. Several types of conditions are defined, the most useful being the :NOT condition, used to block propagation of a outlet m&age High InRow Level Sensor High Bias
Downstream Leak flow Sensor Hlgh Etlas
8
et al.
specified malfunction along a link. For example, the :NOT Levd Sensor Bias condition between Level High and Flow I-I&b indicates that under the assumption of Level Sensor Bias, Flow High is not expected to follow Level High. Therefore, that sequence of events would be counted as strong evidence against Level Sensor Bias. The event graph representation can also model inverse and compensatory response. Links from FIow High to FIow Norm and FIow Low to FIow Norm model the fact that outflow is expected to return to normal following a blockage, when a new steadystate level is achieved. The :ONLY-IF-TRANSIENT link conditions from Level High to Level Norm and Level Low to Level Norm indicate that level will return to normal only when the root cause of the disturbance has been removed, since level is not a compensatory variable to any disturbance variable in this process. The derivation of the event model, even for simple processes, can be complicated. MIDAS provides routines that automate the construction of the event model. The model-building process starts with a library of digraph models of common process equipment which are linked together, analyzed for global interactions, and then converted to event models. During this process, the digraphs are analyzed to determine the behavior of variables affected by negative feedback, using the criteria of Oyeleye and Kramer (1988). As a result of this analysis, MIDAS event models contain less ambiguity than standard causal models. The method of construction of the event model appears elsewhere (Oyeleye, 1990; Oyeleye et al., 1990).
STRUCTURE
3.
Monitors Event interpreter l Process model l Hypothesis model 0 User interface l
Outlet Bku2kag.a Flow Sensor Low Bias
Fig. 2. Part of the tank event graph.
KNOWLEDGE
REPRESENTATION
The MIDAS diagnostics utilizes two types of knowledge: factual knowledge stored in the form of frames, and procedural knowledge contained in a set of algorithms coded in Common LISP. Frames are a standard way of representing objects in knowledgebased systems. (For terminology associated with frames, see Stefik and Bobrow, 1986). Some of the LISP functions in MIDAS are directly associated with frames as demon functions (active values), but the majority of the functions are not associated with any particular object type. The computational paradigm of MIDAS thus contains elements of object-oriented programming, but is not entirely consistent with this style. The overall structure of the MIDAS diagnostics is illustrated in Fig. 3. Five major components are defined: l
Tank Leak Low Inflow Level sensor Low Bias
AND
Event-oriented
diagnosis
Datafrom Process
Fig. 3. Structure
3.1.
Monitors
process
systems
1383
Sensors
of MIDAS
The primary function of a MIDAS monitor is to convert the sampled process data into a series of events for further diagnostic analysis. One monitor is associated with each process sensor and constraint equation in the MIDAS knowledge base. A monitor collects plant data for a specific measured variable or constraint residual and performs tests to detect qualitative events. Monitors operate independently of one another and could be distributed if desired, although in the current implementation, monitors share a single CPU with other system functions. Monitors can detect three types of events: state changes (including changes of state of constraint residuals), changes in trend and gross sensor fai1ures.t Detection of state and trend events (appraisal) is based on the Shewhart control chart technique of statistical process control (Shewhart, 193 l), although conceivably other techniques could be used for the same purpose. Data is taken continuously at the frequency of the data acquisition system. A state change event is declared when one sample is outside the control limit or two consecutive samples are outside a warning limit. A trend event is detected if a run of five or more samples is increasing or decreasing (Nelson, 1984). The level of significance of the tests can be modified by the user. Gross sensor failure events are detected using sensor range and noise limit checks. If these tests are passed, the sensors are not automatically validated-drifts, biases and in-range failures may be still present, and are considered in subsequent analysis. Interaction between the monitors and the event interpreter is mostly one-way-monitors sending events to the interpreter. The one exception to this rule is interrogation, wherein the interpreter queries a monitor for trend information. This occurs whenever
TEvents for operator actions and tests are not generated by monitors, but are entered manually via the user interface.
of dynamic
diagnostics.
the interpreter discovers a situation where the diagnosis can be simplified with the addition of one or more events not detected by the monitors. If the monitor responsible for the variables of interest is able to forecast a possible future event based on current trends (prediction), the interpreter adds the event to the inference procedure, acting as if the event had already occurred. Prediction entails extrapolating a linear or low-order polynomial fit of smoothed historical data for the variable involved to a specified prediction horizon. Forecast values are subjected to statistical tests similar to those used in appraisal, but at lower significance. Interrogation is a robustness feature that allows MIDAS to overcome certain problems associated with out-of-order events. Techniques for handling variations in event order are necessary, because no matter how well monitors are tuned, it may still be impossible to guarantee that events will be detected in causal order (Finch and Kramer, 1989). A description of the slots (attributes) defined in the DATA-MONITOR object class is presented in Table 1. Slots are related to analysis control, analysis results, bookkeeping, data queues and threshold values. Numerical information is stored in the data queues and compared to threshold values according to the analysis control parameters. Current qualitative state, trend and status are placed in the analysis results slots. These slots contain demon functions that activate an event whenever the state, trend or status changes. Threshold values and analysis parameters can be modified from the user interface. 3.2. Event interpreter The event interpreter creates a diagnosis from events detected by the monitors. The procedural knowledge in the event interpreter is general to the extent that the event interpreter can be used to diagnose malfunctions in any process for which an event model and monitors exist. The event interpreter is comprised of over 200 Common LISP functions, and there are no event interpreter objects.
F. E. FlNCH el al.
1384
Table I. DATA-MONITOR Attribute name
Type’
Data type
ANALYSIS-HORIZON LAST-UPDATE MAXIMUM-VALUE MEAN-LCL MEAN-UCL MEAS-TIME-VECTOR MEAN-VECTOR MINIMUM-VALUE MONITOR-EVENTS PERCEIVED-STATE PREDICTED-STATE PREDICTION-CONFIDENCE PREDICTION-HORIZON PREDICTION-TIME-VECTOR RANGE-LCL RANGE-UCL RAW-MEAS-VECTOR STATE-EVENTS TREND-EVENTS
A B T T T D D T
Integer string Real Real Real List List Real Symbolic Symbolic Symbolic Symbolic Integer List Real Real List Symbolic Symbolic
:: R R $ T T D B B
object description Description
Maximum length of sample statistics data queues MIDAS time stamp of last data point Maximum credible raw data value Lower control limit for sample mean Upper control limit for sample mean Queue of measurement times Queue of sample means Minimum credible raw data value POTENTIAL-EVENT instances associated with change in status State determined by last data analysis Predicted future state Level of confidence in predictions (HIGH, LOW or NONE) Number of time steps into future to compute predictions Queue of predicted measurement times Lower control limit for sample ranges Upper control limit for sample ranges Queue of measurement values POTENTIAL-EVENT instances associated with change in state POTENTIAL-EVENT instances associated with change in trend
‘A = analysis control, B L bookkeeping, D = data queue, R = results, calculated, T = threshold.
For every event detected, a similar sequence of tasks is performed, as illustrated in Fig. 4: 1. The event is cataloged in the active knowledge base. 2. The interpreter searches for possible associations or relationships that might exist between the new event and previously detected events. 3. The previously existing diagnosis is revised or a new diagnosis created to incorporate the additional information provided by the new event. This set of tasks is the same for all events, regardless of whether the event represents a state change, a change in trend, sensor failure or operator action. While performing these tasks, the event interpreter interacts with the process model and hypothesis model components, where events are cataloged, causal searches are conducted and diagnoses are logged. t“Non
sunt multiplicanda
i.e.
entia praeter necessitatem”,
“entities are not to be multiplied beyond William of Ockham (ca 1285-1349).
necessity”-
I
A central concept in the event interpreter involves When a new event is linked to previous events in Step 2, the new event becomes part of a cluster of causally-related events. It is assumed that each separate cluster of events arises from a single malfunction. The possibility that two or more malfunctions can lead to symptoms attributable to one malfunction is not entertained in MIDAS, following Ockham’s razor.7 MIDAS, therefore, produces a minimal set of malfunctions spanning the set of observed events. Event clusters are represented by the object INFERRED-MALFUNCTION (see Section 3.4). More details on the inference cycle are given in Section 4.
event clusters.
3.3. Process
model
The process model, or static knowledge base, represents the event model of the process expressed in frame (object) format. The process model is created off-line and remains essentially unchanged during the operation of the MIDAS diagnostics. The object classes corresponding to the components of the event graph are shown in Table 2. The
Search for relationships to previously detected events
search ;nsuccessful
search successful
Fig. 4. Event
interpreter
cycle.
Event-oriented
diagnosis
of
Table 2. Process
dynamic model
Event Root cause Local cause (LC) link Precursor/successor (PS) link between
Symbolic Symbolic Symbolic Symbolic Symbolic Symbolic
TIME-OF-LAST-DETECTION TIME-OF-LAST-OCCURRENCE
string String
name
Data
PRIMARY-DEVIATION
List Real [O, I]
Attribute
Data
Table
ACTIVATE-IF ACTlVE DEACTIVATE-IF NOT-CONDITIONS ONLY-IF-CONDITIONS RESULT-DESCRIPTION SOURCE-DESCRIPTION
Woe
List Symbolic List List List List List
confirm or eliminate the root cause are given in the APPLICABLE-TESTS slots. The PRIORPROBABILITY slot can be used to assigned a relative probability to the root cause, used in the likelihood ranking (Section 4.4). The COMPILED-LINK object, given in Table 5, contains information on the initial and terminal nodes for the link (SOURCEDESCRIPTION and RESULT-DESCRIPTION) and slots for link conditions. Other objects of the process model include: The TEST class, giving manual tests and other off-line measurements that provide useful diagnostic information, along with the inferences that should be made when the test result is positive or negative. The MEASURED OBJECT class, which stores information on the current and past states of a measured variable or constraint, and provides certain pointers between POTENTIAL-EVENT, POTENTIAL-ROOT-CAUSE and COMPILED-LINK objects.
object description Descriotion
PRIOR-PROBABILITY likelihood
name
and events.
object description
4. POTENTIAL-ROOT-CAUSE
tvoe
Symbolic
(PRC)
Set to YES, activates demon initiating inference cycle for the event State that exists after event detection IDS of other POTENTIAL-EVENT instances with the same OBJECT ID of the instance of MEASURED-OBJECT affected by the event State that exists prior to event detection ID of RECORDED-EVENT instance. created the last time the event was detected Time stamp of raw data point resulting in event detection Time stamp of last RECORDED-EVENT
word “POTENTIAL” differentiates process model objects from objects in the hypothesis model that represent actual events or currently hypothesized root causes. The process model contains a superset of the events and root causes that may be observed or hypothesized during any given malfunction episode. Table 3 lists selected attributes of the POTENTIAL-EVENT (PE) object class. The most important attributes include the name of the affected variable (the OBJECT slot), and the states before and after the event (PRIOR -STA TE and CONSEQUENTSTATE slots). When a monitor detects an event, it searches for an instance of POTENTIAL-EVENT with the appropriate prior and consequent state, status and trend and inserts the value “yes” into the ACTIVE slot. This action initiates the inference cycle. Other slots of the POTENTIAL-EVENT object contain bookkeeping information. The attributes of POTENTIAL-ROOT-CAUSE (PRC) objects are listed in Table 4. The PRIMARYDEVIATION slot lists the primary symptoms of the root cause. Tests that could be used to either
APPLICABLE-TESTS
root causes
3. POTENTIAL-EVENT
ACTIVE CONSEQUENT-STATE EXCLUSIVE-EVENTS OBJECT PRIOR-STATE RECORD-OF-LAST-OCCURRENCE
Attribute
objects
POTENTIAL-EVENT (PE) POTENTIAL-ROOT-CAUSE None’ COMPILED-LINK (CL)
‘LC links are modeled via pointers
Table
138.5
systems
Corresponding process model object class
Event graph component
Table
process
IDS of TEST instances containing diagnostic conditions useful in diagnosis of the root cause AI1 primary symptoms of the root cause (listed in MIDAS event description format) A relative probability factor indicating the (I priori root cause compared to other root causes
5. COMPILED-LINK
object description Descriution
Conditions that trigger link activation A YES indicates the link is active in diagnostic Conditions that trigger link deactivation :NOT conditions attached to link :ONLY-IF conditions attached to link DESCRIPTION of terminal EVENT DESCRIPTION of initial EVENT
inference.
otherwise,
set to NO
F. E. FINCH et al.
1386
Table 6. Hvuothesis model obiects Corresponding hypothesis obiect class
Event graph component Event
RECORDED-EVENT (RE) EXPECTED-EVENT (EE) LATENT-EVENT (LE) HYPOTHESIZED-ROOT-CAUSE None’ None2 INFERRED-MALFUNCTION
Root cause Local cause (LC) link Precursor/sucxxs&r (PS) link None
(HRC) (IM)
‘LC links are modeled via pointers between root causes and events. ‘PS links are modeled via pointers between events.
3.4. Hype thesis model The hypothesis model, or active knowledge-base, contains only objects that are created on-line by the event interpreter during the diagnosis process. The objects in the hypothesis model, listed in Table 6, represent realizations of potential events and root causes in the static knowledge base. For example, a RECORDED-EVENT (RE) object in the hypothesis model represents the observation of a specific Table 7. RECORDED-EVENT, Attribute name
Data tvDe
ACTIVE CAUSED-BY
;mpic
CAUSE-OF
List
CLASSIFICATION
Symbolic
EVENT EVENTS-EXPLAINED
Symbolic Symbolic
EVENTS-NOT-EXPLAINED
Symbolic
OBJECT PROBABILITY-OFACCURATE-DETECTION PROBABILITY-OFINACCURATE-DETECTION SYMPTOM-OF
Symbolic Real [O, I]
Symbolic
TIME-OF-CESSATION TIME-OF-OCCURRENCE
String String
POTENTIAL-EVENT in the process model. Two additional classes-EXPECTED-EVENTS (EE) and LATENT-EVENTS (LE)-represent events created as a result of interrogation and are very similar to RECORDED-EVENTS. The clarity of the knowledge base is improved by the separation of potential and observed objects. For example, the separation gives a mechanism for representing multiple occurrences of a single POTENTIALEVENT.
EXPECTED-EVENT,
Real [0, I]
Data tvue
ACTIVE HYPOTHESIZED-BY root INFERRED-MALFUNCTION
Symbolic Symbolic
NORMALIZEDCONDITIONAL-PROBABILITY OPPOSING-TESTS RELATIVE-LIKELIHOOD ROOT-CAUSE
Interval [0, l] 2zzlbo’ic Symbolic
STRONGLY-OPPOSING-EVIDENCE
Symbolic
STRONGLY-SUPPORTING-EVIDENCE
Symbolic
SUPPORTING-TESTS WEAKLY-OPPOSING-EVIDENCE
Symbolic Symbolic
WEAKLY-SUPPORTING-EVIDENCE
Symbolic
Symbolic
objects
Descriotion A YES indicates the event has not been superseded Other observed events linked to the event via causal links terminating at this event Other observed events linked to the event via causal link initiating at this event Classification of this event in the causal network (SOURCE or CONSEQUENCE) Corresponding POTENTIAL-EVENT instance Other observed events in the cluster that can be explained by this event via causal propagation Other observed events in the cluster that cannot be explained by causal propagation ID of the instance of MEASURED-OBJECT affected by the event Probability of true detection (defaults: 1 for TESTS, 0.95 for REs, 0.40 for EEs and LEs) Probability of false detection (defaults: 0 for TESTS, 0.05 for REs, 0.60 for EEs and LEs) ID of the INFERRED-MALFUNCTION instance associated with this event Time stamp of the EVENT superseding this event Time starno of the EVENT
Table 8. HYPOTHESIZED-ROOT-CAUSE Attribute name
and LATENT-EVENT
object description Descriotion
A YES indicates the root cause is active in diagnostic inference ID of EVENT instances that are primary symptoms of the cause ID of the INFERRED-MALFUNCTION instance associated with the root cause Evidential interval based on prior probabilities IDs of active TEST instances opposing the root cause candidate A likelihood ranking based on evidence weights ID of the POTENTIAL-ROOT-CAUSE instance corresponding to this root cause IDS of RECORDED-EVENT instances opposing the root cause candidate IDS of RECORDED-EVENT instances supporting the root cause candidate IDS of active TEST instances supporting the root cause candidate IDS of EXPECTED-EVENT and LATENT-EVENT instances opposing the root cause candidate IDs of EXPECTED-EVENT and LATENT-EVENT instances supporting the root cause candidate
Event-oriented
diagnosisof dynamicprocesssystems
Table 9. INFERRED-MALFUNCTION Attribute name
Data type
ACTIVE APPARENT-SOURCE-EVENTS
Symbolic Symbolic
CREATED-BY
Symbolic
EVENTS-EXPLAINED
Symbolic
HYPOTHESIZED-ROOT-CAUSES
Symbolic
POSSIBLE-LATENT-SOURCES STATUS
Symbolic Symbolic
TESTS-APPLIED
Symbolic
TIME-OF-CREATION TIME-OF-REMOVAL
string strina
1387
object description DeW-ipti‘JIl
Set to YES whenever active in diagnostic inference IDS of associated RECORDED-EVENT instances classified as SOURCE
events
ID of the first RECORDED-EVENT instance associated with the INFERRED-MALFUNCTION IDS of all EVENT instances associated with the INFERRED-MALFUNCTION IDS of all HYPOTHESIZED-ROOT-CAUSE instances associated with the INFERRED-MALFUNCTION IDS of associated LATENT-EVENT instances Qualitative description of INFERRED-MALFUNCTION (PERSISTENT, OSCILLATORY, ONGOING-TRANSIENT, COMPLETED-TRANSIENT, SPURIOUS, CORRECTED) IDS of active TEST instances associated with the INFERRED-MALFUNCTION Time stamp at INFERRED-MALFUNCTION creation Time stamp at INFERRED-MALFUNCTION deactivation
Table 7 lists the attributes of the RECORDEDEVENT, EXPECTED-EVENT and LATENTEVENT object classes. Because these events represent actual observations, they contain slots such as TIME-OF-OCCURRENCE and PRO3ABILITYOF-ACCURATE-DETECTION. Other slots store information on related objects in the hypothesis model, Since the hypothesis model has no equivalent of a COMPILED-LINK, the associations between EXPECTED- EVENTS RECORDED -EVENTS, and LATENT-EVENTS are stored in the event objects in the CAUSE-OF and CAUSED-BY slots. HYPOTHESIZED-ROOT-CAUSE (HRC) objects, listed in Table 8, correspond to the POTENTIAL-ROOT-CAUSE objects of the process model. HYPOTHESIZED -ROOT-CAUSES are created during the diagnostics when a hypothesis is made by the event interpreter. The HYPOTHESIZEDROOT-CAUSE object contains information on the time and reason for the hypothesis, the events and tests that support or oppose the HYPOTHESIZEDROOT-CAUSE, and the calculated likelihood of the hypothesis. Another important object is the INFERREDMALFUNCTION (IM), shown in Table 9. This object class stores information pertaining to clusters of causally related events (RECORDED-, EXPECTEDand LATENT-EVENTS), HYPOTHESIZED-ROOT-CAUSES and TESTS. Because all the observed events may not be attributable to a single underlying cause, there may be multiple clusters of events, and therefore, multiple INFERREDMALFUNCTION objects. One malfunction is assumed to exist for each INFERRED-MALFUNCTION. Each INFERRED-MALFUNCTION object maintains a list of all events and root cause candidates associated with an event cluster, tests that have been performed, and the classification (STATUS) of the malfunction, such as persistent, ongoing-transient, completed-transient, spurious and corrected. Figure 5 illustrates how the structure of the hypothesis model “mirrors” that of the process model.
In this hypothetical example, there have been three recorded events. The recorded events RE- 1, RE-2 and RE-3 are realizations of potential events PE-1, PE-3 and PE-7, respectively. In addition, expected event EE-1 has been created on a forecast of a future event corresponding to PE-2. Causal relations indicate two disjoint event clusters, the first (IM-1) containing RE-I, EE-1 and RE-2, and the other (IM-2) consisting of only RE-3. IM-1 has the candidate malfunctions HRC-1 and HRC-2, corresponding to PRC-1 and PRC-2 in the process model, and IM-2 has the single malfunction candidate HRC-3, a realization of PRC-8. 3.5. User interface The user interface allows the user to view or modify information contained in the knowledge-base and control certain aspects of event interpretation. The interface is comprised of a set of windows and menus and is mouse-controlled. During on-line operation, a window showing current malfunction candidates in order of likelihood rank is shown. 4. THE
MIDAS
INFERENCE
CYCLE
The MIDAS inference cycle is performed whenever a new recorded event is detected. The goal of the inference cycle is to integrate the new event into the existing hypothesis model, if possible, or create new hypotheses to explain the event. The inference cycle consists of four major phases: 1. 2. 3. 4.
Event creation. Search and linkage. Source evaluation. Evidence evaluation.
4.1. Event creation Whenever a monitor detects a change of state, trend, or status, it checks the contents of the STATETREND-EVENTS and MONITOREVENTS, EVENTS slots for POTENTIAL-EVENT instances
F. E. FINCH et al.
1388
Hypothesis Model
0
Potential Root Cause Object
0 Fig.
Potential Event Object
5.
Recorlled and Expected Event Objects m
Hypothesized Root Cause Object
Correspondencebetweenobjects of pro&s and hypothesismodels.
that match the detected change. Inserting “yes” in the ACTIVE slot of the POTENTIAL-EVENT triggers a demon to begin the inference cycle, starting with the creation of a new RECORDED-EVENT instance and the updating of the associated MEASUREDOBJECT instance. At this stage, most of the slots of the new RECORDED-EVENT will be empty. Only basic information, copied from the POTENTIAL-EVENT, and detection information such as TIME-OF-DETECTION and PROBABILITY-OFACCURATE-DETECTION will be available. 4.2. Search and linkage The next stage of the inference cycle is to search the hypothesis model for existing event clusters (INFERRED-MALFUNCTIONS) to which the new RECORDED-EVENT can be linked. The process model is used to guide the search. Based on the results of the search, one of several diagnostic actions is performed. Diagnostic actions create a network of causal links between events in the hypothesis model, create necessary HYPOTHESIZED-ROOT-CAUSE instances and evaluate the evidence contained in the causal network.
Before the search begins, all previously postulated EXPECTED-EVENTS and LATENT-EVENTS are checked. It is possible that the new RECORDEDEVENT could be the realization of a previous predicted event. If so, the existing EXPECTED-EVENT or LATENT-EVENT is removed in favor of the new RECORDED-EVENT but otherwise the existing event network is undisturbed. If the new event is not the realization of a previously postulated EXPECTED-EVENT or LATENT-EVENT, a search for causal links to other events (LATENT-, EXPECTED- or RECORDED-EVENTS) begins. The search is a depth-first progression through the process model starting at the POTENTIAL-EVENT corresponding to the new RECORDED-EVENT. The search is conducted in bothforward and backward directions. While the backward search is usual, MIDAS includes the forward search to protect against errors when event detections occur in reverse causal order. A COMPILED-LINK that originates or terminates at the POTENTIAL-EVENT is selected, and the POTENTIAL-EVENT at the other end of the link is checked for corresponding RECORDED-EVENTS, EXPECTED-EVENTS or
Event-oriented diagnosis of dynamic process systems
Fig. 6b. Case 3 event.
Fig. 6a. Case 2 event.
Fig. 6c. Case 4 event (single cluster).
LATENT-EVENTS. If such an event exists, then a link between the new RECORDED-EVENT and the existing observed event is established. If the search was in the forward direction, the new RECORDEDEVENT can explain the existing event. If the search was backward, the new RECORDED-EVENT’ can be explained by the existing event. If the search fails, interrogation may be performed to try to link the new RECORDED-EVENT with events separated by one or more undetected events. If the monitor responsible for an intervening event can predict a future event on that variable, an EXPECTED-EVENT or LATENT-EVENT is created,? and the search continues until a previous event is found, or until an interrogation fails. The maximum number of intervening nodes that can be traversed in this manner is bounded by a global parameter. Thus, the event interpretercan perform correctly even when there are one or more missing events. When two or more clusters are being developed simultaneously, there is a remote possibility that an event can be explained by events in more than one cluster. If this occurs, the assumption of non-overlapping malfunctions is violated. In such a case, MIDAS adds the event arbitrarily to one of the clusters to which it can be linked. While a reasonable strategy, there is no guarantee of accurate diagnoses in this tIf the search is in the backward direction a LE is created; if the search is in the forward direction an EE is created. $A root cause is a local cause of an event if the event is a primary symptom of the root cause.
1389
Fig. 6d. Case 4 event (dual cluster).
circumstance, because of possible interference and cancellation of symptoms. However, failure to diagnose overlapping malfunctions should not be considered a flaw in MIDAS, since in most cases the probability of multiple, closely-related malfunctions is very low, and these situations are difficult to diagnose by any method. When malfunctions are non-overlapping, there are four possible outcomes of the search procedure: Case 1. The new RECORDED-EVENT
cannot
be
RECORDEDEVENT &comes the first event associated with a new INFERRED-MALFUNCTION object. The CREATED-BY and EVENTS-EXPLAINED slots of the INFERRED-MALFUNCTION are filled accordingly, and new HYPOTHESIZED-ROOT-CAUSE objects are created according to the LOCALCAUSES of the MEASURED-OBJECT corresponding to the event.3 Iinked
to
Case 2 explained
any
The by
exzkting
events.
The
new RECORDED-EVENT can be previously existing events. The
RECORDED-EVENT is the result of causal propagation of a disturbance, as illustrated in Fig. 6a. Causal propagation is expected and tends to confirm the existing diagnosis. The necessary links are created and the EVENTS-EXPLAINED slot is updated in the corresponding INFERREDMALFUNCTION. Case 3. The new RECORDED-EVENT can explain previously existing events. The RECORDED-EVENT is a previously undetected source event, as shown in
F. E.
1390
FINCH et al.
Fig. 6b. This case signals a major monitor error (a LATENT-EVENT missed by previous interrogation), and significant revisions to the existing diagnosis can be expected. In this case, the HYPOTHESIZED-ROOT-CAUSES related to the previous source event are deleted in favor of the local causes of the new RECORDED-EVENT. Related attributes of the INFERRED-MALFUNCTION are updated. If the event is the source of two or more separate event clusters, they are coalesced into a single cluster, and one of the INFERREDMALFUNCTIONS is deleted, reducing the number of hypothesized malfunctions by one. Again, fundamental revisions to existing hypotheses are required. The corrections are carried out by subprograms that can be considered the truth maintenance routines of MIDAS. Case 4. The new RECORDED-EVENT existing
events
and can be explained
can explain by others. This
situation arises when the new RECORDED-EVENT completes an alternate causal path within a single event cluster, illustrated in Fig. 6c, or if the new RECORDED-EVENT constitutes a bridge between two INFERRED-MALFUNCTIONS, as in Fig. 6d. These cases can occur if an EXPECTED-EVENT was missed due to a failed interrogation in a previous step. In the former case, the new RECORDEDEVENT is added to the existing INFERREDMALFUNCTION, and changes occur mainly in the EVENTS-EXPLAINED slots. In the latter case, the second INFERRED-MALFUNCTION is removed in favor of the first, and the remaining INFERREDMALFUNCTION acquires all the RECORDEDEVENTS. 4.3. Evaluating source events All events in a cluster are classified as Source events or consequence events. Within each INFERREDMALFUNCTION, there must be at least one source event. A source event is one that cannot be explained by other events in the cluster, or one that can explain all other events in the cluster. Source events are assumed to be primary symptoms of the root cause. Multiple source events arise only if a root cause has multiple primary symptoms, or if there are completed causal loops among the events. Even though causal loops could be broken using the time stamps on events (i.e. the first event could be considered the source event), this was not done because of the possibility of inaccuracy when events are detected out of order. This is a conservative design choice that favors accuracy at the expense of poorer resolution. Some of the subsequent case study results would display better resolution if cycles of events had been broken using information on time of occurrence. HYPOTHESIZED-ROOT-CAUSES are created for all local causes of the source events of the INFERRED-MALFUNCTION. Therefore, determining source events is a critical step in developing
the final diagnosis. If an event is classified as a source event and later reclassified as a consequence event, any HYPOTHESIZED-ROOT-CAUSES created as a result of the event being a source are removed. 4.4. Evaluating
root cause evidence
When a new RECORDED-EVENT has been linked to an INFERRED-MALFUNCTION, classified as source or consequence, and all HYPOTHESIZED-ROOT-CAUSES have been created, the causal network of the INFERRED-MALFUNCTION cluster is evaluated to determine which events support and oppose various HYPOTHESIZEDROOT-CAUSES. MIDAS does not delete HYPOTHESIZED-ROOT-CAUSES for which there is opposing evidence as was done in the DEA system (Kramer, 1988), because subsequent events may increase the ranking. However, HYPOTHESIZEDROOT-CAUSES with low rankings may be suppressed from the operator display. Each event in a cluster is a piece of evidence that can support a HYPOTHESIZED-ROOT-CAUSE (increase its relative likelihood) or oppose a HYPOTHESIZED-ROOT-CAUSE (decrease its relative likelihood). MIDAS sorts the set of root cause candidates associated with an INFERRED-MALFUNCTION in order of relative likelihood, with the highest ranked HYPOTHESIZED-ROOT-CAUSE representing the most likely candidate. During evidence evaluation, all INFERRED-MALFUNCTIONS are considered individually. An event supports a HYPOTHESIZED-ROOTCAUSE if: (1) the event is a source event that is a primary symptom of the HYPOTHESIZED-ROOTCAUSE, or (2) a causal pathway exists from a primary symptom of the HYPOTHESIZED-ROOT-CAUSE to the event, satisfying all the link conditions on the path. An event opposes CAUSE if:
a
HYPOTHESIZED-ROOT-
(1) the event is a source event that
is not a primary symptom of the HYPOTHESIZED-ROOTCAUSE, or (2) no causal pathway exists from any primary symptom of the HYPOTHESIZED-ROOTCAUSE to the event, or (3) all causal pathways from the primary symptoms of the HYPOTHESIZED-ROOTCAUSE to the event are blocked by link conditions.
Events can provide different levels of evidence. TESTS nominally provide the strongest evidence for or against a particular HYPOTHESIZEDROOT-CAUSE. RECORDED-EVENTS are strong evidence for or against a particular HYPOTHESIZED-ROOT-CAUSE because the statistical tests
Event-oriented
diagnosis
involved in detection of a RECORDED-EVENT carry a high level of significance. EXPECTEDEVENTS and LATENT-EVENTS are considered weak evidence for or against a HYPOTHESIZEDROOT-CAUSE. These events are created using tests of lower significance than RECORDED-EVENTS. Levels of evidence are specified in the slots PROBABILITY-OF-ACCURATE-DETECTION (POAD) and PROBABILITY-OF-INACCURATE-DETECTION (POID) of the various event and TEST objects. Default values for PROBABILITY-OF-ACCUPROBABILITY-OFRATE-DETECTION and INACCURATE-DETECTION, given in Table 7, are consistent with the default levels of significance of the monitor tests. MIDAS can calculate the likelihood of a HYPOTHESIZED-ROOT-CAUSE using several formulas. Generally, the specific formula will not strongly affect the order of the hypothesis ranking; the options are provided to allow for user preferences. The normalized conditional probabilistic likelihood (NCPL), below, is serviceable in most cases. Dempster-Shafer type formulae (Shafer, 1976) are provided if upper and lower bounds on the calculated likelihoods are desired. The NCPL is based on the product of the probabilities of supporting and opposing events and a prior probability. If PP(ni) represents the prior probability associated with Hi, then the conditional probabilistic likelihood (CPL) is: CPL(N,)
= PP(H.).~
PorD,.~~POAD, J
(2)
i
In this formula, OE represents the subset of cluster events opposing H, and SE represents the subset of cluster events supporting H,. Note that opposing evidence counts less against the hypothesis if the PROBABILITY-OF-INACCURATE-DETECTION
is high, and conversely, supporting evidence counts less for the hypothesis if the PROBABILITY-OFACCURATE-DETECTION is low. The normalized conditional probabilistic likelihood (NCPL) is then simply: NCPL(H[)
=
(3)
of dynamic process systems
5. CASE STUDY
A case study was performed to compile performance statistics on MIDAS and to verify the diagnostic algorithm. Full details of the case study are given in Finch (1989). 5.1. CSTR
process
The process used in the study is the jacketed CSTR process illustrated in Fig. 7. This process incorporates many features that make diagnosis challenging. The process contains multiple interacting feedback loops,
I
-
Fig. 7. Jacketed
CSTR
process.
nonlinear relations, compensating and inverse responses. In this process, two parallel first-order reactions take place in the CSTR. The primary reaction is exothermic, and the side reaction is endothermic. Both reactions are essentially irreversible and governed by Arrhenius temperature dependencies. Temperature control is accomplished by cascade control using measurements of reactor temperature and cooling water flowrate to adjust cooling water flow. Fluid level in the reactor is controlled by varying outlet flow. In addition to the indicated sensors, it is assumed that all controller output signals are available. The temperature controller output signal is the cooling water flow setpoint. Other controller setpoints are not measured directly. All sensors include noise, including sensors input to control loops. A program was written to simulate 109 different malfunction modes. The malfunctions include blockages, leaks, heat transfer faults, reaction faults, controller and valve malfunctions, variations at process boundaries and sensor failures. The failure rapidity and ultimate extent could be specified, in e&ct making the set of possible failure cases infinite. Seventy-six malfunctions were chosen randomly (with replacement) from the set of 109 POTENTIALROOT-CAUSES, and assigned random extents and rapidities. The prior probabilities of the malfunctions were assumed to be equal. Of the selected malfunctions, 34 were sensor failures, distributed as follows: l l
Lk
1391
l
10 in-range sensor failures 11 out-of-range sensor failures 13 sensor biases.
Fourteen malfunctions affected sensors that were input to controllers. All episodes lasted 90 simulated minutes with the malfunction introduced after 15 min of normal operation. 5.2. MIDAS
model
The causal portion of the event model was constructed using the MIDAS Model Builder and Model Translator programs. Four constraint equations were manually added to the model: an overall reactor material balance, a reactor chemical species balance, a cooling water pressure drop equation and a reactor
F. E.
1392
FINCH et al.
ante index Cpcapturing both the accuracy and resolution, is:
effluent pressure drop equation. The material balances were written as dynamic difference equations, and the pressure drop equations were algebraic. Information on the relative strength of effects was used in a small number of cases to resolve ambiguities resulting from parallel feedforward causal paths of opposite sign during the model building phase. About 7% of the measurements over the complete set of malfunctions were disambiguated in this fashion. The entire model building process was completed in less than 12 h. Details of the model construction are found in Oyeleye (1990). The final process model consisted of:
4 = accuracy x (No. malfunctions - resolution)/ (No. malfunctions -
144 POTENTIAL-EVENT instances 109 POTENTIAL-ROOT-CAUSE instances l 334 COMPILED-LINK instances l 18 MEASURED-OBJECT instances . 18 DATA-MONITOR instances. l l
Monitor thresholds were set using fault-free data generated by the simulation program using an automatic tuning utility provided in MIDAS. No additional tuning of the monitors was performed.
A total of 486 events were detected and interpreted. Of the 76 cases, 31 included compensatory response and 16 exhibited inverse response of at least one measured variable. There were 23 major monitor errors--false alarms or failure to detect latent or expected events. In addition, in almost every case, there was at least one minor monitor error--out-oforder events or erroneous latent or expected events. The following performance measures were used. The accuracy of a diagnosis was defined as 1 if the true malfunction was a member of the ranked malfunction candidate set, and 0 otherwise. Resolution was defined as the number of malfunction candidates (including the actual malfunction) with likelihood rank greater than or equal to the rank of the true malfunction. Resolution represents how far down the list of possible causes one must go to find the true malfunction. If the true root cause is the top-rated candidate, then the resolution equals one. A perform-
Some more detailed results are as follows. Table 10 represents the accuracy, tier of true malfunction and performance of MIDAS as a function of the number of events observed. Note that all 76 malfunction cases exhibited at least one event, 68 exhibited at least two events, etc. MIDAS consistently eliminated from the diagnosis over 90% of all possible root causes, and the true root cause ranked in the first or second tier in over 90% of cases with
Table 10. Overall diamoatic ocrfomance
1
2 3 4 S 6 7 8 9 10 11
No. of cases
Accuracy (% of cases)
16
93 100 100
68 59 54 49 38 ;z 18 12 10
100
100 100 97 100 100 100 100
‘Statistics based on a sample of 76 cases.
(4)
MIDAS produced an accurate initial diagnosis in 93% of all cases MIDAS produced an accurate final diagnosis in 100% of all cases After all events, the true root cause was ranked in the first tier in 82% of all cases After all events, the true root cause was ranked in the second tier in 8% of all cases The final diagnosis had a performance rating of @ = 0.97 The overall accuracy was 98.9%.
5.3. Case study results
Events observed
1).
If the diagnosis is inaccurate, Q, = 0. Otherwise, Cp represents the fraction of malfunctions eliminated from consideration or ranked below the true malfunction. If two malfunctions are hypothesized to exist in any case where only one root cause exists, then all candidates of the second malfunction were considered to be ranked higher than the true malfunction in the calculation of resolution and performance index. The NCPL ranking is based on the relative amount of evidence supporting and opposing a candidate. With equal prior probabilities, in INFERREDMALFUNCTION clusters with only a single event, all root cause candidates must have equal ranking. In INFERRED-MALFUNCTION clusters with more than one event, candidates will separate into several riers. Within a given tier, all candidates have equal ranking. The basic result of the study can be summarized as follows:
of MIDAS’
Ranked in top tier (% of cases)
Ranked in first or second tier (% of cases)
89 90 82 91
97
;“7 76 67 63
98 98 92 90 80 84 76 63
Average
performance (9) 0.88 0.94 0.95 0.95 0.96 0.96 0.95 0.94 0.93 0.92 0.92
Event-oriented diagnosis of dynamic process systems Table
Il. Diaanosis Detection
Event
lime
description
(min)
Inventory constraint high Level controller sianal hinh Product flowrate Ggh Reactor Cooling Cooling Product Coolina React&
level water high flowrate low’ waler flow controller concentration B low water s&point low’ level no&al
25.5 26.0 30.0 low2
32.0 31.5 32.0 33.0 35.0 52.5
of iacket leak to reactor Tier
certaintv of truefault
Candidates in top tier
I
0.143 0.442 0.473
7 2 i
1 1 I
0.333 0.473 0.333 0.333 0.333 0.487
0.94 0.99 0.99 0.94
: 3
0.93 0.94 0.94 0.98 0.99
of true fault 1 1
1 1
2
’ Creatm (incorrectly) second malfunction, IM-2. All candidates of IM-2 are considered than the true fault in calculating performance (Q). *Linked to IM-2. 3Event causes hypothesis of malfunction IM-2 to be rescinded.
seven or fewer events observed. As the number of events increases, the number of tiers grow, and this accounts for the large drop in the tier ranking with increased number of events. However, with more events, each tier has fewer members, so the performance does not experience the same decline. The best performance was produced by cases with 4-7 events. With fewer events, MIDAS had too little information to produce better resolution. The decline in performance with eight or more events is related to completion of a particular eight-event causal loop in the event model representing the temperature control system. When the disturbance origin was located within this loop and the control system underwent an oscillation, the causal loop was completed in the event graph. As mentioned previously, time stamps were not used to resolve the source events in completed loops. If this had been done, the performance would not have declined as the number of events grew, unless the accuracy were adversely affected by misidentification of the source event. Figure 8 shows the final resolution for all cases. The mean resolution is 3.9 malfunctions, and the median resolution is two malfunctions, meaning that in most cases, only one malfunction was ranked ahead of or at the same level as the actual malfunc-
20 P ;: ” z N
10
0
1
2
3
4
5
6
7
8
9 10 ll
12 13 14 15 16
Resolution
Fig. 8. Distribution of final resolution.
1393
to be
Q,
ranked higher
tion. The higher mean is the result of a small number of cases with poorer resolution. 5.4. Diagnosis examples In this section, we present a detailed examination of two diagnosis examples. These examples were not part of the random case study, but rather were chosen to demonstrate specific features of MIDAS. Example
I-Jacket
leak to the reactor
In this example, a leak develops that allows water to enter the reactor from the cooling jacket. At its ultimate extent, the leak represents approx. 10% of the reactor feed flowrate. The malfunction episode begins at 15 min and takes almost 5 min to reach its full extent. Table 11 shows the progression of the diagnosis. This example demonstrates the dynamic character of MIDAS, as well as robustness and recovery from errors. Event 1: Violation of the reactor inventory constraint suggests excess fluid entering the reactor and seemingly leads directly to diagnosis of a jacket leak into the reactor, but because the constraint is calculated using three separate sensor measurements, the event is also consistent with fixed failure or bias (offset or drift) in any of these sensors. Event 2: High level controller signal is consistent with a high bias of the reactor level sensor, so this candidate remains in the first tier with jacket leak. High fixed failure of the level sensor is not supported because the level itself is not recorded as high; on the other hand, level Sensor bias is consistent with the observed events because the controller can compensate for a level sensor bias, making the level appear normal. The other candidates are not supported by this event and drop to the second tier with low likelihood of 0.023. Event 3: High product Bowrate is a downstream consequence of Event 2. Jacket leak and
F. E. FINCH et al.
1394
Table 12. Diannosis of coolina water pressure surge, with induced sensor failure Event description Cooking water pressure high CW temperature sensor failed’ CW flow controller signal low CW pressure sensor failed CW pressure sensor OK cw pressure nomlal CW Row controller sinnal normal
Detection
Tier of true fault
Certainty of true fauk
16.5 17.0 20.0 22.5 40.5 46.5 53.0
I I
0.333 0.333 0.905 0.487 0.487 0.487 0.487
time (min)
1 1
I I
1
Candidates in top tier
81
3 3 I
0.98 0.98 1.00
: 2 2
0.99 0.99 0.99
’ Performance 8 for first malfunction (pressure surge). ‘Creates (correctly) second malfunction related to temperature sensor failure; resolution = 2, @ = 0.99.
Event 4:
Event 5:
Event 6:
Event 7:
level sensor bias remain in the first tier, but evidence provided by this event raises their likelihoods, while lowering the likelihood of other candidates. When a low cooling water flowrate is detected, MIDAS cannot find a consistent single hypothesis for all events, and formulates a two-malfunction explanation. The event is actually related to the action of the temperature control system to correct the extra cooling resulting from the leak, but it is out of order since the cooling water setpoint and cooling water flow controller signal events are expected first. MIDAS is unable to anticipate the correct event order because the system parameter Maximum-Precursor-Search-Depth was set to two for all case study runs. The closest abnormal event is three links away from Cooling Water Flowrate Low. Therefore, there is no interrogation of monitors to try to fill in the missing events. The fifth event is high reactor level. The diagnosis marginally degrades since this event correlates well with high fixed failure of the level sensor, which moves to the same tier as level sensor high bias and jacket leak. Low cooling water controller signal is linked to low cooling water flowrate, but cooling water setpoint fails the interrogation test (a monitor error), so a two-malfunction explanation remains. The detection of low product concentration is a downstream consequence of previous events and does not change the primary diagnosis. Although the diagnosis does not improve, MIDAS performs an alarm filtering function by relating this event to the previously postulated malfunction.
is a minor semantic problem in the process model. The malfunction~ooling water temperature sensor failedis represented by two separate malfunction modes representing each possible failure direction.
tThis
Event 8: With the detection of low cooling water setpoint, MIDAS is able to see that the cooling system response was related to the other events, and the second malfunction is eliminated, correcting the earlier error. The primary diagnosis is unchanged by this procedure. Despite the fact that MIDAS temporarily postulated two malfunctions, the true root cause was always in the first tier of the first malfunction candidate set. Event 9: When the reactor level returns to normal, fixed failure of the level sensor is removed from the first tier to produce a final diagnosis consisting of two top ranked malfunction candidates: jacket leak to reactor and level sensor high bias. Example 2-Pressure surge induced failure cooling water temperaiure sensor
of
the
In this example, a transient surge in cooling-water source pressure results in failure of the temperature probe in the jacket feed pipe. The progression of the diagnosis is presented in Table 12. This example demonstrates the handling of multiple malfunctions, specifically an induced sensor failure, and the diagnosis of transient disturbances. Event 1: The pressure surge is detected. The malfunction candidates are cooling water pressure sensor high bias, high failure and high cooling water source pressure. Event 2: The signal from the cooling water temperature sensor goes out-of-range and the sensor is Aagged as having failed. This event creates a second malfunction, correctly diagnosing the two separate malfunctions affecting the process. The two root cause candidates of the second cluster are cooling water temperature
sensor failed high and cool-
ing water temperature sensor failed low.? Event 3: Detection of a low cooling water controller signal is confirmation that a real disturbance is underway and not a dual sensor failure. High cooling water source pressure becomes the sole
Event-oriented diagnosis of dynamic process systems
Event 4:
Event 5: Event 6: Event 7:
candidate for the first malfunction with a NCPL of 0.905. The sudden pressure surge violates a range check and the monitor flags the Sensor as having failed-another example of monitor error. Pressure sensor failure is lifted to the first tier of INFERRED-MALFUNCTION cluster 1. The true malfunction is retained in the first tier. No events associated with the cooling water pressure sensor will be detected as long as the sensor is flagged as failed. With the pressure surge over, the monitor finds the pressure sensor OK. Cooling water pressure is verified as normal. The cooling water flow controller signal returns to normal. The first INFERRED-MALFUNCTION cluster is classified as a COMPLETED-TRANSIENT and archived. The second INFERRED-MALFUNCTION cluster is still a PERSISTENT failure and remains in the active memory awaiting correction.
6. DISCUSSION
1. Since event models can be complicated, it is generally advantageous to utilize the model preparation programs provided with MIDAS. However, at present these programs require an amount of computation that scales exponentially with the size of the strongly connected components in the signed directed graph. The computation may take a “long time” for large interconnected graphs.7 detection by process monitors prone. Many errors in the case study
2. Event
3.
4.
5.
New techniques in the areas of representation and reasoning have allowed MIDAS to achieve the’ following objectives: l
l
l
l
l
Utilization of information contained in the dynamic evolution of the disturbance Robustness to symptom variation evidenced as missing, false and out-or-order events Diagnosis of induced sensor failures and multiple non-overiapping failures Combined utilization of qualitative and quantitative process knowledge Automatic knowledge base preparation, minimizing the amount of engineering effort needed to diagnose new or modified processes.
Case study results show MIDAS to be accurate over 98% of the time, and able to place the actual malfunction in the first or second tier of identified causes in 90% of the cases simulated. A large percentage of these cases included complex dynamics, false, missing and out-of-order events, as would likely be encountered in practice. While sucessful within the scope of the examples tried, MIDAS naturally has certain limitations. Some of the issues remaining for further development are as follows: tRecent work by Rose (1990) appears to be successful in removing this limitation as it relates to the analysis of feedback effects.
1395
6.
7.
is error-
occurred because events were missed by the monitors, or detected seriously out of order. MIDAS dealt successfully with many of these errors, but the overall system reliability could be enhanced with more reliability and sensitivity at the level of event detection. Currently, the hypothesis model gives an incomplete explanation of how MIDAS reaches its conclusions. Conceptually, there is no reason why MIDAS could not develop more coherent explanations of its diagnoses, automatically. If the process does not have a nominal steady state, the causal diagnostics would require a dynamic process model running in parallel with the plant to distinguish abnormal transients from normal dynamic behavior. Since arcs of the event model represent “may cause” rather than “will cause” relations, MIDAS can explain events but it cannot predict them. Therefore, only events that occur can currently be factored into the diagnostic reasoning. On the other hand, the “may cause” interpretation enables MIDAS to handle varying rates of propagation of disturbances, and diagnose correctly even if there are long time delays between events. MIDAS assumes the process remains within a single qualitative regime, and malfunctions that add unmodeled new causalities may not be diagnosed correctly. Grantham and Ungar (1989) have recently made interesting comments on this problem. Use of information on process delays, gains and the like can only be used in the context of quantitative constraints within MIDAS. They cannot currently be specified as properties of causal relationships.
These limitations notwithstanding, we believe MIDAS in its current form presents a viable approach for on-line diagnosis of chemical and related process systems. Acknowledgemenrs-This research was funded by the National Science Foundation under Grants CTS8946888, CTS-8814226 and CBT-8605253. Dr Oyeleye was partially supported by a grant from the Nigerian government.
F. E. FINCH et al.
1396 NOMENCLATURE
EE = Expected-event (hypothesis model object) (hypothesis model HRC = Hypothesized-root-cause object) IM = Inferred-malfunction (hypothesis model object) LE = Latent-event (hypothesis model object) MIDAS = Model-integrated diagnostic analysis system NCPL = Normalized conditional probabilistic likelihood PE = Potential-event (process model object) POAD = Probability of accurate detection POID = Probability of inaccurate detection PRC = Potential-root-cause (process model object) RE = Recorded-event (hypothesis model object)
REFERENCES Allen D. J., Digraphs and fault trees. Ind. Engng Chem. Fundam. 23, 175-180 (1984). De Kleer J., An assumption-based TMS. Arrrficial Intell. 28, 127-162 (1986). Dhurjati P. S., D. E. Lamb and D. L. Chester, Experience in the development of an expert system for fault diagnosis in a commerical scale chemical process. In Foundations of Computer Aided Process Operarioras (G. V. Reklaitis and H. D. Spriggs, Eds). CACHE/Elsevier (1987). Finch F. E., Automated fault diagnosis of chemical process reasoning. Sc.D. Thesis, plants using model-based Massachusetts Institute of Technology (1989). Finch F. E. and M. A. Kramer, The handling of dynamics, multiple faults, and out-of-order alarms in the MIDAS diagnosis system. Paper 37e, AZChE Spring Meeting, Houston (1989). Grantham S. and L. H. Ungar, A first principles approach to automated troubleshooting of chemical plants. Paper 26f, AZChE Nazi Meeting, San Francisco (1989). Iri M., K. Aoki, E. O’Shima and H. Matsuyama, An algorithm for diagnosis of system failures in the chemical process. Computers C/rem. Engng 3, 489493 (1979). Kokawa M., S. Miyazaki, S. Shingai, Fault location using digraph and inverse direction search with application. Aufornafica
19, 729-735
(1983).
Kramer M. A., Malfunction diagnosis using quantitative models with non-Boolean reasoning in expert systems. AZChE JI 33, 130-140 (1987a). Kramer M. A., Expert systems for process fault diagnosis: a general framework. In Foundations of Cornpurer Aided Process Operations (G. V. Reklaitis and H. D. Spriggs, Eds). CACHE/Elsevier (1987b). Kramer M. A., Automated diagnosis of malfunctions based
on object-oriented programming. J. Loss Prev. Process. (1988). Kramer M. A. and F. E. Finch, Fault diagnosis of chemical processes. In Knowledge-3ased Systems Diagnosis, Supervision and Control (S. G. Tzafestas, Ed.). Plenum. New York (1989). Kramer M. A. and B. L. Palowitch Jr, Expert systems and knowledge-based approaches to process malfunction diagnosis. Paper 7Ob, AZChE Nat! Meeting, Chicago (1985). Kramer M. A. and B. L. Palowitch Jr, A rule-based approach to fault diagnosis using the signed directed graph. AZChE JI 33, 1067-1078 (1987). Moore R. L. and M. A. Kramer, Expert systems in on-line process control. In Chemical Process Control ZZZ (M. Morari and T. J. McAvoy, Eds). CACHE/Elsevier (1986). Nelson L. S., The Shewhart control chart-tests for special causes. J. Quality Technof. 16, 237-239 (1984). Oyeleye 0. O., Qualitative modeling of continuous chemical processes and applications to fault diagnosis. Sc.D. Thesis, Massachusetts Institute of Technology (1990). Oyeleye 0. 0. and M. A. Kramer, Qualitative simulation of chemical process systems: steady-state analysis. AZChE Jl Znd. 1, 226232
34,
144-1454
(1988).
Oyeleye 0. 0.. F. E. Finch and M. A. Kramer, Qualitative modeling and fault diagnosis of dynamic processes by MIDAS. Chem. Engng Commun. %, 205-228 (1990). Petti T. F., J. Klein and P. S. Dhurjati, Diagnostic model processor: using deep knowledge for process fault diagnosis. AZChE JI 36, 565-575 (1990). Rich S. H. and V. Venkatasubramanian, Model-based reasoning in diagnostic expert systems for chemical process plants. Computers them. Engng 11, 11 l-1 12 (1987). Rose P., A model-based system for fault diagnosis of chemical process plants. MS. Thesis, Massachusetts Institute of Technology (1990). Shafer G., A Mathemazicai Theory of Evidence. Princeton Univ. Press, Princeton (1976). Shewhart W. A., Economic Control of Quaky in Manufactured Product. Van Nostrand, New York (1931). Shiozaki J., H. Matsuyama, E. O’Shima and M. Iri, An improved algorithm for diagnosis of system failures in the chemical process. Computers them. Engng 9, 285-293 (1985). Stefik M. and D. G. Bobrow, Object-oriented programming: themes and variations. AZ Mag. 6, 4&62 (1986). Ulerich N. H. and G. J. Powers, On-line hazard aversion and fault diagnosis in chemical processes: digraph, fault tree method. IEEE Trans. ReZiab. 37, 171-177 (1988). Washio T., M. Kitamura and K. Sugiyama, Development of failure diagnosis method based on transient information of nuclear power plant. J. nucl. Sci. & Technof. 24, 30-39 (1987).