Expert Systems with Applications PERGAMON
Expert Systems With Applications 14 (1998) 371–383
Knowledge based process control supervision and diagnosis: the AEROLID approach C. Alonso Gonza´lez a,*, G. Acosta b, J. Mira c, C. de Prada d a
Department of Computer Science, Edificio de las Tecnologios de la Informacio´n de las Telecommunicaciones, Camino del Cementerio s/n, 47009 Valladolid, Spain Grupo ADQDAT—Department of Electromechanics, Faculty of Engineering, Buenos Aires Province Center National University, Av. Del Valle 5737, 7400 Olavarrı´a, Argentina c Department of Automatics and Computer Science, Spanish Open University (UNED), Madrid, Spain d Department of System Engineering and Automatics, Faculty of Sciences, University of Valladolid, Prado de la Magdalena s/n, 47011 Valladolid, Spain
b
Abstract The artificial intelligence incidence in process control, although an active area in the researchers community and even with some implementations at industrial environment, is not sufficiently evaluated in numerical terms for the long term. The present article shows such an evaluation of a knowledge based system, developing supervisory control tasks in the sugar production from sugar-beet, and paying particular attention to fault detection and diagnosis. A way of conceiving supervision for continuous processes is presented and supported with this industrial application. The expert system carrying out supervisory tasks operates in a VAX t workstation, directly over the distributed control system. The expert system development tool is G2 t which has real-time facilities. Although the core system was developed in G2, it also consists of some external modules because it combines both analytical and artificial intelligence problem resolution techniques. The global architecture, as well as the implementation details of the modules necessary for fault identification, are presented altogether with the experimental results obtained from the factory field. q 1998 Elsevier Science Ltd. All rights reserved.
1. Introduction Supervision is a fundamental aspect of every agent interacting with its medium in order to achieve an objective (whatever it may be). In human terms, even the most daily activities undertake some kind of feedback in order to assure a minimum degree of accomplishment. Nevertheless, it should also be quoted that the need for supervision arises when the global outcome of this interaction is under strong constraints. A very simple activity illustrate these points. Imagine you want to cook something for dinner, let us say a nice spaghetti dish. Obviously, some supervision is need to reach the simple objective of eating something acceptable for dinner: you have to add a minimum amount of water to a boiling pan, a certain amount of salt, spaghetti must boil for a reasonable period of time and so on. Now, imagine you are giving a welcome party to a friend, who very much likes spaghetti. Interaction with the environment is basically the same in both cases, but now you have an objective much
* Corresponding author. Tel: +34-983-423000 ext. 25602; fax: +34-983423671; e-mail:
[email protected]
0957-4174/98/$19.00 q 1998 Elsevier Science Ltd. All rights reserved. PII: S0 95 7 -4 1 74 ( 97 ) 00 0 90 - 0
more demanding: you want to offer him a really superb spaghetti dish. In this new situation you will probably execute the same basic steps, but it is possible to point out two important differences: every step will be carried out with exquisite care and—here comes supervision—you will try to assure that every step actually accomplishes its partial objective. However, before getting deeper into more detailed considerations, we need to make our language more precise. What do we understand by supervision? Since it is an umbrella term covering many ideas, closer to colloquial language than to technical or scientific jargon, a clarification about its meaning deserves some effort. We do not look for a broad and ultimate definition but, at least, we try a characterization that could be operative in dealing with continuous dynamic systems. We will define supervision as the activity that: (a) from some given production objectives is able to produce some desired trajectories for some prespecified variables of the system, considering it as the union of the operator and the physical system (control layer plus process); (b) checks that these pre-specified variable trajectories
372
C. Alonso Gonza´lez et al./Expert Systems With Applications 14 (1998) 371–383
are close enough (regarding some tolerance) to the known desired trajectories; (c) investigate further the reasons for discrepancies between desired and actual trajectories; (d) correct the discrepancies when they exist. As it may be seen, some kind of model will be needed (i.e. rules reflecting domain causal relationships), because in many supervision aspects, keeping track of the physical system may involve the comparison between operatorcontrol-process output and that model. From the previous definition, the necessary elements for a supervisory system (SS) to carry out such supervision may be set as follows: 1. first, someone (a human being, say the factory director) must set up the production objectives; 2. then, the SS needs to be supplied with a module to translate these objectives into the desired trajectory of some key variables; 3. afterwards, the knowledge based SS must be able to quantify these key variables; 4. then, the SS must posses the ‘intellectual’ capabilities to explain the differences between desired and real trajectories of these key variables; and 5. finally, the SS must propose solutions to these trouble causes in order to either re-establish desired trajectories or change the original objectives. Obviously, we need a further step to make this characterization operative in the process control domain. In other words, how can items (2), (4) and (5) be implemented with the present technology? There are basically two main approaches. The first and most usual in the recent past is the natural one: a team at the control room supervising the process evolution. Another emerging approach is the artificial one. That is, in order to achieve an automatic supervision, we must shift our attention to artificial devices. Among them, computer programs using artificial intelligence (AI) techniques seem to be the most promising candidates. Of course we are far away from considering a computer using AI software to have some kind of intellectual capability. Instead, with such techniques we are endowed with a programming methodology to simulate, in a very limited way, the most rudimentary human mental capabilities. This is due to: • • •
access to a great deal of data; a strong organization of these data; the possibility to generate new data from previous data.In AI terms, they offer knowledge representation and inference capabilities. Then, a SS resorting to computerized human expertise on managing the process under study will be referred to as a knowledge based supervisory system (KBSS).
2. Computer based supervision in process control When setting up the necessary elements for supervision, we identify the need to know the desired trajectory of some variables. This implicitly requires, at least at control domain, the possibility of governing those variables. In other words, we need (a necessary condition) a control system to solve the regulation issues in a reasonable degree. It is also necessary a data link among the controller and sensors from which we can access the process data. Fortunately, this status quo is not infrequent in most modern industrial factories of a medium to large size. In practice, we are going to face plants containing PLCs, PIDs, discrete regulators, etc., performing locally, a field bus, and a control room. Usually, in this control room there will be a distributed control system (DCS). Through this DCS, the operator keeps track of the process evolution, may modify to some extend the control scheme, and commands the plant changing set points. To supervise this configuration, a natural solution is to add a supervisory layer over the DCS–operator interaction. In Fig. 1(a), a standard automation scheme for a medium to large plant may be seen, and an ideal abstraction is presented in Fig. 1(b). There is clearly a function hierarchical organization. At higher layers, functions are more complex and rely on the underneath level capabilities. Besides, at lower layers, they are less complex but more critical for the process’ normal run. The resulting configuration is very reliable because any operation error at one level can be overcome by manually acting in the immediate lower layer, without leaving the process uncontrollable. Once the control layer performs properly, we are able to add supervisory functions to our system. As we proposed in our previous definition, several functions may be included under the term supervision. Hence, it may be useful to evolve it even further. Regarding the information source, we can distinguish local and global supervision. By local supervision we mean that the supervisory system only receives the same information as the control law. For example, for a classical PID control, this information will be the set point and the measure of the controlled variable (it may also involve the controller’s internal computation such as the error and the manipulated variable). Note that the controller may not be local in the sense of using information which is external to the control loop (for instance, our PID plus a feedforward). However, the supervision over this controller using its same variables is local. Thus, the locality of supervision is always referred to the corresponding controller. This local supervision is usually called supervisory control. Global supervision, instead, receives information from points which are beyond the control loop scope. In this way, the objective of global supervision is achieved by considering one or more controllers with a general process perspective. That is, with the knowledge of situations that may arise at different places of the plant, which are
C. Alonso Gonza´lez et al./Expert Systems With Applications 14 (1998) 371–383
373
Fig. 1. (a) Schematic view of a standard automated industrial process; (b) an ideal abstraction of (a).
obviously external to the controller environment. As an example, consider that the set point of a reaction tank may depend on the production level (through the resident time). Usually this set point is changed by the operator following a change in the process operating point. A global supervision system may check that this new set point is in a proper range, taking into account the current production level. A KBSS, as conceived in this article, is in charge of global supervision. We have found it useful to cluster global supervisory tasks in the following main sets: planning, monitoring, operation mode, fault diagnosis, prognosis, and man–machine interface (MMI) (Acosta and Alonso Gonza´lez, 1996a). Planning is the generation of desired trajectory for physical system variables from given production objectives. Monitoring is essentially the comparison of key variable values to the well known trajectory, derived from the precedent task and/or human empirical knowledge. The simplest monitoring case is just the comparison of a measured variable against a fixed threshold. More sophisticated monitoring may comprise (although not exhaustively) the contrast of an estimated parameter versus a trajectory, including additional knowledge, mass and/or energy balances and fault detection. Once something abnormal is detected, the system may be able to focus on the origin of such misbehavior. This hypothesis generation and discrimination is carried out by the diagnosis task, which may be further decomposed into the fault diagnosis sub-task (physical devices malfunction) and the operation mode sub-task (operating errors, generally due to operator’s involuntary mistakes). Prognosis is a diagnosis dual task. While the latter find the causes to explain observed symptoms, the former foresee the effects in the future of taking a certain action in the present (what-if task).
The man–machine interface is a filter deciding which information and at what time is really useful to the user, based on design criteria and user requirements. In a way, it is intended to avoid operator cognitive overflow. Note that with this kind of a rudimentary taxonomy, a system modular design can be faced, with all the corresponding development advantages: in principle, any change in one of these modules should not affect the remaining ones. Note also that except for monitoring and MMI, a process model must be elaborated. Even monitoring may consist of the comparison of some variables to a model of correct behavior resembling consistency based diagnosis (Reiter, 1987). To illustrate these ideas, consider the following example. Based on demand studies, the factory management team order a milling output of about 300 tons per hour. Usually this is achieved by three mills with maximum capabilities: MM1 ¼ 50 tons per hour, MM2 ¼ 100 tons per hour, and MM3 ¼ 200 tons per hour. The planning module outcome may be M1 ¼ 0.6 3 MM1, M2 ¼ 0.9 3 MM2, and M3 ¼ 0.9 3 MM3, because, say, the second and third mills are newer than the first one. The desired trajectories, in this case thresholds, may be DM1 ¼ (0.6 6 0.1) 3 MM1, DM2 ¼ (0.9 6 0.1) 3 MM2, and DM3 ¼ (0.9 6 0.1) 3 MM3, respectively. In a few seconds, the monitoring module detects something abnormal with the first mill output (M1 ¼ 10 tons per hour). If, for instance, the manual reference was set incorrectly (an operating error), the operation mode module gives a warning that will be intercepted by the man–machine interface module, which, according to some criteria, will advise whether or not the operator should restore the set point to the correct value. If the case is that M1 is broken, the fault diagnosis module will again generate a warning, managed by the man–machine interface. The KBSS correction may advise the operator to increase the set points of the remaining mills (or even increase them
374
C. Alonso Gonza´lez et al./Expert Systems With Applications 14 (1998) 371–383
automatically if the loop is closed), and generate a warning at the screen about the problem with M1 and how to repair it. Suppose that the original objective was 350 tons per hour instead of 300. The KBSS will give a warning, through the MMI module, to the management team in order to change temporarily the objective. If the factory director (or planing module) now wants to know if, with the present conditions, a milling of about 250 tons per hour is possible, he (or she or it!) will consult the prognosis module. The role of the last module may seem simple in this example, but in a real, critical situation it may help a lot.
3. Focusing on an industrial application In this article we present a KBSS, AEROLID (abductive expert reasoning in an on line industrial diagnoser), which has been implemented following the criteria set down in previous sections, and tested in an industrial continuous process plant. It is also a building block of a main system (Prada et al., 1994; Alonso Gonza´lez et al., 1994). The process considered here is a beet-sugar plant at Benavente, Spain, consisting of four main sections: diffusion, purification, evaporation and crystallization. Raw pieces of beet go into the diffusion section to obtain a juice rich in saccharose ( q 15%). This is achieved in the diffuser, a cylinder of about 25 m length by 8 m diameter, which is rotating about 28 r.p.m., receiving the beet at one extreme and water at about 758C. The temperature and the pH must be closely observed to avoid infections. This hot water bursts the beet cell, freeing both the saccharose and impurities (non-sugars). The impurities are eliminated in the purification, in order to obtain a juice of great purity ð q 95%Þ. The next stage, evaporation, increases the juice concentration, extracting water by boiling, and producing syrup ( q 65% saccharose). The evaporation section consists of nine evaporators in series. Finally, concentration is further increased in bath units until sugar crystals can grow up in an over-saturated environment. It is a typical process industry of an average dimension, highly automated via a DCS, in this case, TELEPERM t (Siemens t). The process is monitored and governed from the central control room, where operators supervise nearly four hundred control loops. The expert system (ES) carrying out supervisory tasks operates, in an VAX t workstation, over the DCS through a database, sharing process variables. The laboratory personnel may access this database to introduce or to read variable values, through a TCP/IP based network. The expert system development tool (ESDT) is G2 t, which has real-time facilities, frames, forward and backward chaining of rules, a graphic editor, and other interesting features to develop a SCADA system. Although this architecture was done in G2, the system also consists of some external modules because the main system combines both analytical and artificial intelligence problem resolution techniques. The operator, as seen in Fig. 2, may
Fig. 2. Hierarchical organization of the implemented KBSS.
view the process both by means of the DCS consoles or the mimics on the VAX screen. It is clear that the KBSS operates in a higher layer than the previous control scheme. As regards the run-time environment, shown in Fig. 3, we added the VAX which is the KBSS hardware support, communicating with the DCS through a Siemens proprietary bus. Also, it may be quoted that the data coming from the laboratory was not previously introduced to the DCS. Now, with the new architecture, this feedback is done directly to the KBSS.
4. Global architecture for diagnosis The design of an intelligent diagnostic system, useful for the control process domain is still an open problem. Though some of the principles underlying any diagnostic system have already emerged and there are important theoretical results regarding model based diagnosis (Reiter, 1987; de Kleer, 1989; Poole, 1992; de Kleer et al., 1992), process control has some characteristics that make the application of these ideas particularly difficult (Ng, 1991; Terpstra et al.,
Fig. 3. Run-time environment.
C. Alonso Gonza´lez et al./Expert Systems With Applications 14 (1998) 371–383
375
Fig. 4. Modular architecture of AEROLID.
1992). The main problems arise because: (a) it is a dynamic problem, (b) it is nearly impossible to simulate every possible behavior, (c) there is a high number of components, (d) there is a small amount of measures, and (e) real time requirements. Furthermore, the main objective of this system development is to be an every day tool to manage a real plant from the control room. When describing global supervision in a preceding paragraph, we stated several tasks. AEROLID, in order to carry out diagnosis, includes the following tasks and subtasks: monitoring, mode operation, fault diagnosis, and MMI. In the present application, they are the kernel of the diagnosis system. Hence, the architecture of the developed system supports them, resorting to the modules that can be seen in Fig. 4. These modules are cooperative agents sharing a common memory area, in a typical blackboard architecture. Thus, the monitoring module performs monitoring, the fault identification module performs operation mode and fault diagnosis, and the alarm management module performs MMI. These modules are related to the diagnosis. As regards the remaining modules that appear in Fig. 4, the user interface module also executes MMI, the process interface module carries out the data communication between the blackboard and the DCS, the data checking module applies a previous filter to raw data in order to reason with variable values which are in some pre-defined, habitual ranges. There are also a predictive control module (planning), a data validation module, a what-if module (prognosis), and others. From our experience, we found it very useful and operative to decompose the diagnosis process into the
mentioned tasks of monitoring, fault diagnosis, operation mode, and man–machine interface. As is evident, these tasks have an intrinsic time sequence, although the final temporal order will be given by the diagnosis strategy selected. Also note that different strategies operating over these tasks may produce very distinct system behavior. In the following sections, we will explain our method to implement them in order to face a real-time, on-line diagnosis.
5. Monitoring Fault detection or monitoring is critical for the behavior of the whole fault detection and diagnosis system (Reiter, 1987), so this point deserves some comments. Our approach to it is based on comparing the evolution of some particular symptoms. These are represented by the concept of a monitored variable (MV), and in the knowledge base by the corresponding frame. That is, we do not detect the presence of every possible symptom, but only the most representative ones, meaning those with more discrimination power. In a process like the one in the sugar factory, there are plenty of these variables. In every industrial application, the set of variable candidates to be monitored is potentially high, and only a subset of them must be selected. This is an important decision involving the classic trade-off between quick detection and time resources. If the system monitors many variables, it is potentially able to detect problems at a very early stage and before they propagate to other units, but as the total
376
C. Alonso Gonza´lez et al./Expert Systems With Applications 14 (1998) 371–383
Fig. 5. MV evolution and state changes.
time to check every variable increases with its total number, a compromise must be found. The selection of monitor variables has been made based on the plant personnel experience, taking into account the amount of information they convey (to monitor variables as little as possible) and the nature of the fault that can be detected (to quickly detect important faults). As a result, MVs are not only measured variables. Very often the value of a direct measure does not convey enough information to focus the diagnosis, but some of its properties (or characteristics) do: oscillation (amplitude, frequency), average, standard deviation, evolution, … Sometimes, even a rather sophisticated estimation derived from indicators is useful (e.g. error of a balance). The concept of MVs is useful in at least two ways: firstly, it allows a unified treatment, independent of the monitored data source; secondly, it allows one to distinguish close properties involving very different physical causes (e.g. a tendency to increase or decrease). AEROLID resorts to the comparison of these MVs to fixed thresholds, or constant trajectories, so the monitoring module can only operate in a stationary mode for a given production level range. In this situation the evolution of the main plant variables (flows, temperatures, concentrations
and levels of buffer and reaction tanks) are restricted to belong to fixed and known intervals. Any departure from this limit may be symptomatic of some kind of trouble. These desired trajectories were also given by plant indoor expert personnel, our observation of the process in regular operation, and plant simulations to obtain more proper values (Acebes et al., 1994). Even so, they required additional adjustment in the factory environment. As a concluding general remark, these desired trajectories must be known and specified beforehand An important problem to cope with when crisp thresholds are used to monitor, is the presence of unstable diagnosis due to small changes in the value of a variable around the thresholds. There are two basic methods to face these problems: one is to use some kind of approximate reasoning scheme (e.g. fuzzy logic); the other is to add more knowledge to the monitoring task. We have opted for the second alternative, defining three thresholds for variables, trigger, confirmation and recovery thresholds, which joined to temporal information govern the value of the attribute state of the MV (see Fig. 5, where the evolution of a MV estimated from factory data—Thermal Jump at Evaporator 4—with the corresponding changing states is
C. Alonso Gonza´lez et al./Expert Systems With Applications 14 (1998) 371–383
377
depending on the dynamics of the faults they are watching for. But while its state is OK they are only checked against the trigger threshold (normal monitoring). Only when the MV state is vigilance, the intensive monitoring sub-task starts. This last uses additional knowledge, as explained in the former paragraph, to decide the final variable state, i.e. critical or OK again. Intensive monitoring rules are invoked with a higher frequency and are effectively deactivated as soon as the state variable is OK. Thus, intensive monitoring can be considered as a simple focus of attention mechanism. In Fig. 6, a frame of this basic knowledge unit is shown. Its outstanding attributes are as follows.
Fig. 6. The frame of a MV.
presented). As soon as a variable crosses the trigger threshold, the state changes from OK to vigilance; if it remains over the confirmation threshold for a certain amount of time it becomes critical. When it descends under the confirmation threshold, it changes to vigilance, and when it stays for a given time under the recovery threshold, the state is OK again. The transitions towards vigilance are the only ones that rest on the threshold crossing alone. Transition from vigilance to critical always includes a persistence condition over a threshold different from the triggering one which is usually enough to avoid an unstable monitoring. Also, it may include additional knowledge that may help to discern whether the detection is justified. Obviously, this approach to monitoring is more time consuming than mere thresholds crossing and must be rationally planned. Every MV has its own checking period,
Formula: the equation from which the variables gets its current value. Notice that, as this value is computed from data which may be consistent or not, in the last case, the MV simply will not have a current value. A rather elaborate consistency checking for this variable is under development. Estado (state): a linguistic label indicating whether the variable is far or not from its known ranges. The state can be OK, vigilance and critical. Umbral disparo, umbral confirmacio´n and umbral recuperacio´n (trigger, confirmation and recovery thresholds, respectively): they are used by the monitoring module to change the state of the MV in the way explained in a following section; Validity interval: is given by the validity intervals of the constituents of the formula; Default update interval: MVs are refreshed on requirement of the monitoring module, so none is the default for this attribute; Tiempo crı´tico (critical time): the current real time when the variable state is turned to critical; Texto (text): this text will appear, if pertinent, in the on-screen alarm advising the irregular situation.As soon as monitoring states a problem with a MV, a fault identification module must start its activity.
6. Obtaining a diagnostic Obtaining a diagnostic may be seen as a generation and discrimination of some hypothesis, carrying out the tasks of fault diagnosis and operation mode. AEROLID implements both of them in an equal way, and resorts to the same technique. This is because fault diagnosis as well as operation mode help to find the reason for departure from nominal behavior. Then, symptoms are shared which do not mind the actual cause of the problem. A tank level (symptom) may be so low because of an erroneous set point (cause one) or because of a broken pump (cause two). The diagnosis task resorts to the expert system methodology, where man’s experience is extracted and codified in an appropriated knowledge representation language. Our decision is founded on the plenty of causal relationships
378
C. Alonso Gonza´lez et al./Expert Systems With Applications 14 (1998) 371–383
Fig. 7. Hierarchical cause–effect graph.
existing in the domain to model a faulty situation and the natural way of translate them into an expert system when human experts are available (see complex problems in Davis (1982)). Besides, everybody agrees on the problems of the expert system approach: time-consuming development phase, total device dependence, incompleteness (list of pre-enumerated faults) and potential inconsistencies. However, these caveats can be alleviated by working in at least two directions. Firstly, we found it very profitable to talk in terms of the tasks the system must undertake, as presented in previous paragraphs. By so doing, a rear problem structure analysis can be done, which is then mirrored in the modular design. Note that this approach is opposite to the gradually increasing prototype. Secondly, during the analysis phase, objectives were fixed in a realistic way, to really tackle down the real world problems. Hence, the diagnosis system complexity is reduced considering single fault hypothesis per MV, and grouping different faulty situations around different MV, and also clustering MVs by factory sections. This allows a hierarchical organization like the one shown in Fig. 7, which improves system development and maintenance, as well as its on-line efficiency. With respect to diagnosis, for every MV an acyclic cause–effect sub-graph was developed, trying to capture the causal relationships among symptoms (effects) and causes (diagnostics). This is also represented in Fig. 7. Thus, the root node represents the whole plant, the first depth level nodes, represents factory sections, and the second level nodes, MVs. Under every MV there is the acyclic cause–effect sub-graph. The leafs of this sub-graph
are the primary causes responsible for the potential problems and are underlined in Fig. 7. These sub-graphs are implicitly described by a set of rules with one of the following generic forms: If {symptom and} þ symptom then cause 1 If cause 1 and {symptom and} þ symptom then cause 2 So, the symptoms lead to the solution, in a typical abductive reasoning. The diagnosis system looks for a primary cause that can explain the symptom manifested through the MV. It is remarkable that, even in the more complicated settings we have considered, it was enough to chain two rules: a cause responsible for the troubles detected around the monitor variable (e.g. cause-1: obstruction@diffuser in Fig. 7), and sometimes a secondary one (e.g. cause-2: diffuser’s-low-speed). This last is considered as the original problem cause. Then, although at a first glance one may think of empirical associations in a domain, the diagnosis is based on actual causal models of the dynamic process failures, put down using rules. It must also be said that cause–effect relationships were allowed among sub-graphs under distinct MVs, in order to let a single cause be the explanation of more than one MV departure from a nominal value. Finally for this section, a diagnostic sketch can be summarized as follows: when the MV reaches critical state, only the rules defining its associated sub-graph of possible causes are backward chained. As soon as one matches every symptom (rules antecedents), the system considers that a cause was found, and the final diagnostic remains associated to that cause.
C. Alonso Gonza´lez et al./Expert Systems With Applications 14 (1998) 371–383
379
effectively be created and deleted. Also, we resorted to the following features. • •
Fig. 8. The frame of a diagnosis object.
6.1. Diagnostic dynamics As soon as the monitoring task detects a problem and the state of any MV becomes critical, a transitory object called diagnosis (Di) is created. This is another basic knowledge unit in AEROLID. Initially, this object is related to the MV being critical and only contains time stamp information (refer to Fig. 8). Then, a set of rules related to the MV is used in backward chaining mode in order to identify the possible cause of the problem. The diagnosis task finishes as soon as a terminal rule fires (single fault hypothesis per MV) or when every related rule has already been tried with no success. Hence, the output of the diagnosis may be the cause of the detected anomaly or an unknown cause (if no rule succeed). In this way, the system is not able to wait for something to come (an effect to appear in the future). It works with past data. However, it must cope with fault intermittence or repairing policy. Also, due to the cause–effect chains that can propagate through the plant units, it is necessary to take into account the set of fault causes that has been detected in the more recent past. In a way, the diagnostic is an entity with a temporal life: it must be created when a problem is detected, then it must grow up when a candidate cause is found, it must be present while the problem has not been recovered, and it must be remembered for a certain amount of time and then be effectively forgotten, because computer memory is a finite recourse and the system is expected to operate uninterruptedly for months. To obtain this behavior the diagnosis strategy uses the transitory objects, named ‘diagnosis’ (Di), that can
A phase attribute of the diagnosis object, to control the dynamics of the diagnosis task itself. A state attribute of the diagnosis object, to control the dynamic behavior of the MMI, through an object-named alarm.
Fig. 9 shows a simplified graph with the main transitions during the Di life. The Di is created with an initial value previous in the phase attribute, and this value is kept until a cause can be assigned to explain the real origin for the problem. In this opportunity, the value of this attribute changes to final. The diagnosis strategy never starts a second diagnosis, related to the same MV, while there is a current diagnosis on the previous phase or active state. This is coherent with the single fault per MV hypothesis, and it is useful to prevent errors due to very sensitive monitorization. Also to prevent this latter kind of error, the diagnosis is deleted if, when the phase becomes final, the related MV is already OK. This allows us to cope with false positives during the monitoring phase. Continuing with the dynamics of diagnosis description, when the Di phase is final, a candidate cause has already been obtained. Then, if the MV state is critical, the Di attribute state is changed from inactive to active, and only changes to recent when the associated MV becomes OK again. It must be quoted that when the state is recent, a new diagnosis for the same MV may be started, with a possible different outcome. Then, regarding the possibility of multiple faults, the system is able to diagnose different troubles simultaneously for different MVs. That is, single fault hypothesis per MV has its single consequence for the set of faults associated with that MV, but the system can cope with a sequence of different faults related to the same variable by keeping track of the faults series. From the recent state there are two possible transitions: active again, if a new diagnosis concludes the same cause (think about an intermittent fault, for instance); or forgetting, after a certain amount of time. Forgetting is a non-recoverable state, useful to remember the presence of the diagnostic. Finally, Di is deleted from the system. These last temporal transitions are governed by temporal parameters, which are local to every terminal cause (see recent and forgetfulness period in Fig. 9).
7. Alarm management The last task in the diagnosis is the alarm management, which may consist of a hard error message or a simple warning, or even the automatic trouble restoration. In the present approach, whenever a problem is detected and a cause is identified, a message appears in the system screen, warning the user about the trouble, about the possible causes and suggesting a recovery policy. Then, an alarm is directly
C. Alonso Gonza´lez et al./Expert Systems With Applications 14 (1998) 371–383
380
Fig. 9. Dynamic hypothesis generation and discrimination task.
coupled to a diagnostic, and by so doing, in AEROLID they are almost synonymous. Also, as it became evident, we considered every alarm with the same importance (nothing similar to criticism distinction). This is perhaps the most simple way to implement the MMI, but the drawback is that it may cause a cognitive overflow in critical situations. For some cases in which the KBSS is in a closed loop with the DCS, also with the message, the troubleshooting is done automatically prior to an operator’s acknowledgment. In terms of the knowledge units presented earlier, the messages as well as the automatic troubleshooting are done whenever the Di state is active, which happens when the causes of Di receives a value (may also be the unknown value, for completeness). This is associated to the alarm object in the knowledge base. Its attributes are, among others, the advice, a text with the legend of the alarm, the recent period, or the time in which Di must continue in latency to become active again if MV is critical again, and the forgetfulness period. The sum of both of these periods is the time in which the transitory object Di must be ‘alive’ after the MV which causes it reaches the OK state.
8. AEROLID in action This example, a habitual situation at the factory, is taken from the evolution of a MV, the oscillation of the diffuser vase level, on 15 February 1994. The formula for this MV is: the maximum value of the vase_level of the_diffuser
during the last 70 s ¹ the minimum value of the vase_level of the_diffuser during the last 70 s where vase_level is a reference to the level indicator of the object the_diffuser. The MV evolution may be seen in Fig. 10, for 355 min as the normal monitoring task updates and tracks it. It is also possible to see the trigger threshold, the confirmation threshold and the recovery threshold, 10, 8, and 7, respectively, for this particular variable. As soon as the MV crosses the trigger threshold, intensive monitoring takes on the tracking activity at a higher sample rate. Intervals Dt 1 and Dt 2 vary from one MV to other. In the present case, Dt 1 ¼ 2 min and Dt 2 ¼ 2 min (not given to scale in the figure for better comprehension). The former is defined as the time interval for which, having crossed the trigger threshold, the MV must be over the confirmation threshold to be considered as in critical state. The latter is the time interval in which the MV must be under the recovery threshold to be declared in normal (OK) state again. See Table 1 for the records of the trace, from birth to death, of the corresponding diagnostic. The MV crosses the trigger threshold at 11:11:07 a.m., changing the state to vigilance. As in the previous 2 min it has been over the confirmation threshold, it is declared in critical state. The Di is created and at 11:11:17 a.m. its cause slots are filled, concluding that the diagnosis is the lack of anti-foam.
C. Alonso Gonza´lez et al./Expert Systems With Applications 14 (1998) 371–383
381
9. Experimental results In a first approach (campaign 1993–1994), to evaluate the diagnosis performance, we recorded every diagnosis process outcome in a MV attribute called ‘estadı´sticas’ (statistics), shown in Fig. 6. This record (the total number of outcomes) was independent regarding the operator’s acknowledgment. The statistics slot is a frame, storing the following items: (a) the total number of diagnoses; (b) the number of diagnoses with unknown cause; (c) the number of acknowledged (by the operator) alarms; (d) the number of correct diagnoses (from the set of (c)); and (e) the number of incorrect diagnoses (i.e. the system has made a mistake in the diagnosis task).
Fig. 10. Example of a diagnosis.
This evaluation was made for the diffusion section, obtained from five shifts, distributed in five different days of normal operation. During this interval, the following 149 diagnoses were recorded, which are shown in Table 2. Third, fourth, fifth and sixth columns (correct and incorrect diagnosis, respectively) are subsets of the acknowledged alarms. Fourth and sixth columns reflect the fact that sometimes the detection was appropriate (MV in critical state has real sense), but the diagnosis, i.e. the actual explanation of the problem, was not correct. In some sense, this may be seen as a separate analysis for, on one hand, the detection, and on the other hand, the identification of the real cause of the problem. From a first view of Table 2, the results of Table 3 can be inferred, but on inspecting Table 2 more closely there were two MVs that were clearly tuned in a wrong manner as regards the thresholds, ‘Variacio´n del pH de Agua Frı´a’ and ‘Caudal a TEJC inferior a consigna’, with a high number of total diagnoses, and the impossibility of determining the cause. Without them, the new (and more significant) statistics yield the data in Table 4, from a total of 97 diagnoses. 79.6% of correctness over the acknowledged alarms was considered a very promising result. However, to properly understand this result two facts should be
Table 1 A diagnosis trace taken from the factory at Benavente Variable-monitorizada: OSCILACION-NIVEL-VASO-DIFUSOR Etapa: PREVIA Estado: INACTIVO Diagno´stico: NO-CALCULADO Creacio´n diagnosis: 15 Feb 94 11:11:07 a.m. Creacio´n traza: 15 Feb 94 11:11:09 a.m. Diagno´stico: NO-ANTIESPUMANTE. Obtencio´n del diagno´stico: 15 Feb 94 11:11:17 a.m. Etapa: FINAL. Fecha y hora: 15 Feb 94 11:11:18 a.m. Estado: ACTIVO. Fecha y hora: 15 Feb 94 11:11:24 a.m. Estado de OSCILACION-NIVEL-VASO-DIFUSOR: VIGILANCIA. Fecha y hora: 15 Feb 94 11:17:00 a.m. Estado de OSCILACION-NIVEL-VASO-DIFUSOR: OK. Fecha y hora: 15 Feb 94 11:18:46 a.m. Estado: RECIENTE. Fecha y hora: 15 Feb 94 11:18:47 a.m.
C. Alonso Gonza´lez et al./Expert Systems With Applications 14 (1998) 371–383
382
Table 2 Activity record at campaign 1993–1994 Monitored variable
a
Oscilacio´n Nivel Vaso del Difusor Tendencia a Bajar Nivel Vaso Difusor Tendencia a Subir Nivel Vaso Difusor Alto Nivel de Jugo Frı´o Apertura Va´lvula Aporte a D-AP-2 Alto Nivel Agua Prensas sin Tamizar Va´lvulas Vasos TEJC Cerradas Bajo Nivel Agua Frı´a Variacio´n pH Agua Frı´a Caudal a TEJC Inferior a Consigna Disminucio´n de Molienda vs. Consigna Alto Nivel Agua Prensas Tamizada Tendencia Subir Nivel Prensas Tamizada Bajo Nivel Agua Prensas Tamizada Bajo Caudal Jugo Frı´o Bajo Nivel Tolva Remolacha Tendencia Bajar Nivel Tolva Remolacha
1
2
3
4
5
6
7
8
10 7 3 5 1 12 0 1 28 24 2 11 7 1 3 20 14
0 0 0 3 0 6 0 0 20 22 0 0 1 1 2 0 5
5 2 2 0 1 3 0 0 2 0 1 4 1 0 0 13 3
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0
0 0 0 0 0 2 0 0 4 6 0 0 0 0 2 0 2
0 0 0 1 0 0 0 0 4 4 0 0 0 0 0 0 0
0.0 0.0 0.0 60.0 0.0 50.0 0.0 0.0 71.0 92.0 0.0 0.0 14.3 100 66.7 0.0 35.7
50.0 28.6 66.7 20.0 100.0 41.7 0.0 0.0 35.7 41.7 100.0 36.4 14.3 0.0 66.7 70.0 35.7
a 1, Total diagnosis; 2, unknown diagnosis; 3, correct diagnosis; 4, incorrect diagnosis; 5, unknown correct; 6, unknown incorrect; 7, % unknown alarms; 8, % acknowledged alarms.
considered: 18.5% of unknown diagnoses and only 45.4% of alarm acknowledgment. To improve this feature, in the next analysis period, i.e. campaign 1995–1996, this acknowledgment was done mandatorily, for the alarm persists on the screen until it is acknowledged. These results were obtained at the diffusion section again, over five days of normal running, and are shown in Table 5. Note that the results are not given in percentages, because only ten diagnoses were obtained similar to the previous campaign period. The reasons for this situation may be found in: • •
MV thresholds were tuned in a more refined and realistic way; process operators take close care of the problems reported by the system because they dislike being pointed out by an ‘artificial tutor’.
However, we consider this last performance as a more adequate one for normal running of the supervisory system. 10. Conclusions As the previous section shows, AEROLID has fulfilled its design objectives reasonably, becoming an everyday tool in the control room, able to give warning in more than 80
Table 4 Campaign results 1993–1994 (revised)
Table 3 Campaign results 1993–1994 Completeness (over a total of 149 diagnoses) Consistency (over a total of 42.9% acknowledged alarms)
anomalous situations with a high reliability degree at factory level (up today with more than 600 h running). Consequently, we consider that this work is in the direction of confirming that the current AI technology is mature enough to face real process control problems. It is also true that if more ambitious objectives are pursued (multiple faults per MV, fault modes with more sophisticated dynamics, …), the weak abductive inference must be strengthened. Furthermore, this approach might no longer be valid, having to resort to methodologies that are still under research, (Acosta et al., 1994; Acosta and Alonso Gonza´lez, 1996b). But with the task taxonomy developed, this labor is easier to accomplish. In the same direction, from this experience we confirmed the idea presented in Struss (1991) about the necessity of a dynamic diagnosis strategy that can evolve with the process dynamics. Truly, it is not only necessary to count with a model of the physical process or its sections, but an appropriate strategy must guide efficiently to the final diagnosis. This induced us to think about a diagnosis protocol. In effect, in AEROLID, fault diagnosis and operation mode are treated jointly, in a flat way. This may be done more efficiently if this diagnosis protocol looks for operator’s mistakes before looking for faulty components. For example, when examining a
known cause unknown cause correct incorrect unknown
59.7% 40.3% 57.8% 3.1% 39.1%
Completeness (over a total of 97 diagnoses) Consistency (over a total of 45.4% acknowledged alarms)
known cause unknown cause correct incorrect unknown
81.5% 18.5% 79.6% 4.5% 15.9%
C. Alonso Gonza´lez et al./Expert Systems With Applications 14 (1998) 371–383 Table 5 Campaign results 1995–1996 Correct Incorrect Unknown
7 1 2
problem in the level of a buffer tank, it may be faster and more direct to firstly consider a problem in the control loop. If the controller works properly and the set-point is consistent with the factory section operation point, then the strategy may go on with more infrequent cases, say, a tank hole. In other words, there are some kinds of contour conditions (necessary conditions) that must be present before a faulty device is looked for. As also comes from the experimental results, AEROLID performs better as a trouble detector (95.5% in the first test and 90% in the second), than as a diagnostician. Of course, this last task is more difficult than the former. We consider it very hard to implement a diagnoser with an efficiency greater than 80% resorting to this abductive reasoning approach in real time and in an industrial environment. However, it must also be said that people at control room level feel satisfied with the results and consider the assistance given by the system to be more than sufficient. With respect to the utility of a KBSS in the control room, the learned experience with this work is that as the process operation is more homogeneous, the plant enlarges it useful life and, perhaps more importantly, the operator is induced to drive the process taking care of some pre-defined aspects (in order to avoid the uncomfortable alarm appearance at the screen), though warranting an ‘operation style’, stated by the consensus of the control team involved in the knowledge base development. We hope with this work to be contributing to some degree to showing some quantitative results about the evaluation of a knowledge based system, for we consider there is not yet enough precise information about large-scale and factorytested systems like AEROLID.
Acknowledgements This work has been possible thanks to the financial support of Sociedad General Azucarera Espan˜ola S.A. (SGAE) and is the result of the collaboration of numerous people. It is not possible to mention all of them, but we wish to especially acknowledge R. Vargas, G. D’Aubare`de, M. Bollain, J. Barahona, L. Lorente and A. Bastos from SGAE. We also wish to acknowledge the work of J. Achirica and L. F. Simulator Acebes, from the University of Valladolid. This article is written also under the support of ‘Proyecto Precompetitivo CYTED VII.5: Te´cnicas de
383
Inteligencia Artificial en Diagnosis, Supervisio´n, y Control de Procesos Continuos’. GA wishes to acknowledge the National Scientific and Technical Research Council (CONICET), Argentina, for its support through a fellowship during his stay at Spain.
References Acebes, L.F., Alvarez, M a. T., Achirica, J., Alonso, C., Acosta, G.G., & De Prada, C. (1994). A simulator to validate fault detection in an industrial process with expert system. In Proceedings of the International Conference on Simulation of Continuous Processes, Barcelona, Spain, June 1–3, 1994, pp. 709–713. Acosta, G.G., & Alonso Gonza´lez, C. (1996a). Towards a task taxonomy in knowledge based systems for the process control supervision. In Proceedings of II Congreso Internacional de Informa´tica y Telecomunicaciones, IV Simposio de Inteligencia Artificial (INFOCOM ’96), Bs. As., Argentina, June 10–14, 1996, pp. 316–325. Acosta, G.G., & Alonso Gonza´lez, C. (1996b). Knowledge based diagnosis for continuous processes using causal fault modes: a tested proposal. In Proceedings of 3er Congreso Interamericano de Computacio´n Aplicada a la Industria de Procesos (CAIP ’96), Villa Marı´a, Argentina, 12–15 Noviembre 1996, pp. 451–457. Acosta, G.G., Acebes, L.F., Sa´nchez, A., & De Prada, C. (1994). Knowledge based diagnosis: dealing with fault modes and temporal constraints. In Proceedings of IEEE XXth International Conf. on Industrial Electronics (IECON ’94), Bologna, Italy, September 5–9, 1994, pp. 1419–1424. Alonso Gonza´lez, C., Acosta, G.G., de Prada, C., & Mira Mira, J. (1994). A knowledge based approach to fault detection and diagnosis in industrial processes: a case study. In Proceedings of the IEEE International Symposium on Industrial Electronics (ISIE ’94), Santiago, Chile, May 25–27 1994, pp. 397–402,. Davis R. (1982). Expert systems: where are we? And where do we go from here?. The A.I. Magazine, Spring, 3–22. de Kleer, J. (1989). Diagnosis with behavioral modes. In Proceedings of the International. Joint Conf. on Artificial Intelligence, IJCAI-89, pp. 104–109. de Kleer, J., Mackorth, A.K., & Reiter, R. (1992). Characterizing Diagnosis and Systems. In Readings in Model Based Diagnosis (pp. 54–65). Morgan-Kaufmann (originally appeared in Artificial Intelligence, 56). Ng, H.T. (1991). Model-based, multiple-fault diagnosis of dynamic, continuous physical devices. IEEE Expert, December, 38–43. Poole, D. (1992). Normality and faults in logic-based diagnosis. In Readings in Model Based Diagnosis (pp. 71–77). Morgan-Kaufmann. Prada, C., Alonso, C., Morilla, F., & Bollain, M. (1994). Hierarchical supervision and control in a sugar factory. In Invited Session of the IFAC Symposium on Artificial Intelligence in Real Time (AIRTC ’94), Vale`ncia, Spain, October 1994. Reiter, R. (1987). A theory of diagnosis from first principles. Artificial Intelligence, 32, 57–95. Struss, P. (1991). What’s’ in SD? Towards a theory of modeling for diagnosis. In Working Notes of the 2nd Int. Workshop on Principles of Diagnosis, 1991, pp. 41–51. Terpstra, V.J., Verbruggen, H.B., Hoogland, M.W., & Ficke, R.A. (1992). A real-time, fuzzy, deep-knowledge based fault-diagnosis system for a CSTR. In Proceedings of IFAC Symposium on On-line detection and Supervision in the Chemical Process Industries, Newark, DE, USA, pp. 26–31.