Copvright IFAC Safccomp 'S3 Cambridge. UK 1983
THE MAN-MACHINE INTERFACE FOR A FAULT TOLERANT CONTROL SYSTEM J. H. Wensley A ugust Systems, In c., 18277 S. W. Boones Ferry Road, Tigard, Oregon 97223, USA
Abstract. Special problems arise in the design of the man - machine interface to a control computer that is fault tolerant. The interface to such a fault tolerant system presents challenges in design, operation and continuing diagnosis and maintenance. Design of a computer control system involves issues such as dealing with interrupts, satisfying the real time requirements, and scheduling of tasks that are incorporated with a real time operating system. A user friendly design interface is described, in which the system designer is freed from detailed considerations concerned with fault tolerance. During operation of the control system, the interface is designed not only for normal operator use, but also to deal with abnormal conditions such as system alarms. Normal operation also implies modification and enhancement of the control process. With a fault tolerant control system, it is additionally necessary to deal with the problem of detecting latent faults within the system. Diagnosis and repair combines hardware and software issues such as how to diagnose control problems, how to carry out hot repair of a system that is continuing to function , and the warm start problem of how to bring a newly repaired or r eplaced unit back into cooperative service . The paper examines the above issues and the particular problems and solutions to them that are appropriate when the control system is fault tolerant, i . e ., is designed to mask the effect of faults. Fault-Tolerance; Control Computer; Error Detection; Error Keywords . Reporting; Diagnosis.
upgrade the control system. Their interactions with the system present significant problems in all but the simplest of cases. We hope to show that the appropriate design of interface will make their operations both easier and more reliable.
INTRODUCTION The traditional view of the man machine in terface in a control system is that of an operator station with its facilities and its procedures . This view includes such factors as ease of use, clarity of display , operator fatigue, and comprehension. This viewpoint is appropriate where primarily routine operator interaction with a machine occurs. This may be justified at a supermarket check out counter, or an airline reservation station. Where dangerous processes are being controlled , the safety of the process is of foremost importance. It is thus necessary that the interface between humans and the control system concentrate upon potentially critical situations that can arise . This implies that the man - machine interface should be designed according to criteria that apply under abnormal conditions. These are usually represented by failure of either components of the process equipment that is being controlled , or of the components of the control system itself.
More importantly, we consider the total life cycle of a system. This includes such phases as design, integration, test, commissioning , operation, repair, upgrading and modification . Figure 1 . illustrates the different phases and represents the total life cycle man - machine interface. A central theme of this paper is to s~udy the impact of a fault tolerant control system on the above man - machine aspects. It is assumed that fault tolerance is employed because the process being controlled is in some manner critical, i.e., the consequence of control failure are severe , or represent an extreme economic penalty . Under such circumstances, the extra cost of a fault tolerant system can be readily justified. This , however , causes significant issues to be raised in the design of the man - machine interface. For example, a fault tolerant system is designed to continue correct operation even when a fault in a component has occurred . Failure of the control
We further take the view that looking only at the interaction between the control system and the plant operator is too restrictive . We should include all human interaction by those who design, maintain and
95
96
J . H. Wensley
system could occur if a second fault mani fested itself before the first fault had been removed. Therefore, in such a system it i s necessary that faults be removed in an expe ditious manner . This invo l ves issues of diagnosis and error reporting that can properly be regarded as prt of the total man machine interface. In summary, this paper attempts to widen the breadth of d i scussion of man - machine interface to include all dif ferent human interactions throughout the complete life cycle of the system. A FAULT TOLERANT CONTROL SYSTEM We assume that the control requirements are such that a computer is necessary . We further recognize that computer faults do occur and that when such an event happens , the fault may cause actions of an obstruse and subtle kind . In critical processes s u ch failures cannot be tolerated . Our solution to this is to use triplicated units operating independently , but with sufficient interaction that they are able to check each other ' s operation, and can continue to con trol the process so that only correct control signals are applied. [1 , 2, 3] This is il lustrated conceptually in Figure 2. where three independent computer channels read all inputs and carry out all the contro l algo rithms . The computers are able to read the memories of each other , and are therefore able to check that each has operated cor rectly. At the conclusion of the calculation of the control algorithm , the outputs to the process are transmitted to a voter circuit that removes the effect of a fault in any of the three channels . This voter circuit is designed so that it is internally fault tolerant, i.e ., a component fault within itself will not corrupt the outputs [4] . THE DESIGN PHASE The man - machine interface at the design phase is dominated by the need for ease of design. In some fault tolerant system designs , it is necessary for the user to specify the error detection procedures , the fault diagnosis procedures , and also in cer tain systems , mechanisms for check- pointing the system so that a restart can be carried out after a fault. Our view is that such a complexity does not represent a good interface between the designer and the control system , and that the designer should be ab solved from such considerations . The de signer is typically a control engineer with his particular skills , rather than a compu ter engineer. Therefore the procedures to be used to achieve fault tolerance should be transparent to the control system designer. One technique for designing a control pro gram is to use a conventional computer programming language , such as FORTRAN , BASIC, PASCAL , or the like. The program is written as if it were to be executed on a conven tional single processor. Before the program is executed, the operating system ensures
that all input data is compared across the three processors and votes on that data so that a l l processors start the program with the same input even if a fault had perturbed the i nput data in one channel . At the con c lu sion of the program , the output data is sent to the hardware output voters to re move the effect of any errors that could have been introduced by a faulty processor . This output can be monitored and the result of the voted output is compared with each processor's attempted output. Thus it is possible to detect whether the voted output diffe$ from one of the processor ' s attempted output , thereby clearly identifying a faulty processor. The use of input voting before the program runs , and output voting and monitoring at its ' conclusion , enables the main body of the program to be written with out considerat i on for fault tolerance, which i s handled automatically by the operating system. Another technique is to use a prepackaged software system that already incorporates a ll the fault tolerance features in its structure . In the August Systems Series 300 , two such packages are offered. One system , TRIGARD, is a system wherein the de signer specifies his control algorithms as if creating a relay ladder network. The control computer then simulates the action of the relays . This method is commonly used in programmable logic controllers and the re l ays are usually augmented by such elements as timers, comparators , and arithmetic operators. This software system already contains all the required fault tolerance by appropriate calls to an underlying operating system that performs such functions as voting, error detection , diagnosis, etc. By use of this system, the designer is freed from concerns with the fault tolerant nature of the control system. In certain cases, particularly those invol ving continuous processes , the relay ladder system is not adequate and more complex con trol algorithms are necessary . For those applications another system , TRIDAC , may be used. This system incorporates more sophis ticated algorithms such as PID (Proportional Integral Derivative) control. In addition, TRI DAC allows for very sophisticated dis plays of the process operation . Like the relay ladder system , TRIDAC has already in corporated into its structure all the fau l t tolerance that is required , thus freeing the designer from such considerations. I t is possible , therefore to achieve a man machine interface during the design phase that either is simple to use , or is already preprogrammed . This is illustrated in Figure 3 . The lowest levels of the hierar ch i cal structure are concerned with the fault tolerance primitives of the system . At the next highest level , the traditional operating system primitives of scheduling , d i spatching, interrupt handling, etc . , are dealt with. Above that level, we either
Man - Machine Interface have a prepackaged necessary calls to mitives, or a user the rules for such well understood.
system incorporating the the fault tolerance priwritten program in which calls are both simple and
OPERATION The predominant factors in the man - machine interface during normal conditions are ease of use and clarity in the information ex change between man and machine . These two factors have been widely studied , but are not addressed in detail in this paper . Rather , we are concerned with aspects of the interface that must be taken into account when conditions are abnormal. This is usually because of a fault in either the physical plant that is being controlled , or a fault condition existing within the control computer . As mentioned above, we assume a fault tolerant computer and thus a single fau lt should leave it still operational, however the fault cannot be ignored because of the jeopardy that further faults wi l l take the system beyond its l imits of fault tolerance. Abnormal p lant conditions are usually represented by alarms . Typical of such alarms are e x cess pres sure , excess temperature, lack of cooling water , bearing failure, etc. The design of the man - machine interface should be such that an operator can, in a timely fashion and without ambiguity, identify the reason behind the abnorma l condition and can quickly take whatever actions are needed to correct the situation . In certain cases , such actions may be a rela tively l e ngthy sequence of smaller actions . Th e TRIDAC system uses a number of images on a color graphics termina l to portray the parameters of the plant . One of the ways in which it greatly assists identification of the cause of plant malfunction is to allow many alternate viewpoints concerning each part of the p lant , therefore a physical unit may be represented on many different images. For example , a cooling water pump may be represented on an image that portrays al l the water f low. It may also be represented on another image showing the electrical power distribution. It may appear also on a composite image associated with a particular par t of the plant . The use of mul tiple images that reference a single com ponent of the physical plant is an important factor in enabling speedy identification of the cause of alarm conditions . Fol l owing identification of the fault the operator should be able with little physical effort , e . g . , a small number of key strokes on a control panel , either to shut the plant down, or otherwise bring it back to proper operating conditions . It is thus necessary for the operator to be able to specify what unit of the plant is to be controlled . This is accomplished by a cursor that "points " at the location of the unit on one of the images on which it appears. Following this , a simple key stroke can be used to turn it
97
off, or on , or to ramp an operating para meter up or down. A fault tolerant control system is able to continue correct operation even in the presence of one or more faulty components. Because the system is then in jeopardy if further faults occur, these faults must be detected and reported to operating personnel so that repair action can be initiated. The first step in these actions is to maintain reports of all detected errors . In the August Systems triplicated design , the most powerful method of detecting errors is to be ab l e to compare data across all three cha~ nels. Any discrepancy of one channel is logged in memory of all three computers. Such errors could have arisen from two sources, either permanent or transient faults. A permanent fault will in general cause re peated errors. The system is designed so that after a certain number of such errors have occurred, the offending unit is decla red to be faulty . This may be a complete processor, or could be part of the input/ output subsystem . I f only isolated e rrors occur, it would most likely indicate a transient fault, usually due to an external disturbance such as excess ive electromagn e ti c noise. If the errors do not persist , the errors are merely l ogged and no corrective action is taken. From time to time the error reports can be printed for examination. Persistent recurrence of transient faults in one part of the system can be used as an in dicator that some pa rt of th e system is too susceptible to disturbance . For example , this could be due to connector contacts that are corroded. The ability to capture data concerning transient errors and their under lying faults is an important benefit of a fault tolerant system. As well as the error reports, the interface to operating personne l includes certain status indicators that are very conspicuously mounted . A central diagnostic panel ad jacent to the computers p rovid es summary information about the status of the system. This enables faulty equipment to be quickly identified down to the unit level, i.e , a processor or an in put/ou t put module . The specific input/output cards are each provi ded with a triple of indicating LEDs . Be cause those cards contain tr ipli cated circuitry , it is possible that one of the three channels on them has failed, but the card maintains correct operation. The three LEDs are each associated with a channel on the card and are used to indicate fault cond i tions. They are directly turned on or off by the processors that are also responsible for detecting the fault on the card. All units of the system may be replaced while the system continues operation. This is accomplished without any loss of control. In the case of the processors , the p rocedure is to switch off the unit , remove it, re p l ace with a new unit (or the same unit after repair) . When restarting, the
98
J.H. Wensley
processor will enter a procedure of "warm starting". This procedure uses the ability of the processor to read the other processor~s memories in order to capture the state of the system. After this action, the total system is brought back to its original configuration of three operating processors. If the fault is in an input or output card, the corrective action requires inserting an equivalent card in an adjacent slot, following which the operator uses a switch on the card to indicate to the processors that this new card should be used. When the new card has taken over all the functions of the one being replaced, the latter card has all its LEDs turned off by the processor, thus indicating that it can be removed for repair. This ability to carry out hot repair while the process continues undisturbed enables continued correct operation. UPGRADING AND MODIFICATION In many control systems, it is necessary from time to time to change or augment the control system. This could be due to new requirements such as change of operating conditions, or could be due to physical changes in the plant such as new sensors or other types of equipment. The two prepackaged systems TRIGARD and TRIDAC both utilize an operator and programmer work station on which new control algorithms can be specified. Following this specification it is possible to simulate their operation off line from the process to ensure correct
performance. Following this simulation, the new algorithms can be cut in to control the actual process. It is thus possible to augment the control system without lengthy down time of the process.
CONCLUSIONS We have attempted to show that it is necessary to take both a wider and longer time scale view of the man-machine interface in a control system. The wider view encompasses not just the operator, but all human interaction with the system. The longer view requires consideration of design, operation, repair, upgrade and modification. REFERENCES [1] Wensley, J.H., "Fault-Tolerant Computers Ensure Reliable Industrial Controls", Electronic Design, June 25, 1981. [2] Wensley, J.H. and Harclerode, C.S., "Programmable Control of a Chemical Reactor Using a Fault Tolerant Computer", Transactions of IECI, October, 1982. [3] Wensley, J.H., "A Fault Tolerant Computer for Industrial Control", Mini-Micro Conference, September, 1982, Anaheim, CA. [4] Wensley, J.H., "Fault Handling in the August Systems' Series 300" SAFECOMP Conference, October 1982, Purdue, IL.
99
Man- Ma ch ine Int e r face
HARDWARE SENSORS ACTUATORS PROCESS INTERFACES CONTROL COMPUTERS INSTALLATION AND COMMISSIONING
MAINTENANCE AND UPGRADES
SOFTWARE OPERATING SYSTEM RELA Y LADDER SYSTEM CONTROL AND DATA ACQUISITION SYSTEM APPLICA TION SOFTWARE F igure 1 .
The Life Cycle of a Control System .
INPUT OUTPUT
Figu re 2 .
A Conceptual view of a Triplica ted Control System .
OPERATOR
-
..... .....
TRIDAC
~
I, -... -...
I
TRIGARD
I
-... -...
USER-WRITTEN PROGRAM
~
OPERATING SYSTEM
I FAULT TOLERANCE PRIMITIVES Figure 3.
-...
The Software Structure .