Reliability Engineering and Systems Safety 63 (1999) 91–97
Diversity in computerized reactor protection systems H.D. Fischer*, L. Piel Communications Engineering, Ruhr-Universitaet Bochum, D-44780 Bochum, Germany Received 23 September 1997; accepted 16 May 1998
Abstract Based on engineering judgement, the most important measures to increase the independency of redundant trains of a computerized safety instrumentation and control system (I&C) in a nuclear power plant are evaluated with respect to practical applications. This paper will contribute to an objective discussion on the necessary and justifiable arrangement of diversity in a computerized safety I&C system. Important conclusions are: (i) diverse equipment may be used to control dependent failures only if measures necessary for designing, licensing, and operating a computerized safety I&C system homogeneous in equipment are neither technically nor economically feasible; (ii) the considerable large operating experience in France with a non-diverse equipment digital reactor protection system does not call for equipment diversity. Although there are no generally accepted methods, the licensing authority is still required to take into account dependent failures in a probabilistic safety analysis; (ii) the frequency of postulated initiating events implies which I&C functionality should be implemented on diverse equipment. Using non-safety I&C equipment in addition to safety I&C equipment is attractive because its necessary unavailability to control an initiating event in teamwork with the safety I&C equipment is estimated to range from 0.01 to 0.1. This can be achieved by operational experience. 䉷 1998 Elsevier Science Ltd. Keywords: Computerized safety I&C; Diversity; Dependent failures
1. Introduction The objectives of a safety instrumentation and control (I&C) system in a nuclear power plant are to control the postulated initiating events, to prevent the escalation from a minor event to a significant incident early in case countermeasures of the closed-loop controls are not successful, and to transfer the plant into a safe state. In the safety I&C system, as well as in other equipment, failures may occur. The causes for loss of system functionality in parts or even in essential areas are both maintenance/ repair and faults occurring in the system. Only a few of them cause loss of one or even several safety functions. A fault is tolerated as long as no failure results from it. The following types of failures are distinguished and considered during the licensing procedure in Germany: the random failure, the dependent failure (common cause failure or common mode failure), and the system degradation in case of maintenance and repair. Each of these types may also result in consequential failures. Since all these failures cannot be * Corresponding author
entirely avoided, one must guarantee that the probability of their occurrence is sufficiently low by use of appropriate measures.
2. Use of redundant and diverse means In general, the redundancy principle is agreed upon. For a system consisting of n independent trains operating in parallel, the total unavailability is calculated by multiplying Y the individual unavailabilities u i of the trains, u ¼ i ui . Because u i ⬍ 1, any uppermost limit for u is achievable by adding more trains (Fig. 1). With respect to random failures, maintenance and repair, and corresponding consequential failures the postulated independence among the individual trains can be assumed even if functional elements of the same kind are used. However, the frequency of repairs, which increases with the number of trains, n, practically limits the achievable unavailability. Thus, increasing n above a certain limit no longer decreases the unavailability significantly. In case of a dependent failure, however, the trains
1353-8020/99/$ - see front matter 䉷 1998 Elsevier Science Ltd. All rights reserved PII S 09 51 - 83 2 0( 9 8) 0 00 3 3- 7
92
H.D. Fischer, L. Piel/Reliability Engineering and Systems Safety 63 (1999) 91–97
Fig. 1. Unavailability of a redundant system.
operating in parallel will fail simultaneously. Dependent failures may originate in either a faulty behavior of identical nature in the trains or in fault propagation mechanisms with common consequences. In this case, the independency of the individual trains is no longer valid. Adequate design provisions are made to annunciate as many dependent failures as reasonably achievable. External impacts are one source of dependent failures. In general, this problem is solved by suitable isolation of the trains, for example by location in physically separated buildings or in rooms respectively in cubicles that are protected properly. However, if a dependent fault in the hardware or in the software of the functional elements must be assumed, then the probability of a dependent failure can either be reduced sufficiently or the common cause can even be avoided perfectly, if functional elements different in design or in mode of operation are chosen. Dynamic faults in the software of a digital computer system embedded in a plant are difficult to discover and will become apparent by variation of input data only. So far, the use of functionally different criteria to initiate a protective action is of essential significance if the probability for dependent faults is reduced just below an acceptable value. This statement of the authors seems to be accepted generally by the nuclear community, but one should also remember the objections of Leveson given on a recent OECD workshop [1]. If workable technically means different in mode of operation or different in design, then we call them diverse means. A different mode of operation is referred to as functional diversity, whereas a different design is said to be physically diverse (Fig. 2). All practical means to avoid dependent failures should be used. Some necessary measures even often turn out to be sufficient to reduce the probability of occurrence of dependent failures to an acceptable limit. These include the following. •
The stringent verification of system elements which perform protection actions by proper methods during the quality-assured design process: possibly by participation of a third party other than the designer or manufacturer.
•
•
•
The adequate monitoring of correct system function during plant operation. Only a computerized safety I&C system provides the potential for extensive diagnoses and detailed tests performed on-line. Modularity and fault encapsulation measures at interfaces which limit the possibilities for fault propagation sufficiently. Furthermore, modularity essentially contributes to the understandability of the plant protection system and to the reproduction of its control actions. Sufficient operational experience with the system elements. Hereby, confidence is founded to such an extent that the undetermined probability of occurrence of dependent failures of unknown cause can be estimated as being properly low. Constructive measures against failures of known cause are already taken during the design phase.
People from other disciplines, e.g. material science, regard this methodology as well-suited to reduce dependent failures to an acceptable value. Sometimes these measures are neither technically nor economically feasible, because the system elements are not sufficiently observable or expensive testing is needed, or one must take into account very rare failures or failures of unknown cause. In this case, the use of diverse means is necessary. The use of diverse means requires the following items. • • •
Diverse elements’ operation is independent, which is to be proved. Only then will this measure be effective. No new sources of faults originate from different maintenance requirements or from insufficient compatibility of the individual elements, for example. The expenditure in design, in verification, and in operation caused by the application of diverse elements does not exceed the acceptable limits of economic sense.
Fig. 2. Different categories of diversity.
H.D. Fischer, L. Piel/Reliability Engineering and Systems Safety 63 (1999) 91–97
After all, the application of diverse means should just promote a significant relief in the licensing procedure. For these reasons, one should carefully weigh whether it is better • •
to just compensate the lack of unavailability of individual trains or parts of them, or to take additional measures against rare failures or against failures of unknown cause, in general.
Diverse means are used to gain sufficient independency among all trains which initiate a certain protection action with respect to postulated dependent failures. Appropriate means or measures are as follows. •
•
•
Technically different functional elements of hardware as well as of system software by using products of different manufacturers. Equipment of the safety I&C system may be supplemented by equipment of the non-safety I&C system, for example. Dissimilar functional elements which implement the same function by different codes. For example, the formation of the second maximum analog value in the relevant supervised direction with subsequent limit value monitoring can be used in addition to the binary two out of four voting logic. Technologically different functional elements which process different input data but initiate the same protective action. Among these there is the application of diverse initiation criteria, which results in diverse application software in a natural way. The term diverse initiation criteria addresses the functional diversity of a reactor protection system: high temperatures or high pressurizer water level together with high coolant pressure are used to reduce reactor power in a pressurized water reactor (refer to Section 4).
The use of totally heterogeneous systems with equipment including different hardware, different binary codes, and different operating systems is seen as a possibility to avoid dependent faults [2]. The demand for diverse equipment is essentially made to promote the use of diverse system software. In some experts’ opinion, this high degree of diversity is justified only if sensors and actuators of the corresponding protection channel are additionally diverse, respectively [3]. Otherwise, the probability of dependent failures among sensors and among actuators will overwhelm the corresponding low failure rate of a diverse information processing part within a total protective action. Such a prerequisite is rarely met for the safety I&C system in existing nuclear power plants. If there is no stringent restriction to the most important I&C functions, an overall heterogeneous system produces an excessive high expenditure during implementation and a considerable number of problems during maintenance and repair. Since each automation device sets different demands, the service operators have to be individually
93
instructed for each of the automation equipment. Service of different equipment by the same employees on-site represents additional potential for faults. Furthermore, licensing experts might require a third(!) computing system for not clearly safety-oriented countermeasures, like the thrust reversal in a modern aircraft, because in such a case situations might arise where no decision can be made between alternative commands to the actuators. Such an expenditure is only justified if safety aspects require it.
3. The use of physical diversity Physical diversity means use of different computers with diverse system software. The same I&C functions—they represent the specification of the system engineering task for the I&C system—are implemented in redundant trains consisting of either different automation devices or the same devices with different components. Correspondingly, the same statement holds for the system software. This procedure may be applied in particular, if the realization of functional diversity by diverse initiation criteria is impossible. Since physical diversity is evaluated differently by licensing experts, some aspects should be treated here in more detail. The measuring equipment and the actuators are excluded from the discussion, although they are the ones to determine the unavailability of a protection action. A reasonable way to apply heterogeneous equipment to the safety I&C system is the use of non-safety I&C equipment to implement safety I&C functions [4]. The non-safety I&C system guarantees economical generation of electric power and therefore shows a good reliability during plant operation. Nevertheless, it must be granted a higher unavailability than the safety I&C, despite missing proofs for qualification. However, the (small) parts of non-safety I&C equipment used for implementation of safety I&C functions should not be excluded in the probabilistic safety analysis. This procedure should be sufficient to prove the required availability for the corresponding postulated initiating events including the safety I&C system. Such a solution is attractive, because the afore-mentioned disadvantages of heterogeneous equipment in the implementation phase as well as during maintenance and repair are avoided. The employees have experience in dealing with the non-safety I&C devices and all tools for them are already available on-site. Furthermore, one knows from experience that a safety I&C system with homogeneous equipment might miss the required unavailability of some I&C functions only by a few orders of magnitude. Then, it is ingenious and suitable to use measuring signals and actuators different from those which are processed and actuated by the safety I&C equipment to control the same initiating event. It is just this requirement which makes such a solution less attractive for existing plants than for new plants, because a renewal of the I&C system will not lead to installation of new actuators especially for I&C purposes.
94
H.D. Fischer, L. Piel/Reliability Engineering and Systems Safety 63 (1999) 91–97
Littlewood [5] extended the Hughes [6] model for forced diversity of hardware components by making the assumption that there may be two, or more, different types of components available (pp. 108–110). The formal mathematical derivation uses statistical averages and comes to the conclusion that the best design uses all the available types of components, but uses each as little as possible. This result is obtained by averaging on all randomly chosen environments. However, in a nuclear power plant there exists a specific well-known environment for the computerized safety I&C system. In fact, we have some knowledge about the reliabilities of particular components in a particular environment. Thus, we agree with Littlewood that then ‘‘… things can change dramatically. In general, it can be shown that preferences that we have ‘on average’ are precisely reversed when we have sufficient knowledge to be able to talk of ‘best of’’’. The practitioners of safety I&C systems in nuclear power plants take into account a lot of knowledge experienced and collected with hardwired I&C systems as well as with digital I&C systems in nuclear power plants over more than two decades. This way, the practitioners are capable of evaluating design measures in a specific nuclear power plant environment from a systemoriented point of view, which otherwise are not manageable by mathematical methods currently available. Now some examples for those elements which might be implemented diversely are considered. (i) In the computing equipment, all devices that belong to the environment of the real computing elements are summarized, e.g. assembling technique, connecting technique, power supply, and ventilation. In these fields the same problems as in hardwired protection systems are found. Therefore, the well-known means for function testing and function monitoring can be used, especially because a comparably simple functionality is applied. (ii) However, the processor boards and their installed chip-set contribute considerably to correct operation of the hardware. But their extent of functions is restricted so that they seem to be controllable by tests and diagnoses. The additional requirement on documented operational experience should be observed. (iii) In contrast, the processors show highest complexity and smallest observability of the internal functions at the same time. The recent example of the Intel Pentium still shows systematic faults of the implementation despite of intensive testing. The same example, however, impressively demonstrates that such faults are the faster detected and then cured the more different they are used. Nevertheless, diverse processors might be used. Application of processors belonging to the same family, such as, for example, the AMD K5 and the Intel Pentium, may be considered. In addition, we do not consider the use of different families of processors like the 80⫻86 family on the
one hand and the 680⫻0 family of the other. The different macro codes will cause adaptations not only of the processor boards but also of all system software which generate or process binary code. The associated expenditure—in particular with the necessary proofs— and the potential of additional sources for faults seem to be in no good ratio to the gain of availability. (iv) The use of a diverse operating system cannot reduce the necessary depths of proofs. As correct function of operating system routines is fundamental for sufficient availability of the safety I&C system, poor correctness cannot be compensated by the selection of a diverse but also deficient operating system in a redundant train. Especially the hardware related parts of the operating system are rather unsuited for the application of diverse elements, because the techniques used for them result from the hardware itself and so can rarely be varied systematically without also changing hardware. Furthermore, just for the processing of I&C functions by commercial off-the-shelf automation devices, the operating system is comparatively simple and tests may be performed economically because of a limited number of specified operating modes. In contrast to the PC world, large changes by modification of the system will not occur during plant operation here. In summary, the potential for systematic faults and thereby induced dependent failures is considered small enough that adequate correctness can be claimed after sufficient operating experience. Therefore efforts should focus on careful specification, design, and verification and validation (V&V) of a single operating system. Recently, Lawrence and Gallagher [7] made a proposal for performing a software safety hazard analysis. Parts of the proposal seem to be valuable for operating systems and communication software, which are often referred to as system software. Software safety hazard analysis focuses on the early phases of the software lifecycle and may be regarded as a supplement to IEC 880. Recommendations for design changes are gained from a hazard analysis of the software architectural design. If the proposal by Lawrence and Gallagher can be restricted to the architectural aspects of system software, then the systematic way to design and to examine software for safety-critical applications will support not only licensing requirements but also contribute to design improvements. System software specification, design, and coding differs considerably from the application software which implements the I&C functions that are defined by safety-system engineering requirements. Especially for the modernization of reactor protection systems in existing nuclear power plants, the numerous documents already available should be taken into account. The advantage will be the reduction of economic penalty for a computerized
H.D. Fischer, L. Piel/Reliability Engineering and Systems Safety 63 (1999) 91–97
safety I&C system burdened by extreme licensing requirements. (v) Particularly for the communication software, a homogeneous system must be preferred just for avoidance of conflicts. Erroneous behavior by conflicts when accessing a common transfer medium, for example, can be eliminated by careful design and in-depth V&V including a software hazard analysis. There are considerable doubts that a simple, adequate, reliable and fault-tolerant communication system particularly developed for use in nuclear power plants represents a concept determining detail. (vi) In the field of software development for safety I&C, the diversity of compilers and its corresponding linkers and loaders is discussed. Due to the uncertainty of colloquial language and the underlying programming rules of a manually operated compiler, a propensity to systematic false reasoning occurs. Taking this into account, the use of a diverse programming team, which applies a different programming language, seems to be advantageous. The problem, however, is solved with much less expenditure by powerful graphical development tools, because: • •
•
they relieve the programmer from programming details during implementation of an I&C function; they formalize the real set-up of the programme code, which essentially simplifies verification and validation of the automatically generated code. This is just the reason why a diverse development tool is no more relevant; in addition, they increase the comprehensibility of the I&C specification particularly for system engineers. Thus, systematic faults from misleading functional requirements are essentially avoided.
The use of a powerful graphical development tool like SPACE [8], for example, is considered as an ingenious solution and may be used together with products of different manufacturers for the same programming language like C or Cþþ, respectively. The development tool is used for specification and functional testing of the application software. If processor diversity to the target hardware is observed, then the application software is proved for correct function with diverse compiler, linker and loader. In this manner, confidence in the application software increases. As far as the reliability of the compiler, linker and loader is concerned, the same conditions result as with microprocessors, if diverse components are chosen.
4. The use of functional diversity Functional diversity means the application of several different I&C functions for the same goal (diverse application software). Several different functions commonly initiate a protective action. Among them there may be a hierarchy of
95
categories, as proposed by IEC 1226, with respect to efficiency or to response time (earlier–later effective). At first, functional diversity is generated by the choice of diverse initiation criteria, i.e. different measuring signals are used to initiate a protection action [9]. The measuring signals are considered independent, because they are obtained via different measuring devices. Nevertheless, their information on the physical transient is correlated with respect to the underlying natural laws. Therefore, the corresponding I&C functions for evaluation of this information differ from each other. The same elementary functional units are reused for the same subfunctions in the graphical specification of I&C functions. This way the code of the I&C functions is systematically structured and therefore easy to survey, but a certain potential for systematic errors does still remain. Modularization and restriction to only a few elementary functional units permit a considerably easier code verification: the size of the code to be proven is smaller and the more frequent use of the same code provides an efficient operational experience. Since the code is automatically generated from the I&C specification by unique rules, the creativity of some engineers during the coding process is strictly limited: the coding process is almost beyond human influences. Finally, it is very improbable for the same common subfunctions to be in the same state during system operation because they are used within different I&C functions and with different input data. Therefore the diverse I&C functions may be assumed to be sufficiently independent of each other. Another possibility to reduce the dependence among redundant trains of equivalent I&C functions is their dissimilar implementation. As far as I&C functions of the highest safety category are concerned, the potential of this design measure is calculated lower, because in this category signal processing is simple and so alternatives are rarely offered to the designer. This is different from other I&C functions in a hierarchical protection system: here, the more intelligent signal processing offers an adequate number of dissimilar implementations. In addition, manual operations are used as protective actions. These are taken into account, if there is sufficient response time for operator actions together with the fact that the necessary protective action requires no complicated sequence of manually given commands. This has to be proven for operator actions required by class-S alarms [10] and is valid for countermeasures initiated by condition limitations [10]. Attention must be paid to the time-dependent total load of the operator to avoid his overloading. Manual countermeasures may be used as a diversity to the automatic actions of a safety I&C system. 4.1. One source of systematic faults One source of systematic faults, however, cannot be eliminated by all of these design measures, namely an
96
H.D. Fischer, L. Piel/Reliability Engineering and Systems Safety 63 (1999) 91–97
erroneous requirement specification. Digitalized or hardwired protection systems do not differ in this item: in both implementations, a requirement fault causes a dependent failure. This problem is not solved by diverse specifications of I&C functions but by the choice of diverse initiation criteria together with the introduction of a hierarchical safety I&C system [11–13], which enables the application of functional diversity considerably. The main goal is still to stop a postulated initiating event as early as possible and to transfer the plant back into normal operation. So far, a second but equivalent protection function initiates a protection action instead of the first if the latter is lost. Alternatively, a protection function with a higher safety relevance may be designed to mitigate an evoluting incident. It is advisable in any case, to design a safety I&C system according to the principle of defence-indepth. This means to respond to frequently supposed events by more symptom-oriented protection actions, while during loss of function or in face of incidents or very rare events a more event-oriented approach is applied and, thus, more effective countermeasures are initiated. The more rare a failure initiating event the more tolerable are severe countermeasures. If an event is controlled by scaled protection actions in this way, the requirements on the first acting system can be reduced. At the same time the expenditure for the required proofs of the first acting system is decreased. This fact permits the use of more intelligent I&C functions in a safety I&C system with scaled safety requirements. Apart from application of a defence-in-depth concept within the I&C system, the verification and validation of the requirements specification may be strengthened. A major source of dependent failures is the specification of the safety-system engineering requirements. It is recommended that derivation and validation of these requirements are done in a somewhat ‘orthogonal’ manner: if the derivation of the system engineering requirements (called I&C functions) is done by using a systematic safety goal-oriented approach, the validation of each I&C function should follow an event-oriented methodology, and vice versa. Such a procedure enforce a completely different mode of thinking during elaboration of I&C functions and during their validation: in the safety goals’ approach, the I&C functions are defined by asking the question, ‘what should be done automatically or manually to guarantee the specific safety goal?’, whereas in the event-oriented approach, the question, ‘what kind of I&C functions are actuating in which way to control a specific event?’ must be asked.
5. Conclusions In most cases [14], a homogeneous computer system for the safety I&C system fulfils the requirements of a nuclear power plant of established design, because the compliance
with hardware as well as software specification is adequately provable in type tests, suitability tests, integration tests, and interconnection tests. Diverse software does not need to be used if the compliance with the specification is provable. This is achievable by an adequate programming methodology, particularly by modularization and the use of fault confinement areas, or even by mathematical proofs of correctness. This is true for application software as well as system software, where a safety hazard analysis may improve the software design. At most, the use of diverse processors or compilers is suitable, if systematic faults are not adequately improbable because of deficient observability and specific complexity of the underlying processes. To preserve the expenditure economically reasonable, products of the same processor family and the same programming language are preferred. The demand on operational experience will also result in a definite reduction of the potential for faults. If physical diversity is necessary in few cases, the devices of the non-safety I&C system may be used to form a proper equipment diversity. Concerning hardware and software that is operated and serviced by humans, the use of diverse elements may result in a considerable potential for dependent failures [14]. It may be even higher than the gain of availability achieved by a diverse design. It is just the increasing number of failures caused by operator errors that represents an argument against physical diversity where not necessary. The application of functional diversity is considered to reduce the potential of dependent failures essentially, because common input data, being the most important cause for dependent failures, are avoided by independent processing of diverse initiation criteria. Dependent failures by erroneous specification of I&C functions are controlled by a hierarchical safety I&C system which responds with more efficient protection actions initiated by I&C functions of a higher safety category. A supplementary way to deal with erroneous specification of I&C functions is the application of an orthogonal approach for elaboration and validation of I&C functions. The discussed measures should be adequate for a design of the safety I&C system in a nuclear power plant to make the occurrence of dependent failures sufficiently unlikely. Acknowledgements The authors thank the reviewers for their valuable comments which contribute to a better comprehension of the paper. References [1] Leveson NG. High-integrity software: is it safe? Could it be safer? Presented at the International OECD Workshop on Licensing of Computer-based Systems Important to Safety, Munich, March
H.D. Fischer, L. Piel/Reliability Engineering and Systems Safety 63 (1999) 91–97
[2]
[3]
[4]
[5] [6] [7]
[8]
1996. OECD Nuclear Energy Agency, NEA/CNRA/R(97)2, Appendix:336–362. The Royal Academy of Engineering. Proceedings of a Forum on Safety Related Systems in Nuclear Applications, 28 October 1992. The Royal Academy of Engineering, ISBN 1 87163424 5, December 1992, 2 Little Smith Street, Westminster, London SW1P 3DL. Ichiyen NM. Benefits from use of computers in CANDU shutdown systems. IAEA Specialists’ Meeting, Saclay, France, 28–30 November 1984. Baffie O. Safety classified I&C at Chooz B. In-depth information after a visit of the German Advisory Committee for Reactor Safety at Chooz, 19 December 1995. Littlewood B The impact of diversity upon common mode failures. Reliability Engineering and System Safety, 1996;51:101–113. Hughes RP A new approach to common cause failure. Reliability Engineering and System Safety, 1987;17:211–236. Lawrence JD, Gallagher JM A proposal for performing software safety hazard analysis. Reliability Engineering and System Safety, 1997;55:267–282. Sto¨cker S, Waedt K. In: Kafka P, Wolf J, editors. SPACE, Specification And Coding Environment. A toolkit allowing graphical
[9]
[10] [11]
[12] [13] [14]
97
specification of safety critical programs for automation, safety and reliability assessment—An integral approach. Amsterdam: Elsevier Science, 1993:825–839. Bachmann G, Geyer KH, Helf H, Hellmerichs K. Safety Systems of KWU Pressurized Water Reactors. Kerntechnik, 18. Jahrgang, 1976:374–381. KTA 3501 standard. Reactor protection system and monitoring equipment of the safety system, edition 6/85 (in German). Aleite W. The contribution of KWU-PWR-NPP Leittechnik important to safety to minimize reactor scram frequency. In: Proceedings of the International Symposium on Reducing Reactor Scram Frequency, Paper No. 6.4, OECD-NEA, Tokyo, 14–18 April 1986. IAEA. Safety Guide D8, Revision 8, Safety Related Instrumentation and Control Systems. Vienna, 1982. IEC 1226. Nuclear power plants—instrumentation and control systems important to safety—classification, 1993 edition. Fischer HD, Hellmerichs K, Parry A. Digital instrumentation and control for future nuclear power plants within the French–German cooperation. In: Proceedings of the International Symposium on Nuclear Power Plant Instrumentation and Control, Tokyo, 18–22 May 1992.