Co p'Tight © lL\C Fuld a. FRC. 1988
S .-\FE C() ~IP 'HH
SOFT\-"ARE SAF ETY
PRINCIPLES FOR DESIGN FOR SAFETY W.
J.
Quirk
L'KA EA HarU'ell Labu raIOl)", Oxfordshire, L'K
Abstract. With the increasing use of computers in safety related applications, a structured approach is required for analysing, implementing and validating critical systems. This paper discusses and justifies one such scheme, recently proposed by the European Workshop on Industrial Computer Systems Technical Committee on Reliability, Safety and Security. Keywords. System Design, Safety, Risk.
INTRODUCTION
MODELLING SAFETY
Designing for safety, particularly for computer based systems, is a goal which everybody wants to achieve but up to now little guidance existed on how to achieve it. Traditionally, most work has been targetted on establishing correctness but, as is discussed below, safety is a wider concept than correctness . EWICS TC7 has recently produced guidelines on Design for System Safety (EWICS87a) covering the area and this paper gives an outline of the philosophy on which those guidelines are based and some principles underlying them.
Any system can be characterised by a specification of the service the system is to provide. The operation of the system can then be broken into alternating intervals where proper (ie. in accordance with the specification) and improper services are or can be delivered. The switching between the two states is by failure and restoration of the service. An underlying assumption of this is that the agreed specification of the service to be provided by the system actually exists. Only then do the notions of proper and improper services become meaningful.
Increasingly, computer systems are required to control or monitor potentially hazardous operations in order to maintain overall safety. Such a safety-related computer system is called the "target system" in this paper. It is assumed to be responsible for controlling some aspects of another system (called the "plant") which could suffer an accident and cause damage to the environment.
Every system exists within some broader context . An explicit distinction can be made between the target system, which is the computerised system to be designed, the plant, which is the part of the world which is controlled by the system and the environment, which is the part of the rest of the world directly .affected by the plant . The safety question concerns what influence the plant has on its environment and, in particular, what could be the negative consequences of that influence. In that respect, the consequences can be classified according to their severity. The most serious event is when an accident occurs. However, there may be other serious environmental consequences not associated with accident occurrences (such as the emission of gases from burning coal leading to the production of acid rain).
Design is one of the major phases in establishing the target system. The aim of this procedure is to minimise, by good design, the risk of any accidents caused by the plant. Safety is best achieved when the design of the plant and target system go hand in hand. A higher degree of safety can be achieved more easily, or for lower cost, when the organisation of safety measures can be integrated over the whole system. Indeed, this design procedure may reveal problem areas which are best resolved by changes to the plant design. It should be noted though that this design procedure aims to enhance safety by special design of the t .arget system and not by improving plant ccmponents. For example, the procedure might lead to a dual braking circuit in an automobile but it would not lead to better brake pipes.
An accident is a transition between safe operation of the plant and a calamity with possible loss of human life, significant destruction of property, etc. The model disregards the possibility of switching back from a calamity to safe operation. After an accident has occurred, the consequences are usually so severe that restoring the target system is not of primary importance. Both the environment and the system may be partially destroyed, and investigation commissions may be involved for long periods.
IDI
102
\\', J,
Typical examples of accidents are train or plane crashes and major environmental pollutions. Typical examples of target sys tems are canputerised protection systems, auto-pilots and traffic light controls. The target systems are not in themselves dangerous and will not directly cause any accidents, as the tsrget systems concerned here will be computer systems. But the target system can indirectly cause an accident through the plant, examples of which are power stations, trains and planes. The environment consists of those people, houses, other vehicles, nature and other objects that can be harmed by an accident to the plant. This also includes for example train or flight passengers and employees at a plant. A crucial concern is how the target system ac ts on the plant and, in particular, how a safety failure of the target system can bring the plant into a state where an accident is likely to occur. Not all target system failures lead to acciden ts. The progression from a target system failure to a calamity goes via an intermediate, dangerous state of the plant. This state may lead to an accident, but it may in principle also be restored into a safe state. Indeed, this restoration is often the goal of the plant safety system. It is important to realise that hazards continue to exist despite the safety system. The only way a hazard can be completely eliminated is by designing it out of the plant. Rather, each hazard has a certain risk associated with it. The purpose of a safety system is to reduce the risks associated with the hazards to a value below an acceptable level according to some appropriate criteria. The safety system can reduce the risk either by reducing the probability of an accident (ego a shutdown system) or by mitigating the consequences (ie. reduce the cost) of the accident should it occur (ego a fire extinguishing system), or both. By definition, a plant will be safe if the risk its operation poses is less than the agreed acceptable value. The plant may be in a dangerous state for several reasons (or a combination of these) : - A hazard was not identified at all and its associated risk takes plant operation outside the specified safe value. - The risk associated with a hazard was not correctly estimsted. This could come about by mis-estimating either the associated cost or the probability, possibly due to incomplete -Fault Tree Analysis or Failure Modes and Effects Analysis (FMEA) studies : ego failing to recognise particular initiating events or dangerous combinations of events. As a result, insufficient mitigating features may be designed into the system. - The safety system fails (either in its design or due to a random failure within it) in such a way as it no longer provides the level of mitigation necessary to operate the plant within its agreed safe risk. Here, "operate" may also mean "shut dOwn";
Quirk the failure of a residual heat removal system would be included in this category. Malfunctions of the target system can influence plant safety in four principal ways: - there may be no initiation of action (output to actuators) when there should be from the point of view of the plant state. - the target system may initiate an incorrect control action on the plant in response to some real demand from the plant. - the initiation of the action may be delayed by more than an acceptable time limit. - the target system may initiate an action in cases where no initiation should take place. As already mentioned, not every service failure (ie.
a failure of the target system to fulfill its specified function) is a safety failure. The main reason is the difference in the level of generality. The service failures are seen in the perspective of the plant, and the judgement of what is and what is not a failure is based on the system specification which concentrates on the mission to be fulfilled by the system. Safety failures require much broader context because safety is intended to capture all possible dangerous consequences of the existence of the system. Due to the difference in generality it may happen that even some desired behaviours of the system constitute a threat to safety. Thus safety failures can also result from "unsafe" specifications. The unsafe specification can in turn be the result of the inadequate perspective chosen while defining the system requirements or from the requirements specification errora (the requirements do not capture the intended mission of the system). One could argue that if the system specification were prepared from a sufficiently general point of view, such that all safety requirements were included, a 100% reliable system would be completely safe and the notion of safety could be replaced by that of reliability. The present state of technology does not, and will not in the foreseeable future, provide for building 100% reliable systems. Due to this fact, safety and reliability are qualities of the system which may sometimes conflict. For example, an ultra-reliable system whose rare failures have dangerous consequences may be less safe than a completely unreliable system which cannot do anything and therefore cannot do any harm. There may also be inescapable design compromises between safety and availability. An aircraft which never flies is safe but unacceptable. On the other hand, unavailability may be the dominant danger factor in, for example, a life support system. Such a system is not modifying the cost of death; rather it is (trying to) reduce the probability of dying. This illustrates the possibility of the system failure itself causing an accident. At the other
Principles for Design for Safe!\' extreme, the non-availability of a fire detection system does not itself cause a calamity. In this case, the system is not trying to reduce the probability of fire breaking out, but rather is trying to mitigate the cost of a fire by giving early detection so that it may be extinguished quickly. A shutdown system represents an intermediate position. It is there to reduce the probability of an accident and not the cost. If it fails, the plant operation may be unsafe but another event is necessary for an accident to occur (ie. an initiating event to which the shutdown system is unable to respond). In this case, like the fire detection example, plant shutdown may well be a safe state and providing that failure of the system csn be translated into such a shutdown with a sufficiently high probability, (ie. fail-safe) then lack of availability is not of itself a risk to safety. Of course there may be secondary considerations : the cost of not running the plant, the cost of prematurely ageing the plant and the increased likelihood of the system being bypassed by frustrated operators or managers.
SAFE DESIGN PRINCIPLES The above discussion demonstrates that safety and reliability are different concepts. Safety assumes a more general point of view and its interest is in avoiding disastrous consequences the system can produce in the surrounding world. This means that the approach to design of a safety-related system is not covered by (or included in) the approach taken to design (ultra-) reliable systems. Five distinguishing principles emerge which should be considered when designing safety into a system. These are discussed below.
i)
Independent analysis of system safety.
In order to establish safety requirements, one has to identify the possible impact the plant has on its environment. A common approach is to start with a list of possible accidents in the plant and their consequences on the environment. The next step is to work backwards looking for possible causes of those accidents. These will further be referred to as hazards. This Preliminary Hazard Analysis (PHA) step aims to identify and evaluate hazards and also to identify general safety design criteria to be used. PHA provides input to an analysis which aims to identify potential hazards in the target system, which may lead to dangerous failures. This should be used to provide a set of criteria for the design of software to reduce the risk posed by these hazards. Having identified the target system safety requirements, which is the output of this analYSiS, one should relate them to the remaining target system requirements specifications, in order to identify potential conflicts, inconsistencies and omissions. Thia step demonstrates that the functional specification of the system is safe and classifies
103
the functions according to their criticality to the safety goals. The above steps are (and should be) independent of the functional requirements oriented specification activities. The reason is that the latter are usually defined from a different (less general) point of view and their prime purpose is in the mission to be fulfilled by the system. Independence also helps to reduce the chance of having errors in common in both the mission a nd safety aspects of the specification. Thi s is an important pOint to note. The specification is the common starting point for all designs and implementations of the target system. Any error in it is likely to exist in all implementations and, indeed, the better the design process, the more likely it is that the error will be propagated. If the specification is itself unsafe, then the best designs will also be unsafe. ii)
Structuring acco rding to criticality.
Normally, the computer based target system will include a significant software component, and the complexity of the software is often a major cause of problems. Limiting and mastering this complexity is one of the primary goals of the design process. The separation of the software system functions into safety-critical and non-safety-critical classes is used to guide the design which should aim to preserve this separation at the software component (module) level. Having preserved this separation, an extra effort can be devoted to increase the quality of the safety critical modules (eg. by performing formal verification, applying advanced fault-toler a nce techniques, etc.). However, this is not enough to guarantee safety. Malfunctions of even ultra-reliable modules can result from faulty operation of other, non-critical modules. The separation of critical and non-critical functions is a useful approach which provides for separation of concerns. Nevertheless, any arguments related to system safety which refer to this separation have to be accompanied by supporting analysis which proves that there is no way that the system safety can be compromised by faults in the non-critical software. This problem can be remedied by enforcing protection and access limitation to safety critical modules. For instance, the critical modules can be implemented in a separate computer and the access from other modules can be subjected to strict authorisation and authentication checks. Data passed to the critical module could be validated by both the calling module and the critical one. In some situations, certain authority limitation with regard to inadvertent activation may be implemented by retaining a human in the decision process and only allowing critical actions to be executed after receiving a positive confirmation from the operator. In other cases, it may be considered that the human is the weaker link in the safety chain. Then the above situation would be reversed with operator requested actions being allowed only if the plant state, as established by the target system from the
104
\Y .
J.
Quirk
sensor data it was receiving, indicated it was safe so to do.
(3) Emergency Shutdown - the system is shutdown completely,
iii) Achieving ultra-high reliability of the safety critical modules.
(4) External Control - the system continues to function, but control is switched to a source external to the computer (eg. manual control),
The previous principle states that the system structure should support the separation between ssfety-critical snd non-safety-critical functions. If such separation is guaranteed then the primary concern is the correctness and reliability of the safety critical part. Any techniques which increase reliability of the safety critical part of the system can be used. They include methods of fault avoidance, which concentrate on decreasing the number of the design faults in the software, and fault tolerance techniques, which concentrate on mitigating the consequences of the faults occuring in the operational version of the system.
(5) Restart - the system is in a transitional state from abnormal to normal. Whether or not any of these is applicable in a given situation depends upon details of the application and general guidance cannot be given. Different failures of a single application may, in any case, be handled seperately: complete shutdown may be a last-resort after attempts to maintain partial functionality have failed. There may be general frustration and even lifetime implications by too frequent shutdowns, and frustrated humans are not renouned for safety.
v) This concentration on the safety critical part of the target system offers two benefits. The first is that the scale of the problem is reduced, since the safety critical part is usually only a small proportion of the complete system. The second is that since timeacales and budgets are never limitless, it delineates those parts of the system which need special attention from the rest. This is not to say that the quality of the rest of the system is of no importance, but it has to be recognised that it is neither possible nor cost effective to build the whole system to the same quality level.
iv)
Design for safe states.
Should a failure occur which prevents the continued normal operation of the system, it is clearly paramount to maintain the plant adequately safe. To achieve this, the possibility of bringing the system into a safe state, or into a state of reduced risk, should be investigated. Such safe states will usually be states where other quality factors (eg. reliability, availability) are considerably reduced, which is why the safety system should dominate the control system. The most radical approach is to shutdown the system completely, assuming that the "dead" system will not cause any hazards. In other situations there may be intermediate safe states with limited functionality, especially for those systems for which a complete shutdown would itself be hazardous (eg. an aircraft autopilot) • For instance, the safety-related control modes for a process control system might include: (1) Partial Shutdown - the system has partial or degradated functionality, (2) Hold - no functionality is provided, but steps are taken to maintain safety or to limit the amount of damage,
Continuous monitoring of safety.
Despite the effort to increase reliability of the safety-critical software, the software itself and the environment should be continuously monitored to intercept failures before they convert into accidents. This principle is based on the recognition of the fact that no presently available means can guarantee absolute safety. Therefore the safety system should monitor the most safety critical parameters in order to implement the "last chance" barrier which guards against accidents in situations when they are most likely to occur. As already mentioned, it is also easier to obtain the necessary ultra-high reliability for a small safety kernel than for a large and relatively complex system. It is also possible to introduce a degree of diversity here. The characterisation of plant safety can be entirely orthogonal to the algorithmic control parts of the system. The safe operating envelope can often be defined without reference to how it is intended that the plant will be kept inside that envelope.
DESIGN PROCEDURE STEPS The principles just described can be embodied in four main steps, as shown in the figure. The first two steps do not deal directly with the design process. Rather they are part of the preparation needed before one can design safety features into the target system. They usually require a broad range of inputs, including some specialised plant knowledge. Remember that the final specification of the target system must itself be safe. Step 1: Safety relevant information on the plant and the environment is compiled in this activity. Sources of information can be plant and environment descriptions, risk analyses etc. Auxiliary information such as regulations and general safety criteria should also be gathered in this step. The result of this
Principles for Design for SafelY step is a set of safety goals which should be fulfilled in the design of the target system. These safety goals are often expressed as a set of unsafe states the plant should not be brought into by the target system. A set of specific design constraints is produced during this step, along with a partial validation plan.
105
about it. Suffice it to say here that formalised approaches are being increasingly recognised as valuable if not crucial to the attainment and demonstration of safety and that automated record keeping and documentation control ease considerably the drudgery of the verification stage.
CONCLUSION Step 2: The initial functional specification should be examined closely with respect to the safety goals established in step 1. The potentially safety critical parts of the target system should be highlighted in order to identify where specific safety measures should be designed into the target system. The result of this step is a specification of the special features which should be designed into the target system in order to enhance safety. These should be added to complete the functional specification. The third and fourth steps concern the design of the target system and its verification. Step 3: The target system is designed on the basis of the functional specifications and the description of the interfaces to the plant, as well as the specification of safety enhancing features made in step 2 and the specific design constraints determined in step 1. During this activity one should also utilise general principles and techniques for safe design, which should have been identified before design started. The result of this step will be the target system design document.
Philosophy is little comfort after an accident and the practical implementation of the scheme discussed still stretches current technology. Each of the four design steps can be further decomposed, as indeed they are in the full EWICS87a guidelines, and suitable techniques applied to assist in collecting and organising the mass of information necessary for a safe and validated system. Much is now known about the power of these techniques currently available for safe design, although the details have not been discussed here. Some, such as FMEA, have traditionally been limited to hardware only but now some cautious steps are being taken in their application to software . Mathematically formal techniques for specification, design and validation are also becoming available in an industrial context. Their use particularly in regard to safety critical systems is being strongly advocated and rightly so. This is not to say that all the problems have yet been solved and that computer system safety is a routine matter. Nevertheless, the approach discussed in this paper forms a sound basis for the design and implementation of safety related systems.
ACKNOWLE DGEMENT
Step 4: The design documents are verified with respect to the functional specification, and it is also confirmed that the specific safety relevant requirements and recommendations developed in the first three steps have been followed. (Not all the inputs to this step are shown in the figure.) Information on any required corrections is returned to step 3. After this step, the final design document is complete.
The two EWICS TC7 documents referred to and this whole approach to safe design were the results of many colleagues working in committee. However, special mention should be made of G. Dahll, J. Gorski and U. Kammerer, all of whose contributions were particularly notable and some of whose text has been included here.
Needless to say, a high quality of project management and quality control is needed on carrying out these four steps if the potential safety gain is to be realised. The verification-procedure in the final step can only proceed properly i f full records exists of the progress through the first three steps. The choice of suitable techniques for these steps and the application of the chosen ones are not trivial although there is not the opportunity here to go deeply into these matters. Details of the techniques may be found in the Safety Assessment and Design of Industrial Computer Systems Techniques Directory, another of the recent EWICS TC7 documents (EWICS87b). This contains a standardised description of each technique and its application, an assessment of its potential and references to more detailed literature
EWICS87 a EWICS TC7 "Guidelines to Design Computer Systems for Safety", December 1987.
REFERENCES
EWICS87b EWICS TC7 "Safety Assessment and Design of Industrial Computer Systems: Techniques Directory", November 1987.
\\'. J. Quirk
106
Infonnation on the environment
Regulations, constraints
Safety criteria
Auxiliary infonnation
Step I. Overall safety analysis of plant and environment
Safety goals: Safe and unsafe states
---------,
r----
I I I I I I I
Interfaces to the plant
Validation plan
System
Specific design constraints
Specific features to design into the target system to enhance safety
Principles and techniques for safe design
Requirement~
I I I I I _ _ _ _ _ .JI
Functional specification of the target system
An overview of the main steps.
Final design document
The ovals represent the four mainsteps of the guidelines, and the rectangles represent data and infonnation. A rectangle with an arrow into it and an arrow out represents infonnation or data which is produced by one step and used by the next. A rectangle with only one arrow out represents data one may assume already exists, but that must be gathered during the step.