Safety Science 40 (2002) 813–833 www.elsevier.com/locate/ssci
The dynamic flowgraph methodology as a safety analysis tool: programmable electronic system design and verification Michel Houtermansa, George Apostolakisb,*, Aarnout Brombacherc, Dimitrios Karydasd a
Automation, Software and Electronics, TUV Product Service, IQSE, 5 Cherry Hilll Drive, Danvers, MA 01923, USA b Massachusetts Institute of Technology, Nuclear Engineering, 77 Massachusetts Avenue, Cambridge, MA 02139, USA c Eindhoven University of Technology, Faculty of Mechanical Engineering, Reliability of Mechanical Equipment, PO Box 513, 5600 MB, Eindhoven, The Netherlands d Factory Mutual Engineering, 1151 Boston-Providence Turnpike, Norwood, MA 02062, USA
Abstract The objective of this paper is to demonstrate the use of the Dynamic Flowgraph Methodology (DFM) during the design and verification of programmable electronic safety-related systems. The safety system consists of hardware as well as software. This paper explains and demonstrates the use of DFM, and how DFM can be used to verify the hardware and application software design. DFM is used not only to analyze newly developed software but also to verify existing software. The outcome of the design verification of the safety system is used to define the necessary diagnostic capabilities that are essential to guarantee the correct functioning of the safety functions. # 2002 Elsevier Science Ltd. All rights reserved.
1. Introduction It is generally recognized that safety is a property of the total system (Leveson, 1995). Addressing hardware and software safety issues independently assumes the correct behavior of software while analyzing hardware and vice versa. This is seldom the case. An industrial system can only be safe if all the individual elements interact with each other in a safe manner. Therefore, safety needs to be addressed in a way
* Corresponding author. 0925-7535/02/$ - see front matter # 2002 Elsevier Science Ltd. All rights reserved. PII: S0925-7535(01)00087-X
814
M. Houtermans et al. / Safety Science 40 (2002) 813–833
that considers both the individual elements and their interaction within the context of the total system. Currently different safety techniques exist in industry. These techniques have been developed for different purposes or to address different aspects of a system, e.g. for hardware, software, or for process analysis. For example, a hazard and operability (HAZOP; Redmill et al., 1999) study is a technique for identifying and analyzing hazards and operational concerns of a system. As the name reveals, it focuses not only on safety, but also on operational issues. Redmill et al. (1999) cites examples where the HAZOP technique is used to address software. A failure modes and effects analysis (FMEA; US MIL-STD-1620) is another widely used safety analysis technique. The FMEA is a bottom-up or inductive procedure. Component failure modes and their effects or consequences are evaluated on system level. The goal of fault tree analysis (FTA; Henley and Kumamoto, 1992) is to identify the basic events that eventually lead to an undesired top event. FTA can be used to investigate causes of hazards or to identify hazards. FTA is a deductive analysis, i.e. it follows a top-down approach, in contrast to FMEA. Fault trees have also been used in software safety analysis applications (Leveson, 1995). New safety standards, e.g. IEC 61508 (1999), require quantitative analysis for the analysis of industrial processes, the justification of necessary safety functions, and the performance of appropriate systems. Parts-count analysis (Brombacher, 1992), reliability block diagrams, FTA, and Markov analysis (Henley and Kumamoto, 1992) are some of the techniques that are being used to carry out probabilistic calculations of safety systems. All methods listed above focus on physical equipment and/or processes. Modern plants are heavily dependent on programmable electronic systems, where software plays an extremely important role particular in terms of safety. Therefore, the basic safety management issues must be extended beyond hardware and physical process or production equipment to cover software as well. The traditional techniques used to analyze software in terms of system safety include an evaluation of the software development process, software requirement specification analysis, software criticality analysis, software module and system testing, and software fault injection testing (Faller, 1998). New developments in software safety analysis recommend not only to examine software in isolation but also to address it in the context of its operating environment. This is known as the error-forcing context of software (Garrett and Apostolakis, 1998). Only if the error-forcing context is understood, it is possible to address the appropriate behavior of the software regarding safety. Software hazard analysis cannot be separated from system hazards analysis, since the latter generates invaluable and indispensable input information for the former. The objective of this paper is to demonstrate the use of a relatively new safety analysis technique, called the Dynamic Flowgraph Methodology (DFM), during the design and verification phases of programmable electronic (PE) safety-related systems. This methodology is used to model the error-forcing context of the application software used in a programmable electronic safety system. With DFM it is possible to collect information about software hazards and how they can effect the operation
M. Houtermans et al. / Safety Science 40 (2002) 813–833
815
of the safety system. This paper will give a brief overview of the state of the art in programmable electronic system assurance, an introduction to the DFM technique and a practical example, which demonstrates its usage in addressing the safety issues of programmable electronic safety systems. The purpose of this paper is to demonstrate the capabilities of DFM. The reader interested in the technical details behind the DFM concept are referred to the listed references in the text and the references within.
2. Programmable electronic system software assurance The DFM technique has been introduced in this paper as it can address hardware, software and the interaction between hardware and software. The main issues in safety assurance of programmable electronic systems lays in the software aspects. It is very difficult to verify with conventional safety analysis tools the correct behavior of software. Currently, the assurance of programmable electronic safety related software is not handled much differently from that of any other type of software for real-time applications (such as communications software). Three principal types of software assurance philosophies can be recognized in the published literature, which are briefly described and discussed later. For the interested reader a more detailed overview is given in Yau et al. (1995). 2.1. Testing and software reliability Assurance by testing is the most common approach to software assurance. Testing is often performed by feeding random inputs into the software and observing the produced output to discover incorrect behavior. Software reliability models have been proposed to aid the testing strategies (Goel, 1985), although the applicability to software of reliability models extrapolated from the hardware reliability realm is seriously questioned, even from within the software reliability community itself (Littlewood and Miller, 1990). 2.2. Formal verification Formal verification is another approach to software assurance that applies logic and mathematical theorems to prove that certain abstract representations of software, in the form of logic statements and assertions, are consistent with the specifications expressing the desired software behavior. Recent work has been directed at developing varieties of this type of technique specifically for the handling of timing and concurrency problems (Razouk and Gorlick, 1989). However, the abstract nature of the formalism adopted in formal verification makes this approach rather difficult to use properly by practitioners with non-specialized mathematical backgrounds. Furthermore, the issue of modeling and representation of hardware/ software interaction (let alone the human element), which is an important issue in programmable electronic system assurance analysis, does not appear to have surfaced
816
M. Houtermans et al. / Safety Science 40 (2002) 813–833
as one of the current objectives of formal verification research. Recent developments have extended Fault Tree Analysis with formal notation that allows the specification of temporal and causal relationships within a system (Go´rsky and Wardzin´ski, 1994). These formal notations make it possible to address timing aspects and help specify specific software requirements. DFM as later will be seen, consists of dynamic fault trees that can be derived automatically for any time and any system top condition state of interest. Another solution to more detailed process models have been presented by (Halbwachs et al., 1993) using synchronous dataflow language as a specification language. 2.3. State machine methods The third type of approach to software assurance is one that analyzes the timing and logic characteristics of software executions by means of discrete state simulation models commonly referred to as state machines, such as queue networks and Petrinets (Leveson and Stolzy, 1987). Simulated executions are analyzed to discover undesirable execution paths. Although this approach can be extended to model combined hardware/software behavior (since the hardware behavior can in principle be approximated in terms of transitions within a set of predefined discrete states), difficulties arise from the ‘‘march-forward’’ nature (in time and causality) of this type of analysis, which forces the analyst to assume knowledge of the initial conditions from which a system simulation can be started. In large systems, many combinations of initial states may exist and the solution space may become unmanageable. A different approach, which reverses the search logic by using fault trees to trace backward from undesirable outcomes to possible cause conditions, offers an interesting solution to this problem, but encounters difficulties due to limitations in its ability to represent dynamic effects, and to the fact that a separate model needs to be constructed for each software state whose initiating causes are to be identified (Leveson and Harvey, 1983).
3. Dynamic flowgraph methodology The dynamic flowgraph methodology arose from the logic flowgraph methodology (LFM), which was developed in the early 1980s (Guarro and Okrent, 1984). In the beginning, LFM was mainly used for applications in the nuclear industry (Guarro and Okrent, 1984; Guarro and Milici, 1992; Guarro, 1998). Later, other application areas were investigated, i.e. aerospace systems (Ting, 1990) and aerospace embedded systems (Guarro and Wu, 1991). Over the years, LFM had proven to be effective, as the underlying methodology in process failure diagnosis, and decision support systems. Research and further efforts for the improvement of the LFM concept resulted in a modeling and analysis approach eventually named the DFM (Garrett et al., 1995; Yau et al., 1995). The major improvement that DFM added to the LFM concept was the capability to analyze time-dependent aspects within a system, in addition to logical behavior elements.
M. Houtermans et al. / Safety Science 40 (2002) 813–833
817
DFM is very general and can model the logical and dynamic behavior of complex systems, including such elements as hardware, software, and human actions. DFM models the relationships between important process parameters arising from causeand-effect and timing functions inherent to the system. If the system is a programmable electronic system (PES), i.e. a system where mechanical devices and physical parameters are controlled, managed, and operated by software, then the physical system, as well as the software controlling the system can be taken into account by the DFM system model. As such, DFM is a useful methodology for the analysis and test of hardware and/or software related systems (Garrett et al., 1995; Milici et al., 1998; Yau et al., 1998). DFM is a relatively recent development, but, as we demonstrate below, it incorporates the basic aspects of pre-existing techniques, e.g. FTA, FMEA, and HAZOP, into one modeling and automated analysis approach. 3.1. The DFM model A DFM model of a system is presented as a directed digraph representing the logical and dynamic behavior of the system in terms of important system parameters (physical, software, human interaction, or any other parameter). A digraph model explicitly identifies the cause-and-effect and timing relationships that exist between key parameters and system states that are suited to describe the system behavior (Garrett et al., 1995). The DFM system model extends a normal digraph model because it represents an integration of three networks, i.e. a ‘‘causality network’’, a ‘‘conditioning network’’, and a ‘‘time-transition network’’. DFM uses a set of basic modeling elements to represent the system parameters and their relationships in terms of these three networks. The possible modeling elements are (Fig. 1): 1. variable and condition nodes; 2. causality and condition edges; and 3. transfer and transition boxes and their associated decision tables. The nodes of the DFM system model represent important system components, parameters, or variables and their normal and abnormal functional conditions. Nodes are discretized into a finite number of states that represent the parameter. This discretization can represent much more than just success or failure, or on/off situations. It can represent, for example, a temperature range, or possible representative
Fig. 1. DFM modeling elements.
818
M. Houtermans et al. / Safety Science 40 (2002) 813–833
values of a software variable, e.g. an integer or a real variable. The transfer and transition boxes represent the relationships that exist between parameters. Each box has an associated decision table that is used to represent a multi-state representation of the cause-and-effect and timing relationships that can exist among the connecting parameters. The two different edges are only used to visually represent the kind of relationships that exists between parameters, i.e. a cause-and-effect or conditioning relationship. 3.2. Modeling and analyses The DFM model is created independently of the analyses of interest. This means that it is possible to carry out a number of analyses. The DFM model contains every system feature that determines the possible desirable and undesirable behavior of the system. This is one of the key features that distinguish DFM from other methods, which usually are focused on modeling the undesirable behavior only. In the analysis phase, two different approaches can be utilized, i.e. deductive analysis and inductive analysis. Event sequences can be traced backward from effects to causes, or forward from causes to effects. The DFM analysis engine contains sophisticated software algorithms that have been developed to support automated deductive and inductive analysis. Being able to combine inductive and deductive analysis in one methodology makes DFM a very powerful tool to analyze systems within the context of design verification, failure analysis and/or automatic test sequence and vector generation. 3.3. Deductive analysis A deductive analysis of a DFM system model starts with the identification of a particular system condition of interest, depending on the objective of the analysis. This system condition, or top event, can contain system states that represent failure, success or a combination of both. The objective of a deductive analysis is to find the root causes of the top event of interest, just as traditional Fault Tree Analysis identifies the basic events causing the predefined top event. A DFM top event is expressed in terms of the state(s) of one or more process variable nodes at a particular time. To find the root causes of the top event the model is analyzed by backtracking through the network of nodes, edges, transfer and transition boxes of the DFM system model, using a specially developed analytical software algorithm. This automated backtracking algorithm contains the procedure to work backward in a causeand-effect flow to identify the paths and conditions by which the top event may be originated. Because of this algorithm, the analysis identifies what the causes or conditions are at the root of the top event. When deductive analysis is used, DFM generates prime implicants (Yau et al., 1998). A prime implicant consists of a set of variables of interest that are in a certain state at a certain time that causes a predetermined system state. Prime implicants are similar to the minimal cut sets of fault tree analysis. They differ in that they deal with multi-valued variables in contrast to the binary variables of fault-tree analysis.
M. Houtermans et al. / Safety Science 40 (2002) 813–833
819
Each variable state is also associated with a time, stating when this prime implicant variable needs to be in this state. Each variable with its properties is called a literal. Fig. 2 depicts an example of a prime implicant. There are three properties of interest for each literal, i.e. the time step ‘‘ 3’’, the variable name ‘‘CNF’’, the state of the variable ‘‘High’’ (meaning ‘‘Stuck High’’). The ‘‘AND’’ at the end of the line represents the Boolean relationship that exists between the literals. Each literal needs to be true in order for the prime implicant to be true and thus for the top event to happen. The number of prime implicants depends on the complexity, size, and required level of detail of the system to be modeled. Because of the time dependency, the number of literals is also strongly correlated with the total analysis time. As it can be seen from this example, a prime implicant can contain a wealth of information. Prime implicant #XYZ is a mixture of software signals, hardware states and process parameters. This makes it much more valuable then the results from any other existing reliability or safety methodologies like, e.g. FMEA or FTA. The prime implicants are a very useful resource for risk management and they should be explored for that purpose. 3.4. Inductive analysis Inductive analysis follows a bottom-up approach by introducing a set of component states and analyzes how this particular set of interest propagates through the system and what the effect will be on a system state of interest. Inductive analysis follows the principles of FMEA and is useful to examine the consequences of hazards on system level. Once an initial set of conditions is defined, inductive analysis is used to trace forward the events that can occur from this starting condition.
Fig. 2. Example prime implicant.
820
M. Houtermans et al. / Safety Science 40 (2002) 813–833
The actual algorithms used to carry out this task are described in Yau et al. (1998). The initial conditions and boundary conditions can be defined to represent desired and undesired states. Starting from a combination of desired states, an inductive analysis can be used to verify whether the system meets its design requirements. Starting with undesired states, inductive analysis can be used to verify the safety behavior of the system. 3.5. Conclusions DFM DFM is introduced and used in this work because it has certain advantages over conventional safety and reliability methods, like FMEA, FTA, ETA, or HAZOP. Although DFM is a fairly new methodology, it integrates the capabilities of the previously mentioned techniques in one approach. DFM has the capability to backtrack into a system or to trace forward into a system. This allows us to find root causes for specified events of interest or to see the effect of conditions or events on an output parameter of interest. The DFM automates this process, which makes it possible to investigate thousands of combinations of events and thus to analyze systems that otherwise would be too complex to comprehend or to be carried out by a human analyst.
4. Overview of safety system To demonstrate the capabilities of DFM as a safety and, particularly, software safety analysis tool a programmable electronic safety system has been modeled (Fig. 3). To prevent accidents, the safety system carries out specific safety functions. It reads inputs from the field (process parameters) via sensors. The programmable electronic logic solver uses these inputs to execute the application software designed
Fig. 3. Logic solver (excluding field devices and related equipment).
821
M. Houtermans et al. / Safety Science 40 (2002) 813–833
by the process safety engineers. If necessary, the application logic will simulate the actuation of field devices (e.g., valves) and generate the respective output values. The main safety function executed by this safety system is to monitor the temperature in a tank. If the temperature goes above a specified limit, then the content of the tank is dumped into a drowning tank by opening a drain valve. The safety system has to switch a divider to the drowning tank before dumping the material, as the drain is also connected to the next step in the batch process. The safety system consists of three distinct modules, i.e. the input module, the main processor module, and the output module. The input module has several input channels that read signals from the field. The input module communicates with the main processor module via bus communication. The main processor module carries out the safety logic using application software. The operator can upload the application software via a human–machine interface software into the main processor module. The main processor communicates with the output module via bus communication. The output board has several output channels that actuate field devices. The safety function carried out by the safety system requires three input signals and one output signal. The required signals are listed in Table 1. One temperature sensor signals the temperature in the tank. The only safety function output signal is towards the valve that controls hydraulic pressure supply to the drain valve.
5. DFM model of the safety system The DFM model of the safety system is shown in Fig. 4. The objective of the safety system is to collect information from the field, interpret this information, and decide what action to feed back to the field. This model is created taking into account the required or necessary information flow and not only the possible failure behavior of the safety system. The model is divided into functional blocks that represent the flow of information. In the upper left corner the model starts with a representation of the Input Channels. The Input Channels are directly connected to the Common Input Circuitry that communicates through Bus Communication with the Controller. The controller itself consists of Application Software, ROM and RAM memory, a CPU and a Clock device. The Controller communicates via Bus Communication with the Common Output Circuitry. The Common Output circuitry is connected to the Output Channel that passes the signal back to the field. The Controller also interfaces with the BPCS and an Operator that can change Table 1 Description of safety system input and output signals Signal
Description
Possible states
LUI1 LUI2 LUI3 LUO1
Logic Unit Input signal 1: Temperature switch in the tank. Logic Unit Input signal 2: Diverter drowning tank switch. Logic Unit Input signal 3: Diverter filter switch Logic Unit output signal 1: Drain valve
[Ok, Too high] [Drowning Tank, Off] [Filter, Off] [Open, Close]
M. Houtermans et al. / Safety Science 40 (2002) 813–833
Fig. 4. DFM model safety system (hardware and software).
822
M. Houtermans et al. / Safety Science 40 (2002) 813–833
823
software values in the RAM memory. The correct execution of the Application software depends on three values it receives from the field, set points stored in the RAM memory and the correct execution of the application logic itself. The application logic is modeled with DFM using the modeling elements represented in Fig. 1 The undesired behavior of the system has been modeled taking into account the failure mode requirements of the IEC 61508 standard (IEC 61508, 1999) (Table 2). The level of detail in this model has been chosen in a way that it reflects identifiable functional blocks that, if they fail, will cause the loss of the function block. Whether this loss will lead to a system failure will failure can be determined using the DFM analysis. This model is on a high level and does not address the lowest possible individual component failures, but at first instance it allows the verification of the system structure for safety issues before the design is worked out and verified in detail. The DFM model includes the complete functional behavior of the safety system, including the interaction between hardware and software. The application software design is also modeled by DFM. In addition, the interaction with the BPCS and Operator is modeled in a simplistic way. It is assumed that the BPCS can fail and exchange the wrong information. The operator can change the critical values and thus enter the wrong values, for example setting the limit higher than the supposed 35 C (High-High). This model shows one of the capabilities of DFM that are worth pointing out. It contains hardware as well as software and is capable of modeling the interaction between them. As this model is eventually integrated with the DFM model of the process, the model incorporates the interaction between the physical field parameter and representation of this parameter in the software. The application software is modeled in the context of its operating environment. An example of one of the software routines is the routine that compares the measured temperature in a tank with the limit programmed by the operator and then decides whether the temperature is at the programmed limit or not. The software routine looks as follows: If (SLUI< STL) HIGH’’
THEN
ST=‘‘OK’’
ELSE
ST=‘‘TOO_
The SLUI1 variable represents a temperature that is either OK or TOO_HIGH. The variable STL is the temperature limit programmed by the operator and can be at the required LIMIT or if the operator made a mistake at a HIGHER_LIMIT. A lower limit is not assumed a failure, since this would be easily found during testing or normal operation. The process would shutdown even though the high temperature limit would not be achieved. The software routine sets the output variable ST. The output variable ST has only two possible states, OK or TOO_HIGH. The software routine can only be executed if the controller (CCNT), the clock (CCLK) and the ROM memory (CROM) function normally. It is assumed that any failure of these components will lead to
824
M. Houtermans et al. / Safety Science 40 (2002) 813–833
Table 2 Overview of typical failure modes (IEC 61508, 1999) Component CPU Register, internal RAM
Coding and execution including flag register Address calculation Program counter, stack pointer Bus General Memory management unit Direct memory access
Bus-arbitration Interrupt handling
Clock (Quartz) Invariable memory Variable memory
Discrete hardware Digital I/O Analogue I/O Power supply Communication and mass storage
Electromechanical devices
Sensors
Failure modes to consider DC model for data and addresses Dynamic cross-over for memory cells No, wrong or multiple addressing No definite failure assumption No definite failure assumption DC model Time out Wrong address decoding All faults which affect data in the memory Wrong data or addresses Wrong access time No or continuous or wrong arbitration No or continuous interrupts Cross-over of interrupts Sub- or super harmonic All faults which affect data in the memory DC model for data and addresses dynamic cross-over for memory cells No, wrong or multiple addressing DC model Drift and oscillation DC model Drift and oscillation DC model Drift and oscillation All faults which affect data in the memory Wrong data or addresses Wrong transmission time Wrong transmission sequence Does not energize or de-energize Individual contacts welded, no positive guidance of contacts No positive opening DC model Drift and oscillation
the OK state for the ST variable, which is the worst-case assumption, because the safety function basically assumes that the temperature is low enough while this might not be the case. The actual relationship, which exists between the parameters SLUI1, STL, ST, CNTL, CCLK and CROM, is represented by a mapping of the possible combination of states in a decision table (Table 3).
Table 3 Decision tablea OUTPUT
SLUI1
STL
CROM
CCLK
CCNT
ST
OK OK TOO_HIGH TOO_HIGH –b – – – – – – –
LIMIT HIGHER_LIMIT LIMIT HIGHER_LIMIT – – – – – – – –
NORMAL NORMAL NORMAL NORMAL – – – – – – STUCK DATA STUCK ADDRESS
NORMAL NORMAL NORMAL NORMAL – – – – SUB HARMONIC SUPER HARMONIC – –
NORMAL NORMAL NORMAL NORMAL NO EXEC WRONG CODING STUCK DATA STUCK ADDRESS – – – –
OK OK LIMIT OK OK OK OK OK OK OK OK OK
a SLUI1=Software variable Logic Unit Input 1, STL= Software variable Temperature Limit, CROM=Condition ROM memory, CCLK=Condition Clock, CCNT=Condition Clock, ST=Software variable Temperature. b The ‘‘–’’ represents ‘‘No matter what the state is’’. The other input variables, and not this one, determine the state of the output variable(s).
M. Houtermans et al. / Safety Science 40 (2002) 813–833
INPUT
825
826
M. Houtermans et al. / Safety Science 40 (2002) 813–833
Even though the example only focuses on the software routine, what needs to be pointed out is that temperature representation in the software depends on the actual mixing process that takes place in the nitrator tank. The complete process is modeled and the actual value of the software variable SLUI1 depends on the process that takes place and the equipment that is involved in measuring and communicating the value up to the point where the software variable SLUI1 is generated. A lot of equipment must operate correctly before the software value is set. When it comes to hardware, this value depends on the temperature sensor, the temperature transmitter, the input channel, the common circuitry, bus communication, and any other equipment that makes the loop work. The same considerations apply for the variable STL. This software variable not only depends on the correct operation of hardware, but also on the value of the variable programmed by the operator. The operator must be extra careful with this variable, as it is safety critical. Special procedures should exist for setting and programming this variable.
6. Analyzing the safety system Once the DFM model is created, the analysis starts with a definition of the top event. A top event of interest would be to analyze what would prevent the safety system from opening the drain valve (for example, this is necessary when the temperature in the tank is too high). The top event could be specified as follows: Top Event #4 At Time 0, LUO1=Closed (Valve closed) This top event results in 1190 prime implicants1 with a total of 15 440 literals, if the deductive analysis traces back 14 time steps. Out of these 1190 prime implicants there are 83 prime implicants with a single conditions and 654 prime implicants where two conditions need to occur. The remaining prime implicants consist of three or four conditions. The resulting prime implicants can now be analyzed. The following prime implicant is found by applying an importance measure that prioritizes for prime implicants with only one literal. The prime implicant represent a stuck-at-high output channel. This is a dangerous fault, as everything can be working correctly in a safety system, but if the output is frozen nothing will happen in case of a process demand2. In practice this means that diagnostic systems must be incorporated so that the operation of the output channel is verified (Section 8). Prime Implicant #1105 At time-1, COC=High (Stuck high) 1 The amount of prime implicants can be enormous. More papers by the main authors will be published in the near future presenting newly developed importance measures that can be used to address this problem. 2 It is the author’s experience that this is a frequent mistake made in safety system design.
M. Houtermans et al. / Safety Science 40 (2002) 813–833
827
It is also possible to apply an importance measure that filters for single points of failure regardless of the number of literals. A very interesting failure involving the basic process control system (BPCS) is found when the prime implicants are prioritized for single points of failure. In this way two prime implicants are found. The problem found in these prime implicants is that the correct functioning of the safety function actually depends on the correct operation of the BPCS. The safety system apparently does not work independent of the BPCS. After analyzing the condition represented in the prime implicant it is found that the safety system can sense the position of the diverter, but it cannot switch by itself the diverter to the drowning tank, if necessary. For this functionality it relies on the BPCS. A design recommendation should be made to make the safety function operate independent of the safety system (or to involve the BPCS as part of the safety system), as recommended in the safety protection layer philosophy (Section 1). The safety system should be able to detect the position of the diverter, but also should also be able to switch the diverter by itself, if necessary. Prime Implicant #94 At time 10 CIC=Normal (Operates normally) At time 9 COP2=Normal (Operates normally) At time 9 CCOMI=Normal (Operates normally) At time 9 COP1=Normal (Operates normally) At time 8 CBUS=Normal (Operates normally) At time 8 CRAM=Normal (Operates normally) At time 7 CROM=Normal (Operates normally) At time 7 CCLK=Normal (Operates normally) At time 7 CCNT=Normal (Operates normally) At time 7 CBPCS=Not (Not functioning) At time 5 CCLK=Normal (Operates normally) At time 2 CCOMO=Normal (Operates normally) At time 1 COC=Normal (Operates normally)
AND AND AND AND AND AND AND AND AND AND AND AND
Prime Implicant #289 At time 10 CIC=Normal (Operates normally) At time 9 COP2=Normal (Operates normally) At time 9 CCOMI=Normal (Operates normally) At time 9 COP1=Normal (Operates normally) At time 8 CBUS=Normal (Operates normally) At time 8 CRAM=Normal (Operates normally) At time 7 CROM=Normal (Operates normally) At time 7 CCLK=Normal (Operates normally) At time 7 CCNT=Normal (Operates normally) At time 6 CBPCS=Not (Not functioning) At time 5 CCLK=Normal (Operates normally) At time 2 CCOMO=Normal (Operates normally) At time 1 COC=Normal (Operates normally)
AND AND AND AND AND AND AND AND AND AND AND AND
828
M. Houtermans et al. / Safety Science 40 (2002) 813–833
The prime implicants have also been prioritized for possible human interaction. The following interesting prime implicants were found. They demonstrate the strength of DFM and the importance measures. Prime implicant 29 shows that, when the operator sets a higher value for the temperature limit (COP1=higher) and all other components are normal, the output signal will keep the drain valve closed. Prime Implicant #29 At time 13 COP1=Higher (Higher value) At time 12 CRAM=Normal (Operates normally) At time 10 CIC=Normal (Operates normally) At time 9 COP2=Normal (Operates normally) At time 9 CCOMI=Normal (Operates normally) At time 9 COP1=Normal (Operates normally) At time 9 CROM=Normal (Operates normally) At time 9 CCNT=Normal (Operates normally) At time 8 CBUS=Normal (Operates normally) At time 8 CRAM=Normal (Operates normally) At time 7 CROM=Normal (Operates normally) At time 7 CCLK=Normal (Operates normally) At time 7 CCNT=Normal (Operates normally) At time 5 CCLK=Normal (Operates normally) At time 2 CCOMO=Normal (Operates normally) At time 1 COC=Normal (Operates normally)
AND AND AND AND AND AND AND AND AND AND AND AND AND AND AND
From a design point of view, must be clear that this value is safety critical. This means that additional procedures must be in place to insure that only knowledgeable and designated safety people can change these values. Modern programmable electronic safety systems should be able to designate software values as safety critical, which means that such values are password protected and they can not be overwritten without using a specified procedure for doing so.
7. Analyses of existing software This section will demonstrate how DFM can be used to find errors in existing software by specifying test cases. Suppose that the complete safety system, i.e. hardware and software exists, and that one acting as an independent assessor would like to analyze whether the software works correctly. The software is modeled as a black box and DFM is used to model the environment of the software (Fig. 5). The safety function consists of several software routines and one of the software routines checks whether the temperature did not exceed its specified limit. On purpose the following error, as specified Table 4 in has been programmed into this particular software routine. The programmer used the ‘‘ > ’’ sign instead of the ‘‘ < ’’ sign. Inductive analysis is used to trace forward initial conditions and see how these conditions affect an end system state of interest. As we are only interested in the
829
Fig. 5. Testing existing software with DFM.
M. Houtermans et al. / Safety Science 40 (2002) 813–833
830
M. Houtermans et al. / Safety Science 40 (2002) 813–833
Table 4 Example software error Software routine Correct Design error
If T<35 C Then Close Drain Valve If T>35 C Then Close Drain Valve
software behavior all conditional values are set constant to their normal behavior. This means that throughout the analysis they cannot fail and thus it is simulated, as if the hardware is operating normally at all times. The starting conditions, represented in a test vector have been specified in Table 5. When this test vector is applied and the DFM model is traced forward, the safety system eventually results in a state for LOU1 that puts the drain valve in a closed position. This should not have happened as the temperature is LUI1=HIGH at all times. Since the hardware is operating normally, it can only indicate a failure in the software. The actual ‘‘ > ’’ programming error is not yet found, but the fact that the software did not perform as intended, that this is a significant discovery. One can now start more detailed analyses to localize where the failure resides.
8. Diagnostics systems One of the reasons that the PES safety system market has grown so fast over the years is its excellent capability to incorporate in on-line diagnostics within the safety systems. The main role of the diagnostics is to verify that the PES is capable of performing uninterruptedly the specified safety functions. The diagnostic functions monitor the performance of the PES within the context of its operating environment. In this example, only the logic solver has been modeled with DFM and therefore only the necessary diagnostics systems in terms of the logic solver can be determined. The prime implicants contain valuable information that can be used to define the necessary diagnostic systems. The DFM model will produce all necessary diagnostics if at first a model is created without taking into account any diagnostic capabilities. The prime implicants can serve as a checklist. Every prime implicant that represents an unwanted situation can be used to define a diagnostic function. For example, prime implicant #21 reveals that a stuck at failure in the common circuitry of the output module is a reason not to open the drain valve. The diagnostic system handling this kind of failure mode could be a feedback loop in combination with a software test vector that would periodically test the correct functionality of the common output circuitry. The diagnostic systems can also go beyond the safety system. DFM can identify certain software parameters as being critical. If an operator can change these parameters in the application program (like the temperature limit or the position of the diverter), then special procedures can be created that determine how, when and by whom these variables can be changed.
M. Houtermans et al. / Safety Science 40 (2002) 813–833
831
Table 5 Test vector Starting condition
Variable state
For all times For all times For all times For all times For all times For all times For all times For all times For all times For all times For all times For all times For all times For all times For all times For all times For all times For all times For all times For all times For all times For all times At time 0 At time 0 At time 0 At time 0 At time 0 At time 0 At time 0 At time 0
LUI1=High (Temp high) LUI3=Tank (Drowning tank) LUI4=Off (Off) ODP=Tank (Drowning tank) OTL=35C (35 degrees C) BPCS=DPTD (DP to DT) CBPCS=Normal (Operates normally) CBUS=Normal (Operates normally) CCLK=Normal (Operates normally) CCNT=Normal (Operates normally) CCOMI=Normal (Operates normally) CCOMO=Normal (Operates normally) CIC=Normal (Operates normally) CLUI1=Ok (Temp ok) CLUI3=Tank (Drowning tank) CLUI4=Off (Off) CLUO1=Open (Open drain valve) COC=Normal (Operates normally) COP1=Normal (Operates normally) COP2=Normal (Operates normally) CRAM=Normal (Operates normally) CROM=Normal (Operates normally) LUO1=Close (Close drain valve) ILUI1=Ok (Temp ok) ILUI3=Tank (Drowning tank) ILUI4=Off (Off) OLOU1=Close (Close drain valve) RDP1=Tank (Drowning tank) RDP2=Off (Off) RTL=Limit (Limit value)
9. Conclusion DFM is introduced and used in this paper because it has major advantages over conventional safety and reliability techniques and methods. DFM incorporates the capabilities of techniques like FMEA, FTA, and HAZOP into one methodology. Only one DFM model is necessary to capture the complete dynamic behavior of a system. This model can be used to verify design requirements, to do failure analysis and to define test cases. In this paper, DFM was used to model the hardware and software of a programmable electronic safety-related system. DFM was used as a deductive technique to generate automatically prime implicants. These prime implicants contain the necessary information (in terms hardware and software states, failure and non-failure states and human interaction) to carry out safety design verification. The content of the prime implicants make it easy to find mistakes made in the software. The list
832
M. Houtermans et al. / Safety Science 40 (2002) 813–833
of prime implicants is used as a checklist to verify the required diagnostic capabilities of the programmable electronic system. DFM is also used to generate automatically test vectors that verify the effect on system states of interest. This is particularly useful to identify the error-forcing context of the software. In practice, this means that the software behavior is verified for unexpected and unanticipated situations in the context of the total hardware and software safety system.
Acknowledgements The authors would like to thank Dr. Michael Yau of ASCA Inc. for the use of the DFM software tool and the support and advice during the modeling and analyses of safety system.
References Brombacher, A.C., 1992. Reliability by Design. John Wiley & Sons, Chichester. Faller, R., 1998. Quality Assurance Procedures. Automation, Software and Electronics—IQSE, PublicVersion 2.0, Munich. Garrett, C, Apostolakis, G.E., 1998. Context And Software Safety Assessment. 2nd Workshop on Human Error, Safety, and System Development, Seattle, WA. Garrett, C.J., Guarro, S.B., Apostolakis, G.E., 1995. The dynamic flowgraph methodology for assessing the dependability of embedded software systems. IEEE Transactions on Systems, Man and Cybernetics 25, 824–840. Goel, A.L., 1985. Software reliability models: assumptions, limitations and applicability. IEEE Transactions on Software Engineering, SE-11. Go´rsky, A., Wardzin´ski, 1995. Formalizing fault trees. In: Achievement and Assurance of Safety. Springer Verlag, pp. 311–327. Go´rsky, A., Wardzin´ski, 1996. Deriving real-time requirements for software from safety analysis, In: Proceedings of the 8th EUROMICRO Workshop on Real-Time Systems. IEEE Computer Society Press. Guarro, S.B., 1998. A logic flowgraph-based concept for decision support and management of nuclear plant operation. Reliability Engineering and System Safety 22, 313–330. Guarro, S.B., Okrent, D., 1984. The logic flowgraph: a new approach to process failure modeling and diagnosis for disturbance analysis applications. Nuclear Technology 67, 348–359. Guarro, S.B., Wu, J.S. Apostolakis, G.E., Yau, M., 1991. Embedded system reliability and safety analysis in the UCLA ESSAE project. Procedures International Conference Probabilistic Safety Assessment and Management (PSAM), Beverly Hills, CA, 4-7. Guarro, S.B., Milici, T. Wu, J.-S., Apostolakis, G., 1992. Accident Management Advisor System (AMAS): A Decision Aid for Interpreting Instrument Information and Managing Accident Conditions in Nuclear Power Plants. OECD/CSNI Specialists meeting on instrumentation to manage sever accidents. Cologne, Germany, 16–17 March. Halbwachs, N., Fernandez, J.-C., Bouajjanni, A., 1993. An executable temporal logic to express safety properties and its connection with the language Lustre, Sixth International Symp. On Lucid and Intensional Programming, ISLIP’93, Quebec. Henley, E.J., Kumamoto, H., 1992. Probabilistic Risk Assessment, Reliability Engineering, Design, and Analysis. IEEE PRESS, Piscataway, New Jersey. IEC 61508, 1999 Functional Safety of Electrical/Electronic/Programmable Electronic Safety-Related Systems, Part 1- 7. International Electrotechnical Committee.
M. Houtermans et al. / Safety Science 40 (2002) 813–833
833
Leveson, N.G., 1995. Safeware, System Safety and Computers. Addison Wesley, Reading MA. Leveson, N.G., Harvey, P.R., 1983. Analyzing Software Safety. IEEE Transactions on Software Engineering 9. Leveson, N.G., Stolzy, J.L., 1987. Safety analysis using petri nets. IEEE Transactions on Software Engineering 13. Littlewood, B., Miller, D., 1990. Preface: special issue on software reliability and safety. Reliability Engineering and System Safety 32. Milici, A., Yau, M., Guarro, S., 1998. Software Safety Analysis of the Space Shuttle Main Engine Control Software. PSAM 4, New York. Razouk, R.R., Gorlick, M.M., 1989. A Real-Time Interval Logic for Reasoning about Executions of Real-Time Programs. ACM Press Software Engineering Notes 14. Redmill, F., Chudleigh, M., Catmur, J., 1999. System Safety: HAZOP and Software HAZOP. Wiley. Ting, Y.T.D., 1990. Space Nuclear Reactor System Diagnosis: A Knowledge Based Approach. PhD thesis, UCLA. US MIL-STD-1620, 1620. Failure Mode and Effect Analysis. National Technical Information Service, Virginia. Yau, M., Guarro, S.B., Apostolakis, G.E., 1995. Demonstration of the Dynamic Flowgraph Methodology using the Titan II space launch vehicle digital flight control system. Reliability Engineering and System Safety 49, 335–353. Yau, M., Apostolakis, G., Guarro, S., 1998. The use of prime implicants in dependability analysis of software controlled systems. Reliability Engineering and System Safety 62, 23–32.