SAFETY AND RELIABILITY THEIR TERMS AND MODELS OF COMPLEX SYSTEMS H. H. Frey Quality Assurance Department, Div. E, Brown Boverz' SI Co. Ltd., CH-5401 Baden, Switzerland
Abstract. Some terms concerning reliable and safe industrial computer systems are qual itatively defined (European Purdue Workshop, TC Safety and Security). The fundamental model of reliability and safety of complex systems is defined by the set of system failure modes whereby the basic difference between hardware and software failure behaviour is discussed. Furthermore the characteristics and application of the model are illustrated and discussed by the example of a car in the traffic, whereby utility versus cost is evaluated by the term "system effectivness", the comprehensive figure of merit of complex systems_ Keywords. System safety, system reliability, reliability model, safety mOdel, terms and definitions of, hardware failures, software failures, systemeffectiveness.
INTRODUCTION
Safety:
In order to deal with a new field in science or technique, it is necessary to build up a vocabulary of special terms to allow efficient, unequivocal communication. This was also the case for the field of computer and computer systems.
Safe:
But, engineers dealing with the new field of reliable and safe industrial computer systems find themselvesstill in a rather awkward position of trying to make themselves clear in specifying, engineering and verifying these complex systems. This contribution will try to shed some more light on models, terms and definition, related to the safety and reliability of industrial Computer Systems or complex systems. A "Glossary of Terms and Definitions" evolving from the intensive discussions and work of the "European Purdue Workshop" TC "Safety and Security" and its Sub-Group "Terms and Definitions" is presented in D2].
Avoidance of or protection against, danger to life and limb, the environment and property arising from failures. Protection against all classes of threat, whether accidental or deliberate to an installation or to the plant which it controls.
Security:
Protection against all classes of threat, whether accidental or deliberate, to an installation or to the plant which it controls.
Secure:
Adequately protected against all classes of threat.
Reliability:
The ability of an item to perform a required function under stated conditions for a stated period of time.
Availability: The ability of an item to perform its intended function when required.
The qualitative definitions of the most important terms in the context of this contribution are [12J *: *Acknowledgement: At this point I also like to thank all my colleagues in the European Purdue Workshop TC Safety and Security for all their work and support on Terms and Definitions.
Correct:
In accordance with the intended specification.
Error:
a) A human mistake or olilission. b) The difference between the actual and the desired output of control systems or computer.
The termination of an item to perform its specified function. Failure mode: The way an item or system fails. Failure:
3
H. H. Frey
4
Fault tolerance: The ability of a system to perform correctly in the presence of a limited number of faults. Verification:
Demonstrating the truth or correctness of a statement system, etc., by special investigation or comparison of data.
For example, the "System" may be a safeguard in a nuclear power plant, where power plant operation is regarded as the "Environment" or "Process", or e.g. (4 J . This abstract formulation will be elaborated upon by an example in the next section. THE SYSTE~i "AUTOHOBILE" AND ITS ENVIRONMENT, AN EXAMPLE Thinking of a car (=System), one may associate with it many terms concern in g its environment and life- cycle . These are, e.g.:
THE BASIC MODEL OF SAFE AND RELIABLE SYSTEM PERFORMANCE Mathematical model
Table I: Terms associated with a car
The basic assumption is that the functioning of the system deterministically depends on the correct function or correctness of its components or elements (hardware, software, human factors, data, etc . ). An obvious way to present the functional relationship between the system components and systenl operation is by a state space .
-
speed gasoline util ity cost performance safety power failure
crash accident pedestrians tra ffi c traffic 1ights freeway highway day/ night
Let the "System" state space Z be defined by t~e set of all observab l e system.states Xj, J-l, ..... ,m, thlS set belng partltlOned into at least two sub - sets S (functiona l, reliable, safe states, etc.) and F (failure modes or unreliable, unsafe states, etc.); thus
-
rushhour regu lation s vii nter/ summer tax 1i cence repa i r maintenance check-up
repai r cost ga rage personnel insurance driver training 1i fe expectation running cost maintenace cost
S,-, F
Z
SA F
o
(1 )
(d i sjoint)
( 2)
Simi lar to the system a state space is accordingly defined for each "subsystem" and "component" and also partitioned into at least two sub-sets, thus describing the functional hierarchy of the system. (t1athematical formulation see llJ - [)] ,[l1l ) Furthermore, the system 's environment or the process it controls, must be described. That is, the "Environment" or "Process" is defined by a set of events, operating modes and their relationship to the "System". The relationship is given by the sequence of states in time of the System, which will exert a partlcular effect on the proper (safe, reliable, available, etc.) operation of the environment or the Process. On the other hand, the operating mode and certain events occurring lndependently of the system in the Environment may disturb the correct performance of the System and be the cause of failures. Thus, only knowledge of the system states or the System failures modes, their effects on the correct operation of the Environment and the environmental factors or events disturbing the correct system performance will be sufficient to define qualitatively the safety and reliability of the System in question.
It is now necessary to order and classify these terms and to decide which characteristics, factors, events and conditions must be taken into consideration for the definition of safe and reliable car performance. As suggested in the preceding section, the failure modes of the car are first investigated and classified, see (Table 11): Table 11 shows that we are able to define the main terms availability, reliability, safety and production quality, or rather their complements when we know the car's fai 1ure modes. But why do we decide, for instance, that c) "brake fails" is an unsafe or dangerous state? It seems that behind this classification more assumptions concerning the "Environment" are unconsciously made. Namely, the speed may be high in comparison to the available run-out third-persons and other cars could be en- ' dangered, a traffic light may be ahead, etc. Hence, it is necessary to outspokenly define the "Environment" with its sequence of events with respect to time and operation modes or phases in order to clearly define the safety of the "System".
Safety and reliability
5
Table 11: Failure modes of a car
.I ) ~ot
.ep••• ble,
does not start - in repair - in maintenance - etc.
NOT AVAILABLE
b) Fails during operation: - mo to r fa i 1s - oil pump fa il s - headlight fails - etc. c) Dangerous failure during operation: - brake fails - loss of a wheel - loss of steering - etc. d) Defects, minor failures: - early rusting - loose interior - early rattling - etc.
Another question is, why (Table 11 b) the "failure of one headlight" is classified as a system failure, since a car could also be operated at night with the redundant one headlight. The answer is: If the law is also considered, a traffic regulation there says, that a car may not be operated with one headlight. But the redundant headlight would be enough for a safe stop. If one also distinguishes between the failure of the left and right headlight, the failure of the left headlight may be considered an unsafe system failure, if the approaching traffic is taken into account. These examples show also how particular the ~erms safety, reliability and availability ln contracting and engineering must be dealt with in order to avoid misinterpretation and omissions, and hen~e design errors and performance failures L5] - [8J, [lOJ. Ordering and classifying the terms of Table will lead to the more generalized presentation (Fig. 1) of the complex system "car". SYSTEM EFFECTIVENESS The connection to the terms cost and overall utility considered over the whole life cycle (table I) is done in Fig. 2. There the term
NOT RELIABLE
I I
Ii
NOT SAFE
BAD PRODUCTION QUALITY
"system effectiveness" is defined by the ratio of utility to cost referred to the total life cycle of a system [8j - [lOJ. The system effectiveness is thus the most general figure of merit to evaluate system performance, long-term operating behaviour and cost over time, considering all possible conditions. The term "operational effectivness" is, together with its relationship to the specifications, displayed in Fig. 3. For the example of the car, a system analysis could result in the systemeffectiveness of an expensive, fast car being low, compared to that of a small, slower car if, according to the definition, the utility of high speed to procurement, operating and maintenance costs is compared, considering a speed limit of 130 km/h or 55 mph (don't use a steam-hammer to crack a nut~). DEFINING SAFETY AND RELIABILITY BY THE SET OF SYSTEM FAILURE ~ODES Generalization of the example in Fig. 1 will lead to Fig. 4. Following the procedure of determining and classifying all the system failure modes in order to evaluate system safety and reliability will lead first to the difficult task of determining all failure modes of each component (hardware, software, etc.). If the relationship between component
C1'
SYSTEM EFFECTIVENESS
R.tlo of utility to cost referred to the toUIII'! cycle 0' the system
OPERATIONAL EFFECTIVENESS (Uttlity)
LIFE CYCLE COSTS
ENVIRONMENT
::r: COST
;r '""'1
"ro '<
LONGTERM OPERATING BEHAVIOUR
UTILITY
~,
'
LIFE CYCLE
Figure 2
Figure 1
COST
Safe l Y and r e li a bilil y
7
lOll
of
Pfrform."c.
fnili.1 mu
Fi gure 3
ENVIRONMENT (PROCESS)
EFFECTS: ADVERSEl Y AFFECTING
RELIABILITY
~~~~~~~. ADVERSEl Y AFFECTING ;..SAFETY DISTURBANCES THREATS Figure 4
H. H. Frey
8
failure modes and system failure modes is established (system structure), each system failure mode is evaluated and rated a~ its effect on the safe or reliable operation of the Environment or the Process the System controls (including multiple failures). While both hardware and software perform a system function, the basic difference between the failure behaviour of hardware and software must be taken into consideration for the system specification, realization and verfication. Hardware Hardware failures occur randomly in time, the failure distribution depending on the specific physical properties of each component, its strength and the particu l ar stress applied, as given by its function and the environmental conditions. Hardw ar e failures therefore occur in many modes. Due to some mechanisms a hardware failure may trigger others simultaneous l y or a certain sequence qf failures spaced in time, e . g. C6J , 1:13J LI B] . Hardware components generally age or wear, especially under certain conditions and may thus be cause of common mode failures, e .g. [ 21J, [2 2J . Once corrected, a hardware failure may occur again . Hardware performance is therefore improved by maintaining the components . Software Software errors (not failures 1.12:1) are occurences of fundamental nature. They may be classi fied as concept, design or logical errors and coding or programming errors. Consequent l y, software errors are basi cally not of stochastical nature. But the occurence of software errors, caused by randomly changing process states , operating conditions and modes, as wel l as hardware failures may cause them to appear to be stochastic. Hence, a software error, once corrected, wil l not occure any more, Hence software can not be maintained in the sense of hardware, e.g. [181 ,
!-16 ] , [It)] . For hardware the basic data, required are the probability of occurence of component failure modes, component failure rates, frequencies of environmenta l events, e tc. (e . g . 1:1 5-, , 1.1 gJ , [ 201 )
The quantitative eva luati on of software i s not yet very advanced. However, severa l models are available, but statistical data concern ing soft\~are errors is scarce, e . g. L23'1, 1.24'],I25J A comparison between reliability and safety characteristics is given in Tabel Ill, L11
[11 J
.
CONCL us I O I~S Starting point of an engineering project are the system specifications. In order to be a measure, the specifications ought to be a true description of what the system ' s function is to be and of how the system is expected to perform over a certain length of time considering all environmental and human factors and conditions (Fig. 3). To assure "correct " performance ("according to specification" C12 ] ) , a validation and verification process must be impo sed during system planning, designing, production, final testing and the operating phase, i.e. over the whole life cycle. In the context of reliable and safe computer systems engineering and verificat i on of correct performance many questions arise already as to what the terms hardware and software performance, integrity, verification, etc. really mean and comprehend. In order to support unequivocal, effic i ent communication and prevention of misunderstanding, misinterpretation and finally failures in system engineering and opecation, the Glossary of Terms and Definitions 1.12J was therefore worked out and the mode l presented in this paper to materialize reliability, safety and availability of complex systems.
1:23'J
As a res uIt. the genera 1i zed mode 1 presented in Fig. 3 and 4 is in fact a convenient, simple guide for understanding and dealing with comp l ex systems concerning safety and reliability aspects and definin g terms in connection with safe and reliable industrial computer systems not defined elsewhere [12~ . QUANTITATIVE EVALUATION OF SAFETY, RELIABILITY AND AVAI LABILITY For the quantitative evaluation of safety, reliability and availability mathematical models are used, e.g. [11 , 1.2] , [5 'I ' 1111
LITERATURE f l]
H. Frey: Computerorientierte Methodik der Systemzuverl~ssigkeits - und Sicherheitsanalyse. Diss. Nr. 5244, ETH Zurich, 1973.
1.2] H. Frey: Safety evaluation of mass transit systems by reliability analysis . IEEE Transactions on Rel iability, August 1974.
:31
H. Frey: Die Sicherheit als Bewertungskenngrosse fur Automatisierungssysteme. Tagung Technische Zuverl~ssigkeit, 1975. NUrnberg, No OK 62-52: 65.011.56.
Safety a nd r e liabilit y
9
Table Ill: Figures of merit for system longterm operating behaviour
RELl AB I Ll TY
SAFETY
all systemfailures modes j * (set F)
R(t)
-LPj (t)
Set F of all observable system- fai lure modes j *
on l y certain systemfailure modes j * (subset Fu L F)
Repair of passiv redundant components without interruption of operation
-2: j F
S(t)
jE F
E
Mean Time To Failure:
t~ean
00
S
R(t) dt
MTTSB
=
S
S(t) dt
0
0
system failure rate: z(t)
u
Time To Safety Breach:
00
MTTF
Pj(t) Gj ( t)
1
- R(t)
~
dt
rate of safety breaches: Zs (t)
1
- S(t)
~
dt
Repair of any component, permitting interruption of op_· ~ eration ~ Availability of the system:
A(
IX> )
r~UT
MOT
=
Availability of save operation of the environment:
r'l UT MUT + ~iDT '
A ( en )
t-'lean Up Time
t'lTBSB
Mean Time Between Safety Breach
rlean Down Time
MTTRTS
nean Time To Return To Safety
* see Figure 4
S
r·1TBSB mBSB + HTTRTS
H. H. Frey
10
[4J H. Frey: Die Fahrsicherheit als Kenngrosse fUr die Beurteilung des elektrischen Systems eines Hochgeschwindigkeitsfahrzeuges. 3. Status semi nar fUr "Spurgebundener Schnellverkehr mit berUhrungsfreier Fahrtechnik". BRD: Forschungsvorhaben NT 248-2S0 BMFT, M~rz 1974.
[s] H.
Frey, H. Glavitsch, H. Wahl: Availability of Power as Affected by the Characteristics of the System Control Center . Part I: Specification and Evaluation . Part I I: Real ization and Conclusions. IFAC Symposium, Feb . 21-2S, 1977, t·~elbourne.
[6] H. Frey: Zuverl~ssigkeitsplanung von Elektroniksystem, E und M Elektronik und Maschinenbau, Heft 6/7, Juni/Juli, 1978. [7
J H.
Frey, K. Roth: On the Relationship between Redundancy, Maintenance and Safety Margins for Testing Thyristor Valves . CIGRE Study Committee No 14, August 1978, Paris.
[SJH. Frey: Optimierung der Systemwirksamkeit elektrischer Analgen mittels Systemund Zuverl~ssigkeitsplanung. Vortrag im Rahmen des Seminars "Industrielle Elektronik und Messtechnik" an der ETH ZUrich vom 8.11.1978. [9 JH. Blatter, H. Frey: System Effectiveness and Quality Assurance of Electronic Systems. BBC Review, i~o 11, 1978.
rlO JH. .
Frey, K. Signer: Reliability Planning of Systems. BBC Review, No 3, 1979.
[11 ] H. Frey: A General Safety Model and the Relation to Reliability and Availability. European Purdue Workshop on Industrial Computer Systems, TC Safety and Security, 1976, Paper No 45. [12 JH. Frey: Glossary of Terms and Definitions Related to the Safety of Industrial Computer Systems. European Purdue Workshop on Industrial Computer Systems, TC Safety and Security, 1977, Paper No 132 . [13 ]J.R. Taylor: Sequential Effects in Failure Mode Analysis. Conference on Reliability and Fault Tree Analysis, Berkeley 1974 (DAEC, Riso-M- 1740) . r14 lJ .R. Taylor: A semiautomatic method for - -qualitative failure mode analysis. April, 1974 (DAEC, Riso-M- 1707) .
[lSJ Reliability Analysis Center RADC/RBRAC. Reliability Data Books MDR-4i8, DSR-2, 1976~1978.
O~ H. SchUller: Methoden zum Erreichen und
zum Nachweis der notigen rlardwarezuver l~ssigkeit beim Einsatz von Prozessrechnern. Dissertation, 30.10 . 78 , Technische Universit~t MUnchen . [17J H. Frey: Reliability of non-redUndant and redundant digital control computer systems. Proc. of IFAC/IFIP Conference on Digital Computer Applications to Process Control, ZUri ch, 1974. Spri nger Verlag: Lecture ~otes in Economics and ~lathematical Systems; Control Theory Nr. 94 (Part 11), p. SOO-513. [18J G. Weber, L. Gmeiner, U. Voges: der Zuverl~ssigkeitsanalyse und rung bei Hardware und Software. forschungszentrum Karlsruhe Nr. 19p03G, August 1978.
Methoden -sicheKern01.02 .
[19] MIL-H DBK 217 B: Reliability Prediction of Electronic Equipment. 1974, updated 1978. ~OJ IEEE STD SOO: IEEE Guide to the Collec -
tion and Presentation of Electrical, Electronic and Sensing Component Reliability Data for Nuclear-Power Generating Stations. 1977. ~lJ IEEE Std 323 - 1974: Qualifying Class lE Equipment for Nuclear Power Generating Stations.
[22J IEEE Std 381 - 1977: Standard Criteria for Type Tests of Class lE Nodules Used in Nuclear Power Generating Stations. M.L. Shooman: Software Reliability Models and Measurement. INFOTECH State of the Art Conference on Reliable Software, London 28.2. - 2.3.1977. ~4J B. Littlewood: How to measure Software
Reliability, and how not to .... 3rd Internat. Conf. on Software Eng ., May 10-12,1978, p.37-4S. IEEE Cat. ijr 78 CH 1317-7 c.
~5J S. Bologna, W. Ehrenberger: Applicability of Statistical Software Reliability Models for Reactor Safety Software Verification . Comitato Nationale Energia Nucleare, RT/ING(79)1.