Statistics of design error in the process industries

Statistics of design error in the process industries

Safety Science 45 (2007) 61–73 www.elsevier.com/locate/ssci Statistics of design error in the process industries J. Robert Taylor Rambøll Danmark A/S...

218KB Sizes 6 Downloads 82 Views

Safety Science 45 (2007) 61–73 www.elsevier.com/locate/ssci

Statistics of design error in the process industries J. Robert Taylor Rambøll Danmark A/S, Bredevej 2, DK-2830 Virum, Denmark

Abstract The paper addresses questions on how frequently incidents and accidents are caused by design errors and how signiWcant design reviews are in removing design errors before a system is put into operation. It is based on a review of earlier studies mainly from the chemical and nuclear industries. The studies report that from about 20% to 50% of the studied incidents and accidents have at least one root cause attributed to erroneous design. The number of design errors actually occurring during the design process is much higher, but 80–95% of them are removed by thorough design reviews. To improve the design process further, it is necessary to analyse the nature and causes of design errors through Wrst hand knowledge about the design process. © 2006 Elsevier Ltd. All rights reserved. Keywords: Safety; Incidents and accidents; Design error; Design review; Chemical industry; Nuclear industry

1. Introduction Design error is one of the most frequent causes of system failure and of accidents in the process industries, but has nevertheless been largely overlooked in risk analysis of process systems and control systems. Evidence for this is given in this paper. It is based on a series of studies by the author as participant in the design process over a period of 35 years, covering both studies of design errors and methods for reducing the incidence of such errors. Collection of data on design error is not straightforward. Evidence of design error appears in accident reports and in trouble reports from customers to manufacturers. As will be shown, only a small percentage of errors actually reach the stage where they cause

E-mail address: [email protected] 0925-7535/$ - see front matter © 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.ssci.2006.08.013

62

J.R. Taylor / Safety Science 45 (2007) 61–73

accidents or operations problems. By far the majority of errors are removed from systems and plants before they are put into operation. For this reason, it is necessary to participate in the actual design process, in order to be able to collect the data on which improvements in the design process can be based. Methods such as hazard and operability and functional failure analyses are partially eVective in detecting design errors. The actual eVectiveness is reviewed here on the basis of practical studies of large-scale systems and on experiments intended to elucidate speciWc problems. One of the classical methods, Hazard and Operability (Hazop) analysis, is found to give more than an 80% chance of discovering those errors which lie within its domain of application. 2. DeWnition of design error A design error may be deWned as a feature of a design which makes it unable to perform according to its speciWcation. It is rare that a design fails under all circumstances, and the deWnition generally means that there are some circumstances, within the scope of the speciWcation, under which the system does not match its speciWcation. There are some problems with this deWnition. For many systems, the speciWcation is inadequate, and needs to be supplemented by general statements, such as “additionally the systems should work in a European climate” or “in addition to performing according to the speciWcation, the system should not produce hazardous outputs”. For most systems, there are a very large number of requirements which are included in the speciWcation by reference, or are implicit. Many of the requirements which are not stated in the design documents are nevertheless explicit in legal requirements, or standards which are legally binding. To understand design error, and even to determine whether a design error has occurred, it is necessary to understand this implicit or indirect background. To complicate matters even further, speciWcations may contain errors, which lead them to diverge from the designer’s, or the purchaser’s true intentions. For these reasons, a more pragmatic deWnition may sometimes by used (Taylor, 1975). “During analysis of incident records, a design error is deemed to have occurred, if the design or operating procedures are changed after an incident has occurred.” 3. Statistics of design error 3.1. Accident statistics Statistics of accident causes are important because they give us an idea of how accidents arise in practice, and help prevent us focussing on the purely anecdotal. One of the Wrst published studies of design error is useful in this way. It was carried out on “abnormal occurrence reports” published by the US Nuclear Regulatory Commission (NRC) in the 1960s and early 1970s (Taylor, 1975, 1976). The criterion for whether a design error occurred was an objective one. If a design change was made as a result of the incident, then a design error or omission was considered to have occurred. In order to make this deWnition compatible with that in the previous section, we need to add the errors in procedures, making 45% design errors in total. The results of this study are given in Tables 1–3. In all 250 reports were assessed over a 10 year period of operation.

J.R. Taylor / Safety Science 45 (2007) 61–73

63

Table 1 Incident causes for nuclear reactors Error cause (total number of errors N D 422)

% of total errors

Design error Component failure Operator error Error in procedure Maintenance or installation error Fabrication fault Cause unknown or unrecorded

35 18 12 10 12 1 12

Table 2 Causes of design errors for nuclear reactors Design error cause

% of design errors

Component selection Oversight EVect unknown at design time Sizing, dimensioning error Complex interactions overlooked Communications problems Cause unknown or unrecorded

14 17 25 13 7 1 22

Table 3 Causes of errors in procedures in nuclear reactors Cause of procedure error

% of errors in procedures

Omission of step in procedure Omission due to eVect or need unknown at design time Procedure unclear or ambiguous Wrong test frequency speciWed Wrong procedure speciWed Extra checks required Cause unknown or unrecorded Total number of procedural errors

56 16 7 2 2 6 14 42

Some conclusions can be drawn from this assessment. Firstly, improving component selection could presumably reduce the total number of incidents by about 5%. This could be accomplished by having better speciWcations, speciWcation checking, and application rules. Sizing and dimensioning errors could probably be reduced in a similar way by using computer aids which calculated a wider range of constraints and requirements. By far the biggest reduction would arise by preventing oversights and eVects due to lack of knowledge. These causes point to some kind of checking based on actual process knowledge. If procedural design is included in the range of problems treated, then the most obvious place for improvements is in planning and scoping procedures, since a simple oversight in the need for a particular procedure is the prime cause of error shown in Table 3. Note that in all cases, design and procedural design errors are more signiWcant than random component failures of the kind treated by traditional reliability analysis techniques. Note also that some of the operator errors and maintenance errors could probably be reduced by better man machine design, and procedure design. In all, this study estimated that 70% of incidents appear to be susceptible to improvements in the design process.

64

J.R. Taylor / Safety Science 45 (2007) 61–73

Table 4 Incident primary causes for nuclear reactor safety related incidents from 1980s Error cause (total number of errors N D 100)

% of total errors

Design error Component failure Operator error Error in procedure Maintenance or installation error Fabrication fault Cause unknown or unrecorded

46 11 9 15 17 1 8

The data from the 1975 study are necessarily out of date – they represent design practice from the 1950s. In order to test whether design practices have changed, the abnormal occurrence study was repeated for incidents on newer plant, using data from the 1980s (Taylor, 1997). The results nevertheless represent the design practices of the 1970s, since the data is for nuclear plants operating in the 1980s. One hundred incident reports were studied (Table 4). The results showed an increased percentage of incidents involving design error. This may be because other causes of failures became less frequent, since there was a signiWcant improvement in equipment reliability resulting from more mature designs and the move to solid-state electronics over the 10–15 years between the studies. The studies resulted in some important observations: • Distinguish between system design (interconnecting a set of components) and component design (selecting speciWc component types and dimensioning). Many generalised analysis techniques, such as FMEA and Hazop, exist for checking system designs. Component design involves selection and calculation, which almost always depends on speciWc knowledge, and for which there are at best check list methods to support design review. • The importance of lack of knowledge among designers and introduction of the classiWcation “non culpable ignorance” to cover the cases where an incident provides completely new knowledge about accidents. The frequency of “new and unusual” accident phenomena has been of continuing interest since this study, because they set the limit to how well we can analyse and predict accidents, and therefore how well we can prevent them. If every accident occurred in some new way, then our risk reduction and loss prevention eVorts would be useless. A similar kind of study to this was carried out by Haastrup (1984) in the chemical industry, using Manufacturing Chemists Association, and Loss Prevention Bulletin incident descriptions. The number of incidents classiWed as arising due to design error is here about 25%. All of the above analyses are to some extent pre-selected, or are in some specialised area of engineering. The Wrst were from the nuclear industry, while the Haastrup’s set was taken from publications, which focus on “interesting” accidents. A further study was made by the present author, of 121 accidents reported under the major hazards scheme to the European Joint Research Centre MARS database, and published by Drogaris (1993). These are not pre-selected and should therefore, within the limits of the obedience to the reporting requirements, be more representative of accident causes. The results are given in Figs. 1–3 and show that over 50% of accidents have some contribution from design error.

J.R. Taylor / Safety Science 45 (2007) 61–73

65

Unavoidable component failure Maintenance error, procedure not followed Operator error, performance error Operator error, did not follow procedures Inadequate codes or standards, wrong code used Inadequate safety analysis Inadequate lab analysis Design Managerial 0

10

20

30

40

50

60

70

%

Fig. 1. Causes of 121 chemical industry accidents reported to the MARS accident database.

This Wgure is potentially higher if the use of wrong codes and inadequate safety analysis are added. This “modern” classiWcation gives several contributing causes for most accidents (the percentages add up to more than 100), hence the exact Wgure for such a conclusion is hard to estimate. Management error and design error dominate to a larger extent than in the data of Tables 1–3, as might be expected from the change in expectations over a 20-year period. More signiWcant perhaps is the light which the study threw onto management (Fig. 2) error mechanisms and design error mechanisms (Fig. 3). A set of data from the US Risk Management Programme was studied. This also non pre-selected data and can be regarded as statistically representative (US Environmental Protection Agency, 2005). This programme requires a risk management report for all plants which have above a certain level of inventory. The reports include a Wve year accident history for the plants, covering the accident history for the plants, for all accidents with oV site consequences, deWned in terms of oV site concentrations of toxic substances from the incidents. The causes of the accidents are self-assessed by the companies (see Figs. 4–6). As might be expected, the proportion of accidents attributed to design error is lower when a self-assessment is made, and when only one cause can be given. Indeed there is no one category of “design error” deWned. Also, the reporting provides little detail about how the errors were made. However, in these Wgures we can assume that all “unsuitable equipment” causes, some “excessive corrosion”, many “equipment failures” and some “improper procedure” causes may be considered as design errors. Even then, as can be seen from these data, design error is assessed as a cause of only a small fraction of the accidents. A large fraction is still regarded by the engineer reporting them as unavoidable, for designers, in some way. Consider for example the large proportion attributed to adverse weather conditions. However, most safety engineers today would consider a plant badly designed if bad weather could cause a signiWcant release of toxic material.

66

J.R. Taylor / Safety Science 45 (2007) 61–73 Managerial, inadequate emergency preparedness Managerial, no MOC, delayed follow up Managerial, poor communications Managerial, inadequate security

Error cause

Managerial, poor storage procedures Managerial, understaffing Managerial, failure to respond to warnings Managerial, inadequate preparation for maintenance Managerial, inadequate permitting Managerial, inadequate inspection, integrity audit Managerial, inadequate training Managerial, poor safety culture Managerial - inadequate operation or maintanance procedures 0

5

10

15

20

25

30

%

Fig. 2. Causes of the managerial errors from Fig. 1.

What conclusions can we draw from these data, if we take the Risk Management Programme records of cause at face value? Design error is obviously a contributor to risk. One approach is then to take the average non-design error accident frequency, and just increase it by a factor of about 20% to account for design error in a risk analysis. Alternatively one could take a base rate of “unavoidable” accidents, related to unavoidable equipment failure, and multiply by a factor of about 3 to take into account the failures which we would generally regard as avoidable. This line of thinking has been used to justify ignoring process plant design error in risk assessments, or at least not considering it as a special Weld worthy of study, for many years. Even the extensive data collections of failure rates, such as OREDA or Mil Std. 217 do not look at why failures occur (Sintef, 2002; US Department of Defence, 1995). The problem here is that the more serious accidents, more or less by deWnition, involve design error or management error. No one would today accept the possibility that piping carrying toxic material would simply fail as a result of a long period of corrosion. Similarly, one would expect a pressure vessel today to have a probability of failure of 10¡7 per year or less, and would expect design, manufacturing procedures and non-destructive testing to ensure this. We must expect that the serious accidents in the future will be dominated by those caused

J.R. Taylor / Safety Science 45 (2007) 61–73

67

Lack of knowledge Lack of kowledge, Novel system LTA feedback LTA safety awareness LTA MOC Lack of qualified staff LTA communication LTA standard LTA analysis LTA design procedure LTA specification 0

20

40

60

80

100

120

%

Fig. 3. Causes of design error for accidents from Fig. 1 (LTA D less than adequate, MOC D management of change).

by conditions over which we do not yet have full control. At present, these are maintenance error, management error, and design error, and a few types of operator error. To see whether there is any evidence of this kind of relationship in the Risk Management Programme data, the cause classes were correlated with the size of releases. The results are shown in Figs. 7 and 8. There is a clear correlation, the design errors (including equipment, corrosion, etc.) leading to larger releases. The results are not as convincing as the arguments above, based on the logic of our expectations of safety engineering, but this is not surprising. The data covered in the Wgures represent only about 1000 plant years of experience, and do not include the largest accidents which could occur. Nevertheless, it can be seen that “unsuitable equipment” is correlated with large releases, at least for the reWnery units. From all of these studies taken together we get a varying picture, but all show at least 20% of incidents having a signiWcant causal factor in design, and most show much higher Wgures around the half or even more, especially if we use the same deWnition as Kinnersley and Roelen (see, this issue) and include inadequate procedures as design errors. 3.2. Collection of design error data in the design oYce and at the plant Using historical accident data from incidents and accidents to determine distributions of causes means, unavoidably, that the data reXects out of date design practices. Another source of data comes from design review and safety studies, which look explicitly for

68

J.R. Taylor / Safety Science 45 (2007) 61–73 Cause distribution, refinery piping Improper procedure

Alkylation piping

Hot tap

Crude unit piping Reformer piping

Bypass Weather Upset Unsuitable equipment Overpressure Management Maintenance Operator error Flame impingement Equipment Excessive corrosion 0

10

20

30 40 percentage

50

60

70

Fig. 4. Causes of failure (as recorded by plant management) for 58 reWnery piping failures.

Weather Plant upset Unsuitable equipment Overpressure Management Human error Hot tapping Flame impingement Equipment Defective pipe, LTA monitoring Bypass 0

5

10

15

20

25

%

Fig. 5. Causes of failure, (as recorded by plant management) for reWnery equipment in general.

design error. The results of these show far more errors made and recovered before they resulted in incidents or accidents. Three such systematic methods are available, which provide signiWcant amounts of data: hazard and operability analysis (Hazop), action error analysis and formal design

J.R. Taylor / Safety Science 45 (2007) 61–73

69

Causes, Ammonia plant releases Weather Unsuitable equipment Unrecorded Purging Overpressure Management Human error Equipment 0

5

10

15

20

25

30

35

40

45

%

Fig. 6. Causes (as recorded by plant management) for 96 ammonia plant releases.

Size of releases (kg/s) vs release count for ammonia plants 25

20

Equipment

15

Human error Unsuitable equipment 10

Weather

5

31

29

27

25

23

21

19

17

15

13

9

11

7

5

3

1

0

Fig. 7. Sizes of releases recorded by class, for ammonia plants.

review. Hazop can be regarded as an eVective design error reduction method. It Wnds design errors in process plants at the piping and instrument diagram level. However, it does not deal with mechanical design errors, and only rarely with dimensioning errors such as pump sizing and it does not deal very well with the kind of problems arising during start up, shut down, etc., so some design errors will be overlooked. Results from Hazop studies were available from the design process for eight chemical plants (Taylor, 1991). The data were available from earliest design through to operation. The results give information not only on the occurrence of design errors, but also on the eVectiveness of the Hazop procedure. The initial designs seem to include about one systems design error per vessel, at the initial design stage, depending somewhat on the degree of standardisation. The Hazop process is eVective in elimination between 80% and 98% of these errors, depending on how the analysis is carried out. (The most complete results were obtained by performing thorough cross check analyses, using a very powerful expert system

70

J.R. Taylor / Safety Science 45 (2007) 61–73

Refinery releases, size kg/s vs number of failures 1000 100

Release kg/s

10

Corrosion Equipment

1

Heater flame

0.1

Human error Management

0.01

Unsuitable equipment Weather

0.001 0.0001 0.00001 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 Number

Fig. 8. Sizes of releases recorded by cause class, for reWnery units.

and then following up discrepancies.) Presumably, therefore, between 0.2 and 0.02 systems design errors persist through to commissioning. Note that there may be other design errors, arising from mechanical design, electrical design, and layout, which are not amenable to Hazop. It has only been possible to gather precise information on these from one of these projects, but the results indicate about 0.5 of those errors per vessel persisting through to the commissioning stage. Table 5 gives error rates found by Hazop for a biological waste processing plant, for a chemical waste incinerator, for a sintering furnace, and for a nitric acid plant, all made during the design. Not all the recommendations from the Hazops can be regarded as reXecting design errors. In many cases the designs were not wrong; they just did not match up to safety expectations. The Hazops should for these aspects, be regarded as part of the design process intended to ensure that safety expectations are met. The data have therefore been divided into two groups of Wndings/recommendations, those which are needed to make the design work, and those needed to ensure a high level of safety (see the last two columns in Table 4). Five plants from the Wrst study (Taylor, 1991) were followed up in detail in post commissioning or operational audits. There were fairly complete details of design errors available, and interviews were carried out with the designers involved. Fig. 9 shows the Table 5 Errors per vessel during design, found by Hazop study during design Analysis

Design errors

Vessels

Errors per vessel found during HAZOP

Safety related errors per vessel

Biological waste processing plant Chemical waste incinerator Sintering furnace Nitric acid plant

9 48 8 84

1 9 1 8

9 5.5 8 10

2 5.5 6 8.5

J.R. Taylor / Safety Science 45 (2007) 61–73

71

Proximate causes of design errors

Wrong specification

Requirement not recognised

Practical constraint not recognised

Mistake

Lack of understanding

Lack of knowledge of standards

Lack of knowledge

Lack of analysis

Inadequate analysis

Drawing error

Difficulty in finding solution

0

5

10

15

20

25

Fig. 9. Design error causes for 52 design errors in Wve chemical plants.

proximate causes of the design errors. Fig. 10 gives design error root causes. The x-axis of both these Wgures is the absolute number of errors, not the percentage. The number of errors persisting through to operation varied from 0.2 to 2.4 errors per vessel type. Of the 52 errors, three gave actual accidents. About 15 would have given accidents within about a year, based on calculated accident frequencies from risk analyses. A further 20 might well have caused accidents over a 10-year period. The remainder would have led to increased consequences if other accidents or disturbance arose. It was shown in Section 3.1 that a signiWcant fraction of accidents arising in process plant involve design error. The percentage varies from about 20% up to over 50%, depending on the standards set for designs. The studies in this section show that the number of

72

J.R. Taylor / Safety Science 45 (2007) 61–73 Design errors classified by root cause

Novel system

Poor feedback from operation

Poor communication

Poor safety culture

Lack of training

Lack of qualified staff

Lack of information

Inadequate standard

Inadequate specification procedure

Inadequate MOC procedure

Inadequate drafting procedure

Inadequate design procedure

Inadequate awareness of field conditions

Inadequate analysis procedures 0

2

4

6

8

10

12

14

16

Fig. 10. Design error root causes or 52 design errors in Wve chemical plants.

J.R. Taylor / Safety Science 45 (2007) 61–73

73

design errors actually occurring during the design process is much higher than the number causing accidents. Fortunately, not all design errors are transferred to the Wnal construction; many are removed as part of the design review, Hazop analysis, and commissioning audit process. Of those design errors which do survive until the operational stage, only a fraction actually causes accidents. Lack of risk analysis, lack of knowledge and inadequate speciWcation and design codes and procedures are the most signiWcant proximal or underlying causes of errors. 4. Conclusions From the data presented here we can conclude that design error plays a major role in process plant risk. Current design review methods discover and remove between 80% and perhaps 95% of the errors made, but there is still a design element present in between 20% and 50% of the accidents and incidents which happen in chemical process plant. That percentage depends very much on the quality of the data reporting, the representativeness of the data analysed and the way in which that analysis is performed. In particular the deWnition of what to include under the term design error is determinant, as is the way in which the analyst conceives of the responsibility of the designer. In order to understand more about the nature and causes of design error it is necessary to dig deeper, behind the statistics and the incident analyses. The other paper by this author in this special issue will do that, based on long experience of involvement in the process of design and safety review. That discussion is therefore based on anecdote and case studies, but rooted in a systematic description of the design process. References Drogaris, D., 1993. Major Accidents Reporting System. Commission of the European Communities, Joint Research Centre, Ispra. Haastrup, P., 1984. Design Error in the Chemical Industry. Report Risø-R-500, Risø National Laboratory, Roskilde. Roelen, A., Kinnersley, S., Drogoul, this issue. Sintef, 2002. OREDA, OVshore Reliability Database. Sintef, Trondheim. Taylor, J.R. 1975. A Study of Abnormal Occurrence Reports. Report RISØ-M-1837, Risø National Laboratory, Roskilde. Taylor, J.R., 1976. A Study of Abnormal Occurrence Reports. IAEA Conference on Reliability of Nuclear Power Plants, IAEA_SM-195/6, Innsbruk. Taylor, J.R., 1991. Quality and Completeness of Risk Assessment. In: Hazard IdentiWcation, 1. Taylor Associates, Copenhagen. Taylor, J.R., 1997. Design Error. Taylor Associates, Copenhagen. US Department of Defence, 1995. Reliability Prediction of Electronic Equipment. Military handbook MILHDBK-217, Rome Laboratory, GriYss AFB, NY. US Environmental Protection Agency, 2005. Chemical Emergency Preparedness and Prevention. Available from: .