The need for an internationally recognised standard for engineering failure analysis

The need for an internationally recognised standard for engineering failure analysis

Journal Pre-proofs The need for an internationally recognised standard for engineering failure analysis Nigel K. Booker, Richard E. Clegg, Peter Knigh...

904KB Sizes 0 Downloads 21 Views

Journal Pre-proofs The need for an internationally recognised standard for engineering failure analysis Nigel K. Booker, Richard E. Clegg, Peter Knights, Jeff Gates PII: DOI: Reference:

S1350-6307(19)30520-5 https://doi.org/10.1016/j.engfailanal.2019.104357 EFA 104357

To appear in:

Engineering Failure Analysis

Received Date: Revised Date: Accepted Date:

14 April 2019 16 December 2019 24 December 2019

Please cite this article as: Booker, N.K., Clegg, R.E., Knights, P., Gates, J., The need for an internationally recognised standard for engineering failure analysis, Engineering Failure Analysis (2019), doi: https://doi.org/ 10.1016/j.engfailanal.2019.104357

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2019 Published by Elsevier Ltd.

THE NEED FOR AN INTERNATIONALLY RECOGNISED STANDARD FOR ENGINEERING FAILURE ANALYSIS. Nigel K. Booker PhD Research Student School of Mechanical and Mining Engineering The University of Queensland Brisbane, Australia [email protected] Dr. Richard E. Clegg School of Chemistry, Physics and Mechanical Engineering Queensland University of Technology Brisbane, Australia [email protected] Prof. Peter Knights School of Mechanical and Mining Engineering The University of Queensland Brisbane, Australia [email protected] Dr Jeff Gates UQ Materials Performance and School of Mechanical & Mining Engineering The University of Queensland Brisbane, Australia [email protected]

Declarations of interest: none Highlights · · · · ·

The selection and rigour of failure analysis methodology has a direct impact on the accuracy and expediency of the decision-making process. It has been established that no internationally recognised standard or diagnostic accuracy guidelines exist for engineering failure analysis. At present, the outcome of failure analysis is an opinion, albeit a considered one. Currently a failure analysis methodology is chosen by the analyst without due diligence and applied according to their interpretation. It has been established that there is a need for an internationally recognised failure analysis standard and diagnostic accuracy guidelines as a calcification of ideas, codified and clarified in order to: a) Be expedient and repeatable so that the results can be peer reviewed and lessons learned; b) Be quality assured by being regulated and conducted in accordance with the standard; and c) Mitigate the risk of wrong diagnosis by ensuring that all procedural steps are executed correctly. 1

Abstract At present, the outcome of failure analysis is an opinion, albeit a considered one, based on the best facts available informed by a series of non-standardised diagnostic tests and experience. Acknowledging that there are many tools that have been developed and implemented successfully by failure analysts over many years, this paper specifically identifies the need for a. an internationally recognised failure analysis standard; and b. diagnostic accuracy guidelines. It is hypothesised that a standard is required to make the process repeatable so that the results can be peer reviewed, lessons learned, and the outcome be objective. Through a thorough literature review and the analysis of 132 failure analysis case studies, various failure analysis methodologies were identified, and diagnostic accuracy methodologies appraised. It was concluded that currently a failure analysis methodology is chosen by the analyst without due diligence. The methodology chosen is then subsequently applied according to their interpretation, which in turn yields a diagnostic decision. Consequently, the result is a methodology that may not be the most accurate or expedient, which in turn has the potential to increase the risk and cost of the decision-making process. It is recommended that a structured approach to decision analysis be established, leading to a set of diagnostic accuracy guidelines, which, once adopted and implemented by the engineering fraternity, allows the further development of an internationally recognised failure analysis standard, thereby reducing the risk of misdiagnosis and the repeatability of the failure. Key Words Behavioural Engineering; Cognitive Engineering; Decision Analysis; Diagnostic Accuracy Guidelines; Failure Analysis; Forensic Engineering; Heuristics; Hypothetico-deductive; Inductive. Introduction Failure analysis is the application of engineering principles and methodologies to determine the causes of failures. The failures to be investigated can vary in nature and magnitude — from material components and minor equipment breakdowns to corporate processes and catastrophic events with significant environmental impact and possible multiple human fatalities. The modifier “forensic” typically connotes a connection with the law or other form of adversarial debate but can validly be broadened to investigations of any failure where there are implications for public safety or major financial losses. The assessment of causation is based on a combination of four types of evidence: · · ·

“History” (what is known about the physical conditions that were present prior to the failure); “Examination” (a non-destructive visual and microscopic characterisation of the failed item prior to any further destructive action); “Simulation” (an accident reconstruction or simulation); and 2

·

“Testing” (measurements and diagnostics conducted on samples collected from the failed item). From the evidence collected, various hypotheses are fundamental to establish how the preevent conditions could have led to the failure event. A process of formulating and then testing various hypotheses for possible ways in which the pre-event conditions could have become the post-event conditions then assesses the evidence. The competing hypothesis are then tested, either by critical reasoning alone, based on the existing body of evidence, or by further critical evidence if required. The selection and rigour of failure analysis methodology has a direct impact on the accuracy and expediency of the decision-making process, however; no internationally recognised standard currently exists for selecting an engineering diagnostic test of appropriate rigour. Objective This paper will demonstrate the lack of, and therefore the need for, internationally recognised diagnostic accuracy guidelines and subsequently, an internationally recognised engineering failure analysis standard. Literature Review Before ascertaining what is a suitable engineering failure analysis standard, it is important to fully investigate and understand the established practice regarding the selection and conduct of diagnostics. Competing Failure Analysis Methodologies According to Cherry, there exists two competing approaches to forensic engineering investigations (Cherry, 2002). These may be termed the “inductive” approach and the “hypothetico-deductive” approach. The “inductive” approach is characterised by formalised (and often commercialised) diagnostic decision analysis tools described by terms such as “Root Cause Analysis”. ‘Induction’ is a process or method of reasoning in which the premises of an argument are deemed to support the conclusion. Sir Francis Bacon (1561-1626) is regarded as the first person to study the methodology that scientists employ. Bacon was a clear proponent of the inductive method, placing experimentation and observation as the absolute foundation of scientific method. Inductive methodologies employ a formulaic approach in which all possible evidence is collected, and all possible explanations are considered in a comprehensive timeconsuming process of elimination. To keep track of a multiplicity of possible causation paths, these tools often use visual aids such as decision tree diagrams and “fishbone” cause-andeffect diagrams. While rigorous and (in principle) reliable, these approaches may waste time and resources investigating numerous hypothetical possibilities, which have only a very low probability of being correct, and consequently may unnecessarily delay the identification of causes.

3

Root Cause Analysis; Simplified Tools and Techniques (Andersen and Fagerhaug, 2000) demonstrates that an advantage of inductive methodologies is that they do not rely on highly experienced practitioners and may be able to be conducted by a person with little engineering experience. Popper, however, claims that an experienced forensic engineer or failure analyst can often get to the heart of the matter more expeditiously using an approach which has been described as a generic heuristic method known as a ‘search tree’, or the “hypothetico-deductive” approach, which is based on tests designed explicitly to falsify or disprove the researcher’s hypothesis. (Popper, 1963). There is a risk that when using a “lean”, “expedient” or “intuitive” heuristic approach to fault diagnosis, the resulting level of certainty or confidence in the correctness of the diagnosis might not be as great as that obtained from a more “rigorous” cognitive inductive approach. Failure Analysis Methodologies Published literature contains very little direct consideration of the diagnostic accuracy of the failure analysis methodology. Literature on failure analysis falls into two basic groups; the most prevalent is a generic ‘cased based reasoning’ approach and the other less common approach is based on strategic application of methodology principles (Davis, 2004), both discussed below. Case Based Reasoning Most of the literature on failure analysis consists of case studies, where the focus is on identification of the cause of a failure (Mascarenhas, 2010). In ‘Case Studies in Engineering Failure Analysis’ published by Elsevier, the Editor, Clegg, observed: Most engineering failures have points of commonality with other, previously investigated failures. Being able to study the analysis of previous similar failures can be of enormous benefit in being able to quickly and efficiently get to an understanding of the failure.

In these studies, an investigation methodology has evidently been used, but rarely if ever do such papers analyse the methodology or justify the selection itself, they only report the outcomes of its application. No consideration is given to the possibility that an alternative methodology might have been used. Indeed, in the clear majority of cases, the methodology is not even named. Strategic Application of Methodology Principles A relatively small number of publications exist which, rather than consisting merely of a collection of case studies, discuss the over-arching principles of the strategies and methodologies by which failure investigations can and should be conducted. That is, they are general rather than specific, and in this sense such publications are more valuable to a trainee investigator, because they teach principles which can be applied to a broad range of specific cases. These publications include the more standard references with regard to performing “failure analysis” and “root cause failure analysis” such as: The origins and history of loss prevention (Kletz 1999); Understanding How Components Fail. (Wulpi 1985); Volume 11: Failure Analysis and Prevention, Metals Handbook, Ninth Edition and Chapter 1: Corrosion Failure Analysis with Case Histories (Eiselstein and Huet, 2011). 4

Failure Analysis of Engineering Materials, (Brooks and Choudhury, 2002) even describes itself as: “…designed to be of benefit to materials engineers and materials scientists, and other engineers involved in the design of components, specification of materials, and fabrication of components. It serves as an introduction to failure analysis for the novice, and as a refresher and a source book for those already familiar with the subject.”

Whereas Lees' Loss prevention in the process industries: Hazard identification, assessment and control (Lees, 2012) clearly indicates that the chemical process industry looks at failure analysis differently than the manufacturer of home appliances or heavy equipment manufactures as they are more interested in understanding what processes were or were not in place that allowed these failures to occur. In some cases, these publications assign a name to the methodology being taught — for example, Klepner-Tregoe’s “Root Cause Analysis” or the Juran Institute’s “Six Sigma”. In other cases, no name is given, and it might appear that the authors assume that the methodology described is the only logical methodology that a failure analyst could use. In any case, regardless of whether they do or do not give a name to the methodology, rarely if ever is explicit consideration given to the existence of competing candidate approaches and their relative merits. Decision Analysis Almost as soon as von Neumann and Morgenstern outlined their theory of expected utility or ‘rationality’ in 1944, economists began adopting it, not just as a model of rational behaviour, but also as a description of how people actually make decisions. (Edwards, 1954). In 1965 Howard of MIT coined the term “decision analysis” for General Electric’s nuclear headquarters. Howard was asked to apply the new decision-making theories to a nuclear power plant. He combined expected utility and Bayesian statistics with computer modelling and engineering techniques thereby pioneering the policy iteration method for solving Markov decision problems; a mathematical framework for modelling decision making in situations where outcomes are partly random and partly under the control of a decision maker. He was also instrumental in the development of the ‘Influence Diagram’ for the graphical analysis of decision situations. Influence diagrams and decision trees are complementary views of a decision problem. The influence diagram shows the dependencies among the variables more clearly than the decision tree. Decision trees display the set of alternative values for each decision and chance variable as branches coming out of each node. However, it wasn’t until 1973 when Kahneman and Tversky collaborated to show that people assess probabilities and make decisions in ways systematically different from what the decision analysts advised. Tversky and Kahneman (1973) wrote: “In making predictions and judgements under uncertainty, people do not appear to follow the calculus of chance or the statistical theory of prediction. They rely on a limited number of heuristics which sometimes yield reasonable judgement and sometimes lead to severe and systematic errors.”

5

Behavioural Economics Kahneman won an economics Nobel in 2002, (Tversky died in 1996), and the heuristics-andbias insights relating to money became known as behavioural economics. They modelled decision-making by comparing two Systems, described by them as: · ·

a “hot and fast” System 1, which takes shortcuts using heuristics, is biased by emotion, and therefore makes less optimal decisions; and a “cold and slow” System 2 that is more logical and optimal (Figure 1).

Figure 1 - Kahneman's model of cognitive thinking, in which Intuition (system 1) acts quickly and Reasoning (system 2) plans more carefully.

They hypothesised that, to help people make better decisions, they should be encouraged towards using Reasoning (system 2); or otherwise use smart defaults and paternalistic policies to influence their choice (Marc Resnick, 2014). Much of the base material of behavioural economics came from straightforward observations of how companies behaved. Bounded Rationality Simon (1957) developed the notion of ‘bounded rationality’, claiming that decision makers seldom have the time or mental processing power to follow the optimisation process outlined by the decision analysts, so they make do by taking shortcuts and going with the first satisfactory course of action rather than continuing to search for the best. The psychologist Gigerenzer further developed this in the 1980s. He believed that by manipulating the framing of a question, it is sometimes possible to make apparent cognitive 6

illusions (an interaction based on assumptions about the world, which lead to unconscious inferences) irrelevant (1991). Gigerenzer finds two main faults in Kahneman and Tversky’s approach. Firstly, he says that Intuition (system 1) is often correct when Reasoning (system 2) fails. This happens because Reasoning (system 2) is limited by working memory, and many complex decisions go beyond what working memory can handle. So, Intuitive (system 1) decisions, although we can’t articulate how we came to them, are sometimes nevertheless better. Secondly, he finds that many of the people who use Intuition (system 1) inappropriately do so only because of a lack statistical knowledge, and that if these people were taught some basic statistical techniques, they would be much better. Gigerenzer highlights the difference between risk and uncertainty – one of the causes of the financial market crash of 2006-8 (Fisk, 2011) when reckless behaviour by the US financial sector spread the culture of risk-taking. The incentive structures for most of the top executives and many of the lending officers of these financial institutions was designed to encourage short-sighted behaviour and excessive risk-taking (Stiglitz, 2010). Gigerenzer also describes some of the intentional tricks designed to mislead consumers that companies use to sell products, thereby deliberately manipulating the decision-making process of the consumer. Gigerenzer states that the Bayesian approach to probability favoured by decision analysts is just one of several options, explaining that doctors and patients are far more likely to asses disease risks correctly when statistics are presented as natural frequencies (1 in 563) rather than a percentage (0.178%) (Hoffrage and Gigerenzer, 1998). Tversky and Kahneman (1979) hypothesized that representations of choice problems are induced by shift of reference points. Suppose you are compelled to play Russian roulette but are given the opportunity to purchase the removal of one bullet from the loaded gun. Would you pay as much to reduce the number of bullets from four to three as you would to reduce the number of bullets from one to zero? Most people feel that they would be willing to pay much more for a reduction of the probability of death from 1/6 to zero than a reduction from 4/6 to 3/6 thereby supporting Simon’s notion of bounded rationality. Ecological Rationality Gigerenzer is not alone in arguing that we shouldn’t be too quick to dismiss the heuristics, gut feelings, snap judgements, and other methods human use to make decisions as necessarily inferior to the probability-based verdicts of the decision analysts. Even Kahneman shares this belief to some extent, as demonstrated by his discussions with the psychologist and decision consultant Klein. In the book Blink: The Power of Thinking Without Thinking (Gladwell, 2005), Klein studies how fire-fighters, soldiers and pilots develop expertise, and he generally sees the process as being a lot more naturalistic and impressionistic than the models of the decision analysts. Upon further studies alongside Kahneman, Klein concluded that reliable intuitions need predictable situations with opportunities for learning, thereby relying on pattern recognition, a concept similar to Koffka’s explanation of Gestalt Theory (Koffka, 1922). Gigerenzer argues that those situations are not the only times in which heuristics outperform decision analysis. He argues that when there is uncertainty, you can no longer optimise; instead you must simplify to be robust. Another way of explaining this is that, when the probabilities being fed into a decision-making model are unreliable, it might be better to follow a rule of thumb. This has led Gigerenzer to develop “ecological rationality”. In environments where uncertainty is high, the number of potential alternatives many, or the sample size small, the argument is that heuristics are likely to outperform more-analytic decision-making approaches. 7

This concept that smart decision-making consists of a mix of rational models, error avoidance, and heuristics seems to be gaining support among authors. Alternatively, it has been argued (Heintze, N. & Jaffar, J. 1991) that decision-making tasks can be assigned to computers, which in principle should not be subject to the information-processing limits or biases which humans face. Due to the familiarity of core techniques such as the decision tree, decision analysis has become commonplace in mainstream business; however, only a handful of universities offer it as a subject of advanced academic research (Fox, 2015). Cognitive Engineering Cognitive engineering is the application of cognitive psychology and related disciplines to the design and operation of human-machine systems that combines multiple methods and perspectives to achieve the goal of improved system performance (Wilson et al 2013). Developed as an engineering method used in the 1970s at Bell Labs, it focused on how people form a cognitive model of a system based upon common metaphors. Part of the science of the field called ‘human factors’ in the United States or ‘ergonomics’ in Europe, cognitive engineering has the purpose of improving systems effectiveness and the safety and productivity of the human constituents of the system (Cooke & Durso, 2007). Nudge Theory Decision analysis is heavily used in industries such as pharmaceuticals or oil and gas, in which managers must make big decisions with long investment horizons, and in which they have somewhat reliable input data. Chevron is an enthusiastic adherent, with 250 decision analysts on staff (Hammond, Keeney & Raiffa, 1998). Aspects of the field have also enjoyed informal renaissance among computer scientists; the 2016 US Presidential election forecasts were a straightforward application of Bayesian methods (Tunggawan & Soelistio, 2016). In 2008, building on the work of Kahneman and Tversky, Richard Thaler, a professor of economics at the University of Chicago and Cass Sunstein, a Harvard Lawyer, published ‘Nudge’, a manual for manipulating people into behaving as they should (Wilkes, 2015). The British Government, who applied it to push policy through UK Government departments, adopted the models they developed in 2010. The central promise of behavioural economics as applied to policy is to use people’s weakness to help them achieve their personal goals. It is a simple and soft form of paternalism rather than a coercive one (Reeves, 2015). Thaler won an economics Nobel in 2017 for this work. This paper investigates the need for an internationally recognised standard for engineering failure analysis and perhaps contribute to the conceptualisation of “behavioural engineering”. Development of Methodology "Everything should be made as simple as possible, but not simpler." Albert Einstein, “On the Method of Theoretical Physics,” the Herbert Spencer Lecture, Oxford, June 10, 1933.

How does a diagnostic test relate to failure analysis? The definition of a test implies that an answer is obtained by carrying out some form of scientific experiment. A test must be objective; 8

dispassionate and without opinion. The answer to the test should, in some way, be Boolean logic, even if there is some degree of scientific uncertainty. For example, if a tensile test is carried out, a measurement of physical properties, such as yield strength, tensile strength, elongation etc. are obtained. These may or may not tell us something about the failure. If for instance, the strength of the material is lower than the specified value, it may indicate that the material is out of specification, but it does not always follow that this is the root cause of the failure. There are cases observed whereby the components were made from a material that was out of specification (Charpy values were low), however, the failure was attributable to other issues, such as fatigue due to poor operation. This sort of test is rarely a failure analysis itself, but provides information for failure analysts to form an opinion of the root cause of failure. The key difference between a diagnostic test and a failure analysis is that the outcome of a failure analysis is, currently, an opinion – not a fact. It is an opinion that allows the stakeholders to proceed with the investigation that they have. As the investigation continues, the opinion becomes stronger and more firmly based in observations based on diagnostic tests. It still remains an opinion, largely because all of the facts surrounding the failure are not known and at some stage, the failure analyst will need to stop the investigation and stand on their opinion. The point at which an investigation stops is generally governed by the resources available to conduct the analysis, which is predominantly dependant on money; the time in which the investigation results must be published by; and how prepared the analyst is to stand by their opinion - the probability that they are correct in their analysis - widely accepted as the acceptable level of risk. On occasion, when it is not reasonable to gather sufficient evidence, failure analysts will simply state their conjecture, reinforcing to the audience that that is what it is, and it is required in order to move forward. Diagnostic Test Definition The aim of a diagnostic test is to confirm the presence or absence of a particular condition. (Shaikh SA 2011) There are distinct demands on the test procedures leading to a diagnosis and the diagnosis itself. (Stengel & Porzsolt ,2006) 1. Results must be precise and reproducible (efficacy). 2. Test findings must prompt actions that are different from those considered without knowledge of the test results (effectiveness). 3. Actions based on the results must lead to an improvement in quality (efficiency). At the first level (efficacy), analysts are responsible for defining the degree of correlation between test findings and engineering truth. In the case of forensic engineering, the diagnostic purpose of the test is to determine the cause of failure where there are implications for public or environmental safety or major financial losses. Diagnostic Accuracy Diagnostic accuracy of any diagnostic procedure or a test yields the answer to the question of how well the test discriminates between two certain conditions of interest. Diagnostic accuracy is not a fixed property of a test and is extremely sensitive to the design of the study. There are major sources of bias that originate in methodological deficiencies, in selection of case studies, data collection, executing or interpreting the test or in data analysis. These sources of variation in diagnostic accuracy are relevant for those who want to apply the findings of a diagnostic 9

accuracy study to answer a specific question about adopting the test to their environment (Cohen et al 2016). Measures of Diagnostic Accuracy Different measures of diagnostic accuracy relate to the different aspects of diagnostic procedure. Some measures are used to assess the discriminative property of the test, others are used to assess its predictive ability. While discriminative measures are mostly used for policy decisions, predictive measures are most useful in predicting the probability of a failure of a system. Furthermore, it should be noted that measures of a test performance are not fixed indicators of a test quality and performance. Measures of diagnostic accuracy are very sensitive to the characteristics of the sample population in which the test accuracy is evaluated. Some measures largely depend on the failure prevalence, while others are highly sensitive to the spectrum of the failure in the general environment. It is therefore imperative to know not just how to interpret the results, but also when and under what conditions to apply them. This discriminative ability can be quantified by the measures of diagnostic accuracy: a) b) c) d) e) f)

sensitivity and specificity; positive and negative predicative values; likelihood ratio; Youden’s index; Diagnostic odds ratio; and the Area Under the Receiver Operating Characteristic (ROC) Curve (AUC).

Sensitivity and Specificity Sensitivity of a test is its ability to correctly identify the proportion of samples with a failure. A highly sensitive test is useful is useful in early diagnosis where the detection of a failure is imperative. A sensitivity test is also useful if it is negative. Whereas specificity is the ability to identify the proportion of samples that do not have the failure. A specific test will rarely misclassify a sample as not having a failure when it does. A highly specific test is most useful if it is positive. Due to the nature of forensic engineering investigations, there is very rarely the opportunity to conduct like for like group sampling, and therefore sensitivity and specificity measurements are not relevant. Positive and Negative Predictive Values Predictive values reflect the characteristics of a test. The positive predictive value of a test is the probability of a sample that has a failure when restricted to only samples with failures, and a negative predictive value of a test is the probability of a sample that does not have a failure when restricted to only samples without failures. This test is relevant in in the field of medical diagnostics, but irrelevant in forensic engineering due to engineering failures being Boolean, however, symptoms are predictive maintenance related. Likelihood Ratio Likelihood ratio is a very useful and mostly widely applied measure of diagnostic accuracy (McGee 2002). It can summarise information about the diagnostic test, where it combines the values of sensitivity and specificity. It indicates how much a positive or negative test result 10

changes the likelihood that a sample would have a failure. Like predictive values, likelihood ratios do not depend on the prevalence of the failure. As a result, likelihood ratios of one study could be used in another setting with the condition that the definition of the failure is not changed. Likelihood ratios are applicable in forensic engineering investigations because they can be directly related to the pre-test and post-test probability of a failure in a specific sample, where the effect of diagnostic test will be quantified. By specifying the information about the sample, pre-test odds and the post-test odds of failure can be determined. The pre-test odds are related to the prevalence of the failure and it is important to specify as the diagnostic test will be adapted to the sample rather than the sample to the diagnostic test. Youden’s Index Youden’s index is a global measure of a test of performance, used in the evaluation of overall discriminative power of a diagnostic procedure and comparison of one test with other tests (Unal 2017). It is an index which summarises the sensitivity and specificity of a test. The index is not affected by the failure prevalence but is affected by the spectrum of the failure. The prime disadvantage of this index is that it does not change for the differences in the sensitivity and specificity of the test i.e. a test with a sensitivity of 0.7 and specificity of 0.8 has the same Youden’s index as test with a sensitivity of 0.9 and specificity of 0.6. For this reason, Youden’s Index should not be applied to forensic engineering investigations as there is a high probability of inaccurate diagnostics. Diagnostic Odds Ratio The diagnostic odds ratio is an overall measure to summarize test performance (Glas et al 2003). It is the positive likelihood ratio divided by the negative likelihood ratio. The test is used to estimate the discriminative ability of a diagnostic test procedure and to compare the diagnostic accuracies of between two or more diagnostic tests. Diagnostic odds ratio depends on the criteria used to define the disease but not on the prevalence of disease. This test is useful in forensic engineering when two or more tests are conducted. The Area Under the ROC Curve (AUC) AUC is a global measure of diagnostic accuracy (Metz 1978). A ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. The name "Receiver Operating Characteristic" came from part of a field called "Signal Detection Theory" developed during World War II for the analysis of radar images. Radar operators had to decide whether a blip on the screen represented an enemy target, a friendly ship, or just noise. Signal detection theory measures the ability of radar receiver operators to make these important distinctions. Their ability to do so was called the Receiver Operating Characteristics. It was not until the 1970's that signal detection theory was recognized as useful for interpreting medical test results. A ROC curve plots True Positive Rate (TPR) vs. False Positive Rate (FPR) at different classification thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives. AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0. AUC is desirable for the following two reasons: 11

·

AUC is scale-invariant. It measures how well predictions are ranked, rather than their absolute values.

·

AUC is classification-threshold-invariant. It measures the quality of the model's predictions irrespective of what classification threshold is chosen.

However, both these reasons come with caveats, which may limit the usefulness of AUC in certain use cases: ·

Scale invariance is not always desirable. For example, sometimes we really do need well calibrated probability outputs, and AUC won’t tell us about that.

·

Classification-threshold invariance is not always desirable. In cases where there are wide disparities in the cost of false negatives vs. false positives, it may be critical to minimise one type of classification error.

Current International Industry Guidelines Through a thorough literature search, it has been determined that there is no internationally recognised engineering failure analysis standard. Failure analysis of engineering structures: methodology and case histories (Ramachandran 2005) does discuss methodology of failure analysis, including some advanced technique, but it does to discuss the selection of failure analysis methodologies or identify a diagnostic accuracy standard that could be applied to the selected methodology. The Proceedings of the Third Forensic Congress (Bosela et al. 2003) contains 55 papers on Forensic Engineering. Each paper applies engineering principles for investigating failures and performance problems of engineered facilities, and development of practices and procedures for reducing the number of future failures; however, at no point is the selection of the failure analysis methodology addressed. There does exist both the ASCE 2003a ‘Guidelines for forensic engineering practice’ and the ASTM E2713-18, Standard Guide to Forensic Engineering. ASCE 223a is published by the American Society of Civil Engineers and provides the fundamentals of developing a practice that includes forensic engineering, whilst the ASTM 2018was developed by subcommittee E58.01 of the ASTM International, formerly known as American Society for Testing and Materials. ASCE 2003a addresses commonly accepted education and experience requirements for forensic engineers and ASTM 2018 is a voluntary consensus technical standard. Both are very US centric and neither are an internationally recognised engineering failure analysis standard to ensure the accurate application of the methodology. Industry groups from many sectors provide information in design guidance documents or standards on how to avoid failures and provide examples of failure modes and commonly encountered damage mechanisms. These are starting points for individuals to assess potential failure modes and damage mechanisms for use in decision making. One source of information from the American Petroleum Institute (API) on specific types of failure mechanisms commonly found in the oil and gas industry is API RP 571, a recommended practice that provides an in-depth look at over 60 different damage mechanisms that can occur to fixed process equipment in refineries. Another source is the AIChE CCPS Guidelines for Investigating Process Safety Incidents, 3rd Edition (2019). Failure analysis is discussed in various sections where causal analysis is considered. On pages 184 to 187 the specific topic of Mechanical Failure Analysis is discussed, however, that section lacks few details on certain damage mechanisms that might cause failure. 12

The National Fire Protection Association (NPFA) 921 (2017) provides only a superficial discussion of failure analysis through fault trees and Failure Modes Effects Analysis (FMEA) but indicates it is often required or useful. In chapter 22, Failure Analysis and Analytical Tools, they acknowledge the requirement for additional tools requiring special expertise and explain the practical application of timelines relating to fault trees in assigning probability to the conditions and events. This is caveated with the statement that assigning reliable probabilities to events or conditions is often difficult and may not be possible. In Germany the Verband Deutscher Metallhandler e.V. (VDM), which translates as the Association of German Metal Traders, have published some circulars internally to their members that offer a step by step guide to failure analysis, but it is little more than a basic explanation of a FMEA. The VDI (The Association of German Engineers) however, have published a series of publicly available guidelines in both German and English that offer a step by step guide to failure analysis, providing definitions of terms, types of failures, direction in systematically performing failure analyses to ensure the comparability of the results and comprehensible documentation. The scope of VDI Guideline 3822: "Failure analysis; Fundamentals and performance of failure analysis” states: "The success of a failure analysis depends to a great extent on the care with which it is planned, the type and extent of the individual steps in the investigation process, as well as the quality of their performance. In order to systematically assess experience gained through failure analyses, and to share this experience with others, uniformity is required. Hence, the purpose of this guideline is to; · · · · ·

provide definitions of terms; designate and describe types of failure in a uniform manner; provide direction in systematically performing failure analyses; ensure the comparability of the results obtained by different analytical laboratories; and establish requirements for comprehensible documentation."

The VDI 3822 Guideline is a very detailed guideline for engineering failure analysis and recognised at least in Germany Austria, Switzerland and some other European countries, however, it is still not recognised as an international standard. Similarly, TapRoot, CAST and Apollo are competing methodologies to VDI and although they are somewhat holistic in nature, they are commercial tools that require payment of money to the entities that developed them, which reiterates that there is no universal standard. Risk of bias and concerns about the applicability are the two key components of QUADAS-2, a quality assessment tool for diagnostic accuracy studies applied to the health industry. Although not an international standard, the development of QUADAS-2 was led by a team based at the Bristol Medical School: Population Health Sciences at the University of Bristol and forms the basis for the Cochrane Database of Systematic Reviews (CDSR), which was established in 1993 to organise medical research findings in such a way as to facilitates evidence-based choices about health interventions faced by health professionals, patients, and policy makers (Whiting PF, et al 2011).

13

Further investigation of other scientific fields that conduct diagnostic tests requiring a level of regulation yielded the field of medicine as having an internationally accepted standard (Šimundić A-M. 2009). Originally released in 2003 and updated in 2015, the Standards for Reporting of Diagnostic Accuracy Studies 2015, or STARD 15 was inspired by the Consolidation Standard for the Reporting of Trials or CONSORT statement for reporting randomised controlled risks. Standard for Accuracy Reporting and Diagnostic Accuracy Studies 2015 (STARD15) Background to STARD The British Medical Journal (BMJ) published an explanation and elaboration of STARD 15 in the BMJ Open (Cohen JF, et al 2016). Brought about by the readers of study reports needing to be informed about study design and conduct, in sufficient detail in order to judge the trustworthiness and applicability of the study findings, the STARD statement (Standards for Reporting of Diagnostic Accuracy Studies) was developed to improve the completeness and transparency of reports of diagnostic accuracy studies. STARD contains a list of essential items that can be used as a checklist by authors, reviewers and other readers to ensure that a report of diagnostic accuracy study contains the necessary information. Comprising of a 30-item checklist (see Annex 1), STARD 2015 describes what is expected from authors in developing sufficiently informative study reports. The STARD 2015 30 essential items (Enclosure 1) Based on the need for complete reporting by removing bias, variability and other issues in complete reporting, the STARD 30 essential items list assists scientists in writing fully informative study reports, and helps peer reviewers, editors and other readers in verifying that submitted and published manuscripts of diagnostic accuracy studies are sufficiently detailed. Discussion At present, the outcome of failure analysis is an opinion, albeit a considered one, based on the best facts available subjected to a series of non-standardised diagnostic tests and experience. The need for an engineering failure analysis diagnostic accuracy test standard is required to make the process repeatable so that the results can be peer reviewed and lessons learned and the outcome objective. STARD 2015 was developed to improve the completeness and transparency of reports of diagnostic accuracy studies. It could be said that doctors only provide opinions because ultimately, a medical diagnosis is a medical opinion. Sometimes they are wrong. They try to base their diagnosis on as much fact as possible, but there have been cases observed where some medical opinions are based on insubstantial evidence. (Hanscom, R. 2018) As observed in failure analysis, in the medical field, a preliminary diagnosis is usually done by a doctor on the basis of a relatively cursory physical examination (weight, blood pressure, discussion with patient, etc). The doctor then prescribes diagnostic tests to assess the 14

patient. The diagnostic tests seem to fall into two main categories: exploratory (inductive) and hypothesis-driven (hypothetico-deductive). Exploratory can be things like blood tests to see what the general condition of the patient is like. The equivalent of these in failure analysis is non-destructive testing (NDT) and condition assessment. The doctor examines a “non-failed” body to see if anything looks unhealthy and should be looked at further. The patient may feel a bit-off colour but also just might be in for a regular check-up. It could be argued that a prostate exam to check for prostate cancer is really a form of NDT. The second type of test is associated with diagnosing the cause of an illness and is more aligned to failure analysis. Starting with the doctor having a theory that the reason a person is suffering certain symptoms is that there is some underlying pathology. The next stage, for example, is a micro-biological test to see if the symptoms of a patient are due to malaria or Ross River Fever or bone marrow biopsies are carried out to test the hypothesis that a patient has leukemia etc. Bone marrow biopsies would rarely be prescribed unless a doctor already has a strong suspicion that the symptoms are indicating that leukemia is a possibility. This statistical probability reasoning is classic heuristic hypothetico-deductive reasoning. These tests are done to help the doctor to test hypotheses and develop an opinion. The same can be said for failure analysis. When a failure occurs, the failure analyst examines the “patient”, gathers background information and develops a working hypothesis concerning the possible mechanisms (and possible causes) of the failure. If done well, this preliminary stage already eliminates many possibilities and will narrow onto a few areas of interest. At this point, the investigator generally has a series of scenarios already in mind and a series of possible competing opinions are developed. Once these hypotheses start to form, then minimally invasive diagnostic tests need to be carried out to test them, which could include: · · · · · · ·

Non-destructive testing (i.e. crack testing) Tensile tests Impact tests Chemical analysis Microstructural analysis Electron microscopy of surfaces and surface deposits Water analysis if it is a corrosion process

Once this series of tests is done, there is an increased probability of being able to answer three key questions: 1. Was the component/system manufactured in accordance with the specifications of the designer? Was the alloy grade correct with the correct heat treatment and properties. 2. How did the failure occur? What were the stages of failure? This is a matter of establishing the “failure story”. That is, developing the story of the progression of the degradation of the component from when it was manufactured to when it failed. 3. Were there any manufacturing defects in or modifications to the component and had these defects or modifications influenced the progression of the degradation? It is quite possible that after this form of analysis has been conducted, there is still no clearly identified “root cause” of the failure. Failure investigation is a systematic approach to the 15

collection and analysis of evidence and data relating to the failure. The scope of testing is situationally dependent and the available evidence. When all the evidence and data has been analysed a root cause of failure may be identified, allowing them to reduce the likelihood of failure recurring. Further work may be required, and this may involve simulation and modelling, statistical analysis of large numbers of failures, FMEA, etc. or incorporated into a more formal RCA process. Having conducted a comprehensive literature review, including reviewing over 132 failure analysis case studies (Table 1), it has become evident that the stage of examining a failed component to answer the three questions outlined above is critical at the early stages of a component failure analysis because without this sort of investigation, the entire efficacy of the analysis is questionable. For example, in one case the investigators carried out a failure analysis of a truck chassis which has cracked solely by carrying out a Finite Element Analysis (FEA) of the truck using design loads. It was evident that they made a number of assumptions by initiating their investigation with FEA. First, that the truck was correctly made from the correct material. Second, that failure was by fatigue as a result of the normal loads on the truck. Third, there were no manufacturing defects or modifications to the component. These assumptions MAY be correct, but we do not KNOW, as the early stage failure analysis work was not done. The analysts have not done the basic investigative work. Taking into consideration that this may not always be possible, but where possible, it SHOULD be done to set the framework for the future work or an explanation for omitting it be provided. These observations, coupled with the three questions above, have led to the development of some investigation guidelines. These are: 1. Develop a complete understanding of how the failure occurred. Establish a timeline in order to understand the failure from the manufacture of the system, through commissioning, qualification and validation, all the way to the failure point. Each observation may be able to be interpreted as being the results of different sources, which makes it complex, but ALL of the observations MUST fit into a coherent failure story. Failed components don’t lie. They do not have political agendas. Establishing the failure story is critical to understanding why the failure occurred. 2. Establish a thorough understanding of the manufacturing process of the failed structure, including the design specifications and specifically what materials were used. This will help with the failure story. 3. Establish if there are any manufacturing defects or if the component has been modified by any of the stakeholders. As such, the following divisions concerning the origin of the failure can be applied. · ·

·

Design – Everything that was the responsibility of the design team – OEM Manufacture – Everything that was the responsibility of the OEM (including subcontracted manufacturers) in ensuring that the design was correctly realised in a manufactured form and delivered to the end user in that form Use - Everything that was the responsibility of the end user, including method of operation, repairs and unauthorised modifications

16

Table 1: Case Studies by Industry Division

INDUSTRY DIVISIONS*

COUNT PERCENTAGE

A — Agriculture, Forestry and Fishing B — Mining C — Manufacturing D — Electricity, Gas, Water and Waste Services E — Construction F — Wholesale Trade G — Retail Trade H — Accommodation and Food Services I — Transport, Postal and Warehousing J — Information Media and Telecommunications K — Financial and Insurance Services L — Rental, Hiring and Real Estate Services M — Professional, Scientific and Technical Services N — Administrative and Support Services O — Public Administration and Safety P — Education and Training Q — Health Care and Social Assistance R — Arts and Recreation Services Total

0 17 37 33 6 0 0 0 27 0 0 0 1

0% 13% 28% 25% 5% 0% 0% 0% 20% 0% 0% 0% 1%

0 1 0 8 2 132

0% 1% 0% 6% 2% 100%

*Australian and New Zealand Standard Industrial Classification (ANZSIC), 2006 (Revision 1.0)

The usefulness of an opinion is heavily dependent on the questions asked by the stakeholders. This paper focuses on the need for an internationally recognised failure analysis standard, which is important if the efficacy of the diagnostic test is to rely on the reporting accuracy and diagnostic accuracy. However, without a diagnosis accuracy standard, the current failure analysis process is subjective, and therefore an opinion. This raises the question as to whether it is possible to develop a method for evaluating the efficacy of an opinion and to rate the opinion on some form of scale? It first of all pre-supposes that there is an underlying truth that can be known against which to match the opinion. Even if that “truth” can be determined absolutely, how much use is a partial answer? What other factors should be considered in developing this rating? Speed of analysis? Cost? What use is a low-cost failure analysis that is carried out in a short period of time if it is completely wrong? There is occasion when incomplete failure analyses may be of significance. Often it is not possible to divorce the analysis from the questions asked by the stakeholders. Forensic engineers are often presented with a series of well-crafted questions during legal cases. The lawyers do not always want an opinion of the cause of failure, but sometimes want specific questions answered. For instances, there was a case where the investigation was examining a crack in a rail which led to a derailment. The question asked was “Can you determine if the crack was at a size that should have been detected at the last inspection?” They weren’t interested in the cause of the crack or the root cause of the failure. They were only interested 17

in working out if the inspection company commissioned to detect cracks had missed this one. They weren’t even interested in an accurate model for how fast the crack grew. All they really wanted to know was could the crack have grown from an undetectable size to final fracture in the time between the last inspection and the failure. A reasonable upper-bound crack growth rate estimate was perfectly acceptable. Ultimately, what the client wanted was an opinion and the purpose of the modelling in this case was to provide support to that opinion. Experience has shown that only a small proportion of clients really want a full root cause analysis of a failure. Most have an underlying question or questions which they want answered. If this question can be answered short of a full root cause failure analysis, then they should be satisfied. Questions are usually “Can I claim a warranty?”, “Am I or is my client liable for this incident?”, “Could this failure have been reasonably foreseen?”, “What can I do to mitigate the incident happening again?” These are all commercially reasonable questions and if one or more of these questions are answered by the analysis, is that not a successful outcome? Even if the full root cause of the failure is not established. So, how do we frame the idea of determining a measure for the efficacy of a failure analysis? Bearing in mind, that a failure analysis is currently a subjective opinion and not an experimental observation. How applicable are the diagnostic efficacy methods discussed above to the assessment of opinions when there is not a structured codified way of assessing what success is in a failure investigation? In practice, the absolute determination of the root cause of failure is often not necessary if the questions asked by the client are answered satisfactorily. There is a wide spectrum of requirements on this matter. The military are happy to allocate significant resources investigating the minutiae associated with a failure - not unreasonable if the cost of a failure can be a billion dollars and the loss of life of a pilot. The other end of the scale is usually constrained by lack of data. There is rarely Health Usage Monitoring (HUM) data available for drill rigs drilling blast holes in quarries. Sometimes there is not even operational hours logged. There is also limited budgets and time constraints. In these situations, the best result expected is an opinion that can help them solve a dispute. Significant effort is expended to make those opinions as accurate as possible. Based on the current environment of failure analysis, the methods need to be better articulated and test criteria need to be established. The efficacy of a failure analysis could be established by setting a base line of self-consistency within the analysis and the way this is tested i.e. to what extent do the diagnostic tests carried out in the analysis support one another and the general hypothesis? For instance, if we are trying to establish the grade of a material, the chemical analysis of the metal (for a metallurgist) is the primary tool we would use, but even this can be ambiguous at times. If, however, there are also metallographic tests and hardness tests and we are told what the grade should be, then each of these would support (and possibly add) to the identification of the grade and provide a check against the chemical analysis. An error in the chemical analysis determination (for instance, reporting the wrong result or misreading the output of the machine), could be picked up through the other tests. Although this is a simple example, it is important to ask the question “If …. occurred, then we should see X, Y and Z”. It is unsatisfactory to rely only on one observation to come to a conclusion. Therefore, a competent failure analysis should have a series of internal checks in which the main conclusions are supported by a series of independent observations and not be reliant only on one observation. If this process of internal checking throws up something that 18

does not fit the hypothesis, the hypothesis must be re-thought. At present, this way of thinking is not explicit in reports. Therefore, looking at a system in which these internal checks are made explicit could be a way of helping assess the efficacy of a failure analysis. Conclusions Most failure analysts have never thought about how the selection and rigour of failure analysis methodology has a direct impact on the accuracy and expediency of the decision-making process. It has been established that no internationally recognised standard of diagnostic accuracy or guideline exists for engineering failure analysis. Coupled with the scarcity of literature investigating the impact of the optimised methodology selection, has therefore led to the hypothesis that currently a diagnostic methodology is; a) b)

chosen by the analyst without due diligence, and applied according to their interpretation,

which in turn yields a diagnostic decision. Consequently, the methodology chosen is then subsequently applied. The result is a methodology that may not be the most accurate or expedient, which in turn has the potential to increase the risk and cost of the decision-making process. Subsequently, it has been established that there is a need for an engineering failure analysis diagnostic accuracy standard that: 1. Is repeatable so that the results can be peer reviewed and lessons learned; 2. Can be quality assured by being regulated and conducted in accordance with the standard; and 3. Can mitigate risk of wrong diagnosis by ensuring that all procedural steps were executed correctly. Now that the need for robust guidelines for a diagnostic accuracy standard for engineering failure analysis has been established, the first stage is to develop a set of guidelines, and then apply them to the selection of an engineering failure analysis methodology. Once these guidelines have been used and accepted by the engineering fraternity, an international standard does need to be established. Acknowledgments The authors would like to thank Prof Jeff Wong and Dr Mark Cheong from the School of Chemical and Biomedical Engineering, Nanyang Technological University, Singapore, for their contribution towards allowing the principal author to testing of ideas with their students. In addition, the authors would also like to thank Jerry L. Jones of Advanced Engineering Resources, Inc. for his valuable contribution towards current guidelines, anonymous reviewers who have assisted to improve the quality and integrity of this paper and Richard Tree, Vice President (North America), Commissioning Agents, Inc. who afforded the Principal Author time to develop this paper. 19

Funding This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

References 1. Anderson, B. & Fagerhaug, T. (2000). Root Cause Analysis - Simplified Tools and Techniques 2. AIChE CCPS Guidelines for Investigating Process Safety Incidents, 3rd Edition (2019). 3. API 571 Damage Mechanisms Affecting Fixed Equipment in the Refining Industry. 4. ASM Handbook, Volume 11 (2002): Failure Analysis and Prevention, Metals Handbook, Ninth Edition. ASM International Edited by Becker W.T. & Shipley R.J. ISBN 978-1-62708-180-1. 5. ASTM E2713-18, Standard Guide to Forensic Engineering, ASTM International, West Conshohocken, PA, 2018. 6. Bosela, P. A., Delatte, N. J. & Rens, K. L. (2003) ASCE 2003b Forensic engineering: proceedings of the Third Congress, October 19-21, 2003, San Diego, California. Published by ASCE Reston, VA. 7. Brooks, Charlie R & Choudhury, Ashok (2002). Failure analysis of engineering materials. McGraw-Hill, New York. 8. Cherry, B. W. (2002). Don't try to Save your Bacon - Hypothetico-Deductive Methodology of Failure Analysis. In J. Price (Ed.), Failure Analysis, Proceedings (pp. 1 - 8). Melbourne Vic Australia: The Institute of Materials Engineering Australasia. 9. Cohen JF, Korevaar DA, Altman DG, et al (2016). STARD 2015 guidelines for reporting diagnostic accuracy studies: explanation and elaboration BMJ Open 2016;6:e012799. doi: 10.1136/bmjopen-2016-012799 10. Cooke, N.J. & Durso, F. (2007) "Stories of Modern Technology Failures and Cognitive Engineering Successes" 1st Edition, CRC Press, Published September 19, 2007. ISBN 9780805856712 - CAT# ER9469 11. Davies, T. (2017) "Why You Should Forget The Awesome, For Now." Published on LinkedIn April 23 2017 12. Davis, R. (2004) "Case-Based Reasoning." MIT Computer Science and Artificial Intelligence Laboratory, Knowledge Based Application Systems, Lecture 17, 13 Apr 2006 13. Del Frate L. (2013) "Failure: Analysis of an Engineering Concept" Simon Stevin Series in the Philosophy of Technology 14. Del Frate, L., Zwart, S.D. & Kroes, P.A. (2011). "Root cause as a U-turn." Engineering Failure Analysis 18 pp 747–758 15. Edwards M. & Lewis P., (2007) "Forensic engineering: Modern methods", The Open University, 9 May 2007 16. Edwards, W. (1954). The theory of decision making. Psychological Bulletin, 51(4), 380-417.)

20

17. Eiselstein, L. E. and R. Huet (2011). Chapter 1: Corrosion Failure Analysis with Case Histories. Uhlig’s Corrosion Handbook, Third-Edition. R. W. Revie. New York, John Wiley & Sons, Inc. 18. Fisk D. (2011). "The 2008 financial collapse: Lessons for engineering failure" Engineering Failure Analysis 18 (2011) 550–556. 19. Fox J. (2015) From “Economic Man” to Behavioral Economics. Harvard Business Review, pp. 78-85, May 2015. 20. Fry, P.B. (2002) "Loads and Stresses – The Real Cause Of Failures In Surface Mining Machinery." WBM Consulting Engineers, Spring Hill, Brisbane, 4000. 21. Gigerenzer, G. & Gaissmaier, W. (2011) "Heuristic Decision Making" Annual Review of Psychology. Vol. 62:451-482 (Volume publication date January 2011) 22. Gigerenzer, G. & Todd, P. M. (1999). Simple heuristics that make us smart. New York: Oxford University Press. 23. Gigerenzer, G. (1991). How to make cognitive illusions disappear: Beyond ‘heuristcs and biases’. In W. Stroebe & M. Hewstone (Eds.), European review of social psychology (Vol. 2, pp. 83–115). Chichester, UK: Wiley. 24. Gigerenzer, G. (1996). On narrow norms and vague heuristics: A reply to Kahneman and Tversky. Psychological Review, 103 (3), 592–596. 25. Gigerenzer, G. (2013) "Simple Heuristics That Make Us Smart", Max Planck Institute for Human Development, Berlin. 26. Gigerenzer, G., Gaissmaier, W., Kurz-Milcke, E., Schwartz, L. M. & Woloshin, S. (2007). Helping doctors and patients make sense of health statistics. Psychological Science in the Public Interest, 8 (2), 53–96. 27. Gladwell, M. (2005) Blink: The Power Of Thinking Without Thinking. New York : Little, Brown And Co., 2005. 28. Glas, A.S., Lijmer, J.G., Prins, M.H., Bonsel, G.J., & Bossuyt, P.M. (2003). The diagnostic odds ratio: a single indicator of test performance. Journal of Clinical Epidemiology, 56 11, 1129-35. 29. Hammond, J.S., Keeney, R.L. & Raiffa, H. (1998) "The Hidden Traps In Decision Making." Harvard Business Review Sept - Oct 1998 30. Hanscom R., Small M, Lambrecht A. (2018) Diagnostic Accuracy: Room for Improvement. Coverys, 13 March 2018 31. Heintze, N. & Jaffar, J. (1991) "Set-Based Program Analysis (Extended Abstract)" School of Compuer Science, Carnegie Mellon University, Oittsburgh PA & IBM Thomas J. Watson Research Centre, Yorktown Heights NY. 1 Jan 1991 32. Hoffrage U. & Gigerenzer, G. (1998) "Using natural frequencies to improve diagnostic inferences", Academic Medicine. 73(5):538-40, May 1998. 33. Kahneman, D. & Frederick, S. (2005). A model of heuristic judgment. In K. J. Holyoak & R. G. Morrison (Eds.), The Cambridge handbook of thinking and reasoning (pp. 267–293). New York: Cambridge University Press. 34. Kahneman, D. & Tversky, A. (1996). On the reality of cognitive illusions: A reply to Gigerenzer’s critique. Psychological Review, 103, 582–591. 35. Kahneman, D. (2003) "A Perspective on Judgement and Choice - Mapping Bounded Rationality." American Psychologist, September 2003. 36. Kletz, T. A. (1999) "The origins and history of loss prevention." Process safety and environmental protection 77.3 (1999): 109-116. 37. Koffka, Kurt (1922). Perception: An introduction to the gestalt theory. Psychological Bulletin 19:531-585. 21

38. Lees, Frank (2012). Lees' Loss prevention in the process industries: Hazard identification, assessment and control. Butterworth-Heinemann, 2012. 39. Lewis, G. L. (2003) ASCE 2003a. Guidelines for forensic engineering practice, Technical Council on Forensic Engineering, Reston, Va. 40. Marewski J.N. & Gigerenzer, G. (2012) "Heuristic decision making in medicine" Dialogues in Clinical Neuroscience - Vol 14 . No. 1 . 2012 Pages 77-89 41. Mascarenhas, S. (2010) "Case-Based Reasoning." Aprendizagem Simbolica e SubSimbolica 2010 42. McGee, S (2002). "Simplifying likelihood ratios". Journal of General Internal Medicine. 17 (8): 647–650. doi:10.1046/j.1525-1497.2002.10750.x. ISSN 08848734. PMC 1495095 1 Aug 2002 43. Metz, CE (1978) "Basic principles of ROC analysis" Seminars in Nuclear Medicine, Volume 8, Issue 4, October 1978, Pages 283-298 44. Munasque, A. (2009) "Thinking about Thinking - Heuristics and the Emergency Physician" EMNow’s ACEP Scientific Assembly Edition: October 2009 45. NFPA 921 (2017) Guide For Fire And Explosion Investigations. 46. Papadopoulos, Y., Walker, M., Parker, D., Rüde, E., Hamann, R., Uhlig, A., Grätz, U. & Lien, R. (2011), "Engineering failure analysis and design optimisation with HiPHOPS" Engineering Failure Analysis 18 pp 590–608 47. Popper K. (1963) Conjectures and Refutations. Harper and Row. 48. Ramachandran, V. (2005). Failure analysis of engineering structures: methodology and case histories, ASM International. 49. Reeves, R. (2015). Misbehaving: The Making of Behavioural Economics by Richard H Thaler review – why don’t people pursue their own best interests? By Richard Reeves, The Guardian newspaper, 4 July 2015. 50. Resnick, M. (2014) "Gigerenzer-Kahneman debate on Decision Making" Ergonomics in Design: The Quarterly of Human Factors Applications, 15 July 2014 51. Schiff, G.D., Kim, S., Abrams, R., Cosby, K., Lambert B., Elstein A.S., Hasler, S., Krosnjar, N., Odwazny, R., Wisniewski, M.F. & McNutt, R.A. (2005) "Diagnosing Diagnosis Errors - Lessons from a Multi-institutional Collaborative Project" Advances in Patient Safety: Vol. 2 pp 255-278 Feb 2005 52. Shaikh SA (2011) Measures Derived from a 2 x 2 Table for an Accuracy of a Diagnostic Test. J Biomet Biostat 2:128. doi:10.4172/2155-6180.1000128 53. Simon H. A. (1957) "Models of Man, Social and Rational: Mathematical Essays on Rational Human Behavior in a Social Setting", New York: John Wiley and Sons. 1957 54. Šimundić A-M. Measures of Diagnostic Accuracy: Basic Definitions. EJIFCC. 2009;19(4):203-211. 55. Stengel D., Porzsolt F. (2006) Efficacy, Effectiveness, and Efficiency of Diagnostic Technology. In: Porzsolt F., Kaplan R.M. (eds) Optimizing Health: Improving the Value of Healthcare Delivery. Springer, Boston, MA 56. Svenson, O. (1979) "Process descriptions of decision making." Organizational Behavior and Human Performance, Volume 23, Issue 1, February 1979, Pages 86-112 57. Thaler R. (2015). Misbehaving: The Making of Behavioural Economics. New York. W. W. Norton & Company. 58. Tversky, A. & Kahneman, D. (1973) "Availability - A Heuristic for Judging Frequency and Probability." Cognitive Psychology 1973, 4, 207-232 59. Tversky, A. & Kahneman, D. (1979) "Prospect Theory: An Analysis of Decision Under Risk." Econometrica, Vol. 47, No. 2. (Mar. 1979), pp. 263-292 22

60. Tversky, A. & Kahneman, D. (1981) "The Framing of Decision and the Psychology of Choice." Science, New Series, Vol. 211, No. 4481. (Jan. 30, 1981), pp.453-458 61. Unal, I. (2017) “Defining an Optimal Cut-Point Value in ROC Analysis: An Alternative Approach,” Computational and Mathematical Methods in Medicine, vol. 2017, Article ID 3762651, 62. Verbitsky, D.E. (2012) "Improving reliability and profitability using the systemic failure analysis methodology during reliability stress testing." Conference Paper, SAE 2012 World Congress and Exhibition; Detroit, MI; United States; 24 April 2012 through 26 April. 63. Wang, J. (2012). "A novel data analysis methodology in failure analysis", Conference Proceeding: 2nd International Conference on Intelligent Systems Design and Engineering Applications, ISDEA 2012; Sanya, Hainan; China; 6 January 2012 through 7 January 2012 64. Whiting P, Rutjes A, Reitsma J, Bossuyt P, Kleijnen J. (2003) The development of QUADAS: a tool for the quality assessment of studies of diagnostic accuracy included in systematic reviews. BMC Medical Research Methodology 2003;3:25. 65. Whiting PF, Rutjes AWS, Westwood ME, Mallett S, Deeks JJ, Reitsma JB, Leeflang MM, Sterne JAC, and Bossuyt PMM. (2011) QUADAS-2: A Revised Tool for the Quality Assessment of Diagnostic Accuracy Studies. Ann Intern Med 155 (8):529-536, 2011. 66. Wilkes, G. (2015). Review by Giles Wilkes of ‘Misbehaving: The Making of Behavioural Economics’, by Richard Thaler, The Financial Times, 16 May 2015. 67. Wilson, Kyle M.; Helton, William S.; Wiggins, Mark W. (2013). "Cognitive engineering". Wiley Interdisciplinary Reviews: Cognitive Science. 4 (1): 17–31. doi:10.1002/wcs.1204. PMID 26304173. 68. Wulpi D.J. (1985) Understanding How Components Fail, American Society for Metals, 1985.

23

ANNEX 1 Table 1: STARD 2015 guidelines for reporting diagnostic accuracy studies: explanation and elaboration (Cohen JF, Korevaar DA, Altman DG, et al, 2016)

TITLE OR ABSTRACT

STARD15 1

Identification as a study of diagnostic accuracy using at least one measure of accuracy (such as sensitivity, specificity, predictive values, or AUC)

2

Structured summary of study design, methods, results, and conclusions (for specific guidance, see STARD for Abstracts)

ABSTRACT

INTRODUCTION 3 4

Scientific and clinical background, including the intended use and clinical role of the index test Study objectives and hypotheses

METHODS Study design

5

Participants

6 7 8 9

Test methods

Whether data collection was planned before the index test and reference standard were performed (prospective study) or after (retrospective study) Eligibility criteria On what basis potentially eligible participants were identified (such as symptoms, results from previous tests, inclusion in registry) Where and when potentially eligible participants were identified (setting, location and dates) Whether participants formed a consecutive, random or convenience series

10a Index test, in sufficient detail to allow replication 10b Reference standard, in sufficient detail to allow replication Rationale for choosing the reference standard (if alternatives exist) Definition of and rationale for test positivity cut-offs or result 12a categories of the index test, distinguishing pre-specified from exploratory 11

Definition of and rationale for test positivity cut-offs or result 12b categories of the reference standard, distinguishing prespecified from exploratory 13a

Whether clinical information and reference standard results were available to the performers/readers of the index test

24

13b Analysis

14 15 16 17 18

RESULTS Participants

Test results

Whether clinical information and index test results were available to the assessors of the reference standard Methods for estimating or comparing measures of diagnostic accuracy How indeterminate index test or reference standard results were handled How missing data on the index test and reference standard were handled Any analyses of variability in diagnostic accuracy, distinguishing pre-specified from exploratory Intended sample size and how it was determined

Flow of participants, using a diagram Baseline demographic and clinical characteristics of 20 participants Distribution of severity of disease in those with the target 21a condition Distribution of alternative diagnoses in those without the target 21b condition Time interval and any clinical interventions between index test 22 and reference standard Cross tabulation of the index test results (or their distribution) 23 by the results of the reference standard Estimates of diagnostic accuracy and their precision (such as 24 95% confidence intervals) Any adverse events from performing the index test or the 25 reference standard 19

DISCUSSION 26 27

Study limitations, including sources of potential bias, statistical uncertainty, and generalisability Implications for practice, including the intended use and clinical role of the index test

OTHER INFORMATION 28 29 30

Registration number and name of registry Where the full study protocol can be accessed Sources of funding and other support; role of funders

(Cohen JF, Korevaar DA, Altman DG, et al STARD 2015 guidelines for reporting diagnostic accuracy studies: explanation and elaboration BMJ Open 2016;6:e012799. doi: 10.1136/bmjopen-2016-012799)

25

THE NEED FOR A DIAGNOSTIC ACCURACY STANDARD FOR ENGINEERING FAILURE ANALYSIS. Nigel K. Booker Research Student School of Mechanical and Mining Engineering The University of Queensland Brisbane, Australia [email protected] Dr. Richard E. Clegg School of Chemistry, Physics and Mechanical Engineering Queensland University of Technology Brisbane, Australia [email protected] Prof. Peter Knights School of Mechanical and Mining Engineering The University of Queensland Brisbane, Australia [email protected] Dr Jeff Gates UQ Materials Performance and School of Mechanical & Mining Engineering The University of Queensland Brisbane, Australia [email protected]

Declarations of interest: none Highlights · · · · ·

The selection and rigour of failure analysis methodology has a direct impact on the accuracy and expediency of the decision-making process. It has been established that no internationally recognised diagnostic accuracy standard or guideline exists for engineering failure analysis. At present, the outcome of failure analysis is an opinion, albeit a considered one. Currently a diagnostic methodology is chosen by the analyst without due diligence and applied according to their interpretation It has been established that there is a need for an engineering failure analysis diagnostic accuracy standard that: d) Is repeatable so that the results can be peer reviewed and lessons learned; e) Can be quality assured by being regulated and conducted in accordance with the standard; and f) Can mitigate risk of wrong diagnosis by ensuring that all procedural steps were executed correctly.

26