Reliability Engineering and System Safety 35 (1992) 1-11
i, /0
~.
~i~,~
•
Empirical evaluation of THERP, SLIM and ranking to estimate HEPs Bernhard Zimolong Ruhr University of Bochum, Bochum, FRG (Received 15 May 1990; accepted 15 August 1990)
The most important techniques for assessing human reliability in risky technologies--such as chemical and manufacturing industries or nuclear power plants--all use human error data for tasks to estimate the overall reliability of the system. Because empirical data are not usually available, these assessment techniques use judgments made by subject-matter experts about the likelihood of human error in task performance. An experimental study was carded out to compare and evaluate three different estimation techniques: the technique for human error rate prediction (THERP), success-likelihood index methodology (SLIM) and a rank ordering procedure. Human error probabilities (HEPs) were derived empirically under 12 different task conditions in a batch manufacturing scenario. THERP was applied to estimate the overall failure probability. Results indicate a satisfactory match between empirical HEPs and THERP estimates. Six judges with backgrounds in human factors and/or mechanical engineering assessed the likelihood of failures, using SLIM and rank ordering. Results indicate a poor match of empirical HEPs and their estimates. The reasons are analyzed and suggestions for improved estimation techniques are outlined. INTRODUCTION Numerous risks are associated with complex technologies. These risks vary in complexity and emanate from three different domains of the technologies' operations: systems, hardware and people. Estimating industrial risks in the people domain--that is, in human errors or human reliability--is much more affected by a fundamental methodological difficulty than estimations of risk in the domains of hardware or systems. This difficulty is due to the fact that empirical data from which to derive frequency-based probability estimates of risks in the people domain are often meager or not available. Furthermore, the likelihood that such data will be available in the foreseeable future is quite small, despite the fact that considerable efforts have been undertaken to establish domaindependent human error data banks.1 This methodological difficulty poses a two-part problem to analysts: the estimation of human risks must be accomplished through subjective estimation techniques and the meaningfulness of risk estimates must be evaluated in the absence of empirical or other objective criteria.
Reliability Engineering and System Safety 0951-8320/91 / $03-50 © 1991 Elsevier Science Publishers Ltd, England.
This paper deals with the first problem: the estimation of risks. Two general approaches are available for eliciting expert judgments about risks: holistic and decomposition methods. 2 Holistic methods assume that experts can provide reasonable estimates of the likelihood of success or failure of any given human task. The task requires experts to make global assessments of the error probability of a given event or action. The way they go about processing relevant information is generally unavailable to outside observers. A typical example is a ranking procedure to rank order error probabilities of given tasks. A variety of holistic elicitation techniques have been developed. They cover formal estimation techniques such as ranking/rating procedures, and informal evaluations such as the derivation of expert's opinions. All the techniques are variations of psychological scaling methods, well known in the literature. The most frequently used of these techniques have been reviewed by Seaver & Stillwell. a A number of studies have shown (see, e.g. Ref. 4) that holistic judgment has two sources of inconsistency: in the application of weights attributed to the constituent components, and in the aggregation of information across the components: In this study, a
2
Bernhard Zimolong
simple ranking procedure is used to estimate the rank order of human error probabilities (HEPs). Decomposition techniques assume that experts are better suited to assessing not the given task as a whole, but the separate elements constituting a task: features of the task, task environment, and conditions under which the task must be accomplished. By separating a problem into its constituent components and by assessing each component separately, the principal features of the problem and the process of assessing each are revealed. More specifically, the decomposition technique must provide an appropriate composition rule and assign a set of numbers to the decomposed judgments of task features, elements or factors, which must then be aggregated into an index. The specification usually results in some kind of linear model. The linear model and deviations from linearity have been discussed by Slovic et al. 6 Just as was the case with holistic techniques, numerous decomposition techniques for application in business and industry are available. A review of the most widely used of these can be found in Humphreys & Wisudha, 7 and in Zimolong & Rohrman 8. Here, we focus on the technique for human error rate prediction (THERP 9) and on the success likelihood index methodology (SLIM). Both techniques will be employed to derive appropriate HEPs.
PROBABILISTIC RISK ASSESSMENT Probabilistic risk assessment (PRA) is a systematic identification of the levels of damage from plant operations and a quantitative assessment of the likelihood of such occurrences. Basically, the assessments are derived from the likelihood of technical and human errors. The most widespread application of a judgment procedure to obtain human error probability estimates in industrial settings is THERP. Other methods, such as the human cognitive reliability model 1° and the maintenance personnel performance simulation model, u have been discussed by Svenson. 12 T H E R P has been used frequently in PRAs of nuclear power plant operations in the USA and in Europe. The human error probability is defined as the ratio of the number of errors to the number of opportunities for each type of error of interest in a task. The overall HEP of a task or a sequence of tasks is calculated from the error probabilities of the elements of a given task or the individual task, respectively. As one approach to quantify HEPs due to missing empirical data, holistic judgements of HEPs were usually obtained from just a few experts who were thoroughly familiar with the tasks and the relevant working conditions. Their opinions were elicited informally and pooled to arrive at the extrapolated error probability. Judgements were
based on information from tasks that most nearly resembled the task in question, and the magnitude of the uncertainty bounds of the errors was adjusted in accordance with the judged similarities or differences between the tasks. 9 In summary, data on HEPs from the 'Handbook of Human Reliability Analysis' are primarily based on extrapolations from similar tasks or on holistic, informal judgments of the expert opinions of the authors. To conduct a P R A in process industries, the calculation of the overall HEP from the nominal HEPs of the subtasks listed in the handbook follows a decomposition approach. Basically, the task under consideration is broken up into elementary units. For each unit the appropriate H E P is obtained from the tables in the handbook, a decision is made if and how to modify the nominal H E P according to the influence of the most relevant performance shaping factors (PSF) in the task studied, and finally, the HEPs of the units are recombined in fault and event trees to model tasks and relevant parts of the complete system. The probability of success or failure to perform the task is derived by means of probability calculus. In modeling human performance for PRA, it is important to consider those factors that have the most effect on performance. According to the taxonomy derived by Swain & Guttmann, 9 some of the PSFs are external to the person and some are internal. External PSFs include the entire work environment, particularly the task and equipment design. Internal PSFs represent the individual characteristics of the person, his/her skills, motivations, and expectations that influence his/her performance. Other kinds of PSF are psychological and physiological stressors that result from a work environment in which the demands placed on the human by the system do not conform to his/her capabilities and limitations. Amongst others, some of the stressors are task speed, task load and fatigue. Swain & Guttmann 9 recommend approximating the effects of PSFs and stressors by applying modifying factors to the nominal HEPs of the task elements. However, only a few modifiers are available from the handbook. As an example, the modifiers for the combined effects of stress and experience level range from factor 2 (very low task load, novice) to factor 10 (extremely high stress, novice; pp. 20-32, Table 20-16). A 'misclassification' between two adjacent categories may lead to an error of twice to five times the true value. With regard to findings in probabilistic decision making, it seems very doubtful if expert judgment is a solution to the problem of missing adjustment factors. This research indicates that people systematically violate principles of rational decision making when judging probabilities, making predictions, or otherwise attempting to cope with probabilistic data. Experts' judgments appear to be prone to many of the
Evaluation of THERP, S L I M and ranking to estimate HEPs same biases as those of laymen, particularly when experts are forced to go beyond the limits of available data and rely on intuition. An event is judged likely or frequent if it is easy to imagine or to recall under relevant circumstances. Small probabilities are likely to be overestimated, while those familiar to the judge are strongly underestimated. 13 There is substantial literature on biases in probability judgments, and the results demonstrate people's systematic misconceptions about how probabilities are generated and combined normatively. TM In addition, the expertise of 'experts' varies considerably in degree. Shanteau ~5 summarized characteristics of expert judges, and suggested that three levels should be used: naive judges (laymen); novice judges, who have considerable experience and skills, and experts, who have been performing the judgment task for a long time (more than 10 years). Svenson ~2 critically summarizes and evaluates expert judgments in safety analyses in process industries. The SLIM technique tries to overcome these difficulties through the use of a decomposition judgment method to generate the weights of the PSFs. SLIM is formally identical to the simple multiattribute rating technique (SMART) introduced by Edwards.16 It serves as an easily applicable tool within multi-attribute utility theory (MAUT2). Embrey et al. 17 applied the SLIM technique to scale task success likelihood as a function of PSFs of tasks performed in nuclear industrial settings. It assumes that experts can assess the extent to which PSFs influence the outcome of human actions and that these assessments can be accomplished with numerical ratings. The ratings can then, after proper weighting, be combined into an overall index that reflects the expert's expectation of whether the task will be completed successfully or not. The resulting index is called the Success Likelihood Index (SLI) and is derived from a linear combination of the weighted PSFs of the given task. A major problem of the method lies in the identification of the most important and relevant PSFs in the task studied, which is usually accomplished by expert evaluation. Another assumption states the independence of PSFs on the contribution of human performance. It is also assumed that the influence on the modification of HEPs can be estimated by assigning a linear additive function of products of the PSF weights and ratings. The linearity assumption is based on theoretical evidence and has not been proven empirically. The objectives of the present study are to validate HEP estimates of judges, to compare one holistic estimation technique with two decomposition methods and to reveal the major components of the underlying judgmental process. First, HEPs were empirically generated with a simulated batch manufacturing scenario and matched with the estimates of judges.
3
THERP, SLIM and a rank ordering procedure were applied to derive the estimates. Secondly, an analysis was made of the judgment process. It included the comparison of the estimated weights of the PSFs as well as an analysis of the linear and nonlinear contributions of the PSFs to the overall HEPs of the tasks under consideration.
METHOD Overview The experiment has two parts: first the empirical HEPs of an operator's task under 12 task conditions were derived from a simulated batch manufacturing scenario. A total of 48 subjects served as operators in this part of the experiment. Secondly, six judges estimated the HEPs of the 12 task conditions by using the SLIM and, thereafter, the ranking procedure. Task analysis and probability calculus required by THERP were conducted by the author after the experiment, but before the results were computed.
Operators Subjects were 48 undergraduate students in psychology at Ruhr University, Bochum, FRG. Subjects received course credits for serving as operators in the experiment.
Apparatus The simulation program FACTORY 18 was run to generate a discrete part manufacturing scenario, for example, a flexible manufacturing system (FMS). This program includes an automated routing and transportation system for six workstations, such as press machines and/or CNC lathes. The individual characteristics of each machine were adjusted with respect to processing time per part and expected lives of the machine tools. The tool-wear state which was defined as the tool's probability of a breakdown was controlled by the parts completion rate. A tool failure could occur only when a tool's actual processing life exceeded 80% of its expected life. The tool wear states of the machines were updated continuously on the screen. A detailed description of the simulation and its underlying F O R T R A N program is available. TM One application example covering risk-taking behavior of operators in FMS is given by Zimolong. 19
Procedure Subjects had to machine as many parts as possible. The frequency of parts completed was mainly affected by the speed of the machining cycles and the decision
4
Bernhard Zimolong
of the operator to shut down the machine for scheduled maintenance. The beginning of scheduled maintenance was indicated on the lower display, which displayed the tool wear state. If the critical wearout of 80% was reached, the operator was required to shut down the machine for maintenance. A limit mark on the display indicated a wearout of 80%. However, it had no specially drawn bar or mark. Beyond the limit the probability of a breakdown increased sharply as a power function of the machining cycles completed. As a consequence, a breakdown of a machine could only occur if the operator failed to close down the machine in time for scheduled maintenance. By definition, a breakdown was considered a failure caused by human action. The HEP value was computed as the ratio of the number of failures to the number of opportunities, i.e. to the frequency of critical wearouts (80%) per session. Operators usually delayed scheduled maintenance in order to get the part which was in the machine completed. As appears in the set-up in Fig. 1, the upper horizontal bar graph displayed the state of machining cycles completed. The operator had to decide whether to stop the machine for scheduled maintenance and to lose the part or to take an increased risk into account and wait another few cycles for the part to be completed. In case of a breakdown, the part would be lost and the operator would have to accept a 'repair time' interval which was four times longer than a 'maintenance' interval.
~. s.= J C~,a 5J Foil-r. J,','=,/~ J[ u.=,u. ,~ .,~ .=, I 142s l o,.
• ,,
"I 1200 I18.245l I
nun I
I
,,,.~
,=,
I
-',-2,6
nun Fig. 1. The set-up of the simulated batch manufacturing scenario. Machines 2 to 5 are running; 1 and 6 are idle due to maintenance and tool wear failure, respectively. Degree of completion of each of the parts is indicated in the upper horizontal bar-graphs. The tool wear state is shown in the lower graphs. Machine 2 was checked by the operator and indicates a tool wear state of 95% (upper-right window). The indicative numbers are in the center of each machine. Figures in the waiting lines below indicate frequency of parts to be processed. The overall measure of performance is the number of parts completed and appears on the upper right side of the display.
Task conditions Three PSFs were varied in the experimental conditions under which the frequencies of human errors were measured: motivation (M), task load (T) and information load (I). Motivation and task load had two levels, information load had three levels, resulting in a 2 x 2 x 3 (M × T x I) experimental design with repeated measures on the last factor. Subjects practiced for 30min. They were equally assigned to the experimental conditions and counterbalanced with respect to risk-taking behavior, which was measured with a 30-min test run of the simulation program. This procedure proved to be superior as compared to a paper-and-pencil test of risk-taking behavior. 19 The experimental session covered was I h and 40 min for each subject. Motivation This internal PSF was varied as the type of incentive at two levels. Subjects were urged to machine as many parts as possible by allowing no machine failures to occur (M1). In the second condition (M2), subjects got another course credit worth 30 German marks if they happened to machine more than 978 parts, which was the average rate from a pretest. Task load Variation of task load was introduced as a modification of a psychological stress factor at two levels. Subjects had to monitor the menu information of the tool wear state of each machine and the corresponding display indicating the machining cycles completed (T1). In the second condition (T2), an additional time-critical task was required. Subjects had to fill out a checklist before they could shut down the machine for maintenance. The checklist required subjects to write down the indicative number of the machine, actual degree of utilization and degree of wearout at this very moment. Information load This external PSF was modeled as a function of the speed of the machining cycles and subsequently of the number of maintenance intervals required by the machine. The frequency of maintenance intervals was programmed as a stochastic process with lower and upper boundaries: two machines (2 and 5) reached the critical states 20 to 29 times per session (I1), two of them (1 and 4) 10 to 19 times (I2) and the remaining machines (3 and 6) 1 to 9 times (I3). Information load included the following cognitive requirements for the operator: check-reading quantitative indications from the displays, comparing information of two bar-graphs
Evaluation of THERP, SLIM and ranking to estimate HEPs out of 12 graphs, assessing the probability of a breakdown, assessing the time for completion of the part in the machine, decision making, and selection of the right pushbutton on the panel to carry out the decision.
5
author. Results were cross-checked and corrected by a second individual, who was familiar with THERP.
RESULTS Empirical H E P s
JUDGMENT
PROCESS
Judges
Six judges were asked to estimate the relative frequency of human errors under the 2 x 2 x 3 = 12 task conditions. They used two different estimation techniques, the SLIM technique and a ranking method. With respect to the undergraduate students serving as subjects and the tasks to be accomplished, emphasis was given to select judges who had experience with students at the university and a strong background in human factors. The six practitioners serving as judges all had teaching and/or research experience of 2 to 4 years at the Ruhr University, were trained in human factors and had a degree either in industrial psychology or in production engineering. Their ages varied from 26 to 32 years; their professional experience from 1 to 3 years. They all were knowledgeable of batch production in small and medium-sized companies, in particular they were familiar with the tasks of operators of CNC machines. Judges had no experience in making H E P judgments. Therefore, emphasis was given to explaining carefully and demonstrating the experimental conditions as well as to practicing under these conditions. Judgmental procedure
Judges all practiced for at least 20 min under the experimental conditions. With respect to the SLIM procedure, judges were asked to assess the generic importance (numerical weight between 1 and 100) of each PSF on the operator's error frequency. The Delphi technique was employed to get a convergence of these estimates. In a second step, an individual rating was made of the specific influence of each PSF with reference to the specific task condition under consideration, resulting in a total of 3 x 1 2 = 3 6 ratings per judge. A detailed description of the procedure is given by Embrey. 2° With respect to the second estimation procedure, the rank ordering technique, judges performed a simple ranking of the 12 task conditions with respect to the assumed error frequency. T H E R P required a task analysis, identification of types of error, and allocation of the adjusted nominal HEPs from the tables of the handbook to reveal the overall human error probability. The procedure was conducted by the
The empirically derived HEPs for the 12 task conditions and the estimated HEPs derived from a regression equation of the SLIs are summarized in Table 1. Empirical HEPs vary from 0.00 to 0-098. Under the information load condition I3, zero failure was obtained. The range of HEPs in the remaining two conditions I1 and I2 vary between 0-019 and 0.098. The reliability of the obtained HEPs was computed with the split-half technique. The split-half reliability of the HEPs is r = 0.88 (n = 24), which is significant at the 1% level. The range of individual HEPs in group A is between 0.006 and 0.08; in group B between 0.03 and 0.12. The allocation of subjects to the two equal parts A and B (n = 24) followed the time order of their participation in the experiment. A 2 x 2 x 2 A N O V A on the empirical HEPs with repeated measurements on the last factor yielded significant main effects of all PSFs: Motivation (F 1,44=55.51; p < 0 - 0 1 ) ; Task load (F 1,44=4.21; p < 0 . 0 5 ) and Information load (F 1,44=60.13; p < 0 . 0 1 ) . The M x I load interaction was also statistically significant (F 1,44 = 12.05; p < 0.05). As a result, all PSFs had statistically significant effects on the number of HEPs and therefore on the reliability of the task performance. The data at least suggest that a nonlinearity between M and I load exists in the sense that HEPs are relatively more increased under the high M condition as compared with the low incentive group. The extent to which the experimental treatments actually account for variance in the dependent variable (HEP) is given by the 712 coefficient. 21 The Table 1. Empirical HEPs of 12 task conditions and correspomling~ SL! estimates (SLI-Est.) of six judges (PSFs: ! = information load (I1 to 13); M = motivation (MI, M2); T -- task load (TI, T2).)
Experimental condition
I1
I2
I3
M1 T1 SLI-Est. M1 T2 SLI-Est.
0.066 0-040 0.098 0.055
0.034 0-029 0.047 0-044
0-000 0-022 0.000 0"022
M2 T1 SLI-Est. M2 T2 SLI-Est.
0.023 0.038 0.042 0.057
0.022 0.035 0.019 0.053
0.000 0.022 0.000 0-022
6
Bernhard Zimolong
sample proportions, which can be interpreted as the empirical weights of the PSFs, were calculated independently for the 'within'--and 'between'--factors and were then normalized to account for 100% of the variance of both designs. The weights are 0.41 for the M factor, 0.08 for T load, 0.39 for I load and 0.08 for the M x I interaction. The three factors and the interaction thus account for a total of 96% of the variance in the empirical HEPs observed. The results demonstrate that the M factor is the most influential factor in this experimental design, followed by I load which was modeled as the frequency of scheduled maintenance intervals. The least influential factor is T load, which was varied as a time-critical additional task. PSFs considerably affected the HEP values: the upper value 0.098 of the task condition M1 T2 I1 (low motivation, additional task, most frequent maintenance intervals) is five times the lower value of 0.019 (M2 T2 12, strong motivation, additional task, medium frequency of maintenance intervals). THERP
THERP was employed to compute the overall human error probability for the batch manufacturing scenario. Efforts were undertaken to compute also the HEPs of the 12 task conditions. However, no information was available--neither from the literature, nor from experience--as to how to adjust nominal HEPs for the influence of the experimental variables, i.e. the levels of the PSFs and their interactions, respectively. A task analysis of operator's activities was conducted and revealed the five steps and possibilities (A1 to AS) of types of error:
A1. Selection of a critical unannunciated display from a row of six physically similar displays (wearout displays): selection error. A2. Check-reading analog information from the wearout display: check-reading error. Checkreading means reference to a display merely to see if the pointer is within allowable limits; no quantitative reading is taken. A3. Check-reading analog information from the machining cycles display: check-reading error. A4. Decision-making: comparison of both displays checked, assessment of the probability of a breakdown, assessment of the machining time of the part to be completed: error in decision making. A5. Selection of the wrong pushbutton on the keyboard from an array of six similar appearing pushbuttons: control-activation error. Pushbuttons were arranged directly below the displays and followed a 1:1 mapping.
Nominal HEPs and uncertainty bounds (UCBs) from the tables of the handbook were assigned to the failure limbs of the event tree (see steps A 1 - A 5 in Table 2). UCBs are estimates of the spread of HEPs associated with a log normal distribution. They include 90% of the true HEPs for a given task or activity. UCBs are presented in Table 2 as error factors (EF), which correspond to the square roots of the ratio of the upper to lower bounds. Subjects in this experiment were treated as novices with respect to the modifiers for nominal HEPs. The general stress level, although modified by the experimental variable task load, was rated between low and optimum by the author and his coworker, who conducted the simulation study. For step A4, decision making, no H E P values were available. However, HEP modifiers for dynamic tasks were helpful. According to Swain & Guttmann, 9 'dynamic tasks require a higher degree of man-machine interaction, such as decision-making, keeping track of several functions, controlling several functions, or any combination of these' (pp. 20-32). Nominal HEPs were modified for the effects of stress, experience level and decision making by a factor of two as recommended in Table 20-16, #3. To arrive at the overall failure probability, the exact failure equation involves summing the probabilities of all failure paths in the event tree. When all the HEPs are 0.01 or smaller, the exact failure equation can be approximated by summing only the primary failure paths, ignoring all the success limbs. The total failure probability FT for the simulation experiment is FT = A 1 + A 2 + A 3 + A5 = 0-002 + 0.006 + 0-006 + 0.002 = 0.016 The upper bound of the failure probability becomes uB = 0.05
The lower bound becomes UL = 0"005
The average H E P value from 12 tasks of the simulation study is H E P = 0 . 0 2 9 , with upper and lower bounds 0.098-0.00. As compared to the THERP estimate the empirical failure probability is only approximately twice the T H E R P value. If the modification factor for stress, experience and decision making is shifted to 3, which is a value between optimum and moderately high stress, the total failure probability becomes F*r = 0.024 UB = 0-07 UL = 0-008
which matches the empirical HEP value with some accuracy. (See Table 3.)
Evaluation o f T H E R P , S L I M and ranking to estimate H E P s
7
Table 2. Type of errors and HEP values. Summary of the task analysis with failure limbs A1 to AS, HEPs, error factors (ELF) and modifiers for the effects of stress, experience and dynamic tasks (Source: tables of the handbook9) Failure limbs
Estimated HEP
Source
EF and Source
A1
0-001 x2 0.002
T20-9, #3 T20-16, #3
3 T20-9, #3
Selection error Operator fails to select correctly the critical unannunciated display during the course of check-reading quantitative indications from the displays A2
0.003 x2 0-006
T20-11, #4 T20-16, #3
3 T20-11, #4
Check-reading error Operator fails to check correctly the bar-graph for scheduled maintenance. The display has no specially drawn limit mark to indicate acceptable readings A3
0-003 x2 0.006
T20-11, #4 T20-16, #3
3 T20-11, #4
Check-reading error Operator fails to check-read correctly the bar-graph for machining cycles completed A4
T20-16, #13
Error in decision making Comparison of two displays, evaluation of the probability of a breakdown, decision making. No HEPs are available; nominal HEPs in steps A1-A3, A5 are adjusted with respect to the dynamic nature of the task A5
0.001 x2 0.002
T20-12, #3 T20-16, #3
3 T20-12, #3
Control-activation error Operator inadvertently selects wrong pushbutton on the panel from an array of six similar controls arranged in a well-delineated functional group
Table 3. Empirical human error probabilities and upper/lower bounds from simulation study; corresponding from SLIM and THERP with two different modifiers for stress, experience and decision making Overall failure probability (Fr) Simulation study SLIM THERP (modifier: x2) THERP (modifier: x3)
Upper Lower bound (Ua) bound (LB)
0.029 0.037 0.016
0.098 0.057 0.050
0.000 0.022 0-005
0-024
0-070
0.008
in enhancing the likelihood of success for task j. In this experiment, the SLIs range from 0 to 100, where 0 indicates a high probability of success (no error), and 100 a high probability of failure; w i = t h e normalized importance weight for the ith PSF (normalized so the weights of the PSFs sum to 1.0); and R o = scale value on an equal interval scale of the jth task on the ith PSF. The conversion of the resultant SLI value to an error probability, the calibration, is r e c o m m e n d e d with the following logarithmic equation: 2° log P = a SLI + b
SLIM Consistent with the S M A R T procedure, the SLIs are obtained by taking the product of each PSF weight with its associated rating for the specific task condition concerned, and then summing the products: SLIj = E wiRo where SLIj = the combined utility of the various PSFs
where P = probability of success (1 - P = H E P ) and a and b are empirically derived constants. The weighting assessments of the three PSFs were carried out independently by the six judges. They were then reviewed by the group as a whole and in some cases the judges modified their individual assessments. No attempt, however, was m a d e to force a consensus. The normalized generic weights of PSFs are indicated in Table 4. The interjudge consistency of the generic weights was calculated from the average of the Kendall rank
8
Bernhard Zimolong
Table 5. ~ and aommlimd estimated weight of three performsnce shspbtg factors (PSF) on the humsn
Table 4. Generie weights of PSF
Judge
Motivation
Task load
Information load
A B C D E F
0.47 0.09 0.28 0-28 0-08 0.27
0.16 0.48 0-32 0-40 0-77 0.67
0.37 0.43 0.40 0.32 0.15 0.06
Median
0.28
0.44
0.35
order coefficient T. It is the correlation between six sets of rankings and the criterion ranking. The result proved to be insignificant (T = -0.22; k = 6; n.s.) and shows the inconsistency of the judges when assessing the generic weights of the PSF on the likelihood of human errors. The rating assessments of the 12 tasks were made for each of the PSFs separately. The SLI was calculated on the basis of the individually elicited PSF weights multiplied by the rating values. Finally, the median of all the judges' SLI values was calculated to give an estimate of the overall SLI for each task condition. The correlation coefficient as a degree of association between HEPs and SLIs is r = 0 . 5 4 (n = 12, p < 0-05). A logarithmic regression equation does not give better results (r=0.53, n = 1 2 , p < 0.05), contrary to the recommendation of Embrey et al. 17 The regression equation was calibrated with two empirically derived HEPs. The estimated HEPs are summarized in Table 1. The overall failure probability of the SLIM estimates (Table 3) was computed as the mean of the 12 estimates, derived from the calibrated regression equation. In contrast to the T H E R P estimates, which were computed without any empirical HEP adjustment, the close match of the overall SLIM estimate is basically a result of the empirical calibration. The calculation of the interjudge consistency of the SLIs with an ANOVA, using the individual SLIs as dependent variables resulted in r = 0 . 7 5 (n = 12, p <0.01). With respect to the inconsistencies in the judgements of the generic weights, the strong agreement among judges concerning the SLIs comes quite unexpectedly. It indicates that most of the variability of SLIs can be attributed to the differences between the task conditions, not between judges. The estimates of the judges only account for 29% of the variance of the empirical HEPs. A better understanding of this result can be reached by analyzing the empirical and estimated weights of the PSFs. The 172 values of the empirical HEPs are listed in Table 5, together with the medians of the estimated normalized weights of the judges. The only weight closely matched by the judges is
error p.,obebmty
PSF
Motivation (M) Task load (T) Information load (I) Mx I Sum
Eta Estimated squared (r/2) weight (median) 0-41 0.08 0.39 0.08 0-96
0.26 0.41 0.33 -1.00
that of information load, while the weight of the task load is strongly overestimated and that of motivation heavily underestimated. The agreement of the judges on the significance of the three factors accounting for the variance of HEPs was analyzed with a 2 x 2 x 2 within-subjects ANOVA. The A N O V A on the SLIs of six judges resulted in one significant main effect; task load (F 1,5 = 10-10; p <0.05). The remaining two factors and interactions proved to be insignificant. Thus, the results indicate that the judges did not weight the PSFs and their interactions according to the empirical findings on the HEPs. They only agreed on the significance of the task load factor, which actually had the least impact on the variance of the HEPs and disagreed with the significance of the remaining two factors and the interaction effects.
ordering A further analysis was conducted for the ranking procedure of the 12 task conditions with respect to the HEP estimates. The correlation between the estimated and the empirical rank number of HEPs is Spearman rank correlation coefficient r h o = 0 . 3 8 (n = 6; n.s.); the interjudge consistency derived from one ANOVA for rank data gives an average correlation coefficient of r =0.31 (n = 6; n.s.). The ranking procedure yielded poor results: the rankings of the judges did not match the empirical HEP rank order of the 12 task conditions. Additionally, individual judges did not agree with the rank orders of the other judges.
DISCUSSION
The simulation study is an attempt to validate HEP estimation of human performance reliability by an experimental approach. The results demonstrate that the experimental variation of three PSFs--motivation, task load and information load -significantly changed
Evaluation of THERP, SLIM and ranking to estimate HEPs
HEP values of required tasks. The range of HEPs is between 0.098 and 0.019, the upper value is five times the lower value. Each of the PSFs accounts for a considerable amount of variance in the HEP values, motivation for 41%, information load for 39%, and work load for 8%. There also exists a significant interaction between motivation and information load, which accounts for 8% of the variance. Although the generated HEPs are well within the range of industrial activities, ~ absolute HEP values of task performance from the FACTORY simulation program cannot be generalized without modifications to real life operations. This is primarily due to the artificial task situation, and to limited qualification and expertise developed by the operators. The T H E R P procedure yielded close estimates of the overall failure probabilities. In contrast to SLIM, no calibration process of HEP estimates has been taken into account. Estimates of T H E R P are based solely on absolute figures from handbook tables. Within the realm of this experiment the question as to how reliable this outcome proves to be cannot be answered. Possible error sources, which affect the reliability of the procedure, are omission of steps in the task analysis, false assessments of the nominal HEPs and false assessment of the modifiers. For example, the results demonstrate that a modifier of three, which accounts for moderately high stress, would predict the empirical HEPs better than the modifier of two selected by the judges. Another more important error source are the HEP values from the handbook, which are based on holistic expert estimates and are not empirically validated. Possible error sources in the application of T H E R P are discussed by Svenson. 12 The decomposition approach of T H E R P requires the identification of task units and the application of a HEP value to each unit. The procedure makes it nearly impossible to cope with higher level cognitive processes, such as making a decision whether to stop a machine for maintenance or to continue the production process. One problem is the decomposition of the cognitive process into units. It is even doubtful in this relatively simple case if subjects take into account the probability of a breakdown and the completion time of the part as assumed in Table 2. It is not likely that existing databanks will provide anything other than examples of higher level cognitive errors. As a consequence the authors of a report from the OECD Nuclear Energy Agency 22 concluded that the only feasible alternative is to use the judgments of individuals who have experienced higher level cognitive errors in a plant or in a simulation study. The results of the SLIM procedure indicate, however, that individual judgments are only a solution to the problem of missing data if the knowledge base of judges can be considerably improved.
9
The SLIM procedure yielded a disappointing match between experts' estimates and empirical HEPs. The correlation of r = 0.54 only accounts for 29% of the variance of the HEPs. The much higher correlation coefficients of r = 0.98 (n = 6) and r = 0.71 (n = 18) obtained with the same technique by Embrey 23 and Embrey et al. 17 could not be reproduced. This may be due to the fact that the experience and skill of the judges in this study could not be compared with that of experts who have been working on jobs for a couple of years. Typically, these experts can extract more information from task conditions with which they have developed a high degree of familiarity. Perhaps more practice by the judges on the experimental tasks would have produced better estimates. The data suggests either a linear or logarithmic relationship between SLIs and HEPs, although a linear equation gave a slightly better fit. In any case, the SLIM procedure proved to be superior as compared with the holistic rank ordering procedure as far as HEP estimates and the interjudge consistency of the ratings are concerned. These findings support the results of a comparison study performed by Comer et al. 24 and Embrey et al. 17 Expert judges used holistic direct estimation techniques to assess the very same tasks evaluated in the SLIM decomposition technique. The ratings between judges derived from SLIM were consistently more stable than those derived from direct estimation techniques. No data, however, were reported in the two studies on the validity of these techniques, because no empirical HEPs were available. The validity data of this study indicate a poor performance of the judges despite the fact that they were fully familiar with the tasks under consideration. The analysis of the weighting procedure revealed at least two main reasons: the improper assessment of the generic weights of the PSFs as compared with their empirical weights and the disagreement among them about how to assess the actual contribution of the PSFs under the different task conditions. One way to improve the estimation procedure, therefore, is to provide the judges with empirically derived generic weights, or at least with a rank order of generic weights of PSFs. The problems usually encountered with this approach are the poor availability of empirical data and the specific interaction pattern of PSFs, which will be discussed later. Some severe constraints also arise from key assumptions of the SLIM technique. The relative weights and the significance of each attribute influencing the course of action were determined separately by the judges, while the PSF's generic weighting procedure was performed consensually. As Rosa & Humphrey 25 pointed out, this procedure is not theoretically optimal, especially with key assumptions
10
Bernhard Zimolong
of monotonicity and preference independence fundamental to rigorous applications of MAUT. The monotonicity assumption requires that each PSF should be scaled in such a way that on each PSF larger numerical values should imply greater preference at the levels they index. Sometimes, elicited preferences in PSF may not be monotonically related to increasing numerical assessments on the scale, such as on a preference scale for stress or workload. To overcome these constraints, the S L I M - M A U D approach was introduced by Embrey et al. 17 M A U D (multi-attribute utility decomposition) is a thoroughly developed and tested decision making system. 26'27 It comprises both a structure for decomposing and organizing expert judgments based on an interactive microcomputer program that aids in the eliciting and assessment of choice alternatives. M A U D automatically detects violations of the monotonicity assumption and rescales the values in such a way that the monotonicity assumption is met. If the preference independence assumption is violated, M A U D drops the critical factors and restructures the remaining in such a way that the retained factors meet the independence condition. The method adopted for testing preference independence within M A U D is based on testing for strong conditional utility independence between each factor in the set and all other factors in turn. 25 The empirical data obtained suggest at least one interaction effect between the factors motivation and information load. Although in this case the interaction accounts only for 8% of the variance, other factor combinations not introduced in this study probably could account for more variance. Under these circumstances, a linear equation of PSFs would be inappropriate to predict the outcome of the empirical HEPs. The linear model can be made sensitive to the interaction effect by incorporating a cross-product term into the regression equation of the HEP. Examples are provided by Slovic & Lichtenstein 2a and Winterfeldt & Edwards. 2 The proper decomposition of the task into crucial task elements which are believed to influence the overall likelihood of human error is usually done by experts. The preferred procedure for this specification is through a consensus process by the judges performing the evaluation. There are also computer programs available such as M A U D to assist judges in the elicitation process. The subjective decomposition procedure, however, raises questions about the representativeness of the factors elicited by the judges. From research on human error and reliability, various frameworks of contributing factors to human errors have emerged (e.g. Refs 9 and 29). However, as critical evaluations of such frameworks reveal (e.g. Ref. 30), there is no agreement on what the major contributing factors to human error are, how they
affect each other and, more specifically, how they interact in the environment under study. Most of the frameworks are based on plausible evidence and with a few exceptions (e.g. Ref. 9), no quantitative data about the weights of the PSFs are available. If so, they are based on experience and intuition. This leads back to the problem of missing data which cannot be solved without further studies on the weight and contribution of PSF in industrial settings.
ACKNOWLEDGEMENTS I wish to acknowledge the assistance of Barbara Stolte in designing and running the experiments. I am grateful to Don McGregor for providing the opportunity for stimulating discussions and to Paul Slovic for helping to shape the ideas in this paper.
REFERENCES 1. Miller, D. P. & Swain, A. D., Human error and human reliability. In Handbook of Human Factors, ed. G. Salvendy. J. Wiley, New York, 1987, pp. 219-50. 2. Winterfeldt, D. V. & Edwards, W., Decision Analysis and Behavioral Research, Cambridge University Press, New York, 1984. 3. Seaver, D. A. & Stillwell, W. G., Procedures for Using Expert Judgment to Estimate Human Error Probabilities in Nuclear Power Plant Operations. US Nuclear Regulatory Commission, NUREG/CR-2743, Washington, DC, 1983. 4. Aschenbrenner, K. M. & Kasubek, W., Challenging the cushing syndrome: Multiattribute evaluation of cortisone drugs. Organizational Behavior and Human Performance, 22 (1978) 216-34. 5. Hogarth, R., Judgement and Choice, 2nd edn, J. Wiley, New York, 1987. 6. Slovic, P., Fischoiff, B. & Lichtenstein, S., Behavioral decision theory. Ann. Roy. Psychol., 28 (1977) 1-39. 7. Humphreys, P. C. & Wisudha, A., Building a Decision Problem Structuring Library: A Review of Some Possibilities. Technical Report 88-1, Decision Analysis Unit, London School of Economics and Political Science, London, 1988. 8. Zimolong, B. & Rohrmann, B., Entscheidungshilfetechnologien. In Angewandte Psychologie, eds C. Graf. Hoyos, D. Frey & D. Stahlberg. Urban & Schwarzenberg, Mtinchen, 1988, S: 624-46. 9. Swain, A. D. & Guttmann, H. E., Handbook of Human Reliability Analysis with Emphasis on Nuclear Power Plant Applications. US Nuclear Regulatory Commission, NUREG/CR-1278, Washington, DC, 1983. 10. Hannaman, G. W., Spurgin, A. J. & Lukic, Y. D., Human Cognitive Reliability Model for PRA Analysis. Nus--4531, EPRI, Palo Alto, 1984. 11. Siegel, A. I., Bartter, W. D., Wolf, J. J., Knee, H. E. & Haas, P. M., Maintenance Personnel Performance Simulation (MAPPS) Model: Summary Description. Applied Psychological Services and Oak Ridge National Laboratory, US Nuclear Regulatory Commission, NUREG/CR--3626, Washington, DC, 1984.
Evaluation of THERP, S L I M and ranking to estimate HEPs 12. Svenson, O., On expert judgments in safety analyses in the process industries. Reliability Engineering and System Safety, 2,5 (1989) 219-56. 13. Zimolong, B. (1985) Hazard perception and risk estimation in accident causation. In Trends in Ergonomics~Human Factors, eds R. E. Eberts & C. G. Eberts. Elsevier, Amsterdam, 1985, pp. 463-70. 14. Kahneman, D., Slovic, P. & Tversky, A. (eds), Judgment under Uncertainty, Cambridge University Press, New York, 1982. 15. Shanteau, J., Psychological characteristics of expert decision makers. In Expert Judgment and System, ed. J. L. Mumpower. Springer-Verlag, Berlin, 1987, pp. 289-304. 16. Edwards, W., How to use multiattribute utility measurement for social decision making. IEEE Transactions Systems, Man and Cybernetics, SMC-7, 1977, pp. 326-40. 17. Embrey, D. E., Humphreys, P., Rosa, E. A. Kirwan, B. & Rea, K., SLIM-MAUD: An Approach to
18.
19.
20.
21.
Assessing Human Error Probabilities Using Structured Expert Judgement, Nuclear Regulatory Commission, NUREG/CR-3518, Washington, DC, 1984. Kiinzel, R., FACTORY: A microcomputer program for simulation of production lines as a versatile tool for psychological experiments. Behavior Res. Meth., Instruments & Computers, 19 (1987) 49-50. Zimolong, B., Decision aids and risk taking in flexible manufacturing systems: a simulation study. In Cognitive Engineering in the Design of Human Computer Interaction and Expert Systems, Vol. 2, ed. G. Salvendy. Elsevier, Amsterdam, 1987, pp. 265-72. Embrey, D. E., SLIM-MAUD: The assessment of human error probabilities using an interactive computer based approach. In Effective Decision Support Systems, ed. J. Hawgood & P. Humphreys. Technical Press, Aldershot, UK, 1987, pp. 20-32. Hays, W. L., Statistics for the Social Sciences, Holt,
11
Rinehart & Winston, London, 1963. 22. OECD Nuclear Energy Agency. Expert Judgment of Human Reliability. Rep. No. 88, Paris, 1985. 23. Embrey, D. E., Use of Performance Shaping Factors
24.
25.
26.
27. 28.
29. 30.
and Quantified Expert Judgement in the Evaluation of Human Reliability: An Initial Appraisal, US Nuclear Regulatory Commission, NUREG/CR-2986, Washington, DC, 1983. Comer, M. K., Seaver, D. A., Stillwell, W. G. & Gaddy, C. D., Generating Human Reliability Estimates Using Expert Judgement, Vols 1 and 2, US Nuclear Regulatory Commission, NUREG/CR-3688, Washington, DC, 1984. Rosa, E. A. & Humphreys, P. C., A decomposition approach to measuring human error probabilities in nuclear power plants: a case example of the SLIM-MAUD methodology. Paper presented at the Managerial Decisions and Risk Section of the Fourth International Conference on the Foundations and Applications of Utility, Risk and Decision Theory, Budapest, Hungary, 1988. Humphreys, P. C. & McFadden, W., Experiences with MAUD. Aiding decision structuring versus bootstrapping the decision makers. Acta Psychol., 45 (1980) 51-69. Humphreys, P. C. & Wisudha, A., MAUD 4. Technical Report 83-5, Decision Analysis Unit, London School of Economics and Political Science, London, 1983. Slovic, P. & Lichtenstein, S., Comparison of bayesian and regression approaches to the study of information processing in judgment. Organizational Behavior and Human Performance, 6 (1971) 649-744. Rasmussen, J., Human errors. A taxonomy for describing human malfunction in industrial installations. J. Occup. Accidents, 4 (1982) 311-33. Hoyos, C. Graf & Zimolong, B., Occupational Safety and Accident Prevention: Behavioral Strategies and Methods, Elsevier, Amsterdam, 1988.