Difficulties in troubleshooting automated manufacturing systems

Difficulties in troubleshooting automated manufacturing systems

InternationalJournal of Industrial Ergonomics, 5 (1990) 91-104 91 Elsevier DIFFICULTIES IN TROUBLESHOOTING AUTOMATED MANUFACTURING SYSTEMS Susan R...

1MB Sizes 13 Downloads 116 Views

InternationalJournal of Industrial Ergonomics, 5 (1990) 91-104

91

Elsevier

DIFFICULTIES IN TROUBLESHOOTING AUTOMATED MANUFACTURING SYSTEMS Susan R. Bereiter

*

Department of Engineering and Public Policy, Carnegie Mellon University, Schenley Park, Pittsburgh, PA 15213-3890 (U.S.A.)

and Steven M. Miller Graduate School of Industrial Administration, Carnegie Mellon University, Schenley Park, Pittsburgh, PA 15213-3890 (U.S.A.) (Received October 18, 1988; accepted in revised form May 15, 1989)

ABSTRACT Based on the results of an empirical study of the process of troubleshooting in computer-controlled manufacturing systems, we highlight sources of difficulty encountered by expert maintenance personnel in acquiring and using information to diagnose faults. We focus on difficulties in collecting background information, testing hypotheses, making repair attempts, and generating beliefs about causes of failures. Recommendations are provided for how to make people more effective in the troubleshooting task by (i) changing management policies and practices at the plant level, (ii) modifying the process control hardware and software that already exists on the plant floor, and (iii) improving the design of new hardware and software for the next generation of process control systems.

RELEVANCE TO INDUSTRY This paper provides a specific discussion of difficulties encountered by expert maintenance technicians who troubleshoot equipment faults in automated manufacturing systems. An understanding of why troubleshooters have difficulty with diagnosis faults in automated factory systems provides a foundation for planning how to improve maintainability through redesigning process control hardware and software, and by designing computer-based decision support aids.

KEYWORDS Troubleshooting, fault diagnosis, manufacturing systems, automation.

* Currently, Member of the Technical Staff, Operations Engineering Department, AT&T Bell Laboratories. 0169-8141/90/$03.50

© 1990 Elsevier Science Publishers B.V.

92

INTRODUCTION Troubleshooting automated equipment in a modern computer-controlled manufacturing system is often difficult for the maintenance specialist. Underlying this difficulty is the fact that manufacturing systems are getting more complex. For example, process control computers on the shop floor are performing more sophisticated tasks, and these controllers are being linked together into more highly integrated systems. A troubleshooter working with a modern manufacturing system typically faces a large system of many parts, where each part is complex in itself, and where parts are interconnected in complex ways. It is no wonder that the troubleshooter often encounter difficulties with understanding system functions, isolating symptoms, and using information to structure a diagnostic search. We recently completed an empirical study of the process of troubleshooting in computer-controlled manufacturing systems. This empirical study consisted of three components: - semi-structured interviews with expert troubleshooters, - field-based data collection based on observation of actual plant-floor troubleshooting episodes, a computer-based experiment to investigate the effects of process controller display design variables on troubleshooting performance. The participants in all three parts of the study were maintenance specialists from six plants in the automotive industry. A detailed description of the research methodology and the sample is given in Bereiter (1988). The empirical results of our study are used to identify difficulties encountered by expert maintenance personnel while they collect background information, test hypotheses, make repair attempts, and generate beliefs about the causes of failures in automated manufacturing systems. Following the discussion of difficulties in troubleshooting automated manufacturing systems is a presentation of recommendations which we believe would eliminate or substantially mitigate many of the problems we describe. Finally, in our conclusions, we call for substantially more attention and effort to be focused on designing com-

plex automation systems for ease of maintainability.

FRAMEWORK FOR ANALYZING TROUBLESHOOTER'S TASK

THE

Our analysis follows the general principles of the "user-centered" design framework advocated by Rouse (1988a,b). There are four key conceptual elements to Rouse's framework: 1. Understanding user-system tasks, and characterizing these tasks by means of a constrained set of terminology. 2. Identifying means for enhancing human abilities for tasks of interest. 3. Identifying means of overcoming human limitations for tasks of interest. 4. Fostering user acceptance of the means identified for enhancing abilities and overcoming limitations. Rouse (1988a, p. 155) proposes that most aspects of the human role in complex h u m a n machine systems can be characterized by three general user-system tasks: execution and monitoring; situation assessment; and planning and commitment. Execution and monitoring involve implementing the solution or plan, and determining whether the consequences of the solution will be acceptably close to the desired result. Situation assessment is concerned with the formulation of the problem or deciding what is happening. Planning and commitment are concerned with devising a solution to the problem or deciding what to do about it. Rouse's approach for characterizing user-system tasks in terms of constrained set of terminology is shown in Fig. 1. The three general user-system tasks mentioned above are decomposed into 13 more specific tasks. Most of these more specific tasks involve generation, evaluation, and selection among alternatives. Based on extensive literature reviews, Rouse claims that these 13 tasks are sufficient to classify and describe the role of the human decision makes in aerospace and process control domains. He also suggests that this task taxonomy should be applicable to the domain of advanced manufacturing systems as well.

93 EXECUTION AND MONITORING 1. . • 4.

I m p l e m e n t a t i o n o f plan O b s e r v a t i o n o f consequences Evaluation o f deviations from expectations Selection b e t w e e n a c c e p t a n c e and rejection

SITUATION ASSESSMENT: 5. 6. 7.

INFORMATION SEEKING

G e n e r a t i o n / i d e n t i f i c a t i o n o f a l t e r n a t i v e i n f o r m a t i o n sources Evaluation o f alternative i n f o r m a t i o n sources Selection a m o n g alternative i n f o r m a t i o n sources

SITUATION ASSESSMENT:

EXPLANATION

. G e n e r a t i o n of alternative explanations • Evaluation of alternative explanations 10. Selection a m o n g a l t e r n a t i v e e x p l a n a t i o n s

PLANNING AND COMMITMENT 11. G e n e r a t i o n of alternative courses o f action 12. Evaluation o f alternative courses of action 13. Selection a m o n g a l t e r n a t i v e courses of action

Fig. 1. G e n e r a l user-system tasks (from Rouse, 1988a, b).

Our approach to characterizing the task of the human who is troubleshooting faults in an advanced manufacturing system is shown in Fig. 2. Based on findings from our interview and observational studies, we describe the process of troubleshooting factory automation in terms of three major kinds of tasks: information collection; hypothesis testing; and repair attempts. Information collection involves gathering background information to assess the situation rather than to test a particular hypothesis (e.g., reading documentation of the system design). Hypothesis tests involve collecting information for the specific purpose of supporting or disconfirming a particular hypothesis. An example hypothesis test would be testing with a voltmeter to determine whether there is an electrical short in a cable. A repair attempt is much like a hypothesis test, except that the way that the hypothesis is tested involves trying to fix the hypothesized fault to see if the repair attempt works. An example repair attempt would be replacing a cable, then attempting to run the machine to see whether or not the cable was the fault. Within each of these major tasks, there are three more specific tasks: selecting the action to be performed, performing the action, and evaluating the results and incorporating new information

into a person's understanding of the situation. Selecting an action involves choosing both the kind of information desired and a way to collect this information. Performing an action involves using an information source to collect information. In the case of hypothesis tests and repair attempts, results must be evaluated• This means that the troubleshooter must decide whether the results indicate presence of a normal or an abnormal condition. Incorporating new information involves evaluating whether the evidence confirms or disconfirms a hypothesis, and updating the understanding of the situation so that a new set of possible faults can be generated. U p o n completing a major task, the troubleshooter reevaluates the situation to generate a set of possible faults. This can involve revising or refining the set of possibilities considered previously, or it can involve changing to a different set of possibilities. If the person believes he has identified the fault, then he attempts to repair the fault (if the repair task falls within his job classification and expertise), or he notifies his supervisor that another person is needed to perform the repair. If the troubleshooter does not believe that he has identified the fault, then he can repeat the cycle of either collecting information, testing a hypothesis, or making a repair attempt. The characterization of the troubleshooting task that we derived from our empirical data to fit our particular domain of troubleshooting in automated factories closely corresponds to Rouse's more general characterization of the human role in complex h u m a n - m a c h i n e systems. The major tasks of information collection and hypothesis testing in our characterization would be part of what Rouse refers to as situation assessment• Our repair attempt would be part of Rouse's execution and monitoring task. The devising of possible solutions and the planning of what to do to evaluate these possibilities occurs at the top of Fig. 2, in the boxes labeled " G e n e r a t e Set of Possibilities" and "Select Type of Activity"• These tasks in our scheme would be part of the planning and commitment task in Rouse's scheme. As in the Rouse scheme, we decompose our major tasks into selection and evaluation subtasks. The generation of alternatives is also a prominent task in our framework. We break it out separately as a major task, while Rouse treats it as a subtask of his major

94

I, Set of

Possibilities yes

Notify

no

Supervisor

Informatio~

~_n.~. _o._1 .

.

.

.

.

.

.

.

.

Testing.. "I".....

V Select lsfformation and S o ~ e e

I

Evaluateinfo [ into Knowledge]

I

Hypothesis and Test

I

Repair Attempt

+

]

I Perform Repair Aacrnpt

Perform Test

~llect Infon~fion

and Incorporate[

Select

Evaluate

Test Results

N e w Info

I

intoKnowlcdgcl ]

New Info into K~wledge

I Fig. 2. Characterizationof the troubleshootingprocess. tasks. In essence, we are following Rouse's characterization since the detailed information processing operations carried out by the human in our description of the troubleshooter's task is the generation, selection, and evaluation of alternatives. In the next sections of the paper, we present our results on difficulties encountered by expert

maintenance technicians as they went about their task of troubleshooting faults in automated manufacturing systems. The characterization of troubleshooting shown in Fig. 2 is used to organize the discussion. There is a separate discussion for difficulties encountered during each of the major tasks shown in the Fig. 2: collecting information; testing a hypothesis; attempting a repair; and

95 generating a possible belief as to the cause of the problem. Following the discussion of difficulties are recommendations for facilitating the task of troubleshooting in automated manufacturing systems, as well as conclusions. The effort in our original study (Bereiter, 1988) concentrated on developing a rich, empiricallybased understanding of the task of the troubleshooter working in automated factories in the motor vehicle industry. In this paper and in Bereiter (1988) we use our results to make design suggestions for enhancing and overcoming human abilities to improve the task of fault diagnosis in automated manufacturing systems. But we did not progress to the stage of implementing these suggestions. Thus, our efforts addressed only the first conceptual element of the four key conceptual elements in Rouse's user-centered design framework.

DIFFICULTIES ENCOUNTERED DURING INFORMATION COLLECTION The first part of this section reviews the difficulties of acquiring information from particular types of sources consulted by troubleshooters, such as people, diagnostic tools, and documentation. The remaining parts of this section describe difficulties which are not specific to particular information sources. These are difficulties resulting from the multiplicity of information sources used, missing information, and discrepancies between information collected and information used.

People The machine operators are the most important of all of the types of people who are used as information sources. Operators provide a troubleshooter with information about the symptoms of the fault, and how the symptoms differ from the way the machine normally works. Sometimes, they can also provide informal information about the history of the machine, including frequency of failure occurrence and recent changes or repairs. An operator is a good source of information if the following conditions are met: (i) he is familiar with how the machine normally operates, (ii) he

was watching the machine at the time of the failure, (iii) he remains at the site so that he is available to respond to a troubleshooter's questions, and (iv) he willingly provides accurate information in a cooperative manner. The difficulty with acquiring information from operators is that all of these conditions often are not met. For example, sometimes a replacement operator is at the machine when it fails. Typically, replacement operators are not familiar with the nuances of the machine's cycle. Also, failures often occur suddenly at unpredictable moments. If the operator was not watching the machine at the precise moment of failure, he might not be able to specify what went wrong. In addition, an operator sometimes gets reassigned to a different work station if his machine fails, and it appears that it will be down for an extended period for diagnosis and repair. If the operator is reassigned to another location, the troubleshooter loses one of his best information sources. There are als6 difficulties associated with acquiring information from other troubleshooters, especially when they work on different shifts. When a particular troubleshooting task gets passed from maintenance specialists on one shift to those on the next, the communication channel is often through supervisors: the troubleshooters on the first shift explain the status of the work to their supervisors; the supervisors relay the information to the supervisors on the next shift, who then communicate the information to the troubleshooters taking over. Since these information channels are imperfect, the information transferred is often vague or incorrect by the time it arrives at the people who need it. Similar problems occur even for troubleshooting events that do not cross shift boundaries. When different troubleshooters within or across shifts work on the same equipment, recurring problems are often not recognized because of problems of transferring information from one maintenance specialist to another.

Diagnostic tools There are difficulties with using diagnostic tools as information sources. This discussion focuses on difficulties in acquiring information from computerized process controllers, such as programma-

96 ble logic controllers, and the panel lights and displays of control cabinets. Error messages displayed by the process controller are meant to help narrow the troubleshooter's diagnostic search. Yet we found that these error messages often provide misleading, wrong, uninterpretable, or obvious information. In very complex systems, which involve tight coupling between components, a fault in one part of the system can cause ripple effects in other parts. In most process control systems, the error message is triggered by one of the ripple effects that result from the propagation of the original fault, as opposed to the original fault itself. In such cases, the error messages are misleading because they point to the wrong component. Also, error messages are often false alarms. If false alarms occur regularly, error messages are ignored. Researchers have found that in cases where many messages and warnings regularly are displayed because of false alarms, all of them are ignored, including the genuine error messages (Wickens, 1984). Often, error messages are presented as alphanumeric codes which need to be interpreted by looking up the code in a separate code book. This requires the troubleshooter to consult an additional information source if he does not have the meaning of the error code memorized. Error messages are usually designed when a machine is installed initially. Later, after months or years of operating experience, these same messages are considered obvious and unhelpful to the troubleshooter who is already familiar with the machine. Some problems with diagnostic tools stem from the language used to describe process control logic. In the factories where we worked, most process control logic is written in a language called "ladder logic". This is an imitation of relay logic, a predecessor of computerized process control in which a process is controlled automatically by the opening and closing of individual binary electrical switches. In ladder logic notation, variables are shown as binary switches or "contacts" and relationships between variables are lines connecting these contacts, as if the display were actually a schematic of a relay circuit. Since the display width is limited, the logic is arranged in rows or "rungs", with the output of one rung as an input to another rung. The CRT is used to show both system design and

system status. System design is shown by the interconnections between contacts. To show system status, the CRTs highlight contacts that are closed at the time the screen is updated. One major problem with current C R T displays deals with the level of detail used to describe process control logic. Since there is only one level of detail available, this level has to provide the most detail that is ever needed in any situation. Thus, the only view of the system which the troubleshooters can see is specific detail of only a very small portion of the process control logic. Many ladder logic programs can be thousands of rungs long, but a C R T typically can show eight rungs at a time. For example, this makes it difficult for a troubleshooter to obtain a higher level understanding of the functions of the process. The importance of the ability to view the higher level organization of control logic in a complex automated system has been pointed out by Rasmussen (1985). He has argued that higher level information about functions can be very important for structuring one's understanding of a complex system in diagnostic tasks. Another aspect of large programs written in low level ladder logic is that the troubleshooter typically must search across m a n y pages of a computer display while looking for problems in the sequence of logical operations. If a person must observe multiple pages to obtain all of the information he needs, then he needs to store much of that information in his own short term (working) memory. It is well established that a human can hold only a few chunks of information in working memory (Simon, 1979). A person who has to search across many pages of ladder logic in a C R T display is likely to run up against the size constraints of the size of working memory. This constraint might account for repeated reports by expert troubleshooters of getting lost or being confused when searching through pages of ladder logic. There is another way in which the need to look at multiple pages of information poses difficulties. When looking for timing problems, a person needs to watch the system status via the C R T while the machine operates. He needs to compare the relative time of execution of several different operations. If the software is written such that these operations are not adjacent in the ladder logic,

97

then the troubleshooter can find the timing problem nearly impossible to trace. By the time he changes the screen from one page to another, the operation he is looking for has already happened. Lack of software standards causes difficulties for troubleshooters in situations where they have to use multiple process controllers. Alphanumeric identification codes for software variables, and the design of software segments which control similar actions on different machines, often are not standardized across different machines within the same factory. As a result, a troubleshooter is faced with unnecessary variety in nomenclature and in programming procedures when he works with multiple machines.

Observation of physical equipment Even an activity as simple as looking at a machine can have difficulties associated with it, particularly when a troubleshooter is looking for problems in timing or sequencing. Many manufacturing machines, particularly assembly machines, operate very quickly. It is difficult to see visually whether the actual sequence of operations is what it should be. The problem is compounded on large machines, which sometimes require that the troubleshooter be in two places at once in order to detect timing problems.

Documentation Maintenance personnel report that insufficient documentation accounts for much of the difficulty associated with troubleshooting. Printouts of software are essential for interpreting the information obtained from CRT displays to process controllers, since the printouts contain meaningful labels for variables that are only identified by number on the CRT displays. We found that these printouts often are not kept up-to-date because of lax practices with respect to updating documentation. Often, these printouts are missing, torn, too dirty to read, or misplaced. Other problems associated with documentation include the need for standardized terminology, for detailed schematics showing test point voltages and oscilloscope waveforms at key test points, and for setup instructions on new equipment.

Multiplicity of information sources Some difficulties related to information collection stem from the fact that information comes from a multiplicity of sources, and this information comes in a variety of formats. Some types of sources require a setup time when they are used (e.g., a display has to be connected to a process controller, documentation has to be taken from a cabinet, or test instruments need to be obtained). When more information sources are used, more time is spent in setup. Sometimes, the need for this type of setup interrupts problem solving. Also, the data from each type of source is usually in its own unique data format. The troubleshooter has to make mental translations between data formats when multiple sources are consulted.

Missing information There is the problem of needing information that is not available. The most obvious example of missing information is the case where documentation of the correct sequence of operations is not accessible. In such a case, the troubleshooter has difficulty establishing what is supposed to happen when the machine operates properly. Lack of knowledge of the proper sequence of events in the machine cycle makes the detection of discrepancies from the proper sequence much more difficult. A less obvious example of missing information results from limitations of the hardware and software components of shop floor process control systems. Many CRT displays to process controllers have slow update rates compared to the rate at which the machine changes its state. Typically, CRTs update every few seconds, yet many manufacturing processes operate so quickly that several operations can take place within this amount of time. Operations that happen very quickly might never be detected if they occur between intervals when the display is refreshing. CRT displays of most process controllers do not provide good information about rapidly occurring events. Thus, the information about the status of the manufacturing process provided by the CRT displays can only be used as a basis for guessing about what the real status is.

98 Lack of information about the first thing to go wrong in a series of events also causes difficulties. Typically, when a fault occurs a machine continues to operate until it cannot operate further. This means that a troubleshooter must backtrack from the last thing that the machine did to find the root cause of the failure. Troubleshooters in our study reported that this backtracking is often difficult and time consuming. Many process controllers lack the ability to record and archive process control events. Sometimes, the events surrounding a failure happen so quickly that it is difficult for a person to process all of the information that is shown on the display. Without a way to record these events, the troubleshooter must try to replicate the failure (if possible), or work with only the subset of information that he is able to remember. Also, a troubleshooter trying to track down an intermittent problem must station himself by the machine, waiting to see if the fault reoccurs. People consider this a wasteful way to use time.

Difference between information collected versus used In our field study, we observed a discrepancy between the information acquired during information collection activities and the kinds of information used throughout the troubleshooting process (Bereiter and Miller, 1989). Nearly all of the information collection activities we observed involved the acquisition of data about symptoms or system design. Very few activities involve the acquisition of other kinds of information. Yet, troubleshooters reported that the criteria used to make troubleshooting decisions often related to the relative ease of performing tests, or to historical information about a machine (such as frequency of fault occurrence), or to a reference to recent changes. If these types of information were obtained from external sources during a troubleshooting episode, we would have coded it as an information collection activity. To our surprise, in all of the episodes we observed, there were very few activities where information on the ease of testing or on machine history was collected. Therefore, we conclude that the troubleshooters in our sample must have been making subjective

assessments of these quantities, based on their general knowledge about what was happening in the factory. While troubleshooters have well developed informal networks for keeping abreast of the status of the systems they maintain, studies have conclusively shown that the heuristics used by people to make subjective judgments of numerical quantities such as frequencies often lead to inaccurate estimates (Tversky and Kahneman, 1974; Fischhoff, 1986).

DIFFICULTIES E N C O U N T E R E D HYPOTHESIS TESTING

DURING

The mechanics of collecting information are essentially the same as the mechanics of testing a hypothesis. The difficulties described above concerning the collection of information are also difficulties encountered in using information sources for hypothesis testing. The difficulties associated with evaluating test results and incorporating new information are all related to getting confused or being misled. In some cases, we found that a troubleshooter does not know what to conclude from the symptoms, or he draws the wrong conclusions and pursues an incorrect search path. For example, sometimes tests results are misleading, so that a troubleshooter concludes that a component is not faulty when it actually is. As a result, he focuses his effort anywhere else except where the fault is. Also, we observed situations where test results appeared conflicting, so that one test indicated that a component was functioning properly, while another test shows that the same component was faulty. Multiple concurrent problems also lead to confusion. Normally, a troubleshooter confronting a failure assumes that there is only one fault responsible for the failure. When there are actually two or more contributing faults, the troubleshooter gets confused when he persists in finding a single cause for symptoms that are actually brought on by multiple faults. Complexity in the interaction between system components (in both hardware and software) also contributes to difficulty in evaluating test results. As pointed out by other researchers who have studied maintainability, a specific definition of complexity is difficult to develop (Bond, 1987).

99 However, when troubleshooters describe sources of complexity from their perspective, they seem to be alluding to a system made up of a large number of parts that interact in a nonsimple way. This is the conceptualization of complexity articulated by Simon (1981). The troubleshooters we worked with found it difficult to understand what the intended operations of a complex system were. Poor understanding of intended operations makes isolating differences between actual and normal operations even more difficult. Also, complexity makes it difficult to isolate and eliminate sections of a machine or a software program as possible faults. Sometimes symptoms are vague and do not give the troubleshooter a good indication of where to focus his search for the fault. In these cases, the troubleshooter is forced to enumerate and selectively eliminate each possibility one at a time. This process of refining a set of possibilities is very tedious. Particular kinds of symptoms that are difficult to evaluate include intermittent problems and problems in which a machine works partially. In both cases, the problem stems from difficulties in finding reasons for machines that partially operate. Troubleshooters reported that the kinds of problems that cause a machine to not work at all are easier to trace than the kinds of problems that cause partial breakdowns.

DIFFICULTIES ENCOUNTERED DURING REPAIR ATTEMPTS One might think that a repair is only attempted when the troubleshooter has clearly identified the problem, and that there would be only one repair attempt per troubleshooting episode. Our field study showed us that is not the case. Many troubleshooting episodes involve multiple repair attempts. There are situations in which a troubleshooter chooses to attempt a repair because he believes it is easier a n d / o r cheaper to isolate the fault in a trial-and-error mode than by other means. In other cases, a troubleshooter chooses to attempt a repair because he believes that he has identified the fault, yet the repair attempt fails, and his hypothesis is disconfirmed. Sometimes, in the course of making a repair attempt, a troubleshooter creates a new problem

while trying to fix the original one. This results in multiple faults. This is a difficult situation since, as noted previously, troubleshooters report that they get confused when they try to find one fault to account for all of the symptoms, when in fact the symptoms are caused by two or more faults. Faulty replacement parts are another source of difficulty encountered during repair attempts. When a troubleshooter attempts to repair a problem and finds that the fault still exists after the attempt, he assumes that his hypothesis is wrong. This is a reasonable inference, assuming the replacement part is functioning properly. This logic leads to confusion when the replacement part is faulty. When the problem exists after the part is replaced, the troubleshooter often starts to generate new possibilities to consider, when in fact he already has considered and tested the correct fault. In generating new possibilities to consider, the troubleshooter works down a wrong solution path. The switch from a wrong solution path to a correct solution path does not come easily for the troubleshooter. The switch can be even more difficult when it involves returning to a solution path that had already been ruled out as a possibility. The problem of faulty replacement parts happens regularly enough that it was cited as a key source of troubleshooting difficulty by maintenance personnel.

DIFFICULTIES ENCOUNTERED WHILE GENERATING A SET OF POSSIBLE BELIEFS Difficulties in generating beliefs can be understood in terms of the evolution of beliefs as a troubleshooting episode progresses. In general, the difficulties involve fixation or "getting stuck". De Keyser et al. (1988) define fixation as failure to revise situation assessments and planned actions when the situation is changing. They cite recent evidence which suggests that fixation is a major source of human error in complex systems. By fixation, we mean generating the same set of beliefs repeatedly. This can happen in two ways: difficulty in shifting from a wrong thought path to a correct one, and difficulty in refining and focusing a thought path when it is correct. Analysis of

100 our observational data yielded insight into these two types of fixation (Bereiter and Miller, 1989). In some of our episodes, troubleshooters have difficulties with getting onto the correct thought path. They explore two or three incorrect paths before eventually getting onto the correct one. Our data indicate that once a troubleshooter starts off on the wrong path, he is hesitant to abandon that path. We observed episodes where the person remains on an incorrect thought path despite repeated tests which provide evidence that he is not on the correct path to discovering the true problem. In other episodes, troubleshooters have difficulty with refining the thought path. They perform repeated activities without refining their belief on the set of possible faults. Some of these episodes are characterized by a large number of information collection activities. This indicates some confusion on how to interpret the results of hypothesis tests. These episodes also include repeated tests of the same hypothesis, with the troubleshooters citing the desire to verify and double check information already collected. A troubleshooter's consideration of data reliability and choice to double check information was also found by Mehle (1982) in his experiments on automotive diagnosis. This indicates that a lack of trust in information can lead to longer troubleshooting episodes. In a subset of our episodes where most of the effort is focused on refining the correct thought path, difficulty stems from the need to sort through a number of possibilities one at a time. The more complex a system (e.g., the more highly integrated), the more difficult it is to partition the search space and eliminate those parts of the system which do not contain the fault. If sections of a system cannot be eliminated, each component must be tested individually. Thus, system complexity can be a contributing factor to troubleshooting difficulty in these episodes.

RECOMMENDATIONS The presentation of the recommendations outlined in the tables is intentionally brief, and more discussion of these recommendations can be found in Bereiter and Miller (1988) and in Bereiter (1988).

The contribution of the discussion presented here is to take the ideas elaborated elsewhere, and to present them in an entirely different way. The organization of the presentation made here is based on the level of capital investment required to implement the recommendations, and also on the target group in the industrial community that is capable of executing them. Our recommendations for improving troubleshooting are summarized in Tables 1-3. Table 1 outlines recommendations which would be implemented by changing some of the existing policies and practices of plant level management. There are two reasons for including these types of suggestions. First, it is important to explicitly acknowledge that the maintenance of large scale automated systems is a task performed by groups of people in complex organizational settings. Evaluation of the design and functioning of organizational policies and practices are as important as analyzing the performance of hardware and software components when considering how to improve the maintainability of complex equipment systems in a factory. While this point seems obvious, it is worthwhile to emphasize because of the tendency in the ergonomics research community to ignore the organizational setting

TA B LE 1 Management policies and practices that will facilitate troubleshooting 1. Insure availability of documentation that is accurate and up-to-date. 2. Make critical information readily available and easy to use. For each major machine, this includes records of: long and short term failure history (frequency of occurfence) - recent maintenance and engineering changes - relative ease and difficulties of performing tests (especially repair attempts) 3. Protect the human communication channels which are critical to troubleshooting. These are the channels between: - the troubleshooter and the machine o p e r a t o r / a t t e n d a n t - troubleshooters on different shifts working on the same machines troubleshooters on the same shift working on the same set of machines 4. Totally eliminate the occurrence of faulty replacement parts being installed during repair attempts.

101 TABLE 2 Modifications to existing process control hardware and software on the plant floor that will facilitate troubleshooting 1. Standardize nomenclature and programming procedures in the process control software (e.g., the ladder logic in the programmable logic controller). 2. Improve error messages by insuring they are: accurate, up-to-date, and reliable - understandable (e.g., use of messages versus codes, and use of meaningful terminology) able to be modified over time to reflect changes to the machine and to match the evolving experience level of troubleshooters. -

-

3. Incorporate variable names and other types of critical information routinely obtained from paper printouts of documentation into the software itself. 4. Provide abilities to replay and trace events in process control logic. For example, abilities to: identify the "primal fault" in situations where there is fault propagation (i.e., a first fault detector) - detect subtle timing differences by replaying process control events in slow motion. - Record and archive a log of process control events so that troubleshooter can study event histories to see how problems evolve over time. -

when studying of how an equipment troubleshooter does fault diagnosis. A second reason for including these types of recommendations is that they do not require capital outlays, or changes to the existing base of hardware and software within a plant. These suggestions are targeted to managers at the plant level who are seeking suggestions for improving maintenance operations which can be implemented without justification for additional capital resources, and without making changes to the existing physical process. Altering managerial policies and practices in ways that are intended to improve maintenance operations is not necessarily an easy thing to do. Numerous examples have been documented where efforts to change policies and practices related to the management of technology have been only partially successful, or are not sustained over time (e.g., Liker et al., 1987; Goodman, 1982). Table 2 outlines recommendations that would be implemented by modifying the process control hardware and software that already exists on the floor of today's modern, automated factory. These

types of suggestions are on the scale of engineering projects which could be executed by maintenance and engineering personnel at the plant level. Implementing these types of recommendations would require technical specialists to make changes to the "guts" of hardware and software systems. This type of work would be time intensive, and require a minor level of capital outlays. However, such changes could be carried out within a short to medium-term horizon, and would not require major efforts to develop new types of technology. For these reasons, these are the types of techniTABLE 3 Design concepts to be implemented in new process control hardware and software that would facilitate troubleshooting 1. Reduce the perceived complexity of the process control system. Simplify the actual complexity of the process control hardware and software by rationalizing and modularizing the interactions between major subsystems and components. - Simplify the user's perception of system complexity by providing representations and user interfaces that help the user to better understand physical and logical interactions. Design subsystems and interfaces that facilitate the isolation of faults in order to make testing easier. -

-

2. Change the way that process control logic is represented to the troubleshooter. - User interface should be able to provide a hierarchical overview of process control system. - The overview should be able to summarize system function and status at each level of the hierarchy. - At any level within the hierarchy, the interface should be able to represent the process control system in terms of logical and temporal relationships, in addition to physical relationships. 3. Design hardware and software so that failure modes are evident. Eliminate situations where it is not obvious whether or not a screen display (or another component) has failed. -

4. Provide real-time information about process control events for rapidly occurring sequences of operations. Make it possible to track timing and sequencing problems in fast occurring events by (i) improving refresh rates of display screens, (ii) making it possible to rapidly search across screens of control logic, and (iii) insuring that displays accurately reflect the changing state of the process. Combine the ability to capture real-time information in rapid sequences with the ability to replay and trace event histories. -

-

102 cally oriented projects that might appeal to a plant's engineering and maintenance staff, especially if they understood that these improvements would help the troubleshooter to carry out his job more effectively. Table 3 outlines recommendations that would be implemented by developing new design concepts for the next version or generation of process control hardware and software. These types of changes would be difficult to accomplish, and would require major development efforts by the firm utilizing the technology, by the vendor supplying it, or by both working in cooperation with one another. These types of recommendations are targeted towards the division or corporate level engineering staff in a firm responsible for writing specifications for the desired features of new automation systems. Also, they are targeted to the designers and product planners who work for equipment suppliers who develop and produce this type of technology. It is especially important that the people who control the direction of development of process control technology understand how the hardware and software design impacts the performance of the maintenance worker who has to use information form these systems to solve problems on the plant floor. Several of these recommendations would also be relevant to aiding an "artificially intelligent" information processor (a machine) in acquiring and using process control information to diagnose faults in complex equipment systems. For example, simplifying and modularizing interactions between major hardware a n d / o r software components in ways that would facilitate the isolation of faults would make testing easier for either a machine-based or human troubleshooter. Similarly, providing real time information would aid a machine, as well as a human, to track timing and sequencing problems in rapidly occurring process control events.

CONCLUSIONS Why did the maintenance personnel in our study encounter difficulties as they acquired and used information to diagnose faults modern, automated factories? The fact that expert troub-

leshooters encountered so many types of difficulties on such a regular basis leads us to conclude that the automated systems they worked with were not designed for ease of maintainability. The designers of the equipment and the accompanying process control technology undoubtedly set out to insure that the automated production lines would meet specified requirements for volume and quality (i.e., cycle time and tolerances) when the system was functioning properly. Apparently, these designers placed a lower priority on insuring that when the system was not functioning properly, the task of fault diagnosis by a maintenance technician would not be overly complicated. Perhaps the concentration of technical effort required to automate the production process left designers with little time to systematically consider how the human (in this case, the troubleshooters) would interact with such a system. As pointed out by Bond (1987), in systems which prove to be continuing maintenance problems, which the systems in our study have proven to be, "it is often possible to trace this undesirable condition to definite human factors aspects that either were ignored, or were recognized too late". We hope that this study will serve as a reminder to those in the manufacturing community involved with automation systems design that emphasis must be placed on human factors issues in order to minimize the complexities of fault diagnosis. Implementing recommendations such as those given in Tables 1-3 should facilitate fault diagnosis in complex automated systems. Such efforts should lead to manufacturing system which are easier to maintain, and which suffer less from losses of capacity due to unscheduled equipment related downtime. We argue that it is critically important for designers of manufacturing systems, and, in particular, designers of process control technology, to understand ergonomic issues in order to improve the maintainability of automated systems. Yet, within the manufacturing community, very little research has been published on human factors and troubleshooting which could provide a theoretical and conceptual foundation for guiding future efforts to improve the designs of automated manufacturing systems. Many of the types of difficulties identified in our study are also identified in studies of troubleshooting in other domains, par-

103 ticularly in the operation of complex military electronic systems (Bond, 1987; Wohl, 1980), and industrial process control plants (Rasmussen, 1986; Wickens, 1984, Chap. 12). Given the commonalities across these several studies of how humans diagnose faults in complex equipment systems, ergonomic researchers focusing on automated manufacturing systems should draw on these studies done in other applications domains as they begin to develop the knowledge required to improve systems design in the domain of manufacturing. Within the ergonomics/human factors research community, there is a distinguished tradition, as exemplified by Rouse and Hunt (1984), of analyzing how the capabilities and limitations of the information processing of the individual troubleshooter affects fault diagnosis performance. Following somewhat in this tradition, our study was designed to understand the problems encountered by the individual troubleshooter in field settings as he worked with the process control hardware and software used in fault diagnosis. While our study focused on the performance of individual troubleshooters, we could not avoid the realization of the importance of organizational types of issues within the field settings. As noted in Table 1 of the recommendations, there are a variety of specific ways in which organizational policies and practices impact the information processing of individual maintenance technicians. Since troubleshooting of manufacturing systems always takes place in complex organizational settings, it is important to understand how organizational issues impact the information processing and performance of the individuals responsible for maintenance tasks. As Bond (1987) suggests, the social and "soft" variables in maintenance work need to be addressed together with the more strictly technical issues if we really intend to have complex manufacturing systems which are maintainable. There is an established tradition in the social sciences of analyzing human information processing in complex organizational settings (Larkey and Sproull, 1984). However, this body of work has focused more on administrative and managerial types of problem solving, and has largely ignored how the organizational setting impacts the performance of "low level" technical tasks such as equipment maintenance.

Recently, a distinguished study panel evaluated promising directions for the future of ergonomic research as applied to the area of nuclear power plants (Moray et al. 1988). They called for the research community to adopt a broader, more systems-oriented approach when analyzing human factors issues which impact operating performance and reliability of a nuclear power plant. The report states (pp. 3-4), "The operator/maintainerplant interface is extremely important; but other factors arising from the way in which a plant is organized, staffed, managed, and regulated, and the way it interacts with other elements of the industry can also affect human performance, induce human error, and increase the level of risk of a plant". The same comment applies to the study of ergonomic issues in modern manufacturing settings. In summary, this paper and the study underlying it illustrate why fault diagnosis is getting more difficult for a human to do as manufacturing systems become increasingly more automated and integrated. Our research efforts demonstrate that by employing a variety of empirical approaches, researchers can understand problems with human performance, even in such a complex setting as a large scale industrial environment. And finally, our research provides several directions for facilitating the task of the human troubleshooter in highly automated factory settings. General recommendations are summarized in this paper for how to make people more effective in the task of acquiring and utilizing information to troubleshoot problems by (i) changing management policies and practices at the plant level, (ii) by modifying existing hardware and software that is already installed on the factory floor, and (iii) by improving the design of new hardware and software for the next generation of process control systems.

ACKNOWLEDGEMENT We acknowledge the support of the General Motors Corporation and the National Science Foundation through Grant DMC-86-173-30 from the Manufacturing Systems Program.

104

REFERENCES Bereiter, S.R., 1988. Looking for trouble: Troubleshooters' information utilization in computer-controlled manufacturing systems. Unpublished Ph.D. Dissertation, Department of Engineering and Public Policy, Carnegie Mellon University, Pittsburgh, PA. Bereiter, S.R. and Miller, S.M., 1988. Ideas for improving maintainability in computer-controlled manufacturing systems. Working Paper No. 88-012, Center for the Management of Technology, Graduate School of Industrial Administration, Carnegie Mellon University. Bereiter, S.R. and Miller, S.M., 1989. A field-based study of troubleshooters' information utilization in computer-controlled manufacturing systems. IEEE Trans. Syst., Man, 19 (1): 205-219. Bond, N.A., Jr., 1987. Maintainability. In: G. Salvendy (Ed.), Handbook of Human Factors. Wiley, New York, Chap. 10.3. De Keyser, V., Woods, D.D., Masson, M. and Van Daele, A., 1988. Fixation errors in dynamic and complex systems: Descriptive forms, psychological mechanisms, and potential countermeasures. Technical Report, Prepared for the NATO Division of Scientific Affairs, Brussels, Belgium. Fischhoff, B., 1986. Decision making in complex systems. In: E. Hollnagel, G. Mancini and D. Woods (Eds.), Intelligent Decision Support in Process Environments. NATO Advanced Science Institute Series, Springer, New York. Goodman, P.S., 1982. Why Productivity Programs Fail: Reasons and Solutions. Natl. Productiv. Rev., Autumn: 369-380. Larkey, P.D. and Sproull, L.D., 1984. Introduction. In: L. Sproull and P.D. Larkey (Eds.), Advances in Information Processing in Organizations, Vol. 1. JAI Press, Greenwich, CT. Liker, J., Roitman, D. and Roskies, E., 1987. Changing everything all at once: Work life and technological change. Sloan Manage. Rev., Summer; 29-41.

Mehle, T., 1982. Hypothesis generation in an automobile malfunction inference task. Acta Psychol., 52: 87-106. Moray, N.P. and Huey, B.V., 1988. Human Factors Research and Nuclear Safety. National Academy Press, Washington, DC. Rasmussen, J., 1985. The role of hierarchical knowledge representation in decisionmaking and system management. IEEE Trans. Syst., Man, Cybern., SMC-15 (2): 234-243. Rasmussen, J., 1986. Information Processing and Human Machine Interaction: An Approach to Cognitive Engineering. North-Holland, New York. Rouse, W.B., 1988a. The human role in advanced manufacturing systems, In: W.D. Compton (Ed.), Design and Analysis of Integrated Manufacturing Systems. National Academy Press, Washington, DC. Rouse, W.B., 1988b. Intelligent decision support for advanced manufacturing systems. Manuf. Rev., 1 (4); December. Rouse, W.B. and Hunt, R.M., 1984. Human problem solving in fault diagnosis tasks. In: W.B. Rouse (Ed.), Advances in Man-Machine Systems Research, Vol. 1. JAI Press, Greenwich, CT, pp. 195-222. Simon, H.A., 1979. How big is a chunk? In: Models of Thought. Yale University Press, New Haven, CT, Chap. 2.2. Simon, H.A., 1981. The Sciences of the Artificial, 2nd edn. MIT Press, Cambridge, MA. Tversky, A. and D. Kahneman, 1974. Judgment under uncertainty: Heuristics and biases. Science, 185: 1124-1131. Wickens, C.D., 1984. Engineering Psychology and Human Performance. Charles E. Merrill Publishing Company, Columbus, OH. Wohl, J.G., 1980. System Complexity, Diagnostic Behavior, and Repair Time: A Predictive Theory. In: J. Rasmussen and W.B. Rouse, (Eds.), Human Detection and Diagnosis of System Failures. Plenum Press, New York.