Assessing the quality of expert judgment

Assessing the quality of expert judgment

Decision Support Systems 11 (1994) 1-24 North-Holland 1 Assessing the quality of expert judgment Issues and analysis Fergus Bolger and George Wright...

2MB Sizes 0 Downloads 92 Views

Decision Support Systems 11 (1994) 1-24 North-Holland

1

Assessing the quality of expert judgment Issues and analysis Fergus Bolger and George Wright

1. Introduction

Bristol Business School, Br&tol, UK Frequently the same biases have been manifest in experts as by students in the laboratory, but expertise studies are often no more ecologically valid than laboratory studies because the methods used in both are similar. Further, real-world tasks vary in their learnability, or the availability of outcome feedback necessary for a judge to improve performance with experience. We propose that good performance will be manifest when both ecological validity and learnability are high, but that performance will be poor when one of these is low. Finally, we suggest how researchers and practitioners might use these task-analytic constructs in order to identify true expertise for the formulation of decision support,

Keywords: Expert judgment; Task analysis; Decision support.

Fergus Bolger gained his PhD in cognitive psychology from University College London in 1988. Since then he has been engaged in research in judgment and decision making, and decision support first at Bristol Business School, then at

University College London. George Wright is professor of business administration at Strathclyde Graduate Business School. He has published widely on the human aspects of decision-making. His books include: Behavioural Decision Theory, Harmondsworth Penguin, 1984; Behavioural Decision Making, New York, Plenum, 1985; Investigative Design and Statistics, Harmondsworth, Penguin, 1986; and 'Judgmental Forecasting', Chichester, Wiley,

1987. Together the authors edited Expertise and Decision Support, New York, Plenum, 1992. Correspondence to: Dr. F. Bolger, Department of Psychology, University College London, Gower Street, London WC1E

6BT, UK.

Assessment of expertise is becoming an increasingly important issue due t o r e c e n t developmerits in Information Technology, arising out of research into Artificial Intelligence (AI). Expert Systems (ES) and Intelligent Decision Support Systems (IDSS) are gradually finding favour in industry and commerce. Given that such systems model aspects of expert judgment, it is evident that the quality of resulting systems is dependent the quality of the information originally elicited. Accurate identification of high quality expertise is, therefore, essential if ES and IDSS are to become commercially viable propositions. Failure in the past to apply the fruits of AI research to commercial problems has been attributed to what Feigenbaum [18] has termed the "knowledge acquisition bottleneck". Part of this bottleneck has largely been removed due to the development of some powerful techniques for eliciting knowledge from experts (see, for example, [65]). In our view, the only major obstacle now remaining to commercially viable ES and IDSS is locating t r u e expertise in the domains of interest. Judgment is also a prime input to decision analysis which is a major management decision aiding technique [62,59]. In forecasting there a r e several purely judgmental techniques such as Delphi, Scenario Analysis and Cross-Impact Analysis. In addition, the output of quantitative forecasting models are commonly adjusted by practitioner judgment [35,55,27]. In [68] w e disc u s s these issues and provide an elaboration and evaluation of the various "gateways" for the incorporation of judgment in decision analysis, forecasting and knowledge-based systems. Such judgmental interventions are usually that of informed, expert opinion.

on

In this paper we shall focus on the issue of identifying the factors contributing to expertise

0167-9236/94/$07.00 © 1994 - Elsevier Science Publishers B.V. All rights reserved

2

F. Bolger, G. Wright / The quality of expertjudgment

and propose some theoretical constructs relating to the conditions under which valid expert judgment will be demonstrated. To this end, we will initially examine the approaches taken towards the definition of expertise by writers in various fields. Having identified the issues which we feel to be the most pertinent to our current thesis, we proceed to assess the relevant empirical studies of expert judgment and decision making. In order to account for the somewhat contradictory findings of these studies, we propose some theoretical constructs which permit the prediction of expert performance in different task domains. We feel that these constructs will be of value both to academic researchers and to practitioners in the field. 2. W h a t is expertise? Definitions of expertise are rare in the literature on expert judgment and decision making, Normally it seems to be accepted that we know an expert when we see one. However, in many practical situations, expertise may be credited to individuals who hold particular roles rather than on the basis of knowledge or skill. For example, Steadman discussed psychiatrists' predictions of future violent behaviour among the mentally ill: "The mark of a mature profession seems to be its ability to make believable predictions, not necessarily accurate just believable" [56, p. 381]. H e argued that a focus on predictive accuracy is central to the profession of psychiatry in the public's view. However, when research on clinical forecasts have been conducted, the accuracy of clinical estimations rarely exceed chance levels. Psychiatrists, Steadman argued, should stop participating in civil and criminal court proceedings that relate to guesses about future behaviour and instead concentrate on diagnosis and treatment, i.e. medical judgment. Hoenig focussed on the legal aspects of product liability and noted that there are "...legions of eager consultants from a variety of disciplines are ready, willing and able to provide services ranging from initial consultations and claim reviews all the way to testimony at trial...(but) is an instructor in a vocational high school who teaches automotive repairs qualified to express opinions about the cornering limits of a tractor-trailer going around a curve?" [24, p. 337].

Our view of expert!se is in some sympathy with the concerns of Steadman and with Hoenig; expertise, to be of practical use, should be measurable as improved performance over forecasts or diagnoses given by those people or systems thought of as "inexpert". For example, expertise should not, in our view, be defined in terms of believability, nor social role (high media profile etc), seniority or earning power. Where authors have explicitly attempted to define expertise, this generally takes the form of an outline, in broad terms, of the characteristics which they believe an expert decision maker should ideally possess. For example, according to Johnson: "An expert is a person who, because of training and experience, is able to do things the rest of us cannot; experts are not only proficient but smooth and efficient in the actions they take. Experts know a great many things and have tricks and caveats for applying what they know to problems and tasks; they are also good at plowing through irrelevant information in order to get at the basic issues and they are good at recognisingproblems they face as instances of types with which they are familiar. Underlying the behaviour of experts is the body of operative knowledgewe have termed expertise" [29]. From this we see that experts are thought to have: well-learnt, highly-practised skills ("smooth and efficient"); have a large body of knowledge ("know a great many things"); have heuristics or rules-of-thumb to allow them to apply their knowledge to real-world situations ("tricks and caveats"); and have certain general problem-solving skills ("plowing through irrelevant information in order to get at basic issues") which constitutes a form of filtering, as does "recognising problems they face as instances of types with which they are familiar". Shanteau [51], in addition to the above, identifies some social/personality characteristics of experts including: - effective communication skills - a sense of responsibility for decisions - confidence in judgments - high stress tolerance. Further, Shanteau distinguishes between "substantive" experts (whose skill lies in analysing large bodies of data) and "assessment" experts (whose skill lies in making judgments under uncertainty). Also Anderson and Lopes [1] distinguish between " p e r c e p t u a l " and "cognitive" experts (e.g. plumbers versus lawyers).

F. Bolger, G. Wright / The quality of expert judgment

Attempts such as these to define the ideal

characteristics of experts (which we shall term "competence" approaches) have some advantages over the performance approach we outlined earlier. Specifically, once competence models are perfected, they should enable experts to be clearly identified and their performance predicted. However, building accurate competence models is a long process which naturally must rely heavily on studies of performance. Further, competence models have only been successful in well-defined a n d / o r simple domains (e.g. linguistics [9]; and counting [22]) - expertise appears to be neither well-defined nor simple. Thirdly, competence models, by their nature, tend to emphasize endogenous/personality variables to the detriment of situational factors. We have reason to doubt, from published research, whether personality factors, such as cognitive style, have much of an influence on decision making and judgment in comparison to situational factors [45,64]. In sum, we feel that while competence models are a worthwhile longterm goal of expertise research, we are justified in focussing on performance in the immediate term. Another approach to the nature of expertise comes from the fields of sociology and management science where it is conceived under the heading "professionalism". Here we also find a school of thought which attempts to identify professionals in terms of the characteristics ideally possessed by members of professional groups as distinct from non-professional groups. For example, Elliot [17] proposes a series of continua which can be used to differentiate between professionals and non-professionals. These continua include "knowledge", "tasks", "identity", "work", "education" and "role". With respect to the "knowledge" continuum, professionals are seen as possessing broad, theoretical knowledge, while non-professionals possess technical or craft skills. As for the "work" continuum, professionals are regarded as viewing their work as their central life interest and basis for achievement, while for non-professionals work is a means to the end of breadwinning. Such a view is also held by G o o d e [21] who suggests that the main characteristics of professionals is that they have acquired specialist knowledge through training and possess a service orientation, Barber [5] also lists the ideal characteristics of

3

professionals but places greater emphasis on the trust and moral responsibility demanded by the professional role. Specifically, by virtue of possessing expertise, professionals have power over society. The professional "culture" consists of a code of conduct which has the primary function of ensuring that this power is used to serve interests of the professional community and society as a whole, rather than the self-interests of individuals. Johnson [28] expands on the notion of expertise as a source of power but sees the regulation of this power as stemming from the interaction of professionals with clients rather than from selfregulatory processes within professionals themselves. In sum, the professionalism approach is in many ways similar to the competence approach and, therefore, is subject to the same criticisms. However, where the two approaches differ is in professionalism's emphasis on the socio-political context of expertise. We recognise that considerations of this nature are non-trivial - especially to those attempting to implement ES and IDSS technology, which is often seen as a threat to professionals' power - however, detailed treatment of the social and political context of expert judgment is outside the scope of this paper. Rather, we must be mindful of such issues inasmuch as they impact on the assessment of expert

performance. 3. Research The bulk of expertise research has tended to be carried out for one or other of the following reasons: - academic interest in the nature of human cognitive processes - academic a n d / o r commercial interest in building expert/decision-support systems. Research has tended to be carried out by cognitive psychologists for the first of these reasons and AI researchers for the second. Hammond [23] compared both types of research and noted that AI researchers try to produce a simulation that will be as good as the expert, whereas cognitive psychologists have tended to focus on judgmental performance (including expert judgment). Early cognitive research viewed judgment as biased, flawed and

4

F. Bolger, G. Wright / The quality of expertjudgment

suboptimal. For this reason, cognitive psychologists have tended to focus on decision aids and interventions that improve upon or replace expert judgment [53,491]. As Hammond has noted, the two approaches to the study of expertise have proceeded in parallel. The AI approach to the modelling of expert judgment can be traced to the problem-solving approach used by Newell and Simon in 1972 [42], whereas the cognitive psychologists' interest in judgment and decision making began in earnest in the late 1950s when Ward Edwards [14] introduced what was essentially a normative theory of economic decision making to the psychological literature. Up to the early 1980s, most psychological research attempted to assess human decision making and judgment against the normative standards. In this paper we shall not address the rapidly growing body of work by cognitive scientists (including both AI and psychological researchers) who are attempting to model expert judgment processes. As we have already stated, we feel that data on performance must precede attempts to model specific experts' abilities, or expert competence generally. Put another way, it seems to us premature to assume that experts' cognitive processes are any different from anyone else's. As we shall see, it is still moot as to whether experts perform any better than those regarded as inexpert. In view of this, the hypothesis that expertise is anything more than a social construct has yet to be supported. Even if a performance advantage for experts can be demonstrated it still remains to be shown that this advantage is attributable to process rather than to experts knowing more (ie a difference in knowledge content), The purpose of this paper is, then, to assess critically the evidence for a behavioural basis to the label 'expert' and to propose means for resolving any ambiguities that there may be in this evidence. To this aim we first wish to examine briefly the methods and findings of some early laboratory studies of (non-expert) judgment because this sets the scene for the later studies of expertise. As will become apparent, most experimental paradigms - and also many experimenters' expectations- are carried over from the laboratory to the 'real world'. It is therefore necessary to know something about the early laboratory studies in order to understand fully the context and implications of the later studies

of expert judgment. In addition, as we shall see, the criticisms of the laboratory studies more or less constitute the raison d'etre for the psychological investigation of expertise.

4. Early laboratory studies of judgement In "The Theory of Decision-Making" Edwards [14] specified that normative decisions can be made by determining potential courses of action and possible alternative outcomes (or "scenarios") resulting from these actions. Subjective probabilities as to the likelihood of outcomes and subjective values (or "utilities") of outcomes must then be assessed. This then permits the calculation of "subjective expected utilities" (known as SEUs) for each outcome. The highest SEU represents the best choice within the terms of Edwards' normative decision theory which has become known as subjective expected utility theory. Early research showed that SEU did not describe human decision making, at least not in the context of choices between simple gambles [64]. This research has, therefore, been taken by many as evidence for the necessity to develop and apply decision analysis as a decision aid [50]. An important component of decision analysis is the generation of 'decision trees' consisting of actions that can be taken and the possible outcomes of these actions. It is important that actions and outcomes thus represented should be exhaustive for the particular decision problem in question else the SEUs calculated on the basis of the decision tree are likely to be incorrect. However, in a study by Fischhoff, Slovic and Lichtenstein [19], it was found that laboratory subjects failed to consider options other than those explicitly presented to them and, correspondingly, underestimated the probability of occurrence of a catchall category. Fischhoff et al labelled this insensitivity to the incompleteness of scenarios

'out of sight, out of mind'. By the mid-1960s Edwards had turned his attention to Bayes' theorem, another normative theory, this time concerned with the revision of subjective probabilities attached to the truth of hypotheses in the light of new information. A number of laboratory studies were conducted to assess human revision of probabilistic opinion relative to the axioms of Bayes' theorem. For

F. Bolger, G. Wright / The quality of expertjudgment

example, in one study [47] it was found that unaided human revision of opinion was often less than the theorem would prescribe. In other words, posterior opinion, given updated information, was not as extreme as that calculated by Bayes' theorem. This result has been termed conservatism, Early research on the conservatism phenomenon used the "book-bag-and-poker-chip" paradigm, The basic paradigm was this: the experimenter holds three opaque book bags. Each contained one hundred poker chips in different, but stated, proportions of red to blue. The experimenter chooses one bag at random, shuffles the poker chips inside and successively draws a single poker chip from the bag. After each draw, the poker chip is replaced and all the poker chips are shuffled before the next drawing. The subjects' task is to say how confident (in probability terms) he or she is that the chosen bag is Bag 1, Bag 2 or Bag 3. The colour of the poker chip drawn on each occasion from the same bag is information on which to revise prior probabilities of 1/3, 1/3 and 1/3 for the three bags. The data from a large number of laboratory studies, using tasks very similar to the one we have described, showed that the amount of opinion revision is often less than the theorem would prescribe. On the basis of experiments of the "book-bagand-poker-chip" kind, there has been considerable research effort into the development of computer-aided "Probabilistic Information Processing Systems" (PIP systems) which implement Bayes' theorem. In these systems, the probability assessor makes the probability judgment after each item of information arrives, but the cornputer aggregates these assessments, Another line of research that has been concerned with the quality of human judgment of probability has stemmed from the work of Amos Tversky and Daniel Kahneman. In a series of well-written and accessible papers starting in the early 1970s they outlined some of the heuristics, or rules of thumb that people use for probability assessment [58]. Much of this research is now incorporated into general introductory texts on psychology, perhaps because the research studies a r e easily understood, non-technical and appealing to the non-specialist reader, Tversky and Kahneman have demonstrated that we judge the probability of an event by the ease with which relevant information of that event

5

is imagined. Instances of frequent events are typically easier to recall than instances of less frequent events, thus availability is often a valid cue for the assessment of frequency and probability. However, since availability is also influenced by factors unrelated to likelihood, such as familiarity, recency and emotional saliency, reliance on it may result in systematic biases. Other heuristics identified by Tversky and Kahneman which may result in biased judgment include anchoring and adjustment - where an arbitrary quantity is used as a baseline for subsequent judgments - and representativeness, a form of stereotyping. Another paradigm which has been used extensively in laboratory studies of judgment is the measurement of calibration. For a person to be perfectly calibrated, assessed probability should equal percentage correct over a number of assessments of equal probability. In other words, for all events assessed as having a 0.XX probability of occurrence, XX% should actualy occur. Early research on the calibration of subjective probabilities used general knowledge questions of the form: "Which is longer?: (a) Suez Canal (b) Panama Canal How sure are you that your answer is correct (0.5 means not at all sure, 1 means absolutely certain)." Questions of this sort have been used extensively because the answers are known to the experimenter and, therefore, subjects' calibration can easily be computed. A general finding of laboratory calibration experiments is that subjects are overconfident. In other words, for a set of events assessed as having a 0.XX probability of occurrence, less than XX% actually occur. In general, the conclusion from these and other laboratory studies is that human judgment is suboptimal in many respects. In fact Slovic [53] goes as far as to say: "This work has led to the sobering conclusion that, in the face of uncertainty, m a n may be an intellectual cripple, whose intuitive j u d g m e n t s and decisions violate many of the fundamental principles of optimal behaviour."

Various decision aids and procedures have been devised to help people make better judgments and decisions. As we have seen, decision

6

F. Bolger, G. Wright / The quality of expertjudgment

analysis (implementing SEU theory) and Probabilistic hfformation Processing Systems (implementing Bayes' theorem) have been propounded as ways of improving human decision making, Both techniques, of course, involve the input of human judgment. However, both are also predicated on a decompose-then-recompose rationale, The logic is that humans' judgment is unsuitable for complex overall assessments but is suitable for modular inputs to decision aiding technologies, For example, in decision analysis, the decision maker provides assessments of subjective probability and utility. Probabilistic information processing systems require inputs of prior probabilities and likelihood ratios, The overall message from the diversity of laboratory studies of human judgment and decision making that we have just outlined is clear: holistic judgment is flawed and suboptimal relative to normative theories. This research conclusion has underpinned the development of decision aiding technologies which have been implemented as a way of improving upon human judgment, However, more recently the research emphasis has moved away from the psychological laboratory to "ecologically valid" studies of judgment and decision making by people who care, in situations that matter. Consequently, the focus has turned to the study of professional experts. This has brought the psychological research closer to that of the content concern of AI, although the latter approach has focused on methods for modelling expert judgment and decision making without first systematically assessing the quality of expert judgment. We first turn to criticisms of the laboratory studies and then to substantive studies of expert judgment. The criticisms of the laboratory studies are important not only because they provide the motivation for expertise research but also because, as we shall see, many of the same criticisms are applicable to the more recent studies of judgment in the real world.

5. Criticisms of the laboratory studies A substantial amount of criticism has been levelled against the early laboratory studies which leads one to question the conclusion of people as "intellectual cripples" with respect to judgment and decision making,

The majority of these criticisms relate to problems with the ecological validity of experiments. The key issue is: Do the laboratory results generalise to work-a-day contexts? Such problems may lie with the type of experimental tasks used, the kind of procedures used to measure performance or the nature of the subjects participating in experiments. A lack of ecological validity in one or more of these aspects of an experiment could mean that it is difficult to apply the findings of the research outside the experimental setting, or worse, mean that the findings are artefactual. For example, criticism has been levelled against the laboratory studies which found human revision of opinion to be "conservative" in comparison to norms computed from Bayes' theorem. Conservatism has been found to be half that typically found in "book-bag-and-poker-chip" studies for a more ecologically valid task [13]. Subjects were asked to determine whether males or females were being sampled on the basis of information about height randomly selected from one of these two populations. In the usual laboratory paradigm the data available to subjects is discrete (eg either a red or a blue poker chip) and generally only two or three revisions are possible on the basis of this information. The researchers argue that their task is more representative of situations encountered in everyday life due to the continuous nature of the information available to subjects and because of the range of revisions they can make. Other investigators [63] have extended the ecological validity argument to show that the typical book-bag-and-poker-chip paradigm differs from real world situations in terms of the lack of redundancy, the constancy, the discriminability and the diagnosticity of data available for opinion revision. They conclude that "conservatism may be an artefact caused by dissimilarities between the laboratory and the real world." The findings of Kahneman and Tversky, with respect to biases and heuristics in human judgment, has also been subject to attack. For exampie, several writers [7,37,38,48] argue that the "word problems" used by Kahneman and Tversky may not be fully understood by subjects and also may not generalize to everyday judgment and reasoning situations. In addition, the Kahnenan and Tversky studies have been criticized for their failure to utilise subjects who have any expertise

F. Bolger, G. Wright / The quality of expert judgment

in the subject matter of the word problems or who have any need to act upon their decisions [ 7 ] . The validity of studies of calibration has also been questioned on a number of grounds. In particular, the almost exclusive use of dichotomous questions about general knowledge items has been criticised. Keren [33] points out that general knowledge items are unrelated in the sense that a subject's knowledge of one item is independent of his/her knowledge of another item. It is therefore impossible for a subject to use his or her knowledge about the occurrence or non-occurrence of an item to assess the probability of other items in the set. Keren proposes that feedback from experience in making judgments about related events/items is a necessary pre-requisite for good calibration. Such feedback is not available to subjects in the majority of laboratory studies, therefore it is not surprising that they are poorly calibrated. However, in many real-world situations feedback about related events is available for utilisation by an experienced judge (e.g. horse racing, weather forecasting, contract bridge and prediction of stocks and shares), In view of these criticisms regarding the lack of ecologically validity of the laboratory studies and the consequent doubts about the veracity of the conclusions of this research - the research focus has recently moved to studies of judgment and decision making in the "real world". However, many of the paradigms popular amongst the laboratory researchers have been adopted for the study of experts. As we shall see, the result of this is that many of the same criticisms regarding lack of ecological validity in laboratory studies also apply to the real world research. We turn next to studies of expert judgment, 6. Studies of expert judgment In most fields, once one starts to look, one finds a surprisingly large number of papers on a particular topic - expertise is no exception. To date we have unearthed more than 60 papers addressing the subject of expert judgment more or less directly and there are surely many others which have not yet come to our attention a n d / o r which touch upon the topic. From these 60-plus papers, we have selected 20 for further scrutiny, Over 20 papers could be put aside immediately

7

being theoretical rather than empirical treatments of expertise. Of the remaining 40, many lacked sufficient detail regarding subjects and methods for a task analysis to be performed. Some others produced equivocal results or tended towards qualitative analyses rather than quantitative measures of performance- only those studies reporting measures of reliability or validity (accuracy) of judgment have been included in the analysis. Also, studies of medical expertise seem to predominate in the literature, so some of the earlier experiments of this type have been excluded in order to have as wide a spread of different experts represented as possible. It must be stressed that direct comparison of individual studies is difficult, due to the fact that they differ with respect to type of performance measured and paradigms used to investigate expert judgment. For example, some studies focus on judges' ability to discriminate, weight and combine cues from a multidimensional stimulus [2,16] into a global judgment. Such studies tend to utilize "psychometric" measures of performance in that inter- and intra-judge reliability is assessed. Such reliability is seen to be a pre-requisite for valid judgments. A contrasting approach is to investigate judges' ability to generate accurate probability estimates for outcomes [12,33,40]. In these studies the focus tends to be on the accuracy of forecasts (eg. calibration) a n d / o r on identifying judgmental biases such as anchoring and adjustment (eg. halo), under/overestimation (eg. availability) and representativeness (e.g. stereotyping). However, despite the differences in approach, it is possible to ask of each study the following questions: - A r e the subjects used (i.e., trained/experienced) to making the sort of judgments asked for? - Are objective models/data potentially and actually available to judges? - Are judges familiar with expressing their decisions in the metric of the experimental task? - Is feedback available? reliable? usable? In Table 1 each of the 20 studies is listed along with a brief analysis of the experimental tasks with respect to the above questions. An indication of the conclusion of each study regarding the quality of expert judgment is also given. Looking

W e a t h e r forecasting is based on a well-understood physical model. Also a large database of objective data relevant to prediction is available including the outcomes of past forecasts. T h e main problem is one of information overload rather than data scarcity,

Formal models of negotiation are available (e.g. game theory) and S's are probably familiar with t h e m but how rigorous are these models? A n d are they used in practice?

85 Weather forecasting stations Short-term forecasting of precipitation, m a x i m u m and minim u m temperature and cloud amount, Murphy and Brown 1985

80 Professional real-estate negotiators Negotiation of refrigerator sales and purschases, Neale and Northcraft 1986

and/or

Precise codings are possible in the laboratory B U T there is a subjective element to judgements in the field due mainly to p e r c e p t u a l limitations, J u d g e m e n t s are potentially highly quantifiable,

data

12 Advanced students of soil morphology Coding of soil samples according to the ratio of sand:silt: clay. Gaeth and Shanteau 1984

of

Objectivity models

Experts, task and source

Table 1

Objective feedback about the profitability of transactions is rapidly available and can be used to modify behaviour in order to produce better deals. However, feedback may have limited utility due to the reactivity of other negotiators,

In short-term forecasting feedback is rapid and reliable. Also outcomes are not influenced by the forecasts as in, e.g. economic forecasting. Feedback is relevant to refining models and thereby making better predictions (forecasters undertake m o d e l r e f i n e m e n t as a matter of course). Motivation for utilizing and improving forecasts is supplied by accountability to 'users'.

In training there is feedback about the accuracy of judgem e n t from a senior soil scientist who has the opportunity to validate h i s / h e r own judgem e n t s in the laboratory. Useable feedback is therefore available but is subject to errors and biases as seniors are shown to disagree and make errors in some codings,

Availability and utility of feedback

T h e S's were highly experienced at n e g o t i a t i o n (average 10.5 years, average professional time spent negotiating 32%) B U T S's were not familiar with the refrigerator market (deliberately). T h e negotiating task appears to be fairly realistic although m u c h simplifled.

Specific details of individual forecasters are not provided but we a s s u m e that the level of experience is quite high (i.e. there is probably more than one forecaster per station and each station prepares a large n u m b e r of forecasts each day). T h e quantification of uncertainty in the m a n n e r asked for in the study is also c o m m o n practice.

T h e task is the same as that normally practized by these S's but the mode of presentation is novel i.e. samples are artificially prepared therefore there is no contextual information available. T h e sample may also be unrepresentative of naturally occurring soils. T h e S's were N O T highly experienced although all had some practical experience of soil judging and the coding system used.

Judges' experience with experimental task

Equivocal.

T h e results supported the view that experts are susceptible to the same framing biases as naive S's (the control group). However, the experts still outp e r f o r m e d the novices in terms of the n u m b e r and the profitability of transactions although there was little difference between the 2 groups by the end of the session.

S's forecasts were found to be very well calibrated and in some cases more accurate than those based on statistical models. Some seasonal variation in accuracy and reliability were found but overall there was a significant i m p r o v e m e n t in forecasts during the 10-year period of the study. Good.

T h e influence of irrelevant factors and training on the accuracy of j u d g e m e n t were measured. Accuracy was found to be significantly influenced by irrelevancies. S's had particular problems in differentiating sand from slit in 1 sample (the instuctor was also poor at this task). T h e training was shown to significantly improve performance. Bad.

Conclusion

~.

~ "~

~

~-~ "~

OO

There is a potential for reasonably reliable and objective data re. factors influencing yield but it appears that judges lacked access to this. Also the ability to discriminate perceptually between cues is likely to be a limiting factor for performance,

Highly objective data available (e.g. actuarial tables and gove r n m e n t death rates) but it is unlikely that the S's had access to these or were familiar with them.

The factors influencing the success or failure of R & D projects are complex and everchanging so data is only moderately objective. However, there are quantifiable success criteria a n d typical project lifecycle models available to the S's.

Certain materials and degrees of d a m a g e are difficult to determine objectively (the 'experts' used to prepare the samples disagreed on s o m e criteria).

12 Maize judges Forecasting yield from perceptual analyses of ears. Hughes 1917 Wallace 1923

23 Medical interns Estimating death rate from potentially lethal events, Christensen-Szalanski et al. 1983

12 R & D Managers Forecasting the technical success of pharmaceutical R & D projects, Balthasar et al. 1978

40 Grain inspectors Detecting percentages of foreign material and d a m a g e d ears in wheat samples, Trumbo et al. 1962

It is unclear from the paper but feedback is probably from more senior judges hence learning is by m e a n s of convergence with potentially inaccurate standards (inaccurate due to lack of objectivity and controD. There is also evidence of social/commercial pressure to make favourable j u d g e m e n t s (e.g. underestimate the extent of damage),

The reliability and useability of feedback is fairly poor due to the long timespans involved (5 to 10 years), the domain cornplexity, control and sampling problems, and the difficulties of making analogies with prior examples,

Moderate to highly reliable feedback available from statistical sources but the main source of feedback for the S's is the mass media and patients, The former is distorted and the latter is subject to sampiing, control and timespan problems,

Feedback about yield potentially available but would be only moderately reliable due to: causal links between yield and judgements; timespan; and seasonal a n d / o r regional variations,

Ecological validity is fairly high as the S's had a median of 16 years experience at making the type o f j u d g e m e n t s asked for. However, the expression of these j u d g e m e n t s as percentage is not normal and the prepared samples may have been unrepresentative.

T h e S's are highly experienced at judging the s u c c e s s / f a i l u r e of projects but mainly on a year to year basis not Iongterm. Expression of j u d g e m e n t s as probabilities also not usual. NB a Delphi panel was used.

T h e S's are moderately experienced with diagnosis and treatm e n t but have no prior formal experience of the experimental task. S's may have incidentally learnt to judge the probability of dying once an ailment has been contracted (or learnt through their reading of medical journals) but N O T the simpie probabilities asked for.

T h e S's were experienced at making the type of discriminations asked for but it is unclear as to the degree of practice they had at predicting yield on the basis of their discriminations. T h e r e is a suggestion that the S's real skill is in evaluating the aesthetic merits of ears for shows.

Accuracy of discriminations and both inter- and intra-judge reliability were m e a s u r e d . Generally poor accuracy but particularly on j u d g e m e n t s of a particular type of material. S's varied greatly in the extent of their u n d e r / o v e r - e s t i m a t i o n . This variation seems to be a function of experience and place of training. Poor intrajudge reliability. Bad.

Forecasting calibration was m e a s u r e d both overall and year by year. No significant deviation from perfect calibration was found (but very few observations). Year by year 67% of predictions fell within 1 standard deviation of the actual outcome. Good.

Equivocal

Accuracy of j u d g e m e n t of simpie death rates was m e a s u r e d as well as personal contact with sufferers and medical journal coverage of the potential causes of death. High perform a n c e was found for rankings of causes but absolute accuracy was poor (although significantly better than novices). T h e S's d e m o n s t r a t e d an over-estimation bias.

Low accuracy (0.2) and moderately high agreement (0.7) were found for j u d g e m e n t s of yield. Bad.

~.

"~ "~.

.~

"~

Objective data about the public's betting behaviour is available (but do the S's use it?). Formal models of betting behaviour have been constructed by psychologists and others but it is unlikely that the handicappers make use of these. Both odds and rates of return required for predictions are easily and commonly quantified,

Highly objective data available in the form of outcomes of similar games in the past. Although outcomes of particular g a m e s depends on the play, variation in outcome is strongly constrained by the rules of the game. Uncertainty is quantifled explicitly in the bidding,

5 Professional horse-race handicapping sources Prediction of the subjective odds the public creates through parimutuel betting, Snyder 1978

16 Professional bridge players Forecasting the n u m b e r of tricks to be made. Keren 1987

and/or

T h e experimenters utilize a formal model of the relation between source reliability and the inferential value of information but this is not available to the S's. The subordinates' initial analyses of internal control is largely subjective (see our c o m m e n t s on A s h t o n '74) as is the data available to m a n g e r s for the assessment of the reliability of these analyses (th'e consensus of other managers),

data

35 Audit managers Assessment of the 'source reliability' of subordinates' analyses. Bamber 1993

of

Objectivity models

Experts, task and source

Table 1 (continued)

Feedback is good because it is continuous, i m m e d i a t e and quantified. Also individual items of feedback are related in that feedback from one game can be used to predict the outcome of another. B U T feedback during a game can influence outcome.

If the S's are to learn the task then they must match the predicted rates of return for each odds class to those of the punters. Feedback relevant to this is available rapidly after each race (but is it used?). O n e problem with the feedback is that of causality i.e. the handicappers publish their forecasts in the ' p a p e r s and punters use these forecasts to make their bets therefore is professional handicapping an act of selffulfilling prophecy?

Relevant feedback about the accuracy of subordinates' analyses is probably not available until several m o n t h s after decisions have been made. Also accuracy will be difficult to assess due to the vagueness of the analyses in the first place, Spectacular successes and failures are likely to be the chief source of data influencing managerial policy re. h i s / h e r subordinates. There is a danger of self-fulfilling prophecies due to the effects of m a n a g e r s ' decision on their subordinates behaviour.

Availability and utility of feedback

S's have a high degree of experience (at n a t i o n a l / i n t e r national t o u r n a m e n t level), Also the experimental task and the response mode are the same as normal for the S's (i.e. bidding in ordinary games of bridge).

lout.

Degree of experience varied over the 5 sources used from 1 part-time handicapper to 4 full-time handicappers. T h e task s e e m s highly ecologically valid as forecasts of the type asked for are very frequently m a d e by the S's. However, there is a query over w h e t h e r the handicappers are actually predicting the outcome of races rather t h a n punters' betting? It is also suggested that handicappers intentionally bias their predictions in order not to overinfluence betting behav-

T h e task s e e m s to be fairly ecologically valid in that the S's are reported to be highly experienced and frequently m a k e j u d g e m e n t s about the source reliability of their subordinates' analyses. However, it is clear that the m a n a g e r s would normally have access to more information than they were presented with in this study. Also S's were asked to present their j u d g e m e n t s as numeric ratings and probabilities which is unlikely to be their c o m m o n practice.

Judges' experience with experimental task

T h e accuracy (calibration of) forecasts was m e a s u r e d and compared to that of novices. A high degree of accuracy was d e m o n s t r a t e d by S's especially in comparison to novices. T h e S's were also well-resolved. Good.

Good.

A n y bias on the part of the S's will be manifest as non-random deviation from the public's pattern of betting behaviour. T h e match between the S's and the public's betting varied as a function of experience - the team of 4 full-time S's m a t c h e d well, but the single part-time S departed significantly from the punters' pattern. T h e less experienced S's were worst on odds above 10 to 1 where they accentuated the betting bias previously d e m o n s t r a t e d in punters.

T h e results indicate that the S's were sensitive to the experimental manipulation of source reliability but significantly underestimated the diagnosticity of this information relative to the normative model. Bad.

Conclusion

~.

,9~

~" ~

-'~

.~

D a t a / m o d e l s are not very objective due to the size and complexity of domain e.g. a n u m b e r of factors interact to cause failure. A database of statistics re. restaurant failure probably not available because 10 'experts' used to construct the fault trees. Original data re. the consensus of these 'experts' is "unavailable".

'Internal' control is rather an abstract concept although it is based on objective data such as cash receipts. Also the characteristics of good internal control are clearly defined in the auditing literature.

Quantitative ratings are imposed upon what are essentially subjective judgements. Further, the cue characteristics are not clearly defined. Q u a n tification of the cues apart from in a dichotomous m a n n e r (i.e. p r e s e n t / n o t present) seems to be inappropriate, However, j u d g e m e n t will be u n d e r p i n n e d by fairly objective medical models. There is plenty of objective data (hands) to base judgem e n t s upon and also a statistical theory of combinations and permutations B U T S's were not familiar with this theory,

60 Restaurant managers Estimating the likelihood of restaurant failure as the result of various causes, Dube-Rioux and Russo 1988

63 Independent auditors Evaluation of clients' systems of 'internal control', Ashton 1974

3 Medical pathologists Diagnosis of the severity of Hodgkins' Disease from biopsy slide. Einhorn 1974

22 Professional blackjack dealers Assessing the probability of card combinations in Blackjack Wagenaar and Keren 1985

Rapid and objective feedback is available although a lot of data would have to be encoded in order to derive probability distributions empirically. However, in principle good judgem e n t s learnable,

Feedback is probably from convergence with seniors otherwise S's must wait for data from autopsy. Therefore there is a lack of reliability or long timespans. Also j u d g e m e n t s will affect t r e a t m e n t which will affect the outcome. However, feedback is related so it is useable for model refinement in principle.

It is unclear as to the precise nature of feedback available to auditors but it seems likely to consist of recommendations from more senior auditors and of referrals back from clients, Feedback is therefore likely to suffer from the problems of a long timespan and lack of reliability. It is also unclear what the criteria for performance validation would be.

The items were divided into long- and short-term causes of failure. More (and more frequent) feedback would be available for latter than former. Also short-term causes likely to be more salient (e.g. customer dissatisfaction). B U T short-term causes may be more susceptible to the causal influences of judgements. In general feedback seems to be liraited and unreliable,

A l t h o u g h the S's were very experienced the task was unfamiliar in that their job does not require t h e m to encode frequency distributions since they play a fixed strategy. Also expression of probabilities novel.

T h e level of the S's experience was not specified but we ass u m e that it was reasonably high. T h e cues were preselected by the experimenter so m a y be unrepresentative, T h e response m o d e (ratings) were also unnatural,

T h e S's had 2-3 years of auditing experience but it is suggested that the type of assessm e n t s required by the experim e n t e r s may not be everyday practice. T h e mode of responding in the experiments task ( y / n and numeric ratings of internal control) appear artificial. T h e task is also severely simplified thereby reducing any contextual information.

T h e experience of the S's was high (average = 8 years) B U T they were not familiar with diagnosing fault-trees or asigning probabilities explicitly to outcomes. It is also unlikely that the S's had any direct experience of the causes of failure (otherwise they would be ex-restaurant managers!). T h e level of formal training varied between S's but was a poor predictor of bias.

Overall the S's were wellcalibrated but not significantly better than naive controls. Also the S's d e m o n s t r a t e d biases particularly with respect to conditional questions. Equivocal.

2 of the S's agreed but the other did not (despite being a protege of one of the o t h e r S's!). Test-retest reliability was 0.69 and no evidence of biases found. T h e S's agreed on how cues clustered but not on how to weight them. Bad.

S's consistency and consensus was m e a s u r e d with the view to demonstrating reliability as an indicator of validity. A n analysis of the type of processing used by S's was also conducted. Reliability was found to be good (average consensus = 0.7; average consistency = 0.81). No evidence was found for configural processing. Equivocal.

S's exhibited an under-estimation bias in that they failed to weight adequately probability of failure from unlisted causes. Bias significantly less for short-term than long-term causes (because of lack of feedback re. latter?). Bad.

9-

~ ~"

"~

H

Quantified data is available for predictions in the form of odds, outcomes and historical data ('form'). Probability models implicitly underlie betting but in practice these are undoubtedly overlain by superstitious behaviour in many cases,

Property value can be calculated in a pseudo-objective m a n n e r from: size of property and price/sq, foot of comparable property with adjustments m a d e for condition and special features. However, a substantial a m o u n t of j u d g m e n t enters into the calculation of each component. Also, estate agents rarely use this method, relying even more heavily on subjective estimates. Further, a process of negotiation and market forces lead evaluations to be regularly changed,

Horse-race bettors (data collected from total betting on 1,825 races), Prediction (subjective probabilities) of particular horses winning. Hoerl and Fallin 1974

66 Real-estate agents Estimating property values, Northcraft and Neale 1987

and/or

Interest rates are quantified so j u d g e m e n t can be validated against published values. T h e factors influencing change in rates are known but are complex. However, historical time series are usually good predictors because rates do not normally fluctuate widely,

data

7 Banking executives Predicting the value of interest rates 90 days ahead (and attaching subjective probabilities to estimates), Kabus 1976

of

Objectivity models

Experts, task and source

Table 1 (continued)

Feedback is received both during the negotiation of a scale (i.e. particular offers) and from each sale. T h e latter is relatively infrequent because sales often take many m o n t h s to complete. Since property values are largely subjective, and each sale consists of a n u m b e r of unique factors, both shortand long-term feedback is likely to be relatively undiagnostic. Also, the S's evaluations and consequent price will influence the market and subsequent evaluation.

Feedback is good because it is continuous, immediate, quantifled and widely published, However, outcomes of individual races are only loosely related because of eg different combinations of horses, track variability, changes in form etc. In theory at least, betting should not influence the outcome of races,

Accurate and reliable feedback re. interest rates is rapidly available. Factors influencing interest rates are largely outside the control of the S's but there is still a small possibility of j u d g e m e n t s influencing the true outcomes. U n i q u e (e.g. political) events occasionally influence rates which reduces the utility of feedback,

Availability and utility of feedback

T h e S's degree of experience was relatively high (av. 8.3 yrs) although the n u m b e r of transactions each year was surprisingly low (av. 15.2). T h e nature of the j u d g e m e n t s and the m a n n e r in which they were elicited in the experimental task were very similar to the S's everyday experience (although they received more information about the property than they would normally receive).

Individual bettors were not investigated, rather the total betting on each race was considered. W e a s s u m e that levels of experience varied but that the m e a n was fairly high (inexperienced bettors tend to bet on only a few popular events), T h e task is also high incentive as well as being identical to what these S's normally do.

Details are limited with respect to the experience of the S's. However, we p r e s u m e it to be high as S's are referred to as 'top' executives who routinely make predictions of interest rates. T h e m a n n e r of eliciting these forecasts ('histograms') was however novel, particularly with respect to the confidence assessments,

Judges' experience with experimental task

Strong evidence was found for an anchoring-and-adjustment heuristic leading to biased judgement. The list price provided by the e x p e r i m e n t e r / seller significantly overinfluenced the experts' valuations in a similar m a n n e r to the non-expert controls. However, unlike the controls, the experts d e m o n s t r a t e d little awareness of their reliance on list price. Bad.

Good a g r e e m e n t was found between the bettor's subjective probability of winning and the actual frequencies of wins (no significant differences were found between the 2). There was a slight bias on long odds which was attributed to attempts by bettors to recoup previous losses. This bias was not found to be statistically significant. Good.

D a t a from the 7 S ' s w a s pooled for analysis. A close match was found between predicted and actual rates (15 trials) and the direction of m o v e m e n t of rates was predicted correctly in each case. S's were correct for all predictions on which they placed confidence ratings of 7 5 % + . Levels of inter-judge a g r e e m e n t were claimed to be high but some individual biases detected. Good.

Conclusion

~ ~"

~"~

~

~

~~

~,

Personality theory is well-developed but it is generally agreed that its utility is descriptive rather than predictive. In other words, there is a weak relationship between personality and behaviour. Reliable description requires the use of standardized tests which was not an option in this study,

A large literature of relevant statistics is potentially available for validation of judgements. However, the S's probably didn't refer to t h i s - relied more on personal experience? (the basis for estimates was not elicited from S's). The 'objective' statistics probably only moderately reliable due to e.g. the varying quality of diagnostic procedures. There is a large published body of data re. share performance as well as widely-used timeseries models of the domain, However, many of the factors influencing share performance are difficult to measure while time-series are very crude models. Share performance can vary dramatically over time and across companies,

8 Clinical psychologists Determining personality and predicting behaviour from case studies, Oskamp 1965

236 Physicians Estimating percentage mortality and non-fatal complications after 10 named invasive diagnostic procedures, Manu et al. 1984

88 Security analysts (88 firms made forecasts of the performance of their own shares and independent analysts' forecasts for these same shares also taken), Forecasting point and range earnings per share, Basi et al. 1976

Feedback about forecasting performance is regular and fairly rapid but forecasts are likely to influence subsequent share values. Also the cornplexity of the domain and external influences (e.g. government policy, world events) make model building difficult,

Any feedback from experience is likely to be unreliable due to sampling and treatment effects as well as other factors (e.g. spontaneous remission). The potentially long timespan between judgement and outcome will also often render feedback unusable,

The long timespans between judgement and outcome, the effects of treatment on outcome, and the complexity of factors influencing outcome all mean that feedback is either unavailable or, if available, lacking in utility,

The level of experience of individual S's was not specified but we assume that it was fairly high. The independent analysts were probably more experienced at forecasting than the company analysts but would have had less information re. the share performance of particular companies. The experimental task is identical to normal practice.

The level of experience of the sample varied considerably from practicing interns to graduating medical students, Further, the questionnaire format lacked potentially valuable contextual information. Also, how often are the S's required to assess percentage mortality etc?

The S's are described as "experienced" but the experimental task is unusual for them because they are asked to make predictions rather than diagnoses. Also placing numeric confidence ratings on judgements is outside the S's normal experience. The information available to S's is (deliberately) restricted relative to normal.

The analysts overestimated earnings on average by 9% (range: 25 to 150%)while the corporate forecasters overestimated by 6% ( - 3 7 . 5 to 126.4%). The analysts were significantly less accurate than the firms. The shorter the forecast period and the lower the business risk the greater the accuracy of the forecasts. Bad.

The S's significantly overestimated the risk of 7 of the 10 diagnostic procedures featured in the study - and underestimated the risk of 1 other. Overall the S's demonstrated a tendency to overestimate small frequencies and underestimate large. Age and experience were positively correlated to accuracy of risk assessment. Bad.

The S's demonstrated overconfidence (i.e. their estimations of % correct were significantly greater than the true % correct). S's overall accuracy was poor (approx. 27%). Confidence was related to the amount of information provided and NOT to accuracy. Bad.

~ "~"

,9~

e~

~

'~ "~

.~

14

F. Bolger, G. Wright / The quality of expert judgment

down this 'conclusions' column in Table 1 you will see that these studies have produced contradictory findings, The 'out of sight, out of mind' finding that laboratory subjects often fail to consider some important outcomes of an event has also been replicated in a real-world setting. Research has found that professional managers in the catering industry underestimated the probability of occurfence of certain causes of restaurant failure and thereby overestimated the likelihood of the few outcomes they had been supplied with [12]. Similarly, a number of other studies have demonstrated that professionals do not always select the most appropriate information as the basis of their decisions. For example, it has been found that expert soil judges referred to soil materials when categorising samples which were irrelevant to the discriminations they were trying to make [20]. The applicability to professional settings of the laboratory findings regarding revision of opinion remains unresolved due to a paucity of empirical studies. However, as we have seen, some powerful arguments have been presented to the effect that "conservatism may be an artifact caused by dissimilarities between the laboratory and the real world" [63] and experiments which have used more 'ecologically valid' tasks have demonstrated significantly reduced levels of conservatism [13,71]. In view of this work a plausible hypothesis is that expert opinion revision conforms relatively closely to the norms of Bayes' Theorem. Negatively again, several judgmental biases have been demonstrated in professionals. For example, in a study of estate agents, it was found that arbitrary valuations tendered by property owners influenced both estate agents' and students' subsequent property valuations thereby demonstrating the anchoring and adjustment heuristic [43]. Availability of information has also been shown to affect the judgments of experts in the same way as non-experts. For example, it has been shown that expert physicians overestimated the risk of certain diseases in a similar manner to a comparison group of students [10]. The source of this particular bias was identified as being the physicians' exposure to patients suffering from the diseases in question, Perhaps the earliest research question with respect to the issue of expertise was the accuracy

of expert judgments. The basic research paradigm here has been to validate expert judgments against some external criterion. A re-analysis of data [61] on the performance of corn judges at predicting crop yield [26] found that judges were in good agreement with each other, but that these forecasts correlated poorly with actual crop yield. Similarly, it has been found that grain judges misclassified more than a third of ears that they regularly graded [57]. Again the research findings are contradictory as other researchers have found expert judges to be accurate decision makers. For example, it has been found that expert auditors agreed .89 with statistically derived estimates [2]. Similarly, good calibration has been found amongst experts in a number of domains. For example, bankers have been shown to be fairly well calibrated in their predictions of future interest rates [31]. Also reasonably well calibrated judgments have been obrained regarding the success of R and D projects [3]. Also in studies of weather forecasters [40], bridge players [33] and horse-racing bettors [25] probability estimates were well calibrated. Related to the issue of accuracy is the reliability and repeatability of judgments because these, are seen to be necessary, but not sufficient, prerequisites for accurate decision making. Returning to the study of grain judges that we mentioned earlier [57], we find that when the judges were asked to re-analyse samples they changed approximately one-third of their original gradings. Similarly, a lack of reliability over time has been found in expert medical diagnosis [16], likewise for the judgment of stock brokers [52] and for the analyses of clinical psychologists [44]. Conversely, weather forecasters' subjective forecasts of precipitation, maximum and minimum temperatures and cloud amount have been shown to be highly repeatable [40]. Also, the subjective evaluations of the future successes of pharmaceutical R and D projects by a panel of R and D managers and clinical specialists have been found to be reliable both in the short term (1 year) and long term (5 years) [3]. Other reliable judgments have been found for the forecasting of future interest rates by merchant bankers [31] and for the prediction of the outcome of horse races by experienced bettors [25]. In this section we have summarized examples of research that have evaluated the judgment of

F. Bolger, G. Wright / The quality of expertjudgment

experts. On the face of it, obtained patterns of both poor and good judgmental performance are difficult to reconcile. In the next sections of our paper we develop and construct a theory of factors underlying the expression of both good and poor judgmental performance. We then utilize our theory to retrospectively anticipate performance within published studies of expert judgment which have been conducted under changing situational conditions,

7. Why the contradictory findings of expertise

research? As we have indicated in the previous section, the "ecologically valid" studies of judgments by people who care in situations which matter have produced contradictory findings. Both "good" and " b a d " judgment has been demonstrated among experts: weather forecasters, bridge players, merchant bankers and bookies are amongst those exhibiting high performance while clinicians, catering managers and corn judges demonstrate low judgmental performance. However, the overall impression from the literature is that, in general, expert judgment is poor and that experts exhibit the same biases as naive subjects. However, there are reasons to question such a conclusion. As we have already indicated, one possible explanation for the conflicting results may be the variety of procedures which researchers have used to assess expert performance. In short, do some measures make it easier for experts to manifest good performance than others? This is, we feel, a crucial but yet often ignored dimension of task analysis in studies of expert judgment, Some ways in which performance measures may vary include: definition or coarseness of grain of the analysis; the facility or ease with which the judgments measured can be made; the 'hardness' of the measure (e.g. is it of the reliability or validity of performance?); and the statistical power of the test used. We will briefly examine how each of these in turn impacts upon the quality of performance ultimately attributed to judges. Our intention here is to alert the reader to some of the subtle difficulties inherent in attempting to reconcile, and theoretically explain,

15

patterns of poor and good expert judgmental performance.

Definition"

There is a continuum in measures of judgment in terms of the depth to which performance is analyzed. For example, the accuracy of a subject's judgment can be crudely assessed in terms of his or her proportion of correct judgments. Alternatively components of the judgments can be 'unpacked'. Three components of subjective probability judgment accuracy commonly measured are 'resolution', 'calibration' and 'knowledge' [69,70]. Briefly, resolution refers to a judge's ability to discriminate individual occasions when an event occurs (it rains, a statement is true, the correct alternative has been selected etc.). Calibration meanwhile is the ability by a judge to track changes in an event's base-rate with his or her probability judgments. Knowledge is an indicator of the variability of the target event and is therefore really a measure of task difficulty as an event is harder to predict the more it varies. However, if the target event is the subject's own judgmental ability, then knowledge can also be seen as a measure of the consistency of that skill. These components of judgment can be shown to be distinct skills in that for example a judge can be either well-resolved or wellcalibrated or both or neither. One hypothesis is that it is harder to achieve good performance when the analysis of this performance is fine-grained (i.e. examines sub-components of a skill) then when it is coarse-grained. The rationale for this hypothesis is that an overall conclusion of 'good performance' may only be reached if, say, the judge was found to be both well-resolved and well-calibrated. Alternatively it could be hypothesized that, since coarse-grained analyses tend to summarize over sub-skills, mediocre performance is likely to be the central tendency and so one might argue that good performance is more easily attained on (at least one component of) a fine-grained analysis. However, an examination of Table 1 favours the second hypothesis above because those studies where good performance has been demonstrated are all fine-grained calibration studies (i.e. weather forecasting, bridge, horseracing and R&D).

Difficulty: If an expert is asked to make an incredibly easy judgment then s h e / h e is obvi-

16

F. Bolger, G. Wright / The quality of expert judgment

ously going to demonstrate very good performance relative to a difficult judgment. For exampie, an expert pet show judge should have no problems discriminating between cats and dogs but might make mistakes distinguishing between different breeds of cat or dog. This is a rather trivial example but task difficulty is likely to vary more subtly than this in the domains covered by the expertise literature, hence here is another way in which studies of expert judgment may not be comparing like with like. In other words differences in performance may simply be a function of variations in task difficulty. However, assessment of the level of task difficulty across different domains of expertise is in itself extremely difficult since difficulty, like beauty, is in the eye of the beholder. This is the reason why so much literature compares human performance with optimal performance. Optimal performance calibrates the difficulty measure at least to the extent of setting an upper limit, Subjective difficulty is rarely reported, whilst behavioural measures such as proportion correct are, as we have seen, indicative of number of factors other than difficulty (eg. resolution and calibration). Thus here is another argument for the necessity of controlled experiments - in this case to test for the influence of difficulty on performance within a single domain, The studies in Table 1 where good performance has been demonstrated are all 'difficult' tasks in the sense that large amounts of information have to be considered but all tend to have: - a quantitative character; - an underlying rule-based model; relevant and useable outcome feedback, -

all of which could be regarded as making the judgmental task easier. Conversely, studies maniresting poor performance tend to have a stronger subjective component, both with respect to how judgments are expressed and models underlying decision making. Also in some of these latter studies there are perceptual limitations on discrimination a n d / o r relevant outcome feedback is unavailable or unusable, Type: A number of different measurement types are employed in psychological investigation. Perhaps the most important distinction is between measures of reliability and measures of validity,

The usual view is that reliability is a necessary, but not sufficient, prerequisite for validity. For example, a ruler which changes length between each measurement is unreliable and consequently its readings must be invalid. However, a ruler that, for example, consistently measures too short will be reliable but its measurements will still not be valid. Given this relationship then measures of reliability of judgment m a y b e regarded as'softer' than measures of validity - in other words the former are indicative of the latter, the latter being what we are ultimately interested in. However, reliability is often all that one can measure because objective criteria against which to validate performance are often lacking. The relevant question here is whether it is easier to make reliable judgments (ie which demonstrate consistency or consensus) or valid judgment (accurate or predictive). From what we have just said about the availability of objective criteria for validation it seems that good performance is more frequently going to be manifest when a reliability measure is employed than a validity measure. To be specific, it is easier to learn to make reliable than valid judgments because relevant data for improving performance is more likely to be available. As already stated above under 'Definition', good judgment seems to be associated with validation approaches. However, reliability measures seem to produce more equivocal outcomes in the studies considered in Table 1. Maize judges and independent auditors agreed whilst grain judges and pathologists disagreed. Auditors and pathologists were consistent while both weather forecasters and soil judges became significantly more accurate over time. Power: A fourth reason why different studies might vary in terms of the quality of expert judgment detected concerns the power of the statistical test used to demonstrate significance (or otherwise) of performance differences between experts and novices or normative models. Power refers to the probability of detecting an effect at an acceptable significance level, given an effect above chance level exists. Power depends on a number of factors, chiefly:

- sample size; - the probability of accepting a significant effect when one doesn't in fact exist (although this

F. Bolger, G. Wright / The quality of expertjudgrnent

-

17

should be set in advance, preferably at a level below 0.01, in reality levels vary considerably, often because experimenters fail to compensate adequately for the effects of using multipie a n d / o r post-hoc tests); test sensitivity (this depends largely on the quality of data which dictates the sort of test

tests for experiments showing good as opposed to bad expert performance - probably due to the more quantitative nature of the tasks in the former than the latter. In summary, there is some suggestion that it might be easier to demonstrate expertise in cerrain studies because:

that one can use thus non-parametric tests are usually, but not always, less sensitive than parametric tests).

- they are more specific with respect to the judgmental skill being assessed; - there are features of the task which permit

Problems of lack of power bedevil experimen-

good performance; - there exist relevant criteria for assessing the validity of judgment; - there is more, and better quality, performance

tal psychology but often go unremarked [11]. However, there is reason to believe that experiments with experts may generally employ less powerful tests than the norm. The main reason for this is that many studies are forced to use very small sample sizes due to the unavailability of experts - experts, by their very nature are less thick on the ground than, say, students and their time is more valuable. Lack of power may mean that expertise is not detected where it is actually present, however this should only mean that overall there is a less favourable view of expert judgment than is actually the case. It is possible though that things may in fact work against the expert the more expert s h e / h e becomes. For example, it seems likely that the more expert the experts, the harder they are to come by. Hence sample size, and therefore power, might be negatively correlated to degree of expertise. Also, in more difficult and more judgmental domains, 'soft' measures may be all that are available due to lack of objective data for validation. For example, the average opinions of a panel may be the only criteria against which to assess the performance of long-range forecasters of unique events. If this is so, less sensitive tests may have to be utilized, again leading to a reduction in power with the greater the expertise of the judges studied, Small sample sizes have been associated with studies showing both good and bad performance (good: race handicappers (5), bridge players (16) and R and D (12); bad: clinical psychologists (8), soil judges (12), maize judges (12) and pathologists (3)). On the whole the former studies tend to collect more data per subject than the latter which to some extent alleviates the power problems of small sample sizes. Thus it seems that there is better data and therefore more sensitive

data. In the next section we shall consider how task features, and in particular the availability of reliable and objective data regarding judgment, influence performance. Another, perhaps more serious, set of objections to the conclusions from studies of expert judgment and decision making are again centered around the issue of ecological validity. In some instances expert performance has been tested outside the domain of the professional expehence of the experts in question. For example, clinicians have been asked to make predictions rather than the diagnoses normally required of them (e.g. [44,10,39]). Corn judges have been required to predict yield on the basis of an examination of ears of maize, whereas their usual task is to judge the aesthetic qualities of ears presented at shows [26,61] and restaurant managers have been asked to make judgments about the probability of various causes of restaurant failure, of which most have had no immediate experience [12]. It can be argued that in these cases the expert is at no particular advantage over naive subjects and, therefore, it is unsurprising that they demonstrate the same deficiencies in judgment. In other studies experts have been required to expressjudgments in unfamiliar metrics. For exampie, grain inspectors have been asked to identify percentages of foreign material in samples where they are used to estimating the weight of these materials [57]; and probability estimates have been elicited from R and D managers for the success of pharmaceutical projects over a 5 to 10 year period where they are familiar only with identify-

18

F. Bolger, G. Wright / The quality of expertjudgment

ing particular successful projects on a year-to-year basis [3]. Keren [33] has proposed that there are two stages to making any judgment. The first stage involves forming the judgment in some subjective metric, the second stage converting this "feeling" into some conventional form. Poor judgmental performance can stem from difficulties with either or both of these two stages. Assuming that skill at expressing subjective judgmental feelings in conventional metrics comes largely from practice at doing so, and allowing that there may be only a limited degree of transfer between different conventional metrics, then many of the findings of poor judgment amongst experts may be a consequence of their inexperience with the mode of responding required by experimenters. Indeed, it has argued that research should focus on what people can do under favourable conditions, and that subjects should be trained, for example, in probabilistic thinking if they are unfamiliar with probability concepts [46]. A third reason to question the ecological validity of some expertise research is that the subjects studied may not have been sufficiently experienced to qualify as "experts". The authors of the majority of papers we have reviewed do not specify the professional qualifications of their chosen experts, nor do they indicate the length of time spent "on the job", or describe the nature of the subjects' work duties in any detail. In no instance have we found any attempt to objectively screen potential subjects for expertise by, for example, the use of a pretest (of course a pretest on a similar performance measure to the test itself would be redundant). It, therefore, seems likely that in most cases, expertise is attributed on the basis of role rather than proven performance, but this is to be expected since, in the main, the hypothesis to be tested is that social experts are de facto experts. In some other studies it is clear that the experts are not highly experienced in the domain in question as, for example, is the case with the soil judges [20] and independent auditors [2]. In additionto measurement problems and the issue of ecological validity, a third reason for the contradictory findings of expertise research might lie in the nature of the task domain in which we are looking for expert performance to be demonstrated. For example, we have already presented

Keren's argument regarding the importance of experience with the outcome of related predicted events for good calibration performance [33]. If no feedback about the accuracy of judgment is available, or if there is feedback but it is meaningless or in some other way unusable then, however experienced the judge, s h e / h e is unlikely to be any more accurate than a total novice. However, a judge who applies good rules without attending to outcome feedback is still likely to perform better than a judge with no rules at all. This said, the poor expert performance demonstrated in some studies may be the consequence of the poor learnability of the domains studied. Analysis of the features of domains where expert performance has been high reveals a series of factors which potentially contribute or detract from the learnability of tasks, and which consequently influence the degree to which experts can be truly expert. We turn to this analysis next.

8. A theory of expertise based on task analysis One domain in which expert judgment has generally been found to be good (reliable, accurate, consistent, well-calibrated etc) is weather forecasting. This finding runs contrary to popular belief so it must be stressed' that weather forecasters' judgments are good relative to the best non-judgmental performance. Murphy and Brown [40] point out that there are characteristics of the weather forecasting task which allow for a high quality of expert performance. These characteristics include: - the existence of a physical model of the weather which, although not fine-grained enough for operational forecasts, provides a firm basis for the weather forecaster's predictions and subsequent evaluations; - the existence of similarities between forecasting situations which allow both objective and subjective predictions to be made on the basis of analogy [34]; - the existence of a large body of objective analyses and forecasts for weather forecasters to refer to; - the opportunity for weather forecasters to gain rapid experience in developing and implementing operational forecasting procedures (including the refinement and testing of models);

F. Bolger, G. Wright / The quality of expert judgment

-

the availability of objective forecasts which are similar in form to the forecasts which weather forecasters actually make, thereby providing relatively stable benchmarks for decision making and acting as examples for novice weather forecasters; - the requirement to make judgments on specific variables (i.e. the task is well defined); - the requirement to quantify uncertainty in an explicit manner and clear definitions and guidelines for use of probabilities. From this we propose that the following three factors contribute to the learnability of a domain: - the availability of accurate, relevant and objec-

-

tive data a n d / o r domain models upon which decisions can be based; the possibility of expressing judgments in a coherent and quantifiable potentially verifiable

manner; - the existence of rapid and meaningful feedback about the accuracy of judgments. Of these, we believe that the third, feedback, is the most crucial for, as we have already suggested, without usable feedback the decision maker is unable to improve on his or her own judgmental performance. As Steadman eloquently points out in his discussion of expertise in psychiatry: "In its most basic structure, the scientific enterprise amounts to a feedback loop between ideas (theories, propositions and hypotheses) and empirical observations systematically collected to test the ideas. Based on the empirical results and the inferences drawn from them, hypotheses are modified to be again tested. It is precisely the absence of such an empirical feedback loop in psychiatry's predictions of dangerousness that doom current practices to rely mainly on magic and art"

[56, p. 385]. However, the availability of feedback in itself is not enough to help the expert improve his or her performance. The feedback must also be usable. As we have already indicated, items upon which judgments are made and outcomes fed back must be related in a way that permits future judgments to be informed by past judgments. In other words, the decision maker must be able to at least revise his or her model o f the domain on the basis of feedback received, Second, in order for feedback to be used constructively it must be fairly immediate. A feedback loop of months or even years may give the expert

19

little or no opportunity (or indeed motivation) to revise his or her judgments. This is especially true if the nature of the domain changes over time as is the case in many real-world situations (e.g. the development of new drugs requires continual revision of estimates of risk associated with particular diseases). Third, as Einhorn points out, in many natural settings there is a causal link between judgments and their outcomes so feedback may be unreliable: "Imagine that you are a waiter in a busy restaurant and because you cannot give good service to all the people at your station, you make a judgement regarding which people will leave good or poor tips. You then give good or bad service depending on your judgement. If the quality of service, in itself, has an effect on the size of the tip, outcome feedback will "confirm" the predictions ('they looked cheap and left no tip - just as I thought')" [15, p. 6].

In order to separate judgment from the effects of actions on the basis of this judgment, it is necessary to perform a controlled experiment. This is often impractical a n d / o r too expensive in real-world situations. For instance, the waiter in the above example would have to give good setvice to those he thought would leave bad tips, and bad service to those he believed would give good tips, with the consequence that accurate judgment would result in a loss of income for the duration of the "experiment". Task domains will vary in the extent to which actions based on judgments influence the subject o f the judgments themselves. For instance, in weather forecasting a forecast of rain will not affect the probability of rain falling, but the prediction of future values of stocks and shares might in itself influence market trends. Self-fulfilling prophecies may, of course, act in the forecasters favour so that, for example, the broker's judgments lead the market to behave in the way s h e / h e predicted, thus making his or her forecasts appear highly accurate. However, in general, a causal link between judgment and outcome will render feedback difficult to interpret and thereby reduce the learnability of the domain. In summary, we propose that the contradictory findings of expertise research can be attributed to two main factors: - the degree to which the expert is experienced at making the type of judgment asked for by the researchers which can broadly be labelled

F. Bolger, G. Wright / The quality of expertjudgment

20

as the ecological validity of the experimental tasks; - the extent to which it is possible to master decision making and judgment in the task domain under investigation, specifically by making use of feedback to refine reliable domain models as a basis for subsequent judgment; this we term the learnability of the experimental task.

ECOLOGICAL

VALIDITY

L O W

H I G H

~edical I n t e r n s ~

Weather Forecasters O

4aize Judges X

Bridge Players

~oil Judges X

Race Handicappers O

Real-estate Negotiators~Horse L

On the basis of this analysis we have assigned each of the 20 studies presented in Table 1 to one of the quadrants of the two-by-two table shown in Figure 1. The quadrants represent combinations of high/low ecological validity with high/low learnability. An indication of the conclusion of each experiment with respect to expert performance is also given in this Figure. If both ecological validity and learnability are

<>

Bettors

O

HIGH

31ackjack Dealer ~

Bankers <>

E A R

N A

B Restaurant Managers

I L

X

21inical Psychologists

I

Physicians

X

Independent Auditors X Grain Judges

Pathologists

X

l~)

T R&D Managers

Y

<>

LOW

Audit Managers

X

Security Analysts X

Estate Agents X

O

Good performance

X

Bad performance Equivocal performance

Fig. 1. The 20 experimental studies of expertise cast into a two-by-two table according to assessed level of learnability and ecological validity of the respective judgmental task.

F. Bolger, G. Wright / The quality of expert judgment

high then we predict that good judgmental performance will be demonstrated by experts. Conversely, in experimental tasks where ecological validity is low a n d / o r the same applies to learnability, then we predict poor judgmental performance in terms of accuracy and reliability. Where both ecological validity and learnability are low, we suggest that the "expert" is in essentially the same position as naive subjects in laboratory studies. In such a situation judges will have little or no knowledge relevant to the task in hand and will be forced to fall back upon the heuristics they use in everyday reasoning, As we know from an extensive body of research in human reasoning [30,32], heuristics are often inappropriate to the task in hand, resulting in flawed and biased judgments. In fact it is the nature of heuristics that they only bring partial success a n d / o r work some of the time. Heuristics are what we use when no appropriate algorithm is available. Consequently, we predict that when relevant domain knowledge is not, or cannot be brought to bear on a judgment task, heuristics will tend to be used and poor judgment will tend to be demonstrated. Of course, it must be stressed that experience in making judgments in a learnable domain is not sufficient qualification for expertise in itself. The expert is also likely to need some or all of the qualities identified by Johnson [29] and Shanteau [51], such as the ability to communicate effectively, to integrate information from disparate sources and to possess high stress tolerance. However, the particular qualities required by an expert are likely to depend on the nature of his or her work and it remains both a theoretical and empirical question as to precisely how qualities of the person and the situation interact to produce expert performance,

9. Summary, conclusions and directions To summarize, we argue in this paper that the assessment of the quality of expert judgment is becoming a central issue for both researchers and practitioners in a number of fields. This is partly because increasing specialisation leads to ever more reliance being placed upon expert judgment, and partly because of the current trends in IT, whereby attempts are being made to "cap-

21

ture" expertise in computer systems. In either case it is important to be able to accurately identify the level of expertise being offered. A pre-requisite for the assessment of expert judgment would seem to be a coherent definition of expertise. Such definitions are lacking in the literature, with expertise often being attributed on the basis of social role or believability. Those definitions of expertise and professionalism which do appear tend to be ill-defined competence models which pre-empt the empirical data necessary to support them. We argue that such competence models represent a desirable end-point for research but that, for the moment, the focus should be on collecting data regarding expert performancethence expertise should be defined in terms of increased performance relative to people or systems regarded to be "inexpert". Our review of the early laboratory studies documented a generally accepted conclusion that human judgmental performance is suboptimal relative to normative models. However, we showed that this conclusion has been seriously questioned, largely on the grounds that the laboratory research lacked ecological validity. These criticisms have resulted in the research focus being shifted to judgment by people who have some knowledge and investment in the judgmental domains investigated - in other words, "experts". In our review of the expertise research, we showed that findings have been contradictory, with some investigators demonstrating "good" performance by their experts, and other investigators showing similar sub-optimal performance and biases as was manifest in the laboratory studies of non-experts. However, we argued that the goal of ecological validity has not always been completely realised in the expertise research, with a number of studies requiring subjects to make judgments outside the domain of their professional experience a n d / o r to express their judgments in unfamiliar metrics. This was offered as one explanation for the contradictory findings of expertise research. A second explanation, we suggest, lies in the degree to which good judgment can be learnt in a task domain. If objective data and models a n d / o r reliable and usable feedback is unavailable, then it may not be possible for a judge in that domain to improve his or her performance significantly with experience. In such

22

F. Bolger, G. Wright / The quality of expert judgment

cases the performance of novices and "experts" is likely to be equivalent, We conclude that judgmental performance is largely a function of the interaction between these two dimensions of ecological validity and learnability - if both are high then good performance will be manifest but if one or both are low then performance will be poor. Further, when a subject is operating in a domain about which she/he knows little, or where accurate judgment is in principle unattainable, we propose that she/he will be forced to resort to using heuristics, which may well result in biased performance, We tested our theory of expertise retrospectively on a sample of studies of expert judgment and found the dimensions of ecological validity and learnability to be reasonably good predictors of the performance outcome of these studies, However, we drew attention to the fact that there are also likely to be subtle factors, other than these task variables, contributing to obtained expert performance. Such factors as level of performance measurement, task difficulty, measurement type, and power of statistical tests utilized need careful consideration in any evaluation of the quality of expert judgment in published studies. It follows that our analysis will enable the practitioner to identify whether the expert, whose judgments he is about to elicit, is likely to show better performance relative to lesser experts or novices. Specifically, if learnability and ecological validity of the particular task domain are high, then greater experience should have allowed the expert to refine his or her model of the domain on the basis of feedback received. In many instances where expert systems are commercially built, the type of judgment modelled by the knowledge engineer will tend to show high ecological validity. However, learnability may be low with feedback being poor. In one example, we developed a life underwriting expert system which modelled the chief underwriter of a major insurance company as he evaluated proposals for life insurance from the public [8]. On the basis of evidence revealed in the proposal forms and other evidence subsequently collected (e.g. medical reports), he accepted proposals at normal age-related rates or added premium weightings to balance life-threatening impairments. For extremely impaired lives he rejected proposals. In this ex-

ample, the chief underwriter had little feedback on the adequacy of his judgments. People don't often die within the term of the policy and only if a life insured dies within 3 years of taking out a policy did the underwriter receive any feedback at all. In addition, junior underwriters acquire their expertise in one-to-one learning situations with senior underwriters, as they work through proposal forms. The result is high reliability within the insurance company in terms of judgments made but expertise was attributed on the basis of convergence with more senior underwriters rather than on the basis of performance measured against some objective standard. Indeed, experimental studies have shown that similarly completed proposal forms sent to different insurance companies resulted in different acceptance decisions. In decision analysis practice, where decision takers are asked to assess subjective probabilities for the occurrence of novel future events, there is likely to be low ecological validity and low learnability. In addition, the probability metric will be unfamiliar to most managers. In short, task conditions more akin to psychiatrists' predictions of dangerousness than weather forecasters' predictions of weather events are likely to be commonplace. Since one-off subjective probability judgments in decision analysis are seldom, if ever, checked for validity (i.e. calibration) the issue of judgmental performance is not addressed. At best, assessments are checked for reliability and indirect methods of probability assessment are utilized which do not require direct numerical statements of likelihood [66]. It follows that factors such as the relative desirability of outcomes may influence the perceived likelihood of their occurrence [52,67]. In addition, actions taken to facilitate desired futures [15] will tend to distort the veracity of any feedback that may occur for particular judgments, so resulting in inappropriate confidence in judgmental performance and reducing learnability both within and between domains.

References [1] N.H. Anderson and L.L. Lopes, Some Psycholinguistic Aspects of Person Perception, Memory and Cognition 2

(1974) 67-74.

F. Bolger, G. Wright / The quality of expert judgment [2] R.H. Ashton, Cue Utilization and Expert Judgments: A Comparison of Independent Auditors With Other Judges, Journal of Applied Psychology 59 (1974) 4, 437-444. [3] H,U. Balthasar, R.A. Boschi and M.M. Menke, Calling the Shots in R and D, Harvard Business Review (1978) May-June 151-160. [4] E.M. Bamber, Expert Judgment in the Audit Team: A Source Reliability Approach, Journal of Accounting Research 21 (1983)2, 396-412. [5] B. Barber, Some Problems in the Sociology Of The Professions, Daedalus 92 (1963)671. [6] B.A. Basi, K.J. Carey and R.D. Twark, A Comparison of the Accuracy of Corporate and Security Analysts' Forecasts of Earnings, The Accounting Review 51 (1976) 244254. [7] L.R. Beach, J.J. Christensen-Szalanski and V. Barnes, Assessing Human Judgment: Has It Been Done, Can It Be Done, Should It Be Done?, in: G. Wright and P. Ayton, Eds. Judgmental Forecasting (Wiley, Chichester, 1987). [8] F. Bolger, G. Wright, E. Rowe, J. Gammack and R. Wood, Lust for Life: Developing Expert Systems for Life Assurance Underwriting, in: N. Shadbolt, Ed. Research and Development in Expert Systems II1 (CUP, Cambridge, 1989). [9] N. Chomsky, Aspects of a Theory of Syntax (MIT Press, Cambridge, Mass., 1965). [10] J.J. Christensen-Szalanski, D.E. Beck, C.M. ChristensenSzalanski and T.D. Keopsell, Effects of Expertise and Experience on Risk Judgments Journal of Applied Psychology 68 (1983) 278-284. [11] J. Cohen, Statistical Power Analysis for the Behavioral Sciences (Academic Press, New York, 1977). [12] L. Dube-Rioux and J.E. Russo, An Availability Bias in Professional Judgment, Journal of Behavioral Decision Making 1 (1988) 223-237. [13] W,M. DuCharme and C.R. Peterson, Intuitive Inference About Normally Distributed Populations, Journal of Experimental Psychology 78 (1968) 269-275. [14] W, Edwards, The Theory of Decision Making, Psychological Bulletin 51 (1954) 380-417. [15] H.J. Einhorn, Learning From Experience and Suboptimal Rules in Decision Making, in: T. Wallsten, Ed. Cognitive Processes in Choice and Decision Behaviour (pp. 1-20) (Hillsdale, N.J. Erlbaum, 1980). [16] H.J. Einhorn, Expert Judgment: Some Necessary Conditions and an Example, Journal of Applied Psychology 59 (1974) 562-571. [17] P. Elliot, The Sociology of the Professions (London, Macmillan, 1972). [18] E.A. Feigenbaum, Themes and Case Studies in Knowledge Engineering, in: D. Michie, Ed. Expert Systems in the Microelectronic Age (Edinburgh University Press, 1989). [19] B. Fischhoff, P. Sovic and S. Lichtenstein, Fault Trees: Sensitivity of Estimated Failure Probabilities to Problem Representation, Journal of Experimental Psychology: Human Perception and Performance 4 (1978) 330-334. [20] G.J. Gaeth and J. Shanteau, Reducing the Influence of Irrelevant Information on Experienced Decision Makers,

23

OrganizationalBehaviour and Human Performance 33 (1984) 263-282. [21] W.J. Goode, Encroachment, Charlatanism and the Emerging Professions: Psychology, Sociology and Medicine, American Sociological Review 25 (1960) 902-14. [22] J.G. Greeno, M.S. Riley and R. Gelman, Conceptual Competence and Children's Counting, Cognitive Psycho# ogy 16 (1984)94-143. [23] K.R. Hammond, Towards a Unified Approach to the Study of Expert Judgment in: J. Mumpower, Ed. Expert Judgment and Expert Systems (Berlin, Springer-Verlag, 1987). [24] M. Hoenig, Drawing the Line on Expert Opinions, Journal of Products Liability 8 (1985) 335-345. [25] A. Hoerl and H.K. Fallin, Reliability of Subjective Evaluations in a High Incentive Situation, Journal of the Royal statistical Society 137 (1974) 227-230. [26] H.D. Hughes, An Interesting Seed Corn Experiment, Iowa Agriculturalist 17 (1917)424-425. [27] J.M. Jenks, Non-Computer Forecasts to Use Right Now, Business Marketing 68 (1983)82-84. [28] T.J. Johnson, Professions and Power (London, Macmillan, 1972). [29] P.E. Johnson, What Kind of Expert Should a System Be?, Journal of Medical Philosophy 8 (1983) 1, 77-97. [30] P.N. Johnson-Laird and P.C. Wason, A Theoretical Analysis of Insight Into a Reasoning Task, in: P.N. Johnson-Laird and P.C. Wason, Eds. Thinking." Readings in Cognitive Science (Cambridge, CUP, 1977). [31] I. Kabus, You Can Bank on Uncertainty, Harvard Business Review (1976), May-June 95-105. [32] D. Kahneman, P. Slovic and A. Tversky, Eds. Judgment Under Uncertainty: Heuristics and Biases (Cambridge, CUP, 1982). [33] G. Keren, Facing Uncertainty in the Game of Bridge: A Calibration Study, Organizational Behaviour and Human Decision Processes 39 (1987) 98-114. [34] S. Kruizinga and A.H. Murphy, Use of an Analogue Procedure to Formulate Objective Probabilistic Temperature Forecasts in the Netherlands, Monthly Weather Review 111 (1983) 2244-2254. [35] R.W. Lawson, Traffic usage Forecasting: Is it an Art or a Science? Telephony, February (1981)19-24. [36] S. Lichtenstein, B. Fischhoff and L.D. Phillips, Calibration of Subjective Probabilities: The State of the Art to 1980, in: D. Kahneman, P. Slovic and A. Tversky, Eds. Judgment Under Uncertainty: Heuristics and Biases (New York, CUP, 1982). [37] R.R. MacDonald, Credible Conceptions and Implausible Probabilities, British Journal of Mathematical and Statistical Psychology 39 (1986) 15-27. [38] R.R. MacDonald, More About Linda, or Conjunctions in Context, paper presented at SPUDM-I, Cambridge, England (1987). [39] P. Manu, L.A. Runge, J.Y. Lee and A.D. Oppenheim, Judged Frequency of Complications After Invasive Diagnostic Procedures: Systematic Biases of a Physician Population, Medical Care 22 (1984) 366-370. [40] A.H. Murphy and B.G. Brown, A Comparative Evaluation of Objective and Subjective Weather Forecasts in

24

F. Bolger, G. Wright / The quality of expert judgment

the United States, in: G. Wright, Ed. Behavioral Decision Making (New York, Plenum 1985). [41] M.A. Neale and G.B. Northcraft, Experts, Amateurs and Refrigerators: Comparing Expert and Amateur Negotiators in a Novel Task, Organisational Behaviour and Human Decision Processes 38 (1986) 305-317. [42] A. Newell and H.A. Simon, Human Problem Solving (New Jersey, Prentice-Hall, 1972). [43] G.B. Northcraft and M.A. Neale, Experts, Amateurs and Real-Estate: An Anchoring and Adjust Perspective in Property Pricing Decisions, Organisational Behaviour and Human Decision Processes 39 (1987) 84-97. [44] S. Oskamp, The Relationship of Clinical Experience and Training Methods to Several Criteria of Clinical Prediction, Psychological Monographs 76 (1962). [45] J.W. Payne, Contingent decision behaviour, Psychological Bulletin 92 (1982) 383-402. [46] L.D. Phillips, On the Adequacy of Judgmental Forecasts, in: G. Wright and P. Ayton, Eds. Judgmental Forecasting (Chichester, Wiley, 1987). [47] L.D. Phillips and W. Edwards, Conservatism in a Simple Probability Influence Task, Journal of Experimental Psychology 72 (1966) 346-354. [48] G.E. Pitz, Decision Making and Cognition, in: H. Jungerman and G. de Zeeuw, Eds. Decision Making and Change in Cognitive Affairs (Amsterdam, D. Reidel, 1977). [49] G.E. Pitz and N.J. Sachs, Judgment and Decision Making: Theory and Applications, Annual Review of Psychology 35 (1984). [50] H. Raiffa and R. Schlaifer, Applied Statistical Decision Theory, Boston, Mass: Harvard Business School Division of Research (1961). [51] J. Shanteau, Psychological Characteristics of Expert Decision Makers, in: J. Mumpower, L.D. Phillips, O. Renn and Y.R.R. Uppuluri, Eds. Expert Judgment and Expert Systems, pp. 289-304 (Berlin, Springer-Verlag, 1987). [52] P. Slovic, Value as a Determiner of Subjective Probability, IEEE Transactions on Human Factors in Electronics HFE-7 (1966) 22-28. [53] P. Slovic, Towards Understanding and Improving Decisions, in: W.C. Howell and E.A. Fleishmann, Eds. Human Performance and Productivity Vol. 12, Information Processing and Decision Making (Hillsdale N.J., Erlbaum, 1982). [54] W. Snyder, Decision Making with Risk and Uncertainty: The Case of Horse Racing, American Journal of Psychology 91 (1978) 2, 201-209.

[55] R.F. Soergel, Probing the Past for the Future, Sales and Marketing Management 130 (1983) 39-43. [56] H.J. Steadman, Predicting Dangerousness Among the Mentally Ill: Art, Magic and Science, International Jourhal of Law and Psychiatry 6 (1983) 381-390. [57] D.A. Trumbo, C.K. Adams, M. Milner and L. Schipper, Reliability and Accuracy in the Inspection of Hard Red Winter Wheat, Cereal Science Today 7 (1962) 3, 62-71. [58] A. Tversky and D. Kahneman, Judgment Under Uncertainty: Heuristics and Biases Science 185(1974)1124-1131. [59] D. Von Winterfeldt and W. Edwards, Decision Analysis and Behavioral Research (Cambridge, Cambridge University Press, 1986). [60] W.A. Wagenaar and G.B. Keren, Does the Expert Know? The Reliability of Predictions and Confidence Ratings of Experts, in: E. Hollnagel et al. Eds. Intelligent Decision Support in Process Environments (Berlin, Springer-Verlag, 1986). [61] H.A. Wallace, What is in the Corn Judge's Mind? Jourhal of the American Society of Agronomy 15 (1923) 300304. [62] S.R. Watson and D.M. Buede, Decision Synthesis (Cambridge, Cambridge University Press, 1987). [63] R.L. Winkler and A.H. Murphy, Experiments in the Laboratory and the Real-World, Organizational Behaviour and Human Performance 10 (1973) 252-270. [64] G. Wright, Behavioral Decision Theory (Harmondsworth, Penguin, 1984). [65] G. Wright and P. Ayton, Eliciting and modelling expert knowledge, Decision Support Systems 3 (1987) 13-26. [66] G. Wright and P. Ayton, Judgmental Forecasting: Personologism, Situationism or Interactionism? in: Personality and Individual Differences 9 (1988) 1, 109-120. [67] G. Wright and P. Ayton, Judgmental Probability Forecasts for Personal and Impersonal Events, International Journal of Forecasting 5 (1989) 117-125. [68] G. Wright and F. Bolger, Eds. Expertise and Decision Support (New York: Plenum, 1992). [69] J.F. Yates, External Correspondence: Decompositions of the Mean Probability Score, Organizational Behaviour and Human Performance, 30 (1982) 132-156. [70] J.F. Yates, Judgment and Decision Making, 1990. (Englewood Cliffs: Prentice Hall). [71] Z.I. Yousef and C.R. Peterson, Intuitive Cascaded Inferences, Organizational Behaviour and Human Performance 10 (1973) 349-358.