The Importance of the Task in Analyzing Expert Judgment

The Importance of the Task in Analyzing Expert Judgment

ORGANIZATIONAL BEHAVIOR AND HUMAN DECISION PROCESSES Vol. 69, No. 3, March, pp. 205–219, 1997 ARTICLE NO. OB972682 The Importance of the Task in Ana...

241KB Sizes 3 Downloads 65 Views

ORGANIZATIONAL BEHAVIOR AND HUMAN DECISION PROCESSES

Vol. 69, No. 3, March, pp. 205–219, 1997 ARTICLE NO. OB972682

The Importance of the Task in Analyzing Expert Judgment THOMAS R. STEWART Center for Policy Research, University at Albany, State University of New York

PAUL J. ROEBBER Department of Geosciences, University of Wisconsin at Milwaukee AND

LANCE F. BOSART Department of Atmospheric Science, University at Albany, State University of New York

Why are some experts more accurate than others? Differential accuracy of judgment is the central issue

in research on expert judgment and judgmental forecasting. Explanations are found both in characteristics of the judges, such as experience (Oskamp, 1965; Roebber & Bosart, 1996a; Roebber et al., 1996), and in characteristics of the task (Hammond, 1996; Bolger & Wright, 1994; Shanteau, 1992). This paper exploits a unique data set to address differential accuracy and several other key issues in research on expert judgment. The data, derived from a weather forecasting exercise, include: (a) multiple forecasters with different levels of experience, (b) multiple forecasting tasks with varying levels of difficulty, (c) meteorological observations for objective evaluation of the accuracy of forecasts, (d) measures of the information (cues) used in forecasting, and (e) a set of computergenerated forecasts. Finally, and most importantly, the forecasts were made under natural conditions, thus addressing a need for naturalistic judgment studies that has been stressed by many authors (e.g., Einhorn, 1974; Ebbesen & Konecni, 1980; Brehmer & Brehmer, 1988; Bunn & Wright, 1991; Hammond, 1993; Klein et al., 1993; Bolger & Wright, 1994). In addition to accuracy, this study yields evidence on the following related questions:

The authors thank Jeryl Mumpower, Claudia Gonza´lez-Vallejo, and James Shanteau for helpful comments on an earlier version of this paper. This research was supported in part by the National Science Foundation under Grants SES-9109594 (Division of Social and Economic Sciences and Division of Atmospheric Sciences, Thomas R. Stewart, Principal Investigator) and ATM-9114598 (Division of Atmospheric Sciences, Lance F. Bosart, Principal Investigator). Address correspondence and reprint requests to Thomas R. Stewart, Center for Policy Research, Milne 300, University at Albany, 135 Western Avenue, Albany, NY 12222-1011. Fax: 518-442-3398. E-mail:[email protected].

1. How reliable are judgments? 2. How much do experts agree/disagree? 3. How well does a linear model fit the judgments? How much of the valid variance in the judgments is accounted for by a linear model? 4. How important are nonlinear, configural models in analyzing judgment? 5. How does the performance of models compare to human judgments?

The accuracy of judgmental forecasts of temperature and precipitation was analyzed. In contrast to the findings of many studies of expert judgment and forecasting, forecasts were highly accurate and forecaster agreement was high. Human forecasters performed better than an operational forecasting model and about the same as a linear regression model. Differences between temperature and precipitation forecasts could be predicted from differences between the tasks. In general, differences between tasks were much greater than differences between forecasters. Task predictability was an excellent indicator of forecast accuracy. The characteristics of the environment for forecasting temperature and precipitation that contribute to accuracy include high-quality information and the availability of “guidance” based on a computer model. It is concluded that an understanding of the properties of the task is essential for understanding the accuracy of expert judgment. q 1997 Academic Press

INTRODUCTION

205

0749-5978/97 $25.00 Copyright q 1997 by Academic Press All rights of reproduction in any form reserved.

206

STEWART, ROEBBER, AND BOSART

6. How does the performance of a group average compare to the performance of individuals? An important issue that cuts across all the above questions is the amount and nature of the influence of task characteristics on properties of judgment. In the next section, we briefly summarize the literature on task characteristics. THE IMPORTANCE OF TASK CHARACTERISTICS

Our analysis is heavily influenced by Brunswik’s (1952, 1956) Probabilistic Functionalism and its extensions—Hammond’s Social Judgment Theory (Hammond et al., 1975; Doherty & Kurz, 1996) and Cognitive Continuum Theory (Hammond, 1980, 1981)—which argue that the focus of the study of judgment is the relation between the judge and the environment (or task), and that judgment cannot be understood without understanding the properties of the task. Task properties are important both because they can facilitate or limit judgmental accuracy and because they describe the environment in which the judgment process was learned. Cognitive Continuum Theory (CCT) states that task characteristics tend to induce intuitive and analytical modes of cognition and that most tasks induce a combination of modes called “quasirational” thought. CCT specifies three categories of important task characteristics: (a) complexity of task structure, (b) ambiguity of task content, and (c) form of task presentation. In this study, we will focus on complexity of task structure, which is determined by the formal, statistical properties of the tasks. Because judgment tasks contain inherent and irreducible uncertainty (Brunswik, 1952; Hammond, 1996), they are not perfectly predictable. Task predictability influences judgmental accuracy in at least two ways. Not only does task predictability place an upper bound on potential accuracy (Tennekes, 1988; Stewart, 1990; Stewart & Lusk, 1994), but a number of studies have found evidence that the reliability of judgment is lower for less predictable tasks (Brehmer, 1976, 1978; Camerer, 1981; Harvey, 1995). Judges respond to unpredictable tasks by behaving less predictably themselves. Another important task property is the amount of information (usually operationalized as the number of cues) available when a judgment is made. It has been suggested that reliability decreases as the amount of information available increases (Einhorn, 1971; Hogarth, 1987; Stewart et al., 1992; Lee & Yates, 1992). Faust (1986) reviewed several studies that suggest that judges are not able to make proper use of large numbers

of cues. Lee and Yates (1992) found that G (see description of the lens model equation below) decreases with increasing numbers of cues. Other authors have recognized the importance of task characteristics (Bolger & Wright, 1994; Shanteau, 1992), but, despite their importance, many studies of expert judgment do not address task characteristics. In some cases, the data necessary to analyze important task properties are not obtained. In other cases, the data are obtained, but not analyzed. LENS MODEL ANALYSIS OF ACCURACY

The conceptual and analytical framework for this study is based on Brunswik’s Lens Model and Hammond’s Social Judgment Theory (Cooksey, 1996; Hammond, 1996). Our measures of both forecast accuracy and agreement among forecasters are based on the correlation coefficient. One objection to the correlation is that it is not sensitive to the lack of calibration (bias) that is important to the users of forecasts. Stewart (1990) showed how bias in judgment could be incorporated into lens model studies using Murphy’s (1988) decomposition of the skill score. In the present study, however, bias was very low for all tasks and all forecasters (ranging from 0.000 to 0.014), and contributed nothing to understanding the differential accuracy of forecasts. This is consistent with previous research showing that weather forecasters are well calibrated when making familiar forecasts (Murphy & Winkler, 1974; Murphy & Daan, 1984). In order to simplify this paper, we will not discuss bias. The lens model equation (LME) for decomposing the correlation coefficient (Hursch, Hammond, & Hursch, 1964; Tucker, 1964; Castellan, 1973; Stewart, 1976; Cooksey & Freebody, 1985) shows how the correlation is influenced by properties of the task, the cognitive system of the judge, and the relations between them. The lens model equation assumes that variables representing the observed event (O) and the judgment (Y) have been partitioned into two linearly independent components—one that is a function of the cues and another that is unrelated to them. This partitioning can be written as: O 5 MO.X(X1, X2, . . . , Xn) 1 EO.X Y 5 MY.X(X1, X2, . . . , Xn) 1 EY.X , where the Xi , are the cues; MO.X and MY.X represent models that describe the relations between the cues and the event and the cues and the forecast, respectively; and the Es, which represent the residuals or “errors” of the models, are not related to the Ms.

207

THE TASK OF EXPERT JUDGMENT

Based on such a partitioning, Tucker (1964) developed the following form of the LME: rYO 5 RO.XGRY.X 1 C!1 2 R2O.X !1 2 R2Y.X , where RO.X is the correlation between O and MO.X, G is the correlation between MO.X and MY.X, RY.X is the correlation between Y and MY.X , and C is the correlation between EY.X and EO.X . In a forecasting context, the terms can be interpreted as follows: RO.X: The strength of the relation between the observed event and the cues. If MO.X is the optimal model of the observed event (that is, it exhausts the systematic relations between the cues and the observed event), then this component measures the maximum predictability of the observed event for the given set of cues. In lens model studies, RO.X is typically the multiple correlation obtained by a linear regression analysis. For reasons summarized by Dawes and Corrigan (1974) and Goldberg (1991), a linear regression model often provides a good estimate of task predictability even when the optimal model is not linear. G: The match between the environmental model and the forecast model. This is the correlation between (a) the model of the systematic relations between the observed event and the cues, and (b) the model of the forecaster’s cue utilization. In other words, it is a measure of how well the model of the forecast matches the environmental model. RY.X: The strength of the relation between the forecast and the cues. If MY.X exhausts the systematic relations between the cues and the forecast, then this component could be considered a measure of the reliability of the forecast. C: The correlation between the unmodeled component of the forecast event and the unmodeled component of the forecast. If the event and forecast models capture all the systematic variance, then C should differ from zero only by chance. If the event and forecast models are both linear, then C measures the validity of the nonlinear component of judgment, and the second term of the lens model equation measures the contribution of the nonlinear component to skill. Measures of forecast performance based on the LME are summarized in Table 1. Since agreement between two forecasters is indicated by the correlation between their forecasts, agreement can also be decomposed using the lens model equation. It will be obvious from the context whether we are discussing forecast accuracy (forecast–event correlations) or forecaster agreement (forecast–forecast correlations). In order to avoid repetition of long phrases, we will

use the following terms to refer to different components of variation in forecasts and events: Linear (residual) variance refers to the variance accounted for (not accounted for) by a linear, additive regression model. Residual variance may include a systematic and a nonsystematic (random error) component. Nonlinear variance is the systematic part of the residual variance, which may include variance that could be accounted for by a curvilinear, additive, model; a non-additive model (also called an interactive or configural model); or use of information not included in the cue set (Einhorn, 1974). METHOD

The data used in this study were gathered as part of an ongoing weather forecasting exercise conducted at the University at Albany. Forecasts of daily temperature and precipitation have been made routinely by members of the Department of Atmospheric Science during the Fall and Spring academic semesters since September 1969 (Bosart, 1975, 1983). In this paper, we analyze results from the game for the period September 1988 through December 1992. Forecasts of maximum (daytime) and minimum (nighttime) temperature, measurable precipitation probability (precipitation greater than 0.01 inches, hereafter referred to as POP), and precipitation amount are made for the four sequential 12-h periods beginning at 0000 UTC of the following day. (In Albany, local standard time is 5 h behind UTC.) Thus, a forecast submitted by 2045 UTC of Day 0 (the approximate forecast submission time) would include a minimum temperature forecast for the period 0000–1200 UTC of Day 1, a maximum temperature forecast for 1200–0000 UTC of Day 1, a minimum temperature forecast for 0000–1200 UTC of Day 2, and a maximum temperature forecast for 1200–0000 UTC of Day 2. In addition, POP and precipitation amount forecasts would also be submitted for each of these four periods. Figure 1 summarizes (in both UTC and local time) the forecast times and verification periods for the forecasts used in this study. Daily forecasts (Monday through Friday) are made by a collective group of 10–20 students and faculty of varying experience and background, with approximately 66 forecast days per semester. Scores, which are public and posted weekly, are based on absolute error in temperature and precipitation amount forecasts and the Brier score (Brier, 1950) for POP forecasts. The information available to the Albany forecasters included virtually everything that was available to operational forecasters in a local National Weather Service (NWS) Forecasting Office, including satellite images, radar, surface and upper air data and analyses,

208

STEWART, ROEBBER, AND BOSART

TABLE 1 Summary of Measures of Forecast Performance Measure

Symbol

Accuracy correlation Task predictabilitya

rYO RO.X

Forecast reliabilityb Matching index (modeled variance)

RY.X G

Matching index (unmodeled variance)

C

Linear component of accuracy

RO.X GRY.X

Non-linear component of accuracy

C!1 2 R2O.X!1 2 R2Y.X

a b

Description Correlation between the forecast and the observed event The strength of the relation between the observed event and the cues The strength of the relation between the forecast and the cues The correlation between the environmental model and the forecast model The correlation between the variance not captured by the environmental model or the forecast model (residual variance) The part of the correlation that has been analyzed by environmental and forecast models The part of the correlation that has not been analyzed by environmental and forecast models

Assumes that the environmental model exhausts the systematic variance in the environment. Assumes that the forecast model exhausts the systematic variance in the forecast.

soundings, and numerical model output from the full suite of National Meteorological Center products. Forecasters and Forecast Events In order to examine issues of individual differences among forecasters and combination of forecasts, we analyzed the subset of 178 forecast days on which all of four individuals participated. These four forecasters included a faculty member, a staff member who teaches forecasting and has extensive operational forecasting experience, a graduate student with operational forecasting experience, and an undergraduate. In addition, we analyzed forecasts made by the National Weather Service Albany Forecasting Office (WSFO), forecasts made by a numerical model (NGM-MOS, or simply MOS) and a “consensus” forecast obtained by averaging the four forecasts. We selected four forecasts for analysis to represent a range of lead times and forecast difficulty. The forecasts were: (1) Minimum (nighttime) temperature 0–12 h from the time the forecast is made, (2) maximum (daytime) temperature 12–24 h ahead, (3) probability of precipitation (POP) during the 0–12 h period, and (4) POP during the 12–24 h period. For brevity, forecast lead times are identified by their maximum time horizon (12 or 24 h).

Temperature forecasts are, in principle, continuous. Although in practice only integer values are forecast and observed temperatures are rounded to integers before recording, the range of temperatures (about 708F) clearly justifies treatment as a continuous scale. Traditionally, POP forecasts are given in whole numbers ranging from 0 to 10, corresponding to probabilities of 0.00 to 1.00. Since the criterion for accuracy of POP forecasts is occurrence of measurable (..01 inches) precipitation at the official NWS gauge, POP refers to the probability of precipitation at that point during the forecast period. Murphy et al. (1980), however, surveyed a group of citizens and found that 56% of the sample believed that POP applied to an area rather than a point. The forecasters themselves, although clearly aware that only one point in the forecast area is used for determining accuracy, are also aware of the difficulty of forecasting for a single point. Consequently, there may be occasions where the forecast is “correct” because precipitation occurs in the forecast area, but the forecast is not verified because no measurable precipitation occurs at the official station. Consequently, evaluation of forecasts based on observed precipitation may underestimate actual forecasting skill. Although the degree of underestimation has not been determined, it is believed to be small. This is not a problem for observed temperatures, which are very good, though obviously not perfect, measures of the “true” temperature. Cues

FIG. 1. Time line for the Albany forecasting game.

In order to conduct a lens model analysis of forecasts, it is necessary to obtain measures of the information or “cues” used by the forecasters. In typical laboratory studies, cues are determined by the researcher. In a naturalistic study, the forecaster can select any cue

209

THE TASK OF EXPERT JUDGMENT

information that is available and the researcher must discover what information was used. We relied on physical principles and forecasting experience to suggest the cues that should be useful to forecasters and that should correlate well with observed temperatures and precipitation. Table 2 lists the cues identified for the temperature and precipitation forecasts, along with the physical basis for the proposed predictive validity of the cue. The list of cues is grounded in sound physical principles and reflects the extensive forecasting experience of two of the authors (LFB and PJR). All of the important information available to the forecasters is included in the cue set. Since this was a retrospective study, it was necessary to reconstruct the cues that were available to the forecasters but had not been routinely recorded in the Albany forecasting game. In some cases, cues that would have been available to the forecasters were obtained from weather records. In other cases (variables marked by a superscript a in Table 2), the cues had to be reconstructed and their values were estimated from data and modeling results that were not available until after the forecast was made. This estimation procedure involved a variation on what atmospheric scientists call the “perfect prog” approach (short for perfect prognosis; Klein

et al., 1959; Carter et al., 1989). As described above, the forecasters had at their disposal a wide array of observational and numerical modeling data up to the time of forecast submission. For the purposes of rebuilding these data, we have assumed that the numerical models can provide a perfect picture of the atmosphere (within the limits of observational error) for short-range forecasts out to 24 h. Consequently, we use actual observations valid during the time period of the forecast verification to simulate the numerical modeling data that was available to the forecaster. Although this procedure may produce data that is more accurate than the data available to the forecasters, it is tenable provided that we restrict our analyses to periods up to roughly 24 h. This procedure has been shown to produce values of atmospheric variables with errors that compare favorably to the measurement errors involved in direct measurement of atmospheric characteristics (DiMego et al., 1992). A detailed description of the method for estimating the cues is available from the authors. Model Output Statistics (MOS) Forecasts MOS forecasts are based on a set of regression models developed by the Techniques Development Laboratory

TABLE 2 Forecast Cues for Temperature and POP Forecasts Temperature cues (13 cues for 12-h minimum, 12 cues for 24-h maximum) Climatological temperature for date Persistence (previous day’s temperature) NGM MOS (temperature predicited by model) Surface dewpoint temperature (0000 and 1200 UTC)a Surface temperature (0000 UTC—min only)a 850 mb temperature (0000 and 1200 UTC)a Wind speed (0000 and 1200 UTC)a Log of precipitable water (0000 and 1200 UTC)a Snow on ground POP cues (24 cues for both 12- and 24-h forecasts) Climatological probability for date Persistence (previous day’s precipitation) NGM MOS (POP predicted by model) Surface dewpoint temperature (0000 and 1200 UTC)a Trenberth (1978) ascent forcing (0000 and 1200 UTC)a Minimum Trenberth ascent forcing (within 12-h period) 250 mb vorticity advection (0000 and 1200 UTC)a Minimum vorticity advection (within 12-h period) 850 mb wind components (U and V at 0000 and 1200 UTC)a K-index (0000 and 1200 UTC)a Lifted index (0000 and 1200 UTC)a Mid-level mean relative humidity (0000 and 1200 UTC)a Cloud thicknessa Precipitable water (0000 and 1200 UTC)a a

Variable estimated by perfect prog method.

Physical basis Seasonal cycles Synoptic-scale weather pattern MLR of forecast model output Saturation temperature limit/radiation Stability (with 850 mb temperature) Air mass temperature/mixing Mechanical mixing Tropospheric moisture—radiation Radiation Physical basis Seasonal cycles Synoptic-scale weather pattern MLR forecast model output Boundary layer moisture Quasi-geostrophic theory Quasi-geostrophic theory Quasi-geostrophic thoery Quasi-geostrophic theory Orographic influences Convective instability Convective instability Mid-level moisture Low- to mid-level moisture Tropospheric moisture

210

STEWART, ROEBBER, AND BOSART

of the National Oceanic and Atmospheric Administration (Glahn & Lowry, 1972). These regression models use the output of numerical models, which produce measures (such as mid-level atmospheric humidity) that are only of interest to meteorologists, to produce forecasts that are of direct interest to the public, such as temperature and precipitation. Since MOS is an important and widely used forecasting tool, it was included as a cue in all analyses of human forecasts. It is important to remember, however, that MOS is a unique cue because it was derived from other information. The other cues characterize current weather conditions or trends, but MOS is a higherorder cue that aggregates relevant information into an automated forecast. Such derived, or “Decision Support System-based,” cues are likely to play an increasingly important role in professional judgment.1 Since MOS is also a forecast, it was analyzed as a separate forecaster, but, for obvious reasons, when MOS was analyzed as a forecast, it was not included as a cue. Representativeness of Cues and Forecast Context It is important to ask to what extent this data set is representative of what forecasters do in operational settings with regard to information available, deadline pressure and related issues. One of us (PJR) has had considerable experience in both operational forecast settings and university games, while another of us (LFB) is quite familiar with NWS restrictions. Information available to forecasters is virtually identical in both settings. In our view, the most important distinction is that the deadline is softer in the university setting, so that if the forecaster chooses, he or she can examine the data to a greater extent than can the operational forecaster, who has many duties and a much stricter timetable. However, we also suspect that someone in continuous contact with the evolving weather situation (such as an operational forecaster) is in a better position by the end of a work session to have a “feel” for what pattern is emerging than a person with intermittent contact throughout the day (such as a university researcher or student). Existing data, such as that presented by Bosart (1983) and Sanders (1986) suggests that consensus forecasts from university games compete quite favorably with WSFO products, and that the best individual forecasters have skill comparable to NWS forecasters. Since WSFO forecasts were included in the data, it was possible to examine differences between the Albany game forecasts and operational forecasts. 1

This was pointed out to us by James Shanteau.

RESULTS

We first examine the formal properties of the forecasting tasks and make some predictions about the forecasts based on these properties. We then turn to an analysis of the forecasts. Analysis of the Forecasting Tasks Although potentially important task characteristics have been described by Murphy and Brown (1984), Shanteau (1992), Klein (1993), Bolger and Wright (1994), and others, we have chosen to focus on a different set of characteristics: the formal (statistical) task properties that determine the “complexity of task structure” in Cognitive Continuum Theory.2 Relations among the cues. There were substantial correlations among pairs of forecast cues for both temperature and POP forecasts. For temperature cues, intercorrelations varied from 20.414 to 0.911 with a median of 0.630. Principal components analysis showed that three factors would account for 80% of the variance, and seven factors would account for 95% of the variance in all 13 cues. For POP cues, intercorrelations varied from 20.726 to 0.874 with a median of 0.013. Principal components analysis of the 24 cues for POP forecasts showed that eight factors would account for 81% of the variance, and it would require 15 factors to account for 95% of the variance. Not only were there fewer cues for temperature forecasts than for POP forecasts, the cues could be reduced to a smaller number of linearly independent variables, indicating that POP forecasting involved more independent items of information and was therefore more cognitively complex. Because of the linear dependencies in both sets of cues, estimates of the relative weighting of various cues by different forecasters were highly unstable. Another consequence of the linear dependencies is that accuracy was not sensitive to departures from optimal weights (von Winterfeldt and Edwards, 1986; see below). Fit of a linear model. The results of linear multiple regression analysis of the four events, using the appropriate cues as predictors, are summarized in Table 3. For temperature events, the linear model was an excellent fit. Precipitation events were predicted moderately well by a linear combination of the 24 cues. MOS was the best single predictor for all events. Stepwise regression indicated that, in all cases, the R2 obtained with the full cue set could be closely approximated using only MOS and 1, 2, or 3 other cues. 2 In the interest of space, certain analyses have been omitted. A complete copy of the task analysis may be obtained from the authors.

211

THE TASK OF EXPERT JUDGMENT

TABLE 3 Summary of Linear Regression Models of Forecasting Tasks

Event Minimum temperature (n 5 178) Maximum temperature (n 5 169) 12-h precipitation (n 5 149) 24-h precipitation (n 5 150) a

Number of cues

Adjusted R2

Number cues required for 99% of adjusted R2a

Adjusted R2 for single best predictor (MOS)

Adjusted R2 if MOS excluded from model

13

.946

4(.941)

.903

.928

12

.929

2(.920)

.906

.904

24

.579

4(.574)

.516

.526

24

.570

3(.572)

.506

.501

Adjusted R2 in parentheses.

The difference between the adjusted R2 for all cues and for all the cues except MOS indicates that MOS added to the predictive power of the other cues for all tasks. That additional predictive power was greater for precipitation events than for temperature events and was greater for 24 h lead times than for 12 h. For temperature events, the results in Table 3 are consistent with the principal components analysis showing that only a few linearly independent variables could account for the variance in the cues. For precipitation events, there were more linearly independent factors than are needed for a linear forecasting model. This suggests either: (a) there was a substantial amount of information in the cues that was not needed to forecast precipitation or (b) nonlinear or nonadditive use of some of the cues would contribute to the accuracy of precipitation forecasts. Sensitivity to weights. Following Dawes and Corrigan (1974), we examined the effects of deviations from optimal weights by calculating G for equal-weight and random-weight models. The weights were given the correct signs in all models. G was the only term calculated because it is the only component of accuracy that is affected by changes in weights. The Gs for equal-weight temperature forecasts were 0.971 and 0.949 (12 and 24 h). The Gs for equal-weight POP forecasts were 0.678 and 0.687, suggesting that the precipitation forecasting task is more sensitive to deviations from optimal weights. Gs were also calculated for 10,000 random weight models using Monte Carlo simulation. Median Gs were 0.961 and 0.931 for temperature and 0.667 and 0.675 for precipitation, providing further indication of the greater weight sensitivity of the precipitation task. Since the G for an optimal linear model is 1.0 by definition, the Gs for alternative weighting systems indicate the losses in predictability due to weights that are not statistically optimal. For example, the rYO of a

forecaster using equal weights to forecast minimum temperature would be 97.1% of that for a forecaster using optimal linear weights, assuming that all other components of accuracy are equal. For temperature forecasts, non-optimal weighting of information would have only a slight impact on accuracy (as long as the signs of the weights were correct). Getting the weights correct would be more critical for precipitation forecasting. One reason for this is the amount of irrelevant information in the cue set that should be ignored (Gaeth & Shanteau, 1984). If that information is given weight, it dilutes the valid variance in the forecast. Predicted properties of forecasts based on task analysis. Analysis of formal task properties provides insights into why some tasks are more difficult than others and describes the environment in which the forecasters have learned to make forecasts. With the understanding of the environment gained from the task analysis, we can make the following predictions about comparative properties of the forecasts: 1. In general, temperature forecasts will be relatively more accurate than precipitation forecasts. The temperature forecasting task poses little inherent challenge. The event is highly predictable using linear regression and the low sensitivity to weights indicates that many different strategies can produce good forecasts. For precipitation forecasts, the linear regression is less accurate and there is greater sensitivity to weights. 2. Nonlinear variance will be more important in POP forecasts than temperature forecasts. Since the temperature forecasting task is highly linear, we would expect the forecasts to be linear also. A linear model does not predict POP as well, so it is possible that configural forecasting strategies may have developed. 3. Because of the lower predictability and larger number of cues in the POP forecasting task, there will be

212

STEWART, ROEBBER, AND BOSART

both lower agreement among forecasters and lower reliability of forecasts. 4. Because of the lower expected reliability of POP forecasts, linear models will result in more improvement for POP than for temperature forecasts, and the improvement due to group averaging will be greater for POP forecasts than for temperature forecasts. The results bearing on these predictions will be examined below. Linear Regression of Forecasts Linear multiple regression models were fit to the four forecasts made by each of five human forecasters (four Albany game forecasters and the WSFO forecast). The resulting multiple correlations (RY.X) are presented in Table 4. Squared multiple correlations indicate that from 84.1 to 98.2% of the variance in forecasts could be accounted for by a linear additive model. Although it

is common in lens model studies to find that a linear model fits judgments well (Slovic, Fischhoff, & Lichtenstein, 1977; Brehmer & Brehmer, 1988; Cooksey, 1996), the multiple correlations for temperature forecasts were remarkably high. For reasons argued by Anderson and Shanteau (1977), Birnbaum (1973), McClelland and Judd (1993), and others, the discovery that linear multiple regression models account for large proportions of variance does not necessarily mean that nonlinear cue use is absent or unimportant. We will show below that, despite the high multiple correlations, there is clear evidence for the presence of nonlinear variance in the forecasts. Agreement among Forecasters Agreement was measured by computing the correlations between all pairs of human forecasters (10 possible pairs of five forecasters). The median correlations

TABLE 4 Lens Model Equation Analysis of Forecasts

rYO

Linear componenta of rYO

Teacher A Teacher B Student A Student B WSFO Median

0.968 0.971 0.965 0.961 0.964 0.965

0.955 0.953 0.952 0.947 0.950 0.952

12-h minimum temperature 0.013 0.018 0.013 0.015 0.014 0.014

Teacher A Teacher B Student A Student B WSFO Median

0.970 0.971 0.963 0.955 0.962 0.963

0.947 0.945 0.944 0.939 0.942 0.944

24-h maximum temperature 0.015 0.020 0.015 0.017 0.016 0.016

Teacher A Teacher B Student A Student B WSFO Median

0.749 0.755 0.734 0.717 0.777 0.749

0.716 0.722 0.708 0.695 0.710 0.710

12-h POP forecast (n 5 149) 0.033 4.35% 0.033 4.38% 0.026 3.49% 0.022 3.08% 0.067 8.63% 0.033 4.35%

Teacher A Teacher B Student A Student B WSFO Median

0.726 0.738 0.707 0.683 0.752 0.726

0.682 0.696 0.652 0.659 0.696 0.682

24-h POP forecast (n 5 150) 0.044 6.04% 0.041 5.62% 0.055 7.78% 0.024 3.52% 0.056 7.46% 0.044 6.04%

Forecaster

Nonlinear componenta of rYO

rYO 5 Linear component 1 Nonlinear component. *Denotes C value with p , .05. **Denotes C value with p , .01. a

Nonlinear contribution to accuracy

RO.X

G

RY.X

C

forecasts (n 5 178) 1.39% 0.975 1.83% 0.975 1.39% 0.975 1.51% 0.975 1.47% 0.975 1.47% 0.975

0.994 0.992 0.992 0.986 0.990 0.992

0.986 0.985 0.985 0.985 0.985 0.985

0.362** 0.466** 0.344** 0.373** 0.361** 0.362

forecasts (n 5 169) 1.66% 0.967 1.96% 0.967 1.72% 0.967 0.81% 0.967 1.51% 0.967 1.66% 0.967

0.996 0.997 0.996 0.995 0.997 0.996

0.991 0.987 0.983 0.985 0.983 0.985

0.458** 0.469** 0.356** 0.173* 0.312** 0.356

0.805 0.805 0.805 0.805 0.805 0.805

0.926 0.924 0.930 0.912 0.932 0.926

0.961 0.971 0.946 0.946 0.947 0.947

0.197* 0.233** 0.134 0.115 0.351** 0.197

0.799 0.799 0.799 0.799 0.799 0.799

0.892 0.916 0.889 0.881 0.922 0.892

0.956 0.951 0.917 0.936 0.944 0.944

0.249** 0.224** 0.230** 0.114 0.283** 0.230

THE TASK OF EXPERT JUDGMENT

were 0.990 and 0.991 for 12- and 24-h temperature forecasts, respectively. The corresponding medians for 12- and 24-h POP forecasts were 0.951 and 0.947. Agreement is extremely high for both forecasts, but, as predicted, there was less agreement for the POP forecasts. Agreement among these forecasters is higher than that found in most studies of expert judgment (Brehmer & Brehmer, 1988; Shanteau, 1995), where findings of large individual differences and disagreement are typical. Agreement is also higher than that found among meteorologists in previous studies of forecasts of severe weather (Stewart et al., 1989, 1992) and microburst forecasting (Lusk et al., 1990). The LME was used to decompose agreement into linear and nonlinear components. Gs measuring the correlation between the linear components of forecasts were extremely high, ranging from 0.996 to 0.999 for temperature forecasts and from 0.967 to 0.995 for POP forecasts. Cs, measuring the correlation between the components of forecasts not accounted for by the linear model, were lower, ranging from 0.631 to 0.854 for temperature forecasts and from 0.356 to 0.729 for POP forecasts. Importantly, all Cs were positive and statistically significant (p , .01) for both types of forecasts. This is strong evidence that the forecast variance that is not accounted for by the linear model is not entirely random error, but is shared among forecasters. This is not surprising for the Albany game forecasters, since they interacted frequently and could be expected to share forecasting strategies. As would be expected, Cs computed between Albany game forecasters and the WSFO forecast were slightly lower than those among the Albany game forecasters, but they were still substantial and statistically significant. The nonlinear component of agreement (the second term in the LME) accounted only for approximately 2% of agreement among temperature forecasts and 4 to 11% of agreement among POP forecasts. This supports the prediction of a greater role for nonlinear variance in POP forecasting than in temperature forecasting. Reliability of Forecasts Direct measurement of forecast reliability requires the ability to compare pairs of forecasts made under identical conditions. Since weather conditions do not repeat themselves, reliability cannot be directly measured, and indirect indicators of reliability must be used. RY.X is such an indirect, and imperfect, indicator of reliability (Stewart & Lusk, 1994). Table 4 shows that this indicator was higher for temperature forecasts than for precipitation forecasts, as expected. If RY.X is

213

adjusted for the greater number of predictor variables in the regression equation for POP forecasts, the difference is even greater. Median adjusted R2 for temperature forecasts is 0.967 and for POP forecasts, it is 0.876. Another indicator of reliability of a forecast is its correlation with other forecasts. These correlations are estimates of a lower bound on reliability (Guilford, 1954). As reported above, agreement was higher for temperature forecasts than for POP forecasts. Thus, both indicators of reliability suggest lower reliability for POP forecasts, as predicted. Accuracy of Forecasts Overall forecast accuracy (rYO in Table 4) was very high for temperature forecasts. As predicted, accuracy is somewhat lower for POP forecasts. There was little difference in accuracy between 12- and 24-h forecasts. The results of the LME decomposition for all forecasters are also presented in Table 4. The G values for the human forecasters were uniformly high, ranging from 0.881 to 0.997. Gs for POP forecasts were lower than for temperature forecasts. Table 4 shows that the linear component of rYO (the first term in the LME) accounted for nearly all of the forecast accuracy. The nonlinear component of accuracy amounted to less than 2% of rYO for temperature forecasts and from 3.08 to 8.83% for POP. The small contribution to accuracy of the nonlinear component is consistent with findings in a number of studies (Camerer, 1981). The greater nonlinear contribution for POP is consistent with our prediction that nonlinear variance would be more important for POP than for temperature. Although nearly all of the valid variance in the forecasts is linear, the nonlinear variance did make a small positive contribution. All of the values of C for temperature forecasts are statistically significant, and 7 out of 10 values for POP forecasts were statistically significant (p , .05), indicating that some of the variance not captured by the linear regression was systematic, valid variance. Although the actual contribution to accuracy of the nonlinear component was small, it was positive for all forecasts. If the nonlinear component were comprised entirely of nonsystematic error variance, its relation to the event would be equally likely to be negative or positive. Since the importance of any increment in forecasting accuracy, no matter how small, depends on how that forecast is used (Roebber & Bosart, 1996b), the value of a small positive component of accuracy could be substantial for certain users of forecasts. The source of the valid nonlinear component could not be determined. One possibility is nonlinear or configural use of cues. In principle, this possibility could be investigated, but due to the small amount of nonlinear

214

STEWART, ROEBBER, AND BOSART

variance, the intercorrelations among the cues, and the limited size of the sample of forecasts, we chose not to pursue it. Since the cues were measured retrospectively, two other possibilities cannot be ruled out. One is that forecasters used information not included in the cues. Another is that a cue used by the forecasters was not included in the set, but was nonlinearly related to one or more of the cues that were included.

of other information; (b) over time, forecasters developed an understanding of when MOS predicted well and when it did not; and (c) forecasters may have had access to information that was not included in MOS. For example, Roebber and Bosart (1996a) have shown that experienced forecasters know that radiative effects on temperature forecasts are handled relatively poorly by MOS; such effects are apparent from particular synoptic patterns, and the forecasters, who are aware of the MOS Human vs Model Forecasts forecasts, can adjust temperatures accordingly. In addiThe comparison between computer models and hu- tion, since the numerical models are run only twice daily, man forecasters has been a topic of considerable inter- the MOS output is based on data which may already be est both in the weather forecasting literature (Snell- several hours old at forecast time. Under certain rapidly man, 1977; Murphy & Brown, 1984; Clemen & Murphy, evolving conditions, such as a frontal passage, additional 1986; Murphy et al., 1988) and the expert judgment observations beyond the initial time of the model foreliterature (Dawes et al., 1989; Blattberg & Hoch, 1990; cast may help forecasters better interpret the timing of Bunn & Wright, 1991). We compared two models with such meteorologically significant events. human forecasts: (a) the multiple linear regression Table 5 shows that MOS performed better than the model, fit to the events based on the sample data, and human forecasters in one regard: Both C and the nonlin(b) MOS, which is used operationally for forecast guid- ear contribution to accuracy are higher for MOS than ance. for the human forecasts in every case. This indicates Multiple linear regression (MLR) vs human fore- that, despite the fact that MOS was based on linear casts. A direct comparison of rYO for humans and MLR regression, there was a component that was not linearly would be unfair because the MLR model forecast is based related to the cues used in this study, and that compoon a model that was fit to the sample data. Since we did nent had some validity. That component must have not have enough cases for cross-validation, an “adjusted” been due to either: (a) variables included in MOS but rYO was computed by taking the square root of the ad- not included in the cues, or (b) nonlinear relations bejusted R2 from the regression analysis. The results (Table tween one or more of the variables included in MOS 5) show that for temperature forecasts, MLR accuracy and one or more of the cues used in this study. Evidence was approximately equal to the median for human fore- in favor of (a) is that MOS was based on model output casters, while for POP forecasts, it was slightly above that is updated every three hours, while the present the human median. This provides weak support for the study was based on cues updated every 12 h. While it predicted greater advantage of the linear regression might, in principle, be possible to examine MOS equations to determine the source of the nonlinear compomodel for POP than for temperature forecasts. nent, that was beyond the scope of this study. MOS vs human forecasts. Humans compared more Although human forecasters outperformed the operafavorably to MOS than to MLR. For temperature foretional model, the model seemed to contribute to the casts, the accuracy of MOS was lower than for any accuracy of forecasts. It had, by far, the greatest weight human forecaster. For 12-h POP forecasts, MOS perof any cue in all the regression models of the forecasts, formed better than only one human forecaster. For 24and it contributed to the predictability for all tasks h POP forecasts, MOS performed better than two of the (Table 3). It is likely that the performance of the human human forecasters. forecasters would be poorer if they were denied access The accuracy of MLR was greater than MOS for all to MOS, but this has not been empirically tested. forecasts, but this comparison is unfair because MOS The results of our comparison between MOS and huwas one of the predictors in the MLR model. A fairer comparison was made by removing MOS from the MLR mans would not have been predicted from much of the model (MLR-MOS). The performance of the resulting judgment literature on comparisons between humans regression model was similar to that of MOS for all and models. The superiority of models over people genforecasts. This is not surprising, since MOS is a linear erally results from the superior reliability of models regression model based on variables that are similar to (Dawes & Corrigan, 1974; Camerer, 1981). Although MOS was perfectly reliable, its primary advantage over our cues. The advantage that humans had over MOS may have the human forecasters was in its nonlinear component. An explanation for the differences between the rebeen because: (a) forecasters were aware of MOS forecasts and could therefore interpret them in the context sults of traditional MLR-human comparisons and MOS-

215

THE TASK OF EXPERT JUDGMENT

TABLE 5 Lens Model Equation Analysis of Model and Group Average Forecasts

rYO

Linear componenta of rYO

MLR MOS MLR-MOS Group

0.973b 0.951 0.963b 0.969

0.973 0.913 0.963 0.954

MLR MOS MLR-MOS Group

0.964b 0.952 0.951b 0.968

0.964 0.901 0.951 0.946

MLR MOS MLR-MOS Group

0.761b 0.721 0.725b 0.756

0.761 0.617 0.725 0.723

12-h POP forecast (n 5 149) 0.000 0.00% 0.104 14.42% 0.000 0.00% 0.033 4.39%

MLR MOS MLR-MOS Group

0.755b 0.714 0.708b 0.716

0.755 0.582 0.708 0.680

24-h POP forecast (n 5 150) 0.000 0.00% 0.131 18.38% 0.000 0.00% 0.035 4.95%

Forecaster

Nonlinear componenta of rYO

Nonlinear contribution to accuracy

RO.X

G

RY.X

C

12-h minimum temperature forecasts (n 5 178) 0.000 0.00% 0.975 0.038 3.99% 0.966 0.000 0.00% 0.966 0.015 1.54% 0.975

1.000 0.987 1.000 0.991

1.000 0.957 1.000 0.987

0.000 0.507** 0.000 0.424**

24-h maximum temperature forecasts (n 5 169) 0.000 0.00% 0.967 0.044 4.22% 0.954 0.000 0.00% 0.954 0.017 1.58% 0.967

1.000 0.990 1.000 0.996

1.000 0.966 1.000 0.989

0.000 0.520** 0.000 0.403**

0.805 0.774 0.774 0.805

1.000 0.905 1.000 0.928

1.000 0.880 1.000 0.969

0.000 0.346** 0.000 0.226**

0.799 0.760 0.760 0.799

1.000 0.904 1.000 0.896

1.000 0.848 1.000 0.950

0.000 0.381** 0.000 0.189*

rYO 5 Linear component 1 Nonlinear component. Square root of adjusted R2. *Denotes C value with p , .05. **Denotes C value with p , .01.

a b

human comparisons may be found in differences in the “rules” of the comparison. In nearly all previous studies comparing MLR models with humans (and in our MLR comparison), it has been assumed that the model and the human have access to an identical set of cues (Dawes & Corrigan, 1974; Dawes et al., 1989). In natural settings, however, this assumption may not be valid, and therefore the research based on MLR models may not generalize to such settings (Blattberg & Hoch, 1990; Bunn & Wright, 1991). Consequently, comparisons between human judges and models in natural settings may be expected to differ from comparisons between humans and MLR models.

that the group average would perform better relative to the individuals for POP forecasts than for temperature forecasts was not confirmed (the only prediction that failed). An examination of the LME decomposition revealed no clear pattern of superiority for the group forecast in any area. The group average was a good forecast, but not dramatically better than the humans. This is probably due, at least in part, to the high quality of the human forecasts, and the high correlations between them (Clemen & Winkler, 1985; Morrison & Schmittlein, 1991).

Combined Forecasts vs Individual Forecasters

Quality of judgment. In contrast with the negative view of human judgment reported in much of the literature, the quality of judgmental forecasts studied in this paper was high. In particular:

Table 5 includes a row for the “Group” forecast—the mean of the four Albany game forecasters. In weather forecasting terminology, this is called a “consensus” forecast. For weather and other types of forecasting, combining forecasts in this way generally improves accuracy (Makridakis & Winkler, 1983; Clemen & Winkler, 1987; Clemen, 1989). For all tasks except 24-h POP, the group forecast was approximately equal to the best humans. The prediction

Summary of Key Findings

1. Forecaster bias was negligible. Forecasts were well calibrated. 2. Forecasters were extremely accurate, particularly when forecasting temperature. 3. There was a high level of agreement among forecasters, suggesting that the forecasts were also reliable.

216

STEWART, ROEBBER, AND BOSART

Linearity of judgment. Linear models accounted for nearly all of the valid variance in the judgments. However, there was shared nonlinearity among the forecasters (with respect to the cues), and that nonlinearity represented a small positive contribution to the accuracy of the forecasts. Models vs human forecasts. Although humans did not outperform a linear regression model, they were able to improve upon an operational forecasting model. Combining forecasts. The group average forecast performed about as well as the best humans. Task analysis. All but one of the predictions from the analysis of the formal properties of the tasks were confirmed: 1. Temperature forecasts were more accurate than POP forecasts. 2. The nonlinear variance was greater for POP forecasts than for temperature. 3. Agreement among forecasters was lower for POP forecasts than for temperature, and the indirect evidence suggested that reliability was also lower. 4. Relative to the median of the human forecasters, MLR performed better for POP than for temperature, as predicted. Contrary to prediction, the same was not true for the group average forecast. DISCUSSION

This study is unique in judgment research. Unlike previous laboratory studies, forecasters worked in a natural setting and their access to information was not restricted. Unlike previous field studies, several forecasters, working as individuals but with access to the same information, made forecasts of the same events. Unlike nearly all studies, four different forecasting tasks were analyzed and compared. Since we did not control the information available to the forecaster, and the information for forecasting is not routinely recorded, it was necessary to reconstruct retrospectively the cues that were available. As a result, perfect validity of cue measurement was not assured. We believed that this was an acceptable price to pay for the opportunity to analyze accuracy in a naturalistic study. As in studies where the cues are known, the linear regression accounted for most of the variance in both the forecasts and the events that were forecast, suggesting that the reconstructed cues were an adequate approximation to the actual cues. In contrast to the results of many previous studies, our results paint a positive picture of expert judgment. The forecasts were highly reliable and inter-forecaster

agreement was high. No dramatic gains in performance were to be obtained by simply averaging forecasts or using a linear model to produce forecasts. The human forecasters were able to improve upon a model that is used in operational forecasting. The good performance of these weather forecasters contrasted sharply with that of one of the most studied, and most maligned, group of expert judges—clinical psychologists (Faust, 1986). The accuracy of their diagnoses of psychosis and neurosis from psychological test profiles was poor, they disagreed, and their performance was worse than a simple statistical model (Dawes et al., 1989). How is it possible that weather forecasters perform so much better than clinical psychologists (and many other experts)? One possible explanation is that they have developed cognitive processes that are more powerful than those of psychologists. This is unlikely since, when confronted with the task of predicting severe weather, meteorologists also disagree and their forecasts are inaccurate (Stewart et al., 1989, 1992; Lusk et al., 1990; Heideman et al., 1993). A more likely explanation is to be found in the nature of the tasks confronting the clinical psychologist and the weather forecaster. The most important single factor related to forecast accuracy in this study was task predictability, as measured by the multiple correlation for the events (RO.X). In Fig. 2, accuracy (rYO) is plotted against task predictability (RO.X) for the four tasks. The figure clearly shows that differences in accuracy among temperature and POP forecasters were small relative to differences between forecasting tasks, and that differences in accuracy were closely related to task predictability. Also shown in Fig. 2 are data from seven meteorologists’ forecasts of probability of hail (Stewart, 1990) and 29 clinical psychologists’ diagnoses of patients with psychosis or neurosis from MMPI profiles (Goldberg

FIG. 2. Environmental predictability vs. forecast accuracy for weather forecasters and clinical psychologists.

217

THE TASK OF EXPERT JUDGMENT

1965). Neither was a naturalistic study, but in both cases accuracy was consistent with task predictability and was much lower than found in the present study. Figure 2 shows that when meteorologists confront a task with low predictability, (predicting hail) their performance is similar to that of the clinical psychologists confronted with a task with a similar level of predictability. Both the meteorologists and the clinical psychologists performed below the level of a linear multiple regression model (although one meteorologist did outperform the model). Therefore, the high skill in temperature and POP forecasts relative to predictions of psychosis and neurosis from the MMPI can be attributed to differences in the tasks. It is not that meteorologists are better judges than clinical psychologists, but rather that temperature and precipitation forecasting have higher limits of predictability. This explanation for differences between weather forecasters and clinical psychologists was overlooked in a recent paper comparing these two types of experts (Monahan & Steadman, 1996). High task predictability for the weather forecasting tasks is partially a result of the high quality of information available to forecasters. That information is expensive and it has taken decades to develop the technology to make it routinely available to forecasters. If we had studied forecasts of temperature and precipitation 75 years ago, they might have looked much more like clinical psychologists. Another major reason for the high accuracy and agreement among forecasters is the availability of a good model-based forecast. This provides a starting point for the forecasting process, and the forecasters are able to improve upon it. We cannot estimate the importance of that improvement because that depends on the forecast user (see Roebber & Bosart, 1996b). Our results simply indicate that the human contribution is positive. The strong relation between task predictability and accuracy shown in Fig. 2 is maintained when the results of many additional studies are considered (Stewart, 1997). Typically, the results illustrated in Fig. 2 are described in terms such as “the repeated failure of judges to outperform linear regression” and are interpreted with reference to the limitations of the human judgment process (Slovic & Lichtenstein, 1971; Faust, 1986; Hogarth, 1987; Johnson, 1988; Dawes, 1988). If, as noted above, the multiple correlation is a reasonable measure of task predictability, then a more positive view emerges: In many fields of expert judgment, performance is near the limit imposed by environmental uncertainty.

REFERENCES Anderson, N. H., & Shanteau, J. C. (1977). Weak inference with linear models. Psychological Bulletin, 84, 1155–1170. Birnbaum, M. H. (1973). The devil rides again: Correlation as an index of fit. Psychological Bulletin, 79, 239–242. Blattberg, R. C., & Hoch, S. J. (1990). Database models and managerial intuition: 50% model 1 50% manager. Management Science, 36, 887–899. Bolger, F., & Wright, G. (1994). Assessing the quality of expert judgment: Issues and analysis. Decision Support Systems, 11, 1–24. Bosart, L. F. (1975). SUNYA experimental results in forecasting daily temperature and precipitation. Monthly Weather Review, 103, 1013–1020. Bosart, L. F. (1983). An update on trends in skill of daily forecasts of temperature and precipitation at the State University of New York at Albany. Bulletin of the American Meteorological Society, 64, 346–354. Brehmer, A., & Brehmer, B. (1988). What have we learned about human judgment from thirty years of policy capturing? In B. Brehmer & C. R. B. Joyce (Eds.), Human judgment: The social judgment theory view (pp. 75–114). Amsterdam: North-Holland. Brehmer, B. (1976). Note on clinical judgment and the formal characteristics of clinical tasks. Psychological Bulletin, 83, 778–782. Brehmer, B. (1978). Response consistency in probabilistic inference tasks. Organizational Behavior and Human Performance, 22, 103– 115. Brehmer, B. (1988). The development of social judgment theory. In B. Brehmer & C. R. B. Joyce (Eds.), Human judgment: The social judgment theory view (pp. 13–40). Amsterdam: North-Holland. Brehmer, B. (1994). The psychology of linear judgment models. ActaPsychologica, 87, 137–154. Brier, G. W. (1950). Verification of weather forecasts expressed in terms of probability. Monthly Weather Review, 78, 1–3. Brunswik, E. (1952). The conceptual framework of psychology. Chicago: University of Chicago Press. Brunswik, E. (1956). Perception and the representative design of psychological experiments. Berkeley: University of California Press. Bunn, D., & Wright, G. (1991). Interaction of judgemental and statistical forecasting methods: issues and analysis. Management Science, 31, 501–518. Camerer, C. (1981). General conditions for the success of bootstrapping models. Organizational Behavior and Human Performance, 27, 411–422. Carter, G. M., Dallavalle, J. P., & Glahn, H. R. (1989). Statistical forecasts based on the National Meteorological Center’s numerical weather prediction system. Weather and Forecasting, 4, 401–412. Castellan, N. J., Jr. (1973). Comments on the “lens model” equation and the analysis of multiple-cue judgment tasks. Psychometrika, 38, 87–100. Chi, M. T. H., Glaser, R., & Farr, M. J. (Eds.) (1988). The nature of expertise. Hillsdale, NJ: Erlbaum. Clemen, R. T. (1989). Combining forecasts: A review and annotated bibliography. International Journal of Forecasting, 5, 559–583. Clemen, R. T., & Murphy, A. H. (1986). Objective and subjective precipitation probability forecasts: statistical analysis of some interrelationships. Weather and Forecasting, 1, 56–65. Clemen, R. T, & Winkler, R. L. (1985). Limits for the precision and value of information from dependent sources. Operations Research, 33, 427–442.

218

STEWART, ROEBBER, AND BOSART

Clemen, R. T., & Winkler, R. L. (1987). Calibrating and combining precipitation probability forecasts. In R. Viertl (Ed.), Probability and Bayesian statistics (pp. 97–110). New York: Plenum. Cooksey, R. (1996). Judgment analysis: Theory, methods, and applications. New York: Academic Press. Cooksey, R. W., & Freebody, P. (1985). Generalized multivariate lens model analysis for complex human inference tasks. Organizational Behavior and Human Decision Processes, 35, 46–72. Dawes, R. M. (1988). Rational choice in an uncertain world. New York: Harcourt, Brace, Jovanivich. Dawes, R. M., & Corrigan, B. (1974). Linear models in decision making. Psychological Bulletin, 81, 95–106. Dawes, R. M., Faust, D., & Meehl, P. E. (1989). Clinical versus actuarial judgment. Science, 243, 1668–1674. DiMego, G. J., Mitchell, K. E., Petersen, R. A., Hoke, J. E., Gerrity, J. P., Tuccilllo, J. J., Wobus, R. L., & Juang, H.-M. H. (1992). Changes to NMCs regional analysis and forecast system. Weather and Forecasting, 7, 185–198. Doherty, M. E., & Kurz, E. (1996). Social judgement theory. Thinking and Reasoning, 2, 109–140. Ebbesen, E. B., & Konecni, V. J. (1980). On the external validity of decision-making research: What do we know about decisions in the real world? In T. Wallsten (Ed.), Cognitive processes in choice and decision behavior (pp. 21–45). Hillsdale, NJ: Erlbaum. Einhorn, H. J. (1971). Use of nonlinear, noncompensatory models as a function of task and amount of information. Organizational Behavior and Human Performance, 6, 1–27. Einhorn, H. J. (1974). Cue definition and residual judgment. Organizational Behavior and Human Decision Processes, 12, 30–49. Ericsson, K. A., & Simon, H. A. (1984). Protocol analysis: Verbal reports as data. Cambridge: Massachusetts Institute of Technology Press. Faust, D. (1986). Research on human judgment and its application to clinical practice. Professional Psychology, 17, 420–430. Gaeth, G. J., & Shanteau, J. (1984). Reducing the influence of irrelevant information on experienced decision makers. Organizational Behavior and Human Performance, 33, 263–282. Glahn, H. R., & Lowry, D. A. (1972). The use of model output statistics (MOS) in objective weather forecasting. Journal of Applied Meteorology, 11, 1203–1211. Goldberg, L. R. (1965). Diagnosticians vs. diagnostic signs: The diagnosis of psychosis vs. neurosis from the MMPI. Psychological Monographs, 79 (Whole No. 602). Goldberg, L. R. (1991). Human mind versus regression equation: Five contrasts. In D. Cicchetti & W. M. Grove (Eds.), Thinking clearly about psychology: Vol. 1. Matters of public interest (pp. 173–184). Minneapolis: University of Minnesota Press. Guilford, J. P. (1954). Psychometric methods. (2nd ed.). New York: McGraw-Hill. Hammond, K. R. (1955). Probabilistic functioning and the clinical method. Psychological Review, 62, 255–262. Hammond, K. R. (1980). The integration of research in judgment and decision theory (Tech. Rep. No. 226): Boulder, CO: University of Colorado, Center for Research on Judgment and Policy. Hammond, K. R. (1981). Principles of organization in intuitive and analytical cognition (Tech. Rep. No. 231): Boulder, CO: University of Colorado, Center for Research on Judgment and Policy. Hammond, K. R. (1993). Naturalistic decision making from a Brunswikian viewpoint: Its past, present, future. In G. Klein, J. Orasanu, R. Calderwood, & C. E. Zsambok (Eds.), Decision making in action: Models and methods (pp. 205–227). Norwood, NJ: Ablex.

Hammond, K. R. (1996). Human judgment and social policy: Irreducible uncertainty, inevitable error, unavoidable injustice. New York: Oxford University Press. Hammond, K. R., Hamm, R. M., & Grassia, J. (1986). Generalizing over conditions by combining the multitrait-multimethod matrix and the representative design of experiments. Psychological Bulletin, 100, 257–269. Hammond, K. R., Stewart, T. R., Brehmer, B., & Steinmann, D. O. (1975). Social judgment theory. In M. F. Kaplan & S. Schwartz (Eds.), Human judgment and decision processes (pp. 271–312). New York: Academic Press. Harvey, N. (1995). Why are judgments less consistent in less predictable task situations? Organizational Behavior and Human Decision Processes, 63, 247–263. Heideman, K. F., Stewart, T. R., Moninger, W. R., & Reagan-Cirincione, P. (1993). The weather information and skill experiment (WISE): The effect of varying levels of information on forecast skill. Weather and Forecasting, 8, 25–36. Hogarth, R. (1987). Judgement and choice (2nd ed.). New York: Wiley. Hursch, C. J., Hammond, K. R., & Hursch, J. L. (1964). Some methodological considerations in multiple-cue probability studies. Psychological Review, 71, 42–60. Johnson, E. J. (1988). Expertise and decision under uncertainty: Performance and process. In M. Chi, R. Glaser, & M. Farr (Eds.), The nature of expertise (pp. 209–228). Hillsdale, NJ: Erlbaum. Kahneman, D., & Tversky, A. (1979). Prospect theory: An analysis of decision under risk. Econometrica, 47, 263–291. Kahneman, D., Slovic, P., & Tversky, A. (Eds.). (1982). Judgment under uncertainty: Heuristics and biases. Cambridge: Cambridge University Press. Klein, G. A. (1993). Twenty questions—Suggestions for research in naturalistic decision making. In G. A. Klein, J. Orasanu, R. Calderwood, & C. E. Zsambok (Eds.), Decision making in action: Models and methods (pp. 389–403). Norwood, NJ: Ablex. Klein, G. A., Orasanu, J., Calderwood, R., & Zsambok, C. E. (Eds.). (1993). Decision making in action: Models and methods. Norwood, NJ: Ablex. Klein, W. H., Lewis, B. M., & Enger, I. (1959). Objective prediction of five-day mean temperatures during winter. Journal of Meteorology, 16, 672–682. Lee, J.-W., & Yates, J. F. (1992). How quantity judgment changes as the number of cues increases: An analytical framework and review. Psychological Bulletin, 112, 363–377. Lusk, C. M., Stewart, T. R., Hammond, K. R., & Potts, R. J. (1990). Judgment and decision making in dynamic tasks: The case of forecasting the microburst. Weather and Forecasting, 5, 627–639. Makridakis, S., & Winkler, R. L. (1983). Averages of forecasts: Some empirical results. Management Science, 29, 987–996. McClelland, G. H., & Judd, C. M. (1993). Statistical difficulties of detecting interactions and moderator effects. Psychological Bulletin, 114, 376–390. Meehl, P. E. (1954). Clinical vs. statistical prediction. Minneapolis, MN: University of Minneapolis Press. Monahan, J., & Steadman, H. J. (1996). Violent storms and violent people: How meteorology can inform risk communication in mental health law. American Psychologist, 51(9), 931–938. Morrison, D. G., & Schmittlein, D. C. (1991). How many forecasters do you really have? Mahanobis provides the intuition for the surprising Clemens and Winkler result. Operations Research, 39, 519– 523. Murphy, A. H. (1988). Skill scores based on the mean square error and

THE TASK OF EXPERT JUDGMENT their relationships to the correlation coefficient. Monthly Weather Review, 116, 2417–2424. Murphy, A. H., & Brown, B. G. (1984). A comparative evaluation of objective and subjective weather forecasts in the United States. Journal of Forecasting, 3, 369–393. Murphy, A. H., & Daan, H. (1984). Impacts of feedback and experience on the quality of subjective probability forecasts: Comparison of results from the first and second years of the Zerikzee experiment. Monthly Weather Review, 112, 413–423. Murphy, A. H., & Winkler, R. L. (1974). Probability forecasts: A survey of national weather service forecasters. Bulletin of the American Meteorological Society, 55, 1449–1453. Murphy, A. H., Chen, Y., & Clemen, R. T. (1988). Statistical analysis of interrelationships between objective and subjective temperature forecasts. Monthly Weather Review, 116, 2121–2131. Murphy, A. H., Lichtenstein, S., Fischhoff, B., & Winkler, R. L. (1980). Misinterpretations of precipitation probability forecasts. Bulletin of the American Meteorological Society, 61, 695–701. Newell, A., & Simon, H. A. (1972). Human problem solving. Englewood Cliffs, NJ: Prentice-Hall. Oskamp, S. (1965). Overconfidence in case study judgments. Journal of Consulting Psychology, 29, 261–265. Roebber, P. J., & Bosart, L. F. (1996a). The complex relationship between forecast skill and forecast value: A real world analysis. Weather and Forecasting, 11, 544–559. Roebber, P. J., & Bosart, L. F. (1996b). The contributions of education and experience to forecast skill. Weather and Forecasting, 11, 21–40. Roebber, P. J., Bosart, L. F., & Forbes, G. J. (1996). Does distance from the forecast site affect skill? Weather and Forecasting, 11, 582–589. Sanders, F. (1986). Trends in skill of Boston forecasts made at MIT, 1966–1984. Bulletin of the American Meteorological Society, 67, 170–176. Shanteau, J. (1992). Competence in experts: The role of task characteristics. Organizational Behavior and Human Decision Processes, 53, 252–266. Shanteau, J. (1995). Expert judgment and financial decision making (Paper prepared for First International Stockholm Seminar on Risk Behavior and Risk Management, Stockholm University, June 12– 14). Manhattan, Kansas: Kansas State University. Slovic, P., & Lichtenstein, S. (1971). Comparison of Bayesian and regression approaches to the study of information processing in judgment. Organizational Behavior and Human Performance, 6, 649–744. Received: February 6, 1996

219

Slovic, P., Fischhoff, B., & Lichtenstein, S. (1977). Behavioral decision theory. Annual Review of Psychology, 28, 1–39. Snellman, L. W. (1977). Operational Forecasting Using Automated Guidance. Bulletin of the American Meteorological Society, 58, 1036–1044. Stewart, T. R. (1976). Components of correlations and extensions of the lens model equation. Psychometrika, 41, 101–120. Stewart, T. R. (1990). A decomposition of the correlation coefficient and its use in analyzing forecasting skill. Weather and Forecasting, 5, 661–666. Stewart, T. R. (1997). Meta-analysis of the relation between task predictability and accuracy of expert judgment. Albany, NY: Center for Policy Research, University at Albany, State University of New York. Stewart, T. R., & Lusk, C. M. (1994). Seven components of judgmental forecasting skill: Implications for research and the improvement of forecasts. Journal of Forecasting, 13, 579–599. Stewart, T. R., Moninger, W. R., Grassia, J., Brady, R. H., & Merrem, F. H. (1989). Analysis of expert judgment in a hail forecasting experiment. Weather and Forecasting, 4, 24–34. Stewart, T. R., Moninger, W. R., Heideman, K. F., & Reagan-Cirincione, P. (1992). Effects of improved information on the components of skill in weather forecasting. Organizational Behavior and Human Decision Processes, 53, 107–134. Swets, J. A. (1988). Measuring the accuracy of diagnostic systems. Science, 240, 1285–1293. Tennekes, H. (1988). The outlook: Scattered showers. Bulletin of the American Meteorological Society, 69, 368–372. Trenberth, K. E. (1978). On the interpretation of the diagnostic quasigeostrophic omega equation. Monthly Weather Review, 106, 131– 137. Tucker, L. R (1964). A suggested alternative formulation in the developments by Hursch, Hammond, and Hursch, and by Hammond, Hursch, and Todd. Psychological Review, 71, 528–530. Tversky, A., & Kahneman, D. (1971). Belief in the law of small numbers. Psychological Bulletin, 76, 105–110. Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases. Science, 185, 1124–1131. von Winterfeldt, D., & Edwards, W. (1986). Decision analysis and behavioral research. Cambridge, England: Cambridge University Press. Wigton, R. S. (1988). Use of linear models to analyze physicians decisions. Medical Decision Making, 8, 241–252.