Combining the judgments of experts: How many and which ones?

Combining the judgments of experts: How many and which ones?

ORGANIZATIONAL Combining BEHAVIOR AND HUMAN DECISION PROCESSES the Judgments of Experts: Which Ones? 38, 405-414 (1986) How Many and ROBERT ...

596KB Sizes 3 Downloads 78 Views

ORGANIZATIONAL

Combining

BEHAVIOR

AND

HUMAN

DECISION

PROCESSES

the Judgments of Experts: Which Ones?

38, 405-414 (1986)

How Many and

ROBERT H. ASHTON University

of Alberta

R. M. Hogarth (1978, Organizational Behavior and Human Performance, 21, 40-46) presents an analytical model which, under certain conditions, may be useful for estimating both the number of experts to include in a staticized group and which experts. This paper reports a study of an important realworld judgment task used to evaluate (1) the extent to which the specified conditions hold and (2) the effect of violations of the conditions on the ability of the model to approximate the actual empirical validities that result from forming such groups. The results indicate that, although the specified conditions hold only moderately well, the model suggested by Hogarth provides a remarkably good approximation to the actual empirical validities of the staticized groups. 0 1986 Academic Press, Inc.

A growing body of empirical research finds the validity of experts’ judgments can be improved by forming “staticized groups” of two or more experts. It also finds that simple averages of the judgments of the individual experts is quite effective, and that only a small number of experts must be included to achieve most of the total improvement possible with a much larger set of experts (e.g., A. H. Ashton & Ashton, 1985; Libby & Blashfield, 1978; Makridakis & Winkler, 1983; Winkler & Makridakis, 1983). Such studies follow a common research approach: Given a group of II experts, all possible combinations of k experts (k = 2, . . . ) n) are formed, the validity of the resulting groups is computed, and summary measures of group validity are evaluated for all levels of k. In contrast, Hogarth (1978) has presented an analytical model which, under certain conditions, yields group validity as a function of the number of experts, their mean individual validity, and the mean intercorrelation among their judgments. Thus, Hogarth’s model may be useful for determining how many experts should be included in a staticized group. Moreover, by focusing on changes in mean individual validity and mean intercorrelation that result from adding another expert to an existing group, the model also may suggest which expert(s) should be added. In this paper, an important real-world judgment setting is used to evalRequests for reprints should be sent to Robert H. Ashton, Fuqua School of Business. Duke University, Durham, NC 27706. 405 0749-5978/86$3.00 Copyright 0 1986 by Academic Press. Inc All rights of reproduction in any form rererved.

406

ROBERT

H. ASHTON

uate the extent to which the specified conditions hold. More importantly, the study evaluates the effect of violations of the conditions on the ability of Hogarth’s model to approximate the actual empirical validities that result from forming such groups. Consequently, the study may be viewed as an empirical demonstration of the extent to which the suggested analytical model might be useful in a practical setting. The next section summarizes the proposed model and the judgment setting used to evaluate it. This is followed by presentation of the results and a brief discussion and conclusion. THE MODEL AND SETTING

Hogarth’s model draws on an analogy with test theory (Ghiselli, 1964), where the question of interest is the correlation of an equally weighted composite with a criterion. In the judgmental setting, the model is pyx

=

k”lF YX + 11 + (k- l)p,,l”,

(1)

where pyr is the validity of a group of k experts, expressed as the correlation between X (= l/kX& i xj), or the mean of the (j = 1, . . . , k) experts’ individual judgments for each observation, and y, or the corresponding criterion values; pyXis the mean validity of the individual experts, i.e., l/kEJ=, pYXj;and &,,, is the mean intercorrelation of the experts’ judgments. For k 2 2, the validity of the experts’ mean judgments, pyx, will equal or exceed the mean validity of the individual experts, pyX.The limiting case, where k = CQ,is & + j&i/*, or the ratio of mean individual validity to the square root of the mean intercorrelation of the experts’ judgments. From Eq. (1) it can be seen that group validity is an increasing function of the number of experts and their mean individual validity, and a decreasing function of the mean intercorrelation among the experts’ judgments. Hogarth (1978) used several combinations of values for number of experts, mean individual validity, and mean intercorrelation to demonstrate their effects on group validity. He concluded that when mean individual validity exceeds mean intercorrelation, perhaps as many as 20 to 25 experts must be included before group validity is near its maximum. However, when mean individual validity is equal to (less than) mean intercorrelation, group validity is near its maximum with only about 10 (6) experts in the group. When the (k + l)st expert is added to an existing group of k experts, both mean individual validity and mean intercorrelation are likely to change. Defining pyXk+,as the individual validity of the (k + l)st expert and pfXixjas the mean intercorrelation of the “new” group of experts, i.e., including the (k + 1)st person, Hogarth (1978) derives the following con-

COMBINING

EXPERT

JUDGMENTS

407

dition which he suggests must hold for addition of the (k + I)st expert to increase group validity: Equation (2) implies that group validity will not necessarily be increased simply by adding an expert whose individual validity, P,,~~+~, exceeds the validity of the “old” group, p,,:. Instead, it also may be necessary that mean intercorrelation be reduced, i.e., that plXirj < i+,. For Eqs. (1) and (2) to provide good approximations to group validity, and to changes in group validity when another expert is added, Hogarth suggests that certain conditions are important. For example, in order for small groups (say, 8 to 12 members) to have near-maximum validity, Hogarth argues that mean intercorrelation must not be low (<.3, approximately) and/or mean expert validity must not exceed mean intercorrelation. Moreover, he argues that there must be no or “small” bias in the mean judgments. A useful measure of bias for this purpose is B = 1(x, - pJ + u / )

(3)

where B is the standardized bias, X, is the true value to be predicted, p. is the mean of a population distribution of individual judgments, and u is the population standard deviation (Einhorn, Hogarth, & Klempner, 1977). For mean judgments to be completely unbiased, B must equal zero. However, Einhorn et al. found that B had to be about .70 before mean judgments were outperformed by a realistic alternative weighting scheme. Finally, it may be important that the judgments of the individual experts be expressed in standardized form. Hogarth argues, however, that failure to express the judgments in standardized form is irrelevant if the standard deviations of the experts’ judgments are equal, and that “small” deviations from equality of standard deviations “are not liable to affect results greatly” (p. 45). The extent to which these conditions hold is an empirical question that must be considered in the context of a particular judgment setting. Similarly, the effects of violations of these conditions on the ability of Eqs. (1) and (2) to provide good estimates of how many (and which) experts should be included is also an empirical issue. An important real-world judgment setting used to evaluate the usefulness of Eqs. (1) and (2) is described below. The task involved quarterly estimates of short-run advertising sales at Time magazine. Such estimates are extremely important at all Time, Inc., magazines because they are prerequisites to editorial and production commitments, as well as to financial budgeting. Subjects estimated the

ROBERT H . ASHTON TABLE 1 DESCRIPTIVESTATISTICSFORINDIVIDUALEXPERTS Subject no.

Mean judgment (in pages)

Standard deviation

Individual validity (pYxj)

1 2 3 4 5 6 7 8 9 10 11 12 13

2946.79 3387.02 2986.24 2980.95 3156.76 3123.21 2766.64 3089.62 3242.83 3100.95 2831.83 2914.02 3029.76

318.48 387.82 449.00 317.95 592.14 405.37 316.07 333.44 465.74 529.96 472.92 590.02 414.09

.8786 .6501 .8242 .7464 .6781 .6913 .9156 .6186 .6888 .6072 .8981 .6145 .8570

number of pages of advertising sold annually by Time magazine over a 1Cyear period, using five variables normally available to Time, Inc., personnel for such estimates. The estimates were made after the first, second, and third quarter of each year, for a total of 42 estimates. The experimental task was patterned closely after the process in which the subjects actually are involved at their own magazines. The subjects, who had a minimum of 8 years of experience, were 13 executives, managers and sales personnel who worked for Time, Inc., magazines other than Time (e.g., Life and Sports Illustrated). They included people who actually prepare and update advertising sales estimates, as well as people who play important supporting roles such as supervising the compilation of historical data and preparing client-specific sales estimates that provide input to the estimates of total annual sales. A more detailed description of the study can be found in A. H. Ashton (1982) and A. H. Ashton & R. H. Ashton (1985). RESULTS

The means and standard deviations of the 42 estimates made by each expert are shown in Table 1, along with the individual validity of each expert, pYXi.Mean individual validity, &, is .7437, which exceeds the mean intercorrelation among the experts’ judgments, iXSj, of .6005. Hogarth (1978) argues that pyX> &.,,, is not the typical finding in studies of expert judgment (p. 44), and that this is the type of judgment setting in which “much is to be gained by adding additional experts to a group

COMBINING

EXPERT

409

JUDGMENTS

(up to, say, 20 or 25)” (p. 42). The limit of group validity in this case, limb, P+ is .9597, i.e., ‘i;yX+ PXSjl/*. The conditions that Hogarth says are important for the suggested analytical approach to be useful hold only moderately well in this judgment setting. For example, the condition that mean intercorrelation exceed approximately .3 is satisfied, but the condition that mean individual validity not exceed mean intercorrelation is not satisfied. Moreover, some bias is evident in the mean judgments. Using the total sample of 13 individual judgments, p, and u can be estimated, for each of the 42 observations, by the sample mean and sample standard deviation, and estimates of the standardized bias, B, can be computed. The absolute values of B range from 0.02 to 1.96, with a median of 0.69. Finally, the standard deviations of the experts’ judgments are not equal, ranging from 316 to 592 pages. Whether these departures from the specified conditions are sufficiently important to nullify the usefulness of Eqs. (1) and (2) is evaluated below. The actual, or empirical, mean validity of groups of size k (k = 2, . . . , 13) is compared to the corresponding level of validity estimated by Eq. (1) in Table 2. Note that both the empirical validity and the validity estimated by Eq. (1) increase at a decreasing rate as more experts are added to the group. More importantly, the differences between the empirical and estimated validities are extremely small at all levels of k. When the validities are rounded to three decimal places, the largest difference, at k = 1, is only .007. Moreover, the differences decrease consistently as k TABLE 2 ACTUAL AND ESTIMATED GROUP VALIDITY Group validity No. of experts in grow (4 2 3 4 5 6 7 8 9 10 11 12 13

No. of groups of size k 78 286 715 1287 1716 1716 1287 715 286 78 13 1

n Mean computed validity for groups of size k. b Estimated validity from Eq. (1).

Empirical” .8236 .8616 .8836 .8977 .9076 .9150 .9206 .9250 .9286 .9316 .9341 .9362

Estimatedb .8314 .8683 .8887 .9016 .9106 .9171 .9222 .9261 .9293 .9320 .9342 .9361

410

ROBERT H. ASHTON

increases, until they are less than .OOl at k 2 10. Thus, in the present setting the model proposed by Hogarth provides a remarkably good approximation to the actual (mean) empirical validities that result from including additional experts in a group. Therefore, the results suggest that Eq. (1) can be quite useful for determining how many experts should be included. The usefulness of Eq. (2) for determining which of the available experts should be added to an existing group can also be evaluated. First, assume that a three-person group-consisting of Subjects 1, 2, and 7-currently is providing equally weighted estimates for each of the 42 observations. Subjects 1, 2, and 7 are chosen as an example because these subjects actually are responsible for the final estimates for their respective magazines at Time, Inc. Now assume that the estimates of one of the remaining 10 subjects are to be added to form a four-person group. The question is which of these subjects should be chosen to achieve the largest increase in group validity. The empirical validity of the existing three-person group was computed and found to be .8991. This performance level is superior to that of approximately 82% of all the three-person groups that can be constructed. Moreover, it exceeds the individual validity of all 10 of the remaining experts. Equation (2) says that for the addition of the (k + 1)st expert to increase the validity of an existing group of k experts, the value of the left-hand term (the individual validity of the (k + l)st expert, pyx,+,) must exceed the value of the right-hand term. These two components of Eq. (2) are shown for each potential addition to the group in Table 3, along with the empirical validity of each of the new four-person groups, pYxk+,, The condition specified in Eq. (2) holds for five of the available subjects (3, 4, 5, 11, and 13) and does not hold for the other five (6, 8, 9, 10, and 12). Four-person groups that include any of subjects 3, 4, 5, 11, and 13 are, in fact, more valid than the original three-person group, i.e., P,,:~+~> .8991. Moreover, four of the five remaining four-person groups (those involving Subjects 6, 9, 10, and 12) are less valid than the original group. The addition of Subject 8, however, for whom the condition specified in Eq. (2) does not hold, nevertheless increases group validity slightly (from .8991 to .9002). Further insight into the usefulness of Eq. (2) can be gained by comparing the rankings of subjects in columns (4) and (5) of Table 3. Column (4) shows the diffeerence between the left-hand and right-hand terms of Eq. (2), while column (5) shows the actual empirical validity of each “new” group. The two sets of ranks differ only in that Ranks 4 and 5, and Ranks 8 and 9, are reversed. The Spearman correlation between the two sets of ranks is .989, which suggests that Eq. (2) does an excellent job of

COMBINING

411

EXPERT JUDGMENTS TABLE 3

ACTUAL AND ESTIMATED EFFECTS OF ADDING A FOURTH EXPERT TO AN EXISTING THREE-PERSON GROUP

(2) (1) Subject added to group”

Individual validity of subject addedb

(3) Right-hand term of Eq. (2)

3 4 5 6 8 9 10 11 12 13

.8242(3)” .7464(4) .6781(7) .6913(5) .6186(8) .6888(6) .6072(10) .8981(l) .6145(9) .8570(2)

.7425 .7096 .6171 .6988 .6219 .7223 .6522 .8361 .6318 .7674

(4) Column (2) minus Column (3)

-

.0817(2) .0368(5) .0610(4) .0075(7) .0033(6) .0335(9) .0450(IO) .0620(3) .0173(g) .0896(1)

(3 Empirical validity of group .9264(2) .9125(4) .9118(5) .8977(7) .9002(6) ,881l(8) .8683(10) .9217(3) .8773(9) .9291(l)

L?In all cases Subjects 1, 2, and 7 comprise the original group. b Left-hand term of Eq. (2). c Ranks are shown in parentheses.

ranking the available experts according to their contribution to the enlarged group. It is also interesting to compare the rankings in column (5) with those in column (2). Notice that column (2) ranks the experts according to their individual validity, while column (5) ranks them according to their contribution to group validity. The various differences in the two sets of ranks illustrate Hogarth’s suggestion (p. 44) that it is not necessarily the most expert of the available individuals that should be added to an existing group. It is interesting to note that although none of the 10 individual experts is as valid as the existing three-person group, 6 of them increase group validity when the group is enlarged to include four members. The foregoing comparisons of empirical and estimated group validities are based on mean empirical validities computed from large numbers of groups of k experts. Another relevant issue is the variability of these group validities. Some insight into this issue can be gained by examining Table 4 and Fig. 1. Table 4 indicates that the variance of pYZdecreases rapidly as k increases. Thus, as a practical matter the importance of choosing the “best” group decreases, on average, as group size increases. A similar conclusion is suggested by Fig. 1, which shows the (empirical) mean, minimum, and maximum values of p,,?as a function of k. Observe that as k increases the minimum and maximum values of pYl-

412

ROBERT

H. ASHTON

TABLE 4 DESCRIFTIVE STATISTICS FOR GROUP VALIDITY No. of experts in group (4 1 2 3 4 5 6 I 8 9 10 11 12 13

Mean validity (PY,)

Variance of Pyz

.7437 .8236 I3616 X336 A977 .9016 .9150 .9206 .9250 .9286 .9316 .9341 .9362

.01339 .00330 .00148 .00083 .00052 .00035 .00024 .00017 .00011 .00008 .00005 .00002 -

Minimum

Median

Maximum

.6012 .7071

.6913 .8237

.7727 .8107 A331 .8535 X684 x302 .8943 .9121 .9249 -

A875 .9004 .9097 .9167 .9217 .9258 .9297 .9329 .93.53 -

.9156 .9185 .9286 .9354 .9387 .9439 .9448 .9447 .9448 .9446 .9443 .9437 -

converge to the mean pyjE,since both the number of groups and the differences among them are smaller as the upper limit of k is approached. Further examination of the minimum and maximum values of p,,?may be instructive. Notice that the minimum value of p,,?at k 5 3 exceeds &, the mean validity of the individual experts, indicating that the worsf of the 286 three-person groups is more valid, on average, than the individuals. In addition, the minimum pYzat k = 12 exceeds the maximum p,,? at k = 1 or k = 2. Further, observe that the maximum values of p,,? increase from k = 1 to k = 7, indicating that even the best expert, as well as the best groups of k = 2, . . . , 6 experts, can be improved by including additional experts. Another interesting feature of the maximum p,,?values is that they level off at approximately .944 for k 5 6. This level of performance is notable because a linear regression of the number of advertising pages on the values of the five independent variables included in the study produces an (unadjusted) R2 of .944 (see A. H. Ashton, 1982). In other words, some groups of six or more judges perform at a level equivalent to the linear predictability of the task. DISCUSSION

AND CONCLUSIONS

The principal conclusions of this research are twofold: First, combining the judgments of the experts studied is quite effective for improving validity. Second, Hogarth’s (1978) analytical model provides an excellent approximation to the level of improvement found in this partic-

COMBINING

413

EXPERT JUDGMENTS

Maximum

0) 0

2

4

6

8

10

12

k

FIG. 1. Group validity as a function of the number of experts in the group.

ular setting. With respect to the former, it was shown that mean group validity increased rapidly (and its variance decreased rapidly) as additional experts were included, and that only three experts had to be included, on average, to achieve most of the improvement that was possible. With respect to the latter, Eq. (1) provided estimates of validity that were virtually identical to those found empirically, and Eq. (2) provided a highly accurate rank ordering of the validity increases achievable from adding additional experts to an existing group. Thus, in the current setting the analytical approach suggested by Hogarth was effective for answering both the “how many” and “which” questions posed in the title of this paper, even though the conditions on which the approach is based held only moderately well. These positive results notwithstanding, the proposed model has a potentially important limitation that must be recognized. As Hogarth (1978) points out, the model is appropriate only if “the judgmental task consists of rank ordering alternatives-that is, the level of judgment is not important” (p. 41). This limitation is due to reliance on a correlational measure of validity. The extent to which a correlational validity measure is an

414

ROBERT H. ASHTON

appropriate surrogate for loss functions encountered in applied judgment settings is an open question (see, e.g., R. H. Ashton & Kessler, 1986). For the task analyzed here, an alternative measure of validity-mean absolute error in pages-was employed, and the experts were ranked on that measure as well as on the correlational measure. Mean absolute error was used because the experts involved consider it to be the most appropriate measure in this setting. The Spearman correlation between the two sets of ranks is .835, which suggests that correlational validity is a good surrogate in this particular case. Moreover, when performance is measured in terms of mean absolute error, group judgments produce the same pattern of results as that shown in Table 4 and Fig. 1 (see A. H. Ashton and Ashton, 1985). In summary, this study demonstrates in a realistic task setting that the analytical approach offered by Hogarth can be extremely useful for estimating the effect on group validity of inciuding additional experts in a staticized group. However, the generalizability of this result to other settings depends on the empirical distribution of such task characteristics as mean individual validity, mean intercorrelation, and bias in mean judgments, and on the ability of a correlational validity measure to surrogate other measures in applied settings. Continued research along these lines could be extremely useful. REFERENCES Ashton, A. H. (1982). An empirical study of budget-related predictions of corporate executives. Journal of Accounting Research, 20, pt. 2, 440-449. Ashton, A. H., & Ashton, R. H. (1985). Aggregating subjective forecasts: Some empirical results. Management Science, 31, 1499- 1508. Ashton, R. H., & Kessler, L. (1986). Consistency among alternative performance measures in an applied judgment setting. Unpublished manuscript, Duke University. Einhorn, H. J., Hogarth, R. M., & Klempner, E. (1977). Quality of group judgment. Psychological Bulletin, 84, 158-172. Ghiselli, E. E. (1964). Theory of psychological measurement. New York: McGraw-Hill. Behavior and Hogarth, R. M. (1978). A note on aggregating opinions. Organizational Human Performance, 21, 40-46. Libby, R., & Blashfield, R. K. (1978). Performance of a composite as a function of the number ofjudges. Organizational Behavior and Human Performance, 21, 121-129. Makridakis, S., & Winkler, R. L. (1983). Averages of forecasts: Some empirical results. Management

Science, 29,987-996.

Winkler, R. L., & Makridakis, S. (1983). The combination of forecasts. Journal Royal Statistical Society, Series A, 146, Pt. 2, 150-157. RECEIVED: July 17, 1985

of the