ORGANIZATIONAL
BEHAVIOR
Knowledge,
AND HUMAN
Calibration,
DECISION
PROCESSES 51,
and Resolution:
1-21 (1992)
A Linear Model
MATS BJ~RKMAN Uppsala University Calibration expresses the correspondence between subjective and objective probability, i.e., relative frequency. A subject is perfectly calibrated if for all propositions assigned the probability XX, .x.x of them are true. When proportion correct is plotted against categorized assessments of subjective probability one gets a calibration curve. Resolution expresses the degree to which the subject can sort correct and incorrect items into different categories. A model which approximates the calibration curve by a linear function c(x,) = a + bx, is suggested, where c(x,) is the predicted proportion correct when the probability assessment is x,. Perfect calibration means that a = 0, b = 1. When c(x,) is combined with the distribution of assessments a linear relation ? = a + by is obtained; 5 is the predicted mean proportion correct and f is the mean of the probability assessments. The model was tested on (1) individual data from an experiment with 250 general knowledge items and (2) group data from a study by Lichtenstein and Fischhoff (1977, Organizational Behavior and Human Decision Processes, 20, 159-183). The tit between predicted 0 and observed proportion correct (?) was satisfactory for 31 of 33 subjects in the individual study and for all 19 conditions in the group study. The two studies were compared in some detail with respect to relationships between knowledge on the one hand and calibration and resolution on the other. 0 1992 Academic Press, Inc.
Judgments of confidence reflect a person’s certainty about the occurrence/nonoccurrence of future events, or about the truth of his/her knowledge of facts. In accordance with the Bayesian position certainty expressed by a percentage number can be treated as a probability. An important aspect of subjective probability (besides consistency with respect to the rules of probability) is the correspondence with the true states of the world. Calibration is the term used to denote the extent to which subjective probability matches “objective probability,” defined as relative frequency. If a subject assessesthe probability of the occurrence of an event to be .6 and, over a number of occasions, the event occurs in 60% of the occasions the subject is perfectly calibrated. This study was supported by a grant from the Swedish Council for Research in the Humanities and Social Sciences. The author is indebted to Peter Juslin for preparing the items in the individual study and to Anna Nordlinder for computational work. I also thank J. Frank Yates and an anonymous reviewer for valuable comments on an earlier version of the paper. Address correspondence and reprint requests to Mats Bjorkman, Department of Psychology, Uppsala University, Box 1854, S-751 48 Uppsala, Sweden. 1 0749-5978192$3.00 Copyright 0 1992 by Academic Press. Inc. All rights of reproduction in any form reserved.
2
MATS RJdRKMAN
In this paper we will be concerned with knowledge tasks of the twoalternative type, e.g., “Which city is further north? (a) Rome (b) New York”? The subject makes two responses. He/she first answers the question by circling the alternative he/she believes is the correct one and then assessesthe probability that the answer is correct. This can be done on a scale of fixed values 50,60, . . . , lOO%, or by using any number from 50 to 100%. In the latter case the experimenter makes the categorization of the probability assessments after the experiment. For each category the proportion of correct responses is computed. The plot of proportion correct against subjective probability is the so-called calibration curve. If the curve coincides with the identity line calibration is perfect. Should the calibration curve fall below the identity line the subject is overconfident, in the opposite case underconfident. In an early paper on calibration Lichtenstein and Fischhoff (1977) asked the question “Do those who know more also know more about how much they know?” In other words, is there a positive relationship between knowledge and calibration? Their answer was: “Those who know more do nof generally know more about how much they know. Calibration improves up to about 80% correct, and then becomes worse” (Lichtenstein & Fischhoff, 1977, pp. 179-180). Besides the empirical approach to the question “Do those who know more . . .” there is a formal possibility to investigate how knowledge is related to calibration. In this paper this will be done by taking a linear approximation of the calibration curve as a point of departure. By combining the linear calibration function with the distribution of the subjective probabilities one obtains a simple model for the relationship between knowledge and calibration. The model will be tested on individual data from an experiment with general knowledge items. In addition, it will be applied to group data from the study by Lichtenstein and Fischhoff (1977). First, however, it is necessary to present and discuss the so-called mean probability score and its decomposition. This serves as a background to the presentation of the knowledge-calibration model and gives precise meanings to two central notions of the paper: calibration and resolution. THE MEAN PROBABILITY
SCORE
The mean probability score, ps, was introduced by Brier (1950) to measure the goodness of weather forecasts. This score, often called the Brier score after its originator, is defined by -
N
PS = IIN C (f;: - di)2,
(1)
KNOWLEDGE,
CALIBRATION,
AND
RESOLUTION
3
where h is the forecast probability of the occurrence of an event, e.g., “rain,” on the ith occasion; 0 c J;: =G1. The outcome on the ith occasion is indexed by di such that di = 1 if the event occurs di = 0 if the event does not occur. The mean probability score ranges from 0 to 1. A perfect judge (a miniature variety of the Laplacean demon) always assigns 1 when the event occurs and 0 when the event does not occur. The other extreme is just the opposite; the probability 1 is always assigned when the event does not occur and 0 when the event occurs. Brier and others after him have emphasized that I% is a proper measure of goodness of probability assessment since no cognitive strategy other than expressing one’s true beliefs can improve the score. The Brier score was introduced to measure the goodness of weather forecasts. In psychology the use of the score has been extended to the study of subjects’ confidence in their answers to knowledge items (e.g., Koriat, Lichtenstein, & Fischhoff, 1980; Lichtenstein & Fischhoff, 1977; Lichtenstein & Fischhoff, 1980; Lichtenstein, Fischhoff, & Phillips, 1982; Sharp, Cutler, 8z Penrod, 1988). There are decisive differences between forecast tasks, e.g., forecasts of the weather, of the outcome of football games, of the rise and fall of stock prices on the one hand and knowledge tasks with items like “Which of the following two countries has the largest production of wheat: (a) USA (b) China?” on the other. To emphasize that we deal with different tasks rewrite the Brier score with new notations
Ps = Z/N i
(xi - ri)‘,
(2)
N
where xi is the probability assessment that the ith item in a set of N items was answered correctly. The correctness index Ti takes either of two values, ri = 1 if the answer is correct ri = 0 if the answer is wrong. The fundamental difference between (1) and (2) concerns the status of di and Ti. The former indexes the occurrence and nonoccurrence of the event under consideration. Whatever the causes of the event, it belongs to “nature” and is beyond the subject’s control. The same of course is true of various statistics, e.g., a = l/NCdi. the relative frequency of the occurrence of the event. In (2), however, Ti is under the subject’s control. For example, Z = lINEri, the overall proportion of correct responses, is
4
MATS
RJdRKMAN
a measure of the subject’s knowledge with respect to the N items. Whereas a is a characteristic of the forecast task indicating its uncertainty, Z is a characteristic of the subject indicating his/her knowledge. Yates (1982) used the concept of external correspondence (in contrast to internal consistency) for the extent to which probabilistic forecasts anticipate the occurrence and nonoccurrence of an event. It is obvious from what was said about the difference between (1) and (2) that external correspondence applies to (l), but not to (2). In the latter case it might rather be appropriate to talk about internal correspondence, since !% in this case measures the extent to which subjective probabilities indicate the subject’s correct and incorrect responses. Decomposition
of ps
A useful property of the mean probability score is that it can be partitioned into additive parts. This can be done in different but related ways (see Yates, 1982, for a review of different decompositions). Among psychologists the so-called Murphy partition (Murphy, 1972, 1973; Lichtenstein et al, 1982) seems to be the one most widely used. This partition requires that x can assume only a finite set of values X, (t = 1, . . . , T) and is nl(xl - c,)’ - IIN i
Ps = z(Z - i?) + I/N i 1
n&
- C)‘.
(3)
I
Here T is the overall proportion correct, c‘, is the proportion correct in category t, c, = l/n, Z;vi, and n, is the number of assessments in category t, ZTnl = N.
The first term on the right side of Eq. (3), the variance of ri, expresses the subject’s knowledge. This score ranges from 0 (perfect knowledge) to 0.25 (true guessing over all N items). The second term is a measure of calibration. This is the component of the Murphy partition that has received most attention in psychological studies, probably because it is a direct expression of the extent to which subjective probability agrees with relative frequency. When cI is plotted against X, one gets a calibration curve. Should this curve coincide with the identity line c, = X, calibration is perfect. The third term, resolution, measures the subject’s ability to sort items into categories such that ct is either 0 or 1. If all c, are equal, and thus equal to C, resolution is 0. Maximum resolution occurs if correct responses and incorrect responses are sorted into different categories. The three components of the mean probability score in Eq. (3) are not independent of each other. For example, if a constant is added to c,, resolution will be unaffected but C will increase with the same constant,
KNOWLEDGE,
CALIBRATION,
AND
RESOLUTION
5
and calibration will increase or decrease depending on its initial value. Yates (1982, p. 141) demonstrated that calibration and resolution “are algebraically confounded with each other in a very direct way.” Direct or not, knowledge, calibration, and resolution are entangled in ways which are difIicult to penetrate. An experimental manipulation which exerts an influence on one of the scores will necessarily influence one of the other two, or both. Knowledge and the Calibration Function The partition of I% is useful for studying to what extent knowledge, calibration, and resolution contribute to I%. Equation (3), however, is of no help when the primary interest, as in the present paper, is the relation between knowledge (Z) on the one hand and the calibration function on the other. What is needed for this purpose, as a complement to the decomposition of !%, is a model that shows how knowledge depends on the parameters of the calibration function. This will concern us next. MODEL
Let a person’s assessment of subjective probability be represented by a variable x, which assumes T discrete values x,, . . . , x,, . . . , xr. Each x, represents a category (interval) on x. The number of assessments in category t is n, and the total number of assessments is N, En, = N. (If not stated otherwise summation goes from 1 to T.) We denote the relative frequency of assessments in category t by Ax,). Hence Ax,) = n,&V and Cflx,) = 1. The mean of the distribution of assessments is X and its variance s*; X = CJ(x,)x, and s* = X$(x,)(x, - F)*. The proportion of correct responses in category t is c,. This of course is the mean of the correctness index ri for the n, items in category t. The mean proportion correct, taken over all items is 7 = lIhZ~ri, which is equivalent to F = Xf(x,)c,. Now, let the calibration curve be approximated by a linear function c(x,) = a + bx,, called the calibrationfunction. Here c(x,) is the predicted proportion correct in category t to be distinguished from c,, the observed proportion correct in t. The predicted proportion correct, taken over all T categories, is denoted by F and is obtained by summing the products f(x,)(a + bx,)for t = 1, . . . , T, i: = IZfx,)(a + bx,). Since XflxJ = 1 and X = Eflx,)x, we get ‘F = a + bZ.
(4)
The predicted proportion correct 7 is a linear function of E with the same parameters as in the calibration function. It is worth noting that the result in (4) is independent of the particular form, that is, skewness and kurtosis of the distribution of probability assessment f(x3. However, when it
6
MATS
EiJ6RKMAN
comes to experimental studies of the fit between C (observed proportion correct) and 7 (predicted proportion correct) characteristics of the distribution may be important (see below). Comments The parameters a and b of the calibration function contain all information about calibration and resolution there is. Calibration is perfect if a = 0 and b = 1. Roughly, a subject’s probability assessments are good or poor depending on how far a is from 0 and b from 1. As will appear below calibration and resolution in the Murphy partition of I% will assume simple and easily interpretable expressions when rewritten in terms of the calibration function. Equation (4) is one answer to the issue raised in the introduction, namely the relationship between knowledge and calibration. Once we know a subject’s mean probability assessment and his calibration parameters a and b his knowledge 7 can be inferred. A second answer has to do with how a and b vary as a result of experimental manipulations and individual differences. This issue will be addressed by the experimental data below. The measures of calibration and resolution in the decomposition of ps are means taken over all observations. They do not show how proportion correct increases with probability assessments. Therefore the linear approximation of the calibration curve is a useful complement to the Murphy scores. Moreover, a and b are not “algebraically confounded” as are the calibration and resolution scores in the partition of I%. About Linearity and Fit of the Model Although the proportion correct (c,) increases with probability judgments (x,) the increase can be quite irregular. In particular this is the case for individual calibration curves based as they are on a smaller number of responses than group curves (a few examples are shown in Fig. 1). Sometimes a slight tendency toward curvilinearity (convex or concave) can be discerned. Under these conditions it might seem futile to describe calibration curves by linear functions. However, in the absence of theoretical arguments it is not a very fruitful strategy to select different functions for different subjects and experimental conditions. In several cases the curves are fairly smooth and linear; when they are “jagged” and/or with a tendency toward curvilinearity a linear function may serve as a good approximation. The present purpose is not to make point predictions of proportion correct from probability judgments. The main advantages of the linear model are (a) the collapsed information in terms of a and b it provides for comparing individual subjects and groups of subjects under
KNOWLEDGE,
CALIBRATION,
AND
RESOLUTION
7
different experimental conditions and (b) the possibility to predict knowledge according to Eq. (4). With this purpose in mind the test of the linear model will be made in two steps. First a and b are determined from the T data pairs of c, and x, by the method of least squares. Second, a, b, and X are inserted in Eq. (4) to determine 7. Finally, ? is compared to F to establish whether the agreement is satisfactory. At first the agreement between 7 and C might appear trivial. Should not the two means agree except for rounding errors? Consider first the determination of a and b. According to the criterion of least squares the best-fitting line c(x,) = a + bx, always passes through the centroid (= center of gravity of the data), which has the coordinates IlI’Xx, and l/Tee,. Notice that the x coordinate is the same for all subjects in the experiment (in the experiments below the x coordinate of the centroid is 0.75). The location of the centroid in the other direction (l/Z%,) varies over subjects and conditions. Perfect calibration requires that Cc, = Xx,. Although the coordinates of the centroid are arithmetic means they are not identical to E and C. The latter are arithmetic means which take Ax,) into account, X = Cflx,)x, and Z = CflxJc,. The same is true of the difference between predicted 0 and observed (Z) proportion correct, 7 C = Cflx,)(c(x,) - c,). However, the method of least squares minimizes X(c(x,) - c,)*. Hence, what makes a difference isflx,). Suppose, to take an example, thatflx,) is very large, say 0.5-0.6, for one x, (such cases are not unusual for extreme values of x,). If the difference (c(x,) - c,) is large for this particular x, then (7 - Z) gets a considerable contribution, positive or negative, which is not balanced by other (c(x,) - c,) because theirflx,) are necessarily small. Considerations like these about the possible effects offlx,) on (? - Z) lead to the conclusion that the fit between 7 and C may be good or poor. Experimental tests are required to find out whether the linear model works. Calibration and Resolution The definitions of calibration and resolution in Eq. (3) capture essential intuitive features of these concepts (although as Yates, 1982, pointed out, resolution is not as easily available intuitively as calibration). Therefore it is of interest to consider these two scores also from the point of view of the linear model. In the Murphy partition in Eq. (3) calibration is the mean of the squared deviations of c, from the identity line. Accordingly we have to consider the expression Zflx,)(x, - c(x,))‘. Denote this quantity by MC, where M stands for model and C for calibration. The calibration function is c(xJ = a + bx, and thus MC = Zf(x,)((l - b)x, - a)*. By developing the square, summing, and using the definitions of arithmetic mean and variance one
8
MATS
BJQRKMAN
gets MC = (1 - b)‘,@ + s2) - 2a(l - b)X + a*, which can be written MC = ((1 - b)F - a)2 + (1 - 6)2~2. According to Eq. (4) 7 = a + bK Thus, the first term on the right side of the last equation for MC is (E - ?)*. Finally, for MC, MC = (X - ?)* + (1 - b)2s2.
(5)
Resolution in Eq. (3) is the mean of the squared deviations of the category relative frequencies (c,) from the overall proportion correct (Z). When this scoring rule is applied to model resolution (MR) we have MR = Xflx,) (c(x,) - F)2. By inserting c(x,) = a + bx, and 7 = a + bF in the expression for MR we get MR = b*~~x,)(x, - Z)‘. Finally, by using the definition of variance we arrive at the following simple expression for MR: MR = b*s=.
(6)
In terms of the model the partition of the mean probability score is MS = F(l - 3 + MC - MR. By inserting the parametric expressions for MC and MR in this equation, Eqs. (5) and (6) the model I% is ME
= ?(l - Z) + (Z - a2 + (1 - b)2s2 - b’s*,
(7)
which can also be written MI-?? = F(1 - 3 + (X - 3’ + s2 - 2bs*.
(8)
With the qualification that ? and b are model notions, Eq. (8) is formally identical to Yates’ (1982) covariance decomposition of ps for forecast tasks. The first term ?(l - 3 is the model variance of the response index (outcome index in forecasting) and bs2 (= 1 x s x bs) is the model covariance between assessments of subjective probability and proportion correct (corresponding to Yates’ covariance of forecasts and outcomes). Comments Calibration, or “calibration in the small” as Yates (1982) called it, in Eq. (5), contains two parts, (3 - q2 and (1 - b)2s2. To be perfectly calibrated it is necessary to be perfectly “calibrated in the large” Yates’, 1982, term for (Z - ?)*, and to have a calibration function with a slope of one. This of course is only a different way of saying that perfect calibration requires that a = 0 and b = 1. Yates (1982) noted that resolution is difficult to interpret intuitively. The correct interpretation is that resolution “reflects the ability of the assessor to sort events into categories for which the hit rates are either 0 or 1” (Yates, 1982, p. 151). Let the proportion C of correct responses be sorted into one category x, and the proportion (1 - ?) of incorrect responses into another X, + h. The mean of the two-valued response distribution is X = X, + (1 - F))hand its variance is s* = T( 1 - Z)h2. The
KNOWLEDGE,
CALIBRATION,
AND
RESOLUTION
9
slope of the calibration function is b = l/h and resolution (b2s2)becomes MR = 5(1 - i?). This shows that perfect resolution is independent of the particular categories into which correct and incorrect responses are sorted. However, only if the correct responses are assigned to x, = 1 and the incorrect responses to x, = 0 will perfect resolution and calibration occur concurrently. Assume that a person is perfectly calibrated, MC = 0 and MR = s2. In this case the mean probability score is ME = E(1 - 3 - s2. Further improvement of M?js can come about in two ways. The first way is to be a perfect knower in which case ? = 1 and s2 = 0. The second way to improve ME is to increase s*. This can be done up to a limit where the proportion of assessments x, = 1 is 7, and the proportion of assessments x, = 0 is (1 - 3. In this case s2 = T( 1 - i?) and ME = 0. The standard deviation of the distribution of probability assessments is s and it follows from c(x3 = a + bx, that the standard deviation of the proportion of correct responses is bs. Thus, resolution, MR = b’s*, is the variance of this distribution. It also follows that (I - b)s is the difference between the standard deviations of the two distributions. The second component of calibration, MC = (7 - 3’ + (1 - b)2s2, is the square of this difference. The observant reader may have noticed that “knowledge” has been used in two different ways. In the partition of the Brier score in Eqs. (7) and (8) knowledge refers to ?(l - 3 and in Eq. (4) to i!. This is no serious problem since the one definition is an exact transformation of the other. TEST OF THE MODEL: INDIVIDUAL DATA The purpose of the following experiment was twofold: (1) to test the model on individual data and (2) to get information about individual differences with respect to knowledge, calibration, and resolution. Specifically, are “those who know more” better calibrated and resolved? Method Tusk. The subjects responded to a questionnaire containing 250 general knowledge items which covered the topics economics, geography, history, politics, and population statistics. Each item had two alternative answers marked by a and b. The subjects gave their answers by circling a or b. After this they assessed their certainty on a half-range scale by circling Xl%, 60%, 70%, 80%, 90%, or 100%. The extreme categories were described in the instructions as “tossing a coin” (50%) and “absolute certainty” (100%). Subjects. Thirty-three undergraduate students taking a half-time, introductory, evening course in psychology served as subjects. Their participation was part of a course requirement. The average age was 27 years
10
MATS
EUiiRKMAN
with a standard deviation of 9 years. The subjects completed the questionnaire at their own rate. The time they needed varied between 60 and 110 min. Results Individual differences. The test of Eq. (4) was made in two steps. First, c(x,) = a + bx, (t = 1, . . . , 6) was fitted to the calibration data of each subject by the method of least squares. In this way individual values of a and b were obtained. Equation (4) was then used to determine 7 by inserting a, b, and X in the equation. The first column of Table 1, labeled c(O.5), shows proportion correct at xI = 0.5. For optimal performance ~(0.5) should of course be 0.5. Due to the moderating effect of b the variation of ~(0.5) is smaller than that for a. It ranges from 0.363 to 0.670. Although the average subject is near the optimal value of 0.5 the individual variation is considerable. The variation of b covers quite precisely the entire range from 0.00 to 1.00. Very few subjects approach what is required for perfect calibration (b = 1). Seven subjects have b values below 0.25, 19 are in the interval 0.25-0.75, and 7 have b values above 0.75. From this one can conclude that resolution (= b2s2)in general is poor. For the vast majority, 25 of 33 subjects, resolution is below and usually far below 50% of what it should be for perfect resolution and calibration. The ugreement between F and T. The standard error of estimate (srdl - gX,J is a measure of the scatter around the regression line. These measures range from 0.022 (Subject 4) to 0.149 (Subject 11) with a mean of 0.083. The correlation between the standard errors of estimate and the absolute differences between 7 and F is 0.39. That there is a correlation at all is due to the combined effect of skewed./ and poor fit of c(xJ = a + bx, (see discussion under Model). It is instructive in this connection to compare Subjects 11 and 28 (Fig. 1). The fit is poor in both cases. However, for Subject 11 the difference Y! - -E= 0.000 and for Subject 28 it is -0.057. The response distribution Ax,) for Subject 1I is 0.15, 0.11,0.08,0.12,0.33, and 0.21 and for Subject 28 0.14, 0.05, 0.07, 0.08, 0.14, and 0.52. The negative deviation from the regression line at x, = 1 is not balanced by the positive deviations at xI = 0.6 and x, = 8 because of low&x,) at these points. This is the reason why ? - T is large and negative for Subject 28. The same mechanism accounts for the large difference ? - C for Subject 21. Table 2 gives the distribution of absolute differences 17 - Cl. The mean is 0.015 with a standard deviation of 0.015. About 50% of the subjects have deviations less than 0.010 and roughly 80% less than 0.020. What can be said in general about the agreement between 7 and C? Is the model close enough to the data? To give a precise answer we would
KNOWLEDGE,
CALIBRATION,
AND
11
RESOLUTION
TABLE 1 MEASURESOF CALIBRATION FORINDIVIDUAL
SUBJECTS
Sub-
1 c(O.5)
2 a
3 b
4 f
,Ia
ject 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
0.457 0.516 0.575 0.555 0.545 0.363 0.638 0.575 0.530 0.473 0.469 0.479 0.488 0.583 0.458 0.482 0.495 0.564 0.611 0.5 19 0.367 0.417 0.589 0.584 0.473 0.519 0.670 0.523 0.661 0.504 0.494 0.553 0.556
- 0.033 0.048 0.387 0.397 0.259 - 0.026 0.618 0.469 0.279 0.156 0.185 0.199 0.334 0.261 0.111 0.129 0.234 0.464 0.556 0.095 -0.125 0.009 0.593 0.412 0.049 0.199 0.391 0.311 0.546 0.222 0.311 0.328 0.461
0.980 0.936 0.383 0.315 0.571 0.778 0.040 0.211 0.502 0.633 0.568 0.559 0.307 0.644 0.693 0.705 0.522 0.200 0.109 0.847 0.983 0.815 -0.009 0.344 0.848 0.639 0.557 0.424 0.229 0.563 0.365 0.449 0.190
0.714 0.787 0.808 0.730 0.716 0.849 0.885 0.620 0.708 0.731 0.801 0.802 0.839 0.667 0.677 0.748 0.720 0.656 0.73 1 0.750 0.930 0.837 0.700 0.699 0.717 0.731 0.698 0.858 0.865 0.7% 0.821 0.790 0.769
0.0516 0.0385 0.0343 0.0231 0.0391 0.0252 0.0292 0.0193 0.0291 0.0372 0.0302 0.0314 0.0347 0.0262 0.0289 0.0336 0.0179 0.0285 0.0267 0.0363 0.0172 0.0314 0.0256 0.0327 0.0253 0.0242 0.0458 0.0347 0.0263 0.0271 0.0272 0.0297 0.0354
0.667 0.785 0.6% 0.627 0.668 0.635 0.653 0.600 0.634 0.619 0.640 0.647 0.592 0.691 0.580 0.656 0.610 0.595 0.636 0.730 0.789 0.691 0.587 0.652 0.657 0.666 0.780 0.675 0.744 0.670 0.611 0.683 0.607
0.699 0.780 0.715 0.628 0.688 0.643 0.683 0.584 0.632 0.635 0.640 0.624 0.614 0.680 0.584 0.660 0.612 0.600 0.628 0.736 0.856 0.708 0.592 0.648 0.635 0.648 0.7% 0.732 0.754 0.668 0.626 0.692 0.618
- 0.032 0.005 -0.019 -0.001 - 0.020 -0.008 -0.030 0.016 0.002 -0.016 0.000 0.023 - 0.022 0.011 -0.004 -0.004 -0.002 -0.005 0.008 -o.cKm -0.067 -0.017 -0.005 0.022 0.018 -0.016 - 0.057 -0.010 0.002 -0.015 -0.009 -0.011
0.015 0.007 0.093** 0.102** 0.028 0.206*** 0.202*** 0.036 0.076’ 0.096** 0.161*** 0.178*** 0.225*** -0.013 0.093** 0.088.’ 0.108** 0.056 0.103” 0.014 0.074*** 0.129*** 0.108*** 0.051 0.082** 0.083** -0.098*** 0.126;” 0.111*** 0.128’** 0.195*** 0X198*** 0.151***
Mean
0.524
0.268
0.512
0.762
0.0304
0.660
0.668
-0.008
0.094
SD
0.0726
0.1924
0.2697
0.0726
0.0073
0.0559
0.0643
a sz = 0.0292 for a rectangular * p -=z.05. *t p -=c.Ol. *** p < ml.
6 7
7 7
8 P-T
0.004
0.0195
9 I-F
0.0687
distribution
need to know the sampling distribution of (7 - Z)), in which the sampling variances of a, b, E, and T and the correlation between 7 and i? enter. In lack of knowledge of the sampling variance of (7 - T) we can take a look at the sampling variance of F. This is probably an underestimate of the sampling variance of (7 - C). Thus, the bias should be in the right direction. The standard error of F is a7 = [(l/N)(F(l - F)]“*, where N = 250 (in a few cases N is somewhat lower because the subject has missed one or a couple of items). The standard error oF varies from 0.022 for Subject 21 to 0.031 for Subject 15. With the exception of Subjects 21 and 28 the
12
MATS
BJdRKMAN 1.0
1.0 .9
.9
.e
.I3
Y k 8
.7
6 .” z h:
.5
LI
: k z
.6
.-z -b hB
.4 .3 Subject
.2
2
.7 .6 .5 .4 .3 Subject
.2
4
.l
.l .5
.6
.7
.E
.9
I
.O
I
.O
1.0
.5
.6
.7
.8
.9
1.0
Assessment
Assessment
1.0
r
.9
.a : k E
.7 .6
P
Subject
j, .5
,
,
,
,
(
.6
.7
.E
.9
1.0
Assessment
28
1:t-----.5
.6
.7
.8
.9
1.0
Assessment
FIG. I. Four examples of calibration curves and the calibration function c(q) = n + bx,. See Table I for the values of (I and b and other calibration data.
differences (7 - C(fall far below one standard error (Subjects I and 7 are borderline cases). The agreement between model and data can be considered satisfactory. The conclusion is the same when the sampling variance is taken to be the one for the difference (E - F). The standard error u;;-~, which is the adequate one to use for testing the significance of (X - 3, ranges from 0.020 (Subject 21) to 0.034 (Subject 10). Again, the differences between 7 and F are within one standard error (and usually much lower), and again the two exceptions are Subjects 21 and 28. The general conclusion is that for 31 of 33 subjects the model has a satisfactory fit to the data. Overlunderconfidence. The last column of Table 1 shows the differences (F - Z). A positive difference indicates overconfidence, a negative difference underconfidence. A common finding on group data is overcon-
KNOWLEDGE,
CALIBRATION, TABLE
DISTRIBUTION
AND
13
RESOLUTION
2
OF ABSOLUTE DIFFERENCES BETWEEN PROPORTION CORRECT (INDIVIDUAL
PREDICTED DATA)
Number
Interval
AND OBSERVED
subjects
16 10
0.000-0.010 0.01 l-0.020 0.021-0.030
4
0.031-0.040
1
0.0414070
2
fidence. This is the case also in this study with a mean difference between Z and C of 0.094. However, we also find subjects (Nos. 1, 2, 5, 8, 14, 18, 20, and 24) with no significant bias toward overconfidence and one subject (No. 27) with significant underconfidence. Calibration and resolution. The 33 subjects were rank-ordered according to proportion correct (Z) and then divided into three groups (Low, Medium, High) with 11 subjects in each. The group means are shown in Table 3. Calibration and resolution, computed according to Eqs. (5) and (6), are given in the last two columns of the table. Both calibration and resolution improve with better knowledge. Note also that both components of calibration, (Y - 3’ and (1 - b)2s2, contribute to improved calibration. These changes can be traced back to a decrease of a (from 0.375 to 0.202) and an increase of b (from 0.315 to 0.652). Due to the negative correlation between a and b a decrease of a must necessarily result in an increase of b. From the compensatory relation between a and b it might be expected that it would be of little (or no) importance to the knowledge score if a decreases and b increases. It is obvious, however, that more is gained from an increase of b than what is lost from the corresponding decrease of a. If, for example, the a and b values of the High group are applied to the Low group, by using Eq. (4), the knowledge score increases from 0.61 to 0.68. TABLE CALIBRATION
Knowledge group Knowledge LOW 0.584-0.628 Medium: 0.6324.683 High: 0.6&X-0.856
AND RESOLUTION
__ x
3
FOR GROUPS OF SUBJECTS DIVIDED TO KNOWLEDGE
ACCORDING
Group meam (n = 11) T
a
b
rob
(a - 7)’
(I - b)%*
MC
MR
0.733
0.610
0.375
0.315
-0.988
0.0194
0.0138
0.033
0.004
0.758
0.652
0.226
0.570
-0.964
0.0162
0.0068
0.023
0.011
0.7%
0.742
0.202
0.652
-0.960
0.0114
0.0062
0.018
0.018
14
MATS
BJdRKMAN
It is quite clear from the data in Table 3 that there is a tendency for more knowledge to be accompanied by better calibration and resolution. The differences between the three knowledge groups are significant for resolution, F(2,30) = 6.01, p < .Ol, but not for calibration, F(2,30) = 1.83, p > .05. GROUP DATA: A PREVIOUS STUDY
Among previous studies which report data (Z, 2, and calibration curves) necessary to test Eq. (4) I have chosen the study by Lichtenstein and Fischhoff (1977). This is because of (1) the great variation of c over different tasks and groups; (2) the large number of experiments and subgroups within experiments, altogether 19 conditions; and (3) the presentation (in graphical form) of response distributions, which is unusual in calibration studies. Method Tasks. All tasks were of the two-alternative type. After the subjects had selected one of the alternatives as the correct response they chose a number between 0.5 and 1 to indicate the probability that the chosen alternative was correct. In Experiments la and lb the subjects had “severely limited knowledge.” In la (Childrens’ art) the subjects were shown a set of drawings and had to decide whether the artist was a European or an Asian child. In lb the task was to make predictions of stockprices (higher or lower) on the basis of information taken from an 8-month period. Handwritten specimens of a Latin phrase written by European and American adults were used as stimuli in Experiment 2. The specimens were divided into two sets, one of which was used as training stimuli and the other as test stimuli. One group of subjects studied the training stimuli with their correct labels for 5 min. Immediately after training they were shown the test stimuli and made their decisions (European or American) and probability assessments. The control group had the same procedure except that the stimuli during the first phase were not assigned the correct labels. In Experiment 3 general knowledge items were used. This was the case also in Experiment 4 in which specially written psychology items were intermixed with general knowledge items. Experiment 5 contained two tests of 50 items, each selected according to percentage correct in Experiment 3. Each item in the hard test matched an item in the easy test such that the difference in percentage correct was 20%; for the hard test the mean percentage correct was 60.5 and for the easy test 80.5. Subjects. In Experiment 4 the subjects were graduate students in the Psychology Department of the University of Oregon. In the other exper-
KNOWLEDGE,
CALIBRATION,
AND
RESOLUTION
15
iments the subjects were paid volunteers who responded to advertisements in the University of Oregon student newspaper. Determination of the calibration parameters. The parameters a and b of the calibration function were determined graphically from the calibration curves in Lichtenstein and Fischhoff (1977). Two independent judges (P.J. and M.B.) used a plastic ruler, fitted it to the calibration curve by eye and read off the pot-portion correct at x, = 0.5 and x, = 1. From these readings a and b were computed separately for each judge, and the mean was taken as the final values of a and b. The agreement between the two judges was suprisingly good. The differences between their readings never exceeded 0.02. The method was “tested” on the 33 subjects in the study of individual subjects. The differences in c between the method of least squares and the graphic method were =sO.OOS in 23 cases, in the interval O.OO&O.OlOin 7 cases, and in the interval 0.01 l-0.015 in 3 cases. Results Fit of the model. The first three columns of Table 4 coincide with Table 1 in Lichtenstein and Fischhoff (1977, p. 166). The following columns give the values of c(OS), a, b, and 7 computed by inserting a, b, and Szin Eq. (4). As can be seen from the figures in the eighth column the differences between predicted (3 and observed (?) proportion correct are 0 or 0.01 in 16 of the 19 task conditions and 0.02 in the remaining 3. To get some point of reference for judging the magnitude of the differences C - C) the standard errors of-d are presented in the last column of Table 4. These values were computed by the formula u r = [( lIiV)F( 1 ?)I”*, where N is the number of responses given in the first column. For the three largest differences of 0.02 the standard errors are 0.018, 0.013, and 0.015, respectively. In these cases the deviations between predicted and observed proportion correct are slightly above one standard error of C. In the other cases the differences (7 - C) are about equal to (three cases) or below the standard errors of ?. Skewness. Figure 12 in Lichtenstein and Fischhoff (1977, p. 179) shows the distributions of subjects’ probability assessments. Not unexpectedly, the response distributions for hard tasks have a tendency to concentrate in the left part of the response scale, positive skewness. An opposite tendency, though weaker, toward negative skewness obtains for easy tasks. There are six conditions with clear positive skewness. For these six conditions the mean difference (7 - i?) is 0.002. Negative skewness obtains for three conditions. Here the mean difference (;: - F) is 0.003. The fit of c(x,) = a + bx, to the data is closer for the group data which has the effect that skewness offlx,) does not affect the difference ? - C. Easy vs hard tasks. Another observation in Table 4 concerns the differences between easy and hard items. Including the easy and hard tests
16
MATS BJdRKMAN TABLE 4 MEASURES OF CALIBRATION FOR GROUW OF SUBJECTS” Experiment BrOUP
la Children’s drawings lb Stockcharts 2 Training No training 3 Best subjects Middle subjects Worst subjects Best subiects Easy iiems Hard items Middle subjects Easy items Hard items Worst subjects Easy items Hard items 4 Best subjects Easy items Hard items Worst subjects Easy items Hard items 5 Easy test Hard test
1 N
2 Y
3 F
4 c(O.5)
5 a
6
7
b
i?
8
1104 756 520 570 3000 2925 3075
0.68 0.65 0.78 0.65 0.76 0.71 0.71
0.53 0.47 0.71 0.5 I 0.71 0.64 0.56
0.515 0.485 0.505 0.500 0.555 0.530 0.455
0.465 0.485 0.125 0.450 0.260 0.260 0.220
0.10 0.00 0.76 0.10 0.59 0.54 0.47
0.53 0.49 0.72 0.52 0.71 0.64 0.55
0.00 0.02 0.01 0.01 0.00 0.00 -0.01
1532 1468
0.80 0.72
0.85 0.58
0.710 0.490
0.480 0.280
0.46 0.42
0.85 0.58
0.00 0.00
1472 1453
0.75 0.67
0.80 0.48
0.690 0.405
0.445 0.235
0.49 0.34
0.81 0.46
0.01 -0.02
0.010 0.013
1516 1559
0.73 0.68
0.70 0.43
0.565 0.375
0.260 0.265
0.61 0.22
0.71 0.42
0.01 -0.01
0.012 0.013
1450 1050
0.86 0.71
0.92 0.66
0.755 0.510
0.510 0.110
0.49 0.80
0.93 0.68
0.01 0.02
0.007 0.015
1450 1050 2250 2490
0.82 0.68 0.78 0.74
0.85 0.51 0.80 0.62
0.670 0.420 0.625 0.500
0.395 0.130 0.315 0.255
0.55 0.58 0.62 0.49
0.85 0.52 0.80 0.62
0.00 0.01 0.00 0.00
P-F
9 ITT
-
0.018 0.020 0.021 0.009
0.015
n Data from Lichtenstein & Fischhoff (1977).
of Experiment 5 there are altogether six pairs of easy-hard to consider. The mean values relevant for Eq. (4) are as follows: Easy: a = 0.401, b = 0.537, X = 0.790, bF = 0.423, and i? = 0.825; Hard: a = 0.212, b = 0.475, X = 0.700, bZi = 0.334, and C = 0.547. The differences between easy and hard tasks are a(diff.) = 0.189, b(diff.) = O.O62,Z(diff.) = 0.090, bY(diff.) = 0.089, and ?(diff.) = 0.278. Notice first the small difference between the b values.The mean slopes of the calibration curves for easy and hard tasks are practically the same. This has the further consequence that the difference in b? is about the same as in 2. On the rough assumption that the response variance is the same for easy and hard tasks, resolution (= b*s*) should not differ between the two conditions. The mean Murphy resolution for easy tasks (Lichtenstein & Fischhoff, 1977, Table 1, p. 166) is 0.012 and for hard tasks 0.009. The major contribution to the difference in overall proportion correct derives from the difference in a. This difference is 0.189 to which should be added 0.089 (bZ(diff.)) to reach 0.278, the mean difference in proportion correct. The increase of a from hard to easy tasks accounts for about
KNOWLEDGE,
CALIBRATION,
AND
RESOLUTION
17
two-thirds of the increase in overall proportion correct; about one-third is accounted for by the increase of X. According to Eq. (5) calibration consists of two parts (5 - a2 and (1 b)*s*. Still on the assumption that b2 and the variance S* are equal for the easy and the hard task in a pair, differences in calibration should be due solely to differences in (Z - 3’. The differences in Murphy calibration (Lichtenstein & Fischhoff, 1977, Table 1, p. 166) and the differences computed according to Eq. (5) by inserting X and 7 and keeping (1 - b)*s* constant are as follows: Experiment 3, Best subjects: 0.018 and 0.017; Middle subjects: 0.043 and 0.041; Worst subjects: 0.078 and 0.067; Experiment 4, Best subjects: - 0.006 and - 0.004; Worst subjects: 0.029 and 0.025; Experiment 5: 0.019 and 0.014. The differences in Murphy calibration scores are somewhat larger than the differences according to Eq. (5) as they should be, since we have ignored all variation in (1 - b)*s*. However, the discrepancies are so small that one can conclude that the lower calibration scores for easy tasks are almost entirely due to the decrease of (Z - i?)*. Thus, it is a reduction of overconfidence that accounts for the better calibration in easy tasks. Training vs no training. A different pattern occurs when training is compared to no training in Experiment 2. Here the value of ~(0.5) is unaffected whereas b increases from 0.10 (no training) to 0.76 (training). The Murphy calibration scores are 0.033 and 0.009 for no training and training, respectively (Lichtenstein & Fischhoff, 1977, Table 1). How much of the improvement of 0.024 is due to reduction of overconfidence? The values of (ZZ- a2 for no training and training are (0.13)* and (0.06)2, which gives a difference of 0.013. This should be compared to 0.024 showing that reduced overconfidence accounts for about 50% of the improved calibration. The remaining 50% falls on the second part of the calibration score (1 - 6)*s*. Training, in contrsat to the task manipulation easy vs hard items, has a considerable effect on the slope of the calibration function. This is also reflected in the resolution score (Lichtenstein 8z Fischhoff, Table 1) which increases from 0.002 (no training) to 0.022 (training). CALIBRATION AND RESOLUTION IN THE TWO STUDIES Lichtenstein and Fischhoff (1977) found that calibration was better for easy as compared to hard tasks. Furthermore, as was demonstrated above, improved calibration for easy tasks was an effect of a decrease of the first component of the calibration score (Z - ?)*. Imagine that the calibration function moves from a low level (hard items, overconfidence) upward without changing its slope. Calibration will first improve as (Y -
18
MATS BJdRKMAN
a2 decreases and then deteriorate as (E - ?)* begins to increase (easy items, underconfidence). The relationship between proportion correct and calibration should therefore be U-shaped with poor calibration for extreme over- and underconfidence and the best calibration when X = ?. Had the underconfidence for easy items been as large as the overconfidence for hard items calibration would have been equal for easy and hard items. Now, the mean of (Y - T) for easy items was -0.03 and for hard items 0.15, which explains why easy items were associated with better calibration scores. Lichtenstein and Fischhoff (1977) also computed individual Brier partition scores for 93 subjects in their Experiment 5. They found a fairly strong quadratic relationship between calibration and proportion correct (R = .62, p < .OOl) with the best calibration around 78% correct. Hence, whether one varies item difficulty from hard to easy or the subject’s ability, in both cases over a range that covers both overconfidence and underconfidence, the relationship between calibration and knowledge is bound to be U-shaped. Only if the second factor of the calibration score, (1 - 6)*s*, increases and decreases to balance the changes of (X - q* can this U-shaped relationship be violated. It is very unlikely, however, that this should occur. Either b is fairly constant as in Lichtenstein and Fischhoff (1977) or it increases with knowledge as in the author’s data, where the correlation between b and C was significant (r = .45, p < .05). The calibration measures in Table 3 can be compared to the Best, Middle, and Worst subjects of Experiment 3 in Lichtenstein and Fischhoff (1977). These calibration measures were 0.011, 0.012, and 0.030. In the author’s experiment the corresponding measures were 0.018, 0.023, and 0.033, indicating a weaker relation between calibration and knowledge. Neither the group differences nor the correlation between MC and C (r = - .31, p > .05) reached significance. Among the 33 subjects only one showed significant underconfidence, which prevented the U-shaped relation between calibration and knowledge to appear. Although the two studies accord in showing an association between knowledge and calibration there is an interesting difference. In the author’s data both parts of the calibration score, (X - 3’ and (1 - b)*s*, contribute to improved calibration. In Lichtenstein and Fischhoff (1977), however, changes in calibration are due solely to the first part (Y - 3’. This leads us directly to their findings concerning resolution, namely that resolution had no systematic relationship to knowledge. Neither the individual variation in Experiment 5 nor the variation of item difficulty in Experiments 3-5 produced any systematic variation of resolution. This result is a direct consequence of the fact that b did not vary with proportion correct, and it contradicts the author’s result in the individual study.
KNOWLEDGE,
CALIBRATION,
AND
RESOLUTION
19
The mean of b is practically the same (approximately 0.5) in Experiments 3-5 and in the author’s experiment. However, there is a considerable interindividual variation (from 0 to 1) in the latter study and this variation is significantly correlated with proportion correct (r = .45, p < .05). Again we can compare the three knowledge groups in Experiment 3 in Lichteinstein and Fischhoff (1977) with those in Table 3. The resolution scores are 0.011,0.013, and 0.009 for Best, Middle, and Worst subjects to be compared with 0.018,0.011,0.004 for High, Medium, and Low knowledge in Table 3. The latter group differences as well as the correlation between MR and proportion correct are significant (r = .42, p < .05). The contradictory results with respect to resolution are hard to explain. An obvious difference is that the variation of C is larger in Fischhoff and Lichtenstein (1977). It ranges from 0.42 to 0.96 over the 93 subjects in Experiment 5 and from 0.43 to 0.92 in the easy-hard comparisons. High values of iZare connected to underconfidence, which was almost absent in the author’s experiment where the range of C goes from 0.58 to 0.86 (only one subject was above 0.80). Now, if there is at all an association between proportion correct and resolution it seems that it would show up when the variation in C is large. But it did not. The Oregon subjects increased their proportion correct by increasing a (and Z) in ? = a + b% whereas the Uppsala subjects increased their b’s and gained more in proportion correct then they lost by the accompanying decrease of a. SUMMARY
The model presented above contains two parts. First, the calibration curve is approximated by a linear function c(x,) = a + bx, in which c(x,) is the relative frequency of correct responses when the probability as2). Second, the parameters a and b of the sessmentisx,(t = l,..., calibration function together with the mean X of the distributionflx,) of the responses predict the mean proportion correct 7, 7 = a + by. If Yates’ (1982) terminology is extended to calibration functions, c(x,) = a + bx, accounts for “calibration in the small” and 7 = a + bx for “calibration in the large.” The two parameters a and b are all that is needed to go into all particulars concerning calibration. For example, over/underconfidence is X - 7 = (1 - b)F - a. Other examples are the parametric expressions of calibration and resolution in Eqs. (5) and (6). The primary advantage of the model is that it provides a convenient tool for treating various aspects of calibration within a single framework. The test of the model is made in two steps: (1) a and b are determined from the calibration curve and (2) a, b, and Z are inserted in i: = a + bZ to determine 7. The decisive question is how close 7 is to the observed
20
MATS BJdRKMAN
proportion correct C. Deviations of 7 from C may arise if the fit of c(x,) = a + bx, to the calibration curve is poor and Ax,) is skewed. The model was tested on two sets of data: (1) an individual study in which 33 subjects answered 250 general knowledge items of the twoalternative type and assessed the probability that the chosen answer was correct; (2) group data from a study by Lichtenstein and Fischhoff (1977). The items in their Experiments 3-5 were of the same type as in the individual study. In the individual study 50% of the subjects had differences i: - C which were less than 0.010; 80% had differences less than 0.020. In the group data of Lichtenstein and Fischhoff (1977) the differences 7 - C were 0 or 0.01 in 16 of the 19 conditions and 0.02 in the remaining 3. Attempts were made to judge the magnitude of the differences c Z) by comparing them with the standard errors of C and (2 - C). Although these comparisons are not the perfect way of testing the null hypothesis, 7 = -C, they seem to indicate that the agreement between 7 and C is satisfactory. With two exceptions in the study of individual data the differences c - Z) are about equal to and most often less than One standard error. In the individual study both calibration and resolution improved with knowledge. Only resolution, however, showed a significant association with knowledge. This result contradicts Lichtenstein and Fischhoff (1977), who found a significant U-shaped relation between calibration and knowledge but no systematic relationship for resolution. Finally, the individual study makes it clear that group averages do not tell the whole story. Knowledge, calibration and resolution are attributes of individuals with a considerable variation from subject to subject. For example, in the sample of psychology undergraduates, which is quite homogeneous compared to an unselected group, the slope of the calibration function varies from 0 to 1. A few subjects show no resolving capacity at all; a few at the other extreme have the resolution required for perfect calibration. REFERENCES Brier, G. W. (1950). Verification of forecasts in terms of probability. Munthly Weather Review, 78, l-3. Koriat, A., Lichtenstein, S., & Fischhoff, B. (1980). Reasons for confidence. Journal of Experimental Psychology: Human Learning and Memory, 6, 107-l 18. Lichtenstein, S., & Fischhoff, B. (1977). Do those who know more also know more about how much they know? Organizational Behavior and Human Performance, 20.159-183. Lichtenstein, S., & Fischhoff, B. (1980). Training for calibration. Organizational Behavior and Human Performance, 26, 149-171. Lichtenstein, S., Fischhoff, B., L Phillips, L. D. (1982). Calibration of probabilities: The state of the art to 1980. In D. Kahneman, P. Slavic, & A. Tversky (Eds.), Judgments
KNOWLEDGE,
CALIBRATION,
AND RESOLUTION
21
under uncertainty: Heuristics and biases (pp. 306-334). New York: Cambridge Univ. Press. Murphy, A. H. (1972). Scalar and vector partitions of the probability score: Part I. The two-state situation. Journal of Applied Meteorology, 11, 273-282. Murphy, A. H. (1973). A new vector partition of the probability score. Journal of Applied Meteorology, 12, 595-600. Sharp, G. L.. Cutler, B. L., & Penrod, S. D. (1988). Performance feedback improves the resolution of confidence judgments. Organizational Behavior and Human Decision Processes, 42, 271-283. Yates, J. F. (1982). External correspondence: Decompositions of the mean probability score. Organizational Behavior and Human Performance, 30, 132-156. RECEIVED: August 23, 1990