167
Acta Psychologica 64 (1987) 167-185 North-Holland
PROCEDURAL
DEBIASING
*
Lola L. LOPES University of Wisconsin, Madison,
USA
Accepted March 1986
This paper presents two experiments that illustrate how Bayesisn inferences can be debiased by analyxing and correcting the cognitive procedures that lead to the biases. In the first experiment, a training procedure is used that corrects a common error in the adjustment process used by subjects when integrating evidence. In the second experiment, a focusing technique is used to improve the relative weighting of samples in the overall judgment. These results are discussed in terms of a model of the judgment process comprising four basic stages: (a) initial scanning of stimulus information; (b) selection of items for processing in order of importance; (c) extraction of scale values on the judgment dimension; and (d) adjustment of a composite value that summakes already-processed components.
As more is learned about judgment processes, it becomes reasonable to expect that this knowledge can be used to help people make better judgments. This is particularly true for situations in which failures of judgment seem to be orderly manifestations of processing mechanisms and not merely unsystematic errors attributable to carelessness and inattention. Unfortunately, it has usually been easier to imagine such improvement than to effect it. Fischhoff has suggested that ‘debiasing’ efforts can be classified according to whether they lay the blame for the bias at ‘the doorstep of the judge, the task, or some mismatch between the two’ (1982: 424). All three such approaches have been taken by researchers in Bayesian inference to reducing the tendency of naive subjects to produce overly ‘conservative’ judgments (Edwards 1968). * The writing of this paper and the research reported therein were supported by the Office of Naval Research (Contract NO001481-K-0069). I am grateful to Rose Reed and Sandra Schneider for care in running the experiments, and to Gregg Oden, David Ronis, and Tom Wahsten for helpful comments on an earher draft of the paper. Requests for reprints should be sent to L.L. Lopes, Brogden Psychology Building, University of Wisconsin, Madison, WI 53796, USA.
OOOl-6918/87/$3.50
Q 1987, Ekevier Science Publishers B.V. (North-Holland)
168
L.L. Lopes / Procedural &biasing
Initially, researchers tried to improve the task by removing potential biases in the response scale and by using feedback and monetary payoffs to improve subjects’ understanding and motivation. In general, such methods had only small success (Phillips and Edwards 1966). More recently, Eils et al. (1977) focused on improving the match between the task and the subject. Observing that people’s judgments are often more like averages (or estimates of population proportion) than like inferences (Beach et al. 1970; Marks and Clarkson 1972, 1973; Shanteau 1970,1972), Eils et al. hypothesized that subjects might be better at judging the mean log likelihood ratio for a set of samples than at judging the cumukztiue log likelihood ratio. Such averages could then be converted mechanically to log odds. The procedure worked as predicted. The present research approaches debiasing by attempting to improve the judges. Like the work of Eils et al. (1977), the approach rests on the observation that subjects seem to average in Bayesian tasks, but no attempt is made to ‘engineer’ the task to human proclivities. Instead, the approach involves: (a) analyzing the procedures that subjects use when they produce averages, (b) warning subjects about procedures that are inappropriate, and (c) providing subjects with alternative appropriate procedures.
Averaging and adjustment in Bayesian inference Consider a task in which subjects judge the relative likelihood of two well-specified hypotheses (HI and H2) involving populations of binary events. The subject is shown a sample, DI, and is asked to rate the probability of HI given DZ. The normative response for such situations is given by Bayes’s theorem, in which p( HI ) and p( H2) are the prior probabilities of the two hypotheses: p(HIID1)
Information
=
P@IIHI)P(HQ P@IH~)P(HI)
+P@‘IIHJ)P(H~)
’
0)
from a second sample, 02, is integrated in the same way:
p( Hl 101) and p( H2 1Dl) simply replace p( Hl) and p (H2), respec-
tively.
LL. Lopes / Proceduml debiasing
169
Subjects, however, do not appear to follow Bayes’s theorem. Shanteau (1970) proposed that their judgments could be modeled by an algebraic rule in which the response, R, at any serial position, fi, is a weighted average of the scale values, si, of the previous and current sample events:
In this equation the wi are weights that sum to unity and the term woso signifies the weight and scale value of a neutral initial impression. Note that the process is necessarily conservative because averages always lie within the range of their component values whereas inferences can be more extreme than any of their component values. Shanteau’s model accounts nicely for many of the quantitative features of inference data, but it does not suggest either why or how averaging occurs. Previous research (Lopes 1982, 1985) has suggested that averaging may occur because subjects integrate ‘new’ information with ‘old’ composite judgments by adjusting the old value so as to make the new composite lie somewhere between the old composite and the value of the new information. Such a process is quuZitatiueZy equivalent to averaging since it produces judgments that lie within the range of their component values, but it does not presuppose that subjects ever ‘compute’ an average in any algebraic (or even any conscious) sense of the term. The adjustment model sketched above is a process model, that is, a model of what people do when they make judgments. Averaging rules, on the other hand, are primarily models of the data people produce. Wallsten (1976; Wallsten and Barton 1982), working independently, has proposed an alternative process model that corresponds mathematically to an adding rule. It is not the purpose of the present paper to distinguish between these models. However, the question of whether the process is algebraically averaging or adding bears fundamentally on the relationship between procedural models and algebraic models of judgment. These issues will be taken up in the General Discussion. Directional errors One prediction of the present model is that subjects may sometimes make adjustments that are strictly in the wrong direction. Consider two
170
L.L. Lopes / Procedural debiasing
samples supporting the same hypothesis but to different degrees. If a subject who has no prior opinion is first shown the weaker sample, we suppose that some weak preliminary judgment will be made in favor of the supported hypothesis. When the subject is later shown the stronger sample, adjustment will be made in the direction of increased support for the hypothesis. This is appropriate qualitatively. But if the samples are reversed, qualitatively inappropriate adjustment ought to result. That is, the preliminary judgment ought to produce a relatively strong result. When the weaker sample is later integrated into the judgment, adjustment should be toward ‘neutral’ since the value of the weaker sample is ‘more neutral’ than the preliminary judgment. Such adjustment is strictly inappropriate since movement toward neutral is de facto movement toward the alternative or non-supported hypothesis. Previous research has suggested that subjects produce directional errors in judgment tasks like that used here (Lopes 1985). The aim of the present research is to find out whether judgmental accuracy can be improved by training that focuses on correcting directional errors. Two experiments are presented. The first focuses on eliminating adjustment errors in the specific cases in which they are expected to occur. The second extends the training to include instruction in a selectional procedure that is hypothesized to improve accuracy. Experiment 1 Method Experimental
task
Subjects were asked to imagine that their job is to make judgments concerning the maintenance of milling machines. The judgment concerns whether or not a critical spring has broken inside the machine. Subjects were told that normal machines produce about 12 rejected parts per 1000 parts produced (H12/1000), whereas machines with broken springs produce about 20 rejected parts per 1000 (H20/1000). Thus, the subjects were required to decide between alternative Bernoulli processes, one with p = 0.012 and the other with p = 0.02, with p being the probability of a rejected part. Design The stimulus design was a 9 x 9, first-sample X second-sample, factorial design for
which the levels of both factors comprised the same samples of parts. These were 12, 13, 14, 15, 16, 17, 18, 19 and 20 rejects per 1000 parts, respectively. Subjects comprised two groups: trained subjects who were warned about directional errors and control subjects who received no training.
LL Lopes / Proceduraldebiaring
171
Procedure Subjects were run individually using a computer-controlled video terminal in sessions that took about 40 minutes for control subjects and 50 minutes for trained subjects. All subjects were read general instructions about the task and were shown how to read the stimulus display. This contained the current sample (e.g., ‘17 rejects’), a notation as to whether the sample was the first or the second, and an unmarked response scale labeled at the left by the words ‘machine normal’ and at the right by the words ‘machine faulty’. 1 Initially, a rating arrow was positioned at the middle of the response scale. The procedure for all trials was identical. Subjects read the first sample and made an initial rating using a joystick to move the rating arrow along the response scale. Then they pushed a button on the response box. The initial rating was then saved by the computer and the first sample was replaced by a second sample from the same machine. Subjects revised their initial rating and saved it by pushing the response button. Then they initiated the next trial by returning the response arrow to the middle of the scale and pushing the button again. The instructions for trained subjects were essentially identical to those for control subjects up through the explanation of the stimulus display and the rating response, except that trained subjects were told at the outset that they would be taught a procedure for avoiding a common judgment error. The actual training took place during the early practice trials. The first practice trial was a weak-strong pair (17,19) that was chosen to elicit correct responses from all subjects. For this trial, all subjects produced directionally correct adjustments. The second trial was a strong-weak pair (13, 14) chosen to elicit the directional error. On this trial subjects were shown the sample of 13 rejects and were allowed to make and save their initial rating. Then they were shown the sample of 14 rejects and were allowed to make their adjustment, but they were stopped before they saved the response. Most of the subjects (20 of 31) made their adjustment in the wrong direction and were read the instructions reproduced below. The others were read similar instructions, but with wording changed to accommodate the fact that they had, in fact, responded correctly on this trial. ‘Before you transmit your response, let me talk with you about your response. You shouldn’t feel bad, but remember I told you that many people make an error in this task. Well, you just made it. Let me explain it to you. Most people, if they are given a sample of I4 rejected parts as a first sample, say that the machine is more likely to be functioning norinshy than not. But when they are given a sample of 14 rejected parts after they have just been given a sample of 13 rejected parts, they tend to adjust their judgment toward the right, that is, toward the machine being broken. Now if you think about it, this is an error of adjustment since a sample of 14 rejects favors the normal machine and therefore provides additional evidence that the machine is normal. Thus, the adjustment should be.toward the left, that is, toward the machine functioning normally. After subjects indicated that they understood the error, the experimenter taught them a procedure for avoiding it. Basically, this was to separate each judgment
’ Unmarked response scales do not allow for perfect response secttracy. However, they reduce response biases such as ‘whole number’ biases which are potentially very serious.
172
L.L. Lopes / Procedural debiaring
operation into two steps: (a) the labeling of each sample as either favoring the ‘normal hypothesis or the ‘faulty’ hypothesis and (b) the adjustment of the response (i.e., movement of the rating arrow) in the direction given by the label. After teaching subjects the judgment procedure, the experimenter had them respond to several trials on their own, while verbalizing what they were doing. This allowed the experimenter to check that they were explicitly separating the labeling and the adjustment steps and that they were adjusting at each step in the direction given by the labeling operation. Among these training trials were two for which the samples were identical. When the first such trial (17,17) appeared, the experimenter waited to see whether the subject would adjust for the second sample without prompting and then stopped the trial for further instruction. Subjects who had failed to adjust (8 of 31) were told, ‘Now this kind of trial also causes errors. Let me explain. Your first sample was 17 rejects and you judged the machine as likely to be broken. Then you got new evidence of 17 rejects also favoring the machine being broken, but you didn’t adjust. Actually you should have adjusted since that is additional evidence in favor of the machine being broken’. Subjects who adjusted correctly were read similar instructions, but modified to accord with their correct response. Altogether there were 13 trials for practice and training for the trained subjects. Control subjects received the same 13 trials for practice, but with no training. Then both groups of subjects received two replications of the stimulus design, bringing the total number of trials to 175 per subject. Experimental trials within each replication were ordered randomly but with the restriction that no sample appear either as first-sample or second-sample on two consecutive trials. Subjects The subjects for control and trained groups were, respectively, 30 and 31 students
from University of Wisconsin-Madison. Approximately half were males and half females. They served for credit to be applied to their grades in introductory psychology. Results and discussion
Two questions are of interest in this experiment. The first is whether training concerning directional adjustment errors can reduce their prevalence in the inference task. The second is whether, given that such reduction in errors is possible, this leads to improvement in the accuracy of the final judgment. Data bearing on the first question are given in table 1. Five subjects have been dropped from each condition since they appeared to base their final judgment entirely on the second sample. The basic results are the same, however, whether these subjects are retained or not. As expected from pilot data, average responses to first samples ranged essentially evenly between the value for 12 rejects (judged probability of faulty machine = 0.062) and the value for 20 rejects (probability of faulty machine = 0.913). Agreement among subjects was very high for all stimuli except the sample of 16 rejects (which actually favors a faulty machine slightly). Some subjects treated such samples as neutral, others treated them as favoring one or the other hypothesis, and still others treated them
173
L. L. Lopes / Procedural debiasing
Table 1 Mean number of adjustment errors per subject for experiment 1. Pairs
Control
Trained
Maximum possible
Weak-strong Strong-weak Diagonal
0.40 13.40 3.24
0.42 3.11 1.08
24 24 16
Note: Errors were scored for weak-strong cells and strong-weak cells only if there was actual adjustment and if it was in the wrong direction. Errors were scored for diagonal cells only if there was room for adjustment and if there was adjustment in the wrong direction or no adjustment.
inconsistently across trials. For reasons of this variability, sample pairs involving 16 rejects are not considered further. Taken together, there were 20 pairs in which adjustment errors might have been
expected. These comprised the eight pairs along the diagonal of the stimulus design in which the two samples are identical (i.e., 12/12, 13/13, 14/14, U/15, 17/17, 18/18, 19/19, 20/20) and the twelve nondiagonal pairs in which a (stronger) sample favoring a particular hypothesis is followed by a weaker sample favoring the same hypothesis (i.e., 12/13, 12/14,12/U, 13/14, 13/l& 14/l& 20/19,20/18, 20/17,19/18, 19/17, 18/17). These pairs are listed in the table as ‘diagonal’ pairs and ‘strong-weak’ pairs, respectively. The table also gives results for the set of ‘weak-strong’ pairs. These are the same set as the strong-weak pairs except that the stronger sample in each pair is preceded by the weaker. Since for these pairs the intuitive direction of adjustment is normatively correct, they provide an estimate of the rate of adjustment errors that occur for reasons other than the incompatibility of the normative response with the intuitive direction of adjustment (i.e., inattention, misreading the stimulus, etc.). Looking first at strong-weak pairs, it is clear that training is effective in reducing directional errors. For the control group there is an average of 13.4 errors per subject (out of 24 maximum) for strong-weak pairs compared to an average of only 0.40 errors per subject for weak-strong pairs, F(1,24) = 60.43, p < 0.01. For trained subjects, however, the average is 3.11 errors for strong-weak pairs compared to 0.42 errors for weak-strong pans, F(1,25) = 8.33, p < 0.01. Comparing across groups, trained subjects have significantly fewer errors than control subjects for strong-weak pairs, F(1,49) = 113.46, p < 0.01, but not for weak-strong pairs, F < 1. The final row of the table gives the results for diagonal pairs. Adjustment errors have been scored for these pairs only if there was room for the adjustment to occur (i.e., the response was not already at the end of the scale) and if there was either no adjustment at all or adjustment in the wrong direction. (Errors of the latter type were very rare.) Mean errors (out of a maximum of 16) were 3.24 for the control group and 1.08 for the trained group. Both these error rates are significantly different from zero, F(1,24) = 26.18 and F(1,25) = 8.99, respectively, p < 0.01, and the rate for the control group is significantly higher than that for the trained group, F(1,49) = 10.29, p < 0.01. Thus, it appears that training reduced (but did not completely eliminate) directional errors, particularly for strong-weak pairs. The question remains, however, as to whether trained subjects became more accurate. One way to answer this is to compare
174
L L Lopes/ Procedural akbiaving
subjects’ judgments to theoretically optimal judgments. Analysis of the 20 pairs that should have been affected by training shows that training did, indeed, improve accuracy on these pairs. Figured on group means, the root-mean-squared-deviations (RiWW’s) between obtained and theoretical were, for the control group, 0.0723 for strong-weak pairs and 0.0227 for diagonal pairs. For the trained subjects, however, these values were 0.0187 and 0.0047, respectively. These data are not surprising, but an unexpected finding is that improved performance on the critical strong-weak pairs generalized to the corresponding weak-strong pairs. Although the training procedure did not in any way attempt to modify subjects’ procedures for judging weak-strong pairs, trained subjects did about as well on these (RMSD = 0.0220) as they did on the strong-weak pairs. In the same way, the control subjects did about as poorly on weak-strong pairs (RMSD = 0.0626) as they did on strong-weak pairs. It is misleading, however, to focus entirely on quantitative results because no attempt was made to train subjects to be accurate in judging individual samples. A better approach is to determine whether training makes subjects qualitatively more Bayesian. By this standard, the answer is clearly no. Theoretically, a Bayesian judge would produce data like those in fig. 1. Graphically, the data should have a ‘barrel shape, and in terms of analysis of variance there should be a sizeable interaction. The
THEORETICAL
DATA 20 f! 14
12
J REJECTS
IN FIRST
SRMPLE
Fig. 1. Data predicted by Bayes’s theorem. Values are plotted as a function of the number of rejects in the first sample with number of rejects in the second sample as the row parameter.
L.L. Lopes / Procedural de&zing
175
data for both control and trained subjects were, however, essentially parallel. Although both data sets had significant interactions (F(64,1536) = 2.66 for control subjects and 4.98 for trained subjects, both p’s < O.Ol), these were disorderly and small, accounting for only 0.7% and 1.2%, respectively, of the total systematic sums of squares. By way of contrast, the interaction accounts for 4.66% of the systematic sums of squares of the theoretical values. The generalization of training from strong-weak to weak-strong pairs suggests that training may be effective not only in helping subjects avoid specific adjustment errors, but also in helping them better understand the task. But if trained subjects understand the task better than untrained subjects, why do their data show the same tendency toward parallelism? The answer may lie in the weights that subjects give to the various samples. Consider a situation like the current one except that subjects are instructed to estimate the proportion of rejected parts for a particular machine. If the two samples are of equal size and equal reliability, the subject ought to give them equal weight and simply average the values. A subject who follows such a ‘constant weighting’ strategy will produce a parallel pattern of data. The Bayesian task, in contrast, requires a ‘differential weighting’ strategy: samples that are extreme (e.g., 12 or 20) are more diagnostic than samples that are nearer neutral (e.g., 15 or 17), and should be weighted more heavily in the inference process. Thus, it is possible that subjects in the trained condition do understand the Bayesian task better than their control condition counterparts, at least in the sense that they are really integrating evidence and not merely integrating samples, but they may not understand that more extreme samples are more diagnostic and should be accorded greater weight. Experiment 2 investigates whether this hypothesized problem with intuitive weighting of information can be alleviated by a modification of the training procedure used in experiment 1.
Experiment 2 Method Task and design The task and design were the same as in experiment 1 except that the samples were
presented simultaneously. the stimulus display.
‘First’ samples appeared directly above ‘second’ samples in
Procedure The procedure for the control subjects was essentially the same as for experiment
1 except for the differences occasioned by the simultaneous stimulus display. However, the instructions for the trained subjects were more detailed and applied to every kind of stimulus pair. Trained subjects were told that they would be taught a simple procedure that would allow them to make good judgments in a kind of inference task (i.e., the machine
L.L Lopes / Procedural deihsing
176
maintenance problem). After they had been instructed on how to read the stimulus display and use the response device, they were taught a four-step procedure to be applied to every stimulus pair. The steps were explained to subjects as they worked through a series of practice trials. During trairhg subjects were asked to verbalize what they were doing so that the experimenter could check on their understanding and use of the procedure. The steps were as follows: (a) Judge for each sample separately whether it supports the ‘normal’ hypothesis or the ‘faulty’ hypothesis or whether it is neutral. (b) Decide which sample supports its own hypothesis more strongly. (c) Make an initial rating based only on the stronger of the two samples. If they are equally strong, use either. (d) Adjust the initial rating in order to take into account the second sample. (i) If the second sample favors the same hypothesis as the fist, then ‘consider the portion of the response scale between [the] original rating and the [appropriate] end of the scale and move the arrow into this region according to how strong the remaining evidence is’.
CONTROL
SUBJECTS 20 18
.9
16
.8 .7
14 .6 12 .5 .4 -3 .2 .l 12
13
14
15
REJECTS
16
I7
IN FIRST
18
I9
20
SAMPLE
Fig. 2. Data for control subjects in experiment 2. Values are plotted as a function of the number of rejects in the first (upper) sample with number of rejects in the second (lower) sample as the row parameter.
L. L. Lopes / Procedural debiasing
177
If the second sample favors the opposite hypothesis, then ‘consider the portion of the rating scale between [the] original rating and the neutral position and adjust back into this region according to how strongly the sample [supports the other hypothesis]’ Note that although the procedure sounds complicated, it was simple to follow in the context of the physical display and actual stimulus pairs. Altogether there were 20 trials for practice and training for the trained subjects. Control subjects received the same 20 practice trials, but with no training. Then both groups of subjects received two replications of the stimulus design, bringing the total number of trials to 182 per subject. Experimental trials within replication were ordered randomly but with the restriction that no given sample appear on two consecutive trials. The full experiment required about 40 minutes for control subjects and about 55 minutes for trained subjects.
Subjects The subjects were 56 student volunteers from the University of Wisconsin-Madison, split evenly between the control and the trained conditions. About half were males
TRFlINED
SUBJECTS 18 16
REJECTS
IN
FIRST
SAMPLE
Fig. 3. Data for trained subjects in experiment 2. Values are plotted as a function of the number of rejects in the first (upper) sample with number of rejects in the second (lower) sample as the row parameter.
178
L L. Lopes / Procedural &biasing
and half females. Most subjects served for pay, although a few served for credit to be applied to their course grade in introductory psychology. Results and discussion
The data for the control subjects are given in fig. 2 pooled over subjects and replications. Samples designated ‘first’ appeared above the other sample in the display. Note that the pattern of judgments is essentially parallel. This is confirmed by an analysis of variance showing that the interaction, although significant, F(64,1728) = 1.81, p < 0.01, accounts for only 0.42% of the systematic sums of squares. Calculation of the RMSD’s between theoretical and obtained reveals an overall RMSD of 0.1043 for the entire data array, which breaks down to RMSD’s of 0.0968 for homogeneous cells (in which the samples favor the same hypothesis), O.llf6 for heterogeneous cells (in which the samples favor different hypotheses), and 0.0322 for diagonal cells. The data for the trained subjects are in fig. 3. Clearly, the training procedure has been effective in bringing subjects’ responses closer to optimal. In terms of analysis of variance, the interaction, F(64,1728) = 33.89, p -C0.01 now accounts for 4.55% of the systematic sums of squares compared to the optimal value of 4.66%. The overall RMSD between theoretical and obtained is 0.0480, which breaks down to RMSD’s of 0.0321 for homogeneous cells, 0.0567 for heterogeneous cells, and 0.0077 for diagonal cells.
General discussion Before discussing the implications of the present research, it is important to point out exactly what the training procedures did and did not ‘teach’ the subjects. In experiment 1, the training procedure taught the subjects only that adjustments of the initial rating made after presentation of the second sample should always be in the direction of the hypothesis favored by the second sample. In experiment 2, the training included the same information about adjustment direction but also taught subjects to process the two samples in order of their apparent relative strength. At no time in either training procedure did the experimenter teach the subjects anything about which samples favored which hypothesis or how strongly they did so, nor did she suggest how diagnostic or ‘weighty’ the samples should be considered to be. In light of the limited training to which subjects were exposed, the amount of debiasing that occurred is impressive. In experiment 1, explicit training was directed at only the 12 strong-weak pairs and the
L.L Lopes / Procedural debiasing
179
8 diagonal pairs. It would have been entirely within reason for subjects’ responses to the other 61 pairs to be unaffected by the training since, so far as was indicated, there was nothing wrong with their intuitions concerning such pairs. As it turned out, however, improvement generalized from the strong-weak pairs to analogous weak-strong pairs. There is no way to know for sure why this improvement occurred, but an appealing possibility is that the training focused subjects’ attention on the inferential nature of the task and prevented the apparently common tendency among subjects to fall into judging the sample proportion rather than the relative likelihood of the two hypotheses. Thus, trained subjects may have benefitted not only from prior instruction concerning how to prevent a particular error, but also by being forced, so to speak, to better understand what it was they were judging. In experiment 2 the extent of debiasing was even more remarkable. In evaluating this result, it is important to understand that there was nothing in the instructions that would have prevented subjects from continuing to weight information equally regardless of diagnosticity. That is, the subjects were only instructed to begin their judgment using the ‘stronger’ sample; they were not instructed to give it more weight in the final judgment. Nevertheless, the net effect of the manipulation was for judgments to closely approximate optimal values. Whether this occurred because subjects intended to give the more important stimulus more weight is, of course, not clear. One might argue that the improved weighting pattern occurred due to unintentional primacy effects that were outside of the subjects’ comprehension of the task. But it is worth noting that in natural judgment situations, people often ‘put first things first’, considering those items that are deemed to- be important before they consider other, less important items. Thus, one pre-existing strategy for differential weighting may be to attend to items in order of importance and to make smaller and smaller adjustments for items that are of lesser and lesser importance. What do judgesdo?
For more than 20 years, evidence has been accumulating that human judgments often follow algebraic rules (cf. Anderson 1981). But algebraic models have had limited appeal for some judgment researchers because they have only ‘as if’ status: although the data look as if they have been produced by application of an algebraic rule, there is no
180
L L. Lopes / Procedural debiaring
principled reason to suppose that the psychological processes of the judge in any way resemble ‘paper and pencil’ algebraic manipulation. The ‘debiasing’ research reported here was based on a procedural theory of how people generate data that have algebraic patterns (Lopes 1982). The approach rests on the assumption that if a person’s judgments in inference tasks look more like averages than like inferences, then at some point during the judgment process the person must be performing one or more operations that are nearer to those required for averaging than they are to those required for inferencing. Thus, debiasing the process entails discovering the inappropriate operations and replacing them by operations that are better suited toiinference. Fig. 4 outlines the major steps that are hypothesized to occur during judgment. In the scanning step, the judge assesses what information has been presented. For tasks in which stimulus information is presented sequentially, scanning will be rudimentary since there is only one stimulus to scan. For simultaneous tasks, however, scanning will be more clearly distinguishable from other judgment operations. Next, an item is selected for use as an ‘anchor point’ (cf. Einhom and Hogarth 1982; Tversky and Kahneman 1974). If only one item has been presented, that item must be the anchor. But if more than one item is available, an item is chosen that seems relatively more important than the others. Such importance may reflect the a priori importance of the category to which the item belongs or it may reflect diagnosticity within category as, for example, when anchors are selected by virtue of their extremity. Only if the various items seem equally important will the subject resort to ad hoc choice schemes such as taking items in order as they appear in the stimulus array. Once an anchor has been chosen, it is evaluated relative to the scale of judgment. This ‘valuation’ operation may in some cases yield a quantity that serves directly as the initial judgment. For example, previous research on the present task suggests that subjects may anchor their judgment at a scale position that is proportional to the number of rejects in the first sample (Lopes 1985). In other cases, however, the initial judgment may be somewhat less extreme than the scale value associated with the anchor stimulus. In these cases subjects act as though their initial judgment is a compromise between the value of the stimulus information and some internal, neutral ‘initial impression’ (Anderson 1967). After anchoring, the subject chooses another item and then ‘adjusts’
L. L. Lopes / Procedural debiasing
CNCCSE IT
181
cHcc!sfJ ARBITRARY ITEM
Fig. 4. Flow diagram of hypothesized model of judgment process.
the initial value. This is the step that is crucial in determining the algebraic form of the overall judgment. In the case of ‘averaging’ rules, the adjustment operation is assumed to involve two stages: (1) location of the new information on the scale of judgment relative to the initial judgment, and (2) adjustment of the initial judgment toward the new information. This produces a new judgment that lies between the first two values and is, in that sense, an average of the two. Other algebraic rules (e.g., multiplying rules, ratio rules) can also result from the adjustment stage, but the particulars of their adjustment processes differ in important ways (Lopes 1982). After each adjustment step, the judge considers whether there are
182
LL. Lopes / Procedural &biasing
still important items left unaccounted for. If there are, the process is repeated with each new item of information leading to adjustment of the previous judgment, and with the adjustments ordinarily becoming smaller as the perceived importance of the new information becomes smaller relative to previously considered information. At some point, either when the stimulus information runs out or when the subject judges that nothing important remains to be considered, the final response is output on whatever response scale has been provided. Relations between procedural and algebraic models Typically, algebraic models focus on the pattern among the various responses produced by a judgment process whereas procedural models focus on the mental operations that produce a single response. In the present case, the connection between averaging as an algebraic phenomenon and averaging as a procedural phenomenon resides in the directional character of the adjustment process. Procedures that produce responses lying between old responses and new information are, qualitatively, averaging procedures. But there may be no simple mapping between other features of the adjustment procedure and any particular averaging rule. For example, it was convenient and empirically reasonable in the present case to assume that subjects would place little or no weight on ‘prior’ opinions. However, setting the weight of the initial opinion to zero in the algebraic formulation has far-reaching implications. One that has particular interest is what has been called ‘the set-size effect’. This refers to increases in the extremity of a judgment following presentation of two or more stimulus items having the same scale value. Set-size effects follow algebraically from adding, but they follow from averaging o&y if there is an initial impression with non-zero weight. Thus, algebraically speaking, one might predict that control subjects would also make errors for diagonal pairs (failures to display set-size effects). But errors for diagonal pairs were less common than errors for strong-weak pairs. Thus, the data are ambiguous vis-a-vis the algebraic distinction between adding and averaging. Procedurally speaking, however, there is no ambiguity. Set-size effects may occur procedurally in cases in which the subject really has an initial impression that provides the ‘anchor’ for the judgment. However, set-size effects may also occur even in the absence of an initial impression if the subject believes that every new sample calls for adjustment. In this
L.L. Lopes / Procedural debiasing
183
case, it would seem more natural to adjust ‘outward’ (i.e., toward greater extremity) than to adjust back toward neutral. What is most important is that models of judgment processes, whether procedural or algebraic, are precisely that, models. If they are good models they should capture important attributes of the underlying process, but models from different conceptual levels may not agree in all their details. Algebraic models are ordinarily mathematically compact, but such compactness implies a degree of judgmental homogeneity that may not exist in nature. Procedural models, on the other hand, may be tuned to fit most quirks of the judgment process, but they do not reveal very well the quantitative orderliness in data. Obviously, both are useful and both are needed for proper description of the judgment process. The present adjustment model is a model of averaging. Wallsten and Barton (1982) have suggested a very similar procedural model, but one that is basically an adding model (although with some allowance for configurality). In the case of these two models, the algebraic differences are crucial procedurally because they reside in the qualitative (i.e., directional) features of the adjustment process. The present data on adjustment errors and those of Lopes (1985) tend to support the averaging procedure, although Wallsten and Sapp (1977) have reported some contrary data involving choices. Also of interest are data displaying forms of nonadditivity that are consistent with differentially weighted averaging, but not with adding (Lopes 1982; Lopes and Oden 1980). Changing what judges do
In order to debias human judgments, it is necessary to change the judgment process. But how? As Fischhoff (1982) pointed out, that depends on one’s theory of why the judgments are biased. The earliest attempts at debiasing Bayesian inference assumed implicitly that the bias could be ‘fixed’ without the necessity of knowing how the subject actually generated the biased judgment. The present approach differs from these early methods in that it rests on an analysis of what subjects do when they erroneously produce averages rather than inferences. The tacit assumption is that although subjects have access to some components of the judgment process (and hence that they can control the sequence in which these components are executed and the content on
L L Lopes/ Procedural &biasing
184
which they operate), they do not have access to the algebraic implications of what these procedures do. Thus, there is little point in enjoining subjects to be less conservative, or to report their ‘true’ probabilities, or to produce ratios rather than averages. Serious engineering in any domain rests on knowledge of the medium to be engineered. In the case of judgmental engineering, one must understand the judge as well as the judgment. People do some things well, other things badly; they have access to some processes and not to others. Building better judges requires that we know what people do and how they do it. After that, debiasing is, if not easy, at least possible. References Anderson, N.H., 1967. Averaging model analysis of set-size effect in impression formation. Journal of Experimental Psychology 75, 158-165. Anderson, N.H., 1981. Foundations of information integration theory. New York: Academic press.
Beach, L.R., J.A. Wise and S. Barclay, 1970. Sample proportions and subjective probability revisions. Organizational Behavior and Human Performance 5, 183-190. Edwards, W., 1968. ‘pnservatism in human information processing’. In: B. KIeinmuntz (ed.), Formal representations of human judgment. New York: Wiley. Eiis, L.C., D.A. Seaver and W. Edwards, 1977. Developing the technology of probabilistic inference: aggregating by averaging reduces conservatism. Research Report 77-3, Social Science Research Institute, University of Southern California. Rinhom. H.J. and RM. Hogarth, 1982. A theory of diagnostic inference: I. Imagination and the psychophysics of evidence. Technical Report, University of Chicago Graduate School of Business. Fischhoff, B., 1982. ‘Debiasing’. In: D. Kahneman, P. Slavic and A. Tversky (eds.), Judgment under uncertainty: heuristics and biases. Cambridge: Cambridge University Press. Lopes, L.L., 1982. Toward a proceduraI theory of judgment. Technical Report, Wisconsin Human Information Processing Program (WHIPP 17), Madison, WI. Lopes, L.L., 1985. Averaging r&s and adjustment processes in Bayesian inference. Bulletin of the Psychonomic Society 23, 509-512. Lopes, L.L. and G.C. Gden, 1980. Comparison of two models of similarity judgment. Acta Psychologica 46, 205-234. Marks, D.F. and J.K. CIarkson, 1972. An explanation of conservatism in the bookbag-andpokerchips situation. Acta Psychologica 36,145-160. Marks, D.F. and J.K. Clarkson, 1973. Conservatism as non-Bayeaian performance: a reply to De Swart. Acta Psychologica 37,55-63. Phillips, L.D. and W. Edwards, 1966. Conservatism in a simple probability inference task. Journal of ExperimentaI Psychology 72,346-357. Shanteau, J.C., 1970. An additive model for sequential decision making. Journal of Experimental Psychology 85, 181-191. Shanteau, J.C., 1972. Descriptive versus normative models of sequential inference judgment. Journal of F!xperimentaI Psychology 93,63-68.
L L Lopes / Procedural &biasing
185
Tversky, A. and D. Kahneman, 1974. Judgment under uncertainty: heuristics and biases. Science 185, 1124-1131. Wallsten, T.S., 1976. Using conjoint-measurement models to investigate a theory about probabilistic information processing. Journal of Mathematical Psychology 14,144-185. Wallsten, T.S. and C. Barton, 1982. Processing probabilistic multidimensional information for decisions. Journal of Experimental Psychology: Learning, Memory, and Cognition 8,361-384. Wallsten, T.S. and M.M. Sapp, 1977. Strong ordinal properties of an additive model for the sequential processing of probabilistic information. Acta Psychologica 41, 225-253.