The quest for limits on noncomplementarity in opinion revision

The quest for limits on noncomplementarity in opinion revision

ORGANIZATIONAL BEHAVIOR AND HUMAN DECISION PROCESSES 43, 38545 The Quest for Limits on Noncomplementarity Opinion Revision (1989) in LORIROB...

1MB Sizes 0 Downloads 14 Views

ORGANIZATIONAL

BEHAVIOR

AND

HUMAN

DECISION

PROCESSES

43, 38545

The Quest for Limits on Noncomplementarity Opinion Revision

(1989)

in

LORIROBINSON VAN WALLENDAEL Northwestern

University

Two experiments are reported exploring opinion revision with multiple competing hypotheses. Subjects read brief murder mysteries, each involving a basic scenario followed by a series of clues, and rated probabilities of guilt for each suspect after each new piece of information was received. As in earlier studies (L. B. Robinson and R. Hastie, 1985, Journal of Experimental Psychology: Human Perception and Performance, 11,443-456), strength of belief in a given hypothesis was found to be independent of belief in competing hypotheses. Subjects showed strong tendencies to respond to any given clue by increasing or decreasing the probability of a specific target hypothesis, but to make no complementary revisions to the probabilities of other hypotheses under consideration. The popularity of such noncomplementary revision is shown to decrease as greater amounts of information are presented. An anchor-and-adjustment model of opinion revision is presented to account for this data in qualitative and quantitative terms. o 1989 Academic press, IIK.

A doctor attempts to discover which disease or disorder might be responsible for her patient’s symptoms. A business executive wants to know why a new sales campaign is failing. A scientist conducts experiments to determine which of several rival theories best explains a certain phenomenon. A detective searches for clues to identify which of 10 guests at a house party has murdered the host. These situations illustrate a classic problem in the psychology of judgment: How do people go about testing hypotheses? The process involves several components: the generation of the hypotheses to be tested; the search for information; the combining of information from different sources; and the formation of an opinion concerning the relative belief in competing hypotheses, given the information that has been obtained. A great part of the literature of hypothesis testing has centered on the combining of information and the

This research was supported in part by National Science Foundation Grant SES-8208132 to Reid Hastie. The author thanks Reid Hastie for his useful advice on the research reported here and on the preparation of the SLEUTH model, and Glenn Shafer for his helpful comments on the relationship of these results to the theory of belief functions. Requests for reprints should be sent to Lori R. Van Wallendael, who is now at the Department of Psychology, University of North Carolina at Charlotte, UNCC Station, Charlotte, NC 28223. 385 0749-5978189$3.OO Copyright All tights

6 1989 by Academic Press, Inc. of reproduction in any form reserved.

386

LORI ROBINSON

VAN

WALLENDAEL

revision of opinions, and particularly on the comparison of human behavior to normative models of probability revision. In study after study, the normative Bayes’s Theorem has been shown to be inadequate as a description of human behavior (Edwards, 1968; Fischhoff & Lichtenstein, 1978; Tversky & Kahneman, 1974; among many others). For example, human subjects have been described as “conservative” in the sense that they do not use information as efftciently as mathematics dictates; when Bayes’s Theorem leads to very high or low probabilities for a given hypothesis, human subjects are still uncertain, and their estimates of the probabilities in question are less extreme than those prescribed by Bayes (Edwards, 1968; Phillips & Edwards, 1966; Phillips, Edwards, & Hays, 1966). In the controversy over whether subjects do or do not follow Bayesian rules in revising their opinions about a hypothesis, another feature of the hypothesis testing process has been taken for granted. Hypotheses do not exist in isolation; we test not one hypothesis alone, but many. At the very least, we test not only “A” but its complement “not A.” In more complex situations, we test “A,” “B, ” “C,” and so forth. The doctor knows of several diseases which may be associated with the given pattern of symptoms; the detective knows of several suspectswho had accessto the typewriter on which the phony suicide note was typed. Indeed, the pitting of several hypotheses against one another is the basis for the scientific philosophy of falsificationism: one designs an experiment to disprove one theory in order to lend credence to its rivals. This critical relationship between a hypothesis and its complement has been assumed to follow mathematical probability rules in almost all current theories of opinion revision. As the probability of “A” increases, the probability of all competing hypotheses is assumedto decrease. This is a crucial feature of not only the normative Bayes’s Theorem, but also of proposed descriptive models such as those of Anderson (1981), Edwards (1968), Einhorn and Hogarth (1985), and Schum and Martin (1980). Only a few theorists admit the potential for different types of opinion revision; Dulany and Carlson (1983), for instance, state that “when belief changes, it should move some proportion of the distance to certainty controlled by the convincingness of the evidence, perhaps [italics added] attenuated . . . by the number of subjectively competing hypotheses” (p. 8). Recent evidence, however, suggeststhat revision of opinions regarding several competing hypotheses may not be done in a complementary manner at all. Teigen (1983)has shown that the probability assignedto a given hypothesis is independent of the number of competing hypotheses in a closed set. And in a series of studies involving murder mysteries, Robinson and Hastie (1985) showed that information regarding a particular “target” hypothesis may have no impact on competing hypotheses, and

LIMITS

ON

NONCOMPLEMENTARITY

387

that even the elimination of a hypothesis from a closed set often has no effect on subjects’ beliefs about the likelihood of its competitors. The noncomplementarity of the opinion revision process is of significance in the development of any model of hypothesis testing. A better understanding of the conditions which produce such behavior is essential. In the present series of studies, I undertook to gain more information about the limits of noncomplementary opinion revision. These empirical results were used, together with those from earlier studies (Robinson & Hastie, 1985), to propose a model of the opinion revision process which accounts for noncomplementarity in the testing of multiple competing hypotheses. EXPERIMENT 1 If subjects revise their beliefs about multiple hypotheses in a noncomplementary fashion, then we would expect the addition of new hypotheses to have no effect on the probabilities of those hypotheses currently under consideration. This prediction was tested in the following experiment: Subjects attempted to solve two murder mysteries. As clues were presented, subjects were asked to rate each suspect’s probability of guilt; at two different points in the series of clues, new suspects were added to the set under consideration and new ratings were made after each of these additions. The impact of disconfirming evidence (clues pointing toward the innocence of a particular suspect) was also examined. Robinson and Hastie (1985) found that such innocent clues tended to decrease in impact as more and more were presented, but this result may have been confounded by variations in the strengths of individual clues. Here, I compared the changes in a target suspect’s rating after presentation of one, two, or three innocent clues, counterbalancing the clue presentation orders across three groups of subjects. Method Subjects. Subjects were 120 male and female undergraduate students at Northwestern University, who participated in the experiment in order to fulfill a course requirement. Subjects were run in groups of size 2-10 in sessions lasting approximately 90 min. Subjects were seated in wooden booths that prevented them from seeing or communicating with one another, and their responses will be treated as statistically independent in the present data analysis. Materials. Two mystery stories, “The Murdered Banker” and “The Poisoned Philanthropist,” were used in this experiment, as in Robinson and Hastie (1985). These stories were composed of two parts: a brief plot scenario which introduced the murder victim and an initial set of three

388

LORI ROBINSON

VAN

WALLENDAEL

suspects, and a set of 13 clues. Clues were of three types: (a) 4 guilty clues, which presented information implicating a particular suspect; (b) 7 innocent clues, which pointed toward the innocence of a suspect; and (c) 2 add-on clues, which added a new suspect. Add-on clues typically involved the discovery of an additional character with motive who was at or near the scene of the crime; for example, in “The Poisoned Philanthropist”: Police have just learned that George Robner’s ex-wife Linda checked into a local hotel two days before the murder [of George’s father, Marshall Robner]. She claims to be in town on business and says she did not contact any of the Robners upon her arrival. It is well known that Linda blamed Marshall Robner for the break-up of her marriage to George.

These add-on clues were presented as the 4th and 12th clues in the series. Two versions of each case were constructed to counterbalance the order of suspect additions; in “The Murdered Banker,” for example, Suspect 0 was added first and Suspect R second in one version, while Suspect R was added first in the second version. For each case, one of the three original suspects was chosen to be the target of three innocent clues. A 3 x 3 Latin square was used to construct three clue sequences; one-third of the subjects received each sequence of innocent clues. In this way, each clue appeared equally often as the first, second, or third innocent clue presented regarding the suspect. Subjects’ probability ratings were made on 5-in. (12.7 cm) calibrated scales running from 0.0 (labeled no chance) to 1.0 (labeled Sure thing), marked off in intervals of .lO. Sets of rating scales for all suspects were printed on separate pages for each clue and bound together in booklet form. Design and procedure. Subjects were informed that they were participating in an experiment on human judgment, and that the judgments they would be asked to make involved suspects in two different murder mysteries. The nature and range of the probability scale were explained, and subjects were given five sample probabilities to estimate before solving their first mystery (for example, the probability of a toss of “heads” on a fair coin, or the probability of rain on a given day). Although these sample probability estimations often illustrated probability rules such as additivity, subjects were not explicitly told that their subsequent judgments ought to behave according to the same rules. Every subject attempted to solve both cases, “The Murdered Banker” and “The Poisoned Philanthropist,” with the order of case presentation counterbalanced across subjects. For each story, the experimenter read the plot scenario aloud. Subjects were then given a recall test to ensure general comprehension and correct

LIMITS

ON NONCOMPLEMENTARITY

389

memory for suspects’ names and relevant facts about the crime. Subjects were told that the guilty party would always be one of the suspects they were asked to rate, and that there would be no conspiracy among two or more suspects; as far as subjects knew at the time, an exhaustive set of mutually exclusive hypotheses was in effect. After being tested on the plot scenario, subjects were instructed to turn to the first page of rating scales in their booklets and mark a slash through the point on each suspect’s scale which they thought corresponded to the probability that this suspect was guilty of the murder. When these baseline ratings were completed, the experimenter read the first clue and asked the subjects to turn to the next page of scales and again rate each suspect’s probability of guilt. This procedure was repeated for each of the 13 clues, at the end of which the experimenter read the solution to the case. The above procedures were then repeated for the second mystery. Results Similar to the results of Robinson and Hastie (1985), subjects showed strong tendencies to adjust only the ratings of the target suspect (that suspect specifically named in the clue) after receiving new evidence. In 42% of all probability revisions (IV = 2,640), no adjustments were made to nontarget suspects’ probabilities. On the average, subjects’ ratings remained stable for a given suspect except after clues specifically referring to that suspect; adjustments to nontargets, when they were made, tended to be smaller than those which would be prescribed by probability theory. (See. Fig. 1). Median probability ratings exhibit the same basic pattern as the means shown here. Even after the addition of a new suspect, subjects were likely to make no adjustments to the ratings of current suspects. Forty-four percent of all such additions (N = 480) resulted in no decrease in the probabilities of competing suspects. Only 8% elicited the Bayesian strategy of decreasing the probability of all competing suspects; another 17% resulted in decreased probabilities for only the suspect with the highest current probability rating. The remainder were fairly evenly split among other types of adjustment strategies (decreasing the probability ratings of the longshots only, decreasing probabilities for some suspects while increasing others, and so forth). The tendency to make no adjustments after the addition of a new suspect was stronger for the first addition (49%) than for the second (38%); these, differences were statistically significant [x2(8, iV = 240) = 25.81; p < .Ol]. Some type of adjustment (particularly subtracting from the probability of the favorite suspect) was more likely when the added suspect came in at a high probability rating (for example, Suspect R in “The Murdered Banker,” whose average initial probability rating was .64; see

390

LORl ROBINSON VAN WALLENDAEL 1 .o 0.9 P .F 5 r ‘5 c) $

0.0 0.7 0.6

+ f

0.5

d

0.4

5

0.3

f P

0.2

=

0.1

FIG. 1. Mean probability of guilt ratings for Experiment 1, “The Murdered Banker.” (Clues are presented on the abscissa, with capital letters indicating the target suspect [B, C, F, 0, R] and subscripts indicating the valence of the clue [s = guilty, i = innocent, a = add-on]. Prior probabilities were estimated after presentation of the plot scenario but before exposure to clues.)

Fig. 1) than when the added suspect was given a lower rating (e.g., Suspect 0 in “The Murdered Banker,” average rating .43). Subjects made no adjustments in 35% of the cases in which a highly probable suspect was added and in 48% of the cases in which the added suspect was less probable [x2(7, N = 120) = 18.41; p < .05]. Innocent clue impact was found to decrease with the presentation of additional innocent clues. An analysis of variance indicated a significant effect of clue ordinal position [F(2,570) = 29.14, p < .OOl]. Presentation of a second innocent clue caused less of a decrease in ratings than did the first innocent clue, with the third clue having a still smaller impact. See Fig. 2. Discussion

The effects of adding a suspect provided further support for the noncomplementarity of subjective probabilities. Even when subjects have been specifically told that the set of hypotheses is mutually exclusive and exhaustive, they do not seem to grasp the logical implications of this fact; they behave as if each new suspect were independent of all others. Thus, the addition of new suspects is treated as irrelevant with respect to old suspects. However, this noncomplementary approach does break down

LIMITS ON NONCOMPLEMENTAIUTY

1

I 1

I

I

2

3

391

Clue Ordinal Position FIG. 2. Mean decrease in a suspect’s probability rating after presentation of first, second, and third innocent clues, Experiment 1.

at times. The probability of the added suspect appears to be critical: a new suspect who seemsvery likely to be guilty may trigger a different strategy on the part of the subject. This effect parallels the greater tendency toward adjustment when a high probability suspect is eliminated from a set, as seen in previous studies (Robinson & Hastie, 1985). A new phenomenon seen in the present experiment is a difference in strategies associated with the timing of a suspect addition. A relatively early addition (within the first few clues) produces fewer adjustments to the current suspects’ ratings than does a later addition. This suggestsan intriguing possibility. In the early stagesof hypothesis evaluation, it may be that the subjects’ beliefs are less clearly delineated, causing the subject to adopt a simple “deal with one hypothesis at a time” strategy. As more evidence is accumulated, beliefs about guilt become stronger and information becomes more and more integrated, leading to a greater tendency to adjust probabilities in a complementary fashion. In this interpretation, the subject is merely “gathering” information in the early stages, but beginning to “integrate” it at some later point. Data from this study show some support for this hypothesis. If we look at the number of noncomplementary adjustments (changesin the rating of the target but not of any other suspects)for each clue in the series, we see that this number decreasesreliably as more and more clues are presented (r = - 28, p C .05; see Fig. 3.) Subjects appear to be more likely to use probabilities in a complementary fashion as evidence accumulates. Note, however, that even after 13 clues, over 35% of all responses are still noncomplementary, suggesting that many subjects never begin integrating evidence with respect to competing hypotheses. A second experiment moved to the realm of hypothesis elimination in an attempt to gain new evidence regarding the effect of the accumulation

392

LORI ROBINSON VAN WALLENDAEL

12

3

4

5

6

7

8

9

10111213

Clue Ordinal Position FIG. 3. Percentage of noncomplementary adjustments across all subjects for each clue in the series, Experiment 1.

of clues, and also to further explore the generality of the noncomplementarity effect. EXPERIMENT 2 The previous experiment has shown a marked difference in subjects’ treatment of a hypothesis addition depending on its position within the clue set. However, since each subject was exposed to two separate suspect additions within each case, there is some confounding between the position of an addition and the number of additions which have occurred. To remedy this, I ran a second experiment in which I returned to the hypothesis elimination paradigm of Robinson and Hastie (1985); only one hypothesis was eliminated from each set, with this elimination occurring early in the clue series for half of the subjects and late in the series for the other half. According to Bayes’ theorem, when one hypothesis is eliminated from a set under consideration, probability is redistributed among the remaining hypotheses while preserving the same ratios of probabilities that existed before the elimination. Thus, the probability of a given hypothesis depends to some extent on the number of hypotheses with which it is competing. However, when probabilities are treated in a noncomplementary manner, the probability of a given hypothesis is independent of the total number of hypotheses. Teigen (1983) has shown this in an experiment in which subjects were given a description of the weather on a certain date and asked to rate the likelihood that this description was

LIMITS

ON

NONCOMPLEMENTARITY

393

obtained in various world cities; when new cities were added to the list, most subjects failed to change their earlier ratings for the competing cities. In this experiment, I attempted to replicate Teigen’s findings by varying the size of a suspect set and looking at its effect on initial ratings of those suspects common to all set sizes. Finally, in a further test of the robustness of the noncomplementarity effect, I looked at the consequences of manipulating similarity between suspects. I reasoned that a superficial similarity between two suspects might encourage a subject to see the dependencies between suspects in the set and thus to make more complementary adjustments. Method Subjects. Subjects were 80 Northwestern University undergraduates who participated in order to fulfill a course requirement. Subjects were run in groups of size 2-10 in single sessions lasting approximately 45 min. Materials. The two cases from the preceding experiment, “The Murdered Banker” and “The Poisoned Philanthropist,” were again used, now modified in the following ways: Two versions of each plot scenario were constructed, one involving three suspects and one involving six suspects (the three from the first version plus three new characters). Twelve clues were used in each version, distributed as follows: in the 3-suspect version, 3 guilty clues, 2 innocent clues, 6 neutral clues and 1 eliminator; in the 6-suspect version, 5 guilty clues, 6 innocent clues and 1 eliminator. Within each version, two clue sequences were used; in the “early elimination” sequence, the eliminator was positioned as clue 3 (with clue 12 being a guilty clue), while in the “late elimination” sequence, clue 12 was the eliminator and clue 3 was a guilty clue. The order of the remaining clues was constant across both sequences. In both cases, one suspect was described in the plot scenario as being very similar to the to-be-eliminated suspect, in terms of age, sex, physical appearance, and occupation. These similarities were chosen so as to be salient but not logically relevant to the case; there was no reason to expect, for example, that a particular occupation shared by two suspects made them any more or less likely to be guilty of the murder. Design and procedure. Procedures were identical to those used in Experiment 1, with each subject attempting to solve only one of the two cases. Half of the subjects received “The Murdered Banker” and half received “The Poisoned Philanthropist.” Results Subjects were more likely to make some sort of adjustment to the

394

LORI ROBINSON

VAN

WALLENDAEL

ratings of the remaining suspects after a late elimination than after an early one. No adjustments were made by 22% of subjects in the earlyelimination condition, whereas only 10% of the subjects in the lateelimination condition made no adjustments [x2(5,N = 80) = 10.03;p < .lO]. Of those subjects who did make adjustments, 28% of the lateelimination subjects added to the probability of only the favorite suspect, as compared to 22% usageof this strategy for the early-elimination group. Noncomplementary adjustments after guilty and innocent clues were again more prevalent in the early stages of clue presentation. (Since the 3-suspect versions of each case contained a large percentage of neutral clues, I confine my current analysis to the 6-suspect versions of the two cases.) Across the entire sequence of 12 clues, no significant correlation between clue position and number of noncomplementary adjustments was seen. However, nonlinearity was basically confined to the first 4 clues, before any elimination had occurred; similarity between suspects could have resulted in the unusual response pattern for these clues. When we look at clues 5-12 within the early elimination group (n = 20), where one of the similar suspects has been eliminated from consideration and thus similarity no longer exists among remaining suspects, we see a strong negative correlation between noncomplementary adjustments and clue position (r = - .73, p < .05), indicating a tendency for-noncomplementary adjustments to decrease in frequency as the number of clues increases. The initial probability ratings of the suspects common to both versions of a case, made after presentation of the plot scenario only, were independent of the number of suspects in the set, replicating the results of Teigen (1983). See Table 1. As a necessary consequence of this, the TABLE 1 AVERAGEINITIAL PROBABILITYRATINGSAS A FUNCTION OF NUMBER OF SUSPECTS (EXPERIMENT2)

Number of suspects 3

6

.556 .352 .352

.542 .380 .380

.352 .540 .474

.366 .533 .532

Case and suspect: “The Poisoned Philanthropist” George Robner Carla Dunbar Mimi Taylor “The Murdered Banker” Rita Frawson Wellington Blakely Hubert Reardon n

20

20

LIMITS

ON NONCOMPLEMENTARITY

395

summed probability ratings for ah suspects in the set were consistently higher for the dsuspect versions than the 3-suspect versions of each case. The number of suspects did have an effect on post-elimination adjustment strategies. Subjects tended to make fewer adjustments after an elimination in the 6-suspect condition than in the 3-suspectcondition. Twentytwo percent of all subjects in the 6-suspect condition (n = 40) made no adjustments after an elimination, while only 10% of subjects in the 3suspect condition did not adjust their ratings. Further, subjects rating only 3 suspects tended to make adjustments more in line with probability theory; 42% of these subjects added to the probabilities of all remaining suspects after an elimination, as compared to only 2% of subjects rating 6 suspects [x2(5,N = 80) = 20.42; p C .Oll. Similarity did have an effect on subjects’ adjustments after an elimination. Twenty-eight percent of all subjects (N = 80) made adjustments favoring the similar suspect over all other suspects after an elimination; these adjustments occurred whether the similar suspect was the current favorite, the longshot, or somewhere in the middle. Adjustments to the similar suspect took several forms: some subjects added only to this suspect’s probability, others added to all but in a proportion favoring the similar suspect, and some even decreased the probabilities of the nonsimilar suspects while increasing the similar suspect’s ratings. Discussion

Noncomplementarity of probabilities was again demonstrated in the tendency to make no adjustments after a suspect was eliminated, although this occurred to a lesser degree than in previous studies. The presence of the similar suspect probably accounts for the greater tendency toward complementary adjustments; the number of postelimination strategies favoring the similar suspect suggeststhat subjects were indeed aware of this suspect’s characteristics, and this may have encouraged subjects to see the relationship between one suspect’s probability and the others’ probabilities. Tendencies toward nonadjustment after an elimination were influenced by the number of suspects under consideration; this is consistent with previous studies (Robinson & Hastie, 1985).Time of elimination was also an important factor. As more and more clues are presented, an elimination seemsto have a greater and greater tendency to “make a difference” to a subject. This tendency actually applies to any type of clue; noncomplementary probability revisions become less frequent later in the clue sequencefor guilty and innocent clues alike. Since such a trend is potentially an important factor in modeling hypothesis testing behavior, a short probe study was run to show that the

396

LORI ROBINSON

VAN

WALLENDAEL

phenomenon was not dependent upon the particular sequences of clues used in the previous two experiments. Ten different sequences of 10 clues from “The Murdered Banker” were constructed, such that each clue was equally likely to appear in each of the 10 clue positions. These 10 sequences were given to 3 subjects each, for a total of 30 subjects. Noncomplementary revisions were counted across subjects for each clue position. The result paralleled the findings of Experiments 1 and 2: a reliable negative correlation between the number of noncomplementary revisions and clue position (r = - .52, p = .06) indicated a decrease in noncomplementary revisions as more and more clues were presented. Of those subjects who do make adjustments to more than the target suspect, we see a difference in the types of strategies preferred for early and late eliminations. After a late elimination, we see a greater tendency to add to the probability of only the favorite suspect. This finding suggests a possible interpretation of the decrease in noncomplementarity discussed above. Perhaps in the early stages of a case, subjects have no clear idea of who might be guilty, and thus tend to gather evidence regarding each suspect independently; as evidence accumulates, however, subjects begin to focus on one or two highly likely hypotheses and to integrate a greater range of evidence into their “picture” of this favorite. For example, a subject who begins to concentrate on Suspect R in “The Murdered Banker” might then begin to adjust R’s rating on all clues, increasing her probability as evidence of her fellow suspects’ innocence is revealed, and decreasing it when one of her fellow suspects begins to look more guilty. This type of behavior would lead to the decrease in noncomplementary adjustments which we see in both of these experiments. It should be noted at this point that such noncomplementary adjustment of probabilities does not necessarily represent a “bias” in human judgment. It is true that one normative system, Bayesian probability, requires complementarity; subjects’ behavior in revising opinions about multiple competing hypotheses clearly violates Bayesian norms in this respect. However, there are other mathematical systems which do allow noncomplementarity: Baconian probabilities (Cohen, 1977) and the theory of belief functions (Shafer, 1976) are noteworthy examples. (Although a thorough comparison of my experimental results with the tenets of Baconian probabilities and belief functions would require a separate paper, a few accordances will be mentioned here; the interested reader is referred to Cohen’s and/or Shafer’s books for full analyses of these systems.) Baconian probabilities have no normalization requirement; thus, they need not sum to any particular number. In a system of belief functions, noncomplementarity is also an integral part of belief revision, and it is permissible for belief in one hypothesis to change while belief in other hypotheses remains constant. Belief functions also allow for the tendency of noncom-

LIMITS

ON NONCOMPLEMENTARITY

397

plementarity to occur less frequently as evidence accumulates. There are thus some aspects of subjects’ behavior in these studies which are consistent with at least some formal, normative theories of likelihood. The theory of belief functions, however, was not intended as a descriptive theory of human behavior (G. Shafer, personal communication, May 19, 1987),and in several respects is still inconsistent with such behavior. For example, noncomplementarity in befiefs always takes the form of subadditivity, not superadditivity as we see here (although a second aspect of the theory of belief functions, the degree of plausibility, does show superadditivity). The representation of discontirming evidence within the framework of belief functions also seemshard to reconcile with subjects’ descriptions of what they are doing in solving these mysteries; in the language of belief functions, a clue which suggests that Suspect A is innocent would be represented as a clue implying that “Suspect B or Suspect C or Suspect D is guilty.” Think-aloud protocols from Robinson and Hastie (1985) suggestthat subjects have a more direct representation of innocence. Finally, the computational complexity of belief functions, like that of Bayes’s Theorem, makes them unlikely candidates for descriptive models of human behavior. Baconian probabilities, too, suffer from descriptive drawbacks. For example, Baconian probabilities have only ordinal properties; in a Baconian system, we may say that one hypothesis has greater probability than another, but we may not say how much greater. Think-aloud protocols from Robinson and Hastie (1985) indicate that subjects do attempt to represent the “how much greater” issue in their probability ratings. How, then, do we describe subjects’ behavior in testing multiple competing hypotheses? I believe a simple anchor-and-adjust model (as in Lopes, 1982, and Einhorn & Hogarth, 1985)captures many of the interesting features of this task and of subjects’ approach to it. SLEUTH:

A Model of Hypothesis

Testing

Let us assumethat a certain amount of background information regarding a problem (e.g., a murder mystery) has been received by the subject. See Fig. 4. The subject identifies the hypotheses (the suspects)and makes initial judgments as to their respective likelihoods of being guilty [P(i)]. A new piece of information is then received. The subject first determines the type of information it is: Is it neutral (irrelevant to the decision), does it make a particular hypothesis look more likely (a guilty clue), or does it make some hypothesis look less likely (an innocent clue)? If the information is judged to be neutral, the subject does nothing with it, and simply waits for the next clue to be received. If the information is not neutral, then the subject determines which hypothesis (suspect) is implicated by the clue. The subject then determines the strength of the clue: Is it a

LORI ROBINSON VAN WALLENDAEL

398

1 Sot Prior

Probabilities

P(i)

I

1

Determine

SLEUTH:

A Hodel

Implicated Suspect = X

Determine Clue Strength oIsIl-P(x)

of Opinion

Revision

FIG. 4. Flowchart diagram of SLEUTH model. Clue types are represented by letters G

(guilty), I (innocent), N (neutral); clue strength = S; C and k are individual difference variables representing the subject’s initial probability of complementary adjustment and the increment in C for each new clue received, respectively.

LIMITS

ON NONCOMPLEMENTARITY

399

strong indicator of guilt/innocence, or only moderate or weak? Clue strength may be only as great as the distance to certainty (probability of 0 or l), and can be described as the “convincingness” of the evidence (see Dulany & Carlson, 1983), or in Bayesian terms, the diagnosticity of the evidence. The probability of the targeted hypothesis is then adjusted upward or downward, depending on the type of clue, by an amount corresponding to the clue’s strength. Nontarget hypotheses may or may not be adjusted after an adjustment is made to the target. In this model, there exists a certain probability p that adjustments to nontarget hypotheses will be made; this probability is assumed to be equal to a criterion (labeled C in the model diagram) which varies from subject to subject and may depend on such factors as prior statistical training. Conversely, there is a 1-C probability that no adjustments will be made to nontarget hypotheses. As each new piece of information is received, the criterion C is modified by an amount k, usually positive. In this way the likelihood of a subject making complementary adjustments can increase or decrease as subsequent clues are received; later clues are usually more likely to elicit adjustments to the probabilities of more than one hypothesis. It should be remembered, however, that C and k are individual difference variables; many subjects appear to have very low values for both, and thus rarely (or never) make complementary adjustments even after a large amount of information has been received. A minority of subjects, however, seem to have fairly large C values, leading them to make complementary adjustments for almost all clues; some seem to have large k values and start out making noncomplementary adjustments, but then quickly switch to a more complementary strategy. It is also possible that some subjects may have negative k values, making their adjustments less likely to be compensatory as more information is received. This simple version of the SLEUTH model can be modified to include such phenomena as the decreasing strengths of innocent clues and the greater likelihood of adjustments after eliminations of high-probability suspects. Qualitatively, the model is capable of handling all of the behaviors exhibited by our subjects. (See Table 2). A more quantitative lit of the model to actual data is described below. Fitting the Model A simplified version of the basic murder mystery task was used in this attempt to fit the model to individual subjects’ data. The 5-suspect plot scenario for “The Murdered Banker” was coupled with a series of 10 clues; no eliminations were used in this demonstration, as I wished to test only the simplest form of the model at this time. A computer simulation of the SLEUTH model was used to predict

TABLE 2 How THE SLEUTH MODELHANDLESQUALITATIVERESULTS Empirical phenomena

Model

1. Popularity of noncomplementary adjustments after eliminations and additions of hypotheses, as well as after confirming and discontiiing evidence about a target hypothesis”.b*csd,e 2. Greater likelihood of complementary adjustment after a high-probability hypothesis is eliminated”~b,c

1. Handled through assumption that C and k are very low (often zero) for most subjects.

3. Greater likelihood of complementarity when number of remaining hypotheses is sm&lOebsCse

4. Popularity of “add-to-all” or “add-to-favorites” strategies among complementary adjustments after a hypothesis is eliminated”~b~‘~d~’

5. Decreasing impact of innocent cluesc*d

6. Decreasing likelihood of noncomplementary revision as more information is presentedd@ 7. Popularity of adjustments favoring a hypothesis similar to the eliminated hypothesis’

2. Can be handled in the following way: Elimination of hypothesis i operates like an innocent clue with strength = P(i). Probability of complementary revision (C) is adjusted according to P(i); thus, C is larger for larger values of P(i). 3. Can be handled by the addition of a mechanism which keeps count of number of hypotheses for which P(i) > 0, and increases C for every decrease in this count. 4. Model currently assumes an “add-to-all” adjustment, although the sizes of adjustments to nontarget hypotheses is random. “Add-tofavorites” could be handled with a mechanism which keeps track of nontarget P(i)‘s and adds only to the hypothesis with the greatest P(i) value. The use of “add-to-all” vs “all-to-favorites” strategies could be a matter of individual differences, or may be linked to clue strength or number of clues presented; this is an empirical question requiring further study. 5. Could be handled through mechanism which counts number of innocent clues for each target and decreases clue strength (5) for subsequent innocent clues after the first. 6. Handled through parameter k, which is assumed to be positive for most subjects, thus increasing the likelihood of complementary revisions as more information is received. 7. Not currently handled by the model. A way to define and measure “similarity” between hypotheses is needed before it may be incorporated into the model.

0 Robinson and Hastie (1985), Experiment 1. b Robinson and Hastie (1985), Experiment 2. c Robinson and Hastie (1985), Experiment 3. d This article, Experiment 1. e This article, Experiment 2. 400

LIMITS

ON NONCOMPLEMENTARITY

401

individual subjects’ behavior. The program takes as input a subjects’ prior probabilities for the five suspects (estimated after receiving only the information in the plot scenario), clue strength values (also assumedto vary from subject to subject), and values of the parameters C and k. With this information, the model predicts the subjects’ probability ratings for each suspect for each of the 10 clues (a total of 50 predictions per subject). To do this, the program first reads the clue type (guilty, innocent, or neutral) and target suspect of the clue, takes the clue’s strength rating and adjusts the probability of the target suspect accordingly. A random number is then generated between 0.0 and 1.0; if this number is less than the criterion C, the probability ratings of nontarget suspects are adjusted by random percentagesof the clue strength; otherwise, no adjustments are made to the nontarget suspects. C is then incremented by k, and the next clue is considered. Since a probabilistic process is central to the workings of the model, any single run with a given set of parameters and clue strengths cannot be said to represent a “typical” behavior pattern for the subject. For this reason, the computer was programmed to run 100 simulations with each subject’s parameters, and the model’s predictions were taken as the averages of the probability ratings obtained from these 100 simulations. Subjects were five Northwestern University students enrolled in a summer psychology course, who participated as part of an in-class experiment. Subjects were run over a 5-day period, in two sessions. In the first session, I attempted to assess clue strengths for each subject by two methods. (1) In the direct method, I gave subjects the plot scenario from “The Murdered Banker” and the series of 10 clues, and asked subjects to rate on a calibrated scale (from - 30 to + JO), “How much would this clue alone change the suspect’s probability of being guilty?” (2) In the indirect method, subjects first made preliminary ratings of suspects’ probabilities after only the plot scenario had been presented. Subjects were then given the 10 clues, and for each clue were asked, “What would this suspect’s probability of guilt be if this (plus the background information you received earlier) were THE ONLY CLUE you had about the suspect?” With this method, the strength of the clue was then calculated by taking the difference between these ratings and the preliminary ratings obtained after reading the plot scenario only. Subjects received a short break between these two assessments.Half of the subjects received the direct method first, and half the indirect method first. The consistency of the clue strength values obtained by these two methods varied greatly from subject to subject; to simplify matters, I took the averageof these two clue strength estimatesfor each clue and for each subject, and used these averages as our input to the model simulation. In the second testing session, subjects were told that they were again

402

LORI ROBINSON VAN WALLENDAEL

going to read “The Murdered Banker” and rate the suspects’ probabilities, but this time they would not be forced to look at each clue in isolation from other clues. They were asked to try to imagine that they had never seen any of the clues before, and to make their ratings after each clue on the basis of all of the evidence they had read that day up to that point (but not on their memory for other clues that they might recall from the first session). Most of the subjects reported later that they had little difficulty complying with these instructions, and that they recalled very few clues from their first viewing of the case 5 days earlier. The estimation of parameters for each subject was done in a simple manner. Since C is defined as the probability of making a complementary set of revisions after the first clue, with k the amount of increase/decrease in C for every succeeding clue, a good way to define these parameters operationally is as the intercept and slope of the regression line found by plotting the percentage of complementary revisions made by a subject at each clue position. Ideally, data from a large number of different mystery casesfor each subject would be needed to obtain a reliable estimate. With only one case to work with, the percentage of complementary revisions reduces to a dummy variable (complementary vs noncomplementary) and the slope and intercept of the regression line are less diagnostic of a subject’s typical behavior. Nevertheless, these estimates led to acceptable predictive accuracy for the model (seeTable 3), and some interesting individual differences in C and k values were seen among these five subjects. One subject showed totally noncomplementary opinion revisions, with both parameters estimated at 0. One subject had a C of 0 but a low positive k = .07, indicating a slight increase in the tendency toward complementary revision. Two subjects had positive C values but k = 0, indicating a rather haphazard use of complementary revisions; a larger TABLE 3 FIT OF SLEUTH MODEL TO INDIVIDUAL SUBJECTSDATA Parameters Subject

c

1 2 3 4 5

0 0

Indexes offit k 0 .07

.4 .6 .6

0 0 -.lO

r

MAD

.94 .82 .87 .81 .80

.0576 .1122 .0532 .1360 .1370

Note. C = initial probability of a complementary set of adjustments; k = increment in C for each new clue received; r = Pearson correlation coefficient between predicted probabilities and subjects’ actual ratings; MAD = Mean Absolute Difference between predicted and actual probability ratings.

LIMITS

ON NONCOMPLEMENTARITY

403

sample of cases might have revealed a more definite pattern to the behavior of these subjects. A negative k value was also found for one subject, meaning that this subject’s likelihood of making a complementary revision actually decreased over time. Results

As shown in Table 3, tits of the model for individual subjects ranged from moderately good to excellent. Two indexes of goodness of fit are reported. I calculated the correlation coefftcient between the predictions of the SLEUTH model and the actual ratings of each subject; r was consistently high, ranging from a low of .80 to a high of .94 (see Table 3, column 4). I also report the Mean Absolute Difference (MAD) between the model’s predictions and the actual ratings of each subject. The range of MAD values reported here compare favorably with similar models of hypothesis testing in other situations, such as those of Einhom and Hogarth (1985)who report MADs varying between .006 and .112 on a 0.00 to 1.OOscale, and Dulany and Carlson (1983),who report MADs between .I1 and .38 on a - 1.00 to + 1.00 scale. For our own model, MAD was very low for subjects #l and #3, indicating a close fit, and somewhat higher for the remaining subjects. The size of the MAD seemedto be dependent on the accuracy of our clue strength ratings for the subject; those subjects with the lowest MAD values also had clue strength ratings which closely paralleled their actual adjustments for the target suspects (differing by an average of only .06 for subject #l and .03 for subject f2). Subjects with higher MAD values tended to have their target adjustments less well predicted by the clue strength ratings used here. These results suggest several ways to improve the quantitative predictive accuracy of the model. First, as mentioned above, to obtain a more reliable estimate of parameters for each subject, more extensive pretesting with a large number of different caseswould need to be done. Second, a more accurate, independent method of assessing clue strength could improve the accuracy of the model greatly. One promising route is suggested by Dulany and Carlson (1983), who describe causal reasoning in terms of the linkages between pieces of information (clues) and hypothesized causes of an event. To assess subjective evidence (what I have here called “clue strength”), they presented subjects with a clue such as, “Some suspects had recently had violent quarrels with Mr. Phelps [the victim].” They then assessedthe forward linkage by asking subjects to rate “the relative likelihood that this clue would be true of the real murderer or some innocent suspect.” Dulany and Carlson obtained strong predictive accuracy using these methods; it is quite possible that this type of approach would be useful in assessing strengths of clues in the SLEUTH model as well.

404

LORI ROBINSON

VAN

WALLENDAEL

The current test of the model provides preliminary support for its depiction of hypothesis testing in both qualitative and quantitative terms. Qualitatively, the model provides a significant advance in our understanding of belief revision by defining the complementarity of hypotheses as a probabilistic function dependent on characteristics of the subject and the task, not simply as a given element as it is in Bayesian theory. It provides a mechanism to explain the tendency of many subjects to make increasingly complementary judgments as more and more information is received. Perhaps most importantly, it allows for individual differences in judgment strategies, as well as differences which are specific to a given situation. Admittedly, correlational fitting of data to a model does not provide a true discriminative test of the model as it compares to other models. However, the SLEUTH model is designed to permit further tests involving the prediction and verification of new phenomena. For example, further tests of the SLEUTH model could focus on the parameters C and k, and how they may vary independently of one another: C (the initial probability of making a complementary adjustment) might be varied by training the subject on formal probability rules, or by varying the initial number of suspects in a set; k (the rate at which one moves toward complementarity) might be manipulated through changes in the difficulty of the problem. Such studies would increase our understanding of the nature of complementarity and the circumstances under which it is likely to be a part of our hypothesis testing behavior. REFERENCES Anderson, N. H. (1981). Foundations of information integration theory. New York: Academic Press. Cohen, L. J. (1977). The probable and the provable. Oxford: Clarendon Press. Dulany, D. E., & Carlson, R. A. (1983, November). Consciousness in the structure of causal reasoning. Paper presented at the meeting of the Psychonomic Society, San Diego. Edwards, W. (1%8). Conservatism in human information processing. In B. Kleinmutz (Ed.), Formal representation of human judgment (pp.17-52). New York: Wiley. Einhom, H. J., & Hogarth, R. M. (1985). Ambiguity and uncertainty in probabilistic inference. Psychological Review, 92, 433-461. Fischhoff, B., & Lichtenstein, S. (1978). Don’t attribute this to Reverend Bayes. Psychological Bulletin, 85, 239-243. Lopes, L. L. (1982). Toward a procedural theory of judgment. Unpublished manuscript, University of Wisconsin (Madison). Phillips, L. D., & Edwards, W. (1966). Conservatism in a simple probability inference task. Journal of Experimental Psychology 72, 345-354. Phillips, L. D., Edwards, W., & Hays, W. L. (1966). Conservatism in complex probabilistic inference. IEEE Transactions in Human Factors in Electronics, IWE-7, 7-18. Robinson, L. B., & Hastie, R. (1985). Revision of beliefs when a hypothesis is eliminated

LIMITS

ON NONCOMPLEMENTARITY

405

from consideration. Journal of Experimental Psychology: Human Perception and Performance, 11, 44346. Schum, D. A., & Martin, A. W. (1980). Probabilistic opinion revision on the basis of evidence at trial: A Baconian or a Pascalian process? (Report No. 80-02). Houston, TX: Rice University. Shafer, G. (1976). A mathematical theory of evidence. Princeton, NJ: Princeton Univ. Press. Teigen, K. H. (1983). Studies in subjective probability III: The unimportance of altematives. Scandinavian Journal of Psychology, 24, 97-105. Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases. Science, 185, 1124-1131. RECEIVED: July 29, 1987