RESPONSE BIAS IN RECOGNITION MEMORY Caren M. Rotello and Neil A. Macmillan
I.
Introduction
In this chapter we explore the ways in which subjects’ response bias can influence conclusions drawn by memory researchers from their data. We show that subjects can, in principle, minimize their memory errors and maximize their accuracy by choosing an unbiased criterion location, but that a host of factors influence their actual bias in generally sensible ways. Moreover, we demonstrate that researchers’ interpretation of subjects’ biases across conditions is dependent on their choice of bias measure, and that a failure to take account of response bias can lead to erroneous conclusions about memory performance. Most recognition studies are intended to learn something about memory sensitivity and wish to avoid the possibly confounding eVects of changes in bias toward one or another of the available responses. When this goal is paramount, the point of studying response bias is to eVectively eliminate its influence, either within the experiment or in data analysis. A full account of recognition performance, however, must explain the presumed decision process that leads to a memory judgment, and we view an understanding of bias eVects as a step in that direction. Whether designed to study accuracy or bias, most recognition memory experiments adopt as dependent variables the parameters of a model, and our discussion is largely couched in terms of signal detection theory. Figure 1A illustrates the simplest model of this type in which memory THE PSYCHOLOGY OF LEARNING AND MOTIVATION VOL. 48 DOI: 10.1016/S0079-7421(07)48002-1
61
Copyright 2007, Elsevier Inc. All rights reserved. 0079-7421/08 $35.00
62
Rotello and Macmillan
A
Targets Lures
Criterion B
Lures Targets
Fig. 1. Distributions of memory strengths for targets and lures: (A) equal‐variance distributions, (B) unequal‐variance distributions.
sensitivity is measured by d 0 , the mean separation of Gaussian underlying distributions corresponding to lures and targets. Many investigators report data in terms of d 0 and acknowledge the assumptions of signal‐detection theory (SDT), but other dependent variables are equally theory‐laden. For example, proportion correct [p(c)], another very common summary statistic, is equivalent to the mean diVerence between two distributions that are rectangular in shape (Macmillan & Creelman, 1990, 2005). Receiver‐ operating characteristic (ROC) data almost always support SDT (Gaussian) over threshold (rectangular) models, and it is for that reason that we focus on SDT here. The SDT construct that captures the response bias concept is the criterion: ‘‘old’’ decisions are made for observations above this level, ‘‘new’’ decisions for those below. We ask what determines the level at which the criterion is set and how that level changes in response to diVerent experimental conditions or variations within a condition. In tasks requiring a choice among more than two responses (we consider rating tasks and the remember–know design), an
Response Bias in Recognition Memory
63
SDT analysis assumes that there are multiple criteria, each of which may be fixed or variable, and we extend our survey to these paradigms. Among the variables that are known to aVect the estimated value of the criterion is the nature of the instructions; apparently, therefore, the criterion can be consciously manipulated. Nothing in SDT requires, however, that all criterion settings are available to introspection. The separation between response bias and accuracy—one of the primary accomplishments of SDT—does not map onto the distinction between conscious and unconscious processing.
II.
Measuring Response Bias
Defining ‘‘response bias’’ to mean ‘‘value of the criterion,’’ as we do here, does not uniquely specify the variable to be examined. The criterion can be interpreted as a location or as a likelihood ratio, and location must be defined relative to some origin. In the simplest cases, all of these measures are equally useful, but in more complex situations each comes with theoretical strings attached. A.
MEASURING RESPONSE BIAS IN A SINGLE CONDITION
The simplest empirical question about response bias is whether it exists in a single experimental condition. The criterion location, relative to the crossover point between the (equal variance) lure and target distributions, is c ¼ 12½zðHÞ þ zðF Þ
ð1Þ
where H is the hit rate and F is the false‐alarm rate. The dependence of c on both the hit and false‐alarm rates is one of its important characteristics. The value of c is 0 for unbiased responses, in which the miss and false‐alarm rates are equal and therefore the overall error rate is minimized. Negative values of c reflect liberal bias settings, whereas positive values indicate conservative locations. Two variants of c that have been popular in memory research locate the criterion relative to the mean of one of the underlying distributions. Relative to the lure mean the criterion is cL ¼ zðF Þ
ð2Þ
and relative to the target mean it is cT ¼ zðHÞ
ð3Þ
64
Rotello and Macmillan
Each of these indices depends on only one of the two response proportions: cL sets the lure mean to 0 and is positive whenever F < 0.5, whereas cT sets the target distribution mean to 0 and is negative whenever H > 0.5. For example, if d 0 ¼ 2 and the criterion is unbiased, then cL ¼ –z(0.16) ¼ 1 and cT ¼ –z(0.84) ¼ –1: The criterion is one unit above the lure mean and one unit below the target mean. Notice that c ¼ 12 ½cL þ cT ; in this example, c ¼ 12 ½1 þ ð1Þ ¼ 0, that is, the criterion is unbiased. The final measure of bias is the likelihood ratio , the ratio of the target and lure distribution ordinates at the criterion. For the equal‐variance model of Fig. 1, approaches 0 for very low strengths, equals 1 at the point where the distributions cross, and approaches 1 for very high levels of strength. For many purposes, it is more useful to examine ln(), which bears a simple relation to the criterion location: lnðbÞ ¼ cd 0
ð4Þ
Many researchers would prefer a measure of response bias that makes no reference to a model, and the yes rate YR ¼ 12 ðH þ F Þ, the average of the hit rate H and the false‐alarm rate F, appears at first glance to fill this bill. In fact, however, the yes rate is mathematically equivalent to (monotonic with) the criterion location in the rectangular model (Macmillan & Creelman, 1990) and commits the user to threshold assumptions. Similarly, the purportedly nonparametric bias measure B00 is known to be equivalent to likelihood ratio if the underlying distributions are logistic (Macmillan & Creelman, 1996). We believe that all statistics purporting to measure response bias are model bound.1 As shorthand for summarizing the amount of bias in a single condition, any of these measures is satisfactory. It is only when multiple conditions must be compared—that is, in all interesting applications—that complications arise.
1 We also believe this to be true of sensitivity measures, with one exception: the area under the full ROC is nonparametric, because it equals proportion correct by an unbiased observer in two‐ alternative forced‐choice (Green, 1964). An area‐based bias measure could hold promise for similar status and such an index, denoted K, has recently been proposed by Kornbrot (2006). Like Green’s sensitivity measure, K requires a complete ROC, as can be obtained with a ratings design (discussed later in this chapter). For the nonparametric claim to be plausible, the ROC must be based on a large number of points.
Response Bias in Recognition Memory
B.
65
COMPARING BIAS AMONG CONDITIONS WITH EQUAL SENSITIVITY
A common question in memory research is whether an experimental variable aVects sensitivity, bias, or both; to answer this question clearly requires an experiment with multiple conditions. If sensitivity is known to be constant, all the SDT‐based measures described above are equivalent for the model of Fig. 1A. To determine whether sensitivity really is constant requires a model with a sensitivity parameter; for example, the model in Fig. 1A requires that d 0 be constant. If sensitivity is equal in two conditions, then values of a response‐bias statistic can be directly compared. All the SDT statistics described so far are monotonically related in this situation and any of them can be used. Even within the constraints of detection theory, however, the choice of a sensitivity statistic requires care. In particular, the equal‐variance model of Fig. 1A can often be rejected in favor of an unequal‐variance model like that shown in Fig. 1B (Glanzer, Kim, Hilford, & Adams, 1999; RatcliV, Sheu, & Gronlund, 1992), and d 0 does not measure sensitivity in that model. To see why, consider the formula for d 0 : d 0 ¼ zðHÞ zðF Þ
ð5Þ
If z(F) ¼ 0, then d 0 depends only on z(H) and, therefore, only on the standard deviation of the target distribution; if z(H) ¼ 0, then d 0 depends only on the characteristics of the lure distribution. When the variances of the lures and targets are equal, all is well. If the variances are unequal, however, the computed values of d 0 when z(H) ¼ 0 and when z(F) ¼ 0 diVer; d 0 depends on bias in that case. The sensitivity measure should depend on both standard deviations, and an excellent option is to use the root mean square standard deviation. Fixing the standard deviation of the lures at 1 (without any loss of generality) and that of the targets at s, we obtain a new measure of sensitivity da ¼
2 1 þ s2
1=2 ½zðHÞ szðF Þ
ð6Þ
when s ¼ 1, the variances are equal and Eq. (6) reduces to Eq. (5). Taking account of unequal variance can dramatically change the interpretation of data. For example, consider recent work on the revelation eVect, the tendency for subjects to say ‘‘old’’ more often in a recognition test after the test item (whether target or lure) is revealed in fragmentary or distorted form immediately prior to a recognition judgment. Hicks and Marsh (1998) summarized the literature using d 0 and concluded that the revelation decreased memory sensitivity. However, Verde and Rotello (2003) collected ROC data, which can be used to estimate the ratio of target and lure standard deviations,
66
Rotello and Macmillan
and showed that da was equal across all conditions; the eVect of revelation was entirely one of response bias. Using similar methods, Dougal and Rotello (in press) showed that emotion‐arousing words aVected response bias rather than memory accuracy. We discuss these results in more detail later. C.
COMPARING BIAS AMONG CONDITIONS WITH UNEQUAL SENSITIVITY
If ‘‘sensitivity’’ has been defined and two conditions are found to diVer in it, then whether response bias has changed necessarily depends on the bias parameter chosen. Figure 2 provides a possible representation for an A
c c L−c T B
Same cL Same b
Same cT Same c
Fig. 2. Equivalence of various possible criterion locations as memory strength for targets is increased. (A) The solid line shows the criterion location; dashed lines indicate the mean of the lure strengths (distance cL from criterion), the mean of the target strengths (distance –cT from criterion), and the strength at which the false‐alarm and miss rates are equal (distance c from criterion). (B) The target distribution has a higher mean. Four possible criterion locations are noted, each of which keeps the criterion constant under one possible definition of its location (, c, cT, or cL).
Response Bias in Recognition Memory
67
experiment in which the same set of lures is compared with targets of two diVerent strengths. The criterion location for the weaker targets is shown, together with the predicted locations for the stronger targets if c, cL, cT, or , remains the same. The example illustrates an asymmetry between the assessment of sensitivity and bias. There is no ambiguity in asserting that condition 1 produces a higher sensitivity than condition 2, but ‘‘higher criterion’’ is ambiguous unless the origin against which it is calculated is specified. Likelihood ratio measures of bias, such as , are ratio scaled and thus unambiguous. Of the three strength‐based criterion measures, c, cL, and cT, which is to be preferred? An alleged advantage of c—that it depends on both H and F—is only compelling if subjects are able to accurately estimate their response rates for each stimulus class. Whether this is possible turns out to be a substantive question.
III.
Explaining Response Bias
A definitive choice among the various measures could justify a claim that bias has or has not been changed by an experimental manipulation. Our satisfaction with such a conclusion is likely to be short‐lived, though, because it immediately raises the questions of experimental strategy (what principles guided the subject’s decisions?) and of invariance (what decision statistic is the subject attempting to keep fixed?). A.
SEEKING INVARIANTS WITHIN THE DATA
As an example of an invariant, consider the well‐known mirror eVect. Recognition memory is better (has a higher d0 ) for low‐frequency (LF) words than for those of high frequency (HF) (analogous results have been obtained for other stimulus characteristics), and this improved sensitivity arises from the combination of a higher hit rate and a lower false‐alarm rate for the HF words. Has ‘‘response bias’’ changed? Figure 3 shows a (somewhat idealized) representation of a typical outcome in which it does not: The criterion falls at the midpoint of the HF lure and target distributions (panel A) and also at the midpoint of the LF distributions (panel B). However, this interpretation is inconsistent with the common assumption that a single strength dimension supports all decisions and that studying an item produces, on average, the same strength increment for all items. One solution to this paradox is to assume that subjects first determine whether the test item is of LF or HF, then evaluate its strength compared only to items of similar frequency. In this example, what is invariant is the criterion location as measured by c or .
68
Rotello and Macmillan
A HF lures
HF targets
B LF targets LF lures
Fig. 3. The mirror eVect. High‐frequency (HF) words (panel A) yield a lower hit rate and a higher false‐alarm rate than low‐frequency (LF) words (panel B). A constant criterion, measured by c or , is shown.
B.
DIVINING SUBJECT STRATEGY
Can the criterion location selected by a subject be predicted? Treisman and Williams (1984) suggested three general principles. The reference principle determines an initial criterion based on several aspects of the task, such as the expected rates of targets and lures on the test, their strengths, and task demands. The other two principles work to optimize the location of the criterion on a trial‐to‐trial basis and thus create criterion variability. In the world (but not usually the lab), target events tend to extend over a period of time, so an ‘‘old’’ judgment on trial n should tend to make a target more likely to appear on trial n þ 1; therefore, the tracking process shifts the criterion slightly (and temporarily) to create sequential response dependencies. To counteract the tracking process, a stabilization process evaluates the median strength of a set of recently tested items and shifts the criterion toward that value. Recognition
Response Bias in Recognition Memory
69
memory judgments are not usually evaluated in terms of sequences of responses, so the potential roles of the tracking and stabilization processes are unknown. We shall see later, however, that allowing criterion location to vary can be important theoretically. Treisman and Williams’ (1984) reference principle is supported by an extensive series of experiments and analyses by Maddox and Bohil (2004), Maddox (2002), and Bohil and Maddox (2001, 2003). Their perceptual classification experiments simultaneously manipulate payoVs, base rates, and discriminability, and their primary conclusion is captured by the acronym COBRA—COmpetition Between Reward and Accuracy. When there is a conflict between payoVs (reward) and accuracy (proportion correct), subjects appear to compromise. The experiments that support these conclusions diVer in many ways from recognition memory studies. The stimuli are perceptual, for example, rectangular bars of various heights. Two artificial, overlapping normal distributions A and B are created with diVerent mean heights. Subjects are aware of the base rates of A and B, and are given explicit payoVs for correct answers. The most important limitation in extending Maddox and Bohil’s conclusions to memory designs may be the absence of feedback in the vast majority of memory experiments. When feedback is provided, subjects can easily estimate their hit and false‐alarm rates and adjust their yes rate (which, in signal‐ detection models, determines their criterion) so as to balance the proportions of these outcomes and therefore minimize errors. Without feedback, they must rely on subjective memorability to meet their goals. One goal is presumably to follow instructions, but instructions in recognition memory studies are typically vague, merely urging accuracy (and speed without sacrificing accuracy). How might a subject set the criterion so as to follow such instructions? First, a likely tactic is probability matching. Few experiments oVer payoVs (but see Van Zandt, 2000, Experiment 1), so there is no ‘‘competition’’ that includes rewards. Instead, subjects may be guided by what Maddox and Bohil (2004) call COBRM—COmpetition Between Reward and probability Matching. In their perceptual experiments, COBRM describes behavior in early blocks and a shift to COBRA occurs with learning, but without rewards this transition may not occur. To probability match is to set the rate of responding ‘‘A’’ equal to the base rate of stimuli from distribution A. In most recognition experiments the base rates are equal, so by this principle subjects should try to use the two responses about equally often. When base rates have been manipulated (or subjects have been misinformed about them), the response rates change in the same direction (Rotello, Macmillan, Hicks, & Hautus, 2006; Strack & Forster, 1995; Van Zandt, 2000, Experiment 2).
70
Rotello and Macmillan
An important class of experiments that we discuss below involves a change in either the target or lure distribution. The COBRA and COBRM rules imply that when discriminability changes, the criterion will remain on the same side of the neutral point but move proportional to the change in sensitivity. If the diVerence between conditions is not announced to the subjects, information suYcient to infer it is available if feedback is provided. In most memory experiments, there is no such information and the subjects must rely on what can be observed, namely the strengths of the test items. Hirshman (1995) proposed that criterion setting was largely determined by the item strengths observed during study. His list‐strength experiments, like most recognition memory designs, did not vary the parameters considered by Maddox and Bohil. Hirshman’s conclusion that subjects base their criterion setting on the range of strengths that they can recall immediately prior to the recognition test, rather than estimating the mean target–lure diVerence or attempting to probability match, apply specifically to the list‐strength paradigm. In addition, Hirshman counterbalanced order of test condition and did not report any block‐to‐block changes; thus his data and analyses do not speak to changes that occur between conditions. We incorporate the range of strengths as a possibly useful statistic, but do not rule out the other decision models considered by Hirshman. This summary leads us to some heuristic expectations of criterion setting when feedback is not provided. Subjects attempt to probability match, setting the ‘‘old’’ rate equal either to the base rates contained in the instructions or, by default, to 0.5. If the base rate is in fact 0.5, this strategy also maximizes accuracy by setting c ¼ 0. By this principle, a shift to a new test should change the criterion, but we speculate that this tendency is tempered by two other tactics. First, subjects are motivated to keep the false‐alarm rate from increasing. This intuition is consistent with COBRA if the psychological ‘‘reward’’ of a false‐alarm is imagined to be strongly negative. Second, subjects display inertia, changing the criterion only if necessary to control the false‐alarm rate. These ideas are informal and we oVer them only as a framework within which to understand experimental results. To illustrate their application, however, consider the eVect of shifting between test lists of diVerent discriminability. In one case, the task is made harder by using weaker targets (Fig. 4A). If the criterion (cL) is held fixed, the yes rate will decrease. To keep the yes rate constant, however, would require an increase in the false‐ alarm rate, so we expect no change in criterion. In the other case, the task is made easier by using stronger targets (Fig. 4B). Now a more conservative criterion would lower the yes rate, move toward probability matching, and lower the false‐alarm rate, all desirable outcomes. Working against such a
71
Response Bias in Recognition Memory
A
B
Stronger targets Weaker targets
Fixed cL Constant yes rate
Fixed cL Constant yes rate
Fig. 4. Distributions of target and lure memory strengths. (A) Subjects encounter weaker targets at some point in the test list. Maintaining a constant yes rate would require using a more liberal response criterion that would increase the false‐alarm rate. (B) Subjects encounter stronger targets at some point in the test list. Maintaining a constant yes rate would require setting a more conservative criterion that would lower both the hit and false‐alarm rate. Here, though, accuracy improves even if the subject does not change the criterion.
change, however, is the inertial principle, and no strong prediction can be made. We now consider the factors influencing criterion setting, and distinguish several kinds of results. Criterion location has been shown to depend on group membership, on the nature of study or test lists, and on the nature of items within test lists. In the first two cases, we refer to criterion diVerences, reserving the phrase criterion shifts for the within‐list case. Most of our heuristic principles apply equally to all these cases, but the inertial principle has the most force in mixed test lists, where adjusting the yes rate or false‐ alarm rate would have to be accomplished on a trial‐by‐trial basis. As a working hypothesis, we predict that such changes will not be made unless the classes of items being tested require very diVerent processing. Even this vague expectation serves to tie together a large class of results.
72
Rotello and Macmillan
IV.
Between‐Group Criterion DiVerences
It is common to ask is whether diVerent types of subjects have better or poorer memory sensitivity; an equally reasonable question is whether such groups employ diVerent decision criteria when responding to the same stimuli. The answer to the latter is clearly aYrmative. For example, older adults have been shown to set more liberal response criteria than younger adults (Benjamin, 2001, Experiment 1) and amnesics set more liberal criteria than control subjects (Dorfman, Kihlstrom, Cork, & Misiaszek, 1995). Subjects tend to set more conservative criteria for easier tasks (Hirshman, 1995), which could easily explain these between‐group diVerences: Amnesics and older adults have poorer memory, which would make the task more diYcult for them and a liberal criterion more appropriate because it increases their hit rate. For other groups, the memory sensitivity eVects are less obvious but criterion diVerences exist nonetheless: Patients with panic disorder use a more liberal criterion than controls (Windmann & Kru¨ger, 1998), as do dementia patients (Woodard, Axelrod, Mordecai, & Shannon, 2004). Women who claim to have recovered previously forgotten memories of sexual abuse have a more liberal response criterion in the Deese–Roediger–McDermott (DRM) paradigm task than controls, whereas women who have always remembered such abuse have a more conservative criterion (Clancy, McNally, Schacter, & Pitman, 2000). Similarly, individuals who either reported memories of having been abducted by aliens or who thought they had been abducted despite the absence of any explicit memories of the experience had more liberal response criteria than controls (Clancy, McNally, Schacter, Lenzenweger, & Pitman, 2002).
V.
Between‐List Criterion DiVerences
Considering the more typical subject population of healthy college students, it is easy to find examples of criterion shifts that occur between diVerent test lists. For example, subjects are sensitive to changes in task demands in roughly the manner expected from Maddox and Bohil’s modeling. They set a more liberal criterion when the ratio of studied items to lures on the test is greater, whether they are (Van Zandt, 2000) or are not informed of the ratio changes (Heit, BrockdorV, & Lamberts, 2003). The consequence of this liberal shift is a greater yes rate. Subjects also establish a more liberal criterion when they are misled that the proportion of studied items on the test is increased (Rotello et al., 2006; Strack & Forster, 1995). Few memory experiments use payoV schemes, but Van Zandt (2000, Experiment 2) showed that subjects are responsive to manipulations of the reward structure and set
Response Bias in Recognition Memory
73
their criteria accordingly. Subjects also set their decision criteria in response to experimental instructions: A number of studies have demonstrated that subjects are more willing to say ‘‘old’’ if they are instructed to ‘‘guess old’’ when uncertain, for example (Postma, 1999; Rotello et al., 2006). Criterion location is also aVected by between‐list manipulations of study strength or task diYculty. Benjamin (2005) asked some subjects to make recognition judgments on studied items and lures from the same semantic classes as the targets, whereas other subjects identified studied items amid lures from diVerent (unstudied) semantic classes. Because the constant element across conditions was the studied items, the criterion location was measured by the hit rate (cT). Subjects who were given the easier task set a more conservative criterion, producing a lower hit rate, than subjects who took the harder test. Using the same materials in a within‐subject design, Benjamin and Bawa (2004) showed that subjects shift their criterion between‐ lists when their first test list is easy and the second is hard, but not if the first test is hard and the second is easy. In a list strength design in which subjects are tested after studying words that were all weak (studied once each or at a fast presentation rate), all strong (studied repeatedly or at a slow presentation rate), or a mixture of strong and weak items, subjects consistently set cL more conservatively on tests of strong rather than weak items (Balakrishnan & RatcliV, 1996; Hirshman, 1995; Stretch & Wixted, 1998a). This observation is consistent with the subject’s presumed goal of maximizing accuracy by balancing false‐alarms to lures against missed targets. In addition, the false‐alarm rate is greater for the weak items in the pure (weak‐only) list than for weak items in the mixed lists, and is lower for the strong items on a pure‐strong list than for strong items on a mixed list (Hirshman, 1995). All of these results point to criterion settings that are responsive to the average strength of test lists, being more conservatively placed on easier tests (yielding a lower false‐alarm rate) and more liberally placed on harder tests (yielding a higher false‐alarm rate). Hirshman (1995, Experiment 4) also showed that the study conditions are suYcient to allow a criterion to be set: Subjects studied either a mixed list or a pure weak list, and were tested on the weak items only. Despite being tested on identical items, subjects who had studied the mixed list set a higher criterion (cL) that resulted in fewer false alarms. Hirshman’s explanation was that the subjects estimated the range of the underlying memory strengths of studied items by covertly recalling a few items prior to the recognition test. The presence of stronger items on the study list would lead the subject to expect a greater range of strengths at test (i.e., a higher target distribution), and thus a higher criterion would be needed to maximize accuracy. In contrast, Verde and Rotello (2007) demonstrated that, when the study conditions are held constant across conditions, the criterion is set by the strength
74
Rotello and Macmillan
of the first few items on the test list: Being tested initially on strong items resulted in a higher criterion (lower false‐alarm rate) than being tested initially on weak items. Their result suggests that the subjects’ predictions about the strength of the test items, which help to locate the criterion, can be revised over the very short term.
VI. A.
Within‐Test Criterion Shifts
EVIDENCE AGAINST SHIFTS: STRENGTH MANIPULATIONS
The results we have described so far are, in some ways, fairly obvious. After all, a cognitive system that could not set the decision criterion in response to encoding or testing conditions would be quite ineYcient. A more interesting question is how adaptive that response criterion is to condition changes over the shorter term: Can we adjust our criterion on the fly, to respond optimally to diVerent kinds of test items presented in a single list? Several striking failures to observe strength‐based criterion shifts have been reported. Stretch and Wixted (1998b) repeated one set of study words five times each; these were colored red. A second set of study words, randomly mixed with the red words, were presented once each in blue. At test, accuracy could be maximized by setting a higher criterion for the stronger items than for the weaker items. Across five experiments that made the strength manipulation increasingly obvious to the subjects, no shifts in cL were observed: The false‐alarm rate for lures presented in the color associated with the stronger items (red) equaled that for lures tested in the color (blue) associated with the weaker study items. Similarly, Benjamin (2001) gave subjects DRM lists to study; some of those lists were repeated three times. The false‐alarm rate to the noncritical lures did not diVer with list strength, consistent with a constant decision criterion being used throughout the test list. Verde and Rotello (2004b) used an associative recognition task in which some pairs were strengthened via repetition; both strong and weak pairs were tested in a random order, along with rearranged pairs (lures made from studied words). The false‐alarm rate to weak and strong pairs was the same, revealing no evidence of a criterion shift in response to stimulus strength. In the most dramatic demonstration of this basic result, Morrell, Gaitan, and Wixted (2002) made the strong and weak classes obvious: Strong and weak items diVered in both content and format (names of professions vs pictures of birds). Still, subjects showed no evidence that they shifted their criterion to be more conservative for the stronger class of stimuli, thus missing an opportunity to improve their accuracy: The false‐alarm rate to lures from the weaker class was the same as to lures from the stronger class.
Response Bias in Recognition Memory
75
One explanation for this reluctance to shift decision criterion on a trial‐ by‐trial basis is, as suggested earlier, inertial: Cognitive eVort is required to constantly adjust criteria (Morrell et al., 2002; Stretch & Wixted, 1998b). An alternative view is that changing the criterion takes some amount of time, so that subjects’ criterion shifts continually lag behind the actual changes in task diYculty. Support for the lag hypothesis comes from Brown and Steyvers (2005), who asked subjects to perform a speeded lexical decision task in which the nonwords were either relatively easy or relatively diYcult to distinguish from actual words. These conditions are analogous to the strong and weak conditions, respectively, studied by Wixted and colleagues (without the memory component, of course), except that the easy and hard conditions appeared on the test list in alternating blocks (that were not identified for the subjects). Modeling work revealed that subjects did shift their criteria in response to the changing test conditions, but that the criterion shift lagged behind the task shift by an average of 14 test trials. A similar study using a recognition task was recently reported by Verde and Rotello (2007). They manipulated the strength of diVerent classes of study items, like Stretch and Wixted (1998b), but controlled the test order so that all the strong items were presented before the weak items (Experiments 1–3 and 5), or all the weak items were tested before the strong ones (Experiment 4). The boundary between the easy and hard blocks was not distinguished for the subjects, and the blocks were fairly long (80 items each) so that the criterion would have suYcient time to catch up. The false‐alarm rates were equal in the strong and weak blocks in four of the five experiments. Thus, it does not appear that subjects attempted to shift their criterion cL in accordance with task diYculty: If they had, even at some lag after a shift between blocks, Verde and Rotello would have detected a change in the false‐ alarm rate. The only experiment in which a criterion shift was detected was one in which subjects were given feedback on the accuracy of the recognition decisions (Experiment 5), presumably because feedback gave the subjects important information about the trade‐oV between their false‐alarm and miss rates. In summary, in the absence of feedback to the subject, no evidence has been found for criterion shifts that occur in response to a selective strengthening of one class of studied items in memory relative to another. The false‐alarm rate to the weaker class of stimuli equals that to the stronger class, regardless of whether those classes are randomly intermixed (Morrell et al., 2002; Stretch & Wixted, 1998b) or blocked at test (Verde & Rotello, 2007). One might conclude that criterion shifts do not occur within lists. After all, subjects in these experiments would have performed better (had higher accuracy levels) if they had shifted their criterion, and Wixted and his colleagues took pains to insure that their subjects understood the memory
76
Rotello and Macmillan
improvements that come with greater strength. As we describe next, however, there is actually a great deal of evidence for criterion shifts that occur within a single test list. B.
EVIDENCE FOR SHIFTS: PROCESSING AND STIMULUS MANIPULATIONS
The observation that subjects do not shift their recognition criterion within list in response to changes in the strength of the test items has often been over‐interpreted to imply that subjects do not make such shifts in response to any experimental or stimulus factor. For example, Diana, Reder, Arndt, and Park (2006) asserted that ‘‘if participants do not shift their criterion within list even under these [Stretch & Wixted’s] very encouraging circumstances, it is unlikely that they will do so under the types of conditions used by [other researchers]’’ (p. 2). Contrary to this plausible inference, there is considerable evidence for within‐list criterion shifts. 1.
Two Processing EVects: Processing Time and Revelation
In the response‐signal paradigm, subjects are asked to make their recognition decisions within a narrow time window after a signal‐to‐respond; the timing of the signal is unpredictable, occurring at a variety of shorter (e.g., 100 and 300 ms) and longer (1000 and 2000 ms) lags after the onset of the test probe. Thus, there is a variable amount of time during which a stimulus may be processed before a decision is required. Virtually every response‐signal experiment has shown that the false‐alarm rate declines with processing time. Because information is continually extracted from the memory probes during processing, these false‐alarm rate changes may reflect changes in the form of the distribution with time rather than changes in the location of the decision criterion (Lamberts, BrockdorV, & Heit, 2003). However, a clever experiment by Heit et al. (2003) clearly showed that subjects do shift their criterion across signal lags. They used a response‐signal paradigm in which the ratio of lures to studied items either increased, decreased, or was constant with lag. Although memory sensitivity was equal across these three conditions, subjects’ response criterion (measured with c and compared across conditions for a particular response‐signal lag) shifted in correspondence with the proportion of studied items tested at that lag. As the ratio of new to old fell, the criterion became more liberal; as the ratio rose, so too did the criterion. Contrary to the claims of Wixted and colleagues (Morrell et al., 2002; Stretch & Wixted, 1998b), subjects in the Heit et al.’s experiment were anything but reluctant to shift their criterion; they did so on a trial‐by‐trial basis while under severe time pressure.
Response Bias in Recognition Memory
77
Another example of a processing manipulation that results in a criterion shift is the revelation eVect, which we discussed earlier. In revelation experiments, the trials on which revelation does and does not occur are randomly intermixed, suggesting that subjects might be changing their response criterion on a trial‐by‐trial basis. Indeed, Niewiadomski and Hockley (2001) and Verde and Rotello (2003, 2004a) demonstrated that the revelation eVect is often due to a liberal shift in the response criterion that occurs within list. 2.
Two Stimulus EVects: Emotional Valence and Subjective Memorability
Intuitively, emotional events in our lives are remembered better than neutral, nonemotional events. This intuition is an assumption in the emotion literature, and the data often support that view (Phelps, LaBar, & Spencer, 1997). Emotionally valenced stimuli, particularly negative stimuli, are often recalled more easily than neutral stimuli, but intrusion data are typically not reported. To evaluate memory sensitivity eVects, we must understand the role of response bias. A possible interpretation of those data is simply that subjects are more willing to report negative events, both true and false, and data from recognition tasks support that view. In recognition, memory sensitivity is sometimes enhanced for negatively valenced stimuli, but that eVect is not always observed (Maratos, Allen, & Rugg, 2000; Windmann & Kru¨ger, 1998; Windmann & Kutas, 2001). In contrast, all recognition studies show clear eVects of emotion on response bias: Subjects have a more liberal response criterion for negatively charged stimuli than for neutral or positively valenced items (Dougal & Rotello, in press; Windmann & Kru¨ger, 1998; Windmann & Kutas, 2001; Windmann, Urbach, & Kutas, 2002). These experiments presented the diVerent classes of stimuli (emotional or not) in a random order within a single test, again demonstrating that subjects can and do change their response criterion on a trial‐by‐trial basis. Wixted (1992) asked subjects to study HF, LF, and rare words (those that occurred less than once in 7 million words). Compared with the HF words, the rare words elicited both a higher hit rate and a higher false‐alarm rate; these data are consistent with a liberal criterion shift that is made on a trial‐ by‐trial basis. Such a criterion shift is an inappropriate response to those rare words (it increases the false‐alarm rate), but it might occur if subjects thought that rare words would be easier to remember than more common words. [Stretch and Wixted (1998b) also make this point.] Relatedly, subjects set a more liberal criterion for saying ‘‘old’’ to common rather than bizarre test pictures (Worthen & Eller, 2002), perhaps reasoning that the bizarre stimuli would be easier to recognize and thus justifying a stricter criterion for those items. A more extreme example of the role of subjective memorability comes
78
Rotello and Macmillan
from research on stimuli that subjects are ‘‘sure’’ they would remember had they been studied: Those items rarely elicit false alarms, as if the criterion were shifted very high (Brown, Lewis, & Monk, 1977; Strack & Bless, 1994; but see Rotello, 1999).
VII.
An Interim Summary
General evolutionary principles can be invoked to account for some of the eVects summarized so far. For some limited types of stimuli, automatic criterion shifts might be expected, even on a trial‐by‐trial basis. For example, a low criterion that allows false alarms to negatively valenced or fear‐ producing stimuli, but few misses, may have survival benefits (LeDoux, 1986). In contrast, there is little survival advantage to either missing or false‐alarming to positive or neutral stimuli. This line of theorizing is limited: It stretches credulity to invoke an adaptive rationale for criterion shifts to revelation‐style tasks, or to changes in the nature of the lures. Instead, we have found support for some heuristic principles, proposed earlier, for establishing and adjusting response bias. (1) Subjects place their criteria so as to match their yes rate to the known or implied base rate of targets at test. (2) Subjects take into account the range of strengths they observe, making more ‘‘old’’ responses when the average familiarity of probes is high than when it is low. Together, these heuristics tend to maximize accuracy by balancing false‐alarms and misses. (3) When the experimental situation changes, criteria may shift in accordance with these same principles, provided that subjects are aware that a change has occurred and are suYciently motivated to overcome the inertia principle. Awareness of changed conditions could come about through feedback from the experimenter, or when there is an obvious change in the nature of the task: Lures that come from a diVerent semantic class (Benjamin & Bawa, 2004), altered properties of the test stimuli (such as emotionality), or new aspects of the processing task (revelation tasks, processing time). Although bias shifts have been observed in all these situations, they seem to be tempered by resistance to increasing the false‐alarm rate and by a more general inertia. Another factor that seems like it should matter is memory strength. Our survey reveals, however, that although subjective memorability influences criterion, real memorability (item strength) does not (Verde & Rotello, 2007). Perhaps the strength eVects in the literature are too weak to be noticed, but d 0 diVerences alone are not good predictors of within‐list criterion changes: Singer and Wixted (2006) pointed out that they and others (Morrell et al., 2002) used large diVerences in memory strength yet observed no shift in criteria.
79
Response Bias in Recognition Memory
VIII.
Distribution Shifts Masquerading as Criterion Shifts
To this point, we have interpreted positively correlated changes in hit and false‐alarm rates as criterion shifts, assuming that the distributions remain fixed (Fig. 5A). The reverse interpretation, however, is often equally possible: The strength of one or both of the underlying memory distributions may shift while the criterion remains fixed (Fig. 5B). If only the lure distribution shifts, then the false‐alarm rate will change in the absence of a hit rate change, and vice versa if only the target distribution shifts. Although these alternatives are substantively very diVerent, it is not possible to distinguish between them in the absence of a set of test items that is truly constant across conditions (e.g., a set of completely unrelated lures). Instead, arguments for one or the other interpretation are typically based on parsimony. We have already described one empirical phenomenon that has been interpreted either as a criterion shift or as a distribution change: the mirror eVect. Next, we describe two eVects that have been interpreted as response‐bias diVerences but could also be taken as evidence for distribution shifts. A
B
Distribution shift Criterion shift
Fig. 5. The equivalence of criterion‐shift and distribution‐shift interpretations of data. (A) Two conditions diVer only in criterion location. (B) Both target and lure distributions shift upward by the same amount; the criterion does not move. Performance in panels (A) and (B) is identical.
80
A.
Rotello and Macmillan
STUDY‐TEST DELAY
Study‐test delays have been found to influence the false‐alarm rate, even when the delays are manipulated within a single test list, and these changes in the false‐alarm rate have been interpreted as criterion shifts (changes in cL). For example, Singer, Gagnon, and Richards (2002) asked their subjects to read two diVerent sets of stories, separated by a delay. After reading the second set of stories, a recognition test was given on brief sentences from each story, paraphrases of studied sentences, and new but related sentences. The results were clear: Subjects had a higher false‐alarm rate when responding to sentences and lures from the stories studied first (i.e., those tested after a delay). Singer and Wixted (2006) adapted this methodology to a more standard memory paradigm in which two lists of words were studied, again separated by a delay. The words on each list were exemplars of particular semantic categories (e.g., birds, diseases). Items from both study lists were mixed up in a random order on the same test list, along with new exemplars from studied categories. At shorter inter‐study delays (40 min), no evidence for diVerences in the false‐alarm rates was found. With a two‐hour interstudy delay, however, the false‐alarm rate to the items from the delayed‐test condition was greater than in the immediate‐test condition. In both papers, Singer and his colleagues interpreted the false‐alarm rate diVerence as a within‐list criterion shift, but it is equally plausible that the familiarity of the targets and their corresponding lures declined with delay. Because the lures were directly tied to particular study stories or, by virtue of their semantic category, to the study list, the familiarity of the targets and lures would be expected to decline together. B.
CONTEXT EFFECTS
A similar explanation can account for changes in the ‘‘old’’ response rate in experiments that manipulate the match between study and test contexts. A number of recognition experiments have posed the question of whether testing items in the same context (background color, font, screen location, voice of speaker, etc.) as that in which they were studied increases memory sensitivity. Although some studies have found sensitivity benefits, many have found what appear to be bias eVects instead: Subjects are more willing to say ‘‘old’’ to studied items in a studied context and to lures tested in a studied context (Dougal & Rotello, 1999; Feenan & Snodgrass, 1990; Goh, 2005; Murnane & Phelps, 1994, 1995; Murnane, Phelps, & Malmberg, 1999). In these experiments, the same‐context probes are mixed randomly with the diVerent‐context probes at test. Thus, one might interpret the greater ‘‘old’’ rate to same‐context probes as a trial‐by‐trial shift in the response criterion. However, the presence of matching context features increases the familiarity
Response Bias in Recognition Memory
81
of both targets and lures; in the absence of criterion shifts, this familiarity enhancement increases both hits and false alarms. Singer and Wixted (2006) also interpreted their study‐test delay results as a kind of context eVect, noting greater encoding‐test match in the immediate testing than the delay condition. C.
IDENTIFYING THE CHANGE
As we suggested earlier, a relatively simple method for disentangling distribution shifts from criterion shifts is to create an ‘‘anchor’’ distribution that is common across all conditions. Criterion shifts may then be evaluated against this ‘‘fixed’’ distribution; this is the method adopted by Benjamin and Bawa (2004) and Verde and Rotello (2007), for example. Although this method appears straightforward, it is not always possible. Consider the eVects of study‐test delay: It is not possible to have a common target distribution because, by definition, those targets were studied at diVerent points in time. It is also not possible to have a common lure distribution, unless there is a single (unrelated) set of lures. Of course, with only one set of lures and a single test list, only a single false‐alarm rate is observable. An alternative solution, based on event‐related potential (ERP) signals of criterion shifts, is more speculative. The prefrontal cortex has been found to play an important role in decision making (Bechara, Damasio, & Damasio, 2000; Bechara, Damasio, Damasio, & Lee, 1999). Windmann and Kutas (2001) found that the ERP signal for hits and false alarms diVered according to the emotional valence of the memory probe. The eVects were seen very early in processing (300 ms) and over prefrontal sites. They argued that the frontal cortex may be responsible for relaxing the response criterion for negative stimuli. Windmann et al. (2002) also asked subjects to recognize words of diVerent emotional valences while ERP signals were recorded; they then divided subjects into liberal and conservative groups based on their ‘‘old’’ response rates. The ERP signals supported Windmann and Kutas’s assertion that the prefrontal cortex plays a role in setting the decision criterion, and that it does so early (300–500 ms poststimulus) for these emotional stimuli: The largest diVerences in the ERP signals of conservative and liberal subjects were observed there. If these results generalize, then it might be possible to distinguish criterion shifts from distributional shifts based on the early frontal ERP signals.
IX.
Designs with Multiple Responses
It is a truism of signal‐detection theory that binary responses provide less information to the experimenter than the subject has available. Most of the experiments described so far require a simple choice between ‘‘old’’ and
82
Rotello and Macmillan
‘‘new,’’ and the corresponding models contain a single criterion. In this section and the next, we expand our discussion to multiresponse paradigms.
A.
THE REMEMBER–KNOW PARADIGM
Our modest first stop on this journey is the three‐response remember–know design (Tulving, 1985), in which ‘‘old’’ responses are divided by the subject into those that are explicitly ‘‘remembered’’ and those that are merely familiar, or ‘‘known.’’ As with any other design, evaluation of response bias in the remember–know paradigm requires a model. We employ the one‐dimensional SDT model sketched in Fig. 6 (Donaldson, 1996; Wixted & Stretch, 2004). Targets and lures generate distributions in SDT fashion; a high criterion separates remember from know responses and a lower criterion separates know from new responses. In this model, then, remembers are viewed simply as high‐confidence old judgments. The first thing to notice about this representation is its claim that subjects can adopt two criteria—two levels of response bias—simultaneously. Indeed, part of the appeal of the remember–know paradigm is that, unlike other putative methods of separating explicit and implicit memory processes, it does not require multiple testing conditions. The ability to make comparisons between two simultaneously maintained biases avoids the possible confounds that arise when measuring criterion shifts across trials or conditions. Evidence that subjects can indeed hold multiple criteria in place has been provided by rating experiments, which we discuss in the next section.
“Remember” “New” “Know”
k
r
Fig. 6. The equal‐variance one‐dimensional model of remember–know judgments. Targets and lures diVer in average strength. To decide whether a memory probe is remembered, subjects use a high criterion (r); the lower criterion (k) separates known items from those judged to be new.
Response Bias in Recognition Memory
83
The one‐dimensional model allows us to ask some interesting questions about remembering and knowing. How do the criteria shift in response to changes in the task conditions? If subjects are asked to shift one of the criteria—say the old–new bound—how do the other criteria react, if at all? Subjects do shift their old–new bound under biasing instructions, and this change is accompanied by a shift in the remember–know bound (Rotello et al., 2006). Similarly, subjects set a more conservative criterion for saying they ‘‘remember’’ a particular stimulus when a narrow, restrictive definition of remembering is provided. This shift in the remember–know bound produces a corresponding numerical (but statistically insignificant) shift in the old–new criterion (Rotello, Macmillan, Reeder, & Wong, 2005). The most popular use of the remember–know design is to uncover interactions involving the remember and know response rates (Gardiner & Richardson‐Klavehn, 2000). For example, Gregg and Gardiner (1994) found that auditory and visual presentations led to about the same remember rate (0.10 for auditory, 0.11 for visual) but very diVerent know rates (0.27, 0.52). Their two‐process conclusion was that presentation modality aVected familiarity but not recollection. No mention is made of response bias in this interpretation, and it was implicitly understood that these are sensitivity eVects. However, Dunn (2004) applied the one‐dimensional, equal‐variance model and found that (1) overall sensitivity was somewhat greater for visual presentations, and (2) the remember criterion measured as cT remained fixed (1.26 vs 1.22), yielding a constant remember rate; but the old–new criterion decreased (0.33 vs 0.33), producing the increase in knows. Figure 7A illustrates Dunn’s modeling, and supports the conclusion that Gregg and Gardiner’s results are primarily due to response bias. In a second data set examined by Dunn (2004), Gardiner and Java (1990) found that words were remembered less often than nonwords (0.28 vs 0.19) but known more often (0.16 vs 0.30); they concluded that words tended to be recognized more on the basis of recollection, nonwords more on the basis of familiarity. The one‐dimensional model (Fig. 7B) leads to a very diVerent conclusion: Both sensitivity (0.95 vs 1.01) and the old–new criterion (measured as cL ¼ 1.07 vs 1.04) were about the same for the two stimulus categories, but the remember criterion (cL ¼ 1.55 vs 1.89) was much lower for words. In this case, the eVect is almost entirely bias driven: Subjects are less willing to use the ‘‘remember’’ response for strange nonwords than for words they have encountered outside the laboratory. Dunn’s analyses bear only weakly on whether the one‐dimensional model is correct (although the fits of the model are excellent), but make clear the importance of using models to interpret remember–know data. In fact, other models that have been applied to these data—including a model that attempts to capture the conventional, direct interpretation
84
Rotello and Macmillan
A
B
Auditory
−3
Words
−1
1 X
3
5
Visual
−3
−1
−3
−1
1 X
3
5
1 X
3
5
Nonwords
1 X
3
5
−3
−1
Fig. 7. One‐dimensional model‐based interpretations of remember–know dissociation experiments. The estimated locations of target distributions, old–new, and remember–know decision criteria are shown for two experiments. (A) Auditory versus visual stimulus presentation. (B) Word versus nonword stimuli. For both experiments, the model accounts for diVerences between conditions in terms of response bias.
(Murdock, 2006)—also reveal a mixture of sensitivity and bias eVects (Macmillan & Rotello, 2006). The existence of response bias in remember– know experiments can be neither demonstrated nor rejected without an explicit model. B.
CONFIDENCE RATINGS
Many recognition memory experiments use confidence ratings instead of, or as a supplement to, the old–new judgment. The response options may be described verbally (e.g., they may range from ‘‘sure new’’ to ‘‘sure old’’) or be numerical values (often from 1 to 6). The SDT interpretation is that the subject establishes n – 1 criteria to divide the decision axis into n regions, assigning observations above the highest criterion to the response indicating the most confidence that the test item is old, observations between the top two criteria the next response, and so forth. If this interpretation is correct,
Response Bias in Recognition Memory
85
then ratings are an extremely eYcient way to observe multiple criterion settings. Rating data can be used to construct ROC curves, which provide several advantages in interpreting response bias. First, as noted earlier, the promise of SDT to separate sensitivity and bias cannot be fulfilled with binary data, because the relative variance of the target and lure distributions cannot be determined. The rating design allows this parameter to be estimated. Second, the form of the ROC limits the possible representations, so that Gaussian models (with which most empirical ROCs are consistent) can be distinguished from threshold and hybrid models (which are sometimes supported). Third, when bias varies across conditions in a rating experiment, sensitivity and bias can be distinguished in a model‐free manner. As an example, consider the data of Dougal and Rotello (in press), shown in Fig. 8. Notice that the curves for the negative, neutral, and positive stimuli all fall at the same height in the space, reflecting equal memory sensitivity. However, the operating points for the negative items each fall to the upper‐right of the others, reflecting a more liberal response bias (higher H and F). Throughout this chapter we have referred to criteria as points on the strength axis, but as noted earlier they may also be viewed as having diVerent values of likelihood. Likelihood ratio is the optimal decision variable (Green & Swets, 1966), but in applications where sensitivity is fixed it is monotonic with, and thus indistinguishable from, strength [see Eq. (3)]. Varying sensitivity provides a tool for deciding between these measures, but the simple old–new design is not rich enough for a convincing decision.
1.0 Negative
Hit rate
0.8 0.6 0.4
Neutral Positive
0.2 0.0 0.0
0.2 0.4 0.6 0.8 False alarm rate
1.0
Fig. 8. Receiver operating characteristic (ROC) data from Dougal & Rotello (in press, Experiment 1b). The filled circles denote the observed responses to neutral stimuli, the minus signs denote the responses to emotionally negative stimuli, and the plus signs indicate the responses to emotionally positive words. Reprinted with permission of the Psychonomic Society.
86
Rotello and Macmillan
Rating experiments can in principle resolve this ambiguity. In a list‐strength rating design, for example, three models have been proposed for how criteria shift between lists diVering in strength. According to the lockstep model (also called the ‘‘distance from criterion’’ view), the spacing between criteria is maintained: Shifting the location of one criterion shifts all of them by the same amount. The likelihood ratio rule assumes that the criteria are placed in such a way as to maintain their relative locations in likelihood terms. That is, if ‘‘sure old’’ was the label used when the relative heights of the distributions was at least 6:1, then that criterion location would remain at 6:1 under diVerent strength conditions. This model requires that the criteria spread out on the strength axis as discriminability decreases. Finally, the range model assumes that subjects try to place their criteria so as to span the same range of the distributions, yielding the same proportions of high‐confidence misses and false alarms as strength changes: As discrimination decreases, the criteria must be placed more closely together. Hirshman (1995) endorsed this range model, but he did so based only on binary yes–no response rates. Although these models make strikingly diVerent predictions about criterion placement in the face of changing memory sensitivity, the existing data do not clearly support one model over another. One type of analysis, using group‐level data, involves the calculation of the location of each criterion relative to the mean of the (constant) lure distribution. Stretch and Wixted (1998a) found that the criteria fanned out as sensitivity decreased, a result that is most consistent with the likelihood ratio model. Likelihood‐based decision axes have been assumed in modern models of memory (REM: ShiVrin & Steyvers, 1997; SLiM: McClelland & Chappell, 1998). In contrast, Balakrishnan and RatcliV (1996) adopted a more complicated analysis, using individual data. They compared the forms of the cumulative distribution functions (CDFs) estimated from rating data across conditions that diVered in study repetitions. The key idea underlying their analysis is that, according to a likelihood decision rule, a particular strength in memory is not equally likely to be a target in diVerent conditions. Thus, the probability of saying ‘‘old’’ will diVer across conditions, even if the memory probe has identical strength. Because of this property, the CDFs will not be parallel to one another as study repetitions are manipulated; instead, the CDFs will cross at the memory strength or confidence level at which the likelihoods equal 1 in each condition. However, in three experiments, including one recognition memory experiment, Balakrishnan and RatcliV observed CDFs that simply paralleled one another. Parallel CDFs are consistent with a lockstep model in which memory judgments are based simply on strength. The lockstep model is also most consistent with the constant false‐ alarm rates seen in within‐list manipulations of memory strength (Stretch & Wixted, 1998b).
Response Bias in Recognition Memory
87
Van Zandt (2000) challenged both conclusions, arguing that the decision axis (and therefore criterion location) was based on neither memory strength nor likelihood ratio. If either of those assumptions were correct, then changing subjects’ response bias through payoVs or the ratio of targets and lures on the test would not influence the form of the underlying memory distributions. Instead, the ratings decision criteria would simply shift their locations along the strength or likelihood axis. Confirming an earlier study by Schulman and Greenberg (1970), Van Zandt found that the slope of the zROC, which measures the ratio of the standard deviation of the lure distribution to that of the target distribution, was systematically influenced by the biasing conditions: Steeper slopes were observed when more targets were tested and when rewards were higher for confident ‘‘old’’ judgments. Hirshman and Hostetter (2000) reported a similar, but smaller scale, result: Providing subjects with response scales that included an equal or unequal number of ‘‘old’’ and ‘‘new’’ ratings categories influenced the slope of the zROC. Both of these results are problematic for the simplest models in which either strength or likelihood serves as the decision axis in recognition memory tasks and decision criteria are stable across trials. Treisman and Williams’ (1984) analysis, discussed earlier, provides an answer to this challenge. Recall that their presumed stabilization process shifts the old–new criterion toward the median strength of recently observed stimuli. In a rating design, the magnitude of the criterion shift is influenced by the desired location of each criterion. For example, for the most confident ‘‘old’’ response category, upward (more conservative) criterion shifts are larger in size than downward shifts. (In the absence of this assumption that criterion would eventually collapse onto lower confidence criteria because there are relatively few targets of extremely high strength.) Less extreme confidence criteria have less variability due to the stabilization process. Together, these assumptions explain the zROC slope eVects observed by Van Zandt (2000) and others, and they do so within the context of a strength‐based decision model. When more targets are tested, the ‘‘old’’ response criteria, and especially the most conservative criteria, are more likely to be shifted than the ‘‘new’’ response criteria. Thus, the conservative criteria have greater variability than the liberal criteria. Conversely, if more lures than targets are tested, the liberal criteria have greater variability. The net result is a systematic change in observed zROC slope, as shown in a series of simulations by Treisman and Faulkner (1984). This argument, if correct, rebuts Van Zandt’s plague‐on‐both‐your‐houses conclusion, but does not address the discrepancy between the Stretch and Wixted (1998a) and Balakrishnan and RatcliV (1996) findings. Recent work by Benjamin and Wee (submitted for publication) supports the general importance of criterion variability in recognition memory performance, but does not
88
Rotello and Macmillan
address the specific argument about whether some criteria have more variability than others.
X.
Conclusions and Recommendations
Response bias, according to our survey, is pervasive in recognition memory experiments. Investigators may take any of several attitudes toward bias, depending in part on whether they view it as an inconvenient truth or a cognitive skill of inherent interest. Among their aims are the measurement, control, optimization, and elimination of response bias. We have had the most to say about measurement; here we summarize our recommendations on this issue and add a few comments about the other, more ambitious, goals.
A.
CHOOSE THE RIGHT SENSITIVITY MEASURE
Even if two experimental conditions diVer only in response bias, some estimates of sensitivity (such as d0 ) are dependent on criterion if the underlying memory distributions are not of equal variance (see Fig. 1B). In this case, substantive conclusions about memory accuracy require a more appropriate measure such as da (or Az, the area under the ROC; see Verde, Macmillan, & Rotello, 2006). Failing to use a proper measure of sensitivity can lead researchers to attribute what are actually response‐bias eVects to sensitivity diVerences (Dougal & Rotello, in press; Verde & Rotello, 2003). Once sensitivity is correctly evaluated and found to be fixed across conditions, it matters little whether response bias is summarized with c, , cT, cL, or some other measure.
B.
CHOOSE THE RIGHT RESPONSE‐BIAS MEASURE
If two experimental conditions diVer in sensitivity, the choice of criterion measure does matter: Figure 2 demonstrates that conclusions about constancy of bias vary systematically with one’s choice of measure. The appropriate criterion location measure depends on the question being asked of the data: For example, if is it important to know whether the criterion is fixed relative to the lure distribution, then cL is the best choice. A decision between location and likelihood indexes depends on the specific model being employed.
Response Bias in Recognition Memory
C.
89
APPLY AN EXPLICIT MODEL TO THE DATA
Some model is always necessary. We have already noted, for example, that adopting an equal‐variance model when variances are unequal can lead to false conclusions. A more complex situation arises in the many memory domains in which dissociations between conditions are sought. Such dissociations may reflect distinct underlying processes, though they provide weak evidence for that conclusion (Dunn & Kirsner, 1988). But it is at least as likely that the dissociations reflect either sensitivity diVerences or criterion changes across conditions, or both, as Dunn (2004) demonstrated in the remember–know domain. In the absence of a model, an erroneous inference is probable.
D.
CONSIDER THE USE OF FEEDBACK AND/OR FORCED‐CHOICE DESIGN
A
Few recognition memory experiments provide feedback to subjects, whereas many perceptual experiments do include this feature. The diVerence may have developed because in many perception experiments the same stimuli are repeated across trials, and learning relevant features might be aided by feedback. In memory, contrariwise, every test item is (typically) diVerent. Nonetheless, feedback does allow the subject to estimate the hit and false‐ alarm rate, and that information can be used to adjust the criterion. In many cases, this adjustment can be expected to encourage neutral responding, making feedback a tool for controlling or optimizing bias rather than measuring it. In the two‐alternative forced‐choice (2AFC) design, two stimuli are presented on each test trial and the subject chooses the one that was studied. Forced‐choice response bias tends to be small; distinguishing sensitivity from response‐bias changes is easier when the latter are minimal. The relation between the forced‐choice and yes–no paradigms may not be the same in memory as in perception (Kroll, Yonelinas, Dobbins, & Frederick, 2002; Smith & Duncan, 2004), but this relation is of little consequence if the design is fixed throughout an experiment. Performance in 2AFC is better than in yes–no, so experimental parameters like length of the study and test lists must be altered to avoid ceiling eVects. To the extent that feedback and 2AFC eliminate bias, they of course eliminate the opportunity for studying it. Real‐life recognition memory is, arguably, more often yes–no than forced‐choice, and this ecological validity supports the continued use of yes–no even at the cost of increased complexity.
90
E.
Rotello and Macmillan
USE RATINGS AND PLOT ROCS
Our strongest recommendation—to ask for confidence ratings and use the resulting data to construct ROC curves—is an aid to implement the others. ROCs provide the safest way to estimate sensitivity without the risk of contamination by bias eVects, and they constrain the possible representations (e.g., by revealing the equality or inequality of variances). In many cases, ROCs allow nonparametric conclusions, as we saw in the case of Dougal and Rotello’s (in press) data for emotion‐laden words displayed in Fig. 8. For a given false‐alarm rate, memory sensitivity is a monotonic function of the hit rate, so that diVerences in memory strength are reflected by ROC heights. Alterations in response bias are indicated by points that fall along equal‐ sensitivity curves (that is, the ROCs): Points in the left region of a curve reflect more conservative biases and those to right more liberal biases. Comparing two conditions in ROC space, then, can indicate at a glance whether they diVer in underlying sensitivity (curves of diVerent heights), diVerences in bias (placement of the points along a common curve), or both. ACKNOWLEDGMENT The authors are supported by a grant from the National Institutes of Health (MH60274).
REFERENCES Balakrishnan, J. D., & RatcliV, R. (1996). Testing models of decision making using confidence ratings in classification. Journal of Experimental Psychology: Human Perception and Performance, 22, 615–633. Bechara, A., Damasio, H., & Damasio, A. R. (2000). Emotion, decision making, and the orbitofrontal cortex. Cerebral Cortex, 10, 295–307. Bechara, A., Damasio, H., Damasio, A. R., & Lee, G. P. (1999). DiVerent contributions of the human amygdala and ventromedial prefrontal cortex to decision‐making. Journal of Neuroscience, 19, 5473–5481. Benjamin, A. S. (2001). On the dual eVects of repetition on false recognition. Journal of Experimental Psychology: Learning, Memory, and Cognition, 27, 941–947. Benjamin, A. S. (2005). Recognition memory and introspective remember/know judgments: Evidence for the influence of distractor plausibility on ‘remembering’ and a caution about purportedly nonparametric measures. Memory & Cognition, 33, 261–269. Benjamin, A. S., & Bawa, S. (2004). Distractor plausibility and criterion placement in recognition. Journal of Memory and Language, 51, 159–172. Benjamin, A. S., & Wee, S. (submitted for publication). Signal detection with criterial variability: Applications to recognition memory. Bohil, C. J., & Maddox, W. T. (2001). Category discriminability, base‐rate, and payoV eVects in perceptual categorization. Perception & Psychophysics, 63, 361–376.
Response Bias in Recognition Memory
91
Bohil, C. J., & Maddox, W. T. (2003). A test of the optimal classifier’s independence assumption in perceptual categorization. Perception & Psychophysics, 65, 478–493. Brown, J., Lewis, V. J., & Monk, A. F. (1977). Memorability, word frequency and negative recognition. Quarterly Journal of Experimental Psychology, 29, 461–473. Brown, S., & Steyvers, M. (2005). The dynamics of experimentally induced criterion shifts. Journal of Experimental Psychology: Learning, Memory, and Cognition, 31, 587–599. Clancy, S. A., McNally, R. J., Schacter, D. L., Lenzenweger, M. F., & Pitman, R. K. (2002). Memory distortion in people reporting abduction by aliens. Journal of Abnormal Psychology, 111, 455–461. Clancy, S. A., McNally, R. J., Schacter, D. L., & Pitman, R. K. (2000). False recognition in women reporting recovered memories of sexual abuse. Psychological Science, 1, 26–31. Diana, R. A., Reder, L. M., Arndt, J., & Park, H. (2006). Models of recognition: A review of arguments in favor of a dual‐process account. Psychonomic Bulletin & Review, 13, 1–21. Donaldson, W. (1996). The role of decision processes in remembering and knowing. Memory & Cognition, 24, 523–533. Dorfman, J., Kihlstrom, J. F., Cork, R. C., & Misiaszek, J. (1995). Priming and recognition in ECT‐induced amnesia. Psychonomic Bulletin & Review, 2, 244–248. Dougal, S., & Rotello, C. M. (1999). Context eVects in recognition. American Journal of Psychology, 112, 277–295. Dougal, S., & Rotello, C. M. (in press). ‘‘Remembering’’ emotional words is based on response bias, not recollection. Psychonomic Bulletin & Review. Dunn, J. C. (2004). Remember‐know: A matter of confidence. Psychological Review, 111, 524–542. Dunn, J. C., & Kirsner, K. (1988). Discovering functionally independent mental processes: The principle of reversed association. Psychological Review, 95, 91–101. Feenan, K., & Snodgrass, J. G. (1990). The eVect of context on discrimination and bias in recognition memory for pictures and words. Memory & Cognition, 18, 515–527. Gardiner, J. M., & Java, R. I. (1990). Recollective experience in word and nonword recognition. Memory & Cognition, 18, 23–30. Gardiner, J. M., & Richardson‐Klavehn, A. (2000). Remembering and knowing. In E. Tulving and F. I. M. Craik (Eds.), The Oxford handbook of memory (pp. 229–244). Oxford: Oxford University Press. Glanzer, M., Kim, K., Hilford, A., & Adams, J. K. (1999). Slope of the receiver‐operating characteristic in recognition memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 25, 500–513. Goh, W. D. (2005). Talker variability and recognition memory: Instance‐specific and voice‐ specific eVects. Journal of Experimental Psychology: Learning, Memory, and Cognition, 31, 40–53. Green, D. M. (1964). General prediction relating yes‐no and forced‐choice results. Journal of the Acoustical Society of America, 36, 1042 (Abstract). Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New York: Wiley. Gregg, V. H., & Gardiner, J. M. (1994). Recognition memory and awareness: A large eVect of study‐test modalities on ‘‘know’’ responses following a highly perceptual orienting task. European Journal of Cognitive Psychology, 6, 131–147. Heit, E., BrockdorV, N., & Lamberts, K. (2003). Adaptive changes of response criterion in recognition memory. Psychonomic Bulletin & Review, 10, 718–723. Hicks, J. L., & Marsh, R. L. (1998). A decrement‐to‐familiarity interpretation of the revelation eVect from forced‐choice tests of recognition memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 24, 1105–1120.
92
Rotello and Macmillan
Hirshman, E. (1995). Decision processes in recognition memory: Criterion shifts and the list‐ strength paradigm. Journal of Experimental Psychology: Learning, Memory, and Cognition, 21, 302–313. Hirshman, E., & Hostetter, M. (2000). Using ROC curves to test models of recognition memory: The relationship between presentation duration and slope. Memory & Cognition, 28, 161–166. Kornbrot, D. E. (2006). Signal detection theory, the approach of choice: Model‐based and distribution‐free measures and evaluation. Perception & Psychophysics, 68, 393–414. Kroll, N. A., Yonelinas, A. P., Dobbins, I. G., & Frederick, C. M. (2002). Separating sensitivity from response bias: Implications of comparisons of yes‐no and forced‐choice tests for models and measures of recognition memory. Journal of Experimental Psychology: General, 131, 241–254. Lamberts, K., BrockdorV, N., & Heit, E. (2003). Feature‐sampling and random‐walk models of individual‐stimulus recognition. Journal of Experimental Psychology: General, 132, 351–378. LeDoux, J. E. (1986). Sensory systems and emotion: A model of aVective processing. Integrative Psychiatry, 4, 237–248. Macmillan, N. A., & Creelman, C. D. (1990). Response bias: Characteristics of detection theory, threshold theory, and ‘‘nonparametric’’ measures. Psychological Bulletin, 107, 401–413. Macmillan, N. A., & Creelman, C. D. (1996). Triangles in ROC space: History and theory of ‘‘nonparametric’’ measures of sensitivity and response bias. Psychonomic Bulletin & Review, 3, 164–170. Macmillan, N. A., & Creelman, C. D. (2005). Detection theory: A user’s guide (2nd ed.). Mahwah, NJ: Lawrence Erlbaum Associates. Macmillan, N. A., & Rotello, C. M. (2006). Deciding about decision models of remember and know judgments: A reply to Murdock (2006). Psychological Review, 113, 657–665. Maddox, W. T. (2002). Toward a unified theory of decision criterion learning in perceptual categorization. Journal of the Experimental Analysis of Behavior, 78, 567–595. Maddox, W. T., & Bohil, C. J. (2004). Probability matching, accuracy maximization, and a test of the optimal classifier’s independence assumption in perceptual categorization. Perception & Psychophysics, 66, 104–118. Maratos, E. J., Allen, K., & Rugg, M. D. (2000). Recognition memory for emotionally negative and neutral words: An ERP study. Neuropsychologia, 38, 1452–1465. McClelland, J. L., & Chappell, M. (1998). Familiarity breeds diVerentiation: A subjective‐ likelihood approach to the eVects of experience in recognition memory. Psychological Review, 105, 734–760. Morrell, H. E. R., Gaitan, S., & Wixted, J. T. (2002). On the nature of the decision axis in signal‐ detection‐based models of recognition memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 28, 1095–1110. Murdock, B. (2006). Decision‐making models of remember/know judgments. Psychological Review, 113, 648–656. Murnane, K., & Phelps, M. P. (1994). When does a diVerent environmental context make a diVerence in recognition? A global activation model Memory & Cognition, 22, 584–590. Murnane, K., & Phelps, M. P. (1995). EVects of changes in relative cue strength on context‐ dependent recognition. Journal of Experimental Psychology: Learning, Memory, and Cognition, 21, 158–172. Murnane, K., Phelps, M. P., & Malmberg, K. (1999). Context‐dependent recognition memory: The ICE theory. Journal of Experimental Psychology: General, 128, 403–415. Niewiadomski, M. W., & Hockley, W. E. (2001). Interrupting recognition memory: Tests of familiarity‐based accounts of the revelation eVect. Memory & Cognition, 29, 1130–1138.
Response Bias in Recognition Memory
93
Phelps, E. A., LaBar, K. S., & Spencer, D. D. (1997). Memory for emotional words following unilateral temporal lobectomy. Brain and Cognition, 35, 85–109. Postma, A. (1999). The influence of decision criteria upon remembering and knowing in recognition memory. Acta Psychologica, 103, 65–76. RatcliV, R., Sheu, C.‐F., & Gronlund, S. D. (1992). Testing global memory models using ROC curves. Psychological Review, 99, 518–535. Rotello, C. M. (1999). Metacognition and memory for nonoccurrence. Memory, 7, 43–63. Rotello, C. M., Macmillan, N. A., Hicks, J. L., & Hautus, M. (2006). Interpreting the eVects of response bias on remember‐know judgments using signal‐detection and threshold models. Memory & Cognition, 34, 1598–1614. Rotello, C. M., Macmillan, N. A., Reeder, J. A., & Wong, M. (2005). The remember response: Subject to bias, graded, and not a process‐pure indicator of recollection. Psychonomic Bulletin & Review, 12, 865–873. Schulman, A. I., & Greenberg, G. Z. (1970). Operating characteristics and a priori probability of the signal. Perception & Psychophysics, 8, 317–320. ShiVrin, R. M., & Steyvers, M. (1997). A model for recognition memory: REM–retrieving eVectively from memory. Psychonomic Bulletin & Review, 4, 145–166. Singer, M., Gagnon, N., & Richards, E. (2002). Strategies of text retrieval: A criterion shift account. Canadian Journal of Experimental Psychology, 56, 41–57. Singer, M., & Wixted, J. T. (2006). EVect of delay on recognition decisions: Evidence for a criterion shift. Memory & Cognition, 34, 125–137. Smith, D. G., & Duncan, M. J. J. (2004). Testing theories of recognition memory by predicting performance across paradigms. Journal of Experimental Psychology: Learning, Memory, and Cognition, 30, 615–625. Strack, F., & Bless, H. (1994). Memory for nonoccurrences: Metacognitive and presuppositional strategies. Journal of Memory and Language, 33, 203–217. Strack, F., & Forster, J. (1995). Reporting recollective experiences: Direct access to memory systems? Psychological Science, 6, 352–358. Stretch, V., & Wixted, J. T. (1998a). Decision rules for recognition memory confidence judgments. Journal of Experimental Psychology: Learning, Memory, and Cognition, 24, 1397–1410. Stretch, V., & Wixted, J. T. (1998b). On the diVerence between strength‐based and frequency‐ based mirror eVects in recognition memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 24, 1379–1396. Treisman, M., & Faulkner, A. (1984). The eVect of signal probability on the slope of the receiver operating characteristic given by the rating procedure. The British Journal of Mathematical and Statistical Psychology, 37, 199–215. Treisman, M., & Williams, T. C. (1984). A theory of criterion setting with an application to sequential dependencies. Psychological Review, 91, 68–111. Tulving, E. (1985). Memory and consciousness. Canadian Journal of Psychology, 26, 1–12. Van Zandt, T. (2000). ROC curves and confidence judgments in recognition memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 26, 582–600. Verde, M. F., Macmillan, N. A., & Rotello, C. M. (2006). Measures of sensitivity based on a single hit rate and false‐alarm rate: The accuracy, precision, and robustness of d 0 , Az, and A0 . Perception & Psychophysics, 68, 643–654. Verde, M. F., & Rotello, C. M. (2003). Does familiarity change in the revelation eVect? Journal of Experimental Psychology: Learning, Memory, and Cognition, 29, 739–746. Verde, M. F., & Rotello, C. M. (2004a). ROC curves show that the revelation eVect is not a single phenomenon. Psychonomic Bulletin & Review, 11, 560–566.
94
Rotello and Macmillan
Verde, M. F., & Rotello, C. M. (2004b). Strong memories obscure weak memories in associative recognition. Psychonomic Bulletin & Review, 11, 1062–1066. Verde, M. F., & Rotello, C. M. (2007). Memory strength and the decision process in recognition memory. Memory & Cognition, 35, 254–262. Windmann, S., & Kru¨ger, T. (1998). Subconscious detection of threat as reflected by an enhanced response bias. Consciousness and Cognition: An International Journal, 7, 603–633. Windmann, S., & Kutas, M. (2001). Electrophysiological correlates of emotion‐induced recognition bias. Journal of Cognitive Neuroscience, 13, 577–592. Windmann, S., Urbach, T. P., & Kutas, M. (2002). Cognitive and neural mechanisms of decision biases in recognition memory. Cerebral Cortex, 12, 808–817. Wixted, J. T. (1992). Subjective memorability and the mirror eVect. Journal of Experimental Psychology: Learning, Memory, and Cognition, 18, 681–690. Wixted, J. T., & Stretch, V. (2004). In defense of the signal detection interpretation of remember/ know judgments. Psychonomic Bulletin & Review, 11, 616–641. Woodard, J. L., Axelrod, B. N., Mordecai, K. L., & Shannon, K. D. (2004). Value of signal detection theory indexes for Wechsler memory Scale‐III recognition measures. Journal of Clinical and Experimental Neuropsychology, 26, 577–586. Worthen, J. B., & Eller, L. S. (2002). Test of competing explanations of the bizarre response bias in recognition memory. The Journal of General Psychology, 129, 36–48.