Commensurability in memory for frequency

Commensurability in memory for frequency

JOURNAL OF MEMORY AND LANGUAGE 29, 501-523 (I!@@ Commensurability in Memory for Frequency DOUGLAS L. HINTZMAN AND ANN L. HARTRY University of...

2MB Sizes 2 Downloads 48 Views

JOURNAL

OF MEMORY

AND

LANGUAGE

29, 501-523 (I!@@

Commensurability

in Memory for Frequency

DOUGLAS L. HINTZMAN

AND ANN L. HARTRY

University of Oregon A fundamental property of the Minerva 2 model of memory for frequency is that frequencies accumulated under two different conditions, A and B, should be commensurable-that is, the frequency signals for all items, regardless of frequency or condition, should fall along a unidimensional scale. An experimental and analytic technique is developed that tests for commensurability, with the added constraint that the variance be linearly related to the mean. The method is applied to artificial data-some displaying commensurability and some violating commensurability in specific ways. It is then applied to frequency-discrimination experiments in which the A and B conditions are defined by: (1) type (“level”) of encoding, (2) exposure duration, (3) recency, and (4) spacing of repetitions. Type of encoding and exposure duration yield data that are consistent with commensurability, while recency and spacing violate the property in what appear to be two different ways. Tentative explanations of these violations are proposed: Subjects may adjust for differences in recency, by resealing frequency signals; and optimal comparisons of massed and spaced frequencies may require some translation between two different frequency codes. Application of the method to questions concerning recognition memory is also discussed. B 1990 Academic press. IIIC.

Artificial systems (i.e., models) that perform analogs of cognitive tasks have several potential advantages in psychological research. One is that they often display interesting properties that may also hold for human cognition. This article concerns a property of the simulation model, Minerva 2, as it has been applied to the task of judging presentation frequencies from memory (Hintzman, 1988). This property, which we call commensurability, could be fundamental to memory for frequency and recognition memory and so should be of theoretical interest beyond concern with the Minerva 2 model itself. The model assumes that judgments of frequency are based on values falling along a single continuum-a kind of familiarity signal called echo intensity. Such a signal is triggered by each retrieval cue. The This is based upon work supported by the National Science Foundation under Grant No. BNS-87-11218. Correspondence and reprint requests should be addressed to Douglas L. Hintzman, Psychology Department, University of Oregon, Eugene, OR 97403.

strength of this signal is determined, in part, by the frequency with which the tested item occurred in the memorized list. Test items of presentation frequency F give rise to a distribution of signal strengths, whose mean increases with F (see top panel of Fig. 1). For reasons that are unimportant for the current discussion, the distributions of signal strength are approximately normal in form. Thus, the model encourages viewing memory for frequency in terms analogous to those used in psychophysical scaling and-more specifically-suggests a Thurstonian model of the way frequency judgments are made (Hintzman, 1969; Torgerson, 1958). Absolute judgments of frequency require that the strength continuum be partitioned by response criteria (to separate judgments of 0 from 1, 1 from 2, etc .) . Relative frequency judgments, or frequency discrimination, can be modeled by simply assuming that the subject chooses the member of each test pair that gives the higher signal strength (Hintzman, 1988). Because modeling the latter task requires fewer free parameters, we will be con501 0749-5%x/90

$3 .oo

Copyright 8 1990 by Academic Press, Inc. AlI Ii&MS of reproduction in any form resctvtd.

HINTZMAN

502

Condition A

Condition B

Signal

Strength

FIG. 1. Echo intensity (signal strength) distributions from the Minerva 2 model, for two encoding conditions (A and B), and presentations frequencies O-5. The frequency = 0 distribution is shown in both panels.

cerned here only with the frequency discrimination task. Frequency discrimination performance is influenced by a number of variables in addition to presentation frequency. For example, “deep” (e.g., semantic) orienting tasks during study lead to higher discrimination accuracy than do “shallow” (visual or auditory) tasks, and performance declines regularly as a function of delay. In the Minerva 2 model, all such effects are mediated by echo intensity, or signal strength. Of two such conditions, let us call the one giving better performance A, and the one giving poorer performance B. In general, according to the model, distributions for condition A are further separated and higher (further to the right) than those of condition B, as shown in Fig. 1. If a single signal-strength continuum underlies performance in both conditions, as assumed, then the distributions that predict A vs. A and B vs. B frequency discrimination performance should predict A vs. B performance, as well. That is, all A and B frequencies should be commensurate, or measured on the same scale. This implies

AND HARTRY

that a single set of values locating all A and all B frequency distributions along the scale should accurately describe performance regardless of the type of test pair-A vs. A, B vs. B, or A vs. B. This is the property we call commensurability, which the present experiments were designed to test. The A and B conditions investigated in separate experiments were: semantic vs. phonemic encoding, long vs. short exposure duration, immediate vs. delayed testing, and massed vs. spaced repetition. Investigating this question required the development of an appropriate methodology. The next section describes the analytic technique and illustrates it on simulated data. We apply the method to data generated by the Minerva 2 model, and also-for comparison purposes-to data produced by variations of the model for which commensurability does not hold. ANALYSIS OF ARTIFICIAL DATA

Commensurability

Hypothesis

The distributions shown in Fig. 1 were generated by the Minerva 2 model. Details of the simulation may be found in Appendix A. Readers interested in a more complete treatment of the model as it is applied to memory for frequency are referred to Hintzman (1988). Regarding Fig. 1, for present purposes it is sufficient to note that memory traces were stored under two conditions and that in condition A traces were more complete (“stronger”) than in condition B. This difference is reflected in the strength of the familiarity signals, indicated along the abscissa of either graph. Condition A and condition B both include items occurring with frequencies l-5 and items having frequency = 0 were also tested (the frequency = 0 distribution is shown in both panels of Fig. 1). One consequence of the assumptions of the model is that, for a given set of parameters, the variance of signal strength increases with frequency as an ap-

COMMENSURABILITY

IN MEMORY

proximately linear function of the mean. For the distributions shown in Fig. 1, the best-fitting function is Variance = .016 f .116 Mean (? = .981). Ignoring within-pair order, the 11 conditions shown in Fig. 1 allow 55 types of test pairs (e.g., A2 vs. B4). For each type of test pair (X vs. Y), choice proportions can be derived from the signal-strength distributions by estimating the likelihood that a strength chosen at random from distribution X will exceed one chosen at random from distribution Y (Hintzman, 1988, p. 531). Table 1 shows the choice percentages derived in this way from the 11 distributions of Fig. 1. The basic question of interest is whether the choice proportions in the triangular pure-A and pure-B submatrices of Table 1 are consistent with those of the rectangular mixed (A vs. B) submatrix. One way to approach this problem would be to find a scaling solution for the entire matrix and to look for systematic outliers to the solution. A potentially more powerful method is used here. This method finds the best solution to the mixed data and then uses this solution to predict the values in the two triangular matrices. The rationale is that the A vs. B comparisons should be sufficient to fix the locations of all 11 distributions along a sin-

SIMULATED

DATA:

CHOICE

503

FOR FREQUENCY

gle dimension, so the pure-A and pure-B choices should be predictable from those for A vs. B.’ Because only the mixed choices are used to locate the distributions, the tit of the scaling model to the pure-A and pure-B choice data is parameter free. To further constrain the fit of the scaling model, we took advantage of the fact that the distributions generated by the Minerva 2 model are approximately normal in form. This implies that the correct scaling solution should be one in which the choice proportions are linearly related to distance, after the proportions have been converted to the normal-deviate scale, or d’. The analysis was carried out using the multidimensional scaling (MDS) module of SYSTAT (Wilkinson, 1986). First, choice proportions, P, were converted to d’ values via the polynomial, d’ = 1.840 + 1.45D6, where D = 2(lP - .51). This function closely approximates the 2-alternative forced-choice d’ values listed by Hacker and Ratcliff (1979), over the range SO < P < .99. Note that. since variances cannot be

1 The opposite is not true. Scaling pure-A and pureB choices would establish two sets of distances, but no common point of reference relating the sets to each other.

TABLE 1 Row CONDITION

PERCENTAGES,

OVER COLUMN

CONDITION

Condition and frequency” 0

Bl B2 B3 B4 B5 Al A2 A3 A4 A5 -

62.2 73.9

Bl

82

83

84

B5

59.7 29.3

21.1

-Al

A2

A3

A4

67.6 80.9 89.2 93.9

65.9 77.7 85.7

63.4 74.0

62.0

62.5

81.4

71.6

60.0

87.6 92.5 72.0 85.0 93.0 96.8 98.5

79.5 86.3 60.7 76.5 87.5 93.5 %.6

69.2 77.5 48.2 660 79.8 88.3 93.4

59.6 68.9 38.3 56.6 72.0 82.6 89.4

47.1

63.7 76.2 84.8

0 Letter indicates condition, digit indicates frequency.

37.8 54.8 68.5 78.8 _

HINTZMAN

504

x

“T

*0

-

I

0

1

2

3

4

Scaled Distance FIG. 2. d’ as predicted by scaled distance, for the basic Minerva 2 model. The solution is based on the mixed data only, and the pure-A and pure-B data are predicted. Pairs including frequency = 0 (recognition pairs) are shown separately.

independently estimated choice data, d’ is given by ti(Pi dlik

=

‘m

-

from

forced-

2bJJi-

CLk) =

ui

+

Pk) uk



(1)

The approximation to d’ given by the righthand term of (1) holds as long as the discrepancy between Ui and ok is not too large.* Second, the values for all mixed (A vs. B) pairs were entered in the MDS module as a triangular distance matrix in which the pure-A and pure-B values were missing. For scaling purposes, frequency = 0 was treated as condition B. Third, the solution was constrained to linearity on d’ (metric scaling) and to one dimension. As was expected with these artificial data, an excellent fit was obtained. Figure 2 plots the d’ values from the simulation against the scaled distances from MDS . The horizontal dashes represent the predictors-i.e., the mixed data that were used to derive the * The right-hand term of Eq. (1) is a linear transform of Thurstone’s Case IV (Torgerson, 1958; Thurstone, 1927).

AND HARTRY scaling solution. The predicted pure-A and pure-B points are also depicted in Fig. 2. Recognition pairs (those including frequency = 0) are plotted separately. All points fall close to a linear function. Thus, not only are the mixed data amenable to a simple, linear solution, but the pure-A and pure-B data are predictable from the same function-i.e., commensurability holds. An important point about commensurability, as defined by this method, is that it requires the additivity of d’ (i.e., dlik = d’, + d’jk). Of course, such additivity is guaranteed when distances between means are additive and u is constant. Less obviously, additivity is approximated whenever distances between means are additive and a* is a linear function of p (a proof that it holds for the right-hand expression of Eq. (1) is given in Appendix B). As was pointed out earlier, a* is essentially linear on p, for the Minerva 2 model. One consequence of the requirement that a* and u be related in this way is that d’,-and hence the scaling solution-is not necessarily linear on pi (For the data in Fig. 1, the scaling solution was approximately logarithmic on pi - pj.) Another consequence is that additivity of distances among means is not a sufficient condition for commensurability as determined by the method. Violations of the constraint on variances can result in a failure of commensurability even when additivity holds among means. Because the artificial data of Fig. 1 came from a model known to approximately satisfy the conditions of the scaling method, the foregoing analysis serves only to illustrate the analytic technique on an ideal data set, and to validate our use of the approximation to d’ given in Eq. (1). The question of interest here will be the extent to which data from human subjects conform to the picture shown by Fig. 2. It is worth noting in this regard that the simulated data were based on 4000 observations per data point, a figure that is impractical in experimental work. Actual data will be based on N’s of pj*

COMMENSURABILITY

IN MEMORY

FOR FREQUENCY

505

around 300, and therefore much “noisier” than what is seen in Fig. 2. This is especially true at the upper end of the scale (e.g., P > .90), because as P approaches 1.0, d’ approaches ~a. Resealing Hypothesis

At least two hypotheses can be identified for which commensurability does not hold and which therefore predict data patterns different from the one shown in Fig. 2. One of these we call the resealing hypothesis. Suppose that a subject can determine at the time of test whether an item was learned under condition A or condition B and that the subject has a metacognitive belief that the A vs. B manipulation should affect signal strength. Faced with the task of making mixed comparisons, such a subject might try to make A and B signal strengths more comparable, by resealing strengths of one type. For example, to keep from systematically erring in favor of A over B, the subject might multiply each B strength by a constant, c > 1. In effect, the subject tries to counter a suspected bias in the memory system, by interposing a compensating bias in the judgment process, after the signal strength has been retrieved. Such a mechanism can easily be added to Minerva 2. The data in the upper left panel of Fig. 3 were generated by a version of the model identical to the original except that the signal strengths of the B items are always multiplied by a constant, 2.2, prior to making a choice. (Under the assumption that frequency = 0 is more likely to be identified with the weaker of the two conditions, frequency = 0 items were treated as B items in this regard.) As previously, data from mixed pairs were used to obtain a one-dimensional scaling solution that is linear on d’. Note three things about the graph. First, the mixed data are still described well by a straight line (dashed line of the figure). Second, the d’ values for pure A pairs are under-predicted, and those for pure-B pairs are over-predicted, by the

Scaled Distance 3. d’ as predicted by scaled distance, for four models that violated commensurability. Upper panels: two version of rescahng hypothesis. Condition B and frequency = 0 signal strengths are resealed by multiplying by 2.2 (left) or by adding 0.2 (right). Lower panels: two versions of the translation hypothesis. Mixed-pair d’ values only are degraded by a factor of 0.5 (left panel) or by the function, MAX{& - 0.5, 0} (right). FIG.

mixed data. Pure-B discriminability is not affected by resealing, because multiplying B strengths by a constant does not change the overlap among the distributions. Pure-A discriminability is not affected because the multiplication is not applied to A distributions. Thus, resealing with the parameter 2.2 improves performance on mixed comparisons-in the sense that it reduces the tendency to systematically select A over B- but it does not change pure comparisons in any way. Third, half the recognition pairs fall on the predicted line (colinear with the mixed data) and half fall below (colinear with pure B). Those on the line represent test pairs pitting condition A against frequency = 0 and were included in the input to MDS. For the purpose of rescaling, these were treated as mixed pairs, in that frequency = 0 signal strengths were multiplied by 2.2. Likewise, the recognition

506

HINTZMANANDHARTRY

points falling below the line are essentially “pure B,” in that they represent condition B vs. frequency = 0.3 One might wish to consider another version of the resealing hypothesis, which involves adding a constant to all B and frequency = 0 signal strengths-i.e., moving these distributions to the right by a fixed amount (the equivalent of a criterion shift). The data in the upper right panel of Fig. 3 show the result of adding a constant of .2. If all u values were the same, this operation would maintain commensurability. However, the u values are different, and adding a constant changes the means without making the corresponding changes in cr. Consequently, the shifted B distributions are not variable enough for their locations, and the scaling solution for the mixed pairs overpredicts performance on pure-A and underpredicts that on pure-B pairs. This addition version of the resealing hypothesis appears quite distinct in its predictions from the multiplication version. Unfortunately, it seems inappropriate for the frequency discrimination task, because the same shift is applied to all B distributions, and a shift sufficient to eliminate the bias favoring A at frequency = 1 becomes increasingly inadequate as frequency rises further. We have seen no data showing this pattern, and so we will not consider the rescaling-byaddition model further.

translation is not completely reliablewould entail the potential introduction of random “noise.” A complete failure of translation would mean that all mixed comparisons would have d’ = 0. More likely is that d’ values for mixed comparisons would be somewhat degraded compared to the commensurability ideal. For purposes of illustration, we simply use the distributions in Fig. 1, but reduce all mixed-pair d’ values, without changing those for pure-A, pure-B, or recognition pairs. The results of two schemes for doing this are shown at the bottom of Fig. 3. The left panel shows the result of degrading mixedpair d’s by multiplying by a constant factor of .5. Here, the mixed data are closely approximated by a linear function with a slope half that for the data of Fig. 2. The right panel shows the effect of reducing mixedpair d’s by a constant and eliminating negative values (i.e., computing the function: max{d’ - .S,O}).In both schemes, d’ is degraded only for mixed comparisons, and so both the pure A and pure B choice data fall above the mixed-choice prediction line. However, multiplying by a constant (left panel) shifts the mixed-pair slope downward while maintaining the zero intercept, while subtracting a constant (right panel) changes the intercept without changing the slope. The commensurability hypothesis, the two resealing hypotheses, and the two Translation Hypothesis translation hypotheses thus make different Commensurability will also fail if sub- predictions regarding the outcome of the jects have particular trouble comparing fre- scaling technique. In the experiments that quencies accumulated under different con- follow, we put the technique to use. ditions, as would be the case if A frequenEXPERIMENT 1: TYPEOF ENCODING cies and B frequencies were represented in One manipulation that is known to affect different formats or codes. Mixed-pair frememory for frequency is the type of encodquency discrimination would require some ing, or level of processing. Semantic orientkind of translation between codes, and-if ing tasks during encoding generally lead to better frequency-judgment performance 3 Because half the signal strengths are positive and than visual or auditory tasks do (Fisk & half are negative, the multiplication operation distorts the shape of the frequency = 0 distribution. This does Schneider, 1984; Greene, 1986; Rose BE Rowe, 1976; Rowe, 1974). In the Minerva 2 not affect relative judgments, however, because multiplication leaves all ordinal relations unchanged. model, this result can be understood in

COMMENSURABILITY

IN MEMORY

terms of two sets of signal-strength distributions like the A and B distributions of Fig. 1 (Hintzman, 1988, p. 534). In Experiment 1, subjects rated some words on a semantic scale (goodness) and other words on an auditory scale (number of rhymes). Frequency discrimination was tested on pure semantic pairs, pure auditory pairs, and mixed pairs for which one word had been rated on the semantic scale and the other on the auditory scale. Method Subjects

and procedure. Subjects were 160 undergraduate students at the University of Oregon who participated in partial fulfillment of a course requirement (data of six additional subjects were excluded due to failure to correctly follow instructions). They were tested in groups of about 10. Subjects performed two tasks: first, they rated words on a questionnaire, and then they were given a frequency discrimination test. The printed questionnaire consisted of 280 words, each next to a scale on which a check mark was to be made. For half the words the scale required a rating of the “goodness” of the word, and for half it required a rating of the number of rhymes the word has in English, for example: bad__________________ good tale few ___________________ many silk

The two types of item were randomly ordered. Words were repeated as many as four times on the questionnaire. A given word was rated on the same scale each time it occurred, and repetitions were separated by at least 20 other items. The frequency judgment test form listed 56 pairs of words from the questionnaire. Subjects were instructed to circle the member of each pair that had occurred more often. Materials and design. The words were 112 common four-letter English nouns, selected to minimize the number of rhymes among them. On each questionnaire, 56 words were rated for goodness and 56 for rhymes. Of each set of 56, fourteen oc-

FOR

FREQUENCY

507

curred at each of the frequencies 1-4, thus there were eight presentation conditions in all. Including all tokens of each type, there were 280 items on the questionnaire, randomly ordered except for the minimum spacing of 20. Eight conditions can be paired in 28 ways, each of which was represented by two word pairs on the frequency-discrimination test list. Right-left ordering was determined randomly for each pair. The same test list was used for all subjects, and counterbalancing was accomplished through the use of 16 forms of the study questionnaire. Eight versions of the questionnaire were constructed by systematically rotating words through the eight presentation conditions, and eight more versions were constructed from these by switching the condition assignments of words that were paired together on the test. Approximately equal numbers of subjects were given each form of the questionnaire. Results and Discussion

Table 2 presents the frequency-discrimination choice percentages. In this and subsequent analyses, means for each version of the presentation list were computed, and the summary data are the means of these means. The data appear quite regular and consistent with previous findings. Discrimination generally improved with increasing differences in frequencies, but a given difference became less effective as the frequencies increased. With regard to the rating manipulation, goodness ratings led to better discrimination than rhyme judgments did, and the mixed data (inside the box of Table 2) show a general preference for words rated for goodness over words rated for number of rhymes. This is seen most clearly in the values on the diagonal of the rectangle, where the two frequencies were the same. To obtain inferential statistics relevant to the above observations, we performed two t tests on values in Table 2. The first asked whether the six values in the pure-goodness

508

HINTZMAN

AND HARTRY TABLE

DATA

FROM

EXPERIMENT

1: CHOICE

2 Row CONDITION

PERCENTAGES,

OVER COLUMN

CONDITION

Condition and frequency“ Rl R2

R2

R3

R4

Gl

G2

G3

65.0

LIR = rhyme judgments, G = goodness judgments, digit indicates frequency.

triangular matrix reliably exceeded the corresponding values for pure-rhyme pairs. A direct-difference t test was significant (M = 5.07, SD = 4.44, r(5) = 2.79,~ < .02). The second t test asked whether the mixedchoice data show a reliable bias for goodness items over rhyme items. To answer this question, we obtained one mean for each frequency comparison. For the four equal-frequency pairs, these means were the values on the diagonal of the mixed-pair rectangular matrix. For the six unequalfrequency pairs, they were the means of corresponding values above and below the diagonal (e.g., the mean of Gs, R, and G,, Rs). Because all values in the table reflect the tendency to choose the goodness item over the rhyme item, these means measure this tendency with differential frequency factored out. The mean of the four diagonal and six off-diagonal values was signifi-

cantly different from 50% (M = 59.74, SD = 3.93, t(9) = 7.84,~ < .OOI). Similaranalyses are reported for the following experiments. There, we refer to them simply as the pure-pair and mixed-pair analyses, respectively, in order to be brief. The choice percentages were converted to d’ values as previously described, and d’s for the mixed comparisons alone were submitted to MDS for linear, one-dimensional scaling. The scaled values are shown in Table 3, and the obtained d’ is plotted against scaled distance in Fig. 4. The mixed data are well described by a straight line, and both the pure-rhyme and pure-goodness choices fall close to the predicted values. A least-squares regression line fitted to the mixed-pair data yields 2 = .982 for mixed pairs alone, and ? = .925 for mixed pairs and pure pairs combined. Importantly, neither the pure-rhyme data nor

TABLE SCALING

Exp. 1 Rl R2 R3 R4 Gl G2 G3 G4

SOLUTIONS

Exp. 2 - 1.61 -0.42 0.10 0.71 -1.17 -0.10 0.85 1.62

Sl s2 s3 Ll L2 L3

- 1.32 0.06 0.99 - 1.27 0.28 1.26

3

FOR ALL

EXPERIMENTS

Exp. 3 Dl 02 03 04 I1 I2 I3 I4

- 1.61 -0.87 -0.14 0.42 -0.81 0.31 1.21 1so

Exp. 4 0 Dl 02 03 04 I1 I2 13 14

- 1.84 -0.88 0.29 1.02 1.21 - 1.12 -0.07 0.56 0.83

Exp. 5 M2 M3 M4 s2 s3 s4

- 1.29 -0.85 -0.54 0.13 1.08 1.47

COMMENSURABILITYIN

quiring numerical judgments to individual items.

2 15 d’t ! 0

MEMORY FORFREQUENCY

of frequency

Method Mixed

1

2

3

-

4

Scaled Distance FIG. 4.

509

as predicted by scaled distance for Experiment 1 (type of encoding). d’

the pure-goodness data fall systematically above or below the prediction line, as in the four panels of Fig. 3. Apparently, if there were any violations of commensurability due to type of encoding, they were too small to be detected by our experimental design. EXPERIMENT 2: EXPOSUREDURATION

Another variable that one might expect to affect memory for frequency is exposure duration. When duration is manipulated within subjects, however, its effect on frequency judgments can be small. Hintzman (1970) found little effect of duration on the judged frequency of words, but found a similarly small effect on free recall, suggesting that the result may have been due to strategies in which short-duration words were rehearsed as the intermixed longer durations allowed. A somewhat larger effect of duration was obtained on judged frequency for vacation slides, which may be less easily rehearsed than are words (Hintzman, Summers, & Block, 1975b). The present experiment used computer-displayed line drawings as stimuli. It also differed from previous studies in using the frequency discrimination task rather than re-

Subjects and procedure. Subjects were 113 undergraduate students, recruited as before and tested individually at a Macintosh Plus computer. Subjects were first told that they should attend to a series of pictures to be shown on the computer screen. A total of 240 bit-mapped line drawings were then shown. Half the pictures were displayed for 500 ms (condition Short) and the other half for 1500 ms (condition Long), with the two durations randomly intermixed. Individual pictures were presented from one to three times, with repetitions of a given item always having the same duration. Repetitions were separated by at least 55 other items. The subjects next were given the frequency-discrimination test. Sixty pairs of pictures were presented side by side on the screen, and subjects were told to indicate which member of each pair had occurred more often by pressing a left or right key on the computer keyboard. A given pair appeared for a maximum of 5 s, with the next pair delayed until the subject had made a response. Presentation of study and test lists was controlled by Psychlab software (Gum & Bub, 1988). Materials and design. The pictures were 120 black-and-white drawings of familiar objects, scavenged from various sources of “clip art.” Examples are: binoculars, eyclist, army knife, eye, Model T Ford, sheep, pencil sharpener, strawberry, and nurse. In

each presentation list, 60 drawings were shown at the short duration and 60 at the long duration. Of each of these sets, 20 pictures occurred at each of the frequencies l-3, yielding a total of six presentation conditions. Including repetitions, the presentation lists were 240 items long. The test included four pairs of pictures representing each of the 15 possible pairwise comparisons among six conditions. Right-left ordering was determined ran-

510

HINTZMAN

AND

HARTRY

domly for each pair. As in Experiment 1, counterbalancing was achieved through different versions of the presentation list. Six versions of the list were constructed by rotating each picture through the six conditions, and another six were made from these by reversing the frequency assignments of pictures paired together on the text. Approximately equal numbers of subjects were given each of the 12presentation lists. 0

Table 4 shows the choice percentages, which reveal a small but consistent advantage for long-duration pictures over shortduration ones. The pure-pair analysis (see Experiment 2) favored long pairs over short pairs but was not significant (M = 1.94, SD = 1.66, t(2) = 1.95, p = .19), while the mixed-pair analysis showed a reliable advantage for long-duration items (M = 54.78, SD = 4.12, t(5) = 2.85, p < .05). The mixed data were converted to d’s and scaled in MDS as previously described. Scaled values are shown in Table 3, and d is plotted against scaled distance in Fig. 5. As was the case with type of processing, exposure duration produced no apparent deviation from commensurability-the regression line fitted to the mixed data yields ? = .995 for the mixed data alone and 2 = .955 for the mixed and pure data combined, and neither the pure-long nor the pure-short data deviate systematically from that line. TABLE

4

DATA FROM EXPERIMENT 2: CHOICE Row CONDITION OVER COLUMN

PERCENTAGES, CONDITION

Condition and frequency” Sl s2 s3

82.0 89.5

Ll

52.9

L2 L3

84.2 93.4

s2 72.1 20.6 56.0 81.0

/’

0

Results and Discussion

s3

Ll

L2

1

I 3

2

Scaled Distance FIG. 5. d’ as predicted by scaled distance for Experiment 2 (exposure duration).

In part, this may reflect the small effect that duration had on judgments overall, which is consistent with previous results (Hintzman, 1970; Hintzman, Summers, & Block, 1975b). The good fit is nevertheless consistent with the model, and serves to further validate our analytic technique. WORDS Frequency discrimination, like other forms of memory performance, declines over time (Hintzman, & Stern, 1984). In the Minerva 2 model, this effect is mediated by differences in echo intensity distributions like those shown in Fig. 1 (Hintzman, 1988). Thus, in its simplest form the model predicts that commensurability should hold for items that vary in recency. This prediction was tested here by using two presentation lists, separated by one week, and giving a frequency-discrimination test including pure-Day 1, pure-Day 2, and mixed Day 1vs. Day 2 pairs, immediately following the second list. EXPERIMENT

3: RECENCY

OF

Method 7.8 7 28.0 62.3

85.4 89.6

74.2

D S = short (500 ms), L = long (1500 ms), digit indicates frequency.

Subjects and procedure. Undergraduate students, recruited as before, volunteered for two experimental sessions seven days apart. A total of 156 subjects appeared for both sessions and were tested in groups of

COMMENSURABILITY

IN MEMORY

about 10 subjects each. On the first day, each subject worked through a questionnaire, rating words on the same semantic goodness scale that was used for half the items in Experiment 1. Words occurred from l-4 times each. At the end of the session the subject was reminded to return one week later. During the first part of the second session, the subject performed the same task, rating a different set of words. Following this second list, the subject was given a printed frequency-discrimination test, with the instruction to choose the member of each pair that had occurred more often in the experiment, disregarding whether it had been rated during the first or second experimental session. Materials and design. Stimuli were the same 112 words that had been used in Experiment 1. They were divided into two sets of 56 words each, one to serve as list 1 and the other as list 2. Within each list, 14 words occurred with each of the frequencies 1-4, giving a total of eight conditions and making each list 140items long. Ordering was random except for the restriction that repetitions had to be separated by at least 15 other words. Each list had the form of a questionnaire, with the “bad-good” scale, illustrated earlier, depicted next to every word. The frequency-discrimination test form was identical to that used in Experiment 1. Counterbalancing was again accomplished by rotating words through the eight conditions and switching the condiTABLE DATA

FROM

EXPERIMENT

3: CHOICE

PERCENTAGES,

FOR FREQUENCY

511

tion assignments of members of each test pair, giving a total of 16 different presentation lists. Results and Discussion

Table 5 presents the choice percentages. The data appear fairly orderly, with performance on pure Immediate (day 2) pairs generally exceeding that on pure Delayed (day 1) pairs, and with Immediate items tending to win over Delayed in the cases of mixed pairs. The pure-pair analysis showed a nonsignificant advantage for Immediate over Delayed (M = 4.77, SD = 6.61, t(5) = 1.77, p = .14), and the mixed-pair analysis showed a highly significant one (M = 58.93, SD = 2.03, t(9) = 13.89, p < JOI). Overall, however, performance was surprisingly poor. For example, the Immediate-4 condition beat the Immediate-l condition only 75.6% of the time. This performance can be compared directly with the pure-Goodness data shown in Table 2 (Experiment I), because the words, encoding condition, and recency were about the same. It may be that the random alternation of goodness and rhyme judgments in Experiment 1 kept subjects more alert than did the more monotonous encoding task of Experiment 3. (Other possible explanations will be advanced after presentation of the results of Experiment 4.) Table 3 and Fig. 6 show the result of scaling the mixed-pair d’s. A good onedimensional, linear fit to the mixed data 5 Row CONDITION

OVER COLUMN

CONDITION

Condition and Frequency” 12

51.0

02 03 04 ;

13

59.6

;,

0 D = delayed test, I = immediate test, digit indicates frequency.

6;:

~::

5;6

HINTZMAN

d’ 05

--

II

1

2

3

4

Scaled Distance FIG. 6. d’ as predicted by scaled distance for Experiment 3 (delay, with words).

AND HARTRY

were .122 and .126, respectively, t(5) = 2.37 0, = .06, 2-tailed); and for the pureDelayed data M and SD were - .007 and .172, t(5) = .I0 (p > .8). Thus, there is some suggestion that commensurability is violated by the pure-Immediate, but not by the pure-Delayed pairs. The general ambiguity of these results appears due to subjects’ poor overall performance, which depressed differences among conditions relative to the standard error of the mean. The primary purpose of Experiment 4, therefore, was to supplement these data by using materials that are generally better remembered on a frequency judgment test than are words. EXPERIMENT 4: RECENCY OF PICTURES

was obtained. The pure-Immediate and pure-Delayed data appear quite variable, however. The regression line fitted to the mixed data yields ? = .967 for the mixed data alone, dropping to r;? = .838 for the mixed data and pure data combined. A crucial question is whether the magnitude of this drop reflects a systematic violation of commensurability, or unreliability in the data. This experiment had essentially the same design and about the same number of subjects as Experiment 1, and so the amount of power should be about the same. Indeed, the greater irregularity of the present data may be an illusion, created by Fig. 6’s compressed d’ scale, which has less than half the range of the ordinate of Fig. 4. The same amount of random error is bound to appear larger when the ordinate is compressed. The lower d’ values in the present experiment, in turn, reflect the generally poor discrimination performance that we already remarked upon in connection with Table 4. One way of evaluating the commensurability hypothesis is to compute the difference between each observed pure-choice d value and the d’ predicted by the mixedchoice regression line, and test the null hypothesis that the mean difference is zero. For the pure-Immediate data, M and SD

Past research has shown that both absolute and relative frequency judgments can be quite good when the stimuli are photographs of color scenes (Hintzman & Rogers, 1973; Hintzman er al., 19756). Accordingly, Experiment 4 was designed to test for effects of recency on commensurability in memory for the frequencies of vacation slides. A second purpose was to add a frequency = 0 condition. The Minerva 2 model views recognition memory (i.e., discriminating frequencies greater than 0 from frequency = 0) as a special case of memory for frequency, with the frequency = 0 signal strength distribution falling at the lower end of the signal-intensity scale (see Fig. 1). Recognition decisions, therefore, should fit the same scaling solution that describes discrimination among frequencies greater than zero. Method Subjects and procedure. A total of 177 subjects, recruited as previously described, attended two experimental sessions scheduled one week apart. They were tested in groups of about 10 each. On the first day, they were told that they would be shown a series of pictures and that they should study them for a later memory test, the nature of which was not revealed. A slide pro-

COMMENSURABILITY

IN MEMORY

jector then displayed on the front wall of the room 160 color slides depicting a wide variety of objects and scenes. The pictures were shown at a 3-s rate. Different pictures were shown from 1 to 4 times each, with repetitions separated by at least 15 other items. On the second day, the subjects were given the same instructions and shown another series of slides having the same general makeup as the first. No pictures appeared in both of the lists. Following the second list, subjects spent about 5 min on a word-rating task, and then were given the frequency-discrimination test. A sequence of 72 pairs of pictures was shown, side-by side, on the front wall, and subjects were told to identify one member of each pair as more frequent, by circling either L (left) or R (right) on a form listing 72 “L-R” pairs. They were instructed to choose the picture that had occurred more often in the experiment as a whole, regardless of the session in which it occurred. The test list consisted of pure Day-l (Delayed) pairs, pure Day-2 (Immediate) pairs, mixed pairs, and pairs including frequency = 0 pictures, which had not been shown on either day. The presentation rate on the test was 5 s per pair. Materials and design. Stimuli were 144 color photographs. They were divided into three subsets: 64 appeared in the first list, 64 appeared in the second list, and 16 were designated frequency = 0 and not shown in

either list. Of the 64 stimuli assigned to each list, 16 appeared at each of the frequencies 14. Thus there were nine conditions altogether, including frequency = 0. Withinlist ordering was random except for the minimum spacing restriction of 15. The frequency-discrimination test was identical for all subjects, and counterbalancing was achieved by systematically changing the presentation lists. Right-left ordering was determined randomly for each pair. Two pairs on the test list represented each of the possible 36 pair-wise combinations of nine conditions. Pictures were rotated systematically through the nine conditions, and, for each such rotation, condition assignments were switched within test pairs, yielding a total of 18 material sets. Results and Discussion

Choice percentages appear in Table 6. As anticipated, performance was better overall than it was in Experiment 3. There are puzzling aspects to these data, however: First, note that values in the pure-Immediate (lower right) part of the matrix generally exceed corresponding values in the pureDelayed part. The pure-pair analysis showed that this difference was reliable (M = 3.45, SD = 2.96, t(5) = 2.86, p < .OS>. The direction of this difference is as expected, the only surprise being that the Immediate advantage was so small (for comparison data, see Hintzman & Stern, 1984).

TABLE DATA

FROM

EXPERIMENT

4: CHOICE

513

FOR FREQUENCY

PERCENTAGES,

6 Row

CONDITION

OVER COLUMN

CONDITION

Condition and frequency” 0 Dl

68.1

02 03 04

79.6 85.3 87.0 63.1 80.7 80.0 83.7

Zl 12 13 14

Dl

67.2 72.1 82.7 45.4 61.4

71.9 17.0

02

59.9 70.2 23.6 43.6 59.4 60.8

03

04

Zl

12

13

20.8 23.8 40.2 43.6

67.0 76.8 83.7

64.0 78.5

62.9

59.5

18.1 35.4

41.3 55.2

0 D = delayed test, Z = immediate test, digit indicates frequency.

514

HINTZMAN

Second, however-although the difference is not statistically reliable (M = - 2.98, SD = 2.84, t(3) = - 2.10, p = .13)recognition performance was generally better when frequency = 0 items were paired with Delayed (day 1) items than with Immediate (day 2) items. This result is both contrary to expectations and difficult to reconcile with the first observation, showing the superiority of pure-Immediate to pure-Delayed pairs. Third, and equally surprising, the mixed data show a preference for Delayed over Immediate items, as well. This is most evident on the equal-frequency diagonal, where all four values are under 50%, but it is true of the rectangular matrix as a whole. The mixed-pair analysis confirmed that the difference from 50% was highly reliable (M = 44.77, SD = 2.56, t(9) = -6.46, p < .OOl). For scaling purposes, the frequency = 0 condition was considered Delayed (i.e., frequency = 0 vs. Immediate pairs were treated as mixed). A linear, one-dimensional solution provided a satisfactory tit to the mixed data. Scaled value are found in Table 3, and the plot of d’ against distance can be seen in Fig. 7. Performance on recognition pairs was consistent with that on mixed pairs in general, as the Minerva 2 model predicts. Regarding prediction of the

0

1

2

3

Scaled Distance FIG. 7. d’ as predicted by scaled distance for Experiment 4 (delay, with pictures).

AND

HARTRY

pure-Immediate and pure-Delayed data, matters look much the same as they did in Experiment 3. The regression line fitted to the mixed data provides a good fit to the mixed data alone (3 = .948) and a somewhat poorer tit to the mixed and pure data combined (3 = .883). More importantly, five of the six Delayed points fall above the prediction line and one falls on it, while half the Immediate points fall on either side of the line. Tests of the hypothesis of no difference between predicted and obtained d’ values yielded M = .198, SD = .199, t(5) = 2.40 for pure-Immediate pairs (p = .06), and M = .009, SD = .173, t(5) = 0.14 for pure-Delayed pairs (p > .8). These statistics are almost identical to the ones obtained in Experiment 3. Although neither Experiment 3 nor 4 may make a compelling case against commensurability by itself, in combination they do. Statistically, this can be demonstrated in at least three ways: (a) The p values from the pure-Immediate t tests from both experiments can be combined as described by Winer (1962, pp. 43-44), yielding x2(2) = 6.04, p < .05. (b) Of the 12 pure-Immediate data points in the two studies, 10 were above the prediction line, 1 was below, and 1fell on the line. Assuming P(below) = .5, the binomial probability of 1 hit out of 11 (discounting the tie) is .006. (c) Across experiments, 10 Immediate points fell above the regression lines and 1 below, while 6 Delayed points fell above and 6 below, giving x2(1) = 4.54, p < .05. Thus, Immediatetest performance was better than the mixed data predict, while Delayed-test performance was not. The data pattern obtained in Experiments 3 and 4 is most like the one produced by the resealing-by-multiplication-model and shown in the upper-left panel of Fig. 2. That model assumes that subjects know that test items were presented under two different conditions, and believe that the manipulation is one that affects signal strength. They therefore attempt to compensate for the memory bias by resealing

COMMENSURABILITYIN

upward signals from the weaker condition, in order to make them more comparable to the stronger one. For the resealing model to be applicable to a given manipulation requires that two assumptions be met: First, the manipulation must be one that subjects are likely to believe affects memory. Second, it must be one that can be remembered-in the present case, this means subjects must be fairly good at judging whether a particular item occurred a week earlier or just a few minutes before the test. The manipulation of recency certainly meets the first test, as the notion that memories fade over time is firmly ingrained in folk psychology. It also meets the second. As a general rule, recency discrimination increases with temporal separation and declines with the time since the more recent event (Yntema & Trask, 1963). The data most directly relevant here may be those collected on normal subjects by Huppert & Piercy (1978). The subjects were able to judge quite accurately whether a picture had been seen “yesterday” or “today” (10 min earlier), averaging 78% correct on pictures that had been presented once and 91% on ones that had been presented three times. (Amnesics, who were also tested, did much worse.) The resealing hypothesis can also explain the peculiarities noted in the choice percentages of Table 5. Subjects in both Experiments 3 and 4 appear to have tried to compensate for differences in recency, but those in Experiment 4 overcompensatedthus the paradoxical preference for Delayed items in mixed pairs. If the rule that was adopted was, “rescale if you don’t remember seeing it today,” the difference between Immediate and Delayed in recognition performance can also be explained. That is, if subjects resealed upward the retrieved signal strengths for frequency = 0 items, as well as for Delayed items, pet-formance on Immediate-vs.-new pairs would be degraded, while that on Delayed-vs.new pairs would not. It is not clear why overcompensation should occur with pic-

MEMORY FORFREQUENCY

515

tures (Exp. 4) and undercompensation with words (Exp. 3). One possibility is that recency discrimination is more accurate with pictures than with words, so that resealing was more uniformly applied in the case of pictures. This still leaves unexplained the surprisingly poor performance on pure-Immediate items in both Experiments 3 and 4. Two hypotheses suggest themselves. One is that encoding was poorer during the second session of either experiment than during the first session. This might occur if for some reason subjects were more poorly motivated and therefore less attentive during list presentation on the second day. A more interesting explanation is that the requirement to retrieve from memory events that are a week old as well as events of today inhibits retrieval of the more recent events, as compared with an effort focused on events of today. Such a phenomenon could be explained in several different wayse.g., the effective “list length” may be greater, or contextual cues may be less precise when both sessionsare involved. Stem (1981, Experiment 2) specifically tested for such an effect. He presented lists of words on two consecutive days, and then cued subjects 2 s prior to each test word as to whether it would have occurred “today,” “yesterday,” or on “either day.” Contrary to the hypothesis, he found no effect of the cue on recognition d’. However, Stern’s experiment could have failed because 2 s is not enough time to refocus the retrieval effort. In any case, regardless of the reason that pure-immediate performance in the present Experiments 3 and 4 was so low, a resealing hypothesis appears generally consistent with the violations of commensurability we obtained. EXPERIMENTS:

SPACING OF REPETITIONS

The spacing of repetitions has a powerful effect on judgments of frequency (e.g., Hintzman, 1974; Hintzman, 1976), although recent evidence suggeststhat its effect may be eliminated under certain incidental

516

HINTZMAN

AND HARTRY

learning tasks (Greene, 1989). A reasonable proposal is that, for whatever reason, items learned under massed repetition produce smaller signal strengths at retrieval than items learned under spaced or distributed repetition do (Hintzman, 1988). If a difference in signal strengths is all that is involved in the superiority of spaced over massed items, then commensurability should hold in the discrimination of frequencies of massed and spaced items. This prediction was tested in Experiment 5. In addition to investigating massed vs. spaced repetition, the experiment included items occurring with frequencies of 0 and 1. Method Subjects and procedure. A total of 178 subjects served in partial fulfillment of a course requirement. They were tested in groups of about 10 subjects each. They were shown a sequence of 282 color vacation slides, which they rated for pleasantness on a scale from 1 to 10. Ratings were made on printed forms, and the presentation rate was 4 s per slide. Different pictures occurred from 1 to 4 times in the list. For half the repeated items (Massed), all presentations were successive; for the other half (Spaced), the minimum spacing between repetitions was 15. Following the presentation list, subjects spent 5 min rating words on a filler task and then were given a frequency-discrimination test. The test was conducted as in Experiment 4, exTABLE DATA

FROM

EXPERIMENT

5: CHOICE

PERCENTAGES,

cept that there were only 56 pairs of pictures on the test. Materials and design. The stimuli were 112 color photographs from the same pool as was used in Experiment 4. Fourteen slides were assigned to the frequency = 0 condition, 14to frequency = 1, and 42 each to the Massed- and Spaced-repetition conditions. Of the Massed and Distributed items, 14 occurred with each of the frequencies 2-4. Thus, there were a total of eight conditions, manipulated within subjects. The ordering of conditions was random, subject to the spacing requirements and an attempt to prevent the longer clusters of Massed repetitions from occurring together in the list. As before, the frequency-discrimination test was the same for all subjects, and counterbalancing was achieved by using different presentation lists. Two test pairs represented each of the 28 possible combinations of conditions. Pictures were rotated through the eight conditions, and for each such rotation the condition assignments within test pairs were switched, yielding 16 different presentation lists. Results and Discussion

Some pictures were accidentally assigned to more than one condition, and all data for pairs including these items were removed prior to analysis. The data thus rejected accounted for no more than 9.4% of the data in any one condition and constiI Row CONDITION

OVER COLUMN

CONDITION

Condition and frequency” 0 1 M2 M3 M4 s2 s3 s4

91.8 86.3 92.6 94.8 93.5 %.8 96.0

1

M2

M3

M4

s2

s3

63.7 69.9 70.7 86.8 90.2

60.0 60.4 77.6 87.5

48.0 77.3 -80 2

74.2 87.8

71.0

63.6

71.0 73.3 83.8 94.6 97.8

a M = massed, S = spaced, digit indicates frequency.

COMMENSURABILITY

IN MEMORY

tuted 5.3% of the total. Table 7 shows the choice percentages. The pure-Massed, pureSpaced, and mixed data all show a strong advantage for the Spaced condition, as previous findings would lead one to expect. This advantage was reliable in the pure-pair analysis (M = 13.13, SD = 4.14, t(2) = 5.50, p < .05) and in the mixed-pair analysis (M = 75.6, SD = 5.32, t(5) = 11.78, p < .OOl). Performance on pairs involving frequency = 0 and 1 items was generally quite good. All choice percentages were converted to d’ values, and several attempts were made to find a mixed-data scaling solution that included data from frequency = 0 and 1. These attempts failed to yield satisfactory fits. One likely reason is that d’ values for frequency = 0 and 1 pairs were very unstable, because choice percentages were close to the ceiling (see Table 7). Small and essentially random differences in choice percentages just under 100% translate into large differences in d’. Another likely reason is that the mixed data are truly incommensurable with the data from other comparisons. This is shown by the more modest scaling attempt to which we now turn. The mixed d’ values alone were submitted to MDS for a linear, one-dimensional solution yielding the values shown in Table 3, and the fit is shown in Fig. 8. Although the mixed data are well described by the regression line, all pure-Massed and pureSpaced points fall well above the line. The regression line fitted to the mixed-pair data yielded 2 = .974 for the mixed data alone, but only 4 = .688 for the mixed and pure data combined. Despite there being only three points of either type, differences between observed and predicted d’ were large and uniform enough to be statistically significant--M = 0.58, SD = .Ol, t(2) = 68.5 for Massed (p < .0002); and M = 0.82, SD = 0.22, f(2) = 6.38 for Spaced @ < .025). This pattern of data, with both pure data sets falling above the mixed-pair regression line, resembles those for the translation hypothesis, as shown in the two panels at the

517

FOR FREQUENCY

0

t

2

3

Scaled Distance FIG. 8. d’ as predicted by scaled distance for Experiment 5 (spacing).

bottom of Fig. 3. A separate regression line was fitted to the pure data of Fig. 8. Its slope was not significantly different from that of the mixed-pair regression line, t( 11) = 1.05, but the difference between the two intercepts was significant, r(l1) = 3.05, p < .02. Moreover, a linear model with two intercepts and a single slope provided a good fit to the pure and mixed data combined (3 = .934). Thus the data appear particularly compatible with the version of the translation hypothesis shown in the lower right of Fig. 3, in which the mixed-pair d’ values from the original Minerva 2 simulation were reduced by a constant amount. A plausible interpretation of this sort of violation of commensurability, which was stated earlier, is that Massed and Spaced frequencies are represented in different codes and that the process of translating between them is subject to error. According to this view, the pure-Massed and pure-Spaced comparisons are better than the mixed data predict because they are not subject to these codetranslation errors. It is less obvious why the subtraction version of this hypothesis should fit better than the multiplication version. Considering the importance of just two or three data points for discriminating between the two versions, there may be little point in such speculation until we know

518

HINTZMAN

whether the two-intercept, single-slope result will replicate. There is other evidence suggesting that massed and spaced repetition may produce somewhat different frequency codes. Hintzman and Block (1970) varied both frequency and spacing and then asked some subjects to give ordinary judgments of frequency and others to give judgments of successive repetition (i.e., judging the “number times in a row” each test item had occurred). Frequency judgments followed the usual pattern of being higher for spaced than for massed items, but judgments of successive repetition, correctly, were higher on average for massed items than for spaced. About half the subjects displayed sensitivity to the number of massed repetitions in judgments of successive repetition, while the other half did not. Hintzman and Block (1970) suggested that immediate repetitions may have been especially striking to some subjects and, so, may have been encoded or labeled in a special way. Retrieval of this information on a memory test would support accurate judgments of successive repetition and, in the present experiment, could have enhanced performance on pure-Massed pairs over what signal strength, by itself, would allow.

AND HARTRY

that variances be the same. The method is thus appropriate for the Minerva 2 memory model, which approximates the linearity assumption fairly well. Using this method, we were able to identify two manipulations-type of encoding and exposure duration-that appear to follow the commensurability pattern and two other manipulations-recency and spacing-that do not. The two violations of commensurability appear to have different forms. We have tentatively identified the violation of commensurability produced by the recency manipulation as due to rescaling, and that resulting from massed vs. spaced repetition as due to deficient translation between frequency codes. The resealing hypothesis assumes that if subjects recall an item’s experimental condition and believe the manipulation is one that affects signal strength, they will try to compensate by introducing a bias into the decision process. By resealing strengths retrieved for the weaker condition upward (or, equivalently, resealing the stronger condition downward), subjects reduce the effect that the manipulation would otherwise have on mixed-condition accuracy, without affecting pure-condition accuracy. Within the framework of this hypothesis, our data suggest that subjects resealed in both Experiment 3 and Experiment 4, but GENERAL DISCUSSION when words were the stimuli (Experiment The purpose of this study was to develop 3) they undercompensated for the effect of a method of testing the commensurability a week’s delay, and when pictures were hypothesis and to apply it to several vari- used (Experiment 4) they overcompensated ables known to affect memory for fre- for the delay. quency. The adopted method involves titThree ingredients should be essential for ting a one-dimensional scaling model to the resealing: (1) For a substantial percentage mixed-choice data and using the scaling so- of items, subjects must be able to retrieve lution to predict pure-choice data. This the encoding condition or context (e.g., a method assumes a scaling model in which week ago vs. today). Thus, for manipuladistance is linearly related to forced-choice tions that are poorly remembered, resealing d’, a condition that is approximately met if should not occur. (2) Metamemory is inthe means and variances of distributions volved, in that the subjects must believe underlying the choices are linearly related. that the remembered manipulation is one As such, it is more restrictive than a model that affects signal strength. This implies that makes no assumption about variances that manipulations whose effects on memand less restrictive than one that requires ory are not widely known (e.g., type of pro-

COMMENSURABILITY

IN MEMORY

cessing) should not produce resealing-that is, not unless the experimenter manages to implant a new metamemorial belief. The results of Experiment 1 thus appear consistent with this view. Exposure duration seems like a reasonable candidate for rescaling, since-at least in the case of pictures-it can be remembered to some degree (Hintzman et al., 1975b). However, Experiment 2 showed no evidence of rescaling. This could be because the duration manipulation was not well enough remembered, or because subjects saw no need for resealing-indeed, performance on shortduration pictures was only slightly worse than on long-duration pictures, which is consistent with such a belief. (3) Resealing is assumed to be an effortful act. Thus, subjects must have sufficient time, capacity, and motivation to carry it out. This suggests that manipulations that eliminate the extra time, capacity, or motivation might reinstate commensurability where, otherwise, evidence of resealing is obtained. As a general rule one should expect to find evidence of resealing when conditions l-3 are present, but if any of the conditions is absent, such evidence should not be found. The resealing hypothesis may help to explain a curious result reported recently by Greene (1988). Subjects performed a frequency-discrimination task on words that they had either generated or merely copied. Greene examined separately the effects of generating or copying the less frequent and the more frequent member of each pair, reporting that generation of the more frequent member improved accuracy, while generation of the less frequent member had no effect. In his Experiment 5, for example, correct choice rates were as follows: C4C2, 60%; C4G2,62%;

G4C2,6%;

FOR FREQUENCY

519

it suggests that subjects may have overcompensated, in attempting to make copied items more comparable to generated ones.4 A resealing account of Greene’s results, of course, requires that subjects be able to recall whether they copied or generated a word. We know of no data that are relevant to this question. Our original expectation regarding Experiment 5 was that the spacing manipulation would obey commensurability, because subjects have little or no metamemorial understanding of the spacing effect (Zechmeister & Shaughnessy, 1980). Instead, we obtained a violation of commensurability similar to what the translation hypothesis predicts (Fig. 3). The proposal is that when subjects discriminate frequencies of pure-Massed pairs and of pure-Spaced pairs, they rely to some extent on qualitatively different representations of frequency, and that they are not capable of the perfect translation between codes that would be necessary for mixed-pair performance to be commensurable with that on pure pairs. What is the nature of the two hypothetical frequency codes? It was pointed out earlier that when called upon to do so, many subjects can make accurate judgments of successive repetition, suggesting that massed repetitions were especially noticed and tagged in some distinctive way (Hintzman & Block, 1970). Thus, some kind of propositional code might be associated with massed items, progressing over repetitions (e.g., “x)‘-“X again”-“X yet again’ ‘-‘ ‘ Oh, no, not X again, I’m really getting tired of looking at X.“) In the extreme case, massed repetitions might even

G4G2,66%,

where G = generate, C = copy, and 2 and 4 are the frequencies. These data can be seen as a violation of commensurability. The first two figures are particularly surprising, since a generated item should be stronger than a copied one. Although the 60% vs. 62% difference was not significant,

4 Greene (1988) discussed possible effects of subjects’ adjustment of frequency estimates, arguing that when variables are manipulated within subjects, such adjustments should apply equally to all items (p. 302). This is contrary to the present rescahng hypothesis, which holds that subjects may adjust individual items differently, according to their memory of the conditions under which encoding occurred.

520

HINTZMAN

be counted-a strategy that would be too taxing and inefficient for coding the frequencies of items whose repetitions are widely spaced. The other representation of frequency, differentiating frequencies of spaced items, could be an analog representation such as signal strength (echo intensity in Minerva 2). This quantity, we assume, increases much more with spaced repetitions than when repetitions are massed. Of course, propositions formed in a kind of internal dialogue could play a role in the judged frequency of spaced items, too. The evidence, however, suggests a qualitative distinction between codes, because the primary representation underlying spaced frequency appears to be obligatory and to be too accurate or discriminating to be the same one on which pureMassed frequency discriminations are based (cf. Hintzman, 1976). The translation problem arises because two propositions, or two signal strengths, can be compared directly, while it is difficult to know exactly how to compare a frequency-relevant proposition with a signal strength. The foregoing argument partially saves the notion that frequency judgments are primarily based on signal strength, but it does so at a cost: It allows information other than signal strength to enter into the judged frequency of massed items (not just into judgments of successive repetition), and it implicitly opens the possibility that propositional information could influence judged frequency of spaced items, as well. To explain subjects’ ability to remember the spacing of repetitions, for example, it has been proposed that when an item is repeated it retrieves the trace of its previous presentation, and an implicit recency judgment is stored (Hintzman & Block, 1973; Hintzman, Summers, & Block, 1975a). Such retrieved, re-processed, and re-stored information could play a role in judgments of frequency, although our present position is that its role in memory for frequency is secondary to that of signal strength.

AND

HARTRY

From the standpoint of the Minerva 2 memory model, the resealing hypothesis constitutes a less serious violation of commensurability than does the translation hypothesis, because resealing is assumed to occur following memory retrieval, and so it entails no change in assumptions about the operation of memory itself. By contrast, the translation hypothesis requires at least some new assumptions about how frequency information is encoded and retrieved, although we have proposed that the role of the propositional code is secondary to that of signal strength. One question for further research is whether such elaborations and refinements of the basic model will be sufficient to explain violations of commensurability or whether the notion that signal strength underlies memory for frequency will have to be replaced. Another question is how other models that adopt a signal-strength account of recognition memory might be generalized to deal with memory for frequency (e.g., Gillund & Shiffrin, 1984; Humphreys, Baine, & Pike, 1989; Murdock, 1982), and whether such extended models predict commensurability under the same conditions as the Minerva 2 model does. According to the Minerva 2 model, recognition memory is a special case of memory for frequency, and so forced-choice recognition and frequency discrimination should be commensurable. In Experiment 4, pairs including frequency = 0 pictures appear generally commensurable with Delayed pictures, which makes sense if both were resealed. However, because commensurability between Immediate and Delayed pictures was not found, the experiment does not provide a good test. In the data of Experiment 5, we were unable to obtain a satisfactory scaling solution that included pairs involving frequency = 0 and frequency = 1 pictures, combined with either pure-Massed or pure-Spaced. In retrospect, it appears that the extremely good performance of subjects these pairs-and

COMMENSURABILITY

IN

the consequent wild fluctuations in d’ -may be responsible for this failure. An adequate test of the commensurability of forced-choice recognition and frequency discrimination will require that recognition performance be further from ceiling than it was in Experiment 5. The present resealing hypothesis is relevant to an earlier experiment on the relationship between recognition and memory for frequency, by Wells (1974). As each word appeared in a running memory task, subjects judged whether it had occurred 0, 1, or 2 times previously. In her analysis, Wells plotted Pr(judgment = 1 or 2) (the recognition decision) against Prfiudgment = 2) in a scatter diagram, to see whether the function relating the judgment criteria corresponding to the two probabilities was the same, regardless of actual frequency (1 or 2). However, to get the ranges of recognition probabilities for both frequencies to overlap, Wells had to test the frequency = 2 words at much longer lags than the frequency = 1 words. She found that words of the two frequencies followed different functions and concluded (in different terms) that recognition and memory for frequency are incommensurable. One explanation of Wells’s results may be that her subjects retrieved recency information and adjusted frequency-judgment criteria differently for items of different recencies (Hintzman, 1988). Although Wells’s recencies were much shorter than those used here, the present results suggest that such resealing based on recency can occur. Readers familiar with the recognitionmemory literature on the “mirror effect” (Glanzer & Adams, 1985, 1990) may have already observed that commensurability is an issue where decisions are restricted to frequency = 0 or 1, and more than one category of experimental items is used (e.g., common vs. uncommon and/or concrete vs. abstract words). We believe that the present method may be helpful in understanding the mirror effect. That question,

MEMORY

FOR

521

FREQUENCY

however, is beyond present report. APPENDIX

A:

the scope

of the

SIMULATIONS

In the Minerva 2 model (Hintzman, 1986, 1988), each memory trace is a vector of N features. For the present simulations, N = 15. In a representation of each stimulus, the features took on the values + 1, - 1, and 0, each with probability r/3. Encoding in memory is a probabilistic process, having probability L. Thus for each element in a memorytrace,P[+l] = P[-l] = L/3,andP[O] = 1 - 2L/3. Every event, even if it is identical to an earlier event, its represented by its own memory trace. Thus following a list of M items, including repetitions, M trace vectors are stored in memory. When a retrieval cue or probe is presented, its similarity to each memory trace is computed as

j=l

where Pi is the value of feature j in the probe, Tij is the value of feature j in trace i, and Ni = N - Z;. Zi is the number of “irrelevant” features, for which both Pj = 0 and Tij = 0. Activation of trace i is a function of its similarity: Ai = S?. Finally, the echo intensity produced by the probe is summed over the M traces in memory:

I = 2 Ai. i=l The simulations were done by presenting the model with 16 different stimulus vectors, four having each of the presentation frequencies l-5. This resulted in M = 60 trace vectors. For half the items of each frequency, L = .45 (condition A), and for the other half, L = .35 (condition B). The original version of each item was used as a probe, as were four newly generated items

522

HINTZMAN

AND HARTRY

(frequency = 0). This process was repeated for a total of 2000 simulated subjects. Values of I were accumulated in bins of width .05, yielding the distributions in Fig. 1. Choice probabilities were estimated from these distributions by the formula

= f (Ui -

Uj

= f (ai

Uj)

= db + djk

> x)

=

c

Pr{Zx

= k}[Pr{Zy

> k}

where I, and I, are the echo intensities produced by items in distributions X and Y, respectively, and k indexes the bins. The simulation of the resealing model was identical in every respect, except that whenever the probe was a B item or a frequency = 0 item, the retrieved echo intensity value (Z) was multiplied by 2.2. APPENDIX B: ADDITIVITY FORCED-CHOICE d’

a: = bp, + a

=

2bi

rk

(1)

-

pk)

(Ti + uk

(2)

then dlik = d’c + d>ke

(3)

Solve (1) for l.~: 1

p, = jj (ai - a).

(4)

- a) -

rk

ui

+

(l/b)(ai - a)]

from (6).

-

Uk)

M., & ADAMS, J. K. The mirror effect in recognition memory: Data and theory. Journal of

perimental Cognition, GREENE, R.

Psychology:

Learning,

Memory,

Psychology:

Learning,

Memory,

and

12, 489-495. L. (1988). Generation effects in frequency judgment. Journal of Experimental Psychology: Learning, Memory, and Cognition, 14, 298-304. GREENE, R. L. (1989). Spacing effects in memory: Evidence for a two-process account. Journal of Experimental Psychology: Learning, Cognition, 15, 371-377. GUM, T., & BUB, D. (1988). PsychLab

Memory,

and

Manual. Montreal: PsychLab. HACKER, M. J., 8~ RATCLIFF, R. (1979). A revised table of d’ for M-alternative forced choice. Perception & Psychophysics, 26, 168-170. HINTZMAN. D. L. (1969). Annarent freauencv as a function of frequency and- the spacing of repetitions. Journal of Experimental Psychology, 80, 139-145. HINTZMAN, D. L. (1970). Effects of repetition and exposure duration on memory. Journal of ExperiPsychology,

in cognitive

(Tk

Equation (5) simplifies to (Ui

(8)

83, 435-W.

D. L. (1974). Theoretical implications of the spacing effect. In R. L. Solso (Ed.), Theories

(5)

d’ik = f

Uk).

8-20. GLANZER,

mental HINTZMAN,

Substitute (4) into (2):

d’, = UWd

-

16, 5-16. (1986). Effects of intentionality and strategy on memory for frequency. Journal ofEx-

and

Proof.

g (Uj

A. D., & SCHNEIDER,

Experimental and Cognition, GREENE, R. L.

OF

Given three points on a line,

d,.

+

(7)

W. (1984). Memory as a function of attention, level of processing, and automatization. Journal of Experimental Psychology: Learning, Memory, and Cognition, 10, 181197. GILLUND, G., & SHIFFRIN, R. M. (1984). A retrieval model for both recognition and recall. Psychological Review, 91, l-67. GLANZER, M., & ADAMS, J. K. The mirror effect in recognition memory. Memory and Cognition, 13,

FISK,

+ .5 Pr{Zy = k}],

To Prove.

- Uk)

Uj

REFERENCES

k

Pr(Y

-

+

(6)

psychology:

The Loyola

symposium

(pp. 77-99). Potomac, MD: Erlbaum. HINTZMAN, D. L. (1976). Repetition and memory. In G. H. Bower (Ed.), The psychology of learning and motivation (pp. 47-91). New York: Academic Press. HINTZMAN, D. L. (1988). Judgments of frequency and recognition memory in a multiple-trace memory model. Psychological Review, 95, 528-55 1.

COMMENSURABILITY

IN MEMORY

HINTZMAN, D. L., & BLOCK, R. A. (1970). Memory judgments and the effects of spacing. Journal of Verbal Learning and Verbal Behavior, 9,561-X6. HINTZMAN, D. L., & BLOCK, R. A. (1973). Memory for the spacing of repetitions. Journal of Experimental Psychology, 99, 70-74. HINTZMAN, D. L., & ROGERS, M. K. (1973). Spacing effects in picture memory. Memory and Cognition, 1, 430-434. HINTZMAN, D. L., & STERN, L. D. (1984). A comparison of forgetting rates in frequency discrimination and recognition. Bulletin of the Psychonomic Society, 22, -12. HINTZMAN, D. L., SUMMERS, J. J., & BLOCK, R. A. (1975a). Spacing judgments as an index of studyphase retrieval. Journal of Experimental Psychology: Human Learning and Memory, 1, 3140. HINTZMAN, D. L., SUMMERS, J. J., & BLOCK, R. A. (1975b). What causes the spacing effect? Some effects of repetition, duration, and spacing on memory for pictures. Memory and Cognition, 3, 287294. HUMPHREYS, M. S., BAIN, J. D., & PIKE, R. (1989). Different ways to cue a coherent memory system: A theory for episodic, semantic, and procedural tasks. Psychological Review, 96, 208-233. HUPPERT, F. A., & PIERCY, M. (1978). The role of trace strength in recency and frequency judgements by amnesic and control subjects. Quarterly Journal of Experimental Psychology, 30,347-354. MURDOCK, B. B. (1982). A theory for the storage and

FOR FREQUENCY

523

retrieval of item and associative information. Psychological Review, 89, 609-626. ROSE, R. J., & ROWE, E. J. (1976). Effects of orienting task and spacing of repetitions on frequency judgments. Journal of Experimental Psychology: Human Learning and Memory, 2, 142-152. ROWE, E. J. (1974). Depth of processing in a frequency judgment task. Journal of Verbal Learning and Verbal Behavior, 13, 638-643. STERN, L. D. (1981). Time of encoding as a retrieval cue in recognition memory. American Journal of Psychology, 94, 99-112. THURSTONE, L. L. (1927). A law of comparative judgment. Psychological Review, 34, 273-286. TORGERSON, W. S. (1958). Theory and methods of scaling. New York: Wiley. WELLS, J. E. (1974). Strength theory and judgments of recency and frequency. Journal of Verbal Learning and Verbal Behavior, 13, 378-392. WILKINSON, L. (1986). SYSTAT: The system for statistics. Evanston, IL: SYSTAT, Inc. WINER, B. J. (1%2). Statistical principles in experimental design. New York: McGraw-Hill. YNTEMA, D. B., & TRASK, F. P. (1%3). Recall as a search process. Journal of Verbal Learning and Verbal Behavior, 2, 65-74. ZECHMEISTER, E. B., & SHAUGHNESSY, J. J. (1980). When you know that you know and when you think that you know but you don’t. Bulletin of the Psychonomic Society, 15, 41-44. (Received December 7, 1989) (Revision received April 16, 1990)