Test monitoring in young grade school children

Test monitoring in young grade school children

JOURNAL OF EXPERIMENTAL CHILD Test Monitoring PSYCHOLOGY 43, 96-I 11 (1987) in Young Grade School Children MICHAEL Universiry PRESLEY of West...

1MB Sizes 26 Downloads 73 Views

JOURNAL

OF EXPERIMENTAL

CHILD

Test Monitoring

PSYCHOLOGY

43, 96-I

11 (1987)

in Young Grade School Children MICHAEL Universiry

PRESLEY

of Western

Ontario

JOEL R. LEVIN

of Wisc,onsin-Mudison

University

ELIZABETH

S.

GHATALA

AND MAHEEN University

AHMAD

of Western

Ontario

Younger (6- to X-year-olds) and older (9. to I I-year-olds) children took a multiple-choice test that yielded comparable performances at the two age levels. When subjects estimated their overall performance at the end of the entire test, older children were more accurate and less variable than younger children. This developmental shift in monitoring of performance could not be explained by the development of estimation skills per se, nor could it be explained by developmental changes in monitoring the correctness of responses to individual items. Younger children were less accurate in judging the correctness of responses to individual items than were older children, with older children manifesting higher signaldetection indices. Statistically significant developmental improvements in monitoring at the item level, however. occurred in girls only. 10 1’3x7 Academic Press. Inc

There have been an enormous number of studies tapping the many aspects of children’s metacognition (Forrest-Pressley, MacKinnon, & This research was principally supported by a grant IO the first author from the Natural Sciences and Engineering Research Council of Canada. Joel Levin’s participation was financed by a Romnes Faculty Fellowship from the University of Wisconsin. The authors are grateful to Patricia Worden. Jeff Bisanz. and two anonymous reviewers for helpful critiques of an earlier version of this paper. Reprints may be requested from Michael Pressley. Department of Psychology, University of Western Ontario, London, Ontario. N6A 5C2 Canada. Y6 0032-0965/X7 Copyright All right\

$3.00

D 19X7 by Academrc Pres?. Inc of reproductmn in any form rexwed.

TEST MONITORING

97

Waller, 1985a, 1985b), with metacognition that controls ongoing thinking considered to be especially important (e.g., Baker & Brown, 1982: Yussen, 1985). This regulatory metacognition follows from the execution of general strategies, including but not limited to planning (figuring out how to proceed), predicting (estimating how well one will do or how long a task will take), and monitoring (checking whether or not a goal has been reached). For example, when a task is presented to proficient learners, they make predictions about it with respect to factors such as difficulty level and time requirements, and then use these predictions as they plan a strategy, begin to carry out the plan, and monitor progress along the way. If they perceive that their goals are being met, there is no reason to modify their processing; if they sense no progress, however, the search for a new approach to the problem is likely. Although monitoring presumably takes place throughout the encoding (studying) process, especially important monitoring opportunities are available during testing. In support of this claim, adults’ awareness of how much they learned during study is improved by giving them a test over the material (e.g., King, Zechmeister, & Shaughnessy, 1980; Lovelace, 1984; Pressley, Levin. & Ghatala. 1984). Consistent with metacognitive theory, knowledge about performance that is gained through test monitoring can be an important determinant of whether learners realize that more study is necessary to accomplish a goal. Such knowledge can also affect how additional study is carried out (e.g., Pressley et al., 1984; Pressley, Snyder, Levin, Murray, & Ghatala, in press). Despite the potential for tests to inform children about their performance, and despite the enormous number of tests in most formal educational settings, there have been no developmental evaluations of test monitoring per se. The studies that come closest to doing so have involved children’s monitoring of test outcomes after both studying and testing (e.g., Berth & Evans, 1973; Bisanz, Vesonder, & Voss, 1978; Kelly, Scholnick, Travers, & Johnson, 1976; Masur, McIntyre, & Flavell, 1973). The results obtained in these investigations are difficult to interpret with respect to test monitoring, however, inasmuch as it is known that monitoring during study improves with age (e.g., Worden & Sladewski-Awig, 1982). Thus, the main purpose of the present study was to determine if there are developmental increases in the metacognitive benefits derived solely from taking a test. We expected that developmental increases in test monitoring might be especially likely during the early grade school years. Experience-based, age-related shifts in test monitoring seemed especially likely during this period, since children first encounter frequent, formal tests of their knowledge with their entry into school. A second reason for focusing on the early grade school years is that other forms of monitoring, ones that share characteristics with test monitoring, develop between 6 and

98

PRESSLEY

ET AL.

10 years of age. For instance, both communications monitoring (e.g., Flavell, Speer, Green, & August, 1981; Robinson, 1981) and text comprehension monitoring (Wagoner, 1983) involve evaluating one’s own performance (i.e., understanding a message or passage) in light of the quality of an external stimulus (e.g., the ambiguity and/or completeness of the message or passage). Similarly, test monitoring involves evaluation of the internal state of knowing test answers in light of attributes of the external test (e.g., its difficulty level). Thus, it seemed likely that the improvements in both communications and comprehension monitoring that occur during the elementary school years might be paralleled by changes in test monitoring. There are two important obstacles to collecting developmental data on test monitoring, however, that preclude simply comparing monitoring after study with monitoring after both study and test as has been done with adults (Pressley et al.. 1984. Experiments l-3). First, there are potential developmental differences in ease ~J’learning that could produce different attitudes or degrees of confidence during study that might be carried over to judgments about test performance. Our solution to this problem was to give a test that does not require study, but rather taps preexperimental knowledge. A second possible confounding variable is level of test performance, with older subjects making postdictions about a test that is functionally much easier for them than for younger subjects. To counteract this, we devised a test that produced approximately the same level of performance at each age studied. In particular, subjects were presented two types of items from a standardized vocabulary test: (I) extremely easy ones that would be answered at a ceiling level by almost all elementary school children: and (2) some of the most difficult items from the standardized measure, ones well beyond the lexical knowledge of almost all elementary school children. We assumed that average test performance for all children would be approximately equal to 100% of the easy items + 25% of the hard items (25% being the “chance” value for the four-option multiple-choice questions used here). Monitoring of overall test performance was tapped by asking the subjects how many of the vocabulary items they thought they had answered correctly. Two manipulated variables were also included in the design. Half of the subjects at each age level evaluated their certainty of correctness for every item, with these ratings collected immediately after completing an item. The remaining subjects made no such item-by-item judgments. In addition to permitting more fine-grained assessment of test monitoring than was provided by the overall monitoring data, collection of the itemby-item ratings made possible the evaluation of a mechanism that might mediate overall test-monitoring accuracy. Better overall monitoring in the item-by-item evaluation condition would suggest that children do not attend to such information spontaneously, but that overall test monitoring

TEST

MONlTORING

99

can be improved by directing children’s attention to their performance on individual items. This hypothesis follows from gains in other types of monitoring when children’s attention is guided to relevant dimensions of a problem (e.g.. Elliott-Faust & Pressley, 1986; Markman. 1979). In order that the results would not be limited to one level of performance. the difficulty of the test was also manipulated. Some of the children took a test composed of two-thirds hard items and one-third easy items; other chidren received a test composed of equal numbers of easy and difficult items; and still other participants took a test composed of one-third hard items and two-thirds easy items. Finally, because overall test monitoring requires gauging how many items were answered correctly, its development might reflect nothing more than developmental changes in general estimation skills (e.g.. Newman & Berger, 1984). To separate developmental changes in monitoring from growth of estimation capacity, all subjects were administered six items that required estimation. An estimation inaccuracy score was derived and compared with age in order to determine the relatively more prominent predictor of overall test-monitoring accuracy. METHOD

The 36 children (17 males, 19 females) in the younger sample (mean age = 7 yr 1 mo; range = 6 yr 1 mo-8 yr 1 mo) were enrolled in Grades 1 and 2 of a school serving a middle-class neighborhood in a moderately sized, predominantly English-speaking. Canadian city. Twenty-two boys and 14 girls (mean age = IO yr 4 mo; range = 9 yr 3 mo-11 yr 3 mo) who were enrolled in the fourth and fifth grades of the same school served as the older sample. All participants could count from I to 30 without error when asked to do so by the experimenter, as could all children in Grades 1. 2, 4, and 5 at this school. This was not a specially selected sample of the children in the school.

A set of 40 items from the Peabody Picture Vocabulary Test (Dunn & Dunn, 1981) served as test items for the experiment. Twenty were among the easiest items on the test (e.g., hnlf), whereas 20 were drawn from the most advanced items (e.g., prnsilr). Each item on the test consists of a vocabulary item and four picture alternatives. The child’s task is to point to the picture depicting the referent of the vocabulary item. Half of the subjects at each age level were required to rate the certainty of correct responding for each item immediately after completing the item. Subjects responded by pointing to one of four red dots, each of

IO0

PKESSLEYEl' Al.

which had an interpretive label printed below it. A very small dot (I cm in diameter) was labeled “NOT SURE AT ALL, JUST GUESSING,” a slightly larger one (2 cm in diameter) represented “A LlTTLE BIT SURE.” a larger one still (3 cm) meant “PRETTY SURE.” and the largest ( IO cm) carried the label. “REALLY. REALLY SURE.” Three levels of test difficulty were devised using the 40 items. Oncthird of the subjects at each age level responded to a list of IO easy and 20 hard items (hard list). One-third of the sub.jects were presented a list with IS easy and 15 hard items (moderate list). The remaining subject\ were tested with 30 easy and IO hard words (easy list). Six different hard, moderate. and easy lists each were constructed by randomly deleting items to meet the proportional constraints for the list. ‘Two subjects at each age level (one subject who provided both item-by-item ratings and overall test-monitoring data and one subject who rated only overall pcrformance) received each of the specific test lists. Two decks of 3 x s-in. cards and four X x 1 l-in. manila cards were used to assess each child’s estimation ability per se. One deck was composed of 31 cards with smiling faces drawn on them and nine cards with frowning facts drawn on them. The second deck consisted of IX cards that had blue dots and I:! that had black dots (one to five dots pet card). This deck was used in conjunction with one blue and one black Bit pen. The first manila card had 14 gold stars affixed to it. Card 3 had I7 silver stars. Card 3 had I6 gold star5 and I4 blue dot>. Card 3 had 5 green stars annd 25 red dots.

This was a 3 (test difficulty) x 2 (rating formats: item-by-item monitoring followed by overall test monitoring vs overall test monitoring only) 1 2 (age) design. Within age, subjects were randomly assigned to a level of test difficulty and a rating condition. Each subject ~‘as seen individually in a small quiet room in the school. Subjects wcrc told that they would bc taking a vocabulary test during the session. The nature of the test and rcsponscs required wet-c explained using a sample item provided with the test. Subjectx in the item-by-item with the monitoring condition were also told about the rating \cale. correspondence between size of the dots and the degree of certaint) explained until the child understood the meaning of the scale. The cxperimenter probed with questions such a\. “Which dot do you point tc) if your‘rc just guessing and don’t really know if you are right?” and “Which dot do you point to if you think you probably got it right but you are not absolutely certain’!” All children were required to provide responses for each of the 30 vocabulary items, an answer as each item was presented in the overall test monitoring condition, and an answer and a certainty rating in the

TEST MONITORING

101

item-by-item condition. Although no limit was put on time to respond, children were urged by the experimenter to make a decision when they delayed for more than IO s on an item. Immmediately after completing the final vocabulary item, the overall test-monitoring data were collected. The children were asked, “How many of those 30 test items do you think you got right? Think hard and make your best guess as to how many you got right out of 30. Make your guess as close to the actual number you got right as you can.” After finishing the vocabulary test and the performance-monitoring question following the test, all children were presented the six tasks that tapped estimation skill. The first two tasks (smiling face/frowning face and black/blue pen lifting, presented in counterbalanced order) required subjects to execute one of two actions in response to each of 30 cards. Thus, subjects were making estimations about a total set of items that was the same size as the total set of items on the vocabulary test. Subjects were not told in advance that estimation questions would be posed after smiling/frowning and pen lifting, so that they would not prepare for estimation by counting. Even if a child had surmised that estimation might be called for, she/he was blind to whether smiles or frowns would be estimated or whether the number of black or blue pen liftings would be requested. (Notably. we explicitly monitored whether there was any indication of counting during these tasks and detected none.) For the smiling face/frowning face task, subjects were shown the faces deck. The task was to smile in response to smiling faces and frown when shown a frowning face. Faces were shown at a rate of 3 s per item. All subjects could smile contingently with high proficiency. Immediately after the last face card was shown, subjects were asked, “I showed you 30 faces altogether. How many smiling faces were there’?” For the pen-lifting task, subjects viewed the blue dot/black dot cards one at a time at a 3-s rate, and lifted the Bit pen that matched the color of the dots on the current card. All participants could do contingent lifting at a high rate of proficiency, with most errors corrected spontaneously and quickly. The experimenter prompted error correction when it did not occur spontaneously. After the last card, subjects were told that 30 dot cards had been presented and were asked, “How many of these 30 cards had black dots on them’?” Following these two tasks, four more estimation problems were presented, with each subject receiving the problems in a different random order. Each problem consisted of an 8 x I l-in. manila card being shown for 2 s (a rate that precluded counting). The subject’s task was to estimate the number of items on the card. Card I had 14 gold stars. Subjects were told that there could be as many as 30 gold stars and were asked how many there were. Card I! had 17 silver stars. Subjects were told that there could be as many as 30 silver stars. and then were asked to

102

PRESSLEY

ET Al

estimate how many there were. Card 3 contained 16 gold stars and 14 blue dots. Before viewing Card 3, subjects were told the card had 30 blue dots and gold stars altogether. After seeing the card, they were asked how many of the 30 things on the card were gold stars‘? Card 4 contained 5 green stars and 25 red dots. Before viewing Card 4 subjects were told that there were 30 green stars and red dots altogether. After viewing the card they were asked how many of the 30 things were red dots. RESULTS Vocabulary

Knowledge

We were successful at constructing a vocabulary test containing easy items that almost all participants answered correctly and hard items that were responded to correctly at approximately the one-in-four chance level. Better than 99% of the easy items were answered correctly by participants, in contrast to about 24% for the hard items. Although a 3 (test difficulty) x 2 (rating format) x 2 (age) analysis of variance yielded a significant main effect of age, F( I, 60) = 10.00, p < .Ol, the variance accounted for by the age factor was small (6’ = .03, with the respective means for older and younger subjects being 64 and 59% correct).’ The age difference did not account for nearly the proportion of variance that the test difficulty factor did (9’ = .76. with means for the easy, moderate, and hard tests being 74, 62, and 49%. respectively). In this and all subsequent analyses of test difficulty, all effects were assessed in terms of their two underlying trends (linear and quadratic). The linear component of the test difficulty effect was significant, F( I, 60) = 232.30, p < ,001, accounting for virtually all of that factor’s variance ($ = .76). No other significant effects emerged in this analysis, all ps 1 .20. Monitoring

of Ovrrall

Test Performarzcc

Overall test monitoring was calculated for each subject as the absolute value of the difference between postdicted and actual vocabulary test performance (IP - Al). The main analysis of overall performance monitoring was based on this measure for two reasons: (1) Underpredictions and overpredictions will not cancel each other out. as may occur with signed measures [e.g., P - A or (P - Al/A]. (2) /P - Al does not vary with the level of actual performance, whereas other measures (e.g., /P -A//A) do. This is an important consideration given the three different test difficulties used in the study. ’ The goal in constructing the vocabulary measure was to yield a test that would produce no individual differences in overall performance within each cell of the 3 x 2 x Z design. Given that we came close to achieving that goal. it is not surprising that even very small between-cell effects emerge as statistically significant in the analysis of the recall data (i.e.. they are tested against very low within-cell variability).

TEST

MONITORING

103

The same three-way analysis of variance that was conducted for vocabulary knowledge was conducted on the IP - Al scores (MS, = 15.77). There was a main effect of age, F(1, 60) = 10.28, p < .Ol, with the older and younger means being 3.31 and 6.31, respectively. None of the other effects was statistically significant, all ps > .20.’ Additional examination of these unsigned overall test-monitoring scores revealed significantly greater variability in the younger sample (pooled s’ = 24.3 1) than in the older sample (pooled sZ = 7.23), Ft30, 30) = 3.36, p < .Ol. These variances were associated with IS- and 1%point accuracy ranges for younger and older subjects, respectively. With one outlying subject excluded, the older subjects’ accuracy range was reduced to 8 points: in contrast, with the least accurate younger subject excluded, that range dropped only to 15 points. There was no evidence to suggest that the direction of the subjects’ postdiction errors varied systematically with age. Breaking these errors down into over- and underestimates of subjects’ actual knowledge, we found that there was a comparable percentage of underestimates at each age level (64 and 58% for younger and older subjects, respectively, x’ < I. Item-by-Item

Monitoring

Subjects in the individual rating condition rated thir certainty of correctness on a 4-point scale (1 = not sure ut all to 4 = certain) for each of the 30 answers. The mean certainty for correct and incorrect items was calculated for each subject. For younger subjects, the mean correct and incorrect ratings were 3.70 and 2.72, respectively. For older subjects, the corresponding values were 3.54 and 2.07. Each subject’s mean certainty ratings were analyzed by a 2 (age) x 3 (test difficulty) x 2 (item correctness) ANOVA with repeated measures on the last factor (MS, = 0.248 and 0.160 for the between- and within-subjects effects, respectively). There was a main effect for age, such that the mean ratings of the younger subjects (3.21) were higher than those of the older subjects (2.801, F(l, ’ Note that a monitoring index defined in this way prnu/i,-es subjects who know they do not know an item, but guess it correctly on the multiple-choice test (i.e.. for 25% of the items they do not know). An alternative measure was therefore defined, based on the assumption that subjects actually knew the definitions of (I// of the easy items and none of the difficult ones. That is, on the easy test, subjects should have postdicted 20 correct: on the moderate test. 15; and on the hard test. 10. Thus. overall test-monitoring accuracies were computed as (P - 201. (P - 151, and lP - 101. respectively, in this supplementary analysis. In contrast to the other overall test-monitoring measure, this one assumes that subjects are not aware that they will get certain items correct by chance. Analysis of this measure confirmed the age difference. F(1, 60) = 4.74, p c .05. with means of 6.58 and 4.06 for younger and older subjects, respectively. In addition. subjects were less accurate in their postdictions as the test they took contained a greater proportion of hard items (mean inaccuracy = 3.62. 5.21, and 7.12 for the easy, moderate, and hard tests, respectively), F(1. 60) = 6.06, p < ,025.

104

PRESSLEY

ET AI

30) = 11.84. p < .Ol. The mean ratings of correct items (3.62) were considerably higher than those of incorrect items (2.40). F’( I, 30) = 169.37. p < .OOl. The grade x rating measures interaction was also significant, RI. 30) = 6.95, p < ,025, such that the difference between mean correct and mean incorrect ratings was greater for older than for younger subjects. No other effects were significant in the analysis of the item ratings. all ps > .10. The difference between mean certainty for correct and incorrect items was calculated for each subject and correlated with overall test performance (I’ = .004 across cells). This low correlation suggests that overall test monitoring is not related in any straightforward way to discriminating between correct and incorrect items. Because the correlation was so low. no additional correlation analyses of item-by-item and overall monitoring scores were undertaken. That the younger subjects gave higher item ratings overall suggests that they might have applied a more liberal criterion for rating correctness than did the older children. The significant interaction between age and rating measures additionally suggests greater discrimination between correct and incorrect items by older children than by younger children. In order to separate developmental difference4 in criteria for rating an item correct from developmental differences in discrimination between correct and incorrect items, a signal-detection analysis was carried out (Green & Swets, 1966). with ratings of 4 and 3 taken as perceptions that a correct response had been made and rating> of I and 2 indicating that sub.jectj believed there was a good chance that their answer was incorrect. The signal-detection parameters of discrimination ttl’) and response criterion CD) were calculated for each sub.ject (Baird CycNoma. 1978: Green & Swets, 1966: Murdock, 1983). The convention of constraining all measures to fall within the 2nd and 98th percentile of the normal distribution wa4 followed (Green & Swets. 1966). The d’ and fi measures wet-c analyzed according to a 2 (age) :,. 3 (test difficulty) ANOVA. There were significant age effects for both tl’ (/11S’, = 0.684) and ,G (MS, = 1.94). consistent with the previously reported certainty data. In particular, the younger sub.jects’ criterion was more liberal than that of the older subjects CP ~1 0.52 and 1.76. respectively,. F(l. 30) -= 7.18, /’ < .025: and the younger subjects exhibited poorer discrimination than did the older subjects (tl’ -= I. IO and I .75. respectively). F( I. 30) = 5.57, p C .025. The only other significant effect was a lineal decrease in subjects’ ability to discriminate between correct and incorrect responses as test difficulty increased (easy tl’ z 1.87. moderate tl’ = 1.29, hard d’ = 1.11). FYI. 30) = 5.07, ~1 ,.: .05.’

TEST

MONITORING

I05

The locus of these developmental shifts was specified additionally by analyses of within-test easy-correct, easy-incorrect, hard-correct,and hardincorrect items. There were only three incorrect responses to easy items, all made by younger subjects giving certainty ratings of 1. For each of the other three items types, each subjects’ mean ratings were calculated for the item type. The mean of subjects’ easy-correct ratings was 3.97 at the younger age level and 4.00 at the older age level. In short. there was little variability on easy items , and thus, such items were not a main source of the observed developmental shift. At the younger age level. the mean of subjects’ hard-correct items was 9.55 and the mean of subjects’ hard-incorrect items was 3.71. The corresponding values among older subjects were 1.96 and 3.08. These hard-item data were analyzed according to a 2 (age) x 3 (test difficulty) x 2 (item correctness) ANOVA, with repeated measures on the last factor. There was a main effect for age, t;( 1, 30) = I I .34, [> < .Ol. and no other significant effects, all ps > .20. The shifts with age in both criteria and discrimination obtained in the signal-detection analysis were due primarily to higher hard-item ratings by younger than by older participants.

Subjects’ sex was not entered into any of the preceding ANOVAs because to do so would have resulted in very low cell sizes. Nonetheless. because there are important contemporary hypotheses about sex differences in perceptions of achievement (e.g., Stipek, 19X3), all of the monitoring data were formally reexamined for sex effects. In doing so, we collapsed across the test difficulty and rating factors for the overall test-monitoring data. The main effect of sex and the age ic ses interactions were both nonsignificant. both /tls c’ I. Log-linear analyses applied to the frequency data similarly revealed no significant sex effects in the children’s over- and underestimation of their overall test performance, both ps > .05. The item-by-item monitoring measures were also reexamined although. of course, by collapsing only across the test difficulty factor for these analyses. The mean difference in ratings of correct vs incorrect items (MS, = 0.262 for this variable) was greater for girls than for boys, t(37) = 3.14, p > .005, with a significant age x sex interaction, f(33) = 1.19. 11 < .05. This difference in correct versus incorrect ratings increased in of the in&cry

Ggnal-detection are undoubtedly

model. We prevznt the traditional more familiar to readers than

d’ and /j measures the corresponding

because parametcr~

the\c in

nonparametric signal-detection analyse% (Cirier. 1971: Pastore Cyr Scheircr. 1974). and the> are the moht frequently reported signal-detection mra\ure\. The corresponding nonparametric analyses were conducted. however. and contirmed developmental changes m both dihcrimination and decision criteria. The linear trend in dixrimination a\ a function of teht difficulty was not significant in that analy\i\. however. ~7 -: .lO.

I06

PRESLEY

ET AL

size with age for girls [younger = 1.06. older = 2.05, 032) = 3.92, p c. .OOl], but not for boys [younger = 0.91. older = 1.11. 032) = 0.84. 1’ > .20]. Similarly, on the mean discrimination measure. ti’ (MS, = 0.538). girls exhibited a larger mean than boys, tC32) = 3.23. p < .005. with a significant age x sex interaction. r(32) = 2.28, 12 < .OS. d’ increased significantly with age for girls [younger = 1.30, older = 2.58, t(32) =3.86, p < .OOl]. but not for boys [younger = 0.97, older = 1.21, t(33) = 0.69, y > .20]. These effects are reflected primarily in the incorrect hard items (MS, = 0.313), where the boys’ mean certainty ratings were higher than the girls’, r(33) = 2.45. 11 <.. .05, again qualified by an age X sex interaction, t(32) = I .98. p CC.06. Girls’ mean certainty decreased significantly with age [younger = 2.67, older = 1.56, 032) = 4. IO, p I-. .OOl], but boys did not [younger = 3.76. older = 2.40, t(32) = I .36, 17 > . IO]. Finally, there were sex effects on the criterion measure. p (MS, = 1.68). where girls were more conservative than boys, t(32) =y 3. IX. p < .005. The age x sex interaction was also significant, r(32) -= 2.08, p < .05, however. There was a significant increase with age in conservatism for girls [younger = 0.46, older = 2.80. r(32) = 3.66. p c .OOl]. but not for boys [younger = 0.59. older := I. IO. t(??) = 0.85, p I .X)]. In short, developmental improvements for monitoring of individual items were significant with girls, but not with boys. In light of these item-by-item monitoring differences as a function ot sex, we also reexamined the vocabulary test data for sex differences. in that these might revcal general ability differences that could account for the sex differences in the monitoring data. No sex differences emerged in the Peabody data with all relevant If/s I’ I.

A subject’s estimation inaccuracy was calculated from performance on the six general estimation items administered at the conclusion of the session. The absolute value of the difference between the actual number of objects and the subject’s estimated number of objects was calculated for each of six items. A subjects’ estimation inaccuracy score was the sum of these six absolute differences. These estimation inaccuracy scores were analyzed in a 2 (age) x 3 (test difficulty) x 3 (rating formats) ANOVA (MS, = 114. IO). The only significant effect was for age. H I. 60) = 26.32.1) < ,001. For younger subjects, mean estimation inaccuracy z 32. I I : for older subjects. mean estimation inaccuracy = 19.19. The variability in estimation inaccuracy was greater in the younger sample (pooled ,F’ = 162.07) than in the older sample (pooled .x’ --5 66.13). F(30. 30) = 3.45. p -c .os. Again, in an attempt to account for the sex differences in the itemby-item monitoring data, the estimation inaccuracy scores were checked

TEST

MONITORING

107

for sex effects following the same analysis plan used in the supplementary inspection of the overall monitoring data. No sex differences were detected, all (f/s < 1. Two hierarchical regression analyses were conducted to compare the relative contributions of age (in months) and estimation inaccuracy. Although these two variables were significantly related, the association was far from perfect (Y = - .53, p < .05). In both analyses, the list difficulty and rating format factors (and their interaction) were forced into the equation as five dummy variables (e.g., Draper & Smith, 1981). Thus, the effects of age and estimation inaccuracy were evaluated independently of the contributions of the two manipulated factors. Consistent with the ANOVA results reported earlier, the collective effect of the experimental factors was not significantly related to monitoring inaccuracy, R' = .039, F < 1. Both estimation inaccuracy and age significantly improved prediction of overall test monitoring when estimation inaccuracy was entered, partial HI, 65) = 4.84, p < .05, followed by age. partial F( I. 64) = 8.70. p < .Ol. When variables were entered in the age-followed-by-estimation inaccuracy order, however, only age significantly incremented the prediction, partial F(1. 65) = 13.99, p < .OOl , and partial F < I, respectively. The total R' = .I06 with the dummy variables and estimation inaccuracy in the equation; the total R' = ,210 with the dummy variables and age in the equation; and the total R' = 213 with all variables in the equation. In short, improvements in global monitoring accuracy as a function of age were not due to developmental improvements in general estimation skills.” DISCUSSION We observed the developmental shifts in overall test monitoring that had been hypothesized. Older chidren were both more accurate and less variable than younger children in monitoring global performance. The older children’s estimates were more likely to be “in the ballpark” of their actual performance than were the estimates of younger children.

’ The smiling faces and pen-lifting tasks required estimation following a sequence of discrete behaviors (smile-frown and lift blue-lift black). tasks that are structurally more similar to the overall test monitoring task (estimating number correct following a sequence of answers) than were the four problems involving estimation of the number of objects on single cards. Moreover, because the smiling faces and pen-lifting tasks were administered first in the series. they are relatively uncontaminated by fatigue/boredom effects. Thus, all estimation analyses were reconducted using only the scores on the smiling faces and pen-lifting problems. Because the correlation between estimation skill based on the two sequential problems and estimation skill based on all six problems was .7l. however, we felt it unlikely that there would be a change in the conclusions. That was in fact the case. with the patterns of significance identical in the reanalyses to the patterns produced when all six estimation problems were used to construct estimation-ability scores.

Several potential determinants of this developmental shift were evaluated in this study. First. we found no support for the hypothesis that agerelated changes in overall test monitoring were tied to the development of general estimation skills. Second. direct mediation of overall monitoring by item monitoring also seems unlikely in light of three aspects of the data reported here: (a) Getting the children to attend to the individual item performance by having them rate their certainty of correctness did not improve global monitoring. (b) Item-by-item monitoring did not correlate with overall test monitoring. (cl If overall monitoring were directly determined by item-by-item monitoring, it would be expected that dcvclopmental changes at the item and overall levels would bc comparable. That was not the case. In particular. there were sex differences in the item-by-item monitoring data, but none in the overall monitoring data. That girls were more aware than boys that they were unlikely to bc correct on difficult items is consistent with results in performance prediction tasks (i.e., when children gauge performance h~/i~e taking a test). Boys more often than girls are oblivious to their past failurca as they make predictions about future performance, whereas girls make lower predictions about performance following failure than do boys (Parsons 61 Ruble. 1977). Most striking. Parsons and Rt~bk obtained an age ‘t sex interaction in performance prediction accuracy that resembled the age x sex interactions in the item-by-item data reported here. Because of the small sample performing item-by-item monitoring in our study. WC are reluctant to offer additional interpretation of these interactions. More study 01 potential age X sex interactions sccm~ justified. howev/cr, given the conflicting implications of such interactions. If girls arc hettri- at itemby-item monitoring. a metacognitivc theoretical prediction would be that these perceptions would permit them to make mot-c realistic dccisiona about how to direct future study. On the other hand. the accuracy of females’ perceptions might actually be a disadvantage, in that their more accurate perceptions of item failures arc also more pessimistic. These perceptions could reduce their motivation compared to boys (e.g.. Dweck & Elliot, 1983). There is yet a third possibility. howcvcr-sex differences at the item level may be of no consequrncc. if motivation and decisions about cognition are much more a function of overall test monitoring than item-by-item monitoring. It is disappointing that neither of the potential explanations of overall test monitoring development that were investigated here (i.e.. estimation. item-by-item checking) proved to be telling. The striking test-monitoring differences that were obtained. however. motivate additional research that witl attempt to tap more directly children’s processing as they monitor. In particular, we plan to elicit concurrent reports as monitoringjudgments arc made (Ericcson & Simon. 1984). followed by experiments that include

TEST

MONITORING

109

manipulations of factors suggested as important by analyses of the “thinkaloud” protocols. Additional motivation for test monitoring research is provided by contrasting the data reported here with data documenting the development of post-study-prediction skills. Our younger subjects’ pattern of overall test monitoring inaccuracies (i.e., a mix of over- and underestimations) contrasts with young grade-school children’s errors when predicting future test performance. Children in the 6- to 8-year-old range tend to be unrealistically optimistic about future performances (Clifford. 1975, 1978; Entwisle & Hayduk, 1978; Flavell, Friedrichs, & Hoyt, 1970: Goss, 1968; Levin, Yussen, DeRose, & Pressley, 1977, Parsons & Ruble, 1977: Phillips, 1963; Stipek. 1983: Stipek & Hoffman, 1980; Yussen & Berman. 1981; Yussen & Levy, 1975). Although dampened, this tendency holds even when predictions about future test performance are made after having test experience in the task situation (e.g., Clifford, 1978: Levin et al., 1977; Parsons & Ruble, 1977; Stipek & Hoffman, 1980). These overpredictions have been interpreted as “wishful thinking” (e.g.. Stipek, 1983: Stipek. Roberts, & Sanborn, 1984). Intuitively, wishful thinking should not play as great a role when reflecting on the concrete reality of something that has already happened (i.e.. when reflecting on test performance) as it does when making estimates about events that have yet to transpire (i.e., when making predictions). Given the probable differences in prediction and overall test monitoring. and given that both prediction and test monitoring are presumed to play important roles in regulating cognition (Baker & Brown, 1982; Yussen. 1985). direct within-experiment comparisons of prediction and test monitoring-and their relative consequences-should be a high priority. REFERENCES

Baker.

L...

&

Brown,

A.

.fournul

L.

(1982).

Metacognitive

skill5

of

(Ed.). HundbooX Berth. D. B.. & Evans.

rcrrding ~c~.s~,c~rc~h (pp. 353-394). R. C. (1973). Decision proceses

of E.rprr.irnentrr/

Child

P.c~r~ho/r~,y~.

and

reading.

New York: in children.5

In P. D. Longman. recognition

Pearson memory.

16, 14X-164.

Bisanz.

G. L.. Vesonder. G. T.. & VOS~. .I. F. (1978). Knowledge of one‘s own responding the relation of such knowledge to learning. Jorrnltrl c!t‘I~.~-~“ri,,r~nrtr/ Child P~\~c~hc~/o~~. 25. I Ih- 12x. Clifford. M. M. t 1975). Validity of expectation: A developmental function. Alhcrrtr Jorrrrrtri and

of Educ~urionu/ Clifford. Draper, Dunn,

M. success. N..

M.

Re,srnrdi. (1978).

Hrifish

‘The

of quantitative

feedback

on

children’s

of Edrru~tiontr/ P.sd~do~~~. 48, 1-20-27-6. H. (1981). Applied ~eyrc~.ssi~,n trr~cr/~.si.s (2nd ed.).

expectation

of

.fowrftrl

& Smith. M.. & Dunn, MN: American Dweck, C. S., & Elliott. & E. M. Hetherington Wilev. L. Pines.

21, I I- 17. effects

L.

M. (1981). Peabody Guidance Service. E. S. 11983). Achievement (Vol. Ed.). Nmr~t/ho~~~

Picture

Vocabulary

New

York:

Tcvf-Rc~i.tc,d.

Wiley. Circle

motivation. In P. Mussen (
110

PRESLEY

Elliott-Faust. increases 78,

D. J.. & Pressley. M. children‘s comprehension D.. & Hayduk. L. childrun. Baltimore,

Ericcson. Press.

A..

&

J. H..

Simon,

H.

Fnedrichs,.

processes. Flavell. J. H.. comprehension Society ,fiJr

D. und Press. 0.

c’o,gnifirvi. Academic

T 00 Johns

(197X). MD: A.

t 1981).

L.. iir~nltrn

Goss.

A. M. (1968). Drl’eltywnrn/.

Kelly.

M.. Scholnick. memory. memory

King.

J. F.. influence

stratefiie\ P.~&ro/c~~\*.

mrcr/x.ti.t

Developmental

f...

MacKinnon.

Ci.

prt~fi)rt~lclnc,p,

Estimated 39, 1x3290.

versus

E. K.. appratsat.

Zechmeister. of retrieval

Travers. and

E. B., practice.

60X-615. Lovelace. E. A. ( lY84).

of inconsistencies. McIntyre. C. of

study

Erprrirnc~t~i~tlChild

actual

time

B

physical

in rnemort/.ation

The

Journoi

1921.

G.

( IYUb). Mrtrrc.(,xrrilio,i. pr-trc~/ic~c.\. Orlando.

in three

ethnic

FL:

groups.

C‘lriltl

The

M. (1977). Developmental changes 1)~,1.(1101)))~(‘)1tll/ P.~vchdo,qy. 1 I, recallability

during

C~ytrilh, Elementary

a multitrial

free

Jorrrncci

study.

LO, 7X-766. school children’s

Child ~IP~~~~/o~Jv~cIII, 50, 643-655. & Flavell. J. H. (1973). Developmental

P.cycholo,qy. Murdock. B. B.. Jr. ( 1982). Recognition mefhod.s i/l hrtrnm nwmory trnd

ot ot‘thc

J. J. t IYXO). Judgments of knowing: c~f‘P.syiu~i~~g~, 93. 319-343.

future

in

development

Monr~,gr~~rp/rs

J. W. (1976). Relattons among C&U L)c,rc~lr~l>merrr. 47, 64%65Y.

Mernrn~~. und don’t under-stand:

items

h41.1’

T.

strength

T. M.. & Pressley. memory capacity.

among

MA:’

G. t lY8f;ai. ,~~~llrc,rl,~,ir/i~,li. 7’/1c~wc~/ic~tr/ [jrr.c.pec./il es. Orlando. FL

I.

Monitoring

C’ambridge.

1~. t IYXII.

Wailer.

S. H.. 8 Johnson. memory strategies.

tecrrning. that you W..

D.

E.. & Wailer. ‘I. Vol. 3. /n.vtrtr(~rionrr/

& Shaughnessy. Atnerictrn

Metamemory:

qf Etperitnenfuf Y.cy~holog~: Markman. E. M. t 1979). Realizing

E.. Vo/.

ol~lioox o/

cic~udrtnic~

changes

about communication. 46t5. Serial No.

Dn~elo~prnrnr.

G.

/‘ire

tu1;on.\: Press.

pro/~m~i

knowledge

J. R.. Yussen. S. R.. DeRose. in assessing recall and recognition

portionment

r.r,wc

Hopkins

J. D. t IY70).

MacKinnon. pr~fiwmcinc~r.

und irrtrntrn Press.

awareness E. F..

,qrerri

V&xd

A. G.. Ilr Hoyt,

monitoring and Re.stwrc~i~ irr Child

Academic Forrest-Pressley.

Masur.

trainmg of comparison Jo~rrrul ~$E&rccrrioncr/

C0g,filit,c, P,s~c~ifoiop~.1, 324- 340. Speer. J. R.. Green, F. I,.. & August.

Forrest-Pressley. qqnilion,

Levin,

Self-controlled monitoring.

15-11.

Entwisle, young

Flavell.

(1986).

ET AL.

recall

changes task.

m ap-

Jortrjrr~l

cjf

15, 237-146. memory. In t’. K. Puff i Ed.). Nc~/rdhr& o/‘rc,.cc,trrc-Ir c’o,qnition (pp. I-26). New York: Academic Press.

Newman. R. S.. & Berger. C. F. (1984). Children‘s numerical estimation: the use of counting. Jourmrl oj’ Educrrtiunt~l P.~,~chol~~,:~y. 76, Y-64. Parsons. J., & Ruble, D. t 1977). The development of achievement-related Child f~ev&ptmvrr. 48, 107% 1079 Phillips. B. N. tlY63). Age changes in accuracy of self-perceptions. (‘l/i/d

Flexibility expectancies. fktx~hq?urcz~fr.

34, 1041-1046. Pressley. M.. Levi,. J. R.. & Ghatala, E. S. I IY84). Memory strategy monitoring in adults and children. Journtrl q/ Verhul Letrrning crnd Vcvhni Bpholior, 23. 270-W. Pressley. M.. Snyder. B. L.. Levitt. J. R., Murray. H. G.. & Ghatala. E. S. (in preset. Perceived readiness for examination performance (PREP) produced by initial reading of text and text containing adjunct questions. Rvudin,q Re.secrrch Qmr~erIy. Robinson, E. J. t 1981 t. The child’s understanding of madequate messages and communication failure: A problem of ignorance or egocentrism. In W. P. Dickson (Ed.). Cilildrcptr’v rwtri c,orrzm/miccr?ion .skii/.s (pp. Ih7- I XX). New York: Academic Prr\s. Stipek. D. t 1981). Young children’s performance expectations: Logical analysis or wishful

in

111

TEST MONITORING

thinking’? In J. Nicholls (Eds.). Tlrr devr/opmenf c$achiervmenf motit~afion. Greenwich. CT: JAI. Stipek. D., & Hoffman, J. (1980). Development of children’s performance-related judgments. Child Development, 51, 912-914. Stipek, D. J.. Roberts. T. A.. & Sanbom. M. E. (1984). Preschool-age children’s performance expectations for themselves and another child as a function of the incentive value of success and the salience of past performance. Child Dr~~elopmrnt. 55. 1983-1989. Wagoner. S. A. (1983). Comprehension monitoring: What it is and what we know about it. Reading Research Quarfer!y. 18. 3118-346. Worden. P. E., & Sladewski-Awig, L. J. (1982). Children’s awareness of memorability. Journal

of Educational

Psychology.

74,

341-350.

Yussen. S. R. (1985). The role of metacognition in contemporary theories of cognitive development. In D. L. Forrest-Pressley. G. E. MacKinnon. & T. G. Wailer (Eds.). Metucognition.

cognition.

und

human

perjormcince.

Vol.

I,

Theoreticcrl

prrspc~c~tives

(pp. 253-283). Orlando, FL: Academic Press. Yussen. S. R.. & Berman. L. (1981). Memory predictions for recall and recognition in first-. third-, and fifth-grade children. L)er~elopmrr)ru/ Ps~c~lrolog~, 17, 124-219. Yussen. S. R.. & Levy. V. M.. Jr. t 1975). Developmental changes in predicting one’\ own span of short-term memory. Jorrrncll 0.f E.yperinrerrrcll Child Psychology. 19, 302-308. RECEIVED: February

25.

1986;

REVISED:

June 9. 1986. August II!. 1986.