Performance assessment in the Third International Mathematics and Science Study — An international perspective

Performance assessment in the Third International Mathematics and Science Study — An international perspective

Studies in Educational Evaluation PERGAMON Studies in Educational Evaluation 25 (1999) 243-262 PERFORMANCE MATHEMATICS A S S E S S M E N T IN T H E...

2MB Sizes 3 Downloads 85 Views

Studies in Educational Evaluation PERGAMON

Studies in Educational Evaluation 25 (1999) 243-262

PERFORMANCE MATHEMATICS

A S S E S S M E N T IN T H E T H I R D I N T E R N A T I O N A L A N D S C I E N C E S T U D Y - AN I N T E R N A T I O N A L PERSPECTIVE 1

Maryellen Harmon

Center for the Study of Testing, Evaluation, and Educational Policy Boston College, Massachusetts, USA

The Performance Assessment 2 component of the Third International Mathematics and Science Study (TIMSS) represents an unusual research effort in a number of ways. Its sample was a subsample of the students who responded to the main survey written test: results for these students can therefore be linked with results from the written test. In addition, the sampling design ensured that the sample conformed to the same rigorous guidelines required for the Main Survey. Each step in the administration and scoring of the performance assessment was controlled by careful training and by international manuals which provided detailed guidelines for equipment, station set-up and student rotation and scoring (Harmon et al., 1994a,b; 1996; Martin & Kelly, 1996). As with the written tests, translations into the language of each country were monitored and any variations in language or equipment had to be approved centrally. Thus every effort was made to guarantee for the Performance Assessment the same rigorous standards as those governing the more traditional tests. However, results of the Performance Assessment have been much less publicized in the public media than those of the corresponding written tests. The purpose of this article is to highlight the findings of this component of TIMSS, to discuss its implications for future assessments, and to bring out some of the issues it raises for further analysis and research.

0191-491X/99/$ - see front matter © 1999 Published by Elsevier Science Ltd. All rights reserved. PII: S0191-491 X(99)00024-3

244

M. Harmon/Studies in Educational Evaluation 25 (1999) 243-262

This presentation of the intemational results will be followed by short profiles from five of these countries, reflecting their experiences and results on the TIMSS Performance Assessment in the light of their national goals, curricula and pedagogical practices. The Third International Mathematics and Science Study (TIMSS) was conducted in 41 countries with over halfa million students. It included several forms of assessment: written multiple-choice questions, short and long free-response items, and a performance assessment. All countries participating in TIMSS administered the 13-year-old student written test (Population 2) but all other segments of the TIMSS testing were optional. For a variety of reasons, including availability of resources and the timing of test administration, only about half the countries were able to participate in the Performance Assessment. The sample of students for the Performance Assessment was drawn from the upper of the two grades already tested in the written survey - usually 8th and 4th grades. Twenty-one countries administered the Performance Assessment at the 8th grade level; of these 10 also administered the 4th grade test. The TIMSS Performance Assessment consisted of 12 tasks at each grade level, five each in mathematics and science and two combining both mathematics and science. The content areas sampled in science were human physiology, heat, light, electricity, magnetism, elasticity, and various experimental and measurement procedures; and in mathematics, number theory and operations, measurement and computation - with and without a calculator, probability, spatial relations and geometry. Each task consisted of a problem to be solved or investigated, and a series of subquestions designed to elicit evidence of particular cognitive and procedural skills called performance expectations. These skills included experimental design, data collection and organization; measurement; graphing, data analysis and interpretation; drawing conclusions and making connections; describing strategies, reasoning, and explaining results. The subquestions (items) proceeded from easier, more procedural items at the beginning of each task to analytical and reasoning items near the end. Tasks usually culminated in a demand for the student to explain findings using concept knowledge. Table 1 shows the overall national averages by task in terms of percent correct for each country for each of the 12 tasks of the Performance Assessment. In the interests of brevity, only the 13-year-old (Population 2) results are presented here, but a more complete description of tasks, results and analyses may be found in the report: Performance Assessment in IEA's Third International Mathematics and Science Study (1997). As can be seen, the scores of some countries ranged from 50 to 95% correct, while others ranged from 11 to 51% correct. Although there was a wide range of scores, the tasks for which scores were high, or low, were the same for almost all countries. Most countries showed their highest scores on Magnets, Dice, (Iran an exception on these two), and Folding and Cutting; and their lowest scores on Shadows, Pulse, and Packaging, with other tasks falling into the average range for that individual country. Hence it would seem that something in the nature of the task is operative in determining difficulty, rather than national curricular differences alone.

40 (3.7) 45 (3.0) 64 (1.7)

(2.2) (2.1) (2.7) (2.2)

26 (2.3) 63 (2.6) 49 (2.0)

59 68 43 48

68 (2.7) 57 (1.9) 50 (2.2) 51 (2.3) 42 (1.8) 59 (2.3) 48 (2.1) 48 (2.1) 41 (2.3) 50 (3.5) 36 (2.4) 29 (2.9)

Solutions

(1.9) (2.3) (2.8) (1.9)

22 (2.5) 36 (2.8) 31 (1.8)

36 46 35 28

50 (3.5) 41 (2.1) 45 (1.9) 36 (2.4) 39 (2.0) 37 (1.9) 35 (1.6) 29 (2.0) 36 (1.7) 43(1.5) 25 (1.5) 18 (1.5)

Shadows

51 (2.7) 63 (4.1) 63 (1.9)

73 (2.9) 55 (2.4) 44 (2.5) 53 (2.1)

66 (3.5) 73 (2.1) 72 (2.9) 61 (2.5) 67 (2.3) 68 (2.6) 65 (1.9) 63 (2.2) 45 (2.5) 81 (2.6) 41 (2.5) 52 (2.4)

Plasticine

(2.4) (1.6) (2.2) (2.1)

49 (4.0) 76 (2.3) 78 (1.4)

78 79 76 71

84 (1.6) 79 (1.4) 74 (2.4) 76 (1.6) 72 (1.9) 73 (2.5) 77 (1.8) 73 (1.2) 73 (2.2) 58 (1.8) 76 (1.8) 68 (2.2)

Dice

31 (1.6) 66 (2.6) 58 (1.5)

59 (1.9) 62 (1.4) 59 (2.3) 56 (1.9)

60 (2.8) 61 (1.6) 51 (2.3) 49 (3.1) 59 (1.6) 54 (2.0) 60 (1.5) 55 (1.6) 53 (2.1) 48 (3.7) 39 (2.1) 40 (1.9)

Calculator

(3.3) (3.1) (2.4) (2.0)

43 (5.7) 84 (2.3) 82 (2.0)

74 69 71 68

80 (2.6) 79 (1.9) 80 (2.5) 71 (3.9) 73 (2.1) 73 (3.2) 59 (2.5) 75 (2.3) 61 (3.1) 58 (2.9) 58 (3.1) 48 (2.4)

Folding & Cutting

(1.8) (1.5) (1.9) (1.8)

34 (4.4) 58 (3.1) 55 (1.9)

58 63 67 48

63 (1.5) 54 (2.2) 65 (1.9) 58 (2.1) 61 (1.3) 58 (1.5) 53 (2.0) 60 (1.4) 53 (1.9) 34 (3.2) 43 (1.8) 42 (1.5)

Around the Bend

(2.8) (2.5) (2.9) (2.5)

20 (3.0) 51 (4.1) 45 (3.8)

55 53 53 28

65 (2.5) 47 (3.3) 47 (2.3) 51 (3.9) 59 (2.4) 43 (4.6) 57 (3.2) 44 (2.5) 28 (2.3) 43 (5.0) 31 (3.2) 14 (2.1)

Packaging

International 59 (0.3) 44 (0.6) 90 (0.5) 67 (0.5) 63 (0.6) 49 (0.5) 35 (0.5) 60 (0.6) 73 (0.5) 54 (0.5) 69 (0. 7) 54 (0.5) 44 (0. 7) Average * For detail on sampling and age/grade specifications see Report: Perfi)rmance Assessment in lEA,s Third International Mathematics and Stuciece Study (1997), Appendix A

55 (2.2) 75 (2.2) 71 (1.8)

Countries not meeting age~grade specifications Colombia 39 (1.8) 11 (1.0) 96 (1.3) Romania 63 (1.9) 41 (3.6) 83 (3.5) Slovenia 61 (1.0) 40 (3.2) 92 (1.9)

Rubber Band

64 (2.4) 79 (1.4) 70 (1.9) 63 (2.4)

Batteries

Countries not satisfying guidelines for sample participation rates* Australia 65 (1.2) 54 (2.6) 92 (1.4) 71 (1.8) England 67 (0.9) 59 (2.2) 99 (0.6) 77 (2.0) Netherlands 60 (1.3) 45 (2.6) 94 (2.1) 63 (2.9) United States 55 (I .3) 50 (2.0) 85 (2.5) 56 (1,9)

95 (0.9) 97 (1.2) 95 (1.6) 98 (0.9) 91 (2.0) 82 (2.3) 92 (1.5) 93 (1.6) 96 (1.4) 45 (4.9) 94 (1.6) 86 (2.3)

Magnets

80 (1.5) 67 (1.9) 70 (2.4) 75 (1.8) 63 (1.9) 65 (3.6) 71 (2.0) 65 (1.8) 51 (2.0) 56 (5.4) 51 (2.3) 59 (2.3)

60 (2.7) 51 (1.9) 45 (2.6) 55 (2.9) 48 (1,.6) 46 (2.9) 46 (2.4) 44 (2.1) 36 (2.1) 55 (4.5) 24 (2.5) 33 (2.1)

Pulse

79 (2.1) 75 (2.1) 71 (2.9) 68 (2.4) 67 (1.7) 66 (2.8) 62 (2.1) 68 (1.6) 73 (1.7) 52 (4.0) 50 (2.2) 66 (2.2)

71 (1.7) 65 (1.2) 64 (1.2) 62 (1.7) 62 (0.8) 61 (1.3) 60 (1.3) 60 (1.4) 54 (0.8) 52 (2.0) 47 (1.1) 46 (1.0)

All Tasks

Performance Assessment in TIMSS Average Achievement by Task

Singapore Switzerland Sweden Scotland Norway Czech Rep. Canada New Zealand Spain Iran, Islam. Rep Portugal Cyprus.

Table 1 :

txa

t2"

246

M. Harmon /Studies in Educational Evaluation 25 (1999) 243-262

Another pattern, interesting in the light of the results o f the written tests, is that there was virtually no difference in the overall averages for mathematics and for science tasks at the 8~ grade level. (Fourth grade students internationally did less well on mathematics tasks as a whole than on the science tasks as a whole. This is similar to the pattern on the written test at the fourth grade.) In spite o f commonalities in the difficulty levels o f various tasks among participating countries and contrasts between countries on overall achievement, aggregation o f results at the task level tells us little that can influence policy and instruction. Because each task consisted o f subquestions (items) each constructed to measure different, although interdependent processes, we need to look at results withintask at the item level for further clarity. In the P u l s e task the student was asked to determine the effect o f exercise on pulse rate. 3 In this task students had to measure their pulse at rest and during exercise, organize and display pulse rates, summarize the trend in the data and explain the physiological processes underlying the results. Table 2:

National Correctness Scores Averages at 13 Year Old Level for Pulse Task

Australia Canada Colombia Cyprus Czech Republic England lran, Islam Rep. Netherlands New Zealand Norway Portugal Romania Scotland Singapore Slovenia Spain Sweden Switzerland United States International Average

Data Organization 68.5 53.0 9.5 30.5 45.0 64.5 76.5 50.5 51.5 44.5 31.5 45.0 61.0 59.5 54.5 36.0 45.5 57.5 54.0

Data Quality 45.3 44.0 4.0 31.7 37.7 59.0 57.7 43.7 37.3 47.7 24.0 29.0 56.0 55.7 33.3 29.7 50.3 43.3 42.3

Analysis

Explanation

Task Average

71.5 60.0 20.0 55.5 72.0 74.5 53.5 56.0 60.5 72.0 26.0 62.5 67.0 82.5 53.5 52.0 62.5 75.0 71.5

30.7 26.0 I 1.0 14.7 27.3 39.0 33.3 29.0 28.0 29.7 16.3 27.7 34.3 42.7 18.7 26.3 22.0 27.0 32.7

54.0 45.8 11.1 33.1 45.5

49.4

40.6

60.4

27.2

44.4

59.3 55.3 44.8 44.3 48.5 24.5 41.0 54.6 60.1 40. 0 36.0 45.1 50. 7 50.1

For each item, criteria for a complete response formed the basis for the coding scheme. This scheme consisted o f a 2-digit code, wherein the first digit represented the number o f correctness points and the second digit, in combination with the first, served as a flag to

M . Harmon/Studies in Educational Evaluation 25 (1999) 243-262

247

identify particular approaches, errors, or misconceptions (see Zuzovsky, Performance assessment in science: Lessons from the practical assessment of 4th grade students in Israel, in this issue). Table 2 shows national correctness score averages at the 13-year-old level (Population 2) for the Pulse task. Figure 1 highlights the internal variability across the task, depending on the demands of the item. Clearly students world-wide were about average when using procedural skills to organize data, slightly less competent in the accuracy of their data collection, but scored well in ability to identify the trend in the data (analysis). The lower average scores for hypothesizing or constructing a concept-based explanation for their findings are not surprising. Difficulty with explanation was discovered in all tasks, regardless of content area. The results on the Pulse task were typical of tasks whose difficulty was in the average range.

70.0

0.0 O

o

0.0

o

0.0

[] <

0.0

International Average

o

0.0

iO.O

0.0 Data Organization

Data Quality

Analysis

Explanation

Task Average

Figure 1: Internal Variability Across Pulse Task Calculators The Calculator task was intended to measure students' ability to use the equipment for multiplication, to identify and describe a pattern in the data (products), apply that pattern to solve a routine and a non-routine problem, and explain their strategy

M . Harmon/Studies in Educational Evaluation 25 (1999) 243-262

248

for solving these problems. At the 8th grade level an additional and different type of problem was presented in the last two items. Students were given the number 455 and asked to find two factors such that both were 2-digit numbers, and none was greater than 50, and, using these criteria, to explain why two numbers proposed could not be right. Results on this task are shown in Table 3. Internal variability depending on demands of each item is shown in Figure 2.

1 ......

[] International Average F,

Figure 2: Internal VariabilityAcross Calculators Task

Thirteen-year-old students all over the world were highly competent in using a calculator and quite competent in the routine application of a pattern even when their description of the pattern was incomplete. However, incomplete identification of the pattern interfered with non-routine applications and with explaining their problem solving strategies. Correctness scores for the non-routine application were 20-25% below those for the routine application. The other weak area was factoring (the last item). Although many students recognized the number properties of the product (preceding item) and hence those required of the factors, almost no students used the obvious short route on the last item, i.e. find prime factors first and use them to locate the correct factor combinations. Instead, some used their insights about the nature of the numbers to bound their experiments with factors, and so solved the problem within the time limits, while others randomly tried one set of factors after another and ran out of time.

M. Harmon/Studies in Educational Evaluation 25 (1999) 243-262

Table 3:

249

National Correctness Scores Averages at 13 Year Old Level for Calculators Task Use of Equipment

Pattern AnNysis

Application Application RoulJne NonRoutine

Strategy

Number Sense

Factoring

Task Averate

Australia Canada Colombia Cyprus Czech Republic England lran, Islam. Rep. Netherlands New Zealand Norway Purtugal Romania Scotland Singapore Slovenia Spain Sweden Switzerland United States

99.3 97.0 93.7 97.3 96.3 98.0 96.0 97.0 94.7 99.0 94.7 98.0 97.0 98.0 99.3 98.3 94.7 99.0 97.0

50.4 44.0 20.5 23.5 45.0 50.5 42.5 37.5 43.0 44.0 22.5 51.0 44.0 33.5 34.5 48.5 40.0 51.0 43.5

86.0 86.0 46.5 56.0 75.5 85.5 59.5 76.5 78.5 79.0 62.0 82.5 65.5 84.0 84.0 76.5 69.0 85.5 79.5

67.0 63.5 26.5 39.0 58.0 59.0 54.5 58.0 56.5 51.0 44.5 79.5 43.5 63.5 68.0 53.0 52.5 63.5 51.5

49.5 47.5 I0.0 19.5 45.5 61.5 30.0 42.0 40.5 46.5 26.0 57.0 45.0 44.5 34.5 52.5 48.5 55.0 44.0

35.7 50.3 t3.0 38.0 43.7 52.3 51.0 78.7 47.3 69.3 21.3 47.7 35.7 53.0 61.3 28.7 38.7 40.3 54.0

26.5 30.0 6.5 9.0 I5.5 29.5 6.0 25.0 23.5 24.5 5.5 44.0 15.0 45.5 22.5 12.0 10.5 33.0 20.0

59.2 59.8 31.0 40.3 54.2 62.3 48.5 59.2 54.9 59.0 39.5

International Average

97.1

40.5

74.6

55.4

42.1

45.3

21.3

53.7

65.7 49.4 60.3

5". 7 52. ?~' 50.5 61.0 55.6

Shadows

T h e S h a d o w s task was one o f those c o m b i n i n g both m a t h e m a t i c s and s c i e n c e k n o w l e d g e and skills. Students were p r o v i d e d with a screen, meter stick and smaller ruler, a card p l a c e d in the path o f light and a flashlight so m o u n t e d that it w o u l d cast the s h a d o w o f the card on the screen. By m a n i p u l a t i n g the card and the flashlight s t u d e n t s were to i n v e s t i g a t e the effects on the s h a d o w o f c h a n g i n g distances, and to find c o m b i n a t i o n s o f d i s t a n c e s b e t w e e n light and card, light and screen, or card and screen, such that the s h a d o w w o u l d a l w a y s be twice the width o f the card. The final item o f t h e task required that s t u d e n t s d e v e l o p a generalization or rule to define w h e n the s h a d o w w o u l d a l w a y s be twice the width o f the card. In e x p l a i n i n g w h y the s h a d o w size c h a n g e d as it did, and when it w o u l d a l w a y s b e twice the size o f the card, students n e e d e d to know, at least intuitively, that light travels in straight lines, and h o w s h a d o w s are formed. For those w h o v i s u a l i z e d the s e t - u p g e o m e t r i c a l l y , the card and the screen formed the b a s e s o f similar triangles with the light s o u r c e at the apex. But even without this level o f sophistication, s o m e s t u d e n t s m a d e m e a s u r e m e n t s so carefully that they were able to d e v e l o p the generalization by l o o k i n g at the ratios o f the distances. N o n e t h e l e s s , this item p r o v e d one o f the most difficult for 8th grade students ( 2 1 % international average achievement). Except for the first item - w h i c h

250

M. Harmon/Studies in EducatiOnal Evaluation 25 (1999) 243-262

was a simple description o f their observations, and on which 66% of the students internationally achieved full points (average correctness score including partial credits, 75%) - the averages for most items o f this task ranged between 20 and 33%, and the percentages o f students achieving full points ranged between 12 and 20% at the international level. A number o f factors contributed to the difficulty of the S h a d o w s task, both at the national level and internationally. In the first item students had only to explore and describe their observations, the effect on shadow size of shortening or lengthening distances. There was no requirement for careful data collection or tabulation; probably as a consequence, scores were high. In the second, however, students needed to invoke knowledge about how a shadow is formed to achieve full points. While the responses revealed many student misconceptions about the nature o f light and of shadows, useful information at the national level for instructional planning, only three countries achieved a correctness level above 50%. The third item was a problem solving item, but students were not cued to make measurements. Instead they were only told to "find 3 positions where the shadow is twice the size of the card." Some found and recorded measurements for one position, some for 3, some estimated the distances "by eye."

Table 4:

National Correctness Scores Averages at 13 Year Old Level for S h a d o w s Task Observation

Explanation

Problem

Investigating

Solvin9

Data

Generalizing

Display

Task

Australia Canada Colombia cyprus Czech Republic England iran, Islam. Rep. Netherlands New Zealand Norway Portugal Romania Scotland Singapore Slovenia Spain Sweden Switzerland United States

66.5 75.5 54.0 64.0 87.5 77.0 84.5 55.0 70.5 75.5 65.5 92.5 82.5 90.0 76.0 78.5 82.5 80.0 64.5

24.5 20.5 22.0 13.5 48.5 33.0 57.0 50.5 17.0 27.5 27.0 27.5 24.5 55.0 29.5 40.5 42.5 44.0 19.5

38.5 34.5 21.5 8.5 32.0 23.5 33.5 33.0 15.0 51.5 24.0 24.0 31.0 40.5 24.5 28.5 57.5 42.5 12.5

32.5 30.0 17.5 12.5 27.5 47.5 23.0 27.0 21.0 25.5 16.0 26.5 28.0 39.0 24.0 37.0 30.5 28.5 26.5

27.5 28.0 14.5 3.5 8.5 71.5 24.0 24.5 35.5 18.5 11.5 17.5 36.0 45.5 11.5 16.0 26.5 21.5 34.0

25.0 19.5 4.5 8.5 19.0 23.0 37.0 22.5 13.0 35.0 7.0 26.5 16.5 29.5 19.0 16.0 32.0 31.5 10.5

A veraqe 35.8 34. 7 22.3 18.4 37.2 45.9 43.2 35.4 28.7 38.9 25.2 35.8 36.4 49.9 30.8 36.1 45.3 41.3 27.9

International Average

74.8

32.8

30.3

27.4

25.1

20.8

35.2

Some simply failed to record any measurements. Somehow the problem-solving aspect o f this interesting task obliterated from view the fact that this was still a science

M. Harmon /Studies in Educational Evaluation 25 (1999) 243-262

251

investigation and careful, replicable recording of data was required. Results on this task are shown in Table 4. Internal variability depending on demands of each item is shown in Figure 3.

m

International Average

Observat,~n

Exptanatitm

Prc~blem Sl~lving

Investigaling

Data Display

Generalitang

Task Average

Figure 3: Internal Variability Across Shadows Task In the fifth item they were explicitly told to "investigate" the phenomenon and "record data" but many either repeated the single measurement from the earlier item, perhaps assuming they had "already done that problem;" others provided no data. Scores on these items certainly reflected student perceptions of the task, as described by Per Morten Kind (1996). Still there were some students who attached a particular meaning to the word "investigate" and were cued by it to collect and tabulate data clearly and precisely and use it to compute ratios. The last item, to find a general rule whereby the shadow would always be twice the width of the card, was too difficult for most students at the theoretical level, although some discovered the generalization from their computations. There was an omission rate of almost 30% on this item. Only in seven countries did scores exceed the chance level.

Gender Differences Table 5 highlights a unique feature of the performance assessment. In contrast to the marked dominance of boys over girls on the 13-year-old written tests (Beaton et al., 1996), there are practically no gender differences in the Performance Assessment results.

46 (3.9) 35 (2.7) 42 (3.2) 52 (5.0) 48 (2.5) 45 (2.5) 25 (3.4) 52 (2.9) 63 (3.5) 38 (2.6) 39 (3.7) 49 (3.0)

Gifts

93 (1.6) 87 (3.8) 86 (3.4) 51 (5.8) 93 (2,3) 93 (2.4) 95 (1.8) 98 (1.2) 94 (1.5) 98 (1,1) 95 (1,8) 98 (1.3) (3.2) (2.3) (2,1) (2.5)

70 (3.4) 81 (2.2) 72 (2.6) 64 (3.5)

65 (0.7)*

58 (2.8) 74 (3.0) 70 (3.1)

60 78 68 62

International 59 (0.4) 58 (0.4) 44 (0.8) 44 (0.8) 91 (0.7) 89 (0.7) 68 (0.7) 65 (0.8) 61 (0.8) Ave rage Source: Performance Assessment in lEA's Third International Mathematics and Science Study Difference from other gender statistically significant at 0.05 level, adjusted for multiple compar sons across each row For further details, see report: Performance Assessment in lEA's Third International Mathematics and Science Study

52 (4.3) 76 (2.7) 72 (3.0)

68 (3.5) 77 (3.6) 59 (3.8) 59 (2.7)

44 (2.6) 49 (3.6) 61 (2.4)

95 (1.9) 83 (4.5) 90 (3.30

76 (2.4) 77 (2.7) 68 (2.6) 54 (2.8)

61 (2.7) 67 (3.7) 57 (4.1) 66 (2.0) 65 (1.8) 56 (2.9) 73 (2.8) 83 (2.1) 53 (2.6) 71 (3.1) 67 (2.7)

59 (4.2) 63 (4.6) 54 (7.8) 65 (2.5) 61 (3.5) 47 (2.7) 78 (1.5) 78 (1.8) 49 (4.2) 68 (3.4) 67 (2.6)

71 (2.8) 72 (2.4) 50 (6.2) 66 (2.7) 70 (2.2) 49 (2.4) 72 (2.7) 81 (2.8) 72 (2.5) 76 (2.3) 79 (2.2)

62 (4.3) 61 (4.5) 54 (4.0) 69 (1.4) 65(2.9) 52 (4.0) 65 (3.5) 78 (2.8) 75 (2.6) 63 (4.4) 70 (2.9)

Rubber Band Boys Gids 69 (3.0) 72 (2.1)

Batteries Boys Girls 62 (2.0) 63 (3.3)

36 (5.8) 39 (3.0) 67 (2.7)

11 (1.6) 40 (4.0) 39 (5.1)

92 (2.6) 99 (0.5) 97 (2.0) 84 (3.2)

91 (2.0) 85 (3.7) 86 (2.8) 40 (7.6) 93 (2.0) 93 (2.8) 92 (3.2) 97 (1.2) 96 (I .7) 95 (2.1) 95 (2.8) 96 (1.9)

Magnets Boys Gids

96 (1,4) 83 (4,7) 95 (2.9)

45 (2.7) 32 (3.2) 48 (4.1) 58 (7.5) 40 (2.4) 52 (2.6) 24 (3.2) 57 (2.8) 57 (2.8) 34 (3,1) 47 (3.0) 53 (3.2)

Pulse

Countries not meetin:~ age/grade specifications Colombia 39 (3.4) 38 (1.6) 12 (1.5) Romania 62 (2.1) 61 (1.9) 42 (3.9) Slovenia 62 (1.2) 59 (1.6) 37 (4.5)

61 (1.3) 47 (1.5) 60 (1.2) 50 (2.7) 61 (I.5) 61 (1,3) 48 (1.7) 62 (2.0) 72 (2.l) 53 (0.9) 63 (1.6) 64 (1.1)

Boys

91 (3.3) 99 (I .0) 92 (3.6) 86 (3.2)

61 (1.5) 47 (1.2) 62 (2.2) 54 (2.9) 58 (1.5) 62 (1.5) 47 (0.9) 64 (1.8) 70 (1.8 55 (1.4) 63 (1.4) 66 (1.9)

Overall Average Boys Girls

Countries not satisj~'ing guidelines for sample participation rates** Australia 62 (1.4) 67 (1.2) 49 (3,7) 60 (6.2) England 67 (1.6) 68 (1.2) 58 (3,1) 60 (3.4) Netherlands 61 (1.8) 60 (1.5) 49 (3.4) 41 (3.8) United States 54 (1.4) 56 (1.5) 50 (3.0) 50 (2.5)

Canada Cyprus Czech Republic Iran Islam. Rep. New Zealand Norway Portugal Scotland Singapore Spain Sweded Switzerland

Country

Table 5.1: Gender Differences in Science TIMSS Performance Assessment

48 (0.8)

26 (5.4) 60 (3.2) 49 (3.3)

51 (2.3) 64 (3.4) 47 (4.0) 44 (3.0)

28 (3.7) 57 (4.9) 48 (3.8) 48 (2.7) 39 (2.7) 36 (2.3) 50 (3.) 70 (3.4) 43 (3.2) 51 (2.7) 57 (3.1)

(2.6)* (2.7) (2,7) (2.7)

51 (0.7)

26 (2.5) 65 (2.9) 52 (3.8)

65 71 40 52

31 (3,1) 62 (4,4) 52 (4.8) 49 (2,6) 44 (2,7) 35 (2.4) 54 (2.4) 66 (3.1) 40 (2.8) 50 (3.3) 58 (2.2)

Solutions Boys Girls 47 (1.9) 52 (3.2)

34 (0.6)

29 (2.3)

34 (3.8) 36 (0.7)

18 (2.8)

35 (218) 47 (3,3) 33 (2.5) 27 (3.3)

18 (2.1) 36 (2.8) 43 (2.5) 30 (2.5) 37 (2.4) 25 (1.5) 35 (3.7) 54 (4.3) 35 (2.4) 45 (3.1) 40 (2.6)

26 (3.5)

. 36 (3.3) 45 (2.9) 37 (5.3) 29 (2.5)

19 (2.4) 38 (2,1) 43 (2.4) 27 (2.1) 42 (2,4) 26 (2.5) 38 (2.8) 47 (4.0) 38 (2.1) 45 (2.8) 43 (3.2)

Shadows Boys Girls 37 (2.4) 33 (2.3)

~"

t~

p~

_~

~ "

61 (0.8)

72 (0.7)

75 (0.6)*

53 (3.8) 76 (2.9) 81 (1.8)

85 (1.6)* 81 (1.5) 77 (1.8) 73(2.5_~

77 (2.5) 70 (3.0) 77 (3.4) 57 (3.1) 74 (2.0) 77 (2.5) 79 (2.2) 75 (2.8) 84 (2.1) 71 (2.9) 76 (2.9) 81 (1.5)

Girls

54 (0.7)

32 (2.6) 67 (3,5) 58 (2.3)

56 (2.2) 60 (2.2) 55 (3.0) 55(2.3)

Giris

54 (0.7)

28 (2.1) 64 (2.9) 58 (2.6)

62 (2.2) 64 (2.6) 63 (3.4) 56(2.3)

61 (1.5) 40 (2.8) 52 (3.0) 42 (5.4) 56 (2.0) 60 (2.2) 42 (3.1) 48 (3.9) 63 (4.1) 52 (2.9) 51 (2.7) 60 (1.8)

Calculator

59 (2.2) 42 (3.2) 57 (3.6) 55 (4.2) 54 (2.5) 58 (2.8) 38 (2.4) 50 (4.2) 58 (3.1) 54 (2.9) 49 (3.6) 63 (3.3)

Boys

69 (0.9)

45 (6.33 84 (2.5) 81 (3.5)

73 (3.9) 71 (3.6) 70 (5.0) 67(2.5)

60 (3,0) 43 (2.9) 73 (3.6) 64 (2.5) 76 (3.1) 72 (4.2) 61 (3.6) 72 (5.2) 80 (3.5) 62 (5. I ) 83 (2.4) 82 (3.2)

55 (2.9) 43 (2.4) 57 (2.4) 35 (5.4) 60 (1.7) 66 (2.3) 45 (2.4) 62 (2.3) 67 (1.4) 54 (3.1) 65 (2.1) 55 (3.5)

68 (0.9)

40 (7.11 84 (3.3) 82 (2.9)

56 (0.8)*

34 (8.2) 65 (2.8)* 55 (3.3)

53 (0.6)

31 (3.61 50 (4.0) 53 (3.0)

57 (2.2) 62 (2.0) 66 (2.4) 46(2.4)

52 (2.1) 40 (3.1) 60 (1.9) 33 (2.3) 61 (1.4) 59 (1.8) 41 (2.1) 56 (3.1) 59 (2.6) 53 (1.8) 67 (2.4) 53 (2.3)

Girls

Around lhe Bend Boys

74 (4.2) 60 (2.4) 67 (4.0) 64 (2,2) 71 (2.7) 68 (2.7) 70(3,3)_ 50(2.3)

58 (3.6) 55 (4.2) 73 (4.2) 52 (4.1) 74 (2.4) 73 (3.1) 55 (5.9) 74 (5.0) 81 (3.2) 61 (2.7) 74 (2.9) 77 (3.4)

Girls

Folding & Cutting Boys _

Average Source: PerJorrnance Assesxment m IEA's Third lnternational Mathematic~ and Science Studl' * Difference from other gender statistically significant at 0.05 level, adjusted for multiple comparisons across each row ** For further details, see report: Pe@>rmance Assessment in IEA's Third International Mathematics and Science Study

58 (0.4)

59 (0.8)

59 (0.4)

International

77 (2.2) 67 (3.9) 71 (3.5) 60 (2.2) 73 (1.7) 68 (2.3) 74 (2.7) 77 (2.5) 84 (2.3) 76 (3.1) 71 (3,4) 76 (2.4)

46 (6.0) 75 (2.8) 78 (2.1)

65 (1.9) 48 93.6) 68 (3.2) 75 (4.1) 66 (3.1) 66 (3.3) 39 (2.9) 62 (4.0) 64 (4.5) 42 (3.2) 70 (4.1) 72 (3.0)

Dice

40 t3.5) 63 (4.5) 57 (3.5)

64 (2.5) 56 (4.0) 69 (4.5) 87 (2.6) 60 (3.3) 67 (3.0) 41 (3.1) 62 (2.8) 68 (3,6) 49 (2.7) 72 (3.1) 74 (2.8)

Boys

Countries" not meetitkg age/grade speci/ications Colombia 39 (3.4) 38 (1,6) 43 (3.5) Romania 62 (2.1) 61 ([.9) 63 (5.0) Slovenia 62 (1.2) 59 (1.6) 69 (2.2)

61 (1.3) 47 (1.5) 60 (I .2) 50 (2.7) 61 (1.5) 61 ([ .3) 48 (1.7) 62 (2.0) 72 (2.1) 53 (0.9) 63 (1.6) 64 (1. I)

Plasticine Boys Girls

72 (3,9) 77 (2.9) 75 (3.6) 70(3,2)

61 (15) 47 (1.2) 62 (2.2) 54 (2.9) 58 (1.5) 62 (1.5) 47 (0,9) 64 (1.8) 70 (1.8) 55 (1.4) 63 (1.4) 66 (1.9)

Overall Average Boys Gifts

Countries not satis/),ing guidelin_es for sample participation rates** Australia 62 (1 A) 67 (1.2) 69 (4,2) 76 (2.7) England 67 ([.6) 68 ([.2) 56 (2,6) 54 (4.0) Netherlands 61 (1,8) 60 (1.50 43 (4.6) 45 (2.7) United States 54~1,4) 56(1.5) 51 (3.1) . 55(2.7)

Canada Cyprus Czech Republic Iran Islam. Rep. New Zealand Norway Portugal Scotland Singapore Spain Sweden Switzerland

Country

Table 5.2: Gender Differences in Mathematics T1MSS Performance Assessment

44 (1.0)

24 (4.2) 52 (6.1) 48 (4.9)

51 (3.4) 52 (4.2) 55 (4.7) 24.__(3.0)

60 (4,1) 13 (2.2) 53 (7,3) 42 (8.3) 40 (3.0) 59 (3.2) 31 (3.2) 54 (4.0) 61 92.9) 29 (3.2) 40 (3.2) 51 (5,8)

44 (0.9)

18 (2.8) 49 (3.6) 42 (3.9)

59 (4.2) 55 (2.9) 51 (3.5) 31 ( 2 . 9 ) _

56 (3.3) 15 (3.8) 35 (4.5) 44 (4.0) 48 (3.5) 59 (4.1) 31 (4.5) 49 (5.8) 69 (3.9) 28 (2.9) 54 (3.0)* 44 (3.9)

Girls

Packaging Boys

t~a

t~

g_

254

M . Harmon /Studies in Educational Evaluation 25 (1999) 243-262

At the 8 th grade level on only one task each, and for a single country in science and three countries in math are there significant gender differences. In three of these four cases girls outranked boys. (At the 4 th grade level, on one task in each of two countries, boys outranked girls in one country and girls outranked boys in the other.) One of the hypotheses sometimes advanced to account for gender differences in large-scale assessment is that females do less well in more abstract content areas. An examination at the item level did not support this hypothesis. For example the first three items of the Folding and Cutting task are solved with scissors and paper; the last one requires that students locate the axes of symmetry ("fold lines") as a paper-and-pencil task, without actually folding or cutting. On this item, as well as on the entire task, there was no gender difference in the results of boys and girls. At the item level on one of the mathematics tasks, Around the Bend, there was a significant difference favoring boys on the two computational items. In both, students had to convert measurements from centimeters to meters or vice versa using a scale. On the Rubber Band task, two items required analysis of a student-constructed data table, and making predictions based on it. On these two items there were very small differences in favor of girls. The gender neutrality of the Performance Assessment needs further investigation.

What Do Our Students Do Well? Where Do We Need To Improve? With only one task in most content areas (4-6 items), the Performance Assessment does not provide a broad enough sample to be representative of knowledge in individual content areas and needs to be supplemented by information from written, observation, or other modes of assessment. What it does show is the degree to which students see the relevance of what they have learned and whether they understand the concepts well enough to know what variables to measure and how to interpret their findings. In the figures provided for Pulse, Calculators, and Shadows these performance expectations were contrasted to show the variability of achievement with particular cognitive demands within a single task and content area. In the following tables we look at the larger performance areas. These categories, used both in the design of all TIMSS tasks and for reporting, may be found in Curriculum Frameworks for Mathematics and Science (Robitaille, Schmidt, Raizen, McKnight, Britton, & Nicol, 1993). The categories reported for science are: • • •

problem solving, and specifically, using scientific principles to solve problems or explain findings; using equipment, procedures and scientific processes, and investigating the natural world.

M . Harmon/Studies in Educational Evaluation 25

(1999) 243-262

255

Table 6: Performance Expectations in Science Country

Overall Average ProblemSolving Using Scientific Scientific Percent Correct* Applyin;g Concepts Procedures Investigating Singapore 71 (1.7) 59 (3.0) 75 (1.8) 74 (1.9) Switzerland 65 (1.2) 55 (1.6) 63 (1.4) 70 (1.3) Sweden 64 (1.2) 56 (2.3) 59 (1.4) 70 (1.3) Scotland 62 (1.7) 48 (2.1) 69 (1.8) 65 (1.5) Norway 62 (0.8) 48 (1.6) 57 (1.2) 63 (1.1) Czech Republic 61 (1.3) 53 (2.2) 57 (2.0) 65 (1.6) Canada 60 (1.3) 53 (2.2) 57 (2.0) 65 (1.6) New Zealand 60 (1.4) 47 (1.6) 65 (2.1) 57 (1.2) Spai 54 (0.8) 39 (1.8) 45 (1.8) 57 (1.2) lran, Islam. Rep. 52 (2.0) 61 (2.0) 53 (3.4) 56 (2.7) Portugal 47 (1.1) 32 (1.8) 47 (1.4) 45 (I.4) Cyprus 46 (1.0) 37 (1.9) 48 (1.7) 50 (1.1) Countries Not Satisfying Guidelines ,[or Sample Participation Rates** Australia 65 (1.2) 54 (2.0) 67 (1.9) 66 (1.1) England 67 (0.9) 49 (2.0) 77 (1.4) 73 (1.0) Netherlands 60 (1.3) 39 (1.9) 63 (1.7) 57 (1.4) United States 55 (1.3) 43 (1.5) 61 (2.2) 55 (1.4) Countries Not Meeting Age~Grade Specifications Colombia 39 (1.8) 32 (2.2) 35 (2.4) 41 (1.5) Romania 62 (1.9) 48 (3.3) 53 (2.5) 61 (2.2) Slovenia 61 (1.0) 48 (1.5) 60 (1.3) 59 (1.3) International 59 (0.3) 47 (0.5) 59 (0.4) 60 (0.4) Average Source: lEA Third International Mathematics and Science Study (TIMSS), 1994-5. * This column lists averages by task, included here for reference. See Performance Assessment in IEA's Third International Mathematics and Science Study, Appendix, for detail on sampling and age/grade specifications. International results for these cognitive skills are shown in Table 6. In mathematics the reporting categories are: • • •

performing simple and complex mathematical procedures, investigating and p r o b l e m solving, and mathematical reasoning. 4

International results for these categories are shown in Table 7. Not surprisingly at the international level 13-year-olds were less competent in problem-solving and explaining findings than in use o f procedural skills. Surprisingly, competence in investigating, which required procedural knowledge but also planning, analyzing and interpreting data, p r o v e d not to be significantly different from competence in procedural skills alone. In mathematics the lines o f demarcation are much sharper. Internationally 13-yearolds are more competent in using simple and complex procedural knowledge than in problem-solving and reasoning - no surprise. Only two o f the mathematics tasks required use o f concept knowledge to explain findings; most o f the other reasoning items i n v o l v e d developing and describing strategies to solve problems.

256

M. Harmon/Studies in Educational Evaluation 25 (1999) 243-262

Table 7: Performance Expectations in Mathematics Country

Overall AveragePercent PerformingProcedures Problem Solving Correct By Task and Reasonin8 62 (2.3) Singapore 71 (1.7) 80 (1.3) 60 (1.8) Switzerland 65 (1.2) 76 (1.8) Sweden 64 (1.2) 73 (1.3) 60 (1.6) Scotland 62 (1.7) 75 (1.2) 58 (1.3) Norway 62 (0.8) 73 (1.6) 58 (1.3) Czech Republic 61 (1.3) 73 (1.6) 56 (1.7) Canada 60 (1.3) 74 (1.4) 54 (1.3) New Zealand 60 (1.4) 72 ( 1.1 ) 55 (1.6) Spain 54 (0.8) 66 (1.4) 46 (1.3) 49 (1.8) lran, Islamic Republic 52 (2.0) 61 (1.8) Portugal 47 (1.1) 66 (1.2) 36 (1.6) Cyprus 46 (1.0) 58 (1.3) 38 (1.4) Countries Not Satisfying Guidelines ,for Sample Participation Rates* Australia 65 (1.2) 75 (1.4) 61 (1.9) England 67 (0.9) 77 (1.1) 54 (1.3) Netherlands 60 (1.3) 77 (1.7) 50 (1.5) United States 55 (1.3) 64 (1.6) 49 (1.4) Countries Not Meeting Age~Grade Specifications Colombia 39 (1.8) 49 (2.7) 30 (2.7) Romania 62 (1.9) 74 (1.9) 60 (2.4) Slovenia 61 (1.0) 72 (1.2) 57 (1.1) International Average 59 (0.3) 70 (0.4) 52 (0.4) Source: lEA Third InternationalMathematicsand Science Study (TIMSS), 1994-5. * See PerformanceAssessmentin IEA'sThird InternationalMathematicsand Science Study, Appendix, for detail on sampling and age/grade specifications. Figure 4 contrasts the performance on these categories for both mathematics and science. Although communication skills were required in more than half the items, they are not reported, as the principal reporting category for each item was its science or mathematics cognitive skill. How marked is content dependency in the exercise of these cognitive skills? Figure 5 shows how one of the performance expectations, scientific problem solving, varies at the 8 th grade level according to the content area in which it is exercised. The first Batteries problem was a multiple-choice question: above 90% international achievement. The following item required explaining why, in terms o f electricity concepts, students selected the option chosen in the multiple-choice item: international achievement on this item, 42%. The four Plasticine subtasks combined both mathematical and science skills. Each consisted o f a problem to be solved followed by explanation o f the strategy used. The first was purely procedural: to make and weigh a 20-gram lump o f clay, given a 20 g standard mass; and describe how it was done (G2.2A,B). The following problems required students to weigh 10 g, 15 g, and 35 g lumps of clay, in each case describing the strategy used. For the 35 g lump they had available both 20 g and 50 g standard masses.



>

0

<

.o

C

AverageProblem Solving

G2.4BStrategy35g

G24APlastlcme

G23BStrateg~15g

G2.3APlasticme

G2.2BStrategy20g

G22APlastlcme

GI 2Shadows~

$5 4Solutions "

~

"~..-:'.~., ~ .:,. ~-~~"

S4.6RBand I

$34Batteries

$3 3BalteriesM("

SI 3Pulse

%--

International Average Percent Correct

t~

o

rn

Overall Average

Problem-Solving and Mathematical Reasoning

Procedures

Mathematical

Natural World

Investigatingthe

Procedures

•.

c~

Using Scientific

4~

Scientific Problem-Solving: Applying Concepts

international Average Percent Correct

--4

~a

I

4~

3"

258

M. Harmon /Studies in Educational Evaluation 25 (1999) 243-262

What Have We Learned From The Performance Assessment In TIMSS?

.

.

.

.

We have learned much about feasibility and the technical aspects that guarantee both internal and interrater reliability (Linn, 1994; Haertel & Linn, 1996). TIMSS demonstrates that it is possible, practical, but not cheap to conduct practical tests in a large-scale assessment across many languages and cultures, and maintain good standardization of translation, procedures, equipment, and scoring. Not to disadvantage countries for which costs of equipment might be a serious obstacle, specifications were provided (after field testing and revision), so that almost all the equipment could be home-made. Reliabilities computed by KR-21 across all 66 items on the Performance Assessment ranged between .86 and .94 at the 8th grade level with an international median of .90. Interrater agreements for correctness scoring (first digit) averaged 91%. Complete agreement for both correctness and diagnostic codes, i.e. identifying the precise type of approach, error, or misconception as well as the point value of the answer, averaged 84% (Harmon et al., 1997, Appendix A). Task achievement was dependent to some degree on both the familiarity of the content area, e.g. human physiology was much less familiar to middle school students than batteries, and on the student's perception of what the task was calling for - the latter a factor affected by culture and style of pedagogy in each country. But the dominating factor causing differential performance between the tasks seems to have been the mix of procedural and higher order thinking items within each task. It is possible to measure across tasks with a high degree of reliability, interesting constructs like non-routine problem solving, experimental design, data analysis and interpretation, and strategy development. Although the level of achievement in these various thinking skills is content dependent, with unfamiliar content rendering even procedures like measurement less effective, these crests and troughs in achievement with task seemed to follow the same pattern in almost all countries and so could be perceived as only one factor contributing to task difficulty. The most difficult items across all content areas were those asking for extended writing, i.e., descriptions of procedures and problem-solving strategies, and concept-based explanations of findings. There is an interesting gender neutrality in the Performance Assessment results. This is in sharp contrast to the marked dominance of boys over girls on the written test at the middle school and the secondary levels. Further research is needed to determine whether this gender neutrality is a function of the absence of multiplechoice questions, of the security provided by equipment and the freedom to "try things out," or whether it is somehow related to the high level of enthusiasm most students showed in addressing themselves to this kind of test, or something in the holistic nature of contextualized real-world tasks in contrast to more abstract items

M. Harmon /Studies in Educational Evaluation 25 (1999) 243-262

259

found on the written test. Whatever the hypothesis, the near-absence of gender differences requires further investigation. There is, however, one area wherein TIMSS Performance Assessment still has to prove itself, the question of meaning (Messick, 1989, 1994). This question staggers anyone who tries to compare countries based on data aggregated at the whole task level What exactly do these numbers mean? To what, if anything, do they generalize? A single task, however well-constructed, is too small a sample of a content domain to meet the test of content representativeness although it has some face validity to evidence what was taught, or whether certain concepts were learned. But the potential for domain-irrelevant variance is evident in many items: the entanglement with language skills, and the impact of student perception of task demand, to name only two. Student ability to recall and appropriately apply certain concepts attests to grasp of those concepts by its presence, but absence of such application need not attest to ignorance of the concepts but possibly to some other factor such as student's reading ability, or perception of what the item is asking (a simple solution to a problem? or an elaborated and methodical "investigation" with replication of data, careful recording, and linking of evidence with hypothesis?). Scoring keys and rubrics which generate the scores are, in fact, theory-driven, i.e. they are constructed by experts according to how they (the experts or experienced teachers) think students should think while moving through a task, In TIMSS the cognitive path is further constrained by subquestions demanding explicit evidence of certain procedural or cognitive skills but also cueing students' thinking in a particular direction. But students, like scientists and mathematicians, may well think completely differently, and move toward task solution by utterly different paths -paths not susceptible to measurement by the predesigned rubrics. Although TIMSS adjusted its rubrics based on empirical pilot data fi'om 19 countries, each task was still scaffolded according to a particular cognitive model, and then subquestions were inserted to pinpoint whether students had mastered certain kinds of processes such as multiplying with a calculator, constructing a graph or data table, honoring the conventions of an experimental design (including materials, method, and controls), or identifying the trends in data. Results cannot be generalized to student's understanding of how a calculator works, or whether the data has any intrinsic meaning beyond showing a pattern, and certainly not to an inference that students understand what is going on that such a pattern should emerge (see Calculator task). So what, in fact, do we know when we line up countries based on national averages of these data aggregated at the task level? What do such scores mean? More interesting, but not necessarily more valid, are aggregations at the construct level. Because some constructs were measured across many domains, there is clearer evidence of the strength or weakness of these skills but this needs further study to disentangle content dependency effects. For example, as has been shown, TIMSS contained a variety of items sampling problem-solving, sometimes by building boxes which would just constrain four balls, or identifying lines of symmetry by folding and cutting out certain designs with just one cut, or trying to figure out a formula for the

260

M. Harmon/Studies in Educational Evaluation 25 (1999) 243-262

dimensions of an object and a corridor, such that the object would always go "around the bend" or determining multiple ways to prove which magnets were stronger, or which batteries alive or dead. That students can demonstrate a viable solution path to a problem, even when the answer may be partially vitiated by small mechanical errors, indicates a direction for future instructional planning. The second digits of the coding system (diagnostic codes) can assist this exploration by indicating certain misconceptions or correct or erroneous patterns. Finding the answer to the problem, by whatever means, also speaks to student initiative, perseverance, and strategy sense in problem-solving. But TIMSS asked students to describe the strategy. Here, in many countries world-wide, students foundered. Similarly explanation of results was weak world-wide, with scores about 20 percentage points lower than on other constructs. Can we conclude that students were unfamiliar with the concepts? Did not see their applicability? Were unused to being asked "why" in class or laboratory? Or did they simply lack written language skills to evidence the connections they saw? It is equally difficult to interpret scores for mathematical reasoning or for "understanding scientific investigation" because of their demonstrated dependence on procedural knowledge before higher order thinking about data can come into play. These questions need further study and data from performance assessment needs to be supplemented from observations, interviews or other sources. This may help us find ways of crafting response sheets so that they will not needlessly constrain student thinking but will allow interpretation of what kind of thinking is going on and at the same time eliminate some of the sources of constructirrelevant variance. Notes 1. An earlier version of this article was presented at the Annual Meeting of the American Education Research Association, April 1998, in San Diego, USA. 2. It is recognized that all testing measures "performance." The term Performance Assessment is used in TIMSS to represent that mode of assessment whereby students use equipment in solving practical problems and conducting investigations. 3. An item similar in content is found in the written test, X01, in which students were asked to describe a procedure to determine the effect of exercise on heart rate. The international average correctness score for this item at the Middle School level was 14%. It should be noted that this item draws on the same conceptual knowledge as the final item of the performance task, but requires the student to describe procedures and equipmerit whereas the performance task provided these and required rather the interpretation and explanation of actual findings. Overall average for performance task 44%, but for the explanation item, 28%. 4. Because of small numbers of items in the reasoning category and overlapping thinking skills, the reasoning category was combined with problem solving for reporting purposes.

References Beaton, A.E., Martin, M.O., Mullis, I.V.S, Gonzalez, E.J., Smith, T.A., & Kelly, D.L. (1996). Science achievement in the middle school years. Mathematics Achievement in the Middle School Years. Chestnut Hill, MA. TIMSS International Study Center.

261

M. Harmon/Studies in Educational Evaluation 25 (1999) 243-262

Haertel, E.H., & Linn, R.L. (1996). Educational Statistics.

Comparability.

Washington, DC: National Center for

Harmon, M., Kind, P.M., Garden, R., & Orpwood, G. (1994). Third International Mathematics and Science Study: Coding guide for performance assessment. Chestnut Hill, MA: TIMSS International Study Center. Harmon, M., Foxman, D., Garden, R., Gonzalez, E., Kind, P.M., Orpwood, G., & Quellmalz, E. (1994). Third International Mathematics and Science Study: Performance Assessment Administration Manual Chestnut Hill, MA. TIMSS International Study Center Harmon, M., & Kelly, D. (1996). Development and design of the TIMSS performance assessment. In M.O. Martin & D. Kelly (Eds.), Technical report, volume 1: Design and development. Chestnut Hill, MA: TIMSS International Study Center. Harmon, M., Smith, T.A., Martin, M.O., Kelly, D.L., Beaton, A.E., Mullis, I.V.S., Gonzales, E.J., & Orpwood, G. (1997) Performance assessment in IEA's Third International Mathematics and Science Study. Boston College, MA: Center for the Study of Testing, Evaluation, and Educational Policy. Kind, P.M. (1996). Exploring performance assessment in science. University of Oslo, Norway.

Doctoral dissertation,

Lie, S., Taylor, A., & Harmon, M. (1996). Scoring techniques and criteria. In M.O. Martin & D.Kelly (Eds.), Technical report, volume 1: Design and development. Chestnut Hill, MA: TIMSS International Study Center. Linn, R.L. (1994). Performance assessment: standards. Educational Researcher, 23 (9), 4-14.

Policy promises and technical measurement

Martin, M., & Kelly, D. (1996). Third International Mathematics and Science Stuclv. Technical report, vol. 1: Design and development. Chestnut Hill, MA: TIMSS International Study Center Messick, S. (1989). Meaning and values in test validation: the science and ethics of assessment. Educational Researcher. 18 (2), 5-11. Messick, S. (1994). The interplay of evidence and consequences in the validation Performance Assessments. Educational Researcher. 23 (2), 13-23.

N J:

Messick, S. (1994). Alternative modes of assessment, uniform standards of validity. Educational Testing Service RR 94-60.

of

Princeton,

262

M. Harmon/Studies in Educational Evaluation 25 (1999) 243-262

Robitaille, D.F., Schmidt, W.H., Raizen, S.A., McKnight, C.C., Britton, E.D., & Nicol, C. (1993). Curriculum frameworks for mathematics and science. Vancouver, Canada: Pacific Educational Press. Schmidt, W.H., McKnight, C.C., Valverde, G.A., Houang, R.T. Wiley, D.E. (1996). Many visions, many aims. A cross-national investigation of curricular intentions, Volume 1- Mathematics. Boston: Kluwer. Schmidt, W.H., Raizen, S.A., Britton, E.D., Bianchi, L.J., & Wolfe, R.G. (1996). Many visions, A cross-national investigation of curricular intentions, Volume II - Science. Boston: Kluwer.

many aims.

The Author MARYELLEN HARMON is Senior Researcher at the Center for the Study of Testing, Evaluation, and Educational Policy, at Boston College, Chestnut Hill, MA. She obtained her B.S.and M.S. Physical Chemistry, and M.A.T. Mathematics, University of Detroit, Ed.D. Humanistic Education with Specializations in Curriculum, Organizational Development, and Human Relations, University of Massachusetts, Amherst. International Coordinator for Performance Assessment for Third International Mathematics and Science Study (TIMSS). Associate Director, Center of the Study of Testing, Evaluation, and Educational Policy, Boston College (1988-90). Co-director and faculty: Teacher Education Program, Madonna University, Livonia, MI.