3.8 ASSESSING PROGRESS IN MATHEMATICAL MODELLING John Izard RMIT University, Melbourne, Australia Abstract-Evidence of mathematical modelling skill may be obtained through pevformance of a number of complete tasks or by identifiing the appropriate actions in a number of sub-tasks. Success is ojien gauged by adding the scores on each performance or sub-task. Students of different levels of modelling skill are assumed to receive different scores and students ofthe same level are assumed to receive the same scores. This paper reports an application of item response modelling procedures to samples from an item collection to investigate this assumption. The results inform the choice of items to meet technical and practical requirements for better assessments in mathematical modelling.
1. CONTEXT FOR THIS RESEARCH At ICTMA-6, held at the University of Delaware in 1993, Werner Blum asked “What needs to be done for applications and modelling in practice and in research?” (Blum, 1995 p13). Blum saw four measures as necessary for applications and modelling in the future. His first measure was Developing appropriate modes of assessment for applications and modeling. This paper focuses on this issue. As Haines and Crouch (2004) have noted, teaching and learning paradigms for mathematical modelling are varied. Some approaches are holistic, commencing with simple cases and moving to an increasingly complex sequence of cases. Others focus on processes and stages that are presumed to relate to key issues in the gaining of expertise by novices. These teaching and learning paradigms assume that expertise may be developed. The task of assessment strategies is to document the qualities of that expertise as well as describe its extent. Item response modelling (IRM) strategies were used in this research. The assumptions underlying IRM (sometimes known as IRT) are similar to those for traditional test analysis. Both assume that item marks are added to give a measure of achievement but traditional approaches usually fail to check whether this assumption is valid. As McGaw et al. (1990, p30) states, “Whenever it is believed appropriate to add results then it is appropriate to express them on a common scale before doing so”. IRM aims to separate item complexity from student achievement using iterative procedures, and to reflect on the common elements in items of comparable difficulty. These IRM strategies are used widely by examination boards (Izard, 1992) and in international studies (such as TIMSS [www.timss.org] and PISA
Izard
159
[www.pisa.oecd.org]) to interpret test results from open-ended and multiple-choice items, responses to rating scales, and other performancejudgments. IRM strategies have been applied in mathematical modelling contexts since 1989 (following ICTMA4 in Roskilde, Denmark). For example, Izard and Haines (1 993) used IRM in assessing oral communications about an authentic mathematical modelling task. Houston, Haines and Kitchen (1994) reported on rating scales for evaluating major projects and shorter projects and investigations, using IRM to interpret the responses. Haines, Crouch and Davis (2001) and Haines, Crouch and Fitzhams (2003) reported on the use of partial-credit multiple-choice questions (see Figure 1) to capture evidence about student skills at key developmental stages without the students being required to carry out the complete modelling task. IRM was used by Izard, Haines, Crouch, Houston and Neil1 (2003) to ensure that the progress of students using such items was reported on a common scale regardless of fluctuation in test difficulty. Consider the real-world problem (do not try to solve it!): A large supermarket has a great many sales checkouts which at busy times lead to frustratingly long delays especially for customers with f a y items. Should express checkouts be introduced for customers who have purchased fewer than a certain number of items? In the following unfinished problem statement which one of the five options should be used to complete the statement? Given that there are five checkouts and given that customers amve at the checkouts at regular intervals with a random number of items (less than 30), find by simulation methods the average waiting time for each customer at 5 checkouts operating normally and compare it with A. the average waiting time for each customer at 1 checkout operating normally whilst the other 4 checkouts are reserved for customers with 8 items or less. B. the average waiting time for each customer at 4 checkouts operating normally whilst the other checkout is reserved for customers with fewer items. C. the average waiting time for each customer at 1 checkout operating normally whilst the other 4 checkouts are reserved for customers with fewer items. D. the average waiting time for each customer at some checkouts operating normally whilst other checkouts are reserved for customers with 8 items or less. E. the average waiting time for each customer at 4 checkouts operating normally whilst the other checkout is reserved for customers with 8 items or less. Preferred responses are A[score = 11, B[O], C[O], D[O], Elscore = 21. Figure 1. Partial credit multiple-choice item (from Haines & Crouch, 2004). When sampling from a collection of such mathematical modelling items it is clear that one should use at least one item from each developmental stage. But items assessing the same stage may differ in difficulty so more information is needed
160
Assessing Progress in Mathematical Modelling
about which items serve the purpose of describing development better. Because the partial credit format shown in Figure 1 was a potential complicating factor, it was decided to use another data set where items were scored correct [score=l] or incorrect [score=O] in modelling the effects of difficulty.
2. THE REAL-WORLD PROBLEM We know that problem tasks vary in difficulty. This variation may be a consequence of differing intellectual demands of the tasks, the quality of the information available, or the number of alternative plausible potential solutions. But the usual procedure of adding component marks when calculating scores assumes that components are equally difficult and interchangeable but the tests with different components are not equivalent (Izard et al., 2003). Further, the number of tasks correct is not necessarily a true indication of achievement. When comparing results on two diflerent assessments we do not know whether results vary because the items differ, or because the students differ, or both. If the results do not vary from the initial results, we do not know whether it is due to an inappropriate range of items (ceiling effects), un-related tests differing in difficulty, or a lack of progress.) Sound evidence of achievement requires multiple demonstrations: completion of a single task is not convincing. Implications: Students in courses assessed in traditional ways may not receive due credit for their achievements. Assessment of progress during a course is impossible when the tests differ because there is no relation between the assessment strategy for one year and the corresponding strategy for the following year: when the "rulers" are un-related and lacking common units we have no valid measure of added value (Izard, 2002a, 2002b). 3. ASSUMPTIONS MADE IN CREATING THE MODEL
Performance on real-life tasks provides evidence of achievement. Tasks used for the assessment strategy (items) are assumed to be from the same domain. (There seems little point in using irrelevant items.) All other things being equal, higher scores are assumed to indicate higher achievement. Larger differences between scores indicate larger differences in achievement. We also assume items are stable in difficulty (relative to the other items) over occasions and that learning does not occur to a substantial extent during an assessment event. Changes in achievement on comparable assessments are inferred to be evidence of learning (or lack of learning). Implications: Assessment items on a test should be internally consistent if assessing on the same dimension. Students with high scores on the test as a whole should do better on each item than those with low scores on the test as a whole. Items that do not distinguish between able and less able students contribute little to a test. Assessment of progress implies that assessments are on the same dimension. 4. FORMULATION OF THE MATHEMATICAL PROBLEM Assessing progress requires multiple test items (or problem tasks) as indicators of achievement over a range of difficulty consistent with the range of achievement to
Izard
161
be measured. Sampling of items or tasks should be representative of difficulty levels as well as process and content. Higher credit should be given for difficult or more complex items than for less difficult or simple items. (This requirement can be met if more able students are successful on both the easier items and the more difficult items.) Several hypothetical tests (with open-ended items scored righuwrong) can be generated from items of known difficulty. If a test is useful, students of the same level should receive the same scores. Conversely, students of different levels of expertise should receive different scores and students with differing expertise should not receive the same scores. Implications: Students receive credit for their achievements through scoring on the easier items and, progressively, on the more difficult items. Students are unlikely to be correct on very difficult items when they are incorrect on easier items. Students who are correct on difficult items but who cannot achieve success on easy items are unusual. Such instances need investigation: possible explanations are carelessness, gaps in learning, students being assessed in other than their first language, or failures in test security.
5. SOLVING THE MATHEMATICAL PROBLEM The notion of an item bank of valid and practical assessment strategies as a resource for teachers of mathematic modelling assumes that teachers can select items from the bank to construct their own valid tests. In principle this appears sound. The next assumption to be checked is the issue of sampling from the item bank to obtain valid tests - tests that can distinguish between differing levels of mathematical modeling expertise. A rectangular distribution of item difficulties may be used to generate a series of tests, representing possibilities expected in practice. Scores on each of the tests can be calculated for hypothetical students of specified achievement levels. Apparent achievement levels from test results can then be compared with true achievement levels known from other evidence. Earlier in this paper a useful test was described as one where students of different achievement levels receive different scores and students of the same achievement level receive the same scores. The usefulness of some tests in assessing some hypothetical students is illustrated in Figure 2 (adapted from Izard et al. (1983), original in Izard (2004)). For each test, each student is shown by an X placed on a vertical linear continuum. A numeral has been added to distinguish between students. Higher achieving students are shown at the top part of the diagram and lower achieving students are shown in the lower part of the diagram. For example, student X1 shows high achievement, and X4 shows low achievement. Items for each of the five-item tests (A, B, F and J) are shown to the right of the vertical line representing the achievement continuum. The vertical placement of each item is in terms of item difficulty. Easy items like Al, A2, A3, A4, A5, F1, F2, and 52 are near the bottom of the diagram. Difficult items like B5, F4, and J5 are near the top of the diagram. The following commentary (adapted from Izard (2004)) considers the theoretical usefulness of each hypothetical test shown in Figure 2.
162
Assessing Progress in Mathematical Modelling I
x1
x2
x3
x4
I I I I I I I I I I I I I I I I I I I I I I I I I I I
Test A
Test B
Test
B5 B4 B3 B2 B1
F5 F4 F3
F
Test J
J5
54
53
52
A5 A4 A3 A2
F2 F1
J1
A1
Figure 2. Alternative possibilities for tests. Test A would not be useful in distinguishing between the 4 students because the item dificulties are much lower than most of the student achievement levels. If an initial test, most or all of the students would get perfect scores. After a period of learning, this test would not be able to detect improvements in achievement of the 4 students because they cannot obtain any higher than a perfect score. In an evaluation of an intervention context, such lack of improvement may be interpreted as a failure of the intervention but the fault lies with the choice of Test A. Test B would not be useful in distinguishing between the 4 students because the item difficulties are much higher than most of the student achievement levels. If an initial test, most or all of the students would get zero scores. After a period of learning, this test may be able to detect improvements in achievement of the 4 students because they can obtain a higher score than zero. But all would have to progress from their earlier level to the level above that of X4 to receive credit for their learning. In an evaluation of an intervention context, a lack of improvement may be interpreted as a failure of the intervention but the fault lies with the choice of Test B. Test F would not be useful in distinguishing between the students. Probably all would be correct on 2 of the items and wrong on the remaining 3. Results on Test F would imply that the 4 students were the same even though they are at different attainment levels. Test J would be useful in distinguishing between each of the students. Probably X1 would be correct on 4 items, X2 would be correct on 3 items, X3 would be
Izard
163
correct on 2 items, and X4 would be correct on 1 item. Further, the differences between the attainment levels are reflected in the differences between the scores. 6. INTERPRETING THE SOLUTION
Did students with the same score have different achevement levels? Tests A, B and F show that all students were the same when they were not. Tests that are much too difficult or much too easy obscure the real differences between students. Combining very easy test items with very difficult test items also obscures the differences between students. Did students with different scores have different achievement levels? Only the results of Test J reflected the actual differences between students. Did students of the same achievement level receive the same scores? This possibility was not tested in the set of alternatives. But if we add a student X5 with the same achievement level as student X4 then both students would receive the same score on Test J and each other test would give both the same score. Test formats and scoring procedures not reflecting true achievement should be are rejected in favor of tests that do reflect true achievement. For Tests A, B and F there were many instances where students at different achievement levels were judged to be at the same level. This denies students appropriate credit for their work and discriminates against some students. Clearly, examinations and assessment strategies like these should be replaced with tests like Test J, provided that real tests behave in a comparable way to this hypothetical set of tests. 7. VALIDATION OF THE MODEL A test of 21 items was administered to more than 700 students and the results were analysed using item response modelling software. The 54 students who obtained a perfect score were excluded from the study because their achievement level could not be determined. (We need to know what they cannot do as well as what they can do to estimate their achievement level: with perfect scores we have no evidence of what they cannot do.) The results were sorted to obtain groups of students with the same scaled achievement level according to the overall test of 21 items. Students were sampled from these larger groups to match as closely as possible the hypothetical students shown in Figure 2 above. Generally five achievement levels were used. Items were chosen to replicate the series of tests with 5 items. Raw and scale scores were calculated for each student. Apparent achievement levels reflected by the test results were compared with the actual achievement levels. A sample set of results is presented in Figure 3. Each graph has a vertical line or scale representing the continuum of achievement as indicated by the 21 items on the overall test. The difficulty level of each item (numbered with a bold numeral on the right hand side of each scale) is shown by its position on the vertical scale. The position of each student sampled is shown by the score obtained, on the left of each vertical scale. For example, the top 3 students on Test A had the same achievement level according to the 21-item test and all scored 5 on Test A. They are shown at the top of the graph and are well clear of the next achievement levels as indicated by the separation on the vertical scale.
Assessing Progress in Mathematical Modelling
164 Test A
Test B
(N=15 L=5) _-_-__---555 I
(N=15 L=5) ---------
Test F
(N=15 L=5) ---------
I I
I I
I
I I
I
I 11
555 I I
I 11 I 1 I 12
I I I I
I I
I 545 I I 555 I
I 13
I I 14 I I 10
I I
545 I I
555
I I I
I 1 12
I
-- - - - -- - -
I
I
I
I I I 13 433 I
I I I 16
I I I
222
1
I I 100 I I I
8
I I I 1 I 1 3 I I 1 4
000 I I 000 I
I 454 I I
I I
I 000
I I I I I I I I 11
13
232 222
I I I I 18 I I
I
I
I
I
Test J (N=18 L=5)
222 I
232 222
I I 17 I I I 1 3 I
I
I 111 I I I 1
9
1 4
I I 000 I I
Figure 3. Student scores on alternative tests. Test A was designed as a very easy test (with all items near the bottom of the scale). The majority of students achieved the same score of 5 . Although they are clearly different on the 21-item test, the results of Test A imply that most are identical in achievement as shown in Table 2. Test B was designed as a very difficult test (with all items near the top of the scale). The majority of students achieved the same score of 0. Although they are clearly different on the 21-item test, the results of Test B imply that most are identical in achievement as shown in Table 2. Test F was designed to have very difficult items and very easy items (with some items near the top of the scale and some items near the bottom of the scale). The majority of students achieved the same score of 2. Although they are clearly
Izard
165
different on the 21-item test, the results of Test F imply that most are identical in achievement as shown in Table 2. Test J was designed to have a range of items spread from very difficult to very easy (with some items near the top of the scale and some items near the bottom of the scale). The majority of students achieved the scores consistent with their level of achievement on the larger test of 21 items. Students are clearly different on the overall test, and the results of Test J reflect most of those differences in achievement as shown in Table 2. Table 2. Comparisons of Apparent and Actual Achievement Levels: 5-Item Tests Compared with Actual Test of 2 1 Items. Students (n=3 Actual Levels of Apparent Levels Decision for each group) Scaled Scores (logits) Test A on Test A based on Test A 3.47 13 ofthe 15 There is no Group 1 Group 2 2.67 students are at the difference Group 3 2.16 same level between the 13 Group 4 1.76 students 0.56 Group 5 Test B on Test B based on Test B Group 1 -0.16 14 ofthe 15 There is no -1.14 students are at the difference Group 2 between the 14 - 1.43 same level Group 3 students Group 4 -1.77 Group 5 -2.67 Test F on Test F based on Test F Group 1 0.12 13 ofthe 15 There is no Group 2 -0.12 students are at the difference Group 3 -0.36 same level between the 13 Group 4 -0.61 students -0.87 Group 5 on Test J based on Test J Test J 16 of the 18 Group 1 3.47 All scored 5 differences Group 2 1.76 4,5,4 Group 3 between the 1.13 4,3,3 students Group 4 0.12 All scored 2 Group 5 matched their -0.61 All scored 1 actual Group 6 -0.87 All scored 0 differences
8. USING THE MODEL TO EXPLAIN AND PREDICT One needs to be aware of sampling fluctuations in the way representatives of each group were chosen for Tables 2 to 5. For example, for the 81 students with the highest (not perfect) score, 78 had a score of 5 on Test A, and 3 had a score of 4. Although the three students sampled had a higher probability of scoring 5, some
166
Assessing Progress in Mathematical Modelling
scores of 4 were possible. Similarly, with the next highest score, 73 of the 77 students scored 5 and 4 scored 4. The next two scores split in different ways. For the higher score, 78 of the next 81 scored 5, while 13 scored 4. For the lower score, 57 of the 73 scored 5, 14 scored 4, and 2 scored 3. Similar concerns apply to the other tests. Small samples of items from a larger population need to be sampled with care to ensure that the inferences from the results are well founded. Without prior knowledge of the properties of the test items, those constructing assessment strategies cannot know whether they have tests like Tests A, B and F (with all their faults) or like Test J - capable of providing useful information at all achievement levels. Trial data must be obtained so that those assembling tests from the items can ensure the resulting tests are like Test J rather than Tests A, B and F. In assembling a bank of assessment strategies, as much care needs to be taken in ensuring that the sampling of items is appropriate and is suitable for the students being assessed, that the scoring protocols are used correctly, and that the results are interpreted appropriately, as is taken to ensure that the items are mathematically sound. For the purposes of explanation this discussion has been confined to small tests. In practice the tests should be much longer (but retain the rectangular item difficulty distribution) in order to make valid inferences about achievement and improved learning. 9. CONCLUDING COMMENTS
The traditional ways of reporting scores are not in terms of how well the student has satisfied each component of the curriculum. They compare students with students rather than compare each student’s achievements with the curriculum intentions. We do not learn from the data collected what students know or do not know, because this information is ignored in the interpretation of the evidence. We also lack information on the progress made by students over several year levels, as a consequence of using different tests at different stages of learning without ever asking how the scores on each test relate to the overall continuum of achievement in that subject. The assessment techniques described in this paper and earlier ICTMA papers can contribute to avoiding these problems. A quality assessment strategy for mathematical modelling complements quality teaching approaches to mathematical modelling: we need the former to provide evidence about the latter. REFERENCES Blum, W. (1995) Applications and modelling in mathematics teaching and mathematics education - Some important aspects of practice and research. In C. Sloyer, W. Blum and I. Huntley, (eds.). Advances and Perspectives in the Teaching of Mathematical Modelling and Applications. Yorklyn, Delaware: Water Street Mathematics, 1-20. Haines, C.R., Crouch, R. M. and Davis, J. (2001) Understanding students’ modelling skills. In J.F. Matos, W. Blum, K. Houston and S.P. Carreira (eds) Modelling and Mathematics Education: ICTMA9 Applications in Science and Technology. Chichester, Honvood Publishing, 366-381. Haines, C.R., Crouch, R. M, & Fitzharris, A. (2003) Deconstructing mathematical modelling: Approaches to problem solving. In Q-X. Ye, W. Blum, K. Houston
Izard
167
and Q-Y. Jiang (eds) Mathematical Modelling in Education and Culture: ICTMAIO. Chichester, Horwood Publishing, 41-53. Haines, C.R. and Crouch, R.M. (2004) Real world contexts: Assessment frameworks using multiple choice questions in mathematical modelling and applications. Paper presented at the IAEA Conference in Philadelphia, 13-18 June 2004. Houston, S.K, Haines, C.R. and Kitchen, A. (1994) Developing rating scales for undergraduate mathematics projects. Coleraine, Northern Ireland: University of Ulster. Izard, J.F. (2004) Best practice in assessment for learning. Paper presented at the Third Conference of the Association of Commonwealth Examinations and Accreditation Bodies on Redefining the roles of educational assessment, March 8-12, 2004, Nadi, Fiji, South Pacific Board for Educational Assessment. [http://www.spbea.org.fj/aceab-conference .html] Izard, J.F. (2002a) Constraints in giving candidates due credit for their work: Strategies for quality control in assessment. In F. Ventura and G. Grima (eds.) Contemporary Issues in Educational Assessment. MSIDA MSD 06, Malta: MATSEC Examinations Board, University of Malta for the Association of Commonwealth Examinations and Accreditation Bodies, 5-28. Izard, J.F. (2002b) Describing student achievement in teacher-friendly ways: Implications for formative and summative assessment. In F. Ventura & G. Grima (eds.) Contemporary Issues in Educational Assessment. MSIDA MSD 06, Malta: MATSEC Examinations Board, University of Malta for the Association of Commonwealth Examinations and Accreditation Bodies, 24 1-252. Izard, J.F. (1 992). Assessment of learning in the classroom. (Educational studies and documents, 60.). Paris: UNESCO. Izard, J.F. et. al.. (1983) ACER Review and Progress Tests in Mathematics. Hawthorn, Victoria: Australian Council for Educational Research. Izard, J.F. and Haines, C.R. (1993) Assessing oral communications about mathematics projects and investigations. In M. Stephens, A. Waywood, D. Clarke and J.F. Izard, (eds.) Communicating Mathematics: Perspectives from classroom practice and current research. Hawthorn, Victoria: Australian Council for Educational Research, 237-25 1. Izard, J.F., Haines, C.R., Crouch, R.M., Houston, S.K. & Neill, N. (2003) Assessing the impact of the teaching of modelling: Some implications. In S.J. Lamon, W.A. Parker, and K. Houston (eds.) Mathematical Modelling: A Way of Life: ICTMA 11. Chichester: Horwood Publishing, 165-177. McGaw, B., Eyers, V., Montgomery, J., Nicholls, B. and Poole, M. (1990) Assessment in the Victorian Certificate of Education: Report of a Review Commissioned by the Victorian Minister of Education and the Victorian Curriculum and Assessment Board. Melbourne, Australia: Victorian Curriculum and Assessment Board.