Studies in Educational Evaluation 64 (2020) 100812
Contents lists available at ScienceDirect
Studies in Educational Evaluation journal homepage: www.elsevier.com/locate/stueduc
Does difficulty-based item order matter in multiple-choice exams? (Empirical evidence from university students)☆
T
Süleyman Nihat Şad Measurement and Assessment, Faculty of Education, İnönü University, TR, 44800, Malatya, Turkey
ARTICLE INFO
ABSTRACT
Keywords: Multiple-choice exams Difficulty-based item sequencing Item order Test performance
This empirical study aimed to investigate the impact of easy first vs. hard first ordering of the same items in a paper and-pencil multiple-choice exam on the performances of low, moderate, and high achiever examinees, as well as on the item statistics. Data were collected from 554 Turkish university students using two test forms, which included the same multiple-choice items ordered reversely, i.e. easy first vs. hard first. Tests included 26 multiple-choice items about the introductory unit of “Measurement and Assessment” course. The results suggested that sequencing the multiple-choice items in either direction from easy to hard or vice versa did not affect the test performances of the examinees no matter whether they are low, moderate or high achiever examinees. Finally, no statistically significant difference was observed between item statistics of both forms, i.e. the difficulty (p), discrimination (d), point biserial (r), and adjusted point biserial (adj. r) coefficients.
1. Introduction Multiple-choice items are one of the most commonly used objective item formats, which include a stem and several prefabricated options among which one is correct and the others are foils or distractors. Though they are practical in terms of measuring a variety of learning outcomes at the knowledge and understanding levels, they are rather inappropriate to measure the complex mental abilities like writing and mathematical problem solving (Haladyna, Downing, & Rodriguez, 2002) or the ability to recall, organize, or articulate ideas in one’s own words (Linn & Gronlund, 1995). Multiple-choice items are highly objective and applicable; however they are vulnerable to cheating, since students can easily copy the correct answers by looking, whispering, or signaling etc. especially in large groups. To reduce the risk of cheating in large groups especially when seating space is limited, the examiner needs to develop multiple forms of the same test (Balch, 1989; Bandalos, 2018; Gray, 2004; Moses, Yang, & Wilson, 2007; Nath & Lovaglia, 2009; Vander Schee, 2009). However, producing good items to cover the same content area is difficult and time consuming (Aamodt & McShane, 1992). A practical way to overcome this challenge is to mix up the order of the same items in alternative forms (Kagundu & Ross, 2015; Noland, Hardin, & Madden, 2015; Pettijohn & Sacco, 2007; Tan, 2001, 2009). However, this strategy is also criticized for the possibility of affecting the exam scores, since beginning an exam with easier or
more difficult items might affect the examinees’ performance for the rest of the test (Aamodt & McShane, 1992). As a rule of thumb, it is generally suggested that items should be arranged in an order of increasing difficulty or easy-to-difficult progression (Linn & Gronlund, 1995; Zimmaro, 2004). Otherwise, if students encounter difficult items early in the test, this often causes demotivation and spending disproportionate amount of time forcing them to omit potentially easier questions ahead (Leary & Dorans, 1985; Linn & Gronlund, 1995). Thus, it can be inferred that unfavorable item order may cause problematic or unfair results (Pettijohn & Sacco, 2007). In this aspect, unfavorable or inequivalent item order in terms of item difficulty can be a source of random error, posing a threat to the reliability of the test scores. According to Classical Test Theory (CTT), random error is characterized with uncontrolled and unpredictable changes in students’ test performances and scores. Inconsistencies across test forms can also be regarded as a common source of such errors. Thus, the present empirical study investigated the effect of difficulty-based item order in multiplechoice exams on the test performances of low, moderate, and high achievers in a test, as well as on the item statistics regarding item difficulty and discrimination. Though there is a bulk of literature about the test characteristics and their effects on test quality and examinees’ performance, little is known about whether the order of items according to difficulty affects the performances of low, moderate and high achievers differently.
☆ The findings of this research were presented as a paper at the 3rd International Eurasian Educational Research Congress held between 31 May–03 June 2016 in Turkey E-mail address:
[email protected].
https://doi.org/10.1016/j.stueduc.2019.100812 Received 29 August 2018; Received in revised form 4 September 2019; Accepted 6 September 2019 0191-491X/ © 2019 Elsevier Ltd. All rights reserved.
Studies in Educational Evaluation 64 (2020) 100812
S.N. Şad
2. Literature review
that college students are undisturbed by different item orders based on either difficulty or content (easy-to-hard within topic; easy-to-hard across topic; random within topic; and random across topic) in classroom tests. Earlier, Brenner (1964) experimented ordering the same items according to difficulty, which revealed no significant difference either in overall test difficulty or test reliability. In a rather recent study, Weinstein and Roediger (2012) reported that examinees’ performance on both arrangements did not differ significantly, even though they felt more optimistic about their performance when following an item arrangement from the easiest item to the hardest. On the contrary, Aamodt and McShane (1992) found that even the easy-tohard order did not significantly affect the perceived difficulty of the exam or examinees’ level of anxiety. Furthermore, some research findings were in agreement with the generally known rule of thumb i.e. items should be arranged in an order of easy-to-difficult. For example, in a meta-analysis on 20 studies with 26 samples investigating the order of items by difficulty, Aamodt and McShane (1992) reported that exams beginning with easier items and then moving to more difficult items are slightly easier than exams with randomly ordered items (d = .11) or exams beginning with difficult items (d = .22). They also reported that beginning an exam with hard items rather than randomly arranged items results in a significant but small reduction (by 1.49 points on 100 items) in exam scores. Hambleton and Traub (1974) found that the mean number of correct answers in the math test with items arranged in difficult-to-easy order was significantly lower than the mean obtained in reversely ordered form. Besides, Plake, Ansorge, Parker, and Lowry (1982) found that beyond better or worse performance, easy-to-hard test yielded more positive perceptions among learners. Some earlier research findings also attempted to distinguish the effect of difficulty-based item order according to the type of the test. For example, Flaugher, Melton, and Myers (1968) proposed that especially for speeded tests a practical effect is evident in favor of easy-to-hard order, but not for power tests. Likewise, in an early study Towle and Merrill (1975) also suggested that timed standardized tests should not begin with hard items, since when students take the hard items first they may not have the chance to complete the remaining easier items. Tan (2009) also investigated the impact of asking the same items in different orders on item difficulties and item discrimination powers in multiple-choice tests. Contrary to the research findings above, he found that although their item difficulty indices are higher, 10 items out of 50 got more right answers in the form where they were placed earlier.
2.1. Item order and performance The presentation of the items is among the test characteristics that may affect an examinee’s performance on a specific test. Thus, the physical location of the test items across the test has been investigated empirically by many researchers. Among them, some investigated the order of items in terms of course content order, which generally revealed results in favor of items ordered according to content (Aamodt & McShane, 1992; Balch, 1989; Carlson & Ostrosky, 1992; Taub & Bell, 1975), while some others found no difference in terms of content-based sequenced tests (Neely, Springston, & McCann, 1994; Pettijohn & Sacco, 2007; Vander Schee, 2009). Gray (2004) also found that place of the item in a test matters as students perform better on a test item when they are asked a related item before, rather than when they were asked a preceding unrelated control item (i.e. a related item is likely to be a cue for students to answer the following item correctly). Moreover, the order of options has also been an issue of research interest, which revealed that as the correct option moves toward the end of the list of options, the item becomes slightly more difficult (Hohensinn & Baghaei, 2017). 2.2. Does difficulty-based item order matter? When it comes to whether difficulty-based item order across a test matters or not, it turns out to be a rather limited and scarcely visited research idea. Furthermore, the findings of this relatively small number of researches are rather controversial. Findings are so mixed that one can read contrasting results even in the same study. For example, in a recent study, Ollennu and Etsey (2015) tested the effect of easy-to-hard, hard-to-easy, and randomly ordered versions of multiple-choice English Language, Mathematics, and Science tests on the performances of 810 Junior Secondary School students in Ghana. Results revealed that, in the English language test, students performed similarly in both easy-tohard (mean = 18.7) and hard-to-easy (mean = 19.8) ordered test forms, while students taking the randomized test performed significantly higher (mean = 33.0) than other two groups. For the Mathematics test, students who took the easy-to-hard (mean = 18.3) version performed better than both students taking the hard-to-easy (mean = 10.4) and randomized (mean = 15.2) forms. Finally, in the Science test hard-to-easy order treatment (mean = 16.7) revealed significantly better results than easy-to-hard (mean = 13.5) order. Similarly, Noland et al. (2015) also reported mixed results with six exams administered on small groups: while test scores showed no significant difference for some groups, for other groups there were some significant differences in favor of the group taking easy-to-hard test or in favor of the group having the hard-to-easy test. Moreover, Newman, Kundert, Lane and Bull (1988) found that item arrangement by item difficulty does not affect students' responses based on total scores, while students receiving the test form with items ordered in increasing cognitive order performed better on hard items than students who received the form with items ordered in decreasing cognitive difficulty. Unlike the researches above with mixed results, some research findings clearly indicated lack of significant effect of various difficultybased item sequencing. Perlini, Lind, and Zumbo (1998), for example, found no significant difference between easy-to-hard, hard-to-easy, or random item arrangements in terms of their effect on students’ test performance. Plake (1980) also found no significant effect of three item-orderings (easy-to-hard, spiral cyclical, and random) on test scores or student’s perceptions about the test. Likewise, Laffitte (1984) found
2.3. The purpose and significance The purpose of the present empirical study was to investigate the impact of easy first and hard first ordering of the same items in a paperand-pencil multiple-choice exam on the performances of low, moderate, and high achiever examinees, as well as on the item statistics including item difficulty (p), discrimination (d), point biserial (r), and adjusted point biserial (adj. r) coefficients. Thus, this study answers the following research questions: 1 Does difficulty-based item order affect examinees’ test scores? 2 Does difficulty-based item order affect low, moderate and high achievers differently? 3 Does difficulty-based item order affect item statistics? The present study is different from previous studies because it attempts to analyze specifically whether the order of items according to difficulty affects low, moderate and high achievers’ test performances
2
Studies in Educational Evaluation 64 (2020) 100812
S.N. Şad
differently or not. This must be of particular interest for test developers, because low, moderate and high achievers might react differently to easy first and hard first item order. Thus, in order to discover whether they react differently, data from low, high and moderate achievers should be handled separately rather than pooling them altogether.
of construct validity and reliability items were selected based on the criteria that they consistently yielded favorable discrimination indices and a large spectrum of difficulty levels from easy to hard as required in the present study. However, though satisfactory on average, some items showed inconsistent item statistics across three previous administrations (e.g., Items 2, 10, 24), most probably due to the differences in the administration conditions or students. Also some items at the extremes (Items 1, 24, 25, and 26) were right at the borderline for acceptability mainly because item discrimination is at stake as the item gets too easy or too difficult. Despite these defects, these items were chosen as the best options available to serve the purpose of developing a test with a large spectrum of difficulty range. Calculated as the average of the three previously piloted datasets, minimum and maximum item difficulty (p) indices for the 26 items were 0.08 and 0.91 with a mean and standard deviation of 0.51 and 0.21, respectively. Item discrimination (d) indices ranged between 0.14 and 0.70, with a mean and standard deviation of 0.42 and 0.14, respectively. Finally, adjusted item-total correlation (r) coefficients ranged between 0.17 and 0.50, with a mean and standard deviation of 0.32 and 0.09, respectively. In Form A items were ordered from easy to hard (or easy first), and in Form B it was vice versa (or hard first). The test took three A4 pages with 9 + 9 + 8 items on each page, respectively.
3. Method 3.1. Design This study was structured based on an experimental design using a 2 (Item order: easy first vs. hard first) X 3 (Achievement level: low vs. moderate vs. high) between-subjects factorial design, where the effect of item order X ability level interaction on examinees’ test scores was tested. 3.2. Study group Data were collected from a total of 554 university students attending a pedagogical formation course at İnönü University during the fall semester of the 2015–2016 academic year in Malatya, Turkey. The pedagogical formation program is a special teacher training program offering 25-credit courses in two semesters to students coming from faculties or schools other than college of education. Recently, in Turkey, getting a certificate from this paid program has been a common means of becoming a teacher in various subject fields as an alternative to having a degree from a college of education. Thus, students have to successfully complete 14-credit theoretical (including 2-credit Measurement and Assessment course) and 11-credit practical courses (including 6-credit teaching practicum) to have a certificate. The participants of the present study included 138 (24.9%) theology students, 73 (13.2%) students from the Turkish language and literature department, 57 (10.3%) history students, 54 (9.8%) students from the sports department, 46 (8.3%) sociology students, 43 (7.8%) philosophy students, 30 (5.4%) nursing students, 28 (5,1%) mathematics students, 26 (4.7%) music students, 13 (2.3%) students from western languages department, 11 (2%) biology students, 11 (1.9%) visual art students, 9 (2%) students from communication faculty, 6 (1.1%) chemistry students, 4 (0.7%) business students, 3 (0.5%) politics and public management students, 1 (0.2%) physics students, and 1 (0.2%) economics students. Out of 554 participants, 371 (67%) were women, while 183 (33%) were men.
3.4. Experimental procedure Tests were administered as a quiz in paper-pencil format during the 5th to 7th weeks of the course separately in 18 groups. While the first student was given either the easy first form or hard first form randomly, the rest of the forms were distributed systematically, i.e. one type of form after the other. Thus, it was assumed that easy first and hard first forms were randomly administered to two equivalent groups: Form A(Easy first) = 276 and Form B(Hard first) = 278. Since the test was not a speeded one, the students were given 35 min to complete the exam. This time was proved to be long enough for every student to read and answer all the questions since items were rarely unanswered towards the end of the test. Examinees in each class were monitored by at least one attendant during the administration. It should be born in mind that although there were two reversely ordered forms according to item difficulty, students were free to jump forward or backward in the test (possibly to find the easy items), because it is not possible either to force or control all students to do so in a paper-pencil exam (unlike an online or computer-based test). This freedom to jump forward or backward in the test is considered as a natural part of the experimental procedure, since this study did not test the effect of forcing students to answer easy or hard questions first, but placing them in such an order in a paper-pencil exam.
3.3. Data collection tools To test the effect of item order in terms of difficulty, two test forms were developed, which included the same multiple-choice items ordered reversely. Each test included 26 multiple-choice items with five options about the introductory unit of “Measurement and Assessment” course delivered in the pedagogical formation program. This preliminary unit aimed to provide students with an understanding of the basic terminology, concepts, classifications and certain principles regarding measurement and assessment at undergraduate level. The content covered introductory topics including “measurement”, “assessment”,” criteria”, “norm-referenced and criterion-referenced assessment”, “validity and reliability”, “measurement error” and so (see Table A1 in Appendix A for a list of topics matching the number of test items). Items were selected from among a larger item-pool after analyzing three data sets obtained from three former exams having been administered by the researcher to his former students. In addition to ensuring the content validity with a table of specifications, for the sake
3.5. Data analysis To answer the research questions, test scores (on 100 metric), the difficulty (p), discrimination (d), point biserial (r), and adjusted point biserial (adj. r) statistics were calculated using Test Analysis Program (TAP, Version 14.7.4) (Brooks & Johanson, 2003) for each group who took either easy first or hard first test forms. All scoring, difficulty and discrimination estimations were done using CTT model. Next, data were analyzed using two-way ANOVA and paired samples t tests, with a significance level of p < 0.05. Two-way ANOVA was adopted to test the interaction between test type (i.e. easy first and hard first forms) and ability groups [i.e. low achievers (bottom 27%), moderate achievers (central) and high achievers (top 27%)]. Since this study
3
Studies in Educational Evaluation 64 (2020) 100812
S.N. Şad
complex and standardized estimation methods under IRT). As DeMars (2018) suggests when difficulty coefficients are obtained from a group of examinees with higher ability, the items can appear easier, and item difficulty and discrimination statistics may not be expected to be consistent across different groups. As a matter of fact, students in three pilot groups were different and had been taught by the researcher himself, while the actual 554 participants of this study were taught by five different instructors. Moreover, students in the pilot study had taken the course as a part of their major, while the participants of this study attending the pedagogical formation certificate program came from a large spectrum of programs or departments and took the course in addition to their major. This almost doubles their workload and possibly reduces their performance since they had to study both their majors and the pedagogical formation certificate program. Thus, considering the contextual conditions affecting students’ performances (e.g. motivation, workload, quality of instruction etc.), it is not plausible to generalize these results broadly to all exams. As seen in Table 2, results revealed that there was a significant main effect of the Ability Group (as expected because ability groups were ranked low, medium and high according to their scores), on the test scores of the examinees, F (2, 548) = 1321.504, p < .05; ω2 = 0.827. The Bonferroni post hoc test suggested that the high achievers’ mean test score was significantly higher (mean = 56.21) than both moderate achievers’ (mean = 38.89) and low achievers’ (mean = 24.89) mean test score, and moderate achievers’ mean test score was significantly higher than that of low achievers, (p < .05). On the other hand, the research question “Does difficulty-based item order affect examinees’ test scores?” was tested through the main effect of test type (i.e. Form) on examinees’ test scores, which revealed an insignificant effect, F(1, 548) = 1.032, p = .310, ω2 = .000. This indicated that in general test scores of learners taking either form (easy first = 39.27 or hard first = 38.06) did not differ significantly. Finally, the research question “Does difficulty-based item order affect low, moderate and high achievers differently?” was tested through the interaction effect of ability group and type of form on students’ test performances, which also revealed an insignificant effect, F(2, 548) = 1.879, p = .154, ω2 = .003. This indicated that low, moderate and high achiever groups were affected similarly from either way of item ordering (i.e. easy first or hard first). The analysis above, which compared the test scores, virtually tested the hypothesis that the total test becomes more difficulty if items are sequenced from difficult to easy. However, since the comparison was done using average scores, its results cannot inform us about how reversing the item sequence affects individual items’ difficulty and capacity to discriminate high and low knowledge students, i.e. the construct validity. Thus, to answer the research question “Does difficultybased item order affect item statistics?”, next analysis examined the effect of reversing the order of items on item statistics including the difficulty (p), discrimination (d), point biserial (r), and adjusted point biserial (adj.r) coefficients. This is because difficulty of items in each form might have increased or decreased randomly, thus resulting an
Table 1 Descriptive statistics on test scores of low, moderate and high achievers from easy first vs. hard first forms. Ability group
Forms
n
Mean
Std. Deviation
Low achievers (Bottom 27%)
Easy first Hard first Total
97 105 202
24.46 25.27 24.89
5.53 5.25 5.39
Moderate achievers
Easy first Hard first Total
98 98 196
39.44 38.35 38.89
4.29 3.24 3.83
High achievers (Top 27%)
Easy first Hard first Total
81 75 156
56.79 55.59 56.21
6.56 8.88 7.76
Total
A (Easy first) B (Hard first) Total
276 278 554
39.27 38.06 38.66
14.05 13.43 13.74
investigated how difficulty-based item order affected the performances of students at different ability levels, there was a need to define three ability groups. So, using the classical cut-off point of 27% was an ideal solution to use. Introduced by Kelley (1939), 27% is a sensitive and stable index of discrimination when defining the upper (the most successful) and lower (the least successful) groups of respondents in case of normal distribution (Bandalos, 2018; Crocker & Algina, 2008). Before conducting the factorial ANOVA test, assumptions of normality of data and homogeneity of variances were checked. Regarding the normality, skewness and kurtosis statistics across all groups were found between ± 1 indicating acceptable deviation from normality. Also, the insignificant Levene’s test result (p > .05) suggested that error variances are equal across groups. The differences between the difficulty (p), discrimination (d), point biserial (r), adjusted point biserial (adj.r) coefficients for 26 twin items in two forms were compared using paired-samples t test. Before conducting the test, the assumption of normal distribution of the differences between scores was checked using Shapiro-Wilk test, which yielded insignificant results indicating normal distribution. 4. Results Tables 1 and 2 below present the descriptive statistics and the results regarding the effect of test type (i.e. easy first and hard first forms) X ability group (i.e. low, moderate and high achievers) interaction on examinees’ test performances. As seen in Table 1, test scores were calculated on 100 metric representing the percentage of correct answers to 26 items. It was found that the mean scores of 554 participants from the test was 38.66. This average score implies that the test was rather difficult. Considering the average item difficulty coefficients obtained from three pilot studies (mean = 0.51 or 51 in a 100 metric), this performance was less than expected. This decrease in average test performance can be attributed to the lack of invariance of CTT estimates used in this study (unlike the
Table 2 Factorial ANOVA results regarding the main and interaction effects of ability group and forms. Source
Type III Sum of Squares
df
Mean Square
F
p
Partial ω2
Observed Powerb
Corrected Model Intercept Ability Group Forms AG*F Error Total Corrected Total
86546.418a 873109.779 86229.654 33.660 122.597 17878.814 932569.602 104425.231
5 1 2 1 2 548 554 553
17309.284 873109.779 43114.827 33.660 61.299 32.626
530.543 26761.516 1321.504 1.032 1.879
.000 .000 .000 .310 .154
.827 .980 .827 .000 .003
1.000 1.000 1.000 0.174 0.391
a b
R Squared = .829 (Adjusted R Squared = .827). Computed using alpha = .05. 4
Studies in Educational Evaluation 64 (2020) 100812
S.N. Şad
Table 3 Descriptive results on item statistics for easy first vs. hard first test forms. Item analysis statistics for easy first item order (Form A)
Item analysis statistics for hard first item order (Form B)
Item No p d r adj. r Item 01 .85 .16 .24 .15 Item 02 .79 .31 .37 .27 Item 03 .54 .56 .49 .37 Item 04 .86 .17 .18 .09 Item 05 .50 .51 .42 .30 Item 06 .61 .41 .38 .26 Item 07 .52 .37 .31 .17 Item 08 .46 .23 .18 .04 Item 09 .32 .39 .38 .26 Item 10 .43 .61 .50 .39 Item 11 .47 .41 .36 .23 Item 12 .47 .22 .23 .10 Item 13 .30 .45 .42 .31 Item 14 .20 .30 .33 .23 Item 15 .32 .53 .47 .36 Item 16 .25 .34 .39 .28 Item 17 .30 .40 .36 .25 Item 18 .32 .41 .36 .25 Item 19 .27 .22 .23 .11 Item 20 .24 .21 .17 .06 Item 21 .27 .26 .23 .11 Item 22 .21 .25 .27 .16 Item 23 .28 .24 .20 .08 Item 24 .20 .13 .15 .04 Item 25 .13 .20 .26 .17 Item 26 .13 .13 .22 .13 Mean Item Difficulty = .393 Mean Discrimination Index = .323 Mean Point Biserial = .312 Mean Adj. Point Biserial = .198 KR20 (Alpha) = .638
Item no p d r adj. r Item 26 .85 .18 .24 .14 Item 25 .71 .42 .41 .29 Item 24 .63 .40 .36 .23 Item 23 .82 .17 .22 .11 Item 22 .51 .43 .40 .27 Item 21 .62 .27 .28 .15 Item 20 .44 .43 .34 .21 Item 19 .52 .27 .22 .08 Item 18 .29 .34 .34 .22 Item 17 .38 .48 .40 .27 Item 16 .50 .41 .34 .21 Item 15 .40 .35 .31 .18 Item 14 .31 .54 .50 .39 Item 13 .22 .30 .30 .18 Item 12 .31 .36 .38 .25 Item 11 .26 .27 .28 .16 Item 10 .27 .15 .19 .06 Item 09 .29 .39 .35 .23 Item 08 .26 .30 .34 .22 Item 07 .18 .18 .18 .07 Item 06 .24 .44 .44 .33 Item 05 .26 .23 .23 .11 Item 04 .23 .20 .18 .06 Item 03 .15 .09 .14 .04 Item 02 .17 .16 .19 .09 Item 01 .08 .13 .23 .15 Mean Item Difficulty = .381 Mean Discrimination Index = .303 Mean Point Biserial = .300 Mean Adj. Point Biserial = .181 KR20 (Alpha) = .607
Table 4 Comparison of the difficulty (p), discrimination (d), point biserial (r), adjusted point biserial (adj. r) item-total correlations of the same items in reversely ordered forms. Pairs
Mean
N
Std. Deviation
Std. Error Mean
t
df
p
p (Form A) p (Form B)
0.39 0.38
26 26
0.21 0.21
0.04 0.04
1.495
25
.148
d (Form A) d (Form B)
0.32 0.30
26 26
0.14 0.12
0.03 0.02
1.051
25
.303
r (Form A) r (Form B)
0.31 0.30
26 26
0.10 0.09
0.02 0.02
.743
25
.465
adj. r (Form A) adj. r (Form B)
0.20 0.18
26 26
0.10 0.09
0.02 0.02
1.049
25
.304
Table 5 Paired samples correlations of item statistics for easy first and hard first test forms. Pairs
N
r
p
p (Form A&B) d (Form A&B) r (Form A&B) adj. r (Form A&B)
26 26 26 26
.977 .714 .662 .600
.000 .000 .000 .001
(p = .27) item becomes less discriminative (d = .26) than when it is asked earlier (d = .44) yielding almost the same level of difficulty (p = .24). While these individual examples suggest that asking both easy and difficult items earlier ensures better discrimination, this is not the case for the rest of the test in general, since the paired sample t test results below proved a lack of association between an item’s location in the test and its discriminative power. Another vivid inconsistency was the apparent change in the rank of some items (e.g. 3rd, 5th and 14th in easy first form) across the difficulty continuum. The difficulty statistics for these items decreased (i.e. became more difficult) in this study compared to the three previous pilot exams. In other words, participants’ attainment of the learning outcomes tested by these items was lower. Although, the reliability coefficients for scores obtained from both test forms revealed similar results (i.e. KR20 = .638 and KR20 = .607 for easy first and hard first test forms, respectively), they were quite lower than those obtained in the three previous pilot administrations (see Table A1 in Appendix A). As explained above paired sample t test was used in order to further test statistically whether the item statistics (p, d, r, and adj. r) of the same items as a whole differ significantly when they are ordered reversely in terms of difficulty. Results of the paired sample t test are given in Tables 4 and 5 below. Results of paired sample t tests (see Table 4) showed that there were no statistically significant differences between the difficulty indices (t (25) = 1.495; p > .05; r = .977; p < .05), discrimination indices (t (25) = 1.051; p > .05; r = .714; p < .05), and point biserial (t (25) = 0.743; p > .05; r = .662; p < .05) and adjusted point biserial item-total correlations (t(25) = 1.049; p > .05; r = .600; p < .05) of 26 paired-items in easy first and hard first test forms, with moderate-tohigh significant correlations (see Table 5). Results as a whole suggested that in the present context of senior university students taking the Measurement and Assessment test, the order of the items either from easy to hard or from hard to easy does not
insignificant difference in each group. So, comparing the item statistics allows us to test whether the characteristics of each item (either difficulty or discrimination) is preserved when their position in a test is changed. In other words, it was aimed to explore whether placing a difficult item at the beginning makes it more difficult and less discriminative, or whether placing a difficult item towards the end of a test makes it less difficult and more discriminative. Table 3 shows the descriptive results of abovementioned statistics for the twin test forms with reversed items. As seen in Table 3, reversed item order yielded rather similar item statistics for most items. While the mean difficulty coefficient for items in easy first form was .393, it was .381 for hard first form. Likewise, mean item discrimination indices were .323 and .303 for easy first and hard first forms, respectively. Mean point biserial item-total correlation coefficients were .312 and .300 for easy first and hard first forms, respectively. Lastly, mean adjusted point biserial item-total correlation coefficients were .198 and .181 for easy first and hard first forms, respectively. When individual parameters are compared, it was also evident that discrimination statistics of some items (e.g. 6th, 17th and 21st in the easy first form) considerably differed in its reversed twin form (21st, 10th and 6th in hard first form, respectively). For example, 6th item in the easy first form (21st item in the hard first form) individually proves that when asked earlier this relatively easy (p = .61) item becomes more discriminative (d = .41) than when it is asked later (d = .27) yielding almost the same level of difficulty (p = .62). Much to our surprise, 21st item in the easy first form (6th item in the hard first form) individually suggests that when asked later this relatively difficult
5
Studies in Educational Evaluation 64 (2020) 100812
S.N. Şad
matter in practice in terms of students’ test scores and item statistics. In other words, it can be said that placing difficult items at the beginning or towards the end of a test did not affect its level of difficult or discrimination in the present context.
discrimination indices of the reversed twin items were compared using paired samples t test. It was assumed that comparing total scores alone or analyzing the items individually can yield fake results regarding the effect of the item position. For example, Leary and Dorans (1985) argued that a rather easy item located towards the end of a test can mistakenly yield a low p coefficient (more difficulty), since many examinees, who could have correctly answered the item, run out of time. Similarly, the item-total correlation coefficient of an easy item can be higher when it appears to the end of a speed-based test since the less able examinees cannot reach the item, while the more able students reach and correctly respond it (Leary and Dorans (1985). However, the results of the present study disproved this assumption, since there were no statistically significant difference between item statistics including the difficulty (p), discrimination (d), point biserial (r), and adjusted point biserial (adj.r) coefficients. This can be because the test administered in the present study was not based on speeded performance and the given time was long enough (35 min for 26 items) even for poor examinees in the hard first test form to complete the whole test.
5. Discussion The present empirical study investigated the effect of difficultybased item order in multiple choice exams on the test performances of the low, moderate, and high achiever examinees, as well as on the item statistics. Present study differs from most of the previous research, since it compared the effect of easy first vs. hard fist item ordering specifically across low, moderate and high achievers. This let us see how low, moderate and high achievers react to different item arrangements according to difficulty. Also, one strength of the present study over the previous ones was the large number of data (n = 554) analyzed compared to rather small data sets such as 50 used in Vander Schee (2009), 82 in Laffitte (1984), and 66 in Pettijohn and Sacco (2007). Another strength of the present study comes from the quality of items used. The items have been selected from among a pool of items tested and validated across three previous exams. The mean discrimination index and adjusted item-total correlation coefficient of the items were .42 and .32, respectively. That means the difficulty levels of the items used in this study were not arranged independent of their construct validity (i.e. how well it can measure a given construct or ability). As a matter of fact, an “item may be difficult because it is complex or … obscure” (Leary & Dorans, 1985, p. 410). Likewise, some items might also be easy because their answers are too evident or simple to validate the construct of the instrument. Thus, it is found that items used in the present study are not just easy or difficult ones disregarding the validity and reliability criteria. As a result of the study, it was found that there was no statistically significant difference between easy first and hard first item arrangements in terms of their effects on examinees’ test scores in general or in terms of different ability groups. In other words, the difficulty-based item order affected high, moderate and low achievers in the same way. The results of this limited study provided evidence that, especially in a rather difficult test (in the present case, test difficulty was 0.38), sequencing the multiple-choice items in either direction from easy to hard or vice versa did not affect test performances of the examinees no matter whether they are low, moderate or high achiever examinees. In other words, contrary to the general assumption that items should be sequenced from easy towards hard (Linn & Gronlund, 1995; Zimmaro, 2004), placing difficult items to the beginning of a test did not adversely affect the reliability of the test scores in the present study. These results substantially concur with the findings of some previous research which also supported that item order does not matter in terms of test performance (Brenner, 1964; Huck & Bowers, 1972; Laffitte, 1984; Newman et al., 1988; Ollennu & Etsey, 2015; Perlini et al., 1998; Plake, 1980; Weinstein & Roediger, 2012). By contrast, findings of the present study were in disagreement with the finding of some previous researches which argued that item order does matter in terms of test performance (Flaugher et al., 1968; Hambleton & Traub, 1974; Noland et al., 2015; Ollennu & Etsey, 2015; Towle & Merrill, 1975). The item statistics including difficulty and discrimination indices were further analyzed in the present study, since they might be affected from the items’ position in the test. Thus, the difficulty and
6. Conclusions and limitations Though it has been long suggested by researches on educational and psychological testing that the location of items across the test can matter (Hohensinn & Baghaei, 2017; Munz & Smouse, 1968; Newman et al., 1988), results of this specific study suggested for this limited context (university students and “measurement and assessment” test) that either easy first or hard first item order does not matter in practice in terms of students’ test scores regardless of examinees’ achievement levels, and item statistics. Thus, it can be implied that shuffling multiple-choice items in either direction as a precaution against student cheating (a more evident source of random error) may not necessarily cause random errors due to possible early encounter with a hard item. This is an important finding not only because fairness of exams is important for validity and reliability purposes, but also because “if an exam is fair then students are more prone to study the subject rather than studying the exam” (McCoubrie, 2004, p. 710). However, it should be born in mind that these results come from a test with 26 items on an introductory unit of an undergraduate course with a large spectrum of difficulty indices. It should be further investigated whether the same results will hold up when the examinees are younger, topics get deeper or more specific, or when there are more items in the test. Moreover, in the present study all estimations were done using CTT model. Considering that item parameters in CTT are sample dependent and are not generally comparable across different samples of examinees, using more complicated and standardized models of IRT would be a better method, since they “allow for item parameter estimates from different groups to be transformed so that they are on the same scale” (Bandalos, 2018, p. 440). Lastly, unlike in online or computer-based tests, in paper-pencil tests students cannot be constrained practically to answer the questions in a certain order and are free to jump ahead/backwards in the test to find the easy items. Thus, the question “Does difficulty-based item order matter in multiple choice exams?” can be tested in its fullest sense when examinees are controlled to answer the items from easy to hard or vice versa. This was another important limitation of the present study because there were not enough computer facilities in the reach of the researcher to control all students to do so.
6
Studies in Educational Evaluation 64 (2020) 100812
S.N. Şad
Appendix A Table A1 Item and test statistics of the items used in the study instrument obtained across three previous test administrations. Items (Topic)
Item 1 (Authentic evaluation) Item 2 (Indirect measurement) Item 3 (Natural unit) Item 4 (Error) Item 5 (Measurement rule) Item 6 (Validity) Item 7 (Accuracy) Item 8 (Diagnostic assessment) Item 9 (Types of scales) Item 10 (Types of measurement) Item 11 (Measurement unit) Item 12 (Measurement and evaluation) Item 13 (Types of zero) Item 14 (Criterion-referenced assessment) Item 15 (Measurement and evaluation) Item 16 (Norm-referenced assessment) Item 17 (Systematical error) Item 18 (Indirect measurement) Item 19 (Measurement) Item 20 (Measure) Item 21 (Random error) Item 22 (Reliability) Item 23 (Internal consistency) Item 24 (Objectivity) Item 25 (Measurement and evaluation) Item 26 (Test-retest reliability)
Test#1 (n = 189; KR20 = .788)
Test#2 (n = 185; KR20 = .794)
Test#3 (n = 138; KR20 = .721)
p
d
r
p
d
r
p
d
r
Mean p
Mean d
Mean r
.90 .87 .89 .81 .64 .63 .63 .57 .55 .47 .52 .42 .51 .46
.32 .38 .29 .29 .51 .55 .22 .38 .63 .51 .55 .35 .59 .72
.16 .33 .27 .23 .34 .33 .15 .23 .40 .39 .35 .21 .32 .47
.88 .81 .75 .87 .71 .65 .65 .59 .46 .49 .58 .54 .45 .48
.22 .16 .58 .30 .36 .16 .34 .61 .40 .69 .46 .19 .83 .54
.19 .13 .43 .20 .32 .14 .19 .43 .27 .52 .38 .18 .65 .33
.95 .89 .89 .84 .79 .64 .63 .55 .68 .66 .47 .58 .53 .50
.22 .16 .21 .15 .27 .30 .33 .39 .48 .35 .15 .39 .68 .46
.15 .14 .26 .15 .28 .20 .27 .25 .33 .22 .26 .30 .53 .44
.91 .86 .84 .84 .71 .64 .64 .57 .56 .54 .52 .51 .50 .48
.25 .23 .36 .25 .38 .34 .30 .46 .50 .52 .39 .31 .70 .58
.17 .20 .32 .19 .31 .22 .20 .30 .33 .38 .33 .23 .50 .41
.44 .45 .46 .44 .37 .33 .39 .24 .27 .27 .08 .09
.47 .64 .31 .43 .59 .49 .55 .43 .48 .32 .16 .24
.38 .38 .23 .26 .44 .36 .38 .40 .38 .18 .37 .37
.47 .44 .42 .38 .40 .36 .39 .36 .28 .28 .15 .11
.62 .64 .45 .55 .45 .51 .59 .61 .25 .41 .12 .21
.47 .46 .32 .44 .36 .38 .45 .46 .36 .27 .21 .17
.50 .50 .45 .42 .47 .50 .32 .42 .37 .08 .12 .05
.40 .77 .44 .34 .41 .70 .52 .54 .68 .17 .14 .20
.19 .56 .27 .28 .25 .45 .45 .39 .54 .18 .20 .41
.47 .46 .44 .41 .41 .40 .37 .34 .31 .21 .12 .08
.50 .68 .40 .44 .48 .57 .55 .53 .47 .30 .14 .22
.35 .47 .27 .33 .35 .40 .43 .42 .43 .21 .26 .32
Leary, L. F., & Dorans, N. J. (1985). Implications for altering the context in which test items appear: A historical perspective on an immediate concern. Review of Educational Research, 55(3), 387–413. Linn, R. L., & Gronlund, N. E. (1995). Measuring and assessment in teaching (7th ed.). Ohio: Prentice Hall. McCoubrie, P. (2004). Improving the fairness of multiple-choice questions: A literature review. Medical Teacher, 26(8), 709–712. Moses, T., Yang, W. L., & Wilson, C. (2007). Using kernel equating to assess item order effects on test scores. Journal of Educational Measurement, 44(2), 157–178. Munz, D. C., & Smouse, A. D. (1968). Interaction effects of item-difficulty sequence and achievement-anxiety reaction on academic performance. Journal of Educational Psychology, 59(5), 370–374. Nath, L., & Lovaglia, M. (2009). Cheating on multiple-choice exams: Monitoring, assessment, and an optional assignment. College Teaching, 57(1), 3–8. Neely, D. L., Springston, F. J., & McCann, S. J. (1994). Does item order affect performance on multiple-choice exams? Teaching of Psychology: With an Emphasis on Assessment, 21(1), 44–45. Newman, D. L., Kundert, D. K., Lane, D. S., & Bull, K. S. (1988). Effect of varying item order on multiple-choice test scores: Importance of statistical and cognitive difficulty. Applied Measurement in Education, 1(1), 89–97. Noland, T. G., Hardin, J. R., & Madden, E. K. (2015). Does question order matter? An analysis of multiple choice question order on accounting exam. Journal of Advancements in Business Education, 2, 1–15. Ollennu, S. N. N., & Etsey, Y. K. A. (2015). The impact of item position in multiple-choice test on student performance at the basic education certificate examination (BECE) level. Universal Journal of Educational Research, 3(10), 718–723. Perlini, A. H., Lind, D. L., & Zumbo, B. D. (1998). Context effects on examinations: The effects of time, item order and item difficulty. Canadian Psychology, 39(4), 299–307. Pettijohn, T., & Sacco, M. (2007). Multiple-choice exam question order influences on student performance, completion time, and perceptions. Journal of Instructional Psychology, 34(3), 142–149. Plake, B. S. (1980). Item arrangement and knowledge of arrangement on test scores. Journal of Experimental Education, 49(1), 56–58. Plake, B. S., Ansorge, C. J., Parker, C. S., & Lowry, S. R. (1982). Effects of item arrangement, knowledge of arrangement test anxiety and sex on test performance. Journal of Educational Measurement, 19(1), 49–57. Tan, Ş. (2001). Measures against cheating in exams. Education and Science, 122(26), 32–40. Tan, Ş. (2009). The effect of question order on item difficulty and discrimination. eJournal of New World Sciences Academy: Educational Sciences, 4(2), 486–493.
References Aamodt, M., & McShane, T. (1992). A meta-analytic investigation of the effect of various test item characteristics on test scores. Public Personnel Management, 21(2), 151–160. Balch, W. R. (1989). Item order affects performance on multiple-choice exams. Teaching of Psychology, 16(2), 75–77. Bandalos, D. L. (2018). Measurement theory and applications for the social sciences. New York: Guilford Publications. Brenner, M. H. (1964). Test difficulty, reliability, and discriminations of item difficulty order. The Journal of Applied Psychology, 48(2), 98–100. Brooks, G. P., & Johanson, G. A. (2003). TAP: Test analysis program. Applied Psychological Measurement, 27(4), 303–304. Carlson, J. L., & Ostrosky, A. L. (1992). Item sequence and student performance on multiple-choice exams: Further evidence. The Journal of Economic Education, 23(3), 232–235. Crocker, L., & Algina, J. (2008). Introduction to classical and modern test theory. Mason, Ohio: Cengage Learning. DeMars, C. E. (2018). Item response theory. In D. L. Bandalos (Ed.). Measurement theory and applications for the social sciences (pp. 403–441). New York: Guilford Publications. Flaugher, R. L., Melton, R. S., & Myers, C. T. (1968). Item rearrangement under typical test conditions. Educational and Psychological Measurement, 28(3), 813–824. Gray, K. E. (2004). The effect of question order on student responses to multiple choice physics questions(Unpublished master’s thesis). Kansas: Kansas State University. Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education, 15(3), 309–333. Hambleton, R. K., & Traub, R. E. (1974). The effects of item order on test performance and stress. Journal of Experimental Education, 43(1), 40–46. Hohensinn, C., & Baghaei, P. (2017). Does the position of response options in multiplechoice tests matter? Psicológica, 38(1), 93–109. Huck, S. W., & Bowers, N. D. (1972). Item difficulty level and sequence effects in multiple‐choice achievement tests. Journal of Educational Measurement, 9(2), 105–111. Kagundu, P., & Ross, G. (2015). The impact of question order on multiple choice exams on student performance in an unconventional introductory economics course. Journal for Economic Educators, 15(1), 19–36. Kelley, T. L. (1939). Selection of upper and lower groups for the validation of test items. Journal of Educational Psychology, 30, 17–24. Laffitte, R. G., Jr (1984). Effects of item order on achievement test scores and students’ perception of test difficulty. Teaching of Psychology, 11(4), 212–214.
7
Studies in Educational Evaluation 64 (2020) 100812
S.N. Şad Taub, A. J., & Bell, E. B. (1975). A bias in scores on multiple-form exams. The Journal of Economic Education, 7(1), 58–59. Towle, N. J., & Merrill, P. F. (1975). Effects of anxiety type and item‐difficulty sequencing on mathematics test performance. Journal of Educational Measurement, 12(4), 241–249. Vander Schee, B. A. (2009). Test item order, academic achievement and student performance on principles of marketing examinations. Journal for Advancement of Marketing
Education, 14(1), 23–29. Weinstein, Y., & Roediger, H. L. (2012). The effect of question order on evaluations of test performance: How does the bias evolve? Memory & Cognition, 40(5), 727–735. Zimmaro, D. M. (2004). Writing good multiple-choice exams. Measurement and evaluation center. University of Texas at Austin. Retrieved from http://www6.cityu.edu.hk/ edge/workshop/seminarseries/2010-11/Seminar03WritingGoodMultipleChoiceExams.pdf.
8