Effects of linguistic complexity and accommodations on estimates of ability for students with learning disabilities

Effects of linguistic complexity and accommodations on estimates of ability for students with learning disabilities

Journal of School Psychology 50 (2012) 293–316 Contents lists available at SciVerse ScienceDirect Journal of School Psychology journal homepage: www...

409KB Sizes 3 Downloads 43 Views

Journal of School Psychology 50 (2012) 293–316

Contents lists available at SciVerse ScienceDirect

Journal of School Psychology journal homepage: www.elsevier.com/locate/ jschpsyc

Effects of linguistic complexity and accommodations on estimates of ability for students with learning disabilities Stephanie W. Cawthon ⁎, Alyssa D. Kaye, L. Leland Lockhart, S. Natasha Beretvas The University of Texas at Austin, USA

a r t i c l e

i n f o

Article history: Received 16 August 2010 Received in revised form 5 January 2012 Accepted 11 January 2012 Keywords: Learning disabilities Accommodations Linguistic complexity

a b s t r a c t Many students with learning disabilities (SLD) participate in standardized assessments using test accommodations such as extended time, having the test items read aloud, or taking the test in a separate setting. Yet there are also aspects of the test items themselves, particularly the language demand, which may contribute to the effects of test accommodations. This study entailed an analysis of linguistic complexity (LC) and accommodation use for SLD in grade four on 2005 National Assessment of Educational Progress (NAEP) reading and mathematics items. The purpose of this study was to investigate (a) the effects of test item LC on reading and mathematics item difficulties for SLD; (b) the impact of accommodations (presentation, response, setting, or timing) on estimates of student ability, after controlling for LC effects; and (c) the impact of differential facet functioning (DFF), a person–by-item-descriptor interaction, on estimates of student ability, after controlling for LC and accommodations' effects. For both reading and mathematics, the higher an item's LC, the more difficult it was for SLD. After controlling for differences due to accommodations, LC was not a significant predictor of mathematics items' difficulties, but it remained a significant predictor for reading items. There was no effect of accommodations on mathematics item performance, but for reading items, students who received presentation and setting accommodations scored lower than those who did not. No significant LC-by-accommodation interactions were found for either subject area, indicating that the effect of LC did not depend on the type of accommodation received. © 2012 Published by Elsevier Ltd. on behalf of Society for the Study of School Psychology.

⁎ Corresponding author at: Department of Educational Psychology, The University of Texas at Austin, 1 University Station, MC D5800, Austin, TX 78712, USA. Tel.: + 1 512 471 0287; fax: + 1 512 475 7641. E-mail address: [email protected] (S.W. Cawthon). ACTION EDITOR: Andrew Roach. 0022-4405/$ – see front matter © 2012 Published by Elsevier Ltd. on behalf of Society for the Study of School Psychology. doi:10.1016/j.jsp.2012.01.002

294

S.W. Cawthon et al. / Journal of School Psychology 50 (2012) 293–316

1. Introduction Students with a learning disability (SLD) comprise the largest group of students with disabilities in the United States (IDEA Child Count, U.S. Department of Education, 2008). In fact, SLD now make up over half of students receiving special education in K-12 settings (Fletcher, Lyon, Fuchs, & Barnes, 2007). One goal of current accountability reform is to ensure that schools, districts, and states work together to close the achievement gaps between all students, including students with learning disabilities and their nondisabled peers (e.g., No Child Left Behind Act of 2001, 2002). Students with learning disabilities face unique challenges in demonstrating their knowledge and skill on standardized assessments. Experts in the field have long cautioned that the text-based format of standardized assessments, often via paper and pencil, may result in inaccurate representations of what students know (McDonnell, McLaughlin, & Morison, 1997). For example, a student who has a reading disability may struggle to answer questions on a mathematics problem-solving task. The interpretation of the resultant test item score may be invalid because it confounds whether the performance represents the student's reading skill or mathematics problem solving ability (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education [NCME], 1999). 1 Furthermore, overall scores on standardized assessments may measure different skills for students with disabilities than for their peers without a disability (Abedi & Lord, 2001). Previous research in the field has investigated the impact of factors such as student disability status, student English Language Proficiency, content area, LC of test items, and accommodation use. With a unique focus on SLD, the purpose of this project was to evaluate the effects of accommodations and test item LC on item scores on the 4th grade National Assessment of Educational Progress (NAEP). This work builds upon research that has investigated factors that influence NAEP item scores for children with disabilities as a whole and English Language Learner (ELL students) students more specifically (e.g., Abedi, Leon, & Kao, 2008a, 2008b; Middleton & Laitusis, 2007; Stone, 2009). This previous research found that high levels of LC, or language demand, can serve as a barrier for students with disabilities and ELL students on standardized assessments, including the NAEP. Yet this work has not yet been done for SLD, students who may face similar challenges in participating in standardized assessments. The purpose of this article was thus twofold: first, to investigate the potential effect of on item functioning for SLD and, second, to understand potential interactions of accommodation use and on item difficulty for this student population. A student with a learning disability is characterized as one who achieves substantially below expectations on reading, mathematics, or written expression. These expectations can be based on a range of factors, including the student's age, exposure to quality instruction, and level of intelligence. According to the Individuals with Disabilities Education Act (IDEA, 1997, 2004) a learning disability is a disorder of one or more of the psychological processes involved in understanding or using language that manifests itself in a deficient ability to listen, speak, read, write, spell, or do mathematical calculations. This disorder includes conditions such as perceptual disabilities, brain injury or minimal brain dysfunction, dyslexia, and developmental aphasia (IDEA, 1997, 2004). Within the context of school, particularly reading and mathematics tasks, the eight areas of struggle that can lead to a diagnosis are reading comprehension, reading fluency, basic reading skills, written expression, mathematics calculation, mathematics problem solving, listening comprehension, and oral expression (IDEA, 1997, 2004).

1 Central to the discussion of fair and appropriate accommodation use is the issue of test score validity. A valid interpretation of an accommodated score is one where the accommodation allowed students to access an assessment without changing the construct being assessed. Validity here refers to the interpretation of the score because it is in how the score is used, what it is assumed to represent in terms of student proficiency, where the validity construct comes into play. However, the term validity has been used in multiple ways in the research literature, muddying the discussion of this construct. In this paper, an accommodated score will be described as to its accuracy, whereas an accommodation will be described with degrees of effect and fairness to keep the distinction from validity clear. A fair accommodation must thus in some way “speak to the nature of the disability,” addressing the barriers created by the interaction between the student's disability and the test item format (Fuchs, Fuchs, & Capizzi, 2005, p. 5). A valid interpretation of the accommodated score must therefore account for both the characteristics of the test and the test taker (Abedi et al., 2008a, 2008b; Middleton & Laitusis, 2007; Stone, 2009).

S.W. Cawthon et al. / Journal of School Psychology 50 (2012) 293–316

295

1.1. Assessment accommodations Specific learning disabilities often impact how students perform on school assessments. With the increased emphasis assessment result as an indicator of educational quality, the question of how to appropriately assess the performance of students with learning disabilities takes on new significance. IDEA, along with guidelines from Section 504 of the Americans with Disabilities Act, requires that schools accommodate students with disabilities to allow them to access an appropriate education. In current practice, accommodations are frequently given to allow students with disabilities to access test material and meaningfully participate in high-stakes assessment (Bolt & Thurlow, 2004). Under IDEA, any student with an Individualized Education Program (IEP) plan must be allowed to receive accommodations specified in classroom instruction, on classroom assessments, and on standardized assessments. Assessment accommodations involve changes to test presentation, timing, setting, or response format intended to allow students with disabilities access to the content of the test (Bolt & Thurlow, 2004). Accommodations are thought to “even the playing field” for students with disabilities so that they have access to the same test content as students without disabilities. Regulations about which accommodations are allowed on an assessment can depend, in part, on the high-stakes context of the assessment. An assessment is high-stakes if it is used to make decisions about student progress, such as whether the score is to be used for high school graduation or college admissions, or to evaluate school effectiveness under No Child Left Behind Act of 2001 (NCLB) accountability reforms. In a system where scores carry such weight, the effect of accommodations on the accuracy of test scores, including the potential to over- or underestimate student proficiency, requires special consideration. Research on the effects of accommodations on student scores indicates a great deal of variability in results that depend on multiple factors such as type of accommodation, the student's characteristics (e.g., proficiency in the subject area or type of disability), the language demand of the test item, and the subject of the test (Cawthon, Ho, Patel, Potvin, & Trundt, 2009). For example, there is some evidence that extended time, a commonly used timing accommodation that involves additional time ranging from time and a half to double time, can benefit SLD (Sireci, Scarpati, & Li, 2005; Stretch & Osborne, 2005; Zuriff, 2000). The purpose of an extended time accommodation is to reduce the barriers of an item (or items) for students who struggle in expressing the targeted academic skill or have trouble accessing the material due to reading problems or have this skill and thus need more time to decode the written question or to work through calculations. Consistent with this purpose, there appears to be a greater effect of extended time on test content that involves high levels of language demand, a key construct in this study and a particular area difficulty for SLD (Crawford, Helwig, & Tindal, 2004; Fuchs, Fuchs, Eaton, Hamlett, & Karns, 2000; Munger & Lloyd, 1991; Runyan & Smith, 1991). Extended time is often combined with a setting accommodation, such as taking a test in a separate room or at a study carrel, to reduce the distractions involved in being in a room full of other test takers. In and of themselves, setting accommodations rarely change the nature of the test format, but they are often given in conjunction with other accommodations that might do so. For example, if a student has a presentation accommodation such as having the items presented in American Sign Language or a response accommodation, such as using a computer instead of a pencil-and-paper format, this accommodation would likely occur in a separate room from the standard test administration. Both individual accommodations and packages of accommodations are thus important to be aware of when evaluating the impact of accommodations on test scores. The most prevalent presentation accommodation, and one of the most controversial, is when the test administrator reads the test items out loud to the students. The basic premise of this accommodation is that reading the test item removes the difficulty of decoding written text from the process of understanding and responding to the test item. However, student factors and test factors, features of the assessment situation beyond the read aloud accommodation itself, may be relevant in predicting its effects (Cawthon et al., 2009). For example, the read aloud accommodation appears to improve student scores when used in mathematics assessments, particularly for students who have low levels of reading proficiency and for test items that have high levels of linguistic demand (e.g., Bolt & Ysseldyke, 2006; Elbaum, 2007; Johnson, 2000; Meloy, Deville, & Frisbie, 2002). For reading assessments, there appears to be a benefit to all students, thus calling into question whether the read aloud accommodation increases access to test content by reducing a barrier that is faced by students with disabilities or by making the test easier for all students

296

S.W. Cawthon et al. / Journal of School Psychology 50 (2012) 293–316

because decoding the text is part of the content area being assessed. In short, the subject of the assessment matters when implementing a read aloud accommodation. In response to concerns about accommodated test scores and the resultant interpretation of student proficiency, educational researchers and test developers look at factors that differentially affect the test scores of students with disabilities. A significant proportion of this research has focused on administrations of the NAEP, particularly in recent years when accommodated test scores have been included in reports of student performance (Zenisky & Sireci, 2007). Previous research has investigated the impact of factors such as student disability status, student English Language proficiency, content area, of test items, and accommodation use. This study draws upon this research and proposes a potential interaction between test accommodation and item characteristics in their effect on item scores for SLD. This study focused on four kinds of accommodations used in NAEP: setting, timing, presentation, and response accommodations. A summary of the NAEP accommodations for this project is provided in Table 1. Decisions about who receives which accommodations are guided by the NAEP framework but made at the local level as determined by school officials. Different kinds of accommodations are likely to have differential effects on item functioning or, in turn, have their effects differentially moderated by external variables (in this study, LC). We focused on the four categories of accommodation formats because they are either widely used in practice or currently have mixed results in the research literature on their effects on standardized assessment scores. 1.2. Linguistic complexity In addition to questions about the effects of accommodations, previous research has shown that language background variables may confound assessment results, adding an extra source of measurement error and impacting both reliability of the assessment and validity of interpretation of accommodated test scores (Abedi, 2002, 2006). For ELL students, research has shown that unnecessarily complicated language, or LC, becomes a potential source of measurement error (Abedi, 2006) and construct-irrelevant variance (Abedi & Lord, 2001). This source of error may be exacerbated when test items draw heavily from academic language (Cognitive Academic Language Proficiency, or CALP) when a student is proficient only in the everyday language of an emerging language user (Basic Interpersonal Communication Skills, or

Table 1 Categories of NAEP accommodations in study sample. Accommodation categories Setting Small group Individual Study carrel Preferential seating Familiar person Timing Extended time Frequent breaks Presentation Read aloud or repeated* Assistance to understand Large print Magnifying equipment Response Points or oral to scribe Computer (no spell or grammar check) Large pen Write in booklet Note. *An accommodation for only Mathematics.

S.W. Cawthon et al. / Journal of School Psychology 50 (2012) 293–316

297

BICS, Cummins, 1984). Language may hinder a student from answering correctly not because the student lacks the content knowledge, but because the student may struggle to understand the language of the item itself, increasing the item's cognitive load. Cognitive load refers to the idea that all tasks impose demands on a learner's cognitive system. Independent tasks that can be processed serially require low cognitive load; tasks with multiple elements that are interrelated and must be processed simultaneously require high cognitive load (van Merriënboer & Sweller, 2005). Furthermore, there are different types of cognitive load, some that are mutable and some that are not (Paas, Renkl, & Sweller, 2003). Intrinsic cognitive load is inherent to the content itself and thought not to be alterable by either instructional or assessment design. Extraneous cognitive load, on the other hand, refers to demand that is due to the method of instruction or presentation of information. The concept of cognitive load has also been applied to models of effective assessment for students with disabilities (Elliott et al., 2010). The goal of accessible testing is to reduce the extraneous cognitive load that puts demands on working memory so as to only activate intrinsic cognitive load. LC on test items especially becomes an issue of extraneous cognitive load in content assessments (e.g., mathematics and science) where reading is not necessarily the targeted skill. Research has shown that unnecessarily complex language (e.g., uncommon vocabulary, words with multiple meanings, or complicated word clauses) may interfere with a student's ability to exhibit content knowledge on a standardized assessment and that this interference can hinge on student English language proficiency (e.g., Abedi, Hofstetter, & Lord, 2004; Abedi & Lord, 2001; Abedi et al., 2005). For example, it has been shown that ELL students have a more difficult time than their non-ELL student peers accurately answering questions with a high discourse demand, or language processing beyond the sentence level (Abedi et al., 2005). In addition to the effects of accommodations, the concept of language demand of assessment items is an important consideration when interpreting the scores of students with potential language processing challenges. The current study explored how LC and accommodations may affect SLD performance on assessments. Although these effects may not be consistent across all items, in aggregate on an assessment, the effects of LC and accommodations may form a different test for students with diverse language backgrounds. Abedi et al. (2005) found a difference in the internal consistency of test scores for ELL students versus non-ELL students; LC and accommodations may lead to similar results for SLD. Although the reasons for challenges with high LC are different for both ELL students and SLD, high LC may be a potential obstacle to optimal performance on assessment items (and particularly, reading items) for both groups. There could be similar challenges due to attentional demand and available working memory in processing complex language present in written text. For example, during assessment, an ELL student may translate the English test item back into their native language, and items with high LC slow down this process due to the higher linguistic demand. In comparison, the SLD may understand the English word meaning, but when reading the test item, difficulties with text comprehension slow down the reading process and interrupt the cognitive connections necessary to successfully respond to the test item. These similarities between SLD and ELL students and the impact of LC tie back to the overall concept of extraneous cognitive load and its impact on a student's ability to answer test items correctly when they have the knowledge needed to do so. Higher levels of LC raise the extraneous cognitive load of that task, compounding the challenge of text comprehension, which, depending on the item, may be intrinsic cognitive load. In either case, high LC impacts the transition from decoding text, to reading comprehension, to test item response. Although the reasons for an increased extraneous cognitive load due to high LC may vary by individual student and student group, attention to unnecessary increases in LC may benefit students across student groups. 1.3. Policy context LC is an important component of a larger process of addressing test item accessibility in standardized assessment. Potential changes to test items, including reducing LC, are often referred to as item modifications (AERA et al., 1999). Test item modifications go beyond test accommodations in that they change the structure of the test item itself and are often controlled by the test developers and not the local IEP team (Roach, Beddow, Kurz, Kettler, & Elliott, 2010). The range of test item modifications can be limited to one aspect of the test item format, such as addressing issues of language demand (Shaftel, Belton-Kocher, Glassnap, & Poggio, 2006), or be comprehensive in nature and address format and layout issues as well

298

S.W. Cawthon et al. / Journal of School Psychology 50 (2012) 293–316

(Kettler, Elliott, & Beddow, 2009). The goal of the test modifications is similar to that of accommodations, to reduce construct-irrelevant variance in the test process by reducing extraneous cognitive load caused by test format characteristics. However, the extent of the modification process is typically more significant than for accommodations, and the results of the assessment are often not seen as measuring the same level of difficulty as an accommodated assessment (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education [NCME], 1999). The current policy context for test item modifications is largely framed by the development of the Alternate Assessment based on Modified Academic Achievement Standards (AA-MAS) as well as the Alternate Assessment based on Alternate Academic Achievement Standards (AA-AAS). Both of these assessments are tied to the NCLB testing mandates and are efforts to create assessments that allow a greater range of students with disabilities to participate in accountability reforms (U.S. Department of Education [USDOE], 2007). Whereas the AA-AAS was meant for students with severe disabilities, the more recent AA-MAS is designed to meet the needs of students who are not severely disabled but are still not working close enough to grade level to participate in the regular assessment (even with accommodations). More specifically, the AA-MAS were designed to address the following situation: The assessments based on grade-level achievement standards are too difficult and, therefore, do not provide data about a student's abilities or information that would be helpful to guide instruction. The alternate assessment based on alternate academic achievement standards is too easy and is not intended to assess a student's achievement across the full range of grade-level content. Such an assessment, therefore, would not provide teachers and parents with information to help these students continue to progress toward grade-level achievement. Modified academic achievement standards, and assessments based on those standards, are intended to fill this gap and provide a more appropriate measure of these students' performance against academic content standards for the grade in which they are enrolled, as well as provide teachers and parents with information that will help guide instruction. (USDOE, 2007)

1.4. Purpose of the study It is in this assessment policy context that test developers and researchers are interested in the construct of LC and how it affects student test scores. The link between item LC and the performance of ELL students has been well demonstrated in the research literature (Abedi, 2006). However much less is known about the link between accommodations, item LC, and SLD. This article thus addresses two current efforts to improve the accessibility of standardized assessments for SLD: issues of LC and the use of test accommodations. Although these elements were not experimentally manipulated in this study, this article uses statistical modeling and secondary data analysis to estimate the effects of these two components of test accessibility for SLD. This study focused on an analysis of accommodations and LC as predictors of item difficulty for SLD. In other words, we investigated whether these variables are predictors of whether the student would respond correctly on the test items. This study was designed to answer the following research questions: 1. Does the LC of the test item affect item difficulty for students with learning disabilities? 2. After controlling for LC's effect on item difficulty, do presentation, response, setting, or timing accommodations impact estimates of student ability? 3. Does the effect of LC on the probability of answering an item correctly depend on receipt of an accommodation (for each of three accommodation types)? 2. Method 2.1. NAEP data set This project was a secondary data analysis of outcomes from the 2005 NAEP Grade 4 reading and mathematics data sets. The NAEP is a national assessment that uses stratified sampling approaches in data

S.W. Cawthon et al. / Journal of School Psychology 50 (2012) 293–316

299

collection. Sampling for the overall NAEP assessment occurred in several stages. The first stage (i.e., the primary sampling unit) was the selection of public schools within each state. (There was a separate sampling process for private schools.) Schools were selected within groups on the basis of demographic characteristics such as whether the school was in a rural or urban area, the percentage of students from diverse backgrounds, and the median income in the local area. This process led to the identification of up to 10,000 schools to participate in the NAEP. The second stage was the selection of students (in the tested grades) within each school. Approximately 60 students were selected within each school in the grades tested that year (e.g., 4th, 8th, and 12th). We focused on 4th grade scores because accommodations have shown, in some cases, to have a greater benefit for younger students than for those in high school (Fletcher et al., 2006). If the selected student had a disability, school staff evaluated whether the student could participate in the NAEP, with or without an accommodation(s). 2 In 2005, 5.3% and 7.4% of 4th grade students received accommodations on the NAEP reading and mathematics assessment due to disability status, respectively. Finally, students were then assigned to the tested subject areas that were a part of that year's NAEP assessment. 2.2. Study sample This study focused on SLD who participated in the NAEP with or without the accommodations listed in Table 1. All data in this study are from SLD in the NAEP assessment; students without disabilities were not eligible to use accommodations and thus were not included in this analysis. Although the NAEP does not include information about the learning disability diagnostic process, the data set is from 2005, when the discrepancy model was the primary means for identifying a learning disability. This practice is in contrast to the current emphasis on Response to Intervention approach to identification. There may therefore be a difference in the characteristics of SLD in this sample than one might find in a more recent data set. There is also no direct information about the type of learning disability beyond a description of the types of services the student is eligible for. For example, the NAEP questionnaire asked whether the student is working on grade level, and this information was used to help determine whether or not the student should participate in the NAEP assessment (i.e., if the student is working too far below grade level, they may not have been included in this test). The student questionnaire did allow individuals to identify more than one disability category for each student, although without information about which category is considered the student's “primary” label. For example, in reading and mathematics subsamples, 10 students were listed as having a hearing impairment, fewer than 10 with a visual impairment, 310 with a speech impairment, and fewer than 10 with mental retardation. Due to this lack of specificity as to whether the SLD eligibility was primary or secondary, we included all students with the SLD category, including those with multiple disability eligibilities. Our exclusionary criteria within the SLD category focused on additional factors that would potentially confound the interpretation of the results about LC of test items and use of accommodations in large-scale assessment for SLD. Students who were also designated as ELL or who attended non-regular education schools were excluded from these analyses. In the NAEP sample, approximately 10% and 7% of students did not have a teacher identifier in the reading and mathematics subsamples, respectively, so they were excluded. Preliminary analyses suggested that models including school-identifier as a crossclassification failed to converge, likely because the majority (over 80%) schools had only one teacher with an SLD student participating in the NAEP. As such, we eliminated school cross-classifier from the analysis. In our study sample, one SLD per teacher was randomly sampled to eliminate the need to model clustering of students within classrooms for this set of analyses. There was also a significant challenge with the level of missing data and how to develop relatively equal sample sizes for the reading and mathematics subsamples. In order to obtain sufficient sizes for both subsamples, students were included if they responded to at least 1 of the reading items and 10 of the mathematics items. The reason for the discrepancy in missing data cut offs was that there were quite a bit more missing values for reading than for mathematics, and we wanted to provide adequate sample sizes for both analyses. 2 It should be noted that the percentage of students with disabilities who participate in the NAEP may be lower than in state standardized assessments for accountability, such as for NCLB requirements. Students with disabilities participated in the NAEP if they were able to meaningfully do so using the accommodations offered by this assessment.

300

S.W. Cawthon et al. / Journal of School Psychology 50 (2012) 293–316

After applying these selection rules, the final reading and mathematics subsamples contained 2170 and 2180 SLD, respectively. (Figures are rounded to the nearest 10 per NAEP restricted data use guidelines.) Demographic characteristics of the two subsamples are provided in Table 2. Demographic information for the two SLD subsamples are accompanied by demographic information from the overall NAEP sample which is included as a reference point on the representativeness of the study data. The proportions of students within each demographic category were similar for the two subject areas. Approximately two-thirds of the students in the study subsamples were boys, reflecting the overall higher rates of male students who are identified as having a learning disability (Altarac, 2007). The racial/ethnic background of students was approximately two-thirds White, one-fifth Black, and one-tenth Hispanic. The remainder of the student population was comprised of students from Asian, Pacific Islander, Native American, or Native Alaskan backgrounds. Based on comparison to the total NAEP sample, the SLD subsamples overrepresented students who were White and under-represented students who were Hispanic. These discrepancies may be because we excluded students who were ELL from the study subsamples. Although two-thirds of the study subsamples report to have never spoken a language other than in English in the home, nearly one-third did have some level of non-English language exposure at home. The representativeness here parallels the race/ethnicity demographic in that the study subsamples have a smaller percentage of students who have a language other than English spoken in the home (e.g., for mathematics, 39% in the overall SLD sample vs. 34% in the study SLD sample). This sample also represents students from diverse home income levels, with more than half of students receiving free or reduced lunch (for an excellent discussion of the concerns about using free or reduced lunch variables as a proxy for broader socio-economic status, see Harwell & LeBeau, 2010). In terms of comparison between the study sample and the overall demographics of SLD in the NAEP, students in the study sample were less likely to be eligible for free lunch. Finally, students were diverse as far as the type of location they lived in,

Table 2 Demographics of students with learning disabilities in NAEP sample and study subsamples. Characteristic

Reading

Mathematics

Overall NAEP dataset N = 11,990

Study sample N = 2170

Overall NAEP dataset N = 12,030

Study sample N = 2180

Sex Boy Girl

65% 35%

64% 36%

65% 35%

66% 34%

Race/Ethnicity White Black Hispanic Asian/Pacific Islander American Indian/Alaskan Native Other

57% 21% 16% 2% 3% 1%

66% 19% 9% 2% 3% 1%

57% 20% 17% 2% 3% 1%

65% 20% 10% 2% 2% 1%

Language other than English at home Never Sometimes to always

56% 40%

61% 35%

58% 39%

63% 34%

Lunch program eligibility Not eligible Free lunch Reduced lunch

38% 50% 9%

47% 40% 10%

38% 51% 9%

42% 46% 10%

Location Large/mid-size city Urban fringe/large town Small town/rural

35% 34% 31%

33% 36% 31%

35% 34% 31%

30% 36% 34%

Note. Some columns do not add to totals due to missing data or to rounding. Per NAEP restricted data guidelines, all sample sizes have been rounded to the nearest “10.”

S.W. Cawthon et al. / Journal of School Psychology 50 (2012) 293–316

301

with one-third from large or mid-size cities, one-third from urban fringe or large towns, and one-third from rural or small town areas. 2.3. Accommodation use All students in this study used at least one of the four categories of accommodations used in this study's analysis (see Table 1 for overview of accommodation types). Data represent reported usage by the administering teacher or staff member; students have the option of using or not using an accommodation at the time of the assessment, thus it is not possible to know from the NAEP data set what variations may have occurred during implementation. A summary of the percentage of students reported using an accommodation, by type and by subject, is provided in Table 3. The three most prevalent accommodations were setting, presentation, and timing, with the majority of students using all three simultaneously. This outcome is not uncommon in practice, as a change in the administration of an assessment often involves changes to the location of the assessment, the way the directions are provided, or how long the student has to complete the assessment. Table 4 provides a summary of the number of students who used none, one, two, three, or all four categories of accommodations. About half of the students used accommodations in three categories for both reading and mathematics. Detailed information on the joint distributions of accommodations use is available from the corresponding author upon request. There were some restrictions on accommodations available on the 2005 NAEP that might result in differences between those used in the NAEP and accommodations students use in other assessment contexts. Although the NAEP does not collect information about accommodations the individual student might be eligible for on other assessments, there are elements of the NAEP assessment that are different than a state assessment or a classroom end-of-unit exam. For example, students could only have test items read to them for the mathematics assessment and not for the reading assessment. In a classroom exam, it may be that some students receive read aloud accommodations for more subject areas than on a state or NAEP assessment. Also, students could not take the assessment over the course of several days because the NAEP administration happens on a one-day basis only. This scheduling constraint limited the extent to which students that needed breaks or shorter sessions could complete the NAEP assessment during the administration timeframe. Accommodation restrictions may therefore limit the generalizability of these findings to students who participate using the accommodations available in this assessment. Furthermore, although guidelines apply to all settings, decisions about accommodations are made at the local level. Categories of students with disabilities who use disabilities are therefore variable and guidelines may be applied differently from site to site (since the time of the 2005 NAEP data used in this study, NAEP has released new guidelines addressing the uniform protocol for the inclusion of students with disabilities [National Assessment Governing Board, 2010]). 2.4. Test item content areas 2.4.1. Reading The original sample included scores from 29 released test items in reading. Items were a mixture of multiple choice, constructed response (a.k.a., short answer), and extended constructed response (a.k.a., essay). Nine items were scored polytomously, leaving 20 dichotomously scored multiple-choice items for analysis. The 2005 NAEP Reading assessment framework focused on two main categories of reading Table 3 Descriptive statistics for accommodations use in reading and mathematics. Type of accommodation

Reading %

Mathematics %

Setting Presentation Response Timing

89 77 25 85

90 85 25 81

Note. There were a total of 2170 students in the reading sample and 2180 students in the mathematics sample. Per NAEP restricted data guidelines, all sample sizes have been rounded to the nearest “10.” Students could receive more than one accommodation.

302

S.W. Cawthon et al. / Journal of School Psychology 50 (2012) 293–316

Table 4 Combined use of accommodations in reading and mathematics. Number of accommodations categories used

Reading

Mathematics

Number (%) of participants

Number (%) of participants

None One Two Three Four Total

60 (3%) 180 (8%) 410 (19%) 1110 (51%) 410 (19%) 2170

70 (3%) 130 (6%) 400 (18%) 1110 (51%) 470 (22%) 2180

Note. Per NAEP restricted data guidelines, all sample sizes have been rounded to the nearest “10.”

goals: reading for literary experience and reading for information. To read for literary experience is defined as reading a variety of text genres such as novels, short stories, poems, plays (National Assessment Governing Board, 2008). Reading for information is defined as using several steps in how to gain information to understand the world, usually across diverse literary sources. It should be noted that the Reading assessment framework was revised for the 2009 assessment, and that the domains measured in future NAEP assessments will vary slightly from the 2005 test administrations. 2.4.2. Mathematics The original sample included scores from 32 released test items in mathematics. Items were a mixture of multiple choice, constructed response, and extended constructed response. Five items were scored polytomously, leaving 27 dichotomously scored items for analysis, all in the multiple choice format. The 2005 NAEP Mathematics assessment framework focused on five content areas: number properties and operations, measurement, geometry, data analysis/probability, and algebra. In addition to content areas, the framework also focuses on the cognitive demand, or the mathematical complexity of the item task. The purpose of the mathematical complexity designation is to indicate the level of abstraction or analysis required to complete the task. For example, recall of a mathematical property is a low complexity task. In contrast, looking at the relation between two properties is a more complex task than recall. The 2005 NAEP items are each coded according to content area and the level of mathematical complexity. The items in this sample included items across the range of mathematical complexity scores. 2.5. Linguistic complexity 2.5.1. Approach to assessment In accordance with previous research (Abedi, 2006; Abedi & Hejri, 2004; Butler, Bailey, Stevens, Huang, & Lord, 2004; Ketterlin-Geller, Yovanoff, & Tindal, 2007), the LC of each test item was assessed based on three overarching features: vocabulary, syntax, and discourse. For this study, we created a tool based on the process reported in Abedi et al. (2005). The Abedi et al. (2005) approach was developed using a multi-step review of potential language demands in test items and definitions of key terms identified as linguistically complex. Inter-rater agreement of the Abedi et al. (2005) measure was calculated as the percentage of exact agreements between two coders, and values ranged from 60% to 100% depending on the subject area (reading, mathematics, or science) and the complexity component (vocabulary, syntax, and discourse, described below). 2.5.2. Coding components and criteria We used the components of linguistic demand and the coding criteria from Abedi et al. (2005) to identify potentially challenging areas. In our coding scheme, each feature was defined and coded in the following ways for each item. 2.5.2.1. Vocabulary. We counted the number of complex vocabulary words in each text unit. Complex vocabulary items were defined as words that have multiple meanings, non-literal usage of words, and manipulation of lexical forms (Martinello, 2008). For example, if the item included the word “plane,” this

S.W. Cawthon et al. / Journal of School Psychology 50 (2012) 293–316

303

word would be counted as a complex vocabulary word because it has different meanings depending on the context of the sentence. 2.5.2.2. Syntax. The syntax of the text unit was also assessed for complexity. The syntax score is a composite of a checklist of the following items: atypical parts of speech, uncommon syntactic structures, complex syntax, and academic syntactic form, long nominals, conditional clauses, relative clauses, and complex questions. The presence or absence of passive voice also was included in the syntax score. 2.5.2.3. Discourse. Complex discourse was defined in this study as uncommon genre, the need for multiclausal processing, or the use of academic language (Abedi, Lord, & Plummer, 1997). Item discourse was also considered complex if students are required to synthesize information across sentences or to make clausal connections between concepts and sentences. Discourse was coded as a discrete variable based on the presence or absence one or more of these features. 2.5.3. Coding procedure Three raters convened to discuss and rate the NAEP test items used in this project. The raters first participated in a training session using examples from the released NAEP items. The training period ended once participants reach at least a 90% agreement rate. They then independently rated all items. Finally, they came together to discuss each item and to come to a consensus on the item rating. When all three raters could not reach consensus for a specific component sub-score, such as the vocabulary rating, the item sub-score was given the score of the two raters who had the same result and was counted as an agreement. This event occurred twice for the mathematics items and eight times for the reading items. The final inter-rater agreement values for the project reading items were 90%, 93%, and 100% for syntax, vocabulary, and discourse, respectively. Final inter-rater agreement values for mathematics items were 97%, 100%, and 100%, for syntax, vocabulary, and discourse, respectively. These reliability values were, on the whole, higher than those found by Abedi et al. (2005). 2.6. Cross-classified multilevel measurement model (MMM) analysis Rasch models for dichotomous outcomes can be parameterized as generalized hierarchical linear models (see Adams & Wilson, 1996; Adams, Wilson, & Wu, 1997; Cheong & Raudenbush, 2000; Fischer, 1995; Kamata, 2001). Although several specifications exist in the literature, the current investigation focuses on the cross-classified family of models described elsewhere (Beretvas, 2008; Goldstein, 2003; Hox, 2002; Rasbash & Browne, 2001). Unlike conventional hierarchical linear models for dichotomous outcomes, the cross-classified family of models allows for the investigation of both student-level and itemlevel predictors (see Beretvas, Cawthon, Lockhart, & Kaye, in press, for a comparison of conventional and cross-classified multilevel measurement models). This consideration is especially relevant in the current analysis, as both accommodation presence (student-level) and (item-level) are thought to affect both student ability estimates and item difficulties, respectively. Fig. 1 contains a network graph depicting the data structure specified by the cross-classified family of models. In this parameterization, student scores on each item are coded zero if the student answered incorrectly and one if the student answered correctly. As seen by the placement of each dichotomously

Fig. 1. Network graph depicting the cross-classification of item scores by item (level two) and student (level two) lined-up within student.

304

S.W. Cawthon et al. / Journal of School Psychology 50 (2012) 293–316

coded outcome Y between both student (j1) and item (j2) identifiers in Fig. 1, the score for a specific person on a specific item is cross-classified by both item and person. Put differently, each item score, i, is uniquely identified by two descriptors: the student, j1, who provided the response, and the item number, j2, for which the response was provided. In Fig. 1, scores are lined up within students, resulting in crossed lines for the connections between item numbers and item scores. An equivalent conceptualization could first line scores up within items, resulting in crossed lines for the connections between students and item scores. This conceptualization is depicted in Fig. 2. These conceptualizations are mathematically equivalent, resulting in identical models termed, here, cross-classified multilevel measurement models (cross-classified MMM; see Meulders & Xie, 2004; van den Noortgate, De Boeck, & Meulders, 2003). The fully unconditional cross-classified MMM is specified as follows at level-one: " # piðj1 ;j2 Þ ηiðj1 ;j2 Þ ¼ log ð1Þ ¼ β0ðj1 ;j2 Þ 1−piðj1 ;j2 Þ where j1 and j2 represent person and item, respectively, and i represents the dichotomously coded score for a particular item. Using Rasbash and Browne's (2001) cross-classified random effects model notation, subscripts within a set of parentheses containing the same letter refer to two factors cross-classified at the same level (in this case, level two). At level two, the intercept is decomposed into two residuals as follows: β0ðj1 ;j2 Þ ¼ u0j1 0 þ u00j2

ð2Þ

where u0j10 and u00j2 represent the person and item residuals, respectively. In multilevel measurement models, fixed intercepts are omitted. This procedure allows item difficulty and student ability estimates to be obtained directly as opposed to through the use of a reference indicator (see Beretvas & Kamata, 2005, for an example). The residuals for each cross-classified factor are assumed independent with means of zero and constant variances of τj1 and τj2, respectively. Substituting Eq. (2) into Eq. (1) results in the following combined model for the log odds of a correct response, ηi(j1, j2): ηiðj1 ;j2 Þ ¼ u0j1 0 þ u00j2

ð3Þ

Solving for the probability of a correct response to a specific item, q, for a specific person, pi(s, q): piðs;qÞ ¼

1 h  i : 1 þ exp − u0s0 þ u00q

ð4Þ

In this model, the Rasch estimate of student ability corresponds to the person residual, u0s0. Similarly, the Rasch item difficulty parameter for item q corresponds to the cross-classified MMM item residual, − u00q. In contrast to conventional Rasch models, larger positive item residuals in the cross-classified MMM indicate easier items. The smaller u00q is, the more difficult the item. Hence, the magnitude of the cross-classified MMM's item-level residual, u00q, is better interpreted, colloquially, as an “easiness”

Fig. 2. Network graph depicting the cross-classification of item scores by item (level two) and student (level two) lined-up within item.

S.W. Cawthon et al. / Journal of School Psychology 50 (2012) 293–316

305

parameter rather than an item “difficulty.” Multiplying the value of the item-level residual by a value of negative 1 minus one allows it to be interpreted as a Rasch model item difficulty parameter. The unconditional cross-classified MMMs in Eqs. (1) and (2) can be extended to model the potential impact of student-level characteristics. Impact, interpreted as the difference in student ability that is due to a person-level descriptor, is assessed by including a student-level predictor in the cross-classified MMM. In the current investigation, receipt of an accommodation is hypothesized to affect a student's probability of a correct response. The impact of accommodations is then modeled by including a dummy coded accommodation variable, X, in the level-two equation: β0ðj1 ;j2 Þ ¼ γ 010 X j1 þ u0j1 0 þ u00j2

ð5Þ

where γ010 can be used to assess the impact of accommodation receipt on students' test performance. Modeling impact in this manner, the ability of a specific person, s, who received an accommodation would be predicted to be: (u0s0 + γ010). And the ability of a different person, k, who did not receive the accommodation would be predicted to be: u0k0. In this context, a positive value for γ010 is interpreted as the accommodation raising the ability estimate. In addition to modeling impact, this specification of the cross-classified MMM permits the inclusion of item-level descriptors as a means of explaining variability in items' difficulties. In the current investigation, items' LC is hypothesized to explain variance in item difficulties. Including an LC variable, Z, in the cross-classified MMM, in addition to the student-level accommodation variable, X, results in the following specification at level two: β0ðj1 ;j2 Þ ¼ γ 010 X j1 þ u0j1 0 þ γ001 Z j2 þ u00j2

ð6Þ

where the coefficient for Z, γ001, represents the effect of LC on items' difficulties. Recall that, with the crossclassified MMM, item difficulty is the item-level residual multiplied by minus one. Hence, the inclusion of the item-level predictor Z results in a difficulty for item q is that it is equal to: (− u00q − γ001Zq). So, for example, the predicted difficulty of item r with an LC score of one is: (− u00r − γ001). Similarly, the predicted LC of another item, t, with an LC score of two is: (− u00t − 2γ001). The larger the result of the number calculated in parentheses, the more difficult the item. Since the current study hypothesizes that items with higher LC scores should be more difficult, a negative γ001 coefficient would increase item difficulty estimates, supporting the expected direction of the effect of LC on items' difficulties. In the context of accommodated testing, the use of an accommodation should eliminate the effect of nuisance variables known to adversely affect test scores for students with disabilities. In the current study, LC is considered to be a nuisance rather than a necessary component of mathematics ability that math items are intended to measure. If this is indeed the case, receiving an accommodation should remove the effect of LC on item difficulty. Put differently, the effect of LC on math items' difficulties should depend on whether a student (who needs an accommodation) receives an accommodation. In the cross-classified MMM, this effect is modeled by including an interaction term between the student-level accommodation descriptor, X, and the item-level LC descriptor, Z. This procedure results in the following level-two specification for the cross-classified MMM: β0ðj1 ;j2 Þ ¼ γ 010 X j1 þ u0j1 0 þ γ001 Z j2 þ γ002 Z j2  X j1 þ u00j2

ð7Þ

where the γ002 represents the magnitude of the interaction between item descriptor Z and person descriptor X. The current investigation hypothesizes that accommodations do not sufficiently control for the effect of LC on math item performance for students with disabilities. In this study, we code the variables such that X = 1 and X = 0 for children receiving and not receiving an accommodation, respectively. Then, the LC variable, Z, is scored such that the higher the value on Z the more linguistically complex the item. A negative value for γ002 would support this hypothesis, as a negative value for γ002 suggests that the log odds of a correct response both depends on whether or not a student (who needs an accommodation) receives one and is lower for those students who do receive an accommodation. This result, along with a negative (or 0) value for the LC coefficient, γ001, indicates more linguistically complex items that are harder for students, and that this effect is exacerbated for students using an accommodation.

306

S.W. Cawthon et al. / Journal of School Psychology 50 (2012) 293–316

The use of a person-by-item descriptor interaction is referred to as differential facet functioning (DFF) in the literature (Meulders & Xie, 2004). This terminology stems from the conceptualization of groups of items with the same value on the item descriptor (e.g., LC) as single facets. However, to the best of our knowledge, use of DFF is not particularly abundant, although its inclusion in models assessing accommodated test seems both promising and warranted. The current investigation uses SAS PROC GLIMMIX (v 9.2) to extend the cross-classified multilevel measurement model described above for each of the reading and mathematics subsamples. In this analysis, four accommodation variables are considered for each student, requiring the inclusion of terms in the model. The first model estimated the log-odds of a correct response for person j1 on item j2 as follows: " ηiðj1 ;j2 Þ ¼ log

#

piðj1 ;j2 Þ

1−piðj1 ;j2 Þ

¼ γ001 Z j2 þ u0j1 0 þ u00j2

ð8Þ

where p (the probability of a correct response) was modeled as a function of Zj2 (the LC of item j2) and the item's difficulty, u00j2, and person ability,u0j10. The coefficient for LC, γ001, represents the effect of LC on item difficulty with a negative value indicating that the more LC an item has, the more difficult it will be. Next, four dummy-coded accommodation type variables, X, were included in the model as follows: ηiðj1 ;j2 Þ ¼ γ001 Z j2 þ

4 X

γ0k0 X kj1 þ u0j1 0 þ u00j2

ð9Þ

k¼1

Table 5 Linguistic complexity ratings for reading items. Item

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 Average SD

Type

SCR MC SCR SCR MC SCR ECR MC SCR MC SCR SCR MC SCR SCR MC ECR MC SCR SCR MC MC SCR MC ECR MC MC SCR SCR

Linguistic complexity component

Total

Syntax

Vocabulary

Discourse

0 0 3 3 0 1 0 1 0 0 1 0 0 1 1 1 1 3 0 2 1 1 2 0 3 2 2 2 2 1.14 1.06

0 1 1 0 1 0 1 0 0 1 0 0 1 0 0 0 1 1 1 1 2 1 1 1 1 1 1 0 1 0.66 0.55

1 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 0.34 0.48

Note. MC = multiple choice; SCR = Student constructed response; ECR = Extended constructed response.

1 1 5 4 1 2 2 1 0 1 1 0 1 1 1 1 3 4 1 3 3 2 4 1 5 3 3 3 4 2.14 1.46

S.W. Cawthon et al. / Journal of School Psychology 50 (2012) 293–316

307

to assess whether an LC effect was still found after controlling for accommodations. In the model estimated in the current study, each Xkj2 refers to one of four specific accommodation types, coded 1 if the student received that accommodation type, and zero if not. The reference group for each Xkj2 consists of students with disabilities who did not receive that accommodation. Last, the model including interactions between LC and each of the four X variables was estimated: ηiðj1 ;j2 Þ ¼ γ001 Z j2 þ

4 X

γ0k0 X kj1 þ

k¼1

5 X

  γ00k X ðk−1Þj1  Z j2 þ u0j1 0 þ u00j2

ð10Þ

k¼2

to assess whether receiving the accommodation changed LC's effect on estimates of student ability. 3. Results 3.1. Descriptive statistics for LC ratings Tables 5 and 6 provide the LC ratings for the reading and mathematics items used in this analysis. As modeled in Abedi et al. (2005), the total LC score consisted of the sum of each of the three areas. For reading, the average LC score was 2.14 (SD = 1.46), with a range from 0 (for two items) to 5 (for one item). Individual component scores show where the emphasis was in the reading LC ratings: Syntax scores Table 6 Linguistic complexity ratings for mathematics items. Item

Linguistic complexity component Type

Syntax

Vocabulary

Discourse

Total

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Average SD

SCR MC MC MC SCR MC SCR SCR MC SCR SCR SCR SCR SCR ECR MC MC MC MC SCR MC MC SCR MC MC MC SCR MC MC MC SCR SCR

0 0 1 1 1 0 0 2 0 1 2 1 1 2 0 0 0 0 1 1 0 0 0 0 0 1 3 0 1 0 1 0 0.63 0.79

0 0 0 0 1 1 2 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0.25 0.51

0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0.53 0.51

0 0 1 1 2 1 2 4 1 2 3 2 3 4 0 0 0 1 1 1 1 1 0 1 1 2 4 0 2 1 3 0 1.41 1.24

Note. MC = multiple choice; SCR = student constructed response; ECR = Extended constructed response.

308

S.W. Cawthon et al. / Journal of School Psychology 50 (2012) 293–316

were the highest (M = 1.14, SD = 1.06), followed by vocabulary (M = 0.66, SD = 0.55), and then discourse ratings (M = 0.35, SD = 0.48). In some sense, the findings are not surprising because the discourse score options are either “0” or “1” whereas the syntax and vocabulary score options have a much higher range. One possible limitation of this study is that the analyses were calculated on the overall LC score, a sum that combines both scale and dichotomous sub-scores, thus weighting the overall score toward vocabulary and syntax components. The mathematics item LC ratings were, overall, lower than those for the reading items. The mean total scores for mathematics was 1.41 (SD= 1.24), which is about two-thirds the level of the reading items. The individual components tell a somewhat different story about how LC was determined for mathematics items than for reading items. The syntax average score was 0.62 (SD= 0.79), about half of the average score for reading. The average vocabulary score was 0.25 (SD= 0.51), one-third less than for reading. Finally, the average discourse score was 0.53 (SD= 0.51), which was actually higher than the discourse score for reading. Because the syntax and vocabulary levels were, overall, lower than in reading, a greater proportion of the LC rating for mathematics was attributed to discourse elements than in the reading total scores. This result may be because mathematics items sometimes use an initial statement to introduce vocabulary words or a concept to be used in completing the item, creating a multi-step process that was coded as “discourse.” 3.2. Effects of LC and accommodations Table 7 contains the fixed effects estimates for the reading and mathematics items. Model 1 (see Eq. (8)) represents the effect of LC on the probability that SLD would respond to an item correctly. For both reading and mathematics items, the LC coefficient was statistically significant (p b .05), with the negative value indicating that the more complex an item, the harder it was for an SLD in this sample. In other words, LC did have an effect, and it had an effect in the predicted direction, such that the higher ^ 001;R = −0.44) than for the LC the harder the item. The magnitude of the effect was greater for reading (γ ^ 001;M = − 0.20). mathematics (γ Model 2 (see Eq. (9)) analyzed LC still had an effect after controlling for the accommodations' effects. ^ 001;R = −0.27, As evident in Table 7, the main effect for LC only remained significant for reading items (γ ^ 001;M − 0.11, p > .05), once the set of accommodation use variables p b .05) and not for mathematics ( γ was included in the model. The results also indicate that after controlling for the effect of LC, accommoda^ 010;R = − 0.29, tions' use only had a significant effect on estimates in reading, and only for presentation (γ ^ 030;R = −0.30, p b .05) accommodations. These values were negative, indicating that p b .05) and setting (γ Table 7 Fixed effects estimates for each model estimated for reading and mathematics items. Model

Parameters

Model 1 LC, γ010 Model 2 LC, γ010 ACC − Presentation, γ020 ACC − Response, γ040 ACC − Setting, γ050 ACC − Timing, γ060 M LC, γ010 ACC − Presentation, γ020 ACC − Response, γ040 ACC − Setting, γ050 ACC − Timing, γ060 LC ∗ ACC − Presentation, γ070 LC ∗ ACC − Response, γ090 LC ∗ ACC − Setting, γ0100 LC ∗ ACC − Timing, γ0110 Note. ACC = accommodation. ⁎ p b .05.

Reading estimates (SE)

Mathematics estimates (SE)

–2LL = 70235.09 –0.44 (.12)⁎ − 2LL = 70333.29 − 0.27 (.12)⁎ − 0.29 (.07)⁎ − 0.02 (.07) − 0.30 (.10)⁎

–2LL = 139348.90 − 0.20 (.09)⁎ − 2LL = 139387.30 − 0.11 (.10) − 0.11 (.07) 0.01 (.05) − 0.05 (.09) − 0.09 (.07) − 2LL = 139409.70 − 0.12 (.10) − 0.10 (.09) 0.06 (.07) − 0.07 (.10) − 0.12 (.08) − 0.01 (.03) − 0.04 (.03) 0.02 (.04) − 0.02 (.03)

0.12 (.08) − 2LL = 70359.99 − 0.27 (.13)⁎ −.23 (.11)⁎ − 0.03 (.10) − 0.18 (.14) − 0.01 (.12) − 0.03 (.05) 0.01 (.05) − 0.07 (.06) 0.08 (.06)

S.W. Cawthon et al. / Journal of School Psychology 50 (2012) 293–316

309

after adjusting for the effects of items' LC, students who received the relevant accommodation were less likely to answer the question correctly than students who did not receive the accommodation. Model 3 (see Eq. (10)) included the cross-product terms representing the interactions between LC and each type of accommodation to assess potential differential facet functioning. None of the interaction terms' coefficients were statistically significant (p > .05) for both reading and mathematics items (see Table 7), implying that the effect of LC on student scores in mathematics and in reading did not significantly depend on the type of accommodation received. In this final model, the main effects of LC and of receipt of presentation accommodations remained statistically significant for reading items. This finding meant that, after controlling for the effect of LC on items' difficulties, students who received presentation accommodations in reading had significantly lower reading scores than students who did not receive ^ 010;R = −0.23). In addition, after controlling for the effect of accommodations, this accommodation ( γ reading items with higher LC scores were found to be significantly more difficult than items with ^ 001;R = −0.27). lower LC scores (γ 4. Discussion There were three main findings in this study. First, LC was found to have a significant effect on both reading and mathematics item difficulties for SLD. The magnitude of the LC effect was stronger for reading than for mathematics items, meaning that larger differences in item difficulties resulting from LC were found for reading than for math items. Second, whereas LC remained as an effect for reading across all three models, LC's effect in mathematics was not significant when the effects of accommodations were included in the analysis (see results for Model 2 estimates in Table 7). This finding implies that including the effects of accommodations in the model reduces the estimated impact of LC on test performance for SLD in mathematics. Finally, the four kinds of accommodations did not have a main effect on student outcomes for mathematics items, but two of the accommodations did have significant effects on reading scores. The direction of the accommodation effect was interesting: Students who received setting accommodations and students who received presentation accommodation had lower scores than students who did not receive those accommodations. The results of the third model (see Table 7) indicated that there were no significant interactions between LC and accommodation type in this study and suggested that receiving an accommodation did not moderate the effect of LC on student performance within the SLD population. These findings are discussed further below. 4.1. Linguistic complexity The second research question focused on the effects of accommodations the probability of students answering items correctly in reading and mathematics. The results of these analyses focus on the difficulty of reading items as the most robust area of impact for LC, but there is some potential effect of LC on item difficulty in mathematics if an SLD does not also have accommodations. For example, vocabulary terminology is often relevant to the construct being measured, but syntactic components are less likely to be related to item construct. Further content analysis might be necessary to assess whether the LC dimension provides an essential component of the construct being measured or, in contrast, whether it is interfering with what the test is intended to assess. The specific role of LC may differ for reading and mathematics test items. For example, consider this released NAEP item where students were asked to identify why Ellis Island was called “the doorway to America.” (NAEP, 2005). This item was as follows: Ellis Island was called “the doorway to America” because it: (a) (b) (c) (d)

was the place most immigrants had to pass through before entering the United States; had a large and famous entranceway that immigrants walked through; was the only port in the United States where foreign ships could dock safely; and was actually a large ship that carried the immigrants to the United States.

This test item seeks to assess an individual's ability to read a passage, understand the information, and infer and be able to explicate the reason for the term “doorway” to describe Ellis Island. However, the high

310

S.W. Cawthon et al. / Journal of School Psychology 50 (2012) 293–316

Fig. 3. Mathematics item with highest syntax score. This is a released NAEP item. A piece of metal in the shape of a rectangle was folded as shown above. In the figure on the right, the ? symbol represents what length? a. 3 inches b. 6 inches c. 8 inches d. 11 inches. Note. This is a released NAEP item.

linguistic demand of this item may be interfering with SLD ability to demonstrate such skills. The LC ratings for this item (item 18 in Table 4) are particularly high in syntax (syntax rated a 3 versus the average syntax score of 1.14). This high score is due to multiple relative clauses and the passive voice. Are the syntax components necessary to assess a student's understanding of the “doorway” concept? Although the vocabulary component of language demand may not be high on this item, ancillary features of the text, specifically the relative clauses, may serve as access barriers for students for whom language demands create additional difficulties in test taking. Mathematics items were, overall, less linguistically complex than the reading items. This finding is perhaps not surprising given that mathematics items, even if they are presented in narrative form as word problems, often require students to conduct non-linguistic procedures to complete the item. For the mathematics items in this study, there were three items with an LC score of “4”, and only one item with a syntax component comparable to the Ellis Island example shown previously. What does a low versus high LC item look like in mathematics word problems? We provide two examples to illustrate this contrast. The highest syntax scored item was item 27 in Table 3 (see Fig. 3). The syntax components rated in this item include the long nominal (i.e., “a piece of metal”), passive voice, and the complex question (“in the figure on the right“ + ”the question about the designation of the symbol”). In contrast, the item in Fig. 4 (item 6 in Table 3) had a zero score for syntax because the question stem was in one direct sentence, and there were no relative clauses embedded into the question format. Although both items measure student's knowledge of geometric properties, the first item requires a complex scenario, whereas the second item represents the scenario using visual tools (i.e., the grid). Thus, a student could infer information from the visual in the second item in a way that he or she could not with the textonly presentation of the first item. When test items are developed, it may be helpful for item writers to be attentive to what information provided in the text versus what can be interpreted from the visual (if present). In other words, it may be that an item can be written without a visual aid, but if it is possible to

Fig. 4. Mathematics item with zero syntax score. This is a released NAEP item. What is the area of the shaded figure? a. 9 square centimeters b. 11 square centimeters c. 13 square centimeters d. 14 square centimeters.

S.W. Cawthon et al. / Journal of School Psychology 50 (2012) 293–316

311

include one, that students who struggle with interpreting written text may still be able to demonstrate comprehension of the mathematical concept with the additional visual support. Although LC may not be as embedded in the structure of mathematics items as it is in reading items, the findings from this study indicate that care may still be necessary across subject areas in looking at LC's potential effects for SLD. In mathematics, the effect of LC was non-significant when accommodations were also included in the model. In reading, the effect of LC was still significant, but was smaller than in the first model without the effects of accommodations in the analysis. The analysis in this study took the level of LC into account, meaning that the larger effect on outcomes in reading is not because reading items had larger levels of LC than mathematics, because the relative level of LC was accounted for in the same way across the two analyses. Within each subject area, therefore, the level of LC in the items may be due to different LC components (e.g., syntax or vocabulary). 4.2. Accommodations The second research question focused on the effects of accommodations on students' reading and mathematics scores. The purpose of accommodations is to remove barriers from the assessment process, including some of the construct-irrelevant language demands in test items. Analyses in this study were separated by accommodation type: presentation, response, setting, and timing. In mathematics, there were no significant effects on estimates of student ability for any of the accommodation categories. This result implies that having an individual accommodation did not change the likelihood that the SLD would answer the question correctly (or incorrectly). However, as noted previously, the more important finding for mathematics was that the LC effect was not significant when the set of accommodation effects were included in the model. The focus, here, is on how an SLD may benefit from an accommodation because it removes construct-irrelevant LC from the test taking process. In reading, there were significant effects for presentation and setting accommodations but not for response or timing accommodations. For both significant effects of accommodations in reading, the presence of the accommodation predicted lower scores for the accommodated student. Although we did not posit a priori hypotheses regarding the specific type of accommodations and their impact on ability estimates, the research literature does support some differentiation between accommodations that might benefit SLD more than others. For example, the setting accommodations are largely used to provide a distractionfree environment for students, but we did not hypothesize that the effects would exist in one subject (e.g., reading) over the other (mathematics). Timing and presentation accommodations, in contrast, would likely have an effect on test items with high linguistic demand, particularly for reading assessments. For SLD, we expected that accommodations would affect ability estimates, which, in turn, would mitigate the influence of item difficulties on student performance. If accommodations increase student performance by reducing construct-irrelevant barriers, the students should be better able to correctly answer harder (i.e., more difficult) questions than without the accommodations. Regardless of the type of accommodation, a priori hypotheses for this study would have most likely been in the positive direction. In other words, we would have presumed from the literature and from the underlying purpose of accommodation use that students would see an increase and not a decrease in test scores. Seeing a negative effect of presentation and setting accommodations on reading outcomes may imply at least two explanations. First, it is possible that accommodations are assigned to students who are lower performing on assessments in hopes that the accommodation can help offset some of the challenges students have in answering test items. There may be a selection effect that cannot be controlled for in the analysis of this particular dataset. Students who did not receive accommodations in this study still were eligible for them (i.e., all students had an IEP), so those accommodation assignments were likely made based on perceived need. In this scenario, the accommodated students in this study may have fared worse than students without the accommodations because of a more severe SLD or lower level of academic proficiency. Accommodations, in practice, are assigned according to student need, thus not allowing us to use random assignment to control for the effects of underlying student characteristics that may also contribute to their test performance. A second possible explanation is that the accommodations function in a way that makes the assessment process more challenging for students. For example, does having the assessment in an individualized or small group setting mean that they are tested at a different time of day, with a different test

312

S.W. Cawthon et al. / Journal of School Psychology 50 (2012) 293–316

administrator, factors that may have a distracting effect on students? Or, perhaps more plausibly, does having a person available to provide assistance to understand the test items (one of the presentation accommodations) slow students down or provide information that confuses the student? These questions are certainly raised with the negative effects that were found in this analysis, but the reasons behind the data remain elusive. What is meaningful in this discussion is to see that accommodations may not have the desired effect in some cases and to continue to check the assumptions that accommodations always remove barriers for students. The non-significant results of all the accommodation types in mathematics are perhaps more in line with some of the previous literature than those in reading. The power of the statistical tests, with over 2000 individuals, was certainly strong enough to detect small effects. Although it is more tenuous to interpret nonsignificant results than statistical significance, the consistency of the results across all accommodation types in mathematics points to a possibly meaningful finding: that including accommodations like those considered in this study did not appear to effect how students performed in mathematics.

4.3. Limitations of this study There were several significant limitations to this study. The sample was from a “live” assessment cycle, with non-randomization of item LC or accommodation conditions. Furthermore, students without an LD were not eligible for the accommodations, leaving out a critical comparison group in an analysis of the potential differential effects of LC and accommodations on students with and without an LD. One related aspect of this sample that needs to be taken into consideration is the fact that many SLD used multiple accommodations on the NAEP. Although there was some variability in which students used accommodations in any one of the four categories used in this analysis, there were certainly many students who received a package of accommodations. Making strong conclusions about the relative benefits of one accommodation over the other therefore is not reflective of how many of the students used them. A different analysis might look at the effects of presentation accommodations when provided singly versus when they were provided as a package with one or more other categories of accommodations. This approach would be more appropriate in a study that is focused on the relative effects of individual accommodations or the cumulative effects of multiple accommodations. A second limitation is the lack of information about the student's LD status, diagnostic process, and characteristics. The NAEP data set provides personnel with a checklist of information including disability categories, perceived severity of disability, and types of services for which the student is eligible. Missing from this data set is information about student characteristics that may be relevant to test performance, most importantly the type of learning disability (e.g., specific reading or mathematics disabilities). SLD are a heterogeneous group, with a wide range of specific test taking skills that may be impacted by the disability. To further complicate matters, it was unclear whether the LD was considered the primary condition, or a comorbid condition to another primary category such as deafness or autism. Without an ability to control for or at least provide descriptive information about the scope of the learning disabilities of the NAEP participants, it is difficult to know whether the specific types of learning disability may confound study results. A third potential limitation was the LC coding schema used in this analysis. This study utilized an adapted version of an LC rating scale that focused primarily on vocabulary and syntax. The discourse component was relatively simplistic, addressing range of potential features such as more than one sentence stem or the use of academic language using a single rating point (1 = yes, 0 = no). It may be that because components were on different scales (vocabulary and syntax had a broader range of possible scores than the dichotomous discourse component), that this rubric characteristic weakened the variability that could be captured by the LC rating scale. It should be noted that only the total score was used in this analysis, not the component scores, so an individual component analysis is not possible in this study. Furthermore, this scale has not been validated against student performance to understand how much a “low” LC score may differentially impact student item responses than a “high” LC score on the same test item construct. Future research might consider a different LC rating scale, or perhaps even a more global measure of item accessibility that incorporates features such as the format of the item on the page and graphical features of the item (Kettler et al., 2009).

S.W. Cawthon et al. / Journal of School Psychology 50 (2012) 293–316

313

Finally, one weakness of the cross-classified MMM used in this study is that its use assumes a Rasch model for item scores. This means that the items within the analysis set are assumed to have the same item discriminations and zero pseudo-guessing parameter values. This assumption does not match the reality of NAEP items that are scaled assuming a non-Rasch item response theory model that permits multiple-choice items to differ in their pseudo-guessing and discrimination parameters. Unfortunately, it is not possible to concurrently assess differential item and facet functioning using a non-Rasch model. There are currently no measures of individual model fit available for cross-classified models. In addition, recent methodological research (Beretvas & Murphy, submitted for publication) has found that comparisons of cross-classified models' information criteria do not perform accurately in terms of correct model identification. Therefore, it is not possible to assess individual models' fit or compare models' fit for the cross-classified models examined in this study. In addition, a limitation of the current study is that only dichotomous item scores were analyzed. Future research is necessary to extend the cross-classified models that are used to allow their use with items of mixed-format. Development and estimation of a non-Rasch version of the cross-classified MMM described here provides a useful direction for future methodological research that can enhance applied research on the effects of accommodations on item scores. Despite this limitation, the current study provides an important starting point for larger-scale, intervention designs that control for student, item, and test characteristics. 4.4. Implications for future research and practice The results of this study lay the groundwork for future research that could clarify how accommodations counteract LC elements that unnecessarily affect student performance. For example, there is the potential to connect what we know about language processing in SLD (or other disability groups) and how these students access the language of the items themselves. Accommodations currently tend to be general in nature and do not specifically address language abilities or processing skills that may be an issue for some students. Further research might focus on how SLDs are accessing items and what about specific accommodations help students demonstrate their knowledge, particularly on high LC items. Finally, it is necessary to expand on these findings and determine if there are differences in between three groups: students without disabilities, SLD with accommodations, and SLD without accommodations (but who would typically be assigned accommodations). The sample would need to have accommodations randomly assigned to SLD so that item difficulty results can be compared across the three groups without the potential confounds present in a “live” testing process such as the NAEP. This kind of analysis would allow for a more direct interpretation of how LC and accommodations affect student performance above and beyond difficulties they may have due to their learning disability that cannot be mitigated through altered testing procedures. Methodologically, the cross-classified MMM utilized in the current analysis provides a viable means of assessing accommodated test scores. The model specification outlined above allows for the exploration of item level covariates as predictors of both item difficulty and student performance. This feature is particularly relevant to the current investigation, as LC, an item-level descriptor, was hypothesized to affect both item difficulties and student achievement. Although the cross-classified MMM is exposited elsewhere in the literature, it is, to the best of the authors' knowledge, novel in the context of accommodated testing. Additionally, the specific parameterization utilized in the current analysis is one of many viable options for investigating the effects of interest. Together with additional research studies such as those described above, this line of research has the potential to provide a more substantial basis for practitioners to use in their decisions about accommodations for SLD. Although LC has been the focus of test item development for ELL students, this study illustrates how those language components may also be significant for SLD. State assessment review boards may want to expand the populations they consider when looking at the levels of complex language and their potential contribution to construct-irrelevant variance. Second, knowing that accommodations might have a mitigating effect on the impact of LC in mathematics could be an important finding to keep in mind when evaluating a student's need for accommodations. Although reading passages are typically more text heavy, this study's findings indicate that the LC for mathematics items is still a potential concern. In practice, many accommodations are provided across all subject areas tested, so this finding may simply confirm an overall attention to how accommodations help provide access for SLD. On the

314

S.W. Cawthon et al. / Journal of School Psychology 50 (2012) 293–316

other hand, some caution may be needed when thinking about accommodations for reading, or at least an understanding that accommodations may not necessarily result in a greater likelihood that the SLD will respond with a correct answer. Accommodations that are assumed to either benefit or at the very least, be benign, may actually have a deleterious effect. In the long run, this literature can help administrators accurately assess SLD, which should increase the validity of SLD accommodated test scores, particularly those in high-stakes accountability frameworks. 4.5. Conclusion SLD represent the largest group of students with disabilities, a group that frequently participates in standardized assessments with at least one type of test accommodation. Although the ELL students and SLD populations are often served under very different educational structures and policies, the recent accountability reforms pull together such diverse groups under the umbrella of achievement testing and high-stakes decisions based on those test scores. Like inclusion reforms before it, accountability reforms seek to measure academic progress and to put mechanisms in place to improve student outcomes, such as developing the AA-MAS framework. Increasing the validity of how we interpret standardized test scores is a high priority, not just for the NAEP but also for schools, districts, and states nationwide. The LC of test items is one malleable variable that warrants careful attention both in test construction and in understanding the interaction between student characteristics, test content, and test accommodations. References Abedi, J. (2002). Standardized achievement tests and English language learners: Psychometrics issues. Educational Assessment, 8, 231–257. Abedi, J. (2006). Language issues in item development. In S. Downing, & T. Haladyna (Eds.), Handbook of test development (pp. 377–399). Mahwah, NJ: Erlbaum. Abedi, J., Bailey, A., Butler, F., Castellon-Wellington, M., Leon, S., & Mirocha, J. (2005). The validity of administering large-scale content assessments to English language learners: An investigation from three perspectives. CSE Report, 663, Los Angeles: University of California, National Center for Research on Evaluation Standards and Student Testing. Abedi, J., & Hejri, F. (2004). Accommodations for students with limited English proficiency in the National Assessment of Educational Progress. Applied Measurement in Education, 17, 371–392. Abedi, J., Hofstetter, C., & Lord, C. (2004). Assessment accommodations for English language learners: Implications for policy-based empirical research. Review of Education Research, 74, 1–28. Abedi, J., Leon, S., & Kao, J. C. (2008). Examining differential item functioning in reading assessments for students with disabilities. CRESST Tech. Rep., 744, Los Angeles: University of California, National Center for Research on Evaluation, Standards, and Student Testing (CRESST). Abedi, J., Leon, S., & Kao, J. C. (2008). Examining differential distractor functioning in reading assessments for students with disabilities. CRESST Tech. Rep., 743, Los Angeles: University of California, National Center for Research on Evaluation, Standards, and Student Testing (CRESST). Abedi, J., & Lord, C. (2001). The language factor in mathematics tests. Applied Measurement in Education, 14, 219–234. Abedi, J., Lord, C., & Plummer, J. (1997). Language background as a variable in NAEP mathematics performance. CSE Tech. Rep., 429, Los Angeles: University of California, National Center for Research on Evaluation, Standards, and Student Testing. Adams, R. J., & Wilson, M. (1996). Formulating the Rasch model as a mixed coefficients multinomial logit. In G. Engelhard, & M. Wilson (Eds.), Objective measurement: Theory and practice, 3. (pp. 143–166)Norwood, NJ: Ablex. Adams, R. J., Wilson, M., & Wu, M. (1997). Multilevel item response models: An approach to errors in variables regression. Journal of Educational and Behavioral Statistics, 22, 47–76. Altarac, S. E. (2007). Lifetime prevalence of learning disability among US children. Pediatrics, 119, 77–83. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (1999). Standards for educational and psychological testing. Washington, DC: Authors. Beretvas, S. N. (2008). Cross-classified random effects models. In A. A. O'Connell, & D. B. McCoach (Eds.), Multilevel modeling of educational data (pp. 161–197). Charlotte, SC: Information Age Publishing. Beretvas, S. N., Cawthon, S., Lockhart, L. & Kaye, A. (in press). Assessing impact, DIF and DFF in accommodated item scores: A comparison of multilevel measurement model parameterizations. Educational and Psychological Measurement. Beretvas, S. N., & Kamata, A. (2005). The multilevel measurement model: Introduction to the special issue. Journal of Applied Measurement, 6, 247–254. Beretvas, S. N., & Murphy, D. L. (submitted for publication). An evaluation of the performance of information criteria for correct crossclassified random effects model selection. Bolt, S., & Thurlow, M. (2004). Five of the most frequently allowed testing accommodations in state policy: Synthesis of research. Remedial and Special Education, 25, 141–152. Bolt, S. E., & Ysseldyke, J. E. (2006). Comparing DIF across math and reading/language arts tests for students receiving a read-aloud accommodation. Applied Measurement in Education, 19, 329–355. Butler, F., Bailey, A., Stevens, R., Huang, B., & Lord, C. (2004). Academic English in fifth-grade mathematics, science, and social studies textbooks. CSE Report, 642, : Center for Research on Evaluation Standards and Student Testing (CRESST).

S.W. Cawthon et al. / Journal of School Psychology 50 (2012) 293–316

315

Cawthon, S., Ho, E., Patel, P., Potvin, D., & Trundt, K. (2009). Towards a multiple construct model of measuring the validity of assessment accommodations. Practical Assessment, Evaluation, and Research, 14(21). Available online: http://pareonline.net/genpare. asp?wh=0&abt=14 Cheong, Y. F., & Raudenbush, S. W. (2000). Measurement and structural models for children's problem behaviors. Psychological Methods, 5, 477–495. Crawford, L., Helwig, R., & Tindal, G. (2004). Effects of a student reads-aloud accommodation on the performance of students with and without learning disabilities on a test of reading comprehension. Exceptionality, 12(2), 71–88. Cummins, J. (1984). Bilingual education and special education: Issues in assessment and pedagogy. San Diego, CA: College Hill. Elbaum, B. (2007). Effects of an oral testing accommodation on the mathematics performance of secondary students with and without learning disabilities. Journal of Special Education, 40(4), 218–229. Elliott, S. N., Kettler, R., Beddow, P., Kurz, A., Compton, E., McGrath, D., et al. (2010). Using modified items to test students with and without persistent academic difficulties: Effects on groups and individual students. Exceptional Children, 76(4), 475–495. Fischer, G. H. (1995). Linear logistic test model. In G. H. Fischer, & I. W. Molenaar (Eds.), Rasch models: Foundations, recent developments and applications (pp. 131–155). New York: Springer-Verlag. Fletcher, J. M., Francis, D. J., Boudousquie, A., Copeland, K., Young, V., Kalinowski, S., et al. (2006). Effects of accommodations on highstakes testing for students with reading disabilities. Exceptional Children, 72(2), 136–150. Fletcher, J. M., Lyon, G. R., Fuchs, L. S., & Barnes, M. A. (2007). Learning disabilities: From identification to intervention. New York, NY: Guilford Press. Fuchs, L. S., Fuchs, D., & Capizzi, A. (2005). Identifying appropriate test accommodations for students with learning disabilities. Focus on Exceptional Children, 37(6), 1–8. Fuchs, L., Fuchs, D., Eaton, S., Hamlett, C. L., & Karns, K. (2000). Supplementing teachers' judgments of mathematics test accommodations with objective data sources. School Psychology Review, 29, 65–85. Goldstein, H. (2003). Multilevel statistical models (3 rd Ed.). London: Hodder Arnold. Harwell, M. R., & LeBeau, B. (2010). Student eligibility for a free lunch as an SES measure in educational research. Educational Researcher, 39, 120–131. Hox, J. J. (2002). Multilevel analysis: Techniques and applications. Mahwah, NJ: Erlbaum. Individuals with Disabilities Education Act Amendments of 1997, 20 U.S.C. § 1400 et seq. Individuals with Disabilities Education Improvement Act (IDEA) of 2004, Public Law 108-446. Johnson, E. S. (2000). The effects of accommodations on performance assessments. Remedial and Special Education, 21, 261–268. Kamata, A. (2001). Item analysis by the hierarchical generalized linear model. Journal of Educational Measurement, 38, 79–93. Ketterlin-Geller, L., Yovanoff, P., & Tindal, G. (2007). Developing a new paradigm for conducting research on accommodations in mathematics testing. Exceptional Children, 73, 331–347. Kettler, R. J., Elliott, S. N., & Beddow, P. A. (2009). Modifying achievement test items: A theory-guided and data-based approach for better measurement of what students with disabilities know. Peabody Journal of Education, 84, 529–551. Martinello, M. (2008). Language and the performance of English-language learners in math word problems. Harvard Educational Review, 78. McDonnell, L., McLaughlin, M., & Morison, P. (1997). Educating one & all: Students with disabilities and standards-based reform. Washington, DC: National Academy Press. Meloy, L. L., Deville, C., & Frisbie, D. A. (2002). The effects of a read aloud accommodation on test scores of students with and without a learning disability in reading. Remedial and Special Education, 23, 248–255. Meulders, M., & Xie, Y. (2004). Person-by-item predictors. In P. De Boeck, & M. Wilson (Eds.), Explanatory item response models: A generalized linear and nonlinear approach. New York, NY: Springer. Middleton, K., & Laitusis, C. C. (2007). Examining test items for differential distractor functioning among students with learning disabilities. Research Report, 07-43, Princeton, NJ: Educational Testing Service. Munger, G. F., & Lloyd, B. H. (1991). Effect of speededness on test performance of handicapped and nonhandicapped examinees. The Journal of Educational Research, 85, 53–58. National Assessment Governing Board (2008). National Assessment of Educational Progress 1992–2007 Framework. Retrieved from. http://nces.ed.gov/nationsreportcard/reading/whatmeasure.asp#sec4 National Assessment Governing Board (2010). NAEP testing and reporting on students with disabilities and English language learners: Policy statement. Retrieved from. ttp://nces.ed.gov/transfer.asp?location=nagb.org/policies/PoliciesPDFs/ ReportingandDissemination/naep_testandreport_studentswithdisabilities.pdf. March National Assessment of Educational Progress (2005). NAEP Sample Items. Retrieved from: http://nces.ed.gov/nationsreportcard/about/ naeptools.asp. No Child Left Behind Act of 2001 (2002), 20 U.S.C. 6301 et seq. Paas, F., Renkl, A., & Sweller, J. (2003). Cognitive load theory and instructional design: Recent developments. Educational Psychologist, 38, 1–4. Rasbash, J., & Browne, W. J. (2001). Modeling non-hierarchical structures. In A. H. Leyland, & H. Goldstein (Eds.), Multilevel modeling of health statistics (pp. 93–105). Chichester, UK: Wiley. Roach, A. T., Beddow, P. A., Kurz, A., Kettler, R. J., & Elliott, S. N. (2010). Incorporating student input in developing alternate assessments based on modified achievement standards. Exceptional Children, 77, 61–80. Runyan, M. K., & Smith, J. (1991). Identifying and accommodation learning disabled law school students. Journal of Legal Education, 41, 317–349. Shaftel, J., Belton-Kocher, E., Glassnap, J., & Poggio, J. (2006). The impact of language characteristics in mathematics test items on the performance of English language learners and students with disabilities. Educational Assessment, 11, 105–126. Sireci, S. G., Scarpati, S. E., & Li, S. (2005). Test accommodations for students with disabilities: An analysis of the interaction hypothesis. Review of Educational Research, 75, 457–490. Stone, E. (2009). Examining the fairness of an English-language arts assessment for students who are deaf or hard of hearing using differential distractor functioning analysis. Paper presented at the annual meeting of the National Council on Measurement in Education; San Diego, CA April. Stretch, L. S., & Osborne, J. W. (2005). Extended time test accommodation: Directions for future research and practice. Practical Assessment, Research & Evaluation, 7, 1–6.

316

S.W. Cawthon et al. / Journal of School Psychology 50 (2012) 293–316

U.S. Department of Education (2007). Title I—Improving the Academic Achievement of the Disadvantaged; Individuals With Disabilities Education Act (IDEA); final rule (34 C.F.R. parts 200 and 300). Washington, DC: Author April 9. U.S. Department of Education (2008). Thirtieth annual report to Congress on the implementation of the Individuals with Disabilities Education Act. Washington, D.C.: Author. van den Noortgate, W., De Boeck, P., & Meulders, M. (2003). Cross-classification multilevel logistic models in psychometrics. Journal of Educational and Behavioral Statistics, 28, 369–386. van Merriënboer, J. J. G., & Sweller, J. (2005). Cognitive load theory and complex learning: Recent developments and future directions. Educational Psychology Review, 17, 147–177. Zenisky, A. L., & Sireci, S. G. (2007). A summary of the research on the effects of test accommodations: 2005–2006 (Technical Report 47). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes Available at: http://cehd.umn.edu/nceo/ OnlinePubs/Tech47/default.html Zuriff, G. E. (2000). Extra examination time for students with learning disabilities: An examination of the maximum potential thesis. Applied measurement in education, 13, 99–117.