Rater effects: Ego engagement in rater decision-making

Rater effects: Ego engagement in rater decision-making

Assessing Writing 17 (2012) 150–173 Contents lists available at SciVerse ScienceDirect Assessing Writing Rater effects: Ego engagement in rater dec...

683KB Sizes 25 Downloads 63 Views

Assessing Writing 17 (2012) 150–173

Contents lists available at SciVerse ScienceDirect

Assessing Writing

Rater effects: Ego engagement in rater decision-making Cynthia S. Wiseman ∗ Borough of Manhattan Community College, City University of New York, 199 Chambers Street, N436, New York, NY 10007, United States

a r t i c l e

i n f o

Article history: Received 9 March 2011 Received in revised form 1 December 2011 Accepted 3 December 2011 Available online 7 April 2012 Keywords: L2 writing assessment Rasch analysis Rater decision-making Think aloud protocols Mixed Methods research design

a b s t r a c t The decision-making behaviors of 8 raters when scoring 39 persuasive and 39 narrative essays written by second language learners were examined, first using Rasch analysis and then, through think aloud protocols. Results based on Rasch analysis and think aloud protocols recorded by raters as they were scoring holistically and analytically suggested that rater background may have contributed to rater expectations that might explain individual differences in the application of the performance criteria of the rubrics when rating essays. The results further suggested that rater ego engagement with the text and/or author may have helped mitigate rater severity and that self-monitoring behaviors by raters may have had a similar mitigating effect. © 2011 Elsevier Ltd. All rights reserved.

As with any performance-based assessment, the direct assessment of L2 writing ability, particularly at the program level, presents threats to reliability. Besides examinee L2 writing ability, other facets of an assessment context, particularly rater effects, can contribute to variability in examinee scores. Past studies (Bachman, Lynch, & Mason, 1995; Barkaoui, 2010; Breland, Bridgeman, & Fowles, 1999; Brennan, Gao, & Colton, 1995; Fulcher, 2003; Henning, 1996; Lynch & McNamara, 1998; Van Weeren & Theunissen, 1987; Weir, 2005) have repeatedly shown the effects of rater judgments to be a source of variance. Indeed, subjective rater judgment can have a significant impact on the reliability of scores as measures of L2 writing ability, resulting in inaccurate promotion decisions or classification errors in placement. Rater decision-making, therefore, warrants examination. In most direct writing exam situations where the samples are scored by human raters, even when raters receive extensive training and rate writing samples according to a set of scoring criteria, there is evidence of subjectivity involved in the rating of essays by human raters (Weigle, 1998; Weir,

∗ Correspondence address: 68 Bradhurst Avenue, 6L, New York, NY 10039, United States. Tel.: +1 212 220 8373/347 726 9741. E-mail address: [email protected] 1075-2935/$ – see front matter © 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.asw.2011.12.001

C.S. Wiseman / Assessing Writing 17 (2012) 150–173

151

2005), which is a “complex and error-prone cognitive process” (Cronbach, 1990, p. 584). As Myford and Wolfe (2000) explained, subjective ratings are “rooted in observation, interpretation, and perhaps most importantly, the exercise of personal and professional judgments” (p. 465). Indeed, observational methods, such as the subjective scoring involved in rating essays, are generally more “vulnerable to the fallibilities of human perceivers than almost any other method” (Spool, 1978, p. 853). Some studies (e.g., Engelhard, 1994; Myford & Wolfe, 2000) have focused on types of rater judgments linked to systematic tendencies in rating performances, commonly referred to as “rater effects.” These include halo effects, central tendency, restriction of range, and rater severity/leniency, which Cronbach (1990) referred to as the most serious rater error. Rater severity is the relative likelihood of raters to award lower scores, and leniency, is its counterpart for higher scores. According to Saal, Downey, and Lahey (1980), severity is the tendency of a rater to rate examinees below the midpoint of a scoring scale, while rater leniency is the tendency to rate examinees above the midpoint of a scoring scale. It has been shown that even after rater training, tendencies of raters to be severe or lenient persist (Weigle, 1994, 1998) and could require adjustment of examinee estimates. While some variability in rater judgments is expected and may even be desirable (McNamara, 1996), systematic tendencies of some raters to rate performances either more or less severely than other raters may contribute to construct-irrelevant variability and unfairly affect the success of a particular examinee. It is conceivable, for example, that one examinee whose paper was scored by the most severe raters could receive a score just below the designated cut score as compared to an examinee of equal or lesser writing ability whose paper was scored by the most lenient raters and received a score just above passing. The variance in these test scores would not just reflect variance in examinee ability but also differences in rater severity. As McNamara (1996) pointed out, rater severity thus has serious implications, particularly for examinees that score near the cut score. At the program level in a department, this could mean an additional semester or even a year for a student before graduating. The persistence of rater effects raises the question of what is involved in rater decision-making. Some studies (e.g., Shohamy, Gordon, & Kraemer, 1992; Stansfield & Ross, 1988) have examined the effects of rater characteristics, such as academic discipline (Vann, Lorenz, & Meyer, 1990) or L1 (Johnson, 2009; Kim, 2009; Kobayashi, 1992). Although Shohamy et al. (1992) found rater background (i.e., professional EFL-speakers vs. professional native English speakers) to have little effect on rater reliability, other studies (Mendelsohn & Cumming, 1987; Santos, 1988) suggested that characteristics of the individual raters, such as educational background or L1, affect rater judgments. Mendelsohn and Cumming (1987) found that raters who were ESL specialists applied different criteria than did raters from other disciplines. In examining differences between English and ESL faculty evaluations of native and ESL student essays, Brown (1991) discovered that the English faculty paid attention to sentence-level features, such as cohesion and syntax, while the ESL faculty put strong emphasis on organization. In an exploratory study for the Test of English as a Foreign Language (TOEFL), Cumming (2001) found that, overall, ESL/EFL-composition raters attended more extensively to language than to rhetoric and ideas, while the attention of English-as-a-Mother-Tongue-composition raters was more evenly balanced than their ESL/EFL counterparts. Sweedler-Brown (1993) found a similar tendency in English instructors who were not trained in ESL. Although the English instructors agreed that rhetorical competence was more important than grammatical accuracy, when scoring the essays they placed greater emphasis on ESL sentence-level errors. Although the results have been mixed, the literature clearly suggests that rater characteristics, such as rater background may have an influence on rater judgments. Another factor impacting rater judgment is rater engagement, or what Vaughan (1991) described as a “psychological link with the writers of the paper” (p. 118). Rater engagement with the examinee might be likened to the relationship that a reader develops with a writer. It might also be related to what Guilford (1954) and Myford and Wolfe (2004) referred to as “ego-involvement.” According to Guilford, “. . . [R]aters tend to rate those whom they know well, or in whom they are ego-involved, higher than they should” (p. 278). That is, as the rater personalizes the examinee, he feels he “knows” the examinee and this involvement might have an impact on scoring patterns. Huot (1993) also found this “personal engagement” of the rater in his study of raters’ decision-making behaviors as the raters were scoring holistically. Indeed, upon examining raters’ think aloud protocols, Huot found that expert

152

C.S. Wiseman / Assessing Writing 17 (2012) 150–173

raters seemed to make a greater variety of comments, “taking the time to interact with student writing without having to evaluate what they were reading” (p. 219). Beyond involvement with the writer, engagement with the text as described in an information processing model (Anderson, 1983) or as schema activation, may be regarded as ego engagement. In order to understand a text, an individual requires prior knowledge of a particular domain, either procedural or declarative knowledge. That is, a reader reads and interprets any text based on information already available in the brain based on prior experience or knowledge. Indeed, raters of L2 writing scripts are readers and they bring with them mental representations of genre, conventions, and their own set of values. Indeed, Stock and Robinson (1987) maintained that rater expectations may be as important as the quality of the text itself, given the nature of the process of reading. Hamp-Lyons and Matthias (1994) found that essay topics that were judged to be more difficult by raters tended to get higher scores than topics judged to be easier. Sakyi (2000) found that raters’ expectations and/or biases influenced ratings in his investigation of raters’ use of a holistic scoring scale. He found that not all the raters focused on the scoring rubric, but some simply assigned a score based on their personal reactions to the text. In sum, raters appear to engage in a dialogue with the text and in so doing, bring their own expectations to the rating process. Given the potential impact of rater judgments in scoring decisions, it is important to understand the nature of the decision-making process of raters. In this regard, a number of studies have examined the decision-making behaviors of raters (Connor & Carrell, 1993; Cumming, 1990; Kondo-Brown, 2002; Lumley, 2005; Milanovic, Saville, & Shuhong, 1996; Orr, 2002; Sakyi, 2000; Smith, 2000; Vaughan, 1991). Some have explored rater behavior through think aloud protocols (Cumming, Kantor, Powers, Santos, & Taylor, 2000; Cumming, 1990; Weigle, 1998), which has thus proven to be an important methodology in the investigation of rater decision-making behavior. 1. Context of the study The current study investigated decision-making behaviors of raters as they were scoring a performance-based assessment of second language writing ability at an urban community college setting. Concerns about the reliability of test scores for promotion decisions prompted an investigation of this second language writing assessment, and concerns about the impact of rater effects on scores were the focus of this part of the study. The total enrollment in this college was approximately 19,000 students. Of the more than 100 countries represented at this college, more than 85% of the student body was of Asian, Hispanic, or Black ethnicity. Many of those students were first or second generation immigrants and as native speakers of English, tested into either remedial English or into freshman English composition. ESL students that could not pass the basic writing proficiency exam were placed in ESL intensive writing courses (i.e., ESL 054, ESL 062, ESL 094, or ESL 095) in Developmental Skills, a department that served from 1500 to 1700 students each academic year. To exit one of the lower ESL levels, students had to pass an L2 writing assessment administered and scored by the department. The exit exam was an impromptu, timed essay in which students were offered a choice of prompts for a narrative or a persuasive essay. Essays were holistically scored by two faculty raters and awarded a composite score that was the average of the two ratings. The question of rater consistency and inter-rater reliability was a recurring concern each semester after midterm and final exam scoring sessions. 2. Method To investigate rater effects in this test situation, 78 writing samples were collected from 39 students in the three ESL levels. The 78 samples included 39 narrative and 39 persuasive essays on a related topic (the subway), with a representative number from each of the three ESL levels (054, 062, and 094). Eight faculty raters, all of whom had obtained certification as CUNYACT (American College Testing) raters, were normed for this particular assessment, first on a holistic rubric and then on an analytic rubric. In each norming session, raters reviewed the rubric, scored a sample of six essays (three persuasive and

C.S. Wiseman / Assessing Writing 17 (2012) 150–173

153

Table 1 Participating examinees. ACT score 2–3 4 5–6

BMCC level

Proficiency level

No. of examinees

ESL 054 ESL 062 ESL 094

Beginning Intermediate High intermediate

13 13 13

three narrative), and through negotiation and discussion, raters reached consensus regarding scores for the sample essays. Based on Bachman and Palmer’s (1996) model of communicative language ability, the sixpoint holistic rubric measured L2 writing ability in five domains: task fulfillment, control of topic development, organizational control (logical structure and use of rhetorical devices), sociolinguistic competence, and grammatical control. This rubric was created through a preliminary study to align domains with departmental curricular objectives and rater criteria for scoring, as reported during previous departmental norming sessions (Wiseman, 2005). The analytic scoring rubric measured L2 writing ability on a six-point scale in each of the same five domains (see Appendices A and B). A norming session modeled after the ACT certification norming session and the department exam norming session was conducted prior to each scoring session. Raters read and reviewed the rubric, and discussed the performance criteria for each category of each domain. They subsequently scored a set of six papers, and raters who were scoring too high or low were normed through a discussion of the writing samples and performance criteria. Each rater then scored all writing samples, first using the holistic rubric and then using the analytic rubric. Each writing sample was, therefore, assigned a total of six scores by each rater, constituting a fully-crossed data set. As raters scored the last six essays (two from each ESL level with three of the six narrative essays and three persuasive), they recorded their decision-making processes using think aloud protocols. Think aloud protocols for all raters were transcribed, and then coded by three different people for references to the different facets of the test context. The coded think alouds were then examined for patterns to gain insights into the decision-making of raters, particularly with regard to the severity of the rater. 3. Participants Participants in the current study included 39 adult students (ages 18 and over) of diverse first language backgrounds, including Spanish, Chinese, Korean, Japanese, Polish, and Russian. The student population in this community college ranged from recent high school graduates to international students to older returning students. They had resided in the U.S. anywhere from two months to 15 or more years. This group of students represented a range of L2 writing ability levels from beginning ESL 054 to intermediate ESL 062 to high intermediate ESL 094. Placement in ESL was generally based on the student’s ACT Sample Writing Test score on a six-point holistic scale (see Table 1). The participants in the current study were selected from these naturally assembled groups of ESL students through a stratified random sampling procedure, with equal allocation (n = 13) from each of the three strata in the targeted population (ESL 054, ESL 062, and ESL 094), representing the levels required to take the departmental final exam. The study included eight experienced raters who were L2 developmental writing instructors in an ESL Developmental Skills department. All had advanced degrees in TESOL or related fields, but they had come to teaching from varied backgrounds. In addition to his studies in ESL and bilingual education, Edward was also a novelist. Dorothy had a master’s degree in English literature. Helen had specialized in French. Carol had experience in writing and editing screenplays. All raters had considerable experience teaching ESL and/or writing. All but Gail had taught at this community college for five or more years. In addition, all regularly participated in the departmental norming and scoring sessions for midterm and final exams (Table 2).

154

Rater

Yrs teaching

Yrs in dev skills

Yrs teach wr

Yrs rating writing

Yrs rating dept final

ACT certified

Sex

Age

Highest degree earned

Discipline

Ann Betty Carol Dorothy Edward

15 45 17 11 26

10 34 9 5 24

8 38 14 5 24

3.5 34 11 5 24

4 38 9 5 24

Y Y Y Y Y

F F F F M

52 62 61 40 53

PhD EdD MA MA EdD

Fran Gail

30 17

20 1

25 14

25 14

20 1

Y Y

F F

57 33

EdD MA

Helen

32

12

19

19

10

Y

F

54

MAT

Linguistics TESOL English Ed English Bilingual Ed/TESOL Applied Ling TESOL/Applied Ling ESL/French

Mean

24

14

18

17

14

51.5

C.S. Wiseman / Assessing Writing 17 (2012) 150–173

Table 2 Rater profiles.

C.S. Wiseman / Assessing Writing 17 (2012) 150–173

155

4. Procedures To address concerns regarding rater reliability, this study investigated to what degree raters were consistent in their ability to judge L2 writing ability. It also looked at scoring bias between raters and examinee L2 writing ability, scoring method, and prompt type. Finally, through verbal reports, the study explored the differential justifications, considerations, and concerns of severe and lenient or inconsistent raters in making judgments about L2 writing ability based on timed writing samples. To this end, a mixed methods embedded design was used (Cresswell & Clark, 2007). That is, quantitative and qualitative data sets were mixed at the design level. The qualitative data set was embedded within the quantitative data, which was framed within the Rasch measurement model (Caracelli & Greene, 1993). Quantitative data were first collected to examine the main effects of each of the facets and then to examine any scoring bias among those facets, particularly between rater severity and each of the other facets. Qualitative data, which played a supplemental role in this investigation, were then collected to explain patterns of rater decision making, such as rater severity/leniency. This study used a one-phase approach in that all data were collected in the same time frame. The main effects of rater severity were explored using two different Multi-Faceted Rasch Measurement models, i.e., Master’s Partial Credit Model and Andrich’s Rating Scale Model. FACETS (Linacre, 2005) provided several indices of main effects, including ability/difficulty estimates (i.e., observed score, observed average, observed count, fair measure average, and logits); an examinee separation ratio (index of spread of examinee proficiency estimates relative to their precision, i.e., variability); an examinee separation index (number of measurably different levels of examinee proficiency), reliability of the examinee separation (how well we can differentiate among examinees in terms of level of proficiency) and chi-square (fixed effect hypothesis that all examines in the sample have the same level of proficiency). Using these MFRM models, the study also explored potential bias on examinee scores attributable to rater severity by identifying unexpected responses for individual examinees by rater based on the observed score as compared to the expected score/count for each examinee. FACETS produced a precise estimate in logits of the degree of severity or leniency with which an examinee was rated by a particular rater on a particular item using a particular rubric to determine the significance of bias. The identification of unexpected responses and scoring bias between raters and particular examinees provided information regarding the rating patterns of particular raters. 5. Results 5.1. Holistic scoring session The All-Facet Vertical Map for the holistic scoring session below (FACETS, Linacre, 2005) presents estimates in logits for examinee ability with examinees demonstrating highest ability listed at the top and those with lowest ability at the bottom of the scale, estimates for rater severity with the most severe raters listed at the top and the most lenient at the bottom, estimates for the difficulty of the prompts, the most difficult prompt at the top and the easiest at the bottom, and finally estimates for the range of difficulty level for the scoring categories in the holistic rubric (Fig. 1). 5.2. Rater severity measures Table 3 summarizes the main effects of rater severity based on the Rasch analysis. The degree of severity each rater exercised when rating holistically ranged from the most lenient rater Gail at −.43 logits to Dorothy the most severe rater at .45 logits. This .88-logit spread suggested some inconsistencies among the raters in their ratings; however, rater severity measures were distributed over a much narrower range than examinee ability measures (a 9.67-logit spread), showing less variation in severity among raters than variation in the ability of examinees. This suggested that the impact of individual differences in rater severity on examinee scores was relatively small. The standard error, an indicator of the precision of the severity measure, was acceptable for all raters, ranging from .15 to .16. The chi-square was statistically significant, 2 = 27.9 (7, N = 8), p < .01, which indicated that even after

156

C.S. Wiseman / Assessing Writing 17 (2012) 150–173

Fig. 1. All Facet Vertical Map, holistic scoring session.

allowing for measurement error, the probability that these eight raters exercised the same degree of severity when rating examinees was virtually zero. Indeed, the rater separation index (1.73) was high and indicative of substantial differences in severity among raters. The FACETS analyses thus suggested that these raters did not function interchangeably but differed in the degrees of severity exercised in rating examinees but that the impact of individual differences in rater severity was small.

C.S. Wiseman / Assessing Writing 17 (2012) 150–173

157

Table 3 Rater measures (N = 8) holistic scoring session. Raters

Observed average

Fair average Measure (in logits) S.E. Infit MnSq

Ann Betty Carol Dorothy Edward Fran Gail Helen

3.2 3.2 3.2 2.9 3.0 3.1 3.3 2.9

3.07 3.06 3.10 2.74 2.84 2.98 3.20 2.79

−.19 −.17 −.24 .45 .25 −.02 −.43 .35

.15 .16 .15 .16 .16 .16 .15 .16

1.12 .64 1.32 1.13 .85 .66 .70 1.47

Infit ZStd Outfit MnSq Outfit ZStd .70 −2.50 1.80 .70 −.90 −2.30 −2.00 2.50

1.07 .65 1.24 1.32 .87 .66 .70 1.35

.40 −2.30 1.30 1.60 −.70 −2.10 −1.90 1.80

Mean SD

3.8 .1

3.79 .16

.00 .37

.21 .00

.98 .30

.98 2.00

.98 .28

.98 1.80

Model, Sample – RMSE: .16, Adj (True) S.D.: .27, separation: 1.73, reliability: .75. Model, Fixed (all same) chi-square: 27.9, d.f.: 7, significance (probability): .00.

Fit indices for holistic scoring session showed that several raters did not score as predicted by the Rasch model. Showing less variation than predicted were over-fitting raters Fran (infit −2.30; outfit −2.10) and Betty (infit −2.50; outfit −2.30); that is, these raters adhered closer to the model predictions than expected. This might be an issue if this were the result of a central tendency or halo effect in the scoring patterns of these raters. Of more concern was the only misfitting rater Helen. This rater’s somewhat inconsistent rating patterns (infit 2.50; outfit 1.80) called for a review of her unexpected responses to determine if further training were necessary. Rater severity effects on examinee scores were further explored by examining unexpected responses and fair averages of examinee ability estimates. Table 4 below shows six unexpected responses assigned by Dorothy and five by Helen. Lenient rater Carol scored examinees 11 and 31

Table 4 Unexpected responses (holistic scoring session). Category

Step

Exp.

Resd

St. res.

N

R

Nu

Ex

N

Items

N

Rubric

2 1 2 4 4 5 6 5 2 6 2 1 2 2 5 2 3 3 3 5 5 5 1 2 5 5 4

2 1 2 4 4 5 6 5 2 6 2 1 2 2 5 2 3 3 3 5 5 5 1 2 5 5 4

4.3 4.3 3.7 2.6 5.4 3.2 3.5 3.0 4.1 4.2 1.0 3.0 3.9 3.6 3.5 1.2 5.0 1.6 1.7 3.4 2.8 3.3 3.7 43 3.2 3.2 2.5

−2.3 −2.3 −1.7 1.4 −1.4 1.8 2.5 2.0 −2.1 1.8 1.0 −2.0 −1.9 −1.6 1.5 .8 −2.0 1.4 1.3 1.6 2.2 1.7 −2.7 −2.3 1.8 1.8 1.5

−28 −2.8 −2.1 2.1 −2.1 2.4 3.3 2.7 −2.6 2.3 4.5 −2.7 −2.3 −2.0 2.0 2.0 −2.6 2.4 2.1 2.1 3.1 2.2 −3.4 −2.8 22.5 2.4 2.1

1 1 1 1 1 2 3 3 3 3 4 4 4 4 4 4 5 5 5 6 7 7 8 8 8 8 8

Ann Ann Ann Ann Ann Betty Carol Carol Carol Carol Dorothy Dorothy Dorothy Dorothy Dorothy Dorothy Edward Edward Edward Fran Gail Gail Helen Helen Helen Helen Helen

35 28 11 5 37 28 11 24 31 17 1 19 26 34 32 39 37 3 3 8 12 27 35 20 21 10 16

35 28 11 5 37 28 11 24 31 17 1 19 26 34 32 39 37 3 3 8 12 27 35 20 21 10 16

2 1 2 1 2 2 1 2 1 2 2 2 2 1 2 2 1 1 2 2 1 2 1 2 2 1 2

Pers Narr Pers Narr Pers Pers Narr Pers Narr Pers Pers Pers Pers Narr Pers Pers Narr Narr Pers Pers Narr Pers Narr Pers Pers Narr Pers

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

H H H H H H H H H H H H H H H H H H H H H H H H H H H

158

C.S. Wiseman / Assessing Writing 17 (2012) 150–173

Table 5 Examinee ability measures (holistic scoring session). Obs score

Obs count

Obs avg

Fair M avg

Meas

Model S.E.

Infit MnSq

Zstd

Outfit MnSq

ZStd

Est discrm

Ex

Level

17 60 28 36 41 35 44 52 30 56 55 43 55 33 49 42 62 40 50 71 52 50 49 45 42 65 51 42 55 59 65 58 34 64 65 80 83 46 20

16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 14 16 16 16 16 16 16 16 16 16 16 16

1.1 3.8 1.8 2.3 2.6 2.2 2.8 3.3 1.9 3.5 3.4 2.7 3.4 2.1 3.1 2.6 3.9 2.5 3.1 4.4 3.3 3.1 3.1 2.8 2.6 4.1 3.2 30 3.4 3.7 4.1 3.6 2.1 4.0 4.1 5.0 5.2 2.9 1.3

1.06 3.74 1.75 2.25 2.56 2.19 2.75 3.24 1.88 3.49 3.45 2.69 3.43 2.07 3.06 2.62 3.87 2.50 3.12 4.44 3.24 3.12 3.06 2.81 2.62 4.06 3.18 3.00 3.43 3.68 4.06 3.62 2.13 3.99 4.06 5.00 5.19 2.87 1.24

−6.70 .6 −3.40 −2.15 −1.46 −2.29 −1.08 −.15 −3.07 .27 .16 −1.20 .16 −2.59 −.48 −133 .85 −1.59 −.37 1.70 −.15 −.37 −.48 −.96 −1.33 1.14 −.26 −.60 .16 .57 1.14 .47 −2.44 1.04 1l14 2.61 2.97 −.83 −5.12

1.02 .31 .41 .38 .36 .38 .35 .33 .40 .32 .32 .36 .32 .39 .34 .36 .31 .37 .34 .31 .33 .34 .34 .35 .36 .31 .33 .36 .32 .31 .31 .32 .39 .31 .31 .33 .35 .35 .56

1.03 .59 1.23 .46 .82 .94 .93 .94 .79 .80 2.00 1.28 .56 .93 .25 .66 .6 .95 .91 1.29 .83 .87 1.02 .98 .82 .86 1.10 2.26 .80 1.00 1.23 .93 .99 .77 1.79 .96 1.41 .77 .85

.3 −1.3 .7 −1.7 −.4 .0 .0 .0 −.5 −.5 2.3 .8 −1.3 .0 −2.9 −.9 −1.0 .0 −.1 .9 −.4 −.2 .1 .0 −.4 −.3 .1 2.5 −.4 .1 .7 .0 .0 −.6 2.0 .0 1.1 −.5 −.2

1.33 .59 1.26 .46 .83 .96 .93 .95 .78 .83 1.99 1.25 .57 .95 .27 .66 .65 .96 .93 1.28 .84 .87 1.02 .96 .81 .86 .99 2.27 .79 1.00 1.22 .94 .98 .78 1.80 .98 1.36 .77 .81

.6 −1.3 .8 −1.7 −.3 .0 .0 .0 −.5 −.4 2.3 7 −1.3 .0 −2.8 −.9 −1.0 .0 .0 .9 −.3 −.2 .1 .0 −.4 −.3 .0 2.5 −.5 .1 .7 .0 .0 −.6 .0 .0 1.0 −.5 −.3

.95 1.46 .69 1.55 1.19 1.03 1.09 1.06 1.25 1.19 .01 .82 1.41 1.06 1.75 1.36 1.39 1.03 .98 .57 1.19 1.11 .96 1.10 1.19 1.12 1.03 −.29 1.20 1.01 .74 1.11 .99 1.24 −.21 .99 .55 1.22 1.15

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

054 054 054 054 054 054 054 054 054 054 054 054 062 062 062 062 062 062 062 062 062 062 062 062 062 094 094 094 094 094 094 094 094 094 094 094 094 094 094

lower than expected on the narrative essay and examinees 24 and 31 higher than expected on the persuasive essay. To further explore the effect of this variability in rater severity, the fair averages of examinees were examined and were found to fall within .1 raw score units from the unadjusted raw score averages for all examinees (see Table 5 below).

5.3. Scoring bias Table 6 shows the 12 bias interactions between raters and examinees in the holistic scoring session for which the t-value exceeded ±2, as determined by the t-score fit statistic (+2 < t < −2). Dorothy, usually severe in rating, showed significantly greater leniency than expected in rating examinees 1 and 32, but rated examinees 32 and 26 with significantly more severity than expected. Severe rater Helen scored examinee 23 with significantly greater leniency than expected and examinees 20, 8, and 35 with greater severity.

C.S. Wiseman / Assessing Writing 17 (2012) 150–173

159

Table 6 Scoring bias between raters and examinees, holistic scoring session (higher score = higher bias measure) (part). Rater

Examinee

Obsvd score

Exp. score

Obsvd count

Bias (logit)

Fit t-score

Ann Betty Carol Dorothy

28 28 24 1 32 34 26 3 28 20 8 35

3 9 8 3 9 5 5 6 8 6 4 5

6.2 6.2 5.9 2.1 6.7 7.4 7.5 3.3 5.7 8.4 6.1 7.7

2 2 2 2 2 2 2 2 2 2 2 2

−3.73 2.23 1.77 3.01 1.78 −2.19 −2.28 3.05 1.90 −1.95 −2.26 −2.39

−2.95 2.56 2.04 2.38 2.04 −2.12 −2.21 3.16 2.19 −2.02 −2.02 −2.30

Edward Helen

5.4. Analytic scoring session The All-Facet Vertical Map for the analytic scoring session in Fig. 2 below presents estimates for examinee ability, rater severity, item difficulty, and difficulty estimates for each category in the analytic scale. 5.5. Rater severity measures Table 7 shows that rater severity measures for the analytic scoring session ranged from the most lenient rater Betty at −.55 logits to the most severe rater Dorothy at .49 logits. This 1.04-logit spread was smaller than that of examinee ability measures (a 7.15-logit spread), showing that there was less variation in severity among raters than there was variation in the ability of examinees and suggesting that the impact of individual differences in rater severity on examinee scores was small. The chi-square for the rater severity estimates based on the analytic rubric was statistically significant (2 = 198.4, (7, N = 8) p < .01). Indeed, the rater separation index (5.25) was high and indicative of substantial differences in severity among raters. Fit statistics for each rater reported in Table 7 above showed unexpected variation in the ratings of some raters. Dorothy and Carol were misfitting. That is, the fit indices for these raters showed more variation than expected. For Dorothy, both fit indices (infit 4.40; outfit 5.20) were high, indicative of a large proportion of unexpected ratings, a “random” type of error. Overfitting raters Gail, Edward, and Betty showed less variation than expected. Further examination of unexpected responses for both the misfitting and overfitting raters could help identify rating patterns, such as central tendency, and suggest the need for a component of intervention in rater training sessions to modify these rater effects. Table 7 Rater summary table (N = 8) analytic scoring session. Raters

Observed average

Fair average

Measure (in logits)

S.E.

Infit MnSq

Infit ZStd

Outfit MnSq

Outfit ZStd

Ann Betty Carol Dorothy Edward Fran Gail Helen Mean SD

2.9 3.3 2.7 2.6 3.0 2.8 3.0 3.0 2.9 .2

2.87 3.22 2.57 2.53 2.97 2.74 2.90 2.99 2.85 .21

−.04 −.55 .43 .49 −.19 .16 −.09 −.21 .00 .32

.06 .06 .07 .07 .06 .07 .06 .06 .06 .00

.98 .79 1.22 1.35 .79 1.09 .80 1.03 1.01 .20

−.20 −3.10 2.90 4.40 −3.10 1.20 −2.90 .40 −.10 2.90

.96 .84 1.21 1.45 .78 1.05 .79 1.01 1.01 .22

−.50 −2.40 2.60 5.20 −3.30 .60 −3.00 .10 1.01 .23

Model, Sample – RMSE: .06, Adj (True) S.D.: .34, separation: 5.25, reliability: .96. Model, Fixed (all same) chi-square: 198.4, d.f.: 7, significance (probability): .00.

160

C.S. Wiseman / Assessing Writing 17 (2012) 150–173

Fig. 2. All Facet Map (N = 39), analytic scoring session.

Table 8 summarizes the unexpected responses of raters when scoring analytically. Carol gave unexpected responses in scoring five examinees 37, 1, 30, 3, and 35. She gave examinee 37 a low score of 1 on task fulfillment when the expected score was 4.6, while she assigned examinees 1 and 4 more lenient scores on task fulfillment. Carol also gave an unexpectedly lenient score (6) to examinee 35 in content development. Dorothy, the most severe rater who demonstrated a characteristically rigid focus on the rubric in the think aloud protocols, veered from this pattern of detachment from the examinee, spoke of the writer and text, and gave examinee 39 unexpectedly high scores for the persuasive essay in four domains: content development, grammatical control, task fulfillment, and organizational control. She gave examinee 32 higher scores than predicted by the Rasch model in the two domains of content and organizational control, but was particularly severe in scoring examinee 37 in grammatical control.

C.S. Wiseman / Assessing Writing 17 (2012) 150–173

161

Table 8 Unexpected responses analytic scoring session. Category

Step

Exp.

Resd

St. res.

N

R

Nu

Ex

N

Items

N

Rubric

2 1 1 2 6 4 6 4 4 3 3 2 5 5

2 1 1 2 6 4 6 4 4 3 3 2 5 5

4.7 4.4 4.6 1.0 3.0 1.8 3.4 1.3 1.3 1.2 1.3 4.5 2.5 2.6

−2.7 −3.4 −3.6 1.0 3.0 2.2 2.6 2.7 2.7 1.8 1.7 −2.5 2.5 2.4

−3.7 −4.3 −4.9 4.4 3.6 3.2 3.1 5.6 5.6 3.9 3.4 −3.1 3.1 3.0

1 2 3 3 3 3 3 4 4 4 4 4 4 4

Ann Betty Carol Carol Carol Carol Carol Dorothy Dorothy Dorothy Dorothy Dorothy Dorothy Dorothy

37 20 37 1 30 3 35 39 39 39 39 37 32 32

37 20 37 1 30 3 35 39 39 39 39 37 32 32

1 2 2 1 2 1 1 2 2 2 2 2 1 1

Narr Pers Pers Narr Pers Narr Narr Pers Pers Pers Pers Pers Narr Narr

1 1 1 1 5 1 2 2 5 1 5 5 2 3

Task fulfillment Task fulfillment Task fulfillment Task fulfillment Grammatical control Task fulfillment Content development Content development Grammatical control Task fulfillment Organizational control Grammatical control Content development Organizational control

As with the holistic scoring analysis, the fair averages for all examinees fell within .1 raw score units from the unadjusted raw score averages (Table 9). 5.6. Scoring bias in the analytic scoring session Although participants were all ACT COMPASS certified, experienced department exam raters, and had undergone an extensive norming session on the holistic rubric and subsequently on the analytic rubric, Rasch scoring bias analysis showed that, when scoring analytically, all raters scored certain examinees more leniently than predicted by the model (Table 10). Given that each rater scored 2 essays written by each examinee and assigned each essay 5 scores, for a total of 10 observations when scoring analytically, the interactions are easily influenced by one or two strange observations. Given the size of this sample, there is not much evidence that these bias interactions would replicate in a future data collection (Linacre, 2011, personal communication). Even so, of particular interest were the results for severe raters Dorothy who showed bias with 18 examinees, and Carol, who showed bias scoring 10 examinees. Dorothy showed leniency when rating examinees 39 (2.39 logits, t = 5.59) and examinee 7 (1.62 logits, t = 4.34). Carol too showed significant leniency in rating examinees 39, 26, 20, and 3. These two severe raters also showed significant bias in scoring certain examinees more severely than expected by the model: Dorothy was severe in scoring examinee 37 (−2.34-logit bias, t = −6.01), examinee 30 (−1.24-logit bias, t = −2.90), examinee 38 (−1.27 logits, t = −2.68), and examinee 21 (−1.22 logits, t = −2.66). Carol was severe in rating examinee 36 (−1.20 logits, t = −3.20) and examinee 28 (−1.40 logits, t = −2.86). 5.7. Summary of rater severity/leniency based on MFRM Based on the Rasch analysis of the main effects of rater severity, Dorothy was the most severe rater across scoring sessions. At the other end of the spectrum was Gail, the most lenient when scoring holistically, and Betty, the most lenient when scoring analytically. Helen and Carol were inconsistent across the scoring sessions. Ann, Edward, and Fran showed moderation in scoring both analytically and holistically (Table 11). 6. Think aloud protocols Finally, to supplement the MFRM investigation, and further investigate rater decision-making behaviors, think aloud protocols were audio-recorded by each rater while scoring a sample of six essays in two scoring sessions, first when using the holistic rubric, and then using the analytic rubric. Before recording, raters were given a training session on the think aloud protocol. Written instructions for the think alouds were provided, and the procedure was demonstrated. Whereas in Ericsson

162

C.S. Wiseman / Assessing Writing 17 (2012) 150–173

Table 9 Examinee ability measures (analytic scoring). Obs score

Obs count

Obs avg

Fair M avg

Meas

Model S.E.

Infit MnSq

Zstd

Outfit MnSq

ZStd

Est discrm

Ex

Level

90 278 173 189 182 165 215 254 148 225 268 216 279 170 218 212 284 205 263 307 231 224 214 230 193 314 215 192 259 261 303 241 168 311 306 354 377 222 112

80 80 80 80 75 80 79 80 80 80 80 80 80 79 80 80 79 80 80 80 80 80 80 80 80 80 80 72 80 79 80 80 80 80 80 80 80 80 80

1.1 3.5 2.2 2.4 2.4 2.1 2.7 3.2 1.9 2.8 3.4 2.7 3.5 2.2 2.7 2.7 3.6 2.6 3.3 3.8 2.9 2.8 2.7 2.9 2.4 3.9 2.7 2.7 3.2 3.3 3.8 3.0 2.1 3.9 3.8 4.4 4.7 2.8 1.4

1.3 3.56 2.22 2.42 2.48 2.12 2.79 3.26 1.90 2.89 3.43 2.77 3.54 2.18 2.77 2.69 3.65 2.60 3.34 3.89 2.93 2.85 2.71 2.92 2.45 3.79 2.56 2.55 3.10 3.17 3.65 2.88 2.00 3.76 3.69 4.31 4.61 2.65 1.34

−5.20 .09 −1.95 −1.59 −1.51 −2.14 −1.01 −.33 −2.58 −.87 −.09 −1.04 .07 −2.02 −1.04 −1.16 .21 −1.30 −.22 .56 −.80 −.93 −1.12 −.81 −1.55 .41 −1.37 −1.39 −.55 −.46 .22 −.88 −2.38 .36 .27 .6 1.64 −1.23 −4.09

.33 .13 .15 .15 .15 .16 .14 .13 .17 .14 .13 .14 13 .15 .14 .14 .13 .14 .13 .13 .14 .14 .14 .14 .15 .13 .14 .15 .13 .13 .13 .14 .15 .13 .13 .14 .15 .14 .21

.86 .63 1.36 .47 .44 .74 .98 1.19 .94 1.49 1.04 .99 .86 .96 1.00 .80 .73 .80 .72 1.21 .74 .69 .76 .60 .91 1.467 .72 2.24 .78 1.17 .90 1.26 .83 .80 1.71 .84 1.83 1.22 1.70

−.3 −2.8 2.0 −4.1 −4.3 −1.7 .0 1.2 −.3 2.8 .3 .0 −.9 −.1 .0 −1.3 −1.9 −1.3 −2.0 1.3 −1.8 −2.2 −1.6 −3.0 −.5 2.7 −1.9 5.7 −1.5 1.0 −.6 1.6 −1.0 −1.4 4.0 −1.0 4.2 1.3 3.5

.81 .63 1.36 .47 .43 .73 .99 1.18 .92 1.45 1.05 .99 .86 .97 1.01 .80 .73 .81 73 1.25 .74 .68 .75 .59 .92 1.45 .72 2.23 78 1.19 .90 1.27 .84 .80 1.70 .83 1.77 1.22 1.93

−5 −2.8 2.1 −4.1 −4.4 −1.8 .0 1.1 −.4 2.6 .3 .0 −.9 −.1 .1 −1.3 −1.9 −1.2 −1.9 1.5 −1.8 −2.2 −1.7 −3.0 −.5 2.7 −1.9 5.6 −1.5 1.2 −.6 1.6 −1.0 −1.3 3.9 −1.1 4.0 1.3 4.3

1.08 1.43 .65 1.56 1.60 1.27 1.02 .79 1.05 .50 .93 .97 1.17 1.03 .97 1.21 1.33 1.21 1.29 .75 1.27 1.34 1.27 1.44 1.11 .46 1.28 −.34 1.27 .77 1.15 .73 1.17 1.25 .19 1.20 .30 .79 .40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

054 054 054 054 054 054 054 054 054 054 054 054 062 062 062 062 062 062 062 062 062 062 062 062 062 094 094 094 094 094 094 094 094 094 094 094 094 094 094

and Simon’s (1984) study in which raters were trained in think aloud protocols using math problems, these raters practiced think aloud protocols as they scored a paragraph from an actual writing sample. They were instructed to read and score the essays as they normally would, but remembering to say aloud whatever they were thinking as they scored each paper. The protocol was open-ended so as not to cue the raters to notice certain features. The think aloud protocols were then transcribed and coded by two coders, using a coding schema and procedure initially modeled on the procedures developed by DeRemer (1998), but modified through an iterative process. Raters’ references to the four facets of this study were identified and coded, as follows: • Comments re examinee L2 writing ability, e.g., evaluative remarks about the examinee or the examinee’s L2 writing ability or references to the domains of L2 writing ability (task fulfillment, content development, organization, sociolinguistic competence, or language control); • Comments re raters, e.g., self-observations of rating decision-making behaviors, rater affective states, rater attitudes or beliefs, fatigue, rater interest in topic or genre, and reactions to handwriting;

C.S. Wiseman / Assessing Writing 17 (2012) 150–173

163

Table 10 Summary of rater-by-examinee bias (by rater) analytic scoring session (higher score = higher bias measure) (part). Rater

Examinee

Obsdscore

Expd score

Obs-exp average

Bias (logit)

Error

Fit t-score

Fit MnSq

Ann Ann Ann Ann Ann Ann Ann Ann Betty Betty Betty Betty Betty Betty Betty Betty Carol Carol Carol Carol Carol Carol Carol Carol Carol Dorothy Dorothy Dorothy Dorothy Dorothy Dorothy Dorothy Dorothy Dorothy Dorothy Dorothy Dorothy Dorothy Dorothy Dorothy Dorothy Dorothy Dorothy Edward Edward Edward Edward Fran Fran Fran Fran Fran Fran Fran Gail Gail Gail Gail Gail Gail Gail

26 3 9 8 10 29 12 35 18 8 1 28 7 9 11 13 11 29 31 39 26 3 28 36 20 39 7 18 32 24 28 16 20 12 3 2 31 27 10 21 38 30 37 30 32 26 10 31 37 2 9 7 6 33 22 8 36 28 23 25 37

51 30 25 39 23 27 20 23 23 28 19 38 38 27 32 33 37 24 29 18 44 26 17 33 44 22 35 32 36 33 30 29 41 29 14 25 28 18 19 19 18 22 28 42 26 32 38 47 54 39 13 20 14 13 35 39 51 22 22 19 58

39.5 21.8 18.6 32.0 28.3 32.6 27.2 38.5 29.1 35.6 12 30.3 30.9 21.1 37.4 38.8 30.5 29.4 34.8 12.8 36.2 19.4 24.1 41.5 35.3 12.7 24.2 22.7 26.9 25.6 23.7 23.5 34.9 24 19.1 31.3 34.4 23.9 25 25.7 24.7 29.7 44.3 34.4 31.4 40.6 29.3 36.7 46.3 33.6 17.7 26.2 19.8 20.1 28.5 32.3 44.9 27.2 27.3 24.6 47.7

1.15 .82 .64 .70 −.53 −.56 −.72 −1.55 −.61 −.76 .70 .77 .71 .59 −.54 −.58 .65 −.54 −.58 .52 .88 .66 −.71 −.85 .87 .93 1.08 .93 .91 .74 .63 .55 .61 .50 −.51 −.63 −.64 −.59 −.60 −.67 −.67 −.77 −1.63 .76 −.54 −.86 .87 1.03 .77 .54 −.47 −.62 −.58 −.71 .65 .67 .61 −.52 −.53 −.56 1.03

1.92 1.34 1.19 .98 −.86 −.82 −1.25 −2.31 −.98 −1.09 2.19 1.08 1.00 1.00 −.76 −.81 .91 −.85 −.83 1.53 1.28 1.19 −1.40 −1.20 1.25 2.39 1.62 1.45 1.32 1.10 .99 .88 .86 .80 −1.32 −.96 −.92 −1.13 −1.11 −1.22 −1.27 −1.24 −2.34 1.07 −.81 −1.20 1.22 1.55 1.60 .75 −1.39 −1.09 −1.45 −1.90 .92 .93 1.13 −.88 −.88 −1.04 2.83

.46 .38 .41 .38 .42 .39 .45 .42 .42 .39 .46 .37 .37 .39 .38 .37 .37 .41 .39 .47 .40 .40 .49 .37 .40 .43 .37 .38 .37 .37 .38 .39 .38 .39 .58 .41 .39 .47 .46 .46 .47 .43 .39 .39 .40 .38 .37 .42 .51 .38 .65 .45 .58 .65 .37 .38 .46 .43 .43 .46 .77

4.18 3.50 2.94 2.61 −2.05 −2.09 −2.80 −5.50 −2.34 −2.81 4.76 2.88 2.67 2.54 2.01 −2.16 2.45 −2.06 −2.16 3.22 3.19 2.97 −2.86 −3.20 3.15 5.59 4.34 3.85 3.54 2.95 2.58 2.27 2.25 2.07 −2.25 −2.36 −2.36 −2.40 −2.42 −2.66 −2.68 −2.90 −6.01 2.79 −2.02 −3.20 3.26 3.71 3.10 2.00 −2.14 −2.44 −2.48 −2.93 2.48 2.47 2.44 −2.05 −2.06 −2.26 3.68

1.2 1.1 .4 .3 .6 .6 2.0 .5 .5 .7 .2 1.9 .3 .3 .2 .6 1.0 .5 .8 .3 1.3 .9 3.5 .4 .7 2.5 .3 .3 2.1 .2 .4 .8 .9 .5 1.0 .3 .2 .9 1.1 .8 1.4 1.1 .2 .6 .9 .2 .7 .4 .5 .8 1.1 1.0 .9 1.1 .4 .8 .5 1.4 .8 .8 .8

164

C.S. Wiseman / Assessing Writing 17 (2012) 150–173

Table 10 (Continued) Rater

Examinee

Obsdscore

Expd score

Helen Helen Helen Helen Helen Helen Helen Helen

10 16 13 26 7 20 11 35

36 34 42 35 22 32 27 48

29.5 27.8 36.4 40.8 28.6 39.9 35.0 39.8

Obs-exp average .65 .62 .56 −.58 −.66 −.79 −.80 .82

Bias (logit)

Error

Fit t-score

Fit MnSq

.92 .89 .80 −.81 −1.08 −1.10 −1.16 1.30

.37 .37 .39 .37 .43 .38 .39 .43

2.45 2.40 2.07 −2.17 −2.52 −2.93 −2.93 3.06

1.4 .5 .7 1.8 .2 .6 1.2 .4

Table 11 Summary of rater severity/leniency. Rater

Holistic

Analytic

Ann Betty Carol Donna Edward Fran Gail Helen

M M M S M M L S

M L S S M M M M

• Comments re prompt type, e.g., references to prompt, topic, or genre of the writing sample or appropriacy of criteria for scoring the two different genres; • Comments re rubric, e.g., rubric type, categories of the scale (e.g., “minimally successful”), or scoring method. The transcriptions of the think aloud protocols were then summarized, and three coders subsequently coded the transcript summaries of the think aloud protocols for references to each facet: Examinee L2 writing ability, rater severity, prompt type, and rubric type. There were thus a total of 368 occasions for coder agreement on raters’ comments (8 raters × 6 essays × 4 facets × 2 rating sessions). Inter-coder reliability was calculated using the proportion of exact agreement. The three coders that coded each transcript were in agreement on 272 out of a total of 368 rater references to the four facets; that is, inter-coder reliability was .75. Responses were then used to describe patterns of rater decision-making behaviors revealed through the Rasch analysis. Various themes emerged post hoc from the categorization of these facets in the think aloud protocols, namely the impact of poor handwriting, fatigue, rater expectations with regard to prompt type, and rater preferences regarding scoring method. Of particular interest, however, was ego engagement with the text or writer, or what Guilford (1954) and Myford (2004) referred to as “ego-involvement,” and the tendency for raters to rate those that they know well or with whom they are ego-involved more leniently (Guildford, 1954, p. 278). As the rater personalizes the examinee, he feels he “knows” the examinee. This ego engagement with the examinee might thus be likened to the relationship that a reader develops with a writer, and in this study, it was demonstrated in the think aloud protocols by a rater’s tendency to personalize the examinee as writer or student, speculate on what the student could or could not do in English, comment on the writer’s personality, or conjecture about the writer’s circumstances. As evidenced by the think alouds, this tendency seemed to be related to rater severity. The lenient-to-moderate raters tended to personalize the writer and/or text, whereas the severe raters focused very rigidly on the performance criteria of the rubric. 6.1. Engagement with the examinee and/or the text Moderate and lenient raters consistently made references to the examinee, describing him/her as “student” or “writer,” and to varying degrees, personalizing the test-taker. For example, in 6 out of

C.S. Wiseman / Assessing Writing 17 (2012) 150–173

165

6 holistic think alouds, moderate raters Ann, Betty, Carol and Gail explicitly referred to the writer, the person or the student. The personalization of the examinee suggested that these raters wanted to understand the perspective of the writer as part of the scoring process and this involvement may have contributed to a more sympathetic reading of the essays and corresponding moderation in severity. Moderate rater Ann, for example, repeatedly explained what “the writer” did: “The writer takes a stand in the first body paragraph” (examinee 35, persuasive). Moderate rater Fran’s stance was similar: “You know, the writer tends to make up some numbers, which is something I don’t like personally” (examinee 35, persuasive). Lenient rater Gail systematically interpreted each essay as the expression of what “the student” was trying to say or do: “The student wants to give an opinion about that, saying that the subway is dangerous. . . .” (examinee 16, narrative). Moderate rater Carol allowed herself to draw conclusions about the examinee that went beyond observable performance: “Talks about expenses, uses himself as a student with children, talks about how much things cost . . . This person has to take the subway late at night” (examinee 16, persuasive); “Interesting touch that she had a nightmare” (examinee 12, narrative); and “He [the examinee] has a good storytelling device” (examinee 35, narrative). Some moderate raters went even further in their observations about the test-takers to routinely form what seemed to resemble a personal description of the examinee. For example, Betty made conclusions about the extent of one examinee’s knowledge of English, how another examinee felt about the Metropolitan Transit Authority [MTA], the psychological state of another examinee, and the suitability of a vocation in politics for another examinee: This person doesn’t know a lot of English, knows some, but not a lot. [Betty, examinee 16, narrative, holistic scoring session] [This] person doesn’t like the MTA service. [Betty, examinee 12, persuasive, holistic scoring session] Person likes numbers and seems to be up on the issues, which is good . . . there’s a little bit of paranoid theory here. [Betty, examinee 35, persuasive, holistic scoring session] Here’s an organized person . . . the person has been, is, on the side of the oppressed and the poor people, so is speaking up for them – he should probably go into politics. [Betty, examinee 16, persuasive, holistic scoring session] Betty seemed to consistently attempt to connect with and understand the examinee through reading the text. In contrast, severe raters engaged minimally with the test-taker or text and focused more on the performance criteria in the rubric or some internal criteria based on their own knowledge of writing, language, or teaching. Instead, both Helen and Dorothy tended to refer to the rubric. They seemed to be more engaged with the rubric than with the examinee, or even the essay text. To illustrate, in scoring the narrative essay written by examinee 16, Dorothy systematically read the performance criteria in the rubric: “I would say this essay is around a 2, with limited success in writing a persuasive [sic] essay, provides limited development of the topic. Although it is organized, there are extremely frequent errors. The writing lacks sentence variety.” Dorothy did, however, refer to ‘the writer’ in 2 out of the 6 holistic think alouds, and when she waivered from her focus on the rubric in scoring the persuasive essay of one examinee (35), she scored the examinee less severely than predicted by the model. Dorothy drew a conclusion about the examinee, “. . . but I’m going to give this essay a 5, ‘cause I’m actually quite impressed with their background knowledge. It shows they read the newspapers and paid attention to the news, and the fact that they were able to add to their essay that it was a 33% raise shows a bit of thinking.” In this case, severe rater Dorothy seemed to visualize the writer before assigning him/her the highest score of all examinees. This personalization of the examinee by this characteristically severe rater who typically focused on the rubric corresponded with a more lenient rating. Helen, also a severe rater, followed the same general pattern as Dorothy in focusing on performance criteria defined in the rubric rather than on the examinee when scoring holistically. This rater primarily

166

C.S. Wiseman / Assessing Writing 17 (2012) 150–173

focused on the essay (“it” or “this”) and the scoring criteria, with particular regard to grammatical errors, rather than to the message of the writer. In scoring examinee 12’s narrative essay, for example, Helen was focused on spelling and performance criteria listed in the rubric: Ok, now I’m on 19 . . .. One day, my friend . . .” [reads silently]. “Pulse.” I guess they mean ‘purse.’ – “The black man ran always – always with . . .” Now, we have “purse.” [Reads the entire paragraph] She calls the police officer. [Reads silently] Ok, there’s definitely a story. I’m not so sure that it is an essay. [Summary comment] There’s a discernible organizational pattern. It lacks clear focus in the development of the central idea. That is, what is the point of this? “That it’s a bad?” Ok, I’m going to give this a high 3. [Assignment of score] [The writer] demonstrates a minimal range of vocabulary and grammar forms. [Justification for score] I think that’s right. Yeah. [Helen, examinee 12, narrative, holistic scoring session] Like Dorothy, when Helen personalized the test-taker (1 script out of 6), Rasch analysis revealed bias in scoring the essay more leniently than expected. Helen’s response to the essay of examinee 35 included a reference to the author, “Ok, it’s [sic] a good storyteller,” with the observation that the essay, “. . . was a very good narrative.” Helen gave that essay a 6 when the expected score was 3.5. Think aloud protocols of analytical scoring sessions also provided evidence of a similar pattern that ego involvement corresponded with leniency although when scoring analytically raters did not focus on the examinee to the same extent as when scoring holistically.

6.2. Rater background Certainly educational background and experience is a facet of ego, and the data collected in the think loud protocols provided food for thought about the relationship between rater background and patterns of severity. As Brown’s (1991) study suggested, raters with an academic background in English literature attended to sentence-level structure more so than their ESL-trained counterparts. Similarly, Dorothy, the severe rater with a master’s in English literature, resembled what Vaughan (1991) termed the “grammar-oriented” rater (p. 120). Dorothy seemed to give language control greater consideration in assessing L2 writing ability. In the verbal reports, Dorothy remarked that she made her initial determination of the scores for a particular essay based on whether the language errors were predominantly global or local and on how much elaboration there was in the development of the essay. This rater considered language control to be very important as demonstrated in the think aloud protocols. Her comments about the essay invariably address grammatical features, as in the following example: The next essay is 194. [Reads the essay aloud] This essay is going to get a 2. [Assignment of score] I see a lot of errors, a lot of spelling mistakes, words omitted, wrong words, and incorrect verb tenses. [Justification for score based on grammatical control] [Dorothy, examinee 12, narrative, holistic scoring session] The influence of rater background as a dimension of ego-involvement surfaced in the think aloud protocols of Edward and Carol, who were both creative writers. These raters seemed to have strong expectations and high standards for the narrative essay as compared to the persuasive, regardless of the scoring method. Indeed, when scoring holistically, Carol explained that she was giving one of the examinees a 6 on the narrative essay based on the strength of the story. She compared the narrative to the direction of action in a film:

C.S. Wiseman / Assessing Writing 17 (2012) 150–173

167

It’s like the camera getting in closer and closer, only instead of in space, it’s in time . . .. I like it . . . very descriptive of the atmosphere . . .. All of a sudden, so this is what the story has been building up to, . . . here’s the payoff. He produces the conductors’ words and then tells how time increased, then goes to describe Laura . . . then pulls out and describes people in the train.” Furthermore, this made it difficult to score the narrative in the sociolinguistic domain, “Sociolinguistic? This is hard in a narrative. It seems to me very good vocabulary. . . . I’m going to have to give it a 5, just because of the constraints of narrative. [Carol, examinee 35, narrative, analytic scoring session] Similarly, in scoring another narrative, Carol recounted the action blow-by-blow (“It plunges right into it. . .. First paragraph, boom!”) but in the end, she assigned this essay a 2 for content development, explaining, “. . .an interesting, interesting promise of narration but lots of weaknesses. . .I can hear somebody telling this story aloud, somebody who isn’t a writer:” Now, oh, I hope I remember which one was first, Oh, yeah, I do. Ok 194874N. [Identification of paper.] This is definitely a narrative. [Identification of prompt type.] It plunges right into it, which I don’t mind. First paragraph, boom! Plunges you right into the action but doesn’t develop the action and doesn’t describe anything. Second paragraph, I’m hearing the after effect on the friend. (Reads silently.) Now, I’m getting more description. Fourth paragraph, maybe a bit of dialogue, a little more description “in a low voice,” she responded. The grammar is really breaking down here. Searching for the right vocabulary, “The robbed man,” uh huh. Then the next paragraph is at the police precinct. It’s getting repetitious. Very, very oral at this point, and basic oral. Ummm, this is the weakest paragraph of this and it’s almost unnecessary, other than she called her parents and they got her. I think the fact that she took a shower is a good detail, and that she had a nightmare. [Reading of essay with asides.]Interesting, interesting promise of narration, but lots of weaknesses, so let’s see.[Summary appraisal of essay.] Hmmmmmmm. I’m going to give it, ok, for task fulfillment, I’m going to say this is a minimally successful narrative. I know what happened. I know who was involved. I know where it was. I know how her friend felt. The content development? Hmmmm. It’s between a 2 and a 3. ‘Cause I think you know from my talk aloud, it falls more on a 2 because it falls down at the end. Now, organizational control, which is always so hard on these narratives, some people might object to the fact that she plunges into the action, but I don’t. In the end, some people might not think this is an ending but we’re not reading a New Yorker story. Each paragraph is about one aspect, I’d say. I’m going to give that a 3 for organization. Now, sociolinguistic appropriateness, pretty basic language, and, ummm, let’s see, grammar forms, pretty basic. Let’s see, I’m going to give it a 2. Let’s see for grammar control, which really overlaps, I’m going to look at sentence variety. I didn’t pay that much attention to it. “After she too, when the police officer came, so her parents . . .” Ok, I’m giving it a 2 because it’s so simple, but it’s a high 2. I can hear somebody just telling this story aloud, somebody who isn’t a writer telling this story. [Assignment of scores with references to rubric] (Carol, examinee 12, narrative, analytic scoring session) Curiously, Carol seemed more inclined to judge persuasive essays with greater leniency, “Sometimes the persuasive prompt produced a decent, if deadeningly formulaic, piece of writing. If the student had clever ideas, the quality of language and writing seemed to take off as the essay went on. Surprisingly, many writers seemed lost in their narrative, confusing it with the persuasion formula. They may have come from schools that were not interested in having them explore their own lives. A few writers, of course, told their story with descriptive relish.” Carol seemed to be bringing in internal criteria for narration related to her professional background when rating the narratives, particularly when considering whether the writer fulfilled the task or not. She seemed to believe that a narrative should not be stream-of-consciousness nor generalizations: “This goes right into – ‘this actually happened to a friend,’ – so this is a narrative . . . this seems like stream-of-consciousness, although this is making sense. It’s not really a story; it’s just about a sentence or two and then generalizing about how things can be dangerous and bad . . . let me

168

C.S. Wiseman / Assessing Writing 17 (2012) 150–173

get out the rating sheet, ok, yes, the task is fulfilled. I’d say in a limited way, however, because I was struggling to determine whether what the essay was, persuasive or narrative.” [Carol, Examinee, narrative, analytic scoring session] Furthermore, this rater found it difficult to score the narrative in the sociolinguistic domain, “Sociolinguistic? This is hard in a narrative. It seems to me very good vocabulary. . . . I’m going to have to give it a 5, just because of the constraints of narrative.” Focusing on the quality of the narrative, the novelist Edward, an overall moderate rater both holistically and analytically, also showed unusual severity by rating a particular examinee below the score predicted by the model. It seemed that the standards for narrative essays were quite high for these two creative writers and their expectations were reflected in the way they scored particular papers. These ratings and protocols thus suggested that rater background and/or expectations related to prompt type, particularly to genre, may influence rater decision-making behavior in assigning a rating. 6.3. Self-monitoring Finally, what might be considered another dimension of ego engagement surfaced in the think aloud protocols. For some raters, a certain amount of self-reflection on their own rating process may have had a mitigating effect on the severity with which they rated some examinees. Moderate raters showed evidence of self-monitoring, particularly when they were fatigued or annoyed by handwriting or adjusting to the analytical scoring scale, which was a new experience for these raters. For example, when scoring holistically, moderate Carol made self-cautionary comments to guard against making prejudgments about examinee 12’s essay because of the handwriting: “I’ve got to be careful because when I see handwriting like this, I have to be neutral.” Carol gave examinee 12’s narrative essay a 3, as compared to the fair measure average of 2.69 for this examinee. When scoring holistically, another moderate rater, Edward, repeatedly commented on the handwriting as “disturbing,” and he even equated the handwriting with a low score. His comments about examinee 12’s persuasive essay suggested that handwriting could have a potentially prejudicial effect on rater judgment: “The handwriting right away is disturbing. I see that handwriting, and I think a 1–2.” In the end, however, Edward paid close attention to content, language control, and vocabulary. He then gave examinee 12’s narrative a 3, slightly higher than the fair measure average of 2.7 for this examinee. It appeared that once he acknowledged handwriting as a potential influence in scoring this examinee, Edward then attended to other criteria to make the final decision. This self-monitoring may thus have helped to moderate rater severity influenced by poor handwriting. 7. Discussion and conclusion Think aloud protocols recorded by all the raters revealed some interesting differences in their decision-making processes. Raters identified as lenient-to-moderate by Rasch analysis demonstrated a common pattern in that they engaged primarily with the examinee and/or the text when scoring holistically and analytically. In contrast to the moderate and lenient raters, severe raters Dorothy and Helen made few observations about the text, prompt, or examinee/writer, and no overt observations about their own rating process when scoring holistically; instead, they seemed to concentrate more on the scoring criteria of the rubric and/or the task of assigning a score. Indeed, they distinguished themselves from the others by consistently reading directly from the rubric. They appeared to rely heavily on the performance descriptors, sometimes reading verbatim. The occasional evidence of ego engagement of the normally severe raters with a particular examinee coincided with a more lenient rating of that examinee’s paper. These tendencies suggested that rater severity may be inversely related to what has been labeled as “ego-involvement” with the examinee or text by Guilford (1954) and Myford and Wolfe (2004). Vaughan (1991) referred to this type of rater, i.e., the rater who demonstrates interaction with examinees when scoring an essay, as “the laughing rater.” Aside from engagement with the examinee, think aloud protocols showed that depending on their background, individual raters engaged with the text or prompt type. Rater background, which might be regarded as another facet of ego, seemed to contribute to raters’ expectations of criteria for narrative vs.

C.S. Wiseman / Assessing Writing 17 (2012) 150–173

169

persuasive essays. For example, raters with a creative writing background expressed disappointment that the examinees did not write a story that taught the reader a lesson but seemed to just recount a series of events chronologically: To these raters, a chronological list of events did not fulfill the requirements of a successful narrative essay. The rater who was a novelist had very high standards for the execution of the narrative and rated the narrative more severely than the persuasive. The screenwriter/editor, lenient in scoring narratives, was thoroughly entertained by the lively, descriptive narratives, recounting the events while reading the essay as though she were directing the scenes depicted in the story. That ego-involvement with the examinee and/or text can have a moderating impact on rating decision-making suggests that as part of their training or certification process, raters be encouraged to engage with the text and the author when reading the writing samples. The think aloud protocols of moderate raters also suggested that what might be described as selfawareness had a moderating effect. Raters that were consistently moderate across scoring sessions showed evidence of self-monitoring, particularly when they were fatigued or annoyed by handwriting or scoring using the analytical scoring scale, which was a new experience for these raters who were accustomed to scoring holistically. This suggests that perhaps training of novice raters for this assessment might include guidelines to encourage this type of self-monitoring to mitigate the effect of construct-irrelevant factors in the rating process. Past studies have shown that raters have benefited from feedback on their performance. To reduce extremes in leniency or severity, feedback from Rasch analysis can be used in helping severe or lenient raters understand their tendency to score either leniently or severely and adjust their rating tendency accordingly (Weigle, 1994). Raters like Dorothy for whom severity is like a personality trait regardless of the scoring context (Guilford, 1954) might benefit from feedback and additional training. The results suggest that scoring protocols might also include discussion of rater process, as well as norming papers, genres, the rubric and performance criteria. Norming sessions could incorporate discussions of the genre or prompt type to reveal individual rater’s expectations or personal criteria, e.g., related to rater background, and thereby reach consensus regarding scoring criteria among raters of diverse backgrounds. Raters must also feel, for example, that they can acknowledge that fatigue or poor handwriting may be influencing the rating process, be sensitive to the influence of these extraneous variables, and encouraged to take needed breaks. It is essential to institute a process whereby raters can deal with the influences of such extraneous factors as fatigue or noise without penalizing the learner with a low score. An understanding of rater decision-making behaviors would be perhaps most useful for chief readers and rater trainers who have to develop and employ protocols that adequately norm raters and ensure as much objectivity as possible in the scoring process so the measurements used at the department and program level are reliable and fair. Finally, one of the objectives of rater training and rater certification is to ensure that raters can come to agreement regarding performance criteria to define a construct of second language writing ability. Typically, rubrics for assessing L2 writing include the domains of content development, organization, and language control, as the rubric for this assessment did. However, if writers write to communicate a message – and indeed the hope is that the reader will engage in a dialogue with the text, – we might consider whether the ability to engage the reader is not a dimension of writing ability. It may be that a definition of second language writing should also include a parameter measuring the degree to which the text engages the reader, given that a primary objective in subjective scoring of writing scripts is for raters to avoid leniency or severity by applying the criteria in the guidelines outlined in the rubric. Regardless of whether such a criterion for accurately measuring engagement with the text could be articulated and incorporated into a construct of L2 writing ability, an understanding and acknowledgement that ego engagement with text and/or writers mitigates rater severity could result in positive wash back in the second language writing classroom. Even though as readers and writers in our personal lives we know that engagement with the text or author has a huge impact on our assessment of a text, as instructors and learners, we are frequently under pressure to prepare for the test and concentrate on meeting the criteria outlined in the rubric. Evidence that rater engagement with the text could mitigate rater severity on the exam might help to shift the focus to writing for self-expression and communication with one’s audience in the classroom,

170

C.S. Wiseman / Assessing Writing 17 (2012) 150–173

furthering the formative role of assessment to support the development of L2 writing skills in the classroom.

Appendix A. Holistic scoring scale Criteria for grading ESL papers 6

5

4

3

2

1

This exceptionally executed essay takes a clear position and succeeds in expressing a point of view or telling a story. The thorough development of ideas includes at least two outstanding points directly related to the topic, and the examples used, particularly those from personal experience, are rich. The essay is clearly and logically organized without digressions; the writer demonstrates evidence of skillful use of cohesive devices. The writer demonstrates the ability to write in the appropriate academic register and demonstrates extensive range of vocabulary for academic purposes, with few problems in word choice or usage. A few grammatical errors are noticeable but rarely do the grammar errors interfere with meaning. Sentence variety and complexity reflect a sufficient command of standard written English to ensure reasonable clarity of expression. The focus of this competently executed essay is clear, but there may be a few digressions. The writer provides substantial support in the development of the essay although all examples may not be entirely relevant or appropriate for the topic. The essay is effectively organized, demonstrating systematically competent use of cohesive devices. The writer demonstrates ability to use a variety of patterns of sentence construction but with some errors. Range of vocabulary for academic purposes is generally competent, and the writer demonstrates accurate and generally appropriate control of word choice, word forms, and idiomatic expressions for academic writing. Some errors in language use, but errors do not generally interfere with meaning. In this adequately executed essay, the writer’s position is clear despite some possible digressions and contradictions. The writer provides adequately detailed support of two or more points that directly relate to the topic. The essay is generally organized, demonstrating generally accurate and appropriate use of cohesive devices. The writer demonstrates some sentence variety with simple, compound, and some complex sentences though not always correctly. The essay may contain frequent errors that may . occasionally interfere with meaning. Vocabulary is adequate in range, but there are some inappropriate or inaccurate word choices and word forms. The essay minimally succeeds in taking a position or relating a narrative with a discernable organizational pattern (introduction, body, conclusion) but may lack clear focus in development of the central idea. The writer makes an attempt at development although examples are sometimes irrelevant. The writer makes minimal use of cohesive devices and he/she demonstrates a minimal range of sentence variety and vocabulary, with some inaccurate and/or inappropriate word choices or inappropriate register. The essay demonstrates minimal control of language, with frequent errors, some of which interfere with meaning. The paper represents limited success in writing a persuasive or narrative essay. The writer provides limited development of the topic with one or more points that directly or indirectly relate to the supporting argument or story. The writing shows limited evidence of organization of ideas (paragraphs are often one sentence) or accurate or appropriate use of cohesive devices. The range of vocabulary and word choice appropriate to academic writing is limited. The control of language is uneven, with frequent errors, many of which obscure meaning. The writing lacks sentence variety. The paper is a failed attempt to write an essay. The writer does not fully develop the topic, lacking related support. There is often no clear organizational pattern, lacking a clear beginning, middle, and end. The writer does not use cohesive devices. The writer demonstrates a narrow range of vocabulary. There is little evidence of appropriate word choice or usage or academic register. The writer demonstrates little control, with frequent errors of all types. The errors generally obscure meaning. The writing lacks basic sentence structure and variety. In some cases, the paper may even be written in the writer’s first language

Appendix B. Analytic scoring scale Criteria for grading ESL papers

C.S. Wiseman / Assessing Writing 17 (2012) 150–173

• Task fulfillment

1 point

2 points

3 points

4 points

5 points

6 points

• Fails to take a position and write an agree/disagree essay

• Takes a position and with limited success, writes an agree/disagree essay

• Takes a position and minimally succeeds in writing an agree/disagree essay

• Takes a clear position and adequately succeeds in writing an agree/disagree essay

• Takes a clear position and competently succeeds in writing an agree/disagree essay

• Writes a minimally successful narrative • Q provides 1–2 points mostly related to topic with digressions

• Writes an adequately successful narrative • Provides 2+ points that directly relate to topic with some digressions • Adequate development of topic

• Writes a competently successful narrative • Provides 2+ points that adequately support topic with occasional digressions • Substantial development of topic

• Takes a clear position and achieves exceptional success in expressing a point of view in an agree/disagree essay • Writes an exceptional narrative

• Generally organized

• Effectively organized

• Clearly and logically organized

• Demonstrates systematically competent use of cohesive devices

• Demonstrates skillful command of cohesive devices

• Demonstrates competent range of vocabulary for academic purposes • Minimal • Some • General • Limited evidence of evidence of evidence of evidence of appropriate appropriate appropriate appropriate word choice word choice word choice word choice and/or usage and/or usage and/or usage and mostly appropriate usage • Demonstrates • Demonstrates • Demonstrates • Demonstrates limited use of minimal use of use of generally competent use appropriate appropriate appropriate of academic register academic academic academic w/occasional register register register but with some use use of informal of informal register register • May make • Frequent • Frequent • Some errors errors with errors but frequent errors but control of uneven control demonstrates but language is of language apparent minimal demonstrates control of developing language grammatical control

• Demonstrates extensive range of vocabulary for academic purposes • Few problems with word choice and/or usage

• With a limited success, writes a rough narrative • Provides 1+ • Control of content • Fails to development provide related point(s) support directly or indirectly related to topic • Fails to write a narrative

• Fails to develop topic

• Organizational Control

• Sociolinguistic competence

• Limited development of topic

• Limited evidence of clear organizational pattern • No evidence • Limited use of systematic (mechanical or use of cohesive inaccurate) of cohesive devices devices • Follows no clear organizational pattern

• Provides minimal development of topic • Discernable organizational pattern

• Demonstrates generally accurate and appropriate use of cohesive devices • Demonstrates • Demonstrates • Demonstrates • Demonstrates narrow range limited range minimal range adequate range of vocabulary of vocabulary of vocabulary of vocabulary for academic for academic for academic for academic purposes purposes purposes purposes

• Little evidence of appropriate word choice and/or usage

• Little evidence of appropriate register

• Grammatical control

171

• Frequent errors of all types with little control

• Minimal use (accurate and appropriate) of cohesive devices

• Provides 2+ outstanding points directly related to topic without digressions • Thorough development of topic

• Demonstrates command of appropriate academic register

• A few errors are noticeable

172

C.S. Wiseman / Assessing Writing 17 (2012) 150–173

1 point

2 points

3 points

4 points

5 points

6 points

• Errors generally obscure meaning • Lacks basic sentence structure and variety

• Errors often obscure meaning

• Errors sometimes interfere with meaning • Minimal sentence variety – simple and some compound or occasional complex sentence

• Errors occasionally interfere with meaning • Demonstrates some sentence variety with simple, compound and some complex sentences but with errors, e.g., fragments, run-ons, errors in subordination

• Few, if any, errors that interfere with meaning • Demonstrates sentence variety but with some errors

• Errors rarely interfere with meaning

• Lacks sentence variety – mostly simple sentences

• Variety of simple and complex sentence structures with few errors

References Anderson, J. R. (1983). The architecture of cognition. Cambridge: Harvard University Press. Bachman, L. F., Lynch, B. K., & Mason, M. (1995). Investigating variability in asks and rater judgments in a performance test of foreign language speaking. Language Testing, 12 (2), 238–257. Bachman, L. E., & Palmer, A. S. (1996). Language testing in practice: Designing and developing useful language tests. Oxford: Oxford University Press. Barkaoui, K. (2010). Variability in ESL essay rating processes: The role of the rating scale and rater experience. Language Assessment Quarterly, 7 (1), 54–74. Breland, H., Bridgeman, B., & Fowles, M. E. (1999). Writing assessment in admission to higher education: Review and framework (ETS RR-99-3). Princeton, NJ: Educational Testing Service. Brennan, R. L., Gao, X., & Colton, D. A. (1995). Generalizability analyses of work key listening and writing tests. Educational and Psychological Measurement, 55, 157–176. Brown, J. D. (1991). Do English and ESL faculties rate writing samples differently? TESOL Quarterly, 25, 587–603. Caracelli, V. J., & Greene, J. C. (1993). Data analysis strategies for mixed-method evaluation designs. Educational Evaluation and Policy Analysis, 15 (2), 195–207. Connor, U., & Carrell, P. (1993). The interpretation of tasks by writers and readers in holistically rated direct assessment of writing. In: J. G. Carson & I. Leki (Eds.), Reading in the composition classroom (pp. 141–160). Boston, MA: Heinle and Heinle. Creswell, J. W., & Clark, V. L. P. (2007). Designing and conducting mixed methods research. Thousand Oaks, CA: Sage Publications, Inc. Cronbach, L. I. (1990). Essentials of psychological testing (5th ed.). New York: Harper and Row. Cumming, A. (1990). Expertise in evaluating second language compositions. Language Testing, 7 (1), 31–51. Cumming, A. (2001). Assessing L2 writing: Alternative constructs and ethical dilemmas. Assessing Writing, 8, 73–83. Cumming, A., Kantor, R., Powers, D., Santos, T., & Taylor, C. (2000). TOEFL 2000 writing framework: A working paper (TOEFL Monograph Series Repot No. 18). Princeton, NJ: Educational Testing Service. DeRemer, M. L. (1998). Writing assessment: Raters’ elaboration of the rating task. Assessing Writing, 5 (1), 7–29. Engelhard, G. (1994). Examining rater errors in the assessment of written composition with a many-faceted Rasch model. Journal of Educational Measurement, 31 (2), 93–112. Ericsson, K., & Simon, H. (1984). Protocol analysis: Verbal reports as data. Cambridge, MA: MIT Press. Fulcher, G. (2003). Testing second language speaking. Essex, England: Pearson Professional Education. Guilford, J. P. (1954). Psychometric methods (2nd ed.). New York: McGraw Hill. Hamp-Lyons, L., & Matthias, S. P. (1994). Examining expert judgments of task difficulty on essay tests. Journal of Second Language Writing, 3 (1), 49–68. Henning, G. H. (1996). Accounting for nonsystematic error in performance ratings. Language Testing, 13, 53–61. Huot, B. (1993). The influence of holistic scoring procedures on reading and rating student essays. In: M. A. Williamson & B. A. Huot (Eds.), Validating holistic scoring for writing assessment. Cresskill, NJ: Hampton Press 307-236. Johnson, J. S. (2009). The influence of rater language background on writing performance assessment. Language Testing, 26 (4), 485–505. Kim, Y. H. (2009). An investigation into native and non-native teachers’ judgments of oral English performance: A mixed methods approach. Language Testing, 26 (2), 187–217. Kobayashi, T. (1992). Native and nonnative reactions to ESL compositions. TESOL Quarterly, 26, 81–112. Kondo-Brown, K. (2002). A FACETS analysis of rater bias in measuring Japanese second language writing performance. Language Testing, 19 (1), 3–31. Linacre, J. M. (2005). A user’s guide to FACETS: Rasch measurement computer program. Version 3.57. Chicago, IL. Lumley, T. (2005). Assessing second language writing: The rater’s perspective. Frankfurt am Main: Peter Lang. Lynch, B. K., & McNamara, T. F. (1998). Using G-theory and many-facet Rasch measurement in the development of performance assessments of the ESL speaking skills in immigrants. Language Testing, 15 (2), 158–180. McNamara, T. F. (1996). Measuring second language performance. London and New York: Longman.

C.S. Wiseman / Assessing Writing 17 (2012) 150–173

173

Mendelsohn, D., & Cumming, A. (1987). Professors’ ratings of language use and rhetorical organizations in ESL compositions. TESL Canada Journal, 5 (1), 9–26. Milanovic, M., Saville, N., & Shuhong, S. (1996). A study of the decision-making behavior of composition markers. In: M. Milanovic & N. Saville (Eds.), Studies in language testing, Vol. 3: Performance testing, cognition and assessment (pp. 92–111). Cambridge University Press. Myford, C., & Wolfe, E. (2000). Monitoring sources of variability within the Test of Spoken English assessment system (Research Project 65). Princeton, NJ: Educational Testing Service. Myford, C., & Wolfe, E. (2004). Detecting and measuring Edward effects using many-facet Rasch measurement: Part 1. In: E. V. Smith & R. M. Smith (Eds.), Introduction to Rasch measurement (pp. 460–517). Maple Grove, MN: JAM Press. Orr, M. (2002). The FCE speaking test: Using rater reports to help interpret test scores. System, 30, 143–154. Saal, F. E., Downey, R. G., & Lahey, M. A. (1980). Rating the ratings: Assessing the psychometric quality of rating data. Psychological Bulletin, 88 (2), 413–428. Sakyi, A. (2000). Validation of holistic scoring for ESL writing assessment: How raters evaluate ESL compositions. In: A. Kunnan (Ed.), Fairness and validation in language assessment (pp. 129–152). Cambridge: Cambridge University Press. Santos, T. (1988). Professors’ reactions to the academic writing of non-native speaking students. TESOL Quarterly, 20, 38–95. Shohamy, E., Gordon, C. M., & Kraemer, R. (1992). The effect of raters’ background and training on the reliability of direct writing tests. The Modern Language Journal, 76 (1), 27–33. Smith, D. (2000). Rater judgments in the direct assessment of competency-based second language writing ability. In: G. Brindley (Ed.), Studies in immigrant English language assessment (pp. 159–189). Sydney, Australia: National Centre for English Language Teaching and Research, Macquarie University. Spool, M. D. (1978). Training programs for observers of behaviors: A review. Personnel Psychology, 31, 853–888. Stansfield, C., & Ross, J. (1988). A long-term research agenda for the Test of Written English. Language Testing, 5, 160–186. Stock, P. L., & Robinson, J. L. (1987). Taking on testing: Teachers as testers researchers. English Education, 19, 93–121. Sweedler-Brown, C. O. (1993). ESL essay evaluation: The influences of sentence-level and rhetorical features. Journal of Second Language Writing, 2, 3–17. Van Weeren, J., & Theunissen, T. J. J. (1987). Testing pronunciation: An application of generablizability theory. Language Learning, 37, 109–122. Vann, R. J., Lorenz, F. L., & Meyer, D. M. (1990). Error gravity: Faculty response to errors in the written discourse of nonnative speakers of English. In: L. Hamp-Lyons (Ed.), Assessing second language writing in academic contexts (pp. 181–195). Norwood, NJ: Ablex. Vaughan, C. (1991). Holistic assessment: What goes on in the raters’ minds? In: L. Hamp-Lyons (Ed.), Assessing second language writing in academic contexts (pp. 111–126). Norwood, NJ: Ablex. Weigle, S. (1994). Effects of training on raters of ESL compositions. Language Testing, 11 (2), 197–223. Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15 (2), 263–287. Weir, C. J. (2005). Language testing and validation: An evidence-based approach. New York: Palgrave MacMillan. Wiseman, C. (2005). A validation study comparing an analytic scoring rubric and a holistic scoring rubric in the assessment of L2 writing samples. Unpublished paper, Teachers College, Columbia University, NY. Cynthia S. Wiseman earned a doctorate in Applied Linguistics with a concentration in second language assessment from Teachers College, Columbia University. She is an assistant professor at Borough of Manhattan Community College, City University of New York.