Features of difficult-to-score essays

Features of difficult-to-score essays

Assessing Writing 27 (2016) 1–10 Contents lists available at ScienceDirect Assessing Writing Features of difficult-to-score essays Edward W. Wolfe a...

521KB Sizes 3 Downloads 33 Views

Assessing Writing 27 (2016) 1–10

Contents lists available at ScienceDirect

Assessing Writing

Features of difficult-to-score essays Edward W. Wolfe a,∗ , Tian Song b,1 , Hong Jiao c,2 a b c

Research & Innovations Network, Pearson, 3974 Roberts Ridge NE, Iowa City, IA 52240, USA Pearson, 5604 E Galbraith Rd., Cincinnati, OH 45236, USA University of Maryland, 1230C Benjamin Building, College Park, MD 20742, USA

a r t i c l e

i n f o

Article history: Received 31 October 2014 Received in revised form 11 June 2015 Accepted 18 June 2015 Keywords: Rater Scoring Writing assessment

a b s t r a c t Previous research that has explored potential antecedents of rater effects in essay scoring has focused on a range of contextual variables, such as rater background, rating context, and prompt demand. This study predicts the difficulty of accurately scoring an essay based on that essay’s content by utilizing linear regression modeling to measure the association between essay features (e.g., length, lexical diversity, sentence complexity) and raters’ ability to assign scores to essays that match those assigned by expert raters. We found that two essay features – essay length and lexical diversity – account for 25% of the variance in ease of scoring measures, and these variables are selected in the predictive modeling whether the essay’s true score is included in the equation or not. We suggest potential applications for these results to rater training and monitoring in direct writing assessment scoring projects. © 2015 Elsevier Inc. All rights reserved.

1. Introduction Direct writing assessments, as they are employed in many educational contexts, require examinees to compose an essay in response to a prompt or stimulus material, sometimes following a sequence of guided prewriting activities. These assessments may be informal, taking place within the classroom and providing formative information regarding instruction, or they may be formal, taking place within a standardized assessment setting and providing information upon which summative and/or accountability decisions are based. Whether the purpose of the assessment is formal or informal, those essays are typically scored by human raters who employ a scoring rubric as a guide to classifying essays into ordered categories intended to indicate increasing levels of writing quality. Those scoring rubrics may focus on and require assignment of several “trait” scores, each of which depicts the quality of a particular aspect of the writing (e.g., mechanics, organization, development, voice), or the rubrics may require the rater to make a holistic judgment by jointly considering all relevant aspects of the essay in arriving at a single score of the overall quality of the writing. That rating process, because it involves subjective judgments, may, potentially, result in different scores being assigned to the same essay by different raters. In fact, that decision making process is sufficiently subjective to allow fluctuations in a rater’s attention and mood, changes in the scoring context, and even variations in the presentation of the essays being scored to cause a single rater to assign different scores to the same essay on different occasions (Shohamy, Gordon, & Kraemer,

∗ Corresponding author. Tel.: +1 319 321 4633. E-mail addresses: [email protected] (E.W. Wolfe), [email protected] (T. Song), [email protected] (H. Jiao). 1 Tel.: +1 734 546 4239. 2 Tel.: +1 301 405 3627. http://dx.doi.org/10.1016/j.asw.2015.06.002 1075-2935/© 2015 Elsevier Inc. All rights reserved.

2

E.W. Wolfe et al. / Assessing Writing 27 (2016) 1–10

1992). Prior research concerning this decision making process identifies several components of the assessment context may influence the accuracy of rater decision-making. This manuscript reports the results of a study of one of those components, the features of essays, which is associated with disagreements between raters with the intent of providing insights that may inform rater training and monitoring practices in direct writing assessments. 1.1. Terminology Before discussing the various potential sources of rater disagreements, we define important terms that are relevant to our focus. The term interrater reliability is often used loosely to refer to the degree to which several potential types of measurement error may influence scores assigned by human raters, and we avoid that term due to the potential confusion that it may introduce. We prefer the specificity of the terms interrater agreement and rater accuracy (sometimes referred to as rater validity). Interrater agreement typically refers to the degree to which a rater assigns scores to a particular set of examinee responses that are consistent with scores assigned to those responses by other raters. Rater accuracy, on the other hand, refers to the degree to which a rater assigns scores to a set of responses that match validity scores (i.e., scores that are assumed to be accurate scores, typically consensus scores assigned by expert raters). The only difference between rater agreement and rater accuracy is the frame of reference against which we evaluate the rater’s performance. We attribute rater disagreement or rater inaccuracy to patterns within the scores assigned by a particular rater which may have diagnostic utility, which we call rater effects. Those patterns exhibit some degree of predictability and may cause the assigned scores for the rater to be consistently high or low (severity or leniency), to be tightly or widely spread (centrality or extremity), or highly consistent or inconsistent (accuracy or inaccuracy) when compared to the target scores. Hence, when rater effects are identified, raters can be provided with information concerning why low interrater agreement or low levels of rater accuracy have occurred and what can be done to correct those errors. Our point is that rater errors can be depicted at either the more general rater agreement/accuracy level or the more diagnostically specific rater effect level, with rater agreement/accuracy being more comprehensive and rater effects being more diagnostically specific. In this manuscript, we focus on the level of rater agreement/accuracy because our purpose is to discover sources of rater errors that may arise due to how raters respond to the content and quality of a particular essay rather than to diagnose the types of errors that raters commit. 1.2. Influences on rater agreement and accuracy Several conceptual models have been proposed to explain how various components of the rating process and context may contribute to rater disagreement and inaccuracy, and we synthesize some of those models in this section in order to identify the context within which our research is situated. Generally, four broad aspects of the rating process have been proposed as potential antecedents to the emergence of rater effects. First, although it has not been subjected to much formal research, the design of the assessment clearly has the potential to impact the quality of assigned scores. Here, we are referring to decisions that have been made on the part of assessment designers, such as the purpose of the assessment, the administration medium, and the focus of the scoring criteria. One example of research that has focused on the impact of direct writing assessment design on rating quality concerns the comparability and raters’ perceptions of essays that are composed in handwriting versus word processor. For example, Powers, Fowles, Farnum, and Ramsey (1994) performed an early study of this topic by asking students to compose essays in both of these composition media and then transcribing each essay to the other medium and finally having raters assign scores to both versions of each essay. Their results indicated that handwritten essays received higher scores, regardless of composition medium. That is, raters, on average, exhibited a leniency effect for handwritten essays. It is worth acknowledging what may seem a potential overlap between assessment design decisions and the next feature that we discuss, response content. Our decision to discuss essay composition medium as an example of assessment design is based on the fact that composition medium in direct writing assessments is typically a decision that is imposed on the examinee by assessment designers, while other characteristics of the response, such as handwriting quality, are more directly under the control of the examinee. Classifying other examples of research relating to assessment design, such as the assessment purpose and the nature of the scoring criteria, would be more straightforward. Second, the content of the response that raters review, which is the primary focus of the research that we conducted, may also influence the quality of assigned scores. In the context of direct writing assessment, when we refer to response content, we mean the visual appearance of the response (e.g., handwriting quality, font choices, and page layout), textual features (e.g., length, word choice, mechanics), and content included in the response (e.g., author clues, ideas). A great deal of research has been conducted regarding the impact of visual appearance on rating quality, but relatively less research exists concerning the impact of textual features and informational content. We briefly summarize this research in the following section. Third, rater characteristics have long been posited as potential influences on rating quality (Pula & Huot, 1993). Rater characteristics include rater experiences (e.g., educational, demographic, and professional), stable rater cognitive and affective traits (e.g., temperament, cognitive style), and temporary rater states (e.g., mood). A good bit of research has been conducted regarding the impact of rater experiences (Wolfe, 1997; Meadows & Billington, 2010; Pula & Huot, 1993; Shohamy et al., 1992; Sweedler-Brown, 1985) and stable rater traits (Crisp, 2012; Huot, 1993; Vaughan, 1991) on rating quality. Generally, these studies suggest that the more general the rater background characteristic (e.g., demographics), the less likely it is to

E.W. Wolfe et al. / Assessing Writing 27 (2016) 1–10

3

impact rating quality, and the more specific and relevant to the rating task (e.g., rating experience and task-specific training), the more relevant the variable. Overall, stable rater traits, particularly rater cognitive style, seem to have a relatively important role in determining rating quality. Due to the fleeting nature of temporary rater states and the difficulty of measuring those states, little research exists concerning their impact on rating quality. Fourth, rating context; which includes the medium and process through which responses are distributed to raters, rater training procedures, rater monitoring and feedback practices, and temporal and physical features of the rating environment; also influences the quality of ratings. Several studies have supported the notion that rater training increases the quality of ratings (Shohamy et al., 1992; Sweedler-Brown, 1985) and that the self-pacing that is afforded by an online training module increases the efficiency of that training process (Wolfe, Matthews, & Vickers, 2010; Elder, Barkhuizen, Knoch, & von Randow, 2007; Elder, Knoch, Barkhuizen, & von Randow, 2005; Knoch, Read, & von Randow, 2007). Less is known about the contribution to rating quality of response distribution systems, rater monitoring and feedback processes, and the setting within which ratings are assigned. 1.3. Response content In this study, we focus on the contribution of response content to rater accuracy. As described in Section 1.2, response content includes a writing sample’s visual appearance and textual features, as well as the personalizing details that an author chooses to include (i.e., author clues and style). Concerning the visual appearance of an essay, early research focused on the influence of handwriting legibility on essay scores. Some of those studies indicated that the quality of handwriting does indeed influence the ratings assigned by scorers so that neater handwriting resulted in higher scores (Briggs, 1980; Hughes, Keeling, & Tuck, 1983). However, at least one of those studies indicated that the impact of handwriting quality may interact with several additional response content variables, particularly clues about the author’s identity (Chase, 1986). More recently, a study by Klein (2005) suggested that additional visual adornments (e.g., color and underlining), whether essay composition takes place via handwriting or word processing, may interact with legibility and font selection to produce similar effects. Other lines of research have focused on differences between raters’ perceptions of handwritten and typed texts, with typed essays receiving higher scores when controlling for language ability (Wolfe and Manalo, 2004; Breland, Lee, & Muraki, 2005). Essay length has also been shown to be highly correlated with measures of writing so that higher scores are associated with longer essays (Barkaoui, 2010; Kobrin, Deng, & Shaw, 2011). However, little attention has been directed toward the degree to which this effect is due to a true relationship between writing quality and essay length versus bias on the part of raters. In terms of assessing the quality of scores, these studies focus only on mean score differences at the group level. That is, at best, they consider a shared tendency for raters to exhibit severity or leniency, and they do not consider other measures of the quality of the scores, such as rater agreement or accuracy. In addition, a review by Meadows and Billington (2005) points out that studies like these focus on “non-professional” raters and that at least two studies in the United Kingdom (UK) fail to replicate these trends when focusing on raters hired to score essays for examinations like the General Certificate of Secondary Education (GCSE) (Baird, 1998; Massey, 1983). Hence, it is not clear whether the impact of these visual features of essays would have as strong of an impact on raters who assign scores in high-stakes contexts. A good bit of the existing research has focused on the possibility that clues to an author’s identity could invoke biases on the part of raters. For example, Fajardo (1985) noted, in a study that focused on author race, that raters tended to assign higher scores to black examinees than to white examinees and that the score difference was greatest for essays of moderate levels of writing quality. Similarly, Chase (1986) identified a complex interaction between an examinee’s stated academic expectations (i.e., grades), race, gender, and penmanship as predictors of scores assigned to essays. Chase replicated the results of Fajardo with respect to racial bias when handwriting legibility was good. However, when handwriting quality was poor, the gender-by-race effects were less pronounced. It is worth noting, however, that some of these results may not generalize to contexts in which essays are scored by raters who are blind to the author’s identity. For example, some studies of rater-author background biases explicitly provide raters with fictitious information about the essay’s author. In most largescale scoring settings, raters are not provided with any such information regarding the author’s identity (e.g., no names or background variables are available), so the only clues available to the raters would be those mentioned directly by the authors in the body of the essay. In light of this fact, a study by Bridgeman, Trapani, and Attali (2012) is particularly interesting because it utilizes a blinded design in addition to comparing scores assigned by humans to those assigned by an automated scoring engine, which could be argued to be, potentially, less likely to be influenced by perceived examinee characteristics. Bridgeman’s study focused on both mean differences and interrater correlations for different nationalities, language groups, and genders on both the TOEFL iBT and the GRE. They concluded that, although differences between human and automated engine scores were not large, those differences were substantively important because they result in different rank orderings of groups and individual examinees. Hence, their results suggest that population-level bias concerning author background clues may exist with respect to both rater severity/leniency and rater accuracy/agreement in large-scale scoring projects. Several studies have focused on substantive content and stylistic elements and their potential impact on assigned scores and on rater perceptions. Freedman (1979) conducted seminal research in this area by examining the impact of essay content, organization, sentence structure, and mechanics on assigned scores, and she found that content and organization were most strongly related to essay score. However, a subsequent study, conducted by Rafoth and Rubin (1984), found just the opposite—mechanics was a better predictor of assigned scores than was content. Mixed results such as these are

4

E.W. Wolfe et al. / Assessing Writing 27 (2016) 1–10

common in studies of the relationship between essay content and essay quality. For example, Barkaoui (2010) indicated that communicative quality, argumentation, and linguistic accuracy were all significant predictors of essay scores. In terms of the quality of essay scores, Schoonen (2005) indicates that language use scores are more generalizable than are scores of content and organization. In terms of raters’ subjective weighting of these various essay features, Weigle, Boldt, Valsecchi, Elder, and Golombek (2003) found that raters ranked content as the most important consideration for essays receiving high scores while ranking both content and grammar as the most important considerations for essays receiving low scores. Clearly, other features of the rating context (e.g., rubric focus, instructions to raters, rater characteristics, etc.) mediate the relationship between substantive features of the essays and essay scores. As was true for the previously-described research, there are more studies of non-professional raters than there are of professional raters. Wolfe (2006) conducted a study that directly compared the cognitive processing of professional essay raters who exhibited different levels of rater agreement using a think-aloud methodology to reveal the content focus of those raters. That is, rather than focusing on population-level differences in rater severity/leniency, Wolfe (2006) focused on population-level differences in rater accuracy/agreement. Those analyses revealed that raters who were more accurate in that writing assessment context, were also more likely to focus on storytelling features of the narrative essays that students wrote. Less proficient raters, on the other hand, while still being most likely to focus on storytelling features, were less likely to do so than were the most proficient raters, and they were also more likely to consider organizational and stylistic features of those essays. 1.4. Purpose While nearly all of the literature reviewed here indicates that features of the essay may influence the scores that raters assign, most of them focus solely on whether scores increase or decrease as a result (i.e., severity/leniency). Numerous other rater effects exist, and each contributes to rater disagreement/inaccuracy. Given that measures of those general rating qualities (e.g., percentage of perfect agreement, coefficient kappa, intraclass correlation) are often the “bottom line” when it comes to evaluating the quality of essay scores, it seems reasonable to consider whether essay features influence raters’ agreement/accuracy when those score characteristics are defined more generally. Furthermore, by better understanding how features of an essay’s content influence the agreement/accuracy of scores, leaders of scoring projects will be in a better position to train raters and provide them with diagnostic feedback in an effective and efficient manner. Hence, this study seeks to determine whether some characteristics are more likely to be found in essays that are difficult for raters to score accurately. That is, we focus on the following research question: Which text features make an essay difficult to score by human raters? 2. Method We conducted a correlational study in which each of 131 professional raters assigned holistic scores on a 4-point rating scale to essays that were handwritten by 400 middle-school students in response to an explanatory prompt on a state-wide writing assessment in the United States (US). Each essay was also assigned a consensus “validity score” by a panel of experts. 2.1. Variables Initially, we measured 42 variables through detailed text analysis (as implemented in a proprietary automated scoring engine), supplemented by our own visual inspection of the handwritten essays. To obtain the text analysis variables, we transcribed each original handwritten essay to electronic text and submitted that text to the text analysis algorithm of the Intelligent Essay Assessor (Foltz, Lochbaum, & Landauer, 2013). We did not produce or utilize any automated scores in this study. Rather, we focused exclusively on a range of 38 countable and measurable textual features of the essay that were considered by that automated scoring engine. Broadly speaking, those features focus the content, lexical sophistication, grammar, mechanics, style, organization, and development within the essay. Providing a detailed definition of all 38 of these variables would detract from the focus of this manuscript, so we refer the readers to Foltz et al. (2013) and Foltz, Laham, and Landauer (1999) for more detail, and we provide an overview of those variables here. Content variables include those that depict the latent semantic structure of the essay (i.e., the similarity of structural elements of an essay, such as words and paragraphs, to the use of those elements in, in this case, the English language). These variables include essay characteristics such as latent semantic analysis measures of the similarity of the essay to other essays written on the same topic (e.g., cosines between vectors that represent essays included in the latent semantic analysis, which represent the similarities between those essays). Lexical sophistication variables include those that depict the maturity of the vocabulary that an essay contains. Such variables include characteristics like the number of syllables per word, the number of non-functioning words, and the diversity of the vocabulary used in the essay. Mechanics and grammar variables include those that depict the appropriateness of language use in an essay, and these variables depict characteristics such as counts of run-on sentences, use of possessives, use of punctuation and capitalization, and correctness of spelling. Style, organization, and development variables include those that depict the level of development of the topic within the essay. These include measures that characterize the coherence of the essay, measures that characterize the organization and flow of the writing in the essay, and measures of the volume of text produced (e.g., word counts and word type counts). We also used our own visual inspection of the original handwritten essays to capture four essay features that could not

E.W. Wolfe et al. / Assessing Writing 27 (2016) 1–10

5

Table 1 Regression predictors summary. Variable

Definition

ES correlation1

Easier to score

Essay length Lexical diversity Semantic typicality

Rated apparent length Ratio of unique words to total words Similarity to typical language use in the remaining essays Number of commas used Number of misspelled words Number of exclamation points Average words per sentence Rated handwriting neatness Commonality of word sequences in comparison to common English Language Average characters per word

–.39 .32 –.18

Shorter essays More lexically diverse Less semantically typical

–.18 –.15 –.15 .15 –.13 .09

Less complex Fewer misspelled words Less advanced punctuation Longer sentences Less legible Less predictable language

–.06

Shorter words

Sentence complexity Spelling accuracy Advanced punctuation frequency Sentence length Legibility Language predictability Word complexity 1

N = 400. All correlations are statistically significant versus H0 :  = 0 at ˛ = .05 except for word complexity, which is not statistically significant.

be codified by the text analysis algorithm. Specifically, we used four-point ordinal rating scales to depict the handwriting quality, visual structure, and apparent length of each essay. Combined, these 42 variables were subjected to an exploratory factor analysis through which we identified 11 factors which we determined to be substantively interpretable. Those 11 factors grouped variables into those that measure essay length (e.g., number of words and apparent length), lexical diversity (e.g., ratio of unique words to total number of words), semantic typicality (e.g., latent semantic similarity of an essay to typical English language use), sentence complexity (e.g., number of commas used), spelling accuracy (e.g., number of misspelled words), advanced use of punctuation (e.g., counts of exclamation points, colons, or question marks), sentence length (e.g., average number of words per sentence), legibility (e.g., rated handwriting neatness or visual structure of the essay), and language predictability (e.g., commonality of word sequences in the essay as compared to their commonality in typical English language usage), and word complexity (e.g., average number of characters or syllables per word). For each of those 11 factors, we chose a single variable for subsequent regression analyses based on the magnitude of the raw Pearson Product Moment correlation and the proportion of time raters assigned ratings that matched a validity score. 2.2. Analysis The score assigned by each rater was converted to an “accuracy” index according to whether the score matched the validity score for that essay (1 = match, 0 = no match). These scores were then scaled using a dichotomous Rasch model so that parameter estimates for the essays indicated the ease of assigning an accurate score to each essay (Engelhard, 1996), and we focused our attention on these estimates. That is, we defined Ease of Scoring (ES) as the outcome variable. Regression procedures were then used to model ES as a linear function of the 11 text feature variables, which were specified to be the predictors in our analyses. We applied an arcsine transformation to the two variables that were measured as proportions (lexical diversity and semantic typicality), and we applied a log transformation to counted variables that produced highly skewed distributions (sentence complexity, spelling accuracy, advanced punctuation frequency, word complexity, language predictability). We conducted regression analyses with and without controlling for true score values (i.e., expert consensus scores). We conducted variable selection via a stepwise process. 3. Results 3.1. Text features and correlations Table 1 identifies and defines the 11 textual feature variables that we considered as the predictor variables in our regression analyses. That table also displays the correlation of that variable with the ES predictor variable and describes the nature of that correlation in terms of which kinds of essays were easier to score. Essay length, which was based on visual inspection of the apparent length of the original handwritten essays, exhibited the highest correlation with ease of scoring, with shorter essays being easier to score. Lexical diversity, measured as the ratio of unique words to the total number of words in the essay, also exhibited a relatively strong correlation with ES measures. Essays with a greater amount of lexical diversity were easier to score. Several of the remaining variables exhibited somewhat weaker correlations with ES measures. Semantic typicality, a measure of the similarity of language use in an essay to the typical language use in all other essays in the body of scored essays, exhibited a tendency for essays that were less similar to the body of essays to be easier to score. Sentence complexity, measured via comma frequency, was also weakly correlated with ease of scoring so that essays with fewer commas were easier to score. Table 1 also reveals weak relationships with ES measures for spelling accuracy (fewer spelling errors is easier to score), use of advanced punctuation (fewer uses are easier to score), sentence length (longer sentences are easier to score), and legibility (poorer handwriting is easier to score). We observed very weak relationships between ES

6

E.W. Wolfe et al. / Assessing Writing 27 (2016) 1–10

Table 2 Regression results omitting true score. Step

Variable

Parameter estimate

SE

p-value

Partial R2

1 2 3 4 5

Essay length Lexical diversity1 Sentence complexity2 Semantic typicality1 Spelling accuracy2

–.19 2.37 –.15 –1.44 –.15

.06 .67 .04 .57 .05

.0001 .0001 .001 .02 .02

.20 .04 .02 .01 .01

1 2

Lexical diversity and semantic typicality were transformed via an arcsine transformation. Sentence complexity and spelling accuracy were transformed via a log transformation.

measures and language predictability (dissimilarity of consecutive words, n-grams, in the essay relative to their appearance in common English language use) and word complexity (average number of characters per word). It is important to note, however, that several of these variables are known to be correlated with measures of essay quality. For example, essay length is commonly identified as an essay feature that is highly predictive of measures of essay quality in applications of automated scoring. Hence, each of these relationships could, potentially, be moderated by the true score of the essay. To take this into account, we examined the correlation between true scores (i.e., expert consensus scores) and ES measures to determine whether that correlation was sufficiently strong to warrant controlling for essay true score when examining the relationship between essay features and ES measures. We observed a correlation between true scores and ES measures equal to −19 (p < .0001), with essays of lower quality being easier to score. Given that this correlation is non-trivial and is greater than the correlation between ES measures and several of the essay features that we considered, we chose to conduct regression analyses two ways—forcing true score into the regression equation and leaving true scores out of the equation entirely. 3.2. Regression analyses Table 2 summarizes the results of the stepwise entry approach to regression modeling when true score was not forced into the model. The results indicate that five predictors were selected for the model using a significance level of ˛ = .05. However, only two of these variables exhibited an effect size that warrants discussion. Specifically, essay length exhibited a large effect size (partial R2 = .20), with a negative parameter estimate indicating that, for approximately every one-point increase on the four-point essay length rating scale, the logit of whether an essay would be scored accurately versus inaccurately decreased by a value of .19. To illustrate, Fig. 1 displays the empirical probability of a match for each level of the essay length rating scale. Generally, as essay length became longer, essays became more difficult to score. Specifically, essays that were about one-half of a page in length were rated accurately about 86% of the time while essays that were between one and one and one-half pages in length were scored accurately 62% and 55% of the time, respectively. Essays that were two pages in length were rated accurately about 52% of the time. The effect size for lexical diversity is considerably smaller (partial R2 = .04), but the effect appears to be large enough to be relevant. Recall that lexical diversity is operationally defined as the proportion of the words written by an examinee that

1

Probability of Accurate Rating

0.8

0.6

0.4

0.2

0 0.00

0.50

1.00

1.50

Essay Length (Pages) Fig. 1. Probability of accurate ratings as a function of essay length.

2.00

E.W. Wolfe et al. / Assessing Writing 27 (2016) 1–10

7

1

Probability of Accurate Rating

0.8

0.6

0.4

0.2

0 0.00

0.20

0.40

0.60

0.80

1.00

Unique Words / Total Words Fig. 2. Probability of accurate ratings as a function of lexical diversity.

were unique (i.e., not repeated within the essay). Fig. 2 displays the empirical probability of a match for five equally-spaced values of the non-transformed lexical diversity variable. When the proportion of unique words is relatively low (e.g., about .35), the probability of raters matching the true score was only .51. As the proportion of unique words increased slightly (e.g., about .55), the probability of matching the true score increased to about .60. At the highest level of lexical diversity (i.e., a proportion of unique words around .75), raters assigned highly accurate scores (i.e., the probability of raters matching the true score equals .97). Table 3 summarizes the results of the stepwise entry approach to regression modeling when true score was forced into the model. The results are remarkably similar to those obtained when true score was not forced into the model. That is, two variables, essay length and lexical diversity, are both statistically significant and account for a sufficient amount of the variance to warrant attention. In addition, all parameter estimates are of the same sign and approximate magnitude in the two analyses. This means that the statements that are made concerning the relationship between textual features and ease of scoring are true, regardless of the overall quality of the essay (i.e., conditioning on true score). It is interesting to note, however, that sentence complexity now approaches substantive importance (R2 = .03), with essays exhibiting less sentence complexity being rated more accurate. However, as shown in Fig. 3, the small effect size supports the conclusion that this effect is probably not of substantive interest—there seems to be only a small amount of variability with respect to probability of raters assigning an accurate score across levels of sentence complexity. 4. Discussion To summarize, our results reveal that several textual features of essays are related to the ease with which raters assign accurate scores to essays. However, only two of those essay features exhibit a substantively important prediction of essay scoring accuracy, whether we control for essay quality or not. Specifically, essay length accounts for about 20% of the variance in ease of scoring, while lexical diversity accounts for an additional 4% of that variance. Other textual variables that we considered fail to account for a sufficient amount of additional variance to make them worthy of additional comment. It is also important to acknowledge that true score accounted for 6% of the total variance in ease of scoring measures, and when true score was forced into the regression equation, the unique percentage of variance accounted for by essay length dropped to 14%. In short, these results indicate that (a) longer essays are more difficult to score, (b) essays that contain less Table 3 Regression results including true score. Step

Variable

Parameter estimate

SE

p-value

Partial R2

0 1 2 3 4 5

True score Essay length Lexical diversity1 Sentence complexity2 Spelling accuracy2 Language predictability2

.14 –.24 2.40 –.17 –.16 .16

.06 .07 .71 .05 .05 .09

.03 .0001 .0001 .0001 .03 .02

.06 .14 .04 .03 .01 .01

1 2

Lexical diversity was transformed via an arcsine transformation. Language predictability, sentence complexity, and spelling accuracy were transformed via a log transformation.

8

E.W. Wolfe et al. / Assessing Writing 27 (2016) 1–10

1

Probability of Accurate Rating

0.8

0.6

0.4

0.2

0 0.00

10.00

20.00

30.00

40.00

50.00

Comma Frequency Fig. 3. Probability of accurate ratings as a function of sentence complexity.

lexical diversity are more difficult to score, and (c) essays of higher quality are more difficult to score. It is also important to note that the results regarding essay length and lexical diversity hold whether controlling for overall essay quality or not. Our results are interesting for two reasons. Of primary importance are our regression analysis results. These results jibe with our expectations and are informative in terms of their implications for provision of rater training and feedback. Specifically, our model that forced true score into the regression equation makes it clear that the overall quality of the essay is an important factor in how difficult it will be to score. Low quality essays are easier to score than are higher quality essays. This is consistent with what we would expect because, typically, poorly written essays exhibit numerous characteristics that are easily identified (e.g., few words written, poor mechanics, simple word choice and sentence structure). However, the most highly predictive variable, even when true score was forced into the regression equation, was essay length with longer essays being more difficult to score. The fact that this relationship holds even when true score is included in the equation is noteworthy because it means that, even when we compare essays assigned the same true score (i.e., controlling for true score), shorter essays within that score category are easier to score. This outcome may indicate that the more evidence that a rater needs to consider, the more difficult it is to arrive at a correct scoring decision. We also found that essays that exhibit more lexical diversity are easier to score—perhaps raters treat redundancy within a writing sample differently, leading to differences in agreement between them. Of secondary importance, our study illustrates an interesting methodological approach to studying the production of rater errors by focusing on those errors as a function of substantive features of the essays. Previously, Wolfe and McVay (2012) criticized much of the extant research regarding rater effects because it failed to jointly consider quantitatively-precise definitions and measures of rating quality with substantive variables that have been thoughtfully selected based on detailed models of the content (in our case, writing). That is, our study demonstrates one way that traditional psychometric methods can be coupled with automated scoring technologies and the associated deep substantive knowledge base upon which those technologies are based to produce results that are important from both quantitative and substantive perspectives. In terms of the implications of this research for direct writing assessment practice, although we believe that it is premature to immediately begin changing how we train and monitor and provide feedback to raters, we do believe that our results have the potential for doing so, pending additional research in this area. That additional research should focus on other aspects of the rating context which may account for some of the remaining 75% of the unaccounted variance. Specifically, we did not consider how characteristics of the raters may interact with essay features in the production of rater errors. It may be that raters with some backgrounds may be better able to incorporate knowledge of essay features into their decision making processes. Similarly, it may be that some features of the rating setting and employed processes may allow raters to be more effective in considering potentially irrelevant or confusing essay features during the decision making process. In addition, we should acknowledge that our study is very limited in scope–we considered a single direct writing assessment, a single prompt and rubric within that assessment, a single pool of raters, etc. It will be important to determine the generalizability of our results to other assessment contexts and designs. Should additional research support or elaborate upon the conclusions that we have drawn, this research may play an important role in future refinements to rater training, monitoring, and feedback procedures in direct writing assessment. For example, by focusing on essay characteristics that make an essay difficult to score, training leaders in essay scoring projects may be better able to identify training examples that better communicate and test raters’ understandings of the vital characteristics that differentiate high quality essays from low quality essays. In the case the writing assessment that we

E.W. Wolfe et al. / Assessing Writing 27 (2016) 1–10

9

investigated, it seems that essay length and lexical diversity may be important sources of rater errors. Hence, rater training might be improved by acknowledging this fact and spending time focusing on specific examples of essays that contain these features. Of course, doing so would require rater trainers to think carefully about and then select essays that demonstrate these features so that rater training affords opportunities to present and discuss essays that illustrate these challenges for raters. Hence, results of studies like ours can not only inform the process of rater training, but it can also inform the production of rater training materials. Similarly, training leaders in those projects may be better able to incorporate discussions of misconceptions that raters may allow to interfere with assigning accurate scores. That is, by better understanding the features of essays that may lead to inaccurate scoring, training procedures could be altered to communicate those tendencies to raters, so that raters might better understand their own decision making processes and avoiding making errors of judgment. In short, we are suggesting that rater training could be modified to include instructing raters about their own metacognitive states (i.e., metacognitive awareness–being aware of one’s own thinking) so that raters may be better able to monitor and correct their own rating behaviors, possibly producing scores of a higher quality as a result. Finally, the notion of purposively selecting rater training materials to exemplify essay features that might contribute to rater errors could be expanded to the practice of selecting materials used during operational rating for the purpose of rater monitoring. That is, rater monitoring and feedback procedures could be made more efficient by allowing those who monitor raters during scoring projects to select essays that are known to exhibit features that are likely to interfere with raters making accurate decisions. Raters could then be informed of how the features of a particular essay make it more difficult to score, again possibly allowing raters to self-correct for their own tendencies to allow those features to influence them. In short, we believe that our results suggest that potentially interesting and important applications can be realized by taking into account the relationship between essay features and the ease with which raters can assign accurate scores to those essays, and we hope that future research can illuminate and expand upon our results by taking into account additional characteristics of the rating context (e.g., rater characteristics and rating setting) as well as examining different assessment contexts (e.g., other types of examinee populations, rater populations, prompts and rubrics, etc.). Acknowledgements This research was funded jointly by the Performance Scoring Center and the Research & Innovations Network at Pearson. References Baird, J. A. (1998). What’s in a name? Experiments with blind marking in A-level examinations. Educational Research, 40, 191–202. Barkaoui, K. (2010). Explaining ESL essay holistic scores: A multilevel modeling approach. Language Testing, 27, 515–535. Breland, H., Lee, Y. W., & Muraki, E. (2005). Comparability of TOEFL CBT essay prompts: Response-mode analyses. Educational and Psychological Measurement, 65, 577–595. Bridgeman, B., Trapani, C., & Attali, Y. (2012). Comparison of human and machine scoring of essays: Differences by gender, ethnicity, and country. Applied Measurement in Education, 25, 27–40. Briggs, D. (1980). A study of the influence of handwriting upon grades using examination scripts. Educational Review, 32, 186–193. Chase, C. I. (1986). Essay test scoring: Interaction of relevant variables. Journal of Educational Measurement, 23, 33–41. Crisp, V. (2012). An investigation of rater cognition in the assessment of projects. Educational Measurement: Issues and Practice, 31, 10–20. Elder, C., Barkhuizen, G., Knoch, U., & von Randow, J. (2007). Evaluating rater responses to an online training program for L2 writing assessment. Language Testing, 24, 37–64. Elder, C., Knoch, U., Barkhuizen, G., & von Randow, J. (2005). Individual feedback to enhance rater training: Does it work? Language Assessment Quarterly, 2, 175–196. Engelhard, G. J. (1996). Evaluating rater accuracy in performance assessments. Journal of Educational Measurement, 33, 56–70. Fajardo, D. M. (1985). Author race, essay quality, and reverse discrimination. Journal of Applied Social Psychology, 1985, 255–268. Foltz, P. W., Laham, D., & Landauer, T. K. (1999). The Intelligent Essay Assessor: Applications to educational technology. Interactive Multimedia Electronic Journal of Computer—Enhanced Learning, 1. Retreived from http://imej.wfu.edu/Articles/1999/2/04/index.asp. Foltz, P. W., Streeter, L. A., Lochbaum, K. E., & Landauer, T. K. (2013). Implementation and applications of the Intelligent Essay Assessor. In M. Shermis, & J. Burstein (Eds.), Handbook of automated essay evaluation (pp. 68–88). New York, NY: Routledge. Freedman, S. W. (1979). How characteristics of student essays influence teachers’ evaluations. Journal of Educational Psychology, 71, 328–338. Hughes, D. C., Keeling, B., & Tuck, B. F. (1983). Effects of achievement expectations and handwriting quality on scoring essays. Journal of Educational Measurement, 20, 65–70. Huot, B. A. (1993). The influence of holistic scoring procedures on reading and rating student essays. In M. M. Williamson, & B. A. Huot (Eds.), Validating holistic scoring for writing assessment (pp. 206–236). Cresskill, NJ: Hampton Press. Klein, J. (2005). The effect of variations in handwriting and printing on evaluation of student essays. Assessing Writing, 10, 123–148. Knoch, U., Read, J., & von Randow, J. (2007). Re-training writing raters online: How does it compare with face-to-face training? Assessing Writing, 12, 26–43. Kobrin, J. L., Deng, H., & Shaw, E. J. (2011). The association between SAT prompt characteristics, response features, and essay scores. Assessing Writing, 16, 154–169. Massey, A. (1983). The effects of handwriting and other incidental variables on GCE “A” level marks in English literature. Educational Review, 35, 45–50. Meadows, M., & Billington, L. (2005). A review of the literature on marking reliability. London: National Assessment Agency. Meadows, M., & Billington, L. (2010). The effect of marker background and training on the quality of marking in GCSE English. Manchester: AQA Education. Powers, D. E., Fowles, M. E., Farnum, M., & Ramsey, P. (1994). Will they think less of my handwritten essay if others word process theirs? Effects on essay scores of intermingling handwritten and word-processed essays. Journal of Educational Measurement, 31, 220–233. Pula, J. J., & Huot, B. A. (1993). A model of background influences on holistic raters. In M. M. Williamson, & B. A. Huot (Eds.), Validating holistic scoring for writing assessment: Theoretical and empirical foundations (pp. 237–265). Cresskill, NJ: Hampton Press. Rafoth, B. A., & Rubin, D. L. (1984). The impact of content and mechanics on judgments of writing quality. Written Communication, 1, 446–458. Schoonen, R. (2005). Generalizability of writing scores: An application of structural equation modeling. Language Testing, 22, 1–30. Shohamy, E., Gordon, C. M., & Kraemer, R. (1992). The effect of raters’ background and training on the reliability of direct writing tests. The Modern Language Journal, 76, 27–33.

10

E.W. Wolfe et al. / Assessing Writing 27 (2016) 1–10

Sweedler-Brown, C. O. (1985). The influence of training and experience on holistic essay evaluations. English Journal, 74, 49–55. Vaughan, C. (1991). Holistic assessment: What goes on in the rater’s mind? In Hamp-Lyons (Ed.), Assessing second language writing in academic contexts (pp. 11–125). Norwood, NJ: Ablex. Weigle, S. C., Boldt, H., Valsecchi, M. I., Elder, C., & Golombek, P. (2003). Effects of task and rater background on the evaluation of ESL student writing: A pilot study. TESOL Quarterly, 37, 345–354. Wolfe, E. W. (1997). The relationship between essay reading style and scoring proficiency in a psychometric scoring system. Assessing Writing, 4, 83–106. Wolfe, E. W. (2006). Uncovering rater’s cognitive processing and focus using think aloud protocols. Journal of Writing Assessment, 2, 37–56. Wolfe, E. W., & Manalo, J. R. (2004). Composition medium comparability of direct writing assessments for non-native English speakers. Language Learning and Technology,. Retreived from http://llt.msu.edu/vol8num1/pdf/wolfe.pdf. Wolfe, E. W., Matthews, S., & Vickers, D. (2010). The effectiveness and efficiency of distributed online, regional online, and regional face-to-face training for writing assessment raters. Journal of Technology, Learning, and Assessment, 10, 1–21. Wolfe, E. W., & McVay, A. (2012). Applications of latent trait models to identifying substantively interesting raters. Educational Measurement: Issues and Practice, 31, 31–37.

Edward W. Wolfe is a Principal Research Scientist at Pearson, where he conducts research relating to human raters and automated scoring. He focuses on applications of measurement models to detecting rater effects, modeling rater cognition, evaluating automated scoring, and applying multidimensional and multifaceted latent trait models to instrument development. Tian Song is a Research Scientist at Pearson. Her research interests are in the area of psychometrics and measurement, specifically on topics of rater effects, computerized adaptive testing (CAT), and multidimensional Item Response Theory (IRT). Hong Jiao is an Associate Professor at University of Maryland, specializing in psychometrics in large-scale assessments. Her current research focuses on applying multilevel item response theory (IRT) modeling to rater effects, item and person clustering effects. She has explored the Bayesian estimation method for multidimensional, multilevel, and mixture IRT models.