Assessing Writing special issue: Assessing writing with automated scoring systems

Assessing Writing special issue: Assessing writing with automated scoring systems

Assessing Writing 18 (2013) 1–6 Contents lists available at SciVerse ScienceDirect Assessing Writing Assessing Writing special issue: Assessing wri...

171KB Sizes 0 Downloads 114 Views

Assessing Writing 18 (2013) 1–6

Contents lists available at SciVerse ScienceDirect

Assessing Writing

Assessing Writing special issue: Assessing writing with automated scoring systems

Automated essay scoring (AES)—the use of computer programs to evaluate the quality of an essay by providing a single score, a detailed evaluation of essay features, or both—has a rich history. Predating the common availability of word processors such as WordStar in the early 1980s, the seminal research on AES conducted by Page (1966) is now 47 years old. While using computers to score essays may have seemed like science fiction when first conceived, such application may seem remarkably mundane when compared to contemporary capabilities of computers to win game shows (Markoff, 2011), respond amicably to our vocal commands (Johnson, 2011), and drive cars (Claburn, 2012). That dismissive view fades, of course, when we recall that writing is among the most complex of human behaviors, both vehicle and subject of thought and culture. Those who have devoted their careers to the study, teaching, and assessment of writing view writing not as a simple behavior easily evaluated through computer algorithms; rather, writing researchers view composition as a rich and nuanced activity informed by myriad discursive and non-discursive purposes, practices, and perspectives. Because writing is an activity that is so deeply human, its association with formulation is double edged. Because students are encouraged to write fluently or to achieve competency in their knowledge of conventions, a certain degree of formulation is necessary. But when these formulations are used by machines as the basis for assessing writing beyond fluency or knowledge of grammar (Attali & Powers, 2008) there is an inherent suspicion that technology can corrupt the essence of a fundamentally human activity (Ericsson & Haswell, 2006; Herrington & Moran, 2012). When machines first impacted the practice of writing with the creation of the first word processors, the skepticism was palpable (Nemerov, 1967; Schwartz, 1982). Collier (1983) undertook case studies with first-year composition students regarding the impact of word processor use on revision strategies. Some of his conclusions suggest how the world has changed. Students, he warned, would have to become sufficiently “computer literate” to make revising on a word processor a practical alternative to writing by hand. Other conclusions stand across time: “Our electronics engineers will have to redesign the word processor so that it demonstrably supports and enhances the writing process” (Collier, 1983, p. 154). From the outset, construct representation (the extent to which the components recognized by machines are true to the meaning of writing) was a central concern. So too was context—the embedded circumstance of the use of such systems. In a response to Collier’s article, Pufahl (1984) argued that the most significant relationship in writing instruction is not computer to construct but teacher to student. Computers, Pufahl predicted, will not have much impact on improving writing if the teacher does not effectively use them in the writing situation. As we have come to learn, technology alone is neither autonomous nor deterministic (Ellul, 1964); rather, it is simply a product of and for human 1075-2935/$ – see front matter. Published by Elsevier Ltd. http://dx.doi.org/10.1016/j.asw.2012.11.002

2

Assessing Writing special issue: Assessing writing with automated scoring systems / Assessing Writing 18 (2013) 1–6

activity, and there is no reason to think of technology as beyond our control (Pitt, 2011). In order to exercise that control, however, we must understand the goals that are being facilitated by technology and be ready to evaluate how well technology facilitates these human goals. With this special issue of Assessing Writing we hope to foster ongoing investigation of AES in meeting human goals by presenting a range of invited papers that encourage continuing dialog and research on the design and use of AES. The papers were invited to represent multiple parts of an overall validity argument related to AES. In the tradition of Kane (2006), these papers reflect that fact that validity is not one unitary concept that is either present or absent for the use of test scores but a judgment based on the totality of evidence that can be brought to bear to argue the merits of test score use. As such, no single study, nor a set of studies investigating only a single aspect of a validity argument (and ignoring others), is sufficient to draw a conclusion about the validity of a test. Instead, there are many questions that have bearing on validity, some of which may be focused on the degree to which the test scores embody the essence or meaning of the tested construct (construct representation), while others investigate the degree to which test scores are associated (or not) with external measures that are similar to (or divergent from) the ability being measured. Still other questions may be focused on aspects of fairness of test use for different groups of examinees. Each of these (and more) is but one part of a potential body of evidence constituting a validity argument for the use test scores. So, too, are there multiple aspects to the evaluation of AES; therefore, a true validation of AES would only be possible in the context of a particular application of the technology. Table 1 presents a set of sample approaches to portions of a validation argument, following Kane’s (2006) representation of the four major areas of argument for validity: scoring, generalization, extrapolation, and implication. For each of these we offer sample topics of investigation in the second column and corresponding research questions in the rightmost column. These are but a few of the many possible questions that could be posed, and the investigative pattern established there provides a systematic way for researchers to pose questions, design validation processes, and conceptualize arguments bearing on the validity of AES systems (Williamson, Xi, & Breyer, 2012). We invite the readers of this special issue to read each of the invited papers and consider how the topics addressed in each correspond to the branches of inquiry provided in Table 1. The evaluation of general characteristics of AES in different environments can provide insight into how AES might be expected to perform in new assessment or learning environments, and the invited papers in this special issue reflect a range of such investigations. Defined by Shermis and Burstein (2003) as a process of scoring (assigning a number to an essay) and evaluating (providing feedback on organization or language use) written products, an effective AES is the product of many elements that must work in harmony. Each article grapples with a different set of issues raised by this basic definition; each represents different perspectives on the challenges and the potential of AES. In “On the Relation between Automated Essay Scoring and Modern Views of the Writing Construct,” Paul Deane examines the range of construct measurement in AES systems. After providing a list of high-volume, high-stakes use in the Graduate Record Examination (GRE® ), the Test of English as a Foreign Language (TOEFL® ), and the Graduate Management Admissions Test (GMAT® ), he then turns to AES technological capabilities that address contemporary definitions of the writing construct. Deane calls special attention to the importance of high fluency and control of text production—factors of writing ability measured by AES—in their relationship to the cognitive resources required to master a broader writing construct. In taking a position of moderation and recognizing the limitations of AES systems, Deane cautions that no assessment technology should be adopted blindly in any context. He equally notes that no assessment method should be rejected without considering how it can be used to support effective instruction. Dean’s article envisions a possible future by conceptualizing AES as the basis for large-scale, embedded forms of automated writing analysis in which social and cognitive aspects of the writing process are taken more richly into account. In his vision for the next generation of testing, Deane follows Tucker (2009) in acknowledging the unique convergence of powerful computer technologies and developments in cognitive science that promise a new generation of testing positioned to advance student learning. In that spirit of exposition, Chaitanya Ramineni and David Williamson provide an overview of psychometric procedures and guidelines used at Educational Testing Service (ETS) to evaluate AES for operational use as one of two scores (the other being a human grader) in high-stakes assessments.

Assessing Writing special issue: Assessing writing with automated scoring systems / Assessing Writing 18 (2013) 1–6

3

Table 1 A sampling of validity arguments for AES. Interpretive argument

Sample areas of investigation

1. Construct representation Scoring Evaluation of the relationship between the observed performance on the essay 2. Association with human scores and the observed essay score; quality of the score Generalization Evaluation of the relationship between the observed essay score and the score that would be expected from administering all possible similar essay tasks; representativeness of this score in comparison to scores from other possible essay tasks

1. Representativeness of the tasks used in assessment to the broader set of writing tasks of interest 2. Consistency of scores across responses to different writing tasks

1. Association of AES scores with other Extrapolation Evaluation of the relationship between measures of writing scores that would be expected from administering all possible similar essay tasks and other scores from other measures in the domain of writing; conclusions about writing ability Implication Evaluation of the relationship between the measure of writing ability from the assessment and subsequent interpretation for decision-making and prediction; utility of the conclusion about writing ability

1. Use of AES scores for decision-making

2. Fairness of AES scores

3. Robustness to construct-irrelevant test-taking strategies

4. Consistency of writing ability

Sample research questions 1. Does AES measure the same essay characteristic that are valued by human raters? 2. Do the scores from AES match the scores from qualified human raters? 1. Are the assessment tasks fully representative of the areas of writing that are valued, or do these tasks under-represent the writing construct? 2. When examinees respond to multiple writing tasks of similar design, are the scores similar across responses? 1. What is the relationship between AES scores and scores from writing measures such as multiple choice tests, constructed response tasks, portfolios, EPortfolios, and writing course grades?

1. Are AES scores sufficiently predictive of course performance to be used for placement—or of academic performance more broadly to be used for admissions? 2. Do examinees with similar levels of ability, but different demographic backgrounds, receive similar AES scores? 3. Can examinees use construct-irrelevant or construct-impoverished strategies to achieve AES scores higher than would be expected? 4. Are AES scores for an examinee consistent over time for similar types of writing tasks?

The evaluation procedures they note cover multiple criteria, including association with human scores, distributional differences, subgroup differences, association with external variables of interest, and analysis with expected levels of performance. While they argue that the a priori establishment of performance expectations and the evaluation of AES systems against these expectations help to ensure that automated scoring provides a positive contribution to the large-scale assessment of writing, the authors also call for continuing transparency in the design of automated scoring systems. The authors call for clear and consistent expectations of automated scoring performance before operational use. As an example of the application of these procedures, Ramineni presents a case study in which AES is evaluated for postsecondary placement purposes. Her paper describes the use of AES models customized for a particular population at a specific institution. Based on human-scored essays submitted at that institution, customized models were evaluated based on ETS criteria. The study identifies a reasonable relationship to each of the study’s outcome measures, with a pattern of increasing correlation as the outcome measure progresses from the more general measure of ability in grade point average to the writing-specific measure of portfolio scores. This study presents the first analysis of the ability of AES systems to be tailored for local use, a hallmark principle of validation for the field of rhetoric and composition/writing studies (Huot, 1996).

4

Assessing Writing special issue: Assessing writing with automated scoring systems / Assessing Writing 18 (2013) 1–6

Departing from the field of educational assessment, evaluation, and research, writing researcher Andrew Klobucar and his colleagues examine AES used for rapid assessment of admitted students during the early weeks of first-year college writing instruction. Using a methodology focusing on construct modeling, response processes, disaggregation, extrapolation, generalization, and consequence, Klobucar and his colleagues raise questions regarding disjuncture between American secondary and post-secondary education; the relationship between the writing construct of the AES under examination and other national and local models of the writing construct; use of the AES by instructors and their students; performance of diverse groups of students; stability among writing measures; and consequences of using AES for rapid assessment. Complementary to research by Grimes and Warschauer (2010), Klobucar’s case study is part of the effort to validate AES score use within specific institutions and to propose systems of validation that may prove useful across institutional sites. In “English Language Learners and Automated Scoring of Essays,” Sara Cushing Weigle attends to the importance of formulating a construct of second language writing for instruction and assessment. At the center of her reflective paper, Weigle reports on her research with automated scores on TOEFL® writing tasks. Although there were differences in scores across writing tasks for the individual features used to generate total e-rater scores, Weigle found that the AES was more consistent across tasks than individual human raters in scoring essays of non-native speakers of English. She also found, however, that the algorithm used by the AES under examination was not capturing some of the features of nonnative writing that human raters are sensitive to when they make their evaluations. As she concludes in the case of English Language Learners, the more system developers know about the particular needs of students—and the more instructors and administrators know about AES design—the greater the chances of this technology to be used to address the needs of this important and growing population. Emphasizing AES limitations, William B. Condon turns the tables on large-scale testing, challenging the industry to meet the standards of locally developed assessment at institutional sites. In a figure depicting the universe of writing assessment, Condon offers a map on which measures of writing ability can be located according to their construct representation, termed validity in his paper. Condon advocates robust assessments, evaluative occasions which link tests with authentic tasks, provide rich descriptions of a writer’s competencies, and allow evaluators to make judgments about the writer’s learning context, the writer, and the writing itself (MacMillan, 2012; Wardle & Roozen, 2012). Citing the 4000 rising junior portfolios that are read each year at Washington State University as an example of these assessments (Kelly-Riley, 2012), Condon calls for attention to writing tasks that reflect actual writing contexts, use of portfolio assessment, and blurred boundaries between testing and teaching in the service of student learning. In so doing, Condon backgrounds research comparing AES to human raters. He notes that a recent national competition (Shermis & Hamner, 2012) revealed that human quadratic-weighted kappas ranged from 0.61 to 0.85, closely mirrored by AES ranges from 0.60 to 0.84. Yet such human-machine measures of inter-reliability are based on “pseudo-essays,” as he terms them, writing samples classified as originating from artificial contexts. While each of the invited papers addresses a different set of issues that are important in the evaluation of AES for a variety of potential uses, they have in common an orientation toward broad categories of construct representation, benefits, and challenges. Some of these challenges are not specific to AES but inherent in the assessment of writing as a complex socio-cognitive process globally manifested over the last five millennia (Prior & Lunsford, 2008). Within this context, a fully mature AES system would have the potential to provide a number of benefits in large-scale scoring: quality improvements over human scoring; consistency, tractability, specificity, detail-orientation; speed of scoring and score reporting; reduced need for recruitment/training overhead; provision of annotated feedback on performance; and cost savings. No single issue of a journal, no matter its focus, can address all the potential questions associated with AES, including noteworthy challenges such as construct, task, and scoring investigation; human and automated score agreement; AES associations with external measures; AES generalizability across alternate tasks and test forms; and claims, disclosures, and consequences (Williamson et al., 2012). Indeed, an entire program of research on AES can be imagined by expansion of Table 1 along just such lines. For international language testing issues of evaluation and use of automated scoring, readers are referred to the journal Language Testing. There may be found accounts of language testing in China, for example, as 13 million students in 2006 prepared for the College English Test, an examination

Assessing Writing special issue: Assessing writing with automated scoring systems / Assessing Writing 18 (2013) 1–6

5

designed to ensure English proficiency in Chinese undergraduate students (Zheng and Cheng, 2008). As well, readers interested in technical aspects of AES development are directed to the proceedings of the Association for Computational Linguistics, the Asian Federation of Natural Language Processing, and the International Computer Assisted Assessment Conference. For AES use in languages other than English, such as the research of the Hebrew Language Project (Ben-Simon & Cohen, 2011) or automated Chinese essay scoring (Chang & Lee, 2009), readers should consult conference papers from the International Association for Educational Assessment and the International Conference on Asian Language Processing, respectively. Readers interested in the scoring function of AES will find general information in Shermis and Burstein (2013); for an introduction to the evaluation function of AES, readers should consult Leacock, Chodorow, Gamon, and Tetreault (2010). For investigations of AES in specific contexts, readers are referred to the Journal of Technology, Learning, and Assessment, with special attention to the research of Attali and Burstein (2006), Dikli (2006), and Wang and Brown (2007). For research on relationships among composition instructors, writing program administrators, and AES, readers should follow publications associated with the National Council of Teachers of English and the Council of Writing Program Administrators. Assessing Writing remains an essential source of evidence-based research, as does Journal of Writing Assessment. It is our hope that this special issue of Assessing Writing will stimulate informed debate through the study of best validation practice. It is our belief that the invited papers advance a constructive dialog that will lead to greater understanding of the place of AES among writing assessment systems. Beyond this hope, history—ever human—will reveal how far AES technological capabilities, bound by our own values, will carry us. Note from the Editors of the Special Issue This article is part of a special issue of Assessing Writing on the automated assessment of writing. The invited articles are intended to provide a comprehensive overview of the design, development, use, applications, and consequences of this technology. Please find the full contents list available online at: http://www.sciencedirect.com/science/journal/10752935. References Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater V.2. Journal of Technology, Learning and Assessment, 4 (3). Retrieved from: http://ejournals.bc.edu/ojs/index.php/jtla/article/view/1650/ Attali, Y., & Powers, D. (2008). A developmental writing scale. (ETS Research Report No. RR-08-19). Princeton, NJ: Educational Testing Service. Retrieved from: http://www.ets.org/Media/Research/pdf/RR-08-19.pdf Ben-Simon, A., & Cohen, Y. (2011). The Hebrew Language Project: Automated essay scoring and readability analysis. In: Proceedings of the International Association for Educational Assessment. Retrieved from: http://www.iaea.info/ documents/paper 4e1237ae.pdf Chang, T. H., & Lee, C. H. (2009). Automated Chinese essay scoring using connections between concepts in paragraphs. In: Proceedings of the International Conference on Asian Language Processing (pp. 265–268). http://dx.doi.org/10.1109/IALP.2009.63 Claburn, T. (2012, October). Google autonomous cars get green light in California. InformationWeek,. Retrieved from: http://www.informationweek.com/government/policy/google-autonomous-cars-get-green-light-i/240008033 Collier, R. M. (1983). The word processor and revision strategies. College Composition and Communication, 34, 149–155. Dikli, S. (2006). An overview of automated scoring of essays. Journal of Technology, Learning and Assessment, 5 (1). Retrieved from: https://ejournals.bc.edu/ojs/index.php/jtla/article/viewFile/1640/1489 Ellul, J. (1964). The technological society. (J. Wilkinson, Trans.). New York, NY: Knopf. Ericsson, P. F., & Haswell, R. (Eds.). (2006). Machine scoring of student essays: Truth and consequences. Logan, Utah: Utah State University Press. Grimes, D., & Warschauer, M. (2010). Utility in a fallible tool: A multi-site case study of automated writing evaluation. Journal of Technology, Learning and Assessment, 8. Retrieved from: http://www.jtla.org Herrington, A., & Moran, C. (2012). Writing to a machine is not writing at all. In: N. Elliot & L. Perelman (Eds.), Writing assessment in the 21st century: Essays in honor of Edward M. White (pp. 219–232). New York, NY: Hampton Press. Huot, B. (1996). Towards a new theory of writing assessment. College Composition and Communication, 47, 549–566. Johnson, A. (2011, November). Now your phone talks back and humors you. The New York Times,. Retrieved from: http://www.nytimes.com/2011/11/06/fashion/when-your-phone-humors-you-noticed.html? r=1 Kane, M. T. (2006). Validation. In: R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 17–64). Westport, CT: American Council on Education/Praeger. Kelly-Riley, D. (2012). Getting off the boat and onto the bank: Exploring validity of shared evaluation methods for students of color in college writing assessment. In: A. B. Inoue & M. Poe (Eds.), Race and writing assessment (pp. 29–43). New York, NY: Peter Lang.

6

Assessing Writing special issue: Assessing writing with automated scoring systems / Assessing Writing 18 (2013) 1–6

Leacock, C., Chodorow, M., Gamon, M., & Tetreault, J. (2010). Automated grammatical error detection for language learners. San Rafael, CA: Morgan and Claypool. MacMillan, S. (2012). The promise of ecological inquiry in writing research. Technical Communication Quarterly, 24, 346–361. http://dx.doi.org/10.1080/10572252.2012.674873 Markoff, J. (2011, February). Computer wins on “Jeopardy!”: Trivial, it’s not. The New York Times,. Retrieved from: http://www.nytimes.com/2011/02/17/science/17jeopardy-watson.html?pagewanted=all Nemerov, H. (1967). Speculative equations: Poems, poets, computers. The American Scholar, 36, 394–414. Page, E. B. (1966). The imminence of grading essays by computer. Phi Delta Kappan, 48, 238–243. Pitt, J. C. (2011). Doing philosophy of technology: Essays in a pragmatist spirit. Dordrecht, NL: Springer. 2011. Prior, P., & Lunsford, K. J. (2008). History of reflection, theory and research on writing. In: C. Bazerman (Ed.), Handbook of research on writing: History, society, school, individual, text (pp. 81–96). New York, NY: Lawrence Erlbaum. Pufahl, R. M. (1984). Response to Richard M. Collier. College Composition and Communication, 35, 91–93. Schwartz, H. J. (1982). Monsters and mentors: Computer applications for the humanities. College English, 44, 141–152. Shermis, M. D., & Burstein, J. (2003). Introduction. In: M. D. Shermis & J. Burstein (Eds.), Automated essay scoring: A crossdisciplinary perspective (pp. xiii–xvi). Mahwah, NJ: Lawrence Erlbaum Associates. Shermis, M. D., & Burstein, J. (Eds.). (2013). Handbook of automated essay evaluation: Current applications and new directions.. New York, NY: Routledge. Shermis, M. D., & Hamner, B. (2012). Contrasting state-of-the-art automated scoring of essays: Analysis. In: Paper presented at the annual meeting of the National Council of Measurement in Education, Vancouver, BC, Canada. Retrieved from http://www.scoreright.org/NCME 2012 Paper3 29 12.pdf Tucker, B. (2009). The next generation of testing. Educational Leadership, 67, 48–53. Wang, J., & Brown, M. S. (2007). Automated essay scoring vs. human scoring: A comparative study, Journal of Technology. Learning and Assessment, 6 (2). Retrieved from: http://ejournals.bc.edu/ojs/index.php/jtla/article/viewFile/1632/1476 Wardle, E., & Roozen, K. (2012). Addressing the complexity of writing development: Toward an ecological model of assessment. Assessing Writing, 17, 106–119. http://dx.doi.org/10.1016/j.asw.2012.01.001 Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practices, 31, 2–13. http://dx.doi.org/10.1111/j.1745-3992.2011.00223.x Zheng, Y., & Cheng, L. (2008). College English Test (CET) in China. Language Testing, 25, 408–417. http://dx.doi.org/10.1177/ 0265532208092433 Norbert Elliot is professor of English at New Jersey Institute of Technology. David M. Williamson is the Senior Research Director for the Assessment Innovations Center in the Research & Development Division of Educational Testing Service, where he oversees fundamental and applied research at the intersection of cognitive modeling, technology and multivariate scoring models, with an emphasis on the development, evaluation and implementation of automated scoring systems for text and spoken responses.

Norbert Elliot a,∗ David M. Williamson b,1 a New Jersey Institute of Technology, University Heights, Newark, NJ 07102, USA b Educational Testing Service, Rosedale Road, Princeton, NJ 08541, USA ∗ Corresponding

author at: Department of Humanities, New Jersey Institute of Technology, University Heights, Newark, NJ 07102, USA. Tel.: +1 973 596 3266. E-mail addresses: [email protected] (N. Elliot), [email protected] (D.M. Williamson) 1 Assessment Innovations Center, Research & Development Division, Educational Testing Service, 660 Rosedale Road, Princeton, NJ 08541, USA. Tel.: +1 609 734 1303.