Available online at www.sciencedirect.com
Journal of Second Language Writing 22 (2013) 217–230
Using multiple texts in an integrated writing assessment: Source text use as a predictor of score Lia Plakans a,*, Atta Gebril b,1 a
The University of Iowa, Department of Teaching and Learning, N259 Lindquist Center, Iowa City, IA 52242, USA b The American University in Cairo, English Language Institute, P.O. Box 74, New Cairo 11835, Egypt
Abstract Interest in integrated tasks is increasing in second language writing, accompanied by a concern for appropriate interpretation of performances and scores from these tasks. Integrated writing adds an element not found in traditional independent writing: the use of source text material. This study investigates how source text use appears in performances on an integrated writing task, and how it differs across score levels and task topics. Educational Testing Service (ETS) supplied 480 performances on the writing section of the Internet-based Test of English as a Foreign Language (TOEFL iBT) to explore these questions. The integrated TOEFL task involves a comparative summary of listening and reading texts that present differing views on a topic. In this study, multiple regression analysis was used to consider three areas of source text use: (1) the importance of source text ideas that writers included in their summary, (2) the use of ideas from a reading source text and from a listening text, and (3) the borrowing of exact wording from the source texts (verbatim source use). These three areas were analyzed across nine score levels and indicated that score and source use are related. Overall, these features of source text use explained over 50% of the variance in scores on the reading– listening–writing task. The use of the listening text and the inclusion of important ideas from source texts explained the most variance, while use of the reading text and verbatim source use were less predictive. The latter two held a negative correlation with score, indicating that the lower scoring essays had more of these features. These findings support the claim that integrated writing assessment elicits academic writing processes, which is reflected by score. High-scoring writers selected important ideas from the source texts and used the listening text as the task prompt instructed. Low scoring writers depended heavily on the reading texts for content and direct copying of words and phrases. These findings support the validity of interpreting integrated task scores as a measure of academic writing but provide a nuanced look at the contribution of certain source use features. # 2013 Elsevier Inc. All rights reserved. Keywords: Integrated writing; Writing assessment; Source use
Introduction Assessment tasks that isolate writing have been prominent in second language research and learning for decades; however, recently, tasks that integrate this ability with other skills are emerging in both high-stakes testing and classroom contexts. Multiple skills, such as reading and listening, are combined in one task that requires writers either to summarize or state their opinions on a topic presented in source texts. The rationale behind this approach,
* Corresponding author. Tel.: +1 319 335 5565. E-mail addresses:
[email protected] (L. Plakans),
[email protected] (A. Gebril). 1 Tel.: +20 2 2615 1919. 1060-3743/$ – see front matter # 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.jslw.2013.02.003
218
L. Plakans, A. Gebril / Journal of Second Language Writing 22 (2013) 217–230
particularly in English for Academic Purposes (EAP) writing, is to simulate language used in academic tasks that require comprehension and integration of source material (Plakans, 2008; Read, 1990; Weigle, 2004). In teaching, this alignment of classroom tasks and the ‘‘real world’’ increases student motivation and can improve transfer of language skills to academic courses (Leki & Carson, 1994, 1997). In language assessment, this connection substantiates the validity of inferences from test scores (Chapelle, Enright, & Jamieson, 2008) and strengthens test usefulness (Bachman, 2002; Bachman & Palmer, 2010). Our study investigated scores on an integrated writing assessment task from the Internet-based Test of English as a Foreign Language (TOEFL iBT) developed by Educational Testing Service (ETS). TOEFL test scores have been used by North American universities in making admission decisions since the 1960s to determine whether the English skills of non-native speakers are adequate to undertake academic coursework. In 2000, ETS merged the Test of Written English (TWE) with the TOEFL; the resulting hybrid included a writing section, which consisted of one independent prompt. This change foreshadowed a major revision that came in 2005 when the TOEFL iBT was released. The revision was based on the recommendations of a number of studies conducted by ETS (e.g., Cumming, Kantor, Powers, Santos, & Taylor, 2000; Cumming, Kantor, & Powers, 2001). In the current test, the independent writing-only task is accompanied by an integrated writing task that prompts test-takers to read a passage, listen to a lecture on a topic, and then write a summary that connects the two source texts. This new task adds depth to the writing score on the TOEFL and improves the authenticity of the writing elicited. However, integrated tasks have had less research attention, and thus, many questions exist about how to interpret scores and the role of source texts in the writing from such tests. Cumming et al. (2005) conducted a large-scale study of three types of tasks piloted for the TOEFL iBT, two of which were integrated tasks: a reading–writing and a listening–writing task (the third was an independent task). By comparing language and textual features across these tasks and across score levels, the researchers provided important discussion about the impact of source texts. Their results indicated that the performances between the three task types differed in complexity, rhetorical style, and pragmatics, but not in grammatical accuracy. When ETS decided which task type to use for the iBT, however, reading and listening were combined into one task, compounding the use of multiple texts with multiple skills. How are test takers affected by this complexity? The goal of our study was to consider features related to source use in TOEFL integrated writing task performances and to analyze how they relate to score. Background Over the past ten years, research on integrated writing tasks has blossomed, delving into issues of task comparison, characteristics of performances across levels, and writers’ processes in composing these tasks. The studies have been conducted with TOEFL tasks, as well as other academic writing tests, such as university placement exams. This section will briefly review the research in these areas. Research on integrated task performances has compared them to independent tasks (Gebril, 2009, 2010; Lewkowicz, 1994; Watanabe, 2001) to define language features at different proficiency or score levels (Cumming et al., 2005; Gebril & Plakans, 2009) and to uncover writers’ processes in composing these tasks (Ascencio´n, 2005; Esmaeili, 2002; Plakans, 2008, 2009b; Yang, 2009). Several researchers have correlated scores from the two types of writing tasks. In a study of reliability, Watanabe (2001) found that the correlation between two different integrated tasks (r = .69) was actually similar to the correlation between an integrated and independent task (r = .62). In contrast, Gebril (2006) and Lee and Kantor (2005) discovered much higher correlations between independent and integrated tasks with values of .93 and above. These differences could be explained by different scoring scales across studies and tasks as well as different participants. However, it is important to recognize that, while the two task types both seek to evaluate writing, some differences may lie in the underlying constructs they elicit. Most significantly, integrated tasks include elicitation of multiple skills as well as the ability to use sources to build one’s own writing. Looking further than holistic scores has provided more explanation of the similarities and differences between these task types. Lewkowicz (1994) found that holistic scores did not differentiate the two tasks and that writing from the two tasks was comparable in response length. However, she identified a significant difference in performances in the number of points introduced in the essays, with more points made in the reading-to-write task. Her conclusion was that since the integrated task writing had more points but was not longer, then each point was less developed than those in the independent writing responses. Cumming et al. (2005) compared discourse features in TOEFL pilot tasks, which included integrated and independent tasks, finding significant differences across areas such as lexical/
L. Plakans, A. Gebril / Journal of Second Language Writing 22 (2013) 217–230
219
syntactic complexity and argument structure. In sum, holistic scores have indicated that a strong relationship exists between independent and integrated tasks, but differences emerge when performances are closely scrutinized. An obvious and critical difference in these task types is the use of source materials in the writing of integrated tasks, a feature which obviously cannot be compared to independent tasks. In addition to comparison studies, another vein of research investigates discourse features in integrated writing tasks across proficiency/score levels. Research has shown that fluency consistently increases with score level (Cumming et al., 2005; Gebril & Plakans, 2009); however, differences in lexical complexity depend on the measures used. Cumming et al. (2005) found significant differences across score levels using a type/token ratio to measure lexical diversity, while Gebril and Plakans (2009) found no difference in average word length across scores. Adding to this picture, Gebril and Plakans (unpublished manuscript) found that, when using vocd,2 lexical density differed across levels and that the lowest-scoring writers, in fact, did not have the least variety in words. In investigating syntactic complexity, different measures may again have led to different results. Cumming et al. (2005) found significant differences in words per T-unit3 between score levels but none when measuring clauses per T-unit. Gebril and Plakans (2009) also found no significance with T-units per sentence. Lastly, grammatical accuracy, like fluency, has been found to differentiate score levels on integrated tasks (Gebril & Plakans, 2009). Studies have looked at these features and their increasing levels or quantity with proficiency levels; however, this assumes a linear progression and does not consider a ceiling, i.e., a point where these features stop increasing. Recent research (Gebril & Plakans, 2009) has suggested, with closer inspection across score levels, that for a number of discourse features, such as grammatical accuracy, differences are stronger at lower score levels and less pronounced at higher ones. Perhaps something other than traditional linguistic features is separating performances at higher levels. The variation potentially lies in the use of source materials. A third area of integrated task research has explored writers’ processes in completing the tasks. These studies have concluded that reading skills are important in integrated reading–writing tasks (Esmaeili, 2002; Plakans, 2009a) although they are not directly interpretable from the final score (Ascencio´n-Delaney, 2008; Watanabe, 2001), and that high-scoring writers seem to employ discourse synthesis, a process of selecting, connecting, and organizing as they read and write for these tasks (Ascencio´n, 2005; Plakans, 2009b; Yang, 2009; Yang & Plakans, 2012). This research contributes to our knowledge of integrated tasks, and, while process is not the focus of this study, it can help explain the results of research on written products. It also calls attention to an important facet of integrated writing: how source texts are used. Source use research Two research areas have received a good deal of attention in clarifying source use in integrated tasks products: integration style and verbatim use of source texts (i.e., plagiarism). Often these issues are combined in research since they are clearly related (Campbell, 1990; Cumming et al., 2005; Currie, 1998; Gebril & Plakans, 2009; Johns & Mayes, 1990; Pennycook, 1996; Shi, 2004; Watanabe, 2001). Watanabe (2001) identified implicit and explicit source use in 47 readingto-write responses and found that writers tended to use quotation (explicit) most, along with some partial paraphrase and summary (implicit). Gebril and Plakans (2009) used a similar scheme to code 145 English writing samples from Arabic speakers and discovered that, overall, higher-scoring writers used source texts more than lower-scoring writers. Cumming et al. (2005) also found differences across scores and source use. The most proficient writers summarized more than less proficient writers, mid-range writers tended to paraphrase and plagiarize more than high and low proficiency writers, while the least proficient writers summarized, paraphrased, and copied less than all other levels of writers. The researchers hypothesized that low proficiency writers were not able to understand the source text well enough to use it even for direct copying. It is important to note that in most of these studies, ‘‘proficiency’’ is defined by the writers’ score on the test itself; however, in some cases, an independent measure of proficiency was used. This disparity can lead to divergent explanations of how source use differs across proficiency levels.
2
vocd is a software program that computes lexical diversity based on mathematical modeling that is impacted less by word count than type/token ratio. For more information, readers are referred to Malvern, Richards, Chipere, and Duran (2009). 3 T-units are similar to independent clauses plus their dependent clauses. They are defined as the smallest unit of a sentence that can stand alone grammatically.
220
L. Plakans, A. Gebril / Journal of Second Language Writing 22 (2013) 217–230
The issue of copying text, often referred to as verbatim source use, has generally received a great deal of attention in L2 writing research (e.g., Currie, 1998; Johns & Mayes, 1990; Pennycook, 1996; Shi, 2004) and was recently featured in a special issue of Journal of Second Language Writing in 2012 on textual appropriation (Polio & Shi, 2012). In an early study on verbatim source use, Johns and Mayes (1990) investigated direct copying in 80 writing samples divided into high and low proficiency groups. Their results revealed the low proficiency writers copied more directly; however, no significant differences occurred between the two groups in ‘‘correct paraphrasing.’’ Further, the higher proficiency group displayed more combinations of idea units from the source texts but were also found to distort the ideas from the source texts. While not considering levels of L2 proficiency directly, Shi (2004) compared native and non-native English writers composing two types of tasks: opinion and summary. The results indicated that the L2 writers borrowed more from source texts, and that the summary task elicited more verbatim source use. Asabi, Akbari, and Graves (2006) investigated authors’ identity and source use in academic coursework requiring source use. Through deep data collection with five non-native English writing graduate students, Asabi found that those who were more experienced had stronger authorial identities and those with less experience plagiarized more as they saw the source texts as more authoritative than themselves. Campbell’s (1990) study considered L2 proficiency as well as the L1/L2 distinction, finding that all groups composing a reading–writing task displayed explicit use of sources, but that the L2 writers included citation of the sources considerably more than the native speakers. These studies provide evidence that integrated writing task scores correspond to integration style and verbatim source use in complex ways. Whether these results transfer to tasks that include listening with reading and writing remains unstudied. Other critical issues in source use require attention but have not benefited from such substantial consideration. One concern is the key ideas that writers include in their essays from source texts. In the L1 literature, Spivey (1984, 1990) traced this feature with reading–writing integration and found that more competent writers distinguished important and less important source text ideas when selecting content for their writing. In L2 testing, research on these tasks has not focused on this issue, and yet the scoring of integrated tasks points to this skill, as shown below in the descriptor from score Level 4 on the TOEFL iBT integrated task rubric: A response at this level is generally good at selecting the important information from the lecture and in coherently and accurately presenting this information in relation to the relevant information in the reading, but may have some minor omissions, inaccuracy, vagueness or imprecision of some content from the lecture or in connection to points made in the reading. (italics added for emphasis) Another under-researched topic in integrated assessment is the role of multiple sources. Frequently in integrated tasks, multiple source texts are given to test takers; this could be two reading texts or, in the case of the TOEFL, a reading and a listening text. How do writers navigate between or across such texts? A few studies in L1 academic writing research have considered the question of multiple reading texts (Nash, Shumacher, & Carlson, 1993; Stahl, Hynd, Gylnn, & Carr, 1996) and found that writers use the first text read to frame their understanding of a topic and use the second and later texts less. However, studies have not yet addressed second language tasks that require multiple source use or that have included listening sources. Given these gaps in understanding source use in integrated assessment tasks, our study sought to investigate the following research questions regarding the source texts in integrated writing: 1. How does the selection of important ideas from the source text predict scores on a reading–listening–writing integrated assessment task? 2. How does the use of ideas originating from reading or listening texts predict scores? 3. How does integration style predict scores? 4. How do the areas of importance, origin, and integration predict scores? Methods Integrated writing tasks Educational Testing Service supplied 480 writing performances from two administrations of the TOEFL iBT for this study. The data represented writers from 73 different countries with 47 different native languages. A wide range in ages was represented, from 14 to 51 years of age. Composite scores on the writing section of the test, a combination of
L. Plakans, A. Gebril / Journal of Second Language Writing 22 (2013) 217–230
221
Table 1 Distribution of scores. Score
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0a
Total
N
47
32
49
49
69
64
58
52
59
479
a
Only one response was scored at the 5.5 level and was removed as it was an outlier for most source use features.
independent and integrated writing, spanned the full range of possible scores from 1 to 10 with M = 5, SD = 2.64. On the integrated writing task in particular, scores ranged from 1 to 5.5. Table 1 shows the number of essays at each of the nine score levels. The integrated writing task required writers to first read a passage on a topic, and then listen to a short academic lecture on the same topic that presented a different viewpoint. Two test forms were included in this study with the following topics: Form 1 on bird migration and Form 2 on fish farming. The test takers were allowed to take notes during both the reading and the lecture. They could only hear the lecture once; however, the reading passage was available throughout the task. After this first stage, writers were asked to compose a summary of the lecture that included the points of contrast with the reading, a kind of comparative summary. The time for writing a response was 20 minutes. While using two tasks with different topics can have an effect on performance (Jennings, Fox, Graves & Shohamy, 1999; Krekeler, 2006; Lee & Anderson, 2007; Tedick, 1990), we conducted a preliminary analysis that indicated that, for this data set, the effect of topic on the source use areas of interest was minimal. The performances were scored by trained raters employed by ETS. The score was based on a holistic rubric that included descriptors regarding the selection of appropriate information from source texts, coherent and organized presentation of the information, and degree of error in usage and grammar. The scale used by ETS for rating integrated tasks is publicly available online (http://www.ets.org/Media/Tests/TOEFL/pdf/Writing_Rubrics.pdf). Source use features As detailed in the research questions, three areas of interest were pursued: (1) importance of ideas from the source texts, (2) origin of ideas taken from the source texts, and (3) integration style (Table 2 provides the measurements used for each of these areas). The first step in identifying these features was to divide the essays into T-units, defined as the smallest possible grammatical unit that could stand alone, for example, an independent clause or an independent clause plus its dependent clause. This unit was chosen to target individual ideas specifically, as other possible units of measure, such as sentences, can include multiple independent clauses with separate ideas and make linking ideas to the source texts difficult. However, the T-unit marking posed some potential challenges as the essays included numerous non-standard grammatical structures. Thus, an initial coding was conducted with two raters to calibrate Tunit marking. Using a set of 98 responses (20% of the full data set) two raters read essays in Word documents and marked T-units. The markings were compared for agreement, finding 91.33% of the T-units were agreed upon. After this check, one rater coded the rest of the T-units. Then T-units in the essays that had ideas from the source texts were marked. The next step was to consider the three features for each of these T-units. A pilot study with 40 essays was conducted to refine this process. An example illustrating the markings described for the source use features is included at the end of this section. Table 2 Source use features. Source use feature
Variable measure
Importance of ideas taken from source text Origin of source texts used
Sum of scores for source text ideas included in essay Total number of T-units from reading texts Total number of T-units from listening texts Total number of T-units with paraphrasing or summary Total number of words borrowed from the source text
Integration style
222
L. Plakans, A. Gebril / Journal of Second Language Writing 22 (2013) 217–230
Importance score To evaluate the importance of the ideas writers selected from the source texts, we followed a process of analyzing the content of the source texts similar to Spivey (1984). First, each source text was divided into T-units, and then each T-unit was scored for importance in the text using the following rating: 4: 3: 2: 1:
very important (key idea) important (supporting idea) less important (specific details or examples) not important
The level of importance related to whether the idea should be included in a summary of the text. For example, each text centered around three main points, and each of these points was ranked a ‘‘4’’ by raters. Ideas that were too detailed for a summary, such as the T-unit in the bird migration text stating, ‘‘Magnetite, as the name suggests, is magnetic,’’ were considered a ‘‘1.’’ Two raters used this four-point scale to score T-units in the four source texts and their agreement determined the importance value for each source text T-unit. Then each test essay T-unit was read and connected to a T-unit in the source text and assigned the value of the source text T-unit. Next, essay T-unit scores were totaled to get an importance score. Thus, the importance score reflects the inclusion of key ideas from source texts. Rater agreement was checked for each T-unit in the pilot study of 40 essays and was found to be sufficient, showing raters agreeing on 83% of the rankings. Of the remaining 17%, over two-thirds of the disagreements were within one score level. The contested T-units in the pilot were discussed, and agreement was sought with a third rater. In the final analysis, we checked reliability through the correlation of importance score totals, which yielded an r = .87 (Pearson’s). Given the size of the data set and the pilot agreement, we felt confident about the trustworthiness of these scores. Origin of T-units: reading and listening source texts The first two steps – marking T-units in the essays and connecting these to T-units in the source texts – provided a link between the essays and the source texts; therefore, identifying the origin of the T-units was straightforward. First the number of T-units from the reading text in an essay was calculated, and then the same process was followed for the listening texts. With the initial pilot of 40 essays, interrater reliability was checked for each T-unit, revealing agreement at 81%. The raters then focused on the problematic T-units and discussed their interpretations. This discussion helped raters find agreement and shared understanding on some of the ambiguity of the source origin decision. For example, some T-units could be ascribed to either source text; in these cases, the raters had to consider if the idea was mostly from one source text, for example, reading, then it was assigned as originating from reading. In other cases, a T-unit could be attributed to either because its idea was mentioned in both, but in the lecture it was a main point and in the reading it was a counter point. In these cases, the T-unit was attributed to the source text in which the idea was a main point. Similar to the importance score, reliability was checked for the full 480 essays by correlating the totals for each essay between raters, showing r = .87 for reading and r = .86 for listening. Integration style T-units in the essays were also marked for integration style, particularly with regard to whether writers used the source text explicitly (quoting or direct copying without quotation marks) or implicitly (paraphrasing or summarizing) (Ackerman, 1991; Watanabe, 2001). For this variable, raters read each essay T-unit that was connected to a source text T-unit and marked it as either explicit or implicit source use. In piloting, we found higher agreement on implicit source use than explicit source use, which also appeared in the final coding. Furthermore, when the explicit T-units were closely inspected, no instances of quoting were found, only direct copying without quotation marks (i.e., verbatim source use). Another complication with this explicit source use measure was its absence in 73% of the summaries, leaving a high number of zero counts and positively skewing the data. Given these issues, only implicit source use was included, and we conducted an analysis of verbatim source use (described in the next paragraph) to capture writers’ direct copying from source texts. To analyze verbatim source use, raters marked essays when strings of 3+ words were taken directly from the source text. When experimenting with different lengths of strings (for example, 2, 3, or 4), Cumming et al. (2005) found that the length of three words did not over- or under-identify verbatim source use. This was important, as there were many
L. Plakans, A. Gebril / Journal of Second Language Writing 22 (2013) 217–230
223
key words and phrases that writers needed to use in their summaries, such as ‘‘bird migration’’ or ‘‘fish farming,’’ that we did not wish to mark as verbatim source use. Once the strings were identified, the number of copied words was recorded. While very time-intensive to mark, the rating for the variable showed high reliability. In the pilot, raters had reliability at r = .89, and in the final coding, r = .79. The number of words was totaled for each essay for analysis. Example of coding The coding in this study was conducted manually; raters read and marked essays in Word documents for the various features. Totals were counted based on the markings and entered in an Excel spreadsheet for further calculation in SPSS. The following example, and excerpt from an essay, illustrates the coding used (the example is not from the TOEFL test, but designed to illustrate our process in this study): The listening has three theories from the reading // and talks they are different.// First, video games can teach skills to young people.// [L-4, I] For example, hand eye coordination develops through cognitively challenging tasks.// [L-2, E, VB-8] Double backslash indicates the ending of a T-unit. After the second sentence, in brackets, ‘‘L’’ stands for listening text source idea; the number following is the important score of four, and ‘‘I’’ stands for implicit source use. The coding in brackets after the third sentence denotes that the T-unit is from the listening text with an importance score of four; the T-unit is explicitly (‘‘E’’) from the source texts and the eight underlined words are used verbatim (‘‘VB-8’’). Statistical analysis A hierarchical linear regression was used to explore the research questions. The coding provided five predictor variables: (1) importance score, (2) reading source use, (3) listening source use, (4) implicit source use, and (5) total verbatim words from the source text. There was one criterion variable, integrated assessment score. The data were checked before carrying out the regression analysis. The verbatim word use variable was positively skewed as there were many essays with zero borrowing, so a log 10 transformation was carried out to achieve an approximately normal distribution; however, this transformation should be kept in mind when interpreting the results for verbatim sources use. Correlations between the independent variables were checked for colinearity, using a cut off of .7. Two correlations were found to be high: between importance score and implicit source use (r = .91, p < .01) and implicit source use and listening score (r = .71, p < .01). For this reason, implicit source use was dropped from the regression study and reported only in the descriptive statistics and correlations, while importance score and listening score were retained. This left verbatim source use as the single variable regarding integration style, but it allowed for more information about source idea selection and origin to be illustrated. Linearity was also considered. Based on scatterplots and regression analysis, three of the four independent variables had a curvilinear relationship with integrated writing score (importance score, listening origin, and verbatim source use). This curvilinearity will be discussed in more detail in the results section; however, in terms of the hierarchical regression analysis, we addressed this issue by entering a polynominal term in the equation for each of the three variables. Results The descriptive statistics (see Table 3) indicate certain patterns in the integrated writing scores related to the predictor variables. When scores increased, so did the use of important source text ideas and ideas from the listening text as well as implicit source use. On the other hand, some aspects of source use decreased as score increased: Use of the reading text and verbatim source use appear more frequently with lower levels of writing than with higher levels. From these means, it appears that these patterns are related; as score increases, so does the use of the listening text over the reading text. In addition, with increased use of listening comes more paraphrasing over direct copying. As noted previously, some of the variables had a curvilinear relationship with integrated writing score, meaning that the increasing and decreasing of source use variables was not uniform across the score ranges. For example, in the case of verbatim source use, the lowest scoring writers (Levels 1 and 2) had a greater tendency to directly copy (M = 27.68 and 21.44 words) than all the other levels (M range: 13.55–6.78). The Level 1 writers also stood out more in terms of the amount of reading text use (M = 6.30 T-units) as well as in their limited use of the listening source text (M = .77 T-units).
224
L. Plakans, A. Gebril / Journal of Second Language Writing 22 (2013) 217–230
Table 3 Descriptive statistics for source use variables. Level
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Total
Importance score
Reading text
Listening text
Implicit source use
Verbatim source use
M
SD
M
SD
M
SD
M
SD
M
SD
22.19 25.03 26.31 30.00 31.84 34.94 35.40 35.96 36.02 31.49
9.71 6.45 7.08 8.13 7.92 7.79 9.68 9.74 10.35 9.90
6.30 4.50 3.61 3.22 3.59 3.39 3.40 2.96 2.54 3.63
2.92 2.50 2.10 2.27 2.10 2.49 2.73 2.79 2.08 2.62
.77 3.50 4.65 6.18 6.39 7.42 7.60 8.42 8.64 6.23
1.59 1.93 2.43 2.52 2.07 2.20 2.42 2.40 2.21 3.18
5.51 7.00 7.76 9.08 9.68 10.47 10.71 11.25 11.08 9.41
3.15 2.58 2.34 2.71 2.51 2.44 3.43 3.0 3.0 3.32
27.68 21.44 13.55 10.51 11.70 12.91 13.09 9.88 6.78 13.51
31.75 17.15 16.22 12.34 14.21 18.19 13.15 11.57 8.38 17.45
To investigate the relationships among the variables, bivariate correlations were conducted (Table 4). The use of listening source material has the highest correlation with integrated writing score (r = .69), followed by implicit source use (r = .52) and the selection of important ideas from the source texts (r = .46), while the other two variables, use of reading text (r = .31) and verbatim source use (r = .20), both had low negative correlations with integrated writing score. These results fit what we might expect given the scoring rubric for the integrated writing task that emphasizes including key ideas from the source texts. The correlations also align with the task instructions, which asked writers to summarize the listening text with some comparison to the reading, putting the listening source text in a more primary position in the task. As mentioned previously, implicit source use was highly correlated with selecting important ideas (r = .91) and use of listening sources (r = .71). The use of listening texts had a moderately sized correlation with selecting important ideas (r = .61) and was negatively correlated with the use of the reading text (r = .44) and verbatim copying from the source texts (r = .20), although these were much smaller than the positive correlations. Lastly, it is interesting to note that the correlation between selecting important ideas and the verbatim copying of words from the text has the smallest correlation (r = .11), which suggests that the writers who were copying from the source texts were not copying the most important ideas from the texts, making this a very unsuccessful strategy. While the correlations indicate a relationship between the scores and the independent variables, examination of the scatterplots indicated that it is a not strictly linear relationship, except for the use of the reading source text. This was confirmed for the three variables checking the quadratic regression curve estimation. For listening, the R2 increased from .47 to .52 when the quadratic term was added ( p < .001) indicating that a curve is a better fit than a line for this relationship. For importance score, R2 increased from .21 to .22 ( p < .001) with the quadratic term, again indicating a curve is a better fit. The quadratic regression lines for these two variables, as shown in Figs. 1 and 2, tend to bend at the 4.0 score level and flatten out somewhat. For verbatim words in an essay, R2 increased from .04 to .08 when the quadratic term was added ( p < .001). The quadratic regression lines for this variable, shown in Fig. 3, slopes downward with a gradual curve from the 3.5 score level. These results further the evidence from the descriptive statistics that these variables do not follow a straight line as they increase and decrease with score, suggesting that the
Table 4 Correlations between variables. Importance Score Importance Reading Listening Implicit * **
p < .01. p < .001.
.46
**
Reading **
.31 .40**
Listening **
.69 .61 ** .44 **
Implicit
Verbatim
**
.20** .11* .35** .20** .23**
.52 .91 ** .26 ** .71 **
L. Plakans, A. Gebril / Journal of Second Language Writing 22 (2013) 217–230
225
Fig. 1. Curvilinear regression line between listening text use and integrated score.
Fig. 2. Curvilinear regression line between importance score and integrated score.
difference between score levels for these features cannot be assumed to have the same degree of difference from level to level. In other words, they may be better at differentiating essays at some levels than other levels. The hierarchical regression analysis indicated that the source use variables significantly accounted for the score variance on the integrated writing assessment, R2 = .55, F(7, 471) = 75.14, p < .001 (adjusted R2 = .54). The unique effect for each of the four variables was examined in the hierarchical regression analysis (see Table 5). Importance of source text selection was entered first (Step 1) because it is explicitly mentioned in the scoring rubric. This variable was the second greatest contributor to the score variance with R2 = .22, F(2, 473) = 67.41, p < .001. Next, listening text use was included (Step 2) as it should have more emphasis based on the task instructions; it had the highest predictive value of the four variables with R2 = .31, F(2, 473) = 152.57, p < .001. The other two variables, those which were negatively correlated to score, had much less contribution to the score variance once the stronger predictors were removed; Step 3, reading text use, had R2 = .01, F(1, 473) = 9.12, p < .01 and Step 4, verbatim word use, had R2 change = .01, F(2, 473) = 6.77, p < .01. In summary, over half of the variance (55%) in integrated writing scores was predicted by the source use features. The largest contributor to score was the inclusion of ideas from the listening source texts, closely followed by writers’
226
L. Plakans, A. Gebril / Journal of Second Language Writing 22 (2013) 217–230
Fig. 3. Curvilinear regression line between verbatim word use and integrated score.
Table 5 Hierarchical regression analysis predicting integrated score with source use features. Step
Predictor variable
1
Importance score Importance score2 Listening text Listening text2 Reading text Verbatim words Verbatim words2
2 3 4 * **
b .88 .43 1.35 .75 1.5 .19 .28
R2
DR2
.22
.22**
.53
.31**
.53 .55
.01* .01*
p < .01. p < .001.
inclusion of the important ideas from the source texts. This relationship was also evidenced by the correlations between these features and integrated score. Verbatim copying of words from source texts and using ideas from the reading texts were significant at the p < .01 level but made relatively small contributions to explaining integrated score variance after listening and importance were accounted for. These two features were negatively related to score on integrated writing as they appeared less when scores were higher. For all the features, except the use of reading texts, the relationship with score is curvilinear, meaning that there is a bend or a flattening out in the line that represents this relationship. Thus, the impact of selecting important ideas, using ideas from the listening text, and direct copying from sources is not equal from level to level. Discussion The results of this study show that source use features impact the level of writing for integrated listening–reading– writing tasks, which confirms the intent of such tasks. In order to accurately represent academic tasks that require multiple skills, source material inclusion should carry weight. Integrated assessment tasks require test-takers to use academic writing processes, such as including key ideas from sources, using multiple sources, and integrating the source material appropriately. The connection between these skills and the scores on this writing task suggest that these processes are elicited by the tasks and captured in the rating scale. This evidence provides support for the validity of using integrated tasks to assess the kind of academic writing (Braine, 1989; Horowitz, 1986; Leki & Carson, 1994, 1997). These results correspond to other research that has looked at writers’ processes (Plakans, 2009a, 2009b; Plakans & Gebril, 2012) in using source texts, as well as in modeling the resulting products (Ascencio´n, 2005; Yang & Plakans, 2012).
L. Plakans, A. Gebril / Journal of Second Language Writing 22 (2013) 217–230
227
Most of the past research on integrated writing has focused on reading–writing tasks, while this study included a third skill, listening. Interestingly, this third element has a significant impact on score, more so than other features that may be more commonly associated with integrated writing, such as verbatim source use. Considering the listening results in more detail, their impact on score may tap into high proficiency writers’ automaticity in language comprehension. Since the lecture is only played once, test takers need to understand the content with some ease and write useful notes. Without this requisite skill, they will not be able to include the listening ideas in their writing. The results also support higher performing writers’ ability to understand the instructions, which direct writers to summarize the listening with the reading text as counter evidence. Such a relationship shows that these writers are forming an appropriate representation of the task (Allen, 2004; Connor & Carrell, 1993; Plakans, 2010; Ruiz-Funes, 2001; Wolfersberger, 2007). The correlation between score and listening material does not mean that reading was not included in the higher scoring essays, but that reading texts are not the primary source of content. While the results suggest that the integrated task elicits similar abilities to academic tasks, the strong predictive contribution of listening is important to consider. While a good deal has been written on the presence of reading in academic writing (Horowitz, 1986; Leki & Carson, 1994), little research has unearthed the importance of listening and note-taking in academic writing. Research reports or literature reviews have a clear line to reading, but what is the nature of academic listening in such writing? The test in this study is drawing on lecture listening, which surely impacts students’ learning in courses and contributes to their understanding of a subject, but the direct impact on writing is not clear. This presents a question and perhaps a needed area for follow up in multi-skill integrated tasks. The inclusion of important ideas was also found to be a predictor for score. Higher scoring writers may have the ability to discriminate between more and less important ideas in source texts to include in their writing. This conclusion corroborates findings by Spivey (1984, 1990) that suggest L1 writers with high proficiency are better readers and can spot important details in a text then successfully integrate them in their writing. Research with L2 integrated writing has also found selection contributes to a writers’ discourse synthesis process, which is reflected in score using similar listening–reading–writing tasks (Yang & Plakans, 2012), as well as an argumentative reading– writing task (Plakans, 2009b). The source use features of verbatim word use and the use of the reading text were less predictive of score and in fact, negatively correlated. Ideas in lower scoring essays were largely from the reading text and showed more direct copying. The reading text may be a lifeline for these writers, providing not just content but also wording, a finding of our other study of source use centering on process (Plakans & Gebril, 2012). In addition, despite the task instructions, these writers did not incorporate the listening text, but focused more on the reading material. Potentially, these writers did not have the aural skills to comprehend the listening passage, they did not understand the instructions, and/or they did not have the note-taking skills to capture the main ideas from the listening. For any or all of these reasons, the lower-scoring writers did not complete the multi-text task successfully. A level of proficiency across skills, not just in writing, may be required in integrated tasks. Several studies that have looked at the correlation between integrated writing scores and reading test scores have found minimal variation due to reading (Ascencio´n-Delaney, 2008; Watanabe, 2001). However, looking specifically at the source text features of these essays indicates that the ability in skills combined with writing may have an impact not captured by the holistic scores in the previous studies. Implications The study holds a number of implications for L2 writing research, teaching, and testing. The findings that source use features contribute significantly to scores on integrated writing provide important considerations for the field of second language writing. First of all, integrated writing is highly impacted by writers’ use of the source material, which supports the premise that it is a different, albeit related, measure of writing than independent writing. Thus, our field’s research body, which consists largely of data from independent writing, should be considered carefully in application to integrated writing. Our study substantiates that source use in integrated writing tasks is demanding and requires writers to draw on a number of skills and to make challenging decisions. To complete these tasks successfully, writers have to comprehend the source material in a second language, select important ideas, juggle several source texts, and finally synthesize information from these sources in their writing. Therefore, writing classes should prioritize these skills in instruction and practice. While advanced writing courses often include instruction on integration strategies such as paraphrasing and quotation with strong warnings about plagiarism, guided instruction is needed on the steps prior to integration. Students would benefit from strategy instruction on approaching source texts (i.e. choosing texts,
228
L. Plakans, A. Gebril / Journal of Second Language Writing 22 (2013) 217–230
skimming, listening for key ideas, and note-taking) selecting the important information from these sources, and appropriately synthesizing the source materials in their writing. Since the lower-level writers in this study have significantly lower performances in these skill areas, we would argue that integrated writing processes should be introduced at low-intermediate levels of language learning, rather than being postponed until more advanced writing courses. Higher-order synthesis skills could then be the focus at upper levels. We also propose implications for writing assessment. First, the source materials used in an integrated task clearly affect the way L2 writers approach the task and how they include these materials in their writing. Designing integrated writing tasks to achieve anticipated outcomes requires selecting, adapting, or developing of source materials in a carefully planned way. Second, the results of this study showed the inextricably complex nature of source use, which should be reflected in the scoring rubrics used with integrated writing tasks. More research is needed in this direction to determine how many distinct levels of source use exist and what key descriptors define these. Limitations This study will hopefully be followed by others that continue to probe questions of source use in integrated tasks. Some limitations appear in our study. For the features of source use, we used counts or totals for all variables, which leaves them vulnerable to the impact of essay length. For this reason, as a follow-up study, we converted counts to ratios with total number of T-units in each essay as the denominator and ran the statistical analysis again. The patterns found did not change, and thus the study reports on the counts with confidence. However, future studies may need to look for methods, such as probability, to eliminate the impact of essay length. Our research design attempted to capture the basic picture of source use in integrated writing. However, raters indicated that occasionally writers were either misparaphrasing or misusing ideas from the source texts. The accurate transfer of ideas is important in source use and would be a valuable feature for further research. Conclusions The impact of source use on scores for writers taking tests that include integration, such as the TOEFL iBT, has been probed by this study. Questions remain, such as why this impact is not linear across levels. The results emphasize the need to distinguish integrated writing from independent writing, which does not share any impact from source materials, as well as in differentiating various types of integrated tasks. Some work has compared different genres of reading–writing tasks, such as summary and argumentative writing (Ascencio´n, 2005; Watanabe, 2001), but the variation in skills involved (reading vs. reading and listening) also needs attention. Still, test-takers, test-users, and teachers should be confident that these tasks require academic writing skills. Research on the writing processes of L2 test-takers on integrated tasks have provided support for this argument (Ascencio´n, 2005; Esmaeili, 2002; Plakans, 2008, 2009a, 2009b; Plakans & Gebril, 2012), and now evidence is emerging about the written products (Cumming et al., 2005; Gebril & Plakans, 2009; Yang & Plakans, 2012). The implications for this alignment between test task and real-world context should lead to stronger validity arguments regarding the scores from such tests, as well as encouraging ESL/EFL writing classrooms to provide more instruction and practice with source text integration. Scholars such as Leki and Carson (1994, 1997), Grabe (2003), and Hirvela (2004) have called for such a shift in the field of second language writing for some time. The findings from this study provide empirical support for tasks that seek such alignment. Acknowledgements Our study was funded by Educational Testing Service TOEFL Committee of Examiners, and we thank them for their support. We also had valuable comments from numerous reviewers along the way, for which we are grateful. However, the authors are solely responsible for the content and any inaccuracies. References Ackerman, J. M. (1991). Reading, writing, and knowing: The role of disciplinary knowledge in comprehension and composing. Research in the Teaching of English, 25, 133–178.
L. Plakans, A. Gebril / Journal of Second Language Writing 22 (2013) 217–230
229
Allen, S. (2004). Task representation of a Japanese L2 writer and its impact on the usage of source text information. Journal of Asian Pacific Communication, 14, 77–89. Asabi, A. R., Akbari, N., & Graves, B. (2006). Discourse appropriations, construction of identities, and the complex issue of plagiarism: ESL students writing in graduate school. Journal of Second Language Writing, 15, 102–117. Ascencio´n, Y. (2005). Validation of reading-to-write assessment tasks performed by second language learners. Unpublished doctoral dissertation. Northern Arizona University: Flagstaff. Ascencio´n-Delaney, Y. (2008). Investigating the reading-to-write construct. Journal of English for Academic Purposes, 7, 140–150. Bachman, L. F. (2002). Some reflections on task-based language performance assessment. Language Testing, 19, 453–476. Bachman, L., & Palmer, A. (2010). Language assessment in practice. Oxford, UK: Oxford University Press. Braine, G. (1989). Writing in science and technology: An analysis of assignments from ten undergraduate courses. English for Specific Purposes, 8, 3–15. Campbell, C. (1990). Writing with other’s words: Using background reading texts in academic compositions. In B. Kroll (Ed.), Second language writing (pp. 211–230). Cambridge, UK: Cambridge University Press. Chapelle, C., Enright, M., & Jamieson, J. (2008). Score interpretation and use. In C. Chapelle, M. Enright, & J. Jamieson (Eds.), Building a validity argument for the Test of English as a Foreign Language (pp. 1–25). New York: Routledge. Connor, U., & Carrell, P. (1993). The interpretation of tasks by writers and readers in holistically rated direct assessment of writing. In J. G. Carson & I. Leki (Eds.), Reading in the composition classroom (pp. 159–175). Mahwah, NJ: Lawrence Erlbaum Associates. Cumming, A., Kantor, R., Baba, K., Erdosy, U., Eouanzoui, K., & James, M. (2005). Differences in written discourse in independent and integrated prototype tasks for next generation TOEFL. Assessing Writing, 10, 5–43. Cumming, A., Kantor, R., Powers, D., Santos, T., & Taylor, C. (2000). TOEFL 2000 writing framework: A working paper (TOEFL Monograph Series Report No. 18). Princeton, NJ: Educational Testing Service. Cumming, A., Kantor, R., & Powers, D. (2001). Scoring TOEFL essays and TOEFL 2000 prototype writing tasks: An investigation into raters’ decision making and development of a preliminary analytic framework (TOEFL Monograph Series Report No. 22). Princeton, NJ: Educational Testing Service. Currie, P. (1998). Staying out of trouble: Apparent plagiarism and academic survival. Journal of Second Language Writing, 7, 1–18. Esmaeili, H. (2002). Integrated reading and writing tasks and ESL students’ reading and writing performance in an English language test. Canadian Modern Language Journal, 58, 599–622. Gebril, A. (2009). Score generalizability of academic writing tasks: Does one test method fit it all? Journal of Language Testing, 26, 507–531. Gebril, A. (2010). Bringing reading-to-write and writing-only assessment tasks together: A generalizability analysis. Assessing Writing, 15, 100–117. Gebril, A. (2006). Independent and integrated academic writing tasks: A study in generalizability and test method. Unpublished doctoral dissertation. The University of Iowa, Iowa City. Gebril, A., & Plakans, L. (2009). Investigating source use, discourse features, and process in integrated writing tests. Spaan Working Papers in Second or Foreign Language Assessment, 7, 47–84. Gebril, A., & Plakans, L. Borrowed words: The impact of source texts, proficiency, and topic on the lexical diversity of integrated writing, unpublished manuscript. Grabe, W. (2003). Reading and writing relations: Second language perspectives on research and practice. In B. Kroll (Ed.), Exploring the dynamics of second language writing (pp. 242–262). Cambridge: Cambridge University Press. Nash, J. G., Shumacher, G., & Carlson, B. (1993). Writing from sources: A structure mapping model. Journal of Education Psychology, 85, 159–170. Hirvela, A. (2004). Connecting reading and writing in second language writing instruction. Ann Arbor, MI: University of Michigan Press. Horowitz, D. M. (1986). What professors actually require: Academic tasks for the ESL classroom. TESOL Quarterly, 20, 445–462. Jennings, M., Fox, J., Graves, B., & Shohamy, E. (1999). The test-takers’ choice: An investigation of the effect of topic on language-test performance. Language Testing, 16, 426–456. Johns, A. M., & Mayes, P. (1990). An analysis of summary protocols of university ESL students. Applied Linguistics, 11, 253–271. Krekeler, C. (2006). Language for special academic purposes (LSAP) testing: The effect of background knowledge revisited. Language Testing, 23, 99–130. Lee, H., & Anderson, C. (2007). Validity and topic generality of a writing performance test. Language Testing, 24, 307–330. Lee, Y., & Kantor, R. (2005). Dependability of new ESL writing test scores: Tasks and alternative rating schemes (TOEFL Monograph Series No. 31). Princeton, NJ: ETS. Leki, I., & Carson, J. G. (1994). Students’ perception of EAP writing instruction and writing across the disciplines. TESOL Quarterly, 28, 81–101. Leki, I., & Carson, J. (1997). ‘‘Completely different worlds’’: EAP and the writing experiences of ESL students in university courses. TESOL Quarterly, 31, 39–69. Lewkowicz, J. A. (1994). Writing from sources: Does source material help or hinder students’ performance? In N. Bird, et al. (Eds.), Language and learning: Papers presented at the annual international language in education conference, ERIC Document (ED 386 050). Malvern, D., Richards, B. J., Chipere, N., & Duran, P. (2009). Lexical diversity and language development: Quantification and assessment. New York: Palgrave Macmillan. Pennycook, A. (1996). Borrowing others’ words: Text, ownership, memory, and plagiarism. TESOL Quarterly, 30, 201–230. Plakans, L. (2008). Comparing composing processes in writing-only and reading-to-write test tasks. Assessing Writing, 13, 79–150. Plakans, L. (2009a). The role of reading strategies in integrated L2 writing tasks. Journal of English for Academic Purposes, 8, 252–266. Plakans, L. (2009b). Discourse synthesis in integrated second language writing assessment. Language Testing, 26, 561–587. Plakans, L. (2010). Independent versus integrated writing tasks: A comparison of task representation. TESOL Quarterly, 44, 185–194. Plakans, L., & Gebril, A. (2012). A close investigation into source use in L2 integrated writing tasks. Assessing Writing, 17, 18–34.
230
L. Plakans, A. Gebril / Journal of Second Language Writing 22 (2013) 217–230
Polio, C., & Shi, L. (Eds.). (2012). Special issue: Textual appropriation and source use in L2 writing. Journal of Second Language Writing, 21, 95–186. Read, J. (1990). Providing relevant content in an EAP writing test. English for Specific Purposes, 9, 109–121. Ruiz-Funes, M. (2001). Task representation in foreign language reading-to-write. Foreign Language Annals, 34, 226–234. Shi, L. (2004). Textual borrowing in second-language writing. Written Communication, 21, 171–200. Spivey, N. (1984). Discourse synthesis: Constructing texts in reading and writing. Outstanding Dissertation Monograph. Newark, DE: International Reading Association. Spivey, N. (1990). Transforming texts: Constructive processes in reading and writing. Written Communication, 7, 256–287. Stahl, S., Hynd, C., Gylnn, S. A., & Carr, M. (1996). Beyond reading to learn: Developing content and disciplinary knowledge through texts. In C. Cornoldi & J. Oakhill (Eds.), Developing engaged readers in school and home communities (pp. 15–31). Mahwah, NJ: Lawrence Erlbaum. Tedick, D. J. (1990). ESL writing assessment: Subject-matter knowledge and its impact on performance. English for Specific Purposes, 9, 123–143. Watanabe, Y. (2001). Read-to-write tasks for the assessment of second language academic writing skills: Investigating text features and rater reactions. Unpublished doctoral dissertation. University of Hawaii, Manoa. Weigle, S. C. (2004). Integrating reading and writing in a competency test for non-native speakers of English. Assessing Writing, 9, 27–55. Wolfersberger, M. A. (2007). Second language writing from sources: An ethnographic study of an argument essay task. Unpublished doctoral dissertation. The University of Auckland. Yang, H. C. (2009). Exploring the complexity of second language writers’ strategy use and performance on an integrated writing test through structural equation modeling and qualitative approaches. Unpublished doctoral dissertation. The University of Texas-Austin. Yang, H. C., & Plakans, L. (2012). Second language writers’ strategy use and performance on an integrated reading–listening–writing task. TESOL Quarterly, 46, 80–103. Lia Plakans is an assistant professor in Foreign Language/ESL Education at the University of Iowa. She teaches courses in language assessment, language planning & policy, materials design, and second language learning. Her research interests include teacher education, L2 reading–writing connections, integrated skills assessment, and test use. She has been an English language educator in Texas, Ohio, Iowa, and Latvia. She earned her PhD at the University of Iowa. Atta Gebril is an assistant professor in the TEFL program at the American University in Cairo (Egypt). He obtained his PhD from the University of Iowa in foreign language and ESL education. He teaches courses in second language assessment and research methods in applied linguistics. His research interests include writing assessment, reading–writing connections, and generalizability theory. He has taught in a number of countries including the USA, United Arab Emirates, and Egypt.