System 58 (2016) 64e81
Contents lists available at ScienceDirect
System journal homepage: www.elsevier.com/locate/system
A critical evaluation of text difficulty development in ELT textbook series: A corpus-based approach using variability neighbor clustering Alvin Cheng-Hsien Chen Department of English, National Changhua University of Education, Changhua City, 500, Taiwan
a r t i c l e i n f o
a b s t r a c t
Article history: Received 30 June 2015 Received in revised form 7 March 2016 Accepted 11 March 2016
Although the importance of English Language Teaching (ELT) textbooks is widely acknowledged, previous evaluation of ELT materials has paid little attention to the appropriateness of the text difficulty development in a textbook series. The present study aims to assess the progression of text difficulty in different textbook series in Taiwan, the rationale of which is argued to be generalizable to other ELT contexts. Specifically, there are two methodological emphases. First, text difficulty has been quantitatively measured by the BNC corpus-based frequency lists and a comprehensive set of well-established readability formulas, considering both vocabulary and structure complexity of the texts; second, a clustering-based statistical algorithmdvariability neighbor clusteringdis utilized to identify the developmental stages in text difficulty on an empirical basis. This corpusbased computational method not only objectively determines the developmental gaps in a textbook series, but also identifies the direction of the difficulty progression in vocabulary and structure complexity. This rigorous textbook evaluation provides a common framework for the assessment of text difficulty progression in the ELT materials. Several pedagogical implications are drawn for EFL learners and teachers as well as ELT textbook developers. © 2016 Elsevier Ltd. All rights reserved.
Keywords: Corpus-based analysis Readability Text difficulty Clustering Textbooks Word lists Vocabulary coverage
1. Introduction Although the importance of English Language Teaching (ELT) textbooks is widely acknowledged, evaluations for ELT materials development are still not a “well supported project” (Ghorbani, 2011). Many scholars have suggested checklists for a thorough examination of the course book contents, considering a wide range of critical features, such as practical considerations, four-skill balance, exercises and activities, pedagogical analyses, appropriateness in language and grammar, and supplementary materials (Ghorbani, 2011; Mukundan & Ahour, 2010; Tomlinson, 2012; Tsagari & Sifakis, 2014). However, these checklists tend to be researcher-dependent and context-dependent, paying little attention to the appropriateness of the text difficulty development in a textbook series. The purpose of this study is to address this issue in the context of the official English curriculum in Taiwan senior high school. After the full-scale deregulation of textbooks in 2001, many publishers in Taiwan began to edit and publish their own versions of course books, following the guidelines and wordlists regulated by the national government. One of the major
E-mail address:
[email protected]. http://dx.doi.org/10.1016/j.system.2016.03.011 0346-251X/© 2016 Elsevier Ltd. All rights reserved.
A.C.-H. Chen / System 58 (2016) 64e81
65
criticisms from teachers is the inadequacy of the difficulty development in the current textbook series, in which volumes used by the higher graders are not necessarily more difficult than those used by the lower graders. This inconsistency often results in a compromise that teachers may select different versions of the textbooks for different academic years. This ad hoc decision may accidentally deprive learners of a chance for a systematic exposure to a grammar illustration provided by the publisher. Previous research on textbook development is somewhat limited in scope, often focusing on a particular linguistic aspect of the materials rather than the overall appropriateness of the text difficulty development. For example, a series of studies has been dedicated to the appropriateness of idiomaticity in the ELT materials by comparing the vocabulary distribution (Hsu, 2009), collocation patterns (Tsai, 2015), phrasal verbs (Chuang & Tsai, 2009), multi-word sequences (Lin, 2014), or formulaic sequences (Hsu, 2014b) of the ELT textbooks in Taiwan with those of the native-speaker production in different representative corpora. These comparisons have taken the ELT materials as a whole in their assessment and failed to provide a critical evaluation for the text difficulty progression in different volumes of the textbook series. Furthermore, the diversity of the linguistic emphases in previous studies has also suggested that a principled framework for ELT materials evaluation is yet to come (cf. Tomlinson, 2012). There have been two main lines of research in the literature on the assessment of text difficulty in high-school textbooks in Taiwan. One of the research lines was to provide a descriptive account of the vocabulary distribution in the textbooks across different frequency-based word lists (Chen, 2014; Kao, 2014; Ting, 2005). They were concerned with the lexical coverage of the text materials, i.e. the percentage of the words that a reader understands or a reference word list covers. While the vocabulary coverage of different textbook versions have been compared, little attention has been paid to the progression between volumes in one series. Moreover, the discussion in the previous studies was often limited to a descriptive comparison with respect to the designated frequency word bands (e.g. the percentage differences in the first 1000-word band, 2000-word band etc.). The variation across all relevant frequency word bands was not analyzed as a whole, thus failing to provide a holistic account of the progression of the overall vocabulary difficulty across different volumes/versions. Another line of research on high school course books focused on the changes of readability across different book levels in a series (Chiu, 2010; Lin, 2008; Lo, 2010; Yeh, 2003). Quite a range of readability formulas were adopted in different projects depending on the limitation and the availability of the analytic instruments and computer programs (e.g. Gunning Fog Index in Chiu (2010), Fry in Lin (2008), Lix in Yeh (2003), and Flesch Reading Ease in Lo (2010)). Yet different readability formulas make commitment to different structural parameters (e.g. the average number of characters per word, the average number of syllable per word, or the average number of words per sentence) in computing the grammatical difficulty of the text materials. It would be more comprehensive if the analysts could consider the variation in different readability formulas. Furthermore, the assessment of the difficulty progression in the previous research relies mostly on an inspection of the visual graph (i.e. a line plot of the readability index values). The present study aims to bridge this gap by proposing an objective and quantitative method to evaluate the arrangement of text difficulty for a textbook series. Both vocabulary and structural complexities will be included in our assessment of text difficulty by incorporating corpus-based frequency lists and well-established readability formulas. Most importantly, a clustering-based statistical algorithmdvariability-based neighbor clustering (VNC) (Gries & Hilpert, 2008)dis adopted to identify the developmental stages of text difficulty on a more empirical basis. It is argued that the rationales of our critical textbook assessment are generalizable to other ELT contexts. On the one hand, our operational definition for text difficulty provides a common ground on which the difficulty levels of ELT materials from different cultural communities can be easily compared with each other. On the other hand, the algorithm of VNC further implements a robust analysis on the progression of the text difficulty in a given textbook series, thus shedding light on the developmental gaps across different book levels in the series. 2. Measuring text difficulty 2.1. Vocabulary levels and reading comprehension Comprehensive reading is one of the keys to successful language learning. EFL learners are expected to possess a critical mass of L2 knowledge, including inferring the meanings of the unknown words from context, identifying the argument structure, and distinguishing idiomatic constructions. Most importantly, learners' vocabulary knowledge plays a crucial role in their reading proficiency (Hu & Nation, 2000; Laufer, 1992; Laufer & Ravenhorst-Kalovski, 2010; Lin, Hue, Lin, & Hsu, 2003). Quite a few studies have been devoted to investigating how many words an L2 reader needs to know (i.e. their vocabulary coverage) for an adequate reading comprehension (Hirsh & Nation, 1992; Hu & Nation, 2000; Laufer, 1989, 1992; Laufer & Ravenhorst-Kalovski, 2010; Lorge & Chall, 1963). Before we discuss the results from these studies, two factors need to be more carefully considered with respect to the methodology. First of all, “adequate” reading comprehension is a tricky idea and its operational definition may vary from scholar to scholar. The confusion may result from (1) the types of evaluation an analyst adopts for reading comprehension; (2) the threshold an analyst defines as an adequate level of reading comprehension. On the one hand, a learner's reading comprehension of a given text can be evaluated by two types of assessments: reading tests or cloze tests. Reading tests refer to the typical comprehension-checking tests, as often seen in most standardized English proficiency tests such as GEPT, TOEFL, and TOEIC. The other type is a cloze test, first introduced by Taylor (1953), which uses a text with regularly deleted words (usually every fifth word) and requires the subjects to fill in the blanks. A score from a reading comprehension test may not necessarily be comparable to one from a cloze test. On the other
66
A.C.-H. Chen / System 58 (2016) 64e81
hand, no matter which type of test learners are required to take, it remains unclear at which point they may be argued to have an “adequate” comprehension. For example, in Laufer's (1989) study, the adequate reading comprehension was set at a score of 55% as the passing score, while in Hu and Nation (2000), it was set at a score of 87.5%. It is suggested that the interpretation of the research findings on vocabulary coverage should be discussed in relation to the threshold defined for adequate comprehension (Laufer & Ravenhorst-Kalovski, 2010). Furthermore, methods may vary with respect to the measurement and estimation of vocabulary coverage. Given a selfdefined threshold for adequate comprehension, previous studies have been concerned with the question of how many words readers need to know in order to reach the passing score of adequate reading comprehension. Two approaches have been suggested. An analyst may investigate directly learners' vocabulary size and relate it to the adequate reading comprehension (Laufer, 1989, 1992). Alternatively, an analyst may also examine the coverage that a set of frequency-based word lists from representative corpora may provide to the text (Chujo, 2004; Hsu, 2014a; Nation, 2006). It is the second approach to which this study is indebted for its contribution of measuring lexical coverage with corpus-based frequency lists; however, the present study poses some rather different questions and extends it to more practical applications of textbook assessment. The first attempt to account for the relationship between reading comprehension and vocabulary coverage was Laufer (1989), in which the lexical coverage was based on the students' self-report and adequate comprehension was measured by reading comprehension tests set at a score of 55%. The results showed that a significant improvement in comprehension (i.e. more participants with a passing score of 55%) was observed when the lexical coverage reached 95%. Later, Hu and Nation (2000) also investigated the correlation between unknown word density and reading comprehension. The lexical coverage was controlled by creating different proportions of the text words with non-words, preserving only 80%, 90%, 95%, and 98% of the original words. The other words in the text were familiar words, taken from the 2000 most frequent vocabulary. A typical comprehension test was adopted to assess the students' comprehension, with the adequate comprehension level set at a score of 87.5%. They concluded that no adequate comprehension could be found at 80% of lexical coverage and 98% coverage was the ideal lexical coverage for adequate comprehension. Compared to Laufer (1989), the marginal increase in vocabulary coverage in Hu and Nation (2000) may possibly be attributed to their higher threshold for “adequate” comprehension (i.e. 87.5% vs. 55%). As vocabulary coverage in the previous studies was defined on a more subjective basis (e.g. learners' self-report), scholars started to look for more objective ways to characterize the construct of vocabulary coverage. Of particular relevance was the development of language corpora. The use and application of language corpora in EFL learning and teaching has received tremendous interest in the past few decades, such as for its use in respect to concordancing (Cobb, 1997; Todd, 2001), collocation (Sinclair, 1991), cohesion (Conrad, 1999), writing (Sun, 2007), formulaic expressions (Hsu, 2014b; Qin, 2014), and language assessment (Hawkins & Buttery, 2010). Most of the corpus-based applications capitalize on the value of the frequency data provided by the corpora. Specifically, corpus word frequency is found to closely reflect native speakers' judgments on word usefulness (Laufer & Ravenhorst-Kalovski, 2010; Okamoto, 2015). Nation (2006) investigated the lexical coverage for 98% of the texts with fourteen 1000 word-family lists from the British National Corpus (BNC).1 In Nation (2006), vocabulary size was defined as the number of word families (in 1000-word-family band) counted from the top that would account for 98% of the texts for unassisted comprehension of written and spoken English. He further concluded that a 8000 to 9000 word-family vocabulary was needed for 98% coverage for written English while a 6000 to 7000 word-family vocabulary was needed for spoken English. Similarly, Chujo (2004) also adopted the BNC word lists and compared the vocabulary sizes of Japanese junior and senior high school texts, college entrance exams, and college textbooks. Her study differed from Nation's research paradigm in two aspects. First of all, she used lemmas, rather than word-families, as the basic unit of the frequency lists. Second, she set the adequate comprehension at the 95% lexical coverage of the texts. Both studies have paved the way for a study of lexical coverage via a corpus-based approach using frequency-based word lists from representative corpora (Hsu, 2009, 2014b; Lin et al., 2003). It should be noted that units on the frequency lists are important in a corpus-based assessment of vocabulary coverage. The choice of the basic units on the list may reflect researchers' different degrees of optimism in learners' performance. One of the earliest word lists was the influential General Service List of English Words (West, 1953), which included the 2000 most useful words of English. The list consisted of only headwords, representing a word family that was only loosely defined by West (1953). For example, the following derived forms were all listed under the headword EFFECT: effective, effectively, efficient, efficiency, efficiently. If all related forms are understandable to a learner who knows the headword, this may have a rather optimistic estimation of an L2 learner's learning progress. Furthermore, West's inclusion of related forms under a headword was less systematic, rendering the list less legitimate for further applications. The inconsistency problem of identifying a word-family was specifically addressed in Bauer and Nation (1993), who proposed a graded set of seven-level word families where relevant words were related to their headword. Differences in levels lie in the frequency, regularity, productivity, and predictability of both inflectional and derivational affixes in English morphology. At Level 1, each form is a different word. Level 2 defines a word-family containing the words with the same base and inflections, such as plural, third person, tense-aspect markers. Level 3 further collapses words with the most frequent and
1 Paul Nation's word family lists have been built in his self-developed Range program (available from http://www.victoria.ac.nz/lals/about/staff/paulnation).
A.C.-H. Chen / System 58 (2016) 64e81
67
regular derivational affixes, including -able, -er, -ish, -less, -ly, -ness, -th, -y, non-, un-. At Level 4, the word family extends to words with frequent and orthographically regular affixes (e.g. -al, -ation, -ess, -ful, -ism, -ist, -ity, -ize, -ment, -ous, in-). At Level 5, words with regular but infrequent affixes are included in the word family. Level 6 includes words with frequent but irregular affixes while Level 7 includes words with bound morphemes. In the later analysis of vocabulary coverage, Nation (2006) adopted a level-6 definition for a word family and generated a series of word family lists from the BNC. For instance, the word family of ACCESS may include accessed, accesses, accessing, inaccessible, accessibility, and inaccessibility. According to Bauer and Nation (1993, p. 253), “[t]he important principle behind the idea of a word family is that once the base word or even a derived word is known, the recognition of other members of the family requires little or no extra effort.” However, this is often not the case. For most EFL learners in high school, one word-family does not represent a single learning unit. Therefore, using frequency lists based on word families may be the most optimistic estimation for learners' vocabulary coverage. In this study, we will instead take a more conservative approach and adopt lemmas as our basic unit on the frequency lists, or a level-2 word-family in Bauer and Nation's definition, to estimate the vocabulary coverage. Frequency lists from corpora provide considerable flexibility in the analysis of text difficulties. First of all, vocabulary sizes for different coverage rates (i.e. 95% or 98%) can be easily operationally defined. Second, they provide a common ground for comparison of different studies. Most importantly, the coverage rate of each 1000-word band enables the analyst to evaluate the composition and distribution of the vocabulary in the text in a more fine-grained resolution, on which this study will capitalize for a more rigorous assessment of text difficulty development.
2.2. Structural complexity and readability formulas For an evaluation on the difficulty level of a reading text, it may not be sufficient to consider only the vocabulary complexity, using the frequency-based lists from corpora. Vocabulary coverage takes into account only parts of the critical mass of L2 knowledge for reading comprehension (Laufer & Ravenhorst-Kalovski, 2010). Structural complexity at other grammatical levels may introduce additional difficulty for learners as well. In the 1920s, educators came up with an idea to use word (phonological) complexity and sentence length to predict the difficulty level of a reading text and proposed readability formulas for computation. By the 1980s, there were about 200 formulas, with thousands of studies being devoted to their theoretical and statistical validity (Crossley, Greenfield, & McNamara, 2008; DuBay, 2004; Greenfield, 2004). Different from frequency-based vocabulary coverage, most readability formulas consider structural complexity at different levels. Utilizing parameters for different structural complexities, readability formulas aim to assess the suitability of the materials by predicting the appropriate grade levels or ages of learners for a given text. Table 1 gives a summary of a list of popular readability formulas along with their embedded parameters used for computation. At the phonological level, some readability formulas consider the average length of a word in terms of the number of either syllables (Sy/W) or characters (C/W), while some may take into account the proportion of the multisyllabic words in the text (W1Sy/W and W>2Sy/W). At the grammatical level, most readability formulas compute the average length of a sentence in terms of word number (W/St). The Coleman index (Coleman, 1965) further incorporates the proportion of pronouns and prepositions in the formula as an approximation of the structural complexity (Wpro/W, Wprep/W). At the semantic level, the Dale-Chall index (Dale & Chall, 1948) adopts a pre-defined word list and includes in the formula the percentage of the words that are not on the list as an approximation of the semantic complexity in the text (W-WL/W). In spite of its wide applications in pedagogy, readability formulas have received considerable critique for their limitations in taking into account more contextual factors such as cohesion, required schemata (Crossley et al., 2008; DuBay, 2004; Zamanian & Heydari, 2012), and psychological factors such as semantic categorization (Lin, Su, Lai, Yang, & Hsieh, 2009). For example, Graesser, McNamara, Louwerse, and Cai (2004) adopted a computational approach and proposed a program d Coh-Metrix d to model readability with more than 200 parameters, considering textual cohesion, structural hierarchy, and Table 1 Summary of the parameters used in a set of readability formulas. Names of readability formulas
W St
Sy W
Flesch (Flesch, 1948) Flesch-Kincaid (Kincaid, Fishburne, Rogers, & Chissom, 1975) Dale-Chall (Dale & Chall, 1948) FOG (Gunning, 1952) SMOG (McLaughlin, 1969) FORCAST (Caylor, Sticht, Fox, & Ford, 1973) ARI (Smith & Senter, 1967) NRI (Kincaid et al., 1975) Coleman (Coleman, 1965)
C C C C
C C
C W
WWL W
W1Sy W
W > 2Sy W
Wpro W
Wprep W
C
C
C C C C
C C C
C C C
Notes. W stands for the number of words; St for the number of sentences; C for the number of characters (usually meaning letters); Sy for the number of syllables; W1Sy for the number of monosyllabic words; W<2Sy for the number of words with less than two syllables; W_WL for the number of words which are not on a certain pre-defined word list. Wpro for the number of pronouns; Wprep for the number of prepositions.
68
A.C.-H. Chen / System 58 (2016) 64e81
discourse contexts. While a more delicate set of parameters has proven to effectively increase the accuracy of the readability prediction (Tanaka-Ishii, Tezuka, & Terada, 2010), the cost is sometimes too high in the sense that the constructs measured by some parameters may not be intuitive to the ELT educators at first sight. Take a parameter, stem overlap, in Coh-Metrix, for instance. Although this parameter is intended to capture the referential cohesion between two neighboring sentences, it is sometimes less straightforward for EFL teachers to connect the construct of cohesion to counting the number of the overlapped stems in the nouns of two neighboring sentences. In addition, it may be overwhelming when an analyst is faced with more than 200 measures for interpretation at the same time, with a possibility of contradictory results on different dimensions. While we acknowledge the importance of a more sophisticated assessment of textual cohesion, to which another project of ours has been devoted, the present study will first utilize a range of popular readability indexes as our initial consideration of the structural complexities for the assessment of text difficulty. Readability formulas may suffice the purpose of the present study for two reasons. As our goal is to compare the text difficulty in a textbook series, it is the relative increase or decrease in difficulty that our proposed algorithm aims to capture. The precision of the grade-level prediction is not our major concern. By considering a wide range of popular readability formulas, it is hoped that the structural complexity can be accounted for to a considerable degree. On the other hand, the parameters in most readability formulas are more intuitive to the teachers and learners. A higher predicted grade level by the formula would imply texts with longer sentences and words, which are indeed more difficult to deal with from an EFL learner's perspective. Therefore, the present study will use a comprehensive set of readability formulas as a compensation for the frequency-based coverage in capturing other grammatical aspects of the reading difficulty.
2.3. Research questions From a perspective of curriculum design, a textbook series should provide a systematic arrangement of the materials in text difficulty. It is an empirical question whether a six-volume textbook series indeed presents six transitional stages in the development of text difficulty. Also, it remains unclear whether the transitions from one volume to another present an appropriate increase in text difficulty. A more objective method other than intuition-based judgment is needed for the assessment of difficulty development. Crucially, it is the expectation of the curriculum designers that the transition should present a positive increase in both vocabulary and structure complexities. Therefore, two research questions are addressed in the present study: 1. What are the developmental stages of text difficulty in three different versions of the six-volume textbook series (i.e. Far East, Lungteng, Sanmin) used in Taiwan senior high school? 2. Does the development of text difficulty conform to our expectation for a positive increase in both vocabulary and structure complexity? The present study adopted a clustering-based algorithm to analyze the development of text difficulty. Text difficulty was operationally defined by two sets of parameters: (1) the vocabulary coverage rates at different word bands of corpus-based frequency lists; (2) the predicted grade levels by a range of readability formulas. It should be noted that we do not intend to make an exact prediction for the threshold of vocabulary coverage for specific volume in the textbook series. Nor do we aim to claim for a precise predicted grade level. It is the relative differences in text difficulty that are central to this study in hope of identifying the gaps in the development of text difficulty.
3. Methods 3.1. Data collection and processing Our ELT materials came from three leading officially-approved versions of the Taiwan senior high school textbooks published by Far East (FE), Lungteng (LT) and Sanmin (SM), respectively. The bibliographical information for each version is summarized in Table 2. These three major ELT textbook series have been widely used in high schools across the island. In the
Table 2 Bibliographical information of the textbooks selected in the present study for evaluation. Publishers
Far East
Lungteng
Sanmin
Editors
Shih, Y.H. Lin, M.S. Huang, C.S Brooks, Sarah 2009 6 68
Chou, C.Y.
Che, P.C.
2013 6 68
2013 6 70
Publishing year Number of volumes Number of lessons
A.C.-H. Chen / System 58 (2016) 64e81
69
Table 3 CLAWS parts-of-speech tags that were included in the analysis of vocabulary coverage. Parts-of-speech included
BNC CLAWS5 Tagset
Nouns Verbs
nn0/nn1/nn2/pni/ vbb/vbd/vbg/vbi/vbn/vbz/ vvb/vvd/vvg/vvi/vvn/vvz/ vdb/vdd/vdg/vdi/vdn/vdz/ vhb/vhd/vhg/vhi/vhn/vhz aj0/ajc/ajs/ord/at0/xx0/dt0/ av0
Adjectives Adverbs
Notes. For a detailed description for each part-of-speech tag, please refer to the website of the CLAWS POS Tagger at http://ucrel.lancs.ac.uk/claws/.
three-year English curriculum, EFL learners in Taiwan are expected to improve their English proficiency level from the CEF (Common European Framework) A2 level to B1 level upon the completion of the high school English curriculum. Each series consists of six volumes designed for six semesters (i.e. three academic years) of the official English curriculum in Taiwan. Each volume normally includes twelve lessons, each of which contains materials divided into different parts, such as pre-reading activities, reading texts, postereading activities, vocabulary, idioms and phrases, grammar, pronunciation, and other languageeskill activities. This study examined only the reading texts in each lesson, as they served as the main reading materials for the senior high school students in their English learning. These reading texts amounted to 206 reading texts in total,2 constituting our ELT corpus for difficulty assessment. The ELT corpus was further processed for parts-of-speech (POS) tags with the Python Natural Language Toolkit (NLTK) (Bird, 2006). Our POS tagset followed the same convention as the BNC, i.e. CLAWS5 tagset, which was developed by University Center for Computer Corpus Research on Language (UCREL), Lancaster University. For the later feature extractions, we utilized the POS annotations and filtered out irrelevant tags by including only content words for the measurement of vocabulary coverage (i.e. nouns, verbs, adjectives, and adverbs). Tags included in our analysis of vocabulary coverage are provided in Table 3. All words were lemmatized using the off-the-shelf lemmatizer from the NLTK, i.e. the WordNet Lemmatizer, which removed the inflections based on the dictionary entry. After the processing of the raw texts in the ELT corpus, frequency-based word (lemma) lists were generated and arranged according to volumes and publishers. Our basic units for frequency lists were comparable to Bauer and Nation (1993) level-2 word families. 3.2. Quantification of text difficulty 3.2.1. Vocabulary coverage rates using BNCCWL In order to investigate the vocabulary coverage based on the BNC frequency-based word lists, we adopted Adam Kilgariff's unlemmetized BNC word list as our “base list”. This list could be downloaded from his website, consisting of words occurring over 5 times in the entire BNC, amounting to 208,656 word types. The base list was first converted into American spelling conventions, as textbooks in Taiwan followed the convention of American spellings. This conversion was done automatically by a dictionary lookup using a self-developed R script. As each word on the base list was listed with its frequency and POS tag, the same POS filtering was also applied to the BNC word lists by including only content words. Finally, words on the list were lemmatized as in ELT corpus and the final BNC content word list (BNCCWL) combined words of the same spellings, irrespective of their differences in parts of speech. Our final criterial BNCCWL consisted of 81,967 word types, amounting to eighty-two 1000-word lists. For each volume in the ELT corpus, the vocabulary coverage was computed based on a comparison of its word lists and the BNCCWL. Counted from the top of the BNCCWL, the coverage rates of each 1000-word band were computed in percentage, representing the proportion of the words in the target text that could be accounted for by the specific 1000-word band of the BNCCWL. Specifically, the word list that reached the cumulative vocabulary coverage 95% of the running text was defined as the vocabulary level of the target text. Considering the learners' expected English proficiency as well as previous literature on vocabulary levels for 95% coverage, we adopted the top thirteen 1000-word lists from BNCCWL for the assessment, as words in the higher lists might be considered beyond the scope of the high school English curriculum.3 The coverage rates on each of
2 One of the anonymous reviewers raised a critical issue of the length and number of the texts for each book level, which may potentially influence our results here. First, we argue that our computation of the text difficulty with corpus-based frequency lists and well-established readability formulas is not subject to the variation of the text lengths. Second, all these textbooks are designed for the official English curriculum of the senior high school in Taiwan. They strictly follow the guidelines regulated by the Ministry of Education, in terms of the number of lessons per volume, the expected teaching hours per lesson, etc. Therefore, each level of the textbook series is designed to be used for the same amount of hours in the classroom, i.e., a five-month semester. Therefore, it should be legitimate to assume that the number of the texts for each volume is equal. 3 The official English instruction in Taiwan starts in the third year of the elementary school curriculum. The English curriculum of the senior high school consists of six-semester English instructions, geared toward intermediate EFL learners who have reached CEF (Common European Framework) A2 level and expect to achieve B1 level upon the completion of the three years in the senior high school. There are 4e5 h of English instructions per week in every semester. Each semester will focus on one volume of the textbook series investigated here.
70
A.C.-H. Chen / System 58 (2016) 64e81
Table 4 The vocabulary coverage rates at the top thirteen 1000-word lists in the criterial BNCCWL for Sanmin textbook series. BNCCWL Nth 1000-word
SMB01
SMB02
SMB03
SMB04
SMB05
SMB06
1 2 3 4 5 6 7 8 9 10 11 12 13
71.62 9.74 6.00 2.66 2.38 1.75 0.72 0.44 0.16 0.44 0.20 0.12 0.44
72.71 9.98 5.83 2.48 2.16 0.58 0.78 0.17 0.37 0.69 0.43 0.14 0.20
70.68 9.13 5.88 2.66 2.49 1.65 0.89 1.06 0.54 0.54 0.30 0.27 0.11
67.83 10.35 6.40 3.70 1.95 1.80 1.01 1.01 0.52 0.59 0.35 0.30 0.17
65.91 11.01 6.29 3.64 2.42 1.44 1.18 0.65 0.57 0.65 0.78 0.33 0.17
67.01 10.25 6.28 3.03 2.56 1.55 1.33 0.81 0.32 0.32 0.74 0.44 0.44
the top thirteen 1000-word lists were computed in percentage for each volume in the ELT corpus. An example from the SM series is provided in Table 4, where each column represents one volume with a set of percentages in rows representing the coverage rates of each 1000-word band on the BNCCWL. These measures quantitatively profiled the vocabulary difficulty of the texts in each volume of the textbook series. The vocabulary level for 95% coverage was also measured for each volume to serve as an indicator for the progression of text difficulty in terms of vocabulary complexity. 3.2.2. Readability composite indexes For every volume of the textbook series, a range of readability indexes were computed, including ARI, Dale-Chall, Flesch, Flesch-Kincaid, FOG, FORCAST, and SMOG. The grade levels predicted by these formulas were used as our measures for grammatical complexity, as shown in Table 5 for the SM series. As has been suggested that most readability formulas utilized similar parameters, they were expected to be highly correlated (DuBay, 2004). Multi-collinearity between classifying features could be a serious methodological issue for most multivariate analyses. Table 6 summarized the pairwise correlations for all the readability formulas computed in this study. Given the potential impact of multi-collinearity, a common statistical strategy was to convert this set of readability measures into mutually independent principal components via principal component analysis (PCA), as shown in Table 7. It was observed that the first two principal components would be able to account for over 97% of the variance in our data. While different readability formulas might capture subtle structural differences (Brown, 1998; Chall & Dale, 1995; DuBay, 2004), it is argued that the first two principal components would preserve the power of accountability yet with a statistical advantage of being mutually independent. Therefore, the values of the first two principal components from PCA were taken as our readability composite indexes, characterizing the grammatical complexity of the reading text. The mean score of the predicted grade levels from these seven readability formulas was also computed to serve as an indicator for the progression of text difficulty in terms of structure complexity.
3.3. Statistical procedures In order to identify the developmental stages of text difficulty in volumes used across different semesters, a clusteringbased algorithm was adopted. This exploratory method has been widely used in dealing with tasks of categorization in a wide range of quantitative linguistic analyses (Baayen, 2008; Moisl, 2015), such as semantic profiles (Divjak & Gries, 2006), typology (Croft, 2008), language phylogeny (Atkinson & Gray, 2005; Dunn, Terrill, Reesink, Foley, & Levinson, 2005), historical developments of constructions (Hilpert, 2007), and language development (Wiechmann, 2008). Of particular importance to the present study was the variability neighbor clustering (VNC) proposed by Gries and Hilpert (2008), which used this datadriven bottom-up method to identify the stages of diachronic linguistic developments in corpora. While the algorithm is similar to other hierarchical clustering methods in successively merging cohesive groups of data points, the key to the success of VNC lies in its control of merging only temporally adjacent data points, thus preserving the chronological order in the final grouping. Specific procedures are explained as follows. Similar to a typical hierarchical clustering-analytic procedure, VNC follows a two-step process: the first step evaluates the differences between each data point to be clustered, and the second constructs a hierarchical tree by successive merging of the data points. A rationale for how the tree construction proceeds is best explained by working through a hypothetical example. Given a class of 20 students, one may be interested in the question of whether there is systematic variation in students' academic performance in this class. In order to uncover the underlying grouping of these students, a set of academic variables descriptive of their academic performance may be defined. For the purpose of illustration, Math and English grades are adopted as criterial indexes for students' academic achievement, as shown in Table 8. There is no theoretical limit on the numbers of variables for the description of the subjects to be clustered. It is crucial, however, that the researcher should account for the validity of using specific variables (i.e. in our current hypothetical
A.C.-H. Chen / System 58 (2016) 64e81
71
Table 5 Grade levels for each volume of the Sanmin textbooks predicted by different readability formulas. RDB index
SMB01
SMB02
SMB03
SMB04
SMB05
SMB06
ARI DALE-CHALL FLESCH FLESCH-KINCAID FOG FORCAST SMOG
5.51 7.5 7 6.18 8.03 9.45 9.05
4.13 7.5 7 5.16 6.96 9.53 8.72
8.32 11.5 8.5 8.61 10.4 10.2 11.25
8.37 11.5 8.5 8.73 10.71 10.27 11.22
6.2 9.5 8.5 7.19 8.89 10.12 10.25
10.78 11.5 10.5 10.79 12.82 10.38 12.72
Table 6 Pairwise correlation for all the readability formulas.
ARI Dale-Chall Flesch Flesch-K FOG FORCAST SMOG
ARI
Dale.Chall
Flesch
Flesch Kincaid
FOG
FORCAST
SMOG
1.0000 0.9027 0.9060 0.9967 0.9910 0.8119 0.9896
0.9027 1.0000 0.8273 0.9134 0.8815 0.8592 0.9199
0.9060 0.8273 1.0000 0.9141 0.8858 0.8507 0.9259
0.9967 0.9134 0.9141 1.0000 0.9888 0.8345 0.9946
0.9910 0.8815 0.8858 0.9888 1.0000 0.7665 0.9859
0.8119 0.8592 0.8507 0.8345 0.7665 1.0000 0.8343
0.9896 0.9199 0.9259 0.9946 0.9859 0.8343 1.0000
Table 7 Rotation matrix of all the principal components for Sanmin textbooks PCA. Principal components
SMB01
SMB02
SMB03
SMB04
SMB05
SMB06
PC1 PC2 PC3 PC4 PC5 PC6 PC7
2.72 0.42 0.20 0.22 0.17 0.01 0.05
3.62 0.45 0.25 0.16 0.11 0.04 0.03
2.16 0.28 0.79 0.10 0.01 0.05 0.06
2.34 0.34 0.79 0.26 0.02 0.10 0.00
0.02 0.91 0.20 0.05 0.21 0.02 0.05
5.23 0.65 0.32 0.43 0.03 0.01 0.05
example, Math and English grades) to measure the construct (i.e. students' academic performance). Any two students' academic performances will be more or less similar depending on how similar their respective variable values are (e.g. Math and English grades). Cluster analysis would help the analyst to group the 20 students according to how similar their academic performances in Math and English are. Fig. 1 shows a typical graphic representation of a cluster analysis applied to this data. This tree structure is commonly referred to as a dendrogram.
Table 8 A hypothetical data of 20 students' academic performance. Student Index
Math
English
SP01 SP02 SP03 SP04 SP05 SP06 SP07 SP08 SP09 SP10 SP11 SP12 SP13 SP14 SP15 SP16 SP17 SP18 SP19 SP20
24 6 6 2 4 40 34 87 80 95 80 62 91 61 98 88 73 90 68 81
69 97 89 71 71 98 80 92 73 73 72 79 90 62 21 5 3 28 40 28
72
A.C.-H. Chen / System 58 (2016) 64e81
The labels in the terminals of the tree refer to all the students in Table 8. Initially, every student is interpreted as a cluster of his or her own. The first step of the hierarchical clustering is to determine which two students are the most similar in terms of their Math and English grades. This similarity measurement may require a comparison of the rows in Table 8 to determine which two rows are mathematically the most similar. Different clustering algorithms may propose different computational methods. After the first-round pairwise comparison, the most similar two students will be grouped into a superordinate cluster in which their degree of similarity is graphically represented by the length of the vertical branches joining the clusters in Fig. 1. In the first-round comparison, SP09 and SP11 are the most similar based on the relative shortness of the branches and are further joined into a composite cluster (SP09, S11). At the next step, the data is searched again to look for the next-most-similar pair among the 19 clusters. Visual inspection in Fig. 1 indicates that SP04 and SP05 are jointed into a composite cluster. This merging goes iteratively and at some point SP10 is joined with the composite cluster (SP09, S11) into another composite cluster, (SP10, (SP09, S11)). This iterative process of merging continues until all the students are merged into one cluster. Once a tree-structure of the students has been identified, the analyst would be able to see the systematic variation of the 20 students' academic performance and potential partitioning of the class. In response to the research question in our hypothetical example, the analyst may still need a method to determine the number of the clusters in the data. A dataset like this could theoretically be grouped in different numbers of clusters by drawing a horizontal line at any height across the dendrogram in Fig. 1. The crossing vertical lines may indicate the final clusters. A plot of the inter-cluster distance by number of clusters extracted is often consulted to determine the appropriate number of clusters, which is commonly referred to as a scree plot. Scree plots are often used in principal component analyses or factor analyses to define the importance of the factors. Fig. 2 shows the scree plot for our hypothetical dataset. The x-axis in the scree plot is the number of the potential clusters, starting from 1 (i.e. all the 20 students forming as one super cluster) to 20 (i.e. every student being a cluster of his or her own). The y-axis refers to the inter-cluster distance (i.e. dissimilarity) at each step of the merging. It is argued that the optimal number of clusters may emerge because the intercluster distance will decrease considerably after a few merging steps (Gries & Hilpert, 2008; Kaufman & Rousseeuw, 2005; Moisl, 2015). A visual heuristic to look for a “local minimum” or “an elbow” in Fig. 2 would suggest that a three-cluster solution to the data partitioning is the optimal number of clusters. That is, at this step of merging, the three clusters of
Fig. 1. Dendrogram of a cluster analysis applied to the hypothetical dataset of 20 students.
A.C.-H. Chen / System 58 (2016) 64e81
73
Fig. 2. Scree plot of the clustering results for the hypothetical dataset.
students differ significantly from one another in terms of their Math and English grades. More number of clusters (i.e. 4 or more clusters) would decrease the inter-cluster distance, yielding clusters that are somewhat more similar while less number of clusters (1 or 2 clusters) would increase the inter-cluster distance, yielding clusters that might be too heterogeneous in nature. As the cluster analysis suggests a three-cluster grouping of the 20 students in the class, a post-hoc analysis of the three clusters in terms of their Math and English grades will give the analyst a clear picture of the systematic variation of the 20 students' academic performance. Fig. 3 illustrates the distribution of the Math grades (i.e. the left-panel) and the English grades (i.e. the right-panel) of these three clusters. It may be, thus, concluded that Cluster 2 features students with good academic performance in both Math and English while students in Cluster 1 and 3 may perform better in one subject than the
Fig. 3. Boxplots of the Math and English grades for the three clusters generated by the clustering analysis in the hypothetical dataset.
74
A.C.-H. Chen / System 58 (2016) 64e81
other. This post-hoc analysis is an analogy of our efforts in the post-hoc analyses on the vocabulary levels and the readability indexes. The algorithm of the VNC follows exactly the mechanism of the hierarchical clustering discussed above but only differs in that each data point can only be merged with its neighboring data points in the successive amalgamation because each data point follows a temporal sequence (e.g. developmental data, historical data). In the present study, the data points to be clustered are the six volumes of the textbook series from each publisher. VNC is used to identify the developmental stages in the volumes that differ from each other quantitatively in the measures of text difficulty. The operational definitions for text difficulty are: (1) a set of vocabulary coverage rates across the top thirteen 1000word lists on our BNCCWL and (2) two PCA-transformed readability composite indexes. These 15 measures are used to quantitatively profile the text difficulty of a specific volume. Mathematically speaking, each volume of the textbook series is represented by a 15-dimensional vector of difficulty measures. Next step is to compute a similarity measure for each consecutive pair of volumes to see how similar they are in terms of their text difficulty. In VNC, Pearson's correlation coefficient is computed as an indicator of the similarity for each pair of vectors. The amalgamation of the clustering is then based on the average link. Like other hierarchical clustering algorithms, the development of text difficulty is represented in a treelike structure in VNC, i.e. a dendrogram. In addition, a scree plot is also generated in VNC, showing the between-cluster distance or dissimilarity measured by the VNC algorithm at different steps of the successive amalgamations. As described in our earlier hypothetical example, this plot shows the distance between clusters at each iterative stage of merging in a reverse order, allowing us to judge at which point the algorithm would give us an optimal solution to the cluster partitioning. The best number of cluster is a partitioning where the inter-cluster dissimilarity is maximized and the intra-cluster dissimilarity is minimized. An ideal pattern in a scree plot is a steep curve, followed by a bend and then a flat or horizontal line. Take, for example, the scree plot for the mergers of the six volumes of FE, as shown in Fig. 4. The values on the y-axis refer to the average inter-cluster dissimilarity in relation to the solutions of different number of clusters (i.e. the x-axis). The starting point of the leveling in the line (i.e. a local minimum at 3) would indicate the optimal number of clusters for the dataset. The VNC was conducted with the R script, kindly provided by Stefan Th. Gries. Other textual preprocessing and the computation of the measures for text difficulty were implemented in R scripts written by the author. Readability indexes were computed using an R library, koRpus (Michalke, 2014). The following results will be discussed in terms of the dendrogram and the scree plot from VNC, answering our first research question on the development of text difficulty in the each textbook series. For a more thorough assessment of text difficulty development, vocabulary level for 95% coverage (VL) and the mean score of the grade level predicted by the readability formulas (RDB) will be specifically discussed to shed light on our second research question of identifying the direction of the difficulty development in terms of vocabulary and structure complexities.
4. Results This section reports the results of the VNC analysis for the three versions of the ELT textbook series used in Taiwan senior high school. For each version, the scree plot and the dendrogram are presented, visualizing the development of text difficulty
Fig. 4. Scree plot of the VNC results for Far-East series.
A.C.-H. Chen / System 58 (2016) 64e81
75
on an empirical basis. In order to further diagnose the directionality in difficulty changes, the vocabulary level (VL) for 95% coverage and the mean score of the predicated grade levels from the set of readability formulas (RDB) are highlighted. Fig. 4 provides the scree plot for FE, showing the inter-cluster distance at different stages of merging in the clustering. The favored solution for the number of developmental stages is a compromise between capturing as much distance between volumes as possible and positing as few clusters as possible. In the FE graph, the bend occurs approximately at the threecluster solution, suggesting three developmental stages in the FE series. The results of the VNC for FE are represented in Fig. 5, as a dendrogram, overlaid with the VL (dashed line) and the RDB (dotted line) for each volume. As the scree plot motivates a three-cluster solution, three developmental stages are adopted here: stage 1 comprises Book1, Book2, and Book3; stage 2 comprises Book4 and Book5; stage 3 is Book6. A closer look at the VL and RDB across different developmental stages in Fig. 5 offers more insights into the changes of text difficulty in respective dimensions. It is interesting to see that the first gap (B3eB4) is attributed to an increase in RDB but to a decrease in VL while the second (B5eB6) is mainly due to a decrease in both VL and RDB. In other words, the transition from Book 3 to Book 4 in FE reflected an increase of text difficulty in its structural complexity only. As far as the vocabulary is concerned, the complexity does not crop up. It is even counter to our expectation that a higher-level volume, Book 6, turns out to be less challenging in both vocabulary and readability complexity. As for LT series, the scree plot is provided in Fig. 6. Based on the bending point of the scree plot, the favored solution for the number of developmental stages is a four-stage development in LT textbook series. The results of the VNC for LT are represented in Fig. 7. As the scree plot motivates a four-cluster solution, four transitional stages are adopted here: stage 1 comprises Book1, Book2, and Book3; stage 2, 3, and 4 comprise single grade level of Book 4, 5, and 6 respectively. Based on diagnostic statistics of VL and RDB, the first gap (B3eB4) might be attributed to an increase in both VL and RDB while the other two gaps display inconsistent variation in VL and RDB. The second gap (B4eB5) showed an increase in VL but a decrease in RDB while the third gap (B5eB6) showed an increase in RDB but a decrease in VL. In other words, only the transition from Book 3 to Book 4 in LT is consistent with our expectation that both the vocabulary and structure complexity should increase. However, as far as the vocabulary is concerned, the difficulty level increases till Book 5 while the materials in Book 6 only becomes more difficult in structure complexity. For SM, the bending point of the scree plot, as shown in Fig. 8, suggests a favored solution of a four-stage development in SM textbook series. The four developmental stages are illustrated in the VNC dendrogram, as shown in Fig. 9. Stage 1 comprises Book 1 and Book 2; stage 2 comprises Book 3 and Book 4; stage 3 and 4 comprise Book 5 and 6 respectively. According to the post-hoc diagnostic statistics of VL and RDB, the first and the third gaps (B2eB3 and B5eB6) demonstrate an increase in both VL and RDB while the second gap (B4eB5) shows an increase in VL but a decrease in RDB. In other words, in SM series, only one transitional gap, i.e. B4eB5, might somewhat deviate from our expectation that the volume used in the later semester does not crop up in structure complexity. However, as far as the vocabulary complexity is concerned, SM provides a consistent accumulation in its design of the textbook series.
Fig. 5. Dendrogram of the Far-East version overlaid with the vocabulary levels (VL) and the mean score of the predicted grade levels (RDB) for each volume.
76
A.C.-H. Chen / System 58 (2016) 64e81
Fig. 6. Scree plot of the VNC results for Lungteng series.
Fig. 7. Dendrogram of the Lungteng version overlaid with the vocabulary levels (VL) and the mean score of the predicted grade levels (RDB) for each volume.
5. Discussion A coherent series of ELT textbooks is expected to provide an appropriate increase in text difficulty as learners' proficiency progresses in different years. Also, the general expectation is that the difficulty development should be unidirectional in the sense that the complexity in vocabulary and structure should reasonably increase with the volumes (i.e. the semesters) in the
A.C.-H. Chen / System 58 (2016) 64e81
77
Fig. 8. Scree plot for the VNC results for Sanmin series.
Fig. 9. Dendrogram of the Sanmin version overlaid with the vocabulary levels (VL) and the mean score of the predicted grade levels (RDB) for each volume.
design of the English curriculum (Mukundan & Ahour, 2010; Tomlinson, 2012). The present study started out to assess ELT textbooks in this aspect by analyzing the development of text difficulty in a textbook series from two perspectives: vocabulary coverage rates across the top thirteen 1000-word lists of the BNCCWL and the PCA-transformed readability composite indexes.
78
A.C.-H. Chen / System 58 (2016) 64e81
Our results of the developmental stages for all the versions of textbooks are graphically summarized in Fig. 10, where each textbook version is represented by a horizontal bar with developmental stages marked with different patterns. Transitions that fully conform to our expectation (i.e. both vocabulary and structural complexity increase) are marked by the black solid vertical bars while those which partially match our expectation (i.e. either vocabulary or structural complexity increases) are marked by the black dotted vertical bars. The white vertical bars mark the transition that is counter to our expectation (i.e. both vocabulary and structural complexity decrease). Our general conclusions are (1) that not every volume constitutes a coherent developmental stage in terms of the progression in text difficulty, and (2) that not all the transitions are due to a positive increase in both vocabulary and structure complexity. Specifically, for the two transitional gaps identified in the FE series, none of them fully conforms to our expectation, while in the LT series, one transitional gap does, namely from Book 3 to Book 4. With two transitional gaps identified to fully match our expectation of increasing text difficulty, the SM series is found to best reflect an expected progression of text difficulty, suggesting that learners may be able to receive materials of increasing text difficulty in both vocabulary and structure. Our findings may be compared with Lo's (2010) textbook assessment, in which the text difficulty was mainly based on the readability formula of the Flesch Reading Ease. In her analysis of the same three sets of textbook series, Lo (2010) made a similar observation that the SM series showed the most appropriate trend within its six volumes in terms of the readability progression. The FE did not show the appropriate trend of readability levels being progressively distributed from easy to difficult, while the LT stood in between. Lo's analysis for the appropriateness of the progression relied on the fact whether the readability index showed more difficulty than the previous volume. According to her observations, the incoherent transitions in the textbook series (e.g. no indication for increasing difficulty in readability) were found in B3-B4 in SM, B2eB3 in LT, and B2eB3, B4eB5, B5eB6 in FE. Even though the transitional gaps identified by Lo were different from our results presented earlier, it should be noted, however, that Lo's interpretation of the difficulty progression was purely based on the inspection of the readability line chart. More importantly, her study considered only one readability index, which featured parameters of average sentence length and average number of syllables only. We would like to argue that our current method for textbook difficulty assessment is more advantageous than Lo's study in the following aspects. First of all, for the evaluation of structure complexity, a more comprehensive set of readability formulas has been adopted and PCA-transformed into independent readability composite indexes, thus statistically securing the accountability of the predicted grade levels by different readability formulas. Second, a more sophisticated algorithmdVNCdhas been adopted to empirically determine the transitional gap instead of relying on an intuitive inspection of the descriptive numbers and line plots. Finally, the difficulty progression has been evaluated in terms of not only the structural complexity in readability but also the vocabulary complexity. While Lo's study may reach a similar conclusion at the outset that SM provides a better transition in terms of readability, our analyses may give a more fine-grained account for the transitional gaps in terms of their respective progression in vocabulary and structure complexity. Even though the present study focuses on the textbook series in a local context of Taiwan high school curriculum, we believe that our method for text difficulty assessment can be applied to other ELT materials in other global contexts. ELT textbooks often serve as the main input for most EFL learners in their English learning. This is especially true for the official English curriculum. They are considered important materials in curriculum design and in the learning process, thus being a second-to-teacher factor in most EFL classrooms (Davison, 1976), or even playing a more prominent role than teachers (Hutchinson & Torres, 1994). Given the vital and positive part of textbooks in teaching and learning English, we are concerned with the appropriateness for each target level in a textbook series. In the discussion of how to improve the textbook selection process, Young and Riegeluth (1988) suggest that a selection committee for textbooks may need to consider at least five aspects: subject-matter content, social content, readability, instructional design, and production quality. Of particular
Fig. 10. A graphic summary of the developmental stages for the three different versions of textbook series identified by VNC.
A.C.-H. Chen / System 58 (2016) 64e81
79
importance to the present study is the aspect of readability, in that students may struggle if the reading level is too difficult while they may not be challenged to improve their reading skills and even feel bored if the reading level is too easy (Young & Riegeluth, 1988, p. 16). The textbooks analyzed in this study are the officially-approved versions that are being widely used in senior high schools in Taiwan. While students may spend as much as 90%e95% of classroom time interacting with textbooks, there have been sporadic complaints from high-school teachers about the inconsistent progression in the difficulty of the textbook series. In some high schools, especially for prestigious high schools, they may even select the first four books from one version and change to another version for Book 5 and 6 simply because teachers have an impressionistic judgment of the insufficient complexities in the higher-level volumes of the original version. This problem is non-trivial not only from a perspective of textbook selection committee in high school, but also to the publishers. While publishers follow the guidelines regulated by the Ministry of Education in Taiwan for textbook writing, their selection of reading texts is usually based on an impressionistic judgment of the editors with respect to the text difficulty (Tomlinson, 2008). Another confounding factor that may lead to the inconsistency of the difficulty development may be attributed to the text transformation. As these high-school textbooks are geared towards lower-intermediate EFL learners (Common European Framework A2 level) with a goal of achieving B1 level, most of the reading texts in the textbooks have been re-written by native speakers into simpler versions after the original texts are collected. This may lead to another intuition-based judgment of the difficulty levels in the textbook-writing process. Our corpus-based quantitative approach to textbook assessment has two advantages. First, we use quantitative measures to characterize the text difficulty of the ELT materials by resorting to BNC frequency-based word lists and well-established readability formulas. When different materials developed in different EFL contexts are evaluated according to the same operational criteria for text difficulty, they can be critically compared on the same ground. Our method provides a platform for materials developers around the world to compare the difficulty levels of the materials in a principled framework. Secondly, our method takes one step further by applying a robust statistical algorithmdVNCdto visualize the developments in text difficulty for a given textbook series, thus shedding light on a more sophisticated selection and arrangement of the reading texts for each grader. VNC computes the trend of difficulty progression based on the measures of vocabulary and structure complexities, thus offering insights into the appropriateness of the transitional gaps. Our agenda for future studies is to find out a more rigorous method for identifying an optimal transition based on our measures of vocabulary and structure complexities. While the transitional gaps identified by VNC may have statistically visualized a significant gap in the progression of text difficulty, it is possible that these difficulty developments may be too challenging for students instead, thus leading to unexpected frustrations in learning. A careful match of book difficulties and learners' reading skills should be a prerequisite for maximum learning gains (Paul, 2003). The idea is inspired by Vygotsky's Zone of Proximal Development (Vygotsky, 1978), which refers to a range of difficulty levels at which learners can read challengingly, yet without too much frustration. Therefore, it requires further psychological experiments to decide what constitutes an optimal difficulty range for EFL learners to move from one volume to another in a textbook series. Differences in idiosyncrasies and communities would render it a rather complicated issue for future research. 6. Conclusion The present study proposed a quantitative corpus-based approach to assess the appropriateness of difficulty development in a textbook series, using two types of quantitative measures of text difficulty: vocabulary coverage rates of the first thirteen 1000-word lists of our BNCCWL and two PCA-transformed readability composite indexes. It is argued that these two sets of measures cover two important dimensions for text difficulty: vocabulary and structure complexity. Variability neighborbased clustering, a variant of hierarchical clustering proposed in quantitative corpus linguistics, was applied to identify the developmental stages in text difficulty and to analyze the appropriateness of the transitional gaps in terms of the expected increase in vocabulary and structure complexity. Our textbook evaluation for text difficulty development differs from previous checklist-based evaluations in considerable ways. Firstly, this corpus-based evaluation brings attention back to the focus on text difficulty as a crucial factor for building linguistic competence. For the evaluation of vocabulary complexity, the reference corpus, the BNC, offers a set of criteria that are more consensus-reached and theory-neutral, rendering them legitimate and objective criteria for cross-community comparisons. For the evaluation of structure complexity, a wide range of readability formulas are adopted and PCAtransformed into independent readability composite indexes, thus statistically securing the accountability of the predicted grade levels by different readability formulas. For the interpretation of the directionality in text difficulty development, vocabulary levels for 95% coverage and readability mean scores for predicted grade levels provide the analyst more empirical evidence to pinpoint the progression in text difficulty in two crucial dimensions, vocabulary and structure complexity. It is, thus, concluded that our corpus-based evaluation for difficulty development in the textbook series may not only correspond to the local needs of Taiwan senior high school curriculum, but also provide a common framework for the assessment of the ELT materials in other global contexts. Acknowledgments An earlier version of this paper was presented at the 14th International Conference of the Asia Association of ComputerAssisted Language Learning (AsiaCALL 2014) and the 32nd International Conference on English Teaching & Learning (ROCTEFL
80
A.C.-H. Chen / System 58 (2016) 64e81
2015). The author would like to thank the audience for their valuable feedbacks. We are also grateful to the anonymous reviewers and the editor for their particularly insightful comments and suggestions. This study was partly funded by the Taiwan Ministry of Science and Technology (Project number: MOST 103-2410-H-018-006). References Atkinson, Q. D., & Gray, R. D. (2005). Curious parallels and curious connections: phylogenetic thinking in biology and historical linguistics. Systematic Biology, 54(4), 513e526. Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction to statistics using R. Cambridge, MA: Cambridge University Press. Bauer, L., & Nation, I. S. P. (1993). Word families. International Journal of Lexicography, 6(4), 253e279. Bird, S. (2006). NLTK: the natural language toolkit. In Proceedings of the COLING/ACL on Interactive presentation sessions (pp. 69e72). Brown, J. D. (1998). An EFL readability index. JALT Journal, 29, 7e36. Caylor, J. S., Sticht, T. G., Fox, L. C., & Ford, J. P. (1973). Methodologies for determining reading requirements of military occupational specialties (Technical report: HumPro-TR-73-5). Virginia: Human Resourcs Research Organization. Chall, J. S., & Dale, E. (1995). Readability revised: The new Dale-Chall readability formula. Cambridge, MA: Brookline Books. Chen, H.-c. (2014). A study on vocabulary selection in senior high school textbooks in Taiwan from the perspective of the academic word list (Unpublished master's thesis). Taipei, Taiwan: National Taiwan Normal University. Chiu, C.-A. (2010). Readability analysis and curriculum coherence of dialogues and reading passages in English textbooks for junior high schools in Taiwan (Unpublished master's thesis). Taichung, Taiwan: National Taichung University of Education. Chuang, Y.-Y., & Tsai, S.-H. (2009). How multi-word verbs are introduced in EFL textbooks in Taiwan. TMUE Journal of Language and Literature, 1, 25e64. Chujo, K. (2004). Measuring vocabulary levels of English textbooks and tests using a BNC lemmatised high frequency word list. Language and Computers, 51(1), 231e249. Cobb, T. (1997). Is there any measurable learning from hands-on concordancing? System, 25(3), 301e315. Coleman, E. B. (1965). On understanding prose: Some determiners of its complexity (NSF Final Report GB-2604). Washington, DC: National Science Foundation. Conrad, S. M. (1999). The importance of corpus-based research for language teachers. System, 27(1), 1e18. Croft, W. (2008). Evolutionary linguistics. Annual Review of Anthropology, 37(1), 219e234. Crossley, S. A., Greenfield, J., & McNamara, D. S. (2008). Assessing text readability using cognitively based indices. TESOL Quarterly, 42(3), 475e493. Dale, E., & Chall, J. S. (1948). A formula for predicting readability. Educational Research Bulletin, 27(1), 11e20. Davison, W. F. (1976). Factors in evaluating and selecting texts for the foreign-language classroom. English Language Teaching Journal, 30(4), 310e314. Divjak, D. S., & Gries, S. T. (2006). Ways of trying in Russian: clustering behavioral profiles. Corpus Linguistics and Linguistic Theory, 2(1), 23e60. DuBay, W. H. (2004). The principles of readability. Costa Mesa, CA: Impact Information. Dunn, M., Terrill, A., Reesink, G., Foley, R. A., & Levinson, S. C. (2005). Structural phylogenetics and the reconstruction of ancient language history. Science, 309, 2072e2075. Flesch, R. (1948). A new readability yardstick. Journal of Applied Psychology, 32(3), 221e233. Ghorbani, M. R. (2011). Quantification and graphic representation of EFL textbook evaluation results. Theory and Practice in Language Studies, 1(5), 511e520. Graesser, A. C., McNamara, D. S., Louwerse, M. M., & Cai, Z. (2004). Coh-Metrix: analysis of text on cohesion and language. Behavior Research Methods, Instruments, & Computers, 36(2), 193e202. Greenfield, J. (2004). Readability formulas for EFL. JALT Journal, 2(5), 5e24. Gries, S. T., & Hilpert, M. (2008). The identification of stages in diachronic data: variability-based neighbor clustering. Corpora, 3(1), 59e81. Gunning, R. (1952). The technique of clear writing. New York, NY: McGraw-Hill. Hawkins, J. A., & Buttery, P. (2010). Criterial features in learner corpora: theory and illustrations. English Profile Journal, 1(1), 1e23. Hilpert, M. (2007). Germanic future constructions: A usage-based approach to language change. Amsterdam, AN: John Benjamins. Hirsh, D., & Nation, P. (1992). What vocabulary size is needed to read unsimplified texts for pleasure? Reading in a Foreign Language, 8(2), 689e696. Hsu, W. (2009). College English textbooks for general purposes: a corpus-based analysis of lexical coverage. Electronic Journal of Foreign Language Teaching, 6(1), 42e62. Hsu, W. (2014). Measuring the vocabulary load of engineering textbooks for EFL undergraduates. English for Specific Purposes, 33, 54e65. Hsu, W. (2014). The most frequent opaque formulaic sequences in English-medium college textbooks. System, 47, 146e161. Hu, M. H.-C., & Nation, P. (2000). Unknown vocabulary density and reading comprehension. Reading in a Foreign Language, 13(1), 403e430. Hutchinson, T., & Torres, E. (1994). The textbook as agent of change. ELT Journal, 48(4), 315e328. Kao, Y.-C. (2014). A corpus analysis of word frequency lists and lexical coverage for senior high school English textbooks and the reading comprehension tests of JCEE (Unpublished master's thesis). Taipei, Taiwan: National Taiwan University of Technology. Kaufman, L., & Rousseeuw, P. J. (2005). Finding groups in data: An introduction to cluster analysis (2nd ed.). Hoboken, NJ: Wiley. Kincaid, J. P., Fishburne, R. P., Rogers, R. L., & Chissom, B. S. (1975). Derivation of new readability formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy enlisted personnel (Research Branch Report 8-75). Springfield, VA: Naval Technical Training Command. Laufer, B. (1989). What percentage of lexis is essential for comprehension? In C. Lauren, & M. Nordman (Eds.), Special languages: From humans thinking to thinking machine (pp. 69e75). Clevedon, England: Multilingual Matters. joint (Eds.), Vocabulary and applied linguistics (pp. Laufer, B. (1992). How much lexis is necessary for reading comprehension. In P. J. L. Arnaud, & H. Be 126e132). New York, NY: Macmillan. Laufer, B., & Ravenhorst-Kalovski, G. C. (2010). Lexical threshold revisited: lexical text coverage, learners' vocabulary size and reading comprehension. Reading in a Foreign Language, 22(1), 15e30. Lin, J.-F. (2008). Comparisons of the readability among the six versions of English textbooks in junior high school in Taiwan (Unpublished master's thesis). Kaohsiung, Taiwan: National Kaohsiung Normal University. Lin, Y.-L. (2014). Exploring recurrent multi-word sequences in EFL textbook dialogues and authentic discourse. English Teaching & Learning, 38(2), 133e158. Lin, H.-L., Hue, C.-W., Lin, C.-Y., & Hsu, N.-W. (2003). Authenticity of English teaching materials: a word frequency analysis of high school English textbooks and the MOE vocabulary list. English Teaching and Learning, 28, 1e13. Lin, S.-Y., Su, C.-C., Lai, Y.-D., Yang, L.-C., & Hsieh, S.-K. (2009). Assessing text readability using hierarchical lexical relations retrieved from WordNet. International Journal of Computational Linguistics & Chinese Language Processing, 14(1), 45e83. Lo, L. W. (2010). Comparative analysis of the readability of the reading passages in senior high school English textbooks (Unpublished master's thesis). Hsinchu, Taiwan: Hsuan Chuang University. Lorge, I., & Chall, J. (1963). Estimating the size of vocabularies of children and adults: an analysis of methodological issues. The Journal of Experimental Education, 32(2), 147e157. McLaughlin, G. H. (1969). SMOG grading: a new readability formula. Journal of Reading, 12(8), 639e646. Michalke, M. (2014). koRpus: An R package for text analysis (Version 0.05-5). Retrieved from http://reaktanz.de/?c¼hacking&s¼koRpus. Moisl, H. (2015). Cluster analysis for corpus linguistics. Berlin, Germany: De Gruyter Mouton. Mukundan, J., & Ahour, T. (2010). A review of textbook evaluation checklists across four decades (1970e2008). In B. Tomlinson, & H. Masuhara (Eds.), Research for materials development in language learning: Evidence for best practice (pp. 336e352). London, UK: Continuum. Nation, I. S. P. (2006). How large a vocabulary is needed for reading and listening? The Canadian Modern Language Review, 63(1), 59e82. Okamoto, M. (2015). Is corpus word frequency a good yardstick for selecting words to teach? Threshold levels for vocabulary selection. System, 51, 1e10.
A.C.-H. Chen / System 58 (2016) 64e81
81
Paul, T. (2003). Guided independent reading. Madison, WI: School Renaissance Institute. Qin, J. (2014). Use of formulaic bundles by non-native English graduate writers and published authors in applied linguistics. System, 42, 220e231. Sinclair, J. (1991). Corpus, concordance, collocation. Oxford, UK: Oxford University Press. Smith, E. A., & Senter, R. J. (1967). Automated readability index (AMRL-TR-66-22). Wright-Patterson Air Force Base, OH: Aerospace Medical Division. Sun, Y.-C. (2007). Learner perceptions of a concordancing tool for academic writing. Computer Assisted Language Learning, 20(4), 323e343. Tanaka-Ishii, K., Tezuka, S., & Terada, H. (2010). Sorting texts by readability. Computational Linguistics, 36(2), 203e227. Taylor, W. L. (1953). “Cloze procedure”: a new tool for measuring readability. Journalism Quarterly, 30, 415e433. Ting, A. (2005). The study of senior high school first graders' vocabulary competence, textbooks vocabulary distribution and teachers' viewpoints (Unpublished master's thesis). Tainan, Taiwan: Southern Taiwan University of Science and Technology. Todd, R. W. (2001). Induction from self-selected concordances and self-correction. System, 29(1), 91e102. Tomlinson, B. (2008). English language learning materials: A critical review. New York, NY: Continuum. Tomlinson, B. (2012). Materials development for language learning and teaching. Language Teaching, 45(2), 143e179. Tsagari, D., & Sifakis, N. C. (2014). EFL course book evaluation in Greek primary schools: views from teachers and authors. System, 45, 211e226. Tsai, K.-J. (2015). Profiling the collocation use in ELT textbooks and learner writing. Language Teaching Research, 19(6), 1e18. Vygotsky, L. (1978). Mind in society. Cambridge, MA: Harvard University Press. West, M. (1953). A general service list of English words. London, UK: Longman, Green & Co. Wiechmann, D. (2008). On the computation of collostruction strength: testing measures of association as expressions of lexical bias. Corpus Linguistics and Linguistic Theory, 4(2), 253e290. Yeh, C.-W. (2003). The content analysis of senior high school English textbooks (Unpublished master's thesis). Kaohsiung, Taiwan: National Kaohsiung Normal University. Young, M. J., & Riegeluth, C. M. (1988). Improving the textbook selection process. Bloomington, IN: Phi Delta Kappa Educational Foundation. Zamanian, M., & Heydari, P. (2012). Readability of texts: State of the art. Theory and Practice in Language Studies, 2(1), 43e53.