Evaluating CALL use across multiple contexts

Evaluating CALL use across multiple contexts

Available online at www.sciencedirect.com System 38 (2010) 357e369 www.elsevier.com/locate/system Evaluating CALL use across multiple contexts Joan...

155KB Sizes 1 Downloads 90 Views

Available online at www.sciencedirect.com

System 38 (2010) 357e369

www.elsevier.com/locate/system

Evaluating CALL use across multiple contexts Joan Jamieson a,*, Carol A. Chapelle b a

Department of English, Northern Arizona University, Flagstaff, AZ 86011-6032, USA Department of English, 203 Ross Hall, Iowa State University, Ames, IA 50011, USA

b

Received 16 December 2009; revised 7 April 2010; accepted 3 June 2010

Abstract For an audience of teachers, materials developers, students, and other applied linguists, three key issues in materials evaluation are establishing a means of drawing on professional knowledge about language learning in the evaluation process; recognizing the tension between the utility of robust evaluation results and the inherent context-specific nature of evaluation; and articulating the need for a procedure that produces stable, defensible results. This study explored one approach to addressing these issues when evaluating computer-assisted language learning (CALL) materials. A survey was developed and used in a multiple case study in twelve different English language classes in four countries. Results on six different criteria indicated generally good internal consistency. Findings from classes both inside and outside the USA suggest that the survey approach combined with multiple case studies had the sensitivity to reveal some differences both within and across classroom contexts. This study illustrates an additional method that can easily complement the more traditional process- and outcomes-based approaches for CALL evaluators. Ó 2010 Elsevier Ltd. All rights reserved. Keywords: Computer-assisted language learning; Evaluation; Second language learning; Case studies; Surveys; Applied linguistics

1. Introduction Many second language students and teachers use computer-assisted language learning (CALL) materials as a routine part of a range of language learning opportunities. CALL materials are intended to be attractive to and beneficial for learners, and publishers tend to claim that their materials succeed in achieving these goals. In view of the investment CALL materials demand and particularly in view of the scope of their reach as they are used all over the world, these intentions and claims need to be examined. Three key issues in materials evaluation are establishing a means of drawing on professional knowledge about language learning in the evaluation process; reconciling the tension between the utility of robust evaluation results and the inherent context-specific nature of evaluation; and articulating a procedure that produces stable, defensible results. We briefly examine how each of these issues has been addressed in the professional literature and then we describe a project whose purpose was to illustrate an additional approach to CALL evaluation. * Corresponding author. Tel.: þ1 928 523 6277 (office); fax: þ1 928 523 7074. E-mail addresses: [email protected] (J. Jamieson), [email protected] (C.A. Chapelle). 0346-251X/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.system.2010.06.014

358

J. Jamieson, C.A. Chapelle / System 38 (2010) 357e369

2. Materials evaluation Second language learning can be evaluated in a manner similar to school learning through examination of comparison and treatment groups, consisting of students who have experienced different instructional presentations or activities. Such research designs have been used in the evaluation of CALL, and they can provide the desired information in contexts where the researchers aim to select one instructional method over another based on results. However, such methods are typically not very informative in cases where CALL is one of many factors that may affect learning. Moreover, if the aim of the research is to identify particular areas of strength and weakness of materials, more specific criteria are needed. Such evaluations challenge researchers to appropriately draw upon professional knowledge about language and language learning, to interpret results relative to the context, and to apply defensible research methods. 2.1. Professional knowledge in materials evaluation In applied linguistics, frameworks such as those used for evaluation of L2 learning tasks (e.g., Pica et al., 1993), L2 course materials (e.g., Tomlinson et al., 2001), and L2 assessments (e.g., Bachman and Palmer, 1996) provide a means of synthesizing professional knowledge in a manner that allows for evaluation heuristics to be developed. This type of framework has been developed for the evaluation of CALL as well (e.g., Chapelle, 2001). Chapelle’s framework, summarized in Table 1, reflects the perspective that language is best learned when it is used for a realistic purpose, while recognizing that language use should be fluent, accurate, and complex. For these reasons, language learning potential, meaning focus, and authenticity are three criteria for CALL evaluation. Learner fit is included to reflect both the need for practice and the ways in which individuals differ such as by age, learning style, and stages of development. Positive impact is included because of the importance of the learner’s attitude in language learning. Finally, practicality is included because of the necessity of having the resources required for using the CALL activity. These six criteria are intended to be used in framing claims about CALL that can be supported using professional judgment and empirical data. 2.2. The role of context The second issue that appears throughout discussion of materials evaluation is whether the quality of learning materials is inherently context specific. How can materials produced in the United States by authors and editors whose primary point of reference is English as it functions and is taught in American classrooms be transported successfully to learners in a variety of classes throughout the world and used by students who may have little if any shared background knowledge? Do results of a CALL evaluation conducted in one school have any relevance for another school? On one hand, second language learning can be seen as an inherently psycholinguistic process which is the same regardless of where students are learning. From this perspective, some stable truths about quality materials can be identified. For example Tomlinson (1998) lists a number of basic principles of second language acquisition relevant for materials developers. Similarly, CALL developers and researchers have attempted to identify general principles that are transportable across contexts (Chapelle, 1998; Doughty and Long, 2003; Egbert et al., 1999). This perspective Table 1 Criteria for CALL evaluation. Language learning potential Meaning focus Learner fit Authenticity Positive impact Practicality

The degree of opportunity present for beneficial focus on form The extent to which learners’ attention is directed toward the meaning of the language The amount of opportunity for engagement with language under appropriate conditions given learner characteristics The degree of correspondence between the learning activity and target language activities of interest to learners out of the classroom The positive effects of the CALL activity on those who participate in it The adequacy of resources to support the use of the CALL activity

J. Jamieson, C.A. Chapelle / System 38 (2010) 357e369

359

is evident in CALL research seeking evidence for the value of particular features of instructional design such as subtitles for listening or informative feedback for grammar. The assumption is that results from a study generalize at least to the learning of other similar students, and that the knowledge about subtitles or feedback strengthens professional knowledge about general principles of CALL instructional design. On the other hand, materials succeed or fail in particular classes and schools due to how they interact with other contextual factors. From this perspective, materials can be evaluated only within the particular context where they are considered for use, since even if the same evaluation criteria are used, a course book may be successful in one situation, but not in another (Sheldon, 1988). Hubbard (2006, p. 313) wrote that evaluation “refers to the process of (a) investigating a piece of CALL software to judge its appropriateness for a given language learning setting, (b) identifying ways it may be effectively implemented in that setting.” The context specificity of evaluation should be seen as a challenge for all professionally-developed language learning materials. The problem may be exacerbated for CALL, however, because of reduced role of the teacher in making on-the-spot additions and changes in the path through the CALL materials relative to a textbook. Context-specific provisions are possible for Chapelle’s (2001) criteria even though they are not explicitly identified. The types of evidence required for their support are outlined in Table 2. Much evidence relies on opinions, thus including subjective data not often captured in outcome-based and process-oriented studies. 2.3. Articulating procedures for defensible results The purpose of materials evaluation performed by professionals for professionals is to produce defensible evidence about the quality of materials. The two issues described above require clearly stated procedures. A procedure needs to specify how evidence reflecting the professional knowledge about language learning is operationalized and obtained. Another procedure needs to clarify how the role of context is included in the research design in a way that allows evaluators to both see the effects of different settings and to see similarities in language learning across these settings. In course book evaluation, professional knowledge is marshaled into sets of criteria. Much of the literature regarding the evaluation of course books takes place before students ever use the book (e.g., Chambers, 1997; McDonough and Shaw, 2003; McGrath, 2002), reflecting teachers’ judgments as to what will work best for their courses. In contrast, because much CALL work is new to teachers and is often used independently by learners, more extensive evaluation including trialing CALL with students is seen as desirable. CALL use in a course and specific features of CALL materials have been evaluated through quasi-experimental research designs that compare students’ performances across Table 2 Context sensitive aspects of the CALL evaluation framework. Criteria

Context-specific aspect of criteria

Context specificity of evidence

Language learning potential: The degree of opportunity present for beneficial focus on form Meaning focus: The extent to which learners’ attention is directed toward the meaning of the language Learner fit: The amount of opportunity for engagement with language under appropriate conditions given learner characteristics Authenticity: The degree of correspondence between the learning activity and target language activities of interest to learners out of the classroom Positive impact: The positive effects of the CALL activity on those who participate in it Practicality: The adequacy of resources to support the use of the CALL activity

The potential and opportunity indicate an open possibility which may or may not be realized in a particular context. Factors affecting learners’ attention will vary across contexts. Creation of meaning is an interactive process between learner and materials. Appropriate conditions have to be defined in terms of learners, and learner characteristics will vary across contexts.

The learners’ use of help options that focus their attention of the language The learners’ recall of the storyline of the materials The learners’ satisfaction with the challenge presented by the language of the materials

The language of the activity is judged in relation Teachers’ and learners’ estimation of the to the language of context of interest to the learners. correspondence of the language to the language students are exposed to outside of class Impact refers to effects on users.

Benefits teachers and learners obtain from working with the materials

Resources refer to the infrastructure in a particular setting.

Teachers’ and students’ perspectives on the usability of the materials

360

J. Jamieson, C.A. Chapelle / System 38 (2010) 357e369

conditions. For example, Scida and Saury (2006) reported on a study comparing two Spanish courses (one a hybrid course) taught by the same instructor; Ramirez Verdugo and Alonso Belmonte (2007) compared the effects of digital versus text stories on listening comprehension. Evaluating specific features of CALL design, Borra´s and Lafayette (1994) compared the effects of video with and without subtitles and Nagata (1993) and Hincks and Edlund (2009) compared the effects of different types of feedback. These and other studies of CALL that appear in many journals and edited volumes exist because of the need to better understand the use of technology for language learning. The comparison method used in these studies sought evidence of one learning condition over another. Based on its preponderance in studies evaluating courses and pedagogical features, one can infer this quasi-experimental design is a credible basis for evidence. However, when the goal is to evaluate course materials that are intended to be supplementary, it does not seem realistic to compare classes with and without supplementary materials, because so many other factors affect any learning outcomes. As there is no real comparison condition for the materials, the quasiexperimental method may not be the most appropriate option. When gathering data on CALL materials as they are used, akin to what Breen (1989) called “tasks-in-process,” a case study method may be more appropriate. Yin (2003) defined a case study as an empirical inquiry that investigates a contemporary phenomenon within its real-life setting, especially when the boundaries between that event and its context are not clear. The more contextsensitive approach of case study can be used to evaluate how CALL is used and why it may or may not be successful. By conducting multiple case studies, results can be compared. This comparison allows for replication, which is akin to the notion of generalization in quasi-experimental studies. By using the well-established procedure of multiple case studies, the tension between context specificity and robust results is relaxed. 3. Research questions The goal of this study was to determine the viability of a survey in different contexts to evaluate the appropriateness of CALL materials by seeking the extent to which evidence could be found to indicate language learning potential, meaning potential, learner fit, authenticity, positive impact, and practicality. These targets for evaluation were based on the criteria for CALL appropriateness that codify professional knowledge about language learning in a manner that is intended for use in CALL evaluation. Based on the evidence obtained concerning these qualities, an additional goal of the study was to summarize the individual attributes into an evaluative statement for each context. The final goal was to determine the degree to which the results were similar across contexts. 4. Method These goals were addressed through the use of a multiple case study in which each class investigated was seen as a case. Guided by the evaluation framework for CALL appropriateness, data were systematically gathered from each class to reveal the perceptions of students over several periods throughout the semester. These class-level results were then compared statistically to assess the extent to which results for factors contributing to appropriateness were the same across contexts. 4.1. CALL materials The CALL material used in the study was Longman English Interactive (LEI; Rost and Fuchs, 2004), video-based, multimedia software. A judgmental analysis and pilot study resulted in a positive evaluation of LEI in terms of Chapelle’s six criteria (2001) as seen by software developers, and a teacher and students at one school (Jamieson et al., 2004, 2005). The publisher donated LEI for this study. LEI aims to develop integrated language skills of beginning to intermediate learners, offering four levels of increasing difficulty (i.e., LEI 1; LEI 2; LEI 3; LEI 4). LEI levels 1 and 2 have 15 units; their video-based segments are based on the interactions among a group of characters and include topics such as greetings and finding an apartment. LEI 3 and 4 have 12 units each; their video-based segments portray an ongoing drama involving a soccer star and a young journalist. Video-based listening provides students with aural input which appears and sounds authentic and which is intended to focus attention on the meaning of the language. The language is intended to be useful both inside and outside of the classroom and to be at a medium level of difficulty and complexity for the learners. Comprehension of the storyline and focus on specific language is supported with a variety of selection-type items, including multiple-choice, drag-and-drop, and listen/record/playback; there are a few short

J. Jamieson, C.A. Chapelle / System 38 (2010) 357e369

361

constructed response items such as fill-in-the-blank, but no sentence/paragraph writing item types. Each LEI unit provides focus on form through explanations and practice in vocabulary, grammar, and pronunciation, and focus on meaning through explanations and practice in speaking, listening, and reading. 4.2. Participants Participants were learners of English as a second/foreign language in twelve classes at six schools in four countries, as shown in Table 3. (For the sake of confidentiality, the names of the schools used below and throughout this article have been altered.) The language programs were selected so that classes both inside and outside the USA would be included to operationalize context, though essentially participants represented a sample of convenience. Three schools were located in the USA: United Community College (UCC) in New Jersey, Nogales American University (NAU) in Arizona, and LeGrand County College (LGCC) in New York. Three schools were located outside of the USA: Kansai Sanyo University (KSU) in Japan, Korn Kahn University (KKU) in Thailand, and Instituto Christobal Norte (ICN) in Chile. Two hundred and twenty-one students and ten teachers from these six schools participated in the study. The twelve classes were “intact”; that is, the students had been assigned to their classes using the normal procedures in place at each school. The schools, classes, home countries, first languages, ages, and English proficiency measured by the Institutional TOEFL of the students are displayed in Table 3. At a glance, it can be seen that the students in the USA Table 3 Participants in the study. School/Location

Total N of students

Class

n of students

Students’ home country (n)

Students’ L1 (n)

1. United Community College (UCC)/NJ, USA

63

UCC1

22

UCC2

18

UCC3

23

Haiti (15); Columbia (11); Peru (10); Ecuador (5); Brazil (3); Dominican Republic (3); El Salvador (2); Puerto Rico (2); Poland (2); Cuba (1); Ghana (1); Greece (1); Mexico (1); Pakistan (1); The Philippines (1); Slovakia (1); Turkey

Spanish (36); French (8); Portuguese (3); Creole (2); English (1); Greek (1); Polish (1); Slovak (1); Tagalog (1); Turkish (1); Urdu (1); other (5)

NAU

8

China (2); Taiwan (2); Ecuador (1); Mexico (1); Peru (1); Russia (1)

Chinese (4); Spanish (3); Russian (1)

29.5 (11.7)

440 (45) n¼8

Korea (18); Columbia (5); Ecuador (5); Japan (4); Peru (3); Cyprus (2); Poland (2); Taiwan (2); Thailand (2); Germany (1); Ivory Coast (1); The Philippines (1); Mexico (1)

Korean (17); Spanish (13); Japanese (5); Chinese (2); Greek (2); Polish (2); Thai (2); French (1); German (1); Tagalog (1); other (1)

25 (7.1)

421 (44) n ¼ 20 415 (31) n ¼ 20

Japan (60)

Japanese (60)

2. Nogales American University (NAU)/AZ, USA

8

3. LeGrand County College (LGCC)/NY, USA

50

4. Kansai Sanyo University (KSU)/Japan

63

5. Korn Kahn University (KKU)/Thailand 6. Instituto Christobal Norte (ICN)/Chile

33

4

Age mean (SD) 30 (9.9) 27 (9.8) 29 (9.7)

LGCC1

23

LGCC2

27

KSU1

14

KSU2

35

18 (.6)

KSU3

14

18 (.4)

KKU1

21

KKU2

12

ICN

4

Thailand (33)

Thai (33)

27 (6.8)

18 (.3)

18 (.7) 20 (.7)

Chile (4)

Spanish (4)

30 (14)

TOEFL mean (SD) 469 (48) n¼4 484 (53) n ¼ 13 418 (42) n ¼ 19

383 (29) n ¼ 13 377 (31) n ¼ 30 379 (24) n ¼ 11 398 (38) n ¼ 21 433 (70) n ¼ 12 N/A

J. Jamieson, C.A. Chapelle / System 38 (2010) 357e369

362

formed a more heterogeneous group of individuals than those outside the USA in terms of home country/first language (L1) and age. Those students in the USA also had somewhat higher TOEFL scores, although all of the students in the study seemed to be at the lower end of the proficiency continuum rather than the higher end, as only two participants out of 221 had TOEFL scores above 550. The 63 students at UCC were enrolled in three classes in the intensive English as a second language (ESL) program offered at the community college. Many of the UCC students were adult immigrants who had full-time jobs and supported their families. The eight full-time students at NAU were enrolled in a university’s intensive English program taking a content-based course modeled after typical North American university classes. The 50 students at LGCC were enrolled in two classes in an intensive English program, part of the Adult and Continuing Education division of the community college. Both classes were taught by the same teacher. The students were enrolled in Level 4, the last level of fluency building before college preparation began. The 63 students at KSU (Japan) were enrolled in three classes in the English as a foreign language (EFL) course offered by the university’s English Department. They were non-English majors, taking the EFL course to fulfill their undergraduate foreign language requirement and assigned to one of three classes based on their years of study at the university. The 33 students at KKU (Thailand) were enrolled in two classes offered by the university’s Department of Education for an undergraduate TESOL teacher training program. The four students at ICN (Chile) were enrolled in an optional extra-curricular class at a beginning to low-intermediate level offered by the language school. At each institution, teachers were asked to examine the scope and sequence of each LEI level to select the one that they felt was most appropriate for their class. Their choices are shown in Table 4. 4.3. Instruments A survey, called Unit Reflections, was developed for the study and administered to the students at different points in time. Steps were taken to design the survey so that students’ reactions to the CALL material would be logically linked to the six criteria for CALL evaluation. Questions were generated for each of the six criteria, and those for the Unit Reflections ultimately used in the study are shown in Table 5. Students answered separately for listening, vocabulary, speaking, grammar, pronunciation, reading, and (sometimes) quiz, so that one question could have six or seven items. The survey included selected responses mainly using a 4-point Likert-scale. The online versions were created using Flashlight Online, a Web-based system for developing questionnaires, administering them online, and downloading responses for subsequent analyses. 4.4. Procedure The data from the participating schools were collected between January and September, 2005. After teachers agreed to participate and chose the LEI level appropriate for their students, they had the software installed in Table 4 Classes, LEI levels, and English levels. Country

Class

LEI Level

School Level

Average TOEFL

United States

UCC1 UCC2 UCC3 NAU LGCC1 LGCC2

3 4 4 3 3 3

Intermediate Advanced High Intermediate High Intermediate Intermediate Intermediate

469 484 418 440 421 415

Japan

KSU1 KSU2 KSU3

1 1 2

Beginner Beginner Beginner-Year 2

383 377 379

Thailand

KKU1 KKU2

3 4

Intermediate Intermediate-Year 2

398 433

Chile

ICN

2

N/A

N/A

J. Jamieson, C.A. Chapelle / System 38 (2010) 357e369

363

Table 5 Linking questions in unit reflections to criteria. Criteria

Questions

Format

# of items

Language learning potential

Which of these parts did you work on in Unit ___? For [grammar, vocabulary, pronunciation], how much did you understand? Overall, how much do you feel you learned from Unit ___? How often did you use the following tools? (e.g., transcripts, video control, grammar/dictionary help, culture notes)

Likert Likert

7 3

Likert Likert

6 7

Learner fit

How interesting were the activities in the following parts? For each part that you worked on, how much did you already know? How difficult were the activities in Unit ___ that you worked on?

Likert Likert Likert

2 4 6

Meaning focus

How well did you understand [listening, reading, speaking]?

Likert

3

Authenticity

How much were you able to use what you had learned in the LEI lessons while outside of the classroom? For each part you worked on, how often outside of the classroom did you see or hear what you studied?

Likert

6

Likert

6

Impact

How much did you enjoy the following practice activities?

Likert

6

Practicality

How easy to understand were the instructions and explanations for the activities you did in Unit __? Did you have any trouble with the computer or LEI in Unit __?

Likert

1

Yes/No

1

a computer lab with technical support from the publisher. (At the time of the study, LEI was distributed through local area networks which did not provide easy access to online records, in contrast to its Internet capabilities at the time of this writing.) The teachers were asked to start using LEI with their students at the beginning of a new term. Teachers explained the study to their students and asked for their participation. The students who agreed to participate in the study filled out a brief questionnaire (which had been approved by the Institutional Review Board as a legitimate substitution for a written consent document) and submitted it electronically. The institutional TOEFL (which was contributed to the project by Educational Testing Service) was administered to the students between the beginning and the middle of the term. Over the term, students were asked to use LEI 3e5 h per week, in and outside of class. Every time students finished one unit, they were asked to complete and submit a Unit Reflection. Except for one site, students proceeded through the units at their own pace, with occasional guidance from teachers regarding their pace. Some students, therefore, completed more units than others. At KSU (Japan), students proceeded through the units uniformly due mainly to curricular constraints. 4.5. Analysis The following steps were taken to analyze the responses to the Unit Reflections. First, the students’ responses were checked to be sure to include only those data pertaining to sections of the materials that the learners had actually worked on. At the beginning of the survey, students were asked to indicate which parts of the unit they had worked on. If a student provided answers about a part (e.g., Listening) previously indicated as not having been worked on, any subsequent responses were recoded as missing values. Second, with a small subset of items, the scores based on the original Likert scales were recoded so that they would reflect the pre-defined desirability of the responses. Third, individual students’ responses to each Unit Reflection item were averaged based on the number of units completed. Fourth, these item-based average scores were grouped according to the six criteria, and internal consistency reliability was computed for each criterion. For example, after the third step each student had an average score for each item. Then, these average item scores were grouped according to the six criteria. Specifically for language learning potential, there were 23 different items. These 23 items needed to provide similar information if they were to form a scale, and so internal consistency

J. Jamieson, C.A. Chapelle / System 38 (2010) 357e369

364

reliability was estimated using Cronbach’s alpha. This procedure was followed for all items grouped by each of the six criteria. At this stage, items that did not contribute to the reliability of its criterion measure were dropped from further analysis. Fifth, individual students’ item-based scores for each criterion were summed, which yielded six criterion-based Unit Reflection scores for the students. Continuing with the example, once a group of items, such as the 23 for language learning potential, was found to form a reliable scale, then each student’s scores on those 23 items were added up to give that student a “language learning potential score.” The possible range of scores was from 23 to 92; in other words, the lowest value was 23  1, or 23, and the highest value was 23  4, or 92. Next, for each class descriptive statistics for the six criteria were computed to find the mean and standard deviations for each set of scores on the six qualities. This resulted in an average score for each class for each of the six criteria. The number of items used to assess the six criteria was different; language learning potential had 23 items, so scores ranged from 23 to 92; meaning focus had 3 items, ranging from 3 to 12; learner fit and authenticity had 12 items, ranging from 12 to 48; positive impact had 6 items, ranging from 6 to 24. (Because reliability could not be established for practicality, its two items were treated separately, one on a 1e4 scale, and the other on a 1e2 scale.) In order to interpret the scores for each class across these various ranges, a procedure was needed. Almost all of the items (except for one practicality item) used a 4-point scale in which 4 reflected the best evidence for appropriate use of CALL, and 1 reflected the worst evidence. We decided that a class’ average score at or above 3.5 would provide excellent evidence for a criterion. For example, for Language Learning Potential with 23 items, a class score of 80 or above (23  3.5), was considered “excellent.” For Meaning Focus with 3 items, a class score of 10.5 or above was considered “excellent.” For the rating of “good,” the top score was the cut-off for excellent and the bottom score was set by multiplying the number of items by 2.5; for the rating of “weak,” the number of items was multiplied by 1.5 to set the lower limit; anything below the lower limit of weak was considered “poor.” The evaluative labels and their respective score ranges are shown in Table 6. In order to address the overall appropriateness of LEI for each of the classes in the study, the evaluative ratings by the students for each criterion were examined. To determine the degree of similarity across classes, the statistical significance of mean differences across groups was tested. In separate analyses, each of the six criterion-based scores was the dependent variable; students’ classes were used as the independent variable. One-way Analyses of Variance (ANOVA) were run to test the significance of group differences when the underlying assumptions for this parametric procedure were met; if assumptions were not met, then the non-parametric KruskaleWallis test was used. Post-hoc analyses were carried out when the initial ANOVA/KruskaleWallis indicated significant between-group difference(s), in order to specify which groups differed from each other. A significance level of .05 was set for the study, but it was corrected in view of the number of comparisons made by using the Bonferroni adjustment. For example, if a post-hoc comparison was made among 12 classes, the significance level was .05/12 ¼ .004 to diminish the possibility of Type I error. 5. Results The data allowed for estimation of a level of support for each of the qualities, the overall appropriateness of the materials for each class, and the statistically significant differences among classes. Overall, support was good for the six criteria, though stronger for some than others, as results were not entirely uniform across classes. Table 6 Interpreting scores on unit reflections. Criteria

N items

N of scale points per item

Excellent 3.5

Good 3.45e2.5

Weak 2.45e1.5

Poor <1.5

Language learning potential Meaning focus Learner fit Authenticity Impact Practicalitya

23 3 12 12 6 1 1

4 4 4 4 4 2 4

80 or above 10.5 or above 42 or above 42 or above 21 or above 1.75 or above 3.5 or above

58e79 7.5e10.4 30e41 30e41 15e20 1.5e1.75 2.5e3.4

35e57 4.5e7.4 18e29 18e29 9e15 1.25e1.4 1.5e2.4

Less Less Less Less Less Less Less

a

The two items for practicality were conceptually distinct and did not correlate so they were not combined to form a scale.

than than than than than than than

35 4.5 18 18 9 1.25 1.5

J. Jamieson, C.A. Chapelle / System 38 (2010) 357e369

365

Table 7 Mean, standard deviation, reliability, & evaluation from unit reflections. Class 1 UCC1 A

2 UCC2 B

3 UCC3 C

4 NAU D

5 LAG1 E

6 LAG2 E

7 JAP1 F

8 JAP2 G

9 JAP3 H

10 THAI1 I

11 THAI2 I

12 CHILE J

N valid missing

22 0

17 1

21 2

8 0

23 0

25 2

14 0

33 2

12 2

20 1

11 1

4 0

Language learning potential

68.31 (6.76) .77 G

63.45 (15.96) .90 G

67.67 (10.92) .75 G

70.17 (5.88) .88 G

66.65 (8.38) .92 G

68.02 (12.29) .89 G

62.13 (4.37) .74 G

56.50 (10.25) .93 W

51.11 (9.35) .96 W

74.79 (6.29) .86 G

68.84 (8.93) .70 G

71.78 (4.24) .51 G

Meaning focus

9.40 (1.70) .75 G

8.86 (2.77) .72 G

9.39 (1.72) .85 G

9.68 (1.14) .85 G

9.60 (1.18) .86 G

9.41 (1.51) .86 G

9.28 (2.12) .99 G

7.87 (1.39) .91 G

7.39 (1.26) .80 W

9.65 (1.18) .71 G

10.17 (1.27) .69 G

9.55 (2.29) .99 G

Learner fit

32.47 (6.29) .76 G

25.23 (9.93) .80 W

33.08 (7.70) .85 G

38.17 (2.08) .21 G

35.39 (2.84) .44 G

36.90 (3.40) .58 G

34.18 (7.29) .91 G

37.72 (3.11) .71 G

38.16 (1.80) .39 G

37.15 (3.73) .65 G

35.42 (6.28) .68 G

35.29 (4.82) .76 G

Authenticity

33.31 (7.23) .86 G

32.15 (11.34) .88 G

35.38 (7.23) .97 G

33.37 (4.16) .94 G

30.97 (6.61) .97 G

31.78 (6.65) .96 G

16.77 (6.20) .99 P

26.25 (4.92) .97 W

20.49 (6.76) .98 W

36.91 (5.09) .85 G

38.14 (6.44) .78 G

35.03 (4.62) .96 G

Positive impact

20.19 (2.51) .85 G

17.57 (6.18) .84 G

19.98 (3.78) .93 G

19.10 (4.36) .98 G

18.45 (2.90) .90 G

19.36 (3.07) .90 G

18.05 (2.85) .98 G

16.63 (2.72) .96 G

14.47 (2.47) .90 W

21.60 (2.23) .87 E

19.16 (3.01) .62 G

22.65 (1.57) .89 E

Practicality: problems

1.70 (.39) G

1.97 (.12) E

1.94 (.17) E

1.72 (.26) G

1.77 (.36) E

1.87 (.26) E

1.90 (.20) E

1.92 (.15) E

1.42 (.30) W

1.39 (.40) W

1.30 (.34) W

1.57 (.51) G

Practicality: instructions

3.63 (.47) E

3.55 (.50) E

3.29 (.50) G

3.36 (.39) G

3.14 (.49) G

3.21 (.57) G

2.72 (.60) G

2.46 (.51) W

2.36 (.41) W

3.08 (.45) G

3.43 (.43) G

3.39 (.54) G

Note. Evaluative judgment: E ¼ excellent; G ¼ good; W ¼ weak; P ¼ poor.

5.1. Evaluation of qualities by each class The statistics on the Unit Reflections for each of the classes, provided for each of the six criteria, are presented in Table 7. The 12 classes are listed across the top, identified by both number and abbreviation (e.g., UCC1). The capital letter under each abbreviation indicates the teacher (for example, the same teacher “E” taught classes LAG1 and LAG2). The first row of the table lists the number of students (both those who answered questionsdvalid, and those who did notdmissing). This is followed by statistical information including the class’s average score, the standard deviation, Cronbach’s alpha estimate of reliability, and an evaluative judgment. The mean score for each class is italicized, as this was the statistic that was used to make the evaluative judgment. 5.1.1. Language learning potential The student data for language learning potential consisted of the results from the Unit Reflections. Reading across the language learning potential rows in Table 7, the majority of the class’s scores were interpreted as “good,” between 60 and 75, except for two classes in Japan (JAP 2 and JAP 3). For these two classes, the score for language learning potential indicated a “weak” level. In fact, the scores for all three of the Japanese classes were lower than those for the other classes. 5.1.2. Meaning focus Table 7 also shows the descriptive statistics for each of the classes on questions used to assess meaning focus. The individual classes ranged from 10.17 to 7.39. The same two classes in Japan, JAP2 and JAP3 whose mean scores on language learning potential were interpreted as “weak,” also had the lowest mean scores for meaning potential. The mean score for JAP3 fell into the “weak” range, whereas all others were “good.” 5.1.3. Learner fit Most of the classes’ mean scores fell close to 35, which was interpreted as “good” learner fit. As shown in Table 7, the one exception was the UCC2 class in the USA with a mean of 25.23, which was interpreted as “weak” learner fit for that class.

366

J. Jamieson, C.A. Chapelle / System 38 (2010) 357e369

5.1.4. Authenticity Except for the classes in Japan, the mean scores for all groups fell in the 30s, which was interpreted as “good.” The scores of the three classes in Japan were lower than the others. The mean score of 16.77 for JAP1 was interpreted as “poor” authenticity, and the mean scores of 26.25 for JAP2 and 20.49 for JAP3 were interpreted as “weak.” 5.1.5. Positive impact The mean score for most of the classes for impact was between 17 and 20, interpreted as “good.” Two classes with higher means, THAI1 (21.60) and CHILE (22.65) were both interpreted as “excellent.” One class in Japan, JAP3, had a low 14.47 mean, interpreted as “weak.” 5.1.6. Practicality Table 7 also summarizes the students’ responses to the question of whether or not they had difficulty in working with the computer. A response indicating no problems would receive a score of 2, and therefore it is evident that the majority of the students reported few problems. The exceptions to this overall trend were the JAP3 class in Japan and the two classes in Thailand. Results concerning the other aspect of practicality, ease of comprehension for the instructions, are given in the last rows of Table 7. A score of 4 indicates that instructions were very easy to understand, and so mean scores within the 3 ranges, which appeared for most of the classes, is “good.” The exceptions to the “good” comprehension of the instructions were the classes in Japan. 5.2. Evaluation of overall appropriateness of LEI by class The perspectives of students in the individual classes provide a means of examining the appropriateness of LEI in each context. Reading down the columns in Table 7, students in five of the classes in the USA (UCC1, UCC3, NAU, LAG1, LAG2) rated the LEI materials as “good” on all of the criteria except practicality, which they rated as either “good” or “excellent.” The other class in the USA, UCC2, gave similar ratings except for learner fit, which students thought was “weak.” Students in the classes outside the USA rated the materials somewhat differently. In general, students in class JAP1 gave “good” ratings to each criterion except for authenticity, which they rated as “poor.” Students in class JAP2 gave “weak” ratings for language learning potential, authenticity, and instructions, and “good” ratings for meaning focus, learner fit, and positive impact. Students in class JAP3 gave the lowest ratings; they rated every criterion as “weak,” except for learner fit. Students in both Thai classes gave the same ratings, except for positive impact; students in class THAI1 thought the materials had an “excellent” impact, whereas those in class THAI2 thought it was “good.” Both classes had technical problems, as indicated by their “weak” ratings. On the other criteria, the two Thai classes gave “good” ratings. Finally, the Chilean participants rated all of the criteria as “good,” and rated positive impact as “excellent.” 5.3. Statistical comparisons of criteria across classes Classes’ average scores for each criterion were compared statistically. The results in Table 8 indicate the cases where differences were statistically significant. Looking at language learning potential for example, class JAP1 gave statistically significant lower ratings than class THAI2; class JAP2 gave statistically significant lower ratings than classes UCC1, UCC3, NAU, LAG2, THAI1, and THAI2; class JAP3 gave statistically significant lower ratings than classes UCC1, UCC3, NAU, LAG2, and THAI1. The table illustrates the fact that the students in Japan often rated the LEI materials lower than other students. 6. Discussion In general, for all but two Japanese classes, the students felt that the LEI materials had good language learning potential and adequately directed learners’ attention toward the meaning of the language they were exposed to. All students except for those in one class in Japan evaluated LEI as having good meaning focus. One might raise questions about the way in which the qualities were defined as well as the need to make claims about these versus other aspects of the materials. For example, meaning focus, a critical construct for current approaches to second language acquisition, was operationalized partly through questions asking students about their understanding and recall of the

J. Jamieson, C.A. Chapelle / System 38 (2010) 357e369

367

Table 8 Summary of differences among scores on each scale of the unit reflections. Criteria

Classes with statistically significant different means (p. < .000)

Classes with which differences were statistically significant

Language learning potential

JAP1 lower than JAP2 lower than JAP3 lower than

THAI2 UCC1, UCC3, NAU, LAG2, THAI1, THAI2 UCC1, UCC3, NAU, LAG2, THAI1

Meaning focus

JAP2 lower than JAP3 lower than

LAG1, LAG2, THAI1, THAI2 LAG1, LAG2, NAU, THAI1, THAI2

Learner fit

UCC2 lower than

LAG2, JAP2, JAP3, THAI1

Authenticity

JAP1 lower than JAP2 lower than JAP3 lower than

UCC1, UCC2, UCC3, NAU, LAG1, LAG2, JAP2, THAI, THAI2 UCC1, UCC3, THAI, THAI2 UCC1, UCC2, UCC3, NAU, LAG1, LAG2, THAI, THAI2

Positive impact

JAP2 lower than JAP3 lower than THAI1 higher than

UCC1, UCC3, THAI1 UCC1, UCC3, LAG1, LAG2, THAI1, THAI2 UCC2, JAP2, JAP3

Practicality

JAP1 lower than JAP2 and JAP3 lower than THAI lower than JAP3, THAI1, THAI2 lower than

UCC1, UCC2, THAI2 UCC1, UCC2, UCC3, NAU, LAG1, LAG2, THAI1, THAI2 UCC1 UCC1, UCC2, UCC3, LAG1, LAG2, JAP1, JAP2

major events of the video story and reading. Like many constructed response questions, these were difficult to evaluate in view of the range of unpredictable responses and were dropped from the analysis. Additional questions, more focused on specific aspects of the story should be added to increase the reliability of this measure. Still, these findings do provide evidence that authors and publishers were able to develop materials that captured important aspects of learning materials across a wide range of contexts, except for two classes in Japan. Did LEI fit the learners well in terms of the level of language, instructional design, and content of materials? All students except for those in one class in UCC evaluated LEI as having good learner fit. This provides evidence that the series as a whole provided the range of materials required to fit students in both intensive programs in the US and EFL classes in other countries. The class at UCC was the most advanced of all the classes in the study. Even though they worked on the most advanced version of LEI, the students felt the LEI 4 was not sufficiently challenging. The results would support a claim that LEI is appropriate for beginning through high intermediate learners in terms of the level of language, instructional design, and content of materials. This claim held for students across contexts. Was there a good correspondence between the language of the learning activity in LEI and that of the target language activities of interest to learners out of the classroom? The data from students provided either good or in a few cases excellent support for this claim in all classes in the USA. Data from students outside the USA provided evidence ranging from poor to good authenticity. The claim about LEI supported by these data would be that for students studying English in the USA and in some cases outside the USA, good correspondence exists between the language of the learning activity in LEI and that of the target language activities of interest to learners out of the classroom. Authenticity was defined as the degree of correspondence between the learning activity and target language activities of interest to learners out of the classroom. This definition of authenticity as a construct that is relative to a particular target language context reflects current perspectives in applied linguistics, replacing the hegemonic conception linking authenticity to native speaker discourse or any categorically defined context. This definition was straightforward to operationalize with questions for students about the degree of similarity between the language of LEI and what they are exposed to out of the classroom. At the same time, the definition poses a challenge for authenticity in an EFL context. Learners have access to English outside of class through the Internet and some other sources, but these are idiosyncratically accessed. How can a current definition of authenticity be operationalized for materials evaluation? Authenticity is a term that is used regularly in materials evaluation, and therefore an operational definition (or definitions) that make sense in ESL and EFL contexts seem(s) a worthy goal. In fact, if authenticity cannot be defined operationally, what does it mean to claim that CALL materials are authentic?

368

J. Jamieson, C.A. Chapelle / System 38 (2010) 357e369

Did LEI have a positive impact on learners? Results supported the claim of positive impact in all cases except for the three classes in Japan. Regarding practicality, the data revealed a division between the greater degree to which problems were reported by students outside the USA and the lesser degree to which they were reported by students in the USA. The LEI materials appeared to be appropriate for all but two classes, both of which used LEI 1 and 2, and were in Japan. The trend of negativity should be cause for a careful look at what might be wrong with the match between learners and these materials in classes in Japan. There were several differences in these Japanese classes from the others in the study. These classes had the lowest TOEFL scores of any in the study. The students in the Japanese classes were the youngest participants and they were taking an English course to fulfill their undergraduate foreign language requirement rather than as a major area of interest (students in THAI1 were the same age, but they were English majors). On the basis of the results for the individual criteria, in every case except learner fit, one or more of the Japanese classes did not respond favorably. In sum, although positive evidence for the appropriateness of LEI was found in most of the classes, it was not found in all. The statistical significance of the mean differences across groups underscores the results reported above. Two of the classes in Japan (JAP2 and JAP3) had significantly lower scores than many of the other classes regarding language learning potential, meaning focus, learner fit, authenticity, positive impact, and understanding instructions. The multiple potential reasons for these differences include the age, proficiency level, reasons for learning English, and the specific LEI materials for Levels 1 and 2 in addition to culture. The design of the research did not allow for isolating which one or more factors came into play in the results. However, what is important here is the fact that there were quantifiably discernable differences among the different groups. The overall, good scores of LEI cannot be generalized uncritically. 7. Conclusion This study demonstrated the viability of a survey to provide evidence for CALL appropriateness in different contexts, which could then be compared. It revealed the extent to which LEI multimedia language learning materials were appropriate for ESL and EFL learners across a variety of classes. Overall, the research provided evidence relevant to the claims about LEI’s language learning potential, meaning focus, learner fit, authenticity, positive impact, and practicality. Results were consistent across classes except for two in Japan, in which the results were mixed to negative. Thus, the more supportable positive claims to make about LEI would be claims pertaining to LEI 3 and LEI 4, not LEI 1 and LEI 2 that were used in Japan. This finding in addition to the positive findings from other classes outside the USA suggests that multimedia materials created in an English-speaking context may be appropriate more broadly, but this is a proposition that needs to be evaluated rather than assumed. While the research provided some informative results pertaining to LEI, it also addressed some important issues regarding how one obtains empirical evidence concerning qualities of learning materials of interest to applied linguists. We relied on quantitative data gathered from multiple administrations of a survey to examine the potential benefits of administering a survey which operationalized professional knowledge in different contexts to evaluate the appropriateness of CALL materials. The administration of a survey on different occasions, in different classes, taught by different teachers, in different cities and countries yielded fairly reliable results and allowed us to practically conduct a multi-site, multi-country evaluation. Such self-report data can be used together with other data such as records of learners’ actual use of the materials, their knowledge gained from the materials, and how it transfers to other settings. Such data would all contribute to the basic framework of the criteria for CALL evaluation but were not possible to include in this study due to the limited funding for the project. Language materials evaluation represents one example of research in applied linguistics which is concerned with the study of context-embedded language-related issues. In this study we did not assume context specificity for the evaluation of LEI, but rather we tested the possibility that LEI might be evaluated differently in different classes. The results showed some differences, but the more overwhelming findings were similar evaluations across contexts, particularly pertaining to LEI 3 and 4. These findings are positive for publishers who develop materials intended to have a broad reach, but they also underscore the need for empirical evaluation to identify settings where customization needs to be considered. When the goal is to inform prospective users and those responsible for subsequent editions and customization, empirically demonstrated quality of such materials across a relevant sample of contexts should be regarded as indispensable. The judgmental analysis the profession is accustomed to reading in journals

J. Jamieson, C.A. Chapelle / System 38 (2010) 357e369

369

should provide only a starting point for such evaluation, which needs to be conducted in a manner that produces defensible results. Acknowledgements Support for this project was provided by a grant from TIRF (The International Research Foundation for English Language Education), as well as by materials from Educational Testing Service and Pearson Longman. We are particularly grateful to our research assistant, Reiko Komiyama, for her thorough and extensive work. References Bachman, L., Palmer, A., 1996. Language Testing in Practice. Oxford University Press, Oxford. Borra´s, I., Lafayette, R.C., 1994. Effects of multimedia courseware subtitling on the speaking performance of college students of French. The Modern Language Journal 78, 61e75. Breen, M., 1989. The evaluation cycle for language learning tasks. In: Johnson, R.K. (Ed.), The Second Language Curriculum. Cambridge University Press, Cambridge, pp. 187e206. Chambers, F., 1997. Seeking consensus in coursebook evaluation. ELT Journal 51, 29e35. Chapelle, C.A., 1998. Multimedia CALL: lessons to be learned from research on instructed SLA. Language Learning and Technology 2 (1), 22e34. Chapelle, C.A., 2001. Computer Applications in Second Language Acquisition. Cambridge University Press, Cambridge. Doughty, C., Long, M., 2003. Optimal psycholinguistic environments for distance foreign language learning. Language Learning and Technology 7 (3), 50e80. Egbert, J., Chao, C.-C., Hanson-Smith, E., 1999. Computer-enhanced language learning environment: an overview. In: Egbert, J., HansonSmith, E. (Eds.), Computer-enhanced Language Learning. TESOL Publications, Alexandria, VA, pp. 1e13. Flashlight Online. Retrieved from the Web on August 2, 2008: http://flashlightonline.wsu.edu/Retired from the Web on April 1, 2010. Hincks, R., Edlund, J., 2009. Promoting increased pitch variation in oral presentations with transient visual feedback. Language Learning and Technology 13 (3), 32e50. Hubbard, P., 2006. Evaluating CALL software. In: Ducate, L., Arnold, N. (Eds.), Calling on CALL: From Theory and Research to New Directions in Foreign Language Teaching. CALICO Monograph Series, vol. 5. Texas State University, San Marco, TX, pp. 313e338. Jamieson, J., Chapelle, C.A., Preiss, S., 2004. Putting principles into practice. ReCALL Journal 16, 396e415. Jamieson, J., Chapelle, C.A., Preiss, S., 2005. CALL evaluation by developers, a teacher, and students. CALICO Journal 23, 93e138. McDonough, J., Shaw, C., 2003. Materials and Methods in ELT: A Teacher’s Guide, second ed. Blackwell Publishing, Malden, MA. McGrath, I., 2002. Materials Evaluation and Design for Language Teaching. EdinburghUniversity Press, Edinburgh. Nagata, N., 1993. Intelligent computer feedback for second language instruction. Modern Language Journal 77, 330e339. Pica, T., Kanagy, R., Falodun, J., 1993. Choosing and using communication tasks for second language instruction. In: Crookes, G., Gass, S. (Eds.), Tasks and Language Learning: Integrating Theory & Practice. Multilingual Matters, Ltd., Clevedon, England, pp. 9e34. Ramirez Verdugo, D., Alonso Belmonte, I., 2007. Using digital stories to improve listening comprehension with Spanish young learners of English. Language Learning and Technology 11 (1), 87e101. Rost, M., Fuchs, M., 2004. Longman English Interactive 1e4. Pearson Longman, White Plains, NY. Scida, E., Saury, R., 2006. Hybrid courses and their impact on student and classroom performance: a case study at the University of Virginia. CALICO Journal 23, 517e532. Sheldon, L.E., 1988. Evaluating ELT textbooks and materials. ELT Journal 42, 237e246. Tomlinson, B., 1998. Materials Development for Language Teachers. Cambridge University Press, Cambridge. Tomlinson, B., Dat, B., Masuhara, H., Rubdy, R., 2001. ELT courses for adults. ELT Journal 55, 80e101. Yin, R., 2003. Case Study Research: Design and Methods, third ed. SAGE Publications, Thousand Oaks, CA.