Journal of English for Academic Purposes 12 (2013) 288–298
Contents lists available at ScienceDirect
Journal of English for Academic Purposes journal homepage: www.elsevier.com/locate/jeap
In-house or commercial speaking tests: Evaluating strengths for EAP placement Joan Jamieson a, *, Linxiao Wang a, Jacqueline Church b a b
Department of English, Northern Arizona University, Flagstaff, AZ 86011-6032, USA Second Language Testing, Inc., 6135 Executive Boulevard, Rockville, MD 20852, USA
a b s t r a c t Keywords: ESL Speaking Assessment Computers Versant English
When language program administrators consider changing a placement test, there are many issues to address. Will the scores help us place students into our curriculum? Will the scores reflect real differences in students’ abilities? Will the administration of the test be feasible? This article describes one program’s deliberations between keeping an inhouse test or adopting a commercial test for speaking. Two speaking tests were compared according to curricular coverage, statistical distributions, and practicality. One test, PIE Speaking, was developed in-house. The other test, Versant English, was developed by Pearson Knowledge Technologies. Both covered many but not all curricular objectives. Internal consistency estimates were higher for Versant English than for PIE Speaking. The comparison of distribution patterns suggested that PIE Speaking better discriminated between mid-level students, but Versant English better discriminated between low and high ability students. PIE Speaking took approximately 60 staff hours, costing about $1200. Versant English took about 10 staff hours at an estimated cost of $6500. Cost weighed most heavily in the decision to keep the in-house speaking test. Modeling the steps taken to answer specific questions may provide structure for other language programs when evaluating their placement tests. Ó 2013 Elsevier Ltd. All rights reserved.
1. Introduction Language testing can be seen as a gateway, as it checks for adequate preparedness before individuals embark on their future academic studies and ultimately on their careers. For students entering an English-medium university, tests are administered in order to gauge their language proficiency as it relates to readiness for academic classes. Those students not ready are often placed in an English for Academic Purposes (EAP) program, also referred to as a foundation program, a university preparation program, or an intensive English program. Although all EAP programs need a way to determine students’ English abilities, there is not a single test used by all; instead, a myriad of placement tests are used. This article reports on one intensive English program’s decision-making process when considering whether to keep an in-house speaking placement test or to replace it with a commercially available standardized, computerized speaking test. Three factors that were particularly important for this decisiondcontent coverage, statistical characteristics, and practicalitydcould serve as a frame of reference for other EAP programs faced with a similar choice between keeping or adopting a placement test of students’ speaking abilities.
* Corresponding author. Tel.: þ1 928 523 6277; fax: þ1 928 523 7074. E-mail addresses:
[email protected] (J. Jamieson),
[email protected] (L. Wang),
[email protected] (J. Church). 1475-1585/$ – see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.jeap.2013.09.003
J. Jamieson et al. / Journal of English for Academic Purposes 12 (2013) 288–298
289
2. Background 2.1. Two purposes for EAP testing Essentially, purpose answers the question: Why are we giving a test? Purpose is related to the decisions that will be made based on the test’s score. It represents a fundamental consideration before selecting or developing a test, administering it, and interpreting its scores (Bachman & Palmer, 2010; Stoynoff & Chapelle, 2005). Two purposes that universities and EAP programs deal with are admission and placement of students. For many universities in the United States, a student takes an English test such as IELTS or TOEFL iBT to determine readiness for university study long before arriving on campus; the purpose of this test is admission. A student whose score exceeds an established standard or threshold (also known as a cut score) will be admitted to the university with no or few additional language requirements (assuming, of course, that other requirements have been met). The cut score is often decided on by the consensus of teachers and administrators who feel that it represents a threshold ability level. Oftentimes tests for admission purposes are designed and developed based a general theory or model of language ability rather than any particular course of study; they are known as proficiency tests. A student who scores below the cut score is thought to have insufficient English ability for university study. Depending on the university’s policy, the student may still arrive on campus, but will need to take English courses before beginning full time university study. Such a student often takes an EAP test; the purpose of this test is placement in an appropriate level of intensive English instruction. Achievement tests that are designed and developed based on a set of objectives for a course of study may be used (although this is sometimes done by reference to a TOEFL or IELTS score). Achievement tests are intended to have a criterion-referenced interpretation where each student is compared to mastery of a content domain, in this case the EAP program’s objectives, rather than to other students. Apart from students who take the IELTS or TOEFL before arriving on campus, in some cases international students arrive without a TOEFL or IELTS score. These students are directed to the EAP program where they are required to take the EAP test. In these circumstances, the EAP test serves a dual purpose: to place students in a pre-university EAP program or to certify readiness for full-time academic standing in a university. This is our situation. While certainly not unheard of, it is not what is traditionally depicted in language testing textbooks. Instead of using two tests, one for admission and one for placement, it is important for our test both to adequately measure university-level language ability and also to have a match with our curricular, program-wide objectives. Dual interpretation tests can be designed to spread students out, but in terms of defined learning objectives (Miller, Linn, & Gronlund, 2009). This point was described in the Standards for Educational and Psychological Testing: Results of an educational assessment might be reported on a scale consisting of several ordered proficiency levels, defined by the types of tasks students at each level were able to perform. That would be a criterion-referenced scale, but once the distribution of scores was reported.individual students’ scores would also convey information about their standing relative to that tested population.Interpretations based on cut-scores may likewise be either criterionreferenced or norm-referenced. (AERA, APA, NCME, 1999, p. 50) An idea for a combined type of test similar to this idea of dual interpretation was described by John Clark. At a small conference at Educational Testing Service in 1982, he coined the term prochievement testing: “This approach thus combined both ‘proficiency’ and ‘achievement’ assessment elements, with at least the theoretical possibility of providing information of some degree of utility for both assessment purposes” (1989, p. 214, pp. 206–215). In our situation, we have employed these ideas of dual interpretation and prochievement. We have designed the reading and listening sections of our EAP “placement test” by taking passages and questions from achievement tests at four different proficiency levels in the curriculum. Thus together, performance on these items is intended to reflect a scale that consists of ordered ability levels. We have done this so that incoming international students with no standardized English proficiency score on tests such as IELTS or TOEFL iBT can either enter the university or be placed at appropriate levels in the intensive English program. One cut score is used to certify academic readiness; other cut scores allow placement into English classes with students at relatively homogeneous proficiency levels. 2.2. The speaking section Because the number of students in our EAP program grew rapidly in recent years and our curriculum evolved in a more reactive than proactive manner, we felt the need to reexamine the EAP program’s speaking section on the placement test for three main reasons. First, perhaps the speaking scores were not placing students appropriately in our language curriculum; teachers were reporting informally that the scores did not seem to discriminate among test-takers. Second, the tasks on the test might not be effective; that is, that they might elicit identical aspects of speaking ability. Third, perhaps the speaking section had become impractical, taking too many resources. Keeping these issues in mind, we began surveying how other programs were testing speaking. A review of approximately 100 EAP programs in the US included on the Commission for English Language Accreditation Programs website indicated that nearly 50% used an interview to estimate students’ speaking ability, 15% used a standardized speaking test, and 35% did not include a speaking test to inform placement decisions.
290
J. Jamieson et al. / Journal of English for Academic Purposes 12 (2013) 288–298
Each EAP program decides whether to develop its own in-house placement test, or to use a commercial test. Tests such as the IELTS and TOEFL, and more recently PTE Academic, although clearly designed as admission tests, have been used by educational institutions to facilitate placement decisions regarding learners’ studying abroad or other training purposes (Saville & van Avermaet, 2005). Other commonly used tests such as the University of Michigan’s English Placement Test, College Board’s ACCUPLACER, and ACT’s Compass English as a Second Language Placement Test contain a combination of listening comprehension, grammar, vocabulary, reading comprehension, and writing, but no speaking. Since the curriculum of many EAP programs include both skills-based classes and content-based classes to prepare international students for the university, it seems reasonable to include all four skills in placement tests. The fact that the one skill most often left out is speaking may be attributed to issues of practicality. There is a wide range of speaking tests and tasks which can provide specific information about the test takers’ speaking ability. “When we are assessing speaking, we guide the examinees’ talk by the tasks that we give them. These outline the content and general format of the talk to be assessed and they also provide the context for it” (Luoma, 2004, p. 29). Interviews are certainly widely used, though shortcomings of them have been recognized. For example, the interviewer’s behavior can influence examinees’ performance due to personality (Berry, 2007) and familiarity (O’Sullivan, 2002). The interviewer might bias the interaction by leading and controlling the conversation (Lazaraton, 1996; Young & Milanovic, 1992). What is the best way to proceed? On one hand, Brown (1989) suggested that if a program had the resources and expertise, developing placement tests within a program seemed to be the best practice. In an EAP program, the speaking portion of the placement test is an opportunity for students to demonstrate their ability to speak at length in a way that would simulate an academic speaking experience in a university. On the other hand, a computerized speaking test may be able to address the time-intensive problems of administration and scoring that restrict programs from including speaking in their placement test battery. Bernstein, Van Moere, and Cheng (2010) suggested two main categories of computerized tests measuring spoken language: semi-direct and automated tests. The semi-direct method of speaking tests uses computers to mainly deliver the test and collect examinees’ responses. However, the scoring part is completed by human raters following certain developed rubrics. Compared with the semi-direct spoken language test, fully automated speaking tests use computer technology not only in areas such as test development and administration but also in scoring test-takers’ spoken responses. Several examples of computerized speaking tests include the Computerized Oral Proficiency Interview (COPI; Malabonga & Kenyon, 1999), TOEFL iBT (Zechner, Higgins, Xi, & Williamson, 2009), and Versant English (Pearson, 2011). Fully automated computerized speaking tests have various advantages (e.g., Eyckmans, Van de Velde, Van Hout, & Boers, 2007; Qian, 2009; Van Moere, 2010). Firstly, if the test does not require on-site examiners, it can be taken at many locations by examinees from different areas and regions, which greatly improves the test accessibility or availability. Secondly, a single speaking test can be delivered to a large number of test takers, which economizes the test administration efforts. Thirdly, standardized instructions and prompts improve the test’s reliability and fairness. Practitioners, however, should be cautious when introducing a computerized test to replace traditional tests for dual interpretation purposes. The score distribution must be examined to ensure separation not only between readiness and nonreadiness, but also between students whose scores fall below the cut score. The match with curricular objectives needs to be examined. This is very important for EAP programs that desire to place students into homogeneous ability levels. It is also important that scores are not adversely affected by lack of computer familiarity and that students’ perceive the test as fair. While it may appear simpler to administer the test via a computer, there may still be administrative costs. For example, the ease of accessibility needs to be balanced with lack of security; reduced number of test administrators needs to be balanced with increased number of computer technicians. 2.3. Statement of the problem In order to decide whether to continue with an in-house speaking test or to replace it with a commercial, computerized speaking test, we compiled the following list of questions relating to the curriculum, statistical distribution, and practicality that guided our decision process. Curriculum Does the test adequately measure university-level language ability? Is it important to have a match between the test and curricular objectives? Does the commercial test include all skills that your program wants to be assessed? Do the tasks on the test correspond with your program’s view of language ability? Statistical distribution Does Does Does Does
the the the the
test test test test
allow for a norm-referenced interpretation? adequately discriminate between students’ language ability in all testing areas? provide enough information to place students in a level within your program? provide valid and reliable results to determine which students go to the university?
J. Jamieson et al. / Journal of English for Academic Purposes 12 (2013) 288–298
291
Practicality Does your program want to administer the test on-site or in various locations? Does your program want to ensure standardized administration protocol and instructions? Does your program have the resources needed, in terms of time and cost, to administer, design, develop, and score the test? Does your program have the equipment and IT support needed to administer the test? This list of questions serves as a good resource in compiling information to guide which test should be used at an EAP program. For the purposes of this paper, while all of these questions were taken under consideration by the authors, the points that are generalizable to a larger audience and contain procedures that may be informative for others undergoing a similar testing decision will be discussed in depth. With this aim, we investigated the following three research questions: 1. Which test’s tasks (in-house test versus commercial) better reflected the speaking objectives of the curriculum? 2. Which test’s scores (in-house test versus commercial) better distinguished among students at different proficiency levels? 3. Which test (in-house test versus commercial) was more practical? 3. Methods This descriptive study compared performance on two speaking tests in order to determine whether one was more appropriate than the other in a particular setting. In this section, the participants are described first, followed by an explanation of the two speaking tests and the questionnaire investigating the users’ attitudes towards the computerized test. 3.1. Participants A convenience sample made up of students who were enrolled in an EAP program in a university in the southwestern United States was used. One group of students (81) took the tests in December, 2011; a second group (43) took the tests a few weeks later in January, 2012. In total, 124 EAP students participated in the study. There were 103 men and 21 women. The native language of the majority of the students was Arabic (84%); other language groups included Chinese and Korean, together accounting for the remaining 16% participation population. All participants signed an informed consent document, which was available in Arabic, Chinese, and English. 3.2. Measures Three measures were used in the studydan in-house speaking test (hereafter referred to as PIE Speaking), a commercial speaking test, Versant English, and a questionnaire. PIE Speaking consisted of two tasks, an independent speaking task and an integrated speaking task. These tasks were similar to the TOEFL iBT and were intended to reveal how students transmitted and demonstrated knowledge in an academic context. It was expected that the independent task would elicit evidence for intelligibility and fluency, and that the integrated task (that is, listening and then speaking) would elicit evidence not only for intelligibility and fluency, but also for content, coherence, and organization (Butler, Eignor, Jones, McNamara, & Suomi, 2000; Jamieson, Eignor, Grabe, & Kunnan, 2008). The independent task had students prepare for 1 min and then respond for 1 min to a prompt they would be familiar with such as, “For vacation, would you rather travel to a different country or stay in your home country?” The integrated task had students listen to two short lectures on a topic such as popular culture. They were then given 2 min to prepare and 1 min to respond to a prompt, “Use the information from the two listenings about popular culture (movies, news, music) and summarize the main points. Then explain how popular culture influences your daily life. Give specific details and reasons to support your answer.” Each student recorded his/her response to the prompt through a microphone in a computer lab; as such, one may consider this a semi-direct test. Sound files were then copied to a shared folder. Two raters were assigned to score each response, using the TOEFL iBT speaking rubrics. The two assigned raters gave a holistic score on each task which ranges from 0 to 4, based on impressions of delivery, language use, and topic development. (The rubrics are available online: http://www.ets.org/Media/ Tests/TOEFL/pdf/Speaking_Rubrics.pdf) The two raters’ scores on both tasks were summed, yielding a score from 0 to 16; the summed score was then multiplied by 1.875 to yield a score which ranged from 0 to 30, putting the speaking score on the same 0–30 range as the other three scores on the placement test (i.e., reading, writing, and listening). The raters were EAP teachers in the program. They had been teaching listening and speaking classes in the program and generally have had a rich experience in rating speaking tasks in the program. All raters participated in a 1-h session which included training to become familiar with the rubrics and their usage by listening and rating benchmark samples at each score level. If raters gave scores that were different by more than two points, they listened again and discussed their ratings, which usually resulted in one rater changing a score. If agreement could not be reached, an outside expert was brought in to mediate. Three types of correlations were run to examine the reliability and validity of the in-house speaking test. First, inter-rater reliability was estimated by computing Pearson correlations between the two raters; due in part to raters’ discussions and the calibration in advance, the correlations were quite high. In December, 2011, the correlation between raters was .93 for the
292
J. Jamieson et al. / Journal of English for Academic Purposes 12 (2013) 288–298
independent task and .95 for the integrated task; in January, 2012, the correlation between raters was .96 for the independent task and .96 for the integrated task. Second, internal consistency was estimated for each administration using Cronbach’s alpha (December, .75; January, .77). This form of reliability is used to determine whether the items or tasks fit well enough together that we are justified in reporting one score. In this case, because the speaking score will be used with other section scores to report a total score, we can say the items had adequate internal consistency (Iowa Technical Adequacy Project, 2003). Third, the internal structure of the speaking measure was examined to inform our theoretical questions as we wondered whether the two tasks were measuring the same underlying abilities (e.g., Campbell & Fiske, 1959; Enright, Bridgeman, Eignor, Lee, & Powers, 2008; Hatch & Lazaraton, 1991; Huff et al., 2008; Lehman, 1988; Pae, 2012). The disattenuated correlation between the two speaking tasks was .69 in December and .58 in January (the relationship between the scores having been adjusted for inconsistency due to lack of inter-rater reliability; Bachman, 2004). These moderate correlations provided evidence to support the notion that the independent and integrated tasks tapped into somewhat different speaking abilities. The commercial test chosen for inclusion for this study was Versant English. Versant English, originally known as PhonePass and later known at Set-10, was chosen as it is easily available via Internet and it has a well-documented 15 year history in testing spoken English (e.g., Bernstein, DeJong, Pisoni, & Townsend, 2000; Townsend, Bernstein, Todic, & Warren, 1998). Versant English measures “facility in a spoken language,” which was defined as “the ability to understand spoken language on everyday topics and to respond appropriately at a native-like conversational pace in intelligible English” (Pearson, 2011, p. 8). This construct reflects both receptive and productive aspects of spoken English ability in a natural context of spoken conversation. As Bernstein and Cheng (2007) claimed, a person participating in conversations needs to process what is being said and to formulate appropriate responses to what he or she heard and understood. Versant English is a fully automated speaking test. Although tests can be administered over the telephone or on a computer, in the current study students took Versant English on computers. In general, the test took about 18 min for each testtaker to complete. Students listened to the recorded instructions which were also displayed on the computer screen, and spoke their responses into a microphone. The test was automatically scored in about 30 s. Versant English includes six sections: (a) reading, (b) repetition, (c) short questions and answers, (d) sentence building, (e) story retelling, and (f) open questions. In total, 57 out of 63 items in Versant English are automatically scored (Four practice items and the two open questions are not included in the score.). Responses are scored based on content and manner-of-speaking. The two are weighted equally since they are considered indispensable parts of successful communication (Pearson, 2011). The content of response (sentence mastery and vocabulary) is scored according to the presence or absence of expected correct words in correct sequences. The manner-of-speaking quality (fluency and pronunciation) is scored by “measuring the latency of the response, the rate of speaking, the position and length of pauses, the stress and segmental forms of the words, and the pronunciation of the segments in the words within their lexical and phrasal context” (Pearson, 2011, p. 13). The test report consisted of an overall score (ranging from 20 to 80) and four subscores on sentence mastery, vocabulary, fluency, and pronunciation (Pearson, 2011). Sentence mastery focuses on understanding, recalling, and producing phrases and clauses in complete sentences. Vocabulary reflects the ability to understand words spoken in the context of a sentence. Together content accuracy, including both sentence master and vocabulary, accounts for 50% of the overall score. Fluency is measured on the basis of rhythm, phrasing, and timing, as evident on constructing reading and repeating sentences. Pronunciation is assessed in terms of intelligible consonants, vowels, and lexical stress. Fluency and pronunciation, together as the manner of speaking, account for the remaining 50% of the overall score. A summary of validation efforts by the test publisher (Pearson, 2011) provided evidence for Versant English’s internal consistency (.92 to .99), test-retest reliability (.97) and human to computer scores (.88 to .97). Correlations among the subparts scores typically ranged from .53 to .80, providing evidence that each part of the test could be measuring somewhat different aspects of speaking. Correlations with other speaking tests ranged from .75 to .94; the lowest correlation was with TOEFL iBT, and the highest correlations were found in three experiments that included an oral interaction scale based on the Common European Framework. In our administration, the internal consistency of the four parts was estimated (using Cronbach’s alpha) at .88 in December and .92 in January, which is high enough to support giving one overall score. Correlations among the four subpart scores ranged from .55 to .87. A paper-and-pencil questionnaire was developed to elicit students’ feedback on Versant English. For the Versant English questionnaire, there were six questions which covered the aspects of speaking ability measured by the test: (a) how well it measured your speaking ability, (b) its difficulty, (c) its fairness, (d) how easy the directions were to understand, (e) how easy the items were to understand, and (f) its user interface. Each question was followed by five-point Likert-type scale which asked for the student’s opinion and ranged in value from 1 “not at all,” to 5 “yes, very much.” There was also space for comments. Due to the fact that the PIE Speaking test was used as part of an operational placement test which included reading, listening, and writing, the decision was made not to administer a questionnaire after it. The responses we may have received, especially from beginning students, would have been affected by the entire placement test, resulting in unusable data. Also, the use of questionnaires to elicit test-taker feedback would potentially endanger the face validity of the placement test and the resulting decisions. 3.3. Procedures Both tests and the paper-and-pencil questionnaire were administered in a 22 station computer lab equipped with Windows 7 operating system. Data were collected at two different times as illustrated in Table 1. The first group of participants
J. Jamieson et al. / Journal of English for Academic Purposes 12 (2013) 288–298
293
Table 1 Test dates and number of participants. Test date
Number
Percent
December, 2011 January, 2012 Total
81 43 124
65 35 100
took the PIE Speaking test, Versant English, and the questionnaire in December, 2011. While there were 28 students at Level 3 and 24 at Level 4, there were only 11, 8, and 10 students at levels 1, 2, and 5, respectively. So, when the PIE Speaking test was administered to students in January, 2012, students from Levels 1, 2, and 5 were recruited to take Versant English and the questionnaire a few weeks after the PIE Speaking test. The distribution of these students by proficiency level in the EAP program is illustrated in Table 2. Students’ proficiency level was mainly decided based on the total score of the placement test results. Level assignment could also be decided by other factors, such as grades from a previous semester. Each level had similar number of participants except for Level 5. In order to estimate the degree to which the tests covered important content, a chart of speaking objectives was created based on speaking syllabi at each level in the EAP’s curriculum. Two of the researchers independently completed the chart by marking whether or not these objectives were measured by the tests. In order to include other perspectives, a subsequent analysis was undertaken. Two disinterested, yet experienced, teachers were invited to review the content match between the curriculum and the two tests. Both had more than ten years of experience teaching EAP. Both were familiar with the EAP speaking curriculum and placement tests. Before the review process, both teachers received training. In the training, the teachers went over the list of course objectives for speaking in the EAP program. The PIE Speaking test and relevant rubrics were provided. Then the content and scoring criteria of Versant English was introduced and discussed. Each reviewer then took the Versant test. Later, the two teachers worked individually to mark whether the tests reflected the EAP program’s speaking objectives. They were also encouraged to attach comments to their judgment about the content coverage. 4. Results The steps taken to answer each of the three research questions will be presented in turn. 4.1. Research Question 1. Which test’s tasks better reflected the speaking objectives of the curriculum? As mentioned, the EAP program had five levels representing low proficiency (Level 1) to high-intermediate/advanced proficiency (Level 5). First, the speaking objectives of each level were listed and then repetitions were collapsed, leaving seven objectives, as shown in the first column of Table 3. These objectives varied in both context and level of sophistication. For example, an objective in Level 1 was “Relate and tell information about everyday topics,” whereas a parallel objective for Level 4 was to “Formulate, express, and defend your individual ideas, opinions, predictions, and inferences in academic contexts and responding appropriately to those of others.” The wording of the objectives has been simplified into a keyword format. For example: “Produce the sounds and intonation of English that are crucial to understanding meaning and pragmatic function in academic contexts” was rewritten into “Produce sounds and intonation.” The tasks, rubrics, and score reports of both tests were analyzed by two of the authors and by two independent teachers to determine whether or not the EAP’s objectives were covered. All four raters agreed on the classifications with one exception. As we can see in Table 3, both tests included sounds and intonation, vocabulary and grammar, and summarizing information in its measure. Both tests represented monologic speech, so neither one included conversations nor group discussions; neither test required students to deliver presentations (Although Pearson claims that the language targeted by Versant English is “conversational English,” because all language was one-way, we did not give a check for Objective 3.). PIE Speaking did include formulation and expression of ideas (though not in response to others’); it also included evaluating information from listening passages. The one exception was on Objective 4; one rater thought that that the open question task on Versant did ask for formulating and expressing opinions. However, as mentioned previously, the two open questions are not included in Versant’s score. Table 2 Number of participants in the EAP’s five proficiency levels. Level
Number
Percent
1 2 3 4 5 Total
25 30 28 24 17 124
20.16 24.19 22.58 19.35 13.70 100.00
294
J. Jamieson et al. / Journal of English for Academic Purposes 12 (2013) 288–298 Table 3 EAP speaking objectives and tests’ content coverage. Objectives
PIE Speaking
Versant English
1. 2. 3. 4. 5. 6. 7.
O O X O O O X
O O X X O X X
Produce sounds and intonation Integrate vocabulary and grammar Participate in conversations and group discussions Formulate and express opinions and respond to others’ Summarize information from listening passages Evaluate new information from listening passages Deliver presentations using effective skills
Notes: O ¼ yes; X ¼ no. All objectives use either every day or content-specific academic contexts.
4.2. Research Question 2. Which test’s scores better distinguished among students at different proficiency levels? To answer this question, four analyses were undertaken. First, histograms were generated for each test to see the spread of scores. Next, scores at each EAP level were examined to determine the degree to which the scores at each level were different. A correlation between students’ scores on the two tests was then computed to investigate the degree to which speaking was being measured in a similar way on the two tests. Finally, students’ reactions to Versant English on the questionnaire items were summarized. Because of the desire to spread out students according to their speaking ability, histograms were generated so that the distributions on the two tests could be compared. As shown below in Figs. 1 and 2, the scores patterned differently. As we can see in Fig. 1, almost 20 students scored 0 on the PIE Speaking test. Also, the stretching of the 0–16 PIE Speaking scale to 0–30 may have led to some imprecision in measurement (Bachman, 1990, pp. 35–37; Wang, Eignor, & Enright, 2008, pp. 278–282). The distribution appears somewhat horizontal, albeit with dips. The scores varied from 0 to about 26. The distribution of Versant English (Fig. 2) appears positively skewed and perhaps a bit bimodal. This means the bulk of scores lie to the left, which indicates that many EAP students scored low on Versant Speaking. These scores appear to reflect interval-level scores better than the PIE Speaking scores which were ordinal data, transformed to be considered as continuous data. Both tests appear to spread out the students, though in different ways. In order to get an understanding of how well the tests distinguished students at different levels, it is necessary to examine two columns in Tables 4 and 5–the means (M) and the 95% confidence intervals (CI). Looking at the actual distributions of PIE Speaking scores by levels in the EAP program as displayed in Table 4, we can see that the mean score at each level increased except between Level 4 and Level 5. When we take error into account, the 95% confidence intervals (CI) show us the extent to which the scores at each level are distinct. If the confidence intervals of two levels do not overlap, the scores are clearly distinguishing between placements. As we can see in Table 4, there was no overlap between Level 2 and Level 3, or between Level 3 and Level 4. This means that students placed in these two levels had different speaking scores. This was not the case for PIE Speaking scores at Levels 1 and 2 in which there was a small degree of overlap, or between Levels 4 and 5 which almost totally overlapped. As shown in Table 5, the mean scores on Versant English increased as the levels increase, which is desirable. Examining the 95% confidence intervals, however, we see that there was overlap between scores at all adjacent levels, which is not desirable. It can be seen though that the overlaps between Levels 2 and 3 and Levels 4 and 5 are quite small.
Fig. 1. Histogram of PIE Speaking scores.
J. Jamieson et al. / Journal of English for Academic Purposes 12 (2013) 288–298
295
Fig. 2. Histogram of Versant English scores.
In order to determine the degree to which the scores on the two tests might be measuring speaking in similar ways, a correlation coefficient was calculated, after taking into account error in the internal consistency estimates. The correlation corrected for attenuation was .77. While considered a moderate to high correlation accounting for about 60% of shared variance, this suggests that the tests were not measuring speaking identically, but rather similarlydwith each test including some aspects of speaking ability that were not included in the other. As for the Versant English questionnaire results, 113 of the 124 participants responded. Students disagreed that the test was easy (i.e., they correctly thought the test was difficult for them; Mean ¼ 1.76, S.D. ¼ 1.12). This corresponds to the statistical evidence (positive skewness). Some students wrote comments like “The test is very hard and difficult” and “Repeat the questions are difficult part because it is too quick. Moreover, we have to give us easy vocabulary because that help you to know what ability for speaking we have.” On the other spectrum, many students agreed that the interface of Versant speaking was user-friendly (Mean ¼ 4.08, S.D. ¼ 1.31). Overall, responses averaged out regarding whether Versant speaking test reflected their real speaking ability (Mean ¼ 3.15, S.D. ¼ 1.31), whether the test was fair as a tool measuring speaking ability (Mean ¼ 3.21, S.D. ¼ 1.18), and whether they understood both directions (Mean ¼ 3.73, S.D. ¼ 1.22) and questions (Mean ¼ 3.78, S.D. ¼ 1.17). 4.3. Research Question 3. Which test was more practical? The practicality analysis took into account time and cost for four stages of testingddevelopment, administration, scoring, and reporting. These varied between the two tests as one was developed in-house whereas the other was developed by a publishing company. For PIE Speaking, the entire test process took approximately 60 h and involved about 10 people. It took 5½ h to create the tasks, and edit both the tasks and the audio files. One computer lab was used, that held a maximum of 22 students; by the time students came in and out, it took 3–4 h to administer the test to 150–200 students. Three people assisted with administering and proctoring the test to the students. Scoring included finding benchmarks (2–3 h per task involving six people), training raters (1 h involving twelve people), and rating the speech samples (2 h for 5 pairs of raters). In all, it took approximately 35 h to score. Scores then had to be entered into an Excel spreadsheet for each student, by each rater, per task; this took about 1.5 h per task or 3 h. It then took about 30 min to tabulate total scores. The PIE Speaking test cost about $1200 estimating a rate of pay for staff at $20 per h (Note that all costs estimated are in USD). On the other hand, the EAP program encountered no development costs for Versant English. Staff was needed to download the Pearson software into the lab and onto each computer (2 h for one person). Staff time was needed for initial training
Table 4 Descriptive statistics for PIE total speaking scores by level. Level
n
Minimum
Maximum
M
SD
Std. error
95% CI
1 2 3 4 5 Total
25 30 28 24 17 124
.00 .00 3.76 7.52 11.28 .00
11.28 20.68 22.56 22.56 26.32 26.32
3.46 6.20 14.03 17.93 17.68 11.26
4.59 5.32 5.64 3.88 3.89 7.57
.92 .97 1.07 .79 .94 .68
1.89–5.03 4.54–7.86 12.21–15.85 16.58–19.28 16.04–19.32 10.13–12.39
Note: PIE total speaking scores range from 0 to 30.
296
J. Jamieson et al. / Journal of English for Academic Purposes 12 (2013) 288–298
Table 5 Descriptive statistics for students’ versant total speaking scores by level. Level
n
Minimum
Maximum
M
SD
Std. error
95% CI
1 2 3 4 5 Total
25 30 28 24 17 124
20.00 20.00 23.00 30.00 35.00 20.00
46.00 48.00 56.00 57.00 51.00 57.00
30.40 33.57 38.07 40.50 44.29 36.76
5.55 6.66 7.84 6.24 4.25 7.83
1.11 1.22 1.48 1.27 1.03 .70
28.50–32.30 31.50–35.64 35.55–40.59 38.32–42.68 42.49–46.09 35.60–37.92
Note: Versant English total speaking scores range from 20 to 80.
(40 min for 3 people) and for administrationdassigning ID numbers to students and making a handout for each student with his or her name and number (2 h for 1 person), and leading students to the computer lab and staffing the computer lab in case of questions or trouble (1 h for 3 people). Because only one lab with 22 computers was used, students had to take the test in turns. Versant English could have been taken by students at home, but because we could not be sure the “real” student would take the test, this option was not used for security reasons. Once students completed the test, Pearson sent an Excel file with all ID numbers and scores (no personal information was saved by Pearson; all identification was done through the ID numbers that were sent). The matching of test ID’s to students’ names took about 1 h. If we again estimate staff cost at $20 per hour, the 10 h needed to administer Versant English cost about $200. For this study Pearson graciously gave us a reduced price per test of $6.11. However their university list price is $32.20 per test. Were we to test 150–200 students, the cost would range from $4830 to $6440. The total cost for both staff and the test itself would range from about $5000 to $6700. 5. Discussion Three research questions guided the overall considerations that resulted in the decision of whether to keep the in-house PIE Speaking test or to replace it with the commercial test, Versant English. Each of them will be discussed in turn. First, which test’s tasks better reflected the speaking objectives of the curriculum? Both tests reflected the linguistic subskills of pronunciation, stress and intonation, grammar, and vocabulary, as well as the rhetorical skill of summarizing information. However, PIE Speaking covered more objectives in the academic contextdformulating and expressing opinions and evaluating content. This conclusion is not meant to criticize Versant English; indeed, in online and print materials the publishers make it quite clear that its purpose is everyday English (Pearson, 2011). Our desire to match PIE Speaking to the EAP’s curriculum may lead some to criticize the perception of a fully-fixed or predetermined curriculum that does not reflect a more fluid and interactional view of language learning. We acknowledge that lining up students according to their exact abilities and expecting them all to progress in like manner does not match the realities of teaching and learning in an EAP program. Still, students do have different levels of ability, and a curriculum, while not fully-fixed, can certainly outline a range of objectives according to progressively increasing levels. The construct definition used to design the PIE Speaking test is naturally reflective of the EAP program’s definition of language ability. This definition of language ability informs the curriculum, and thus, while the placement test is not directly tied to the curriculum there is an interaction between the two. The PIE Speaking test cannot comprehensively assess the EAP program’s curriculum even if it were fixed. However, by identifying common elements and tasks from the placement test specifications and the curriculum’s course objectives, the test scores can result in more useful and accurate information. For the students who are continuing in the EAP program, the prochievement placement test that is designed to allow for dual interpretations of student ability in reference to the curriculum should lead to more accurate placement decisions within the program. We would be remiss if we did not alert the reader to John Clark’s caution about dual purpose tests. He wrote that “It is necessary to keep in mind that the results of a test that is deliberately restricted to particular.domains .cannot legitimately be extrapolated beyond the bounds of the subject matter actually covered in the particular course and its associated test” (1989, p. 214, pp. 206–215). He was writing about FSI/ILR types of speaking tests, which is different in type and orientation to underlying ability than what we have been discussing. Still, it serves as a caution to realize that by combining proficiency and achievement tests, and norm-referenced and criterion-referenced interpretations, we are perhaps limiting the strengths of each. Second, we asked which test’s scores better distinguished among students at different proficiency levels? Overall, Versant English was somewhat better than PIE Speaking in this regard. In terms of the statistical distributions, both tests spread students out, with Versant English demonstrating a better spread of students at the lower end of the scale and a better interval-level scale. The distribution of Versant English in this study was positively skewed, but it must be remembered that it was designed with native speaker performance tied to the high end of the scale, whereas PIE Speaking was designed with advanced non-native speakers of English at the high end of the scale. Versant English had higher reliability in terms of its internal consistency, indicating that its single score could be used more confidently that PIE Speaking. Neither test clearly separated students at the five levels in the EAP program. PIE Speaking did a better job at distinguishing students at the mid levels in the curriculum, whereas Versant English distinguished students better at the high levels. The moderate correlation between the two tests may be explained by the different tasks and contexts that were targeteddPIE Speaking targeted integrated and extended speaking tasks an academic, university context whereas Versant English targeted more linguistic,
J. Jamieson et al. / Journal of English for Academic Purposes 12 (2013) 288–298
297
discrete tasks in an everyday context. Such explanation finds support in the similar correlation between Versant English and TOEFL iBT speaking scores and the fact that the TOEFL iBT rubrics were used to score PIE Speaking. The results of questionnaire indicated that students generally accepted Versant English as an easy-to-follow test in terms of the writing of its directions and questions, and the design of its interface. However, most students found the test was very difficult for them, especially because of the input with fast sentences and difficult vocabulary. This is not surprising as the scoring scale is anchored by native speakers at its highest end. The fact that no similar questionnaire was given to students for PIE Speaking is a shortcoming of the study. From its score distribution, we might imagine that many students would also report that the test was relatively difficult for them. The third research question asked which test was more practical. Based on time, there is no doubt that the fully automated computerized test was more efficient. PIE Speaking took more staff time than Versant English in terms of design, administration, and scoring, but it was much less expensive. As Bachman and Palmer (2010, p. 263) wrote, “If the resources required for developing and using the assessment exceed the resources available, the assessment will be impractical.[It] will not be used at all, or it will not be sustainable unless resources can be managed more efficiently, or additional resources can be allocated.” At the present time, cost is more important than time. In the future, the idea of charging students a fee for the test could be explored. Based on the answers to the three research questions, at this time PIE Speaking will continue to be used at this EAP program. The two most important considerations were practicality and match with curricular objectives. Cost was an important criterion for us, but if price were not an issue, then time saved in staff hours and generating results could make another EAP program choose Versant English. The match with curricular objectives speaks to the validity of the test in terms of content relevance, or domain description; in this situation, PIE Speaking had a bit better match than Versant English, though this may not be the case in another EAP program. Another related criterion is the nature of the test tasks. PIE Speaking’s independent and integrated tasks were more communicatively focused than the linguistic focus of Versant English; one could argue that this made it more authentic for our university setting. Were cost not an issue, then time saved along with the statistical criteria of higher reliability, better separation of students with little speaking ability, and better separation of students at high levels in the EAP program would favor Versant English. Although these statistical criteria did not mark PIE Speaking as a psychometrically flawed test, this does not mean that PIE Speaking need not be improved. Work will soon be undertaken to move away from the TOEFL iBT scoring rubrics to bring the scoring bands more in line with the levels in the EAP’s curriculum and the expectations of university faculty, and hopefully improve discrimination at the higher end of the scale. An additional speaking task, a picture description task, will be added in an effort to better discriminate among students at the lower levels. Although interviews had been used at the EAP program several years ago, when enrollment grew it was found that they were impractical in terms of scheduling and administrative time, and that interviewers were inconsistent. Perhaps the time has come to reconsider interviews, particularly in light of content coverage of the conversation/group discussion speaking objective. Comparison of the statistical properties of these two tests, their match with our EAP program’s curriculum, and practical issues provide an example other EAP programs may use to evaluate placement tests. Considering our circumstances, we might have been better off by beginning with determining how much money we were willing to spend. Cost was the most important factor in our decision of keeping the PIE Speaking test. If we had known at the outset that we could not afford another test then perhaps such a comparison need not have been made. That said, having found positive answers to the questions we asked about the curriculum and the statistical distributions gave us confidence in our decision. Finding that PIE Speaking adequately measured university-level language ability, matched our curricular objectives, and corresponded with our EAP program’s view of language ability gave us more confidence in keeping it. If we had found that the commercial test, Versant, answered our curriculum questions much better than PIE Speaking, the decision might have been made to spend more money. Also, finding that PIE Speaking resulted in a spread of scores and that those scores had adequate reliability added to our confidence. Granted, Versant’s psychometric characteristics were somewhat better than PIE Speaking’s. If we had found severe problems with the statistical distribution of PIE Speaking, we might have been more inclined to pay the price of the commercial test, but that was not the case here. We imagine that many EAP programs in a variety contexts will face the decision of using commercial or in-house tests. The three criteria we examined provided a useful framework. Other programs may also find our criteria useful. We suggest that, depending on the context, the criteria be prioritized with stopping pointsdlike cut scores that have not been met. That way, all along the process, tests that do have necessary qualities will be dismissed from further consideration. Hopefully, this will lead to use of a test in which a program has confidence.
References American Educational Research Association, American Psychological Association, National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, D.C.: American Educational Research Association. Bachman, L. (1990). Fundamental considerations in language testing. Oxford, UK: Oxford University Press. Bachman, L. (2004). Statistical analyses for language assessment. Cambridge, UK: Cambridge University Press. Bachman, L., & Palmer, A. (2010). Language assessment in practice. Oxford, UK: Oxford University Press. Bernstein, J., & Cheng, J. (2007). Logic and validation of fully automatic spoken English test. In M. Holland, & F. P. Fisher (Eds.), The path of speech technologies in computer assisted language learning: From research toward practice (pp. 174–194). Florence, KY: Routledge.
298
J. Jamieson et al. / Journal of English for Academic Purposes 12 (2013) 288–298
Bernstein, J., DeJong, J., Pisoni, D., & Townsend, B. (2000). Two experiments on automatic scoring of spoken language proficiency. In P. Delcloque (Ed.), Proceedings of InSTIL (integrating speech technology in learning) (pp. 57–61). Dundee, Scotland: University of Abertay. Retrieved from Internet http:// www.pearsonpte.com/research/Documents/Bernstein_DeJong_Pisoni_and_Townshend_2000.pdf. Bernstein, J., Van Moere, A., & Cheng, J. (2010). Validating automated speaking tests. Language Testing, 27, 355–377. Berry, V. (2007). Personality differences and oral test performance. Frankfurt, Germany: Peter Lang. Brown, J. D. (1989). Improving ESL placement tests using two perspectives. TESOL Quarterly, 23, 65–83. Butler, F., Eignor, D., Jones, S., McNamara, T., & Suomi, B. (2000). TOEFL 2000 speaking framework: A working paper. In TOEFL Monograph Series, 18. Princeton, NJ: Educational Testing Service. Campbell, A., & Fiske, D. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105. Clark, J. (1989). Multipurpose language tests: Is a conceptual and operational synthesis possible? Georgetown University round table on languages and linguistics. Washington, D. C.: Georgetown University Press. Commission on English Language Program Accreditation. (2012). http://www.cea-accredit.org/directory.html. Enright, M., Bridgeman, B., Eignor, D., Lee, Y.-W., & Powers, D. (2008). Prototyping measures of listening, reading, speaking, and writing. In C. Chapelle, M. Enright, & J. Jamieson (Eds.), Building a validity argument for the test of English as a foreign language (pp. 145–186). New York, NY: Routledge. Eyckmans, J., Van de Velde, H., Van Hout, R., & Boers, F. (2007). Learners’ response behaviour in yes/no vocabulary tests. In H. Daller, J. Milton, & J. TreffersDaller (Eds.), Modeling and assessing vocabulary knowledge (pp. 59–76). Cambridge, UK: Cambridge University Press. Hatch, E., & Lazaraton, A. (1991). The research manual: Design and statistics for applied linguistics. New York, NY: Newbury House. Huff, K., Powers, D., Kantor, R., Mollaun, P., Nissan, S., & Schedl, M. (2008). Prototyping a new test. In C. Chapelle, M. Enright, & J. Jamieson (Eds.), Building a validity argument for the test of English as a foreign Language (pp. 187–225). New York, NY: Routledge. Iowa Technical Adequacy Project. (2003). Procedures for estimating internal consistency reliability. http://www.education.uiowa.edu/archives/itap/ information/pdf/Coef_Alpha_Reliability.pdf. Jamieson, J., Eignor, D., Grabe, W., & Kunnan, A. (2008). Frameworks for a new TOEFL. In C. Chapelle, M. Enright, & J. Jamieson (Eds.), Building a validity argument for the test of English as a foreign language (pp. 55–95). New York, NY: Routledge. Lazaraton, A. (1996). Interlocutor support in oral proficiency interviews: the case of CASE. Language Testing, 13, 151–172. Lehman, D. (1988). An alternative procedure for assessing convergent and discriminant validity. Applied Psychological Measurement, 12, 411–423. Luoma, S. (2004). Assessing speaking. New York, NY: Cambridge University Press. Malabonga, V., & Kenyon, D..,M. (1999). Multimedia computer technology and performance-based language testing: a demonstration of the Computerized Oral Proficiency Instrument (COPI). In M. B. Olsen (Ed.), Computer mediated language assessment and evaluation in natural language processing: Proceedings of a symposium sponsored by the association for computational linguistics and international association of language learning technology (pp. 16– 23). New Brunswick, NJ: Association for Computational Linguistics. Miller, M. D., Linn, R., & Gronlund, N. (2009). Measurement and evaluation in teaching (10th ed.). Upper Saddle River, NJ: Merrill, Prentice Hall. O’Sullivan, B. (2002). Learner acquaintanceship and oral proficiency test pair-task performance. Language Testing, 19(3), 277–295. Pae, H. (2012). Research note: Construct validity of the Pearson Test of English academic: A multitrait multimethod approach. http://www.pearsonpte.com/ research/Documents/ResearchNoteConstructvalidityfinal2012-10-02gj.pdf. Pearson. (2011). Versant English test: Test description and validation summary. Palo Alto, CA: Pearson Knowledge Technologies. Retrieved from http://www. versanttest.com/technology/VersantEnglishTestValidation.pdf. Qian, D. (2009). Comparing direct and semi-direct modes for speaking assessment: affective effects on test takers. Language Assessment Quarterly, 6(2), 113–125. Saville, N., & van Avermaet, P. (2005). Language testing for migration and citizenship: contexts and issues. In L. Taylor, & C. J. Weir (Eds.), Studies in language testing: Multilingualism and assessment (pp. 265–275). Cambridge, UK: Cambridge University Press. Stoynoff, S., & Chapelle, C. (2005). ESOL tests and testing. Alexandria, VA: TESOL. Townsend, B., Bernstein, J., Todic, O., & Warren, E. (1998). Estimation of spoken language proficiency. In ETRW on speech technology in language learning. Retrieved from the web http://www.isca-speech.org/archive_open/archive_papers/still98/stl8_179.pdf. Van Moere, A. (2010). Automated spoken language testing: test construction and scoring model development. In L. Araújo (Ed.), Computer-based assessment of foreign language speaking skills (pp. 84–99). Luxembourg: Publications Office of the European Union. Wang, L., Eignor, D., & Enright, M. (2008). A final analysis. In C. Chapelle, M. Enright, & J. Jamieson (Eds.), Building a validity argument for the test of English as a foreign language (pp. 259–318). New York, NY: Routledge. Young, R., & Milanovic, M. (1992). Discourse variation in oral proficiency interviews. Studies in Second Language Acquisition, 14, 403–424. Zechner, K., Higgins, D., Xi, X., & Williamson, D. (2009). Automatic scoring of non-native spontaneous speech in tests of spoken English. Speech Communication, 51, 883–895. Joan Jamieson is a Professor of English/Applied Linguistics at Northern Arizona University, USA. She teaches classes in second language testing, computerassisted language learning, and research methods in the MA-TESL program and the PhD program in Applied Linguistics. Her publications in these areas include articles, books, computer software, technical reports, tests, and textbooks. Linxiao Wang is currently a Ph.D. student in Applied Linguistics at Northern Arizona University, USA. She also teaches and works for the assessment team in the Program of Intensive English at NAU. Her research interests include second language assessment (especially listening and speaking) and second language acquisition. Jacqueline Church received an MA-TESL degree from Northern Arizona University, USA. She was assessment coordinator at the Program in Intensive English at NAU, and is currently a test developer at Second Language Testing, Inc.