System, Vol. 20, No. 3, pp. 365-372, Printed in Great Britain
034625
1992
THE ACTFL ORAL PROFICIENCY INTERVIEW: EVIDENCE
IX/92 $5.00 + 0.00 Pergamon Press Ltd
VALIDITY
GRANT HENNING Educational
Testing Service, Princeton, NJ, USA
A number of researchers have raised concerns about the validity of the ACTFL Guidelines in general and the Oral Proficiency Interview in particular. This study reports results of a variety of validity analyses involving the ACTFL Oral Proficiency Interview as it was administered to 59 learners of English and 60 learners of French. Scalability of the level descriptors was investigated using the one parameter Rasch model. Based on estimates of internal consistency, inter-rater reliability, concurrence between ratings from ACTFL-trained raters and from naive native speakers, and scalability of levels, it is concluded that the Guidelines can be useful as an assessment tool and offer advantages which warrant serious consideration in the development of language testing procedures.
INTRODUCTION The Oral Proficiency Interview (OPI) of the American Council on the Teaching of Foreign Languages (ACTFL) and the accompanying ACTFL Guidelines have been with us in one form or another since 1982, when the provisional guidelines were first published (ACTFL, 1982). The current revised form of the Guidelines was developed with support from the United States Department of Education and first appeared in 1986 [ACTFL (1986): see the Appendix]. The ACTFL Guidelines have subsequently gained popularity as a basis for language proficiency assessment in many academic institutions where foreign languages are taught. Although this particular method of assessment had its roots in the earlier OPI of the Foreign Service Institute (FSI) and the Interagency Language Roundtable (ILR) (Jones, 1975; Lowe and Stansfield, 1988) the current ACTFL version has been adapted so that the behavioral level descriptors are more applicable to the introductory levels of language speaking ability that might be encountered in secondary and tertiary academic settings. The ACTFL Guidelines are an attempt to respond in part to a need for a practical method of foreign language assessment that is more precise than the traditional use of a number of academic terms of foreign language study successfully completed (Clifford, 1980) and that in some sense offers a common metric that may potentially be generalizable across languages (Educational Testing Service, 1981). Further, the Guidelines emphasize authentic language use in communicative contexts. The Guidelines have also been expanded to provide proficiency level descriptors in the skills of listening, reading and writing, and to include less generic descriptors in a growing variety of languages. 365
366
GRANTHENNING
Predictably, the ACTFL proficiency movement in general and the Guidelines in particular have not been without critics. Bachman and Savignon (1986) have argued the need for empirical investigation of the reliability and validity of the ACTFL ratings, for investigation of the generalizability of the scale across languages, for control and study of method effects such as situational testing context and test instructions, and for more precise definitions of proficiency. Lantolf and Frawley (1985) have criticized the curricular relevance of the Guidelines in terms of unrealistic expectations of achievement, failure to control for proficiency variation due to topical interest, use of normative reference to average linguistic performance of an idealized native speaker, scalability of ACTFL proficiency levels, and lack of sufficient basis on empirical data. Douglas (1988) has criticized the Guidelines for insufficient specification of cultural content at respective levels, for inadequate relation to what is known of the stages of language acquisition, and for insufficient discussion of the purposes for language use. Some of these criticisms are beginning to receive research attention. Dandonoli and Henning (1990) investigated the utility of the Guidelines in terms of the construct validity of language tests developed according to the Guidelines in the four major language skills across the two languages of English (N = 59) and French (N = 60). They reported results indicating high reliability estimates (0.85496 internal consistency; 0.93-0.98 inter-rater) for unrevised tests of speaking, listening, writing and reading ability developed according to the Guidelines in both languages. They also reported meaningful and significant convergent validity coefficients for all skills in both languages, and high levels of discriminant validity for all skills except listening which showed disappointing levels of discriminant validity in both languages (Campbell and Fiske, 1959). Because that study examined the validity of tests developed according to the Guidelines and, by inference, the validity of the Guidelines as a guide to test development, it was thought important that the experimental tests not be given the advantage of a pilot analysis and revision cycle that would have unduly enhanced their reliability and validity. Thus, the results of that study were intentionally conservative. The results partially addressed those criticisms related to the need for empirical evidence of reliability and validity, for investigation of the generalizability of procedures across languages, for analysis of the scalability of ACTFL proficiency levels, and for the need to control and examine test method effects. Additional research funded by the U.S. Department of Education is currently investigating the generalizability of these findings across the languages of Arabic, Japanese, Mandarin Chinese and Russian, and should have publishable results within the next 2 years. Of interest in the present article is evidence for the validity of the ACTFL Oral Proficiency Interview for some educational assessment applications. Therefore, this article will summarize and expand on earlier results with regard to the ACTFL OPI only. METHOD Subjects Subjects were 59 ESL students drawn primarily from Brandeis University (Boston) and 60 French language students at Northwestern University (Chicago). Subjects were given a $50.00 honorarium for their full-day participation in the study. Testers and raters ACTFL-certified oral proficiency
testers were employed
both to conduct and to rate the oral
THE ACTFL ORAL PROFICIENCY
367
INTERVIEW
proficiency interviews. Four certified testers rated the ESL tapes and nine certified testers rated the French tapes. In addition, eight untrained native speakers were paid to listen to selected interviews taped in their native languages and to rank-order them with regard to comparative speaking proficiency of the examinees. Four of these naive raters were native speakers of English and four were native speakers of French. Procedures and instrumentation For each speaking sample, a minimum of two independent judgments were required. In cases where certified testers disagreed or; assignment of ACTFL level designation to speaking performance, a third certified tester was enlisted to rate the performance. The final judgment of speaking level was determined as that level for which two of the three judges agreed. In practice, this adjudication procedure was used only to provide a total classification of each taped performance for use with the naive raters. Otherwise, all of the reliability and validity estimation was done using unadjudicated expert ratings in order to provide the most conservative possible results. Dandonoli and Henning (1990) considered the convergent and discriminant validity of the OPI by comparison with other language skills or traits when assessed using different methods. For the purpose of the study, independent raters were employed as representative of different methods or perspectives on assessment. Speaking results were analyzed both with regard to the Campbell and Fiske (1959) multitrait-multimethod correlational procedure, and by using Rasch model scalar analysis (Wright and Masters, 1982). For data coding and analysis purposes, numerical equivalents were assigned for ACTFL level assignments according to the Lange and Lowe (1987) index as follows: novice-low = 0.1, novice-mid = 0.3, novice-high = 0.8, intermediate-low = 1.1, intermediate-mid = 1.3, intermediate-high = 1.8, advanced = 2.3, advanced-high = 2.8, superior = 3.3.
AGGREGATED
OPI RESULTS
Descriptive results are reported for both language groups ih Table 1 (note that the average over raters A and B is lower than the mean of either rater as some students were not rated by both raters).
Table 1. Means, standard deviations, internal consistency (alpha) and inter-rater (Y) reliability estimates for English and French versions of the ACTFL Oral Proficiency Interview (N = 59 ESL, N = 60 FFL) English Rater
Mean
SD
Rater A Rater B Rater (A + B)/2
1.646 1.649 1.615
0.713 0.733 0.728
French Alpha/r
Mean
SD
Alpha/r
0.85/0.98
1.733 1.715 1.703
0.915 0.895 0.678
0.89/0.97
Note that the unadjudicated results in Table 1 suggest that the students of French as a foreign language on average were at a slightly higher proficiency level than were the students of
368
GRANT-HENNING
English as a second language (mean = 1.703 versus 1.615 = intermediate-high). Note also that the reliability estimates based on the unadjudicated ratings were acceptably high. Table 2 reports the mean Rasch model scalar analysis calibrations French at all observed proficiency levels.
for students of English and
Table 2. Numbers of ratings per level, mean Rasch model rating scale calibrations and standard error estimates for ACTFL OPI ratings of students of English and French English Level Novice-low Novice-mid Novice-high Intermediate-low Intermediate-mid Intermediate-high Advanced Advanced-plus Superior
N
; 2 24 21 29 14 8 I
Mean
-
French SE
3lfo OYl
- 1.2s -2.16 -0.70 ~ 0.05 1.36 2.32 3.97
0.4 1 0.35 0.16 0.13 0.18 0.29 0.53
N 0
12 12 13 22 18 17 13 13
-
Mean
SE
na
na
3% 1.95 1.24 0.08 0.69 1.97 3.86
na 0.36 0.23 0.19 0.17 0.20 0.25 0.47
Note from Table 2 that there were more French learners rated at the advanced-plus and the superior levels than there were English learners rated at those levels. This may account for the slight mean proficiency advantage for French learners noted in Table 1. Note from the mean calibration columns that, in general, there was regular progression in scaled proficiency from the lower to higher levels. The one exception to this observation came at the novice-high level for English learners, where the unexpectedly high mean calibration (- 1.25) was probably attributable in part to the fact that there were only two ratings assigned at that level. Also note that the standard errors associated with the calibrations at each level were uniformly smaller than the corresponding interval between levels. This last observation is an important feature of any rating scale where true measurement is said to occur. Of further interest in the scalar analysis is the observation that there were no persons classified as misfitting the model. The estimated misfit t value did not exceed 2.00 for any person in either language group. This is further evidence that the implementation of the ACTFL OPI Guidelines considered here represented a practicable assessment procedure. There was also no observed misfit of defined proficiency levels in English or French, apart from the English novice-high level for which it was observed earlier that there were only two ratings assigned at that level so that the misfit could be attributed to sampling limitations. It is, however, possible that the novice-high level descriptors for English were inadequately articulated in the Guidelines so that too few persons could be found to match those descriptors. ACTFL-certified examiners associated with this project have reported that this is not usually the case, and that usually novice-high students appear in expected proportions. A final result of interest concerns the outcomes of the ratings by naive native speakers. Recall that four untrained native speakers of English and four untrained native speakers of French were requested each to rank-order on proficiency the students from a set of taped interviews.
THE ACTFL
ORAL
PROFICIENCY
369
INTERVIEW
The taped interviews had previously been reliably rated by two ACTFL-certified language testers as being at particular ACTFL proficiency levels. Procedurally, one tape at each ACTFL level (excluding novice-low) was chosen at random to form a set of eight tapes. Two such sets were gathered in each language. Each set was rank-ordered independently by two naive native speakers of the target language on the tapes. Spearman’s rho was calculated as an index of agreement between the ACTFL-certified raters and the naive native speakers. Table 3 reports the results of this analysis.
Table 3. Spearman rank correlation means, ranges and standard deviations as indices of agreement between native speakers and certified ACTFL raters of ACTFL OPI performance in French and English (N = 16 raters on 32 tapes) Language English French
Mean correlation 0.934 0.927
Range 0.904-l 0,857-l
.OOo .OOO
SD 0.045 0.058
Note that the high observed correspondence between proficiency judgment of ACTFL-certified language raters and naive native speakers provides further criterion-related validity evidence supporting the utility of the ACTFL OPI. Again, these results are conservative in the sense that none of the ratings or rankings were adjudicated prior to analysis. All correlations were meaningful and significant (p < 0.05).
DISCUSSION
AND CONCLUSION
One common method used for the assessment of speaking proficiency in a foreign language is the oral proficiency interview. The American Council on the Teaching of Foreign Languages has developed and applied a particular oral proficiency interview with accompanying guideline level descriptors. The Guidelines have encountered some legitimate criticisms regarding the absence of supporting evidence of reliability, validity, generalizability, scalability, meaningfulness and utility of the ratings they generate. The present study has provided some limited research evidence related to the reliability, construct and criterion-related validity, scalability and generalizability of ratings obtained according to the ACTFL Oral Proficiency Interview. While the Guidelines no doubt contain abiding imperfections that should be investigated in future research efforts, none of the evidence gathered in this study has managed to negate their utility as one kind of assessment tool. This is not to maintain that they can properly be applied for all language assessment purposes, nor that tests developed according to the Guidelines will necessarily be reliable and valid in every instance. However, some of the presumed advantages of this mode of assessment for some assessment purposes certainly warrant serious consideration.
370
GRANT HENNING
REFERENCES AMERICAN
COUNCIL ON THE TEACHING Hastings-on-Hudson, NY: Author.
OF FOREIGN LANGUAGES
(1982) ACTFL Provisional
Proficiency
AMERICAN COUNCIL ON THE TEACHING Hastings-on-Hudson, NY: Author.
OF FOREIGN LANGUAGES
(1986) ACTFL Proficiency
Guidelines.
Guidelines.
BACHMAN, L. F. and SAVIGNON, S. J. (1986) The evaluation of communicative the ACTFL oral interview. The Modern Language Journal 70(4), 380-390. CAMPBELL, D. T. and FISKE, D. W. (1959) Convergent matrix. Psychological Bullerin 56,81-105. CLIFFORD,
and discriminant
language proficiency:
validation
a critique of
by the multitrait-multimethod
R. (1980) Curricular and comprehensive program evaluation. In Proceedings of the National Prioriries. Boston: American Council on the Teaching of Foreign Languages.
Conference
on Professional
DANDONOLI, P. and HENNING, G. (1990) An investigation of the construct guidelines and oral interview procedure. Foreign Language Annals 23( 1). 1l-22. DOUGLAS,
D. (1988) Testing listening comprehension Acquisition 10(Z), 245-260.
validity
of the ACTFL proficiency
in the context of the ACTFL proficiency
guidelines.
Studies in
Second Language
EDUCATIONAL TESTING SERVICE (1981) A Common Mefric fbr Department of Education Grant No. G008001739). Princeton, NJ: Author. JONES,
Language
R. L. (1975) Testing language proficiency in the U.S. government. Proficiency. Arlington, VA: Center for Applied Linguistics.
Proficiency
(Final
Report
In Jones, R. L. and Spolsky.
for
B. (eds).
Tesring Language
LANGE, D. L. and LOWE, P. L., Jr. (1987) Grading reading passages according to the ACTFL/ETS/ILR reading proficiency standard: can it be learned? Selected papers from the 1986 language testing research colloquium, pp. I1 1-127. Monterey, CA: Defense Language Institute. LANTOLF, J. P. and FRAWLEY, W. (1985)Oral proficiency testing: a criticai analysis. The Modern iu;iguage Journal
69(4), 337-345.
LOWE, P., Jr. and STANSFIELD, C. W. (1988) Englewood Cliffs, NJ: Prentice Hall Regents. WRIGHT,
Second
Language
Proficiency
Assessmenr:
B. D. and MASTERS, G. N. (1982) Raring Scale AnaLysrs: Rasch Measuremenr.
APPENDIX.
ACTFL PROFICIENCY
GUIDELINES: SPEAKING
Current
Issues.
Chicago: MESA Press.
GENERIC DESCRIPTIONS-
Novice
The novice level is characterized
by the ability to communicate
minimally
with learned
material.
Novice-low:
Oral production consists of isolated words and perhaps Essentially no functional communicative ability.
a few high-frequency
phrases.
Novice-mid:
Oral production continues to consist of isolated words and learned phrases within very predictable areas of need, although quality is increased. Vocabulary is sufficient only for handling simple, elementary needs and expressing basic courtesies. Utterances rarely consist of more than two or three words and show frequent long pauses and repetition of interlocutor’s words. Speaker may have some difficulty producing even the simplest utterances. Some novice-mid speakers will be understood only with great difficulty.
Novice-high:
Able to satisfy partially the requirements of basic communicative exchanges 6y relying heavily on learned utterances but occasionally expanding these through simple recombinations of their elements. Can ask questions or make statements involving learned material. Shows signs of spontaneity although this falls short of real autonomy of expression. Speech continues to consist of learned utterances rather than of personalized, situationally adapted ones. Vocabulary centers on areas such as basic objects, places and most common kinship terms. Pronunciation may still be strongly influenced by first language. Errors are frequent and, in spite of repetition, some novice-high speakers will have difficulty being understood even by sympathetic interlocutors.
THE ACTFL ORAL PROFICIENCY
INTERVIEW
371
intermediate
The intermediate level is characterized by the speaker’s ability to: create with the language by combining and recombining learned elements, though primarily in a reactive mode; initiate, minimally sustain, and close in a simple way basic communicative tasks; and ask and answer questions. Intermediate-low:
Able to handle successfully a limited number of interactive, task-oriented and social situations. Can ask and answer questions, initiate and respond to simple statements and maintain face-to-face conversation, although in a highly restricted manner and with much linguistic inaccuracy. Within these limitations, can perform such tasks as introducing self, ordering a meal, asking directions and making purchases. Vocabulary is adequate to express only the most elementary needs. Strong interference from native language may occur. Misunderstandings frequently arise, but with repetition, the intermediate-low speaker can generally be understood by sympathetic interlocutors.
Intermediate-mid:
Able to handle successfully a variety of uncomplicated, basic and communicative tasks and social situations. Can talk simply about self and family members. Can ask and answer questions and participate in simple conversations on topics beyond the most immediate needs; e.g. personal history and leisure time activities. Utterance length increases slightly, but speech may continue to be characterized by frequent long pauses, since the smooth incorporation of even basic conversational strategies is often hindered as the speaker struggles to create appropriate language forms. Pronunciation may continue to be strongly influenced by first language and fluency may still be strained. Although misunderstandings still arise, the intermediate-mid speaker can generally be understood by sympathetic interlocutors.
Intermediate-high:
Able to handle successfully most uncomplicated communicative tasks and social situations. Can initiate, sustain and close a general conversation with a number of strategies appropriate to a range of circumstances and topics, but errors are evident. Limited vocabulary still necessitates hesitation and may bring about slightly unexpected circumlocution. There is emerging evidence of connected discourse, particularly for simple narration and description. The intermediate-high speaker can generally be understood even by interlocutors not accustomed to dealing with speakers at this level, but repetition may still be required.
Adwnced
The advanced level is characterized by the speaker’s ability to: converse in a clearly participatory fashion; initiate, sustain and bring to closure a wide variety of communicative tasks, including those that require an increased ability to convey meaning with diverse language strategies due to a complication or an unforeseen turn of events; satisfy the requirements of school and work situations; and narrate and describe with paragraph-length connected discourse. Advanced:
Able to satisfy the requirements of everyday situations and routine school and work requirements. Can handle with confidence but not with facility complicated tasks and social situations, such as elaborating, complaining and apologizing. Can narrate and describe with some details, linking sentences together smoothly. Can communicate facts and talk casually about topics of current public and personal interest, using general vocabulary. Shortcomings can often be smoothed over by communicative strategies, such as pause fillers, stalling devices and different rates of speech. Circumlocution which arises from vocabulary or syntactic limitations very often is quite successful, though some groping for words may still be evident. The advanced-level speaker can be understood without difficulty by native interlocutors.
Advanced-plus:
Able to satisfy the requirements of a broad variety of everyday, school and work situatiorts. Can discuss concrete topics relating to particular interests and special fields of competence. There is emerging evidence of ability to support opinions, explain in detail, and hypothesize. The advanced-plus speaker often shows a well developed ability to compensate for an imperfect grasp of some forms with confident use of communicative strategies, such as paraphrasing and circumlocution. Differentiated vocabulary and intonation are effectively used to communicate fine shades of meaning. The advanced-plus speaker often shows remarkable fluency and ease of speech but under the demand of superior-level complex tasks, language may break down or prove inadequate.
GRANT
HENNlNG
Superior The superior level is characterized by the speaker’s ability to: participate effectively in most formal and informal conversations on practical, social, professional and abstract topics; and support opinions and hypothesize using nativelike discourse strategies. Superior:
Able to speak the language with sufficient accuracy to participate effectively in most formal and informal conversations on practical, social, professional and abstract topics. Can discuss special fields of competence and interest with ease. Can support opinions and hypothesize, but may not be able to tailor language to audience or discuss in depth lightly abstract or unfamiliar topics. Usually the superior level speaker is only partially familiar with regional or other dialectical variants. The superior level speaker commands a wide variety of interactive strategies and shows good awareness of discourse strategies. The latter involves the ability to distinguish main ideas from supporting information through syntactic, lexical Sporadic errors may occur, and suprasegmental features (pitch, stress, intonation). particularly in low-frequency structures and some complex high-frequency structures more common to formal writing, but no patterns of error are evident. Errors do not disturb the native speaker or interfere with communication.