Using subject specialists to validate an ESP rating scale: The case of the International Civil Aviation Organization (ICAO) rating scale

Using subject specialists to validate an ESP rating scale: The case of the International Civil Aviation Organization (ICAO) rating scale

English for Specific Purposes 33 (2014) 77–86 Contents lists available at ScienceDirect English for Specific Purposes journal homepage: www.elsevier.c...

357KB Sizes 0 Downloads 49 Views

English for Specific Purposes 33 (2014) 77–86

Contents lists available at ScienceDirect

English for Specific Purposes journal homepage: www.elsevier.com/locate/esp

Using subject specialists to validate an ESP rating scale: The case of the International Civil Aviation Organization (ICAO) rating scale Ute Knoch ⇑ Language Testing Research Centre, School of Languages and Linguistics, University of Melbourne, Parkville, 3010 Victoria, Australia

a r t i c l e

i n f o

Article history: Available online 13 October 2013 Keywords: Rating scale validation LSP assessment Speaking assessment Indigenous criteria Subject specialist ICAO language proficiency requirements

a b s t r a c t As part of the English-language proficiency requirements for pilots and air traffic controllers, the International Civil Aviation Organization (ICAO) published a rating scale designed to assess pilots’ and air traffic controllers’ aviation English proficiency. However, it is not clear how this scale was developed. As part of an attempt to address the need for validation, this paper presents a study involving focus group interviews with pilots. Ten pilots listened to performances of test takers taking a variety of aviation English tests. The pilots were asked to rate the acceptability of the pilot’s language for (a) communicating with other pilots and (b) radiotelephony communications with air traffic control. The focus groups had two aims: (1) to establish the ‘indigenous’ assessment criteria pilots use when assessing the language ability of peers and (2) to establish what level is sufficient as the operational level. The results showed that the pilots focused on some but not all of the criteria on the ICAO scale. Whilst listening to the performances, they also often focused on the speakers’ technical knowledge. The paper proposes a model of how industry professionals can be involved in the validation of an LSP rating scales. Ó 2013 Elsevier Ltd. All rights reserved.

1. Introduction 1.1. ICAO language proficiency requirements Having recognized the fact that inadequate English proficiency on the part of pilots or air traffic controllers has played a role in the chain of events leading to accident or incidents, the International Civil Aviation Organization (ICAO) decided to strengthen language proficiency requirements for radiotelephony communication provisions, and established a set of Language Proficiency Requirements (ICAO, 2004, 2010). That is, airline pilots and air traffic controllers who engage in international flight operation must be able to show that their level of English proficiency is at or above the operational level (Level 4) to practice their professions. In the scale published as part of the proficiency requirements, there are six proficiency levels from Pre-Elementary (Level 1) to Expert (Level 6) across six assessment criteria: Pronunciation, Structure, Vocabulary, Fluency, Comprehension, and Interactions. The initial deadline for the requirements was 5 March 2008. However, three years’ grace was given to those ICAO member states which were not prepared to abide by the testing requirements by the time of the initial deadline. Therefore, the requirements came into effect on 5 March 2011. Unfortunately, very little information is available on how the rating scale and the proficiency requirements were developed. According to the ICAO, the Proficiency Requirements in Common English (PRICE) Study Group, who developed the requirements, consisted of industrial and linguistic experts with a background in aviation. This study aims to establish what criteria pilots use when evaluating the speech of ⇑ Tel.: +61 3 83445206. E-mail address: [email protected] 0889-4906/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.esp.2013.08.002

78

U. Knoch / English for Specific Purposes 33 (2014) 77–86

their peers and to establish what passing standards in relation to the ICAO scale levels are applied by pilots. It therefore attempts to fill the gap for validation studies on the ICAO proficiency requirements. 1.2. Nature of radiotelephony communications It is important here to describe the nature of the language used by pilots and air traffic controllers during their radiotelephony communication. The language used can be categorized into two types: standard phraseology and plain language. Phraseology, which is used in the majority of cases of communication in this context, is a language consisting of a restricted repertoire. This means that the language needed is strictly controlled and standardized. A list of the basic principles of standard phraseology can be found in ICAO (2001). A further feature of standard phraseology is that it is a simplified language that emphasizes certain features of language, for example whether an instruction or advice is negative or positive. Standard phraseology aims for clarity and avoidance of ambiguity of meaning and pronunciation. Examples of specific instances of phraseology can be found in ICAO (2001) and in Kim (2012). Standard phraseology is part of the regular training of pilots and air traffic controllers as it needs considerable practice; however, it has been shown that it is not always used when required (see e.g. Howard, 2008). Plain language, on the other hand, is used in contexts in which phraseology does not suffice. When using plain language, pilots and air traffic controllers are required to simplify their language as much as possible and avoid using ambiguous language (see e.g. Kim, 2012). 1.3. Rating scales in LSP assessment Most LSP assessment systems make use of a rating scale which raters use to judge spoken or written performances. Such scales are generally used because they are a representation of the test construct. To best represent the test construct at hand, rating scales should be grounded in a theory that describes the type of language used in the target language use domain (McNamara, 2002; Turner, 2000). However, rating scales are often not developed in a way that accounts for such a theory. In fact, often very little information is available on how rating scales were developed (e.g. Brindley, 1998; McNamara, 1996; Turner, 2000; Upshur & Turner, 1995). In the context of LSP testing, Douglas (2001) writes that the content of the target language use (TLU) domain that serves as the basis for the content of the test tasks is usually fairly well understood; however, the way assessment criteria and rating scales should be developed is not through an analysis of the TLU situation, but rather through an understanding of what it means to know and use a language in the specific context (Jacoby & McNamara, 1999). That is, rather than focusing on the specific tasks used for assessment, knowledge of what makes a successful performance in the TLU context is important. Douglas therefore argues that in the development of LSP assessment criteria, theoretically based approaches should be supplemented by taking into account the criteria that experienced professionals in the relevant field employ when evaluating communicative language use. Jacoby (1998) first coined the term ‘indigenous assessment criteria’ to refer to such criteria used by subject specialists when assessing communication in their respective professional fields. Such criteria can vary widely from being linguistic in focus to commenting on professional competence and even commenting on a professional’s appearance (see e.g. Douglas & Myers, 2000). Because of this, some authors have cautioned of potential problems of transferring or superimposing these highly context-specific indigenous criteria back onto the criteria used in language assessments (see e.g. Jacoby & McNamara, 1999) and argued that their fit to the language test needs to be critically evaluated (see e.g. Douglas, 2000). Overall, the literature on the use of indigenous criteria in LSP is promising but more work needs to be done to understand how such criteria can be incorporated into rating scales based on linguistic criteria. Several studies on LSP assessment have made use of subject specialists for rating scale validation (Douglas & Myers, 2000; Elder, 1993; Elder et al., 2012; Jacoby, 1998; Jacoby & McNamara, 1999). Among the aims of these studies was to elicit the indigenous criteria of professionals in the field and then either feed these back into the assessment cycle or compare them to already existing criteria. Even though most LSP scales use linguistic criteria (Douglas, 2000), using subject specialists’ judgements of language performance adds to the validity of the resulting assessment criteria as they will more closely reflect norms expected in the workplace. Lumley (1998) argues that it is common practice to involve industry specialists in the test design phase but that it is less common to use this group of stakeholders in the formation of rating criteria and standard-setting. Lumley (1998), for example, compared the ratings of ESL professionals and healthcare professionals on the Occupational English Test (OET) speaking section. He found that there were similarities in the ratings. Banerjee and Taylor (2005) conducted a similar study using the IELTS test and also found a general acceptable level of agreement. Most previous research was done in the domain of testing the English proficiency of health professionals (Douglas & Myers, 2000; Elder et al., 2012; Ryan, 2007). A recent validation study of the criteria used in the OET speaking sub-test (Elder et al., 2012), for example, shows that the educators in the health professions of medicine, nursing and physiotherapy hardly mentioned language skills when evaluating the performance of trainee–patient interactions, and what they commented on is not reflected in the current OET assessment criteria (the health professionals commented for example on how well the test taker had managed the interaction). Few studies have been conducted using aviation professionals as informants despite the high stakes of the ICAO language proficiency requirements. Also, very few studies have employed industry professionals as informants for post hoc validation of both the linguistic criteria of an LSP rating scale and a validation of the cut-scores or passing criteria. The aim of this study, therefore, is to explore the utility of using pilot informants in the context of post hoc validation of an aviation-related LSP rating scale.

U. Knoch / English for Specific Purposes 33 (2014) 77–86

79

2. Methodology To gather the necessary data for such post hoc validation of the scale, experienced pilots were invited to take part in focus group interviews with the aim of eliciting information on two aspects: (a) what criteria pilots use when evaluating the language competency of their colleagues and (b) what level of English language competence was deemed sufficient to work in an operational environment. The specific research questions are as follows: 1. What criteria do pilots use when evaluating the effectiveness of speech samples of other pilots (especially those around the Level 3/4 cut-off)? 2. Do language-trained raters and pilots agree on the appropriate proficiency level for operational flying? During the focus group interviews, pilots were played extracts of recordings from three types of aviation English tests: (a) a semi-direct speaking test,1 (b) a structured interview, and (c) an oral proficiency interview (OPI). Each pilot was asked to listen to eight speech samples and decide whether they felt the English proficiency of each candidate was sufficient to work in an operational environment. 2.1. Instruments 2.1.1. Speech samples As mentioned above, three types of speech samples were selected for use in this study. The semi-direct speaking tests were sourced from an aviation test used in New Zealand and which has been accredited by the New Zealand Civil Aviation Authority. The structured interviews were sourced from an aviation test used in Australia and accredited by the Civil Aviation Safety Authority of Australia (CASA). These tests were used because they were available to the researcher (convenience sampling) and because they represented commonly used task types in aviation English tests. The semi-direct speaking test included a range of task types, all designed so that they could only be completed by pilots who have at least completed a private pilot’s license. Task types included making announcements to passengers and communicating with air traffic control. The direct interview included a range of questions, all designed for practicing pilots. For example, pilots were asked to describe and evaluate certain routine and emergency situations. Both the semi-direct speaking test and the structured interview were designed to elicit aviation language, therefore making them specific-purpose tests. The OPI (from the ICAO rated speech sample CD2) and the structured interviews differed in that the OPI was completely unscripted, while the structured interviews were mainly scripted with some room for follow-up questions by the interviewer. As the OPI was taken from the ICAO speech samples CD, no further information about the test specifications (e.g. the typical sections of the interview and whether any of the questions were used in all interviews) was available. The speech samples were selected to represent a range of proficiency levels and L1 backgrounds. All had been rated previously by two trained aviation English raters. Three different test types were selected to represent the types of speaking performances commonly elicited in aviation English tests. However, how much these tests actually tap into the rather broad and varied domain of aviation English is not entirely clear. For the purpose of the speech samples, extracts that the researcher judged as tapping as closely as possible into the TLU domain were selected. For example, items that involved making an announcement to air traffic control were selected over more general items (e.g. where the pilot had to describe their favourite phase of flight). A summary of the eight speech samples can be found in Table 1. As can be seen from Table 1, samples chosen for eliciting data were originally rated 3 or 4 because these levels were deemed to be the most interesting for the purpose of the focus group interviews. The larger sample (of 270 performances) from which these speech samples were purposefully sampled included a wider range of performances but these were not included for the purpose of this study. For example, we did not expect a performance rated at Level 6 to elicit a lot of discussion. Performances at the very low spectrum (rated e.g. at Level 2) often did not include a lot of rateable language. Two performances at Level 5 were included as some involved in aviation testing have argued that the operational level should be set at Level 5. The speech samples had previously been rated by two aviation English raters who were specifically trained to evaluate aviation English tests. Only speech samples on which the two original raters had agreed were selected for the purpose of this study. Two speech samples from the ‘ICAO rated samples CD’ were also included (one was an OPI and one a semidirect speaking test). This CD was published by ICAO in 2007 and included samples from a number of different aviation tests with benchmark ratings. The speech samples ranged in length from 3 minutes to 4.5 minutes. They were extracts from the longer test performances by the candidates. The extracts were chosen from the middle and later sections of the tests and excluded the introductory sections of the tests which are usually not rated in real test situations. We attempted to include items that tapped into routine procedures in a flight context as well as more unusual, emergency scenarios.

1 A semi-direct speaking test refers to a speaking test in which the test taker does not directly speak to a human interlocutor. Questions are usually prerecorded and the test taker records answers into a computer system or on an audio file which is later rated. 2 This CD was created for the benefit of rater training so raters can understand the standards underlying the ICAO proficiency scale.

80

U. Knoch / English for Specific Purposes 33 (2014) 77–86 Table 1 Speech samples. Speech sample

ICAO level as rated by trained raters

L1 background of speaker

Type of test

1 2 3 4 6 7 8 9

5 3 3 4 3 4 3 5

Hindi Japanese Korean Arabic Chinese Japanese Chinese Urdu

Semi-direct Direct – scripted Semi-direct Semi-direct Direct – scripted Direct – scripted Direct – OPI Direct – scripted

Fig. 1. Questionnaire.

Table 2 Background information – pilot participants. Pilot

Years of flying experience

Domestic/international experience

Experience working with colleagues from non-English speaking backgrounds?

1 2 3 4 5 6 7 8 9 10

10 31 40 26 19 30 42 12 37 18

Both Both Both Both Domestic Domestic Both Both Both Both

No No Yes Yes Yes Yes Yes Yes Yes Yes

2.1.2. Speech sample questionnaire While the pilots listened to the speech samples, they were asked to complete a short questionnaire about each speaker. This questionnaire was designed to focus on the thoughts of the pilots. The questionnaire questions can be found in Figure 1. As part of the questionnaire, the participants were asked to judge whether the speaker in each speech sample would be able to communicate effectively with (a) other pilots or (b) ATC (air traffic control). These groups were selected because they are the main professions pilots communicate with about safety-related issues when in the air. The questionnaire was kept purposefully short as it was designed to be completed while the focus group participants were listening to the speech samples. The main purpose of the questionnaire was to stimulate discussion in the focus groups. 2.2. Participants Ten experienced pilots participated in the interviews (Table 2). The majority were from a large airline in Australia, while others were recruited through personal networks. About half the pilots were working in the capacity of pilot trainers. All participants were native speakers of English. One criterion for the recruitment of the pilots was that they needed to have sufficient experience working with air traffic control (ATC), as less experienced pilots mainly flying in visual flight rule conditions3 would not have much opportunity to talk to ATC. Because pilots only flying domestically in Australia have very little opportunity to encounter speakers from backgrounds other than English in their professional life, great care was taken to recruit pilots with some overseas flying experience. Pilots 1 and 2, who indicated not having any colleagues from non-English-speaking backgrounds, did have extensive experience flying into Asia and therefore worked with ATCs whose first language was not English. Of the ten pilots taking part in the study, six were from the same major airline, three were working for a flight-training school training overseas student pilots, and one 3

Visual flight rules refers to flights in clear weather conditions that allow the pilot to rely on visual orientation.

U. Knoch / English for Specific Purposes 33 (2014) 77–86

81

was a retired pilot working in pilot training in Asia. While it would have been preferable to also include pilots from a wider range of backgrounds (including pilots from non-native speaking backgrounds), this was not possible for practical reasons. 2.3. Procedures 2.3.1. Data collection The data was collected using focus group interviews. This format was chosen for two reasons: (1) previous studies (e.g. Elder et al., 2012; Ryan, 2007) had found that it led to rich discussion between the participants which resulted in richer data than interviews with individual participants and (2) this format was the most practical as the data was collected on a day when most participants participated in a professional development workshop. The focus group interviews were conducted in groups of three to four pilots in either a conference room at the airline office or in the office of the researcher. Before listening to the speech samples, the pilots were asked to review the very brief questionnaire and ask any clarification questions they had. The pilots were then asked to listen to the first speech sample and complete the questionnaire while they were listening. Once the audio stopped, the questionnaire questions were used as the basis for the discussion among the participants. The pilots were asked to comment on their perceptions of the test taker’s ability to communicate effectively. The researcher asked follow-up questions to elicit which aspects of the test taker’s speech were particularly effective or problematic. The focus group participants were given little guidance on what aspects of speech to comment on (as the primary motive was to elicit their indigenous assessment criteria). However, the researcher asked follow-up questions where necessary when the participants used unknown terminology or struggled to express their thoughts about communicative effectiveness. The researcher also clarified questions about the contexts of the collection of the speech samples. This happened frequently, as the pilots needed to understand the professional background of the test takers (e.g. whether they were student pilots or experienced pilots). In all instances, a healthy discussion between the focus group participants was elicited. It is important to note that the pilots were not asked to agree in their judgements. All focus group interviews were recorded. 2.3.2. Data analysis All interviews were transcribed by a research assistant and then checked by the researcher for accuracy. The next step was to develop the coding categories. The categories in the ICAO rating scale were used as the starting point for the analysis, as it was assumed that at least some of these would be mentioned in the focus groups. Careful reading and rereading of the interview data also resulted in a number of other categories being identified. This analysis is in line with Douglas’ (2001, p. 182) and Douglas and Selinker’s (1994) recommendation on how such commentary about primary data (in this case the speech samples) should be analysed. To answer research question 1, eleven coding categories were identified. The frequency and percentages of mentions of each category were compared. To answer research question 2, the rating results by the language-trained raters and by the focus group participants were compared. 3. Results The results of the two research questions are presented in turn below. 3.1. RQ 1 What criteria do pilots use when evaluating the effectiveness of speech samples of other pilots (especially around the Level 3/4 cut-off)? To answer the question, the data were coded according to recurring themes. The categories on the ICAO rating scale were automatically included as possible coding categories. These were:      

Pronunciation (including accent) Structure Vocabulary Fluency (including ‘speed’ or ‘rate of delivery’) Comprehension Interaction (including ‘immediacy’ or ‘delay’ in responses) References to all categories on the ICAO scale were found. The following additional themes were also identified:

    

Pilots’ technical knowledge, experience and level of training Overall evaluation of level of speech Transition from standard phraseology to plain language Non-verbal cues4 Appropriacy of answer

4 The term ‘non-verbal cues’ generally refers to any discussions about the lack of these cues in radiotelephony communication as well as any mention of the use of these cues in the cockpit.

82

U. Knoch / English for Specific Purposes 33 (2014) 77–86

Table 3 Focus group interview categories. Rank

Category

N

Percentage

ICAO scale?

1 2 3 4 5 6 7 8 9 10= 10=

Technical knowledge, experience, training level Pronunciation Comprehension Overall evaluation Fluency Transition from standard phraseology to plain speech Vocabulary Interaction Non-verbal cues Appropriacy of answer Sentence structure

50 45 27 24 22 13 12 8 6 4 4

23.26 20.93 12.58 11.16 10.23 6.05 5.58 3.72 2.79 1.86 1.86

No Yes Yes No Yes No Yes Yes Yes No Yes

Total

215

100

Table 4 Pilots’ speech sample questionnaire responses (collated). Sample

Language raters

Work with other pilots (%)

Work with ATC (%)

1 2 3 4 6 7 8 9

High 4 3 3 4 3 4 3 5

90 0 30 20 80 100 60 100

80 0 10 30 60 90 70 100

Altogether, eleven thematic categories were coded in the data. Each token of these categories was coded in the transcripts and percentages were calculated for each category (Table 3). The second to last column indicates how many of the pilots commented on a certain item. We included this column to provide an indication as to whether the frequency count was mainly achieved by a small number of pilots repeatedly mentioning a category. The final column indicates whether each category is represented in the ICAO rating scale. Table 3 shows that a speaker’s level of technical knowledge, experience, and training was most often mentioned in the pilot interviews and it can therefore be assumed that these played an important role in their evaluation of the effectiveness of the speech sample. The second most mentioned category was pronunciation. Comprehension, overall evaluation and fluency were mentioned equally frequently. The point of transition from standard phraseology to plain speech5 and a test taker’s vocabulary were mentioned almost equally as often. References to interaction, non-verbal cues, appropriacy of an answer and sentence structure were less common. The table shows that the pilots drew on a wider range of criteria than those included in the ICAO rating scale and that some categories mentioned frequently are not represented in the current scale while others which are included (e.g. structure) seem less important to the pilots who participated in this study. 3.2. RQ 2 Do language-trained raters and pilots agree on the appropriate proficiency level for operational flying? To answer this research question, the ratings from the raters and the pilots were compared. The results are summarized in Table 4. The first column shows the number of the speech sample. The rating provided by two language-trained raters can be found in column 2. Columns 3 and 4 indicate the percentage of pilots who thought that the language proficiency of the speakers was high enough to communicate adequately with other pilots and ATC. In general, the pilots seemed to agree with the ratings by the language experts, therefore confirming the ICAO cut-scores. All pilots agreed that the two candidates with the highest language proficiency, Speakers 1 and 9, are proficient enough to work as pilots. The pilots also agreed with the ICAO rating of Speaker 7. However, the other speaker rated as Level 4 – Speaker 4 – was generally judged as very low and most pilots thought that this person was not proficient enough to work as a pilot. This speaker will be examined in more detail later. Four speakers were rated as being at Level 3 (i.e. not operational on the ICAO scale) by the language-trained raters. However, the pilots were divided when making their ratings. They generally agreed that Speakers 2 and 3 were not proficient enough to work as pilots where English is required; however, nearly all thought that Speaker 6 was proficient enough to work in such an environment, and more than half considered Speaker 8 5 This is the point in the speech when participants are not able to rely on standard phraseology anymore but need to use plain language to communicate with the other party (e.g. air traffic control). This point is often critical in emergency situations and is often the point where speakers of lower proficiency levels encounter problems.

U. Knoch / English for Specific Purposes 33 (2014) 77–86

83

proficient enough. Speaker 8 was a benchmark sample6 from the ICAO rated speech sample CD, and a group of aviation English test raters who re-rated this sample thought that he should actually have been rated at Level 4. The ICAO rated speech sample CD had been criticized for similar inconsistencies and has since been replaced by a newer version with different samples. Therefore, this discrepancy can be explained. This leaves two speakers where the language-trained raters and the pilot informants did not agree: Speaker 4 and Speaker 6. Each of these will be examined in more detail below. 3.2.1. Speaker 4 The pilot raters’ remarks focused on a number of linguistic features in Speaker 4’s speech.7 Four pilot participants commented on the lack of vocabulary in Speaker 4’s speech. For example, Participant 7 said: ‘‘It seems to me his vocabulary is not great, if you just gotta explain going around . . . aircraft in front has a technical problem, so you don’t panic everybody by telling them it’s on fire8 . . ..’’. Similarly Participant 5 said: ‘‘He just didn’t have the . . . he didn’t have the words to explain it . . ..’’ Four pilots mentioned in the interviews that they thought Speaker 4 also lacked comprehension. Participant 1, for example, said: ‘‘Again, I thought he had trouble actually understanding the question.’’ Pilot 9 was concerned with Speaker 4’s lack of overall English language ability. He particularly thought that he would have trouble communicating with other L2 speakers or with certain L1 English speakers less familiar with foreign accents. Below are some excerpts from the interview: [Working with] other pilots, it would depend on who the other pilot is, but if it was someone who was marginal in English again, communicating in English would be a problem. He couldn’t express . . . couldn’t get the words to express the situation. I think that’s gonna depend a little bit on the country people come from. Australians now are getting quite adept at hearing a lot of English variation. Someone from America . . . still has a little bit of difficulty even with me. . . . American ATC, they would understand him on the set script, and it’s gonna depend on who the person is with as to how they get by. If you’re gonna say, ‘‘can he just go with anybody?’’ – no, I don’t think so. (Participant 9) There was concern among the pilot participants that although this speaker would be fine operating using standard phraseology, his language ability would not be sufficient when using plain language. The excerpts below by Participants 7 and 9 are exemplifications of this concern: Communicating effectively with ATC, he’s doubtful on that one, because it’s scripted what they say. But if they step off the script, this is the problem. If he just followed everything, as per SOP [Standard Operating Procedure], he could probably get by, but the moment they step off that . . . so I’d have to say, you know, from a safety angle, ‘‘No.’’ (Participant 7) In the area where you use set script stuff, I’m sure [he] would be fine. So if he had to respond to ATC in a standard fashion, I don’t think there’s be a problem. Ah . . . communicating in plain language with other pilots . . . difficult. (Participant 9) But the main concern among the pilots seemed to be Speaker 4’s lack of technical knowledge. This concern was raised by nine of the ten pilots participating in the study. This seemed to be one of the main reasons the test-taker pilot was rated as communicatively ineffective. Participant 4 below exemplifies other pilots’ comments: I got the impression that he didn’t know the answer. That airspeed indicator example9 . . . he certainly had communication issues getting it out but I don’t think he knew what to say even in his own language, to be honest. It just sounded like he was largely making it up. (Pilot 4) Similarly, referring to the same test item, Pilot 5 noted: He was trying to explain why the air speed was basically low. . . . I don’t think he knew the actual aerodynamics of why. (Participant 5) Summing up, it seems that the reasons why most of the pilots did not rate Speaker 4’s language ability as sufficient for operational flying had to do with two concerns: (a) the speaker’s lack of technical knowledge and (b) the concern that the speaker would struggle in the transition from standard phraseology to plain language. 3.2.2. Speaker 6 Speaker 6 was the opposite case to Speaker 4. He was rated at Level 3 by the language-trained raters but more than half of the pilot participants considered him adequate to work in an operational environment.10 Again, the interview comments were scrutinized to arrive at an explanation for this disparity.

6

This sample had been rated by several raters. Overall, Speaker 4’s speech was marked by many pauses and hesitations spent searching for appropriate vocabulary. 8 The test item required the pilot to make an announcement to the passengers explaining why they had to overshoot the runway. The reason given in the test item was that a preceding aircraft had caught fire. The pilot informants were remarking on the lack of ability to ‘package’ the reasons for the delay in arrival into something less worrying to passengers. 9 In this particular test item test takers were asked to describe why an airspeed indicator could give a wrong reading on a frosty morning. Speaker 4 struggled giving an answer on this item – his speech was marked by hesitations and re-starts. 10 Speaker 6’s speech was very mechanical and sounded pre-rehearsed at times. His speech was very monotonous but in contrast to Speaker 4, Speaker 6 always continued speaking and did not pause to search for vocabulary, although the vocabulary used was very basic. 7

84

U. Knoch / English for Specific Purposes 33 (2014) 77–86

Just as was reported for Speaker 4 above, the pilot participants commented on a range of linguistic features in Speaker 6’s speech sample. In fact, they mentioned a wider range of linguistic features than they had done in response to Speaker 4’s sample. But just as technical knowledge seemed to be the deciding factor in the case of Speaker 4 described above, most comments related to Speaker 6’s technical knowledge. In this case, the participants thought that the speaker had sufficient knowledge to work operationally. Below are two typical excerpts: I don’t know if there was a bit of a misunderstanding with the VFR [Visual Flight Rules11] on the first questions, but . . . yeah, they seemed to sort of explain all the stuff that they would do . . . they understood what they were talking about [. . .] and all the relevant aviation type terms were there. Yeah, it would’ve been interesting to hear, for example, that overshoot question12 . . . and see what they would come up with then. But yeah, all the aviation related stuff’s all there. (Participant 8) I could understand him and he put enough information out to give me confidence. . . (Participant 3) Although the language raters thought that Speaker 6’s language ability was below operational level, the majority of the pilots thought that he should be allowed to fly. It seems that the pilots’ judgment of Speaker 6’s technical knowledge was the deciding factor. 4. Discussion and conclusion The findings show that the pilot informants drew on a wider range of criteria than those included in the ICAO scale to judge the language ability of the pilots who provided the speech samples. They put much weight on the technical knowledge of the speaker. It is possible that pilots rely on this evidence as they are not trained in the assessment of linguistic criteria. Many comments were also made about the pronunciation of the speakers. This criterion is represented in the ICAO rating scale. The issue of pronunciation seems to be more challenging when radiotelephony communication takes place in the nonEnglish speaking context. A study of the effect of accents in radiotelephony communication conducted with discourse data collected at Bangkok International Airport (Tiewtrakul & Fletcher, 2010) reveals that while communication flowed smoothly between Thai pilots and Thai air traffic controllers, there was much more disruption when Thai air traffic controllers communicated with native English speaking pilots followed by non-Thai non-native English speaking pilots. Similar findings were also reported in the Korean context. The results of a survey show that the Korean pilots and air traffic controllers feel more comfortable with Korean interlocutors, followed by native English speaking and then non-Korean non-native English speaking interlocutors (Kim, 2012). The pilots also relied on an overall evaluation of a pilot’s speech. This criterion is currently not included in the ICAO scale, although other ESP rating scales make use of this (e.g. the Occupational English Test designed to assess the language proficiency of health professionals: McNamara, 1990). The pilots also frequently commented on the potential ability of the speakers to make the transition from standard phraseology to plain language if needed. This transition was not the focus of any of the speech samples played in the focus group interviews; however, is often cited as the critical point at which miscommunication can happen (for a further discussion of the nature of radiotelephony refer to Kim & Elder, 2009). This criterion is also not specifically included in the ICAO rating scale. Overall, the pilot informants drew on a wider range of criteria than is included in the ICAO scale. Some criteria included in the current scale were mentioned rarely (e.g. structure, non-verbal cues and interaction). It can be argued that the criteria used by industry specialists need to be understood and possibly included in future versions of the scale to arrive at ratings that are more in line with the criteria used by professionals working in a certain profession and to avoid complaints about the language proficiency of professionals having passed an ESP test. The decision on whether to include such criteria is often dictated by policy however. For example, in the case of the Occupational English Test, the test is required by law to only assess English language competence (and not professional competence) (see McNamara, 1996). Whether this decision has been explicitly made in the case of the ICAO language proficiency requirements is not clear. When judging the communicative effectiveness of the speakers, the technical specialists generally agreed with the language specialists. When disagreement occurred, this was generally about the speakers’ level of technical knowledge. Although this study was based on a very small sample, the following tentative conclusions can be drawn about how industry professionals might judge the communicative effectiveness of their peers (Table 5). From this table it can be seen that there is a possible interaction between the level of technical knowledge of a pilot and their language ability. In fact, it seems that evidence of technical knowledge can outweigh to a certain degree unfavourable impressions of linguistic shortcomings. This is a judgement that language-trained raters cannot make, as they are trained to focus on the language ability of a speaker only (see also Long, 2005). A possible conclusion from this finding is that ESL raters should not be used in isolation to rate specific purposes language samples; ESL raters and profession-specific experts should work together to judge the communicative effectiveness of professionals. The findings from the study generally supported the current ICAO cut-scores although Table 5 shows that the conclusion is far from straightforward. It was certainly shown that the ranking of individuals around the cut-scores differed depending

11 12

The test item required test takers to describe the rules that need to be followed when flying under visual flight rules. See Footnote 3.

85

U. Knoch / English for Specific Purposes 33 (2014) 77–86 Table 5 Possible interaction of language ability and technical knowledge in the ratings of communicative effectiveness by industry professionals. Judgment of language ability

Judgment of technical knowledge specific to profession

Overall communicative effectiveness

Sufficient Sufficient Not sufficient Not sufficient

Sufficient Not sufficient Sufficient Not sufficient

Sufficient Not sufficient Sufficient Not sufficient

on language-trained raters and industry professionals and that therefore no clear answer to this research question has been found. A possible solution to this issue will be discussed in the following section. 4.1. Using industry professionals in rating scale development and validation This study has implied that close co-operation between language-testing specialists and industry professionals is crucial in the development and validation of ESP rating criteria. Both groups of professionals bring invaluable experiences to the table and no group should be involved in isolation, as is often the case in ESP testing. I propose that industry professionals should be involved at several points in the test development phase: (1) in the conception phase of the test to help with insight into the TLU and the authenticity of test tasks; (2) during the initial development of the rating scale to suggest criteria which might lend themselves for possible inclusion in the scale; (3) as second raters used in combination with ESL-trained professionals; (4) in test validation work; and (5), as was the case in this study, in the post hoc validation phase of the scale to validate the criteria used and to set industry-appropriate cut-scores. The results from the focus groups that were set up to establish the indigenous assessment criteria used by pilots were able to show that such research is useful for post hoc validation studies. The findings show that some ICAO rating scale criteria were less important to pilots (e.g. structure). Research conducted in the Korean aviation context (Kim, 2012) reveals similar results in that the majority of Korean pilots and air traffic controllers perceived interaction and comprehension as appropriate criteria, but they were unconvinced that structure and fluency were critical criteria. This study also shows how much importance the pilots put on the speakers’ technical knowledge and experience. Jacoby and McNamara (1999) make the point that it is not straightforward to make the transition from the indigenous assessment criteria derived from a TLU situation to criteria that will be employed in a test. It is therefore possible that such post hoc validation work, as was employed for this study, could be useful to suggest changes to a previously existing scale such as the one developed by ICAO. For example, pronunciation, a criterion currently included in the ICAO rating scale, was mentioned frequently in the focus groups and this criterion could possibly be weighted more heavily than other criteria. Of course, as one reviewer pointed out, it is possible that the participants were not always able to describe what they heard in the speech samples and labelled any problems as ‘pronunciation’. Interestingly, however, other research has shown that pronunciation is often mentioned as being important to industry specialists (see for example Brown, 1995; Ryan, 2007). The same was the case in this study. The reason for this is that pronunciation problems can severely interfere with the understanding of the message, especially in radiotelephony communication. Pronunciation problems often affect larger sections of a speech sample and therefore probably interfere with understanding more than, for example, individual wrongly chosen lexical items or slow speech. The transcripts of the focus groups show just how difficult it is for industry professionals such as pilots to separate language ability from aviation knowledge, a point that has been made by others working on LSP assessment (e.g. Davies, 2001; Ryan, 2007). However, O’Sullivan (2002) has argued that discipline-specific knowledge should be assessed separately. I would like to argue that in this specific context – namely aviation English testing, which is quite different from, for example, testing English for medical professionals because of the prominence of the use of standard phraseology – the testing of language and technical knowledge cannot and should not be separated. This also relates to the use of the policy surrounding raters from aviation and non-aviation backgrounds. ICAO recommends (but does not mandate) the use of both aviation experts and English language experts. This study was able to show that the standards industry specialists and language experts used when rating the speech samples were slightly different. They agreed on the ratings of most candidates, but there were some significant differences. This was also found in other, similar studies (e.g. Douglas & Myers, 2000; Ryan, 2007). The findings of this study suggest that it is important to involve industry specialists in the rating process, especially around the Level 3/4 cut-off, as language experts might not have sufficient insight into the TLU domain. How this would best be implemented is subject to trials. It is possible that industry specialists are best asked to make simple yes/no judgements rather than using complicated scale criteria. More work in this area is clearly necessary. As with most studies, there are a number of limitations to this research. Firstly, the sample size was small, with only ten pilots taking part in the study. Secondly, most of the pilots were from a very similar background, working in large multi-crew environments. None of the pilots were from a non-English speaking background and none were private pilots (mainly because private pilots in Australia have very little opportunity to interact with non-native speakers via radiotelephony). The speech samples that were selected were mainly produced by non-native speakers from Asian backgrounds. It would have been good to include a wider variety of speech samples, and this could be a suggestion for future research. The speech samples were selected to focus the participants on the ICAO Level 3/4 cut-off and it is therefore not clear how the results generalize to the other ICAO levels. However, two Level 5 performances were part of the speech samples and one of the speech

86

U. Knoch / English for Specific Purposes 33 (2014) 77–86

samples rated at Level 3 was a very low performance. A future larger study should focus on a wider range of performances. It is possible that the pilots focus on different features at different proficiency levels. It can further be argued that the use of speech samples collected from aviation tests is fundamentally flawed as it is possible that the tests are not actually successfully tapping into what happens in the target language use domain and are therefore not representative of the construct (for a detailed discussion of the issues surrounding task types in aviation English tests refer to Moder & Halleck, 2009). If the test performances played to the pilots are not sufficiently representative of the TLU domain, then of course the indigenous assessment criteria elicited from the pilots might not be generalizable back to the TLU domain. However, the language produced during radiotelephony communications is mostly composed of the use of standard phraseology with small parts of plain language when standard phraseology does not suffice. Because aviation English tests should mainly test candidates’ ability to produce plain English in an aviation context, the speech samples from the selected tests were the best samples available for this research. For this reason, speech samples from a number of tests were chosen. A further domain of future research, as was pointed out by one anonymous reviewer, would be to elicit the indigenous assessment criteria of pilots and air traffic controllers using real pilot–ATC recordings. The study presented above is an attempt to involve stakeholders on a number of levels in post hoc validation work of an LSP rating scale. The findings show that it is possible to engage different stakeholder groups in such a process. The interpretations we make based on candidates’ test performance have a good chance of being improved and more generalizable if we engage in such validation work. However, it is clear that this study provides just one piece in the puzzle necessary to fully validate the ICAO rating scale. References Banerjee, J., & Taylor, L. (2005). Setting the standard: What English language abilities do overseas trained doctors need? Paper presented at the Language Testing Research Colloqium (LTRC) in Ottawa. Canada. July 2005. Brindley, G. (1998). Describing language development? Rating scales and SLA. In L. F. Bachman & A. D. Cohen (Eds.), Interfaces between second language acquisition and language testing research. Cambridge: Cambridge University Press. Brown, A. (1995). The effect of rater variables in the development of an occupation-specific language performance test. Language Testing, 12, 1–15. Davies, A. (2001). The logic of testing languages for specific purposes. Language Testing, 18(2), 133–147. Douglas, D. (2000). Assessing languages for specific purposes. Cambridge: Cambridge University Press. Douglas, D. (2001). Language for specific purposes assessment criteria: Where do they come from? Language Testing, 18(2), 171–185. Douglas, D., & Myers, R. K. (2000). Assessing the communication skills of veterinary students: Whose criteria? In A. Kunnan (Ed.), Fairness and validation in language assessment. Selected papers from the 19th Language Testing Research Colloquium. Studies in Language Testing 9. Cambridge: Cambridge University Press. Douglas, D., & Selinker, L. (1994). Research methodology in context-based second language research. In E. Tarone, S. Gass, & A. Cohen (Eds.), Methodologies for eliciting and analyzing language in context. Northvale, NJ: Lawrence Erlbaum Associates. Elder, C. (1993). How do subject specialists construe classroom language proficiency? Language Testing, 10(3), 235–254. Elder, C., Pill, J., Woodward-Kron, R., McNamara, T., Manias, E., Webb, G., et al (2012). Health professionals’ view of communication: Implications for assessing performance on a health-specific English language test. TESOL Quarterly, 46(2), 409–419. Howard, J. W. (2008). Tower, am I cleared to land? Problematic communication in aviation discourse. Human Communication Research, 34, 370–391. ICAO (2001). Annex 10. Aeronautical telecommunications. Communication procedures including those with PANS status (Vol. II). Montreal: International Civil Aviation Organization. ICAO (2004). Manual on the implementation of the language proficiency requirements (Doc 9835). Montreal: International Civil Aviation Organization. ICAO (2010). Manual on the implementation of ICAO language proficiency requirements (2nd ed.). Montreal: International Civil Aviation Organization. Jacoby, S. (1998). Science as performance: Socializing scientific discourse through conference talk rehearsals. Unpublished doctoral dissertation. University of California, Los Angeles. Jacoby, S., & McNamara, T. (1999). Locating competence. English for Specific Purposes, 18(3), 213–241. Kim, H. (2012). Exploring the construct of aviation communication: A critique of the ICAO language proficiency policy. Unpublished doctoral thesis. University of Melbourne, Melbourne. Kim, H., & Elder, C. (2009). Understanding aviation English as a lingua franca: Perceptions of Korean aviation personnel. Australian Review of Applied Linguistics, 32(3), 23.1–23.17. Long, M. (2005). Methodological issues in learner needs analysis. In M. Long (Ed.), Second language needs analysis. Cambridge: Cambridge University Press. Lumley, T. (1998). Perceptions of language-trained raters and occupational experts in a test of occupational English language proficiency. English for Specific Purposes, 17(4), 347–367. McNamara, T. (1996). Measuring second language performance. London & New York: Longman. McNamara, T. (2002). Discourse and assessment. Annual Review of Applied Linguistics, 22, 221–242. McNamara, T. F. (1990). Assessing the second language proficiency of health professionals. Unpublished PhD Thesis. University of Melbourne, Melbourne. Moder, C. L., & Halleck, G. B. (2009). Planes, politics and oral proficiency: Testing international air traffic controllers. Australian Review of Applied Linguistics, 32(3), 21.1–21.11. O’Sullivan, B. (2002). Some theoretical perspectives on testing language for business. Cambridge ESOL Research Notes, 8, 2–4. Ryan, K. (2007). Assessing the OET: The nurse’s perspective. Unpublished Masters thesis. The University of Melbourne, Melbourne. Tiewtrakul, T., & Fletcher, S. R. (2010). The challenge of regional accents for aviation English language proficiency standards: A study of the difficulties in understanding in air traffic control-pilot communication. Ergonomics, 53(2), 229–239. Turner, C. E. (2000). Listening to the voices of rating scale developers: Identifying salient features for second language performance assessment. The Canadian Modern Language Review, 56(4), 555–584. Upshur, J. A., & Turner, C. E. (1995). Constructing rating scales for second language tests. ELT Journal, 49(1), 3–12. Dr Ute Knoch is a senior research fellow and the Acting Director of the Language Testing Research Centre at the University of Melbourne. Her research interests are in the areas of second language writing assessment, writing development, and assessing languages for academic and specific purposes.