Practical issues in the application of speech technology to network and customer service applications

Practical issues in the application of speech technology to network and customer service applications

Speech Communication 31 (2000) 279±291 www.elsevier.nl/locate/specom Practical issues in the application of speech technology to network and custome...

606KB Sizes 0 Downloads 31 Views

Speech Communication 31 (2000) 279±291

www.elsevier.nl/locate/specom

Practical issues in the application of speech technology to network and customer service applications David Attwater a,*, Mike Edgington a,b, Peter Durston a, Steve Whittaker a a

BT laboratories, Mobility and Network Services, Martlesham Heath, Ipswich, Su€olk, IP5 3RE, UK b SRI International, Menlo Park, CA, USA Received 26 April 1999; received in revised form 14 June 1999; accepted 25 November 1999

Abstract This paper proposes a simple model to characterise the di€erent stages of short telephone transactions. It also discusses the impact of the context of the caller when entering an automated service. Three di€erent styles of service were then identi®ed, namely, large vocabulary information gathering, spoken language command and natural language task identi®cation for helpdesks. By considering human dialogue equivalents, the requirements for each style are considered. Consequently, it is shown that each style pushes di€erent technological limits. Three case studies, selected from current project from BT laboratories, are presented to highlight the practical design issues in these di€erent styles. The styles and case studies presented are: · Information gathering ± UK name and address recognition. · Spoken language command ± network service con®guration. · Natural language helpdesks ± BT operator services. It is shown that large vocabulary information gathering systems require high accuracy, careful data modelling and welldesigned strategies to boost con®dence and accuracy. Spoken language command requires dialogue and grammar design and test complexity to be managed. Natural language task identi®cation requires large volumes of training data, good learning algorithms and good data generalisation techniques. These styles can be mixed into a single interaction meaning that design frameworks of the future will have to address all of the aspects of the di€erent interaction styles. Ó 2000 Elsevier Science B.V. All rights reserved. Keywords: Speech recognition; Dialogue modelling; Network service automation; Address recognition; Natural language processing; Semantic classi®cation

1. Introduction 1.1. The model

*

Corresponding author. E-mail address: [email protected] (D. Attwater).

1.1.1. Caller context The context of a call to an automated service is very important. We note two important related dimensions:

0167-6393/00/$ - see front matter Ó 2000 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 6 3 9 3 ( 9 9 ) 0 0 0 6 2 - X

280

D. Attwater et al. / Speech Communication 31 (2000) 279±291

· Victim or volunteer ± was the caller expecting automation or were they unsuspecting victims? · Frequent or infrequent ± is the caller well primed and experienced or do they rarely call the service? It is the clear experience of the authors that these two dimensions strongly dictate what can be achieved, and in what style, for a given service. It is also extremely common for these two dimensions to pair-up into frequent volunteers and infrequent victims. By de®nition, frequent callers to a service will quickly come to expect automation and become volunteers if they continue to call. The term victim is deliberately emotive. In the UK IVR services, especially those based on touch tone, are widely disliked when callers are not expecting them (Attwater et al., 1998a). Early indications are that acceptance of dialogue-based speech recognition systems is higher, but there are currently no well-established norms for talking with machines. Consequently, spoken language behaviour from callers who have not been primed for a service can be dicult to predict. 1.1.2. Four-layer call handling model There are typically four phases during a transaction with a service: · Problem speci®cation ± in which the problem to be solved is identi®ed. · Task identi®cation ± in which the customer intent is identi®ed within the framework of available services. · Information gathering ± in which all details necessary to achieve the task are gathered from the customer · Task completion ± in which the customer receives the service or information they require. In practice, when a customer calls a human agent there is often signi®cant overlap between these various phases. For example, there may be several stages of negotiation in order to discover the actual problem experienced by the customer, during which several potential services may be o€ered to the customer. Fig. 1 shows a real call to a BT international operator, annotated into these four phases. This model is helpful for analysing operatorbased and automatic interactions. It is important to note that the ®rst two phases of a transaction

may also be implicitly satis®ed. For example, the BT directory enquiries service on ``192'' uses a human operator to achieve information gathering and then automates the task completion phase by use of recorded number announcement. Since the directory enquiry service is very well known, the ®rst two phases are implicitly ful®lled when the customer dials the ``192'' access number. 1.2. Dialogue styles This paper suggests that the point in the model at which a caller engages a dialogue system will be a deciding factor in the style of dialogue that the caller and agent conduct. The style may even change as the caller advances through the stages of the four layer model. We propose a progression of dialogue styles, based on the patterns of dialogue which have been observed to be successful. These are: Information gathering Command Helpdesk

``answer the question'' ``tell me what to do'' ``what's the problem?''

In information gathering dialogues it is often satisfactory for the agent to take the initiative and follow a structured question and response style (Bennacef et al., 1995). Once the information is gathered, the task may be completed. Automation of this style of dialogue lends itself to large vocabulary isolated word speech recognition and highly structured dialogues. In command dialogues the caller will take the initiative and give a clear direction initially to the agent, which is often followed by information gathering. A key element to this style of dialogue is that callers may spontaneously give the whole command and information in a single utterance. Automation of this style lends itself to hand-coded ®nite state speech recognition grammars and slot®lling natural language style dialogues. Finally with assistance dialogues, callers tend to take the initiative and describe the problem which they are experiencing to the agent with the expectation that the agent will propose a potential solution. Once agreed, subsequent information gathering may occur. An important element of

D. Attwater et al. / Speech Communication 31 (2000) 279±291

281

Fig. 1. Straightforward call to an international operator annotated as four-layer call handling model.

these helpdesk styles of dialogue is that the language callers use to describe problems can be complicated, but the solution space is often limited. Automation of this dialogue style lends itself to statistical language modelling techniques and robust topic identi®cation with very ¯exible dialogues. Finally, we recognise another style of dialogue ± enquiry dialogues. These are very similar to command dialogues, with callers taking the initiative by asking a well-formed question followed by information gathering of missing details and the o€ering of the information. This area has been extensively investigated in areas such as air or rail travel information and is not mentioned further in this paper. These dialogue styles are not mutually exclusive, but represent a progression of increasing levels of sophistication as callers are given more scope to express themselves with a wider range of language. Each progressive layer is likely to require all the features of the previous layer. The following sections present three case studies exploring the key points of these styles in more

detail. All of the systems described use BT's STAP speech recognition toolkit (Attwater and colleagues, 1998). 2. Case study ± transcription of UK addresses 2.1. Introduction ± information gathering As has been discussed brie¯y, given that the problem and task are clear between caller and agent, subsequent information gathering often follows a structured dialogue that is almost entirely agent led. In many helpdesk instances this is even re-enforced by an explicit script. Isolated word recognition and a structured dialogue may then suce for automated information gathering speech dialogues. Such systems often require very large vocabulary sizes of proper nouns, e.g., names and addresses or alpha-numeric grammars. Current speech recognition algorithms bene®t from keeping the perplexity of the language model low, therefore ``natural language'' capabilities, such as multiple feature entry in a single utterance,

282

D. Attwater et al. / Speech Communication 31 (2000) 279±291

increase design complexity and often reduce recognition accuracy. The similarity of dialogue between automated services and agent-based services means that asking for information one ®eld at a time is also an approach suitable for use on frequent or infrequent victim or volunteer callers. This case study discusses an information gathering task ± a study of the recognition of UK addresses given by infrequent volunteer callers. It illustrates how an isolated word approach is suitable for very large vocabulary information gathering tasks, discusses the language behaviour of callers and the need to optimise for recognition accuracy and con®dence. 2.2. UK address recognition task In many telephone-based commercial transactions, a call-centre agent is required to enter the address of a caller. This is in many ways analogous to the directory enquires problem (Attwater and colleagues, 1998) ± except that the caller will be providing information about themselves, rather than a third party. They can be expected not just to know additional information, such as their postcode, but also to know such information with a high level of con®dence. In many cases, such as requests for catalogues, census data collection (Cole et al., 1997) and address changes for loyalty card management the transaction is straightforward and can even be handled o€-line. Fig. 2 shows a typical UK address ± made up of a house number, road name, postal town, county and postcode. In many cases this information is

Fig. 3. Form of UK postcodes.

supplemented by house names, ¯at numbers, apartment block names and village or locality names. There are over 30,000 place-names in common usage in the UK, relating to over 10,000 communities ranging from small villages to major conurbations. Unlike the US and many European countries, postcodes in the UK are not numeric but alpha-numeric and have the form shown in Fig. 3. (Note that the space is not completely occupied and this grammar over-generates with respect to the actual set of postcodes in use.) A prototype system was constructed to investigate the appropriateness of speech recognition for transcribing UK addresses (Attwater et al., 1998b). A data collection was undertaken through a competition in a BT internal newspaper and 2000 addresses were collected from callers throughout the UK. Callers were asked to leave the information both as isolated ®elds and in a ¯uent style as if addressing an envelope. The experimental results outlined below refer to the isolated ®eld data. The data was split into a ``training set'' (1500 calls) and ``test set'' (500 calls). The test set was further segmented into two portions: straightforward and dicult calls. The dicult calls (48%) contained out of vocabulary responses, incomplete calls, extraneous words and all London calls (London addresses have special requirements not fully addressed by the study). The straightforward calls contained the remaining well-formed responses. 2.3. User language behaviour

Fig. 2. Typical UK address.

It was important to establish whether callers could say information reliably as isolated utterances in the structured dialogue style. It was found

D. Attwater et al. / Speech Communication 31 (2000) 279±291 Table 1 Frequency of clari®cation terms Frequency

Keyword

32 16 12 11 9 7 6 5

Sugar Freddie Apple, Mother Peter Tommy Edward, Robert Charlie, Norman Echo, November

that for spoken proper nouns, around 12% of callers spoken more than just the word desired. This was generally words such as ``its'' and occasional spontaneous spelling following surnames. In addition, about 4% of callers sought to clarify alpha-numeric entries, such as postcodes, with pseudo-phonetic alphabets (e.g., ``s for sugar'') ± often corresponding to predictable confusions such as the e-set, s±f and m±n. See Table 1. The UK is split into administrative regions referred to as counties. These require careful modelling, e.g, many people quote county names which have been formally defunct for decades, e.g. ``It used to be Cleveland, I think it's North Yorkshire now'' (often due to latency and regional loyalty!). Other synonymous forms included abbreviations (``Herts'' for ``Hertforshire'') and the use of conurbations rather than counties (``Birmingham'' rather than ``West Midlands''). User language behaviour therefore indicates that accurate modelling of spontaneously helpful behaviour and good data modelling to capture aliases is important when gathering information about proper nouns. 2.4. System performance Recognition grammars were implemented for each of the isolated ®elds with care being taken over the pronunciation lexicons. Only the training set was used to optimise pronunciation lexicons and grammars. No acoustic model re-training was performed. The vocabularies were very large and the underlying recognition performance re¯ected this. Using the 88% of postcodes in the test data in which callers said well-formed postcodes, the cor-

283

rect postcode was recognised 66% of the time. The remaining poor quality postcode utterances were recognised 43% of the time. The lower result was caused by background noise or ill-formed utterances, such as postcodes containing phonetic spelling. County synonyms were carefully modelled and the optimal county vocabulary was found to be 160 words. This gave a coverage of about 95% (i.e., 95% of counties in the training set are in the vocabulary). The test results based on this vocabulary gave an in-vocabulary accuracy of 93% which falls to 90% when out-of-vocabulary utterances are included in the test. For a 1000 road vocabulary an accuracy of 86% was observed. This level of performance was only achieved after very careful attention was paid to the pronunciation models for the vocabulary. This highlights the importance of good pronunciation models for the large proper noun vocabularies in common use. It is important to note that top-10 accuracies for these tasks were signi®cantly higher and that these data ®elds are not independent from one another. These two facts enable much higher composite accuracies to be achieved. The architectural framework and ``track'' approach used in BT's brimstone directory application (discussed in detail in (Attwater and colleagues, 1998; Attwater and Whittaker, 1996)) was used to control the propagation of candidate lists and scores across dialogue turns and to produce a set of competing database hypotheses against a UK address database. Comparisons of the resulting hypotheses were used to categorise the transcriptions into 3 con®dence bands ± ``high'', ``medium'' and ``low''. Table 2 shows the banded accuracy of the transcriptions (in terms of the correct postcode being in ®nal postulated address) within each categorised group for each test set. The ®gures show the proportion of calls from each test set which were categorised into each con®dence level and the accuracy within those groups. 2.5. Address recognition conclusions The accuracy achieved was promising, given the diculty of the task. General conclusions relating

284

D. Attwater et al. / Speech Communication 31 (2000) 279±291

Table 2 Accuracy of postcode task, using additional ®elds to boost con®dence and accuracy High con®dence Straightforward Dicult All

Medium con®dence

Low con®dence

Calls

Accuracy

Calls

Accuracy

Calls

Accuracy

43% 20% 35%

97% 81% 92%

22% 23% 28%

72% 71% 72%

35% 57% 37%

61% 56% 52%

to information gathering parts of dialogues may be drawn. Isolated word entry does not seem to present a problem for volunteer callers who are familiar with the information. Careful attention however needs to be paid to modelling spontaneously helpful behaviour (such as phonetic spelling) in the speech recognition grammars. Also data models often need to be built to relate the spoken vocabulary of the task to the underlying data (e.g., synonyms for counties). In addition it is found that with large vocabularies, poor recognition performance can often be compensated for by utilising top-N information from the recogniser and utilising the known redundancy in the relationships between ®elds.

3. Case study ± natural language control of BT network services 3.1. Introduction ± command dialogues Where agents or automated services perform one of a number of tasks, the tasks required need to be identi®ed before information gathering and completion can occur. In many instances a caller may have pre-decided the task in which case problem speci®cation has happened implicitly in this case. This may, e.g., be a caller ringing an agent to set up a direct debit or a call to change a network service option. We term this style of dialogue ``command'' as it implies a dialogue style with strong caller initiative and a caller with a well-established task model. The high degree of caller initiative in this style of interaction suggests that, excluding hierarchical menu designs, more natural dialogues are required. For example, enabling a caller to say ``give me an alarm call at three please'' will provide a

very natural form of interface. Notice that in this case the task selection and information gathering are simultaneous and spontaneous. Where services are too numerous to explicitly list, the caller must be well primed. This will rarely be possible with infrequent victim callers unless the tasks are universally known by well-known words or phrases. Command dialogues are best suited to situations with frequent volunteer callers. Techniques for building natural command dialogues tend to be implemented using quite complex hand designed dialogues and ®nite state or statistical grammars discovered from a corpus. The relationship between language and meaning is usually hand-coded into parse rules of some form. This raises a number of issues: · Dialogues can be complex or dicult to represent. · The degree to which the user can arbitrarily shift the focus of the dialogue must somehow be made explicit. · It can be harder for the user to identify the scope and capabilities of the system. · The greater number of paths through the dialogue makes the speci®cation of prompts more complex. · Testing can be dicult as the number of dialogue paths can grow rapidly. In this section we present a case study of how a command style dialogue was designed to provide natural language access to the control of BT network services. A particular architecture and methodology is described which was designed to address the issues described above. 3.2. BT network services ± the freedom system BT Select ServicesTM are a range of value added services o€ered on BT residential lines in the UK.

D. Attwater et al. / Speech Communication 31 (2000) 279±291

285

They permit callers to perform tasks such as setting reminder calls, barring incoming and outgoing calls, giving the number of the last incoming caller and simple voice mail. They are currently accessed using TouchTone commands. The Freedom application (Attwater et al., 1997) is a prototype spoken language system, which was developed to investigate the use of structured dialogue techniques combined with the ¯exible entry of information using ¯uent speech recognition. An example shows the style of the system: How may I help you? Can I have a reminder call on Tuesday please? Set reminder call, thank you. You currently have a single reminder call booked at ®ve thirty on Thursday morning. Would you like to replace it? Yes please What time on Tuesday would you like the reminder call? Three thirty Is that in the morning or the afternoon? The morning A single reminder call will be booked at three thirty on Tuesday morning. Is this OK? Yes Con®rmed. Reminder call booked... The system has the ¯exibility to deal with both ¯uent and isolated modes of speech. In the former, the caller can choose to spontaneously give all of the required information and only con®rmation is needed. In the latter the caller responds more simply (perhaps with a single phrase such as ``reminder call'') and hence must be guided through the dialogue in a structured fashion. The architecture therefore can support isolated word information entry, such as discussed in the previous section. 3.3. System architecture Fig. 4 shows a simpli®ed system architecture for the Freedom system. This includes a parser, blackboard and prompt generation subsystems to cater for the additional complexities of a ¯uent

Fig. 4. Simpli®ed Freedom architecture.

system. All of the components were found to be necessary to conduct a viable natural command dialogue. The blackboard contains three logical stores of information: · Task feature values ± a cumulative record of all of the information entered by the user within the interaction. · Dialogue history ± key dialogue events in the interaction. · Customer data ± persistent data corresponding to the user pro®les, e.g., service options. 3.4. Recogniser/parser In the Freedom system, the recognition network grammar and the parser grammar were derived from a single de®nition. This meant that the recogniser could not return sentences that the parser did not know how to parse. The grammar designs in Freedom were represented as hand-coded context free word-based grammars. For the parser, non-terminals and terminals in the grammar could be associated with values for pre-de®ned slot types. Simple inference and algebraic rules were also permitted. This approach appears to be suf®cient for command style dialogues with predictable syntax and vocabulary. Multiple grammars

286

D. Attwater et al. / Speech Communication 31 (2000) 279±291

were de®ned for di€erent parts of the dialogue (see below) but they shared common portions where appropriate to keep design complexity to a minimum. In command dialogues, iterative trialling can be used to increase grammar coverage. 3.5. Assumptions and inference Blackboard-based inferences were found to be vital in the Freedom system in order for it to maintain a usable dialogue and avoid redundant or annoying questions to the caller. In addition a similar mechanism to inference, assumptions, was also found to be very helpful. Inference rules were used to codify relationships between feature-values which are always true. Their role is to eliminate redundant questions based on simple logical relationships. Consider gathering information about the time for example. An afternoon caller requesting a reminder call at ``four oÕclock today'' must be referring to four oÕclock that afternoon and the relevant feature for a.m./p.m. may then be inferred to be ``p.m.''. Assumptions were similar to inferences. They codify relationships between feature-values which are usually true in the context of the application. They provided a way of o€ering callers default behaviour resulting in faster interactions for the majority of callers. For example, callers were assumed to be wanting a single reminder call today because they requested a reminder call at a certain time without mentioning a day or frequency of repetition. The initial user analysis indicated that this assumption would suit most callers. The dialogue allowed callers to correct assumptions at the con®rmation stage if they were found to be incorrect.

Fig. 5. Target block.

di€erent and has its own recognition and parse grammar. Therefore, a question block could be expected to return one or more of a pre-de®ned set of features and values. A number of question blocks are used to construct ``target blocks'' ± each expressing a dialogue target such as the service in question, the time of a reminder call or its regularity (Fig. 5). The target block aims to achieve its completion criteria by ascertaining the value of its target before exiting. Target blocks can also log additional information given by the caller onto the blackboard. In this way, task feature values are collated in slot-®lling fashion. If a target block is instantiated and its completion criteria can already be met given information on the blackboard, it exits immediately via a mechanism termed ``Fall Through''. It is fallthrough which enables a relatively in¯exible hierarchical dialogue structure to model more complex dialogue behaviour. The ®nal dialogue is constructed by combining target blocks (Fig. 6). In this way, the dialogue provides a preferred order in which targets will be sought, whilst allowing the user to answer open questions in a very ¯uent fashion.

3.6. Dialogue modelling

3.7. Prompt generation

The dialogue modelling used a standard graphical application for constructing ®nite state dialogues (BT Visage). This implied a hierarchical approach to dialogue design. The smallest unit of dialogue was the question block. Each question block logically asks a particular question, though the exact wording may be

In structured dialogues the path to any given dialogue node is predictable and each message will be preceded by one of a small set of pre-determined prompts. However, in a ¯uent system such as Freedom, the order in which information is given by the user can vary substantially and the fall-through mechanism also increases the number

D. Attwater et al. / Speech Communication 31 (2000) 279±291

287

3.8. Conclusion ± control dialogues Control dialogues demand more user initiative and therefore more complex dialogue modelling. Given frequent callers who can to some degree learn the available sublanguage and dialogue, very natural dialogues can be o€ered with relatively simple technology. Of greater importance, this keeps the design complexity down and reduces the lead time for deploying such systems. Speech output may become one of the limiting factors for applications of this kind ± also re-enforcing that sophisticate natural language may not be appropriate in many cases. 4. Case study ± BT operator services Fig. 6. Chaining targets to give a dialogue structure based on sequential goals.

of potential dialogue paths. To cater for this, the prompt generation component makes use of dialogue history (nodes visited) and task feature-value pairs (the information given to date by the caller) to manage increasing complexity and enable anaphoric e€ects in the speech output. In structured dialogue design con®rmation tends to be either explicit (``Did you say Edinburgh?'') or implicit (``Edinburgh, thank you''). In the Freedom system, to aid the smooth ¯ow of the dialogue, we also make use of embedded con®rmation ± as in the phrase ``What time on Tuesday would you like the reminder call?'' used within the dialogue segment above. This reassures the caller that the initial input of the day has not been lost without breaking the ¯ow. In Freedom pre-recorded prompts were concatenated to give very high quality speech output. Text-to-speech increases the ¯exibility for prompt generation but can dramatically reduce acceptability (McInnes et al., 1999). It was found to be possible to build a complex dialogue with a manageable number of recorded prompts. This issue may be the limiting factor in command dialogues for some time.

4.1. Introduction ± helpdesk dialogues What happens when a caller has a general enquiry or problem and wishes to get help from an agent on a helpdesk? These calls are generally infrequent. Callers may have a clear idea of the task they wish to be performed (i.e., a command dialogue), however, often their language behaviour is dicult to predict and the relationship between this language and the required task is less than obvious. Frequently, task names will not be known to the caller or are too numerous to enumerate. In addition, callers will often present their own problem scenario to the agent rather than directly request a solution. The authors are not aware of any currently existing systems that consciously address the problem speci®cation phase, though this is crucial to many customer handling calls. Commercial case-based reasoning tools designed to improve operator eciency come closest to this, but have no speech interface currently. A few research studies are addressing aspects of the problem, namely, the AT&T ``how may I help you project'' (Gorin et al., 1997) and the Lucent call steering banking trials (Lee et al., 1998). Successful techniques for addressing this problem have focused on the use of statistical language models and robust topic identi®cation learned from an example corpus. This contrasts with more traditional

288

D. Attwater et al. / Speech Communication 31 (2000) 279±291

hand-coding of grammars and templates or parse rules. The statistical approach lends itself well to task identi®cation and frequent problem speci®cations. Once the task is identi®ed, more traditional information gathering and command style solutions may be used. This section presents a case study considering the automation of operator services in BT ± the OASIS project. 4.2. Operator services within BT BT has the largest call centre capability in Europe with around 115 call centres dealing with BT customer calls alone. These are sta€ed by the equivalent of nearly 20,000 full time operators, taking over 1,000,000,000 calls per year. The most general contact point for BT is the operator assistance (OA) service, accessed through the wellknown ``100'' code. Calls to OA operators cover an extremely wide range of topics, including: simple requests for information, malicious or inappropriate calls, explicit requests for non-OA services, requests for various BT services and various miscellaneous calls such as confusing requests, confused customers and the plain odd. Since the OA service has such a broad functionality and customer pro®le, it was seen as a very challenging case study for helpdesk style dialogues. 4.3. OASIS database collection In order to get an understanding of the language and dialogue behaviour of customers on the OA service, a data collection exercise was undertaken. The ®rst phase collected almost 1000 calls to the Cambridge OA centre over the course of a typical week, using an analogue connection. All calls were fully transcribed, including hesitations and restarts and classi®ed into detailed semantic classes. This database has been used in the pilot study to investigate general issues of dialogue structure, trends in language use and simple classi®cation strategies. A second database is currently in preparation containing around 25,000 calls to the OA service over a one-month period. This has been collected digitally and is therefore suitable for recognition experimentation and the training of

®ne-grained statistical language models and topic identi®cation models. 4.4. Language characteristics and speech recongition 4.4.1. Vocabulary growth The transcription of the initial customer utterance was analysed to give information about growth of the customer vocabulary. Fig. 7 shows how the vocabulary (i.e., number of distinct words) grows with the number of calls observed. Since the detailed shape of the vocabulary growth curve is determined by the actual order of call arrival, the graph shows a smoothed plot of the average vocabulary size over several permutations of call order. A total of 1228 distinct words were observed across all initial customer utterances, and as can be seen from the gradient, after all 752 calls vocabulary growth is still around 0.8 new words/ call. Simple extrapolation of this trend (Gorin et al., 1997) predicts a vocabulary size of around 4500 words after 10,000 calls, growing at around 0.25 new words/call. 4.4.2. Speech recognition Given the language complexity of the initial utterance in this task, hand-crafted speech recognition grammars are not feasible. Recognition language models must be learned from a corpus of examples. It has also been shown above that sizeable data sets will still not capture all possible vocabulary or usage, therefore generalisation based on the training set will be bene®cial. Statistical N-grams with smoothing have been investigated by the authors for this purpose and similar approaches have been taken by other similar

Fig. 7. Vocabulary growth of the pilot study OA calls.

D. Attwater et al. / Speech Communication 31 (2000) 279±291

studies (Gorin et al., 1997; Lee et al., 1998). Recognition accuracies are not available at the time of publishing, but state-of-the-art ®gures for conversational speech generally have high word error rates, around 40±60% (Peskin et al., 1997). The speech in the OASIS corpus contains large amounts of dis-¯uency (mostly ums, ers and restarts) and initial indications are that error rates will be similarly high. 4.5. Classi®cation of the initial utterance 4.5.1. Types of request By considering the detailed semantic classes, it was possible to classify the customerÕs initial response into one of four broad request types, which are summarised and exempli®ed in Table 3. This can be compared to the simpler taxonomy of short initial user utterances in a banking application described in (Lee et al., 1998). These request types can also be described with reference to the four-layer model of customer call handling in Section 2: · A ± an explicit named service request, where the customer makes a direct request for a speci®c service. The customer has resolved the problem speci®cation stage and o€ers their solution to the task identi®cation phase to the operator. · B ± an implicit service request, where the customer may give details of the problem without explicitly asking for a speci®c service. This is similar to A, but the customer does not explicitly cooperate in the task identi®cation phase; they immediately request a speci®c solution

289

instead and this request often contains elements of information gathering as well. In this case the operator will generally explicitly complete the task identi®cation phase by way of con®rmation. · C ± general problem description, where the customer is unaware of what service they require, but know that the operator should be able to help. The customer is at the problem speci®cation phase and expects the operator to engage in a dialogue in order to move to the task identi®cation stage. · D ± other. There is evident confusion within the problem speci®cation phase or about what the operator can do. The proportion of these di€erent classes in the pilot data is shown in Fig. 8. Almost half of all calls are problem speci®cations making this a very challenging application. 4.5.2. Utterance length The transcription of the initial customer response was analysed for language use. The initial customer response is de®ned as the customerÕs utterance following the operator greeting, up until the point of the operatorÕs next productive response. Any operator utterances which act merely as a con®rmation that they are listening without interrupting the customer are ignored (e.g., ``uh-ha'', ``yea'', etc.). The detailed semantic classi®cations of initial customer utterance were grouped into broader classes, aligning with the more common tasks which operators are called to perform. The more

Table 3 Primary request types in OA calls Request type

Description

Example

A

Explicit service request

B

Implicit service request

C

General problem speci®cation

D

Others

Can I er, could you put me through to directory enquiries please Can I want a reverse charge please Can I have the number for the er Probation Oce on Dover Street, S E 1 I'm very sorry is this 0 8 3 1 a mobile number Hi my name is XXXX XXXXXX and we have a problem here, there's someone whose trying to er call er who is calling us the whole time trying to fax us something, we haven't got a fax machine so must, must have the wrong number, it's been going on the whole day Yeah I wanted them er er to ®nd out the how to spell a place that I wanted to send a telegraph to in Cornwall please

290

D. Attwater et al. / Speech Communication 31 (2000) 279±291

Fig. 8. Proportion of initial requests falling into the di€erent classes of enquiry.

common of these classes were: reverse charge call, booking an alarm call, erroneous request for directory assistance, coins stuck or lost in payphones, problems getting through to a particular number and all remaining requests which have not been given a class. The distribution of initial customer utterance by the number of words for these di€erent classes of request is shown in Fig. 9. Across all classes, the average utterance length is 17.7 words with a median of 13.0 words. The useful range is from 1 word (``hello''), to 163 words (a caller having problems reaching the Italian tax oce). There were a small number of calls where the customer never spoke ± presumably the customer dialled ``100'' in error. These results show strong similarity to the AT&T study (Gorin et al., 1997). It is interesting to note the di€erent utterance lengths of the di€erent types of request. Reverse charge and alarm calls are well-known service names and callers tend to request them succinctly.

Fig. 9. Distribution and cumulative total of initial customer utterance length in words.

These are command style requests with explicit well-known service names (type A). Directory assistance requests usually take the form of a direct request for a particular number; directory assistance is rarely asked for directly. This is therefore a command task without the explicit use of a service name (type B). The lost money and line test categories mostly represent problem speci®cations (type C). Correspondingly, utterance lengths are longer and more complicated. Finally, the ``other'' category represents a range of utterance lengths including more involved problem speci®cations on a range of di€erent topics (mostly types C and D). 4.5.3. Topic identi®cation For similar reasons to the need for learned language models, the authors have taken the approach of learning the relationship between the language of the initial utterance and the class of task required. The use of advanced topic identi®cation techniques to identify many of the type C requests is the subject of ongoing research. The OASIS project took an information theoretic approach to classi®cation broadly similar to that described in (Gorin et al., 1997; Garner, 1997). Preliminary results for the OASIS pilot data are shown in Fig. 10. It shows the text-based performance of the classi®er using the pilot database for training and test. The classi®er is trained on 80% of the data with the set of classes discussed above. Correct rejection occurred when the classi®er correctly identi®ed the ``other'' class as the highest ranking class. This set of curves shows the recall

Fig. 10. Preliminary call classi®cation ROC curve showing correct acceptance versus false rejection.

D. Attwater et al. / Speech Communication 31 (2000) 279±291

performance of the classi®er on seen data (i.e., the training set) and the classi®er recognition performance on unseen data (the remaining 20% test set). It can be seen that the performance on unseen data is promising with around 93% recognition con®dence with 30% false rejections. Note that the probability that the correct classi®cation result is in the top-2 choices is over 97%. The dialoguebased nature of the task means that this information can be exploited using con®rmation and disambiguation subdialogues. Performance of the algorithm with real recognition rates is yet to be determined. 5. Conclusions This paper has presented a simple model to characterise di€erent stages of short telephone transactions. It also discussed the impact of the context of the caller when entering an automated service. Three di€erent styles of service were then identi®ed, namely, large vocabulary information gathering, spoken language command and natural language task identi®cation for helpdesks. These three styles of interaction mirror the di€erent types of dialogue that humans use in these instances. Each style, however, pushes different technological limits. With the aid of three case studies, some of these design issues were investigated. Large vocabulary information gathering systems require high accuracy, careful data modelling and well-designed strategies to boost con®dence and accuracy. Spoken language command requires dialogue and grammar design and test complexity to be managed. Natural language task identi®cation requires large volumes of training data, good learning algorithms and good data generalisation techniques. It is perfectly feasible to have all three styles as di€erent phases of the same interaction. For this reason, successful design frameworks of the future will have to address all of the aspects of the di€erent interaction styles.

291

Automation approaches for helpdesks where callers present problems to be resolved, rather than request solutions, are an active area of research. Careful analysis has been shown to be important before selecting a particular technical solution and style of dialogue.

References Attwater, D.J., Whittaker, S.J., 1996. Large vocabulary access to corporate directories. In: Proceedings of the IOA 18, Part 9. Attwater, D.J., Greenhow, H.R., Fisher, H.R., 1997. Towards ¯uency ± structured dialogues with natural speech input. In: Proceedings of AVIOS 1997. Attwater, D.J., Edgington, M.D., Durston, P.J., Cape, L., 1998a. Popping the question ± natural language for customer handling applications. In: Proceedings of Voice Europe, 1998. Attwater, D.J., Greenhow, H.R., Durston, P.J., 1998b. What's in an address? Issues in UK address recognition. In: Proceedings of AVIOS 1998. Attwater and colleagues, 1998. In: Westall, Johnston, Lewis (Eds.), Speech Technology for Telecommunications. BT Telecommunications Series, Vol. 11, Chapman & Hall, London. Bennacef, S.K., Neel, F., Maynard, H.B., 1995. An oral dialogue model based on speech acts categorisation. In: Proceedings of the ESCA workshop on Dialogue Systems. Theories and applications. Visgo, May 1995. Cole et al., 1997. Experiments with a spoken dialogue system for taking the US census. Free Speech Journal Issue 3. http://cslu.cse.ogi.edu/fsj/issues/issue3/main.html. Garner, P.N., 1997. On topic identi®cation and dialogue move recognition. Computer Speech and Language 11, 275±306. Gorin, A.L., Riccardi, G., Wright, J.H., 1997. How may I help you?. Speech Communications 23, 113±127. Lee, C.H., Carpenter, R., Chou, W., Chu-Carroll, J., Reichl, W., Saad, A., Zhou, Q., 1998. A study on natural language call routing. In: Proceedings of the Workshop on Interactive Voice Technology for Telecommunication Applications, Turin, pp. 37±42. McInnes, F., Attwater, D., Edgington, M., Schmidt, M., Jack, M., 1999. User attitudes to concatenated natural speech and text-to-speech synthesis in an automated information service. In: Proceedings of Eurospeech, Budapest, 1999, to be published. Peskin, B., Gillick, L. et al., 1997. Progress in recognising conversational telephone speech. In: ICASSP-97.