SPEECH
( 3 0 ~ Speech Communication 15 (1994) 341-353
ELSEVIER
Spontaneous speech dialogue system TOSBURG II and its evaluation Shigenobu Seto a,,, Hiroshi Kanazawa a, Hideaki Shinchi b, Yoichi Takebayashi c a Toshiba Corporation, Kansai Research Laboratory, 6-26, Motoyama-Minami-cho 8 chome, Higashinada-ku, Kobe-shi, 658 Japan h Toshiba Software Engineering Co. Ltd., l, Komukai Toshiba-cho, Saiwai-ku, Kawasaki-shi, 210 Japan c Toshiba Corporation, Research and Development Center, I, Komukai Toshiba-cho, Saiwai-ku, Kawasaki-shi, 210 Japan
Received 11 May 1994; revised 15 June 1994
Abstract We have developed a spontaneous speech dialogue system TOSBURG II, employing keyword-based spontaneous speech understanding and multimodal response generation, with adaptive speech response cancellation. Since in multimodal interaction, the user understands the system's response by a visual output before its speech response is completed, the user often interrupts the system's speech response. Therefore, our adaptive speech response cancellation serves to facilitate natural human-computer interaction by allowing the user's interruption. We have also developed an evaluation environment for dialogue data collection and the performance of TOSBURG II. Unlike conventional data collection systems, TOSBURG II collects in this environment not only speech data and the final results of speech understanding but also its intermediate results as dialogue data, to use them for the evaluation and improvement of the system. The results of our dialogue experiments using TOSBURG II prove the effectiveness of adaptive speech response cancellation for natural interaction, confirming that the dialogue data and the evaluation environment will contribute to a further development of spontaneous speech dialogue systems.
Zusammenfassung Wir haben ein System fiJr das Verst~indnis von freien gesprochenen Dialogen entwickelt, das TOSBURG II genannt wurde. In diesem System beruht das Verst~indnis freier gesprochener Dialoge auf der Anwendung von Schliisselworten. Ferner erlaubt das System eine wahlweise Annullierung der gesprochenen Reaktion, so dab der Benutzer sich in die gesprochene Antwort des Systems einmischen kann. Zur Entwicklung eines in der Praxis anwendbaren und zuverl~issigen Systems zum Versfiindnis gesprochener Dialoge (also nicht eines nur im Labor brauchbaren Systems) muB das System durch vielfach wiederholte Untersuchungen und Auswertung von Dialogen in versuchsweise konstruierten Prototypen verbessert werden. Aus diesem Grunde haben wir zur Leistungssteigerung des T O S B U R G II Systems zum Verst~indnis von freien gesprochenen Dialogen ein spezielles Umfeld fiir seine Bewertung erarbeitet. Im Gegensatz zum weitverbreiteten Verfahren zur Sammlung der Dialogdaten nach dem W I Z A R D System werden im hier vorgestellten System die Dialogdaten in Echtzeit w~ihrend des Dialoges automatisch an die Sicherstellungsdate ausgegeben, und die Reihe der vom Benutzer benutzten Schlfisselworte und die zur
* Corresponding author. E-mail:
[email protected]. 0167-6393/94/$07.00 © 1994 Elsevier Science B.V. All rights reserved SSDI 0167-6393(94)00038-7
342
S. Seto et al. / Speech Communication 15 (1994) 341-353
Auswertung notwendigen, durch einen menschlichen Operator erarbeiteten Erkl~irungen und Kommentare werden den Dialogdaten w~ihrend des Dialogs hinzugefiigt. R~sum~
Cet article dEcrit un syst~me de dialogue vocal, appelE TOSBURG II, qui adopte une approche basEe sur la detection de mots-clE pour la comprehension de la parole spontanEe. Pour permettre ~ l'utilisateur d'interrompre les rEponses du syst~me, TOSBURG II utilise une technique adaptive spEcifique. Le dEveloppement d'un syst~me de dialogue qui soit fiable en situation rEelle requiert de pouvoir le perfectionner en en testant des versions prototype successives dans des contextes de dialogues reels. Pour accroitre les performances de TOSBURG II, nous avons donc mis l'accent sur l'Elaboration d'un environnement performant d'Evaluation. Contrairement ~ la mEthode prEdominante, de type Wizard of Oz, pour la collecte de donnEes de dialogue, TOSBURG II transmet, en temps reels, ies donnEes du dialogue. Ces donnEes comportent non seulement les rEsultats de la comprehension des EnoncEs mais aussi les rEsultats des traitements intermEdiaires, ce qui permet d'analyser pr~cisement les comportements globaux du syst~me. A ces donnEes automatiques s'ajoutent des donnEes introduites par l'expErimentateur (sErie de mots-clE ou expressions de l'utilisateur par exemple), utiles pour l'amElioration du traitement du langage spontanE. Les rEsultats de l'Evaluation expErimentale dEmontrent l'efficacit6 de la nouvelle version de TOSBURG II.
Keywords: Speech recognition; Speech dialogue; Spontaneous speech; Word spotting; Multimodal dialogue; Active noise cancellation
I. Introduction Much effort has been made to achieve natural speech communication between humans and computers. It is crucial that speech recognition systems cope with background noise and spontaneous speech to increase the robustness of speech dialogue systems in a practical environment. Various phenomena in spontaneous speech have been investigated by analyzing spontaneous speech data and comparing spontaneous speech with read speech. These studies indicate the need to deal with such phenomena as unintentional utterances, ellipses, pauses, out-of-vocabulary words, corrections and interruption (Komatsu et al., 1988; Murakami and Sagayama, 1991). The performance of many speech recognition systems decreases with the addition of background noise, because they are not designed to work in an actual noisy environment or to deal with spontaneous speech (Lee, 1989). Moreover, since the above-mentioned phenomena are difficult to represent in terms of the grammatical rules that form the basis of conventional systems, they pose significant problems for the spoken language systems now under development by the DARPA
programs and the European projects, which aim to deal with spontaneous speech (Gerbino et al., 1993; Peckham, 1991; Tuback and Doignon, 1991). Several speech dialogue systems and large vocabulary speech recognition systems to deal with spontaneous speech have been developed. However, to operate in real-time and to maintain recognition accuracy for spontaneous speech is still important (Bates et al., 1993; Hayarnizu et al., 1991; Minami et al., 1994; Kuroiwa et al., 1993). We previously developed a prototype of a spontaneous speech dialogue system, TOSBURG (Task-Oriented Speech dialogue system Based on speech Understanding and Response Generation), to achieve natural human-computer interaction (Takebayashi et al., 1992). This system understands spontaneous speech based on keywords and generates a multimodal response to the user. To increase the efficiency and naturalness of dialogues, we have extended the system to TOSBURG II, which allows the user's spoken interruption (Takebayashi et al., 1993). A dialogue database enables efficient research and development of speech processing systems. Large scale speech corpora of read-speech or
S. Seto et aL / Speech Communication 15 (1994) 341-353
spontaneous speech have been collected and compiled by many laboratories and projects (Kobayashi et al., 1992; Thompson et al., 1993; Mariani, 1992). The Wizard of Oz' (WOZ) technique is the prevailing method for large-scale dialogue speech data collection used by the DARPA programs and many other laboratories (Hirschman, 1992; Kuroiwa et al., 1993; Moore and Morris, 1992). Generally, in order to refine a prototype system so that it is usable in practical applications, it is necessary to iteratively evaluate and improve through large-scale user testing (Cole and Novick, 1993; Nielsen, 1993). Improving the robustness of a speech dialogue system in a practical environment requires large-scale evaluation of dialogue data collected by dialogue experiments with an actual system (Hayamizu et al., 1991; Shriberg et al., 1992; Gerbino et al., 1993). To this end, we have developed an evaluation environment for TOSBURG II. In this environment, TOSBURG II outputs dialogue processing data and dialogue speech data while performing the dialogue. The dialogue processing data consists of timing of utterance, keyword lattice, candidates of semantic representation of the user's utterance, the semantic representation that is selected from the candidates, dialogue state and contents of speech response. The operator adds annotations and keyword sequences of the user's utterance to the dialogue processing data with a graphic user interface. The annotations are marks for the user's utterance data that are currently unable to be dealt with, and are used for a retrieval key of the utterance data from the dialogue data. The keyword sequences are used as correct keywords for evaluation of the system's recognition and understanding performance. Thus, collected data are used to improve the performance of each subsystem (i.e. a keyword spotter, a keyword lattice parser, a dialogue manager, a multimodal response generator and speech response canceller) and to achieve high system usability. This paper first presents our approach to realworld speech dialogue systems. Next, usability problems of TOSBURG are discussed. Then, the further developed version, TOSBURG II, is introduced. Finally, a description of the TOSBURG II evaluation environment is given.
343
2. TOSBURG II spontaneous speech dialogue system 2.1. Approach to speech dialogue systems To implement advanced speaker-independent speech recognition on a real-world speech dialogue system, it is necessary to improve the realtime performance and robustness of the system. It is also desirable to understand spontaneous speech without restrictions on grammar and utterance manner. A spontaneous speech understanding system should be able to deal with unintentional utterances, ellipses, ambiguous utterances, pauses and out-of-vocabulary words. However, these speech phenomena are difficult to represent in terms of the grammatical rules that form the basis of conventional systems. Keeping these points in mind, we developed a first version of a task-oriented speech dialogue system named TOSBURG for a fast food ordering task. It has the following features: 1. Spontaneous speech understanding based on keywords. To cope with the phenomena mentioned above, we have employed keyword-based speech understanding. Fig. 1 shows an example keyword lattice. It premises that the meaning of an utterance can be extracted from a combination of keyword spotting and keyword lattice parsing, and enables it to resolve inevitable ambiguity in speech understanding. Fig. 2 shows an example of a tree structure and semantic representation. To cope with ellipsis in spontaneous speech, TOSBURG supplements default values of name, size and number of ordered food items to a semantic representation of the user's utterance if the meaning of the utterance is obvious from the dialogue history. 2. User-initiated dialogue management. Many dialogue systems employ a computer-initiative dialogue model, in which the user is forced into a dialogue situation by the system, and must follow the request messages from the computer. In contrast with the conventional computer-initiative dialogue model, we have employed a userinitiated dialogue model which allows the user to change the topic. For example, even if the system
344
S. Seto et aL / Speech Communication 15 (1994) 341-353
[FrN ~=¢N~']
.....tl
Ii ....
.11111
.lllll ....
.........
I lalI.T,mT ~A
I
I ?Q.wrTm'/n~
,
_
. . . . . . . .
I I ~n~.?qnlwA I l ?R • m " r m ' q t n I ~1. n~rraTqm~ln0
I
(Eh... one hamburger and.., uhm.., one french fries, please.) Fig. 1. An example keyword lattice.
expects the user to confirm the ordered items, it accepts the user's utterance for an additional order. Our dialogue model consists of user states and system states as shown in Fig. 3. The dialogue state changes from the user state to the system state when the system understands the user's utterance, and from the system states to the user state after generating of the system response. At the beginning of a dialogue, if the system detects that the user is stepping on the floor mat, the system outputs a speech response of greeting and asks the user to order, changing the dialogue state from the "initial state" to the "dialogue continuation state", and then gets ready to recognize the user's utterance. If contents of the user's
Semantic Representation
(item
utterance are unexpected, the state changes to "dialogue stagnation state" and the system asks the user to repeat. 3. Multimodal response generation. To realize a friendly and efficient interface, both visual and audio media are utilized for the system's response. Multimodal response includes synthesized speech, text, animated facial expressions and pictures of ordered food items as shown in Fig. 4. The content of the text is the same sentence as the speech response. The animated
Systemstate
Userstate ting, askingfor orde,r~
((act ORDER) (itemHAMB NOSIZE NONUM) (item COFFEENOSIZE3))
HAMB~~RDER)zE~
hanbaagalL!.o.o!i..uunt.o...i ~l m~suI kudasaiI Eh..onehamburgerand... uhm......coffee three please
[;2eeto.'2il
Fig. 2. An example of tree structure and semantic representation.
co :inu~tion
,~...................................................................... il
Confirmation ,
ue
\ (~ Askto repeat uat~ 1 ~°nti Dia~pgt{e '~"~J~..................................................... !iillll''"1 stag'°n~~Unexpected a utterance Fig. 3. State-transition in user-initiative dialogue model.
S. Seto et al. / Speech Communication 15 (1994) 341-353
Ani malated faci expressi on
Pictures ...fooditems of ordered
.-..,......
Text~ ,
..,~.~-.~..T:~...7..-:.. ~llvm
TOSBURG
/
Lip-shaped icon
Fig. 4. Examples of visual response in a multimodal dialogue system. - Your order is one humberger, one fry . . . . - Sorry, say that again, please.
facial expressions change according to dialogue states. TOSBURG has four subsystems: a keyword spotter, a keyword lattice parser, a dialogue manager and a multimodal response generator. The keyword spotter extracts keyword candidates from spontaneous speech and passes a keyword lattice to the syntactic and semantic keyword lattice parser. The parser generates semantic utterance representation candidates which are passed to the dialogue manager. The dialogue manager interprets the candidates using dialogue history, and then determines a semantic response representation. The response generator creates a multimodal response from this representation. The system runs in real time using three general-purpose workstations and DSP accelerators (520 MFLOPS) (Tsuboi et al., 1990). 2.2. Extension to TOSBURG H
Dialogue experiments were conducted with TOSBURG, in which spoken interruption by the user was often observed. Since the speech response transmitted from a loudspeaker contami-
345
nates the user's input speech at the microphone, recognition accuracy decreases if the system detects keywords directly from the microphone signal. To maintain recognition accuracy, TOSBURG was designed to accept the user's speech input only after the system's speech response is completed. In addition, the system tells the user either to speak or to wait, using lip-shaped icons. However, restrictions on the timing of the user's utterance reduce the efficiency and friendliness of the dialogue. Moreover, in multimodal interaction, the user often ignores these icons and interrupts the computer-generated speech response, because the meanings of utterances can be naturally understood with the aid of visual media before the end of the system's speech response. Therefore, we have introduced an adaptive speech response cancellation to TOSBURG, extending it to a spontaneous speech dialogue system TOSBURG II (Takebayashi et al., 1993). This system allows the user's spoken interruption by introducing an adaptive speech response cancellation using active noise control technology. The system employs an adaptive speech response cancellation to cope with the user's body movement during dialogue, which is different from the echo canceller of telephonic speech recognition systems. The configuration of TOSBURG II is shown in Fig. 5. A new component, the speech response canceller, is appended to the components of TOSBURG. When a floor mat detects that the user is in position, the keyword lattice parser activates the word spotter and sends a message to the dialogue manager to initialize its
Speech~ [ I '~1~ r~.aP~ieer~ll~KsepoYWerdt t parsirL"-I t~KeyW, icgrd
I
~'~Jpeechoutput -'Visual output
I
ou,0,,t generator
D,a,o°uo I manager
Fig. 5. Configuration of spontaneous speech dialogue system T O S B U R G |I.
346
S. Seto et aL/ Speech Communication 15 (1994) 341-353
state. The floor mat information is transferred to the parser in order to avoid a delay in activation of the word spotter due to inter-process communication between the processes of the dialogue manager and the keyword spotter.
reflected sound (reflected from the user's
speaker
\
~
body)
............ '..... ~movement
~
2.3. Speech response cancellation
Fig. 6 shows a block diagram of the speech dialogue system with a speech response canceller. The canceller subtracts the system's speech response from the microphone input signal by using an adaptive filter which estimates the impulse response between loudspeaker and microphone. Although a cardioid microphone can be used for attenuating speech response, it leads to positioning restrictions between the loudspeaker and microphone and cannot exclude the reflected sound of the speech response. Moreover, preliminary experiments revealed that reflection by the user's body affects the impulse response (Nagata et al., 1992). Since the user's body movements are often observed in dialogues with TOSBURG, adaptive estimation is necessary to follow changes of the impulse response (Fig. 7). We also verified that word recognition accuracy was maintained by synthetic speech cancellation in cases where the rate of residual synthetic speech power through cancellation to its original synthetic speech power was less than - 1 2 dB. In order to achieve stable and fast estimation of the impulse response, the system employs the
O-. serMicr° )hone
............................................................. Q____
Speaker Xk '
I
=i[
/
(refl~ :d¢~reodm~.unw¢a,,)//
microphone
\/ V wall
Fig. 7. Examplesof reflectedsounds. Normalized Least Mean Square (NLMS) algorithm with an active filter (about 300 taps, sampling frequency 12 kHz) using an adaptation switch and spectral pre-whitening. The adaptation switch, which activates the adaptation, prevents the estimation from becoming unstable due to rapid changes in signal power. Furthermore, spectral pre-whitening using a first-order difference signal accelerates convergence speed without increasing calculation overhead. The speech response canceller has been implemented using a DSP accelerator. Fig. 8 shows a comparison of the timing of utterance between spontaneous dialogue on TOSBURG II and conventional dialogue.
Example of spontaneous dialo,que fade out
3ystem)I M Yourorder is one.hamburger
Thankyou
d(k)
Adaptive el~(k) Filter /
UDialogueL__~SemanticL._~Word I Synthe-sizer I IManagerHAnalyzerl I Spotter
Fig. 6. Speech dialoguesystemwith a speech response canceller.
Yes. it is. Example of conventional dialo,que (System)~ Yourorder.is one hamburger ana a cup oTcoffee. (User)
Thankyou Yes. it is.
Fig. 8. Comparison of timing of utterance between spontaneous dialogueand conventionaldialogue.
S. Seto et al. / Speech Communication 15 (1994) 341-353 3. Performance evaluation
User
347
Operator
3.1. Large-scale data collection using the actual system To improve robustness in real-world applications, a speech processing system needs to be evaluated with large-scale spontaneous humancomputer dialogue data collected using the actual system. This involves the following two major problems. Note that although widely used for large-scale data collection, the Wizard of Oz' technique falls short of our aim because it collects simulated dialogue data only. First, human-computer spontaneous dialogues characteristically contain ambiguities; therefore, it is difficult to convert speech data into written data, which is done to supplement information about the user's utterance. It is also difficult to ensure notational consistency throughout a large-scale database. Consider the case of disfluencies, unintentional utterances, ellipses and insertion of pauses. Conventionally, full transcription is used for this type of data collection, but this requires considerable cost in time and effort and lacks efficiency. Consequently, a more efficient method should be devised by automating the work. Second, human-computer spontaneous dialogues contain possible recognition errors unlike human-human dialogues, where confirmation is possible when necessary. This is because man's utterance style and strategy in dialogue are subject to change by the dialogue partner, which in this case is the system itself. To account for these recognition errors, the system's evaluation should be conducted with the actual system in a practical environment; furthermore, data collection and natural human-computer dialogue should be carried out in parallel.
3.2. Evaluation environment for TOSBURG H To solve these problems, we have developed an evaluation environment for TOSBURG II as shown in Fig. 9. This environment collects the following dialogue data while a dialogue is conducted.
'
'dalab~: .........................:..iii.iilJ
Fig. 9. Configurationof evaluation environment for TOSBURG II.
Dialogue speech data: Speech data of the system's response and the user's utterance. Dialogue processing data: Intermediate results of dialogue processing in each component of TOSBURG II. Keyword sequence: Sequence of keywords that are included in the user's utterance. Listening to the dialogue speech data, an operator inputs the keywords using a tool for inputting keywords. After collecting the above data, the operator inputs the following data using another tool, called the dialogue evaluation tool. Semantic representation: The correct semantic representation of the user's utterance is inferred from keyword sequence by the dialogue evaluation tool. Correspondence of keyword sequence with dialogue processing data: Mapping from each keyword detected by the system to a keyword inputted with the keyword sequence input tool. This mapping is used for the evaluation of speech recognition and understanding performance. The evaluation environment has the following features.
348
S. Seto et al. / Speech Communication 15 (1994) 341-353
1. Description with keyword sequences. Since TOSBURG II understands the user's utterance based on keywords, speech recognition and understanding can be evaluated from various points of view by describing the user's utterance with a keyword sequence and without a full transcription. A keyword sequence entered by a human operator is regarded as the correct processing result of speech recognition and is used for the performance evaluation of word spotting, whereas semantic representation, which is inferred from the sequence and confirmed by the operator, is regarded as the correct speech understanding result and is used for the performance evaluation of syntactic and semantic analy-
ses and dialogue management. Thus, description by keyword sequence reduces time and cost, leading to higher efficiency in transcribing data. 2. On-line data collection TOSBURG II outputs dialogue speech data and dialogue processing data (log files) during dialogue with no noticeable delay. It records the dialogue speech data of the user's utterance and the system's response in two channels. 3. Use of intermediary processing results In addition to speech data and its final result, dialogue processing data of TOSBURG II includes intermediate processing results, such as
u
E×peri.entEnd/Stop •~ l End
I ",~ui~ I I (.ore) I
I r ~ O A A I I utter.~ I(huatbarger)l I End l
l
[ (give me) I
Iris sros~Aod
K'ishbarger~ ~ POTEI~
~
100KII I
i (fries)I ~ In~AIDOPO~
~
,OHOODhI ,(give . e ) [
•~ ,
[ IRZra~s~l Ip~N6IJYUusl ~°r=~gejuic]
~
[don't need] Utter. .......... . End
~oonii I (coffee)
I
I,~.~ I
i~oo~
i I
' .....
'
I
I
,
p~eoo.eq
'
~
'
.....
I
I
(c°la)
KU'D:~AI
I (give
.e)
t±arge~l I(eancol) I
~
~ [
i
[d°n't '~''a' 1 IRANAI l
~
r
(,,o~)9
I (not) '
I
I Uttero"°e E.d I
Fig. 10. Visual interface for inputting correct keywords.
Dialog Hnd
S. Seto et al. / Speech Communication 15 (1994) 341-353
keyword lattices, semantic representation candidates of the user's utterance, a sequence of dialogue states and the contents of the system's response because they interact at various levels of data processing and thereby play an important role in determining the final result of speech understanding. Such intermediate results would
349
be useful in finding out at which stage errors occur. 4. Dialogue data collection tool For highly efficient evaluation with a largescale database, our data collection tools, namely a keyword sequence input tool and a dialogue
[] .
,
_
.,
&L~.L ~lqry'-~,
SYS iUSR ISYS [USR [SYS
~I
,,L
b_ T'-
i: 0.16 1: 3.10 2: 8.46 2:14.78 3:17.32
2.76 7.71 14.69 16.91 19.32
Fr~
[sec] (sec] [sec] [sec] [secl
r~
!
IRASSHAIMASE, GOCHUUMONWO DOUZO. HANBAAGAA FUTATSU POTETO DAI HITOTU KOORA MITTSU KUDASAI. GOCHUUMON HA, HANBAAGAAWO FUTATSU, POTETO NO DAI HO HITOTSU, KOORANO CHUU WO .. KOOHACHUU KUDASAI. KOORA NO SHOO HA HITOTSU DESUHE?
dlkd,~ "l'l l w
!gJ i.l
HANBAAGAA FUTATSU POTETO DAI HITOTSU
i I CANDIDATE JCANDIDATE ICANDIDATE [CANDIDATE ]CANDIDATE
1 2 3 4 5
O O O (D C)
(TIME (TIME (TIME (TIME (TIME
[ORDER] -> 0 [ORDER] X [ORDER] X [ORDER] X [ORDER] X
~ I CANDIDATE 1 [ORDER] H~ARGER POTATO
: : : : :
3.35 4.15 4.74 5.27 5.60
-
[sec]) [sec]) [see]) [sec])
[sec])
HANBAAGAA HANBAAGAA IiANBAAGAA I'IANBAAF, AA IIANBAAGAA
:
LARGE
3.92 4.55 5.12 5.49 5.96
FUTATSU POTETO DAI HITOTSU KOORA MITTSU KUDASAI (FUTATSU-DEL) POTETO DAX HITOTSU KOORA MITTSU KUDASAI FUTATSU POTETO DAI HITOTSU KOORA (MITTSU-DEL) KUDASAI FUTATSU POTETO (DAI-DEL} HITOTSU KOORA KIJDASAI (FUTATSU-DEL) POTETO DAI HITOTSU KOORA (MITTSUI-DEL)
2 1
~ I CORRECT I~,~NING [ORDER] HUI~ARGER 2 POTATO LARGE 1 .J. . . . . . . . . . .
Fig. 11. Visual interface of dialogue evaluation.
350
S. Seto et al. / Speech Communication 15 (1994) 341-353
evaluation tool, help to speed up the data collection by providing the following three functions. First, the keyword sequence input tool has a visual interface, shown in Fig. 10, so that an operator can enter correct keyword sequences on line during dialogue. Each button of the tool corresponds to a keyword, and if the buttons are pressed the tool appends the keyword sequence to the dialogue data. Second, the dialogue evaluation tool has a visual interface, shown in Fig. 11, which allows the operator to browse through and confirm the meanings of utterances (semantic representation) the system inferred from these keyword sequences. In this process, each correct keyword entered by the operator is mapped on to a keyword detected by the system. System performance can be evaluated by comparing the correct processing results with a log file. Finally, the results are automatically computed for later use, reducing human work. The dialogue evaluation tool also enables the operator to add annotations to an utterance. The annotations indicate current system problems such as failure to collectly detect the user's utterance, or out-of-vocabulary words, etc.
3.3. Experiments 3.3.1. Speech recognition and understanding performance Dialogue experiments were carried out to evaluate the performance of speech recognition and understanding. The subjects were four Japanese males, two of whom were completely unfamiliar with the system. The subjects who were unfamiliar with the system were given a simple explanation that it is an ordering system in a fast food shop, watched a video with two or three dialogue examples, and tried to use the system. Each subject was asked beforehand to decide his order and performed three dialogues. Although each subject was asked to order according to his preference in order to place no restrictions on worduse frequency, the result of this experiment may be affected by the variation in the subject's preference. The twelve dialogues contained 185 keywords and 68 utterances.
The keyword detection rate, that is the rate of correctly spotted keywords to the total 185 keywords, was 96%. There were 605 false alarms. The sentence understanding rate, that is the rate of correct semantic representation to the total 68 utterances, was 82%. The sentence recognition rate, that is the rate of utterances in which all keywords were recognized, was 76%. The sentence understanding rate is over 5% higher than the sentence recognition rate, confirming the effectiveness of the supplementation of unspoken information such as food item name, size and number. Each spotted keyword speech datum, which is obtained from dialogue speech data, is appended with information indicating whether the keyword spotting result is correct or not. This data is used to train the keyword spotting dictionary to improve keyword detection performance.
3.3.2. Effectiveness of visual response To evaluate the effectiveness of the visual response, the total time of a dialogue was examined where some of the visual system responses were chosen. A combination of the visual system's responses was selected from the following three types: type-l: all visual responses including animated facial expressions, text (the same sentence as the speech response) and pictures of ordered food items; type-2: visual responses without pictures of ordered food items; type-3: visual responses without text or pictures of ordered food items. In this experiment, the subjects were four Japanese males. For comparison among the three types of visual response combination, each subject was asked to order five predetermined items. Fig. 12 shows the result of the total dialogue time. The animated facial expression and pictures of ordered food items indicate the system's internal states. In the case of type-2 and type-3, the system's visual responses have less information on the internal state. It is observed that a dialogue with type-2 and type-3 tends to take longer than one with type-1. In the case of type-3, subjects often waited for completion of the system speech
S. Seto et aL / Speech Communication 15 (1994) 341-353
351
Time (sec.)
subject said that he felt awkward speaking loudly without the face of a dialogue partner.
150
3.3.3. Effectiveness o f allowing user's interruption The total time of a dialogue with T O S B U R G was examined as well to evaluate the effectiveness of allowing user's spoken interruption. The subjects were five Japanese males. Fig. 13 shows the total dialogue time for five dialogues. It shows a tendency to take a longer dialogue time with T O S B U R G than with T O S B U R G II shown in Fig. 12. The average dialogue time is close to the minimum dialogue time in each of the output combinations. Although the amount of dialogue data is not large, these results show that dialogues with T O S B U R G II are smoother and more natural than those with T O S B U R G . Comparing the total time of a type-1 dialogue with the others, the visual information prevents an increase of the total dialogue time. The increase of the total dialogue time with T O S B U R G II is less than that with TOSBURG. Since speech is a medium that represents information serially, the total dialogue time is reduced by allowing the user's interruption such as intermediate correction of his order if the user finds the system's error. Although the result of the experiment may be affected by variance of the user's utterance style, these results show that speech response cancellation, enabling a user's spoken interruption, is effective for natural h u m a n - c o m p u t e r interaction.
100
-•Max.
50 _
Avg. -Z-Min. I
1
I
Type-1 Type-2 Type-3 without output without without facial t rtaYxt& expression all tray Fig. 12. Dialogue time using TOSBURG II.
response. These indicate that the visual supplementary indication of internal states of the system shortens the total time in speech dialogue even though the main communication medium is speech. Also, we carried out a preliminary dialogue experiment without animated facial expression. In this case, the total time of a dialogue was as long as that with type-1 visual response, and a
"time (sec.)
150
4. C o n c l u s i o n
100
_
50
vg.
_
.-t-Min. I
Type-1
I
I
Type-2 Type-3
output without without all tray t rtaeYxt& Fig. 13. Dialogue time using TOSBURG.
We have developed a real-time spontaneous speech dialogue system T O S B U R G II and its evaluation environment. This system understands spontaneous speech based on keywords and allows the user's spoken interruption by employing adaptive response cancellation. In our environment, the actual dialogue data is collected using the system in real time to describe the user's utterance with a keyword sequence. The collected data, including the real-time dialogue speech data and the final and intermediate results of the dialogue system processing, are used for the improvement of T O S B U R G II and also for a fur-
352
S. Seto et al. / Speech Communication 15 (1994) 341-353
ther development of advanced spontaneous speech dialogue systems. Furthermore, the use of description with keyword sequences and facilities of our powerful tool considerably reduce the transcription cost of the user's utterances. We have conducted dialogue experiments to evaluate the system's effectiveness. The fact that the average of the total dialogue time is shortened by allowing the user's spoken interruption confirms that such interruption is important for facilitating spontaneous human-computer interaction and that visual indication of the system's internal state is useful for higher efficiency. In these experiments, subjects unfamiliar with the system rarely interrupt the system's spoken response, while those familiar with it often interrupt. Such a difference in dialogue style due to the user's familiarity with the system should be investigated and reflected in future speech dialogue systems.
Acknowledgments We would like to thank Miwako Shimazu and David Culley for their kindly aid to prepare this paper. We would also like to thank Hiroyuki Tsuboi, Yoshifumi Nagata and Hideki Hashimoto for their participation in the development of the systems and for their aid to prepare this draft.
References M. Bates, R. Bobrow, P. Fung, R. Ingria, F. Kubala, J. Makhoul, L. Nguyen, R. Schwartz and D. Stallard (1993), "The B B N / H A R C spoken language understanding system", Proc. Internat. Conf. Acoust. Speech Signal Process. '93, 27- 30 April 1993, Minneapolis, pp. II- 111-II-114. R.A. Cole and D.G. Novick (1993), "Rapid prototyping of spoken language systems: The year 2000 Census project", Proc. ISSD-93, 10-12 November 1993, Tokyo, pp. 19-23. E. Gerbino, P. Baggia, A. Ciaramella and C. Rullent (1993), "Test and evaluation of a spoken dialogue system", Proc. Internat. Conf. Acoust. Speech Signal Process.-93, 27-30 April 1993, Minneapolis, pp. 11-135-11-138. S. Hayamizu, K. Itou and K. Tanaka (1991), A spoken language dialogue system for spontaneous speech collection, IEICE Technical Report, SP91-101. L. Hirsehman (1992), "Multi-site data collection for a spoken language corpus", Proc. DARPA Speech and Natural Lan-
guage Workshop, 23-26 February 1992, Harriman, pp. 7-14. J. Junqua (1991), "Robustness and cooperative multimodal man-machine communication application", Proc. Second Workshop on the Structure of Muhimodal Dialogue, 16-20 September 1991, Itary, Italy. T. Kobayashi, S. Itahashi, S. Hayamizu, T. Takezawa (1992), "ASJ continuous speech corpus for research", J. Acoust. Soc. Jpn., Vol. 48, pp. 888-893. A. Komatsu, E. Oohira and A. Ichikawa (1988), "Conversational speech understanding based on sentence structure inference using prosodies, and word spotting", Trans. Institute Electronics Information Communication Engineers, Vol. J71-D, No. 7, pp. 1218-1228. S. Kuroiwa, K. Takeda, N. Inoue, I. Nogaito and S. Yamamoto (1993), "Online collection of spontaneous speech using a voice-activated telephone exchanger", Proc. ISSD93, 10-12 November 1993, Tokyo, pp. 25-31. K.F. Lee (1989), Automatic Speech Recognition: The Development of the SPHINX System (Kluwer Academic Publishers, Boston). J. Mariani (1992), "Spoken language processing in the framework of human-machine communication at LIMSI", Proc. DARPA Speech and Natural Language Workshop, 23-26 February 1992, Harriman, pp. 55-58. T. Minami, T. Yamada, K. Shikano and T. Matsuoka (1994), "Very large vocabulary continuous speech recognition algorithms for telephone directory assistance", Trans. Institute Electronics Information Communication Engineers, Vol. J77-A, No. 2, pp. 190-197. R. Moore and A. Morris (1992), "Experiences collecting genuine spoken enquiries using WOZ techniques", Proc. DARPA Speech and Natural Language Workshop, 23-26 February 1992, Harriman, pp. 61-63. J. Murakami and S. Sagayama (1991), A discussion of acoustic and linguistic problems in spontaneous speech recognition, IEICE Technical Report, SP91-100. Y. Nagata, H. Tsuboi, H. Shinchi and Y. Takebayashi (1992), Speech dialogue system using DSP accelerators, IEICE Technical Report, EA92-84, pp. 1-8. J. Nielsen (1993), "Iterative user-interface design", Computer, pp. 32-40. J. Peckham (1991), "Speech understanding and dialogue over the telephone: An overview of progress in the SUNDIAL projects", EUROSPEECH '91, pp. 1469-1472. J. Polofroni, L. Hirschman, S. Seneff and V. Zue (1992), "Experiments in evaluating interactive spoken language systems", Proc. DARPA Speech and Natural Language Workshop, 23-26 February 1992, Harriman, pp. 28-33. E. Shriberg, E. Wade and P. Proce (1992), "Human-machine problem solving using spoken language systems (SLS): Factors affecting performance and user satisfaction", Proc. DARPA Speech and Natural Language Workshop, 23-26 February 1992, Harriman, pp. 49-54. Y. Takebayashi, H. Tsuboi, Y. Sadamoto, H. Hashimoto and H. Shinchi (1992), "A real-time speech dialogue system using spontaneous speech understanding", Proc. 1CSLP 92, 12-16 October 1992, Banff, pp. 651-654.
S. Seto et al. / Speech Communication 15 (1994) 341-353 Y. Takebayashi, Y. Nagata and H. Kanazawa (1993), "Noisy spontaneous speech understanding using noise immunity keyword spotting with adaptive speech response cancellation", Proc. Internat. Conf. Acoust. Speech Signal Process. '93, 27-30 April 1993, Minneapolis, pp. 11-115-11-118. H.S. Thompson, A.H. Anderson, M. Bader, E.G. Bard, E. Boyle, G. Doherty-Sneddon, S. Garrod, S.D. Isard, J. Kowtko, J. McAllister J.E. Miller, C. Sotillo and R. Weinert (1993), "The HCRC map task corpus: A natural spoken dialogue corpus", Proc. ISSD-93, 10-12 November 1993, Tokyo, pp. 33-36.
353
H. Tsuboi, H. Kanazawa and Y. Takebayashi (1990), "An accelerator for high-speed spoken word-spotting and noise immunity learning system", Proc. ICSLP 90, 18-22 November 1990, Kobe, pp. 273-276. J. Tuback and P. Doignon (1991), "A system for natural spoken language queries: Design, implementation and assessment", EUROSPEECH '91, pp. 1473-1476. V. Zue, S. Seneff, J. Polifroni and M. Phillips (1993), "PEGASUS: A spoken dialogue interface for on-line air travel planning", Proc. ISSD-93, 10-12 November 1993, Tokyo, pp. 157-160.