Symbiosis of Human and Artifact Y. Anzai, K. Ogawa and H. Mori (Editors) © 1995 Elsevier Science B.V. All rights reserved.
29
Agent-tyIz~ Multimodal Interface Using Speech, Pointing Gestures and CG HmmAnd~ Hideaki K 2 m d d , a n d Nobuo Hataoka
Central Research Laboratory, Hitachi, Ltd. Kokubunji, Tokyo 185, J A P A N Contact: H.Ando, e-mail:
[email protected] ABSTRACT This paper proposes a sophisticated agent-typed user interface using speech, pointing gestures and CG technologies. An "Agent-typed Interior Design System" has been implemented as a prototype for evaluating the proposed agent-typed interface, which has speech and pointing gestures as input modalities, and in which the agent is realized by 3 dimensional CG (3-D CG) and speech guidance. In this paper, the details of system implementation and evaluation results, which clarified the effectiveness of the agent-typed interface, are described. 1. I N T R O D U C T I O N Recently, studies concerning multimodal interfaces have been encouraged[I][2]. Multimodal interfaces have multiple input and output means, and it has been said that multimodal interfaces make machines more readily available because users can communicate to machines as if they would communicate to other people. To investigate effective multimodal interfaces, we have already developed an "Interior Design System" using input means of speech and pointing gestures as a vehicle of multimodal interfaces, and clarified desirable specifications for multimodal interfaces through various experiments[3]. Through these experiments, we have compared multimodal interfaces with unimodal interfaces, and compared command utterances with sentence utterances to check the best means for a speech input method. In this system developed, we have confirmed the effectiveness of the proposed multimodal interface. However, we have found that this system developed does not have an enough dialogue mechanism between users and machines, and that there are no help functions and no compensation for speech recognition errors. To cope with these problems, especially the lack of a dialogue mechanism, in this paper, we extended a multimodal interface to an agent-typed multimodal interface which has a dialogue mechanism. There are two major arguments to realize agent-typed interfaces. First, how to make a dialogue between users and agents user-friendly in terms of input and output means. Second, how to deal with speech recognition errors using agents. In this paper, we mainly focus on methods for combining input and output means to realize a user-friendly agent-typed interface. And we assess the effectiveness of the interface in order to clarify desirable specifications of agent-typed interfaces. 2. I N T R O D U C T I O N O F A G E N T - T Y P E D I N T E R F A C E 2.1. Definition of a ~ e n t - t w e d i n t e r f a c e w
_
_
In the area of software engineering, the new concept for agents becomes important as post object-oriented technologies. An agent is "software" that acts like a human being, and he/she moves autonomously in computers to carry out users' orders as a representative. There would be two usages of "agents," first for agent-typed interfaces, and second for agent-typei systems. In the first usage, the agent is usually displayed on the screen and acts as a representative of
30
systems to communicate with users providing help guidance and error corrections[4]. In the second usage as an agent-typed system, multiple agents work autonomously and mutually according to users' demands to check damaged parts in communication networks. By using the agent-typed system, performance and reliability could be increased and communication costs could be reduced. In this paper, we focus the first concept for agent-typed interfaces, and implement the agent-typed interface which has a speech dialogue mechanism. 2.2. M e r i t of a ~ e n t - t w e d interface Table I shows merits of agent-typed interfaces. In this table, merits are summarized from two sides of a user and a system. A mutual understanding about limitations and situation is essential for both sides to increase operation efficiency. In agent-typed interfaces, agents which have a dialogue mechanism play a r o l e t o improve a mutual understanding. Using this dialogue, a knowledge gap between users and machines concerning operations can be reduced. For example, users can Table 1 Merits of Agent-typed Interfaces recognize a situation of # merns contents operating machines, and get I) explanation of operation usag~ knowledge aquisition information from machines 2) rule for fumiture location by help guidance by a dialogue through an I) understand limits of system user fUnctions side aquisition of a system agent. On the other hand, 2) understand causes of speech situation system can ask users in order recognition errors natural dialogue function friendliness to get indispensable :lemand tor missin~l input system input information aquisitior information using a dialogue side error reduction instructed dialogue by a system by an agent. ,
3. P R O T O T Y P I N G A G E N T - T Y P E D I N T ~ ; K I O R D E S I G N S Y S T E M In this chapter, the "Agent-typed Interior Design System" developed as a vehicle for evaluating agent-typed multimodal interfaces is described. This system is a simulation system in which a room layout is designed. Users can use pointing gestures to a touch panel and speech input at the same time as input means, and users can communicate with an agent realized by a speech output and 3-D CG. 3.1. S y s t e m configuration _
v
Fig.1 shows a block-diagram of the system.
Conceptually, we can distinguish five functionally different components. (1) Speech Input Processing Unit: In this unit, speech is converted into characters[5]. We use the Hidden Markov Models (HMMs), to match vector characteristics of input speech to standard patterns of HMMs networks. The word string which has the highest matching correlation is regarded as the result of ,.. . . . . . . . . . . . L. . . . . . . . . ' , I recognition. The speech Speech.Input. ~1 Interior Dmdgn Main Unit i n p u t processing u n it consists of two parts, a speech analysis unit, and a speech recognition unit. In the speech analysis unit, speech is converted into digital signals and analyzed. As a result, speech parameters such as
i I WorkSlalion(,) i ' I
[ilPointmglGeslurelp.vPre/nlaaonl~oulput "!lProcessingunit I IProce~rm unlt i Proce~rm unit
[. IosP,~,.r~ I !
I
..
I
~,,k,,.
F~.I Block-diagram of the Agent-typed Interior Design System
31
LPC (Linear Predictive Coefficient) cepstn~n and power information are e x t r a c t ~ u VQ (Vector Quantization) codes. In the speech recognition unit, using the VQ codes, word strings are extract~ by matching these codes to the standard patterns of HMMs networks. (2) Pointing Gesture Processing Unit: In this unit, pointing gestures, which are input with a touch panel, are sampled at 180 points per second and converted into X-Y co~)rdinates. (3) Interior Design Main Unit: In this unit, the information from several input means is integrated to extract users' intentions. Words recognized in the speech input processing unit are filled in an information integration table[3] shown by Table 2. This table is used to extract a command using case grammar. This grammar, at first, looks for a verb in a sentence. Next, according to a verb found, words of objects and positions are extracted. Sequentially, words extracted in the speech processing unit arid X-Y co-ordinates extracted in the pointing gesture processing unit are integrated according to an input order. As a result, objects and positions that users indicate are specified and the layout design is performed. (4) Speech Output Processing Unit: In this unit, output speech sentences recorded are output from two speakers as agent's responses. (5) 3-D Presentation Processing Unit: This unit is for displaying a woman agent composed of polygons on the screen. Polygons are redrawn according to agent's responses corresponding to information integration results, and the agent's display is changed. The detailed system specification for interfaces is shown in Table 3. The output means the system response through an agent. 3.2. S v s t e m o v erat i o n Users can use speech and pointing gestures synergistically as input means. For example, users operate this system by saying" Please move this at this place," and they point to the touch panel to indicate an object and a position at the same time. According to user's utterances accompanied by pointing gestures input, the system recognizes speech and pointing input at first. Subsequently, the system Table 2 An Example of Information integrates this recognized information Integration Table in order to understand user's intention, and extracts a command information c a ~ object po~'~n adverb ~r "move," an object "this," and ~, ~,~, ~.~_0)< 6 ~ command
voo
/
subcommand information "at this place" as a position for "move" command. ............. ~ copy (~r~) 'tr " u . . . . o Finally, the system can rearrange a ........... .... ,.( room layout by outputting 2-D CG that .~ ,-('~. . . . . ~;P. . . . . . . . . . . . . . . . . -~ move shows a room layout, and 3-D CG that S: Speech P: Pointing Gestures 0 : Indispensable O: option shows a moving woman agent uttered by speech output. In case that Table 3 Specification of the "Agent-typed this system can't recognize Interior Design System" a command that the user I I *# article ! specncamK)n inputs, the agent asks the user again "Which is correct, move or copy?" So, the user can r e p l y to the
system " Copy, please." In this system, we can use seven commands, "move,"
... ~....i I1) the number of sentences : 163 ~. .p. .~ . ~. .RR,uy.n~l 12) the number of words : 63 gesture recognition Ion a touch panel Jinformation semantic analysis Iregular grammar, case grammar "3~infom 'integr lintegration 0ntormatK)n Integrataon lan information integration table olaJogue mstructK)n Icontrol by, dialOClUe networks speech output riles of r~ording outr output . . , ,, Itne numoer OTsentences 171 ~u ~ Irecorded : 25
H
inp)ut
32 "copy," "enlarge or shrink," "exchange," "color," "delete," and "newly input." 3.3. Role of a~ent-typed interface There are a couple of advantages to use agent-typed interfaces. Firstly, users can obtain necessary information about the condition of the system working and the restriction of functions on the system, through a dialogue with interface agents. Secondly, systems can obtain information that users have forgot to input, or that systems have not been able to recognize, by asking users to input again. Fig.2 shows an example of the system display. In this example, a command message is missing in an utterance such as "Please ....... this desk at this place." Subsequently, the system checks whether necessary information for information integration has been input or not. In case that there is only object information and place information, the agent asks the user, "which, move or copy?" This dialogue function reduces the number of possible words for input and improves speech recognition performance. If layout arrangements are user impossible by some reason, the agent shows alternatives Fig.2 Example of System Display to users, and makes user's operations smooth.
4. EVALUATION OF A G E N T - T Y P E D IN'I"~KIOR D E S I G N S Y S T E M We evaluate the effectiveness of the proposed agent-typed multimodal interface using the prototype. Especially, we compare agent-typed and non agent-typed interfaces from usability viewpoints. Next, we report aims of evaluation, outlines of experiments, and evaluation results. 4.1. Aim for e v s l u a t i o n The aims for evaluation experiments are as follows; (a) evaluate effectiveness of agent-typed interfaces from viewpoints of friendliness, usability, and operation efficiency. (b) clarify necessary functions for agent-typed interfaces. 4.2. Outline of exv_e r i m e n t s Subjects: 5 males and 5 females Input m e a n s : sentence utterances and pointing gestures Output I n t e r f a c e : The following three types of interfaces are examined to check usefulness of agent-typed interfaces. (a) display of a 2-D CG interior layout only. (b) agent-typed interface by speech output only. (c) agent-typed interface with a woman agent drawn by 3-D CG and with speech output. Procedure: Fig.3 shows the initial scene and the final scene which are used to edit. (a) The scene A and the scene B were presented on Co) Subjects were asked to make the scene A into the scene B using speech input and pointing gestures. (c) Subjects compared the three output interfaces
on a 3-degree scale. (d) Subjects had to report reasons for their compared results.
red ' {~1~ ~,_=_=.=_._=.~[by...ubi~r~
J
(A) inilial Display (S) Goal Display Rg.3 Task for Display Editing.
33 Before the experiment, the subjects practiced how Table 4 Available Words to speak to the system. Subjects can use any sentences ~ ÷ , tO, NL~¢¢, ~W, ¢¢]', consisting of each word in Table 4. 4.3. E v a l u a t i o n i t e m s position ~ r ~ , ~ _ ~ , ~ _ There are three evaluation items as follows; ~I~, ~, Ii~,-c, (1)friendliness, (2)usability and (3)operation efficiency. command Ii~, ~ ~ , WtJt~ Table 5 shows the evaluation items and sub-items. For ranking three output interfaces, the same preference order was allowed to each output interface if there is no clear difference between them. 4.4. Results a n d discussions Figure 4 and Table 6 show the results for each evaluation item and each sub-item, respectively. In Table 6, the number of subjects for each preference order is shown. We confirmed the following contents from evaluation results shown in Fig.4 and Table 6. (1) friendliness The interface with an agent realized by 3-D CG and speech output was the most friendly among three output interfaces. Results for sub-items shown in Table 6 suggested the agent-typed interface is better than the normal interface with no agent, even if the agent is not drawn on the screen. The remarkable comments from subjects are as follows; a) The display of the agent is necessary because they felt uncomfortable if the output interface was only speech output. b) The response by speech output corresponding to input was natural. c) The speech response was good to time input operations. (2) usability Fig.4 shows that the output interface with an agent was better than that with only a 2-D CG layout. Especially, from Table 6, the speech response was very effective for speech recognition errors. The comments are ; a) Guidance by speech was useful, especially for beginners to learn the system operation. b) The output interface with a speech output and with an agent display was better to understand the system situation. As a result, it was easy for users to make the next action. (3) operation efficiency There were varieties on results for operation efficiency as shown in Fig.4 because of small evaluation tasks. However, the results may suggest the effectiveness of the agent-typed interface. The remarkable comments are; a) The instruction by Table 5 Evaluation Items the agent reduced users' evaluabon fatigue. item trtendUness usai~ opeceLioneflk~en~ b)According to users' rellevanoe smoothnessk~ opemlion lime input l e a r n i n g s t a g e of the emK~ot sl:eec~ enecttot qxmc:n evaluek)n e esy~ ~ m msmmse mcocmidon errors operation, only t e x t u a l ettec~c~an agent timing of respome easy-to-learn output would be sufficient revresentalion ,,, ~ effect of guidarce fatigue rather than speech output. ,
,,
,
,
,,
34
5. C O N C L U S I O N The a g e n t - t y p e d m u l t i m o d a l i n t e r f a c e was proposed as a sophisticated user interface. The "Agent-typed Interior Design System" using speech input and pointing gestures, and 3-D CG has been implemented as a prototype for evaluating the proposed interface. The evaluation experiments were carried out using 10 subjects, 5 males and 5 females, to check the best output interface among the following three output types, first a 2-D CG interior layout only, second a 2-D CG interior with speech output and third a 2-D CG interior with speech output and the agent displayed. From the results, we have confirmed that the proposed agenttyped interface which has a dialogue mechanism through an agent using speech and 3-D CG, is effective and that a user adaptation function will be necessary. REFERFJ~CF~ [1] R.A. Bolt, Put-that-there: Voice and Gesture at the Graphics Interface, ACM Compu~r Graphics, 14, 3, pp.262-270 (1980) [2] J. J. Mariani, Speech in the Context o f Human-Machine Communication, ISSD-93, pp.91-94 (Nov, 1993) [3] H. Ando, ctad., Evaluation of Multimodal Interface Using Speech and Pointing Gesture on an Interior Design System, Trans. EICI, J-77-D-2 (Aug, 1994 ) [4] P.Macs, Learning Interface Agents, FRIRND21'94 International Symposium (Feb,1994) [5] A. Amano, el.a1., An experimental Spoken
I
no.d subje~ who mpCmdm the bestouq~
mJ tho ~ boat output J"l ~ e t ~ t m t o . ~
i Ill
121
131
121 131
g,t,x,tnn,
mt,ay
~
"-
eu~
(1) a 2-D CG inledo¢ layout only (2) a 2-D CG imedor wilh q)eech output (3) a 2-O CG imed~ wilh speech outpul
and tho 8oe~t claphmyedby &CG
Fig.4 Evaluation Results for Each Evaluation Item
Table 6 Evaluation Results for Each Sub-item (a) Idmxlrmm
itsms
~xdout means mnkJn9 re~svanGe elled of q)wch response (dledoianagenl weeent~llkm
no
dill,
1
5
eem to epm~ recogn~ionmum
10 10
ms timingdrespome
5
elfed of guidance
a
ranking olxeraliontime its- easy-lo-harKlle ms easy-to-learn
1
9
9
1
(1} (2! (3] no let 2nd ~rdlsl 2ncli3rcllst Zncl3rcl diff_ 3 1 6 4 3 6 1
rank~
otn~ m e n
4
(I)) .mbnay
ouIput mean8
ire-
11 r21 (31 2nd 3n:l let ~ 3rd let ~ 3~1 10 4 6 1OI 10 10 10
It
1
4
3
7
7
(c) opmat~ emdmx:y (1) im,i
dill,
10 2
4
1
7
(2)
(a)
1el 2rid 3rd let 2nd 3rd Ist 2nd 3rd
7 2 1 6 fatigue 5 1 (1) a 2-0 CG ~ layout only (2) a 2-D CG intedor ruth qpeech otlpul
1 3 4 4
1 2 2 1 3 1 4 1
1 3 4 5
2
Dialogue System, Paper of Autumn Meeting of (3) a 2-0 CG interior~lh ~:mochoutput and the agent d i e ~ y ~ by 3-D CG J. AcousL Soc. Japan,
pp39-40 (Oct, 1992)