Speech intelligibility of two voice output communication aids

Speech intelligibility of two voice output communication aids

J. COMMUN. DISORD. 21 (1988). II-20 SPEECH INTELLIGIBILITY OF TWO VOICE OUTPUT COMMUNICATION AIDS PATRICIA KANNENBERG, and JAYNE LARSON Program in Co...

638KB Sizes 0 Downloads 35 Views

J. COMMUN. DISORD. 21 (1988). II-20

SPEECH INTELLIGIBILITY OF TWO VOICE OUTPUT COMMUNICATION AIDS PATRICIA KANNENBERG, and JAYNE LARSON Program in Communication University of Texas, Austin

Disorders,

THOMAS P. MARQUARDT, Department

of Speech

Communication,

The purpose of the study was to investigate the intelligibility of two voice output communication aids. Twenty words and 20 sentences, synthesized with the Personal Communicator and with the SpeechPAC, were presented to 20 listeners. Listener transcriptions of the stimuli were used to compare the intelligibility of the two communication aids. Analysis of variance revealed significantly higher intelligibility scores for the Personal Communicator compared to the SpeechPAC and higher scores for sentences than for words. The implications of these findings for selection of augmentative communication devices is discussed.

INTRODUCTION Recent advances in technology have led to the development of several augmentative communication aids with synthesized speech output. These devices are termed voice output communication aids, or VOCAs (Chial, 1984). VOCAs use direct selection, encoding, or scanning as input methods. Output methods in addition to taped or synthesized speech include printed, video, visual, and/or liquid crystal (LCD) displays. Many are small and portable (e.g., Phonic Ear VOIS machines, SpeechPAC); others are designed as peripheral devices for microcomputers (e.g., Votrax TypeN-Talk, Echo II). VOCAs offer instantaneous communication, output that can be understood by nonreaders, and an efficient means of communicating in groups, across a room, and over the telephone. VOCAs also offer greater communicative independence; the nonspeaking individual can use his or her own “voice” rather than relying on others to translate gestures, signs, speech, and/or pointing responses. A major disadvantage of many of these systems, however, is the poor intelligibility of their synthesized speech output. Address correspondence to Thomas P. Marquardt, Ph.D., Director, Program in Communication Disorders, Department of Speech Communication, University of Texas at Austin, Austin,

TX 78712-1089.

0 1988 by Elsevier 52 Vanderbilt

Ave.,

Science Publishing Co., Inc. New York, NY 10017

11 0021-9924/88/$3.50

12

KANNENBERG

et ul.

Research on the intelligibility of synthesized speech has focused on military and industrial applications such as telecommunications, response and warning systems, and interactive computer systems (Fons and Gargagliano, 1981; Simpson et al., 1985). Few studies have evaluated the intelligibility of the synthesized speech used in augmentative communication devices. Kraat and Levinson (1984) noted that this information is needed to make intelligent decisions about the selection of communication aids, the synthesized speech used in these devices, and the strategies that optimize the listener’s ability to understand synthesized speech used for communication. Research on VOCAs has concentrated on identifying the signal (e.g., rate, fundamental frequency, amplitude, accuracy of pronunciation), message, and listener variables that improve or reduce the intelligibility of synthesized speech. Many of these unpublished studies have been reviewed by Kraat (1985). In general, it has been demonstrated that increased exposure time to synthesized speech increases intelligibility, intelligibility is better for polysyllabic words and short phrases than for single words, and increased pause time between words in sentences significantly increases intelligibility. In terms of listeners’ ability to understand and respond to synthesized speech, factors such as the context in which the stimulus items are presented, the type of response required by the listener, the presence of background noise, and the listener’s familiarity with synthesized speech have been shown to contribute to the overall intelligibility of the system. The interaction of signal, message, and listener variables produces what OperaSimpson and Navarro (1984) term “operational intelligibility.” tional intelligibility is basic phoneme intelligibility conditioned by physical, pragmatic, and linguistic contexts. The physical context includes extraneous noise and vibration produced by the system as well as other environmental influences. Elements of the pragmatic context include the ongoing event, influences of past and present events, and social constraints filtered through the listener’s knowledge and experience with realworld interchanges. Linguistic context is provided by the listener’s knowledge of probable linguistic events. The concept of operational intelligibility is particularly applicable to the evaluation of speech synthesizers used in VOCAs because it is the summation of signal, listener, and message factors. Until recently (Kraat and Levinson, 1984; Punzi and Kraat, 1985), formalized evaluations of synthesized speech (Voiers, 1983) have used assessment models, materials, and listeners that are inappropriate for examining aspects of language interaction or the efficacy of synthesized speech communication. Chial(1984) has argued that while there is a need to examine the individual components of operational intelligibility, there also is a need to study less structured communication where vocabularies, listeners, messages, and

COMMUNICATION

AID INTELLIGIBILITY

13

signals collectively contribute to intelligibility and impact on communicative effectiveness. Intelligibility indices have been the primary method used to determine the effectiveness of speakers, including those with dysarthria (e.g.. Beukelman and Yorkston, 1979), hearing impairment (Sitler, Schiavetti, and Metz, 1983), and alaryngeal speech (Tikofsky, 1965). Beukelman and Yorkston (1979) found a close relationship between intelligibility scores of speakers with dysarthria and the amount of information that was successfully communicated to listeners. They concluded that intelligibility measures can serve as an overall index of a particular communication disorder and as an indication of communicative performance. Several methods have been employed to measure the intelligibility of speakers including rating scales (e.g., Platt, 1980), single word or sentence transcription (Beukelman and Yorkston, 1980), identification of key words in sentences (Voiers, 1983), and percentage estimates (Beukelman and Yorkston, 1980). Yorkston and Beukelman (1978) compared several techniques for measuring the intelligibility of dysarthric speech. Single word transcription was compared to percentage estimates, rating scales, multiple choice tests, and sentence completion tasks. All techniques ranked speakers in a similar manner. While greater test-retest reliability was obtained by the objective measures, there was a wide dispersion between the average estimates and average transcription scores for individual subjects. Multiple choice tests rendered the highest intelligibility scores; sentence completion yielded midrange scores; and transcription tasks the lowest scores. They concluded that subjective measures such as percentage estimates and rating scales were impractical owing to the large number of judges required for a reliable assessment of intelligibility. Objective measures, using a large number of samples from the speaker, were advocated because of greater reliability with fewer judges. Some of the same techniques have been applied to the measurement of intelligibility of synthesized speech. Clark (1983), Kraat and Levinson (1984), and Punzi and Kraat (1985) used word/sentence transcription or identification tasks. The listeners either wrote the word or sentence they heard, or identified the stimulus from a group of words, sentences, or pictures. Voiers (1983) recommended the use of the Diagnostic Rhyme Test (DRT) to evaluate computer processed speech. The DRT, adapted from the Modified Rhyme Test (House et al., 1965), consists of rhyming word pairs that differ by a single distinctive feature. Tests such as the DRT, which use carefully controlled listeners, materials, and contexts, are appropriate for analyzing basic phoneme intelligibility (Chial, 1984). However, when the purpose of the assessment is to determine the communicative effectiveness of a system, measurements of intelligibility should use listeners, materials, and procedures that more closely resemble the communicative interactions of augmentative device users. The pur-

14

KANNENBERG

et al.

pose of this study was to use intelligibility scores to compare the effectiveness of two voice output communication aids: the Personal Communicator and the SpeechPAC. METHOD Communication

Devices

The communication aids used in this study were the Personal Communicator (AudioBionics, Eden Prairie, Minnesota) and the SpeechPAC (Adaptive Communication Systems, Pittsburgh, Pennsylvania). Both systems are battery operated, portable, and have visual (LCD) and synthesized speech output. The SpeechPAC also has printed output. The Personal Communicator has a preprogrammed, 1700 word vocabulary (480 root words and their derivations). Access is primarily through direct selection, with some encoding available for commonly used, preprogrammed words and phrases and user programmable messages. The maximum memory capacity is 8000 characters. Words that are not included in the 1700-word vocabulary are spelled out. The device also provides telecommunications via voice, TDD, or computer modes (Kraat and SitverKogut, 1985). The SpeechPAC is an Epson HX-20 portable computer with a speech synthesis unit attached. Special software permits access through direct selection, encoding, or scanning. Methods of operation include standard keyboard, expanded keyboard, switch activated scan, joystick-directed scan, light pointer, and Morse code. The SpeechPAC is fully user programmable with a memory capacity of 26,000 characters. The speech synthesis algorithm is phoneme based, providing text-to-speech synthesis in which the synthesizer translates text (orthography) into connected speech. Word pronunciations can be programmed into the memory to eliminate the need to alter spelling for correct pronunciations. The user can adjust the volume, pitch, speed, and filter of the voice to resemble that of a man, woman, or child; or to add stress and intonation patterns to words and sentences. The SpeechPAC has computer access capabilities through an RS232C port (Kraat and Sitver-Kogut, 1985). Stimuli Twenty single words and 20 sentences were used as stimuli. The words and sentences were chosen to represent functional utterances a communication aid user might use in real-life interactions, either to respond to others or to initiate conversation. All words were included on a list of the 500 most frequently occurring words produced by adult communication augmentation system users (Beukelman et al., 1984) with the restriction that they were available in the Personal communicator prepro-

COMMUNICATION AID INTELLIGIBILITY

15

Table 1. Word Stimuli and SpeechPAC Spelling Revisions Presentation

order

L

3 4 5 6 7 8 9 10 11 12 13 14 1.5 16 17 18 19 20 a Epson

SpeechPAC

Talk Best,

Word

Spelling changes

today stop home love wish money food please problem listen Friday where eat hurt hello help play hand sorry

toodaiy stawpp” hoamme luvv” wish muhnee ffoo’dd ppleaz problum” lissen” friday wairre eett hirrt” hello hellpp pplaiy hann’dd sorry school

1985.

grammed vocabulary. Sentences were taken from or adapted from suggestions made by Bristow and Fristoe (1984) and Non Oral Communication (1980) regarding items to include on communication boards. The lists of word and sentence stimuli are shown in Tables 1 and 2. Also shown are variations from standard orthography and pitch and pause adjustments used with the SpeechPAC to maximize the intelligibility of the synthesized output. The stimulus items were tape recorded in a sound treated room on a Sharp tape recorder, model RD-6667AV, with a Sharp microphone, model MC-78DV, positioned approximately 2 in. from the speaker of the device. The SpeechPAC was set at volume 15, pitch 65, speed 11, and filter 232, which produced an adult male voice. These settings offered the best apparent intelligibility for the device. The Personal Communicator does not have adjustments for speed (except for the speed at which words are spelled), pitch, or filter. It was set at the highest of its two volume levels. The Personal Communicator has an adult female voice. No attempt was made to match the pitch levels of the two devices because of the potential effect this adjustment could have on the intelligibility of the SpeechPAC output. The tape was rerecorded on a Nakamichi BX-1 tape recorder equipped with a VU meter and Dolby noise reduction. The volume meter was adjusted to peak at 0 dB for each stimulus item.

KANNENBERG

16 Table 2. Sentence Presentation 1 2 3 4 5

order

Stimuli

and

SpeechPAC

Sentence

or phrase

I need money. That’s right. How are you‘? I work next week. Who are you?

6 7 8 9 10 11 12 13 14 15

Turn on the !ight. They are coming tonight. Call the doctor. I like cars. It’s in my room. Wait for me. Let’s go tomorrow. Get my paper. I forgot. What time is it‘?

16 17 18 19 20

See you later. Give me a minute. Take a walk. Go away. What’s your name?

Spelling

et al.

Revisions Spelling

changes

i-need-muhnee” that’s right IL-IRT14-how----IL-lRT16-r-uh i-wirrkk’-next-week IL-IRT14-who-/L-lRT15-are----iLIRTlS-u” turn-on-theh-light they-are-kumming-tuhnite call-the-dawkterr i-like-carrz it’s-in-my-room wait-fer-me lets-go-tuhmorow get-my-paper i-fergot IL-IRTIh-whuh----IL-/RTl4time’is-/L-/RTl5-ith see-ya-layder givv-me-uh-minit take’&wawkk gowuhway iL-lRT14-whutts’-IL-IRTlSyername”

a “-” Denotes a space or pause. ’ The Sequence “IL-IRT” followed by a number is the code for raising or lowering the relative pitch of the word or words that follow the code. ’ Epson SpeechPAC Talk Best, 198s.

Listeners Judges were 20 adult listeners between 20 and 55 years of age without a history of hearing loss. None of the listeners was experienced in making judgments about synthesized speech. Stimuli were presented individually in a quiet room under earphones (Telex 610-2) on a Sharp model RD6667AV tape recorder. The judges were instructed to write what they heard. Each judge was allowed to listen to the tape at his or her own most comfortable listening level, and up to three repetitions of each stimulus item were allowed. The order of presentation of the word and sentence lists and communication device (SpeechPAC versus Personal Communicator) were fully counterbalanced to control for learning effects. Intelligibility scores were calculated from judges’ responses to the stimulus items. The score was based on the percentage of whole words correctly identified. No credit was assigned for partially correct words. There were 20 words in the single word condition, and 63 words in the sentence condition.

COMMUNICATION

AID INTELLIGIBILITY

17

RESULTS Mean single word and sentence intelligibility scores for each synthesizer condition are shown in Table 3. Mean intelligibility scores were greater for single words (90.5) and sentences (95.2) produced by the Personal Communicator than for single words (61 .O>and sentences (83.1) produced by the SpeechPAC. A two-way analysis of variance with repeated measures was used to evaluate the effects of synthesizer (SpeechPAC, Personal Communicator) and stimulus length (single word, phrase/sentence) on intelligibility. A significant effect was found for synthesizer (F = 116.35; & = 1,76; p < .OOOl) and for length (F = 48.27; df = 1,76: p < .OOOl>.A significant interaction effect also was found (F = 20.5; df = 1,76; p < .OOOl). These results suggest that single words and sentences produced by the Personal Communicator are more intelligible than those synthesized by the SpeechPAC, and that sentences are more intelligible than single words produced by the synthesizers. The interaction between synthesizers and

Table 3. Word and Sentence and SpeechPAC

Intelligibility

Personal Communicator Subjects

Sentences

45 60 50 70 65 85 55 65 75 75 60 30 65 65 80 65 50 50 65 60

93.7 88.9 79.4 92.1 85.7 88.9 85.7 92.1 88.9 88.9 87.3 13.0 71.4 81.0 98.4 79.4 73.0 66.7 68.3 85.7

61.0 12.42 55.0

83.1 8.91 31.7

95.2

80 70 90 95 100 95 95 100 90

95.2 96.8 93.7 92.1 95.2 95.2 93.7 93.7 93.7

11

95

95.2

12 13 14 15 16 17 18 19 20

85 90 100 80 95 90 95 90 85

92.1 92.1 100.0 95.2 98.4 98.4 92.1 98.4 96.8

Range

SpeechPAC Words

90

90.5 deviation

Communicator

Sentences

1

Standard

for Personal

Words

2 3 4 5 6 7 8 9 10

Mean

Scores

7.59 30.0

95.2 2.37 7.9

18

KANNENBERG

et al.

length was the result of a major reduction in intelligibility between the word and sentence conditions for the SpeechPAC. Post hoc t-test comparisons revealed significant differences (p < .OS) between word and sentence conditions for the Personal Communicator and for the SpeechPAC. Significant differences (p < .OS>also were found between intelligibility scores for the SpeechPAC and Personal Communicator for words and for sentences. Therefore, it can be concluded that synthesized speech produced by the Personal Communicator is more intelligible than the output from the SpeechPAC for both words and sentences and intelligibility is higher for sentences than for words for both systems.

DISCUSSION A major finding of this study was the significantly higher intelligibility scores for single words and sentences produced by the Personal Communicator compared to the SpeechPAC. Several factors may account for the better intelligibility of this communication aid. Since the Personal Communicator has a fixed vocabulary, memory can be devoted to maximizing the intelligibility of individual words. The text to speech design of the SpeechPAC allows for an unlimited vocabulary, but more memory is required for this flexibility and intelligibility of individual words is compromised. The Personal Communicator also may be more intelligible because the pause time between words in sentences was greater. Although stimulus duration was not analyzed, several of the listeners noted that the extra time between words in the sentences produced with the Personal Communicator made these stimuli more understandable. This observation is in keeping with the finding of Kraat and Levinson (1984) that increased pause time between words increases the intelligibility of synthesized sentences because it allows the listener additional processing time. The procedures used to maximize the intelligibility of the SpeechPAC output may have reduced intelligibility scores. In several cases, space (time) between words was eliminated to make a sentence more “natural” sounding. The effects of this change may have been to reduce intelligibility. However, it is unlikely that these small adjustments could account for the major difference in intelligibility between the sentence length outputs of the two systems. The finding of higher intelligibility scores for sentences compared to words is not surprising (Punzi and Kraat, 1985). Sentences and phrases offer linguistic cues that aid the listener in narrowing the pool of possible words and in predicting what words follow. The significant interaction between synthesizers and stimulus length was produced primarily by improved intelligibility scores on sentence output for the SpeechPAC. The

COMMUNICATION

AID INTELLIGIBILITY

19

finding of improved intelligibility for sentences with a system that has relatively poor intelligibility at the word level has important implications for the VOCA user. Individuals using a synthesizer with these characteristics should use sentences and phrases in order to maximize communicative efficiency. Given that a patient is a candidate for either device, the Personal Communicator might be viewed as the VOCA of choice. However, since the user’s abilities, needs, and preferences must be matched to a communication aid, the additional access options and unlimited user programmable vocabulary of the SpeechPAC may hold important advantages that make it the most appropriate VOCA option for the user.

REFERENCES Beukelman, D. R., and Yorkston, K. M. (1979). The relationship formation transfer and speech intelligibility Disord. 12: 189-196.

of dysarthric

between inspeakers. J. Commun.

Beukelman, D. R., and Yorkston, K. M. (1980). Influence of passage familiarity on intelligibility estimates of dysarthric speech. J. Commun. Disord. 13:33-41. Beukelman, D. R., Yorkston, K. M., Poblete, M., and Naranjo, C. (1984). Frequency of word occurrence in communication samples produced by adult communication aid users. J. Speech Hear. Disord. 49:360-367. Bristow, D., and Fristoe, M. (1984) Systematic evaluation of the nonspeaking child. Miniseminar presented at the Convention of the American Speech-Language-Hearing Association, San Francisco. Chial, M. (1984). Evaluating Handbook

microcomputer

of Microcomputer

hardware.

Applications

In A. J. Schwartz

in Communication

Disorders.

(ed.) San

Diego: College Hill Press. Clark, J. E. (1983). Intelligibility comparisons speech source. J. Phonetics 11:37-49. Epson SpeechPAC Hospital.

for two synthetic

Talk Best (1985). Sioux

Falls,

and one natural

SD: Crippled

Children’s

Fons, K., and Gargagliano, T. (1981). Articulate automata: An overview of speech synthesis. Byte 6: 164-187. House, A. S., Williams, C. E., Hecker, M. H., and Kryter, K. D. (1965). Articulation-testing methods: Consonantal differentiation with a close-response set. J. Acoust.

Sot.

Amer.

37:158-166.

Kraat, A. (1985). Communication ers: A State

of the Art Report.

Interaction

Toronto:

between

Canadian

Aided and Natural

Rehabilitation

Speak-

Council for

the Disabled. Kraat, A., and Levinson, E. (1984). Intelligibility of two speech synthesizers used in augmentative communication devices for the severely speech impaired. Paper presented at the Third International Conference on Augmentative and Alternative Communication. Boston: Massachusetts Institute of Technology.

20

KANNENBERG

Kraat, A., and Sitver-Kogut, M. (1985). Features of Commercially Communication Devices. Flushing, NY: Queens College.

et al.

Available

Non-oral Communication: A Training Guide for the Child Without Speech (1980). Fountain Valley, CA: Fountain Valley School District. Platt, L. J. (1980). Dysarthria of adult cerebral palsy: I. Intelligibility ulation impairment. .I. Speech Hear. Res. 23:28-40.

and artic-

Punzi, L. M., and Kraat, A. (1985). The effect of context on preschool children’s understanding of synthetic speech-A pilot study. Working Papers Speech Lang. Pathol. 13:84-106. Simpson, C. A., McCauley, M. E., Roland, E. F., Ruth, J. C., and Williges, B. H. (1985). System design for speech recognition and generation. Human Factors 27: 115-141. Simpson, C. A., and Navarro, T. N. (1984). Intelligibility of computer generated speech as a function of multiple factors. Proc. Natl. Aerospace Elect. Conf. New York: IEEE, pp. 932-940. Sitler, R. W., Schiavetti, N., and Metz, D. E. (1983). Contextual effects in the measurement of hearing-impaired speakers’ intelligibility. J. Speech Hear. Res. 26:30-34. Tikofsky, R. S. (1965). A comparison of the intelligibility mal speakers. Folia Phon. 17: 19-32. Voiers, W. D. (1983). Evaluating Test. Speech Technol. 30-39.

processed

of esophageal

speech using the Diagnostic

and norRhyme

Yorkston, K., and Beukelman, D. (1978). A comparison of techniques for measuring intelligibility of dysarthric speech. J. Commun. Disord. 11:499-512.