Production and perception of final consonant voicing in speech during simultaneous communication

Production and perception of final consonant voicing in speech during simultaneous communication

PRODUCTION AND PERCEPTION OF FINAL CONSONANT VOICING IN SPEECH DURING SIMULTANEOUS COMMUNICATION DALE EVAN METZ, NICHOLAS SCHIAVETTI, AMY LESSLER, and...

40KB Sizes 2 Downloads 55 Views

PRODUCTION AND PERCEPTION OF FINAL CONSONANT VOICING IN SPEECH DURING SIMULTANEOUS COMMUNICATION DALE EVAN METZ, NICHOLAS SCHIAVETTI, AMY LESSLER, and YVONNE LAWE Department of Communicative Disorders and Sciences, State University of New York, Geneseo, New York

ROBERT L. WHITEHEAD and BRENDA H. WHITEHEAD National Technical Institute for the Deaf, Rochester Institute of Technology, Rochester, New York

Simultaneous communication combines both spoken and manual modes to produce each word of an utterance. This study investigated the potential influence of alterations in the temporal structure of speech produced during simultaneous communication on the perception of final consonant voicing. Experienced signers recorded words that differed only in the voicing characteristic of the final consonant under two conditions: (a) speech alone and (b) simultaneous communication. The words were digitally edited to remove the final consonant and played to 20 listeners who, in a forced-choice paradigm, circled the word they thought they heard. Results indicated that accurate perception of final consonant voicing was not impaired by changes in the temporal structure of speech that accompany simultaneous communication. © 1997 by Elsevier Science Inc. Educational Objectives: The reader will (1) acquire knowledge and understanding of simultaneous communication and its role in the education and communication processes with children who are deaf; and (2) understand the relationship between temporal elongation of speech during simultaneous communication and perception of final consonant voicing. KEY WORDS: Speech perception; Simultaneous communication.

INTRODUCTION Simultaneous communication (SC) combines speech and various forms of manually coded English (e.g., signs, fingerspelling) in an attempt to produce each word of an utterance in both spoken and manual modes (Akamatsu & Address correspondence to Dale Evan Metz, Ph.D., Department of Communicative Disorders and Sciences, State University of New York at Geneseo, Geneseo, NY 14454.

J. COMMUN. DISORD. 30 (1997), 495–505 © 1997 by Elsevier Science Inc. 655 Avenue of the Americas, New York, NY 10010

0021-9924/97/$17.00 PII S0021-9924(97)00047-6

496

METZ et al.

Stewart, 1989; Marmor & Petitto, 1979; Maxwell & Bernstein, 1985; Strong & Charlson, 1987). A suggested advantage of using SC with persons who are deaf or hard of hearing is that a more accurate representation of English is provided than by lip reading alone, whereas suggested disadvantages of SC include alterations in the linguistic integrity of manual and oral forms of communication (e.g., deletions of grammatical sign markers) and a slowing of speech (Vernon & Andrews, 1990). A specific criticism of the effects of SC on oral communication was levied by Huntington and Watton (1984) who observed that teachers who used SC showed disruptions in the normal rhythm of speech and concluded that SC may not always expose deaf and hard-of-hearing children to the typical prosodic and segmental features of normal speech. Research findings have, in fact, indicated that the overall temporal structure of speech is altered dramatically during SC. Notable temporal alterations in speech include a slower rate of articulation; increased sentence, word, and vowel durations; increased voice onset times and interword interval durations; and more pauses immediately after a signed word (Bellugi & Fischer, 1972; Huntington & Watton, 1984; Schiavetti, Whitehead, Metz, Whitehead, & Mignerey, 1996; Whitehead, Schiavetti, Whitehead, & Metz, 1995; Windsor & Fristoe, 1989, 1991). Temporal aspects of speech play a pivotal role in the perception of certain phonemic contrasts spoken in English. Salient cues for the perception of final consonant voicing, for example, appear to be carried by the duration of the vowel preceding the final consonant. Vowel durations preceding voiced consonants are generally 50 to 100 msec longer than vowel durations preceding voiceless consonants (House, 1961; Peterson & Lehiste, 1960). Chen (1970) has reported that the average ratio of English vowel durations preceding voiceless versus voiced consonants was 0.61. According to Kent and Read (1992), durational differences of such magnitude should be readily perceived and serve as a salient cue for the perception of final consonant voicing. Kent and Read’s (1992) assertion is supported by research indicating that when the acoustic information representing the final consonant has been digitally removed or synthetically edited from a word, listeners can identify accurately the voicing characteristic of the final consonant by hearing only the preceding vowel (Hillenbrand, Ingrisano, Smith, & Flege, 1984; Raphael, 1972). Moreover, Wardrip-Fruin (1982) has demonstrated that the overall syllable duration of monosyllabic (CVC) words has an even greater effect on final consonant voicing perception than preceding vowel duration cues alone. Specifically, when listeners were presented syllables in which the final voiced or voiceless consonant was deleted, they typically perceived the final consonant to be voiceless when the preceding syllable duration (i.e., the CV portion of the CVC syllable) was less than 200 msec and voiced when the preceding syllable duration was greater than 200 msec. When the preceding syllable duration was between 375 and 400 msec, listeners overwhelmingly judged the final consonant to be voiced.

VOICING PERCEPTION DURING SIMULTANEOUS COMMUNICATION

497

Preceding vowel duration is probably not the only cue for final consonant voicing perception. Hogan and Rozsypal (1980) have shown that approximately 48% of the variance accounted for in final consonant voicing perception was attributable to four acoustic variables: (a) voice bar duration [22%], (b) vowel duration [21%], (c) burst friction duration [4%], and (d) silent closure duration [1%]. In addition, Hillenbrand et al. (1984) investigated several acoustic parameters (including preceding vowel duration) putatively associated with final consonant voicing distinctions, but were unable to find a unitary acoustic measure or combination of acoustic measures that “clearly explained . . . listeners’ voiced–voiceless decisions.” Despite these findings regarding the influence of multiple variables on the perception of final consonant voicing, Kent and Read (1992) suggest that duration cues do play an important role when other cues are absent or ambiguous. In light of the evidence provided by Chen (1970) and Wardrip-Fruin (1982) and the observed temporal lengthening of speech segments that occurs during SC, it is reasonable to ask if such temporal alterations could have an influence on the perception of final consonant voicing. Specifically, could increased segment durations associated with SC introduce acoustic cues that would be sufficient to cause voiceless final consonants to be perceived as voiced? Digital signal processing techniques could be used to evaluate this question by removing final voiced or voiceless consonants from words produced under speech alone and SC conditions and having listeners judge final consonant voicing of the digitally edited words. Digital removal of the final consonant may eliminate some acoustic cues present during the closure interval of voiced final stops that are absent during voiceless final stops. Even with the potential loss of certain acoustic cues intrinsic to the final consonant, listeners can make accurate final consonant voicing decisions based on information carried in the preceding vowel (Hillenbrand et al., 1984). As such, removal of the final consonant provides a reasonably uncorrupted view of the potential effects increased segment durations associated with SC may have on final consonant voicing perception when other perceptual cues are absent or ambiguous (cf. Kent & Read, 1992). Therefore, the specific purpose of this study was to compare final consonant voicing perception of digitally edited words that had been produced in speech alone and SC conditions.

METHOD Speakers Four females and four males who were highly skilled users of SC served as speakers. They were normally hearing faculty members at the National Technical Institute for the Deaf who had taught young deaf adults for at least 8 years using SC. English was the first language of all the speakers. Each

498

METZ et al.

speaker’s sign language performance was evaluated on the Sign Communication Proficiency Index (Caccamise, Updegraff, & Newell, 1990), and all speakers were classified at the advanced level or higher on this instrument. Thus, the speakers were considered to be fluent in the use of speech combined with signed English and fingerspelling, i.e., SC.

Speech Stimuli The speech stimuli consisted of six pairs of CVC English experimental words (Table 1). Each pair differed only in the voicing characteristic of the final consonant (e.g., hit vs. hid, but vs. bud, etc.). Each target word was embedded in the carrier phrase “I can say ________ again clearly and with ease” and presented to the speakers on 3 3 5 flash cards.

Recording Procedures The speakers produced each sentence, with its embedded experimental word, under two conditions: (a) speech alone; and (b) speech combined with signed English and fingerspelling, or SC. The speakers were instructed to produce orally the sentences at a comfortable rate and loudness level for the speech alone and SC conditions. Additionally, for the SC condition, the speakers were instructed to sign all the words of the carrier phrase except the experimental word, which was fingerspelled. The rationale for having the speakers fingerspell the experimental word was twofold: (a) adding fingerspelling to the task made it more representative of typical SC; and (b) because some of the experimental words like “bud” and “hud” do not have conventional signs, the instruction to fingerspell the experimental word provided uniformity of production across speakers. The experimental words were arranged in a random order and the order of experimental condition (speech alone vs. SC) was randomized for each speaker. Speech samples for both conditions were obtained in a sound attenuating room with a microphone (Shure 5455D) that was placed 30 cm from the speaker’s mouth and connected to a Nakamichi (1000II) tape recorder. Table 1. Stimulus Words Voiced Final Consonant

Voiceless Final Consonant

Hid Bid Bad Had Bud Hud

Hit Bit Bat Hat But Hut

VOICING PERCEPTION DURING SIMULTANEOUS COMMUNICATION

499

Digital Editing Procedures The audio recordings were low-passed filtered at 8 kHz and digitized with 16bit precision at 20 kHz by a Kay Elemetrics Computerized Speech Laboratory (CSL) system (4300B) and stored on disk. A trained research assistant displayed the digital representation of the test sentences on a graphics terminal, and the CVC experimental word was identified and its duration was measured. A cursor was then placed and stored at the onset of acoustic energy associated with the release of the initial consonant of each experimental word. A second cursor was placed and stored at the zero-crossing (to avoid introducing audible clicks) preceding the last cycle of the experimental word’s vowel. These two cursors demarcated the CV syllable portion of the CVC experimental word; the last two vowel cycles and the final consonant’s closure phase and release were eliminated. The duration of the CV syllable portion of the experimental word was measured and then submitted to the CSL’s D-to-A processor. The line output of the D-to-A processor was connected to a high quality tape recorder (Tascam 202MKII). Separate listening tapes were made for each subject that comprised a random order of the CV syllable portions of the experimental words from both conditions (speech alone and SC).

CV Syllable and Vowel Duration Measurement Reliability Reliability of the durational measurements was assessed by remeasuring the entire data corpus of one randomly selected female speaker and one randomly selected male speaker using the procedures described above. Replicate measurements were made by the research assistant (intrajudge reliability) and by the senior author (interjudge reliability). For intrajudge reliability, the correlations for the female and male speakers were 0.953 and 0.955 for CV syllable duration and 0.935 and 0.968 for vowel duration. The average magnitude difference between the original and replicate measurements for the research assistant for the female and male speakers were 8.92 msec and 4.32 msec for CV syllable duration and 7.04 msec and 6.63 msec for vowel duration. For interjudge reliability, the correlations for the female and male speakers were 0.916 and 0.911 for CV syllable duration and 0.855 and 0.892 for vowel duration. The average magnitude difference between the original and replicate measurements for the research assistant for the female and male speakers were 12.42 msec and 5.90 msec for CV syllable duration and 9.57 msec and 10.72 msec for vowel duration. These results indicate substantial intra- and interjudge agreement for the durational measurements.

Listeners Twenty undergraduate students in communicative disorders served as listeners. All listeners passed a hearing screening at 20 dB HL (American National

500

METZ et al.

Standards Institute, 1989) at 0.5, 1, and 2 kHz; spoke English as their first language; and were unaware of the nature of the recordings they would audit.

Listening Procedures The 20 listeners were assembled into four separate listening groups in a sound-attenuating room. The recordings of the speakers were played to the listeners who were seated in groups of five in chairs arranged along an arc traced 2 meters from the center of the playback speaker. Individual response sheets were made for each speaker that listed in pairs the experimental word and its foil (e.g., bit→bid) in the order read by that speaker. The actual words read by the speaker were placed randomly in either the first or second position of the pair. The words were presented in the sound field at 70 dB SPL (C-scale of a B&K 2204 sound level meter positioned 2 meters in front of the loudspeaker). Listeners were instructed to circle one of the words in the pair. Correct word recognition (i.e., correct perception of final consonant voicing) percentages were computed for both the speech alone and SC conditions.

RESULTS Syllable and Vowel Durations Table 2 shows the average edited CV syllable durations, the average vowel durations, and the voiced to voiceless vowel duration ratios for both speaking conditions. For the speech-alone condition, the edited CV syllables from words that had ended with final voiced consonants were on average 41.38 msec longer than those that had ended with final voiceless consonants and the vowel duration preceding the voiced consonant was 48.10 msec longer than the vowel duration preceding the voiceless consonant. For the SC condition, the Table 2. Means and Standard Deviations of the CV Syllable Durations and Vowel Durations for Words Ending with Voiced and Voiceless Consonants for Speech Produced during the Speech-Alone and Simultaneous Communication Conditions

Speaking Condition

CV Syllable Duration (msec)

Vowel Duration (msec)

Vowel Duration Ratio a

Voiced

Voiceless

Voiced

Voiceless

Speech alone

297.72 81.09

256.34 61.80

193.16 42.15

145.06 27.29

0.75

SC

358.20 66.05

278.57 62.96

230.98 28.27

166.26 26.83

0.72

Abbreviations: CV 5 consonant voicing, SC 5 simultaneous communication. a Vowel duration ratios (voiceless/voiced) for both experimental conditions are shown in the column.

VOICING PERCEPTION DURING SIMULTANEOUS COMMUNICATION

501

edited CV syllables from words that had ended with final voiced consonants were on average 79.63 msec longer than those that had ended with final voiceless consonants and the vowel duration preceding the voiced consonant was 64.72 msec longer than the vowel duration preceding the voiceless consonant. These results are consistent with English voicing rules and reflect slower speech rate during SC. The vowel duration ratio (voiceless/voiced) was 0.75 for the speech alone condition and 0.72 for the SC condition. These ratios are similar to those reported by Chen (1970) but are slightly higher, presumably because our experimental tasks yielded a somewhat stylized manner of speaking.

Listener Judgments There were 3840 judgments (12 experimental words 3 two conditions 3 eight speakers 3 20 listeners) made by the listeners in this study. Listeners correctly identified the voicing characteristic of the deleted final consonant for 69.2% of the CV words produced during the speech alone condition and for 77.0% of the CV words produced during the SC condition. Although a slightly larger number of correct voicing judgments was made in the SC than in the speech alone condition, a paired t-test indicated that this was not a significant difference, t (7) 5 2.28, p 5 0.056.

DISCUSSION The results of the present investigation demonstrated that perception of final consonant voicing was not impaired by the durational changes accompanying the typically slower speech pattern of SC. Typical duration changes for SC were found in this study in the form of both prolonged syllable and vowel durations. However, perception of final consonant voicing was probably unaffected by the change in speech rate during SC because the ratios of the vowel durations preceding voiced and voiceless consonants were approximately the same for both the SC and speech alone conditions. Thus, it appears that, although speech was slowed during SC, the relative durations of vowels preceding voiced and voiceless consonants were unaffected, thus allowing preceding vowel duration to remain as one intact acoustical cue for final consonant perception. Further research should investigate the degree to which other potential acoustic cues that are important for perception of other phonemic characteristics (cf. Hillenbrand et al., 1984) are influenced by rate changes during SC. Although this study addressed only one specific perceptual task, it should be pointed out that this task is somewhat difficult for listeners, as indicated by the 25% to 30% error rates of our subjects. This is important because the experimental task takes advantage of a redundant acoustical cue to consonant identification that is carried by the preceding vowel, thus indicating that altered speech rate alone did not produce deleterious carryover effects to this

502

METZ et al.

important acoustical redundancy. Such redundancies are important for speech perception under unfavorable listening conditions (e.g., in masking noise or speaking to hearing impaired listeners) where multiple linguistic and acoustical cues are important for accurate speech perception. The maintenance of this type of redundant cue, then, may be particularly important for accurate perception when speaking to hearing impaired listeners. Further research is necessary to test this assumption and should examine the relative intelligibility of recordings of speech alone versus speech produced during SC when these recordings are presented to normal listeners in noise or to hearing impaired listeners. The results of this study are consistent with those of Schiavetti et al. (1996) who found that rate-altered speech demonstrated acoustical characteristics with potentially positive perceptual consequences. They reported that voice onset time (VOT) of word-initial plosives were elongated during SC, but that the elongation was greater for voiceless than for voiced members of cognate pairs, thus increasing the voicing contrasts. These results would predict better perception of initial voiced versus voiceless phonemes for speech produced during SC than for speech produced alone. In addition, Picheny, Durlach, and Braida (1986) studied speech made intentionally clear for the hearing impaired and found a number of acoustical changes accompanying rate alteration in their speech samples. Clear speech, which was approximately 15% more intelligible than normal speech, showed slower rate and enhanced VOT distinctions between cognate pair members that was similar to the speech changes during SC reported by Schiavetti et al. (1996). Further research is needed to determine whether speech produced during SC demonstrates any of the other acoustical and perceptual consequences found in the clear speech studied by Picheny et al. (1986). Although some authors have criticized SC for the speech rate changes evidenced in this mode (Huntington & Watton, 1984), this experiment is the first study to address a specific perceptual consequence of the rate alteration. If the speech rate changes typically found in SC did have deleterious perceptual effects, then serious reconsideration of the continued use of SC would be necessitated. This study addressed only one of the many possible perceptual consequences of slowed speech rate during SC and found no deleterious perceptual consequence. More research is needed to examine other perceptual consequences of speech produced during SC in order to determine its viability as a method for communicating with deaf and hard-of-hearing persons. A portion of this research was conducted at the National Technical Institute for the Deaf in the course of an agreement between the Rochester Institute of Technology and the United States Department of Education. The authors thank Ms. Jessica Jerome for her assistance with the data collection phase of this research. Part of this research was supported by funds from the Geneseo Foundation provided through the Research Council.

VOICING PERCEPTION DURING SIMULTANEOUS COMMUNICATION

503

REFERENCES Akamatsu, C.T., & Stewart, D.A. (1989). The role of fingerspelling in simultaneous communication. Sign Language Studies, 65, 361–374. Bellugi, U., & Fischer, S. (1972). A comparison of sign language and spoken language. Cognition, 1, 173–200. Caccamise, F., Updegraff, D., & Newell, W. (1990). Staff sign skills assessment-development at Michigan School for the Deaf: Achieving an important need. Journal of the Academy of Rehabilitative Audiology, 23, 27–41. Chen, M. (1970). Vowel length as a function of the voicing consonant environment. Phonetica, 22, 129–159. Hillenbrand, J., Ingrisano, D.R., Smith, B.L., & Flege, J.E. (1984). Perception of the voiced–voiceless contrast in syllable-final stops. Journal of the Acoustical Society of America, 76, 18–26. Hogan, J.T., & Rozsypal, A.J. (1980). Evaluation of vowel duration as a cue for the voicing distinction in the following word-final consonants. Journal of the Acoustical Society of America, 67, 1764–1771. House, A.S. (1961). On vowel duration in English. Journal of the Acoustical Society of America, 33, 1174–1178. Huntington, A., & Watton, F. (1984). Language and interaction in the education of hearing-impaired children (Part 2). Journal of the British Association of Teachers of the Deaf, 8(5), 137–144. Kent, R.L., & Read, C. (1992). The acoustic analysis of speech. San Diego, CA: Singular. Marmor, G.S., & Petitto, L. (1979). Simultaneous communication in the classroom: How well is English grammar represented? Sign Language Studies, 23, 99–136. Maxwell, M., & Bernstein, M.E. (1985). The synergy of sign and speech in simultaneous communication. Applied Psycholinguistics, 6, 63–81. Peterson, G.E., & Lehiste, I. (1960). Duration of syllable nuclei in English. Journal of the Acoustical Society of America, 24, 693–703. Picheny, M.A., Durlach, N.I., & Braida, L.D. (1986). Speaking clearly for the hard of hearing. II. Acoustic characteristics of clear and conversational speech. Journal of Speech and Hearing Research, 29, 434–446. Rapheal, L. (1972). Preceding vowel duration as a cue to the perception of the voicing characteristic of word-final consonants in English. Journal of the Acoustical Society of America, 51, 1296–1303.

504

METZ et al.

Schiavetti, N., Whitehead, R.L., Metz, D.E., Whitehead, B.H., & Mignerey, M. (1996). Voice onset time in speech produced during simultaneous communication. Journal of Speech and Hearing Research, 38, 565–572. Strong, M., & Charlson, E.S. (1987). Simultaneous communication: Are teachers attempting an impossible task? American Annals of the Deaf, 132, 376–382. Vernon, M., & Andrews, J.F. (1990). The psychology of deafness. New York: Longman. Wardrip-Fruin, C. (1982). On the status of phonetic cues to phonetic categories: Preceding vowel duration as a cue to voicing in final stop consonants. Journal of the Acoustical Society of America, 71, 187–195. Whitehead, R.L., Schiavetti, N., Whitehead, B.H., & Metz, D.E. (1995). Temporal characteristics of speech produced during simultaneous communication. Journal of Speech and Hearing Research, 38, 1014–1024. Windsor, J., & Fristoe, M. (1989). Key word signing: Listeners’ classification of signed and spoken narratives. Journal of Speech and Hearing Disorders, 54, 374–382. Windsor, J., & Fristoe, M. (1991). Key word signing: Perceived and acoustic differences between signed and spoken narratives. Journal of Speech and Hearing Research, 34, 260–268.

CONTINUING EDUCATION Production and Perception of Final Consonant Voicing in Speech During Simultaneous Communication QUESTIONS 1. As used in this study simultaneous communication (SC) involved: a. Speech combined with American Sign Language b. Speech combined with signed English c. Speech combined with signed English and fingerspelling d. Speech combined with key word signing e. Speech combined with fingerspelling 2. A suggested advantage of SC is to: a. Provide a more accurate representation of English language than lipreading alone b. Alter the linguistic integrity of manual and oral representations of English c. Enhance the learning of suprasegmental features of speech by deaf children

VOICING PERCEPTION DURING SIMULTANEOUS COMMUNICATION

505

d. All of the above e. None of the above 3. In the present article, final consonant voicing cues were removed by: a. Mixing speech with masking noise b. Low-pass filtering speech c. Digitally editing speech d. a and b e. b and c 4. The results reported in this article indicate that: a. Final consonant voicing was perceived better in speech alone than in SC b. Final consonant voicing was perceived better in SC than in speech alone c. Final consonant voicing perception was essentially the same in SC and in speech alone d. Final consonant voicing was not perceived in speech alone e. Final consonant voicing was not perceived in SC 5. The results reported in this article indicate that: a. Vowel duration ratios preceeding voiceless/voiced consonants were longer in SC than in speech alone b. Vowel duration ratios preceeding voiceless/voiced consonants were longer in speech alone than in SC c. Vowel duration ratios preceeding voiceless/voiced consonants were the same in SC and in speech alone d. Vowel duration ratios preceeding voiceless/voiced consonants could not be calculated in SC e. Vowel duration ratios preceeding voiceless/voiced consonants could not be calculated in speech alone