Influence of speaker age on perceptual cue distribution

Influence of speaker age on perceptual cue distribution

Journal of Phonetics (1978) 6, 275-282 Influence of speaker age on perceptual cue distribution Martha M. Parnell University of Missouri-Columbia, U.S...

3MB Sizes 0 Downloads 17 Views

Journal of Phonetics (1978) 6, 275-282

Influence of speaker age on perceptual cue distribution Martha M. Parnell University of Missouri-Columbia, U.S.A.

James D. Amerman University of Missouri-Columbia, Columbia, U.S.A.

Conrad W. LaRiviere University of Maine, Orono, Maine, U.S.A . Received 25th February 1977

Abstract:

Twenty adult listeners were presented with electronically gated segments from nine voiceless stop + vowel syllables spoken by a 4·2-year-old child and by an adult. Listener recognition of syllabic and sub-syllabic segments for both the child and adult-speaker tasks reveal patterns of relative perceptual cue distribution similar to previous data reported by LaRiviere et at. (1975). Recognition percentages were consistently lower across all gating conditions in the child-speaker task. Implications regarding maturational influences on encoding and coarticulatory behavior are discussed.

Introduction

Previous investigations by Harris (1953), Delattre et a!. (1955), Liberman et a!. (1967), Cole & Scott (1974), Carpenter (1976) and others concerned with cues for speech perception primarily utilized speech synthesis and tape-splicing techniques for segmentation of adult speech and concentrated mainly on presentation of the manipulated stimuli to adult listeners. These studies have provided useful information relative not only to perceptual cue utilization by adults but also permitted indirect observations of the encoding process. However, they have typically been limited by the lack of control over stimulus segmentation of real speech and secondly, only a few investigators have attempted to observe the development of speech perceptual behavior in children. Availability of precision real-speech segmentation techniques utilizing an electronic gating procedure allows for a more comprehensive investigation of the speech decoding process. More specifically, precise temporal segmentation of real-speech permits observation of how and to what extent the listener utilizes the speaker's coarticulatory behavior in making perceptual judgments. Analysis of perceptual responses to temporally segmented speech stimuli may serve as an indirect but possibly quite sensitive method for the observation of speech production behavior. 0095-4470/040275+08 $02.00

© 1978 Academic Press Inc. (London) Ltd.

276

M . M. Parnell, J.D. Amerman and C. W. LaRiviere

This investigation constitutes the first of a series of experiments planned to examine maturational influences on the encoding and decoding of speech by varying speakerlistener age relationships. The purpose of this first experiment was to expand the findings of LaRiviere eta!. (1975) concerning perceptual cue distribution, and to comparatively analyze listener recognition of stimuli spoken by a young child. Evaluation of the influence of a child's speech on adult perceptual strategies would also permit speculation regarding maturational influences on encoding and coarticulatory behavior. Method The subject group consisted of 20 adults between the ages of 20 and 25. All were speakers of General American Dialect, with no past or present evidence of speech or language impairment. Each subject passed a puretone audiometric sweep test at 500, lk, 2k, and 4kHz bilaterally at 20 dB re normal (ANSI, 1969). The same 20 listeners participated in both experimental tasks. Both tasks involved presentation under headphones of segments of CV syllables containing the voiceless stops fp, t, k/ and the vowels /i, a, uf. These CV monosyllables were segmented using an electronic gating procedure described by LaRiviere et a!. (1975). Oscillographic and spectrographic displays were used as references to verify the accuracy of segmentation procedures. For Experiment I, which we will designate as the "adult-speaker task", the CV syllables were produced by an adult male speaker. The segments isolated from each syllable for presentation on the test tape were: (1) the burst plus the aspiration portion (which contains the aspirated or voiceless transitions) (B +A). (2) the burst, aspiration, and vocalic transition (B + A + VT). (3) the vocalic transition alone (VT). (4) the vocalic transition plus the entire vowel (VT + V). The adult-speaker tape included four repetitions of both the original syllable and each segment type, yielding a total of 180 items. This tape was originally developed by LaRiviere eta!. (1975) in their investigation of perceptual cue distribution. The syllables for the second experimental task, which we will refer to as the "childspeaker task," were spoken by a four-year, two-month-old male child. The stimuli for this experiment were prepared using the same segmentation procedure used for the adult-speaker task, with one modification: an additiomi.l segment type, burst alone (B) was isolated from each syllable, yielding a 216-item listening task. The order of presentation of the two experimental tasks was randomized within the subject group. Each subject was given a brief training session in identification of the IPA phonetic symbols on the response sheet. Subjects were told to listen to each item and to circle which of the nine response alternatives /pi, pa, pu, ti, ta, tu, ki, ka, ku/ they believed they heard. An interstimulus interval of six seconds was maintained between items. The subjects were informed that this was a forced-choice task and were required to guess even if they were not sure of the correct response. To ensure that the subjects understood the nature of the task and correctly followed instructions for marking the response sheet, 10 practice items representative of the experimental stimuli were presented prior to the initiation of the first experimental task. Results and discussion As revealed in Fig. 1, the results of the adult-speaker task indicate that perceptual cue

277

Perceptual cue distribution 100 90 80 c:

70

Q

0

u ~

60

c

"' 50 :2

( 1976) o- o Parnell et ol (1976) • - • Parnell eta/ (1975) •-• LaRiviere eta/



~~ 0

:~

"'

:0 .5:! 4 0

>-

(f)

i

30 20

0

10 0

N

B+A

B+A+VT

VT

VT+V

Segment conditions

Figure 1

Percentage syllable identification based on listener responses to the normal (N) utterance and the segment conditions; burst plus aspiration (B + A), burst plus aspiration plus vocalic transition (B + A + VT), vocalic transition (VT), and vocalic transition plus vowel (VT + V). Open circle= child speaker task, closed circle = adult speaker task, closed triangle = adult speaker task.

distribution closely parallels the result of LaRiviere et a!. (1975). The informational load or patterns for perceptual cue distribution throughout the syllable remained relatively stable for the child-speaker task. The segment type rank orderings displayed in Table I were derived from mean percentage of correct target recognition in response to each segment type. Resulting rankings can be viewed as an index of cue strength. The identical rank order between adult and child speaker tasks were retained for consonant and vowel recognition as well as syllable recognition. The burst alone (B) condition was not included in the adult-speaker task. It is shown in Table I in parentheses, indicating its rank order position relative to other conditions in the child-speaker experiment. Table I Segment type rank orderings based on mean percentage correct target recognition. Rank order was equivalent for both adult and child speaker tasks Syllable recognition

Consonant recognition

B+A+VT B+A VT+V VT (B)

B+A+VT B+A (B) VT+V VT

Vowel recognition VT+V B+A+VT B+A VT (B)

278

M. M . Parnell, J. D. Amerman and C. W. LaRiviere

The burst alone contributed minimal information for vowel recovery, ranked only third for consonant recovery, and it is the least effective of the acoustic segments for syllable identification. For both speakers, vowel recognition percentages are highest for the VT + V condition, but the VT alone is relatively insignificant for either consonant or vowel recovery. The BA VT segment, containing both aspirated and vocalic transitions, was the most effective cue for syllable and consonant recognition for both the child- and adult-speaker tasks, and was ranked second for vowel recognition. The B + A (burst plus aspirated or voiceless transitions) provides important information for syllable and consonant identification. It becomes less important for vowel recognition, but still constitutes a more effective perceptual cue than the VT alone even for vowel recovery, indicating that the aperiodic portion provides perceptual information relative to both consonant and vowel targets. We should point out that these relationships between cue load among the various segments reflect general trends across all syllables and subjects. There are, of course, deviations from these patterns by individual subject and syllable. In accord with previous investigations by LaRiviere et al. (1975) and Liberman et a!. (1967), our data demonstrates that the strength of an acoustic segment as a perceptual cue is context conditioned, that is, cue strength for a particular phoneme target changes in different phonetic environments. Group percentages for recognition accuracy for the two tasks ranged from 4 % to 100% with higher percentages generally associated with the original (intact) syllable, B +A+ VT, and VT + V conditions and lower percentages associated with the B and VT conditions. For a closed set of nine response alternatives, a recognition accuracy of greater than 21 % must be attained in order to achieve statistical significance at the 0·05' level. Very few of our subjects' syllable (Table II), consonant (Table III), or vowel (Table IV) recognition scores fell at or below chance performance. Those non-significant scores we did observe occurred only for the B and VT conditions, although the majority of percentages even for these segment types were above chance level. Table II Percentage correct syllable identification from each segment for child speaker (CS) and adult speaker (AS) tasks

cv

/pi/ /pa/

fpuf /til /ta/ /tu;

/kif /ka/ /ku/

Original

Segment type Burst+ Asp. Burst+ +Vocalic Vocalic asp. trans. trans. Burst

cs

AS

cs

88 98 59 84 86 59 80 75 78

100 76 100 98 100 100 83 78 100

23 48 15 78 26 10 21 51 15

AS

Vocalic trans.+ vowel

cs

AS

cs

AS

cs

AS

cs

AS

60 95 48 91 56 30 63 63 23

46 88 73 76 71 51 50 71 59

78 95 61 90 75 41 65 58 27

85 79 75 95 94 39 86 80 93

31 26 30 91 15 68 13 9 5 4 6 5 35 25 23 36 16 9

35 50 38 16 30 29 46 29 25

48 81 58 30 27 53 33 55 25

279

Perceptual cue distribution Table III Percentage correct consonant identification from each segment for child speaker (CS) and adult speaker (AS) tasks

cv

Original

cs /pi/ /pa/ jpu/ ;ti/ j taf

/tu/ /ki/ /ka/ j ku/

Burst

AS

cs

94 100 99 99 91 100 100 99 91 100 66 100 86 85 76 85 86 100

43 74 60 76 75 38 31 55 33

AS

Segment Burst+ asp. Burst+ +Vocalic Vocalic asp. trans. trans.

Vocalic trans.+ vowel

cs

AS

cs

AS

cs

AS

cs

AS

73 100 86 96 93 93 63 65 34

50 96 85 98 99 79 51 84 99

82 99 98 100 88 99 66 55 38

89 99 89 99 100 45 78 70 99

60 71 58 29 18 18 39 33 19

76 93 86

40 51 45 28 30 33 54 29 28

50 81 61 33 29 53 34 55 26

11

8 21 29 41 9

Table IV Percentage correct vowel identification from each segment for child speaker (CS) and adult speaker (AS) tasks

cv /pi/ jpa/ jpu/ /ti/ /ta/ jtuj

/kif jkaj

/ku/

Original

Burst

cs

AS

cs

93 99 75 78 100 90 90 99 99

100 78 100 99 100 100 84 90 100

48 61 28 78 29 19 60 83 39

AS

Segment Burst+ asp. Burst+ +vocalic Vocalic asp. trans. trans.

cs

AS

72

89 90 94 44 89 93 94 58 83 76 85 46 78 91 95 45 71 84 94 34 58 39 69 41 89 96 88 70 79 95 84 75 64 65 95 60

95 60 95 64 33 93 95 50

cs

AS

cs

AS

Vocalic trans.+ vowel CS AS

39 71 100 96 100 93 74 93 94 91 38 96 88 99 100 15 94 100 91 71 98 85 99 99 60 88 99

The high frequency of above-chance recognition percentages for syllable, consonant and vowel from subsyllabic segments provides additional evidence that the speech perceptual mechanism can utilize coarticulatory information supplied by the speaker's encoding strategies. Evidently the listener can utilize the acoustic segment information spread throughout the CV syllable to help identify a particular phoneme target. To establish definitively whether the listener does, in fact, use all of this temporally distriuted information to identify each target in conversational speech, would necessitate a more involved segmentation procedure. Interpretations based on our limited sampling of speakers must be guarded. However, the similarity of perceptual cue distribution between speakers suggests that children as

280

M. M. Parnell, J.D. Amerman and C. W. LaRiviere

young as 4·2 years may have developed fairly sophisticated coarticulatory strategies, including a scanning ahead mechanism similar to that characteristic of adult speech. Although the pattern of relative distribution of perceptual cues between segment types was similar for the adult and child speakers, our subjects' accuracy levels for syllable recognition were consistently lower across all gating conditions for the childspeaker task. Their responses indicated that they were experiencing some particular difficulties with the child-speaker tape that they did not experience to as great an extent with the adult-speaker tape. The first of these problems involved vowel recognition differences between child and adult speaker VT and VT + V segments. Listener's vowel recognition percentages for the child speaker task decreased by 18·5% and 14·0% for VT and VT+ V conditions, respectively. The cause for this substantial decrease in recognition between tasks may reflect more variability (less precision) in the child's articulation of the CV syllables. The transition between consonant and vowel targets may not have been as well controlled (smooth) as the adult transitions. Also, once vowel target was approximated or achieved by the child, more lingual "oscillation" or lack of vowel steady-state might have occurred. A second difficulty involved recovery of /u/ from all consonant environments for all segment conditions except VT + V. Cinefluorographic investigation of adult speech (Daniloff & Moll, 1968) revealed that an anticipatory lip rounding gesture for a rounded vowel may begin during the approach toward articulatory contact for the initial consonant in a C 1 _ 4 V sequence. Depressed recognition accuracy for the child speaker task may indicate that the child's lip rounding gesture was not initiated as early in the CV sequence as has been observed for adult subjects. On the other hand, similar anticipatory lip rounding may occur for child and adult speakers but the acoustic consequences of the child's gesture may not constitute perceptual cues that are as effective as those observed in adult speech. Additional difficulties characteristic of the child-speaker task include: (1) recognition of the vowel in all jtj + vowel syllables across all segment conditions and (2) recovery of fk/ in all vowel contexts across all segment conditions. To determine whether these difficulties, as well as the more general tendency towards lowered recognition accuracy during the child-speaker task reflect a subtle lack of sophistication commensurate with the coarticulatory skills of an adult would require further investigation with larger numbers of speakers. Acoustic durations of the original syllable and each segment condition indicated significant differences between adult and child speakers, with the exception of VT segments. The B + A and B +A+ VT durations were significantly longer for the child's productions, and the original syllables and VT + V segments were significantly longer for the adult's productions. The child's productions also displayed wider ranges of durations than the adult speaker for all original syllables and segment conditions. We examined the possibility that these durational differences might have significantly affected recognition. Results revealed, however, that segments characterized by large significant durational differences between speakers (such as B +A segments) resulted in relatively small perceptual shifts. Similarly, the VT segments, which manifested nonsignificant durational differences between speakers elicited the largest number of significant differences in perceptual accuracy. The vocalic transition dependency model proposed by LaRiviere et al. (1975) emphasizes the importance of the vocalic transition in the "restructuring of phonetic

Perceptual cue distribution

281

elements". Specifically, they hypothesized that the vocalic transition serves as a "primary cue for stop consonant identification". The model was not supported either by LaRiviere's investigation or subsequently by our data. In both instances, the vocalic transition appears to be neither a necessary nor a sufficient cue for identification of syllable-initial voiceless stops. If the aspirated transition as well as the vocalic transition were included within the model, redefining it as simply the "transition-dependency model", then support could be found for such a model from both LaRiviere's results and our own. Analysis of subjects' perceptual responses does provide evidence that the entire transition from consonant to vowel which includes both voiceless and voiced components, may provide the most important cues for consonant as well as vowel recognition. Isolation of this complete transition segment, A+ VT (aspirated and vocalic transitions), will be necessary to validate this alternate hypothesis. At the present time, interpretation of the results of investigations regarding the role of transitional cues is difficult due to the lack of correlation between the descriptive terminology used and the actual transitional segment used in the study. For example, the terms "formant transition" or simply "transition" are frequently employed without clear specification as to whether these terms describe acoustic segments that include aspirated transitions, vocalic transitions, or both. A similar lack of clarity appears in descriptions of aperiodic segments. Acoustic stimuli described variously as the "noise portion", "noise burst", " burst", "release", "explosion", or "aspiration" also fail to define the references for segment boundaries relative to aspirated transitions. It is imperative that subsequent investigations concerning cue strength of transitions specifically define their segmentation criteria for the development of their acoustic stimuli. Investigators must specify what portions of the total transition, which may include both voiceless and voiced components, have been segmented or synthesized for presentation. Specific explanations of segmentation or synthesis criteria and consistency in usage of acoustic terminology would ensure more accurate, productive integration of existing research regarding speech acoustics and perception. These investigators' observations of similar perceptual cue distribution in the child's productions may provide indirect evidence of coarticulatory strategies similar to those employed by adults. The consistently lower (although generally well above chance) recognition scores in response to the child-speaker stimuli additionally suggest that the coarticulatory skills of young children may be incompletely developed and that the acoustic consequences of children's encoding behaviors may not be processed by the adult speech perceptual mechanism in a manner identical with the perception of adult productions. Although speculations based on observations of two speakers must be guarded, the results of this investigation indicate that a more extensive study of developmental influences on encoding-decoding associations is warranted. Analysis of children's speech has been limited by the lack of methodologies appropriate for use with younger subjects. Cinefluorographic and electromyographic techniques routinely employed in observation of adult subjects involve potential hazards that render them less desirable as tools for analysis of children's speech production. Accurate interpretation of spectrographic displays is often difficult because of the higher fundamental frequency associated with children's voices. Experimental methodologies that provide indirect access to speech production processes by means of analysis of perceptual responses may permit more comprehensive investigation of the speech of young children than has previously been possible.

282

M. M. Parnell, J. D. Amerman and C. W. LaRiviere

A portion of this paper was presented at the Annual Convention of the American Speech and Hearing Association, Houston, Texas, 1976. Requests for reprints should be addressed to: Martha M. Parnell, 103 Parker Hall, Area of Speech Pathology-Audiology, University of Missouri-Columbia, Columbia, Missouri, 65201, U.S.A.

References Carpenter, R. L. (1976). Development of acoustic cue discrimination in children. Journal of Communication Disorders 9, 7-17. Cole, R. A. & Scott, B. (1974). Toward a theory of speech production. Psychological Review 81, 348-374. Delattre, P. C., Liberman, A. M . & Cooper, F . S. (1955). Acoustic loci and transitional cues for consonants. Journal of the Acoustical Society of America 27, 769-773. Harris, C. M. (1953). A study of the building blocks of speech. Journal of the Acoustical Society of America 25, 962-969. LaRiviere, C., Winitz, H., & Herriman, E. (1975). Vocalic transitions in the perception of voiceless initial stops. Journal of the Acoustical Society of America 57, 470-475. Liberman, A.M., Delattre, P. C., Shankweiler, D. F., & Studdert-Kennedy, M. (1967). Perception of the speech code. Psychological Review 74, 431-461. Parnell, M. M., Amerman, J . D ., & LaRiviere, C. (1976). Influence of speaker age on perceptual cue distribution. Paper presented at the Annual Convention of the American Speech and Hearing Association, Houston, Texas.