Journal of Ph onetics (1984) 12, 237-243
Correlation between the production and perception of the English glides /w, r, l,j/ W. A. Ainsworth and K. K. Paliwal* Department of Communication and Neuroscience, University of Keele, Keele, Staffordshire, England Received 16th April1984
Abstract:
An experiment has been performed to test the hypothesis that a listener refers to his own articulation when perceiving speech. It has recently been found that this is probably not so in the case of vowel sounds, but it remains an open question in the case of consonants. In this experiment the production and perception of the glides fw, r, 1, j/ were studied. The second and third formant frequencies at the onset of the CV transition were measured for 10 speakers from both a production experiment and a perceptual-identification experiment. Correlations between production and perception values were computed for all four glides, but were found not to be statistically significant. Frequency transformations (from Hertz-to-mel; and Hertz-to-bark) also failed to make these correlations statistically significant. Thus the results from this experiment tend to reject the above hypothesis.
Introduction In an earlier paper (Paliwal, Lindsay & Ainsworth, 1983), an experiment was reported in which the hypothesis that a listener refers to his own articulation when perceiving vowel sounds was examined. For this, a test was made to examine whether the differences in acoustic parameters (the frequencies of the first two formants, Fl and F2) resulting from the production systems of different subjects for a given vowel sound were isomorphic with the differences in F I and F2 used by the perception systems of the same set of subjects for the same vowel sound. The formant frequencies of II English vowels for I 0 subjects were measured in two different ways: first, from the utterances of the vowels in /h/-vowel-/d/ context produced by these subjects, and secondly by presenting a set of synthesised vowel stimuli, in /h/-vowel-/d/ context, covering the entire F 1-F2 plane to the same set of subjects for identification. The isomorphism was measured in terms of the correlation between the production and perception formant frequencies. It was found, however, that this correlation was not statistically significant, and hence the above hypothesis was rejected . In many experiments, vowel sounds have been found to behave differently from consonant sounds. For example, Liberman et al. (I957) found that plosive sounds were more easily discriminated across phoneme boundaries than within phoneme boundaries, whereas Fry et al. (1962) found that there was no difference for vowel sounds. Also *On leave from Tata Institute of Fundamental Research, Bombay, India. Present address: Division of Telecommunications, University of Trondheim, Trondheim-NTH, Norway.
0095-4470/84/030237 + 07 $03.00/0.
© 1984 Academic Press Inc. (London) Limited
238
W. A. Ainsworth and K. K. Paliwal
Studdert-Kennedy & Shankweiler (1970) found hemispheric specialization for the perception of consonants, but no significant right ear advantage for vowels sounds. It is thus of interest to examine whether there is any correlation between the production and perception of consonant sounds. In the present paper this question is addressed. It was found difficult to generate a set of stimuli which gave rise to responses consisting of the entire set of English plosives, because of the multidimensional nature of such stimuli. The glides /w, r, 1, j/, however, proved more amenable to experimental manipulation. The acoustic parameters used for computing the production-perception correlation were measured first from a production experiment and then from a perception experiment. As noted earlier (Paliwal et al. , 1983) these parameters must satisfy two important requirements. First, the acoustic parameters measured from the production experiment must differ significantly from speaker to speaker, and secondly these acoustic parameters must identify the glides /w, r, 1, j/ distinctly in the perception experiment. In several earlier studies, a formant synthesizer has been used for conducting perception experiments on glides (O'Connor et al., 1957 ; Lisker, 1957; Ainsworth , 1968). It was found that the parameters which served to distinguish between the glides were the onset frequencies of the second and third formant transitions. These will be referred to as the formant locus frequencies . Other distinguishing features consist of the first formant locus frequency and the shape of the first formant transition. However, if these secondary cues are neutralized by giving them average values, it is possible to construct a set of stimuli differing from each other in only their second and third formant locus frequencies which are identified as /w/ , /r/, /1/ or /j / by most listeners. Moreover, it was observed that their formant locus frequencies measured from a production experiment show statistically significant differences across speakers. Thus if the hypothesis is correct that a listener refers to his own articulation when perceiving speech, we would expect to see a correlation between the production and perception formant locus frequencies .
Methods The second and third formant locus frequencies of the four English glides /w, r, 1, j/ were measured for ten subjects, first from a perception experiment and then from a production experiment. In the perception experiment, a set of CV synthetic stimuli with their second and third formant locus frequencies covering the entire F2-F3 plane were presented to the subjects for identification. The responses of these subjects clustered differently in the F2-F3 plane for different glides. The centroids of these clusters were used as an estimate of the perception formant locus frequencies . In the production experiment, the glides /w , r, 1, j/ were recorded in CV context by the same set of subjects. The production formant locus frequencies were measured from the power spectra estimated by linear prediction analysis of the speech signal. Perception experiment The glides /w, r, 1, j/ differ from the other sounds of speech in the duration of their formant transitions. These are shorter than those of diphthongs but longer than those of voiced plosives (Liberman et al. , 1956). They each have a rising first formant , reflecting an opening of the vocal tract , but they are distinguished from each other by their second and third formant transitions. In the present experiment a set of consonant-vowel stimuli were generated with a parallel
239
Production and perception of English glides
formant synthesiser of the type described by Holmes et al. (1964). Each stimulus consisted of a steady state portion, 20 ms in length, followed by a linear transition of 80 ms duration and a vowel of an /3/ quality 400 ms long. The initial portion had a first formant frequency of 460Hz, and second and third formants whose frequencies had different values in each stimulus. During the transition the formant frequencies changed in a linear manner to frequencies appropriate for an /3/ vowel (580Hz for F1, 1420Hz for F2, and 2620Hz for F3). During the transition the intensities of the formants increased from 35 to 51 dB for F1, 28-45 dB for F2, and 21-33 dB for F3. In order to make the sounds more speech-like, a fourth formant of a constant frequency of 3500Hz was added whose intensity increased from 14 to 26 dB during the formant transition period. The formants were excited by a train of periodic pulses whose fundamental rose from 120 to 150Hz during the formant transitions, then fell to 100Hz during the vowel. A set of 100 stimuli were produced. The second formant frequency ranged from 760Hz in 10 steps of 180Hz to 2380Hz, and the third formant ranged from 1540Hz also in 10 steps of 180Hz to 3160Hz. A diagram of the stimuli is shown in Fig. 1. The stimuli were presented in random order to a group of listeners. In each session the listeners heard two sets of 100 stimuli in different random orders. They each took part in five sessions held approximately on consecutive days. Ten volunteers, staff and postgraduate students of the University of Keele , having a variety of British English accents, acted as listeners. There were seven male and three female subjects. The listeners heard the stimuli binaurally over Sennheiser 414 headphones. They were asked to try to identify the initial consonant in each syllable they heard, and to press an
3 5 0 0 f - - - - - - - - - - - - - - - - - F4
3000
' ',
',
'
',
2500
''
"N I
-;::, 2000
''
0
c: Q)
:::>
C" Q)
.t
1500
1000
--'
,,
~
,,
' ',, ,, ' '
, __ , , ,
,,
500~---
,
''
',,
, ,,
'
, ,,
·, . . . . . - - - - - - - --- - F3
'' ' ,,
----F2
~----------
o'-------5="o:---.....,.,lo'-=o:----1""5.,..o--2::-:oo-- Time (ms)
Figure 1
Formant patterns of the glide-vowel stimuli.
-FI
40o
240
W. A. Ainsworth and K. K. Paliwal
appropriately labelled switch on a box in front of them. The boxes were labelled L, R, W, Y and OTHER. The listeners were instructed to press the OTHER switch only if they were unable to classify the consonant as a glide. Groups of up to four listeners took part simultaneously. The stimuli were presented and the responses recorded by a computer-controlled system described by Ainsworth & Millar (1971). After each stimulus had been presented, the computer waited for all listeners to respond before generating the next sound. The responses of each subject were analysed separately. A typical set of responses is shown in Table I. It can be seen that the sounds were identified as /w/ if they had a low F2 locus, and /j/ if they had a high F2 locus. Sounds having a mid F2 locus were classified as /r/ if they had a low F3 locus and as /1/ with a high F3 locus. The centroids of each cluster were computed in order to obtain an estimate of the F2 and F3 loci for each glide. (Stimuli having F2 locus higher than F3 locus were excluded from this analysis.) Table I Typical set of responses obtained from listening to 100 glide-vowel synthetic stimuli
3160Hz
Third formant locus frequency 1540Hz
w w w w w w w w w w
w w w w w w
w w w w
760Hz
w L w w w R w w w R w R w R w R w R R
R
R L R R R R R R R R
L L R R R L R R R R
L L L L L L L
y
L R
Second formant locus frequency
y y y y y y y y y y
y y y y y y y y y y
y y y y y y y y y y 2380Hz
Production experiment In the production experiment the same ten subjects were used as in the perception experiment. The subjects were asked to utter five repetitions of the four syllables /wJ/ , /jJ/, /rJ/ and /13/. These were recorded in an ordinary laboratory with a Revox A77 tape recorder and microphone. These utterances were digitized at a sampling rate of 10kHz with a Computer Automation Alpha minicomputer using a 12-bit analogue-to-digital converter. A low-pass filter with a cut-off frequency of 5kHz was used prior to digitisation as an anti-aliasing filter. The waveforms of these utterances were displayed on a CRT, and a 25 .6 ms segment of the sound was carefully selected for spectral analysis at the onset of the CV transition by means of a manually-controlled cursor. The speech signal at this point was weighted by a 25.6 ms Hamming window, and a lOth order linear prediction analysis was performed using the autocorrelation method (Makhoul, 1975). The log-power spectrum was computed from the 10 linear prediction coefficients by taking a 256-point discrete Fourier transform using a fast Fourier transform algorithm. The formant frequencies were measured from the logpower spectrum by a peak-picking method. For this, the log-power spectrum was displayed on the CRT and the peaks were selected by means of a manually-controlled cursor. A threepoint parabolic interpolation around the peak was performed in order to obtain a more accurate location of the peak. In this manner the second and third production formant locus frequencies of all four glides were measured for all the five repetitions of each of the 10 subjects.
Production and perception of English glides
241
In order to show that these formant locus frequencies were significantly different across the different speakers, the technique of analysis of variance was employed. The ratio of the between-speakers variance and the within-speakers variance (denoted by F 9 ,40 where the subscripted quantities are the degrees of freedom for the numerator and the denominator) was computed for each of the four glides for each formant locus frequency. The values of the F 9 , 40 -ratios are listed in Table II. It can be seen from this table that these ratios are much greater than F 9 , 4 o;o. 99 = 2.89. Thus it can be concluded that differences in the production formant locus frequencies across the 10 speakers are statistically significant for all the glid~s (P ~ 0.01). Once this was established, the production formant locus frequencies of the five repetitions were averaged for each of the glides and for each of the subjects. Table II F 9, 40 -ratios for F2 and F3 locus frequencies computed by analysis of variance for different glides. The value of the F 9 40 -ratio to confirm the . hypothesis that the glides of different speakers are 'different at the 1% level is 2.89 F-ratio for
Glide
F2 locus frequency
F3 locus frequency
fwf /r/ /1/ fjf
70.9 9.4 37.1 85.3
167.5 12.0 48.9 19.2
Results In the preceding section it was demonstrated that the second and third formant locus frequencies measured from the production experiment also show statistically significant differences across subjects. Also the earlier studies (O'Connor et al., 1957; Lisker, 1957; Ainsworth, 1968) have shown that these formant locus frequencies are the main cues for identifying the glides /w, r, 1, j/. Thus both of the requirements mentioned earlier are satisfied, and it is expected that isomorphism will be seen in terms of these frequencies if the hypothesis that a listener makes reference to his own articulation when perceiving speech is correct. In order to measure this isomorphism, two types of correlation between the production and perception formant locus frequencies were computed: first, when the ordering of the subjects for the two sets of frequencies was the same (i.e. the within-subjects correlation), and second, when it was different (i.e. the between-subjects correlation). In order to be able to say that the production-perception isomorphism exists, it must be shown that the within-subjects correlation is significantly higher than the between-subjects correlation. The correlation for the jth glide and the kth formant locus frequency may be computed as follows: 10
L
Fpr [m(i), j, k)Fpe [n(i), j, k]
i=l
C(j, k) 10
[
~ {Fpr(i,j, k)}
1
2 ] 112 [ 10
1
~ {Fpe(i,j, k)}
2 ] 112
where Fpr(i, j, k) and Fpe(i, j, k) are the production and perception formant locus frequencies of the ith subject, and [m(i), i = 1, ... , 10] and [n(i), i = 1, ... , 10] are the orders during
242
W. A. Ainsworth and K. K. Paliwal
which the subjects are arranged during production and perception, respectively, for computing the correlation C(j, k). When these two orders are the same (i.e. rn(i) = n(i), i = 1, ... , 10), the within-subjects correlation is obtained. When these orders are different, the betweensubjects correlation is computed. Since there are 10!(10!- 1) possible combinations when the ordering of subjects is different, the between-subjects correlation can be computed, in principle, for that many combinations. However, in practice, 180 different combinations were randomly selected, and the values of the between-subjects correlation were computed for these combinations. The means and standard deviations of these values were then calculated. These are listed in Table III along with the within-subjects correlations for the glides /w , r, 1, j/ for the second and third formant locus frequencies. Table III Within- and between-subjects correlation for different glides for F2 and F3 locus frequencies
F2 locus frequency Glide
Within-subjects correlation
/w/ /r/ /1/
0.7931 0.9948 0.9853 0.9934
fjf
F3 locus frequency
Between-subjects correlation 0.8815 0.9936 0.9866 0.9908
± ± ± ±
0.0344 0.0020 0.0046 0.0027
Within-subjects correlation 0.9427 0.9936 0.9958 0.9972
Between-subjects correlation 0.9375 0.9952 0.9946 0.9967
± ± ± ±
0.0115 0.0016 0.0014 0.0004
It can be seen from Table III that out of the four glides, the within-subjects correlation is higher than the between-subjects correlation for the two glides /j/ and /r/ for the F2 locus frequency, and for the three glides /w/, /j/ and /1/ for the F3 locus frequency; but in no case is the difference between the within-subjects correlation and the between-subjects correlation statistically significant. Thus there appears to be no isomorphism between the perception and production formant locus frequencies. The within- and between-subjects correlations have been computed, so far, in terms of the formant locus frequencies measured in the usual physical units of Hertz. It is likely, however, that some psychophysical unit of frequency, such as mel or Bark, is more appropriate for the perceptual measurements. Accordingly, the frequency values were transformed to mels and Barks (see Paliwal et al. (1983) for the Hertz-to-mel and Hertz-to-Bark transformations) . The within- and between-subjects correlations were computed in terms of the F2 and F3 locus frequencies measured in these units. The difference between the withinsubjects correlations and the between-subjects correlation was found again not to be statistically significant. Thus we can conclude that no isomorphism exists between the production and perception formant locus frequencies, and the present experiment provides no evidence for the hypothesis that a listener refers to his own articulation when perceiving speech.
Conclusion An experiment has been performed with the glides /w, r, 1, j/ to test the hypothesis that a listener refers to his own articulation when perceiving speech. The second and third formant locus frequencies of these four glides in CV context were measured for ten subjects, first from a perception experiment and then from a production experiment. The within- and between-subjects correlations were computed for all the glides for both the F2 and F3 locus
Production and perception of English glides
243
frequencies. In no case, however, was the difference between the within-subjects correlation and the between-subjects correlation found to be significant at the 1% level. Thus the results of this experiment tend to reject the above-mentioned hypothesis. References Ainsworth, W. A. (1968). First formant transitions and the perception of semivowels. Journal of the Acoustical Society of America, 44, 689-694. Ainsworth, W. A. & Millar, J. B. (1971). A simple time-sharing system for speech perception experiments. Behavioural Research Methods and Instrumentation, 3, 21-24. Fry, D. B., Abramson, A. S., Eimas, P. D. & Liberman, A.M. (1962). The identification and discrimination of synthetic vowels. Language and Speech, 5, 171-189. Holmes, J. N., Mattingly, I. G. & Shearme, J. N. (1964). Speech synthesis by rule. Language and Speech, 7, 127-143. Liberman, A.M., Delattre, P. C., Gentman, L. J. & Cooper, F . S. (1956). Tempo of frequency change as a cue for distinguishing classes of speech sounds. Journal of Experimental Psychology, 52, 127-137. Liberman, A.M., Harris, K. S., Hoffman, H. S. & Griffith, B. C. (1957). The discrimination of speech sounds within and across phoneme boundaries. Journal of Experimental Psychology, 54, 358-368. Lisker, L. (1957). Minimal cues for separating / w, r, I, j / in intervocalic position. Word, 13, 256-267. Makhoul, J. (1975). Linear prediction: a tutorial review. Proceedings IEEE, 63, 561-580. O'Connor, J. D. , Gerstman, L. J., Liberman, A.M., Delattre, P. C. & Cooper, F. S. (1957). Acoustic cues for the perception of initial /w, j, r, 1/ in English. Word, 13, 24-43. Paliwal, K. K., Lindsay, D. & Ainsworth, W. A. (1983). Correlation between production and perception of English vowels. Journal of Phonetics, 11, 77-83. Studdert-Kennedy, M. & Shankweiler, D. (1970). Hemispheric specialisation for speech perception. Journal of the Acoustical Society of America, 48, 579-594.