Journal of Phonetics (1981) 9, 333-342
Duration as a factor in the recognition of synthetic vowels W. A. Ainsworth Department of Communication and Neuroscience, University of Keele, Keele, Staffordshire ST5 5BG, England Received 1st August 1980
Abstract:
Experiments have been performed with short vowels (50 ms) and long vowels (500 ms) in order to investigate the effects of duration on vowel recognition . Various models are proposed to account for the observed confusions. The models employ linear and euclidean distance measures, and measure frequency on the Hertz and Bark scales. Each of the models perform similarly, and suggest that differences in duration of 250 ms have about the same effect on vowel recognition as differences in formant frequency of about I 00 Hz of I Bark.
Introduction It is generally accepted that the frequencies of the first two formants are the most important factors in the recognition of vowel sounds (Delattre et al., 1952; Peterson & Barney, 1952; Pols et al., 1969). Temporal factors, however, are also important (Stevens, 1959; Cohen eta!., 1967; Bennett, 1968; Ainsworth, 1972; Klatt, 1976; Nooteboom & Doodeman, 1980). The question therefore arises as to how formant frequency information is combined with duration information in order for a vowel sound to be identified. The process of the recognition of vowel sounds can be formalised by means of a multidimensional space in which points in that space represent idealised forms of vowels. The dimensions of the space will be the frequency of the first formant (F1 ), the frequency of the second formant (F2), the duration of the vowel (T), and possibly others such as the bandwidths of the formants and the frequencies of the higher formants. An unknown vowel can also be represented by a point in this space. It is reasonable to suppose that it will be perceived as belonging to that vowel category whose ideal form is represented by the nearest point. The dimensions of the space, however, are not known, nor are the relative effects of differences in formant frequency and differences in duration. The purpose of the present experiments was to investigate these questions. When listening to a set of vowels of normal duration, the effects of duration will be minimised, but if the durations of the vowels are deliberately distorted, by making them all very long or very short, errors of identification will be induced which reflect the effects of duration. By studying the error pattern obtained it should be possible to estimate the contribution of duration to vowel recognition. Method The stimuli consisted of the eleven vowels of British English shown in Table I. Their formant frequencies and intensities and their durations were those parameters in a speech 0095-4470/81/030333+10 $02.00/0
© 1981 Academic Press Inc. (London) Ltd.
334
W. A. Ainsworth Table I Vowel
E
re Q
D
::> 0
u A
3
Parameters of the synthesised vowels
Fl
F2
F2
AI
A2
A3
A4
(Hz)
(Hz)
(Hz)
(dB)
(dB)
(dB)
(dB)
250 400 640 790 790 610 490 340 250 700 580
2320 2080 2020 1780 1060 880 820 1000 880 1360 1420
2740 2680 2800 2800 2500 2260 2500 2440 2080 2560 2620
51 51 51 51 51 51 51 51 51 51 51
33 37 42 47 49 48 45 42 38 44 45
37 35 38 38 30 23 30 28 17 31 33
31 30 31 31 23 16 17 23 10 24 26
T (ms)
280
ISO 200 250 400 230 400
ISO 350 230 400
synthesis-by-rule programme (Ainsworth, 1974) which were known would be recognised with an accuracy of about 90% by typical British English listeners (Ainsworth, 1978). They were generated on-line by a synthesis-by-rule system consisting of a parallel-formant speech synthesiser (Holmes et al., 1964) controlled by a minicomputer (Ainsworth & Millar, 1971). The listeners heard the stimuli via Sennheiser HD 414 headphones. They were asked to try to identify the sounds they heard as the vowel in one of the words in the set: heed, hid, head, had, hard, hod, hoard, hood, who'd, hud, heard; and to press an appropriately labelled switch on a box in front of them. An extra switch was provided in case the sound that they heard did not correspond to any of the vowels in the response set. The programme waited until all of the listeners had responded before generating the next stimulus. The listeners were seven unpaid volunteers who were undergraduates at the University of Keele. They were initially unfamiliar with synthesised speech. They listened to the sounds in approximately 20 min sessions at weekly intervals. In the first session they heard the eleven vowels ten times each in a randomised order in an /h-d/ context preceded by the introductory phrase "the next syllable is", synthesised with fairly natural intonation. In the next session the /h-d/ words were presented in isolation with a falling intonation contour. In the third session the stimuli were similar but contained only two formants instead of the original four. In the fourth and fifth sessions the vowels were presented in isolation with a steady fundamental of 120Hz and with the durations shown in Table I. In the sixth and seventh the vowels were presented with a duration of 50 ms, and in the eighth and ninth with a duration of 500 ms. The normal range of duration of vowels was 150-400 ms, so 50 ms was 100 ms less than the shortest, and 500 ms was 100 ms more than the longest. Results The recognition scores for the 'normal' duration vowels, the 'long' vowels (500 ms), and the 'short' vowels (50 ms) are shown in Table II. For the normal vowels the individual scores ranged from 69 to 98% (mean 87.1%), for the long vowels from 61 to 98% (mean 79 .3%), and for the short vowels from 51 to 87% (mean 68.4%). The difference between the normal and long vowels was 7.8% which is significant (p < 2.5%) by at-test. The difference between normal and short vowels was 18.7% (significant at p < 0.05%), and the difference between long and short vowels was 10.9% (significant at p < 1%). Hence the durations of the vowels do have a significant effect on their recognition. The overall confusion matrices are shown in Table III. With normal vowels the errors
335
Recognition of synthetic vowels
Table II Individual and overall scores (%) obtained with (a) normal vowels, (b) 50 ms vowels, and (c) 500 ms vowels
Subject
Normal
50 ms
500 ms
7
98 88 82 69 96 84 93
87 61 73 51 66 59 82
98 79 79 61 77 68 93
Mean S.D.
87.1 10.7
68.4 13.3
79.3 12 .9
I
2 3 4 5
6
were mainly between near neighbours in an Fl - F2 space, but with the long vowels there was a tendency for some phonetically short vowels to be heard as phonetically long vowels, and with the short vowels some phonetically long vowels were heard as phonetically short vowels. Model The first model to be investigated was one in which an unknown vowel is classified as the nearest ideal vowel in an F1-F2-T space. In order to determine the nearest vowel it is necessary to define a distance measure such as :
(1) where t.F1 = F1i- F1h t.F2 = F2i- F2j, and t.T = Ti- Tj. With n having a value of 1, equation (1) becomes the sum of the differences between two points projected onto the axes. This will be called the linear distance measure (LDM). With n having a value of 2, equation (1) becomes the distance between two points in a three dimensional space, or the euclidean distance measure (EDM). The appropriate units in which to measure F1 and F2 are not known. The units which were investigated were the physical units (Hz) and the psychophysical units (Barks). A Bark is defined as the width of the critical band at that frequency (Zwicker & Fast!, 1972) and may be approximated by
f = 650 sinh (x/7),
(2)
where f is the frequency in Hz and x is equivalent psychophysical unit in Barks (Schroeder et al., 1979). In order to test whether an unknown vowel is most likely to be classified as the nearest ideal vowel, it is necessary to define some function which is a measure of this. One such function can be obtained by calculating du for each pair of vowels, and then ordering the vowels according to the value of the distance measure. If there are m vowels in the set, define a vector [O;] such that the nearest point to the unknown vowel has a value m, the next nearest m -1, etc. An order matrix [Oul can then be constructed for each of them vowels in the set. A function can then be defined as m
S =
I.ij
[Mul
* [Ou]/m 2 ,
(3)
W. A . Ainsworth
336
Table III (a) Confusion matrix with normal duration vowels (a total of 140 responses were obtained from each stimulus). (b) Confusion matrix with the 50 ms vowels. (c) Confusion matrix with the 500 ms vowels
Heard
(a) Spoken
E
124 2 E
16 129 1
Q
8 134
re
5 134
Q
D
4 2
6 1
3
Spoken
E
81 3 E
59 122 15
re
Q
14 118 1
6 138 73
Q
D
117 1
21 132
23
::>
::>
D
1 59 126 55
u u
1 2
A
3
E
130 2 Q
D ::>
5 7
13 2
27 5
8 103
34 132
re
6 136
Q
D
4 139 11
64 8
u
3
2 4
2 1 93 12 12
2 10 127
u
u
37 73
20 137
A
3
3 2
2 14 46 1 1 2 2
6 25 86 34 13 5
3 23 95
u
u
9 30 10 72 27
5 86
2 2
::>
A
3
2 1 2
u A
3
Heard
(c) Spoken
E
A
Heard
(b)
re
u
2
u u A
u
2
2 134
::>
::>
D
29
2
65 132 5 4
86 12
30 123
15 44 1
4 1 65 138
where [Mu ] is the confusion matrix and [Ou] is the order matrix. Each element Mu is multiplied by the corresponding element Ou and the results summed. The function S will have a maximum value if the largest elements of each row of M occupy the same positions as the maximum elements of 0 , or at the minimum distances from the ideal vowels. Any
Recognition of synthetic vowels
337
errors of classification will reduce S, and will reduce S by a greater amount the larger the value of the corresponding du . As well as misclassification induced by distortions of vowel durations, there will also be errors caused by such factors as the listeners' attention wandering, accidentally pressing the wrong switch, or their internal idealised vowels being different from those of the set of test vowels. Some of these factors will result in responses which are dependent on the distance measure, but others will not. In order to allow for such factors another function can be defined m
s' = Iij [Mu1 * [Oil1/m 2 ,
.
(4)
where [o;i] is an order matrix derived from the confusion matrix by placing m in the position in each row which contains the largest number, m- 1 in the position which contains the next largest, etc. The difference between these two functions ~::,.s =
s'-s
(5)
is then a measure of how well a particular definition of distance measure predicts the confusions which are actually obtained. Effect of Fl and F2 With normal duration vowels /::,.T for the unknown vowel and the perceived vowel should be relatively small, so that equation (1) can be simplified to
(6) Using this distance measure /::,.S was calculated for the LDM and EDM for both the Hz and Bark scales of frequency fork ranging from 0 to 2. The results are shown in Fig. 1. It can be seen that /::,.S is minimised from k;;;. 1, and that neither the distance measure nor the frequency scale had any noticeable effect on the result. The distribution of errors in the confusion matrix was examined next. It was expected that the number of confusions between intended and perceived vowels would be related in some way to the distance between them. Histograms of the number of errors within each range of distance were constructed, and the correlation between number of errors and distance were calculated. The results are shown in Table IVa. The correlation was not very great, but where the calculation was repeated using a logarithmic scale for the number of errors, the correlations were much higher. The greatest correlation was obtained with the EDM and the Bark frequency scale. It is probable that when a vowel is misclassified it is just as likely to be perceived as the neighbour to the left as the neighbour to the right. In some cases, however, there will be no neighbour to the right. The vowel /i/, for example, is only likely to be confused with /r/, whereas the vowel /A/ could be confused with /re/, /a/, /o/, /u/ or /3/. It is necessary, therefore, to take into account the distribution of vowels in the vicinity of the intended vowel. One way of doing this is to construct a histogram of the number of ideal vowels which occur at a given distance from the intended vowel. If the histogram of the number of errors at a given distance is divided by this histogram, a new histogram is produced which gives the ratio of the number of actual errors to the number of potential errors. The correlations of this function with distance on both the linear and logarithmic scales are shown in Table IV. The highest correlation once again occurs with the EDM and the Bark frequency scale, suggesting that the distribution of errors can be described by:
338
W. A. Ainsworth 6 (a)
(b)
(c)
(d)
5
4
~
3
c:
2
.,u ~
~
'6
., :; ., 0
E c:
0
·v;
5
"
'E 0 u
0·5
0
0 5
1·0
I 5
k
Figure 1
Differences between a measure of observed and predicted confusions as a function of kin equation (6) (see text) for (a) linear distance measure in hertz, (b) linear distance measure in Barks, (c) euclidean distance measure in hertz, and (d) euclidean distance measure in Barks.
NE a PE exp (-dii/d 0 ),
(7)
where NE is the actual number of errors at a distance dii, PE is the potential number of errors, and d 0 is a constant (with a value of about 0.7 Bark). Effect of duration As Fl and F2 appear to have approximately equal effects, equation (1) can be simplified to: dii = (IL\FW
+ IL\F21" + lkL\TI") 11".
(8)
The next task is to estimate k in order to discover the relative effects of formant frequency and vowel duration. The function L\S was calculated for the normal duration vowels, the short vowels (50 ms), and the long vowels (500 ms) for a range of values of k. These were added together and plotted against k for the LDM and the EDM for both the hertz and Bark scales. The results are shown in Fig. 2. It can be seen that for the LDM a minimum occurs at k = 0.4 for the hertz scale and at 0.7 for the Bark scale. With the EDM, duration has no effect for the hertz scale over a range of k of 0 to 0.4, and for the Bark scale the minimum occurs at k = 0.4. The lowest minimum
339
Recognition of synthetic vowels
Table IV Correlation coefficients of: (i) histogram of number of confusions with distance; (ii) ratio of histogram of number of confusions over histogram of potential categories, with distances; (iii) logarithm of (i) and (iv) logarithm of (ii). (b) Distance measure given by equation (8) (see text) (all 0.1% signifi· cance). (c) Distance measure given by equation (9) (see text) (all at 0.1% significance)
(a) Distance measure Frequency scale (i) (ii)
(iii) (iv) (b) Distance measure Frequency scale (i) (ii) (iii)
(iv) (c) Distance measure Frequency scale (i)
(ii) (iii) (iv)
Linear - 0.509 - 0.538 -0 .745 -0.796
Euclidean (Bark)
(Hz)
(n.s.) (n.s.) (5%) (2%)
-- 0.680 - 0.688 -0.872 -0.884
( 10%) ( 10%) (2%) (1 %)
-0.577 (10%) -0.571 (10%) -0.826(1%) -0 .815 (1%)
Linear -0.646 -0.730 -0 .818 -0.842
-0.599 -0.668 ·- 0.839 -0.838
(5%) (5%) (0.1%) (0.1 %)
-0.685 -0.820 -0.822 -0.870
(Bark)
(Hz)
-0.604 -0.756 -0.832 - 0.811
Linear (Hz)
-0 .736 -0.731 -0.918 - 0.934
Euclidean (Bark)
(Hz)
(Bark)
(Hz)
-0.743 -0.757 -0.869 -0.877 Euclidean
(Bark) -0 .702 -0.713 - 0.870 -0.880
(Hz)
-0.670 -0.708 -0.902 -0.899
(Bark) -0.755 -0.809 -0.891 -0.920
occurs for the EDM with the hertz scale, but the differences between the various measures are small. The distribution of errors as a function of distance was investigated in the same way as described in the previous section. Histograms were constructed for each condition for the value of k which gave minimum D.S. Correlation coefficients were then calculated as shown in Table IVb. Although the correlations between the error histograms and distance were highly significant in all cases, the highest correlations occurred with error histogram being measured as the logarithm of the ratio of the number of actual errors to the number of potential errors. The highest correlation of all occurred, as before, with the EDM and the Bark scale of frequency. The distribution of errors can thus be described by equation (7) with d 0 having a value of about 1.0 and du given by equation (8) with k = 0.4. Mixed distance measures It has so far been assumed that duration differences are combined with formant frequency differences in the same way that differences in F1 are combined with differences in F2. This need not necessarily be the case. A more general form of equation (8) is: (9) where m and n may equal 1 or 2. This reduces to equation (8) when m = n. The calculations outlined in the last section were repeated for the cases where m and n are not equal. The results are also shown in Fig. 2. With F1 and F2 measured in hertz, combined with the LDM, and then combined with T with the EDM, a minimum in D.S
340
W. A. Ainsworth I
(a)
I
(b)
I I
I
I
I
I
6
I
I
, .... ..I
I
I
I
,I
_I
,,--~ __ I
5
; I
I
3 -
- - - Mixed distance measure
Simple distance measure
~
"
(d)
(c)
V>
c
"'E g
·u;
7
.2 c u
I
!
0
I
I I I I
6
I I
I I
I
I
I
,,- ....... I-
I
I
/,
I
I
I I
I
/
,-/
--1
I I
/
I
I
I
/
I
3
2
0
05
0
I 5
I 5
k
Figure 2
Differences between a measure of observed and predicted confusions as a function of kin equation (8) (solid lines) and equation (9) (broken lines) (see text) for (a) linear distance measure in hertz, (b) linear distance measure in Barks, (c) euclidean distance measure in hertz, and (d) euclidean distance measure in Barks.
occurred when k = 0.5. With Fl and F2 measured in Barks, the minimum occurred at 0.2. With Fl and F2 measured in hertz, combined with the EDM, and then combined with T with the LDM, the minimum occurred at k = 0.4, and with Fl and F2 measured in Barks the minimum also occurred at 0.4. The lowest minimum of all occurred with Fl and F2 measured in hertz combined with the EDM, and then combined with the LDM. It is interesting to note that in every c
Recognition of synthetic vowels
341
Discussion The results show that both duration information and formant frequency information may be used in vowel recognition. They suggest, however , that duration information is relatively unimportant . With the distance measures discussed in the previous sections, a difference in duration of about 250 ms (k = 0.4) has the same effect as a difference in formant frequency of about I 00 Hz or 1 Bark. As the range of duration of the normal vowels employed in these experiments was from 150 to 400 ms , this suggests that duration can only contribute one bit of information towards the decision about the identity of an unknown vowel sound. This , of course , fits well with the view of traditional phonetics that vowels should be described as 'long' or 'short' with no finer gradations being necessary. None of the models appears to be much superior to the others. This is perhaps not surprising as the Bark scale of frequency increases monotonically with the hertz scale, and the LDM is never longer than 1.4 of the EDM. A similar conclusion was reached by Carlson & Granstrom (1980) who compared the predictions of vowel dissimilarity of various models. It would be interesting to compare the predictions of their models , and the psychoacoustic model of Bladon & Lindblom (1979), with the results of the present experiments on vowel confusions. The results have been interpreted in terms of a model which assumed that the set of ideal vowels was the same for each listener, and that these had the characteristics of the set of 'normal' vowels employed in the first experiment. This is clearly only an approximation as the scores for each listener were different, and the overall score was only 87 .1%. Using the distance measure described by equation (8) it might be possible , using iterative techniques, to calculate the parameters of a set of ideal vowels from the confusion matrix obtained from an individual listener. This could be useful in assessing hearing deficiencies and in foreign language training . There are many other instances in speech perception where it is known that a number of diverse factors contribute to a decision. For example, with initial plosives the voicedvoiceless distinction depends upon voice onset time, first formant cut-back, the presence of a voice bar, and the slope of the intonation contour. It would be interesting to apply the techniques discussed in the previous sections to define appropriate distance measures and to assess the relative importance of each of these factors . The results obtained from such experiments could well have application in programming machines to recognise speech automatically . Conclusions The effect of duration on the recognition of synthetic vowels has been investigated by studying the confusions obtained with very long and very short vowels. The results support the hypothesis that when vowels are misclassified, the perceived vowel is most likely to be a new neighbour in an Fl-F2-T space. Various models have been proposed involving linear and euclidean distance measures of this space, but no one model was much superior to the others in accounting for the observed confusions. The results suggest that only two categories of duration , long and short, are necessary in vowel recognition. References Ainsworth, W. A. (1972). Duration as a cue in the recognition of synthetic vowels. Journal of the Acoustical Society of America, 51,648-651. Ainsworth , W. A. (1974). Performance of a speech synthesis system. International Journal Man-Machin e Studies , 6, 493- 511.
342
W. A. Ainsworth
Ainsworth, W. A. (1978). Perception of speech sounds with alternate formants presented to opposite ears. Journal of the Acoustical Society of America, 63, 1528-1534. Ainsworth, W. A. & Millar, J. B. (1971). A simple time-sharing system for speech perception experiments. Behavioral Research Methods and Instrumentation, 3, 21-24. Bennett, D. C. (1968). Spectral form and duration cues in the recognition of English and German vowels. Language and Speech, 11, 65-85 . Bladon, R. A. W. & Lindblom, B. (1979). Spectral and temporal domain questions for an auditory model of vowel perception. Institute of Acoustics Autumn Conference, A22, 79-83. Carlson, R. & Granstrom, B. (1980). Model predictions of vowel dissimilarity. STL-QPSR, 3-4/1979, 84-104. Cohen, A., Slis, I. H. & t'Hart, J. (1967). On tolerance and intolerance in vowel perception. Phonetica, 16, 65-70. Delattre, P., Liberman, A.M., Cooper, F. S. & Gerstman, L. J. (1952). An experimental study of the acoustic determinants of vowel color. Word , 8, 195-210. Holmes, J. N., Mattingly, I. G. & Shearme, J. N. (1964). Speech synthesis by rule. Language and Speech, 7, 127-143. Klatt, D. H. (1976). Linguistic uses of segmental duration in English: acoustic and perceptual evidence. Journal of the Acoustical Society of America , 59, 1208-1221. Nooteboom, S. G. & Doodeman, G. J. N. (1980). Production and perception of vowel length in spoken sentences. Journal of the Acoustical Society of America, 6 7, 276-287. Peterson, G. E. & Barney, H. L. (195 2). Control methods used in a study of the vowels. Journal of the Acoustical Society of America, 24, 175-184. Pols, L. W. C., van der Kamp, L. J. & Plomp, R. (1969). Perceptual and physical space of vowel sounds. Journal of the Acoustical Society of America , 46, 458-467. Schroeder, M. R., A tal, B.S. & Hall, J. L. (1979). Objective measure of certain speech signal degradations based on masking properties of human auditory perception. In Frontiers of Speech Communication Research (B. Lindblom and S. Ohman, eds) London: Academic Press, pp. 217-229. Stevens, K. N. (1959). Effect of duration on identification. Journal of the Acoustical Society of America, 31, 109(A). Zwicker, E. & Fast!, H. (1972). On the development of the critical band. Journal of the Acoustical Society of America, 52,699-702.