Journal ofPhonetics (1974) 2, 117-123
Perceptual processing of coarticulationa case study of flf Mark Haggard Department of Psychology, Queen's University, Belfast BTl INN Northern Ireland Received 1st January 1974
Abstract:
Spectrographic data show effects of both initial and postvocalic /1/ upon average vowel formant frequencies. The present study investigated preference judgements for F 1 and F 2 in three vowels in /1/ context. Subjects were asked to choose the vowel of a pair that was more similar in quality to a preceding isolated vowel. Significant preferences were obtained chiefly on vowels preceding /1/ and only for F 2 variations. A supplementary experiment showed poor discriminability of contextual differences in both F1 and F2 values during /1/ steady states. The data were interpreted as favourable to the proposition that perception can solve the problem of coarticulation in at least two different ways-by modelling it in detail (weak form of motor theory) or by applying an information-losing categorical strategy.
Experiment I
It has long been appreciated that the acoustical structure of a phone reflects both its own intrinsic articulatory characteristics and those of neighbouring phones. Generally the contextual influences give acoustical assimilation as a result of articulatory assimilation, whether from articulatory overlapping of non-critical features (coarticulation) or from articulatory undershoot of incompatible critical features (reduction). It is widely assumed that human perceptual processes thrive upon these contextual variations. However, little precise data exists upon the detailed extent to which the perceptual system expects and needs articulatory constraints to appear in the acoustical stimulus. As well as being of practical interest in the optimisation of speech synthesis by rule, such data are of theoretical interest in the modelling of speech perception because they can specify just how detailed and complex must be the perceptual modelling of the articulatory constraints in speech. The present study sought to determine the degree of perceptual relevance of certain context effects reported for American English in certain vowels adjacent to /1/ phones (Lehiste, 1964). These constitute cases favourable to the hypothesis of perceptual sensitivity to articulatory constraints because vowels are not categorically perceived (Pisonih, 1973). Lindblom & Studdert-Kennedy (1967) used identification of jrj and juj vowels in the context of the similar and dissimilar semivowels fwj and jjj to demonstrate the perceptual relevance of acoustical assimilation by reduction of, chiefly, the lip-rounding gesture. The present study had a basically similar motive to theirs but used a technique not restricted to the phoneme boundary between artificial minimal pairs. In seeking to determine whether effects of coarticulation are perceptually significant, it asked whether the acoustical quality of a particular vowel influenced by the various overlapping gestures of an adjacent /1/
118
M. Haggard
was perceptually preferred to the intrinsic or ideal quality, given that the vowel is presented in an unmistakable /1/ context.
Method Sets of stimuli readily identifiable as /li,l3\, lu, il, 3'1 ul/ were synthesised on the Haskins Laboratory parallel formant synthesiser. Initial /1/ stimuli had 50 ms steady state, for which F 2 was 9 dB below F 3 in level, followed by 75 ms transition and 150 ms vowel. Final /1/ stimuli had 150 ms vowel followed by 100 ms transition and 70 ms steady-state, for which F 2 was 3 dB below F 3 in level. The series of vowel formant frequency values used was based on average values that closely followed Lehiste's data (Table I). The first row of Table II shows the trends in her data that were tested perceptually in the experiment. The signed value in Table II represents the difference between the given context value and the overall average for the phoneme. In initial /1/ cases Lehiste found heterogeneous F 1 differences, and in one of them flu/, there was also a paradoxical F2 raising. For final cases, only F 2 was involved, the general effect reported by Lehiste being a lowering of F 2 before /1/ . / 3'1/ was a notable exception, related presumably to the already low F 2 in / 3'/ . Table! Stimuli for vowel allophone experiment
F1 F1 F1
/1/ in /li/ /i/ /I/ in /15'/ 15'1 /I/ in fuf /u/
Fz
/I/ in flu/* fuf
Fz
/i/
Fz Fz
in /il/
/I/ I 5'1 in /5'1/ /1/ fuf m /ul/ /1/
F1
Fz
F3
260 310 260 435 260 310 260 285 285 410 490 410 310 335
1230 2235 920 1230 1075 920 1075 845 2235 995 1230 845 920 695
3530 2860 3195 1850 3530 2525 3530 2525 2860 3195 1850 3195 2525 3195
*/u/ following initial /1/ was diphthongized, and the whole F2 contour was raised or lowered by the value given in Table II.
For each of the seven cases a series of five related stimuli was drawn up. The first was an isolated vowel with steady state formant frequency values for 150 ms but with an amplitude and F0 ; trajectory similar to the vowels in context for the sake of naturalness. The latter peaked at 120Hz. The last four in each series were /If-vowel or vowel-/1/ syllables in which F 1 or F 2 varied. The quantization step for this variation is given in the second row of Table II. Either the second or third member of each of these series had identical vowel formant frequencies to those of the isolated average vowel. This was so arranged for each case as to enable experimental determination of any preference (compared to this average) for a stimulus differing by one or by two quantization steps in the expected direction, and for any preference of one interval in the unexpected direction. A difference of two quantization steps was in general between one and two times the expected context effect. In the series involving testing of F 1 in flu/ the F2 value was made equal to the average for that context and when testing F 2 , F 1 was similarly set to average.
Perception of coarticulation
119
Table II Summary of vowel allophone matching experiment
Formant 1 (Hz) /li/ /13'/ . flu/ Value of formant frequency in context relative to average value for the -50 same vowel. [Analysis data of Lehiste (1964)] Experimental series 25 step values (Hz)
2-step preferences of 10 subjects relative to average value
N.S.
Formant 2 (Hz)
flu/
/ 3'1/
/il/
+95
-80
+155
-100
50
25
75
75
No difference
10 out of 10 out of 10 10 Higher Higher N.S. N.S. p < 0·01 p < 0·01 Predicted Reverse effect effect
full
-105
75
75
8 out of 10 Lower p < 0·10 Minimal effect
9 out of 10 Lower P< 0·02 Predicted , effect
Trials for paired comparison were constructed as illustrated in Table III. The isolated average vowel was followed after 500 ms by two versions of that vowel in /1/ context which were separated by 250 ms. The inter-trial interval was 5 s. Experimental tapes were spliced containing a randomised sequence of the four /If-vowel and the three vowel-/1/ cases. All the six possible pairings of the four stimuli in each series were used, each appearing in both possible orderings. -Thus "first" and "second" were equiprobable as correct responses. Subjects heard each tape (4 initial cases x 6 pairings x 2 orderings) and (3 final cases x 6 pairings x 2 orderings) four times each. This gave 192 trials in all for /If-vowel cases, and 144 trials for vowel-/!/ cases. To prevent learning effects two versions of each tape were prepared by reversing the order of the first and second halves. Table III Trial format in vowel allophone experiment
Isolated vowel
Different syllables, both /If-vowel or vowel-/!/
The task is to indicate which of the second two syllables contains the vowel more like the isolated vowel heard first
Ten undergraduate student subjects were tested at the University of Connecticut. They were seated in a quiet room between 5 ft and 12ft from a high-quality loudspeaker. They were given 15 practice trials and had no difficulty with the task. They wrote "1st'' or "2nd" on standard response sheets. The subjects heard each tape once; after a preliminary analysis of results they were recalled to hear the tape containing the four /If-vowel cases a second time, doubling the number of trials for these cases. Results The pattern of 1-step preferences obtained was in general not systematic. It was decided to concentrate on the two 2-step pairings available for each 4-item series. These data had the intermediate degree of preference necessary to avoid floor and ceiling effects, and
120
M. Haggard
bridged the region of the "average" stimulus and the stimulus corresponding to the expected context effect. The basic datum was the number out of 16 judgements (2 orderings, 2 pairings, x 4 repetitions) on which a lower or higher formant frequency was preferred from the point of view of giving a subjective quality similar to the isolated standard-the "average" vowel. A preliminary analysis of variance was performed on the data from the 10 subjects on the 7 cases examined. The cases differed significantly in their number of preferences for the lower formant frequency (F6 , 54 = 3·425; P < 0·005). This meant that the cases could be scrutinised individually, but required a more stringent significance level than might be used when testing for an effect on one single case. A significance level of P < 0·02 was adopted. The individual cases were examined with the binomial test on the proportion out of 10 subjects preferring lower or higher values on average. These numbers and the associated 2-tail probability levels are given in Table II. It can be seen that in general, consistent preferences did emerge for the cases where F 2 varied between members of a pairing. Discussion
The absence of consistent preferences for F 1 variations following initial /1/ has two possible explanations. First is the possibility that the 2-step differences of 50 Hz were too small to discriminate. They are certainly disproportionately smaller than the 2-step F 2 differences of 150Hz, even when the better differential frequency discriminability for the former is taken into account. However the same cannot be said of the /13' /case where the two-step (100Hz) spacing was examined and where the expected F 1 effect is as large as most final F 2 effects. This consideration renders more likely a second explanation in terms of a different degree of use of contextual place information carried on F 2 and F 1 . The contextual F 1 differences reported by Lehiste are not systematically in one direction. This could lead to the perceptual mechanism setting broad F 1 boundaries for vowel areas and not extracting contextual information from F 1 . On the other hand it may be that F 1 generally carries many other types of information and is processed in this less precise way in other contexts besides that of initial /1/. It would be well worth investigating this latter possibility; the experiments would need to take detailed account of basic differential discriminability before investigating contextual preferences. The vowel-/1/ cases produced more satisfactory results. The ful/ case showed a consistent preference for F 2 lowering. There was a slight tendency in the same direction with /3' 1/ but this was not significant, again in line with the analysis data. The /il/ case presented problems as a synthetic stimulus, apparently connected with the long F 2 transition. In the present stimuli, forcing the long transition into the time-frame appropriate for shorter transitions made the /il/ stimuli sound almost bisyllabic. This may have enabled subjects to isolate the vowel from the liquid, inhibiting the context effects and making possible preference of the higher F 2 values, nearer on average to the isolated vowel presented. With this qualification the experiment confirms the perceptual relevance of context effects reported in Lehiste's analysis data. The most interesting case of all is that of F 2 in fluf. The effect is not a straightforward assimilatory one, in that the F 2 values in /1/ and in fu/ are so close. They would be even closer did not the fu/ context paradoxically raise the F 2 of /1/ as well as being raised by it (Lehiste, 1964). There is a consistent perceptual preference for higher than average F 2 in the vowel nevertheless, despite the unique acoustical manifestations of this particular coarticulation. It would be interesting to establish whether the effect in production is due
Perception of coarticulation
121
to avoidance of fuf rounding after /1/ or to a tendency of the tongue to remain in the forward part of the mouth. These two articulatory possibilities would lead to differences in F 3 but these might be difficult to test perceptually due to the low overall level of F 3 in fuf. Experiment II Lehiste's data also show systematic variations in the formant frequencies of the /1/ phones themselves as well as in adjacent vowels. An attempt was made in pilot experiments to investigate these variations with the matching technique employed in Experiment I. It rapidly emerged that this could be done only by making the variations much larger than those reported in Lehiste's data, such that compensatory distortions became necessary in some cases to retain the identifications in the /1/ category at all. This seemed to provide an instance of the usefulness of categorical perception in liberating the perceptual mechanism from the obligation to compute and discount many heterogeneous and small context effects. A short experiment was run to determine whether this impracticability of contextual matching for the /1/ steady state frequencies was attributable to failure to discriminate differences of similar magnitude to those involved in the context effects. Method The basic stimuli described in Table I were supplemented with flo/ and fal/ syllables in which the first three formant frequency values were respectively 335, 920, 3530 and 490, 845, 3195Hz. Three further stimuli were derived from each of these 8 basic syllables by making changes to the steady-state /1/ formant frequencies as shown in Table IV. Two of the changes involved F 2 and one involved F 1 in each case. The beginning of each transition displayed the experimental differences, but the end, being continuous with the vowel, did not. Table IV
li Ia lu
J:f' il a!
ul
5'1
Summary of /1/ discrimination experiment
First formant
Second formant
(Hz)
(Hz)
+75 -75*** +75 +75 -75 -75 +75 -75
-150 +150 -150 +150 -150* +150 +300 +150
-300 +300 +150 +300** -300 -150 +150 -150
The asterisks indicate significance levels of ABX discriminations. * p < 0·05; ** p < 0·01; *** p < 0.001.
ABX trials were constructed to measure discrimination, with a 250 ms interstimulus interval between A and B, and a 500 ms interval between Band X. Intertrial interval was 5 s. One of A or B was always one of the 8 basic syllables. The other was always one of the corresponding set of three derived syllables given in Table IV. The X was equally frequently identical with A and with B and the task of the subject was to say with which, A orB, on on each trial. A tape was spliced incorporating the 24 possible AB pairings each occurring
122
M.~Haggard
4 times. This was necessary so that the basic stimulus could occur as both A or B, and so that under each of these contingencies the X could be identical on one occasion with the A orB member of the trial. Thus 4 presentations of each type of stimulus difference occurred. The tape was heard by 10 undergraduate subjects under similar conditions to Experiment I. Orthodox ABX method instructions were employed. Results The numbers of correct responses were totalled over the whole group of subjects and examined for significant deviation from chance level (20 correct) by the normal approximation to the binomial. Table IV depicts the significant results. Out of 24 discriminations inspected one result of P < 0·05 would occur under chance alone. Thus only two of the discriminations can be meaningfully discussed, those on /1/ in /lo/ and /13'/. Discussion Again there is some correspondence between our perceptual results and the context effects seen in spectrograms. According to Lehiste's data the highest F 1 values for initial /1/ occur preceding /a/ and the lowest F 2 values:preceding I 't'l· The extreme values must be due respectively to anticipatory coarticulated tongue backing and anticipatory coarticulated tongue backing and retroflexion. Perception apparently takes some account of the most extreme case. The effect preceding /3' /could possibly be attributed to the 300 HzF2 increment banishing the F 2 change between liquid and vowel in the present stimuli and hence is not strong evidence for a perceptual modelling of coarticulation. In the case of /lo/ the discriminability of lowering F 1 by 75 Hz could be ascribed primarily to the more extensive transition required. However, listening to the sounds involved gives a distinct impression of an /1/ that is simply the wrong allophone and which implies unnaturally rapid, non-coarticulated, tongue movements rather than an inadequacy of the synthesis at the acoustical level. There is thus only one overridingly clear case in the /1/ discriminations where perceptual mechanisms appear to expect the articulatory context effect to the extent of making its absence discriminable, that which involves the most conflicting position of the mass of the tongue. Otherwise physical differences of the magnitude of the context effects seen in spectrograms are, broadly speaking, not discriminable. Although not directed at the question of discriminability per se our results confirm indirectly the poorer discriminability of differences in certain types of consonant than in vowel sounds (Pisoni, 1973). It has been suggested (Fujisaki & Kawashima, 1969) that these and other apparent psychological differences between consonants and vowels are not intrinsic properties of the phonological classes, but arise from different ways in which their normal acoustical properties, especially duration, interact with the short-term memory requirements of discrimination tasks, especially ABX. This cannot seriously be doubted. But it is a mistake to imagine that this is a purely physical explanation that banishes the need for a psychological explanation. Obviously the statistics of the durations of different types of acoustical segments will, with their spectral characteristics, help determine the optimum and characteristic mechanisms of their own processing. We have elsewhere (Haggard, 1974) suggested that the coherence and simplicity of acoustical-articulatory relationships obtaining while the vocal tract is in vocalic mode plus the characteristically long times involved leads to more detailed processing of all types of contextual information from vowels. This is consistent with a weak form of motor theory. Hypothetically the short transitions and the more complicated aspects of the
Perception of coarticulation
123
steady-state spectra of certain consonants are problems for which simple analyses and approximate heuristic categorising mechanisms have to be used. The steady-states of the /1/ stimuli used here plus their transitions were of comparable length to the vowels and of comparable intensity. However F 2 differences underlying the liquid quality were not discriminable while comparable differences underlying the vowel quality were discriminable to an extent permitting systematic preferences. Duration in the experiment may, then, be less important than duration in the perceiver's early development. As well as documenting the necessary fidelity of reproduction of context effects in synthesis-by-rule, therefore, the present results give some confirmation to the idea that different modes of processing operate for vowels and consonants. The context-sensitivity of (e.g.) the place of articulation of consonants argues for, rather than against, approximate and crude processing for consonants features, in that the contextual information arises in the vowel. It is misleading to talk of vowel perception as more 'complex' than consonant perception, in that complex processing is used in different ways in the two cases; also processing of the two types of information proceeds in parallel and is independent at some levels. Making the distinction vowel/consonant at all is in this sense only a prescientific taxonomy hopefully to be replaced by detailed models of what happens to all acoustical information in perceptual processing. This work was performed while the author was a guest researcher at Haskins Laboratories. The hospitality of Dr F. S. Cooper is gratefully acknowledged as is the assistance of Mr Donald Moldover in the running of the experiments.
References Fujisaki, H. & Kawashima, T. (J 969). On the modes and mechanisms of speech perception. Annual Report of Engineering Research Institute 28, University of Tokyo. pp. 67-73. Haggard, M.P. (1974). The perception of speech. In (S. Gerber, Ed.) Physics and Psychology of Hearing. Philadelphia: W. B. Saunders Co. Lehiste, I. (1964). Acoustical Characteristics of Selected English Consonants. Publication 34, Research Center in Anthropology, Folklore and Linguistics. Bloomington, Ind.: Indiana University. Liberman, A.M., Cooper, F. S., Shankweiler, D.P. & Studdert-Kennedy, M. (1967). Perception of the speech code. Psychological Review 74, 431-461. Pisoni, D. (1973). Auditory and phonetic memory codes in the discrimination of consonants and vowels. Perception and Psychophysics 13, 253-260.