CV- and VC-transitions: a spectral study of coarticulation—Part II

CV- and VC-transitions: a spectral study of coarticulation—Part II

Journal of Phonetics (1979) 7, 205-224 CV- and VC-transitions: a spectral study of coarticulation-Part II M. E. H. Schouten and L. C. W. Pols Institu...

9MB Sizes 0 Downloads 11 Views

Journal of Phonetics (1979) 7, 205-224

CV- and VC-transitions: a spectral study of coarticulation-Part II M. E. H. Schouten and L. C. W. Pols Institute for Perception TNO, Soesterberg, The Netherlands R eceived 13th February 1978

Abstract :

Locus positions of six Dutch consonants are determined on the basis of all voiced 10 ms-samples representing the CV- and VCtransitions from 120 CV words read five times by five speakers in two conditions (text and isolated). In an earlier article, those locus positions had been provisionally estimated using only the vowel portions of the transitions. In addition, the positions of three Dutch glides are determined.

Introduction The purpose of the present investigation was to validate, and extend, our knowledge about the consonant loci found in our previous investigation (Schouten & Pols, 1979). In that investigation, we restricted ourselves mainly to vowels in eve (consonant-vowelconsonant) words ; the vowel segments consisted of fairly stationary parts with short transitions to the surrounding consonants on both sides ; everything that could possibly be thought to belong to a consonant had been rigorously removed from those transitions. Both the steady-state vowel positions and the transitions were studied. We found that the only useful way of reducing the information contained in the transitions was in terms of the area of the vowel space to which all transitions pointed which shared a common initial or final consonant. The coordinate values of the centre of such a consonant area or "locus", as we called it, were estimated by extrapolating the short transitions in the required direction. Those short transitions as such were reliable enough, but we had no way of knowing whether a linear extrapolation would be a good description of what happens in reality between consonant and vowel. For that reason we decided to follow up our previous research by investigating the same material, but this time from the point of view of the consonants. However, the same vowel space was used , in which only voiced speech samples could find a place ; these may be interpreted as being, in a way, similar to transitions in a formant representation.

Brief description of the previous investigation In Schouten & Pols (1979) we used vowel segments lifted from eve-syllables, pronounced both as separate words and as stressed syllables in a text. We used six initial consonants: ft, d, n, p, x, r/, four vowels : /i, u, e, a/ and five final consonants: ft, n, p, x, r/ (final /d/ does not occur in Dutch). This resulted in 120 eve combinations, which were recorded five times in both conditions (isolated words and text) by each of five male speakers of Dutch. The recordings were analysed by means of 17 one-third octave filters, with centre frequencies ranging from 120 to 8000Hz. The filter outputs were sampled every 10 ms, and the 0095-4470/79/030205 + 20 $02.00/0

© 1979 Academic Press Inc. (London) Ltd.

206

M. E. H. Schouten and L. C. W. Pols

resulting filter levels expressed in dB were stored, along with linear level , nu mber of zero crossings, and fundamental frequency. A spectral subspace was determined by mea ns of a principal-components analysis of the covariance matrix based on the filter leve ls of all vowel samples from a different data set. The coordinate values along the first three dimensions of the subspace were calculated for the voiced samples in the isolated words and in the text. This made it possible to display every utterance as a trace in a plane, using two of the three dimensions at a time. The relevant eve words were excised from the text with the aid of a display showing all the basic parameters of about 75 10-ms samples, along with a two-dimensional trace in the vowel subspace of those 75 samples. In cases of uncertainty one or two extra samples were included in order to ensure that the whole word was lifted from the text. Every vowel from every (isolated or text) eve word was segmented into a stationary part and two transitions: one from the preceding consonant (eV-transition) and one to the following consonant (Ve-transition). Since at the time we only wanted to study the vowels, great care was taken to ensure that no voiced consonantal samples were included in these transitions: every sample clearly belonged to the vowel, but not every sample was stationary relative to its neighbours. Next, for every possible consonant-vowel and vowel-consonant combination the average transitions were calculated (after a linear time normalisation) over the 25 to 30 words containing that transition, regardless of the consonant at the other and of the eve syllables. The numbers 25 and 30 apply to one speaker and one condition : speakers and conditions were kept separate throughout. Whenever possible, the four average transitions containing the four different vowels but the same initial or final consonant were extrapolated backward or forward respectively, until they met in an area which was then regarded as the locus area for that particular consonant. We felt that this was the best way of describing the transitions systematically. The considerable interspeaker differences were reduced by means of a very simple linear form of speaker normalisation, based on each speaker's overall vowel average. The normalised loci were averaged over the speakers, and the averages where shown in Figs 14 and 15 in Schouten & Pols (1979). The differences between text words and isolated words were quite small, except for final /-t/ and 1-x/. The latter finding was not unexpected: for final /-t/ and /-x/ hardly any converging transitions were found , so in both conditions we had to rely on guesswork to a greater extent that we found acceptable.

Aims of this investigation Although the consistency of the average transitions over speakers and conditions was quite high, we felt that we could not be certain about consonant loci based on ev- and ve-transitions, from which every hint of a consonant had, with great care, been removed because we were investigating only the vowels. We therefore decided to have a closer look at the consonants themselves, to see: (I) whether the directions of the short transitional segments surrounding the stationary part of each vowel are similar to the directions of lengthened transitions containing all the voiced samples up to and, wherever applicable, including those of the actual consonant itself; (2) whether the steady-state portions of voiced consonants could be said to occupy areas of the vowel space comparable to the loci estimated on the basis of those transitions (voiceless consonants cannot be displayed in a vowel space) . Our original recorded material had, by design, also included a great many instances of the Dutch glides /w, j, 1/, because we wanted to find out:

A spectral study of coarticulation-Part II

207

(3) whether glides, which are nothing if not transitional have anything in common with the transitions between consonants and vowels. Since we had already seen (see e.g. Fig. 2 in our previous article) that /d/ can be described as a vocal murmur close to the juj-area, followed by a spectral jump towards the /t/-area close to /if, we thought that it might be difficult to assign /d/ to one particular area in the vowel space. We therefore decided to restrict our treatment of question (2) above to the consonants /n/ and /r/, and to investigate fdf only in the manner of question (1), i.e. by adding all voiced /d/-samples, on both sides of the explosion, to the transitions already established . Question (I) was studied in relation to all CV- and VC-transitions.

Extended CV- and VC-transitions Method As was mentioned above, the CV- and VC-transitions studied in our previous articie had been severely curtailed , leaving in only totally vowel-like samples. This time, however, no restrictions were imposed at all beyond the requirement that the first voiced sample was the first sample of the CV-transition, and that the last voiced sample was the last sample of the VC-transition. Any incidental voiceless sample occurring between those two endpoints was recalculated in the vowel space (these were mostly borderline cases with respect to the voiced/voiceless criteria, which were based on the number of zero crossings and a spectral weighting). The transition once again ended or began at the stationary vowel segment as it had been defined in our previous article, namely as the three samples with the shortest mutual spectral distances ("inner boundaries"). The obvious space to be used here is a space based on all the voiced samples of a suitable data base, rather than the vowel space used in our previous article. In view of the probability that transitions calculated and displayed in two different spaces would be hard to compare and thus thwart aim (I) of the present investigation (see above), we studied both the vowel space of our previous article and a voiced space, based on all the 27 881 voiced 10-ms samples from a data base containing all Dutch vowels and consonants in 270 CVC-words pronounced by three speakers (see Pols, 1977). It turned out that the first two dimensions of the voiced space and the first two dimensions of the vowel space defined the same plane ; the third dimensions of both spaces were virtually identical. No information is lost, therefore, in using one space in preference to the other, so from now on only the vowel space will be discussed, in order to make comparison easier. The only thing keeping us from stating that the vowel space and the voiced space are identical is the fact that in the latter the second dimension is slightly shrunk relative to that of the former. At low overall (linear) levels per sample, individual filter levels could fall below the lower end of the dynamic range (60 dB). This would result in a form of "valley-clipping" or "valley-flattening", consisting of setting low-level filter values to a lower-limiting fixed value (a procedure similar to Klatt's (1976) "noise floor normalization"). We investigated the general effect of valley-clipping on the position of a sample in the vowel space by selecting a number of vowel and voiced-consonant samples from all over the space, and increasing in steps the threshold, below which all filter levels are made equal to the threshold. No matter what sample we chose, the effect always was that with every increase in the threshold the sample position shifted along a fairly straight line in the direction of the same general area, where the coordinate values along both dimensions are very low: the third quadrant of the I-II vowel plane. Consequently, whenever we come across a trace in the vowel space which combines a sudden change of direction towards the bottom left-

208

M. E. H . Schouten and L. C. W. Pols

hand corner with a low linear level, we have to be on our guard against valley-clipping. As we shall see later, valley-clipping is very common at the end of isolated words, where the overall level dies away very gradually. Results After application of a linear time normalization, average CV- and VC-transitions were determined separately for each of the five speakers in each of the two conditions, over 5 IT

40

4 0 II

30

30

20

20

10

10

0

0 Ia CV- transit io ns Tex t words Sp ea ke r 2 ns 25

- 10

- 40

- 30

-20

- 10

0

10

20

30

- 20 -40 - 30

40

- 20 -1 0

0

10

20

30

40

0

10

20

30

40

40 IT

II

40

lb CV-t ran sit io ns Text wo rd s Speake r 2 ns 25

- 10

30

30

20

20

10

10

0

0 lc CV-t ran sit io ns Tex t words Speak er 2 n,;; 25

-1 0

........ xa

- 20

- 40

-30

- 20 - 10

0

10

20

30

-1 0

- 20 I

40

I -40

30 - 20

-10

4 0 II u~

30 20 10

~

0

v

£\ "£

~~

,a p

,-

·,~

X

- 10 - 20

: ~

- 40

Figure 1

-30

n ~30

- --l!-:· · ~2o

VC - transitions Tex t wor ds Spea ker 2

-10

Q

10

20

I 30

40

Average CV- and VC-transitions by speaker 2, calculated over all voiced samples in the text words. The labelled crosses indicate the vowel positions, whereas the underlined labels indicate the consonantal ends of traces. Plus signs represent the estimated locus positions of the relevant consonants.

A spectral study of coarticulation-Part II 40

209

40 II

IT

~-·,! 30

,•lu

l!

. .... nu

30

•#-~:..._~". .;···' ru

20

20 ''

tO

10

0

0 2a CV- Transitions Isolated words Speaker 2 n' 25

-10 -20

-40 -30 -20 40

- 10

0

10

20

30

2b CV- Trans it ions Isolated words Speaker 2 n' 25

-1 0 -20

40

II

-40

-30

40 II

30

-20 -10

0

10

20

30

40

ir,

30 20 10 '

-

CV- Tran sit ions I so late d words Speake r 2 n '25

-20 pa

-4 0 -30

-2 0 -10

0

10

20

--r ,-·

~''

2c - 10

tf, .... '

1r

0

30

' ' ~ ,,' u~

-10

ill

-40

2d VC-Transitions Isolated words Speaker 2 n' 30

'

; ~•ar

-20

40

-

'

'

~n

-30 -20 -10

0

10

20

30

I 40

40 II 30 20

~_.---~-

10 ( ):

0

~

0 ,··

-v--fi">~· /~ 1

£P

-10 - 20

+ X

VC-Transitions

Isolated words Spea ker 2 n' 30

~

-40 -30

Figure2

·.:;e • • ..

-20

- 10

0

10

20

30

40

As Fig. 1, but for isolated words.

(final consonants) x 5 (replications)= 25 tokens per CV-transition, and over 6 (initial consonants) x 5 (replications)= 30 tokens per VC-transition. What we then set out to do was to determine where the transitions involving the same initial or final consonant, but different vowels, met, either unaided (voiced consonants), or after extrapolation (voiceless consonants). These meeting points were called by us the new consonant locus positions in the vowel space.

210

M. E. H. Schouten and L. C. W. Pols

It would have been impossible to show all average transitions for all five speakers, so speaker 2 will serve, as he did in our previous article, as an example of how the locus positions for all speakers were arrived at. Figures I and 2 display the average transitions produced by speaker 2, in text words and in isolated words, respectively. The small crosses represent the average inner-boundary vowel position for that particular CV- or VC-combination. Phonetic symbols indicating the CV- or VC-combinations have been placed at both ends ofthe traces; at the consonant ends they are underlined. The symbols themselves are well enough known, except perhaps for that of Dutch velar fricative fx/. The estimated locus positions are indicated by means of plus signs. In the following discussion, we shall take the transitions consonant by consonant, comparing text words and isolated words. Any reader not interested in the details of how the loci were estimated, may skip this discussion.

Initialfnf-Figs !(a) and 2(a) Tn the text words [Fig. 1(a)] it is clear that the /nV/-traces tend to converge from various directions, before moving off towards their respective vowels. It seems obvious that the point of convergence should be regarded as the /n/-locus, which we estimated as (13, 20) (coordinate values along the two dimensions) for the text words. The different directions from which this locus position is approached, are attributable to the various speech sounds preceding /n/ in the text. Text words taken from a context in which /n/ was preceded by /s/ (e.g. /s nut/) or by a non-released /t/ (e.g. /dat nit/), approach the /n/locus from below, while with preceding /k/ the approach is much more from the left. Preceding vowels in the text had no systematic effect. The first parts of the /n/-traces in Fig. 1(a) are simply a resultant of the different words in varying contexts contributing to the average. The deviating behaviour of the first part of the /nc/-trace in Fig. 2(a) (isolated words) is due to the fact that in the list of words the only eve-combination not occurring on its own was ./ncx/: it came embedded as /kncxt/, introducing /k/-influences in five out of twenty-five cases. The locus position of /n/ in isolated words [Fig. 2(a)] is considerably more extreme than that in the text words; it was estimated by us as (16, 32). A closer look revealed that none of the initial-/n/-traces in isolated words contained any samples that had undergone valley-clipping, and so it seems reasonable to conclude that the more or less unanimous course of the /ni/-, /nu/- and /na/-traces towards the /n/locus position is, on average, an integral part of initial /n/ in isolated words.

Initial fri-Figs J(a) and 2(a) The rather erratic pattern of the beginning of the text-word /ri/- and /rc/-traces in Fig. 1(a) is difficult to explain; since this speaker's /r/ is very often a tongue-tip rattle, it is bound to (and does) contain samples with a very low level which will tend to shift the average /r/-position to bottom-left (valley-clipping). One should therefore be wary of any parallel movements at the beginnings of the initial /r/-traces, and place the /r/-locus, whenever possible, at the point where those traces diverge towards the various vowel positions. Using some extrapolation, this results in (- 20, 8) being chosen as the locus for initial /r/ in text words. Average initial /r/ in isolated words [Fig. 2(a)] shows parallel movements, first from a rather central position to a point in the lower left quadrant, and from there to an area centred around (-14, 2) from which the traces move towards the various vowels. The explanation for this is, in all likelihood, that speaker 2 precedes initial /r/ in isolated words with an f:l/-like sound, after which the probability of samples with a very low level occurring increases for a while, and then decreases again as the speaker approaches the vowel. This varying probability is due to the fact that the individual /r/-traces which make up the averages shown in Fig. 2(a), are of unequal duration, and that a given sample number may, in one individual /r/-trace, refer to a high-level sample, whereas in another one it refers to a

A spectral study of coarticulation-Part II

211

low-level sample. Individual traces, as a result, consisted mostly of wild movements to and fro; in the average traces of Fig. 2(a), the position of an /r/-sample can only be a measure of the number of individual highand low-level samples it is the average of.

Initial fti-Figs J(b) and 2(b) We see the same pattern for text words as for isolated words: /tu/ and /te/ come from the same area, /ta/ points backward to that area, and /ti/ leads its own life, away from the others. In Schouten & Pols (1979) transitions involving !if often turned out to deviate from the pattern of the other three vowels; here we see that adding more samples does not alter that picture. The initial /t/-loci were estimated at (-18, 17) for the text words, and (-20, 20) for the isolated words.

Initial /d!-Figs J(b) and 2(b) Initial /d/ appears to be best characterized as a vocal murmur, foliowed by a sudden and quite great jump towards the /t/ - /df-locus as soon as the oral cavity is opened. This pattern is clearer in the isolated words of Fig. 2(b) than it is in the text words of Fig. 1(b), but there can be little doubt that it is similar in both cases. In Fig. 2(b) the vocal murmur seems to consist of a movement to the top right-hand corner, which could easily (and probably does) imply a diminishing probability of low-level samples, so it seems best to place the isolated vocal-murmur locus, which we cali d', at the point where the trace suddenly turns around, i.e. at (15, 38). The text-word /df-locus [Fig. 1(b)] is even more clearly similar to the /t/-locus, particularly since the /di/-trace seems to originate in the /t/-locus area. We decided to place the vocal-murmur locus d' for text words at (5, 23). Both for text words and for isolated words it should be specified that /d/ moves first from the vocal-murmur locus to the ftf - /d/ locus, before turning in the direction of the vowel.

Initial fpf-Figs J(c) and 2(c) In our previous article, in which we worked only with vowel segments, the estimated initial-/p/-locus did not agree with backward extrapolation of the /pi/-trace, but this time we see that by ali owing ali voiced samples to be part of the traces, only /pi/ and fpu/ retain the same direction, whereas the other two (/pa/ and fpe/) now turn around to change direction; only a little extrapolation is required to make all traces meet, with the exception of fpef in the text words. The /p/-loci now change into (-28, -5) for text words, and (-40, -8) for isolated words. We investigated the possibility that the starting directions <)f the four traces resulted from valley-clipping in low-level samples, and found that ali the individual words we looked at showed a pattern similar to that in Figs. 1(c) and 2(c), without there being the slightest trace of distorted spectra. The conclusion seems warranted that the initial directions of fpe/ and /pa/ are quite realistic. Combining what we have seen in relation toft/, fd/ and /p/, we should now be able to predict that initial /b/ will start as a vocal murmur, followed by a sudden spectral jump to the /p/-locus, foliowed by traces similar to the /p/-traces in Figs. 1(c) and 2(c). This has, however, not yet been investigated.

Initialfx/-Figs l(c) and 2(c) In Figs 1(c) and 2(c), I xaf and I xu/ have starting points that are very close together. Backward extrapolation off xi/ and f xe/ would result in a different meeting point, so there is no unique locus position for initial

I xf.

Final fn/-Figs I( d) and 2(d) The downward move at the end of every average final-/n/-trace is almost entirely due to valiey-clipping; the short upward move that usualiy precedes it, seems real enough, and is probably a simple reversal of the course of

M. E. H. Schouten and L. C. W. Pols

212

initial in/ [see Figs l(a) and 2(a)). Disregarding /in/, the final-/n/-loci for text words and isolated words become (10, 20) and (20, 30) respectively. A separate /n/-locus may have to be defined for /in/, although that seems really necessary only for text words [Fig. I(d)].

Final fr/-Figs l(d) and 2(d) Of final/r/, the same can be said as of final/n/; after convergence of the traces, the number of low-level samples begins to increase, resulting in valley-clipping. Some extrapolation of the not always completely converging traces results in final-/r/-loci of (-20, 0) for text words and (-26, -5) for isolated words.

Final jt, p, x/-Figs J(e) and 2(e) As was only to be expected with these voiceless consonants, the traces had to be extrapolated if a locus was to be found; in most cases only a rough estimate could be made, although the various directions the traces take from the vowel to the three final consonants, are distinct enough. In Fig. I (e), the estimated ft/-, /p/-, and I xi-loci for text words were estimated as (-5, 28), (3, 28), and (-32, -5) respectively. For isolated words [Fig. 2(e)] the loci for the same consonants were (7, 20), (4, 25), and (-40, -16), if we ignore fix/.

Consonant locus positions over the 5 speakers In the same way as for speaker 2, locus positions were determined for all the initial and final consonants for all five speakers in both conditions. In our previous article we found that the variation among the locus positions over the speakers was quite great, but that it

40

II

40 I

30

®

20

'

A 1

II

i

30

'

~ - - s--~

2 .. -- 2

.!.

3a Initial consonant loci in Text words

0

10

40 II

i

30

'8') ... 0

20

30

"

-10

®E

-2 0

10

'

0

r'P '

'1

20

) o<

).

Figure 3

20

30

40

50

~

'

- -i -~-~- ~ ~ ~L

3.

30

40

.. ,

r

0

10

20

®u

10

3c lniatial consonant loci in Isolated words

0

10

0

-

(i)

].:'r ---- 4•'

-10 Y

W

eE

3d

Final consonant loci in Isolated words

-20 ~ ',

[{. -20 -10

a

.H

.. . '.. 1}.",, :.. r ..' ',_

eE

®

i

30

~

,'

-10 -20

3b Final consonant loci in Text words

40 II

--.·--- - ~

20

I

..,

-50 -40 -30 -20 -10

40

.~· "1

,-·· n

r

0

I

-40 -30 -20 -10

L•

u

10

-20

>~\ .

.:n "t0

20

~

~_'l'

-<

®

I

®

a

-50 -40 -30 -20 -10

0

10

20

30

40

Estimated consonant loci for the five speakers. Encircled dots indicate average vowel positions.

50

A spectral study of coarticulation-Part II

213

could be reduced considerably by means of a linear speaker normalization based only on vowels. Those locus positions had been obtained by extrapolating traces that contained exclusively vowel-like samples. The present locus positions, however, while still being situated in the vowel space, are based on traces containing at least a few consonantal samples, and hence incorporate any changes of direction taking place in the neighbourhood of the consonants. This is probably the reason why this time our vowel-based speaker normalization did not lead to any great reduction in variation among the locus positions of the various speakers. There was a slight reduction for voiced consonants, but a considerable increase for voiceless ones. The locus positions depicted in Fig. 3 have not been speaker-normalized. A comparison of the four panels of Fig. 3 yields the following observations: (l) agreement between text words and isolated words is greater for final consonants than it is for initial consonants: loci of the latter in isolated words are farther apart than in text words; (2) for initial /x/ no locus positions could be found at all; for final /xi they could not be found for all speakers, whereas those that were found are situated in an area of the vowel space which is probably articulatorily impossible, since it is considerably to the left of the /i-a/ line; (3) inter-speaker variation is very small for the initial-consonant loci; the only exception appears to be speaker l's initial /n/ in isolated words, which deviates in a way similar to that speaker's /u/ (see our previous article); nevertheless it is curious to see how the other speakers' /n/ approaches that of speaker 1 in the text words; (4) the spread among the locus positions of final /t/ and /p/ is quite enormous: it seems justifiable to conclude that there are no such things as locus positions for final voiceless consonants, and that there is only a general direction in which the vowels trail off on their way to ftf, fp/, or /x/, with !PI tending to move slightly more to the right than /t/, and with transitions to /x/ going in the opposite direction; (5) it seems unlikely that there is much to be gained from any general, i.e. not consonantspecific, form of speaker normalisation: apart from the cases mentioned in (2), (3) and (4), the consonant locus areas seem to be quite stable over the various speakers. Figure 4 displays the average locus positions over the five speakers, after speaker normalization; the averages are displayed in the form of crosses which are connected by means of dashed lines to dots indicating the locus positions from our previous article, estimated on the basis of vowel traces. Everything that was said in connection with Fig. 3 is also applicable here-for example, the locus positions of the final voiceless consonants should not be taken too seriously. However, the close proximity of final /x/ in text words and in isolated words is quite surprising in view of the earlier difference. In general, it can be said that great differences between locus positions estimated earlier (dots in Fig. 4) and the present ones (crosses) reflect direction changes in the transitional traces, and make it necessary to describe the corresponding CV- and VCtransitions by means of three points in the vowel space, namely the vowel position, the consonant locus, and an intermediate point, specifying an earlier direction in the transition. At what point along the trace the change in direction takes place is better defined in some cases than in others. Let us, therefore, look at the data points in Fig. 4 in greater detail: initial /t/: no change of direction, and no difference between text words and isolated words; initialfd/: as we remarked in the Introduction to this article, initial /d/ begins with a vocal murmur in the neighbourhood of /n/, and then jumps to the /t/-/d/-locus when the

M. E. H. Schouten and L. C. W. Pols

214

40 II

40 II 30

30

i 0

p

r

i

x--- ....

I

0

20

20

10

I

I

0 - - - - - - - - - - - - - - - --"--· - - - - - - - - - - - - - -"

-10

-20

a


-40

30

20

10

Texl wards lniti91l c~~son _ants _ 20 • Old lOCI

i 0

x

10

30

i

20 10

''

0

>'

~

-10

1

---

4 C Isolated wards

0

•"Old" loci k"New" loci

-40 - 30 -20 -10

Figure 4

0

10

20

30

40

40

50

. 0

I

u

I

//

(

r--

:

- - - - - - - - - - - "- - - - -.--4 - - - - - -

/ /

-20

30

/

-10


'

,1" /

/

10

Initial consonants

a

/

20

/

0

20

u

..,.•

I

i

0

\t : I

30

...xn

d., .......... : ............ ,

dl

0 --- - - - -- - - ~r:. ~:--'--- --- ---

-20

10

0

40 ll ~·

30

0

- 50 -40 -30 -20 -10

40

40 ll

4b Text wards Final consonants o"Oid" loci x"New" loci

a

"New" loci I

20

r

1

-10

4a

X,(,"

"

-------

4d Isolated wards Final consonants • "Old" loci

0E:

a

x

''New" I oci

20

30

0

-50 - 40 -30 -20 -10

0

10

40

50

Comparison of the "old" loci (based on vowels only) and the "new" loci (based on all voiced samples), after speaker normalisation. The length of a dashed line is a measure of the difference between an "old" and a "new" locus.

vocal tract suddenly opens, so it seems permissible to describe /d/ spectrally as /t/-/d/ preceded by a vocal murmur; initial fnf: the change of direction that is in evidence here, is shown, in Figs l(a) and 2(a), to be only a very gradual one so that the usefulness of defining a locus and a turning point seems doubtful: it is probably best to use only one locus. The difference between text and isolated words is again considerable; initialfpf: as we saw in Figs l(c) and 2(c), the change of direction in jpsf and fpaj is quite sudden. It is easy to see why there is no change of direction in the /puj-transition : vowel position, originally estimated locus and new locus are roughly on one line. The difference between text and isolated words is fairly small; initial /x/: no loci found; initial frf: no change of direction and no difference between the two conditions; final ft/: the change of direction in the text words does not mean much, but the more centralized position of final jtj in isolated words [see the /t/-traces in Fig 2(e)] seems real enough. However, it remains better not to speak of a locus for final jtj, because there is too much inter-speaker variation; final fnf: no change or difference; final fpf: we saw in Figs l(e) and 2(e) that transitions to final /p/ do tend to bend in the direction of the centre, but the inter-speaker variation is too great to conclude more than

A spectral study of coarticulation-Part II

215

that these transitions start out as moves to the right and tend to turn more towards the centre as /PI is approached ; jinalfx/: there seems to be an enormous change of direction here, but the difference is that the locus positions originally estimated on the basis of vowel traces were the averages over widely divergent locus positions for three speakers only, whereas this time the interspeaker variation was much smaller, leading to a far more reliable position. We still hesitate, however, about calling these average positions "loci"; jina/frf : no great change of direction, and very little difference between text and isolated words. Discussion Our previous study dealt exclusively with vowel behaviour and used "predicted" consonant locus positions in the vowel space to describe the regularities in the vowel traces; the present study was meant to check the validity of those preliminary predictions. In a number of cases, the original estimates turned out to be corre.ct in themselves, but to be simplifications, so that the new estimates provided additional information; and in those cases in which we had originally had very little confidence in our estimates, we now are slightly more confident with respect to the new estimates. Our description of consonant loci on the basis of extrapolated transitions shows some resemblance with the locus theory (Delattre, Liberman & Cooper, 1955; Delattre, 1969). In their data the second-formant locus for initial plosives was found to be relatively low for labials, somewhat higher for alveolars, and higher still for velars. Since in our study there were no velar plosives, we could only compare labials and alveolars. As can be seen in nearly every figure in this article, our fitst dimension appears to be roughly proportional to a reversed log F2, so iflabialfp/ is to have a lower locus position than alveolar /t/, it has to be situated to the right of the latter. This was true for the originally estimated locus positions [the dots in Figs 4(a) and 4(c)], but is no longer true now. It is easy enough to think up a large number of possible explanations for this (such as the difference in aspiration between Dutch and English), but all of them would be mere guesses. However, our data are hard enough, whatever the cause of the discrepancy may be. The locus positions of the voiceless consonants are still based on extrapolation, although considerably less so than formerly, when only vowel samples were taken into account. It would be useful to investigate whether the locus positions found can be used for automatic consonant recognition, at least in isolated or excised words: this could help us in finding out what, if any, additional information is needed to specify CV- and VCtransitions satisfactorily. Consonants /n/ and fr/ in the vowel space '!;'his part of the investigation served the purpose of finding out what the actual average positions of the voiced consonants /n/ and frf are in the vowel space (question (2) of the Introduction); if there is any systematic movement within the consonants, these average positions will probably be different from the loci. Method

Exactly the same words were used as those that underlay Figs l(a) and 2(a). There, every voiced sample was used between the vowel target ("inner boundaries") and the beginning or the end of the word, Now, however, we wanted only the consonants fnf and /r/, i.e. either the voiced samples to the left or those to the right of the vowel part of a transition.

216

M. E. H. Schouten and L. C. W. Pols

Segmentation was done by hand, and obeyed the following criteria: initial frf : all samples not contributing to the specific /r/-sound were eliminated, that is, any vocal-murmur- or vowel-like samples were done away with, leaving only the rattle itself; despite appearances, this was an unambiguous, if time-consuming, criterion: in most cases the rattle consisted of three samples: two with a normal linear level surrounding a low-level middle sample. Longer rattles were little more than rather irregular but easily recognizable extensions of that pattern, and the only difficulties occurred in those text words in which /r/ had failed to be articulated properly, and where we, as a result, had to rely solely on auditory impressions; finalfrf: since finalfr/ very often does not result in a rattle, we had to be a little more careful and decided only to eliminate those samples in which the preceding vowel was still audible, and those samples whose spectra had been subject to "valley-clipping"; initial fnf: since the transition from fnf to the following vowel is nearly always a gradual one, all we could do was to listen carefully to each sample and to eliminate all samples containing the merest hint of the identity of the following vowel: final fnf: it was soon discovered that at the point along the time axis where "valleyclipping" begins, the trace in the I-II vowel space consisting of interpolations of sample coordinate values, suddenly turns around by about 180°; this made elimination of unwanted samples at the end of the trace much easier. Once again, of course, samples containing a hint of vowel quality were eliminated.

Results Ten sets of averages were calculated, one for each of the five speakers in each of two conditions. Every set consisted of the average traces in the vowel space of each of the two consonants /n/ and fr/, calculated separately for each of the CV- and VC-combinations. An example of such a set (text words by speaker 2) is shown in Fig. 5, where each initialI ll

40

__i .•

30 lne

20

--•-- - --·~"·~~

.. -.i~ r

!'···· ·un

"

.~----~r

.

(

~ .-~~~:" • . • • ----· ·En rE

)

ru

or/

u

n• na'-....

.>

10

@

E@

.._;..::~-----.-,~Er

,..,·

f

.

-10

-

i

an

ar

Speaker 2 Initial cons. Final cons.

-20

@

Average vowel positions

t -40

Figure 5

-30

-20

0

I 10

20

Average /r/- and /n/-traces in text words by speaker 2.

30

40

50

A spectral study of coarticulation-Part II

217

consonant trace is the average of 5 (final consonants) x 5 (replications)= 25 instances, and each final-consonant trace of 6 (initial consonants) x 5 (replications)= 30 instances of that particular combination. The average vowel positions in Fig. 5 were those found for this speaker's text words, shown in Fig. 3 of our previous article. Phonetic symbols indicate the beginning of each trace. The time normalisation used before averaging, was a linear one: each individual instance of fnf and frf was made to have the average number of samples, by deleting or adding samples at regular intervals. From Fig. 5 a number of tendencies emerge which are representative of all five speakers: (I) the position of initial and finalfr/ within the overall/rf-area is, to a great extent, a fqnction of the preceding and of the following vowel: the relative positions of the /r/-traces seem to reproduce in miniature the relative positions of the vowels; (2) initialfn/ is positionally independent of the following vowel; (3) finalfn/ begins with a "tail" pointing in the direction of the preceding vowel. Initial and final/n/ were separated from the following and preceding vowels respectively, according to criteria that were exactly the same, so it is somewhat surprising to find that final/n/ should contain a part of the transition from the preceding vowel, while a similar transition is not present for initial/n/. The only tentative explanation we can offer is that complete closure of the oral cavity seems to take place while the tongue body is still moving from vowel to final /n/, whereas the "release" of initial /n/ seems to occur before any other tongue movement has taken place. But whatever the reason may be, the final-/n/-traces did not contain any samples betraying the identity of the preceding vowel. Next, the average positions of the two consonants were calculated, separately for each speaker. This was done by first determining the average sample positions of all the average traces such as the ones shown in Fig. 5, and then determining the averages over those averages, keeping text words, isolated words, initial consonants, and final consonants I

n 40

o3

30

o2

10

I. 0

.

1 1

20

''

o+

1

'

x1

'

10

n

-

Of-

r -10

• Initial cons.l x Final cons. text words o

-20

+

(!)

Initial cons.

. F1na 1 cons.

. .solated words

Average vowel positions

I

1 -40

Figure 6

-30

-20

-10

0

10

20

30

Average positions of /r/ and /n/ over the five speakers. The position of average /s/ is not shown, owing to lack of room.

40

218

M. E. H. Schouten and L. C. Z. Pols

separate. The results are shown in Fig. 6; the only noticeable features are the distinct /nf-areas for speakers 1 and 3. In our previous article we noticed that speaker 1's fuf had a very unusual centralized position, so it is perhaps not surprising that a neighbouring vowel should exhibit a similar pattern. Another way of looking at the data is to determine the influence of the preceding or following vowels on the positions of /r/ and /n/. Since the pattern for speaker 2, shown in Fig. 5, was common to all speakers, we decided to average the influence of the vowels over all five speakers. To do this in a meaningful way, some form of speaker normalization had to be applied first; once again we used the same vowel-based normalization as in our I

II

40

30

20

10

-

o-10 • Initio I cons. } x Final cons. Text words

-20

o Initial cons.} + Fino I cons. Isolated words

a 0 -40

Figure 7

-30

-20

- 10

®

Average vow el pos itions

I

0

10

20

30

40

Average positions of /r/ and /n/ as a function of the adjoining vowel, after speaker normalization.

previous article, for reasons of convenience. After normalization, a new average sample position over each average trace was calculated, and the resulting points were averaged over the five speakers. The final averages, every single one of them based on from 125 to 150 utterances, are plotted in Fig. 7. The influence of the vowel on the position of initial and final/r/ is, once again, plainly visible, in particular on initial/r/. The consonant /n/ appears to be concentrated in a narrow diagonal strip, with fuf and /i/ tending to push /n/ in one, and fe/ and fa/ in the opposite direction. We have been unable to find a context reason why /n/ after fa/ in text words should be so different from the rest, although it is always possible to think up a speculative explanation, such as: the trajectory from fa/ to /n/ is so much longer than that from the other vowels to /n/, that in running speech the "target" of fn/ is often not reached. The predictive power of such speculations is, however, usually low. Diagrams showing the behaviour of the various speakers in the different vowel contexts yielded no extra information compared to what has already been given, but there was one thing that called for a little more attention. Speakers 2 and 4 had a tongue-tip fr/, and

219

A spectral study of coarticulation-Part II I

II

40

30

20

10

-

0~

-10 •

o -20

a

E>

40

Figure 8

30

20

-10

lnitiol/r/ Text words Isolated words Vowel/i/ Vowel/u/

I 0

10

20

30

40

Comparison of the influence of juj and /i/ on the positions of tongue-tip jrj (speakers 2 and 4), and uvular /r/ (speakers 1, 3 and 5).

speakers 1, 3, and 5 a uvular .one. To our amazement, this articulatory difference did not result in any systematic spectral differences, except when /r/ was followed by fu/: as can be seen in Fig. 8, both in text words and in isolated words, back vowel fuf appears to pull uvular fr/ (speakers 1, 3, and 5) towards itself. If this is caused by the relative tongue positions for the two phonemes, we should perhaps expect front vowel /i/ to exert the same attraction on tongue-tip /r/. That is why /i/ has also been included in Fig. 8, but the expected tendency is absent, suggesting that interaction between tongue-tip and (if-constriction is much less significant than that between the places of articulation of uvular /r/ and /u/. All in all, we have here another allophone to join the few fauna for the vowels in our previous article, (/i/ preceded by /n/, and most vowels followed by /r/): here it is uvular frj followed by fu/. Finally, the data points in Fig. 7 were averaged over the vowels, and the resulting averages were compared with both the "old" loci for /n/ and /r/, found in our previous article, and with the "new" loci discussed earlier in the present article. This comparison is shown in Fig. 9, where straight lines have been drawn, connecting the "old" loci first to the "new" loci, and then to the average /n/- and /r/-positions. It is evident from Fig. 9 that the "new" loci differ very little from the average positions of the consonants. If there is a trend, it is for the "new" locus to be situated between the "old" locus and the average consonant position. Conclusions (1) The locus positions in the vowel area for /r/ and /n/ agree very well with the actual measured consonant positions. (2) The position of /r/ within the /r/-area is a function of the preceding and of the following vowel.

220

M. E. H. Schouten and L. C. W. Pols IT

40

30

20

10

,,.lJ,

2•

,

,

3 -~"2

0

3:2

-10

33



r

£0

" o

+

Initial Final Initial Final 11 11

-20 3

a

01d

cons. } cons. cons . } cons .

11

Text words Isolated words

loci

11

New loci Average consonant positions

I

0

-40

Figure9

-30

-20

-10

0

10

20

30

40

Comparison of the "old" and "new" fnf- and /r/-Ioci and the average positions of those consonants.

(3) Final /n/ contains a "tail" pointing backward to the preceding vowel. (4) Back vowel fuf pulls a preceding uvular frf towards itself.

Glides Method The three glides which we investigated, fw, j, 1/, had been added to the list of words and the text discussed above. All three were combined, as far as possible, with the same four vowels /i, u, e, a/, both as initial and as final consonants, but no care was taken about what consonant to place on the other side of the vowel. Each glide-vowel and vowel-glide combination occurred only once in list and text, so that per speaker only five replications were available. Furthermore, in a non-systematic way a number of intervocalic glides were included in the material. The words used were, for initial glides: /jet, jip, jun, wer, wix, wust, war, let, lip, lur, lat/; for final glides : ftaj, ruj, niw, tel, vil, dul, bal/; for intervocalic glides: fjyjyb~s, b~mujal, draj~n, jaja, ruj~n, drij~n, ryw~, jawel, slal;,m, jul~n, sxelak/. Segmentation of initial and final glides was such that at one end of the glide all samples were included with linear levels no more than 9 dB below the highest found in the word, while at the other end the stationary part of the vowel (defined in the same way as the inner boundaries in the eve words, see Schouten & Pols, 1979) served as boundary. Intervocalic glides were segmented by cutting off both surrounding vowels at their stationary parts. We defined the "steady-state" or "target" position of an initial glide as the average of the first three included samples, that of a final glide as the average of the last three samples, a~d that of an intervocalic glide as the average of the three spectrally closest samples among the ones segmented, excluding the vowel samples. The transition was then formed

A spectral study of coarticulation-Part II

221

by the samples between the glide "target" and the vowel "target" (i .e. the inner boundaries).

Results For every glide-vowel and vowel-glide combination, and for every intervocalic glide, the average transitions were calculated over the five replications per speaker per condition. This was done after the linear time normalization mentioned earlier ; the three samples defined as the target did not take part in that normalization. A few examples of the resulting 40

40 II

II

30

,,

, '1- .......

wu ·~-~ ~... -

20

-- ;:{;.··( ;: ...-::·

30 20

+r' l£ :+w

10

10

,Ia , wa

0

'

''

,~,. '

-20

'

··r"' -40

-30 -20 -1 0

0

10

!

:

0

10 a Speaker 2 Text words Ini tial glides n' 5

'

-10

~ '

20

30

' -10

II

J '· _/ __

-40 -30 -20 -10

..--·

lOb . / Speake r 2 I so Ioted words Final gl i des n' 6

:

-20 40

+I

0

10

20

30

40

10

0 IOc Speaker 2 Text words Intervocalic glides no5

-40 -30 -20 -10

Figure 10

0

10

20

30

40

Some glides by speaker 2. Crosses once again represent vowel positions, and the phonetic labels have been placed near the glide "targets".

traces are shown in Fig. 10, namely speaker 2's initial glides from text words [Fig. IO(a)], final glides from isolated words [Fig. IO(b)], and intervocalic glides from text words. The average positions of the stationary vowel parts in Figs. IO(a) and IO(b) are indicated by means of crosses, and the phonetic symbols have been placed at the glide-end of the trace. The plus signs represent this speaker's average target positions of the glides. In Fig. !O(c) every trace consists of two stationary vowel positions surrounding a glide; directions are indicated by means of arrows. Figure 11 shows the average target positions of the three glides, in ll(a) and II(c) as a function of the various speakers (not normalized), and in ll(b) and ll(d) as a function of the following vowel (initial glides), or of the preceding vowel (final glides), after speaker normalisation. Points belonging to the same glide are connected by means of straight lines ; the phonetic symbols for the glides are either preceded by a dash, indicatinr final glide position, or followed by one, indicating initial glide position. In general, the points

M. E. H. Schouten and L. C. W. Pols

222 40 II

40

30

30

20

20

10

10

0

0 E®

-10

-10

II a Glides per speaker Text words Be fore nor mol isation

-20 a®

D

'

' 1\,u

~ -1 ax



II b Glides per adjacent vowel Text words After normalisation

-20 a~

-40 -30 -20

10

0

10

20

30

-40 -30 - 20 -10

40

40 D

40

30

30

0

10

20

30

40

u

20

20

/

l~u E

J-

i'"/ j 1



ax

E•,

,'

\

10

10

~-: '

;

'



,

71

,-1 ·

u: :

0

0 -10

u

- --~~· \ 1- , £ w- u .

':

II c Glides per speaker Isolated words Before normalisation



-20

-10



II d Glides per adjacent vowel Isolated words After normalisation



-40

Figure 11

-30 -20 -10

0

10

20

30

40

-40 - 30 -20 -10

0

10

20

30

40

Average glide "target" positions as a function of speaker (a) and (c) and of adjacent vowel (b) and (d).

in Figs ll(b) and ll(d) each represent the averages over 5 (speakers) x 5 (replications)= 25 glide targets; the numbers underlying Figs ll(a) and ll(c) are more variable, and depend on the number of different words containing the same glide. Comparing Figs ll(a) and ll(c), we see that both initial and final /j/ are, as was to be expected, very close to average /i/, lying a little to the right of the latter, with only speakers 4 and 5 deviating in their final (jf. In text words, speaker 4 also has his initial /j/ in a different position. On the basis of Figs 11 (b) and 11 (d) it appears that the precise position of final (jf is more dependent on the accompanying vowel than is that of initial /j/. Initial /1/ and fwf appear to be more or less interchangeable in text words, with final /1/ occupying a separate position; in isolated words, fwf is situated between initial and final /1/. In text words, the position of /1/ and fwf relative to the vowel fuf is much more centralised than it is in isolated words; also, dependence on the adjoining vowel seems to be much smaller in the latter case. Consideration of the intervocalic glides did not yield any extra information, so presentation will be restricted to Fig. 12, which shows the average target positions of the intervocalic glides as a function of their vowel environment. Discussion O'Connor, Gerstman, Liberman, Delattre & Cooper (1957) showed that F2 is the primary cue for distinguishing fw, j, 1/ in initial position: fwf requires an F2 of around 700Hz, /1/ one of around 1100 Hz, and (jf one of about 2300 Hz. This seems to agree very well with

A spectral study of coarticulation-Part II

223

I

1I

40

30

20

ywa X

'

..

,x at:>

'/

®

'

ywa\~·'t;a

10

u

x

-

_x

o-10

• Text words x Isolated words

-20

-40

Figure 12

- 30

-20

-10

I 0

10

20

30

40

Average "target" positions of the intervocalic glides. The average /if-position coincides with the cross indicating the position of fjf in fyjyf from isolated words.

our measurements of isolated words [see Fig. 11(c)], if we assume that our first dimension is very similar to an inverted log F2. The separate position of final /1/ in all panels of Fig. 11 is due to the fact that in Dutch final/1/ is more "lax" or "darker" than its initial counterpart. A description of Dutch glides can now be given in terms of the general area of each glide. The glide itself consists of a fairly rapid movement (see the examples in Fig. 10) from that glide area to a vowel, or the other way around. In most cases, particularly with /1/, there were a few samples that could be regarded as the steady-state portion of the glide, but it was much shorter than it usually is with vowels; in text words there was quite often only a transition. General Discussion The main results of this article are to be found in Fig. 4, where for each of the initial and final consonants used, the locus positions in the vowel space are shown. Wherever the loci based on all voiced samples of a transition ("new" loci) differ from those based only on vowel samples ("old" loci), this is usually due to the fact that the transition in question changes course somewhere along the line, so that it cannot adequately be described in terms of two points. No reliable locus positions could be established for initial/x/, and for final /t, p, x/; in the case of these three final consonants, one could still speak of a general direction, but with initial /x/ even that is out of the question, and that is why it has not been included in Fig. 4. The "new" locus for initial /d/ is the locus of the vocal murmur preceding that consonant, which we expect to be the same for every initial voiced plosive. Figure 9 compares the "old" and the "new" locus positions with the actual positions in the vowel space of the voiced consonants /n/ and /r/. It can be seen that the "new" loci

M. E. H. Schouten and L. C. W. Pols

are very close to the actual consonant positions, and this increases our confidence in the other locus positions. Figure 11 shows the average positions in the vowel space of the very brief steady-state portions ofthe glides fw,j, If; quite often these steady-state portions can hardly be said to exist, and then the indicated position becomes no more than just the beginning or the end of the transition to or from the vowel. The precise position of the glides appears to depend very much on the spectral position of the preceding or the following vowel [see Figs. 11(b) and 11(d)], although this is much clearer in the text words than it is in the isolated words. Conclusion A number of Dutch consonants, including the three glides, have been described as "locus" points in the vowel space; for some consonants, two points were needed, because the transitions to or from them did not follow straight lines. Transitions to or from voiceless consonants never quite reach their loci: they leave (enter) the vowel space too soon (late). However, the voiced-consonant loci do represent the positions of those consonants in the vowel space, which was found to be almost identical to the voiced space. References Delattre, P., Liberman, A. M. &. Cooper, F. S. (1955). Acoustic loci and transitional cues for consonants. Journal of the Acoustical Society of America 27, 769-73. Klatt, D. H. (1976). A digital filter bank for spectral matching. 1976 IEEE International Conference on ASSP573-6. O'Connor, J. D., Gerstman, L. J., Liberman, A. M., Delattre, P. & Cooper, F. S. (1957). Acoustic cues for the perception of initial /w, j, r, 1/ in English. Word 13, 24-43. Pols, L. C. W. (1977). Spectral Analysis and Identification of Dutch Vowels in Monosyllabic Words. Doctoral thesis, Free University, Amsterdam. Schouten, M. E. H. & Pols, L. C. W. (1979). Vowel segments in consonantal contexts: a spectral study of coarticulation part I. Journal ofPhonetics 7, 1-24.