Journal of Phonetics (1991) 19, 465-471
Coarticulation rules in an articulatory model Georg Heike, Reinhold Greisbach and Bernd J. Kroger Institut fur Phonetik, Universitiit Koln, Greinstr. 2, 5 Koln 41, Germany Received 12th August 1990, and in revised form 7th December 1990
Coarticulation rules are of primary importance for good speech quality in an articulation-based speech synthesis system. This is demonstrated by a simple perception experiment. Auditory feedback strategies combined with general phonetic knowledge about the relation between articulation and acoustics provide a reliable basis for the generation of such coarticulation rules.
1. Introduction A text-to-speech system which uses an articulatory model for signal generation has to transform a discrete string of symbols into a continuous stream of articulatory control parameters. The efficiency of this transformation is of primary importance to guarantee both the realistic simulation of the articulatory movements and acoustic sound generation and the good quality of the resulting synthetic speech . Real articulatory movements of human speakers, whether observed by articulatory measurements (e.g. by X-ray or electromagnetic transduction procedures) or reanalysed from the acoustic speech signal (e.g. Heike, Greisbach, Hilger & Kroger, 1989) exhibit very complex traces if parameterized into articulatory parameters. So articulatory and acoustic reality may contradict the simplicity of the transformation rules needed in a text-to-speech system, particularly in the extent to which coarticulation is taken into account. This paper will focus on this problem in detail and present an example of how it is solved in the German language text-to-speech system KOLLE (KOLner vorLEsesystem), which succeeds the LISA system (Heike & Philipp, 1985).
2. The articulatory model If fed by a string of orthographic symbols the KOLLE synthesis system will produce a film of the articulatory movement and a synthetic speech signal. This system incorporates an articulatory model for speech synthesis, which aims at a realistic articulatory and acoustic simulation, and consists of three modules: a transformation model, an articulatory module, and an acoustic module (see Fig. 1). Input to the articulatory module is a vector of articulatory parameters. These parameters define the positions of the articulatory organs in a midsagittal plane . At present, nine parameters are used to control the position of the tongue, the lips, the lower jaw and the velum, and the height of the larynx. These parameters not only define the positions of the articulatory organs (the actual configuration is generated 0095-4470/91/030465 + 07 $03 .00/0
© 1991
Academic Press Limited
G. Heike et al.
466
Orthographic to phonetic module
Film of articulatory movement
Synthetic speech signal
L------- ------Articulatory model
Figure 1. The KOLLE text-to-video/speech system.
by interpolation between stored geometrical patterns, which correspond to the extreme values of the parameters), but also the boundary of the vocal tract, so that the area function of the vocal tract can be derived via the calculation of the midsagittal distance. For a detailed description of the parameters and of the procedures to generate a picture of the vocal tract and the area function, see Greisbach (1986). Three more physiologically defined parameters are used to control a two-mass model of the vocal cords (cord tension, sub glottal pressure and phonation neutral area). This model allows also for preadjustment of the length of the glottis and the coupling stiffness (for details see Kroger, 1990). The input to the acoustic module is the area function of the vocal tract. After transformation of this function into an equidistant step function (0.875 em) the propagation of sound within the vocal tract is modelled by computing the forward and backward travelling sound pressure waves. Various forms of losses, such as series losses (laminar flow losses), shunt losses (wall vibration, losses from wall interaction), glottal losses and losses due to lip radiation are simulated. A nasal tract can be coupled. Most of its sections (except for the first one behind the coupling tube) have fixed cross-sectional areas. Turbulence noise is generated if the Reynolds number exceeds a critical value inside the vocal tract. In this case turbulence noise is inserted by a volume-velocity source one section downstream. The kinetic pressure drop is modelled at the junction before the tube of smallest diameter. The input to the transformation module is a string of phonetic symbols derived from the original orthographic input string. Each symbol is assigned an articulatory target position, which may be regarded as a snapshot of the articulatory motion to be synthesized. These articulatory target positions are initially defined by a set of articulatory parameters. In the case of consonants the "non-distinctive" parameters are left free. These free slots are filled depending on the neighbouring symbols in
467
Coarticulation rules TABLE I. Orthographic conversion and transformation (first step) for the German word Nase. The orthographic symbol string is rewritten as a phonetic string, which includes symbols specifying accentuation and timing. The values of some of the articulatory parameters before and after the transformation are listed N
n
Articulatory parameters Before transformation P1 VE velum height PS ZH tongue height P6 ZP tongue position P7 ZSH tongue tip position P10 Ps subglottal pressure After transformation P1 VE velum height PS ZH tongue height P6 ZP tongue position P7 ZSH tongue tip position P10 Ps subglottal pressure
100 *
a a:
s
e
z
;}
100
0 -80 0 0
100
0 0 20 0
100 0 0 100 80
0 -80 0 0 90
0 0 10 100 80
0 0 20 0 70
-
0
• Free slot.
the string, physiological and dynamic restrictions of possible articulator movements, and language-specific restrictions. This first step of the transformation also includes the setting of the phonatory parameters using rules of accentuation and intonation (Table I). To derive an articulatory movement from these target positions they must be arranged in time and the articulatory parameters have to be traced from one target position to the next so that an articulatory configuration is defined for every time instant (in reality every 6.4 ms). Linear interpolation between one target position and the next would be the simplest form of tracing the articulatory parameters. But in the model the traces of the parameters are allowed to exhibit nearly every form of transition, such as step functions or very fast transitions as well as non-simultaneous parameter changes. Figure 2 shows traces of the articulatory parameters for the synthesis of the German word Nase.
3. Modelling of coarticulation If interpreted on a more theoretical linguistic level the transformation module
obviously performs coarticulation between the input segments. The transformation module incorporates two stages, in both of which information is spread from one segment (or to be more precise from one articulatory target position) to its environment. In the first stage, where the free parameters of the consonants are set with respect to the environmental symbol chain, this spreading works on discrete entities, and might therefore be called "static coarticulation" . The second stage of the transformation might then be called "dynamic coarticulation", because it controls the dynamics of the articulation process by continuously connecting two neighbouring articulatory targets.
468
G. Heike et al. z
a:
n
:J1f1111
N
I
.:.:.
:il
lf~'trl·
w
>
~
~
/
ell
D..
(c)
0
ci
100
200
300
400
Time (ms)
Figure 2. Transformation (second step) of Nase. (a) Synthetic oscillogram. (b) Spectrogram. (c) Traces of selected parameters.
In the development of a rule-based text-to-speech system one is faced with the fact that quite a lot is known about "static coarticulation" phenomena (e.g. on the basis of general phonetic and phonological knowledge), whereas very little is known about "dynamic coarticulation" (except for the very rare X-ray or magnetic measurement data that have been published) . So for the German language text-to-speech system KOLLE, mainly auditory feedback strategies are used to generate rules. Using auditory control in this case means improvement with respect to acoustic control (e.g. by use of spectrograms), as it is much faster and in some ways much sharper, too. This auditory control must focus on segmental features such as the quality of the intended segments (especially the consonants) and the correct number of the segments (i.e. no additional or missing sounds) as well as on suprasegmental features such as accentuation, intonation and timing. 4. Perception tests To demonstrate some of the effects of the variation in the articulatory parameter domain quantitatively, two hearing tests were performed with a total of 12 native
Coarticulation rules
469
TABLE II. Result of the first perception test. High values indicate good acceptability. The corresponding values of parameter ZH (tongue height) for the preceding targets [a] and [i] are -80 and 80 respectively ZH
Nase
niese
80 (high) 40 0 -40 -80 (low)
2 17 25 21 23
1 21 28 21 15
listeners of German. In both tests synthetic stimuli of the German minimal pair Nase [na:z;}] "nose" and niese [ni:z;)] "to sneeze" were presented. The vowels [a:] and [i:] are extremes in terms of tongue (and jaw) height, and, consequently, in terms of influence of this parameter on the succeeding consonant [z]. In the first test the parameters of the free articulatory slots, tongue (and jaw) height in the specification of the sound [z] were varied. The hearers were asked to judge the [z]-like quality in a five step rating scale. The test (Table II) shows that setting the tongue height parameter too high for [z] (e.g. equal to the value for [i]) will lead to unintended auditory impressions. In the second test the transition speeds of the parameters tongue (and jaw) height and tongue tip height were varied in the transition phase from the vowel to the [z], as shown in Fig. 3. The corresponding synthetic stimuli were presented in an AB-test arrangement for judging the relative quality as [na:z;)] or [ni:z;}]. Each stimulus pair was rated on a 5-step scale ranging from +2 (for "A very much better Up /I
:"I ,..,
a.
:
'I
:;::;
.....
,' __ _...___,
"c:
.......
::>
Ol
'
~
a:
z
i:
I
Down
z
....r: .'e' .r:
"::>
Ol
c:
~
-80
£
-- ___
a:
I
I I
I
I )
Low
z
z
Figure 3. Variation of the articulatory parameters as functions of time for the
second perception test. Transition from vowel to [z):-- -,very fast;···· ·, fast; - - , slow .
470
G. Heike et al. TABLE III. Results of the second perception test. Slow, fast and very fast indicate the transition speed from vowel ([i] or [a]) to [z] of the parameter tongue (and jaw) height. Positive values indicate that stimulus A was classified as better than stimulus B, negative values the opposite. Theoretically the values in the diagonals ought to be zero. Their actual deviation from zero represents the intrinsic error level of the test Nase
B
A
Slow
Fast
Very fast
Slow Fast Very fast
-9
-4 -8 -2
-16
4 3
Niese
-5 -6
B
A
Slow
Fast
Very fast
Slow Fast Very fast
8 25 30
-15 2 7
-15 -14
-2
than B"), through 0 (for "A equal to B"), to -2 (for "A very much worse than B"). The results show no significant effect of the variation in the speed of the tongue tip height parameter, so Table III gives only the ratings for the changing speed of the tongue (and jaw) height parameter. This result expresses numerically the perceived impression that just as with the tongue tip parameter, varying the speed of the tongue height parameter in the transition from [a] to [z] has no significant influence on the perception of the word Nase (the off-diagonal absolute values in the left part of Table III are in most cases smaller than the absolute values in the diagonal). But extrapolation of this "rule" to the synthesis of the word niese is not possible; in the slow (and even in the fast movement) case a diphthongization of the [i] (to [i::~]) will result, the perception of which is clearly reflected in the right part of Table III , where the off-diagonal absolute values are significantly higher than the diagonal ones.
5. Conclusion The modelling of coarticulation is an inherent feature of a text-to-speech system based on an articulatory model. In the KOLLE system two aspects of this process can be distinguished, if attributed to the two stages of the process in the transformation module. Both aspects may raise problems in the course of rule generation , as the perception experiments exemplify. "Static coarticulation" rules have to be modelled to guarantee proper sound quality, whereas "dynamic coarticulation" rules are necessary to exclude for example intrusive elements (which may lead for example to diphthongizations of vowels) in the perceived signal.
Coarticulation rules
471
Certainly the problems concerning the "static coarticulation" phenomena are of minor importance, because they can generally be avoided on the basis of known facts about articulation. As long as information on articulatory dynamics is not easily obtainable, and probably even then, auditory feedback strategies are an important means of establishing rules for "dynamic coarticulation" in a text-tospeech system. References Greisbach, R. (1986) M4: Ein Produktionsmodell mit einer allgemeinphonetischen Steuerung, /nstitut fur Phonetik der Universitiit zu Koln: /PKoln-Berichte, 13, 41-58. Heike, G. & Philipp, J . (1985) LISA-Ein Verfahren zur artikulatorischen Sprachsynthese , Sprachsynthese (B . Muller, editor), pp. 39-53. Hildesheim. Heike, G., Greisbach, R., Hilger, S. & Kroger, B. J. (1989) Speech synthesis by acoustic control. In Eurospeech '89, proceedings of the European conference on speech communication and Technology, Paris, September 1989, Vol. 2 (J.P. Tubeck & J . J. Mariani , editors), pp. 20-22. Edinburgh: CEP Consultants. Kroger, B. J. (1990) Three glottal models with different degrees of glottal source-vocal tract interaction , lnstitut fur Phonetik der Universitiit zu Koln : /PKoln-Berichte, 16, 43-58.