Speech Communication 13 (1993) 45-51 North-Holland
45
Estimation and generation of articulatory motion using neural networks Katsuhiko Shirai Department of Information and Computer Science, Waseda University, 3-4-10kubo, Shinjuku-ku, Tokyo, 169 Japan Received 10 May 1993 Revised 20 May 1993
Abstract. In this paper, it is tried to apply neural networks for two kinds of problems concerning articulatory motion. They are estimation of articulatory motion from speech waves and generation of articulator movement for a sequence of phonemic symbols. In the former problem, since estimation of articulatory parameters is regarded as a nonlinear mapping between the acoustic parameters and the articulatory ones, a neural network is expected to be a suitable method. In the latter problem, a nonlinear control system that produces articulatory motion is successfully constructed combining neural networks.
Zusammenfassung. In diesem Artikel werden Verbindungsnetze fiir zwei Probleme der artikulierenden Bewegungen verwendet. Das erste ist die Schiitzung der artikulierenden Bewegung anhand des Sprachsignals; das zweite ist die Erzeugung der artikulierenden Bewegung, wobei eine Folge von Phonemsymbolen gegeben ist. Die Verbindungsnetze sind ein geeignetes Mittel fiir das erste Problem, denn die Sch~itzung der artikulierenden Parameter ist fiir eine nicht lineare Applikation zwischen den akustischen Parametern und den artikulierenden Parametern bekannt. Ffir das zweite Problem erstellt man erfolgreich ein nicht lineares Steuersystem zur Erzeugung der artikulierenden Bewegung dutch Kombination von mehreren Verbindungsnetzen.
R6sum6. On emploie dans cet article les r6seaux connexionnistes pour deux probl~mes concernant les mouvements articulatoires. Le premier est l'estimation du mouvement articulatoire ?t partir du signal de parole; le second est la g6n6ration du mouvement articulatoire &ant donn6e une s6quence de symboles phon6miques. Les r6seaux connexionnistes sont un outil adapt6 au premier probl~me, puisque l'estimation des parambtres articulatoires est connue pour ~tre une application non lin6aire entre les param~tres accoustiques et les param~tres articulatoires. Pour le second, on construit efficacement un syst~me de commande non lin6aire pour produire le mouvement articulatoire, en combinant plusieurs r6seaux connexionnistes.
Keywords. Speech production; articulatory model; neural networks; articulatory dynamics.
1. Introduction
It is expected that the effective modeling of articulatory mechanism can not only contribute to clarify the speech production process but also to take out an essential nature of the speech sounds and to develop high quality speech coding methods, synthesis techniques and speech recognition
systems (Schr6ter et al., 1987, 1990). For example, the coarticulation compensation and speaker adaptation could be considered in the most systematic way in the articulatory domain, since the physical or physiological constraints can be effectively organized in the model (Shirai, 1981; Shirai and Kobayashi, 1982). Several advantages are also expected in the speech synthesis from the intro-
0167-6393/93/$06.00 © 1993 - Elsevier Science Publishers B.V. All rights reserved
46
K. Shirai / Estimation and generation of articulatory motion
duction of the articulatory model instead of the direct derivation of the waveform or the acoustic parameter from the phonemic symbol. (1) Many factors are accumulated in the acoustic parameters in a very complex way, so it is difficult to derive a simple general framework of the dynamic characteristics. On the other hand, the articulatory parameter is the physical parameter which corresponds to the articulatory organ, so it is expectable that the dynamic characteristic can be easily modeled by the relatively simple rules. (2) The values of the articulatory parameters represent the shape of the vocal tract directly, so there is a capability to control the generation of the turbulence or the timing of the transition to the next phoneme autonomously on the basis of the parameter value. However, in the articulatory mechanism various types of nonlinearities are included and suitable nonlinear components should be contained in the model. Neural networks are very useful elements for that purpose, since they can approximate an arbitrary nonlinear function and are amenable for complicated characteristics through the training procedure (Funahashi, 1988). In the following, two kinds of possibility of using neural networks for the articulatory modeling are discussed. One is the estimation of articulatory motion from speech waves and the other is the generation of speech from a sequence of phonemic symbols.
PALATE
XL
--'~--~+
--
.......
F
j U
XL
_
!
xl
y
/
direction of
+
+
G GLOTTIS --~ ~+
XG Fig. 1. Configurationof the articulatorymodel.
Table 1 Qualitative characteristics of the articulatory parameters Parameter
Organ
+
-
ST1
tongue tongue dorsum jaw lip glottis velum
back high open rounded open open
front low close unrounded closed closed
Xr2 Xj XL XG XN
ployed in this study is constructed through the statistical analysis of real data.
2. Articulatory model The total configuration of the model and the characteristics of the articulatory parameters are shown in Figure 1 and Table 1. The first five parameters (Xrl, Xr2, Xj, X z, X c) determine the shape of the oral and pharyngeal cavity, and the nasalization parameter X u describes the cross-sectional area of the velopharyngeal part. Movements of articulatory organs in normal articulation are constrained physiologically and phonologically. Therefore, it is desirable to introduce these constraints into the articulatory model. For that purpose, the model em-
3. Estimation of articulatory motion
3.1. Model matching method The acoustic feature of speech waves which are generated by the model can be expressed by a nonlinear function of the articulatory parameters. Therefore, the estimation problem is to solve a nonlinear function. The conventional technique for this problem is a nonlinear optimization method called the model matching method (MM). In this method, model parameters are iteratively
K. Shirai / Estimation and generation of articulatory motion
changed to optimize the following criterion function:
J(x) = ( y - h ( x ) ) ' r p ( y - h ( x ) )
+ xTQx
where y denotes the acoustic parameter which is measured from the speech waves, x denotes the articulatory parameter, and h(x) denotes the nonlinear function to calculate the acoustic parameter y from the articulatory parameter x. P, Q and R are the weight matrices, and x 0 is the estimate at the previous frame. This problem is solved by a hill climbing method. Details about the algorithm are described in (Shirai and Honda, 1976, 1978; Shirai and Kobayashi, 1986).
3.2. Neural network based approach In our experiment, a four-layer feed-forward network is adopted. The network consists of a 12 node input layer (for 12th order cepstral coefficients), two 24 node hidden layers and a 4 node output layer (for 4 articulatory parameters: two tongue parameters, a jaw parameter and a lip parameter). Weight coefficients are determined as follows: Firstly, vowels are selected from training dataset. Then, using MM, articulatory parameters are estimated for all frames in these data (including transients). 12th order LPC cepstral coefficients are also calculated for the same data and cepstrum-articulatory parameter pairs are prepared. Finally, applying backpropagation to these data pairs, the weights of the network are determined. To estimate articulatory parameters from speech waves, cepstral coefficients are calculated by LPC analysis, and then these data are input to the neural network (Shirai and Kobayashi, 1991).
3.3. Experiment of articulatory parameter estimation The evaluation test was performed using the vowel data in 5200 tokens in the ATR word database. The average difference in estimated articulatory parameter values between MM and NN was
47
only 3.3% of value range. It was observed that the neural network works well to estimate articulatory parameters. Figure 2 shows the estimated articulatory parameters and spectra. The solid line denotes the articulatory parameters obtained by MM and the dashed line denotes that by NN. Data i s / n i o u / . Articulatory parameters estimated with NN are almost equal to that estimated with MM except the part of / i / . As for / i / , the big difference can be seen between MM and NN. The contour of the spectra (formant structure, etc.) obtained by NN is more similar to the real one than that by MM. Parameters (XT], XT2, X~, X L) should be (front, high, closed, unrounded) in / i / sound. The estimated articulatory parameters using NN satisfy this term but those by MM do not. Estimated articulatory parameters using NN can be regarded as more appropriate f o r / i / . Since MM is constructed on the basis of the hill climbing method, it sometimes finds only the local minimum and makes serious errors. On the I~I
/ii
Io/
I~I
.. ...... u.w_Lt t U.k~,Ltl t Ila daLLklillhILIL,~, • . . . . . . . ~ . . . . .
back[ front
~
i " "
high
',ow t
........ ' = '
.
.
.
.
.
.
.~ Xl
<
,o,e
L
unroundL
"'"'"'Y"
Real
~ MM
Time
_ 10 [msl
Fig. 2. M o v e m e n t s o f a r t i c u l a t o r y p a r a m e t e r s a n d s p e c t r a o b t a i n e d by m o d e l m a t c h i n g m e t h o d ( - - ) a n d n e t w o r k ( • • • ) for/niou/.
K. Shirai / Estimation and generation of articulatory motion
48
Table 2 Relative frequency of distance from average estimate
4. Generation of articulatory motion
Method
4.1. Model of transfer function from command to articulatory motion
MM NN
Relative frequency [%] 16~
36~
1.78 1.44
0.55 0.26
0.71 0.60
d: Mahalanobis distance
other hand, the NN method performs global mapping because the NN is a kind of associative memories, so there is little risk of making serious errors. The strong constraints among articulatory parameters are embedded in the network structure on the training process. So the unnatural combination of parameters can be automatically excluded. As we can see in this simple, big differences arise between MM and NN for a few data. However, this result does not mean the mis-estimation in NN. Table 2 shows the distribution of the estimated articulatory parameters. The values in the table denote the relative frequencies of the distances from the average estimate of each vowel. The distribution of the articulatory parameters obtained by NN is more compact than that obtained by MM. Since the data far from the average estimate can be considered as misestimates, the above result suggests the stability of NN in the articulatory parameter estimation problem. In this case, the percentage of estimates whose mahalanobis distance from the average estimate for each phoneme is greater than 36 in NN is less than 50% of MM. Table 3 shows comparison of the calculation time. This table shows that NN is faster than MM by more than 10 times.
In this section, the reverse process of the estimation is discussed using the relation between the acoustic, phonemic events and the gestures of the articulatory organs which were estimated in the above section. First the nonlinear property of articulatory dynamics is shown. We introduce a linear secondorder system to model the dynamics of the articulatory motion as shown in Figure 3. In this model, a command means two things, namely (a) an input of a phonemic unit and is regarded as a step function corresponding to a target in articulatory space, and (b) the transfer function G(z), which brings about the response movement of the articulatory parameters corresponding to the command sequence. The function G ( z ) expresses the overall transfer characteristics of the articulatory dynamics. The following second order system is considered. bz-1
X(z)
1 + a l z -1 +a2z - 2 u ( z ) '
b = 1 + a I + a 2, where X ( z ) is the z-transform of an articulatory parameter x(n), U(z) is that of a command u(n), and a 1 and a 2 are the parameters of dynamic characteristics. Parameters a I and a 2 are assumed to remain constant during a period of
u(t)
x(t )
Table 3 Calculation speed (CPU time on Sun3/60) Data
CPU time [sec] MM NN
Ratio MM/NN
/aimai/ /shokuiN/ /jiQci/
33.3 53.7 40.3
13.0 18.1 16.6
2.57 2.97 2.43
I I !,
I I
I r
mo
m~
m~
MOTOR COMMANO
ARTICULATORY ORGAN
ARTICULATORY PARAMETER
Fig. 3. The articulatory control model.
K. Shirai / Estimation and generation of articulatory motion 10
vocal tractshape laryngealcontrol
o --SM A--KS
a2
49
u--AS
airflow q
i. . . . . . . . . . . . . . . . . . .
phooe ci sequence--.~l
control
~'1~1~ , ~ , ; ; I - - "
ir
~"|
conversion to
05 I
F
eech
articulatorystate
/
semory feedback
Fig. 5. Diagram for speech production.
0.0
0
-1.5
-1.0
(a )
&
-0.5
Furthermore, in another experiment using VCV (vowel-consonant-vowel) data, almost the same distribution function was obtained. Therefore, the system is nonlinear and it is adequate to model it by the network including nonlinear elements.
XT1
1.0
a2
05
0.0 -2.0
4.2. Model using neural networks
l
-1.5 (b)
-1.0
&
-05
Xj
Fig. 4. Distribution of dynamics parameters (al, a 2) obtained from data of t h e / V V / t y p e . The curve indicates the relation between a 1 and a 2 in the case of critical damp. SM, KS and AS are subject identifiers.
transition, and are estimated by applying the model to various kinds of data. The parameters ae estimated from the 80 data of two vowels continuation which include all kinds of transition between two vowels. The parameters (al, a z) are estimated for each articulatory parameter independently. The distribution of (a 1, a 2) for the tongue parameter Xrl and for the jaw parameter Xj are shown in Figures 4(a), and 4(b), respectively. These figures show that it is impossible to regard the dynamics as constant when the transfer function is modeled by the linear expression. However, they are found to be distributed along a straight line that lies near the curve of the critically damped second order system (Shirai and Kobayashi, 1986).
The block diagram to generate speech sound is shown in Figure 5. This system receives the sequence of phoneme symbols and sends out the speech signal. The main part that is indicated by the broken line is modeled using neural networks. The articulatory control system is composed of two neural networks as shown in Figure 6. The first network produces the target values of articulatory move-
Articulatory parameters
Articulator] dynamics
Target value of articulatory parameters
preceding phoneme
~
xt - - ~
~ , . o- cr o ~
succeeding phoneme
I
present Sequence of phonemic symbol phoneme Fig. 6. Structure of the neural network.
1C Shirai / Estimation and generation of articulatory motion
50
:: :,, :: actual value - generated value
0.14
.x.. ~,~,~,,~::::::::::::,~,,~,u~,--
7-,
0.12 0,10
•
0,08
with°ut phoneme context
i.| i. ;:,.<
with [] phoneme context
0.06 0.04
LI
i.i I.i
0.02 0.00
,d
XTI
XT2
XJ
XL
XG o.i
Fig. 7. Error of the target value calculated by the target generation network.
La
~ . . . . . L . . ~ L
..... . . , a ~ ' ~ q ~ a ~ t ~
.......
o.i
ments from the sequence of phonetic symbols. And the second network is a recurrent network used to model the dynamic characteristics of articulatory motion. Generally, the system should have various sensory feedback signals from the succeeding processes of the speech production.
4.3. Target generation network It is assumed in this model that the articulatory motion is derived by the consecutive step function which gives the target values of articulatory state. The target value is a fictitious positional command that can eventually generate the articulatory motion sufficient for the production of a given phoneme sequence. It is important that I
' - [-
. . . . . . . . . . . .
...... ~'" "-"
actual value generated value
t,o
aaaaaa~anao~a
nanaaaanaaaeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
Fig. 9. Articulatory movement produced by neural network.
each target value is strongly dependent on the phonemic context. Therefore, the input of the network contains the phoneme categories of the preceeding and the succeeding phonemes as well as the present one. The effect of the consideration on the phonemic context is shown in Figure 7. The error of the target value of X~ is especially large if only the present category of the phoneme is given for the network.
4.4. Network generating the articulatory dynamics This network is used to model the nonlinear dynamics of articulatory motion. In this experiment, sequences of two vowels are used for the training and the results are evaluated by comparing the root mean square error at the transition. 0.1S I 0.14 0.12
o,i
LI
000810
:-.1
l~
N.N. sys em |
o 05 0 0,1 002
,i io
L~
linear system
3
io
0.00 aaaaanaaaaaaaaaaaaaaaaaaeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
Fig. 8. Articulatory movement produced by the second order linear system.
XTI
XT2
XJ
XL
XG
Fig. 10. Error in the transient part of articulatory movement by the linear system and NN.
K. Shirai / Estimation and generation of articulatory motion
In Figures 8 and 9, motions generated by the linear system and the neural network are shown. The average error is seen in Figure 10 and it is obvious that the neural network model is more suitable to represent the dynamic characteristics of the articulatory motion. 5. Conclusion Neural networks are not only applied for the articulatory parameter estimation but also for the generation of articulatory motion. The neural network has several advantages for the estimation. The cost of calculation is rather small and the number of mis-estimate reduces compared with the nonlinear optimization method. In the case of generation, the nonlinear characteristics inherent in the articulatory dynamics can be effectively modeled using a neural network. Articulatory movements for the sequence of vowels produced by the model are quite similar for the real one. References G. Bailly, C. Abry, L.J. Boe, R. Laboissiere, P. Perrier and J.L. Schwartz (1992), "Inversion and speech recognition", Signal Processing VI, pp. 159-164.
51
K. Funahashi (1988), "On the capability of neural networks", Proc. ATR Workshop on Neural Networks and Parallel Distributed Processing. P. Jospa, A. Soquet and M. Saerens (1992), "Acoustical sensitivity functions and the control of a vocal tract model", Signal Processing VI, pp. 171-174. J. Schr6ter, J.N. Larar and M.M. Shondhi (1987), "Speech parameter estimation using a vocal/cord model", IEEE Proc. lnternat. Conf. Acoust. Speech Signal Process. 87, pp. 308-311. J. Schr6ter, P. Meyer and S. Partbasaratby (1990), "'Evaluation of improved articulatory codebooks and codebook access distance measures", IEEE Proc. lnternat. Conf. Acoust. Speech Signal Process. 90, pp. 393-356. K. Shirai (1981), "Vowel identification in continuous speech using articulatory parameters", IEEE Proc. Internat. Conf. Acoust Speech Signal Process. 81, pp. 1172-1175. K. Shirai and M. Honda (1976), "An articulatory model and the estimation of articulatory parameters by nonlinear regression model", Trans. IECE Japan, Vol. J59-A, No. 8, pp. 668-674. K. Shirai and M. Honda (1978), "Estimation of articulatory parameter from speech wave", Trans. IECE Japan, Vol. J61-A, No. 5 pp. 409-416. K. Shirai and T. Kobayashi (1982), "Recognition semivowels and consonants in continuous speech using articulatory parameters", IEEE Proc. lnternat. Conf. Acoust. Speech Signal Process. 82, pp. 2004-2007. K. Shirai and T. Kobayashi (1986), "Estimating articulatory motion from speech wave", Speech Communication, Vol. 5, No. 2, pp. 159-170. K. Shirai and T. Kobayashi (1991), "Estimation of articulatory motion using neural networks", J. Phonetics, Vol. 19, pp. 379-385.