Philips J. Res. 49 (1995) 367-379
A STATISTICAL APPROACH TO MULTILINGUAL PHONETIC TRANSCRIPTION by STEFAN
BESLING
Philips GmbH Forschungslaboratorien Aachen, Pos[fach 1980. D-52021 Aachen. German?
Abstract In this paper we present a statistical method for generating phonetic transcriptions from written words. It uses Bayes’ decision rule to find the most likely phonetic transcription. The method is illustrated by applying it to English, French and German. For these three languages it produces transcriptions that differ from the correct ones in at most two phonemes for more than 97% of all words. An advantage of the statistical approach lies in the fact that phonotactical knowledge is automatically learned from background lexica and does not have to be explicitly coded. Thus, the system is basically language independent. Keywords:
phonetic transcription; linguality.
statistics;
Bayes’ decision
rule; multi-
1. Introduction
Most continuous-speech recognition systems are based on phonemes and thus need phonetic transcriptions for all words in the recognition vocabulary. As the vocabulary becomes rather large (e.g. 64000 words in the NAB task), this poses a problem since a considerable portion of words usually cannot be found in available lexica. Furthermore, in a commercial dictation system like the one presented in [I] and [2] the user(s) will constantly keep adding words, which need phonetic transcriptions too. Therefore we are in need of an automatic system that generates these transcriptions from written words, optionally using additional information like available speech data. Different methods have been tried in automatic systems to generate phonetic transcriptions from written words. Two major approaches are rulebased programs which generate transcriptions following certain phonotactical rules formulated by experts [3-51, and heuristical programs that use similarities between words in order to find the pronunciation [6, 71.
Philips Journal of Research
Vol. 49
No. 4
1995
361
S. Bedim
Statistical systems [8] use an approach that is mostly based on Bayes’ decision rule. Background lexica are used to estimate probabilities for the fact that a given graphemic word has a certain phonetic transcription. Thereby, the system is able to learn automatically from the expert knowledge that is implicitly present in these lexica. Optionally, spoken utterances may be used in addition to the written words; this was done e.g. in [9], [lo] or [l 11. A considerable advantage of this type of system lies in the fact that it is language independent, which is important for the use in commercial dictation systems like the one described in [l]. In Section 2, we will briefly describe how the training material is prepared in order to use it in our system. Section 3 explains the theoretical background of the statistical method and in Section 4 we report on the tests we performed. 2. Preparation of the training data As far as the generation of phonetic transcriptions from graphemic words is concerned, there are significant differences between languages, stemming to a good part from the scope of variations in pronunciation that a sequence of letters has. In this respect, German is rather simple since a fixed sequence of letters is pronounced the same in almost every context. In English, however, a sequence of letters can have a large number of completely different pronunciations, as is shown by the following examples:’ through/Tru:/ though/D@U/ tough/tVf/ bought/bO:t/ Since the system is intended to work for several different languages, it has to be able to cope with the simpler as well as with the more difficult ones. We therefore take into account longer dependences between letters than would be necessary for, say, German. The details will be explained in Section 3. In order to make use of the transcriptions in a background lexicon, the method presented needs to know which phoneme sequence was produced by which letter sequence. Our modelling assumes that each letter generates at least one phoneme. Hence, a matching between letters and sequences of phonemes can be obtained by using an alignment between the graphemic and the phonetic representations. After this matching has been performed, we can ‘stretch graphemes and/or phonemes in order to have grapheme and phoneme ’ Throughout
368
this paper we use the SAMPA
notation
for phonetic
transcriptions.
Philips Journal of Research
Vol. 49
No. 4
1995
A statistical approach to multilingual phonetic transcription
Fig. 1.Grapheme-phoneme alignment for the word night&y (shown horizontally) and its phonetic transcription (shown vertically) obtained by dynamic programming. Dots indicate the optimal path.
strings of the same length. Fig. 1 shows an example of this alignment, which is done by dynamic programming. We use distance measures that are based on probability distributions for the production of a phoneme by a grapheme. These are obtained by starting from a uniform distribution and performing several iterations of alignment and reestimation [12]. To give an example, in German the letter ‘m’will most likely produce the phoneme ‘m’ but practically never the phoneme ‘p’. So the distance measure will assign a small penalty to the grapheme-phoneme pair (m,m) and a large one to (m,p). The distance measures depend, of course, on the language the system is intended to work on, so they have to be prepared once for each new language. This, however, can easily be done within about an hour as it is independent of the size of the training data. The dots in Fig. 1 indicate the optimal path obtained by dynamic programming; there are cases of ‘grapheme stretching’ as for the ‘y’to be aligned to the phoneme sequence /aI/ as well as ‘phoneme stretching’ as for the first /I/ to be aligned to the grapheme sequence ‘gh’. Accordingly, the entire stretched grapheme and phoneme sequences would be ‘nightskyy’ and /naIItskaI/, respectively. 3. The system in theory Using a statistical approach based on Bayes’ decision rule for transcribing a word u’= gl . . . g,, we have to find the phoneme sequence p; . . . pk that is the most likely transcription of u’, i.e. that maximizes the conditional probability Pr(p, . .p,,, (gl . g,). Using Bayes’ formula, we can rewrite this as follows:
PT...P; = ayyW-(h
Philips Journalof Research
Vol. 49
No. 4
19%
. . .pnl) * Prkt
.g,,IPI . . .P,)).
(1)
369
S. Besling
The actual modelling of estimators for Pr(pt . . .p,) and Pr(g, . . . g, jp1 . . .p,) will be referred to as the phonotactic model and the matching model, respectively. The phonotactic model is currently realized as the linear interpolation of zerogram, unigram etc. up to 7-gram estimators, i.e. Pr(Pl...P~*)=i?lPr(P~lP,,...lPi-l) i=l
M
~P,(PilPi-61...,Pi-l). i=l
On the background lexicon that we use as a training set we first perform a grapheme-phoneme alignment as described in the previous section, and stretch the phoneme sequence accordingly. Denoting the count of the phoneme k-gram qk on this stretched sequence by N(ql , . . . , qk), thej-gram estimators are 41..* chosen as follows: po(Pi):
=
1 number of phonemes
( C1 -“lj-l)Pj-1(PilPi-j+2~~**~Pi-1)+ @j-l
pj(PiIPi-j+ll...,Pi-,):=
NPi-,+lY.di) N(Pi-j+lv..>Pr-l)
if igj and
N(Pi-j +I,..*, Pi-l) > 0 i . ,pi_ 1) otherwise I pj-1(PilPi-j+2,..
for 1
Pr(gt
. .gh,..
. ,P,)=~Pr(g,lgl,...,gi-l,Pl:...,P~) i=l 1
E
N
aP3(gilgi-2,gi-1,Pi-2,Pi-1,Pi)
i=l
+ C1 -“)A
1
where G denotes the number of graphemes.
370
Philips Journd of
Research Vol. 49 No. 4 1995
A statistical
approach
to multilingual phonetic
transcription
The probabilities in the matching model are given by
N(gi-z,g,-I,g,,p,-z,pi-I,P,) N(g,-2.R,~I,P~~2P,~I.P,)
:=
i
p2(giIgikI,Pi-I,Pi)
ifN(gi-2rgi~,,Pi-2;Pi-1,Pi)
#O
ifN(gi-2,gi-lrPi~2,Pi-irPi)
=O
P2(gilgi-l~Pi-lrPi)
:=
Nk-l&>P,-l.P,) Ng, I .PI I ~1
ifN(gi-l,Pi-i,Pi)
PlQlkil+&Q2ki) +&Q3k;)
ifNkI,pi-l,pi) =O
#O
with pt + ,& + & = 1 and each /$>O. The Qi are given by
Q2(giIgi-1,Pi-l):=
#O
if N(gi_t,pi)
=O
Nk-l>giJLl) Nk-1,~r-l
ifNbl,Pi-I)
#
&
if N(gi_[,pj_t)
=O
N(&>P,- I?P,)
Qs(gi Ipi_ 1,pi): =
ifN(gikl.Pi)
LN(pl-I””
ifNpi_I,Pi)
0
#O
if N(pi_ 1,pi) = 0.
G
Experimentally we determined that the values o = 0.99, /3t = p2 = 0.03 and ,& = 0.94 lead to a good performance of the system. Fig. 2 visualizes the components of the matching score. The dots in the boxes for Q,, Q2 and Qs mark those variables that were omitted in the respective probability distribution. As mentioned above, we use word start and word end markers to capture the special behaviour at these points. Furthermore, the matching model is split into three sections dealing with the beginning, middle and end of words in order to separate effects that occur often at the beginning or end of words but almost never in the middle. Since most languages have a lot of common p3
llz-2 I”_2 1
St-1
p z-1
Fig. 2. Components
Philips Journal of Research
QI
p2
;’1
9i-1
9;
Pt-I
PI
of the matching
Vol. 49
No. 4
Q2
9i-1
91
92-l
.
P1
pz-1
score. Dots indicate the variables respective distribution.
1995
Q3 91 ??
??
n-1
that were omitted
91 PL
for the
371
S. Besling
endings (like e.g. . . .tion, . . . ing in English; . . . ung, . . ten in German or . . . tion, . . . ment in French etc.), this method is particularly useful here. Using the statistical approach, we determine the phonetic transcription of a word as follows. Each grapheme can generate one out of a list of phonemes and phoneme strings (like y -+ /aI/ as in Fig. 1). Starting with the leftmost letter in the word to be transcribed, we generate all new hypotheses by appending the possible phoneme strings from this list to the current hypothesis (initially consisting of the word start symbol). All these hypotheses are then evaluated using the phonotactic as well as the matching model and sorted according to their probabilities. Pruning is used to eliminate those hypotheses that differ from the currently best one by more than a certain threshold. Furthermore, we use a histogram pruning [ 131that restricts the number of hypotheses active for each grapheme. Recursively, we investigate the remaining hypotheses further, starting with the most probable one, until we have reached the final letter of the word. As soon as we have reached the first hypothesis giving a transcription of the entire word, we use its probability to pre-emptively stop the consideration of competing hypotheses whenever they are already less likely than the complete transcription, thereby speeding up the search process. 4. The system in practice In the evaluation of transcriptions, it usually is not possible to just say that one is ‘wrong’ whereas the other is ‘right’. Even if we restrict ourselves to a standard pronunciation, there will often be several transcriptions that differ only minimally and should all be considered correct. This minimal difference, however, though easily recognized by humans, is quite difficult to define in a rigid sense. A common approach in literature is that of removing critical words like proper names or abbreviations which have uncommon pronunciations from the test set and then checking the generated transcriptions manually (see e.g. [3]). We considered this to be somewhat arbitrary and extremely tedious, taking into account that we were testing on 15 000 to 20 000 words2 and we therefore decided to choose the simplest and strictest possible error count, i.e. we considered the transcriptions given in the lexicon to be the only correct ones and counted each differing transcription as wrong. We refer to these as word errors.
Since it is rather frequent that the transcription is wrong in this word error 2 Except for French,
372
where this amount
of data is not currently
available.
Philips Journal of Research
vol. 49
No. 4
1995
A statistical approach to multilingual phonetic transcription sense, yet differs from the correct version in just a single place, we also measure the phoneme error rate. This is done in strict similarity to the error count in speech recognition, i.e. we perform an alignment between the two transcriptions using the distance d(p, 4): =
0
ifp=q
1
Of4
1
and then determine the places where a phoneme has been deleted, inserted or substituted by a different one. In the following tables, WER denotes the word error rate given by WER: =
number of wrong transcriptions number of all transcriptions
and PER denotes the phoneme error rate which is decomposed into insertions, deletions and substitutions: PER, = deletions + insertions + substitutions number of all phonemes We present test results for German, English and French, where the latter is somewhat limited by the fact that work on French data started very recently and we do not yet have a sufficiently large amount of training material. 1. English For English, we performed two different groups of tests: In the first group, we randomly chose about 20% of the words from each of our two background lexica, namely a modified version of the Moby lexicon by Grady Ward and the lexicon compiled at Carnegie-Mellon University (CMU), as test sets while the remaining 80% of the lexica were used to train the system. We did not process or alter the words in the test set in any way; in particular we did not remove abbreviations or proper names. In the second group, we determined the words that occurred in both the Moby and the CMU lexicon and had the same transcription in both of them.3 These were used as test set whereas the remainders of the two lexica served as training material. Table I gives the sizes of the background/training lexica as well as the test sets in numbers of words and phonemes. The Train and Test rows indicate the sizes of the train and test parts of the respective lexicon; the Common 3 Both lexica were translated
into SAMPA
Philips Journal of Research
No. 4
Vol. 49
1995
notation.
373
S. Besling TABLE I Words and phonemes in training and test lexica, English task Lexica MobyTrain MobyTest CMUTrain CMUTest MobyNew CMUNew Common
No. of words
No. of phonemes
56218 14044 79416 19 854 56201 83 694 14 108
461938 115651 522 106 130 561 488 531 552441 90 163
row shows the number of words that occur in both lexica with the same transcription, and finally the New rows give the sizes of the remaining parts (i.e. the ‘non-common’ words) in each lexicon. Table II summarizes our test results. The rows labelled MobyTest and CMUTest give the error rates for the tests of group one. The Common rows give the error rates for the tests of group two. Moby and CMU indicate experiments in which only the remainder of the respective lexicon was used for training. In Both, the two training lexica were just combined, whereas in Mixed we trained separately on both of them and used a linear interpolation of the two resulting models during transcription. The results demonstrate that the size of the background lexicon is of great importance, as was to be expected. By just combining the Moby and the CMU lexica, we were able to reduce the error rates on the Common test set by about 15% (relative to Common/Moby); using the more sophisticated method of training the models separately and combining them through linear TABLE II Results of transcription tests, English task Test
WER
PER
MobyTest CMUTest Common/Moby Common/CMU Common/Both Common/Mixed
36.7% 42.8% 29.6% 35.4% 25.0% 23.7%
7.9% 11.3% 7.4% 9.2% 6.2% 5.7%
374
Journal Philips
of Research
Vol. 49
No. 4
1995
A statistical approach to multilingual phonetic transcription TABLE III Distribution of phoneme errors for Common/Mixed No. of phoneme errors 0
1 2 3 4 35
All
Wrong
77.4% 12.1% 7.8% 1.9% 0.5% 0.3%
53.9% 34.7% 8.3% 2.1% 1.0%
_
interpolation during the transcription process even gained about 20%. The latter effect also indicates that the two lexica are quite inhomogeneous in character (see [14]), which certainly stems from the fact that our version of the Moby lexicon was hand-checked whereas the CMU lexicon contains manually corrected portions as well as unchecked transcriptions from several different automatic systems. This also explains the poor results obtained with the CMU lexicon alone. As compared to the German results shown below, those given here are about 37% worse in word error rate and 60% worse in phoneme error rate. The results obtained suggest that the difference is caused partly by the fact that in English the scope of pronunciational variations is far greater than in German and partly by the lack of sufficiently consistent training material. Table III shows the distribution of phoneme errors among all words and among the wrongly transcribed ones (in the columns labelled all and wrong, respectively) for the test performed on the Common/Mixed test set with linear interpolation of the models trained on MobyTrain and CMUTrain. It is seen that almost 90% (53.9% + 34.7%) of all word errors are caused by one or two phoneme errors in the respective word. 4.2. German For German, our test set consisted of the words that occurred in both of our lexica, namely the Duden pronunciation lexicon and a lexicon compiled at the University of Bonn, and had the same transcription in both of them. We did not process or alter these words in any way; in particular we did not remove abbreviations or proper names. As background and training lexicon, we used the remaining words from either lexicon as well as the union of these two (the latter being denoted by DuBo). Table IV gives the sizes of the background/training lexica as well as
Philips Journal of Research
Vol. 49
No. 4
199.5
375
S. Besling TABLE IV Words and phonemes in background lexica, German task Lexica Bonn Duden DuBo Test set
No. of words
No. of phonemes
45 432 67 533 103 766 28 674
437 007 532 741 896 524 214033
the test set in numbers of words and phonemes. Table V summarizes our test results. As for the English tests, the results demonstrate that the size of the background lexicon is of supreme importance. Table VI shows the distribution of phoneme errors among all words and among the wrongly transcribed ones (in the columns labelled all and wrong, respectively) for the test performed with the DuBo background lexicon. Again, it can be seen that almost 90% (63.05% + 24.99%) of all word errors are caused by one or two phoneme errors in the respective word. 4.3. French For the French task, we currently have just one lexicon available, namely Brulex compiled at the University of Brussels. In size, it is far smaller than the data used for English and German, as Table VII shows. Still, we were able to obtain results, given in Table VIII, that are slightly better than those for German. The distribution of phoneme errors among the words as given in Table IX is also very much the same as for German. Here, too, about 90% (70.58% +21.86%) of all word errors are caused by at most two phoneme errors in the wrongly transcribed word. 4.4. Comparison of the languages The tests described above indicate that as far as pronunciation
is concerned,
TABLE V Results of transcription tests, German task Test set Bonn Duden DuBo
376
WER
PER
29.74% 20.82% 17.28%
6.59% 4.37% 3.55%
Philips Journal of Research
Vol. 49
No. 4
19%
A statistical approach to multilingual phonetic transcription TABLE VI Distribution of phoneme errors for DuBo No. of phoneme errors 0
1 2 3 4 25
All
Wrong
82.92% 10.77% 4.27% 1.50% 0.39% 0.15%
63.05% 24.99% 8.76% 2.30% 0.90%
TABLE VII Words and phonemes in background lexica, French task Lexica Brulex Test set
No. of words
No. of phonemes
25 827 6456
174251 43 614
TABLE VIII Results of transcription tests, French task Test set
WER
PER
Brulex
13.4%
2.77%
TABLE IX Distribution of phoneme errors for Brulex No. of phoneme errors
All
Wrong
0
86.68%
_
1 2 3 4 35
9.4% 2.91% 0.73% 0.14% 0.14%
70.58% 21.86% 5.46% 1.05% 1.05%
Philips Journal of Research
Vol. 49
No. 4
1995
377
S. Bedim
there are significant differences between English, French and German. As already pointed out in Section 2, English shows a considerable variation in the pronunciation of a given grapheme string, and this can be seen in the higher error rates obtained on this language. German is much more wellbehaved in the sense that a given graphemic string seldom has more than one pronunciation. Accordingly, the error rates for German are considerably better than those for English. French, finally, appears to have the most consistent pronunciation of the three languages. Even though the error rates differ significantly among the three languages investigated, they always remain within a reasonable interval. Thus we can say that while certain languages are more suitable for our approach than others, it works sufficiently well on all of them. 5. Conclusion We have presented a multilingual statistical method for generating phonetic transcriptions from graphemic words. It produces correct transcriptions for 77% to 86% of all words in the test corpora for English, French and German, with the remaining 23% to 14% of wrong transcriptions differing from the correct ones by at most two phonemes in at least 87% of the cases. Since one or two wrong phonemes usually still leave the word intelligible, we altogether obtain useful phonetic transcriptions for more than 97% of all words presented to the system. As compared to rule-based methods, our method has the advantage that it is easily adaptable to new languages, learns automatically from background lexica and can be improved by just adding more training material. Acknowledgements
We are grateful to H.-G. Meier and V. Steinbiss for helpful discussions.
REFERENCES
Ul V. Steinbiss, H. Ney, R. Haeb-Umbach, B.-H. Tran, U. Essen, R. Kneser, M. Oerder, H.-G. Meier, X. Aubert, C. Dugast, D. Geller, W. Hollerbauer and H. Bartosik, The Philips research system for large-vocabulary continuous-speech recognition, Proc. EUROSPEECH, Berlin, pp. 212552128 (1993). PI H. Nev, V. Steinbiss. R. Haeb-Umbach, B.-H. Tran and U. Essen, An overview of the Philips research system for large vocabulary continuous-speech recognition, Int. .I. Pattern Recognition and Artificial Intelligence, 8(l), 33-70 (1994). rules for automatic t31H.S. Elovitz, R. Johnson, A. McHugh and J.E. Shore, Letter-to-sound translation of English text to phonetics, IEEE Trans. Acoustics, Speech, and Signal Processing, ASSP-24(6) (Dec. 1976).
378
Philips Journal of Research
Vol. 49
No. 4
1995
A statistical
approach
to multilingual
phonetic
transcription
[4] M. Kommenda, Automat&he Wortstrukturanalyse fiir die akustische Ausgabe von deutschem Text. Doctoral Thesis, Technical University Vienna, Austria (1991). (51 K. Wothke, Morphologically based automatic phonetic transcription. IBM Syst. J.. 32(3). 4866511 (1993). [6] S. Besling, Heuristical and statistical methods for grapheme-to-phoneme conversion Proc. KONVENS, Vienna. Austria, pp. 23-3 1 (1994). [7] A.V.D. Bosch and W. Daelemans. Data-oriented methods for grapheme-to-phoneme conversion. Proc. EACL, Utrecht, pp. 45553 (1993). [8] S. Besling, A statistical system for grapheme-to-phoneme conversion. Proc. Tenth Annual Conf. of UWC for New OED, Waterloo. Ontario, Canada. pp. 5513 (1994). [9] L.R. Bahl, S. Das. P.V. De Souza, M. Epstem. R.L. Mercer. B. Merialdo. D. Nahamoo. M.A. Picheny and J. Powell. Automatic phonetic baseform determination Proc. ICASSP. Toronto. pp. l73- 176 (1991). [lo] E. Thelen, Untersuchung akustisch definierter Einheiten zur Vokabularerweiterung von Spracherkennungssystemen, Diplom Thesis. RWTH Aachen. Germany (1994). [l I] R. Haeb-Umbach. P. Beyerlein and E. Thelen. Automatic transcription of unknown words in a speech recognition system, Proc. ICASSP. Detroit. pp. 840-843 (1995). [12] B.V. Coile, On the development of pronunciation rules for text-to-speech synthesis. Proc. EUROSPEECH. Berlin. pp. 1455- 1458 (1993). [13] V. Steinbiss. B.-H. Tran and H. Ney. Improvements in beam search. Proc. ICSLP. Yokohama. pp. 214332146 (1994). [ 141 H.-G. Meier, Personal communication. Philips Forschungslaboratorien. Aachen (1994).
Philips Journal of Research
Vol. 49
No. 4
1995
379