Speech Communication 49 (2007) 453–463 www.elsevier.com/locate/specom
Language-dependent state clustering for multilingual acoustic modelling q Thomas Niesler
*
Department of Electrical and Electronic Engineering, University of Stellenbosch, Stellenbosch, South Africa Received 14 November 2006; received in revised form 29 March 2007; accepted 2 April 2007
Abstract The need to compile annotated speech databases remains an impediment to the development of automatic speech recognition (ASR) systems in under-resourced multilingual environments. We investigate whether it is possible to combine speech data from different languages spoken within the same multilingual population to improve the overall performance of a speech recognition system. For our investigation, we use recently collected Afrikaans, South African English, Xhosa and Zulu speech databases. Each consists of between 6 and 7 h of speech that has been annotated at the phonetic and the orthographic level using a common IPA-based phone set. We compare the performance of separate language-specific systems with that of multilingual systems based on straightforward pooling of training data as well as on a data-driven alternative. For the latter, we extend the decision-tree clustering process normally used to construct tied-state hidden Markov models to allow the inclusion of language-specific questions, and compare the performance of systems that allow sharing between languages with those that do not. We find that multilingual acoustic models obtained in this way show a small but consistent improvement over separate-language systems as well as systems based on IPA-based data pooling. Ó 2007 Elsevier B.V. All rights reserved. Keywords: Multilinguality; Multilingual acoustic models; Multilingual speech recognition
1. Introduction The first step in the development of a speech recognition system for a new language is normally the recording and annotation of a large quantity of spoken audio data. In general, the more data are available, the better the performance of the system. However, data gathering and especially annotation is very expensive in terms of money and time. In a multilingual environment, this problem is compounded by the need to obtain such material in several languages. It is in this light that we would like to deter-
q
Expanded version of the paper was presented at the ISCA Tutorial and Research Workshop on Multilingual Speech and Language Processing (Stellenbosch, South Africa, 2006). * Tel.: +27 21 808 4118. E-mail address:
[email protected] 0167-6393/$ - see front matter Ó 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2007.04.001
mine whether data from different languages that coexist in a multilingual environment can be combined in order to improve speech recognition performance. This process involves determining phonetic similarities between the languages, and exploiting these to obtain more robust and effective acoustic models. The eventual aim of this work is the development of speech recognition systems that are able to deal with multiple languages without the need to explicitly implement a set of parallel language-specific systems. Multilingual speech recognition is particularly relevant in South Africa, which has 11 official languages and among whose population monolinguality is almost entirely absent. This leads to a large geographic overlap in mother tongue speakers of different languages, as well as habitual and frequent code-mixing and code-switching. We study the four languages Afrikaans, English, Xhosa and Zulu, for which a certain amount of phonetically and orthographically
454
T. Niesler / Speech Communication 49 (2007) 453–463
Table 1 Percentage of the population who use Afrikaans, English, Xhosa and Zulu as first and second languages respectively (Statistics South Africa, 2004) Language
First language speakers (%)
Second language speakers (%)
Afrikaans English Xhosa Zulu
13.3 8.2 17.6 23.8
16.5 18.5 18.0 24.2
annotated speech data is available. Detailed statistics reflecting the degree of multilinguality in South Africa are unfortunately not available. However, Table 1 shows the percentage of the population found to use each of the four languages under study as a mother tongue, and also as a second language. It has furthermore been estimated that Xhosa and Zulu speakers on average each speak four languages, while Afrikaans and English speakers are usually bilingual (Webb, 2002). Multilingual speech recognition has received attention by several authors over the last decade. The seminal work by Schultz and Waibel dealt with 10 languages, each of which is spoken in a different country and forms part of the GlobalPhone speech corpus (Schultz and Waibel, 1998, 2000, 2001). Multilingual recognition systems were developed by applying decision-tree clustering to tied-mixture HMMs. While these techniques proved to be an effective means of constructing recognition systems for an unseen language not part of the training set, the authors were unable to improve on the performance of separate monolingual recognition systems. Similar conclusions were reached by Ko¨hler (2001), who considered agglomerative clustering of monophone models in six languages, and by Uebler (2001), who employed IPA-based clustering in two and three languages. Finally, work by Uebler et al. (1998), which has focussed on a bilingual speech recognition system, has shown some improvement in recognition accuracy for non-native speech, but also not for first language speakers. We have considered the application of decision-tree clustering to the development of tied-state multilingual triphone HMMs. Furthermore, we focus on languages spoken within the same country, and hence related at least to some degree by the extensive phonetic and lexical borrowing, sharing, and mixing that takes place in a multilingual society. Finally, strong links exist between certain groups of indigenous languages (such as the Nguni language group which includes Xhosa and Zulu), that may allow more fruitful exploitation of data sharing by speech recognition applications.
2. Speech databases We have based our experiments on the African Speech Technology (AST) databases, which consist of recorded
and annotated speech collected over both mobile and fixed telephone networks (Roux et al., 2004). For their compilation, speakers were recruited from targeted language groups and given unique datasheets with items designed to elicit a phonetically diverse mix of read and spontaneous speech. The datasheets included read items such as isolated digits, as well as digit strings, money amounts, dates, times, spellings and also phonetically rich words and sentences. Spontaneous items included references to gender, age, mother tongue, place of residence and level of education. Information regarding other languages spoken by the speakers was unfortunately not recorded. The AST databases were collected in five different languages, as well as in a number of non-native variations. In this work we have made use of the Afrikaans, English, Xhosa and Zulu mother tongue databases. Afrikaans is a Germanic language with its origins in 17th-century Dutch brought to South Africa by settlers from Holland. It incorporates lexical and syntactic borrowings from Malay, Bantu and Khoisan languages, as well as from Portuguese and other European languages. English was brought to South Africa by British occupying forces at the end of the 18th century, and is today regarded as a specific variety of so-called World Englishes. Xhosa and Zulu both belong to the Nguni group of languages, which also includes Ndebele and Swazi. The orthography of both languages, as it is commonly accepted today, was standardised as recently as the 1960s. Both are tone languages, and exhibit a range of ‘exotic’ sounds, such as implosive bilabial stops as well as a range of clicks and associated phonetic accompaniments. A global set of 130 phones, based on the International Phonetic Alphabet, has been used during phonetic speech transcription of the speech databases (International Phonetic Association, 2001). These phone units were chosen to reflect the variety of distinct sounds that were observed during transcription in each of the four languages. The phone set includes labels for diphthongs and tripthongs that occur across word boundaries in spontaneous speech. Furthermore, in several cases diacritic symbols, in combination with consonants or vowels, have been considered to be unique phones. These diacritics include ejection, aspiration, syllabification, voicing, devoicing and duration. A table summarising the phone set is presented in the Appendix, while a more detailed analysis of the phonetic differences between the four languages has been presented in (Niesler et al., 2005). Together with the recorded speech waveforms, both orthographic (word-level) and phonetic (phone-level) transcriptions were available for each utterance. The orthographic transcriptions were produced and validated by human transcribers. Initial phonetic transcriptions were obtained from the orthography using grapheme-to-phoneme rules (Louw et al., 2001; Wissing et al., 2004; Louw, 2005), except for English where a pronunciation dictionary was used instead. These were subsequently corrected and validated manually by human experts.
T. Niesler / Speech Communication 49 (2007) 453–463 Table 2 Training sets for each database
455
Table 5 Test sets for each database
Database name
Speech (h)
No. of utterances
No. of speakers
Phone types
Phone tokens
Word tokens
Database name
Speech (min)
No. of utterances
No. of speakers
Phone tokens
Word tokens
Afrikaans English Xhosa Zulu
6.18 6.02 6.98 7.03
9520 9904 8538 8295
234 271 219 203
82 71 105 99
180,904 167,986 177,843 187,249
47,383 47,941 36,676 35,568
Afrikaans English Xhosa Zulu
24.4 24.0 26.8 27.1
750 702 609 583
20 18 17 16
11,441 10,338 10,925 11,008
3051 3652 2480 2385
2.1. Training and test sets Each database was divided into a training and a test set. The four training sets each contain between 6 and 7 h of audio data, as indicated in Table 2. Phone types refer to the number of different phones that occur in the data, while phone tokens indicate their total number. Table 3 shows the extent to which the phone set, which corresponds to the set of unique phone types in the training set, of each language covers the phone types and tokens found in the other language’s training set. For example, 75.6% of phone types and 98.4% of phone tokens in the Afrikaans training set are also present among the English phone types. From the table we see that especially Xhosa and Zulu have many phones in common. This is also true for Afrikaans and English, but to a lesser degree. Since our acoustic models will be context dependent, it is useful also to consider the similarity between the triphones of each language. Accordingly, Table 4 shows the extent to which the training set triphone types of each language cover the training set triphone tokens of the other languages. Once again, there is a great deal of commonality between Xhosa and Zulu. English triphones show the greatest overlap with Afrikaans, but this is not true for
Table 3 Degree to which the monophone types of each language cover the training set monophone types and tokens of every other language Phone types of language
Covers % of training set phone types/tokens in: Afrikaans
English
Xhosa
Zulu
Afrikaans English Xhosa Zulu
100/100 75.6/98.4 79.3/97.4 73.2/92.6
87.3/99.5 100/100 81.7/92.3 71.8/85.7
61.9/87.5 55.2/86.6 100/100 85.7/99.8
60.6/88.1 51.5/86.5 90.9/99.9 100/100
Table 4 Degree to which the triphone types of each language cover the training set triphone tokens of every other language Triphone types
Covers % of training set triphone tokens in
Language
Number
Afrikaans
English
Xhosa
Zulu
Afrikaans English Xhosa Zulu
8287 11,297 8700 8626
100 29.3 31.9 34.8
25.6 100 26.6 22.9
34.5 18.1 100 87.3
32.9 13.1 88.3 100
Afrikaans, which exhibits a slightly better coverage of both Xhosa and Zulu triphones. The test set for each language contains approximately 25 min of speech data, as shown in Table 5. There was no speaker-overlap between the test and training sets, and each contained both male and female speakers. Finally, a separate development set, consisting of approximately 15 min of speech from 10 speakers in each language was also prepared. This data was used only in the optimisation of recognition parameters, before final evaluation on the test set. There is no overlap between the development set and either the test or training sets. 3. General experimental method The HTK tools were used to develop and test recognition systems (Young et al., 2002). The speech audio data was parameterised as 39-dimensional feature vectors consisting of 13 Mel-frequency cepstral coefficients (MFCCs) and their first and second differentials, with cepstral mean normalisation (CMN) applied on a per-utterance basis. From this parameterised training set and its phonetic transcription, diagonal-covariance cross-word triphone models with three states per model and eight Gaussian mixtures per state were trained by embedded Baum–Welch re-estimation and decision-tree state clustering (Young et al., 1994). The decision-tree state clustering process begins by pooling all context-dependent phones found in the training corpus that correspond to the same context-independent phone, termed the basephone hereafter. A set of linguistically motivated questions is defined with which these clusters can be split. Such questions may, for example, ask whether the left context of a particular context-dependent phone is a vowel, or whether the right context is silence. The clusters are subdivided repeatedly, at each iteration applying the question that affords the largest improvement in training set likelihood. The process ends either when this likelihood gain falls below a certain threshold, or when the number of occurrences remaining in a cluster becomes too small. Hence the clustering process results in a binary decision tree for each state of each basephone. The leaves of this tree are clusters of context-dependent phones whose training data must subsequently be pooled. A great advantage of this clustering method is that context-dependent phones not encountered in the training data at all can easily be synthesised by means of the decision
456
T. Niesler / Speech Communication 49 (2007) 453–463
Table 6 Bigram language model perplexities measured on corresponding test sets Database
Bigram types
Perplexity
Afrikaans English Xhosa Zulu
1420 1900 2003 1886
11.84 14.08 12.72 12.57
trees that have been determined for the corresponding basephone. This is important when using cross-word context-dependent models, or when the phone set is large or the training set small and hence sparse. The related research by Schultz and Waibel has applied decision-tree clustering to multilingual acoustic modelling using tied-mixture systems (Schultz and Waibel, 2001). In these systems the HMMs share a single large set of Gaussian distributions, with state-specific mixture weights. This configuration allows the clustering process to employ an entropy-based distance measure based on the mixture weights to determine the similarity between states. Our configuration may in contrast be described as a tied-state system. Each set of clustered states shares a much smaller Gaussian mixture distribution, but the distribution is completely separate for each set of clustered states. Since the entropy-based distance measure cannot be used for this configuration, our clustering procedure is based on the reduction in training set likelihood associated with a cluster subdivision. Since the AST databases contain too little data for word-based language modelling, the comparison of recognition performance will be based on phoneme error rates, as is also proposed in (Schultz and Waibel, 1998; Waibel et al., 2000). All speech recognition experiments are performed using a backoff bigram language model obtained for each language from the training set phoneme transcriptions (Katz, 1987). Absolute discounting was used for the estimation of language model probabilities (Ney et al., 1994). Language model perplexities are shown in Table 6. Word-insertion penalties and language model scale factors used during recognition were optimised on the development set. Note that, as indicated in Table 2, the phone sets of each language differ considerably. Of the 130 different phone labels present in the annotations, only 53 are common to all four languages. It is apparent from Table 6 that English has the highest perplexity, although from Table 2 we know it has the smallest number of phone types. In contrast, the perplexities are lower for both Xhosa and Zulu, despite their larger phone sets. A stronger sequential relationship therefore exists between consecutive phones for Xhosa and Zulu in relation to English. This is believed to be a consequence of the strong /CV/ syllable structure characterising these languages, which limits the variety of sequential phone combinations. Table 4 reflects the same pattern in the smaller number of Xhosa and Zulu triphone types, when compared with English.
Xhosa HMM for triphone [b]–[a]+[d]
ax11
S1
ax12
az12
S1 az11
x a22
S2
x a23
z a23
S2
ax33
S3
S3
az22
az33
Zulu HMM for triphone [b]–[a]+[d]
Fig. 1. Language-specific acoustic models.
4. Language-specific acoustic models To serve as a baseline, a fully language-specific system was developed, that allows no sharing between languages. Model development begins by pooling all triphones with the same basephone separately for each language. The decision-tree clustering process then employs only questions relating to the phonetic character of the left and the right context. The structure of the resulting acoustic models is illustrated in Fig. 1 for two languages (Xhosa and Zulu) and a single triphone of basephone [a] in the left context of [b] and the right context of [d]. Since no overlap is allowed between the triphones of different languages, this baseline corresponds to a completely separate set of acoustic models for each language. 5. Language-independent acoustic models In addition to the four language-specific acoustic model sets, a single language-independent acoustic model set was trained by pooling the data across all four languages for phones with the same IPA classification. This will no longer allow the models to differentiate between subtly different spectral qualities of the same IPA phone used in different languages. It also no longer allows languagespecific co-articulation effects to be taken into account during the decision-tree triphone clustering step of model development. Hence the performance of these models is expected to be a lower bound for the performance of multilingual state clustering that will be introduced in the next section.
T. Niesler / Speech Communication 49 (2007) 453–463 Xhosa HMM for triphone [b]–[a]+[d]
a22
a11
S1
a12
a12
S1
S2
a23
a23
S2
a11
457
Xhosa HMM for triphone [b]–[a]+[d]
a33
S3
a22
a11
S1
a12
S2
a23
a33
S3
S3
a22
a33
Zulu HMM for triphone [b]–[a]+[d]
a12
S1
a23
S2
S3
Fig. 2. Language-independent acoustic models. a11
Fig. 2 illustrates the structure of the language-independent models, again for just two languages, Xhosa and Zulu, and a single triphone. Both languages share the same Gaussian mixture probability distributions, as well as HMM transition probabilities. 6. Multilingual acoustic models The development of our multilingual model set is similar to that of the language-independent model set. The state tying process begins by pooling all triphones of all four languages corresponding to the same basephone. However in this case the set of decision-tree questions take into account not only the phonetic character of the left or right context, but also the language of the basephone. The HMM states of two triphones with the same IPA symbols but from different languages can therefore be kept separate if there is a significant acoustic difference, or can be merged if there is not. For example, a pool of triphones with basephone [a] can be split by a question separating the Zulu triphones from the triphones belonging to other languages. This allows tying across languages when the triphone states are acoustically similar, and separate modelling of the same triphone state for different languages when there are differences. The use of language questions in decision trees was first proposed in Schultz and Waibel (1998) and in Ward et al. (1998). The structure of such multilingual acoustic models is shown in Fig. 3. Here the centre state of the triphone [b] [a] + [d] is tied, but the first and last states are modelled separately for each language. In our experiments, the transition probabilities of all triphones with the same basephone were tied across all four languages before embedded re-estimation. Language-specific, language-independent and multilingual acoustic models as described in this and the previous two sections respectively resemble
a22
a33
Zulu HMM for triphone [b]–[a]+[d]
Fig. 3. Multilingual acoustic models.
the ‘‘ML-sep’’, ‘‘ML-mix’’ and ‘‘ML-tag’’ approaches applied to tied-mixture HMM systems by Schultz and Waibel (2001). 7. Experimental results We have applied the acoustic modelling approaches described in Sections 4–6 to the combination of the Afrikaans, English, Xhosa and Zulu training data described in Section 2. Since the optimal number of parameters for the acoustic models was not known, several sets of HMMs were produced by varying the likelihood-improvement threshold used during decision-tree clustering, as described in Section 3. Decision-tree clustering was carried out using HMM sets with single-mixture Gaussian densities per state. Clustering was followed by four iterations of embedded Baum–Welch re-estimation. The number of mixtures per state was then gradually increased to eight, each such increase being followed by a further four iterations of embedded training. 7.1. Analysis of recognition performance The performance, in terms of average phone accuracy, of the final 8-mixture HMM set measured on the evaluation test set is shown in Fig. 4. A single curve indicating the average accuracy over all four test sets is shown, and the number of states in the language-specific systems was taken to be the sum of the number of states in each component language-specific HMM set. The number of states in the multilingual system is the total number of unique states
458
T. Niesler / Speech Communication 49 (2007) 453–463 63
Phone recognition accuracy (%)
62.5 62 61.5 61 60.5 60 59.5 59
Multilingual HMMs Language-dependent HMMs
58.5
Language-independent HMMs 58 0
5000 10000 15000 Number of clustered states
20000
Fig. 4. Average evaluation test set phone accuracies of language-specific, multilingual and language-independent systems as a function of the total number of distinct HMM states.
Phone recognition accuracy (%)
remaining after decision-tree clustering, and hence takes cross-language sharing into account. Fig. 4 indicates that, for most systems, pooling the data between languages for phones with the same IPA symbol leads to a deterioration in performance. This agrees with the findings of other studies (Ko¨hler, 2001; Schultz and Waibel, 2001; Uebler, 2001). However, in contrast to previous findings, the results also indicate that the multilingual systems achieved modest average performance improvements relative to both the language-specific and the language-independent systems over all the models in the range considered. This suggests that beneficial cross-lingual sharing is in fact taking place. The improvement achieved by the best multilingual system relative to the best language-specific system shown in Fig. 4 is approximately 0.1% absolute, and has been determined to be statistically significant only at the 80% level using bootstrap confidence
67 66 65 64 63 62 61 60 59 58 57 56
interval estimation (Bisani and Ney, 2004). Nevertheless, the improvements are consistently achieved over different model sizes. Closer inspection of the results shows that the average improvement is usually accompanied by per-language gains. Fig. 5 shows the respective recognition performance of the multilingual and the language-specific systems measured separately on the evaluation test set of each language. The differing phone recognition accuracies are a result of the differing phone sets used by each language, as well as of the different language model perplexities. In particular, Afrikaans, which has the highest phone accuracies, has the lowest language model perplexity. In contrast, Xhosa and Zulu, which have the lowest accuracies, have the largest phone sets. Finally, English displays the second highest phone accuracies, and has the smallest phone set, but has the highest language model perplexity of all. Fig. 5 indicates that in the majority of cases the multilingual system delivers improved performance relative to a corresponding language-specific system. Xhosa and Zulu appear usually to show larger improvements than Afrikaans or English. This may be due to the greater phonetic similarity between the former two languages, and their associated larger mutual triphone coverage, as reflected in Table 4. Furthermore, an analysis of how all possible groupings of three languages cover the triphone tokens in the remaining language is shown in Table 7. The percentages in this table indicate that both Xhosa and Zulu triphones are well accounted for respectively by the triphones of Afrikaans, English and Zulu, and of Afrikaans, English and Xhosa. Afrikaans and English triphones, on the other hand, are less well represented. Since the clustering process optimises the overall likelihood, improvements for each individual language are not guaranteed. Indeed, Fig. 5 demonstrated that unchanged or slightly deteriorated per-language performance does sometimes occur, despite an improvement in the associated overall average.
7.2. Analysis of the decision tree Inspection of the type of questions most frequently used during clustering reveals that language-based questions are most common at the root of the decision tree, and become increasingly less frequent towards the leaves. Fig. 6 analy-
0
5000
A, ML A, LS
10000 15000 Number of clustered states E, ML E, LS
X, ML X, LS
20000
Z, ML Z, LS
Fig. 5. Phone accuracies of multilingual (ML) and language-specific (LS) systems as a function of the total number of distinct HMM states, measured separately for the Afrikaans (A), English (E), Xhosa (X) and Zulu (Z) evaluation test sets.
Table 7 Degree to which the triphone types of each combination of three languages cover the training set triphone tokens of the remaining language Triphone types of the pool of three languages
Coverage (%) of training set tokens Language
Triphone coverage
Afrikaans, English and Xhosa Afrikaans, English and Zulu Afrikaans, Xhosa and Zulu English, Xhosa and Zulu
Zulu Xhosa English Afrikaans
89.7 89.3 35.9 49.5
T. Niesler / Speech Communication 49 (2007) 453–463 50
Percentage of clustered states
90
45 Language-based questions (%)
459
40 35 30 25 20 15 10
80 70 60 50 40 30 20 10 0 0
5
5000
10000
15000
20000
Number of clustered states in multilingual HMM set
0 0
2 4 6 8 Depth within decision tree (root = 0)
10
Fig. 6. Analysis showing the percentage of questions that are languagebased at various depths within the multilingual decision tree.
1 language 2 languages
3 languages 4 languages
Fig. 8. Proportion of state clusters combining data from one (unilingual), two (bilingual), three (trilingual) and four (quadralingual) languages.
7.3. Analysis of cross-language data sharing Absolute reduction in overall log likelihood
7e+06
Language-based questions Phonetically-based questions
6e+06 5e+06 4e+06 3e+06 2e+06 1e+06 0 0
2 4 6 8 Depth within decision tree (root = 0)
10
Fig. 7. Analysis showing the contribution made to the reduction in overall log-likelihood by the language-based and phonetically-based questions respectively.
ses the decision tree of the largest multilingual system (with almost 20,000 states), and shows that approximately 45% of all questions at the root nodes are language-based, and that this proportion drops to 22% and 15% for the roots’ children and grandchildren respectively. This behavior is reflected also when considering the contribution to the reduction in log likelihood made by the language-based and phonetically-based questions respectively during the decision-tree growing process. In Fig. 7 this contribution is shown, again as a function of the depth within the decision tree. It is evident that, at the root node, the greatest log likelihood reduction is afforded by the language-based questions (approximately 74% of the total reduction), while the phonetically-based questions make up the greatest contribution thereafter. This indicates that the decision tree often quickly partitions the models into language-based groups, after which further refinements to the tree are based more on phonetic distinctions.
In order to determine to what extent and for which languages data sharing ultimately takes place for the various multilingual systems, we have determined the proportions of decision-tree leaf nodes (which correspond to the state clusters) that are populated by states of exactly one, two, three or four languages respectively. A cluster populated by states of a single language corresponds to a unilingual cluster, and indicates that no sharing with other languages has taken place. A cluster populated by states of all four languages, on the other hand, indicates that sharing across all four languages has taken place. Fig. 8 illustrates how these proportions change as a function of the total number of clustered states in the system. From Fig. 8 it is apparent that, as the number of clustered states is increased, the proportion of clusters consisting of a single language also increases. This indicates that the progressive specialisation of the decision tree is tending to produce separate clusters for each language, as one would find in a language-specific system. The proportion of clusters containing two, three and all four languages shows a commensurate decrease as the number of clustered states increases. Nevertheless, for an acoustic model with 10,000 states, which according to Fig. 4 yields approximately optimal recognition accuracy, around 20% of the state clusters contain a mixture of languages, demonstrating that a significant degree of sharing is taking place. In order to determine which languages are being shared most often by the clustering procedure, Figs. 9 and 10 analyse the proportion of states that consist of groups of two and three languages respectively. From Fig. 9 we see that the largest proportion of two-language clusters are due to a combination of Xhosa and Zulu. This agrees with our earlier observations of the high phonetic similarity of these languages, even at the triphone level. Afrikaans and English are the second most frequent language combination, but are much less common than the Xhosa and Zulu clusters. Fig. 10 shows that, in cases where three languages are
460
T. Niesler / Speech Communication 49 (2007) 453–463
quent. In contrast, clusters containing Afrikaans, English and one of the Nguni languages are much less common.
Percentage of clustered states
16 14 12
8. Summary and conclusions
10 8 6 4 2 0 0
5000 10000 15000 Number of clustered states in multilingual HMM set Afrikaans and English Afrikaans and Xhosa Afrikaans and Zulu
20000
English and Xhosa English and Zulu Xhosa and Zulu
Fig. 9. Proportion of state clusters combining data from various combinations of two languages (bilingual clusters).
Percentage of clustered states
4.5 4 3.5 3
We have demonstrated that decision-tree state clustering can be employed to obtain tied-state multilingual acoustic models by allowing sharing between basephones of different languages and introducing decision-tree questions that relate to the language of a particular basephone. This mode of clustering was used to combine Afrikaans, English, Xhosa and Zulu acoustic models. Improvements over language-specific as well as language-independent systems were observed. Further analysis showed that language-based decision-tree questions play a dominant role at and near the root node, indicating that language specialisation occurs quickly during tree construction. However, it is also observed that a significant proportion of state clusters contain more than one language, from which we conclude that language sharing occurs regularly in the multilingual acoustic models. Hence this technique promises to be useful for the development of acoustic models for use within multilingual speech recognition systems.
2.5
Acknowledgements
2 1.5 1 0.5 0 0
5000 10000 15000 Number of clustered states in multilingual HMM set Afrikaans, English, Xhosa Afrikaans, English, Zulu
20000
Afrikaans, Xhosa, Zulu English, Xhosa, Zulu
Fig. 10. Proportion of state clusters combining data from various combinations of three languages (trilingual clusters).
clustered together, the combinations of Afrikaans, Xhosa and Zulu, and of English, Xhosa and Zulu are similarly fre-
The author thank Febe de Wet, Thomas Hain and Tanja Schultz for their very helpful suggestions. This work was supported by the South African National Research Foundation (NRF) under grant number FA2005022300010. Appendix The following table presents a list of the phones used in the Afrikaans, English, Xhosa and Zulu databases. An example of a word in which the phone occurs in each language is given. When this example word appears in italics, it indicates that it is has been borrowed from one of the other languages during code switching.
Description
IPA
Afrikaans
English
Xhosa
Zulu
Stops Voiceless Bilabial Plosive Aspirated Bilabial Plosive Ejective Bilabial Plosive Voiced Bilabial Plosive Devoiced Bilabial Plosive Voiced Bilabial Implosive Voiceless Alveolar Plosive Aspirated Alveolar Plosive Ejective Alveolar Plosive Voiced Alveolar Plosive Devoiced Alveolar Plosive Ejective Palatal Plosive
p ph p’ b b ” t th t’ d d c’
pop
spit
baba
baby
tot
total
daar
death
Cape phila pasa imbuzi bhala ubawo Atlanta thatha itakane indoda amadoda ukutya
Cape phila impi imbuzi bholoho ubaba Atlanta isikhathi ikati isidala ngishadile uMatyakalana
T. Niesler / Speech Communication 49 (2007) 453–463
461
Table (continued) Description
IPA
Aspirated Palatal Plosive Voiced Palatal Plosive Voiceless Velar Plosive Aspirated Velar Plosive Ejective Velar Plosive Voiced Velar Plosive Glottal Stop
ch f k kh k’ g ÷
Fricatives Voiceless Labiodental Fricative Voiced Labiodental Fricative Voiceless Dental Fricative Voiced Dental Fricative Voiceless Alveolar Fricative Voiced Alveolar Fricative Voiceless Post-Alveolar Fricative Voiced Post-Alveolar Fricative Voiceless Velar Fricative Voiceless Glottal Fricative Voiced Glottal Fricative Voiceless Alveolar Lateral Fricative Voiced Alveolar Lateral Fricative
f v h ð s z S Z x h H æ L
Affricates Ejective Alveolar Lateral Affricate Post-Alveolar Affricate Aspirated Post-Alveolar Affricate Ejective Alveolar Affricate Aspirated Alveolar Affricate Voiced Post-Alveolar Affricate Voiced Alveolar Affricate
0 tæ ^ ^ h ^0 ¸ ^ ¸h ^ F ^ ^
Afrikaans
English
koek
kick
berge ver_as
gun co_operative
vier water Martha Northam sies zoem sjoe genre gaan Bethlehem hand Umhlanga
four vat thing this some zero shine genre Gauteng hand Johannes
tjokker
chocolate
John
jug
amanzi
amanzi
Xhosa
Zulu
ityhefu indyevo Brakpan yikha kakubi ingubo i_oyile
Brakpan -khipha kanti ugogo u_opharetha
ukufa ukuvula thousand the sala ukuzama kushushu genre irhafu huhuza ihashe hlala dlala
ukufa ukuvula think the ukusala ukuzama shisa genre Gauteng huhuluza ihhashi hlala dlala
intloko tshixa ukutshisa ukutsiba isithsaba inja
inhloko intsha uputshuza tsotsi Botshabelo juba
Voiced Lateral Alveolar Affricate Ejective Velar Affricate
dL ^ kx0
indlovu ikrele
indlala -kleleba
Clicks Dental Click Nasalised Dental Click Voiced Nasalised Dental Click Voiced Dental Click
j ~j j ^ j
cinga nceda iingcango gcina
cima ncane ingcazi gcina
j k ~k k ^ k
chaza xela nxila ingxelo gxeka
chitha uxolo nxusa ingxenye gxuma
k ! ~! ! ^ !^ !h
xhasa qiqa nqaba ngqukuva gquba qhuba
xhuma qoma inqola ngqatha gqabula qhaqha
Aspirated Dental Click Alveolar Lateral Click Nasalised Alveolar Lateral Click Voiced Nasalised Alveolar Lateral Click Voiced Alveolar Lateral Click Aspirated Alveolar Lateral Click Palatal Click Nasalised Palatal Click Voiced Nasalised Palatal Click Voiced Palatal Click Aspirated Palatal Click
^
^ h
^ h
(continued on next page)
462
T. Niesler / Speech Communication 49 (2007) 453–463
Table (continued) Description
IPA
Afrikaans
English
Xhosa
Zulu
Trills and Flaps Alveolar Trill Uvular Trill Alveolar Flap
r R |
roer roer vat dit
Hartenbos ndigraya forty
isitrato
egaraji
quarter
quarter
Approximants Alveolar Approximant Alveolar Lateral Approximant Palatal Approximant Voiced labio-velar Approximant
¤ l j w
Brixton lag jas William
red legs yes west
lala yima wela
red lala yima wathuza
Nasals Bilabial Nasal Syllabic Bilabial Nasal Alveolar Nasal Palatal Nasal Velar Nasal
m m1 n › N
ma
man
non mandjie lang
not
mama umfana iyana inyama ingubo
umama ngiyambona -na inyama ingane
Vowels High Front Vowel High Front Vowel with duration Lax Front Vowel Rounded High Front Vowel Rounded High Front Vowel with duration High Back Vowel High Back Vowel with duration Lax Back Vowel Mid-high Front Vowel Mid-high Front Vowel with duration Rounded Mid-high Front Vowel Rounded Mid-high Front Vowel with duration Rounded Mid-high Back Vowel Rounded Mid-high Back Vowel with duration Mid-low Front Vowel Mid-low Front Vowel with duration Rounded Mid-low Front Vowel Rounded Mid-low Front Vowel with duration Central Vowel with duration Rounded Mid-low Back Vowel Rounded Mid-low Back Vowel with duration Low Back Vowel Lax Mid-low Vowel Low Central Vowel Low Central Vowel with duration Low Back Vowel with duration Central Vowel (Schwa) Mid-low Front Vowel Mid-low Front Vowel with duration
i i+ I y y+ u u+ u e e+ ø ø+ o o+ + œ œ+ ä+ O O+ A V a a+ A+ E æ æ+
siek ski Brixton u ekskuus boek boer Woodstock
Piet keep him
impilo impilo Cecil u
impilo impilo Cecil u
Kapkaroord blue push Senekal Vrede
vulani vula Newtown ndithengile isefu
vulani vula Newtown ngithengile sengizwe
Sibongile
ukubonisa
ukubonisa
nest fairy nurse burst turn Hartenbos bore hot hut Garsfontein Klerksdorp harp the average dad
Themba aneesenti
thenga nembeza
third molo anethoba box hut sala apha degree camp camp
third ubhalo ncokola box hut ukusala ukusala August degree camp camp
Oi ^ Ei ^ iu ^ ia ^ iO
rotjie allerlei geoloog inisiatief Centurion
point Spelled ‘‘H’’ Beaufort financial
Joyce Spelled ‘‘H’’ Beaufort financial
been neus keuse Sibongile hoop mes leˆ brug brue wıˆe mos moˆre McDonald’s public agter saak master’s niks eg, help ver
thing
Diphthongs
^
Kleinmond
Georgalli
T. Niesler / Speech Communication 49 (2007) 453–463
463
Table (continued) Description
Tripthongs
IPA
Afrikaans
i ^ ¤E ^ iE ^ y ^ ui ^ Eu ^ Eu ^ ai ^ a+i ^ aI ^ VI ^ au ^ au ^ oi ^ OI ^ Ou ^ eu ^ i ^ I ^ eI ^ E ^ uE ^ ua ^ i ^ VE ^
Pretorius
IO Eau ^^ EiE ^^
vegetarie¨r luiaard moeiliker nou phone poegaai baaie Alpine Camperdown Landsdown weggooi
English
Xhosa
here hear
nearly
hope Laaiplek Laaiplek fine power power
Zulu
road
road
drive Laaiplek drive
drive
south south
south
road
road
Cape Cape Cape
Cape
February
February
boy eeu
gymnasium
waste play there poor
Francois twee en
Cape
Bryan Spelled ‘‘TON’’ thousand
^^
hy is
References Bisani, M., Ney, H., 2004. Bootstrap estimates for confidence intervals in ASR performance evaluation. In: Proc. ICASSP, Montreal, Canada. International Phonetic Association, 2001. Handbook of the International Phonetic Association. Cambridge University Press. Katz, S., 1987. Estimation of probabilities from sparse data for the language model component of a speech recogniser. IEEE Trans. Acoust. Speech Signal Process. 35 (3), 400–401. Ko¨hler, J., 2001. Multi-lingual phone models for vocabulary independent speech recognition tasks. Speech Comm. 35 (1–2), 21–30. Louw, P., 2005. A new definition of Xhosa grapheme-to-phoneme rules for automatic transcription. South Afr. J. Afr. Lang. 25 (2), 71–91. Louw, P., Roux, J., Botha, E., 2001. African Speech Technology (AST) telephone speech databases: corpus design and contents. In: Proc. Eurospeech, Aalborg, Denmark. Ney, H., Essen, U., Kneser, R., 1994. On structuring probabilistic dependencies in stochastic language modelling. Comput. Speech Lang. 8 (1), 1–38. Niesler, T., Louw, P., Roux, J., 2005. Phonetic analysis of Afrikaans, English, Xhosa and Zulu using South African speech databases. South. Afr. Linguist. App. Lang. Stud. 23 (4), 459–474. Roux, J., Louw, P., Niesler, T., 2004. The African Speech Technology project: an assessment. In: Proc. LREC, Lisbon, Portugal. Schultz, T., Waibel, A., 1998. Language independent and language adaptive large vocabulary speech recognition. In: Proc. ICSLP, Sydney, Australia.
Schultz, T., Waibel, A., 2000. Polyphone decision tree specialisation for language adaptation. In: Proc. ICASSP, Istanbul, Turkey. Schultz, T., Waibel, A., 2001. Language independent and language adaptive acoustic modelling for speech recognition. Speech Comm. 35 (1–2), 31–51. Statistics South Africa (Ed.), 2004. Census 2001: Primary tables South Africa: Census 1996 and 2001 compared. Statistics South Africa. Uebler, U., 2001. Multilingual speech recognition in seven languages. Speech Comm. 35 (1–2), 53–69. Uebler, U., Schu¨ßler, M., Niemann, H., 1998. Bilingual and dialectal adaptation and retraining. In: Proc. ICSLP, Sydney, Australia. Waibel, A., Geutner, P., Mayfield-Tomokiyo, L., Schultz, T., Woszczyna, M., 2000. Multilinguality in speech and spoken language systems. Proc. IEEE 88 (8), 1297–1313. Ward, T., Roukos, S., Neti, C., Gros, J., Epstein, M., Dharanipragada, 1998. Towards speech understanding across multiple languages. In: Proc. ICSLP, Sydney, Australia. Webb, V., 2002. Language policy development in South Africa. In: World Congress on Language Policies, Barcelona, Spain. Wissing, D., Martens, J.-P., Janke, U., Goedertier, W., 2004. A spoken Afrikaans language resource designed for research on pronunciation variations. In: Proc. LREC. Lisbon, Portugal. Young, S., Evermann, G., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P., 2002. The HTK book, version 3.2.1. Cambridge University Engineering Department. Young, S., Odell, J., Woodland, P., 1994. Tree-based state tying for high accuracy acoustic modelling. Proc. Workshop on Human Language Technology. Morgan Kaufmann, pp. 307–312.