Speech Communication 48 (2006) 532–548 www.elsevier.com/locate/specom
Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer *, Hauke Schramm Philips Research Laboratories, Weisshausstrasse 2, D-52066 Aachen, Germany Received 15 April 2005; received in revised form 5 September 2005; accepted 21 September 2005
Abstract Boosting algorithms have been successfully used to improve performance in a variety of classification tasks. Here, we suggest an approach to apply a popular boosting algorithm (called ‘‘AdaBoost.M2’’) to Hidden Markov Model based speech recognizers, at the level of utterances. In a variety of recognition tasks we show that boosting significantly improves the best test error rates obtained with standard maximum likelihood training. In addition, results in several isolated word decoding experiments show that boosting may also provide further performance gains over discriminative training, when both training techniques are combined. In our experiments this also holds when comparing final classifiers with a similar number of parameters and when evaluating in decoding conditions with lexical and acoustic mismatch to the training conditions. Moreover, we present an extension of our algorithm to large vocabulary continuous speech recognition, allowing online recognition without further processing of N-best lists or word lattices. This is achieved by using a lexical approach for combining different acoustic models in decoding. In particular, we introduce a weighted summation over an extended set of alternative pronunciation models representing both the boosted models and the baseline model. In this way, arbitrarily long utterances can be recognized by the boosted ensemble in a single pass decoding framework. Evaluation results are presented on two tasks: a real-life spontaneous speech dictation task with a 60k word vocabulary and Switchboard. 2005 Elsevier B.V. All rights reserved. Keywords: Boosting; AdaBoost; Machine learning; Acoustic model training; Spontaneous speech; Automatic speech recognition
1. Introduction Boosting is a general method for improving the accuracy of almost any learning algorithm (Schapire, 1990). The technique works by sequentially training and combining a collection of classifiers in such a way that the later classifiers focus more and more on misclassified examples. To this end, in the AdaBoost algorithm (Freund and Schapire, *
Corresponding author. E-mail address:
[email protected] (C. Meyer).
1997), a probability distribution is introduced and updated on the input space. Initially, every training example gets the same weight. In the following iterations, the weights of misclassified examples are increased, according to the classification result of the current classifier on the training set. Using the calculated training weights, a new classifier is trained. This process is repeated, thus producing a set of individual (‘‘base’’) classifiers. The output of the final classifier is then determined from a linear combination of the individual classifier scores.
0167-6393/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2005.09.009
C. Meyer, H. Schramm / Speech Communication 48 (2006) 532–548
Experimental results in a variety of classification tasks have shown AdaBoostÕs remarkable success in improving the classifier accuracy and generalization properties (see e.g. Schapire, 2002). This was also supported by theoretical analysis (Schapire et al., 1998), showing that AdaBoost is especially effective at increasing the margins of the training examples, i.e. the difference between the classifier score for the correct label and the score for the closest incorrect label. The main motivation to apply AdaBoost to speech recognition is its theoretical foundation providing explicit bounds on the training and—in terms of margins—on the generalization error (Schapire et al., 1998; Koltchinskii and Panchenko, 2002). Improving generalization performance on independent test data is still a crucial issue in pattern recognition, and in particular in speech recognition. This holds for ‘‘matched conditions’’ as well as for ‘‘mismatched conditions’’, exhibiting any kind of mismatch (environment, domain, noise etc.) to the training data. In contrast to most previous applications of boosting, focusing on systems with a rather moderate number of classes and internal degrees of freedom, speech recognition is an extremely complex ‘‘large scale’’ classification problem involving tens of thousands of words to be recognized and classifiers with up to millions of internal parameters. This might be the reason why there are only a few studies so far applying boosting to acoustic model training. In previous work, boosting has mainly been applied to classifying each individual feature vector to a phoneme symbol. This phoneme classification was performed either with neural networks in hybrid neural network/Hidden Markov Model speech recognizers (Cook and Robinson, 1996; Schwenk, 1999), or with Gaussian mixtures (Zweig and Padmanabhan, 2000). On the other hand, in ‘‘conventional’’ HMM speech recognizers—i.e. recognizers where the HMMs employ probabilities for direct emission of the feature vectors given a state—there is no intermediate phoneme classification with associated posterior probabilities for each individual feature vector. Therefore, the application of frame-level boosting to conventional HMM recognizers is not obvious. Recently, an alternative approach for HMM speech recognizers has been suggested based on classifying feature vector sequences to phoneme symbols (Dimitrakakis and Bengio, 2004). This, however, requires a presegmentation of the training data into phonemes, which is not easily accomplished or, if automatically
533
performed, may strongly depend on the acoustic model used for the alignment. On the other hand, an intuitive way of applying boosting to HMM speech recognizers is at the utterance level, since each HMM speech recognizer calculates scores for the correct and the competing utterance hypotheses. Therefore, we proposed to apply AdaBoost to utterance classification, i.e. to the mapping of the whole feature vector sequence to the corresponding word sequence (‘‘utterance approach’’ for boosting HMM acoustic models; Meyer, 2002). Thus, boosting is used to improve upon an initial ranking of candidate word sequences, similar to algorithms developed for natural language parsing (Collins, 2000) and named entity extraction (Collins, 2002). Apart from being applicable to recognizers without explicit phoneme classification as well as on non-segmented data, the utterance approach has two advantages: First, it is directly related to the sentence error rate, since a monotonous function of the total decoding score is used as classification score (see Section 4.1). Second, it is computationally much less expensive than boosting applied at the level of feature vectors. In this paper, we show in a variety of recognition tasks that boosting HMM recognizers, at an utterance level, significantly improves the best test error rates obtained with standard maximum likelihood training. To this end, we summarize and extend previous experimental results in large vocabulary isolated word recognition, comparing utterance-level boosting to maximum likelihood training in dependence of the number of densities of the final classifier (Meyer, 2002). In addition, we provide results in isolated word decoding tasks demonstrating that boosting can even improve upon results obtained with discriminative training, by combining both training techniques (Meyer and Beyerlein, 2002). In our experiments, this holds even when comparing classifiers with a similar number of parameters, and is not limited to matched decoding conditions. The second issue of our paper is the extension of our algorithm to large-scale continuous speech recognition. Recently, Zhang and Rudnicky (2003) applied the utterance approach to a dialogue task in the travel domain with a 10k word vocabulary using an offline decoding scheme. Here, we suggest an online (single pass) decoding framework, by combining the boosted and the baseline acoustic models at the lexical level. Thus, there is no need for further processing of N-best lists or word lattices in recognition, which allows to decode arbitrarily
534
C. Meyer, H. Schramm / Speech Communication 48 (2006) 532–548
long utterances with the boosted ensemble in the first decoding pass. The basic idea of our approach is to introduce an additional set of ‘‘symbolic’’ pronunciation models in each boosting iteration, representing the underlying boosted acoustic model. A related approach has been introduced by Zheng et al. (2000) to model different speaking rates, where at each word position only the best matching pronunciation variant was selected. In our approach, the key point is the weighted summation over the extended set of pronunciation variants representing the boosted and the baseline models, thus realizing the summation of the contributions of the underlying acoustic models. We present experimental results for a real-life spontaneous speech dictation task with a 60k word vocabulary as well as for the Switchboard task. Preliminary results of this work have been reported in (Meyer and Schramm, 2004). The rest of the paper is organized as follows: In the next section, we describe the standard AdaBoost.M2 algorithm. In Section 3, we briefly review the maximum likelihood and discriminative training criteria for acoustic model training. Then, in Section 4, we outline our utterance approach for applying AdaBoost to Hidden Markov Model based speech recognizers. In particular, we introduce the lexical approach to combine boosted acoustic models in decoding of continuous speech (Section 4.2.2). Experimental results are presented in Section 5 for isolated word recognition and in Section 6 for large-scale continuous speech recognition. Finally, in Section 7, we summarize our findings and discuss future directions of work. The theory for acoustic model combination at the lexical level is presented in Appendix A.
problems based on ranking loss (Schapire and Singer, 1999). AdaBoost.M2 (in the following simply referred to as AdaBoost) is applicable when a mapping ht : X · Y ! [0, 1] can be defined for classifier ht which is related to the classification criterion (i.e. the output ^y of classifier ht for the input pattern x is given by ^y ¼ arg maxy2Y ht ðx; yÞ). The basic algorithm is outlined in Table 1. AdaBoost iterates a distribution Dt(i, y) over the example/label pairs (xi, y) (being 0 for the correct label yi: Dt(i, yi) = 0). This distribution is updated according to the output of the current classifier on the training set: weights for example/label pairs belonging to confidently classified examples are reduced relative to those of misclassified examples. Then, in each iteration t, a new classifier is trained with respect to the distribution Dt. In recognition, the classification scores ht(x, y) of the individual classifiers are linearly combined, with weights ln b1t being inversely related to the training error of the individual classifiers, to give the final output T X 1 H ðxÞ ¼ arg max ln ð1Þ ht ðx; yÞ y2Y bt t¼1
2. The AdaBoost.M2 algorithm
Table 1 The AdaBoost.M2 algorithm
The AdaBoost algorithm (Schapire, 1990) was presented for transforming a ‘‘weak’’ learning rule into a ‘‘strong’’ one (in the sense of the ‘‘probably approximately correct’’ (PAC) learning model; Valiant, 1984). It takes as input a labeled training set (x1, y1), . . . , (xN, yN), where xi 2 X denotes the input patterns and yi 2 Y = {1, . . . , k} the (correct) label from a set of k labels. The basic idea is to train a series of classifiers based on the classification performance of the previous classifier on the training data. In multi-class classification (k > 2), a popular variant is the AdaBoost.M2 algorithm (Freund and Schapire, 1997), which is a particular case of a more general AdaBoost algorithm for multiclass
Input:
(see Table 1). The update rule is designed to guarantee an upper bound on the training error of the combined classifier which is exponentially decreasing with the number of individual classifiers (Freund and Schapire, 1997). The generalization performance has been analyzed in terms of separation margins (Schapire et al., 1998; Koltchinskii and Panchenko, 2002). The margin of a classification example is defined as the difference between the classifier score ht(xi, yi)
Init:
Repeat:
Sequence of examples (x1, y1), . . . , (xN, yN) with xi 2 X and labels yi 2 Y = {1, . . . , k}. Define B = {(i, y): i 2 {1, . . . , N}, y 5 yi}, D1(i, y) = 1/jBj " (i, y) 2 B, else 0, where jBj is the number of elements of the set B. 1. Train weak learner using distribution Dt. 2. Get weak hypothesis ht: X · Y ! [0, 1] and calculate pseudo-loss P P t ¼ 12 Ni¼1 yð6¼y i Þ Dt ði; yÞð1 ht ðxi ; y i Þ þ ht ðxi ; yÞÞ. t 3. Set bt ¼ 1 . t
4. Update distribution Dt: 1ð1þh
Output:
ðx ;y Þh ðx ;yÞÞ
bt2 t i i t i , Dtþ1 ði; yÞ ¼ DtZði;yÞ t where Zt is a constant normalizing Dt+1. Final hypothesis: P H ðxÞ ¼ arg maxy2Y t ðln b1 Þht ðx; yÞ. t
C. Meyer, H. Schramm / Speech Communication 48 (2006) 532–548
for the correct class and the highest score ht(xi, y) for any wrong class y5yi. Thus, the margin is positive if and only if the example is classified correctly. Moreover, the magnitude of the margin can be interpreted as a measure of confidence in the prediction. In a theoretical analysis, Schapire et al. (1998) proved that larger margins on the training set translate into a reduced upper bound on the generalization error, and that boosting is especially effective in increasing the margins of the training examples. In multi-class problems where the learning rule does not explicitly take alternative hypotheses y into account (as is the case e.g. for maximum likelihood training), the weights Dt(i, y) are summed up (Schwenk, 1999; Zweig and Padmanabhan, 2000) to give a weight wt(i) for each training pattern i: X wt ðiÞ ¼ Dt ði; yÞ. ð2Þ yð6¼y i Þ
3. Speech recognition: common training criteria
in 30–40 dimensions). ML training maximizes the function (5), i.e. the log-likelihood for the correct word sequence to generate the acoustic observations. Competing word hypotheses y are not taken into account. Due to incorrect modeling assumptions and insufficient training data, this results in sub-optimal models yielding non-optimal recognition accuracy. A second training paradigm that is more directly related to the classification rule (3) is discriminative training. Several approaches have been investigated. ‘‘Minimum classification error’’ (MCE) training (Juang and Katagiri, 1992) directly minimizes a smoothed function of the training sentence error rate, by gradient descent methods. Another approach aims at maximizing the (empirical) mutual information between the acoustic observations and the corresponding word sequences (‘‘MMI training’’; Bahl et al., 1986): F MMI ðhÞ ¼
N X i¼1
For decoding a word sequence ^y (out of a set of possible word sequences y), given a spoken utterance x (acoustic observation), the Bayes decision rule is commonly applied: ^y ¼ arg max pðyjxÞ y
¼ arg max ðpðxjyÞ pðyÞÞ. y
ð3Þ ð4Þ
Here, the language model p(y) measures the prior probability for the word sequence y. The acoustic model p(xjy) denotes the probability for observing an acoustic feature sequence x, given a word sequence y. In our experiments, we use Hidden Markov Models (HMM) for representing the acoustic model. In automatic speech recognition (ASR), both probability models are approximated on training data, employing a range of thousands to millions of parameters. For acoustic model training, the standard training paradigm is maximum likelihood (ML) training. Given a set of training utterances xi, i = 1, . . . , N with spoken (correct) word sequences yi, the criterion amounts to maximizing F ML ðhÞ ¼
N X
log ph ðxi jy i Þ;
ð5Þ
i¼1
where h represents the set of acoustic model parameters (which, for large scale applications, involves about 105–106 probability density functions
535
log P
ph ðxi jy i Þ . y p h ðxi jyÞpðyÞ
ð6Þ
By simultaneously decreasing the likelihood of competing hypotheses y to generate the observed acoustics, discriminative training optimizes class separability. Eq. (6) can only be solved in a complex iteration process, approximating the denominator in each iteration by data obtained from a recognition pass on the training corpus (the same holds for MCE training). This recognition pass on the training data for each iteration renders discriminative training computationally extremely involved and time-intensive (the same is true for boosting). The computations for MMI can be simplified by further restricting the denominator to the recognized text. The resulting algorithm, called ‘‘corrective training’’ (CT), however, gives only slight improvements if the training error is very low. An extension of the CT algorithm, called ‘‘rival training’’ (RT) has been proposed (Meyer and Rose, 2000), which is computationally less expensive than lattice-based discriminative training methods like MMI, but gives significantly better performance than CT. Instead of a set of competing hypotheses, RT employs only the best scored incorrect hypothesis (the ‘‘rival’’) in the denominator of (6), if it is sufficiently ‘‘close’’ to the correct hypothesis. We investigate the performance of boosting applied to maximum likelihood training in Sections 5 and 6 and to discriminative training in Section
536
C. Meyer, H. Schramm / Speech Communication 48 (2006) 532–548
5.3.2. In the following, we outline our utterance approach for boosting HMM acoustic models. 4. The utterance approach for boosting in ASR 4.1. General framework In most previous applications to speech recognition, boosting was applied to classifying each individual feature vector to a phoneme symbol. To this end, phoneme posterior probabilities were calculated for each feature vector, either by neural networks (Cook and Robinson, 1996; Schwenk, 1999) or Gaussian mixtures (Zweig and Padmanabhan, 2000). Then, the boosting algorithm was run on the phoneme posterior probabilities, thus reducing the phoneme classification error rate and (indirectly) the word error rate. Since conventional HMM speech recognizers do not involve an intermediate phoneme classification step for individual feature vectors, the frame-level boosting approach cannot straightforwardly be applied. On the other hand, each HMM speech recognizer calculates scores for the correct and competing hypotheses at an utterance level. Therefore, we proposed to apply boosting at the utterance level, i.e. to classifying the whole feature vector sequence to the corresponding word sequence (Meyer, 2002). This involves a suitable choice of the function ht, related to the decoding criterion. We suggested to use the (sentence) posterior probability, which can be approximated by the a posteriori confidence measure (Ru¨ber, 1997), calculated on basis of a list of the N most likely hypotheses (‘‘N-best list’’) provided by the recognizer for each utterance. This provides an easy way to apply AdaBoost to any HMM speech recognizer. In more detail, we define in our utterance approach the input patterns xi to be the sequence of feature vectors corresponding to the entire utterance i (Meyer, 2002). Correspondingly, y denotes one possible candidate word sequence of the speech recognizer (yi being the correct word sequence for utterance i). In isolated word recognition, the label space Y is simply defined by the recognition lexicon. It is, however, convenient to restrict the label space—for each utterance individually—to the most likely hypotheses, as determined e.g. by a baseline recognition on the training corpus (Zweig and Padmanabhan, 2000). This is mandatory for applying the utterance approach to continuous speech recognition.
The a posteriori confidence measure is calculated on basis of the N-best list Li for utterance i. It involves the acoustic probability pt(xijy) and the language model probability1 p(y): ht ðxi ; yÞ ¼ P
pt ðxi jyÞk pðyÞk z2Li pt ðxi jzÞ
k
pðzÞ
k
.
ð7Þ
The scaling parameter k is used to compensate for the intrinsic scaling and offset of the decoding scores with language model factor and word penalty.2 For any value of k > 0, Eq. (7) defines a suitable function ht with values in [0, 1] which can be used in the boosting algorithm. The value of k affects the ‘‘distribution’’ of confidence values ht(x, y) among the list of hypotheses y; for larger values of k, the confidence values exhibit a stronger decay with increasing N-best list index. Based on the confidence values (7), we use AdaBoost.M2 (Table 1) to update the distribution Dt(i, y). Then, we calculate an utterance weight wt(i) for each training utterance i according to Eq. (2). The weights wt(i) are then used in subsequent maximum likelihood and discriminative training of Gaussian or Laplacian mixture density HMMs, where the next parameter set h for the acoustic model is determined according to the following criteria for boosted maximum likelihood training: F ML;t ðhÞ ¼
N X
wt ðiÞ log ph ðxi jy i Þ
ð8Þ
i¼1
and boosted maximum mutual information training: F MMI;t ðhÞ ¼
N X i¼1
ph ðxi jy i Þ . p y h ðxi jyÞpðyÞ
wt ðiÞ log P
ð9Þ
4.2. Utterance approach for continuous speech recognition The application of the utterance approach to isolated word recognition and small vocabulary continuous speech recognition (e.g. digit strings) is straightforward. In this case, N-best lists generally cover a large portion of the space of candidate 1
We are focusing on boosting only the acoustic model, i.e. the language model p(y) is assumed to be independent of t. 2 Theoretically, the recognizer scores of the N-best sentences should correspond to the negative logarithms of the probabilities; however, due to the application of word penalty and language model factor this relation is lost. See also (Ru¨ber, 1997).
C. Meyer, H. Schramm / Speech Communication 48 (2006) 532–548
hypotheses. For large-scale continuous speech applications, which consist of long sentences, however, this might not be the case, due to their excessive complexity: Consider a 60k word vocabulary and a sentence consisting of 20 words. Theoretically, there are 60,00020 1095 possible classification results. Thus, N-best lists of reasonable length (e.g. N = 100) generally contain only a tiny fraction of the possible classification results. This means that for a large number of hypotheses y, the classification score ht(x, y) as defined by Eq. (7) is not available. This has two consequences: In training, it may lead to sub-optimal utterance weights (see Table 1). In recognition, Eq. (1) cannot be applied appropriately. In the following, we discuss our approach to address this problem, focusing separately on training (Section 4.2.1) and decoding (Section 4.2.2). 4.2.1. Training A convenient strategy to reduce the complexity of the classification task and to provide more meaningful N-best lists consists in ‘‘chopping’’ of the training data. For long sentences, it simply means to insert additional sentence break symbols at silence intervals with a given minimum length. This reduces the number of possible classifications of each sentence ‘‘fragment’’ , so that the resulting N-best lists should cover a sufficiently large fraction of hypotheses. Then, AdaBoost.M2 can be applied as described in Section 4.1. 4.2.2. Decoding: lexical approach for model combination Whereas chopping may be accepted in training, it is certainly difficult and undesired in decoding. Therefore, we follow a different strategy in decoding, which avoids chopping of the test data. Specifically, we suggest a single pass decoding setup, where the combination of the boosted acoustic models is realized at a lexical level. The basic idea is to add a new pronunciation model by ‘‘replicating’’ the set of phoneme symbols in each boosting iteration t (e.g. by appending the suffix ‘‘_t’’to the phoneme symbol). In this way, the new pronunciation model, i.e. the new phoneme symbols, represent the underlying acoustic model of boosting iteration t (example: for the phoneme ‘‘dh’’, additional variants ‘‘dh_1’’ and ‘‘dh_2’’ are added in boosting iterations 1 and 2, respectively). Then, we add to each phonetic transcription in the decoding lexicon a new transcription using the corresponding phoneme
537
set Mt (example: to the baseform ‘‘dh uh’’ in the decoding lexicon, a new variant ‘‘dh_1 uh_1’’ is added in the first boosting iteration, representing the acoustic parameters of the boosted model). Note that the phoneme sequences of the baseline lexicon are not modified; instead, only the suffixes of the phoneme symbols are changed to represent the underlying acoustic model of boosting iteration t. A similar approach has been introduced in (Zheng et al., 2000) to model different speaking rates. The idea of boosting however is to use the reweighted training data to train the boosted classifier t, i.e. the acoustic model corresponding to the phoneme set Mt. Here, the training weights are given by the utterance weights wt(i) calculated for boosting iteration t. Decoding is then performed using the extended lexicon and the set of acoustic models representing the boosted classifiers. The key point of our approach is the weighted summation (Schramm and Aubert, 2000) over the extended set of pronunciations.3 Alternative pronunciations are weighted by their unigram prior probabilities which are estimated on the training data (Schramm and Aubert, 2000). In more detail, our algorithm proceeds as follows: • In each boosting iteration, we generate a set of new phoneme symbols Mt by applying a suffix ‘‘_t’’ to each of the speech phonemes of the baseline system. • In iteration t, the spoken word sequences of the training corpus are phonetically transcribed with the phoneme symbols from Mt. Using the respective utterance boosting weights calculated for iteration t according to Section 2, the acoustic model corresponding to Mt is trained by the usual acoustic training algorithm (e.g. standard ML training). • The final classifier consists of the various phoneme sets Mt, t = 1, . . . , T with corresponding acoustic models. As explained above, in each boosting iteration t, for every pronunciation variant vi in the base decoding lexicon, a new variant vi,t with phonemes from Mt is appended. For words w in the base decoding lexicon having 3
We use the term ‘‘pronunciation variant’’ here also for the transcriptions added in each boosting iteration, although this term is somewhat misleading, since the phoneme sequences of the baseline decoding lexicon are not modified.
538
C. Meyer, H. Schramm / Speech Communication 48 (2006) 532–548
more than one pronunciation variant vi, the unigram prior probabilities of the set {vi,t} are estimated and normalized separately in each boosting iteration t, according to a forced alignment. • Normalization of the complete set of pronunciation variants {vi,t}i,t=1, . . . , T for a given word w is achieved in two ways: – either by introducing a uniform factor T1 to the unigram priors estimated in the T boosting iterations (‘‘unweighted model combination’’), – or by multiplying the (normalized) unigram priors estimated in boosting iteration t with the respective boosting classifier weight a0t ¼ c lnðb1t Þ calculated from boosting theory (‘‘weighted model combination’’, see Appendix A). Here, cP is a normalization constant which T guarantees t¼1 a0t ¼ 1. • Decoding is based on a weighted summation over the contribution of all pronunciation variants {vi,t}i,t=1,. . .,T of a given word with their respective unigram priors, as described in (Schramm and Aubert, 2000). Note that this summation is carried out in the probability domain and not in the logarithmic domain. Applied at the sentence level, i.e. keeping the model index t fixed within each sentence, the weighted summation over the extended set of pronunciation variants realizes the final output (1) given by boosting theory, as we show in Appendix A. However, we abandon the restriction of a fixed model label within an utterance, first, because the weighted summation over pronunciation variants as described in (Schramm and Aubert, 2000) is performed at the word level and second, since recent results reported in (Zhang and Rudnicky, 2003) supported word level model combination. Thus, we also take into account sequences of pronunciation variants where the model index changes within an utterance. Note that in our approach the boosted acoustic models are directly combined in decoding, due to the weighted summation over the extended set of pronunciation variants representing the baseline and the boosted models. Thus, our approach is integrated in a single pass decoding framework, and neither postprocessing of word graphs or N-best lists nor chopping of the test data is required in recognition. This allows to apply our algorithm even to arbitrarily long sentences, which is especially important for dictation tasks.
We evaluate our algorithm in Section 6. In the next section, we present experimental results for utterance-level boosting applied to isolated word recognition. 5. Experiments: isolated word recognition 5.1. Database Our first set of experiments was performed in a telephone-bandwidth large vocabulary isolated word recognition task. The ‘‘standard’’ training corpus is based on SpeechDat(II) German material4 and consists of 18k utterances (about 4.3 h of speech) of city, company, first and family names and other single words. By using only every second utterance, we also created the ‘‘small’’ training corpus. Evaluations were carried out in matched and in mismatched decoding conditions: • For matched conditions, we used the ‘‘LILI’’ (long item list) test corpus of 10k single word utterances from SpeechDat(II) German material (about 3.5 h of speech). The decoding lexicon consisted of about 10k words (city names, company names, first and family names etc). Apart from small modifications, this lexicon was also used for the recognitions on the training data involved in boosting and discriminative training. • The ‘‘names’’ corpus is an inhouse collection of 676 utterances of German last names, recorded over telephone line (about 0.5 h of data). We used two different decoding lexica: the ‘‘10k lex’’ and the ‘‘190k lex’’ consisting of 10k (190k) different last names, respectively. Here, the acoustic conditions are matched, whereas there is a lexical mismatch.5 • The ‘‘office’’ corpus consists of 3.2k utterances of German city names, recorded over microphone in clean (office) conditions (about 1.5 h of data), introducing an acoustic mismatch to
4
For more information on the SpeechDat corpus see www.speechdat.org. 5 Note that boosting as well as discriminative training both involve a recognition pass on the training data, generally performed with a lexicon which is similar to the decoding lexicon. Using a different decoding lexicon in testing compared to training introduces a lexical mismatch between the training and test conditions.
C. Meyer, H. Schramm / Speech Communication 48 (2006) 532–548
the (telephone-bandwidth) training conditions. Here, we used a decoding lexicon of 20k different city names.
539
4000 training of classifier t=2 training of classifier t=5
5.2. Baseline system description We employed standard feature extraction using mel-frequency cepstral coefficients (MFCC) (Davis and Mermelstein, 1980), optionally including a linear discriminant analysis (LDA). The acoustic model is made of context dependent HMM based phone models (‘‘triphones’’), with 6 states each. State-clustering is applied for parameter reduction. For each state, the probability density function is assumed to be a continuous Gaussian mixture density distribution. 5.3. Boosting For our boosting experiments, we iterated the AdaBoost.M2 algorithm as explained in Section 4.1. In view of the linearity of the ML estimation Eq. (8), the utterance weights wt(i) were normalized to a mean value of 1.0. Note that due to the complexity of each (base) speech recognizer, only a few boosting iterations are feasible, in contrast to other boosting applications. The final output of the boosting ensemble is determined from the outputs (N-best lists) of the individual boosted speech recognizers, according to Eq. (1) with confidence scores (7). 5.3.1. Boosting ML models In an initial set of experiments, we evaluated the performance of our utterance-level boosting algorithm compared to standard maximum likelihood training (Meyer, 2002). These experiments were performed on the ‘‘small’’ training corpus, without LDA. Example histograms for the calculated utterance weights are shown in Fig. 1. At iteration t = 1 (initialization) all the weights are equal to 1. As t increases, the distribution of the weights is expanded, assigning larger weights to the misrecognized utterances and lower weights to the correctly recognized sentences. In this way, the new base classifiers are more and more focussed on the misclassified examples. Fig. 2 presents evaluation results on the ‘‘LILI’’ test corpus (matched conditions) for ML training and boosting, in dependence of the number of densities of the final classifier. We found that
Counts
3000
2000
1000
0
0
0.5
1
1.5
2
2.5
3
3.5
4
(Normalized) utterance weights Fig. 1. (Normalized) utterance weight histograms for boosting the ML baseline model with 53k densities, trained on the ‘‘small’’ training corpus (k = 0.005). For the first classifier, all utterance weights are 1.0. The histogram for training of classifier t = 2 is shown by the solid line, the one for training of classifier t = 5 by the dashed line.
boosting significantly reduced the test error rates compared to the non-boosted initial system t = 1. On the other hand, comparing classifiers with a similar number of densities, boosting outperformed ML training only in the case of high accuracy acoustic models, where a further increase of parameters in ML training resulted in overfitting (see solid line in Fig. 2). In this regime, boosting the 53k ML baseline model improved the best ML test error rate by 10% relative (dashed line in Fig. 2).6 This performance improvement could not be obtained by simply combining non-boosted models: for example, combining the ML baseline models with 53k and 74k densities, respectively, we only achieved a word error rate of 10.94% (127k densities), even when optimizing the classifier weights a posteriori on the test data (Meyer, 2002).
6 Due to the inherent pruning and the finite size of the N-best lists, only for a limited number of hypotheses y the scores ht(x, y) of classifier t are obtained. With more boosting iterations, in general the fraction of hypotheses for which the score is not available for one or more classifiers increases (in these cases, we assign a small default value, e.g. ht(x, y) = 105, to the respective hypothesis y of classifier t). This approximation to Eq. (1) might explain why we do not observe further improvements for more than four boosting iterations.
540
C. Meyer, H. Schramm / Speech Communication 48 (2006) 532–548
baseline ML classifier (35k dens.) boosting ML model up to t=3 baseline ML model plus 10 iter. RT boosting RT model up to t=3
90
Cumulative distribution [%]
12
WER [%]
100
Maximum Likelihood training boosting ML 11k density model boosting ML 53k density model boosting ML 94k density model
12.5
11.5
11
10.5
80 70 60 50 40 30 20 10
10
0
50
100
150
200
0
250
0
# Densities [k] Fig. 2. Word error rates (WER, in %) on the LILI test corpus (matched conditions) for maximum likelihood (ML) training (solid line) and boosting, in dependence of the number of densities of the final classifier. The ML baseline results were obtained by further splitting the ML models. Boosting was applied to the ML 11k density baseline model (dotted line, k = 0.01), to the ML 53k density baseline model (long-dashed line, k = 0.005) and to the ML 94k density baseline model (shortdashed line, k = 0.005). Note that the training error for the ML 94k density baseline model is as low as 0.28%. Training on the ‘‘small’’ training corpus.
Similar improvements by boosting have been observed in experiments with LDA, if the LDA matrix of the baseline classifier is kept fixed during the boosting iterations; for experimental results, see (Meyer and Beyerlein, 2002). 5.3.2. Combining boosting and discriminative training The improvements obtained by boosting (Section 5.3.1) come at the expense of an increased complexity (number of parameters) of the final, boosted classifier, since in each boosting iteration, an additional speech recognizer is added to the ensemble. Therefore, the recognition costs (CPU time and memory) scale linearly with the number of boosting iterations. This is in contrast to discriminative training, where the decoding costs remain constant, since the number of parameters is kept fixed. In spite of these drawbacks, we argue that boosting may provide additional benefits on top of discriminative training, i.e. when combined with discriminative training. To this end, we performed a second set of experiments, applying boosting to discriminatively trained models (Meyer and Beyerlein, 2002). The basic idea is to use discriminative training for each of the ensemble models, e.g. 10 iterations of rival training (RT) after an initial ML training.
0.2
0.4
0.6
0.8
1
Margin Fig. 3. Cumulative distribution of margins for boosting applied to maximum likelihood (ML) and rival training (RT). Baseline model: 35k densities, trained on the ‘‘standard’’ training corpus (ML training: solid line, plus 10 iterations RT: short-dashed line). Boosting (a) ML training: long-dashed line, (b) rival training: dotted line. k = 0.005.
Thus, the hypothesis ht(x, y) is calculated from a discriminatively trained model for each t, and used to calculate utterance weights according to Section 4.1. These experiments were performed on the ‘‘standard’’ training corpus, selecting a 35k density model as baseline model t = 1, including LDA. It is interesting to analyze the training process in terms of margins (see Section 2). In Fig. 3, we show the cumulative distribution of margins on the training corpus for four training paradigms:7 (a) standard ML training (solid line), (b) boosting ML training (long-dashed line), (c) rival training (short-dashed line) and (d) boosting rival training (dotted line). It is seen that both boosting and rival training tend to increase the margins of the training patterns (for training patterns with not too large margins). Moreover, this effect is additive when combining boosting and rival training (see also Meyer and Beyerlein, 2002). Table 2 presents evaluation results for the different training paradigms (column ‘‘LILI’’: matched test conditions). First, we see again that boosting provided substantial test error reductions
7
The discontinuity at margin 1.0 (margin 1.0 ¼ ^ 100%) stems from the fact that due to the inherent pruning applied in speech recognition, for a considerable portion of training utterances (about 30%) the recognition output consists of only a single hypothesis (N-best list of length 1). This leads to a confidence value and margin of exactly 1.0.
C. Meyer, H. Schramm / Speech Communication 48 (2006) 532–548
541
Table 2 Word error rates (WER) in % for maximum likelihood (ML) training, rival training (RT), and boosting ML and RT, on different test corpora in matched (‘‘LILI’’) and mismatched (‘‘names’’, ‘‘office’’) conditions (see Section 5.1) Training criterion
# dens.
ML (t = 1) ML (t = 1) Boosting ML, t = 2 Boosting ML, t = 3 RT (t = 1) RT (t = 1) Boosting RT, t = 2 Boosting RT, t = 3
35k 101k 68k 101k 35k 101k 68k 101k
WER on test corpus LILI (10k lex)
Names (10k lex)
Names (190k lex)
Office (20k lex)
8.00 7.23 7.55 7.35 7.10 7.05 6.81 6.56
28.25 25.00 26.18 25.30 25.15 23.67 22.93 22.93
56.07 56.07 53.25 54.14 52.37 53.70 51.92 53.55
48.35 47.20 46.80 45.99 44.31 46.02 44.12 43.25
t is the boosting iteration index (t = 1: baseline system). AdaBoost.M2 was applied to the 35k density ML and RT models. ‘‘tot. # dens’’ refers to the number of densities of the final (combined) classifier. ‘‘Standard’’ training corpus.
compared to the initial base classifier t = 1, for ML training as well as for rival training.8 In fact, combining boosting and rival training lead to a nearly additive performance gain over the initial ML base classifier t = 1. Second, boosting RT outperformed rival training even when comparing recognizers with a similar number of densities. This is because discriminative training of the 101k density model resulted in smaller performance gains (due to its higher model accuracy) than for the 35k density model, in line with previous observations (Woodland and Povey, 2000; Schlu¨ter et al., 1999). This demonstrates the benefit of boosting even when applied to discriminative training. 5.3.3. Evaluation in mismatched decoding conditions It is interesting to evaluate whether the improvements obtained by boosting are limited to matched conditions, i.e. if they imply ‘‘tuning’’ to the respective training conditions. To this end, we evaluated the boosted models of Section 5.3.2 also in selected mismatched conditions (see Section 5.1). Results are given in Table 2, columns ‘‘names’’ and ‘‘office’’. Note that no adaptation was performed in any of our mismatched decoding experiments apart from long-term spectrum normalization. The experiments demonstrate that the performance gains obtained by boosting are not limited to decoding tasks matching 8
Since the boosted ML model (t = 3) does not outperform ML training with a comparable number of densities (101k), we again note that boosting is mostly relevant for high accuracy models, where the word error rate cannot be further reduced by ML training with more parameters. In Table 2, the acoustic resolution in ML training has not been optimized before applying boosting.
the training conditions (apart from overfitting for t = 3 on the ‘‘names 190k’’ task). Even when comparing recognizers with a similar number of densities, we see from Table 2 that boosted ML (RT) models performed better than or comparable to ML (RT) models, respectively. Moreover, by combining boosting and rival training, again nearly additive performance gains are observed compared to the initial ML base classifier t = 1 (apart from t = 3 classifiers on the ‘‘names 190k’’ task). To summarize, the experiments in isolated word recognition showed that boosting may improve the best test error rates obtained with ML or discriminative training. By combining boosting and discriminative training, we obtained nearly additive performance gains over the initial ML base classifier. Moreover, boosting RT outperformed a purely discriminatively trained model with a similar number of densities (i.e. with comparable decoding costs), in matched and in mismatched decoding conditions. This shows that boosting can also be successfully applied in the context of discriminative training. On the other hand, our results indicate that boosting is mostly relevant for high accuracy acoustic models, where ML or discriminative training do not provide further improvements when increasing the number of parameters. 6. Experiments: continuous speech recognition 6.1. Database In large vocabulary continuous speech recognition (LVCSR), we evaluated our algorithm on two tasks: professional dictation and Switchboard.
542
C. Meyer, H. Schramm / Speech Communication 48 (2006) 532–548
6.1.1. Professional dictation For this task, we used an inhouse data collection of real-life recordings of medical reports, spontaneously spoken over long distance telephone lines by various (male) speakers from all over the US. This database contains a variety of speaking styles, accents and speaking rates and a large amount of spontaneous speech effects like filled pauses, partial words, repetitions and restarts (filled pauses and non-speech events are annotated). The acoustic training corpus consists of about 58 h of data (426 speakers, 0.5M words). This dictation task is particularly challenging for boosting due to the large average sentence length of about 66 words per sentence. Thus, we chopped the training utterances at silence intervals of sufficient length (e.g. 0.3 s), in order to reduce the average training sentence length to about 7 words per sentence (see also Section 4.2.1). Evaluations were carried out on two (nonchopped) test corpora: the development corpus (DEV set) consists of 5.0 h of speech (11 male speakers, 38.0k words), and the evaluation corpus (EVAL set) of 3.3 h of speech (another 11 male speakers, 26.5k words). 6.1.2. Switchboard As a second task, we performed evaluations on the Switchboard corpus, consisting of spontaneous conversations recorded over telephone line (Godfrey et al., 1992). Our male (female) training corpus consists of about 57 h (73 h) of data and 86k (114k) utterances, respectively. Due to the small average sentence length of about 7 words per utterance, we did not chop the training data. We performed evaluations on the development corpus of the Johns Hopkins University summer workshop 1997, containing about 1 h (0.5 h) of data and 1.6k (0.9k) utterances for the male (female) corpus, respectively. 6.2. Baseline system description The baseline system is a recent upgrade of the Philips Research LVCSR system which has already been applied to numerous tasks including Broadcast News transcriptions (Aubert and Blasig, 2000; Beyerlein et al., 2002), Switchboard (Beyerlein et al., 2001) and the present professional dictation task (Schramm et al., 2003). All the experiments reported here concern the first decoding pass without any acoustic or language model adaptation. The main system components can be summarized as follows:
• The signal front-end relies on standard MFCC coefficients. In the professional dictation task, additional voicing features are appended. • These feature vectors are subjected to LDA to yield 35-component vectors. • Gender dependent continuous mixtures of Laplacian densities are estimated using maximum likelihood Viterbi training. • Phonetic contexts are within-word triphones tied with CART. • The baseline decoding lexicon consists of 60k words and 17k additional pronunciation variants for the professional dictation task and 22k words and 750 pronunciation variants for Switchboard. The pronunciation variants are weighted by unigram prior probabilities. • For the professional dictation task, filled-pause specific phones are applied in combination with various weighted filled-pause pronunciation variants of variable length; for more details see (Schramm et al., 2003) and references therein. • The language model is a word-based 64K trigram. For professional dictation, the language model histories are free from filled pauses, which are however predicted probabilistically based on the two preceding words (Schramm et al., 2003; Peters, 2003). • Decoding proceeds from left to right using a prefix-tree lexicon (Aubert, 1999) and contributions of simultaneously active pronunciations of the same word are summed up (Schramm and Aubert, 2000). • Language model and phoneme look-ahead techniques are conservatively applied (Aubert and Blasig, 2000). The parameterization of the search has been tuned for the baseline setup in the professional dictation task to give a real time factor of about 16.0 on an Intel Xeon 3.0 GHz processor for the ML 6 split baseline model. The same search setup is used in all experiments described in this section (including the baseline and the boosted systems). 6.3. Boosting Each boosting iteration involves a recognition pass on the training data, generating N-best lists of reasonable length (e.g. 100 for Switchboard). According to the average length of the training utterances, we used non-chopped (Switchboard) or chopped training data (professional dictation, see
C. Meyer, H. Schramm / Speech Communication 48 (2006) 532–548
6.3.1. Results: professional dictation In the professional dictation task, evaluations were carried out on the (unchopped) DEV and EVAL test sets. The baseline decoding lexicon contains about 60k baseforms and 17k pronunciation variants. For the boosted system, the decoding lexicon contains about 155k (232k) entries for t = 2 (t = 3), respectively.
9 Note that we could not use a forced alignment to determine the decoding score of the spoken words, since in a forced alignment only the best pronunciation variant is considered. In contrast, in decoding we sum over different pronunciation variants, weighted by their unigram prior probabilities. Thus, a forced alignment yields different scores from those obtained in decoding.
Professional dictation 30000 training of classifier t=2 training of classifier t=3
20000
Counts
Section 4.2.1). If the spoken utterance was not contained in the N-best list, it was added with an assigned confidence value of 0.0 (assuming a very low score for the spoken words).9 Contrary to results found for discriminative training (Schlu¨ter et al., 1999), we used a trigram language model (LM) for decoding on the training data. This is to reduce the training error and the pseudo loss t in order to obtain a sufficient range of calculated utterance weights. Using the corresponding N-best lists, the utterance weights were calculated according to Table 1 and Eq. (2), employing the a posteriori confidence measure (7) calculated from time-normalized total decoding scores. The scaling factor k was chosen on the training corpus in order to provide a minimal pseudo loss and thus a maximal range of calculated utterance weights. Note that no parameter tuning for the boosted systems was performed apart from choosing k. The utterance weights were normalized to a mean value of 1.0. An example utterance weight histogram is shown in Fig. 4. In each boosting iteration t, a new phoneme set Mt was introduced and trained on the reweighted training data as explained in Section 4.2.2. A complete training pass was performed for each phoneme set Mt (i.e. each boosted acoustic model), involving training of context-independent phonemes, Viterbiselection of pronunciation variants, CART clustering and training of context-dependent units. Thus, the training effort for each boosted classifier comprises a recognition pass on the training data and a standard training pass on the reweighted training data. Finally, the decoding lexicon has been created according to Section 4.2.2.
543
10000
0 0.5
0.75
1
1.25
1.5
(Normalized) utterance weights Fig. 4. (Normalized) utterance weight histograms for boosting on the professional dictation training corpus, k = 0.05. For the first classifier t = 1, all utterance weights are 1.0. The histogram for training of classifier t = 2 is shown by the dashed line, the one for training of classifier t = 3 by the solid line. The filled area under the solid line marks the histogram for those utterances which are correctly recognized by the second classifier t = 2.
Evaluation results10 are summarized in Table 3. Applying ML baseline training, the best test error rate was obtained for 349k densities; further increasing the number of parameters of the acoustic model did not improve results, due to overfitting. In contrast, AdaBoost.M2 significantly improved the best ML result on both evaluation corpora for the tested values of k (Table 3). Results for weighted model combination were similar to those of unweighted combination, since the theoretical classifier weights calculated from boosting theory were nearly uniform (e.g. for k = 0.05: theoretical classifier weights 1.07 and 0.93 for the first (second) classifier, respectively). Note also that for word level model combination, the weighted combination scheme is an approximation to the final boosting output (1) (compare Appendix A). The performance improvements come at the expense of a larger decoding real time factor, which increased from about 16.0 for the ML 349k density baseline model to about 21.2 (27.6) for the boosted system with 2 (3) classifiers, respectively (Intel Xeon 3.0 GHz processor). The increased decoding time
10 The word error rate on the EVAL corpus is higher than on the DEV corpus due to large fluctuations of the error rates between individual speakers.
544
C. Meyer, H. Schramm / Speech Communication 48 (2006) 532–548
Table 3 Professional dictation: Word error rates (WER) in % on the DEV and EVAL test sets and their average (‘‘WER ALL’’) for ML training and boosting Algorithm
t
# dens.
Model combination
WER DEV
WER EVAL
WER ALL
Baseline ML (6 splits) Baseline ML (7 splits) Baseline ML (8 splits)
1 1 1
349k 511k 682k
– – –
22.04 22.03 22.32
29.20 29.40 29.41
24.98 25.06 25.23
AdaBoost.M2, k = 0.03
2
659k
Unweighted Weighted
21.37 21.38
28.92 28.92
24.47 24.48
3
982k
Unweighted
21.41
28.68
24.39
2
659k
Unweighted Weighted
21.51 21.56
28.65 28.63
24.44 24.46
3
982k
Unweighted
21.26
28.56
24.26
2
658k
Unweighted Weighted
21.39 21.38
28.55 28.66
24.33 24.37
3
979k
Unweighted
21.15
28.46
24.15
AdaBoost.M2, k = 0.05
AdaBoost.M2, k = 0.1
AdaBoost.M2 was applied to the ML 349k density baseline model. t is the boosting iteration index, ‘‘# dens’’ refers to the number of densities of the final (combined) classifier. Weighted/unweighted model combination: see text.
for the boosted models can be attributed to the increased lexicon size and the increased number of densities for the boosted systems, leading to a larger search effort. On the other hand, it can be expected that the addition of a new classifier in each boosting iteration leads to a large amount of redundancy in the final boosted system in terms of symbolic pronunciation variants and densities. Thus, there should be significant potential to reduce the complexity of the final classifier and to increase the decoding efficiency for the boosted systems. Note that the search space has not been optimized for
the boosted systems. In contrast experiments, we found that by increasing the search space, the word error rate of the boosted systems with t = 3 could be stronger reduced than that for t = 2, whereas the baseline system showed only a marginal error reduction. Thus, more efficient search strategies for such large systems may provide further performance improvements. 6.3.2. Results: Switchboard On Switchboard, AdaBoost.M2 was applied separately to the male and the female ML baseline
Table 4 Switchboard: Word error rates (WER) in % on the Switchboard male and female test sets and their average (‘‘ALL’’) for ML training and boosting Male
Female
All
Algorithm
t
# dens
WER
# dens
WER
WER
Baseline ML (7 splits) Baseline ML (8 splits) Baseline ML (9 splits)
1 1 1
368k 514k 674k
38.60 39.36 39.13
389k 558k 753k
40.03 40.34 40.79
39.05 39.67 39.65
AdaBoost.M2, k = 0.05
2 3
726k 1085k
38.20 38.22
771k 1153k
39.54 39.12
38.62 38.50
AdaBoost.M2, k = 0.1
2 3
726k 1085k
37.93 37.75
770k 1150k
39.43 39.32
38.40 38.24
AdaBoost.M2, k = 0.2
2 3
728k 1082k
38.25 38.31
770k 1149k
39.49 39.32
38.64 38.62
AdaBoost.M2 was applied to the (gender dependant) maximum likelihood 7 splits baseline model. t is the boosting iteration index, ‘‘# dens’’ refers to the number of densities of the final (combined) classifier. Decoding with unweighted model combination (see Section 4.2.2).
C. Meyer, H. Schramm / Speech Communication 48 (2006) 532–548
models. In the baseline decoding lexicon, there are about 22k baseforms and 750 pronunciation variants. For t = 2 (t = 3) boosted classifiers, the decoding lexicon contained about 41k (60k) entries, respectively. Evaluation results are given in Table 4. Again boosting clearly improved the best (overall) test error rate obtained by ML training, by up to 0.8% absolute. Note that on Switchboard, the training word error rate was quite large, leading to a large pseudo loss and a quite restricted range of utterance weights. The decoding real time factor increased from about 18.2 for the 368k ML baseline model (male corpus) to about 30.3 (42.4) for the boosted system with t = 2 (t = 3), respectively, on an Intel Xeon 3.0 GHz processor. Again, we remark that we did not spend any effort to optimize the parameterization of the search setup or to increase the decoding efficiency for the boosted models, which remains an issue for future work. Regarding the search space, contrast experiments led to similar observations as in the professional dictation task, namely that increasing the search space results in larger performance improvements for the boosted systems (especially for t = 3) than for the ML baseline system. 7. Conclusions and discussion We presented and evaluated a boosting approach which can be applied to any Hidden Markov Model (HMM) based speech recognizer, based on utterance classification. Specifically, we used the well-known AdaBoost.M2 algorithm to calculate utterance weights used in acoustic model training of subsequent boosted classifiers. In a large vocabulary isolated word recognition task we obtained significant performance improvements compared to the maximum likelihood (ML) baseline model. On the other hand, taking into account the increased complexity of the final, boosted classifier and the increased training effort, our results indicate that boosting is mostly relevant for high accuracy acoustic models, which fully exploit the potential of an increased acoustic resolution (i.e. number of densities of the acoustic model). In this case, further increasing the number of densities does not improve test performance in the context of standard training, due to overfitting. Boosting, instead, significantly reduced the best test error rate. The increased recognizer complexity and thus decoding effort of the boosted systems is a major
545
drawback compared to other training techniques like discriminative training. In spite of this, we presented examples showing that boosting can provide further performance gains over discriminative training, when both training techniques are combined. Compared to the initial ML base classifier, nearly additive performance gains were achieved. Moreover, combining boosting and discriminative training clearly outperformed a purely discriminatively trained model with a similar number of densities. This demonstrates the benefit of boosting also in the context of discriminative training. On the other hand, this is again mostly relevant for high accuracy acoustic models, where the improvements obtained by discriminative training are limited by the large acoustic resolution. Here, boosting can compensate for this effect. Interestingly, these findings were not restricted to matched decoding conditions; similar conclusions could be drawn in a set of recognition experiments exhibiting lexical or acoustic mismatch to the training conditions. In the second part of our paper, we extended our algorithm to large-scale continuous speech recognition (LVCSR). To apply our utterance based training approach to continuous speech tasks, it might be convenient to chop the training data, depending on the (average) length of the training utterances. To avoid chopping in decoding, we introduced a model combination scheme at the lexical level, based on an extended set of pronunciation variants employing separate phoneme sets with underlying acoustic model for each of the boosted classifiers. Realizing the acoustic model combination by performing a weighted summation over the extended set of pronunciation variants, we arrive at an online (single pass) recognition mode, without any need for further offline processing on word lattices or N-best lists. We presented experimental results on two tasks: a real-life spontaneous speech dictation task with a 60k word vocabulary, and Switchboard. On both corpora, boosting significantly improved the best test error rates obtained with ML training. In view of the high complexity of our spontaneous speech decoding tasks, our results demonstrate that our approach provides a feasible way for boosting HMM speech recognizers even in (very) large scale continuous speech recognition tasks, integrated in an online decoding framework. Due to the increased complexity (lexicon size and number of densities) of the boosted system with each boosting iteration, the performance gains come
546
C. Meyer, H. Schramm / Speech Communication 48 (2006) 532–548
at the expense of an increased decoding time. However it can be speculated that the acoustic representation of the boosted system exhibits some degree of redundancy (in terms of densities and symbolic pronunciation variants). Thus, an important direction for future research is to reduce the complexity of the boosted ensemble, e.g. by density clustering algorithms. Also, more efficient search strategies for such large systems may further improve performance. For LVCSR, the benefit of boosting in combination with discriminative training remains to be shown. Here, efficient techniques (to reduce the training runtime) are needed to combine both algorithms. This might include boosting algorithms operating on word lattices or confusion networks instead of N-best lists. Finally, further focusing the algorithm on the specifics of speech recognition (e.g. outliers and noise) remains to be investigated. An interesting approach was recently presented in (Raetsch, 2003). Acknowledgements We are grateful to Xavier Aubert for his investigations on tuning the search setup. Many thanks also to the Center for Language and Speech Processing at Johns-Hopkins University for providing us the Switchboard training and development corpus and the pronunciation dictionary. Finally, we would like to thank the reviewers for their useful comments which helped to improve the clarity of the manuscript. Appendix A. Acoustic model combination at the lexical level: theory In this section we introduce the theory for the lexical approach to acoustic model combination and show its relation to the final output (1) given by boosting theory. First we remark that alternative sequences of pronunciation variants vN1 :¼ v1 ; . . . ; vN may be integrated as follows into the decision rule (3) for decoding a word sequence y ¼ wN1 :¼ w1 ; . . . ; wN , given a sequence of feature vectors x ¼ xM 1 :¼ x1 ; . . . ; xM : ^ N1 ¼ arg max pðwN1 jxM w 1 Þ wN 1
¼ arg max pðwN1 ; xM 1 Þ wN 1
ðA:1Þ
with pðwN1 ; xM 1 Þ¼
X
pðwN1 ; vN1 ; xM 1 Þ
ðA:2Þ
vN 2RðwN Þ 1 1
¼
X
N pðwN1 Þ pðvN1 jwN1 Þ pðxM 1 jv1 Þ
vN 2RðwN Þ 1 1
ðA:3Þ ¼
X
N Y
pðwi jwi1 imþ1 Þ pðvi jwi Þ pðxi jvi Þ.
vN 2RðwN Þ i¼1 1 1
ðA:4Þ Here, the notation vN1 2 RðwN1 Þ denotes all sequences of pronunciation variants vN1 consistent with the word sequence wN1 . A specific pronunciation variant sequence vN1 2 RðwN1 Þ is obtained as follows: For each word position i = 1, . . . , N, select one pronunciation variant vi from the set of variants belonging to word wi as defined by the lexicon. The term p(vijwi) refers to the unigram prior for the pronunciation variant vi, the term p(xijvi) to the acoustic probability for pronunciation variant vi (on the respective part xi of the feature vector sequence), and the term pðwi jwi1 imþ1 Þ denotes the language model probability of word wi in an m-gram context. In Eq. (A.3), we applied the Bayes rule and assumed probabilistic dependence of the feature vector sequence xM only on the pronunciation variant 1 sequence vN1 . In Eq. (A.4) we assumed only ‘‘local’’ dependencies of xi on vi and of vi only on wi. In (Schramm and Aubert, 2000), an algorithm for approximating the summation over all pronunciation variant sequences as given in Eq. (A.4) has been described. This algorithm has been used in our experiments. In the second step, we extend this framework, integrating the contribution of different acoustic models which are represented by an extended set of pronunciation variants (see Section 4.2.2). The main idea is to realize the contribution of the set of acoustic models by a summation over respective pronunciation variant sequences. Two cases have to be distinguished: 1. ‘‘Sentence level model combination’’: Here, we introduce a scalar value t 2 {1, . . . , T} labeling the different acoustic models at the sentence level. In detail, this means that each element in a pronunciation variant sequence vN1 considered in the summation (see below) belongs to the same acoustic model, i.e. ti = t "i = 1, . . . , N.
C. Meyer, H. Schramm / Speech Communication 48 (2006) 532–548
2. ‘‘Word level model combination’’: In this case, we introduce an N-dimensional variable tN1 , where ti 2 {1, . . . , T} for i 2 {1, . . . , N}. This variable labels the acoustic model individually at each word position. Here, we also consider sequences of pronunciation variants with alternating model labels within a sentence. Now we introduce the notation t to represent both cases, i.e. t = t for sentence level model combination and t ¼ tN1 for word level model combination. Then, the summation over the contributions of the different acoustic models can be introduced as follows: pðwN1 ; xM 1 Þ¼ ¼
X
X
t
vN 2RðwN Þ 1 1
X t
¼
ðA:5Þ
pðwN1 ; vN1 ; t; xM 1 Þ
X
N pðtÞpðwN1 jtÞ pðvN1 jwN1 ; tÞ pðxM 1 jv1 ; tÞ
vN 2RðwN Þ 1 1
X
pðtÞ
t
ðA:6Þ
X
N pðwN1 Þ pðvN1 jwN1 ; tÞ pðxM 1 jv1 ; tÞ
vN 2RðwN Þ 1 1
ðA:7Þ ¼
X
pðtÞ
t
X
N Y
vN 2RðwN Þ 1 1
i¼1
i1 pðwi jwimþ1 Þ pðvi jwi ; ti Þ pðxi jvi ; ti Þ.
probability pt ðwN1 jxM 1 Þ for model t. Thus, we arrive at the final output (1) for the combined classifier given by boosting theory. For word level model combination we factorize Q the model prior pðtÞ ¼ pðtN1 Þ ¼ Ni¼1 pðti Þ and arrive at pðwN1 ; xM 1 Þ ¼
T X
X
N Y
t¼1
vN 2RðwN Þ 1 1
i¼1
pt ðwN1 ; xM 1 Þ ¼
X
N Y
pðwi jwi1 imþ1 Þ
vN 2RðwN Þ i¼1 1 1
pðvi jwi ; tÞ pðxi jvi ; tÞ
ðA:8Þ
pðxN1 Þ,
and dividing by the prior the posterior probability pðwN1 jxM Þ can be written as weighted summa1 tion over posterior probabilities pt ðwN1 jxM 1 Þ from the individual models t: pðwN1 jxM 1 Þ ¼
T X t¼1
at pt ðwN1 jxM 1 Þ ¼
T X
at ht ðx; yÞ.
ðA:9Þ
t¼1
The last equation follows since the function ht(x, y) is according to Section 4.1 modeled by the posterior
pðwi jwi1 imþ1 Þ
pðvi jwi ; ti Þ pðti Þ pðxi jvi ; ti Þ.
ðA:10Þ
For simplicity, the model priors p(ti) may again be replaced by the respective boosting classifier prior ati ¼ ln b1t . This can be simply realized at the lexical i level, by multiplying the estimated maximum likelihood prior for each pronunciation variant vi,t in the lexicon with the respective (normalized) boosting classifier weight at (preserving normalization of the unigram priors). Replacing the product of the model priors by a single prior leads to the final boosting output (1). This relation is exact for isolated word recognition N = 1, where Eq. (A.10) can be written (dropping the index i and setting p(t) = at): pðw; xM 1 Þ ¼
T X t¼1
In Eq. (A.7) we assumed independence of the word sequence wN1 from the model label t. The unigram priors p(vijwi, ti) are estimated on the training data for each model t, performing a forced alignment restricted to pronunciation variants of model t only. The acoustic probability p(xijvi, ti) is determined by using the respective pronunciation variant vi,t for the model t = ti. For sentence level model combination, we replace the model prior p(t) = p(t) by the prior at ¼ ln b1t determined from boosting theory. Defining
547
at
X
pðwÞ pðvjw; tÞ pðxM 1 jv; tÞ.
v
ðA:11Þ References Aubert, X., 1999. One pass cross word decoding for large vocabularies based on a lexical tree search organization. In: Proc. EUROSPEECH-99, Budapest, Hungary, pp. 1559– 1562. Aubert, X., Blasig, R., 2000. Combined acoustic and linguistic look-ahead for one-pass time-synchronous decoding. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP-00), Vol. 3, Beijing, China, pp. 802–805. Bahl, L.R., Brown, P.F., de Souza, P.V., Mercer, R.L., 1986. Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In: Proc. Internat. Conf. on Acoustics, Speech and Signal Processing (ICASSP86), Tokyo, pp. 49–52. Beyerlein, P., Aubert, X., Harris, M., Meyer, C., Schramm, H., 2001. Investigations on conversational speech recognition. In: Proc. EUROSPEECH-01, Aalborg, Denmark, pp. 499–503. Beyerlein, P., Aubert, X., Haeb-Umbach, R., Harris, M., et al., 2002. Large vocabulary continuous speech recognition of broadcast news—The Philips/RWTH approach. Speech Commun. 37, 109–137. Collins, M., 2000. Discriminative reranking for natural language parsing. In: Proc. Seventeenth Internat. Conf. on Machine Learning (ICML-00), Stanford, USA. Collins, M. 2002. Ranking algorithms for named-entity extraction: boosting and the voted perceptron. In: Proc. ACL 2002.
548
C. Meyer, H. Schramm / Speech Communication 48 (2006) 532–548
Cook, G.D., Robinson, A.J., 1996. Boosting the performance of connectionist large vocabulary speech recognition. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP-96), Philadelphia, PA, USA, pp. 1305–1308. Davis, S.B., Mermelstein, P., 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28, 357–366. Dimitrakakis, D., Bengio, S., 2004. Boosting HMMs with an application to speech recognition. In: Proc. Internat. Conf. on Acoustics, Speech and Signal Processing (ICASSP-04), Montreal, Canada. Freund, Y., Schapire, R.E., 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 119–139. Godfrey, J., Holliman, E., McDaniel, J., 1992. SWITCHBOARD: Telephone speech corpus for Research and Development. In: Proc. Internat. Conf. on Acoustics, Speech and Signal Processing (ICASSP-92), San Francisco, CA, USA. Juang, B.H., Katagiri, S., 1992. Discriminative learning for minimum error classification. IEEE Trans. Signal Process. 40, 3043–3054. Koltchinskii, V., Panchenko, D., 2002. Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statist. 30 (1). Meyer, C., 2002. Utterance-level boosting of HMM speech recognizers. In: Proc. Internat. Conf. on Acoustics, Speech and Signal Processing (ICASSP-02), Orlando, FL, USA, pp. 109–112. Meyer, C., Beyerlein, P., 2002. Towards ‘‘Large Margin’’ speech recognizers by boosting and discriminative training. In: Proc. Nineteenth Internat. Conf. on Machine Learning (ICML-02), Sydney, Australia, pp. 419–426. Meyer, C., Rose, G., 2000. Rival training: efficient use of data in discriminative training. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP-00), Beijing, pp. 632–635. Meyer, C., Schramm, H., 2004. Boosting acoustic models in large vocabulary speech recognition. In: Proc. Sixth IASTED Internat. Conf. on Signal and Image Processing (SIP-2004), Honolulu, Hawaii, USA. Peters, J. 2003. LM studies on filled pauses in spontaneous medical dictation. In: Proc. Human Language Technology Conf. (HLT-NAACL 2003), short papers, Edmonton, Alberta, Canada, pp. 82–84. Raetsch, G., 2003. Robust multi-class boosting. In: Proc. EUROSPEECH-03, Geneva, Switzerland, pp. 997–1000. Ru¨ber, B., 1997. Obtaining confidence measures from sentence probabilities. In: Proc. EUROSPEECH-97, Rhodes, Greece, pp. 739–742.
Schapire, R.E., 1990. The strength of weak learnability. Mach. Learn. 5, 197–227. Schapire, R.E., 2002. The boosting approach to machine learning: an overview. In: Proceedings of the MSRI Workshop on Nonlinear Estimation and Classification. Springer-Verlag, pp. 149–172. Schapire, R.E., Singer, Y., 1999. Improved Boosting Algorithms Using Confidence-rated Predictions. Mach. Learn. 37 (3), 297–336. Schapire, R.E., Freund, Y., Bartlett, P., Lee, W.S., 1998. Boosting the margin: a new explanation of the effectiveness of voting methods. Ann. Statist. 26, 1651–1686. Schlu¨ter, R., Mu¨ller, R., Wessel, F., Ney, H., 1999. Interdependence of language models and discriminative training. In: Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU-99), Keystone, Colorado, pp. 119– 122. Schramm, H., Aubert, X., 2000. Efficient integration of multiple pronunciations in a large vocabulary decoder. In: Proc. Internat. Conf. on Acoustics, Speech and Signal Processing (ICASSP-00), Istanbul, Turkey, Vol. 3, pp. 1659– 1662. Schramm, H., Aubert, X., Meyer, C., Peters, J., 2003. Filledpause modeling for medical transcriptions. In: Proc. ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition (SSPR 2003), Tokyo, Japan, pp. 143–146. Schwenk, H., 1999. Using boosting to improve a Hybrid HMM/ Neural Network speech recognizer. In: Proc. Internat. Conf. on Acoustics, Speech and Signal Processing (ICASSP-99), Phoenix, AZ, pp. 1009–1012. Valiant, L.G., 1984. A theory of the learnable. Commun. ACM 27 (11), 1134–1142. Woodland, P.C., Povey, D., 2000. Large scale discriminative training for speech recognition. In: Proc. ASR 2000 Conference—Automatic Speech Recognition: Challenges for the new Millenium, Paris, France, pp. 7–16. Zhang, R., Rudnicky, A.I., 2003. Comparative study of boosting and non-boosting training for constructing ensembles of acoustic models. In: Proc. EUROSPEECH-03, Geneva, Switzerland, Vol. III, pp. 1885–1888. Zheng, J., Franco, H., Weng, F., Sankar, A., Bratt, H., 2000. Word-level rate of speech modeling using rate-specific phones and pronunciations. In: Proc. Internat. Conf. on Acoustics, Speech and Signal Processing (ICASSP-00), Istanbul, Turkey, Vol. III, pp. 1775–1778. Zweig, G., Padmanabhan, M., 2000. Boosting Gaussian mixtures in an LVCSR system. In: Proc. Internat. Conf. on Acoustics, Speech and Signal Processing (ICASSP-00), Istanbul, Turkey, pp. 1527–1530.