Computer Speech and Language (2000) 14, 241–259 Article No. 10.1006/csla.2000.0145 Available online at http://www.idealibrary.com on
Multi-resolution sub-band features and models for HMM-based phonetic modelling P. M. McCourt, S. V. Vaseghi† and B. Doherty School of Electrical & Electronic Engineering, Queens University, Belfast, U.K. Abstract HMM acoustic models are typically trained on a single set of cepstral features extracted over the full bandwidth of mel-spaced filterbank energies. In this paper, multi-resolution sub-band transformations of the log energy spectra are introduced based on the conjecture that additional cues for phonetic discrimination may exist in the local spectral correlates not captured by the full-band analysis. In this approach the discriminative contribution from sub-band features is considered to supplement rather than substitute for full-band features. HMMs trained on concatenated multi-resolution cepstral features are investigated, along with models based on linearly combined independent multi-resolution streams, in which the sub-band and full-band streams represent different resolutions of the same signal. For the stream-based models, discriminative training of the linear combination weights to a minimum classification error criteria is also applied. Both the concatenated feature and the independent stream modelling configurations are demonstrated to outperform traditional full-band cepstra for HMM-based acoustic phonetic modelling on the TIMIT database. Experiments on context-independent modelling achieve a best increase on the core test set from an accuracy of 62.3% for full-band models to a 67.5% accuracy for discriminately weighted multi-resolution sub-band modelling. A triphone accuracy of 73.9% achieved on the core test set improves notably on full-band cepstra and compares well with results previously published on this task. c 2000 Academic Press
1. Introduction Insights into the physiology and performance of the human hearing process have found significant application in achieving goals in speech and audio signal processing. In coding, for example, the phenomenon of masking forms the basis for the audio compression standard of MPEG 2 and is used successfully to improve the quality of low rate coded speech through perceptual post-filtering. For automatic speech recognition, exploitation of our knowledge of the human hearing process has primarily influenced the front-end feature extraction from which the back-end acoustic models are trained. Examples extend from the use of mel-spaced filterbank spectral analysis, and perceptual linear prediction (PLP) features (Hermansky, 1990) to the more direct modelling of basilar membrane “excitation” distributions such as that represented by the EIH (ensemble interval histogram) (Ghitza, 1994). † Now at Department of Electronic and Computer Engineering, Brunel University, Uxbridge, Middlesex, UB8 3PH, U.K.
0885–2308/00/030241 + 19 $35.00/0
c 2000 Academic Press
242
P. M. McCourt et al.
Allen (1994) provided an excellent summary of rediscovered elementary work performed over 40 years ago by Fletcher and his colleagues at the Bell Labs on the nature of human speech recognition. One of the primary conclusions of this work was that, for the human auditory system, meaningful features are independently recognized in sub-bands and merged at some higher processing layer into recognition of basic phonemes and syllables. Whilst precise knowledge of how the merging of this partial recognition information is performed by humans is as yet far from being fully understood, these insights none the less provided a stimulus for recent exploration of the usefulness of sub-band-based features and models for the statistical pattern recognition problem of ASR (Bourlard, Dupont, Hermansky & Morgan, 1996; Bourlard & Dupont, 1997; Cerisara, Haton, Mari & Mohr, 1997; Tibrewala & Hermansky, 1997; Okawa, Bocchieri & Potamianos, 1998). This interest was initially driven by the proposition that, whilst any form of band-limited noise will corrupt all features extracted from a full-bandwidth analysis, some form of band-limited feature extraction and modelling will limit the corruption. In our recent work (Vaseghi, Harte & Milner, 1997; McCourt, Vaseghi & Harte, 1998) we have taken the approach of exploring multi-resolution sub-band features and models. By “sub-band” we imply any region of the full-signal bandwidth specifically represented by a sub-set of the standard mel-spaced log energy vector, across which cepstral analysis is separately performed.1 The use of such “sub-band cepstra” is based on the conjecture that important cues for phonetic discrimination may exist in the local spectral correlates that are not captured by full-band cepstral features. The multi-resolution contribution acknowledges that sub-band discriminative information may be used, not purely in substitution of, but rather to supplement full-band features and models. The level of sub-band decomposition can be increased such that features may be selected from a pyramid or hierarchy of increasing resolution levels. As already stated, most of the published work on multi-band recognition has concentrated on the noise robustness issue. Comments on baseline performance in clean conditions generally state that, for the tasks in question, performance close to full band was achieved. The objective of the investigations reported in this paper is to fully ascertain the contribution of multi-resolution sub-band features and models in improving the performance of HMM-based acoustic phonetic modelling. Refinement of the usefulness of a sub-band decomposition to wide-band and non-stationary noise in particular, remains none the less of ongoing importance. In exploring the multi-resolution feature set in HMM-based modelling, two approaches can be adopted. • The multi-resolution sub-band cepstra can be concatenated to create a larger feature space from which to directly train acoustic models. • Independent sub-band and resolution models can be created for each feature set, in which case the “scores” from the independent models must be recombined in order to make decoding decisions. Okawa et al. (1998) proposed a similar view for sub-bands only, calling the approaches “feature combination” and “likelihood combination”, respectively. Two research issues raised by Bourlard et al. (1996) and Bourlard and Dupont (1997) concerning sub-band modelling are: (i) the type of merging to be used (linear or non-linear) and (ii) the timing of the recombination, if independent modelling is employed. 1 In this work we have exclusively explored the use of cepstral features, though other feature types (e.g. PLP, PLPCC) can be naturally accommodated.
Multi-resolution sub-band modelling
243
Discussion on the timing issue reflects a desire to mimic the temporal merging of information as performed by humans, as well as suggesting that the asynchrony of state transitions permitted by independent sub-band models may benefit modelling accuracy. Implementation of this approach implies that combination of independent scores for different models for the same class is best performed at the phone or syllable segmental level. For isolated word or phone classification experiments this occurs naturally as boundaries are given. However, when performing continuous recognition, choosing appropriate segmental anchor points at which recombination is performed requires some prior estimation of the most likely phone or syllable boundaries for sequences of competing strings. As reported by Cerisara et al. (1997), independent sub-band models produce very different phone sequences and boundaries. This requires post-processing of decoded utterances to group similar phone sequences to perform combined path searches in order to find the most reliable overall sequence. Mirghafori and Morgan (1998) investigated asynchrony at the class transition level by comparing timing boundaries found by forced alignment of phone strings from independently trained sub-band models. This reported different timing segmentations for broad class transitions that tended to increase in the higher frequency bands. Tibrewala and Hermansky (1997) reported however that, for continuous recognition, initial explorations showed that non-linear merging of likelihood scores at the acoustic frame level gave better performance than word level merging. Acoustic frame-level likelihood combination is indeed the most immediately tractable way in which to achieve single-pass decoding of continuous speech using multi-resolution features and models. This automatically implies that in training HMM-based models, it is the combined likelihood from the independent feature distributions which influences state decoding, with state transitions effectively tied. This is the modelling assumption maintained throughout the investigations reported here. This is a reasonable modelling constraint, as although independently trained sub-band models may suggest different timing segmentations at the state and class level, the baseline acoustic modelling accuracy and discriminative capabilities of such models may be low for continuous recognition, especially with narrow-bandwidth subbands. As the problem of automatic recognition is essentially one of discriminative pattern matching, the usefulness of sub-band features and models is not necessarily in “replicating” the implementation of human recognition, but rather in determining what new discriminative information can be gleaned from considering a sub-band decomposition. The other experimental issue (as stated above) is that of choosing a model combination strategy if the independent stream modelling framework is followed. The MLP (multi-layer perceptron) merging network used by Tibrewala and Hermansky (1997) attempts to find a single discriminative non-linear merging function, globally trained across the frame-based sub-band model scores from all classes. A similar approach is used by Bourlard and Dupont (1997). Okawa et al. (1998) and Bourlard et al. (1996) used linear combination of the loglikelihood scores employing various heuristic weightings, primarily for sub-band model performance in narrow-band noise. In this work we have concentrated on linear combination of the log-likelihood probabilities of the multi-resolution models based on the simple assumption of statistical independence between streams. It could be suggested that, particularly for multi-resolution features, this assumption seems violated. However, our early experiments [in common with similar work by Halberstadt and Glass (1998) on multiple classifier combination], demonstrated that this simple form of combination performs as well as other more complex options such as proportional weighting of normalized scores. Maximum-likelihood (ML)-trained HMMs are used in our experiments to explore different multi-resolution feature and model decompositions. We also however investigate re-training of the log-likelihood linear combination weights (or if preferred, likelihood exponents) to the
244
P. M. McCourt et al.
minimum classification error (MCE) criteria using generalized probabilistic descent (GPD) training (Juang & Katagiri, 1992; Juang, Chou & Lee, 1997). The most similar application is that of Normandin, Cardin and De Mori (1994) where training of exponent weights for separate acoustic feature and delta time trajectory streams was investigated, also from an independent modelling assumption starting point. Linear weight re-combination MCE training has also been used by Potamianos and Graf (1998) for multi-modal, i.e. audio and video recognition blending of ML-trained models. The issue of model combination is also of wider developing interest as more generally it offers the potential for jointly harnessing the discriminative abilities of several acoustic or language models (e.g. Halberstadt & Glass, 1998; Beyerlin, 1998; McCourt, Harte & Vaseghi, 1999). This paper is organized as follows. Section 2 formally defines multi-resolution sub-band cepstral feature extraction. In Section 3, alternative multi-resolution model composition strategies are presented. A formulation for discriminative training of linear combination weights to the MCE objective is also included. Section 4 reports on experiments carried out on the TIMIT database. Comparisons between mono-resolution sub-band and multiresolution concatenated feature and independent stream modelling are discussed. Improvements gained from discriminative weight training are also described.Finally, contextdependent modelling with the multi-resolution sub-band features and models is explored. Section 5 furthers discussion of the usefulness of these models with Section 6 concluding the paper. 2. Multi-resolution features We now define what is meant by multi-resolution features in the domain of cepstral feature extraction from mel-spaced filterbank log energies, though the principle can be extended to other forms of spectral representation. For the standard mel-frequency cepstral coefficient (MFCC) features, cepstral analysis is performed on the mel-spaced filterbank log energy vector E of each short-time analysis frame as expressed by the linear transformation X = AE
(1)
where A typically represents the DCT basis functions, although discriminatively derived linear transforms have been used (e.g. Chengalvarayan & Deng, 1997). The log energy vector E can be split into N sub-vectors E = [ETl , . . . , ETb , . . . , ETB ]T (where T indicates vector transpose) such that each sub-vector Eb effectively represents a grouped bandwidth of log energies. Separate cepstral analysis using appropriately dimensioned DCT transforms Ab yield a new feature vector XT in Equation (2), created from the set of sub-band cepstral vectors. This is illustrated in Figure 1 for the case of two sub-bands (r is a resolution index identifying a user-defined sub-band decomposition). Thus: Xr = [XTl , . . . , XTb , . . . , XTB ]T
(2)
[XTl , . . . , XTb , . . . , XTB ]T = [(Al El )T , . . . , (Ab Eb )T , . . . , (A B E B )T ]T .
(3)
is formed by
In creating conventional MFCC feature vectors, the DCT is applied to the full-band melfrequency energy vector in order to de-correlate the features so as to better fit a diagonal covariance assumption predominantly used in Gaussian mixture modelling. Sub-band DCT analysis is based on the conjecture that important cues for discrimination may exist in the
Multi-resolution sub-band modelling
Log energy
Resolution 1, Two bands
Log energy
Resolution 0, Full band
245
Mel Freq
Mel Freq
DCT
DCT
DCT
X(r = 0)
X(r = 1)
X(r = 1)
b=1
b=2
Figure 1. Multi-resolution sub-band cepstral feature extraction.
local spectral correlates that may not be captured by the full-band cepstral analysis and emphasizes more the sense of cepstra as local spectral envelope shape analysis while retaining the de-correlation benefit. It may be argued, particularly for the concatenated features, that there is a slight weakening of the de-correlation assumption. However, as will be seen, this does not inhibit performance for a small number of bands. Multi-resolution analysis simply involves performing the above sub-band feature extraction at several decomposition levels, using a different number of sub-bands at each level. A multi-resolution cepstral vector X M R hence implies concatenation of the mono-resolution vector as defined in Equations (2) and (3) at different resolutions of analysis, i.e. X M R = [X(r =0)T , X(r =1)T , X(r =2)T , . . .]T .
(4)
Throughout this paper r = 0 specifies the standard full-band analysis, with r > 1 denoting a user-defined sub-band decomposition. 3. Multi-resolution sub-band models In addition to the option of training acoustic models with concatenated multi-resolution cepstra as defined above, acoustic modelling which mirrors the sub-band decomposition used for the feature extraction is also possible. Combination or merging of the log likelihoods from a set of multi-resolution sub-band streams for the same acoustic class must then be performed. If Ml(r b) identifies the independent acoustic stream for each class l trained for band b of resolution r , and using the notation X(r b) to represent the appropriate sub-band cepstral features, a linearly combined log-likelihood function is given as log p(X|Ml )
Br R−1 X X
ωl(r b) log p(X(r b) |Ml(r b) )
(5a)
r =0 b=1
where Ml defines the model set {Ml(r b) : r = 0, . . . , R − 1; b = 1, . . . , Br } for phone class l with R decomposition configurations in which each decomposition r is composed of Br sub-bands. This recombination strategy is illustrated in Figure 2(a) for the case of a fullband model supplemented by two sub-band models. Equation (5a) can be reduced to purely
246
P. M. McCourt et al.
sub-band model combination for a particular decomposition r , given by log p(X|Ml ) =
Br X
ωl(b) log p(X(b) |Ml(b) )
(5b)
b=1
or also combination of independent resolution models, trained on concatenated sub-band features, given by log p(X|Ml )
R−1 X
ωl(r ) log p(X(r ) |Ml(r ) ).
(5c)
r =0
(r b)
These cases are illustrated in Figure 2(b) and 2(c). For all schemes {ωl : r = 0, . . . , R − 1; b = 1, . . . , Br } (or a sub-set thereof as appropriate), define the set of linear weights applied to the log-likelihood scores from the independent models. These can also obviously be viewed as exponents of the direct likelihood measures. The default case investigated here is for ωl(r b) = 1 across all streams, implying an independence assumption. Although it could be suggested, particularly for combining multi-resolution models, that this assumption seems violated, our early experiments demonstrated that this simplest form of combination performs as well as others. The same conclusion is also reported by Halberstadt and Glass (1998) in their recent work also investigating multiple classifier combination. Restrictions that the weights sum to one, while intuitively appealing, are not strictly essential within the prescribed probabilistic framework (5a) unless the scores themselves are normalized and proportioned to a known summed score. 3.1. Discriminative weight training Parameter estimation techniques have been recently developed to optimize discriminative objective functions such as the maximum mutual information (MMI) (Bahl, Brown, De Souza & Mercer, 1986) and the minimum classification error (MCE) criteria (Juang et al., 1997). These have been used in direct estimation of the parameters of HMMs (e.g. Juang & Katagiri, 1992; Valtchev, 1995; Herman & Sukkar, 1998) or aspects of the feature extraction (Biem, McDermott & Katagiri, 1995; Chengalvarayan & Deng, 1997). In the experiments reported here we have employed the MCE criteria specifically in training the set of class-dependent L , based on combining the outlinear weights {ωl(r b) : r = 0, . . . , R − 1; b = 1, . . . , Br }l=1 put from ML-trained acoustic models. The most similar application is that of Normandin et al. (1994) where training of exponent weights for separate acoustic feature and delta feature streams was investigated from an independent modelling assumption starting point. The (r b) weightings were in that case class-independent global exponents. If ωl 6= 1 for any stream R (r b) after training, it is no longer strictly true for the stream that X p(X )dX = 1. However as noted by Normandin et al. (1994) it can be argued that, because in practice likelihood-trained HMMs are not completely accurate acoustic representations of speech, this minor deviation from a strict probabilistic framework is justifiable. In this section the parameter estimation equations specific to this application are developed with respect to the MCE criteria. This is specifically with respect to multi-resolution subband combination as defined by Equation (5a). The derivations transfer easily to other model combinations (5b) and (5c). To simplify subsequent notation the following definition for the log-likelihood is introduced as (r b)
Pl (X(r b) ) = log p(X(r b) |Ml
).
(6)
Multi-resolution sub-band modelling
247
ωl(0)
Full band: resolution "0"
×
log p(X(0)| Ml(0))
Two bands: resolution "1"
ωl(11)
log p(X(11)| Ml(11))
×
Lower band
ωl(12)
log p(X(12)| Ml(12))
×
log p(X| Ml)
Upper band (a)
Two bands: resolution "1"
ωl(11)
log p(X(11)| Ml(11))
×
Lower band
ωl(12)
log p(X(12)| Ml(12))
×
log p(X| Ml)
Upper band (b)
ωl(0)
Full band: resolution "0" log p(X(0)| Ml(0))
×
Two bands: resolution "1" log p(X(1)| Ml(1))
ωl(1)
log p(X| Ml)
×
(c) Figure 2. (a) Multi-resolution sub-band likelihood combination. (b) Sub-band likelihood combination. (c) Multi-resolution likelihood combination.
This defines a partial recognition score for a sub-band random variable X(r b) given an associated model. The log-likelihood score of a multi-resolution sub-band model for class l is given by gl (X) =
Br R−1 X X r =0 b=1
ωl(r b) Pl (X(r b) ).
(7)
248
P. M. McCourt et al.
It is assumed that X represents a sequence of T multi-resolution observation vectors, i.e. M R , X M R , . . . , X M R and hence g (X) is the final score for the optimal decoded state X = {Xt=1 l t=T t=2 sequence. The classifier operates according to the decision rule, selecting the class h with the highest score (8) C(X) = C h if gh (X) = max gl (X). l
This is the discriminant function. The results of this decision rule over a set of observations yields a classification error rate, based on the proportion of all observations where the chosen class according to Equation (8) does not correspond to the correct class. In this “loss function”, a correct decision is assigned a zero loss and an incorrect decision a unit loss. The innovation of the MCE objective function is the proposition of a smooth loss function amenable to optimization. This is defined (Juang & Katagiri, 1992) as follows. Let a misclassification measure dk (X) for an observation known to belong to class k be given by dk (X) = −gk (X) + max g j (X) j6=k
= −gk (X) + gη (X)
(9)
where gk (X) is the score for the correct class and gη (X) represents the score of the “most confusable” class η chosen from among the remaining set of classes according to the decision rule Equation (8). dk (X) ≤ 0 therefore implies a correct decision with dk (X) > 0 indicating misclassification. A smoothed continuous loss function is defined as a sigmoidal function of dk (X) 1 (10) 0k (X) = 1 + e−γ dk (X) with “good” classification tending towards zero and incorrect classification tending towards one, but generating a measure in the continuous range 0 < 0k (X) < 1. The parameter γ controls the slope of the sigmoid function. The loss function over the entire training set is thus no longer a simple summation of incorrect classifications based on the binary decision measure, but the sum of 0k (X) over all training tokens. Although it is the overall expected loss function which is to be minimized, the use of GPD (or stochastic descent) token-bytoken training in effect means it is the gradient of the local loss function for each training token, i.e. ∂0k (X)/ωk , which drives the parameter updates (Chengalvarayan & Deng, 1997). Thus the sub-band weight update equation based on GPD is ωki+1 = ωki − ε
∂0k (X) ∂ωki
(11)
where ωki is a weight parameter for model k after the ith iteration, ∂0k (X)/ωk is the gradient of the local loss function and ε is a small positive learning constant. The gradient function is expanded according to the chain rule of calculus as ∂0k (X) ∂k 0(X) ∂dk (X) ∂gk (X) = . ∂ωk ∂dk (X) ∂gk (X) ∂ωk
(12)
For the sake of brevity, the subsequent weight update equations are quoted below, based on derivation of the first two elements according to Chengalvarayan and Deng (1997), with the last element based on the relevant partial derivative of Equation (6) easily determined. Thus considering observation sequence X belonging to class k and η being the most confusable class as defined by Equation (9), we have the weight update equations (r b),i+1
ωk
(r b),i
= ωk
− ε(0k (X)[0k (X) − 1])Pk (X(r b) )
(13a)
Multi-resolution sub-band modelling
249
ωη(r b),i+1 = ωη(r b),i + ε(0k (X)[0k (X) − 1])Pη (X(r b) ).
(13b)
3.2. State-based weights It is readily accepted that the states within a HMM represent feature distributions which may display quite different spectral content. It may be proposed, therefore, that the balance of discriminative importance between sub-band streams may vary according to each (r b) HMM state. For this case, the parameter set to be estimated is thus extended to {ωl,s : r = L 0, . . . , R − 1; b = 1, . . . , Br ; s = 1, . . . , S}l=1 where S, the number of HMM states, is typically 3 for phonetic modelling. M R , X M R , . . . , X M R } and the Given the sequence of multi-resolution feature vectors X{Xt=1 t=T t=2 optimal state sequence 8k = [θ1 , θ2 , θ3 , . . . , θT ] for class k, the score for a particular band (not taking into account state-to-state transition probabilities) is Pk (X(r b) , 8k ) =
T X
(r b)
(r b)
ωk,θt log pk,θt (Xt
)
(14)
t=1
(r b) where ωk,θ represents the linear weight associated with state θt . t We define Tk,s to represent the set of time indices in the state sequence such that the state b) association of the feature X(r vector at time t is state s. t
Tk,s = {t|θt = s}
1 < s < N, 1 < t < T .
(15)
This formulation is similar to that used by Chengalvarayan and Deng (1997) on state-dependent linear transform trained according to the MCE objective. Following through the gradient calculation Equation (12) using this formulation, the weight update equations for the correct class k and most confusable class η are subsequently refined to X (r b),i+1 (r b),i (r b) ωk,s = ωk,s − ε(0k [0k − 1]) log pk,t (Xt ) (16a) t∈Tk ,s
(r b),i+1 ωη,s
=
(r b),i ωη,s
+ ε(0k [0k − 1])
X
b) log pη,t (X(r ). t
(16b)
t∈Tη,s
4. Phone modelling experimental results The use of the multi-resolution features and models described in Sections 2 and 3 are investigated in this section for improved acoustic phonetic modelling on the TIMIT database (Garofolo et al., 1993). The TIMIT database was designed to give broad phonetic coverage, and is thus appropriate for assessing the performance of new approaches to improved phonetic modelling. As recently reiterated by Lippmann (1997), improvements to basic acoustic modelling remains an important objective in ASR research. The database consists of 6300 sentences with manually segmented and labelled phonetic transcriptions, 80% of which are apportioned to training and the remainder to testing. All the results presented are for the core TIMIT test set of 192 sentences. 4.1. Concatenated multi-resolution sub-band features As discussed in the introduction, acoustic models trained on concatenated multi-resolution sub-band features offers one approach to modelling. In this section, both sub-band feature concatenation for a given number of bands, referred to as mono-resolution modelling, and multi-resolution feature concatenation, as described in Section 2, are explored.
250
P. M. McCourt et al. TABLE I. TIMIT core test set results for context-independent models trained on concatenated sub-band features Resolution r r0 r1 r2 r3 r4
No. of bands 1 2 3 4 8
Band edges (kHz)
Cepstral coeffs
0, 7.9 0, 2.0, 7.9 0, 0.9, 2.7, 7.9 0, 0.6, 1.7, 3.7, 7.9 0, 0.3, 0.6, 1.0, 1.7, 2.7, 3.7, 5.3, 7.9
13 7,6 5,4,4 4,3,3,3 2,2,2,2,2,1,1,1
Correct (%) 65.5 68.0 66.7 65.6 61.7
Accuracy (%) 62.3 64.4 62.3 61.1 55.3
4.1.1. Concatenated sub-band cepstral features Our first set of experiments set out to assess the performance of different sub-band decompositions forming separate resolution levels. Context-independent continuous density HMMs with 20 mixtures were trained directly for each of the 39 phoneme classes specified by Lee and Hon (1989). Concatenated feature vectors created from cepstral analysis of the melfrequency log energy spectra grouped into sub-bands were formed according to Equation (3). Equal bandwidth decomposition in the mel-frequency domain was applied. The number of coefficients retained from each sub-band cepstra were chosen to be approximately equal (with extra coefficients retained from the lowest band if necessary) to achieve a matched feature length of 13 coefficients. This allowed initial direct comparisons to be made with acoustic models of identical complexity trained on the ubiquitous 13 MFCC features. Delta and delta-delta coefficients were appended in all cases. Table I gives the sub-band decompositions used (with the band-edge frequencies given in kHz) and the number of cepstral coefficients retained. Our spectral feature extraction uses 20 overlapped mel-frequency filterbanks. Features were extracted from frames of length 25 ms at a rate of 10 ms. The HTK toolkit (Young, Woodland, Odell, Ollason & Valtchev, 1996) was used for training and recognition. In Table I, particular sub-band decompositions are denoted by a resolution-level identifier, e.g. r0 (full-band), r1 (two-bands), etc. These identifications are maintained in subsequent experimental descriptions. Results are given for percentage correct and percentage accuracy.2 The accuracy figures are used in all subsequent discussion as the primary basis for performance comparison. No language model was applied in any of the context-independent experiments. The most notable result from Table I is the increase in accuracy and recognition obtained using two sub-bands compared to the standard full-band cepstrum. Whilst the result is given for 20 mixtures, the improvement of two-band over full-band cepstra is consistent at all mixture levels, indeed improving with increasing mixtures. As the number of bands is increased to three, accuracy is similar to that of full- bandwidth cepstra, with further decomposition to four bands giving similar recognition performance with a decrease in accuracy. For a larger number of narrow sub-bands the performance begins to fall off more significantly. The loworder coefficients in the eight-band example primarily encode the ratio of energies between neighbouring mel-frequency filterbanks. This feature set thus loses broader correlation information. The de-correlation assumption is thus also stretched in validity for this feature set. The first set of results in Table II present alternative choices for the cepstral coefficients selected from the two sub-bands (again limited initially to a total of 13) in order to assess 2 As is common practice, if N is the number of test tokens, H is the number of correctly recognized labels and I is the number of label insertions, then %Correct = (H/N ) × 100, %Accuracy = ((H − I )/N ) × 100.
Multi-resolution sub-band modelling
251
TABLE II. Effects of different sub-band decomposition and feature retention No. of bands 2 2 2 2 2 2 3 4
Band edges (kHz)
Cepstral coeffs
0, 2.0, 7.9 0, 2.0, 7.9 0, 2.0, 7.9 0, 1.5, 7.9 0, 3.7, 7.9 0, 2.0, 7.9 0, 0.9, 2.7, 7.9 0, 0.6, 1.7, 3.7, 7.9
7,6 8,5 5,8 6,7 8,5 8,8 7,7,6 5,5,5,5
Correct (%) 68.0 67.4 67.8 67.5 66.6 68.1 67.2 67.3
Accuracy (%) 64.4 63.7 64.4 64.3 63.2 64.4 64.0 63.7
the relative importance of discriminative information captured by one band at the expense of another. This demonstrates a decrease in accuracy if the number of coefficients retained from the higher sub-band is reduced. Results for alternative dual sub-band frequencies, for band splitting at 1.5 and 3.7 kHz, respectively, again demonstrates a small loss by underemphasizing the upper band contribution. Results from the last rows of Table II demonstrate that increasing the number of coefficients retained increases performance for the three and four bands compared to Table I, giving better performance than 13 full-band MFCCs. These results confirm the proposition that the localized cepstra, particularly with respect to keeping high-order coefficients, do indeed provide features with greater discriminative potential. It is also noted, from a comparison of the confusion matrices for the full- and sub-band results, that certain classes were significantly better recognized by the models trained from one feature set over the other. Individual increases in class recognition scores of 5% or greater were noted for the two-band case over full-band for fricatives /v/, /f/ and /ch/, stops /d/, /dx/ and /p/ , and also surprisingly for silence. It is noted that for the 39 phonetic model set (Lee & Hon, 1989), that all the short closure classes (e.g. /bcl/, /kcl/ etc.) are folded into the /sil/ label for training and testing. This would suggest that localized features do indeed improve performance for these types of sounds with more envelope detail captured in the higher band. Balancing advantages between the feature sets existed for two liquid and two nasal classes with vowel discrimination remaining similar to full band. Finally, the use of overlapping sub-band groups (above that of underlying overlapping mel-frequency filterbank analysis), whilst not exhaustively explored, appeared to give no benefit. 4.1.2. Concatenated multi-resolution sub-band cepstral features As a first step to exploring the joint discriminative capabilities of our proposed multi-resolution feature set, acoustic models were trained on multi-resolution concatenated features as introduced in Section 2. This leads to a much greater dimensional feature space. Results on the TIMIT core test set are given in Table III for multi-resolution feature vector concatenation using an equal number of coefficients per resolution level (identified according to the decompositions presented in Table I). A colon is used to separate the choice of features from each resolution. The combination of full-band features with sub-band features yields a performance improvement over the mono-resolution results in Table I when full-band features are concatenated with those from a three- or four-band decomposition. With two bands, the performance appears bounded by the two-band mono-resolution result given in Table I. Similarly, when features from the three- and four-band decomposition are combined, an improvement over
252
P. M. McCourt et al. TABLE III. TIMIT core test results for multi-resolution concatenated features Resolution r0 r1 r0 r2 r0 r3 r1 r2 r2 r3
Cepstral Features (13:7,6) (13:5,4,4) (13:4,3,3,3) (7,6:5,4,4) (5,4,4:4,3,3,3)
Correct (%) 67.9 67.2 67.8 67.5 68.0
Accuracy (%) 64.3 63.3 63.7 63.3 63.2
either feature set in isolation is observed. However, when two bands are joined to the threeband features, the result is less than for the two-band on its own, but greater than for the three-band case. As the feature space is quite large, experiments were carried out reducing the dimension of the concatenated multi-resolution vector to match the length of the original 13-element MFCC vector. A simple postulation was to halve the numbers of retained coefficients from each resolution from those originally given in Table I. In this case, the sub-band cepstral features could be considered as replacing the higher coefficients of the full-band cepstrum. Initially, the reduced dimension vector was formed by retaining the lower-order coefficients from each sub-band, but we also experimented with retaining the higher-order coefficients for the higher-frequency bands. However, these all generally led to a reduction in performance compared to the concatenated results (Table II) or the mono-resolution results (Table I). 4.2. Independent multi-resolution sub-band models In this configuration, the multi-resolution and sub-band decomposition used for the feature extraction is matched by acoustic modelling in independent streams, with the likelihood function calculated according to Equation (5a,b,c) as appropriate. As already discussed in the introduction, estimation of the stream distributions within each state is based on a frame-level likelihood combination assumption. Independent stream modelling results are presented in Table IV for the three configurations discussed in Section 3. These are identified in Table IV as follows. (i) Sub-band modelling Within a given resolution decomposition as defined Table I, each sub-band can represent a single stream, e.g. the model naming convention r1 b1 –r1 b2 used in Table IV identifies the two sub-bands from r1 modelled independently. (ii) Multi-resolution modelling From the modelling combination viewpoint we can also consider separate independent resolution streams formed from the respective concatenated sub-band features, e.g. r0 –r1 , represents resolution 0 (full-band) and the concatenated feature set from resolution 1 (two-band) modelled by independent streams (iii) Multi-resolution sub-band modelling Finally, sub-bands across a number resolutions can be modelled by independent distributions, e.g. r0 –r1 b1 –r2 b2 , implies the full band, and each of the two sub-bands, considered as independent streams.
Multi-resolution sub-band modelling
253
TABLE IV. Independent stream modelling results on the TIMIT core test set Stream models
Stream features
No. of streams 1 1 2 2 3
Correct (%) 56.1 45.4 66.8 70.6 70.4
Accuracy (%) 52.6 42.1 63.0 65.1 65.1
r 1 b1 r 1 b2 r1 b1 –r1 b2 r0 –r1 r0 –r1 b1 –r1 b2
(7) (6) (7)–(6) (13)–(7,6) (13)–(7)–(6)
r2 b1 r2 b2 r2 b3 r2 b1 –r2 b2 –r2 b3 r0 –r2 r0 –r2 b1 –r2 b2 –r2 b3
(5) (4) (4) (5)–(4)–(4) (13)–(5,4,4) (13)–(5)–(4)–(4)
1 1 1 3 2 4
42.5 46.7 35.6 65.2 69.8 69.9
39.2 43.8 32.7 60.5 64.0 64.0
r3 b1 r3 b2 r3 b3 r3 b4 r3 b1 –r3 b2 –r3 b3 –r3 b4 r0 –r3 r0 –r3 b1 –r3 b2 –r3 b3 –r3 b4
(4) (3) (3) (3) (4)–(3)–(3)–(3) (13)–(4,3,3,3) (13)–(4)–(3)–(3)–(3)
1 1 1 1 4 2 5
36.2 38.6 37.7 31.9 63.2 70.7 69.5
32.7 35.8 34.9 29.3 58.3 64.4 63.5
Also included in Table IV are results for each sub-band in isolation. In all these experiments each stream is modelled by 20 mixture Gaussian pdfs, such that the number of parameters to be trained is the same as for the concatenated case. The results in Table IV represent the case where all the log-likelihood recombination weights are set to unity. It can be concluded from the results in Table IV that the linearly summed log-likelihoods from different acoustic streams improves on the performance of any of the constituent streams in isolation. In particular, the combination of sub-band streams for each resolution decomposition significantly improves on any individual sub-band performance. These results demonstrate that discriminative information in the different sub-bands is efficiently captured by simple linear combination based on an independence assumption. The performance of the independent sub-bands is, however, less than that achieved by sub-band feature concatenation in a single stream (Table I), the difference becoming greater with an increasing number of bands. As will be discussed in Section 5, weighted combination leads to a better performance match. Both the multi-resolution and multi- resolution sub-band configurations again achieve better performance than the full-band or sub-band models on their own, e.g. for full band with a two-band analysis, both r0 –r1 and r0 –r1 b1 –r1 b2 from Table IV give an accuracy of 65.1% which improves on the mono-resolution results for r0 and r1 of 62.2% and 64.4%, respectively, (from Table I) and of the 63% attained by the two-band independent stream model r1 b1 –r1 b2 (from Table IV). A similar pattern also occurs for the three- and four-band cases. The results suggest less information independence, as the increase in accuracy is smaller compared to the respective resolutions in isolation. None the less, the result confirms that multi-resolution modelling does provide supplementary information beyond that of full-band features in isolation. Unlike the sub-band case however, the multi-resolution stream models fare better than the equivalent concatenated options (Table IV). This is probably because separate mixture modelling within the streams better exploits the acoustic feature distributions than a single high-dimension vector space. It also seems that, except for the four-band case,
254
P. M. McCourt et al. TABLE V. Discriminatively trained weighted model combination on the TIMIT core test set (si, state-independent; sd, state-dependent) Stream models r0
Feature dimensions (13)
Stream weighting
Correct (%) 65.5 68.0
64.4
unity trained (si)
70.6 72.3
65.1 67.2
(7)–(7)
unity trained (si) trained (sd)
66.8 68.1 67.3
63.0 64.8 63.8
r0 –r1 b1 –r1 b2
(13)–(7)–(7)
unity trained (si) trained (sd)
70.6 72.5 71.7
65.1 67.5 66.5
r2 b1 –r2 b2 –r2 b3
(7)–(7)–(6)
unity trained (si)
67.1 70.0
62.2 65.1
(13)–(7)–(7)–(6)
unity trained (si)
71.1 73.2
65.1 67.3
r1
(7,7)
r0 –r1
(13)–(7,7)
r1 b1 –r1 b2
r0 –r2 b1 –r2 b2 –r2 b3
Accuracy (%) 62.3
multi-resolution and multi-resolution sub-band modelling perform more or less identically. It is also apparent from Table IV that for multi-resolution streams the improvement in percentage recognition of correct labels is greater than that in accuracy. This implies that the benefits of reductions in substitution and deletion errors are offset somewhat by a rise in the insertion rate. For the r0 –r1 and r0 –r1 b1 –r2 b2 models, based on full-band and two-band acoustic features, vowels /ih/, /eh/ and the fricatives /ch/, /th/ show significant improvement (i.e. an individual performance increase of 5% or greater in correct classification), compared to the concatenated feature two-band model, with the majority of labels performing better or the same as the isolated streams. 4.3. Discriminative weight training results Discriminative training as described in Section 3 was used to train state-independent and state-dependent log-likelihood linear combination weight sets. This was again performed for each of the stream modelling options described in Section 4. Training was performed using only dialect region 1 and dialect region 5 utterances from the TIMIT training set (these provide more than sufficient data to estimate the relatively small number of weight parameters). In Table V, results are presented for monophone accuracy on the core test set, for unity-weighting, state-independent (si) and state-dependent (sd) trained weights. As before, 20 mixture Gaussian models were used. In the three-band case, extra features were included compared to Table IV to improve the baseline score with unity weights. Discriminatively trained stream weights are shown to benefit recognition accuracy. A clearer benefit is gained from state-independent weights compared to state-dependent weights. This is perhaps due to the fact that while different state segmentations for the same phoneme normally occur in different contexts during decoding, the use of state-dependent trained weights will tend towards a particular state segmentation, thus losing a little in generality of modelling. The discriminatively weighted sub-band models match or surpass the result from the equivalent concatenated feature models. In terms of the changes to the weights,
Multi-resolution sub-band modelling
255
these are very slight, of the order of ±5% of their starting values of unity. As the update Equations (13) show, the changes in the weights are proportional to the independent model score, with lower log-likelihood scoring models creating the largest change. The effect of the trained weights is therefore to achieve small changes to the relative dynamic ranges of the stream combinations. This none the less produces an improvement in performance, with the highest accuracy of 67.5% achieved by the full band with two-band stream model, stretching the advantage of the multi-resolution model compared to full band in isolation. As specified, MCE training is only applied to the stream weights using scores from ML-trained models. Re-iteration of ML training with the new weights produced only a marginal improvement to 67.6% accuracy. 4.4. Context-dependent modelling results Final experiments were carried out on context-dependent phone modelling on the TIMIT database. The sub-band boundaries as originally defined in Table I and used throughout previous experiments were employed. State-tied word-internal context-dependent models were created using decision tree clustering (Odell, 1995) with a minimum occupation threshold also applied. Approximately the same number of tied states where created in each case to facilitate comparison on the basis of feature selection and model configuration alone. The triphones were created from an initial 48 monophone set (Young & Woodland, 1994) with recognition labels collapsed to the 39 set for final scoring. In all cases 10 mixture CDHMMs where trained. Unity weights were used in the stream-based models. In the reported experiments a phone bigram language model built from the training set was applied. Results are presented in Table VI. From Table VI, it can be seen that the accuracy of the independent sub-band stream models and the equivalent concatenated sub-band feature models is closer than was observed in the monophone experiments (where the concatenated feature models displayed an advantage). This is useful, because as well as there being a comparative performance (in the three-band case) or improved performance (for the two-band case) compared to full band, the use of independent sub-band streams opens up the potential benefits as discussed in Section 5. For the multi-resolution models, the configuration based on independent full-band and sub-band streams performed best, with an accuracy of 73.4% improving on either full-band (73.0%) or the two-band stream model (73.2%). However, a higher insertion rate seems to contribute to a disadvantage for this configuration compared to the mono-resolution concatenated twoband model. This situation is in contrast to context-independent modelling, where the multiresolution models worked best overall. Although the multi-resolution result is still better than for full-band models, the degree of improvement is also less than in the context-independent case. It could therefore be argued that much of the original advantage afforded by the multiresolution models may have been due to an inherent increase in model complexity. None the less, an improvement over full band is still observed. In the multi-resolution case, higher degrees of tying were employed in order to reduce the overall number of state parameters to be estimated. This had little impact on performance however. The best result achieved of 73.9% accuracy is for the two-band concatenated features model, outperforming stateclustered HMM modelling from full-band features with accuracy of 73.0%. This compares well with other results previously published on this task using full-band features with various model paradigms and training approaches, e.g. 71.1% (Lamel & Gauvain, 1993) (using gender-dependent HMMs), 72.3% (Young & Woodland, 1994)3 (state-clustered biphones), 3 This is for a slightly larger test set than the core test set.
256
P. M. McCourt et al. TABLE VI. Context-dependent modelling results on the TIMIT core test set Model definition r0
Features (13)
Corr % 76.2
Acc % 73.0
Sub % 16.7
Del % 7.1
Ins % 3.2
r1 r1 b1 –r1 b2
(7,7) (7)–(7)
77.2 77.3
73.9 73.2
15.7 16.3
7.1 6.4
3.3 4.0
r2 r2 b1 –r2 b2 –r2 b3
(7,7,6) (7)–(7)–(6)
77.8 77.6
72.7 72.9
16.6 16.6
6.4 5.8
4.2 4.7
r0 r1 r0 –r1 r0 –r1 b1 –r1 b2
(13,7,7) (13)–(7,7) (13)–(7)–(7)
77.5 77.7 78.4
73.0 73.1 73.4
16.7 16.6 16.4
5.7 5.7 5.2
4.5 4.6 5.0
73.5% (Deng & Sameti, 1996) (polynomial state HMMs) and 73.9% (Robinson, 1994) (recurrent neural nets). As this increase is primarily from a better discriminative feature choice, incorporation within other training strategies, e.g. Ming and Smith (1998) (efficiently building triphones from biphone models) may extend performance further. 5. Future work Beyond the baseline superiority in performance, the success of multi-resolution sub-band modelling at different decompositions presents certain potential additional benefits and issues for further investigation. These are now discussed. (i) While the present work has used a fixed class-independent sub-band decomposition, our intention is to explore phonetic or linguistic class-dependent sub-band decomposition. One approach for vowel sounds could be based on a dual harmonic and unvoiced band decomposition of the training vectors, such as that employed by the harmonic noise model of speech (Stylianou & Cappe, 1998). This is based on the assertion that uniformity of spectral content may create localized cepstral features with less variance. This separation may be more distinct for telephone bandwidth speech where, with increased resolution of a mel-frequency filterbank analysis, variations in the harmonic band separations are more apparent. An analysis of the harmonic-noise split from labelled data could indicate initially how class-specific this information would be. As an alternative, the training data could be directly decomposed, and although the sub-band dimensions would vary, a constant dimension feature vector could be obtained by retaining a fixed number of coefficients from variable-length sub-band cepstra. This is similar to our recent work on segmental models based on trajectory features derived from cepstral analysis of features across time (Harte, Vaseghi & Milner, 1998) where a fixed number of coefficients were retained from variable duration segments with variablelength DCT basis functions. These features could be directly employed in training, where any model-specific decomposition information may be directly captured. Our results indicated that sub-band features in particular helped discriminate the fricative class set and stop class set better than full band. Some pre-analysis of labelled data or the spectral characteristics from trained classified acoustic models would again be necessary to reveal any classspecific spectral divisions. The use of class-dependent features none the less creates certain challenges to be resolved for embedded training. (ii) The benefits of sub-band acoustic models are currently being explored for speaker adaptation using maximum likelihood linear regression (MLLR) transforms (Leggetter & Woodland, 1995). As well as improvements in recognition accuracy, the reduced dimensions of independent sub-band cepstra permit reductions in the complexity of the matrix inversion
Multi-resolution sub-band modelling
257
required to compute the linear transform, the reduction being particularly significant for full covariance modelling. The decomposition also permits investigation as to whether speaker characteristics useful for adaptation are constrained within specific bands, and consequently, therefore, if limited bandwidth adaptation can lead to further complexity reductions. Some promising results have been produced thus far. (iii) While much of the work on sub-band modelling showed advantages for spectrally constrained noise, in general no definite advantage compared to full band is reported for broadband noise sources. Some very initial work by ourselves (McCourt et al., 1998) did show that multi-resolution models performed better in experiments with low SNR white noise. This area requires more research however to understand how different SNRs in different resolution bands affects the comparative reliability of discriminative information and how this can be effectively weighted and combined for increased robustness. (iv) The combination of alternative acoustic modelling strategies for speech recognition is receiving increasing interest in an effort to jointly exploit the discriminative strengths of particular models into a global optimum (Halberstadt & Glass, 1998; McCourt et al., 1999). Discriminative model combination of our multi-resolution sub-band acoustic models and segmental models (Harte et al., 1998) has already produced significant advantages in phonetic classification. Work on successfully incorporating both models in continuous recognition is currently being investigated. 6. Conclusions The use of sub-band linear transformations of the log energy spectra to capture localized envelope information provides observation features which have been shown to outperform traditional full-band cepstra for HMM-based phonetic acoustic modelling on the TIMIT database. This is demonstrated both for mono-resolution features, chosen from a sub-band decomposition, and for multi-resolution features chosen from a full-band analysis and a small number of bands. This advantage was yielded both for models trained on concatenated features and those based on multiple independent streams. Discriminative training of the classdependent stream weights to a minimum classification error criteria was demonstrated to improve the performance of the stream-based models. Training the stream weights at the state level yielded no particular benefit however. In the mono-resolution case, the best performance is achieved with features extracted across two sub-bands, with good performance attained using up to four bands. Beyond this the performance decreases relative to full band. For multi-resolution modelling, features chosen from full band and one resolution of subband analysis performs well. Experiments on context-independent modelling of a 39 phone set from the TIMIT database achieved a best accuracy of 67.5% for a stream-based model with class-dependent weights combining full-band features with features chosen from a twoband analysis. This improves favourably on the 62.3% given by full-band models in isolation. For context-dependent phonetic modelling, an accuracy of 73.4% was achieved by the same model configuration with unity weightings, compared to 73.0% for full-band models. However, models trained on concatenated features from a two-band analysis gave an accuracy of 73.9%. This result compares favourably with others previously published on this task, and represents a useful increase on standard full-band models, with the proposed models operating easily within a HMM-based single-pass decoding strategy for continuous speech. This work is supported by a grant from the Engineering and Physical Sciences Research Council of the United Kingdom under EPSRC Grant GR/L60463. Amendments suggested by the reviewers are gratefully acknowledged.
258
P. M. McCourt et al.
References Allen, J. (1994). How do humans process and recognise speech? IEEE Transactions on Speech and Audio Processing, 2, 567–577. Beyerlin, P. (1998). Discriminative model combination. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, WA, volume I, pp. 481–484. Biem, A., McDermott, E. & Katagiri, S. (1995). A discriminative filter-bank model for speech recognition. Proceedings of Eurospeech, Madrid, pp. 545–548. Bourlard, H. & Dupont, S. (1997). Subband-based speech recognition. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Munich, volume II, pp. 1251–1254. Bourlard, H., Dupont, S., Hermansky, H. & Morgan, N. (1996). Towards sub-band based speech recognition. Proceedings of EUSIPCO, Trieste, pp. 1579–1582. Cerisara, C., Haton, J.-P., Mari, J.-F. & Mohr, D. (1997). Multi-band continuous speech recognition. Proceedings of Eurospeech, Rhodes, pp. 1235–1238. Chengalvarayan, R. & Deng, L. (1997). HMM-based speech recognition using state-dependent, discriminatively derived transforms on mel-warped DFT features. IEEE Transactions on Speech and Audio Processing, 5, 243– 256. Deng, L. & Sameti, H. (1996). Transitional speech units and their representation by regressive Markov states: application to speech recognition. IEEE Transactions on Speech and Audio Processing, 4, 301–306. Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., Pallett, D. & Dahlgren, N. DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM, NTIS order number PBO1-100354, available from LDCO. Ghitza, O. (1994). Auditory models and human performance in tasks related to speech coding and speech recognition. IEEE Transactions on Speech and Audio Processing, 2, 115–132. Halberstadt, A. K. & Glass, J. R. (1998). Heterogeneous measurements and multiple classifiers for speech recognition. Proceedings of the International Conference on Speech and Language Processing, Sydney. Harte, N., Vaseghi, S. & Milner, B. (1998). Joint recognition and segmentation using phonetically derived features and a hybrid phoneme model. Proceedings of the International Conference on Speech and Language Processing, Sydney, pp. 999–1002. Herman, S. & Sukkar, R. (1998). Joint MCE estimation of VQ and HMM parameters for Gaussian mixture selection. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, volume I, pp. 485–488. Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America, 87, 1738–1752. Juang, B.-H., Chou, W. & Lee, C.-H. (1997). Minimum classification error rate methods for speech recognition. IEEE Transactions on Speech and Audio Processing, 5, 257–265. Juang, B.-H. & Katagiri, S. (1992). Discriminative learning for minimum error classification. IEEE Transactions on Signal Processing, 40, 3043–3054. Lamel, L. & Gauvain, J. (1993). High performance speaker-independent phone recognition using CDHMM. Proceedings of Eurospeech ’93, Berlin, pp. 121–124. Lee, K. F. & Hon, H. W. (1989). Speaker independent phone recognition using hidden Markov models. IEEE Transactions on Acoustics, Speech and Signal Processing, 37, 1641–1648. Leggetter, C. J. & Woodland, P. C. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech and Language, 9, 171–186. Lippmann, R. H. (1997). Speech recognition by machines and humans. Speech Communication, 22, 1–15. McCourt, P., Harte, N. & Vaseghi, S. (1999). Combined temporal and spectral multi-resolution phonetic modelling. Proceedings of Eurospeech, Budapest. McCourt, P., Vaseghi, S. & Harte, N. (1998). Multi-resolution cepstral features for phoneme recognition across speech sub-bands. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, WA, volume I, pp. 557–560. Ming, J. & Smith, F. J. (1998). Improved phone recognition using bayesian triphone models. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, WA, volume I, pp. 409– 412. Mirghafori, N. & Morgan, N. (1998). Transmissions and transitions: a study of two common assumptions in multiband ASR. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, WA, volume II, pp. 713–716. Normandin, Y., Cardin, R. & De Mori, R. (1994). High performance connected digit recognition using maximum mutual information estimation. IEEE Transactions on Speech and Audio Processing, 2, 299–311.
Multi-resolution sub-band modelling
259
Odell, J. J. (1995). The use of context in large vocabulary speech recognition. PhD Thesis. Cambridge University. Okawa, S., Bocchieri, E. & Potamianos, A. (1998). Multi-band speech recognition in noisy environments. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, WA, volume II, pp. 641–644. Potamianos, G. & Graf, H. P. (1998). Discriminative training of HMM stream exponents for audio-visual speech recognition. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, WA, volume IV, pp. 3733–3737. Robinson, A. (1994). An application of recurrent nets to phone probability estimation. IEEE Transactions on Neural Networks, 5, 298–305. Stylianou, Y. & Cappe, O. (1998). A system for voice conversion based on probabilistic classification and a harmonic plus noise model. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, WA, volume I, pp. 281–284. Tibrewala, S. & Hermansky, H. (1997). Sub-band based recognition of noisy speech. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Munich, volume II, pp. 1255–1258. Valtchev, V. (1995). Discriminative methods in HMM-based speech recognition. PhD Thesis. Cambridge University. Vaseghi, S., Harte, N. & Milner, B. (1997). Multi-resolution phonetic/segmental features and models for HMMbased speech recognition. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Munich, pp. 1263–1266. Young, S. J. & Woodland, P. C. (1994). State clustering in hidden Markov model-based continuous speech recognition. Computer Speech and Language, 8, 369–383. Young, S. J., Woodland, P. C., Odell, J., Ollason, D. & Valtchev, V. HTK-Hidden Markov Model Toolkit, Entropic, Cambridge. (Received 26 August 1999 and accepted for publication 31 March 2000)