Computer Speech and Language (1989) 3,239-25 1
Semi-continuous hidden Markov models for speech signals X. D. Huang* and M. A. Jack Cenrre for Speech Technology Research, University of Edinburgh, 80 South Bridge, Edinburgh EHI IHN, U.K.
Abstract A semi-continuous hidden Markov model, which can be considered as a special form of continuous mixture hidden Markov model with the continuous output probability density functions sharing in a mixture Gaussian density codebook, is proposed in this paper. The semicontinuous output probability density function is represented by a combination of the discrete output probabilities of the model and the continuous Gaussian density functions of a mixture Gaussian density codebook. The amount of training data required, as well as the computational complexity of the semi-continuous hidden Markov model, can be significantly reduced in comparison with the continuous mixture hidden Markov model. Parameters of the vector quantization codebook and hidden Markov model can be mutually optimized to achieve an optimal model/codebook combination, which leads to a unified modelling approach to vector quantization and hidden Markov modelling of speech signals. Experimental results are included which show that the recognition accuracy of the semi-continuous hidden Markov model is measurably higher than both the discrete and the continuous hidden Markov model.
1. Introduction Hidden Markov models (HMM), which can be based on either discrete output probability distributions (discrete HMM) or continuous output probability density functions (continuous HMM), have been shown to represent one of the most powerful statistical tools available for modelling speech signals (Jelinek, 1976, 1985; Levinson, Rabiner & Sondhi, 1983; Juang, 1985; Rabiner, Juang, Levinson & Sondhi, 1985; Levinson, 1986; Chow, Dunham, Kimball, Kranser, Kubala, Makhoul, Price, Roucos & Schwartz, 1987; Lee, 1988). In the continuous mixture HMM, parameter estimation of the model is usually based on maximum likelihood methods on the assumption that the observed signals have been generated by a mixture Gaussian process (Juang, 1985) or autoregressive process (Juang & Rabiner, 1985). Algorithms based on the Parzen estimator (Parzen, 1962) using some kernel function (Soudoplatoff, 1986) have also been used. The continuous mixture HMM usually offers more powerful modelling ability * Also Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China 0885-2308/89/030239 + 13 $03.00/O
0 1989 Academic Press Limited
240
X. D. Huang and M. A. Jack
than the continuous HMM (single mixture) when sufficient training data exist. However, the continuous mixture HMM involves considerable computational complexity and is also very sensitive to initial estimates of several of the model parameters (Jung & Rabiner, 1985; Rabiner ef al., 1985). For the left-to-right HMM, a modified segmental kmeans clustering procedure (Rabiner, Wilpon & Juang, 1986) similar to the vector quantization (VQ) procedure (Makhoul, Roucos & Gish, 1985) for the discrete HMM, has been developed to obtain reliable initial estimates. In the Parzen estimator, the key problem is how to choose the value of the radius, which has been determined from the sample data using a topological approach. In the estimation of probability density functions, the training data required for non-parametric methods, such as the Parzen estimator, are usually more than that for the parametric method, such as the Gaussian assumption, to achieve similar performance. For the discrete HMM, VQ makes it possible to use a non-parametric, discrete output probability distribution to model the observed speech signals. The objective of VQ is to find the set of reproduction vectors, or codebook, that represents an information source with minimum expected distortion. By using VQ, the discrete HMM offers faster computation in comparison with the continuous HMM, since computing the discrete output probability of an observation reduces to a table-lookup operation. On the other hand, in the continuous HMM, many multiplication operations are required even when using the simplest single-mixture, multivariate Gaussian density with a diagonal covariance matrix. However, with the discrete HMM there may be some information lost in the VQ operations and the recognition accuracy for the discrete HMM can be considerably lower than for the continuous HMM (Rabiner et al., 1985; Poritz & Richter, 1986), although other opposing results have been reported (Brown, 1987). As the discrete output probability generally contains more free parameters, the discrete HMM usually requires more training data in comparison with the continuous HMM. Various smoothing techniques have been proposed to eliminate the errors introduced by the conventional VQ (Nishimura & Toshioka, 1987; Tseng, Sabin & Lee, 1987; Lee, 1988). In this paper, a semi-continuous hidden Markov model (SCHMM) is proposed, to extend the discrete HMM by replacing discrete output probabilities with a combination of the original discrete output probabilities and continuous probability density functions of a mixture Gaussian codebook that is modelled as a parametric family of mixture Gaussian densities. The EM algorithm (E for expectation and M for maximization) (Hasselblad, 1966; Dempster, Laird & Rubin, 1977) developed for estimation of mixture probability densities, works in a similar manner to the Baum-Welch algorithm (Baum, Petrie, Soules & Weiss, 1970). In the case of maximum likelihood estimation of HMM parameters using the Baum-Welch algorithm, the VQ codebook could in principle be adjusted together with the HMM parameters, in order to obtain the optimum maximum likelihood of the HMM, although this may not lead to optimal VQ distortion minimization. A unified modelling approach can therefore be applied to vector quantization and hidden Markov modelling of speech signals to achieve an optimum combination of HMM parameters and VQ codebook parameters. In the SCHMM, the continuous probability density functions of the codebook can be used either in the modified Viterbi decoding algorithm, or in estimation of the SCHMM parameters with the forward-backward algorithm to bridge between continuous observations and discrete parameters. From the continuous HMM point of view, the SCHMM can be considered as a special form of continuous mixture HMM with tied mixture continuous density functions. Because of the binding of the continuous density functions, in the
Semi-continuous hidden Markov models
241
SCHMM, the number of free parameters and the computational complexity are reduced in comparison with the continuous mixture HMM, while retaining the modelling powers of mixture HMM. From the discrete HMM point of view, the SCHMM can effectively minimize the errors involved in VQ operations. For speaker-dependent and speakerindependent isolated digit recognition, it has been shown that the SCHMM decoder based directly on discrete HMM parameters can offer improved recognition accuracy in comparison with the discrete HMM (Huang & Jack, 1988a,b). This paper is organized as follows. In Section 2, the mathematical formulation of the HMM is reviewed and the concept of the SCHMM using parameter feedback to the VQ codebook is developed. In Section 3, the implementation of the SCHMM is discussed and experimental results for speaker-independent phoneme recognition are presented to permit comparison between the semi-continuous HMM, the discrete HMM, and the continuous HMM. Finally, Section 4 contains a summary and discussion of potential applications. 2. Semi-continuous hidden Markov models and codebook optimization .?.I. Discrete HMh4 and continuous HMM
An N-state Markov chain is considered with state transition matrix A = [a,], ij= 1,2,. . .,N, where a, denotes the transition probability from state i to state j; and a discrete output probability distribution, b,(O&, or continuous output probability density function b,(x) associated with each state j of the unobservable Markov chain. Here 0, represents discrete observation symbols (usually VQ indexes), and x represents continuous observations (usually speech frame vectors) of K-dimensional random vectors. With the discrete HMM, there are L discrete output symbols from an L-level VQ, and the output probability density function is modelled with discrete probability distributions of these discrete symbols. Let 0 be the observed sequence, 0= 0k,,0k2,. . .,O, observed over T samples. Here Oki denotes the VQ codeword ki observed at time i. The7 observation probability of such an observed sequence, Pr(QlrZ), can be expressed as: Pr(OlA) = C Pr(O,SII) = C Pr(OlS~)Pr(SI~) s s
where S is a particular state sequence, SE@,,S,,. . .,s,), S,E{1,2,. . .,N), and the summation is taken over all of the possible state sequences, S, of the given model 1, which is represented by (n, A, B), where IL is the initial state probability vector, A is the state transition matrix, and B is the output probability distribution matrix. In the discrete HMM, classification of 0,, from x, in the VQ may not be accurate, such as when an acoustic vector x, is intermediate between two VQ indexes. The effects of VQ errors may cause the performance of the discrete HMM to be inferior to that of the continuous HMM (Rabiner et al., 1985). If the observation to be decoded is not vector quantized, then the probability density function, f(Xln), of producing an observation of continuous vector sequences given the model 2, would be computed, instead of the probability of generating a discrete
242
X. D. Huang and M. A. Jack
observation symbol, Pr(OlA). Here X is a sequence of continuous (acoustic) vectors x, x = x,,x*,. . .,xp The principle advantage of using the continuous HMM is the ability to model speech parameters directly without involving VQ. However, the continuous HMM requires considerably longer training and recognition times, especially when several mixture Gaussian distributions are used to represent the output probability. In the continuous HMM, Equation (1) can be rewritten as:
S(Xln)=Cf(Xls,n)Pr(sln) s
(2) where the output probability density function can be represented by the Gaussian probability density function. More generally, in the continuous Gaussian (M-component) mixture HMM (Juang, 1985), the output probability density of statej, b,(x), can be represented as
where N(x,&Z) denotes a multi-dimensional Gaussian density function of mean vector p and covariance matrix Z. Here cjk is a weighting coefficient for the kth Gaussian component. 2.2. Semi-continuous
hidden Markov models
In the discrete HMM, the discrete probability distributions are sufficiently powerful to model any random events with mixture Gaussian density distributions. The major problem with the discrete output probability is that the VQ operation partitions the acoustic space into separate regions according to some distortion measure. This introduces errors, as the partition operations may destroy the original signal structure. The VQ codebook can be modelled as a family of Gaussian density functions such that the distributions are overlapped, rather than partitioned. Each codeword of the codebook can then be represented by one of the Gaussian probability density functions and may be used together with others to model the acoustic event. The use of a parametric family of finite mixture densities (a mixture density VQ) can be closely combined with the HMM methodology. From the continuous mixture HMM point of view, the output probability in the continuous mixture HMM can be shared among the Gaussian probability density functions of the VQ. This can reduce the number of free parameters to be estimated as well as the computational complexity. From the discrete HMM point of view, the partition of the VQ is unnecessary, and is replaced by the mixture density modelling with overlap, which can effectively minimize the VQ errors. The problems of estimating the parameters which determine a mixture density has been the subject of a large diverse body of literature spanning some 90 years (Redner & Walker, 1984). The procedure, known as the EM algorithm (Dempster et al., 1977), is a specialization to the mixture density context, of a general algorithm for obtaining maximum likelihood estimates for incomplete problems. Here the mixture density
Semi-continuous hidden Markov models
243
estimation problem is regarded as an estimation problem involving incomplete data by treating an unlabelled observation on the mixture as missing a label indicating its component population of origin. This has been defined early by Baum et al. (1970) in a similar way and has been widely used in HMM-based speech recognition methods. Thus, the VQ problems and HMM modelling problems can be unified under the same probabilistic framework to obtain an optimized VQ/HMM combination, which forms the foundation of the SCHMM. Provided that each codeword of the VQ codebook is represented by a Gaussian density function, for a given state s, of HMM, here assuming s, and continuous observations, x, are independent, the probability density function that s, produces a vector x can then be written as:
j=l
=Ii
f(xIojr)Pr(Oj,lsJ=
j=l
i
j=l
f(XlO,l)b,l(O,,)
where L denotes the VQ codebook level. Given the VQ codebook index O,, the probability density functionf(xlOi() can be estimated with the EM algorithm (Redner & Walker, 1984), or by maximum likelihood clustering (Huang & Jack, 1988c). It can also be obtained from the HMM parameter estimation directly as explained later. Using Equation (4) to represent the semi-continuous output probability density, it is possible to combine the codebook distortion characteristics with the parameters of the discrete HMM under a unified probabilistic framework. Here, each discrete output probability is weighted by the continuous conditional Gaussian probability density function derived from VQ. If these continuous VQ density functions are considered as the continuous output probability density function in the continuous mixture HMM, this also resembles the L-mixture continuous HMM with all the continuous output probability density functions shared with each other in the VQ codebook. The discrete output probability in state i, b,(O,I), becomes the weighting coefficients for the mixture components. In the decoding process, this results in an L-mixture continuous HMM with a computational complexity comparable to the continuous (single-mixture) HMM. In practice, Equation (4) can be replaced by finding the M most significant values off(xlOj) (in practice, some two to five values) over all possible codebook indices Oj, which can be easily obtained in the VQ procedure. This can significantly reduce the computational load for subsequent output probability computation, since A4 is of lower order than L. Experimental results show this to perform well in speech recognition as discussed in Section 3. In the decoding process, the continuous probability density function of the VQ codebook can be used to bridge between the non-VQ observation sequence X and the discrete HMM parameters. The semi-continuous output probability density function defined in Equation (4) can be used directly in the Viterbi algorithm (Viterbi, 1967; Rabiner & Juang, 1986), which can find the single best path with the highest probability of the observation sequence. 2.3. Mutual parameter re-estimation of the SCHMM
and VQ codebook
The SCHMM can be considered as a special form of continuous mixture HMM with
244
X. D. Huang and M. A. Jack
tied mixture continuous density functions as discussed above. Because of the binding of the continuous density functions, in the SCHMM, the number of free parameters as well as computational complexity are reduced in comparison to the continuous mixture HMM while retaining the modelling powers of mixture HMM. Here, if the b,(Oi,) are considered as the weighting coefficients of different mixture output probability density functions in the continuous mixture HMM, the re-estimation algorithm for the weighting coefficients can be extended to re-estimate bi(Cf,) of the SCHMM (Juang, 1985). The re-estimation formulations can be more readtly computed by defining a forward partial probability, a,(i), and a backward partial probability, /I,(i) for any time t and state i as: a,(i) = Pr(x,,x,,. . .,x,,s, = i/A)
The forward and backward probability can be computed recursively with u,(i) = z, and &(i) = 1 as: 2 < t G T,
a,(i) = E a,_ ,t&j,bAX,), j=l
B,(i)= f a,$~x,+llP,+lti)~lGfGT-- 1. i=l
The intermediate probabilities, X,(ij,k), y,(ij), y,(i), &(ij), and c,(j) can be defined for efficient re-estimation of the model parameters. They are: X,(ij,k)=Pr(s,=i,s,+,
=_i,~,,+,lW)
=a,(i)ai~r(~k,+Ilf(X,+~l~k,+l)~~+~O’)
WXI 4 y,(ij) = Pr(s,=
i,s,+, =jlX,n)
y,(i) = Pr(s, = i/X,1) [,(i,k) = Pr(s, = i,O,,lW)
(64
i,(k) = WO,,lXJ) All these intermediate advantages as:
probability
can be represented
r,W= i x,W,k)
by x,() with computational
VW
k=l
r,(9 = i
i=I
r,Gj)
(6~)
Semi-continuous hidden Markov models
245
l,(i,k) = t x,_,U,i,k) j=l
Using Equations (5) and (6), the re-estimation equations for rrp a,, bi(Oj) can be written as: n,=y,(Q,
1
1
(7)
(8)
Equations (7) to (9) can be extended to the case of multiple training sequences (Juang & Rabiner, 1985). The means and covariances of the Gaussian probability density functions can also be re-estimated to update the VQ codebook separately with Equations (5) and (6). The feedback from the HMM estimation results to the VQ codebook implies that the VQ codebook is optimized based on the HMM likelihood maximization rather than minimizing the total distortion errors from the set of training data. Although reestimation of means and covariances of different models will involve interdependencies, the different density functions which are re-estimated are strongly correlated, so the new estimates can be expected to improve the discrimination ability of the VQ codebook. To re-estimate the parameters of the VQ codebook, i.e. the means, pj, and covariances matrices, Xj, of the codebook indexj, it is not difficult to extend the continuous mixture HMM re-estimation algorithm. In general, it can be written as:
r(,=
, l$jgL;
(i@
and
(11)
246
X. D. Huang and M. A. Jack
where v denotes the whole vocabulary size; and expressions in square brackets are variables of model v. In Equations (10) and (1 I), the re-estimation for the means and covariance matrices in the output probability density function of the SCHMM are tied up with all the HMM models. This is similar to the approach with tied transition probability inside the model (Jelinek & Mercer, 1980). From Equations (10) and (1 l), it can be observed that these are merely a special form of EM algorithm for parameter estimation of mixture density functions (Redner & Walker, 1984), and these are closely welded into the HMM re-estimation equations. Thus, a unified modelling approach to vector quantization and hidden Markov modelling of speech signals can be established for both of the VQ codebook and HMM parameter estimation. 2.4. Computational considerations It should be noted that the computational complexity for decoding with this new SCHMM is less than that of continuous mixture HMM if the size of the VQ codebook is less than the number of output probability density functions in the continuous mixture HMM sincef(xlO,) need only be calculated for each codebook index as opposed to each state with several mixture density functions, and usually the total number of output probability density functions will be more than the codebook size for large vocabulary recognition. The memory requirements for the SCHMM are almost the same as the discrete HMM except that the memory for f(xlOj) must be added for all the VQ codebook entries and observation sequences X. In practice, if Equation (4) is simplified by using only the first three most significant values off(xlOj), the computational load of the SCHMM is comparable to that of the discrete HMM. For parameter re-estimation, the computational complexity is similar to the continuous HMM with three mixture. If simplified methods are used as explained in Section 3, the computational load can be reduced to that of the continuous HMM with single mixture. 3. Experimental evaluation Experiments have been carried out to compare the performance of (a) discrete HMM; (b) SCHMM without re-estimation; (c) SCHMM with re-estimation; and (d) continuous HMM in a phoneme classification task as a useful exemplar domain to investigate the differences between SCHMM, continuous HMM, and discrete HMM algorithms. Other phoneme classification problems, such as coarticulation effects, speech modelling units, and model structure, are not discussed here. 3.1. SimpliJied SCHMM parameter re-estimation For parameter re-estimation of the SCHMM, due to the extensive computation load of Equations (10) and (1 I), a simplified method is adopted here. The re-estimation procedure is divided into two stages. In the first stage, the discrete HMM re-estimation is run, and discrete HMM parameters are then used as initial parameters for the SCHMM. Only one mixture is used to represent the semi-continuous output probability density during parameter re-estimation, which is determined according to the most significant value of b,(Uj). The re-estimates (means and covariances) can be used either to replace the original means and covariances in the VQ codebook or to form an average with other re-estimates according to the preselected codeword index. This can be considered as a
Semi-continuous hidden Markov models
247
special technique of Equations (10) and (11) within an individual model v where the reestimation of the means and covariance is constrained only to the most significant bi(O,). These means and covariances are used as feedback to the VQ codebook for updating. This procedure can be re-estimated together with the transition probabilities. In the second stage, after re-estimation of the transition probabilities, along with the mean and the covariance matrices, the re-estimation algorithm for the weighting coefficients (Equation (9)) can be used again together with the transition probabilities on the updated codebook to obtain the final discrete output probability distributions for the SCHMM. With the simpfified approach used here, the computational complexity of the SCHMM parameter re-estimation can be significantly reduced in comparison with Equations (IO) and (11). 3.2. Analysis conditions For both training and evaluation, the analysis conditions consisted of the following: sampling rate: 16 kHz; analysis method: 10 cepstrum coefficients derived from the 12th order autocorrelation LPC method (Rabiner et al., 1985); window type: Hamming window; window length and shift: 20 ms and 10 ms; pre-emphasis: l-0+972-‘; HMM structure: left-to-right model as shown in Fig. 1. The database consists of two repetitions of the same continuous speech sentences from a male speaker. Each set has 98 sentences with 579 words. The sentences have been handlabelled to the level of individual phonemes. Each of the 47 individual HMM are trained and decoded with hand-labelled phonemes, with the (unbalanced) number of phonemes used to derive individual models varying from 3 to 191. The VQ codebook is generated using training data with a total of 20000 frames by employing the LBG algorithm (Linde, Buzo & Gray, 1980). The first set of sentences is used to estimate HMM parameters with the forward-backward algorithm, and the second set of sentences is used in decoding by the Viterbi algorithm. The forwardbackward algorithm is used iteratively three to six times and the final output probability is smoothed as suggested in Levinson et al. (1985). 3.3. Experimental results The discrete HMM was first to be evaluated in the experiments as shown in Table I. When the VQ level varies from 128 to 256, the average phoneme recognition accuracy of
0 Figure 1. HMM structure.
X. D. Huang and M. A. Jack
248
TABLEI. Comparison of discrete HMM, continuous HMM
HMM, and semi-continuous
Average recognition (number of tests = 1748)
Model Discrete HMM Discrete HMM Continuous HMM SCHMM without re-estimation Discrete HMM with updated VQ SCHMM with updated VQ Discrete HMM with new VQ SCHMM with new VQ
VQ level
Accuracy (correct recognized tests)
128 256 128 128 128 141 141
50.0% (874) 50.1% (875) 55.2% (966) 53.3% (923) 55.3% (968) 58.3% (1019) 53.7% (940) 58,5% (1023)
the discrete HMM is 50.0% and 50.1% respectively for 47 phonemes. The fact that increasing the VQ level from 128 to 256 does not result in improved phoneme recognition accuracy is an indication that for the limited training data used here, a VQ level of 128 is adequate. Due to the limited training data, the covariance matrices used in continuous HMM and SCHMM are all assumed to be diagonal. Using discrete HMM parameters, the recognition accuracy of SCHMM (without reestimation) was tested. Varying the range of the most significantf(xPj) in Equation (4) from 1 to 5, the recognition accuracy of the SCHMM is shown in Fig. 2. Choice of the top three values is appropriate here and will be used in the following experiments. The recognition accuracy of SCHMM without re-estimation is 53.5%. When Equation (9) is used to re-estimate the output probability while keeping the VQ codebook unchanged, marginally higher accuracy is obtained for the SCHMM. However, when the continuous HMM is used, the measured accuracy improves to 55*2%, which is higher than both the
6 Number of mixture
Figure 2. Performance comparison using different mixtures for decoding. l : SCHMM without re-estimation; l : SCHMM with updated VQ; A : SCHMM with VQ codebook obtained from the continuous HMM.
Semi-continuous hidden Markov models
249
discrete HMM and the SCHMM without the re-estimated means and covariances. In the continuous HMM used here, parameters of the discrete HMM are used as initial parameters for the continuous HMM. The means and covariances corresponding to the maximum codeword in the discrete output probability are used as initial parameters for the continuous HMM and iterated thereafter with the forward-backward algorithm. The simplified SCHMM parameter re-estimation technique is used for evaluation. The experimental results show that use of both the replacement or averaging techniques during the re-estimation produces similar recognition accuracy. This suggests strong correlation between re-estimated codewords. With re-estimated parameters and updated VQ codebook, the recognition accuracy of the SCHMM is 58*3%, which is measurably better than both the discrete HMM and the continuous HMM. Using the updated VQ codebook, it is interesting to note that the discrete HMM (55.3%) is comparable to the continuous HMM (55.2%). Here, the discrete output probability distribution is obtained from the SCHMM re-estimation formula Equation (9). To avoid loss of information in the VQ codebook by the replacement of averaging operation described above, a novel approach is used here which involves using the means and covariances of the continuous HMM output probability density functions to form a new codebook. There are a total of 47 different phoneme models with three output probability density functions in each model. Therefore, a VQ codebook of 141 levels is constructed. Here again, Equation (9) is used to re-estimate the discrete output probability for the SCHMM. The recognition accuracy of the SCHMM with the new codebook is 58.5%, and the discrete HMM with the new codebook is 53.8%. From Table 1, it can be seen that re-estimated SCHMM with modified VQ codebook performs consistently better than both the continuous and discrete HMM. In addition, with the VQ codebook updated from the continuous HMM, or collected from the continuous HMM, the performance of the discrete HMM (55.3%) has improved towards that of the continuous HMM (55.2%). This strongly suggests that the unified modelling of vector quantization and HMM is necessary. To retain robustness, the continuous mixture HMM can also be applied with the same strategy, although this will result in a codebook of large level, an information-theoretic clustering can be easily extended to merge similar codewords. 4. Discussion and summary The SCHMM takes the advantages of both the discrete HMM and continuous HMM, and results in a powerful tool for modelling time-varying signals sources. The distortion produced by VQ operations has been shown to be accommodated within the HMM methodology such that parameter estimation of the VQ codebook can be combined together with that of the HMM under the same probabilistic framework. The SCHMM technique described here can be considered as a method where a VQ codebook is updated with the HMM re-estimation algorithm and thus represents a special form of maximum likelihood vector quantization. Although the maximum likelihood VQ and the HMM forward-backward algorithms attempt to maximize the likelihood, the new SCHMM approach has some obvious advantages in that it maximizes the likelihood of each HMM using preclassified data, in contrast to other approaches which maximize the likelihood of VQ codebook with unclassified data. As a result, later estimates are more suitable for classification when the preclassified data and the HMM parameters are optimized together.
X. D. Huang and M. A. Jack
In summary, the SCHMM can be considered as a special form of continuous mixture hidden Markov model by merging the continuous output probability functions of the model into the VQ codebook. This can effectively reduce the amount of training data as well as computational complexity in comparison with the continuous mixture HMM. The VQ codebook obtained from maximizing the likelihood function of the HMM provides better discrimination powers than a conventional VQ codebook obtained from minimizing distortion errors. Vector quantization and hidden Markov modelling of speech signals are unified under the EM algorithm, which has been widely used in the context of mixture density estimation. Experimental results have demonstrated that the SCHMM algorithm proposed here offers improved phoneme recognition accuracy in comparison with both the discrete HMM and the continuous HMM with only limited computational complexity increases. We conclude that the SCHMM is a powerful new technique for modelling non-stationary stochastic processes with multi-model nonsymmetric probabilistic functions of Markov chains. wish to thank Professor J. Laver, Dr Y. Ariki, and Dr F. McInnes, Centre for Speech Technology Research, Edinburgh University and Professor D. T. Fang, Computer Science and Technology Department, Tsinghua University for their support and contributions in the research work.
The authors
References Baum, L. E., Petrie, T., Soules, G. & Weiss, N. (1979). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematics and Staristics, 41, 164171.
Brown, P. F. (1987). Acoustic-phonetic modelling problem in automatic speech recognition. Ph.D. Thesis, I&partment of Computer Science, Carnegie-Mellon University. Chow, Y. L., Dunham, M. D., Kimball, 0. A., Kramer, M. A., Kubala, G. F., Makhoul, J., Price, P. J., Roucos, S. & Schwartz, R. M. (1987). BYBLOS: The BBN continuous speech recognition system. IEEE ICASSP 87, Dallas, pp. 89-92. Dempster, A. P., Laird, N. M. & Rubin, D. B. (1977). Maximum-likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B (methodological), 39, l-38. Hasselblad, V. (1966). Estimation of parameters for a mixture of normal distributions. Technometrics, 8, 431444.
Huang, X. D. & Jack, M. A. (1988a). Performance comparison between semi-continuous and discrete hidden Markov models. IEE Electronics Lelters, U(3), 149-l 50. Huang, X. D. & Jack, M. A. (1988b). On several problems of hidden Markov models. Seventh FASE symposium, SPEECH 88, Edinburgh, pp. 17-22. Huang, X. D. & Jack, M. A. (1988c). Maximum likelihood clustering applied to semi-continuous hidden Markov models for speech recognition. IEEE international symposium on information theory, Kobe, Japan, p. 71. Jelinek, F. (1976). Continuous speech recognition by statistical methods. Proceedings ofZEEE, 64, 532-556. Jelinek, F. (1985). The development of an experimental discrete dictation recognizer. Proceedings of IEEE, 73, 1616-1624.
Jelinek, F. & Mercer, R. L. (1980). Interpolated estimation of Markov source Proceedings of the workshop on pattern recognition in practice. Amsterdam: Juang, B. H. (1985). Maximum-likelihood estimation for mixture multivariate Markov chain. AT&T Technical Journal, 64, 1235-1249. Juang, B. H. & Rabiner, L. R. (1985). Mixture autoregressive hidden Markov
parameters from sparse data. North-Holland. stochastic observations of models for speech signals.
IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-33, 1404-1413. Lee, K. F. (1988). Large-vocabulary speaker-independent continuous speech recognition: The SPHINX
system. Ph.D. Thesis, Department of Computer Science, Carnegie-Mellon University. Levinson, S. E. (1986). Continuously variable duration hidden Markov models for automatic speech recognition. Computer Speech and Language, 1, 2%45. Levinson, S. E., Rabiner, L. R. L Sondhi, M. M. (1983). An introduction to the application of theory of probabilistic functions of a Markov process to automatic speech recognition. BeN System Technical Journal, 62, 1035-1074.
Semi-continuous
hidden Markov
models
251
Linde, Y., Buzo, A. & Gray, R. M. (1980). An algorithm for vector quantizer design. IEEE Transactions and Communications, COM-28, 84-95. Makhoul, J., Roucos, S. & Gish, H. (1985). Vector quantisation in speech coding. Proceedings of IEEE, 73, 1551-1588. Nishimura, M. & Toshioka, K. (1987). HMM-based speech recognition using multi-dimensional multilabeling. IEEE ICASSP 87, Dallas, pp. 1163-l 166. Parzen, E. (1962). On estimation of a probability density function and mode. Annuls of Mathematics and Statisrics, 33, OOOX@O. Poritz, A. B. & Richter, A. G. (1986). On hidden Markov models in isolated word recognition. IEEE ICASSP 86, Tokyo, Japan, pp. 705-708. Rabiner, L. R. & Juang, B. H. (1986). An introduction to hidden Markov models. IEEE ASSP Magazine, January, 4-16. Rabiner, L. R., Wilpon, J. G. & Juang, B. H. (1986). A segmental k-means training procedure for connected word recognition. AT&T Technical Journal, 65, 2 l-3 1. Rabiner, L. R., Juang, B. G., Levinson,,S. E. & Sondhi, M. M. (1985). Recognition of isolated digits using hidden Markov models with continuous mixture densities. AT&T Technical Journal, 64, 1211-1234. Redner, R. A. & Walker, H. F. (1984). Mixture densities, maximum likelihood and the EM algorithm. SIAM review, 26, 195-239. Soudoplatoff. S. (1986). Markov modelling of continuous parameters in speech recognition. IEEE ICASSP 86, Tokyo, Japan, pp. 4-8. Tseng, H. P., Sabin, M. & Lee, E. (1987). Fuzzy vector quantization applied to hidden Markov modelling. IEEE ICASSP 87, Dallas, pp. 64644. Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, IT-13, 260-269.