A dynamic parameter compensation method for noisy speech recognition

A dynamic parameter compensation method for noisy speech recognition

Speech Communication 48 (2006) 1283–1293 www.elsevier.com/locate/specom A dynamic parameter compensation method for noisy speech recognition q Geng-x...

307KB Sizes 1 Downloads 83 Views

Speech Communication 48 (2006) 1283–1293 www.elsevier.com/locate/specom

A dynamic parameter compensation method for noisy speech recognition q Geng-xin Ning a, Shu-hung Leung a

b,*

, Kam-keung Chu b, Gang Wei

a

School of Electronic and Information Engineering, South China University of Technology, Guangzhou 510640, China b Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong, China Received 19 December 2005; received in revised form 15 June 2006; accepted 22 June 2006

Abstract Model-based compensation techniques have been successfully used for speech recognition in noisy environments. Popular model-based compensation methods such as the Log-Normal PMC and Log-Add PMC generally use approximate compensation for dynamic parameters. Hence their recognition accuracy is degraded at low and very low signal-to-noise ratios. In this paper we use time derivatives of static features to derive a dynamic parameter compensation method (DPCM). In this method, we assume the static features independent of the dynamic features of speech and noise. This assumption helps simplify the procedures of the compensation of delta and delta–delta parameters. The new compensated dynamic model together with any known compensated static model form a new corrupted speech recognition model. Experimental results show that the recognition model using this DPCM scheme gives recognition accuracy better than the original model compensation method for different additive noises at the expense of slight increase in computational complexity. Ó 2006 Elsevier B.V. All rights reserved. Keywords: Noisy speech recognition; Model compensation; Dynamic parameter combination

1. Introduction It is well-known that the performance of a speech recognizer trained with clean speech database usually degrades drastically when it is operating in noisy environments. Such degradation of performance is mainly due to the mismatch q

The work described in this paper was substantially supported by a grant from the City University of Hong Kong (Project No. 7001607). * Corresponding author. Tel.: +852 2788 7784; fax: +852 2788 7791. E-mail address: [email protected] (S.-h. Leung).

between the training and testing conditions. In the testing, ambient noise, channel effects, and speaker’s stress are major sources causing the mismatch. Various schemes have been proposed to deal with these problems. These schemes can be roughly categorized into three classes: inherently robust feature representation (Mansour and Juang, 1989), speech enhancement (Lockwood and Boudy, 1992; Hansen and Clements, 1991; Yao et al., 2004), and model-based compensation (Varga and Moore, 1990; Gales and Young, 1996; Cerisara et al., 2002; Parssinen et al., 2002; Rigazio and Junqua, 2004). More references and details of these methods can be found in

0167-6393/$ - see front matter Ó 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2006.06.005

1284

G.-x. Ning et al. / Speech Communication 48 (2006) 1283–1293

(Gong, 1995). Model-based compensation (MC) is mainly combining a clean speech model with an additive noise model to generate a corruptedspeech model. It has been shown that parallel model compensation (PMC) (Gales and Young, 1993) is an effective MC method to yield robust corrupted-speech models in additive noise. Different PMC schemes have been proposed to obtain corrupted-speech model parameters. Among these methods, some of them such as the numerical integration PMC (NI-PMC) and the data-driven PMC (D-PMC) have very high computational complexity but give good model estimation (Gales, 1998; Moreno et al., 1998); on the other hand, some of them such as the Log-Add PMC and Log-Normal PMC use simple approximations that give unsatisfactory performance at low and very low SNRs. The Log-Normal PMC method provides a rigorous compensation for the static models, but only the means are compensated for the dynamic models. Although this dynamic mean parameter compensation increases the recognition accuracy for noisy speech, it still needs to be improved, and an effective compensation for the covariance matrix of delta parameters certainly needs to be developed. In this paper, a dynamic parameter compensation method (DPCM) for compensating the dynamic models is proposed. This method sets the difference of speech and noise as an auxiliary random variable and assumes it is independent of the time derivatives of the speech and the noise. In doing so, the procedures of the compensation of the dynamic models can be derived mathematically. Any compensated static model can form a new speech recognition model with the dynamic model generated by the DPCM. Experimental results show that the new compensated model constructed in this way provides recognition accuracy better than the original counterpart. The structure of this paper is organized as follows. The model-based compensation approach is briefly reviewed in Section 2. The procedures of static and dynamic models compensations of the LogNormal, Log-Add, split mixture, and Vector Taylor Series (Acero et al., 2000) are briefly described in Sections 3 and 4, respectively. In Section 5, we use time derivatives of static features to derive the new dynamic parameter compensation method. Experimental results along with discussion will be presented in Section 6. Conclusions will be given in the final section.

2. Model-based technique Model compensation is to combine the statistical models of clean speech features and noise features to form corrupted speech models (Moreno et al., 1998). Fig. 1 shows the basic model compensation framework. The inputs to the combination process are clean speech models and a noise model. Broadly speaking, MC methods can be categorized into two classes: Linear-Spectral domain approach and LogSpectral domain approach. For the Log-Normal PMC (see Fig. 1-II), the combination of the clean speech models and the noise model is performed in the Linear-Spectral domain. The model parameters of the clean speech and noise are mapped from the Cepstral domain to the Log-Spectral domain first, and then to the Linear-Spectral domain. The clean speech models and the noise model are then combined into corrupted speech models. After obtaining the combined models, the parameters of the corrupted speech models are transformed to the Log-Spectral domain and then back to the Cepstral domain. On the other hand, the Log-Add PMC and VTS (see Fig. 1-I) combine the clean speech models and the noise model in the Log-Spectral domain. There are two kinds of noises corrupting speech signals, they are convolutional noise (the frequency response of channel) and additive noise. In this paper, we assume the noise is additive only. Assuming the additive noise statistically independent of the speech, the kth component of the corrupted speech ‘‘observations’’ in the Linear-Spectral domain, Yk, is expressed as Y k ¼ X k þ G  Nk

ð1Þ

where Xk and Nk are the kth components of the clean speech ‘‘observations’’ and the background noise ‘‘observations’’, respectively, in the LinearSpectral domain. The gain G is a matching term introduced to account for the level difference between the clean speech and the noisy speech. For the sake of simplicity, the use of G will be dropped in the sequel. Throughout this paper, the superscript c stands for the Cepstral domain parameter, and the superscript l means the Log-Spectral domain parameter. No superscript denotes the Linear-Spectral domain parameter. The statistical parameters of the background noise model are capped with ~, while the parameters of the noise-corrupted speech model are capped with ^.

G.-x. Ning et al. / Speech Communication 48 (2006) 1283–1293

1285

Fig. 1. Basic model compensation framework.

3. Static parameter compensation of conventional MC schemes

3.1. Compensation techniques in the linear-spectral domain

For the Log-Normal PMC, the mismatch function of the corrupted speech ‘‘observations’’ in the Log-Spectral domain is expressed as (Gales and Young, 1993)

3.1.1. Log-normal PMC It starts with using inverse discrete cosine transform (IDCT) to map the mean and covariance parameters of the speech models from the Cepstral domain to the Log-Spectral domain as follows:

Y lk ¼ logðX k þ N k Þ  l  l ¼ log eX k þ eN k

ð2Þ ð3Þ

ll ¼ Q1 lc l

where {X lk ; N lk } and {Xk, Nk} are the clean speech and background noise ‘‘observations’’ in the LogSpectral and Linear-Spectral domains, respectively.

1

c

ð4Þ 1 T

R ¼ Q R ðQ Þ

ð5Þ

where Q and Q1 are the matrices of the DCT and IDCT operations, respectively. The kth elements of

1286

G.-x. Ning et al. / Speech Communication 48 (2006) 1283–1293

the mean vectors and the (k, l)th elements of the covariance matrices of the clean speech models in the Linear-Spectral domain are related to the LogSpectral domain as lX k ¼ e RX kl

llX þ12RlX k

kk

 l  R ¼ lX k lX l e X kl  1

ð6Þ ð7Þ

In the Linear-Spectral domain, the noise is assumed to be additive and independent of the speech. The corrupted speech model parameters in this domain are obtained by combining the clean speech models and the noise model as ~N ^Y ¼ lX þ l l b Y ¼ RX þ R eN R

ð9Þ

and then back to the Cepstral domain via DCT as ^cY ¼ Q^ l llY b c ¼ QR b l QT R Y Y

ð12Þ ð13Þ

3.1.2. Split mixture PMC The basic concept of split-mixture PMC is splitting each Gaussian mixture during the domain transformation process, thus increasing the number of mixtures and improving domain transformation. The procedure of splitting a mixture into two in the Log-Spectral domain is briefly described as follows (Hung et al., 2001): qffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffi l l ðlX 1 Þk ¼ llX k þ a RlX kk ; ðlX 2 Þk ¼ llX k  a RlX kk qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  a2 RlX kk RlX ll

ð14Þ ð15Þ

where a is found to minimize the following distance:  2 X  1    ð16Þ d¼ lm  2 ½lm1 ðaÞ þ lm2 ðaÞ 2 m lm

llY ¼ 0:5ðllY 1 þ llY 2 Þ

ð17Þ

RlY ¼ 0:5ðRlY 1 þ RlY 2 Þ þ 0:5ðllY  llY 1 ÞðllY  llY 1 Þ þ 0:5ðllY  llY 2 ÞðllY  llY 2 Þ

T

T

ð18Þ

ð8Þ

After model combination, the model parameters are mapped back to the Log-Spectral domain as ! bY R 1 l kk ^Y k ¼ logð^ l lY k Þ  log þ1 ð10Þ ^2Y k 2 l ! bY R l kk b ¼ log þ1 ð11Þ R Y kl ^Y k l ^Y l l

ðRX 1 Þlkl ¼ ðRX 2 Þlkl ¼ RlX kl

mixture approach, and the summation over m should cover all mixtures of all states in all models. An alternative approach of estimating a is the one that gives the best recognition performance over training data. More details about the split mixture PMC can be found in (Hung et al., 2001). After the two mixtures are compensated by the Log-Normal PMC method, they are recombined into one as

is the mean vector of a certain mixture m for the speech HMMs in the Linear-Spectral domain, lm1(a) and lm2(a) are the corresponding two Linear-Spectral vectors that are split by the split

3.2. Compensation techniques in the log-spectral domain Many PMC methods such as the Log-Add PMC, NI-PMC, and VTS are compensation techniques in the Log-Spectral domain. For the Log-Add PMC, the mean compensation is described as  l  ~l l l ^lY k ¼ log e X k þ e N k l ð19Þ This method only compensates the mean but not the variance. It thus has low computational complexity (Gales, 1998). However its performance becomes unsatisfactory at low SNR. This scheme can be viewed as the zeroth-order VTS (denoted as VTS0, Acero et al., 2000). The VTS1 method is to approximate the mismatch function by a finite length Taylor series, and the expectation of this Taylor series is taken to find the corrupted speech model parameters. Higher-order Taylor series can yield a better solution but its computational complexity is very expensive. Thus VTS-0 and first-order VTS (VTS-1) (Kim et al., 1998; Moreno et al., 1996; Acero et al., 2000) are employed commonly. Using the VTS-1 (Acero et al., 2000) method, the compensation of the mean is the same as the Log-Add, and the covariance b l is compensated as matrix R Y e l ðI  MÞT b l ¼ MRl M T þ ðI  MÞ R R Y Y N

ð20Þ

where M is the diagonal matrix whose elements are expressed as Mk ¼

1

1 1þe

ð~ llN llX Þ k

k

Only additive noise is considered in this work.

ð21Þ

G.-x. Ning et al. / Speech Communication 48 (2006) 1283–1293

1287

5.1. Delta parameter compensation

4. Dynamic parameter compensation of conventional MC schemes

Using (3), the delta feature is expressed as The compensation of dynamic models of the Log-Add, Log-Normal and VTS is briefly described in this section. The estimation of delta parameters of the Log-Add PMC is described as (Gales, 1995)  l   l  ~l þ~ l þll ll l ~l l l ^lDY k ¼ log e DX k X k þ e DN k N k  log e X k þ e N k l ð22Þ Only the means of the delta models are compensated in the Log-Add. This method has low computational complexity but unsatisfactory performance at low SNR. For the Log-Normal PMC, the compensation of the delta model parameters is difficult to derive and mostly approximated. Gales (1995) and Hung et al. (2001), respectively, used (23) and (24) to compensate the means of the delta models ^lDY k l ^lDY k l

e

¼ e ¼

e

llX

k

llX

k

l ^lY

k

e

llX

k

þe

l ~lN

llDX k

ð23Þ

k

llDX k

ð24Þ

These approximations are rough, and there is no compensation for the covariance matrices of the delta models. In (Hung et al., 2001), experimental results showed that the VTS method can provide very good performance compared with other model compensation schemes. The compensation of dynamic parameters of the VTS-1 (Acero et al., 2000) is described as ^lDY l

MllDX

¼ l b e l ðI  MÞT R DY ¼ MRlDY M T þ ðI  MÞ R DN

ð25Þ ð26Þ

It is noted that the mean compensation of VTS-1 as described in (25) is the same as that of Hung et al. (2001) for the Log-Normal PMC as described in (24). 5. A new dynamic parameter compensation method (DPCM) In this paper, we use time derivatives of static ‘‘observations’’ to represent dynamic ‘‘observations’’. In doing so, we let the delta and delta–delta features equal to the first- and second-order time derivatives of static features, respectively.

l

DY lk ¼

l

oY lk eX k oX lk eN k oN lk ¼ Xl þ l l l ot e k þ eN k ot eX k þ eN k ot

ð27Þ

We define an auxiliary random variable (r.v.) Z lk as Z lk ¼ N lk  X lk . Since X lk and N lk are normally distributed, then Z lk is also normally distributed (Papoulis, 1991). The mean vector and covariance matrix of the auxiliary random vector Zl, where Z l ¼ fZ lk g, ~lN  llX and are respectively, expressed as llZ ¼ l l l l e RZ ¼ R N þ RX . Then the dynamic feature as described in (27) is expressed in terms of Z lk as DY lk ¼

1 1þe

Z lk

oX lk 1 oN lk þ Z l ot e k þ 1 ot

ð28Þ

We assume the background noise to be stationary. The mean of the delta features of the noise can be set equal to zero. We also assume that the auxiliary r.v. and the delta features of the speech and noise are uncorrelated. This assumption is the essence of the DPCM. It is reasonable because static features are loosely correlated with differential changes of static features. Using (28) and the above assumption, the mean of the delta features of the corrupted speech is expressed as ^lDY k ¼ EfDY lk g ¼ u1 ðZ lk ÞllDX k l where u1 ðZ lk Þ ¼ E



ð29Þ



1

ð30Þ

l

1 þ eZ k

Using Gaussian–Hermite numerical integration (Abramowitz and Stegun, 1972), u1 ðZ lk Þ can be obtained as Z



ðZ l ll Þ2 k Zk 2Rl Z kk

1 e qffiffiffiffiffiffiffiffiffiffiffiffiffi dZ lk Zl l R 2pRZ kk 1 þ e k Z 1 1 2 p ffiffiffi pffiffiffiffiffiffiffiffi ¼ et dt llZ þ 2RlZ t p R kk 1þe k n X 1 xi ð31Þ ffi pffiffiffi p i¼1 1 þ eA qffiffiffiffiffiffiffiffiffiffiffi where A ¼ llZ k þ 2RlZ kk ti . The parameters ti and xi for i = 1 to n are the abscissas and weights of the Hermite polynomial Hn(t) (Abramowitz and Stegun, 1972). u1 ðZ lk Þ ¼

1288

G.-x. Ning et al. / Speech Communication 48 (2006) 1283–1293

It is noted that the mean compensation formula of both VTS-1 and Hung et al. (2001) is a rough approximation of (29) and (30). Instead, we use Gaussian–Hermite numerical integration to provide a good approximation of the mean formula as shown by experimental results. The elements of the covariance matrices of the corrupted speech models are given by b l ¼ EfDY l DY l g  l ^lDY k l ^lDY l R DY kl k l ¼

u2 ðZ lkl ÞðRlDX kl

þ

8 <

ð38Þ with



u3ðZ lk Þ

¼E

ð33Þ



1 p

j¼1

xj 1 þ eA

i¼1

xi 1 þ eB

þ u4ðZ llk ÞC2 þ u5ðZ lkl ÞC3 ^lD2 Y k l ^lD2 Y l l

! ð34Þ

with

qffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffi l l B ¼ lZ l þ 2RZ ll rkl tj þ 1  r2kl ti RlZ kl RlZ lk

ð35Þ ð36Þ

RlZ ll RlZ kk

Using the auxiliary r.v. Z lk and the assumption that this r.v. is uncorrelated with the dynamic features of the speech and the noise, we reduce the dimensionality of the problem. This reduction together with Gaussian–Hermite numerical integration make the new DPCM become an efficient approach for generating compensated dynamic models. 5.2. Delta–delta parameter compensation Similarly, the delta–delta feature is written as the second-order time derivatives of static features and expressed as D2 Y lk ¼

o2 X lk 1 o2 N lk þ l 2 1 þ e ot 1 þ eZ k ot2 " # 2  l 2 oX lk oN k oX lk oN lk þ þ 2 ot ot ot ot 1

Z lk



1 Z lk

l

2 þ e þ eZ k

ð39Þ

e l 2 þ u4ðZ l ÞC1 þ u2ðZ lkl Þ R kl D N kl

with

ð37Þ

Thus the delta–delta mean of the corrupted speech model can be obtained as

ð40Þ

(

u4ðZ lkl Þ

r2kl ¼ r2lk ¼

l

b l 2 ¼ u2ðZ l ÞðRl 2 þ ll 2 ll 2 Þ R kl D X kl D Y kl D Xk D Xl

9 =

n X

l

And the elements of the covariance matrices of the corrupted delta–delta models are given by

1   u2 ðZ lkl Þ ¼ E  : 1 þ eZ lk 1 þ eZ ll ; n X



1

2 þ eZ k þ eZ k n X xi pffiffiffi ffi p ð2 þ eA þ eA Þ i¼1

ð32Þ

llDX k llDX l Þ

el  l ^lDY k l ^lDY l þ u2 ðZ lkl Þ R DN kl where

2

^lD2 Y k ¼ u1ðZ lk ÞllD2 X k þ u3ðZ lk Þ  ðRlDX k þ llDX k þ RlDN k Þ l

¼E

)

1 l

l

l

ð1 þ eZ k Þð2 þ eZ l þ eZ l Þ

! n n 1X xj X xi ffi ð41Þ p j¼1 1 þ eA i¼1 2 þ eB þ eB ( ) 1 1 l  u5ðZ kl Þ ¼ E l l l l ð2 þ eZ k þ eZ k Þ ð2 þ eZ l þ eZ l Þ ! n n X 1X xj xi ffi  p j¼1 2 þ eA þ eA i¼1 2 þ eB þ eB ð42Þ and (  ( l 2 ) 2 ) o2 X lk oX l oN ll C1 ¼ E E þE ot2 ot ot   2 el ¼ llD2 X k RlDX ll þ llDX l þ R ð43Þ DN ll ( ) ( )  2 l   l 2 2 o Xl oX k oN lk E þ E C2 ¼ E ot2 ot ot   2 el ¼ llD2 X l RlDX kk þ llDX k þ R ð44Þ DN kk ( ) ( ) 2 2 oX lk oX ll oN lk oN ll þE C3 ¼ E ot ot ot ot ( ) (  2 ) 2 oX lk oN ll oN lk oX ll þE þE ot ot ot ot  l l l l oX k oX l oN k oN l þ 4E ot ot ot ot 

G.-x. Ning et al. / Speech Communication 48 (2006) 1283–1293 2

¼ RlDX kk RlDX ll þ RlDX kk llDX l þ RlDX ll 2

2

 llDX k þ 2RlDX kl þ 4RlDX kl llDX k llDX l 2 2 el el R e l2 þ llDX k llDX l þ R DN kk DN ll þ 2 R DN kl   l l2 el el þR DN ll RDX kk þ lDX k þ R DN kk     2 e l  R l þ ll l l  RlDX ll þ llDX l þ 4 R DN kl DX kl DX k DX l

ð45Þ Eqs. (29), (33), (38), and (40) are the expressions for compensating the clean dynamic models. 6. Experimental results The TIDigits database is used for the evaluation of the proposed DPCM, which contains 20 isolated words including digits ‘0’ to ‘9’ plus ten commands. The speech utterances are spoken by 16 speakers (8 males and 8 females). The numbers of training and testing utterances spoken by each speaker for each word are 2 and 16, respectively. Altogether there are 640 utterances for training and 5081 utterances for testing. The length of the analysis frame (windowed by Hamming weights) is 32 m s, and the frame rate is 104 frame/s (9.6 m s/frame). The feature vector is composed of static feature and its derivatives each has 13 cepstral coefficients. The dynamic feature of frame t is computed from the static features as P3 mY l ðt þ mÞ DY lk ðtÞ ¼ m¼3 ð46Þ P3 k 2 m¼3 m P3 mDY l ðt þ mÞ 2 l D Y k ðtÞ ¼ m¼3 ð47Þ P3 k 2 m¼3 m We refer the static and delta features to as twostream features while the two-stream features plus the delta–delta features as three-stream features. Four kinds of noises from the NOISEX-92 database are used for testing. They are the white, babble, pink, and destroyerengine noises. The average noise power spectrum is estimated from 200 non-overlapping frames of noise data. It is scaled according to the specified global SNR. The global SNR of an utterance is defined as PH PL=2 k¼0 P m ðkÞ SNRglobal ¼ 10 log10 m¼1 ð48Þ PL=2 2 H k¼0 g N ðkÞ where {Pm(k)} is the clean speech power spectrum of the mth frame, fN ðkÞg is the non-scaled average noise power spectrum, H is the number of frames

1289

of the utterance, L is the FFT size, and g is the scaling factor to scale the noise according to the specified SNRglobal. Thus the corrupted speech is produced as yðiÞ ¼ xðiÞ þ g  nðiÞ

ð49Þ

where y(i) is the corrupted speech signal, x(i) and n(i) are the clean speech and non-scaled noise, respectively. We use a word-based HMM as speech recognizer. It has six emitting states each has four Gaussian output probability distributions. In the training mode, we train the system with the clean speech utterances to produce clean models. In the testing, the ten speech recognition methods as listed in Table 1 are carried out. These ten methods are: the mismatched case without noise compensation, speech enhancement method using non-linear spectral subtraction (Lockwood and Boudy, 1992) Log-Add PMC, Log-Normal PMC, split-mixture PMC, VTS-1, and the former four MC methods with their dynamic models obtained from the DPCM. Table 2 shows the word recognition rates (WRR) of the VTS-1 plus DPCM versus SNR for the two and three streams features in different noise environments. It can be seen that the recognition rate decreases slightly as the number of streams increases from 2 to 3 at low SNR for all noises; nevertheless, the three streams provide accuracy slightly better than the two-streams at high SNR. The performance degradation of the three streams features at low SNR could mean that the delta–delta features are affected by noise more seriously than the two streams features. Hence the two-stream feature vector is practically sufficient for noisy speech recognition.

Table 1 Index table for the ten methods Index

Method

h1i h2i h3i h4i h5i h6i h7i h8i h9i h10i

Mismatched case Non-linear spectral subtraction Log-Add PMCS&D Log-Add PMCS + DPCM Log-Normal PMCS&D Log-Normal PMCS + DPCM Split Mix. PMC (a = 0.6)S&D Split Mix. (a = 0.6) PMCS + DPCM VTS-1S&D VTS-1S + DPCM

S

: Compensation for static models. : Compensation for static and delta models.

S&D

1290

G.-x. Ning et al. / Speech Communication 48 (2006) 1283–1293

Table 2 Word recognition rate (%) versus SNR for two and three streams for VTS-1 + DPCM (n = 2) Noise–stream

100 dB

30 dB

10 dB

5 dB

0 dB

5 dB

White–2a White–3b

99.04 99.16

98.28 98.30

95.77 95.52

93.68 92.19

86.05 84.57

64.16 61.35

Pink–2 Pink–3

99.04 99.16

98.40 98.38

97.07 96.37

94.95 94.36

88.72 87.12

68.17 63.33

Babble–2 Babble–3

99.04 99.16

98.9 99.06

97.46 97.67

95.21 95.29

86.88 86.00

62.85 61.35

Destroyerengine–2 Destroyerengine–3

99.04 99.16

97.96 98.75

95.51 95.94

92.62 93.62

85.71 86.26

64.89 63.66

a b

Include static and delta features (26 cepstral coefficients). Include static, delta and delta–delta features (39 cepstral coefficients).

show that n = 2 is sufficient to provide the accuracy for the Gaussian–Hermite integration of DPCM, pffiffiffi pffiffiffi where x1 ¼ x2 ¼ p=2 and t1 ¼ t2 ¼ 2=2 (Abramowitz and Stegun, 1972). The experimental results of recognition accuracy under the four different noises are shown in Tables 3–6. The results show that the model-based compensation methods can achieve good performance in the noise environments at low SNR (note the recog-

In the following experiments, the two-stream features are adopted. Fig. 2 plots the WRRs and computational complexities of the Log-Add, Log-Normal, and VTS-1 using the DPCM versus the degree n of the Gauss–Hermite integrals for white noise. For the three methods, it can be seen that the WRR has no obvious change with the degree; however, the complexity is increasing linearly with n. The results

Log−add + DPCM

Log−normal + DPCM

100

100 95

90 90 85

70

WRR (%)

WRR (%)

80

−5 dB 0 dB 5 dB 10 dB 30 dB

60

80 75

−5 dB 0 dB 5 dB 10 dB 30 dB

70 65

50 60 40

2

3

4

degree n

5

55

VTS−1 + DPCM 6

95

x 10

degree n

4

5

Computational Complexity

Number of Operations

5

90

WRR (%)

3

4

100

85 80

−5 dB 0 dB 5 dB 10 dB 30 dB

75 70

4 Log−add + DPCM Log−normal + DPCM VTS−1 + DPCM

3 2 1

65 60

2

2

3

degree n

4

5

0

2

3

4

5

degree n

Fig. 2. Word recognition rate (WRR) and computational complexity of different methods using DPCM versus degree n for white noise.

G.-x. Ning et al. / Speech Communication 48 (2006) 1283–1293

nition accuracy at SNR = 5 dB) in comparison with the mismatched case and non-linear spectral subtraction. Under the four noises, the experimental results show that the four MC methods modified with the DPCM has recognition accuracy better than the original counterparts particularly at low and very low SNRs (note the recognition accuracy at SNR = 0, 5 dB). For the sake of comparison, we define an average performance gain Gave of a Table 3 Word recognition rates (%) of the ten methods in white noise environment SNR

30 dB

10 dB

5 dB

0 dB

5 dB

h1i h2i h3i h4i h5i h6i h7i h8i h9i h10i

98.31 98.41 98.47 98.65 98.57 98.27 98.55 98.41 97.86 98.28

51.89 84.04 93.61 94.56 93.97 95.56 94.40 95.75 94.49 95.77

27.10 64.76 87.71 89.78 85.03 88.03 89.79 92.33 89.96 92.68

14.28 38.97 73.05 76.93 73.41 80.92 78.99 85.04 81.60 86.05

6.74 21.58 42.79 49.20 51.61 58.08 54.11 61.90 60.29 64.16

Table 4 Word recognition rates (%) of the ten methods in pink noise environment SNR

30 dB

10 dB

5 dB

0 dB

5 dB

h1i h2i h3i h4i h5i h6i h7i h8i h9i h10i

98.80 98.71 98.63 98.69 98.76 98.63 98.71 98.42 97.37 98.40

67.70 92.78 94.77 95.62 95.50 96.64 96.31 96.98 95.81 97.07

29.80 78.55 88.28 90.29 90.29 94.00 92.08 94.97 93.47 94.95

9.28 48.79 63.19 74.52 73.73 81.82 80.95 87.09 86.90 88.72

5.24 20.60 37.71 38.51 50.85 57.77 56.00 60.51 67.53 68.17

Table 5 Word recognition rates (%) of the ten methods in destroyerengine noise environment SNR

30 dB

10 dB

5 dB

0 dB

5 dB

h1i h2i h3i h4i h5i h6i h7i h8i h9i h10i

98.37 97.51 98.59 98.76 98.71 98.71 98.80 98.80 97.96 97.96

59.77 77.26 95.60 95.30 94.32 95.51 96.35 96.35 94.05 95.51

24.59 51.58 88.78 90.58 85.63 88.87 91.75 93.18 89.72 92.62

11.30 23.07 70.71 79.16 74.06 80.49 77.81 84.46 82.26 85.71

7.27 9.77 35.84 51.04 51.80 58.26 45.11 55.82 61.94 64.89

1291

Table 6 Word recognition rates (%) of the ten methods in babble noise environment SNR

30 dB

10 dB

5 dB

0 dB

5 dB

h1i h2i h3i h4i h5i h6i h7i h8i h9i h10i

98.96 98.51 98.65 98.73 98.98 99.00 98.94 99.02 98.96 98.90

82.96 86.26 94.95 95.60 97.15 97.57 97.10 97.55 97.45 97.46

53.89 70.41 88.05 89.46 92.97 94.93 93.20 94.89 94.38 95.21

25.67 47.31 66.15 68.82 79.18 84.65 81.13 86.07 85.84 86.88

10.17 21.24 28.09 29.85 49.01 57.69 52.30 60.51 62.59 62.85

MC method as the average of the difference of the recognition rates in absolute percentage of the MC method using DPCM and its original counterpart over the four noises. The average performance gains of the four PMC methods using DPCM are tabulated in Table 7. For 0 dB case, the Gave of the Log-Add PMC using DPCM, the Log-Normal PMC using DPCM and split-mixture PMC using DPCM are 7%, 7% and 6%, respectively. For 5 dB case, the Gave of the three methods are 6%, 7% and 8%, respectively. The experimental results also show that the DPCM scheme can enhance the performance of VTS-1 method under the four noises for all SNR cases. The performance gain of the PMC with DPCM over its counterpart is attributed to the more rigorous delta parameters compensation which includes the mean and covariance compensation. The DPCM has been shown to be a general approach to improve the dynamic model compensation for any MC methods. We also use the real noisy speech database Aurora 2.0 (Hirsch and Pearce, 2000) for the illustration of the performance of the DPCM scheme. The recognition performances of the ten methods listed in Table 1 are tabulated in Table 8. The experimental results prove the DPCM scheme to be a general approach to improve model compensation again. Table 9 lists the number of multiplications, divisions, logarithm and exponential operations of each

Table 7 Gave (%) of different methods for different global SNRs SNR

10 dB

5 dB

0 dB

5 dB

h4i over h3i h6i over h5i h8i over h7i h10i over h9i

0.54 1.09 0.62 1.16

1.82 2.98 2.14 1.98

6.58 6.88 6.04 2.69

6.04 7.13 7.81 1.93

1292

G.-x. Ning et al. / Speech Communication 48 (2006) 1283–1293

Table 8 Word recognition rates (WRR %) of five methods for using Aurora 2.0 noisy speech database Environment

SNR/dB

h1i

h2i

h3i

h4i

h5i

h6i

h7ia

h8ia

h9i

h10i

Exhibition hall noise

Clean 20 15 10 5 0 5

99.62 64.46 39.12 20.70 16.11 9.82 9.49

99.62 88.66 75.86 63.50 36.84 17.37 11.32

99.62 96.06 95.10 90.78 84.23 56.48 26.36

99.62 97.48 95.10 91.54 85.02 59.75 29.55

99.62 95.74 94.60 90.02 80.51 57.68 30.92

99.62 96.10 95.63 92.60 83.56 62.47 36.58

99.62 95.98 94.30 91.17 82.35 56.19 32.04

99.62 96.49 96.09 92.65 83.62 61.12 34.88

99.62 96.39 94.01 91.26 84.54 61.03 33.11

99.62 96.71 95.37 91.57 86.17 65.71 37.91

Suburban train noise

Clean 20 15 10 5 0 5

99.29 68.97 43.54 22.74 12.63 8.50 9.09

99.29 92.69 82.37 62.48 38.43 21.03 13.09

99.29 97.93 97.91 94.16 89.04 73.36 44.66

99.29 98.28 97.52 94.76 90.05 75.14 44.89

99.29 98.28 97.24 92.71 90.14 95.69 50.47

99.29 98.31 97.62 94.37 89.32 78.73 54.76

99.29 97.62 97.24 92.71 90.14 76.94 51.60

99.29 98.66 97.58 94.76 90.09 79.04 54.33

99.29 98.36 97.67 94.04 89.14 79.04 54.41

99.29 98.31 97.58 93.65 90.92 81.95 58.90

Babble noise

Clean 20 15 10 5 0 5

99.59 89.34 66.18 32.19 20.27 13.59 11.78

99.59 88.89 81.36 74.89 48.97 22.85 15.18

99.59 98.02 96.56 95.17 88.95 67.02 30.23

99.59 98.02 96.84 96.10 89.28 67.95 30.35

99.59 98.02 97.43 96.65 91.04 75.09 36.50

99.59 98.33 97.73 97.41 90.47 77.01 38.45

99.59 98.02 97.43 96.69 90.52 75.96 36.54

99.59 98.02 97.43 96.67 90.85 76.71 38.14

99.59 98.02 97.30 95.65 90.58 78.38 47.37

99.59 98.02 97.71 96.81 90.24 78.90 48.49

Car noise

Clean 20 15 10 5 0 5

98.57 84.68 71.33 40.74 27.93 20.10 11.01

98.57 95.63 89.99 80.34 59.59 23.40 11.32

98.57 98.73 97.47 94.74 88.90 68.99 28.47

98.57 98.72 97.49 95.43 89.81 70.83 30.13

98.57 97.26 96.91 96.01 89.42 69.04 36.35

98.57 96.23 96.57 95.74 91.50 74.75 38.55

98.57 97.26 97.39 96.01 89.73 69.68 36.40

98.57 97.74 97.39 97.41 90.22 72.52 39.05

98.57 96.28 95.05 92.74 88.79 70.74 36.93

98.57 97.41 96.97 95.42 90.53 74.88 37.85

a

For split mixture PMC, a = 0.4 is employed.

technique to update the parameters of a single mixture of static and dynamic models, where N and M are the dimensions of the features in the Cepstral and Log Spectral domains, respectively. For the Table 9 Computational complexity of ten method Method

Total

M = 25, N = 13, n=2

h1i h2i h3i h4i h5i h6i

None None 4M(N + 1) + 2M 4M(N + 1) + (2 + 2n)M 2MN(2M + N + 3) + 2M(3M + 4) 2MN(2M + N + 3) + (3n + 10)M2 + (7n + 10)M 2MN(2M + N + 3) + M(19M + 13) 2MN(2M + N + 3) + (3n + 23)M2 + (7n + 17)M 2MN(2M + N + 3) + 12M2 + 13M 2MN(2M + N + 3) + (3n + 10)M2 + (7n + 14)M

None None 1450 1500 46 850 53 500

h7i h8i h9i h10i

55 100 61 800 50 725 53 600

Log-Add, only the means of the models are compensated. It can be seen that, by using n = 2 in the Gaussian–Hermite integral, the computational complexity of the DPCM is comparable to that of the compensation of the dynamic models of the LogNormal, VTS-1 and split-mixture. However, the DPCM is more effective than the dynamic model compensation of these methods. 7. Conclusion A dynamic parameter compensation method for noisy speech recognition is presented. Using the first- and second-order time derivatives for representing the delta and delta–delta features, the equations for the compensation of the dynamic models are derived based on the reasonable assumption of the static features independent of their derivatives. Experimental results show that the MC method which applies the DPCM for the compensation of dynamic models can cope with different noises with

G.-x. Ning et al. / Speech Communication 48 (2006) 1283–1293

improved recognition accuracy particularly at low and very low SNRs. Moreover, the computational complexity of the DPCM is comparable to that of most MC methods. References Abramowitz, M., Stegun, I.A., 1972. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Dover Publications Inc., New York. Acero, A. et al., 2000. HMM adaptation using vector Taylor series for noise speech recognition. In: Proceedings of ICSLP, Beijing, pp. 869–872. Cerisara, C., Junqua, J.C., Rigazio, L., 2002. Dynamic estimation of a noise over estimation factor for Jacobian-based adaptation. In: Proceedings of ICASSP, pp. 201–204. Gales, M.J., 1995. Model-based techniques for noise robust speech recognition, Ph.D. Thesis, Cambridge University. Gales, M.J., 1998. Predictive model-based compensation schemes for robust speech recognition. Speech Communication 25, 49– 74. Gales, M.J., Young, S.J., 1993. Cepstral parameter compensation for HMM recognition in noise. Speech Communication 12, 231–239. Gales, M.J., Young, S.J., 1996. Robust continuous speech recognition using parallel model combination. IEEE Transactions on Speech and Audio Processing 4, 352–859. Gong, Y., 1995. Speech recognition in noisy environments: a survey. Speech Communication 16, 261–291. Hansen, J.H.L., Clements, M., 1991. Constrained iterative speech enhancement with applications to speech recognition. IEEE Transactions on Signal Processing 39, 795–805. Hirsch, H.G., Pearce, D., 2000. The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions, ISCA ITRW ASR2000, Paris.

1293

Hung, J.W., Shen, J.L., Lee, L.S., 2001. New approach for domain transformation and parameter combination for improved accuracy in parallel model combination (PMC) techniques. IEEE Transactions on Speech and Audio Processing 9, 842–854. Kim, D.Y., Un, C.K., Kim, N.S., 1998. Speech recognition in noisy environments using first-order vector Taylor series. Speech Communication 24, 39–49. Lockwood, P., Boudy, J., 1992. Experiments with a nonlinear spectral subtractor (NSS), Hidden Markov models and the projection, for robust speech recognition in cars. Speech Communication 11, 215–228. Mansour, D., Juang, B.H., 1989. The short-time modified coherence representation and noisy speech recognition. IEEE Transactions on Acoustics Speech and Signal Processing 37, 759–804. Moreno, P.J., Raj, B., Stern, R.M., 1996. A vector Taylor series approach for environment-independent speech recognition. In: Proceedings of ICASSP, 2, pp. 733–736. Moreno, P.J., Raj, B., Stern, R.M., 1998. Data-driven environmental compensation for speech recognition: a unified approach. Speech Communication 24, 267–285. Papoulis, A., 1991. Probability, random variables, and stochastic processes, third ed. McGraw-Hill Inc. Parssinen, K. et al., 2002. Comparing Jacobian adaptation with cepstral mean normalization and parallel model combination for noise robust speech recognition. In: Proceedings of ICASSP, pp. 193–196. Rigazio, L., Junqua, J.-C., 2004. a-Jacobian environmental adaptation. Speech Communication 42, 25–42. Varga, A.P., Moore, R.K., 1990. Hidden Markov model decomposition of speech and noise. In: Proceedings of ICASSP, pp. 845–848. Yao, K., Paliwal, K.K., Nakamura, S., 2004. Noise adaptive speech recognition based on sequential noise parameter estimation. Speech Communication 42, 5–23.