Maximum likelihood subband polynomial regression for robust speech recognition

Applied Acoustics 74 (2013) 640–646 Contents lists available at SciVerse ScienceDirect Applied Acoustics journal homepage: www.elsevier.com/locate/a...

Download PDF

490KB Sizes 1 Downloads 88 Views

Report

PDF Reader
Full Text

Applied Acoustics 74 (2013) 640–646

Contents lists available at SciVerse ScienceDirect

Applied Acoustics journal homepage: www.elsevier.com/locate/apacoust

Maximum likelihood subband polynomial regression for robust speech recognition Yong Lü a,b,⇑, Zhenyang Wu b a b

College of Computer and Information Engineering, Hohai University, Nanjing 210098, China School of Information Science and Engineering, Southeast University, Nanjing 210096, China

a r t i c l e

i n f o

Article history: Received 10 January 2012 Received in revised form 2 September 2012 Accepted 20 November 2012 Available online 28 December 2012 Keywords: Model adaptation Subband polynomial regression Hidden Markov model Robust speech recognition

a b s t r a c t In this paper, we propose a model adaptation algorithm based on maximum likelihood subband polynomial regression (MLSPR) for robust speech recognition. In this algorithm, the cepstral mean vectors of prior trained hidden Markov models (HMMs) are converted to the log-spectral domain by the inverse discrete cosine transform (DCT) and each log-spectral mean vector is divided into several subband vectors. The relationship between the training and testing subband vectors is approximated by a polynomial function. The polynomial coefﬁcients are estimated from adaptation data using the expectation–maximization (EM) algorithm under the maximum likelihood (ML) criterion. The experimental results show that the proposed MLSPR algorithm is superior to both the maximum likelihood linear regression (MLLR) adaptation and maximum likelihood subband weighting (MLSW) approach. In the MLSPR adaptation, only a very small amount of adaptation data is required and therefore it is more useful for fast model adaptation. Ó 2012 Elsevier Ltd. All rights reserved.

1. Introduction The robustness issue is crucial for speech recognition in real applications because the mismatch between training and testing conditions often degrades the recognition performance signiﬁcantly. This mismatch is due to the additive background noise, channel distortion (convolutional noise), speaker and other factors. Generally speaking, the methods used to reduce the environmental mismatch can be classiﬁed into two major categories: the front-end feature domain methods and the back-end model domain methods. In the feature domain, spectral subtraction (SS) [1], cepstral mean normalization (CMN) [2] and relative spectra (RASTA) [3] are commonly used to reduce the impact of noise for automatic speech recognition. Besides, model-based feature compensation is also an effective approach to noise robust speech recognition, which is simultaneously proposed by Erell and Weintraub [4] and Acero [5], and further studied in [6–9]. This algorithm typically employs a Gaussian mixture model (GMM) to represent the distribution of the speech feature and uses the minimum mean squared error (MMSE) method to reconstruct the original speech feature. The noise parameters, which are used for transforming the prior trained speech model to the testing condition, are estimated from the noisy speech [6,7] or from the silence duration of the incoming speech [8,9]. In order to obtain the closed-form solution of the noise param⇑ Corresponding author at: College of Computer and Information Engineering, Hohai University, Nanjing 210098, China. Tel.: +86 25 58099120. E-mail address: [email protected] (Y. Lü). 0003-682X/$ - see front matter Ó 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.apacoust.2012.11.016

eters from noisy speech features, the vector Taylor series (VTS) expansion [6] is proposed to approximate the nonlinear relationship between clean and noisy speech. Another VTS-based noise log-spectral estimation approach is studied in [7] where the maximum a posteriori (MAP) criterion is used instead of the maximum likelihood (ML) criterion. Although the feature domain methods can achieve signiﬁcant performance improvements in noisy speech recognition, they are not effective in reducing the environmental mismatch resulting from other factors, such as speaker. In the model domain, the MAP adaptation [10] and maximum likelihood linear regression (MLLR) [11] are two popular adaptation algorithms. The MAP adaptation estimates model parameters by optimally interpolating the associated adaptation data with the prior parameters of hidden Markov models (HMMs). This algorithm only updates the parameters of models which are observed in the adaptation data, and thus fairly large amounts of adaptation data are generally required. MLLR is a transform-based adaptation algorithm, which transforms the overall HMMs by a set of linear regression functions. Compared to MAP, MLLR has better performance when a small amount of adaptation data is available. A model adaptation algorithm using piecewise-linear transformation is studied in [12] where various types of noise are clustered according to their spectral property and signal-to-noise ratio (SNR), and a set of noisy speech HMMs is trained for each cluster. In the recognition phase, the HMM set that best matches the input noisy speech is selected and further adapted using the MLLR method. The maximum likelihood estimation is frequently unreliable in the case of sparse adaptation data. In the maximum a posteriori

641

Y. Lü, Z. Wu / Applied Acoustics 74 (2013) 640–646

linear regression (MAPLR) adaptation [13], the prior knowledge of regression parameters is incorporated into the linear regression estimation to solve the sparseness problem. In theory, MAPLR works better than MLLR when the prior probability density of regression parameters is estimated properly. Besides, some noise adaptation methods, such as parallel model combination (PMC) [14], linear spectral transformation (LST) [15,16] and the VTS adaptation [17], have been proposed for noisy speech recognition. These methods are very effective for noise compensation, but do not take account of other environmental factors such as speaker. Therefore, they cannot be applied to speaker adaptation or minimizing other environmental mismatches. In recent years, the subband approaches [18–22] have been proposed to improve the robustness of speech recognition against band-limited additive noise. In these approaches, the channels in the full-band ﬁlter bank are divided into several subbands, usually of equal partitions, and the subband feature vector is computed from each partition using the discrete cosine transform (DCT). The subband approaches can improve the recognition performance in the presence of narrow-band noise, but may degrade the baseline performance for clean speech because the sub-band features lose the correlation among subbands. In [23], a maximum likelihood subband weighting (MLSW) approach is proposed, where the subband features or subband means are multiplied with weighting factors and then combined and converted to the cepstral domain. The experimental results show that MLSW is more robust than both full-band approaches and conventional subband approaches. However, it achieves higher performance than MLLR only in the case of a very small amount of adaptation data and its performance does not increase obviously with the growth of adaptation data. In this paper, we propose a model adaptation algorithm based on maximum likelihood subband polynomial regression (MLSPR) for robust speech recognition. Firstly, the channels of Mel ﬁlter bank (Mel channels) are divided into several subbands of equal partitions, and the cepstral mean vectors of prior trained HMMs are converted to the log-spectral domain by the inverse DCT. Then each log-spectral mean vector is divided into several subband vectors and each subband vector is transformed to the testing condition by a polynomial function. In other words, all the channels of each subband share a polynomial transformation, which can further improve the robustness of transform-based model adaptation algorithms. The polynomial coefﬁcients are estimated from adaptation data using the expectation–maximization (EM) algorithm [24] under the maximum likelihood criterion. Finally, the estimated subband mean vectors are combined and converted to the cepstral domain. The rest of this paper is organized as follows. Section 2 describes the MLSPR algorithm. The estimation of polynomial coefﬁcients is given in Section 3. The experimental procedures and results are presented and discussed in Section 4. Section 5 concludes the paper with a summary.

However, the data sparseness is still a difﬁcult problem to be solved when a small amount of adaptation data is available. The Mel-frequency cepstral coefﬁcient (MFCC) vector is the most commonly used speech feature for speech recognition. In the cepstral domain, there are weak correlations among the different components of the MFCC vector and thus the different components of the HMM mean vector cannot share the same transformation, while in the log-spectral domain, the adjacent channels of Mel ﬁlter bank are overlapped and therefore it can be assumed that the transformation function of one channel is similar to those of its adjacent channels. In this work, the channels of Mel ﬁlter bank (Mel channels) are divided into several subbands of equal partitions and all the channels of each subband share a polynomial transformation. There are two reasons why the polynomial regression is used to approximate the relationship between the training and testing subbands. One is that the environmental transformation of each channel is nonlinear and the polynomial regression can approximate any nonlinear function. The other is that there are always some differences among the different channels of one subband and the polynomial regression can cover the different channels better than the linear regression and weighting adaptation. The proposed subband polynomial regression proceeds as follows: 1. The prior trained cepstral HMM mean vector l is converted to the log-spectral domain by the following equation:

u ¼ C 1 l

ð1Þ

1

where C denotes the inverse DCT matrix and u is the log-spectral mean vector. If the high order coefﬁcients of the MFCC vector are ignored, the full MFCC vector can be extracted from training data and then the MAP adaptation [10] is employed to estimate the full cepstral mean vector for Eq. (1). The original cepstral mean vector can also be padded with zeros to obtain the approximate value of the full mean vector. 2. The log-spectral mean vector u ¼ ½u1 ; u2 ; . . . ; uD T is divided into K subbands:

u ¼ ½mT1 ; mT2 ; . . . ; mTK T

ð2Þ

where D is the number of Mel channels and the superscript T denotes the transpose of the matrix or vector. The mean vector u can also be expressed as the sum of K subband mean vectors,

u¼

K X uk

ð3Þ

k¼1

where uk ¼ ½0; . . . ; 0; mTk ; 0; . . . ; 0T . The decomposition procedure is illustrated in Fig. 1 where the log-spectral mean vector is decomposed into four subband mean vectors.

m1

m1

0

0

0

m2

0

m2

0

0

m3

0

0

m3

0

m4

0

0

0

m4

u

u1

u2

u3

u4

2. Subband polynomial regression In the MAP adaptation [10], the prior trained HMM parameters including the state transition probabilities, mixture weights, mean vectors, and covariance matrices are adapted to the test condition. If enough adaptation data is provided, the MAP estimate is proved to asymptotically approach to the retrained system. Nevertheless, in the test environment, the available adaptation data is often limited. Thus the transform-based adaptation methods, such as MLLR [11], are proposed to improve the robustness of model adaptation, where the data that belong to various Gaussian mixture components are combined to estimate a set of transformation parameters.

Fig. 1. Decomposition of mean vector.

642

Y. Lü, Z. Wu / Applied Acoustics 74 (2013) 640–646

k and 3. The relationship between the testing mean sub-vector u the original value uk is approximated by a polynomial function,

k ¼ a0k ek þ u

P X

apk ðuk Þp

ð4Þ

apk

where is the pth coefﬁcient of the kth subband; ek is the unit subvector where the values of the kth subband are a series of ones and the others are zeros; ðuk Þp denotes the element-by-element multi can be represented plication. The new log-spectral mean vector u by, K K X P X X ¼ a0k ek þ apk ðUek Þp u

ð5Þ

K X

K X P X

k¼1

k¼1 p¼1

a0k Cek þ

apk CðUek Þp

ð6Þ

i¼1 m¼1 t¼1

i¼1 m¼1 t¼1

3. Maximum likelihood estimation of subband polynomial coefﬁcients

M X 1 cim ð2pÞd=2 jRim j1=2 exp ðot lim ÞT R1 im ðot lim Þ 2 m¼1 ð7Þ

where ot is the tth MFCC vector; cim, lim and Rim denote the mixture coefﬁcient, mean vector and covariance matrix of the mth Gaussian mixture component in the ith state, which are estimated from training data. We assume that the environmental mismatch only affects the mean vectors of HMM. According to Eq. (6), the relationship be im and the training mean lim can be reptween the testing mean l resented by,

ð8Þ

k¼1 p¼1

k¼1

where U im is computed using the following equation:

U im ¼ diagðC 1 lim Þ

ð9Þ apk

The subband polynomial coefﬁcient can be estimated from adaptation data using the EM algorithm [24] under the maximum likelihood criterion and the auxiliary function is deﬁned as:

Q ðkjkÞ ¼

N X M X T X

cim ðtÞðot l im ÞT R1 im ðot lim Þ

ð10Þ

i¼1 m¼1 t¼1

where cim ðtÞ ¼ Pðht ¼ i; kt ¼ mjO; kÞ is the posterior probability of being in state i and mixture component m at time t given the observation sequence O ¼ fo1 ; . . . ; ot ; . . . ; oT g and the prior parameter set k. Eq. (8) can be expressed in a matrix form,

l im ¼ Dim a

ð14Þ

When multiple HMMs are combined to estimate a set of parameters, Eq. (14) becomes: kJ Tr R X N X M X X X

T 1 j;r j j cj;r im ðtÞðDim Þ ðRim Þ ot

kj ¼k1 r¼1 i¼1 m¼1 t¼1 kJ Tr R X N X M X X X

T 1 j j j cj;r im ðtÞðDim Þ ðRim Þ Dim a

ð15Þ

kj ¼k1 r¼1 i¼1 m¼1 t¼1

where oj;r is the tth MFCC vector of the rth observation sequence n t o j;r j;r of the jth HMM kj ; cj;r ; Oj;r ¼ oj;r 1 . . . ; ot ; . . . ; oT im ðtÞ ¼ Pðht ¼ i; kt ¼ mjOj;r ; kj Þ is the posterior probability of being in state i and mix-

2

This paper uses the HMM to model each basic speech unit and the probability density function of the ith state of the HMM can be expressed as,

K K X P X X a0k Cek þ apk CðU im ek Þp

cim ðtÞðDim ÞT R1 im Dim a

ture component m at time t given the observation sequence Oj;r and the prior parameter set kj . Finally, the polynomial coefﬁcient vector is computed using the following equation: a

where C denotes the DCT matrix.

l im ¼

N X M X T X

¼

where U ¼ diagðuÞ is a diagonal matrix and Uek ¼ uk . 4. Applying the DCT on both sides of Eq. (5), the testing cepstral can be expressed as, mean l

bi ðot Þ ¼

N X M X T X

k¼1 p¼1

k¼1

ð13Þ

be equal to zero Let the derivative of Q ð kjkÞ with respect to a and thereby the following equation can be obtained,

cim ðtÞðDim ÞT R1 im ot ¼

p¼1

l ¼

¼ ½a01 ; . . . ; a0K ; a11 ; . . . ; a1K ; . . . ; aP1 ; . . . ; aPK T a

ð11Þ

are given by, where Dim and a

h i Dim ¼ Ce1 ; ... ; CeK ;CU im e1 ; ... ; CU im eK ; ... ; CðU im e1 ÞP ; ... ; CðU im eK ÞP ð12Þ

¼4 a

kJ Tr R X N X M X X X

31 T 1 j 5 j;r j j im ðtÞðDim Þ ðRim Þ Dim

c

kj ¼k1 r¼1 i¼1 m¼1 t¼1

2 4

kJ Tr R X N X M X X X

c

3 T 1 j;r 5 j;r j j im ðtÞðDim Þ ðRim Þ ot

ð16Þ

kj ¼k1 r¼1 i¼1 m¼1 t¼1

is obtained, l im can be calculated by Eqs. (8) and (9). Once a 4. Experimental results and discussion 4.1. Experimental conditions In this work, the TIMIT database is used to evaluate the proposed algorithm and the two dialect sentences spoken by each speaker in the database are segmented into 21 words for isolated word recognition. The 8400 utterances spoken by 400 speakers are used to train the HMM of each word for recognition. The 2100 utterances spoken by 100 speakers are contaminated with noise at different signal-to-noise ratio (SNR) levels to produce noisy speech for testing. Three types of noise, white, pink and factory noise, are selected from the NOISEX-92 database. The original 16 kHz speech has been down-sampled to 8 kHz with a low-pass ﬁlter and the useful frequency band, which lies between 64 Hz and 4 kHz, is divided into 20 equidistant channels in Mel-frequency domain. The speech is coded into 16 ms frames, with a frame shift of 8 ms. Each frame is represented by a 39dimensional vector consisting of 13 Mel-frequency cepstral coefﬁcients, and their ﬁrst and second time derivatives. For the sake of convenience, the 0th cepstral coefﬁcient is used instead of the log energy in the feature vector. The HMM of each word is composed of 16 states with 4 Gaussian mixtures per state, and the covariance matrices of all the Gaussian mixtures are diagonal. 4.2. Convergence of adaptation procedures The convergence property of the MLSPR algorithm is illustrated in Figs. 2 and 3 where the clean test speech is mixed with white noise at 5 dB SNR for testing and ﬁve utterances are selected from the test data for adaptation. The number of subbands and polynomial order are set to ﬁve and two, respectively. In Fig. 2, the aver-

643

Y. Lü, Z. Wu / Applied Acoustics 74 (2013) 640–646

−70

45 1 utterance 40

2 utterances

−72

5 utterances

−74

WER (%)

Log−likelihood

35

−76

10 utterances

30 25 20

−78 15 −80

0

1

2

3

4

10

5

1

2

3

4

Iteration number

5

7

10

No. of subbands

Fig. 2. Average log-likelihood with the increase of iteration number.

Fig. 4. Average WERs of MLSPR1 with different numbers of subbands.

80

45

70

40

60

35

1 utterance 2 utterances

WER (%)

WER (%)

5 utterances

50 40

30 25

30

20

20

15

10 0

1

2

3

4

5

10 utterances

10

1

2

Iteration number

3

4

5

7

10

No. of subbands

Fig. 3. WERs with the increase of iteration number.

Fig. 5. Average WERs of MLSPR2 with different numbers of subbands.

age likelihood increases signiﬁcantly after the ﬁrst iteration and afterwards increases monotonously until it converges to a stationary point. This indicates that the mean vectors of HMMs are matched with adaptation data after the adaptation procedure. The monotonicity and convergence is guaranteed by the EM algorithm. Fig. 3 shows that the word error rate (WER) converges quickly and does not change much while two or more iterations are performed. Typically, the MLSPR algorithm converges after three or four iterations and therefore four iterations was performed in the following experiments.

increased to further improve the recognition performance with the growth of adaptation data. In the following experiments, 1, 2, 3, 4 and 5 subbands are selected for 1, 2, 3, 4 and 5 utterances respectively, and the number of subbands is set to 7 for 6–10 utterances.

4.3. Number of subbands This experiment shows how to select the number of subbands for the MLSPR adaptation. The clean test speech is mixed with white, pink and factory noise at 5 dB SNR for testing. The recognition results are averaged over the three types of noise, and the average WERs of the ﬁrst-order MLSPR (MLSPR1) and the secondorder MLSPR (MLSPR2) with different numbers of subbands are demonstrated in Figs. 4 and 5 respectively. The channels of Mel ﬁlter bank are divided into 1, 2, 3, 4, 5, 7 and 10 approximately uniform subbands. The ﬁgures show that one or two subbands should be selected in the case of a very small amount of adaptation, such as one and two utterances, and the number of subbands should be

4.4. Polynomial order This experiment shows how to select the polynomial order of the MLSPR algorithm. The same test utterances used in the previous subsection are employed to test the performance of MLSPR with various polynomial orders. Fig. 6 shows the average WERs over the three types of noise. From these results, we see that the performance of MLSPR improves when the polynomial order increases from 1 to 3 and the fourth-order and ﬁfth-order polynomials give worse performance due to a larger number of parameters to be estimated. The performance of the third-order polynomials is only slightly better than that of the second-order polynomials. Considering the performance and computational complexity, the ﬁrst-order and second-order polynomials are chosen for the experiments. 4.5. Comparison with SS, MLLR and MLSW Firstly, MLSPR is compared with MLLR [11] and MLSW [24] with various numbers of adaptation utterances. The diagonal

644

Y. Lü, Z. Wu / Applied Acoustics 74 (2013) 640–646

30

35 1 utterance

MLLR

2 utterances 5 utterances

25

MLSPR1

10 utterances

MLSPR2

25

WER (%)

WER (%)

MLSW

30

20

20

15 15 10

10

1

2

3

4

5

5

1

2

3

4

Polynomial order Fig. 6. Average WERs with various polynomial orders.

50 MLLR

45

WER (%)

MLSW 40

MLSPR1

35

MLSPR2

30 25 20 15 10 1

2

3

4

5

6

7

6

7

8

9

10

Fig. 8. Performance comparison of MLLR, MLSW, MLSPR1 and MLSPR2 with various numbers of adaptation utterances in pink noise environment.

8

9

10

No. of utterances Fig. 7. Performance comparison of MLLR, MLSW, MLSPR1 and MLSPR2 with various numbers of adaptation utterances in white noise environment.

45 MLLR 40

MLSW MLSPR1

35

MLSPR2

WER (%)

transformation matrix is used for the implementation of the MLLR adaptation. Two MLSPR methods, the ﬁrst-order MLSPR (MLSPR1) and the second-order MLSPR (MLSPR2), are considered in this experiment. The clean test speech is contaminated with white, pink and factory noise at 10 dB SNR to produce three types of noisy speech. Figs. 7–9 show the WERs of the MLLR, MLSW, MLSPR1 and MLSPR2 with various numbers of adaptation utterances for white, pink and factory noise environments, respectively. From these ﬁgures, it can be observed that MLSW works better than MLLR when only several adaptation utterances are available, but its performance does not increase obviously with the growth of adaptation data and is worse than that of MLLR in the case of more adaptation data. This is because the single variable linear regression (MLLR with diagonal transformation matrices) can model the environmental mismatch better than the subband weighting. Figs. 7–9 also show that MLSPR1 is superior to MLLR and MLSW both in the case of several utterances and more utterances. This is due to the fact that MLSPR1 takes the advantages of both MLLR and MLSW. On the one hand, multiple channels are combined to estimate the same set of parameters, which can solve the data sparseness problem; on the other hand, the linear regression (ﬁrst-order polynomial) can transform the original mean vector to the testing condition preferably. Comparing MLSPR2 with MLSPR1, we see that MLSPR2 has an advantage over MLSPR1, and its performance

5

5

No. of utterances

30 25 20 15 10 5

1

2

3

4

5

6

7

8

9

10

No. of utterances Fig. 9. Performance comparison of MLLR, MLSW, MLSPR1 and MLSPR2 with various numbers of adaptation utterances in factory noise environment.

is affected only slightly by the number of adaptation utterances. This demonstrates that the second-order polynomial regression can achieve the approximately optimal balance between the number of parameters and the approximation capability. For example, in the case of only one adaptation utterance, the WERs of MLSPR1 are 21.5%, 12.6% and 13.8% for white, pink and factory noise respectively, while the corresponding results of MLSPR2 are 10.8%, 6.2% and 6.7%. The absolute performance improvements of MLSPR2 over MLSPR1 are 10.7%, 6.4%, and 7.1% for the three testing conditions, respectively. Then we compare the performance of SS [1], MLLR, MLSW and MLSPR2. Fig. 10 illustrates the average WERs over the three types of noise (white, pink, and factory) with different SNRs. In this experiment, ten adaptation utterances are used for MLLR, MLSW and MLSPR2. In the spectral subtraction (SS) method, the subtraction factor and ﬂooring factor are set at 4.0 and 0.2, respectively. As shown in the ﬁgure, the conventional spectral subtraction method can effectively reduce the WERs at high SNR levels. However, it does not achieve the signiﬁcant performance improvement at low SNR levels compared to MLLR and MLSPR2. Also, the performance of MLSW is lower than that of MLLR and MLSPR2 because the subband weighting cannot model the actual environmental mismatch effectively. Furthermore, MLSPR2 is compared with SS,

645

Y. Lü, Z. Wu / Applied Acoustics 74 (2013) 640–646

100

80

WER (%)

Table 1 WERs (%) of the baseline system, MLLR, MLSW, MLSPR1 and MLSPR2 with different SNRs for the three types of noise.

Baseline SS MLSW MLLR MLSPR2

Noise type

60

5

10

15

20

10

15

20

90.1 80.5 72.4 63.1 39.6

78.7 54.8 49.7 32.7 22.8

41.7 45.8 31.7 21.5 10.8

17.0 24.8 15.1 11.3 6.7

6.6 19.7 6.4 5.8 4.3

Pink

Baseline MLLR MLSW MLSPR1 MLSPR2

94.2 68.1 62.7 45.2 36.7

57.0 46.9 42.9 21.8 16.4

28.1 34.3 18.4 12.6 6.2

11.7 27.2 9.9 8.0 4.2

5.0 24.6 5.2 5.1 3.8

Factory

Baseline MLLR MLSW MLSPR1 MLSPR2

90.9 82.1 77.0 42.8 30.4

54.2 70.8 44.8 24.6 14.3

23.8 41.4 19.4 13.8 6.7

10.4 33.7 9.9 8.1 5.0

4.8 30.3 5.5 5.2 4.4

SNR (dB) Fig. 10. Average WERs of SS, MLLR, MLSW and MLSPR2 over the three types of noise (white, pink, and factory) with different SNRs.

MLLR and MLSW with different SNRs for the band-limited white noise and the experimental results are illustrated in Fig. 11 where a bandpass ﬁlter (passband 1000–2000 Hz) is used for generating the band-limited white noise. From the ﬁgure, it can be seen that MLSW outperforms MLLR in the band-limited noise conditions because the cepstral linear regression cannot model the subband mismatch well. Comparing Figs. 10 and 11, we ﬁnd that the performance improvements of MLSPR2 over the other methods are more signiﬁcant in the band-limited noise conditions than in the broadband noise conditions. For example, in the case of 0 dB SNR, the WERs of SS, MLLR, MLSW and MLSPR2 are 51.3%, 31.8%, 26.4% and 8.6%, and the absolute performance improvement of MLSPR2 over MLSW is 17.8%. This shows that the subband polynomial regression is more effective for the subband mismatch than the other methods. Finally, we investigate the fast adaptation performance of the four algorithms at different SNRs. The recognition results are shown in Table 1 where only one utterance is used for adaptation. From these results, it can be seen that MLLR often degrades the performance of speech recognition systems in the case of sparse data, especially at higher SNRs, but MLSW, MLSPR1 and MLSPR2 can improve the recognition performance. Besides, we also observe that MLSPR2 consistently achieves the best performance over a

5

Baseline MLLR MLSW MLSPR1 MLSPR2

20

0

SNR (dB) 0

White

40

0

Algorithm

range of SNRs for the three types of noise when only one utterance is available. Therefore, the second-order MLSPR algorithm is best for fast adaptation. 5. Conclusions Model adaptation is an effective approach to robust speech recognition, which adapts acoustic models to testing conditions. However, generally, only a small amount of adaptation data is available in real applications. Thus the data sparseness is an outstanding and difﬁcult problem to be solved for fast model adaptation. This paper proposes a model adaptation algorithm based on maximum likelihood subband polynomial regression, where the channels of Mel ﬁlter bank are divided into several subbands and all the channels of each subbands share the same polynomial transformation. The number of subbands is increased with the growth of adaptation data. Considering the performance and computational complexity, the second-order polynomial is best for our algorithm. The experimental results show that the proposed MLSPR algorithm overcomes the data sparseness problem well and requires only a small amount of adaptation data. Therefore, it is more useful for fast model adaptation. Acknowledgements

80

60

WER (%)

This work is supported by the Fundamental Research Funds for the Central Universities (No. 2011B09214) and the National Natural Science Foundation of China (No. 60971098).

Baseline SS MLLR MLSW MLSPR2

70

50

References

40 30 20 10 0

0

5

10

15

20

SNR (dB) Fig. 11. Performance comparison of SS, MLLR, MLSW and MLSPR2 with different SNRs for the band-limited white noise.

[1] Boll S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans Acoust Speech Signal Process 1979;27:113–20. [2] Atal B. Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identiﬁcation and veriﬁcation. J Acoust Soc Am 1974;55:1304–12. [3] Hermansky H, Morgan N. RASTA processing of speech. IEEE Trans Speech Audio Process 1994;2:578–89. [4] Erell A, Weintraub M. Filterbank-energy estimation using mixture and Markov models for recognition of noisy speech. IEEE Trans Speech Audio Process 1993;1:68–76. [5] Acero A. Acoustical and environmental robustness in automatic speech recognition. Norwell: Kluwer Academic Publisher; 1993. [6] Moreno PJ. Speech recognition in noisy environments, Ph.D. dissertation, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA; 1996. [7] Ding GH. Maximum a posteriori noise log-spectral estimation based on ﬁrstorder vector Taylor series expansion. IEEE Signal Process Lett 2008;15:158–61.

646

Y. Lü, Z. Wu / Applied Acoustics 74 (2013) 640–646

[8] Kim W, Hansen JHL. Feature compensation in the cepstral domain employing model combination. Speech Commun 2009;51:83–96. [9] Sasou A, Asano F, Nakamura S, et al. HMM-based noise-robust feature compensation. Speech Commun 2006;48:1100–11. [10] Gauvain JL, Lee CH. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans Speech Audio Process 1994;2:291–8. [11] Leggetter CJ, Woodland PC. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput Speech Lang 1995;9:171–85. [12] Zhang Z, Furui S. Piecewise-linear transformation-based HMM adaptation for noisy speech. Speech Commun 2004;42:43–58. [13] Chesta C, Siohan O, Lee CH. Maximum a posteriori linear regression for hidden Markov model adaptation. In: Proceedings of Eurospeech; 1999. p. 211–4. [14] Gales MJF, Young SJ. Robust speech recognition in additive and convolutional noise using parallel model combination. Comput Speech Lang 1995;9:289–307. [15] Kim D, Yook D. Fast channel adaptation for continuous density HMMs using maximum likelihood spectral transform. Electron Lett 2004;40:632–3. [16] Kim D, Yook D. Linear spectral transformation for robust speech recognition using maximum mutual information. IEEE Signal Process Lett 2007;14:496–9.

[17] Li J, Deng L, Yu D, et al. High-performance HMM adaptation with joint compensation of additive and convolutive distortions via vector Taylor series. In: Proceedings of ASRU; 2007. p. 65–70. [18] Bourlard H, Dupont S. A new ASR approach based on independent processing and recombination of partial frequency bands. In: Proceedings of ICSLP; 1996. p. 426–9. [19] Hermansky H, Tibrewala S, Pavel M. Towards ASR on partially corrupted speech. In: Proceedings of ICSLP; 1996. p. 462–5. [20] Bourlard H, Dupont S. Subband-based speech recognition. In: Proceedings of ICASSP; 1997. p. 1251–4. [21] Tibrewala S, Hermansky H. Sub-band based recognition of noisy speech. In: Proceedings of ICASSP; 1997. p. 1255–8. [22] Okawa S, Bocchieri E, Potamianos A. Multi-band speech recognition in noisy environments. In: Proceedings of ICASSP; 1998. p. 641–4. [23] Zhu D, Nakamura S, Paliwal KK, et al. Maximum likelihood sub-band adaptation for robust speech recognition. Speech Commun 2005;47: 243–64. [24] Dempster A, Laird N, Rubin D. Maximum likelihood from incomplete data via the EM algorithm. J Roy Statist Soc 1977;39:1–38.

Maximum likelihood subband polynomial regression for robust speech recognition

Maximum likelihood subband polynomial regression for robust speech recognition

Recommend Documents