Robust features for noisy speech recognition based on temporal trajectory filtering of short-time autocorrelation sequences

Speech Communication 28 (1999) 13±24 www.elsevier.nl/locate/specom Robust features for noisy speech recognition based on temporal trajectory ®lterin...

Download PDF

383KB Sizes 5 Downloads 201 Views

Report

PDF Reader
Full Text

Speech Communication 28 (1999) 13±24

www.elsevier.nl/locate/specom

Robust features for noisy speech recognition based on temporal trajectory ®ltering of short-time autocorrelation sequences Kuo-Hwei Yuo, Hsiao-Chuan Wang

*

Department of Electrical Engineering, National Tsing Hua University, Hsinchu 30043, Taiwan Received 9 September 1997; received in revised form 24 April 1998; accepted 1 December 1998

Abstract This paper introduces a new representation of speech that is invariant to noise. The idea is to ®lter the temporal trajectories of short-time one-sided autocorrelation sequences of speech such that the noise eect is removed. The ®ltered sequences are denoted by the relative autocorrelation sequences (RASs), and the mel-scale frequency cepstral coecients (MFCC) are extracted from RAS instead of the original speech. This new speech feature set is denoted as RAS-MFCC. Experiments were conducted on a task of multispeaker isolated Mandarin digit recognition to demonstrate the eectiveness of RAS-MFCC features in the presence of white noise and colored noise. The proposed features are also shown to be superior to other robust representations and compensation techniques. Ó 1999 Published by Elsevier Science B.V. All rights reserved. Keywords: Noisy speech recognition; Temporal trajectory ®ltering; Relative autocorrelation sequences

1. Introduction It is well known that the performance of automatic speech recognition systems may drastically degrade in the case of a mismatch between training and test environments. This problem is inevitably encountered when speech recognizers are deployed in real environments where noise or channel eects always exist. The goal of robust speech recognition is to improve the performance of speech recognizers across diverse environments. Environmental eects are often dominated by noise and short-time convolutional distortion.

* Corresponding author. Tel.: +886 3 574 2587; fax: +886 3 571 5971; e-mail: [email protected]

Short-time convolutional distortion appears as an additive bias to the speech signal in the logarithmic spectral and cepstral domains. Various techniques based on this property have been developed to suppress the short-time convolutional distortion. Cepstral mean subtraction or normalization (CMN) (Acero and Stern, 1991) removes the bias by subtracting the global average cepstral vector from each cepstral vector. Signal bias removal (SBR) (Rahim and Juang, 1996) uses a recursive procedure to remove the bias. Both methods assume that the short-time convolutional distortion is time-invariant. RASTA (RelAtive SpecTra) (Hermansky and Morgan, 1994) technique is another approach. The idea of RASTA is to suppress constant factors in each spectral component of spectrum. This method is eective in the

0167-6393/99/$ ± see front matter Ó 1999 Published by Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 6 3 9 3 ( 9 9 ) 0 0 0 0 4 - 7

14

K.-H. Yuo, H.-C. Wang / Speech Communication 28 (1999) 13±24

suppression of additive and convolutional noise. Moreover, Nadeu and Juang (1994) and Nadeu et al. (1997) applied frequency analysis to time sequences of spectral parameters (TSSPs) and indicated that the use of properly ®ltered parameter sequences, with no supplementary parameters, results in improved recognition rates even for clean speech. The techniques mentioned above bene®t from the fact that the short-time convolutional distortion can be removed in the cepstral domain, which is also an eective way to represent speech features. In contrast, noise is additive in the power spectral domain. Therefore one may expect to remove noise directly in the power spectrum domain. Spectral subtraction (Boll, 1979) is such a technique that subtracts the DFT magnitude coecients of the noise from that of the noisy speech. Nonlinear spectral subtraction (Lockwood and Boudy, 1992) is an improved version of this technique. Usually, the noise spectrum is estimated from a silence period. However, due to the random ¯uctuation of noise, the subtraction process can produce negative values when the signalto-noise ratio (SNR) is low. Special procedures are required to handle these negative values. With a concept similar to RASTA's temporal ®ltering in logarithmic spectral domain, Hirsch et al. (1991) improved noisy speech recognition accuracy by using a high-pass ®lter on subband envelopes. Avendano and Hermansky (1996) extended this temporal ®ltering method to DFT magnitude trajectories for dereverberation of speech. However, the outputs of both temporal ®ltering methods may also result in negative values. These methods that attempt to remove the noise in the power spectral domain are referred to as speech enhancement algorithms. One particular characteristic of these methods is that their output may be transformed back into the time domain and listened to by humans to evaluate the resulting speech quality. Besides speech enhancement, there are two other important methods for dealing with noise. One is the use of robust feature representations that attempt to ®nd characteristic representations of speech that are invariant or resistant to noise corruption. The short-time modi®ed coherence

(SMC) (Mansour and Juang, 1989a,b) and onesided autocorrelation LPC (OSALPC) (Hernando et al., 1994; Hernando and Nadeu, 1997) methods are typical examples. The other is to adapt the reference templates to include the eect of noise. Parallel model combination (PMC) (Gales and Young, 1995) and the projection measure (Mansour and Juang, 1989a,b; Carlson and Clements, 1994) belong to this category. Furthermore, some techniques employ robust feature representations and other noise compensation method together. Cepstral-time matrices (Vaseghi and Milner, 1997), that are compensated by noise-adaptive HMM, state-integrated Wiener ®lters, and spectral subtraction respectively, have been well-studied. While speech enhancement methods and robust feature representations may seem similar, they are not. Speech enhancement methods attempt to derive clean speech from noisy speech and then compute features from the clean speech, while robust feature representations attempt to produce reliable speech features directly from noisy speech. In this paper, we describe a robust feature representation. Our approach is also based on the idea of temporal ®ltering. However, the ®ltering is applied in neither the subband domain nor the DFT magnitude domain, where negative outputs are a problem. Instead, the ®ltering is performed in the autocorrelation domain. We derive new features that are robust not only to white noise, but also to colored noise. The power spectrum has a corresponding time-domain signal called the autocorrelation sequence. When speech is corrupted by uncorrelated noise, the noise component is additive with the speech not only in the power spectral domain, but also in the autocorrelation domain. In our approach, the noise is removed by ®ltering the temporal trajectories of short-time one-sided autocorrelation sequences of speech. These ®ltered short-term sequences are denoted the relative autocorrelation sequences (RASs). We regard the RASs as another short-time representation of speech, and propose mel-scale frequency cepstral coecients (MFCC) be extracted from the RAS instead of the original speech. This new robust feature is denoted RAS-MFCC and its delta coecient is delta-RAS-MFCC.

K.-H. Yuo, H.-C. Wang / Speech Communication 28 (1999) 13±24

The remainder of the paper is organized as follows. Section 2 describes mathematical foundations of the RAS. In Section 3, a procedure for computing RAS-MFCC is presented and compared with those for the SMC and OSALPC methods. In Section 4, we ®rst evaluate various parameters used in RAS-MFCC. Then we present experimental results comparing the performance of RAS-MFCC with the SMC and OSALPC methods on white noise and comparing RASMFCC with spectral subtraction, with nonlinear spectral subtraction, and with the projection measure on colored noise. In addition, we also compare temporal ®ltering in the autocorrelation domain with ®ltering in the DFT spectrum, subband and logarithmic subband domains. Finally, Section 5 gives a summary and conclusions. 2. Trajectory ®ltering of autocorrelation sequence In this paper, the focus is on the eects of additive noise on speech recognition. The noisy speech signal is blocked into M frames of N samples, and is modeled as ym; n xm; n wm; n; 0 6 m 6 M ÿ 1;

0 6 n 6 N ÿ 1;

1

where m is the frame index, n the discrete time index within a frame, xm; n clean speech, ym; n noisy speech and wm; n noise. If the noise is uncorrelated with the speech, it follows that the autocorrelation of the noisy speech is the sum of autocorrelation of the clean speech xm; n and autocorrelation of the noise wm; n. ryy m; k rxx m; k rww m; k; 0 6 m 6 M ÿ 1; 0 6 k 6 N ÿ 1;

2

where ryy m; k, rxx m; k and rww m; k are the onesided autocorrelation (OSA) sequences of the noisy speech, clean speech and noise, respectively and k is the autocorrelation sequence index. Moreover, if the noise is stationary, the autocorrelation sequences of noise in all frames can be assumed to be identical and rww m; k will depend only on autocorrelation index k. Hence, we drop the index m of rww m; k and Eq. (2) becomes

ryy m; k rxx m; k rww k; 0 6 m 6 M ÿ 1; 0 6 k 6 N ÿ 1:

15

3

Dierentiating both sides of Eq. (3) with respect to frame index m for all k yields oryy m; k orxx m; k ; om om 0 6 m 6 M ÿ 1; 0 6 k 6 N ÿ 1:

4

N ÿ1 foryy m; k=omgk0 ,

is named the The sequence, RAS of noisy speech at the mth frame. Eq. (4) demonstrates that, in each frame, the RAS of noisy speech is equal to the RAS of clean speech. This implies that the eect of noise is removed. The RASs can be obtained by polynomial approximation in a manner similar to the derivation of delta cepstral coecients from cepstral coecients (Furui, 1986). Therefore, the RASs are approximated by L oryy m; k 1 X tryy m t; k; om TL tÿL

0 6 m 6 M ÿ 1;

0 6 k 6 N ÿ 1;

5

where TL

L X

t2

6

tÿL

and the summation, for t ÿL; ÿL 1; . . . ; L, is the frame range for polynomial ®tting. From the signal processing point of view, the operation of Eq. (5) can be interpreted as a ®ltering process on the temporal autocorrelation trajectory using an FIR ®lter that has a transfer function given by H z

L 1 X tzt : TL tÿL

7

Eq. (7) is a high-pass ®lter. When the temporal autocorrelation trajectory of noise is a DC signal or a slowly varying signal, its eect is suppressed by this high-pass ®lter. 3. Procedure to compute RAS-MFCC Because the RASs are invariant to noise, we may derive noise-robust features from them. We regard the RAS as another short-time representa-

16

K.-H. Yuo, H.-C. Wang / Speech Communication 28 (1999) 13±24

tion of speech and compute MFCC from it. We call this new feature set RAS-MFCC. Delta features may be derived from the RAS-MFCC features in the usual manner (Furui, 1986). Our proposed method is dierent from the SMC and the OSALPC methods as illustrated in Fig. 1. The procedure for computing RAS-MFCC is summarized as follows: 1. The original speech is segmented into overlapping frames with N sample data points per frame, and the N-point one-sided autocorrelation sequence for each frame is computed using the unbiased autocorrelation estimator given by ryy m; k

ÿ1ÿk 1 NX ym; jym; j k; N ÿ k j0

0 6 k 6 N ÿ 1:

8

2. The RAS's are obtained by processing all temporal trajectories of one-sided autocorrelation coecients with the FIR high-pass ®lters given by Eq. (7). 3. Each RAS is multiplied by an N-point Hamming window and processed with a simulated mel-scale ®lter bank. Its log-energy outputs are then cosine transformed to produce the cepstral coecients. Two related robust feature representation methods, SMC and OSALPC, are based on AR modeling in the autocorrelation domain. SMC uses a coherence estimator with the zeroth lag term set to zero and a spectral shaper to obtain a robust representation of speech. OSALPC uses the conventional biased estimator to compute one-sided autocorrelation sequences, sets the zeroth lag term to half of its original value, and models the autocorrelation sequence by an AR process. Both

Fig. 1. Calculation of the (a) RAS-MFCC, (b) SMCC and (c) OSALPCC.

K.-H. Yuo, H.-C. Wang / Speech Communication 28 (1999) 13±24

techniques produce LPC parameters that are further transformed to cepstra. Fig. 1(b) and (c) show the block diagrams for calculating SMC cepstral coecients (SMCC) and OSALPC cepstral coecients (OSALPCC), respectively. These two methods are restricted to the case of additive white noise, while our proposed RAS-MFCC method can be applied to any type of stationary noise. RAS-MFCC is based on inter-frame processing to suppress the noise eect. Contrarily, SMC and OSALPC are based on intra-frame processing. Fig. 2 compares these robust features for the same input with white noise added at dierent SNR levels. Fig. 2(a) shows that the traditional MFCC is severely in¯uenced by white noise corruption. On the other hand, Fig. 2(b) shows the high robustness of RAS-MFCC, which is relatively unaected by noise corruption even at low SNR. Fig. 2(c) and (d) show the feature variation for SMC and OSALPC, respectively. When the SNR is greater than 15 dB, both SMC and OSALPC are robust. However, as the noise power increases, both representations are aected. As we have mentioned above, RAS-MFCC is not only robust

17

to white noise, but also to colored noise. One can observe this property in Fig. 3. The traditional MFCC, shown in Fig. 3(a), is aected by colored noise, such as F16 noise or factory noise. On the other hand, the RAS-MFCC, shown in Fig. 3(b), is relatively unchanged for these types of colored noise corruption. As far as computational load is concerned, RAS-MFCC requires the additional computation of autocorrelation sequences. It is obvious that the computation based on Eq. (8) is tedious. We can signi®cantly reduce the computational load by using the FFT and inverse FFT. This is based on the property that the power spectrum of a signal is equal to the spectrum of its biased two-sided autocorrelation sequence (Oppenheim and Schafer, 1989). 4. Experiments This section presents the application of RASMFCC and delta-RAS-MFCC features to multispeaker isolated Mandarin digit recognition in the

Fig. 2. Comparison of various feature types at dierent levels of SNR in white noise corruption: (a) Eect of additive white noise on MFCC; (b) eect of additive white noise on RAS-MFCC; (c) eect of additive white noise on SMCC and (d) eect of additive white noise on OSALPCC.

18

K.-H. Yuo, H.-C. Wang / Speech Communication 28 (1999) 13±24

Fig. 3. Comparison of MFCC and RAS-MFCC in colored noise corruption. (a) Eect of additive color noise on MFCC; (b) eect of additive color noise on RAS-MFCC.

presence of noise. Here, both white noise and colored noise are considered. Four experiments are conducted. The ®rst experiment evaluates the parameters of RAS-MFCC. The second experiment compares the performance of our proposed features with other representations in the presence of white noise. The third experiment compares RASMFCC features with other noise compensation methods that use traditional MFCC features in the presence of colored noise. In the last experiment, various temporal ®ltering domains with various ®lters are compared. 4.1. Database and classi®er A Mandarin isolated-digit database, collected from 100 speakers (50 males and 50 females) at an 8 kHz sampling rate in a noise-free environment, was the clean speech database. Three sessions were recorded. In each session, each speaker uttered one repetition of ten isolated digits. The ®rst two sessions were used for training, and the third session was for testing. The white Gaussian noise was arti®cially generated by computer, while the colored noise (factory, F16, and babble noises) was obtained from NOISEX-92. The noisy speech was generated by adding noise to the clean speech at a speci®ed SNR. No pre-emphasis was performed. Each digit was modeled by a leftto-right HMM without skips. Each HMM had seven to nine states depending on the duration of digit. Each state was represented by a mixture of four Gaussian densities with a diagonal covariance matrix. The ®rst and last states of each HMM were tied together as silence states. Note

that all the HMM models are trained on clean speech. 4.2. Parameter evaluation of RAS-MFCC In the RAS-MFCC representation, we use a high-pass ®lter to ®lter temporal trajectories of one-sided autocorrelation sequences. This highpass ®lter is given by Eq. (7), where L is a parameter in this ®ltering operation. These autocorrelation sequences are obtained from an unbiased autocorrelation estimator speci®ed by Eq. (8). However, we are also interested in using a biased estimator. In this experiment, the in¯uences of the parameter L and estimator types on RASMFCC are examined. For comparison, the performance of MFCC feature is also evaluated. The RAS-MFCC and MFCC features were computed using 32-ms frames with a 16-ms shift. For each speech frame, a 20-channel ®lter-bank spectrum with mel-scale frequency was obtained. The logenergy outputs of the ®lter bank were transformed to a set of 12 cepstral coecients. Table 1 illustrates the recognition rates for RAS-MFCC features with various frame ranges and estimator types and white noise corruption. The recognition rate using MFCC features is also included in Table 1. The recognition rate using MFCC features is seriously degraded by white noise, while the RASMFCC features are robust to white noise. For RAS-MFCC features using both biased and unbiased estimators, increasing L can improve the recognition accuracy at low SNR (for example, 0 or 5 dB), but degrades the recognition accuracy at high SNR (for example, clean or 20 dB). There-

K.-H. Yuo, H.-C. Wang / Speech Communication 28 (1999) 13±24

19

Table 1 The recognition rates for RAS-MFCC at various parameters and comparison to MFCC with white noise corruption Feature MFCC

RAS-MFCC Unbiased estimator

SNR (dB)

Clean 20 15 10 5 0

0.957 0.784 0.602 0.461 0.319 0.130

Biased estimator

Three frames (L 1)

Five frames (L 2)

Seven frames (L 3)

Nine frames (L 4)

Three frames (L 1)

Five frames (L 2)

Seven frames (L 3)

Nine frames (L 4)

0.938 0.927 0.903 0.859 0.735 0.467

0.932 0.912 0.893 0.859 0.775 0.580

0.900 0.883 0.857 0.834 0.781 0.620

0.885 0.878 0.852 0.831 0.788 0.653

0.953 0.938 0.911 0.822 0.607 0.354

0.933 0.924 0.897 0.828 0.681 0.509

0.889 0.878 0.867 0.835 0.730 0.542

0.852 0.850 0.833 0.823 0.750 0.508

fore, the ®lter with L 2 is used in subsequent experiments. We also ®nd that both the biased and unbiased estimators have almost the same performance at high SNR, but the unbiased estimator can attain better recognition accuracy at low SNR. Therefore, the default estimator used in RASMFCC is unbiased. 4.3. Comparison of robust features in white noise corruption Three types of robust acoustic representations, SMCC, OSALPCC and RAS-MFCC, were tested on the case of white noise corruption. The OSALPCC and SMCC features shown in Fig. 1 are derived via LP analysis. In order to make the comparison fair, we derive OSA and SMC cepstral parameters in Mel-scale frequency instead of LP analysis. These parameters are denoted OSAMFCC and SMC-MFCC. The traditional MFCC and LPCC features are also included for comparison. A 64-ms frame was used to compute a 32-ms coherence sequence for the SMCC and SMCMFCC features. Likewise, a 64-ms frame was used to compute a 32-ms biased autocorrelation sequence for the OSALPCC and OSA-MFCC features. 32-ms frames were used to compute 32-ms unbiased autocorrelation sequences for the RASMFCC. The LPCC and MFCC features were also computed using 32-ms frames. For all feature sets, a frame step of 16 ms was used. The AR-based

features, LPCC, SMCC and OSALPCC, were modeled with 12th-order LPC coecients, which were converted to 12 cepstral coecients. The MFCC, OSA-MFCC, SMC-MFCC and RASMFCC features were generated using a 20-channel mel-scale ®lter bank. The log-energy outputs from the ®lter bank were transformed to a set of 12 cepstral coecients. In addition, frame-dierential features were computed over a window of ®ve cepstral vectors as described in (Furui, 1986). We can include these 12 delta cepstral coecients in a feature vector to form a 24-element feature vector. Cepstral features and cepstral features augmented with delta cepstral features were evaluated in this experiment. Table 2 shows the recognition rates using the dierent features for speech recognition in the presence of white noise corruption. The results using cepstral features are shown in Table 2(a) and Fig. 4(a). The recognition rates using LPCC and MFCC both degrade in white noise corruption, but MFCC is more robust than LPCC. All robust features perform better than the traditional LPCC and MFCC features in noise. SMC-MFCC attains improvement over SMCC when the SNR is greater than 15 dB, and especially on clean speech. OSAMFCC performs better than OSALPCC. Among the SMCC, SMC-MFCC, OSALPCC and OSAMFCC features, SMC-MFCC features attain the best performance when the SNR is greater than 15 dB, while OSA-MFCC features attain the best performance when the SNR is less than 15 dB.

20

K.-H. Yuo, H.-C. Wang / Speech Communication 28 (1999) 13±24

Table 2 Comparison of recognition rates for the various feature types with white noise corruption Feature type

SNR Clean

20 dB

15 dB

(a) The recognition rates for cepstral features with white noise corruption LPCC 0.951 0.703 0.517 MFCC 0.957 0.784 0.602 SMCC 0.884 0.876 0.852 SMC-MFCC 0.941 0.908 0.885 OSALPCC 0.859 0.845 0.807 OSA-MFCC 0.892 0.873 0.855 RAS-MFCC 0.932 0.912 0.893

10 dB

5 dB

0 dB

0.378 0.461 0.775 0.712 0.649 0.807 0.859

0.276 0.319 0.578 0.477 0.517 0.655 0.775

0.130 0.130 0.432 0.369 0.356 0.484 0.580

0.270 0.373 0.626 0.641 0.581 0.815 0.879

0.113 0.108 0.456 0.458 0.402 0.618 0.754

(b) The recognition rates for cepstral and delta-cepstral features with white noise corruption LPCC 0.983 0.825 0.676 0.480 MFCC 0.985 0.871 0.780 0.621 SMCC 0.913 0.895 0.871 0.808 SMC-MFCC 0.962 0.935 0.913 0.836 OSALPCC 0.904 0.883 0.856 0.721 OSA-MFCC 0.950 0.936 0.923 0.903 RAS-MFCC 0.969 0.957 0.943 0.933

Fig. 4. Comparison of recognition rates for the various feature types with white noise corruption: (a) cepstral features and (b) cepstral and delta-cepstral features.

RAS-MFCC outperforms the other features in noise and performs well even in severe noise conditions such as SNR at 0 dB, but in clean conditions RAS-MFCC is slightly less accurate than MFCC, LPCC and SMC-MFCC. Table 2(b) and Fig. 4(b) show the performance when both cepstral and delta-cepstral features are

used. The results are consistent with those obtained with cepstral features alone. RAS-MFCC still outperforms the other features. The inclusion of frame-dierential features generally improves the recognition rate, with the improvement for RAS-MFCC running from 3.7% on clean speech to 17.4% at an SNR of 0 dB.

K.-H. Yuo, H.-C. Wang / Speech Communication 28 (1999) 13±24

4.4. Comparison with other noise compensation techniques This experiment compared RAS-MFCC with traditional MFCC paired with alternative noisecompensation techniques. Here, spectral subtraction (SS), nonlinear spectral subtraction (NSS) and the projection measure (PM) were evaluated because these techniques are well-known and have almost become standard techniques for noisy speech recognition. For spectral subtraction, the procedure described in (Boll, 1979) was used. For

21

nonlinear spectral subtraction, the procedure described in (Lockwood and Boudy, 1992) was used. For the projection measure, a system based on (Carlson and Clements, 1994) was implemented. In this part of experiment, cepstral and delta cepstral features were used. Note that the projection distance measure was only performed on the cepstral coecients. The experiment was conducted on four types of noise: white, F16, factory, and babble noises. The results are shown in Fig. 5. The notation in the parentheses, NC, SS, NSS and PM represents no

Fig. 5. Comparison of recognition rates for the various noise compensation methods with colored noise corruption: (a) white noise; (b) factory noise; (c) F16 noise and (d) babble noise.

22

K.-H. Yuo, H.-C. Wang / Speech Communication 28 (1999) 13±24

compensation, compensation by spectral subtraction, compensation by nonlinear spectral subtraction, and compensation by the projection measure, respectively. In almost all cases, MFCC(NSS) performs better than MFCC(SS). MFCC(NC) is always the worst case, and RAS-MFCC achieves the best performance among all methods. This result shows that RAS-MFCC is robust for both white and colored noise corruptions. 4.5. Comparison with other temporal ®ltering techniques Temporal ®ltering techniques have been applied in several domains, such as the subband domain (Hirsch et al., 1991) using a high-pass FIR ®lter, the logarithmic subband domain (RASTA technique; Hermansky and Morgan, 1994) using a band-pass IIR ®lter, and the DFT spectrum domain (Avendano and Hermansky, 1996) using FIR ®lters. Hanson and Applebaum (1993) compared several ®lters applied to the logarithmic subband domain or cepstral domain for noisy-Lombard and channel-distorted speech. In this paper, the temporal ®ltering technique is applied in the autocorrelation domain using a high-pass FIR ®lter. We are interested to know what happens using other ®lters in deriving the RAS-MFCC representation and using our high-pass FIR ®lter in other temporal ®ltering domains. Therefore, we evaluate three ®lters, HRAS , HRASTA and HHirsch , denoting the RAS ®lter, RASTA ®lter and Hirsch's ®lter, and apply these three ®lters to the DFT spectrum, subband, logarithmic subband, and autocorrelation domains. These three ®lters are HRAS z

1 2z2 z ÿ zÿ1 ÿ 2zÿ2 ; 10

HRASTA z

z4 2 zÿ1 ÿ zÿ3 ÿ 2zÿ4 101 ÿ 0:98zÿ1

9 10

and

P16 t ÿt t1 0:94 z HHirsch z 1 ÿ P ; 16 t t1 0:94

11

where HRAS follows Eq. (7) with L 2, HRASTA follows the equation used in (Hermansky and

Morgan, 1994), and HHirsch follows the equation used in (Hanson and Applebaum, 1993). Temporal ®ltering in the autocorrelation domain is shown in Fig. 1(a), and the calculation of features based on temporal ®ltering in the DFT spectrum, subband and logarithmic subband domains is shown in Fig. 6. Note that the outputs of the ®ltered DFT spectrum domain and subband domain are postprocessed by full-wave recti®cation (Avendano and Hermansky, 1996) to avoid negative values. In this part of experiment, both cepstral and delta cepstral features were used. Table 3 illustrates the results of these three ®lters applied to four domains with white or factory noise corruption. For comparison, traditional MFCC features, denoted ``No ®lter'', are also included. Table 3(a) and (b) are consistent. As compared with traditional MFCC features, the temporal ®ltering in these four domains all demonstrate robustness. The temporal ®ltering in the autocorrelation domain outperforms ®ltering in the DFT spectrum, subband, and logarithmic subband domains in all cases. This suggests that the autocorrelation domain is a good choice for performing temporal ®ltering to improve robustness to noise corruption. When these three ®lters are applied in the autocorrelation domain, the ®lter HRAS performs better than HRASTA and HHirsch in white noise corruption, and much better than HRASTA and HHirsch in factory noise corruption. 5. Conclusion RAS-MFCC has been proposed as a new feature set for robust speech recognition. Experimental results demonstrate remarkable improvements using the proposed RAS-MFCC and delta-RASMFCC features over the traditional MFCC and delta-MFCC features. This approach also outperforms other robust representations, such as SMC and OSALPC, and other compensation techniques, such as spectral subtraction and the projection measure, in noisy speech recognition. Since the autocorrelation sequences can be computed by using FFT and inverse FFT, the computational load is comparable to that of traditional MFCC

K.-H. Yuo, H.-C. Wang / Speech Communication 28 (1999) 13±24

23

Fig. 6. Calculation of robust features based on temporal ®ltering in (a) DFT-magnitude domain, (b) subband domain and (c) logarithmic subband domain.

Table 3 The recognition rates for various temporal ®ltering techniques Filter SNR (dB)

No ®lter DFT spectrum HRAS

a

a

Log subband

a

Autocorrelation

a

HHirsch

HRAS

HRASTA HHirsch

HRAS

HRASTA HHirsch

HRAS

HRASTA HHirsch

(a) with white noise corruption Clean 0.985 0.986 0.987 20 0.871 0.941 0.953 15 0.780 0.870 0.892 10 0.621 0.745 0.796 5 0.373 0.476 0.665 0 0.108 0.172 0.462

0.980 0.908 0.832 0.753 0.517 0.238

0.989 0.910 0.849 0.719 0.531 0.319

0.976 0.895 0.844 0.711 0.533 0.378

0.987 0.907 0.834 0.683 0.507 0.341

0.975 0.913 0.859 0.718 0.522 0.327

0.986 0.900 0.836 0.659 0.371 0.243

0.987 0.915 0.855 0.688 0.465 0.281

0.969 0.957 0.943 0.933 0.879 0.754

0.938 0.933 0.916 0.893 0.860 0.758

0.964 0.940 0.923 0.873 0.752 0.484

(b) with factory noise Clean 0.985 20 0.960 15 0.863 10 0.605 5 0.322 0 0.146

0.980 0.965 0.900 0.722 0.471 0.261

0.989 0.954 0.913 0.807 0.552 0.322

0.976 0.897 0.813 0.683 0.476 0.331

0.987 0.948 0.890 0.789 0.551 0.334

0.975 0.932 0.884 0.744 0.554 0.263

0.986 0.944 0.885 0.681 0.392 0.208

0.987 0.962 0.912 0.792 0.479 0.191

0.969 0.963 0.950 0.925 0.874 0.688

0.938 0.930 0.920 0.888 0.773 0.568

0.964 0.963 0.950 0.916 0.831 0.554

a

Refers to domain.

HRASTA

Subband

corruption 0.986 0.987 0.968 0.971 0.917 0.917 0.747 0.778 0.540 0.551 0.258 0.344

24

K.-H. Yuo, H.-C. Wang / Speech Communication 28 (1999) 13±24

and delta-MFCC. Moreover, it works not only for white noise corruption but also for colored noise corruption. Acknowledgements This research was partially sponsored by the National Science Council, Taiwan, ROC, under contract number NSC 86-2745-E007-010. References Acero, A., Stern, R.M., 1991. Robust speech recognition by normalization of the acoustic space. In: Proceedings of IEEE International Conference of Acoust., Speech, Signal Process '91, pp. 893±896. Avendano, C., Hermansky, H., 1996. Study on the dereverberation of speech based on temporal envelope ®ltering. In: Proc. ICSLP'96, Vol. 2, pp. 889±892. Boll, S.F., 1979. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust., Speech, Signal Processing 27, 113±120. Carlson, A., Clements, M.A., 1994. A projection-based likelihood measure for speech recognition in noise. IEEE Trans. Acoust., Speech, Signal Processing 2 (1) part 1, 97±102. Furui, S., 1986. Speaker-independent isolated word recognition based on emphasized spectral dynamics. In: Proceedings of the IEEE International Conference of Acoust., Speech, Signal Process.'86, Tokyo, April 1986, pp. 1991±1994. Gales, M.J.F., Young, S.J., 1995. Robust speech recognition in additive and convolutional noise using parallel model combination. Computer Speech and Language, 289±307. Hanson, B.A., Applebaum, T.H., 1993. Subband or cepstral domain ®ltering for recognition of Lombard and channeldistorted speech. In: Proc. IEEE Internat. Conf. Acoust., Speech, Signal Process.'93, Vol. 2, pp. 79±82.

Hermansky, H., Morgan, N., 1994. RASTA processing of speech. IEEE Trans. Speech Audio Processing 2, 578±589. Hernando, J., Nadeu, C., 1997. Linear prediction of the onesided autocorrelation sequence for noisy speech recognition. IEEE Trans. Speech Audio Processing 5 (1), 80±84. Hernando, J., Nadeu, C., Villagrasa, C., Monte, E., 1994. Speaker identi®cation in noisy conditions using linear prediction of the one-sided autocorrelation sequence. In: Proc. ICSLP'94, Vol. 4, pp. 1847±1850. Hirsch, H.G., Meyer, P., Ruchl, H., 1991. Improved speech recognition using high-pass ®ltering of subband envelopes. In: Proceedings of EUROSPEECH91, Genova, pp. 413± 416. Lockwood, P., Boudy, J., 1992. Experiments with a Nonlinear Spectral Subtractor (NSS), Hidden Markov Models and the projection, for robust speech recognition in cars. Speech Communication 11, 215±228. Mansour, D., Juang, B.H., 1989a. The short-time modi®ed coherence representation and noisy speech recognition. IEEE Trans. Acoust., Speech, Signal Processing 37 (6), 795±804. Mansour, D., Juang, B.H., 1989b. A family of distortion measures based upon projection operation for robust speech recognition. IEEE Trans. Acoust., Speech, Signal Processing 37 (11), 1659±1671. Nadeu, C., Juang, B.H., 1994. Filtering of spectral parameters for speech recognition. In: Proc. ICSLP'94, Vol. 4, pp. 1927± 1930. Nadeu, C., Paches-Leal, P., Juang, B.-H., 1997. Filtering the time sequences of spectral parameters for speech recognition. Speech Communication 22 (4), 315±332. Oppenheim, A.V., Schafer, R.W., 1989. Discrete-Time Signal Processing. Prentice Hall, Englewood Clis, NJ. Rahim, M.G., Juang, B.H., 1996. Signal bias removal by maximum likelihood estimation for robust telephone speech recognition. IEEE Trans. Acoust., Speech, Signal Processing 4 (1), 19±30. Vaseghi, S.V., Milner, B.P., 1997. Noise compensation methods for hidden Markov model speech recognition in adverse environments. IEEE Trans. Speech Audio Processing 5 (1), 11±21.

Robust features for noisy speech recognition based on temporal trajectory filtering of short-time autocorrelation sequences

Robust features for noisy speech recognition based on temporal trajectory filtering of short-time autocorrelation sequences

Recommend Documents