Modified Wiener filtering

Modified Wiener filtering

ARTICLE IN PRESS Signal Processing 86 (2006) 267–272 www.elsevier.com/locate/sigpro Modified Wiener filtering$ Levent M. Arslan Electrical & Electron...

184KB Sizes 0 Downloads 108 Views

ARTICLE IN PRESS

Signal Processing 86 (2006) 267–272 www.elsevier.com/locate/sigpro

Modified Wiener filtering$ Levent M. Arslan Electrical & Electronics Engineering Department, 80815, Bebek, Istanbul, Turkey Received 1 November 2000; received in revised form 2 January 2005; accepted 15 May 2005 Available online 21 July 2005

Abstract In this paper, a new Wiener filtering-based method for speech enhancement is described. Standard Wiener filtering formulation requires an iterative estimation of the clean speech spectrum. However, the proposed algorithm is noniterative, therefore the computation is much faster. In addition, it employs a time-varying noise suppression factor which is based on the frame-by-frame SNR. This feature gives us the ability to suppress those parts of the degraded signal, where speech is not likely to be present and not to suppress, and hence not to distort the speech segments as much. The algorithm is tested under simulated and actual car noise conditions and is shown to perform substantially better than the well-known spectral subtraction method in both subjective and objective speech quality evaluations. Proposed method also outperformed well-known minimum mean-square error (MMSE) short-time spectral amplitude estimator technique in terms of subjective quality. In addition, the proposed method is shown to improve the robustness of a speech recognition system significantly. r 2005 Elsevier B.V. All rights reserved. Keywords: Speech enhancement; Noise suppression; Time-varying SNR; Wiener filter

1. Introduction There has been a renewed interest in speech enhancement after the recent developments in communications. Especially in applications where hands-free telephone usage is required (e.g., $ This work is in part based on Levent M. Arslan’s work during his summer internship at Texas Instruments, Dallas, TX. Submitted Nov. 3, 2000 to Signal Processing. First Revision on March 27, 2001. Second Revision on March 26, 2002. Tel.: +90 212 359 6421; fax: +90 212 287 2465. E-mail address: [email protected].

talking on the phone while driving a car), the level of acoustic background noise can be disturbing to the person on the other end of the line [1]. In such an environment, speech enhancement is desirable to reduce the amount of background noise, improve the performance of a speech processing module such as a speech coder or a speech recognizer, and reduce listener fatigue. Speech enhancement can be either single-channel or multi-channel. In single-channel enhancement, speech is available from only a single microphone, whereas multi-channel systems make

0165-1684/$ - see front matter r 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.sigpro.2005.05.021

ARTICLE IN PRESS 268

L.M. Arslan / Signal Processing 86 (2006) 267–272

use of more than one microphone to better characterize and attenuate the noise. In this study we will concentrate on single-channel systems. Single-channel enhancement methods can be divided into mainly two groups: (i) spectral subtraction-based methods [2–4], (ii) Wiener filtering-based methods [5–11]. Most of the Wiener filtering-based algorithms are iterative since an estimate of clean speech power spectrum is required in the formulation, whereas spectral subtraction methods are non-iterative. Therefore spectral subtraction methods are computationally more attractive in practical applications. In this paper, a non-iterative Wiener filtering technique is described. The main advantage of the proposed method is that it makes use of a timevarying signal-to-noise ratio (SNR) dependent noise suppression factor. This property gives us the ability to suppress those parts of the degraded signal, where speech is not likely to be present and not to suppress, and hence not to distort the speech segments as much. A non-iterative Wiener filter was also proposed in [11], where a decisiondirected approach [6]. However, in this paper we use a different SNR estimation method as explained in the algorithm description section. The non-iterative Wiener filtering technique proposed in this paper produces enhanced speech significantly better than enhanced speech from standard spectral subtraction [3]. The outline of the paper will be as follows. First, the non-iterative modified Wiener filtering (MWF) method will be explained in Section 1. In Section 4, evaluations of the proposed method by both subjective and objective measures are presented. Finally, Section 5 presents discussion and conclusions.

2. Modified Wiener filter The basic idea behind modified Wiener filtering is to emphasize the frequencies where speech signal is dominant over the noise signal, and to attenuate the frequencies where the speech signal is weak relative to noise. Suppose that the noise is additive, i.e., yðtÞ ¼ sðtÞ þ nðtÞ,

(1)

where yðtÞ is noisy speech, sðtÞ is noise-free speech, and nðtÞ is noise signal. Then, a generalized Wiener filter can be formulated as !b P^ s ðoÞ HðoÞ ¼ , (2) P^ s ðoÞ þ aPn ðoÞ where P^ s ðoÞ is the clean speech power spectrum estimate, Pn ðoÞ is the noise power spectrum, a is the noise suppression factor, and b is the power of the filter (b ¼ 0:5 in all our evaluations). This filter adjusts the amplitude at each frequency, and preserves the original phase. Thus, we have ^ SðoÞ ¼ HðoÞY ðoÞ, ^ s^ðtÞ ¼ F 1 fSðoÞg,

ð3Þ

where Y ðoÞ is the Fourier transform of the noisy ^ speech, and SðoÞ is the estimate of the Fourier transform of the clean speech. In this formulation, it is assumed that we have an estimate of the clean speech power spectrum, P^ s ðoÞ. This estimate is calculated from the AR smoothed spectrum of the noisy speech, Py ðoÞ, by only a DC gain modification (note that P^ s ðoÞ uses original shape of Py ðoÞ as 2

g^ P^ s ðoÞ ¼ s2 Py ðoÞ, gy

(4)

where g^ s is the DC gain of the noise free speech signal and g^ y is the DC gain of the noisy speech signal. If we assume that noise and speech are uncorrelated, Py ðoÞ ¼ P^ s ðoÞ þ Pn ðoÞ.

(5)

If we integrate the both sides of the equation over w and use the expression for P^ s ðoÞ stated in Eq. (4), Z p Z p 2 Z p g^ s Py ðoÞ do ¼ P ðoÞ do þ Pn ðoÞ do. y 2 p p gy p (6) Using Parseval’s relation, the above equation can be simplified to 8 Ey  En if E y 4E n ; g^ 2s < Ey (7) ¼ 2 : gy 0 otherwise;

ARTICLE IN PRESS L.M. Arslan / Signal Processing 86 (2006) 267–272

where E s is the noise free speech energy and E y is the noisy speech energy. If we substitute the expression for g^ 2s =g2y in Eq. (4), the clean speech spectrum estimate now becomes Ey  En Py ðoÞ. P^ s ðoÞ ¼ Ey

(8)

Using the above expression in Eq. (2), and introducing a time-dependent noise suppression factor at we get  b ½ðE y  E n Þ=E y Py ðoÞ HðoÞ ¼ . (9) ½ðE y  E n Þ=E y Py ðoÞ þ at Pn ðoÞ After simplifications, the above expression becomes  b Py ðoÞ HðoÞ ¼ . Py ðoÞ þ ½E y =ðE y  E n Þat Pn ðoÞ (10) One desirable property would be to make at inversely dependent on SNR (i.e., E s =E n ), and allow it to change from frame to frame. This will ensure stronger suppression for noise-only frames and weaker suppression during speech segments which are not corrupted as much to begin with. The desired SNR dependence is achieved simply by replacing at with E n =E y a0 . Then the expression for HðoÞ becomes  b Py ðoÞ HðoÞ ¼ . Py ðoÞ þ ½E n =ðE y  E n Þa0 Pn ðoÞ (11) The value of a0 can be increased to obtain higher noise suppression, which does not result in fluctuations in the speech as much as it does in spectral subtraction. The reason for this result is that HðoÞ is always non-negative, whereas in spectral subtraction, the equivalent filter may fluctuate between negative values and positive values. Although the negative values are set to zero, there are still sharp discontinuities, which cause a musical noise. The at value is calculated based on the present frame and past frame, and its minimum value is set to 1. The equation for at is then at ¼ maxð1; gat þ ð1  gÞat1 Þ.

(12)

269

Ideally g value should be 1, however a fast change in the value of at implies a fast change of the amount of suppression applied to the noisy speech signal. In order to reduce the amount of artifacts across frame boundaries a smaller value for g should be chosen. After running test simulations a suitable value for g was found to be 0.8. A modification to the above algorithm is to clamp the lowest value of HðoÞ to, H min . H min may range from 0.10 to 0.31 on a linear scale, which corresponds to 20–10 dB attenuation on a log scale. Clamping of the filter gain provides smoothing in the noise only parts, and reduces the audible artifacts. To further reduce the discontinuities after the clamping, a 5-point moving-average smoothing is done on HðoÞ.

3. Step-by-step algorithm description The steps followed in the implementation of the algorithm are: 1. Take a frame of speech data (25 ms frame length with 12.5 ms skip length). 2. Apply Hanning window on the frame. 3. Calculate the 8th order LPC coefficients. After informal subjective listening tests an LPC order of 8 resulted in less distorted speech output when compared to an LPC order of 10. This is probably because a lower LPC order allows a smoother transition of the timevarying filter between successive frames. Take the discrete Fourier transform (DFT) of the LPC filter to find Py ðoÞ in Eq. (11). If it is the first frame, let Pn ðoÞ ¼ Py ðoÞ, otherwise update the noise spectrum according to Py ðoÞ at each frequency as 8 > < uPn ðoÞ if Py ðoÞ4uPn ðoÞ Pn ðoÞ ¼ dPn ðoÞ if Py ðoÞodPn ðoÞ (13) > : P ðoÞ otherwise; n where u is set to 1.01 (3.5 dB/s) and d is set to 0.95 (17:8 dB=s). Also a counter keeps the number of times the condition Py ðoÞ4uPn ðoÞ happens successively. Whenever this number exceeds 75 frames (0.94 s) u is multiplied by

ARTICLE IN PRESS L.M. Arslan / Signal Processing 86 (2006) 267–272

270

4. 5. 6.

7. 8. 9. 10. 11. 12.

1.03, and the counter is set to zero. Whenever Py ðoÞodPn ðoÞ happens, u is set to its original value, 1.01. Adaptive update of u provides faster adjustment to handle abrupt increases in noise level. Calculate at according to the rule stated in Eq. (12). Calculate the Wiener filter gain for each frequency from Eq. (11). Clamp the Wiener filter gain by H min (a value between 10 and 20 dB according to the desired level of noise suppression). A value of 20 dB would mean more suppression than 10 dB. Smooth the clamped HðoÞ with a movingaverage window ([0.1, 0.2, 0.4, 0.2, 0.1]). Calculate the other half of HðoÞ, for w ¼ 4–8 kHz, by symmetry. Calculate the DFT of the noisy-speech signal Y ðoÞ. Multiply Y ðoÞ by HðoÞ at each frequency bin ^ to get SðoÞ. ^ Take inverse DFT of SðoÞ, and take the real part as the filtered speech. Overlap-add the filtered speech from successive frames to reconstruct the original signal.

Table 1 Itakura–Saito distance measure values between the original clean speech and the enhanced speech files for white Gaussian noise with 10 dB SNR Processing type

IS distance

Comparison of Itakura– Saito distance across various processing types Noisy speech 0.355 Standard spectral subtraction 0.257 Modified Wiener filter 0.267

match criterion. We compared the proposed method against a standard spectral subtraction algorithm. The implementation of the spectral subtraction was based on the theory in Boll’s original paper [3]. However, in order to cope with non-stationary noise and to make a fair comparison with the proposed method we applied the same adaptive noise spectrum estimation algorithm that we used for the MWF method (Eq. (13)). The results of the objective evaluation are shown in Table 1. Here, both methods reduce the spectral distance from the clean speech signal significantly. Standard spectral subtraction achieves a slightly better spectral match. However, standard spectral subtraction introduces a musical noise artifact which is disturbing to the ear.

4. Evaluations

4.2. Actual car noise

For the performance evaluation of the proposed method both subjective and objective tests are performed. We investigated the performance of the system by two types of noise: artificially added white Gaussian noise and actual car noise. Finally, we evaluated the performance of the system as a preprocessor to an automatic speech recognition system.

In the second experiment, we used actual noisy speech data collected in a moving car. Twelve sentences (6 male and 6 female speakers, one sentence each) collected in a car driven along a highway are used as the test database to test the performance of the MWF method. The proposed method is again compared against standard spectral subtraction. Both objective (SNR) and subjective (MOS) tests are performed to test the two algorithm performances. Since we did not have access to clean speech signals in this experiment we could only use global SNR as an indicator for the degree of noise suppression. For noisy speech, average SNR value was 14 dB, and when the speech was processed using standard spectral subtraction the SNR value increased to 18 dB. When MWF method was employed the

4.1. Artificially added noise The database for the first experiment was prepared by adding white Gaussian noise to clean speech files (20 TIMIT database utterances). We used this database to compare the spectra between the clean, noisy and processed speech signals. The Itakura–Saito distance was used as the spectral

ARTICLE IN PRESS L.M. Arslan / Signal Processing 86 (2006) 267–272

271

Table 2 Mean Opinion Scores on a scale from 1 to 5 over 12 sentences recorded in a car on a highway (6 male, 6 female) for 3 different conditions: (i) Original noisy speech, (ii) enhanced speech using standard spectral subtraction and (iii) enhanced speech using MWF method

Table 3 MOS scores for the enhancement of speech degraded with 5 dB car noise

Listener

Performance of MWF method vs. MMSE method Enhanced speech with 2.93 MMSE Enhanced speech with 3.67 MWF

Degraded

Spec. Sub

MWF

Subjective listening test results in terms of MOS scores Subject 1 2.333 1.833 3.667 Subject Subject Subject Subject Subject Subject Subject Subject Subject Subject Subject Subject Subject Subject

2 3 4 5 6 7 8 9 10 11 12 13 14 15

Mean St. Dev.

2.333 1.917 2.667 3.083 3.417 3.000 3.250 3.083 3.000 2.500 2.400 2.667 1.917 3.000

2.917 2.750 2.000 2.000 2.167 1.750 2.500 1.917 1.083 1.333 1.214 1.000 1.000 2.333

3.667 2.833 3.083 2.583 4.083 3.333 3.667 3.917 3.583 3.417 3.364 3.750 2.417 4.083

2.704 0.463

1.853 0.625

3.430 0.508

SNR level increased to 23 dB, a significant improvement over standard spectral subtraction. We conducted MOS tests on the same 12 sentences using 15 listeners. The MOS scores are shown in Table 2. The average MOS score was 2.7 for noisy speech, 1.9 for traditional spectral subtraction, and 3.4 for the MWF method. 4.3. Simulated car noise In the third experiment, we wanted to compare the performance of the proposed method to a wellknown speech enhancement technique [6,7]. On web page prepared by Gustaffson after his thesis [12], the utterance ‘‘I am often perplexed with rapid advances in state of the art Technology’’ is degraded with simulated car noise of 5 dB SNR. We used the result of Gustaffson’s implementation of Ephraim’s MMSE technique and the result of our proposed method to test the performance of our algorithm. An MOS subjective listening test is performed with 15 listeners. The MOS scores and

Processing type

Mean MOS score

Standard deviation

0.88 0.49

Table 4 Speech recognition rates of stock names for clean speech, speech degraded with 10 dB white Gaussian noise, and enhanced speech Processing type

Recognition rate (%)

Performance of MWF method as a front-end to speech recognition Normal speech 96.5% Degraded speech 83.3% Enhanced speech with MWF 92.0%

their standard deviations are shown in Table 3. The proposed method received a 3.67 score while the score for MMSE technique was 2.93.

4.4. Speech recognition experiment As a last experiment we investigated the performance of the system as a preprocessor to a speech recognition engine. The training database for the recognition experiment consisted of 300 stock names in Istanbul Stock Exchange. We collected utterances of these 300 stock names spoken by 200 speakers of varying age, dialect and gender from various handsets. 160 speakers were used in the training set to train a triphone based hidden Markov model speech recognition system. The performance of the system for the remaining 40 speakers were 96.5% (Table 4). When 10 dB white Gaussian noise was added to the recordings the speech recognition rate reduced to 83.3%. After applying MWF method to the degraded

ARTICLE IN PRESS 272

L.M. Arslan / Signal Processing 86 (2006) 267–272

recordings speech recognition rate increased to 92.0%, which indicates a significant improvement.

5. Conclusion and discussion In this paper, a new method for high-quality speech enhancement is described. The proposed method, MWF, uses an SNR dependent noise suppression factor to employ a more aggressive enhancement at non-speech intervals, and a more mild filtering at the speech segments. Using this method within the Wiener filtering formulation results in very large SNR improvements without any speech signal distortions. In addition the method is non-iterative unlike most other Wiener filter based techniques which makes it attractive for real-time implementation. The MWF method produced significantly better results in terms of both subjective and objective tests when compared to the traditional spectral subtraction method for actual car noise. Proposed method also outperformed well-known MMSE technique in terms of subjective quality. In addition, the use of the proposed method as a preprocessor to a speech recognizer is demonstrated.

References [1] C.T. Hemphill, R. Agarwall, Y.K. Muthusamy, Y. Gong, Voice-driven information access in the automobile, in: IEEE Vehicular Technology Society News, 2000, pp. 8–11.

[2] M. Berouti, R. Schwartz, J. Makhoul, Enhancement of speech corrupted by acoustic noise, in: Proceedings of IEEE International Conference Acoustics, Speech, Signal Processing, April 1979, pp. 208–211. [3] S.F. Boll, Suppression of acoustic noise in speech using spectral subtraction, IEEE Trans. Acoust. Speech Signal Process. (1979) 113–120. [4] R.J. McAulay, M.L. Malpass, Speech enhancement using a soft-decision noise suppression filter, IEEE Trans. Acoust. Speech Signal Process. ASSP-28 (2) (April 1980) 74–82. [5] L.M. Arslan, J.H.L. Hansen, Minimum cost based phoneme class detection for improved iterative speech enhancement, in: Proceedings of IEEE International Conference on Acoustics, Speech, Signal Processing, vol. 2, Adelaide, Australia, April 1994, pp. 045–048. [6] Y. Ephraim, D. Malah, Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator, IEEE Trans. ASSP 32 (1984) 1109–11216. [7] Y. Ephraim, D. Malah, Spech enhancement using a minimum mean-square error log-spectral amplitude estimator, IEEE Trans. ASSP 33 (1985) 443–445. [8] J.H.L. Hansen, L.M. Arslan, Markov model based phoneme class partitioning for improved constrained iterative speech enhancement, IEEE Trans. Speech Audio Process. 3 (1) (1995) 98–104. [9] J.H.L. Hansen, M.A. Clements, Constrained iterative speech enhancement with application to speech recognition, IEEE Trans. Signal Process. 39 (4) (1991) 795–805. [10] J.S. Lim, Speech Enhancement, Prentice-Hall, Englewood Cliffs, NJ, 1983. [11] P. Scalart, J. Vieira Filho, Speech enhancement based on a priori signal to noise estimation, in: Processing of IEEE International Conference Acoustics, Speech, Signal Processing, vol. 2, Atlanta, USA, May 1996, pp. 629–632. [12] S. Gustafsson, Enhancement of audio signals by combined acoustic echo cancellation and noise reduction, Ph.D. Thesis, ABDN Band 11, P. Vary, Hrsg. Verlag der Augustinus Buchhandlung, Aachen, 1999.