Perceptual improvement of Wiener filtering employing a post-filter

Perceptual improvement of Wiener filtering employing a post-filter

Digital Signal Processing 21 (2011) 54–65 Contents lists available at ScienceDirect Digital Signal Processing www.elsevier.com/locate/dsp Perceptua...

2MB Sizes 2 Downloads 80 Views

Digital Signal Processing 21 (2011) 54–65

Contents lists available at ScienceDirect

Digital Signal Processing www.elsevier.com/locate/dsp

Perceptual improvement of Wiener filtering employing a post-filter Md. Jahangir Alam ∗ , Douglas O’Shaughnessy INRS-EMT, University of Quebec, Montreal QC H5A 1K6, Canada

a r t i c l e

i n f o

a b s t r a c t

Article history: Available online 18 April 2010

A major drawback of many speech enhancement methods in speech applications is the generation of an annoying residual noise with musical character. Although the Wiener filter introduces less musical noise than spectral subtraction methods, such noise, however, exists and is perceptually annoying to the listener. A potential solution to this artifact is the incorporation of a psychoacoustic model in the suppression filter design. In this paper a frequency domain optimal linear estimator with perceptual post-filtering is proposed, which incorporates the masking properties of the human hearing system to render the residual noise distortion inaudible. Proposed post-processing presents a modified way to measure the tonality coefficient and relative threshold offset for an optimal estimation of the noise masking threshold. The performance of the proposed enhancement algorithm is evaluated by the segmental SNR, Modified Bark Spectral Distortion (MBSD) and Perceptual Evaluation of Speech Quality (PESQ) measures under various noisy environments and yields better results compared to the Wiener filtering based on Ephraim–Malah’s decisiondirected approach. © 2010 Elsevier Inc. All rights reserved.

Keywords: Speech enhancement Perceptual post-filter Musical critical band MMSE Modified masking threshold

1. Introduction The performance of speech communication systems in applications such as hands-free telephony degrades considerably in adverse acoustic environments. The presence of noise can cause loss of intelligibility as well as the listener’s discomfort and fatigue. Speech enhancement methods seek to improve the performance of these systems and to make the corrupted speech more pleasant to the listener. These methods are also useful in other applications such as automatic speech recognition. The removal of additive noise from speech has been an active area of research for several decades. Numerous methods have been proposed by the signal processing community. Among the most successful signal enhancement techniques have been spectral subtraction [1,2], Wiener filtering [3,4] and signal subspace methods [17]. Although these techniques improve speech quality, they suffer from the annoying residual noise known as musical noise. Tones at random frequencies, resulting from poor estimation of the signal and noise statistics, are at the origin of this artifact. The quality and the intelligibility of the enhanced speech signal could be improved by reducing or in better cases eliminating this kind of musical residual noise. Many variations have been developed to cope with the musical residual noise phenomena, including spectral subtraction techniques based on masking properties of the human hearing system. A number of methods have been developed to improve intelligibility by modeling several aspects of the enhancement function present in the auditory system [5–7,24,25]. These attractive methods use a noise masking threshold (NMT) as a crucial parameter to empirically adjust either thresholds

*

Corresponding author. Fax: +1 514 875 0344. E-mail addresses: [email protected] (Md.J. Alam), [email protected] (D. O’Shaughnessy).

1051-2004/$ – see front matter doi:10.1016/j.dsp.2010.04.002

©

2010 Elsevier Inc. All rights reserved.

Md.J. Alam, D. O’Shaughnessy / Digital Signal Processing 21 (2011) 54–65

55

or gain factors. This auditory system is based on the fact that the human ear cannot perceive additive noise when the noise level falls below the NMT. Masking is the phenomenon where the perception of one sound is obscured by the perception of another. Masking occurs when two sounds occur at the same time or when separated by a small delay. The former is known as simultaneous masking (or frequency masking) and the later as temporal masking (or non-simultaneous masking). Temporal masking can be further classified as forward masking and backward masking. In this paper, we have only considered simultaneous masking, i.e., a weak signal is made inaudible by a strong signal occurring simultaneously. This phenomenon is modeled via a noise-masking threshold, below which all spectral components are inaudible. The masking-based speech enhancement approach basically incorporates the noise masking properties into a speech enhancement algorithm. It has been well established that noise masks tones more effectively than tones mask noise [26]. Researchers have suggested that the bandwidth and temporal characteristics of the target and masker contribute to this asymmetry. Excitation patterns (EP), which represent the output of the auditory filters, are intrinsically associated with auditory masking [27]; if the EP of a target signal falls below that of a masker, the target stimulus is no longer audible. Coding applications have used these properties to compress audio and suppress noise [11]. Specifically, the EP is calculated by convolving the basilar membrane spreading function with the critical band densities. The EP is adjusted in accordance with the notion that tones and noise are asymmetrical maskers. This adjustment is the “relative threshold offset” term. For a tone masking a noise, the EP is reduced by a factor of (14.5 + i ) dB where i is the critical band number. For noise masking a tone, the EP is reduced by factor of 5.5 dB across critical bands. These values are based on results from [28] and [26] respectively. Both calculations are scaled, based on the degree to which a signal is noise-like versus tone-like (tonality), by computing a spectral flatness measure (SFM). The SFM is the ratio of the geometric mean of the power spectrum to the arithmetic mean of the power spectrum. A SFM approaching 1 indicates that the signal is tone-like; a SFM approaching 0 indicates that the signal is noise-like. In this paper, we have developed a post-processing method with a modified masking threshold to reduce the musical residual noise in each critical band generated by classical speech enhancement methods. The human ear has critical bands around each frequency and behaves like a band-pass filter [8]. We tried to detect critical band (CB) musical noise for the CBs between 9 and 18 as the annoying musical noise is situated only in the frequency range between 1 kHz and 4 kHz [13,14]. For the frequencies below 1 kHz the annoying musical noise is masked by the presence of real tones of clean speech [16]. The tonality coefficient in each critical band is utilized to characterize the residual musical noise. The total tonality coefficient used by Johnston gives a general idea about the nature of the power spectrum. Again, since we only have the noisy speech signal, the preliminary estimate of the speech signal is usually not very accurate. Consequently, the estimation of the relative threshold offset and hence noise masking threshold is also affected. In order to estimate the noise masking threshold more accurately we propose to correct the relative threshold offset by merging both the fixed relative threshold offset [7] and the variable relative threshold offset [11,13]. Experimental results show that the proposed method outperforms the Wiener denoising method based on a decision-directed a priori SNR estimator. This paper is organized as follows: Section 2 provides a description of the standard filtering for a speech enhancement system. In Section 3, descriptions of the a priori SNR estimation, noise estimation and proposed method are given. A discussion on the experimental results and a conclusion are drawn in Section 4 and Section 5, respectively. 2. Standard filtering method Let the noisy signal be expressed as

y (n) = x(n) + d(n),

(1)

where x(n) is the clean signal and d(n) is the additive random noise signal, uncorrelated with the original signal. Taking the FFT to the observed signal gives

Y (m, k) = X (m, k) + D (m, k),

(2)

where m = 1, 2, . . . , M is the frame index, k = 1, 2, . . . , K is the frequency bin index, M is the total number of frames and K is the frame length, and Y (m, k), X (m, k) and D (m, k) represent the short-time spectral component of y (n), x(n) and d(n), respectively. Basic speech enhancement methods involve estimating every frequency component of the clean speech Xˆ (m, k) by

Xˆ (m, k) = H (m, k)Y (m, k),

(3)

where H (m, k) is the noise suppression filter (denoising filter) chosen according to a suitable criterion. The error signal generated by this filter is

e (m, k) = Xˆ (m, k) − X (m, k)

  = H (m, k) − 1 X (m, k) + H (m, k) D (m, k).

(3a)

56

Md.J. Alam, D. O’Shaughnessy / Digital Signal Processing 21 (2011) 54–65

The first term in Eq. (3a) describes the speech distortion caused by the spectral weighting that can be minimized using H (m, k) = 1. The second term in the above equation is the residual noise distortion that can be minimized if the spectral weighting H (m, k) = 0. Musical residual noise results from the pure tones present in the residual noise. In general, the noise suppression filter can be expressed as a function of the a posteriori SNR γ (m, k) and a priori SNR ξ(m, k) given by

|Y (m, k)|2 , Γd (m, k) Γx (m, k) , ξ(m, k) = Γd (m, k)

γ (m, k) =

(4) (5)

where Γd (m, k) = E {| D (m, k)|2 }, by definition, is the noise power spectrum, an estimate of which can be made easily during speech pauses and Γx (m, k) = E {| X (m, k)|2 }. The instantaneous SNR can be defined as

ϑ(m, k) = γ (m, k) − 1.

(6)

An estimate ξˆ (m, k) of ξ(m, k) is given by the well-known decision-directed approach [9] and is expressed as

    | H (m − 1, k)Y (m − 1, k)|2  ˆξ (m, k) = max α + (1 − α ) P ϑ(m, k) , ξmin , Γd (m, k)

(7)

where P  [x] = x if x  0 and P  [x] = 0 otherwise. In this paper we have chosen α = 0.98 and ξmin = 0.0032 (i.e., −25 dB) by simulations and informal listening tests. Several variants of the noise suppression gain H (m, k) have been reported in the literature; here, without loss of generality, the gain function is chosen as the Wiener filter expressed as

H (m, k) =

ξ(m, k) 1 + ξ(m, k)

(8)

.

There are many algorithms in the literature; it is extremely difficult if not impossible to find a universal analytical tool that can be applied to any speech enhancement algorithm. We choose the Wiener filter as the basis since it is the most fundamental approach, and many algorithms are closely connected with this technique. Moreover, the Wiener filter introduces less musical noise than spectral subtraction methods [9]. The temporal-domain denoised speech is obtained with the following relation







xˆ (n) = IFFT  Xˆ (m, k) · e j arg(Y (m,k)) .

(9)

Since people are mostly phase insensitive and perceive speech primarily based on the magnitude spectrum, we have used the noisy signal phase to obtain temporal-domain denoised speech. Although the Wiener filtering based on the decision-directed a priori SNR estimator reduces the level of musical noise, it, however, exists and is perceptually annoying. 3. Overview of the proposed method The block diagram of Fig. 1 summarizes the different steps involved in the proposed speech enhancement method discussed in this section. Fig. 2 depicts the different steps of the proposed perceptual post-filter that takes advantage of the human auditory system to make the residual musical noise inaudible. The noisy signal is windowed using a Hamming window (32 ms duration) with 50% overlap and is converted into the frequency domain using DFT to get the noisy speech amplitude spectrum and phase spectrum. The noisy phase spectrum is used to obtain the temporal-domain enhanced signal. 3.1. Estimation of a priori SNR An important parameter of numerous speech enhancement techniques is the a priori SNR. In the well-known decisiondirected approach, the a priori SNR depends on the speech spectrum estimation in the previous frame [10], which results in degradation of the speech enhancement performance. In order to alleviate this problem while keeping its benefits we have used the MMSE-based two-step a priori SNR estimation approach proposed in [18] and which is expressed as

ξˆMMSE = where ξˆ and

ξˆ 1 + ξˆ

 1+

ξˆ 1 + ξˆ



γ ,

γ are given by Eqs. (7) and (4), respectively.

(10)

Md.J. Alam, D. O’Shaughnessy / Digital Signal Processing 21 (2011) 54–65

57

Fig. 1. Block diagram of the perceptual Wiener denoising technique.

Fig. 2. Block diagram of the proposed perceptual post-filter.

3.2. Noise estimation Noise estimation is also an important factor in speech enhancement systems. In this paper the noise power spectrum is estimated during speech pauses using the following recursive relation [12]:

Γd (m, k) =

λ D Γd (m − 1, k) + (1 − λ D ) W (m, k)|Y (m, k)|2 if W (m, k) > 0, Γd (m − 1, k) if W (m, k) = 0,

(11)

where λ D is a smoothing factor satisfying 0 < λ D < 1 and W (m, k) is the weighting factor on the noisy power spectrum. The weighting factor is designed so that it is almost inversely proportional to the estimated SNR (dB) given by



γ˜ (m, k) = 10 log10

|Y (m, k)|2 Γd (m − 1, k)



and the weighting factor is given by the following relation

(12)

58

Md.J. Alam, D. O’Shaughnessy / Digital Signal Processing 21 (2011) 54–65

⎧ if γ˜ (m, k)  0, ⎨1 W (m, k) = − τ1 γ˜ (m, k) + 1 if 0 < γ˜ (m, k)  ε , ⎩ 0 if γ˜ (m, k) > ε ,

(13)

where τ is a slope deciding constant of the graph of (13) and ε is a threshold to eliminate an unreliable γ˜ (m, k). In this paper we have used λ D = 0.98, τ = 12, and ε = 6 on the basis of simulations. 3.3. Spectral weight calculation The spectral weight H (m, k) for the Wiener denoising technique is given by (8). The spectral weight H r (m, k) for the reference signal is computed from H (m, k) as

H r (m, k) =



η

H (m, k) + 1

if H (m, k) + otherwise,

q

η q

 1,

(14)

where η and q are adjustment constants chosen experimentally. η It (i.e., Eq. (14)) is the shifting up of the Wiener denoising filter by q . In this paper we have used η = 0.35 and 3  q  10. It is assumed that H r (m, k) introduces minimal distortion and results in a reference signal that does not contain residual musical noise. The reference signal obtained using H r (m, k) is used instead of the noisy speech signal to improve the accuracy of the musical noise detector. 3.4. Calculation of tonality coefficients In the denoised signal obtained by subtractive methods, annoying musical tones appear in the power spectrum, which leads to an increase of the tonality coefficient. Thus it is possible to detect the presence or absence of musical tones by means of a tonality coefficient. The steps for calculating the tonality coefficient are taken from [11] and described below: I. Frequency analysis of both signals (reference and denoised) along the critical band (CB): Critical band analysis of a signal is to obtain the power spectral density on a bark scale. The Bark scale is the critical band scale simulating the human auditory system. The bark frequency scale can be approximated as

 i = 13 arctan



76 f 100 000

 + 3.5 arctan

f 7500

2  ,

(15a)

where i is the critical band index and f is the linear frequency in Hz. Theoretically, the range of human auditory frequency spreads from 20 Hz to 20 kHz and covers approximately 25 Bark as shown in Table 1. The power spectrum of the denoised signal and that of the reference signal are partitioned in critical bands. We have considered CBs between 0 kHz and 4 kHz as we chose the Aurora database, having the sampling frequency of 8 kHz, for this experiment. In the frequency range between 0 and 4 kHz, there are 18 CBs. In this paper we tried to detect CB musical noise for the CBs between 9 and 18 (i.e., i = 9, 10, 11, . . . , 18) as the annoying musical noise is situated only in the frequency range between 1 kHz and 4 kHz; for frequencies under 1 kHz it is masked by the presence of real tones of clean speech [16]. II. Calculation of the tonality coefficients: The tonality coefficient is measured using the ratio of the geometric mean (GM) and the arithmetic mean (AM) of the signal power spectrum, known as the spectral flatness measure (SFM). SFM is used to determine whether the signal is tone-like or noise-like. The coefficient of tonality is expressed as

 tc(i ) = min

SFMdB (i )

−60

 ,1 ,

(15)

where i is the CB index and SFMdB (i ) is given as



SFMdB (i ) = 10 log10

I

GM(i ) AM(i )

1



(16)

, I

P ( j)

=1 where GM(i ) = j =1 P ( j ) M (i) and AM(i ) = jM (i ) , I is the number of critical bands, and M (i ) and P ( j ) denote the number of frequency bins and the power spectral density, respectively in each critical band i. A tonality coefficient of unity indicates that the signal is tone-like and a tonality coefficient close to zero indicates that the signal is noise-like.

Using (15) and (16) the tonality coefficients for the denoised signal tcd (i ) and that of the reference signal tcr (i ) are computed for each CB. Then, the tonality coefficient difference for each CB is given by

tc(i ) = tcd (i ) − tcr (i ).

(17)

Md.J. Alam, D. O’Shaughnessy / Digital Signal Processing 21 (2011) 54–65

59

Table 1 List of critical bands. Band No.

Lower freq. (Hz)

Center freq. (Hz)

Upper freq. (Hz)

Band No.

Lower freq. (Hz)

Center freq. (Hz)

Upper freq. (Hz)

1 2 3 4 5 6 7 8 9 10 11 12 13

0 100 200 300 400 510 630 770 920 1080 1270 1480 1720

50 150 250 350 450 570 700 840 1000 1170 1370 1600 1850

100 200 300 400 510 630 770 920 1080 1270 1480 1720 2000

14 15 16 17 18 19 20 21 22 23 24 25

2000 2320 2700 3150 3700 4400 5300 6400 7700 9500 12 000 15 500

2150 2500 2900 3400 4000 4800 5800 7000 8500 10 500 13 500 19 500

2320 2700 3150 3700 4400 5300 6400 7700 9500 12 000 15 500

Fig. 3. Block diagram of the modified Johnston model for the masking threshold.

Musical residual noise appears in the ith CB if tcd (i ) > tcr (i ) and it becomes audible if tc(i ) > T  (i ), where T  (i ) is the threshold for the ith CB, which depends on the order of the CB and masking properties of the human ear. For the calculation of the threshold, we used a narrow band noise and a sinusoidal signal and computed tonality coefficients for each of the signals. The differences between the tonality coefficients of both signals have been taken as the thresholds T  (i ). The threshold, T  (i ), for all CB is found to be approximately constant and is T  (i ) = 0.06 [13,14]. 3.5. Modified relative threshold offset and noise-masking threshold computation The noise-masking threshold is obtained through modeling the frequency selectivity of the human auditory system and its masking properties [15]. The steps for calculating the modified relative threshold offset and the modified masking threshold are shown in Fig. 3 and are described below: I. Partition of the signal power spectrum into CBs and the energies E (i ) in each CB are added. II. Calculation of the spread CB spectrum C (i ) by convolving the spread function SF (i ) and the Bark spectrum E (i ) in order to take into account the masking effect between different CBs. III. In the Johnston model [11], an offset O (i ) is determined according to the tonality coefficient and the CB order as





O (i ) = tc(i )(14.5 + i ) + 1 − tc(i ) 5.5 dB.

(18)

The total tonality coefficient used by Johnston gives a general idea about the nature of the power spectrum. In our context, we seek to detect musical noise in the selected CBs, i.e., between 9 and 18. The CB tonality coefficient of the denoised signal was taken into account when calculating the threshold offset O (i ) in the proposed method. In order to calculate the modified relative threshold offset, a Boolean flag F (i ) is constructed first based on the tonality coefficient difference tc(i ) and the threshold T  (i ), over which an additive tone becomes audible in the presence of narrow-

60

Md.J. Alam, D. O’Shaughnessy / Digital Signal Processing 21 (2011) 54–65

Fig. 4. Fixed relative threshold offset.

band noise. tc(i ) and T  (i ) were determined in the previous section. The Boolean flag F (i ) indicates the presence or absence of CB musical noise and is given as

F (i ) =



1 0

if tc(i )  T  (i ), otherwise.

(19)

In the case of critical-band musical noise ( F (i ) = 1), the tonality coefficient used to calculate O (i ) is close to one. However, it shouldn’t be for better estimation of the masking threshold. For better estimation we proposed to correct the tonality coefficient of the ith musical critical band by replacing it with the tonality coefficient of the ith critical band tonality coefficient of the reference signal. Thus the corrected offset threshold for the modified Johnston model becomes [13,14]





O M  (i ) = tcm (i )(14.5 + i ) + 1 − tcc (i ) 5.5 dB, where

tcm (i ) =

tcd (i ),

for F (i ) = 1,

tcr (i ),

for F (i ) = 0.

(20)

(21)

Again, since we only have the noisy speech signal, the preliminary estimate of the speech signal is usually not very accurate. Consequently, the estimation of the relative threshold offsets obtained using (18) and/or (20) are also affected. This is because the residual noise of the preliminary estimated speech may severely change the original tonality of the speech signal [19]. As a result of the inaccurate relative threshold offset, the estimation error of the noise masking threshold increases. In [7], the fixed relative threshold offset is exploited and compensated with a slight modification by taking into account the tone-like nature of the musical noise for the CB i > 15. In this paper, we propose a modified relative threshold offset by merging both the fixed relative threshold offset O F (i ) (as shown in Fig. 4) and the modified relative threshold offset O M  (i ) (i.e., (20)) to achieve a more effective application of the masking properties. The final modified relative threshold offset, O M (i ), is expressed as

O M (i ) = β O M  (i ) + (1 − β) O F (i ),

(22)

where β is a weighting constant, and O F (i ) is the fixed relative threshold offset in each CB as shown in Fig. 4. In this paper we have used β = 0.98, on the basis of simulations. The modified relative threshold offset is then subtracted from the spread CB spectrum to yield the spread threshold estimate T (i )

T (i ) = 10[log10 (C (i ))−( O M (i )/10)] .

(23)

IV. In the final step, the modified noise-masking threshold (NMT) is estimated as





NMT (i ) = max T q (i ), T (i ) ,

(24)

where T q (i ) is the absolute threshold of hearing [14] given by

 T q ( f ) = 3.64

f 1000

 − 0 .8

− 6.5e −0.6



2 f − 3 .3 1000

+ 10−3



f 1000

4 dB SPL

(25)

where f is the frequency in Hz. The absolute threshold of hearing is modeled by taking into consideration the transfer function of the outer ear and middle ear and the effect of the neural suppression in the inner ear.

Md.J. Alam, D. O’Shaughnessy / Digital Signal Processing 21 (2011) 54–65

61

Fig. 5. MBSD measures of the noisy signal and enhanced signals for (a) subway noise, (b) car noise, and (c) white Gaussian noise.

The proposed perceptual post-filtering technique described above aims to detect not only musical peaks but also larger intervals of 1 Bark where musical noise is dominant. In fact, it is important to recall that the human ear operates as a filter bank which subdivides the frequency axis in the so-called critical bands [15], frequency bands where two sounds are perceptually heard as one sound with energy equal to the sum of the two sounds’ energies. We have exploited this property to detect residual musical noise in a whole critical band instead of separate tones and we call it “critical band musical noise” [13,14]. Critical-band musical noise is detected if the estimated spectrum overtakes the modified masking threshold in the considered critical band and is eliminated by forcing the musical critical band spectrum under the estimated masking threshold. The perceptual post-filter provides a musical tone-eliminated denoised spectrum, which is then recombined with the original noisy phase spectrum and converted back to a time domain signal by using an inverse DFT overlap and add (OLA) method. The phase of the input noisy signal is used for reconstruction of the estimated speech spectrum based on the fact for human perception that the short-time spectral amplitude (STSA) is more important than the phase for intelligibility and quality. 4. Performance evaluation In order to evaluate the performance of the proposed perceptual Wiener denoising technique, we conducted extensive objective quality tests under various noisy environments. The frame sizes were chosen to be 256 samples (32 msec) long with 50% overlap; a sampling frequency of 8 kHz and a Hamming window were applied. To evaluate and compare the performance of the proposed perceptual Wiener denoising technique, we carried out simulations with the TEST A database

62

Md.J. Alam, D. O’Shaughnessy / Digital Signal Processing 21 (2011) 54–65

Fig. 6. (a) Segmental SNR and (b) PESQ measures of the enhanced and noisy signals when the signals are degraded with subway noise.

of the Aurora [22]. Speech signals were degraded with three types of noise at global SNR levels of 0 dB, 5 dB, 10 dB and 15 dB. The noises were subway noise, car noise and white Gaussian noise. In this experiment we used a simple energy-based voice activity detector (VAD) for separating (speech + noise) and noise-only segments of the noisy speech signal. The objective quality measures used for the evaluation of the proposed speech enhancement method are the segmental SNR, MBSD and the PESQ measures. It is well known that the segmental SNR is more accurate in indicating the speech distortion than the overall SNR. The higher value of the segmental SNR indicates the weaker speech distortions [20]. The MBSD and PESQ measures prove to be highly correlated with the subjective listening tests [21,23]. The higher PESQ score indicates better perceived quality and the minimal value of the MBSD measure corresponds to the best quality of the processed signal. Fig. 5 depicts the MBSD measures of the noisy signal and enhanced signals obtained using the Wiener denoising technique and the proposed method. These results demonstrate that the proposed method offers a better spectral approximation of the clean speech than does the Wiener filter. Figs. 6, 7, and 8 show the segmental SNR and the PESQ scores for the enhanced speech signals of the Wiener denoising technique and the proposed method when speech signals are degraded with the car noise, subway noise and white noise, respectively. It is observed that the proposed approach yields better segmental SNR than that of the Wiener denoising technique under all tested noisy environments. In the case of the PESQ measure, the proposed perceptual Wiener denoising technique gives better PESQ scores than the Wiener denoising technique. The time-frequency distribution of signals provides more accurate information about the residual noise and speech distortion than the corresponding time domain waveforms. We compared the spectrograms for each of the methods and confirmed a reduction of the residual noise and speech distortion. In order to project the performance of our proposed method against the Wiener filtering without post-processing we present the spectrograms of the enhanced signals at the three different noisy conditions. Figs. 9, 10 and 11 represent the spectrograms of the clean speech signal, noisy signal and enhanced speech signals obtained using the Wiener denoising technique (without post-processing) and the proposed perceptual technique when the signals are distorted with the 3 noises. Speech spectrograms presented in Figs. 9–11 use a Hamming window of 256 samples with 50% overlap. The signal-to-noise ratios for Figs. 9–11 were 0 dB, 0 dB and 5 dB, respectively. It is seen that the musical noise is almost all removed in Figs. 9(d), 10(d) and 11(d). It is found from the experimental results that the proposed algorithm almost always has better segmental SNR, PESQ score, MBSD measure and spectrogram inspection as compared to the Wiener filtering without employing perceptual postfiltering. The proposed technique is simple to implement and does not add too demanding a cost for the additional step of estimating the modified masking threshold as the algorithm works in the spectral domain using DFT.

Md.J. Alam, D. O’Shaughnessy / Digital Signal Processing 21 (2011) 54–65

Fig. 7. (a) Segmental SNR and (b) PESQ measures of the enhanced and noisy signals when the signals are degraded with car noise.

63

Fig. 8. (a) Segmental SNR and (b) PESQ measures of the enhanced and noisy signals when the signals are degraded with white Gaussian noise.

Fig. 9. Speech spectrograms, car noise, SNR = 0 dB. (a) Clean signal, (b) noisy signal, (c) enhanced signal (Wiener filter), and (d) enhanced signal (proposed method).

5. Conclusion The goal of our study is to develop a novel speech enhancement technique that would maximize noise attenuation while reducing speech distortion. In this paper we present a new perceptual post-filter to improve the performance of a subtractive type speech enhancement algorithm such as the Wiener filter. The proposed post-processing is based on the detection of musical critical bands using the corrected tonality coefficient and a modified noise masking threshold. In order to obtain a more accurate estimation of the masking threshold a modified way to compute the relative threshold offset is presented. The proposed method has been shown to perform better than Wiener filtering without post-processing in all tested noisy conditions. It does not introduce additional speech distortion and results in a significant reduction of

64

Md.J. Alam, D. O’Shaughnessy / Digital Signal Processing 21 (2011) 54–65

Fig. 10. Speech spectrograms, Gaussian noise, SNR = 0 dB. (a) Clean signal, (b) noisy signal, (c) enhanced signal (Wiener filter), and (d) enhanced signal (proposed method).

Fig. 11. Speech spectrograms, subway noise, SNR = 5 dB. (a) Clean signal, (b) noisy signal, (c) enhanced signal (Wiener filter), and (d) enhanced signal (proposed method).

the musical phenomenon. Experimental results, i.e., segmental SNR measure, PESQ scores and MBSD measure, in plotted spectrograms at three different noisy environments as well as informal listening tests confirm this conclusion. References [1] S.F. Boll, Suppression of acoustic noise in speech using spectral subtraction, IEEE Trans. Acoust. Speech Signal Process. 27 (April 1979) 113–120. [2] M. Berouti, R. Schwartz, J. Makhoul, Enhancement of speech corrupted by acoustic noise, in: Proc. IEEE Int. Conf. on Acoustics, Speech, Signal Processing, vol. 1, Washington, DC, April 1979, pp. 208–211. [3] H.L.V. Trees, Detection, Estimation, and Modulation: Part I – Detection, Estimation and Linear Modulation Theory, 1st ed., John Wiley and Sons, Inc., 1968. [4] T.F. Quatieri, R.B. Dunn, Speech enhancement based on auditory spectral change, in: IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol. 1, Orlando, FL, USA, 2002, pp. 257–260. [5] Y.M. Cheng, D. O’Shaughnessy, Speech enhancement based conceptually on auditory evidence, IEEE Trans. Signal Process. 39 (9) (1991) 1943–1954. [6] D. Tsoukalas, M. Paraskevas, J. Mourjopoulos, Speech enhancement based on audible noise suppression, IEEE Trans. Audio Speech Process. 5 (6) (November 1997) 497–514.

Md.J. Alam, D. O’Shaughnessy / Digital Signal Processing 21 (2011) 54–65

65

[7] N. Virag, Single channel speech enhancement based on masking properties of the human auditory system, IEEE Trans. Speech Audio Process. 7 (2) (1999) 126–137. [8] E. Zwicker, H. Fastl, Psychoacoustics: Facts and Models, 2nd ed., Springer-Verlag, 1999. [9] Y. Ephraim, D. Mallah, Speech enhancement using a minimum mean-square error short-time spectral amplitude estimation, IEEE Trans. Acoust. Speech Signal Process. ASSP-32 (6) (Dec. 1984) 1109–1121. [10] O. Cappe, Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor, IEEE Trans. Speech Audio Process. 2 (1) (April 1994) 345–349. [11] A.J. Accardi, R.V. Cox, A modular approach to speech enhancement with an application to speech coding, in: Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, 1999, pp. 201–204. [12] M. Kato, A. Sugiyama, M. Serizawa, Noise suppression with high speech quality based on weighted noise estimation and MMSE STSA, IEICE Trans. Fundament. E85-A (7) (July 2002) 1710–1718. [13] Md.J. Alam, S.A. Selouani, D. O’Shaughnessy, S. Ben Jebara, Speech enhancement using a Wiener denoising technique and musical noise reduction, in: Proceedings of INTERSPEECH’08, Brisbane, Australia, September 2008, pp. 407–410. [14] A. Ben Aicha, S. Ben Jebara, Perceptual musical noise reduction using critical bands tonality coefficients and masking thresholds, in: INTERSPEECH Conf., Antwerp, Belgium, August 2007, pp. 822–825. [15] J.D. Johnston, Transform coding of audio signals using perceptual noise criteria, IEEE J. Selected Areas Comm. 6 (Feb. 1988) 314–323. [16] A. Ben Aicha, S. Ben Jebara, D. Pastor, Speech denoising improvement by musical tones shape modification, in: International Symposium on Communication, Control and Signal Processing, ISCCSP, Morocco, 2006. [17] K. Hermus, P. Wambacq, H. Van Hamme, A review of signal subspace speech enhancement and its application to noise robust speech recognition, EURASIP J. Adv. Signal Process. 2007 (1) (January 2007) 195. [18] Md.J. Alam, D. O’Shaughnessy, S.A. Selouani, Speech enhancement based on novel two-step a priori SNR estimators, in: Proceedings of INTERSPEECH’08, Brisbane, Australia, September 2008, pp. 565–568. [19] C.H. You, S.N. Koh, S. Rahardja, Masking-based β -order MMSE speech enhancement, Speech Comm. 48 (2006) 57–70. [20] Y. Hu, P.C. Loizou, Evaluation of objective quality measures for speech enhancement, IEEE Trans. Audio Speech Language Process. 16 (1) (January 2008) 229–238. [21] S. Quackenbush, T. Barnwell, M. Clements, Objective Measures of Speech Quality, Prentice Hall, Englewood Cliffs, NJ, 1988. [22] H. Hirsch, D. Pearce, The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy environments, ISCA ITRW ASR, September 2000. [23] M. Dixon, W. Yang, R. Yantorno, Modified bark spectral distortion measure which uses noise masking threshold, in: IEEE Speech Coding Workshop, 1997, pp. 55–56. [24] P. Scalart, C. Beaugeant, V. Turbin, A. Gilloire, New optimal filtering approaches for hands-free telecommunication terminals, Signal Process. 64 (15) (Jan. 1998) 33–47. [25] W.H. Holmes, L. Lin, E. Ambikairajah, Speech denoising using perceptual modification of Wiener filtering, IEE Electron. Lett. 38 (Nov. 2002) 1486–1487. [26] R. Hellman, Asymmetry of masking between noise and tone, Percept. Psychophys. 11 (1972) 241–246. [27] B. Moore, Perceptual Consequences of Cochlear Damage, Oxford Psychology Series, 1995. [28] M. Schroeder, J. Hall, B. Atal, Optimizing digital speech coders by exploiting the masking properties of the human ear, JASA 6 (66) (1979) 1647–1652.

Md. Jahangir Alam received his M.Sc. degree in Telecommunications Engineering from INRS-EMT, University of Quebec, Montreal, Canada in December 2008. Prior to that, he received his B.Sc. degree in Electrical and Electronic Engineering from the Khulna University of Engineering and Technology (KUET), Bangladesh in December 2002. Afterwards he joined as a lecturer (March 2003–July 2006) and as an assistant professor (from July 2006 to present) the same university and consequently held the position of a Consultant and Research Testing Officer for about more than three years before starting his M.Sc. degree at INRS-EMT. He is currently pursuing his Ph.D. degree in Telecommunications Engineering at INRS-EMT. His research interests include single and multichannel statistical signal processing, binaural signal processing for hearing aids, speech enhancement algorithms, acoustical beamforming, speaker diarization and localization, Bayesian speaker and channel adaptation for speech and speaker recognition. Douglas O’Shaughnessy has been a professor at INRS-EMT (formerly, INRS-Telecommunications), a constituent of the University of Quebec, in Montreal, Canada, since 1977. During this same period, he has also taught as adjunct professor at McGill University in the Department of Electrical and Computer Engineering. For the periods 1991–1997 and 2001–present, he has been Program Director of INRS-EMT as well. Dr. O’Shaughnessy has worked as a teacher and researcher in the speech communication field for more than 30 years. His interests and research include automatic speech synthesis, analysis, coding, enhancement, and recognition. His research team is currently working to improve various aspects of automatic voice dialogues. Dr. O’Shaughnessy was educated at the Massachusetts Institute of Technology, Cambridge, MA (B.Sc. and M.Sc. in 1972; Ph.D. in 1976). He is a Fellow of the Acoustical Society of America (1992) and of IEEE (2006), and was recently a member of the ASA Technical Committee on Speech. From 1995 to 1999, he served as an Associate Editor for the IEEE Transactions on Speech and Audio Processing, and has been an Associate Editor for the Journal of the Acoustical Society of America since 1998. He also served as a member of the IEEE Technical Committee for Speech Processing during 1981–1985. Dr. O’Shaughnessy was the General Chair of the 2004 International Conference on Acoustics, Speech and Signal Processing (ICASSP) in Montreal, Canada. He recently finished a three-year elected term as a Member-at-Large of the IEEE SPS Board of Governors, and was a member of the IEEE SPS Conference Board. Dr. O’Shaughnessy has served on several Canadian research grant panels: for FCAR (Quebec, 1991–1995) and for NSERC (Natural Sciences and Engineering Research Council: 1995–2002). He has also served on organizing technical committees for ICSLP and Eurospeech. Dr. O’Shaughnessy is the author of the textbook Speech Communications: Human and Machine (first edition in 1986 from AddisonWesley; completely revised edition in 2000 from IEEE Press). In 2003, with Li Deng, he co-authored the book Speech Processing: A Dynamic and Optimization-Oriented Approach (Marcel Dekker Inc.). He has presented tutorials on speech recognition at ICASSP-96 in Atlanta, ICASSP-2001 in Orlando, at ICC-2003 in Anchorage, and at ICASSP-09 in Taipei. He has published more than 35 articles in the major speech journals, is a regular presenter at the major speech conferences of Eurospeech and ICSLP, and has had papers at almost every ICASSP since 1986 (more than 150 conference papers).