Digital Signal Processing 20 (2010) 1572–1578
Contents lists available at ScienceDirect
Digital Signal Processing www.elsevier.com/locate/dsp
Improved minima controlled recursive averaging technique using conditional maximum a posteriori criterion for speech enhancement ✩ Jong-Mo Kum, Yun-Sik Park, Joon-Hyuk Chang ∗ School of Electronic and Electrical Engineering, Inha University, Incheon 402-751, Republic of Korea
a r t i c l e
i n f o
a b s t r a c t
Article history: Available online 2 February 2010 Keywords: Speech enhancement Minima controlled recursive averaging (MCRA) Conditional maximum a posteriori (CMAP)
In this paper, we propose a novel approach to improve the performance of minima controlled recursive averaging (MCRA) based on a conditional maximum a posteriori (MAP) criterion. From an investigation of the MCRA scheme, it is discovered that the MCRA method cannot take full consideration of the inter-frame correlation of voice activity since the noise power estimate is adjusted by the speech presence probability depending on an observation of the current frame. To avoid this phenomenon, the proposed MCRA approach incorporates the conditional MAP criterion in which the noise power estimate is obtained using the speech presence probability conditioned on both the current observation and the speech activity decision in the previous frame. Experimental results show that compared to the conventional MCRA method the proposed MCRA technique based on conditional MAP obtains low estimation error and when integrated into a speech enhancement system achieves improved speech quality. © 2010 Elsevier Inc. All rights reserved.
1. Introduction The robustness of a speech enhancement system is significantly affected by the capability to reliably track noise statistics under adverse environments involving nonstationary noise, weak speech components and low signal-to-noise ratio (SNR) [1–9]. Recently, there has appeared a new class of speech enhancement technique based on the minimum statistics (MS) which tracks the minima values of a smoothed power estimate of the noisy signal, and multiply the result by a factor that compensates the bias. However, this noise estimate is sensitive to the minimal tracking which is a primary procedure. And, a computationally efficient minimal tracking scheme is proposed by Doblinger [5]. This method is known to be slow for updating the noise estimate especially in the case of a sudden change in the noise level [6]. Recently, one of the successful noise power estimation is the minima controlled recursive averaging (MCRA) technique, which obtains the noise power estimate by averaging past spectral power values using a smoothing parameter adjusted by the speech presence probability [10–13]. The resultant noise power estimate depending on the probability of speech presence is known to be computationally efficient and robust with respect to noise environments. However, this method is sensitive to temporal variation since the presence of speech in each subband is inherently estimated by the ratio of the local energy and its minimum of a given frame. Recently, Shin et al. proposed a novel technique to voice activity detection (VAD) considering the inter-frame correlation of voice activity based on the conditional maximum a posteriori (MAP) criterion [9]. The conditional MAP scheme was originally motivated by the observation that the speech presence of a given frame is dependent not only
✩
*
This work was supported by an Inha University Research Grant. Corresponding author. E-mail address:
[email protected] (J.-H. Chang).
1051-2004/$ – see front matter doi:10.1016/j.dsp.2010.01.011
©
2010 Elsevier Inc. All rights reserved.
J.-M. Kum et al. / Digital Signal Processing 20 (2010) 1572–1578
1573
on the current observation but also on the voice activity in the previous frame. This method is considered to be efficient by exploiting the inter-frame correlation to enhance the performance of the VAD dramatically [14,15]. In this paper, we propose a MCRA-based noise power estimation incorporating the conditional MAP criterion for speech enhancement. To determine efficiently the speech presence probability of the MCRA method, we consider the inter-frame correlation of speech activity based on the conditional MAP, which depends on both the current observation of a given frame and the speech presence decision in the previous frame. Based on this, we first derive the multiple thresholds for updating the noise power in the MCRA framework and further improve the performance by incorporating the soft decision scheme, which results in time varying thresholds. The proposed technique is substantially adopted for a speech enhancement technique and is evaluated with objective and subjective quality test experiments under various noise conditions. 2. Review of minima controlled recursive averaging technique Let y (n) denote a noisy speech signal that is the sum of a clean speech signal x(n) and an uncorrelated additive noise signal d(n); y (n) = x(n) + d(n). Applying a discrete Fourier transform (DFT), we have in the time-frequency domain
Y (k, l) = X (k, l) + D (k, l)
(1)
where k is the frequency bin and l is the frame index, respectively. Given two hypotheses, H 0 (k, l) and H 1 (k, l), which respectively indicate speech absence and presence, it is assumed that
H 0 (k, l):
Y (k, l) = D (k, l),
H 1 (k, l):
Y (k, l) = X (k, l) + D (k, l).
(2)
Letting λd (k, l) = E [| D (k, l)|2 ] be the variance of noise, the MCRA method obtains the noise power estimate by applying a temporal recursive smoothing to the noise measurement during periods of speech absence as follows:
H 0 (k, l):
2 λˆ d (k, l + 1) = αd λˆ d (k, l) + (1 − αd )Y (k, l) ,
H 1 (k, l):
λˆ d (k, l + 1) = λˆ d (k, l)
(3)
is a smoothing parameter. Also, H 0 and H 1 designate hypothetical speech absence and presence,
where αd (0 < αd < 1) respectively. In [10], a clear distinction is made between the hypotheses in (2), which is used for estimation of clean speech, and the hypotheses in (3), which controls adaptation of the noise statistics. In other words, deciding speech is absent ( H 0 ) when speech is present ( H 1 ) is more destructive when estimating the signal than when estimation noise power spectrum. Therefore, different decision rules are employed, in that H 1 is chosen to get a higher confidence than H 1 i.e., P ( H 1 (k, l)|Y (k, l)) P ( H 1 (k, l)|Y (k, l)) [10], where P ( H 1 (k, l)|Y (k, l)) denote the a posteriori probability of speech presence. Then, (3) implies
2 λˆ d (k, l + 1) = λˆ d (k, l) P H 1 (k, l)|Y (k, l) + αd λˆ d (k, l) + (1 − αd )Y (k, l) 1 − P H 1 (k, l)|Y (k, l) 2 = αˆ d (k, l)λˆ d (k, l) + 1 − αˆ d (k, l) Y (k, l)
(4)
ˆ d is a time-varying smoothing parameter that is adjusted by the speech presence probability as follows: where α
αˆ d (k, l) = αd + (1 − αd ) P H 1 (k, l)|Y (k, l) .
(5)
The conditional speech presence probability, P ( H 1 (k, l)| Y (k, l)), is calculated
Pˆ H 1 (k, l)|Y (k, l) =
α p Pˆ ( H 1 (k, l − 1)|Y (k, l − 1)) + (1 − α p ), if S r (k, l) > δ α p Pˆ ( H 1 (k, l − 1)|Y (k, l − 1)),
(6)
otherwise
α p (0 < α p < 1) is a smoothing parameter and δ is a probability threshold of speech signal presence. Also, S r (k, l)
where S (k,l) S min (k,l)
denotes the ratio between the local energy of noisy speech, S (k, l) and its determined minimum, S min (k, l) for the current frame. To obtain S (k, l) recursive averaging is employed such that
2
S (k, l) = ζs S (k, l − 1) + (1 − ζs )Y (k, l)
(7)
where ζs (0 < ζs < 1) is a smoothing parameter. In addition, to obtain the minimum value S min (k, l) of the current frame, a samplewise comparison of the local energy and the local minimum value of the previous frame is performed [10] such that S min (k, l) = min{ S min (k, l − 1), S (k, l)}. Here, S r (k, l)-based decision rule is inherently derived from a Bayes minimum-cost decision rule given by
P ( S r (k, l)| H 1 (k, l))
H 1
C 10 P ( H (k, l) = H 0 )
P ( S r (k, l)| H 0 (k, l))
H 0
C 01 P ( H (k, l) = H 1 )
≷α
(8)
1574
J.-M. Kum et al. / Digital Signal Processing 20 (2010) 1572–1578
where P ( H (k, l) = H 0 ) = (1 − P ( H (k, l) = H 1 )) is the a priori probability of speech absence and C i j is the cost of deciding H 1
H i when H j . Since (8) is a monotonic function, the decision rule (8) is expressed by S r (k, l) ≷ δ like (6), in which δ is a H 0
given threshold. 3. Enhanced MCRA based on conditional MAP In the previous section, we note that the crucial parameter in the MCRA method is P ( H 1 (k, l)|Y (k, l)), controlling the smoothing parameter for the recursively averaged power spectrum of the noisy signal. However, P ( H 1 (k, l)|Y (k, l)) does not consider the inter-frame correlation carefully since the conditional probability of speech presence incorporating a Bayes decision rule is simply based on a current observation Y (k, l). Since it is known that speech activities in the adjacent frames have a strong correlation representing P ( H (k, l) = H 1 | H (k, l − 1) = H 1 ) > P ( H (k, l) = H 1 ) where H (k, l) denotes the correct hypothesis at the kth frequency bin in the lth frame [9,16], we first consider the decision rule conditioned on both the current observation and the speech presence decision in the previous (l − 1) frame as follows:
P ( H (k, l) = H 1 | S r (k, l), H (k, l − 1) = H i )
H 1
P ( H (k, l) = H 0 | S r (k, l), H (k, l − 1) = H i )
H 0
≷ α,
i = 0, 1 .
(9)
Using Bayes’ rule, this criterion results in the following likelihood ratio test (LRT) which is analogous to (8).
P ( S r (k, l)| H (k, l) = H 1 , H (k, l − 1) = H i )
H 1
P ( S r (k, l)| H (k, l) = H 0 , H (k, l − 1) = H i )
H 0
≷ αi ,
i = 0, 1
(10)
P ( H (k,l)= H | H (k,l−1)= H )
where αi = α P ( H (k,l)= H 0 | H (k,l−1)= H i ) . It is assumed that speech presence of the current frame dominantly determines the 1 i distribution of the observed signal in the current frame [9]. Therefore (10) can be simplified as follows:
P ( S r (k, l)| H (k, l) = H 1 )
H 1
P ( S r (k, l)| H (k, l) = H 0 )
H 0
≷ αi ,
i = 0, 1 .
(11)
)= H 0 | H (k,l−1)= H 0 ) α0 (= α PP (( HH ((kk,,ll)= ) is used if absence of the speech signal is detected in H 1 | H (k,l−1)= H 0 ) P ( H (k,l)= H | H (k,l−1)= H ) the previous frame (i.e., P ( H (k, l − 1) = H 0 |Y (k, l − 1)) ≈ 1) while the threshold α1 (= α P ( H (k,l)= H 0 | H (k,l−1)= H 1 ) ) is adopted 1 1
From this, we know that the threshold
otherwise. It should be noted that multiple thresholds can provide more accurate speech presence probabilities producing the improved noise power estimator. Finally, the proposed method combines the above double decision rules using the soft decision scheme as follows: H 1
S r (k, l) ≷ η
(12)
H 0
where η = P ( H (k, l − 1) = H 1 |Y (k, l − 1)) g (α0 ) + (1 − P ( H (k, l − 1) = H 1 |Y (k, l − 1)) g (α1 ) and g (·) is a mapping function between the LRT in (11) and S r (k, l) in (12). From (12), we can see that α0 replaces η in the case of speech presence at the previous frame and α1 becomes more dominant as the speech presence probability of the previous frame decreases. This method can be considered desirable because of taking full considerations on the soft decision scheme. In this regard, it is not difficult to see from Fig. 1 that the speech presence probability seems to be more accurate than the conventional method. Based on this, from Fig. 2 showing a typical example of the estimated noise power contour, we can see that the proposed method offers more accurate noise power estimate than the previous method. As a main application of the proposed technique, we adopt a speech enhancement algorithm based on a minimum mean-square error (MMSE) as follows [1,8]:
Xˆ (k, l) = G ξ(k, l), γ (k, l) Y (k, l)
(13)
where Xˆ (k, l) is the estimated clean speech spectrum and G (·) indicates the noise suppression gain. Also, a posteriori SNR and ξ(k, l) is the a priori SNR defined by
|Y (k, l)|2 , λd (k, l) λx (k, l) . ξ(k, l) ≡ λd (k, l)
γ (k, l) ≡
γ (k, l) is the
(14) (15)
Here, the MMSE noise suppression gain is given by
G ξˆ (k, l), γˆ (k, l) =
√
πν (k, l) ν (k, l) exp − ˆ 2γ (k, l) 2
1 + ν (k, l) I 0
ν (k, l) 2
+ ν (k, l) I 1
ν (k, l) 2
(16)
J.-M. Kum et al. / Digital Signal Processing 20 (2010) 1572–1578
1575
Fig. 1. Comparison of the speech presence probability (k = 5, around 310 Hz) under F16 noise (SNR = 10 dB). (a) Clean speech waveform. (b) Speech presence probability in short-time frames: probability of conventional MCRA (dashed line), probability of proposed algorithm (bold line).
Fig. 2. Comparison of noise power estimation (k = 5, around 310 Hz) under F16 noise (SNR = 10 dB). (a) Actual noise power (solid line), estimated noise power based on MCRA (dashed line), estimated noise power based on proposed algorithm (bold line). (b) Noisy speech. (c) Clean speech.
1576
J.-M. Kum et al. / Digital Signal Processing 20 (2010) 1572–1578
Table 1 Relative estimation error obtained from the MCRA and proposed method. Noise
Method
SNR (dB) 5
10
15
White
MCRA Proposed MCRA Proposed MCRA Proposed MCRA Proposed MCRA Proposed
0.379 0.370 0.779 0.680 0.470 0.379 0.532 0.505 0.965 0.739
0.475 0.384 0.786 0.702 0.620 0.367 0.608 0.593 0.816 0.715
0.578 0.421 1.486 0.730 1.444 0.432 0.591 0.582 0.984 0.723
Babble F16 Office Street
in which I 0 and I 1 are the modified Bessel functions of zero and first order. Also, (15) as given by
ν (k, l) =
ν (k, l) is defined based on (4), (14) and
ξˆ (k, l) γˆ (k, l) 1 + ξˆ (k, l)
(17)
ˆ d (k, l) for λd (k, l) with the result in (4). where ξˆ (k, l) and γˆ (k, l) is inherently obtained by the estimator λ 4. Experiments and results The proposed approach was adopted for the MCRA-based speech enhancement method as given in [1,8,10] and was evaluated with a quantitative comparison and subjective quality test experiment under various noise environments. One hundred test phrases from the NTT database, spoken by four male and four female speakers, were used as experimental data. Each phrase consisted of two different meaningful sentences and lasted 8 s. The MCRA-based noise power estimation was performed with a trapezoidal window of length 13 ms for each frame of 10 ms at a sampling frequency of 8 kHz. Subsequently, each frame of the windowed signal was transformed to the corresponding spectrum through the 128-point DFT after zero-padding. Three types of noise sources such as white, babble and F16 noises from the NOISEX-92 database were added to the clean speech waveform at SNRs of 5, 10 and 15 dB. In all cases, speech enhancement was conducted with the experimentally optimized parameter values according to a variety of the experimental data; αd = 0.95, α p = 0.2, ζs = 0.45. To determine the parameters α0 and α1 , we investigated the joint probabilities P ( H (l) = H 0 , H (l − 1) = H 0 ) (= 39.1%), P ( H (l) = H 0 , H (l − 1) = H 1 ) (= 1.2%), P ( H (l) = H 1 , H (l − 1) = H 0 ) (= 1.2%) and P ( H (l) = H 1 , H (l − 1) = H 1 ) (= 58.5%) based on the speech material which length was 456 s. In this regard, α0 and α1 used in the experiment were experimentally chosen to yield the effective conditional MAP such that α0 = 9 and α1 = 3.5. Also, we incorporated two different noises such as office and street to demonstrate that the selected parameters are not specially biased to the given noise used for the parameter setting. For the first time, the performance of noise power estimation was evaluated frame by frame based on the normalized relative estimation error εn defined by [10]
εn =
N −1 1
N
l =0
2 ˆ k [λd (k, l) − λd (k, l)]
2 k λd (k, l)
(18)
ˆ d (k, l) is the estimated noise where λd (k, l) is the actual noise power estimate directly obtained by the noise signal [10], λ power by the conventional MCRA method and the proposed method for the number of frames N. Table 1 shows the results of the relative estimation error for the given noise power estimation methods under the various noise conditions. From the results, it can be seen that the proposed scheme gave us an improvement over the previous MCRA approach because, in all of the environmental conditions including the office and street noises, the proposed method gave us much lower estimation error. Especially, it should be noted that the performance gain becomes larger as the SNR gets higher [10]. For the purpose of evaluating the performance of the presented method in terms of speech quality, we first measured a perceptual evaluation of speech quality (PESQ) based on the ITU-T P.862 tests [17]. Even though the PESQ cannot access the quality of the signal including acoustic terminals and background noises, we employed the PESQ due to its widespread use as a decent objective measure of subjective quality in a speech enhancement system [18]. Table 2 presenting the results of the PESQ shows that the proposed conditional MAP-based MCRA approach outperformed the original MCRA-based scheme under the given noise conditions, where we can see that incorporation of the conditional MAP definitely has a positive effective in terms of the objective measure of subjective quality. Secondly, to evaluate the subjective quality of the proposed scheme, we carried out a set of informal tests under the same noise conditions. Subjective opinions were decided by a group of 20 listeners; each listener gave a score for each test sentence: 5 (Excellent), 4 (Good), 3 (Fair), 2 (Poor), 1 (Bad). All listener scores were then averaged to yield a mean opinion score (MOS) [8]. The MOS test results are summarized in Table 3, in which a higher value indicates preference. Table 3
J.-M. Kum et al. / Digital Signal Processing 20 (2010) 1572–1578
1577
Table 2 PESQ scores of the MCRA and proposed method. Noise
Method
SNR (dB) 5
10
15
White
MCRA Proposed MCRA Proposed MCRA Proposed MCRA Proposed MCRA Proposed
1.947 2.034 2.316 2.339 2.030 2.123 2.147 2.184 2.513 2.563
2.295 2.373 2.646 2.668 2.442 2.515 2.514 2.551 2.814 2.860
2.636 2.701 2.927 2.956 2.757 2.820 2.799 2.839 3.031 3.068
Babble F16 Office Street
Table 3 MOS of the MCRA and proposed method. Noise
Method
SNR (dB) 5
10
15
White
MCRA Proposed MCRA Proposed MCRA Proposed MCRA Proposed MCRA Proposed
2.17 2.23 2.93 2.95 2.23 2.33 2.00 2.10 2.71 3.05
2.56 2.65 3.32 3.42 2.77 2.90 2.68 2.90 3.48 3.53
3.03 3.07 3.69 3.76 3.20 3.46 3.40 3.60 3.85 4.33
Babble F16 Office Street
demonstrates that the proposed approach improved over the conventional MCRA method under the given conditions. It is noted that performance improvement was found to be greater for the F16 noise case at all SNRs. This phenomenon could be attributed to the fact that the speech presence probability can be effectively estimated since the F16 noise spectra are dominantly located in the low frequency range. These results confirm that the proposed conditional MAP scheme substantially improves the MCRA method in speech quality enhancement. 5. Conclusions In this paper, we have proposed a novel approach to incorporate the conditional MAP into the conventional MCRA-based noise power estimation scheme for speech enhancement. The noise power estimate was given by the recursive smoothing parameter incorporating the speech absence probability conditioned on both the current observation and the voice activity decision in the previous frame. Compared to the conventional MCRA method, it has been demonstrated that the proposed technique provides better noise power estimates in the speech enhancement systems. References [1] Y. Ephraim, D. Malah, Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator, IEEE Trans. Acoust. Speech Signal Process. ASSP-32 (6) (Dec. 1984) 1109–1121. [2] Y. Ephraim, D. Malah, Speech enhancement using a minimum mean-square error log-spectral amplitude estimator, IEEE Trans. Acoust. Speech Signal Process. ASSP-32 (2) (Apr. 1985) 443–445. [3] S.F. Boll, Suppression of acoustic noise in speech using spectral subtraction, IEEE Trans. Acoust. Speech Signal Process. ASSP-27 (2) (Apr. 1979) 113–120. [4] R. Martin, Noise power spectral density estimation based on optimal smoothing and minimum statistics, IEEE Trans. Speech Audio Process. 9 (5) (Jul. 2001) 504–512. [5] G. Doblinger, Computationally efficient speech enhancement by spectral minima tracking in subbands, in: Proc. EUROSPEECH, Madrid, Spain, Sept. 1995, pp. 1513–1516. [6] J. Meyer, K.U. Simmer, K.D. Kammeyer, Comparison of one- and two-channel noise-estimation techniques, in: Proc. IWAENC, London, UK, Sept. 1997, pp. 137–145. [7] I. Cohen, B. Berdugo, Speech enhancement for non-stationary noise environments, Signal Process. 81 (Nov. 2001) 2403–2418. [8] N.S. Kim, J.-H. Chang, Spectral enhancement based on global soft decision, IEEE Signal Process. Lett. 7 (5) (May 2000) 108–110. [9] J.W. Shin, H.J. Kwon, S.H. Jin, N.S. Kim, Voice activity detection based on conditional MAP criterion, IEEE Signal Process. Lett. 15 (Feb. 2008) 257–260. [10] I. Cohen, B. Berdugo, Noise estimation by minima controlled recursive averaging for robust speech enhancement, IEEE Signal Process. Lett. 9 (1) (Jan. 2002) 12–15. [11] I. Cohen, Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging, IEEE Trans. Speech Audio Process. 11 (5) (Sept. 2003) 466–475. [12] V. Stouten, H.V. Hamme, P. Wambacq, Application of minimum statistics and minima controlled recursive averaging methods to estimate a spectral noise model for robust ASR, in: Proc. ICASSP, Toulouse, France, May 2006, pp. 765–768.
1578
J.-M. Kum et al. / Digital Signal Processing 20 (2010) 1572–1578
[13] N. Fan, J. Rosca, R. Balan, Speech noise estimation using enhanced minima controlled recursive averaging, in: Proc. ICASSP, Honolulu, Hawaii, USA, Apr. 2007, pp. 581–584. [14] Y.D. Cho, K. Al-Naimi, A. Kondoz, Improved voice activity detection based on a smoothed statistical likelihood ratio, in: Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, vol. 2, Salt Lake City, UT, May 2001, pp. 7–11. [15] A. Davis, S. Nordholm, R. Togneri, Statistical voice activity detection using low-variance spectrum estimation and an adaptive threshold, IEEE Trans. Audio Speech Lang. Process. 14 (2) (Mar. 2006) 412–424. [16] J. Sohn, N.S. Kim, W. Sung, A statistical model-based voice activity detection, IEEE Signal Process. Lett. 6 (1) (Jan. 1999) 1–3. [17] ITU-T P.862, Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs, 2001. [18] J.-H. Chang, Q.-H. Jo, D.K. Kim, N.S. Kim, Global soft decision employing support vector machine for speech enhancement, IEEE Signal Process. Lett. 16 (1) (Jan. 2009) 57–60.
Jong-Mo Kum received B.S. degree in electronic engineering from Inha University, Incheon, Korea, in 2008. He is now pursuing toward M.S. degree at the same university. Yun-Sik Park received B.S. and M.S. degrees in electronics engineering from Inha Univeristy, Incheon, Korea, in 2006 and 2008, respectively. He is now pursuing toward Ph.D. degree at the same university. Joon-Hyuk Chang received B.S. degree in electronics engineering from Kyungpook National University, Daegu, in 1998 and M.S. and Ph.D. degrees in electrical engineering from Seoul National University, Seoul, Korea, in 2000 and 2004, respectively. From March 2000 to April 2005, he was with Netdus Corp., Seoul, as a Chief Engineer. From May 2004 to April 2005, he was with the University of California, Santa Barbara, in a postdoctoral position to work on adaptive signal processing and audio coding. In May 2005, he joined Korea Institute of Science and Technology, Seoul, as a Research Scientist to work on speech recognition. Currently, he is an assistant professor with the Faculty of Electronic Engineering, Inha University, Incheon, Korea. His research interests are in speech coding, speech enhancement, speech recognition, audio coding, and adaptive signal processing.