The Journal of China Universities of Posts and Telecommunications September 2011, 18(Suppl. 1): 13–18 www.sciencedirect.com/science/journal/10058885
http://jcupt.xsw.bupt.cn
Speech enhancement using super gauss mixture model of speech spectral amplitude WANG Hai-yan ( ), ZHAO Xiao-hui, GU Hai-jun Key Laboratory of Information Science, School of Communication Engineering, Jilin University, Changchun 130012, China
Abstract
Most speech enhancement algorithms are derived by applying gauss hypothesis based on decor related speech samples. And recently it is accepted that the real pdf of speech spectral amplitude lies between the Laplace and Gamma amplitude approximation. This paper presents a supper gauss mixture model of speech spectral amplitude which can accurately approximates real speech spectral amplitude distribution. With this model, a new speech enhancement algorithm on the basis of MMSE estimator is derived. Simulations show that this algorithm has a better noise reduction performance and improve output SNR in the sense of speech segmentation. Keywords speech enhancement, supper gauss mixture model (SGMM), MMSE
1
Introduction
The problem of speech enhancement is basically required in many speech processing applications such as speech recognition, telephony, wireless communication, and teleconferencing and hears aiding. Various types of speech enhancement algorithms have been proposed over the past several decades. Most of these algorithms were developed independently of each other with different measures such as wiener filer, statistical methods, signal subspace approach and etc. And statistical algorithms have been the subject of intensive research. Under the assumption that Discrete Fourier Transform (DFT) coefficients of speech are independent Gaussian random variables, Ephraim [1] proposed an algorithm which estimated speech DFT coefficients with minimum mean square error (MMSE). Recently researchers lay more attentions on super gauss models such as Laplace [2] and Gamma [3] distribution instead of Gaussian assumption used in speech enhancement. Thomas [4] derived a more common expression of super gauss models. Computationally efficient Received date: 06-17-2011 Corresponding author: WANG Hai-yan, E-mail:
[email protected] DOI: 10.1016/S1005-8885(10)60217-8
speech estimators were given by applying the maximum a posteriori (Map) estimation rule. And Erkelen [5–6] gave a minimum mean square error estimator for supper gauss distribution. Though these algorithms have a better performance than Gaussian assumption algorithms, it is clearly that the real speech spectral amplitude distribution is more close to super gauss. In practice, the probability distribution of the speech signal is not known, and cannot be accurately described by a single function. We have been always trying to find the closest mathematic expressions to the real speech spectral amplitude distribution. As we know, the real speech spectral amplitude distribution lies between Laplace and Gamma distribution. We consider using the sum of a series of one basic function with different variances and coefficients to describe the real speech spectral amplitude distribution, and the real distribution is tested on a very long time speech signal from speech database. The processing to obtain the variances and coefficients is basically parameter estimation problem. After received the parameters properly, we derived a new speech enhancement algorithm.
14
2
The Journal of China Universities of Posts and Telecommunications
In this section we overview speech enhancement based on Gaussian and super gauss assumptions. The problem of speech enhancement which has been degraded by noise is an estimation problem in which the clean speech has to be estimated from a given function of noisy speech by minimizing a certain distortion measure between the clean speech signal and noisy speech signal. Consider an additive noise model where speech and noise signals are statistically independent (1) y n s n w n where y n represents noisy speech, s n and w n represent clean speech signal and noise signal respectively. Speech enhancement is to estimate s n from y n . As we know that speech signal is non stationary, short time Fourier transform (STFT) analysis is widely used in speech enhancement. (2) Y m, k S m, k W m, k where
m, k
represents the m th frame k th frequency
value of STFT. Because human ears are not sensitive to phase information, we only consider amplitude information. Let R , A and W represent the amplitude of noisy signal, clean speech signal and noise signal respectively. The estimator of speech spectral amplitude used minimum mean square error as the distortion measure is
Aˆ
³ E ^ A R` ³
f
0
af R A r a f A a da
f 0
f R A r a f A a da
(3)
where f A a is the probability density function of A , and f R A r a [7] is conditional probability density function.
Suppose noise signal is zero mean gauss process, f R A r a
fR A r a
§ r 2 a 2 · § 2ar · exp ¨ ¸ I0 ¨ ¸ 2 VW V W2 ¹ © V W2 ¹ ©
V Sv
§ a · exp ¨ P ¸ V S ¹ ©
P v 1
fA a
a v 1
* v 1 V
v 1 S
(5)
§ a · exp ¨ P ¸ © VS ¹
(6)
By using different sets of parameters v, P , we can
accurately approximate Gamma and Laplace distributions. Under supper gauss assumption, we get the estimator of speech spectral amplitude using minimum mean square error measure. f § a2 a · § 2ar · v 1 a exp ¨ 2 P ¸I ¨ ¸ da ³0 V S ¹ 0 © V W2 ¹ © VW (7) Aˆ f § a2 a · § 2ar · v ³ 0 a exp ¨© V W2 P V S ¸¹ I 0 ¨© V W2 ¸¹ da When x is small, the value of I 0 x can be approximated by
I 0 x, k
K 1
2k
§ x· § 1 · ¦ ¨ ¸ ¨ ¸ k 0 © 2 ¹ © 2k ¹
2
(8)
Then we get the estimator used minimum mean square error measure 2k
· ¸¸ ¹R Aˆ , k · P ¸ 2[ ¹¸ (9) Where [ and J are called priori SNR and posterior SNR, 2 §1· § ¦¨ ¸ ¨ 1 k 0 © k ! ¹ ¨© 2J K 1 § 1 · 2 § ¦ ¨ ¸ ¨¨ k 0 © k!¹ © K 1
§ ¸¸ * v 2 2k D v 2 2 k ¨¨ 2¹ © 2k § · J ¸ * v 1 2k D v 1 2 k ¨¨ 2 ¹¸ ©
J ·
and are defined by R2 V S2
J
, [ V W2 V W2 When x is large, I 0 x is more close to I0 x
P 2[
(10)
1 exp x 2ʌx
(11)
We get the estimator as
can be represented by
2r
av
fA a
Problem formation
2011
(4)
where I 0 is zero order Bessel function. V W2 denotes noise
spectrum variance. When we use different f A a models,
different estimators can be derived. Recently supper gauss model is used as speech spectral amplitude distribution, such as Gamma and Laplace distributions. The real pdf of speech spectral amplitude is obviously between them. Thomas [4] gave a general form of supper gauss distribution as
Aˆ
§ * v 1.5 D v 1.5 ¨¨ 1 © § 2J * v 0.5 D v 0.5 ¨¨ ©
· P 2J ¸¸ 2[ ¹ · P 2J ¸¸ 2[ ¹
R
(12)
It has been proved that supper gauss distribution has better performance than Gaussian assumption. Matching with experimental data, supper gauss distribution with parameter v 0.126, P 1.74 is the most close to real speech spectral amplitude distribution. But there is a gap between super gauss and real speech spectral amplitude distribution. It shows that the pdf of spectral amplitude cannot be represented by a
Supplement 1
WANG Hai-yan, et al./ Speech enhancement using super gauss mixture model of…
single function.
³
f
0
3 Super gauss mixture model In general, the probability distribution of a natural random variable is determined by complex physic of the phenomena. The accurate distribution could be investigated by measurement and statistical tests. In this section, we present a mixture of a basic function with different variances and coefficients as the pdf of speech spectral amplitude based on actual pdf implemented from speech database. As the actual pdf is very different from Raleigh distribution, researchers focus on super gauss distributions such as gamma and Laplace distributions, and it is proved that supper gauss assumption have better performance compared with Gaussian assumption. Because the real pdf is more close to supper gauss, we consider a more simple function to solute speech enhancement problem § a · (13) f A a exp ¨ ¸ © VS ¹ To reduce the complexity, we retain the linear term for A § a · a exp ¨ ¸ (14) fA a 2 VS © VS ¹
15
f A a da 1
(16)
Here the coefficients and variances are calculated from the experimental data, we can ensure Eq. (16) and 7
¦c
i
1
(17)
i 1
The variance V i2 is not the traditional meaning of distribution function, but in experiment it is linear with V S2 . To get the coefficients and variances, we need to measure the pdf of speech spectral amplitude. First a histogram is built using speech periods from speech database which conclude speech of different speakers of different ages and genders, which have equal variance. The estimation of coefficients and variances could be parameter estimation problem under the condition of Eq. (17). The data in Table 1 is a group of prediction. Table 1
Parameters of SGMM of speech spectral amplitude ci
Vi
0.007 252 75
0.079 773 09
0.004 092 93
0.072 743 06
0.813 598 22
0.214 56
0.006 406 90
0.071 908 41
0.028 354 95
0.070 000 26
0.014 037 04
0.076 625 06
0.130 619 75
0.814 728 2
There are many groups of prediction that satisfy Eq. (15). We choose the group in table 1 because it is easy to calculate. Fig. 2 shows the histogram of speech spectral amplitude, compared between the real pdf of speech spectral amplitude with Raleigh, gamma, Laplace, super gauss v 0.126, P 1.74 and the mixture of Table 1. It
Fig. 1 Function of Eq. (14) with different V S
Fig. 1 shows the function of Eq. (14) with different variances. Like supper gauss distribution, this function can not represent the actual pdf of speech spectral amplitude. Hence we use a mixture of function Eq. (14) to approximate the actual pdf. 7 § a· a (15) f A a ¦ ci 2 exp ¨ ¸ Vi i 1 © Vi ¹ where ci and V i are coefficients and variances. Normally, we restrict
proved that the actual pdf of speech spectral amplitude lies between gamma and Laplace distribution. The SGMM is more close to that of the real speech spectral amplitude than any present distribution. Speech signal is non stationary random process, so the SGMM can not accurately describe the time varying speech signal either.
4 New speech enhancement algorithm Given the SGMM of speech spectral amplitude, we derive a new speech enhancement algorithm through minimum mean square error measurement. Similar to the MMSE estimator under super gauss assumption, I 0 x can be approximated by Eq. (8) when x is small. Therefore the estimator based on
16
The Journal of China Universities of Posts and Telecommunications
2011
the proposed SGMM is
Fig. 2 Histogram of speech spectral amplitude, Raleigh, gamma, Laplace, super gauss K 1
Aˆ , k
1 2J
2
§ · ¦ ¨© k ! ¸¹ 1
k 0
2
§ · ¦ ¨© k ! ¸¹ K 1 k 0
1
§ § 1 · exp ¨ ¸ D 3 2 k ¨¨ 8 [ i © i¹ © 7 2k § § 1 · ci 2J * 2 2k ¦ exp ¨ ¸ D 2 2 k ¨¨ i 1 [i © 8[i ¹ © 2J
2k
7
* 3 2k ¦
ci
1 [i
· ¸¸ ¹R 1 · ¸ 2[i ¸¹
1 2[i
(18) I 0 x can be approximated by Eq. (11) when x is large. Thus the estimator based on SGMM becomes 7 § 1 § J · c * 2.5 ¦ i exp ¨ ¸¸ D2.5 ¨¨ ¨ 8 [ [ i 1 i 1 © i 2 [i ¹ © Aˆ 7 § · § 2J J c 1 * 1.5 ¦ i exp ¨ ¸ D15 ¨¨ ¨ [ [ 8 [i ¸¹ 2 i 1 i © i ©
· 1 2J ¸ ¸ 2[i ¹R · 1 2J ¸ ¸ 2[i ¹
(19) The procedure of the derivation of Eq. (19) is in appendix. Here J have the same meaning as it is in Sect 2. The value of [i is equal to [ multiplied by V i2 . The priori SNR [ is estimated using a recursive approach proposed by Ephraim and Malah
[ m, k D
v
0.126, P 1.74 and SGMM
Aˆ m 1, k 1 D F ª¬J m, k 1º¼ V W2 m, k
(20)
x x ! 0 where F > x @ ® , and D is usually 0.98. ¯0 else The comparison of the gain function of speech enhancement algorithms based on supper gauss assumption and the SGMM is as follow In Fig. 3, the gain function of super gauss v 0.126, P 1.74 and the new speech model was drawn respectively, when the prior SNR [ is -5, 0, 5 dB and the posterior SNR is from -20 dB to 15 dB. We can see that the gain function of SGMM is apparently better than the gain function of supper gauss v 0.126, P 1.74 .
5
Experiments
To measure the equality of speech enhancement based on the proposed constructed pdf of the speech spectral amplitude,
Supplement 1
WANG Hai-yan, et al./ Speech enhancement using super gauss mixture model of…
17
spectral amplitude. Table 2 gives a comparison between the SGMM, Gaussian, gamma, Laplace and supper gauss v 0.126, P 1.74 with real speech spectral amplitude pdf in Root Mean Square Error (RMSE) which is defined by Eq. (22).
¦ fˆ (a) f N
A
G
A
(a )
a 1
2
(22)
N
It shows that the SGMM can approximate real speech spectral amplitude better than that of other distributions. Table 2 RMSE of Raleigh, gamma, Laplace, supergauss v 0.126, P 1.74 and SGMM Raleigh
Laplace
Gamma
Super gauss
SGMM
0.514 6
0.364 9
0.183 0
0.164 5
0.095 1
Table 3 Output SNR of estimators of Raleigh, gamma, Laplace, super gauss v 0.126, P 1.74 and SGMM for speech corrupted with white noise Raleigh
Laplace
Gamma
Super gauss
SGMM
-3dB
2.785 3
3.619 4
4.074 7
4.099 1
4.416 7
0dB
4.679 6
5.378 2
5.703 0
5.684 0
5.743 4
3dB
6.624 3
6.851 0
7.086 1
7.046 8
7.107 6
Table 4 Output SNR of estimators of Raleigh, gamma, Laplace, super gauss v 0.126, P 1.74 and SGMM for speech corrupted with babble noise
Fig. 3
Comparison of the gain function of super gauss v 0.126, P 1.74 and SGMM
we consider the ratio of segmental signal to noise to track speech quality. Ratio of segmental signal to noise is defined by I § · s 2 i pl ¦ ¨ ¸ 1 P i 1 ¨ ¸ (21) SNR seg 10log ¦ 10 I 2 ¸ Pp1 ¨ ˆ s i pl s i pl ¨¦ ¸ ©i1 ¹ where P denotes the number of speech signal segments, I denotes the samples number in each segment. PI is equal to the length of speech signal. As the description in Sect. 3, the proposed SGMM of speech spectral amplitude is more close to the real speech
Raleigh
Laplace
Gamma
Super gauss
SGMM
-3dB
0.381 5
0.906 3
1.165 4
1.217 0
1.160 0
0dB
2.223 7
2.685 0
2.862 4
2.864 8
2.870 8
3dB
4.199 0
4.638 8
4.745 0
4.696 1
4.751 8
In Table 3 and Table 4, an evaluation in terms of segmental SNR is shown for the input SNR at -3 dB, 0 dB and 3dB respectively when the speech is degraded with white noise and babble noise. For comparison we take Raleigh, Laplace, gamma and super gauss v 0.126, P 1.74 as reference. It is clear that the SGMM has better performance, or output SNR, than the other distributions as a whole.
6
Conclusions
A super gauss mixture model (SGMM) of speech spectral amplitude is proposed, which is proved more close to the real speech spectral amplitude distribution. Using this model, an MMSE estimator for speech enhancement algorithm is presented. Experiments show that the use of this new
18
The Journal of China Universities of Posts and Telecommunications
estimator leads to a better performance even though the SGMM still has computation deficiency. This proposed algorithm also shows another way to improve speech enhancement algorithms by RMSE and segmental SNR, and it can be extended to noise analysis in order to simplify MMSE estimators.
Appendix
by Eq. (8). Then we have 2 2k 2k K 1 § a2 a· § 1 · 2 R I ci f 2 k 2 a exp ¨ 2 ¸ da ¦ ¦ ¨ ¸ 4k 2 ³0 V V V V k ! ¹ k 0© i 1 W i i ¹ © W Aˆ , k 2 2k 2k K 1 § a2 a· § 1 · 2 R I ci f 2 k 1 a exp ¨ 2 ¸ da ¦ ¦ ¨ ¸ V W4 k i 1 V i2 ³0 k 0 © k!¹ © VW V i ¹ (23) The integral part of Eq. (23) can be refer to [9, Eq. 3.462.1] § J · f §J2 · v / 2 v 1 E x 2 J x dx 2E * v exp ¨ (24) ¸¸ ¸ D v ¨¨ ³0 x e E 8 © ¹ © 2E ¹ Substituting Eq. (24) into Eq. (23) Aˆ , k
1 2V W
2
k 0
2
§
§1· ¦ ¨© k ! ¸¹ * 2k 2 ¨¨ k 0
©
2k
§ · 1 ¸ D 2 k 3 ¨¨ 2 2 / V W2 V ¹ i © 2k § § · 2 R 2 · I ci 1 1 ¸ ¦ exp ¨ 2 2 ¸ D 2 k 2 ¨ ¨ 2V 2 / V 2 V W2 ¸¹ i 1 V i2 © 8V i / V W ¹ i W ©
§ 2R2 · ¸ 2 ¸ © VW ¹
§1· ¦ ¨© k ! ¸¹ * 2k 3 ¨¨ K 1
Let
[i
V i2 , J V W2
R2
V W2
Finally, we get Eq. (18). The derivation of Eq. (19) is similar to Eq. (18). Acknowledgements
In this section we give the derivation of the new speech enhancement estimator under the SGMM. Based on the Eq. (3), we can get the estimator by putting the SGMM Eq. (15) into it. Like the MMSE estimator of super gauss assumption, When x is small, the value of I 0 x can be approximated
K 1
2011
I
ci
¦V i 1
2 i
§ 1 exp ¨ 2 2 © 8V i / V W
· ¸ ¸ ¹ · ¸ ¸ ¹
(25)
This work was supported by Specialized Research Fund for the Doctoral Program of Higher Education (200801830037).
References 1. Ephrain Y, Malah D. Speech enhancement using a minimum mean square error short-time spectral amplitude estimator. IEEE Transaction on Acoustics, Speech, and Signal Processing, 1984, 32(6): 443445 2. Martin R, Breithaupt C. Speech enhancement in the DFT domain using Laplacian speech priors. International Workshop on Asoustic Echo and Noise Control, 2003, Kyoto, Japan, 2003: 8790 3. Cohen I, Speech Enhancement using MMSE short time spectral estimation with gamma distributed speech estimators. IEEE Int. conf. Acoustics, Speech, Signal Processing, Vol I, 2002, Orlando, FL, USA, 2002:253256 4. Lotter T, Vary P. Speech enhancement by MAP spectral amplitude estimation using a super gauss speech model. EURASIP Journal on Applied Signal Processing, 2005(1): 11101126 5. Erkelen J S, Hendriks R C, Heusdens R, et al. Minimum Mean-Square Error Estimation of Discrete Fourier Coefficients with Generalized Gamma Priors. IEEE Transaction on Audio, Speech, and Language Processing, 2007, 15(6):17411752 6. Hendriks R C, Erkelen J S, Jesen J, et al. Minimum mean-square error amplitude estimators for speech enhancement under gamma distribution. IWAENC 2006, Paris, France, 2006: 14 7. Mcaulay R J, Malpass M L. Speech enhancement using a soft-decision noise suppression filter. IEEE Transactions on Acoustics, Speech and Signal Processing, 1980, 28(2): 137145